This blog attempts to shed light on when and how to use relational information for presumably richly-structured data domains such as social networks to most effectively detect spam posted by their users.
Spam on social networks has always been a problem, and most likely always will be, as the arms race between methods of posting spam and those that detect them rages on. People spend a lot of time building features for specific domains they plug into traditional machine learning techniques, in which those techniques assume their data to be independent and identically distributed (IID). These models can be very effective, but leave a potentially large source of relational information un-exploited.
The idea of relational reasoning is straightforward: knowing something about one data point can tell you something about other data points related to that one.
Since it is virtually impossible to eliminate spam completely (depending on your definition of spam in the first place), we should at least try to increase the pain and difficulty of successfully posting spam to as a high a level as possible. This means effectively incorporating all sources of meaningful information to detect spam. So how can we do this?
One way, which has proven to be effective, is to predict all related messages at the same time using a joint prediction model such as a Markov network where independent model predictions are propagated with, but these are few and far between and are usually domain specific and shown to work only over a small number of messages.
If this method of joint prediction truly is more effective at detecting spam, then it seems we should push for more wide-spread adoption of this approach. Unfortunately, there are no general best practices about when and how to use a relational model.
This blog post analyzes three separate but similar domains: Souncloud, Youtube, and Twitter. After an in-depth look at the data, we want to answer questions such as: can relational modeling provide any benefit to that domain? If so, how much benefit? how does one choose which relations to exploit? Is joint modeling the most effective way to capture this information? When does relational modeling not work?
We provide analysis to help answer these questions and develop a general framework to easily test and incorporate any relation into a flexible model that can scale to jointly predict millions of messages at a time.