You may think that search engine spammers look pretty much the same as anyone else and that is probably true, unless of course you are a spam detection algorithm.
At last weeks ACM SIGIR conference in the Netherlands an interesting paper was presented with the title “Know your Neighbors: Web Spam Detection using the Web Topology”.
Essentially this describes a spam detection system that uses the link structure of web pages and their content to identify spam. Or as the abstract puts it “In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link dependencies among the Web pages, and the content of the pages themselves.
The following impressive diagram appears in the paper:
This is a graphical depiction (for a very small part of the web) of domains with a connection of over 100 links between them, black nodes are spam and white nodes are non-spam.
Most of the spammers are clustered together in the upper-right of the center portion and here is a magnified view of that section:
The other domains are either in spam clusters or non-spam clusters. Here is a typical spam cluster and it shows what spammers, who indulge in nepotistic linking, may look like to a spam detection algorithm.
Of course this is only one line of research into spam detection but you don’t need to be clairvoyant to know that the major search engines have been including similar components in their ranking algorithms for some time. Good search engine optimizers avoid unnatural linking patterns and all site owners are well advised to do the same.
You can read the full paper here: Know your Neighbors: Web Spam Detection using the Web Topology, Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri, proceedings of SIGIR, ACM Press, July 2007, Amsterdam, Netherlands, 423-430.
There is also a very good lecture by Carlos Castillo that gives an insight into various techniques of spam detection. Recorded at the Workshop: The Future of Web Search, May 19, 2006 Organized by Yahoo! Research Barcelona and the Web Research Group of the Department of Technology, Universitat Pompeu Fabra. You can see the lecture here: Using Rank Propagation and Probabilistic Counting for Link-based Spam Detection.