What do search engine spammers look like?

You may think that search engine spammers look pretty much the same as anyone else and that is probably true, unless of course you are a spam detection algorithm.

At last weeks ACM SIGIR conference in the Netherlands an interesting paper was presented with the title “Know your Neighbors: Web Spam Detection using the Web Topology”.

Essentially this describes a spam detection system that uses the link structure of web pages and their content to identify spam. Or as the abstract puts it “In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link dependencies among the Web pages, and the content of the pages themselves.

The following impressive diagram appears in the paper:

Hostgraph of a section of the web

This is a graphical depiction (for a very small part of the web) of domains with a connection of over 100 links between them, black nodes are spam and white nodes are non-spam.

Most of the spammers are clustered together in the upper-right of the center portion and here is a magnified view of that section:

Magnified section of hostgraph

 

What spammers look like.

The other domains are either in spam clusters or non-spam clusters. Here is a typical spam cluster and it shows what spammers, who indulge in nepotistic linking, may look like to a spam detection algorithm.

Of course this is only one line of research into spam detection but you don’t need to be clairvoyant to know that the major search engines have been including similar components in their ranking algorithms for some time. Good search engine optimizers avoid unnatural linking patterns and all site owners are well advised to do the same.

You can read the full paper here: Know your Neighbors: Web Spam Detection using the Web Topology, Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri, proceedings of SIGIR, ACM Press, July 2007, Amsterdam, Netherlands, 423-430.

There is also a very good lecture by Carlos Castillo that gives an insight into various techniques of spam detection. Recorded at the Workshop: The Future of Web Search, May 19, 2006 Organized by Yahoo! Research Barcelona and the Web Research Group of the Department of Technology, Universitat Pompeu Fabra. You can see the lecture here: Using Rank Propagation and Probabilistic Counting for Link-based Spam Detection.

2 Comments »

  1. vingold said,

    August 27, 2007 @ 8:48 pm

    I’ve been mulling this over, and here is my question.

    Suppose you are not a spammer, but you do have a network of websites all of which are linked to each other through external navigation, footer/site info, etc.

    For instance, maybe you run a regional website and all of your sites are related by geography. So you might have a network of 25 websites for Florida - broken out like MiamiBoats.com, FortLauderdaleBoats.com, KeyWestBoats.com, etc.

    Also each site has unique content - all original to the site itself and genuinely unique.

    If you wanted to promote these sites across other the whole network - is there a better way to do it than by linking to each other through some form of external navigation?

  2. duz said,

    August 28, 2007 @ 7:04 am

    That is a very good question vingold and I have always thought very carefully about interlinking a client’s network. I don’t think there is a ‘one size fits all’ solution but I do now routinely remove from client’s networked sites all external sitewide links and all inter-network non-contextual links.

RSS feed for comments on this post · TrackBack URI

Leave a Comment

Bot-Check