Archive for News and Comment

Web Analytics: An Hour A Day by Avinash Kaushik

Web Analytics: An Hour A Day by Avinash KaushikI don’t normally herald the addition of a new book in the Essential Reading section of the sidebar here on SEO Blog but this is an exception. Web Analytics: An Hour A Day by Avinash Kaushik is quite the best book I have read for anyone involved in Internet marketing and small businesses. Don’t be put off by the title, the book explains the many complex metrics in simple terms and provides very specific guidelines with a step by step approach. Avinash Kaushik was Director of Web Research & Analytics for Intuit but left earlier this year to become an Independent Consultant. His first assignment is working with Google as an Analytics Evangelist and he talks more about that role on his excellent blog Occam’s Razor. Avinash is donating proceeds from the sale of Web Analytics to charity so you will not be the only good cause to benefit when you buy this book!

Comments

QR Codes

I was with a Japanese corporate client recently and as usual over dinner I asked them what was new on their cell phone (technically not personally!). I say as usual because I have found that if you ask that question in Japan you will always learn something interesting. This case was no exception but first some background.

It is worth noting that the Japanese are addicted to their cell phones with over 100 million (nearly 80% of the population) using the widely available high-speed 3G systems. Many of these users have been taking advantage of QR (Quick Response) codes which are two dimensional barcodes. QR codes can be read by any mobile device with a camera and the appropriate reader software. QR codes appear all over Japan on billboards, in print, on websites and in store windows. Even the Japanese government uses them with the immigration service stamping QR codes on passports detailing the visa status. I first saw them at a trade show in Tokyo where every stand seemed to have one.

QR codes can store up to 7089 numeric characters, 4296 alphanumeric characters or 1817 characters of Japanese (kanji script). That compares with 20-30 (depending on the standard) ascii characters for a conventional one dimensional bar code.

If you want to read more about QR codes Nokia has a simple explanation with some useful links and here is a short article on a potential development from NTT called Audio Barcodes.

Back to my dinner conversation. My client who is obviously a conscientious consumer told me that her local supermarket has started to use QR code labels on some fresh produce. She shops with the QR reader software enabled on her cell phone, takes a picture of the label and is then connected to a site with all the supplier’s details.

The labels look like this:

QR code label

You can see the QR code in the bottom right hand corner and the supplier’s details, in this case for a lettuce, look like this:

Lettuce grower's details

My Japanese is not very good but this has all sorts of interesting information; exactly where it was grown, when the seeds were sown, when the lettuce was harvested, the fertilizer used, the insecticide used, the bactericide used, the herbicide used and lots more. I have to say I was impressed!

She also told me that she uses QR codes to put useful RSS feeds on to her cell phone. It transpired that this will work on any RSS feed, not just those especially for mobile devices, because the software adapts the content automatically. I have generated a QR code for the feed on this site:

QR code for SEO Blog feed

This is what you would see on the cell phone:

Cell phone image data

If you would like to generate QR codes for your own rss feeds you can do so on the Kaywa site.

So when can we expect to see QR codes used widely in the US? Not any time soon would be my guess. As anyone who has visited the major cell phone trade shows like CTIA Wireless the latest and best cell phones on display are labeled ‘Not available in the US’. The three different transmission modes, CDMA (Sprint & Verizon), GSM (AT&T & T-Mobile) and iDEN (Nextel) used in the US make it a far more attractive proposition for manufacturers to concentrate on the European and Asian markets for the launch of new products and functionality, where they just use GSM.

Europeans are just beginning to see QR codes but if you have Japanese or Asian customers it’s definitely worth knowing about QR codes and the many ways they are used for marketing.

September 28, 2007 update.

Image searchA reader has emailed me about a new store in fashionable Rue de Turbigo, Paris, France. Called Denim Code they are selling designer clothes with QR codes attached. This publicity photo shows a QR code on a pair of jeans but do Parisian males need a excuse to take a picture of a lady’s bottom? What message will they receive on their cell phone? I bet those of you with a marketing brain are already thinking of hundreds of novel ideas!

Comments (2)

What do search engine spammers look like?

You may think that search engine spammers look pretty much the same as anyone else and that is probably true, unless of course you are a spam detection algorithm.

At last weeks ACM SIGIR conference in the Netherlands an interesting paper was presented with the title “Know your Neighbors: Web Spam Detection using the Web Topology”.

Essentially this describes a spam detection system that uses the link structure of web pages and their content to identify spam. Or as the abstract puts it “In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link dependencies among the Web pages, and the content of the pages themselves.

The following impressive diagram appears in the paper:

Hostgraph of a section of the web

This is a graphical depiction (for a very small part of the web) of domains with a connection of over 100 links between them, black nodes are spam and white nodes are non-spam.

Most of the spammers are clustered together in the upper-right of the center portion and here is a magnified view of that section:

Magnified section of hostgraph

 

What spammers look like.

The other domains are either in spam clusters or non-spam clusters. Here is a typical spam cluster and it shows what spammers, who indulge in nepotistic linking, may look like to a spam detection algorithm.

Of course this is only one line of research into spam detection but you don’t need to be clairvoyant to know that the major search engines have been including similar components in their ranking algorithms for some time. Good search engine optimizers avoid unnatural linking patterns and all site owners are well advised to do the same.

You can read the full paper here: Know your Neighbors: Web Spam Detection using the Web Topology, Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri, proceedings of SIGIR, ACM Press, July 2007, Amsterdam, Netherlands, 423-430.

There is also a very good lecture by Carlos Castillo that gives an insight into various techniques of spam detection. Recorded at the Workshop: The Future of Web Search, May 19, 2006 Organized by Yahoo! Research Barcelona and the Web Research Group of the Department of Technology, Universitat Pompeu Fabra. You can see the lecture here: Using Rank Propagation and Probabilistic Counting for Link-based Spam Detection.

Comments (2)

The LSI Myth

In a previous post ‘What is Latent Semantic Indexing?‘ I attempted to give a non-mathematical and simplified explanation of LSI. The document set I chose as an example was every web page and we saw how this would result in a matrix of huge dimensions. I mentioned that LSI would consume very large amounts of processing power if used on such a huge term-document matrix. If you want to get a feel of just how much processing is required take a look at Telcordia LSI Engine: Implementation and Scalability Issues. Not only that but to be meaningful the process would have to index a constant stream of new and updated pages and run continuously, this makes it totally impractical. The algorithm does not scale and keeping the data in memory for very large datasets is not feasible. Keeping it on disk and making random disk seeks takes too much time. LSI has been shown to work best on small homogeneous document collections but for large non-homogeneous document collections it remains a research tool of an as yet unknown efficacy. Also recent experimental results seem to confirm claims by previous researchers that the retrieval accuracy of the LSI technique may deteriorate with large size inhomogeneous datasets (Clustered SVD strategies in latent semantic indexing). The search engines may well have a semantic component of some kind (more on that later) but LSI, no way!

So why would anybody claim that Google or any other search engine was using LSI? Two possible reasons, simple ignorance or as Dr. E. Garcia (information retrieval researcher) puts it “snake oil marketers”, SEO firms and individuals who find some commercial value in pretending they have an understanding of LSI. Here are some typical quotes right off their web pages:

LSI quotes

So what sort of evidence do these people cite to justify their erroneous claims? There appear to three common misunderstandings. The first concerns Google’s acquisition of Applied Semantics in April 2003. Applied Semantics was purchased for its semantic text processing and online advertising expertise derived from its patented CIRCA technology (Google press release). CIRCA uses a proprietary ontology which consists of hundreds of thousands of concepts and their relationships to each other. This ontology is developed by merging industry standard knowledge bases with automated tools together with guidance and direction from a team of lexicographers and computational linguists. The technology is outlined in two Applied Semantics patents; Meaning-based advertising and document relevance determination and Meaning-based information organization and retrieval. CIRCA has absolutely nothing to do with LSI. Google uses CIRCA (by now much improved) to target online advertising and also possibly in much the same way that Yahoo uses its “concept server” (Systems and methods for search processing using superunits and Systems and methods for generating concept units from search queries. The concept server manifests itself as the “Also try:” snippet at the top of the Yahoo SERPs.

The second erroneous justification is associating the Google synonym search operator with LSI. This Google advanced search operator will search not only for your search term but also for its synonyms if you place the tilde sign (~) immediately in front of your search term. As Marissa Mayer, Vice President, Search Products at Google put it when the operator was launched “We think this is a powerful and useful way to broaden results. It’s the opposite of disambiguation, which narrows a search”. Anyone who has used it will see immediately that it uses a small and very poor set of real synonyms (sorry Marissa!). For example ‘shell’ has many synonyms; ammunition, armament, bullet, cartridge, carcass, framework, peel, husk, seashell etc., etc. However Google recognizes very few of these with a ~shell search. Obviously it is not based on a synonym thesaurus but it is as Marissa says a way to broaden search results. These pseudo-synonyms are almost certainly generated algorithmically (possibly from clickthrough data) but again absolutely nothing to do with LSI. In any case as Dr E. Garcia explains LSI is far from being a synonym discovery technique (LSI Keyword Research and Co-Occurrence Theory).

The third fallacious argument involves a belief that a raft of recent Google patents ‘proves’ that Google is using LSI. The patents in question are; Multiple index based information retrieval system, Phrase-based searching in an information retrieval system, Phrase-based indexing in an information retrieval system, Phrase-based generation of document descriptions, Phrase identification in an information retrieval system and Detecting spam documents in a phrase based information retrieval system. These patents contain some very interesting concepts and are required reading for the professional SEO. They are however only filed patents and this does not mean that all or any of the ideas in them have been implemented. They should be studied to give an indication of what search engineers are thinking about and which components (if any) may be implemented now and in the future.

The overall concept in these patents involves indexing documents (pages) according to their included phrases with each potential phrase classified as either a good phrase or a bad phrase. Good phrases are defined as “phrases that tend to occur in more than certain percentage of documents in the document collection and/or are indicated as having a distinguished appearance in such documents, such as delimited by markup tags or other morphological, format, or grammatical markers. Another aspect of good phrases is that they are predictive of other good phrases, and are not merely sequences of words that appear in the lexicon”. Bad phrases are defined as those “…lacking in predictive power”. When a user types in a query any phrases present in the query are used to search the index and ranked results are returned according to the phrases that are contained in the document. This is a gross over simplification :) but to explain the details here is not the point.

The confusion with these patents and LSI arises because as part of the indexing process the proposed algorithm maintains a co-occurrence matrix of good phrases and this is mistaken for the term-document matrix used in LSI. The co-occurrence matrix of good phrases is not only different, it is much smaller and not optimally mapped by SVD as in LSI.

So what’s the bottom line for the LSI myth? If you hear or read an SEO talking about the importance of LSI in search engine optimization then you can be sure they haven’t a clue what they are talking about and you should simply follow the advice for good copy from a previous post.

Those that have got this far may be wondering what use is LSI if it is not used by the search engines. LSI does in fact have quite a few practical applications and here are some examples to satisfy the curious; Pacific Metrics Corporation are using the Content Analyst Company LSI Patents for automated essay scoring, the analysis of legal documents and creating document summaries for academic funding applications.

May 11, 2007

Professor Michael Berry head of the Department of Computer Science at the University of Tennessee wrote me as follows “Just for the record, LSI has been used to index on the order of 10 million documents using out-of-core SVD based techniques so you could apply it to subdomains of the Web but the entire Web would be problematic as you point out”. He also recommended an all inclusive reference book now available on LSA - Handbook of Latent Semantic Analysis, T.K. Landauer, D.S. McNamara, S. Dennis, and W. Kintsch (Eds), Lawrence Erlbaum Associates (2007). Thank you Dr Berry.

Comments (2)

Paid Links

“Now Warwick, tell me, even upon thy conscience, is Edward your true king? For I were loath to link with him that were not lawful chosen”. Henry VI, Act 3, Scene 3 by William Shakespeare.

To buy or not to buy, that is the question.

The head of Google’s Webspam team was advising over a year ago that “…if you sell links, you should mark them with the nofollow tag. – Matt Cutts”. A more recent post has caused alarm bells to ring in the minds of those who buy or sell links. The post in question details how to report any sites you find that are selling or buying links. Matt explains that these external reports will be used to test out some new techniques in algorithmic paid link detection.

So why is Google so keen on detecting paid links you might ask? Look no further than Google’s Corporate Information, Philosophy page “Google works because it relies on the millions of individuals posting links on websites to help determine which other sites offer content of value. Google assesses the importance of every web page using a variety of techniques, including its patented PageRank algorithm which analyzes which sites have been “voted” the best sources of information by other pages across the web”. So it is hardly surprising that Google views paid links as ‘paid votes’ and therefore likely to introduce bias into their PageRank algorithm.

The Problem.

With the introduction of PageRank (originating from Larry Page and Sergey Brin’s 1997 paper) Google created a new commodity – links that improve ranking. Economists from Karl Marx to Milton Friedman have recognized that for every commodity there will always be a market and hence the buying and selling of links has become an industry. Text link brokers have been making hay while the sun shines and Google now feels that it needs to get on top of this problem before the PageRank component of its algorithm breaks. Algorithmically detecting and then discounting paid links is one approach and hence Matt Cutts request for data.

Google’s Solution.

As well as improving the detection of paid links Google’s solution includes extending the use of the nofollow tag from its original conception as “…an easy way for a website to tell search engines that the website can’t or doesn’t want to vouch for a link - Matt Cutts” to a “…machine-readable disclosure for paid links… – Matt Cutts”. It appears likely that once Google encounters a paid link without a nofollow then at the very least it will be discounted.

What should you do?

The obvious course of action is to only buy links for traffic and make sure they are nofollowed or if you are selling links then make sure they too are all nofollowed. However you can be sure that no professional SEO will be signing up exclusively to this approach. Paid links are too important a tool in SEO to be given up on Google’s say so, especially when Google are still in the process of creating an improved detection algorithm. So my advice if you buy links is:

  • Go into stealth mode if you aren’t in it already.
  • Don’t buy links that are advertised or from a broker.
  • Approach site owners directly by telephone.
  • Check the site to make sure it would pass a human inspection for paid links.
  • Make sure your link is embedded in content and that it is relevant content.
  • Make sure the link points to relevant content on you website.
  • Don’t buy home page links.

If you employ an SEO or are about to, make sure that they have a clearly defined policy on buying links based on the above. If you don’t want links purchased for your site make sure that your SEO knows your position on the subject.

June 7, 2007
Added
Google has provided guidelines in its Webmaster Help Center titled Why should I report paid links to Google?

June 12, 2007
Added
Google has put up a Paid Links Reporting Form on Webmaster Tools.

Google paid links reporting form on Webmaster Tools

December 1, 2007
Added
Google have simultaneously published two important posts on paid links in a concerted effort to draw a line in the sand:

On Google Webmaster Central Blog - Information about buying and selling links that pass PageRank

On Matt Cutts blog - Selling links that pass PageRank

December 30, 2007
Added
Ted Murphy of Izea (formerly PayPerPost) has published part of an email he received from Matt Cutts “Google (and probably all search engines) will consider all links in a paid post to be paid”. (My embolding)

Comments

« Previous entries