The LSI Myth

In a previous post ‘What is Latent Semantic Indexing?‘ I attempted to give a non-mathematical and simplified explanation of LSI. The document set I chose as an example was every web page and we saw how this would result in a matrix of huge dimensions. I mentioned that LSI would consume very large amounts of processing power if used on such a huge term-document matrix. If you want to get a feel of just how much processing is required take a look at Telcordia LSI Engine: Implementation and Scalability Issues. Not only that but to be meaningful the process would have to index a constant stream of new and updated pages and run continuously, this makes it totally impractical. The algorithm does not scale and keeping the data in memory for very large datasets is not feasible. Keeping it on disk and making random disk seeks takes too much time. LSI has been shown to work best on small homogeneous document collections but for large non-homogeneous document collections it remains a research tool of an as yet unknown efficacy. Also recent experimental results seem to confirm claims by previous researchers that the retrieval accuracy of the LSI technique may deteriorate with large size inhomogeneous datasets (Clustered SVD strategies in latent semantic indexing). The search engines may well have a semantic component of some kind (more on that later) but LSI, no way!

So why would anybody claim that Google or any other search engine was using LSI? Two possible reasons, simple ignorance or as Dr. E. Garcia (information retrieval researcher) puts it “snake oil marketers”, SEO firms and individuals who find some commercial value in pretending they have an understanding of LSI. Here are some typical quotes right off their web pages:

LSI quotes

So what sort of evidence do these people cite to justify their erroneous claims? There appear to three common misunderstandings. The first concerns Google’s acquisition of Applied Semantics in April 2003. Applied Semantics was purchased for its semantic text processing and online advertising expertise derived from its patented CIRCA technology (Google press release). CIRCA uses a proprietary ontology which consists of hundreds of thousands of concepts and their relationships to each other. This ontology is developed by merging industry standard knowledge bases with automated tools together with guidance and direction from a team of lexicographers and computational linguists. The technology is outlined in two Applied Semantics patents; Meaning-based advertising and document relevance determination and Meaning-based information organization and retrieval. CIRCA has absolutely nothing to do with LSI. Google uses CIRCA (by now much improved) to target online advertising and also possibly in much the same way that Yahoo uses its “concept server” (Systems and methods for search processing using superunits and Systems and methods for generating concept units from search queries. The concept server manifests itself as the “Also try:” snippet at the top of the Yahoo SERPs.

The second erroneous justification is associating the Google synonym search operator with LSI. This Google advanced search operator will search not only for your search term but also for its synonyms if you place the tilde sign (~) immediately in front of your search term. As Marissa Mayer, Vice President, Search Products at Google put it when the operator was launched “We think this is a powerful and useful way to broaden results. It’s the opposite of disambiguation, which narrows a search”. Anyone who has used it will see immediately that it uses a small and very poor set of real synonyms (sorry Marissa!). For example ‘shell’ has many synonyms; ammunition, armament, bullet, cartridge, carcass, framework, peel, husk, seashell etc., etc. However Google recognizes very few of these with a ~shell search. Obviously it is not based on a synonym thesaurus but it is as Marissa says a way to broaden search results. These pseudo-synonyms are almost certainly generated algorithmically (possibly from clickthrough data) but again absolutely nothing to do with LSI. In any case as Dr E. Garcia explains LSI is far from being a synonym discovery technique (LSI Keyword Research and Co-Occurrence Theory).

The third fallacious argument involves a belief that a raft of recent Google patents ‘proves’ that Google is using LSI. The patents in question are; Multiple index based information retrieval system, Phrase-based searching in an information retrieval system, Phrase-based indexing in an information retrieval system, Phrase-based generation of document descriptions, Phrase identification in an information retrieval system and Detecting spam documents in a phrase based information retrieval system. These patents contain some very interesting concepts and are required reading for the professional SEO. They are however only filed patents and this does not mean that all or any of the ideas in them have been implemented. They should be studied to give an indication of what search engineers are thinking about and which components (if any) may be implemented now and in the future.

The overall concept in these patents involves indexing documents (pages) according to their included phrases with each potential phrase classified as either a good phrase or a bad phrase. Good phrases are defined as “phrases that tend to occur in more than certain percentage of documents in the document collection and/or are indicated as having a distinguished appearance in such documents, such as delimited by markup tags or other morphological, format, or grammatical markers. Another aspect of good phrases is that they are predictive of other good phrases, and are not merely sequences of words that appear in the lexicon”. Bad phrases are defined as those “…lacking in predictive power”. When a user types in a query any phrases present in the query are used to search the index and ranked results are returned according to the phrases that are contained in the document. This is a gross over simplification :) but to explain the details here is not the point.

The confusion with these patents and LSI arises because as part of the indexing process the proposed algorithm maintains a co-occurrence matrix of good phrases and this is mistaken for the term-document matrix used in LSI. The co-occurrence matrix of good phrases is not only different, it is much smaller and not optimally mapped by SVD as in LSI.

So what’s the bottom line for the LSI myth? If you hear or read an SEO talking about the importance of LSI in search engine optimization then you can be sure they haven’t a clue what they are talking about and you should simply follow the advice for good copy from a previous post.

Those that have got this far may be wondering what use is LSI if it is not used by the search engines. LSI does in fact have quite a few practical applications and here are some examples to satisfy the curious; Pacific Metrics Corporation are using the Content Analyst Company LSI Patents for automated essay scoring, the analysis of legal documents and creating document summaries for academic funding applications.

May 11, 2007

Professor Michael Berry head of the Department of Computer Science at the University of Tennessee wrote me as follows “Just for the record, LSI has been used to index on the order of 10 million documents using out-of-core SVD based techniques so you could apply it to subdomains of the Web but the entire Web would be problematic as you point out”. He also recommended an all inclusive reference book now available on LSA - Handbook of Latent Semantic Analysis, T.K. Landauer, D.S. McNamara, S. Dennis, and W. Kintsch (Eds), Lawrence Erlbaum Associates (2007). Thank you Dr Berry.

1 Comment »

  1. llimllib said,

    June 4, 2007 @ 2:09 pm

    I just want to write in to mention that the paper which introduced LSI is a very worthwhile, and surprisingly readable, source.

RSS feed for comments on this post · TrackBack URI

Leave a Comment