In this post I will try to explain Latent Semantic Indexing (LSI) in simple terms and without the college degree math that is usually required. In a follow up post I will explain why LSI is not used by search engines.
Forget for a moment how search engines like Google rank pages in their search results and let’s take a look at a possible method of indexing and retrieving all the pages relevant to the user’s query before the ranking algorithm is applied.
An obvious method of retrieving relevant pages is by matching the terms of a search query with the same text found in all web pages. However the problem with simple text (lexical) matching methods is that they are inherently inaccurate. This is because there are many ways for a user to express a given concept using different words (synonymy) and also because most words have multiple meanings (polysemy). The problem of synonymy means that the user’s query may not actually match the text on relevant pages so they will be overlooked and the problem of polysemy means that the terms in a user’s query will often match terms in irrelevant pages.
LSI is an attempt to overcome this problem by looking at patterns of word distribution across the whole of the web. In doing so it considers pages that have many words in common to be close in meaning (semantically close) and pages with a few words in common to be semantically distant. The result is an LSI indexed database with similarity values it has calculated for every content word and phrase.
In response to a query an LSI indexed database will return the pages it thinks will best fit the search terms. The LSI algorithm doesn’t understand anything about what the words mean and does not require an exact match to return useful results.
Before we look at how LSI is achieved let’s refresh our knowledge of some high school math, in particular Cartesian coordinates.
If you wanted to describe the exact location of the telephone on your desk you might say that it was 10ft from one wall of the room, 5ft from another and 2.4ft from the ground. In general you can describe the location of anything in three dimensional space with just three numerical values x, y and z as in Figure 1. Alternatively the position can be specified by a position vector r which is expressed in terms of the coordinate values.
Now let’s look at how we might plot the position of a web page. Let’s imagine that on the page we choose to plot, the words fertilizer, broadleaf and turf occur a specific number of times. We could represent the position of the page in fertilizer-broadleaf-turf space with the vector r as in Figure 2.
We could also plot the position of every page that contained these words which is called a ‘term space’ and would look something like Figure 3. Each page forms a vector in that space and the vector’s direction and magnitude determine how many times the three keywords appear in it.
Because we have used three words we have a 3-dimensional term space that is easy to visualize. If we wanted to plot the occurrence of a fourth or fifth word we would need a 4-dimensionl or 5-dimensional term space which is definitely not easy to visualize. If we wanted to represent every word and every page we might end up with millions of dimensions! Although we cannot imagine such a huge multi-dimensional term space we can represent it with a very large but simple grid as in Figure 6. Here we are assuming that we are looking at every web page in existence, which as we shall see later is not practical.
All the different words found, of which there will be millions, are listed down the first column as word 1, word 2, word 3 until we get to the final word, word n. Then all the pages found, of which there will be billions, are listed as a separate column p1, p2, p3 until we get to the final page, pn. If a page contains a particular word (for example p2 contains word 4 as in Figure 6.) then we indicate this by putting a one in the appropriate position on the grid. Otherwise we put a zero to indicate the words absence. This kind of grid is called a ‘term-document matrix’. Not only is it extraordinarily large the other thing to note is that it will contain many, many times more zeros than ones.
Typically a term-document matrix is created from pages that have been pre-processed so only words that are likely to have semantic meaning remain. Firstly all formatting from the pages including capitalization, punctuation and extraneous markup are removed. Then prepositions, conjunctions, common verbs, pronouns, articles and common adjectives are also removed. Lastly the common endings are removed from words leaving just the basic root form (this process is called stemming).
So we now have a valid, nice-looking term-document matrix, what next? You may have noticed that we have not yet taken into account the number of times a particular word appears on the page. This is achieved by applying a ‘local weighting factor’ so that words that appear many times on a page are given a greater weight than words that appear only once.
In addition a ‘global weighting factor’ is applied so that words that appear in a small number of pages are given more weight than words that occur widely across all the pages. This is done on the basis that these words are likely to be more significant.
Another step in weighting is called normalization. This is required to put large pages on a level playing field with smaller pages and to remove any bias as a result of page size.
The application of weighting factors is called term weighting and the three processes just described are common but they are not the only weighting scheme that can be used. Also the actual value of the factors that are used in the weighting and the ways in which they can be calculated and applied are various. However the basic idea remains the same, to calculate a more useful and valid term-document matrix from the initial simplistic term-document matrix, in preparation for the next stage.
After term weighting, our term-document matrix might look like Figure 7. Notice that only the non-zero values will have changed as a result of term weighting.
By now you are probably thinking what the term-document matrix has to do with our earlier discussion on multi-dimensional term space. In fact each column in our term-document matrix can be thought of as a list of coordinates that give the exact position of a single page in a multi-dimensional term space. We now need to think of it this way because the next stage is to project this large multidimensional space into a much smaller multidimensional space. When this is done words that are semantically similar will get squeezed together and it is this step that is the heart of LSI.
Imagine that you are in a football stadium watching your favorite team play. Suddenly 50ft above the center of the field appears a 3-dimensional term space populated as in our original Figure 3. You have your camera with you and take a picture. So do lots of other people around the stadium, even the news reporter in an overhead helicopter manages to take a picture. When all these pictures are printed off they will all be different. For example yours might look like Figure 4. However from another position in the stadium or from a helicopter it might look like Figure 5.
In fact every picture will be different. Although there are an infinite number of positions from which a photograph can be taken there will always be at least one position that is superior in the sense that the printed picture will contain more information than the others. For example fewer points will be obscured from view by other points.
The process of transferring data from a higher dimension (the 3-dimensional space that appears above the field) to a lower dimension (the 2-dimensional printed picture) is called mapping. Retaining as much information as possible in the process is called ‘optimal mapping’.
This is exactly what happens in the next stage of our LSI process. The term-document matrix is optimally mapped into a smaller number of dimensions while keeping as much information as possible. When this happens information is lost and content words are superimposed on one another. It transpires that what is actually lost is the noise from our original term-document matrix and this reveals similarities that are latent within the pages.
One of the algorithms that can perform this task is called Singular Value Decomposition (SVD) and Figure 8. shows how our term-document matrix might look after it has been applied. (It is worth mentioning at this point that Bell Communications Research were granted a patent for LSI using SVD in 1989 which is now owned by Content Analyst Company, LLC).
There are two interesting features in the processed data:
Firstly the matrix contains far fewer zero values and each page has a similarity value for nearly all the content words. Secondly some of the similarity values are negative. In our original term-document matrix this would correspond to a page with less than zero occurrences of a word, which is impossible. What it means in the processed matrix is that the more negative the value the greater the semantic distance between a term and a page. Conversely the more positive the value the more semantically related they are.
This finished matrix is what we would use to actually perform a search and it would work like this: We take however many terms in the search query and look up the values for each search term/page combination. We calculate a cumulative score for every page and then rank the pages by that score. This will be the measure of the pages similarity to the search query. Of course we don’t want to rank every page so in practice there would be a threshold value to act as a cutoff between relevant and irrelevant pages.
Although the pages that have been selected are semantically related and ranked according to their similarity this is obviously not sufficient by itself to present to the user. In practice a search engine will use hundreds of variables to determine relevancy and a pages position in the LSI could simply be one of them (but it isn’t!).
One final note to the above explanation, it transpires that although the SVD algorithm does a reasonable job it is computationally inefficient and consumes impossibly large amounts of processing power. So much so in fact that it cannot be used on a data set as large as the Web. However there are ways of reducing the size of the initial term-document matrix by, for example, first partitioning the data set into a number of smaller partitions having similar ‘concept domains’, as in this recent Telcordia Technologies patent.
Also there are other algorithms apart from SVD which do a similar job or can help speed up the process, for example Scaling Latent Semantic Indexing for Large PeertoPeer Systems, Non-Negative Matrix Factorization (NMF) and ULV Decomposition (ULVD). As you can imagine these are areas that are actively researched by mathematicians and information retrieval specialists.
The explanation above should give you a good idea of what is involved in LSI and in a follow up post I will explore the myths surrounding LSI.