Archive for News and Comment

What is Latent Semantic Indexing (LSI)?

In this post I will try to explain Latent Semantic Indexing (LSI) in simple terms and without the college degree math that is usually required. In a follow up post I will explain why LSI is not used by search engines.

Forget for a moment how search engines like Google rank pages in their search results and let’s take a look at a possible method of indexing and retrieving all the pages relevant to the user’s query before the ranking algorithm is applied.

An obvious method of retrieving relevant pages is by matching the terms of a search query with the same text found in all web pages. However the problem with simple text (lexical) matching methods is that they are inherently inaccurate. This is because there are many ways for a user to express a given concept using different words (synonymy) and also because most words have multiple meanings (polysemy). The problem of synonymy means that the user’s query may not actually match the text on relevant pages so they will be overlooked and the problem of polysemy means that the terms in a user’s query will often match terms in irrelevant pages.

Image searchLSI is an attempt to overcome this problem by looking at patterns of word distribution across the whole of the web. In doing so it considers pages that have many words in common to be close in meaning (semantically close) and pages with a few words in common to be semantically distant. The result is an LSI indexed database with similarity values it has calculated for every content word and phrase.

In response to a query an LSI indexed database will return the pages it thinks will best fit the search terms. The LSI algorithm doesn’t understand anything about what the words mean and does not require an exact match to return useful results.

Before we look at how LSI is achieved let’s refresh our knowledge of some high school math, in particular Cartesian coordinates.

Image searchIf you wanted to describe the exact location of the telephone on your desk you might say that it was 10ft from one wall of the room, 5ft from another and 2.4ft from the ground. In general you can describe the location of anything in three dimensional space with just three numerical values x, y and z as in Figure 1. Alternatively the position can be specified by a position vector r which is expressed in terms of the coordinate values.

Now let’s look at how we might plot the position of a web page. Let’s imagine that on the page we choose to plot, the words fertilizer, broadleaf and turf occur a specific number of times. We could represent the position of the page in fertilizer-broadleaf-turf space with the vector r as in Figure 2.

Image searchWe could also plot the position of every page that contained these words which is called a ‘term space’ and would look something like Figure 3. Each page forms a vector in that space and the vector’s direction and magnitude determine how many times the three keywords appear in it.

Because we have used three words we have a 3-dimensional term space that is easy to visualize. If we wanted to plot the occurrence of a fourth or fifth word we would need a 4-dimensionl or 5-dimensional term space which is definitely not easy to visualize. If we wanted to represent every word and every page we might end up with millions of dimensions! Although we cannot imagine such a huge multi-dimensional term space we can represent it with a very large but simple grid as in Figure 6. Here we are assuming that we are looking at every web page in existence, which as we shall see later is not practical.

Image search

All the different words found, of which there will be millions, are listed down the first column as word 1, word 2, word 3 until we get to the final word, word n. Then all the pages found, of which there will be billions, are listed as a separate column p1, p2, p3 until we get to the final page, pn. If a page contains a particular word (for example p2 contains word 4 as in Figure 6.) then we indicate this by putting a one in the appropriate position on the grid. Otherwise we put a zero to indicate the words absence. This kind of grid is called a ‘term-document matrix’. Not only is it extraordinarily large the other thing to note is that it will contain many, many times more zeros than ones.

Typically a term-document matrix is created from pages that have been pre-processed so only words that are likely to have semantic meaning remain. Firstly all formatting from the pages including capitalization, punctuation and extraneous markup are removed. Then prepositions, conjunctions, common verbs, pronouns, articles and common adjectives are also removed. Lastly the common endings are removed from words leaving just the basic root form (this process is called stemming).

So we now have a valid, nice-looking term-document matrix, what next? You may have noticed that we have not yet taken into account the number of times a particular word appears on the page. This is achieved by applying a ‘local weighting factor’ so that words that appear many times on a page are given a greater weight than words that appear only once.

In addition a ‘global weighting factor’ is applied so that words that appear in a small number of pages are given more weight than words that occur widely across all the pages. This is done on the basis that these words are likely to be more significant.

Another step in weighting is called normalization. This is required to put large pages on a level playing field with smaller pages and to remove any bias as a result of page size.

The application of weighting factors is called term weighting and the three processes just described are common but they are not the only weighting scheme that can be used. Also the actual value of the factors that are used in the weighting and the ways in which they can be calculated and applied are various. However the basic idea remains the same, to calculate a more useful and valid term-document matrix from the initial simplistic term-document matrix, in preparation for the next stage.

After term weighting, our term-document matrix might look like Figure 7. Notice that only the non-zero values will have changed as a result of term weighting.

Image search

By now you are probably thinking what the term-document matrix has to do with our earlier discussion on multi-dimensional term space. In fact each column in our term-document matrix can be thought of as a list of coordinates that give the exact position of a single page in a multi-dimensional term space. We now need to think of it this way because the next stage is to project this large multidimensional space into a much smaller multidimensional space. When this is done words that are semantically similar will get squeezed together and it is this step that is the heart of LSI.

Image searchImagine that you are in a football stadium watching your favorite team play. Suddenly 50ft above the center of the field appears a 3-dimensional term space populated as in our original Figure 3. You have your camera with you and take a picture. So do lots of other people around the stadium, even the news reporter in an overhead helicopter manages to take a picture. When all these pictures are printed off they will all be different. For example yours might look like Figure 4. However from another position in the stadium or from a helicopter it might look like Figure 5.

In fact every picture will be different. Although there are an infinite number of positions from which a photograph can be taken there will always be at least one position that is superior in the sense that the printed picture will contain more information than the others. For example fewer points will be obscured from view by other points.

Image searchThe process of transferring data from a higher dimension (the 3-dimensional space that appears above the field) to a lower dimension (the 2-dimensional printed picture) is called mapping. Retaining as much information as possible in the process is called ‘optimal mapping’.

This is exactly what happens in the next stage of our LSI process. The term-document matrix is optimally mapped into a smaller number of dimensions while keeping as much information as possible. When this happens information is lost and content words are superimposed on one another. It transpires that what is actually lost is the noise from our original term-document matrix and this reveals similarities that are latent within the pages.

One of the algorithms that can perform this task is called Singular Value Decomposition (SVD) and Figure 8. shows how our term-document matrix might look after it has been applied. (It is worth mentioning at this point that Bell Communications Research were granted a patent for LSI using SVD in 1989 which is now owned by Content Analyst Company, LLC).

Image search

There are two interesting features in the processed data:

Firstly the matrix contains far fewer zero values and each page has a similarity value for nearly all the content words. Secondly some of the similarity values are negative. In our original term-document matrix this would correspond to a page with less than zero occurrences of a word, which is impossible. What it means in the processed matrix is that the more negative the value the greater the semantic distance between a term and a page. Conversely the more positive the value the more semantically related they are.

This finished matrix is what we would use to actually perform a search and it would work like this: We take however many terms in the search query and look up the values for each search term/page combination. We calculate a cumulative score for every page and then rank the pages by that score. This will be the measure of the pages similarity to the search query. Of course we don’t want to rank every page so in practice there would be a threshold value to act as a cutoff between relevant and irrelevant pages.

Although the pages that have been selected are semantically related and ranked according to their similarity this is obviously not sufficient by itself to present to the user. In practice a search engine will use hundreds of variables to determine relevancy and a pages position in the LSI could simply be one of them (but it isn’t!).

One final note to the above explanation, it transpires that although the SVD algorithm does a reasonable job it is computationally inefficient and consumes impossibly large amounts of processing power. So much so in fact that it cannot be used on a data set as large as the Web. However there are ways of reducing the size of the initial term-document matrix by, for example, first partitioning the data set into a number of smaller partitions having similar ‘concept domains’, as in this recent Telcordia Technologies patent.

Also there are other algorithms apart from SVD which do a similar job or can help speed up the process, for example Scaling Latent Semantic Indexing for Large Peer­to­Peer Systems, Non-Negative Matrix Factorization (NMF) and ULV Decomposition (ULVD). As you can imagine these are areas that are actively researched by mathematicians and information retrieval specialists.

The explanation above should give you a good idea of what is involved in LSI and in a follow up post I will explore the myths surrounding LSI.

Comments

If you are too stupid to use a computer you might try giving SEO advice.

Now that everybody and their uncle is an expert in search engine optimization you get to read some really off the wall advice. I have collected a few gems that had me laughing and reproduced them below. Most of these are from Yahoo Answers and Live QnA but you can find answers like these almost everywhere.

These are all from the last few months and I have corrected the spelling, grammar and replaced any links with <url>.

Q. What is cloaking in SEO?
A. Cloaking is the speed process for your cpu. You can over cloak your cpu but be aware of the overheating, the biggest problem of over cloaking is that it overheats the cpu.

Q. Seo stuff, what are some basic pointers?
A. My friend I have the perfect website for you. It includes over 1000 links to free advertising websites including free directories, free search engine submissions, free viral marketing, free top keywords and much, much more <url> also investing in the big daddy search engine and program hoppers will prove very profitable.

Q. If your site is a cooking site but you add your site as a link to sports related forums, would your site be penalized or banned from the search engines?
A. It is legal to point from one topic to other, you will just get a lower rank but it is better than nothing.

Q. Can someone explain how sub-domains can impact search engine optimization?
A. If you want top ranking in MSN just spam with sub-domains.

Q. Why does page rank fall?
A. The number of hits you get and the frequency that your site is updated changes your ratings.

Q. What does “omitted results” mean e.g. “In order to show you the most relevant results, we have omitted some entries very similar to the 103 already displayed”?
A. It means that there are search results that have been left out.

Q. What is SEO?
A. South East Organization.

Q. Can an expert tell me why Google is not updating our website?
A. No it doesn’t update, you have to resubmit it.

Q. Why are SEO Consultants too expensive for webmasters?
A. I personally used a firm that did a 250,000 site submissions for my site, it worked great.

Q. What are SEO and SEM and how do they differ?
A. SEO typically includes keyword research, density balancing, tagging, linking strategy and website submission. With SEM you can buy advertising to get to the top.

Q. Can any of you SEO experts recommend a good link exchange site that works?
A. I’ve found 45 different link exchange services and the two that I think work the best are <url> and <url>.

Q. Can anyone suggest any SEO tips for my website?
A. I checked your website and one thing is missing, a Visitor’s Guest Book where visitors can insert a message describing their business and website.

Q. What are some SEO tips?
A. The most important item is meta tags, very good keywords and keyword density.

Q. What is the best way to get a website to appear on the first page of search engines, can I do it myself?
A. My experience is that trying to do it yourself does not give the best results. My website has been online for about 5 years now and on average I get about 11 unique hits per day and my meta tags are in order.

Q. How do you get removed from a search result?
A. You don’t. Whatever embarrassing thing you may have committed in a public forum is indelibly etched.

How do I add my website to search engines like Google, Yahoo, AOL?
A. From my understanding, your website will pop up in search engines based on how many people visit your website but how people visit your website if it’s not in a search engine, ironic right?

Comments (2)

Paid Directories

I have been asked several times in the last few weeks, are paid directories worth the money? In my experience there are only two essential paid directories but depending on your product/service there may be more in the form of niche directories.

  • Yahoo!

    The Yahoo directory has a $299.00 non-refundable, recurring annual fee per submission, and is the only paid directory specifically suggested by Google.

    Find the appropriate category for your site in the Yahoo Directory (normally where your competitors are) and click on the ‘Suggest a Site’ link on that page. You will need a Yahoo! ID to complete the submission but you can sign up during the process. After ticking various declarations and agreeing to the terms of service you will be taken to the submission form itself. Fill in the form following the guidelines for completion making sure that your site description does not exceed 150 characters.

  • Business.com

    Business.com directory has a $199 fee for the first year which is renewable thereafter at $149. Included with the site listing are up to four additional links which allow you to list deep site content. Annual directory inclusion and standard listing account set-up can be found here.

  • Niche Directories

    There are three basic ways to find relevant niche directories for your site.

    1. Look in the Business Directories section of DMOZ.
    2. Enter your competitors sites into Yahoo Site Explorer and look to see if they have links from any niche directories.
    3. Search with Google for your industry keywords plus the word directory
      e.g |”lawn care” directory|.

One last suggestion if you find that a large number of your competitors are using a paid general directory other than Yahoo or Business.com then you may wish to consider joining them.

Comments

Google Custom Search Engine

Google has spawned a whole new generation of search engines with the introduction of its Custom Search Engine (CSE). Without any technical knowledge anyone can now use the Google search platform to create a search engine focused on any content they want.

You can choose which pages or sites you want to include in your custom search engine, how the content should be prioritized, whether others can contribute to the index and what the search results page will look like.

Already an increasing number of people are creating niche custom search engines. For example; Specialized Aquarium and Fish Keeping which searches over a hundred aquarium sites, Medical Libraries Blogs which searches all the blogs of medical libraries and librarians or A Dedicated Information Architecture Search Engine.

There is however an interesting anomaly when building a custom search for which I can find no rational explanation. I built a custom search using the seven SEO sites from the SEO Forums http://www.seo-blog.com/seo-forums.php post. It would appear that the order of entry of the sites makes a difference to the order of the results. Here are partial screen shots of the tests that confirm this:

Order of selected urls for test 1.Order of selected urls for test 2.

 

Serps for test 1.Serps for test 2.

The same seven sites in Test Search Engine 1.0 and Test Search Engine 2.0 are searched for the phrase |external nofollow| and produce different results. The only difference in the build was the order of entry of the sites. Other search phrases (but not all) also show similar differences in the order of results.

Surprising…..

November 2, 2006

A Google Co-op Engineer has shed light on this test result as follows:

Theres actually a very subtle bug going on here, in one of the search engines an extra whitespace got added to one of the URLs so its actually searching over:

forums.digitalpoint.com/%20

Which effectively renders this URL useless.

Thanks for pointing this out, we’ll fix it in the next release by making sure we do a better job of stripping whitespace off the URLs!

That explains it!

Comments

SEO Forums

One of the common questions from those just starting to learn about SEO and SEM is which is the best forum to read and ask questions? This post makes some suggestions and offers some guidance on how to get the most out of forum participation. Also we show how to search the major SEO forums all at once and look at the results of a not too serious test of forum response.

There are seven major SEO and SEM forums and they each have their own characteristics and range of topic headings.

To get the most out of any of these forums you should observe these guidelines:
Read the rest of this entry »

Comments (3)

« Previous entries · Next entries »