"Welcome to this free resource for site owners and small businesses. If this is your first visit you may find it helpful to read these two posts; About SEO Blog and Using SEO Blog. To keep up to date you should subscribe to the RSS feed and you are welcome to ask questions or to make a comment but you must register first. Thank you and may your business prosper". - Michael Duz

Paid Links

“Now Warwick, tell me, even upon thy conscience, is Edward your true king? For I were loath to link with him that were not lawful chosen”. Henry VI, Act 3, Scene 3 by William Shakespeare.

To buy or not to buy, that is the question.

The head of Google’s Webspam team was advising over a year ago that “…if you sell links, you should mark them with the nofollow tag. – Matt Cutts”. A more recent post has caused alarm bells to ring in the minds of those who buy or sell links. The post in question details how to report any sites you find that are selling or buying links. Matt explains that these external reports will be used to test out some new techniques in algorithmic paid link detection.

So why is Google so keen on detecting paid links you might ask? Look no further than Google’s Corporate Information, Philosophy page “Google works because it relies on the millions of individuals posting links on websites to help determine which other sites offer content of value. Google assesses the importance of every web page using a variety of techniques, including its patented PageRank algorithm which analyzes which sites have been “voted” the best sources of information by other pages across the web”. So it is hardly surprising that Google views paid links as ‘paid votes’ and therefore likely to introduce bias into their PageRank algorithm.

The Problem.

With the introduction of PageRank (originating from Larry Page and Sergey Brin’s 1997 paper) Google created a new commodity – links that improve ranking. Economists from Karl Marx to Milton Friedman have recognized that for every commodity there will always be a market and hence the buying and selling of links has become an industry. Text link brokers have been making hay while the sun shines and Google now feels that it needs to get on top of this problem before the PageRank component of its algorithm breaks. Algorithmically detecting and then discounting paid links is one approach and hence Matt Cutts request for data.

Google’s Solution.

As well as improving the detection of paid links Google’s solution includes extending the use of the nofollow tag from its original conception as “…an easy way for a website to tell search engines that the website can’t or doesn’t want to vouch for a link - Matt Cutts” to a “…machine-readable disclosure for paid links… – Matt Cutts”. It appears likely that once Google encounters a paid link without a nofollow then at the very least it will be discounted.

What should you do?

The obvious course of action is to only buy links for traffic and make sure they are nofollowed or if you are selling links then make sure they too are all nofollowed. However you can be sure that no professional SEO will be signing up exclusively to this approach. Paid links are too important a tool in SEO to be given up on Google’s say so, especially when Google are still in the process of creating an improved detection algorithm. So my advice if you buy links is:

  • Go into stealth mode if you aren’t in it already.
  • Don’t buy links that are advertised or from a broker.
  • Approach site owners directly by telephone.
  • Check the site to make sure it would pass a human inspection for paid links.
  • Make sure your link is embedded in content and that it is relevant content.
  • Make sure the link points to relevant content on you website.
  • Don’t buy home page links.

If you employ an SEO or are about to, make sure that they have a clearly defined policy on buying links based on the above. If you don’t want links purchased for your site make sure that your SEO knows your position on the subject.

June 7, 2007
Added
Google has provided guidelines in its Webmaster Help Center titled Why should I report paid links to Google?

June 12, 2007
Added
Google has put up a Paid Links Reporting Form on Webmaster Tools.

Google paid links reporting form on Webmaster Tools

December 1, 2007
Added
Google have simultaneously published two important posts on paid links in a concerted effort to draw a line in the sand:

On Google Webmaster Central Blog - Information about buying and selling links that pass PageRank

On Matt Cutts blog - Selling links that pass PageRank

December 30, 2007
Added
Ted Murphy of Izea (formerly PayPerPost) has published part of an email he received from Matt Cutts “Google (and probably all search engines) will consider all links in a paid post to be paid”. (My embolding)

Comments

What is Latent Semantic Indexing (LSI)?

In this post I will try to explain Latent Semantic Indexing (LSI) in simple terms and without the college degree math that is usually required. In a follow up post I will explain why LSI is not used by search engines.

Forget for a moment how search engines like Google rank pages in their search results and let’s take a look at a possible method of indexing and retrieving all the pages relevant to the user’s query before the ranking algorithm is applied.

An obvious method of retrieving relevant pages is by matching the terms of a search query with the same text found in all web pages. However the problem with simple text (lexical) matching methods is that they are inherently inaccurate. This is because there are many ways for a user to express a given concept using different words (synonymy) and also because most words have multiple meanings (polysemy). The problem of synonymy means that the user’s query may not actually match the text on relevant pages so they will be overlooked and the problem of polysemy means that the terms in a user’s query will often match terms in irrelevant pages.

Image searchLSI is an attempt to overcome this problem by looking at patterns of word distribution across the whole of the web. In doing so it considers pages that have many words in common to be close in meaning (semantically close) and pages with a few words in common to be semantically distant. The result is an LSI indexed database with similarity values it has calculated for every content word and phrase.

In response to a query an LSI indexed database will return the pages it thinks will best fit the search terms. The LSI algorithm doesn’t understand anything about what the words mean and does not require an exact match to return useful results.

Before we look at how LSI is achieved let’s refresh our knowledge of some high school math, in particular Cartesian coordinates.

Image searchIf you wanted to describe the exact location of the telephone on your desk you might say that it was 10ft from one wall of the room, 5ft from another and 2.4ft from the ground. In general you can describe the location of anything in three dimensional space with just three numerical values x, y and z as in Figure 1. Alternatively the position can be specified by a position vector r which is expressed in terms of the coordinate values.

Now let’s look at how we might plot the position of a web page. Let’s imagine that on the page we choose to plot, the words fertilizer, broadleaf and turf occur a specific number of times. We could represent the position of the page in fertilizer-broadleaf-turf space with the vector r as in Figure 2.

Image searchWe could also plot the position of every page that contained these words which is called a ‘term space’ and would look something like Figure 3. Each page forms a vector in that space and the vector’s direction and magnitude determine how many times the three keywords appear in it.

Because we have used three words we have a 3-dimensional term space that is easy to visualize. If we wanted to plot the occurrence of a fourth or fifth word we would need a 4-dimensionl or 5-dimensional term space which is definitely not easy to visualize. If we wanted to represent every word and every page we might end up with millions of dimensions! Although we cannot imagine such a huge multi-dimensional term space we can represent it with a very large but simple grid as in Figure 6. Here we are assuming that we are looking at every web page in existence, which as we shall see later is not practical.

Image search

All the different words found, of which there will be millions, are listed down the first column as word 1, word 2, word 3 until we get to the final word, word n. Then all the pages found, of which there will be billions, are listed as a separate column p1, p2, p3 until we get to the final page, pn. If a page contains a particular word (for example p2 contains word 4 as in Figure 6.) then we indicate this by putting a one in the appropriate position on the grid. Otherwise we put a zero to indicate the words absence. This kind of grid is called a ‘term-document matrix’. Not only is it extraordinarily large the other thing to note is that it will contain many, many times more zeros than ones.

Typically a term-document matrix is created from pages that have been pre-processed so only words that are likely to have semantic meaning remain. Firstly all formatting from the pages including capitalization, punctuation and extraneous markup are removed. Then prepositions, conjunctions, common verbs, pronouns, articles and common adjectives are also removed. Lastly the common endings are removed from words leaving just the basic root form (this process is called stemming).

So we now have a valid, nice-looking term-document matrix, what next? You may have noticed that we have not yet taken into account the number of times a particular word appears on the page. This is achieved by applying a ‘local weighting factor’ so that words that appear many times on a page are given a greater weight than words that appear only once.

In addition a ‘global weighting factor’ is applied so that words that appear in a small number of pages are given more weight than words that occur widely across all the pages. This is done on the basis that these words are likely to be more significant.

Another step in weighting is called normalization. This is required to put large pages on a level playing field with smaller pages and to remove any bias as a result of page size.

The application of weighting factors is called term weighting and the three processes just described are common but they are not the only weighting scheme that can be used. Also the actual value of the factors that are used in the weighting and the ways in which they can be calculated and applied are various. However the basic idea remains the same, to calculate a more useful and valid term-document matrix from the initial simplistic term-document matrix, in preparation for the next stage.

After term weighting, our term-document matrix might look like Figure 7. Notice that only the non-zero values will have changed as a result of term weighting.

Image search

By now you are probably thinking what the term-document matrix has to do with our earlier discussion on multi-dimensional term space. In fact each column in our term-document matrix can be thought of as a list of coordinates that give the exact position of a single page in a multi-dimensional term space. We now need to think of it this way because the next stage is to project this large multidimensional space into a much smaller multidimensional space. When this is done words that are semantically similar will get squeezed together and it is this step that is the heart of LSI.

Image searchImagine that you are in a football stadium watching your favorite team play. Suddenly 50ft above the center of the field appears a 3-dimensional term space populated as in our original Figure 3. You have your camera with you and take a picture. So do lots of other people around the stadium, even the news reporter in an overhead helicopter manages to take a picture. When all these pictures are printed off they will all be different. For example yours might look like Figure 4. However from another position in the stadium or from a helicopter it might look like Figure 5.

In fact every picture will be different. Although there are an infinite number of positions from which a photograph can be taken there will always be at least one position that is superior in the sense that the printed picture will contain more information than the others. For example fewer points will be obscured from view by other points.

Image searchThe process of transferring data from a higher dimension (the 3-dimensional space that appears above the field) to a lower dimension (the 2-dimensional printed picture) is called mapping. Retaining as much information as possible in the process is called ‘optimal mapping’.

This is exactly what happens in the next stage of our LSI process. The term-document matrix is optimally mapped into a smaller number of dimensions while keeping as much information as possible. When this happens information is lost and content words are superimposed on one another. It transpires that what is actually lost is the noise from our original term-document matrix and this reveals similarities that are latent within the pages.

One of the algorithms that can perform this task is called Singular Value Decomposition (SVD) and Figure 8. shows how our term-document matrix might look after it has been applied. (It is worth mentioning at this point that Bell Communications Research were granted a patent for LSI using SVD in 1989 which is now owned by Content Analyst Company, LLC).

Image search

There are two interesting features in the processed data:

Firstly the matrix contains far fewer zero values and each page has a similarity value for nearly all the content words. Secondly some of the similarity values are negative. In our original term-document matrix this would correspond to a page with less than zero occurrences of a word, which is impossible. What it means in the processed matrix is that the more negative the value the greater the semantic distance between a term and a page. Conversely the more positive the value the more semantically related they are.

This finished matrix is what we would use to actually perform a search and it would work like this: We take however many terms in the search query and look up the values for each search term/page combination. We calculate a cumulative score for every page and then rank the pages by that score. This will be the measure of the pages similarity to the search query. Of course we don’t want to rank every page so in practice there would be a threshold value to act as a cutoff between relevant and irrelevant pages.

Although the pages that have been selected are semantically related and ranked according to their similarity this is obviously not sufficient by itself to present to the user. In practice a search engine will use hundreds of variables to determine relevancy and a pages position in the LSI could simply be one of them (but it isn’t!).

One final note to the above explanation, it transpires that although the SVD algorithm does a reasonable job it is computationally inefficient and consumes impossibly large amounts of processing power. So much so in fact that it cannot be used on a data set as large as the Web. However there are ways of reducing the size of the initial term-document matrix by, for example, first partitioning the data set into a number of smaller partitions having similar ‘concept domains’, as in this recent Telcordia Technologies patent.

Also there are other algorithms apart from SVD which do a similar job or can help speed up the process, for example Scaling Latent Semantic Indexing for Large Peer­to­Peer Systems, Non-Negative Matrix Factorization (NMF) and ULV Decomposition (ULVD). As you can imagine these are areas that are actively researched by mathematicians and information retrieval specialists.

The explanation above should give you a good idea of what is involved in LSI and in a follow up post I will explore the myths surrounding LSI.

Comments

Yahoo User Interface Library

I am continually surprised by even good web designers/developers who are unaware of the Yahoo! User Interface Library. I am hoping that this post will help more of them discover this useful resource. If you are a site owner you may want to pass the information on to your designer. Who knows they may thank you for it :)

The Yahoo User Interface Library (YUI) is a collection of JavaScript and CSS resources that make it easier to build interactive applications in web browsers. Some like the Event Utility simply make in-browser programming easier while others like the Menu family of components make it a snip to add fly-out menus, customized context menus, or application-style menu bars to your website or web application.

Not only is the YUI Library the same high quality code that is used by Yahoo on its web properties but it is also free for both commercial and non-profit use (subject to minor restrictions).

An additional bonus is that if you’re using YUI for your own project Yahoo is offering free hosting for YUI components, both JavaScript and CSS , gzipped and with good cache-control using their state of the art network.

Support is provided through a Yahoo! User Interface Library Group and there is a YUI Blog for announcements.

I played around with the DataTable control which provides a powerful API to display screen reader accessible tabular data on a web page with sortable columns.

This is what I was able to produce in 15 minutes.
Read the rest of this entry »

Comments (1)

Social Media Marketing - the first step

“Communities can build amazing things but you have to be part of that community and you can’t abuse them”. Jimmy Wales (Co-founder of Wikipedia), Keynote Speech South by Southwest Interactive 2006.

Social Media Optimization (SMO) or more appropriately named (in my view) Social Media Marketing (SMM), is becoming mainstream.

The term social media is a broad concept that describes the many applications that allow individuals to communicate with one another and to track events across the web in real-time such as Digg, MySpace, YouTube etc., and social networking refers to the exchange of interpersonal information through these websites.

Using social media as a marketing tool is rapidly becoming more popular as companies realize that they need to get in front of their prospective customers wherever they may congregate online.

The most important thing to remember is the quote from Jimmy Wales above in that you must become part of the community before you can effectively use the community.

If you haven’t already you should begin to become familiar with the SMM space and prepare for the future. To that end I suggest you spend time on the sites below to see which may be most suitable for marketing your company products or services.

When you have a short list designate someone in your company (or do it yourself) and build a reputation within the chosen communities. Then (and only then) will your future SMM efforts be consistently rewarded.

One tip to start with, choose a username that is associated with your company and if you are active in more than one community use the same username in all of them.

Just click on the logo to visit the site, there are 55 of them.

Social Shopping

thisnextThisNext is a communal consumer site for users to talk about the lastest must have items. Gadgets, fashion, products for the home etc., all chosen by the ThisNext community.

crowdstormCrowdstorm users recommend products and write comments about them, and the best items bubble up to the top.

Kaboodle When users see an item on the web they are interested in they hit a bookmarklet and Kaboodle will automatically grab the image and relevant information from the page. Users can also explore what others are bookmarking and comment on their choices.

Stylehive The Stylehive is a collaborative shopping community where contributors can work together to share and discover the hottest stores, designers, trends, and must have products.

StyleFeeder StyleFeeder is a way to find, share and keep track of shopping stuff online using visual bookmarks.

Read the rest of this entry »

Comments (1)

If you are too stupid to use a computer you might try giving SEO advice.

Now that everybody and their uncle is an expert in search engine optimization you get to read some really off the wall advice. I have collected a few gems that had me laughing and reproduced them below. Most of these are from Yahoo Answers and Live QnA but you can find answers like these almost everywhere.

These are all from the last few months and I have corrected the spelling, grammar and replaced any links with <url>.

Q. What is cloaking in SEO?
A. Cloaking is the speed process for your cpu. You can over cloak your cpu but be aware of the overheating, the biggest problem of over cloaking is that it overheats the cpu.

Q. Seo stuff, what are some basic pointers?
A. My friend I have the perfect website for you. It includes over 1000 links to free advertising websites including free directories, free search engine submissions, free viral marketing, free top keywords and much, much more <url> also investing in the big daddy search engine and program hoppers will prove very profitable.

Q. If your site is a cooking site but you add your site as a link to sports related forums, would your site be penalized or banned from the search engines?
A. It is legal to point from one topic to other, you will just get a lower rank but it is better than nothing.

Q. Can someone explain how sub-domains can impact search engine optimization?
A. If you want top ranking in MSN just spam with sub-domains.

Q. Why does page rank fall?
A. The number of hits you get and the frequency that your site is updated changes your ratings.

Q. What does “omitted results” mean e.g. “In order to show you the most relevant results, we have omitted some entries very similar to the 103 already displayed”?
A. It means that there are search results that have been left out.

Q. What is SEO?
A. South East Organization.

Q. Can an expert tell me why Google is not updating our website?
A. No it doesn’t update, you have to resubmit it.

Q. Why are SEO Consultants too expensive for webmasters?
A. I personally used a firm that did a 250,000 site submissions for my site, it worked great.

Q. What are SEO and SEM and how do they differ?
A. SEO typically includes keyword research, density balancing, tagging, linking strategy and website submission. With SEM you can buy advertising to get to the top.

Q. Can any of you SEO experts recommend a good link exchange site that works?
A. I’ve found 45 different link exchange services and the two that I think work the best are <url> and <url>.

Q. Can anyone suggest any SEO tips for my website?
A. I checked your website and one thing is missing, a Visitor’s Guest Book where visitors can insert a message describing their business and website.

Q. What are some SEO tips?
A. The most important item is meta tags, very good keywords and keyword density.

Q. What is the best way to get a website to appear on the first page of search engines, can I do it myself?
A. My experience is that trying to do it yourself does not give the best results. My website has been online for about 5 years now and on average I get about 11 unique hits per day and my meta tags are in order.

Q. How do you get removed from a search result?
A. You don’t. Whatever embarrassing thing you may have committed in a public forum is indelibly etched.

How do I add my website to search engines like Google, Yahoo, AOL?
A. From my understanding, your website will pop up in search engines based on how many people visit your website but how people visit your website if it’s not in a search engine, ironic right?

Comments (2)

« Previous entries · Next entries »