Site icon Search Engine People Blog

Google and Elephants Never Forget – The Challenge for Google

Stay Connected with Us!

Some might question Google's commitment to its oft quoted slogan Do Know Evil.  Nevertheless its commitment to its mission is clear: to organize the world's information and make it universally accessible and useful.

Founders Larry Page and Sergey Brin named the search engine they built "Google," a play on the word "googol," the mathematical term for a 1 followed by 100 zeros. The name reflects the immense volume of information that exists, and the scope of Google's mission.

Documenting all knowledge is a very tall order and brings in some hidden factors that do affect how your website may appear in Google searches.  Let us explore these potential problems.

It's A Huge Web

Getting a handle on the current size of the web is not easy.  Google itself reported in mid 2008 that they had recorded one trillion web pages.

The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. Over the last eight years, we've seen a lot of big numbers about how much content is really out there. Our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!

How fast is that number growing?  Again data is not readily available.  However Jakob Nielsen had noted in 2006 that the number of websites was growing but had eased back to a more mature growth rate of 25% per year.

As the chart shows, the Web has experienced three growth stages:

1991-1997: Explosive growth, at a rate of 850% per year.
1998-2001: Rapid growth, at a rate of 150% per year.
2002-2006: Maturing growth, at a rate of 25% per year.

It seems very likely that as we now move on to the mobile Web with all those smart phones in everyone's eager hands that the growth rate has accelerated above that 25% growth rate figure.

To complete the picture we should mention that there is what is being called the "Invisible Web", a.k.a. the "Deep Web".

The "visible web" is what you can find using general web search engines. It's also what you see in almost all subject directories. The "invisible web" is what you cannot find using these types of tools.  ...  Since 2000, search engines' crawlers and indexing programs have overcome many of the technical barriers that made it impossible for them to find a good proportion of the "invisible" web pages.

It is quite clear that this information space is massive and Google has taken on quite a challenge in attempting to make it universally accessible and useful.

Practical Issues for Google

It is interesting to check out the basics of how Google approaches this herculean task.

There are two main processes involved:

Crawling

Crawling is the process by which Googlebot discovers new and updated pages to be added to the Google index.  We use a huge set of computers to fetch (or "crawl") billions of pages on the web. The program that does the fetching is called Googlebot (also known as a robot, bot, or spider). Googlebot uses an algorithmic process: computer programs determine which sites to crawl, how often, and how many pages to fetch from each site.  ....   New sites, changes to existing sites, and dead links are noted and used to update the Google index.

Indexing

Googlebot processes each of the pages it crawls in order to compile a massive index of all the words it sees and their location on each page. In addition, we process information included in key content tags and attributes, such as Title tags and ALT attributes. Googlebot can process many, but not all, content types. For example, we cannot process the content of some rich media files or dynamic pages.

Clearly the process of getting all these web pages into the index is no small task.  The computing power necessary for this means that Google may own more than 2% of all servers in the world.

Nobody outside Google knows exactly how many servers the company has, but there have been a number of estimates through the years. One of the most quoted ones is from 2006, when it was estimated that Google had approximately 450,000 servers. And that was three years ago.  Another estimate showed up in 2007, this time from the analyst firm Gartner, estimating the number of Google servers to one million.

Undoubtedly that number has grown in line with the overall growth in the size of the Web.  The basic description seems deceptively straightforward, but clearly we are talking about a massive information system.  This limits what then are feasible operations.  There is one aspect that can cause a real problem for webmasters and is not often discussed.

Synchronizing databases

Google maintains a number of versions of its databases for computational efficiency and for data security.  They are distributed across all those million or more of servers.  In consequence the process of synchronizing these databases so they reflect the most correct present version of URLs and the content on those web pages is complex and will involve a number of different processes with varying time cycles.

Once indexed, almost never forgotten

Another side to this is that Google does not quickly eliminate URLs with a no-longer-valid online web page.  These can stay in the index for a year or or more and probably this is an appropriate way to handle things.  Web pages can sometimes not be available for short periods of time because of server problems.  If they were to be immediately expunged from the index after a first visit that showed them not to be there, this could cause major problems, since servers will have their short periods off-line.

This persistence can cause real problems.  This is true even though rarely visited web pages may go into a secondary class citizen status by being put in what Google calls the supplementary index.  This undoubtedly still exists, even though Google has not commented recently on it.

Usually such persistent old web pages that have been superseded by more current web pages probably do not represent major problems.  However one never knows how the keyword search algorithms may interrogate the databases and attempt to get the most relevant items for any given keyword search.

Some Problem cases

Website owners may well have run into problems caused by this persistency problem with the Google indexes.  Sometimes an older version will persist and a newer version is only made visible after some time despite all the efforts of a Web site owner.  One recent example of this came up in using geo-targeting via the Google Webmaster tools website.  Switching the Google targeting for a website to another location may take some weeks or even months to have effect.

Another way in which extra erroneous URLs can be created is where someone has indulged in Camel Humping.  That is the practice of using uppercase letters for certain letters in the URL.  For example SearchEnginePeople.com.  The different versions of URLs with these upper or lower case letters in them are all indexed as independent web pages.  Even if the practice is stopped and only a single version of the URL is used everywhere, the camel hump versions will tend to persist.

Ways of fixing problems

Google will likely continue to retain no longer valid URLs well beyond their best-before date.  However it does provide mechanisms that will attempt to avoid any problems so created.

The most obvious is the 301 redirect.  This is a permanent redirection whereby visitors will never see the old web page and will immediately be switched to the new version.  Although this might look like a get-out-of-jail-free card, it does bring with it its own penalties as was mentioned  in an interview that Eric Enge had with Matt Cutts.

The penalty is that such a 301 redirect may cause a certain reduction in the PageRank contribution that is passed through to the new web page.

The other way in which one can attempt to avoid problems about multiple URLs linking to duplicate copies is to use the tag, rel=canonical.  However even on this, Google says that this will only a be regarded as a hint rather than a firm directive.  It is also uncertain how the other search engines may handle this in all cases.

Avoiding old version web page persistency problems

Clearly given that the wrong web pages may get indexed and persist, what is the careful website owner to do.  Clearly whatever is done should apply to the other search engines as well as Google.  Some of the others are slow in applying new practices so the best advice is to stick with time proven approaches that work for all the search engines.

Plan twice, load up once

This is just as important as that old phrase that applies to sawing wood.  Google and the other search engines are relevantly rapid in finding new web pages and adding them to their indexes.  If the URLs on reflection should be modified, it may then be very difficult to clear out the erroneous ones and have them replaced by the new ones.

You can of course use a sand-box approach by putting new web pages in domain folders that are excluded from search engine consideration by a robots.txt file until you are absolutely sure of the right URL to use. Whether the robots.txt file is always respected is again open to question.

As usual, the KISS principle is the most likely approach to succeed.  Website architecture should be as simple and as flat as possible.  The KISS principle should also apply to the URL structures themselves and there should be strong consistency in how these are set out, particularly with respect to such items as www., and upper and lower case letters.

The Bottom Line

The final caution to remember is that search engines do not create URLs.  They merely record and index URLs they find.  If there are problems in what has been indexed, then it undoubtedly is you, the webmaster, who is to blame.  You should never upload a website with the assumption that in a second phase you may change the URLs.  This is trouble waiting to happen.

Careful planning ahead of time is the way to avoid such problems.  Then monitor website performance using Google Webmaster Tools to ensure that what has been created is indeed what you intended.