What Are Web Bots?
A web robot is a program that runs amok on the web, gathering documents and referencing them. It follows links from page to page and site to site. Web Robots are also known as "Spiders", "Crawlers", "Web Bots", "Search Bots", or just "Bots".
You may also see a program that is meant to run automatically doing a specific task, or set of tasks, called a “bot”. These types of bots are often used for chat rooms, auction sites, game bots (think online solitaire), and chatterbots (a computer program designed to emulate conversation, like IRC bots, or the popular AIM bot, FriendBot).
What are Agents?
The term "agent" (or user-agent) is another word for the name of the application accessing a document. All search bots must declare themselves when they access a document. Much like how your browser (which is an application) declares itself when you access a website. If you see a Google agent on your site, that’s also the Googlebot.
What Kinds of Web Robots Are There?
Google has several versions of bots:
- THE Googlebot - crawls pages for their web and news index.
- Googlebot-Mobile - crawls pages for their mobile index.
- Googlebot-Image - crawls pages for their image index.
- Mediapartners-Google - crawls pages to determine Adsense content.
- Adsbot-Google - crawls pages to measure Adwords landing page quality.
With several thousands of web bots on online today "in the wild", it’d be impossible to list them all. So here are some of the most well-known "good" search bots:
- MSNbot - owned by MSN.com
- Ask Jeeves/Teoma - owned by Ask.com
- Architext spider - owned by Excite.com
- FAST-WebCrawler - owned by FAST (AllTheWeb.com)
- Slurp - owned by Inktomi.com
- Yahoo Slurp - owned by Yahoo Web Search
- ia_archiver - owned by Alexa.com
- archive.org_bot - owned by Archive.org
- Scooter - owned by AltaVista.com
- Crawler - owned by Crawler.de
- InfoSeek sidewinder InfoSeek.com
- Lycos_Spider_(T-Rex) Lycos.com
Some of the bad type of bots are evil minions sent by programmers who are up to no good, usually the illegal type of no good. A bad bot can also be defined as a bot that ignores META tags and ignores the robots.txt, follows urls anyway and/or revisits your site too much (thereby causing it to slow down, or crash). The more insidious types of bots are:
- Denial of Service Bots (DoS Bots) flood your site and crash your server. A cybercriminal may even announce a DoS attack to the target site to extort money.
- Identity Theft Bots have the sole purpose to scour your personal information such as address, credit card numbers and passwords.
- Spam Bots send just that, spam. Mass emails usually loaded with naughty intentions and/or pharmaceutical enhancers.
- Phishing Bots are similar to both the identity theft bot and the spam bot. They send luring emails to tempt you to give up your personal information. (like asking you to enter your Paypal password on a non-Paypal site)
- Click Fraud bots imitate a person clicking on an advertisement to inflate pay per click income.
How to Deal With Bots
1. You can tell the "good" bots what you want them to do when they visit your site by using the META tag on your pages. Use the following examples:
<meta name="robots" content="index,follow"> index this page and read or list links
<meta name="robots" content="noindex,follow"> do not index this page but may read or list links
<meta name="robots" content="index,nofollow"> do not index this page but not read or list links
<meta name="robots" content="noindex,nofollow"> do not index this page or read or list links
**But note, not all search bots recognize the META tags, and not all good bots respect them either.
2. So the better method is to use a robots.txt file to specify what the bots can, or cannot see or follow. Here are some
# Disallow Google's Image bot from accessing your website
User-agent: Googlebot-Image
Disallow: /
# Disallow Yahoo's image bot from accessing your website
User-agent: Yahoo-MMCrawler
Disallow: /
# Disallow Archive.org’s bot from accessing your website
User-agent: archive.org_bot
Disallow: /
# Disallow any other bot from the files listed
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /my_private_file.html
3. You can also use an .htaccess (HyperText access) file in any directory of your server that you want to protect, or the root of your site to protect all of it. The .htaccess file is a hidden file that will deny agents from accessing what you say they cannot. Bad bots cannot ignore the .htaccess file.
For example, EmailSiphon, is a known spam bot. It looks for emails on websites to send spam to. To prevent it from accessing your site, you would use the following in your .htaccess file:
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>
What happens is that when the bad bot, EmailSiphon, visits your site it will be served a 403 forbidden error and prevented from going any further into your site. Just add more lines with other known bad bots and slam that spam! Bu-bye!
So, whether they are good, bad, or ugly Googlebot, the bots are here to stay. Learning how to deal with the bad bots, prevent your images from being available, or stop the indexing of parts of your site is very worthwhile.