How and Why to Build a Robots.txt
Some of you have asked 'How do I keep 'search engine A' from indexing pages
designed for 'search engine B'. The answer is to use a robots.txt file. There
are also other reasons for wanting to keep search engines from indexing some or
all pages on a site. Therefore, I’ve put together this detailed article to show
you how to do that, and to avoid common mistakes that are made all too often.
If you create different versions of essentially the same doorway page and
every search engine indexes every copy of the page, then you could, in theory,
get in trouble for spamming. AltaVista in particular is known to dislike
duplicate or near duplicate content. Therefore, if you create pages that are too
similar, you run the risk of being red-flagged.
In practice, many people don't worry about having too many duplicate pages
indexed by a single search engine because they are not creating huge numbers of
similar pages. In fact, I spoke to the CEO’s of three search engine positioning
companies. Each said they did not use robots.txt files, although their reasons
varied.
If your pages vary enough in the size and number of words, then you should
not need to worry about being red-flagged. If you focus primarily on optimizing
existing pages on your site that have unique content rather than building a lot
of new pages that are similar, you’ll also avoid any potential problems.
Other people simply submit pages designed for a particular search engine to
just the engine(s) to which the page applies. This can be the simplest method to
avoid spamming the search engines. This can work IF there are no other links to
that page from the rest of your site. However, if another search engine spider
manages to find a link to that page, it could index it even though you never
submitted the page to that engine.
Despite this, two of the consulting companies I spoke to said submitting
hallway pages that pointed to just the pages built for that engine worked well
for them and they’d done it successfully for years.
However, if you want to create a lot of doorway pages, targeted for a large
number of engines, where many will be very similar to each other, you should
consider using a robots.txt file. This file can tell the search engine spiders
which pages they are not allowed to index. That way you can build pages for
search engine A and tell search engine B to ignore them. The search engine's
like this because it keeps them from indexing pages that don’t apply to them.
Therefore, it benefits the search engines, the users that use that engine, and
it keeps you from being labeled as a spammer.
I have seen some people debate whether the search engines even honor
robots.txt files since this is a purely voluntary feature of the Web. However,
the search engines have historically been challenged by companies - and in the
courts - about indexing copyrighted materials without the permission of the
copyright holder. The search engine's most prominent argument for being able to
index copyrighted material without permission is that the Web site owner always
has the option to exclude their indexing by creating a robots.txt file.
Therefore, it's unlikely the search engines would intentionally ignore the
robots.txt or they could get themselves into unnecessary legal problems. They
might in theory spider the page and then after checking the robots.txt file drop
it. This may explain reports I've heard from a couple of people that claim the
spider ignored their robots.txt file because they saw it opening the page in
their log file. Another explanation is that the Webmaster used the wrong syntax
when creating the robots.txt. Therefore, always double check your work.
I'll try to point out common errors in this article and give plenty of
examples. Please don't be intimidated. It really is not as hard as it looks.
There is also a method you can set it up once and then not have to mess with it
ever again. I’ll explain this method near the end of this article.
To create a robots.txt file, open Window's NotePad or any other editor that
can save plain ASCII .txt files. Use the following syntax to exclude a file name
from a particular search engine spider:
User-agent: {SpiderNameHere}
Disallow: {FilenameHere}
Note: For the purpose of this article, the term spider and search engine may
be used interchangeably.
For example, to tell Excite's spider, called ArchitextSpider to not index
files called orderform.html, product1.html, and product2.html, create a
robots.txt file as follows:
User-agent: ArchitextSpider
Disallow: /orderform.html
Disallow: /product1.html
Disallow: /product2.html
According to the official robots.txt specifications, the above is
case-sensitive so you should spell it as 'User-agent:' rather than
'User-Agent:'. Whether this causes a problem in practice, I cannot say for
certain. To be safe, keep the names in the correct case. In addition, make sure
you include a forward-slash before the file name if the file is in the root
directory.
The User-agent line is the identifier for the search engine you wish to
target. It is like a 'code name' for the search engine's spider that goes around
and indexes pages on the Web. It may be similar to the name of the search engine
or it may be completely different. (I'll list the official User-agent names for
the major engines later in this article).
Once you create your robots.txt, you would then upload this text file to the
root directory of your Web site. Although robots.txt is a voluntary protocol,
most major search engines will honor it. If you do not have your own domain name
but instead use a subdirectory off of your host's domain, then your robots.txt
may not be recognized in theory since standard practice is to look only at the
root directory of the domain. This is just one more reason to invest in your own
domain name!
You can add additional lines to exclude pages from other engines by
specifying the User-Agent parameter again in the same file followed by more
Disallow lines. Each disallow statement will be applied to the last User-Agent
that was specified.
If you want to exclude an entire directory, use this syntax:
User-agent: ArchitextSpider
Disallow: /mydirectory/
A common mistake is to include the asterisk after the directory name to
indicate that you want to exclude all files in that directory. However, the
proper syntax is to NOT include any asterisks in the Disallow statement.
According to the robots.txt specifications, it is implied that the above
statement will disallow all files in 'mydirectory.'
To disallow a file named product.htm in the 'mydirectory' subdirectory, do
this:
User-agent: ArchitextSpider
Disallow: /mydirectory/product.htm
You can exclude pages from ALL spiders with this User-agent:
User-agent: *
In the case of the User-agent line, you CAN use the asterisk as a wildcard.
To disallow all pages on your Web site for the specified spider use:
Disallow: /
To re-iterate, you use only a forward slash to indicate you want to disallow
your entire site. Do NOT use an asterisk here. It's important that you use the
proper syntax. If you misspell something, it may not work and you won't know it
until it's too late! It is possible that certain search engines may handle
common syntax variations without problems. However, this doesn't guarantee that
they will all tolerate variances in the syntax. Therefore, play it safe. If at
some point you do find that your syntax was wrong, don't panic. Correct the
problem and then re-submit. The search engine will then re-spider the site and
drop the pages that you excluded.
If you wish to include comments in your robots.txt file, you should precede
them with a # sign like this:
# Here are my comments about this entry.
Each set of disallow statements should be separated by a blank line. For
example, you might have something like the following to exclude different files
from different spiders:
User-agent: ArchitextSpider
Disallow: /mydirectory/product.htm
Disallow: /mydirectory/product2.htm
User-agent: Infoseek
Disallow: /mydirectory/product3.htm
Disallow: /mydirectory/product4.htm
The blank line between the two groups is important to group things into
'records.'
If, on the other hand you wanted to exclude the same set of files for more
than one spider, you could do something like this:
User-agent: ArchitextSpider
User-agent: Infoseek
Disallow: /mydirectory/product.htm
Disallow: /mydirectory/product2.htm
Side note about subdirectories: Some Webmasters like to organize their
doorway pages into different subdirectories according to which search engine
they are optimized for. However, some engines are suspected of assigning lower
rankings to pages appearing in subdirectories versus the root directory of a Web
site. If they perceive that those pages belong to a Web site that shares a
domain with its host, they could discriminate against those pages as being
potentially of a lesser quality. I asked three search engine consultants there
opinion of subdirectories. The general feeling was that pages in the root
directory was probably better, but they’d not seen evidence that it caused
problems.
If you were still concerned about being penalized for keeping pages in
subdirectories and wished to use them, you could ask your hosting service to
give you 'machine names' like myproduct.mydomain.com that you could submit. The
myproduct.mydomain.com URL could then be configured by your hosting service to
point to your 'myproduct' subdirectory or whatever directory you desired. That
way no discrimination could occur by the search engine since they would not see
the subdirectory in the URL. In addition, you could include keywords in that
machine name which may also improve your rankings. (Note: A machine name is
normally just 'www.' prefixed at the start of your domain name. However, rather
than 'www.' it could be any name you desire and it could point to any location
on any physical machine).
We are often asked about the proper names for the User-agent. The name of the
agent does not always correspond to the name of the search engine. Therefore,
you can't just put in 'AltaVista' in the User-agent and expect AltaVista to
exclude your designated pages. Don't ask me why it can't be that simple. Perhaps
it's a job security plan for professional Webmasters :-)
In any case, there's a lot of confusion in newsgroup forums and on the Web
about what the proper agent names should be. The confusion derives from
Webmasters reading their server log files and noticing all kinds of complicated
agent names being logged such as Scooter/2.0 G.R.A.B. X2.0, InfoSeek
Sidewinder/0.9, or Slurp/2.0. However, the agent names listed in your log are
not necessarily what you are expected to use in your robots.txt file.
The reason is very logical when you think about it. Names like InfoSeek
Sidewinder/0.9 in a robots.txt are not very useful if the search engine updates
their agent software and decides to start using Infoseek Sidewinder/2.0 as their
new name next month. Would it make sense to expect millions of Webmasters to
know this and to all update their robots.txt files to the new name? Would they
expect people to update the file EVERYTIME any search engine updated their agent
version number and do it precisely when the name change occurred? It's not
likely.
In reality, the name that needs to appear in the robots.txt file is whatever
name the search engine spider is programmed to look for. Therefore, the best
source of information for this name is not your log files but the help files on
the search engine itself. In theory, a search engine could look for a wide
variety of name variations. However, in general they will simply look for the
least common denominator such as 'Scooter' rather than 'Scooter/2.0'. If the
search engine is smart they will allow you to use Scooter/2.0 too, but that is
not guaranteed. Therefore, if you've already setup a robots.txt on your site,
double-check the syntax and the agent names against the list below. All names
are case sensitive.
Here are the User-Agent names that we have compiled. Most of these came
directly from the search engine's own help files, or when not available, from
other respected sources:
| Search Engine: |
User-Agent |
| AltaVista: |
Scooter |
| Infoseek: |
Infoseek |
| Hotbot: |
Slurp |
| AOL: |
Slurp |
| Excite: |
ArchitextSpider |
| Google: |
Googlebot |
| Goto: |
Slurp: |
| Lycos: |
Lycos |
| MSN: |
Slurp |
| Netscape: |
Googlebot |
| NorthernLight: |
Gulliver |
| WebCrawler: |
ArchitextSpider |
| Iwon: |
Slurp |
| Fast: |
Fast |
| DirectHit: |
Grabber |
| Yahoo Web Pages: |
Googlebot |
| Looksmart Web Pages: |
Slurp |
You'll notice that many of the engines use the 'Slurp' agent which is the
Inktomi spider used on HotBot and other Inktomi related sites. Unfortunately,
I'm not aware of a way you can exclude pages from the HotBot spider and not
exclude them from all other Inktomi sites. As far as I can tell, they use the
same spider to index the pages and thereby recognize only one User-agent string
in the robots.txt file. (If I am wrong, please reply to this e-mail and let me
know how this is done!)
The individual Inktomi sites tend to rank the pages differently, although
they will often be rather similar. Normally you can create a handful of pages
that will rank well on most of the Inktomi powered sites, so the duplicated
content issue does not normally become a big problem with Inktomi.
If you're now scratching your head on how this all comes together in relation
to the doorway pages you created, check out the two detailed examples of a
robots.txt file we've put together.
The first one shows how you can disallow INDIVIDUAL files:
The second example shows how you can group your doorway pages into
DIRECTORIES and disallow the entire directory.
The advantage to method #1 is that it can be more flexible for working with a
small number of files already on your site, and it 'might' be a little safer.
Some people believe that locating your doorways in the root directory rather
than a subdirectory can give you a ranking advantage. The theory is that the
search engines might discriminate against sites that don't have their own domain
name, so pages submitted in a subdirectory could be perceived as sharing a
domain with their host.
The disadvantage to method #1 is that if you have very many doorway pages
then the size of your robots.txt file could be enormous. This runs the risk that
a search engine might have problems with a robots.txt file that exceeds a
certain reasonable size. It might also slow down the spider from accessing your
site if it must read in an extremely large robots.txt file. Lastly, a robots.txt
with a lot of entries in it could be a red-flag in itself to a search engine.
This is all speculation, but it’s enough that I would avoid excluding a lot of
files individually if you don’t have to.
Example method #2 is to organize your doorway pages into subdirectories for
each search engine. The advantage to method #2 is that it is much easier to
track your doorway pages if they are organized in separate subdirectories. In
addition, the size of your robots.txt will be relatively small. You'll also not
need to update the file every time you upload new doorway pages. Once the
robots.txt is set up with method #2, all you have to do is upload it to the
appropriate directory, submit and you're done!
So do the engines discriminate against files in subdirectories? The
consultants I talked to did not think so. Based on these conversations, if you
properly design a hallway page in your ROOT directory that links to the doorway
pages in your subdirectory, and submit that hallway page, then you’ll be fine.
This demonstrates to the engine that the pages are most likely sub-pages of the
main site. In addition, it would be dangerous for the search engines to penalize
pages in subdirectories since most large Web sites must organize their pages
into subdirectories to avoid complete chaos. As an added precaution, you could
assign machine names to subdirectories as I mentioned earlier in this article.
If you have any experience, comments, or observations on this issue, please let
me know by replying to this e-mail.
My conclusion: If all your pages have good content and are fairly unique,
don’t worry about robots.txt files. If you focus only on optimizing existing
pages on your site, don’t worry about a robots.txt. If, however, you decide you
need to experiment with more than a handful of pages that are rather similar,
consider making use of the robots.txt file, particularly with AltaVista. Use
example method #1 if you’re only dealing with a small number of pages or special
scenarios. Otherwise, organize your files into directories and use example
method #2.
However, I tested it on a couple of files and it sometimes complained about
things that were perfectly valid. Therefore, the service in my opinion may be
too buggy to be of great use. If it points out errors in your file, refer to
this article or the article below to verify that the errors it catches are real.
If you know of a better syntax checker, let me know and I'll pass the
information along and give you and your Web site credit for the tip. I will also
try to give you credit for any other search engine marketing tips I end up using
of which I was not aware!
Note: The information presented here adapted, under license agreement, from FirstPlace Software. |