An important aspect of Drupal SEO is the robots.txt file. Drupal 5 was the first version of Drupal that came with a robots.txt file, but it still needs some modifications.
One of the most serious SEO problems with Drupal is duplicate content. With the addition of contributed modules it can get so bad that one might refer to it as druplicate content. (ow...)
A key element of SEO on sites is getting a good, clean crawl. A robots.txt file is important for a clean crawl because it tells robots where they aren't supposed to go. There are many places on a Drupal site that search engine crawlers shouldn't go.
I've attached Drupal 5's default robots.txt file for reference and will address it in sections:
The first thing I would do is remove the Crawl-delay line. Unless you have a very large site or spidering problems, it's not needed. The other robots.txt rules that I mention here and in the Drupal module section should help cut down on the number of pages crawled.
User-agent: *
Crawl-delay: 10
The next section of the default robots.txt file addresses the physical directories created by Drupal:
# Directories
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
That section can be left as-is. Just keep in mind that it will probably keep search engines out of your logo and image files also because you are blocking your /sites/, /modules/, and /themes/ directories. If you use an alternate logo image, rename it so that it includes a keyword and place it in your /files/ directory.
The next section addresses files that are included with Drupal. I've never seen any of these files indexed, but you can leave this section in if you wish. Don't delete your CHANGELOG.txt file as some people recommend, because it lets you know what version of Drupal you are running in case you forget later.
# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
This is the most important section of the default robots.txt file because it contains some errors:
# Paths (clean URLs)
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Drupal doesn't have trailing slashes on the URLs, so you may want to remove trailing slashes from some of the rules as shown below:
Disallow: /admin/
Disallow: /aggregator
Disallow: /comment/reply/
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search/
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
For example, each "Login or register to post comments" link on each node creates URLs like http://example.com/user/login?destination=comment/reply/806%2523comment_form and http://example.com/user/register?destination=comment/reply/806%2523comment_form. Drupal's default robots.txt rules will not block search engines from spidering those URLs, but if you remove the trailing slashes as I've mentioned above, it will.
The Aggregator Module creates URLs of duplicate content like http://example.com/aggregator?page=3 that are not blocked by the default robots.txt file. Removing the trailing slash on the end of "/aggregator/" in the default robots.txt file will solve that problem.
The next section of the robots.txt file addresses paths that should be blocked if you aren't using clean URLs:
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
UPDATE: Please ignore the following lines. Further testing has shown that this rule will block all dynamic URLs in Google. So don't use it!
Most of the people reading these Drupal SEO tutorials are using clean URLs. If you are using clean URLs you can delete that section and replace it with the following line:
Disallow: /?
That line would block all of the URLs that start with ?q= as well as other miscellaneous query strings that might later appear for various reasons.
If you are not using clean URLs, modify the above section using the same logic as for the "clean paths" section above it. If your site has been indexed without clean URLS—for example, the page http://example.com/?q=node/25 has PageRank and you are going to implement clean URLs—you should use .htaccess to do 301 redirects from the dynamic versions of the URLs to the clean ones. In that case do not block the dynamic URLs from search engines because you would want them to transfer the PageRank of the dynamic URLs to the clean URLs. If that issue applies to you and my explanation doesn't make sense, please let me know in a comment below and I'll try to explain it another way.
I also recommend adding the following rules, after carefully reading and understanding the explanations given with them:
Each module potentially adds many extra URLs on the site which often create massive amounts of duplicate content and that also increase the crawling load on your server. The following rules address some extra robots.txt rules for core modules. The Drupal SEO Module Database contains information on additional rules that should be added when using contributed modules.
Disallow: /node$Disallow: /user$Disallow: /*sort=Disallow: /search$Disallow: /*/feed$Disallow: /*/track$Disallow: /tracker?/tracker exposed to search engines allows search engines to rapidly find and index your latest content as it is posted.Disallow: [front page] (replace with the path to your alternate front page)An improved version of Drupal's Robots.txt file that summarizes the explanations above can be download here.
Please see the Drupal SEO Module Database for instructions about specific rules. If you have questions about a specific module that I haven't covered yet, please contact me and I'll try to review the module as soon as possible.
UPDATE: a patch for some of these issues has been added here: http://drupal.org/node/180379
| Attachment | Size |
|---|---|
| default_robots.txt | 1.64 KB |
| improved_robots.txt | 1.04 KB |
Comments
Hello, is it better enable
Hello, is it better enable indexing of the comments?
You help to position better, no?
Pardon the syntax, not control the English.
indexing of comments
Google seems to discard URL fragments. I believe that a link to http://example.com/page-title#comment-13 is the same as http://example.com/page-title, at least for Google.
If your comments are paginated then it might matter, but if the comments are all on one page it shouldn't matter.
I can't think of a Drupal site with enough comments to cause pagination except for the blogs on Computerworld.com — but I think those comments are created with views and not in the conventional way.
If you know of a Drupal site with paginated comments I could take a look at it.
indexing of comments
Forgive, I am not talking about that.
What I mean is whether to let the comments are indexed to improve positioning.
Do you understand?
Since defaults are nullified by the robots.txt
Drupal comments
The robots.txt rule that says
Disallow: /comment/blocks the "comment/reply" URLs, not the actual comments.Drupal comments are indexed because they are on the same page as the node. For example, on this page the comments are indexed at the URL http://drupalzilla.com/robots-txt because they are on the page. The URLs that are blocked are ones like http://drupalzilla.com/comment/reply/5/16. those URLs just contain a form to add a new comment.
Ok, thanks He was confused.
Ok, thanks
He was confused.