Pages

Friday, July 18, 2008

Robots.txt how this file helps you

Robots.txt – Its normal a text file and contain information about the search engine robots and instruct them how to behave when they find our web site URL. By defining a few rules in this text file, you can instruct robots to not crawl and index certain files, directories within your site, or at all. When ever search engine crawler find any URL, it’s looking for robots.txt file and follow the instruction mention in that file that’s why it’s good for us to mention all required condition that we want fulfill when search engine crawl our site.

We need to place robots.txt file if our site includes content that we don't want search engines to index that’s the main purpose of this file.

How to create and where you need to place this robots.txt?

Create a normal text file save as robots.txt and upload in your website root directory I mean – http://www.xyz.com/robots.txt

User-agent: *
Disallow: /

This is the common syntax and its meaning is all robots (indicated by "*") are instructed to not index any of your pages (indicated by "/").

Below details instruct crawler not to crawl these whole directory -

User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/

If you want disallow any specific page –

User-agent: *

Disallow: /page.html
Disallow: /our/page.html


Some crawlers now support an additional field called "Allow:", most notably, Google. As its name implies, "Allow:" lets you explicitly dictate what files/folders can be crawled.


User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

If you want to know more about the robots.txt and some other special cases where you don’t want to crawl you pages or images just mail me your problems -

No comments: