Sunday, April 28, 2013

Robots.txt

Robots Exclusion standard which is commonly known as robots.txt is a file used to index part of a web page which is useful and not to index a part of web page or whole by search engines. This robots.txt file follows set of rules and regulations for crawling a web page by crawlers. This file must be placed in a root level of a website. If web owner doesn't specify this file in website's root level then the whole site can be crawled by crawlers.
Syntax for robots.txt file: http://www.domain.com/robots.txt

    Structure of robots.txt file consist of user-agent and disallow. With these web owners can restrict the crawlers in indexing the web pages of a particular site that which can be excluded in crawling by web robots and which part can be indexed.
   
Few examples for robot.txt file using user-agent and disallow are: 

1. User-agent: *
   Disallow:
The above example meant that all search robots can visit all files of a website since * includes all robots.

2. User-agent: *
   Disallow: /
The above example is meant to exclude the whole website by all search robots if placed in a robots.txt file.

3. User-agent: *
  Disallow: /temp/
  Disallow: /cgi-bin/
This example is meant to not to crawl the specified folders. Here 'temp' and 'cig-bin' excluded in crawling by all robots.

4. User-agent: *
  Disallow: /directory/
This meant to not to crawl the specific directory by all robots mentioned in robots.txt file.

5. User-agent: x robot
   Disallow: /
The above example represents that 'x robot' excludes in crawling the entire site.

No comments:

Post a Comment