2.38. robots.txt

File robots.txt consists of groups of rules that determine the behavior of robots on the site.

 

Important points:

  • File robots.txt should have exactly this name and its encoding should be UTF-8.
  • File robots.txt must not be more than 32KB in size.
  • File robots.txt should be in root directory of the site. That is, it must be accessible through the browser at an address of the form http://www.example.com/robots.txt .
  • Only one file can exist on one site robots.txt .
  • Each directive must start on a new line.
  • Default all site pages are allowed to be processed by a robot. Banning for certain pages is done using the directive Disallow .
  • Rules are case sensitive.

Each group can contain several of the same rules. For example, this is useful for specifying multiple robots or pages.

The rule group must be in the following order and consist of the specified directives:

  1. User-agent — obligatory directive, can be specified multiple times in one rule group.
  2. Disallow and Allow — obligatory directives. At least one of them must be listed in each rule group.
  3. Host , Crawl-delay , Sitemap - optional directives.

To specify regular expressions, use:

  • * - means a sequence of any length from any characters.
  • $ - means the end of the line.

Directive User-agent defines the name of the robot that the rule will apply to. To specify all robots, you can use:

 User-agent: *

If this directive is specified with a specific robot name - the rule with * will be ignored.

The specified directives will allow access to the robot named Googlebot and prohibit others:

 User-agent: * Dissalow: / User-agent: Googlebot Dissalow: 

Directive Disallow defines pages to which robots are denied access.

You can deny access to the entire site by specifying:

 Dissalow: /

A ban on individual pages can be specified as follows:

 Dissalow: /admin

Directive Allow defines pages to which robots are denied access. The directive is used to throw exceptions when specifying Disallow .

The following rule specifies block for the robot Googlebot the whole site except the directory pages :

 User-agent: Googlebot Disallow: / Allow: /pages/

Directive Host defines main site domain. The directive is useful if several domain names are bound to the site and for correct search indexing, thus, you can specify which domain will be the main one so that the rest of the domains are defined as mirrors, technical addresses, etc.

An example of using the directive within a site with domains example.com and domain.com where for all robots example.com will be the main domain:

 User-agent: * Disallow: Host: domain.com

Directive Crawl-delay defines the interval between the end of loading one page and the beginning of loading the next for robots. This directive is useful for reducing requests to the site, which helps to reduce the load on the server. The interval is specified in seconds.

Usage example:

 User-Agent: * Disallow: Crawl-delay: 3

Directive Sitemap defines URL-address of the sitemap file on the site. This directive can be specified multiple times. The address must be specified in the format protocol://address/path/to/sitemap .

Usage example:

 Sitemap: https://example.com/sitemap.xml Sitemap: http://www.example.com/sitemap.xml
 
To implement an existing file robots.txt must be removed, as well as site settings the parameter “ Send requests to the backend if the file is not found"Or extension txt should be removed from static files.

If the site uses several domains, for example using aliases, then the settings specified in the file robots.txt , may differ for each site due to certain SEO-optimization or other tasks. To implement dynamic robots.txt do the following:

  1. Read the important information in this article and make sure that all conditions are met.
  2. Create files domain.com-robots.txt in root directory site, where instead of domain.com specify the domain for which the specified rules will apply.
  3. Specify the required rules for each domain in the generated files.
  4. Customize the output of files by adding at the beginning of the file .htaccess the following rules:
      RewriteEngine  On  RewriteCond %{REQUEST_URI} ^/robots\.txt$  RewriteRule ^robots\.txt$ %{HTTP_HOST}-robots.txt [L]
  5. Check the output of the rules for each of the domains.
  • 0 Users Found This Useful
Was this answer helpful?

Related Articles

2.1. Hosting account

A hosting account is a virtual or business hosting service that hosts website files, databases,...

2.2. Additional services

Detailed information 2.2.1. Dedicated IP 2.2.2. Additional space on NVMe...

2.3. My sites

Sites and their settings are managed in the section "Hosting → My sites»: Detailed information...

2.4. Backup

Working with automatic backups is performed in the section "Hosting → Backup»: Detailed...