File robots.txt
consists of groups of rules that determine the behavior of robots on the site.
Important points:
-
File
robots.txt
should have exactly this name and its encoding should be UTF-8. -
File
robots.txt
must not be more than 32KB in size. -
File
robots.txt
should be in root directory of the site. That is, it must be accessible through the browser at an address of the formhttp://www.example.com/robots.txt
. -
Only one file can exist on one site
robots.txt
. -
Each directive must start on a new line.
-
Default all site pages are allowed to be processed by a robot. Banning for certain pages is done using the directive
Disallow
. -
Rules are case sensitive.
Syntax
Each group can contain several of the same rules. For example, this is useful for specifying multiple robots or pages.
The rule group must be in the following order and consist of the specified directives:
-
User-agent
— obligatory directive, can be specified multiple times in one rule group. -
Disallow
andAllow
— obligatory directives. At least one of them must be listed in each rule group. -
Host
,Crawl-delay
,Sitemap
- optional directives.
To specify regular expressions, use:
-
*
- means a sequence of any length from any characters. -
$
- means the end of the line.
Basic directives
User-agent
Directive User-agent
defines the name of the robot that the rule will apply to. To specify all robots, you can use:
User-agent: *
If this directive is specified with a specific robot name - the rule with *
will be ignored.
The specified directives will allow access to the robot named Googlebot
and prohibit others:
User-agent: * Dissalow: / User-agent: Googlebot Dissalow:
Disallow
Directive Disallow
defines pages to which robots are denied access.
You can deny access to the entire site by specifying:
Dissalow: /
A ban on individual pages can be specified as follows:
Dissalow: /admin
Allow
Directive Allow
defines pages to which robots are denied access. The directive is used to throw exceptions when specifying Disallow
.
The following rule specifies block for the robot Googlebot
the whole site except the directory pages
:
User-agent: Googlebot Disallow: / Allow: /pages/
Host
Directive Host
defines main site domain. The directive is useful if several domain names are bound to the site and for correct search indexing, thus, you can specify which domain will be the main one so that the rest of the domains are defined as mirrors, technical addresses, etc.
An example of using the directive within a site with domains example.com
and domain.com
where for all robots example.com
will be the main domain:
User-agent: * Disallow: Host: domain.com
Crawl-delay
Directive Crawl-delay
defines the interval between the end of loading one page and the beginning of loading the next for robots. This directive is useful for reducing requests to the site, which helps to reduce the load on the server. The interval is specified in seconds.
Usage example:
User-Agent: * Disallow: Crawl-delay: 3
Sitemap
Directive Sitemap
defines URL-address of the sitemap file on the site. This directive can be specified multiple times. The address must be specified in the format protocol://address/path/to/sitemap
.
Usage example:
Sitemap: https://example.com/sitemap.xml Sitemap: http://www.example.com/sitemap.xml
Multi-domain robots.txt
robots.txt
must be removed, as well as site settings the parameter “ Send requests to the backend if the file is not found"Or extension txt
should be removed from static files.If the site uses several domains, for example using aliases, then the settings specified in the file robots.txt
, may differ for each site due to certain SEO-optimization or other tasks. To implement dynamic robots.txt
do the following:
-
Read the important information in this article and make sure that all conditions are met.
-
Create files
domain.com-robots.txt
in root directory site, where instead ofdomain.com
specify the domain for which the specified rules will apply. -
Specify the required rules for each domain in the generated files.
-
Customize the output of files by adding at the beginning of the file
.htaccess
the following rules:RewriteEngine On RewriteCond %{REQUEST_URI} ^/robots\.txt$ RewriteRule ^robots\.txt$ %{HTTP_HOST}-robots.txt [L]
-
Check the output of the rules for each of the domains.