A Beginner's Guide to Understanding robots.txt for Better SEO

If you're a proud owner of a website, you may have dealt with robots.txt file. It's applicable for self-hosted websites where the administrators directly access and edit this file. It's a powerful way to help you dictate how search engine crawlers and content indexing robots access a website. In this tutorial, we'll learn how we can use this file to manage the indexing of our website—which in turn—decides the visibility of content on the search engines. If you're on a hosted platform, chances are that you may not have the privilege to access or edit this file. But still, knowledge of this file can benefit a website owner.

Remember, all of your on-page SEO efforts can go in vain if you're not properly handling the crawlers and spiders visiting your website. And, that's what this guide is going to teach you in easy steps.

Here I must give you a word of caution. If you do not properly understand what the robots.txt file is all about, do not edit this file. In such a case, edits can harm your website's search engine presence.

Understanding robots.txt File

To begin with, robots.txt is a text file that resides in the root directory of any website. It is used to direct or guide the content indexing crawlers about which part of the website is available for indexing.

In other words, the robots.txt file contains the directives that instruct the bots and spiders about which directories and files are accessible and which are not. It's a kind of guide for the visiting content crawlers letting them know where in the website's directory tree they can traverse and gobble up content.

Importance of robots.txt File

Now that we have an idea of what this file is all about, let's talk a little about why it is important for every website and what are certain scenarios or conditions in which this file plays an important role.

Protection and Privacy

Through the directives given in the robots.txt file, you can block the crawlers from accessing select directories associated with website administration. Access to directories containing sensitive and private information can also be blocked with the help of this file.

Check on Bandwidth Consumption

By preventing access to certain directories and files, website owners can save a lot of bandwidth. It becomes more necessary when some of the files hosted on the web server are large.

Fulfilling SEO Goals

By using the robots.txt file, you can divert the web crawlers towards the primary content of your website that you want to get indexed. It can also be used to avoid indexation of duplicate content which may harm your search engine rankings.

Understanding robots.txt Syntax and Rules

When it comes to adding rules or directives in the robots.txt file, the syntax is pretty simple. Let's take a look and understand the basics of adding guidelines (rules) to this file.

If you're creating a new file, you can use any simple text editor. Make sure the file name is robots.txt and nothing else. That's how it is distinguished from other files.

You can define multiple rules in a single robots.txt file. These rules are divided into separate groups and each group targets a specific or a group of web crawlers.

Let's see the syntax of defining rules in this file

Adding Comments

To add a comment in this file, one should use the # symbol.

# Rules to block indexing of sales data

As you can see in the example, there's a comment that is just a human-readable text and is ignored by web crawlers. It's generally used to keep yourself well-informed about what the rules are all about.

Targeting a Search Engine Crawler

The next syntax defines how you can address or target a crawler. You can either target one or multiple crawlers in one go. Use the User-agent rule to target web crawlers.

User-agent: Googlebot

Here, we are targeting Googlebot which is the most common and famous search engine crawler. You can see that the rule User-agent is followed by a semicolon and a blank space.

The semicolon followed by a blank space is mandatory when applying a rule in the robots.txt file.

Blocking Directories and Pages

If you want to block the crawling and indexing of a directory or a web page, use the Disallow directive. Let's see an example

Disallow: /

When writing directives (rules), the / character refers to the root directory of the website. Here, we're blocking the indexing of the entire website.

If we're blocking any page, the full path to that page should be mentioned. If it is a directory, end the pathname with another / character.

# Block access to the '/pricing.html' page
Disallow: /pricing.html

# Block access to '/sales/' directory
Disallow: /sales/

In the two examples shown above, we're blocking crawlers from accessing both the pricing.html web page and the /sales/ directory.

Opening/Allowing Indexing of Directories and Pages

We have seen how we can block web crawlers from accessing directories and web pages. And, what if we want to keep specific directories and pages open for the web crawlers? Here's how to do it.

Allow: /

In the example shown above, we're using the Allow directive to open the entire website for the web crawlers. In other words, starting from the root directory of a website, a crawler can access every single directory and web page without any restriction.

Specifying Sitemap Location

And last but not least is the Sitemap directive. It is used to tell the location of the sitemap. It helps web crawlers easily navigate through the website. An example is given below.

Sitemap: https://www.freshtechtips.com/sitemap.xml

You may notice that a fully-qualified URL is specified for the sitemap. Make sure to specify the correct http/s prefix too. Test the validity of the sitemap URL before including it in the robots.txt file.

Wildcard Usage

Except for the Sitemap directive, all other directives support the use of wildcard characters. You can use them anywhere in the entire string. Here's a basic example.

User-agent: *

We've used the * wildcard character to target every web crawler. All the rules written in this group will apply to any web crawler visiting the website.

Practical Examples of robots.txt Directives

Now that we're familiar with the syntax of robots.txt directives, let's dive into some of the examples to better understand how these rules and guidelines are implemented in a real-world situation.

Allow Google Adsense Crawler on a Website

If you're running Google Adsense ads on your website, you should give its bot free access to your website. This way, it can deeply scan your content which in turn helps in serving targeted ads.

User-agent: Mediapartners-Google
Disallow:

To open your website for every web crawler, replace Mediapartners-Google with the * character. It's just an example, because in a real-world situation, you may never want to implement the latter rule.

Block Access to the Entire Website

If your new website is still under development, you may consider blocking all web crawlers until it is complete and ready for launch. Here's how to do it.

User-agent: *
Disallow: /

Remember, blocking all web crawlers is not a good move and should only be considered under exceptional conditions.

Wildcard character * is ignored by several ad-related bots.

If you want to block a specific ad bot, create a separate rule block and explicitly use the user agent string of that bot. Here's an example.

User-agent: AdsBot-Google
Disallow: /

You can use this list of common web crawlers to get familiar with their user agent strings.

Block a Directory or a Web Page

Sometimes, you may want to block access to certain directories or web pages. To do so, you can add the blocking directives in the robots.txt file. Here's an example.

# Block the '/albums/' directory for the Google image bot
User-agent: Googlebot-Image
Disallow: /albums/

# Block a webpage for all web crawlers
User-agent: *
Disallow: /blog/2012/data/confidential-report.html

In the first example, we're blocking access to the /albums/ directory. And, in the second one, access to a confidential web page is blocked.

Block Access to All the PDF Files

As mentioned before, wildcards can be used in these directives to target multiple assets in a single blocking rule. Let's see it with an example.

User-agent: *
Disallow: /*.pdf$

In this example, for all the web crawlers, access to all the files ending with the .pdf extension is blocked. Through wildcards, you can block access to similar types of files and directories.

Miscellaneous Tips and Guidelines for Using robots.txt

Now that we're comfortable with writing and editing robots.txt directives, let's quickly go through some of the important guidelines and tips to help you better manage this important file on your website.

Testing and Validation - Before uploading a new or edited robots.txt file to your web server, always test and validate it. An error can block web crawlers from indexing your content.
Correct Placement - Always make sure you're uploading the robots.txt file to the root directory of your website. A wrong upload location is like having no such file.
Careful Blocking - When adding blocking directives, pay special attention and think twice if it is necessary or not.
Keep Up to Date - Keep yourself informed about the new web crawlers (good or bad) and any updates related to new rules and directives for the robots.txt file.

If you're following the tips given above, your robots.txt file will remain in a healthy condition and you'll be able to use it in the best possible way.

Conclusion

If you want to take control of how web crawlers access your website's content, use the robots.txt file. It'll also help in protecting sensitive and private content.

Whenever you're making changes in this file, make sure to double-check it for syntax errors. And lastly, from time to time, review your robots.txt file and make the necessary changes—if required.