Understanding the Robots.txt File

By ALOKHNATH PS | March 26, 2025

Understanding the Robots.txt File

The robots.txt file is a text file that instructs search engine crawlers on how to access and index your website's pages. By creating and maintaining a robots.txt file, you can control which pages are crawled and indexed by search engines, helping to improve your website's search engine visibility.

Here is an example of a basic robots.txt file:

        
        
# robots.txt for example.com

User-agent: Googlebot
Disallow: /admin/temp/
Allow: /admin/temp/allowed-page.html

User-agent: Bingbot
Disallow: /private/sensitive-data/

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /old-pages/
Disallow: /*.pdf$
Disallow: /*.jpg$
Allow: /images/public-image.jpg

Sitemap: https://www.example.com/sitemap.xml

The robots.txt file consists of directives that specify rules for search engine crawlers. Here are some common directives:

User-agent: Specifies the search engine crawler to which the following rules apply. You can target specific crawlers like Googlebot, Bingbot, or all crawlers with an asterisk (*).
Disallow: Instructs the crawler not to access specific directories or files on your website. You can use wildcards (*) to match patterns.
Allow: Overrides a Disallow directive for a specific page or directory.
Sitemap: Specifies the location of your website's sitemap file, which helps search engines discover and index your pages.

Best Practices for Robots.txt

When creating a robots.txt file, consider the following best practices:

Use specific directives for different search engine crawlers to customize the rules for each.
Test your robots.txt file using Google's robots.txt Tester to ensure it is correctly formatted.
Include a reference to your sitemap file to help search engines discover and index your pages.
Avoid blocking important pages or resources that you want search engines to index.
Regularly monitor your website's crawl errors in Google Search Console to identify any issues with your robots.txt file.

Filtering Content with Robots.txt

You can use the robots.txt file to prevent search engines from indexing certain types of content on your website. For example, you can block specific file types like PDFs or images from being crawled by using wildcard directives.

Here are some examples of content filtering using robots.txt:

    
        
# Block all PDF files
User-agent: *
Disallow: /*.pdf$

    
        
# Block all JPEG images
User-agent: *
Disallow: /*.jpg$

By using wildcard directives like /*.pdf$ and /*.jpg$, you can prevent search engines from indexing specific file types on your website.

Robots.txt syntax for filtering

The robots.txt file uses a simple syntax to define rules for search engine crawlers. Here are some key points to remember:

1. Block Entire Website

Prevents all search engines from accessing any page.

        
        
User-agent: *
Disallow: /

2. Block Only HTML Files

Blocks all `.html` pages while allowing other file types.

        
        
User-agent: *
Disallow: /*.html$

3. Block a Specific Directory

Prevents crawlers from accessing the `/private/` folder and all files inside it.

        
        
User-agent: *
Disallow: /private/

4. Allow a Specific Directory

Overrides a general Disallow rule and allows search engines to access `/public/`.

        
        
User-agent: *
Disallow: /
Allow: /public/

5. Block Specific File Types

Prevents crawlers from indexing PDF and TXT files.

        
        
User-agent: *
Disallow: /*.pdf$
Disallow: /*.txt$

6. Block a Single Page

Blocks a specific page, such as `thank-you.html`.

        
        
User-agent: *
Disallow: /thank-you.html

7. Allow Googlebot but Block Others

Allows Googlebot but prevents all other crawlers from accessing the site.

        
        
User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

8. Block URL Parameters (for Better SEO)

Prevents search engines from indexing URLs with parameters like `?id=123`.

        
        
User-agent: *
Disallow: /*?*

9. Block Crawlers Except for a Specific One

Allows Bingbot while blocking all other crawlers.

        
        
User-agent: Bingbot
Disallow:

User-agent: *
Disallow: /

10. Block Internal Search Pages

Prevents search engines from crawling site search results, which can create duplicate content issues.

        
        
User-agent: *
Disallow: /search

Stops search engines from indexing sensitive areas like login and admin pages.

        
        
User-agent: *
Disallow: /admin/
Disallow: /login/

12. Block Image Crawling

Prevents Google from indexing images on the site.

        
        
User-agent: Googlebot-Image
Disallow: /

13. Block Video Crawling

Prevents Google from indexing video content on your site.

        
        
User-agent: Googlebot-Video
Disallow: /

14. Allow Everything (Default Setting)

This is the default state if no robots.txt file exists.

        
        
User-agent: *
Disallow:

15. Sitemap Declaration

Helps search engines find the sitemap file for better indexing.

        
        
Sitemap: https://example.com/sitemap.xml

It is important to note that the robots.txt file is a publicly accessible file, and not all search engines honor its directives. While most major search engines follow the rules specified in the robots.txt file, it is not a foolproof method of preventing pages from being indexed. It's also important to regularly review and update your robots.txt file to ensure that search engines are crawling and indexing your website correctly.

Understanding the Robots.txt File