TL;DR
Contents
A robots.txt file is a plain text file that sits at the root of your website and instructs search engine crawlers which pages or directories they are permitted to access. It is one of the most fundamental elements of technical SEO, yet it is frequently misconfigured by site owners who do not fully understand what it does.
If you are new to what is SEO and how it shapes your website's visibility, understanding robots.txt is an important early step. The file follows the Robots Exclusion Protocol, a standard that has been in use since the early days of the web. When Googlebot or any other crawler visits your site, it looks for this file at yourdomain.co.uk/robots.txt before it does anything else.
If you need specialist help reviewing or correcting your setup, StudioHawk's technical SEO services cover full crawl audits, robots.txt configuration, and ongoing optimisation for UK businesses of all sizes.
Robots.txt works by providing a set of instructions that crawlers read before deciding which parts of your site to visit. It does not guarantee anything. The file is a request, not an enforcement mechanism. Most reputable search engine crawlers, including Googlebot, will honour the instructions. Malicious bots will not.
Understanding how search engines crawl is essential context here. Googlebot follows a queue of URLs to visit. When it reaches your site, it reads your robots.txt file first, then decides which URLs to crawl based on your directives. Pages blocked by robots.txt will not be crawled, but they can still appear in search results if external inbound links point to them, because Google can discover a URL without crawling it directly.
This is a critical distinction that many site owners miss. Blocking a page in robots.txt does not remove it from Google's index. To prevent indexing, you need to use a noindex meta tag on the page itself. Robots.txt and noindex serve different purposes and should be used together with care.
A robots.txt file is made up of groups called records, each containing a User-agent line and one or more Disallow or Allow directives. The syntax is straightforward, but small errors can have large consequences.
| Directive | What It Does | Example |
|---|---|---|
| User-agent | Specifies which crawler the rule applies to | User-agent: * (all bots) |
| Disallow | Blocks a crawler from accessing a specific path | Disallow: /admin/ |
| Allow | Permits access to a path within a blocked directory | Allow: /admin/public/ |
| Sitemap | Points crawlers to your XML sitemap location | Sitemap: https://yourdomain.co.uk/sitemap.xml |
| Crawl-delay | Requests a pause between crawler requests (not supported by Google) | Crawl-delay: 10 |
Source: Google Search Central, Robots.txt Specifications, 2024
An asterisk (*) in the User-agent field means the rule applies to all crawlers. You can also write rules specific to individual bots, such as User-agent: Googlebot. Including your XML sitemap URL at the bottom of your robots.txt file is best practice, as it helps crawlers find and prioritise your most important pages efficiently.
The most damaging robots.txt mistake is accidentally blocking your entire website from being crawled. This happens more often than you might expect, particularly after a website migration or CMS switch when the default configuration has not been updated. The result is an immediate and severe drop in organic visibility.
Here are the most common errors to watch for:
A well-structured site architecture makes robots.txt configuration far easier, because your URL structure is logical and grouped sensibly. When directories are clearly defined, it is straightforward to block what you do not want crawled and leave everything else accessible.
Crawl budget is the number of pages Googlebot will crawl on your site within a given time period. For smaller sites this is rarely a concern, but for larger eCommerce sites or sites with thousands of URLs, it matters significantly. Wasting crawl budget on low-value pages means your most important content may not be crawled and indexed as frequently as it should be.
Using robots.txt strategically helps direct Googlebot towards the pages that matter most. Consider blocking the following types of URLs to preserve crawl budget for high-value content:
It is worth noting that robots.txt alone does not solve all duplication problems. Canonical tags and structured data work alongside robots.txt to give Google the clearest possible picture of your site. Think of robots.txt as one layer in a broader technical foundation, not a standalone fix.
For eCommerce SEO in particular, managing which faceted navigation URLs are crawlable is one of the most impactful uses of robots.txt. StudioHawk's on-page SEO services include full technical audits that address exactly these kinds of crawl efficiency issues.
You should always test your robots.txt file before and after any changes using Google Search Console's robots.txt testing tool. This tool shows you exactly how Googlebot interprets your directives, and flags any URLs that are being blocked unexpectedly.
Here is a straightforward process for auditing and validating your robots.txt file:
Understanding the role of FAQ schema and other structured markup signals is also part of ensuring Google can properly understand and present your content once it has been crawled. Good robots.txt configuration simply ensures Google can reach that content in the first place. Getting the foundations right is what separates sites that rank consistently from those that do not.
Key Takeaways
If your site has no robots.txt file, search engine crawlers will simply crawl your entire site without restriction. For most small sites this is fine, but for larger sites it may mean crawlers spend time on low-value pages rather than your most important content. Google will return a 404 for a missing robots.txt file and proceed to crawl the site normally.
Not reliably. Robots.txt prevents a page from being crawled, but Google can still index a URL it has not crawled if it discovers the URL through external links. To prevent a page from appearing in search results, you must use a noindex meta tag or the X-Robots-Tag HTTP header on the page itself.
Googlebot respects robots.txt instructions for crawling, but it is not legally obligated to do so. Google has stated that it treats robots.txt as a strong signal but may still discover and index URLs that are blocked by robots.txt if those URLs receive links from other pages. Other crawlers and bots, particularly malicious ones, may ignore robots.txt entirely.
Yes, you should block your staging or development environment from being crawled by search engines. The most common method is adding a Disallow: / directive for all user agents in the staging site's robots.txt file. However, you must ensure this rule is removed or updated before migrating the site to live, as this is one of the most frequent causes of a site disappearing from Google overnight.
No, they serve different purposes. Robots.txt is a file that controls crawler access to entire directories or paths on your server. A meta robots tag, placed in the HTML of an individual page, controls how that specific page is indexed and whether links on it are followed. Both can be used together, but they operate at different levels.
Google typically re-fetches and re-reads a robots.txt file approximately every 24 hours. In practice, it may cache the file for up to 24 hours, which means changes you make today may not be reflected in Googlebot's behaviour until the following day. For urgent changes, you can request a re-crawl via Google Search Console, though this does not guarantee immediate processing.
If you are unsure whether your robots.txt file is helping or hurting your site's performance, the team at StudioHawk can help. We carry out thorough technical audits for UK businesses, identifying crawl issues, misconfigurations, and missed opportunities before they cost you rankings. Whether you are a small business or a large eCommerce site, we will make sure the foundations are solid and your most important pages are getting the crawl attention they deserve.
Contact our SEO experts today.
Book a free consultation and see what's possible.