Resources

Feature overview

Best Practices: How to Correctly Write User-Agent Directives in Your robots.txt File

bonnie

2025-12-23 06:24

Many websites only intend to block scrapers, but end up blocking search engine spiders as well.

I’ve also seen cases where people wrote a bunch of User-Agents that looked very professional in robots.txt, but in reality none of them worked—and the server was still heavily crawled.

Next, from a practical webmaster’s perspective, let’s talk about how User-Agents in robots.txt should actually be written to minimize pitfalls and avoid problems.

1. What exactly does User-Agent in robots.txt do?

User-Agent is how you tell search engines or crawlers, “Who these rules are intended for.”

For example, the most common case: User-agent: * Disallow: /admin/

The * here represents all crawlers, including search engine bots, scraping tools, and certain automation scripts.

And if you write it like this: User-agent: Googlebot Disallow: /test/

That means only Google’s crawler is restricted, while other crawlers are unaffected.

So whether your User-Agent is written correctly directly determines whether your robots rules actually take effect.

2. User-Agent parsing : Don’t just look at the name—check the “real identity”

Many beginners make the mistake of judging identity solely by the crawler name. For example, seeing Googlebot in the request header and assuming it must be Google’s spider.

In reality, there are too many tools today that can spoof User-Agents, so relying on strings alone is unreliable.

This is where User-Agent parsing comes into play:

• Whether it conforms to the official UA specification

• Whether it carries reasonable system information

• Whether it matches the IP range

• Whether its behavior resembles a normal search engine crawler

This is why some webmasters allow Googlebot in robots.txt but still have their servers overwhelmed by abnormal crawling.

3. Recommended ways to correctly write User-Agent rules in robots.txt

1️⃣ Use wildcard rules with caution

This format is fine, but only if you truly don’t intend to restrict any crawlers.

If you have admin panels, test directories, or duplicate-content pages, it’s better to add separate rules.

2️⃣ Declare major search engines separately

A safer approach is:

The benefits of doing this include:

• Better readability

• Easier troubleshooting later

• Avoiding accidental blocking by wildcard rules

3️⃣ Don’t write non-existent User-Agents

Some online tutorials teach you to block a bunch of “high-end-looking” UAs, but many of them don’t actually exist. robots.txt won’t throw errors, but those rules are effectively useless.

4. Browser fingerprinting: the layer robots.txt can’t control

Many modern scrapers and automation tools don’t care about robots.txt at all. What they really focus on is:

• Browser fingerprint detection

• Behavior patterns

• Request frequency

• JavaScript execution capability

In other words, even if your User-Agent is written perfectly, without basic browser fingerprint detection, attackers can still simulate a crawler that “looks completely legitimate.”

This is why many sites now combine fingerprint identification with behavior analysis for access control.

5. How can you tell whether a User-Agent parser is trustworthy?

Using the ToDetect fingerprint lookup tool as assistance, you can see:

• Whether the UA is reused by a large number of tools

• Whether there are abnormal fingerprint combinations

• Whether it matches a normal browser environment

• Whether there are obvious automation characteristics

This step is extremely helpful for determining “real spiders vs. fake crawlers,” especially for medium to large websites.

6. Easily overlooked details

• robots.txt is case-sensitive; User-Agent names should follow official capitalization

• Don’t write conflicting rules under the same User-Agent

• After modifying robots.txt, clear caches and retest

• Search engines apply robots rules with delay—they don’t take effect immediately

Ignoring these details often leads people to mistakenly believe that “robots.txt doesn’t work.”

Summary

If your website is already being heavily scraped or flooded with traffic, don’t expect a single User-Agent rule to “block everything.”

User-Agent entries in robots.txt must be real and standardized. Don’t rely on User-Agent alone—combining parsing and behavior analysis is far more reliable.

Preventing scraping starts with robots.txt, but browser fingerprint detection from ToDetect is the real key. If your site is large, you should definitely move one step further into fingerprinting and behavioral analysis.

Table of Contents