📚 Learn About Robots.txt
What is robots.txt?
A robots.txt file tells web crawlers which pages or files they can or can't request from your site. It's placed at the root of your website (e.g., yoursite.com/robots.txt).
Basic Structure
# Allow all bots to crawl everything
User-agent: *
Allow: /
# Block specific directories
Disallow: /admin/
Disallow: /private/
# Sitemap location
Sitemap: https://yoursite.com/sitemap.xml
Remember: robots.txt is publicly accessible and not a security measure! Use proper authentication for sensitive content.
Prevent Duplicate Content
# Block parameter-based duplicates
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
Disallow: /*?utm_
Disallow: /*?ref=
Conserve Crawl Budget
# Block low-value pages
Disallow: /search/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /print/
Multiple Sitemaps
Sitemap: https://yoursite.com/sitemap-posts.xml
Sitemap: https://yoursite.com/sitemap-pages.xml
Sitemap: https://yoursite.com/sitemap-images.xml
Bot-Specific Rules
# Allow major search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Crawl-delay: 2
Allow: /
# Block resource-heavy crawlers
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
Smart API Protection
# Allow only specific endpoints for rich snippets
Allow: /api/schema/
Disallow: /api/
Performance Optimization
# Fast crawling for premium search engines
User-agent: Googlebot
Crawl-delay: 1
# Slower for less important bots
User-agent: *
Crawl-delay: 30