Sunday, March 17, 2024

How to Block ChatGPT and GPTPlus Crawler to Index your Blog


In the vast landscape of the internet, search engine bots play a crucial role in indexing and cataloging web content. However, with the rise of AI-powered crawlers, website owners may find it necessary to tailor their crawling permissions to exclude specific AI bots. This can be achieved through the use of the robots.txt file, a simple yet effective tool for controlling bot access.

Blocking AI Crawlers with Robots.txt

The process of blocking AI crawlers using the robots.txt file follows a standard syntax:

User-agent: {AI-Crawler-Bot-Name-Here}

Disallow: /

To effectively block OpenAI crawlers, you can add the following directives to your robots.txt file:

User-agent: GPTBot

Disallow: /

User-agent: ChatGPT-User

Disallow: /

Considerations for OpenAI Blocking

It's important to note that OpenAI utilizes separate user agents for web crawling and browsing, each with its own CIDR and IP ranges. Blocking these bots requires a deep understanding of networking concepts and root-level access to your server.

Implementing Firewall Rules

Implementing firewall rules to block OpenAI's CIDR or IP ranges is an effective strategy for those comfortable with Linux and server administration. Here's an example using the UFW (Uncomplicated Firewall) command:

sudo ufw deny proto tcp from 23.98.142.176/28 to any port 80

sudo ufw deny proto tcp from 23.98.142.176/28 to any port 443

Alternatively, you can use a shell script to automate the blocking process:

#!/bin/bash

# Block OpenAI ChatGPT bot CIDR

file="/tmp/out.txt.$$"

wget -q -O "$file" https://openai.com/gptbot-ranges.txt 2>/dev/null



while IFS= read -r cidr

do

    sudo ufw deny proto tcp from $cidr to any port 80

    sudo ufw deny proto tcp from $cidr to any port 443

done < "$file"

[ -f "$file" ] && rm -f "$file"

By incorporating these techniques, website owners can effectively manage bot access and maintain control over their online content.

0 comments:

Post a Comment