Web Crawler Archives

Cloudflare Blocking Web Bots from Scraping AI Training Data

By Paula Parisi
July 9, 2024

Cloudflare has a new tool that can block AI from scraping a website’s content for model training. The no-code feature is available even to customers on the free tier. “Declare your ‘AIndependence’” by blocking AI bots, scrapers and crawlers with a single click, the San Francisco-based company urged last week, simultaneously releasing a chart of frequent crawlers by “request volume” on websites using Cloudflare. The ByteDance-owned Bytespider was number one, presumably gathering training data for its large language models “including those that support its ChatGPT rival, Doubao,” Cloudflare says. Amazonbot, ClaudeBot and GPTBot rounded out the top four. Continue reading Cloudflare Blocking Web Bots from Scraping AI Training Data

The New York Times Looks to Protect IP Content in Era of AI

By Paula Parisi
August 18, 2023

Newsrooms can potentially benefit greatly from AI language models, but at this early stage they’ve begun laying down boundaries to ensure that rather than having their data coopted to build artificial intelligence by third parties they’ll survive long enough to create models of their own, or license proprietary IP. As industries await regulations from the federal government, The New York Times has proactively updated its terms of service to prohibit data-scraping of its content for machine learning. The move follows a Google policy refresh that expressly states it uses search data to train AI. Continue reading The New York Times Looks to Protect IP Content in Era of AI