The New York Times Looks to Protect IP Content in Era of AI

By Paula Parisi
August 18, 2023

Newsrooms can potentially benefit greatly from AI language models, but at this early stage they’ve begun laying down boundaries to ensure that rather than having their data coopted to build artificial intelligence by third parties they’ll survive long enough to create models of their own, or license proprietary IP. As industries await regulations from the federal government, The New York Times has proactively updated its terms of service to prohibit data-scraping of its content for machine learning. The move follows a Google policy refresh that expressly states it uses search data to train AI.

“We use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities,” Google’s privacy policy discloses.

Although the concept has yet to be legally tested, content behind paywalls is presumably “publicly available” if anyone can purchase a subscription. The general belief is that existing data has already been scraped by LLMs like Google’s PaLM, OpenAI’s GPT, Meta’s LLaMA and others.

“The genie is really out of the bottle,” writes Popular Science. “The training data has now been used and, since the models themselves consist of layers of complex algorithms, can’t easily be removed,” PopSci continues, explaining that “the fight is now over access to training data for future models — and, in many cases, who gets compensated.”

Although purveyors of large language AI have been secretive about how their models are trained, court challenges — which have already begun — can potentially force disclosure. Legal action is already underway, and the new NYT terms of service suggest violators may trigger more.

NPR reports lawyers for NYT “are exploring whether to sue OpenAI” to protect the IP rights associated with its content. A central question will be whether “fair use” applies.

The Verge reports that NYT “signed a $100 million deal with Google back in February that allows the search giant to feature Times content across some of its platforms over the next three years,” which means “it’s possible that the changes to the NYT terms of service are directed at other companies like OpenAI or Microsoft.”

According to Adweek, The New York Times TOS defines content as including but “not limited to text, photographs, images, illustrations, designs, audio clips, video clips, ‘look and feel’ and metadata, including the party credited as the provider of such content.” Additionally, the updated TOS also forbids Web crawlers, which index for search results, from using data they collect to train AI systems.

OpenAI this month launched the GPTBot web crawler, which Adweek says lets publishers control access to website content, but notes that “significant players in the field, namely Microsoft’s Bing and Google’s Bard, have not added this functionality to their bots, leaving publishers struggling to control what the crawlers scrape.”

An April analysis of Web crawler data by The Washington Post “found evidence that content from 15 million websites, including The New York Times, have been used to train LLMs” including LLaMA and and Google’s own T5, per Adweek.

One result is that several news orgs have banded together to demand IP protection against model training.

Related:
Potential NYT Lawsuit Could Force OpenAI to Wipe ChatGPT and Start Over, Ars Technica, 8/17/23

The New York Times Looks to Protect IP Content in Era of AI

No Comments Yet

Leave a comment