Today : Nov 17, 2025
Technology
24 October 2025

Reddit Sues Perplexity AI Over Massive Data Scraping

The lawsuit alleges that Perplexity and data partners evaded safeguards to harvest Reddit content, raising stakes in the debate over AI training data and copyright.

Reddit, the online platform famous for its sprawling communities and lively debates, has launched a high-stakes lawsuit against Perplexity AI and three data-scraping companies, thrusting the ongoing battle over digital content and artificial intelligence into the spotlight. Filed on October 22, 2025, in the Southern District of New York, the case accuses Perplexity AI—a San Francisco startup known for its chatbot and “answer engine”—along with Oxylabs UAB, AWMProxy, and SerpApi, of what Reddit calls “industrial-scale” scraping of its user-generated content.

According to court documents and statements reported by The New York Times, Bloomberg, and CNBC, Reddit claims these companies worked together, using sophisticated technical methods to bypass the platform’s barriers and harvest vast quantities of posts, comments, and discussions. The data, Reddit alleges, was then funneled to Perplexity, which integrated it into its AI-driven services—without permission, payment, or licensing agreements.

Reddit’s chief legal officer, Ben Lee, described the situation as fueling an “industrial-scale ‘data laundering’ economy,” with Reddit’s massive trove of user-generated content as a prime target. In a statement to The Post, Lee said, “These data scrapers mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search.” He further accused Perplexity of being “a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself.”

The lawsuit alleges that in July 2025 alone, the three data firms accessed nearly three billion search engine result pages (SERPs) in just two weeks, using proxy servers, altered user agents, and other techniques to evade detection. Reddit likened their tactics to “would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead.” The company claims that after issuing a cease-and-desist letter to Perplexity, the AI platform’s use of Reddit content actually increased forty-fold—a detail that underscores the escalating nature of the dispute.

Perplexity, for its part, has denied the allegations and accused Reddit of “extortion.” In a statement posted on Reddit, the company asserted, “It is ‘impossible’ to sign a licence agreement for this reason,” arguing that it does not train its models on Reddit content but instead offers summaries and citations of public discussions. “A year ago, after explaining this, Reddit insisted we pay anyway, despite lawfully accessing Reddit data. Bowing to strong arm tactics just isn’t how we do business,” Perplexity added, characterizing Reddit’s legal action as “a show of force in Reddit’s training data negotiations with Google and OpenAI.” The company maintains that its approach is “principled and responsible,” and that it opposes any threats to openness and the public interest.

Other defendants have responded in kind. SerpApi told CNBC it “strongly disagrees” with Reddit’s claims and will vigorously defend itself in court. Denas Grybauskas, chief governance and strategy officer at Oxylabs, stated to The Post that Oxylabs “will not hesitate to defend itself against these allegations,” insisting that the company “has always been and will continue to be a pioneer and industry leader in public data collection.” AWMProxy did not immediately respond to requests for comment.

This isn’t the first time Reddit has taken legal action over its data. In June 2025, the company filed a similar complaint against Anthropic, another AI firm, and has also struck licensing deals with OpenAI and Google to allow controlled, compensated use of its content. According to Reddit COO Jen Wong, such licensing deals now account for close to 10% of the firm’s revenue—a significant shift as platforms seek to monetize their vast repositories of user-generated material.

Reddit’s complaint also points to broader industry trends. With over 110 million daily active users and more than 22 billion posts and comments, Reddit’s content is a goldmine for AI companies eager to train their models on authentic, human-created data. Researchers and analysts have long noted that Reddit’s moderation and volume make it especially valuable for producing conversational AI outputs. But as the arms race for quality data heats up, questions about copyright, fair use, and the ethics of scraping have become flashpoints in the tech world.

Copyright law, in particular, is proving to be a battleground. While some AI companies have negotiated multimillion-dollar licensing deals with publishers and platforms, others, like Perplexity, have argued that their use of publicly available data falls under “fair use.” Courts are now being asked to determine where the line lies, with recent cases involving Meta and Anthropic resulting in fair use victories for the AI companies. The legal uncertainty has left both publishers and AI developers navigating a murky landscape, where the rules of engagement are far from settled.

In the meantime, publishers have limited tools to protect their content. The robots.txt protocol, for instance, can tell bots what information they can and cannot scrape, but it isn’t legally binding. As a result, many platforms are turning to the courts to enforce their rights and set new precedents for the digital economy.

Industry experts say the outcome of Reddit’s lawsuit could have far-reaching implications. If successful, it might force Perplexity and similar companies to pay hefty damages or enter into licensing agreements, reshaping how AI firms source and value online content. It could also accelerate the adoption of standardized data-sharing frameworks, balancing the need for innovation with respect for intellectual property.

For marketers, the case is a wake-up call. As EMARKETER notes, brands may need to diversify their reliance on AI tools and seek legal indemnification clauses with partners to avoid risks tied to scraped data. The case also highlights the growing importance of first-party data and direct partnerships with content creators.

As the legal battle unfolds, one thing is clear: the fight over who owns, controls, and profits from the digital world’s vast reserves of human expression is only just beginning. The outcome could set the tone for years to come, shaping the relationship between content creators, AI innovators, and the audiences they serve.