A US court allows authors to proceed with a class action accusing Databricks of using pirated books to develop its large language model, marking a significant turn in AI copyright disputes.
Databricks has failed in its bid to throw out a class action brought by a group of authors who say the company helped build its DBRX large language model using pirated copies of their books. In a ruling last week, US District Judge Charles Breyer in Northern California said the writers had put forward enough to keep the case alive, allowing allegations involving roughly 196,000 titles to move forward.
The dispute centres on a chain of development that runs through MosaicML, the AI start-up Databricks bought in 2023. According to the materials described in the case, early Mosaic models were linked to the RedPajama dataset, which incorporated Books3, a collection later removed from Hugging Face over copyright concerns. Databricks has argued that the authors have not shown DBRX itself was trained on the disputed material, but the plaintiffs say the company copied the books during development regardless of what ended up in the final model.
Judge Breyer said the authors had tied their works directly to DBRX and that employee statements supported the inference that the challenged material mattered to the model’s development. The case now appears likely to turn on how much of the training pipeline Databricks can explain, and whether the court accepts the company’s claim that any use of the material was too remote to amount to infringement.
The potential financial exposure is substantial if the writers can persuade the court that any infringement was wilful. Brandon Butler, a copyright lawyer and executive director of Re:Create, told The Register that statutory damages in copyright law can be severe, with awards reaching six figures per work. Among the authors backing the suit are Jason Reynolds, Stuart O’Nan, Brian Keene and Rebecca Makkai, whose The Great Believers was a Pulitzer Prize finalist.
Databricks has not yet advanced the fair use defence that has helped other AI companies in related litigation. Meta won a similar case brought by authors last year, while Anthropic also prevailed on fair use in a separate matter, though it later agreed to set up a $1.5 billion compensation fund after admitting it had ingested pirated books. For now, Databricks is left to fight on with a narrower procedural argument, while the authors insist the company cannot lawfully download, store and reuse pirated works simply because they were not retained in the final training set.
Source Reference Map
Inspired by headline at: [1]
Sources by paragraph:
Source: Noah Wire Services
Noah Fact Check Pro
The draft above was created using the information available at the time the story first
emerged. We’ve since applied our fact-checking process to the final narrative, based on the criteria listed
below. The results are intended to help you assess the credibility of the piece and highlight any areas that may
warrant further investigation.
Freshness check
Score:
10
Notes:
The article is dated April 29, 2026, and reports on a recent legal development, indicating high freshness. No evidence of recycled or outdated content was found.
Quotes check
Score:
8
Notes:
The article includes direct quotes from Judge Charles Breyer and Brandon Butler. While the quotes are attributed, they are not independently verifiable through the provided sources. This raises concerns about the ability to confirm the accuracy of these statements.
Source reliability
Score:
7
Notes:
The primary source is The Register, a technology news website. While it is a known publication, it is not as widely recognized as major outlets like the BBC or Reuters. The article cites additional sources, including legal filings and statements from Brandon Butler, a copyright lawyer. However, the reliance on a single source for direct quotes and the lack of independent verification of these quotes reduce the overall reliability.
Plausibility check
Score:
9
Notes:
The claims about the lawsuit and the involvement of authors like Jason Reynolds and Rebecca Makkai are plausible and align with known information. However, the inability to independently verify direct quotes from Judge Breyer and Brandon Butler introduces some uncertainty.
Overall assessment
Verdict (FAIL, OPEN, PASS): CONDITIONAL
Confidence (LOW, MEDIUM, HIGH): MEDIUM
Summary:
The article provides timely and plausible information about the Databricks copyright lawsuit, with direct quotes from Judge Charles Breyer and Brandon Butler. However, the inability to independently verify these quotes and the reliance on a single source for direct quotes reduce the overall reliability. Given these concerns, the content is conditionally acceptable, provided that the direct quotes are independently verified before publication.
