Major publishers restrict Internet Archive over AI training fears, risking loss of web history

✨

Generating key takeaways...

As news outlets block the Wayback Machine amid fears of unauthorised AI use, critics warn that crucial web history could vanish, challenging the balance between copyright interests and digital preservation.

The Internet Archive’s Wayback Machine has long functioned as the web’s institutional memory, preserving pages that might otherwise disappear through deletion, redesign or pressure from powerful interests. In the Philippines, the archive helped сохранить material linked to the controversy over articles about Senator Tito Sotto and Pepsi Paloma after those pages were taken down, underlining how easily the online record can be altered once a publisher removes content.

That role is now being strained by an argument that has nothing to do with old links and everything to do with artificial intelligence. According to reporting by WIRED, a growing number of major news organisations are blocking the Internet Archive’s crawler because they fear archived material is being used by AI companies to train models without permission. The New York Times has said the issue is that its archived content is being used in ways that violate copyright law and compete with its business.

The scale of the shift is significant. Early this year, Nieman Lab found that 241 outlets across nine countries were restricting at least one of the archive’s bots, while other reports said the blockage now includes major publishers such as USA Today Co., The New York Times, Reddit and, in a more limited form, The Guardian. By early 2026, the Wayback Machine had passed one trillion archived pages, according to reporting cited by Rappler, but that expanding library is increasingly being fenced off just as demand for historical web records remains strong.

The contradiction is sharpest where publishers themselves rely on the archive. USA Today has used the Wayback Machine in reporting on changes to US Immigration and Customs Enforcement detention statistics, even as its parent company blocks the archive from preserving its own pages. Mark Graham, who leads the Wayback Machine, told WIRED that publishers can benefit from the archive’s records while limiting access to them, describing the broader standoff with AI firms as collateral damage.

There is, however, a real reason publishers are nervous. A Washington Post analysis found that Internet Archive data has appeared in major training datasets, and that has reinforced fears that publicly archived pages can be repurposed downstream for commercial AI systems. Industry observers say the problem is not simply technical but economic: as search summaries and chatbots reshape how audiences reach news, publishers are looking for ways to stop their content being reused without compensation.

Against that backdrop, digital-rights groups are urging newsrooms not to punish the archive for disputes it did not create. Fight for the Future, the Electronic Frontier Foundation and Public Knowledge backed an open letter praising the Internet Archive’s role in preserving the public record, saying it has helped keep citations alive on millions of Wikipedia entries and does not engage in paywall circumvention or irresponsible scraping. The EFF warned that if major publishers keep locking the archive out, large parts of the historical record may simply vanish, even as courts continue to sort out the separate legal fight over AI training.

Source Reference Map

Inspired by headline at: ^[1]

Sources by paragraph:

Source: Noah Wire Services

Noah Fact Check Pro

The draft above was created using the information available at the time the story first
emerged. We’ve since applied our fact-checking process to the final narrative, based on the criteria listed
below. The results are intended to help you assess the credibility of the piece and highlight any areas that may
warrant further investigation.

Freshness check

Score:
8

Notes:
The article was published on April 15, 2026, and reports on recent actions by major news outlets blocking the Internet Archive’s Wayback Machine. Similar reports have appeared in the past few days, indicating that the narrative is current and not recycled. However, the earliest known publication date of substantially similar content is not specified, so a slight reduction in score is warranted.

Quotes check

Score:
7

Notes:
The article includes direct quotes from Mark Graham, director of the Wayback Machine, and other sources. However, the earliest known usage of these quotes cannot be determined from the available information, raising concerns about their originality. Without independent verification of these quotes, a moderate reduction in score is appropriate.

Source reliability

Score:
6

Notes:
The article originates from Rappler, a reputable news organisation. However, the article relies heavily on reports from other sources, including TechRadar, Fortune, and Forbes, without providing direct access to the original sources. This reliance on secondary reporting raises concerns about the independence and reliability of the information presented. Additionally, the article includes a significant amount of aggregated content, which may affect its originality.

Plausibility check

Score:
7

Notes:
The claims about major news outlets blocking the Wayback Machine due to AI concerns are plausible and align with recent industry trends. However, the article lacks specific factual anchors, such as names, institutions, and dates, which makes it difficult to independently verify the claims. The tone of the article is consistent with typical reporting on this topic, but the lack of detailed supporting evidence raises questions about its credibility.

Overall assessment

Verdict (FAIL, OPEN, PASS): FAIL

Confidence (LOW, MEDIUM, HIGH): MEDIUM

Summary:
The article presents plausible claims about major news outlets blocking the Wayback Machine due to AI concerns. However, it relies heavily on aggregated content from other sources without providing direct access to the original sources, raising concerns about the independence and reliability of the information. The lack of specific factual anchors and independent verification further diminishes the article’s credibility. Therefore, the overall assessment is a FAIL with MEDIUM confidence.