Harvard Launches Million-Book Corpus for AI Training
Harvard University has introduced the Harvard Library Public Domain Corpus, a collection of nearly 1 million copyright-free books digitized through the Google Books project. This new resource vastly outscales prior datasets like Books3, which included 200,000 volumes but was withdrawn due to copyright concerns.
The corpus, compiled by Harvard Law Library’s Innovation Lab with funding from Microsoft and OpenAI, includes diverse historical works, legal texts, and books in languages like Czech and Welsh. Currently accessible only to Harvard students and staff, plans are underway to make it widely available.
This initiative addresses the ongoing need for high-quality training data for AI models while complying with transparency requirements like the EU’s AI Act. It demonstrates the rich potential of public-domain texts to enhance AI performance with a broader range of perspectives and knowledge.
Source: The Batch
For more details, read the full article here.