As part of the Europeana Newspapers project, millions of words of public domain text were created via optical character recognition (OCR) of digitized historic newspapers. They were sourced from a number of national and research libraries throughout Europe.
Full details of the OCR process are available at www.europeana-newspapers.eu/wp-content/uploads/2012/04/D-2-2_Specification_of_requirements-2.pdf. More details of the project are available at http://www.europeana-newspapers.eu/
The resulting full-text corpus was made available by the library partners, for aggregation at The European Library, where they were made searchable via its portal - http://www.theeuropeanlibrary.org/tel4/newspapers.
Europeana and The European Library are now making here available the metadata records and full-text of this corpus. Images are currently not available; we will explore the possibility of making them available in 2016.
The export files, currently here available, contain only part of the Europeana Newspapers Corpus. Additional content is expected to be made available soon, as formal and technical issues, affecting the availability of some content, are under processing at The European Library.
The available corpus contains 426 newspaper titles from 10 European libraries. The total size of the files is 70 gigabytes (after compression).
Provided by: Austrian National Library
Newspaper titles: 84
Size of compressed files: 7,2 gigabytes
Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/AUT/
Provided by: National Library of Estonia
Newspaper titles: 37
Size of compressed files: 2,8 gigabytes
Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/EST/
Provided by: National Library of Finland
Newspaper titles: 11
Size of compressed files: 1,1 gigabytes
Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/FIN/
Provided by: National Library of France
Newspaper titles: 20
Size of compressed files: 32,5
Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/FRA/
Provided by: Berlin State Library, Hamburg State Library
Newspaper titles: 8
Size of compressed files: 18,5 gigabytes
Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/DEU/
Provided by: Teßmann Library
Newspaper titles: 47
Size of compressed files: 4,2 gigabytes
Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/ITA/
Provided by: National Library of Latvia
Newspaper titles: 101
Size of compressed files: 3,1 gigabytes
Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/LVA/
Provided by: National Library of Luxembourg
Newspaper titles: 2
Size of compressed files: 70 megabytes
Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/LUX/
Provided by: National Library of Poland
Newspaper titles: 116
Size of compressed files: 480 megabytes
Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/POL/
The corpus is exported to zip archive files, where each file contains the complete content of a newspaper title, that is, all its issues.
In the parent folder of the zip archive, a JSON file is present. This file contains a metadata record describing the newspaper title. The metadata fields, present in the records, are described at the end of this section.
Also at the parent folder, there are subfolders for for each year with published issues, available in full-text. Inside the folders of each year, folders exist for each issue. The folders of the issues are named with the date of publication, in the format “YYYYMMDD”. Inside the issue folders, a JSON file is present, containing a metadata records describing the issue, and the full-text of the issue.
The JSON files
The JSON files concerning the newspapers titles and issues use fields named after the properties defined in the DCMI Metadata Terms, please refer to the documentation at http://dublincore.org/documents/dcmi-terms/ for their meaning.The field is named “contentAsText” and each field contains the text of a single page. The order of these elements respects the order of the pages of the issues.
Finally it is relevant to highlight the use of the field “format” for providing an estimate of the quality of the OCR. The field is available both in the metadata records of newspapers titles and issues. In the issue records, the measure indicates the average OCR confidence across all words of the issue. In title records, it indicates the average OCR confidence across all the issues of the newspaper title. These “format” fields start with the tag "[OCR confidence]” to signal their use for this particular information.Please note that OCR confidence is a native output of the OCR engine, therefore is is merely indicative, and should not be confused with OCR accuracy, which can only be determined by evaluation against a Ground Truth. You may find more information about OCR evaluation in this educational resource.
All full-text is available under:
Creative Commons Public Domain Mark 1.0 (https://creativecommons.org/publicdomain/mark/1.0/)
All metadata is available under CC0 (https://creativecommons.org/publicdomain/zero/1.0/)