Index of /download/newspapers-by-country

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[TXT]README.html2015-11-04 04:28 14K 
[DIR]POL/2015-09-27 16:42 -  
[DIR]LVA/2015-09-27 00:04 -  
[DIR]LUX/2015-09-27 00:32 -  
[DIR]ITA/2015-09-26 05:56 -  
[DIR]FRA/2015-09-25 15:05 -  
[DIR]FIN/2015-09-26 01:05 -  
[DIR]EST/2015-09-26 21:32 -  
[DIR]DEU/2015-09-27 12:05 -  
[DIR]AUT/2015-09-25 23:50 -  

The Europeana Newspapers Corpus

Alastair Dunning

Nuno Freire

October 2015         

Introduction

As part of the Europeana Newspapers project, millions of words of public domain text were created via optical character recognition (OCR) of digitized historic newspapers. They were sourced from a number of national and research libraries throughout Europe.

Full details of the OCR process are available at www.europeana-newspapers.eu/wp-content/uploads/2012/04/D-2-2_Specification_of_requirements-2.pdf. More details of the project are available at http://www.europeana-newspapers.eu/

The resulting full-text corpus was made available by the library partners, for aggregation at The European Library, where they were made searchable via its portal - http://www.theeuropeanlibrary.org/tel4/newspapers.

Europeana and The European Library are now making here available the metadata records and full-text of this corpus. Images are currently not available; we will explore the possibility of making them available in 2016.

Status

The export files, currently here available, contain only part of the Europeana Newspapers Corpus. Additional content is expected to be made available soon, as formal and technical issues, affecting the availability of some content, are under processing at The European Library.

The available corpus contains 426 newspaper titles from 10 European libraries. The total size of the files is 70 gigabytes (after compression).

Austria

Provided by: Austrian National Library

Newspaper titles: 84

Size of compressed files: 7,2 gigabytes

Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/AUT/  

Estonia

Provided by: National Library of Estonia

Newspaper titles: 37

Size of compressed files: 2,8 gigabytes

Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/EST/  

Finland

Provided by: National Library of Finland

Newspaper titles: 11

Size of compressed files: 1,1 gigabytes

Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/FIN/    

France

Provided by: National Library of France

Newspaper titles: 20

Size of compressed files: 32,5

Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/FRA/   

Germany

Provided by: Berlin State Library, Hamburg State Library

Newspaper titles: 8

Size of compressed files: 18,5 gigabytes

Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/DEU/  

Italy

Provided by: Teßmann Library

Newspaper titles: 47

Size of compressed files: 4,2 gigabytes

Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/ITA/ 

Latvia

Provided by: National Library of Latvia

Newspaper titles: 101

Size of compressed files: 3,1 gigabytes

Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/LVA/ 

Luxembourg

Provided by: National Library of Luxembourg

Newspaper titles: 2

Size of compressed files: 70 megabytes

Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/LUX/ 

Poland

Provided by: National Library of Poland

Newspaper titles: 116

Size of compressed files: 480 megabytes

Download URL: http://data.theeuropeanlibrary.org/download/newspapers-by-country/POL/ 

Technical information about the export files

The corpus is exported to zip archive files, where each file contains the complete content of a newspaper title, that is, all its issues.

d

In the parent folder of the zip archive, a JSON file is present. This file contains a metadata record describing the newspaper title. The metadata fields, present in the records, are described at the end of this section.

Also at the parent folder, there are subfolders for for each year with published issues, available in full-text. Inside the folders of each year, folders exist for each issue. The folders of the issues are named with the date of publication, in the format “YYYYMMDD”. Inside the issue folders, a JSON file is present, containing a metadata records describing the issue, and the full-text of the issue.

The JSON files

The JSON files concerning the newspapers titles and issues use fields named after the properties defined in the DCMI Metadata Terms, please refer to the documentation at http://dublincore.org/documents/dcmi-terms/ for their meaning.The field is named “contentAsText” and each field contains the text of a single page. The order of these elements respects the order of the pages of the issues.

Finally it is relevant to highlight the use of the field “format” for providing an estimate of the quality of the OCR. The field is available both in the metadata records of newspapers titles and issues. In the issue records, the measure indicates the average OCR confidence across all words of the issue. In title records, it indicates the average OCR confidence across all the issues of the newspaper title. These “format” fields start with the tag "[OCR confidence]” to signal their use for this particular information.Please note that OCR confidence is a native output of the OCR engine, therefore is is merely indicative, and should not be confused with OCR accuracy, which can only be determined by evaluation against a Ground Truth. You may find more information about OCR evaluation in this  educational resource.

Licence

All full-text is available under:

Creative Commons Public Domain Mark 1.0 (https://creativecommons.org/publicdomain/mark/1.0/)

All metadata is available under CC0 (https://creativecommons.org/publicdomain/zero/1.0/)