ICDAR 2021 Tutorial

The NewsEye pipeline for digitalizing large collections of historical newspapers

7 September 2021 12:30 PM - 14:30 PM CET

Objectives

This tutorial will present a complete pipeline of digitalisation applied to historical newspapers, from the digitisation of documents to ways to access them with high level semantics within a digital library demonstrator. The focus on historical newspapers is a particularly relevant use case as they are unique and detailed records of events, with numerous points of view. Yet, they were by nature not printed to be preserved; they were produced in large quantities, with the intent to be used on a regular basis and replaced by their next issue. This implies production on rather cheap paper and ink, with conservation being a very secondary concern. 

This tutorial will detail the challenges in digitisation, OCR, layout analysis, article separation, robust-to-noise and language-independent semantic enrichment, up to the indexation and working with large collections of newspapers in multiple languages coming from multiple sources. The tutorial will rely on the document analysis, recognition and understanding pipeline of the H2020 NewsEye project, as well as its newspaper analysis platform.

Speakers

Antoine Doucet (University of La Rochelle)

Antoine Doucet has been a tenured Full Professor in Computer Science at the L3i laboratory of the University of La Rochelle since 2014. Leader of the digital document and contents research group (about 40 people), he is the coordinator of the H2020 project NewsEye, running until 2022 and focusing on improving access to historical newspapers across domains and languages. His main research interests lie in the fields of information retrieval, natural language processing and (text) data mining. The central focus of his work is on the development of methods that scale to very large document collections and that do not require prior knowledge of the data, hence that are robust to noise (e.g, stemming from OCR) and language-independent. Antoine Doucet has held a PhD in Computer Science from the University in Helsinki (Finland) since 2005 and a research supervision habilitation (HDR) since 2012.

Axel Jean-Caurant (University of La Rochelle)

Dr. Axel Jean-Caurant is a research engineer at the L3i laboratory of the University of La Rochelle. He obtained his PhD in Computer Science in 2018. The focus of his thesis was on the accessibility of documents inside digital libraries. The large number of digital documents available online has changed the way researchers think about information and research. The focus of his work was put on two distinct aspects. First is to understand how researchers are using these online platforms and how to train them to understand the changes data has undergone during the digitization processes. Second is the study of the impact of the quality of documents on accessibility. Axel is now working full time on the NewsEye project and is in charge of developing the demonstrator which will hold the data researchers of the project are interested in, along with the different tools developed during the project.

Johannes Michael (Rostock University)

Johannes Michael, M. Sc. received his Master degree in Mathematics from the University of Rostock in 2016 with a thesis in the area of Machine Learning with Recurrent Neural Networks. Then, he immediately started working in the CITlab group on tasks from the Horizon-2020 READ project including application-oriented academic research, algorithm and technology development and software & tool implementation, mainly focusing on Automated Text Recognition. His current research in the Horizon-2020 NewsEye project focuses on the visual information extraction of documents such as historical newspapers with the help of Neural Networks.

Jean-Philippe Moreux (National Library of France)

Jean-Philippe Moreux is the Gallica scientific advisor at the Bibliothèque nationale de France. He works on all the BnF heritage digitization, digital mediation, digital humanities programs and applications of AI approaches to heritage institutions. He participates in national and international research projects on these topics. He’s also a member and secretary of ai4lam.org, as well as the chairman of the “AI for Libraries” CENL network group. Prior to that, he was the BnF's OCR and digital text formats expert, an IT R&D Engineer and project manager (Alcatel), and a science editor and consultant in the publishing industry (Pearson, Magnard-Vuibert, Le Robert). He holds a engineering degree in computer science from INSA Toulouse (1990) and a mastère spécialisé (MS) in software engineering (CERAM, Sophia Antipolis).

Günter Mühlberger (University of Innsbruck)

Günter Mühlberger works at the Department for German Language and Literature of the University of Innsbruck. He heads the Digitisation and Digital Preservation group as well as the Digital Humanities Research Centre at the University of Innsbruck. Günter is chair of the European Cooperative Society READ-COOP SCE which runs the Transkribus platform. Since the mid-90s Günter has worked in the domains of digitization, digital preservation, digital libraries and Digital Humanities. He has initiated and managed a large number of national and international research and digitization projects.

Max Weidemann (University of Rostock)

Max Weidemann, M. Sc. received his Master degree in Mathematics from the University of Rostock in 2016. In his thesis he applied Conditional Random Fields, a model used in the Machine Learning context, to the field of Automated Text Recognition (ATR). Right after that, he started working in the CITlab group on tasks from the Horizon-2020 READ project including application-oriented academic research, algorithm and technology development and software & tool implementation, mainly focusing on ATR. His current research in the Horizon-2020 NewsEye project focuses on the visual information extraction of documents such as historical newspapers with the help of Neural Networks.