NewsEye: The Ins and Outs of the impresso Project with Maud Ehrmann

This autumn, we are asking experts to share their experiences related to themes which are relevant to the NewsEye project. In this interview, Maud Ehrmann, a research scientist at the École polytechnique fédérale de Lausanne (Switzerland), speaks to us about impresso, an interdisciplinary project which focused on text mining 200 years of historical newspapers.

Could you introduce impresso and your role within the project?

Sure! The project impresso: Media Monitoring of the Past is an interdisciplinary endeavor in which a team of computational linguists, designers and historians collaborate on the datafication of a multilingual corpus of digitized historical newspapers. The primary goals of the project are 1) to improve text mining tools for historical text, 2) to enrich historical newspapers with automatically extracted data, and 3) to integrate such data into historical research workflows by means of a newly developed user interface. Beyond the specific challenges of each of these objectives, the question of how best to adapt text mining tools and their use by humanities scholars is at the heart of the impresso enterprise.

Beside its inception with my colleagues Marten During (C2DH) and Simon Clematide (UZH), my role in this project was to provide overall coordination, as well as to work on data management and technical infrastructure, indexing processes and text mining, in particular named entity processing (in this regard, see the HIPE-2020 shared task we organised and its continuation with the upcoming HIPE-2022).

Much like NewsEye, impresso incorporated digitised newspapers from multiple countries and in multiple languages. What kinds of challenges did you face when processing these corpora?

Well, quite many, historical newspapers are not easy to get nor to handle!

The first challenge relates to newspaper silos, or the fact that newspaper collections are far from all digitised, and that their access modalities are very heterogeneous due to legal restrictions and digitisation policy constraints. It is, therefore, not easy to access the material, and once we have it - and this is the second challenge - we realise that data is quite 'big and messy', for digitised newspapers consist of different types of data (images, OCR, metadata), which come in very heterogeneous formats despite the existence of standards, and frequently features many inconsistencies (such as duplicates, missing content, incorrect language indication, pages in the wrong order, etc.).

Next, texts are noisy and it is a real challenge for existing tools to deal with imperfect OCR and faulty article segmentation, even more so in a context lacking appropriate linguistic resources. How to enable the exploration of historical newspapers and their semantic enrichments (i.e. various annotation layers added by text mining tools) is yet another challenge: until very recently, almost no interfaces supported the search and discovery of relevant content in large collections of historical newspapers.

Furthermore, since an interface ‘constrains’ what researchers can learn about sources and shapes their workflows through a selection of tools and functionalities, its design must be based on a careful assessment of the needs and practices of humanities researchers. In this regard, I believe the impresso interface pioneered very innovative ways in this direction. Last, but not least, in a historical research context, critical assessment of the biases inherent in digitisation tools, digitised sources and the annotations extracted from them is paramount for informed use of the data.

Meeting these challenges is not easy but very exciting, for it can only be done in an interdisciplinary framework.

Can you explain how users can access the web application created during the project? How is this access impacted by copyright restrictions?

The impresso app accommodates two access levels according to the copyrights restrictions of the impresso corpus collections. Concretely speaking, non logged-in users can only see items that belong to the public domain. For all other titles, users are asked to sign a non-disclosure agreement whereby they commit to make only personal and/or academic use of the data. Once this agreement is signed, users receive an account, can log in and explore all the collections. For more details on how to use the corpus and the annotations, you can visit our FAQ.

What are the future plans for the Impresso project outcomes?

A last release of the app, with fixed bugs, more functionalities and additional collections is in the making and should be out for early 2022. Stay tuned with @ImpressoProject!

Takaisin