Blog

A perspective on research on digitised newspapers at DH2019

by Mikko Tolonen, University of Helsinki

Mikko Tolonen is head of the Helsinki Computational History Group.

Over the past decade newspapers and different newspaper projects have become a crucial part of the digital humanities. Australia has its Trove; Europeana has been ongoing for more than ten years, Library of Congress and different libraries in the US are working on newspaper projects; and at the same time different individuals and institutions in South America, Asia and Africa are doing their best to preserve our cultural heritage.  

Researchers are also finally getting their hands on data and beginning to work together with libraries for new objectives. In short, different digitisation projects are becoming everyone’s business. It is simply not enough to think about newspaper digitisation as a process where printed objects are turned into a PDF format without a possibility to access and modify the data for research purposes. Andrew Prescott expressed this eloquently during the DH2019 conference by stating that with respect to digitised newspaper collections “searchability has often been an afterthought” and not something integrated to the processes from the beginning. Meanwhile, it is good to remember that still only a fraction of newspapers has actually been digitised (and there are thus chances of learning from past mistakes and establish best practices):

Below I will discuss some aspects of research on newspapers visible at DH2019 conference in July 2019 and bring impressions from the conference from its twitter feed.

ADHO in Utrecht 2019

ADHO’s DH2019 conference in Utrecht turned out to be the largest global digital humanities conference to date bringing together more than 1000 participants: Newseye was present with different papers on research undertaken in the project and as interlocutors (two long papers at the conference led by Helsinki Computational History Group and one panel on newspaper digitization).

The conference theme was “Complexities” which is of course crucial with respect to newspapers as well and resonating in DH2019 at different levels. Two full panels were devoted to newspapers alone, there were at least three posters on newspapers and close to ten separate presentations on different aspects of newspapers at the conference. Also some scholars working on newspapers, but not present in Utrecht made a conscious effort to contribute:

Libraries, newspapers and methods – some insights from the conference

It was delightful that the relevance of libraries for newspaper projects was not forgotten at DH2019. Questions of copyright and bias were examined both in theory and in practice for example at the Digital Scholarship institute’s panel, “The Past, Present & Future of Digital Scholarship with Newspaper Collections” focusing on a new project called Living with Machines at the British Library. NewsEye took part in this panel through a presentation by Jean-Philippe Moreux. With respect to the panel, Melissa Terras noted on twitter that a visible change in DH is that people are now more critical towards questions of bias and data:

The same applies also to material aspects of print products. While text mining remains the main aspect of DH engagement with newspapers, questions of metadata and materiality are also taken seriously in many newspaper projects, including the NewsEye:

One of the Helsinki Computational History Group’s papers that links directly to NewsEye aims to take the questions of materiality to a next level devoting all of its time to this aspect when thinking about the evolution of newspapers as material products that become eventually distinct from newsbooks, pamphlets and books:

There was also discussion if this kind of take should be integrated in all newspaper projects:

The idea indeed is that this type of research should be done in a manner that is reusable and scalable to different datasets and projects as well. A lesson learned at the Oceanic exchanges project panel related directly to this desire for common methodology that needs to be more than just a wish:

For example, text reuse in newspapers (that is at the core of Oceanic exchanges -project) is something that can be applied across different datasets. This at the same time connects with the crucial question of data quality. Perhaps one of the most interesting recent developments within the NewsEye project is that the Transkribus engine designed for handwritten text recognition is producing wonderful improvements also with respect to printed newspaper material. Thus, checking out a poster about handwritten text recognition is becoming even more interesting for historians studying newspapers:

One very delightful aspect of digital research on newspapers is that data releases involving researchers are becoming more and more common parts of different projects. At least two newspaper related data releases were announced during the conference. One on ground truth data:

And another one on word embeddings:

Future prospects

In the future, contex should be taken seriously with respect to digitisation projects. Project planning still needs to improve though according to Andrew Prescott and others:

Meanwhile, collaboration is another key objective going forward. One of the great moments of the conference was Maud Ehrmann’s (who leads the Impresso newspaper project) comment that for interoperability with respect to newspapers the road to success is the collaboration between Impresso and NewsEye project:

For collaboration between libraries and researchers a new development was also launched and an “Inventory Researchers Wishlist on digitised newspapers” emerged on the last day of DH2019:

Working groups that unite across different countries should be used also for these undertakings. Newspapers don’t necessarily need a new working group in Dariah (although it would make sense at some level), because many related ones already exist. For example, a new bibliographical data group unites interests in research on newspapers: https://dariah-ae-2019.sciencesconf.org/261699/document and should be followed by relevant newspaper projects. 

All of this points towards the fact that open science (or cross-project collaboration in general) is not easy and if the ecosystems aren’t built correctly, it can become a nuisance… But open science is the only way for a more effective work on large digitised historical corpora. This development can also be considered natural because work on newspapers combines memory organisations (different national libraries in particular) and researchers through necessity of working together. It is not only the researchers that have realised this but also library communities (in LIBER for example have been active on this front for some time).  

All in all, next time you hear someone complaining about bad OCR in newspapers, think of it as a gateway for realising the inevitable need to think about our work in digital humanities as a mutual process. If the projects remain isolated, we are not going to be able to do DH and impact humanities at large. There should also be incentives from the research funding perspective to encourage this cross-project collaboration to enable sustainability of successfully evolving projects in digital humanities.