NewsEye: Navigating Collections of Digitised Historical Newspapers: A Conversation with Ben Lee

A visualisation of American Civil War-era maps from the Library of Congress's collections. (Courtesy of Ben Lee)

This autumn, we are asking experts to share their experiences related to themes which are relevant to the NewsEye project. For this interview, Ben Lee, a Doctoral Candidate in Computer Science & Engineering at the University of Washington and a former Innovator in Residence at the Library of Congress, talked to us about the Newspaper Navigator project and the issues related to enhancing (re)search options within collections of digitised historical newspapers.

Can you tell us about the Newspaper Navigator and your experience at the Library of Congress?

Of course! I began Newspaper Navigator in September 2019, while an Innovator in Residence at the Library of Congress. In partnership with LC Labs, the National Digital Newspaper Program, and IT Design & Development at the Library of Congress, as well as my Ph.D. advisor, Professor Daniel Weld, at the University of Washington, I began the project by extracting visual content across 16 million pages of digitized newspapers in the Chronicling America collection. In order to accomplish this, I leveraged annotations from the Beyond Words crowdsourcing initiative, which engaged the American public by asking volunteers to identify and classify visual content in Chronicling America. In particular, I finetuned an object detection model on these annotations in order to identify and categorize seven different classes of visual content on newspaper pages: photographs, illustrations, maps, comics, editorial cartoons, headlines, and advertisements. I then built an end-to-end pipeline to process all 100 terabytes of image and optical character recognition (OCR) data in Chronicling America using this finetuned model, yielding extracted visual content along with textual captions and other metadata. The Library of Congress and I released the resulting Newspaper Navigator dataset to the American public in May 2020, as the largest dataset of its kind ever produced. In pursuit of the Library’s mission of improving access, we placed the dataset and all code into the public domain for unrestricted re-use.

While caption-based keyword search for images provides much utility, the approach also has fundamental limitations: for example, how do people search for photographs with distinct visual motifs? In the second phase of Newspaper Navigator, I created and deployed a public search application for 1.5 million photographs in the dataset based on the real needs that historians and other users had articulated to us surrounding these limitations. Since launching with the Library of Congress in September 2020, the search application has reached tens of thousands of users and has been permanently adopted by the Library of Congress as part of their search infrastructure. In addition to providing keyword search functionality, the search application enables users to iteratively train machine learning algorithms in order to retrieve visually similar photos according to topics or concepts of interest, such as baseball players. From an exploratory search perspective, I call this search functionality open faceted search because it empowers users to create their own facets dynamically, facilitated by interactive machine learning algorithms that can train and predict over all 1.5 million photos in under a second. Unlike standard faceted search, open faceted search provides a path forward even when metadata is impoverished, making it extensible to a wide range of digitized collections. I presented open faceted search in a demo at UIST 2020.

To examine the ways in which a Chronicling America newspaper page is altered and decontextualized during its journey from a physical artifact to a series of probabilistic photographs in Newspaper Navigator, I wrote what I call a data archaeology. I released it with the Newspaper Navigator search application in order to provide scholars and the general public alike with a resource surrounding the ethical considerations and implications of this project (it is forthcoming in Digital Humanities Quarterly). In this data archaeology, I studied the digitization journeys of four different pages in Black newspapers in Chronicling America that reproduce the same photograph of W.E.B. Du Bois. In tracing the pages’ journeys, I unpacked how each step, from microfiliming to OCR to image embeddings, propagates bias, marginalization, and erasure via the machine learning algorithms employed.

Ben Lee (Photo by Shawn Miller - Library of Congress)

During the NewsEye International Conference, you presented during a session about ‘Challenges and Perspectives in Digital Research’. What challenges have you faced as a computer scientist working with digitised historical newspapers?

Digitized newspapers are fun to work with from a computational perspective because they present so many challenges, from the variable quality of scans to the complex layouts of different titles. For example, the difficulties of OCR motivate the exploration of search methods for visual content other than keyword search over captions. Indeed, this challenge was a primary motivation for the open faceted search affordance in the Newspaper Navigator search application. Moreover, digitized newspapers present a challenge of scale. I’ve really enjoyed the computational challenges of processing millions of newspaper pages, as well as the challenges of searching over the extracted visual content, which was another motivation for the search application.

Ben Lee's presentation with Nathan Yarasavage (Library of Congress), 'From Chronicling America to Newspaper Navigator: Improving Access to Historic Newspaper Photos at the Library of Congress through Machine Learning', can be viewed below.

What perspectives do you have about the future of digital research using these sources?

Personally, I am very excited about the future of digital research with newspapers! The work being carried out by NewsEye, the National Digital Newspaper Program, and other initiatives across the world are re-imagining how we access digitized newspapers. As access to digitized newspapers continues to improve, I suspect that the already large communities surrounding them will only continue to grow. From a computational perspective, I am excited by the work being carried out to improve entity recognition, OCR correction, article segmentation, and beyond. From a humanistic perspective, I am looking forward to more text analysis across newspaper corpora. Lastly, given the ongoing digitization efforts across the world, I am excited to see scholars accessing new materials and doing scholarship across a diverse range of titles.

What’s coming up next for you?

I’m currently working to develop Newspaper Navigator in a number of directions. First, with my Ph.D. advisor, Dan Weld, I’m expanding open faceted search to encompass fully developed, customizable faceted search interfaces that are refined by the user with the Newspaper Navigator-style interactive machine learners. I’m also working with print scholars Jim Casey, Sarah Salter, and Joshua Ortiz Baco to study the ethnic presses in Chronicling America (we will be presenting our first paper on our collaboration at Computational Humanities Research 2021). Lastly, I am collaborating with Devin Naar at the University of Washington to study the visual culture of the Ladino press at a macroscopic scale for the first time.

Takaisin