Conference "Newspapers, Magazines & AI Models: Training and (Re-)Use in the Digital Humanities"

AI generated image of a newspaper that dissolves into an abstract-looking circuit board on the right side

The two-day conference is dedicated to the use of AI models for the digitization and analysis of newspapers and magazines from the early modern period to the present. This covers both the “out-of-the-box” use or fine-tuning of existing models and the training of new models.

The term “AI model” is deliberately defined broadly and includes several subfields of artificial intelligence (e.g., Machine Learning, Deep Learning, Generative AI, NLP) and architectures (e.g., CNNs, BERT, GPT, CLIP) as well as different modalities (text, image, multimodal models) and modes of integration into individual workflows (e.g., through applications such as Transkribus, Newspaper Navigator; through Python libraries like spaCy, flair).

The conference focuses on various application scenarios of AI in relation to newspapers and magazines. The following areas of use are of particular interest:

Layout analysis and structural annotation
Automated Text Recognition (HTR, OCR)
Text genre classification
Semantic/linguistic annotation (e.g., Named Entity Recognition, Part-of-Speech Tagging)
Image annotation and classification (Computer Vision)
Format transformation and data modeling
Corpus design and searchability
Data analysis and visualization

For organizational reasons the registrations are now closed.

Programme

Day 1

14:00-14:30

Welcome and Introduction

Alexandra N. Lenz, Claudia Resch, Nina C. Rastinger

Panel 1: Digitizing and enriching newspapers & magazines with AI - I

Moderation: Gabriel Viehhauser

14:30-15:00	The FINLAM Newspaper Dataset - a dataset for end-to-end newspaper recognition	Solène Tarride
15:00-15:30	Das Darmstädter Tagblatt und zwei KI-Lösungen: Transkribus-Workflows und die Entwicklung eines KI-Assistenten	Dario Kampkaspar, Kevin Kuck, Anna Christina Kupffer
15:30-16:00	From Image to Machine-Readable text: AI for Layout Analysis, OCR and Post-Correction for Job Ads from Historical Newspapers	Klara Venglarova, Raven Adams, Georg Vogeler
16:00-16:30	Coffee break

Panel 2: Digitizing and enriching newspapers & magazines with AI - II

Moderation: Nina C. Rastinger

16:30-17:00	Hierarchical Structure Extraction from Newspaper Images Using a Transformer-Based Model	William Mocaër, Clément Chatelain, Stéphane Nicolas, Thierry Paquet, Tom Simon, Pierrick Tranouez
17:00-17:30	Werkstattbericht aus der historisch-kritischen digitalen Edition der „Neuen Zeitschrift für Musik“ 1834-1844	Nelly Krämer-Reinhardt

Day 2

Panel 3: Analyzing magazines with AI

Moderation: Claudia Resch

09:30-10:00	AI-Driven Analysis of Female Representations in Fin-de-Siècle Spanish Magazines	Adriana Rodríguez-Alfonso
10:00-10:30	Challenges in dealing with historical gossip	Christian Lendl
10:30-11:00	Potenziale und Herausforderungen einer KI-unterstützten Medien- und Texterschließung am Beispiel der Gattung “Fotogedicht”	Lisa Hufschmidt
11:00-11:30	Coffee break

Panel 4: Analyzing magazines & newspapers with AI

Moderation: Christoph Steindl

11:30-12:00	LexiMus Project. Advantages and Challenges of Artificial Intelligence in the Analysis of Music Press	Daniel Martín Sáez, María Isabel Jiménez Gutiérrez
12:00-12:30	AI-assisted Analysis of Arrival Lists: From the ”Wienerisches Diarium“ to the ”Regensburgisches Diarium“	Nina C. Rastinger, Sarah Lentz
12:30-14:00	Lunch break (catered)

Panel 5: Analyzing newspapers with AI

Moderation: Andreas Baumann

14:00-14:30	Part-of-speech and grammar tagging with German spaCy pipelines from a linguistic perspective: Opportunities and challenges in the annotation of diminutives in forum posts on an Austrian online newspaper article	Katharina Korecky-Kröll
14:30-15:30	Concluding exchange over coffee and cake

Abstracts

Panel 1: Digitizing and enriching newspapers & magazines with AI - I

The FINLAM Newspaper Dataset - a dataset for end-to-end newspaper recognition

Solène Tarride

TEKLIA

Abstract

Advances in machine learning and the emergence of Visual Large Language models (VLLMs) have significantly advanced the field of automatic document understanding. However, these models often show limited performance on historical or handwritten documents. The ANR Finlam (Foundation INtegrated models for Libraries Archives and Museum) project specifically aims to develop multimodal models that can handle a wide variety of documents, languages, layout, writing styles and illustrations. One of the first use cases of this project focuses on historical newspapers.

Historical newspapers present unique challenges for automated processing due to their dense and complex layouts. Tasks such as reading order detection and article separation remain underexplored in the machine learning and document analysis communities. To fill these gaps, we present the FINLAM Newspaper Dataset, an open-source dataset designed for end-to-end training and evaluation of complex newspaper recognition tasks. The FINLAM Newspaper Dataset contains 149 issues of 23 newspapers published in the 19th and 20th centuries, mainly in French, with some newspapers in English. Each issue contains between 2 and 12 pages, and each page is segmented into zones annotated with multimodal features: localization, textual content (extracted by OCR), zone classifications (among 13 categories including article titles, intertitles, paragraphs, illustrations, advertisements and free ads), reading order and article separation. This dataset presents significant challenges due to its dense, complex and varied layout. It is freely available on HuggingFace: FINLAM Newspaper Dataset on HuggingFace. In this workshop, we will introduce the FINLAM Newspaper Dataset and present benchmark results for key tasks such as OCR, document layout analysis, reading order detection and article separation.

Short Biography

Solène Tarride is a machine learning researcher at TEKLIA. During her PhD at IRISA, she focused on deep learning for understanding historical documents. At TEKLIA, she develops new methods for automatic information extraction from historical and modern documents.

To top

Screenshot des KI-Assistenten "RAGblatt"

Das Darmstädter Tagblatt und zwei KI-Lösungen: Transkribus-Workflows und die Entwicklung eines KI-Assistenten

Dario Kampkaspar, Kevin Kuck, Anna Christina Kupffer

Universitäts- und Landesbibliothek Darmstadt

Abstract

Das Digitalisierungsprojekt „Darmstädter Tagblatt“ erarbeitet momentan einen der umfangreichsten Quellenfundus historischer Zeitungen im deutschsprachigen Raum. In zwei DFG-geförderten Projektphasen werden über 600.000 Seiten aus fast drei Jahrhunderten digitalisiert und als Volltexte zur Verfügung gestellt.¹Neben einer kurzen Projektvorstellung möchten wir Einblicke in zwei Aspekte des Projekts bzw. daraus erwachsener Kooperationen geben.

1. Der Transkribus-Workflow

Ursprünglich im Rahmen einer Dienstleistung vergeben, wurde das OCR und die Layouterkennung der Nachkriegsausgaben (1949–1986) extern durchgeführt. Während die Erkennung des Layouts zufriedenstellend war, wurde das OCR nach einer strategischen Entscheidung intern durchgeführt. Im Rahmen einer infrastrukturellen Investition an der ULB Darmstadt können wir für das Tagblattprojekt Transkribus als Epic Member einsetzen. In einem Workflow, der TextTitan sowie ein bestehendes Modell des ZEiD nutzte, wurden mit Double Keying neue Versionen der Volltexte erstellt. Dies war wesentlich schneller als die manuelle Korrektur des Outputs des Dienstleisters. Aus dem ALTO-Format von letzterem konnten die Layoutinformationen mittels Transformation in PAGE-Dateien in Transkribus überführt werden und somit nachgenutzt werden.

2. Das RAGblatt

In Darmstadt befindet sich mit Hessian.AI eine Ausgründung der TU Darmstadt mit KI Expertise.²Um die Möglichkeiten zu erforschen und auch um Herausforderungen zu bewältigen, die mit einem so umfangreichen Projekt, wie dem Tagblattprojekt einhergehen, wurde eine Kooperation mit Hessian.AI angestoßen. Das momentane Ergebnis ist das „RAGblatt“, ein KI-gestützter Assistent, der zur Recherche des Tagblattmaterials genutzt werden kann. Momentan befindet sich der Assistent noch in der Prototypphase und verwendet Modelle wie Metas Llama und Occiglot. Es ermöglicht textbasierte Suchen, die schriftliche Antworten generieren und den Kontext des Zeitungsartikels enthalten, der für die Anfrage relevant ist. Wir hoffen, dass dieser Assistent Nutzer:innen dabei unterstützen wird, Material zu finden und einen explorativen Ausgangspunkt für Forschungsthemen zu bieten. In einer laborähnlichen Umgebung können Forschende mögliche Fragen mit minimalem technischem Aufwand oder Zeitaufwand testen. Gegenstand des Vortrags werden auch die Herausforderungen technischer und disziplinärer Art sein.

Momentan ist der Assistant lediglich innerhalb des Netzwerks der TU Darmstadt verfügbar. elektra.ai.tu-darmstadt.de/ulb (aufgerufen am: 05.02.2025 ).

Fußnoten

¹www.ulb.tu-darmstadt.de/forschen_publizieren/forschen/darmstaedter_tagblatt.en.jsp  (aufgerufen am: 05.02.2025).

²hessian.ai (aufgerufen am: 05.02.2025).

Kurzbiografie

Dario Kampkaspar ist Leiter des Zentrums für digitale Editionen an der ULB Darmstadt. Nicht zuletzt seit dem Wien(n)erischen Diarium ist er in der Digitalisierung und Volltetxerfassung von Zeitungen aktiv. Er ist u.a. in der TEI und in Transkribus aktiv (z. B. der TEI-Export von Transkribus und weitere Tools für digitale Editionen).

Kevin Kuck betreut seit September 2023 das Digitalisierungsprojekt des „Darmstädter Tagblatt“. An der ULB Darmstadt ist er außerdem im Projekt „Europäischer Religionsfrieden Digital“ tätig. Herr Kuck studierte Geschichte an der Universität Heidelberg, wo er ebenfalls promoviert.

To top

From Image to Machine-Readable text: AI for Layout Analysis, OCR and Post-Correction for Job Ads from Historical Newspapers

Klara Venglarova, Raven Adam, Georg Vogeler

University of Graz

Abstract

This study describes a comprehensive workflow for extracting machine-readable text from historical newspaper job advertisements, addressing layout analysis, optical character recognition (OCR), and post-correction with state of the art machine learning methods. Leveraging an annotated dataset as a ground truth, we evaluate various layout detection tools, including default ANNO segments, Eynollah, Transkribus, and Tesseract. For the evaluation purposes, we also proposed a new methodology based on text presence in non-intersecting parts of the predicted region and its ground truth. Eynollah demonstrated the highest segmentation accuracy (72.5%), while other models, such as Kraken, underperformed due to mismatched between the specific task and pretrained models.

For OCR, we compared multiple models, including GT4HistOCR (CER: 0.1218), Tesseract model used in ANNO (CER: 0.1295), and German_Print model (CER: 0.1202). While several models reached comparable results, the Fraktur_GT4HistOCR achieved the best WER. Post-correction further improved text quality, addressing OCR-induced biases. We fine-tuned the hmbyt5-preliminary 1 model on the IDCAR2019-POCR dataset for OCR correction to perform better on our dataset. As manual creation of gold standard is a time-consuming process, we also explore the generative transformer based methods to support creating training data needed to achieve a good performance of a post-correction model.

Our work emphasizes the persisting need of annotated datasets and gold standards in assessing segmentation and recognition performance. By systematically comparing tools and methodologies, we contribute to the transparency of results obtained based on the subsequent data analysis.

This work is part of the FWF project P35783 “The making of the incredibly differentiated labor market” (PI Jörn Kleinert).

Short Biographies

Georg Vogeler is professor for Digital Humanities at University of Graz. He is a trained historian (Historical Auxiliary Sciences), graduated from Ludwig-Maximilians-Universtität (LMU), Munich, and worked on late medieval administrative records, emperor Federic II. (1198-1250), digital scholarly editing and semantic web technologies for humanities, with positions at the LMU Munich, the Università del Salento in Lecce and the University of Graz . He engaged recently in the application of machine learning and AI in the analysis of historical records. He was and is PI of numerous projects, among which the ERC ADG grant From Digital to Distant Diplomatics (2022-2026).

Klara Venglarova is a PhD student of Linguistics and Digital Humanities at the Palacky University in Olomouc, Czech Republic. She is involved in the FWF-funded project The Making of the Incredibly Differentiated Labor Market: Evidence from Job Offers from Ten Decades at the University of Graz (PI Jörn Kleinert), specifically engaged in layout analysis, OCR, post-correction, information extraction and other NLP and machine-learning tasks.

Raven Adam is a PhD student at the department of Environmental System Sciences at the University of Graz. His research focus is on NLP applications, such as topic modeling, text classification and text generation. He is currently involved in two FWF-funded projects, namely The Making of the Incredibly Differentiated Labor Market: Evidence from Job Offers from Ten Decades and Responses (PI Jörn Kleinert) to Threat and Solution-Oriented Climate News (PI Marie Lisa Ulrike Kogler).

To top

Panel 2: Digitizing and enriching newspapers & magazines with AI - II

Hierarchical Structure Extraction from Newspaper Images Using a Transformer-Based Model

William Mocaër, Clément Chatelain, Stéphane Nicolas, Thierry Paquet, Tom Simon, Pierrick Tranouez

LITIS Laboratory

Abstract

Understanding newspapers images is a challenging task due to their complex hierarchical structures, rendered through a variety of dense layouts. This research introduces a novel transformer-based model specifically designed to tackle these challenges through a comprehensive, end-to-end approach. The proposed model excels at extracting the hierarchical structure of newspapers, including sections and articles. It performs blocks localization and categorization (title, paragraph, image, table…) and reading order prediction at multiple levels. The model provides a comprehensive, detailed and consistent analysis of newspaper content.

The approach relies on an iterative process of information extraction, through the hierarchy of levels, where each level is processed one after the other. To enhance computational efficiency, each level is executed using a parallel attention mechanism. Relying on high-level structural modeling, the model achieves end-to-end processing without requiring any additional pre or postprocessing, ensuring adaptability to a wide variety of newspaper formats.

The model is trained using synthetic documents that capture the variability and complexity of real-world newspapers. These synthetic documents enable the model to learn robust representations of newspaper layouts, ensuring its ability to generalize across a wide range of structural configurations. Preliminary evaluations highlight the model's potential in accurately reconstructing newspaper hierarchies and providing insights into their content.

The method offers a promising solution for precise structure extraction in highly structured documents such as newspapers but could be applicable to a wider range of documents addressing the growing need for scalable and efficient AI-based digitization solutions.

Short Biography

William Mocaër is a postdoctoral researcher at the LITIS laboratory (Rouen, France), where he contributes to the FINLAM project in collaboration with the Bibliothèque nationale de France (BnF) and Teklia, focusing on advanced newspaper analysis techniques. Previously, he completed a PhD at IRISA as part of the Shadoc Team (Systems for Hybrid Analysis of DOCuments), specialized in document analysis.

To top

Werkstattbericht aus der historisch-kritischen digitalen Edition der „Neuen Zeitschrift für Musik“ 1834-1844

Nelly Krämer-Reinhardt

Julius-Maximilians-Universität Würzburg, Bayerische Akademie der Wissenschaften

Abstract

Im Rahmen eines Werkstattberichts möchten wir einen Einblick in die Konzeptionierung der digitalen Edition einer historischen Zeitschrift geben.

Ausgangslage:
Das Akademieprojekt „Robert Schumanns Poetische Welt“ umfasst unter anderem die historisch-kritische Edition der Neuen Zeitschrift für Musik (NZfM) der Gründungsdekade, in der der Komponist und Musikschriftsteller Robert Schumann die Zeitschrift konzipiert und redaktionell verantwortet hat.
Die digitale Edition wird in Kooperation mit dem TCDH Trier in der virtuellen Forschungsumgebung FuD erstellt.

Ziel:
Schumanns NZfM verstand sich als unabhängiges Organ für die Förderung begabter Komponisten und bildete eine zentrale Plattform innerhalb des romantischen Musikdiskurses. Sie ist damit ein wichtiges Korpus in der Musikwissenschaft.
Der nun geplante tiefenerschlossene, annotierte Lesetext, der sowohl auf Heftebene als auch auf Ebene der einzelnen Texteinheiten angesteuert werden kann, wird die Grundlage für zukünftige Forschung bilden. Ziel der Edition des Zeitschriftenkorpus ist die Generierung von Deep Data.

Prozess:
Texterfassung: Ein eigens trainiertes Modell der KI-gestützten Software Transkribus erfasst präzise sowohl die Textfelder des Zeitschriftenlayouts als auch die Zeilen und deren Text.
Multimodale Erfassung: Notenbeispiele werden mit der Software mei-friend erstellt, kollationiert und anschließend in TEI-XML (Text Encoding Initiative) integriert. Die Einsatzmöglichkeiten von OMR werden derzeit noch getestet.
Semantische Erschließung: Die Zeitschriftenausgaben werden in Sinneinheiten unterteilt und semantisch erschlossen. Damit führt der Fokus von der reinen Texterfassung hin zur Substanz der Texteinheiten, die sich teilweise auch über mehrere Hefte erstrecken.
Annotierung: Sämtliche Entitäten werden ausgezeichnet und mit Normdaten verlinkt. Die einzelnen Sinneinheiten werden Textkategorien zugeordnet und Sachkommentare erleichtern das Verständnis erklärungsbedürftiger Textstellen. Zudem wird, wo entsprechende Quellen vorhanden sind, die Textgenese dargestellt. Dabei werden einerseits handschriftliche Vorlagen herangezogen, die mit dem Software-Tool Transcribo in FuD erschlossen werden. Andererseits werden die Textbeiträge, die Schumann in seine „Gesammelten Schriften“ übernommen hat, jeweils verknüpft und mit dem Kollationierungstool Comparo verglichen.

Diskussion:
Wir laden dazu ein, dieses Konzept und die Methoden zu diskutieren und die Limitationen des Einsatzes von KI zur Erfassung von Deep Data in Zeitungen und Zeitschriften zu reflektieren sowie die Frage zu beleuchten, welche Rolle die Sachkommentierung in Zeiten von KI erfüllen kann und soll.

Kurzbiographie

Die Referentin Nelly Krämer-Reinhardt, M.A. studierte Schulmusik an der Hochschule für Musik Würzburg und Musikwissenschaft an der Julius-Maximilians-Universität Würzburg. Seit 2023 ist sie wissenschaftliche Mitarbeiterin im Akademieprojekt Robert Schumanns Poetische Welt und beschäftigt sich in ihrer Dissertation mit Notenbeispielen in der Neuen Zeitschrift für Musik.

To top

Panel 3: Analyzing magazines with AI

AI-Driven Analysis of Female Representations in Fin-de-Siècle Spanish Magazines

Adriana Rodríguez-Alfonso

University of Tübingen

Abstract

This presentation examines female representations in three pioneering fin-de-siècle magazines from Spain: Vida nueva (Madrid, 1898-1900), La Vida Literaria (Madrid, 1899), and La vida galante (Barcelona, 1898-1905). This study is part of a Spanish-language magazine digitization project undertaken at the University of Tübingen, Germany. These magazines, conceived as platforms for artistic and literary dissemination, not only brought together prominent painters, illustrators, photographers, and writers of the Hispanic movement but also provided invaluable insights into the prevailing symbolic and social constructs of femininity.

Given the close interrelationship between medicine and art in the late nineteenth century (Jordanova, 1989; Gilman, 1995; Mazzoni, 1996; Clemison and Vázquez, 2009; Tsuchiya, 2011; Alder, 2020), the social imaginaries surrounding women in these magazines frequently intersected with the dominant theories of women's mental health. These theories were influenced by European degeneration models (Nordau, Brachet, Charcot), as well as home-grown adaptations by leading Spanish positivist psychiatrists (Escuder, Giné y Partagás, Bernaldo de Quirós).

Drawing from digital methods and perspectives, this presentation will showcase the results derived from the digitization, processing, and analysis of this corpus of Spanish cultural magazines from the turn of the century. Using Natural Language Processing (NLP) techniques—such as word-sense disambiguation and semantic analysis—the study maps the various semantic frameworks associated with women, their societal roles, and representations within nineteenth-century Spanish society. These techniques will reveal how medical, political, and social connotations were closely intertwined within the discourse surrounding female identity and roles of the time.

This talk will also foster a discussion on the application of computational tools in periodical press analysis, highlighting both the potential and challenges of adapting digital methods across languages—particularly for Spanish-language materials. It will delve into issues such as cross-linguistic adaptation and the nuances involved in applying these methods to historical texts.

Short Biography

Adriana Rodríguez-Alfonso holds a PhD and Master's degree in Spanish and Latin American Literature from the University of Salamanca, and a Bachelor's degree in Hispanic Philology from the University of Havana. She is currently a professor and researcher at the Romanisches Seminar of the University of Tübingen, where she is working on her “Habilitation.”

Her main research focuses on Portuguese and Spanish literature from the 19th and 20th centuries, literary magazines, intellectual fields and networks, and digital humanities. She has published articles and chapters in various essay collections and specialized journals and her book El grupo Shanghai en Argentina: Redes, estéticas y mercados editoriales latinoamericanos was published in 2024 by De Gruyter.

To top

Challenges in dealing with historical gossip

Christian Lendl

University of Vienna

Abstract

The Wiener Salonblatt was the hot gossip magazine of fin-de-siècle Vienna. The illustrated weekly was published from 1870 to 1938 and mainly consisted of short messages – mostly published by members of the nobility – about personal achievements, travelling, and family issues. These short texts can be seen as typical examples for factoids and served the same purpose as posts on social media networks today: staying connected with one’s peers and presenting oneself to the public.

This dissertation project aims to analyze these (~250,000) factoids with the help of digital methods. The research goals are to better understand the transformation of the late Habsburg nobility by identifying topical trends and geospatial patterns as well as conducting a historical network analysis.

The current focus of this project lies on the automated text recognition of the corpus. Therefore, several models on Transkribus are being trained for layout recognition, segmentation, and automated transcription. In addition, a text processing algorithm is currently being developed that optimizes the (textual) output of the transcription stage and transfers it into a database. This includes several steps of error correction, normalization and validation as well as an algorithm to correct the reading order (for all messages on a single magazine page). All these processes are necessary to optimally prepare the factoids for the upcoming stage of this dissertation project: Analyzing the factoids with natural language processing methods. While named entity recognition will be used to extract as much information as possible (persons, organizations, places etc.), topic modelling will identify the topic(s) of each factoid. All these steps have to be completed in order to start the final stages: data analysis and historical interpretation.

Short Biography

Christian Lendl is PhD candidate at the Department of East-European History (University of Vienna). His fields of interest include the Austrian nobility in the late Habsburg empire, the development of portrait and press photography, and the visual coverage in Austrian newspapers. He is also a lecturer for visual marketing at the IMC Krems University of Applied Sciences and holds a MSc in Computer Science from the Vienna University of Technology and a MA in History from the University of Vienna.

To top

Potenziale und Herausforderungen einer KI-unterstützten Medien- und Texterschließung am Beispiel der Gattung „Fotogedicht“

Lisa Hufschmidt

Julius-Maximilians-Universität Würzburg

Abstract

Der Vortrag thematisiert die teilweise automatisierte Erschließung von Fotogedichten¹ (d.h. einer spezifischen, literarischen Form von Text-Bild-Beziehung) mithilfe unterschiedlicher KI-Modelle aus literaturwissenschaftlicher Perspektive. Der Erschließungsprozess wird ausgehend von dem 2024 gestarteten DFG-Projekt Das Fotogedicht in illustrierten Zeitschriften zwischen 1895 und 1945²und in Zusammenarbeit mit dem Zentrum für Philologie und Digitalität an der Universität Würzburg entwickelt. Ziel des Projekts ist es, das räumliche Verhältnis zwischen Text und Bild, den Volltext sowie die verwendeten Bild- und Textmotive zu erfassen sowie, in einem zweiten Schritt, damit verbundene Semantisierungen zu untersuchen. Die angestrebte Erschließung und damit verbundene Forschungsfragen bergen neben technischen auch kommunikative Potenziale sowie Herausforderungen, auf welchen der Fokus des Vortrags liegen wird.

Fußnoten

¹Siehe hierzu: Catani, Stephanie/Michael Will (2024): “Das Fotogedicht. Zur (Wieder-)Entdeckung einer intermedialen Gattung.” Zeitschrift für Deutsche Philologie Digital/Zeitschrift für Deutsche Philologie (2), doi:10.37307/j.1868-7806.2024.02.09.

²www.germanistik.uni-wuerzburg.de/ndl1/forschung-projekte/forschungsstelle-fotolyrik/.

Kurzbiographie

Lisa Hufschmidt studierte Germanistik (B.A.) und Deutsche Literaturwissenschaft (M.A.) in Mannheim und Stuttgart. Seit Juli 2024 ist sie Mitarbeiterin im DFG-Projekt Das Fotogedicht in illustrierten Zeitschriften zwischen 1895 und 1945 an der Julius-Maximilians-Universität Würzburg, das sich mit der (wieder-)entdeckten Gattung der Fotolyrik beschäftigt. In ihrem Promotionsvorhaben entwickelt Lisa Hufschmidt ein Analysemodell für Fotogedichte.

To top

Panel 4: Analyzing magazines & newspapers with AI

LexiMus Project. Advantages and Challenges of Artificial Intelligence in the Analysis of Music Press

Daniel Martín Sáez, María Isabel Jiménez Gutiérrez

University of Salamanca

Abstract

In this presentation, we will introduce LexiMus, a project aimed at understanding the trends in the use of musical lexicon in Spanish throughout history. Specifically, we will focus on the work of the team from the University of Salamanca, where we are in charge of studying the press from the 18th to the 21st century, both general and specialized. This involves working with massive text data (millions of words), from which we often need to exclude non-musical information. To address this, we have developed a tool that utilizes Google Cloud OCR reading and Vertex AI. This learning platform allows us to train large language models (LLM) and create automated workflows to extract information in blocks by applying a prompt similar to those used in ChatBots. We were able to select and transcribe musical news from hundreds of periodical sources, creating a corpus of over 70 million words. In the past year, we begun analyzing this corpus using the Voyant Tools platform, which enables us to study usage trends, create word clouds, and observe their evolution over time. Currently, we are still seeking ways to improve OCR reading, which is hindered by issues with text legibility and column organization, but perhaps the greatest challenge lies in the numerous interpretation problems that AI is far from solving at present, despite efforts in recent years in the field of the history of concepts (e.g., Peter de Bolla, Explorations in the Digital History of Ideas, 2024).

Short Biographies

Dr. Daniel Martín Sáez: Associate Professor of Musicology at the University of Salamanca. Professor of the BA in Musicology and the MA in Hispanic Music at the University of Salamanca. Member of the research team of LexiMus Project.

María Isabel Jiménez Gutiérrez: Predoctoral Fellow at the University of Salamanca. BA in Musicology and MA in Hispanic Music from USAL. Graduate in Higher Artistic Education with a specialization in Clarinet from the Higher Conservatory of Castilla y León.

To top

AI-assisted Analysis of Arrival Lists: From the ”Wienerisches Diarium“ to the ”Regensburgisches Diarium“

Nina C. Rastinger, Sarah Lentz

Austrian Academy of Sciences, University of Bremen

Abstract

Besides more commonly known contents, such as the classical ‘news’ or advertisements, German historical newspapers contain many other, often highly undervalued types of texts. One example are arrival lists, i.e. semi-structured texts that list persons who arrived in a certain city and were documented either at city gates and/or in their accommodations. As an invaluable historical source, they can, among other things, provide important insights into pre-modern travel and mobility. At the same time, although many arrival lists are already digitally available as parts of greater newspaper collections (e.g., AustriaN Newspapers Online, Deutsches Zeitungsportal, DigiPress), gaining access to the information stored in these lists is usually not straightforward. Instead, researchers are confronted with the challenge of transforming unstructured full texts - or even only facsimiles - to structured, systematically analysable data.

With this challenge in mind, the contribution presents an AI-assisted approach to the (semi-)automatic analysis of arrival lists that combines Transkribus for full-text digitization with LLMs (esp. GPT-3.5, GPT-4o) for Named Entity Recognition (NER). This combination was first employed for the arrival lists of the Wien[n]erisches Diarium (1703-1725) in the context of the case study “Visiting Vienna” (Rastinger 2024) and yielded excellent results, achieving, among other things, an approximate Character Error Rate (CER) of 0.8 % and an NER F1 Score of 0.97. The outcome indicates that arrival lists, as semi-structured newspaper text types with high relative counts of named entities, pose an optimal object for AI-assisted automatic annotation.

Building on the idea, a collaboration between the Austrian Centre for Digital Humanities and Cultural Heritage and the University of Bremen is now further exploring this potential. To be concrete, the AI-based workflow initially developed for the arrival lists of the Wien[n]erisches Diarium (1703-1725) is currently being adapted and applied to selected arrival lists of the Regensburgisches Diarium (1762-1802). The presentation will therefore not only offer a retrospective on the work completed so far, but also provide an outlook on next steps, e.g., by discussing the commonalities and differences between the two newspapers similar in name, potentially necessary adjustments to the workflow, and general considerations regarding the transferability of LLM-based approaches.

References

Rastinger, Nina C. (2024): Re-Reading Lists in Historical Newspapers: Digital Insights into an Overlooked Text Type, in: Selected papers from the CLARIN Annual Conference 2023. Linköping Electronic Press.

Short Biographies

Nina C. Rastinger is a doctoral researcher at the Austrian Centre for Digital Humanities and Cultural Heritage. Her doctoral project funded through an ÖAW DOC fellowship deals with periodically published lists in historical newspapers and her areas of interest include early modern texts, digital workflows for corpus-based research, and generative AI.

Sarah Lentz is a postdoctoral researcher at the Institute of History at the University of Bremen and Associated Junior Fellow at the Hanse-Wissenschaftskolleg, Delmenhorst. She conducts research on mobilities in Early Modern Central Europe and is head of the funded research projects "Inequalities on the Move" and "AI & Marginalized Mobilities".

To top

Panel 5: Analyzing newspapers with AI

Part-of-speech and grammar tagging with German spaCy pipelines from a linguistic perspective: Opportunities and challenges in the annotation of diminutives in forum posts on an Austrian online newspaper article

Katharina Korecky-Kröll

Austrian Academy of Sciences

Abstract

The NLP python library spaCy (Honnibal et al. 2020) is a useful tool for everyone interested in linguistic analyses of large amounts of written data.

Using spaCy, such data can be tokenized and tagged for parts-of-speech quickly, and a basic morphological annotation for categories of inflectional morphology (e.g., case, gender, and number of nouns) or an annotation of syntactic dependencies or named entities is also possible. All these levels of annotation may as a basis for further linguistic analyses.

To date, spaCy supports over 75 languages and has over 80 pretrained pipelines for 25 languages. There are four pipelines for German, which are all based on the TIGER-Corpus, Tiger2Dep and WikiNER and sometimes on additional sources (in round brackets after the name of the pipeline) and which vary regarding the accuracies of their morphological annotation [in square brackets]:

de_core_news_sm [0.91]
de_core_news_md (Explosion fastText Vectors (cbow, OSCAR Common Crawl + Wikipedia)) [0.92]
de_core_news_lg (like de_core_news_md) [0.92]
de_dep_news_trf (bert-base-german-cased) [0.97]

Specific challenges arise when annotating user-generated content in a pluricentric language such as German, which has several national standard varieties and is also characterized by numerous dialects and regiolects resulting in highly diverse word formation patterns (e.g., Ammon 1995; Lenz 2019). Thus, in a 12-million-token corpus of forum posts on an online article of the Austrian newspaper DERSTANDARD.at regarding the COVID-19 pandemic (e.g., Korecky-Kröll 2023; Korecky-Kröll et al. submitted), spaCy assigns a wrong grammatical gender to many diminutive nouns or misclassifies them in another way (e.g., common nouns as proper names).

Using a randomly selected sub-corpus of 1000 diminutive tokens from the above-mentioned corpus, the four spaCy pipelines for German are tested for accuracy, problems at the individual token or lemma level are identified and possible solutions are worked out. As an outlook, the possibility of an additional automatic word formation tagging (e.g., Wartena 2023) is also discussed.

References

Ammon, Ulrich. 1995. Die deutsche Sprache in Deutschland, Österreich und der Schweiz: das Problem der nationalen Varietäten. Berlin & New York: De Gruyter
Honnibal, Matthew, Ines Montani, Sofie Van Landeghem & Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. doi: 10.5281/zenodo.1212303.
Korecky-Kröll, Katharina. 2023. Diminutives and number: theoretical predictions and empirical evidence from German in Austria. In: Stela Manova, Laura Grestenberger & Katharina Korecky-Kröll. eds. Diminutives across languages, theoretical frameworks and linguistic domains. Berlin: De Gruyter (= Trends in Linguistics. Studies and Monographs 380), 179-204. https://doi.org/10.1515/9783110792874-008.
Korecky-Kröll, Katharina, Amelie Dorn, Theresa Ziegler, Jan Höll & Alexandra N. Lenz. submitted. Language in times of COVID-19: lexical and morphopragmatic analyses of two Austrian Media Corpora. Submitted to: Digital Scholarship in the Humanities.
Lenz, Alexandra N. 2019. Bairisch und Alemannisch in Österreich. In Joachim Herrgen, & Jürgen Erich Schmidt. eds. Language and Space. An International Handbook of Linguistic Variation. Vol. 4: Deutsch. Unter Mitarbeit von Hanna Fischer und Brigitte Ganswindt. Berlin & Boston: de Gruyter Mouton (= Handbooks of Linguistics and Communication Science 30.4), 318–363.
Wartena, Christian. 2023. The Hanover Tagger (Version 1.1.0) - Lemmatization, Morphological Analysis and POS Tagging in Python. doi: 10.25968/opus-2457. https://serwiss.bib.hs-hannover.de/frontdoor/deliver/index/docId/2457/file/wartena2023-HanTa_v1.1.0.pdf

Short Biography

After completing her PhD in Linguistics at the University of Vienna in 2012, Katharina Korecky-Kröll worked in several postdoc positions. She is now a Senior Lecturer at the Department of German Studies of the University of Vienna and an Academy Scientist at the “Dictionary of Historical Bavarian Dialects in Austria and South Tyrol” of the Research Unit Linguistics of the Austrian Centre for Digital Humanities and Cultural Heritage of the Austrian Academy of Sciences.

To top

Date

7–8 May 2025

Location

Seminar room 1,
Campus of the Austrian Academy of Sciences,
Bäckerstraße 13, 1010 Vienna

Organization

Department of Literary and Textual Studies,
Austrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH),
in cooperation with DHd-AG Zeitungen & Zeitschriften

Contact

Nina C. Rastinger
Claudia Resch

Languages

The presentations will be held in either German or English.

Name	Purpose	Storage duration	Type	Provider
CookieConsent	Remembers your consent to the use of cookies.	1 year	HTML	Web Consent
fe_typo_user	Assigns your browser to a session on the server. This only affects the content you see and is not evaluated or processed by us	-	HTTP	Web User

Name	Purpose	Storage duration	Type	Provider
_pk_id	Used to store a few details about the user like unique visitor ID.	13 months	HTML	Matomo-id
_pk_ref	Used to store information about the user's referring website.	6 months	HTML	Matomo-ref
_pk_ses	Short-term cookie to save temporary data from the visit.	30 minutes	HTML	Matomo-ses
_pk_cvar	Short-term cookie to save temporary data from the visit.	30 minutes	HTML	Matomo-cvar
_pk_hsr	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo

Name	Purpose	Storage duration	Type	Provider
YouTube	A connection to YouTube will be established to view videos.	-	Connection	YouTube
SoundCloud	A connection to SoundCloud will be established to play audio files.	-	Connection	SoundCloud
Twitter	A connection to Twitter will be established to display tweets.	-	missing translation: type.	Twitter
_cs_c	Indicates whether the user has consented to ContentSquare tracking.	394 days	missing translation: type.	Spotify (ContentSquare)
_cs_id	Stores a unique user ID for ContentSquare session analysis.	394 days	missing translation: type.	Spotify (ContentSquare)
_ga	Used to distinguish users.	400 days	missing translation: type.	Google Analytics
_ga_BMC5VGR8YS	Used by Google Analytics to persist session state.	400 days	missing translation: type.	Google Analytics
_ga_S0T2DJJFZM	Used by Google Analytics to persist session state.	399 days	missing translation: type.	Google Analytics
_ga_ZWG1NSHWD8	Used by Google Analytics to persist session state.	400 days	missing translation: type.	Google Analytics
_ga_ZWRF3NLZJZ	Used by Google Analytics to persist session state.	400 days	missing translation: type.	Google Analytics
_gid	Used to distinguish users.	1 days	missing translation: type.	Google Analytics
_ScCbts	Stores temporary session or playback preferences.	6 days	missing translation: type.	Spotify
_scid	Spotify advertising ID used for analytics and retargeting.	395 days	missing translation: type.	Spotify
_scid_r	Spotify advertising ID used for analytics and retargeting.	395 days	missing translation: type.	Spotify
eupubconsent-v2	Stores the IAB Transparency & Consent Framework string.	364 days	missing translation: type.	IAB / Spotify
OptanonAlertBoxClosed	Saves the state of your data protection consent.	364 days	missing translation: type.	OneTrust
OptanonConsent	Saves the state of your data protection consent.	365 days	missing translation: type.	OneTrust
sp_adid	Spotify advertising identifier.	365 days	missing translation: type.	Spotify
sp_landing	Tracks which page the user landed on within Spotify.	1 days	missing translation: type.	Spotify
sp_m	Stores the user’s market region (Spotify).	399 days	missing translation: type.	Spotify
sp_t	Session token used for Spotify playback and access.	365 days	missing translation: type.	Spotify

News & Events

Programme

Day 1

Panel 1: Digitizing and enriching newspapers & magazines with AI - I

Panel 2: Digitizing and enriching newspapers & magazines with AI - II

Day 2

Panel 3: Analyzing magazines with AI

Panel 4: Analyzing magazines & newspapers with AI

Panel 5: Analyzing newspapers with AI

The FINLAM Newspaper Dataset - a dataset for end-to-end newspaper recognition

Abstract

Short Biography

Das Darmstädter Tagblatt und zwei KI-Lösungen: Transkribus-Workflows und die Entwicklung eines KI-Assistenten

Abstract

Kurzbiografie

From Image to Machine-Readable text: AI for Layout Analysis, OCR and Post-Correction for Job Ads from Historical Newspapers

Abstract

Short Biographies

Hierarchical Structure Extraction from Newspaper Images Using a Transformer-Based Model

Abstract

Short Biography

Werkstattbericht aus der historisch-kritischen digitalen Edition der „Neuen Zeitschrift für Musik“ 1834-1844

Abstract

Kurzbiographie

AI-Driven Analysis of Female Representations in Fin-de-Siècle Spanish Magazines

Abstract

Short Biography

Challenges in dealing with historical gossip

Abstract

Short Biography

Potenziale und Herausforderungen einer KI-unterstützten Medien- und Texterschließung am Beispiel der Gattung „Fotogedicht“

Abstract

Kurzbiographie

LexiMus Project. Advantages and Challenges of Artificial Intelligence in the Analysis of Music Press

Abstract

Short Biographies

AI-assisted Analysis of Arrival Lists: From the ”Wienerisches Diarium“ to the ”Regensburgisches Diarium“

Abstract

Short Biographies

Part-of-speech and grammar tagging with German spaCy pipelines from a linguistic perspective: Opportunities and challenges in the annotation of diminutives in forum posts on an Austrian online newspaper article

Abstract

Short Biography

Date

Location

Organization

Contact

Languages