Conference "Newspapers, Magazines & AI Models: Training and (Re-)Use in the Digital Humanities"

The two-day conference is dedicated to the use of AI models for the digitization and analysis of newspapers and magazines from the early modern period to the present. This covers both the “out-of-the-box” use or fine-tuning of existing models and the training of new models.

The term “AI model” is deliberately defined broadly and includes several subfields of artificial intelligence (e.g., Machine Learning, Deep Learning, Generative AI, NLP) and architectures (e.g., CNNs, BERT, GPT, CLIP) as well as different modalities (text, image, multimodal models) and modes of integration into individual workflows (e.g., through applications such as Transkribus, Newspaper Navigator; through Python libraries like spaCy, flair).

    The conference focuses on various application scenarios of AI in relation to newspapers and magazines. The following areas of use are of particular interest:

    • Layout analysis and structural annotation
    • Automated Text Recognition (HTR, OCR)
    • Text genre classification
    • Semantic/linguistic annotation (e.g., Named Entity Recognition, Part-of-Speech Tagging)
    • Image annotation and classification (Computer Vision)
    • Format transformation and data modeling
    • Corpus design and searchability
    • Data analysis and visualization

    For organizational reasons the registrations are now closed.

    Programme

    Day 1

    14:00-14:30Welcome and IntroductionAlexandra N. Lenz, Claudia Resch, Nina C. Rastinger

    Panel 1: Digitizing and enriching newspapers & magazines with AI - I

    Moderation: Gabriel Viehhauser

    14:30-15:00The FINLAM Newspaper Dataset - a dataset for end-to-end newspaper recognitionSolène Tarride
    15:00-15:30Das Darmstädter Tagblatt und zwei KI-Lösungen: Transkribus-Workflows und die Entwicklung eines KI-AssistentenDario Kampkaspar, Kevin Kuck, Anna Christina Kupffer
    15:30-16:00From Image to Machine-Readable text: AI for Layout Analysis, OCR and Post-Correction for Job Ads from Historical NewspapersKlara Venglarova, Raven Adams, Georg Vogeler
    16:00-16:30Coffee break

     

    Panel 2: Digitizing and enriching newspapers & magazines with AI - II

    Moderation: Nina C. Rastinger

    16:30-17:00Hierarchical Structure Extraction from Newspaper Images Using a Transformer-Based ModelWilliam Mocaër, Clément Chatelain, Stéphane Nicolas, Thierry Paquet, Tom Simon, Pierrick Tranouez
    17:00-17:30Werkstattbericht aus der historisch-kritischen digitalen Edition der „Neuen Zeitschrift für Musik“ 1834-1844Nelly Krämer-Reinhardt

     

    Day 2

    Panel 3: Analyzing magazines with AI

    Moderation: Claudia Resch

    09:30-10:00AI-Driven Analysis of Female Representations in Fin-de-Siècle Spanish MagazinesAdriana Rodríguez-Alfonso 
    10:00-10:30Challenges in dealing with historical gossipChristian Lendl
    10:30-11:00Potenziale und Herausforderungen einer KI-unterstützten Medien- und Texterschließung am Beispiel der Gattung “Fotogedicht”Lisa Hufschmidt
    11:00-11:30Coffee break

     

    Panel 4: Analyzing magazines & newspapers with AI

    Moderation: Christoph Steindl

    11:30-12:00LexiMus Project. Advantages and Challenges of Artificial Intelligence in the Analysis of Music PressDaniel Martín Sáez, María Isabel Jiménez Gutiérrez
    12:00-12:30AI-assisted Analysis of Arrival Lists: From the ”Wienerisches Diarium“ to the ”Regensburgisches Diarium“Nina C. Rastinger, Sarah Lentz
    12:30-14:00Lunch break (catered)

     

    Panel 5: Analyzing newspapers with AI

    Moderation: Andreas Baumann

    14:00-14:30Part-of-speech and grammar tagging with German spaCy pipelines from a linguistic perspective: Opportunities and challenges in the annotation of diminutives in forum posts on an Austrian online newspaper articleKatharina Korecky-Kröll
    14:30-15:30Concluding exchange over coffee and cake 

     

    Abstracts

    Panel 1: Digitizing and enriching newspapers & magazines with AI - I

    The FINLAM Newspaper Dataset - a dataset for end-to-end newspaper recognition

    Solène Tarride

    TEKLIA

    Abstract

    Advances in machine learning and the emergence of Visual Large Language models (VLLMs) have significantly advanced the field of automatic document understanding. However, these models often show limited performance on historical or handwritten documents. The ANR Finlam (Foundation INtegrated models for Libraries Archives and Museum) project specifically aims to develop multimodal models that can handle a wide variety of documents, languages, layout, writing styles and illustrations. One of the first use cases of this project focuses on historical newspapers.

    Historical newspapers present unique challenges for automated processing due to their dense and complex layouts. Tasks such as reading order detection and article separation remain underexplored in the machine learning and document analysis communities. To fill these gaps, we present the FINLAM Newspaper Dataset, an open-source dataset designed for end-to-end training and evaluation of complex newspaper recognition tasks. The FINLAM Newspaper Dataset contains 149 issues of 23 newspapers published in the 19th and 20th centuries, mainly in French, with some newspapers in English. Each issue contains between 2 and 12 pages, and each page is segmented into zones annotated with multimodal features: localization, textual content (extracted by OCR), zone classifications (among 13 categories including article titles, intertitles, paragraphs, illustrations, advertisements and free ads), reading order and article separation. This dataset presents significant challenges due to its dense, complex and varied layout. It is freely available on HuggingFace: FINLAM Newspaper Dataset on HuggingFace. In this workshop, we will introduce the FINLAM Newspaper Dataset and present benchmark results for key tasks such as OCR, document layout analysis, reading order detection and article separation.

    Short Biography

    Solène Tarride is a machine learning researcher at TEKLIA. During her PhD at IRISA, she focused on deep learning for understanding historical documents. At TEKLIA, she develops new methods for automatic information extraction from historical and modern documents.

    To top


    Das Darmstädter Tagblatt und zwei KI-Lösungen: Transkribus-Workflows und die Entwicklung eines KI-Assistenten

    Dario Kampkaspar, Kevin Kuck, Anna Christina Kupffer

    Universitäts- und Landesbibliothek Darmstadt

    Abstract

    Das Digitalisierungsprojekt „Darmstädter Tagblatt“ erarbeitet momentan einen der umfangreichsten Quellenfundus historischer Zeitungen im deutschsprachigen Raum. In zwei DFG-geförderten Projektphasen werden über 600.000 Seiten aus fast drei  Jahrhunderten digitalisiert und als Volltexte zur Verfügung gestellt.1 Neben einer kurzen Projektvorstellung möchten wir Einblicke in zwei Aspekte des Projekts bzw. daraus  erwachsener Kooperationen geben.

    1. Der Transkribus-Workflow

    Ursprünglich im Rahmen einer Dienstleistung vergeben, wurde das OCR und die  Layouterkennung der Nachkriegsausgaben (1949–1986) extern durchgeführt. Während  die Erkennung des Layouts zufriedenstellend war, wurde das OCR nach einer strategischen Entscheidung intern durchgeführt. Im Rahmen einer infrastrukturellen Investition an der ULB Darmstadt können wir für das Tagblattprojekt Transkribus als Epic  Member einsetzen. In einem Workflow, der TextTitan sowie ein bestehendes Modell des ZEiD nutzte, wurden mit Double Keying neue Versionen der Volltexte erstellt. Dies war wesentlich schneller als die manuelle Korrektur des Outputs des Dienstleisters. Aus dem ALTO-Format von letzterem konnten die Layoutinformationen mittels Transformation in PAGE-Dateien in Transkribus überführt werden und somit nachgenutzt werden.

    2. Das RAGblatt 

    In Darmstadt befindet sich mit Hessian.AI eine Ausgründung der TU Darmstadt mit KI Expertise.2 Um die Möglichkeiten zu erforschen und auch um Herausforderungen zu  bewältigen, die mit einem so umfangreichen Projekt, wie dem Tagblattprojekt einhergehen, wurde eine Kooperation mit Hessian.AI angestoßen. Das momentane Ergebnis ist das „RAGblatt“, ein KI-gestützter Assistent, der zur Recherche des Tagblattmaterials genutzt werden kann. Momentan befindet sich der Assistent noch in der Prototypphase und verwendet Modelle wie Metas Llama und Occiglot. Es ermöglicht textbasierte Suchen, die schriftliche Antworten generieren und den Kontext des Zeitungsartikels enthalten, der für die Anfrage relevant ist. Wir hoffen, dass dieser Assistent Nutzer:innen dabei unterstützen wird, Material zu finden und einen explorativen Ausgangspunkt für Forschungsthemen zu bieten. In einer laborähnlichen Umgebung können Forschende mögliche Fragen mit minimalem technischem Aufwand oder Zeitaufwand testen. Gegenstand des Vortrags werden auch die Herausforderungen technischer und disziplinärer Art sein.

    Momentan ist der Assistant lediglich innerhalb des Netzwerks der TU Darmstadt verfügbar. elektra.ai.tu-darmstadt.de/ulb (aufgerufen am: 05.02.2025 ).

    Fußnoten

    1www.ulb.tu-darmstadt.de/forschen_publizieren/forschen/darmstaedter_tagblatt.en.jsp  (aufgerufen am: 05.02.2025).

    2 hessian.ai (aufgerufen am: 05.02.2025).

    Kurzbiografie

    Dario Kampkaspar ist Leiter des Zentrums für digitale Editionen an der ULB Darmstadt. Nicht zuletzt seit dem Wien(n)erischen Diarium ist er in der Digitalisierung und Volltetxerfassung von Zeitungen aktiv. Er ist u.a. in der TEI und in Transkribus aktiv (z. B.  der TEI-Export von Transkribus und weitere Tools für digitale Editionen).

    Kevin Kuck betreut seit September 2023 das Digitalisierungsprojekt des „Darmstädter Tagblatt“. An der ULB Darmstadt ist er außerdem im Projekt „Europäischer Religionsfrieden Digital“ tätig. Herr Kuck studierte Geschichte an der Universität Heidelberg, wo er ebenfalls promoviert.

    To top


    From Image to Machine-Readable text: AI for Layout Analysis, OCR and Post-Correction for Job Ads from Historical Newspapers

    Klara Venglarova, Raven Adam, Georg Vogeler

    University of Graz

    Abstract

    This study describes a comprehensive workflow for extracting machine-readable text from historical newspaper job advertisements, addressing layout analysis, optical character recognition (OCR), and post-correction with state of the art machine learning methods. Leveraging an annotated dataset as a ground truth, we evaluate various layout detection tools, including default ANNO segments, Eynollah, Transkribus, and Tesseract. For the evaluation purposes, we also proposed a new methodology based on text presence in non-intersecting parts of the predicted region and its ground truth. Eynollah demonstrated the highest segmentation accuracy (72.5%), while other models, such as Kraken, underperformed due to mismatched between the specific task and pretrained models.

    For OCR, we compared multiple models, including GT4HistOCR (CER: 0.1218), Tesseract model used in ANNO (CER: 0.1295), and German_Print model (CER: 0.1202). While several models reached comparable results, the Fraktur_GT4HistOCR achieved the best WER. Post-correction further improved text quality, addressing OCR-induced biases. We fine-tuned the hmbyt5-preliminary 1 model on the IDCAR2019-POCR dataset for OCR correction to perform better on our dataset. As manual creation of gold standard is a time-consuming process, we also explore the generative transformer based methods to support creating training data needed to achieve a good performance of a post-correction model.

    Our work emphasizes the persisting need of annotated datasets and gold standards in assessing segmentation and recognition performance. By systematically comparing tools and methodologies, we contribute to the transparency of results obtained based on the subsequent data analysis.

    This work is part of the FWF project P35783 “The making of the incredibly differentiated labor market” (PI Jörn Kleinert).

    Short Biographies

    Georg Vogeler is professor for Digital Humanities at University of Graz. He is a trained historian (Historical Auxiliary Sciences), graduated from Ludwig-Maximilians-Universtität (LMU), Munich, and worked on late medieval administrative records, emperor Federic II. (1198-1250), digital scholarly editing and semantic web technologies for humanities, with positions at the LMU Munich, the Università del Salento in Lecce and the University of Graz . He engaged recently in the application of machine learning and AI in the analysis of historical records. He was and is PI of numerous projects, among which the ERC ADG grant From Digital to Distant Diplomatics (2022-2026).

    Klara Venglarova is a PhD student of Linguistics and Digital Humanities at the Palacky University in Olomouc, Czech Republic. She is involved in the FWF-funded project The Making of the Incredibly Differentiated Labor Market: Evidence from Job Offers from Ten Decades at the University of Graz (PI Jörn Kleinert), specifically engaged in layout analysis, OCR, post-correction, information extraction and other NLP and machine-learning tasks.

    Raven Adam is a PhD student at the department of Environmental System Sciences at the University of Graz. His research focus is on NLP applications, such as topic modeling, text classification and text generation. He is currently involved in two FWF-funded projects, namely The Making of the Incredibly Differentiated Labor Market: Evidence from Job Offers from Ten Decades and Responses (PI Jörn Kleinert) to Threat and Solution-Oriented Climate News (PI Marie Lisa Ulrike Kogler).

    To top


    Panel 2: Digitizing and enriching newspapers & magazines with AI - II

    Hierarchical Structure Extraction from Newspaper Images Using a Transformer-Based Model

    William Mocaër, Clément Chatelain, Stéphane Nicolas, Thierry Paquet, Tom Simon, Pierrick Tranouez

    LITIS Laboratory

    Abstract

    Understanding newspapers images is a challenging task due to their complex hierarchical structures, rendered through a variety of dense layouts. This research introduces a novel transformer-based model specifically designed to tackle these challenges through a comprehensive, end-to-end approach. The proposed model excels at extracting the hierarchical structure of newspapers, including sections and articles. It performs blocks localization and categorization (title, paragraph, image, table…) and reading order prediction at multiple levels. The model provides a comprehensive, detailed and consistent analysis of newspaper content.

    The approach relies on an iterative process of information extraction, through the hierarchy of levels, where each level is processed one after the other. To enhance computational efficiency, each level is executed using a parallel attention mechanism. Relying on high-level structural modeling, the model achieves end-to-end processing without requiring any additional pre or postprocessing, ensuring adaptability to a wide variety of newspaper formats.

    The model is trained using synthetic documents that capture the variability and complexity of real-world newspapers. These synthetic documents enable the model to learn robust representations of newspaper layouts, ensuring its ability to generalize across a wide range of structural configurations. Preliminary evaluations highlight the model's potential in accurately reconstructing newspaper hierarchies and providing insights into their content.

    The method offers a promising solution for precise structure extraction in highly structured documents such as newspapers but could be applicable to a wider range of documents addressing the growing need for scalable and efficient AI-based digitization solutions.

    Short Biography

    William Mocaër is a postdoctoral researcher at the LITIS laboratory (Rouen, France), where he contributes to the FINLAM project in collaboration with the Bibliothèque nationale de France (BnF) and Teklia, focusing on advanced newspaper analysis techniques. Previously, he completed a PhD at IRISA as part of the Shadoc Team (Systems for Hybrid Analysis of DOCuments), specialized in document analysis.

    To top


    Werkstattbericht aus der historisch-kritischen digitalen Edition der „Neuen Zeitschrift für Musik“ 1834-1844

    Nelly Krämer-Reinhardt

    Julius-Maximilians-Universität Würzburg, Bayerische Akademie der Wissenschaften

    Abstract

    Im Rahmen eines Werkstattberichts möchten wir einen Einblick in die Konzeptionierung der digitalen Edition einer historischen Zeitschrift geben. 

    Ausgangslage: 
    Das Akademieprojekt „Robert Schumanns Poetische Welt“ umfasst unter anderem die historisch-kritische Edition der Neuen Zeitschrift für Musik (NZfM) der Gründungsdekade, in der der Komponist und Musikschriftsteller Robert Schumann die Zeitschrift konzipiert und redaktionell verantwortet hat.
    Die digitale Edition wird in Kooperation mit dem TCDH Trier in der virtuellen Forschungsumgebung FuD erstellt. 

    Ziel: 
    Schumanns NZfM verstand sich als unabhängiges Organ für die Förderung begabter Komponisten und bildete eine zentrale Plattform innerhalb des romantischen Musikdiskurses. Sie ist damit ein wichtiges Korpus in der Musikwissenschaft. 
    Der nun geplante tiefenerschlossene, annotierte Lesetext, der sowohl auf Heftebene als auch auf Ebene der einzelnen Texteinheiten angesteuert werden kann, wird die Grundlage für zukünftige Forschung bilden. Ziel der Edition des Zeitschriftenkorpus ist die Generierung von Deep Data. 

    Prozess
    Texterfassung: Ein eigens trainiertes Modell der KI-gestützten Software Transkribus erfasst präzise sowohl die Textfelder des Zeitschriftenlayouts als auch die Zeilen und deren Text. 
    Multimodale Erfassung: Notenbeispiele werden mit der Software mei-friend erstellt, kollationiert und anschließend in TEI-XML (Text Encoding Initiative) integriert. Die Einsatzmöglichkeiten von OMR werden derzeit noch getestet. 
    Semantische Erschließung: Die Zeitschriftenausgaben werden in Sinneinheiten unterteilt und semantisch erschlossen. Damit führt der Fokus von der reinen Texterfassung hin zur Substanz der Texteinheiten, die sich teilweise auch über mehrere Hefte erstrecken. 
    Annotierung: Sämtliche Entitäten werden ausgezeichnet und mit Normdaten verlinkt. Die einzelnen Sinneinheiten werden Textkategorien zugeordnet und Sachkommentare erleichtern das Verständnis erklärungsbedürftiger Textstellen. Zudem wird, wo entsprechende Quellen vorhanden sind, die Textgenese dargestellt. Dabei werden einerseits handschriftliche Vorlagen herangezogen, die mit dem Software-Tool Transcribo in FuD erschlossen werden. Andererseits werden die Textbeiträge, die Schumann in seine „Gesammelten Schriften“ übernommen hat, jeweils verknüpft und mit dem Kollationierungstool Comparo verglichen. 

    Diskussion
    Wir laden dazu ein, dieses Konzept und die Methoden zu diskutieren und die Limitationen des Einsatzes von KI zur Erfassung von Deep Data in Zeitungen und Zeitschriften zu reflektieren sowie die Frage zu beleuchten, welche Rolle die Sachkommentierung in Zeiten von KI erfüllen kann und soll.

    Kurzbiographie

    Die Referentin Nelly Krämer-Reinhardt, M.A. studierte Schulmusik an der Hochschule für Musik Würzburg und Musikwissenschaft an der Julius-Maximilians-Universität Würzburg. Seit 2023 ist sie wissenschaftliche Mitarbeiterin im Akademieprojekt Robert Schumanns Poetische Welt und beschäftigt sich in ihrer Dissertation mit Notenbeispielen in der Neuen Zeitschrift für Musik.

    To top


    Panel 3: Analyzing magazines with AI

    AI-Driven Analysis of Female Representations in Fin-de-Siècle Spanish Magazines

    Adriana Rodríguez-Alfonso

    University of Tübingen

    Abstract

    This presentation examines female representations in three pioneering fin-de-siècle magazines from Spain: Vida nueva (Madrid, 1898-1900), La Vida Literaria (Madrid, 1899), and La vida galante (Barcelona, 1898-1905). This study is part of a Spanish-language magazine digitization project undertaken at the University of Tübingen, Germany. These magazines, conceived as platforms for artistic and literary dissemination, not only brought together prominent painters, illustrators, photographers, and writers of the Hispanic movement but also provided invaluable insights into the prevailing symbolic and social constructs of femininity.

    Given the close interrelationship between medicine and art in the late nineteenth century (Jordanova, 1989; Gilman, 1995; Mazzoni, 1996; Clemison and Vázquez, 2009; Tsuchiya, 2011; Alder, 2020), the social imaginaries surrounding women in these magazines frequently intersected with the dominant theories of women's mental health. These theories were influenced by European degeneration models (Nordau, Brachet, Charcot), as well as home-grown adaptations by leading Spanish positivist psychiatrists (Escuder, Giné y Partagás, Bernaldo de Quirós).

    Drawing from digital methods and perspectives, this presentation will showcase the results derived from the digitization, processing, and analysis of this corpus of Spanish cultural magazines from the turn of the century. Using Natural Language Processing (NLP) techniques—such as word-sense disambiguation and semantic analysis—the study maps the various semantic frameworks associated with women, their societal roles, and representations within nineteenth-century Spanish society. These techniques will reveal how medical, political, and social connotations were closely intertwined within the discourse surrounding female identity and roles of the time.

    This talk will also foster a discussion on the application of computational tools in periodical press analysis, highlighting both the potential and challenges of adapting digital methods across languages—particularly for Spanish-language materials. It will delve into issues such as cross-linguistic adaptation and the nuances involved in applying these methods to historical texts.

    Short Biography

    Adriana Rodríguez-Alfonso holds a PhD and Master's degree in Spanish and Latin American Literature from the University of Salamanca, and a Bachelor's degree in Hispanic Philology from the University of Havana. She is currently a professor and researcher at the Romanisches Seminar of the University of Tübingen, where she is working on her “Habilitation.”

    Her main research focuses on Portuguese and Spanish literature from the 19th and 20th centuries, literary magazines, intellectual fields and networks, and digital humanities. She has published articles and chapters in various essay collections and specialized journals and her book El grupo Shanghai en Argentina: Redes, estéticas y mercados editoriales latinoamericanos was published in 2024 by De Gruyter.

    To top


    Challenges in dealing with historical gossip

    Christian Lendl

    University of Vienna

    Abstract

    The Wiener Salonblatt was the hot gossip magazine of fin-de-siècle Vienna. The illustrated  weekly was published from 1870 to 1938 and mainly consisted of short messages – mostly published by members of the nobility – about personal achievements, travelling, and family issues. These short texts can be seen as typical examples for factoids and served the same purpose as posts on social media networks today: staying connected  with one’s peers and presenting oneself to the public.

    This dissertation project aims to analyze these (~250,000) factoids with the help of digital  methods. The research goals are to better understand the transformation of the late Habsburg nobility by identifying topical trends and geospatial patterns as well as conducting a historical network analysis. 

    The current focus of this project lies on the automated text recognition of the corpus. Therefore, several models on Transkribus are being trained for layout recognition, segmentation, and automated transcription. In addition, a text processing algorithm is currently being developed that optimizes the (textual) output of the transcription stage and transfers it into a database. This includes several steps of error correction,  normalization and validation as well as an algorithm to correct the reading order (for all  messages on a single magazine page). All these processes are necessary to optimally  prepare the factoids for the upcoming stage of this dissertation project: Analyzing the factoids with natural language processing methods. While named entity recognition will  be used to extract as much information as possible (persons, organizations, places etc.), topic modelling will identify the topic(s) of each factoid. All these steps have to be  completed in order to start the final stages: data analysis and historical interpretation.

    Short Biography

    Christian Lendl is PhD candidate at the Department of East-European History (University of Vienna). His  fields of interest include the Austrian nobility in the late Habsburg empire, the development of portrait and  press photography, and the visual coverage in Austrian newspapers. He is also a lecturer for visual  marketing at the IMC Krems University of Applied Sciences and holds a MSc in Computer Science from  the Vienna University of Technology and a MA in History from the University of Vienna.

    To top


    Potenziale und Herausforderungen einer KI-unterstützten Medien- und Texterschließung am Beispiel der Gattung „Fotogedicht“

    Lisa Hufschmidt

    Julius-Maximilians-Universität Würzburg

    Abstract

    Der Vortrag thematisiert die teilweise automatisierte Erschließung von Fotogedichten1 (d.h. einer spezifischen, literarischen Form von Text-Bild-Beziehung) mithilfe unterschiedlicher KI-Modelle aus literaturwissenschaftlicher Perspektive. Der Erschließungsprozess wird ausgehend von dem 2024 gestarteten DFG-Projekt Das Fotogedicht in illustrierten Zeitschriften zwischen 1895 und 19452 und in Zusammenarbeit mit dem Zentrum für Philologie und Digitalität an der Universität Würzburg entwickelt. Ziel des Projekts ist es, das räumliche Verhältnis zwischen Text und Bild, den Volltext sowie die verwendeten Bild- und Textmotive zu erfassen sowie, in einem zweiten Schritt, damit verbundene Semantisierungen zu untersuchen. Die angestrebte Erschließung und damit verbundene Forschungsfragen bergen neben technischen auch kommunikative Potenziale sowie Herausforderungen, auf welchen der Fokus des Vortrags liegen wird.

    Fußnoten

    1 Siehe hierzu: Catani, Stephanie/Michael Will (2024): “Das Fotogedicht. Zur (Wieder-)Entdeckung einer intermedialen Gattung.” Zeitschrift für Deutsche Philologie Digital/Zeitschrift für Deutsche Philologie (2), doi:10.37307/j.1868-7806.2024.02.09.

    2www.germanistik.uni-wuerzburg.de/ndl1/forschung-projekte/forschungsstelle-fotolyrik/.

    Kurzbiographie

    Lisa Hufschmidt studierte Germanistik (B.A.) und Deutsche Literaturwissenschaft (M.A.) in Mannheim und Stuttgart. Seit Juli 2024 ist sie Mitarbeiterin im DFG-Projekt Das Fotogedicht in illustrierten Zeitschriften zwischen 1895 und 1945 an der Julius-Maximilians-Universität Würzburg, das sich mit der (wieder-)entdeckten Gattung der Fotolyrik beschäftigt. In ihrem Promotionsvorhaben entwickelt Lisa Hufschmidt ein Analysemodell für Fotogedichte.

    To top


    Panel 4: Analyzing magazines & newspapers with AI

    LexiMus Project. Advantages and Challenges of Artificial Intelligence in the Analysis of Music Press

    Daniel Martín Sáez, María Isabel Jiménez Gutiérrez

    University of Salamanca

    Abstract

    In this presentation, we will introduce LexiMus, a project aimed at understanding the trends in the use of musical lexicon in Spanish throughout history. Specifically, we will focus on the work of the team from the University of Salamanca, where we are in charge of studying the press from the 18th to the 21st century, both general and specialized. This involves working with massive text data (millions of words), from which we often need to exclude non-musical information. To address this, we have developed a tool that utilizes Google Cloud OCR reading and Vertex AI. This learning platform allows us to train large language models (LLM) and create automated workflows to extract information in blocks by applying a prompt similar to those used in ChatBots. We were able to select and transcribe musical news from hundreds of periodical sources, creating a corpus of over 70 million words. In the past year, we begun analyzing this corpus using the Voyant Tools platform, which enables us to study usage trends, create word clouds, and observe their evolution over time. Currently, we are still seeking ways to improve OCR reading, which is hindered by issues with text legibility and column organization, but perhaps the greatest challenge lies in the numerous interpretation problems that AI is far from solving at present, despite efforts in recent years in the field of the history of concepts (e.g., Peter de Bolla, Explorations in the Digital History of Ideas, 2024).

    Short Biographies

    Dr. Daniel Martín Sáez: Associate Professor of Musicology at the University of Salamanca. Professor of the BA in Musicology and the MA in Hispanic Music at the University of Salamanca. Member of the research team of LexiMus Project.

    María Isabel Jiménez Gutiérrez: Predoctoral Fellow at the University of Salamanca. BA in Musicology and MA in Hispanic Music from USAL. Graduate in Higher Artistic Education with a specialization in Clarinet from the Higher Conservatory of Castilla y León.

    To top


    AI-assisted Analysis of Arrival Lists: From the ”Wienerisches Diarium“ to the ”Regensburgisches Diarium“

    Nina C. Rastinger, Sarah Lentz

    Austrian Academy of Sciences, University of Bremen

    Abstract

    Besides more commonly known contents, such as the classical ‘news’ or advertisements, German historical newspapers contain many other, often highly undervalued types of texts. One example are arrival lists, i.e. semi-structured texts that list persons who arrived in a certain city and were documented either at city gates and/or in their accommodations. As an invaluable historical source, they can, among other things, provide important insights into pre-modern travel and mobility. At the same time, although many arrival lists are already digitally available as parts of greater newspaper collections (e.g., AustriaN Newspapers Online, Deutsches Zeitungsportal, DigiPress), gaining access to the information stored in these lists is usually not straightforward. Instead, researchers are confronted with the challenge of transforming unstructured full texts - or even only facsimiles - to structured, systematically analysable data.

    With this challenge in mind, the contribution presents an AI-assisted approach to the (semi-)automatic analysis of arrival lists that combines Transkribus for full-text digitization with LLMs (esp. GPT-3.5, GPT-4o) for Named Entity Recognition (NER). This combination was first employed for the arrival lists of the Wien[n]erisches Diarium (1703-1725) in the context of the case study “Visiting Vienna” (Rastinger 2024) and yielded excellent results, achieving, among other things, an approximate Character Error Rate (CER) of 0.8 % and an NER F1 Score of 0.97. The outcome indicates that arrival lists, as semi-structured newspaper text types with high relative counts of named entities, pose an optimal object for AI-assisted automatic annotation.

    Building on the idea, a collaboration between the Austrian Centre for Digital Humanities and Cultural Heritage and the University of Bremen is now further exploring this potential. To be concrete, the AI-based workflow initially developed for the arrival lists of the Wien[n]erisches Diarium (1703-1725) is currently being adapted and applied to selected arrival lists of the Regensburgisches Diarium (1762-1802). The presentation will therefore not only offer a retrospective on the work completed so far, but also provide an outlook on next steps, e.g., by discussing the commonalities and differences between the two newspapers similar in name, potentially necessary adjustments to the workflow, and general considerations regarding the transferability of LLM-based approaches.

    References

    Rastinger, Nina C. (2024): Re-Reading Lists in Historical Newspapers: Digital Insights into an Overlooked Text Type, in: Selected papers from the CLARIN Annual Conference 2023. Linköping Electronic Press.

    Short Biographies

    Nina C. Rastinger is a doctoral researcher at the Austrian Centre for Digital Humanities and Cultural Heritage. Her doctoral project funded through an ÖAW DOC fellowship deals with periodically published lists in historical newspapers and her areas of interest include early modern texts, digital workflows for corpus-based research, and generative AI.

    Sarah Lentz is a postdoctoral researcher at the Institute of History at the University of Bremen and Associated Junior Fellow at the Hanse-Wissenschaftskolleg, Delmenhorst. She conducts research on mobilities in Early Modern Central Europe and is head of the funded research projects "Inequalities on the Move" and "AI & Marginalized Mobilities".

    To top


    Panel 5: Analyzing newspapers with AI

    Part-of-speech and grammar tagging with German spaCy pipelines from a linguistic perspective: Opportunities and challenges in the annotation of diminutives in forum posts on an Austrian online newspaper article

    Katharina Korecky-Kröll

    Austrian Academy of Sciences

    Abstract

    The NLP python library spaCy (Honnibal et al. 2020) is a useful tool for everyone interested in linguistic analyses of large amounts of written data. 

    Using spaCy, such data can be tokenized and tagged for parts-of-speech quickly, and a basic morphological annotation for categories of inflectional morphology (e.g., case, gender, and number of nouns) or an annotation of syntactic dependencies or named entities is also possible. All these levels of annotation may as a basis for further linguistic analyses. 

    To date, spaCy supports over 75 languages and has over 80 pretrained pipelines for 25 languages. There are four pipelines for German, which are all based on the  TIGER-Corpus, Tiger2Dep and WikiNER and sometimes on additional sources (in round brackets after the name of the pipeline) and which vary regarding the accuracies of their morphological annotation [in square brackets]:

    • de_core_news_sm [0.91]
    • de_core_news_md (Explosion fastText Vectors (cbow, OSCAR Common Crawl + Wikipedia)) [0.92]
    • de_core_news_lg (like de_core_news_md) [0.92]
    • de_dep_news_trf (bert-base-german-cased) [0.97]

    Specific challenges arise when annotating user-generated content in a pluricentric language such as German, which has several national standard varieties and is also characterized by numerous dialects and regiolects resulting in highly diverse word formation patterns (e.g., Ammon 1995; Lenz 2019). Thus, in a 12-million-token corpus of forum posts on an online article of the Austrian newspaper DERSTANDARD.at regarding the COVID-19 pandemic (e.g., Korecky-Kröll 2023; Korecky-Kröll et al. submitted), spaCy assigns a wrong grammatical gender to many diminutive nouns or misclassifies them in another way (e.g., common nouns as proper names). 

    Using a randomly selected sub-corpus of 1000 diminutive tokens from the above-mentioned corpus, the four spaCy pipelines for German are tested for accuracy, problems at the individual token or lemma level are identified and possible solutions are worked out. As an outlook, the possibility of an additional automatic word formation tagging (e.g., Wartena 2023) is also discussed.

    References

    • Ammon, Ulrich. 1995. Die deutsche Sprache in Deutschland, Österreich und der Schweiz: das Problem der nationalen Varietäten. Berlin & New York: De Gruyter
    • Honnibal, Matthew, Ines Montani, Sofie Van Landeghem & Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. doi: 10.5281/zenodo.1212303. 
    • Korecky-Kröll, Katharina. 2023. Diminutives and number: theoretical predictions and empirical evidence from German in Austria. In: Stela Manova, Laura Grestenberger & Katharina Korecky-Kröll. eds. Diminutives across languages, theoretical frameworks and linguistic domains. Berlin: De Gruyter (= Trends in Linguistics. Studies and Monographs 380), 179-204. https://doi.org/10.1515/9783110792874-008
    • Korecky-Kröll, Katharina, Amelie Dorn, Theresa Ziegler, Jan Höll & Alexandra N. Lenz. submitted. Language in times of COVID-19: lexical and morphopragmatic analyses of two Austrian Media Corpora. Submitted to: Digital Scholarship in the Humanities.
    • Lenz, Alexandra N. 2019. Bairisch und Alemannisch in Österreich. In Joachim Herrgen,  & Jürgen Erich Schmidt. eds. Language and Space. An International Handbook of Linguistic Variation. Vol. 4: Deutsch. Unter Mitarbeit von Hanna Fischer und Brigitte Ganswindt. Berlin & Boston: de Gruyter Mouton (= Handbooks of Linguistics and Communication Science 30.4), 318–363.
    • Wartena, Christian. 2023. The Hanover Tagger (Version 1.1.0) - Lemmatization, Morphological Analysis and POS Tagging in Python.  doi: 10.25968/opus-2457. https://serwiss.bib.hs-hannover.de/frontdoor/deliver/index/docId/2457/file/wartena2023-HanTa_v1.1.0.pdf
    Short Biography

    After completing her PhD in Linguistics at the University of Vienna in 2012, Katharina Korecky-Kröll worked in several postdoc positions. She is now a Senior Lecturer at the Department of German Studies of the University of Vienna and an Academy Scientist at the Dictionary of Historical Bavarian Dialects in Austria and South Tyrol of the Research Unit Linguistics of the Austrian Centre for Digital Humanities and Cultural Heritage of the Austrian Academy of Sciences.

    To top

    Date

    7–8 May 2025

    Location

    Seminar room 1,
    Campus of the Austrian Academy of Sciences,
    Bäckerstraße 13, 1010 Vienna

    Organization

    Department of Literary and Textual Studies,
    Austrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH),
    in cooperation with DHd-AG Zeitungen & Zeitschriften

    Contact

    Nina C. Rastinger
    Claudia Resch

    Languages

    The presentations will be held in either German or English.