Conference "Newspapers, Magazines & AI Models: Training and (Re-)Use in the Digital Humanities"

The two-day conference is dedicated to the use of AI models for the digitization and analysis of newspapers and magazines from the early modern period to the present. This covers both the “out-of-the-box” use or fine-tuning of existing models and the training of new models.

The term “AI model” is deliberately defined broadly and includes several subfields of artificial intelligence (e.g., Machine Learning, Deep Learning, Generative AI, NLP) and architectures (e.g., CNNs, BERT, GPT, CLIP) as well as different modalities (text, image, multimodal models) and modes of integration into individual workflows (e.g., through applications such as Transkribus, Newspaper Navigator; through Python libraries like spaCy, flair).

    The conference focuses on various application scenarios of AI in relation to newspapers and magazines. The following areas of use are of particular interest:

    • Layout analysis and structural annotation
    • Automated Text Recognition (HTR, OCR)
    • Text genre classification
    • Semantic/linguistic annotation (e.g., Named Entity Recognition, Part-of-Speech Tagging)
    • Image annotation and classification (Computer Vision)
    • Format transformation and data modeling
    • Corpus design and searchability
    • Data analysis and visualization

    Due to limited capacities we ask you to register using our registration form.

    Programme

    Day 1

    14:00-14:30Welcome and IntroductionAlexandra N. Lenz, Claudia Resch, Nina C. Rastinger

    Panel 1: Digitizing and enriching newspapers & magazines with AI - I

    14:30-15:00Hierarchical Structure Extraction from Newspaper Images Using a Transformer-Based ModelWilliam Mocaër, Clément Chatelain, Stéphane Nicolas, Thierry Paquet, Tom Simon, Pierrick Tranouez
    15:00-15:30The FINLAM Newspaper Dataset - a dataset for end-to-end newspaper recognitionSolène Tarride
    15:30-16:00From Image to Machine-Readable text: AI for Layout Analysis, OCR and Post-Correction for Job Ads from Historical NewspapersKlara Venglarova, Raven Adams, Georg Vogeler
    16:00-16:30Coffee break

     

    Panel 2: Digitizing and enriching newspapers & magazines with AI - II

    16:30-17:00Das Darmstädter Tagblatt und zwei KI-Lösungen: Transkribus-Workflows und die Entwicklung eines KI-AssistentenDario Kampkaspar, Kevin Kuck, Anna Christina Kupffer
    17:00-17:30Werkstattbericht aus der historisch-kritischen digitalen Edition der „Neuen Zeitschrift für Musik“ 1834-1844Nelly Krämer-Reinhardt

     

    Day 2

    Panel 3: Analyzing magazines with AI

    09:30-10:00AI-Driven Analysis of Female Representations in Fin-de-Siècle Spanish MagazinesAdriana Rodríguez-Alfonso 
    10:00-10:30Challenges in dealing with historical gossipChristian Lendl
    10:30-11:00Potenziale und Herausforderungen einer KI-unterstützten Medien- und Texterschließung am Beispiel der Gattung “Fotogedicht”Lisa Hufschmidt
    11:00-11:30Coffee break

     

    Panel 4: Analyzing magazines & newspapers with AI

    11:30-12:00LexiMus Project. Advantages and Challenges of Artificial Intelligence in the Analysis of Music PressDaniel Martín Sáez, María Isabel Jiménez Gutiérrez
    12:00-12:30LLM-based list analysis: From semi-structured newspaper texts to structured dataNina C. Rastinger
    12:30-14:00Lunch break (catered)

     

    Panel 5: Analyzing newspapers with AI

    14:00-14:30Semantische Variationen und Bedeutungswandel im Ukrainischen: Herausforderungen für Multilinguale Sprachmodelle Nataliia Cheilytko
    14:30-15:00Part-of-speech and grammar tagging with German spaCy pipelines from a linguistic perspective: Opportunities and challenges in the annotation of diminutives in forum posts on an Austrian online newspaper articleKatharina Korecky-Kröll
    15:00-16:00Concluding exchange over coffee and cake

     

    Abstracts

    Panel 1: Digitizing and enriching newspapers & magazines with AI - I

    Hierarchical Structure Extraction from Newspaper Images Using a Transformer-Based Model

    William Mocaër, Clément Chatelain, Stéphane Nicolas, Thierry Paquet, Tom Simon, Pierrick Tranouez

    LITIS Laboratory

    Abstract

    Understanding newspapers images is a challenging task due to their complex hierarchical structures, rendered through a variety of dense layouts. This research introduces a novel transformer-based model specifically designed to tackle these challenges through a comprehensive, end-to-end approach. The proposed model excels at extracting the hierarchical structure of newspapers, including sections and articles. It performs blocks localization and categorization (title, paragraph, image, table…) and reading order prediction at multiple levels. The model provides a comprehensive, detailed and consistent analysis of newspaper content.

    The approach relies on an iterative process of information extraction, through the hierarchy of levels, where each level is processed one after the other. To enhance computational efficiency, each level is executed using a parallel attention mechanism. Relying on high-level structural modeling, the model achieves end-to-end processing without requiring any additional pre or postprocessing, ensuring adaptability to a wide variety of newspaper formats.

    The model is trained using synthetic documents that capture the variability and complexity of real-world newspapers. These synthetic documents enable the model to learn robust representations of newspaper layouts, ensuring its ability to generalize across a wide range of structural configurations. Preliminary evaluations highlight the model's potential in accurately reconstructing newspaper hierarchies and providing insights into their content.

    The method offers a promising solution for precise structure extraction in highly structured documents such as newspapers but could be applicable to a wider range of documents addressing the growing need for scalable and efficient AI-based digitization solutions.

    Short Biography

    William Mocaër is a postdoctoral researcher at the LITIS laboratory (Rouen, France), where he contributes to the FINLAM project in collaboration with the Bibliothèque nationale de France (BnF) and Teklia, focusing on advanced newspaper analysis techniques. Previously, he completed a PhD at IRISA as part of the Shadoc Team (Systems for Hybrid Analysis of DOCuments), specialized in document analysis.

    To top


    The FINLAM Newspaper Dataset - a dataset for end-to-end newspaper recognition

    Solène Tarride

    TEKLIA

    Abstract

    Advances in machine learning and the emergence of Visual Large Language models (VLLMs) have significantly advanced the field of automatic document understanding. However, these models often show limited performance on historical or handwritten documents. The ANR Finlam (Foundation INtegrated models for Libraries Archives and Museum) project specifically aims to develop multimodal models that can handle a wide variety of documents, languages, layout, writing styles and illustrations. One of the first use cases of this project focuses on historical newspapers.

    Historical newspapers present unique challenges for automated processing due to their dense and complex layouts. Tasks such as reading order detection and article separation remain underexplored in the machine learning and document analysis communities. To fill these gaps, we present the FINLAM Newspaper Dataset, an open-source dataset designed for end-to-end training and evaluation of complex newspaper recognition tasks. The FINLAM Newspaper Dataset contains 149 issues of 23 newspapers published in the 19th and 20th centuries, mainly in French, with some newspapers in English. Each issue contains between 2 and 12 pages, and each page is segmented into zones annotated with multimodal features: localization, textual content (extracted by OCR), zone classifications (among 13 categories including article titles, intertitles, paragraphs, illustrations, advertisements and free ads), reading order and article separation. This dataset presents significant challenges due to its dense, complex and varied layout. It is freely available on HuggingFace: FINLAM Newspaper Dataset on HuggingFace. In this workshop, we will introduce the FINLAM Newspaper Dataset and present benchmark results for key tasks such as OCR, document layout analysis, reading order detection and article separation.

    Short Biography

    Solène Tarride is a machine learning researcher at TEKLIA. During her PhD at IRISA, she focused on deep learning for understanding historical documents. At TEKLIA, she develops new methods for automatic information extraction from historical and modern documents.

    To top


    From Image to Machine-Readable text: AI for Layout Analysis, OCR and Post-Correction for Job Ads from Historical Newspapers

    Klara Venglarova, Raven Adam, Georg Vogeler

    University of Graz

    Abstract

    This study describes a comprehensive workflow for extracting machine-readable text from historical newspaper job advertisements, addressing layout analysis, optical character recognition (OCR), and post-correction with state of the art machine learning methods. Leveraging an annotated dataset as a ground truth, we evaluate various layout detection tools, including default ANNO segments, Eynollah, Transkribus, and Tesseract. For the evaluation purposes, we also proposed a new methodology based on text presence in non-intersecting parts of the predicted region and its ground truth. Eynollah demonstrated the highest segmentation accuracy (72.5%), while other models, such as Kraken, underperformed due to mismatched between the specific task and pretrained models.

    For OCR, we compared multiple models, including GT4HistOCR (CER: 0.1218), Tesseract model used in ANNO (CER: 0.1295), and German_Print model (CER: 0.1202). While several models reached comparable results, the Fraktur_GT4HistOCR achieved the best WER. Post-correction further improved text quality, addressing OCR-induced biases. We fine-tuned the hmbyt5-preliminary 1 model on the IDCAR2019-POCR dataset for OCR correction to perform better on our dataset. As manual creation of gold standard is a time-consuming process, we also explore the generative transformer based methods to support creating training data needed to achieve a good performance of a post-correction model.

    Our work emphasizes the persisting need of annotated datasets and gold standards in assessing segmentation and recognition performance. By systematically comparing tools and methodologies, we contribute to the transparency of results obtained based on the subsequent data analysis.

    This work is part of the FWF project P35783 “The making of the incredibly differentiated labor market” (PI Jörn Kleinert).

    Short Biographies

    Georg Vogeler is professor for Digital Humanities at University of Graz. He is a trained historian (Historical Auxiliary Sciences), graduated from Ludwig-Maximilians-Universtität (LMU), Munich, and worked on late medieval administrative records, emperor Federic II. (1198-1250), digital scholarly editing and semantic web technologies for humanities, with positions at the LMU Munich, the Università del Salento in Lecce and the University of Graz . He engaged recently in the application of machine learning and AI in the analysis of historical records. He was and is PI of numerous projects, among which the ERC ADG grant From Digital to Distant Diplomatics (2022-2026).

    Klara Venglarova is a PhD student of Linguistics and Digital Humanities at the Palacky University in Olomouc, Czech Republic. She is involved in the FWF-funded project The Making of the Incredibly Differentiated Labor Market: Evidence from Job Offers from Ten Decades at the University of Graz (PI Jörn Kleinert), specifically engaged in layout analysis, OCR, post-correction, information extraction and other NLP and machine-learning tasks.

    Raven Adam is a PhD student at the department of Environmental System Sciences at the University of Graz. His research focus is on NLP applications, such as topic modeling, text classification and text generation. He is currently involved in two FWF-funded projects, namely The Making of the Incredibly Differentiated Labor Market: Evidence from Job Offers from Ten Decades and Responses (PI Jörn Kleinert) to Threat and Solution-Oriented Climate News (PI Marie Lisa Ulrike Kogler).

    To top


    Panel 2: Digitizing and enriching newspapers & magazines with AI - II

    Das Darmstädter Tagblatt und zwei KI-Lösungen: Transkribus-Workflows und die Entwicklung eines KI-Assistenten

    Dario Kampkaspar, Kevin Kuck, Anna Christina Kupffer

    Universitäts- und Landesbibliothek Darmstadt

    Abstract

    Das Digitalisierungsprojekt „Darmstädter Tagblatt“ erarbeitet momentan einen der umfangreichsten Quellenfundus historischer Zeitungen im deutschsprachigen Raum. In zwei DFG-geförderten Projektphasen werden über 600.000 Seiten aus fast drei  Jahrhunderten digitalisiert und als Volltexte zur Verfügung gestellt.1 Neben einer kurzen Projektvorstellung möchten wir Einblicke in zwei Aspekte des Projekts bzw. daraus  erwachsener Kooperationen geben.

    1. Der Transkribus-Workflow

    Ursprünglich im Rahmen einer Dienstleistung vergeben, wurde das OCR und die  Layouterkennung der Nachkriegsausgaben (1949–1986) extern durchgeführt. Während  die Erkennung des Layouts zufriedenstellend war, wurde das OCR nach einer strategischen Entscheidung intern durchgeführt. Im Rahmen einer infrastrukturellen Investition an der ULB Darmstadt können wir für das Tagblattprojekt Transkribus als Epic  Member einsetzen. In einem Workflow, der TextTitan sowie ein bestehendes Modell des ZEiD nutzte, wurden mit Double Keying neue Versionen der Volltexte erstellt. Dies war wesentlich schneller als die manuelle Korrektur des Outputs des Dienstleisters. Aus dem ALTO-Format von letzterem konnten die Layoutinformationen mittels Transformation in PAGE-Dateien in Transkribus überführt werden und somit nachgenutzt werden.

    2. Das RAGblatt 

    In Darmstadt befindet sich mit Hessian.AI eine Ausgründung der TU Darmstadt mit KI Expertise.2 Um die Möglichkeiten zu erforschen und auch um Herausforderungen zu  bewältigen, die mit einem so umfangreichen Projekt, wie dem Tagblattprojekt einhergehen, wurde eine Kooperation mit Hessian.AI angestoßen. Das momentane Ergebnis ist das „RAGblatt“, ein KI-gestützter Assistent, der zur Recherche des Tagblattmaterials genutzt werden kann. Momentan befindet sich der Assistent noch in der Prototypphase und verwendet Modelle wie Metas Llama und Occiglot. Es ermöglicht textbasierte Suchen, die schriftliche Antworten generieren und den Kontext des Zeitungsartikels enthalten, der für die Anfrage relevant ist. Wir hoffen, dass dieser Assistent Nutzer:innen dabei unterstützen wird, Material zu finden und einen explorativen Ausgangspunkt für Forschungsthemen zu bieten. In einer laborähnlichen Umgebung können Forschende mögliche Fragen mit minimalem technischem Aufwand oder Zeitaufwand testen. Gegenstand des Vortrags werden auch die Herausforderungen technischer und disziplinärer Art sein.

    Momentan ist der Assistant lediglich innerhalb des Netzwerks der TU Darmstadt verfügbar. elektra.ai.tu-darmstadt.de/ulb (aufgerufen am: 05.02.2025 ).

    Fußnoten

    1www.ulb.tu-darmstadt.de/forschen_publizieren/forschen/darmstaedter_tagblatt.en.jsp  (aufgerufen am: 05.02.2025).

    2 hessian.ai (aufgerufen am: 05.02.2025).

    Kurzbiografie

    Dario Kampkaspar ist Leiter des Zentrums für digitale Editionen an der ULB Darmstadt. Nicht zuletzt seit dem Wien(n)erischen Diarium ist er in der Digitalisierung und Volltetxerfassung von Zeitungen aktiv. Er ist u.a. in der TEI und in Transkribus aktiv (z. B.  der TEI-Export von Transkribus und weitere Tools für digitale Editionen).

    Kevin Kuck betreut seit September 2023 das Digitalisierungsprojekt des „Darmstädter Tagblatt“. An der ULB Darmstadt ist er außerdem im Projekt „Europäischer Religionsfrieden Digital“ tätig. Herr Kuck studierte Geschichte an der Universität Heidelberg, wo er ebenfalls promoviert.

    To top


    Werkstattbericht aus der historisch-kritischen digitalen Edition der „Neuen Zeitschrift für Musik“ 1834-1844

    Nelly Krämer-Reinhardt

    Julius-Maximilians-Universität Würzburg, Bayerische Akademie der Wissenschaften

    Abstract

    Im Rahmen eines Werkstattberichts möchten wir einen Einblick in die Konzeptionierung der digitalen Edition einer historischen Zeitschrift geben. 

    Ausgangslage: 
    Das Akademieprojekt „Robert Schumanns Poetische Welt“ umfasst unter anderem die historisch-kritische Edition der Neuen Zeitschrift für Musik (NZfM) der Gründungsdekade, in der der Komponist und Musikschriftsteller Robert Schumann die Zeitschrift konzipiert und redaktionell verantwortet hat.
    Die digitale Edition wird in Kooperation mit dem TCDH Trier in der virtuellen Forschungsumgebung FuD erstellt. 

    Ziel: 
    Schumanns NZfM verstand sich als unabhängiges Organ für die Förderung begabter Komponisten und bildete eine zentrale Plattform innerhalb des romantischen Musikdiskurses. Sie ist damit ein wichtiges Korpus in der Musikwissenschaft. 
    Der nun geplante tiefenerschlossene, annotierte Lesetext, der sowohl auf Heftebene als auch auf Ebene der einzelnen Texteinheiten angesteuert werden kann, wird die Grundlage für zukünftige Forschung bilden. Ziel der Edition des Zeitschriftenkorpus ist die Generierung von Deep Data. 

    Prozess
    Texterfassung: Ein eigens trainiertes Modell der KI-gestützten Software Transkribus erfasst präzise sowohl die Textfelder des Zeitschriftenlayouts als auch die Zeilen und deren Text. 
    Multimodale Erfassung: Notenbeispiele werden mit der Software mei-friend erstellt, kollationiert und anschließend in TEI-XML (Text Encoding Initiative) integriert. Die Einsatzmöglichkeiten von OMR werden derzeit noch getestet. 
    Semantische Erschließung: Die Zeitschriftenausgaben werden in Sinneinheiten unterteilt und semantisch erschlossen. Damit führt der Fokus von der reinen Texterfassung hin zur Substanz der Texteinheiten, die sich teilweise auch über mehrere Hefte erstrecken. 
    Annotierung: Sämtliche Entitäten werden ausgezeichnet und mit Normdaten verlinkt. Die einzelnen Sinneinheiten werden Textkategorien zugeordnet und Sachkommentare erleichtern das Verständnis erklärungsbedürftiger Textstellen. Zudem wird, wo entsprechende Quellen vorhanden sind, die Textgenese dargestellt. Dabei werden einerseits handschriftliche Vorlagen herangezogen, die mit dem Software-Tool Transcribo in FuD erschlossen werden. Andererseits werden die Textbeiträge, die Schumann in seine „Gesammelten Schriften“ übernommen hat, jeweils verknüpft und mit dem Kollationierungstool Comparo verglichen. 

    Diskussion
    Wir laden dazu ein, dieses Konzept und die Methoden zu diskutieren und die Limitationen des Einsatzes von KI zur Erfassung von Deep Data in Zeitungen und Zeitschriften zu reflektieren sowie die Frage zu beleuchten, welche Rolle die Sachkommentierung in Zeiten von KI erfüllen kann und soll.

    Kurzbiographie

    Die Referentin Nelly Krämer-Reinhardt, M.A. studierte Schulmusik an der Hochschule für Musik Würzburg und Musikwissenschaft an der Julius-Maximilians-Universität Würzburg. Seit 2023 ist sie wissenschaftliche Mitarbeiterin im Akademieprojekt Robert Schumanns Poetische Welt und beschäftigt sich in ihrer Dissertation mit Notenbeispielen in der Neuen Zeitschrift für Musik.

    To top


    Panel 3: Analyzing magazines with AI

    AI-Driven Analysis of Female Representations in Fin-de-Siècle Spanish Magazines

    Adriana Rodríguez-Alfonso

    University of Tübingen

    Abstract

    This presentation examines female representations in three pioneering fin-de-siècle magazines from Spain: Vida nueva (Madrid, 1898-1900), La Vida Literaria (Madrid, 1899), and La vida galante (Barcelona, 1898-1905). This study is part of a Spanish-language magazine digitization project undertaken at the University of Tübingen, Germany. These magazines, conceived as platforms for artistic and literary dissemination, not only brought together prominent painters, illustrators, photographers, and writers of the Hispanic movement but also provided invaluable insights into the prevailing symbolic and social constructs of femininity.

    Given the close interrelationship between medicine and art in the late nineteenth century (Jordanova, 1989; Gilman, 1995; Mazzoni, 1996; Clemison and Vázquez, 2009; Tsuchiya, 2011; Alder, 2020), the social imaginaries surrounding women in these magazines frequently intersected with the dominant theories of women's mental health. These theories were influenced by European degeneration models (Nordau, Brachet, Charcot), as well as home-grown adaptations by leading Spanish positivist psychiatrists (Escuder, Giné y Partagás, Bernaldo de Quirós).

    Drawing from digital methods and perspectives, this presentation will showcase the results derived from the digitization, processing, and analysis of this corpus of Spanish cultural magazines from the turn of the century. Using Natural Language Processing (NLP) techniques—such as word-sense disambiguation and semantic analysis—the study maps the various semantic frameworks associated with women, their societal roles, and representations within nineteenth-century Spanish society. These techniques will reveal how medical, political, and social connotations were closely intertwined within the discourse surrounding female identity and roles of the time.

    This talk will also foster a discussion on the application of computational tools in periodical press analysis, highlighting both the potential and challenges of adapting digital methods across languages—particularly for Spanish-language materials. It will delve into issues such as cross-linguistic adaptation and the nuances involved in applying these methods to historical texts.

    Short Biography

    Adriana Rodríguez-Alfonso holds a PhD and Master's degree in Spanish and Latin American Literature from the University of Salamanca, and a Bachelor's degree in Hispanic Philology from the University of Havana. She is currently a professor and researcher at the Romanisches Seminar of the University of Tübingen, where she is working on her “Habilitation.”

    Her main research focuses on Portuguese and Spanish literature from the 19th and 20th centuries, literary magazines, intellectual fields and networks, and digital humanities. She has published articles and chapters in various essay collections and specialized journals and her book El grupo Shanghai en Argentina: Redes, estéticas y mercados editoriales latinoamericanos was published in 2024 by De Gruyter.

    To top


    Challenges in dealing with historical gossip

    Christian Lendl

    University of Vienna

    Abstract

    The Wiener Salonblatt was the hot gossip magazine of fin-de-siècle Vienna. The illustrated  weekly was published from 1870 to 1938 and mainly consisted of short messages – mostly published by members of the nobility – about personal achievements, travelling, and family issues. These short texts can be seen as typical examples for factoids and served the same purpose as posts on social media networks today: staying connected  with one’s peers and presenting oneself to the public.

    This dissertation project aims to analyze these (~250,000) factoids with the help of digital  methods. The research goals are to better understand the transformation of the late Habsburg nobility by identifying topical trends and geospatial patterns as well as conducting a historical network analysis. 

    The current focus of this project lies on the automated text recognition of the corpus. Therefore, several models on Transkribus are being trained for layout recognition, segmentation, and automated transcription. In addition, a text processing algorithm is currently being developed that optimizes the (textual) output of the transcription stage and transfers it into a database. This includes several steps of error correction,  normalization and validation as well as an algorithm to correct the reading order (for all  messages on a single magazine page). All these processes are necessary to optimally  prepare the factoids for the upcoming stage of this dissertation project: Analyzing the factoids with natural language processing methods. While named entity recognition will  be used to extract as much information as possible (persons, organizations, places etc.), topic modelling will identify the topic(s) of each factoid. All these steps have to be  completed in order to start the final stages: data analysis and historical interpretation.

    Short Biography

    Christian Lendl is PhD candidate at the Department of East-European History (University of Vienna). His  fields of interest include the Austrian nobility in the late Habsburg empire, the development of portrait and  press photography, and the visual coverage in Austrian newspapers. He is also a lecturer for visual  marketing at the IMC Krems University of Applied Sciences and holds a MSc in Computer Science from  the Vienna University of Technology and a MA in History from the University of Vienna.

     

    To top


    Potenziale und Herausforderungen einer KI-unterstützten Medien- und Texterschließung am Beispiel der Gattung „Fotogedicht“

    Lisa Hufschmidt

    Julius-Maximilians-Universität Würzburg

    Abstract

    Der Vortrag thematisiert die teilweise automatisierte Erschließung von Fotogedichten1 (d.h. einer spezifischen, literarischen Form von Text-Bild-Beziehung) mithilfe unterschiedlicher KI-Modelle aus literaturwissenschaftlicher Perspektive. Der Erschließungsprozess wird ausgehend von dem 2024 gestarteten DFG-Projekt Das Fotogedicht in illustrierten Zeitschriften zwischen 1895 und 19452 und in Zusammenarbeit mit dem Zentrum für Philologie und Digitalität an der Universität Würzburg entwickelt. Ziel des Projekts ist es, das räumliche Verhältnis zwischen Text und Bild, den Volltext sowie die verwendeten Bild- und Textmotive zu erfassen sowie, in einem zweiten Schritt, damit verbundene Semantisierungen zu untersuchen. Die angestrebte Erschließung und damit verbundene Forschungsfragen bergen neben technischen auch kommunikative Potenziale sowie Herausforderungen, auf welchen der Fokus des Vortrags liegen wird.

    Fußnoten

    1 Siehe hierzu: Catani, Stephanie/Michael Will (2024): “Das Fotogedicht. Zur (Wieder-)Entdeckung einer intermedialen Gattung.” Zeitschrift für Deutsche Philologie Digital/Zeitschrift für Deutsche Philologie (2), doi:10.37307/j.1868-7806.2024.02.09.

    2www.germanistik.uni-wuerzburg.de/ndl1/forschung-projekte/forschungsstelle-fotolyrik/.

    Kurzbiographie

    Lisa Hufschmidt studierte Germanistik (B.A.) und Deutsche Literaturwissenschaft (M.A.) in Mannheim und Stuttgart. Seit Juli 2024 ist sie Mitarbeiterin im DFG-Projekt Das Fotogedicht in illustrierten Zeitschriften zwischen 1895 und 1945 an der Julius-Maximilians-Universität Würzburg, das sich mit der (wieder-)entdeckten Gattung der Fotolyrik beschäftigt. In ihrem Promotionsvorhaben entwickelt Lisa Hufschmidt ein Analysemodell für Fotogedichte.

    To top


    Panel 4: Analyzing magazines & newspapers with AI

    LexiMus Project. Advantages and Challenges of Artificial Intelligence in the Analysis of Music Press

    Daniel Martín Sáez, María Isabel Jiménez Gutiérrez

    University of Salamanca

    Abstract

    In this presentation, we will introduce LexiMus, a project aimed at understanding the trends in the use of musical lexicon in Spanish throughout history. Specifically, we will focus on the work of the team from the University of Salamanca, where we are in charge of studying the press from the 18th to the 21st century, both general and specialized. This involves working with massive text data (millions of words), from which we often need to exclude non-musical information. To address this, we have developed a tool that utilizes Google Cloud OCR reading and Vertex AI. This learning platform allows us to train large language models (LLM) and create automated workflows to extract information in blocks by applying a prompt similar to those used in ChatBots. We were able to select and transcribe musical news from hundreds of periodical sources, creating a corpus of over 70 million words. In the past year, we begun analyzing this corpus using the Voyant Tools platform, which enables us to study usage trends, create word clouds, and observe their evolution over time. Currently, we are still seeking ways to improve OCR reading, which is hindered by issues with text legibility and column organization, but perhaps the greatest challenge lies in the numerous interpretation problems that AI is far from solving at present, despite efforts in recent years in the field of the history of concepts (e.g., Peter de Bolla, Explorations in the Digital History of Ideas, 2024).

    Short Biographies

    Dr. Daniel Martín Sáez: Associate Professor of Musicology at the University of Salamanca. Professor of the BA in Musicology and the MA in Hispanic Music at the University of Salamanca. Member of the research team of LexiMus Project.

    María Isabel Jiménez Gutiérrez: Predoctoral Fellow at the University of Salamanca. BA in Musicology and MA in Hispanic Music from USAL. Graduate in Higher Artistic Education with a specialization in Clarinet from the Higher Conservatory of Castilla y León.

    To top


    Panel 5: Analyzing newspapers with AI

    Semantische Variationen und Bedeutungswandel im Ukrainischen: Herausforderungen für Multilinguale Sprachmodelle 

    Nataliia Cheilytko 

    Friedrich Schiller University Jena

    Abstract

    Der Beitrag untersucht, inwieweit große Sprachmodelle (LLMs) und kontextualisierte Einbettungen in der Lage sind, Bedeutungsnuancen ukrainischer Wörter zu erfassen, insbesondere vor dem Hintergrund regionaler und diachroner Variationen im 20. und 21. Jahrhundert. Ziel des Projekts ist es, systematisch die Dynamik von Wortbedeutungen im Ukrainischen mithilfe moderner KI-Modelle zu analysieren. 

    Das Ukrainische ist als ressourcenarme Sprache von unzureichenden annotierten Datensätzen und NLP-Tools geprägt, was die semantische Repräsentation und Analyse erschwert. Besonders herausfordernd ist die Verarbeitung historischer Texte, die oft nicht in digitalen Archiven verfügbar sind. Für erste Experimente wurden Daten aus dem General Regionally Annotated Corpus of Ukrainian (GRAC) herangezogen, das Texte aus verschiedenen Regionen der Ukraine seit dem 20. Jahrhundert umfasst. 

    Zwei Ansätze wurden verfolgt: Erstens wurden kontextualisierte Einbettungen visualisiert und geclustert, um Bedeutungsunterschiede in verschiedenen Kontexten zu analysieren. Zweitens wurden GPT-Modelle genutzt, um Bedeutungen in spezifischen Sätzen zu ermitteln. Erste Ergebnisse zeigen, dass kontextualisierte Einbettungen semantische Veränderungen erfolgreich identifizieren können, während LLMs wie GPT-4o in einigen Fällen an historischen oder regionalen Bedeutungen scheitern. 

    Beispielsweise konnte das Modell die neue metaphorische Bedeutung von "bavovna" („Explosion“) korrekt zuordnen, versagte jedoch bei der historischen regionalen Bedeutung des Adjektivs "povazhnyi" („streng“) im westlichen Ukrainisch des frühen 20. Jahrhunderts. 

    Zukünftige Arbeiten zielen darauf ab, LLMs mit spezifischen Daten anzupassen, um semantische Veränderungen im Ukrainischen besser zu modellieren und eine feinere Granularität bei lexikalischen Analysen zu erreichen. 

    Referenzen 

    • Cheilytko, N. and von Waldenfels, R. (2023): Exploring Word Sense Distribution in Ukrainian with a Semantic Vector Space Model. In Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), pages 73–78, Dubrovnik, Croatia. Association for Computational Linguistics. 
    • Cheilytko N., von Waldenfels R. (2024a). Semantic Change and Lexical Variation in Ukrainian with Vector Representations and LLM. Book of Abstracts of the Workshop Large Language Models and Lexicography. Simon Krek (Ed.), pp. 1-5. www.cjvt.si/wp-content/uploads/2024/10/LLM-Lex_2024_Book-of-Abstracts.pdf
    • Cheilytko N., von Waldenfels R. (2024b). Word Embeddings for Detecting Lexical Semantic Change in Ukrainian. Lexicography and Semantics Proceedings of the XXI EURALEX International Congress 8–12 October 2024 Cavtat, Croatia, pp. 231-241.https://euralex.jezik.hr/wp-content/uploads/2021/09/Euralax-XXI-final-web.pdf 
    • Geeraerts, D., Speelman, D., Heylen, R., Montes, M., de Pascale, S., Franco, K. & Lang, M. (2023). Lexical Variation and Change. A Distributional Semantic Approach. Oxford Academic. 
    • Montes, M. & Geeraerts, D. (2022). How vector space models disambiguate adjectives: A perilous but valid enterprise. Yearbook of the German Cognitive Linguistics Association, 10(1), 7-32. doi.org/10.1515/gcla-2022-0002 
    • Shvedova, M., von Waldenfels, R., Yarygin, S., Rysin, A., Starko, V., Nikolajenko, T. (2017-2024): GRAC: General Regionally Annotated Corpus of Ukrainian. Retrieved July 31, 2024, from uacorpus.org
    Short Biography

    Nataliia Cheilytko is a postdoctoral researcher at Friedrich Schiller University (Jena), a computational linguist, an NLP engineer, a R&D team leader, and a lecturer with more than 10-year experience in various linguistic, NLP, and Semantic Web projects in both academia and industrial startups. The areas of expertise are Corpus Linguistics, Computational Linguistics, Natural Language Processing, Machine Learning and AI, Large Language Models, Semantic Modeling, Sociolinguistics, Language Variation and Change, Lectometry, Knowledge Representation, Labeled Property Graphs, and Semantic Web.

    To top


    Part-of-speech and grammar tagging with German spaCy pipelines from a linguistic perspective: Opportunities and challenges in the annotation of diminutives in forum posts on an Austrian online newspaper article

    Katharina Korecky-Kröll

    Austrian Academy of Sciences

    Abstract

    The NLP python library spaCy (Honnibal et al. 2020) is a useful tool for everyone interested in linguistic analyses of large amounts of written data. 

    Using spaCy, such data can be tokenized and tagged for parts-of-speech quickly, and a basic morphological annotation for categories of inflectional morphology (e.g., case, gender, and number of nouns) or an annotation of syntactic dependencies or named entities is also possible. All these levels of annotation may as a basis for further linguistic analyses. 

    To date, spaCy supports over 75 languages and has over 80 pretrained pipelines for 25 languages. There are four pipelines for German, which are all based on the  TIGER-Corpus, Tiger2Dep and WikiNER and sometimes on additional sources (in round brackets after the name of the pipeline) and which vary regarding the accuracies of their morphological annotation [in square brackets]:

    • de_core_news_sm [0.91]
    • de_core_news_md (Explosion fastText Vectors (cbow, OSCAR Common Crawl + Wikipedia)) [0.92]
    • de_core_news_lg (like de_core_news_md) [0.92]
    • de_dep_news_trf (bert-base-german-cased) [0.97]

    Specific challenges arise when annotating user-generated content in a pluricentric language such as German, which has several national standard varieties and is also characterized by numerous dialects and regiolects resulting in highly diverse word formation patterns (e.g., Ammon 1995; Lenz 2019). Thus, in a 12-million-token corpus of forum posts on an online article of the Austrian newspaper DERSTANDARD.at regarding the COVID-19 pandemic (e.g., Korecky-Kröll 2023; Korecky-Kröll et al. submitted), spaCy assigns a wrong grammatical gender to many diminutive nouns or misclassifies them in another way (e.g., common nouns as proper names). 

    Using a randomly selected sub-corpus of 1000 diminutive tokens from the above-mentioned corpus, the four spaCy pipelines for German are tested for accuracy, problems at the individual token or lemma level are identified and possible solutions are worked out. As an outlook, the possibility of an additional automatic word formation tagging (e.g., Wartena 2023) is also discussed.

    References

    • Ammon, Ulrich. 1995. Die deutsche Sprache in Deutschland, Österreich und der Schweiz: das Problem der nationalen Varietäten. Berlin & New York: De Gruyter
    • Honnibal, Matthew, Ines Montani, Sofie Van Landeghem & Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. doi: 10.5281/zenodo.1212303. 
    • Korecky-Kröll, Katharina. 2023. Diminutives and number: theoretical predictions and empirical evidence from German in Austria. In: Stela Manova, Laura Grestenberger & Katharina Korecky-Kröll. eds. Diminutives across languages, theoretical frameworks and linguistic domains. Berlin: De Gruyter (= Trends in Linguistics. Studies and Monographs 380), 179-204. https://doi.org/10.1515/9783110792874-008
    • Korecky-Kröll, Katharina, Amelie Dorn, Theresa Ziegler, Jan Höll & Alexandra N. Lenz. submitted. Language in times of COVID-19: lexical and morphopragmatic analyses of two Austrian Media Corpora. Submitted to: Digital Scholarship in the Humanities.
    • Lenz, Alexandra N. 2019. Bairisch und Alemannisch in Österreich. In Joachim Herrgen,  & Jürgen Erich Schmidt. eds. Language and Space. An International Handbook of Linguistic Variation. Vol. 4: Deutsch. Unter Mitarbeit von Hanna Fischer und Brigitte Ganswindt. Berlin & Boston: de Gruyter Mouton (= Handbooks of Linguistics and Communication Science 30.4), 318–363.
    • Wartena, Christian. 2023. The Hanover Tagger (Version 1.1.0) - Lemmatization, Morphological Analysis and POS Tagging in Python.  doi: 10.25968/opus-2457. https://serwiss.bib.hs-hannover.de/frontdoor/deliver/index/docId/2457/file/wartena2023-HanTa_v1.1.0.pdf
    Short Biography

    After completing her PhD in Linguistics at the University of Vienna in 2012, Katharina Korecky-Kröll worked in several postdoc positions. She is now a Senior Lecturer at the Department of German Studies of the University of Vienna and an Academy Scientist at the Dictionary of Historical Bavarian Dialects in Austria and South Tyrol of the Research Unit Linguistics of the Austrian Centre for Digital Humanities and Cultural Heritage of the Austrian Academy of Sciences.

    To top

    Date

    7–8 May 2025

    Location

    Seminar room 1,
    Campus of the Austrian Academy of Sciences,
    Bäckerstraße 13, 1010 Vienna

    Organization

    Department of Literary and Textual Studies,
    Austrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH),
    in cooperation with DHd-AG Zeitungen & Zeitschriften

    Contact

    Nina C. Rastinger
    Claudia Resch

    Languages

    The presentations will be held in either German or English.

     
    Registration

    Please register for the conference using our registration form:

    Register here