The two-day conference is dedicated to the use of AI models for the digitization and analysis of newspapers and magazines from the early modern period to the present. This covers both the “out-of-the-box” use or fine-tuning of existing models and the training of new models.
The term “AI model” is deliberately defined broadly and includes several subfields of artificial intelligence (e.g., Machine Learning, Deep Learning, Generative AI, NLP) and architectures (e.g., CNNs, BERT, GPT, CLIP) as well as different modalities (text, image, multimodal models) and modes of integration into individual workflows (e.g., through applications such as Transkribus, Newspaper Navigator; through Python libraries like spaCy, flair).
The conference focuses on various application scenarios of AI in relation to newspapers and magazines. The following areas of use are of particular interest:
Due to limited capacities we ask you to register using our registration form.
14:00-14:30 | Welcome and Introduction | Alexandra N. Lenz, Claudia Resch, Nina C. Rastinger |
14:30-15:00 | Hierarchical Structure Extraction from Newspaper Images Using a Transformer-Based Model | William Mocaër, Clément Chatelain, Stéphane Nicolas, Thierry Paquet, Tom Simon, Pierrick Tranouez |
15:00-15:30 | The FINLAM Newspaper Dataset - a dataset for end-to-end newspaper recognition | Solène Tarride |
15:30-16:00 | From Image to Machine-Readable text: AI for Layout Analysis, OCR and Post-Correction for Job Ads from Historical Newspapers | Klara Venglarova, Raven Adams, Georg Vogeler |
16:00-16:30 | Coffee break |
16:30-17:00 | Das Darmstädter Tagblatt und zwei KI-Lösungen: Transkribus-Workflows und die Entwicklung eines KI-Assistenten | Dario Kampkaspar, Kevin Kuck, Anna Christina Kupffer |
17:00-17:30 | Werkstattbericht aus der historisch-kritischen digitalen Edition der „Neuen Zeitschrift für Musik“ 1834-1844 | Nelly Krämer-Reinhardt |
09:30-10:00 | AI-Driven Analysis of Female Representations in Fin-de-Siècle Spanish Magazines | Adriana Rodríguez-Alfonso |
10:00-10:30 | Challenges in dealing with historical gossip | Christian Lendl |
10:30-11:00 | Potenziale und Herausforderungen einer KI-unterstützten Medien- und Texterschließung am Beispiel der Gattung “Fotogedicht” | Lisa Hufschmidt |
11:00-11:30 | Coffee break |
11:30-12:00 | LexiMus Project. Advantages and Challenges of Artificial Intelligence in the Analysis of Music Press | Daniel Martín Sáez, María Isabel Jiménez Gutiérrez |
12:00-12:30 | LLM-based list analysis: From semi-structured newspaper texts to structured data | Nina C. Rastinger |
12:30-14:00 | Lunch break (catered) |
14:00-14:30 | Semantische Variationen und Bedeutungswandel im Ukrainischen: Herausforderungen für Multilinguale Sprachmodelle | Nataliia Cheilytko |
14:30-15:00 | Part-of-speech and grammar tagging with German spaCy pipelines from a linguistic perspective: Opportunities and challenges in the annotation of diminutives in forum posts on an Austrian online newspaper article | Katharina Korecky-Kröll |
15:00-16:00 | Concluding exchange over coffee and cake |
William Mocaër, Clément Chatelain, Stéphane Nicolas, Thierry Paquet, Tom Simon, Pierrick Tranouez
LITIS Laboratory
Understanding newspapers images is a challenging task due to their complex hierarchical structures, rendered through a variety of dense layouts. This research introduces a novel transformer-based model specifically designed to tackle these challenges through a comprehensive, end-to-end approach. The proposed model excels at extracting the hierarchical structure of newspapers, including sections and articles. It performs blocks localization and categorization (title, paragraph, image, table…) and reading order prediction at multiple levels. The model provides a comprehensive, detailed and consistent analysis of newspaper content.
The approach relies on an iterative process of information extraction, through the hierarchy of levels, where each level is processed one after the other. To enhance computational efficiency, each level is executed using a parallel attention mechanism. Relying on high-level structural modeling, the model achieves end-to-end processing without requiring any additional pre or postprocessing, ensuring adaptability to a wide variety of newspaper formats.
The model is trained using synthetic documents that capture the variability and complexity of real-world newspapers. These synthetic documents enable the model to learn robust representations of newspaper layouts, ensuring its ability to generalize across a wide range of structural configurations. Preliminary evaluations highlight the model's potential in accurately reconstructing newspaper hierarchies and providing insights into their content.
The method offers a promising solution for precise structure extraction in highly structured documents such as newspapers but could be applicable to a wider range of documents addressing the growing need for scalable and efficient AI-based digitization solutions.
William Mocaër is a postdoctoral researcher at the LITIS laboratory (Rouen, France), where he contributes to the FINLAM project in collaboration with the Bibliothèque nationale de France (BnF) and Teklia, focusing on advanced newspaper analysis techniques. Previously, he completed a PhD at IRISA as part of the Shadoc Team (Systems for Hybrid Analysis of DOCuments), specialized in document analysis.
Solène Tarride
TEKLIA
Advances in machine learning and the emergence of Visual Large Language models (VLLMs) have significantly advanced the field of automatic document understanding. However, these models often show limited performance on historical or handwritten documents. The ANR Finlam (Foundation INtegrated models for Libraries Archives and Museum) project specifically aims to develop multimodal models that can handle a wide variety of documents, languages, layout, writing styles and illustrations. One of the first use cases of this project focuses on historical newspapers.
Historical newspapers present unique challenges for automated processing due to their dense and complex layouts. Tasks such as reading order detection and article separation remain underexplored in the machine learning and document analysis communities. To fill these gaps, we present the FINLAM Newspaper Dataset, an open-source dataset designed for end-to-end training and evaluation of complex newspaper recognition tasks. The FINLAM Newspaper Dataset contains 149 issues of 23 newspapers published in the 19th and 20th centuries, mainly in French, with some newspapers in English. Each issue contains between 2 and 12 pages, and each page is segmented into zones annotated with multimodal features: localization, textual content (extracted by OCR), zone classifications (among 13 categories including article titles, intertitles, paragraphs, illustrations, advertisements and free ads), reading order and article separation. This dataset presents significant challenges due to its dense, complex and varied layout. It is freely available on HuggingFace: FINLAM Newspaper Dataset on HuggingFace. In this workshop, we will introduce the FINLAM Newspaper Dataset and present benchmark results for key tasks such as OCR, document layout analysis, reading order detection and article separation.
Solène Tarride is a machine learning researcher at TEKLIA. During her PhD at IRISA, she focused on deep learning for understanding historical documents. At TEKLIA, she develops new methods for automatic information extraction from historical and modern documents.
Klara Venglarova, Raven Adam, Georg Vogeler
University of Graz
This study describes a comprehensive workflow for extracting machine-readable text from historical newspaper job advertisements, addressing layout analysis, optical character recognition (OCR), and post-correction with state of the art machine learning methods. Leveraging an annotated dataset as a ground truth, we evaluate various layout detection tools, including default ANNO segments, Eynollah, Transkribus, and Tesseract. For the evaluation purposes, we also proposed a new methodology based on text presence in non-intersecting parts of the predicted region and its ground truth. Eynollah demonstrated the highest segmentation accuracy (72.5%), while other models, such as Kraken, underperformed due to mismatched between the specific task and pretrained models.
For OCR, we compared multiple models, including GT4HistOCR (CER: 0.1218), Tesseract model used in ANNO (CER: 0.1295), and German_Print model (CER: 0.1202). While several models reached comparable results, the Fraktur_GT4HistOCR achieved the best WER. Post-correction further improved text quality, addressing OCR-induced biases. We fine-tuned the hmbyt5-preliminary 1 model on the IDCAR2019-POCR dataset for OCR correction to perform better on our dataset. As manual creation of gold standard is a time-consuming process, we also explore the generative transformer based methods to support creating training data needed to achieve a good performance of a post-correction model.
Our work emphasizes the persisting need of annotated datasets and gold standards in assessing segmentation and recognition performance. By systematically comparing tools and methodologies, we contribute to the transparency of results obtained based on the subsequent data analysis.
This work is part of the FWF project P35783 “The making of the incredibly differentiated labor market” (PI Jörn Kleinert).
Georg Vogeler is professor for Digital Humanities at University of Graz. He is a trained historian (Historical Auxiliary Sciences), graduated from Ludwig-Maximilians-Universtität (LMU), Munich, and worked on late medieval administrative records, emperor Federic II. (1198-1250), digital scholarly editing and semantic web technologies for humanities, with positions at the LMU Munich, the Università del Salento in Lecce and the University of Graz . He engaged recently in the application of machine learning and AI in the analysis of historical records. He was and is PI of numerous projects, among which the ERC ADG grant From Digital to Distant Diplomatics (2022-2026).
Klara Venglarova is a PhD student of Linguistics and Digital Humanities at the Palacky University in Olomouc, Czech Republic. She is involved in the FWF-funded project The Making of the Incredibly Differentiated Labor Market: Evidence from Job Offers from Ten Decades at the University of Graz (PI Jörn Kleinert), specifically engaged in layout analysis, OCR, post-correction, information extraction and other NLP and machine-learning tasks.
Raven Adam is a PhD student at the department of Environmental System Sciences at the University of Graz. His research focus is on NLP applications, such as topic modeling, text classification and text generation. He is currently involved in two FWF-funded projects, namely The Making of the Incredibly Differentiated Labor Market: Evidence from Job Offers from Ten Decades and Responses (PI Jörn Kleinert) to Threat and Solution-Oriented Climate News (PI Marie Lisa Ulrike Kogler).
Dario Kampkaspar, Kevin Kuck, Anna Christina Kupffer
Universitäts- und Landesbibliothek Darmstadt
Das Digitalisierungsprojekt „Darmstädter Tagblatt“ erarbeitet momentan einen der umfangreichsten Quellenfundus historischer Zeitungen im deutschsprachigen Raum. In zwei DFG-geförderten Projektphasen werden über 600.000 Seiten aus fast drei Jahrhunderten digitalisiert und als Volltexte zur Verfügung gestellt.1 Neben einer kurzen Projektvorstellung möchten wir Einblicke in zwei Aspekte des Projekts bzw. daraus erwachsener Kooperationen geben.
1. Der Transkribus-Workflow
Ursprünglich im Rahmen einer Dienstleistung vergeben, wurde das OCR und die Layouterkennung der Nachkriegsausgaben (1949–1986) extern durchgeführt. Während die Erkennung des Layouts zufriedenstellend war, wurde das OCR nach einer strategischen Entscheidung intern durchgeführt. Im Rahmen einer infrastrukturellen Investition an der ULB Darmstadt können wir für das Tagblattprojekt Transkribus als Epic Member einsetzen. In einem Workflow, der TextTitan sowie ein bestehendes Modell des ZEiD nutzte, wurden mit Double Keying neue Versionen der Volltexte erstellt. Dies war wesentlich schneller als die manuelle Korrektur des Outputs des Dienstleisters. Aus dem ALTO-Format von letzterem konnten die Layoutinformationen mittels Transformation in PAGE-Dateien in Transkribus überführt werden und somit nachgenutzt werden.
2. Das RAGblatt
In Darmstadt befindet sich mit Hessian.AI eine Ausgründung der TU Darmstadt mit KI Expertise.2 Um die Möglichkeiten zu erforschen und auch um Herausforderungen zu bewältigen, die mit einem so umfangreichen Projekt, wie dem Tagblattprojekt einhergehen, wurde eine Kooperation mit Hessian.AI angestoßen. Das momentane Ergebnis ist das „RAGblatt“, ein KI-gestützter Assistent, der zur Recherche des Tagblattmaterials genutzt werden kann. Momentan befindet sich der Assistent noch in der Prototypphase und verwendet Modelle wie Metas Llama und Occiglot. Es ermöglicht textbasierte Suchen, die schriftliche Antworten generieren und den Kontext des Zeitungsartikels enthalten, der für die Anfrage relevant ist. Wir hoffen, dass dieser Assistent Nutzer:innen dabei unterstützen wird, Material zu finden und einen explorativen Ausgangspunkt für Forschungsthemen zu bieten. In einer laborähnlichen Umgebung können Forschende mögliche Fragen mit minimalem technischem Aufwand oder Zeitaufwand testen. Gegenstand des Vortrags werden auch die Herausforderungen technischer und disziplinärer Art sein.
Momentan ist der Assistant lediglich innerhalb des Netzwerks der TU Darmstadt verfügbar. elektra.ai.tu-darmstadt.de/ulb (aufgerufen am: 05.02.2025 ).
Fußnoten
1www.ulb.tu-darmstadt.de/forschen_publizieren/forschen/darmstaedter_tagblatt.en.jsp (aufgerufen am: 05.02.2025).
2 hessian.ai (aufgerufen am: 05.02.2025).
Dario Kampkaspar ist Leiter des Zentrums für digitale Editionen an der ULB Darmstadt. Nicht zuletzt seit dem Wien(n)erischen Diarium ist er in der Digitalisierung und Volltetxerfassung von Zeitungen aktiv. Er ist u.a. in der TEI und in Transkribus aktiv (z. B. der TEI-Export von Transkribus und weitere Tools für digitale Editionen).
Kevin Kuck betreut seit September 2023 das Digitalisierungsprojekt des „Darmstädter Tagblatt“. An der ULB Darmstadt ist er außerdem im Projekt „Europäischer Religionsfrieden Digital“ tätig. Herr Kuck studierte Geschichte an der Universität Heidelberg, wo er ebenfalls promoviert.
Nelly Krämer-Reinhardt
Julius-Maximilians-Universität Würzburg, Bayerische Akademie der Wissenschaften
Im Rahmen eines Werkstattberichts möchten wir einen Einblick in die Konzeptionierung der digitalen Edition einer historischen Zeitschrift geben.
Ausgangslage:
Das Akademieprojekt „Robert Schumanns Poetische Welt“ umfasst unter anderem die historisch-kritische Edition der Neuen Zeitschrift für Musik (NZfM) der Gründungsdekade, in der der Komponist und Musikschriftsteller Robert Schumann die Zeitschrift konzipiert und redaktionell verantwortet hat.
Die digitale Edition wird in Kooperation mit dem TCDH Trier in der virtuellen Forschungsumgebung FuD erstellt.
Ziel:
Schumanns NZfM verstand sich als unabhängiges Organ für die Förderung begabter Komponisten und bildete eine zentrale Plattform innerhalb des romantischen Musikdiskurses. Sie ist damit ein wichtiges Korpus in der Musikwissenschaft.
Der nun geplante tiefenerschlossene, annotierte Lesetext, der sowohl auf Heftebene als auch auf Ebene der einzelnen Texteinheiten angesteuert werden kann, wird die Grundlage für zukünftige Forschung bilden. Ziel der Edition des Zeitschriftenkorpus ist die Generierung von Deep Data.
Prozess:
Texterfassung: Ein eigens trainiertes Modell der KI-gestützten Software Transkribus erfasst präzise sowohl die Textfelder des Zeitschriftenlayouts als auch die Zeilen und deren Text.
Multimodale Erfassung: Notenbeispiele werden mit der Software mei-friend erstellt, kollationiert und anschließend in TEI-XML (Text Encoding Initiative) integriert. Die Einsatzmöglichkeiten von OMR werden derzeit noch getestet.
Semantische Erschließung: Die Zeitschriftenausgaben werden in Sinneinheiten unterteilt und semantisch erschlossen. Damit führt der Fokus von der reinen Texterfassung hin zur Substanz der Texteinheiten, die sich teilweise auch über mehrere Hefte erstrecken.
Annotierung: Sämtliche Entitäten werden ausgezeichnet und mit Normdaten verlinkt. Die einzelnen Sinneinheiten werden Textkategorien zugeordnet und Sachkommentare erleichtern das Verständnis erklärungsbedürftiger Textstellen. Zudem wird, wo entsprechende Quellen vorhanden sind, die Textgenese dargestellt. Dabei werden einerseits handschriftliche Vorlagen herangezogen, die mit dem Software-Tool Transcribo in FuD erschlossen werden. Andererseits werden die Textbeiträge, die Schumann in seine „Gesammelten Schriften“ übernommen hat, jeweils verknüpft und mit dem Kollationierungstool Comparo verglichen.
Diskussion:
Wir laden dazu ein, dieses Konzept und die Methoden zu diskutieren und die Limitationen des Einsatzes von KI zur Erfassung von Deep Data in Zeitungen und Zeitschriften zu reflektieren sowie die Frage zu beleuchten, welche Rolle die Sachkommentierung in Zeiten von KI erfüllen kann und soll.
Die Referentin Nelly Krämer-Reinhardt, M.A. studierte Schulmusik an der Hochschule für Musik Würzburg und Musikwissenschaft an der Julius-Maximilians-Universität Würzburg. Seit 2023 ist sie wissenschaftliche Mitarbeiterin im Akademieprojekt Robert Schumanns Poetische Welt und beschäftigt sich in ihrer Dissertation mit Notenbeispielen in der Neuen Zeitschrift für Musik.
Adriana Rodríguez-Alfonso
University of Tübingen
This presentation examines female representations in three pioneering fin-de-siècle magazines from Spain: Vida nueva (Madrid, 1898-1900), La Vida Literaria (Madrid, 1899), and La vida galante (Barcelona, 1898-1905). This study is part of a Spanish-language magazine digitization project undertaken at the University of Tübingen, Germany. These magazines, conceived as platforms for artistic and literary dissemination, not only brought together prominent painters, illustrators, photographers, and writers of the Hispanic movement but also provided invaluable insights into the prevailing symbolic and social constructs of femininity.
Given the close interrelationship between medicine and art in the late nineteenth century (Jordanova, 1989; Gilman, 1995; Mazzoni, 1996; Clemison and Vázquez, 2009; Tsuchiya, 2011; Alder, 2020), the social imaginaries surrounding women in these magazines frequently intersected with the dominant theories of women's mental health. These theories were influenced by European degeneration models (Nordau, Brachet, Charcot), as well as home-grown adaptations by leading Spanish positivist psychiatrists (Escuder, Giné y Partagás, Bernaldo de Quirós).
Drawing from digital methods and perspectives, this presentation will showcase the results derived from the digitization, processing, and analysis of this corpus of Spanish cultural magazines from the turn of the century. Using Natural Language Processing (NLP) techniques—such as word-sense disambiguation and semantic analysis—the study maps the various semantic frameworks associated with women, their societal roles, and representations within nineteenth-century Spanish society. These techniques will reveal how medical, political, and social connotations were closely intertwined within the discourse surrounding female identity and roles of the time.
This talk will also foster a discussion on the application of computational tools in periodical press analysis, highlighting both the potential and challenges of adapting digital methods across languages—particularly for Spanish-language materials. It will delve into issues such as cross-linguistic adaptation and the nuances involved in applying these methods to historical texts.
Adriana Rodríguez-Alfonso holds a PhD and Master's degree in Spanish and Latin American Literature from the University of Salamanca, and a Bachelor's degree in Hispanic Philology from the University of Havana. She is currently a professor and researcher at the Romanisches Seminar of the University of Tübingen, where she is working on her “Habilitation.”
Her main research focuses on Portuguese and Spanish literature from the 19th and 20th centuries, literary magazines, intellectual fields and networks, and digital humanities. She has published articles and chapters in various essay collections and specialized journals and her book El grupo Shanghai en Argentina: Redes, estéticas y mercados editoriales latinoamericanos was published in 2024 by De Gruyter.
Christian Lendl
University of Vienna
The Wiener Salonblatt was the hot gossip magazine of fin-de-siècle Vienna. The illustrated weekly was published from 1870 to 1938 and mainly consisted of short messages – mostly published by members of the nobility – about personal achievements, travelling, and family issues. These short texts can be seen as typical examples for factoids and served the same purpose as posts on social media networks today: staying connected with one’s peers and presenting oneself to the public.
This dissertation project aims to analyze these (~250,000) factoids with the help of digital methods. The research goals are to better understand the transformation of the late Habsburg nobility by identifying topical trends and geospatial patterns as well as conducting a historical network analysis.
The current focus of this project lies on the automated text recognition of the corpus. Therefore, several models on Transkribus are being trained for layout recognition, segmentation, and automated transcription. In addition, a text processing algorithm is currently being developed that optimizes the (textual) output of the transcription stage and transfers it into a database. This includes several steps of error correction, normalization and validation as well as an algorithm to correct the reading order (for all messages on a single magazine page). All these processes are necessary to optimally prepare the factoids for the upcoming stage of this dissertation project: Analyzing the factoids with natural language processing methods. While named entity recognition will be used to extract as much information as possible (persons, organizations, places etc.), topic modelling will identify the topic(s) of each factoid. All these steps have to be completed in order to start the final stages: data analysis and historical interpretation.
Christian Lendl is PhD candidate at the Department of East-European History (University of Vienna). His fields of interest include the Austrian nobility in the late Habsburg empire, the development of portrait and press photography, and the visual coverage in Austrian newspapers. He is also a lecturer for visual marketing at the IMC Krems University of Applied Sciences and holds a MSc in Computer Science from the Vienna University of Technology and a MA in History from the University of Vienna.
Lisa Hufschmidt
Julius-Maximilians-Universität Würzburg
Der Vortrag thematisiert die teilweise automatisierte Erschließung von Fotogedichten1 (d.h. einer spezifischen, literarischen Form von Text-Bild-Beziehung) mithilfe unterschiedlicher KI-Modelle aus literaturwissenschaftlicher Perspektive. Der Erschließungsprozess wird ausgehend von dem 2024 gestarteten DFG-Projekt Das Fotogedicht in illustrierten Zeitschriften zwischen 1895 und 19452 und in Zusammenarbeit mit dem Zentrum für Philologie und Digitalität an der Universität Würzburg entwickelt. Ziel des Projekts ist es, das räumliche Verhältnis zwischen Text und Bild, den Volltext sowie die verwendeten Bild- und Textmotive zu erfassen sowie, in einem zweiten Schritt, damit verbundene Semantisierungen zu untersuchen. Die angestrebte Erschließung und damit verbundene Forschungsfragen bergen neben technischen auch kommunikative Potenziale sowie Herausforderungen, auf welchen der Fokus des Vortrags liegen wird.
Fußnoten
1 Siehe hierzu: Catani, Stephanie/Michael Will (2024): “Das Fotogedicht. Zur (Wieder-)Entdeckung einer intermedialen Gattung.” Zeitschrift für Deutsche Philologie Digital/Zeitschrift für Deutsche Philologie (2), doi:10.37307/j.1868-7806.2024.02.09.
2www.germanistik.uni-wuerzburg.de/ndl1/forschung-projekte/forschungsstelle-fotolyrik/.
Lisa Hufschmidt studierte Germanistik (B.A.) und Deutsche Literaturwissenschaft (M.A.) in Mannheim und Stuttgart. Seit Juli 2024 ist sie Mitarbeiterin im DFG-Projekt Das Fotogedicht in illustrierten Zeitschriften zwischen 1895 und 1945 an der Julius-Maximilians-Universität Würzburg, das sich mit der (wieder-)entdeckten Gattung der Fotolyrik beschäftigt. In ihrem Promotionsvorhaben entwickelt Lisa Hufschmidt ein Analysemodell für Fotogedichte.
Daniel Martín Sáez, María Isabel Jiménez Gutiérrez
University of Salamanca
In this presentation, we will introduce LexiMus, a project aimed at understanding the trends in the use of musical lexicon in Spanish throughout history. Specifically, we will focus on the work of the team from the University of Salamanca, where we are in charge of studying the press from the 18th to the 21st century, both general and specialized. This involves working with massive text data (millions of words), from which we often need to exclude non-musical information. To address this, we have developed a tool that utilizes Google Cloud OCR reading and Vertex AI. This learning platform allows us to train large language models (LLM) and create automated workflows to extract information in blocks by applying a prompt similar to those used in ChatBots. We were able to select and transcribe musical news from hundreds of periodical sources, creating a corpus of over 70 million words. In the past year, we begun analyzing this corpus using the Voyant Tools platform, which enables us to study usage trends, create word clouds, and observe their evolution over time. Currently, we are still seeking ways to improve OCR reading, which is hindered by issues with text legibility and column organization, but perhaps the greatest challenge lies in the numerous interpretation problems that AI is far from solving at present, despite efforts in recent years in the field of the history of concepts (e.g., Peter de Bolla, Explorations in the Digital History of Ideas, 2024).
Dr. Daniel Martín Sáez: Associate Professor of Musicology at the University of Salamanca. Professor of the BA in Musicology and the MA in Hispanic Music at the University of Salamanca. Member of the research team of LexiMus Project.
María Isabel Jiménez Gutiérrez: Predoctoral Fellow at the University of Salamanca. BA in Musicology and MA in Hispanic Music from USAL. Graduate in Higher Artistic Education with a specialization in Clarinet from the Higher Conservatory of Castilla y León.
Nataliia Cheilytko
Friedrich Schiller University Jena
Der Beitrag untersucht, inwieweit große Sprachmodelle (LLMs) und kontextualisierte Einbettungen in der Lage sind, Bedeutungsnuancen ukrainischer Wörter zu erfassen, insbesondere vor dem Hintergrund regionaler und diachroner Variationen im 20. und 21. Jahrhundert. Ziel des Projekts ist es, systematisch die Dynamik von Wortbedeutungen im Ukrainischen mithilfe moderner KI-Modelle zu analysieren.
Das Ukrainische ist als ressourcenarme Sprache von unzureichenden annotierten Datensätzen und NLP-Tools geprägt, was die semantische Repräsentation und Analyse erschwert. Besonders herausfordernd ist die Verarbeitung historischer Texte, die oft nicht in digitalen Archiven verfügbar sind. Für erste Experimente wurden Daten aus dem General Regionally Annotated Corpus of Ukrainian (GRAC) herangezogen, das Texte aus verschiedenen Regionen der Ukraine seit dem 20. Jahrhundert umfasst.
Zwei Ansätze wurden verfolgt: Erstens wurden kontextualisierte Einbettungen visualisiert und geclustert, um Bedeutungsunterschiede in verschiedenen Kontexten zu analysieren. Zweitens wurden GPT-Modelle genutzt, um Bedeutungen in spezifischen Sätzen zu ermitteln. Erste Ergebnisse zeigen, dass kontextualisierte Einbettungen semantische Veränderungen erfolgreich identifizieren können, während LLMs wie GPT-4o in einigen Fällen an historischen oder regionalen Bedeutungen scheitern.
Beispielsweise konnte das Modell die neue metaphorische Bedeutung von "bavovna" („Explosion“) korrekt zuordnen, versagte jedoch bei der historischen regionalen Bedeutung des Adjektivs "povazhnyi" („streng“) im westlichen Ukrainisch des frühen 20. Jahrhunderts.
Zukünftige Arbeiten zielen darauf ab, LLMs mit spezifischen Daten anzupassen, um semantische Veränderungen im Ukrainischen besser zu modellieren und eine feinere Granularität bei lexikalischen Analysen zu erreichen.
Referenzen
Nataliia Cheilytko is a postdoctoral researcher at Friedrich Schiller University (Jena), a computational linguist, an NLP engineer, a R&D team leader, and a lecturer with more than 10-year experience in various linguistic, NLP, and Semantic Web projects in both academia and industrial startups. The areas of expertise are Corpus Linguistics, Computational Linguistics, Natural Language Processing, Machine Learning and AI, Large Language Models, Semantic Modeling, Sociolinguistics, Language Variation and Change, Lectometry, Knowledge Representation, Labeled Property Graphs, and Semantic Web.
Katharina Korecky-Kröll
Austrian Academy of Sciences
The NLP python library spaCy (Honnibal et al. 2020) is a useful tool for everyone interested in linguistic analyses of large amounts of written data.
Using spaCy, such data can be tokenized and tagged for parts-of-speech quickly, and a basic morphological annotation for categories of inflectional morphology (e.g., case, gender, and number of nouns) or an annotation of syntactic dependencies or named entities is also possible. All these levels of annotation may as a basis for further linguistic analyses.
To date, spaCy supports over 75 languages and has over 80 pretrained pipelines for 25 languages. There are four pipelines for German, which are all based on the TIGER-Corpus, Tiger2Dep and WikiNER and sometimes on additional sources (in round brackets after the name of the pipeline) and which vary regarding the accuracies of their morphological annotation [in square brackets]:
Specific challenges arise when annotating user-generated content in a pluricentric language such as German, which has several national standard varieties and is also characterized by numerous dialects and regiolects resulting in highly diverse word formation patterns (e.g., Ammon 1995; Lenz 2019). Thus, in a 12-million-token corpus of forum posts on an online article of the Austrian newspaper DERSTANDARD.at regarding the COVID-19 pandemic (e.g., Korecky-Kröll 2023; Korecky-Kröll et al. submitted), spaCy assigns a wrong grammatical gender to many diminutive nouns or misclassifies them in another way (e.g., common nouns as proper names).
Using a randomly selected sub-corpus of 1000 diminutive tokens from the above-mentioned corpus, the four spaCy pipelines for German are tested for accuracy, problems at the individual token or lemma level are identified and possible solutions are worked out. As an outlook, the possibility of an additional automatic word formation tagging (e.g., Wartena 2023) is also discussed.
References
After completing her PhD in Linguistics at the University of Vienna in 2012, Katharina Korecky-Kröll worked in several postdoc positions. She is now a Senior Lecturer at the Department of German Studies of the University of Vienna and an Academy Scientist at the “Dictionary of Historical Bavarian Dialects in Austria and South Tyrol” of the Research Unit Linguistics of the Austrian Centre for Digital Humanities and Cultural Heritage of the Austrian Academy of Sciences.
7–8 May 2025
Seminar room 1,
Campus of the Austrian Academy of Sciences,
Bäckerstraße 13, 1010 Vienna
Department of Literary and Textual Studies,
Austrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH),
in cooperation with DHd-AG Zeitungen & Zeitschriften
Nina C. Rastinger
Claudia Resch
The presentations will be held in either German or English.