ACDH-CH Forschungstag 6

Digital Approaches to Timeless Questions: Emerging Voices in the DH

Roberto Busa, der als zentraler Akteur in der Begründung der Digital Humanities gilt und manchmal sogar als eine Art ‚Gründungsvater‘ bezeichnet wird, nahm in den 1940er Jahren kurz nach Vollendung seiner Dissertation einen Computer zu Hilfe, um die Werke Thomas von Aquins zu lemmatisieren. Inspiriert von diesem ‚Urmythos‘, der die Digitalen Geisteswissenschaften als Erfindung eines aufstrebenden Nachwuchswissenschaftlers erzählt, lädt das Austrian Centre for Digital Humanities and Cultural Heritage und die Sektion Digital Humanities & Quantitative Methods der Doktoratsschule für Philologisch-Kulturwissenschaftliche Studien an der Universität Wien im Rahmen seines 6. Forschungstags Masterstudierende, Doktoratstudierende sowie Personen, die ihr Doktoratsstudium vor Kurzem abgeschlossen haben, zum fachlichen Austausch ein. Teilnahmevoraussetzung ist eine inhaltliche Fokussierung auf Digital Humanities im weitesten Sinne – also eine Verortung des Studienschwerpunkts bzw. der eigenen Forschung an (einer) Schnittstelle(n) von Geisteswissenschaften und digitalen Technologien. Ziel des Forschungstags soll ein Überblick über aufstrebende Forschungsfelder und junge Forschung in den digitalen Geisteswissenschaften sein um Forschende auch über (vermeintliche) Fachgrenzen hinweg zu vernetzen.

Programm

08:45-09:00	Begrüßung und einleitende Worte	Jan Höll, Markus Pluschkovits, Susanne Schmalwieser, Patrick Zeitlhuber, Theresa Ziegler
09:00-09:30	Developing Handwritten Text Recognition for Garshuni Malayalam: Challenges and Prospects in Multilingual Digital Humanities	Sarayana Chandran (Central European University)
09:30-10:00	‚Kleine‘ Zeitungstextsorten ganz groß: In Richtung einer Typologie periodisch publizierter Listen	Nina Claudia Rastinger (ACDH-CH)
10:00-10:30	פון פּאַפּיר צו פּיקסעלן - From paper to pixels: Digitising and Translating a Yiddish newspaper	Robin Luger (Universität Wien)
10:30-11:00	Coffee Break
11:00-11:30	Computergestützte Judaistik: ארץ in den Psalmen. Eine begriffsanalytische Fallstudie zur Integration digitaler und traditioneller Forschungsansätze	Bianca Plattner (Universität Wien)
11:30-12:00	Computational representation of the language variation and change patterns in the historically attested dialects of Ukrainian	Ilia Afanasev (Universität Wien)
12:00-12:30	AI in the Archive: A Visual Analysis of Jim Hubbard’s AIDS Activist Videos	Manuel Meschiari (Universität Wien)
12:30-13:30	Lunch Break
13:30-14:00	Russian Future Periphrasis in Statistical Perspective	Maximilian Grübsch (Universität Wien)
14:00-14:30	Das gehören-Passiv. Korpuslinguistische Untersuchung einer analytischen Konstruktion.	Claudia Mattes (Universität Wien)
14:30-15:00	Investigating Authorship: Stylometry and the Harvard Project on the Soviet Social System	Sydney Szijarto (Central European University)
15:00-15:30	Coffee Break
15:30-16:00	The opportunities for language preservation afforded by digital communication platforms on the example of Nahuatl	Evelyn Fischer (Universität Wien)
16:00-16:30	Curiositäten- und Memorabilien Lexicon von Wien: Exploring Topics and Co-occurrence Networks in a Mid-19th Century Lexicon	Nikola Krisztian Czindrity (Universität Wien)
16:30-17:00	Emotionsgestaltung als Erzählstrategie am Beispiel von Der Saelden Hort	Dorothea Sichrovsky (Universität Wien)
17:00-17:30	Digitale und diplomatische Darstellungsversuche der Nibelungenhandschrift B durch Transkribus	Sabrina Bach (Universität Wien)
17:30	Abschließende Bemerkungen	Jan Höll, Markus Pluschkovits, Susanne Schmalwieser, Patrick Zeitlhuber, Theresa Ziegler

Abstracts

Sarayana Chandran (Central European University): Developing Handwritten Text Recognition for Garshuni Malayalam: Challenges and Prospects in Multilingual Digital Humanities

The integration of Digital Humanities (DH) tools into the study of historical manuscripts has opened transformative ways for preserving and analyzing cultural heritage. This paper discusses the development of Handwritten Text Recognition (HTR) for Garshuni Malayalam, a script used by the Saint Thomas Christians of Kerala and Jesuit missionaries between the 16th and 20th centuries. Garshuni Malayalam uniquely combines Syriac and Malayalam characters. Like Syriac it is also written from right to left, making it an essential yet challenging script for computational analysis. Despite the digitization of these manuscripts under the SRITE project and their storage in the Hill Museum and Manuscript Library (HMML), the lack of cataloging and accessibility remains a significant barrier to scholarship. This study aims to bridge this gap by presenting a trained HTR model in e-Scriptorium developed to Garshuni Malayalam manuscripts. This case study is a part of My PhD thesis at Central European University, Vienna. My thesis titled “Developing Digital Humanities tools for Garshuni Malayalam” includes the development of a keyboard, HTR system, and online dictionary database for Garshuni Malayalam

Garshuni Malayalam manuscripts offer invaluable insights into the colonial, linguistic, and cultural history of Kerala. However, their right-to-left orientation, multilingual content, and idiosyncratic paleographic features pose substantial challenges for HTR development. By leveraging the transcription environment of e-Scriptorium, this study undertakes the task of training an HTR model to recognize and transcribe Garshuni Malayalam texts accurately.

This proposed paper examines three core aspects of the project. First, it discusses the implications of multilingual DH and the necessity of interdisciplinary approaches when working with scripts that amalgamate linguistic systems. Garshuni Malayalam epitomizes the complexities of multilingualism, blending Syriac's liturgical significance with Malayalam's vernacular richness. Second, it identifies technical challenges encountered during the development of the HTR model, such as managing script-specific variations and diacritical marks. The right-to-left orientation of the script required adaptations to existing recognition frameworks, pushing the boundaries of current HTR technologies. Third, the study presents preliminary results from the trained model, highlighting its performance metrics, areas of improvement, and its potential applications in cataloging and creating searchable digital repositories.

This paper contributes to the broader field of multilingual digital humanities by showcasing how advanced computational methods can be adapted to work with non-Latin, right-to-left scripts. It also emphasizes the importance of preserving and making accessible lesser-known textual traditions through innovative digital tools. the paper also reflects on the broader methodological considerations of developing DH tools for historically marginalized languages and scripts. It advocates for collaborative workflows that include linguists, historians, and technologists to ensure the cultural and scholarly utility of such tools.

By presenting the challenges, solutions, and results of this HTR initiative, the paper aims to contribute to the growing discourse on multilingual DH and inspire further research into digital tools for underrepresented scripts. This project underscores the potential of DH to unlock historical knowledge embedded in complex manuscript traditions, thereby democratizing access to cultural heritage and fostering global academic collaboration.

Nina Claudia Rastinger (Austrian Centre for Digital Humanities and Cultural Heritage): ‚Kleine‘ Zeitungstextsorten ganz groß: In Richtung einer Typologie periodisch publizierter Listen

Historische, deutschprachige Zeitungen haben in den letzten Jahrzehnten eine Massendigitalisierung erfahren und stehen interessierten User:innen heutzutage über eine Vielzahl und -falt digitaler Ressourcen zur Verfügung. Gleichzeitig zeichnet sich sowohl im Hinblick auf die Zeitungsinhalte als auch auf deren Erforschung ein heterogenes Bild ab: Obwohl Zeitungen eine Vielzahl unterschiedlicher Textsorten in sich vereinen, wurden bisher insbesondere prototypische Bestandteile – wie Nachrichtentexte oder Anzeigen – in den Blick genommen, während ‚kleine(re)‘ Zeitungstextsorten außerhalb der klassischen ‚News‘ häufig unbeachtet geblieben sind.

Vor diesem Hintergrund widmet sich das Dissertationsprojekt einer Gruppe von bislang vernachlässigten ‚kleinen‘ Texten und analysiert periodisch veröffentlichte Listen in historischen, deutschsprachigen Zeitungen aus einer empirischen Perspektive. Der Untersuchungszeitraum erstreckt sich dabei von 1600 bis 1850 und zur praktischen Umsetzung werden Methoden der Korpuslinguistik und der Digital Humanities herangezogen, die im Rahmen von drei Projektsäulen zur Anwendung kommen: Erstens werden periodische Listen in bestehenden digitalen Zeitungskorpora und -sammlungen identifiziert, um einen ersten Überblick über ihre Verbreitung, Häufigkeit und Typologie zu erhalten. Zweitens wird das historische Material auf seine textuellen Merkmale hin analysiert, wofür inhaltliche, sprachliche und typographische Muster ausgewählter Listentypen ausgewertet werden. Drittens werden im Kontext einer Pilotstudie die Potenziale und Herausforderungen solcher in Zeitungen enthaltenen Listen für Anwendungsszenarien der Digital Humanities, insbesondere für Verfahren zur automatischen Informationsextraktion, erprobt. Durch diesen methodischen Dreischritt versucht das Doktoratsprojekt, das bislang kaum beachtete Phänomen periodisch publizierter Listen in frühen Zeitungen erstmals systematisch zu erschließen.

Der Vortrag gibt einen kurzen Überblick über dieses Vorhaben in seiner Gesamtheit, erörtert den Begriff der ‚kleinen‘ Zeitungstextsorten näher und konzentriert sich dann insbesondere auf die erste und zweite Projektsäule, d.h. auf die Identifikation periodisch publizierter Listen sowie die Analyse des Materials im Hinblick auf textuelle Muster. So sollen auf Basis der bisher gesammelten und ausgewerteten Daten beispielsweise unterschiedliche Listentypen (z.B. Ankunftsliste, Sterbeliste, Heiratsliste, Geburtsliste, Geschichtskalender, Inhaltsverzeichnis) herausgearbeitet und räumlich sowie zeitlich verortet werden. Dieses kontrastive und diachrone Vorgehen eröffnet neue Perspektiven auf die Bedeutung ‚kleiner‘ Textsorten in der frühneuzeitlichen Zeitungslandschaft und zeigt auf, welche Datenschätze in bereits bestehenden digitalen Ressourcen verborgen liegen.

Robin Luger (University of Vienna): פון פּאַפּיר צו פּיקסעלן - From paper to pixels: Digitising and Translating a Yiddish newspaper

From paper to pixels is a born-digital translation of Yiddish press annotated and contextualised as a scholarly digital edition. The edition and supporting data are displayed on a static, open-source website.[1] The focus lies in transforming scans of the newspaper in Five-Star Open-Source and Linked Data in a universally understood language without losing the essence of Yiddish press sources.

דער אַמת׳ר יוּד – The Genuine Jew is a Yiddish newspaper that was published from February to October 1904 in Lviv, Ukraine – at that time part of the Austrian-Hungarian monarchy, Galicia, to be precise. The paper features various articles, news and stories from different writers and the editor. A big focus is Jewish life at the beginning of the 20^th century Europe focusing on religious questions of Jewish people in the diaspora. Every issue also includes general, political, and Jewish news as well as historical retellings important for Jewish heritage and culture.

The scans, either in PDFs or images, are provided by the Austrian National Library[2] already including the transcription in Hebrew letters in a separate text document. Taking those digitised sources, the text is transliterated, then translated manually as existing automated translation tools are not proficient (enough) to provide a satisfactory translation of Yiddish – a language with numerous dialects and interchangeable letters. The translation is then followed by the annotation in TEI / XML. Using the Cookiecutter library and XSL stylesheets, the data is transformed and processed into HTML files which are then deployed on GitHub pages. On the website, the newspapers are published as text and viewable as HTML sources. People, places, important events, and Jewish concepts / traditions are linked into indices and to external databases like Wikidata or GeoNames.

Special focus is put on creating open, findable, accessible, interoperable, and reusable data. This includes the technical backend by following the Five-Star Principle and the FAIR principles. This digital edition will be made widely accessible by making use of catalogues and databases, enriching metadata, universally usable formats and linking all resources used and code written on the website. Additionally, it will follow the Web Content Accessibility Guidelines[3] to improve universal access and digital accessibility as academia, research and digital resources widely lack digital accessibility.

Combining all these standards, guidelines, and principles דער אַמת׳ר יוּד is an effort to bring minority language primary sources to the wider research and non-research sphere alike.

[1]lug-robin.github.io/tgj-data/

[2]anno.onb.ac.at/cgi-content/anno

[3] W3C W. A. Initiative, “WCAG 2 Overview,” accessed April 8, 2024, www.w3.org/WAI/standards-guidelines/wcag/.

Bianca Plattner (University of Vienna): Computergestützte Judaistik: ארץ in den Psalmen. Eine begriffsanalytische Fallstudie zur Integration digitaler und traditioneller Forschungsansätze

Die Bedeutungsnuancen von ארץ (Erets/Arets), häufig mit „Land“ übersetzt, reichen weit über ein konkretes geografisches Terrain hinaus. ארץ kann sich ebenso auf die gesamte Erde oder den Erdboden beziehen und steht besonders oft im Zusammenhang mit dem Verheißenen Land oder dem Land der Lebenden.

In dieser Masterarbeit wurden die verschiedenen Bedeutungsnuancen von ארץ im Masoretischen Text der Psalmen (MT-Ps) mit denen der mittelalterlichen Psalmenhandschrift (MS) T-S A43.10 der Cambridge University Library¹ verglichen. Mithilfe digitaler Methoden wie automatischer Texterkennung, semantischem Tagging und computerlinguistischen Visualisierungen wurde im Rahmen einer Distant-Reading-Analyse (Teil 1 und Teil 2) untersucht, wie die Polysemie von ארץ im Psalter und in der Psalmenhandschrift zum Ausdruck kommt. Anschließend wurde diese Analyse durch eine Close-Reading-Interpretation (Teil 3) vertieft, die den Midrasch Tehillim (MidTeh) als Grundlage nutzte und sich konkret mit zwei ausgewählten Psalmen – Psalm 46 und Psalm 74 – befasste. Ergänzend wurden verschiedene Übersetzungen, rabbinische Kommentare und weitere relevante Sekundärliteratur herangezogen, um das Land-Konzept in diesen beiden Psalmen zu erläutern.

Die Handschrift T-S A43.10 wurde zunächst mit dem HTR/OCR-Tool eScriptorium² automatisch transkribiert. Anschließend erfolgte eine manuelle und automatisierte Bereinigung der TXT-Dateien, bevor sie für die quantitative Analyse mit Python³, R⁴, CATMA⁵ und visÁvis⁶ aufbereitet wurden. Ziel dieser quantitativen Analyse war es, wiederkehrende Bedeutungsakkumulationen zu identifizieren, die Dichte von ארץ zu visualisieren und syntaktische und semantische Muster im MT-Ps und in der MS festzustellen.

Um die Bedeutungsnuancen von ארץ systematisch zu erfassen, wurden in einem weiteren Schritt Hauptkategorien (Main-Tags) definiert und sowohl der MT-Ps als auch die MS automatisch und manuell in CATMA annotiert. Innerhalb dieser Hauptkategorien entstanden Unterkategorien (Sub-Tags), die eine noch feinere Differenzierung ermöglichten. So konnten beispielsweise Bedeutungsnuancen identifiziert werden, die sich ausschließlich auf geografische oder metaphorische Kontexte in den Psalmen beziehen.

In dieser Präsentation werden erste quantitative Analyseergebnisse vorgestellt und interpretiert.

1 Cambridge University Library, T-S A43.10; In: cudl.lib.cam.ac.uk/view/MS-TS-A-00043-00010/1

2 eScriptorium. Sofer Stam. Version v0.14.0, https://www.sofer.info/

3 Python. Python. Version 3.13.1, https://www.python.org.

4 R. The R Project for Statistical Computing. Version 4.4.2. https://www.r-project.org/

5 CATMA. Computer Assisted Text Markup and Analysis. Version 7.1.0, catma.de

6 visÁvis. Pattern Recognition in Annotated Texts. Version 2023, visavis.ouproj.org.il

Ilia Afanasev (University of Vienna): Computational representation of the language variation and change patterns in the historically attested dialects of Ukrainian

The focal point of the research is the prospect of using corpora to search for symbol sequences that provide the data for establishing the order of the phonetic changes within a particular language clade. The latter can aid in outlining the clade history, its inner grouping, and facilitate an overall better reconstruction of the proto-language, at the same time serving the purpose of inner reconstruction for each of the lects (Fox, 1995). While the most frequently used material for this procedure is basic vocabulary lists (Borin, 2012), the study argues that corpora can provide an additional insight into the mechanics of language variation and change.

The main dataset used in the research includes the material from the historically attested Carpathian Ukrainian (Bojkian, Lemkian, and Huzulian) lects. The collected small corpus (estimated at 10,000 tokens) undergoes a preliminary human-performed analysis, aided with the existing body of the Slavic historical phonology studies (Shevelov, 1979). This analysis aims at tagging the sequences that contain information, relevant to establishing the relative chronology of the phonetic changes. The sequences are not necessarily words but rather their overlapping parts (Zelenkov and Segalovich, 2007). This helps in both enhancing the corpus size and providing more precise instructions to the utilised automatic methodology.

The research discusses two distinct approaches. The first, descriptive visualisation with the help of heatmaps, allows to take an overview of the phonetic variation patterns within the data. The other one, based on machine learning algorithms originally designed for grammatical error detection (Kasewa et al., 2018), permits automatic detection of similar patterns in the new lects to help in automating further research of the clade. Both methods undergo a qualitative cross-evaluation (Afanasev, 2024) with the help of grammatical error detection datasets. This facilitates further exploration of the nature of language evolution and the connection between variation and change within it.

Overall, the study presents a new look at the onomasiological reconstruction (Kassian et al., 2010) and corpus material dynamics in historical comparative linguistics, while showing the benefits of interaction between natural language processing and computational linguistics, and the advantages of enhancing quantitative methods with qualitative analysis.

References:

Afanasev, I. (2024). The Cross-Evaluation Crux for Computational Phylogenetic Linguistics. In M. Bakaev, R. Bolgov, A. V. Chugunov, R. Pereira, E. R, & W. Zhang (Eds.), Digital Geography (pp. 75–89). Springer Nature Switzerland.

Borin, L. (2012). Core Vocabulary: A Useful But Mystical Concept in Some Kinds of Linguistics. In D. Santos, W. Lindén, & W. Ng’ang’a (Eds.), Shall We Play the Festschrift Game? (pp. 53–65).

Springer Berlin Heidelberg.

Fox, A. (1995). Linguistic Reconstruction: An Introduction to Theory and Method. Oxford University Press.

Kasewa, S., Stenetorp, P., & Riedel, S. (2018). Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection. In E. Riloff, D. Chiang, J. Hockenmaier, & J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 4977–4983). Association for Computational Linguistics.

Kassian, A. S., Starostin, G., & Dybo, A. V. (2010). The Swadesh wordlist. An attempt at semantic specification. Journal of Language Relationship, 4, 46–89.

Shevelov, G. Y. (1979). A Historical Phonology of the Ukrainian Language. Winter.

Zelenkov Ju.G, Segalovich I.V., 2007, "Sravnitel'nyj analiz metodov opredelenija nechetkih dublikatov dlja Web-dokumentov", Trudy 9oj Vserossijskoj nauchnoj konferencii «Jelektronnye biblioteki: perspektivnye metody i tehnologii, jelektronnye kollekcii» — RCDL’2007, 1-9.

Manuel Meschiari (University of Vienna): AI in the Archive: A Visual Analysis of Jim Hubbard’s AIDS Activist Videos

The application of quantitative and computational methods to film analysis has a long-standing tradition, dating back to the foundational work of Barry Salt with his statistical style analysis of motion pictures (1974). Similar to the transformative impact of Franco Moretti’s practice of distant reading in textual analysis (2013), Artificial Intelligence and Machine Learning present new tools for engaging with audiovisual material and for studying form, aesthetics and artistic expression of images and video. Recent developments such as distant viewing, developed by Taylor Arnold and Lauren Tilton (2023), and cultural analytics by Lev Manovich (2020) represent the scholarly need in the humanities to examine large audiovisual corpora. One compelling area where these methods can be applied is the analysis of amateur and experimental filmmaking within the context of protests and activism. During the AIDS epidemic of the late 1980s, a group of activists from ACT UP New York began capturing their protests on video, taking advantage of the new cheap and broadly available video technology, and documented their struggles in the face of political and institutional neglect. These recordings, labeled as AIDS Activist Videos, went beyond documentation, evolving into a distinctive form of activist filmmaking characterized by subversive aesthetics and experimental techniques. This body of work is credited with inspiring the first wave of New Queer Cinema, which revolutionized the landscape of independent queer film.

An important figure in preserving the AIDS Activist Videos is Jim Hubbard, who, in collaboration with the New York Public Library, took care of archiving and digitizing the material. My research focuses on his short films, which are also part of the archive. His videos can be defined as experimental. Through the playing around with film material, montage and color they convey the desperation, rage and urgency of the times.

My Master’s thesis, currently a work in progress, explores how innovative computational tools can shed some light on the style and aesthetics of AIDS Activist Videos. By conducting a visual data analysis through distant viewing methods, and using computer vision libraries such as OpenCV, my research examines patterns in color schemes, motion dynamics, and scene composition, uncovering the aesthetic strategies that shaped these films' documentation of AIDS activism. Some preliminary findings show numerous jumps in brightness values, extremely short average shot length, and a high detection of movements, suggesting intermittent editing and fast-paced footage.

This study situates itself at the intersection of film studies, archival studies, and the Digital Humanities, trying to bridge computational approaches and the analysis of form and aesthetic in experimental cinema.

References:

Arnold, Taylor, and Lauren Tilton. Distant viewing: computational exploration of digital images. MIT Press, 2023.

Manovich, Lev. Cultural analytics. Mit Press, 2020.

Moretti, Franco. Distant reading. Vol. 93. Verso, 2013.

Salt, Barry. "Statistical style analysis of motion pictures." Film quarterly 28.1, 1974: 13-22.

Maximilian Grübsch (University of Vienna): Russian Future Periphrasis in Statistical Perspective

The Russian imperfective future is formed via the future copula budu (given here in the first person) and an imperfective infinitive. This is not a native construction, but a calque from the Polish equivalent that entered the language in the Late Middle Russian period, that is, in the 16th and 17th centuries (Moser 1998). During that time, there was a multitude of auxiliaries that could pretend to the status of a grammaticalized future tense, such as stanu, imu and uč’nu. Their distribution can partially be explained in syntactic terms (Penkova 2019), but especially their semantics are yet to be fully understood.

While Late Middle Russian texts and genres show varying degrees of Polish-Ruthenian and Church Slavonic interference, it is the first time that a somewhat unified Russian-based variety can be observed, the so-called Russe de la chancellerie. It was used primarily in diplomatic notes and juridical documents (Unbegaun 1935). Therefore, at least these texts can be understood and analysed as one variety. Differing style and linguistic interference, however, must still be kept in mind.

It is the task of my PhD thesis to uncover the semantic distribution of the above-mentioned infinitive periphrases and to investigate what shifts they underwent within this variety and in subsequent centuries. While its first part is dedicated to close reading, the second part, which is to be presented in the talk, consists of a corpus-based approach using the Middle Russian subcorpus of the Russian National Corpus (approximately 8 million tokens) to provide a birds-eyes perspective on the topic. As demonstrated in Hilpert (2021), the collexeme behaviour of the periphrases was evaluated using pointwise mutual information. The data was then comprised by principal component analysis and visualised by multidimensional scaling. The indications of previous research (e.g. Penkova 2022) can be reaffirmed and further corroborated, namely the fact that uč’nu + infinitive, despite its inchoative etymology, is closer to budu than to stanu. The latter is rather to be understood as an inchoative auxiliary. Furthermore, stanu, budu and uč’nu are all rather removed from modal semantic components.

References:

Hilpert, M. (2021). Ten Lectures on Diachronic Construction Grammar. Leiden: Brill.

Moser, M. (1998). Die polnische, ukrainische und weißrussische Interferenzschicht im russischen Satzbau des 16. und 17. Jahrhunderts. Frankfurt am Main: Lang.

Penkova, Y. (2019). Imu, uchnu, stanu, budu: A Corpus-Based Study of Periphrastic Future Constructions in Middle Russian. Slavistična revija, 67 (4), pp. 569‒586.

Penkova, Y. (2022). Semantics of Inception: A Corpus-Based Research of Inchoative Verbs in the History of the Russian Language, Vestnik VolGU, Serija 2, Jazykoznanie, 21(6), pp. 57-75.

Unbegaun, B. (1935). La langue Russe au XVIe siècle: 1500 - 1550. 1, La flexion des noms. Paris: Champion.

Claudia Mattes (University of Vienna): Das gehören-Passiv. Korpuslinguistische Untersuchung einer analytischen Konstruktion.

Der Slogan des österreichischen Radiosenders Ö1 „Ö1 gehört gehört“[1] ist eines der prominenten Beispiele des sogenannten gehören-Passivs. Zusammengesetzt aus gehören und einem Perfektpartizip ähnelt es den übrigen üblicheren Passivformen, jedoch fällt auf, dass es sich nur mit Zuhilfenahme von Modalverben umschreiben lässt: „Das soll/muss gehört werden!“ Zusätzlich zur Passivierung trägt es eine modale Komponente, die semantischen Ursprungs zu sein scheint und während des Grammatikalisierungsprozesses erhalten blieb. In dieser Hinsicht ähnelt es eher dem sogenannten Rezipienten-Passiv mit kriegen/bekommen/erhalten als dem werden- oder sein-Passiv. Man kann es zu jenen Konstruktionen zählen, die, eben aufgrund ihrer Form– in diesem Fall die Diskonnektivität der analytischen Verbform– oder weil sie außerhalb der Norm liegen, bislang weniger Aufmerksamkeit bekommen haben.

Die Grammatikalisierung des gehören-Passivs habe ich bereits in meiner Masterarbeit untersucht (siehe Mattes 2024), wobei das Austrian Media Corpus (amc) als Datengrundlage verwendet wurde. Die Erfahrungen daraus werden nun in eine Dissertation einfließen, deren umfangreicherer Rahmen für einen ausgeweitete Untersuchung auf unterschiedlichen Ebenen genutzt werden soll.

In der Forschungsliteratur wurden zur Erforschung des gehören-Passivs entweder einzelne Beispiele angeführt (siehe Szatmári 2002) oder bestimmte Kombinationen mit CQL abgefragt (siehe Stathi 2010; Lasch 2016), wobei schwierigere Fälle – wenn – nur am Rande beleuchtet wurden. Um die Konstruktion möglichst umfassend begreifen zu können, habe ich einen Zugang gewählt, der zusätzlich zu Abfragen in den Korpora ein NLP-Script miteinschließt. Ein Anliegen ist es dabei, die Erkenntnisse aus der bisherigen Untersuchung und manuell annotierte Daten für die weiteren Analysen zu nutzen; ein Schritt ist die Erstellung eines Klassifizierungsparsers mithilfe der bereits bekannten Belege.

Die Konstruktion gehören + Perfektpartizip soll in weiteren Kontexten untersucht werden, denn die Ergebnisse legen nahe, dass es sich um ein tendenziell mündliches, eventuell aus dem standardferneren Gebrauch stammendes Phänomen handelt (vgl. Mattes 2024: 116), entgegen der Schrift- und Standardsprachlichkeit, die grundlegend das amc prägt. Zudem sind nach wie vor die Entstehung und das Aufkommen des gehören-Passivs nicht eindeutig geklärt: Eine diachrone Perspektive sowie ein Abgleich mit den anderen gehören-Verb-Belegen und konkurrierenden Konstruktionen mit ähnlicher Semantik sollen hierfür herangezogen werden. Das Forschungsinteresse bezieht sich sowohl auf grammatische als auch semantische Aspekte der Konstruktion, um Gemeinsamkeiten und Unterschiede der Verwendung in verschiedenen Zusammenhängen zu fassen, unter anderem durch Emotionsanalyse.

Mehrere Korpora – schriftliche und auditive Daten, auf der vertikalen Achse zwischen Dialekt und Standard, rezente und historische Quellen – bedeuten unterschiedliche Herausforderungen, auch mit den Mengen der Daten, die extrahiert werden dürfen und können, um die Filterung mithilfe NLP zu bewerkstelligen und ausreichend Belege zu sammeln. Die Reflexion der Vorgehensweise und ihrer Herausforderungen steht hierbei, wie bereits auch in der Masterarbeit, in erster Reihe, um sowohl dem germanistischsprachwissenschaftlichen Theorierahmen als auch den digitalen Methoden und Komponenten im Sinne der Digital Humanities Rechnung zu tragen.

[1] 1 Standard (2020): „Ö1 gehört gehört“. Radiosender darf Slogan unbegrenzt verwenden.

www.derstandard.at/story/2000117600042/oe1-gehoert-gehoert-radiosender-darf-sloganunbegrenzt-verwenden [Zuletzt zugegriffen 14.12.2024]

Literaturhinweise:

Lasch, Alexander (2016): Nonagentive Konstruktionen des Deutschen: Sprache und Wissen: Berlin/Boston: De Gruyter.

Lasch, Alexander (2018): „Diese gehören kalt zu geben.“ Die Konstruktion „gehören“ mit Qualitativ. In: Sprachwissenschaft (43 (2)), 159– 185.

Mattes, Claudia (2024): Das gehören-Passiv in der österreichischen Standard(schrift)sprache. Eine Analyse im Austrian Media Corpus (amc). Masterarbeit Universität Wien.

Stathi, Katerina (2010): Is German gehören an auxiliary? The grammaticalization of the construction gehören + participle II: In: Stathi, Katerina / Gehweiler, Elke / König, Ekkehard (Hg.): Grammaticalization: current views and issues. Studies in language companion series (SLCS). Amsterdam/Philadelphia: Benjamins Pub. Co, 323–342.

Szatmári, Petra (2002): Das gehört nicht vom Tisch gewischt... Überlegungen zu einem modalen Passiv und dessen Einordnung ins Passiv-Feld: In: Jezikoslovlje (3.1–2), 171– 192.

Sydney Szijarto (Central European University): Investigating Authorship: Stylometry and the Harvard Project on the Soviet Social System

Although the Harvard Project on the Soviet Social System’s interviews contain a wealth of information on the pre-1945 Soviet sphere and Soviet émigré sentiment, the transcripts researchers have access to today are not the verbatim words of the interviewees. Rather, interviewers produced audio recordings based on their notes and recollections, which transcribers then used to create the typewritten transcripts. Ambiguity in this informational ‘chain of custody’ can make it difficult for researchers to responsibly engage with the Harvard Project. For one, it is difficult to determine whether the transcripts represent the voice of the interviewers themselves or of the transcribers who turned their recordings into text; in short, how detailed were the audio recordings the interviewers created, and can researchers use contextual information about the interviewers to make inferences about transcript tone and word choice? This paper investigates if stylometry can be used to determine whether transcripts by a particular interviewer have a cohesive style, implying that the transcripts represent the voice of the interviewer moreso than the transcriber. To do this, I used the transcripts of interviews by Kent Geiger and Sidney Harcave, familiarizing python with their respective styles by offering it corpora of their interviews. I then applied Chi-squared tests to compare the word choice in the test corpora to those of both Geiger and Harcave. The Chi-squared tests clearly differentiated Geiger and Harcave’s interviews, assigning the test corpora to the correct interviewer. Based on this outcome, I claim that stylometry can shed light on this aspect of the ‘chain of custody’ of Harvard Project interview material. Specifically, I argue that Geiger and Harcave had a strong influence over the style of the transcripts produced, and therefore that their audio recordings must have been very detailed. My findings suggest that researchers can approach Geiger and Harcave’s interview transcripts with more confidence that the words and tone they are analyzing represent the perception of the interviewers themselves.

Evelyn Fischer (University of Vienna): The opportunities for language preservation afforded by digital communication platforms on the example of Nahuatl

The digital world is increasingly becoming a medium for culture, and its structures and power dynamics shape the offline world. For endangered indigenous languages, digital communities offer a communication space that can include, crucially, speakers who have migrated away from the community. In this way, digital communication can mitigate the effects of migration on the long-term survival of minority languages.

A question arises, then, whether the currently existing tools fulfil the needs of minority groups and whether they foster the use of the minority languages. In my master's thesis (Fischer 2024) I collected and analyzed the instances of the use of Nahuatl, Mexico's most widely spoken indigenous language, on the Internet and I found that while their number is rapidly increasing, the presence of Nahuatl on the internet was relatively low.

However, indigenous communities are not excluded from the digital world (see Peñuelas Peñarroya 2022 and the references therein for examples of the appropriation of digital technologies by indigenous groups). The question therefore arises why indigenous languages are underrepresented online. A partial explanation can be offered by an observation I made during a visit to Tepoxteco, a Nahuatl speaking community of 400 people in Veracruz, in November 2024. Many of the school-aged children and younger adults had mobile phones, while in the older generations the use and possession of mobile phone was not widespread. At the same time, Nahuatl is spoken by the older members of the community, while most younger members spoke exclusively or predominantly Spanish. It can therefore be inferred that the reason for the limited presence of Nahuatl on the Internet is that the members of the communities with access to internet participate in digital communication in the majority language Spanish.

This raises a question of how to structure a digital community that would foster the use of indigenous languages and the exchange of materials in or about indigenous languages. Initial considerations suggest that making it available as an app and the option of voice messages could contribute to a higher number of users. One challenge is how to prepare a digital community that would be attractive to, and could connect, both older and younger users. Further research incorporating feedback from potential users is planned and necessary.

References:

Fischer, Evelyn. 2024 El internet en náhuatl: la apropiación de las tecnologías de información y comunicación por una lengua indígena. Vienna: University of Vienna, Master´s Thesis. utheses.univie.ac.at/detail/70866/ .

Peñuelas Peñarroya, Anna (2022). Impacto de los celulares e internet en la movilidad de los indígenas ngäbe entre Panamá y Costa Rica (2021-2022). ODISEA. Revista de Estudios Migratorios, (9), 30-54.

Nikola Krisztian Czindrity (University of Vienna): Curiositäten- und Memorabilien Lexicon von Wien: Exploring Topics and Co-occurrence Networks in a Mid-19th Century Lexicon

The Curiositäten- und Memorabilien Lexicon von Wien (1846), authored under the pseudonym Realis (Walter Ritter von Cöckelberghe-Dützele), is a two-volume lexicon comprising 962 pages, approximately 1,295 entries, and around 490,000 tokens. A curious hybrid of factual, anecdotal, and often humorous content, this lexicon offers a unique perspective on mid-19th century Viennese society. The project presented applied computational techniques in order to explore the structure, themes, and relationships within this cultural artefact.

The project began with the digitization and transcription of the lexicon using the HTR platform Transkribus. Field models were employed to identify headers and entries, with the model trained iteratively. Initially, 30 manually annotated pages were used as training data, which was later scaled up to 102 pages. Field models demonstrated good performance and enabled the accurate extraction of entries in the later stages.

The pre-trained HTR model "German Fraktur 18th Century" (WrDiarium_M9) was utilised, resulting in an accurate transcription of the text. This data was then extracted into a TEI-XML format and processed for further computational analysis.

To uncover latent thematic structures within the lexicon, Negative Matrix Factorization (NMF) was applied for topic modelling. This analysis revealed ten coherent and well-distributed topics: (1) courtly matters, (2) buildings, (3) culture (strongly associated with theatre), (4) culture (associated with museums and libraries), (5) religion, (6) public spaces and suburbia, (7) entertainment, (8) military matters and buildings, (9) suburbia, and (10) paintings. Further analysis the results of t-SNE dimensionality reduction of the entry-topic matrix and keyword-entry matrix were plotted to visually inspect the distinctiveness of topics and reinforcing the thematic validity of the model’s outputs.

Finally, a co-occurrence network analysis examined relationships between lexicon entries. In this network, nodes represented entries, and edges were established when other entry headers appeared within a given entry. The resulting network, comprising 1,102 nodes and 3,417 edges, illustrated colourful, complex, yet meaningful connections between various entries. For example, the entry Universität is closely linked to Studenten and Botanischer Garten, reflecting logical real-world relationships. Although broader clustering using metrics such as betweenness centrality proved challenging, inspecting specific subgraphs illuminated localized and interpretable associations within the data. However, the network as a whole remains too complex for a straightforward visual analysis.

While further investigation is required to disentangle these connections, the project demonstrates a clear and feasible workflow for converting a scanned lexicon into a structured co-occurrence network.

Dorothea Sichrovsky (University of Vienna): Emotionsgestaltung als Erzählstrategie am Beispiel von Der Saelden Hort

Der Saelden Hort ist eine um 1300 in Basel entstandene geistliche Dichtung, die in zwei Teilen (1) die Geburt Jesu und div. biblische Geschichten rund um den Beginn seiner Wundertaten sowie (2) das Wirken Jesu und das Leben und Wirken Maria Magdalenas erzählt.

Ins Auge sticht zunächst der folgende Aspekt des Textes: Obwohl einige der Figuren in größerem Maße fokalisiert und motiviert werden, konkretisieren sich diese Strategien in der Figur der Maria Magdalena. Dies, so lautet die Arbeitshypothese, liegt einerseits an einem gesteigerten Interesse an weiblichen Figuren bzw. an der Figur der Maria Magdalena im Konkreten, andererseits kann diese Tatsache als Strategie bezeichnet werden, trotz der wenigen und recht additiv auftretenden kanonischen Bibelstellen über Maria Magdalena einen Plot um ihre Figur bzw. im gesamten Text zu kreieren.

Im Rahmen meiner sich im Entstehen befindenden Seminararbeit beschäftige ich mich mit der Emotionsgestaltung in der Saelden Hort und darauf basierend mit der (Aus-)Gestaltung von Handlung und Plot im Text.

Methodisch führe ich dazu eine Sentiment Analysis durch, wobei ich selbst ein Model trainiere, dessen Fokus nicht auf der Art der Emotion (bspw. positiv/neutral/negativ) sondern auf dem Grad der Emotionalität liegt (bspw. keine/ein bisschen/viel Emotionalität). Das Model wird anhand der mittelhochdeutschen Texte der Saelden Hort (sowie eventuell Lutwins Eva und Adam – die Auswahl orientiert sich an der Entstehungszeit der Texte sowie an der ihnen zugeschriebenen Gattung/Erzählweise) trainiert. Zur Einteilung der mittelhochdeutschen Trainingsdaten in die angeführten Kategorien wird die mittelhochdeutsche Begriffsdatenbank (MHDBDB) herangezogen und die Wörter/Phrasen werden mithilfe ihrer Verwendung in anderen Kontexten zugeteilt.

Parallel zur computergestützten Untersuchung des Textes wird eine qualitative Annotation in TEI/XML Oxygen durchgeführt und der Saelden Hort wird mithilfe des <hi> Elements (und div. Konkretisierungen) getaggt.

Im Anschluss daran werden die beiden Datensets vergleichend zusammengeführt, wobei die folgenden Fragestellungen fokussiert werden:

(1) An welchen Stellen und mit welchen Strategien werden Emotionen ausgestaltet und welchen Zweck erfüllt das für den Aufbau eines Plots/die Fokalisierung der Figuren im Text?

(2) Welche Aussagen lassen sich durch den Vergleich der Datensets über (a) Strategien der Fokalisierung treffen, um den Leser:innen eines Textes Informationen erinnerbar zu gestalten sowie (b) eine Art von sich steigernder Sympathisierung, die sich beim Lesen des Textes aufbaut, während die computergestützte Auswertung die Fokalisierung/Motivierung an Einzelstellen des Textes fokussiert. Das heißt, ich möchte durch den Vergleich der beiden Analyseschritte herausfinden, ob man durch mögliche Abweichungen der beiden Analysen beispielsweise stichhaltige Hypothesen darüber aufstellen kann, welche Informationen beim Lesen besonders herausstechen (bspw. wenn ein:e menschliche:r Annotator:in das Tagging durchführt, wird, im Vergleich zur maschinellen Sentimentanalyse, beispielsweise die dritte Emotionsfokalisierung Maria Magdalenas als stärker beurteilt, alleine deshalb, weil es die dritte ist, und ein:e menschlich:e Annotator:in sich daher Wissen über die Figur angeeignet hat und konkrete Erwartungen an den Handlungsablauf stellt) und ob man darauf basierend Rückschlüsse zu übergeordneten Textstrategien ziehen kann.

Sabrina Bach (University of Vienna): Digitale und diplomatische Darstellungsversuche der Nibelungenhandschrift B durch Transkribus

Der Codex Sangallensis 857 (Mitte des 13. Jahrhundert, um 1260) gilt aufgrund seines Umfangs zu den wohl wichtigsten mittelhochdeutschen Sammelhandschriften. So enthält er eine Vielzahl an Epen, für deren Edition der Codex die Grundlage bot. Eine davon ist die enthaltene Nibelungenhandschrift B, die 125 der circa 700 Seiten ausmacht. Die Handschrift weist ein recht simples Layout auf – sie ist zweispaltig mit bunten Majuskeln am Strophenbeginn; die Initialen sind künstlerisch gestaltet.

Handschrift B ist bis heute relevant in der Forschung – bei einer näheren Auseinandersetzung mit der Materie fällt allerdings auf, dass sich die Forschung in den letzten Jahrzehnten von einer diplomatischen Darstellung des Nibelungenlieds wegbewegte und eine größere Aufmerksamkeit den normalisierten Übertragungen widmete, welche sich als Lesefassungen verstehen. Eine relevante diplomatische Transkription ist jene von Batts aus 1971 – allerdings wurde auch hier die Textgestalt nicht exakt übertragen: Eingriffe lassen sich vor allem in der Interpunktion erkennen, aber auch bei einer Auflösung von Abkürzungen oder der Großschreibung bei Orts- und Personennamen. Ebenso gibt es noch keine diplomatische Transkription mit der Bildgestalt daneben.

Das ist der Punkt, an dem mein Projekt ansetzt – ich habe durch die KI Transkribus eine diplomatische Transkription geschaffen, bei der ich mich so nah wie möglich an die Textgestalt gehalten habe – dabei wurde nichts hinzugefügt, was nicht in der Handschrift vorhanden ist. Ich habe mich für die Erstellung meines Modells auf Base Models gestützt – sowohl bei text recognition als auch Layout-Erkennung, allerdings war einiges an Nacharbeit nötig, was den Großteil meiner Arbeit ausmachte. Dadurch habe ich Field-Models trainiert, die zum einen die Struktur der Seite erkennen können, aber dabei auch im nächsten Schritt die einzelnen Zeilen und ihre Linienpolygone (= Umgebung der Zeile, in der der Text erkannt wird) markieren. Weiterführend habe ich ein Texterkennungsmodell trainiert, durch das die Transkription der übrigen Seiten maßgeblich erleichtert wurde. Damit sind die einzelnen Seiten nun maschinenlesbar; die Textgestalt ist in der Applikation neben der Transkription vorhanden.

Als nächsten Schritt habe ich den Export angedacht, um die Daten in Form einer digitalen Edition weiterverwenden zu können. Mein Projekt ist noch in progress; bis Ende des Semesters sollte ich jedenfalls mit der Transkription fertig sein. Ich sehe in solch einer Aufarbeitung eine neuartige Möglichkeit, sich mit der Handschrift B zu befassen und sich nicht bloß auf die gängigen Lesefassungen zu berufen – gleichzeitig ermöglicht die Maschinenlesbarkeit ein Durchsuchen der Handschrift.

Es wäre zudem erstrebenswert, sich im nächsten Schritt mit einer Gesamttranskription des Codex zu befassen, da es so etwas bislang nicht gibt. Ein Vorbild hierfür wäre das „Ambraser Heldenbuch“ mit Bildgestalt neben der diplomatischen Transkription.

Datum

20. Februar 2025

Ort

Sitzungssaal
Österreichische Akademie der Wissenschaften
Dr. Ignaz-Seipel-Platz 2
1010 Wien

Kontakt

Susanne Schmalwieser

Anmeldung

Sie können sich hier für die Veranstaltung anmelden:

Anmeldeformular

Name	Funktion	Speicherdauer	Typ	Anbieter
CookieConsent	Speichert Ihre Einwilligung zur Verwendung von Cookies.	1 Jahr	HTML	Web Consent
fe_typo_user	Ordnet Ihren Browser einer Session auf dem Server zu. Dies beeinflusst nur die Inhalte, die Sie sehen und wird von uns nicht ausgewertet oder weiterverarbeitet.	-	HTTP	Web User

Name	Funktion	Speicherdauer	Typ	Anbieter
_pk_id	Wird verwendet, um ein paar Details über den Benutzer wie die eindeutige Besucher-ID zu speichern.	13 Monate	HTML	Matomo-id
_pk_ref	Wird benutzt, um die Informationen der Herkunftswebsite des Benutzers zu speichern.	6 Monate	HTML	Matomo-ref
_pk_ses	Kurzzeitiges Cookie, um vorübergehende Daten des Besuchs zu speichern.	30 Minuten	HTML	Matomo-ses
_pk_cvar	Kurzzeitiges Cookie, um vorübergehende Daten des Besuchs zu speichern.	30 Minuten	HTML	Matomo-cvar
_pk_hsr	Kurzzeitiges Cookie, um vorübergehende Daten des Besuchs zu speichern.	30 Minuten	HTML	Matomo

Name	Funktion	Speicherdauer	Typ	Anbieter
YouTube	Es wird eine Verbindung mit YouTube hergestellt, um Videos anzuzeigen.	-	Verbindung	YouTube
SoundCloud	Es wird eine Verbindung mit SoundCloud hergestellt, um Audio-Dateien abzuspielen.	-	Verbindung	SoundCloud
Twitter	Es wird eine Verbindung mit Twitter hergestellt, um Tweets anzuzeigen.	-	missing translation: type.	Twitter
_cs_c	Zeigt an, ob der Nutzer dem Tracking durch ContentSquare zugestimmt hat.	394 Tage	missing translation: type.	Spotify (ContentSquare)
_cs_id	Speichert eine eindeutige Benutzer-ID für die Analyse durch ContentSquare.	394 Tage	missing translation: type.	Spotify (ContentSquare)
_ga	Wird verwendet, um Benutzer zu unterscheiden.	400 Tage	missing translation: type.	Google Analytics
_ga_BMC5VGR8YS	Dient Google Analytics zur Aufrechterhaltung des Sitzungsstatus.	400 Tage	missing translation: type.	Google Analytics
_ga_S0T2DJJFZM	Dient Google Analytics zur Aufrechterhaltung des Sitzungsstatus.	399 Tage	missing translation: type.	Google Analytics
_ga_ZWG1NSHWD8	Dient Google Analytics zur Aufrechterhaltung des Sitzungsstatus.	400 Tage	missing translation: type.	Google Analytics
_ga_ZWRF3NLZJZ	Dient Google Analytics zur Aufrechterhaltung des Sitzungsstatus.	400 Tage	missing translation: type.	Google Analytics
_gid	Wird verwendet, um Benutzer zu unterscheiden.	1 Tage	missing translation: type.	Google Analytics
_ScCbts	Speichert temporäre Sitzungs- oder Wiedergabeeinstellungen.	6 Tage	missing translation: type.	Spotify
_scid	Spotify-Werbe-ID für Analyse und Remarketing.	395 Tage	missing translation: type.	Spotify
_scid_r	Spotify-Werbe-ID für Analyse und Remarketing.	395 Tage	missing translation: type.	Spotify
eupubconsent-v2	Speichert die IAB-Zustimmungsinformationen gemäß dem TCF.	364 Tage	missing translation: type.	IAB / Spotify
OptanonAlertBoxClosed	Speichert, ob der Cookie-Hinweis geschlossen wurde.	364 Tage	missing translation: type.	OneTrust
OptanonConsent	Speichert die Zustimmungseinstellungen, die über OneTrust gesetzt wurden.	365 Tage	missing translation: type.	OneTrust
sp_adid	Werbekennung von Spotify für Tracking und Personalisierung.	365 Tage	missing translation: type.	Spotify
sp_landing	Zeichnet auf, welche Spotify-Seite zuerst besucht wurde.	1 Tage	missing translation: type.	Spotify
sp_m	Speichert die Marktregion des Nutzers.	399 Tage	missing translation: type.	Spotify
sp_t	Sitzungstoken für Spotify-Wiedergabe und Zugriff.	365 Tage	missing translation: type.	Spotify

News & Events

Programm

Abstracts

Datum

Ort

Kontakt

Anmeldung