Discovering the Power of AI for Reading Medieval Manuscripts: HTR Winter School 2022

Jan Odstrčilík

23.01.2023

Fig. 1: AI generated image with the prompt "medieval illumination of a flying spaghetti monster on a page", using huggingface.

January 23, 2023 | Jan Odstrčilík | Historical Identity Research Blog

In the last few years, huge strides have been made in artificial intelligence. Instead of something that was more experimental or reserved for an academic or big business environment, it is slowly making its way to us, the everyday users.

We can use it to generate surprisingly high-quality images (e.g., Stable Diffusion), it helps us with text translation (DeepL), we can chat with it (ChatGTP), or use it to create computer code based on our requests (GitHub Copilot). Even this very blog post you are reading, was written in a new text editor (LEX) that allowed me to write it more quickly, efficiently and with better English than I – a non-native speaker – would have (but it was still corrected by my kind human colleagues at the institute).

It is therefore no surprise that AI is making its way into the field of medieval studies and it comes with a huge promise: There are hundreds of thousands of medieval manuscripts in libraries and archives. These contain often unknown texts written in complicated scripts. Traditionally, reading these manuscripts requires a lot of training in palaeography and – even more importantly – time. Thanks to the development of the so-called “handwritten text recognition” (HTR), these texts will soon be made searchable and readable much faster than ever before.

In this stage, we still have a lot of work ahead of us: We need to familiarize ourselves with this technology, we need to train the AI to recognize specific types of handwriting, and we have to develop new, appropriate methodologies for our field. None of these tasks are easy.

Our Winter School 2022: Introduction into Handwritten Text Recognition technologies of medieval manuscripts provided an opportunity to experiment with these new tools in an international seminar that was organized by the Institut für Mittelalterforschung of the Austrian Academy of Sciences in cooperation with colleagues from Princeton University (Manuscripts, Rare Books and Archival Studies Initiative, MARBAS), University of Bielefeld (CRC 1288, Practices of Comparing), the University of Vienna (Department of German Studies), the University of Bern (Digital Humanities, Walter Benjamin Kolleg) and the Czech Academy of Sciences (Institute of the Czech Language).

In four zoom sessions and an in-person workshop held in Vienna (December 19-21) we worked in four groups to employ digital tools to transcribe, research and edit Latin, German and Czech medieval manuscripts. As an international instructor team, we collaborated with more than 40 participants from Europe and the US (undergraduate and graduate students, postdocs, archivists and librarians) to apply these new technologies to the study of medieval manuscripts and to publish the results of their work on Github (for a longer report, image and links, see below).

To learn more about ongoing and future experiments, workshops, and projects of this network, please, feel free to contact myself or Leon Pürstinger.

Currently, there are a variety of tools for automated handwriting transcription, such as eScriptorium or OCR4All. For our winter school, we decided to go with Transkribus, which is arguably the most user-friendly platform with the longest tradition and the broadest user base. In comparison to other platforms, it is a paid service. Luckily, the Transkribus team was so kind as to support our workshop with 10,000 free credits, for which we are very grateful.

We did not want to limit ourselves to just one short workshop, but instead to give the participants the opportunity to come to concrete results. For this reason, we chose a somewhat unusual concept: four virtual meetings, followed by a three-day workshop in Vienna.

We also did not want to limit ourselves to only one script or to one language. Thus, we created four teams:

●   Carolingian minuscule team led by Helmut Reimitz, Tim Geelhaar and Gerda Heydemann,
●   Late Latin team led by myself, Jan Odstrčilík, Tobias Hodel and Daniela Mairhofer,
●   Medieval German group led Michael Berger, Sarah Hutterer and Dennis Wegener,
●   Last but not least, Anna Michalcová led the Czech team.

The workshop would also not have been possible without the help of Leon Pürstinger who took care of the digital aspects of the workshop, namely Zoom (including making the recordings), Notion.so and our own Discord server for the coordination of the teams.

The sessions were divided into the joint sections, working in the teams. We started with a view into the technological aspects of HTR, thanks to Vicent Bosch from tranSkriptorium project. This was followed by subsequent, ever deeper dives into the use of Transkribus software, through the presentations given by Tim Geelhaar, Tobias Hodel and myself.

Meanwhile, the teams worked on their selected manuscripts from the National Library of Austria. This required first and foremost that the participants agreed on a common methodology for the transcription (taking into consideration the advantages and disadvantages of various methods, especially for how to deal with abbreviations). The teams then prepared training data for Transkribus and created their first models.

The three days in-person workshop in Vienna from 19th to 21st December of 2022, focused on further work with the transcriptions obtained from Transkribus and on providing additional context. Thus, on the first day, the participants learnt about the alternative HTR programmes from myself, Tim Geelhaar showed the use of nopaque for the language analysis of transcribed texts and Dot Porter from the University of Pennsylvania presented virtually the VCEditor – a tool that allows for modelling of manuscript structures (collation of quires). The second day was focused on the possibility of publishing transcriptions. I gave a short introduction into TEI-XML – a mark-up language commonly used for digital editions that is also one of the export formats used by Transkribus, and Tobias Hodel showed how our transcriptions can be displayed using TEI-Publisher. In the afternoon, we also visited the manuscript department of the Austrian National Library where we were kindly allowed to inspect physically the manuscripts we had worked with for the past two months.

On the last day, each of the teams published their transcriptions with images (the so-called ground truth) on the Zenodo platform and on HTR-United. This means that the work of the participants can be used by other researchers to train their own models for the artificial intelligence. Finally, all teams presented and discussed their results.

Our experience from the workshop in Vienna is summarized by the video made by one of our participants, Giuseppe Paternicò:

Watch VIDEO

We hope to make follow-up to this HTR Winter school in future and stay in touch with the participants. We are also immensely grateful for the support we received from our institutions, as well as from READ-COOP (Transkribus).

Fig. 2: Participants of the HTR Winter School 2022 in Vienna.

Winter School Introduction into Handwritten Text Recognition: Program & Information

Ground truth available on HTR-United from HTR Winter School 2022:

Carolingian Minuscule
- Vienna, ÖNB, Cod. 2160 f. 164-184. Ground Truth – https://htr-united.github.io/share.html?uri=32f975946

Late Latin
- Vienna, ÖNB, Cod. 3891. Ground Truth – https://htr-united.github.io/share.html?uri=255da7ea1

Medieval Czech
- Vienna, ÖNB, Cod. 1175, Padeřov Bible. Ground Truth – https://htr-united.github.io/share.html?uri=43573de7e

Medieval German
- Klosterneuburg, Stiftsbibl., Cod. 48. Ground Truth: Initial Release – https://htr-united.github.io/share.html?uri=939d02cb9

Instructors (alphabetic order):

Michael Berger, MA MA, University of Vienna

Dr. Tim Geelhaar, Bielefeld University

Mag. Dr. Gerda Heydemann, Freie Universität Berlin

Prof. Dr. Tobias Hodel, University of Bern

Dr. Sarah Hutterer, MA, University of Stuttgart

Prof. Dr. Daniela Mairhofer, Princeton University

Anna Michalcová, MA, Czech Academy of Sciences

Dr. Jan Odstrčilík, Austrian Academy of Sciences

Leon Pürstinger, BA, Austrian Academy of Sciences

Prof. Dr. Helmut Reimitz, Princeton University

Dr. Dennis Wegener, University of Vienna

Credits:

Fig. 1: AI generated image with the prompt "medieval illumination of a flying spaghetti monster on a page", using https://huggingface.co/spaces/stabilityai/stable-diffusion.

Name	Funktion	Speicherdauer	Typ	Anbieter
CookieConsent	Speichert Ihre Einwilligung zur Verwendung von Cookies.	1 Jahr	HTML	Web Consent
fe_typo_user	Ordnet Ihren Browser einer Session auf dem Server zu. Dies beeinflusst nur die Inhalte, die Sie sehen und wird von uns nicht ausgewertet oder weiterverarbeitet.	-	HTTP	Web User

Name	Funktion	Speicherdauer	Typ	Anbieter
_pk_id	Wird verwendet, um ein paar Details über den Benutzer wie die eindeutige Besucher-ID zu speichern.	13 Monate	HTML	Matomo-id
_pk_ref	Wird benutzt, um die Informationen der Herkunftswebsite des Benutzers zu speichern.	6 Monate	HTML	Matomo-ref
_pk_ses	Kurzzeitiges Cookie, um vorübergehende Daten des Besuchs zu speichern.	30 Minuten	HTML	Matomo-ses
_pk_cvar	Kurzzeitiges Cookie, um vorübergehende Daten des Besuchs zu speichern.	30 Minuten	HTML	Matomo-cvar
_pk_hsr	Kurzzeitiges Cookie, um vorübergehende Daten des Besuchs zu speichern.	30 Minuten	HTML	Matomo

Name	Funktion	Speicherdauer	Typ	Anbieter
YouTube	Es wird eine Verbindung mit YouTube hergestellt, um Videos anzuzeigen.	-	Verbindung	YouTube
SoundCloud	Es wird eine Verbindung mit SoundCloud hergestellt, um Audio-Dateien abzuspielen.	-	Verbindung	SoundCloud
Twitter	Es wird eine Verbindung mit Twitter hergestellt, um Tweets anzuzeigen.	-	missing translation: type.	Twitter
_cs_c	Zeigt an, ob der Nutzer dem Tracking durch ContentSquare zugestimmt hat.	394 Tage	missing translation: type.	Spotify (ContentSquare)
_cs_id	Speichert eine eindeutige Benutzer-ID für die Analyse durch ContentSquare.	394 Tage	missing translation: type.	Spotify (ContentSquare)
_ga	Wird verwendet, um Benutzer zu unterscheiden.	400 Tage	missing translation: type.	Google Analytics
_ga_BMC5VGR8YS	Dient Google Analytics zur Aufrechterhaltung des Sitzungsstatus.	400 Tage	missing translation: type.	Google Analytics
_ga_S0T2DJJFZM	Dient Google Analytics zur Aufrechterhaltung des Sitzungsstatus.	399 Tage	missing translation: type.	Google Analytics
_ga_ZWG1NSHWD8	Dient Google Analytics zur Aufrechterhaltung des Sitzungsstatus.	400 Tage	missing translation: type.	Google Analytics
_ga_ZWRF3NLZJZ	Dient Google Analytics zur Aufrechterhaltung des Sitzungsstatus.	400 Tage	missing translation: type.	Google Analytics
_gid	Wird verwendet, um Benutzer zu unterscheiden.	1 Tage	missing translation: type.	Google Analytics
_ScCbts	Speichert temporäre Sitzungs- oder Wiedergabeeinstellungen.	6 Tage	missing translation: type.	Spotify
_scid	Spotify-Werbe-ID für Analyse und Remarketing.	395 Tage	missing translation: type.	Spotify
_scid_r	Spotify-Werbe-ID für Analyse und Remarketing.	395 Tage	missing translation: type.	Spotify
eupubconsent-v2	Speichert die IAB-Zustimmungsinformationen gemäß dem TCF.	364 Tage	missing translation: type.	IAB / Spotify
OptanonAlertBoxClosed	Speichert, ob der Cookie-Hinweis geschlossen wurde.	364 Tage	missing translation: type.	OneTrust
OptanonConsent	Speichert die Zustimmungseinstellungen, die über OneTrust gesetzt wurden.	365 Tage	missing translation: type.	OneTrust
sp_adid	Werbekennung von Spotify für Tracking und Personalisierung.	365 Tage	missing translation: type.	Spotify
sp_landing	Zeichnet auf, welche Spotify-Seite zuerst besucht wurde.	1 Tage	missing translation: type.	Spotify
sp_m	Speichert die Marktregion des Nutzers.	399 Tage	missing translation: type.	Spotify
sp_t	Sitzungstoken für Spotify-Wiedergabe und Zugriff.	365 Tage	missing translation: type.	Spotify