Data modelling is a central challenge in every Digital Humanities endeavour. This activity comprises both the conceptualization of the domain of discourse (conceptual model), as well as formalizing the conceptual model in concrete schemas.
Created at the outset of every project, the resulting data model is an agreement between the humanities domain experts and the development team which guides the technical development and sets the boundaries for what kind of information can be captured. The data model also defines how the resulting data will be interoperable with other datasets and tools.
Although each project may have specific requirements, we base project-specific solutions on established reference models, standardized terminology, and recommended digital formats that address the specific challenges of humanities data (i.e., sparseness, ambiguity, inconsistency or lack of information). For example, we use TEI for encoding textual resources in XML, and the CIDOC CRM family of ontologies for structured data in RDF.
The modelling task must also take into account the formats of source data, as well as the constraints imposed by the tools used to curate and present the data. Legacy data is often available only in textual documents or spreadsheets thus making custom conversions an important part of many data processing workflows.
XML files are usually curated by the team in a git repository, which ensures detailed versioning and reproducibility of all changes. Automated pipelines implemented as GitHub actions are in place to convert the XML/TEI documents into HTML pages, creating a static site which has little infrastructural requirements and is easy to maintain and host in the future. Usually, this conversion workflow is implemented by using our core software DSE Static Cookiecutter. In many cases, indexes of named entities (like persons, institutions or places) are curated within project-specific instances of APIS, our framework for prosopographic data, thus lowering the barrier between (semi-structured) philological analysis and structured (meta)data sets.
