Data integration, storage, archiving and open access

(Gerald Hiebel, Peter Andorfer, Matej Durco, Edeltraud Aspöck)


    Metadata mapped to the formal definitions of the CIDOC CRM ontology. Figure 1: Metadata mapped to the formal definitions of the CIDOC CRM ontology. System architecture and workflow. Figure 2: System architecture and workflow.


Based on user requirements defined in 2015 and requirements for long term preservation of data, existing software solutions have been evaluated and an analysis of the preferred system architecture has been conducted. To create, manage and query metadata and digital documents of the Tell el-Daba excavation documentation we identified three main components within the system architecture:

  • Data Creation & Curation
  • Data Integration, Storage & Archiving
  • Data Presentation & Publication

The goal was to develop a system with open and well-defined interfaces between the components. The leading idea is that the data are the most important asset within the project and it should be possible to choose different software products for each system component and if necessary replace them individually if a better one comes up for the specific purpose.

We chose Microsoft Excel for the metadata entry and management of the controlled vocabularies (see Digitisation of Tell el-Daba resources). We found the flexibility offered by MS Excel was an advantage compared to other systems which would need development of a user interface or customisation of an existing interface to accommodate the needs of the project. Another big advantage is that users are accustomed to this default piece of office software. This arguments outweigh the limitations of Excel in terms of data modelling and data validation capabilities that e.g. a traditional database application would provide (referential integrity, check for allowed values, concurrent user access etc.). After defining the main categories of the data structure and creating an identifier policy we could immediately start the metadata entry process. Excel allows to enter values quickly (e.g. entering the same value to many cells at the same time, where a database only allows to enter one value at a time), which was a main reason to stick with Excel. Disadvantages are that this method is more prone to errors, as identifier handling and management is performed by humans and requires constant monitoring and regular quality assessment. However, as only few students carry out the data entry, they have become experts in TD documentation in the meantime and mistakes have become less frequent.

We used Karma (ISI 2016), a semantic web tool to map metadata and vocabularies to the CIDOC CRM data model and SKOS. Figure 1 shows how the metadata is mapped to the formal definitions of the CIDOC CRM ontology.

Karma creates a knowledge graph to represent the information and exports it in RDF (Resource Description Framework), a data format that is able to relate logical statements within a network (W3C 2014). The RDF structure was ingested in a triple store, where we linked the resources (metadata elements such as a specific excavation area) through the unique identifiers. This process integrates the metadata of the digitised resources such as field drawings or photos. Resources are either linked on a class level (because they belong to the same CIDOC CRM class, e.g. “document” or “physical thing”), on the SKOS concept level (because the same thesaurus term was attributed to them, e.g. “field drawing”) or on an instance level (because they describe the same excavation area or archaeological feature/find, e.g. “Site TD, Area F/1, SQUARE j/21, Planum 3”). The RDF network of the triple store can be queried using the SPARQL (W3C 2013) query language.

Up to date we ingested the metadata of scanned field drawings and photos of the excavation areas F/I and A/II (Table 3.1) and created queries according to the requirements (see metadata and semantic enrichment). They could be fulfilled and in addition a de-normalized export of the data was done and imported in MS Excel. The filter functionality of MS Excel can be used to simulate a faceted search tool with the ability to drill down on excavation areas, archaeological types like graves or documentation types like field drawings or photos and retrieve the filenames corresponding to the specified criteria. This approach was intended as a fast way to review, query and validate the entered metadata and also to explore the feasibility and quality of the semantic mapping.

Workflow

The actual final workflow from the metadata to an integrated web application with all metadata and digitized objects integrated and searchable, involves a number of components, which are displayed in Figure 2:

  1. Metadata creation/curation continues as before.
  2. The generated spreadsheets containing the metadata are imported into a PostgreSQL database with PostGIS extension allowing to model and store GIS objects. This import script can either be run in predefined frequencies (cron job) or triggered manually.
  3. The data in the PostgreSQL database is made accessible through a user friendly (python/django based) web application (https://4dpuzzle.orea.oeaw.ac.at/), which represents the primary entry point to all digital data of the project.
  4. The Karma models are applied to the data in the PostgreSQL database and the generated RDF (modelled in CIDOC-CRM) is ingested in the triple store. The main goal of this transformation to provide a standard-conformant serialization of the data and a versatile querying capability via SPARQL, which allows for any custom advanced queries on the dataset represented in a semantic conceptual model, beyond the querying functionality offered by the default application.
  5. The actual binary data, i.e. the scans reside on a file server (see WP2) and are stored in a format which meets all requirements for long term archiving (.tif, no compression). To make these images easily accessible, compressed (JP2000) derivatives of the ‘master’ images are created and uploaded to an IIIF-conformant (http://iiif.io) image server hosted by the ACDH (e.g. https://4dpuzzle-iiif.acdh.oeaw.ac.at/TD_FZ_1043__TD_F-I_j21_Ostprofil/ ), allowing dynamic zooming, panning etc. of the images in user’s browser. From this IIIF-server the images are fetched and displayed by the 4dpuzzle web application (e.g. https://4dpuzzle.orea.oeaw.ac.at/archiv/fielddrawing/778).
  6. Metadata dumps from the triple store together with the binary files (digitized objects) are imported into a repository suitable for long term archiving, maintained by the ACDH (“ARCHE”, see section below). The repository provides own generic search and browse capabilities, thus it represents an alternative mode of access to the data.
  7. Metadata entries browsable in https://4dpuzzle.orea.oeaw.ac.at/ are back-linked to the corresponding objects in the repository.
  8. GIS Data is published via ArcGIS Server (or feature-equivalent GIS server) and integrated with the main application.

ARCHE - A Resource Centre for HumanitiEs

ACDH has been developing and will soon release its new data hosting service – ARCHE.

As part of the CLARIAH-AT infrastructure, ARCHE is primarily intended to be a digital data hosting service for the humanities in Austria. Thus data from all humanities fields including modern languages, classical languages, linguistics, literature, history, jurisprudence, philosophy, archaeology, comparative religion, ethics, criticism and theory of the arts are equally welcome.

The repository builds on the well-established open-source repository software Fedora Commons version 4 which provides a sound technological basis for implementing the OAIS (Open Archival Information System) reference model by taking care of storage, management and dissemination of our content. The core component is accompanied by a set of custom-built components:

  • the “doorkeeper” service represents the single point of access to Fedora and ensures adherence to established business rules (transactions, metadata validation, authentication, etc.). It implements the API of Fedora 4, thus it is compatible with any Fedora 4 compliant client.
  • Repo-php-util is a library offering high-level functionality to interact with the repository. It is used by all other components of the system that interact with the repository.
  • Repository Browser - implemented as a drupal 8 module this is the user facing component, which allows to navigate and search through the content of the repository.
  • OAI-PMH endpoint - a simple application implementing the OAI-PMH protocol delivering the metadata about the resources in the repository.
  • Validation routine - a php application that is run on any dataset before it is ingested into the repository to automatically check the structure and formats of the data and to provide an overview of the dataset with respect to the size, structure and file formats used.

These core features are currently in its final stage of development and a series of test imports has been performed.

The aims of the remaining 2,5 years of 4DP in WP 4 will be to …

  • create Karma models for digital resources such as databases that record locus information or detailed information on inventory objects. The goal is to be able to process digital information from recent excavation campaigns in order to have a workflow that can incorporate newly created digital information into the metadata.
  • import ‘real’ data from the 4DP into the productive hosting instance is foreseen for beginning of 2018. Full access to the TD archive will be provided by the end of the project in 01/2020.
  • develop the 4dpuzzle.orea.oeaw.ac.at User-Interface (ongoing process until end of the project):
    • implementation of complex filter/search/browsing interface,
    • implementation of custom detail views per data type (excavation objects, documentations object, …),
    • further interlinking of objects,
    • parallel to the development of the main application, we will test software solutions for querying our metadata triple store via a user-friendly web-interface (Metaphacts, wisKI: http://www.metaphacts.com/, http://wiss-ki.eu/), also aiming to compare the two approaches (triple store vs. traditional relational database) in terms of flexibility, user-friendliness and performance.
  • Implementation of data transformation/synchronisation workflows (done until end of 2019)
  • Implementation repository ingest workflow (done until end of 2019)