Publishing and Archiving Data

In CLARIAH-DE, researchers can publish their research data via repositories in accordance with the FAIR principles and are thus able to sustainably archive them as well as make them available to the community.

Research data which are created in the context of one’s own research or which are re-used and enriched, should be published and archived in suitable research data repositories. There are several good reasons for this: Firstly, research results become reproducible for the academic public in this way. Moreover, data can serve other researchers as a foundation for further research and thus become citable and part of the documented scientific achievement. Finally, the archiving and publication of data is part of good scientific practice and research funding organisations increasingly demand and presuppose it.

Research data in the Humanities, which are archived by CLARIAH-DE partners, are very different: they range from text collections and speech recordings with literary or linguistic annotations to lexical data with dictionaries and similar resources, as well as to critical editions which compare different editions and versions of written texts while taking into account facsimiles. For this wide range of different research data, CLARIAH-DE and its partners offer various services, which range from the TextGrid Repository for digital corpora and editions to linguistically oriented collections such as the Bavarian Archive for Speech Signals.

Publishing and Archiving in a FAIR Manner

Archived and published research data enables the citation and re-use of data. To ensure this, the data is archived and stored according to the FAIR principles. FAIR is an acronym of the terms Findable, Accessible, Interoperable, and Re-Usable. CLARIAH-DE supports researchers in publishing and archiving their research data according to the FAIR principles.

Further background information on the FAIR principles can be found on the website of the initiators of FORCE11.

Archiving and Publishing with CLARIAH-DE

The CLARIAH-DE partners maintain repositories with different foci, which store data and make it available on a long-term basis. In these repositories, research data are saved together with formal descriptions of the data, so-called metadata. These metadata contain information for the description of the data as well as persistent identifiers, access information, and they inform about the conditions for re-use. The CLARIAH-DE partners share these metadata via established technical protocols with their national and international partners, who maintain search engines for data.

The way to publication is documented in detail on the webpages of CLARIAH-DE’s repositories. The effort should be minimized for the researchers, but depends on the interest for re-use in the respective repository. Moreover, some repositories are specialized in different types of data, languages, epochs, modalities and data formats.

With the Helpdesk, CLARIAH-DE supports users with finding partners compatible with their data and with contacting them.

Repositories for Archiving Research Data

Via connected institutions, CLARIAH-DE opens up the possibility to archive and publish research data. There are different subject-specific foci, which differ in terms of the type of data, language, data formats and technical guidelines. The balance of the data (“balanced corpora”) can also constitute a relevant criterion, for instance in the case of the German Reference Corpus (DeReKo) and the German Text Archive (DTA). Some repositories supply interfaces for quantitative experiments. For editions which were created with the help of TextGrid, the TextGrid Repository would be suitable, for instance. TEI-represented editions, which are available in the DTA base format, can be integrated into the German Text Archive, on the other hand. Spoken language data can be archived at the Hamburg Centre for Speech Corpora, at the Bavarian Archive for Speech Signals or at the Leibniz Institute for the German Language. Some repositories offer support tools for users who only rarely make data available, for instance in the scope of edition projects in theses. Here, the DARIAH repository could support users via the DARIAH-Publikator, so that they are able to publish research data in an easy, fast and format-independent manner. The TextGrid Repository requires more effort as the TextGridLab has to be used and due to the format specification (XML), but it also guarantees a tighter disciplinary classification and re-use.

The following list provides an overview of evaluated and certified repositories which are maintained by the partners of CLARIAH-DE.


Institution

Focus

Contact details

Certificate

Berlin-Brandenburg Academy
of Sciences and Humanities

German Language, Lexica, diachronic corpora (before 1900), digital editions, methods for text recognition (OCR)

geyken@bbaw.de

Core Trust Seal

DARIAH-DE Repository

Humanities and cultural studies research data, collections

info@de.dariah.eu

Core Trust Seal

University of Tübingen,
Department of General and
Computational Linguistics

Annotated corpora (treebanks), lexical data, experimental data, linguistic knowledge components and web services

clarin-repository@sfs.uni-tuebingen.de

Core Trust Seal

IDS Mannheim

German language, large corpora of German (after 1900), corpora of spoken German, especially variation und interaction corpora

witt@ids-mannheim.de

Core Trust Seal

LMU Munich, BAS

German language and multimodal data, phonetic tools and services, language statistics, pronunciation lexica

bas@bas.uni-muenchen.de

Core Trust Seal

TextGrid Repository

TEI-based editions

info@de.dariah.eu

UDS Saarbrücken

Multilingual corpora and corpus tools

e.teich@mx.uni-saarland.de

Core Trust Seal

UHH Hamburg, HZSK

Multilingual spoken corpora, transcription tools, sign language

kristin.buehrig@uni-hamburg.de

Core Trust Seal

University of Leipzig, ASV

Other languages (not German), contemporary language, lexical data, web services, special reference corpora, public data

heyer@informatik.uni-leipzig.de

Core Trust Seal

University of Stuttgart, IMS

Software for computational linguistics, e.g. corpora and corpus tools, parameterizable tools and web services, written language

clarin@ims.uni-stuttgart.de

Core Trust Seal


Keywords:

Repositories, FAIR, meta data, archiving data, publishing data, persistent identifiers