Demonstratoren zur Arbeit mit bibliographischen Daten
Bibliographical data is central for the humanities. Many resources, tools and software developments revolve around bibliographical data and cover parts of the data lifecycle such as extracting, archiving, refining, analysing or visualizing. During the DESIR project - DESIR stands for DARIAH ERIC Sustainability Refined - a work package addressed bibliographical data in the humanities by means of three demonstrators. The underlying software was already existent, therefore the work focussed on bringing the components into a connected chain which processed the same set of bibliographical data and extended the functionalities of the software. The result of this work is archived by the CLARIAH-DE project on this website to disseminate the demonstrators and encourage the further use or adaption of the software. The demonstrators are described below, accessible as web services. The respective documentation is also available on the DARIAH ERIC GitHub repository.
Code Sprints 2018/2019
Apart from the work in the project, the DESIR project also organized two code sprints to gather public feedback on the demonstrators and jointly develop further functionalities. Starting from the areas text-analytic services, entity-based search, scholarly content management and visualization, ideas for the DESIR Code Sprint have been considered and refined since the summer of 2017. For the development of concepts and demonstrators for specific requirements of the DARIAH community, the technology partners and UGOE-SUB as Work Package Leads have organized a code sprint revolving around bibliographical metadata. The code sprint took place from July 31 to August 2 2018 on the premises of the Institute of Library and Information Science of the Humboldt Universität zu Berlin. The event was open for anyone interested in programming for Digital Humanities use cases but aimed primarily at developers. It fostered an improvement of the cooperation between DARIAH-related partners and institutions as well as the development of service concepts revolving around bibliographical data. Although organised as part of the DESIR project, the event was branded and disseminated as a DARIAH activity to gain more awareness for it and brand it as an event unmistakably related to the Digital Humanities. The grid of the code sprint was split into four coding tracks, except for the track in AAI focusing on bibliographical data. The tracks were
- A: Extraction of bibliographical data and citations from PDF applying GROBID
- B: Import and export of bibliographical data from BibSonomy and ingest in managed collections
- C: Visualization of processed data with added dimensions for journals, topics, or dependency graphs
- D: Securing Online Services in the DARIAH AAI using SAML/Shibboleth
With its first developments starting in 2008, GROBID has become a state-of-the-art (Lipinski: 2013; Tkaczyk: 2018) open source library for extracting metadata from technical and scientific documents in PDF format. Beyond simple bibliographical extraction tasks, the goal of the library is to reconstruct the logical structure of raw documents in order to enable large scale advanced digital library processes. For achieving this, GROBID explores a fully automated solution relying on machine learning (Linear Conditional Random Fields) models. The library is integrated today in various commercial and public scientific services such as ResearchGate, Mendeley, CERN Inspire and the HAL national publication repository in France. During the Code Sprint workshop, a hands-on session was proposed where users were guided through PDF data extraction and processing. The goal of the workshop of track A was to extract PDF documents into XML-TEI format, to enrich information gained from the extraction process by accessing some other web services and to visualize the results collected in PDF scientific article documents.
Main functions/developments in DESIR:
- Built a new model for parsing acknowledgement both in the form of raw texts and PDF files with GROBID (https://github.com/kermitt2/grobid) and DeLFT (https://github.com/kermitt2/delft/)
- Created an acknowledgement Web service in GROBID
- Integrated the results of GROBID acknowledgement parser into a demonstrator (https://github.com/DARIAH-ERIC/DESIR-CodeSprint-TrackA-TextMining)
The GROBID demonstrator is available via CLARIAH-DE.
The focus was on the simplification of data entry, e.g. by enabling import from ORCID or via drag’n’drop from PDF files and the use of BibSonomy as a backend for storing and organizing literature references. With its REST API, it enabled collaborative storage and retrieval of bibliographical metadata. A tool has been built for extracting bibliographical metadata from PDF files using GROBID and storing it in BibSonomy. In this way, bibliographical metadata can be easily added to BibSonomy. The tool comes with a user-friendly interface. The full Java code and an installation guide are published on GitHub: https://github.com/DESIR-CodeSprint/trackB.
Main functions/developments within DESIR:
- Two new ways of submitting data have been implemented, i.e. the option of uploading text files and the option of submitting text directly in the browser using a text field, allowing users to directly copy and paste text parts from other sources
- The tool was provided with an individual user login for BibSonomy, so that users can add bibliographical items to their own BibSonomy accounts
- User interface has been improved by adding new useful features, e.g. the removal of specific items from the list of extracted bibliographical items
The BibSonomy demonstrator is available via CLARIAH-DE.
Existing building components of the generic visualization framework VisNow (http://visnow.icm.edu.pl) were used in combination with web frameworks. The prototype web frontend for 3D graph visualization was extended by adding an ego-centred view for nodes (representing authors) and adjacent edges (representing publications with other authors). The 3D interaction concepts were redesigned and example 2D maps were created. A number of expansions were implemented and tested in the 3D interaction part of the web frontend in order to work out the interaction schemes between the user and a 3D graph visualization. Data import codes were created for interaction with Bibsonomy data export files and Bibsonomy API. Modifications of backend data structuring for graph creation were tested with an additional data processing and sorting layer in the backend. An additional 2D visualization was introduced on the frontend side using the high-level descriptive language Vega-Lite.
Access the ViStory-Demonstrator.
Main functions/developments within DESIR:
- Design and development of an internal generic data model of entity relations in time
- Conceptualisation, design and implementation of relation timelines (including co-authorship and citation graphs)
- Web-based 3D visualization
- Creation of mapping from RDF model to internal model
- Creation of mapping from JSON model to internal model
- Creation of mapping from BibSonomy REST API model to internal model
DESIR Work package 4 explored the possibility to enhance existing services on bibliographical metadata for DARIAH. The work package focussed on entity-based search, scholarly content management, visualization and text analytic services. The work had been undertaken from 2017 to 2019 and culminated in two code sprints with external participants and two workshops on software and infrastructure sustainability and quality, and finally a documentation of the undertaken work and results in the DARIAH ERIC Github repository. DESIR has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731081. INFRADEV-03-2016-2017 - Individual support to ESFRI and other world-class research infrastructures.