D4.2 OpenAIRE back-end and Invenio upgrade: specification and release plan¶
Authors: Paolo Manghi (CNR), Lars Holm Nielsen (CERN), Claudio Atzori (CNR), Miriam Baglioni (CNR), Alessia Bardi (CNR)
*Overview of updates
Updated naming of the new entities to be inserted according to OpenAIRE-connect data model extension.
List of steps performed for the back-end extension
1. Introduction¶
The aim of this document is to explain in detail the software release plan for upgrading the OpenAIRE back-end services and Invenio. The upgrade will be done according to the data model described in D4.1 OpenAIRE Data model extension. For this, this document will illustrate the plan of design, development, testing, and integration into beta and production of the infrastructure services to be delivered by T4.1 and T4.2. The plan’s technical activities will be supervised and led by CNR and carried out across the technical partners CNR, CERN, in synergy with the partners Jisc, UMinho, UniHB, PIN, CNRS, IRD, ICRE8.
The deliverable is on-going and will be updated at M15 (before first BETA release of the service), M23 (before second BETA release of the service), and M27 (before production release of the service). The first release of this document will report on extensions to be provided by M8.
The OpenAIRE-Connect technical roadmap is organised in three parallel threads, leading to two BETA releases of the services and one final production release (Fig. 1). Initially, from M1 to M8, OpenAIRE back-end services and Invenio will be upgraded to include research methods and packages of artefacts. From M9 to M28 the services will be incrementally realised, released in BETA, tested and assessed in the Pilots, and further refined. From M16 to M30, from the first BETA release of the services until their last round of refinements, the services will be tested and assessed by research communities and content providers in order to validate and improve their functionalities.
Figure 1. OpenAIRE-Connect technical timeline
The upgrade of the OpenAIRE back-end services and Invenio TRL6 is mandatory to the development of other components and therefore a first TRL6 version is expected to be delivered at M8 (Milestone M4.1). The final version of the upgrade will be then merged into milestone M4.2 for the delivery of the first TRL6 release of the Research Community Dashboard and APIs software packages at M16 (see Gantt chart in Figure 2).
Figure 2. Gantt Chart: technical roadmap of WP4
This document focuses on the release plan for the first thread (from M1 to M8) and it is organised in two main sections: Section 2 describes the release plan for upgrading the OpenAIRE back-end services, while Section 3 describes the plan for the upgrade of Invenio.
2. Specification and release plan for the OpenAIRE back-end services (CNR)¶
In this section we present the extension of the OpenAIRE back-end services due to M8 to allow the deposition and usage of the OpenAIRE-connect entities. To this aim, we shall first introduce the OpenAIRE service for which extensions are needed.
The OpenAIRE technical infrastructure collects information about objects of the research life-cycle from different types of data sources:
- article metadata and full-texts from institutional and thematic repositories, Open Access journals and publishers;
- dataset metadata from data repositories and data journals;
- metadata about data sources, organisations, projects, and funding programs from entity registries, i.e. authoritative sources such as CORDA for FP7 and H2020 projects, OpenDOAR for publication repositories, DOAJ for Open Access journals;
- metadata about publications, datasets, persons, organisations, projects, funding, equipment and services are collected through CRIS (Common Research Information Systems).
The information is collected by means of a data aggregation workflow (Fig.3)
The workflow consists of three main phases: collection, validation, and transformation. In the collection phase metadata records are collected from a data source and are stored in a metadata store. The validation phase is an optional phase that can be enabled to validate the collected metadata records according to the OpenAIRE guidelines. Finally, the transformation phase transforms the collected records according to the OpenAIRE data model and stores them in another metadata store.
From the transformed information can be derived objects and relationships that populate the OpenAIRE Information Space Graph (ISG). For example, a bibliographic metadata record describing a scientific article will yield one publication object and a set of person objects (one per author) related to it.
Figure 3. The OpenAIRE data aggregation workflow
All back-ends (e.g. HBASE, MongoDB, Solr) and processing phases involved in the above workflow must be adapted to represent and interpret the new data model entities. All existing services are running TRL9 technology and are designed to be flexible with respect to the data model. As such, the change will not require software re-design and rewriting, but rather software extension/re-parametrization according to a pre-defined methodology.
Release Plan¶
OpenAire-connect extends the OpenAIRE model with two new entities: methods and research packages. The back-end will then be extended to allow OpenAIRE-connect to collect these new information as well.
Although the structure of the aggregation workflow does not require changes, extensions have to be realised to the different back-end services adopted in the phases of the aggregation workflow.
The validation phase, which needs a processing extension, will be modified according to the OpenAIRE-connect interoperability guidelines provided in M3.1 – OpenAIRE-connect Interoperability guidelines due by M10.
The actions to be accomplished by M8 are:
- Action 1 Extension of the OpenAIRE internal representation model, used by the metadata store service based on MongoDB, in order to allow the representation of methods and research packages;
- Action 2 Extension of the transformation service to transform methods and research packages collected from communities into the OpenAIRE-connect internal representation (processing extension);
- Action 3 Extension of the ISG model for the representation of methods, research packages, and links between them and other OpenAIRE entities on HBASE, as defined by the OpenAIRE-connect data model extension;
- Action 4 Extension of the Solr index schema for the indexing of methods and research packages.
According to OpenAIRE-connect data model extension the concepts of methods and research packages have been replaced by software and other research products . The OpenAIRE back-end has been extended to support the new entities by the following actions:
- Action 1 Extension of the OpenAIRE internal representation model, used by the metadata store service based on MongoDB, in order to allow the representation of software and other research products;
- Action 2 Extension of the transformation service to transform software and other research products collected from communities into the OpenAIRE-connect internal representation (processing extension);
- Action 3 Extension of the ISG model for the representation of software, other research products, and links between them and other OpenAIRE entities on HBASE, as defined by the OpenAIRE-connect data model extension;
- Action 4 Extension of the Solr index schema for the indexing of software and other research products.
The new entities are already part of OpenAIRE, and visible on beta.explore.openaire.eu
3. Specification and release plan for Invenio (CERN)¶
In this section we present the extension of Zenodo due on M8 to allow the deposition of the OpenAIRE-connect entities, exposure of the entities to the OpenAIRE infrastructure and the alignment of Zenodo communities with Research Community Dashboard.To this aim, we shall first introduce Zenodo for which extensions are needed.
The Zenodo is a digital repository running on top of the Invenio digital library framework, that allow users to deposit research outputs into a trusted digital archive. Users deposit a new record in Zenodo by uploading one or more files and providing metadata such as title, authors, description and grants. Zenodo registers a new DOI for the record once the user publish the research output. After publishing the user can no longer modify the files.
For each record the user provides a resource type in the metadata. Currently Zenodo supports: publication, poster, presentation, datasets, image, video, software, lesson. The publication and image resource types also supports subtypes (e.g. for publications you have e.g. journal article, report, thesis and more).
In Zenodo users can create communities which allow the user to create a curated set of record inside Zenodo. Any user can request to have their record added in a community, but it is the community owner who either accepts or rejects the requests.
Zenodo exposes the record metadata via both REST APIs and OAI-PMH in multiple metadata formats (according to the OpenAIRE Guidelines) such as Dublin Core, DataCite, MARC21 and JSON. Zenodo also supports depositing records via the REST API.
Zenodo is running TRL9 technology and is designed to support an evolving record and deposit model. As such, the changes will not require software re-design and rewriting, but rather software extension/re-parametrization according to a pre-defined methodology.
The following subsections details the extensions and changes needed to support the different OpenAIRE-connect entities.
Research Methods¶
In order to support deposition of the OpenAIRE-Connect entity research method as defined by the OpenAIRE-connect data model extension the following actions need to be performed:
- Zenodo resource type vocabulary extension: Define the resource types needed to support research methods and the mapping of the vocabulary extensions into DataCite, info:eu-repo and Citation Style Language vocabulary terms. A progressive approach defining a minimal viable vocabulary and later extending with further terms is preferred, as once vocabulary terms enter in production it takes minimum 1-year to deprecate the terms again. In addition it is important that the defined terms aid an end-user in selecting an appropriate resource type.
- Extend internal data models: The internal record and deposit models defined by JSONSchemas must be adapted as must some of the internal vocabularies.
- Deposit REST API: The deposit REST API validation and serializations must be extended to support the new resource types and vocabularies.
- Deposit UI form: The user interface form used by most users for the deposition of material must be extended to expose the new resource types.
- Records REST and OAI-PMH APIs: The serialization of records into other metadata formats such as DataCite and Dublin Core must be extended to expose the new resource types. Also search and filtering based on the resource types must be fixed.
- REST API documentation: The performed changes in the REST API must be propagated to the public documentation for the REST API.
The concept of Research Methods has been replaced by Software. Software is an entity already supported in Zenodo, and in order to support the deposition of Software as defined by the OpenAIRE-connect data model extension and following the guidelines at https://guidelines.openaire.eu Zenodo back-end has been extended by performing the following actions:
- Extend internal data models: The internal record and deposit models defined by JSONSchemas have been adapted to support specific community vocabulary for Software.
- Deposit REST API: The deposit REST API validation and serializations have been extended to support the new specific communities vocabularies.
- Deposit UI form: The user interface form used by most users for the deposition of material have been extended to expose the new specific communities vocabularies.
Metadata Schema Extensions¶
In order to describe the format of the research methods as defined by OpenAIRE-connect data model extension the Zenodo record metadata model should be extended with either a format field or with a more specific field such as programming language. The changes needed to add the field is a subset of the changes needed to add a new resource type.
Research Packages¶
Research packages as defined by OpenAIRE-connect data model extension is significantly different from the existing Zenodo deposit models in that research packages have no associated files (only the individual parts of the package have files associated). A research package thus are metadata-only records. However, subparts of the metadata, namely the links to the individual subparts, are versioned and not freely editable after publishing a record which is otherwise normally allowed for other resource types. E.g. any addition or removal of links to subparts will create a new version of the research package (i.e. new DOI). For this reason, the extension to support research packages as a new resource type is significantly larger than other new resource types. Following actions needs to be completed:
- Research packages REST API: A new REST API for the deposition of research packages needs to be created. This is primarily done through configuration and allows us to cleanly separate the validation logic and permission handling between packages and other resource types. The serialisation and validation logic will be shared as much as possible between the two deposition REST APIs.
- Programmatic API: The REST API needs to be supported by a new programmatic API which is responsible for implementing most of the deposition workflow in the backend. In particular the programmatic API is needed for the management and indexing of relations between the package and its subparts.
- Deposit User interface: A new deposit form (essentially an extension to the existing deposit form) needs to be developed. This is needed in order to a) remove the file upload part of the existing form and b) support a new widget for easily selecting the subparts that makes up the research package from the Zenodo record namespace. In particular this widget should allow selecting any version of a published record on Zenodo (not only the latest). The deposit form will support most of the metadata fields from the existing form.
- Record search and display (REST and UI): The record landing pages for research packages needs to be adapted to not display the files, but instead all the individual subparts of the packages. This is relatively easily achieved via the existing JavaScript search client. Record which as subparts of a package also need to display this relationship. These changes needs to be applied both at the UI and REST API level.
- REST API documentation: The new REST API must be publicly documented in order to make it useable by the OpenAIRE-connect partners.
All parts of the required extension to support research packages should be able to handle research packages with 100s to 1000s of subparts.
The concept of Research Package has been replaced by Other Research Products, and in order to support the deposition of Other Research Products as defined by the OpenAIRE-connect data model extension and following the guidelines at https://guidelines.openaire.eu Zenodo back-end has been extended by performing the following actions:
- Zenodo resource type vocabulary has been extended to support the new resource type.
- Extend internal data models: The internal record and deposit models defined by JSONSchemas has been adapted as has some of the internal vocabularies.
- Deposit UI form: The user interface form used by most users for the deposition of material has been extended to expose the new resource types.
- Records REST and OAI-PMH APIs: The serialisation of records into other metadata formats such as DataCite and Dublin Core has been extended to expose the new resource types.
- The deposit REST API validation and serialisations have been extended to support the new resource types and vocabularies.
- The documentation for the REST API has been extended with the new type
Both Software and Other Research Products, their association to OpenAIRE-CONNECT and specific community defined vocabularies can be uploaded from Zenodo (https://www.zenodo.org/deposit?page=1&size=20).
Release Plan¶
All extensions to Zenodo are due on M8. Minor changes afterwards are expected depending on changes to the OpenAIRE data model as well as the OpenAIRE-connect interoperability guidelines provided in M3.1 – OpenAIRE-connect Interoperability guidelines due by M10.
The extensions will be delivered in two chunks:
- M4-M5: Extend Zenodo with support for research methods and extra metadata fields as described in the above specification.
- M6-M8: Extend Zenodo with support for research packages as described in the above specification.