D4.1 OpenAIRE Data model extension¶
Authors: Paolo Manghi, Alessia Bardi, Miriam Baglioni, Franco Zoppi, Michele Artini -- ISTI-CNR
Contributors: Achille Felicetti (PIN), Christian Barillot, Stéphane Pesant (UNIHB), Sorina Pop (CNRS), Achilleas Vassilopoulos (ICRE8), Phoebe Koundouri (ICRE8), Julien Barde (IRD)
Updates:
- 2017-12-22: apply changes agreed with the partners: research methods and packages removed; software and other research products added.
- 2018-03-29: update the molteplicity of the model's element
1. Introduction¶
The aim of this wiki is to describe how the OpenAIRE data model is extended for the representation of artefacts different from publications and datasets.
In particular, OpenAIRE-Connect extends the OpenAIRE data model with:
- entities for the representation of research communities, software and other research products;
- relationships between those new entities and the already existing OpenAIRE entities;
- new controlled vocabularies for properties of software and other research products;
- new terms to be added to controlled vocabularies already in use by the OpenAIRE infrastructure.
Requirements have been collected from the research communities involved in OpenAIRE-Connect, which helped the definition of the model extensions by specifying their current practice and desiderata about the publishing of research artefacts in their specific disciplines. In addition to the requirements of the research communities, requirements and expectations from the existing “consumers” of the OpenAIRE infrastructure, including data sources (providing content to OpenAIRE), portal end-users of various roles (researchers, project coordinators, general public, research communities), OpenAIRE data curators (responsible of the workflows for collecting, harmonising, de-duplicating, inferring content), and third-party services (accessing content via APIs), have been considered.
The data model will be subject to changes in the future, depending on the evolution of the requirements of the OpenAIRE infrastructure and of the communities involved in OpenAIRE-Connect.
It is important to underline that the model extension has been designed not for research reproducibility, but to be of aid in the citation of the newly added research products. It is also important to stress that all the metadata collected by OpenAIRE are compliant to the OpenAIRE guidelines, and that they will be mapped to elements of the OpenAIRE Data Model described in this deliverable.
The OpenAIRE-Connect data model is obtained as an extension of the OpenAIRE data model, by adding the concepts that are relevant to support the Open Science as-a-Service functionalities introduced by OpenAIRE-Connect.
Section 2 introduces the OpenAIRE data model and its core entities, which today focuses on datasets and literature.
Section 3 describes the data model extensions agreed with the OpenAIRE-Connect communities, which mainly consists in the introduction of three new entities: software, other research product, and community.
Section 4 exemplifies the usage of the OpenAIRE-Connect model to represent examples of instantiations of datasets, software, and other research product across the five research community use-cases of OpenAIRE-Connect.
Section 5 synthesises the input given by research communities regarding the controlled vocabularies that needs to be set up or updated in the OpenAIRE infrastructure.
The Appendix presents examples metadata about software and other research product that communities used as a guide for their contributions in Section 4.
2. OpenAIRE Data Model¶
The main entities of the OpenAIRE data model are illustrated in Figure 1 and are: dataset, literature, organisation, funder, funding stream, project, and data source, where dataset and literature are sub-entities of type result composed of instances (i.e. different manifestations of a result).
In the following a brief explanation of the OpenAIRE data model is given. For a detailed description please refer to the OpenAIRE data model specification.
Figure 1 OpenAIRE Data Model
Result In the OpenAIRE data model a Result is equivalent to an OpenAIRE-Connect artefact, here intended as a digital object resulting from a scientific process described by metadata. Its sub-entity types are Literature and Dataset, which inherit all Result properties and relationships with other entities and add their specific ones.
Literature includes all digital research artefacts whose intended use is narrative storytelling of a research activity and its results. Examples are scientific articles, reports, slides, data papers, etc. Although there are exceptions, as each scientist has a large degree of freedom in publishing and interlinking his artefacts, it can be generally assumed that literature artefacts are published with a narrative intent. For those specific cases where literature is intended for different use, we in general do not expect scientists to publish such artefacts as literature artefacts. For example when an article is a carrier of readable datasets (e.g. articles with tables) the article is often deposited a second time in a data repository, assigned a new DOI, and marked as a dataset of type “textual”; in the case articles full-texts are used for natural language processing (NLP), scientists will likely create a dataset of type “collection of articles”.
Datasets include digital research artefacts encoding experimental or real-world observations/measures (e.g. primary data), secondary data derived from programmatic processing of other datasets, or more generally digital representations of facts to be interpreted by a program. The definition is cross-discipline, hence spans across multiple interpretations of datasets, where typologies and granularity obey to different scientific facets. Examples include, but are not limited to: databases (e.g. Worms), records of databases (e.g. proteins in the UniProt database), table files, queries over databases (time-series slices, geospatial maps, SQL queries), media (e.g. images, videos) or collections of media.
Organizations include companies, research centers or institutions involved as project partners or as responsible of operating data sources. Information about organizations are currently collected by OpenAIRE from CORDA and CRIS systems, as being related to projects, and from OpenDOAR and re3data, as being related to a data source (e.g. institutional or data repository).
Funders, funding streams and projects. Of crucial interest to OpenAIRE is also the identification of the funders (e.g. European Commission, WellcomeTrust, FCT Portugal, NWO The Netherlands) that co-funded the projects that have led to a given result. Funders can be associated to a list of funding streams (e.g. FP7, H2020 for the EC), which identify the strands of fundings. Funding streams can be nested to form a tree of sub-funding streams. Projects are typically associated to the funding stream “leaves” of such trees.
Data sources Finally, OpenAIRE entity instances are created out of data collected from various data sources of different kinds, such as publication repositories, dataset archives, CRIS systems, funder databases, etc. Data sources export information packages (e.g., XML records, HTTP responses, RDF data, JSON) that may contain information on one or more of such entities and possibly relationships between them. For example, a metadata record about a project carries information for the creation of a Project entity and its participants (as Organization entities). It is important, once each piece of information is extracted from such packages and inserted into the OpenAIRE information space as an entity, for such pieces to keep provenance information relative to the originating data source. This is to give visibility to the data source, but also to enable the reconstruction of the very same piece of information if problems arise.
3. OpenAIRE-Connect data model¶
The OpenAIRE-Connect data model is obtained as an extension of the OpenAIRE data model. The extension devised in OpenAIRE-Connect includes the result entity subtypes software and other research products, and the new entity community, as described below. To better understand the structure and semantics of software and other research products, we also introduce the definition of Result, in terms of properties and relationships with other entities.
Figure 2 OpenAIRE-Connect Data Model
Result A Result is characterised by the set of properties and relationships described in detail here and reported in short in Table 1 and 2. Relationships added for the OpenAIRE-Connect extension are highlighted in grey.
Table 1. Result properties
Property
Multiplicity
Description
Title
1..N
The titles of the Result
Author
1..N
The author of the Result
Contributor
0..N
A contributor of the Result
Date
0..N
Date relevant for the Result. The type of date is specified according to the vocabulary dnet:dataCite_date. Examples are publication date, creation date, submission date.
BestAccessRights
1..1
The "most open" access rights among the licenses of the Result instances. For Literature and Dataset the vocabulary is OPEN > EMBARGO > RESTRICTED > CLOSED > UNKNOWN
Publisher
0..1
The name of the organization responsible for the publishing of the result.
Description
0..N
Contains the Abstract of the Result
PID
0..N
Unique and persistent identifiers used to identify the result together with the relative identification agencies. A PID is a pair (PID scheme, identifier), where PID Scheme are for example: DOI, PDB, etc.
Language
0..1
The language used in the description or body of the Result, specified according to the ISO639 3-letter language codes.
Subject
0..N
The scientific discipline covered by the Result. A subject is a pair (subject, classification scheme). The list of supported subject classification schemes is defined by the vocabulary dnet:subject_classification_typologies. The list can be enriched with additional subject classification schemes suggested by research communities.
Instance
1..N
An Instance of the Result, which represents a physical location where the Result files (web resources entities, possibly identified by DOIs) can be found
External Reference
0..N
A link to an external resource that is not available as an entity in OpenAIRE. A reference is described by:
Web site name
Web URL
Identifier of the referenced resource
The type of the referenced resource
Example: a protein referred in a publication is represented as an external reference to the Protein Data Bank, with type ‘accessionNumber’)
Source
0..N
Maps the dc:source element. Values are expected to be compliant to the OpenAIRE guidelines.
Country
0..N
The countries of the organizations to which the authors of the Result are affiliated to at the moment of publishing; values specified according to the ISO3166 2-letter country codes.
Table 2. Results relationships (relationships added for OpenAIRE-Connect are in grey)
Relationship
Target
Multiplicity
Description
isProducedBy
Project
0..N
The Project that co-funded the research underlying the Result
isRelatedTo
Result
0..N
Link to Result entities (e.g. related datasets)
hasAuthorInstitution
Organization
0..N
Link to Organization entities to which authors of the results are affiliated
aggregatedBy
Other Research Product
0..N
Link to a product that aggregates the result
isRelevantTo
Community
0..N
Link to Community entities to which the Result is relevant
isVersionOf
Result
0..N
Link to a Result that is another version of this Result
Instances and web resources Results are always associated to one or more instances of the results, in the sense that different “manifestations” of the same result may exist and be comprised in the same entity. For example, the same article may be kept in two different repositories, both exposing the payload file (e.g., PDF) at different internet locations (URLs). Moreover, an instance of a Result is represented as a combination of one or more web resources, i.e. URLs to the files of the Result (e.g. article PDFs, dataset files), and are associated to the data sources hosting such resources (e.g., repositories).
Properties and relationships of Instances are listed in Tables 3 and 4.
Table 3. Instance properties
Property
Multiplicity
Description
Format
0..N
Maps the dc:format/datacite:format element, describes the file format of the web resources relative to this instance. Guidelines suggest the use of mime-types.
Access Rights
0..1
Maps dc:right/datacite:right, describes the access rights of the web resources relative to this instance. For Literature and Dataset the vocabulary is
OPEN > EMBARGO > RESTRICTED > CLOSED > UNKNOWN
Embargo end-date
0..1
Date when the embargo ends
License
0..1
Terms of permission of use of this instance. Vocabulary to be defined, may depend on the specific type of result and community.
Type
0..1
Type of the result instance. Values must comply to controlled vocabularies based on the specific subclass. For example type of publication instances must comply to the vocabulary dnet_publication_resource, while dataset instances to dnet:dataCite_resource. See also the OpenAIRE guidelines for dc:type and datacite:resourceType. In the case of software and other research product, the value of this property is chosen from a controlled vocabulary shaped by the community.
Web resource
1..N
URL of the files relative to the instance
Table 4. Instance relationships
Relationship
Target
Multiplicity
Description
collectedFrom
Datasource
1
The data source from which this instance has been collected
hostedBy
Datasource
1
The data source that hosts and makes available the resources of this instance.
Software entities represent research software, i.e. software that is an output of a research activity. Examples include, but are not limited to: code scripts, web services and web applications.
As subclass of Results, Software inherits all properties and relationships of the Result entity. In addition, they bear the specific properties in Table 5.
Table 5. Software properties
Property
Multiplicity
Description
Contact person
0..n
Information on the person responsible for providing further information regarding the resource
Contact group
0..n
Information on the group responsible for providing further information regarding the resource
Software type
1..1
Specifies the type of the software being described
Distribution Location
0..N
URL of the web location from which the software can be directly downloaded
Documentation
0..N
URL to a resource that provides useful information about the software to the end-users, such as execution tips, FAQs, help forums, etc.
Programming Language
0..n
Programming language in which the software is implemented.
Version
0..1
Version of the software
Tool
0..N
IT tool/service that can execute the software. E.g. if the software is an R script, then it can be executed by the tool "R Studio".
Distribution Form
0..N
The form in which the software is distributed. E.g. "source code", "executable", "zip".
Other research product includes any research output that is not literature, data, or software. Examples include, but are not limited to: algorithms, scientific workflows/pipelines, protocols, standard operating procedure (SOP), simulations, mathematical and statistical models, but also research packages. Research packages can group a set of research artefacts, but can also include the encoding of a composition logic that binds them together. Such logic must obey to specification or programming languages, agreed on across the community, and to be interpreted by humans or machines. For example, an instance of a workflow is a package as it describes the combination of specific artefacts to implement a scientific process, execute an experiment, etc. It is worth pointing out that, although strongly recommended and convenient for scientific publishing practices, artefacts and links specified within a representation of a research package are not necessarily represented as objects and links in the OpenAIRE graph.
As subclass of Results, Other Research Product inherits all Result entity properties and relationships introduced above. In addition, they bear the specific properties and relationships in Table 6 and 7.
Table 6. Other Research Product properties
Property
Multiplicity
Description
Contact person
0..n
Information on the person responsible for providing further information regarding the resource
Contact group
0..n
Information on the group responsible for providing further information regarding the resource
Distribution Location
0..N
URL of the web location from which the research product can be directly downloaded
Documentation
0..N
URL to a resource that provides useful information about the product to the end-users.
Version
0..1
Version of the research product
Tool
0..N
IT tool/service that can execute the research product. E.g. if the product is a Taverna workflow it can be executed with the Taverna workbench.
Table 7. Other Research Product relationships
Relationship
Target
Multiplicity
Description
Aggregates
Result
0..N
Link to Result entities aggregated by an instance of "Other research product"
Communities are communities of practice in a research field, composed of researchers willing to publish and share their scientific results (artefacts) among the community itself and beyond. Communities have a set of users associated to them, which are not represented in the model:
Community operators: such users can
- invite and register users to the community;
- set up NLP inference rules to identify different kinds of links;
- mark-up OpenAIRE data sources to specify their results are relevant to the community;
- configure the setting up of the research monitoring tools for the community;
- curate the community controlled vocabularies for software and other research product.
Community researchers: such users can
- deposit community results (artefacts);
- claim community results;
- search and browse the community information space;
- consult the monitoring statistics.
Communities are entities characterised by the properties and relationships in Table 8 and Table 9.
Table 8. Community properties
Property
Multiplicity
Description
Name
1
The name of the community
Subject
0..N
Field of reference
Controlled vocabularies for types and formats of software and other research product
tbd
Inference parameters
tbd
Monitoring parameters
tbd
Table 9. Community relationships
Relationship
Target
Multiplicity
Description
hasRelevantProject
Project
0..N
Projects relevant to the community
hasRelevantResult
Result
0..N
Results relevant to the community
4. Communities Perspective¶
This section exemplifies how the data model entities of datasets, software and other research products are used to represent the respective artefacts across the OpenAIRE-Connect communities of: Marine environmental science (UNIHB), Cultural Heritage & Humanities (PIN), Neuroinformatics (CNRS), Fisheries and Aquaculture Management (IRD), Environment and Economy (ICRE8/SDSN Greece). Initial examples, as reported in the DoA are illustrated in Table 10.
Table 10 Examples of entities from each OpenAIRE-Connect community
Community
Literature
Dataset
Software
Other Research Product
Marine environmental science (UNIHB)
Deliverables, proceedings, presentations (slides), cruise reports, pre-prints, journal articles, blogs, poster, papers,...
Observations, modelling, sequencing, imaging, relational databases
Code scripts, implemented algorithms
Experimental protocols, executable workflows (i.e. the set of tools, scripts composing the workflow and the dataset they are applied to), packages that aggregate presentation slides and the proceedings paper they refer to, possibly together with the eventual used dataset and processing workflow.
Cultural Heritage & Humanities (PIN)
Papers, monographs, reports,...
Pictures, GIS, text files, thesauri, gazetteers,...
Processing tools
Literature artefact with datasets and processing tools if used
Neuroinformatics (CNRS)
White papers, journal articles, conference articles, book chapters, mooc, presentation slides
Database of images accessible through online services, patient database, derived data bases (coming either from expert annotations or results of processing tools)
Image processing software
Image processing pipelines, an experiment: an image processing pipeline with the dataset of processed images and the processing parameters
Fisheries and Aquaculture Management (IRD)
Journals articles, conference articles, pre-print, post-print, theses, reports, technical guides,grey literature
Databases, files, images, remote sensing or in situ sensors data, model outputs.
Software scripts
Protocols for interoperability, Compilation environment, Publication with related (meta)data and processes, experiment execution (apply data to an experiment description), experiment, description (set of operations, sampling protocol)
Environment and Economy (ICRE8/SDSN Greece)
Journal articles, white papers, pre-prints
SDG Indices, economic data, geophysical data, GIS data, files
Software scripts, tools
Executable workflows, publications and datasets they are derived from together with the used workflow.
In the following sections, each community reports on the types and formats of the research artefacts they use/produce in their scientific process, as well as on related tooling (if any). The idea is for each community to:
- Align cross-community and intra-community interpretation of the data model;
- Kickstart the definition of community vocabularies for subjects, types and formats of software and other research products.
Firstly, community provide examples of datasets, software and other research products describing their types, formats, tools:
- Type: types are the classes of artefacts that community scientists use every day to distinguish their products; type vocabularies typically differ for datasets, software (e.g. software, script, service), and other (e.g. FAIRdom Research Object [1], RMap/DISCO [2], workflow);
- Format: formats are the classes of file formats used to encode a given file; vocabularies vary depending on the kind of artefact: for software, format is the programming language, for other research products of type “workflow” it could be "Docker image";
- Tool: tools are references to services, e-infrastructures, tools, engines or similar to be used for the execution or interpretation of the given software and other research products. Examples:
- Taverna workflow engine (https://taverna.incubator.apache.org)
- Galaxy (https://galaxyproject.org)
- D4Science Virtual Research Environment (https://services.d4science.org/group/openaire-connect_coordination)
Secondly, communities provide an initial list terms for the type and format vocabularies they will refer to in describing software and other research products (note that for datasets OpenAIRE refers to types and formats recommended by DataCite, for details please refer to the OpenAIRE guidelines [3]). Such lists are supposed to be dynamic, i.e. can change over time, and be maintained by the communities via dedicated tools.
4.1 Marine environmental science (UNIHB)¶
UniHB will engage researchers in Marine Environmental Sciences, a specific branch of Earth and Environmental Science, and then extend the approach to other branches. The research community will deal with artefacts of kind: datasets, collection of datasets, related literature, model codes, model outputs, and packages of those. UniHB will also extend the OAI-PMH API of the PANGAEA thematic data archive in order to expose packages of artefacts described according to the OpenAIRE-Connect interoperability guidelines defined in WP3.
Table 11 provides example of datasets relevant for the community, together with their types and formats.
Table 11 Examples of datasets from the Marine environmental science community (UNIHB)
Example
Type
Format
EBI/ENA
http://www.ebi.ac.uk/ena/data/view/ERR538187
or
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR538/ERR538187/ERR538187.fastq.gz
Nucleotide sequence
fastq
PANGAEA https://doi.pangaea.de/10.1594/PANGAEA.724492
GBIF http://doi.org/10.15468/ej3i4f
Tabular data set
text/html
text/plain
PANGAEA
https://doi.pangaea.de/10.1594/PANGAEA.775081
Image data set
raster files
ICES http://www.ices.dk/marine-data/data-portals/Pages/ocean.aspx
Database
text/html
Table 12 provides example of software relevant for the community, together with their types and tools.
Table 12 Examples of software from the Marine environmental science community (UNIHB)
Example
Tool
Type
A Stepwise Fitting Procedure for automated fitting of Ecopath with Ecosim models
https://github.com/ElsevierSoftwareX/SOFTX-D-15-00065
Related publication:
http://dx.doi.org/10.1016/j.softx.2016.02.002
(.NET Framework, https://www.microsoft.com/net/
download/framework)
Statistical procedure
Table 13 provides example of other research products relevant for the community, together with their types, formats, and tools. Examples shows how datasets and software can be combined to form a new research artefact that can be published and re-used by the Marine community.
Table 13 Examples of other research products from the Marine environmental science community (UNIHB)
Example
Tool
Type
Format
A simple, low cost GC/MS method for the sub-nanogram per litre measurement of organotins in coastal water
http://dx.doi.org/10.1016/j.mex.2016.07.001
Laboratory procedure
Journal article
Cas9 RNP nucleofection for cell lines using Lonza 4D Nucleofector
http://dx.doi.org/10.17504/protocols.io.dm649d
Laboratory procedure
text/html
EBI metagenomics pipeline (v 3.0)
https://www.ebi.ac.uk/metagenomics/pipelines/3.0
Software pipeline
?
PANGAEA data download service - filter by dataset DOI and parameter value
https://ws.pangaea.de/dds-fdp/
Web Service
Aggregated data from seven stations of the Tara Oceans expedition:
https://ws.pangaea.de/
dds-fdp/rest/panquery?
datasetDOI=doi.pangaea.de
/10.1594/PANGAEA.853809&
filterParameterValue=Station,
TARA_036,TARA_068,TARA_098,
TARA_128,TARA_152,TARA_175,
TARA_210
PANGAEA data download service - filter by dataset DOI and parameter value,
https://ws.pangaea.de/dds-fdp/
Web services
text/plain
The MAREDAT global database of high performance liquid chromatography marine pigment measurements:
data set
https://doi.pangaea.de/10.1594/PANGAEA.793246
related to a data article
http://www.earth-syst-sci-data.net/5/109/2013/
(Matlab,
https://www.mathworks.com/
products/matlab.html)
(Excel)
(R,
https://cran.r-project.org/)
Manual aggregation of data sets
application/zip
The MAREDAT global database of high performance liquid chromatography marine pigment measurements:
data set
https://doi.pangaea.de/10.1594/
PANGAEA.774851
related to a data article
http://www.earth-syst-sci-data.net/5/109/2013/
(Matlab,
https://www.mathworks.com/
products/matlab.html)
(Excel)
(R,
https://cran.r-project.org/)
Manual aggregation of data sets
application/zip
The MAREDAT global database of high performance liquid chromatography marine pigment measurements: data set
https://doi.pangaea.de/10.1594/
PANGAEA.818214
related to a data article
http://www.earth-syst-sci-data.net/5/109/2013/
(Matlab,
https://www.mathworks.com/
products/matlab.html)
(Excel)
(R,
https://cran.r-project.org/)
Manual aggregation of data sets
application/zip
4.2 Digital Humanities & Cultural Heritage (PIN)¶
The PARTHENOS European research infrastructure (H2020 project led by PIN) is today realizing a “meta” e-infrastructure for the Digital Humanities and Cultural Heritage as an ecosystem of the e-infrastructures of it sub-communities (e.g. CLARIN, Ariadne). Among its objectives, this e-infrastructure will build an “information cloud” for the community by aggregating content from heterogeneous content providers, whose content will be exposed via OAI-PMH according to the OpenAIRE- Connect interoperability guidelines defined in WP3.
Table 14 and 15 provide examples of datasets and software, respectively, relevant for the Parthenos community of researchers. Table 16 provides examples for Other Research Products.
Table 14 Examples of datasets from the Digital Humanties and Cultural heritage community (PIN/Parthenos).
Example
Type
Format
Library of Unpublished Fieldwork Reports
https://doi.org/10.5284/1029537
Text files
application/PDF
GIS-material for the archaeological project:
Mamorbrottet 1:19
http://doi.org/10.5878/000741
GIS
application/ZIP
Metadata:
DDI Lifecycle 3.2, DDI Codebook 1.2.2, MARCXML
(
http://xml.snd.gu.se/data/ddi3_2/internal/
se.gu.snd.sims.ddi3.SND2237.xml)
Öcsöd, Atrácsi csatorna partja
http://archeodatabase.hnm.hu/en/node/1973
Collection - Archaeological geophysical measurement report
Images + metadata
Pompeian Plasters Database
http://dx.doi.org/10.5878
/002567
Database
application/zip
Metadata: DDI Lifecycle 3.2, DDI Codebook 1.2.2,
MARCXML (http://xml.snd.gu.se/data/
ddi3_2/internal/se.gu.snd.sims.ddi3.SND0920.xml)
Snake head gourd (3D model)
http://public.cyi.ac.cy/starcRepo/
details/show/60b358392bad6d00878d36f21dcac9c5
3D Model(s)
3D-PDF
Table 15 Examples of software from the Digital Humanties and Cultural heritage community (PIN/Parthenos)
Example
Tool
Type
Vocabulary matching tool
https://github.com/cbinding/VocabularyMatchingTool
Mapping tool
Web Application
Table 16 Examples of other research product from the Digital Humanties and Cultural heritage community (PIN/Parthenos).
Example
Tool
Type
Format
Snake head gourd (3D model) http://public.cyi.ac.cy/starcRepo/details/show/60b358392bad6d00878d36f21dcac9c5
Acrobat X Pro
Model
3D-PDF
4.3 Neuroinformatics (CNRS)¶
CNRS will engage researchers in the Neuroinformatics area at depositing artefacts in Zenodo (minting a DOI), claiming artefacts and relationships, and at creating packages of artefacts according to their needs. Specifically they will deal with artefacts of kind: data, tools, methods, workflows, and experiments (which are packages representing execution of workflows, i.e. sequences of tools applied to data). CNRS will also extend content providers of “tool” and “workflows” artefacts for publishing (push) “tools” and artefact packages in OpenAIRE via OpenAIRE-Publishing APIs.
Table 17, 18 and 19 shows examples of datasets, software and other research products, respectively.
Table 17 Examples of datasets from the Neuroinformatics community (CNRS)
Example
Type
Format
Miccai Challenge Dataset
Challenge Dataset
Folder (Images + metadata)
Output dataset
Results
Folder (Images + metadata)
Table 18 Examples of software from the Neuroinformatics community (CNRS)
Example
Tool
Type
Format
Table 19 Examples of other research products from the Neuroinformatics community (CNRS)
Example
Tool
Type
Format
Image processing pipelines
Virtual Imaging Platform
(https://vip.creatis.insa-lyon.fr)
Shanoir
(http://www.shanoir.org)
Workflow
Pipeline
VIP pipeline template
Image processing software
Software
Gwendia
Taverna
JSON (
https://raw.githubusercontent.com/boutiques/
boutiques/master/schema/descriptor.schema.json)
Executables
Docker containers
An image processing pipeline with the required inputs (images and processing parameters)
Virtual Imaging Platform
(https://vip.creatis.insa-lyon.fr)
VIP pipeline
Challenge -Dataset
Results
application/json
A set of “source medical images” with raw data and manual annotations from experts that define the ground truth of what an image processing solution should retrieve
Shanoir
(http://www.shanoir.org)
Images and associated meta data
Dicom, Nifti, BIDS, XML
A set of images that the image processing softwares used by VIP tool are provided and are attached to the same location as the “source images”
Shanoir
(http://www.shanoir.org)
Images and associated meta data
Nifti, BIDS, XML
4.4 Fisheries and Aquaculture Management (IRD)¶
IRD will involve researchers of the community to deposit artefacts constituting the elements of a “dynamic publications”, e.g. latex, code, data, and packages wrapping these together and enabling their reproducibility to generate the dynamic publication (possibly linking to applications in BlueBridge Virtual research Environments). Researchers will also claim artefacts and link them to pertain to the community and to funding projects. Secondly, they will integrate repositories of data and methods (scripts) to export metadata information according to the OpenAIRE-Connect guidelines as defined in WP3.
Table 20 provides example of datasets relevant for the community, together with their types and formats.
Table 20 Examples of datasets from the Fisheries and Aquaculture Management community (IRD)
Example
Type
Format
Sea Surface temperature
http://mdst-macroes.ird.fr:8080/thredds/
catalog/AWA/satellite/NRT/RTOSTIA/catalog.html?dataset=
AwaCatalog/satellite/NRT/RTOSTIA/
OSTIA-UKMO-L4HRfnd-GLOB-v01-fv02-OSTIA_all_datasets.ncml
Satellite images
NetCDF, HDF, GRIB, Geotiff (gdal formats)..and Web Services (OPeNDAP, OGC Web Services)
Fisheries Catch data
http://mdst-macroes.ird.fr:8080/thredds/catalog/macroes/
world/fisheries/catches/balbaya/5d/1m/catalog.html?
dataset=testDatasetScan/world/fisheries/catches/balbaya/
5d/1m/test_balbaya_5dims_1deg_annual_catches.nc
(or http://db-tuna.d4science.org:5432/) or
https://irdmail.fr/Redirect/735481B2/iotc.org/data/datasets
Fisheries datasets from RFMOs (Regional Fisheries Bodies)
SQL, CSV, (ogr formats)
Vessel or animal tracking
Trajectories
Shapefiles, NetCDF, CSV, XML
Sea Surface current (OSCAR dataset)
http://mdst-macroes.ird.fr:8080/thredds/catalog/macroes/
world/current/global/180/catalog.html?dataset=testDatasetScan/
world/current/global/180/oscar-third_vel1992_2016.ncml
Raster data / satellite images / products
NetCDF / NCML....and Web Services (OPeNDAP, OGC Web Services)
Table 21 provides example of software relevant for the community, together with their types, formats and tools.
Table 21 Examples of software for the Fisheries and Aquaculture Management community (IRD)
Example
Tool
Type
Format
Sea surface temperature front detection
Thredds, R, IDL, Matlab
Processing
Raster data formats (gdal): NetCDF..processed with IDL or R codes
Plots
(
http://mdst-macroes.ird.fr/tmp/SpeciesByOcean/
I1_Katsuwonus_pelamis_Rickshaw.html)
R, IDL, Matlab, Python
Visualization
Png, jpg, json
A Lagrangian tool for simulating ichthyoplankton dynamics
http://www.ichthyop.org/
Java model (see
https://support.d4science.org/
projects/bluebridge/wiki/Ichthyop )
Model
Java / jar (example) for the model which is driven by datasets like OSCAR
Transforming Fisheries data into biodiversity data,
http://www.gbif.org/dataset/0e3d6f05-a287-4ffd-852d-4e17db22d810
GBIF / IPT (Java)
Format transformation / mapping
Postgis databases becomes Darwin Core occurrences (data format)
Table 22 provides examples of other research products used by the community. The Dryad data package is the most relevant example of research package used in the community.
Table 22 Other research products for the Fisheries and aquaculture community (IRD)
Example
Tool
Type
Format
Dryad package
http://datadryad.org/resource/doi:10.5061/dryad.f6p6n
R scripts
Dryad data package
R
Tuna Atlas workflow
https://github.com/juldebar/IRDTunaAtlas/wiki
R engine, shell
ETL (Extract, Transform, Load)
R codes on a Github repository, shell
Aggregation of High resolution data / climatology
SQL (Potgres / Postgis), R, Python
Rasterizations
Raster or vector data formats (gdal) processed with R / Gdal / Postgis spatial functions
BlueBridge VRE
https://i-marine.d4science.org/web/rstudiolab
RStudio within a Virtual Research Environment
Online Compilation Environment
Website
4.5 Environment and Economy (ICRE8/SDSN Greece)¶
ICRE8 will engage researchers of the community towards Open Science practices, such as publishing datasets and methods that today are mainly described as part of literature. In this process they will benefit from OpenAIRE already collecting from REPEC, the main reference for community literature worldwide, and from a LoS provided by the United Nations, committing UN researchers in the domain to acquire Open Science practices.
Table 23, 24, 25 presents some examples of datasets, software and other research products that the community is willing to publish in OpenAIRE-Connect.
Table 23 Examples of datasets from the Environment and Economy community (ICRE8/SDSN Greece)
Example
Type
Format
Stated Preference data (e.g.
https://www.fs.fed.us/nonmarketprimerdata/chapter05/
dichotomous_choice_data.xls which is described here:
https://www.fs.fed.us/nonmarketprimerdata/contingent_valuation.html )
Microdata; cross-sectional; responses to surveys
text/plain
text/csv
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
.dat
.dta
.sav
.lpg
.mtp
Resource productivity per country (
http://ec.europa.eu/eurostat/statistics-explained/images/5/51/
Resource_productivity_June2016new.xlsx described here:
http://ec.europa.eu/eurostat/statistics-explained/index.php/
Resource_productivity_statistics ); Real GDP per capita, growth rate and totals
Macrodata; panel; time-series
Air emissions by industry (e.g.
http://ec.europa.eu/eurostat/statistics-explained/images/f/fc/
Air_pollution_by_industries_and_households_2016.xls )
Macrodata; panel
Table24 Examples of software from the Environment and Economy community (ICRE8/SDSN Greece)
Example
Tool
Type
Plots
(R, https://www.r-project.org/)
script
For the vocabulary of formats of methods, ICRE8 also suggests, in addition to the terms in table 24: stata, Matlab, Python, zip.
Table 25Examples of other research products from the Environment and Economy community (ICRE8/SDSN Greece)
Example
Tooling
Type
Format
A compressed folder containing a research paper, dataset, code scripts and explanatory text
(7zip, http://www.7-zip.org/
research package
.zip, .gz, .7z
Data processing
(SPSS,
https://www.ibm.com/analytics/us/en/technology/spss/)
Data Processing;
Syntax files
SPSS
Fixed and Random Effects Panel Data Analysis
(stata, http://www.stata.com/)
Data analysis; Syntax files
.do, .ado
5. Vocabularies¶
The valuable input from research communities allows us to define a first version of the controlled vocabularies to be integrated into the OpenAIRE-Connect model. Vocabularies should not be considered as static resources: the terms in a vocabulary can change according to new expectations and requirements of the communities. Dedicated tools and technologies already in use in the production environment of the OpenAIRE infrastructure will be adopted to ensure that vocabularies are consistently applied.
Each of the following sections describes, for each data model entity, which field is controlled by which vocabulary and presents the terms that need to be introduced according to the input from research communities reported in Section 4.
5.1 Result¶
5.1.1 BestAccessRights¶
Current OpenAIRE vocabulary
dnet:access_modes
Proposed Extension: NONE
5.1.2 Date type¶
Current OpenAIRE vocabulary
dnet:dataCite_date
Proposed Extension: NONE
5.1.3 PID schemes¶
Current OpenAIRE vocabulary
dnet:pid_types
Proposed Extension: NONE
5.1.4 Language¶
Current OpenAIRE vocabulary: ISO 639-3
dnet:languages
Proposed Extension: NONE
5.1.5 Subject schemes¶
Current OpenAIRE vocabulary
dnet:subject_classification_typologies
Proposed Extension: NONE
5.1.6 Country¶
Current OpenAIRE vocabulary: **ISO 3166-1 alpha-2**
det:countries
Proposed Extension: NONE
5.1.7 Format (Instance)¶
Current OpenAIRE vocabulary
None. OpenAIRE guidelines suggest to use file extensions or mime-types:
https://guidelines.openaire.eu/en/latest/literature/field_format.html
https://guidelines.openaire.eu/en/latest/data/field_format.html
Proposed Extension: For methods and packages OpenAIRE-Connect needs more than file extensions and mime-types. For details please refer to the relative sub-sections.
5.1.8 Access Rights (Instance)¶
Current OpenAIRE vocabulary
dnet:access_modes
Proposed Extension: NONE
5.1.9 License (Instance)¶
Current OpenAIRE vocabulary
To be defined. Needs of different vocabularies per result typologies (literature, dataset, method, research package) is under investigation.
Proposed Extension To be defined. Possible sources:
http://www.rioxx.net/profiles/v2-0-final/
5.2 Literature¶
5.2.1 Type (Instance)¶
Current OpenAIRE vocabulary
dnet:publication_resource
Proposed Extension: NONE
5.3 Dataset¶
5.3.1 Type (Instance)¶
Current OpenAIRE vocabulary
dnet:datacite_resource
Proposed Extension:
- Nucleotide sequence (Marine Environmental Science)
- Tabular data set (Marine Environmental Science)
- Image data set (Marine Environmental Science)
- Database (Marine Environmental Science, Digital Humanities & Cultural Heritage)
- GIS (Digital Humanities & Cultural Heritage)
- Collection - Archaeological geophysical measurement report (Digital Humanities & Cultural Heritage)
- 3D Model (Digital Humanities & Cultural Heritage)
- Challenge Dataset (Neuroinformatics)
- Satellite Images (Fisheries and Aquaculture management)
- Fisheries datasets from RFMOs (Regional Fisheries Bodies) (Fisheries and Aquaculture management)
- Trajectories (Fisheries and Aquaculture management)
- Raster data (Fisheries and Aquaculture management)
- Raw data (Environment and Economy)
- Processed data (Environment and Economy)
- Microeconomics (Environment and Economy)
- Macroeconomics (Environment and Economy)
5.4 Software¶
5.4.1 Type (Instance)¶
Proposed vocabulary terms:
- Statistical procedure (Marine Environmental Science)
- Web application (Digital Humanities & Cultural Heritage)
- Visualization (Fisheries and Aquaculture management)
- Processing (Fisheries and Aquaculture management)
- Model (Fisheries and Aquaculture management)
- Format transformation / mapping (Fisheries and Aquaculture management)
- Data Processing (Environment and Economy)
- Syntax (Environment and Economy)
- Data analysis (Environment and Economy)
- Data Visualization (Environment and Economy)
5.4.2 Format (Instance)¶
Current OpenAIRE vocabulary
None. OpenAIRE guidelines suggest to use file extensions or mime-types:
https://guidelines.openaire.eu/en/latest/literature/field_format.html
https://guidelines.openaire.eu/en/latest/data/field_format.html
In the case of Software, the format field can be used to map the programming language, as described in the OpenAIRE guidelines for software repositories (TODO: add public link)
5.5 Other Research Product¶
5.5.1 Type (Instance)¶
Proposed vocabulary terms:
- Laboratory procedure (Marine Environmental Science)
- Software Pipeline (Marine Environmental Science)
- Web Service (Marine Environmental Science)
- Manual aggregation of data sets (Marine Environmental Science)
- Model (Digital Humanities & Cultural Heritage)
- Workflow Pipeline (Neuroinformatics)
- VIP pipeline template (Neuroinformatics)
- VIP pipeline Challenge -Dataset Results (Neuroinformatics)
- Image processing pipeline (Neuroinformatics)
- Images and associated meta data (Neuroinformatics)
- Dryad data package (Fisheries and Aquaculture management)
- ETL (Extract, Transform, Load) (Fisheries and Aquaculture management)
- Rasterizations (Fisheries and Aquaculture management)
- Online Compilation Environment (Fisheries and Aquaculture management)
- Research package (Environment and Economy)
- Data Processing; Syntax files (Environment and Economy)
5.5.2 Format (Instance)¶
Current OpenAIRE vocabulary
None. OpenAIRE guidelines suggest to use file extensions or mime-types:
https://guidelines.openaire.eu/en/latest/literature/field_format.html
https://guidelines.openaire.eu/en/latest/data/field_format.html
Proposed Extension: Communities agree on the usage of file extension and mime-types. Examples show that no specific formats of packages (Research Object bundle, RMap) are currently in use by the communities. Consequently packages are seen as possibly compressed folders containing those files researchers want to share as a whole.
References¶
Katherine Wolstencroft, Olga Krebs, Jacky L. Snoep, Natalie J. Stanford, Finn Bacall, Martin Golebiewski, Rostyk Kuzyakiv, Quyen Nguyen, Stuart Owen, Stian Soiland-Reyes, Jakub Straszewski, David D. van Niekerk, Alan R. Williams, Lars Malmström, Bernd Rinn, Wolfgang Müller, Carole Goble; FAIRDOMHub: a repository and collaboration environment for sharing systems biology research. Nucleic Acids Res 2017; 45 (D1): D404-D407. doi:10.1093/nar/gkw1032
Hanson, K. L., DiLauro, T., and Donoghue, M. (2015). The RMap Project: Capturing and Preserving Associations amongst Multi-Part Distributed Publications. In Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries (pp. 281-282). ACM. doi:10.1145/2756406.2756952
The OpenAIRE guidelines. https://guidelines.openaire.eu
Appendix¶
locked