EOSC-Nordic FAIRification webinar on PIDs

EOSC-Nordic

On November 26, 2020, data repository representatives from across the Nordics and Baltics participated in a FAIRification webinar organised by the EOSC-Nordic project. The webinar was the first in a series of four and focused on Persistent Identifiers (PIDs). The idea of having PIDs as a first topic originates from the exhaustive FAIR maturity evaluation made by the project in early 2020. One of the survey conclusions where that as many as 25 % of the data repositories included in our sample did not provide globally unique persistent and resolvable identifiers (GUID) pointing to their datasets. In order to make research data reusable, the data needs to be findable in the first place, by both humans and machines. The first FAIR principle, F1, sets the ground for FAIRness of data by indicating the importance of assigning GUIDs to the metadata and data.

The webinar gave a diverting overview of PIDs, where the current state of play regarding FAIR maturity levels in Nordic and Baltic data repositories set the scene for the discussion. We were introduced to the very meaning of PIDs, the difference it makes to use PIDs in science, the practicalities involved in implementing DOIs in a repository, and lastly, PIDs in the context of different identifier registration agencies.

What led us up to this point?

Bert Meerman from the GO FAIR Foundation hosted the webinar. Bert pointed out that a common misperception is that data cannot be FAIR unless it is openly shared and provided free of charge. However, that is not always the case, as a dataset provided, for example, on the commercial market with a fee as a condition for access, can be perfectly FAIR. The FAIR principles indicate that these are in the majority of intended for metadata management and are meant for both human and machine interactions.

Bert reflected on the FAIR maturity evaluation executed earlier this year and tied some of the results in context with this webinar’s theme. The evaluation part was performed automatically and in a machine-readable manner by FAIR maturity evaluation software. The software used for the evaluations uses the GUID as input from the evaluated data repositories. Through this input, data explores the machine-actionable metadata provided within the landing page or other location the GUID points to. In other words, for this to be a successful exercise, there needs to be a GUID in place equipped with the metadata. Due to the requirement, one-fourth of the data repositories in our sample could not be evaluated at all. This notion made us realise that we need to start the FAIRification process of data repositories in this end and take it from there.

If the software can locate a GUID, it can evaluate against the FAIR principles and calculate a so-called FAIR maturity score for each of the letters of FAIR. In case a repository fails to score on a particular letter, the project can provide advice on how to improve this.

The state of play in the EOSC-Nordic project

Andreas Jaunsen, the FAIR data work package leader from Nordforsk, presented an overview of the project and its progress on the FAIR related work. Overall, the project aims to integrate the datasets, software, publications, and other research outputs for researchers consistently and orderly. The way to do that is to go FAIR. FAIR’s overall objective is to make research data more reusable and science more transparent, efficient, and trustworthy.

Andreas gave context to the PID topic by describing the FAIR maturity evaluation process and the most relevant findings. The sample consisted of around 100 data repositories, and from each repository, there was a manual process of ten (10) manually selected datasets. Open-source software developed by Mark Wilkinson was used with minor tweaks to process the evaluation, which checks 22 entries of the FAIR principles. Both JSON and Google scripts were used to display the results. Out of the sample of 98 data repositories, only 74 passed the minimum requirement of having a GUID in place and could thus be evaluated.

Unsurprisingly, most of the evaluated repositories did not come out very FAIR with regards to machine-actionable metadata. 30 % has no support for machine-actionable metadata what so ever, a few repositories support some degree of machine-actionable metadata or have some metadata standards I place. A handful of repositories scored more than 50 %. The average score of the 74 evaluated repositories was 17 %. But there was a noticeable higher scoring among the repositories being run on established platforms (Dataverse, Figshare, etc.), 30 %, and among certified repositories, 24 %. The project will offer support for repositories interested in achieving a Core Trust Seal (CTS) certification or completing self-assessments against the CTS requirements. The project will also host additional events like this with the primary ambition to help repositories achieve higher levels of FAIRness. The project will perform regular FAIR maturity evaluations throughout the project lifespan to monitor increased FAIRness levels.

Relevance and importance of PIDs in science

Helena Cousijn from DataCite gave a presentation on the importance of PIDs. She started by explaining what a PID is in the first place. In short, a PID is a globally unique string. An organization that keeps the identification alive and accurate and thus enables PIDs to resolve over time persistently performs the ‘persistent’ part of PIDs. PIDs contribute to solving disambiguation, where, for example, multiple people hold the same name.

Helena explained that DataCite is a non-profit organization with members from over 2,100 repositories from 43 countries, with 238 members and over 20 million registered DOIs. Through the DOIs, DataCite helps making research outputs discoverable, makes it easy to follow best practices, and enables tracking and reporting of science. DOIs can be assigned to various research outputs, for example, research datasets, associated workflows, methods, images, and software. DOIs can also be assigned to grey literature, for example, reports, thesis, dissertations, conference papers, and technical standards. The DataCite has an automated interface for easy and quick DOI registering.

The data becomes findable through metadata discoverability. All metadata submitted to DataCite are available under a CC0 license and can be harvested in different ways and available through various platforms, such as DataCite Commons, OpenAIRE, and Google Dataset Search. The accessibility and interoperability boxes are checked with DataCite’s metadata schema. Additional information describing relations between the datasets and other entities can be added in the DOI management software Fabrica. For example, there is an option to associate datasets to an organisation via the Research Organisation Registry (ROR) and connect data to publications, other contributors, funding, etc., via the PID Graph developed by the FREYA project. Data reuse is made possible through DataCite Commons, which allows the users to access data metrics and import data metrics to the repositories’ pages with the help of widgets provided by DataCite.

Considerations around PID implementation

Örnólfur Thorlacius from DATICE in Iceland provided experience-based information on implementing PIDs in their DATICE data repository. DATICE is a data service and archive for Icelandic research data, established in 2018, located at the Social Science Research Institute within the University of Iceland. Their main goals are to make their data holdings FAIR and ensure high-quality research data by following international standards and examples of best practices on open access to research data. They are the official service provider for Iceland in the Consortium of European Social Science Data Archives (CESSDA).

DATICE chose to implement DOIs offered through the registration agency GESIS Leibniz-Institut in Germany (DARA). The responsibility of DARA is to publish and store associated metadata and connect the DOIs to ensure that they resolve to a landing page specified by DATICE. DATICE’s responsibility is to ensure they provide relevant and accurate metadata and make sure that the URLs lead to the correct landing pages. DATICE was able to decide on the structure of the DOI names by defining the suffix.

DATICE is currently in the stage of implementing the Dataverse platform, through which one can register DOI easily with partially automated processes. Easy DOI registering is an important step in becoming even more FAIR, as the platform provides means of accessing information in a machine-actionable manner.

Örnólfur provided some advice on things to consider before implementing DOIs. He specifically stressed the importance of deciding what metadata to include where metadata standards are concerned, for example, the Data Documentation Initiative (DDI), and in case some additional metadata are needed. He also pointed out that one should consider making the DOI registration workflow as efficient as possible and streamlined ways of collecting and organising the metadata.

PIDs in context

Mike Nason from the University of New Brunswick Libraries in Canada broadened up the PID discussion by presenting the use cases of Persistent Identifiers in various contexts in relation to organisational needs and open infrastructures. He pointed out that PIDs are crucial for scholarly publishing and throughout the research process. It is also worth noting that, for example, the Current Research Information Systems (CRIS) is heavily reliant on the information that these can pull out of the DOI systems (for example, ORCID, CrossRef, and DataCite). PIDs are unique IDs assigned to several things, for example, institutions, datasets, people, monographs, and organisations. Consequently, it is a given that PIDs enables tracking research and solve any disambiguation related to inconsistency in the metadata, or more specifically, points to the correct contributor, institution, and places.

In all cases, there is a registration agency tied to the PIDs who is responsible for collecting and distributing metadata publicly. These agencies all have metadata schemas of their own and make the metadata accessible through an Application Programming Interface (API). The APIs are publicly accessible and freely available for pulling data out. The metadata schema of DataCite is generally known for its great compliance with, for example, datasets, software, collections, and audio/visuals, and CrossRef for compliance with, for example, publications, articles, proceedings, and reports.

It is also worth noting that purely minting a DOI is not enough, as it does not activate or resolve unless registered. It also has to be maintained by someone for it to resolve to content for all eternity. Additional good advice is to note that PIDs are not meant to be human-readable, which means that DOIs should not be turned into nice-looking, custom-made URLs.

Further support measures and webinars on FAIR

The next FAIRification webinar in the series will be organised on February 3, 2021. More information on this will be available closer to the event.

We are more than happy to receive feedback and questions from the research community, so please reach out to us by contacting either FAIRification task leader Bert Meerman at b.meerman(at)gofairfoundation.org, or FAIR work package leader Andreas Jaunsen at andreas.jaunsen(at)nordforsk.org.

The webinar was recorded, you can view the recordings (part 1 and part 2) on our YouTube channel.

Author Josefine Nordling, Open Science Specialist at CSC and WP4 member.