FAIRification STEP 2 on DATA / METADATA webinar

EOSC-Nordic

On February 3, 2021, data repository representatives from across the Nordics and Baltics participated in a FAIRification webinar organised by the EOSC-Nordic project. The webinar was the second in a series of multiple steps and focused on the machine-actionable split between Data and Metadata. The idea of concentrating on metadata as a second topic originated from the project team’s exhaustive FAIR maturity evaluation in April 2020. One of the survey conclusions was that a large percentage of the data repositories included in the sample did not pass the test on FAIR Principle F3. The FAIR Principle (F3) states that metadata needs to clearly and explicitly include the identifier of the data they describe.

Principle F3 requires the user to provide a clear machine-actionable split between the data and the metadata. Ideally, the metadata points to the data, and the data, in return, points to the metadata. It is crucial that the split between metadata and data is machine-actionable so that a machine agent can find, interpret and process the data based on the metadata found, for instance, on the landing page of the repository. In other words, if we want to make research data reusable, the data needs to be enriched with metadata that can be found, interpreted, and processed by machine agents and not only by the human eye. The FAIR principle F3 sets the guideline for FAIRness of data by indicating the relevance and importance of having a clear machine-actionable split between the metadata and data. For more information on all the 15 FAIR Principles on the Go-Fair web page.

The webinar of February 3, 2021, gave a diverting overview of metadata options, standards, and templates and set the scene for the discussion on the current state of play regarding FAIR maturity levels in the Nordic and Baltic repositories. We were introduced to the very meaning of metadata, the concept of FAIR Digital Objects, and the practicalities involved in implementing metadata schemas in and by a repository.

What led us up to this point?

Bert Meerman from the GO FAIR Foundation hosted the webinar. Bert opened by highlighting the 15 FAIR Principles and addressed a few misconceptions around FAIR.

The first and most important misconception is that FAIR has been only designed for the human interface — this is not the case. FAIR is also about machine-actionability. Hosting a website with a link to several PDF-files does not automatically make a repository FAIR. Under the FAIR Principles, a “machine-agent” or algorithm should find, access, interpret and reuse the data and meta-data of a repository. FAIR is meant for both human and machine interactions and is all about automated findability and machine-actionability of data and metadata.

Secondly, we often hear that researchers mistakenly think that FAIR data is identical to “Open and Free” data — this also is not the case. Data can be closed when necessary and be offered under predefined conditions at a fee or as a subscription to the market and be perfectly FAIR.

Bert reminded us of the work done by work package 4 on the FAIR assessment of repositories in the Nordic and Baltic countries. Back in April 2020, the team organised a hackathon event evaluating about 100 repositories based on their machine-findability on the internet. For the evaluations, the team used software programs that could automatically QUANTIFY the adherence to the FAIR Principles. Initially, they used the FAIR Maturity Evaluator designed by Mark Wilkinson. They later decided to make assessments based on the F-UJI evaluator software developed by the EOSC FAIRsFAIR project as well.

Bert also explained the automated evaluation process; a FAIR Maturity Evaluator takes as input an identifier and harvests the machine-actionable metadata within the repositories landing page. The tool evaluates against the FAIR Principles from which one can calculate a FAIR SCORE for each of the letters F, A, I, and R. If a particular test fails, the Nordic project team can provide recommendations on how to IMPROVE the FAIRness. The evaluations’ software requires the repositories to be FINDABLE based on a Global, Unique, IDentifier, often abbreviated as a GUID or even better a PID – Persistent, Resolvable Identifier. The input identifier is an absolute mandatory requirement for the evaluation process of data repositories.

During this first exercise in April 2020, the project team discovered that they could not properly evaluate some of the repositories due to a lack of a proper PID. Therefore, the team hosted a webinar in November 2020 on being “Findable” through a persistent resolvable ID. During this webinar, named “FAIRification Step 1”, the project team guided repositories to take the necessary steps to comply with this critical requirement.

The exercises done in April 2020 and in November 2020 were a success insofar that over 100 repositories, with an average of ten datasets each, were evaluated. The project team was able to provide several recommendations to repositories who wanted to improve their FAIRness score.

However, the team also noted that a large percentage did not pass the test for Principle F3. This principle recommends datasets to split DATA and META-DATA and that they are published separately with separate (persistent) identifiers. At the very least, the data should be specified explicitly in the metadata, pointing to the location of the data element (e.g., file) using established semantics to do so. Once this is in place, the evaluator test for Principle F3 will pass. The purpose of this webinar was to inspire repositories to take action on addressing the requirements underlying FAIR Principle F3.

The state of play in the EOSC-Nordic project

Andreas Jaunsen, the FAIR data work package leader from Nordforsk, presented an overview of the project and its progress on the FAIR-related work.

Overall, WP 4 aims to integrate the datasets, software, publications, and other research outputs for researchers consistently and orderly. The way to do that is to go FAIR. FAIR’s overall objective is to make research data more reusable and science more transparent, efficient, and trustworthy.

Andreas went on to explain the limitations of a repository that concentrates on publishing its data mainly on the human interface and who does not give enough attention to the data’s machine actionability. The concept of “FAIR DIGITAL OBJECT” plays an important role in permanently and intelligently linking the data and metadata to the related data sets, including reverse referencing.

Another crucial element in the FAIRification process is the growing dependency on data-scientists capable of helping the researchers and analysts structure and efficiently preserve the research data. The role of these DATA-STEWARDS is becoming more important for organisations worldwide. We see this development in many countries that universities and research institutes are starting to hire data-stewards.

Andreas gave context to the topic by describing the FAIR maturity evaluation process and the most relevant findings. The sample consisted of around 100 data repositories, and from each repository ten (10) datasets were selected manually. Open-source software initially developed by Mark Wilkinson and the more recent F-UJI tool developed by the FAIRsFAIR project was used with minor tweaks to process the evaluation, assessing the FAIR principles via several tests.

Unsurprisingly, most of the evaluated repositories did not come out very FAIR with regards to machine-actionable metadata. The sample showed that a sizable percentage (27%) of the tested repositories had no support for machine-actionable metadata. Approximately 34% of the repositories had a small degree of machine-actionable metadata. Only a handful of repositories (15%) scored more than 40 %. The average score of the 75 evaluated repositories was 22 %. But there was a noticeable higher scoring among the repositories being run on established platforms (Dataverse, Figshare, etc.), 38 %, and among certified repositories, 29 %. The project offers support for repositories interested in achieving a Core Trust Seal (CTS) certification or completing self-assessments against the CTS requirements. The project will also host additional events like this with the primary ambition to help repositories achieve higher levels of FAIRness.

The project will perform regular FAIR maturity evaluations throughout the project lifespan to monitor increased FAIRness levels. The results of the main intermediate conclusions:

  • The majority of repositories were evaluated as “not very FAIR,” primarily because they do not support machine-actionable metadata.
  • 24% of the sample cannot be evaluated due to the lack of a GUID.
  • 27% of the sample does not support ma-FAIR (<0.1).
  • The average score of the 75 evaluated repositories is 0.22
  • Repositories running on established platforms score an average of 0.38
  • Repositories that are certified score an average of 0.29
  • Machine-actionable metadata (and data) is the way to FAIR
  • Local competent support staff (DATA-STEWARDS) to assist in the FAIRification process is needed.
  • Persistent identifiers, (generic) metadata standards, licensing, provenance, and data identifiers are essential FAIR implementation elements.

Relevance and importance of splitting METADATA / DATA

Erik Schultes from GO FAIR gave a presentation on the FAIR Principle F3. Erik started by explaining the F3 Principle in more detail. Although metadata and data could be joint in one single file, he indicated WHY it often makes sense to separate the metadata from the data explicitly. Erik argues that:

  • Metadata can often be lightweight compared to data (Kb instead of Gb or Tb)
  • Costs of maintaining metadata significantly lower than costs of data
  • Metadata can be more persistent than data.
  • Repositories can use the same technical stack for building/serving metadata.
  • Higher chance of converging to a limited number of metadata standards (templates) versus managing a jungle of “standards.”
  • Metadata formats can be shared and reused.
  • RDF is a way to go in any of its serializations, such as JSON-LD, TTL, etc.
  • This (separation) leads to machine-actionability even though data will always depend on the underlying data standards (not all data is SPARQL friendly).
  • Often researchers can only publish metadata (data access remains restricted).

Erik gave several examples where repositories had explicitly separated the data from the metadata but where the repository still failed on Principle F3. In all cases, the reason was that while the human agent could indeed detect the logic between the metadata and the data, this logic was not automatically detectable for a machine agent (separation, not machine-actionable).

Erik then showed examples of several FAIR DATA POINTS (FDP) platforms to publish the metadata. He used the examples of FDP’s in the VODAN Africa project, where real-world COVID-patient-data is shared between several African countries and universities in the Netherlands (LUMC) and California (UCSD). He explained the structure of an FDP with the different layers of metadata relevant to machine actionability. This infrastructure allows researchers to perform “data-visiting activities” over multiple FDP’s, whereby the sensitive (patient) data sets stay in the countries, entirely under the control of the data-owners. While an FDP is not the only way to separate meta-data from data, it is a good reference model for an implementation where protecting sensitive data is crucial.

Recommendations for practical, machine-friendly implementation of F3

Robert Huber from the University of Bremen showed F-UJI’s specifics, the automated FAIR Metrics Assessment Tool, developed in cooperation with PANGAEA, the Data Publisher for Earth & Environmental Science, also based in Bremen. The development of F-UJI was part of one of the work-packages in the EOSC FAIRsFAIR project. Robert demonstrated F-UJI’s working and showed the flowcharts that explained the process to DISCOVER the metadata required to satisfy Principle F3. He gave guidance for usage and examples on the following services:

  • Typed links – SIGNPOSTING
  • Schema.org metadata
  • RDF Metadata (DCAT)
  • DATACite
  • Domain-specific XML

Robert ended with three main recommendations:

  • Avoid storing multiple unrelated data objects within one dataset.
  • Avoid storing additional metadata as part of a dataset ( f.i. a pdf).
  • Indicate access levels rather than hiding links for protected files.

File-level identification support in DataverseNO

Philipp Conzett from UiT, the Arctic University of Norway, started with explaining the DataverseNO infrastructure:

  • National generic repository for open research data
  • Operated at UiT, the Arctic University of Norway.
  • Currently, nine partners institutions/universities
  • Aligned with FAIR Principles and Core-Trust-Seal certified
  • Runs on Dataverse software platform

Philipp explained the different support structures, whereby he distinguished between automated system support and manual support programs. Automated support services from DataverseNO include, among others:

  • PID support
  • (DOI)
  • Support for Tabular Files (using Universal Numeric Fingerprints / UNF)
  • Flexible Image Transport System (FITS)
  • Citation support (including Force-11 aligned references)
  • Verification support
  • Media Type Identification Support (MIME types)

Manual support services include:

  • File-level metadata
  • Access restrictions
  • File hierarchy

Work in progress within DataverseNO:

  • DOI versioning
  • Harvesting file-level identification
  • Improve machine-actionability
  • Include more granular license support

Further support measures and webinars on FAIR

The next FAIRification webinar to further improve the FAIRness of repositories in the Nordic and Baltic countries will be organised in April 2021 (date and time to be announced later). We are more than happy to receive feedback and questions from the research community. Please reach out to us by contacting either FAIRification task leader Bert Meerman at b.meerman(at)gofairfoundation.org or FAIR work package leader Andreas Jaunsen andreas.jaunsen(at)nordforsk.org.

More information and event materials

The webinar was recorded and is available on our YouTube channel.

Webinar recording, part 1

Webinar recording, part 2

Presentations are available on the event page

More information on the F-UJI tool is available on their website and on PANGAEA, the Data Publisher for Earth & Environmental Science GitHub page. Additional information on DataverseNO is available on their website

Author: Bert Meerman, Director GFF and EOSC-Nordic WP4 member