FAIRification STEP 3 on DATA / METADATA webinar

EOSC-Nordic

21.05.2021

On 29 April 2021, data repository representatives from across the Nordics and Baltics participated in a FAIRification webinar organised by the EOSC-Nordic project. The webinar was the third in a series of multiple steps and focused on Generic Metadata. The idea of focusing on generic metadata as a third topic originates from the exhaustive FAIR maturity evaluation made by the project team in April 2020. One of the survey conclusions was that many data repositories struggled to provide the right object types to datasets. In many cases, the data was not enriched with enough metadata.

It is crucial that the metadata is machine-actionable so that a machine agent can find, interpret and process the data based upon the metadata found, for instance, on the landing page of the repository. In other words, if we want to make research data reusable, the data needs to be enriched with metadata that can be found, interpreted, and processed by machine agents and not only by the human eye. The FAIR principles set out the guideline for FAIRness of data by indicating the relevance and importance of enriching datasets with clear machine-actionable metadata. For more information on all the 15 FAIR Principles, please visit the Go Fair web page.

The webinar on 29 April 2021 gave an engaging overview of metadata options, standards, templates, and reference implementations. The current state of play regarding FAIR maturity levels in Nordic and Baltic data repositories set the scene for the discussion. We were introduced to the very meaning of rich metadata, the concepts of Metadata Templates, Metadata for Machine Workshops (M4Ms), the importance of controlled vocabularies, and the practicalities involved in implementing metadata schemas in and by a repository provider.

Summary of the webinar

Bert Meerman from the GO FAIR Foundation hosted the webinar. Bert opened by highlighting the 15 FAIR Principles and addressed a few misconceptions around FAIR. The most important one: Under the FAIR Principles, a “machine-agent” or algorithm should be capable of finding, accessing, interpreting, and reusing the data and meta-data of a repository. FAIR is meant for both human and machine interactions and is therefore all about automated findability and machine-actionability of data and metadata.

When it comes to the FAIR assessment of repositories in the Nordic and Baltic countries, the project team has followed a process to increase the FAIR uptake in the Nordics and has already hosted a series of events to further this goal.
1. April 2020 – First assessment hackathon – Initial exercise
2. November 2020 – Webinar Step 1 – Focus on PID (Global, Unique, Persistent, Resolvable)
3. February 2021 – Webinar Step 2 – Focus on the split between Data and Metadata.
4. April 2021 – Webinar Step 3 – Focus on Generic Metadata.

Summary of the FAIR uptake in the EOSC-Nordic project

Andreas Jaunsen, the FAIR data work package leader from Nordforsk, presented an overview of the project and its progress on the FAIR uptake.
Andreas presented that 20 members in eight (8) countries are involved in this project, and he explained the process of evaluating the FAIR uptake based upon harvesting the landing page of repositories.

He also explained the limitations of repositories that concentrate on publishing their datasets mainly for human consumption and consequently do not give enough attention to machine actionability of the (meta)data. A dataset needs to be FAIR for humans as well as for machines. The concept of “FAIR DIGITAL OBJECT” plays an important role here permanently and intelligently linking the metadata to the related data sets and vice versa.
Andreas gave context to the topic by describing the FAIR maturity evaluation process as a semi-automated FAIR assessment process, and he presented the most relevant findings. The investigated sample consisted of around 100 data repositories, and from each repository, there is a manual process of manually selecting ten (10) datasets. Experiments show that a sample size of 10 seems to be a good estimator for the entire population of datasets within the repository.

The tool used for the automated FAR Data Assessment is F-UJI, the open-source software developed by staff from Pangea as part of the FAIRsFAIR project. F-UJI is capable of checking/testing 17 aspects of the FAIR Principles and is streamlined with Google Scripts to process the evaluation of 1000+ datasets.

Andreas presented the result from the FAIR assessments, and unsurprisingly, most of the evaluated repositories did not come out very FAIR with regards to machine-actionable metadata. Compared to the results reported in webinar Step 2 (3 February 2021), only marginal improvement of the uptake was recorded. About 24 % of the repositories could not be evaluated. The overall FAIRness score of the majority of repositories remains low (in the zero to 10% range). The average score of the 75 evaluated repositories is 0.244, plus / minus 0.007.

While we saw quantitative improvement in “descriptive core metadata elements“ and “automatically retrieved metadata“ over the last three months, it is risky to make hard conclusions. This small change of false positives could be attributed to updates of the evaluation software. In the results, there was a noticeable and significantly higher score among the repositories being run on established platforms (Dataverse, Figshare, etc.). The recorded average score for this sample is 0.42 (compared to 0.24 for the entire population). Certified repositories recorded an average score of 0.31.
The project will offer support for repositories interested in achieving a Core Trust Seal (CTS) certification or completing self-assessments against the CTS requirements.

The project will also host additional events such as this webinar, with the primary ambition to help repositories achieve higher levels of FAIRness.
The project will continue to perform regular FAIR maturity evaluations throughout the project lifespan to monitor increased FAIRness levels.

Relevance and importance of generic metadata.

ERIK SCHULTES from GO FAIR gave a presentation on Generic Metadata. Erik explained the FAIR Principles and indicated the importance of describing the properties and content of a dataset by providing rich metadata.

Erik then continued to demonstrate the FAIR DATA POINT (FDP) as a reference implementation. He explained where and how machine-actionable metadata can be added to add rich metadata for describing the generic properties of a dataset (like the owner, license, provenance, etc.) and the domain-specific metadata (details about the content, based upon chosen, controlled vocabularies). The FDP is a (lightweight) metadata publication platform linked to one or multiple datasets that contain content. He described the structure of an FDP with the different layers of metadata (Catalog, Dataset, and Distribution levels) relevant to machine actionability. These levels allow users to define what type of datasets are relevant for particular research (reports, documents, graphs, Excel files, RDF files, TIFF files, etc.). This infrastructure allows researchers to perform “data-visiting activities” over multiple FDP’s, whereby the sensitive datasets remain at the source, entirely under the control of the data owners.

While an FDP is not the only way to separate meta-data from data, it is a good reference model for an implementation where protecting sensitive data is crucial. With this FDP approach, it is relatively easy to indicate the object type and, for instance, the size of a dataset—helpful for a researcher in search of specific datasets.
Erik gave examples from the VODAN project — the Kampala International University in Uganda —to demonstrate the power of having a separate platform for publishing the metadata.

Erik sharpened the understanding of FAIR Digital Objects (FDO’s) by indicating FDO’s are “Self-describing Digital Objects.” Furthermore, Erik explained the use of existing standard structures like DCAT and Dublin Core that may assist in defining metadata schemas for a community. Furthermore, Erik explained the process of defining the necessary metadata by allowing the community to organize a so-called Metadata for Machine (M4M) workshop. Combining domain expertise and FAIR metadata expertise will result in a (domain-specific) metadata schema that a particular community can use.

Using an online tool developed by CEDAR, Erik demonstrated how communities can build metadata templates in a relatively short period and how these templates can be “reused” by other communities to speed up the process.

The following steps could help to speed up the process:
a. Build a community metadata schema (result from an M4M workshop)
b. Store the metadata schema in a (CEDAR) template.
c. Publish the agreed metadata templates (f.i. Bioportal)
d. Present the template to researchers as a web form.

Erik ended his presentation by indicating that these published generic metadata schemas can then effectively become certified schemas.

How can metadata schemas be found by machines/tools?

Robert Huber from the University of Bremen started by indicating that FAIR is both for humans and for machines and that FAIR is about the link between data and metadata.

There are three options when a PID resolved to a machine / human-readable landing page
– Embedding in the HTML landing page.
– Retrieving metadata using content negotiations.
– Providing machine-readable links in HTML (signposting, typed links)
Robert showed the specifics of F-UJI, the automated FAIR Metrics Assessment Tool, developed in cooperation with PANGAEA, the Data Publisher for Earth & Environmental Science, also based in Bremen. The development of F-UJI was part of one of the work packages in the EOSC FAIRsFAIR project.

Robert presented the workflow of F-UJI and explained the process to DISCOVER the metadata required to satisfy the FAIR principles. (For a detailed description hon how F-UJI works, see the workflow slide in his presentation).
Robert explained that F-UJI could be used for evaluation purposes as well as for self-assessment.

He gave guidance for pro’s and con’s and usage and examples on the following widely used platform services:
• Dublin Core
• Schema.org metadata
• RDF Metadata (DCAT)
• Datacite

The conclusion was that these four domain agnostic data schema services could all be used, whereby the user needs to consider that they can all be used for embedding in the landing page. For specific details on pros and cons, investigate Robert’s presented slides. Regarding the community-related tests (like domain-relevant standards defined in Principle R1.3), Robert explained that F-UJI uses RDA-listed metadata, and the software attempts to identify the required metadata elements.

Users are invited to test out the F-UJI evaluator for self-assessment or other purposes. More information on the F-UJI tool is available on their website and on PANGAEA, the Data Publisher for Earth & Environmental Science GitHub page.

Generic Dataset Metadata Templates GDMT

Nikola Vasiljevic from the TU of Denmark – Wind energy Dept explained his view on the importance of supporting readability/actionability for both human and machine users. He defined the concept of Metadata Templates to describe the different Fields and Values for Generic elements (Title, Creator, Dates, License, etc). The Fields and Values are represented by unique URLs for each element so that the machine can interpret the metadata. The aim is to try to avoid “free text input” as much as possible.

Great support for these templates is “controlled vocabularies,” whereby the representation follows a limited number of allowed input options. These vocabularies choices are recorded in a data model (RDF), described in a specific format (Turtle, JSON-LD, XML-RDF), and visualized in a representation language (SKOS or OWL). The combination of RDF, TURTLE, and SKOS is strongly recommended as these models and formats are widely supported (W3C recommended).

Linked Data and JSON-LD are also good tools to combine human and machine actionability on the web, allowing users to connect and query data from different sources.

Nikola gave practical examples of a GDMT, inspired by DCAT and DataCite, and built-in close cooperation with CEDAR, using their online tool. This work was established in two Metadata for Machine Workshops (M4M’s) and was crucial for defining the schemas in the wind-energy community.
The M4M’s are a great tool to build and further develop controlled terms and vocabularies within a community. The wind-energy community is a prime example of where this approach paid off.

Nikola’s words on future developments were quite interesting, whereby he foresees software development capable of automatically generating metadata.

Further support measures and webinars on FAIR

The next FAIRification webinar to further improve the FAIRness of repositories in the Nordic and Baltic countries will be organized in the second half of 2021. We will publish the date and time as soon as possible.

We are more than happy to receive feedback and questions from the research community, so please reach out to us by contacting either FAIRification task leader Bert Meerman at b.meerman (at) gofairfoundation.org, or FAIR work package leader Andreas Jaunsen at andreas.jaunsen (at) nordforsk.org.