A Generic Workflow for the Data FAIRification Process

The FAIR guiding principles aim to enhance the Findability, Accessibility, Interoperability and Reusability of digital resources such as data, for both humans and machines. The process of making data FAIR (“FAIRification”) can be described in multiple steps. In this paper, we describe a generic step-by-step FAIRification workflow to be performed in a multidisciplinary team guided by FAIR data stewards. The FAIRification workflow should be applicable to any type of data and has been developed and used for “Bring Your Own Data” (BYOD) workshops, as well as for the FAIRification of e.g., rare diseases resources. The steps are: 1) identify the FAIRification objective, 2) analyze data, 3) analyze metadata, 4) define semantic model for data (4a) and metadata (4b), 5) make data (5a) and metadata (5b) linkable, 6) host FAIR data, and 7) assess FAIR data. For each step we describe how the data are processed, what expertise is required, which procedures and tools can be used, and which FAIR principles they relate to.


INTRODUCTION
The FAIR data principles aim to enable efficient and error-free analysis of data from multiple sources by machines and ultimately by humans, through enhancing their Findability, Accessibility, Interoperability and Reusability [1]. Since the initiation of the FAIR principles in 2014, FAIR metrics [2,3,4], FAIR infrastructure [5] and FAIR tools [6] have been developed to aid the process of making data FAIR ("FAIRification"). At the same time, a FAIRification workflow emerged, which was first developed for and subsequently matured through numerous "Bring Your Own Data" (BYOD) workshops [7], and FAIRification projects of rare disease patient registries. The aim was a domain-independent workflow that may be used in a wide range of FAIRification efforts. Indeed, the workflow has been applied in BYODs on e.g., tomato, yeast, cancer and several in the rare disease domain. The rare disease domain provided an excellent use case, considering the evident need to efficiently analyze sparse, heterogeneous and privacy-sensitive data from multiple sources across institutes and countries.
Early drafts of the FAIRification workflow originate from the first BYODs before the inception of FAIR. These BYODs had a focus on interoperability following earlier work that applied semantic Web technology as a framework for interoperability and machine-readability [8,9]. With the advent of FAIR, the workflow was adapted to also cover the other three facets of FAIR: findability, accessibility and reusability, which heavily relies on metadata (i.e., data descriptors). This evolution has resulted in the workflow presented in this paper, which we intend as a template and a guide: in practice we expect that instantiations of the workflow may vary depending on the data source, use case, domain or FAIRification objective.

A GENERIC FAIRIFICATION WORKFLOW
The details of the generic step-by-step FAIRification workflow can be seen in Figure 1. The workflow is divided into three phases: pre-FAIRification, FAIRification and post-FAIRification, which are further divided into seven steps. The steps include: 1) identify FAIRification objective, 2) analyze data, 3) analyze metadata, 4) define semantic model for data (4a) and metadata (4b), 5) make data (5a) and metadata (5b) linkable, 6) host FAIR data, and 7) assess FAIR data. The steps do not always need to be followed in a strict sequential order and may be iterated. Please note that with each step there are multiple smaller "steps" that themselves also may be iterated. Data sets differ and practical constraints or new insights may lead to a different order of execution and some steps are often visited multiple times. Each step attempts to enable the implementation of the FAIR principles and aims to enhance the FAIR status (i.e., "FAIRness") of the data set.
Data FAIRification requires different types of expertise and should therefore be carried out in a multidisciplinary team guided by FAIR data steward(s). The different sets of expertise are on i) the data to be FAIRified and how they are managed, ii) the domain and the aims of the data resource within it, iii) architectural features of the software that is (or will be) used for managing the data, iv) access policies applicable to the resource, v) the FAIRification process (guiding and monitoring it), vi) FAIR software services and their deployment, vii) data modelling, viii) global standards applicable to the data resource, and ix) global standards for data access. A good working approach is to organize a team that contains or

A Generic Workflow for the Data FAIRification Process
has access to the required expertise. The core of such a team may be formed by data stewards, with at least expertise of the local environment and of the FAIRification process in general.

Figure 1.
A generic step-by-step workflow for the process of making data FAIR ("FAIRification"). The workflow is divided into three "phases": Pre-FAIRification, FAIRification, and Post-FAIRification (dark grey boxes) that are further specified by "steps" indicating typical aspects of practical FAIRification (light grey boxes): 1) identify FAIRification objective, 2) analyze data, 3) analyze metadata, 4a) define semantic data model, 4b) define semantic metadata model, 5a) make data linkable, 5b) make metadata linkable, 6) host FAIR data, and 7) assess FAIR data. The order is not strict and can be iterative.

Identify FAIRification Objective (Step 1)
The first step is to identify the FAIRification objective and is within the pre-FAIRification phase of the workflow. This step requires having access to the data, or in the case of privacy-sensitive data (when even the data steward should not get access to the actual information), a sample of anonymized or mocked data may be used. This step also requires having a general knowledge and understanding of the data set, as well as being familiar with the FAIR principles in general. Objectives for FAIRification could be specific requirements of publishers, funders [10] or stakeholder communities [11,12], or to increase the efficiency of using data from multiple sources. We recommend to first focus the FAIRification on a subset of the data elements in line with available resources for FAIRification (e.g., time). The workflow can be iterated so more data elements may be included later. A good way to select the subset is by defining domain relevant "driving user question(s)" that require at least two data resources. This should be done in a team with both domain and data modelling expertise. Other good drivers for implementing FAIR principles are to enhance the findability, accessibility or reusability of the data, e.g., by improving the metadata. Eventually, the FAIRification objective depends on the availability of 1) expertise, 2) FAIR solutions that may be reused [11,12], and 3) data management tools with FAIRification features that are appropriate to the data set [13].

Analyze Data (Step 2)
The second step is to analyze the data to prepare for subsequent FAIRification (e.g., improving interoperability) and is within the pre-FAIRification phase of the workflow. This process may include: 1) investigating the data in whatever form(s) it is available (specified in Step 1) and checking whether both the data representation (format) and the meaning of the data elements (the data semantics) are clear and unambiguous, and 2) checking whether the data already contain FAIR features, such as persistent unique identifiers for data elements [14] (FAIR principle F1 [1]) by e.g., using FAIRness assessment tooling [2,3,4]. It is evident that this step is tightly connected with Step 1 since e.g., selecting a relevant subset of the data and defining driving user questions(s) are highly relying on being familiar with the data.

Analyze Metadata (Step 3)
The third step is to analyze the availability of metadata regarding findability, accessibility, and reusability, and is within the pre-FAIRification phase of the workflow (note, metadata is made interoperable in Steps 4b and 5b). This process may include: 1) investigating the metadata describing the data, or if no metadata exists, identifying what metadata should be gathered (which may be very unique for each stakeholder community), and 2) checking whether the metadata already contains FAIR features, such as rich metadata and provenance descriptions (FAIR principles F2 and R1, R1.1, R1.2, R1.3 [1]) by e.g., using FAIRness assessment tooling [2,3,4]. Improving metadata regarding findability, accessibility, and reusability requires including details such as: license, copyright, contributions statements (e.g., funders, data set creator, publisher), and description of use conditions and access of data.

Define Semantic Data and Metadata Model (Step 4a and 4b)
The fourth step is to define a semantic model of the data (4a) and the metadata (4b) and is within the FAIRification phase of the workflow. The semantic models are templates for the next step transforming the data (5a) and metadata (5b) into a machine-readable format. Generating a semantic model is often the most time-consuming step of data FAIRification. However, we expect the modelling effort to diminish as more and more models are made available for reuse over time, especially if such models are treated as FAIR digital objects themselves. Thus, it is important to first check whether a semantic model already exists for the data and the metadata that may be reused. For cases where no semantic model is available a new one needs to be generated. We briefly describe this process below.

A Generic Workflow for the Data FAIRification Process
Building a semantic data model (4a) can be defined in three steps: 1) creating a conceptual model, 2) searching for ontology terms, and 3) creating a semantic data model from Steps 1) and 2). This requires domain expertise on the data set and expertise in semantic data modelling. The domain expert(s) make sure that the exact meaning of the data is understood by the modeler and the modeler ensures that the semantic representation correctly represents the domain knowledge. It is important that both the data representation (format) and the meaning of the data elements (the data semantics) are clear and unambiguous (as mentioned in Step 2). A good vehicle for this discussion is to first create an abstract "conceptual model", which lists the main concepts and relationships between data elements to be FAIRified, e.g., related to the subset specified in the driving user question(s).
Next, the concepts and relations between the data elements in the data set are substituted with the machine-readable classes and properties from ontologies, vocabularies and thesauri. While acknowledging the differences between the latter types of resources, we will subsequently use the ontologies as proper ontologies generally best serve the FAIRification process. Ontologies, and the concepts and properties that they describe, can be found using search engines, such as the Ontology Lookup Service (OLS)  , BioPortal  and BARTOC  . We have found that making optimal choices, demands good searching skills and experience. For instance, it is generally insufficient to just choose the first ontology in the list provided by ontology search tools by definition. Instead one should also check the usability license, usage statistics, update activity, whether the ontology contains a good class and property structure (which generally facilitates data integration), and whether a general ontological framework is used (such as OBO Foundry [15]). Nevertheless, it may be very difficult to decide which term from which ontology should be used, i.e., to match the detail in domain specific ontologies with the detail that is needed to describe data elements correctly. Terms used in human narrative do not always match directly with the ontological representation of the term. If the search is unsuccessful, new ontology terms could be defined and added to existing ontologies or new ontologies could be developed. This is however a time-consuming process that should be undertaken with a team of experts from both the domain of the study as well as in consultation with ontology experts.
Finally, the conceptual model and the ontology terms are used to create a detailed semantic data model that in contrast to the conceptual model, distinguishes between the data items (instances and their values) and their types (classes). This model is an exact representation of the data and exposes the meaning of the data in machine-readable terms (ideally in the most universal form possible). This enables the transformed FAIR data set to be efficiently incorporated in other systems, analysis workflows, and unforeseen future applications.
Finally, the conceptual model and the ontology terms are used to create a detailed semantic data model that in contrast to the conceptual model, distinguishes between the data items (instances and their values) and their types (classes). This model is an exact representation of the data and exposes the meaning of the data in machine-readable terms (ideally in the most universal form possible). This enables the transformed  https://www.ebi.ac.uk/ols/index.  https://bioportal.bioontology.org/.  https://bartoc.org.

A Generic Workflow for the Data FAIRification Process
FAIR data set to be efficiently incorporated in other systems, analysis workflows, and unforeseen future applications.
For metadata (4b), semantic models describing generic items are available to be reused, e.g., DCAT to describe data set description (see 5b describing tools with reusable semantic models). Domain-specific items (e.g., described by principles F2 and R1, R1.1, R1.2, R1.3 [1]) should be decided by each individual self-identified domain [11,12], and need thereafter to be described in a semantic metadata model.

Make Data and Metadata Linkable (Step 5a and 5b)
The fifth step is to make the data (5a) and metadata (5b) linkable i.e., transformed to a FAIR representation and is within the FAIRification phase of the workflow. The method for making data and metadata linkable is highly application and use case dependent. However, it is crucial that a description of the data and metadata is available in a representation framework that is globally understood by machines. Further, the semantic model should be associated with the data and metadata so that it is available for unforeseen future applications and scalable interoperability across all types of data. An example of a linkable machinereadable global framework is the Resource Description Framework (RDF). It provides a common and straightforward underlying model and creates a powerful global virtual knowledge graph.
In order to transform the data into a machine-readable form (Step 5a) the semantic data model defined (or chosen) in Step 4a is required. Specialized tools are available for this process such as the FAIRifier, which provides insight into the transformation process and makes the process reproducible by tracking intermediate steps [6]. Other similar tools are Karma [16], Rightfield [17], and OntoMaton [18].
For the transformation of the metadata into a machine-readable form (Step 5b) the semantic metadata model defined (or chosen) in Step 4b is required. For some generic metadata items there are several tools available that support this transformation process such as the FAIR Metadata Editor [6], CEDAR [19], and BioschemasGenerator. The FAIR Metadata Editor is a free online tool that demonstrates the concept of structuring metadata in a FAIR-supporting way. Good metadata increases the potential to make a resource more findable. We mention two additional mechanisms to increase the findability of a resource. First, we recommend registering a resource in a domain-relevant registry or index, preferably one that strives for FAIR-compliance. Second, to enable indexing of the data set by general purpose Web search engines such as Google, we recommend including Schema.org markup (or a domain specific variant like Bioschemas) for example using the DataCatalog and Dataset profiles.

Host FAIR Data (Step 6)
The sixth step is to host the FAIR data i.e., make it available for consumption and is within the FAIRification phase of the workflow. This enables human and machine use through different interfaces, such as an Application Programming Interface (API), RDF triple store, or Web application. Please note that "FAIR does not mean open" [20] and that access restrictions may be applied at any level of (meta)data on each of the A Generic Workflow for the Data FAIRification Process interfaces. There are many different ways to deploy a FAIR resource online and to provide (and manage) access [21,22]. One of these is the general-purpose FAIR data accessor as provided by the FAIR Data Point (FDP) software component [6]. It is developed as an exemplar tool to demonstrate the critical step of using global standards to provide access to structured metadata, and to demonstrate compliance with the FAIR guiding principles [1]. An FDP facilitates transparent, controlled access in a stepwise manner to increasingly detailed information about the data set and eventually the data records to both humans and machines. The human interface consists of a simple Web page providing links to the relevant layers of metadata provided by the FDP. The FDP machine interface will return a machine-readable RDF document.

Assess FAIR Data (Step 7)
The seventh step is to assess the FAIR data and is within the post-FAIRification phase of the workflow. This process may include: 1) an evaluation to check whether the original objectives as defined in Step 1 have been achieved (if not, some of the steps in the workflow may need to be revisited), and 2) checking the FAIR status of the data and metadata by e.g., using FAIRness assessment tooling [2,3,4], and compare it with the FAIR status assessed in Steps 2 and 3.
If driving user question(s) were defined in Step 1 it should be "answered" in this step. The results of these question(s) are gathered by processing the FAIR machine-readable data. If RDF is the machine-readable format used, then RDF data stores (triple stores) are used to store the machine-readable data, and SPARQL queries are used to retrieve the data required to answer the driving user question(s).

DISCUSSION AND CONCLUSIONS
In this paper, we have described a generic workflow for the data FAIRification process. It mainly describes the technical hands-on part, but can also be used for other purposes, such as planning, training and dissemination. The purpose of this workflow is to make FAIRification easier. However there are specific decisions beyond this workflow that need to be made by stakeholders who are part of an organization, institution, consortium, or other relevant collective supporting the FAIRification. This for instance pertains to decisions regarding: 1) a common standard for FAIR (meta)data collection and storage within a given community, 2) the FAIRification plan prior to data capture, and 3) sharing semantic data models within and across communities and domains.
Stakeholders also need to consider managerial aspects of FAIRification, i.e., the required expertise, and building and maintaining capacity on FAIR data stewardship for the longer term. The way a FAIR project is approached depends on the available budget, and on the type and size of an organization. Capacity could be established by organizing a dedicated team of specialists within a larger organization, or by organizing collaboration with experts who are willing to contribute expertise and part-time consultancy to guide the process. In either case, we recommend that a small team of experts comprised of one or more trained FAIR data stewards is formed to maintain a full overview of the FAIRification process as it is implemented. The FAIRification workflow, including its required budget, should be explicitly incorporated in Data Management Plans (DMPs) [13]. Because of the interdisciplinary nature of FAIRification and the early stage of development of support for FAIR data stewards, a DMP typically involves interdisciplinary and cross-organizational collaboration.
The FAIRification workflow presented in this paper is generic, thus intended to be used in any domain. Also, we note that our current workflow representation is not intended as a normative or final workflow for the FAIRification process. It should be used as a template, and we expect continuous evolution of the workflow as the awareness and understanding of specific data stewardship issues increases in application communities: for example, we are currently considering whether a planning phase and inventory phase should be added as explicit separate steps, or whether the workflow should change to accommodate FAIRification-by-design (e.g., before starting data capture) as opposed to FAIRification of an existing data set (post-hoc-FAIRification) [23]. The future composition of the workflow highly depends on which implementation decisions are made [11], and domains start to reuse solutions from other domains, we will see a creolization in the workflow [12].

AUTHOR CONTRIBUTIONS
The workflow presented in the manuscript is a result of many years of experience by all authors, A. Jacobsen