The FAIR Principles: First Generation Implementation Choices and Challenges

1Leiden University Medical Center, Leiden, 2333 ZA, The Netherlands 2GO FAIR International Support & Coordination Office (GFISCO), Leiden, The Netherlands 3National Science Library, Chinese Academy of Sciences, Beijing 100093, China 4School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100084, China 5School of Information Management, Nanjing University, Nanjing 210023, China


Data Intelligence
The FAIR Principles: First Generation Implementation Choices and Challenges of guiding implementation considerations on our journey to machine-assisted research and innovation. Therefore this special issue of the Data Intelligence journal is dedicated to practical first generation implementation choices that are being made by communities of practice, relevant for the FAIR status of data and accompanying tooling. This issue also features "opinion articles" that address challenges encountered or anticipated along the implementation trajectory of FAIR for which there are no ready-tore-use solutions. These early examples and visionary discussions may inspire the further development of interoperable approaches by various early mover communities, who are convinced that both data and services (including tooling) should be FAIR to enable the envisioned Internet of FAIR Data and Services (IFDS) [1]. This is all "early stage", because the FAIR principles were published in their current form only in March 2016 [2]. The FAIR principles did not mark a major new insight, but were rather a consolidation and a comprehensive rephrasing of a series of earlier foundational and pioneering approaches (some decades in the making) to move toward a machine-friendly research infrastructure. Several papers in this issue make this point in historical context [3,4,5].

FAIR "hit a chord"
Why then is it that the FAIR acronym so rapidly sparked the discussions at the domain sciences level, got wide recognition at the policy level [6], among funders [7], repositories [8] and not only in Europe and the USA, but also in Africa [9], Latin America [10], and China [11,12]? There may be two important differences with earlier efforts. First, after the inception of the FAIR principles in January 2014 at the Lorentz workshop in Leiden, The Netherlands  , the principles where initially posted at the FORCE11 website for community comments  and subsequently published in Scientific Data in March 2016 [2]. The article apparently "hit a chord" and was massively read, tweeted, discussed in blogs and cited [see box below].
At the press date of this Data Intelligence special issue on emerging FAIR practices, the commentary had already 67,000 article accesses, over 1600 citations in Google Scholar, altmetrics score of 1,385 and was ranked 1 st of the articles of a similar age in Scientific Data and consistently scored in the 99 th percentile (ranked 74 th ) of the 264,581 tracked articles of a similar age in all journals. The article was tweeted 1,243 times, appeared in 77 blogs and picked up by 84 news outlets. After three years, the rate of citation continues to increase and is currently almost two citations per day in Google Scholar.
In addition to a sticky acronym, we strongly believe that the inception of the FAIR principles at the Lorentz workshop in Leiden (January 2014) marked a natural tipping point, also caused by the very visible discussions in the scope of the European Open Science Cloud preparations [1], which again lead to an "attraction phase" toward a common approach as described for many other major infrastructures [13].

FAIR is already implemented in communities of practice
In a recent student-led survey, referenced in this issue [14], it was shown that 80% of the papers citing the original FAIR paper, actually deal with practical implementations. This indicates that, next to the "political hype" caused by the acronym, a growing number of organizations and communities have actually attempted to forge technical implementation choices that adhere to the FAIR guiding principles. Some of them are described in this issue, but there are many more that can be found in the literature  . The implementations are apparently emerging in the entire spectrum of science and innovation domains and include life sciences (notably biomedicine and health, biodiversity and agriculture), nuclear energy, climate change, ocean research, humanities, economics, space science and mineralogy and many more. Furthermore, data science related implementations are also numerous, such as ontology mapping, machine learning algorithms, ontology-based access protocols, automation technology, annotation and curation, and many aspects practiced in the emerging profession of data stewardship in data competence centers in institutions all over the world [15]. Unfortunately, 80% of the citations so far are from Europe and USA [14]. It is very encouraging though, that this issue reports on strong FAIR related activities, not only in Europe [5] and USA [16], but also in Africa [9] and Latin America [10]. Moreover, international organizations such as Research Data Alliance (RDA) [5], The Committee on Data for Science and Technology (CODATA) [17,18], European Strategy Forum on Research Infrastructures (ESFRI) [4] and scientific unions such as AGU [15] and IUPAC [18] are leading their domain communities toward more mature FAIR choices and infrastructures. Ever since the publication of the FAIR principles, they have sparked converging initiatives such as GO FAIR  and strong collaborations between GO FAIR and other international data related initiatives including the Commission on Data of the International Science Council (ISC's CODATA  ), it's World Data System  , and the Research Data Alliance (RDA)  . Based on all the above it is likely that well over 1,000 communities of practice already work on some implementation aspects guided by the FAIR principles (i.e., 80% of > 1,600 written articles citing the FAIR principles paper [2]).

"My Machine knows what I mean"
So, what does "being FAIR enough" actually mean? First of all, this will vary widely for different communities and domains and will be ultimately decided by the communities of practice that adopt policies supporting machine-actionable data, that aim to "de-silo" and that strive to overcome disciplinary boundaries.
Still, in this very early implementation phase there has also been quite some confusion and anxiety about what the FAIR principles actually cover. As a result several "additional acronym letters" have been proposed, even in some early draft articles for this issue. So far, all of these proposed changes could be resolved  https://scholar.google.com/scholar?oi=bibs&hl=en&cites=5577767787428752324,13506161023122668610.  https://www.go-fair.org/.  http://www.codata.org/.  https://www.icsu-wds.org/.  https://www.rd-alliance.org/.

The FAIR Principles: First Generation Implementation Choices and Challenges
without changing the powerful acronym, because they could either be classified as a specific "implementation choice" [19] or because they were "beyond FAIR" [20]  since they addressed issues that, "by design", the FAIR principles do not cover, such as ethics, privacy, reproducibility, data or software quality per se. Many of these very important aspects are implicitly related to findability of software and data, their accessibility, interoperability and therefore the ability to reuse these research objects, but they should not be conflated with the FAIR principles themselves. These were designed to strictly cover the inherent machine-FAIRness of data and services. In that sense, even the fabrication of fake data, making them FAIR and publishing them in a Core Trust Seal Repository  would not violate the principles, especially when the metadata indicate that the data are fabricated, for example a machine learning (ML) training-set. Conversely, putting high quality data in a mediocre repository can not be prevented by the principles as such, although obviously, when the only repository in which data or code was published is offline or not findable for other reasons, the FAIR principles are not properly followed. Abuse of the FAIR acronym is related to specific, and stakeholder defined, implementations, and some of them tangentially addressed by the FAIR principles and others not.
In this issue [19], the original conception of the FAIR principles and what they are intended to cover is explained in detail. In an attempt to narrow down to the essence of what the original composers of the FAIR guiding principles had in mind, we would like to introduce an even higher level of abstraction than the principles themselves: the trigger for so much international attention for better data stewardship and Open Science is likely correlated to the data explosion we have created through ever increasing automation and instrumentation advances. It follows that we need "machines", both as creators of data and as analytical assistants, all the time: we better make them as efficient and collaborative as possible. So at their very core, the FAIR guiding principles should lead us to ensure that "Machines know what it means". Obviously, this does not (yet) take people out of the loop. In fact the envisioned Internet of FAIR Data and Services [1] should be an environment where our implementation choices support both machines and humans, in a tight and iterative collaboration (i.e., "Social Machines" [21] are the end users).

FAIR Data are the substrate for FAIR tools and services for Open Science
Open Science is in fact a new way of doing and communicating science with an emphasis on reusability of data and the accompanying analytics, not only by other researchers, but ostensibly also by machines. Hence the one-liner that captures the essence of the FAIR principles is "Machines know what it means". So, do we trivialize the role of humans in science and innovation? On the contrary; publishing our major research output (data, software tools, derived information and major scientific conclusions and claims) in FAIR format, will enable computers to "also" output the relevant information in precise, human-digestible formats, meanwhile mitigating ambiguity introduced by natural language, and effectively even crossing jargon, ontological false agreements and false disagreements [22], and ultimately even natural language barriers. It should also be emphasized that "data" should always be published with supplementary narrative  https://www.rd-alliance.org/groups/fair-data-maturity-model-wg.  https://www.coretrustseal.org/.

The FAIR Principles: First Generation Implementation Choices and Challenges
for humans to judge and evaluate the data and information we provide in machine readable and actionable formats [23].
Human prosaic narrative, graphical figures and tables and most supplementary data, in the formats we have used for scholarly communication for centuries, are "a nightmare for machines" and therefore intrinsically, in their native form, do not comply with FAIR principles, which obviously does not make them useless for human reuse. The good news is that precise scientific claims in legacy text as well as the supporting data can be transformed into FAIR formats with increasing relative ease. Human readable text, tables and figures for human intellectual consumption can also increasingly be produced by machines, for instance from relational databases to RDF and vice versa [24]. Supporting both machines and humans in their collaborative work is therefore the major contribution the FAIR guiding principles are supposed to make to 21st century research an innovation throughout the world. An important notion is also that research objects that are not "digital" or otherwise are not machine interpretable, such as geological and biological specimens, analogue pictures, PDFs and the like, can nonetheless always be adorned with FAIR metadata creating a "digital twin" [3,5]. In this context, beyond the original coining of the term by Michael Grieves in 2002 [25], we see a "digital twin" of a non-digital research object as a set of machine-readable metadata and instructions that allow machines to detect and resolve to the location of the object via its unique identifier [26] make the best possible interpretation of what the object is, what operations on it are technically possible and what is allowed to be done with the digital objects in the twin. The actual research object can be anything from a molecule to a packaged data and workflow object [27] to one of the 3 billion biological specimens in natural history museums [3], to citizens in the FAIR driven research and care environments such as the Personal Health Train [28]. FAIR digital objects, sometimes pointing to other digital objects and sometimes to objects in the physical world, thus form the basic substrate for machineassisted science and innovation.

Pioneering choices in FAIR implementations
In the first series of 15 articles in this special issue we have bundled a relevant set of "first generation" implementations and emerging practices in the context of FAIR. These are followed by 12 articles that focus more on gaps in existing technology and practice encountered or envisioned and offer opinion and propose directional solutions for the relevant communities to develop FAIR guided approaches. The Implementation Articles cover the overall protocols and operations to enable efficient handling of FAIR Digital Objects in the Internet and Web environments. The very first requirement being that each FAIR Digital Object has its own Unique, Persistent, Resolvable Identifier [26]. Machines subsequently need instructions and workflows, and these can and should be FAIR themselves to effectively participate in the Internet of FAIR Data and Services [27]. Next to that, data, even when they can not be Open Access without restrictions and as "closed off" as necessary, can nevertheless be FAIR [29]. For all research objects, restricted in use or not, there is a need for machines to independently access the data and "understand" what kind of (machine and human) operations are possible and allowed [30]. The actual reuse of the data is subsequently always subject to a user license, however liberal [31]. This is for instance critical for industrial use of data, as data and other research objects that have not been properly licensed is viewed as having uncertain legal liability,

The FAIR Principles: First Generation Implementation Choices and Challenges
and thus cannot be easily reused by industry [32]. A very important aspect of the wide acceptance of FAIR data as a first class research output is that data are properly (indeed, automatically) cited upon reuse. Technologies to make effective and scalable data citation possible are in their early stages, but they will soon be well established [33]. A number of pioneering domain specific implementation efforts and choices to help make data and metadata FAIR are also emerging [34]. In addition, tools that enable planning of FAIR compliant metadata files and data management and stewardship plans are being developed and tested [35], and increasingly these interact with tools that expose FAIR standards in dedicated FAIR repositories to stimulate reuse [36]. As stated earlier, FAIR data alone remain a lame substrate, unless there are FAIR data consuming workflows, that in the Internet of FAIR Data and Services should be developed according to FAIR principles themselves, which poses a whole additional set of choices and challenges [37]. The "implementation section" is completed by a set of articles that describe how the various choices impact their own and potentially other disciplines, such as sensitive personal health data [28], data describing physical objects rather than digital objects, such as in biodiversity collections, biobanks and geosciences [3], and a massive cross-cutting domain such as chemistry [18]. In all these areas there are different, but also overlapping legal aspects associated with the reuse of data and workflows that should be addressed when publishing data and code for reuse [38]. Once all these decisions have been made, a final, and very important decision for FAIR-oriented researchers and data stewards is actually "Where to publish and archive my data with maximum chance of proper reuse"? This means that also data repositories should consider the aspects of FAIR metadata and the data collections themselves and how they can support long term reuse and preservation [35]. Finally it is very important to develop tooling that, as objectively as possible, can measure the maturity of FAIRness of digital resources [19,20 & footnotes 6-9] clearly demonstrating that FAIR is not a "binary" status, but an aspiration to move from scholarly communication assets that are "re-useless" for machines toward increasingly machine-actionable elements of an emerging Internet of FAIR Data and Services [33].
Finally, we realized early on that FAIR should not, and will not be, exclusively an academic exercise. As happened in the Internet as we know it, sooner rather than later, private institutions and industry will (and should) join and much of what we will see happening on the route to the envisioned Internet of FAIR Data and Services will be done in public-private partnerships and in some cases scaled by large industries. It is important that at such an early stage, many industrial partners are already highly interested and contributing their views on what their needs are [32]. A balanced development of a professional (indeed, commercial) backbone on which both academic and industrial applications can run, will be as important for the Internet of FAIR Data and Services as it is for the current Internet. Moreover, given the commitments to Open Research and the "long-tail" of technically-rich disciplines participating in the Internet of FAIR Data and Services, the principles of net neutrality and open standards will necessarily feature prominently and irrevocably regardless of the emerging business models.

Outlook
It will be obvious after reading this special issue that FAIR compliant data stewardship will require many different skills that are not traditionally covered by the research curricula of contemporary students and The FAIR Principles: First Generation Implementation Choices and Challenges researchers. Therefore, extensive training capacity and training materials are needed, and in need of development. Some academic tools under development have been described in [39], and also commercial training options and in-company FAIR competence centers are being developed [32]. Public research funders [7] as well as data driven private endeavors will increasingly call for proper (and funded) data stewardship plans, not only for research outputs, but also in the data-intensive processes for product approval, legislation and certification [32]. International agreements will be needed on good practices that can form the basis for better, FAIR and Open Science as well as for well documented innovation and production processes. The envisioned Internet of FAIR Data and Services will form a backbone for this future societal innovation and may be of very high impact on human wellbeing and the responsible stewardship of our planet.
Thus, this special issue is entirely based on the concept of reusable FAIR digital objects that effectively form "one computer with one data set" as suggested by George Strawn in this issue, and which is distributed over the planet but functionally interconnected and interoperable by FAIR principles, for fair and equal use. This being said, we are fully aware of the limited and lumpy scope of the articles we were able to collect for this issue. Looking forward, the publisher and the co-editors therefore encourage from the communityat-large, additional collections of practical Implementation Choices and recognized Challenges as contributions to what we envision will be a recurring special issue.