Unique, Persistent, Resolvable: Identifiers as the Foundation of FAIR

The FAIR principles describe characteristics intended to support access to and reuse of digital artifacts in the scientific research ecosystem. Persistent, globally unique identifiers, resolvable on the Web, and associated with a set of additional descriptive metadata, are foundational to FAIR data. Here we describe some basic principles and exemplars for their design, use and orchestration with other system elements to achieve FAIRness for digital research objects.


INTRODUCTION
The FAIR principles [1] are a model to guide the implementation of interoperable digital research artifacts which are fully discoverable and ultimately reusable in the scientific research ecosystem. These digital artifacts must be preserved long term as the basis of modern scientific research. However, recent publications suggest that up to 80% of scientific research data are lost within 20 years [2]. This estimate does not take into account the effects of practical unavailability of such data, when they have not been robustly and accessibly archived, even when they are not actually "lost". Data attrition blocks the ability to validate results and reuse information in science.
"FAIR" as an approach to reduce data attrition and support reuse has seen growing support in EU-and US-funded projects  and influential science policy recommendations [3].
This article focuses specifically on identifiers as required for practical implementation of the foundational "F" element of FAIR, the "Findability" of digital research artifacts. All the other FAIR principles ("A", "I", and "R") are based upon it. We cannot access, interoperate or reuse data without knowing how to identify them. And ultimately, all criteria of FAIR influence how we must provision identifiers.
FAIR data, software, and other digital research objects collectively support the validity of scientific research as communicated in the literature. Likewise, the scientific literature in turn supports FAIR data, when properly referenced, by describing the methods by which the data have been obtained.

IDENTIFIERS RESOLVABLE ON THE INTERNET
To be "Findable", digital objects must be uniquely specifiable, i.e., they must have unique names, and must be locatable, by association of the unique name with a protocol for retrieval. They must also have at least some associated descriptive metadata to assist discovery and verification of that object when the identifier is not known a priori. Recent work on scientific data "commons" architectures [4] and multicommons "data ecosystem frameworks" [5], highlighted here, reinforces the FAIR Accessibility principle that identifiers must be paired with such descriptive metadata [6,7,8].
An identifier is a unique name given to an object, property, set or class. Digital identifiers, by which we mean ordered strings of characters, uniquely name digital resources. An identifier may sometimes be "read", as if its string revealed meaningful semantics, but fundamentally, the underlying semantics of an identifier is that it means the resource to which it refers. Table 1 shows four common FAIR identifier types, which we discuss in detail later in this article. Identifiers have a scope within which they are verifiably unique. For digital artifacts this may be local scope within some particular data collection, as is the case with typical "accession numbers", the concept for which originated in museum, library and specimen collection practice and was extended to the biomedical sequence and literature databases.
In current best citation practice, identifiers must be globally unique and resolvable on the Internet, where digital identifiers are ordinary Web addresses, technically URIs, or Uniform Resource Identifiers (URIs) [9]. A URI is a compact sequence of characters uniquely identifying a resource. Resources may be physical things (e.g., a dog, building, microscope, star, and person), which are described by associated information), digital (e.g., a document, data set, catalog record, program executable, service endpoint, or the actual objects of interest), or abstract (e.g., a vocabulary term).
URIs begin with a scheme name which specifies the method used to resolve them, e.g., http, urn, ftp, LDAP, etc. In this article we will be speaking primarily of secure http URIs, which begin with "https://" [10], but using the term http URI to mean both secure and insecure variants. And we will discuss their resolution in the context of REST interfaces [11,12], the canonical way to access required metadata, including the resolution endpoints, for any persistent identifier.
An http URI gives a protocol for finding (resolving) and accessing the information resource to which it refers, that is, it is used to locate the resource. As noted, we find the "Accessible" requirement of FAIR is also embedded in secure https URI based identifiers.

IDENTIFIER PERSISTENCE AND INDIRECTION
The hosting of identified resources can change; publishing companies or data archives may merge, or be acquired, or the internal infrastructure at an institution may be revamped, modifying the presentation of a resource's URIs. This means there is a problem if identifiers are dependent upon local hostnames; this can be mitigated by indirection through central public resolvers. These provide resolution for the identifier URI by redirecting http GET requests [12] to a different http URI hosted by the information provider.

Unique, Persistent, Resolvable: Identifiers as the Foundation of FAIR
Identifiers may be made indirect by (Pattern 1) registering each one with a redirecting service (along with some metadata) as it is issued; or (Pattern 2) by registering a unique namespace, identifier pattern and resolver endpoint for the resource provider just once. In either case indirect identifiers become "persistable" in that changes to the endpoint URIs can be made invisible and non-disruptive to users either individually (Pattern 1 above) or in bulk (Pattern 2) when the provider hosting structure changes. Pattern 1 allows us to retrieve core metadata elements for individual identifiers. Pattern 2 allows us to retrofit global resolution and persistence behavior to systems that formerly assigned only locally unique identifiers such as database accession numbers.

EXEMPLAR PERSISTENT, RESOLVABLE IDENTIFIERS
A variety of services have been developed and made available in support of persistent identifiers, facilitating their consistent use/reuse, as well as global collaborative efforts to converge upon best practices in these areas. We highlight some exemplar services here, all of which comply with the recommendations in [13].

Digital Object Identifiers
A very well-known example of the Pattern 1 approach (above) is the Digital Object Identifier (DOI) system [14]. DOIs are widely used in the publishing industry to identify documents, and as data set identifiers and for data citation and interoperability, besides other uses. They provide a canonical example of how Pattern 1 should be implemented.
The International DOI Foundation  maintains the DOI system, supported by its various domain-centric Registration Agencies (RAs). The most important RA for our purposes is DataCite  , a member supported organization that registers data sets [15]. DataCite membership gives an organization the right to mint a certain number of DOIs per year, at various subscription levels. Over 16 million unique DOIs have been registered with DataCite. DOI resolution is free.
DOIs consist of a DOI name which is resolved at https://doi.org, with the full URI formulated according to the pattern https://doi.org/<DOI_name>. DOI names in turn consist of a prefix and a suffix, separated by a forward slash. The prefix is a code indicating the registrant who issues the DOI, e.g., Harvard University Dataverse -10.7910; Dryad Digital Repository -10.5061. The suffix is the identifier, in any form, assigned by the registrant. Metadata, including the data package URI, will have been registered with DataCite at the time the identifier was assigned. Full resolution to the actual data is a multi-step process. First, the client sends an http GET request to the doi.org resolver for the DOI of interest. Then doi.org sends a redirect message to the client with the address of the data set's "landing page", which holds the metadata, including the resolution endpoint(s) for its data.

Unique, Persistent, Resolvable: Identifiers as the Foundation of FAIR
Why is this multi-redirect process required? The first redirect separates the endpoint of the data from their persistent identifier, providing isolation from infrastructure changes at the provider side. The second redirect allows decisions to be made about how to handle the data set prior to access. For example, we may want to ensure that we have the correct data set by scanning the metadata; we may not want to directly download a large data set before we ascertain it is the correct one (via the metadata); and if very large, we may wish to consider transit costs, and transport speed, before we decide how to access it. If the data exist at multiple mirror sites, we might want to consider proximity. These decisions can be made in real time by humans working at a browser, or on a command line; or they can be made algorithmically by "content negotiation" between services [16,17].
Metadata at the DataCite service is provided in both human and machine readable forms. Machine readable metadata is provided both in DataCite XML and in the schema.org [18] vocabulary, serialized as JSON-LD [19]. Future work may also enable incorporation of terms from domain-specific schema.org extensions such as bioschemas  [20].

Archival Resource Keys
Archival Resource Keys (ARKs) are another widely used persistent identifier, supported by the California Digital Library [21], in collaboration with DuraSpace  . ARKs work similarly to DOIs, but are more permissive in design. Over 500 registered organizations have created over 3.2 billion ARKs. There is no fee for registering or resolving ARKs. Metadata is strongly recommended for publicly released ARKs, but is optional and flexible while the resource is being planned, reviewed, tested, etc.
The ARK scheme is decentralized; any organization can run its own resolver, but all can be resolved through the global Name-to-Thing resolver  . N2T operates in two modes. It stores records for millions of ARKs registered by the EZID service  [22] and the Internet Archive. And in collaboration with identifiers.org (below), it stores 3,500 rules for redirecting identifiers that it does not store individually.
ARKs have the syntax: https://NMA/ark:/NAAN/Name [Qualifier] where: · NMA is the Name Mapping Authority; · NAAN is the Name Assigning Authority Number (like a DOI prefix), · Name is the local identifier issued by the NAAN, and · [Qualifier] is an optional qualifier.
Name Mapping Authorities operate ARK resolvers. Name Assigning Authorities have NAANs allowing them to mint globally unique ARKs.

Identifiers.org and CURIEs
The identifiers.org system [23,24,25] has been providing globally resolvable http URI-based identifiers to the scientific community for over a decade. The core system consists of a resolver and a namespace mapper, following Pattern 2 (Section 3). It registers a unique namespace, a namespace prefix, an identifier regex and a local identifier resolver URI, to provide globally unique persistent identifiers. It records metadata only for the identifier namespace, not for individual identifiers, relying upon the data provider to expose the appropriate metadata within the resolved landing page.

Unique, Persistent, Resolvable: Identifiers as the Foundation of FAIR
· Resolver is resolution host, currently either identifiers.org or n2t.net; · Provider is an optional term indicating specific providers for a Namespace, e.g., "RCSB", "PDBe", "PDBj", etc., for the Protein Data Bank's various providers; · Namespace is a registered prefix assigned to an identifier issuing authority; and · Identifier is the locally assigned identifier.
Taken together, a namespace prefix and colon followed by a local identifier constitute a CURIE, or Compact Identifier [13,23].
Identifiers.org records all known resource-specific access URIs for a data set or data record, including alternative access URIs. This information allows numerous equivalent URIs to be considered to be a single canonical URI string. This mapping, retrievable as a data set in its own right, is integral to platforms like EMBL-EBI RDF and Open PHACTS [26], where identifiers.org URIs act as a semantic glue in distributed queries.

PURLs
Persistent Uniform Resource Locators (PURLs) generically are http URIs whose services redirect GET requests to another location-based URI, with the added promise of long-term persistence. All identifier types discussed above meet this definition. However, in common usage, a PURL is an identifier minted and resolved at https://purl.org, hosted since 2016 by the Internet Archive  , after being transferred from OCLC, where the service started in 1995. That the ownership was transferred to the Internet Archive highlights that particular care was taken to preserve and continue operation of this identifier infrastructure and its registrations. A separate PURL registry exists at https://purl.obolibrary.org, hosted by the OBO Foundry. OBO Library PURLs are widely used by the OBO ontologies community to persistently reference their information artifacts, for which they are well suited  .
Users of purl.org register to gain control of a sub-path, e.g., http://purl.org/dc/ and then assign and modify the redirection targets for individual entries like http://purl.org/dc/terms/. The ability to update the PURL target is crucial for persistence, for instance if ownership of a service changes, or the server infrastructure changes URL layout. In this sense a PURL is a counter-measure to unintended violations of "Cool URIs don't change" [27].

DATA CITATION
The Joint Declaration of Data Citation Principles (JDDCP) provides a set of guiding principles for publishing and citing research data. Many implementation guidelines developed for JDDCP [15,16,23,28,29] provide rich foundational practices for FAIR, including globally unique persistent identifiers, indirect Unique, Persistent, Resolvable: Identifiers as the Foundation of FAIR resolution to a landing page or service, and accessibility of human and machine readable metadata. In our view, it is not possible to reliably reuse data without robust links to provenance documentation. Ultimately, literature describing how data were obtained provides the rationale for accepting them as reliable. Conversely, no assertion in the scientific literature is ultimately credible unless backed by reliable data. Good data sharing, archiving and citation practices followed by authors, publishers, repositories, funders and institutions, along with good identifier and metadata practices, and appropriate credit incentives [30,31], will promote creation of normative FAIR data integrally associated with the literature.

IDENTIFIERS PLUS METADATA
Sensible strategies that work across projects, domains and infrastructures permit realization of an open landscape where science can operate reliably with massive amounts of data. Large scale "data commons" projects such as the European Open Science Cloud, the NCI Genomic Data Commons, and the NHLBI Data Stage are initial steps in this direction. Investigators in such programs understand that frameworks for interoperating across commons in a "data ecosystem" [5] of FAIR research objects require not only persistent resolvable identifiers, but well-documented APIs for associated human-and machine-readable metadata. These metadata should optimally include not only provenance information but descriptive "guide" metadata to enable discoverability. A large, active community is engaged in determining and establishing best practices around persistent identifiers, metadata standards and exposure mechanisms; there are many active "interest" and "working" groups engaged in such activities under the auspices of the Research Data Alliance  .
Properly managed persistent identifiers with their associated metadata are the primary access point to enable Findability in the FAIR data ecosystem, and to provide a basis for all the other FAIR attributes.