Distributed Analytics on Sensitive Medical Data: The Personal Health Train

In recent years, as newer technologies have evolved around the healthcare ecosystem, more and more data have been generated. Advanced analytics could power the data collected from numerous sources, both from healthcare institutions, or generated by individuals themselves via apps and devices, and lead to innovations in treatment and diagnosis of diseases; improve the care given to the patient; and empower citizens to participate in the decision-making process regarding their own health and well-being. However, the sensitive nature of the health data prohibits healthcare organizations from sharing the data. The Personal Health Train (PHT) is a novel approach, aiming to establish a distributed data analytics infrastructure enabling the (re)use of distributed healthcare data, while data owners stay in control of their own data. The main principle of the PHT is that data remain in their original location, and analytical tasks visit data sources and execute the tasks. The PHT provides a distributed, flexible approach to use data in a network of participants, incorporating the FAIR principles. It facilitates the responsible use of sensitive and/or personal data by adopting international principles and regulations. This paper presents the concepts and main components of the PHT and demonstrates how it complies with FAIR principles.


INTRODUCTION: MOVING FROM CENTRALIZED DATA SHARING TO EMPOWERING DATA OWNERS TO GAIN CONTROL OVER DATA REUSE
Data-driven technologies are changing business, our daily lives, and the way we conduct research more than ever. In recent years, more and more data have been generated in the healthcare ecosystem. The data contain potential knowledge to transform health care delivery and life sciences. Advanced analytics could potentially power the data collected from numerous sources to improve prevention, diagnosis and treatment of diseases, as well as supporting individuals and societies to maintain their health and well-being.
The era of exponential growth of data has also witnessed the increase of risk involved in sharing them. Countries are quickly adopting policies and formulating laws that regulate the collection, use, and sharing of personal data. The data protection law in the USA, the HIPAA Act, limits sharing sensitive data. In the European Union, the General Data Protection Regulation sets a well-formulated directive for securing confidentiality and privacy of citizens so that the data are not available publicly without explicit, well informed specific consent, and cannot be used to identify a subject without additional information stored separately [1]. PIPEDA in Canada, the Data Protection Act (PDA) in the UK, the Russian Federal Law on Personal Data, the IT Act in India and the China Data Protection Regulations (CDPR), all reflect the increasing global awareness regarding the importance of data privacy and confidentiality [2,3,4,5,6]. Patients and the general public are becoming more and more aware about the use of their personal data and are becoming more reluctant about sharing data. The current norm is that disclosure of health data without proper consent is a breach of privacy, which harms the fundamental right of freedom from intrusion or interference by others. Organizations that safeguard trusted information have thus a duty to ensure confidentiality [7]. Alternatively, anonymization and data masking are common solutions applied to protect privacy, although these methods cannot totally mitigate the risk of re-identification [8]. Big data analytics applications increase the risk of (re)identification, since linking various data sources increases the amount and quality of information [3,9]. These high dimensional data sets can be used to infer sensitive information at the individual or at the subpopulation level.

Distributed Analytics on Sensitive Medical Data: The Personal Health Train
Due to ethical concerns, a huge amount of usable health data is currently trapped inside the organizational boundaries of hospitals, clinics and within patients' devices. Many healthcare institutions implement centralized repositories by pooling data from multiple systems into data warehouses or data lakes [10]. Sharing these data out of the organization's boundaries is not a viable solution since the anonymization of data may not be possible for certain data types such as genomic data and also since linking data sets increases the re-identification risk. Alternatively, research communities build domain-specific data infrastructures [11] e.g., bioinformatics, cohort studies, clinical research or biobanks. The problem of accessing data outside of the network remains, and since data are collected for a specific use and duplicated outside of the first data source, it limits the record linkage and integration of multimodal data.
As a technical solution to centralized data sharing, i2b2 or DataShield provide software and tools to support querying and analysis of sensitive data in a distributed fashion by proposing their own technology stack and tools [12,13,14]. Nonetheless, since health data are generated and stored in a highly diverse system by heterogeneous stakeholders, it is very unlikely that these infrastructures will converge on a single solution.
Another aspect is social and cultural. The sensitive nature of health data makes individuals and institutions hesitant to share their data. From the public perspective, people are more willing to accept and participate in data sharing if they are informed about existing safeguards and governance mechanisms. They are willing to contribute to science for better care and wellbeing; however, they want to decide who can use their data, for what purposes and make sure data users are accountable for their actions. A survey among 603 secondary data users shows that 56% of the researchers who are willing to share their data demand a context with access control, and want to have a say or at least knowledge regarding the use of the data [15]. The current data sharing practice does not allow the owners to decide who can access the data and for which purpose. Although data sharing and licensing agreements set terms and conditions such as the limitation to a certain number of research purposes, the conditions of data transfer, not allowing attempts to establish individual identities or a maximally allowed time before data have to be destroyed, once data is out of the institutional boundaries, there is no mechanism for enforcement of these policies.
The Personal Health Train (PHT) proposes an alternative approach which encompasses both technological and social aspects of sensitive data reuse. When data sharing is not achievable, using distributed analytics on distributed data becomes a viable solution. The PHT does not require the transfer of data from the holding entity. Rather than moving the data to the requester, it moves the analytics tasks to the data repositories and executes the tasks in a secure environment. In this approach, the owner of the data can remain in control and decide which part of the data will be analyzed for which specific purposes and by whom. This new approach requires discovering, understanding, exchanging and executing analytics tasks with minimum human intervention.
FAIR principles becomes relevant not only for data but also for analytics tasks. In the fragmented landscape of data, interoperability and accessibility can be ensured by applying FAIR principles to the analytics tasks Distributed Analytics on Sensitive Medical Data: The Personal Health Train and system components that interact with these tasks. In this paper, we will demonstrate the application of FAIR principles to the Personal Health Train approach.

AN OPEN ECOSYSTEM WHERE DATA MEETS ANALYTICS: MACHINE READABILITY AT THE CORE
The Personal Health Train provides an infrastructure to support distributed and federated solutions that utilize the data at the original location. Typically health data are produced by diverse sources, including care institutions, biomedical researchers, imaging facilities, clinical and population studies, genomic sequencing centers and by citizens themselves. It creates an open ecosystem by making self-contained, machine-readable analytics task exchangeable and executable in diverse systems. The PHT does not prescribe any specific standard or technology for data, and instead, it only requires publishing individual choices as metadata. The PHT focuses on making data, tasks, processes and algorithms findable, accessible, interoperable and reusable (FAIR). As a result, it enables data providers and data users to match FAIR data to FAIR analytics and empowers them to make informed decisions about participating in specific applications.
The PHT provides an alternative solution to reuse the data in institutional data silos or citizens' personal data stores. It targets maximal interoperability between diverse systems, by focusing on machine-readable and interpretable data, metadata, workflows and services. The core design principle is to give data owners authority to decide and monitor the use of their data. Eventually, this will lead to the creation of the Internet of FAIR data and services that operates on personal health data that can never be completely open.
An example application is training of patient surviving prediction model. The particular case requires to assess and analyze a large amount of real-world, high-dimensional, multimodal personal data. In the health domain, this corresponds to information such as longitudinal medical records, diagnostics tests such as imaging, genomic profiles, and patient-generated health data and outcomes via apps and wearable devices. To discover hidden patterns, the full data set should be made available to the machine learning task, but the privacy-driven requirement of data minimization limits the personal data to those elements deemed directly relevant and necessary to accomplish a specified purpose. The PHT approach could unleash the potential of big data analytics for personal data without compromising privacy. The machine learning model can be sent across various health-care providers through the PHT infrastructure without data ever leaving the organizational boundaries [16,17].
The PHT defines the following three core components: Station: Provides curated, confidential data and acts as FAIR data points. Stations expose data in a discoverable format, define an interface to execute queries, provide computational resources and execute analytic tasks in a secure environment. Stations are registered and the schemas and the metadata of the data provided by a Station are published through Station Registries.
Train: Data Consumers intend to access privacy-sensitive data from multiple curators and to execute a data analytics algorithm to derive insights from the data. They formulate the queries and specify the analytics

Distributed Analytics on Sensitive Medical Data: The Personal Health Train
algorithm. The set of all artifacts required to execute the distributed algorithm and return the results is called a "Train". A Train is identified by a Digital Persistent Identifier (PID) and contains a self-sufficient message with all the information required to transfer code and result between the relevant parties. Trains may be simple or complex with different kinds of wagons that are also digital uniquely identifiable objects. Each wagon may have its own resources with many different types. A Train carries different components; namely, metadata that stores the Train's unique digital persistent identifier, study description, the query used in data extraction, analytics for data utilization and aggregation for result integration. Once specified, the consumer uploads the Train to the Train Repository and sends the reference of the Train to the handling Station. Trains are registered in a Train Registry to make them identifiable. The consumer has no direct access to the data sources and humans are entirely decoupled from the computation phase until the algorithm has finished.

Handler (Track):
Acts as a gateway between the consumer and the curators. It orchestrates communication by receiving self-sufficient Trains from the consumer and forward them to selected Stations. It may act as a broker and may aggregate results from multiple curators. It manages Train and Station states and logs the transaction information for future auditing. Essentially, the Track is a centralized point of trust. The Train dispatcher module of the Track transfers the PHT Train either as payload or as a reference. Container execution modules at the Station (platforms in the PHT metaphor) consume the PHT Train and execute the provided algorithm. The results from different Stations are evaluated and aggregated by the Track and sent back to the consumer.
The PHT proposes a technology agnostic implementation by definition of a commonly agreed Train metadata. By design, it enables shipment of any analytic task written in any programming language. Figure 1 sketches a high-level representation of the various components of the PHT architecture.

FOLLOWING FAIR PRINCIPLES FOR DISTRIBUTED ANALYTICS
FAIR refers to a set of guiding principles that aims to enhance the ability of machines and individuals to automatically find and use data [18,19]. Although it is originally designed for data management and stewardship with a focus on making data self-explainable and discoverable, it can be applied to any digital object with a goal to create an integrated and harmonized domain to support reusability [20]. The PHT approach promotes improving the reuse of data by sharing analytics, which can interact with the data and complete its task without giving access to the end user. Within the PHT, the FAIR principles are applied to both the Train and Station concepts, keeping in mind that the goal is enhancing the reusability of distributed data with distributed analytics.
Clearly, making data self-explainable and discoverable goes a long way to ensure reusability. However, this may not always be possible, specifically when data are sensitive and have not been collected for research purposes. In the case of data collected during routine healthcare, for example, it is likely that data are stored in heterogeneous systems and follow the data standards imposed by the requirements of daily transactions, such as HL7 or DICOM, which might not support the desired level of metadata and persistent identification schemes. Therefore, the PHT needs to interact with data repositories, which may or may not follow FAIR principles, despite the fact that having FAIR data is highly desirable. Participating data repositories independently decide at which degree they will support FAIR data. They act as FAIR data points [21] by implementing custom interfaces supporting the computational task that reuses data.
The PHT sets the machine readability at the core, aiming for maximal interoperability between diverse systems. Therefore, it is well aligned with FAIR principles. The components of the PHT infrastructure support FAIR principles at varying degrees ( Figure 2). Station Private: Access to health data has restrictions which derive from the original consent obtained by the patient or from data protection policies of involved institutions [22]. These data should be kept in a secure part of the Station which is not accessible by external data consumers. FAIR is not a requirement

Distributed Analytics on Sensitive Medical Data: The Personal Health Train
for the private part of data repositories data and metadata reside in, which may follow preferred institutional standards. However, the access of the analytics tasks should be supported by having a queriable consent, a mechanism to link data sets, and a virtual layer to support integrated queries over diverse data sets. Therefore, it requires applying a formal and shared knowledge representation. In conclusion, the private part of the Station should support Interoperability.

Station Controlled:
This part of the Station provides a secure environment for executing analytics tasks. It supports Accessibility by following standardized communication protocols to discover and receive Trains. Analytics tasks can be delivered with open, free and universally implementable protocols mandating authentication and authorization procedure. Access control to data resides in the sovereignty of the Station, but results are communicated with open protocols. Ideally, ontology-based access control can be applied [23].
Station Public: Each Station is uniquely identified with a persistent identifier and registered in a registry with its metadata. It improves findability by publishing the metadata about the data repositories, as well as the computational environment.
Trains: Data analytics tasks support all four dimensions of FAIR metrics. They are Findable, as Trains are uniquely and persistently identified resolvable digital objects that are registered in a Train Registry, searchable by their metadata. Train objects are persistently stored in repositories that contain all source and environment information required to What FAIRness means for a PHT Station: · (F) As a data owner, I want to provide enough metadata to be discoverable and published by Station registry; · (I+A) As a Station administrator, I want to judge if a specific Train can use my data (e.g. compatible data standards), or if I have the required computational resources (e.g., metadata descriptions of Trains) before I provide a permission; · (I+A) As a Station administrator/dispatcher, I want to set a mechanism to prevent a high demand for computational resources (e.g., prevent a crash in the Station); · (I+A) As a Station administrator, I want to interact with Trains through defined interfaces for providing data input, and executing the tasks.
What FAIRness means for a PHT Train: · (F) As a data consumer, I want to find already implemented Trains for a specific task (e.g. calculate hospital readmission rates for a specific case) (Train metadata); · (F+R) As a data consumer/owner, I want to find exactly the same Train without any change after two years to replicate the computation (persistency policy for Trains); · (A+I) As a data owner, I want to guarantee that the Train deposited and persistently identified in a repository, is the same Train that I receive (e.g. methods such as checksum); · (A) As a data consumer, I want to guarantee that the Train that I am sending over public network is securely transferred (is there a mechanism e.g. some public/private keys); · (A) As a data consumer/owner, I want to apply authorization and authentication policies to Train repositories for identity management. execute them. They are Accessible with open, free, and universally implementable protocols allowing authentication and authorization. They are Interoperable since every Train described by metadata uses a formal, accessible, shared and broadly applicable language, e.g., XML, for knowledge repre sentation. The metadata defines both the content and provenance of the analytics task such as what is the intended use, who developed it, what are the consent requirements, and also the requirements of specific tasks such as dependencies, prescribed data standards and computational resources. They are self-contained which enable virtualization to support interoperability during execution. Trains are Reusable, they are designed to be reused in multiple locations. License and certification can be assigned to Trains. They keep detailed provenance metadata, including execution history.
Train repositories are the building stones to achieve the FAIRness of analytics tasks. They should adopt and follow the recommendations set for the data repositories, namely persistent identification, application programming interface (API), Train curation and moderation workflows, accessibility, license for reuse, and sustainability [24]. The first recommendation to follow is assigning persistent and global identifiers to each Train. Various identifier schemas such as URIs or DOIs can be employed. Trains which are deposited to private registries should be described with rich descriptive and operational metadata and can be registered to public repositories such as DataCite  . Trains should receive a PID ideally at the earliest workflow state and in order to support later operations the PID should be embedded to the object [25]. Identification of Trains with PIDs and having associated machine-readable metadata can facilitate distribution of Trains in a Digital Object Architecture [26]. The second recommendation for FAIR Train repositories is to offer a set of well-documented APIs to ensure programmatic access to Trains and Train metadata. The next recommendation is providing a platform to support data scientists to define and moderate their Trains composed of analytics tasks and metadata. Similar to data curation experts, data scientists will require tools where they can check, verify and approve the content. The accessibility requirement of the Train repository should be ensured by open and implementable protocols such as HTTP(S) and FTP. Moreover, licenses for reuse should be clearly defined for Trains. Currently, there are various options for licensing data and database What FAIRness means for PHT Tracks: · As a dispatcher, I want to have enough metadata to understand the content and requirements of a Train, so that I can route them to the relevant Stations; · As a dispatcher, I want to have enough metadata about Stations, so that I can communicate with them and direct the relevant Trains to matching Stations; · As a dispatcher/auditor, I want to check who is sending this specific Train, for which purposes and to which data Stations; · As a dispatcher, I want to have enough information about the status of the computation, so that I can communicate results when an execution step is finished; · As an auditor, I want to have enough provenance metadata so that I can trace and replicate execution flows when needed.

Distributed Analytics on Sensitive Medical Data: The Personal Health Train
rights [27]. Further investigation should be carried out to associate licenses to Trains reflecting the intellectual property and copyrights of analytics task. The last requirement is sustainability: Train repositories should have a long term preservation strategy.
The PHT Track or Handler monitors the request/response cycle between Trains and Stations and executes the aggregation tasks whenever required. All communication is logged by the Handler. As a result, it improves the transparency and accountability of the involved partners. Table 1 summarizes the supported FAIR principles by the PHT.

CONCLUSION AND OUTLOOK
The PHT is a novel approach establishing a FAIR distributed data analytics infrastructure enabling the (re)use of distributed healthcare data, while data owners stay in control of their own data. In summary the PHT: (i) empowers citizens and organizations to control the use of the data that reside in their own data repositories for the benefit of the individual and society, (ii) improves the usability of health data by lowering the barriers for data protection, by ensuring that the privacy and confidentiality of the data subject will be preserved, (iii) ensures data sovereignty beyond data security and privacy by supporting the responsible use and builds trust between data consumers and data owners by making analytics processes repeatable, transparent and auditable,

Distributed Analytics on Sensitive Medical Data: The Personal Health Train
(iv) applies FAIR principles to the protocols of how data analytics interacts with FAIR data points by making data analytics tasks itself FAIR and placing machine readability at its core.
The PHT provides a distributed, flexible approach to use data in a network of participants, incorporating the FAIR principles. The PHT facilitates the responsible use of sensitive and/or personal data by adopting international principles and regulations. It supports accountability by providing provenance of analytics execution and audit mechanisms.
The PHT has been already implemented in various use cases. The Maastro clinic has implemented a Patient Cohort Counter (PCC) "Train" as a demonstration using multiple data representations. The PCC calculates the number of matching patients and cohort statistics for a specific disease at a PHT data Station. The PCC can work with different data sources with different data representations (e.g., FHIR, RDF, OMOP-OHDSI, CDISC-ODM) and is agnostic to the underlying data. The current implementation works with two data sources, one with RDF based on the Radiation Oncology Ontology, and one Station using FHIR  Other applications are the Varian Learning Portal by Varian Medical Systems and the open source software ppDLI by IKNL which are both example implementations of distributed learning PHT infrastructures in healthcare  . One use case demonstration is the development of a distributed Bayesian network model to predict dyspnea after radiotherapy for lung cancer patients which has been developed and used in the Varian Learning portal using data from five different hospitals. The ppDLI implementation currently provides a ready to use implementation of distributed Cox Proportion Hazards algorithm [16]. SMITH and DIFUTURE projects funded by the German Medical Informatics Initiative have developed cross consortia implementations and tested phenotyping use cases [28]. The PHT approach can be applied to various other domains, which want to process data but cannot share them due to the sensitive nature of data, such as the agricultural sector and the courts.