Social Data: CESSDA Best Practices

The European Commission report “Turning FAIR into reality” provides an index of 27 FAIR Action Plan recommendations. This index is used for a self-assessment on CESSDA, the Consortium of European Social Science Data Archives. CESSDA is performing well on “Concepts for FAIR implementation”, “Skills for FAIR”, and “Investment in FAIR”; there is work in progress on “FAIR culture”, and work to start up on “FAIR ecosystem” and especially on “Incentives and metrics for FAIR data and services”. Next, an analysis on the FAIR components, reveals that CESSDA has accomplished the “F”, is working on the “A” – considering the sensitivity and security requirements of social data, just started on “I”, and that there is lack of clarity on what should be in “R”. On Findability, the CESSDA Data Catalogue is explained, showing the building blocks that need to be in place before one can produce a catalogue. The article ends with a forward look on CESSDA's deployment on the FAIR principles.


Social Data Research Infrastructures
. In 2017 CESSDA became a European Research Infrastructure Consortium (ERIC) a European legal entity assigned by the European Commission. CESSDA has 19 Members and 1 Observer -and each country must assign a national data service provider. The CESSDA Main Office is in Bergen, Norway  .
CESSDA's mission is: · to provide a distributed and sustainable research infrastructure enabling the research community to conduct high-quality research in the social sciences and contributing to the production of effective solutions to the major challenges facing society today; and · to facilitate teaching and learning in the social sciences.
CESSDA's capital value is 117 M€ and operating costs are 39 M€ per year [2] (p. 212). This makes CESSDA one of the major players within the Social & Cultural Innovation Cluster (see Table 1). The table also shows that this cluster is rather small compared to the other clusters; that it can be characterised as a distributive infrastructure (see also [3] Wittenburg, 2019); and that annual operating costs compared to investment costs are relatively high.  1  0  0  1  0  100  500  60  12  ENERGY  6  0  1  2  4  50  3,354 127  4  ENVIRONMENT  11  11  4  7  4  91  2,298 219  10  HEALTH & FOOD  16  14  6  10  6  100  2,410 337  14  PHYSICAL SCIENCES & ENGINEERING  16  8+6  3  12  4  44  10,336 1,070  10  SOCIAL & CULT. INNOVATION  7  6  5  5  2  100  382  81  21  TOTAL  57  45  19  37  20  77 19,280 1,894 10 Next to its distributive structure and high annual running costs, there is another important feature of social data: the amount of sensitive data. Rough estimates indicate that 40% of the data need protective measures, e.g. further anonymization, remote execution, secured access. This has implications on the degree of openness and accessibility of social data, and may limit interoperability of the data (connecting data with contextual data, linking data using semantics).
Other characteristics of social data are the multilingualism, requiring additional actions to make data comparable to make the data reusable, and concerns on data quality and provenance: as everybody can generate social data, it is important that there is a quality check on the data. As a first step in providing trust and ensuring data quality, CESSDA requires that its national service providers have the CoreTrust Seal (CTS  ).

Open Science and FAIR
Open Science is an umbrella term for trends in scientific communication, data sharing, and software development. A good explanation why we need open science is given in the Lamy-report [4]

FAIR ACTION PLAN
The FAIR principles [5] are connected to open science, stressing that research data should be Findable, Accessible, Interoperable and Reusable [6]. The EC Expert Group on FAIR Data, chaired by Simon Hodson and Sarah Jones, developed an Index to FAIR Action Plan Recommendations [1] (p. 60). This index has been used for a self-assessment on CESSDA and is a first attempt to map the readiness of the infrastructure to the FAIR Index. There are 27 recommen dations, specifying actions for different stakeholders. In the assessment, each recommendation is scored on recommended activities for service providers: · Green: activities are taking place in a sufficient way -green does not imply that activities can stop; · Orange: activities are planned (in the pipeline) or just started; · Red: activities are not planned yet.
The scoring is subjective, based on CESSDA activities (strategy, working plans, projects). Ideally such a self-assessment should be checked by an external team, but as this is a first attempt to use the FAIR Index, we are mainly interested whether this approach is usable at all.  www.coretrustseal.org.

Social Data: CESSDA Best Practices
We used the FAIR Action Plan Index to map CESSDA's position (Table 2). The Consortium is performing well on "Concepts for FAIR implementation", "Skills for FAIR", and "Investment in FAIR", as most of the recommendations are already taking place (green in the table), or are in the pipeline (orange in the table). There is work in progress on "FAIR culture" (most recommendations are orange, that is, in the pipeline), and work to start up on "FAIR ecosystem" and especially on "Incentives Social Data: CESSDA Best Practices and metrics for FAIR data and services" (most recommendations did not start (green), but are in the pipeline (orange) or not even started (red)).
In detail the scores are: CESSDA is performing well on  · "Concepts for FAIR implementation" CESSDA Service Providers put a lot of effort into long-term stewardship, timeliness of data sharing, assessability and legal issues (1); CESSDA has a persistent identifiers policy that has been implemented, and uses tools to capture metadata from the national service providers (2); CESSDA distinguishes and uses the essential components of the FAIR ecosystem (3). CESSDA is working on case studies, getting into contact with research communities (16), CESSDA is part of Social Sciences and Humanities (SSHOC) and ERIC-Forum, aligns with the DDI (metadata system), is one of the co-founders of the Interest Group on Social Sciences Research Data at the Research Data Alliance, and working on secured access tools (17). · "Skills for FAIR" As training and sharing expertise is in CESSDA's mission, there is a lot of attention and effort into training of researchers and service providers' staff (10). There is a mentorship programme for new service providers and there is an internal quality check before training goes online (11). · "Investment in FAIR" Strategic and coordinated funding (14) seems hardly applicable for service providers, although CESSDA has been discussing the fall-back procedure should one of the national service providers suddenly stop its activities. The setting of being an ERIC provides stability for the funding of the Consortium (15), although there remain concerns on the national support for some service providers.
Ensuring that services are FAIR (27) relates to CESSDA's quality assessment of its national service providers (including CoreTrustSeal), but also to strive for full European coverage and strengthening new service providers (via seminars, consortium meetings and mentorships).
There is work in progress on · "FAIR culture" CESSDA is active on making data interoperable in cooperation with research communities and new platforms or market places for sharing tools and data are being developed in the European EOSCcluster project SSHOC (4). Service Providers encourage the use of Data Management Plans (DMPs) and CESSDA has a widely used DMP course online (5). CESSDA started up a task force on data citation (6), which is a prerequisite for acknowledgement on data sharing. On cost data management (18) there are ideas, but no activities yet. Selecting and prioritising FAIR Digital Objects are at the core activities of national service providers (19), and the same goes for Deposit in Trusted Digital Repositories (20) -again in cooperation with research communities.
 The numbers in brackets refer to the FAIR Index Recommendations (Table 2).

Social Data: CESSDA Best Practices
Reaching out to users and encourage and incentivise reuse of FAIR outputs must start (21) -for this we first need to extend connections with the communities. and work to start up on · "FAIR ecosystem" CESSDA has participated in semantic data projects and this is also being addressed in the 14.5 M€ SSHOC project. Service providers work on new data types (e.g. social media data, registries) to be combined or linked with surveys (7). Automated processing (8) is a goal, but must start yet -first CESSDA needs to optimise its workflows within the FAIR ecosystem. CESSDA is very active on certification (9), including the CoreTrustSeal and developing internal assessment procedures. On using information held in DMPs there must first be increased use of DMPs to have more content (22); CESSDA is working on metadata profiles for different parts of the DMPs, e.g., on data sets, variables, and questionnaires. Extension of FAIR components (23) has not started yet: priority is to work on the existing ones. CESSDA is active on supporting FAIR data -also in cooperation within the Social Sciences and Humanities cluster (SSHOC project) and within the ERIC-Forum (24). and especially on · "Incentives and metrics for FAIR data and services" Metrics on FAIR Digital Objects (12) and monitoring (25) are still to do, preferably with other partners (in SSHOC or ERIC-Forum). Certification of FAIR Services (13) has some work in progress, extending on the CoreTrustSeal. On data citation (metrics) (26) CESSDA has started a task force.

Conclusion
The FAIR Index [1] has been used to monitor research infrastructures' status on implementation of FAIR policies. After a zero measurement, the index can be used to measure progress: how does the mapping change in a year?
We used only three categories and used colours instead of scores (green: implemented; orange: in the pipeline; red: not yet started). One could be tempted to make the scoring more quantitative, just to bring in more objectivity or to be able to add the scores and construct one single number. However, adding up scores would give false accuracy and scores might become a goal in itself. Using colours gives a clear dashboard and quick status overview.
This index should not be about reaching high score, but about making progress -where are the weak spots and is there room for improvements. After the monitoring the organisation should set up an Action Plan with concrete actions to get more and more green marks and diminishing the red ones. This Indexing should be a helpful tool and not a goal on itself.

Strategy on FAIR -start with Findability
The FAIR Action Plan Index describes FAIR actions, but does not distinguish between F, A, I, and R. In the CESSDA Strategy the FAIR principles are boundary conditions, but we also opted to have a sequential approach: First focus on Findability, which is without prejudice to the fact that we are also working on the other principles.
In this section, we will present how CESSDA worked towards this Findability, by implementing a Data Catalogue (Figure 1) with metadata on data sets (studies) from all the national service providers.
The good news: the CESSDA Data Catalogue has been online since the end of 2018. The bad news: it took over three years to produce it and we had to build some back-office tools first.
The catalogue holds over 25,000 Studies (19,000 in English, and other languages also available) and besides free text search (in basic and advanced mode), it allows filtering on language, topic, years, country, publisher (service provider), and language of data files. It is fast and harvests the metadata every night. As the catalogue contains only metadata there are no privacy or security issues with the data. A user can click on a Study to go directly to the Service Provider to access the data. To produce a catalogue, CESSDA had to build a metadata harvester -collecting the metadata from the Service Providers. As these providers use different systems, we had to set up different end-points. Another tool was the metadata profile: we had to agree on the obligatory fields, based on the DDI-standard  , and  http://www.ddialliance.org. ensure compatibility with other standards (e.g. schema.org, OpenAIRE, DataCite). For the filtering (closed selection of topics) we had to set the Vocabularies and we needed a multilingual thesaurus to support the searching. The technical and architecture specifications allow machine-reading of the catalogue.
Still in the pipeline (Table 3) are search facilities for questions and to search within variable names and value labels. And in the planning, is to set up a secured environment for access and analysis of data, including a single sign on: logging in on CESSDA gives access to secured environments of the national data service providers. Not in the table, but still an issue: the use of persistent identifiers (see also [7] and [8]). CESSDA has developed a Persistent Identifier Policy, that is in the pipeline for implementation.

Conclusion
Building a catalogue relies on a good plan -on architecture, on pipelines to collect content in an automated way, on machine-readability. The catalogue is still under development, adding new functionality and new search options. But we think it was a good decision to go live with a basic version -to do further testing on real metadata and endpoints -and learn from users' feedback.

Strategy on FAIR -also work on Accessibility, Interoperability and Reuse
Access to social data can be very complicated, because of the security issues. Within CESSDA we work on a common approach on how to describe the different access levels in a uniform way.
Interoperability has big opportunities in social sciences: traditional work is on making data comparable over countries -doing this ex post implies a lot of work, so ideally this is done beforehand, e.g. European

Social Data: CESSDA Best Practices
Social Survey and the Survey of Health, Aging and Retirement in Europe. New work makes use of linked open data techniques and would allow to connect related data, e.g. in Election Studies: newspapers, social network data, party manifestos, surveys on election behaviour and motivations, etc.
Reusability deals with connecting data, data provenance and quality, connecting with research communities. This follows from the other principles and deals with a final stage of putting all the principles together. Before we can do this, we need to have a better understanding of all the principles and have more experience in dealing with them (see also [9] for more information on developing FAIR tools).

NEXT STEPS
Using the FAIR Action Plan Recommendations, we could picture where CESSDA is in implementing FAIR Data Principles. CESSDA follows a sequential approach, currently realising Findability, and working on the other principles.
Next steps will be on implementation of new tools, but also on cooperation -within the CESSDA consortium, within Social Sciences and Humanities -via the SSHOC project, and with other established research infrastructures -using the ERIC-Forum.
But we should look further, and seek cooperation with other information service organisations, to build a European Open Science Cloud that is beneficial to researchers of all disciplines and professionals and experts outside academia.