Scopus as a curated, high-quality bibliometric data source for academic research in quantitative science studies

Abstract Scopus is among the largest curated abstract and citation databases, with a wide global and regional coverage of scientific journals, conference proceedings, and books, while ensuring only the highest quality data are indexed through rigorous content selection and re-evaluation by an independent Content Selection and Advisory Board. Additionally, extensive quality assurance processes continuously monitor and improve all data elements in Scopus. Besides enriched metadata records of scientific articles, Scopus offers comprehensive author and institution profiles, obtained from advanced profiling algorithms and manual curation, ensuring high precision and recall. The trustworthiness of Scopus has led to its use as bibliometric data source for large-scale analyses in research assessments, research landscape studies, science policy evaluations, and university rankings. Scopus data have been offered for free for selected studies by the academic research community, such as through application programming interfaces, which have led to many publications employing Scopus data to investigate topics such as researcher mobility, network visualizations, and spatial bibliometrics. In June 2019, the International Center for the Study of Research was launched, with an advisory board consisting of bibliometricians, aiming to work with the scientometric research community and offering a virtual laboratory where researchers will be able to utilize Scopus data.

In 2004, Elsevier launched Scopus as a new search and discovery tool (Schotten, el Aisati, Meester, Steiginga, & Ross, 2017). Scopus is an abstract and citation database consisting of peer-reviewed scientific content. At its launch, it contained about 27 million publication records . Since then, the content of the database has grown to over 76 million records at the time of writing, covering publications from 1788-2019, making it among the largest curated bibliographic abstract and citation databases today. Approximately 3 million new items are being added every year. The content in Scopus is sourced from over 39,100 serial titles (with the most recently published content indexed from over 24,500 titles), a n o p e n a c c e s s j o u r n a l 120,000 conferences, and 206,000 books from over 5,000 different publishers worldwide. Scopus is a curated database, which means that content is selected for inclusion in the database through a rigorous process: Serial content (i.e., journals, conference proceedings, and book series) submitted for possible inclusion in Scopus by editors and publishers is reviewed and selected, based on criteria of scientific quality and rigor. This selection process is carried out by an external Content Selection and Advisory Board (CSAB) of editorially independent scientists, each of which are subject matter experts in their respective fields. This ensures that only high-quality curated content is indexed in the database and affirms the trustworthiness of Scopus. In addition, Scopus being a curated database also refers to the rigorous capturing procedures, publisher agreements, and technical infrastructure in place to source the publications directly from the publishers themselves, ensuring comprehensive, complete, and accurate coverage of the serial content once it has been selected by the CSAB.
Scopus indexes many different elements of scientific publications, obtained from external publishers, such as publication title, abstract, keywords, author names and linked affiliations, references, and drug terms (Berkvens, 2012). Content in Scopus contains publications from scientific publishers from all over the world. Elsevier, the owner of Scopus, is also a scientific publisher. About 9.9% of the serial titles (i.e., journals and book series) in Scopus are published by Elsevier (this amounts to an article share in Scopus of 17.4% between 2012 and 2018); the other 90.1% of serial titles (and 82.6% of articles, respectively) are produced by an extensive list of global publishers ( Figure 1a). In addition, subject coverage of the serial titles in Scopus is quite balanced among the four main subject categories (Figure 1b). Scopus also includes non-English content, as long as an English title and abstract, as well as references in Roman script, are available.
In addition to the fields that are provided in the source data for Scopus, Elsevier further enriches the content using a variety of enhancements; citations provided in the full text are structured and clustered together and, where referencing content that is already indexed in Scopus, linked to the cited Scopus records. This allows users to view the citation count (how many times an article was cited). At the time of writing, the precision for citation linking in Scopus is measured at 99.9% and the recall is 98.3%. This means that in general, references that should be linked to Scopus records are linked in 98.3% of cases; and among all reference links, 99.9% are linked to the correct record. Authorships in the Scopus databases are clustered into publication histories called Scopus (author) profiles. Author profiles are generated using a combination of algorithms and manual curation. Elsevier uses a "gold set" of roughly 12,000 randomly selected authors for quality assessment. This set is updated and expanded annually and is independent of sets used for tuning or training of algorithms. The end-to-end accuracy is measured continuously by several metrics. Moreover, regular spot checks are run on aspects of author profiles, such as canonical names or affiliations. Typically, accuracy metrics are averaged over authors to better represent the typical experience of users. Publications in author profiles have an average precision of 98.1% and an average recall of 94.4%. Both precision and recall are measured based on the best matching Scopus profile for a given "gold set" author. The best match is determined based on the Scopus profile containing the largest number of publications for that author (Figure 2). This overlap number divided by the total number of publications by the "gold set" author defines recall (i.e., ratio of publications captured). The same overlap number divided by the total number of publications in the Scopus profile defines precision (i.e., ratio of publications correctly assigned to the author).
A substantial number of author profiles in Scopus have been curated. Curation can be initiated through a variety of sources. A well-established process is through Open Researcher and Contributor ID (ORCID), an open, non-proprietary registry of unique, persistent author identification codes (What is ORCID?, 2018). ORCID is managed by a non-profit organization with the same name established in 2012. Researchers can export, import, or link their publications and curated metadata between Scopus and ORCID. Similarly, researchers can use a feature on Scopus.com called the "Author Feedback Wizard" (AFW) to improve their author profile.
Finally, yet another process that results in improved, curated Scopus author profiles is a commercial service offered by Elsevier to subscribers of its "Pure" university administration product, called Profile Refinement Service. Pure customers can opt to have their profiles refined upon request or refined every 4 months. Elsevier uses the same service to proactively refine author profiles regardless of Pure subscription whenever needed.
All above efforts combined have led to approximately 1.8 million Scopus author profiles that have been manually enhanced (Scopus index, July 2019). This total has been verified using a "manual curated" flag in the XML data of Scopus author profile records. Nonetheless, we must emphasize that Scopus creates author profiles that are being actively updated by algorithms for all Figure 2. Schematic depicting how precision and recall are calculated for Scopus author profiles. "Author A" and "Author B" represent manually curated "gold set" lists of publications by these authors. authorships (of over 76 million publications) covered. Availability of author profiles throughout the corpus enables author level analytics and benchmarking across the database beyond subsample estimations. Moreover, Scopus author profiles are designed to be a complete publication history and so will not remove or hide select publications for personal gain (i.e., negatively impacting publications, such as retraction notices, errata, suspiciously funded). In fact, feedback that Elsevier receives is algorithmically and manually reviewed before changing any existing profile.
An important aspect of the Scopus database is the high coverage and availability of first names, even for relatively old records: from 25% of authorships in 1970-1974 and 52% between 1995-1999 to 82% of authorships in 2015-2019 ( Figure 1). This feature strengthens the disambiguation of authors and allows, for instance, gender-based longitudinal analyses that leverage first names (Elsevier, 2017;Lerchenmueller & Sorenson, 2018). Another relevant aspect in the analytical context is the availability of author-affiliation links in publications throughout the database historically. This enables studies dealing with the mobility of researchers, by analyzing the author affiliations and how they change over time.
An important enrichment in the Scopus database is that of institution profiles, allowing different name variants and hierarchies of institutions to be curated in a similar fashion as authors, thereby allowing automated organizing of information where needed (via an advanced, proprietary, and highly accurate institutional profiling algorithm) and manual modification and instruction, where possible. In addition, the full text of articles sourced by Scopus is processed using natural language processing to identify potential references to funding acknowledgements.
To maintain Scopus as a high-quality data source and push the boundaries of quality forward, Scopus introduced internal review processes to constantly monitor preidentified areas of quality focus, such as processing, profile quality, and completeness and accuracy of source data. This allows the content team to identify early trends in the data and to monitor progress on key initiatives to increase quality. For instance, under this program, digital object identifier (DOI) completion rates went up from 87.8% at the start of the program to 99.8% in December 2018. The completeness was measured across a gold record set for which each should have a DOI. Other main focus areas where significant improvements have been made over the past few years, and where continuous investments are being made, are completeness of indexed publication records for the serial titles covered (by weekly comparisons against the CrossRef database), the removal of duplicate MedLine and Article in Press records, the correctness and completeness of citation links (by measuring against a gold set), of the author and institution profiles and publication record metadata (such as document type classifications, publication years, article numbers, country codes, and funding information), as well as improving the timeliness and currency of newly indexed content. These quality review processes employ machine learning approaches, supplemented with manual validation, and concern legacy content (i.e., content already indexed in Scopus) as well as continuously improving and fine-tuning the capturing procedures for newly indexed content.
In addition to expanding and enriching the content, as well as improving the timeliness of the database, a curated database such as Scopus requires re-evaluation of the appropriateness of new and already indexed titles on an ongoing basis. This is needed to exclude poor-quality journals and "predatory" journals and publishers, a relatively recent phenomenon that is a threat to the integrity of science, as well as an increasing challenge to all research publishing stakeholders: authors, editors, researchers, research institutions, funding bodies, research assessment bodies, and governments. To ensure that only the most reliable scientific articles and content are available in Scopus to each of these stakeholders and that the quality of the existing content is maintained, a rigorous process of continuous monitoring and re-evaluation has been installed. This means that titles that have been selected for inclusion in Scopus may be discontinued and no longer indexed going forward. There are three different identification techniques applied, using (a) external feedback (i.e., formally raised concerns about publication standards), (b) heuristics (metrics) to flag underperforming journals, and (c) a machine learning approach to flag outlier behavior, which each lead to titles being tagged for re-evaluation. The ultimate decision to (de)select content lies with the external and independent Scopus CSAB (for full details, please see Elsevier, 2019b;Holland, Brimblecombe, Meester, & Steiginga, 2019). For publications of which the CSAB determines they no longer meet the quality standards for inclusion in Scopus, indexing of new content is discontinued, but content already indexed remains in Scopus, to ensure the integrity of the scientific record as well the stability and consistency of research trend analytics. A second type of analysis supported by Scopus data is government science policy evaluations. The UK department of Business, Innovation and Skills (BIS) commissioned a report that entailed a comparative study of the UK's international research base, in 2011 (Elsevier, 2011(Elsevier, 2013. In 2016, another refresh of this report was issued by the newly renamed Department for Business, Energy, & Industrial Strategy (BEIS) (Elsevier, 2016). More examples of the use of Scopus data for research policy reports include those of the European Research Council (ERC) and other large government bodies, often dealing with program evaluations and research landscape analyses. Many of these evaluations include different data sources of various types (macroeconomic data, for instance), as well as deep qualitative evaluations in which Elsevier works in consortia.
As a further example of this second type of analysis, Scopus data have also been used in the production of bibliometric indicators for the US National Science Foundation's Science and Engineering Indicators (SEI), for the 2016 (National Science Board, 2016) and 2018 (National Science Board, 2018) editions, and will be used in future editions of the SEI reports up to 2022. Scopus data are also the source of bibliometric indicators for the European Research Area in the context of the 2010-2014 study "Analysis and Regular Update of Bibliometric Indicators for the European Commission" (Science-Metrix, 2014) and have recently been selected for the continuation and improvement of this study, now called "Provision and Analysis of Key Indicators in Research and Innovation," for the three coming years.
The third type of analysis is that of university rankings. University rankings are often composed of combinations of evaluations for which only part is a bibliometric resource. Different ranking bodies have variations of subjective and objective data sources to provide ranked lists of universities. These rankings and the media attention they draw provide a platform for academia to engage with the public. Elsevier provides publication output, citation, and international collaboration data from Scopus for each university to organizations in the field of university rankings. These include the World University Rankings and its various derived Regional, Global Subject, Young University, and World Reputation Rankings, as well as the Wall Street Journal/THE College Rankings, which are all issued by Times Higher Education (THE, since 2014); and the World University Rankings and its various derived rankings issued by Quacquarelli Symonds (QS, since 2015); as well as various other regional and subjectspecific rankings, such as the Best Chinese Universities Ranking issued by ShanghaiRanking Consultancy (since 2015), Perspektywy in Poland, Maclean's in Canada, National Institutional Ranking Framework (NIRF) in India, the Financial Times Global MBA Ranking, and the Frankfurter Allgemeine Zeitung Economists Ranking in Germany.
The large-scale analyses supported by Scopus are reports in the government and commercial space. In contrast to peer-reviewed scientific studies, where the focus is on access to the data and open reproducibility of results, reports in the evaluation space focus on accountability. Key aspects in those engagements are therefore about how data providers deal with quality concerns, quality assurance, and risk mitigation. For instance, what is the process and timeline in case content is identified as missing?

DATA AVAILABILITY FOR RESEARCH
Next to the aim of supporting the academic community with robust results using reliable data in the analyses mentioned, providing access to raw data is an essential component in securing advancement of the bibliometric field. Until 2014, Elsevier supported bibliometricians with data using the Elsevier Bibliometric Research Program (EBRP). The program provided precompiled data sets to researchers, after a scientific board reviewed and approved a submitted proposal. Its aim was to enable external research groups or individual researchers in the field of bibliometrics and quantitative research assessment to carry out strategic research using Elsevier data and to present the outcomes in peer-reviewed journal papers and at international conferences.
Since 2014, application programming interfaces (APIs) have taken over the role of providing access to raw data, allowing free use for scientific purposes, such as the text-and-data-mining resources (Elsevier, 2019c) and Scopus APIs for academic research purposes (Elsevier, 2019a). Use of the APIs does not require a Scopus subscription 1 ; without a subscription, users will have limited access to basic metadata for most citation records, as well as to basic search functionality. Full access to Scopus APIs is only granted to subscribers of Scopus.
In addition, Scopus data have been available in bulk for research groups. Research groups working with bulk Scopus data include CWTS (CWTS, 2019), SciMago (Scimago Lab, 2019), DZHW (DZHW, 2019), SciTech Strategies (SciTech Strategies, 2018), and others through tailored agreements that have been established between these groups and Elsevier.
The mission of the International Center for the Study of Research (ICSR) is to advance research evaluation in all fields of knowledge production. To foster this development, the ICSR provides access to a working environment where new ideas and hypotheses can be tested against high-quality, large data sets. This platform, offering a virtual laboratory, will allow researchers to collaboratively develop indicators and methodologies. Elsevier is providing computational access to Scopus data for research purposes on this platform, free of charge. This will also enhance the reproducibility of scientometric studies, by enabling other researchers to verify published research findings using the same data set and methodologies with shared code. Researchers can use the environment for such academic, noncommercial purposes, and access will be organized by the ICSR to review submitted proposals for use of the lab as well as actively reaching out to researchers to collaboratively work on specific research problems. The platform allows researchers to create and extract aggregate derivatives that can be published as part of their work, under the condition that the source of the data is acknowledged. At the moment of writing, this platform is not yet publicly available and will be announced through the ICSR website (International Center for the Study of Research, 2019).

EXAMPLES OF STUDIES USING SCOPUS DATA
Scopus data have been used as a source for many types of different bibliometric studies. The different quality properties of Scopus described support different types of analyses.
For instance, there are studies on mobility, using Scopus' unique historic author-affiliation records, such as by Caroline Wagner and Koen Jonkers on international collaboration, mobility, and openness (Wagner & Jonkers, 2017), funding and collaboration (Leydesdorff, Bornmann, & Wagner, 2019), and author (Pina, Barac, Buljan, Grimaldo, & Marušic, 2019) and institutional (Lee, 2012) collaboration networks. Another example of author-mobility analyses can be found in a bibliometric study to measure knowledge transfer (Aman, 2018). The mobility analysis using Scopus author profiles also informs the research policy of governments, such as through the European Commission's Joint Research Center (JRC) report on the rise of China as an industrial and innovation powerhouse (Preziosi et al., 2019).
In addition, Scopus' availability of author first names historically, combined with author profiling, enables studies using author gender assignments: for example, "The gender gap in early-career transitions in the life sciences" (Lerchenmueller & Sorenson, 2018) and "Gender differences in research areas, methods and topics: Can people and thing orientations explain the results?" (Thelwall, Bailey, Tobin, & Bradshaw, 2019). In addition, Scopus author profiles have been used to study the recent phenomenon of hyperprolific authorships (Ioannidis, Klavans, & Boyack, 2018) and for an author database of highly cited researchers (Ioannidis, Baas, Klavans, & Boyack, 2019;Van Noorden & Singh Chawla, 2019).
There are also examples of studies using the full Scopus database to build new algorithms: Richard Klavans and Kevin Boyack developed algorithms on top of the database, resulting in Topics of Prominence (Klavans & Boyack, 2017), which are now prominently displayed in Elsevier's SciVal research performance product (which uses Scopus data as one of its data sources).
In the more traditional sense of bibliometric analysis, there are many studies available around citation analysis and correlations, such as on the influence of highly cited articles on indicators (Thelwall, 2019;Thelwall & Fairclough, 2015), on correlation between citations and Mendeley readership (Maflahi & Thelwall, 2016;Thelwall & Wilson, 2016), on journal usage (Schloegl & Gorraiz, 2010), and studies revisiting bibliometric laws (Thelwall & Wilson, 2014). Scopus data were also used to analyze initiatives in open science, particularly open access (Solomon, Laakso, & Björk, 2013), citizen science (Follett & Strezov, 2015) and new tools in the scientific space, such as ResearchGate (Thelwall & Kousha, 2017). They have been used to evaluate the fate of rejected manuscripts (Bornmann et al., 2009), to investigate potential citation manipulation by reviewers (Baas & Fennell, 2019;Singh Chawla, 2019) and to study the development of multidisciplinarity (Levitt & Thelwall, 2008). At present, Scopus data are used for bibliometric analysis to inform the EU Open Science Monitor (The Lisbon Council, CWTS, & Esade, 2018).

WHAT CAN WE LEARN FROM SCOPUS DATA, TOGETHER?
In the preceding sections, we have outlined some of the concrete ways in which Scopus data have been used for large-scale evaluative studies (typically at the national or institutional levels) and for exploratory work leading to a plethora of papers on aspects as diverse as topic detection, researcher mobility, and data visualization techniques. But the potential of Scopus to uncover and understand the fundamental forces that drive human knowledge creation through the research endeavor may be limited only by our capability to ask the right questions. How do career paths form and change for individual researchers through space and time? Can we follow people as they develop from "apprentice" to "master" and understand the drift in their topical focus, collaborative patterns, geolocation, and research impact (by citation-based indicators or other means) through careers that may be either very short or very long? Can we identify the conditions that, near the beginning of a research career, predict a long and successful contribution to the knowledge front? And those conditions that foreshadow an early exit from the world of (academic) research? Going beyond Scopus, can we use standardized researcher identifiers, such as ORCID, connected to nonresearch online personas, such as LinkedIn, to pinpoint the exit of trained researchers from publication-centric roles (largely within or adjacent to academia) into careers in organizations in the commercial or charitable sectors? What is the influence of gender, nationality, and early-career mentoring on these outcomes, and how much remains unexplained? Is a career in research more likely the result of persistence or of good fortune-and what does this mean for the development of better and fairer evaluative structures in research? Finally, what are the implications of the answers to these questions for all the actors in research, from educators to public policy experts and from university career advisors to researchers themselves?
This single example shows that those who create and those who use Scopus suffer no lack of imagination to ask challenging questions, and Scopus itself offers a firm base on which to begin seeking answers. The remaining piece of the puzzle is a collective one: How can the bibliometric research community and the creators of Scopus best come together to address these challenges together? In June 2019, the ICSR (International Center for the Study of Research, 2019) was launched, with a wide-ranging brief and the support of an advisory board, including experts in research policy, research evaluation, and bibliometrics, to be a place where a dialogue can happen and research of great interest and importance can be pursued-together. Elsevier, Scopus and the ICSR do not see themselves as apart from the world of research but as part of it, and this spirit will inform our work for many years to come.