Granularity of algorithmically constructed publication-level classifications of research publications: Identification of specialties

Abstract In this work, we build on and use the outcome of an earlier study on topic identification in an algorithmically constructed publication-level classification (ACPLC), and address the issue of how to algorithmically obtain a classification of topics (containing articles), where the classes of the classification correspond to specialties. The methodology we propose, which is similar to that used in the earlier study, uses journals and their articles to construct a baseline classification. The underlying assumption of our approach is that journals of a particular size and focus have a scope that corresponds to specialties. By measuring the similarity between (1) the baseline classification and (2) multiple classifications obtained by topic clustering and using different values of a resolution parameter, we have identified a best performing ACPLC. In two case studies, we could identify the subject foci of the specialties involved, and the subject foci of specialties were relatively easy to distinguish. Further, the class size variation regarding the best performing ACPLC is moderate, and only a small proportion of the articles belong to very small classes. For these reasons, we conclude that the proposed methodology is suitable for determining the specialty granularity level of an ACPLC.


Introduction
In a recent article we proposed a methodology for identification of research topics in an algorithmically constructed publication-level classification of research publications (ACPLC) (Sjögårde & Ahlgren, 2018).We used a large dataset of more than 30 million publications in Web of Science to create an ACPLC, at the granularity level of topics.However, more levels of different granularity are needed for an ACPLC to be used to answer a broader range of questions.In the present study, we use a similar methodology to create a classification whose granularity corresponds to research specialties.In the remainder of this paper, we use the term "specialty" instead of "research specialty".
The identification of specialties is part of a broader aim to develop a standard approach to create a hierarchical ACPLC of research publications in large and global, both in term of geographical uptake and coverage of subject areas, citation databases, such as Web of Science or Scopus.An ACPLC can be used for a great variety of analytical purposes and is especially useful for recurrent analytical activities.
A classification system, including a classification of publications into classes whose sizes correspond to specialties, can be used to study the publication output of different actors within a specialty, the collaboration between actors, dynamics, emergence and decline of specialties, and the relation between specialties.Moreover, a hierarchical classification, including both classes corresponding to topics and classes corresponding to specialties, makes it possible to identify topics within a specialty and, e.g., a shifting focus of a specialty.We therefore suggest that the level of specialties, together with the level of topics, should be included in a standard ACPLC, and that such an ACPLC should be hierarchical.
The purpose of this paper is to find a theoretically grounded, practically applicable and useful granularity level of an ACPLC with respect to specialties.To determine the granularity of specialties, a baseline classification is constructed.A set of journals is identified and used to create a baseline classification.ACPLCs with different granularities, constructed by the use of different values of a resolution parameter, are then compared to the baseline classification.The classification that best fits the baseline classification is proposed to be used for bibliometric analyses of specialties.In contrast to earlier work, our aim is to create a classification of publications that can be used to identify all specialties represented in Web of Science from 1980 onwards.
The remainder of this paper is structured as follows.In the next section, a short summary of our previous article on topic identification is given.The framework of the study is outlined in Section 3 and the specialty notion is discussed in Section 4. Data and methods are presented in Section 5, whereas Section 6 gives the results.Conclusions are given in Section 7.

Summary of the Sjögårde-Ahlgren study on identification of topics
To give the reader some background to the present study, we in this section summarize the earlier study on topic identification (Sjögårde & Ahlgren, 2018).In that study, we discussed how the resolution parameter given to the software Modularity Optimizer can be calibrated to obtain publication classes corresponding to the size of topics.
A set of about 31 million articles and reviews from Bibmet, KTH Royal Institute of Technology's bibliometric database, which contains Web of Science data, was used for the study.The study involved a methodology consisting of four steps.In the first step, we constructed a baseline classification (BCPt) corresponding to topics, where BCPt contains synthesis articles, operationalized as articles with at least 100 references.Each such article constitutes a class, and its list of cited references points to the reference articles of the class, i.e. to the members of the class.The underlying assumption of this approach is that synthesis publications in general address a topic.
In the second step of the methodology, several ACPLCs of different granularity with respect to the topic level were created by setting the resolution parameter of Modularity Optimizer to different values.Normalized direct citation values between the articles in the dataset were used, as proposed by Waltman and van Eck (2012).For the third step, classifications derived from the ACPLCs were obtained, where each derived classification constitutes a classification of the union of the classes of the baseline classification, BCPt.Thus, the latter classification and a given derived one have exactly the same underlying reference articles.In the fourth and final step of the methodology, the similarity between BCPt and each of the derived classifications from the third step was quantified.For this purpose, the Adjusted Rand Index (ARI) (Hubert & Arabie, 1985) was used.We denoted the ACPLC such that its corresponding derived classification exhibited the largest ARI similarity with BCPt by ACPLCt.
With respect to the results of the study, the class size variation regarding ACPLCt turned out to be moderate, and only a small proportion of the articles belong to very small classes.Moreover, the outcomes of two case studies showed that the topics of the cases were closely associated with different classes of ACPLCt, and that these classes tend to treat only one topic.We concluded that the proposed methodology is suitable to determine the topic granularity level of an ACPLC and that the ACPLC identified by this methodology is useful for bibliometric analyses.
In the present study, we use a similar methodology to identify specialties.The classes obtained in the previous study are clustered into specialties.A baseline classification is constructed that corresponds to specialties, and a set of journals is used to create the baseline classification.
We need to point out that there is a substantial overlap between our earlier paper (Sjögårde & Ahlgren, 2018) and the present one.The reason for this is that the four-step methodology used in the earlier study, and briefly described above, is used also in the study underlying the present paper.

Framework
As in the previous study, we use a network-based approach to obtain a classification of research publications (Fortunato, 2010).We use the Modularity Optimizer1 software, created by Waltman and van Eck (2013), and the methodology put forward in Waltman & van Eck (2012).The alternative modularity function is used (Traag, van Dooren, & Nesterov, 2011), together with the SLM algorithm for modularity optimization.We acknowledge that a new algorithm for modularity optimization has been proposed (Traag, Waltman, & van Eck, 2018).However, to be consistent with the previous study, we use the SLM algorithm also in this study.We choose direct citation to express publicationpublication relations, rather than bibliographic coupling (Kessler, 1965), co-citations (e.g.Marshakova-Shaikevich, 1973;Small, 1973), textual similarity (e.g.Ahlgren & Colliander, 2009;Boyack et al., 2011) or combined approaches (e.g.Colliander, 2015;Glänzel & Thijs, 2017).Direct citation is more efficient as it gives rise to fewer relations than the mentioned approaches, and there is empirical support that direct citations performs well in comparison with bibliographic coupling and co-citations when it comes to larger datasets (Boyack, 2017).
In Sjögårde & Ahlgren (2018), a network model with two levels of hierarchy, topics and specialties, was presented.This model comprises a logical classification: Each publication is classified into exactly one class at each level of hierarchy.2Moreover, all publications in a class, at a level below the top level, are classified into exactly one, and the same, parent class.It follows that each topic in the model belongs to exactly one specialty.In this study, in which we continue to use logical classifications, we obtain such a relation by clustering topics into specialties, rather than using the alternative approach to cluster publications directly into specialties.Logical classifications have some shortcomings: topics can be addressed by several specialties (Yan, Ding, & Jacob, 2012) or, at a higher level of aggregation, disciplines (Wen, Horlings, van der Zouwen, & van den Besselaar, 2017), phenomena not expressed by logical classifications.However, the relation between a topic and other specialties than the parent specialty, as well as relations between topics, can still be expressed and analyzed by use of the relational strengths associated with the edges in the model.

4
For further discussion on the general classification framework and for an explication of a model that expresses the relations between classes at different hierarchical levels in the model, we refer the reader to Sjögårde & and Ahlgren (2018).

Specialties
Specialties have been studied since the 1960s in the field of sociology.In this literature, specialties are considered as smaller intellectual units within research disciplines (Chubin, 1976).The researchers within the same specialty communicate with each other.They possess similar competences and can engage in the same, or similar, research problems (Hagstrom, 1970).The notion of specialties is closely related to the notion of invisible colleges (Crane, 1972;Price, 1965).However, as pointed out by Morris and van der Veer Martens (2008), invisible colleges "presuppose that the researchers are in frequent informal contact with one another", which is not the case for specialties.
We use the definition of a specialty that has been given by Morris & van der Veer Martens (2008).They define a specialty as "a self-organized network of researchers who tend to study the same research topics, attend the same conferences, read and cite each other´s research papers and publish in the same journals".Further, and in concurrence with others, we consider specialties to be the largest homogeneous units of science "in that each specialty has its own set of problems, a core of researchers, shared knowledge, a vocabulary, and literature" (Scharnhorst, Börner, & Besselaar, 2012) and that they "play an important role in the creation and validation of new knowledge" (Colliander, 2014).
As early as 1974, Small and Griffith argued that publications can be clustered and that the obtained clusters may represent specialties (Small & Griffith, 1974).The single-linkage method was used by Small and Griffith to cluster 1,832 publications, which today would be considered a very small number of publications.They used their results to identify specialties.Since the 1970s, the technological advancements and the emergence of the Internet have changed the preconditions for research communication.There has also been a growth in research activity and production of research publications.
More lately, specialties have been identified and analyzed by the use of different clustering techniques (Lucio-Arias & Leydesdorff, 2009;Morris & van der Veer Martens, 2008;Scharnhorst et al., 2012).Different points of departure and different operationalizations of the specialty notion have captured different aspects of specialties.For example, clustering of publications based on citation relations and clustering of researchers based on co-authorship may result in different pictures of a specialty.The former approach identifies a set of publications and the latter a group of researchers belonging to a specialty.We attempt to capture the publications belonging to each specialty, rather than the researchers belonging to the specialty.A researcher can be part of several specialties, a property that cannot be expressed by the co-authorship approach.For this reason, we consider this approach less suitable for the identification of publications belonging to a specialty.We believe that it is preferable to base classifications constructed for the purpose of bibliometric analyses of specialties on the network of publications, rather than on the network of researchers.Our approach makes it possible to identify the researchers within a specialty without forcing every researcher into exactly one specialty.It also makes it possible to analyze the contribution of one researcher to multiple specialties.Kuhn (1996) estimates the number of core researchers in a specialty to be around 100.Based on Lotka's law (1926), Morris (2005) estimates the total size of researchers within a specialty to be around 1,000, and the number of publications produced by a specialty to be between 100 and 5,000.Boyack et al. (2014) regard specialties to be "ranging from roughly a hundred to a thousand articles per year."We acknowledge that the size of specialties in terms of publications may vary over time.Because the output of research publications have been growing the last decades, it is likely that the total size of specialties, in terms of number of publications, has been growing.Also the yearly publication production of active specialties are likely to be on average larger today than ten or twenty years ago.The size of specialties is an empirical question that we intend to shed light on in the present study.

Data and methods
As in Sjögårde and Ahlgren (2018), KTH Royal Institute of Technology's bibliometric database Bibmet was used for the study.Bibmet contains Web of Science publications from the publication year 1980 onwards.In the present study, we use the same set of publications as in the earlier study.We denote this set, in agreement with the earlier study, by P. P consists of 30,669,365 publications of the two document types Article" and "Review".In the remainder of this paper, we use the term "article" to refer to both articles and reviews.

Design of the study
We attempt to find a granularity of an ACPLC, where the ACPLC is based on the articles in P, that correspond to specialties.In order to identify the granularity of specialties, a baseline classification of publications (BCP) is created.The BCP is a set of journals, considered as classes, and each member of a class in BCP is a publication appearing in the class, i.e. appearing in the journal.
The BCP is compared to several ACPLCs with different granularities, where each such ACPLC is obtained by clustering the classes of ACPLCt (see Section 2), which is thereby utilized in the present study.An appropriate granularity is detected and an ACPLC is chosen, the classes of which correspond to specialties.The methodology, which has four steps and a high degree of similarity with the methodology proposed in Sjögårde and Ahlgren (2018), is described in detail in step I to IV below and schematically illustrated in Figure 1.

I. Creation of baseline classes
We construct a baseline classification to correspond to specialties, which we denote by BCPs.For the creation of BCPs, a subset of journals covered by Web of Science is used.Each journal constitutes a class, and the publications appearing in the journal are the members of the class.
The reason to use journals to obtain BCPs is that researchers within a specialty publish in and read the same journals.The new possibilities to search, retrieve and read research articles have changed the role of journals, nevertheless many journals are still focused on specific areas of expertise and the researchers within those areas.Such journals aim to publish articles that are relevant to its audience.E.g. we consider bibliometrics as a specialty within the discipline of library and information science, and the scope of Journal of Informetrics as roughly targeting the specialty of bibliometrics.In resemblance with Bradford's law (1948), a researcher within a specialty needs to go to several journals to find all relevant articles within her or his specialty.The boundaries of a specialty are vague and fading rather than sharp.If we consider a journal, which scope roughly covers a specialty, a core set of the articles in such journal is likely to be of high relevance to the core audience of the journal.The researchers that belong to this core audience can be considered as the backbone of the specialty.The rest of the articles in the journal have a fading relevance to this specialty.Some of these articles will be of higher relevance to other specialties.
When creating BCPs, we attempt to delimit the set of journals to such journals that, regarding their size and scope, can be considered as proxies for specialties.Since BCPs is to be used as a baseline to estimate granularity of an ACPLC regarding specialties, the following three requirements should be addressed: A. To be able to compare the classifications, the union of the classes in BCPs must be a subset of the union of the classes (i.e. the topics) in ACPLCt.B. Ideally, each class (journal) in BCPs should address exactly one specialty.C. Ideally, each pair of distinct classes (journals) should address different specialties.Now, as a first step to satisfy point A, we kept, for a given journal, only articles, i.e. publications that are of the document types "Article" or "Review".We return to this point in the end of this subsection.
To deal with point B, we first delimited the publication period to one year, namely 2010.By this operation, which resulted in 12,276 journals, the risk of including journals that, for instance, have shifted subject focus over time is lowered.In addition to deal with point B, the choice of publications from a publication year that have both incoming and outgoing citations can be assumed to have a stabilizing effect when these articles are being clustered, compared to more recent publications.
We then removed all journals belonging to the Web of Science subject category "Multidisciplinary Sciences", since a journal in this category is clearly not focused to a single specialty.After this, 12,233 journals remained.Next, we considered the distribution of articles over journal size.Figure 2 shows the distribution limited to journals with less than or equal to 1,000 articles.A typical article is published in a journal that, with respect to size and modal interval as a measure of central tendency, published 30-40 articles in 2010.By including journals between the 5th and 50th percentiles of the journal size distribution displayed in Figure 2, journals with 28-194 articles were included.With this journal size limitation, the risk to include journals addressing multiple specialties (or journals with a narrower scope than a specialty) is reduced.The limitation reduced the number of journals to 7,481.
Finally, in order to further reduce the risk of including journals addressing multiple specialties, we took journal self-citations into account.The idea is that a one-specialty journal can be assumed to cite itself to a larger extent compared to a journal that covers two or more specialties, other things held constant.In the light of this, we required, for a journal to be included in BCPs, that the self-citation ratio (in %) should be at least 10. 3 This further reduced the number of journals to 1,404.
Some of the measures taken to satisfy point B are also relevant for satisfying point C (which states that each pair of distinct classes should address different specialties), for instance the limitation to the publication year 2010.With the aim to further raise the possibilities to satisfy point C, we applied bibliographic coupling between journals.If two journals had an overlap of 8% or more regarding their active cited references, they were considered as specialty overlapping. 4The threshold was chosen after browsing a list of journal pairs ordered descending after number of shared cited references.We grouped journals so that all journals that were directly or indirectly connected, by a cited reference overlap of 8% or more, were assigned the same group.E.g. if journal j1 has an cited reference overlap of ≥ 8% with journal j2, and j2 has an cited reference overlap of ≥ 8% with j3, then j1, j2 and j3 are assigned to the same group.Note that j1 and j3 are assigned to the same group, even if they do not have an active reference article overlap of ≥ 8%.Each obtained group of journals was considered as addressing the same specialty.One of the journals was then randomly selected from each group.After the execution of this procedure, 1,119 journals remained.This number is the number of journals (classes) in BCPs.
The number of articles belonging to the classes of BCPs was initially 84,139.However, only articles that belong to a class in ACPLCt, the best-performing ACPLC in the topic study, were kept, in order to satisfy point A above.Thereby, 7% of the articles were removed, and therefore the number of articles in BCPs is 78,217.Note that articles not present in ACPLCt lack citation relations.We denote the union of the classes in BCPs as P'. 3 The self-citation ratio (s) for a journal j is given by: where cs is the number of self-citations in j, and ra the total number of active references in j. References are considered as active if they point to publications covered by the data source (Waltman, van Eck, van Leeuwen, & Visser, 2013).A reference is considered as a self-citation if the referencing publication and the referenced publication belong to the same journal. 4The overlap (y) between two journals (j1 and j2) is given by: where m is the number of shared cited references, i.e. cited references occurring in both j1 and j2, A1 the number of cited references in j1 and A2 the number of cited references in j2.The reference list of a journal was obtained by concatenating the reference lists of the articles (published year 2010) in the journal.If a reference article has been cited by more than one article in a journal, then this reference is counted multiple times for that journal.E.g. if journal j1 has four references to article a and journal j2 has two references to article a, than journal j1 and j2 have two shared cited references with respect to article a.Note that we give the overlap measure threshold as a percentage in the running text.

II. Creation of ACPLCs of different granularity with respect to the specialty level
In order to obtain ACPLCs of different granularity, the first step was to measure the relatedness between the classes (topics) of ACPLCt.We measured the relatedness as the average normalized direct citation value between the articles belonging to the two classes: If class C contains m articles and class C' n, the sum of the m  n normalized direct citation values between articles in C and articles in C' was divided by m  n.In the second step, the generated class relatedness values were iteratively given as input to Modularity Optimizer to cluster the classes of ACPLCt, where the resolution parameter was set to different values in the iterations.By this, ACPLCs were created for comparison of similarity with BCPs.We denote the ACPLCs by ACPLC_1, …, ACPLC_k, where k is the number of created ACPLCs.

III. Creation of classifications derived from the ACPLCs
For each ACPLC_i (1 ≤ i ≤ k), a classification was derived from ACPLC_i in the following way: (a) Each class C in ACPLC_i such that C did not contain any articles in P' was removed from ACPLC_i.Let ACPLC_i1 be the subset of ACPLC_i that resulted from the removal.(b) For each class C in ACPLC_i1, all articles in C that did not belong to P' were removed from C.
Let ACPLC_iP' be the set that resulted from these removal operations.
Clearly, the set ACPLC_iP' constitutes a classification of P', i.e. of the union of the classes of the baseline classification BCPs.Thus, ACPLC_iP' and BCPs have exactly the same underlying articles.We denote the k derived classifications as ACPLC_1P', …, ACPLC_kP'.These classifications then correspond to the classifications ACPLC_1, …, ACPLC_k.

IV.
Quantification of the similarity between BCP s and the ACPLC_iP's We attempt to optimize the granularity of an ACPLC_iP' so that it exhibits as high similarity as possible with BCPs. Figure 3  As in our topic identification study, we used the Adjusted Rand Index (ARI) (Hubert & Arabie, 1985) to quantify the similarity between BCPs and an ACPLC_iP'.The ARI ranges from 0 to 1.It is advantageous over the original Rand Index proposed by Rand (1971), because it adjusts for chance.The ARI compares two classifications by considering pairs of items in one of the classifications and whether or not each pair is grouped into the same class in the other classification.Note that an ARI value of 1 between BCPt and an ACPLC_iP' corresponds to a situation in which these two classifications are identical.For further information on ARI, we refer the reader to Sjögårde and Ahlgren (2018).
To find the ACPLC_iP' with the highest ARI similarity with BCPs, we tested the similarity after each run of Modularity optimizer.A first run was made with a resolution parameter value of 5E-7.This value was chosen based on previous experience and some testing.We then increased the parameter value with 5E-7.This increase resulted in a higher ARI similarity, and we therefore increased the resolution further with 5E-7 for the third run, from 1E-6 to 1.5E-6.We continued by increasing the resolution by 5E-7, in total three more times, and thus six runs were done.The fourth run, with a resolution parameter value of 2E-6, gave rise to the highest ARI similarity (Table 2 and Figure 4, Section 6).
In total BCPs consists of 1,118 baseline classes.A given ACLPC_iP' consists of 78,217 articles, which is about 7% of the articles from the year 2010 in the corresponding ACPLC_i.The ACPLC_i such that ACLPC_iP' exhibits the largest ARI similarity with BCPs is proposed to be used for the analyses of specialties.We denote this ACPLC_i by ACPLCs.

Results and discussion
In this section, we first deal with the selection and properties of ACPLCs.Then, as in the earlier study on topic identification (Sjögårde & Ahlgren, 2018), we consider two cases.We examine the specialties of articles belonging to (1) the Web of Science subject category "Information science & Library Science", and (2) the Web of Science subject category "Medical Informatics".

Selection and properties of ACPLCs
Figure 4 shows a scatter plot of the relation between the resolution value (horizontal axis) used to obtain ACPLC_is and the ARI value (vertical axis), obtained by comparing the ACPLC_iP's with BCPs.ACPLC_4P' has the highest ARI value.ACPLC_4P' corresponds to ACPLC_4, which we consider to be the most proper ACPLC_i with respect to granularity of specialties.In the remainder of this paper, we denote ACPLC_4 as ACPLCs.However, we acknowledge that ACPLC_3P' has an ARI value that is only slightly lower than the value of ACPLC_4P'.Thus, ACPLC_3P' performs almost as good as ACPLC_4P'.
Figure 4: ARI values between ACPLC_iP's and BCPs.The vertical axis shows the ARI value and the horizontal axis shows the value of the resolution parameter used to obtain the corresponding ACPLC_is.The order of ACPLC_iP's corresponds to their order in Table 2.
To get a picture of how well ACPLCs matches BCPs, we calculated the distribution of articles in an average class in BCPs into classes (journals) in ACPLCs.This was done by first calculating the average number of classes in ACPLCs into which the articles in a class in BCPs are distributed, an average that is equal to 24 (after rounding to nearest integer).We then selected all 19 classes in BCPs that were distributed into exactly 24 classes.Let the set of these classes be Psc.The average number of articles in a Psc class is 63.For each of the Psc classes, we calculated the number of its articles in each of the 24 ACPLCs classes and sorted the resulting table in descending order.The ACPLCs class with the highest number of articles (i.e. the class corresponding to the first row in the table) was assigned the rank 1, the second largest class (i.e. the class corresponding to the second row in the table) was assigned the rank 2, etc.In this way, 19 ranked tables were obtained.Finally, averages of the number of articles by rank number, 1,…, 24, were calculated across all the 19 tables.Figure 5   Given that we consider the classes in ACPLCs as specialties, the distribution of journal articles in a typical BCPs class follows a skewed distribution of specialties.About 49% of the articles in an average BCPs class are distributed into the two most frequent specialties, and 17 specialties (classes 8 to 24) are represented by a single article (after rounding to nearest integer).Hence, a high share of the articles of the average BCPs class is concentrated to a few of the ACPLCs classes.We therefore consider the match between ACPLCs and BCPs as good.
ACPLCs consists of 60,649 classes, ranging from 1 to 60,608 articles.Most of the classes are small in size, however these classes contain a small share of the total number of articles in ACPLCs.For instance, classes with less than 500 articles contain about 0.9% of the publications in ACPLCs.Figure 6 shows a histogram of the distribution of classes by class size (in terms of number of articles).In order to include classes of a substantial size in the figure, classes with less than 500 articles has been excluded in the figure .Most specialties of substantial size (minimum of 500 articles) have 7 (10 th percentile) to 90 (90 th percentile) subordinated topics of substantial size (a minimum of 50 articles), with a mode value of 10, a median of 28 and a mean value of about 40 (Figure 7 and Table 1).In Figure 8 class sizes are plotted by rank order for ACPLCs (= ACPLC_4), as well as for ACPLC_3 and ACPLC_5.A log-10 scale is used on both the vertical axis (showing class size by number of articles) and the horizontal axis (showing ranks).In this figure, all classes are shown, including small size classes.About 3,500 classes contain at least 1,000 articles, about 1,100 classes contain at least 10,000 articles and about 70 classes contain at least 30,000 articles.In agreement with our study on topics, the size of classes is dropping rather slowly, regardless of classification.The increasing granularity-from ACPLC_3 via ACPLCs to ACPLC_5-is reflected by, for example, corresponding, decreasing intercepts.
Figure 9 expresses the number of articles in P (vertical axis) that are associated with different class sizes (horizontal axis).For a randomly selected article a, it is most probable that the size of the specialty class in ACPLCs to which a belongs is 7,000-8,000 articles (cf. the highest bar of the histogram in Figure 9).80% of the articles belong to classes consisting of 3,765 (10 th percentile) to 29,509 (90 th percentile) articles (Table 2).The median value of ACPLCs is 13,145 and the mean 15,228.This distribution is not as skewed as the corresponding topic distribution (Sjögårde & Ahlgren, 2018, Figure 8).
The number of articles contributing to a specialty in 2015 (the most recent complete year at the time for data extraction) is between 187 and 2,040, given that we only take the mid 80% of the distribution into account (Table 3 and Figure 10).The median class size is 742.The mean number of articles per specialty class is growing approximately linearly across the 10 years (Table 3).This can be expected, considering the linear growth of research publication output in Web of Science.
As mentioned in the introduction, Morris (2005) estimates the size of specialties to be between 100 and 5,000 articles, however not mentioning any time period, and Boyack et al. (2014) estimate the yearly article output of a specialty to be somewhere between 100 and 1,000 articles.The results of the present study cannot be easily compared to these figures.Both the estimation of Morris and Boyack et al. are rough.Morris does not mention any time period.Further, the work by Morris is rather old and the size of specialties may have increased, in terms of publication output.Table 3 shows that the number of articles in Web of Science has been growing by more than 50% between 2006 and 2015.In 2015, the size of specialties range from about 200 articles (10 th percentile) to 2,000 (90 th percentile) articles.Thus, the size of specialties in 2015 is about twice the size estimated by Boyack et al.We regard this difference as rather small, taking into account that Boyack et al. defines the next larger level (disciplines) to range from tens to hundreds of thousand articles per year, thus several orders of magnitude larger than our estimation of the size of specialties.
In agreement with Morris and Boyack et al., we find it reasonable not to consider publication classes under some threshold to be regarded as specialties.One solution to the problem of small class sizes is to reassign such classes (classes below a threshold) based on their relations with larger classes (classes above or equal to the same threshold) as proposed by Waltman & van Eck (2012).However, how to set the threshold is a question that we do not address in this paper.

The case of Library and Information Science
To explore how articles within the discipline of library and information science (LIS) are distributed into classes in ACPLCs, we retrieved all articles in P that belong to a journal classified into the Web of Science subject category "Information science & Library Science" and published in the period 2011-2015.In total, 16,278 articles were retrieved.Let Plis be this set of articles.
For each class in ACPLCs, labels were automatically created based on author keywords (Sjögårde & Ahlgren, 2018).To distinguish the scope of each specialty, we used these labels and the labels of the topics in each class.Recall that ACPLCs is obtained by clustering the topics of ACPLCt, the best performing ACPLC with respect to topic identification (Sjögårde & Ahlgren, 2018).Table 4 shows the total number of articles in the 10 most frequent specialties and the number, and the share, of articles in a specialty that belong to Plis.The top 10 specialties cover about 55% of the articles in Plis.Some of the top ten specialties are highly concentrated to the analyzed Web of Science subject category (e.g."ACADEMIC LIBRARIES//INTERLENDING//DOCUMENT DELIVERY", 77%), while other specialties have a low share of its total number of articles in this category (e.g."CUSTOMER SATISFACTION//CUSTOMER LOYALTY//SERVICE QUALITY", 6%).
The highest ranked specialty, "ACADEMIC LIBRARIES//INTERLENDING//DOCUMENT DELIVERY", focuses on library science.This category includes topics such as open access, information literacy, e-books, user needs and user behavior, interlending, library systems and reference services.The second ranked specialty, "BIBLIOMETRICS//CITATION ANALYSIS//IMPACT FACTOR", focuses on bibliometric indicators, mapping and evaluation of research and the analysis of scholarly communication.We acknowledge that a majority of the largest topics in this specialty are the same topics that were observed in the case study of Journal of Informetrics in the previous topics study (Sjögårde & Ahlgren, 2018, and Appendix 1 in this paper).The specialty "ENTERPRISE RESOURCE PLANNING//IT OUTSOURCING//IT INVESTMENT" includes some topics related to LIS, e.g.IT business value, IT outsourcing, Information system planning and information infrastructure.The scopes of specialties 4, 5, 6, 7, 8 and 10 are captured rather well by their labels and these specialties are all clearly related to LIS.These six specialties include information retrieval, studies of customers and service as well as library and information aspects of health service and occupation, of innovation and patents and of media and communication.The LIS relevance of "UNIVERSAL SERVICE//TELECOMMUNICATIONS//ACCESS PRICING" (rank 9) is within topics such as internet access and digital divide.
Appendix 1 lists the 10 topics with most publications in Plis for the top 10 ranked specialties with regard to Plis.In analogy with the case of LIS, we retrieved all articles in P that belong to a Web of Science subject category, in this case "Medical Informatics" and published in the period 2011-2015, to explore how articles within this discipline are distributed into classes in ACPLCs.In total, 12,516 articles were retrieved.Let Pmi be this set of articles.
Table 5 shows the top 10 specialties in Pmi, ranked by frequency.Only one specialty is highly concentrated into the "Medical informatics" category, namely "ELECTRONIC HEALTH RECORDS//ELECTRONIC MEDICAL RECORD//MEDICAL INFORMATICS" (which are also present in the LIS case).For the rest of the top 10 specialties, 17% or less of the total number of articles in the specialty belong to Pmi.This might suggest that MI is more interdisciplinary than LIS.It can also be the case that MI articles are published in broader journals, which are not classified into the "Medical Informatics" Web of Science subject category.
The largest specialty in "Medical Informatics" category focuses on clinical decision support systems, clinical research informatics and electronic health records.The second ranked specialty within the category, "INTERNET//MHEALTH//PERSONAL HEALTH RECORDS", addresses topics within mobile health such as personal health records, online health information and online support groups.The specialty "MISSING DATA//MULTIPLE IMPUTATION//GENERALIZED ESTIMATING EQUATIONS" includes topics related to mathematical and statistical models and methods within the medical sciences, e.g.generalized estimating equations and shared parameter models.The remainder seven top ten ranked specialties have the following foci: (4) Health technology assessment; (5) Prediction and risk models; (6) Clinical trial designs; (7) Medical epistemology, meta-analysis methods and literature searching; (8) Tele health (can be seen as a predecessor to mobile health); (9) Patient safety (includes incident and error reporting), and (10) Computational techniques such as biomedical textmining, drug target interaction and gene prioritization.
Appendix 2 lists the 10 topics with most publications in Pmi for the top 10 ranked specialties with regard to Pmi.

Conclusions
In this study we have discussed how the resolution parameter given to the Modularity Optimizer software can be calibrated to cluster topics, obtained in a previous study (Sjögårde & Ahlgren, 2018) on topic identification, so that obtained publication classes correspond to the size of specialties.A set of journals has been used as a baseline for the calibration.Journals were selected based on their size and self-citation rate.The underlying assumption of our approach is that journals of a particular size and foci have a scope that correspond to specialties.By measuring the similarity between (1) the baseline classification and (2) multiple classifications obtained by using different values of the resolution parameter, we have identified a classification, which we denote as ACPLCs, whose granularity corresponds to specialties.Some criteria for the evaluation of ACPLCt, the best performing ACPLC with respect to topic identification, are the same for the evaluation of ACPLCs.The differences of class sizes should not be too large and "the number of very small clusters should be minimized as much as possible" (Šubelj, van Eck & Waltman, 2016).In ACPLCs, 80% of the articles belong to classes consisting of 3,765-29,509 articles.Further, 80% of the articles belong to classes with a yearly publication rate of 126-1,082 articles in publication year 2006, increasing to 187-2,040 in the publication year 2015.Only 0.9% of the articles in P belong to classes with a total number of articles of less than 500.As in the previous study, the distribution follows a typical scientometric distribution, and we therefore consider the results, regarding class sizes, as satisfying.
In accordance with the previous study, we stress that it is reasonable of practical reasons to reassign small classes, which we have not done in this study.Moreover, we consider content labelling of classes as a topic that needs to be addressed in future work.
Another criteria stated by Šubelj, van Eck & Waltman (2016) is that classes should make intuitive sense.In addition, we stress that the focus of a specialty should be possible to identify and that two specialties should have subject foci that can be distinguished.Two case studies, in which we have identified specialties within the disciplines LIS and MI, have been performed to evaluate these criteria.We could identify the subject foci of the specialties in these case studies, and the subject foci of the specialties have been relatively easy to distinguish.Thus, the two criteria are (approximately) satisfied in our case.Further, several of the specialties identified in the LIS case study have been identified by others (Bauer, Leydesdorff, & Bornmann, 2016;Blessinger & Frasier, n.d.;Figuerola, García Marco, & Pinto, 2017;Janssens, Leta, Glänzel, & De Moor, 2006) and the same holds for several of the specialties identified in the MI case study (Kim & Delen, 2018;Schuemie, Talmon, Moorman, & Kors, 2009;Wang, Topaz, Plasek, & Zhou, 2017).However, more case studies are needed to verify the soundness of the used methodology.
The beforementioned feature of the classification approach used in this study, logical classification, which assigns each topic to exactly one speciality, has some limitations.It is clear that topics can be addressed by several specialties (or at a higher level disciplines).For instance, Appendix 1 and 2 show that the topic with the label "NATURAL LANGUAGE PROCESSING//MEDICAL LANGUAGE PROCESSING//CLINICAL TEXT" is addressed by both the LIS and the MI discipline.This topic is forced into exactly one specialty, "ELECTRONIC HEALTH RECORDS//ELECTRONIC MEDICAL RECORD//MEDICAL INFORMATICS".Thus, relations between this topic and e.g.specialties within the LIS discipline is not expressed by ACPLCs.However, relations between a specialty and topics within other specialties can still be analyzed using, for instance, citation relations.Nevertheless, a logical classification to some extent oversimplifies the complex structure of topic representation in research publications.
The classification of a topic into a specialty may also be counter-intuitive from the point of view of a single researcher active in one of the involved specialties or disciplines.An advantage of logical classification is, however, that such a classification might put topics and specialties in a larger context.As an example, the technology acceptance model (TAM) is an information system theory used within LIS.However, TAM is not only used within LIS, but also by other fields, e.g. computer science.In ACPLCs, the topic "TECHNOLOGY ACCEPTANCE MODEL//TECHNOLOGY ACCEPTANCE MODEL TAM//TAM" is categorized in the specialty "CUSTOMER SATISFACTION//CUSTOMER LOYALTY//SERVICE QUALITY", and thereby ACPLCs puts LIS publications within this topic in a larger context.
The combined outcome of our previous study on the classification of topics, and the present study on the classification of specialties, is a two-level hierarchical classification.We believe that such classification comprises a valuable part of a research information system and propose that such classification can be used for bibliometric analyses of topics and specialties.

Figure 1 :
Figure 1: Illustration of the design of the study.

Figure 2 :
Figure 2: Number of articles per journal size for journals with 1 to 1,000 articles in 2010.

Figure 3 :
Figure 3: Two alluvial diagrams (A and B) illustrating the relation between two classifications.A shows two classifications with a high level of similarity.B shows two classifications with a low level of similarity.
shows the resulting average distribution of articles in Psc (to the left) into the 24 ACPLCt classes (to the right).Ranks and average number of articles across the Psc classes are shown for ACPLCs.

Figure 5 :
Figure 5: Alluvial diagram for an average class.The diagram shows the distribution of journal articles in BCPs into ACPLCs. 5

Figure 6 :
Figure 6: Histogram of number of classes by class size for ACPLCs.Classes with less than 500 articles disregarded.

Figure 7 :
Figure 7: Histogram of number of specialties by number of subordinated topics for ACPLCs.Specialties with less than 500 articles and topics with less than 50 articles disregarded.

Figure 8 :
Figure 8: Distribution of number of articles by class size for three classifications.The classes in ACPLC_3, ACPLC_4 = ACPLCs and ACPLC_5 are ordered descending by size with respect to the horizontal axis.Log-10 scale used for both axes.

Figure 9 :
Figure 9: Histogram of number of articles by class size for ACPLCs.Table 2: For each ACPLC_iP', the ARI value between ACPLC_iP' and BCPs, and the value of the resolution parameter used to obtain ACPLC_i, are shown, as well as number of classes with at least 500 articles and class size distribution measures for ACPLC_i.

Figure 10 :
Figure 10: Histogram of number of articles by class size, for the publication year 2015 and for ACPLCs.

Table 1 :
Distribution statistics of number of topics per specialty for ACPLCs.Specialties with less than 500 articles and topics with less than 50 articles disregarded.

Table 3 :
For a 10 year period (at the time for data extraction), the table shows class size distribution measures for ACPLCs.

Table 4 :
Distribution of articles in the Web of Science subject category "Information Science & Library Science" into specialties, 2011-2015.

Table 5 :
Distribution of articles in the Web of Science subject category "Medical Informatics" into specialties, 2011-2015.