KB4Rec: A Dataset for Linking Knowledge Bases with Recommender Systems

To develop a knowledge-aware recommender system, a key data problem is how we can obtain rich and structured knowledge information for recommender system (RS) items. Existing datasets or methods either use side information from original recommender systems (containing very few kinds of useful information) or utilize private knowledge base (KB). In this paper, we present the first public linked KB dataset for recommender systems, named KB4Rec v1.0, which has linked three widely used RS datasets with the popular KB Freebase. Based on our linked dataset, we first preform some interesting qualitative analysis experiments, in which we discuss the effect of two important factors (i.e. popularity and recency) on whether a RS item can be linked to a KB entity. Finally, we present the comparison of several knowledge-aware recommendation algorithms on our linked dataset.


INTRODUCTION
With the rapid development of Web techniques, various kinds of side information has become available in recommender systems (RS). In an early stage, such context information is usually unstructured, and its availability is limited to specific data domains or platforms [6,8,15]. Recently, more and more efforts have been made by both research and industry communities for structuring world knowledge or domain facts in a variety of data domains. One of the most typical organization forms is knowledge base (KB) [19]. KBs provide a general and unified way to organize and relate information entities, which have been shown to be useful in many applications. Specially, KBs have been used in RSs [10,21], usually called knowledge-aware recommendation.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). To develop a knowledge-aware recommender system, a key data problem is how we can obtain rich and structured knowledge information for RS items. Overall, there are two main solutions from existing studies. First, side information is collected from the RS platform [6,8,15], and several studies further construct tiny and simple KB-like knowledge structure [20]. The number of attributes or relations is usually limited, and much useful knowledge information has not been considered. Second, several works propose to link RS with private KBs [21]. The linkage results are not publicly available.
To address the need for the linked dataset of RS and KBs, we present a public linked KB dataset for recommender systems, named KB4Rec v1.0, freely available at https://github.com/RUCDM/KB4Rec. Our basic idea is to heuristically link items from RSs with entities from a public large-scale KB 1 . On the RS side, we select three widely used datasets (i.e., MovieLens [6], LFM-1b [15] and Amazon book [8]) covering three different data domains, namely movie, music and book; on the KB side, we select the well-known Freebase [5]. We try to maximize the applicability of our linked dataset by selecting very popular RS datasets and KBs. Specially, we are also aware of some closely related studies [1,13], which also aim to link RS items with KB entities. While, our focus is on the Freebase, which is now widely used in many NLP or related domains [19].
In our KB4Rec v1.0 dataset, we organized the linkage results by linked ID pairs, which consists of a RS item ID and a KB entity ID. We do not share the original datasets, since they are maintained by original researchers or publishers. All the IDs are inner values from the original datasets. Once such a linkage has been accomplished, it is able to reuse existing large-scale KB data for RSs. For example, the movie of "Avatar" from MovieLens dataset [6] has a corresponding entity entry in Freebase, and we are able to obtain its attribute information by reading out all its associated relation triples in Freebase. Based on the linked dataset, we first preform some interesting qualitative analysis experiments, in which we discuss the effect of two important factors (i.e., popularity and recency) on whether a RS item can be linked to a KB entity. Finally, we present the comparison of several knowledge-aware recommendation algorithms on our linked dataset.

EXISTING DATASETS AND METHODS
In this section, we briefly review the related datasets and methods.
Early knowledge-aware recommendation algorithms are also called context-aware recommendation algorithms, in which the side information from the original RS platform is considered as context data. For example, social network information of Epinions dataset is utilized in [11,12], POI property information of Yelp dataset is utilized in [4], movie attribute information of MovieLens dataset is utilized in [20] and user profile information of microblogging dataset has been utilized in [22]. These datasets usually contain very few kinds of side information, and the relation between different kinds of side information is ignored.
To make such side information more structured, Heterogeneous Information Networks (HIN) have been proposed as a general technique for modeling information networks [16]. In HINs, we can effectively learn underlying relation patterns (called meta-path) and organize side information via meta-path-based representations. For example, HIN-based recommendation have been applied to solve PER [20] and MCRec [9]. HIN based algorithms usually rely on graph search algorithms, which is difficult to deal with large-scale relation pattern finding.
More recently, KBs have become a popular kind of data resources to store and organize world knowledge or domain facts. Many studies have been proposed [19] for the construction, inference and applications of KBs. Specially, several pioneering studies try to leverage existing KB information for improving the recommendation performance [17,18,21]. They apply a heuristic method for linking RS items with KB entities. In these studies, they use a private KB for linkage, which cannot be obtained publicly.
Specially, we are also aware of some closely related studies, including [1,13], which also aim to link RS items with KB entities. While, our focus is on the Freebase, which is now widely used in many NLP or related domains [19].

LINKED DATASET CONSTRUCTION
In our work, we need to prepare two kinds of datasets, namely RS and KB data. Next, we first give the detailed descriptions of the original datasets, and then discuss the linkage method.
RS Datasets. We consider three popular RS datasets for linkage, namely MovieLens, LFM-1b and Amazon book, which covers the three domains of movie, music and book respectively.
(1) MovieLens dataset [6] describes users' preferences on movies. A preference record takes the form ⟨user, item, rating, timestamp⟩, indicating the rating score of a user for a movie at some time. There have been four MovieLens datasets released, known as 100K, 1M, 10M, and 20M, reflecting the approximate number of ratings in each dataset. We select the largest MovieLens 20M for linkage.
(2) LFM-1b dataset [15] describes users' interaction records on music. It provides information including artists, albums, tracks, and users, as well as individual listening events. It records the the listening events of a user on songs, but does not contain rating information.
(3) Amazon book dataset [8] describes users' preferences on book products with the data form of ⟨user, item, rating, timestamp⟩. The dataset is very sparse, containing 22 million ratings from 8 million users across nearly 23 million items.
In the three RS datasets, we several kinds of side information such as item titles (all), IMDB ID (movie), writer (book) and artist (music). We utilize such side information for subsequent KB linkage.
KB Dataset. We adopt the large-scale pubic KB Freebase. Freebase [5] is a KG announced by Metaweb Technologies, Inc. in 2007 and was acquired by Google Inc. on July 16, 2010. Freebase stores facts by triples of the form ⟨head, relation, tail⟩. Since Freebase shut down its services on August 31, 2016, we use its latest public version. We select Freebase because it has been widely applied in the research communities [19]. RS to KB Linkage. With an offline Freebase search API, we retrieve KB entities with item titles as queries. If no KB entity with the same title was returned, we say the RS item is rejected in the linkage process. If at least one KB entity with the same title was returned, we further incorporate one kind of side information as a refined constraint for accurate linkage: IMDB ID, artist name and writer name are used for the three domains of movie, music and book respectively. We find only a small number (about one thousand for each domain) of RS items can not be accurately linked or rejected via the above procedure, and simply discard them. During the linkage process, we deal with several problems that will affect the results of string match algorithms, e.g., lowercase, abbreviation, and the order of family/given names. Since the LFM-1b dataset is extremely large, we remove all the musics with fewer than ten listening events. Even after filtering, it still contains about 6.5 million musics.
Basic Statistics. We summarize the basic statistics of the three linked datasets in the second column of

LINKAGE ANALYSIS
Previously, we have shown the linkage ratios for different datasets. We find that a considerable amount of RS items can not be linked to KB entities. It is interesting to study what factors will affect the linkage ratio. We consider two kinds of factors for analysis.
Effect of Popularity on Linkage. Intuitively, a popular RS item should be more likely to be included in a KB than an unpopular  We use A, B, · · · to indicate the bin number in an ordered way. The first three subfigures correspond to the popularity analysis, and the last one corresponds to the recency analysis.
item, since it is reasonable to incorporate more "important" RS items judged by the RS users into KBs. The construction of KB itself usually involves manual efforts, which is difficult to avoid the bias of human attention. To measure the popularity of a RS item, we adopt a simple frequency-based method by counting the number of users who have interacted with the item. This measure characterizes the attractiveness of an item from the users in a RS. First, we sort the items ascendingly according to its popularity value. Then, we further equally divide all the items into five ordered bins with the same number of items. Hence, an item with a larger bin number will be more popular than another with a smaller bin number. Then we compute the linkage ratio for each bin and the results are reported in Fig. 1(a) (the three subfigures on the left). It can be observed that a bin with a larger number has a higher linkage ratio than the ones with a smaller number. The results indicate that popularity is likely to have positive effect on linkage.
Effect of Recency on Linkage. The second factor we consider is the recency, i.e., the time when a RS item was created. Our assumption is that if a RS item was created or released on an earlier time, it would be more probable to be included in KBs. Since human attention aggregation is a gradually growing process, a RS item usually requires a considerable amount of time to become popular. To check this assumption, we need to obtain the release date of RS items. However, only the MovieLens 20M dataset contains such an attribute information, we only report the analysis result on this dataset. We first sort the items according to their release dates ascendingly, and then equally divide all the items into ten ordered bins following the procedure of the above popularity analysis. Finally, we compute the linkage ratios for each bin. The results are reported in Fig. 1(b). We can see that the linkage ratios gradually decrease with time going. The results indicate that recency is likely to have negative effect on linkage, i.e., an older RS item seems to be more probable to be included in a KB than a more recent one. Especially, the last bin has a dramatic drop. A possible reason is that our version of MovieLens is April 2015.

EXPERIMENT
In this section, we present the comparison of some existing recommendation algorithms on our linked datasets.
Experimental Setup. Since our linked datasets are very large, we first generate a small test set for the following experiments. We take the subset from the last year for LFM-1b dataset and the subset from year 2005 to 2015 for MovieLens 20M dataset. We also perform 3-core filtering for Amazon book dataset and 10-core filtering for other datasets. The statistics of dataset used in [10] is reported in Table 1 (the last column). Following [7], we consider the last-item recommendation task for evaluation. We set up such a task since it is a commonly used evaluation setting for RSs, and it is easy to compare different methods. Given a user, first we sort the items according to the interaction timestamp ascendingly, then we take the last item into the test set and the rest into training set. The final goal is to predict the last item given the previous interaction sequence of a user. Since enumerating all the items as candidate is time-consuming, we pair each ground-truth with 100 negative items to form a randomly ordered list. Then each comparison method is to return a ranked list according its recommendation confidence.
To evaluate different methods, we adopt a variety of evaluation metrics, including the Mean Reciprocal Rank (MRR), Hit Ratio (HR), and Normalized Discounted cumulative gain (NDCG).

KB Information Representation.
Our focus is to provide rich knowledge information for recommender systems. A simple way is to represent KB information with a one-hot vector, which is sparse and large. Here we borrow the idea in [2,21] to embed KB data into low-dimensional vectors. Then the learned embeddings are used for subsequent recommendation algorithms. To train TransE [2], we start with linked entities as seeds and expand the graph with one-step search. Not all the relations in KBs are useful, we remove unfrequent relations with fewer than 5,000 triples. After that, each linked item is associated with a learned KB embedding vector.
Methods to Compare. We consider the following methods for performance comparison 2 : • BPR [14]: It learns a matrix factorization model by minimizing the pairwise ranking loss in a Bayesian framework.  Results and Analysis. The results of different methods for the last-item recommendation are presented in Table 2. We can see that: (1) Among all the methods, BPR performs worst on three datasets. Overall, the other models perform better than BPR, since they incorporate the information of KB.
(2) SVDFeature is implemented with a pairwise ranking loss function, and it can be roughly understood as an enhanced BPR model with the incorporation of the learned KB embeddings. Compared with BPR, SVDFeature is slightly better on the book dataset which is more sparse, and substantially better on the music dataset and the book dataset. In SVDFeature, each additional context feature will increase some number of parameters (deciding on the number of KB embeddings dimensions).
(3) Next, we analyze the performance of the knowledge-aware recommendation methods, namely mCKE and KSR. Overall, mCKE does not work well as expected, which only beats SVDFeature on the LFM-1b dataset. A possible reason is that our implementation of mCKE fixes the learned KB embeddings, while the original CKE model adaptively updates KB embeddings. As a comparison, the recently proposed KSR method works best consistently on the three datasets. KSR combines the capacity of modeling data sequences from Recurrent Neural Networks (RNN) and the capacity of storing data in a long term from Memory Networks (MN). It further enhances MNs with the learned KB embeddings.

CONCLUSION
This paper introduced a public dataset for linking RS with KB, namely KB4Rec v1.0. Our dataset covered three domains consists of a large number of linked ID pairs. As future work, we will consider linking more RS datasets with Freebase. We will also consider adopting other KB data for linkage, e.g., YAGO and DBpedia.