Entity Linking meets Word Sense Disambiguation: a Unified Approach

Entity Linking (EL) and Word Sense Disambiguation (WSD) both address the lexical ambiguity of language. But while the two tasks are pretty similar, they differ in a fundamental respect: in EL the textual mention can be linked to a named entity which may or may not contain the exact mention, while in WSD there is a perfect match between the word form (better, its lemma) and a suitable word sense. In this paper we present Babelfy, a unified graph-based approach to EL and WSD based on a loose identification of candidate meanings coupled with a densest subgraph heuristic which selects high-coherence semantic interpretations. Our experiments show state-of-the-art performances on both tasks on 6 different datasets, including a multilingual setting. Babelfy is online at http://babelfy.org


Introduction
The automatic understanding of the meaning of text has been a major goal of research in computational linguistics and related areas for several decades, with ambitious challenges, such as Machine Reading (Etzioni et al., 2006) and the quest for knowledge (Schubert, 2006).Word Sense Disambiguation (WSD) (Navigli, 2009;Navigli, 2012) is a historical task aimed at assigning meanings to single-word and multi-word occurrences within text, a task which is more alive than ever in the research community.
Recently, the collaborative creation of large semistructured resources, such as Wikipedia, and knowledge resources built from them (Hovy et al., 2013), such as BabelNet (Navigli and Ponzetto, 2012a), DBpedia (Auer et al., 2007) and YAGO2 (Hoffart et al., 2013), has favoured the emergence of new tasks, such as Entity Linking (EL) (Rao et al., 2013), and opened up new possibilities for tasks such as Named Entity Disambiguation (NED) and Wikification.The aim of EL is to discover mentions of entities within a text and to link them to the most suitable entry in a reference knowledge base.However, in contrast to WSD, a mention may be partial while still being unambiguous thanks to the context.For instance, consider the following sentence: (1) Thomas and Mario are strikers playing in Munich.
This example makes it clear how intertwined the two tasks of WSD and EL are.In fact, on the one hand, striker and play are polysemous words which can be disambiguated by selecting the game/soccer playing senses of the two words in a dictionary; on the other hand, Thomas and Mario are partial mentions which have to be linked to the appropriate entries of a knowledge base, that is, Thomas Müller and Mario Gomez, two well-known soccer players.
The two main differences between WSD and EL lie, on the one hand, in the kind of inventory used, i.e., dictionary vs. encyclopedia, and, on the other hand, in the assumption that the mention is complete or potentially partial.Notwithstanding these differences, the tasks are similar in nature, in that they both involve the disambiguation of textual fragments according to a reference inventory.However, the research community has so far tackled the two tasks separately, often duplicating efforts and solutions.
In contrast to this trend, research in knowledge acquisition is now heading towards the seamless in-tegration of encyclopedic and lexicographic knowledge into structured language resources (Hovy et al., 2013), and the main representative of this new direction is undoubtedly BabelNet (Navigli and Ponzetto, 2012a).Given such structured language resources it seems natural to suppose that they might provide a common ground for the two tasks of WSD and EL.
More precisely, in this paper we explore the hypothesis that the lexicographic knowledge used in WSD is also useful for tackling the EL task, and, vice versa, that the encyclopedic information utilized in EL helps disambiguate nominal mentions in a WSD setting.We propose Babelfy, a novel, unified graph-based approach to WSD and EL, which performs two main steps: i) it exploits random walks with restart, and triangles as a support for reweighting the edges of a large semantic network; ii) it uses a densest subgraph heuristic on the available semantic interpretations of the input text to perform a joint disambiguation with both concepts and named entities.Our experiments show the benefits of our synergistic approach on six gold-standard datasets.
2 Related Work

Word Sense Disambiguation
Word Sense Disambiguation (WSD) is the task of choosing the right sense for a word within a given context.Typical approaches for this task can be classified as i) supervised, ii) knowledge-based, and iii) unsupervised.However, supervised approaches require huge amounts of annotated data (Zhong and Ng, 2010;Shen et al., 2013;Pilehvar and Navigli, 2014), an effort which cannot easily be repeated for new domains and languages, while unsupervised ones suffer from data sparsity and an intrinsic difficulty in their evaluation (Agirre et al., 2006;Brody and Lapata, 2009;Manandhar et al., 2010;Van de Cruys and Apidianaki, 2011;Di Marco and Navigli, 2013).On the other hand, knowledge-based approaches are able to obtain good performance using readily-available structured knowledge (Agirre et al., 2010;Guo and Diab, 2010;Ponzetto and Navigli, 2010;Miller et al., 2012;Agirre et al., 2014).Some of these approaches marginally take into account the structural properties of the knowledge base (Mihalcea, 2005).Other approaches, instead, leverage the structural properties of the knowledge base by exploiting centrality and connectivity measures (Sinha and Mihalcea, 2007;Tsatsaronis et al., 2007;Agirre and Soroa, 2009;Navigli and Lapata, 2010).
One of the key steps of many knowledge-based WSD algorithms is the creation of a graph representing the semantic interpretations of the input text.Two main strategies to build this graph have been proposed: i) exploiting the direct connections, i.e., edges, between the considered sense candidates; ii) populating the graph according to (shortest) paths between them.In our approach we manage to unify these two strategies by automatically creating edges between sense candidates performing Random Walk with Restart (Tong et al., 2006).
The recent upsurge of interest in multilinguality has led to the development of cross-lingual and multilingual approaches to WSD (Lefever and Hoste, 2010;Lefever and Hoste, 2013;Navigli et al., 2013).Multilinguality has been exploited in different ways, e.g., by using parallel corpora to build multilingual contexts (Guo and Diab, 2010;Banea and Mihalcea, 2011;Lefever et al., 2011) or by means of ensemble methods which exploit complementary sense evidence from translations in different languages (Navigli and Ponzetto, 2012b).In this work, we present a novel exploitation of the structural properties of a multilingual semantic network.

Entity Linking
Entity Linking (Erbs et al., 2011;Rao et al., 2013;Cornolti et al., 2013) encompasses a set of similar tasks, which include Named Entity Disambiguation (NED), that is the task of linking entity mentions in a text to a knowledge base (Bunescu and Pasca, 2006;Cucerzan, 2007), and Wikification, i.e., the automatic annotation of text by linking its relevant fragments of text to the appropriate Wikipedia articles.Mihalcea and Csomai (2007) were the first to tackle the Wikification task.In their approach they disambiguate each word in a sentence independently by exploiting the context in which it occurs.However, this approach is local in that it lacks a collective notion of coherence between the selected Wikipedia pages.To overcome this problem, Cucerzan (2007) introduced a global approach based on the simultaneous disambiguation of all the terms in a text and the use of lexical context to disambiguate the mentions.To maximize the semantic agreement Milne and Witten (2008) introduced the analysis of the semantic relations between the candidate senses and the unambiguous context, i.e., words with a single sense candidate.However, the performance of this algorithm depends heavily on the number of links incident to the target senses and on the availability of unambiguous words within the input text.To overcome this issue a novel class of approaches have been proposed (Kulkarni et al., 2009;Ratinov et al., 2011;Hoffart et al., 2011) that exploit global and local features.However, these systems either rely on a difficult NP-hard formalization of the problem which is infeasible for long text, or exploit popularity measures which are domain-dependent.In contrast, we show that the semantic network structure can be leveraged to obtain state-of-the-art performance by synergistically disambiguating both word senses and named entities at the same time.
Recently, the explosion of on-line social networking services, such as Twitter and Facebook, have contributed to the development of new methods for the efficient disambiguation of short texts (Ferragina and Scaiella, 2010;Hoffart et al., 2012;Böhm et al., 2012).Thanks to a loose candidate identification technique coupled with a densest subgraph heuristic, we show that our approach is particularly suited for short and highly ambiguous text disambiguation.

The Best of Two Worlds
Our main goal is to bring together the two worlds of WSD and EL.On the one hand, this implies relaxing the constraint of a perfect association between mentions and meanings, which is, instead, assumed in WSD.On the other hand, this relaxation leads to the inherent difficulty of encoding a full-fledged sense inventory for EL.Our solution to this problem is to keep the set of candidate meanings for a given mention as open as possible (see Section 6), so as to enable high recall in linking partial mentions, while providing an effective method for handling this high ambiguity (see Section 7).
A key assumption of our work is that the lexicographic knowledge used in WSD is also useful for tackling the EL task, and vice versa the encyclopedic information utilized in EL helps disambiguate nominal mentions in a WSD setting.We enable the joint treatment of concepts and named entities by enforcing high coherence in our semantic interpretations.

WSD and Entity Linking Together
Task.Our task is to disambiguate and link all nominal and named entity mentions occurring within a text.The linking task is performed by associating each mention with the most suitable entry of a given knowledge base. 1e point out that our definition is unconstrained in terms of what to link, i.e., unlike Wikification and WSD, we can link overlapping fragments of text.For instance, given the text fragment Major League Soccer, we identify and disambiguate several different nominal and entity mentions: Major League Soccer, major league, league and soccer.In contrast to EL, we link not only named entity mentions, such as Major League Soccer, but also nominal mentions, e.g., major league, to their corresponding meanings in the knowledge base.
Babelfy.We provide a unified approach to WSD and entity linking in three steps: 1. Given a lexicalized semantic network, we associate with each vertex, i.e., either concept or named entity, a semantic signature, that is, a set of related vertices (Section 5).This is a preliminary step which needs to be performed only once, independently of the input text.
2. Given a text, we extract all the linkable fragments from this text and, for each of them, list the possible meanings according to the semantic network (Section 6).
3. We create a graph-based semantic interpretation of the whole text by linking the candidate meanings of the extracted fragments using the previously-computed semantic signatures.We then extract a dense subgraph of this representation and select the best candidate meaning for each fragment (Section 7).

Semantic Network
Our approach requires the availability of a widecoverage semantic network which encodes structural and lexical information both of an encyclopedic and of a lexicographic kind.Although in principle any semantic network with these properties could be utilized, in our work we used the Babel-Net2 1.1.1semantic network (Navigli and Ponzetto, 2012a) since it is the largest multilingual knowledge base, obtained from the automatic seamless integration of Wikipedia3 and WordNet (Fellbaum, 1998).We consider BabelNet as a directed multigraph which contains both concepts and named entities as its vertices and a multiset of semantic relations as its edges.We leverage the multilingual lexicalizations of the vertices of BabelNet to identify mentions in the input text.For example, the entity FC Bayern Munich can be lexicalized in different languages, e.g., F.C. Bayern de Múnich in Spanish, Die Roten in English and Bayern München in German, among others.As regards semantic relations, the only information we use is that of the end points, i.e., vertices, that these relations connect, while neglecting the relation type.

Building Semantic Signatures
One of the major issues affecting both manuallycurated and automatically constructed semantic networks is data sparsity.For instance, we calculated that the average number of incident edges is roughly 10 in WordNet, 50 in BabelNet and 80 in YAGO2, to mention a few.Although automatically-built resources typically provide larger amounts of edges, two issues have to be taken into account: concepts which should be related might not be directly connected despite being structurally close within the network, and, vice versa, weakly-related or even unrelated concepts can be erroneously connected by an edge.For instance, in BabelNet we do not have an edge between playmaker and Thomas Müller, while we have an incorrect edge connecting FC Bayern Munich and Yellow Submarine (song).However, this crisp notion of relatedness can be overcome by exploiting the global structure of the semantic network, thereby obtaining a more precise and highercoverage measure of relatedness.We address this issue in two steps: first, we provide a structural weighting of the network's edges; second, for each vertex we create a set of related vertices using random walks with restart.
Structural weighting.Our first objective is to assign higher weights to edges which are involved in more densely connected areas of the directed network.To this end, inspired by the local clustering coefficient measure (Watts and Strogatz, 1998) and its recent success in Word Sense Induction (Di Marco and Navigli, 2013), we use directed triangles, i.e., directed cycles of length 3, and weight each edge (v, v ) by the number of directed triangles it occurs in: We add one to each weight to ensure the highest degree of reachability in the network.
Random Walk with Restart.Our goal is to create a semantic signature (i.e., a set of highly related vertices) for each concept and named entity of the semantic network.To do this, we perform a Random Walk with Restart (RWR) (Tong et al., 2006), that is, a stochastic process that starts from an initial vertex of the graph4 and then, for a fixed number n of steps or until convergence, explores the graph by choosing the next vertex within the current neighborhood or by restarting from the initial vertex with a given, fixed restart probability α.For each edge (v, v ) in the network, we model the conditional probability P (v |v) as the normalized weight of the edge: where V is the set of vertices of the semantic network and weight(v, v ) is the function defined in Equation 1.We then run the RWR from each vertex v of the semantic network for a fixed number n of steps (we show in Algorithm 1 our RWR pseudocode).We keep track of the encountered vertices using the map counts, i.e., we increase the counter associated with vertex v in counts every time we hit v during a RWR started from v (see line 11).As a result, we obtain a frequency distribution over the whole set of concepts and entities.
To eliminate weakly-related vertices we keep only those items that were hit at least η times (see lines 16-18).Finally, we save the remaining vertices in the set semSign v which is the semantic signature of v (see line 19).
Algorithm 1 Random walk with restart.
1: input: v, the starting vertex; α, the restart probability; n, the number of steps to be executed; P , the transition probabilities; η, the frequency threshold.2: output: semSign v , set of related vertices for v.
The creation of our set of semantic signatures, one for each vertex in the semantic network, is a preliminary step carried out once only before starting processing any input text.We now turn to the candidate identification and disambiguation steps.

Candidate Identification
Given a text as input, we apply part-of-speech tagging and identify the set F of all the textual fragments, i.e., all the sequences of words of maximum length five, which contain at least one noun and that are substrings of lexicalizations in BabelNet, i.e., those fragments that can potentially be linked to an entry in BabelNet.For each textual fragment f ∈ F , i.e., a single-or multi-word expression of the input text, we look up the semantic network for candidate meanings, i.e., vertices that contain f or, only for named entities, a superstring of f as their lexicalization.For instance, for sentence (1) in the introduction, we identify the following textual fragments: Thomas, Mario, strikers, Munich.This output is obtained thanks to our loose candidate identification routine, i.e., based on superstring matching instead of exact matching, which, for instance, enables us to recognize the right candidate Mario Gomez for the mention Mario even if this named entity does not have Mario as one of its lexicalizations (for an analysis of the impact of this routine against the exact matching approach see the discussion in Section 9).
Moreover, as we stated in Section 3, we allow overlapping fragments, e.g., for major league we recognize league and major league.We denote with cand(f ) the set of all the candidate meanings of fragment f .For instance, for the noun league we have that cand(league) contains among others the sport word sense and the TV series named entity.

Candidate Disambiguation
Semantic interpretation graph.After the identification of fragments (F ) and their candidate meanings (cand(•)), we create a directed graph G I = (V I , E I ) of the semantic interpretations of the input text.We show the pseudocode in Algorithm 2. V I contains all the candidate meanings of all fragments, that is, where f is a fragment of the input text and v is a candidate Babel synset that has a lexicalization which is equal to or is a superstring of f (see lines 4-8).The set of edges E I connects related meanings and is populated as follows: we add an edge from (v, f ) to (v , f ) if and only if f = f and v ∈ semSign v (see lines 9-11).In other words, we connect two candidate meanings of different fragments if one is in the semantic signature of the other.For instance, we add an edge between (Mario Gomez, Mario) and (Thomas Müller, Thomas), while we do not add one between (Mario Gomez, Mario) and (Mario Basler, Mario) since these are two candidate meanings of the same fragment, i.e., Mario.In Figure 1, we show an excerpt of our graph for sentence (1).
At this point we have a graph-based representation of all the possible interpretations of the input text.In order to drastically reduce the degree of ambiguity while keeping the interpretation coherence as high as possible, we apply a novel densest subgraph heuristic (see line 12), whose description we defer to the next paragraph.The result is a subgraph which contains those semantic interpretations that are most coherent to each other.However, this subgraph might still contain multiple interpretations for the same fragment, and even unambiguous fragments which are not correct.Therefore, the final (Tomás Milián, Thomas) (Thomas Müller, Thomas) step is the selection of the most suitable candidate meaning for each fragment f given a threshold θ to discard semantically unrelated candidate meanings.We score each meaning v ∈ cand(f ) with its normalized weighted degree5 in the densest subgraph: where w (v,f ) is the fraction of fragments the candidate meaning v connects to: The rationale behind this scoring function is to take into account both the semantic coherence, using a graph centrality measure among the candidate meanings, and the lexical coherence, in terms of the number of fragments a candidate relates to.
Finally, we link each f to the highest ranking candidate meaning v if score((v , f )) ≥ θ, where θ is a fixed threshold (see lines 14-18 of Algorithm 2).For instance, in sentence (1) and for the fragment Mario we select Mario Gomez as our final candidate meaning and link it to the fragment.
Linking by densest subgraph.We now illustrate our novel densest subgraph heuristic, used in line 12 of Algorithm 2, for reducing the level of ambiguity of the initial semantic interpretation graph G I .The main idea here is that the most suitable meanings of each text fragment will belong to the densest area of the graph.For instance, in Figure 1 for each fragment f ∈ F do 7: for each candidate v ∈ cand(f ) do 8: The problem of identifying the densest subgraph of size at least k is NP-hard (Feige et al., 1999).Therefore, we define a heuristic for k-partite graphs inspired by a 2-approximation greedy algorithm for arbitrary graphs (Charikar, 2000;Khuller and Saha, 2009).Our adapted strategy for selecting a dense subgraph of G I is based on the iterative removal of low-coherence vertices, i.e., fragment interpretations.We show the pseudocode in Algorithm 3.
We start with the initial graph G (0) I at step t = 0 (see line 5).For each step t (lines 7-16), first, we identify the most ambiguous fragment f max , i.e., the one with the maximum number of candidate mean-Algorithm 3 Densest Subgraph.
1: input: , the set of all fragments in the input text; cand, from fragments to candidate meanings; G I , the full semantic interpretation graph; µ, ambiguity level to be reached.2: output: G I , a dense subgraph.return G I ings in the graph (see line 7).Next, we discard the weakest interpretation of the current fragment f max .To do so, we determine the lexical and semantic coherence of each candidate meaning (v, f max ) using Formula 2 (see line 10).We then remove from our graph G (t) I the lowest-coherence vertex (v min , f max ), i.e., the one whose score is minimum (see lines 11-13).For instance, in Figure 1, f max is the fragment Mario and we have: score((Mario Gomez, Mario)) ∝ 3 3 • 5 = 5, score((Mario Basler, Mario)) ∝ 1 3 • 1 = 0.3 and score((Mario Adorf, Mario)) ∝ 2 3 • 2 = 1.3, so we remove (Mario Basler, Mario) from the graph since its score is minimum.
We then move to the next step, i.e., we set t := t + 1 (see line 16) and repeat the low-coherence removal step.We stop when the number of remaining candidates for each fragment is below a threshold µ, i.e., |{v : I }| ≤ µ ∀f ∈ F (see lines 8-9).During each iteration step t we compute the average degree of the current graph G (t) . Finally, we select as the densest subgraph of the initial semantic interpretation graph G I the graph G I that maximizes the average degree (see lines 14-15).

Experimental Setup
Datasets.We carried out our experiments on six datasets, four for WSD and two for EL: • The SemEval-2013 task 12 dataset for multilingual WSD (Navigli et al., 2013), which consists of 13 documents in different domains, available in 5 languages.For each language, all noun occurrences were annotated using BabelNet, thereby providing Wikipedia and WordNet annotations wherever applicable.The number of mentions to be disambiguated roughly ranges from 1K to 2K per language in the different setups.
• The SemEval-2007 task 7 dataset for coarsegrained English all-words WSD (Navigli et al., 2007).We take into account only nominal mentions obtaining a dataset containing 1107 nouns to be disambiguated using WordNet.
• The Senseval-3 dataset for English all-words WSD (Snyder and Palmer, 2004), which contains 899 nouns to be disambiguated using WordNet.
• KORE50 (Hoffart et al., 2012), which consists of 50 short English sentences (mean length of 14 words) with a total number of 144 mentions manually annotated using YAGO2, for which a Wikipedia mapping is available.This dataset was built with the idea of testing against a high level of ambiguity for the EL task.
• AIDA-CoNLL6 (Hoffart et al., 2011), which consists of 1392 English articles, for a total of roughly 35K named entity mentions annotated with YAGO concepts separated in development, training and test sets.We exploited the POS tags already available in the SemEval and Senseval datasets, while we used the Stanford POS tagger (Toutanova et al., 2003) for the English sentences in the last two datasets.
We fixed the parameters of RWR (Section 5) to the values α = .85,η = 100 and n = 1M which maximize F1 on a manually created tuning set made up of 10 gold-standard semantic signatures.We tuned our two disambiguation parameters µ = 10 and θ = 0.8 by optimizing F 1 on the trial dataset of the SemEval-2013 task on multilingual WSD (Navigli et al., 2013).We used the same parameters on all the other WSD datasets.As for EL, we used the training part of AIDA-CoNLL (Hoffart et al., 2011) to set µ = 5 and θ = 0.0.

Systems
Multilingual WSD.We evaluated our system on the SemEval-2013 task 12 by comparing it with the participating systems: • UMCC-DLSI (Gutiérrez et al., 2013) a stateof-the-art Personalized PageRank-based approach that exploits the integration of different sources of knowledge, such as WordNet Domains/Affect (Strapparava and Valitutti, 2004), SUMO (Zouaq et al., 2009) and the eXtended WordNet (Mihalcea and Moldovan, 2001); • DAEBAK!(Manion and Sainudiin, 2013) which performs WSD on the basis of peripheral diversity within subgraphs of BabelNet; • GETALP (Schwab et al., 2013) which uses an Ant Colony Optimization technique together with the classical measure of Lesk (1986).We also compared with UKB w2w (Agirre and Soroa, 2009), a state-of-the-art approach for knowledge-based WSD, based on Personalized PageRank (Haveliwala, 2002).We used the same mapping from words to senses that we used in our approach, default parameters 7 and BabelNet as the input graph.Moreover, we compared our system with IMS (Zhong and Ng, 2010), a state-of-theart supervised English WSD system which uses an SVM trained on sense-annotated corpora, such as SemCor (Miller et al., 1993) and DSO (Ng and Lee, 1996), among others.We used the IMS model out-of-the-box with Most Frequent Sense (MFS) as backoff routine since the model obtained using the task trial data performed worse.
We followed the original task formulation and evaluated the synsets in three different settings, i.e., 7 ./ukbwsd -D dict.txt-K kb.bin --ppr w2w ctx.txt when using BabelNet senses, Wikipedia senses and WordNet senses, thanks to BabelNet being a superset of the other two inventories.We ran our system on a document-by-document basis, i.e., disambiguating each document at once, so as to test its effectiveness on long coherent texts.Performance was calculated in terms of F1 score.We also compared the systems with the MFS baseline computed for the three inventories (Navigli et al., 2013).
Coarse-grained WSD.For the SemEval-2007 task 7 we compared our system with the two topranked approaches, i.e., NUS-PT (Chan et al., 2007) and UoR-SSI (Navigli, 2008), which respectively exploited parallel texts and enriched semantic paths in a semantic network, the previously described UKB w2w system,8 a knowledge-based WSD approach (Ponzetto and Navigli, 2010) which exploits an automatic extension of WordNet, and, as baseline, the MFS.
Fine-grained WSD.For the remaining finegrained WSD datasets, i.e., Senseval-3 and SemEval-2007 task 17, we compared our approach with the previously described state-of-the-art systems UKB and IMS, and, as baseline, the MFS.KORE50 and AIDA-CoNLL.For the KORE50 and AIDA-CoNLL datasets we compared our system with six approaches, including state-of-the-art ones (Hoffart et al., 2012;Cornolti et al., 2013): • MW, i.e., the Normalized Google Distance as defined by Milne and Witten (2008); • KPCS (Hoffart et al., 2012), which calculates a Mutual Information weighted vector of keyphrases for each candidate and then uses the cosine similarity to obtain candidates' scores; • KORE and its variants KORE LSH−G and KORE LSH−F (Hoffart et al., 2012), based on similarity measures that exploit the overlap between phrases associated with the considered entities (KORE) and a hashing technique to reduce the space needed by the keyphrases associated with the entities (LSH-G, LSH-F); • Tagme 2.09 (Ferragina and Scaiella, 2012) which uses the relatedness measure defined Table 1: F1 scores (percentages) of the participating systems of SemEval-2013 task 12 together with MFS, UKB w2w, IMS, our system and its ablated versions on the Senseval-3, SemEval-2007 task 17 and SemEval-2013 datasets.The first system which has a statistically significant difference from the top system is marked with (χ 2 , p < 0.05).
by Milne and Witten (2008) weighted with the commonness of a sense together with the keyphraseness measure defined by Mihalcea and Csomai (2007) to exploit the context around the target word; • Illinois Wikifier10 (Cheng and Roth, 2013) which combines local features, such as commonness and TF-IDF between mentions and Wikipedia pages, with global coherence features based on Wikipedia links and relational inference; • DBpedia Spotlight11 (Mendes et al., 2011) which uses LingPipe's string matching algorithm implementation together with a weighted cosine similarity measure to recognize and disambiguate mentions.
We also compared with UKB w2w, introduced above.Note that we could not use supervised systems, as the training data of AIDA-CoNLL covers less than half of the mentions used in the testing part and less than 10% of the entities considered in KORE50.To enable a fair comparison, we ran our system by restricting the BabelNet sense inventory of the target mentions to the English Wikipedia.As is customary in the literature, we calculated the systems' accuracy for both Entity Linking datasets.

Results
Multilingual WSD.In Table 1 we show the F1 performance on the SemEval-2013 task 12 for the three setups: WordNet, Wikipedia and BabelNet.Using BabelNet we surpass all systems on English and German and obtain performance comparable with the best systems on two other languages (UKB on Italian and UMCC-DLSI on Spanish).Using the WordNet sense inventory, our results are on a par with the best system, i.e., IMS.On Wikipedia our results range between 71.6% (French) and 87.4% F1 (English), i.e., more than 10 points higher than the current state of the art (UMCC-DLSI) in all 5 languages.As for the MFS baseline, which is known to be very competitive in WSD (Navigli, 2009), we beat it in all setups except for German on Wikipedia.Interestingly, we surpass the WordNet MFS by 2.9 points, a significant result for a knowledge-based system (see also (Pilehvar and Navigli, 2014)).
Coarse-and fine-grained WSD.In Table 2, we show the results of the systems on the SemEval-2007 coarse-grained WSD dataset.As can be seen, we obtain the second best result after Ponzetto and Navigli (2010).In Table 1 (first two columns), we show the results of IMS and UKB on the Senseval-3 and SemEval-2007 task 17 datasets.We rank second on both datasets after IMS.However, the differences are not statistically significant.Moreover, Agirre et al. (2014, Table 5) note that using WordNet 3.0, instead of 1.7 or 2.1, to annotate these datasets can cause a more than one percent drop in performance.F1 (Ponzetto and Navigli, 2010)  Entity Linking.In Table 3 we show the results on the two Entity Linking datasets, i.e., KORE50 and AIDA-CoNLL.Our system outperforms all other approaches, with KORE-LSH-G getting closest, and Tagme and Wikifier lagging behind on the KORE50 dataset.For the AIDA-CoNLL dataset we obtain the third best performance after MW and KPCS, however the difference is not statistically significant.We note the low performance of DBpedia Spotlight which, even if it achieves almost 100% precision on the identified mentions on both datasets, suffers from low recall due to its candidate identification step, confirming previous evaluations (Derczynski et al., 2013;Hakimov et al., 2012;Ludwig and Sack, 2011).This problem becomes even more accentuated in the latest version of this system (Daiber et al., 2013).Finally, UKB using BabelNet obtains low performance on EL, i.e., 19.4-10.5 points below the state of the art.This result is discussed below.
Discussion.The results obtained by UKB show that the high performance of our unified approach to EL and WSD is not just a mere artifact of the use of a rich multilingual semantic network, that is, Ba-belNet.In other words, it is not true that any graphbased algorithm could be applied to perform both EL and WSD at the same time equally well.This also shows that BabelNet by itself is not sufficient for achieving high performances for both tasks and that, instead, an appropriate processing of the structural and lexical information of the semantic network is needed.A manual analysis revealed that the main cause of error for UKB in the EL setup stems Table 3: Accuracy (percentages) of state-of-the-art EL systems and our system on KORE50 and AIDA-CoNLL.
The first system with a statistically significant difference from the top system is marked with (χ 2 , p < 0.05).
from its inability to enforce high coherence, e.g., by jointly disambiguating all the words, which is instead needed when considering the high level of ambiguity that we have in our semantic interpretation graph (Cucerzan, 2007).For instance, for sentence (1) in the introduction, UKB disambiguates Thomas as a cricket player and Mario as the popular video game rather than the two well-known soccer players, and Munich as the German city, rather than the soccer team in which they play.Our approach, instead, by enforcing highly coherent semantic interpretations, correctly identifies all the soccer-related entities.
In order to determine the need of our loose candidate identification heuristic (see Section 6), we compared the percentage of times a candidate set contains the correct entity against that obtained by an exact string matching between the mention and the sense inventory.On KORE50, our heuristic retrieves the correct entity 98.6% of the time vs. 42.4% when exact matching is used.This demonstrates the inadequacy of exact matching for EL, and the need for a comprehensive sense inventory, as is done in our approach.
We also performed different ablation tests by experimenting with the following variants of our system (reported at the bottom of Tables 1, 2 and 3): • Babelfy using uniform distribution during the RWR to obtain the concepts' semantic signatures; this test assesses the impact of our weighting and edge creation strategy.
• Babelfy without performing the densest graph heuristic, i.e., when line 12 in Algorithm 2 is G I = G I , so as to verify the impact of identifying the most coherent interpretations.
• Babelfy applied to the BabelNet subgraph induced by the entire set of named entity vertices, for the EL task, and that induced by word senses only, for the WSD task; this test aims to stress the impact of our unified approach.
• Babelfy applied on sentences instead of on whole documents.The component which has a smaller impact on the performance is our triangle-based weighting scheme.The main exception is on the smallest dataset, i.e., SemEval-2007 task 17, for which this version attains an improvement of 2.5 percentage points.
Babelfy without the densest subgraph algorithm is the version which attains the lowest performances on the EL task, with a 9% performance drop on the KORE50 dataset, showing the need for a specially designed approach to cope with the high level of ambiguity that is encountered on this task.On the other hand, in the WSD datasets this version attains almost the same results as the full version, due to the lower number of candidate word senses.
Babelfy applied on sentences instead of on whole documents shows a lower performance, confirming the significance of higher semantic coherence on whole documents (notwithstanding the two exceptions on the SemEval-2007 task 17 and on the SemEval-2013 German Wikipedia datasets).
Finally, the version in which we restrict our system to named entities only (for EL) and concepts only (for WSD) consistently obtains lower results (notwithstanding the three exceptions on the Spanish SemEval-2013 task 12 using BabelNet and Wikipedia, and on the SemEval 2007 coarse-grained task).This highlights the benefit of our joint use of lexicographic and encyclopedic structured knowledge, on each of the two tasks.The 3.4% performance drop attained on KORE50 is of particular interest, since this dataset aims at testing performance on highly ambiguous mentions within short sentences.This indicates that the semantic analysis of small contexts can be improved by leveraging the coherence between concepts and named entities.

Conclusion
In this paper we presented Babelfy, a novel, integrated approach to Entity Linking and Word Sense Disambiguation, available at http://babelfy.org.Our joint solution is based on three key steps: i) the automatic creation of semantic signatures, i.e., related concepts and named entities, for each node in the reference semantic network; ii) the unconstrained identification of candidate meanings for all possible textual fragments; iii) linking based on a high-coherence densest subgraph algorithm.We used BabelNet 1.1.1 as our multilingual semantic network.
Our graph-based approach exploits the semantic network structure to its advantage: two key features of BabelNet, that is, its multilinguality and its integration of lexicographic and encyclopedic knowledge, make it possible to run our general, unified approach on the two tasks of Entity Linking and WSD in any of the languages covered by the semantic network.However, we also demonstrated that Babel-Net in itself does not lead to state-of-the-art accuracy on both tasks, even when used in conjunction with a high-performance graph-based algorithm like Personalized PageRank.This shows the need for our novel unified approach to EL and WSD.
At the core of our approach lies the effective treatment of the high degree of ambiguity of partial textual mentions by means of a 2-approximation algorithm for the densest subgraph problem, which enables us to output a semantic interpretation of the input text with drastically reduced ambiguity, as was previously done with SSI (Navigli, 2008).
Our experiments on six gold-standard datasets show the state-of-the-art performance of our approach, as well as its robustness across languages.Our evaluation also demonstrates that our approach fares well both on long texts, such as those of the WSD tasks, and short and highly-ambiguous sentences, such as the ones in KORE50.Finally, ablation tests and further analysis demonstrate that each component of our system is needed to contribute state-of-the-art performances on both EL and WSD.
As future work, we plan to use Babelfy for information extraction, where semantics is taking the lead (Moro and Navigli, 2013), and for the validation of semantic annotations (Vannella et al., 2014).

Figure 1 :
Figure 1: An excerpt of the semantic interpretation graph automatically built for the sentence Thomas and Mario are strikers playing in Munich (the edges connecting the correct meanings are in bold).
: input: F , the fragments in the input text; semSign, the semantic signatures; µ, ambiguity level to be reached; cand, fragments to candidate meanings.