J-NERD: Joint Named Entity Recognition and Disambiguation with Rich Linguistic Features

Methods for Named Entity Recognition and Disambiguation (NERD) perform NER and NED in two separate stages. Therefore, NED may be penalized with respect to precision by NER false positives, and suffers in recall from NER false negatives. Conversely, NED does not fully exploit information computed by NER such as types of mentions. This paper presents J-NERD, a new approach to perform NER and NED jointly, by means of a probabilistic graphical model that captures mention spans, mention types, and the mapping of mentions to entities in a knowledge base. We present experiments with different kinds of texts from the CoNLL’03, ACE’05, and ClueWeb’09-FACC1 corpora. J-NERD consistently outperforms state-of-the-art competitors in end-to-end NERD precision, recall, and F1.


Introduction
Motivation: Methods for Named Entity Recognition and Disambiguation, NERD for short, typically proceed in two stages: • At the NER stage, text spans of entity mentions are detected and tagged with coarse-grained types like Person, Organization, Location, etc. This is typically performed by a trained Conditional Random Field (CRF) over word sequences (e.g., Finkel et al. (2005)). • At the NED stage, mentions are mapped to entities in a knowledge base (KB) based on contextual similarity measures and the semantic coherence of the selected entities (e.g., Cucerzan (2014); Hoffart et al. (2011); Ratinov et al. (2011)).
This two-stage approach has limitations. First, NER may produce false positives that can misguide NED. Second, NER may miss out on some entity mentions, and NED has no chance to compensate for these false negatives. Third, NED is not able to help NER, for example, by disambiguating "easy" mentions (e.g., of prominent entities with more or less unique names), and then using the entities and knowledge about them as enriched features for NER. Example: Consider the following sentences: David played for manu, real, and la galaxy. His wife posh performed with the spice girls. This is difficult for NER because of the absence of upper-case spelling, which is not untypical in social media, for example. Most NER methods will miss out on multi-word mentions or words that are also common nouns ("spice") or adjectives ("posh", "real"). Typically, NER would pass only the mentions "David", "manu", and "la" to the NED stage, which then is prone to many errors like mapping the first two mentions to any prominent people with first names David and Manu, and mapping the third one to the city of Los Angeles. With NER and NED performed jointly, the possible disambiguation of "la galaxy" to the soccer club can guide NER to tag the right mentions with the right types (e.g., recognizing that "manu" could be a short name for a soccer team), which in turn helps NED to map "David" to the right entity David Beckham. Contribution: This paper presents a novel kind of probabilistic graphical model for the joint recognition and disambiguation of named-entity mentions in natural-language texts. With this integrated approach to NERD, we aim to overcome the limitations of the two-stage NER/NED methods discussed above. 215 Our method, called J-NERD 1 , is based on a supervised, non-linear graphical model that combines multiple per-sentence models into an entitycoherence-aware global model. The global model detects mention spans, tags them with coarsegrained types, and maps them to entities in a single joint-inference step based on the Viterbi algorithm (for exact inference) or Gibbs sampling (for approximate inference). The J-NERD method comprises the following novel contributions: • a tree-shaped model for each sentence, whose structure is derived from the dependency parse tree and thus captures linguistic context in a deeper way compared to prior work with CRF's for NER and NED; • richer linguistic features not considered in prior work, harnessing dependency parse trees and verbal patterns that indicate mention types as part of their nsubj or dobj arguments; • an inference method that maintains the uncertainty of both mention candidates (i.e., token spans) and entity candidates for competing mention candidates, and makes joint decisions, as opposed to fixing mentions before reasoning on their disambiguation. We present experiments with three major datasets: the CoNLL'03 collection of newswire articles, the ACE'05 corpus of news and blogs, and the ClueWeb'09-FACC1 corpus of web pages. Baselines that we compare J-NERD with include AIDAlight (Nguyen et al., 2014), Spotlight (Daiber et al., 2013), and TagMe (Ferragina and Scaiella, 2010), and the recent joint NER/NED method of Durrett and Klein (2014). J-NERD consistently outperforms these competitors in terms of both precision and recall.
2 Related Work NER: Detecting the boundaries of text spans that denote named entities has been mostly addressed by supervised CRF's over word sequences (McCallum and Li, 2003;Finkel et al., 2005). The work of Ratinov and Roth (2009)  al. (2014) harnessed skip-gram features and external dictionaries for further improvement. An alternative line of NER techniques is based on dictionaries of name-entity pairs, including nicknames, shorthand names, and paraphrases (e.g., "the first man on the moon"). The work of Ferragina and Scaiella (2010) and Mendes et al. (2011) are examples of dictionary-based NER. The work of Spitkovsky and Chang (2012) is an example of a large-scale dictionary that can be harnessed by such methods.
An additional output of the CRF's are type tags for the recognized word spans, typically limited to coarse-grained types like Person, Organization, and Location (and also Miscellaneous). The most widely used tool of this kind is the Stanford NER Tagger (Finkel et al., 2005). Many NED tools use the Stanford NER Tagger in their first stage of detecting mentions. Mention Typing: The specific NER task of inferring semantic types has been further refined and extended by various works on fine-grained typing (e.g., politicians, musicians, singers, guitarists) for entity mentions and general noun phrases (Fleischman and Hovy, 2002;Rahman and Ng, 2010;Ling and Weld, 2012;Yosef et al., 2012;Nakashole et al., 2013). Most of these works are based on supervised classification, using linguistic features from mentions and their surrounding text. One exception is the work of Nakashole et al. (2013) which is based on text patterns that connect entities of specific types, acquired by sequence mining from the Wikipedia fulltext corpus. In contrast to our work, those are simple surface patterns, and the task addressed here is limited to typing noun phrases that likely denote emerging entities that are not yet registered in a KB. NED: Methods and tools for NED go back to the seminal work of Dill et al. (2003), Bunescu andPasca (2006), Cucerzan (2007), and Milne and Witten (2008). More recent advances led to open-source tools like the Wikipedia Miner Wikifier (Milne and Witten, 2013), the Illinois Wikifier (Ratinov et al., 2011), Spotlight (Mendes et al., 2011), Semanticizer (Meij et al., 2012), TagMe (Ferragina and Scaiella, 2010;Cornolti et al., 2014), and AIDA (Hoffart et al., 2011) with its improved variant AIDA-light (Nguyen et al., 2014). We choose some, namely, Spotlight, TagMe and AIDA-light, as baselines for our experiments. These are the best-performing, publicly available systems for news and web texts. Most of these methods combine contextual similarity measures with some form of consideration for the coherence among a selected set of candidate entities for disambiguation. The latter aspect can be cast into a variety of computational models, like graph algorithms (Hoffart et al., 2011), integer linear programming (Ratinov et al., 2011), or probabilistic graphical models (Kulkarni et al., 2009). All these methods use the Stanford NER Tagger or dictionarybased matching for their NER stages. Kulkarni et al. (2009) uses an ILP or LP solver (with rounding) for the NED inference, which is computationally expensive. Note that some of the NED tools aim to link not only named entities but also general concepts (e.g. "world peace") for which Wikipedia has articles. In this paper, we solely focus on proper entities. Joint NERD: There is little prior work on performing NER and NED jointly. Sil and Yates (2013), and Durrett and Klein (2014) are the most notable methods. Sil and Yates (2013) first compile a liberal set of mention and entity candidates, and then perform joint ranking of the candidates. Durrett and Klein (2014) present a CRF model for coreference resolution, mention typing, and mention disambiguation. Our model is also based on CRF's, but distinguishes itself from prior work in three ways: 1) tree-shaped per-sentence CRF's derived from dependency parse trees, as opposed to merely having connections among mentions and entity candidates; 2) linguistic features about verbal phrases from dependency parse trees; 3) the maintaining of candidates for both mentions and entities and jointly reasoning on their uncertainty. Our experiments include comparisons with the method of Durrett and Klein (2014).
There are also benchmarking efforts on measuring the performance for end-to-end NERD (Cornolti et al., 2013;Carmel et al., 2014;Usbeck et al., 2015), as opposed to assessing NER and NED separately. However, to the best of our knowledge, none of the participants in these competitions considered integrating NER and NED.

Overview
To label a sequence of input tokens x 1 , . . . , x m with a sequence of output labels y 1 , . . . , y m , con-sisting of NER types and NED entities, we devise a family of linear-chain and tree-shaped probabilistic graphical models (Koller et al., 2007). We employ these models to compactly encode a multivariate probability distribution over random variables X ∪ Y, where X denotes the set of input tokens x i we may observe, and Y denotes the set of output labels y i we may associate with these tokens. By writing x, we denote an assignment of tokens to X , while y denotes an assignment of labels to Y. In our running example, "David" is the first token x 1 with the desired label y 1 = PER:David Beckham where PER denotes the NER type Person and David Beckham is the entity of interest. Consecutive tokens with identical labels are considered to be entity mentions. For example, for x 5 = la and x 6 = galaxy, the output would ideally be y 5 = ORG:Los Angeles Galaxy and y 6 = ORG:Los Angeles Galaxy, denoting the soccer club. Upfront these are merely candidate labels, though. Our method may alternatively consider the labels y 5 = LOC:Los Angeles and y 6 = MISC:Samsung Galaxy. This would yield incorrect output with two single-token mentions and improper entities.
The feature templates f 1 -f 17 we describe in detail in Section 4 each take the possible assignments x, y of tokens and labels, respectively, as input and give a binary value or real number as output. Binary values denote the presence or absence of a feature (e.g., a particular token); real-valued ones typically denote frequencies of observed features.
For tractability, probabilistic graphical models are typically constrained by making conditional independence assumptions, thus imposing structure and locality on X ∪ Y. In our models, we postulate that the following conditional independence assumptions hold: That is, the label y i for the i th token directly depends only on the label y prev (i) of some previous token at position prev(i) and potentially on all input tokens. The case where prev (i) = i − 1 is the standard setting for a linear-chain CRF, where the label of a token depends only on the label of its preceding token. We generalize this approach to considering prev(i) tokens based on the edges of a dependency parse 217 tree and prev (i) tokens derived from co-references in preceding sentences.
By the Hammersley-Clifford Theorem, such a graphical model can be factorized into a product form where each factor captures a subset A ⊆ X ∪Y of the random variables. Typically, each factor considers only those X and Y variables that are coupled by a conditional (in-)dependence assumptions, with overlapping A sets of different factors. The probability distribution encoded by the graphical model can then be expressed as follows: Here, F A (x A , y A ) denotes the factors of the model, each of which is of the following form: ensures that this distribution sums up to 1, while λ k are the parameters of the model, which we aim to learn from various annotated background corpora.
Our inference objective then is to find the most probable sequence of labels y * when given the token sequence x as evidence: That is, in our setting, we fix x = tok 1 , . . . , tok m to the observed token sequence, while y = y 1 , . . . , y m ranges over all possible sequences of associated labels. In our approach, which we hence coined J-NERD, each y i label represents a combination of NER type and NED entity.
State-of-the-art NER methods, such as the Stanford NER Tagger, employ linear-chain factor graph, known as Conditional Random Fields (CRF's) (Sutton and McCallum, 2012). We also devise more sophisticated tree-shaped factor graphs whose structure is obtained from the dependency parse trees of the input sentences. These per-sentence models are optionally combined into a global factor graph by adding also cross-sentence dependencies (Finkel et al., 2005). These cross-sentence dependencies are added whenever overlapping sets of entity candidates (i.e., potential co-references) are detected among the input sentences. Figure 3 gives an example of such a global graphical model for two sentences.
The search space of candidate labels for our models depends on the candidates for mention spans (with the same NER type) and their NED entities. We use pruning heuristics to restrict this space: candidate spans for mentions are derived from dictionaries, and we consider only the top-20 entity candidates for each candidate mention. For a given sentence, this typically leads to a few thousand candidate labels over which the CRF inference runs. The candidates are determined independently for each sentence.

Features
These models employ a variety of feature templates that generate the factors of the joint probability distribution. Some of the features are fairly standard for NER/NED, whereas others are novel.
• Standard features include lexico-syntactic properties of tokens like POS tags, matches in dictionaries/gazetteers, and similarity measures between token strings and entity names. Also, entity-entity coherence is an important feature for NED -not exactly a standard feature, but used in some prior works. • Features about the topical domain of an input text (e.g., politics, sports, football, etc.) are obtained by a classifier based on "easy mentions": those mentions for which the NED decision can be made with very high confidence without advanced features. The use of domains for NED was introduced by Nguyen et al. (2014). Here, we further extend this technique by harnessing domain features for joint inference on NER and NED. • The third feature group captures typed dependencies from the sentence parsing. To our knowledge, these have not been used in prior work on NER and NED. The NER types that we consider are the standard types PER for person, LOC for location and ORG for organization. All other types that, for example, the Stanford NER Tagger would mark, are collapsed into a type MISC for miscellaneous. These include labels like date and money (which are not genuine entities anyway) and also entity types like events and creative works such as movies, songs, etc. (which are disregarded by the Stanford NER Tagger). We add two dedicated tags for tokens to express the case when no meaningful NER type or NED entity can be assigned. For tokens that should not be labeled as a named entity at all (e.g., "played" in our example), we use the tag Other. For tokens with a valid NER type, we add the virtual entity Out-of-KB (for "out of knowledge base") to its entity candidates, to prepare for the possible situation where the token (and its surrounding tokens) actually denotes an emerging or long-tail entity that is not contained in the knowledge base.

Linear-Chain Model
In the local models, J-NERD works on each sentence S = tok 1 , . . . , tok m separately. We construct a linear-chain CRF (see Figure 1) by introducing an observed variable x i for each token tok i that represents a proper word. For each x i , we additionally introduce a variable y i that represents the combined NERD label. As in any CRF, the x i , y i and y i , y i+1 pairs are connected via factors F(x, y), whose weights we obtain from the feature functions described in Section 4. y 1 y 2 y 3 y 4 y 5 y 6 x 1 x 2 x 3 x 4 x 5 x 6 David played manu real la galaxy Figure 1: Linear-chain model (CRF).

Tree Model
The factor graph for the tree-shaped model is constructed in a similar way. However, here we add a factor that links a pair of labels y i , y j if their respective tokens tok i , tok j are connected via a typed dependency which we obtain from the Stanford parser. Figure 2 shows an example of such a tree model. Thus, while the linear-chain model adds factors between labels of adjacent tokens only based on their positions in the sentence, the tree model adds factors based on the dependency parse tree to enhance the coherence of labels across tokens.

Global Models
For global models, we consider an entire input text consisting of multiple sentences S 1 , . . . , S n = tok 1 , . . . , tok m , for augmenting either one of the y 1 y 2 y 3 y 4 y 5 y 6 x 1 x 2 x 3 x 4 x 5 x 6 David played manu real la galaxy linear-chain model or tree-shaped model. As shown in Figure 3, cross-sentence edges among pairs of labels y i , y j are introduced for candidate sets C i , C j that share at least one candidate entity, such as "David" and "David Beckham". Additionally, we introduce factors for all pairs of tokens in adjacent mentions within the same sentence, such as "David" and "manu".

Inference & Learning
Our inference objective is to find the most probable sequence of NERD labels y * = arg max y p(y | x) according to the objective function we defined in Section 3. Instead of considering the actual distribution for this purpose, we aim to maximize an equivalent objective function as follows. Each factor A in our model couples a label variable y t with a variable y prev(t) : either its immediately preceding token in the same sentence, or a parsing-dependency-linked token in the same sentence, or a co-reference-linked token in a different sentence. Each of these factors has its feature functions, and we can regroup these features on a per-token basis given the log-linear nature of the objective function. This leads to the following optimization problem which has its maximum for the same label assignment as the original problem: is the index of label y j on which y t depends, • feature 1..K are the feature functions generated from templates f 1 -f 17 of Section 4, y1 y2 y3 y4 y5 y6 x 1 x 2 x 3 x 4 x 5 x 6 David played manu real la galaxy y7 y8 y9 y10 y11 x 7 x 8 x 9 x 10 x 11 • and λ k are the feature weights, i.e., the model parameters to be learned. The actual number of generated features, K, depends on the training corpus and the choice of the graphical model. For the CoNLL-YAGO2 training set, the tree models have K = 1, 767 parameters. Given a trained model, exact inference with respect to the above objective function can be efficiently performed by variants of the Viterbi algorithm (Sutton and McCallum, 2012) for the local models, both in the linear-chain and tree-shaped cases. For the global models, however, exact solutions are computationally intractable. Therefore, we employ Gibbs sampling (Finkel et al., 2005) to approximate the solution.
As for the model parameters, J-NERD learns the feature weights λ k from the training data by maximizing a respective conditional likelihood function (Sutton and McCallum, 2012), using a variant of the L-BFGS optimization algorithm (Liu and Nocedal, 1989). We do this for each local model (linear-chain and tree models), and apply the same learned weights to the corresponding global models. Our implementation uses the RISO toolkit 2 for belief networks.

Feature Templates
We define feature templates for detecting the combined NER/NED labels of token that denote or are part of an entity mention. Once these labels are determined, the actual boundaries of the mentions, i.e., their token spans, are trivially derived by combining adjacent tokens with the same label (and disregarding all tokens with the tag Other). Language Preprocessing. We employ the Stan-ford CoreNLP tool suite 3 for processing input documents. This includes tokenization, sentence detection, POS tagging, lemmatization, and dependency parsing. All of these provide features for our graphical model. In particular, we harness dependency types between noun phrases (de Marneffe et al., 2006), like nsubj, dobj, prep in, prep for, etc.
In the following, we introduce the complete set of feature templates f 1 through f 17 used by our method. Templates are instantiated based on the observed input and the candidate space of possible labels for this input, and guided by distant resources like knowledge bases and dictionaries. Templates f 1 , f 8 -f 13 , f 17 generate real numbers as values derived from frequencies in training data; all other templates generate binary values denoting presence or absence of certain features. The generated feature values depend on the assignment of input tokens to variables x i ∈ X . In addition, our graphical models often consider only a specific subset of candidate labels as assignments to the output variables y i ∈ Y. Therefore, we formulate the feature-generation process as a set of feature functions that depend on both (per-factor subsets of) X and Y. Table 1 illustrates the feature generation by the set of active feature functions for the token "manu" in our running example, using three different candidate labels. Entity Repository and Name-Entity Dictionary. Many feature templates harness a knowledge base, namely, YAGO2 (Hoffart et al., 2013), as an entity repository and as a dictionary of name-to-entity pairs (i.e., aliases and paraphrases). We import the YAGO2 means and hasName relations, a total of more than 6 Million name-entity pairs (for ca. 3 Million distinct entities). We derive additional  NER-type-specific phrase dictionaries from supporting phrases of GATE (Cunningham et al., 2011), e.g., "Mr.", "Mrs.", "Dr.", "President", etc. for the type PER; "city", "river", "park", etc. for the type LOC; "company", "institute", "Inc.", "Ltd.", etc. for the type ORG.
Pruning the Candidate Space. To reduce the dimensionality of the generated feature space and to make the factor-graph inference tractable, we use pruning techniques based on the knowledge base and the dictionaries. To determine if a token can be a mention or part of a mention, we first perform exact-match lookups of all sub-sequences against the name-entity dictionary. As an option (and by default), this can be limited to sub-sequences that are tagged as noun phrases by the Stanford parser. For higher recall, we then add partial-match lookups when a token sub-sequence matches only some but not all tokens of an entity name in the dictionary. For example, for the sentence "David played for manu, real and la galaxy", we obtain "David", "manu", "real", "la galaxy", "la", and "galaxy" as candidate mentions. For each such candidate mention, we look up the knowledge base for entities and consider only the best n (using n = 20 in our experiments) highest ranked candidate entities. The ranking is based on the string similarity between the mention and the entity name, the prior popularity of the entity, and the local context similarity (using feature functions f 8 , f 9 , f 10 described in Subsection 4.1).

Standard Features
For the following definitions of the feature templates, let pos i denote the POS tag of tok i , dic i denote the NER tag from the dictionary lookup of tok i , and dep i denote the parsing dependency that connects tok i with another token. Further, we write sur i = tok i−1 , tok i , tok i+1 to refer to the sequence of tokens surrounding tok i . As for the possible labels, we denote by type i and ent i an NER type and candidate entity for the current token tok i , respectively. Token-Type Prior. Feature f 1 (type i , tok i ) captures a prior probability for tok i being of NER type type i . These probabilities are estimated from an NERannotated training corpus. In our experiments, we used training subsets of different test corpora such as CoNLL. For example, we may thus obtain a prior of f 1 (ORG, "Ltd.") = 0.8. Current POS. Template f 2 (type i , tok i ) generates a binary feature function if token tok i occurs in the training corpus with POS tag pos i and NER label type i . For example, f 2 (PER, "David") = 1 if the current token "David" has occurred with POS tag NNP and NER label PER in the training corpus. For combinations of tokens with POS tags and NER types that do not occur in the training corpus, no actual feature function is generated from the template (i.e., the value of function would be 0). For the rest of this section, we assume that all binary feature functions are generated from their feature templates in an analogous manner. In-Dictionary. Template f 3 (type i , tok i ) generates a binary feature function if the current token tok i occurs in the name-to-entity dictionary for some entity of NER label type i .
Uppercase. Template f 4 (type i , tok i ) generates a binary feature function if the current token tok i appears in upper-case form and additionally has the NER label type i in the training corpus. Surrounding POS. Template f 5 (type i , tok i ) generates a binary feature function if the current token tok i and the POS sequence of its surrounding tokens sur i both appear in the training corpus, where tok i also has the NER label type i . Surrounding Tokens. Template f 6 (type i , tok i ) generates a binary feature function if the current token tok i has NER label type i , given that tok i also appears with surrounding tokens sur i in the training corpus. When instantiated, this template could possibly lead to a huge number of feature functions. For tractability, we thus ignore sequences that occur only once in the training corpus. Surrounding In-Dictionary. Template f 7 (type i , tok i ) performs dictionary lookups for surrounding tokens in sur i . Similar to f 6 , it generates a binary feature function if the current token tok i and the dictionary lookups of its surrounding tokens sur i appear in the training corpus, where tok i also has NER label type i . Token-Entity Prior.
Feature f 8 (ent i , tok i ) captures a prior probability of tok i having NED label ent i .
These probabilities are estimated from co-occurrence frequencies of name-to-entity pairs in the background corpus, thus harnessing link-anchor texts in Wikipedia. For example, we may have a prior of f 8 (David Beckham, "Beckham") = 0.7, as David is more popular (today) than his wife Victoria.
On the other hand, f 8 (David Beckham, "David") may be lower than f 8 (David Bowie, "David"), for example, as this still active pop star is more frequently and prominently mentioned than the retired football player. Token-Entity n-Gram Similarity.
Feature f 9 (ent i , tok i ) measures the Jaccard similarity of character-level n-grams of a name in the dictionary that includes tok i and is the primary (i.e., full and most frequently used) name of an entity ent j . For example, for n = 2 the value of f 9 (David Beckham, "Becks") is 3 11 . In our experiments, we set n = 3.

Token-Entity Token Contexts.
Feature f 10 (ent i , tok i ) measures the weighted overlap similarity between the token contexts (tok-cxt) of token tok i and entity ent j . Specifically, we use a weighted generalization of the standard overlap coefficient, WO, between two sets X, Y of weighted elements, X k ∈ X and Y k ∈ Y : We set the weights to be tf-idf scores, and hence we obtain: f 10 (ent i , tok i ) = WO tok-cxt(ent i ), tok-cxt(tok i ) Entity-Entity Token Coherence.
Feature f 11 (ent i , ent j ) measures the coherence between the token contexts of two entity candidates ent i and ent j : f 11 allows us to establish cross-dependencies among labels in our graphical model. For example, the two entities David Beckham and Manchester United are highly coherent as they share many tokens in their contexts, such as "champions", "league", "premier", "cup", etc. Thus, they should mapped jointly.

Domain Features
We use WordNet domains, created by Miller (1995), Magnini and Cavagli (2000), and Bentivogli et al. (2004), to construct a taxonomy of 46 domains, including Politics, Economy, Sports, Science, Medicine, Biology, Art, Music, etc. We combine the domains with semantic types (classes of entities) provided by YAGO2, by assigning them to their respective domains. This is based on the manual assignment of WordNet synsets to domains, introduced by Magnini and Cavagli (2000), and Bentivogli et al. (2004), and extends to additional types in YAGO2. For example, Singer is assigned to Music, and Football Player to Football, a sub-domain of Sports. These types include the standard NER types Person (PER), Organization (ORG), Location (LOC), and Miscellaneous (MISC) which are further refined by the YAGO2 subclassOf hierarchy. In total, the 46 domains are enhanced with ca. 350,000 types imported from YAGO2.
J-NERD classifies input texts into domains by means of "easy mentions". An easy mention is a match in the name-to-entity dictionary for which there exist at most three candidate entities (Nguyen et al., 2014). Although the mention boundaries are not explicitly provided as input, J-NERD still can extract these easy mentions from the entirety of all mention candidates.
In the following, let C * be the set of candidate entities for the "easy" mentions in the input document. For each domain d (see Section 3), we compute the coherence of the easy mentions M * = {m 1 , m 2 , . . . }: where C d is the set of all entities under domain d.
We classify the document into the domain with the highest coherence score.
Although the mentions and their entities may be inferred incorrectly, the domain classification still tends to work very reliably as it aggregates over all "easy" mention candidates. The following feature templates exploit domains.

Entity-Domain
Coherence. Template f 12 (ent i , tok i ) generates a binary feature function that captures the coherence between an entity candidate ent i of token tok i and the domain d which the input text is classified into. That is, Otherwise, the feature value is 0. Entity-Entity Type Coherence.
Feature f 13 (ent i , ent j ) computes the relatedness between the Wikipedia categories of two candidate entities ent i ∈ C i and ent j ∈ C j , where C i , C j denote the two sets of candidate entities associated with tok i , tok j , respectively: where the function rel(c u , c v ) computes the reciprocal length of the shortest path between categories c u , c v in the domain taxonomy (Nguyen et al., 2014). Recall that our domain taxonomy contains a few hundred thousands of Wikipedia categories integrated in the YAGO2 type hierarchy.

Linguistic Features
Recall that we harvest dependency-parsing patterns by using Wikipedia as a large background corpus.
Here we harness that Wikipedia contains many mentions with explicit links to entities and that the knowledge base provides us with the NER types for these entities.

Typed-Dependency.
Template f 14 (type i , tok i ) generates a binary feature function if the background corpus contains the pattern dep i = deptype(arg1 , arg2 ) where the current token tok i is either arg1 or arg2 , and tok i is labeled with NER label type i . Typed-Dependency/POS. Template f 15 (type i , tok i ) captures linguistic patterns that combine parsing dependencies (like in f 14 ) and POS tags (like in f 2 ) learned from an annotated training corpus. It generates binary features if the current token tok i appears in the dependency pattern dep i with POS tag pos i and this combination also occurs in the training data under NER label type i . Typed-Dependency/In-Dictionary.
Template f 16 (type i , tok i ) captures linguistic patterns that combine parsing dependencies (like in f 14 ) and dictionary lookups (like in f 3 ) learned from an annotated training corpus. It generates a binary feature function if the current token tok i appears in the dependency pattern dep i and has an entry dic i in the name-to-entity dictionary for some entity with NER label type i .

Token-Entity Linguistic Contexts.
Feature f 17 (ent i , tok i ) measures the weighted overlap between the linguistic contexts (ling-cxt) of token tok i and candidate entity ent i : f 17 (ent i , tok i ) = WO ling-cxt(ent i ), ling-cxt(tok i )

Data Collections
Our evaluation is mainly based on the CoNLL-YAGO2 corpus of newswire articles. Additionally, we report on experiments with an extended version of the ACE-2005 corpus and a large sample of the entity-annotated ClueWeb'09-FACC1 Web crawl. CoNLL-YAGO2 is derived from the CoNLL-YAGO corpus (Hoffart et al., 2011) 4 by removing tables where mentions in table cells do not have linguistic context; a typical example is sports results. The resulting corpus contains 1,244 documents with 20,924 mentions including 4,774 Out-of-KB entities. Ground-truth entities in YAGO2 are provided by Hoffart et al. (2011). For a consistent ground-truth set, we derived the NER types from the NED ground-truth entities, fixing some errors in the original annotations related to metonymy (e.g., labeling the mentions in "India beats Pakistan 2:1" incorrectly as LOC, whereas the entities are the sports teams of type ORG). This makes the dataset not only cleaner but also more demanding, as metonymous mentions are among the most difficult cases.
For our evaluation, we use the "testb" subset of CoNLL-YAGO, which -after the removal of tables -has 231 documents with 5,616 mentions including 1,131 Out-of-KB entities. The other 1,045 documents with a total of 17,870 mentions (including 4,057 Out-of-KB mentions) are used for training. ACE is an extended variant of the ACE 2005 corpus 5 , with additional NED labels by Bentivogli et al. (2010). We consider only proper entities and exclude mentions of general concepts such as "revenue", "world economy", "financial crisis", etc., as they do not correspond to individual entities in a knowledge base. This reduces the number of mentions, but gives the task a crisp focus. We disallow overlapping mention spans and consider only maximum-length mentions, following the rationale of the ERD Challenge 2014. The test set contains 117 documents with 2,958 mentions. ClueWeb contains two randomly sampled subsets of the ClueWeb'09-FACC1 6 corpus with Freebase annotations: • ClueWeb: 1,000 documents (24,289 mentions) each with at least 5 entities. • ClueWeb long−tail : 1,000 documents (49,604 mentions) each with at least 3 long-tail entities. We consider an entity to be "long-tail" if it has at most 10 incoming links in the English Wikipedia. Note that these Web documents are very different in style from the news-centric articles in CoNLL and ACE. Also note that the entity markup is automatically generated, but with emphasis on high precision. So the data captures only a small subset of the potential entity mentions, and it may contain a small fraction of false entities.
In addition to these larger test corpora, we ran experiments with several smaller datasets used in prior work: KORE , MSNBC (Cucerzan, 2007), and a subset of AQUAINT (Milne and Witten, 2008). Each of these has only a few hundred mentions, but they exhibit different characteristics. The findings on these datasets are fully in line with those of our main experiments; hence no explicit results are presented here.
In all of these test datasets, the ground-truth considers only individual entities and excludes general concepts, such as "climate change", "harmony", "logic", "algebra", etc. These proper entities are identified by the intersection of Wikipedia articles and YAGO2 entities. This way, we focus on NERD. Systems that are designed for the broader task of "Wikification" are not penalized by their (typically lower) performance on inputs other than proper entity mentions.

Methods under Comparison
We compare J-NERD in its four variants (linear vs. tree and local vs. global) to various state-of-the-art NER/NED methods.
For NER (i.e., mention boundaries and types) we use the recent version 3.4.1 of the Stanford NER Tagger 7 (Finkel et al., 2005) and the recent version 2.8.4 of the Illinois Tagger 8 (Ratinov and Roth, 2009) as baselines. These systems have NER benchmark results on CoNLL'03 that are as good as the result reported in Passos et al. (2014). We retrained this model by using the same corpus-specific training data that we use for J-NERD .
For NED, we compared J-NERD against the following methods for which we obtained open-source software or could call a Web service: • Berkeley-entity (Durrett and Klein, 2014) uses a joint model for coreference resolution, NER and NED with linkage to Wikipedia. • AIDA-light (Nguyen et al., 2014) is an optimized variant of the AIDA system (Hoffart et al., 2011), based on YAGO2. It uses the Stanford tool for NER. • TagMe (Ferragina and Scaiella, 2010) is a Wikifier that maps mentions to entities or concepts in Wikipedia. It uses a Wikipedia-derived dictionary for NER. • Spotlight (Mendes et al., 2011) links mentions to entities in DBpedia. It uses the LingPipe dictionary-based chunker for NER. Some systems use confidence thresholds to decide on when to map a mention to Out-of-KB. For each dataset, we used withheld data to tune these systemspecific thresholds. Figure 4 illustrates the sensitivity of the thresholds for the CoNLL-YAGO2 dataset.

Evaluation Measures
We evalute the output quality at the NER level alone and for the end-to-end NERD task. We do not evaluate NED alone, as this would require giving a ground-truth set of mentions to the systems to rule out that NER errors affect NED. Most competitors do not have interfaces for such a controlled NEDonly evaluation.
Each test collection has ground-truth annotations (G) consisting of text spans for mentions, NER types of the mentions, and mapping mentions to entities in the KB or to Out-of-KB. Recall that the Out-of-KB case captures entities that are not in the KB at all. Let X be the output of system X: detected mentions, NER types, NED mappings. Following the ERD 2014 Challenge (Carmel et al., 2014), we define precision and recall of X for endto-end NERD as: where agreement means that X and G overlap in the text spans (i.e., have at least one token in common) for a mention, have the same NER type, and have the same mapping to an entity or Out-of-KB. The F 1 score of X is the harmonic mean of precision and recall.
For evaluating the mention-boundary detection alone, we consider only the overlap of text spans; for evaluating NER completely, we consider both mention overlap and agreement based on the assigned NER types.

Results for CoNLL-YAGO2
Our first experiment on CoNLL-YAGO2 is comparing the four CRF variants of J-NERD for three tasks: mention boundary detection, NER typing and endto-end NERD. Then, the best model of J-NERD is compared against various baselines and a pipelined configuration of our method. Finally, we test the influence of different features groups. Table 2 compares the different CRF variants. All CRFs have the same features, but differ in their factors. Therefore, some features are not effective for the linear model and the tree model. For the linear CRF, the parsing-based linguistic features and the cross-sentence features do not contribute; for the tree CRF, the cross-sentence features are not effective. We see that all variants perform very well on boundary detection and NER typing, with small differences only. For end-to-end NERD, however, J-NERD tree-global outperforms all other variants by a large margin. This results in achieving the best F 1 score of 78.7%, which is 2.6% higher than J-NERD linear-global . We performed a paired t-test between these two variants, and obtained a p-value of 0.01. The local variants of J-NERD lose around 4% of F 1 because they do not capture the coherence among mentions in different sentences.

Experiments on CRF Variants
In the rest of our experiments, we focus on J-NERD tree-global and the task of end-to-end NERD.

Comparison of Joint vs. Pipelined Models and Baselines
In this subsection, we demonstrate the benefits of joint models against pipelined models including state-of-the-art baselines. In addition to the competitors introduced in Section 5.2, we add a pipelined configuration of J-NERD , coined P-NERD. That is, we first run J-NERD in NER mode (thus only considering NER features f 1..7 and f 14..16 ). The best sequence of NER labels is then given to J-NERD to run in NED mode (only considering NED features f 8..13 and f 17 ). The results are shown in Table 3. J-NERD achieves the highest precision of 81.9% for endto-end NERD, outperforming all competitors by a significant margin. This results in achieving the best F 1 score of 78.7%, which is 1.2% higher than P-NERD and 1.4% higher than AIDA-light. Note that Nguyen et al. (2014) reported higher precision for AIDA-light, but that experiment did not consider Out-of-KB entities which pose an extra difficulty in our setting. Also, the test corpora -CoNLL-YAGO2 vs. CoNLL-YAGO -are not quite comparable (see above).
TagMe and Spotlight are clearly inferior on this dataset (more than 20% lower in F 1 than J-NERD). These systems are more geared towards efficiency and coping with popular and thus frequent entities, whereas the CoNLL-YAGO2 dataset contains very difficult test cases. For the best F 1 score of J-NERD, we performed a paired t-test against the other methods' F 1 values and obtained a p-value of 0.075.
We also compared the NER performance of J-NERD against the state-of-the-art method for NER alone, the Stanford NER Tagger version 3.4.1 and the Illinois Tagger 2.8.4 (Table 4). For mention boundary detection, J-NERD achieved an F 1 score of 93.1% versus 93.4% by Stanford NER, 93.3% by

Influence of Features
To analyze the influence of the features, we performed an additional ablation study on the global J-NERD tree model, which is the best variant of J-NERD , as follows: • Standard features only include features introduced in Section 4.1. • Standard and domain features exclude the linguistic features f 14 , f 15 , f 16 , f 17 . • Standard and linguistic features excludes the domain features f 12 and f 13 . • All features is the full-fledged J-NERD tree-global model.  Table 5 shows the results, demonstrating that linguistic features are crucial for both NER and NERD. For example, in the sentence "Woolmer played 19 tests for England", the mention "England" refers to an organization (the English cricket team), not to a location. The dependency-type feature prep for[play, England] is a decisive cue to handle such cases properly. Domain features help in NED to eliminate, for example, football teams when the domain is cricket.

End-to-End NERD on ACE
For comparison to the recently developed Berkeleyentity system (Durrett and Klein, 2014), the authors of that system provided us with detailed results for the entity-annotated ACE'2005 corpus, which allowed us to discount non-entity (so-called "NOMtype") mappings (see Subsection 5.1). All other systems, including the best J-NERD method, were run on the corpus under the same conditions. J-NERD outperforms P-NERD and Berkeleyentity: F 1 scores are 1.3% and 1.8% better, respectively, with a t-test p-value of 0.05 (Table 6). Following these three best-performing systems, AIDAlight also achieves decent results. The other systems show substantially inferior performance.
The performance gains that J-NERD achieves over Berkeley-entity can be attributed to two factors. First, the rich linguistic features of J-NERD help to correctly cope with more of the difficult cases, e.g., when common nouns are actually names of people. Second, the coherence features of global J-NERD help to properly couple decisions on related entity mentions.

End-to-End NERD on ClueWeb
The results for ClueWeb are shown in Table 7. Again, J-NERD outperforms all other systems with a t-test p-value of 0.05. The differences between J-NERD and fast NED systems such as TagMe or SpotLight become smaller as the number of prominent entities (i.e., prominent people, organizations and locations) is higher on ClueWeb than on CoNLL-YAGO2.

Conclusions
We have shown that coupling the tasks of NER and NED in a joint CRF-like model is beneficial. Our J-NERD method outperforms strong baselines on a variety of test datasets. The strength of J-NERD comes from three novel assets. First, our treeshaped models capture the structure of dependency parse trees, and we couple multiple such tree models across sentences. Second, we harness non-standard features about domains and novel features based on linguistic patterns derived from parsing. Third, our joint inference maintains uncertain candidates for both mentions and entities and makes decisions as late as possible. In our future work, we plan to explore more use cases for joint NERD, especially for content analytics over news streams and social media.