A Joint Model for Entity Analysis: Coreference, Typing, and Linking

We present a joint model of three core tasks in the entity analysis stack: coreference resolution (within-document clustering), named entity recognition (coarse semantic typing), and entity linking (matching to Wikipedia entities). Our model is formally a structured conditional random field. Unary factors encode local features from strong baselines for each task. We then add binary and ternary factors to capture cross-task interactions, such as the constraint that coreferent mentions have the same semantic type. On the ACE 2005 and OntoNotes datasets, we achieve state-of-the-art results for all three tasks. Moreover, joint modeling improves performance on each task over strong independent baselines.


Introduction
How do we characterize the collection of entities present in a document? Two broad threads exist in the literature. The first is coreference resolution (Soon et al., 2001;Ng, 2010;Pradhan et al., 2011), which identifies clusters of mentions in a document referring to the same entity. This process gives us access to useful information about the referents of pronouns and nominal expressions, but because clusters are local to each document, it is often hard to situate document entities in a broader context. A separate line of work has considered the problem of entity linking or "Wikification" (Cucerzan, 2007;Milne and Witten, 2008;Ji and Grishman, 2011), where mentions are linked to entries in a given knowledge 1 System available at http://nlp.cs.berkeley.edu base. This is useful for grounding proper entities, but in the absence of coreference gives an incomplete picture of document content itself, in that nominal expressions and pronouns are left unresolved.
In this paper, we describe a joint model of coreference, entity linking, and semantic typing (named entity recognition) using a structured conditional random field. Variables in the model capture decisions about antecedence, semantic type, and entity links for each mention. Unary factors on these variables incorporate features that are commonly employed when solving each task in isolation. Binary and higher-order factors capture interactions between pairs of tasks. For entity linking and NER, factors capture a mapping between NER's semantic types and Wikipedia's semantics as described by infoboxes, categories, and article text. Coreference interacts with the other tasks in a more complex way, via factors that softly encourage consistency of semantic types and entity links across coreference arcs, similar to the method of . Figure 1 shows an example of the effects such factors can capture. The non-locality of coreference factors make exact inference intractable, but we find that belief propagation is a suitable approximation technique and performs well.
Our joint modeling of these three tasks is motivated by their heavy interdependencies, which have been noted in previous work (discussed more in Section 7). Entity linking has been employed for coreference resolution (Ponzetto and Strube, 2006;Rahman and Ng, 2011;Ratinov and Roth, 2012) and coreference for entity linking (Cheng and Roth, 2013)  shown that tighter integration of coreference and entity linking is promising (Hajishirzi et al., 2013;; we extend these approaches and model the entire process more holistically. Named entity recognition is improved by simple coreference (Finkel et al., 2005;Ratinov and Roth, 2009) and knowledge from Wikipedia (Kazama and Torisawa, 2007;Ratinov and Roth, 2009;Sil and Yates, 2013). Joint models of coreference and NER have been proposed in Haghighi and Klein (2010) and , but in neither case was supervised data used for both tasks. Technically, our model is most closely related to that of , who handle coreference, named entity recognition, and relation extraction. 2 Our system is novel in three ways: the choice of tasks to model jointly, the fact that we maintain uncertainty about all decisions throughout inference (rather than using a greedy approach), and the feature sets we deploy for cross-task interactions.
In designing a joint model, we would like to preserve the modularity, efficiency, and structural simplicity of pipelined approaches. Our model's feature-based structure permits improvement of features specific to a particular task or to a pair of tasks. By pruning variable domains with a coarse model and using approximate inference via belief propagation, we maintain efficiency and our model is only a factor of two slower than the union of the individual models. Finally, as a structured CRF, it is conceptually no more complex than its component models and its behavior can be understood using the same intuition. We apply our model to two datasets, ACE 2005 and OntoNotes, with different mention standards and layers of annotation. In both settings, our joint model outperforms our independent baseline models. On ACE, we achieve state-of-the-art entity linking results, matching the performance of the system of Fahrni and Strube (2014). On OntoNotes, we match the performance of the best published coreference system (Björkelund and Kuhn, 2014) and outperform two strong NER systems (Ratinov and Roth, 2009;Passos et al., 2014).

Motivating Examples
We first present two examples to motivate our approach. Figure 1 shows an example of a case where coreference is beneficial for named entity recognition and entity linking. The company is clearly coreferent to Dell by virtue of the lack of other possible antecedents; this in turn indicates that Dell refers to the corporation rather than to Michael Dell. This effect can be captured for entity linking by a feature tying the lexical item company to the fact that COMPANY is in the Wikipedia infobox for Dell, 3 thereby helping the linker make the correct decision. This would also be important for recovering the fact that the mention the company links to Dell; however, in the version of the task we consider, a mention like the company actually links to the Wikipedia article for Company. 4 Figure 2 shows a different example, one where the coreference is now ambiguous but entity linking is transparent. In this case, an NER system based on surface statistics alone would likely predict that Freddie Mac is a PERSON. However, the Wikipedia article for Freddie Mac is unambiguous, which allows us to fix this error. The pronoun his can then be correctly resolved.
These examples justify why these tasks should be handled jointly: there is no obvious pipeline order for a system designer who cares about the perfor-

Model
Our model is a structured conditional random field (Lafferty et al., 2001). The input (conditioning context) is the text of a document, automatic parses, and a set of pre-extracted mentions (spans of text).
Mentions are allowed to overlap or nest: our model makes no structural assumptions here, and in fact we will show results on datasets with two different mention annotation standards (see Section 6.1 and Section 6.3). Figure 3 shows the random variables in our model. We are trying to predict three distinct types of annotation, so we naturally have one variable per annotation type per mention (of which there are n): • Coreference variables a = (a 1 , . . . , a n ) which indicate antecedents: a i ∈ {1, . . . , i−1, NEW}, indicating that the mention refers to some previous mention or that it begins a new cluster.
• Named entity type variables t = (t 1 , . . . , t n ) which take values in a fixed inventory of semantic types. 5 • Entity link variables e = (e 1 , . . . , e n ) which take values in the set of all Wikipedia titles.
In addition we have variables q = (q 1 , . . . , q n ) which represent queries to Wikipedia. These are explained further in Section 3. to remark that they are unobserved during both training and testing. We place a log-linear probability distribution over these variables as follows: where θ is a weight vector, f is a feature function, and x indicates the document text, automatic parses, and mention boundaries.
We represent the features in this model with standard factor graph notation; features over a particular set of output variables (and x) are associated with factors connected to those variables. Figure 3 shows the task-specific factors in the model, discussed next in Section 3.1. Higher-order factors coupling variables between tasks are discussed in Section 3.2. Figure 3 shows a version of the model with only task-specific factors. Though this framework is structurally simple, it is nevertheless powerful enough for us to implement high-performing models for each task. State-of-the-art approaches to coreference  and entity linking (Ratinov et al., 2011) already have this independent structure and Ratinov and Roth (2009) note that it is a reasonable assumption to make for NER as well. 6 In this section, we describe the features present in the task-specific factors of each type (which also serve as our three separate baseline systems).

Coreference
Our modeling of the coreference output space (as antecedents chosen for each mention) follows the mention-ranking approach to coreference (Denis and Baldridge, 2008;. Our feature set is that of Durrett and Klein, targeting surface properties of mentions: for each mention, we examine the first word, head word, last word, context words, the mention's length, and whether the mention is nominal, proper or pronominal. Anaphoricity features examine each of these properties in turn; coreference features conjoin various properties between mention pairs and also use properties of the mention pair itself, such as the distance between the mentions and whether their heads match. Note that this baseline does not rely on having access to named entity chunks.

Named Entity Recognition
Our NER model places a distribution over possible semantic types for each mention, which corresponds to a fixed span of the input text. We define the features of a span to be the concatenation of standard NER surface features associated with each token in that chunk. We use surface token features similar to those from previous work (Zhang and Johnson, 2003;Ratinov and Roth, 2009;Passos et al., 2014): for tokens at offsets of {−2, −1, 0, 1, 2} from the current token, we fire features based on 1) word identity, 2) POS tag, 3) word class (based on capitalization, presence of numbers, suffixes, etc.), 4) word shape (based on the pattern of uppercase and lowercase letters, digits, and punctuation), 5) Brown cluster prefixes of length 4, 6, 10, 20 using the clusters from Koo et al. (2008), and 6) common bigrams of word shape and word identity. 6 Pairwise potentials in sequence-based NER are useful for producing coherent output (e.g. prohibiting configurations like O I-PER), but since we have so far defined the task as operating over fixed mentions, this structural constraint does not come into play for our system.

Entity Linking
Our entity linking system diverges more substantially from past work than the coreference or NER systems. Most entity linking systems operate in two distinct phases (Cucerzan, 2007;Milne and Witten, 2008;Dredze et al., 2010;Ratinov et al., 2011). First, in the candidate generation phase, a system generates a ranked set of possible candidates for a given mention by querying Wikipedia. The standard approach for doing so is to collect all hyperlinks in Wikipedia and associate each hyperlinked span of text (e.g. Michael Jordan) with a distribution over titles of Wikipedia articles it is observed to link to (Michael Jordan, Michael I. Jordan, etc.). Second, in the disambiguation phase, a learned model selects the correct candidate from the set of possibilities.
As noted by Hachey et al. (2013) and Guo et al. (2013), candidate generation is often overlooked and yet accounts for large gaps in performance between different systems. It is not always clear how to best turn the text of a mention into a query for our set of hyperlinks. For example, the phrase Chief Executive Michael Dell has never been hyperlinked on Wikipedia. If we query the substring Michael Dell, the highest-ranked title is correct; however, querying the substring Dell returns the article on the company.
Our model for entity linking therefore includes both predictions of final Wikipedia titles e i as well as latent query variables q i that model the choice of query. Given a mention, possible queries are all prefixes of the mention containing the head with optional truecasing or lemmatization applied. Unary factors on the q i model the appropriateness of a query based on surface text of the mention, investigating the following properties: whether the mention is proper or nominal, whether the query employed truecasing or lemmatization, the query's length, the POS tag sequence within the query and the tag immediately preceding it, and whether the query is the longest query to yield a nonempty set of candidates for the mention. This part of the model can learn, for example, that queries based on lemmatized proper names are bad, whereas queries based on lemmatized common nouns are good.
Our set of candidates links for a mention is the set of all titles produced by some query. The bi- : Factors that tie predictions between variables across tasks. Joint NER and entity linking factors (Section 3.2.1) tie semantic information from Wikipedia articles to semantic type predictions. Joint coreference and NER factors (Section 3.2.2) couple type decisions between mentions, encouraging consistent type assignments within an entity. Joint coreference and entity linking factors (Section 3.2.3) encourage relatedness between articles linked from coreferent mentions. nary factors connecting q i and e i then decide which title a given query should yield. These include: the rank of the article title among all possible titles returned by that query (sorted by relative frequency count), whether the title is a close string match of the query, and whether the title matches the query up to a parenthetical (e.g. Paul Allen and Paul Allen (editor)).
We could also at this point add factors between pairs of variables (e i , e j ) to capture coherence between choices of linked entities. Integration with the rest of the model, learning, and inference would remain unchanged. However, while such features have been employed in past entity linking systems (Ratinov et al., 2011;Hoffart et al., 2011), Ratinov et al. found them to be of limited utility, so we omit them from the present work.

Cross-task Interaction Factors
We now add factors that tie the predictions of multiple output variables in a feature-based way. Figure 4 shows the general structure of these factors. Each couples variables from one pair of tasks.

Entity Linking and NER
We want to exploit the semantic information in Wikipedia for better semantic typing of mentions. We also want to use semantic types to disambiguate tricky Wikipedia links. We use three sources of semantics from Wikipedia (Kazama and Torisawa, 2007;: • Categories (e.g. American financiers); used by Ponzetto and Strube (2006;Kazama and Torisawa (2007;Ratinov and Roth (2012) • Infobox type (e.g. Person, Company) • Copula in the first sentence (is a British politician); used for coreference previously in Haghighi and Klein (2009) We fire features that conjoin the information from the selected Wikipedia article with the selected NER type. Because these types of information from Wikipedia are of a moderate granularity, we should be able to learn a mapping between them and NER types and exploit Wikipedia as a soft gazetteer.

Coreference and NER
Coreference can improve NER by ensuring consistent semantic type predictions across coreferent mentions; likewise, NER can help coreference by encouraging the system to link up mentions of the same type. The factors we implement for these purposes closely resemble the factors employed for latent semantic clusters in . That structure is as follows: That is, the features between the type variables for mentions i and j does not come into play unless i and j are coreferent. Note that there are quadratically many such factors in the graph (before pruning; see Section 5), one for each ordered pair of mentions (j, i) with j < i. When scoring a particular configuration of variables, only a small subset of the factors is active, but during inference when we marginalize over all settings of variables, each of the factors comes into play for some configuration.
This model structure allows us to maintain uncertainty about coreference decisions but still propagate information along coreference arcs in a soft way. Given this factor definition, we define features that should fire over coreferent pairs of entity types. Our features target: • The pair of semantic types for the current and antecedent mention • The semantic type of the current mention and the head of the antecedent mention, and the type of the antecedent and head of the current We found such monolexical features to improve over just type pairs and while not suffering from the sparsity problems of bilexical features.

Coreference and Entity Linking
As we said in Section 2, coreferent mentions can actually have different entity links (e.g. Dell and Company), so encouraging equality alone is less effective for entity linking than it is for NER. Our factors have the same structure as those for coreference-NER, but features now target overall semantic relatedness of Wikipedia articles using the structure of Wikipedia by computing whether the articles have the same title, share any out links, or link to each other. More complex relatedness schemes such as those described in Ratinov et al. (2011) can be implemented in this framework. Nevertheless, these basic features still promise to help identify related articles as well as name variations by exploiting the abundance of entity mentions on Wikipedia.

Learning
Our training data consists of d documents, where a given document consists of a tuple (x, C * , t * , e * ). Gold-standard labels for types (t * ) and entity links (e * ) are provided directly, while supervision for coreference is provided in the form of a clustering C * . Regardless, we can simply marginalize over the uncertainty about a * and form the conditional loglikelihood of the training labels as follows: where A(C * ) is the set of antecedent structures consistent with the gold annotation: the first mention in a cluster must pick the NEW label and subsequent mentions must pick an antecedent from the set of those preceding them in the cluster. This marginalization over latent structure has been employed in prior work as well (Fernandes et al., 2012;. We adapt this objective to exploit parameterized loss functions for each task by modifying the distribution as follows: where c , t , and e are task-specific loss functions with weight parameters α. This technique, softmaxmargin, allows us to shape the distribution learned by the model and encourage the model to move probability mass away from outputs that are bad according to our loss functions (Gimpel and Smith, 2010). As in , we take α c = 1 and use c as defined there, penalizing the model by α c,FA = 0.1 for linking up a mention that should have been nonanaphoric, by α c,FN = 3 for calling nonanaphoric a mention that should have an antecedent, and by α c,WL = 1 for picking the wrong antecedent for an anaphoric mention. t and e are simply Hamming distance, with α t = 3 and α e = 0 for all experiments. We found that the outcome of learning was not particularly sensitive to these parameters. 7 We optimize our objective using AdaGrad (Duchi et al., 2011) with L 1 regularization and λ = 0.001. Our final objective is This objective is nonconvex, but in practice we have found that it is very stable. One reason is that for any mention that has fewer than two antecedents in its cluster, all elements of A(C * ) only contain one possibility for that mention, and even for mentions with ambiguity, the parameters that the model ends up learning tend to place almost all of the probability mass consistently on one antecedent.  (Pradhan et al., 2014). We report accuracy on NER because the set of mentions is fixed and all mentions have named entity types. Coreference and NER are compared to prior work in a more standard setting in Section 6.3. Finally, we also report accuracy of our entity linker (including links to NIL); entity linking is analyzed more thoroughly in Table 2. Bolded values represent statistically significant improvements with p < 0.05 according to a bootstrap resampling test.

Inference
For both learning and decoding, inference consists of computing marginals for individual variables or for sets of variables adjacent to a factor. Exact inference is intractabile due to our factor graph's loopiness; however, we can still perform efficient inference using belief propagation, which has been successfully employed for a similar model  as well as for other NLP tasks (Smith and Eisner, 2008;Burkett and Klein, 2012). Marginals typically converge in 3-5 iterations of belief propagation; we use 5 iterations for all experiments.
However, belief propagation would still be quite computationally expensive if run on the full factor graph as described in Section 3. In particular, the factors in Section 3.2.2 and Section 3.2.3 are costly to sum over due to their ternary structure and the fact that there are quadratically many of them in the number of mentions. The solution to this is to prune the domains of the coreference variables using a coarse model consisting of the coreference factors trained in isolation. Given marginals p 0 (a i |x), we prune values a i such that log p 0 (a i |x) < log p 0 (a * i |x) − k for a threshold parameter k, which we set to 5 for our experiments; this is sufficient to prune over 90% of possible coreference arcs while leaving at least one possible gold link for 98% of mentions. 8 With this optimization, our full joint model could be trained for 20 iterations on the ACE 2005 corpus in around an hour.
We use minimum Bayes risk (MBR) decoding, where we compute marginals for each variable under the full model and independently return the most likely setting of each variable. Note that for coreference, this implies that we produce the MBR antecedent structure rather than the MBR clustering; the latter is much more computationally difficult to find and would be largely the same, since the posterior distributions of the a i are quite peaked.

Experiments
We present results on two corpora. First, we use the ACE 2005corpus (NIST, 2005: this corpus annotates mentions complete with coreference, semantic types (per mention), and entity links (also per mention) later added by Bentivogli et al. (2010). We evaluate on gold mentions in this setting for comparability with prior work on entity linking; we lift this restriction in Section 6.3. Second, we evaluate on the OntoNotes 5 corpus (Hovy et al., 2006) as used in the CoNLL 2012 coreference shared task (Pradhan et al., 2012). This corpus does not contain gold-standard entity links, so we cannot evaluate this portion of our model, though the model still exploits the information from Wikipedia to make coreference and named entity decisions. We will compare to prior coreference and named entity work in the system mentions setting.

ACE Evaluation
We tokenize and sentence-split the ACE dataset using the tools bundled with Reconcile (Stoyanov et al., 2010) and parse it using the Berkeley Parser (Petrov et al., 2006). We use the train/test split from Stoyanov et al. (2009), Haghighi andKlein (2010), and Bansal and Klein (2012).   (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), and CEAF e (Luo, 2005), as well as their average, the CoNLL metric, all computed from the reference implementation of the CoNLL scorer (Pradhan et al., 2014). We see that the joint model improves all three tasks compared to the individual task models in the baseline.
More in-depth entity linking results are shown in Table 2. We both evaluate on overall accuracy (how many mentions are correctly linked) as well as two more specific criteria: precision/recall/F 1 of non-NIL 9 predictions, and precision/recall/F 1 of NIL predictions. This latter measure may be important if a system designer is trying to identify new entities in a document. We compare to the results of the best model from Fahrni and Strube (2014), which is a sophisticated discriminative model incorporating a latent model of mention scope. 10 Our performance is similar to that of Fahrni and Strube (2014), though the results are not exactly comparable for two reasons. First, our models are trained on different datasets: Fahrni and Strube (2014) train on Wikipedia data whereas we train on the ACE training set. Second, they make use of the annotated head spans in ACE whereas we only use detected heads based on automatic parses. Note that this information is particularly beneficial for locating the right query because "heads" may be multiword expressions such as West Bank as part of the phrase southern West Bank. 9 NIL is a placeholder for mentions which do not link to an article in Wikipedia. 10 On the TAC datasets, this FAHRNI model substantially outperforms Ratinov et al. (2011) and has comparable performance to Cheng and Roth (2013) Table 3: Results of model ablations on the ACE development set. We hold out each type of factor in turn from the JOINT model and add each in turn over the IN-DEP. model. We evaluate the coreference performance using the CoNLL metric, NER accuracy, and entity linking accuracy.

Model Ablations
To evaluate the importance of the different parts of the model, we perform a series of ablations on the model interaction factors. Table 3 shows the results of adding each interaction factor in turn to the baseline and removing each of the three interaction factors from the full joint model (see Figure 4).
Link-NER interactions. These joint factors are the strongest out of any considered here and give large improvements to entity linking and NER. Their utility is unsurprising: effectively, they give NER access to a gazetteer that it did not have in the baseline model. Moreover, our relatively rich featurization of the semantic information on Wikipedia allows the model to make effective use of it.
Coref-NER interactions. These are moderately beneficial to both coreference and NER. Having re-liable semantic types allows the coreference system to be bolder about linking up mention pairs that do not exhibit direct head matches. Part of this is due to our use of monolexical features, which are finegrained enough to avoid the problems with coarse semantic type matching  but still effectively learnable.
Coref-Link interactions. These are the least useful of any of the major factors, providing only a small benefit to coreference. This is likely a result of the ACE entity linking annotation standard: a mention like the company is not linked to the specific company it refers to, but instead the Wikipedia article Company. Determining the relatedness of Company to an article like Dell is surprisingly difficult: many related articles share almost no outlinks and may not explicitly link to one another. Further feature engineering could likely improve the utility of these factors.
The last line of Table 3 shows the results of an experiment where the entity links were not observed during training, i.e. they were left latent. Unsurprisingly, the system is not good at entity linking; however, the model is still able to do as well or even slightly better on coreference and named entity recognition. A possible explanation for this is that even the wrong Wikipedia link can in many cases provide correct semantic information: for example, not knowing which Donald Layton is being referred to is irrelevant for the question of determining that he is a PERSON and may also have little impact on coreference performance. This result indicates that the joint modeling approach is not necessarily dependent on having all tasks annotated. The model can make use of cross-task information even when that information comes via latent variables.

OntoNotes Evaluation
The second part of our evaluation uses the datasets from the CoNLL 2012 Shared Task (Pradhan et al., 2012), specifically the coreference and NER annotations. All experiments use the standard automatic parses from the shared task and mentions detected according to the method of .
Evaluating on OntoNotes carries with it a few complications. First, gold-standard entity linking annotations are not available; we can handle this by  leaving the e i variables in our model latent. Second, and more seriously, NER chunks are no longer the same as coreference mentions, so our assumption of fixed NER spans no longer holds.

Divergent Coreference and NER
Our model can be adapted to handle NER chunks that diverge from mentions for the other two tasks, as shown in Figure 5. We have kept the coreference and entity linking portions of our model the same, now defined over system predicted mentions. However, we have replaced mention-synchronous type variables with standard token-synchronous BIO-valued variables. The unary NER features developed in Section 3.1.2 are now applied in the standard way, namely they are conjoined with the BIO labels at each token position. Binary factors between adjacent NER nodes enforce appropriate structural constraints and fire indicator features on transitions. In order to maintain tractability in the face of a larger number of variables and factors in the NER portion of our model, we prune the NER variables' domains using the NER model trained in isolation, similar to the procedure that we described for pruning coreference arcs in Section 5.  Cross-task factors that previously would have fired features based on the NE type for a whole mention now instead consult the NE type of that mention's head. 11 In Figure 5, this can be seen with factors involving e 2 and a 2 touching t 9 (company), the head of the second mention. Since the chain structure enforces consistency between adjacent labels, features that strongly prefer a particular label on one node of a mention will implicitly affect other nodes in that mention and beyond.
Training and inference proceed as before, with a slight modification: instead of computing the MBR setting of every variable in isolation, we instead compute the MBR sequence of labeled NER chunks to avoid the problem of producing inconsistent tag sequences, e.g. O I-PER or B-PER I-ORG. Table 4 shows coreference results from our IN-DEP. and JOINT models compared to three strong systems: Klein (2013), Fernandes et al. (2012) (the winner of the CoNLL shared task), and Björkelund and Kuhn (2014) (the best reported results on the dataset). Our JOINT method outperforms all three as well as the INDEP. system. 12 Next, we report results on named entity recognition. We use the same OntoNotes splits as for the coreference data; however, the New Testament (NT) 11 The NER-coreference portion of the model now resembles the skip-chain CRF from Finkel et al. (2005), though with soft coreference. 12 The systems of Chang et al. (2013) and Webster and Curran (2014) perform similarly to the FERNANDES system; changes in the reference implementation of the metrics make exact comparison to printed numbers difficult.   (Ratinov and Roth, 2009) and the system of Passos et al. (2014). Our model outperforms both other systems in terms of F 1 , and once again joint modeling gives substantial improvements over our baseline system. portion of the CoNLL 2012 test set does not have gold-standard named entity annotations, so we omit it from our evaluation. This leaves us with exactly the CoNLL 2011 test set. We compare to two existing baselines from the literature: the Illinois NER system of Ratinov and Roth (2009) and the results of Passos et al. (2014). Table 5 shows that we outperform both prior systems in terms of F 1 , though the ILLINOIS system features higher recall while our system features higher precision.

Related Work
There are two closely related threads of prior work: those that address the tasks we consider in a different way and those that propose joint models for other related sets of tasks. In the first category, Hajishirzi et al. (2013) integrate entity linking into a sieve-based coreference system (Raghunathan et al., 2010), the aim being to propagate link decisions throughout coreference chains, block corefer-ence links between different entities, and use semantic information to make additional coreference links.  build coreference clusters greedily left-to-right and maintain entity link information for each cluster, namely a list of possible targets in the knowledge base as well as a current best link target that is used to extract features (though that might not be the target that is chosen by the end of inference). Cheng and Roth (2013) use coreference as a preprocessing step for entity linking and then solve an ILP to determine the optimal entity link assignments for each mention based on surface properties of that mention, other mentions in its cluster, and other mentions that it is related to. Compared to these systems, our approach maintains greater uncertainty about all random variables throughout inference and uses features to capture cross-task interactions as opposed to rules or hard constraints, which can be less effective for incorporating semantic knowledge (Lee et al., 2011).
The joint model most closely related to ours is that of , modeling coreference, named entity recognition, and relation extraction. Their techniques differ from ours in a few notable ways: they choose a different objective function than we do and also opt to freeze the values of certain variables during the belief propagation process rather than pruning with a coarse pass. Sil and Yates (2013) jointly model NER and entity linking in such a way that they maintain uncertainty over mention boundaries, allowing information from Wikipedia to inform segmentation choices. We could strengthen our model by integrating this capability; however, the primary cause of errors for mention detection on OntoNotes is parsing ambiguities rather than named entity ambiguities, so we would be unlikely to see improvements in the experiments presented here. Beyond maintaining uncertainty over mention boundaries, we might also consider maintaining uncertainty over the entire parse structure, as in Finkel and Manning (2009), who consider parsing and named entity recognition together with a PCFG.

Conclusion
We return to our initial motivation for joint modeling, namely that the three tasks we address have the potential to influence one another. Table 3 shows that failing to exploit any of the pairwise interactions between the tasks causes lower performance on at least one of them. Therefore, any pipelined system would necessarily underperform a joint model on whatever task came first in the pipeline, which is undesirable given the importance of these tasks. The trend towards broader and deeper NLP pipelines will only exacerbate this problem and make it more difficult to find a suitable pipeline ordering. In addition to showing that joint modeling is high-performing, we have also shown that it can be implemented with relatively low overhead, requiring no fundamentally new learning or inference techniques, and that it is extensible, due to its modular structure and natural partitioning of features. Taken together, these aspects make a compelling case that joint models can provide a way to integrate deeper levels of processing, particularly for semantic layers of annotation, and that this modeling power does not need to come at the expense of computational efficiency, structural simplicity, or modularity.
The Berkeley Entity Resolution System is available at http://nlp.cs.berkeley.edu.