A Hierarchical Distance-dependent Bayesian Model for Event Coreference Resolution

We present a novel hierarchical distance-dependent Bayesian model for event coreference resolution. While existing generative models for event coreference resolution are completely unsupervised, our model allows for the incorporation of pairwise distances between event mentions — information that is widely used in supervised coreference models to guide the generative clustering processing for better event clustering both within and across documents. We model the distances between event mentions using a feature-rich learnable distance function and encode them as Bayesian priors for nonparametric clustering. Experiments on the ECB+ corpus show that our model outperforms state-of-the-art methods for both within- and cross-document event coreference resolution.


Introduction
The task of event coreference resolution consists of identifying text snippets that describe events, and then clustering them such that all event mentions that share a partition refer to the same unique event. Event coreference resolution can be applied within a single document or across multiple documents and is crucial for many natural language processing tasks including topic detection and tracking, information extraction, question answering and textual entailment (Bejan and Harabagiu, 2010). More importantly, event coreference resolution is a necessary component in any reasonable, broadly applicable computational model of natural language understanding (Humphreys et al., 1997).
In comparison to entity coreference resolution (Ng, 2010), which deals with identifying and grouping noun phrases that refer to the same discourse entity, event coreference resolution has not been extensively studied. This is, in part, because events typically exhibit a more complex structure than entities: a single event can be described via multiple event mentions, and a single event mention can be associated with multiple event arguments that characterize the participants in the event as well as spatio-temporal information (Bejan and Harabagiu, 2010). Hence, the coreference decisions for event mentions usually require the interpretation of event mentions and their arguments in context. See, for example, Figure 1, in which five event mentions across two documents all refer to the same underlying event: Plane bombs Yida camp.  Most previous approaches to event coreference resolution (e.g. Ahn (2006), Chen et al. (2009)) operate by extending the supervised pairwise classifi-cation model that is widely used in entity coreference resolution (e.g. Ng and Cardie (2002)). In this framework, pairwise distances between event mentions are modeled via event-related features (e.g. that indicate event argument compatibility), and agglomerative clustering is applied to greedily merge event mentions into clusters. A major drawback of this general approach is that it makes hard decisions on the merging and splitting of clusters based on heuristics derived from the pairwise distances. In addition, it only captures pairwise coreference decisions within a single document and can not account for signals that commonly appear across documents. More recently, Bejan and Harabagiu (2010) proposed several nonparametric Bayesian models for event coreference resolution that probabilistically infer event clusters both within a document and across multiple documents. Their method, however, is completely unsupervised, and thus can not encode any readily available supervisory information to guide the model toward better event clustering.

Event: Plane bombs Yida camp
To address these limitations, we propose a hierarchical distance-dependent Bayesian model for within-and cross-document event coreference resolution. The approach leverages the advantages of supervised feature-rich modeling of pairwise coreference relations and the unsupervised Bayesian modeling of cluster distributions, allowing for three desired properties: (1) probabilistic inference over event clusters; (2) automatic determination of the number of events based on the data; and (3) the encoding of rich event-specific knowledge as pairwise linking preferences during clustering. Our model builds on the framework of the distance-dependent Chinese restaurant process (DDCRP) (Blei and Frazier, 2011), which was introduced to incorporate data dependencies into nonparametric clustering models. Here, however, we extend the DDCRP to allow the incorporation of feature-based, learnable distance functions as clustering priors, thus encouraging event mentions that are close in meaning to belong to the same cluster. In addition, we introduce to the DDCRP a representational hierarchy that allows event mentions to be grouped within a document and with-document event clusters to be grouped across documents.
To investigate the effectiveness of our approach, we conduct extensive experiments on the ECB+ cor-pus (Cybulska and Vossen, 2014b), an extension to EventCorefBank (ECB) (Bejan and Harabagiu, 2010) and the largest corpus available that contains event coreference annotations within and across documents. We show that integrating pairwise learning of event coreference relations with unsupervised hierarchical modeling of event clustering achieves promising improvements over state-of-theart approaches for within-and cross-document event coreference resolution.

Related Work
While there has been extensive work on entity coreference resolution (Ng, 2010), the problem of event coreference resolution has been relatively less explored. Humphreys et al. (1997) first addressed event coreference in the context of information extraction, where event mentions of pre-specified types are extracted and grouped together using event-attribute constraints. Later approaches (Ahn, 2006;Chen et al., 2009) applied agglomerative clustering as in entity coreference resolution, but computed mention-pair distances based on eventspecific features. Bejan and Harabagiu (2010) proposed several extensions to the Hierarchical Dirichlet Process (HDP) model (Teh et al., 2006) to address event coreference both within and across documents. They consider a rich set of linguistic features, modeling them as independent observations in the data likelihood. Our approach is similar in that it is also a nonparametric hierarchical Bayesian model with rich linguistic features, however, we incorporate the features into a learnable distance function that can capture event similarity. Finally, Lee et al. (2012) proposed a supervised approach to iteratively group entity and event mentions based on a linear regressor that models the quality of the cluster merging operations. Our approach also models clusterlevel merging operations, but instead encodes the preferences of merging clusters as Bayesian priors.
Our model is a hierarchical extension to the distance-dependent Chinese Restaurant Process (DDCRP) framework (Blei and Frazier, 2011). The DDCRP has been successfully applied to perform infinite clustering by accounting for the sequential, temporal or spatial structure in data (Ghosh et al., 2011;Socher et al., 2011). However, there has been  limited work exploring the distance-based structure in hierarchical clustering models. Kim and Oh (2011) encodes temporal distances between documents in topic modeling. Ghosh et al. (2014) encodes the geometric distance between paragraphs and documents (assuming each document maps to a pre-defined location) in discourse segmentation. Our work is novel in that we learn the pairwise distances using a log-linear model and employ them as priors at both within-and cross-document levels. We also derive a Gibbs sampler for posterior inference.

Problem Formulation
We adopt the terminology from ECB+ (Cybulska and Vossen, 2014b), a corpus that extends the widely used EventCorefBank (ECB (Bejan and Harabagiu, 2010)). An event is something that happens or a situation that occurs (Cybulska and Vossen, 2014a). It consists of four components: (1) an Action: what happens in the event; (2) Participants: who or what is involved; (3) a Time: when the event happens; and (4) a Location: where the event happens. We assume that each document in the corpus consists of a set of mentions -text spans -that describe event actions, their participants, times, and locations. Table 1 shows examples of these in the sentence "Sudan bombs Yida refugee camp in South Sudan on Thursday, Nov 10th, 2011." In this paper, we also use the term event mention to refer to the mention of an event action, and event arguments to refer collectively to mentions of the participants, time and locations involved in the event. Event mentions are usually noun phrases or verb phrases that clearly describe events. Two event mentions are considered coreferent if they refer to the same actual event, i.e. a situation involving a particular combination of action, participants, time and location. Note that in text, not all event arguments are always present for an event mention; they may even be distributed over different sentences. Whether two event mentions are coref-erential should be determined based on the context. For example, in Figure 1, the event mention dropped in DOCUMENT 1 corefers with air strike in the same document as they describe the same event, Plane bombs Yida camp, in the discourse context; it also corefers with dropped in DOCUMENT 2 based on the contexts of both documents.
The problem of event coreference resolution can be divided into two sub-problems: (1) event extraction: extracting event mentions and event arguments, and (2) event clustering: grouping event mentions into clusters according to their coreference relations. We consider both within-and crossdocument event coreference resolution and hypothesize that leveraging context information from multiple documents will improve both within-and crossdocument coreference resolution. In the following, we first describe the event extraction step and then focus on the event clustering step.

Event Extraction
The goal of event extraction is to extract from a text all event mentions (actions) and event arguments (the associated participants, times and locations). One might expect that event actions could be extracted reasonably well by identifying verb groups; and event arguments, by applying semantic role labeling (SRL) to identify, for example, the Agent and Patient of each predicate. Unfortunately, most SRL systems only handle verbal predicates and so would miss event mentions described via noun phrases. In addition, SRL systems are not designed to capture event-specific arguments. Accordingly, we found that a state-of-the-art SRL system (SwiRL (Surdeanu et al., 2007)) extracted only 56% of the actions, 76% of participants, 65% of times and 13% of locations for events in a development set of ECB+ based on head word matching. (We provide dataset details in Section 6.) To produce higher recall, we adopt a supervised approach and train an event extractor using sentences from ECB+, which are annotated for event actions, participants, times and locations. Because these mentions vary widely in their length and grammatical type, we employ semi-Markov CRFs (Sarawagi and Cohen, 2004) using the lossaugmented objective of Yang and Cardie (2014) that provides more accurate detection of mention boundaries. We make use of a rich feature set that includes word-level features such as unigrams, bigrams, POS tags, WordNet hypernyms, synonyms and FrameNet semantic roles, and phrase-level features such as phrasal syntax (e.g. NP, VP) and phrasal embeddings (constructed by averaging word embeddings produced by word2vec (Mikolov et al., 2013)). Our experiments on the same (held-out) development data show that the semi-CRF-based extractor correctly identifies 95% of actions, 90% of participants, 94% of times and 74% of locations based on head word matching.
Note that the semi-CRF extractor identifies event mentions and event arguments but not relationships among them, i.e. it does not associate arguments with an event mention. Lacking supervisory data in the ECB+ corpus for training an event action-argument relation detector, we assume that all event arguments identified by the semi-CRF extractor are related to all event mentions in the same sentence and then apply SRL-based heuristics to augment and further disambiguate intra-sentential action-argument relations (using the SwiRL SRL). More specifically, we link each verbal event mention to the participants that match its ARG0, ARG1 or ARG2 semantic role fillers; similarly, we associate with the event mention the time and locations that match its AM-TMP and AM-LOC role fillers, respectively. For each nominal event mention, we associate those participants that match the possessor of the mention since these were suggested in Lee et al. (2012) as playing the ARG0 role for nominal predicates.

Event Clustering
Now we describe our proposed Bayesian model for event clustering. Our model is a hierarchical extension of the distance-dependent Chinese Restaurant Process (DDCRP). It first groups event mentions within a document to form within-document event cluster and then groups these event clusters across documents to form global clusters. The model can account for similarity between event mentions during the clustering process, putting a bias toward clusters of event mentions that are closer to each other in context. To capture event similarity, we uses a log-linear model with rich syntactic and semantic features, and learn the feature weights using goldstandard data.

Distance-dependent Chinese Restaurant Process
The Distance-dependent Chinese Restaurant Process (DDCRP) is a generalization of the Chinese Restaurant process (CRP) that models distributions over partitions. In a CRP, the generative process can be described by imagining data points as customers in a restaurant and the partitioning of data as tables at which the customers sit. The process randomly samples the table assignment for each customer sequentially: the probability of a customer sitting at an existing table is proportional to the number of customers already sitting at that table and the probability of sitting at a new table is proportional to a scaling parameter. For each customer sitting at the same table, an observation can be drawn from a distribution determined by the parameter associated with that table. Despite the sequential sampling process, the CRP makes the assumption of exchangeability: the permutation of the customer ordering does not change the probability of the partitions. The exchangeability assumption may not be reasonable for clustering data which has clear interdependencies. The DDCRP allows the incorporation of data dependencies in infinite clustering, encouraging data points that are closer to each other to be grouped together. In the generative process, it samples a customer link for each customer instead of a table assignment, linking the customer to another customer or itself. The clustering can be uniquely constructed once the customer links are determined for all customers: two customers belong to the same cluster if and only if one can reach the other by traversing the customer links (treating these links as undirected).
More formally, consider a sequence of customers 1, ..., n, and denote a = (a 1 , ..., a n ) as the assignments of the customer links. a i ∈ {1, . . . , n} is drawn from where F is an distance function and F (i, j) is a value that measures the distance between customer i and j. α is a scaling parameter, measuring selfaffinity. For each customer, the observation is generated by the per-table parameters as in the CRP. A DDCRP is said to be sequential if F (i, j) = 0 when i < j, so customers may link only to themselves, and to previous customers.

A Hierarchical Extension of the DDCRP
We can model within-document coreference resolution using a sequential DDCRP. Imagining customers as event mentions and the restaurant as a document, each mention can either refer to an antecedent mention in the document or no other mentions, starting a description of a new event. However, the coreference relations also exist across documents -the same event may be described in multiple documents. It is ideal to have a two-level clustering model that can group event mentions within a document and further group them across documents. Therefore we propose a hierarchical extension of the DDCRP (HDDCRP) that employs a DDCRP twice: the first-level DDCRP links mentions based on within-document distances and the-second level DDCRP links the within-document clusters based on cross-document distances, forming larger clusters in the corpus. The generative process of a HDDCRP can be described using the same "Chinese Restaurant" metaphor. Imagine a collection of documents as a collection of restaurants, and the event mentions in each document as customers entering a restaurant. The local (within-document) event clusters correspond to tables. The global (within-corpus) event clusters correspond to menus (tables that serve the same menu belong to the same cluster). The hidden variables are the customer links and the table links. Figure 2 shows a configuration of these variables and the corresponding clustering structure.

For each customer i in cluster k, sample an observation
F 1:D and F 0 are distance functions that map a pair of customers to a distance value. We will discuss them in details in Section 5.4.

Posterior Inference with Gibbs Sampling
The central computation problem for the HDDCRP model is posterior inference -computing the conditional distribution of the hidden variables given the observations p(a, c|x, α 0 , F 0 , α 1:D , F 1:D ). The posterior is intractable due to a combinatorial number of possible link configurations. Thus we approximate the posterior using Monte Carlo Markov Chain (MCMC) sampling, and specifically using a Gibbs sampler.
In developing this Gibbs sampler, we first observe that the generative process is equivalent to one that, in step 2 samples a table link for all customers, and then in step 3, when calculating z(a, c), includes only those table links c i,d originating at customers (i, d) that started a new table, i.e., that chose a i,d = (i, d).
The Gibbs sampler for the HDDCRP iteratively samples a customer link for each customer (i, d) from After sampling all the customer links, it samples a table link for all customers (i, d) according to For those customers (i, d) that did not start a new table, i.e., with a i,d = (i, d), the table link c * i,d does not affect the clustering, and so H c (x, z, λ) = 1 in this case.
To simplify computations of both H a (x, z, λ) and H c (x, z, λ), note that the likelihood under clustering z(a, c) can be factorized as where x z=k denotes all customers that belong to the global cluster k. p(x z=k |λ) is the marginal probability. It can be computed as where x i is the observation associated with customer i. In our problem, each customer is an event mention, and the observation corresponds to the lemmatized words in the mention. We model the word distributions using multinomial distributions with Dirichlet priors.

Feature-based Distance Functions
The distance functions F 1:D and F 0 encode the priors for the clustering distribution, preferring clustering data points that are closer to each other. We consider event mentions as the data points and encode the similarity (or compatibility) between event mentions as priors for event clustering. Specifically, we use a log-linear model to estimate the similarity between a pair of event mentions ( where ψ is a feature vector, containing a rich set of features based on event mentions i and j: (1) head word string match, (2) head POS pair, (3) cosine similarity between the head word embeddings (we use the pre-trained 300-dimensional word embeddings from word2vec 1 ), (4) similarity between the words in the event mentions (based on term frequency (TF) vectors), (5) the Jaccard coefficient between the WordNet synonyms of the head words, and (6) similarity between the context words (a window of three words before and after each event mention). If both event mentions involve participants, we consider the similarity between the words in the participant mentions based on the TF vectors, similarly for the time mentions and the location mentions. If the SRL role information is available, we also consider the similarity between words in each SRL role, i.e. Arg0, Arg1, Arg2.
Training We train the parameter θ using logistic regression with an L2 regularizer. We construct the training data by considering all ordered pairs of event mentions within a document, and also pairs of event mentions in similar documents. To measure document similarity, we collect all mentions of events, participants, times and locations in each document, and compute the cosine similarity between the TF vectors constructed from all the event-related mentions. We consider two documents to be similar  if their TF-based similarity is above a threshold σ (we set it to 0.4 in our experiments). After learning θ, we set the withindocument distances as F d (i, j) = f θ (x i , x j ), and the across-document distances as , d )) captures document similarity where sim(d, d ) is the TF-based similarity between document d and d , and γ is a weight parameter (we set γ = 1 in our experiments). The intuition is that event clusters are more likely to be shared among similar documents.

Experiments
We conduct experiments on the ECB+ corpus (Cybulska and Vossen, 2014b), which is the largest available dataset with annotations of both withindocument (WD) and cross-document (CD) event coreference resolution. The cross-document coreference annotations only exist in documents that describe the same seminal event (the event that triggers the topic of the document and has interconnections with the majority of events from its surrounding textual context (Bejan and Harabagiu, 2014)). We divide the dataset into a training set (topics 1-20), a development set (topics 21-23), and a test set (topics 24-43). Table 2 shows the statistics of the data.
We performed event coreference resolution on all possible event mentions that are expressed in the documents. Using the event extraction method described in Section 4, we extracted 53,429 event mentions, 43,682 participant mentions, 5,791 time mentions and 3,836 location mentions in the test data, covering 93.5%, 89.0%, 95.0%, 72.8% of the annotated event mentions, participants, time and locations, respectively.
We evaluate both within-and cross-document event coreference resolution.
As in previous work (Bejan and Harabagiu, 2010), we evaluate cross-document coreference resolution by merg-ing all documents from the same seminal event into a meta-document and then evaluate the metadocument as in within-document coreference resolution. However, during inference time, we do not assume the knowledge of the mapping of documents to seminal events.
We consider three widely used coreference resolution metrics: (1) MUC (Vilain et al., 1995), which measures how many gold (predicted) cluster merging operations are needed to recover each predicted (gold) cluster; (2) B 3 (Bagga and Baldwin, 1998), which measures the proportion of overlap between the predicted and gold clusters for each mention and computes the average scores; and (3) CEAF (Luo, 2005) (CEAF e ), which measures the best alignment of the gold-standard and predicted clusters. All the scores are computed using the latest version (v8.01) of the official CoNLL scorer (Pradhan et al., 2014).
Baselines (1) BL lemma : a lemma-matching baseline that groups two event mentions if they have the same lemmatized head word; (2) PAIRWISE: a single-link agglomerative clustering method previously reported to have good performance on event coreference resolution (Chen et al., 2009). Pairwise scores are computed using the log linear model described in Section 5.4. We considered two clustermerging thresholds, one for within-document clustering and the other for cross-document clustering. They were tuned using the development set.
(3) HDP: a hierarchical dirichlet process model that has been used for within-and cross-document event coreference resolution. It corresponds to the HDP f lat model with lexical features in Bejan and Harabagiu (2010) (we implemented other variants of HDP in the paper and found that HDP f lat performs the best in our experiments). The concentration parameter and the base measure hyperparameter of HDP were set to 1 and 10 −7 respectively based on the development data.
HDDCRP. We consider the proposed HDDCRP model and also its variant HDDCRP * , which uses a sequential DDCRP for the within-document clustering and uses a standard CRP for the cross-document clustering. Comparing the performance of these two models can reveal the effect of incorporating mention dependencies across documents. By comparing these two models to HDP, we can see the effect  The generative process of HDDCRP * is similar to the one described in Section 5.2, except that in step 2, for each table t, we sample a cluster assignment c t according to where K is the number of existing clusters, n k is the number of existing tables that belong to cluster k, α is the concentration parameter. And in step 3, the clusters z(a, c) are constructed by traversing the customer links and looking up the cluster assignments for the obtained tables. In Gibbs sampling, we iteratively sample the customer links by computing the conditional probability as in Equation 4. If removing a i,d creates a new table, then we sample the cluster assignment for the new table using the table-level CRP. After sampling the customer links, we sample the cluster assignments for tables using the sampler for the standard CRP. The reported results are averaged results over 5 MCMC runs, each for 500 iterations (we found mixing in less than 200 iterations). We also truncated the pairwise mention similarity to zero if it is below 0.5 as we found that it leads to better performance on the development set. We set α 1 = ... = α D = 0.5, α 0 = 0.001 for HDDCRP, α = 1 for HDDCRP * and λ = 10 −7 based on the development data. Table 3 shows the event coreference results. We can see that lemma matching is a strong baseline for event coreference resolution. HDP provides noticeable improvements over BL lemma , suggesting the benefit of modeling the distribution of event clusters using an infinite mixture model. PAIRWISE further improves the performance of HDP for WD resolution, however, it fails to improve for CD resolution. We conjecture that this is due to the combination of ineffective thresholding and the prediction errors on the pairwise distances between mention pairs across documents. We can see that the HDDCRP * model outperforms all the baselines in CoNLL F1 in both WD and CD evaluation. The clear performance gains over HDP demonstrates that incorporating mention-pair distances into the clustering priors allows for better modeling of the clustering structure. The gains over PAIRWISE indicates that it is more effective to use mention-pair distances as prior probabilities for generative clustering than using them for deterministic clustering. Finally, our HDDCRP model further improves HDDCRP * in WD CoNLL F1 and demonstrates comparable performance with HDDCRP * in CD CoNLL F1. This indicates that incorporating mention dependencies across documents can help. Especially, we can see that HDDCRP clearly outperforms HDDCRP * in precision when evaluating in MUC and B 3 . It has lower recall in B 3 . A possible explanation is that it is more conservative in merging clusters, and tends to  To further understand the performance of our HD-DCRP models, we analyze the impact of the features in the mention-pair similarity model. Table 4 lists the learned weights of some top features (sorted by weights). We can see that they mainly serve to discriminate event mentions based on the head word similarity (especially embedding-based similarity) and the context word similarity. Event argument information such as SRL Arg1, SRL Arg0, and Participant are also indicative of the coreferential relations.

Discussion
We found that the HDDCRP models correct many errors maded by HDP by modeling the context similarities between event mentions. For example, for event mentions such as found and discovered, which may demonstrate strong association by themselves but may not refer to the same event given the context. Compared to PAIRWISE, our model is better at capturing the distributional properties of event clusters. PAIRWISE groups event mentions purely based on their distances, and it requires a careful setting of the threshold values for determining the final clusters. In our experiments, we found that its performance is very sensitive to the threshold setting on the development set, and it often suffers from errors made by the supervised distance learner. However, our HDDCRP models are more robust as they combine the discriminative signals with the distributional signals in generative modeling. For example, they correctly group the event mention "unveiled" in "Apple's Phil Schiller unveiled a revamped MacBook Pro today at WWDC12" together with the event mention "announced" in "this notebook isn't the only laptop Apple announced for the MacBook Pro lineup today.", even though the PAIR-WISE model fails to make the connection since there is not strong enough evidence based on the model features.
Despite the promising gains provided by the HD-DCRP, there is still a lot of room for improvement. We think a key problem is how to better model the context. This involves associating event mentions with correct arguments in the discourse, better representations of event argument information, and better distance measures for event argument compatibility, especially for time and locations which are less likely to be repeatedly mentioned and in many cases require external knowledge to resolve, e.g. (May 13, Friday) and (Mount Cook, New Zealand's highest peak). The HDDCRP model can also be further improved by a more efficient sampling process so that it can scale to very large corpora.

Conclusion
In this paper we propose a novel hierarchical Bayesian model for within-and cross-document event coreference resolution. It leverages the advantages of generative modeling of coreference resolution and feature-rich discriminative modeling of mention reference relations. We have shown its power in resolving event coreference by comparing it to a traditional agglomerative clustering approach and a state-of-the-art unsupervised generative clustering approach. It is worth noting that our model is general and can be easily applied to the clustering problems of any linguistic objects that exhibit contextual dependencies both within and across documents by using problem-specific features in the distance functions. While it can effectively resolve coreference of objects of a single type, it would be interesting to extend it to allow joint coreference resolution of objects of multiple related types, e.g. events and entities.