2-Slave Dual Decomposition for Generalized Higher Order CRFs

We show that the decoding problem in generalized Higher Order Conditional Random Fields (CRFs) can be decomposed into two parts: one is a tree labeling problem that can be solved in linear time using dynamic programming; the other is a supermodular quadratic pseudo-Boolean maximization problem, which can be solved in cubic time using a minimum cut algorithm. We use dual decomposition to force their agreement. Experimental results on Twitter named entity recognition and sentence dependency tagging tasks show that our method outperforms spanning tree based dual decomposition.


Introduction
Conditional Random Fields (Lafferty et al., 2001) (CRFs) are popular models for many NLP tasks. In particular, the linear chain CRFs explore local structure information for sequence labeling tasks, such as part-of-speech (POS) tagging, named entity recognition (NER), and shallow parsing. Recent studies have shown that the predictive power of CRFs can be strengthened by breaking the locality assumption. They either add long distance dependencies and patterns to linear chains for improved sequence labeling (Galley, 2006;Finkel et al., 2005;Kazama and Torisawa, 2007), or directly use the 4-connected neighborhood lattice (Ding et al., 2008). The resulting non-local models generally suffer from exponential time complexity of inference except some special cases (Sarawagi and Cohen, 2004;Takhanov and Kolmogorov, 2013;Kolmogorov and Zabih, 2004).
Approximate decoding algorithms have been proposed in the past decade, such as reranking (Collins, 2002b), loopy belief propagation (Sutton and Mccallum, 2006), tree reweighted belief propagation (Kolmogorov, 2006). In this paper, we focus on dual decomposition (DD), which has attracted much attention recently due to its simplicity and effectiveness (Rush and Collins, 2012). In short, it decomposes the decoding problem into several sub-problems. For each sub-problem, an efficient decoding algorithm is deployed as a slave solver. Finally a simple method forces agreement among different slaves. A popular choice is the subgradient algorithm. Martins et al. (2011b) showed that the success of the sub-gradient algorithm is strongly tied to the ability of finding a good decomposition, i.e., one involving few overlapping slaves. However, for generalized higher order graphical models, a lightweight decomposition is not at hand and many overlapping slaves may be involved. Martins et al. (2011b) showed that the sub-gradient algorithm exhibits extremely slow convergence in such cases, and they proposed the alternating directions method (DD-ADMM) to tackle these.
In this paper, we propose a 2-slave dual decomposition approach for efficient decoding in higher order CRFs. One slave is a tree labeling model that can be solved in linear time using dynamic programming. The other is a supermodular quadratic pseudo-Boolean maximization problem, which can be solved in cubic time via minimum cut. Experimental results on Twitter NER and sentence dependency tagging tasks demonstrate the effectiveness of our technique.

Generalized Higher Order CRFs
Given an undirected graph G = (V, E) with N vertices, let x = x 1 , x 2 , . . . , x N denote the observations of the vertices, and each observation x v is asked to assign one state (or label) in the state set s ∈ S. The assignment of the graph can be represented by a binary matrix Y N ×|S| , where |S| is the cardinality of S, and the element Y v,s indicates if x v is assigned state s. In the rest of the paper, we is required so that each vertex has exactly one state.
In this paper, we use Y to denote the space of state assignments. The decoding problem is to search the optimal assignment that maximizes the scoring function where ϕ(x, Y ) is a given scoring function. As x is constant in this maximization problem, we omit x for simplicity in the remainder of the paper. The decoding problem becomes The scoring function ϕ(Y ) is usually decomposed into small parts where c is a subset of vertices, called a factor. c[·] is the set of all possible assignments of c. For example, factor c = {u, v} denotes the edge (u, v) in the graph, and c[·] = S 2 is the set of the |S| 2 transitions. A factor c with a specific state assignment s is called a pattern, denoted as c[s].
For example, v[s] is a pattern of vertex v, and uv[st] is a pattern of edge (u, v) as shown in Figure 1. Note that our definition extends of the work of Takhanov and Kolmogorov where patterns are restricted to the state sequences of consecutive vertices (Takhanov and Kolmogorov, 2013 Y v[s] to denote whether pattern c[s] appears in the assignment. Then the scoring function becomes Many existing CRFs can be represented using Eq (3). For example, the popular linear chain CRFs consider two types of patterns: vertices and edges connecting adjacent vertices, resulting in the following scoring function The optimal Y can be found in linear time using the Viterbi algorithm. Another example is the skip-chain CRFs, which consider the interactions between similar vertices With positive ϕ uv [ss] , the model encourages similar vertices u and v to have identical state s, and thus it yields a more consistent labeling result compared with linear chain CRFs. Empirically, the use of complex patterns achieves better performance but suffers from high computational complexity of inference, which is generally NP-hard. Hence an efficient approximate inference algorithm is required to balance the trade-off.

Dual Decomposition
Dual decomposition is a popular approach due to its simplicity and effectiveness, and has been successfully applied to many tasks such as machine translation, cross sentential POS tagging, joint POS tagging and parsing. Briefly, dual decomposition attempts to solve problems of the following form The objective function is the sum of several small components that are tractable in isolation but whose combination is not. These components are called slaves.
Rather than solving the problem directly, dual decomposition considers the equivalent problem Using Lagrangian relaxation to eliminate the constraint, we get which provides the upper bound of the original problem. λ is the Lagrange multiplier, which is typically optimized via sub-gradient algorithms. Martins et al. (2011b) showed that the success of sub-gradient algorithms is strongly tied to the ability of finding a good decomposition, i.e., one involving few slaves. Finding a concise decomposition is usually task dependent. For example, Koo et al. (2010) introduced dual decomposition for parsing with non-projective head automata. They used only two slaves: one is the arc-factored model, and the other is head automata which involves adjacent siblings and can be solved using dynamic programming in linear time.
Dual decomposition is especially efficient for joint learning tasks because a concise decomposition can be derived naturally where each slave solves one subtask. For example,  used two slaves for integrated phrase-structure parsing and trigram POS tagging task.
However, for generalized higher order CRFs, a lightweight decomposition may be not at hand. Martins et al. (2011a) showed that the sub-gradient algorithms exhibited extremely slow convergence when handling many slaves. For fast convergence, they employed alternating directions dual decomposition (AD 3 ), which relaxes the agreement constraint via augmented Lagrangian Relaxation, where an additional quadratic penalty term was added into the Lagrangian (Eq (4)). Similarly, Jojic et al. (2010) added a strongly concave term to the Lagrangian to make it differentiable, resulting in fast convergence.
The work most closely related to ours is the work by Komodakis (2011), where dual decomposition was used for decoding general higher order CRFs. Komodakis achieved great empirical success even with the naive decomposition where each slave processes a single higher order factor. His result demonstrates the effectiveness of the dual decomposition framework. Our work improves Komodakis' by using a concise decomposition with only two slaves.

Graph Representable Pseudo-Boolean Optimization
One slave in our approach is a graph representable pseudo-Boolean maximization problem, which can be reduced to a supermodular quadratic pseudo-Boolean maximization problem and solved efficiently using an algorithm for finding a minimal cut.
A pseudo-Boolean function (PBF) (Boros and Hammer, 2002) is a multilinear function of binary variables, that is Maximizing a PBF is usually NP-hard, such as the maximum cut problem (Boros and Hammer, 1991). A pseudo-Boolean function is said to be supermodular iff where x ∧ y, x ∨ y are the element-wise AND and OR operator of the two vectors respectively. This is an important concept, because a supermodular pseudo-Boolean function (SPBF) can be maximized in O(n 6 ) running time (Orlin, 2009). A necessary and sufficient condition for identifying a SPBF is that all of its second order derivatives are nonnegative (Nemhauser et al., 1978), i.e., for all i < j, For example a quadratic PBF is supermodular if its coefficients of all quadratic terms are non-negative. Though the general supermodular maximization algorithm can be used for any SPBF, the special features of some specific problems allow more efficient algorithms to be used. For example, it is well known that the supermodular quadratic pseudo-Boolean maximization problem can be solved in cubic time using min-cut (Billionnet and Minoux, 1985;Kolmogorov and Zabih, 2004).
In fact, a subset of SPBFs can be maximized using a min-cut algorithm. A pseudo-Boolean function f (x) is called graph representable or graph expressible if there exists a graph G = (V, E) with terminals s and t and a subset of vertices V 0 = V − {s, t} = {v 1 , . . . , v n , u 1 , . . . , u m } such that, for any configuration x 1 , . . . , x n , the value of the function f (x) is equal to a constant plus the cost of the minimum s-t cut among all cuts, in which Our definition extends the work of Kolmogorov and Zabih (2004) that focused on quadratic and cubic functions. Vertices u 1 , . . . u m correspond to the extra binary variables that are introduced to reduce the graph representable PBFs to equivalent quadratic forms. For example, the positive-negative PBFs where all terms of degree 2 or more have positive coefficients are graph representable, and each non-linear term requires one extra binary variable to obtain the equivalent quadratic form (Rhys, 1970).

The Tree-Cut Decomposition for Generalized Higher Order CRFs
We decompose the decoding problem, i.e., maximization of Eq (3), into two parts, a tree labeling problem and a PBF maximization problem. We show that the PBF can be graph representable by reparameterizing the scoring function in Eq (3). Then we reduce these pseudo-Boolean functions to quadratic forms based on the recent work ofŽivný and Jeavons (2010), and finally solve the slave problem via graph cuts.

Fully Connected Pairwise CRFs
We first describe our idea for a simple case, the fully connected pairwise CRFs (Krähenbühl and Koltun, 2011), which are generalizations of linearchain CRFs and skip-chain CRFs. Formally, the decoding problem in fully connected pairwise CRFs can be formulated as follows: Note that for any edge (u, v), adding a constant ψ uv to all of its related patterns will not change the optimal solution of the problem. In other words, the optimal Y for the following problem is irrelevant to The reparameterization keeps the optimality of the problem and plays an important role for graph representation, as we will show later. By introducing a new variable Z = Y for the quadratic terms and relaxing the constraint Z = Y using Lagrangian relaxation, we get the relaxed problem We split the inner max Y,Z into two subproblems, and a minimal λ is found using sub-gradient descent algorithms which repeatedly find a maximizing assignment for the subproblems individually.
The two subproblems are The first subproblem can be solved in linear time since all vertices are independent.
The second problem is a binary quadratic programming problem. As discussed in Section 2.3, g λ (Z) can be solved using min-cut if the coefficients of the quadratic terms are non-negative, i.e.
Hence, we can set to guarantee the non-negativity. This supermodular binary quadratic programming problem can be solved via the push-relabel algorithm (Goldberg, 2008) in O(|S|N ) 3 running time.
Though Z may not satisfy the constraint Z ∈ Y after sub-gradient descent based optimization, Y must satisfy Y ∈ Y, hence we could use Y as the final solution if Z and Y disagree.

Generalized Higher Order CRFs
Now we consider the general case, maximizing Eq (3). Similar with the pairwise case, we use two slaves. One is a set of independent vertices, and the other is a pseudo-Boolean optimization problem. That is, we can redefine g λ (Z) as A sufficient condition for g λ (Z) to be graph representable is that coefficients of all non-linear terms are non-negative (Freedman and Drineas, 2005). Hence, we can set {ϕ c[s] } to guarantee the non-negativity. In real applications, higher order patterns are sparse, i.e., {s ∈ c[·] | ϕ c[s]̸ =0 } ≪ |S| |c| (Qian et al., 2009;Ye et al., 2009 for fast inference. However, the reparameterization described above may introduce many non-zero terms which destroy the sparsity. For example, in the NER task, a binary feature is defined as true if a word subsequence matches a location name in a gazetteer. Suppose c =Little York village is such a word subsequence, then among |S| 3 possible assignments of c, only the one that labels c =Little York village as a location name has non-zero weight. However, the reparameterization may add ψ c to the other |S| 3 − 1 assignments, yielding many new patterns.
Therefore, we use another reparameterization strategy that exploits the sparsity for efficient decomposition. We only reparameterize the weights of edges, i.e., quadratic terms. Let The optimal solution is unchanged for any ψ.
In Appendix A, we show that by setting a sufficiently large ψ, g λ (Z) is graph representable. Such reparameterization method requires at most N 2 |S| 2 new patterns ψ c,|c|=2 to make g λ (Z) graph representable. It preserves the sparsity of higher order patterns, hence is more efficient than the naive approach.

Tree-Cut Decomposition
In some cases, the graph is built by adding sparse global patterns to local models like trees, resulting in nearly tree-structured CRFs. For example, Sutton and Mccallum (2006) used skip-chain CRFs for NER, where skip-edges connecting identical words were added to linear chain CRFs. Since the skipedges are sparse, the resulting graphical models are nearly linear chains. To handle the edges in local models efficiently, we reformulate the decomposition. Let T be a spanning tree of the graph, if edge (u, v) ∈ T , we put its related patterns into the first slave, otherwise we put its related patterns into the second slave.
For clarity, we formulate the tree-cut decomposition for generalized higher order CRFs. The first slave involves the patterns covered by the spanning tree T , and its scoring function is The second slave involves the rest patterns. The scoring function of the second slave is h 1 involves the edges that are not in T , h 2 involves positive terms of degree 3 or more. h 3 involves negative cubic terms, h 4 involves negative terms of degree 4 or more.
The relaxed problem for generalized higher order CRFs, i.e., Problem (2) is Z, u are binary

Complexity Analysis
In this section, we theoretically analyze the time complexity for each iteration in dual decomposition. Running time for max Y ∈Y f λ (Y ) is linear in the size of the graph, i.e., N × |S| 2 . Running time for max Z,u h(Z, u) is cubic in the number of variables, which is the sum of variables in function h 1 to h 4 . h 1 (Z) has at most N 2 |S| 2 variables; each pattern in (1 + |c|) variables. Similarly, we could count the number of variables in h 3 and h 4 , as shown in Table 1.
In summary, each pattern in h(Z, u) requires at most 2|c| − 2 variables, so h(Z, u) has no more than ∑ c ∑ s∈c[·] (2|c| − 2) variables. Finally, the time complexity for each iteration in dual decomposition is which is cubic in the total length of patterns.

Data Sets
Our first experiment is named entity recognition in tweets. Recently, information extraction on Twitter or Facebook data is attracting much attention (Ritter et al., 2011). Different from traditional information extraction for news articles, messages posted on these social media websites are short and noisy, making the task more challenging. In this paper, we use generalized higher order CRFs for Twitter NER with discriminative training, and compare our 2-slave dual decomposition approach with spanning tree based dual decomposition approach and other decoding algorithms.
So far as we know, there are two publicly available data sets for Twitter NER. One is the Ritter's (Ritter et al., 2011), the other is from MSM2013 Concept Extraction Challenge (Basave et al., 2013) 1 . Note that in Ritter's work (Ritter et al., 2011), all of the data are used for evaluating named entity type classification, and not used during training. However, our approach requires discriminative training, which makes our method not comparable with their results. Therefore we choose the MSM2013 dataset in our experiment and compare our system with the MSM2013 official runs.
The MSM2013 corpus has 4 types of named entities, person (PER), location (LOC), organization (ORG), and miscellaneous (MISC). The name entities are about film/movie, entertainment award event, political event, programming language, sporting event and TV show. The data is separated into a training set containing 2815 tweets, and a test set containing 1526 tweets.

Local Features
We cast the NER task as a structured classification problem, and adopt BIESO labeling, where for each multi-word entity of class C, the first word is labeled as B-C, the words in the entity are labeled as I-C, and the last word is labeled as E-C, a single word entity of class C is labeled as S-C, and other words are labeled as O.
1 http://oak.dcs.shef.ac.uk/msm2013/challenge.html Our baseline NER is a linear chain CRF. As the MSM2013 competition allows to use extra resources, we use several additional datasets to generate rich features. Specifically, we trained two POS taggers and two NER taggers using extra datasets. All the 4 taggers are trained using linear chain CRFs with perceptron training. One POS tagger is trained on Brown and Wall Street Journal corpora in Penn Tree Bank 3, and the other is trained on ARK Twitter NLP corpus (Gimpel et al., 2011) with slight modification. One of the NER taggers is trained on CoNLL 2003 English dataset 2 , and the other is trained on Ritter's dataset.
We used dictionaries in Ark Twitter NLP toolkit 3 , Ritter's Twitter NLP toolkit 4 and Moby Words project 5 to generate dictionary features. We also collected film names and TV shows from IMDB website and musician groups from wikipedia. These dictionaries are used to detect candidate named entities in the training and testing datasets using string matching. Those matched words are assigned with BIESO style labels which are used as features.
We also used the unsupervised word cluster features provided by Ark Twitter NLP toolkit, which has significantly improved the Twitter POS tagging accuracy (Owoputi et al., 2013). Similar with previous work, we used prefixes of the cluster bit strings with lengths ∈ {2, 4, . . . , 16} as features.

Global Features
Previous studies showed that the document level consistency features (same phrases in a document tend to have the same entity class) are effective for NER (Kazama and Torisawa, 2007;Finkel et al., 2005). However, unlike news articles, tweets are not organized in documents. To use these document level consistency features, we grouped the tweets in MSM2013 dataset using single linkage clustering algorithm where similarity between two tweets is the number of their overlapped words. If the similarity is greater than 4, then we put the two tweets into one group. Unlike standard document clustering, we did not normalize the length of tweets since all the tweets are limited to 140 characters. Then we extracted the group level features as follows. For any two identical phrases x i . . . x i+k , x j . . . x j+k in a group, a binary feature is true if they have the same label subsequences. The pattern set of this feature is c = {i, . . . , i + k, j, . . . , j + k} and c[·] = {s|s i = s j , . . . , s i+k = s j+k }.

Results
We use two evaluation metrics. One is the micro averaged F score, which is used in CoNLL2003 shared task. The other is macro averaged F score, which is used in MSM2013 official evaluation (Basave et al., 2013).
We compare our approach with two baselines, integer linear programming (ILP) 6 and a naive dual decomposition method.
In naive dual decomposition, we use three types of slaves: a linear chain captures unigrams and bigrams, and the spanning trees cover the skip edges linking identical words. Identical multi-word phrases yield larger factors with more than 4 vertices. They could not be handled efficiently by belief propagation for spanning trees. Therefore, we create multiple slaves, each of which covers a pair of identical multi-word phrases.
To reduce the number of slaves, we use a greedy algorithm to choose the spanning trees. Each time we select the spanning tree that covers the most uncovered edges. This can be done by performing the maximum spanning tree algorithm on the graph where each uncovered edge has unit weight. Let x * denote the most frequent word in a tweet cluster, and F * is its frequency, then at least (F * −1)/2 spanning trees are required to cover the complete subgraph spanned by x * .
For both dual decomposition systems, averaged perceptron (Collins, 2002a) with 10 iterations is used for parameter estimation. We follow the work of  to choose the step size in the sub-gradient algorithm.  To compare the convergence speed and optimality of 2-slave DD and naive DD algorithms, we use the model trained by ILP, and record the F micro scores, averaged dual objectives per instance (the lower the tighter), decoding time, and fraction of optimality certificates across iterations of the two DD algorithms on test data. Figure 2 shows the performances of the two algorithms relative to decoding time. Our method requires 0.0064 seconds for each iteration on average, about four times slower than the naive DD. However, our approach achieves a tighter upper bound and larger fraction of optimality certificates.

Sentence Dependency Tagging
Our second experiment is sentence dependency tagging in Question Answering forums task studied in Qu and Liu's work (Qu and Liu, 2012). The goal is to extract the dependency relationships between sentences for automatic question answering. For example, from the posts below, we would need to know that sentence S4 is a comment about sentence S1 and S2, not an answer to S3. Order-3 factors (e.g., red and blue) connects the 3 vertices in adjacent edge pairs.
A: [S1]I'm having trouble installing my DVB Card.
[S3]What could I do to resolve this problem? B: [S4] I'm having similar problems with Ubuntu For a pair of sentences, the depending sentence is called the source sentence, and the depended sentence the target sentence. One source sentence can potentially depend on many different target sentences, and one target sentence can also correspond to multiple sources. Qu and Liu (2012) casted the task as a binary classification problem, i.e., whether or not there exists a dependency relation between a pair of sentences. Formally, in this task, Y is a N 2 × 2 matrix, where N is the number of sentences, Y i * N +j[1] = 1 if the i th sentence depends on the j th sentence, otherwise, Y i * N +j[0] = 1. We use the corpus in Qu and Liu's work (Qu and Liu, 2012)  annotated. Following their settings we randomly split annotated threads into three disjoint sets, and run a three-fold cross validation. F score is used as the evaluation metric. Qu and Liu (2012) used the pairwise CRF with a 4-connected neighborhood system (2D CRF) as their graphical model, where each vertex in the graph represents a sentence pair, and each edge connects adjacent source sentences or target sentences. The key observation is that given a source/target sentence, there is strong dependency between adjacent target/source sentences. In this paper, we extend their work by connecting the 3 vertices in adjacent edge pairs, resulting in 3-wise CRFs, as shown in Figure 3. We use the same vertex features and edge features as in Qu and Liu's work. For a 3-tuple of vertices, we use the following features: combination of the sentence types within the tuple, whether the related sentences are in one post or belong to the same author. Again, we use perceptron to train the model, and the max iteration number for dual decomposition is 200. The spanning tree in our decomposition is the concatenation of all the rows in the graph. naive DD 2−slave DD Figure 4: QA sentence dependency tagging using 3-wise CRFs: The F scores, dual objectives, and fraction of optimality certificates relative to decoding time. Table 3 shows the experimental results. For 2D CRFs, the edges can be covered by 2 spanning trees (one covers all vertical edges and the other covers all horizontal edges), hence the naive dual decomposition has only two slaves. Compared with naive DD, our 2-slave DD achieved competitive performance while two times slower. This is because naive DD adopts dynamic programming that runs in linear time. However, for 3-wise CRFs, the naive dual decomposition requires many small slaves to cover the order-3 factors. Therefore our 2-slave method is more effective. The fraction of optimality certificates and dual objectives of 3-wise CRFs relative to decoding time during testing are shown in Figure 4. For each iteration, our method requires 0.0049 seconds and the naive DD requires 0.00054 seconds, about 10 times faster than ours, but our method converges to a lower lower bound.

Conclusion
We proposed a new decomposition approach for generalized higher order CRFs using only two slaves. Both permit polynomial decoding time. We evaluated our method on two different tasks: Twitter named entity recognition and forum sentence dependency detection. Experimental results show that though the compact decomposition requires more running time for each iteration, it achieves consistently tighter bounds and outperforms the naive dual decomposition. The two experiments demonstrate that our method works for general graphs, even if the graph can not be decomposed into a few spanning trees (for example, if the graph has large complete subgraphs or large factors).

Appendix A
We show that by setting a sufficiently large ψ, g λ (Z) in Section 3.2 is graph representable. Let g λ (Z) = g 1 (Z) + g 2 (Z) + g 3 (Z) + g 4 (Z) where g 1 (Z) = ∑ For g 2 (Z), since coefficients of all terms are nonnegative, we can use the fact to reduce g 2 (Z) into an equivalent quadratic form (Freedman and Drineas, 2005). That is, max Z g 2 (Z) is equivalent to which is graph representable because coefficients of all the quadratic terms are non-negative. Coefficients of terms in g 3 (Z) and g 4 (Z) are negative, therefore g 3 (Z) and g 4 (Z) are not supermodular. To make them graph representable, we use the following fact Proposition 1 (Živný and Jeavons, 2010) The pseudo-Boolean function p(x) x i is graph representable and can be reduced to the quadratic forms: if K = 3, then According to Eq (8) The first part on the right hand side is graph representable since all quadratic terms are nonnegative. The second part is a quadratic function of Z, and it can be merged into g 1 (Z). With sufficiently large ψ c in g 1 (Z), we could guarantee the non-negativity of all quadratic terms. Similarly, we could apply Eq (9) to reduce g 4 (Z) to graph representable quadratic forms.