Branch and Bound Algorithm for Dependency Parsing with Non-local Features

Graph based dependency parsing is inefficient when handling non-local features due to high computational complexity of inference. In this paper, we proposed an exact and efficient decoding algorithm based on the Branch and Bound (B&B) framework where non-local features are bounded by a linear combination of local features. Dynamic programming is used to search the upper bound. Experiments are conducted on English PTB and Chinese CTB datasets. We achieved competitive Unlabeled Attachment Score (UAS) when no additional resources are available: 93.17% for English and 87.25% for Chinese. Parsing speed is 177 words per second for English and 97 words per second for Chinese. Our algorithm is general and can be adapted to non-projective dependency parsing or other graphical models.


Introduction
For graph based projective dependency parsing, dynamic programming (DP) is popular for decoding due to its efficiency when handling local features. It performs cubic time parsing for arc-factored models (Eisner, 1996;McDonald et al., 2005a) and biquadratic time for higher order models with richer sibling and grandchild features (Carreras, 2007;. However, for models with general non-local features, DP is inefficient. There have been numerous studies on global inference algorithms for general higher order parsing. One popular approach is reranking (Collins, 2000;Charniak and Johnson, 2005;Hall, 2007). It typically has two steps: the low level classifier generates the top k hypotheses using local features, then the high level classifier reranks these candidates using global features. Since the reranking quality is bounded by the oracle performance of candidates, some work has combined candidate generation and reranking steps using cube pruning (Huang, 2008;Zhang and McDonald, 2012) to achieve higher oracle performance. They parse a sentence in bottom up order and keep the top k derivations for each span using k best parsing (Huang and Chiang, 2005). After merging the two spans, non-local features are used to rerank top k combinations. This approach is very efficient and flexible to handle various nonlocal features. The disadvantage is that it tends to compute non-local features as early as possible so that the decoder can utilize that information at internal spans, hence it may miss long historical features such as long dependency chains.
Smith and Eisner modeled dependency parsing using Markov Random Fields (MRFs) with global constraints and applied loopy belief propagation (LBP) for approximate learning and inference (Smith and Eisner, 2008). Similar work was done for Combinatorial Categorial Grammar (CCG) parsing (Auli and Lopez, 2011). They used posterior marginal beliefs for inference to satisfy the tree constraint: for each factor, only legal messages (satisfying global constraints) are considered in the partition function.
A similar line of research investigated the use of integer linear programming (ILP) based parsing (Riedel and Clarke, 2006;Martins et al., 2009). This method is very expressive. It can handle arbitrary non-local features determined or bounded by linear inequalities of local features. For local models, LP is less efficient than DP. The reason is that, DP works on a small number of dimensions in each recursion, while for LP, the popular revised simplex method needs to solve a m dimensional linear system in each iteration (Nocedal and Wright, 2006), where m is the number of constraints, which is quadratic in sentence length for projective dependency parsing (Martins et al., 2009).
Dual Decomposition (DD) ) is a special case of Lagrangian relaxation. It relies on standard decoding algorithms as oracle solvers for sub-problems, together with a simple method for forcing agreement between the different oracles. This method does not need to consider the tree constraint explicitly, as it resorts to dynamic programming which guarantees its satisfaction. It works well if the sub-problems can be well defined, especially for joint learning tasks. However, for the task of dependency parsing, using various non-local features may result in many overlapped sub-problems, hence it may take a long time to reach a consensus (Martins et al., 2011).
In this paper, we propose a novel Branch and Bound (B&B) algorithm for efficient parsing with various non-local features. B&B (Land and Doig, 1960) is generally used for combinatorial optimization problems such as ILP. The difference between our method and ILP is that the sub-problem in ILP is a relaxed LP, which requires a numerical solution, while ours bounds the non-local features by a linear combination of local features and uses DP for decoding as well as calculating the upper bound of the objective function. An exact solution is achieved if the bound is tight. Though in the worst case, time complexity is exponential in sentence length, it is practically efficient especially when adopting a pruning strategy.
Experiments are conducted on English PennTree Bank and Chinese Tree Bank 5 (CTB5) with standard train/develop/test split. We achieved 93.17% Unlabeled Attachment Score (UAS) for English at a speed of 177 words per second and 87.25% for Chinese at a speed of 97 words per second.

Problem Definition
Given a sentence x = x 1 , x 2 , . . . , x n where x i is the i th word of the sentence, dependency parsing assigns exactly one head word to each word, so that dependencies from head words to modifiers form a tree. The root of the tree is a special symbol denoted by x 0 which has exactly one modifier. In this paper, we focus on unlabeled projective dependency parsing but our algorithm can be adapted for labeled or non-projective dependency parsing (McDonald et al., 2005b).
The inference problem is to search the optimal parse tree y * where Y(x) is the set of all candidate parse trees of sentence x. ϕ(x, y) is a given score function which is usually decomposed into small parts where c is a subset of edges, and is called a factor. For example, in the all grandchild model , the score function can be represented as ϕ(x, y) = ∑ e hm ∈y ϕ e hm (x) + ∑ e gh ,e hm ∈y ϕ e gh ,e hm (x) where the first term is the sum of scores of all edges x h → x m , and the second term is the sum of the scores of all edge chains x g → x h → x m . In discriminative models, the score of a parse tree y is the weighted sum of the fired feature functions, which can be represented by the sum of the factors where f (x, c) is the feature vector that depends on c. For example, we could define a feature for grand-

Dynamic Programming for Local Models
In first order models, all factors c in Eq(1) contain a single edge. The optimal parse tree can be derived by DP with running time O(n 3 ) (Eisner, 1996). The algorithm has two types of structures: complete span, which consists of a headword and its descendants on one side, and incomplete span, which consists of a dependency and the region between the head and modifier. It starts at single word spans, and merges the spans in bottom up order. For second order models, the score function ϕ(x, y) adds the scores of siblings (adjacent edges with a common head) and grandchildren There are two versions of second order models, used respectively by Carreras (2007) and . The difference is that Carreras' only considers the outermost grandchildren, while Koo and Collin's allows all grandchild features. Both models permit O(n 4 ) running time. Third-order models score edge triples such as three adjacent sibling modifiers, or grand-siblings that score a word, its modifier and its adjacent grandchildren, and the inference complexity is O(n 4 ) .
In this paper, for all the factors/features that can be handled by DP, we call them the local factors/features.

Basic Idea
For general high order models with non-local features, we propose to use Branch and Bound (B&B) algorithm to search the optimal parse tree. A B&B algorithm has two steps: branching and bounding. The branching step recursively splits the search space Y(x) into two disjoint subspaces Y(x) = Y 1 ∪ Y 2 by fixing assignment of one edge. For each subspace Y i , the bounding step calculates the upper bound of the optimal parse tree score in the subspace: U B Y i ≥ max y∈Y i ϕ(x, y). If this bound is no more than any obtained parse tree score U B Y i ≤ ϕ(x, y ′ ), then all parse trees in subspace Y i are no more optimal than y ′ , and Y i could be pruned safely.
The efficiency of B&B depends on the branching strategy and upper bound computation. For example, Sun et al. (2012) used B&B for MRFs, where they proposed two branching strategies and a novel data structure for efficient upper bound computation. Klenner and Ailloud (2009) proposed a variation of Balas algorithm (Balas, 1965) for coreference resolution, where candidate branching variables are sorted by their weights.
Our bounding strategy is to find an upper bound for the score of each non-local factor c containing multiple edges. The bound is the sum of new scores of edges in the factor plus a constant Based on the new scores {ψ e (x)} and constants {α c }, we define the new score of parse tree y The advantage of such a bound is that, it is the sum of new edge scores. Hence, its optimum tree max y∈Y(x) ψ(x, y) can be found by DP, which is the upper bound of max y∈Y(x) ϕ(x, y), as for any y ∈ Y(x), ψ(x, y) ≥ ϕ(x, y).

The Upper Bound Function
In this section, we derive the upper bound function ψ(x, y) described above. To simplify notation, we drop x throughout the rest of the paper. Let z c be a binary variable indicating whether factor c is selected in the parse tree. We reformulate the score function in Eq(1) as Correspondingly, the tree constraint is replaced by z ∈ Z. Then the parsing task is Notice that, for any z c , we have which means that factor c appears in parse tree if and only if all its edges {e|e ∈ c} are selected in the tree.
Here z e is short for z {e} for simplicity. Our bounding method is based on the following fact: for a set {a 1 , a 2 , . . . a r } (a j denotes the j th element) , its minimum where ∆ is probability simplex We discuss the bound for ϕ c z c in two cases: ϕ c ≥ 0 and ϕ c < 0.
If ϕ c ≥ 0, we have The second equation comes from Eq(4). For simplicity, let If ϕ c < 0, we have two upper bounds. One is commonly used in ILP when all the variables are binary According to the last inequality, we have the upper bound for negative scored factors where r c is the number of edges in c. For simplicity, we use the notation The other upper bound when ϕ c < 0 is simple Notice that, for any parse tree, one of the upper bounds must be tight. Eq (6) is tight if c appears in the parse tree: z c = 1, otherwise Eq (7) is tight. Therefore According to Eq(4), we have Minimize ψ with respect to p, we have The second equation holds since, for any two factors, c and c ′ , g c (or h c ) and g c ′ (or h c ′ ) are separable. The third equation comes from Eq(5) and Eq (8).
Based on this, we have the following proposition: Proposition 1. For any p, p c ∈ ∆, and z ∈ Z, ψ(p, z) ≥ ϕ(z). Therefore, ψ(p, z) is an upper bound function of ϕ(z). Furthermore, fixing p, ψ(p, z) is a linear function of z e , see Eq(5) and Eq (8), variables z c for large factors are eliminated. Hence z ′ = arg max z ψ(p, z) can be solved efficiently by DP. Because after obtaining z ′ , we get the upper bound and lower bound of ϕ(z * ): ψ(p, z ′ ) and ϕ(z ′ ).
The upper bound is expected to be as tight as possible. Using min-max inequality, we get which provides the tightest upper bound of ϕ(z * ).
Since ψ is not differentiable w.r.t p, projected sub-gradient (Calamai and Moré, 1987; is used to search the saddle point. More specifically, in each iteration, we first fix p and search z using DP, then we fix z and update p by where α > 0 is the step size in line search, function P ∆ (q) denotes the projection of q onto the probability simplex ∆. In this paper, we use Euclidean projection, that is which can be solved efficiently by sorting (Duchi et al., 2008).

Branch and Bound Based Parsing
As discussed in Section 3.1, the B&B recursive procedure yields a binary tree structure called Branch and Bound tree. Each node of the B&B tree has some fixed z e , specifying some must-select edges and must-remove edges. The root of the B&B tree has no constraints, so it can produce all possible parse trees including z * . Each node has two children. One adds a constraint z e = 1 for a free edge z = e1 0 1 ψ(p, z) 6 Figure 1: A part of B&B tree. ϕ, ψ are short for ϕ(z ′ ) and ψ(p ′ , z ′ ) respectively. For each node, some edges of the parse tree are fixed. All parse trees that satisfy the fixed edges compose the subset of S ⊆ Z. A min-max problem is solved to get the upper bound and lower bound of the optimal parse tree over S. Once the upper bound ψ is less than LB, the node is removed safely.
e and the other fixes z e = 0. We can explore the search space {z|z e ∈ {0, 1}} by traversing the B&B tree in breadth first order.
Let S ⊆ Z be subspace of parse trees satisfying the constraint, i.e., in the branch of the node. For each node in B&B tree, we solve p ′ , z ′ = arg min p max z∈S ψ(p, z) to get the upper bound and lower bound of the best parse tree in S. A global lower bound LB is maintained which is the maximum of all obtained lower bounds. If the upper bound of the current node is lower than the global lower bound, the node can be pruned from the B&B tree safely. An example is shown in Figure 1.
When the upper bound is not tight: ψ > LB, we need to choose a good branching variable to generate the child nodes. Let G(z ′ ) = ψ(p ′ , z ′ ) − ϕ(z ′ ) denote the gap between the upper bound and lower bound. This gap is actually the accumulated gaps of all factors c. Let G c be the gap of c We choose the branching variable heuristically: for each edge e, we define its gap as the sum of the gaps of factors that contain it The edge with the maximum gap is selected as the branching variable. Suppose there are N nodes on a level of B&B tree, and correspondingly, we get N branching variables, among which, we choose the one with the highest lower bound as it likely reaches the optimal value faster.

Lower Bound Initialization
A large lower bound is critical for efficient pruning. In this section, we discuss an alternative way to initialize the lower bound LB. We apply the similar trick to get the lower bound function of ϕ(z).
Similar to Eq (8), for ϕ c ≥ 0, we have Using the fact that Put the two cases together, we get the lower bound function

Algorithm 1 Branch and Bound based parsing
Require: {ϕ c } Ensure: Optimal parse tree z * Solve p * , z * = arg max p,z π(p, z) Initialize S = {Z}, LB = π(p * , z * ) while S ̸ = ∅ do Set S ′ = ∅{nodes that survive from pruning} foreach S ∈ S Solve min p max z ψ(p, z) to get LB S , U B S LB = max{LB, LB S∈S }, update z * foreach S ∈ S, add S to S ′ , if U B S > LB Select a branching variable z e . Clear S = ∅ foreach S ∈ S ′ Add S 1 = {z|z ∈ S, z e = 1} to S Add S 2 = {z|z ∈ S, z e = 0} to S. end while For any p, p c ∈ ∆, z ∈ Z π(p, z) ≤ ϕ(z) π(p, z) is not concave, however, we could alternatively optimize z and p to get a good approximation, which provides a lower bound for ϕ(z * ).

Summary
We summarize our B&B algorithm in Algorithm 1.
It is worth pointing out that so far in the above description, we have used the assumption that the backbone DP uses first order models, however, the backbone DP can be the second or third order version. The difference is that, for higher order DP, higher order factors such as adjacent siblings, grandchildren are directly handled as local factors.
In the worst case, all the edges are selected for branching, and the complexity grows exponentially in sentence length. However, in practice, it is quite efficient, as we will show in the next section.

Experimental Settings
The datasets we used are the English Penn Tree Bank (PTB) and Chinese Tree Bank 5.0 (CTB5). We use the standard train/develop/test split as described in Table 1.
We extracted dependencies using Joakim Nivre's Penn2Malt tool with standard head rules: Yamada and Matsumoto's (Yamada and Matsumoto, 2003) (Zhang and Clark, 2008) for Chinese. Unlabeled attachment score (UAS) is used to evaluate parsing quality 1 . The B&B parser is implemented with C++. All the experiments are conducted on the platform Intel Core i5-2500 CPU 3.30GHz.

Baseline: DP Based Second Order Parser
We use the dynamic programming based second order parser (Carreras, 2007) as the baseline. Averaged structured perceptron (Collins, 2002) is used for parameter estimation. We determine the number of iterations on the validation set, which is 6 for both corpora.
For English, we train the POS tagger using linear chain perceptron on training set, and predict POS tags for the development and test data. The parser is trained using the automatic POS tags generated by 10 fold cross validation. For Chinese, we use the gold standard POS tags.
We use five types of features: unigram features, bigram features, in-between features, adjacent sibling features and outermost grand-child features. The first three types of features are firstly introduced by McDonald et al. (2005a) and the last two types of features are used by Carreras (2007). All the features are the concatenation of surrounding words, lower cased words (English only), word length (Chinese only), prefixes and suffixes of words (Chinese only), POS tags, coarse POS tags which are derived from POS tags using a simple mapping table, distance between head and modifier, direction of edges. For English, we used 674 feature templates to generate large amounts of features, and finally got 86.7M non-zero weighted features after training. The baseline parser got 92.81% UAS on the testing set. For Chinese, we used 858 feature templates, and finally got 71.5M non-zero weighted features after train-ing. The baseline parser got 86.89% UAS on the testing set.

B&B Based Parser with Non-local Features
We use the baseline parser as the backbone of our B&B parser. We tried different types of non-local features as listed below: • All grand-child features. Notice that this feature can be handled by Koo's second order model  directly.
• All great grand-child features.
• All sibling features: all the pairs of edges with common head. An example is shown in Figure 2.
• All tri-sibling features: all the 3-tuples of edges with common head.
• Comb features: for any word with more than 3 consecutive modifiers, the set of all the edges from the word to the modifiers form a comb. 2 • Hand crafted features: We perform cross validation on the training data using the baseline parser, and designed features that may correct the most common errors. We designed 13 hand-craft features for English in total. One example is shown in Figure 3. For Chinese, we did not add any hand-craft features, as the errors in the cross validation result vary a lot, and we did not find general patterns to fix them.

Implementation Details
To speed up the solution of the min-max subproblem, for each node in the B&B tree, we initialize p with the optimal solution of its parent node, since the child node fixes only one additional edge, its optimal point is likely to be closed to its parent's. For the root node of B&B tree, we initialize p e c = 1 r c for factors with non-negative weights and p 1 c = 0 for regulation occurs through inaction , rather than through ... Figure 3: An example of hand-craft feature: for the word sequence A . . . rather than A, where A is a preposition, the first A is the head of than, than is the head of rather and the second A.
negative weighted factors.
Step size α is initialized with max c,ϕ c ̸ =0 { 1 |ϕ c | }, as the vector p is bounded in a unit box. α is updated using the same strategy as . Two stopping criteria are used. One is 0 ≤ ψ old − ψ new ≤ ϵ, where ϵ > 0 is a given precision 3 . The other checks if the bound is tight: U B = LB. Because all features are boolean (note that they can be integer), their weights are integer during each perceptron update, hence the scores of parse trees are discrete. The minimal gap between different scores is 1 N ×T after averaging, where N is the number of training samples, and T is the iteration number for perceptron training. Therefore the upper bound can be tightened as U B = ⌊N T ψ⌋ N T . During testing, we use the pre-pruning method as used in Martins et al. (2009) for both datasets to balance parsing quality and speed. This method uses a simple classifier to select the top k candidate heads for each word and exclude the other heads from search space. In our experiment, we set k = 10.   Bohnet and Kuhn (2012) 93.39 87.5 Systems using additional resources Suzuki et al. (2009) 93.79 N/A Koo et al. (2008) 93.5 N/A Chen et al. (2012) 92.76 N/A Table 2: Comparison between our system and thestate-of-art systems.

Main Result
Experimental results are listed in Table 2. For comparison, we also include results of representative state-of-the-art systems. For the third order parser, we re-implemented Model 1 , and removed the longest sentence in the CTB dataset, which contains 240 words, due to the O(n 4 ) space complexity 4 . For ILP based parsing, we used TurboParser 5 , a speed-optimized parser toolkit. We trained full models (which use all grandchild features, all sibling features and head bigram features (Martins et al., 2011)) for both datasets using its default settings. We also list the performance in its documentation on English corpus. The observation is that, the all-sibling features are most helpful for our parser, as some good sibling features can not be encoded in DP based parser. For example, a matched pair of parentheses are always siblings, but their head may lie between them. An-other observation is that all great grandchild features and all tri-sibling features slightly hurt the performance and we excluded them from the final system.
When no additional resource is available, our parser achieved competitive performance: 93.17% Unlabeled Attachment Score (UAS) for English at a speed of 177 words per second and 87.25% for Chinese at a speed of 97 words per second. Higher UAS is reported by joint tagging and parsing (Bohnet and Nivre, 2012) or system integration (Bohnet and Kuhn, 2012) which benefits from both transition based parsing and graph based parsing. Previous work shows that combination of the two parsing techniques can learn to overcome the shortcomings of each non-integrated system (Nivre and McDonald, 2008;Zhang and Clark, 2008). System combination will be an interesting topic for our future research. The highest reported performance on English corpus is 93.79%, obtained by semisupervised learning with a large amount of unlabeled data (Suzuki et al., 2009).

Tradeoff Between Accuracy and Speed
In this section, we study the trade off between accuracy and speed using different pre-pruning setups.
In Table 3, we show the parsing accuracy and inference time in testing stage with different numbers of candidate heads k in pruning step. We can see that, on English dataset, when k ≥ 10, our parser could gain 2 − 3 times speedup without losing much parsing accuracy. There is a further increase of the speed with smaller k, at the cost of some accuracy. Compared with TurboParser, our parser is less efficient but more accurate. Zhang and McDonald (2012) is a state-of-the-art system which adopts cube pruning for efficient parsing. Notice that, they did not use pruning which seems to increase parsing speed with little hit in accuracy. Moreover, they did labeled parsing, which also makes their speed not directly comparable.
For each node of B&B tree, our parsing algorithm uses projected sub-gradient method to find the saddle point, which requires a number of calls to a DP, hence the efficiency of Algorithm 1 is mainly determined by the number of DP calls. Figure 4 and

Polynomial Non-local Factors
Our bounding strategy can handle a family of nonlocal factors that can be expressed as a polynomial function of local factors. To see this, suppose For each i, we introduce new variable z E i = min e∈E i z e . Because z e is binary, z E i = ∏ e∈E i z e . In this way, we replace z c by several z E i that can be handled by our bounding strategy.
We give two examples of these polynomial nonlocal factors. First is the OR of local factors: z c = max{z e , z ′ e }, which can be expressed by z c = z e + z ′ e −z e z ′ e . The second is the factor of valency feature  Figure 5 Averaged number of Calls to DP relative to sentence length with different pruning settings, k denotes the number of candidate heads of each word in pruning step. (Martins et al., 2009). Let binary variable v ik indicate whether word i has k modifiers. Given {z e } for the edges with head i, then {v ik |k = 1, . . . , n − 1} can be solved by The left side of the equation is the linear function of v ik . The right side of the equation is a polynomial function of z e . Hence, v ik could be expressed as a polynomial function of z e .

k Best Parsing
Though our B&B algorithm is able to capture a variety of non-local features, it is still difficult to handle many kinds of features, such as the depth of the parse tree. Hence, a reranking approach may be useful in order to incorporate such information, where k parse trees can be generated first and then a second pass model is used to rerank these candidates based on more global or non-local features. In addition, k-best parsing may be needed in many applications to use parse information and especially utilize infor-mation from multiple candidates to optimize taskspecific performance. We have not conducted any experiment for k best parsing, hence we only discuss the algorithm. According to proposition 1, we have Proposition 2. Given p and subset S ⊆ Z, let z k denote the k th best solution of max z∈S ψ(p, z). If a parse tree z ′ ∈ S satisfies ϕ(z ′ ) ≥ ψ(p, z k ), then z ′ is one of the k best parse trees in subset S.
Proof. Since z k is the k th best solution of ψ(p, z), for z j , j > k, we have ψ(p, z k ) ≥ ψ(p, z j ) ≥ ϕ(z j ). Since the size of the set {z j |j > k} is |S| − k, hence there are at least |S| − k parse trees whose scores ϕ(z j ) are less than ψ(p, z k ). Because ϕ(z ′ ) ≥ ψ(p, z k ), hence z ′ is at least the k th best parse tree in subset S.
Therefore, we can search the k best parse trees in this way: for each sub-problem, we use DP to derive the k best parse trees. For each parse tree z, if ϕ(z) ≥ ψ(p, z k ), then z is selected into the k best set. Algorithm terminates until the k th bound is tight.

Conclusion
In this paper we proposed a new parsing algorithm based on a Branch and Bound framework. The motivation is to use dynamic programming to search for the bound. Experimental results on PTB and CTB5 datasets show that our method is competitive in terms of both performance and efficiency. Our method can be adapted to non-projective dependency parsing, as well as the k best MST algorithm (Hall, 2007) to find the k best candidates.