Efficient Stacked Dependency Parsing by Forest Reranking

This paper proposes a discriminative forest reranking algorithm for dependency parsing that can be seen as a form of efficient stacked parsing. A dynamic programming shift-reduce parser produces a packed derivation forest which is then scored by a discriminative reranker, using the 1-best tree output by the shift-reduce parser as guide features in addition to third-order graph-based features. To improve efficiency and accuracy, this paper also proposes a novel shift-reduce parser that eliminates the spurious ambiguity of arc-standard transition systems. Testing on the English Penn Treebank data, forest reranking gave a state-of-the-art unlabeled dependency accuracy of 93.12.


Introduction
There are two main approaches of data-driven dependency parsing -one is graph-based and the other is transition-based.
In the graph-based approach, global optimization algorithms find the highest-scoring tree with locally factored models (McDonald et al., 2005). While third-order graph-based models achieve stateof-the-art accuracy, it has O(n 4 ) time complexity for a sentence of length n. Recently, some pruning techniques have been proposed to improve the efficiency of third-order models (Rush and Petrov, 2012;Zhang and McDonald, 2012).
The transition-based approach usually employs the shift-reduce parsing algorithm with linear-time complexity (Nivre, 2008). It greedily chooses the transition with the highest score and the resulting transition sequence is not always globally optimal. The beam search algorithm improves parsing flexibility in deterministic parsing (Zhang and Clark, 2008;Zhang and Nivre, 2011), and dynamic programming makes beam search more efficient (Huang and Sagae, 2010).
There is also an alternative approach that integrates graph-based and transition-based models (Sagae and Lavie, 2006;Zhang and Clark, 2008;Nivre and McDonald, 2008;Martins et al., 2008). Martins et al. (2008) formulated their approach as stacking of parsers where the output of the first-stage parser is provided to the second as guide features. In particular, they used a transition-based parser for the first stage and a graph-based parser for the second stage. The main drawback of this approach is that the efficiency of the transition-based parser is sacrificed because the second-stage employs full parsing. This paper proposes an efficient stacked parsing method through discriminative reranking with higher-order graph-based features, which works on the forests output by the first-stage dynamic programming shift-reduce parser and integrates nonlocal features efficiently with cube-pruning (Huang and Chiang, 2007). The advantages of our method are as follows: • Unlike the conventional stacking approach, the first-stage shift-reduce parser prunes the search space of the second-stage graph-based parser.
• In addition to guide features, the second-stage graph-based parser can employ the scores of the first-stage parser which cannot be incorpo-  Figure 1: The arc-standard transition-based dependency parsing system with dynamic programming: means "take anything". a ↷ b denotes that a tree b is attached to a tree a. rated in standard graph-based models.
• In contrast to joint transition-based/graphbased approaches (Zhang and Clark, 2008;Bohnet and Kuhn, 2012) which require a large beam size and make dynamic programming impractical, our two-stage approach can integrate both models with little loss of efficiency.
In addition, the elimination of spurious ambiguity from the arc-standard shift-reduce parser improves the efficiency and accuracy of our approach.

Arc-Standard Shift-Reduce Parsing
We use a beam search shift-reduce parser with dynamic programming as our baseline system. Figure 1 shows it as a deductive system (Shieber et al., 1995). A state is defined as the following: where ℓ is the step size, [i, j] is the span of the topmost stack element s 0 , and s d |s d−1 | . . . |s 1 shows a stack with d elements at the top, where d is the window size used for defining features. The axiom is initialized with an input sentence of length n, x = w 0 . . . w n where w 0 is a special root symbol $ 0 . The system takes 2n steps for a complete analysis. π is a set of pointers to the predictor states, each of which is the state just before shifting the root word Table 1: Additional feature templates for shift-reduce parsers: q denotes input queue. h, lc and rc are head, leftmost child and rightmost child of a stack element s. lc2 and rc2 denote the second leftmost and rightmost children. t and w are a part-of-speech (POS) tag and a word. of s 0 into stack 1 . Dynamic programming merges equivalent states in the same step if they have the same feature values. We add the feature templates shown in Table 1 to Huang and Sagae (2010)'s feature templates.
Dynamic programming not only makes the shiftreduce parser with beam search more efficient but also produces a packed forest that encodes an exponential number of dependency trees. A packed dependency forest can be represented by a weighted (directed) hypergraph. A weighted hypergraph is a pair H = ⟨V, E⟩, where V is the set of vertices and E is the set of hyperedges. Each hyperedge e ∈ E is a tuple e = ⟨T (e), h(e), f e ⟩, where h(e) ∈ V is X($) saw0, 7 I X her (saw) 1, 7 I X her (saw) 1, 4 I X(saw) 1, 3 X(I) 1, 2 I X(saw) 2, 3 saw X(her) 3, 4 her I X her,with (saw) 1, 7 X with (her) 3, 7 X man (with) 4, 7 X(with) 4, 5 with a X(man) 5, 7 X(a) 5, 6 a X(man) 6, 7 man X($) saw0, 7 I X her (saw) 1, 7 I X her (saw) 1, 4 I X(saw) 1, 3 X(I) 1, 2 I X(saw) 2, 3 saw X(her) 3, 4 her I X her,with (saw) 1, 7 X with (her) 3, 7 X man (with) 4, 7 X(with) 4, 5 with a X(man) 5, 7 X(a) 5, 6 a X(man) 6, 7 man Figure 2: An example of packed dependency (derivation) forest: each vertex has information about the topmost stack element of the corresponding state to it. its head vertex, T (e) ∈ V + is an ordered list of tail vertices, and f e is a weight for e. Figure 2 shows an example of a packed forest. Each binary hyperedge corresponds to a reduce action, and each leaf vertex corresponds to a shift action. Each vertex also corresponds to a state, and parse histories on the states can be encoded into the vertices. In the example, information about the topmost stack element is attached to the corresponding vertex marked with a non-terminal symbol X.
Weights are omitted in the example. In practice, we attach each reduction weight to the corresponding hyperedge, and add the shift weight to the reduction weight when a shifted word is reduced.

Arc-Standard Shift-Reduce Parsing without Spurious Ambiguity
One solution to remove spurious ambiguity in the arc-standard transition system is to give priority to the construction of left arcs over that of right arcs (or vice versa) like Eisner (1997). For example, an Earley dependency parser (Hayashi et al., 2012) attaches all left dependents to a word before right dependents. The parser uses a scan action to stop the construction of left arcs. We apply this idea to the arc-standard transition system and show the resulting transition system in Figure 3. We introduce the * symbol to indicate that the root node of the topmost element on the stack has not been scanned yet. The shift and reduce ↷ actions can be used only when the root of the topmost element on the stack has already been scanned, and all left arcs are always attached to the head before the head is scanned.
The arc-standard shift-reduce parser without spurious ambiguity takes 3n steps to finish parsing, and the additional n scan actions add surplus vertices and (unary) hyperedges to a packed forest. However, it is easy to remove them from the packed forest because the consequent state of a scan action has a unique antecedent state and all the hyperedges going out from a vertex corresponding to the consequent state can be attached to the vertex corresponding to the antecedent state. The scan weight of the removed unary hyperedge is added to each weight of the hyperedges attached to the antecedent.

Experiments (Spurious Ambiguity vs. Non-Spurious Ambiguity)
We conducted experiments on the English Penn Treebank (PTB) data to compare spurious and nonspurious shift-reduce parsers. We split the WSJ part of PTB into sections 02-21 for training, section 22 for development, and section 23 for test. We used the head rules (Yamada and Matsumoto, 2003) to convert phrase structure to dependency structure.
axiom(c 0 ) : Figure 3: The dynamic programming arc-standard transition-based deductive system without spurious ambiguity: the symbol represents that the root node of the topmost element on the stack has not been scanned yet.  Table 2: Unlabeled accuracy scores (UAS) and parsing times (+forest dumping times, second per sentence) for parsing development (WSJ22) and test (WSJ23) data with spurious shift-reduce and proposed shift-reduce parser (non-sp.) using several beam sizes.
We used an early update version of the averaged perceptron algorithm (Collins and Roark, 2004;Huang et al., 2012) to train two shift-reduce dependency parsers with beam size of 12. Table 2 shows experimental results of parsing the development and test datasets with each of the spurious and non-spurious shift-reduce parsers using several beam sizes. Parsing accuracies were evaluated by unlabeled accuracy scores (UAS) with and without punctuations. The parsing times were measured on an Intel Core i7 2.8GHz. The average cpu time (per sentence) includes that of dumping packed forests. This result indicates that the non-spurious parser achieves better accuracies than the spurious beam size 8 32 128 % of distinct trees (10) 93.5 94.8 95.0 % of distinct trees (100) 81.8 84.9 87.2 % of distinct trees (1000) 70.6 73.1 77.6 % of distinct trees (10000) 62.1 64.3 65.6 Table 3: The percentages of distinct dependency trees in 10, 100, 1000 and 10000 best trees extracted from spurious forests with several beam sizes. parser without loss of efficiency. Figure 4 shows oracle unlabeled accuracies of spurious k-best lists, non-spurious k-best lists, spurious forests, and non-spurious forests. We extract an oracle tree from each packed forest using the for- "kbest" "forest" "non-sp-kbest" "non-sp-forest" Figure 4: Each plot shows oracle unlabeled accuracies of spurious k-best lists, spurious forests, and non-spurious forests. The oracle accuracies are evaluated using UAS with punctuations. est oracle algorithm (Huang, 2008). Both forests produce much better results than the k-best lists, and non-spurious forests have almost the same oracle accuracies as spurious forests.
However, as shown in Table 3, spurious forests encode a number of non-unique dependency trees while all dependency trees in non-spurious forests are distinct from each other.

Discriminative Reranking Model
We define a reranking model based on the graphbased features as the following: where α is a weight vector, f g is a feature vector (g indicates "graph-based"), x is the input sentence, y is a dependency tree and H is a dependency forest. This model assumes a hyperedge factorization which induces a decomposition of the feature vector as the following: The search problem can be solved by simply using the (generalized) Viterbi algorithm (Klein and Manning, 2001). When using non-local features, the hyperedge factorization is redefined to the following: where f g,e,N is a non-local feature vector. Though the cube-pruning algorithm (Huang and Chiang, 2007) is an approximate decoding technique based on a k-best Viterbi algorithm, it can calculate the non-local scores efficiently. The baseline score can be taken into the reranker as a linear interpolation: where sc tr is the score from the baseline parser (tr indicates "transition-based"), and β is a scaling factor.

Local Features
While the inference algorithm is a simple Viterbi algorithm, the discriminative model can use all trisibling features and some grand-sibling features 2 (Koo and Collins, 2010) as a local scoring factor in addition to the first-and sibling second-order graphbased features. This is because the first stage shiftreduce parser uses features described in Section 2 and this information can be encoded into vertices of a hypergraph.
The reranking model also uses guide features extracted from the 1-best tree predicted by the first stage shift-reduce parser. We define the guide features as first-order relations like those used in Nivre and McDonald (2008) though our parser handles only unlabeled and projective dependency structures. We summarize the features for discriminative reranking model as the following: • First-and second-order features: these features are the same as those used in MST parser 3 .
• Grand-child features: we define tri-gram POS features with POS tags of grand parent, parent, and rightmost or leftmost child.
• Tri-sibling features: we define tri-gram features with three POS-tags of child, sibling, and trisibling. We also define tri-gram features with one word and two POS tags of the above.
• Guide feaures: we define a feature indicating whether an arc from a child to its parent is present in the 1-best tree predicted by the firststage shift-reduce parser, conjoined with the POS tags of the parent and child.
• PP-Attachment features: when a parent word is a preposition, we define tri-gram features with the parent word and POS tags of grand parent and the rightmost child.

Non-local Features
To define richer features as a non-local factor, we extend a local reranking algorithm by augmenting each k-best item with all child vertices of its head vertex 4 . Information about all children enables the reranker to calculate the following features when reducing the head vertex: • Grand-child features: we define tri-gram features with one word and two POS tags of grand parent, parent, and child.
• Grand-sibling features: we define 4-gram POS features with POS tags of grand parent, parent, child and sibling. We also define coordination features with POS tags of grand parent, parent and child when the sibling word is a coordinate conjunction.
• Valency features: we define a feature indicating the number of children of a head, conjoined with each of its word and POS tag.
When using non-local features, we removed the local grand-child features from the model.

Oracle for Discriminative Training
A discriminative reranking model is trained on packed forests by using their oracle trees as the correct parse. More accurate oracles are essential to train a discriminative reranking model well.
While large size forests have much more accurate oracles than small size forests, large forests have too many hyperedges to train a discriminative model on them, as shown in Figure 4. The usual forest reranking algorithms (Huang, 2008;Hayashi et al., 2011) remove low quality hyperedges from large forests by using inside-outside forest pruning.
However, producing large forests and pruning them is computationally very expensive. Instead, we propose a simpler method to produce small forests which have more accurate oracles by forcing the beam search shift-reduce parser to keep the correct state in the beam buffer. As a result, the correct tree will always be encoded in a packed forest.

Experimental Setting
Following (Huang, 2008), the training set (WSJ02-21) is split into 20 folds, and each fold is parsed by each of the spurious and non-spurious shift-reduce parsers using beam size 12 with the model trained on sentences from the remaining 19 folds, dumping the outputs as packed forests.
The reranker is modeled by either equation (1) or (4). By our preliminary experiments using development data (WSJ22), we modeled the reranker with equation (1) when training, and with equation (4) when testing 5 (i.e., the scores of the first-stage parser are not considered during training of the reranking model). This prevents the discriminative reranking features from under-training (Sutton et al., 2006;Hollingshead and Roark, 2008). A discriminative reranking model is trained on the packed forests by using the averaged perceptron algorithm with 5 iterations. When training nonlocal reranking models, we set k-best size of cubepruning to 5.
For dumping packed forests for test data, spurious and non-spurious shift-reduce parsers are trained by the averaged perceptron algorithm. In all experiments on English data, we fixed beam size to 12 for training both parsers.

Test with Gold POS tags
We show the comparison of dumped spurious and non-spurious packed forests for training data in Ta   method described in Section 5.3. The 1-best accuracy of the non-spurious forests is higher than that of the spurious forests. As we expected, the results show that there are many non-unique dependency trees in the spurious forests. The spurious forests also get larger than the non-spurious forests. Table 5 shows how long the training on spurious and non-spurious forests took on an Opteron 8356 2.3GHz. It is clear from the results that training on non-spurious forests is more efficient than that on spurious forests. Table 6 shows the statistics of spurious and nonspurious packed forests dumped by shift-reduce parsers using beam size 12 for test data. The trends are similar to those for training data shown in Table 4. We show the results of the forest reranking algorithms for test data in  Table 5: Training times on both spurious and nonspurious packed forests (beam 12): pre-comp. denotes cpu time for feature extraction and attaching features to all hyperedges. The non-local models were trained setting k-best size of cube-pruning to 5, and non-local features were calculated on-the-fly while training. packed forests using four beam sizes 8, 12, 32, and 64. The reranking on non-spurious forests achieves better accuracies and is slightly faster than that on spurious forests consistently.

Test with Automatic POS tags
To compare the proposed reranking system with other systems, we evaluate its parsing accuracy on test data with automatic POS tags. We used the Stanford POS tagger 6 with a model trained on sections 02-21 to tag development and test data, and used 10-way jackknifing to tag training data. The tagging accuracies on training, development, and test data were 97.1, 97.2, and 97.5.   Table 8: Comparison with other systems: the results were evaluated on testing data (WSJ23) with automatic POS tags: label means labeled dependency parsing and the cpu times of our systems were taken on Intel Core i7 2.8GHz.
our proposed systems together with results from related work. The parsing times are reported in tokens/second for comparison. Note that, however, the difference of the parsing time does not represent the efficiency of the algorithm directly because each system was implemented in different programming language and the times were measured on different environments. The accuracy of local reranking on non-spurious forests is the best among unlabeled shift-reduce parsers, but slightly behind the third-order graphbased systems (Koo and Collins, 2010;Zhang and McDonald, 2012;Rush and Petrov, 2012). It is likely that the difference comes from the fact that our local reranking model can define only some of the grand-child related features. w/ guide. w/o guide. P P P P P P P P  To define all grand-child features and other nonlocal features, we also experimented with the nonlocal reranking algorithm on non-spurious packed forests. It achieved almost the same accuracy as the previous third-order graph-based algorithms. Moreover, the computational overhead is very small when setting k-best size of cube-pruning small.

Analysis
One advantage of our reranking approach is that guide features can be defined as in stacked parsing. To analyze the effect of the guide features on parsing accuracy, we remove the guide features from baseline reranking models with and without non-local features used in Section 6.3. The results are shown in Table 9 and 10. The parsing accuracies of the baseline reranking models are better than those of the models without guide features though the number of guide features is not large. Additionally, each model with guide features is smaller than that without guide features. This indicates that stacking has a good effect on training the models.
To further investigate the effects of guide features, we tried to define unlabeled versions of the secondorder guide features used in (Martins et al., 2008;McClosky et al., 2012). However, these features did not produce good results, and investigation to find the cause is an important future work.
We also examined parsing errors in more detail. Table 11 shows root and sentence complete rates of three systems, the non-spurious shift-reduce w/ guide. w/o guide. P P P P P P P P   parser, local reranking, and non-local reranking. The two reranking systems outperform the shiftreduce parser significantly, and the non-local reranking system is the best among them. Part of the difference between the shift-reduce parser and reranking systems comes from the correction of coordination errors. Table 12 shows the head correct rate, recall, precision, F-measure and complete rate of coordination structures, by which we mean the head and siblings of a token whose POS tag is CC. The head correct rate denotes how correct a head of the CC token is. The recall, precision, F-measure are measured by counting arcs between the head and siblings. When the head of the CC token is incorrect, all arcs of the coordination structure are counted as incorrect. Therefore, the recall, precision, F-measure are greatly affected by the head correct rate, and though the complete rate of non-local reranking is higher than that of local reranking, the results of the first three measures are lower.   Table 13: Recall, precision, and F-measure of grand-child structures whose grand parent is an artificial root symbol: these are measured on test data (WSJ23).
We assume that the improvements of non-local reranking over the others can be mainly attributed to the better prediction of the structures around the sentence root because most of the non-local features are useful for predicting these structures. Table 13 shows the recall, precision and F-measure of grandchild structures whose grand parent is a sentence root symbol $. The results support the above assumption. The root correct rate directly influences on prediction of the overall structures of a sentence, and it is likely that the reduction of root prediction errors brings better results.

Experiments on Chinese
We also experiment on the Penn Chinese Treebank (CTB5). Following Huang and Sagae (2010), we split it into training (secs 001-815 and 1001-1136), development (secs 886-931 and 1148-1151), and test (secs 816-885 and 1137-1147) sets, and use the head rules of Zhang and Clark (2008). The training set is split into 10 folds to dump packed forests for training of reranking models.
We set the beam size of both spurious and nonspurious parsers to 12, and the number of perceptron training iterations to 25 for the parsers and to 8 for both rerankers. Table 14 shows the results for the test sets. As we expected, reranking on non-spurious forests outperforms that on spurious forests.  The graph-based approach employs Eisner and Satta (1999)'s algorithm where spurious ambiguities are eliminated by the notion of split head automaton grammars (Alshawi, 1996). However, the arc-standard transition-based parser has the spurious ambiguity problem. Cohen et al. (2012) proposed a method to eliminate the spurious ambiguity of shift-reduce transition systems. Their method covers existing systems such as the arcstandard and non-projective transition-based parsers (Attardi, 2006). Our system copes only with the projective case, but is simpler than theirs and we show its efficacy empirically through some experiments.
The arc-eager shift-reduce parser also has a spurious ambiguity problem. Goldberg and Nivre (2012) addressed this problem by not only training with a canonical transition sequence but also with alternate optimal transitions that are calculated dynamically for a current state.

Methods to Improve Dependency Parsing
Higher-order features like third-order dependency relations are essential to improve dependency parsing accuracy (Koo and Collins, 2010;Rush and Petrov, 2012;Zhang and McDonald, 2012). A reranking approach is one effective solution to introduce rich features to a parser model in the context of constituency parsing (Charniak and Johnson, 2005;Huang, 2008). Hall (2007) applied a k-best maximum spanning tree algorithm to non-projective dependency analysis, and showed that k-best discriminative reranking improves parsing accuracy in several languages. Sangati et al. (2009) proposed a k-best dependency reranking algorithm using a third-order generative model, and Hayashi et al. (2011) extended it to a forest algorithm. Though forest reranking requires some approximations such as cube-pruning to integrate non-local features, it can explore larger search space than k-best reranking.
The stacking approach (Nivre and McDonald, 2008;Martins et al., 2008) uses the output of one dependency parser to provide guide features for another. Stacking improves the parsing accuracy of second stage parsers on various language datasets. The joint graph-based and transition-based approach (Zhang and Clark, 2008;Bohnet and Kuhn, 2012) uses an arc-eager shift-reduce parser with a joint graph-based and transition-based model. Though it improves parsing accuracy significantly, the large beam size of the shift-reduce parser harms its efficiency. Sagae and Lavie (2006) showed that combining the outputs of graph-based and transitionbased parsers can improve parsing accuracies.

Conclusion
We have presented a discriminative forest reranking algorithm for dependency parsing. This can be seen as a kind of joint transition-based and graph-based approach because the first-stage parser is a shiftreduce parser and the second-stage reranker uses a graph-based model.
Additionally, we have proposed a dynamic programming arc-standard transition-based dependency parser without spurious ambiguity, along with a heuristic that encodes the correct tree in the output packed forest for reranker training, and shown that forest reranking works well on packed forests produced by the proposed parser.
To improve the accuracy of reranking, we will engage in feature engineering. We need to further investigate effective higher-order guide and non-local features. It also seems promising to extend the unlabeled reranker to a labeled one because labeled information often improves unlabeled accuracy.
In this paper, we adopt a reranking approach, but a rescoring approach is more promising to improve efficiency because it does not have the overhead of dumping packed forests.