Efficient Structured Inference for Transition-Based Parsing with Neural Networks and Error States

Transition-based approaches based on local classification are attractive for dependency parsing due to their simplicity and speed, despite producing results slightly below the state-of-the-art. In this paper, we propose a new approach for approximate structured inference for transition-based parsing that produces scores suitable for global scoring using local models. This is accomplished with the introduction of error states in local training, which add information about incorrect derivation paths typically left out completely in locally-trained models. Using neural networks for our local classifiers, our approach achieves 93.61% accuracy for transition-based dependency parsing in English.


Introduction
Transition-based parsing approaches based on local classification of parser actions (Nivre, 2008) remain attractive due to their simplicity, despite producing results slightly below the state-of-the-art. Although the application of online structured prediction and beam search has made transition-based parsing competitive in accuracy (Zhang and Clark, 2008;Huang et al., 2012) while retaining linear time complexity, greedy inference with locally-trained classifiers is still widely used, and techniques for improving the performance of greedy parsing have been proposed recently (Choi and Palmer, 2011;Goldberg and Nivre, 2012;Goldberg and Nivre, 2013;Honnibal et al., 2013). Recent work on the application of neural network classification to drive greedy transition-based dependency parsing has achieved high accuracy (Chen and Manning, 2014), showing * Both authors contributed equally to this paper. how effective locally-trained neural network models are at predicting parser actions, while providing a straightforward way to improve parsing accuracy using word embeddings pre-trained using a large set of unlabeled data.
We propose a novel approach for approximate structured inference for transition-based parsing that uses locally-trained neural networks that, unlike previous local classification approaches, produce scores suitable for global scoring. This is accomplished with the introduction of error states in local training, which add information about incorrect derivation paths typically left out completely in locally-trained models. Our approach produces high accuracy for transition-based dependency parsing in English, surpassing parsers based on the structured perceptron (Huang and Sagae, 2010;Zhang and Nivre, 2011) by allowing seamless integration of pre-trained word embeddings, while requiring nearly none of the feature engineering typically associated with parsing with linear models. Trained without external resources or pre-trained embeddings, our neural network (NN) dependency parser outperforms the NN transition-based dependency parser from Chen and Manning (2014), which uses pre-trained word embeddings trained on external data and more features, thanks to improved search. Our experiments show that naive search produces very limited improvements in accuracy compared to greedy inference, while search in conjunction with error states that mark incorrect derivations produces substantial accuracy improvements.

Background: Transition-Based Parsing
Transition-based approaches are attractive in dependency parsing for their algorithmic simplicity and straightforward data-driven application. Using shift-reduce algorithms, such as those pioneered by Nivre (2008), the task of finding a dependency structure becomes that of predicting each action in the derivation of desired structure.

Arc-Standard Dependency Parsing
Our parsing models are based on a simple shiftreduce algorithm for dependency parsing known as the arc-standard dependency parsing algorithm (Nivre, 2008). An arc-standard dependency parser maintains one or more parser states T , each composed of a stack S = [s m , ..., s 1 , s 0 ] (where the topmost item is s 0 ) and input buffer W = [w 0 , w 1 , ..., w n ] (where the first element of the buffer is w 0 ). In its initial state T 0 , the stack is empty, and the input buffer contains each token in the input sentence with its part-of-speech tag. One of three actions can be applied to a parser state T i to create a new parser state T j : shift, which takes the next word in the input buffer (with its part-of-speech tag), and places it as a tree with a single node on top of the stack (i.e. input token w 0 is consumed to create the new stack item s 0 ); reduce-right, which pops the top two items on the stack, s 0 and s 1 , and pushes onto the stack a new subtree formed by attaching the root node of s 0 as a dependent of s 1 ; and reduce-left, which pops the top two items on the stack, s 0 and s 1 , and pushes onto the stack a new subtree formed by attaching the root node of s 1 as a dependent of s 0 . An alternative formulation keeps only word indices in the stack and input buffer, and includes an additional set of dependency arcs; the two formulations are equivalent.
A greedy arc-standard parser keeps only one parser state, choosing at each step one parser action to apply to the current state, which is replaced once application of the chosen action creates the next state. Once the current state is a final state, parsing terminates. A final state is one where the input buffer is empty, and the stack contains only one element, which is the dependency tree output. Given a way to score parser actions instead of simply choosing one action to apply, a state score can be defined on the sequence of actions resulting in the state. Keeping track of multiple states with scores resulting from the application of different valid actions for a single state creates an exponential search space. Beam search can then be applied to search for a high scor-ing state. With global estimation of parameters for scoring parser actions, a beam search can produce more accurate results than greedy parsing by minimizing global loss (Zhang and Clark, 2008).

Local Classification
Initial data-driven transition-based dependency parsing approaches employed locally-trained multiclass models to choose a parser action based on the parser state at each step in the derivation (Yamada and Matsumoto, 2003;Nivre and Scholz, 2004). In these models, classification is based on a set of features extracted from the current state of the parser, and creating training examples for the classifier requires only running the transition-based algorithm to reproduce the trees in a training treebank, while recording the features and actions at each step. A classifier is then trained with the actions as classes.
While this simple procedure has allowed for training of dependency parsers using off-the-shelf classifier implementations, the resulting parsers are restricted to performing greedy search, considering only one tree out of the exponentially many. Although the distribution of class scores for each parser state can be used to create a search space for beam search, the locally normalized scores obtained with these classifiers make searching a largely futile endeavor, since action scores cannot be combined meaningfully to score entire trees or entire derivations. For example,  use maximum entropy classification for local classification of shift-reduce parsing actions with a dynamic programming approach based on the work of Huang and Sagae (2010). Despite using exact search, Zhao et al. report an improvement of only 0.6% in unlabeled dependency accuracy over greedy parsing, reaching an accuracy of 90.7%, far below the 92.2% obtained with a comparable structured perceptron parser with beam search and very similar features (Huang et al., 2012). Similarly, Johansson and Nugues (2006), who used probability estimates from local SVM classification to perform a beam search in transitionbased parsing, report some accuracy gains when using a beam of size 2, but no further gains with larger beams. Because transition scores out of each state are normalized locally, the quality of any particular state is in no way captured by the scores that will ultimately result in the overall score for the deriva-tion. In fact, from an incorrect parser state, more incorrect transitions may follow, due to a version of the label bias problem faced by MEMMs (Lafferty et al., 2001;McCallum et al., 2000). In Section 3, we will present our approach that significantly improves search with locally normalized models.

Structured Perceptron
One effective way to create models that score parser transitions globally and allow for effective search is to use the structured perceptron (Collins, 2002). Unlike with local classifiers, weight updates are based on entire derivations, instead of individual states. However, because exact inference is too costly for transition-based parsing with a rich feature set, in practice parsers use beam search to perform approximate inference, and care must be taken to ensure the validity of weight updates (Huang et al., 2012). A widely used approach is to employ early updates, which stop parsing and perform weight updates once the desired structure is no longer in the beam (Collins and Roark, 2004).
Transition-based dependency parsers based on the structured perceptron have reached high accuracy (Zhang and Nivre, 2011;Hatori et al., 2012), but these parsers remain in general less accurate than high-order graph-based parsers that model dependency graphs directly, instead of derivations (Zhang and McDonald, 2014;Martins et al., 2013). The drawback of these more accurate parsers is that they tend to be slower than transition-based parsers.

Parsing with Local Classifiers and Error States
The standard way to train local classifiers to predict actions for transition-based parsers is to run the parsing algorithm using a gold-standard sequence of actions (i.e. a sequence of actions that generates the gold-standard tree from a training set) and record the features corresponding to each parser state, where a parser state includes the parser's stack, input buffer, and set of dependencies created so far. The features corresponding to a state are then associated with the gold-standard action that should be taken from that state, and this constitutes one training example for the local action classifier. Sagae and Tsujii (2007) propose using a maximum entropy classifier to pro-duce conditional probabilities for each action given the features of the state, and score each state using the product of the probabilities of all actions taken up to that state. However, they report that searching through the resulting space for the highest scoring parse does not consistently result in improved parser accuracy over a greedy policy (i.e. pursue only the highest scoring action at each state), suggesting that this strategy for scoring states is a poor choice. This is confirmed by , who report only a small improvement over greedy search despite using exact inference with this state scoring strategy.
Because action probabilities are conditioned on the features of the current state alone and normalized locally, there is no reason to expect that the product of such probabilities along a derivation path up to a state, whether or not it is a final state, should reflect the overall quality of the state. Once an incorrect action is classified as more probable than the correct action in a given state T i , the incorrect state T j resulting from the application of the incorrect action will have higher score than the correct state T k resulting from the application of the correct action. From that point, the action probabilities given state T j will sum to one, just as the action probabilities given state T k will sum to one, and there is no reason to expect that the most probable action from T j should be less probable than the most probable action from T k . In other words, once an error occurs, search is of little help in recovering from it, since scores are based only on local decisions and not on any notion of state quality, and the error occurred precisely because an incorrect action was found to be more probable locally.
Our key contribution is a solution to this problem by introducing a notion of state quality in local action classification. This is done through the addition of a new error class to the local classification model. Unlike the other classes, the error class does not correspond to a parser action. In fact, the error class is not used at all during parsing, and serves to occupy probability mass, keeping it from the actual parser actions. Intuitively, the probability of the error class given the current state can be thought of as the probability that an error has occurred previously and the resulting state belongs to an incorrect derivation path.

Training Local Classifiers with Error States
To train a local classifier with error states, the standard way of generating classifier training examples is modified to include parser states that do not belong to the derivation of the gold-standard tree. It is these incorrect parser states that are labeled error states. Figure 1 illustrates the generation of training examples for a local action classifier with error states, assuming unlabeled arc-standard dependency parsing (Nivre, 2008), where the actions are shift, reduce-right and reduce-left. From state 2, the standard way of training local classifiers would be simply to associate features from state 2 to a shift action, generate state 3 (only), associate features from state 3 with a shift action, generate state 6, and continue in this fashion along the derivation of the gold-standard tree. To add error states, from state 2 we do not only generate state 3, but also states 4 and 5, which result from the application of incorrect actions. In addition to associating features from state 3 with shift, we associate features from state 4 with the error class, and features from state 5 with the error class. The desired effect is that any time the parser deviates from a correct derivation, the error class should become probable, while valid parser actions become less probable. Although in principle any state outside of a gold-standard derivation is an error state, we generate only error states resulting from the application of a single incorrect action, which in practice increases the number of state-action pairs used to train the classifier by approximately a factor of three. We leave an investigation of how far into incorrect derivations one should go to generate additional error states as future work.

Parsing with Error States
Once a local classifier has been trained with error states, this classifier can be used in a transitionbased parser with no modifications; the error class is simply thrown away during parsing. For example, the type of beam search typically used in transitionbased parsing with the structured perceptron (Zhang and Clark, 2008;Huang and Sagae, 2010) can be used to pursue several derivations in parallel, and global score of a derivation can be decomposed as the sum of the scores of all actions in the deriva-  tion. Analogously, we score each derivation using the product of the probabilities for all actions in the derivation. Interestingly, local normalization of action scores allows the use of best-first search (Sagae and Tsujii, 2007), which has the potential to arrive at high quality solutions without having to explore as many states as a typical beam search, and even allows efficient exact or nearly exact inference . Once actions are scored for the parser's current state using a classifier, the score of a new state resulting from the application of a valid action to the current state can be computed as the product of the probabilities of all actions applied up to the new state in its derivation path. In other words, the score of each new state is the score of the current state multiplied by the probability of the action applied to the current state to generate the new state. New scored states resulting from the application of each action to the current state are then placed in a priority queue. The highest scoring item in the priority queue is chosen, and the state corresponding to that item is then made the current state 1 . The local classifier is then applied to the current state, and   Figure 2: Exploration of parser state space using best-first search and error states. States are numbered according to the order in which they become the parser's current state. The local action classifier is trained with four classes: the three valid actions (represented as Sh for shift, L for reduce-left, and R for reduce-right) and an error class. The error class is not used by the parser and not shown in the diagram, but serves to reduce the total probability of valid parser actions by occupying some probability mass in each state, creating a way to reflect the overall quality of individual states. the process is repeated (without clearing the priority queue, which already contains items corresponding to unexplored new states) until the current state is a final state (a state corresponding to a complete parse). This agenda-driven transition-based parsing approach, where the agenda is a priority queue, is optimal since all scores fall between 0 and 1, inclusive, but in practice a priority queue with limited capacity can be used to improve efficiency by preventing unbounded exploration of the exponential search space in cases where probabilities are nearly uniform. Figure 2 illustrates arc-standard dependency parsing with error states and best-first search. From states 0 and 1, the only possible action is shift. From state 2, the most probably action according to the model is reduce-left, which is not the correct action, but has probability 0.6. The correct action, shift, has probability 0.3. State 3 is then chosen as the current state, but when the classifier is applied to state 3, the only valid action, shift, is assigned probability 0.1. This is because the classifier assigns most of the probability mass to the error class, which the parser does not use. Because the state resulting from a shift from state 3 would have low probability, due to the low probability of shift, the search continues from state 4, and the parser has recovered from the classification error at state 2.
In the next section, we will present details of our neural network local classifiers.

Neural Models for Transition Based Parsing
We implement transition-based parsers with error states following two search strategies: the stepwise beam search normally used in transition-based parsers with global models (Zhang and Clark, 2008;Huang and Sagae, 2010) and best-first search (Sagae and Tsujii, 2007;, described in the previous section. The trainable components of our transition-based parsers are the local classifiers that predict the next action given features derived from the current state. Following Chen and Manning (2014), we train feed-forward neural networks (NNs) for local classification in our parsers. The NN is trained on pairs of features and actions, {f n , a n } N n=1 , where f n is the feature vector extracted from the parser state and a n is the corresponding correct action. For vanilla arc-standard parsing, a n is one of {shift, reduce-left, reduce-right}, and for parsing with error states, an additional error action. While parsing, we extract feature vector f from the current state and make a decision based on the output distribution, P (a | f) computed by the NN.
We will now describe the basic architecture of our NN classifier and the features that we use. We will also describe how we pre-train embeddings from unannotated data for our word features. Figure 3 shows the basic architecture of our neural network action prediction model for two input features f = f 1 , f 2 , each of which is a one-hot vector with dimension equal to the number of possible feature types. The neural network has two hidden layers and the output softmax layer has the same number of units as the number of parser actions, that is, either 3 (without error states) or 4 (with error states). D is the input embedding matrix that is shared by all the input feature positions. Each feature position f i has a corresponding position matrix C f i . The two hidden layers h 1 and h 2 comprise rectified linear units (Nair and Hinton, 2010) having the activation max(0, x) (Figure 4).

Neural Network Model
x Figure 4: Activation function for a rectified linear unit.
The neural network computes the probability distribution over the parser actions, P (a | f), as follows: The first hidden layer computes, where b 1 is a vector of biases for h 1 and φ is applied elementwise.
The output of the second layer h 2 is where M is a weight matrix between h 1 and h 2 and b 2 is a vector of biases for h 2 . The output softmax layer computes the probability of an action as: where D is the matrix of output action embeddings, b is a vector of action biases, and v a is the one hot representation of the action a. We learn models that predict over two types of output distributions conditioned on f: vanilla arc-standard models that predict over shift, reduce-left and reduce-right, and arcstandard models with error states (Section 3) that predict over shift, reduce-left, reduce-right and error.

Semi-supervised Learning: Pre-training Word Embeddings
It is often the case that large amounts of domainspecific unannotated data, i.e. raw text, is available in addition to annotated data. For both graphbased and transition-based parsing, many feature templates are defined on words from the input sentence. Previous work has shown benefits of using word representations learned from unannotated data. Koo et al. (2008) (Charniak et al., 2000). Chen and Manning (2014) also show 0.7% improvement on English dependency parsing on PTB using pre-trained English word embeddings from Collobert et al. (2011). We also seek to benefit from pre-trained embeddings to initialize the input feature embeddings, D (Figure 3), in our neural network classifiers. Following both Koo et al. (2008) and Chen and Manning (2014), we learn word embeddings by training a feed-forward neural network language model on a concatenation of the BLLIP corpus and sections 02-21 of the PTB corpus. We use the NPLM toolkit (http://nlg.isi.edu/software/nplm/), which implements noise contrastive estimation training of a two-hidden layer feed-forward neural network language model with rectifier linear units (Vaswani et al., 2013). We train a 7-gram neural language model with input word embedding dimension 64, 512 units in the first hidden layer, 512 units in the second hidden layer and output wordembedding dimension of 512. The neural language model is trained for 30 epochs using stochastic gradient descent and an initial learning rate of 1.0. We restrict the vocabulary to about 100k most-frequent words, replacing all the other words with <unk>. We use a validation set of about 250k n-grams and extract the input word embeddings from the epoch that achieves the lowest perplexity on the validation set. To avoid over-fitting, in our dependency parsing experiments, we only use pre-trained embeddings for the words that occurred at least twice in sections 02-21 of the PTB corpus. Pre-trained embeddings give us significant improvements over randomly initialized embeddings, as our results will show (section 5).

Training
We train six different types of NN classifiers for transition-based dependency parsing: one for each search algorithm with error states, with and without pre-trained word embeddings, in addition to two models with no error states, as described below.

Vanilla Arc-standard Parsers
We train NN classifiers for arc-standard transitionbased parsers that compute probability distributions over shift, reduce-left, and reduce-right, using the 14 kernel features described by Huang and Sagae (2010), shown in Table 1. We trained models using both pre-trained (Section 4.2) and randomly initialized word embeddings. We denote these parsers as Local-14-pre and Local-14-rand. These models allow us to compare the use of NN classification Word features s 0 .w s 1 .w s 2 .w q 0 .w q 1 .w POS tag features s 0 .t s 1 .t s 2 .t q 0 .t q 1 .t Child features s 0 .lc.t s 0 .rc.t s 1 .lc.t s 1 .rc.t Table 1: The 14 feature templates used in some of our models. s and q indicate the stack and the input buffer respectively, subscripts start at zero on the top of the stack or in the front of the input buffer. Finally, lc and rc indicate the leftmost left child and the rightmost right child, respectively.
without error states directly with the structured perceptron, and examine the impact of pre-trained word embeddings.

Error State Parsers
We also train NN classifiers that differ from the ones above only in the use of error states: ErrSt-14-pre (error states, pre-trained embeddings), and ErrSt-14-rand (error states, randomly initialized embeddings). These allow us to examine the impact of using error states, and give us a way to compare our approach with NN and error states directly with an existing structured perceptron arc-standard parser (Huang and Sagae, 2010) using the same 14 kernel features and the same search approach. The differences in the two approaches are that Huang and Sagae (2010) use the structured perceptron and an extended set of features based on the kernel features, and we use NN with error states and only the 14 kernel features. Additionally, we train two classifiers that use a more expressive expanded set of 25 features, which we use with best-first search (Sagae and Tsujii, 2007; with error states: ErrSt-25-pre (expanded feature set, error states, pre-trained embeddings), and ErrSt-25-rand (expanded feature set, error states, randomly initialized embeddings). The 25 feature templates used by these classifiers are shown in Table 2. In contrast, Chen and Manning (2014) use 48 feature templates, including higher-order dependency information than has been shown to improve parsing accuracy significantly (Zhang and Nivre, 2011). It is likely that our approach would benefit similarly from the use of these features, but we leave the addition of features as future work.
Finally, for each of six types of classifiers above,  The expanded set of 25 feature templates used in some of our models. s and q indicate the stack and the input buffer respectively, subscripts start at zero on the top of the stack or in the front of the input buffer. lc and rc indicate the leftmost left child and the rightmost right child, respectively. dist(a, b) is the signed distance between the root of a and the root of b in the input sentence, and previous action is the action that was applied to generate the current state.
we train models using both Stanford dependencies (de Marneffe and Manning, 2008) and Yamada and Matsumoto (YM) dependencies (Yamada and Matsumoto, 2003) extracted from the Penn Treebank. We create {f n train , a n train } N train n=1 pairs on Wall Street Journal sections 02-21, and use {f n dev , a n dev } N dev n=1 pairs from section 22 as a development set, where N train and N dev are the number of training and dev instances. We obtain part-of-speech tags by training a CRF tagger on sections 02-21 with 4-way jackknifing, which achieves a tagging accuracy of 97.2% on section 23. We train our NN classifiers to maximize the log-likelihood of the correct actions given features, 1 n N train n=1 log P (a n train | f n train ).
We use mini-batch dropout training, computing gradients using the back-propagation algorithm (Hinton et al., 2012). We use the development set to tune the learning rate, halving it if the perplexity on the development set increases.

Model and Parser Selection
For each of our NN classifiers, there are a few tunable hyper-parameters: hidden layer size (h 1 ), minibatch size, initial learning rate lr, dropout probabilities for h 1 and h 2 (d h 1 , and d h 2 ) , and random initialization of parameters. We tuned each of these to maximize classification accuracy of the most likely action predicted by the classifier given the feature vector. We calculated classification accuracy as where δ(x, y) returns 1 if x equals y and 0 otherwise. For each of the classifiers, we first tuned lr, mini-batch size, h 1 size, and d h 1 and d h 2 for accuracy. We  , we chose the model that achieved the best classification accuracy on the development set for parsing the test set. We also used the same random seed to initialize our parameters. For Local-14-pre, ErrSt-14-pre, and ErrSt-25-pre, we trained models with different random initializations of the input embeddings (D in figure 3) that were not pre-trained. For each random initialization, we chose the model with the best classification accuracy on the development set. To pick the final model for parsing on test, we selected the model to maximize parsing accuracy on the development set. We computed our parsing accuracies using the eval07.pl script from the CoNLL 2007 shared task on dependency parsing (Nivre et al., 2007), ignoring punctuation as is standard in English dependency parsing evaluation.
For both YM and Stanford dependencies, the optimal values of h 1 were 1536 for ErrSt-25pre and ErrSt-25-rand, and either 1536, or 2048 for Local-14-pre, Local-14-rand, ErrSt-14-pre, and ErrSt-14-rand. For the best parser on Stanford and YM dependencies, (ErrSt-25-pre), we used a minibatch size of 256 and a initial learning rate of 0.25. For future work, we will explore a larger grid of learning rate, minibatch sizes, and dropout values.
At parsing time, we pre-multiply the input embeddings, D and the position matrices, C f i , which speeds up computation significantly.
ErrSt-25-pre uses best-first search with a priority queue of size limited to 100.
for training, 22 for development and 23 for testing), and part-of-speech tags assigned automatically using four-way jackknifing. Tables 3 and 4 present results obtained on the development set with our models trained with and without pre-trained word embeddings. Our baseline arc-standard parser using greedy search (Local-14-pre beam 1) is as accurate as the best NN dependency parser of Chen and Manning (2014), where both use pre-trained embeddings. In both tables, we can see that increasing the beam size from 1 (greedy parsing) to 4 gives only very modest improvements in accuracy when trained without error states (Local-14pre and Local-14-rand). As mentioned in section 2.2, using beam search with vanilla arc-standard parsing with locally normalized models does not produce large improvements over greedy search due to the label bias problem. For both pre-trained and randomly initialized word embeddings, beam search with models trained with error states improves accuracy substantially (ErrSt-14-pre and ErrSt-14- rand). Our best parsers use pre-trained embeddings, best-first search, and a larger feature set (ErrSt-25pre). In Table 4, we isolate the efficacy of training and search with error states. Even with randomly initialized embeddings, we are able to outperform Chen and Manning's NN dependency parser initialized with embeddings from external sources. Our results show that using error states in parsing can improve parsing accuracy independently of whether beam search or best-first search is used. Figures 5  and 6 show comparisons of the effects of increasing beam sizes in models trained with and without error states. We additionally show that the benefits of using error states are not limited to classification with neural networks. Figure 7 shows the results obtained on the development set with an arc-standard beam-search parser using maximum entropy classification 2 with L1 regularization and the full set of features used by Huang and Sagae (2010), and increasing beam sizes. The improvement obtained from beam-search with baseline local classification is limited as expected, while the improvement obtained from beam-search with error states is substantially more pronounced. Although the accuracy levels obtained with maximum entropy classification are clearly lower than those obtained with our neural network models, these results do confirm that error states are effective with linear classification.
In Table 5, we compare our best parsers with and without pre-trained word embeddings, ErrSt-25pre and ErrSt-25-rand, against other published results. On Stanford dependencies, our parser with pre-trained embeddings performs comparably with the state-of-the-art. By using search with error states, we outperform a greedy NN parser (Chen and Manning, 2014) by a wide margin. On YM dependencies, our performance (ErrSt-25-pre) is comparable to that of a structured perceptron secondorder graph-based parser using carefully selected features based on Brown clustering of the BLLIP 2 We used Yoshimasa Tsuruoka's maximum entropy library, downloaded from http://www.logos.ic.i.u-tokyo. ac.jp/˜tsuruoka/maxent/.  (Chen and Manning, 2014;Huang and Sagae, 2010;Zhang and Nivre, 2011;Weiss et al., 2015) and graph-based approaches (Zhang and McDonald, 2014;Martins et al., 2013;Koo et al., 2008). * These parsers use large sets of unlabeled data.
corpus (Koo et al., 2008). Our randomly initialized parser (no pre-trained embeddings), ErrSt-25rand performs at a similar level to a structured perceptron transition-based parser (Huang and Sagae, 2010), but below that of parsers with finely tuned higher-order rich feature sets (Zhang and Nivre, 2011). While we leave the use of Zhang and Nivre's rich feature set as future work, for a direct comparison of NN models with error states and structured perceptron for transition-based dependency parsing, we additionally tested a parser (ErrSt-14-rand) that uses the exact same beam search and kernel features used by Huang and Sagae (2010). With NN and error states, we obtained 91.84% accuracy, compared to Huang and Sagae's 92.1%. An advantage of our approach is that we use the kernel features only, which come from the 14 templates shown in Table 1, while Huang and Sagae use additionally an extended set of features composed of carefully tuned concatenation templates involving the kernel features. The parsing speed of our parser implementation using our best model, ErrSt-25-pre, is approximately 1,000 tokens (or 42 sentences) a second. Of course, such a measurement of speed is dependent on a variety of factors, such as hardware and programming language of the specific implementation, among others, so this figure observed in our experiments serves only as an illustrative sample.

Related Work
Our work builds on the transition-based parsing work described in Section 2, where local classifiers are trained to predict parser actions (Nivre, 2008), but provides a way to go beyond deterministic parsing. One way to create models capable of global scoring, and therefore effective search, is to parse with the structured perceptron (Zhang and Clark, 2008), which we also discuss in Section 2. Instead of performing global weight updates, our approach relies on local classifiers, but adds information about incorrect derivation paths to approximate a notion of global loss. This gives us a simple way to train neural network models for predicting parser actions locally but still perform effective search. Our use of error states is conceptually related to the correctness probability estimate proposed by Yazdani and Henderson (2015), which is used only with each shift action of an arc-eager transitionbased parsing model. This correctness probability creates a measure of quality of derivations at the point of each shift, which allows a combination of local action scores and the correctness probability to be used with beam search. The beam is then determined only at each shift, while search paths produced by other actions are extended exhaustively. Our error states, in contrast, adjust the scores of every action, making the use of best-first search natural.
Non-deterministic oracles for transition-based dependency parsing (Goldberg and Nivre, 2012;Goldberg and Nivre, 2013) are also designed to improve the performance of parsers that use local classification of actions by adding to the amount of information used to train the local classifiers. However, non-deterministic oracles aim to allow a deterministic parser to recover from incorrect actions by including information in the training of the local classifiers based on the notion that there may be several correct actions at a given point, as long as a desired tree remains reachable. In contrast, our local classifier, or oracle, is trained to encode a notion of state quality or approximate global loss that is specifically designed for search. In fact, when used with greedy search, our error states have no positive effect on parsing. This suggests that a combination of the benefits of non-deterministic oracles and error states may be possible.
Our training of local classifiers with error states shares with SEARN (Daumé III et al., 2009) and DAGGER (Ross et al., 2010) the idea of creating a notion of global loss in local scores, but SEARN and DAGGER learn to estimate the quality of search states by iteratively training policies using the entire training set, while we train only one policy, but using explicit information about states outside of the optimal path.
Choi and Palmer (2011) show that the idea of iteratively refining policies in a very similar way as proposed in SEARN and DAGGER can in fact be applied to transition-based dependency parsing to improve accuracy of deterministic parsing. By creating training examples for local classifiers based on parser states that are likely to occur at run time, but would not be generated with the gold-standard derivation, local classification models can be trained to be more robust in recovery from past errors. A key difference is that this provides a way for the parser to do better assuming that a mistake has already been made and is irrevocable, while our error states are designed to improve search, lowering the score of undesirable paths so a different path is chosen.
Our greedy neural network parser is similar to Chen and Manning (2014), who are the first to show the benefits of using feed-forward neural network classifiers in greedy transition-based dependency parsing. Unlike us, they use a single hidden layer of cube activation functions, and more features. We follow the neural network architecture of Vaswani et al. (2013), using two hidden layers of rectified linear units. Chen and Manning (2014) use Adagrad (Duchi et al., 2011) and dropout for optimization, while we use stochastic gradient descent with dropout. Recent work by Weiss et al. (2015) produces the highest published accuracy for English dependency parsing with very similar neural network architectures and similar pre-training of word embeddings. The accuracy of the greedy ver-sion of their parser is substantially higher than that of our greedy parser, due at least in part to the use of more features. A more interesting difference between their approach and ours is in the way structured prediction is performed. While Weiss et al. add a structured perceptron layer to a network pretrained locally, we train only locally, but using error states. Both approaches are effective in producing improvements over the respective greedy baselines, and a direct comparison using the same greedy baseline is left for future work.

Conclusion
We presented a new approach for approximate structured inference for transition-based parsing that allows us to obtain high parsing accuracy using neural networks. Using error states, we improved search by producing scores suitable for global scoring using only local models, and showed that our approach is competitive with the structured perceptron in transition-based parsing. Additionally, our approach provides a straightforward way to take advantage of word embeddings in transition-based parsing, which produce high accuracy for transitionbased dependency parsing in English, rivaling that of higher-order graph-based parsers. Source code, models and word embeddings for our transitionbased dependency parser with error states are available at http://github.com/sagae/nndep.
Our approach for using error states to improve search is quite general, and could be applied to other structured problems that can be approximated using local models, such as sequence labeling and transition-based parsing with recurrent neural networks.
An area of future work is the application of error state training in problems where the local classifier has a high number of classes, as is often the case in labeled dependency parsing. A straightforward application of our approach roughly multiplies the number of training examples for the local classifier by the number of possible classes. For example, in labeled dependency parsing, where dependency labels are typically concatenated to actions, the number of classes is often greater than 30, which would increase the number of training examples more than 30-fold. Preventing such an increase in the number of training examples may be possible by factoring the problem in such a way that structure-building decisions are treated separately from labeling decisions (in labeled dependency parsing this would amount to training an arc labeling classifier separately), or perhaps more generally, by sampling from the possible error states.