Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations

We present a simple and effective scheme for dependency parsing which is based on bidirectional-LSTMs (BiLSTMs). Each sentence token is associated with a BiLSTM vector representing the token in its sentential context, and feature vectors are constructed by concatenating a few BiLSTM vectors. The BiLSTM is trained jointly with the parser objective, resulting in very effective feature extractors for parsing. We demonstrate the effectiveness of the approach by applying it to a greedy transition-based parser as well as to a globally optimized graph-based parser. The resulting parsers have very simple architectures, and match or surpass the state-of-the-art accuracies on English and Chinese.


Introduction
The focus of this paper is on feature representation for dependency parsing, using recent techniques from the neural-networks ("deep learning") literature.Modern approaches to dependency parsing can be broadly categorized into graph-based and transition-based parsers (Kübler et al., 2008).Graph-based parsers (McDonald, 2006) treat parsing as a search-based structured prediction problem in which the goal is learning a scoring function over dependency trees such that the correct tree is scored above all other trees.Transition-based parsers (Nivre, 2004;Nivre, 2008) treat parsing as a sequence of actions that produce a parse tree, and a classifier is trained to score the possible actions at each stage of the process and guide the parsing process.Perhaps the simplest graph-based parsers are arc-factored (first order) models (McDonald, 2006), in which the scoring function for a tree decomposes over the individual arcs of the tree.More elaborate models look at larger (overlapping) parts, requiring more sophisticated inference and training algorithms (Martins et al., 2009;Koo and Collins, 2010).The basic transition-based parsers work in a greedy manner, performing a series of locally-optimal decisions, and boast very fast parsing speeds.More advanced transition-based parsers introduce some search into the process using a beam (Zhang and Clark, 2008) or dynamic programming (Huang and Sagae, 2010a).
Regardless of the details of the parsing framework being used, a crucial step in parser design is choosing the right feature function for the underlying statistical model.Recent work (see Section 2.2 for an overview) attempt to alleviate parts of the feature function design problem by moving from linear to non-linear models, enabling the modeler to focus on a small set of "core" features and leaving it up to the machine-learning machinery to come up with good feature combinations (Chen and Manning, 2014;Pei et al., 2015;Lei et al., 2014;Taub-Tabib et al., 2015).However, the need to carefully define a set of core features remains.For example, the work of (Chen and Manning, 2014) uses 18 different elements in its feature function, while the work of (Pei et al., 2015) uses 21 different elements.Other works, notably (Dyer et al., 2015;Le and Zuidema, 2014), propose more sophisticated feature representations, in which the feature engineering is replaced with architecture engineering.
In this work, we suggest an approach which is much simpler in terms of both feature engineering arXiv:1603.04351v2[cs.CL] 26 May 2016 and architecture engineering.Our proposal (Section 3) is centered around BiRNNs (Irsoy and Cardie, 2014;Schuster and Paliwal, 1997), and more specifically BiLSTMs (Graves, 2008), which are strong and trainable sequence models (see Section 2.3).The BiLSTM excels at representing elements in a sequence (i.e., words) together with their contexts, capturing the element and an "infinite" window around it.We represent each word by its BiL-STM encoding, and use a concatenation of a minimal set of such BiLSTM encodings as our feature function, which is then passed to a non-linear scoring function (multi-layer perceptron).Crucially, the BiLSTM is trained with the rest of the parser in order to learn a good feature representation for the parsing problem.If we set aside the inherent complexity of the BiLSTM itself and treat it as a black box, our proposal results in a frustratingly simple feature extractor.
We demonstrate the effectiveness of the approach by using the BiLSTM feature extractor in two parsing architectures, transition-based (Section 4) as well as a graph-based (Section 5).In the graphbased parser, we jointly train a structured-prediction model on top of a BiLSTM, propagating errors from the structured objective all the way back to the BiL-STM feature-encoder.To the best of our knowledge, we are the first to perform such end-to-end training of a structured prediction model and a recurrent feature extractor for non-sequential outputs. 1side from the novelty of the BiLSTM feature extractor and the end-to-end structured training, we rely on existing models and techniques from the parsing and structured prediction literature.We stick to the simplest parsers in each categorygreedy inference for the transition-based architecture, and a first-order, arc-factored model for the graph-based architecture.Despite the simplicity of the parsing architectures and the feature functions, we achieve near state-of-the-art parsing accuracies in both English (93.1 UAS) and Chinese (86.6 UAS), using a first-order parser with two features and while training solely on Treebank data, without relying on semi-supervised signals such as pre-trained word embeddings (Chen and Manning, 2014), word-clusters (Koo et al., 2008), or techniques such as tri-training (Weiss et al., 2015).When including also pre-trained word embeddings, we obtain further improvements, with accuracies of 93.9 UAS (English) and 87.6 UAS (Chinese) for a greedy transition-based parser with 11 features, and 93.6 UAS (En) / 87.4 (Ch) for a greedy transitionbased parser with 4 features.

Background and Notation
Notation We use x 1:n to denote a sequence of n vectors x 1 , • • • , x n .F θ (•) is a function parameterized with parameters θ.We write F L (•) as a shortcut to F θ L -an instantiation of F with a specific set of parameters θ L .We use • to denote a vector concatenation operation, and v[i] to denote an indexing operation taking the ith element of a vector v.

Feature Functions in Dependency Parsing
Traditionally, state-of-the-art parsers rely on linear models over hand-crafted feature functions.The feature functions look at core components (e.g."word on top of stack", "leftmost child of the second-totop word on the stack", "distance between the head and the modifier words"), and are comprised of several templates, where each template instantiates a binary indicator function over a conjunction of core elements (resulting in features of the form "word on top of stack is X and leftmost child is Y and . . .").The design of the feature function -which components to consider and which combinations of components to include -is a major challenge in parser design.Once a good feature function is proposed in a paper it is usually adopted in later works, and sometimes tweaked to improve performance.Examples of good feature functions are the feature-set proposed by Zhang and Nivre (2011) for transitionbased parsing (including roughly 20 core components and 72 feature templates), and the feature-set proposed by McDonald et al (2005) for graph-based parsing, with the paper listing 18 templates for a first-order parser, and the MSTParser's first-order feature-extractor code containing roughly a hundred feature templates.
The core features in a transition-based parser usually look at information such as the word-identity and part-of-speech (POS) tags of a fixed number of words on top of the stack, a fixed number of words on the top of the buffer, the modifiers (usually leftmost and right-most) of items on the stack and on the buffer, the number of modifiers of these elements, parents of words on the stack, and the length of the spans spanned by the words on the stack.The core features of a first-order graph-based parser usually take into account the word and POS of the head and modifier items, as well as POS-tags of the items around the head and modifier, POS tags of items between the head and modifier, and the distance and direction between the head and modifier.

Related Research Efforts
Coming up with a good feature-set for a parser is a hard and time consuming task, and many researchers attempt to reduce the required manual efforts.The work of Lei et al (2014) suggest a low-rank tensor representation to automatically find good feature combinations.Taub-Tabib et al (2015) suggest a kernel-based approach to implicitly consider all possible feature combinations over sets of core-features.The recent popularity of neural-networks prompted a move from templates of sparse, binary indicator features to dense core feature encodings fed into non-linear classifiers.Chen and Manning (2014) encode each core feature of a greedy transition-based parser as a dense low-dimensional vector, and the vectors are then concatenated and fed into a nonlinear classifier (multi-layer perceptron) which can potentially capture arbitrary feature combinations.Weiss et al (2015) showed further gains using the same approach coupled with a somewhat improved set of core features, a more involved network architecture with skip-layers, beam search-decoding, and careful hyper-parameter tuning.Pei et al (2015) apply a similar methodology to graph-based parsing.While the move to neural-network classifiers alleviates the need for hand-crafting featurecombinations, the need to carefully define a set of core features remain.For example, the feature representation in (Chen and Manning, 2014) is a concatenation of 18 word vectors, 18 POS vectors and 12 dependency-label vectors. 2he above works tackle the effort in hand-crafting effective feature combinations.A different line of work attacks the feature-engineering problem by suggesting novel neural-network architectures for encoding the parser state, including intermediatelybuilt subtrees, as vectors which are then fed to nonlinear classifiers.Titov and Henderson encode the parser state using incremental sigmoid-belief networks (2007).In the work of Dyer et al (2015), the entire stack and buffer of a transition-based parser are encoded as a stack-LSTMs, where each stack element is itself based on a compositional representation of parse trees.Le and Zuidema (2014) encode each tree node as two compositional representations capturing the inside and outside structures around the node, and feed the representations into a reranker.A similar reranking approach, this time based on convolutional neural networks, is taken by Zhu et al (2015).Finally, in Kiperwasser and Goldberg (2016) we present an Easy-First parser based on a novel hierarchical-LSTM tree encoding.
In contrast to these, the approach we present in this work results in much simpler feature functions, without resorting to elaborate network architectures or compositional tree representations.
Work by Vinyals et al (2015) employ a sequenceto-sequence with attention architecture for constituency parsing.Each token in the input sentence is encoded in a deep-BiLSTM representation, and then the tokens are fed as input to a deep-LSTM that predicts a sequence of bracketing actions based on the already predicted bracketing as well as the encoded BiLSTM vectors.A trainable attention mechanism is used to guide the parser to relevant BiLSTM vectors at each stage.This architecture shares with ours the use of BiLSTM encoding and end-to-end training.The sequence of bracketing actions can be interpreted as a sequence of Shift and Reduce operations of a transition-based parser.However, while the parser of Vinyals et al rely on a trainable attention mechanism for focusing on specific BiLSTM vectors, parsers in the transition-based family we use in Section 4 use a human designed stack and buffer mechanism to manually direct the parser's attention.While the effectiveness of the trainable attention category, making it hard to tease apart the contribution of the automatic feature-combination component from that of the semisupervised component.
approach is impressive, the stack-and-buffer guidance of transition-based parsers result in more robust learning.Indeed, work by Cross and Huang (2016), published while working on the camera-ready version of this paper, show that the same methodology as ours is highly effective also for greedy, transitionbased constituency parsing, surpassing the beambased architecture of Vinyals et al (88.3F vs. 89.8F points) when trained on PTB dataset and without using orthogonal methods such as ensembling and uptraining.

Bidirectional Recurrent Neural Networks
Recurrent neural networks (RNNs) are statistical learners for modeling sequential data.An RNN allows to model the ith element in the sequence based on the past -the elements x 1:i up to and including it.The RNN model provides a framework for conditioning on the entire history x 1:i without resorting to the Markov assumption which is traditionally used for modeling sequences.RNNs were shown to be capable of learning to count, as well as to model line lengths and complex phenomena such as bracketing and code indentation (Karpathy et al., 2015).Our proposed feature extractors are based on a bidirectional recurrent neural network (BiRNN), an extension of RNNs that take into account both the past x 1:i and the future x i:n .We use a specific flavor of RNN called a long short-term memory network (LSTM).For brevity, we treat RNN as an abstraction, without getting into the mathematical details of the implementation of the RNNs and LSTMs.For further details on RNNs and LSTMs, the reader is referred to (Goldberg, 2015;Cho, 2015).
The recurrent neural network (RNN) abstraction is a parameterized function RNN θ (x 1:n ) mapping a sequence of n input vectors x 1:n , x i ∈ R d in to a sequence of n output vectors h 1:n , h i ∈ R dout .Each output vector h i is conditioned on all the input vectors x 1:i , and can be thought of as a summary of the prefix x 1:i of x 1:n .In our notation, we ignore the intermediate vectors h 1:n−1 and take the output of RNN θ (x 1:n ) to be the vector h n .
A bidirectional RNN is composed of two RNNs, RNN F and RNN R , one reading the sequence in its regular order, and the other reading it in reverse.Concretely, given a sequence of vectors x 1:n and a desired index i, the function BIRNN θ (x 1:n , i) is de-fined as: The vector v i = BIRNN(x 1:n , i) is then a representation of the ith item in x 1:n , taking into account both the entire history x 1:i and the entire future x i:n .We can view the BiRNN encoding of an item i as representing the item i together with a context of an infinite window around it.
BiRNN Training Initially, the BiRNN encodings v i do not capture any particular information.During training, the encoded vectors v i are fed into further network layers, until at some point a prediction is made, and a loss is incurred.The back-propagation algorithm is used to compute the gradients of all the parameters in the network (including the BiRNN parameters) with respect to the loss, and an optimizer is used to update the parameters according to the gradients.The training procedure causes the BiRNN function to extract from the input sequence x 1:n the relevant information for the task task at hand.Historical Notes RNNs were introduced by Elamn ( Elman, 1990), and extended to BiRNNs by (Schuster and Paliwal, 1997).The LSTM variant of RNNs is due to (Hochreiter and Schmidhuber, 1997).BiLSTMs were recently popularized by Graves (2008), and deep BiRNNs were introduced to NLP by Irsoy and Cardie (2014), who used them for sequence tagging.In the context of parsing, Lewis et al (2016) use a BiLSTMs sequence tagging model to assign a CCG supertag for each token in the sentence.The resulting supertags sequence is then fed into an A* CCG parser, producing state-of-theart CCG parsing results (in that work, the BiLSTM is trained to produce accurate CCG supertags, and is not aware of the global parsing objective).

Our Approach
We propose to replace the hand-crafted feature functions in favor of minimally-defined feature functions which make use of automatically learned Bidirectional LSTM representations.
Given an n words input sentence s with words w 1 , . . ., w n together with the corresponding POS tags t 1 , . . ., t n ,3 we associate each word w i and POS t i with embedding vectors e(w i ) and e(t i ), and create a sequence of input vectors x 1:n in which each x i is a concatenation of the corresponding word and POS vectors: The embeddings are trained together with the model.This encodes each word in isolation, disregarding its context.We introduce context by representing each input element as its (deep) BiLSTM vector, v i : Our feature function φ is then a concatenation of a small number of BiLSTM vectors.The exact feature function is parser dependent and will be discussed when discussing the corresponding parsers.The resulting feature vectors are then scored using a non-linear function, namely a multi-layer perceptron with one hidden layer (MLP): Beside using the BiLSTM-based feature functions, we make use of standard parsing techniques.Crucially, the BiLSTM is trained jointly with the rest of the parsing objective.This allows it to learn representations which are suitable for the parsing task.
Consider a concatenation of two BiLSTM vectors (v i • v j ) scored using an MLP.The scoring function has access to the words and POS-tags of v i and v j , as well as the words and POS-tags of the words in an infinite window surrounding them.As LSTMs are known to capture length and sequence position information, it is very plausible that the scoring function can be sensitive also to the distance between i and j, their ordering, and the sequential material between them.
Parsing-time Complexity Once the BiLSTM is trained, parsing is performed by first computing the BiLSTM encoding v i for each word in the sentence (a linear time operation). 4Then, parsing proceeds as usual, where the feature extraction involves a concatenation of a small number of the pre-computed v i vectors.

Transition-based Parser
We begin by integrating the feature extractor in a transition-based parser (Nivre, 2008).We follow the notation in (Goldberg and Nivre, 2013).The transition-based parsing framework assumes a transition system, an abstract machine that processes sentences and produces parse trees.The transition system has a set of configurations and a set of transitions which are applied to configurations.When parsing a sentence, the system is initialized to an initial configuration based on the input sentence, and transitions are repeatedly applied to this configuration.After a finite number of transitions, the system arrives at a terminal configuration, and a parse tree is read off the terminal configuration.In a greedy parser, a classifier is used to choose the transition to take in each configuration, based on features extracted from the configuration itself.The parsing algorithm is presented in algorithm 1 below: Given a sentence s, the parser is initialized with the configuration c (line 2).Then, a feature func-tion φ(c) represents the configuration c as a vector, which is fed to a scoring function SCORE assigning scores to (configuration,transition) pairs.SCORE scores the possible transitions t, and the highest scoring transition t is chosen (line 4).The transition t is applied to the configuration, resulting in a new parser configuration.The process ends when reaching a final configuration, from which the resulting parse tree is read and returned (line 6).
Transition systems differ by the way they define configurations, and by the particular set of transitions available to them.A parser is determined by the choice of a transition system, a feature function φ and a scoring function SCORE.Our choices are detailed below.
The Arc-Hybrid System Many transitions systems exist in the literature.In this work, we use the arc-hybrid transition system (Kuhlmann et al., 2011a), which is similar to the more popular arcstandard system (Nivre, 2004), but for which an efficient dynamic oracle is available (Goldberg and Nivre, 2012;Goldberg and Nivre, 2013).In the archybrid system, a configuration c = (σ, β, T ) consists of a stack σ, a buffer β, and a set T of dependency arcs.Both the stack and the buffer hold integer indices to sentence elements.Given a sentence s = w 1 , . . ., w n , t 1 , . . ., t n , the system is initialized with an empty stack, an empty arc set, and β = 1, . . ., n, ROOT , where ROOT is the special root index.Any configuration c with an empty stack and a buffer containing only ROOT is terminal, and the parse tree is given by the arc set T c of c. Arc-hybrid system allows 3 possible transitions, SHIFT, LEFT and RIGHT , defined as: The SHIFT transition moves the first item of the buffer (b 0 ) to the stack.The LEFT transition removes the first item on top of the stack (s 0 ) and attaches it as a modifier to b 0 with label , adding the arc (b 0 , s 0 , ).The RIGHT transition removes s 0 from the stack and attaches it as a modifier to the next item on the stack (s 1 ), adding the arc (s 1 , s 0 , ).

Scoring Function
This feature function is rather minimal: it takes into account the BiLSTM representations of s 1 , s 0 and b 0 , which are the items affected by the possible transitions being scored, as well as one extra stack context s 2 .5Note that, unlike previous work, this feature function does not take into account T , the already built structure.The high parsing accuracies in the experimental sections suggest that the BiL-STM encoding is capable of estimating a lot of the missing information based on the provided stack and buffer elements and the sequential content between them.
While not explored in this work, relying on only four word indices for scoring an action results in very compact state signatures, making our proposed feature representation very appealing for use in transition-based parsers that employ dynamicprogramming search (Huang and Sagae, 2010b;Kuhlmann et al., 2011b).
Extended Feature Function One of the benefits of the greedy transition-based parsing framework is precisely its ability to look at arbitrary features from the already built tree.If we allow somewhat less minimal feature function, we could add the BiL-STM vectors corresponding to the right-most and left-most modifiers of s 0 , s 1 and s 2 , as well as the left-most modifier of b 0 , reaching a total of 11 BiL-STM vectors.We refer to this as the extended feature set.As we'll see in Section 6, using the extended set does indeed improve parsing accuracies when using pre-trained word embeddings, but has a minimal effect in the fully-supervised case. 6

Details of the Training Algorithm
The training objective is to set the score of correct transitions above the scores of incorrect transitions.We use a margin-based objective, aiming to maximize the margin between the highest scoring correct action and the highest scoring incorrect action.The hinge loss at each parsing configuration c is defined as: where A is the set of possible transitions and G is the set of correct (gold) transitions at the current stage.At each stage of the training process the parser scores the possible transitions A, incurs a loss, selects a transition to follow, and moves to the next configuration based on it.The local losses are summed throughout the parsing process of a sentence, and the parameters are updated with respect to the sum of the losses sentence boundaries. 7 The gradients of the entire network (including the MLP and the BiLSTM) with respect to the sum of the losses are calculated using the backpropagation algorithm.As usual, we perform several training iterations over the training corpus, shuffling the order of sentences in each iteration.

Error-Exploration and Dynamic Oracle Training
We follow Goldberg and Nivre (2013;2012) in us- 6 We did not experiment with other feature configurations.It is well possible that not all of the additional 7 child encodings are needed for the observed accuracy gains, and that a smaller feature set will yield similar or even better improvements. 7To increase gradient stability and training speed, we simulate mini-batch updates by only updating the parameters when the sum of local losses contains at least 50 non-zero elements.Sums of fewer elements are carried across sentences.This assures us a sufficient number of gradient samples for every update thus minimizing the effect of gradient instability.
ing error exploration training with a dynamic-oracle, which we briefly describe below.
At each stage in the training process, the parser assigns scores to all the possible transitions t ∈ A. It then selects a transition, applies it, and moves to the next step.Which transition should be followed?A common approach follows the highest scoring transition that can lead to the gold tree.However, when training in this way the parser sees only configurations that result from following correct actions, and as a result tends to suffer from error propagation at test time.Instead, in error-exploration training the parser follows the highest scoring action in A during training even if this action is incorrect, exposing it to configurations that result from erroneous decisions.This strategy requires defining the set G such that the correct actions to take are well-defined also for states that cannot lead to the gold tree.Such a set G is called a dynamic oracle.We perform error-exploration training using the dynamic-oracle defined in (Goldberg and Nivre, 2013).
Aggressive Exploration We found that even when using error-exploration, after one iteration the model remembers the training set quite well, and does not make enough errors to make error-exploration effective.In order to expose the parser to more errors, we follow an aggressive-exploration scheme: we sometimes follow incorrect transitions also if they score below correct transitions.Specifically, when the score of the correct transition is greater than that of the wrong transition but the difference is smaller than a margin constant, we chose to follow the incorrect action with probability p agg (we use p agg = 0.1 in our experiments).

Summary
The greedy transition based parser follows standard techniques from the literature (margin-based objective, dynamic oracle training, error exploration, MLP-based non-linear scoring function).We depart from the literature by replacing the hand-crafted feature function over carefully selected components of the configuration with a concatenation of BiLSTM representations of few prominent items on the stack and the buffer, and training the BiLSTM encoder jointly with the rest of the network.

+
Figure 1: Illustration of the neural of the graph-based parser when calculating the score of a given parse tree.The parse tree is depicted below the sentence.Each dependency arc in the sentence is scored using an MLP that is fed the BiLSTM encoding of the words at the arc's end points (the colors of the arcs correspond to colors of the MLP inputs above), and the individual arc scores are summed to produce the final score.All the MLPs share the same parameters.The figure depicts a singlelayer BiLSTM, while in practice we use two layers.When parsing a sentence, we compute scores for all possible n 2 arcs, and find the best scoring tree using a dynamic-programming algorithm.

Graph-based Parser
Graph-based parsing follows the common structured prediction paradigm (Taskar et al., 2005;McDonald et al., 2005): Given an input sentence s (and the corresponding sequence of vectors x 1:n ) we look for the highestscoring parse tree y in the space Y(s) of valid dependency trees over s.In order to make the search tractable, the scoring function is decomposed to the sum of local scores for each part independently.
In this work, we focus on arc-factored graph based approach presented in (McDonald et al., 2005).Arc-factored parsing decomposes the score of tree to the sum of the score of its head-modifier arcs (h, m): Given the scores of the arcs the highest scoring projective tree can be efficiently found using Eisner's decoding algorithm (1996).McDonald et al and most subsequent work estimate the local score of an arc by a linear model parameterized by a weight vector w, and a feature function φ(s, h, m) assigning sparse feature vector for an arc linking modifier m to head h.We follow Pei et al (2015) and replace the linear scoring function with an MLP.
The feature extractor φ(s, h, m) is usually complex, involving many elements (see section 2.1).In contrast, our feature extractor uses merely the BiL-STM encoding of the head word and the modifier word: The final model is: The architecture is illustrated at Figure 1.
Training The training objective is to set the Score function such that correct tree y is scored above incorrect ones.We use a margin-based objective (Mc-Donald et al., 2005;LeCun et al., 2006), aiming to maximize the margin between the score of the gold tree y and highest scoring incorrect tree y .We define a hinge loss with respect to a gold tree y as: Each of the tree scores is the calculated by activating the MLP on the arc representations.The entire loss can viewed as the sum of multiple neural networks, which is sub-differentiable.We calculate the gradients of the entire thing (including to the BiLSTM encoder and word embeddings).
Labeled Parsing Up to now, we described unlabeled parsing.A possible approach of adding labels is to score the combination of an unlabeled arc (h, m) and its label by considering the label as part of the arc (h, m, ).This results in |Labels|×|Arcs| parts that need to be scored, leading to slow parsing speeds and arguably a harder learning problem.Instead, we chose to first predict the unlabeled structure using the model given above, and then predict the label of each resulting arc.Using this approach, the number of parts stays small, enabling fast parsing.
The labeling of an arc (h, m) is performed using the same feature representation φ(s, h, m) fed into a different MLP predictor: As before we use a margin based hinge loss.The labeler is trained on the gold trees. 8The BiLSTM encoder responsible for producing v h and v m is shared with the arc-factored parser: the same BiLSTM encoder is used in the parer and the labeler.This sharing of parameters can be seen as an instance of multi-task learning (Caruana, 1997).As we show in Section 6, the sharing is effective: training the BiL-STM feature encoder to be good at predicting arclabels significantly improves the parser's unlabeled accuracy.

Loss augmented inference
In initial experiments, the network learned quickly and overfit the data.In order to remedy this, we found it useful to use loss augmented inference (Taskar et al., 2005).The intuition behind loss augmented inference is to update against trees which have high model scores and are also very wrong.This is done by augmenting the score of each part not belonging to the gold tree by adding a constant to its score.Formally, the loss transforms as follows: Speed improvements The arc-factored model requires the scoring of n 2 arcs.Scoring is performed using an MLP with one hidden layer, resulting in n 2 matrix-vector multiplications from the input to the hidden layer, and n 2 multiplications from the hidden to the output layer.The first n 2 multiplications involve larger dimensional input and output vectors, and are the most time consuming.Fortunately, these can be reduced to 2n multiplications and n 2 vector additions, by observing that the multiplication where W 1 and W 1 are are the first and second half of the matrix W and reusing the products across different pairs.Summary The graph-based parser is straightforward first-order parser, trained with a marginbased hinge-loss and loss-augmented inference.We depart from the literature by replacing the handcrafted feature function with a concatenation of BiLSTM representations of the head and modifier words, and training the BiLSTM encoder jointly with the structured objective.We also introduce a novel MTL-based approach for labeled parsing by training a second-stage arc-labeler sharing the same BiLSTM encoder with the unlabeled parser.
We evaluated our parsing model on English and Chinese data.For comparison purposes we follow the setup of (Dyer et al., 2015).
Data For English, we used the Stanford Dependency (SD) (de Marneffe and Manning, 2008) conversion of the Penn Treebank (Marcus et al., 1993), using the standard train/dev/test splitswith the same predicted POS-tags as used in (Dyer et al., 2015;Chen and Manning, 2014).This dataset contains a few non-projective trees.Punctuation symbols are excluded from the evaluation.
When using external word embeddings, we also use the same data as (Dyer et al., 2015).9 Implementation Details The parsers are implemented in python, using the PyCNN toolkit10 for neural network training.The code will be made available on the first author's website.We use the LSTM variant implemented in PyCNN, and optimize using the Adam optimizer (Kingma and Ba, 2014).Unless otherwise noted, we use the default values provided by PyCNN (e.g. for random initialization, learning rates etc).
The word and POS embeddings e(w i ) and e(p i ) are initialized to random values and trained together with the rest of parsers' networks.In some experiments, we introduce also pre-trained word embeddings.In those cases, the vector representation of a word is a concatenation of its randomly-initialized vector embedding with its pre-trained word vector.Both are tuned during training.We use the same word vectors as in Dyer et al (2015).
During training, we employ a variant of word dropout (Iyyer et al., 2015), and replace a word with the unknown-word symbol with probability that is inversely proportional to frequency of the word.A word w appearing #(w) times in the training corpus is replaced with the unknown symbol with a probability p unk (w) = α #(w)+α .If a word was dropped the external embedding of the word is also dropped with probability of half.
We train the parsers for up to 30 iterations, and choose the best model according to the UAS accuracy on the development set.
Hyperparameter Tuning We performed a very minimal hyper-parameter search with the graphbased parser, and use the same hyper-parameters for both parsers.The hyper-parameters of the final networks used for all the reported experiments are detailed in Table 1.It is clear that our parsers are very competitive, despite using very simple parsing architectures and minimal feature extractors.When not using external embeddings, the first-order graph-based parser with 2 features outperforms all other systems that are not using external resources, including the third-order TurboParser.The greedy transition based parser with 4 features also matches or outperforms most other parsers, including the beam-based transition parser with heavily engineered features of Zhang and Nivre (2011) and the Stack-LSTM parser of Dyer et al (2015), as well as the same parser when trained using a dynamic-oracle (Ballesteros et al., 2016).Moving from the simple (4 features) to the extended (11 features) feature set leads to some gains in accuracy for both English and Chinese.
grades.We are not sure why this happens, and leave the exploration of effective semi-supervised parsing with the graph-based model for future work.The greedy parser does manage to benefit from the external embeddings, and with using them we also see gains from moving from the simple to the extended feature set.Both feature sets result in very competitive results, with the extended feature set yielding the best reported results for Chinese, and the is ranked second for English, after the heavily-tuned beam-based parser of Weiss et al (2015).

Additional Results
We perform some ablation experiments in order to quantify the effect of the different components on our best models (Table 3).
Loss augmented inference is crucial for the success of the graph-based parser, and the MTL arclabeler contributes nicely to the unlabeled scores.Dynamic-oracle training yields nice gains for both English and Chinese.

Conclusion
We presented a frustratingly effective approach for feature extraction for dependency parsing based on a BiLSTM encoder that is trained jointly with the parer, and demonstrated its effectiveness by integrating it into two simple parsing models: a greedy transition based parser and a globally optimized firstorder graph-based parser, yielding very competitive parsing accuracies in both cases.
We use a variant of deep bidirectional RNN (or k-layer BiRNN) which is composed of k BiRNN functions BIRNN 1 , • • • , BIRNN k that feed into each other: the output BIRNN (x 1:n , 1), . . ., BIRNN (x 1:n , n) of BIRNN becomes the input of BIRNN +1 .Stacking BiRNNs in this way was empirically shown to be effective.In this work, we use BiRNNs and deep-BiRNNs interchangeably, specifying the number of layers when needed.
Chen and Manning (2014)required the feature function φ(•) to encode non-linearities in the form of combination features.We followChen and Manning (2014)and replace the linear scoring model with an MLP.
Traditionally, the scoring function SCORE θ (x, t) is a discriminative linear model of the form SCORE W (x, t) = (W • x)[t].

Table 1 :
Hyper-parameter values used in experimentsMain Results Table 2 lists the test-set accuracies of our best parsing models, compared to other state-ofthe art parsers from the literature. 11

Table 2 :
Test-set parsing results of various state-of-the-art parsing systems on the English (PTB) and Chinese (CTB) datasets.The systems that use embeddings may use different pre-trained embeddings.English results use predicted POS tags (different systems use different taggers), while Chinese results use gold POS tags.PTB-YM: English PTB, Yamada and Matsumoto head rules.PTB-SD: English PTB, Stanford Dependencies (different systems may use different versions of the Stanford converter).CTB:

Table 3 :
Acknowledgements This research is supported by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) and the Israeli Science Foundation (grant number 1555/15).Ablation experiments results (dev set) for the graphbased parser without external embeddings and the greedy parser with external embeddings and extended feature set.