Shift-Reduce Constituent Parsing with Neural Lookahead Features

Transition-based models can be fast and accurate for constituent parsing. Compared with chart-based models, they leverage richer features by extracting history information from a parser stack, which consists of a sequence of non-local constituents. On the other hand, during incremental parsing, constituent information on the right hand side of the current word is not utilized, which is a relative weakness of shift-reduce parsing. To address this limitation, we leverage a fast neural model to extract lookahead features. In particular, we build a bidirectional LSTM model, which leverages full sentence information to predict the hierarchy of constituents that each word starts and ends. The results are then passed to a strong transition-based constituent parser as lookahead features. The resulting parser gives 1.3% absolute improvement in WSJ and 2.3% in CTB compared to the baseline, giving the highest reported accuracies for fully-supervised parsing.


Introduction
Transition-based constituent parsers are fast and accurate, performing incremental parsing using a sequence of state transitions in linear time. Pioneering models rely on a classifier to make local decisions, searching greedily for local transitions to build a parse tree (Sagae and Lavie, 2005). Zhu et al. (2013) use a beam search framework, which preserves linear time complexity of greedy search, while alleviating the disadvantage of error propagation. The model gives state-of-the-art accuracies at a speed of 89 sentences per second on the standard WSJ benchmark (Marcus et al., 1993). Zhu et al. (2013) exploit rich features by extracting history information from a parser stack, which consists of a sequence of non-local constituents. However, due to the incremental nature of shiftreduce parsing, the right-hand side constituents of the current word cannot be used to guide the action at each step. Such lookahead features (Tsuruoka et al., 2011) correspond to the outside scores in chart parsing (Goodman, 1998), which has been effective for obtaining improved accuracies.
To leverage such information for improving shiftreduce parsing, we propose a novel neural model to predict the constituent hierarchy related to each word before parsing. Our idea is inspired by the work of Roark and Hollingshead (2009) and Zhang et al. (2010b), which shows that shallow syntactic information gathered over the word sequence can be utilized for pruning chart parsers, improving chart parsing speed without sacrificing accuracies. For example, Roark and Hollingshead (2009) predict constituent boundary information on words as a preprocessing step, and use such information to prune the chart. Since such information is much lighterweight compared to full parsing, it can be predicted relatively accurately using sequence labellers.
Different from Roark and Hollingshead (2009), we collect lookahead constituent information for shift-reduce parsing, rather than pruning information for chart parsing. Our main concern is improving the accuracy rather than improving the speed. Accordingly, our model should predict the constituent hierarchy for each word rather than simple boundary information. For example, in Figure 1(a), the constituent hierarchy that the word "The" starts is "S → NP", and the constituent hierarchy that the word "table" ends is "S → VP → NP → PP → NP".   For each word, we predict both the constituent hierarchy it starts and the constituent hierarchy it ends, using them as lookahead features. The task is challenging. First, it is significantly more difficult compared to simple sequence labelling, since two sequences of constituent hierarchies must be predicted for each word in the input sequence. Second, for high accuracies, global features from the full sentence are necessary since constituent hierarchies contain rich structural information. Third, to retain high speed for shift-reduce parsing, lookahead feature prediction must be executed efficiently. It is highly difficult to build such a model using manual discrete features and structured search.
Fortunately, sequential recurrent neural networks (RNNs) are remarkably effective models to encode the full input sentence. We leverage RNNs for building our constituent hierarchy predictor. In particular, an LSTM (Hochreiter and Schmidhuber, 1997) is used to learn global features automatically from the input words. For each word, a second LSTM is then used to generate the constituent hierarchies greedily using features from the hidden layer of the first LSTM, in the same way a neural language model decoder generates output sentences for machine translation (Bahdanau et al., 2015). The resulting model solves all three challenges raised above. For fullysupervised learning, we learn word embeddings as part of the model parameters.
In the standard WSJ (Marcus et al., 1993) and CTB 5.1 tests (Xue et al., 2005)  baseline of Zhu et al. (2013), resulting in a accuracy of 91.7 F 1 for English and 85.5 F 1 for Chinese, which are the best for fully-supervised models in the literature. We release our code, based on ZPar (Zhang and Clark, 2011;Zhu et al., 2013), at https://github.com/SUTDNLP/LookAheadConparser.

The Shift-Reduce System
Shift-reduce parsers process an input sentence incrementally from left to right. A stack is used to maintain partial phrase-structures, while the incoming words are ordered in a buffer. At each step, a transition action is applied to consume an input word or construct a new phrase-structure. The set of transition actions are • SHIFT: pop the front word off the buffer, and push it onto the stack.
• REDUCE-L/R-X: pop the top two constituents off the stack (L/R means that the head is the left constituent or the right constituent, respectively), combine them into a new constituent with label X, and push the new constituent onto the stack.
• UNARY-X: pop the top constituent off the stack, raise it to a new constituent X, and push the new constituent onto the stack.
• FINISH: pop the root node off the stack and end parsing.
• IDLE: no-effect action on a completed state without changing items on the stack or buffer, used to ensure that the same number of actions are in each item in beam search (Zhu et al., 2013).
The deduction system for the process is shown in Figure 2, where a state is represented as [stack, buffer front index, completion mark, action index], and n is the number of words in the input. For example, given the sentence "They like apples", the action sequence "SHIFT, SHIFT, SHIFT, REDUCE-L-VP, REDUCE-R-S" gives its syntax "(S They (VP like apples) )".

Search and Training
Beam-search is used for decoding with the k best state items at each step being kept in the agenda. During initialization, the agenda contains only the initial state [φ, 0, f alse, 0]. At each step, each state in the agenda is popped and expanded by applying all valid transition actions, and the top k resulting states are put back onto the agenda (Zhu et al., 2013). The process repeats until the agenda is Description Templates UNIGRAM s 0 tc, s 0 wc, s 1 tc, s 1 wc, s 2 tc s 2 wc, s 3 tc, s 3 wc, q 0 wt, q 1 wt q 2 wt, q 3 wt, s 0 lwc, s 0 rwc s 0 uwc, s 1 lwc, s 1 rwc, s 1 uwc BIGRAM s 0 ws 1 w, s 0 ws 1 c, s 0 cs 1 w, s 0 cs 1 c s 0 wq 0 w, s 0 wq 0 t, s 0 cq 0 w, s 0 cq 0 t q 0 wq 1 w, q 0 wq 1 t, q 0 tq 1 w, q 0 tq 1 t s 1 wq 0 w, s 1 wq 0 t, s 1 cq 0 w, s 1 cq 0 t TRIGRAM s 0 cs 1 cs 2 c, s 0 ws 1 cs 2 c, s 0 cs 1 wq 0 t s 0 cs 1 cs 2 w, s 0 cs 1 cq 0 t, s 0 ws 1 cq 0 t s 0 cs 1 wq 0 t, s 0 cs 1 cq 0 w Extended s 0 llwc, s 0 lrwc, s 0 luwc s 0 rlwc, s 0 rrwc, s 0 ruwc s 0 ulwc, s 0 urwc, s 0 uuwc s 1 llwc, s 1 lrwc, s 1 luwc s 1 rlwc, s 1 rrwc, s 1 ruwc Table 1: Baseline feature templates, where s i represents the ith item on the top of the stack and q i denotes the ith item in the front of the buffer. The symbol w denotes the lexical head of an item; the symbol c denotes the constituent label of an item; the symbol t is the POS of a lexical head; u denotes unary child; s i ll denotes the left child of s i 's left child.
empty, and the best completed state is taken as output. The score of a state is the total score of the transition actions that have been applied to build it: Here Φ(α i ) represents the feature vector for the ith action α i in the state item α. N is the total number of actions in α.
The model parameter vector θ is trained online using the averaged perceptron algorithm with the early-update strategy (Collins and Roark, 2004).

Baseline Features
Our baseline features are taken from Zhu et al. (2013). As shown in Table 1, they include the UN-IGRAM, BIGRAM, TRIGRAM features of Zhang and Clark (2009) and the extended features of Zhu et al. (2013).
Templates s 0 g s , s 0 g e , s 1 g s , s 1 g e q 0 g s , q 0 g e , q 1 g s , q 1 g e Table 2: Lookahead feature templates, where s i represents the ith item on the top of the stack and q i denotes the ith item in the front end of the buffer. The symbol g s and g e denote the next level constituent in the s-type hierarchy and e-type hierarchy, respectively.

Global Lookahead Features
The baseline features suffer two limitations, as mentioned in the introduction. First, they are relatively local to the state, considering only the neighbouring nodes of s 0 (top of stack) and q 0 (front of buffer). Second, they do not consider lookahead information beyond s 3 , or the syntactic structure of the buffer and sequence. We use an LSTM to capture full sentential information in linear time, representing such global information that is fed into the baseline parser as a constituent hierarchy for each word. Lookahead features are extracted from the constituent hierarchy to provide top-down guidance for bottom-up parsing.

Constituent Hierarchy
In a constituency tree, each word can start or end a constituent hierarchy. As shown in Figure 1, the word "The" starts a constituent hierarchy "S → NP". In particular, it starts a constituent S in the top level, dominating a constituent NP. The word "table" ends a constituent hierarchy "S → VP → NP → PP → NP". In particular, it ends a constituent hierarchy, with a constituent S on the top level, dominating a VP (starting from the word "like"), and then an NP (starting from the noun phrase "this book"), and then a PP (starting from the word "in"), and finally an NP (starting from the word "the"). The extraction of constituent hierarchies for each word is based on unbinarized grammars, reflecting the unbinarized trees that the word starts or ends. The constituent hierarchy is empty (denoted as φ) if the corresponding word does not start or end a constituent. The constituent hierarchies are added into the shift-reduce parser as soft features (section 3.2).
Formally, a constituent hierarchy is defined as where c is a constituent label (e.g. NP), "→" represents the top-down hierarchy, and type can be s or e, denoting that the current word starts or ends the constituent hierarchy, respectively, as shown in Figure  1. Compared with full parsing, the constituent hierarchies associated with each word have no forced structural dependencies between each other, and therefore can be modelled more easily, for each word individually. Being soft lookahead features rather than hard constraints, inter-dependencies are not crucial for the main parser.

Lookahead Features
The lookahead feature templates are defined in Table  2. In order to ensure parsing efficiency, only simple feature templates are taken into consideration. The lookahead features of a state are instantiated for the top two items on the stack (i.e., s 0 and s 1 ) and buffer (i.e., q 0 and q 1 ). The new function Φ is defined to output the lookahead features vector. The scoring of a state in our model is based on Formula (1) but with a new term Φ (α i ) · θ : For each word, the lookahead feature represents the next level constituent in the top-down hierarchy, which can guide bottom-up parsing.
For example, Figure 3 shows two intermediate states during parsing. In Figure 3(a), the s-type and e-type lookahead features of s 1 (i.e., the word "The" are extracted from the constituent hierarchy in the bottom level, namely NP and NULL, respectively. On the other hand, in Figure 3(b), the s-type lookahead feature of s 1 is extracted from the s-type constituent hierarchy of same word "The", but it is S based on current hierarchical level. The e-type lookahead feature, on the other hand, is extracted from the e-type constituent hierarchy of end word "students" of the VP constituent, which is NULL in the next level. Lookahead features for items on the buffer are extracted in the same way.
The  Figure 3: Two intermediate states for parsing on the sentence "The past and present students like this book on the table". Each item on the stack or buffer has two constituent hierarchies: s-type (left) and e-type (right), respectively, in the corresponding box. Note that the e-type constituent hierarchy of the word "students" is incorrectly predicted, yet used as soft constraints (i.e., features) in our model. example, given the intermediate state in Figure 3(a), s 0 has a s-type lookahead feature ADJP, and q 1 in the buffer has e-type lookahead feature ADJP. This indicates that the two items are likely reduced into the same constituent. Further, s 0 cannot end a constituent because of the empty e-type constituent hierarchy. As a result, the final shift-reduce parser will assign a higher score to the SHIFT decision.

Constituent Hierarchy Prediction
We propose a novel neural model for constituent hierarchy prediction. Inspired by the encoder-decoder framework for neural machine translation (Bahdanau et al., 2015;Cho et al., 2014), we use an LSTM to capture full sentence features, and another LSTM to generate the constituent hierarchies for each word. Compared with a CRF-based sequence labelling model (Roark and Hollingshead, 2009), the proposed model has three advantages. First, the global features can be automatically represented. Second, it can avoid the exponentially large number of labels if constituent hierarchies are treated as unique labels. Third, the model size is relatively small, and does not have a large effect on the final parser model. As shown in Figure 4, the neural network consists of three main layers, namely the input layer, the encoder layer and the decoder layer. The input layer represents each word using its characters and token information; the encoder hidden layer uses a bidirectional recurrent neural network structure to learn global features from the sentence; and the decoder layer predicts constituent hierarchies according to the encoder layer features, by using the attention mechanism (Bahdanau et al., 2015) to compute the contribution of each hidden unit of the encoder.

Input Layer
The input layer generates a dense vector representation of each input word. We use character embeddings to alleviate OOV problems in word embeddings Santos and Zadrozny, 2014;Kim et al., 2016), concatenating characterembeddings of a word with its word embedding. Formally, the input representation x i of the word w i is computed by: where x w i is a word embedding vector of the word w i according to a embedding lookup table, c i att is a character embedding form of the word w i , c ij is the embedding of the jth character in w i , c ij is the character window representation centered at c ij , and α ij is the contribution of the c ij to c i att , which is computed by:  ← − h i denotes the right-to-left encoder hidden units; s denotes the decoder hidden state vector; and y ij is the jth label of the word w i .

Encoder Layer
The encoder first uses a window strategy to represent input nodes with their corresponding local context nodes. Formally, a word window representation takes the form Second, the encoder scans the input sentence and generates hidden units for each input word using a recurrent neural network (RNN), which represents features of the word from the global sequence. Formally, given the windowed input nodes x 1 , x 2 , ..., x n for the sentence w 1 , w 2 , ..., w n , the RNN layer calculates a hidden node sequence h 1 , h 2 , ..., h n .
Long Short-Term Memory (LSTM) mitigates the vanishing gradient problem in RNN training, by introducing gates (i.e., input i, forget f and output o) and a cell memory vector c. We use the variation of Graves and Schmidhuber (2008). Formally, the values in the LSTM hidden layers are computed as follows: where is pair-wise multiplication. Further, in order to collect features for x i from both x 1 , .., x i−1 and x i+1 , ... x n , we use a bidirectional variation (Schuster and Paliwal, 1997;Graves et al., 2013). As shown in Figure 4, the hidden units are generated by concatenating the corresponding hidden layers of

Decoder Layer
The decoder hidden layer uses two different LSTMs to generate the s-type and e-type sequences of constituent labels from each encoder hidden output, respectively, as shown in Figure 4. Each constituent hierarchy is generated bottom-up recurrently. In particular, a sequence of state vectors is generated recurrently, with each state yielding a output constituent label. The process starts with a 0 state vector and ends when a NULL constituent is generated. The recurrent state transition process is achieved using an LSTM model with the hidden vectors of the encoder layer being used for context features. Formally, for word w i , the value of the jth state unit s ij of the LSTM is computed by: where the context a ij is computed by: Here ← → h k refers to the encoder hidden vector for w k . The weights of contribution β ijk are computed using the attention mechanism (Bahdanau et al., 2015).
The constituent labels are generated from each state unit s ij , where each constituent label y ij is the output of a SOFTMAX function, p(y ij = l) = e s ij W l k e s ij W k y ij = l denotes that the jth label of the ith word is l(l ∈ L). As shown in Figure 4, the SOFTMAX functions are applied to the state units of the decoder, generating hierarchical labels bottom-up, until the default label NULL is predicted.

Training
We use two separate models to assign the s-type and e-type labels, respectively. For training each constituent hierarchy predictor, we minimize the following training objective: where T is the length of the sentence, Z i is the depth of the constituent hierarchy of the word w i , and p ijo stands for p(y ij = o), which is given by the SOFT-MAX function, and o is the gold label. We apply back-propagation, using momentum stochastic gradient descent (Sutskever et al., 2013) with a learning rate of η = 0.01 for optimization and regularization parameter λ = 10 −6 .

Experiment Settings
Our English data are taken from the Wall Street Journal (WSJ) sections of the Penn Treebank (Marcus et al., 1993). We use sections 2-21 for training, section 24 for system development, and section 23 for final performance evaluation. Our Chinese data are taken from the version 5.1 of the Penn Chinese Treebank (CTB) (Xue et al., 2005). We use articles 001-270 and 440-1151 for training, articles 301-325 for system development, and articles 271-300 for final performance evaluation. For both English and Chinese   data, we adopt ZPar 2 for POS tagging, and use tenfold jackknifing to assign POS tags automatically to the training data. In addition, we use ten-fold jackknifing to assign constituent hierarchies automatically to the training data for training the parser using the constituent hierarchy predictor. We use F 1 score to evaluate constituent hierarchy prediction. For example, if the prediction is "S → S → VP → NP" and the gold is "S → NP → NP", the evaluation process matches the two hierarchies bottom-up. The precision is 2/4 = 0.5, the recall is 2/3 = 0.66 and the F 1 score is 0.57. A label is counted as correct if and only if it occurs at the correct position.

Model Settings
For training the constituent hierarchy prediction model, gold constituent labels are derived from labelled constituency trees in the training data. The hyper-parameters are chosen according to development tests, and the values are shown in Table 3.
For the shift-reduce constituency parser, we set the beam size to 16 for both training and decoding, which achieves a good tradeoff between efficiency  and accuracy (Zhu et al., 2013). The optimal training iteration number is determined on the development sets. Table 4 shows the results of constituent hierarchy prediction, where word and character embeddings are randomly initialized, and fine-tuned during training. The third column shows the development parsing accuracies when the labels are used for lookahead features. As Table 4 shows, when the number of hidden layers increases, both s-type and e-type constituent hierarchy prediction improve. The accuracy of e-type prediction is relatively lower due to right-branching in the treebank, which makes e-type hierarchies longer than s-type hierarchies. In addition, a 3-layer LSTM does not give significant improvements compared to a 2-layer LSTM. For better tradeoff between efficiency and accuracy, we choose the 2-layer LSTM as our constituent hierarchy predictor. Table 5 shows ablation results for constituent hierarchy prediction given by different reduced architectures, which include an architecture without character embeddings and an architecture with neither character embeddings nor input windows. We find that the original architecture achieves the highest performance on constituent hierarchy prediction, compared to the two baselines. The baseline only without character embeddings has relatively small influence on constituent hierarchy prediction. On the other hand, the baseline only without input word windows has relatively smaller influence on constituent hierarchy prediction. Nevertheless, both of these two ablation architectures lead to lower pars-Parser LR LP F 1 Fully-supervised Ratnaparkhi (1997) 86.3 87.5 86.9 Charniak (2000) 89.5 89.9 89.5 Collins (2003) 88  Table 6: Comparison of related work on the WSJ test set. * denotes neural parsing; † denotes methods using a shift-reduce framework.

Results of Constituent Hierarchy Prediction
ing accuracies. The baseline removing both the character embeddings and the input word windows has a relatively low F-score.

Final Results
For English, we compare the final results with previous related work on the WSJ test sets. As shown in Table 6 4 , our model achieves 1.3% F 1 improvement compared to the baseline parser with fully-supervised learning (Zhu et al., 2013). Our model outperforms the state-of-the-art fullysupervised system (Carreras et al., 2008;Shindo et al., 2012) by 0.6% F 1 . In addition, our fullysupervised model also catches up with many stateof-the-art semi-supervised models (Zhu et al., 2013; Parser LR LP F 1 Fully-supervised Charniak (2000) 79.6 82.1 80.8 Bikel ( Table 7: Comparison of related work on the CTB5.1 test set. * denotes neural parsing; † denotes methods using a shift-reduce framework; ‡ denotes joint POS tagging and parsing. Huang and Harper, 2009;Huang et al., 2010;Durrett and Klein, 2015) by achieving 91.7% F 1 on WSJ test set. The size of our model is much smaller than the semi-supervised model of Zhu et al. (2013), which contains rich features from a large automatically parsed corpus. In contrast, our model is about the same in size compared to the baseline parser. We carry out Chinese experiments with the same models, and compare the final results with previous related work on the CTB test set. As shown in Table  7, our model achieves 2.3% F 1 improvement compared to the state-of-the-art baseline system with fully-supervised learning (Zhu et al., 2013), which is by far the best result in the literature. In addition, our fully-supervised model is also comparable to many state-of-the-art semi-supervised models (Zhu et al., 2013;Wang and Xue, 2014;Dyer et al., 2016) by achieving 85.5% F 1 on the CTB test set. Wang and Xue (2014) and  do joint POS tagging and parsing. Table 8 shows the running times of various parsers on test sets on a Intel 2.2 GHz processor with 16G memory. Our parsers are much faster than the related parser with the same shift-reduce framework (Sagae and Lavie, 2005;Sagae and Lavie, 2006). Compared to the baseline parser, our parser gives Parser #Sent/Second Ratnaparkhi (1997) Unk Collins (2003) 3.5 Charniak (2000) 5.7 Sagae and Lavie (2005) 3.7 Sagae and Lavie (2006) 2.2 Petrov and Klein (2007) 6. 2 Carreras et al. (2008) Unk Zhu et al. (2013) 89.5 This work 79.2

Error Analysis
We conduct error analysis by measuring parsing accuracies against: different phrase types, constituents of different span lengths, and different sentence lengths.
6.1 Phrase Type Table 9 shows the accuracies of the baseline and the final parsers with lookahead features on 9 common phrase types. As the results show, while the parser with lookahead features achieves improvements on all of the frequent phrase types, there are relatively higher improvements on VP, S, SBAR and WHNP. The constituent hierarchy predictor has relatively better performance on s-type labels for the constituents VP, WHNP and PP, which are prone to errors by the baseline system. The constituent hierarchy can give guidance to the constituent parser for tackling the issue. Compared to the s-type constituent hierarchy, the e-type constituent hierarchy is relatively more difficult to predict, particularly for the constituents with long spans such as VP, S and SBAR. Despite this, the e-type constituent hierarchies with relatively low accuracies also benefit prediction of constituents with long spans. Figure 5 shows the F1-scores of the two parsers on constituents with different span lengths. As the results show, lookahead features are helpful on both large spans and small spans, and the performance gap between the two parsers is larger as the size of span increases. This reflects the usefulness of longrange information captured by the constituent hierarchy predictor and lookahead features. Figure 6 shows the F1-scores of the two parsers on sentences of different lengths. As the results show, the parser with lookahead features outperforms the baseline system on both short sentences and long sentences. Also, the performance gap between the two parsers is larger as the length of sentence increases. The constituent hierarchy predictors generate hierarchical constituents for each input word using global information. For longer sentences, the predictors yield deeper constituent hierarchies, offering corresponding lookahead features. As a result, compared to the baseline parser, the performance of the parser with lookahead features decreases more slowly as the length of the sentences increases.

Related Work
Our lookahead features are similar in spirit to the pruners of Roark and Hollingshead (2009) and Zhang et al. (2010b), which infer the maximum length of constituents that a particular word can start or end. However, our method is different in three main ways. First, rather than using a CRF with sparse local word window features, a neural network is used for dense global features on the sentence. Second, not only the size of constituents but also the constituent hierarchy is identified for each word. Third, the results are added into a transition-based parser as soft features, rather then being used as hard constraints to a chart parser. Our concept of constituent hierarchies is similar to supertags in the sense that both are shallow parses. For lexicalized grammars such as Combinatory Categorial Grammar (CCG), Tree-Adjoining Grammar (TAG) and Head-Driven Phrase Structure Grammar (HPSG), each word in the input sentence is assigned one or more supertags, which are used to identify the syntactic role of the word to constrain parsing (Clark, 2002;Clark and Curran, 2004;Carreras et al., 2008;Ninomiya et al., 2006;Dridan et al., 2008;Faleńska et al., 2015). For a lexicalized grammar, supertagging can benefit the parsing in both accuracy and efficiency by offering almostparsing information. In particular, Carreras et al. (2008) used the concept of spine for TAG (Schabes, 1992;Vijay-Shanker and Joshi, 1988), which is similar to our constituent hierarchy. However, there are three differences. First, the spine is defined to describe the main syntactic tree structure with a series  Table 9: Comparison between the parsers with lookahead features on different phrases types, with the corresponding constituent hierarchy predictor performances.
of unary projections, while constituent hierarchy is defined to describe how words can start or end hierarchical constituents (it can be empty if the word cannot start or end constituents). Second, spines are extracted from gold trees and used to prune the search space of parsing as hard constraints. In contrast, we use constituent hierarchies as soft features. Third, Carreras et al. (2008) use spines to prune chart parsing, while we use constituent hierarchies to improve a linear shift-reduce parser. For lexicalized grammars, supertags can benefit parsing significantly since they contain rich syntactic information as almost parsing (Bangalore and Joshi, 1999). Recently, there has been a line of work on better supertagging. Zhang et al. (2010a) proposed efficient methods to obtain supertags for HPSG parsing using dependency information. Xu et al. (2015) and Vaswani et al. (2016) leverage recursive neural networks for supertagging for CCG parsing. In contrast, our models predict the constituent hierarchy instead of a single supertag for each word in the input sentence.
Our constituent hierarchy predictor is also related to sequence-to-sequence learning (Sutskever et al., 2014), which has been successfully used in neural machine translation (Bahdanau et al., 2015). The neural model encodes the source-side sentence into dense vectors, and then uses them to generate targetside word by word. There has also been work that directly applies sequence-to-sequence models for constituent parsing, which generates constituent trees given raw sentences Luong et al., 2015). Compared to , who predict a full parse tree from input, our predictors tackle a much simpler task, by predicting the constituent hierarchies of each word separately. In addition, the outputs of the predictors are used for soft lookahead features in bottom-up parsing, rather than being taken as output structures directly.
By integrating a neural constituent hierarchy predictor, our parser is related to neural network models for parsing, which has given competitive accuracies for both constituency parsing (Dyer et al., 2016;Cross and Huang, 2016;Watanabe and Sumita, 2015) and dependency parsing (Chen and Manning, 2014;Zhou et al., 2015;. In particular, our parser is more closely related to neural models that integrate discrete manual features (Socher et al., 2013;Durrett and Klein, 2015). Socher et al. (2013) use neural features to rerank a sparse baseline parser; Durrett and Klein directly integrate sparse features into neural layers in a chart parser. In contrast, we integrate neural information into sparse features in the form of lookahead features.
There has also been work on lookahead features for parsing. Tsuruoka et al. (2011) run a baseline parser for a few future steps, and use the output actions to guide the current action. In contrast to their model, our model leverages full sentential information, yet is significantly faster.
Previous work investigated more efficient parsing without loss of accuracy, which is required by real time applications, such as web parsing. Zhang et al. (2010b) introduced a chart pruner to accelerate a CCG parser. Kummerfeld et al. (2010) proposed a self-training method focusing on increasing the speed of a CCG parser rather than its accuracy.

Conclusion
We proposed a novel constituent hierarchy predictor based on recurrent neural networks, aiming to capture global sentential information. The resulting constituent hierarchies are fed to a baseline shiftreduce parser as lookahead features, addressing limitations of shift-reduce parsers in not leveraging 55 right-hand side syntax for local decisions, yet maintaining the same model size and speed. The resulting fully-supervised parser outperforms the state-of-theart baseline parser by achieving 91.7% F 1 on standard WSJ evaluation and 85.5% F 1 on standard CTB evaluation.