Head-Lexicalized Bidirectional Tree LSTMs

Sequential LSTMs have been extended to model tree structures, giving competitive results for a number of tasks. Existing methods model constituent trees by bottom-up combinations of constituent nodes, making direct use of input word information only for leaf nodes. This is different from sequential LSTMs, which contain references to input words for each node. In this paper, we propose a method for automatic head-lexicalization for tree-structure LSTMs, propagating head words from leaf nodes to every constituent node. In addition, enabled by head lexicalization, we build a tree LSTM in the top-down direction, which corresponds to bidirectional sequential LSTMs in structure. Experiments show that both extensions give better representations of tree structures. Our final model gives the best results on the Stanford Sentiment Treebank and highly competitive results on the TREC question type classification task.


Introduction
Both sequence structured and tree structured neural models have been applied to NLP problems. Seminal work uses convolutional neural networks (Collobert and Weston, 2008), recurrent neural networks (Elman, 1990;Mikolov et al., 2010) and recursive neural networks (Socher et al., 2011) for sequence and tree modeling. Long short-term memory (LSTM) networks have significantly improved accuracies in a variety of sequence tasks Bahdanau et al., 2015) compared to vanilla recurrent neural networks. Addressing diminishing gradients effectively, they have been extended to tree structures, achieving promising results for tasks such as syntactic language modeling , sentiment analysis (Li et al., 2015;Le and Zuidema, 2015;Tai et al., 2015;Teng et al., 2016) and relation extraction (Miwa and Bansal, 2016).
Depending on the node type, typical tree structures in NLP can be categorized to constituent trees and dependency trees. A salient difference between the two types of tree structures is in the node. While dependency tree nodes are input words themselves, constituent tree nodes represent syntactic constituents. Only leaf nodes in constituent trees correspond to words. Though LSTM structures have been developed for both types of trees above, we investigate constituent trees in this paper. There are three existing methods for constituent tree LSTMs Tai et al., 2015;Le and Zuidema, 2015), which make essentially the same extension from sequence structured LSTMs. We take the method of  as our baseline. Figure 1 shows the sequence structured LSTM of Hochreiter and Schmidhuber (1997) and the treestructured LSTM of , illustrating the input (x), cell (c) and hidden (h) nodes at a certain time step t. The most important difference between Figure 1(a) and Figure 1(b) is the branching factor. While a cell in the sequence structure LSTM depends on the single previous hidden node, a cell in the tree-structured LSTM depends on a left hidden node and a right hidden node. Such tree-structured extensions of the sequence structured LSTM assume that the constituent tree is binarized, building hidden nodes from the input words in the bottom-up direction. The leaf node structure is shown in Figure 1(c).
A second salient difference between the two types of LSTMs is the modeling of input words. While each cell in the sequence structure LSTM directly depends on its corresponding input word ( Figure  1(a)), only leaf cells in the tree structure LSTM directly depend on corresponding input words ( Figure  1(c)). This corresponds well to the constituent tree structure, where there is no direct association between non-leaf constituent nodes and input words. However, it leaves the tree structure a degraded version of a perfect binary-branching variation of the sequence-structure LSTM, with one important source of information (i.e. words) missing in forming a cell (Figure 1(b)).
We fill this gap by proposing an extension to the tree LSTM model, injecting lexical information into every node in the tree. Our method takes inspiration from work on head-lexicalization, which shows that each node in a constituent tree structure is governed by a head word. As shown in Figure 2, the head word for the verb phrase "visited Mary" is "visited", and the head word of the adverb phrase "this afternoon" is "afternoon". Research has shown that head word information can significantly improve the performance of syntactic parsing (Collins, 2003;Clark and Curran, 2004). Correspondingly, we use the head lexical information of each constituent word as the input node x for calculating the corresponding cell c in Figure 1  Traditional head-lexicalization relies on specific rules (Collins, 2003;Zhang and Clark, 2009), typically extracting heads from constituent treebanks according to certain grammar formalisms. For better generalization, we use a neural attention mechanism to derive head lexical information automatically, rather than relying on linguistic head rules to find the head lexicon of each constituent, which is language-and formalism-dependent.
Based on such head lexicalization, we further make a bidirectional extension of the tree structured LSTM, propagating information in the top-down direction as well as the bottom-up direction. This is analogous to the bidirectional extension of sequence structured LSTMs, which are commonly used for NLP tasks such as speech recognition (Graves et al., 2013), sentiment analysis (Tai et al., 2015;Li et al., 2015) and machine translation Bahdanau et al., 2015) tasks.
Results on a standard sentiment classification benchmark and a question type classification benchmark show that our tree LSTM structure gives significantly better accuracies compared with the method of . We achieve the best reported results for sentiment classification. Interestingly, the head lexical information that is learned automatically from the sentiment treebank consists of both syntactic head information and key sentiment word information. This shows the advantage of automatic head-finding as compared with rule-based head lexicalization. We make our code available under GPL at https://github.com/ zeeeyang/lexicalized_bitreelstm.

Related Work
LSTM Recurrent neural network (RNN) (Elman, 1990;Mikolov et al., 2010) achieves success on 164 modeling linear structures due to its ability to preserve history over arbitrary length sequences. At each step, RNN decides its hidden state based on both the current input and the previous hidden state. In theory, it can carry over unbounded history. Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) is a special type of RNN that leverages multiple gate vectors and a memory cell vector to solve the vanishing and exploding gradient problems of training RNNs. It has been successfully applied to parsing (Vinyals et al., 2015a), sentiment classification (Tai et al., 2015;Li et al., 2015), speech recognition (Graves et al., 2013), machine translation Bahdanau et al., 2015) and image captioning (Vinyals et al., 2015b). There are many variants of sequential LSTMs, such as simple Gated Recurrent Neural Networks (Cho et al., 2014). Greff et al. (2017) compared various architectures of LSTM. In this paper, we take the standard LSTM with peephole connections (Gers and Schmidhuber, 2000) as a baseline.
Structured LSTM There has been a line of research that extends the standard sequential LSTM in order to model more complex structures. Kalchbrenner et al. (2016) proposed Grid LSTMs to process multi-dimensional data. Theis and Bethge (2015) proposed Spatial LSTMs to handle image data. Dyer et al. (2015) designed Stack LSTMs by adding a top pointer to sequential LSTMs to deal with push and pop sequences of a stack. Tai et al. (2015),  and Le and Zuidema (2015) extended sequential LSTMs to Tree-Structured LSTMs (Tree LSTMs) by adding branching factors. Experiments demonstrated that Tree LSTMs can outperform competitive LSTM baselines on several tasks, such as semantic relatedness prediction and sentiment classification. Li et al. (2015) further investigated the effectiveness of Tree LSTMs on various tasks and discussed when Tree LSTMs are necessary. In addition, Li et al. (2016) employed graph gated units to model graph-based structures.
Tree LSTM The idea of extending linear recurrent structures to tree recurrent structures is reminiscent of extending Recurrent Neural Network to Recursive Neural Network (ReNN) (Socher et al., 2013b;Le and Zuidema, 2014) to support information flow over trees. In addition to Tai et al. (2015),  and Le and Zuidema (2015), who explicitly named their models as Tree LSTMs, Cho et al. (2014) designed gated recurrent units over tree structures, and Chen et al. (2015) introduced gate mechanisms to recursive neural networks. These can also be regarded as variants of Tree LSTMs.
Both  and Le and Zuidema (2015) proposed Binary Tree LSTM models, which can be applied to situations where there are exactly two children of each internal node in a tree. The difference between  and Le and Zuidema (2015) is that besides using two forget gates, Le and Zuidema (2015) also make use of two input gates to let a node know its sibling. Tai et al. (2015) introduced Child-Sum Tree LSTM and Nary Tree LSTM. Child-Sum Tree LSTMs can support multiple children, while N-ary Tree LSTMs work for trees with a branching factor of at most N . In this perspective, Binary Tree LSTM is a special case of N-ary Tree LSTM with N = 2.
When a Child-Sum Tree LSTM is applied to a dependency tree, it is referred to as a Dependency Tree LSTM. A Binary Tree LSTM is also referred to as a Constituent Tree LSTM. Based on Tai et al. (2015), Miwa and Bansal (2016) introduced a Tree LSTM model that can handle different types of children. A dependency tree naturally contains lexical information at every node, while only leaf nodes contain lexical information in a constituent tree. None of these methods (Tai et al., 2015;Le and Zuidema, 2015) make direct use of lexical input for internal nodes when using constituent Tree LSTMs.
Bi-LSTM Another common extension to sequential LSTM is to include bidirectional information (Graves et al., 2013), which can model history both left-to-right and right-to-left. The aforementioned Tree LSTM models (Tai et al., 2015;Le and Zuidema, 2015) propagate the history of children to their parent in the bottom-up direction only, while ignoring the top-down information flow from parents to children.  proposed a top-down Tree LSTM to estimate the generation probability of a dependency tree. However, no corresponding bottom-up Tree LSTM is incorporated into their model. Paulus et al. (2014) leveraged bidirectional information over recursive binary trees by propagating global belief down from the tree root to leaf nodes. However, their model is based on recursive neural network rather than LSTM. Miwa and Bansal (2016) adopted a bidirectional Tree LSTM model to jointly extract named entities and relations under a dependency tree structure. For constituent tree structures, however, their model does not work due to lack of word inputs on non-leaf constituent nodes, and in particular the root node. Our head lexicalization allows us to investigate the top-down constituent Tree LSTM. To our knowledge, we are the first to report a bidirectional constituent Tree LSTM.

Baselines
A sequence-structure LSTM estimates a sequence of hidden state vectors given a sequence of input vectors, through the calculation of a sequence of hidden cell vectors using a gate mechanism. For NLP, the input vectors are typically word embeddings (Mikolov et al., 2013), but can also include partof-speech (POS) embeddings, character embeddings or other types of information. For notational convenience, we refer to the input vectors as lexical vectors.
Formally, given an input vector sequence x 1 , x 2 , . . . , x n , each state vector h t is estimated from the Hadamard product of a cell vector c t and a corresponding output gate vector o t Here the cell vector depends on both the previous cell vector c t , and a combination of the previous state vector h t−1 ; the current input vector x t : The combination of c t−1 and g t is controlled by the Hadamard product between a forget gate vector f t and an input gate vector i t , respectively. The gates o t , f t and i t are defined as follows where σ is the sigmoid function.
The bottom-up Tree LSTM of  extends the left-to-right sequence LSTM by splitting the previous state vector h t−1 into a left child state vector h L t−1 and a right child state vector h R t−1 , and the previous cell vector c t−1 into a left child cell vector c L t−1 and a right child cell vector c R t−1 , calculating c t as and the input/output gates i t /o t as The forget gate f t is split into f L t and f R t for regulating c L t−1 and c R t−1 , respectively: g t depends on both h L t−1 and h R t−1 , but as shown in Figure 1 Finally, the hidden state vector h t is calculated in the same way as in the sequential LSTM model shown in Equation

Head Lexicalization
We introduce an input lexical vector x t to the calculation of each cell vector c t via a bottom-up head propagation mechanism. As shown in the shaded nodes in Figure 3 (b), the head propagation mechanism is parallel to the cell propagation mechanism. In contrast, the method of  in Figure  3 (a) does not have the input vector x t for non-leaf constituents.
There are multiple ways to choose a head lexicon for a given binary-branching constituent. One simple method is to choose the head lexicon of the left child as the head (left-headedness). Correspondingly, an alternative is to use the right child for head lexicon. There is less consistency in the governing head lexicons across variations of the same type of constituents with slightly different typologies. Hence, simple baselines can be less effective compared to linguistically motivated head findings.
Rather than selecting head lexicons using manually-defined head-finding rules, which are language-and formalism-dependent (Collins, 2003), we cast head finding as a part of the neural network model, learning the head lexicon of each constituent by a gated combination of the head lexicons of its two children 1 . Formally, where x t represents the head lexicon vector of the current constituent, x L t−1 represents the head lexicon of its left child constituent, and x R t−1 represents the head lexicon of its right child constituent. The gate z t is calculated based on x L t−1 and x R t−1 , Here W L zx , W R zx and b z are model parameters.

Lexicalized Tree LSTM
Given head lexicon vectors for nodes, the Tree LSTM of  can be extended by leveraging x t in calculating the corresponding c t . In particular, x t is used to estimate the input (i t ), output 1 In this paper, we work on binary trees only, which is a common form for CKY and shift-reduce parsing. Typical binarization methods, such as head binarization (Klein and Manning, 2003) , also rely on specific head-finding rules.
(o t ) and forget (f R t and f L t ) gates: In addition, x t is also used in computing g t , With the new definition of i t , f R t , f L t and g t , the computing of c t remains the same as the baseline Tree LSTM model as shown in Equation 4. Similarly, h t remains the Hadamard product of c t and the new o t as shown in Equation 1.
In this model, W xi , W xf , W xg and W xo are newly-introduced model parameters. The use of x t in computing the gate and cell values are consistent with those in the baseline sequential LSTM.

Bidirectional Extensions
Given a sequence of input vectors [x 1 , x 2 , . . . , x n ], a bidirectional sequential LSTM (Graves et al., 2013) computes two sets of hidden state vectors, [h 1 ,h 2 , . . . ,h n ] and [h n ,h n−1 , . . . ,h 1 ] in the left-to-right and the right-to-left directions, respectively. The final hidden state h i of the input x i is the concatenation of the corresponding state vectors in the two LSTMs, The two LSTMs can share the same model parameters or use different parameters. We choose the latter in our baseline experiments.
We make a bidirectional extension to the Lexicalized Tree LSTM in Section 4.2 by following the sequential LSTMs in Section 3, adding an additional set of hidden state vectors in the top-down direction. Different from the bottom-up direction, each hidden state in the top-down LSTM has exactly one predecessor. In fact, the path from the root of a tree down to any node forms a sequential LSTM.
Note, however, that two different sets of model parameters are used when the current node is the left and the right child of its predecessor. Denoting the two sets of parameters as U L and U R , respectively, the hidden state vector h 7 in Figure 4 is calculated from the hidden state vector h 1 using the parameter the computing of h t follows the sequential LSTM model: With the gate values being defined as: Here N ∈ {L, R} and One final note is that the top-down Tree LSTM is enabled by the head propagation mechanism, which allows a head lexicon node to be made available for the root constituent node. Without such information, it would be difficult to build top-down LSTM for constituent trees.

Usage for Classification
We apply the bidirectional Tree LSTM to classification tasks, where the input is a sentence with its binarized constituent tree, and the output is a discrete label. We denote the bottom-up hidden state vector of the root ash ROOT ↑ , the top-down hidden state vector of the root ash ROOT ↓ and the top-down hidden state vectors of the input words x 1 , x 2 , . . . , x n ash 1 ,h 2 , . . . ,h n . We take the concatenation of h ROOT ↑ ,h ROOT ↓ and the average ofh 1 ,h 2 , . . . ,h n as the final representation h of the sentence: A softmax classifier is used to predict the probability p j of sentiment label j from h by where W hl , b hl , W lp and b lp are model parameters, and ReLU is the rectifier function f (x) = max(0, x). During prediction, the largest probability component of P will be taken as the answer.

Training
We train our classifier to maximize the conditional log-likelihood of gold labels of training samples. Formally, given a training set of size |D|, the training objective is defined by where Θ is the set of model parameters, λ is a regularization parameter, y i is the gold label of the ith training sample and p y i is obtained according to Equation 16. For sequential LSTM models, we collect errors over each sequence. For Tree LSTMs, we sum up errors at every node. The model parameters are optimized using ADAM (Kingma and Ba, 2015) without gradient clipping, with the default hyper-parameters of the AdamTrainer in the Dynet toolkits. 2 We also use dropout (Srivastava et al., 2014) at lexical input embeddings with a fixed probability p drop to avoid overfitting. p drop is set to 0.5 for all tasks.
Following Tai et al. (2015), Li et al. (2015),  and Le and Zuidema (2015), we use Glove-300d word embeddings 3 to train our model. The pretrained word embeddings are fine-tuned for all tasks. Unknown words are handled in two steps. First, if a word is not contained in the pretrained word embeddings, but its lowercased form exists in the embedding table, we use the lowercase as a replacement. Second, if both the original word and its lowercased form cannot be found, we treat the word as unk. The embedding vector of the UNK token is initialized as the average of all embedding vectors.
We use one hidden layer and the same dimensionality settings for both sequential and Tree LSTMs. LSTM hidden states are of size 150. The output hidden size is 128 and 64 for the sentiment classification task and the question type classification task, respectively. Each model is trained for 30 iterations. The same training procedure repeats five times using different random seeds, with parameters being evaluated at the end of every iteration on the development set. The model that gives the best development result is used for final tests.

Experiments
The effectiveness of our model is tested mainly on a sentiment classification task and a question type classification task.

Tasks
Sentiment Classification. For sentiment classification, we use the same data settings as . Specifically, we use the Stanford Sentiment Treebank (Socher et al., 2013b). Each sentence is annotated with a constituent tree. Every internal node corresponds to a phrase. Each node is manually assigned an integer sentiment label from 0 to 4, that correspond to five sentiment classes: very negative, negative, neutral, positive and very positive, respectively. The root label represents the sentiment label of the whole sentence.
We perform both binary classification and finegrained classification. Following previous work, we use labels of all phrases for training. Gold-standard 3 http://nlp.stanford.edu/data/glove.840B.300d.zip tree structures are used for training and testing (Le and Zuidema, 2015;Li et al., 2015;Tai et al., 2015). Accuracies are evaluated for both the sentence root labels and phrase labels.
Question Type Classification. For the question type classification task, we use the TREC data (Li and Roth, 2002). Each training sample in this dataset contains a question sentence and its corresponding question type. We work on the sixway coarse classification task, where the six question types are ENTY, HUM, LOC, DESC, NUM and ABBR, corresponding to ENTITY, HUMAN, LOCA-TION, DESCRIPTION, NUMERIC VALUE and AB-BREVIATION, respectively. For example, the type for the sentence "What year did the Titanic sink?" is NUM. The training set consists of 5,452 examples and the test set contains 500 examples. Since there is no development set, we follow Zhou et al. (2015), randomly extracting 500 examples from the training set as a development set. Unlike the sentiment treebank, there is no annotated tree for each sentence. Instead, we obtain an automatically parsed tree for each sentence using ZPar 4 off-the-shelf (Zhang and Clark, 2011). Another difference between the TREC data and the sentiment treebank is that there is only one label, at the root node, rather than a label for each phrase.

Baselines
We consider two models for our baselines. The first is bidirectional LSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997;Graves et al., 2013). Our bidirectional constituency Tree LSTM (BiConTree) is compared against BiLSTM to investigate the effectiveness of tree structures. For the sentiment task, following Tai et al. (2015) and Li et al. (2015), we convert the treebank into sequences to allow the bidirectional LSTM model to make use of every phrase span as a training example. The second baseline model is the bottom-up Tree LSTM model of . We compare this model with our lexicalized bidirectional models to show the effects of adding head lexicalization and top-down information flow.

Main Results
After incorporating head lexicalization into our constituent Tree LSTM, the fine-grained sentiment classification accuracy increases from 51.2 to 52.8, and the binary sentiment classification accuracy increases from 88.5 to 89.2, which demonstrates the effectiveness of the head lexicalization mechanism.
Table 1 also shows that a vanilla top-down Con-Tree LSTM by head-lexicalization (i.e. the topdown half of the final bidirectional model) alone obtains comparable accuracies to the bottom-up Con-Tree LSTM model. The BiConTree model can further improve the classification accuracies by 0.7 points (fine-grained) and 1.3 points (binary) compared to the unidirectional bottom-up lexicalized ConTree LSTM model, respectively. Table 1 includes 5 class accuracies for all nodes. There is no significant difference between different models, consistent with the observation of Li et al. (2015). To our knowledge, these are the best reported results for this sentiment classification task.  (Silva et al., 2011) 95.0 Bidirectional ConTree LSTM 94.8

Training Time and Model Size
Introducing head lexicalization and bidirectional extension to the model increases the model complexity. In this section, we analyze training time and model size with the fine-grained sentiment classification task. We run all the models using an i7-4790 3.60GHz CPU with a single thread. Table 3 shows the average running time for different models over 30 iterations. The baseline ConTree model takes about 1.3 hours to finish the training procedure. Con-Tree+Lex takes about 1.5 times longer than Con-Tree. BiConTree takes about 3.2 hours, which is about 2.5 times longer than that of ConTree. Table 4 compares the model sizes. We did not count the number of parameters in the lookup table since these parameters are the same for all models. Because the size of LSTM models mainly depends on the dimensionality of the state vector h, we change the size of h to study the effect of model size. When |h| = 150, the model size of the baseline model ConTree is the smallest, which consists of about 538K parameters. The model size of Con-Tree+Lex is about 1.4 times as large as that of the baseline model. The bidirectional model BiCon-Tree is the largest, about 1.7 times as large as that of the ConTree+Lex model. However, this parameter set is not very large compared to the modern memory capacity, even for a computer with 16GB RAM. In conclusion, in terms of both time, number of parameters and accuracy, head lexicalization method is   Table 4 also helps to clarify whether the gain of the BiConTree model over the ConTree+Lex model is from the top-down information flow or more parameters. For the same model, increasing the model size can improve the performance to some extent. For example, doubling the size of |h| (75 → 150) increases the performance from 51.5 to 52.8 for the ConTree+Lex model. Similarly, we boost the performance of the BiConTree model when doubling the size of |h| from 75 to 150. However, doubling the size of |h| from 150 to 300 empirically decreases the performance of the Con-Tree+Lex model. The size of the BiConTree model with |h| = 75 is much smaller than that of the Con-Tree+Lex model with |h| = 150. However the performance of these two models is quite close, which indicates that top-down information is useful even for a small model. A ConTree+Lex model with |h| = 215 and a BiConTree model with |h| = 150 are of similar size. The performance of the Con-Tree+Lex model is again worse than that of the Bi-ConTree model (52.5 v.s. 53.5), which shows the effectiveness of top-down information.

Head Lexicalization Methods
In this experiment, we investigate the effect of our head lexicalization method over heuristic baselines. We consider three baseline methods, namely left branching (L), right branching (R) and averaging (A). For L, a parent node accepts lexical information of its left child while ignoring the right child. Correspondingly, for R, a parent node accepts lexical information of its right child while ignoring the left child. For A, a parent node takes the average of the lexical vectors of its children. Table 5 shows the accuracies on the test set, where G denotes our gated head lexicalization method de-Method L R A G Root Accuracy (%) 51.1 51.6 51.8 53.5 Table 5: Test set accuracies of four head lexicalization methods on fine-grained classification. scribed in Section 4.1. R gives better results compared to L due to relatively more right-branching structures in this treebank. A simple average yields similar results compared with right branching. In contrast, G outperforms A method by considering the relative weights of each branch according to treelevel contexts.
We then investigate what lexical heads can be learned by G. Interestingly, the lexical heads contain both syntactic and sentiment information. Some heads correspond well to syntactic rules (Collins, 2003), others are driven by subjective words. Compared to Collins' rules, our method found 30.68% and 25.72% overlapping heads on the development and test sets, respectively.
Based on the cosine similarity between the head lexical vector and its children, we visualize the head of a node by choosing the head of the child that gives the largest similarity value. Figure 5 shows some examples, where <> indicates head words, sentiment labels (e.g. 2, 3) are also included. In Figure 5a, "Emerges" is the syntactic head word of the whole phrase, which is consistent with Collins-style head finding. However, "rare" is the head word of the phrase "something rare", which is different from the syntactic head. Similar observations are found in Figure 5b, where "good" is the head word of the whole phrase, rather than the syntactic head "place". The sentiment label of "good" and the sentiment label of the whole phrase are both 3. Figure 5c shows more complex interactions between syntax and sentiment for deciding the head word. Table 6 shows some example sentences incorrectly predicted by the baseline bottom-up tree model, but correctly labeled by our final model. The head word of sentence #1 by our model is "Gloriously", which is consistent with the sentiment of the whole sentence. This shows how head lexicalization can affect sentiment classification results. Sentences #2 and #3 show the usefulness of top-down informa-(a) "Emerges as something rare".

Error Analysis
(b) "is a good place to start".
(c) "the road to hell is paved with good intentions".  tion for complex semantic structures, where compositionality has subtle effects. Our final model improves the results for the 'very negative' and 'very positive' classes by 10% and 11%, respectively. It also boosts the accuracies for sentences with negation (e.g. "not", "no", and "none") by 4.4%. Figure 6 shows the accuracy distribution accord- Figure 6: Distribution of 5-class accuracies at the root level according to the sentence length.
ing to the sentence length. We find that our model can improve the classification accuracy for longer sentences (>30 words) by 3.5 absolute points compared to the baseline ConTree LSTM of , which demonstrates the strength of our model for handling long range information. By considering bidirectional information over tree structures, our model is aware of more contexts for making better predictions.

Applications
Our main results are obtained on semanticdriven sentence classification tasks, where the automatically-learned head words contain mixed syntactic and semantic information. To further investigate the effectiveness of automatically learned head information on a pure syntactic task, we additionally conduct a simple parser reranking experiment. Further, we discuss findings in language modeling by Kuncoro et al. (2017) on the model of recurrent neural network grammars (Dyer et al., 2016). Finally, we show potential future work leveraging our idea for more tasks.

Syntactic Parsing
We use our tree LSTM models to rerank the 10 best outputs of the Charniak (2000) parser. Given a sentence x, suppose that Y (x) is a set of parse tree candidates generated by a baseline parser for x, the goal of a syntactic reranker is to choose the best parsing hypothesisŷ according to a score function f (x, y; Θ). Formally, For each tree y of sentence x, we follow Socher et al. (2013a) and define the score f (x, y; Θ) as the sum of scores of each constituent node, Without loss of generality, we take a binary node as an example. Given a node A, suppose that its two children are B and C. Let the learned composition state vectors of A, B and C by our proposed Tree-LSTM model be n A , n B and n C , respectively. The head word vector of node A is h A . Score(A; Θ) is defined as: where W L s , W R s and b s are model parameters. Training. Given a training instance x i , Y (x i ) in the training set D, we use a max-margin loss function to train our reranking model. Suppose that the oracle parse tree in Y (x i ) is y i , the loss function L(Θ) is Here λ is a regularization parameter and r i (Θ) is the margin loss between y i and the highest score treeŷ i predicted by the reranking model. r i (Θ) is given by where ∆(y i ,ŷ i ) is the structure loss between y i and y i by counting the number of incorrect nodes in the oracle tree: κ is a scalar. With this loss function, we require the score of the oracle tree to be higher than the other candidates by a score margin. Intuitively, the score of the y i will increase and the score ofŷ i will decrease during training.
Results. We experiment on the WSJ portion of the Penn Treebank, following the standard split (Collins, 2003). Sections 2-21 are used for training, Section 24 and Section 23 are the development set and test set, respectively. The Charniak parser (Charniak, 2000;Charniak and Johnson, 2005) is adopted for our baseline by following the settings of Choe and Charniak (2016).
To obtain N-best lists on the development set and test set, we first train a baseline parser on the training set. To obtain N-best lists on the training data, we split the training data into 20 folds and trained 20 parsers. Each parser was trained on 19 folds data and used to produce the n-best list of the remaining fold. For the neural reranking model, we use the pretrained word vectors from Collobert et al. (2011). The input dimension is 50. The dimension of state vectors in Tree-LSTM model is 60. These parameters are trained with ADAM (Kingma and Ba, 2015) with a batch size of 20. We set κ = 0.1 for all experiments. For practical reasons, we use the Con-Tree+Lex model to learn the node representations and define Y (x i ) to be the 10-best parsing trees of x i . Table 7 shows the reranking results on WSJ test set. The baseline F1 score is 89.7. Our ConTree improves the baseline model to 90.6. Using Con-Tree+Lex model can further improve the performance (90.6 → 90.9). This suggests that automatic heads can also be useful for a syntactic task. Among neural rerankers, our model outperforms Socher et al. (2013a), but underperforms current state-of-theart models, including sequence-to-sequence based LSTM language models (Vinyals et al., 2015a;Choe and Charniak, 2016) and recurrent neural network grammars (Dyer et al., 2016). This is likely due to our simple reranking configurations and settings 5 . Nevertheless, it serves our goal of contrasting the tree LSTM models. Kuncoro et al. (2017) investigate composition functions in recurrent neural network grammars (RNNG) (Dyer et al., 2016), finding that syntactic head information can be automatically learned. Their observa-Model F 1 Baseline (Charniak (2000)) 89.7 ConTree 90.6 ConTree+Lex 90.9 Our 10-best Oracle 94.8 tion is consistent with ours. Formally, an RNNG is a tuple N, Σ, R, S, Θ , where N is the set of nonterminals, Σ is the set of terminals, R is a set of top-down transition-based rules, S is the start symbol and Θ is the set of model parameters. Given S, the derivation process resembles transition-based parsing, which is performed incrementally from left to right. Unlike surface language models, RNNGs model sentences with explicit grammar. Comparing naive sequence-to-sequence models of syntax (Vinyals et al., 2015a), RNNGs have the advantage of explicitly modeling syntactic composition between constituents, by combining the vector representation of child constituents into a single vector representation of their parent using a neural network. Kuncoro et al. (2017) show that such compositions are the key to the success, and further investigate several alternatives neural network structures. In particular, they compare vanilla LSTMs to attention networks when composing child constituents. Interestingly, the attention values represent syntactic heads among the child constituents to some extent. In addition, the vector constituent representation implicitly reflects constituent types. Their finding is consistent with ours in that a neural network can learn pure syntactic head information from constituent vectors.

Relation Extraction
Our head-lexicalized tree model can be used for all tasks that require representation learning for sentences, given their constituent syntax. One example of future work is relation extraction. For example, given the sentence "John is from Google Inc.", a relation 'works in' can be extracted between 'John' and 'Google Inc.'. Miwa and Bansal (2016) solve this task by using the Child-Sum tree representation of Tai et al. (2015) to represent the input sentence, extracting features for the two entities according to their related nodes in the dependency tree, and then conducting rela-tion classification based on these features. Headlexicalization and top-down information can potentially be useful for improving relation extraction in the framework of Miwa and Bansal (2016).

Conclusion
We proposed lexicalized variants for constituent tree LSTMs. Learning the heads of constituents automatically using a neural model, our lexicalized tree LSTM is applicable to arbitrary binary branching trees in CFG, and is formalism-independent. In addition, lexical information on the root further allows a top-down extension to the model, resulting in a bidirectional constituent Tree LSTM. Experiments on two well-known datasets show that head-lexicalization improves the unidirectional Tree LSTM model. In addition, the bidirectional Tree LSTM gives superior labeling results compared to both unidirectional Tree LSTMs and bidirectional sequential LSTMs.