Learning Structured Text Representations

In this paper, we focus on learning structure-aware document representations from data without recourse to a discourse parser or additional annotations. Drawing inspiration from recent efforts to empower neural networks with a structural bias (Cheng et al., 2016; Kim et al., 2017), we propose a model that can encode a document while automatically inducing rich structural dependencies. Specifically, we embed a differentiable non-projective parsing algorithm into a neural model and use attention mechanisms to incorporate the structural biases. Experimental evaluations across different tasks and datasets show that the proposed model achieves state-of-the-art results on document modeling tasks while inducing intermediate structures which are both interpretable and meaningful.

Recent work provides strong evidence that better document representations can be obtained by incorporating structural knowledge (Bhatia et al., 2015;Ji and Smith, 2017;Yang et al., 2016). Inspired by existing theories of discourse, representations of document structure have assumed several guises in the literature, such as trees in the style of Rhetorical Struc-ture Theory (RST; Mann and Thompson, 1988), graphs (Lin et al., 2011;Wolf and Gibson, 2006), entity transitions (Barzilay and Lapata, 2008), or combinations thereof (Lin et al., 2011;Mesgar and Strube, 2015). The availability of discourse annotated corpora (Carlson et al., 2003;Prasad et al., 2008) has led to the development of off-the-shelf discourse parsers (e.g., Feng and Hirst, 2012;Liu and Lapata, 2017), and the common use of trees as representations of document structure. For example, Bhatia et al. (2015) improve document-level sentiment analysis by reweighing discourse units based on the depth of RST trees, whereas Ji and Smith (2017) show that a recursive neural network built on the output of an RST parser benefits text categorization in learning representations that focus on salient content.
Linguistically motivated representations of document structure rely on the availability of annotated corpora as well as a wider range of standard NLP tools (e.g., tokenizers, pos-taggers, syntactic parsers). Unfortunately, the reliance on labeled data, which is both difficult and highly expensive to produce, presents a major obstacle to the widespread use of discourse structure for document modeling. Moreover, despite recent advances in discourse processing, the use of an external parser often leads to pipeline-style architectures where errors propagate to later processing stages, affecting model performance.
It is therefore not surprising that there have been attempts to induce document representations directly from data without recourse to a discourse parser or additional annotations. The main idea is to obtain hierarchical representations by first building representations of sentences, and then aggregating those into a document representation (Tang et al., 2015a,b). Yang et al. (2016) further demonstrate how to implicitly inject structural knowledge onto the representation using an attention mechanism (Bahdanau et al., 2015) which acknowledges that sentences are differentially important in different contexts. Their model learns to pay more or less attention to individual sentences when constructing the representation of the document.
Our work focus on learning deeper structureaware document representations, drawing inspiration from recent efforts to empower neural networks with a structural bias (Cheng et al., 2016). Kim et al. (2017) introduce structured attention networks which are generalizations of the basic attention procedure, allowing to learn sentential representations while attending to partial segmentations or subtrees. Specifically, they take into account the dependency structure of a sentence by viewing the attention mechanism as a graphical model over latent variables. They first calculate unnormalized pairwise attention scores for all tokens in a sentence and then use the inside-outside algorithm to normalize the scores with the marginal probabilities of a dependency tree. Without recourse to an external parser, their model learns meaningful task-specific dependency structures, achieving competitive results in several sentence-level tasks. However, for document modeling, this approach has two drawbacks. Firstly, it does not consider non-projective dependency structures, which are common in documentlevel discourse analysis (Hayashi et al., 2016;. As illustrated in Figure 1, the tree structure of a document can be flexible and the dependency edges may cross. Secondly, the inside-outside algorithm involves a dynamic programming process which is difficult to parallelize, making it impractical for modeling long documents. 1 In this paper, we propose a new model for representing documents while automatically learning richer structural dependencies. Using a variant of Kirchhoff's Matrix-Tree Theorem (Tutte, 1984), our model implicitly considers non-projective depen-  Figure 1: The document is analyzed in the style of Rhetorical Structure Theory (Mann and Thompson, 1988), and represented as a dependency tree following the conversion algorithm of Hayashi et al. (2016).
dency tree structures. We keep each step of the learning process differentiable, so the model can be trained in an end-to-end fashion and induce discourse information that is helpful to specific tasks without an external parser. The inside-outside model of Kim et al. (2017) and our model both have a O(n 3 ) worst case complexity. However, major operations in our approach can be parallelized efficiently on GPU computing hardware. Although our primary focus is on document modeling, there is nothing inherent in our model that prevents its application to individual sentences. Advantageously, it can induce non-projective structures which are required for representing languages with free or flexible word order (McDonald and Satta, 2007). Our contributions in this work are threefold: a model for learning document representations whilst taking structural information into account; an efficient training procedure which allows to compute document level representations of arbitrary length; and a large scale evaluation study showing that the proposed model performs competitively against strong baselines while inducing intermediate structures which are both interpretable and meaningful.

Background
In this section, we describe how previous work uses the attention mechanism for representing individual sentences. The key idea is to capture the interaction between tokens within a sentence, generating a context representation for each word with weak structural information. This type of intra-sentence attention encodes relationships between words within Figure 2: Intra-sentential attention mechanism; a ij denotes the normalized attention score between tokens u i and u j .
each sentence and differs from inter-sentence attention which has been widely applied to sequence transduction tasks like machine translation (Bahdanau et al., 2015) and learns the latent alignment between source and target sequences. Figure 2 provides a schematic view of the intrasentential attention mechanism.
Given a sentence represented as a sequence of n word vectors [u 0 , u 1 , · · · , u n ], for each word pair u i , u j , the attention score a ij is estimated as: where F () is a function for computing the unnormalized score f ij which is then normalized by calculating a probability distribution a ij . Individual words collect information from their context based on a ij and obtain a context representation: where attention score a ij indicates the (dependency) relation between the i-th and the j-th-words and how information from u j should be fed into u i . Despite successful application of the above attention mechanism in sentiment analysis (Cheng et al., 2016) and entailment recognition (Parikh et al., 2016), the structural information under consideration is shallow, limited to word-word dependencies. Since attention is computed as a simple probability distribution, it cannot capture more elaborate structural dependencies such as trees (or graphs). Kim et al. (2017) induce richer internal structure by imposing structural constraints on the probability distribution computed by the attention mechanism. Specifically, they normalize f ij with a projective dependency tree using the inside-outside algorithm (Baker, 1979): This process is differentiable, so the model can be trained end-to-end and learn structural information without relying on a parser. However, efficiency is a major issue, since the inside-outside algorithm has time complexity O(n 3 ) (where n represents the number of tokens) and does not lend itself to easy parallelization. The high order complexity renders the approach impractical for real-world applications.

Encoding Text Representations
In this section we present our document representation model. We follow previous work (Tang et al., 2015a;Yang et al., 2016) in modeling documents hierarchically by first obtaining representations for sentences and then composing those into a document representation. Structural information is taken into account while learning representations for both sentences and documents and an attention mechanism is applied on both words within a sentence and sentences within a document. The general idea is to force pair-wise attention between text units to form a non-projective dependency tree, and automatically induce this tree for different natural language processing tasks in a differentiable way. In the following, we first describe how the attention mechanism is applied to sentences, and then move on to present our document-level model.

Sentence Model
Let T = [u 0 , u 1 , · · · , u n ] denote a sentence containing a sequence of words, each represented by a vector u, which can be pre-trained on a large corpus. Long Short-Term Memory Neural Networks (LSTMs; Hochreiter and Schmidhuber, 1997) have

Calculat e St ruct ured At t ent ion
Updat e Sem ant ic Vect ors r 0 r 1 r 2 r 3 Figure 3: Sentence representation model: u t is the input vector for the t-th word, e t and d t are semantic and structure vectors, respectively.
been successfully applied to various sequence modeling tasks ranging from machine translation (Bahdanau et al., 2015), to speech recognition (Graves et al., 2013), and image caption generation (Xu et al., 2015). In this paper we use bidirectional LSTMs as a way of representing elements in a sequence (i.e., words or sentences) together with their contexts, capturing the element and an "infinite" window around it. Specifically, we run a bidirectional LSTM over sentence T , and take the output vectors [h 0 , h 1 , · · · , h n ] as the representations of words in T , where h t ∈ R k is the output vector for word u t based on its context. We then exploit the structure of T which we induce based on an attention mechanism detailed below to obtain more precise representations. Inspired by recent work (Daniluk et al., 2017;Miller et al., 2016), which shows that the conventional way of using LSTM output vectors for calculating both attention and encoding word semantics is overloaded and likely to cause performance deficiencies, we decompose the LSTM output vector in two parts: where e t ∈ R kt , the semantic vector, encodes semantic information for specific tasks, and d t ∈ R ks , the structure vector, is used to calculate structured attention.
We use a series of operations based on the Matrix-Tree Theorem (Tutte, 1984) to incorporate the struc-tural bias of non-projective dependency trees into the attention weights. We constrain the probability distributions a ij (see Equation (2)) to be the posterior marginals of a dependency tree structure. We then use the normalized structured attention, to build a context vector for updating the semantic vector of each word, obtaining new representations [r 0 , r 1 , · · · , r n ]. An overview of the model is presented in Figure 3. We describe the attention mechanism in detail in the following section.

Structured Attention Mechanism
Dependency representations of natural language are a simple yet flexible mechanism for encoding words and their syntactic relations through directed graphs. Much work in descriptive linguistics (Melcuk, 1988;Tesniére, 1959) has advocated their suitability for representing syntactic structure across languages. A primary advantage of dependency representations is that they have a natural mechanism for representing discontinuous constructions arising from long distance dependencies or free word order through nonprojective dependency edges.
More formally, building a dependency tree amounts to finding latent variables z ij for all i = j, where word i is the parent node of word j, under some global constraints, amongst which the single-head constraint is the most important, since it forces the structure to be a rooted tree. We use a variant of Kirchhoff's Matrix-Tree Theorem (Koo et al., 2007;Tutte, 1984) to calculate the marginal probability of each dependency edge P (z ij = 1) of a non-projective dependency tree, and this probability is used as the attention weight that decides how much information is collected from child unit j to the parent unit i.
We first calculate unnormalized attention scores f ij with structure vector d (see Equation (7)) via a bilinear function: a weighted adjacency matrix for a graph G whose nodes correspond to the words in a sentence. We also calculate the root score f r i , indicating the unnormalized possibility of a node being the root: where W r ∈ R 1 * ks . We calculate P (z ij = 1), the marginal probability of the dependency edge, following Koo et al. (2007): where L ∈ R n * n is the Laplacian matrix for graph G andL is a variant of L that takes the root node into consideration, and δ is the Kronecker delta. The key for the calculation to hold is for L ii , the minor of the Laplacian matrix L with respect to row i and column i, to be equal to the sum of the weights of all directed spanning trees of G which are rooted at i. P (z ij = 1) is the marginal probability of the dependency edge between the i-th and j-th words. P (root(i) = 1) is the marginal probability of the ith word headed by the root of the tree. Details of the proof can be found in Koo et al. (2007). We denote the marginal probabilities P (z ij = 1) as a ij and P (root(i)) as a r i . This can be interpreted as attention scores which are constrained to converge to a structured object, a non-projective dependency tree, in our case. We update the semantic vector e i of each word with structured attention: where p i ∈ R ke is the context vector gathered from possible parents of u i and c i ∈ R ke the context vector gathered from possible children, and e root is a special embedding for the root node. The context vectors are concatenated with e i and transformed with weights W r ∈ R ke * 3ke to obtain the updated semantic vector r i ∈ R ke with rich structural information (see Figure 3).

Document Model
We build document representations hierarchically: sentences are composed of words and documents are composed of sentences. Composition on the document level also makes use of structured attention in the form of a dependency graph. Dependencybased representations have been previously used for developing discourse parsers (Hayashi et al., 2016;Li et al., 2014) and in applications such as summarization (Hirao et al., 2013). As illustrated in Figure 4, given a document with n sentences [s 0 , s 1 , · · · , s n ], for each sentence s i , the input is a sequence of word embeddings [u i0 , u i1 , · · · , u im ], where m is the number of tokens in s i . By feeding the embeddings into a sentence-level bi-LSTM and applying the proposed structured attention mechanism, we obtain the updated semantic vector [r i0 , r i1 , · · · , r im ]. Then a pooling operation produces a fixed-length vector v i for each sentence. Analogously, we view the document as a sequence of sentence vectors [v 0 , v 1 , · · · , v n ] whose embeddings are fed to a document-level bi-LSTM. Application of the structured attention mechanism creates new semantic vectors [q 0 , q 1 , · · · , q n ] and another pooling operation yields the final document representation y.

End-to-End Training
Our model can be trained in an end-to-end fashion since all operations required for computing structured attention and using it to update the semantic vectors are differentiable. In contrast to in Kim et al. (2017), training can be done efficiently. The major complexity of our model lies in the computation of the gradients of the the inverse matrix. Let A denote a matrix depending on a real parameter x; assuming all component functions in A are differentiable, and A is invertible for all possible values, the gradient of A with respect respect to x is: Multiplication of the three matrices and matrix inversion can be computed efficiently on modern parallel hardware architectures such as GPUs. In our experiments, computation of structured attention takes only 1/10 of training time.

Experiments
In this section we present our experiments for evaluating the performance of our model. Since sentence representations constitute the basic building blocks of our document model, we first evaluate the performance of structured attention on a sentence-level task, namely natural language inference. We then assess the document-level representations obtained by our model on a variety of classification tasks representing documents of different length, subject matter, and language. Our code is available at https://github.com/ nlpyang/structured.

Natural Language Inference
The ability to reason about the semantic relationship between two sentences is an integral part of text understanding. We therefore evaluate our model on recognizing textual entailment, i.e., whether two premise-hypothesis pairs are entailing, contradictory, or neutral. For this task we used the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015), which contains premise-hypothesis pairs and target labels indicating their relation. After removing sentences with unknown labels, we obtained 549,367 pairs for training, 9,842 for development and 9,824 for testing. Sentence-level representations obtained by our model (with structured attention) were used to encode the premise and hypothesis by modifying the model of Parikh et al. (2016) as follows. Let [x p 0 , · · · , x p n ] and [x h 0 , · · · , x h m ] be the input vectors for the premise and hypothesis, respectively. Application of structured attention yields new vector representations [r p 0 , · · · , r p n ] and [r h 0 , · · · , r h m ]. Then we combine these two vectors with inter-sentential attention, and apply an average pooling operation: where M LP () is a two-layer perceptron with a ReLU activation function. The new representations r p , r h are then concatenated and fed into another two-layer perceptron with a softmax layer to obtain the predicted distribution over the labels. The hidden size of the LSTM was set to 150. The dimensions of the semantic vector were 100 and the dimensions of structure vector were 50. We used pretrained 300-D Glove 840B (Pennington et al., 2014) vectors to initialize the word embeddings. All parameters (including word embeddings) were updated with Adagrad (Duchi et al., 2011), and the learning rate was set to 0.05. The hidden size of the two-layer perceptron was set to 200 and dropout was used with ratio 0.2. The mini-batch size was 32. We compared our model (and variants thereof) against several related systems. Results (in terms of 3-class accuracy) are shown in Table 1 (Rocktäschel et al., 2016) 83.5 252K 200D Matching LSTMs (Wang and Jiang, 2015) 86.1 1.9M 450D LSTMN with deep attention fusion (Cheng et al., 2016) 86.3 3.4M Decomposable Attention over word embeddings (Parikh et al., 2016) 86.8 582K Enhanced BiLSTM Inference Model (Chen et al., 2016b) 88   Cheng et al. (2016) and Parikh et al. (2016) whose models include intra-attention encoding relationships between words within each sentence (see Equation (2)). It is also worth noting that some models take structural information into account in the form of parse trees (Bowman et al., 2016;Chen et al., 2016b). The second block of Table 1 presents a version of our model without an intra-sentential attention mechanism as well as three variants with attention, assuming the structure of word-to-word relations and dependency trees. In the latter case we compare our matrix inversion based model against Kim et al.'s (2017) inside-outside attention model. Consistent with previous work (Cheng et al., 2016;Parikh et al., 2016), we observe that simple attention brings performance improvements over no attention.  Structured attention further enhances performance. Our own model with tree matrix inversion slightly outperforms the inside-outside model of Kim et al. (2017), overall achieving results in the same ballpark with related LSTM-based models (Chen et al., 2016b;Cheng et al., 2016;Parikh et al., 2016). Table 2 compares the running speed of the models shown in the second block of Table 1. As can be seen matrix inversion does not increase running speed over the simpler attention mechanism and is considerably faster compared to inside-outside. The latter is 10-20 times slower than our model on the same platform.

Document Classification
In this section, we evaluate our document-level model on a variety of classification tasks. We selected four datasets which we describe below. Table 3 summarizes some statistics for each dataset.
Yelp reviews were obtained from the 2013 Yelp Dataset Challenge. This dataset contains restaurant reviews, each associated with human ratings on a scale from 1 (negative) to 5 (positive) which we used as gold labels for sentiment classification; we followed the preprocessing introduced in Tang et al. (2015a) and report experiments on their training, development, and testing partitions (80/10/10).
IMDB reviews were obtained from Diao et al. (2014), who randomly crawled reviews for 50K movies. Each review is associated with user ratings ranging from 1 to 10.
Czech reviews were obtained from Brychcın and Habernal (2013). The dataset contains reviews from the Czech Movie Database 2 each labeled as positive, neutral, or negative. We include Czech in our experiments since it has more flexible word order compared to English, with non-projective dependency structures being more frequent. Experiments on this dataset perform 10-fold cross-validation following previous work (Brychcın and Habernal, 2013).
Congressional floor debates were obtained from a corpus originally created by Thomas et al. (2006) which contains transcripts of U.S. floor debates in the House of Representatives for the year 2005. Each debate consists of a series of speech segments, each labeled by the vote ("yea" or "nay") cast for the proposed bill by the the speaker of each segment. We used the pre-processed corpus from Yogatama and Smith (2014). 3 Following previous work (Yang et al., 2016), we only retained words appearing more than five times in building the vocabulary and replaced words with lesser frequencies with a special UNK token. Word embeddings were initialized by training word2vec (Mikolov et al., 2013) on the training and validation splits of each dataset. In our experiments, we set the word embedding dimension to be 200 and the hidden size for the sentence-level and documentlevel LSTMs to 100 (the dimensions of the semantic and structure vectors were set to 75 and 25, respectively). We used a mini-batch size of 32 during training and documents of similar length were grouped in one batch. Parameters were optimized with Adagrad (Duchi et al., 2011), the learning rate was set to 0.05. We used L 2 regularization for all parameters except word embeddings with regularization constant set to 1e −4 . Dropout was applied on the input and output layers with dropout rate 0.3.
Our results are summarized in Table 4. We compared our model against several related models covering a wide spectrum of representations including word-based ones (e.g., paragraph vector and CNN 3 http://www.cs.cornell.edu/˜ainur/data. models) as well as hierarchically composed ones (e.g., a CNN or LSTM provides a sentence vector and then a recurrent neural network combines the sentence vectors to form a document level representation for classification). Previous state-of-the-art results on the three review datasets were achieved by the hierarchical attention network of Yang et al. (2016), which models the document hierarchically with two GRUs and uses an attention mechanism to weigh the importance of each word and sentence.
On the debates corpus, Ji and Smith (2017) obtained best results with a recursive neural network model operating on the output of an RST parser. Table 4 presents three variants 4 of our model, one with structured attention on the sentence level, another one with structured attention on the document level and a third model which employs attention on both levels. As can be seen, the combination is beneficial achieving best results on three out of four datasets. Furthermore, structured attention is superior to the simpler word-to-word attention mechanism, and both types of attention bring improvements over no attention. Also, the structured attention approach is still very efficient, taking only 20 minutes for one training epoch on the largest dataset.

Analysis of Induced Structures
To gain further insight on structured attention, we inspected the dependency trees it produces. Specifically, we used the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967) to extract the maximum spanning tree from the attention scores. We report various statistics on the characteristics of the induced trees across different tasks and datasets. We also provide examples of tree output,  in an attempt to explain how our model uses dependency structures to model text.

Sentence Trees
We compared the dependency trees obtained from our model with those produced by a state-of-the-art dependency parser trained on the English Penn Treebank. Table 5 presents various statistics on the depth of the trees produced by our model on the SNLI test set and the Stanford dependency parser . As can be seen, the induced dependency structures are simpler compared to those obtained from the Stanford parser. The trees are generally less deep (their height is 5.78 compared to 8.99 for the Stanford parser), with the majority being of depth 2-4. Almost half of the induced trees have a projective structure, although there is nothing in the model to enforce this constraint. We also calculated the percentage of headdependency edges that are identical between the two sets of trees. Although our model is not exposed to annotated trees during training, a large number of edges agree with the output of the Stanford parser. Figure 5 shows examples of dependency trees induced on the SNLI dataset. Although the model is trained without ever being exposed to a parse tree, it is able to learn plausible dependency structures via the attention mechanism. Overall we observe that the induced trees differ from linguistically motivated ones in the types of dependencies they create which tend to be of shorter length. The dependencies obtained from structured attention are more direct as shown in the first premise sentence in Figure 5 where words at and bar are directly connected to the verb drink. This is perhaps to be expected since the attention mechanism uses the dependency structures to collect information from other words, and the direct links will be more effective.

Document Trees
We also used the Chu-Liu-Edmonds algorithms to obtain document-level dependency trees. Table 6 summarizes various characteristics of these trees. For most datasets, documentlevel trees are not very deep, they mostly contain up to nodes of depth 3. This is not surprising as the documents are relatively short (see Table 3) with the exception of debates which are longer and the induced trees more complex. The fact that most documents exhibit simple discourse structures is further corroborated by the large number (over 70%) of projective trees induced on Yelp, IMBD, and CZ Movies datasets. Unfortunately, our trees cannot be directly compared with the output of a discourse parser which typically involves a segmentation process splitting sentences into smaller units. Our trees are constructed over entire sentences, and there is no mechanism currently in the model to split sentences into discourse units. Figure 6 shows examples of document-level trees taken from Yelp and the Czech Movie dataset. In the first tree, most edges are examples of the "elaboration" discourse relation, i.e., the child presents 1 great instruction by ryan 2 clean workout facility and friendly people 3 they have a new student membership for 60 per month and classes are mon , weds and fri 6pm 7pm 4 it 's definitely worth money if you want to learn brazilian jiu jitsu 5 i usually go to classes on mondays and fridays , and it 's the best workout i 've had in years (c)  additional information about the parent. The second tree is non-projective, the edges connecting sentences 1 and 4 and 3 and 5 cross. The third review, perhaps due to its colloquial nature, is not entirely coherent. However, the model manages to link sen-tences 1 and 3 to sentence 2, i.e., the movie being discussed; it also relates sentence 6 to 4, both of which express highly positive sentiment.

Conclusions
In this paper we proposed a new model for representing documents while automatically learning rich structural dependencies. Our model normalizes intra-attention scores with the marginal probabilities of a non-projective dependency tree based on a matrix inversion process. Each operation in this process is differentiable and the model can be trained efficiently end-to-end, while inducing structural information. We applied this approach to model documents hierarchically, incorporating both sentenceand document-level structure. Experiments on sentence and document modeling tasks show that the representations learned by our model achieve competitive performance against strong comparison systems. Analysis of the induced tree structures revealed that they are meaningful, albeit different from linguistics ones, without ever exposing the model to linguistic annotations or an external parser. Directions for future work are many and varied. Given appropriate training objectives (Linzen et al., 2016), it should be possible to induce linguistically meaningful dependency trees using the proposed attention mechanism. We also plan to explore how document-level trees can be usefully employed in summarization, e.g., as a means to represent or even extract important content.