ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs

How to model a pair of sentences is a critical issue in many NLP tasks such as answer selection (AS), paraphrase identification (PI) and textual entailment (TE). Most prior work (i) deals with one individual task by fine-tuning a specific system; (ii) models each sentence’s representation separately, rarely considering the impact of the other sentence; or (iii) relies fully on manually designed, task-specific linguistic features. This work presents a general Attention Based Convolutional Neural Network (ABCNN) for modeling a pair of sentences. We make three contributions. (i) The ABCNN can be applied to a wide variety of tasks that require modeling of sentence pairs. (ii) We propose three attention schemes that integrate mutual influence between sentences into CNNs; thus, the representation of each sentence takes into consideration its counterpart. These interdependent sentence pair representations are more powerful than isolated sentence representations. (iii) ABCNNs achieve state-of-the-art performance on AS, PI and TE tasks. We release code at: https://github.com/yinwenpeng/Answer_Selection.

Most prior work derives each sentence's representation separately, rarely considering the impact of AS s 0 how much did Waterboy gross?s + 1 the movie earned $161.5 million s − 1 this was Jerry Reeds final film appearance PI s 0 she struck a deal with RH to pen a book today s + 1 she signed a contract with RH to write a book s − 1 she denied today that she struck a deal with RH TE s 0 an ice skating rink placed outdoors is full of people s + 1 a lot of people are in an ice skating park s − 1 an ice skating rink placed indoors is full of people Figure 1: Positive (< s 0 , s + 1 >) and negative (< s 0 , s − 1 >) examples for AS, PI and TE tasks.RH = Random House the other sentence.This neglects the mutual influence of the two sentences in the context of the task.It also contradicts what humans do when comparing two sentences.We usually focus on key parts of one sentence by extracting parts from the other sentence that are related by identity, synonymy, antonymy and other relations.Thus, human beings model the two sentences together, using the content of one sentence to guide the representation of the other.
Figure 1 demonstrates that each sentence of a pair partially determines which parts of the other sentence we should focus on.For AS, correctly answering s 0 requires putting attention on "gross": s + 1 contains a corresponding unit ("earned") while s − 1 does not.For PI, focus should be removed from "today" to correctly recognize (< s 0 , s + 1 >) as paraphrases and (< s 0 , s − 1 >) as non-paraphrases.For TE, we need to focus on "full of people" (to recognize TE for < s 0 , s + 1 >) and on "outdoors" / "indoors" (to recognize non-TE for < s 0 , s − 1 >).These examples show the need for an architecture that computes different representations of s i for different s 1−i 's (i ∈ {0, 1}).
Convolutional Neural Network (CNN, (LeCun et al., 1998)) is widely used to model sentences (?; ?) and sentence pairs (Yu et al., 2014;Socher et al., 2011;Yin and Schütze, 2015a), especially in classification tasks.CNN is supposed to be good at extracting robust and abstract features of input.This work presents ABCNN, an attention-based convolutional neural network, that has a powerful mechanism for modeling a sentence pair by taking into account the interdependence between the two sentences.ABCNN is a general architecture that can handle a wide variety of sentence pair modeling tasks.
Some prior work proposes simple mechanisms that can be interpreted as controlling varying attention; e.g., Yih et al. (2013) employ word alignment to match related parts of the two sentences.In contrast, our attention scheme based on CNN is able to model relatedness between two parts fully automatically.Moreover, attention at multiple levels of granularity, not only at the word level, is achieved as we stack multiple convolution layers that increase abstraction.
Prior work on attention in deep learning mostly addresses LSTMs (long short-term memory, Hochreiter and Schmidhuber (1997)).LSTM achieves attention usually in word-to-word scheme, and the word representations mostly encode the whole context within the sentence (Bahdanau et al., 2015;Rocktäschel et al., 2016).But it is not clear whether this is the best strategy; e.g., in the AS example in Figure 1, it is possible to determine that "how much" in s 0 matches "$161.5 million" in s 1 without taking the entire remaining sentence contexts into account.This observation was also investigated by Yao et al. (2013b) where an information retrieval system retrieves sentences with tokens labeled as DATE by named entity recognition or as CD by part-of-speech tagging if there is a "when" question.However, labels or POS tags require extra tools.CNNs benefit from incorporating attention into representations of local phrases detected by filters; in contrast, LSTMs encode the whole context to form attention-based word representations -a strategy that is more complex than the CNN strategy and (as our experiments suggest) performs less well for some tasks.
Apart from these differences, it is clear that atten-tion has as much potential for CNNs as it does for LSTMs.As far as we know, this is the first NLP paper that incorporates attention into CNNs.Our ABCNN gets state-of-the-art in AS and TE tasks, and competitive performance in PI, then obtains further improvements over all three tasks when linguistic features are used.Section 2 discusses related work.Section 3 introduces BCNN, a network that models two sentences in parallel with shared weights, but without attention.Section 4 presents three different attention mechanisms and their realization in ABCNN, an architecture that is based on BCNN.Section 5 evaluates the models on AS, PI and TE tasks and conducts visual analysis for our attention mechanism.Section 6 summarizes the contributions of this work.

Non-NN Work on Sentence Pair Modeling
Sentence pair modeling has attracted lots of attention in the past decades.Most tasks can be reduced to a semantic text matching problem.Due to the variety of word choices and inherent ambiguities in natural languages, bag-of-word approaches with simple surface-form word matching tend to produce brittle results with poor prediction accuracy (Bilotti et al., 2007).As a result, researchers put more emphasis on exploiting syntactic and semantic structure.Representative examples include methods based on deeper semantic analysis (Shen and Lapata, 2007;Moldovan et al., 2007), tree edit-distance (Punyakanok et al., 2004;Heilman and Smith, 2010) and quasi-synchronous grammars (Wang et al., 2007) that match the dependency parse trees of the two sentences.Instead of focusing on the high-level semantic representation, Yih et al. (2013) turn their attention to improving the shallow semantic component, lexical semantics, by performing semantic matching based on a latent wordalignment structure (cf.Chang et al. (2010)).Lai and Hockenmaier (2014) explore more fine-grained word overlap and alignment between two sentences using negation, hypernym/hyponym, synonym and antonym relations.Yao et al. (2013a) extend wordto-word alignment to phrase-to-phrase alignment by a semi-Markov CRF.However, such approaches often require more computational resources.In ad-dition, employing syntactic or semantic parserswhich produce errors on many sentences -to find the best match between the structured representation of two sentences is not trivial.

NN Work on Sentence Pair Modeling
To address some of the challenges of non-NN work, much recent work uses neural networks to model sentence pairs for AS, PI and TE.
For AS, Yu et al. (2014) present a bigram CNN to model question and answer candidates.Yang et al. (2015) extend this method and get state-of-the-art performance on the WikiQA dataset (Section 5.2).Feng et al. (2015) test various setups of a bi-CNN architecture on an insurance domain QA dataset.Tan et al. (2015) explore bidirectional LSTM on the same dataset.Our approach is different because we do not model the sentences by two independent neural networks in parallel, but instead as an interdependent sentence pair, using attention.
For PI, Blacoe and Lapata (2012) form sentence representations by summing up word embeddings.Socher et al. (2011) use recursive autoencoder (RAE) to model representations of local phrases in sentences, then pool similarity values of phrases from the two sentences as features for binary classification.Yin and Schütze (2015a) present a similar model in which RAE is replaced by CNN.In all three papers, the representation of one sentence is not influenced by the other -in contrast to our attention-based model.
For TE, Bowman et al. (2015b) use recursive neural networks to encode entailment on SICK (Marelli et al., 2014b).Rocktäschel et al. (2016) present an attention-based LSTM for the Stanford natural language inference corpus (Bowman et al., 2015a).Our system is the first CNN-based work on TE.Some prior work aims to solve a general sentence matching problem.Hu et al. (2014) present two CNN architectures, ARC-I and ARC-II, for sentence matching.ARC-I focuses on sentence representation learning while ARC-II focuses on matching features on phrase level.Both systems were tested on PI, sentence completion (SC) and tweetresponse matching.Yin and Schütze (2015b) propose the MultiGranCNN architecture to model general sentence matching based on phrase matching on multiple levels of granularity and get promising re-sults for PI and SC.Wan et al. (2015) try to match two sentences in AS and SC by multiple sentence representations, each coming from the local representations of two LSTMs.Our work is the first one to investigate attention for the general sentence matching task.

Attention-Based NN in Non-NLP Domains
Even though there is little if any work on attention mechanisms in CNNs for NLP, attention-based CNNs have been used in computer vision for visual question answering (Chen et al., 2015), image classification (Xiao et al., 2015), caption generation (Xu et al., 2015), image segmentation (Hong et al., 2015) and object localization (Cao et al., 2015).Mnih et al. (2014) apply attention in recurrent neural network (RNN) to extract information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution.Gregor et al. (2015) combine a spatial attention mechanism with RNN for image generation.Ba et al. (2015) investigate attention-based RNN for recognizing multiple objects in images.Chorowski et al. (2014) and Chorowski et al. (2015) use attention in RNN for speech recognition.

Attention-Based NN in NLP
Attention-based deep learning systems are studied in NLP domain after its success in computer vision and speech recognition, and mainly rely on recurrent neural network for end-to-end encoder-decoder system for tasks such as machine translation (Bahdanau et al., 2015;Luong et al., 2015a) and text reconstruction (Li et al., 2015;Rush et al., 2015).Our work takes the lead in exploring attention mechanism in CNN for NLP tasks.

BCNN: Basic Bi-CNN
We now introduce our basic (non-attention) CNN that is based on Siamese architecture (?), i.e., it consists of two weight-sharing CNNs, each processing one of the two sentences, and a final layer that solves the sentence pair task.See Figure 2. We refer to this architecture as BCNN.The next section will then introduce ABCNN, an attention architecture that extends BCNN.In our implementation and also in the mathematical formalization of the model given below, we pad the two sentences to have the same length s = max(s 0 , s 1 ).However, in the figures we show different lengths because this gives a better intuition of how the model works.
BCNN has four types of layers: input layer, convolution layer, average pooling layer and output layer.We now describe each in turn.
Input layer.In the example in the figure, the two input sentences have 5 and 7 words, respectively.Each word is represented as a d 0 -dimensional precomputed word2vec (Mikolov et al., 2013) embed-ding,1 d 0 = 300.As a result, each sentence is represented as a feature map of dimension d 0 × s.
Convolution layer.Let v 1 , v 2 , . . ., v s be the words of a sentence and c i ∈ R w•d 0 , 0 < i < s + w, the concatenated embeddings of v i−w+1 , . . ., v i where embeddings for v i , i < 1 and i > s, are set to zero.We then generate the representation p i ∈ R d 1 for the phrase v i−w+1 , . . ., v i using the convolution weights W ∈ R d 1 ×wd 0 as follows: where b ∈ R d 1 is the bias.We use wide convolution; i.e., we apply the convolution weights W to words v i , i < 1 and i > s, because this makes sure that each word v i , 1 ≤ i ≤ s, can be detected by all weights in W -as opposed to only the rightmost (resp.leftmost) weights for initial (resp.final) words in narrow convolution.Average pooling layer.Pooling, including min pooling, max pooling and average pooling, is commonly used to extract robust features from convolution.In this paper, we introduce attention weighting as an alternative, but use average pooling as a baseline as follows.
For the output feature map of the last convolution layer, we do column-wise averaging over all columns, denoted as all-ap.This will generate a representation vector for each of the two sentences, shown as the top "Average pooling (all-ap)" layer below "Logistic regression" in Figure 2.These two representations are then the basis for the sentence pair decision.
For the output feature map of non-final convolution layers, we do column-wise averaging over windows of w consecutive columns, denoted as w-ap; shown as the lower "Average pooling (w-ap)" layer in Figure 2.For filter width w, a convolution layer transforms an input feature map of s columns into a new feature map of s + w − 1 columns; average pooling transforms this back to s columns.This architecture supports stacking an arbitrary number of convolution-pooling blocks to extract increasingly abstract features.Input features to the bottom layer are words, input features to the next layer are short phrases and so on.Each level generates more abstract features of higher granularity.
Output layer.The last layer is an output layer, chosen according to the task; e.g., for binary classification tasks, this layer is logistic regression (see Figure 2).Other types of output layers are introduced below.
We found that in most cases, performance is boosted if we provide the output of all pooling layers as input to the output layer.For each non-final average pooling layer, we perform w-ap (pooling over windows of w columns) as described above, but we also perform all-ap (pooling over all columns) and forward the result to the output layer.This improves performance because representations from different layers cover the properties of the sentences at different levels of abstraction and all of these levels can be important for a particular sentence pair.

ABCNN-1
ABCNN-1 (Figure 3(a)) employs an attention feature matrix A to influence convolution.Attention features are intended to weight those units of s i more highly in convolution that are relevant to a unit of s 1−i (i ∈ {0, 1}); we use the term "unit" here to refer to words on the lowest level and to phrases on higher levels of the network.Figure 3(a) shows two unit representation feature maps in red: this part of ABCNN-1 is the same as in BCNN (see Figure 2).Each column is the representation of a unit, a word on the lowest level and a phrase on higher levels.We first describe the attention feature matrix A informally (layer "Conv input", middle column, in Figure 3(a)).A is generated by matching units of the left representation feature map with units of the right representation feature map such that the attention values of row i in A denote the attention distribution of the i-th unit of s 0 with respect to s 1 , and the attention values of column j in A denote the attention distribution of the j-th unit of s 1 with respect to s 0 .A can be viewed as a new feature map of s 0 (resp.s 1 ) in row (resp.column) direction because each row (resp.column) is a new feature vector of a unit in s 0 (resp.s 1 ).Thus, it makes sense to combine this new feature map with the representation feature maps and use both as input to the convolution operation.We achieve this by transforming A into the two blue matrices in Figure 3(a) that have the same format as the representation feature maps.As a result, the new input of convolution has two feature maps for each sentence (shown in red and blue).Our motivation is that the attention feature map will guide the convolution to learn "counterpart-biased" sentence representations.
More formally, let F i,r ∈ R d×s be the representation feature map of sentence i (i ∈ {0, 1}).Then we define the attention matrix A ∈ R s×s as follows: The function match-score can be defined in a variety of ways.We found that 1/(1 + |x − y|) works well where | • | is Euclidean distance.
Given attention matrix A, we generate the attention feature map F i,a for s i as follows: The weight matrices W 0 ∈ R d×s , W 1 ∈ R d×s are parameters of the model to be learned in training. 2e stack the representation feature map F i,r and the attention feature map F i,a as an order 3 tensor and feed it into convolution to generate a higherlevel representation feature map for s i (i ∈ {0, 1}).In Figure 3(a), s 0 has has 5 units, s 1 has 7.The output of convolution (shown in the top layer, filter width w = 3) is a higher-level representation feature map with 7 columns for s 0 and 9 columns for s 1 .

ABCNN-2
ABCNN-1 computes attention weights directly on the input representation with the aim of improving the features computed by convolution.ABCNN-2 (Figure 3 column is the representation of a unit.The attention matrix A compares all units in s 0 with all units of s 1 .We sum all attention values for a unit to derive a single attention weight for that unit.This corresponds to summing all values in a row of A for s 0 ("col-wise sum", resulting in the column vector of size 7 shown) and summing all values in a column for s 1 ("row-wise sum", resulting in the row vector of size 9 shown).
More formally, let A ∈ R s×s be the attention matrix, a 0,j = A[j, :] the attention weight of unit j in s 0 , a 1,j = A[:, j] the attention weight of unit j in s 1 and F c i,r ∈ R d×(s i +w−1) the output of convolution for s i .Then the j-th column of the new feature map F p i,r generated by w-ap is derived by: Note that F p i,r ∈ R d×s i , i.e., ABCNN-2 pooling generates an output feature map of the same size as the input feature map of convolution.This allows us to stack multiple convolution-pooling blocks to extract features of increasing abstraction.
(i) Attention in ABCNN-1 impacts convolution indirectly while attention in ABCNN-2 influences pooling through direct attention weighting.(ii) ABCNN-1 requires the two matrices W i to convert the attention matrix into attention feature maps; and the input to convolution has two times as many feature maps.Thus, ABCNN-1 has more parameters than ABCNN-2 and is more vulnerable to overfitting.(iii) As pooling is performed after convolution, pooling handles larger-granularity units than convolution; e.g., if the input to convolution has word level granularity, then the input to pooling has phrase level granularity, the phrase size being equal to filter size w.Thus, ABCNN-1 and ABCNN-2 implement attention mechanisms for linguistic units of different granularity.The complementarity of ABCNN-1 and ABCNN-2 motivates us to propose ABCNN-3, a third architecture that combines elements of the two.

ABCNN-3
ABCNN-3 combines ABCNN-1 and ABCNN-2 by stacking them.See Figure 3(c).ABCNN-3 combines the strengths of ABCNN-1 and ABCNN-2 by allowing the attention mechanism to operate (i) both on the convolution and on the pooling parts of a convolution-pooling block and (ii) both on the input granularity and on the more abstract output granularity.

Experiments
We test the proposed architectures on three tasks: answer selection (AS), paraphrase identification (PI) and textual entailment (TE).

Common Training Setup
For all tasks, words are initialized by 300dimensional word2vec embeddings and not changed during training.A single randomly initialized embedding3 is created for all unknown words by uniform sampling from [-.01,.01].We employ Adagrad (Duchi et al., 2011) and L 2 regularization.

Network configuration
Each network in the experiments below consists of (i) an initialization block b 1 that initializes words by word2vec embeddings, (ii) a stack of k − 1 convolution-pooling blocks b 2 , . . ., b k , computing increasingly abstract features, and (iii) one final LR layer (logistic regression layer) as shown in Figure 2.
The input to the LR layer consists of kn features -each block provides n similarity scores, e.g., n cosine similarity scores.Figure 2 shows the two sentence vectors output by the final block b k of the stack ("sentence representation 0", "sentence representation 1"); this is the basis of the last n similarity scores.As we explained in the final paragraph of Section 3, we perform all-ap pooling for all blocks, not just for b k .Thus we get one sentence representation each for s 0 and s 1 for each block b 1 , . . ., b k .We compute n similarity scores for each block (based on the block's two sentence representations).Thus, we compute a total of kn similarity scores and these scores are input to the LR layer.
Depending on the task, we use different methods for computing the similarity score: see below.

Layerwise training
In our training regime, we first train a network consisting of just one convolution-pooling block b 2 .We then create a new network by adding a block b 3 , initialize its b 2 block with the previously learned weights for b 2 and train b 3 keeping the previously learned weights for b 2 fixed.We repeat this procedure until all k − 1 convolution-pooling blocks are trained.We found that this training regime gives us good performance and shortens training times considerably.Since similarity scores of lower blocks are kept unchanged once they have been learned, this also has the nice effect that "simple" similarity scores (those based on surface features) are learned first and subsequent training phases can focus on complementary scores derived from more complex abstract features.

Classifier
We found that performance increases if we do not use the output of the LR layer as the final decision, but instead train linear SVM or logistic regression with default parameters4 directly on the input to the LR layer (i.e., on the kn similarity scores that are generated by the k-block stack after network training is completed).Direct training of SVMs/LR seems to get closer to the global optimum than gradient descent training of CNNs.
Table 2 shows the values of the hyperparameters.Hyperparameters were tuned on dev.

Shared Baselines
We use addition and LSTM as two shared baselines for all three tasks, i.e., for AS, PI and TE.We now describe these two shared baselines.
(i) Addition.We sum up word embeddings element-wise to form each sentence representation, then concatenate the two sentence representation vectors as classifier input.(ii) A-LSTM.Before this work, most attention mechanisms in NLP domain are implemented in recurrent neural networks for text generation tasks such as machine translation (e.g., Bahdanau et al. (2015), Luong et al. (2015a)).Rocktäschel et al. (2016) present an attention-LSTM for natural language inference task.Since this model is the pioneering attention based RNN system for sentence pair classification problem, we consider it as a baseline system ("A-LSTM") for all our three tasks.A-LSTM has the same configuration as our ABCNN systems in terms of word initialization (300-dimensional word2vec embeddings) and the dimensionality of all hidden layers (50).

Answer Selection
We use WikiQA, 5 an open domain question-answer dataset.We use the subtask that assumes that there is at least one correct answer for a question.The corresponding dataset consists of 20,360 questioncandidate pairs in train, 1,130 pairs in dev and 2,352 pairs in test where we adopt the standard setup of only considering questions that have correct answers for evaluation.Following Yang et al. (2015), we truncate answers to 40 tokens.
The task is to rank the candidate answers based on their relatedness to the question.Evaluation measures are mean average precision (MAP) and mean reciprocal rank (MRR).

Task-Specific Setup
We use cosine similarity as the similarity score for this task.In addition, we use sentence lengths, WordCnt (count of the number of non-stopwords in the question that also occur in the answer) and Wgt-WordCnt (reweight the counts by the IDF values of the question words).Thus, the final input to the LR layer has size k + 4: one cosine for each of the k blocks and the four additional features.
We compare with eleven baselines.The first seven are considered by Yang et al. (2015): (i) Word-Cnt; (ii) WgtWordCnt; (iii) LCLR (Yih et al., 2013) makes use of rich lexical semantic features, including word/lemma matching, WordNet (Miller, 1995) and distributional models; (iv) PV: Paragraph Vector (Le and Mikolov, 2014); (v) CNN: bigram convolutional neural network (Yu et al., 2014) combine PV with (i) and (ii); (vii) CNN-Cnt: combine CNN with (i) and (ii).Apart from the baselines considered by Yang et al. (2015), we compare with two Addition baselines and two LSTM baselines.Addition and A-LSTM are the baselines described in Section 5.1.4.We also combine both with the four extra features; this gives us two additional baselines that we refer to as Addition(+) and A-LSTM(+).

Results
Table 3 shows performance of the baselines, of BCNN and of the three ABCNN architectures.For CNNs, we test one (one-conv) and two (two-conv) convolution-pooling blocks.
The non-attention network BCNN already performs better than the baselines.If we add attention mechanisms, then the performance further improves by several points.Comparing ABCNN-2 with ABCNN-1, we find ABCNN-2 is slightly better even though ABCNN-2 is the simpler architecture.If we combine ABCNN-1 and ABCNN-2 to form ABCNN-3, we get further improvement. 6his can be explained by ABCNN-3's ability to take attention of more fine-grained granularity into consideration in each convolution-pooling block while ABCNN-1 and ABCNN-2 consider attention only at convolution input or only at pooling input, respectively.We also find that stacking two convolution-pooling blocks does not bring consistent improvement and therefore do not test deeper architectures.

Paraphrase Identification
We use the Microsoft Research Paraphrase (MSRP) corpus (Dolan et al., 2004).The training set contains 2753 true / 1323 false and the test set 1147 true / 578 false paraphrase pairs.We randomly select 400 pairs from train and use them as dev set; but we still report results for training on the entire training set.For each triple (label, s 0 , s 1 ) in the training set, we also add (label, s 1 , s 0 ) to the training set to make best use of the training data.Systems are evaluated by accuracy and F 1 .

Task-Specific Setup
In this task, we add the 15 MT features from (Madnani et al., 2012) and the lengths of the two sentences.In addition, we compute ROUGE-1, ROUGE-2 and ROUGE-SU4 (Lin, 2004), which are scores measuring the match between the two sentences on (i) unigrams, (ii) bigrams and (iii) unigrams and skip-bigrams (maximum skip distance of four), respectively.In this task, we found transforming Euclidean distance into similarity score by 1/(1 + |x − y|) performs better than cosine similarity.Additionally, we use dynamic pooling (Yin and Schütze, 2015a) of the attention matrix A in Equation 2 and forward pooled values of all blocks to the classifier.This gives us better performance than only forwarding sentence-level matching features.We compare our system with a number of alternative approaches, both with representative neural network (NN) approaches and non-NN approaches: (i) A-LSTM; (ii) A-LSTM(+): A-LSTM plus handcrafted features; (iii) RAE (Socher et al., 2011), recursive autoencoder; (iv) Bi-CNN-MI (Yin and Schütze, 2015a) MPSSM-CNN (He et al., 2015), the state-of-the-art NN system for PI.We consider the following four non-NN systems: (vi) Addition (see Section 5.1.4);(vii) Addition(+): Addition plus handcrafted features; (viii) MT (Madnani et al., 2012), a system that combines machine translation metrics;7 (ix) MF-TF-KLD (Ji and Eisenstein, 2013), the state-of-the-art non-NN system.

Results
Table 4 shows that BCNN is slightly worse than the state-of-the-art whereas ABCNN-1 roughly matches it.ABCNN-2 is slightly above the state-ofthe-art.ABCNN-3 outperforms the state-of-the-art in accuracy and F 1 .8Two convolution layers only bring small improvements over one.

Textual Entailment
SemEval 2014 Task 1 (Marelli et al., 2014a)   relations on sentence pairs from the SICK dataset (Marelli et al., 2014b).The three classes are entailment, contradiction and neutral.The sizes of SICK train, dev and test sets are 4439, 495 and 4906 pairs, respectively.We call this dataset ORIG.We also create NONOVER, a copy of ORIG in which the words that occur in both sentences have been removed.A sentence in NONOVER is denoted by the special token <empty> if all words are removed.Table 5 shows three pairs from ORIG and their transformation in NONOVER.We observe that focusing on the non-overlapping parts provides clearer hints for TE than ORIG.In this task, we run two copies of each network, one for ORIG, one for NONOVER; these two networks have a single common LR layer.
Following Lai and Hockenmaier (2014), we train our final system (after fixing of hyperparameters) on train and dev (4,934 pairs).Our evaluation measure is accuracy.

Task-Specific Setup
We found that for this task forwarding two similarity scores from each block (instead of just one) is helpful.We use cosine similarity and Euclidean distance.As we did for paraphrase identification, we add the 15 MT features for each sentence pair for this task as well; our motivation is that entailed sentences resemble paraphrases more than contradictory sentences do.
We use the following linguistic features.
Negation.Negation obviously is an important feature for detecting contradiction.Feature NEG is set to 1 if either sentence contains "no", "not", "nobody", "isn't" and to 0 otherwise.
Nyms.Following Lai and Hockenmaier (2014), we use WordNet to detect synonyms, hypernyms and antonyms in the pairs.But we do this on NONOVER (not on ORIG) to focus on what is critical for TE.Specifically, feature SYN is the number of word pairs in s 0 and s 1 that are synonyms.HYP0 (resp.HYP1) is the number of words in s 0 (resp.s 1 ) that have a hypernym in s 1 (resp.s 0 ).In addition, we collect all potential antonym pairs (PAP) in NONOVER.We identify the matched chunks that occur in contradictory and neutral, but not in entailed pairs.We exclude synonyms and hypernyms and apply a frequency filter of n = 2.In contrast to (Lai and Hockenmaier, 2014), we constrain the PAP pairs to cosine similarity above 0.4 in word2vec embedding space as this discards many noise pairs.Feature ANT is the number of matched PAP antonyms in a sentence pair.
Length.As before we use sentence length, both ORIG -LEN0O and LEN1O -and NONOVER lengths: LEN0N and LEN1N.
Apart from the Addition and LSTM baselines, we further compare with the top-3 systems in SemEval and TrRNTN (Bowman et al., 2015b), a recursive neural network developed for this SICK task.

Results
Table 6 shows that our CNNs outperform A-LSTM (with or without linguistic features added) as well as the top three systems of SemEval.Comparing ABCNN with BCNN, attention mechanism consistently improves performance.ABCNN-1 roughly has comparable performance as ABCNN-2 while ABCNN-3 has bigger improvement: a boost of 1.6 points compared to the previous state of the art. 9

Visual Analysis
In Figure 4, we visualize the attention matrices for one TE sentence pair in ABCNN-2 for blocks b 1 (unigrams), b 2 (first convolutional layer) and b 3 (second convolutional layer).Darker shades of blue indicate stronger attention values.
In Figure 4(a), each word corresponds to exactly one row or column.We can see that words in s i with semantic equivalents in s 1−i get high attention while 9 If we run ABCNN-3 (two-conv) without the 24 linguistic features, the performance is 84.6.

method acc
SemEval Top3 (Jimenez et al., 2014) 83.1 (Zhao et al., 2014) 83.6 (Lai and Hockenmaier, 2014)  words without semantic equivalents get low attention, e.g., "walking" and "murals" in s 0 and "front" and "colorful" in s 1 .This behavior seems reasonable for the unigram level.Rows/columns of the attention matrix in Figure 4(b) correspond to phrases of length three since filter width w = 3. High attention values generally correlate with close semantic correspondence: the phrase "people are" in s 0 matches "several people are" in s 1 ; both "are walking outside" and "walking outside the" in s 0 match "are in front" in s 1 ; "the building that" in s 0 matches "a colorful building" in s 1 .More interestingly, looking at the bottom right corner, both "on it" and "it" in s 0 match "building" in s 1 ; this indicates that ABCNN is able to detect some coreference across sentences."building" in s 1 has two places in which higher attentions appear, one is with "it" in s 0 , the other is with "the building that" in s 0 .This may indicate that ABCNN recognizes that "building" in s 1 and "the building that" / "it" in s 0 refer to the same object.Hence, coreference resolution across sentences as well as within a sentence both are detected.For the attention vectors on the left and the top, we can see that attention has focused on the key parts: "people are walking out-  side the building that" in s 0 , "several people are in" and "of a colorful building" in s 1 .
Rows/columns of the attention matrix in Figure 4(c) (second layer of convolution) correspond to phrases of length 5 since filter width w = 3 in both convolution layers (5 = 1 + 2 * (3 − 1)).We use ". .." to denote words in the middle if a phrase like "several...front" has more than two words.We can see that attention distribution in the matrix has focused on some local regions.As granularity of phrases is larger, it makes sense that the attention values are smoother.But we still can find some interesting clues: at the two ends of the main diagonal, higher attentions hint that the first part of s 0 matches well with the first part of s 1 ; "several murals on it" in s 0 matches well with "of a colorful building" in s 1 , which satisfies the intuition that these two phrases are crucial for making a decision on TE in this case.This again shows the potential strength of our system in figuring out which parts of the two sentences refer to the same object.In addition, in the central part of the matrix, we can see that the long phrase "people are walking outside the building" in s 0 matches well with the long phrase "are in front of a colorful building" in s 1 .

Summary
In this work, we presented three mechanisms to integrate attention into convolutional neural networks for general sentence pair modeling tasks.
Our experimental results on AS, PI and TE show that attention-based CNNs perform better than CNNs without attention mechanisms.ABCNN-2 generally outperforms ABCNN-1 and ABCNN-3 surpasses both.
In all tasks, we did not find any big improvement of two layers of convolution over one layer.This is probably due to the limited size of training data.We expect that, as larger training sets become available, deep ABCNNs will show even better performance.
In addition, linguistic features contribute in all three tasks: improvements by 0.0321 (MAP) and 0.0338 (MRR) for AS, improvements by 3.8 (acc) and 2.1 (F 1 ) for PI and an improvement by 1.6 (acc) for TE.But our ABCNN can still reach or surpass state-of-the-art even without those features in AS and TE tasks.This shows that ABCNN is generally a strong NN system.
As we discussed in Section 2, attention-based LSTMs have been especially successful in tasks that have a strong generation component like machine translation and summarization.CNNs have not been used for this type of task.This is an interesting area of future work for attention-based CNN systems.
Figure 3: Three ABCNN architectures jumping in the leaves boys three kids are jumping in the leaves kids 2 a man is jumping into an empty pool an empty a man is jumping into a full pool a full

Figure 4 :
Figure 4: Attention visualization for TE

Table 4 :
Results for PI on MSRP

Table 5 :
SICK data: Converting the original sentences (ORIG) into the NONOVER format