From Paraphrase Database to Compositional Paraphrase Model and Back

The Paraphrase Database (PPDB; Ganitkevitch et al., 2013) is an extensive semantic resource, consisting of a list of phrase pairs with (heuristic) confidence estimates. However, it is still unclear how it can best be used, due to the heuristic nature of the confidences and its necessarily incomplete coverage. We propose models to leverage the phrase pairs from the PPDB to build parametric paraphrase models that score paraphrase pairs more accurately than the PPDB’s internal scores while simultaneously improving its coverage. They allow for learning phrase embeddings as well as improved word embeddings. Moreover, we introduce two new, manually annotated datasets to evaluate short-phrase paraphrasing models. Using our paraphrase model trained using PPDB, we achieve state-of-the-art results on standard word and bigram similarity tasks and beat strong baselines on our new short phrase paraphrase tasks.


Introduction
Paraphrase detection 2 is the task of analyzing two segments of text and determining if they have the same meaning despite differences in structure and wording.
One component of many such systems is a paraphrase table containing pairs of text snippets, usually automatically generated, that have the same meaning. The most recent work in this area is the Paraphrase Database (PPDB; Ganitkevitch et al., 2013), a collection of confidence-rated paraphrases created using the pivoting technique of Bannard and Callison-Burch (2005) over large parallel corpora. The PPDB is a massive resource, containing 220 million paraphrase pairs. It captures many short paraphrases that would be difficult to obtain using any other resource. For example, the pair {we must do our utmost, we must make every effort} has little lexical overlap but is present in PPDB. The PPDB has recently been used for monolingual alignment (Yao et al., 2013), for predicting sentence similarity (Bjerva et al., 2014), and to improve the coverage of FrameNet (Rastogi and Van Durme, 2014).
Though already effective for multiple NLP tasks, we note some drawbacks of PPDB. The first is lack of coverage: to use the PPDB to compare two phrases, both must be in the database. The second is that PPDB is a nonparametric paraphrase model; the number of parameters (phrase pairs) grows with the size of the dataset used to build it. In practice, it can become unwieldy to work with as the size of the database increases. A third concern is that the confidence estimates in PPDB are a heuristic combination of features, and their quality is unclear.
We address these issues in this work by introducing ways to use PPDB to construct parametric paraphrase models. First we show that initial skip-gram word vectors (Mikolov et al., 2013a) can be fine-tuned for the paraphrase task by training on word pairs from PPDB. We call them PARA-GRAM word vectors. We find additive composition of PARAGRAM vectors to be a simple but effective way to embed phrases for short-phrase paraphrase tasks. We find improved performance by training a recursive neural network (RNN;Socher et al., 2010) directly on phrase pairs from PPDB.
We show that our resulting word and phrase representations are effective on a wide variety of tasks, including two new datasets that we introduce. The first, Annotated-PPDB, contains pairs from PPDB that were scored by human annotators. It can be used to evaluate paraphrase models for short phrases. We use it to show that the phrase embeddings produced by our methods are significantly more indicative of paraphrasability than the original heuristic scoring used by Ganitkevitch et al. (2013). Thus we use the power of PPDB to improve its contents.
Our second dataset, ML-Paraphrase, is a reannotation of the bigram similarity corpus from Mitchell and Lapata (2010). The task was originally developed to measure semantic similarity of bigrams, but some annotations are not congruent with the functional similarity central to paraphrase relationships. Our re-annotation can be used to assess paraphrasing capability of bigram compositional models. In summary, we make the following contributions: Provide new PARAGRAM word vectors, learned using PPDB, that achieve state-of-the-art performance on the SimLex-999 lexical similarity task (Hill et al., 2014b) and lead to improved performance in sentiment analysis.
Provide ways to use PPDB to embed phrases. We compare additive and RNN composition of PARA-GRAM vectors. Both can improve PPDB by reranking the paraphrases in PPDB to improve correlations with human judgments. They can be used as concise parameterizations of PPDB, thereby vastly increasing its coverage. We also perform a qualitative analysis of the differences between additive and RNN composition.
Introduce two new datasets. The first contains PPDB phrase pairs and evaluates how well models can measure the quality of short paraphrases. The second is a new annotation of the bigram similarity task in Mitchell and Lapata (2010) that makes it suitable for evaluating bigram paraphrases.
We release the new datasets, complete with annotation instructions and raw annotations, as well as our code and the trained models. 3

Related Work
There is a vast literature on representing words as vectors. The intuition of most methods to create these vectors (or embeddings) is that similar words have similar contexts (Firth, 1957). Earlier models made use of latent semantic analysis (LSA) (Deerwester et al., 1990). Recently, more sophisticated neural models, work originating with (Bengio et al., 2003), have been gaining popularity (Mikolov et al., 2013a;Pennington et al., 2014). These embeddings are now being used in new ways as they are being tailored to specific downstream tasks (Bansal et al., 2014).
Phrase representations can be created from word vectors using compositional models. Simple but effective compositional models were studied by Mitchell and Lapata (2008; and Blacoe and Lapata (2012). They compared a variety of binary operations on word vectors and found that simple point-wise multiplication of explicit vector representations performed very well. Other works like Zanzotto et al. (2010) and Baroni and Zamparelli (2010) also explored composition using models based on operations of vectors and matrices.
More recent work has shown that the extremely efficient neural embeddings of Mikolov et al. (2013a) also do well on compositional tasks simply by adding the word vectors (Mikolov et al., 2013b). Hashimoto et al. (2014) introduced an alternative word embedding and compositional model based on predicate-argument structures that does well on two simple composition tasks, including the one introduced by Mitchell and Lapata (2010).
An alternative approach to composition, used by Socher et al. (2011), is to train a recursive neural network (RNN) whose structure is defined by a binarized parse tree. In particular, they trained their RNN as an unsupervised autoencoder. The RNN captures the latent structure of composition. Recent work has shown that this model struggles in tasks involving compositionality (Blacoe and Lapata, 2012;Hashimoto et al., 2014). 4 However, we found success using RNNs in a supervised setting, similar to Socher et al. (2014), who used RNNs to learn representations for image descriptions. The objective function we used in this work was motivated by their multimodal objective function for learning joint image-sentence representations.
Lastly, the PPDB has been used along with other resources to learn word embeddings for several tasks, including semantic similarity, language modeling, predicting human judgments, and classification (Yu and Dredze, 2014;Faruqui et al., 2015). Concurrently with our work, it has also been used to construct paraphrase models for short phrases (Yu and Dredze, 2015).

New Paraphrase Datasets
We created two novel datasets: (1) Annotated-PPDB, a subset of phrase pairs from PPDB which are annotated according to how strongly they represent a paraphrase relationship, and (2) ML-Paraphrase, a re-annotation of the bigram similarity dataset from Mitchell and Lapata (2010), again annotated for strength of paraphrase relationship.

Annotated-PPDB
Our motivation for creating Annotated-PPDB was to establish a way to evaluate compositional paraphrase models on short phrases.
Most existing paraphrase tasks focus on words, like SimLex-999 (Hill et al., 2014b), or entire sentences, such as the Microsoft Research Paraphrase Corpus Quirk et al., 2004). To our knowledge, there are no datasets that focus on the paraphrasability of short phrases. Thus, we created Annotated-PPDB so that researchers can focus on local compositional phenomena and measure the performance of models directly-avoiding the need to do so indirectly in a sentence-level task. Models that have strong performance on Annotated-PPDB can be used to provide more accurate confidence scores for the paraphrases in the PPDB as well as reduce the need for large paraphrase tables altogether.
Annotated-PPDB was created in a multi-step process (outlined below) involving various automatic 4 We also replicated this approach and found training to be time-consuming even using low-dimensional word vectors. filtering steps followed by crowdsourced human annotation. One of the aims for our dataset was to collect a variety of paraphrase types-we wanted to include pairs that were non-trivial to recognize as well as those with a range of similarity and length. We focused on phrase pairs with limited lexical overlap to avoid including those with only trivial differences.
We started with candidate phrases extracted from the first 10M pairs in the XXL version of the PPDB and then executed the following steps. 5 Filter phrases for quality: Only those phrases whose tokens were in our vocabulary were retained. 6 Next, all duplicate paraphrase pairs were removed; in PPDB, these are distinct pairs that contain the same two phrases with the order swapped. Filter by lexical overlap: Next, we calculated the word overlap score in each phrase pair and then retained only those pairs that had a score of less than 0.5. By word overlap score, we mean the fraction of tokens in the smaller of the phrases with Levenshtein distance ≤ 1 to a token in the larger of the phrases. This was done to exclude less interesting phrase pairs like my dad had, my father had or ballistic missiles, of ballistic missiles that only differ in a synonym or the addition of a single word. Select range of paraphrasabilities: To balance our dataset with both clear paraphrases and erroneous pairs in PPDB, we sampled 5,000 examples from ten chunks of the first 10M initial phrase pairs where a chunk is defined as 1M phrase pairs. Select range of phrase lengths: We then selected 1,500 phrases from each 5000-example sample that encompassed a wide range of phrase lengths. To do this, we first binned the phrase pairs by their effective size. Let n 1 be the number of tokens of length greater than one character in the first phrase and n 2 the same for the second phrase. Then the effective size is defined as max(n 1 , n 2 ). The bins contained pairs of effective size of 3, 4, and 5 or more, and 500 pairs were selected from each bin. This gave us a total of 15,000 phrase pairs. Prune to 3,000: 3,000 phrase pairs were then selected randomly from the 15,000 remaining pairs to form an initial dataset, Annotated-PPDB-3K. The phrases were selected so that every phrase in the dataset was unique.
Annotate with Mechanical Turk: The dataset was then rated on a scale from 1-5 using Amazon Mechanical Turk, where a score of 5 denoted phrases that are equivalent in a large number of contexts, 3 meant that the phrases had some overlap in meaning, and 1 indicated that the phrases were dissimilar or contradictory in some way (e.g., can not adopt and is able to accept).
We only permitted workers whose location was in the United States and who had done at least 1,000 HITS with a 99% acceptance rate. Each example was labeled by 5 annotators and their scores were averaged to produce the final rating. Table 1 shows some statistics of the data. Overall, the annotated data had a mean deviation (MD) 7 of 0.80. Table 1 shows that overall, workers found the phrases to be of high quality, as more than two-thirds of the pairs had an average score of at least 3. Also from the Table, we can see that workers had stronger agreement on very low and very high quality pairs and were less certain in the middle of the range.
Prune to 1,260: To create our final dataset, Annotated-PPDB, we selected 1,260 phrase pairs from the 3,000 annotations. We did this by first binning the phrases into 3 categories: those with scores in the interval [1, 2.5), those with scores in the interval [2.5, 3.5], and those with scores in the interval (3.5, 5]. We took the 420 phrase pairs with the lowest MD in each bin, as these have the most agreement about their label, to form Annotated-PPDB.  4,5] 0.59 36.9 Table 1: An analysis of Annotated-PPDB-3K extracted from PPDB. The statistics shown are for the splits of the data according to the average score by workers. MD denotes mean deviation and % of Data refers to the percentage of our dataset that fell into each range.

ML-Paraphrase
Our second newly-annotated dataset, ML-Paraphrase, is based on the bigram similarity task originally introduced by Mitchell and Lapata (2010); we refer to the original annotations as the ML dataset.
The ML dataset consists of human similarity ratings for three types of bigrams: adjective-noun (JN), noun-noun (NN), and verb-noun (VN). Through manual inspection, we found that the annotations were not consistent with the notion of similarity central to paraphrase tasks. For instance, television set and television programme were the highest rated phrases in the NN section (based on average annotator score). Similarly, one of the highest ranked JN pairs was older man and elderly woman. This indicates that the annotations reflect topical similarity in addition to capturing functional or definitional similarity.
Therefore, we had the data re-annotated by two authors of this paper who are native English speakers. 8 The bigrams were labeled on a scale from 1-5 where 5 denotes phrases that are equivalent in a large number of contexts, 3 indicates the phrases are roughly equivalent in a narrow set of contexts, and 1 means the phrases are not at all equivalent in any context. Following annotation, we collapsed the rating scale by merging 4s and 5s together and 1s and 2s together.
Statistics for the data are shown in Table 2. We show inter-annotator Spearman ρ and Cohen's κ in columns 2 and 3, indicating substantial agreement on the JN and VN portions but only moderate agreement on NN. In fact, when evaluating our NN anno-  tations against those from the original ML data (column 4), we find ρ to be 0.38, well below the average human correlation of 0.49 (final column) reported by Mitchell and Lapata and also surpassed by pointwise multiplication (Mitchell and Lapata, 2010). This suggests that the original NN portion, more so than the others, favored a notion of similarity more related to association than paraphrase.

Paraphrase Models
We now present parametric paraphrase models and discuss training. Our goal is to embed phrases into a low-dimensional space such that cosine similarity in the space corresponds to the strength of the paraphrase relationship between phrases. We use a recursive neural network (RNN) similar to that used by Socher et al. (2014). We first use a constituent parser to obtain a binarized parse of a phrase. For phrase p, we compute its vector g(p) through recursive computation on the parse. That is, if phrase p is the yield of a parent node in a parse tree, and phrases c 1 and c 2 are the yields of its two child nodes, we define g(p) recursively as follows: where f is an element-wise activation function (tanh), [g(c 1 ); g(c 2 )] ∈ R 2n is the concatenation of the child vectors, W ∈ R n×2n is the composition matrix, b ∈ R n is the offset, and n is the dimensionality of the word embeddings. If node p has no children (i.e., it is a single token), we define g(p) = W (p) w , where W w is the word embedding matrix in which particular word vectors are indexed using superscripts. The trainable parameters of the model are W , b, and W w .

Objective Functions
We now present objective functions for training on pairs extracted from PPDB. The training data consists of (possibly noisy) pairs taken directly from the original PPDB. In subsequent sections, we discuss how we extract training pairs for particular tasks.
We assume our training data consists of a set X of phrase pairs x 1 , x 2 , where x 1 and x 2 are assumed to be paraphrases. To learn the model parameters (W, b, W w ), we minimize our objective function over the data using AdaGrad (Duchi et al., 2011) with mini-batches. The objective function follows: where λ W and λ Ww are regularization parameters, W w initial is the initial word embedding matrix, δ is the margin (set to 1 in all of our experiments), and t 1 and t 2 are carefully-selected negative examples taken from a mini-batch during optimization.
The intuition for this objective is that we want the two phrases to be more similar to each other (g(x 1 ) · g(x 2 )) than either is to their respective negative examples t 1 and t 2 , by a margin of at least δ.

Selecting Negative Examples
To select t 1 and t 2 in Eq. 1, we simply chose the most similar phrase in the mini-batch (other than those in the given phrase pair). E.g., for choosing t 1 for a given x 1 , x 2 : where X b ⊆ X is the current mini-batch. That is, we want to choose a negative example t i that is similar to x i according to the current model parameters.
The downside of this approach is that we may occasionally choose a phrase t i that is actually a true paraphrase of x i . We also tried a strategy in which we selected the least similar phrase that would trigger an update (i.e., g(t i ) · g(x i ) > g(x 1 ) · g(x 2 ) − δ), but we found the simpler strategy above to work better and used it for all experiments reported below.

Discussion
The objective in Eq. 1 is similar to one used by Socher et al. (2014), but with several differences. Their objective compared text and projected images. They also did not update the underlying word embeddings; we do so here, and in a way such that they are penalized from deviating from their initialization. Also for a given x 1 , x 2 , they do not select a single t 1 and t 2 as we do, but use the entire training set, which can be very expensive with a large training dataset. We also experimented with a simpler objective that sought to directly minimize the squared L2norm between g(x 1 ) and g(x 2 ) in each pair, along with the same regularization terms as in Eq. 1. One problem with this objective function is that the global minimum is 0 and is achieved simply by driving the parameters to 0. We obtained much better results using the objective in Eq. 1.

Training Word Paraphrase Models
To train just word vectors on word paraphrase pairs (again from PPDB), we used the same objective function as above, but simply dropped the composition terms. This gave us an objective that bears some similarity to the skip-gram objective with negative sampling in word2vec (Mikolov et al., 2013a). Both seek to maximize the dot products of certain word pairs while minimizing the dot products of others. This objective function is: It is like Eq. 1 except with word vectors replacing the RNN composition function and with the regularization terms on the W and b removed. We further found we could improve this model by incorporating constraints. From our training pairs, for a given word w, we assembled all other words that were paired with it in PPDB and all of their lemmas. These were then used as constraints during the pairing process: a word t could only be paired with w if it was not in its list of assembled words.

Experiments -Word Paraphrasing
We first present experiments on learning lexical paraphrasability.
We train on word pairs from PPDB and evaluate on the SimLex-999 dataset (Hill et al., 2014b), achieving the best results reported to date.

Training Procedure
To learn word vectors that reflect paraphrasability, we optimized Eq. 2. There are many tunable hyperparameters with this objective, so to make training tractable we fixed the initial learning rates for the word embeddings to 0.5 and the margin δ to 1. Then we did a coarse grid search over a parameter space for λ Ww and the mini-batch size. We considered λ Ww values in {10 −2 , 10 −3 , ..., 10 −7 , 0} and minibatch sizes in {100, 250, 500, 1000}. We trained for 20 epochs for each set of hyperparameters using AdaGrad (Duchi et al., 2011).
For all experiments, we initialized our word vectors with skip-gram vectors trained using word2vec (Mikolov et al., 2013a). The vectors were trained on English Wikipedia (tokenized and lowercased, yielding 1.8B tokens). 9 We used a window size of 5 and a minimum count cut-off of 60, producing vectors for approximately 270K word types. We retained vectors for only the 100K most frequent words, averaging the rest to obtain a single vector for unknown words. We will refer to this set of the 100K most frequent words as our vocabulary.

Extracting Training Data
For training, we extracted word pairs from the lexical XL section of PPDB. We used the XL data for all experiments, including those for phrases. We used XL instead of XXL because XL has better quality overall while still being large enough so that we could be selective in choosing training pairs. There are a total of 548,085 pairs. We removed 174,766 that either contained numerical digits or words not in our vocabulary. We then removed 260,425 redundant pairs, leaving us with a final training set of 112,894 word pairs.  Table 3: Results on the SimLex-999 (SL999) word similarity task obtained by performing hyperparameter tuning based on 2×WS-S −WS-R and treating SL999 as a held-out test set. n is word vector dimensionality. A * indicates statistical significance (p < 0.05) over the 1000-dimensional skip-gram vectors.

Tuning and Evaluation
Hyperparameters were tuned using the wordsim-353 (WS353) dataset (Finkelstein et al., 2001), specifically its similarity (WS-S) and relatedness (WS-R) partitions (Agirre et al., 2009). In particular, we tuned to maximize 2×WS-S correlation minus the WS-R correlation. The idea was to reward vectors with high similarity and relatively low relatedness, in order to target the paraphrase relationship. After tuning, we evaluated the best hyperparameters on the SimLex-999 (SL999) dataset (Hill et al., 2014b). We chose SL999 as our primary test set as it most closely evaluates the paraphrase relationship. Even though WS-S is a close approximation to this relationship, it does not include pairs that are merely associated and assigned low scores, which SL999 does (see discussion in Hill et al., 2014b).
Note that for all experiments we used cosine similarity as our similarity metric and evaluated the statistical significance of dependent correlations using the one-tailed method of (Steiger, 1980). Table 3 shows results on SL999 when improving the initial word vectors by training on word pairs from PPDB, both with and without constraints. The "PARAGRAM WS " rows show results when tuning to maximize 2×WS-S − WS-R. We also show results for strong skip-gram baselines and the best results from the literature, including the state-of-the-art results from Hill et al. (2014a) as well as the inter-annotator agreement from Hill et al. (2014b). 10 The table illustrates that, by training on PPDB, we can surpass the previous best correlations on SL999 by 4-6% absolute, achieving the best results reported to date. We also find that we can train low-dimensional word vectors that exceed the performance of much larger vectors. This is very useful as using large vectors can increase both time and memory consumption in NLP applications.

Results
To generate word vectors to use for downstream applications, we chose hyperparameters so as to maximize performance on SL999. 11 These word vectors, which we refer to as PARAGRAM vectors, had a ρ of 0.57 on SL999. We use them as initial word vectors for the remainder of the paper.

Sentiment Analysis
As an extrinsic evaluation of our PARAGRAM word vectors, we used them in a convolutional neural network (CNN) for sentiment analysis. We used the simple CNN from Kim (2014) and the binary sentence-level sentiment analysis task from Socher et al. (2013). We used the standard data splits, removing examples with a neutral rating. We trained on all constituents in the training set while only using full sentences from development and test, giving us train/development/test sizes of 67,349/872/1,821.
The CNN uses m-gram filters, each of which is an m × n vector. The CNN computes the inner product between an m-gram filter and each m-gram in an example, retaining the maximum match (so-called "max-pooling"). The score of the match is a single dimension in a feature vector for the example, which is then associated with a weight in a linear classifier used to predict positive or negative sentiment.
While Kim (2014) used m-gram filters of several lengths, we only used unigram filters. We also fixed the word vectors during learning (called "static" by Kim). After learning, the unigram filters correspond to locations in the fixed word vector space. The learned classifier weights represent how strongly each location corresponds to positive or negative sentiment. We expect this static CNN to  be more effective if the word vector space separates positive and negative sentiment.
In our experiments, we compared baseline skipgram embeddings to our PARAGRAM vectors. We used AdaGrad learning rate of 0.1, mini-batches of size 10, and a dropout rate of 0.5. We used 200 unigram filters and rectified linear units as the activation (applied to the filter output + filter bias). We trained for 30 epochs, predicting labels on the development set after each set of 3,000 examples. We recorded the highest development accuracy and used those parameters to predict labels on the test set.
Results are shown in Table 4. We see improvements over the baselines when using PARAGRAM vectors, even exceeding the performance of higherdimensional skip-gram vectors.

Experiments -Compositional Paraphrasing
In this section, we describe experiments on a variety of compositional phrase-based paraphrasing tasks. We start with the simplest case of bigrams, and then proceed to short phrases. For all tasks, we again train on appropriate data from PPDB and test on various evaluation datasets, including our two novel datasets (Annotated-PPDB and ML-Paraphrase).

Training Procedure
We trained our models by optimizing Eq. 1 using AdaGrad (Duchi et al., 2011). We fixed the initial learning rates to 0.5 for the word embeddings and 0.05 for the composition parameters, and the margin to 1. Then we did a coarse grid search over a parameter space for λ Ww , λ W , and mini-batch size. For λ Ww , our search space again consisted of {10 −2 , 10 −3 , ..., 10 −7 , 0}, for λ W it was {10 −1 , 10 −2 , 10 −3 , 0}, and we explored batch sizes of {100, 250, 500, 1000, 2000}. When initializing with PARAGRAM vectors, the search space for λ Ww was shifted upwards to be {10, 1, 10 −1 , 10 −3 , ..., 10 −6 } to reflect our increased confidence in the initial vectors. We trained only for 5 epochs for each set of parameters. For baselines, we used the same initial skip-gram vectors as in Section 5.

Evaluation and Baselines
For all experiments, we again used cosine similarity as our similarity metric and evaluated the statistical significance using the method of (Steiger, 1980).
A baseline used in all compositional experiments is vector addition of skip-gram (or PARA-GRAM) word vectors. Unlike explicit word vectors, where point-wise multiplication acts as a conjunction of features and performs well on composition tasks (Mitchell and Lapata, 2008), using addition with skip-gram vectors (Mikolov et al., 2013b) gives better performance than multiplication.

Bigram Paraphrasability
To evaluate our ability to paraphrase bigrams, we consider the original bigram similarity task from Mitchell and Lapata (2010) as well as our newlyannotated version of it: ML-Paraphrase.
Extracting Training Data Training data for these tasks was extracted from the XL portion of PPDB. The bigram similarity task from Mitchell and Lapata (2010) contains three types of bigrams: adjective-noun (JN), noun-noun (NN), and verb-noun (VN). We aimed to collect pairs from PPDB that mirrored these three types of bigrams.
We found parsing to be unreliable on such short segments of text, so we used a POS tagger (Manning et al., 2014) to tag the tokens in each phrase. We then used the word alignments in PPDB to extract bigrams for training. For JN and NN, we extracted pairs containing aligned, adjacent tokens in the two phrases with the appropriate partof-speech tag. Thus we extracted pairs like easy job, simple task for the JN section and town meeting, town council for the NN section. We used a different strategy for extracting training data for the VN subset: we took aligned VN tokens and took the closest noun after the verb. This was done to approximate the direct object that would have been ideally extracted with a dependency parse. An example from this section is achieve goal, achieve aim .  Table 5: Results on the test section of the bigram similarity task of Mitchell and Lapata (2010) and our newly annotated version (ML-Paraphrase). (n) shows the word vector dimensionality and ("comp.") shows the composition function used: "+" is vector addition and "RNN" is the recursive neural network. The * indicates statistically significant (p < 0.05) over the skip-gram model, † statistically significant over the {PARAGRAM, +} model, and ‡ statistically significant over Hashimoto et al. (2014).
We removed phrase pairs that (1) contained words not in our vocabulary, (2) were redundant with others, (3) contained brackets, or (4) had Levenshtein distance ≤ 1. The final criterion helps to ensure that we train on phrase pairs with non-trivial differences. The final training data consisted of 133,997 JN pairs, 62,640 VN pairs and 35,601 NN pairs.
Baselines In addition to RNN models, we report baselines that use vector addition as the composition function, both with our skip-gram embeddings and PARAGRAM embeddings from Section 5.
We also compare to several results from prior work. When doing so, we took their best correlations for each data subset. That is, the JN and NN results from Mitchell and Lapata (2010) use their multiplicative model and the VN results use their dilation model. From Hashimoto et al. (2014) we used their PAS-CLBLM Add l and PAS-CLBLM Add nl models. We note that their vector dimensionalities are larger than ours, using n = 2000 and 50 respectively.

Results
Results are shown in Table 5. We report results on the test portion of the original Mitchell and Lapata (2010) dataset (ML) as well as the entirety of our newly-annotated dataset (ML-Paraphrase). RNN results on ML were tuned on the respective development sections and RNN results on ML-Paraphrase were tuned on the entire ML dataset.
Our RNN model outperforms results from the literature on most sections in both datasets and its average correlations are among the highest. 12 The one subset of the data that posed difficulty was the NN section of the ML dataset. We suspect this is due to the reasons discussed in Section 3.2; for our ML-Paraphrase dataset, by contrast, we do see gains on the NN section.
We also outperform the strong baseline of adding 1000-dimensional skip-gram embeddings, a model with 40 times the number of parameters, on our ML-Paraphrase dataset. This baseline had correlations of 0.45, 0.43, and 0.47 on the JN, NN, and VN partitions, with an average of 0.45-below the average ρ of the RNN (0.52) and even the {PARAGRAM, +} model (0.46).
Interestingly, the type of vectors used to initialize the RNN has a significant effect on performance. If we initialize using the 25-dimensional skip-gram vectors, the average ρ on ML-Paraphrase drops to 0.43, below even the {PARAGRAM, +} model.

Phrase Paraphrasability
In this section we show that by training a model based on filtered phrase pairs in PPDB, we can actually distinguish between quality paraphrases and poor paraphrases in PPDB better than the original heuristic scoring scheme from Ganitkevitch et al. (2013).
Extracting Training Data As before, training data was extracted from the XL section of PPDB. Similar to the procedure to create our Annotated-PPDB dataset, phrases were filtered such that only those with a word overlap score of less than 0.5 were kept. We also removed redundant phrases and phrases that contained tokens not in our vocabulary. The phrases were then binned according to their effective size and 20,000 examples were selected from bins of effective sizes of 3, 4, and more than 5, creating a training set of 60,000 examples. Care was taken to ensure that none of our training pairs was also present in our development and test sets.
Baselines We compare our models with strong lexical baselines. The first, strict word overlap, is the percentage of words in the smaller phrase that are also in the larger phrase. We also include a version where the words are lemmatized prior to the calculation.
We also train a support vector regression model (epsilon-SVR) (Chang and Lin, 2011) on the 33 features that are included for each phrase pair in PPDB. We scaled the features such that each lies in the interval [−1, 1] and tuned the parameters using 5-fold cross validation on our dev set. 13 We then trained on the entire dev set after finding the best performing C and ǫ combination and evaluated on the test set of Annotated-PPDB.

Results
We evaluated on our Annotated-PPDB dataset described in §3.1. Table 6 shows the Spearman correlations on the 1000-example test set. RNN models were tuned on the development set of 260 examples. All other methods had no hyperparameters and therefore required no tuning. We note that the confidence estimates from Ganitkevitch et al. (2013) reach a ρ of 0.25 on the test set, similar to the results of strict overlap. While 25-dimensional skip-gram embeddings only reach 0.20, we can improve this to 0.32 by fine-tuning them using PPDB (thereby obtaining our PARA- 13 We tuned both parameters over {2 −10 , 2 −9 , ..., 2 10 }. GRAM vectors). By using the PARAGRAM vectors to initialize the RNN, we reach a correlation of 0.40, which is better than the PPDB confidence estimates by 15% absolute.
We again consider addition of 1000-dimensional skip-gram embeddings as a baseline, and they continue to perform strongly (ρ = 0.37). The RNN initialized with PARAGRAM vectors does reach a higher ρ (0.40), but the difference is not statistically significant (p = 0.16). Thus we can achieve similarlystrong results with far fewer parameters.
This task also illustrates the importance of initializing our RNN model with appropriate word embeddings. An RNN initialized with skip-gram vectors has a modest ρ of 0.22, well below the ρ of the RNN initialized with PARAGRAM vectors. Clearly, initialization is important when optimizing non-convex objectives like ours, but it is noteworthy that our best results came from first improving the word vectors and then learning the composition model, rather than jointly learning both from scratch.  We performed a qualitative analysis to uncover sources of error and determine differences between adding PARAGRAM vectors and using an RNN initialized with them. To do so, we took the output of both systems on Annotated-PPDB and mapped their cosine similarities to the interval [1, 5]. We then computed their absolute error as compared to the gold ratings. Table 7 shows how the average of these absolute errors changes with the magnitude of the gold ratings. The RNN performs better (has lower average absolute error) for less similar pairs. Vector addition only does better on the most similar pairs. This is presumably because the most positive pairs have high word overlap and so can be represented effectively with a simpler model. does not exceed is no more than 0.75 0.0 5.0 4.8 3.5 Table 8: Illustrative phrase pairs from Annotated-PPDB with gold similarity > 4. The last three columns show the gold similarity score, the similarity score of the RNN model, and the similarity score of vector addition. We note that addition performs better when the pairs have high length ratio (rows 1-2) or overlap ratio (rows 3-4) while the RNN does better when those values are low (rows 5-6 and 7-8 respectively). Boldface indicates smaller error compared to gold scores.

Qualitative Analysis
To further investigate the differences between these models, we removed those pairs with gold scores in [2,4], in order to focus on pairs with extreme scores. We identified two factors that distinguished the performance between the two models: length ratio and the amount of lexical overlap. We did not find evidence that non-compositional phrases, such as idioms, were a source of error as these were not found in ML-Paraphrase and only appear rarely in Annotated-PPDB.
We define length ratio as simply the number of tokens in the smaller phrase divided by the number of tokens in the larger phrase. Overlap ratio is the number of equivalent tokens in the phrase pair divided by the number of tokens in the smaller of the two phrases. Equivalent tokens are defined as tokens that are either exact matches or are paired up in the lexical portion of PPDB used to train the PARA-GRAM vectors. Table 9 shows how the performance of the models changes under different values of length ratio and overlap ratio. 14 The values in this table are the percentage changes in absolute error when using the RNN over the PARAGRAM vector addition model. So negative values indicate superior performance by the RNN.
A few trends emerge from this table. One is that as the length ratio increases (i.e., the phrase pairs are closer in length), addition surpasses the RNN for positive examples. For negative examples, the trend is reversed. The same trend appears for over- 14 The bin delimiters were chosen to be uniform over the range of output values of the length ratio ([0.4,1] with one outlier data point removed) and overlap ratio ([0,1] Table 8. When considering both positive and negative examples ("Both"), we see that the RNN excels on the most difficult examples (large differences in phrase length and less lexical overlap). For easier examples, the two fare similarly overall (-2.0 to 0.0% change), but the RNN does much better on negative examples. This aligns with the intuition that addition should perform well when two paraphrastic phrases have high lexical overlap and similar length. But when they are not paraphrases, simple addition is misled and the RNN's learned composition function better captures the relationship. This may suggest new architectures for modeling compositionality differently depending on differences in length and amount of overlap.

Conclusion
We have shown how to leverage PPDB to learn state-of-the-art word embeddings and compositional models for paraphrase tasks. Since PPDB was created automatically from parallel corpora, our models are also built automatically. Only small amounts of annotated data are used to tune hyperparameters.
We also introduced two new datasets to evaluate compositional models of short paraphrases 15 , filling a gap in the NLP community, as currently there are no datasets created for this purpose. Successful models on these datasets can then be used to extend the coverage of, or provide an alternative to, PPDB.
There remains a great deal of work to be done in developing new composition models, whether with new network architectures or distance functions.
In this work, we based our composition function on constituent parse trees, but this may not be the best approach-especially for short phrases. Dependency syntax may be a better alternative (Socher et al., 2014). Besides improving composition, another direction to explore is how to use models for short phrases in sentence-level paraphrase recognition and other downstream tasks.