Distributional Semantics Beyond Words: Supervised Learning of Analogy and Paraphrase

There have been several efforts to extend distributional semantics beyond individual words, to measure the similarity of word pairs, phrases, and sentences (briefly, tuples; ordered sets of words, contiguous or noncontiguous). One way to extend beyond words is to compare two tuples using a function that combines pairwise similarities between the component words in the tuples. A strength of this approach is that it works with both relational similarity (analogy) and compositional similarity (paraphrase). However, past work required hand-coding the combination function for different tasks. The main contribution of this paper is that combination functions are generated by supervised learning. We achieve state-of-the-art results in measuring relational similarity between word pairs (SAT analogies and SemEval 2012 Task 2) and measuring compositional similarity between noun-modifier phrases and unigrams (multiple-choice paraphrase questions).


Introduction
and Firth (1957) hypothesized that words that appear in similar contexts tend to have similar meanings. This hypothesis is the foundation for distributional semantics, in which words are represented by context vectors. The similarity of two words is calculated by comparing the two corresponding context vectors (Lund et al., 1995; Landauer and Dumais, 1997; Turney and Pantel, 2010). Distributional semantics is highly effective for measuring the semantic similarity between individual words.
On a set of eighty multiplechoice synonym questions from the test of English as a foreign language (TOEFL), a distributional approach recently achieved 100% accuracy (Bullinaria and Levy, 2012). However, it has been difficult to extend distributional semantics beyond individual words, to word pairs, phrases, and sentences.
Moving beyond individual words, there are various types of semantic similarity to consider. Here we focus on paraphrase and analogy. Paraphrase is similarity in the meaning of two pieces of text (Androutsopoulos and Malakasiotis, 2010). Analogy is similarity in the semantic relations of two sets of words (Turney, 2008a).
It is common to study paraphrase at the sentence level (Androutsopoulos and Malakasiotis, 2010), but we prefer to concentrate on the simplest type of paraphrase, where a bigram paraphrases a unigram. For example, dog house is a paraphrase of kennel. In our experiments, we concentrate on noun-modifier bigrams and noun unigrams.
Analogies map terms in one domain to terms in another domain (Gentner, 1983). The familiar analogy between the solar system and the Rutherford-Bohr atomic model involves several terms from the domain of the solar system and the domain of the atomic model (Turney, 2008a).
The simplest type of analogy is proportional analogy, which involves two pairs of words (Turney, 2006b). For example, the pair cook, raw is analogous to the pair decorate, plain . If we cook a thing, it is no longer raw; if we decorate a thing, it is no longer plain. The semantic relations between cook and raw are similar to the semantic relations between decorate and plain. In the following experiments, we focus on proportional analogies. Erk (2013) distinguished four approaches to extend distributional semantics beyond words: In the first, a single vector space representation for a phrase or sentence is computed from the representations of the individual words (Mitchell and Lapata, 2010;Baroni and Zamparelli, 2010).
In the second, two phrases or sentences are compared by combining multiple pairwise similarity values (Socher et al., 2011;. Third, weighted inference rules integrate distributional similarity and formal logic (Garrette et al., 2011). Fourth, a single space integrates formal logic and vectors (Clarke, 2012).
Taking the second approach,  introduced a dual-space model, with one space for measuring domain similarity (similarity of topic or field) and another for function similarity (similarity of role or usage). Similarities beyond individual words are calculated by functions that combine domain and function similarities of component words.
The dual-space model has been applied to measuring compositional similarity (paraphrase recognition) and relational similarity (analogy recognition). In experiments that tested for sensitivity to word order, the dual-space model performed significantly better than competing approaches .
A limitation of past work with the dual-space model is that the combination functions were handcoded. Our main contribution is to show how handcoding can be eliminated with supervised learning. For ease of reference, we will call our approach SuperSim (supervised similarity). With no modification of SuperSim for the specific task (relational similarity or compositional similarity), we achieve better results than previous hand-coded models.
Compositional similarity (paraphrase) compares two contiguous phrases or sentences (n-grams), whereas relational similarity (analogy) does not require contiguity. We use tuple to refer to both contiguous and noncontiguous word sequences.
We approach analogy as a problem of supervised tuple classification. To measure the relational sim-ilarity between two word pairs, we train SuperSim with quadruples that are labeled as positive and negative examples of analogies. For example, the proportional analogy cook, raw, decorate, plain is labeled as a positive example.
A quadruple is represented by a feature vector, composed of domain and function similarities from the dual-space model and other features based on corpus frequencies. SuperSim uses a support vector machine (Platt, 1998) to learn the probability that a quadruple a, b, c, d consists of a word pair a, b and an analogous word pair c, d . The probability can be interpreted as the degree of relational similarity between the two given word pairs.
We also approach paraphrase as supervised tuple classification. To measure the compositional similarity beween an m-gram and an n-gram, we train the learning algorithm with (m + n)-tuples that are positive and negative examples of paraphrases.
SuperSim learns to estimate the probability that a triple a, b, c consists of a compositional bigram ab and a synonymous unigram c. For instance, the phrase fish tank is synonymous with aquarium; that is, fish tank and aquarium have high compositional similarity. The triple fish, tank, aquarium is represented using the same features that we used for analogy. The probability of the triple can be interpreted as the degree of compositional similarity between the given bigram and unigram.
We review related work in Section 2. The general feature space for learning relations and compositions is presented in Section 3. The experiments with relational similarity are described in Section 4, and Section 5 reports the results with compositional similarity. Section 6 discusses the implications of the results. We consider future work in Section 7 and conclude in Section 8.

Related Work
In SemEval 2012, Task 2 was concerned with measuring the degree of relational similarity between two word pairs (Jurgens et al., 2012) and Task 6 (Agirre et al., 2012) examined the degree of semantic equivalence between two sentences. These two areas of research have been mostly independent, although Socher et al. (2012) and  present unified perspectives on the two tasks. We first discuss some work on relational similarity, then some work on compositional similarity, and lastly work that unifies the two types of similarity.
Rows in the matrix correspond to word pairs (a, b) and columns correspond to patterns that connect the pairs ("a for the b") in a large corpus. This is a holistic (noncompositional) approach to distributional similarity, since the word pairs are opaque wholes; the component words have no separate representations. A compositional approach to analogy has a representation for each word, and a word pair is represented by composing the representations for each member of the pair. Given a vocabulary of N words, a compositional approach requires N representations to handle all possible word pairs, but a holistic approach requires N 2 representations. Holistic approaches do not scale up . LRA required nine days to run. Bollegala et al. (2008) answered the SAT analogy questions with a support vector machine trained on quadruples (proportional analogies), as we do here. However, their feature vectors are holistic, and hence there are scaling problems. Herdagdelen and Baroni (2009) used a support vector machine to learn relational similarity. Their feature vectors contained a combination of holistic and compositional features.

Compositional Similarity
To extend distributional semantics beyond words, many researchers take the first approach described by Erk (2013), in which a single vector space is used for individual words, phrases, and sentences (Landauer and Dumais, 1997;Mitchell and Lapata, 2008;Mitchell and Lapata, 2010).
In this approach, given the words a and b with context vectors a and b, we construct a vector for the bigram ab by applying vector operations to a and b.
Mitchell and Lapata (2010) experiment with many different vector operations and find that element-wise multiplication performs well. The bigram ab is represented by c = a ⊙ b, where c i = a i · b i . However, element-wise multiplication is commutative, so the bigrams ab and ba map to the same vector c. In experiments that test for order sensitivity, element-wise multiplication performs poorly .
We can treat the bigram ab as a unit, as if it were a single word, and construct a context vector for ab from occurrences of ab in a large corpus. This holistic approach to representing bigrams performs well when a limited set of bigrams is specified in advance (before building the word-context matrix), but it does not scale up, because there are too many possible bigrams .
Although the holistic approach does not scale up, we can generate a few holistic bigram vectors and use them to train a supervised regression model (Guevara, 2010;Baroni and Zamparelli, 2010). Given a new bigram cd, not observed in the corpus, the regression model can predict a holistic vector for cd, if c and d have been observed separately. We show in Section 5 that this idea can be adapted to train SuperSim without manually labeled data. Socher et al. (2011) take the second approach described by Erk (2013), in which two sentences are compared by combining multiple pairwise similarity values. They construct a variable-sized similarity matrix X, in which the element x ij is the similarity between the i-th phrase of one sentence and the j-th phrase of the other. Since supervised learning is simpler with fixed-sized feature vectors, the variable-sized similarity matrix is then reduced to a smaller fixed-sized matrix, to allow comparison of pairs of sentences of varying lengths. Socher et al. (2012) represent words and phrases with a pair, consisting of a vector and a matrix. The vector captures the meaning of the word or phrase and the matrix captures how a word or phrase modifies the meaning of another word or phrase when they are combined. They apply this matrix-vector representation to both compositions and relations.  represents words with two vectors, a vector from domain space and a vector from function space. The domain vector captures the topic or field of the word and the function vector captures the functional role of the word. This dual-space model is applied to both compositions and relations.

Unified Perspectives on Similarity
Here we extend the dual-space model of Turney (2012) in two ways: Hand-coding is replaced with supervised learning and two new sets of features augment domain and function space. Moving to supervised learning instead of hand-coding makes it easier to introduce new features.
In the dual-space model, parameterized similarity measures provided the input values for handcrafted functions. Each task required a different set of hand-crafted functions. The parameters of the similarity measures were tuned using a customized grid search algorithm. The grid search algorithm was not suitable for integration with a supervised learning algorithm. The insight behind SuperSim is that, given appropriate features, a supervised learning algorithm can replace the grid search algorithm and the hand-crafted functions.

Features for Tuple Classification
We represent a tuple with four types of features, all based on frequencies in a large corpus. The first type of feature is the logarithm of the frequency of a word. The second type is the positive pointwise mutual information (PPMI) between two words (Church and Hanks, 1989;Bullinaria and Levy, 2007). Third and fourth are the similarities of two words in domain and function space .
In the following experiments, we use the PPMI matrix from Turney et al. (2011) and the domain and function matrices from . 1 The three matrices and the word frequency data are based on the same corpus, a collection of web pages gathered from university web sites, containing 5 × 10 10 words. 2 All three matrices are word-context matrices, in which the rows correspond to terms (words 1 The three matrices and the word frequency data are available on request from the author. The matrix files range from two to five gigabytes when packaged and compressed for distribution. 2 The corpus was collected by Charles Clarke at the University of Waterloo. It is about 280 gigabytes of plain text. and phrases) in WordNet. 3 The columns correspond to the contexts in which the terms appear; each matrix involves a different kind of context. Let x 1 , x 2 , . . . , x n be an n-tuple of words. The number of features we use to represent this tuple increases as a function of n.
The first set of features consists of log frequency values for each word x i in the n-tuple. Let freq(x i ) be the frequency of x i in the corpus. We define LF(x i ) as log(freq(x i )+1). If x i is not in the corpus, freq(x i ) is zero, and thus LF(x i ) is also zero. There are n log frequency features, one LF(x i ) feature for each word in the n-tuple.
The second set of features consists of positive pointwise mutual information values for each pair of words in the n-tuple. We use the raw PPMI matrix from Turney et al. (2011). Although they computed the singular value decomposition (SVD) to project the row vectors into a lower-dimensional space, we need the original high-dimensional columns for our features. The raw PPMI matrix has 114,501 rows and 139,246 columns with a density of 1.2%. For each term in WordNet, there is a corresponding row in the raw PPMI matrix. For each unigram in Word-Net, there are two corresponding columns in the raw PPMI matrix, one marked left and the other right.
Suppose x i corresponds to the i-th row of the PPMI matrix and x j corresponds the j-th column, marked left. The value in the i-th row and j-th column of the PPMI matrix, PPMI(x i , x j , left), is the positive pointwise mutual information of x i and x j co-occurring in the corpus, where x j is the first word to the left of x i , ignoring any intervening stop words (that is, ignoring any words that are not in WordNet). If x i (or x j ) has no corresponding row (or column) in the matrix, then the PPMI value is set to zero. Turney et al. (2011) estimated PPMI(x i , x j , left) by sampling the corpus for phrases containing x i and then looking for x j to the left of x i in the sampled phrases (and likewise for right). Due to this sampling process, PPMI(x i , x j , left) does not necessarily equal PPMI(x j , x i , right). For example, suppose x i is a rare word and x j is a common word. With PPMI(x i , x j , left), when we sample phrases containing x i , we are relatively likely to find x j in some of these phrases. With PPMI(x j , x i , right), when we sample phrases containing x j , we are less likely to find any phrases containing x i . Although, in theory, PPMI(x i , x j , left) should equal PPMI(x j , x i , right), they are likely to be unequal given a limited sample.
From the n-tuple, we select all of the n(n − 1) pairs, x i , x j , such that i = j. We then generate two features for each pair, PPMI(x i , x j , left) and PPMI(x i , x j , right). Thus there are 2n(n − 1) PPMI values in the second set of features.
The third set of features consists of domain space similarity values for each pair of words in the n-tuple. Domain space was designed to capture the topic of a word. Turney (2012) first constructed a frequency matrix, in which the rows correspond to terms in WordNet and the columns correspond to nearby nouns. Given a term x i , the corpus was sampled for phrases containing x i and the phrases were processed with a part-of-speech tagger, to identify nouns. If the noun x j was the closest noun to the left or right of x i , then the frequency count for the i-th row and j-th column was incremented. The hypothesis was that the nouns near a term characterize the topics associated with the term.
The word-context frequency matrix for domain space has 114,297 rows (terms) and 50,000 columns (noun contexts, topics), with a density of 2.6%. The frequency matrix was converted to a PPMI matrix and then smoothed with SVD. The SVD yields three matrices, U, Σ, and V.
A term in domain space is represented by a row vector in U k Σ p k . The parameter k specifies the number of singular values in the truncated singular value decomposition; that is, k is the number of latent factors in the low-dimensional representation of the term (Landauer and Dumais, 1997). We generate U k and Σ k by deleting the columns in U and Σ corresponding to the smallest singular values. The parameter p raises the singular values in Σ k to the power p (Caron, 2001). As p goes from one to zero, factors with smaller singular values are given more weight. This has the effect of making the similarity measure more discriminating .
The similarity of two words in domain space, Dom(x i , x j , k, p), is computed by extracting the row vectors in U k Σ p k that correspond to the words x i and x j , and then calculating their cosine. Optimal performance requires tuning the parameters k and p for the task (Bullinaria and Levy, 2012;. In the following experiments, we avoid directly tuning k and p by generating features with a variety of values for k and p, allowing the supervised learning algorithm to decide which features to use. From the n-tuple, we select all 1 2 n(n − 1) pairs, x i , x j , such that i < j. For each pair, we generate domain similarity features, Dom(x i , x j , k, p), where k varies from 100 to 1000 in steps of 100 and p varies from 0 to 1 in steps of 0.1. The number of k values, n k , is 10 and the number of p values, n p , is 11; therefore there are 110 features, n k n p , for each pair, x i , x j . Thus there are 1 2 n(n−1)n k n p domain space similarity values in the third set of features.
The fourth set of features consists of function space similarity values for each pair of words in the n-tuple. Function space was designed to capture the functional role of a word. It is similar to domain space, except the context is based on verbal patterns, instead of nearby nouns. The hypothesis was that the functional role of a word is characterized by the patterns that relate the word to nearby verbs.
The word-context frequency matrix for function space has 114,101 rows (terms) and 50,000 columns (verb pattern contexts, functional roles), with a density of 1.2%. The frequency matrix was converted to a PPMI matrix and smoothed with SVD.
From the n-tuple, we select all 1 2 n(n − 1) pairs, x i , x j , such that i < j. For each pair, we generate function similarity features, Fun(x i , x j , k, p), where k and p vary as they did with domain space. Thus there are 1 2 n(n − 1)n k n p function space similarity values in the fourth set of features. Table 1 summarizes the four sets of features and the size of each set as a function of n, the number of words in the given tuple. The values of n k and n p (10 and 11) are considered to be constants. Table 2 shows the number of elements in the feature vector, as n varies from 1 to 6. The total number of features is O(n 2 ). We believe that this is acceptable growth and will scale up to comparing sentence pairs.
The four sets of features have a hierarchical relationship. The log frequency features are based on counting isolated occurrences of each word in the corpus.
The PPMI features are based on direct co-occurrences of two words; that is, PPMI is only greater than zero if the two words actually occur together in the corpus. Domain

Feature set
Size of set LF(x i ) n PPMI(x i , x j , handedness) 2n(n − 1) Dom(x i , x j , k, p) 1 2 n(n − 1)n k n p Fun(x i , x j , k, p) 1 2 n(n − 1)n k n p  and function space capture indirect or higherorder co-occurrence, due to the truncated SVD (Lemaire and Denhière, 2006); that is, the values of Dom(x i , x j , k, p) and Fun(x i , x j , k, p) can be high even when x i and x j do not actually co-occur in the corpus. We conjecture that there are yet higher orders in this hierarchy that would provide improved similarity measures. SuperSim learns to classify tuples by representing them with these features. SuperSim uses the sequential minimal optimization (SMO) support vector machine (SVM) as implemented in Weka (Platt, 1998;Witten et al., 2011). 4 The kernel is a normalized third-order polynomial. Weka provides probability estimates for the classes by fitting the outputs of the SVM with logistic regression models.

Relational Similarity
This section presents experiments with learning relational similarity using SuperSim. The training datasets consist of quadruples that are labeled as positive and negative examples of analogies. Table 2 shows that the feature vectors have 1,348 elements.
We experiment with three datasets, a collection of 374 five-choice questions from the SAT college entrance exam (Turney et al., 2003), a modified ten-choice variation of the SAT questions 4 Weka is available at http://www.cs.waikato.ac.nz/ml/weka/.

Stem:
word:language Choices: (1) paint:portrait (2) poetry:rhythm (3) note:music (4) tale:story (5) week:year Solution: (3) note:music  , and the relational similarity dataset from SemEval 2012 Task 2 (Jurgens et al., 2012). 5 Table 3 is an example of a question from the 374 five-choice SAT questions. Each five-choice question yields five labeled quadruples, by combining the stem with each choice. The quadruple word, language, note, music is labeled positive and the other four quadruples are labeled negative.

Five-choice SAT Questions
Since learning works better with balanced training data (Japkowicz and Stephen, 2002), we use the symmetries of proportional analogies to add more positive examples (Lepage and Shin-ichi, 1996). For each positive quadruple, a, b, c, d , we add three more positive quadruples, b, a, d, c , c, d, a, b , and d, c, b, a . Thus each five-choice question provides four positive and four negative quadruples. We use ten-fold cross-validation to apply Super-Sim to the SAT questions. The folds are constructed so that the eight quadruples from each SAT question are kept together in the same fold. To answer a question in the testing fold, the learned model assigns a probability to each of the five choices and guesses the choice with the highest probability. SuperSim achieves a score of 54.8% correct (205 out of 374). Table 4 gives the rank of SuperSim in the list of the top ten results with the SAT analogy questions. 6 The scores ranging from 51.1% to 57.0% are not significantly different from SuperSim's score of 54.8%, according to Fisher's exact test at the 95% confidence level. However, SuperSim answers the SAT  questions in a few minutes, whereas LRA requires nine days, and SuperSim learns its models automatically, unlike the hand-coding of Turney (2012).

Ten-choice SAT Questions
In addition to symmetries, proportional analogies have asymmetries. In general, if the quadruple a, b, c, d is positive, a, d, c, b is negative. For example, word, language, note, music is a good analogy, but word, music, note, language is not.
Words are the basic units of language and notes are the basic units of music, but words are not necessary for music and notes are not necessary for language.  used this asymmetry to convert the 374 five-choice SAT questions into 374 tenchoice SAT questions. Each choice c, d was expanded with the stem a, b , resulting in the quadruple a, b, c, d , and then the order was shuffled to a, d, c, b , so that each choice pair in a fivechoice question generated two choice quadruples in a ten-choice question. Nine of the quadruples are negative examples and the quadruple consisting of the stem pair followed by the solution pair is the only positive example. The purpose of the ten-choice questions is to test the ability of measures of relational similarity to avoid the asymmetric distractors.
We use the ten-choice questions to compare the hand-coded dual-space approach  with SuperSim. We also use these questions to perform an ablation study of the four sets of features in SuperSim. As with the five-choice questions, we use the symmetries of proportional analogies to add three more positive examples, so the training   (Table 4), a drop of 3.2%. The difference between SuperSim (52.7%) and the handcoded dual-space model (47.9%) is not significant according to Fisher's exact test at the 95% confidence level. The advantage of SuperSim is that it does not need hand-coding. The results show that SuperSim can avoid the asymmetric distractors. Table 5 shows the impact of different subsets of features on the percentage of correct answers to the ten-choice SAT questions. Included features are marked 1 and ablated features are marked 0. The results show that the log frequency (LF) and PPMI features are not helpful (but also not harmful) for relational similarity. We also see that domain space and function space are both needed for good results.

SemEval 2012 Task 2
The SemEval 2012 Task 2 dataset is based on the semantic relation classification scheme of Bejar et al. (1991), consisting of ten high-level categories of relations and seventy-nine subcategories, with paradigmatic examples of each subcategory. For instance, the subcategory taxonomic in the category class inclusion has three paradigmatic examples, flower:tulip, emotion:rage, and poem:sonnet.  Jurgens et al. (2012) used Amazon's Mechanical Turk to create the SemEval 2012 Task 2 dataset in two phases. In the first phase, Turkers expanded the paradigmatic examples for each subcategory to an average of forty-one word pairs per subcategory, a total of 3,218 pairs. In the second phase, each word pair from the first phase was assigned a prototypicality score, indicating its similarity to the paradigmatic examples. The challenge of SemEval 2012 Task 2 was to guess the prototypicality scores.
SuperSim was trained on the five-choice SAT questions and evaluated on the SemEval 2012 Task 2 test dataset. For a given a word pair, we created quadruples, combining the word pair with each of the paradigmatic examples for its subcategory. We then used SuperSim to compute the probabilities for each quadruple. Our guess for the prototypicality score of the given word pair was the average of the probabilities. Spearman's rank correlation coefficient between the Turkers' prototypicality scores and SuperSim's scores was 0.408, averaged over the sixty-nine subcategories in the testing set. Super-Sim has the highest Spearman correlation achieved to date on SemEval 2012 Task 2 (see Table 6).

Compositional Similarity
This section presents experiments using SuperSim to learn compositional similarity. The datasets consist of triples, a, b, c , such that ab is a nounmodifier bigram and c is a noun unigram. The triples are labeled as positive and negative examples of paraphrases. Table 2 shows that the feature vectors have 675 elements. We experiment Stem: fantasy world Choices: (1) fairyland (2) fantasy (3) world (4) phantasy (5) universe (6) ranter (7) souring Solution: (1) fairyland with two datasets, seven-choice and fourteen-choice noun-modifier questions . 7

Noun-Modifier Questions
The first dataset is a seven-choice noun-modifier question dataset, constructed from WordNet . The dataset contains 680 questions for training and 1,500 for testing, a total of 2,180 questions. Table 7 shows one of the questions.
The stem is a bigram and the choices are unigrams. The bigram is composed of a head noun (world), modified by an adjective or noun (fantasy). The solution is the unigram (fairyland) that belongs to the same WordNet synset as the stem.
The distractors are designed to be difficult for current approaches to composition. For example, if fantasy world is represented by element-wise multiplication of the context vectors for fantasy and world (Mitchell and Lapata, 2010), the most likely guess is fantasy or world, not fairyland .
Each seven-choice question yields seven labeled triples, by combining the stem with each choice. The triple fantasy, world, fairyland is labeled positive and the other six triples are labeled negative.
In general, if a, b, c is a positive example, then b, a, c is negative. For example, world fantasy is not a paraphrase of fairyland. The second dataset is constructed by applying this shuffling transformation to convert the 2,180 sevenchoice questions into 2,180 fourteen-choice questions  to be difficult for approaches that are not sensitive to word order. Table 8 shows the percentage of the testing questions that are answered correctly for the two datasets. Because vector addition and element-wise multiplication are not sensitive to word order, they perform poorly on the fourteen-choice questions. For both datasets, SuperSim performs significantly better than all other approaches, except for the holistic approach, according to Fisher's exact test at the 95% confidence level. 8 The holistic approach is noncompositional. The stem bigram is represented by a single context vector, generated by treating the bigram as if it were a unigram. A noncompositional approach cannot scale up to realistic applications . The holistic approach cannot be applied to the fourteenchoice questions, because the bigrams in these questions do not correspond to terms in WordNet, and hence they do not correspond to row vectors in the matrices we use (see Section 3).  found it necessary to hand-code a soundness check into all of the algorithms (vector addition, element-wise multiplication, dual-space, and holistic). Given a stem ab and a choice c, the hand-coded check assigns a minimal score to the choice if c = a or c = b. We do not need to handcode any checking into SuperSim. It learns automatically from the training data to avoid such choices. Table 9 shows the effects of ablating sets of features on the performance of SuperSim with the fourteen-choice questions. PPMI features are the most important; by themselves, they achieve 59.7% correct, although the other features are needed to 8 The results for SuperSim are new but the other results in Table 8    reach 68.0%. Domain space features reach the second highest performance when used alone (34.6%), but they reduce performance (from 69.3% to 68.0%) when combined with other features; however, the drop is not significant according to Fisher's exact test at the 95% significance level. Since the PPMI features play an important role in answering the noun-modifier questions, let us take a closer look at them. From Table 2, we see that there are twelve PPMI features for the triple a, b, c , where ab is a noun-modifier bigram and c is a noun unigram. We can split the twelve features into three subsets, one subset for each pair of words, a, b , a, c , and b, c . For example, the subset for a, b is the four features PPMI(a, b, left), PPMI(b, a, left), PPMI(a, b, right), and PPMI(b, a, right). Table 10 shows the effects of ablating these subsets.

Ablation Experiments
The results in Table 10 indicate that all three PPMI subsets contribute to the performance of SuperSim, but the a, b subset contributes more than the other two subsets. The a, b features help Stem: search engine Choices: (1) search engine (2) search (3) engine (4) search language (5) search warrant (6) diesel engine (7) steam engine Solution: (1) search engine to increase the sensitivity of SuperSim to the order of the words in the noun-modifier bigram; for example, they make it easier to distinguish fantasy world from world fantasy.

Holistic Training
SuperSim uses 680 training questions to learn to recognize when a bigram is a paraphrase of a unigram; it learns from expert knowledge implicit in WordNet synsets. It would be advantageous to be able to train SuperSim with less reliance on expert knowledge.
Past work with adjective-noun bigrams has shown that we can use holistic bigram vectors to train a supervised regression model (Guevara, 2010;Baroni and Zamparelli, 2010). The output of the regression model is a vector representation for a bigram that approximates the holistic vector for the bigram; that is, it approximates the vector we would get by treating the bigram as if it were a unigram.
SuperSim does not generate vectors as output, but we can still use holistic bigram vectors for training. Table 11 shows a seven-choice training question that was generated without using WordNet synsets. The choices of the form a b are bigrams, but we represent them with holistic bigram vectors; we pretend they are unigrams. We call a b bigrams pseudounigrams. As far as SuperSim is concerned, there is no difference between these pseudo-unigrams and true unigrams. The question in Table 11 is treated the same as the question in Table 7.
We generate 680 holistic training questions by randomly selecting 680 noun-modifier bigrams from WordNet as stems for the questions (search engine), avoiding any bigrams that appear as stems in the testing questions. The solution (search engine) is the pseudo-unigram that corresponds to the   Table 11 versus training with questions like Table 7). The testing set is the standard testing set in both cases. There is a significant drop in performance with holistic training, but the performance still surpasses vector addition, element-wise multiplication, and the hand-coded dual-space model (see Table 8).
Since holistic questions can be generated automatically without human expertise, we experimented with increasing the size of the holistic training dataset, growing it from 1,000 to 10,000 questions in increments of 1,000. The performance on the fourteen-choice questions with holistic training and standard testing varied between 53.3% and 55.1% correct, with no clear trend up or down. This is not significantly different from the performance with 680 holistic training questions (54.4%).
It seems likely that the drop in performance with holistic training instead of standard training is due to a difference in the nature of the standard questions (Table 7) and the holistic questions (Table 11). We are currently investigating this issue. We expect to be able to close the performance gap in future work, by improving the holistic questions. However, it is possible that there are fundamental limits to holistic training.
SuperSim performs slightly better (not statistically significant) than the hand-coded dual-space model on relational similarity problems (Section 4), but it performs much better on compositional similarity problems (Section 5). The ablation studies suggest this is due to the PPMI features, which have no effect on ten-choice SAT performance (Table 5), but have a large effect on fourteen-choice noun-modifier paraphrase performance (Table 9).
One advantage of supervised learning over handcoding is that it facilitates adding new features. It is not clear how to modify the hand-coded equations for the dual-space model of noun-modifier composition  to include PPMI information.
SuperSim is one of the few approaches to distributional semantics beyond words that has attempted to address both relational and compositional similarity (see Section 2.3). It is a strength of this approach that it works well with both kinds of similarity.

Future Work and Limitations
Given the promising results with holistic training for noun-modifier paraphrases, we plan to experiment with holistic training for analogies. Consider the proportional analogy hard is to hard time as good is to good time, where hard time and good time are pseudo-unigrams. To a human, this analogy is trivial, but SuperSim has no access to the surface form of a term. As far as SuperSim is concerned, this analogy is much the same as the analogy hard is to difficulty as good is to fun. This strategy automatically converts simple, easily generated analogies into more complex, challenging analogies, which may be suited to training SuperSim.
This also suggests that noun-modifier paraphrases may be used to solve analogies. Perhaps we can evaluate the quality of a candidate analogy a, b, c, d by searching for a term e such that b, e, a and d, e, c are good paraphrases. For example, consider the analogy mason is to stone as carpenter is to wood. We can paraphrase mason as stone worker and carpenter as wood worker. This transforms the analogy to stone worker is to stone as wood worker is to wood, which makes it easier to recognize the relational similarity.
Another area for future work is extending Super-Sim beyond noun-modifier paraphrases to measuring the similarity of sentence pairs. We plan to adapt ideas from Socher et al. (2011) for this task. They use dynamic pooling to represent sentences of varying size with fixed-size feature vectors. Using fixedsize feature vectors avoids the problem of quadratic growth and it enables the supervised learner to generalize over sentences of varying length.
Some of the competing approaches discussed by Erk (2013) incorporate formal logic. The work of Baroni et al. (2012) suggests ways that SuperSim could be developed to deal with logic.
We believe that SuperSim could benefit from more features, with greater diversity. One place to look for these features is higher levels in the hierarchy that we sketch in Section 3.
Our ablation experiments suggest that domain and function spaces provide the most important features for relational similarity, but PPMI values provide the most important features for noun-modifier compositional similarity. Explaining this is another topic for future research.

Conclusion
In this paper, we have presented SuperSim, a unified approach to analogy (relational similarity) and paraphrase (compositional similarity). SuperSim treats them both as problems of supervised tuple classification. The supervised learning algorithm is a standard support vector machine. The main contribution of SuperSim is a set of four types of features for representing tuples. The features work well with both analogy and paraphrase, with no task-specific modifications. SuperSim matches the state of the art on SAT analogy questions and substantially advances the state of the art on the SemEval 2012 Task 2 challenge and the noun-modifier paraphrase questions.
SuperSim runs much faster than LRA (Turney, 2006b), answering the SAT questions in minutes instead of days. Unlike the dual-space model , SuperSim requires no handcoded similarity composition functions. Since there is no hand-coding, it is easy to add new features to SuperSim. Much work remains to be done, such as incorporating logic and scaling up to sentence paraphrases, but past work suggests that these problems are tractable.
In the four approaches described by Erk (2013), SuperSim is an instance of the second approach to extending distributional semantics beyond words, comparing word pairs, phrases, or sentences (in general, tuples) by combining multiple pairwise similarity values. Perhaps the main significance of this paper is that it provides some evidence in support of this general approach.