Good, Great, Excellent: Global Inference of Semantic Intensities

Adjectives like good , great , and excellent are similar in meaning, but differ in intensity. Intensity order information is very useful for language learners as well as in several NLP tasks, but is missing in most lexical resources (dictionaries, WordNet, and thesauri). In this paper, we present a primarily unsupervised approach that uses semantics from Web-scale data (e.g., phrases like good but not excellent ) to rank words by assigning them positions on a continuous scale. We rely on Mixed Integer Linear Programming to jointly determine the ranks, such that individual decisions beneﬁt from global information. When ranking English adjectives, our global algorithm achieves substantial improvements over previous work on both pairwise and rank correlation metrics (speciﬁcally, 70% pairwise accuracy as compared to only 56% by previous work). Moreover, our approach can incorporate external synonymy information (increasing its pairwise accuracy to 78%) and extends easily to new languages. We also make our code and data freely available. 1


Introduction
Current lexical resources such as dictionaries and thesauri do not provide information about the intensity order of words. For example, both WordNet (Miller, 1995) and Roget's 21st Century Thesaurus (thesaurus.com) present acceptable, great, and superb as synonyms of the adjective good. However, a native speaker knows that these words represent varying intensity and can in fact generally be ranked by intensity as acceptable < good < great < superb. Similarly, warm < hot < scorching are identified as synonyms in these resources. Ranking information, 1 http://demelo.org/gdm/intensity/ however, is crucial because it allows us to differentiate e.g. between various intensities of an emotion, and is hence very useful for humans when learning a language or judging product reviews, as well as for automatic text understanding and generation tasks such as sentiment and subjectivity analysis, recognizing textual entailment, question answering, summarization, and coreference and discourse analysis.
In this work, we attempt to automatically rank sets of related words by intensity, focusing in particular on adjectives. This is made possible by the vast amounts of world knowledge that are now available. We use lexico-semantic information extracted from a Web-scale corpus in conjunction with an algorithm based on a Mixed Integer Linear Program (MILP). Linguistic analyses have identified phrases such as good but not great or hot and almost scorching in a text corpus as sources of evidence about the relative intensities of words. However, pure information extraction approaches often fail to provide enough coverage for real-world downstream applications (Tandon and de Melo, 2010), unless some form of advanced inference is used (Snow et al., 2006;Suchanek et al., 2009).
In our work, we address this sparsity problem by relying on Web-scale data and using an MILP model that extends the pairwise scores to a more complete joint ranking of words on a continuous scale, while maintaining global constraints such as transitivity and giving more weight to the order of word pairs with higher corpus evidence scores. Instead of considering intensity ranking as a pairwise decision process, we thus exploit the fact that individual decisions may benefit from global information, e.g. about how two words relate to some third word.
Previous work (Sheinman and Tokunaga, 2009;Schulam and Fellbaum, 2010;Sheinman et al., 2012) has also used lexico-semantic patterns to or-der adjectives. They mainly evaluate their algorithm on a set of pairwise decisions, but also present a partitioning approach that attempts to form scales by placing each adjective to the left or right of pivot words. Unfortunately, this approach often fails because many pairs lack order-based evidence even on the Web, as explained in more detail in Section 3.
In contrast, our MILP jointly uses information from all relevant word pairs and captures complex interactions and inferences to produce intensity scales. We can thus obtain an order between two adjectives even when there is no explicit evidence in the corpus (using evidence for related pairs and transitive inference). Our global MILP is flexible and can also incorporate additional synonymy information if available (which helps the MILP find an even better ranking solution). Our approach also extends easily to new languages. We describe two approaches for this multilingual extension: pattern projection and cross-lingual MILPs.
We evaluate our predicted intensity rankings using both pairwise classification accuracy and ranking correlation coefficients, achieving strong results, significantly better than the previous approach by Sheinman & Tokunaga (32% relative error reduction) and quite close to human-level performance.

Method
In this section, we describe each step of our approach to ordering adjectives on a single, relative scale. Our method can also be applied to other word classes and to languages other than English.

Intensity Scales
Near-synonyms may differ in intensity, e.g. joy vs. euphoria, or drizzle vs. rain. This is particularly true of adjectives, which can represent different degrees of a given quality or attribute such as size or age. Many adjectives are gradable and thus allow for grading adverbial modifiers to express such intensity degrees, e.g., a house can be very big or extremely big. Often, however, completely different adjectives refer to varying degrees on the same scale, e.g., huge, gigantic, gargantuan. Even adjectives like enormous (or superb, impossible) that are considered non-gradable from a syntactic perspective can be placed on a such a scale.

Weak-Strong Patterns
Strong-Weak Patterns (,) but not not (,) just (,) if not not (,) but just (,) although not not (,) still (,) though not not (,) but still (,) (and/or) even not (,) although still (,) (and/or) almost not (,) though still not only but (,) or very not just but Table 1: Ranking patterns used in this work. Among the patterns represented by the regular expressions above, we use only those that capture less than or equal to five words (to fit in the Google n-grams, see Section 2.1.2). Articles (a, an, the) are allowed to appear before the wildcards wherever possible.

Intensity Patterns
Linguistic studies have found lexical patterns like ' but not ' (e.g. good but not great) to reveal order information between a pair of adjectives (Sheinman and Tokunaga, 2009). We assume that we have two sets of lexical patterns that allow us to infer the most likely ordering between two words when encountered in a corpus. A first pattern set, P ws , contains patterns that reflect a weak-strong order between a pair of word (the first word is weaker than the second), and a second pattern set, P sw , captures the strong-weak order. See Table 1 for the adjective patterns that we used in this work (and see Section 4.1 for implementation details regarding our pattern collection). Many of these patterns also apply to other parts of speech (e.g. 'drizzle but not rain', 'running or even sprinting'), with significant discrimination on the Web in the right direction.

Pairwise Scores
Given an input set of words to be placed on a scale, we first collect evidence of their intensity order by using the above-mentioned intensity patterns and a large, Web-scale text corpus.
Previous work on information extraction from limited-sized raw text corpora revealed that coverage is often limited (Hearst, 1992;Hatzivassiloglou and McKeown, 1993). Some studies (Chklovski and Pantel, 2004;Sheinman and Tokunaga, 2009) used hit counts from an online search engine, but this is unstable and irreproducible (Kilgarriff, 2007). To avoid these issues, we use the largest available  static corpus of counts, the Google n-grams corpus (Brants and Franz, 2006), which contains English n-grams (n = 1 to 5) and their observed frequency counts, generated from nearly 1 trillion word tokens and 95 billion sentences. We consider each pair of words (a 1 , a 2 ) in the input set in turn. For each pattern p in the two pattern sets (weak-strong P ws and strong-weak P sw ), we insert the word pair into the pattern as p(a 1 , a 2 ) to get a phrasal query like "big but not huge". This is done by replacing the two wildcards in the pattern by the two words in order. Finally, we scan the Web ngrams corpus in a batch approach similar to Bansal and Klein (2011) and collect frequencies of all our phrase queries. Table 2 depicts some examples of useful intensity-based phrase queries and their frequencies in the Web-scale corpus. We also collect frequencies for the input word unigrams and the patterns for normalization purposes. Given a word pair (a 1 , a 2 ) and a corpus count function cnt, we define W 1 = 1 P 1 p 1 ∈Pws cnt(p 1 (a 1 , a 2 )) S 1 = 1 P 2 p 2 ∈Psw cnt(p 2 (a 1 , a 2 )) W 2 = 1 P 1 p 1 ∈Pws cnt(p 1 (a 2 , a 1 )) with P 1 = p 1 ∈Pws cnt(p 1 ) such that the final overall weak-strong score is Here W 1 and S 1 represent Web evidence of a 1 and a 2 being in the weak-strong and strong-weak relation, respectively. W 2 and S 2 fit the reverse pair (a 2 , a 1 ) in the patterns and hence represent the strong-weak and weak-strong relations, respectively, in the opposite direction. Hence, overall, (W 1 − S 1 ) − (W 2 − S 2 ) represents the total weakstrong score of the pair (a 1 , a 2 ), i.e. the score of a 1 being on the left of a 2 on a relative intensity scale, such that score(a 1 , a 2 ) = −score(a 2 , a 1 ). The raw frequencies in the score are divided by counts of the patterns and by individual word unigram counts to obtain a pointwise mutual information (PMI) style normalization and hence avoid any bias in the score due to high-frequency patterns or word unigrams. 2 2.2 Global Ordering with an MILP

Objective and Constraints
Given pairwise scores, we now aim at producing a global ranking of the input words that is much more informative than the original pairwise scores. Joint inference from multiple word pairs allows us to benefit from global information: Due to the sparsity of the pattern evidence, determining how two adjectives relate to each other can sometimes e.g. only be inferred by observing how each of them relate to some third adjective.
We assume that we are given N input words A = a 1 , . . . , a N that we wish to place on a linear scale, say [0, 1]. Thus each word a i is to be assigned a position x i ∈ [0, 1] based on the pairwise weakstrong weights score(a i , a j ). A positive value for Figure 1: The input weak-strong data may contain one or more cycles, e.g. due to noisy patterns, so the final ranking will have to choose which input scores to honor and which to remove. score(a i , a j ) means that a i is supposedly weaker than a j and hence we would like to obtain x i < x j . A negative value for score(a i , a j ) means that a i is assumed to be stronger than a j , so we would want to obtain x i > x j . Therefore, intuitively, our goal corresponds to maximizing the objective Note that it is important to use the signum function sgn() here, because we only care about the relative order of x i and x j . Maximizing ij (x j − x i ) · score(a i , a j ) would lead to all words being placed at the edges of the scale, because the highest scores would dominate over all other ones. We do include the score magnitudes in the objective, because they help resolve contradictions in the pairwise scores (e.g., see Figure 1). This is discussed in more detail in Section 2.2.2.
In order to maximize this non-differentiable objective, we use Mixed Integer Linear Programming (MILP), a variant of linear programming in which some but not all of the variables are constrained to be integers. Using an MILP formalization, we can find a globally optimal solution in the joint decision space, and unlike previous work, we jointly exploit global information rather than just individual local (pairwise) scores. To encode the objective in a MILP, we need to introduce additional variables d ij , w ij , s ij to capture the effect of the signum function, as explained below.
We additionally also enable our MILP to make use of any external equivalence (synonymy) information E ⊆ {1, . . . , N } × {1, . . . , N } that may be available. In this context, two words are considered synonymous if they are close enough in meaning to be placed on (almost) the same position in the intensity scale. If (i, j) ∈ E, we can safely assume that a i , a j have near-equivalent intensity, so we should encourage x i , x j to remain close to each other. The MILP is defined as follows: The difference variables d ij simply capture differences between x i , x j . C is any very large constant greater than i,j |score(a i , a j )|; the exact value is irrelevant. The indicator variables w ij and s ij are jointly used to determine the value of the signum function sgn(d ij ) = sgn(x j − x i ). Variables w ij become 1 if and only if d ij > 0 and hence serve as indicator variables for weak-strong relationships in the output. Variables s ij become 1 if and only if d ij < 0 and hence serve as indicator variables for a strong-weak relationship in the output. The objective encourages w ij = 1 for score(a i , a j ) > 0 and s ij = 1 for score(a i , a j ) < 0. 3 When equivalence (synonymy) information is available, then for (i, j) ∈ E both s ij = 0 and w ij = 0 are encouraged.

Discussion
Our MILP uses intensity evidence of all input pairs together and assimilates all the scores via global transitivity constraints to determine the positions of the input words on a continuous real-valued scale. Hence, our approach addresses drawbacks Figure 2: Equivalence Information: Knowing that a m , a 2 are synonyms gives the MILP an indication of where to place a n on the scale with respect to a 1 , a 2 , a 3 of local or divide-and-conquer approaches, where adjectives are scored with respect to selected pivot words, and hence many adjectives that lack pairwise evidence with the pivots are not properly classified, although they may have order evidence with some third adjective that could help establish the ranking. Optional synonymy information can further help, as shown in Figure 2.
Moreover, our MILP also gives higher weight to pairs with higher scores, which is useful when breaking global constraint cycles as in the simple example in Figure 1. If we need to break a constraint violating triangle or cycle, we would have to make arbitrary choices if we were ranking based on sgn(score(a, b)) alone. Instead, we can choose a better ranking based on the magnitude of the pairwise scores. A stronger score between an adjective pair doesn't necessarily mean that they should be further apart in the ranking. It means that these two words are attested together on the Web with respect to the intensity patterns more than with other candidate words. Therefore, we try to respect the order of such word pairs more in the final ranking when we are breaking constraint-violating cycles.
3 Related Work Hatzivassiloglou and McKeown (1993) presented the first step towards automatic identification of adjective scales, thoroughly discussing the background of adjective semantics and a means of discovering clusters of adjectives that belong on the same scale, thus providing one way of creating the input for our ranking algorithm. Inkpen and Hirst (2006) study near-synonyms and nuances of meaning differentiation (such as stylistic, attitudinal, etc.). They attempt to automatically acquire a knowledge base of near-synonym differences via an unsupervised decision-list algorithm. However, their method depends on a special dictionary of synonym differences to learn the extraction patterns, while we use only a raw Web-scale corpus.
Mohammad et al. (2013) proposed a method of identifying whether two adjectives are antonymous. This problem is related but distinct, because the degree of antonymy does not necessarily determine their position on an intensity scale. Antonyms (e.g., little, big) are not necessarily on the extreme ends of scales.
Sheinman and Tokunaga (2009) and Sheinman et al. (2012) present the most closely related previous work on adjective intensities. They collect lexicosemantic patterns via bootstrapping from seed adjective pairs to obtain pairwise intensities, albeit using search engine 'hits', which are unstable and problematic (Kilgarriff, 2007). While their approach is primarily evaluated in terms of a local pairwise classification task, they also suggest the possibility of ordering adjectives on a scale using a pivotbased partitioning approach. Although intuitive in theory, the extracted pairwise scores are frequently too sparse for this to work. Thus, many adjectives have no score with a particular headword. In our experiments, we reimplemented this approach and show that our MILP method improves over it by allowing individual pairwise decisions to benefit more from global information. Schulam and Fellbaum (2010) apply the approach of Sheinman and Tokunaga (2009) to German adjectives. Our method extends easily to various foreign languages as described in Section 5.
Another related task is the extraction of lexicosyntactic and lexico-semantic intensity-order patterns from large text corpora (Hearst, 1992;Chklovski and Pantel, 2004;Tandon and de Melo, 2010). Sheinman and Tokunaga (2009) follows Davidov and Rappoport (2008) to automatically bootstrap adjective scaling patterns using seed adjectives and Web hits. These methods thus can be used to provide the input patterns for our algorithm.
VerbOcean by Chklovski and Pantel (2004) extracts various fine-grained semantic relations (including the stronger-than relation) between pairs of verbs, using lexico-syntactic patterns over the Web.
Our approach of jointly ranking a set of words using pairwise evidence is also applicable to the VerbOcean pairs, and should help address similar sparsity issues of local pairwise decisions. Such scales will again be quite useful for language learners and language understanding tools.
de Marneffe et al. (2010) infer yes-or-no answers to questions with responses involving scalar adjectives in a dialogue corpus. They correlate adjectives with ratings in a movie review corpus to find that good appears in lower-rated reviews than excellent.
Finally, there has been a lot of work on measuring the general sentiment polarity of words (Hatzivassiloglou and McKeown, 1997;Hatzivassiloglou and Wiebe, 2000;Turney and Littman, 2003;Liu and Seneff, 2009;Taboada et al., 2011;Yessenalina and Cardie, 2011;Pang and Lee, 2008). Our work instead aims at producing a large, unrestricted number of individual intensity scales for different qualities and hence can help in fine-grained sentiment analysis with respect to very particular content aspects.

Data
Input Clusters In order to obtain input clusters for evaluation, we started out with the satellite cluster or 'dumbbell' structure of adjectives in WordNet 3.0, which consists of two direct antonyms as the poles and a number of other satellite adjectives that are semantically similar to each of the poles (Gross and Miller, 1990). For each antonymy pair, we determined an extended dumbbell set by looking up synonyms and words in related (satellite adjective and 'see-also') synonym sets. We cut such an extended dumbbell into two antonymous halves and treated each of these halves as a potential input adjective cluster.
Most of these WordNet clusters are noisy for the purpose of our task, i.e. they contain adjectives that appear unrelatable on a single scale due to polysemy and semantic drift, e.g. violent with respect to supernatural and affected. Motivated by Sheinman and Tokunaga (2009), we split such hard-to-relate adjectives into smaller scale-specific subgroups using the corpus evidence 4 . For this, we consider an undi-4 Note that we do not use the WordNet dataset of Sheinman and Tokunaga (2009)   rected edge between each pair of adjectives that has a non-zero intensity score (based on the Web-scale scoring procedure described in Section 2.1.3). The resulting graph is then partitioned into connected components such that any adjectives in a subgraph are at least indirectly connected via some path and thus much more likely to belong to the same intensity scale. While this does break up partitions whenever there is no corpus evidence connecting them, ordering the adjectives within each such partition remains a challenging task. This is because the Web evidence will still not necessarily directly relate all adjectives (in a partition) to each other. Additionally, the Web evidence may still indicate the wrong direction. Figure 3 shows the size distribution of the resulting partitions.
Patterns To construct our intensity pattern set, we started with a couple of common rankable adjective seed pairs such as (good, great) and (hot, boiling) and used the Web-scale n-grams corpus (Brants and Franz, 2006) to collect the few most frequent patterns between and around these seed-pairs (in both directions). Among these, we manually chose a scales. Instead, their annotators only made pairwise comparisons with select words, using a 5-way classification scheme (neutral, mild, very mild, intense, very intense). small set of intuitive patterns that are linguistically useful for ordering adjectives, several of which had not been discovered in previous work. These are shown in Table 1. Note that we only collected patterns that were not ambiguous in the two orders, for example the pattern ' , not ' is ambiguous because it can be used as both 'good, not great' and 'great, not good'. Alternatively, one can easily also use fully-automatic bootstrapping techniques based on seed word pairs (Hearst, 1992;Chklovski and Pantel, 2004;Yang and Su, 2007;Turney, 2008;Davidov and Rappoport, 2008). However, our semiautomatic approach is a simple and fast process that extracts a small set of high-quality and very general adjective-scaling patterns. This process can quickly be repeated from scratch in any other language. Moreover, as described in Section 5.1, the English patterns can also be projected automatically to patterns in other languages.
Development and Test Sets Section 2.1 describes the method for collecting the intensity scores for adjective pairs, using Web-scale n-grams (Brants and Franz, 2006). We relied on a small development set to test the MILP structure and the pairwise score setup. For this, we manually chose 5 representative adjective clusters from the full set of clusters.
The final test set, distinct from this development set, consists of 569 word pairs in 88 clusters, each annotated by two native speakers of English. Both the gold test data (and our code) are freely available. 5 To arrive at this data, we randomly drew 30 clusters each for cluster sizes 3, 4, and 5+ from the histogram of partitioned adjective clusters in Figure 3. While labeling a cluster, annotators could exclude words that they deemed unsuitable to fit on a single shared intensity scale with the rest of the cluster. Fortunately, the partitioning described earlier had already separated most such cases into distinct clusters. The annotators ordered the remaining words on a scale. Words that seemed indistinguishable in strength could share positions in their annotation.
As our goal is to compare scale formation algorithms, we did not include trivial clusters of size 2. On such trivial clusters, the Web evidence alone determines the output and hence all algorithms, includ-5 http://demelo.org/gdm/intensity/ ing the baseline, obtain the same pairwise accuracy (defined below) of 93.3% on a separate set of 30 random clusters of size 2. Figure 4 shows the distribution of cluster sizes in our main gold set. The inter-annotator agreement in terms of Cohen's κ (Cohen, 1960) on the pairwise classification task with 3 labels (weaker, stronger, or equal/unknown) was 0.64. In terms of pairwise accuracy, the agreement was 78.0%.

Metrics
In order to thoroughly evaluate the performance of our adjective ordering procedure, we rely on both pairwise and ranking-correlation evaluation metrics. Consider a set of input words A = {a 1 , a 2 , . . . , a n } and two rankings for this set -a gold-standard ranking r G (A) and a predicted ranking r P (A).

Pairwise Accuracy
For a pair of words a i , a j , we may consider the classification task of choosing one of three labels (<, >, =?) for the case of a i being weaker, stronger, and equal (or unknown) in intensity, respectively, compared to a 2 : For each pair (a 1 , a 2 ), we compute gold-standard labels L G (a 1 , a 2 ) and predicted labels L P (a 1 , a 2 ) as above, and then the pairwise accuracy P W (A) for a particular ordering on A is simply the fraction of pairs that are correctly classified, i.e. for which the predicted label is same as the gold-standard label:

Ranking Correlation Coefficients
Our second type of evaluation assesses the rank correlation between two ranking permutations (gold-standard and predicted). Many studies use Kendall's tau (Kendall, 1938), which measures the total number of pairwise inversions, while others prefer Spearman's rho (Spearman, 1904), which measures the L1 distance between ranks.
Kendall's tau correlation coefficient We use the τ b version of Kendall's correlation metric, as it incorporates a correction for ties (Kruskal, 1958;Dou et al., 2008): where P is the number of concordant pairs, Q is the number of discordant pairs, X 0 is the number of pairs tied in the first ranking, Y 0 is the number of pairs tied in the second ranking. Given the two rankings of an adjective set A, the gold-standard ranking r G (A) and the predicted ranking r P (A), two words a i , a j are: • concordant iff both rankings have the same strict order of the two elements, i.e., r G (a i ) > r G (a j ) and r P (a i ) > r P (a j ), or r G (a i ) < r G (a j ) and r P (a i ) < r P (a j ). • discordant iff the two rankings have an inverted strict order of the two elements, i.e., r G (a i ) > r G (a j ) and r P (a i ) < r P (a j ), or r G (a i ) < r G (a j ) and r P (a i ) > r P (a j ).
Spearman's rho correlation coefficient For two n-sized ranked lists {x i } and {y i }, the Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranks of variables: Here,x andȳ denote the means of the values in the respective lists. We use the standard procedure for handling ties correctly. Tied values are assigned the average of all ranks of items sharing the same value in the ranked list sorted in ascending order of the values.
Handling Inversions While annotating, we sometimes observed that the ordering itself was very clear but the annotators disagreed about which end of a particular scale was to count as the strong one, e.g. when transitioning from soft to hard or from alpha to beta. We thus also report average absolute values of both correlation coefficients, as these properly account for anticorrelations. Our test set only contains clusters of size 3 or larger, so there is no need to account for inversions in clusters of size 2.

Results
In Table 3, we use the evaluation metrics mentioned above to compare several different approaches.
Web Baseline The first baseline simply reflects the original pairwise Web-based intensity scores. We classify (with one of 3 labels) a given pair of adjectives using the Web-based intensity scores (as described in Section 2.1.3) as follows: if score(a i , a j ) = 0 Since score(a i , a j ) represents the weak-strong score of the two adjectives, a more positive value means a higher likelihood of a i being weaker (<, on the left) in intensity than a j . In Table 3, we observe that the (micro-averaged) pairwise accuracy, as defined earlier, for the original Web baseline is 48.2%, while the ranking measures are undefined because the individual pairs do not lead to a coherent scale.

Divide-and-Conquer
The divide-and-conquer baseline recursively splits a set of words into three subgroups, placed to the left (weaker), on the same position (no evidence), or to the right (stronger) of a given randomly chosen pivot word.
While this approach shows only a minor improvement in terms of the pairwise accuracy (50.6%), its main benefit is that one obtains well-defined intensity scales rather than just a collection of pairwise scores.

Sheinman and Tokunaga
The approach by Sheinman and Tokunaga (2009) involves a similar divide-and-conquer based partitioning in the first phase, except that their method makes use of synonymy information from WordNet and uses all synonyms in WordNet's synset for the headword as neutral pivot elements (if the headword is not in WordNet, then the word with the maximal unigram frequency is chosen). In the second phase, their method performs pairwise comparisons within the more intense and less intense subgroups. We reimplement their approach here, using the Google N-Grams dataset instead of online Web search engine hits. We observe a small improvement over the Web baseline in terms of pairwise accuracy. Note that the   rank correlation measure scores are undefined for their approach. This is because in some cases their method placed all words on the same position in the scale, which these measures cannot handle even in their tie-corrected versions. Overall, the Sheinman and Tokunaga approach does not aggregate information sufficiently well at the global level and often fails to make use of transitive inference.
MILP Our MILP exploits the same pairwise scores to induce significantly more accurate pairwise labels with 69.6% accuracy, a 41% relative error reduction over the Web baseline, 38% over Divide-and-Conquer, and 32% over Sheinman and Tokunaga (2009). We further see that our MILP method is able to exploit external synonymy (equivalence) information (using synonyms marked by the annotators). The accuracy of the pairwise scores as well as the quality of the overall ranking increase even further to 78.2%, approaching the human interannotator agreement. In terms of average correlation coefficients, we observe similar improvement trends from the MILP, but of different magnitudes, because these averages give small clusters the same weight as larger ones.

Analysis
Confusion Matrices For a given approach, we can study the confusion matrix obtained by crosstabulating the gold classification with the predicted  classification of every unique pair of adjectives in the ground truth data. Table 4 shows the confusion matrix for the Web baseline. We observe that due to the sparsity of pairwise intensity order evidence, the baseline method predicts too many ties. Table 5 provides the confusion matrix for the MILP (without external equivalence information) for comparison. Although the middle column still shows that the MILP predicts more ties than humans annotators, we find that a clear majority of all unique pairs are now correctly placed along the diagonal. This confirms that our MILP successfully infers new ordering decisions, although it uses the same input (corpus evidence) as the baseline. The remaining ties are mostly just the result of pairs for which there simply is no evidence at all in the input Web counts. Note that this problem could for instance be circumvented by relying on a crowdsourcing approach: A few dispersed tie-breakers are enough to allow our MILP to correct many other predictions.
Predicted Examples Finally, in Table 6, we provide a selection of real results obtained by our algorithm. For instance, it correctly inferred that terrifying is more intense than creepy or scary, although the Web pattern counts did not provide any explicit information about these words pairs. In some cases, however, the Web evidence did not suffice to draw the right conclusions, or it was misleading due to issues like polysemy (as for the word funny).
While we show results on gold-standard chains here for evaluation purposes, in practice one can also recombine two [0, 1] chains for a pair of antonymic clusters to form a single scale from [−1, 1] that visualizes the full spectrum of available adjectives along a dimension, from adjacent all the way to removed, or from black to glaring.

Extension to Multilingual Ordering
Our method for globally ordering words on a scale can easily be applied to languages other than English. The entire process is language-independent as long as the required resources are available and a small number of patterns are chosen. For morphologically rich languages, the information extraction step of course may require additional morphological analysis tools for stemming and aggregating frequencies across different forms.
Alternatively, a cross-lingual projection approach is possible at multiple levels, utilizing information from the English data and ranking. As the first step, the set of words in the target language that we wish to rank can be projected from the English word set if necessary -e.g., as shown in de Melo and Weikum (2009). Next, we outline two projection methods for the ordering step. The first method is based on projection of the English intensity-ordering patterns to the new language, and then using the same MILP as described in Section 2.2. In the second method, we also change the MILP and add cross-lingual constraints to better inform the target language's adjective ranking. A detailed empirical evaluation of these approaches remains future work.

Cross-Lingual Pattern Projection
Instead of creating new patterns, in many cases we obtain quite adequate intensity patterns by using cross-lingual projection. We simply take several adjective pairs, instantiate the English patterns with them, and obtain new patterns using a machine translation system. Filling the wildcards in a pattern, say ' but not ', with good/excellent results in 'good but not excellent'. This phrase is then translated into the target language using the translation system, say into German 'gut aber nicht ausgezeichnet'. Finally, put back the wildcards in the place of the translations of the adjective words, here gut and ausgezeichnet, to get the corresponding German pattern ' aber nicht '. Table 7 shows various German intensity patterns that we obtain by projecting from the English patterns as described. The process is repeated with multiple adjective pairs in case different variants are returned, e.g. due to morphology. Most of these translations deliver useful results. Now that we have the target language adjectives and the ranking patterns, we can compute the pairwise intensity scores using large-scale data in that language. We can use the Google n-grams corpora for 10 European languages (Brants and Franz, 2009), and also for Chinese (LDC2010T02) and Japanese (LDC2009T08). For other languages, one can use available large raw-text corpora or Web crawling tools.

Crosslingual MILP
To improve the rankings for lesser-resourced languages, we can further use a joint MILP approach for the new language we want to transfer this process to. Additional constraints between the English  words and their corresponding target language translations, in combination with the English ranking information, allow the algorithm to obtain better rankings for the target words whenever the non-English target language corpus does not provide sufficient intensity order evidence. In this case, the input set A contains words in multiple languages. The Web intensity scores score(a i , a j ) should be set to zero when comparing words across languages. We instead link them using a translation table T ⊆ {1, . . . , N } × {1, . . . , N } from a translation dictionary or phrase table. Here, (i, j) ∈ T signifies that a i is a translation of a j . We do not require a bijective relationship between them (i.e., translations needn't be unique). The objective function is augmented by adding the new term (i,j)∈T (w ij + s ij )C T for a constant C T > 0 that determines how much weight we assign to translations as opposed to the corpus count scores. The MILP is extended by adding the following extra constraints.
∀i, j ∈ T The variables d i,j , as before, encode distances between positions of words on the scale, but now also include cross-lingual pairs of words in different languages. The new constraints encourage translational equivalents to remain close to each other, preferably within a desired (but not strictly enforced) maximum distance d max . The new variables w ij , s ij are similar to w ij , s ij in the standard MILP. However, the w ij become 1 if and only if d ij ≥ −d max and the s ij become 1 if and only if d ij ≤ d max . If both w ij and s ij are 1, then the two words have a small distance −d max ≤ d ij ≤ d max . The augmented objective function explicitly encourages this for translational equivalents. Overall, this approach thus allows evidence from a language with more Web evidence to improve the process of adjective ordering in lesserresourced languages.

Conclusion
In this work, we have presented an approach to the challenging and little-studied task of ranking words in terms of their intensity on a continuous scale. We address the issue of sparsity of the intensity order evidence in two ways. First, pairwise intensity scores are computed using linguistically intuitive patterns in a very large, Web-scale corpus. Next, a Mixed Integer Linear Program (MILP) expands on this further by inferring new relative relationships. Instead of making ordering decisions about word pairs independently, our MILP considers the joint decision space and factors in e.g. how two adjectives relate to some third adjective, thus enforcing global constraints such as transitivity.
Our approach is general enough to allow additional evidence such as synonymy in the MILP, and can straightforwardly be applied to other word classes (such as verbs), and to other languages (monolingually as well as cross-lingually). The overall results across multiple metrics are substantially better than previous approaches, and fairly close to human agreement on this challenging task.