Using Pivot-Based Paraphrasing and Sentiment Profiles to Improve a Subjectivity Lexicon for Essay Data

We demonstrate a method of improving a seed sentiment lexicon developed on essay data by using a pivot-based paraphrasing system for lexical expansion coupled with sentiment profile enrichment using crowdsourcing. Profile enrichment alone yields up to 15% improvement in the accuracy of the seed lexicon on 3-way sentence-level sentiment polarity classification of essay data. Using lexical expansion in addition to sentiment profiles provides a further 7% improvement in performance. Additional experiments show that the proposed method is also effective with other subjectivity lexicons and in a different domain of application (product reviews).


Introduction
In almost any sub-field of computational linguistics, creation of working systems starts with an investment in manually-generated or manually-annotated data for computational exploration. In subjectivity and sentiment analysis, annotation of training and testing data and construction of subjectivity lexicons have been the loci of costly labor investment.
Many subjectivity lexicons are mentioned in the literature. The two large manually-built lexicons for English -the General Inquirer (Stone et al., 1966) and the lexicon provided with the Opinion-Finder distribution (Wiebe and Riloff, 2005) -are available for research and education only 1 and under GNU GPL license that disallows their incorporation into proprietary materials, 2 respectively. Those wishing to integrate sentiment analysis into products, along with those studying subjectivity in languages other than English, or for specific domains such as finance, or for particular genres such as MySpace comments, reported construction of lexicons (Taboada et al., 2011;Loughran and McDonald, 2011;Thelwall et al., 2010;Rao and Ravichandran, 2009;Jijkoun and Hofmann, 2009;Pitel and Grefenstette, 2008;Mihalcea et al., 2007).
In this paper, we address the step of expanding a small-scale, manually-built subjectivity lexicon (a seed lexicon, typically for a domain or language in question) into a much larger but noisier lexicon using an automatic procedure. We present a novel expansion method using a state-of-the-art paraphrasing system. The expansion yields a 4-fold increase in lexicon size; yet, the expansion alone is insufficient in order to improve performance on sentence-level sentiment polarity classification.
In this paper we test the following hypothesis. We suggest that the effectiveness of the expansion is hampered by (1) introduction of opposite-polarity items, such as introducing resolute as an expansion of forceful, or remarkable as an expansion of peculiar; (2) introduction of weakly polar, neutral, or ambiguous words as expansions of polar seed words, such as generating concern as an expansion of anxiety or future as an expansion of aftermath; 3 (3) inability to distinguish between stronger or clear-cut versus weaker or ambiguous sentiment and to make a differential use of those.
We address items (1) and (2) by enriching the lexicon with sentiment profiles (section 3), and propose a way of effectively utilizing this information for the sentence-level sentiment polarity classification task (sections 5 and 6). Profile-enrichment alone yields up to 15% increase in performance for the seed lexicon when using different machine learning algorithms; paraphraser-based expansion with sentiment profiles improves performance by an additional 7%. Overall, we observe an improvement of up to 25% in classification accuracy over the seed lexicon without profiles.
In section 7, we present comparative evaluations, demonstrating the competitiveness of the expanded and profile-enriched lexicon, as well as the effectiveness of the expansion and enrichment paradigm presented here for different subjectivity lexicons, different lexical expansion methods, and in a different domain of application (product reviews).

Building Subjectivity Lexicons
The goal of our sentiment analysis project is to allow for the identification of sentiment in sentences that appear in essay responses to a variety of tasks designed to test English proficiency in both native-and non-native-speaker populations in a standardized assessment as well as in an instructional settings. In order to allow for the future use of the sentiment analyzer in a proprietory product and to ensure its fit to the test-taker essay domain, we began our work with the construction of a seed lexicon relying on our materials (section 2.1). We then used a statistical paraphrasing system to expand the seed lexicon (section 2.2).

Seed Lexicon
In order to inform the process of lexicon construction, we randomly sampled 5,000 essays from a corpus of about 100,000 essays containing writing samples across many topics. Essays were responses to several different writing assignments, including graduate school entrance exams, non-native English speaker proficiency exams, and professional licensure exams. Our seed lexicon is a combination of (1) positive and negative sentiment words manually selected from a full list of word types in these data, and (2) words marked in a small-scale annotation of a sample of sentences from these data for all positive and negative words. A more detailed descrip-tion of the construction of seed lexicon can be found in Beigman Klebanov et al (2012). The seed lexicon contains 749 single words, 406 positive and 343 negative.

Expanded Lexicon
We used a pivot-based lexical and phrasal paraphrase generation system (Madnani and Dorr, 2013). The paraphraser implements the pivot-based method as described by Bannard and Callison-Burch (2005) with several additional filtering mechanisms to increase the precision of the extracted pairs. The pivot-based method utilizes the inherent monolingual semantic knowledge from bilingual corpora: We first identify phrasal correspondences between English and a given foreign language F , then map from English to English by following translation units from English to the other language and back. For example, if the two English phrases e1 and e2 both correspond to the same foreign phrase f , then they may be considered to be paraphrases of each other with the following probability: If there are several pivot phrases that link the two English phrases, then they are all used in computing the probability:

Seed
Expansion Seed Expansion abuse exploitation costly onerous accuse reproach dangerous unsafe anxiety disquiet improve reinforce conflict crisis invaluable precious Some examples of expansions generated by the paraphraser are shown in Table 1. More details about this kind of approach can be found in Bannard and Callison-Burch (2005). We use the French-English parallel corpus (approximately 1.2 million sentences) from the corpus of European parliamentary proceedings (Koehn, 2005) as the data on which pivoting is performed to extract the paraphrases. However, the base paraphrase system is susceptible to large amounts of noise due to the imperfect bilingual word alignments. Therefore, we implement additional heuristics in order to minimize the number of noisy paraphrase pairs (Madnani and Dorr, 2013). For example, one such heuristic filters out pairs where a function word may have been inferred as a paraphrase of a content word. For the lexicon expansion experiment reported here, we use the top 15 single-word paraphrases for every word from the seed lexicon, excluding morphological variants of the seed word. This process results in an expanded lexicon of 2,994 different words, 1,666 positive and 1,761 negative (433 words are in both the positive and the negative lists). The expanded lexicon includes the seed lexicon.

Inducing sentiment profiles
Let γ w be the sentiment profile of the word w.
where Σ i∈{pos,neg,neu} p i w = 1. Thus, a sentiment profile of a word is essentially a 3-sided coin, corresponding to its probability of coming out positive, negative, and neutral, respectively.

Estimating sentiment profiles
Our goal is to estimate the profile using outcomes of multiple trials as follows. For every word, a person is shown the word and asked whether it is positive, negative, or neutral. A person's decision is modeled as flipping the coin corresponding to the word, and recording the outcome -positive, negative, or neutral. We run N =20 such trials for every word in the expanded lexicon using the CrowdFlower crowdsourcing site, 4 for a total cost of $800. We use maximum likelihood estimate of sentiment profile: where n i w is the proportion of N trials on the word w that fell in cell i ∈ {pos, neg, neu}. Table 2 shows some estimated profiles.
Following Goodman (1965) and Quesenberry and Hurst (1964), we calculate confidence intervals for the parameters p i w : For confidence α that all p i w , i ∈ {pos, neg, neu} are simultaneously within their respective intervals, the value of B is determined as the upper α/3×100 th percentile of the χ 2 distribution with one degree of freedom. We use α=0.1, resulting in B=4.55. The resulting interval is about 0.2 around the estimated value whenp i w is close to 0.5, and somewhat narrower forp i w closer to 0 or 1. We will use this information when inducing features from the profiles.

Sentiment distributions of the lexicons
The estimated sentiment profiles per word allow us to visualize the distributions of the two lexicons. In Figure 1, we plot the number of entries in the lexicon as a function of the difference in positive and negative parts of the profile, in 0.2-wide bins. Thus, a word w would be in the second-leftmost bin if −0.8 < (p pos w −p neg w ) < −0.6. While the expansion process more than doubles the number of words in the highest bins for both the positive and the negative polarity, it clearly introduces a large number of words in the lowand medium bins into the lexicon. It is in this sense that the expansion process is noisy; apparently, seed words with clear and strong polarity are often expanded into low intensity, neutral, or ambiguous ones, as in pairs like absurd/laughable, deadly/fateful, anxiety/concern shown in Table 2.

Related Work
The most popular seed expansion methods discussed in the literature are based on WordNet (Miller, 1995) or another lexicographic resource, on distributional similarity with the seeds, or on a mixture thereof (Cruz et al., 2011;Baccianella et al., 2010;Velikovich et al., 2010;Qiu et al., 2009;Mohammad et al., 2009;Esuli and Sebastiani, 2006;Kim and Hovy, 2004;Andreevskaia and Bergler, 2006;Hu and Liu, 2004;Kanayama and Nasukawa, 2006;Strapparava and Valitutti, 2004;Kamps et al., 2004;Takamura et al., 2005;Turney and Littman, 2003;Hatzivassiloglou and McKeown, 1997). The paraphrase-based expansion method is in the distributional similarity camp; we also experimented with WordNet-based expansion as descibed in section 7.2. The task of assigning sentiment profiles to words in a sentiment lexicon has been addressed in the literature. SentiWordNet assigns profiles to all words in WordNet based on a propagation algorithm from a small seed set manually annotated by a small number of judges (Baccianella et al., 2010;Cerini et al., 2007). Andreevskaia and Bergler (2006) use graph propagation algorithms on WordNet to assign cen-trality scores in positive and negative categories; a similar approach based on web-scale co-occurrence graphs is discussed in Velikovich et al (2010). Thelwall et al (2010) manually annotated a set of words for strength of sentiment and used machine learning to fine-tune it. Taboada et al (2011) produced an expert annotation of their lexicon with strength of sentiment. Subasic and Huettner (2001) manually built an affect lexicon with intensities. Wiebe and Riloff (2005) classifed lexicon entries into weakly and strongly subjective, based on their relative frequency of appearance in subjective versus objective contexts in a large annotated dataset.
Our sentiment profiles are best thought of as relatively fine-grained priors for the sentiment expressed by a given word out-of-context. These reflect a mixture of strength of sentiment (p pos good > p pos decent ), contextual ambiguity (concern can be interpreted as similar to worry or to care, as in "Her condition was causing concern" versus "He showed genuine concern for her"), and dominance of a polar connotation (abandon isp neg =1; it has a negative overtone even if the actual sense is not that of desert but of vacate, as in "You must abandon your office").
To the best of our knowledge, this paper presents the first attempt to integrate judgements obtained through crowdsourcing on a large scale into a sentiment lexicon, showing the effectiveness of this lexicon-enrichment procedure for a sentiment classification task.

Using profiles for sentence-level sentiment polarity classification
To evaluate the usefulness of the lexicons, we use them to generate features for machine learning systems, and compare performance on 3-way sentencelevel sentiment polarity classification. To ensure robustness of the observed trends, we experiment with a number of machine learning algorithms: SVM Linear and RBF, Naïve Bayes, Logistic Regression (using WEKA (Hall et al., 2009)), and c5.0 Decision Trees (Quinlan, 1993). 5

Data
We generated the data for training and testing the machine learning systems as follows. We used our pool of 100,000 essays to sample a second, nonoverlapping set of 5,000 essays, so that no essay used for lexicon development appears in this set. From these essays, we randomly sampled 550 sentences, and submitted them to sentiment polarity annotation by two experienced research assistants; 50 double-annotated sentenced showed κ=0.8. TEST set contains the 43 agreed double-annotated sentences, and additional 238 sampled from the 500 single-annotated sentences, 281 sentence in total. The category distribution in the TEST set is 46.6% neutral, 32.4% positive, and 21% negative. The TRAIN set contains the remaining sentences, plus positive, negative, and neutral sentences annotated during lexicon development, for the total of 1,631 sentences. The category distribution in TRAIN is 39% neutral, 35% positive, 26% negative.

From lexicons to features
Our goal is to evaluate the impact of sentiment profiles on sentence-level sentiment polarity classification for the seed and the expanded lexicons, while also looking for the most effective ways to represent this information for machine learners.
We implement two baseline systems. One provides the machine learner with the most detailed information contained in a lexicon: BL-full has 2 features for every lexicon word, taking the values (1,0) for positive match in a sentence, (0,1) -for negative, (1,1) for a word in both positive and negative parts of the lexicon, and (0,0) otherwise.
The second baseline provides the machine learner with only summary information about the overall sentiment of the sentence. BL-sum uses only 2 features: (1) the total count of positive words in the sentence; (2) the total count of negative words in the sentence, according to the given lexicon.
For the sentiment-enriched runs, we construct a number of representations: Int-full, Int-sum, Intbin, and Int-c. Int-full and Int-sum are parallel to the respective baseline systems. Int-full represents each lexicon word as 2 features corresponding to the word's estimatedp pos w andp neg w , providing the most detailed information to the machine learner. In the Int-sum condition, we usep pos w andp neg w for every word to induce 2 features: (1) the sum of positive probabilities of all words in the sentence; (2) the sum of negative probabilities for all words in the sentence, according to the given lexicon.
For Int-bin runs, we use bins of the size of 0.2half of the maximal confidence interval -to group together words with close estimates. We produce 10 features. For positive bins, the 5 features count the number of words in the sentence that fall in bin i , 1 ≤ i ≤ 5, respectively, that is, words with 0.2(i − 1) <p pos w ≤ 0.2i. Bin 1 also includes words withp pos w = 0, since these cannot be distinguished with high confidence fromp pos w =0.1. Note that we do not provide a scale, we merely represent different ranges with different features. This should allow the machine learners the flexibility to weight the different bins differently when inducing classifiers.
The Int-c condition represents a coarse-grained setting. We produce 4 features, two for each polarity: (1) the number of words such that 0 ≤p pos w < 0.4; (2) the number of words such that 0.4 ≤p pos w ≤ 1; similarity for the negative polarity. Table 3 summarizes conditions and features. Table 3: Description of conditions. Column 2 shows the number of features. In column 3: 1 is an indicator function; L is a lexicon; L pos is the part of the lexicon containing positive words (same with negatives); S is a sentence for which a feature vector is built; A = L ∩ S. For all w ∈ L − S in the -full conditions, w is represented with (0,0).  full or BL-sum) for a combination of machine learner and lexicon. The results show that (1) Intbin > Int-sum > BL = Int-c = Int-full; (2) Expanded > Seed under Int condition. All inequalities are statistically significant at p=0.05 (see caption of Table 4 for details).

Results
First, both the seed and the expanded lexicons benefit from profile enrichment, although, as predicted, the expanded lexicon yields larger gains due to its more varied profiles: The seed lexicon gains up to 15% in accuracy (c5.0 BL-sum vs Int-bin), while the expanded lexicon gains up to 30%, as SVM RBF scores go up from 0.495 to 0.644.
Second, observe that profiling allows the expanded lexicon to leverage its improved coverage: While it is inferior to the best baseline run with the seed lexicon for all systems, it succeeds in improving the seed lexicon accuracies by 5%-12% across the different systems for the Int-bin runs. The best run of the expanded lexicon (Int-bin for SVM RBF) improves upon the best run of the seed lexicon (Intsum for SVM-linear) by 7%, demonstrating the success of the paraphraser-based expansion once profiles are taken into account. Overall, comparing the best baseline for the seed lexicon with Int-bin condition of the expanded lexicon, we observe an improvement between 5% (0.598 to 0.626 for Naïve Bayes) and 25% (0.512 to 0.641 for c5.0), proving the effectiviness of the paraphrase-based expansion with profile enrichment paradigm.
Third, representing profiles using 10 bins (Int-bin) provides a small but consistent improvement over the summary representation (Int-sum) that sums positivity and negativity of the sentiment-bearing words in a sentence, over a coarse-grained representation (Int-c), as well as over the full-information representation (Int-full). Even Naïve Bayes and SVM linear, known to work well with large feature sets, show better performance in the Int-bin condition for the expanded lexicon. The results indicate that an intermediate degree of detail -between summary-only and coarse-grained representation on the one hand and full-information representation on the other -is the best choice in our setting.

Comparative Evaluations
In this section, we present comparative evaluations of the work presented in this paper with respect to related work. This section shows that the paraphrase expansion+profile enrichment solution proposed in this paper is effective for our task beyond off-theshelf solutions, and that its effectiveness generalizes to sentiment analysis in a different domain. We also show that profile enrichment can be effectively coupled with other methods of lexical expansion, although the paraphraser-based expansion receives a larger boost in performance from profile enrichment than the alternative expansion methods we consider.
In section 7.1, we demonstrate that the paraphrase-based expansion and profile enrichment yield superior performance on our data relative to state-of-art subjectivity lexicons -Opin-ionFinder, General Inquirer, and SentiWordNet. In section 7.2, we show that profile enrichment can be effectively coupled with other methods of lexical expansion, such as a WordNet-based expansion and an expansion that utilizes Lin's distributional thesaurus. However, we find that the paraphraser-based expansion benefits the most from profile enrichment, and attains better performance on our data than the alterantive expansion methods. In section 7.3, we show that the paraphrase-based expansion and profile enrichment paradigm is effective for other subjecitivy lexicons on other data. We use a dataset of product reviews annotated for sentence-level positivity and negativity as new data for evaluation (Hu and Liu, 2004). We use subsets of OpinionFinder, General Inquirer, and sentiment lexicon from Hu and Liu (2004). We demonstrate that paraphrase-based expansion and profile enrichment improve the accuracy of sentiment classification of product reviews for every lexicon and machine learner combination; the magnitude of improvement is 5% on average.

Competitiveness of the Expanded Lexicon
Had we been able to use the OpinionFinder or the General Inquirer lexicons (OFL and GIL) asis, how would the results have compared to those attained using our lexicons? We performed the baseline runs with both lexicons; OFL accuracies were 0.544-0.594 across machine learning systems, GIL's -0.491-0.584 (see GIL column in Table 5).
We also experimented with using the weaksubj and strongsubj labels in OFL as somewhat parallel distinctions to the ones presented here (see section 4 -Related Work -for a more detailed discussion). We used (1,0,0) profile for strong positives, (0.3,0,0.7) for weak positives, (0,1,0) for strong negatives, and (0,0.3,0.7) for weak negatives, and ran all the feature representations discussed in section 5.2. Table 5 column OFL shows the best run for every machine learning system, across the different feature representations, and choosing the better performing run between vanilla OFL and the version enriched with weak/strong distinctions.

Machine
Seed  Table 5: Performance of different lexicons on essay data using various machine learning systems. For each system and lexicon, the best performance across the applicable feature representations from section 5.2 and the variants (see text) is shown. Seed BL column shows the best baseline performance of our seed lexicon -before paraphraser expansion and profile enrichment were applied. Exp. column shows the performance of Int-bin feature representation for the expanded lexicon after profile enrichment.
Additionally, we experimented with SentiWord-Net (Baccianella et al., 2010). SentiWordNet is a resource for opinion mining built on top of Word-Net, which assigns each synset in WordNet a score triplet (positive, negative, and objective), indicating the strength of each of these three properties for the words in the synset. The SentiWordNet annotations were automatically generated, starting with a set of manually labeled synsets. Currently, SentiWordNet includes an automatic annotation for all the synsets in WordNet, totaling more than 100,000 words. It is therefore the largest-scale lexicon with intensity information that is currently available.
Since SentiWordNet assigns scores to synsets and since our data is not sense-tagged, we induced Sen-tiWordNet scores in the following ways. We partof-speech tagged our train and test data using Stanford tagger (Toutanova et al., 2003). Then, we took the SentiWordNet scores for the top sense for the given part-of-speech (SWN-1). In a different variant, we took a weighted average of the scores for the different senses, using the weighting algorithm provided on SentiWordNet website 6 (SWN-2). Table 5 column SWN shows the best performance figures between SWN-1 and SWN-2, across the feature representations in section 5.2.
The comparative results in Table 5 clearly show that while our vanilla seed lexicon performs comparably to off-the-shelf lexicons on our data, the paraphraser-expanded lexicon with sentitment profiles outperforms OpinionFinder, General Inquirer, and SentiWordNet.

Sentiment Profile Enrichment with Other Lexical Expansion Methods
We presented a novel lexicon expansion method using a paraphrasing system. We also experimented with more standard methods, using WordNet and distributional similarity (Beigman Klebanov et al., 2012;Esuli and Sebastiani, 2006;Kim and Hovy, 2004;Andreevskaia and Bergler, 2006;Hu and Liu, 2004;Kanayama and Nasukawa, 2006;Strapparava and Valitutti, 2004;Kamps et al., 2004;Takamura et al., 2005;Turney and Littman, 2003;Hatzivassiloglou and McKeown, 1997). Specifically, we implemented a WordNet (Miller, 1995) based expansion that uses the 3 most frequent synonyms of the top sense of the seed word (WN-e). We also implemented a method based on distributional similarity: Using Lin's proximity-based thesaurus (Lin, 1998) trained on our in-house essay data as well as on wellformed newswire texts, we took all words with the proximity score > 1.80 to any of the seed lexicon words (Lin-e). Just like the paraphraser lexicon, both perform worse than the seed lexicon in 9 out of 10 baseline runs (BL-sum and Bl-full conditions for the 5 machine learners).
To test the effect of profile enrichment, all words in WN-e and Lin-e underwent profile estimation as described in section 3.1, yielding lexicons WN-e-p and Lin-e-p, respectively. Figure 2 shows the distri-6 http://sentiwordnet.isti.cnr.it/, under "Sample code." butions. WN-e-p and Lin-e-p exhibit similar trends to those of the paraphraser. Substituting WN-e-p for Expanded data in Table 4, we find the same relationships between the different feature sets: Int-bin>Int-sum>Int-full=BL. For Lin-e-p, Int-sum deteriorates: Int-bin>Int-sum=Int-full=BL. For the 20 runs in the Int condition, Paraphraser>WN-e-p>Lin-e-p. 7 Note that this is also the order of lexicon sizes: Lin-e is the most conservative expansion (1,907 words), WN-e is the second with 2,527 words, and the lexicon expanded using paraphrasing is the largest with 2,994 words. Table 6 shows the performance of Lin-e-p, WN-e-p, and of the Expanded lexicon from Table 4 using the Int-bin feature representation. The average relative improvements over the best baseline range between 6.6% to 14.6% for the different expansion methods. Profile induction appears to be a powerful lexicon clean-up procedure that works especially well with more aggressive and thus potentially noisier expansions: The machine learners depress low-intensity and ambiguous expansions, thereby allowing the effective utilization of the improved coverage of sentiment-bearing vocabulary.

Effectiveness of the Paraphrase Expansion with Profile Enrichment Paradigm in a Different Domain
In order to check whether the paraphrase-based expasion and profile enrichment paradigm discussed in this paper generalizes to other subjectivity lexicons 7 All > are signficant at p=0.05 using Wilcoxon test.  Table 6: Performance of WordNet-based, Lin-based, and Paraphraser-based expansions with profile enrichment in the Int-bin condition. Seed BL column shows the best baseline performance of the seed lexicon -before expansion and profile enrichment were applied. The last line shows the average relative gain over the best baseline calculated as AG lex = Σ m∈M

Lexicons
We use the OpinionFinder and General Inquirer lexicons (OFL and GIL) as before, as well as the lexicon of positive and negative sentiment and opinion words available along with (Hu and Liu, 2004) product reviews dataset -HL. 8 Since each of these lexicons contains more than 3,000 words, enrichment of the full lexicons with profiles is beyond the financial scope of our project. We therefore restrict each of the lexicons to the size of their overlap with our seed lexicon (see 2.1); the overlaps have between 415 and 467 words. These restricted lexicons are our initial lexicons for the new experiment that parallel the role of the seed lexicon in the experiments on essay data.
For each of the 3 initial lexicons L, L∈{OFL, GIL, HL}, we follow the paraphrase-based expansion as described in section 2.2. This results in about 4.5-fold expansion of each lexicon, the new lexicons L-e, L∈{OFL, GIL, HL}, numbering between 2,015 and 2,167 words. Both the initial and the expanded lexicons now undergo profile enrichment as described in section 3.1, producing lexicons L-p and L-e-p, L∈{OFL, GIL, HL}.

Data
We use the dataset from Hu and Liu (2004) 9 that contains reviews of 5 products from amazon.com: two digital cameras, a DVD player, an MP3 player, and a cellular phone. The reviews are annotated at sentence level with a label that desrcibes the particular feature that is the subject of the positive or negative evaluation and the polarity and extent of the evaluation. For example, the sentence "The phone book is very user-friendly and the speakerphone is excellent" is labeled as PHONE BOOK[+2], SPEAKERPHONE[+2], while the sentence "I am bored with the silver look" is labeled LOOK[−1]. We used all sentences that were labeled with a numerical score for at least one feature, removing a small number of sentences labeled with both positive and negative scores for different features. 10 We used the sign of the numerical score to label the sentences as positive or negative. The resulting dataset consists of 1,695 sentences, 1,061 positive and 634 negative; accuracy for a majority baseline on this dataset is 0.626. Our experiments on this dataset are done using 5-fold cross-validation. Table 7 shows classification accuracies for the product review data using different lexicons and machine learners. We observe that the combination of paraphrase-based expansion and profile enrichment (L-e-p column in the table) resulted in an improved performance over the initial lexicon (L column in the table) in all cases, with the average gain of 5% in accuracy.

Results
Furthermore, the contributions of the expansion and the profile enrichment are complementary, since their combination performs better than each in isolation. We note that profile enrichment alone for the initial lexicon did not yield an improvement. This can be explained by the fact that the initial lexicons are highly polar, so profiles provide little additional information: The percentage of words withp pos ≥ 0.8 orp neg ≥ 0.8 is 84%, 86% and 91% for GIL,  Table 7: Accuracies on product review data. For each machine learner and lexicon, the best baseline performance is shown as L for the initial lexicon and as L-e for the paraphrase-expanded lexicon. L-p and L-e-p show the performance of Int-bin feature set on the profile-enriched initial and paraphrase-expanded lexicons, respectively. The three initial lexicons L are OpinionFinder (OFL), General Inquirer (GIL), and (Hu and Liu, 2004) (HL), each intersected with our seed lexicon. Sizes of the intial and expanded lexicons are provided.

Conclusions
We demonstrated a method of improving a seed sentiment lexicon by using a pivot-based paraphrasing system for lexical expansion and sentiment profile enrichment using crowdsourcing. Profile enrichment alone yielded up to 15% improvement in the performance of the seed lexicon on the task of 3way sentence-level sentiment polarity classification of test-taker essay data. While the lexical expansion on its own failed to improve upon the performance of the seed lexicon, it became much more effective on top of sentiment profiles, generating a 7% performance boost over the best profile-enriched run with the seed lexicon. Overall, paraphrase-based expansion coupled with profile enrichment yields an up to 25% improvement in accuracy.
Additionally, we showed that our paraphraseexpanded and profile-enriched lexicon performs significantly better on our data than off-the-shelf subjectivity lexicons, namely, Opinion Finder, General Inquirer, and SentiWordNet. Furthermore, our results suggest that paraphrase-based expansion derives more benefit from profiles than two competing expansion mechanisms based on WordNet and on Lin's distributional thesaurus.
Finally, we demonstrated the effectiveness of the paraphraser-based expansion with profile enrichment paradigm on a different dataset. We used Hu and Liu (2004) product review data with sentencelevel sentiment polarity labels. Paraphrase-based expansion with profile enrichment yielded an improved performance across all lexicons and machine learning algorithms we tried, with an average improvement rate of 5% in classification accuracy.
Recent literature argues that sentiment polarity is a property of word senses, rather than of words (Gyamfi et al., 2009;Su and Markert, 2008;Wiebe and Mihalcea, 2006), although Dragut et al (2012) successfully operate with "mostly negative" and "mostly positive" words based on the polarity distributions of word senses. We plan to address in future work sense disambiguation for words that have multiple senses with very different sentiment, such as stress, as either anxiety (negative) or emphasis (neutral).