Bootstrap Domain-Specific Sentiment Classifiers from Unlabeled Corpora

There is often the need to perform sentiment classification in a particular domain where no labeled document is available. Although we could make use of a general-purpose off-the-shelf sentiment classifier or a pre-built one for a different domain, the effectiveness would be inferior. In this paper, we explore the possibility of building domain-specific sentiment classifiers with unlabeled documents only. Our investigation indicates that in the word embeddings learned from the unlabeled corpus of a given domain, the distributed word representations (vectors) for opposite sentiments form distinct clusters, though those clusters are not transferable across domains. Exploiting such a clustering structure, we are able to utilize machine learning algorithms to induce a quality domain-specific sentiment lexicon from just a few typical sentiment words (“seeds”). An important finding is that simple linear model based supervised learning algorithms (such as linear SVM) can actually work better than more sophisticated semi-supervised/transductive learning algorithms which represent the state-of-the-art technique for sentiment lexicon induction. The induced lexicon could be applied directly in a lexicon-based method for sentiment classification, but a higher performance could be achieved through a two-phase bootstrapping method which uses the induced lexicon to assign positive/negative sentiment scores to unlabeled documents first, a nd t hen u ses those documents found to have clear sentiment signals as pseudo-labeled examples to train a document sentiment classifier v ia supervised learning algorithms (such as LSTM). On several benchmark datasets for document sentiment classification, our end-to-end pipelined approach which is overall unsupervised (except for a tiny set of seed words) outperforms existing unsupervised approaches and achieves an accuracy comparable to that of fully supervised approaches.

There is often the need to perform sentiment classification in a particular domain where no labeled document is available. Although we could make use of a general-purpose off-theshelf sentiment classifier or a pre-built one for a different domain, the effectiveness would be inferior. In this paper, we explore the possibility of building domain-specific sentiment classifiers w ith u nlabeled d ocuments o nly. Our investigation indicates that in the word embeddings learned from the unlabeled corpus of a given domain, the distributed word representations (vectors) for opposite sentiments form distinct clusters, though those clusters are not transferable across domains. Exploiting such a clustering structure, we are able to utilize machine learning algorithms to induce a quality domain-specific sentiment lexicon from just a few typical sentiment words ("seeds"). An important finding is that simple linear model based supervised learning algorithms (such as linear SVM) can actually work better than more sophisticated semi-supervised/transductive learning algorithms which represent the state-of-theart technique for sentiment lexicon induction. The induced lexicon could be applied directly in a lexicon-based method for sentiment classification, but a h igher p erformance c ould be achieved through a two-phase bootstrapping method which uses the induced lexicon to assign positive/negative sentiment scores to unlabeled documents first, a nd t hen u ses those documents found to have clear sentiment signals as pseudo-labeled examples to train a document sentiment classifier v ia supervised learning algorithms (such as LSTM). On sev-eral benchmark datasets for document sentiment classification, our end-to-end pipelined approach which is overall unsupervised (except for a tiny set of seed words) outperforms existing unsupervised approaches and achieves an accuracy comparable to that of fully supervised approaches.

Introduction
Sentiment analysis (Liu, 2015) is a popular research topic which has a wide range of applications, such as summarizing customer reviews, monitoring social media, and predicting stock market trends (Bollen et al., 2011). A basic task in sentiment analysis is to classify the sentiment polarity of a given piece of text (document), i.e., whether the opinion expressed in the text is positive or negative (Pang et al., 2002), which is the focus of this paper.
There are many different approaches to sentiment classification in the Natural Language Processing (NLP) literature -from simple lexicon-based methods (Ding et al., 2008;Thelwall et al., 2010;Thelwall et al., 2012) to learning-based approaches (Pang and Lee, 2004;Turney, 2002;Jo and Oh, 2011;Argamon et al., 2007;Lin and He, 2009), and also hybrid methods in between (Mudinas et al., 2012;Zhang et al., 2011). No matter which approach is taken, a sentiment classifier built for its target domain would work well only within that specific domain, but suffer a serious performance loss once the domain boundary is crossed. The same word could drastically change its sentiment polarity (and/or strength) if it is used in a different domain. For example, being "small" is likely to be negative for a hotel room but positive for a digital camcorder, being "unexpected" may be a good thing for the ending of a movie but not for the engine of a car, and we will probably enjoy "interesting" books but not necessarily "interesting" food. Here, the domain could be defined not by the topic of the documents but by the style of writing. For example, the meanings of words like "gay" and "terrific" would depend on whether the text was written in a historical era or modern times.
When we need to perform sentiment classification in a new domain unseen before, there are usually neither labeled dictionary available to employ lexicon-based sentiment classifiers nor labeled corpus available to train learning-based sentiment classifiers. It is, of course, possible to resort to a generalpurpose off-the-shelf sentiment classifier, or a prebuilt one for a different domain. However, the effectiveness would often be unsatisfactory because of the reasons mentioned above. There have been some studies on domain adaptation or transfer learning for sentiment classification (Blitzer et al., 2007;Tan et al., 2009;Pan et al., 2010;Glorot et al., 2011;Yoshida et al., 2011;Bollegala et al., 2013;Xia et al., 2013;Yang and Eisenstein, 2015), but they still require a large amount of labeled training data from a fairly similar source domain, which is not always feasible. Those algorithms also tend to be computational-expensive and time-consuming (Mohammad and Turney, 2010;Fast et al., 2016).
In this paper, we propose an end-to-end pipelined nearly-unsupervised approach to domain-specific sentiment classification of documents for a new domain based on distributed word representations (vectors). As shown in Fig. 1, the proposed approach consists of three main stages (components): (1) domain-specific sentiment word embedding, (2) domain-specific sentiment lexicon induction, (3) domain-specific sentiment classification of documents. Briefly speaking, given a large unlabeled corpus for a new domain, we would first set up the vector space for that domain via word embedding, then induce a sentiment lexicon in the discovered vector space from a very small set of seed words as well as a general-purpose lexicon, and finally exploit the induced lexicon in a lexicon-based document sentiment classifier to bootstrap a more effective learning-based document sentiment classifier for that domain. The second stage of our approach outperforms the state-of-the-art unsupervised method for sentiment lexicon induction (Hamilton et al., 2016), which is the most closely related work (see Section 2). The key to the superior performance of our method compared with theirs is the insight gained from our first stage that positive and negative sentiment words are largely clustered in the domain-specific vector space but these two clusters have a non-negligible overlap, therefore semisupervised/transductive learning algorithms could be easily misled by the examples in the overlap and would actually not work as well as simple supervised classification algorithms. Overall, the document sentiment classifier resulting from our nearlyunsupervised approach does not require any labeled document to be trained, and it can outperform the state-of-the-art unsupervised method for document sentiment classification (Eisenstein, 2017). The source code for our implemented system and the datasets for our experiments are open to the research community 1 .
The rest of this paper is organized as follows. In Section 2, we review previous studies on this topic. In Sections 3 to 5, we describe the three main stages of our approach respectively. In Section 6, we draw conclusions and discuss future work.

Related Work
Most of the early sentiment analysis systems took lexicon-based approaches to document sentiment classification which rely on pre-compiled sentiment lexicons (Owsley et al., 2006). Various methods have been proposed to automatically produce such sentiment lexicons (Hu and Liu, 2004;Ding et al., 2008). Later, the focus of research shifted to learning-based approaches (Pang et al., 2002;Pang and Lee, 2004), as supervised learning algorithms usually deliver a much higher accuracy in sentiment classification than pure lexicon-based methods. However, lexicons have not completely lost their attractiveness: they are usually easier to understand and to maintain by non-experts, and they can also be integrated into learning-based sentiment classifiers (Mudinas et al., 2012;Eisenstein, 2017).  The lexicon-based sentiment classifier used in our experiments is a publicly-available system called pSenti 2 (Mudinas et al., 2012). In addition to a customizable sentiment lexicon, it also uses shallow NLP techniques like part-of-speech (POS) tagging and the detection of sentiment inverters and other modifiers (intensifying and diminishing adverbs).
The introduction of modern word embedding techniques like word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) have opened the possibility of new sentiment analysis methods. Given a large unlabeled corpus, such techniques can learn from word co-occurrence information and produce a vector space of hundreds of dimensions, with each word being assigned a corresponding vector. The resulting vector space helps in understanding the semantic relationships between words and allows grouping of words based on their linguistic similarities. Recently Rothe et al. (2016) proposed the DENSIFIER method that can reduce the dimensionality of word embeddings without losing semantic information and explored its application in various domains. For the SemEval-2015 task (Rosenthal et al., 2015), DENSIFIER performed slightly worse compared to word2vec, though its training time was shorter by a factor of 21. In fact, previous studies such as (Rothe et al., 2016;Cliche, 2017) suggest that word2vec usually provides the best word embeddings for sentiment analysis tasks.
In their recent work, Hamilton et al. (2016) 2 https://goo.gl/pj4XAQ demonstrated that by starting from a small set of seed words and conducting label propagation over the lexical graph derived from the pairwise proximities of word embeddings, they could induce a domain-specific sentiment lexicon comparable to a hand-curated one. Intuitively, the success of their method named SentProp requires a relatively clear separation between sentiment words of opposite polarity in the vector space which, as we will show later, is not very realistic. Moreover, they have focused on the induction of sentiment lexicons alone, while we are trying to design an end-to-end pipeline that can turn unlabeled documents in a new domain directly to their sentiment classifications, with domain-specific sentiment lexicon induction as a key component.
Recent advances in deep learning (LeCun et al., 2015) has elevated sentiment analysis to new performance levels (Kim, 2014;Dai and Le, 2015;Hong and Fang, 2015). As reported by Dai and Le (2015), the Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) Recurrent Neural Network (RNN) can reach or surpass the performance levels of all previous baselines for sentiment classification of documents. One of the many appeals of LTSM is that it can connect previous information to the current context and allow seamless integration of pre-trained word embeddings as the first (projection) layer of the neural network. Moreover, Radford et al. (2017) discovered the "sentiment unit", the single unit which can learn the perfect representation of sentiment, in a multiplicative LSTM with 4096 units, despite the fact that the LSTM was only trained for a completely different purpose -to predict the next character in the text of Amazon reviews. Our results are in line with those findings and confirmed the superiority of LSTM in building document-level sentiment classifiers. Zhang et al. (2011) tried to address the low recall problem of lexicon-based methods for Twitter sentiment classification via training a learning-based sentiment classifier using the noisy labels generated by a lexicon-based sentiment classifier (Ding et al., 2008). Although the basic idea of their work is similar to what we do in the third stage of our approach (see Section 5), there exist several notable differences. First, they adopted a single generalpurpose sentiment lexicon provided by Ding et al. (2008) and used it for all domains, while we would induce a different lexicon for each different domain. Consequently, their method could have a relatively large variance in the document sentiment classification performance because of the domain mismatch (e.g., F 1 = 0.874 for the "Tangled" tweets and F 1 = 0.647 for the "Obama" tweets), whereas our approach would perform quite consistently over different domains. Second, they would need to strip out all the previously-known opinion words in their single general-purpose sentiment lexicon from the training documents in order to prevent the training bias and force their document sentiment classifier to exploit domain-specific features, but doing this would obviously lose the very valuable sentiment signals carried by those opinion words. In contrast, we would be able to utilize all terms in the training documents, including those opinion words that appeared in our automatically induced domain-specific lexicons, as features, when building our document sentiment classifiers. Third, they designed their method specifically for Twitter sentiment classification, while our approach would work for not only short texts such as tweets (see Section 5.2) but also long texts such as customer reviews (see Section 5.1). Fourth, they had to use an intermediate step to identify additional opinionated tweets (according to the opinion indicators extracted through the χ 2 test on the results of their lexicon-based sentiment classifier) in order to handle the neutral class, but we would not require that time-consuming step as we would use the calibrated probabilistic outputs of our document sentiment classifier to detect the neutral class (see Section 5.3).

Domain-Specific Sentiment Word Embedding
Our approach to domain-specific document-level sentiment classification is built on top of word embeddings -distributed word representations (vectors) that could be learned from an unlabeled corpus to encode the semantic similarities between words (Goldberg, 2017).
In this section, we investigate how the embeddings of sentiment words for a particular domain would look like in the domain-specific vector space. To ensure a fair comparison with the state-of-theart sentiment lexicon induction technique SentProp 3 (Hamilton et al., 2016) later in Section 4, we adopt the same publicly-available pre-trained word embeddings for the following three domains together with the corresponding sets of sentiment words (i.e., sentiment lexicons).
• Standard-English. We use the the Google News word embeddings 4 and the 'General Inquirer' lexicon (Stone et al., 1966) with the sentiment polarity scores collected by Warriner et al. (2013). • Twitter. We use the word embeddings constructed by Rothe et al. (2016) and the sentiment lexicon from the SemEval-2015 Task 10E (Rosenthal et al., 2015). • Finance. We use the word embeddings learned using an SVD-based method (Manning et al., 2008) from a collection of "8-K" financial reports 5 (Lee et al., 2014) and the finance sentiment lexicon hand-crafted by Hamilton et al. (2016). Note that the above three sentiment lexicons would be used for both the inspection of sentiment word distributions in this section and the evaluation of sentiment lexicon induction later in the next section. Furthermore, to facilitate a fair comparison with the state-of-the-art unsupervised document sentiment classification technique ProbLex-DCM 6 (Eisenstein, 2017) later in Section 5, we also adopt the following two document collections which they have used.
• IMDB. We use 50k movie reviews in English from IMDB (Maas et al., 2011) with 25k labeled training documents. • Amazon. We use about 28k product reviews in English across four product categories from Amazon (Blitzer et al., 2007;McAuley and Leskovec, 2013) with 8k labeled training documents. The word embeddings for the above two domains were trained by us on the respective corpora using word2vec (Mikolov et al., 2013) which employs a two-layer neural network and is by far the most widely used word embedding technique. Specifically, we ran word2vec with skip-gram of a fiveword window to construct word vectors of 500 dimensions, as recommended by previous studies 7 . The sentiment lexicon made by Liu (2015) is consistently one of the best for analyzing reviews (Ribeiro et al., 2016), so it is used for both of those domains.
Drawing an analogy to the well-known cluster hypothesis in Information Retrieval (IR) (Manning et al., 2008), here we put forward the cluster hypothesis for sentiment analysis: words in the same cluster behave similarly with respect to sentiment polarity in a specific domain. That is to say, we expect positive and negative sentiment words to form distinct clusters, given that they have been represented in an appropriate vector space. To verify this hypothesis, it would be useful to visualize the high-dimensional sentiment word vectors in a 2D plane. We have tried a number of dimensionality reduction techniques including the t-distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten and Hinton, 2008), but found that simply using the classic Principle Component Analysis (PCA) (Bishop, 2006) works very well for this purpose.
We have found that in general, the above cluster hypothesis holds for word embeddings within a specific domain. Fig. 2a shows that in the Standard-English domain, the sentiment words with opposite polarities would form two distinct clusters. However, it can also be seen that those two clusters would overlap with each other. That is because each word carries not only a sentiment value but also its linguistic and semantic information. Zooming into one of the word vector space regions (  larities could be grouped together: 'hail', 'stormy' and 'sunny' are linguistically similar as they all describe weather conditions, yet they convey very different sentiment values. Moreover, as described by (Plutchik, 1984), sentiment could be grouped into multiple dimensions such as joy-sadness, angerfear, trust-disgust and anticipation-surprise. Putting that aside, certain sentiment words can be classified sometimes as positive and sometimes as negative, depending on the context. These reasons lead to the phenomenon that many sentiment words are located in the overlapping noisy region between two clusters in the domain-specific vector space. On visual inspection of the Finance (Fig. 3a) sentiment words and IMDB (Fig. 4a) sentiment words in their respective vector spaces, we can see that positive and negative words form distinct clusters which are largely separable. However, if we consider Finance sentiment words in the IMDB vector space (see Fig. 3b), positive and negative words would be mixed together and could not be separated easily.
One may be surprised that positive and negative sentiment words form their respective clusters, because most of the time they could be used in ex-   actly the same context which might suggest that they would result in similar word embeddings. For example, we could say "the room is good" and also "the room is bad": both are legitimate sentences. The probable reason for the cluster hypothesis to be true is that in reality people tend to use positive sentiment words together much more often than to mix them with negative sentiment words, and vice versa. For example, it would be much more often for us to see sentences like "the room is clean and tidy" than "the room is clean but messy". It is a long established fact in computational linguistics that words with similar meanings tend to occur nearby each other (Miller and Charles, 1991); sentiment words are no exception (Turney, 2002). Moreover, it has been widely observed that online customer reviews are affected by the so-called love-hate self-selection bias: users tend to rate only products which they either like or hate, leading to a lot more 1-star and 5-star ratings than other (moderate) ratings; if the product is just 274 average or so-so, they probably will not bother to leave reviews. The polarization of online customer reviews would also encourage the clustering of sentiment words into opposite polarities.

Domain-Specific Sentiment Lexicon Induction
Given the word embeddings for a specific domain, we can induce a customized sentiment lexicon from a few typical sentiment words ("seeds") frequently used in that particular domain. Such an induced domain-specific sentiment lexicon plays a crucial role in the pipeline towards domain-specific document-level sentiment classification. Table 1 shows the seed words for five different domains which are identical to those used by Hamilton et al. (2016) except for the two additional domains IMDB and Amazon. The induction of a sentiment lexicon could then be formulated as a simple word sentiment classification problem with two classes (positive vs. negative). Each word is represented as a vector via domain-specific word embedding; the seed words are labeled with their corresponding classes while all the other words (i.e., "candidates") are unlabeled; the task here is to learn a classifier from the labeled examples first and then apply it to predict the sentiment polarity of each unlabeled candidate word. The probabilistic outputs of such a word sentiment classifier could be regarded as the measure of confidence about the predicted sentiment polarity. In the end, those candidate words with a high probability of being either positive or negative would be added to the sentiment lexicon. The final induced sentiment lexicon would include both the seed words and the selected candidate words.
As pointed out by Mudinas et al. (2012), if we simply consider all words from the given corpus as candidate words, the above described word sentiment classifier tends to assign sentiment values not only to the actual sentiment words but also to their associated product features or more generally the aspects of the expressed view. For example, if a lot of customers do not like the weight of a product, the word sentiment classifier may assign strong negative sentiment to "weight", yet this is not stablethe sentiment polarity of "weight" may be different when a new version of the product is released or the customer population has changed, and furthermore it probably does not apply to other products. To avoid this potential issue, it would be necessary to consider only a high-quality list of candidate words which are likely to be genuine sentiment words. Such a list of candidate words could be obtained directly from general-purpose sentiment lexicons. It is also possible to perform NLP on the target domain corpus and extract frequently-occurring adjectives or other typical sentiment indicators like emoticons as candidate words, which is beyond the scope of this paper.
To examine the effectiveness of different machine learning algorithms for building such domainspecific word sentiment classifiers, we attempt to recreate known sentiment lexicons in three domains: Standard-English, Twitter, and Finance (see Section 3), in the same way as Hamilton et al. (2016) did. Put differently, for the purpose of evaluation, we would just use a known sentiment lexicon in the corresponding domain as the list of candidate words and see how different machine learning algorithms would classify those candidate words based on their domain-specific word embeddings. For those lexicons with ternary sentiment classification (positive vs. neutral vs. negative), the class-mass normalization method (Zhu et al., 2003) used by Hamilton et al. (2016) has been applied here to identify the neutral category. The quality of each induced lexicon for a specific domain is evaluated by comparing it with its corresponding known lexicon as the ground-truth, according to the performance metrics which are the same as in (Hamilton et al., 2016): Area Under the Receiver-Operating-Characteristic (ROC) Curve (AU C) for the binary classifications (ignoring the neutral class, as is common in previous work) and Kendall's τ rank correlation coefficient with continuous human-annotated polarity scores. Note that Kendall's τ is not suitable for the Finance domain, as its known sentiment lexicon is only binary. Therefore, our experimental setting and performance measures are all identical to those of Hamilton et al. (2016), which ensures the validity of the empirical comparison between our approach and theirs.
In Table 2, we compare a number of typical supervised and semi-supervised/transductive learning algorithms for word sentiment classification in the context of domain-specific sentiment lexicon induc-   (Joachims, 1998), • SVM rbf -Support Vector Machine with the nonlinear RBF kernel (Joachims, 1998), • TSVM -Transductive Support Vector Machine (Joachims, 1999), • S3VM -Semi-Supervised Support Vector Machine (Gieseke et al., 2012), • CPLE -Contrastive Pessimistic Likelihood Estimation (Loog, 2016), • SGT -Spectral Graph Transducer (Joachims, 2003), • SentProp -a label propagation based classification method proposed for the SocialSent system (Hamilton et al., 2016). The suitable parameter values of the above learning algorithms (such as the C for SVM) are found via grid search with cross-validation, and the probabilistic outputs are given by Platt scaling (Platt, 2000) if they are not provided by the original learning algorithm.
The experimental results shown in Table 2 demonstrate that in almost every single domain, simple linear model based supervised learning algorithms (LR and SVM lin ) can achieve the optimal or near-optimal accuracy for the sentiment lexicon induction task, and they outperform the state-of-the-art sentiment lexicon induction method SentProp (Hamilton et al., 2016) by a large margin. The performance improvements are statisti-cally significant (p-value < 0.05) according to the sign test. There does not seem to be any benefit of utilizing non-linear models (kNN and SVM rbf ) or semi-supervised/transductive learning algorithms (TSVM, S3VM, CPLE, SGT, and SentProp). The qualitative analysis of the sentiment lexicons induced by different methods shows that they differ only on those borderline, ambiguous words (such as "soft") residing in the noisy overlapping region between two clusters in the vector space (see Section 3). In particular, SentProp is based on label propagation over the lexical graph of words, so it could be easily misled by noisy borderline words when sentiment clusters have considerable overlap with each other, kind of "over-fitting" (Bishop, 2006). Furthermore, according to our experiments on the same machine, those simple linear models are 70+ times faster than SentProp. The speed difference is mainly due to the fact that supervised learning algorithms only need to train on a small number of labeled words ("seeds" in our context) while semi-supervised/transductive learning algorithms need to train on not only a small number of labeled words but also a large number of unlabeled words.
It has also been observed in our experiments that there is a typical precision/recall trade-off (Manning et al., 2008) for the automatic induction of semantic lexicons. Assuming that the classified candidate words are added to the lexicon in the descending order of their probabilities (of being either positive or negative), the induced lexicon will be noisier and noisier when it becomes bigger and bigger.  Table 2: Comparing the induced lexicons with their corresponding known lexicons (ground-truth) according to the ranking of sentiment words measured by AU C and Kendall's τ . Fig. 5 shows that imposing a higher cut-off probability threshold (for candidate words to enter the induced lexicon) would decrease the size of the induced lexicon but increase its quality (accuracy). On one hand, the induced lexicon needs to contain a sufficient number of sentiment words, especially when detecting sentiment from short texts, as a lexiconbased method cannot reasonably classify documents with none or too few sentiment words. On the other hand, the noise (misclassified sentiment words) in the induced lexicon would obviously have a detrimental impact on the accuracy of the document sentiment classifier built on top of it. Contrary to most previous work like that from Qiu et al. (2011) which tries to expand the sentiment lexicon as much as possible and thus maintain a high recall, we would put more emphasis on the precision and keep a tight control of the lexicon size. For us, having a small sentiment lexicon is affordable, because our proposed approach to document sentiment classification will be able to mitigate the low recall problem of lexiconbased methods by combining them with learningbased methods, which we shall talk about next.

Domain-Specific Sentiment Classification of Documents
A domain-specific sentiment lexicon, automatically induced using the above technique, provides a solid basis for building domain-specific document sentiment classifiers. For the experiments here, we would use a list of 7866 candidate words constructed by merging two well-known general-purpose sentiment lexicons that are both publicly available -the 'General Inquirer' (Stone et al., 1966) and the sentiment lexicon from Liu (2012). This set of candidate words is itself a combined, general-purpose sentiment lex-  icon, so we name it the GI+BL lexicon. Moreover, we would set the cut-off probability threshold to a generally good value 0.7 in our sentiment lexicon induction algorithm. Comparing the IMDB vector space including all the candidate words ( Fig. 4a) with that including only the high-probability candidate words (Fig. 4b), it is obvious that the positive and negative sentiment clusters become more clearly separated in the latter.
The induced sentiment lexicon on its own could be applied directly in a lexicon-based method for sentiment classification of documents, and a reasonably good performance could be achieved as we will show later in Table 4. However, most of the time, lexicon-based sentiment classifiers are not as effective as learning-based sentiment classifiers. One reason is that the former tends to suffer from a poor recall. For example, with a limited size sentiment lexicon, lexicon-based methods would often fail to 277 detect the sentiment present in short texts, e.g., from Twitter, due to the lexical gap.
Given the induced sentiment lexicon, we propose to use a lexicon-based sentiment classifier to classify unlabeled documents, and then use those classified documents containing at least three sentiment words as pseudo-labeled documents to be used later for the training of a learning-based sentiment classifier. The condition of "at least three sentiment words" is to ensure that only reliably classified documents would be further utilised as training examples.

Sentiment Classification of Long Texts
First, we try the induced sentiment lexicons in the lexicon-based sentiment classifier pSenti (Mudinas et al., 2012) to see how good they are. Given a sentiment lexicon, pSenti is able to perform not only binary sentiment classification but also ordinal sentiment classification on a five-point scale. To measure the binary classification performance, we use both micro-averaged F 1 (miF 1 ) and macro-averaged F 1 (maF 1 ) which are commonly used in text categorization (Yang and Liu, 1999). To measure the fivepoint scale classification performance, we use both Cohen's κ coefficient (Manning et al., 2008) and also Root-Mean-Square Error (RM SE) (Bishop, 2006). As the baseline, we use a combined generalpurpose sentiment lexicon, GI+BL, mentioned previously in Section 4. As we can see from the results shown in Table 3, using the induced sentiment lexicon for the target domain would make the lexiconbased sentiment classifier pSenti perform better than simply employing an existing general-purpose sentiment lexicon. Moreover, using the sentiment lexicons induced from the same domain would lead a much better performance than using the sentiment lexicons induced from a different domain.
Second, to evaluate the proposed two-phase bootstrapping method, we make empirical comparisons on the IMDB and Amazon datasets using a number of representative methods for document sentiment classification: • pSenti -a concept-level lexicon-based sentiment classifier (Mudinas et al., 2012), • ProbLex-DCM -a probabilistic lexicon-based classification using the Dirichlet Compound Multinomial (DCM) likelihood to reduce effective counts for repeated words (Eisenstein, 2017), • SVM lin -Support Vector Machine with linear kernel (Joachims, 1998), • CNN -Convolutional Neural Network (Kim, 2014), • LSTM -Long Short-Term Memory, a Recurrent Neural Network (RNN) that can remember values over arbitrary time intervals (Hochreiter and Schmidhuber, 1997;Dai and Le, 2015). To apply the deep learning algorithms CNN and LSTM that have a word embedding projection layer, we fix t he r eview s ize t o 5 00 w ords, t runcating reviews longer than that and padding reviews shorter than that with null values. As pointed out by Greff et al. (2017), the hidden layer size is an important hyperparameter of LSTM: usually the larger the network, the better the performance but the longer the training time. In our experiments, we have used an LSTM network with 400 units on the hidden layer which is the capacity that a PC with one Nvidia GTX 1080 Ti GPU can afford and a dropout (Wager et al., 2013) rate of 0.5 which is the most common setting in research literature (Srivastava et al., 2014;Hong and Fang, 2015;Cliche, 2017).
As shown in Table 4, the above described twophase bootstrapping method has been demonstrated to be beneficial: the learning-based sentiment classifiers t rained o n p seudo-labeled d ata a re superior to lexicon-based sentiment classifiers, including the state-of-the-art unsupervised sentiment classifier ProbLex-DCM (Eisenstein, 2017). Furthermore, the two-phase bootstrapping method is a general framework which can utilize any lexicon-based sentiment classifier t o p roduce p seudo-labeled d ata. Therefore the more sophisticated ProbLex-DCM could also be used instead of pSenti in this framework, which is likely to deliver an even higher performance. Among the three learning-based sentiment classifiers, LSTM achieved the best performance on both datasets, which is consistent with the observations in other studies like Dai and Le (2015).
Comparing the LSTM-based sentiment classifiers trained on pseudo-labeled and real labeled data, we can also see that using a large number of pseudolabeled examples could achieve a similar effect as using 25/4 ≈ 6k and 8/2 = 4k real labeled examples for IMDB and Amazon respectively. This suggests that the unsupervised approach is actually preferable to the supervised approach if there are

Sentiment Classification of Short Texts
To evaluate our proposed approach to sentiment classification of short texts, we have carried out experiments on the Twitter sentiment classification benchmark dataset from SemEval-2017 Task 4B (Rosenthal et al., 2017) which is to classify 6185 tweets as either positive or negative. Other than the training set of 20, 508 tweets, we also collected unlabeled tweets using the Twitter API. All the tweets would be pre-processed by replacing emoticons with their corresponding text representations and encoding URLs by tokens. In addition to the Twitterdomain seed words listed in Table 1, we have also made use of common positive/negative emoticons which are ubiquitous on Twitter as additional seeds for the task of sentiment lexicon induction. Note that in all our experiments, we do not use the sentiment labels and the topic information provided in the training data.
Making use of the provided training data and our own unlabeled data collected from Twitter, we have constructed the domain-specific word embeddings, induced the sentiment lexicon, and bootstrapped the pseudo-labeled tweet data to train the binary tweet sentiment classifier. As the learning algorithm we have chosen LSTM with a hidden layer of 150 units which would be enough for tweets as they are quite short (with an average length of only 20 words).
The official performance measures for this short text sentiment classification task (Rosenthal et al., 2017) include Accuracy (Acc) and F 1 . Although our approach is nearly-unsupervised (without any reliance on labeled documents), its performance on this benchmark dataset is comparable to that of supervised methods: it would be placed roughly in the middle of all the participating systems in this competition (see Table 5).

Detecting Neutral Sentiment
Many real-world applications of sentiment classification (e.g., on social media) are not simply a binary classification task, but involve a neutral category as well. Although many lexicon-based sentiment classifiers including pSenti can detect neutral sentiment, extending the above learning-based sentiment classifier (trained on pseudo-labeled data)   (Rosenthal et al., 2017) which is to classify 12379 tweets into an ordinal five-point scale (−2, −1, 0, +1, +2) where 0 represents the neutral class.
One common way to handle neutral sentiment is to treat the set of neutral documents as a separate class for the classification algorithm, which is the method advocated by Koppel and Schler (2006). With the pseudo-labeled training examples of three classes (−1: negative, 0: neutral, and +1: positive), we tried both standard multi-class classification (Hsu and Lin, 2002) and ordinal classification (Frank and Hall, 2001). However, neither of them could deliver a reasonable performance. After carefully inspecting the classification results, we realised that it is very difficult to have a set of representative training examples with good coverage for the neutral class. This is because the neutral class is not homogeneous: a document could be neutral because it is equally positive and negative, or because it does not contain any sentiment. In practice, the latter case is more often seen than the former case, and it implies that the neutral class is more often defined by the absence of sentiment word features rather than their presence, which would be problematic to most supervised learning algorithms.
What we discovered is that the simple method of identifying neutral documents from the binary sentiment classifier's decision boundary works surprisingly well, as long as the right thresholds are found. Specifically, we take the probabilistic outputs of a binary sentiment classifier trained as before, and then put all the documents whose proba-bility of being positive lies not close to 0, not close to 1, but in the middle range into the neutral class. It turns out that probability calibration (Niculescu-Mizil and Caruana, 2005) is crucially important for this simple method to work. Some supervised learning algorithms for classification can give poor estimates of the class probabilities, and some even do not support probability prediction. For instance, maximum-margin learning algorithms such as SVM focus on hard samples that are close to the decision boundary (the support vectors), which makes their probability prediction biased. The technique of probability calibration allows us to better calibrate the probabilities of a given classifier, or to add support for probability prediction. If a classifier is well calibrated, its probabilistic output should be able to be directly interpreted as a confidence level on the prediction. For example, among the documents to which such a calibrated binary classifier gives a probabilistic output close to 0.8, approximately 80% of the documents would actually belong to the positive class.
Using the sigmoid model of Platt (2000) with cross-validation on the pseudo-labeled training data, we carry out probability calibration for our LSTM based binary sentiment classifier. Fig. 6 shows that the calibrated probability prediction aligns with the true confidence of prediction much better than the raw probability prediction. In this case, the Brier loss (Brier, 1950) that measures the mean squared difference between the predicted probability and the actual outcome could be reduced from 0.182 to 0.153 by probability calibration.
If we rank the estimated probabilities of being positive from low to high, the curve of probabilities would be in an "S"-shape with a distinct middle range where the slope is steeper than the two ends, as shown in Fig. 7. The documents with their probabilities of being positive in such a middle range should be neutral. Therefore the two elbow points in the probability curve would make appropriate thresholds for the identification of neutral sentiment, and they could be found automatically by a simple algorithm using the central difference to approximate the second derivative. Let p L and p U denote the identified thresholds (p L < p U ), then we assign class label "−1" to all those documents with the probability below p L , "+1" to all those documents with the proba-   bility above p U , and "0" to all those documents with the probability within [p L , p U ].
The official performance measures for this sentiment classification task (Rosenthal et al., 2017) are M AE µ and M AE M which stand for microaveraged and macro-averaged Mean Absolute Error (MAE), respectively. We would also like to report the micro-averaged and macro-averaged F 1 scores which are denoted as miF 1 and maF 1 respectively. As shown in Fig. 7, the thresholds identified from the raw probability curve are roughly at 55 percentile and 75 percentile, which would yield M AE µ = 0.632 and M AE M = 0.832; the thresholds identified from the calibrated probability curve are roughly at 40 percentile and 80 percentile, which would yield much better scores M AE µ = 0.536 and M AE M = 0.815. So with the help of probabil-ity calibration, our proposed approach would be able to comfortably beat all the baselines including the lexicon-based method pSenti (Mudinas et al., 2012) and compete with the average (median) participating systems (see Table 6). Please note that this is not a fair comparison: our approach is at a great disadvantage because (i) it is nearly-unsupervised, without any reliance on labeled documents while all the other systems are supervised; and (ii) it performs only ternary classification while all the other systems make classification on the full five-point scale.

Conclusions
How far can we go in sentiment classification for a new domain, given only unlabeled data? This paper presents our exploration towards answering the above research question. Specifically, the main contributions of this paper are as follows.
• We have formulated the cluster hypothesis for sentiment analysis (i.e., words with different sentiment polarities form distinct clusters) and verified that in general it holds for word embeddings within a specific domain but not across domains. • We have demonstrated that a quality domainspecific sentiment lexicon can be induced from the word embeddings of that domain together with just a few seed words. Surprisingly, simple linear model based supervised learning algorithms (such as linear SVM) are good enough for this purpose; there is no benefit of utilizing non-linear models or semi-supervised/transductive learning algorithms due to the noise at the borders of sentiment word clusters. Using such linear models our system clearly outperforms the state-of-theart sentiment lexicon induction method -Sent-Prop (Hamilton et al., 2016).
• We have shown that a lexicon-based sentiment classifier could be enhanced by using its outputs as pseudo-labels and employing supervised learning algorithms such as LSTM to train a learning-based sentiment classifier on pseudolabeled documents. Our end-to-end pipelined approach which, overall, is unsupervised (except for a very small set of seed words), works better than the state-of-the-art unsupervised technique for document sentiment classification -ProbLex-DCM (Eisenstein, 2017), and its performance is at least on par with an average fully supervised sentiment classifier trained on real labeled data (Rosenthal et al., 2017). • We have revealed the crucial importance of probability calibration to the detection of neutral sentiment which was overlooked in previous studies (Koppel and Schler, 2006). With the right thresholds found, neutral documents can be simply identified at the binary sentiment classifier's decision boundary.
One promising way to further enhance the LSTMbased sentiment classifier in the proposed approach with the induced sentiment lexicon would be to concatenate word embeddings with an indicator feature which tells whether a current word is positive, neutral, or negative (Ebert et al., 2015). We leave this for future work.