Polite Dialogue Generation Without Parallel Data

Stylistic dialogue response generation, with valuable applications in personality-based conversational agents, is a challenging task because the response needs to be fluent, contextually-relevant, as well as paralinguistically accurate. Moreover, parallel datasets for regular-to-stylistic pairs are usually unavailable. We present three weakly-supervised models that can generate diverse, polite (or rude) dialogue responses without parallel data. Our late fusion model (Fusion) merges the decoder of an encoder-attention-decoder dialogue model with a language model trained on stand-alone polite utterances. Our label-finetuning (LFT) model prepends to each source sequence a politeness-score scaled label (predicted by our state-of-the-art politeness classifier) during training, and at test time is able to generate polite, neutral, and rude responses by simply scaling the label embedding by the corresponding score. Our reinforcement learning model (Polite-RL) encourages politeness generation by assigning rewards proportional to the politeness classifier score of the sampled response. We also present two retrievalbased, polite dialogue model baselines. Human evaluation validates that while the Fusion and the retrieval-based models achieve politeness with poorer context-relevance, the LFT and Polite-RL models can produce significantly more polite responses without sacrificing dialogue quality.


Introduction
Generating stylistic, personality-based language is crucial to developing engaging, convincing, and trustworthy conversational agents, for their effective application in intelligent tutoring, home assistance, online reservations/purchasing, health care, etc.Most current chatbots and conversational models lack any such style, which can be a social issue because human users might learn biased styles from such interactions, e.g., kids learning to be rude because the dialogue system encourages short, curt responses, and also does not itself use politeness to set an example. 1 In this work, we focus on the important and diverse paralinguistic style axis of politeness vs. rudeness (Brown and Levinson, 1987).
Generating stylistic dialogue responses is a substantially challenging task because the generated response needs to be syntactically and semantically fluent, contextually-relevant to the conversation, as well as convey accurate paralinguistic features.This is further complicated by the fact that content and style are only available in separate unpaired datasets, as opposed to translation-type parallel datasets containing regular-to-stylistic text pairs.Hence, we need indirectly-supervised models that can incorporate style into the generated response in absence of parallel data (i.e., where the training data for the conversation versus style components comes from two different datasets or domains), while still maintaining conversation relevance.
In this work, we present three such weaklysupervised models 2 that can generate diverse, natural, and contextually-relevant polite (and rude) di-alogue responses, using data from separate style and dialogue domains: the Stanford Politeness Corpus (Danescu-Niculescu-Mizil et al., 2013) with Wikipedia and StackExchange requests, and the MovieTriples Dialogue Corpus (Serban et al., 2016) with IMSDB movie scripts, respectively.Each of our three models is based on a state-of-the-art politeness classifier and a sequence-to-sequence dialogue model.The first model (Fusion) employs a late fusion technique to merge the response generation decoder of the dialogue model with a language model trained on polite utterances chosen by the politeness classifier.The second label-fine-tuning (LFT) model prepends to the input utterance a single politeness label whose embedding is continuously scaled by the politeness score of the target sequence during training.This score is determined by feeding the corresponding ground-truth target sequence to our politeness classifier.During test time, we show that the LFT model is able to control the politeness level of generated responses by simply scaling the label's embedding by the continuous target politeness score of our choice.Our third reinforcement-based model (Polite-RL) encourages politeness generation by using the continuous-scale politeness score of the decoder-sampled sentence as a reward (via mixedobjective policy gradient methods), i.e., polite utterances are encouraged with positive reward, and rude ones discouraged with negative reward.
Hence, our models only need a style classifier (without parallel data) to automatically influence and encourage continuous-scale stylistic language generation in a complex dialogue setup, which also requires maintaining relevance to conversational context.Each of these models requires minimal changes to the architecture of either the underlying sequence-to-sequence (Seq2seq) dialogue base model or the style classifier, and hence can modularly update the architecture with the latest stateof-the-art dialogue models or style classifiers (and for diverse styles).In addition, we also employ two retrieval-based models, where we output the response which has the highest match with the input context from a set of classifier-picked polite responses or manually-picked generic polite utterof the LFT model were added to the Feb 1, 2018 resubmission based on reviewer discussions.ances.These two retrieval models serve as parallel investigations on the performance of our three proposed generative models above.
We conducted multiple human evaluations (for style and dialogue quality) on Amazon Mechanical Turk (MTurk) (Buhrmester et al., 2011) for all three models plus the base sequence-to-sequence dialogue model and the retrieval-based models, and show that while the Fusion and the two retrieval models increase the politeness level of responses at the cost of poorer dialogue quality, both our LFT and Polite-RL models can successfully produce polite responses (capturing several politeness strategies discussed by Brown and Levinson (1987)), without sacrificing dialogue coherence and relevance compared to the base Seq2seq model (hence better balance between politeness and dialogue quality).We also compare the output dialogue politeness levels of the continuous LFT model for three different politeness levels.Finally, we present several detailed qualitative and quantitative analyses, including positive and negative output examples, automatic metric results on output responses, classifier error analysis, and visualization of the RL rewards.

Models for Style Transfer
Style Transfer with Parallel Data There have been multiple works on style transfer with parallel data.These tasks can often be solved by directly applying some variation of translation-based Seq2seq model discussed in the previous section.For example, Xu et al. (2012) use a phrase-based statistical model, and Jhamtani et al. (2017) use a standard Seq2seq model to convert modern language to Shakespeare-style language by treating style transfer as a translation task.Some labeled sequence transduction methods have also been proposed (Kobus et al., 2017;Yamagishi et al., 2016;Johnson et al., 2017).For example, Kikuchi et al. (2016) are able to control the length of the summarization text by feeding to the Seq2seq base model a label that indicates the intended output length in addition to the source input.Our LFT model also adopts this labeling idea, and is able to handle a similar situation but without parallel data, because by labeling each target sequence in the training set with its politeness classifier score, we are essentially converting nonparallel data to (noisy) parallel data (by using a classifier with high accuracy).
Style Transfer without Parallel Data Several previous works have looked at style transfer without parallel data, in both vision (Gatys et al., 2016;Zhu et al., 2017;Liu and Tuzel, 2016;Liu et al., 2017;Taigman et al., 2016;Kim et al., 2017;Yi et al., 2017), and text (Sennrich et al., 2016a;Hu et al., 2017;Ghosh et al., 2017;Zhao et al., 2017;Mueller et al., 2017;Wang et al., 2017;Luan et al., 2017).Among these models, some are bag-of-words based, i.e., they use style-related keywords to annotate the target sequences in the training set.For example, to control how formal the output sequences are in a EN-DE translation task, Sennrich et al. (2016a) labeled each target sequence based on whether it contains formal or informal verbs and pronouns (honorifics).To build a language model that generates utterances with the desired style, Ficler and Goldberg (2017) annotated their text with meta-data and keywords/POS tags based heuristics, while Ghosh et al. ( 2017) also adopted keyword spotting based on a dictionary of emotional words.The basic ideas of their models are similar to that of our LFT model.However, these keyword-spotting approaches do not fully extend to our politeness generation task, because politeness strategies follow complex patterns of grammar, word order, and phrasing (Danescu-Niculescu-Mizil et al., 2013).For example, the politeness of please depends on where it occurs in a sentence, and what other politeness markers it cooccurs with (e.g., 'could/would you' style counterfactual modals vs. 'can/will you' style indicative modals).Therefore, our novel polite dialogue models are based on an accurate neural classifier, which is better at capturing several compositional paralinguistic features (as visualized in Aubakirova and Bansal (2016), whose politeness classifier we extend).Moreover, our LFT and Polite-RL models can generate a continuum of style levels based on the continuously-scaled (by the politeness score) label embedding or reinforcement rewards.
Lastly, there have also been style transfer models that rely on the latent representation of text and use variational auto-encoders or cross-alignment to disentangle the representation of content and style in text (Hu et al., 2017;Shen et al., 2017;Zhao et al., 2017;Fu et al., 2018).During inference time, the latent style representation is combined with new content to generate stylized, content-preserving text.Although both fall into the category of style transfer, our task differs in two important aspects from their tasks.First, as opposed to the task of strict content preservation when rephrasing a sentence to a different style, our task is about maintaining good relevance to the context when adding style, especially useful for dialogue-based tasks.Another distinctive trait of our task is that politeness resides in a spectrum rather than a fixed category or topic (e.g., Shakespearean), and our models can treat politeness as a continuum, i.e., controlling the politeness level by adjusting the fusion rate in the Fusion model, the magnitude of the continuous label in the LFT model, or the RL weight in the Polite-RL model.

Multi-Task Learning and Style Transfer
In order to obtain a persona-based conversational agent, Luan et al. (2017) proposed a multi-task learning (MTL) based approach: they train a Seq2seq model with conversation data and an autoencoder with non-conversational persona-related data from target speakers, and share the decoder parameters of these two models so that the generated responses can be adapted to the style of the target-speaker.This way of incorporating MTL into Seq2seq learning was first investigated by Dong et al. (2015) and Luong et al. (2016) to achieve multilingual NMT.In addition, Sennrich et al. (2016b) also employed MTL to improve NMT models with monolingual (non-parallel) data.These approaches are related to our Fusion model, because we use our classifier to obtain noisy polite target sequences (non-parallel data) which a polite language model trains on, and during inference combine the parameters of the language model with a generative dialogue model trained on parallel data.In general, our models are also related to previous works like Johnson et al. (2017), who adopted labeled sequence transduction methods for MTL tasks, because our task also involves adapting generated responses to different politeness styles and optimizing two subtasks' (namely response and politeness generation) loss functions (related to a multi-task setup).-Niculescu-Mizil et al. (2013) created the Stanford Politeness Corpus and trained an SVM classifier using a list of useful linguistic features based on strategies from Brown and Levinson's theory of politeness (Brown and Levinson, 1987).Aubakirova and Bansal (2016) recently took an endto-end neural approach to this politeness classification task by training a CNN model that directly learns to identify polite requests without using any hand-engineered features, while still improving on prediction accuracy.They also visualized what features the CNN model was learning and discovered some new features along the way.Our classifier mainly extends their work by adding a bi-directional LSTM layer (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) before the CNN layer to capture long-distance relationships in the sentence, which leads to higher cross-domain performance.

Danescu
A related early work in personality-based dialogue is Mairesse and Walker (2007), who study introvert/extrovert personality language based on templated content and sentence planning (via personality dimensions such as hedges, tag questions, negations, subject implicitness, etc.).Relatedly, Sennrich et al. (2016a) use an English to German translation task to present a model that can generate target sequences that are either formal or informal, specifically based on honorifics-related verbs and pronouns.Our task is more general, taking into account several politeness-related paralinguistic features of Brown and Levinson (1987) and allowing end-to-end trainable stylistic dialogue generation with a polite-to-rude spectrum (based on a politeness classifier, without relying on parallel data).Moreover, our approaches allow simply replacing the politeness classifier with any other emotion or personality based language classifier to generate stylistic dialogue for that new style dimension.

Politeness Classification Model
In order to develop an accurate politeness classifier for effective use in stylistic dialogue response generation, we extend and improve upon the state-of-theart CNN model of Aubakirova and Bansal (2016), and propose a bi-directional LSTM followed by a convolutional layer (see Figure 1), in order to both capture long-distance relationships in the sentence as well as windowed filter based features.For a sentence v 1:n (where each token v i is a d-dim word embedding vector), the LSTM layer first produces hidden states h 1:n (where h t is the concatenation of forward and backward hidden states at time step t).A filter m is then applied on a window of u hidden states.This produces a convolution feature , where f is a non-linear function and b is a bias term.Every feature map c ∈ R n−u+1 is applied to each window, so that c = [c 1 , ..., c n−u+1 ].The output of the convolutional layer is then fed to a max-pooling layer (Collobert et al., 2011) which gives C = max{c} for the filter.Filters of various sizes are used to obtain multiple features.The result is then passed to a fully-connected softmax layer that outputs probabilities over two labels, namely Polite and Rude.
Our classification model achieves comparable in-domain accuracy and improved crossdomain accuracy over the state-of-the-art results reported in Danescu-Niculescu-Mizil et al. (2013) and Aubakirova and Bansal (2016).We will discuss these results in detail in Section 6.

Polite-Style Dialogue Models
In this section, we first describe our base dialogue model, i.e., the core (backbone) dialogue architecture upon which the three proposed politeness mod- els are built, and then present these three models that can generate polite dialogue responses.As a parallel investigation on the performance of our proposed models, we also employ two retrieval-based polite dialogue models toward the end.

Base Seq2seq Dialogue Model
Our base dialogue model is a simple sequence-tosequence (Seq2seq) model that consists of a twolayer bi-directional LSTM-RNN encoder to encode the conversation history turns, and a four-layer LSTM-RNN decoder to generate the response.Additive attention from the output of the encoder is applied to the last layer of the decoder.This architecture is almost identical to that proposed by Bahdanau et al. (2015), except with more layers (similar to Shao et al. (2017)).Our base dialogue model achieves perplexity and word error rate results on par with those reported for the popular hierarchical HRED models in Serban et al. (2016), thus serving as a good base model to incorporate style into.Details will be discussed in Section 6.

Fusion Model
Inspired by the 'late fusion' approach in Venugopalan et al. ( 2016), our Fusion model (Fig. 2) combines the response generation decoder of the base Seq2seq dialogue model with a language model (polite-LM) trained exclusively on polite utterances.These utterances are chosen by feeding to the classifier all response utterances in the MovieTriples training set, and only keeping those with politeness scores great than a certain threshold (set to 0.8 in our experiments, as will be discussed in Section 4.5).
The polite-LM model is a two-layer LSTM-RNN based on Jozefowicz et al. (2016).
During inference time, we used the language where the fusion ratio α is a hyperparameter that indicates how much Seq2seq output will influence the final output.

Label-Fine-Tuning Model
There are at least two drawbacks of the Fusion model.First, half of its output is determined by a polite language model that has not attended to the conversation context, making the response more likely to be irrelevant.Second, the model does not learn politeness during training, but is forced to be polite only during inference time.To address these two issues, we present our label-fine-tuning (LFT) model, which prepends a predicted continuous style label at the beginning of each input sentence to specify the intended politeness level.
Specifically, we add to the vocabulary a single politeness label and attach with it a trainable word embedding, just like what we would do to a normal token.Then, the way we make it continuous is by scaling its embedding vector with the (intended) politeness score of the target sequence.During training, this score is obtained by feeding the ground-truth target sequence (response) to the politeness classi-  fier (see Figure 3), while during test time, we are free to scale the prepended politeness label with different scores of our choice (i.e., when we want the model to generate a polite response, we scale the label's embedding by a score between 0.5 and 1.0, while to generate a rude response, we scale the embedding by a score between 0.0 and 0.5).This approach is related to the 'numerically-grounded' language model (Spithourakis et al., 2016), except that we scale the politeness label embedding by its corresponding politeness score, rather than concatenating the two as input to the LSTM. 3hus, the LFT model is able to simultaneously produce polite, neutral and rude responses depending on the prepended label, similar to recent multilabel, multi-space, and zero-shot machine translation work using language identity or style labels (Sennrich et al., 2016a;Johnson et al., 2017;Ghosh et al., 2017).Intuitively, this prepended label serves as the prior for the intended style of the generated response sequence, while the source utterance serves as the prior for the content of the generated sequence.In other words, the label and the source sentence cooperatively determine what the overall response looks like.4

Polite Reinforcement Learning Model
The LFT model incorporates style more directly into its training procedure than the fusion model, but it still does not fully exploit the value of the style classifier since it only supervises the dialogue model once by initially classifying the style of all the target sequences in the training set.Ideally we would want the classifier to constantly monitor and influence what style the model produces.Moreover, many contexts do not naturally elicit a polite response,5 in which case we do not want to force the model to generate an utterance that matches the target politeness score, but rather to ask the model to generate as polite and natural a response as it could.These limitations motivate us to propose the third model: Polite Reinforcement Learning model (Polite-RL), where the style classifier regularly updates the model parameters (via sampling-based policy gradient) with continuous-spectrum rewards that encourage decoder-generated response samples to be polite and discourage them from being rude.
Following work from Paulus et al. (2018), our loss function consists of two terms.The first term is the traditional maximum likelihood loss (L ML ), which we refer to as the teacher forcing part.The other one is the reinforcement learning loss (L RL ) based on politeness scores, which we refer to as the reinforce part.The total loss L then takes the form: where β is a hyperparameter indicating how much weight we want to give to the style reward component of the loss.The teacher forcing part minimizes the average of the maximum-likelihood loss at each decoding step.Specifically, let y * = {y * 1 , y * 2 , ..., y * n } be the ground-truth response for a given source (conversation history) utterance sequence x.The maximum-likelihood training objective is the minimization of the loss: We use a policy gradient method (Williams, 1992;Sutton et al., 2000) to calculate the second term in the objective function.Specifically, we sample a generated response for each input sequence (conversation history) x, and assign to it a reward R, which in our case is the politeness classifier's probability of the response classified as polite.Let y s = {y s 1 , y s 2 , ..., y s n } be the sampled response, then the reinforce part of the loss is: where R b is a baseline that helps reduce variance during training (Ranzato et al., 2016).Note that we can invert the classifier scores or reward (by flipping the first minus sign in Eq. 4), if we want to encourage rudeness as the style, instead of politeness.This also shows that an advantage of our implementations of the LFT model over the Polite-RL model (at the cost of shallower training) is that the LFT model can multitask to simultaneously produce responses of different style labels at test time, whereas reward-based reinforcement learning can only work in one direction at a time (based on the reward sign). 6

Retrieval-based Models
We employ two retrieval-based baseline models as a sanity check to the proposed approaches' perfor-6 However, to make the reward-based model capable of multitasking, one could also prepend various politeness labels to each of the context in the training set (thus generating several examples out of one context), and encourage the generated response to be consistent with the given label.We will explore this extension in future work.mance: the first with oracle-level fluency, the second with additional oracle-level politeness.
Classifier-based Retrieval Following Lowe et al. (2015), for a [X 1 , Y, X 2 ] triple, our retrieval model treats the context (X 1 , Y ) and each response (X 2 ) as two documents and convert them to their TF-IDF based vectors (Ramos, 2003) to check for similarity.Specifically, we first obtain all candidate responses in the training set that are polite,7 and calculate their TF-IDF vectors.Then for each context TF-IDF vector in the test set, we calculate its cosine similarity with that of each such polite-classified candidate response, and output the one with the highest value.Intuitively, for each context we are choosing a response that is both polite and most relevant to (having the most word overlaps with) the context.

Generic-10
This approach is similar to the one above but uses the 10 manually-chosen most-polite generic utterances as candidate responses for each context.Specifically, we collect all ground-truth polite requests from the Stanford Politeness Corpus, split each one into sentences, and then manually pick the most frequent 10 polite sentences. 8We then determine which one to retrieve as a response for each input context, based again on the TF-IDF vector similarity method described above.

Datasets
As discussed above, we propose models that can deal with style data coming from an unpaired, nonparallel domain, different from the domain of the dialogue dataset.For our style (politeness) domain, we use the Stanford Politeness Corpus (Danescu- Niculescu-Mizil et al., 2013), which contains a collection of requests from Wikipedia (WIKI) editor's talk pages and the Stack Exchange (SE) questionanswering communities.Based on scores from human annotators, these requests are labeled with either polite or rude, with each class equally consisting of 1,089 requests for the Wikipedia domain and 1,651 requests for the Stack Exchange domain.For the content (dialogue) domain, we use the popular MovieTriples dialogue corpus (Serban et al., 2016), which contains 245K conversations extracted from IMSDB movie scripts in X-Y-X triplet-utterance format, where X and Y correspond to two movie characters (and the model's task is to generate the last response).

Evaluation Methods
Human To evaluate our models' ability to generate polite responses without sacrificing dialogue quality, we conducted several comprehensive human evaluation studies on Amazon Mechanical Turk (MTurk).Specifically, we compare the three stylistic models w.r.t. the base model on both dialogue quality (i.e., context relevance and coherence) and politeness level. 9For this, we randomly sampled 300 contexts covering all types of conversations and their generated responses from the Seq2seq base model, the three stylistic models, and the retrievalbased models.For each source input, the six responses are randomly shuffled to anonymize model identities.Each response was then annotated by two human evaluators that were located in the US, had an approval rate greater than 98%, and had at least 10, 000 approved HITs on record (to prevent those who had just started using MTurk and hence unconditionally enjoyed a high acceptance rate.).All our human evaluations are performed by two annotators (for both dialogue quality and politeness level) in order to calculate inter-rater agreement, for which we employ Cohens Kappa κ (Cohen, 1968), a score that measures the level of inter-rater agreement between two annotators on a classification problem (Artstein and Poesio, 2008).For both dialogue quality and 9 We opted for dialogue quality rather than several separated, fine-grained metrics such as relevance, specificity, informativeness because Lowe et al. (2017) found that little additional information was provided by adding in more metrics on top of overall dialogue quality, and it also confused MTurkers in many scenarios.We had similar observations in our initial human study on MTurk.politeness evaluations, the human raters were shown the conversation context (input) and the six shuffled responses (from the six models).Clear instructions (closely following those from Wang et al. (2017)) corresponding to each score were shown in the interface.More specifically, we asked the annotators to first read the context and each of the generated/retrieved responses, and assign to each response a score.They then scored each response on a five-point Likert scale (Likert, 1932) (for both politeness and dialogue quality), hence providing absolute measurements but in an overall comparative (relative) setting. 10We explicitly stated that it is possible for them to find some conversation disconnected or lacking context, and encouraged them to make the best guess when in doubt.Using similar instructions (and a 300-sized sample), we also performed a separate 3-way LFT model comparison by setting its target politeness scores to 1.0, 0.5 and 0.0, respectively.
Automatic Since there do not exist ground-truth stylized versions of the response to the MovieTriples conversations, we only use automatic evaluation metrics as complementary and trend-verification information to the primary human perception studies in this work: we compute BLEU (a phrase-matching based metric; (Papineni et al., 2002)) as an approximation of dialogue quality as used by some previous work (Ritter et al., 2011;Galley et al., 2015;Li et al., 2016c).Note that we choose to report BLEU scores not in order to draw any immediate conclusion (Liu et al. (2016) found that BLEU does not correlate well with human studies on dialogue quality), but rather to check for match with the trends from human evaluation.We also compute the polite-10 The Likert scale is a bipolar scaling method that maps each score to a text item that describes the score, e.g., our politeness level interface uses 'Polite', 'Slightly Polite', 'Neutral', 'Slightly Rude', 'Rude'; and our dialogue quality study uses 'Very good', 'Good', 'Acceptable', 'Poor', and 'Very poor', instead of the abstract scores 1-5.Note that we did not adopt pairwise comparisons because first, it will create several independent sets of pairwise results (15 sets in our case), which also raises the cost substantially, and secondly, pairwise comparison does not tell us "by how much" a response is better/equal/worse than the other.In contrast, our absolute scores can help future research compare more directly to our results.We will release our detailed instructions and MTurk interfaces, plus our annotation scores on a public webpage.ness classifier's scores as an approximation of politeness level.Sec.6.3 discusses these results.

Training Details
We now present some important training details.11 Embedding Initialization For all our models, we initialized the embedding matrix with word2vec trained on Google News dataset (about 100 billion words)12 (Mikolov et al., 2013); we use Xavier initializer (Glorot and Bengio, 2010) for out-ofvocabulary words.
Pretraining Following Serban et al. ( 2016), we pretrained the Seq2seq base model for 4 epochs with Q-A SubTle corpus (Ameixa et al., 2014), which contains around 5.5M movie subtitle Q&A pairs.Implementation Details We used 300-dim embeddings, the AdamOptimizer (Kingma and Ba, 2015) with a learning rate of 0.001, and a dropout rate of 0.2.All models were trained with a minibatch of size 96.The classifier was trained for 3 epochs, and the three proposed stylistic models were each trained for 35 epochs.The polite language model used in the Fusion model was trained until there was no improvement for perplexity on a heldout dev-set (all tuning decisions were made on the respective dev-sets).We use a balanced value of 0.5 for the fusion ratio (α in Eq. 1), and 2.0 for the RL weight (β in Eq. 4) after some light empirical tuning.Also due to the nearly perfect balance between the number of polite and rude examples in the Stanford Politeness Corpus, we set the baseline reward of Polite-RL (R b in Eq. 4) to a constant 0.5 at all times. 13Note that for effective and non-confusing MTurk studies, for all our models (the base model and the three stylistic models), we avoid UNK tokens to appear in the generated response, by not word2vec/ 13 We also tried using a self-critical baseline as in Rennie et al. ( 2017), but found that our way of setting the constant-based baseline led to better responses.We speculate that this is because a self-critical approach tries to make an utterance as polite as possible, which usually leads to a few very generic and very polite responses at convergence (because the model gets a positive reward only when the sampled utterance is more polite than the greedy-decoded one).back-propagating the MLE loss for these tokens.We also do the same for a short list (around 10) of very offensive swear words (from Wiktionary).

Results
In this results section, we first briefly present our politeness classifier (Sec.3) and base dialogue model (Sec.4.1) results, and then focus on the stylisticdialogue results (retrieval and generative).

Politeness Classification Results
Following Danescu-Niculescu-Mizil et al. ( 2013), we use accuracy (i.e., percentage of correctly labeled messages for binary polite/rude labels) to evaluate our politeness classifier's generalization ability.Specifically, we used data from the training set of WIKI, and test on both the test set of WIKI and the entire SE (StackExchange) corpus.We used the same train-validation-test split setup (7:1:2) as in Aubakirova and Bansal (2016). 14As shown in Table 1, our LSTM-CNN model improved crossdomain accuracy (while maintaining comparable indomain accuracy) compared to that of the SVM and CNN models reported in Aubakirova and Bansal (2016).This is similar to how Zhou et al. (2015) also found that a combination of LSTM-RNNs and CNNs is superior to an LSTM-RNN or CNN alone for sentiment classification, likely because the joint model captures both long-distance relationships as well as local windowed filter-based features, and this could make it easier to separate in-domain and outof-domain properties.Also, we observe more improvement on cross-domain accuracy because it has much more space for improvement, as opposed to in-domain accuracy which is already very close to human performance.The higher accuracy is also important because we need a cross-domain-accurate style classifier so that it can effectively stylize responses in diverse dialogue corpora domains such as MovieTriples.

Base Dialogue Model Results
Next, in Table 2, we show that our starting point, base dialogue model is comparable in quality to a popular, representative previous model of Serban et al. (2016), trained on the same corpora with similar model architectures.We use their Perplexity (PPL) and Word Error Rate (WER) metrics.In order to have a meaningful perplexity (i.e., the probability of regenerating a reference response) comparison between two language generation models, they should have the same vocabulary set.Since the vocabulary of our politeness dialogue models is a combination of vocabulary sets drawn from the MovieTriples and Stanford Politeness corpora, for fair comparison in this section, we separately train a base Seq2seq model following exactly the vocabulary (10, 000 most frequent tokens, plus an UNK for the rest) and preprocessing protocols from Serban et al. (2016).We bootstrapped the model with 4 epochs on the SubTle corpus (see Sec. 5.3), and then trained on MovieTriples till there was no improvement on perplexity for the validation set.The comparison for this base model with their hierarchical-encoder HRED models is presented in Table 2.As shown, we get comparable results overall on all metrics, and hence we have a good starting-point dialogue model to next add politeness to via three approaches.

Primary Human Evaluation Results
In this section, we present our primary human evaluation (MTurk) results on both politeness level and dialogue quality (context-relevance) of the generated response, based on two annotators and a 300-sized test sample.Table 3 shows the annotator-average scores for each of these two metrics and their absolute difference, based on our Likert scales of 1 to 5 (see Sec. 5.2).We can first see that all three of our stylistic generative models improve on politeness compared to the Seq2seq base model.However, the Fusion model's politeness gain is not statistically significant,15 and moreover it achieves this minor politeness level improvement at the cost of significantly compromising dialogue quality (because its output is half-determined by a standalone politeness-trained LM that ignores context).
Next, we see that the LFT model is the most polite (stat.significance of p < 0.01 over the Seq2seq model), and also has dialogue quality close (statistically equal) to that of Seq2seq.Our final Polite-RL model wins over Seq2seq on politeness (stat.significance of p < 0.01) as well as achieves a small improvement in dialogue quality (though not at stat.significance level; but it is stat.significantly better in quality than Retrieval, Generic-10 and Fusion.).Moreover, the politeness levels of the LFT and Polite-RL models are statistically equal.Therefore, both models, with their training depth and multitasking trade-offs (see Sec. 4), can produce strong levels of stylistic content, without harming contextrelevance.
Lastly, we can also see that our two retrievalbased models are both very polite (but not stat.significantly better over LFT); and as expected, they both have dialogue quality lower than Seq2seq, Polite-RL and LFT (stat.significance of p < 0.01).
They also feature two of the worst balances between average politeness and dialogue quality score.This is the type of sacrifice we want to avoid from imposing on dialogue quality when building a stylistic dialogue model.
For inter-annotator agreement, the Kappa score was 0.35 (fair 16 ) on Dialogue Quality and 0.46 (moderate) on Politeness.If we employ a collapsed-Likert version, where the more ambiguous and extreme scores of {1, 2} and {4, 5} are bucketed together, 17 we obtained a Kappa score of 0.42 (moderate) on Dialogue Quality and 0.55 (moderate) on Politeness.
Human Evaluation Results on 3-way LFT Models We also present results on a 3-way politeness level comparison MTurk study among the Polite-LFT, Neutral-LFT, and Rude-LFT models, i.e., the LFT model with three levels (scores) of scaling the prepended style label, corresponding to politeness scores 1.0, 0.5 and 0.0, respectively (Table .4, Continuous-LFT column).The table shows that Polite-LFT is significantly more polite than Neutral-LFT (stat.significance of p < 0.01), and Neutral-LFT is in turn more polite than Rude-LFT (stat.significance of p < 0.01).For inter-annotator agreement on this 3-way LFT study, we get a Kappa of 0.51 (moderate), and 0.61 (substantial) for the collapsed-Likert case.
We also experimented earlier with a discrete version of LFT, where we treated responses in the [0.8, 1.0] range as polite, [0.2, 0.8] as neutral, and [0.0, 0.2] as rude.Instead of scaling a single label embedding with continuous politeness scores (as described in Section 4.3), we assigned to each response one of these three labels with no scaling, according to its corresponding politeness bin.The human evaluation scores for that model were 3.52, 3.09 and 2.93, respectively, which features less score difference between neutral and rude (Table .4 Discrete-LFT column). 16These levels were defined by Landis and Koch (1977)

Automatic Metric Evaluation Results
As discussed in Sec.5.2, we also use some automatic evaluation metrics to complement and verify the MTurk human study results.In Table 5, we present the average politeness classifier and BLEU-4 scores of responses from each model.First, we can see that our politeness classifier agrees reasonably well with the human politeness judgments in Table 3, since both identify the Retrieval-based models and LFT as the most polite, followed by Polite-RL and Fusion in descending order.We quantified this 'agreement' concretely, and found high correlation between the six human Politeness scores ( Hence, overall, the automatic metric evaluation again shows that without politeness training, the base dialogue model produces neutral responses on average (0.49 score), while the retrieval-based models and all three proposed generative models improve on politeness score.Also, the BLEU scores show, similar to the human study results in Table 3, that among the three proposed models, the Fusion model sacrifices the most dialog quality to become more polite, whereas the LFT and RL models maintain comparable quality with improved politeness over the base model (Seq2seq).For the retrieval models, we again see that their politeness levels are better than LFT and RL models, but with a corresponding loss in dialogue quality.such as gratitude, deference, greeting, positive lexicon, indirection, indicative modal, and negative ones such as negative lexicon, direct question, direct start, 2nd person start.However, it does occasionally give strongly polite or rude scores to some mild or neutral responses, e.g., "they were in a car accident", showing scope for classifier improvements.

Output Examples of Stylistic Dialogue
Next, we show some output examples of our polite dialogue models w.r.t. the base Seq2seq model as well as the retrieval-based models.We use these examples to demonstrate the politeness strategies our proposed generative models have learned (in Table 7).In the first example, our stylistic models use politeness strategies such as indirection, positive lexicon and counterfactual modal (Danescu-Niculescu-Mizil et al., 2013).This example also illustrates the behavior of the Retrieval model, i.e., most of the time it just outputs an utterance that has word overlap with but totally irrelevant to the context.Thus although all its retrieved responses have oracle-level fluency and grammaticality, its average dialogue quality score in the human evaluation is still not as good as that of Seq2seq.
In the second example, Fusion uses indirection, while LFT is being polite even when disagreeing with the abusive language from Y .This example also shows that Generic-10, due to its limited space for retrieval, oftentimes fails to provide a relevant answer, although it is the most polite one since its candidate responses are manually picked.In the third example, Fusion and LFT both use positive lexicon, and RL makes a compliment.In the fourth example, each of the three proposed models uses positive lexicon.It is worth noting that in the last example, while LFT and Polite-RL seem to provide a relevant compliment, they are actually compliment-ing the wrong person.This kind of issue motivates us toward creating persona-based (Li et al., 2016c) politeness models for future work.

Visualization of Polite-RL Reward
Using derivative saliency (Simonyan et al., 2013;Li et al., 2016a;Aubakirova and Bansal, 2016), we also visualize how much each token in the sampled response contributes to the classifier's reward during Polite-RL model's training.Fig. 5 shows three such heatmaps that correspond to the magnitudes of the derivative in absolute value with respect to each dimension.The figures clearly show that the classifier has learned to identify multiple politeness strategies, e.g., "smart" (deference), "sir" (polite address), and the two "sorry"s (apologizing).

Conclusion and Future Work
We first presented three diverse generative models that can generate rich polite-to-rude spectrum dialogue responses (based on the politeness theories by Brown and Levinson (1987)), without using any parallel data (which is usually assumed for tasks such as machine translation) and only relying on a style classifier.Via multiple human evaluation studies and automatic metrics, we demonstrated that all three models generate more polite responses (displaying several politeness strategies discussed in previous psycholinguistic works), while LFT and Polite-RL are able to do so without losing dialogue quality, as opposed to the Fusion model as well as the two retrieval-based models.
In future work, there is still much room for improvement on the politeness as well as dialogue quality side, and one could employ more recent, advanced models such as variational, adversarial, and decoder-regulation techniques.
Though we focused on politeness for the scope of this paper, our models can be easily generalized to other emotion and personality styles (only relying on a style classifier), hopefully contributing towards the valuable paradigm of human-like and engaging intelligent tutors and personal assistants.In future work, our polite-RL model could also be extended to stylistic task-based dialogue generation, where both content preservation and style transfer are needed, potentially by disentangling politeness and content of the generated response and then only feeding the politeness portion to the classifier for RL training.

Figure 2 :
Figure 2: Fusion model: the output probability distributions of the decoder and the polite-LM are linearly mixed to generate the final decoded outputs.

Figure 3 :
Figure 3: Label-Fine-Tuning model: during training, the embedding of the prepended label is scaled by the style classifier's continuous score on the ground-truth (target) sequence.During testing, we scale the embedding of the label by the desired (continuous) politeness score of the generated response.

Figure 4 :
Figure 4: Polite-RL model: upper-right shows max-likelihood (ML) training with generated and ground-truth target sequences; lower-right shows RL training with a randomly sampled response generated by the model and the reward it generates after getting fed into the style classifier.Note that the attention mechanism is not shown here for clarity.

Figure 5 :
Figure 5: Saliency heatmaps of the classifier's attention (reward for sampled responses in Polite-RL model).
Seq2seqmodel and the LM model at time t, respectively.The final 'fused' distribution p t for that time step is:

Table 3 :
MTurk human evaluation results on politeness level and dialogue quality (as well as the absolute value difference between the two, to show balance) of the Retrieval Models, Seq2seq and the three proposed generative models (avg. of two annotators is shown here).Top results are boldfaced.

Table 4 :
; also see https://en.wikipedia.org/wiki/Cohens_kappa 17 discussed in Weijters et al. (2010), James et al.(1984), and https://en.wikipedia.org/wiki/Likert_scale, the 'central tendency bias' makes raters avoid using the two extreme response categories.MTurk human evaluation results on politeness level of 3 LFT models, for both the continuous and the discrete versions.

Table 7 :
Output dialogue response examples by Retrieval,