Generating Sentences by Editing Prototypes

We propose a new generative language model for sentences that first samples a prototype sentence from the training corpus and then edits it into a new sentence. Compared to traditional language models that generate from scratch either left-to-right or by first sampling a latent sentence vector, our prototype-then-edit model improves perplexity on language modeling and generates higher quality outputs according to human evaluation. Furthermore, the model gives rise to a latent edit vector that captures interpretable semantics such as sentence similarity and sentence-level analogies.


Introduction
The ability to generate sentences is a core component of many NLP systems, including machine translation (Kalchbrenner and Blunsom, 2013;Koehn et al., 2007), summarization (Hahn and Mani, 2000;Nallapati et al., 2016), speech recognition (Jurafsky and Martin, 2000), and dialogue (Ritter et al., 2011). State-of-the-art models are largely based on recurrent neural language models (NLMs) that generate sentences from scratch, often in a left-to-right manner. It is often observed that such NLMs suffer from the problem of favoring generic utterances such as "I don't know" . At the same time, naive strategies to increase diversity have been shown to compromise grammaticality (Shao et al., 2017), suggesting that current NLMs may lack the inductive bias to faithfully represent the full diversity of complex utterances. Figure 1: The prototype-then-edit model generates a sentence by sampling a random example from the training set and then editing it using a randomly sampled edit vector.
Indeed, it is difficult even for humans to write complex text from scratch in a single pass; we often create an initial draft and incrementally revise it (Hayes and Flower, 1986). Inspired by this process, we propose a new generative model of text which we call the prototype-then-edit model, illustrated in Figure 1. It first samples a random prototype sentence from the training corpus, and then invokes a neural editor, which draws a random "edit vector" and generates a new sentence by attending to the prototype while conditioning on the edit vector. The motivation is that sentences from the corpus provide a high quality starting point: they are grammatical, naturally diverse, and exhibit no bias towards shortness or vagueness. The attention mechanism (Bahdanau et al., 2015) then extracts the rich information from the prototype and generalizes to novel sentences.
We train the neural editor by maximizing an approximation to the generative model's loglikelihood. This objective is a sum over lexicallysimilar sentence pairs in the training set, which we can scalably approximate using locality sensitive hashing. We also show empirically that most lexically similar sentences are also semantically similar, thereby endowing the neural editor with additional semantic structure. For example, we can use the neural editor to perform a random walk from a seed sentence to traverse semantic space. We compare our prototype-then-edit model against approaches which generate from scratch on two fronts: language generation quality and semantic properties. For the former, our model generates higher quality generations according to human evaluations, and improves perplexity by 13 points on the Yelp corpus and 7 points on the One Billion Word Benchmark. For the latter, we show that latent edit vectors outperform standard sentence variational autoencoders (Bowman et al., 2016) on semantic similarity, locally-controlled text generation, and a sentence analogy task.

Problem statement
Our primary goal is to learn a generative model of sentences. In particular, we model sentence generation as a prototype-then-edit process: 1 1. Prototype selector: Given a training corpus of sentences X , we randomly sample a prototype sentence, x ∼ p(x ) (in our case, uniform over X ).
2. Neural editor: First we draw an edit vector z, from an edit distribution p(z), which encodes the type of edit. Then, we draw the final sentence x from p edit (x | x , z), which takes the prototype x and the edit vector z.
Under this model, the likelihood of a sentence is: where both prototype x and edit vector z are latent variables. Our formulation stems from the observation that many sentences in a large corpus can be represented as minor transformations of other sentences. For example, in the Yelp restaurant review corpus (Yelp, 2017) we find that 70% of the test set is within wordtoken Jaccard distance 0.5 of a training set sentence, even though almost no sentences are repeated verbatim. This implies that a neural editor which models lexically similar sentences should be an effective generative model for large parts of the test set.
A secondary goal for the neural editor is to capture certain semantic properties; we focus on the following two in particular: 1. Semantic smoothness: an edit should be able to alter the semantics of a sentence by a small and well-controlled amount, while multiple edits should make it possible to accumulate a larger change.
2. Consistent edit behavior: the edit vector z should model/control the variation in the type of edit that is performed. When we apply the same edit vector on different sentences, the neural editor should perform semantically analogous edits across the sentences.
In Section 4, we show that the neural editor can successfully capture both properties, as reported by human evaluations.

Approach
We would like to train our neural editor by maximizing the marginal likelihood (Equation 1), but the exact likelihood is difficult to maximize because it involves a sum over all latent prototypes x (expensive) and integration over the latent edit vector z (intractable).
We therefore propose two approximations to overcome these challenges: 1. We approximate the sum over all prototypes x by only summing over x that are lexically similar to x.
2. We lower bound the integral over latent edit vectors by modeling z with a variational autoencoder, which admits tractable inference via the evidence lower bound (ELBO), which incidentally also provides additional semantic structure.
We describe and motivate both of these approximations in the subsections below.
3.1 Approximate sum on prototypes, x Equation 1 defines the probability of generating a sentence x as the total probability of reaching x via edits from every prototype x ∈ X . However, most prototypes are unrelated and should have very small probability of transforming into x. Therefore, we approximate the marginal distribution over prototypes by only considering the prototypes x with high lexical overlap with x, as measured by word token Jaccard distance d J . Formally, define a lexical similarity neighborhood as N (x) = {x ∈ X : d J (x, x ) < 0.5}. The neighborhoods N (x) can be constructed efficiently with locality sensitive hashing (LSH) and minhashing. The full procedure is described in Appendix 6.1. Then the log-likelihood of the prototypethen-edit process is lower bounded by: where the inequalities follow from summing over fewer terms, multiplying by a number that is at most 1, and Jensen's inequality. Treating |N (x)| −1 as a constant 2 and summing over the training set, we arrive at the following objective: Interlude: lexical similarity semantics. We have motivated lexical similarity neighborhoods via computational considerations, but we find that lexical similarity training also captures semantic similarity. One can certainly construct sentences with small lexical distance that differ semantically (e.g., insertion of the word "not"). However, since we mine sentences from a corpus grounded in real world events, most lexically similar sentences are also semantically similar. For example, given "my son enjoyed the delicious pizza", we are far more likely to see "my son enjoyed the delicious macaroni", versus "my son hated the delicious pizza". Human evaluations of 250 edit pairs sampled from lexical similarity neighborhoods on the Yelp corpus support this conclusion. 35.2% of the sentence pairs were judged to be exact paraphrases, while 84% of the pairs were judged to be at least roughly equivalent. Sentence pairs were negated or change in topic only 7.2% of the time. Thus, a neural editor trained on this distribution should preferentially generate semantically similar edits.

Approximate integration on edit vectors, z
Now let us tackle integration over the latent edit vectors. To do this, let us introduce the evidence lower bound (ELBO) to the integral over z in Equation 1: The important elements of (x, x ) are the neural editor p edit (x | x , z), the edit prior p(z), and the approximate edit posterior q(z | x, x ) (we describe each of these shortly). Combining the ELBO with Equation 3, our final objective function is: L ELBO is now maximized over the parameters of both p edit and q. Note that q is only used for training, and discarded at test time. With the introduction of q, we are now training p(x | x ) as a conditional variational autoencoder (C-VAE) (it is conditioned on x ). We maximize the objective via SGD, approximating the gradient using the usual Monte Carlo estimate for VAEs (Kingma and Welling, 2014).
Neural editor p edit (x | x , z): We implement our neural editor as a left-to-right sequence-to-sequence model with attention, where the prototype is the input sequence and the revised sentence is the output sequence. We employ an encoder-decoder architecture similar to Wu (2016), extending it to condition on an edit vector z by concatenating z to the input of the decoder at each time step. Further details are given in Appendix 6.2.
Edit prior p(z): We sample z from the prior by drawing a random magnitude z norm ∼ Unif(0, 10) and then drawing a random direction z dir ∼ vMF(0), where vMF(0) is a uniform distribution over the unit sphere (von-Mises Fisher distribution with concentration = 0). The resulting z = z norm z dir .
Approximate edit posterior q(z | x, x ): q is named the approximate edit posterior because the ELBO is tight when q matches the true posterior: the best possible estimate of the edit z given both the prototype x and the revision x. Our design of q treats the edit vector z as a generalization of word vectors. In the case of a single word insertion, a good edit vector would be the word vector of the inserted word. Extending this to multi-word edits, we would like multi-word insertions to be represented as the sum of the individual word vectors.
Since q is an encoder observing both the protoype x and the revised sentence x, it can directly observe the word differences between x and x . Define I = x\x to be the set of words added to the prototype, and D = x \x to be the words deleted.
We would then like q to output a z equal to where Φ(w) is the word vector for word w and ⊕ denotes concatenation. The word embeddings Φ are parameters of q. However, q cannot deterministically output f (x, x ) -without any entropy in q, the KL term in equation 5 would be infinity and training would be infeasible. Hence, we design q to output a noisy, perturbed version of f : we perturb the norm of f by adding uniform noise, and we perturb the direction of f by adding von-Mises Fisher noise. The von-Mises Fisher distribution is a distribution over vectors with unit norm, with a mean µ and a precision κ such that the log-likelihood of drawing a vector decays linearly with the cosine similarity to µ.
Let f norm = f and f dir = f /f norm . Furthermore, definef norm = min(f norm , 10) to be the trun-cated norm. Then, where the resulting z = z dir z norm . The resulting distribution q has three parameters: (Φ, κ, ), where κ, are hyperparameters. This distribution is straightforward to use as part of a variational autoencoder, as sampling can be easily performed using the reparameterization trick and the rejection sampler of Wood (1994). Furthermore, the KL term has a closed form expression independent of z.
where I n (κ) is the modified Bessel function of the first kind, and Γ is the gamma function. Our design of q differs substantially from the standard choice of a standard normal distribution with a given mean (Bowman et al., 2016;Kingma and Welling, 2014) for two reasons: First, by construction, edit vectors are sums of word vectors and since cosine distances are traditionally used to measure distances between word vectors, it would be natural to encode distances between edit vectors by the cosine distance. The von-Mises Fisher distribution captures this idea, as the log likelihood of transforming f dir into z dir decays linearly with the cosine similarity.
Second, our parameterization avoids collapsing the latent code, which is a serious problem with variational autoencoders in practice (Bowman et al., 2016). With a Gaussian latent noise variable, the KL-divergence term is instance-dependent, and thus the encoder must decide how much information to pass to the decoder. Even with training techniques such as annealing, the model often learns to ignore the encoder entirely. In our case, this does not occur, since the KL divergence between a von-Mises Fisher with parameter κ and the uniform distribution is independent of the mean f dir , allowing us to optimize κ separately using binary search. In practice, we never observe issues with encoder collapse using standard gradient training.
We divide our experimental results into two parts. In Section 4.3, we evaluate the merits of the prototypethen-edit model as a generative modeling strategy, measuring its improvements on language modeling (perplexity) and generation quality (human evaluations of diversity and plausibility). In Section 4.4, we focus on the semantics learned by the model and its latent edit vector space. We demonstrate that it possesses interpretable semantics, enabling us to smoothly control the magnitude of edits, incrementally optimize sentences for target properties, and perform analogy-style sentence transformations.

Datasets
We evaluate perplexity on the Yelp review corpus (YELP, Yelp (2017)) and the One Billion Word Language Model Benchmark (BILLIONWORD, Chelba (2013)). For qualitative evaluations of generation quality and semantics, we focus on YELP as our primary test case, as we found that human judgments of semantic similarity were much better calibrated in this focused setting.
For both corpora, we used the named-entity recognizer (NER) in spaCy 3 to replace named entities with their NER categories. We replaced tokens outside the top 10,000 most frequent tokens with an "out-of-vocabulary" token.

Approaches
Throughout our experiments, we compare the following generative models: where we sum over training set instances within Jaccard distance < 0.5, and for the VAE term in NEURALEDITOR, we use the one-sample approximation to the lower bound used in Kingma (2014) and Bowman (2016). Compared to NLM, our initial result is that NEU-RALEDITOR is able to drastically improve loglikelihood for a significant number of sentences in the test set ( Figure 2) when considering test sentences with at least one similar sentence in the training set. However, it places lower log-likelihood and on sentences which are far from any prototypes, as it was not trained to make extremely large edits. Proximity to a prototype seems to be the chief determiner of NEURALEDITOR's performance. To evaluate NEURALEDITOR's perplexity, we use smoothing with NLM to account for rare sentences not within our Jaccard distance threshold. 5 We find NEURALEDITOR improves perplexity over NLM and KN5. Table 1 shows that this is the case for both YELP and the more general BILLIONWORD, which contains substantially fewer test-set sentences close to the training set. On YELP, we surpass even the best ensemble of NLM and KN5, while on BIL-LIONWORD we nearly match their performance.
Since NEURALEDITOR draws its strength from sentences in the training set, we also compared against a simpler alternative, in which we ensemble the NLM and MEMORIZATION (retrieval without edits). NEURALEDITOR performs dramatically better than this alternative. Table 2 also qualitatively demonstrates that sentences generated by  NEURALEDITOR are substantially different from the original prototypes.
Human evaluation. We now turn to human evaluation of generation quality, focusing on grammaticality and plausibility. 6 We evaluate generations from NEURALEDITOR against a NLM with a temperature parameter on the per-token softmax 7 , which is a popular technique for suppressing incoherent and ungrammatical sentences. Many NLM systems have noted a undesireable tradeoff between grammaticality and diversity, where a temperature low enough to enforce grammaticality results in short and generic utterances . Figure 3 illustrates that both the grammaticality and plausibility of NEURALEDITOR with a temperature of 1.0 is on par with the best tuned temperature for NLM, with a far higher diversity, as measured by unigram entropy. We also find that decreasing the temperature of the NEURALEDITOR can be used to slightly improve the grammaticality, without sub-6 Human raters were asked, "How plausible is it for this sentence to appear in the corpus?" on a scale of 1-3. 7 If si is the softmax logit for token wi and τ is a temperature parameter, the temperature-adjusted distribution is p(wi) ∝ exp(si/τ ).

Prototype x
Revision x i had the fried whitefish taco which was decent, but i've had much better.
i had the <unk> and the fried carnitas tacos, it was pretty tasty, but i've had better. "hash browns" are unseasoned, frozen potato shreds burnt to a crisp on the outside and mushy on the inside. the hash browns were crispy on the outside, but still the taste was missing.
i'm not sure what is preventing me from giving it <car-dinal> stars, but i probably should.
i'm currently giving <car-dinal> stars for the service alone.
quick place to grab light and tasty teriyaki. this place is good and a quick place to grab a tasty sandwich. sad part is we've been there before and its been good.
i've been here several times and always have a good time. stantially reducing the diversity of the generations.
A key advantage of edit-based models thus emerges: Prototypes sampled from the training set organically inject diversity into the generation process, even if the temperature of the decoder in the NEURALEDITOR is zero. Hence, we can keep the decoder at a very low temperature to maximize grammaticality and plausibility, without sacrificing sample diversity. In contrast, a zero temperature NLM would collapse to outputting one generic sentence.
This also suggests that the temperature parameter for the NEURALEDITOR captures a more natural notion of diversity -higher temperature encourages more aggressive extrapolation from the training set while lower temperatures favor more conservative mimicking. This is likely to be more useful than the tradeoff for generation-from-scratch, where the temperature also affects the quality of generations.

Semantics of the neural editor
In this section, we investigate learned semantics of the NEURALEDITOR, focusing on the two desiderata discussed in Section 2: semantic smoothness, and consistent edit behavior.
In order to establish a baseline for these properties, we consider existing sentence generation techniques which can sample semantically similar sentences. We are not aware of other approaches which attempt to learn a vector space for edits, but there are many approaches which learn a vector space for sentences. Of particular relevance is the sentence variational autoencoder (SVAE) which also imposes semantic structure onto a latent vector space, but uses the latent vector to represent the entire sentence, rather than just an edit. To use the SVAE to "edit" a target sentence into a semantically similar sentence, we perturb its underlying latent sentence vector and then decode the result back into a sentence -the same method used in Bowman et al. (2016).
Semantic smoothness. A good editing system should have fine-grained control over the semantics of a sentence: i.e., each edit should only alter the semantics of a sentence by a small and well-controlled amount. We call this property semantic smoothness.
To study smoothness, we first generate an "edit sequence" by randomly selecting a prototype sentence, and then repeatedly editing via the neural editor (with edits drawn from the edit prior p(z)) to produce a sequence of revisions. We then ask human annotators to rate the size of the semantic changes between revisions. An example is given in Table 3.
For the SVAE baseline, we generate a similar sequence of sentences by first encoding the prototype sentence, and then decoding after the addition of a random Gaussian with variance 0.4. 8 This process is 8 The variance was selected so that the SVAE and NEU- repeated to produce a sequence of sentences which we can view as the SVAE equivalent of the edit sequence. Figure 4 shows that the neural editor frequently generates paraphrases despite being trained on lexical similarity, and only 1% of edits are unrelated from the prototype. In contrast, the SVAE often repeats sentences exactly, and when it makes an edit it is equally likely to generate unrelated sentences. This suggests that the neural editor produces substantially smoother sentence sequences with a surprisingly high frequency of paraphrases.
Qualitatively (Table 3), NEURALEDITOR seems to generate long, diverse sentences which smoothly change over time, while the SVAE biases towards short sentences with several semantic jumps, presumably due to the difficulty of training a sufficiently informative SVAE encoder. RALEDITOR have the same average human similarity judgement between two successive sentences. This avoids situations where the SVAE produces completely unrelated sentence due to the perturbation size.
10 545 similarity assessments pairs were collected through Amazon Mechanical Turk following Agirre (2014), with the same scale and prompt. Similarity judgements were converted to descriptions by defining Paraphrase (5), Roughly Equivalent (4-3), Same Topic (2-1), Unrelated (0). NEURALEDITOR SVAE this food was amazing one of the best i've tried, service was fast and great.
this food was amazing one of the best i've tried, service was fast and great. this is the best food and the best service i've tried in <gpe>.
this place is a great place to go if you want a quick bite. some of the best <norp> food i've had in <date> i've lived in <gpe>. the food was good, but the service was terrible. i have to say this is the best <norp> food i've had in <gpe>.
this is the best <norp> food in <gpe>. best <norp> food i've had since moving to <gpe> <date>. this place is a great place to go if you want to eat. this was some of the best <norp> food i've had in the <gpe>.
this is the best <norp> food in <gpe>. i've lived in <gpe> for <date> and every time we come in this is great the food was good, the service was great. i've lived in <gpe> for <date> and have enjoyed my haircut at <gpe> since <date>. the food was good, but the service was terrible. Table 3: Example random walks from NEURALEDITOR, where the top sentence is the prototype. Figure 5: Neural editors can shorten sentences (left), include common words (center, the word 'service') and rarer words (right 'pizza') while maintaining similarity.
NEURALEDITOR SVAE the coffee ice cream was one of the best i've ever tried. the coffee ice cream was one of the best i've ever tried. some of the best ice cream we've ever had! the <unk> was very good and the food was good. just had the best ice -cream i've ever had! the food was good, but not great. some of the best pizza i've ever tasted! the food was good, but not great. that was some of the best pizza i've had in the area. the food was good, but the service was n't bad. Smoothly controlling sentences. We now show that we can selectively choose edits sampled from the neural editor to incrementally optimize a sentence towards desired attributes. This task serves as a useful measure of semantic coverage: if an edit model has high coverage of sentences that are semantically similar to a prototype, it should be able to satisfy the target attribute while deviating minimally from the prototype's original meaning.
We focus on controlling two simple attributes: compressing a sentence to below a desired length (e.g. 7 words), and inserting a target keyword into the sentence (e.g. "service" or "pizza").
Given a prototype sentence, we try to discover a semantically similar sentence satisfying the target attribute using the following procedure: First, we generate 1000 edit sequences using the procedure described earlier. Then, we select the sequence with highest likelihood whose endpoint possesses the tar-get attribute. We repeat this process for a large number of prototypes.
We use almost the same procedure for the SVAE, but instead of selecting by highest likelihood, we select the sequence whose endpoint has shortest latent vector distance from the prototype (as this is the SVAE's metric of semantic similarity).
In Figure 5, we then aggregate the sentences from the collected edit sequences, and plot their semantic similarity to the prototype against their success in satisfying the target attribute. Not surprisingly, as target attribute satisfaction rises, semantic similarity drops. However, we also see that the neural editor sacrifices less semantic similarity to achieve the same level of attribute satisfaction as the SVAE. The SVAE is reasonable on tasks involving common words (such as the word service), but fails when the model is asked to generate rarer words such as pizza. Examples from these word inclusion problems show that the SVAE often becomes stuck generating short, generic sentences (Table 4).
Consistent edit behavior: sentence analogies. In the previous results, we showed that edit models learn to generate semantically similar sentences. We now assess whether the edit vector possesses globally consistent semantics. Specifically, applying the same edit vector to different sentences should result in semantically analogous edits.
Formally, suppose we have two sentences, x 1 and x 2 , which are related by some underlying semantic relation r. Given a new sentence y 1 , we would like to find a y 2 such that the same relation r holds between y 1 and y 2 .
Our approach is to estimate the edit vector between x 1 and x 2 asẑ = f (x 1 , x 2 ) -the mode of our edit posterior q. We then apply this edit vector to y 1 using the neural editor to yieldŷ 2 = argmax x p edit (x | y 1 ,ẑ).
Since it is difficult to outputŷ 2 exactly matching y 2 , we take the top k candidate outputs of p edit (using beam search) and evaluate whether the gold y 2 appears among the top k elements.
Evaluation for this task can be difficult, as two arbitrary sentences x 1 and x 2 are often not related by any well-defined relationship. However, prior work already provide well-established sets of word analogies (Mikolov et al., 2013a;Mikolov et al., 2013b). We leverage these to generate a new dataset of sentence analogies, using a simple strategy: given an analogous word pair (w 1 , w 2 ), we mine the Yelp corpus for sentence pairs (x 1 , x 2 ) such that x 1 is transformed into x 2 by inserting w 1 and removing w 2 (allowing for reordering and inclusion/exclusion of stop words).
For example, given the words (good, best), we mine the sentence pair x 1 ="this was a good restaurant" and x 2 ="this was the best restaurant". Given a new sentence y 1 ="The cake was great", we expect y 2 to be "The cake was the greatest".
For this task, we initially compared against the SVAE, but it had a top-k accuracy close to zero. Hence, we instead compare to the baseline of randomly sampling an edit vectorẑ ∼ p(z), instead usingẑ derived from f (x 1 , x 2 ). We can also compare to numbers on the original word analogy task, restricted to the relationships we find in the Yelp cor-pus although these numbers are for the easier task of computing word-level (not sentence) analogies.
Interestingly, the top-10 performance of our model in Table 5 is nearly as good as the performance of GloVe vectors on the simpler lexical analogy task, despite the fact that the sentence prediction task is harder. In some categories the neural editor at top-10 actually performs better than word vectors, since the neural editor has an understanding of which words are likely to appear in the context of a Yelp review. Examples in Table 6 show the model is accurate and captures lexical analogies requiring word reorderings.

Related work and discussion
Our work connects with a broad literature on neural retrieval and attention-based generation methods, semantically meaningful representations for sentences, and nonparametric statistics.
Neural language models (Bengio et al., 2003) based upon recurrent neural networks and sequenceto-sequence architectures (Sutskever et al., 2014) have been widely used due to their flexibility and performance across a wide-range of NLP tasks (Jurafsky and Martin, 2000;Kalchbrenner and Blunsom, 2013;Hahn and Mani, 2000;Ritter et al., 2011). Our work is motivated by an emerging consensus that attention-based mechanisms (Bahdanau et al., 2015) can substantially improve performance on various sequence to sequence tasks by capturing more information from the input sequence (Wu et al., 2016;Vaswani et al., 2017). Our work extends the applicability of attention mechanisms beyond sequence to sequence tasks by deriving a training method for models which attend to randomly sampled sentences.
There is a growing literature on applying retrieval mechanisms to augment neural sequence-tosequence models. For example, Song (2016) ensembled a retrieval system and an NLM for dialogue, using the NLM to transform the retrieved utterance, and Gu (2017) used an off-the-shelf search engine system to retrieve and condition on training set examples. Both of these approaches rely on a deterministic retrieval mechanism which selects a prototype x using some input c. In contrast, our work treats the prototype x as a latent vari-   In terms of generation techniques that capture semantics, the sentence variational autoencoder (SVAE) (Bowman et al., 2016) is closest to our work in that it attempts to impose semantic structure on a latent vector space. However, the SVAE's latent vector is meant to represent the entire sentence, whereas the neural editor's latent vector represents an edit. Our results from Section 4.4 suggest that local variation over edits is easier to model than global variation over sentences.
Our use of lexical similarity neighborhoods is comparable to context windows used in word vector training (Mikolov et al., 2013a). Proximity of words within text and lexical similarity both serve as a filter which reveals semantics through distributional statistics of the corpus. More generally, results in manifold learning demonstrate that a weak metric such as lexical similarity can be used to extract semantic similarity through distributional statistics (Tenenbaum et al., 2000;Hashimoto et al., 2016).
Prototype-then-edit is a semi-parametric approach that remembers the entire training set and uses a neural editor to generalize meaningfully beyond the training set. The training set provides a strong inductive bias -that the corpus can be char-acterized by prototypes surrounded by semantically similar sentences reachable by edits. Beyond improvements on generation quality as measured by perplexity, the approach also reveals new semantic structures via the edit vector. A natural next step is to apply these ideas in the conditional setting for tasks such as dialogue generation.

Construction of the LSH
The LSH maps each sentence to other lexically similar sentences in the corpus, representing a graph over sentences. To speed up corpus construction, we apply breadth-first search (BFS) over the LSH sentence graph started at randomly selected seed sentences. We store the edges encountered during the BFS, and uniformly sample from this set to form the training set. The BFS ensures that every query to the LSH index returns a valid edit pair, at the cost of adding bias to our training set.

Neural editor architecture
Encoder. The prototype encoder is a 3-layer biL-STM. The inputs to each layer are the concatenation of the forward and backward hidden states of the previous layer, with the exception of the first layer, which takes word vectors as input. Word vectors were initialized with GloVe (Pennington et al., 2014).
Decoder. The decoder is a 3-layer LSTM with attention. At each time step, the hidden state of the top layer is used to compute attention over the toplayer hidden states of the prototype encoder. The resulting attention context vector is then concatenated with the decoder's top-layer hidden state and used to compute a softmax distribution over output tokens.