Optimizing Statistical Machine Translation for Text Simplification

Most recent sentence simplification systems use basic machine translation models to learn lexical and syntactic paraphrases from a manually simplified parallel corpus. These methods are limited by the quality and quantity of manually simplified corpora, which are expensive to build. In this paper, we conduct an in-depth adaptation of statistical machine translation to perform text simplification, taking advantage of large-scale paraphrases learned from bilingual texts and a small amount of manual simplifications with multiple references. Our work is the first to design automatic metrics that are effective for tuning and evaluating simplification systems, which will facilitate iterative development for this task.


Introduction
The goal of text simplification is to rewrite an input text so that the output is more readable. Text simplification has applications for reducing input complexity for natural language processing (Siddharthan et al., 2004;Miwa et al., 2010;Chen et al., 2012b) and providing reading aids for people with limited language skills (Petersen and Ostendorf, 2007;Watanabe et al., 2009;Allen, 2009;De Belder and Moens, 2010;Siddharthan and Katsos, 2010) or language impairments such as dyslexia (Rello et al., 2013), autism (Evans et al., 2014), and aphasia .
It is widely accepted that sentence simplification can be implemented by three major types of oper-ations: splitting, deletion and paraphrasing (Feng, 2008). The splitting operation decomposes a long sentence into a sequence of shorter sentences. Deletion removes less important parts of a sentence. The paraphrasing operation includes reordering, lexical substitutions and syntactic transformations. While sentence splitting (Siddharthan, 2006;Petersen and Ostendorf, 2007;Narayan and Gardent, 2014;Angrosh et al., 2014) and deletion (Knight and Marcu 2002;Clarke and Lapata 2006;Filippova and Strube 2008;Filippova et al. 2015;Rush et al. 2015; and others) have been intensively studied, there has been considerably less research on developing new paraphrasing models for text simplification -most previous work has used off-the-shelf statistical machine translation (SMT) technology and achieved reasonable results (Coster and Kauchak, 2011a,b;Wubben et al., 2012;Štajner et al., 2015). However, they have either treated the judgment technology as a black (Coster and Kauchak, 2011a,b;Narayan and Gardent, 2014;Angrosh et al., 2014;Štajner et al., 2015) or they have been limited to modifying only one aspect of it, such as the translation model (Zhu et al., 2010;Woodsend and Lapata, 2011) or the reranking component (Wubben et al., 2012).
In this paper, we present a complete adaptation of a syntax-based machine translation framework to perform simplification. Our methodology poses text simplification as a paraphrasing problem: given an input text, rewrite it subject to the constraints that the output should be simpler than the input, while preserving as much meaning of the input as possible, and maintaining the well-formedness of the text. Going beyond previous work, we make di-401 rect modifications to four key components in the SMT pipeline: 1 1) two novel simplification-specific tunable metrics; 2) large-scale paraphrase rules automatically derived from bilingual parallel corpora, which are more naturally and abundantly available than manually simplified texts; 3) rich rule-level simplification features; and 4) multiple reference simplifications collected via crowdsourcing for tuning and evaluation. In particular, we report the first study that shows promising correlations of automatic metrics with human evaluation. Our work answers the call made in a recent TACL paper  to address problems in current simplification research -we amend human evaluation criteria, develop automatic metrics, and generate an improved multiple reference dataset.
Our work is primarily focused on lexical simplification (rewriting words or phrases with simpler versions), and to a lesser extent on syntactic rewrite rules that simplify the input. It largely ignores the important subtasks of sentence splitting and deletion. Our focus on lexical simplification does not affect the generality of the presented work, since deletion or sentence splitting could be applied as pre-or post-processing steps. Xu et al. (2015) laid out a series of problems that are present in current text simplification research, and argued that we should deviate from the previous state-of-the-art benchmarking setup.

Background
First, the Simple English Wikipedia data has dominated simplification research since 2010 (Zhu et al., 2010;Siddharthan, 2014), and is used together with Standard English Wikipedia to create parallel text to train MT-based simplification systems. However, recent studies Amancio and Specia, 2014;Hwang et al., 2015;Štajner et al., 2015) showed that the parallel Wikipedia simplification corpus contains a large proportion of inadequate (not much simpler) or inaccurate (not aligned or only partially aligned) simplifications. It is one of the leading reasons that existing simplification systems struggle to generate simplifying paraphrases and leave the input sentences unchanged (Wubben et al., 2012). Previously researchers attempted some quick fixes by adding phrasal deletion rules (Coster and Kauchak, 2011a) or reranking n-best outputs based on their dissimilarity to the input (Wubben et al., 2012). In contrast, we exploit data with improved quality and enlarged quantity, namely, largescale paraphrase rules automatically derived from bilingual corpora and a small amount of manual simplification data with multiple references for tuning parameters. We then systematically design new tuning metrics and rich simplification-specific features into a syntactic machine translation model to enforce optimization towards simplicity. This approach achieves better simplification performance without relying on a manually simplified corpus to learn paraphrase rules, which is important given the fact that Simple Wikipedia and the newly released Newsela simplification corpus  are only available for English.
Second, previous evaluation used in the simplification literature is uninformative and not comparable across models due to the complications between the three different operations of paraphrasing, deletion, and splitting. This, combined with the unreliable quality of Simple Wikipedia as a gold reference for evaluation, has been the bottleneck for developing automatic metrics. There exist only a few studies (Wubben et al., 2012;Štajner et al., 2014) on automatic simplification evaluation using existing MT metrics which show limited correlation with human assessments. In this paper, we restrict ourselves to lexical simplification, where we believe MT-derived evaluation metrics can best be deployed. Our newly proposed metric is the first automatic metric that shows reasonable correlation with human evaluation on the text simplification task. We also introduce multiple references to make automatic evaluation feasible.
The most related work to ours is that of Ganitkevitch et al. (2013) on sentence compression, in which compression of word and sentence lengths can be more straightforwardly implemented in features and the objective function in the SMT framework. We want to stress that sentence simplification is not a simple extension of sentence compression, but is a much more complicated task, primarily because high-quality data is much harder to obtain and the solution space is more constrained by word choice and grammar. Our work is also related to other tunable metrics designed to be very simple and light-weight to ensure fast repeated computation for tuning bilingual translation models (Liu et al., 2010;Chen et al., 2012a). To the best of our knowledge, no tunable metric has been attempted for simplification, except for BLEU. Nor do any evaluation metrics exist for simplification, although there are several designed for other text-to-text generation tasks: grammatical error correction Felice and Briscoe, 2015;Dahlmeier and Ng, 2012), paraphrase generation (Chen and Dolan, 2011;Xu et al., 2012;Sun and Zhou, 2012), and conversation generation (Galley et al., 2015). Another line of related work is lexical simplification that focuses on finding simpler synonyms of a given complex word (Yatskar et al., 2010;Biran et al., 2011;Specia et al., 2012;Horn et al., 2014).

Adapting Machine Translation for Simplification
We adapt the machinery of statistical machine translation to the task of text simplification by making changes in the following four key components:

Simplification-specific Objective Functions
In the statistical machine translation framework, one crucial element is to design automatic evaluation metrics to be used as training objectives. Training algorithms, such as MERT (Och, 2003) or PRO (Hopkins and May, 2011), then directly optimize the model parameters such that the end-to-end simplification quality is optimal. Unfortunately, previous work on text simplification has only used BLEU for tuning, which is insufficient as we show empirically in Section 4. We propose two new light-weight metrics instead: FKBLEU that explicitly measures readability and SARI that implicitly measures it by comparing against the input and references. Unlike machine translation metrics which do not compare against the (foreign) input sentence, it is necessary to compare simplification system outputs against the inputs to assess readability changes. It is also important to keep tunable metrics as simple as possible, since they are repeatedly computed during the tuning process for hundreds of thousands of candidate outputs.

FKBLEU
Our first metric combines a previously proposed metric for paraphrase generation, iBLEU (Sun and Zhou, 2012), and the widely used readability metric, Flesch-Kincaid Index (Kincaid et al., 1975). iBLEU is an extension of the BLEU metric to measure diversity as well as adequacy of the generated paraphrase output. Given a candidate sentence O, human references R and input text I, iBLEU is defined as: where α is a parameter taking balance between adequacy and dissimilarity, and set to 0.9 empirically as suggested by Sun and Zhou (2012). Since the text simplification task aims at improving readability, we include the Flesch-Kincaid Index (FK) which estimates the readability of text using cognitively motivated features (Kincaid et al., 1975): with a lower value indicating higher readability. 2 We adapt FK to score individual sentences and change it so that it counts punctuation tokens as well as word, and counts each punctuation token as one syllable. This prevents it from arbitrarily deleting punctuation. FK measures readability assuming that the text is well-formed, and therefore is insufficient alone as a metric for generating or evaluating automatically generated sentences. Combining FK and iBLEU captures both a measure of readability and adequacy. The resulting objective function, FKBLEU, is defined as a geometric mean of the iBLEU and the FK difference between input and output sentences: Sentences with higher FKBLEU values are better simplifications with higher readability.

SARI
We design a second new metric SARI that principally compares system output against references and against the input sentence. It explicitly measures the goodness of words that are added, deleted and kept by the systems (Figure 1).
We reward addition operations, where system output O was not in the input I but occurred in any of the references R, i.e. O ∩ I ∩ R. We define n-gram precision p(n) and recall r(n) for addition operations as follows: 3 where # g (·) is a binary indicator of occurrence of ngrams g in a given set (and is a fractional indicator in some later formulas) and Therefore, in the example below, the addition of unigram now is rewarded in both p add (n) and r add (n), while the addition of you in OUTPUT-1 is penalized in p add (n): The corresponding SARI scores of these three toy outputs are 0.2683, 0.7594, 0.5890, which match with intuitions about their quality. To put it in perspective, the BLEU scores are 0.1562, 0.6435, 0.6435 respectively. BLEU fails to distinguish between OUTPUT-2 and OUTPUT-3 because matching any one of references is credited the same. Not all the references are necessarily complete simplifications, e.g. REF-1 doesn't simplify the word Input System output Human references Input that is unchanged by system and which is not in the reference Input that is retained in the references, but was deleted by the system Overlap between all 3 Input that was correctly deleted by the system, and replaced by content from the references Potentially incorrect system output Figure 1: Metrics that evaluate the output of monolingual text-to-text generation systems can compare system output against references and against the input sentence, unlike in MT metrics which do not compare against the (foreign) input sentence. The different regions of this Venn diagram are treated differently with our SARI metric.
currently, which gives BLEU too much latitude for matching the input.
Words that are retained in both the system output and references should be rewarded. When multiple references are used, the number of references in which an n-gram was retained matters. It takes into account that some words/phrases are considered simple and are unnecessary (but still encouraged) to be simplified. We use R to mark the n-gram counts over R with fractions, e.g. if a unigram (word about in above example) occurs in 2 out of the total r references, then its count is weighted by 2 /r in computation of precision and recall: where For deletion, we only use precision because overdeleting hurts readability much more significantly than not deleting:  NNP 's JJ legislation → the JJ law of NNP The precision of what is kept also reflects the sufficiency of deletions. The n-gram counts are also weighted in R to compensate n-grams, such as the word currently in the example, that are not considered as required simplification by human editors. Together, in SARI, we use arithmetic average of n-gram precisions P operation and recalls R operation : where k is the highest n-gram order and set to 4 in our experiments.

Incorporating Large-Scale Paraphrase Rules
Another challenge for text simplification is generating an ample set of rewrite rules that potentially simplify an input sentence. Most early work has relied on either hand-crafted rules (Chandrasekar et al., 1996;Carroll et al., 1999;Siddharthan, 2006;Vickrey and Koller, 2008) or dictionaries like WordNet Kaji et al., 2002;Inui et al., 2003). Other more recent studies have relied on the parallel Normal-Simple Wikipedia Corpus to automatically extract rewrite rules (Zhu et al., 2010;Woodsend and Lapata, 2011;Coster and Kauchak, 2011b;Wubben et al., 2012;Narayan and Gardent, 2014;Siddharthan and Angrosh, 2014;Angrosh et al., 2014). This technique does manage to learn a small number of transformations that simplify. However, we argue that because the size of the Normal-Simple Wikipedia parallel corpus is quite small (108k sentence pairs with 2 million words), the diversity and coverage of patterns that can be learned is actually quite limited. In this paper we will leverage the large-scale Paraphrase Database (PPDB) 4 Pavlick et al., 2015) as a rich source of lexical, phrasal and syntactic simplification operations. It is created by extracting English paraphrases from bilingual parallel corpora using a technique called "bilingual pivoting" (Bannard and Callison-Burch, 2005). The PPDB is represented as a synchronous context-free grammar (SCFG), which is commonly used as the formalism for syntax-based machine translation (Zollmann and Venugopal, 2006;Chiang, 2007;Weese et al., 2011). Table 1 shows some example paraphrase rules in the PPDB.
PPDB employs 1000 times more data (106 million sentence pairs with 2 billion words) than the Normal-Simple Wikipedia parallel corpus. The English portion of PPDB contains over 220 million paraphrase rules, consisting of 8 million lexical, 73 million phrasal and 140 million syntactic para-phrase patterns. The key differences between the paraphrase rules from PPDB and the transformations learned by the naive application of SMT to the Normal-Simple Wikipedia parallel corpus, are that the PPDB paraphrases are much more diverse. For example, PPDB contains 214 paraphrases for ancient including antique, ancestral, old, age-old, archeological, former, antiquated, longstanding, archaic, centuries-old, and so on. However, there is nothing inherent in the rule extraction process to say which of the PPDB paraphrases are simplifications.
In this paper, we model the task by incorporating rich features into each rule and let SMT advances in decoding and optimization determine how well a rule simplifies an input phrase. An alternative way of using PPDB for simplification would be to simply discard any of its rules which did not result in a simplified output, possibly using a simple supervised classifier (Pavlick and Callison-Burch, 2016).

Simplification-specific Features for Paraphrase Rules
Designing good features is an essential aspect of modeling. For each input sentence i and its candidate output sentence j, a vector of feature functions ϕ = {ϕ 1 ...ϕ N } are combined with a weight vector w in a linear model to obtain a single score h w : In SMT, typical feature functions are phrase translation probabilities, word-for-word lexical translation probabilities, a rule application penalty (which governs whether the system prefers fewer longer phrases or a greater number of shorter phrases), and a language model probability. Together these features are what the model uses to distinguish between good and bad translations. For monolingual translation tasks, previous research suggests that features like paraphrase probability and distributional similarity are potentially helpful in picking out good paraphrases (Chan et al., 2011) and for text-to-text generation (Ganitkevitch et al., 2012b). While these two features quantify how good a paraphrase rule is in general, they do not indicate how good the rule is for a specific task, like simplification.
For each paraphrase rule, we use all the 33 features that were distributed with PPDB 1.0 and add 9 new features for simplification purposes: 5 length in characters, length in words, number of syllables, language model scores, and fraction of common English words in each rule. These features are computed for both sides of a paraphrase pattern, the word with the maximum number of syllables on each side and the difference between the two sides, when it is applicable. We use language models built from the Gigaword corpus and the Simple Wikipedia corpus collected by Kauchak (2013). We also use a list of 3000 most common US English words compiled by Paul and Bernice Noll. 6

Creating Multiple References
Like with machine translation, where there are many equally good translations, in simplification there may be several ways of simplifying a sentence. Most previous work on text simplification only uses a single reference simplification, often from the Simple Wikipedia. This is undesirable since the Simple Wikipedia contains a large proportion of inadequate or inaccurate simplifications  .
In this study, we collect multiple human reference simplifications that focus on simplification by paraphrasing rather than deletion or splitting. We first selected the Simple-Normal sentence pairs of similar length (≤ 20% differences in number of tokens) from the Parallel Wikipedia Simplification (PWKP) corpus (Zhu et al., 2010) that are more likely to be paraphrase-only simplifications. We then asked 8 workers on Amazon Mechanical Turk to rewrite a selected sentence from Normal Wikipedia (a subset of PWKP) into a simpler version while preserving its meaning, without losing any information or splitting sentence. We removed bad workers by manual inspection on the worker's first several submissions on the basis of a recent study  on crowdsourcing translation that suggests Turkers' performance stays consistent over time and can be reliably predicted by their first few translations.
In total, we collected 8 reference simplifications for 2350 sentences, and randomly split them into 2000 sentences for tuning, 350 for evaluation. Many crowdsourcing workers were able to provide simplifications of good quality and diversity (see Table 2 for an example and Table 4 for the manual quality evaluation). Having multiple references allows us to develop automatic metrics similar to BLEU to take advantage of the variation across many people's simplifications. We leave more in-depth investigations on crowdsourcing simplification (Pellow and Eskenazi, 2014a,b) for future work.

Tuning Parameters
Like in statistical machine translation, we set the weights of the linear model w in the Equation (8) so that the system's output is optimized with respect to the automatic evaluation metric on the 2000 sentence development set. We use the pairwise ranking optimization (PRO) algorithm (Hopkins and May, 2011) implemented in the open-source Joshua toolkit (Ganitkevitch et al., 2012a;Post et al., 2013) for tuning.
Specifically, we train the system to distinguish a good candidate output j from a bad candidate j , measured by an objective function o (Section 3.1), for an input sentence i: Thus, the optimization reduces to a binary classification problem. Each training instance is the difference vector ϕ(i, j) − ϕ(i, j )) of a pair of candidates, and its training label is positive or negative depending on whether the value of o(i, j) − o(i, j ) is positive or negative. The candidates are generated according to h w at each iteration, and sampled for making the training tractable. We use different metrics: BLEU, FKBLEU and SARI as objectives.

Experiments and Analyses
We implemented all the proposed adaptations into the open source syntactic machine translation decoder Joshua (Post et al., 2013), 7 and conducted the experiments with PPDB and the dataset of 2350 sentences collected in Section 3.4. Most recent 7 http://joshua-decoder.org/ We augment its latest version to include the text-to-text generation functionality described in this paper.  end-to-end sentence simplification systems use a basic phrase-based MT model trained on parallel Wikipedia data using the Moses decoder (Štajner et al., 2015, and others). One of the best systems is PBMT-R by Wubben et al. (2012), which reranks Moses' n-best outputs based on their dissimilarity to the input to promote simplification. We also build a baseline by using BLEU as the tuning metric in our adapted MT framework. We conduct both human and automatic evaluation to demonstrate the advantage of the proposed simplification systems. We also show the effectiveness of the two new metrics in tuning and automatic evaluation. Table 2 shows a representative example of the simplification results. The PBMT-R model failed to learn any good substitutions for the word ablebodied or the phrase are required to from the manually simplified corpora of limited size. In contrast, our proposed method can make use of more para-407 Sentence

Normal Wikipedia
Jeddah is the principal gateway to Mecca, Islam's holiest city, which able-bodied Muslims are required to visit at least once in their lifetime.

Simple Wikipedia
Jeddah is the main gateway to Mecca, the holiest city of Islam, where able-bodied Muslims must go to at least once in a lifetime.

Mechanical Turk #1
Jeddah is the main entrance to Mecca, the holiest city in Islam, which all healthy Muslims need to visit at least once in their life.

Mechanical Turk #2
Jeddah is the main entrance to Mecca, Islam's holiest city, which pure Muslims are required to visit at least once in their lifetime. PBMT-R (Wubben et al., 2012) Jeddah is the main gateway to Mecca, Islam's holiest city, which able-bodied Muslims are required of Muslims at least once in their lifetime.

SBMT (PPDB + BLEU)
Jeddah is the main door to Mecca, Islam's holiest city, which sound Muslims are to go to at least once in their life. SBMT (PPDB + FKBLEU) Jeddah is the main gateway to Mecca, Islam's holiest city, which sound Muslims must visit at least once in their life. SBMT (PPDB + SARI) Jeddah is the main gateway to Mecca, Islam's holiest city, which sound Muslims have to visit at least once in their life. phrases learned from the more abundant bilingual texts. It improves method applicability to languages other than English, for which no simpler version of Wikipedia is available. Our proposed approach also provides an intuitive way to inspect the ranking of candidate paraphrases in the translation model. This is done by scoring each rule in PPDB by Equation 8 using the weights optimized in the tuning process, as in Table 3. It shows that our proposed method is capable of capturing the notion of simplicity using a small amount of parallel tuning data. It correctly ranks key and main as good simplifications for principal. Its choices are not always perfect as it prefers sound over healthy for able-bodied. The final simplification outputs are generated according to both the translation model and the language model trained on the Gigaword corpus to take into account context and further bias towards more common n-grams.

Quantitative Evaluation of Simplification Systems
For the human evaluation, participants were shown the original English Wikipedia sentence as a reference and asked to judge a set of simplifications that were displayed in random order. They evaluated a simplification from each system, the Simple Wikipedia version, and a Turker simplification. Judges rated each simplification on two 5-point scales of meaning retention and grammaticality (0 is the worst and 4 is the best). We also ask participants to rate Simplicity Gain (Simplicity+) by counting how many successful lexical or syntactic paraphrases occurred in the simplification. We found this makes the judgment easier and that it is more informative than rating the simplicity directly on 5-point scale, since the original sentences have very different readability levels to start with. More importantly, using simplicity gain avoids over-punishment of errors, which are already penalized for poor meaning retention and grammaticality, and thus reduces the bias towards very conservative models. We collect judgments on these three criteria from five different annotators and report the average scores. Table 4 shows that our best system, a syntacticbased MT system (SBMT) using PPDB as the source of paraphrase rules and tuning towards the SARI metric, achieves better performance in all three simplification measurements than the state-ofthe-art system PBMT-R. The relatively small values of simplicity gain, even for the two human references (Simple Wikipedia and Mechanical Turk), clearly show the major challenge of simplification, which is the need of not only generating paraphrases but also ensuring the generated paraphrases are simpler while fitting the contexts. Although many researchers have noticed this difficulty, PBMT-R is one of the few that tried to address it by promoting  Table 4: Human evaluation (Grammar, Meaning, Simplicity+) and basic statistics of our proposed systems (SBMTs) and baselines. PBMT-R is an reimplementation of the state-of-the-art system by Wubben et al. (2012)  outputs that are dissimilar to the input. Our best system is able to make more effective paraphrases (better Simplicity+) while introducing less errors (better Grammar and Meaning). Table 5 shows the automatic evaluation. An encouraging fact is that SARI metric ranks all 5 different systems and 3 human references in the same order as human assessment. Most systems achieve similar FK readability as human editors, using fewer words or words with fewer syllables. Tuning towards BLEU with all 8 references results in no transformation (same as input), as this can get a nearperfect BLEU score of 99.05 (out of 100). Table 6 shows the computation time for different metrics. SARI is only slightly slower than BLEU but achieves much better simplification quality.  Table 7 shows the correlation of automatic metrics with human judgment. There are several interesting observations.

Correlation of Automatic Metrics with Human Judgments
First, simplicity is essential in measuring the goodness of simplification. However, none of the existing metrics (i.e. FK, BLEU, iBLEU) demonstrate any significant correlation with the simplicity scores rated by humans, same as noted in previous work (Wubben et al., 2012;Štajner et al., 2014). In contrast, our two new metrics, FKBLEU and SARI, achieve a much better correlation with humans in simplicity judgment while still capturing the notion of grammaticality and meaning preservation. This explains why they are more suitable than BLEU to be used in training the simplification models. In particular, SARI provides a balanced and integrative measurement of system performance that can assist iterative development. To date, developing advanced simplification systems has been a difficult and timeconsuming process, since it is impractical to run new  Table 7: Correlations (and two-tailed p-values) of metrics against the human ratings at sentence-level (also see Figure  3). In this work, we propose to use multiple (eight) references and two new metrics: FKBLEU and SARI. For all three criteria of simplification quality, SARI correlates reasonably with human judgments. In contrast, previous works use only a single reference. Existing metrics BLEU and iBLEU show higher correlations on grammaticality and meaning preservation using multiple references, but fail to measure the most important aspect of simplification -simplicity.
human evaluation every time a new model is built or parameters are adjusted. Second, the correlation of automatic metrics with human judgment of grammaticality and meaning preservation is higher than any reported before (Wubben et al., 2012;Štajner et al., 2014). It validates our argument that constraining simplification to only paraphrasing reduces the complication from deletion and splitting, and thus makes automatic evaluation more feasible. Using multiple references further improves the correlations.

Why Does BLEU Correlate Strongly with
Meaning/Grammar, and SARI with Simplicity?
Here we look more deeply at the correlations of BLEU and SARI with human judgments. Our SARI metric has highest correlation with human judgments of simplicity, but BLEU exhibits higher correlations on grammaticality and meaning preservation. BLEU was designed to evaluate bilingual translation systems. It measures the n-gram precision of a system's output against one or more references. BLEU ignores recall (and compensates for this with its brevity penalty). BLEU prefers an output that is not too short and contains only n-grams that appear in any reference. The role of multiple references in BLEU is to capture allowable variations in translation quality. When applied to monolingual tasks like simplification, BLEU does not take into account anything about the differences between the input and the references. In contrast, SARI takes into account both precision and recall, by looking at the difference between the references and the input sentence. Figure 2: A scatter plot of BLEU scores vs. SARI scores for the individual sentences in our test set. The metrics' scores for many sentences substantially diverge. Few of the sentences that scored perfectly in BLEU receive a high score from SARI.
In this work, we use multiple references to capture many different ways of simplifying the input.
Unlike bilingual translation, the more references created for the monolingual simplification task the more n-grams of the original input will be included in the references. That means, with more references, outputs that are close or identical to the input will get high BLEU. Outputs with few changes also receive high Grammar/Meaning scores from human judges; but these do not necessarily get high SARI score nor are they good simplifications. BLEU therefore tends to favor conservative systems that do not make many changes, while SARI penalizes them. This can be seen in Figure 2 where sentences with a BLEU score of 1.0 receive a range of scores from SARI. The scatter plots in Figure 3 further illustrate the above analysis. These plots emphasize the correlation of high human scores on meaning/grammar for systems that make few changes (which BLEU rewards, but SARI does not). The tradeoff is that conservative outputs with few or no changes do not result in increased simplicity. SARI correctly rewards systems that make changes that simplify the input.

Conclusions and Future Work
In this paper, we presented an effective adaptation of statistical machine translation techniques. We find the approach promising in suggesting two new directions: designing tunable metrics that correlate with human judgements and using simplicityenriched paraphrase rules derived from larger data than the Normal-Simple Wikipedia dataset. For future work, we think it might be possible to design a universal metric that works for multiple text-totext generation tasks (including sentence simplification, compression and error correction), at the same time using the same idea of comparing system output against multiple references and against the input. The metric could possibly include tunable parameters or weighted human judgments on references to accommodate different tasks. Finally, we are also interested in designing neural translation models for the simplification task.