Problems in Current Text Simplification Research: New Data Can Help

Simple Wikipedia has dominated simplification research in the past 5 years. In this opinion paper, we argue that focusing on Wikipedia limits simplification research. We back up our arguments with corpus analysis and by highlighting statements that other researchers have made in the simplification literature. We introduce a new simplification dataset that is a significant improvement over Simple Wikipedia, and present a novel quantitative-comparative approach to study the quality of simplification data resources.

The Parallel Wikipedia Simplification (PWKP) corpus prepared by Zhu et al. (2010), has become the benchmark dataset for training and evaluating automatic text simplification systems.An associated test set of 100 sentences from Wikipedia has been used for comparing the state-of-the-art approaches.The collection of simple-complex parallel sentences sparked a major advance for machine translationbased approaches to simplification.However, we will show that this dataset is deficient and should be considered obsolete.
In this opinion paper, we argue that Wikipedia as a simplification data resource is suboptimal for several reasons: 1) It is prone to automatic sentence alignment errors; 2) It contains a large proportion of inadequate simplifications; 3) It generalizes poorly to other text genres.These problems are largely due to the fact that Simple Wikipedia is an encyclopedia spontaneously and collaboratively created for "children and adults who are learning English language" without more specific guidelines.We quantitatively illustrate the seriousness of these problems through manual inspection and statistical analysis.
Our manual inspection reveals that about 50% of the sentence pairs in the PWKP corpus are not simplifications.We also introduce a new comparative approach to simplification corpus analysis.In particular, we assemble a new simplification corpus of news articles, 1 re-written by professional editors to meet the readability standards for children at multi-Not Aligned (17%) [NORM] The soprano ranges are also written from middle C to A an octave higher, but sound one octave higher than written.
[SIMP] The xylophone is usually played so that the music sounds an octave higher than written.

Not Simpler
[NORM] Chile is the longest north-south country in the world, and also claims of Antarctica as part of its territory.
[SIMP] Chile, which claims a part of the Antarctic continent, is the longest country on earth.(33%) [NORM] Death On 1 October 1988, Strauss collapsed while hunting with the Prince of Thurn and Taxis in the Thurn and Taxis forests, east of Regensburg.
[SIMP] Death On October  ple grade levels.This parallel corpus is higher quality and its size is comparable to the PWKP dataset.It helps us to showcase the limitations of Wikipedia data in comparison and it provides potential remedies that may improve simplification research.We are not the only researchers to notice problems with Simple Wikipedia.There are many hints in past publications that reflect the inadequacy of this resource, which we piece together in this paper to support our arguments.Several different simplification datasets have been proposed (Bach et al., 2011;Woodsend and Lapata, 2011a;Coster and Kauchak, 2011;Woodsend and Lapata, 2011b), but most of these are derived from Wikipedia and not thoroughly analyzed.Siddharthan (2014)'s excellent survey of text simplification research states that one of the most important questions that needs to be addressed is "how good is the quality of Simple English Wikipedia".To the best of our knowledge, we are the first to systematically quantify the quality of Simple English Wikipedia and directly answer this question.
We make our argument not as a criticism of others or ourselves, but as an effort to refocus research directions in the future (Eisenstein, 2013).We hope to inspire the creation of higher quality simplification datasets, and to encourage researchers to think critically about existing resources and evaluation methods.We believe this will lead to breakthroughs in text simplification research.

Simple Wikipedia is not that simple
The Parallel Wikipedia Simplification (PWKP) corpus (Zhu et al., 2010) contains approximately 108,000 automatically aligned sentence pairs from cross-linked articles between Simple and Normal English Wikipedia.It has become a benchmark dataset for simplification largely because of its size and availability, and because follow-up papers (Woodsend and Lapata, 2011a;Coster and Kauchak, 2011;Wubben et al., 2012;Narayan and Gardent, 2014;Siddharthan and Angrosh, 2014;Angrosh et al., 2014) often compare with Zhu et al.'s system outputs to demonstrate further improvements.
The large quantity of parallel text from Wikipedia made it possible to build simplification systems using statistical machine translation (SMT) technology.But after the initial success of these firstgeneration systems, we started to suffer from the inadequacy of the parallel Wikipedia simplification datasets.There is scattered evidence in the literature.Bach et al. (2011) mentioned they have attempted to use parallel Wikipedia data, but opted to construct their own corpus of 854 sentences (25% from New York Times and 75% are from Wikipedia) with one manual simplification per sentence.Woodsend and Lapata (2011a) showed that rewriting rules learned from Simple Wikipedia revision histories produce better output compared to the "unavoidably noisy" aligned sentences from Simple-Normal Wikipedia.The Woodsend and Lapata (2011b) model, that used quasi-synchronous grammars learned from Wikipedia revision history, left 22% sentences unchanged in the test set.Wubben et al. (2012) found that a phrase-based machine translation model trained on the PWKP dataset often left the input unchanged, since "much of training data consists of partially equal input and output strings".Coster and Kauchak (2011) constructed another parallel Wikipedia dataset using a more sophisticated sentence alignment algorithm with an additional step that first aligns paragraphs.They noticed that 27% aligned sentences are identical between simple and normal, and retained them in the dataset "since not all sentences need to be simplified and it is important for any simplification algorithm to be able to handle this case".However, we will show that many sentences that need to be simplified are not simplified in the Simple Wikipedia.
We manually examined the Parallel Wikipedia Simplification (PWKP) corpus and found that it is noisy and half of its sentence pairs are not simplifications (Table 1).We randomly sampled 200 one-toone sentence pairs from the PWKP dataset (one-tomany sentence splitting cases consist of only 6.1% of the dataset), and classify each sentence pair into one of the three categories: Not Aligned (17%) -Two sentences have different meanings, or only have partial content overlap.Not Simpler (33%)- The SIMP sentence has the same meaning as the NORM sentence, but is not simpler.

Real Simplification (50%)-
The SIMP sentence has the same meaning as the NORM sentence, and is simpler.We fur-ther breakdown into whether the simplification is due to deletion or paraphrasing.
Table 1 shows a detailed breakdown and representative examples for each category.Although Zhu et al. (2010) and Coster and Kauchak (2011) have provided a simple analysis on the accuracy of sentence alignment, there are some important facts that cannot be revealed without in-depth manual inspection.The "non-simplification" noise in the parallel Simple-Normal Wikipedia data is a much more serious problem than we all thought.The quality of "real simplifications" also varies: some sentences are simpler by only one word while the rest of sentence is still complex.
The main causes of non-simplifications and partial-simplifications in the parallel Wikipedia corpus include: 1) The Simple Wikipedia was created by volunteer contributors with no specific objective; 2) Very rarely are the simple articles complete re-writes of the regular articles in Wikipedia (Coster and Kauchak, 2011), which makes automatic sentence alignment errors worse; 3) As an encyclopedia, Wikipedia contains many difficult sentences with complex terminology.The difficulty of sentence alignment between Normal-Simple Wikipedia is highlighted by a recent study by Hwang et al. (2015) that achieves state-of-the-art performance of 0.712 maximum F1 score (over the precisionrecall curve) by combining Wiktionary-based and dependency-parse-based sentence similarities.And in fact, even the simple side of the PWKP corpus contains an extensive English vocabulary of 78,009 unique words.6,669 of these words do not exist in the normal side (Table 2).Below is a sentence from an article entitled "Photolithography" in Simple Wikipedia: Microphototolithography is the use of photolithography to transfer geometric shapes on a photomask to the surface of a semiconductor wafer for making integrated circuits.
We should use the PWKP corpus with caution and consider other alternative parallel simplification corpora.Alternatives could come from Wikipedia (but better aligned and selected) or from manual simplification of other domains, such as newswire.In the Table 2: The vocabulary size of the Parallel Wikipedia Simplification (PWKP) corpus and the vocabulary difference between its normal and simple sides (as a 2×2 matrix).Only words consisting of the 26 English letters are counted.
next section, we will present a corpus of news articles simplified by professional editors, called the Newsela corpus.We perform a comparative corpus analysis of the Newsela corpus versus the PWKP corpus to further illustrate concerns about PWKP's quality.

What the Newsela corpus teaches us
To study how professional editors conduct text simplification, we have assembled a new simplification dataset that consists of 1,130 news articles.Each article has been re-written 4 times for children at different grade levels by editors at Newsela2 , a company that produces reading materials for pre-college classroom use.We use Simp-4 to denote the most simplified level and Simp-1 to denote the least simplified level.This data forms a parallel corpus, where we can align sentences at different reading levels, as shown in Table 3.
Unlike Simple Wikipedia, which was created without a well-defined objective, Newsela is meant to help teachers prepare curricula that match the English language skills required at each grade level.It is motivated by the Common Core Standards (Porter et al., 2011) in the United States.All the Newsela articles are grounded in the Lexile3 readability score, which is widely used to measure text complexity and assess students' reading ability.

Manual examination of Newsela corpus
We conducted a manual examination of the Newsela data similar to the one for Wikipedia data in Table 1.The breakdown of aligned sentence pairs between different versions in Newsela is shown in Figure 1.It is based on 50 randomly selected sentence pairs and shows much more reliable simplification than the Wikipedia data.
We designed a sentence alignment algorithm for the Newsela corpus based on Jaccard similarity (Jaccard, 1912).We first align each sentence in the simpler version (e.g.s1 in Simp-3) to the sentence in the immediate more complex version (e.g.s2 in Simp-2) of the highest similarity score.We compute the similarity based on overlapping word lemmas:4 We then align sentences into groups across all 5 versions for each article.For cases where no sentence splitting is involved, we discard any sentence pairs with a similarity smaller than 0.40.If splitting occurs, we set the similarity threshold to 0.20 instead.
Newsela's professional editors produce simplifications with noticeably higher quality than Wikipedia's simplifications.Compared to sentence alignment for Normal-Simple Wikipedia, automatically aligning Newsela is more straightforward and reliable.The better correspondence between the simplified and complex articles and the availability of multiple simplified versions in the Newsela data also contribute to the accuracy of sentence alignment.Table 5: This table shows the vocabulary changes between different levels of simplification in the Newsela corpus (as a 5×5 matrix).Each cell shows the number of unique word types that appear in the corpus listed in the column but do not appear in the corpus listed in the row.We also list the average frequency of those vocabulary items.For example, in the cell marked *, the Simp-4 version contains 583 unique words that do not appear in the Original version.By comparing the cells marked **, we see about half of the words (19,197 out of 39,046) in the Original version are not in the Simp-4 version.Most of the vocabulary that is removed consists of low-frequency words (with an average frequency of 2.6 in the Original).

Vocabulary statistics
Table 4 shows the basic statistics of the Newsela corpus and the PWKP corpus.They are clearly different.Compared to the Newsela data, the Wikipedia corpus contains remarkably longer (more complex) words and the difference of sentence length before and after simplification is much smaller.We use the Penn Treebank tokenizer in the Moses package. 5ables 2 and 5 show the vocabulary statistics and the vocabulary difference matrix of the PWKP and Newsela corpus.While the vocabulary size of the PWKP corpus drops only 18% from 95,111 unique words to 78,009, the vocabulary size of the Newsela corpus is reduced dramatically by 50.8% from 39,046 to 19,197 words at its most simplified level (Simp-4).Moreover, in the Newsela data, only several hundred words that occur in the simpler version do not occur in the more complex version.The words introduced are often abbreviations ("National Hurricane Center" → "NHC"), less formal words ("unscrupulous" → "crooked") and shortened words ("chimpanzee" → "chimp").This implies a more complete and precise degree of simplification in the Newsela than the PWKP dataset.

Log-odds-ratio analysis of words
In this section, we visualize the differences in the topics and degree of simplification between the Simple Wikipedia and the Newsela corpus.To do this, we employ the log-odds-ratio informative Dirichlet prior method of Monroe et al. (2008) to find words and punctuation marks that are statistically overrepresented in the simplified text compared to the original text.The method measures each token by the z-score of its log-odds-ratio as: It uses a background corpus when calculating the log-odds-ratio δ t for token t, and controls for its variance σ 2 .Therefore it is capable of detecting differences even in very frequent tokens.Other methods used to discover word associations, such as mu-tual information, log likelihood ratio, t-test and chisquare, often have problems with frequent words (Jurafsky et al., 2014).We choose the Monroe et al. (2008) method because many function words and punctuations are very frequent and play important roles in text simplification.
The log-odds-ratio δ for token t estimates the difference of the frequency of token t between two text sets i and j as: where n i is the size of corpus i, n j is the size of corpus j, y i t is the count of token t in corpus i, y j t is the count of token t in corpus j, α 0 is the size of the background corpus, and α t is the count of token t in the background corpus.We use the combination of both simple and complex sides in the corpus as the background.
And the variance of the log-odds-ratio is estimated by: σ 2 (δ Table 6 lists the top 50 words and punctuation marks that are the most strongly associated with the complex text.Both corpora significantly reduce function words and punctuation.The content words show the differences of the topics and subject matters between the two corpora.Table 7 lists the top 50 words that are the most strongly associated with the simplified text.The two corpora are more agreeable on what the simple words are than what complex words need to be simplified.
Table 8 shows the frequency and odds ratio of example words from the top 50 complex words.The odds ratio of token t between two texts sets i and j is defined as: It reflects the difference of topics and degree of simplification between the Wikipedia and the Newsela data.The high proportion of clause-related function words, such as "which" and "where",    that are retained in Simple Wikipedia indicates the incompleteness of simplification in the Simple Wikipedia.The dramatic frequency decrease of words like "which" and "advocates" in Newsela shows the consistent quality from professional simplifications.Wikipedia has good coverage on certain words, such as "approximately", because of its large volume.

Log-odds-ratio analysis of syntax patterns
We can also reveal the syntax patterns that are most strongly associated with simple text versus complex text using the log-odds-ratio technique.Table 9 shows syntax patterns that represent "parent node (head word) → children node(s)" structures from a constituency parse tree.To extract theses patterns we parsed our corpus with the Stanford Parser (Klein and Manning, 2002) and applied its built-in head word identifier from Collins (2003).
Both the Newsela and Wikipedia corpora exhibit syntactic differences that are intuitive and interesting.However, as with word frequency (Table 8), complex syntactic patterns are retained more often in Wikipedia's simplifications than in Newsela's.
In order to show interesting syntax patterns in the Wikipedia parallel data for Table 9, we first had to discard 3613 sentences in PWKP that contain both "is a commune" and "France".As the word-level analysis in Tables 6 and 7 hints, there is an exceeding number of sentences about communes in France in the PWKP corpus, such as the sentence pair below: La Couture is a commune in the Pas-de-Calais department in the Nord-Pas-de-Calais region of France .
[SIMP] La Couture, Pas-de-Calais is a commune.It is found in the region Nord-Pas-de-Calais in the Pas-de-Calais department in the north of France.This is a template sentence from a stub geographic article and its deterministic simplification.The influence of this template sentence is more over-whelming in the syntax-level analysis than in the word-level analysis --about 1/3 of the top 30 syntax patterns would be related to these sentence pairs if they were not discarded.
Simple Wikipedia is rarely used to study document-level simplification.Woodsend and Lapata (2011b) developed a model that simplifies Wikipedia articles while selecting their most important content.However, they could only use Simple Wikipedia in very limited ways.They noted that Simple Wikipedia is "less mature" with many articles that are just "stubs, comprising a single paragraph of just one or two sentences".We quantify their observation in Figure 2, plotting the documentlevel compression ratio of Simple vs. Normal Wikipedia articles.The compression ratio is the ratio of the number of characters between each simple-complex article pair.In the plot, we use all 60 thousand article pairs from the Simple-Normal Wikipedia collected by Kauchak (2013) in May 2011.The overall compression ratio is skewed towards almost 0. For comparison, we also plot the ratio between the simplest version (Simp-4) and the original version (Original) of the news articles in the Newsela corpus.The Newsela corpus has a much more reasonable compression ratio and is therefore likely to be more suitable for studying documentlevel simplification.

Analysis of discourse connectives
Although discourse is known to affect readability, the relation between discourse and text simplification is still under-studied with the use of statistical methods (Williams et al., 2003;Siddharthan, 2006;Siddharthan and Katsos, 2010).Text simplification often involves splitting one sentence into multiple sentences, which is likely to require discourse-level changes such as introducing explicit rhetorical rela-tions.However, previous research that uses Simple-Normal Wikipedia largely focuses on sentence-level transformation, without taking large discourse structure into account.
Figure 3: A radar chart that visualizes the odds ratio (radius axis) of discourse connectives in simple side vs. complex side.An odds ratio larger than 1 indicates the word is more likely to occur in the simplified text than in the complex text, and vice versa.Simple cue words (in the shaded region), except "hence", are more likely to be added during Newsela's simplification process than in Wikipedia's.Complex conjunction connectives (in the unshaded are more likely to be retained in Wikipedia's simplifications than in Newsela's. To preserve the rhetorical structure, Siddharthan (2003Siddharthan ( , 2006) ) proposed to introduce cue words when simplifying various conjoined clauses.We perform an analysis on discourse connectives that are relevant to readability as suggested by Siddharthan (2003).Figure 3 presents the odds ratios of simple cue words and complex conjunction connectives.The odds radios are computed for Newsela between the Original and Simp-4 versions, and for Wikipedia between Normal and Simple documents collected by Kauchak (2013).It suggests that Newsela exhibits a more complete degree of simplification than Wikipedia, and that it may be able to enable more computational studies of the role of discourse in text simplification in the future.

Newsela's quality is better than Wikipedia
Overall, we have shown that the professional simplification of Newsela is more rigorous and more consistent than Simple English Wikipedia.The language and content also differ between the encyclopedia and news domains.They are not exchangeable in developing nor in evaluating simplification systems.In the next section, we will review the evaluation methodology used in recent research, discuss its shortcomings and propose alternative evaluations.Such evaluation is insufficient to measure 1) the practical value of a system to a specific target reader population and 2) the performance of individual simplification components: sentence splitting, dele-tion and paraphrasing.Although the inadequacy of text simplification evaluations has been discussed before (Siddharthan, 2014), we focus on these two common deficiencies and suggest two future directions.

Targeting specific audiences
Simplification has many subtleties, since what constitutes simplification for one type of user may not be appropriate for another.Many researchers have studied simplification in the context of different audiences.However, most recent automatic simplification systems are developed and evaluated with little consideration of target reader population.There is one attempt by Angrosh et al. (2014) who evaluate their system by asking non-native speakers comprehension questions.They conducted an English vocabulary size test to categorize the users into different levels of language skills.
The Newsela corpus allows us to target children at different grade levels.From the application point of view, making knowledge accessible to all children is an important yet challenging part of education (Scarton et al., 2010;Moraes et al., 2014).From the technical point of view, reading grade level is a clearly defined objective for both simplification systems and human annotators.Once there is a well-defined objective, with constraints such as vocabulary size and sentence length, it is easier to fairly compare different systems.Newsela provides human simplification at different grade levels and reading comprehension quizzes alongside each article.
In addition, readability is widely studied and can be automatically estimated (Kincaid et al., 1975;Pitler and Nenkova, 2008;Petersen and Ostendorf, 2009).Although existing readability metrics assume text is well-formed, they can potentially be used in combination with text quality metrics (Post, 2011;Louis and Nenkova, 2013) to evaluate simplifications.They can also be used to aid humans in the creation of reference simplifications.

Evaluating sub-tasks separately
It is widely accepted that sentence simplification involves three different elements: splitting, deletion and paraphrasing (Feng, 2008;Narayan and Gardent, 2014).Splitting breaks a long sentence into a few short sentences to achieve better readability.Deletion the complexity by removing unimportant parts of a sentence.Paraphrasing rewrites text into a simpler version via reordering, substitution and occasionally expansion.
Most state-of-the-art systems consist of all or a subset of these three components.However, the popular human evaluation criteria (grammaticality, simplicity and adequacy) do not explain which components in a system are good or bad.More importantly, deletion may be unfairly penalized since shorter output tends to result in lower adequacy judgements (Napoles et al., 2011).
We therefore advocate for a more informative evaluation that separates out each sub-task.We believe this will lead to more easily quantifiable metrics and possibly the development of automatic metrics.For example, early work shows potential use of precision and recall to evaluate splitting (Siddharthan, 2006;Gasperin et al., 2009) and deletion (Riezler et al., 2003;Filippova and Strube, 2008).Several studies also have investigated various metrics for evaluating sentence paraphrasing (Callison-Burch et al., 2008;Chen and Dolan, 2011;Ganitkevitch et al., 2011;Xu et al., 2012Xu et al., , 2013;;Weese et al., 2014).

Summary and recommendations
In this paper, we presented the first systematic analysis of the quality of Simple Wikipedia as a simpli-fication data resource.We conducted a qualitative manual examination and several statistical analyses (including vocabulary change matrices, compression ratio histograms, log-odds-ratio calculations, etc.).We introduced a new, high-quality corpus of professionally simplified news articles, Newsela, as an alternative resource, that allowed us to demonstrate Simple Wikipedia's inadequacies in comparison.We further discussed problems with current simplification evaluation methodology and proposed potential improvements.
Our goal for this opinion paper is to stimulate progress in text simplification research.Simple English Wikipedia played a vital role in inspiring simplification approaches based on statistical machine translation.However, it has so many drawbacks that we recommend the community to drop it as the standard benchmark set for simplification.Other resources like the Newsela corpus are superior, since they provide a more consistent level of quality, target a particular audience, and approach the size of parallel Simple-Normal English Wikipedia.We believe that simplification is an important area of research that has the potential for broader impact beyond NLP research.But we must first adopt appropriate data sets and research methodologies.
Researchers can request the Newsela data following the instructions at: https://newsela.com/data/

Figure 1 :
Figure 1: Manual classification of aligned sentence pairs from the Newsela corpus.We categorize randomly sampled 50 sentence pairs drawn from the Original-Simp2 and 50 sentences from the Original-Simp4.

Figure 2 :
Figure 2: Distribution of document-level compression ratio, displayed as a histogram smoothed by kernel density estimation.The Newsela corpus is more normally distributed, suggesting more consistent quality.
1, 1988, Strauß collapsed while hunting with the Prince of Thurn and Taxis in the Thurn and Taxis forests, east of Regensburg.This article is a list of the 50 U.S. states and the District of Columbia ordered by population density.[SIMP]This is a list of the 50 U.S. states, ordered by population density.
[NORM] All adult Muslims, with exceptions for the infirm, are required to offer Salat prayers five times daily.[SIMP]All adult Muslims should do Salat prayers five times a day.

Table 1 :
Example sentence pairs (NORM-SIMP) aligned between English Wikipedia and Simple English Wikipedia.The breakdown in percentages is obtained through manual examination of 200 randomly sampled sentence pairs in the Parallel Wikipedia Simplification (PWKP) corpus.

Table 3 :
Example of sentences written at multiple levels of text complexity from the Newsela data set.The Lexile readability score and grade level apply to the whole article rather than individual sentences, so the same sentences may receive different scores, e.g. the above sentences for the 6th and 7th grades.The bold font highlights the parts of sentence that are different from the adjacent version(s).

Table 4 :
Basic statistics of the Newsela Simplification corpus vs. the Parallel Wikipedia Simplification (PWKP) corpus.The Newsela corpus consists of 1130 articles with original and 4 simplified versions each.Simp-1 is of the least simplified level, while Simp-4 is the most simplified.The numbers marked by * are slightly different from previously reported, because of the use of different tokenizers.

Table 6 :
Monroe et al. (2008)ated with the complex text, computed using theMonroe et al. (2008)method.Bold words are shared by the complex version of Newsela and the complex version of Wikipedia.
found is made called started pays said was got are like get can means says has went comes make put used

Table 7 :
Top 50 tokens associated with the simplified text.

Table 8 :
Frequency of example words from Table6.These complex words are reduced at a much greater rate in the simplified Newsela than they are in the Simple English Wikipedia.A smaller odds ratio indicates greater reduction.

Table 9 :
Top 30 syntax patterns associated with the complex text (left) and simplified text (right).Bold patterns are the top patterns shared by Newsela and Wikipedia.