Grammar Error Correction in Morphologically Rich Languages: The Case of Russian

Until now, most of the research in grammar error correction focused on English, and the problem has hardly been explored for other languages. We address the task of correcting writing mistakes in morphologically rich languages, with a focus on Russian. We present a corrected and error-tagged corpus of Russian learner writing and develop models that make use of existing state-of-the-art methods that have been well studied for English. Although impressive results have recently been achieved for grammar error correction of non-native English writing, these results are limited to domains where plentiful training data are available. Because annotation is extremely costly, these approaches are not suitable for the majority of domains and languages. We thus focus on methods that use “minimal supervision”; that is, those that do not rely on large amounts of annotated training data, and show how existing minimal-supervision approaches extend to a highly inflectional language such as Russian. The results demonstrate that these methods are particularly useful for correcting mistakes in grammatical phenomena that involve rich morphology.


Introduction
This paper addresses the task of correcting errors in text. Most of the research in the area of grammar error correction (GEC) focused on correcting mistakes made by English language learners. One standard approach to dealing with these errors, which proved highly successful in text correction competitions (Dale and Kilgarriff, 2011;Dale et al., 2012;Ng et al., 2013Rozovskaya et al., 2017), makes use of a machine-learning classifier paradigm and is based on the methodology for correcting context-sensitive spelling mistakes Roth, 1996, 1999;Banko and Brill, 2001). In this approach, classifiers are trained for a particular mistake type: for example, preposition, article, or noun number Gamon, 2010;Rozovskaya and Roth, 2010c,b;Dahlmeier and Ng, 2012). Originally, classifiers were trained on native English data. As several annotated learner datasets became available, models were also trained on annotated learner data.
More recently, the statistical machine translation (MT) methods, including neural MT, have gained considerable popularity thanks to the availability of large annotated corpora of learner writing (e.g., Yuan and Briscoe, 2016;Junczys-Dowmunt and Grundkiewicz, 2016;Chollampatt and Ng, 2018). Classification methods work very well on well-defined types of errors, whereas MT is good at correcting interacting and complex types of mistakes, which makes these approaches complementary in some respects (Rozovskaya and Roth, 2016).
Thanks to the availability of large (in-domain) datasets, substantial gains in performance have been made in English grammar correction. Unfortunately, research on other languages has been scarce. Previous work includes efforts to create annotated learner corpora for Arabic , Japanese (Mizumoto et al., 2011), and Chinese (Yu et al., 2014), and shared tasks on Arabic Rozovskaya et al., 2015) and Chinese error detection (Lee et al., 2016;Rao et al., 2017). However, building robust models in other languages has been a challenge, since an approach that relies on heavy supervision is not viable across languages, genres, and learner backgrounds. Moreover, for languages that are complex morphologically, we may need more data to address the lexical sparsity. This work focuses on Russian, a highly inflectional language from the Slavic group. Russian has over 260M speakers, for 47% of whom Russian is not their native language. 1 We corrected and error-tagged over 200K words of non-native Russian texts. We use this dataset to build several grammar correction systems that draw on and extend the methods that showed state-of-theart performance on English grammar correction. Because the size of our annotation is limited, compared with what is used for English, one of the goals of our work is to quantify the effect of having limited annotation on existing approaches. We evaluate both the MT paradigm, which requires large amounts of annotated learner data, and the classification approaches that can work with any amount of supervision.
Overall, the results obtained for Russian are much lower than those reported for English. We further find that the minimal-supervision classification methods that can combine large amounts of native data with a small annotated learner sample give the best results on a language with rich morphology and with limited annotation. The system that uses classifiers with minimal supervision achieves an F 0.5 score of 21.0, 2 whereas the MT system trained on the same data achieves a score of only 10.6. This paper makes the following contributions: (1) We describe an error classification schema for Russian learner errors, and present an error-tagged Russian learner corpus. The dataset is available for research 3 and can serve as a benchmark dataset for Russian, which should facilitate progress on grammar correction research, especially for languages other than English. (2) We present an analysis of the annotated data, in terms of error rates, error distributions by learner type (foreign and heritage), as well as comparison to learner corpora in other languages. (3) We extend stateof-the-art grammar correction methods to a morphologically rich language and, in particular, identify classifiers needed to address mistakes 1 https://en.wikipedia.org/wiki/Russian language.
2 This is a standard metric used in grammar correction since the CoNLL shared tasks. Because precision is more important than recall in grammar correction, it is weighed twice as high, and is denoted as F 0.5 . Other metrics have been proposed recently (Felice and Briscoe, 2015;Napoles et al., 2015;Choshen and Abend, 2018a).
that are specific to these languages. (4) We demonstrate that the classification framework with minimal supervision is particularly useful for morphologically rich languages; they can benefit from large amounts of native data, due to a large variability of word forms, and small amounts of annotation provide good estimates of typical learner errors. (5) We present an error analysis that provides further insight into the behavior of the models on a morphologically rich language. Section 2 presents related work. Section 3 describes the corpus. Experiments are described in Section 4, and the results are presented in Section 5. We present an error analysis in Section 6 and conclude in Section 7.

Background and Related Work
We first discuss related work in text correction on languages other than English. We then introduce the two frameworks for grammar correction (evaluated primarily on English learner datasets) and discuss the ''minimal supervision'' approach.

Grammar Correction in Other Languages
The two most prominent attempts at grammar error correction in other languages are shared tasks on Arabic and Chinese text correction. In Arabic, a large-scale corpus (2M words) was collected and annotated as part of the QALB project . The corpus is fairly diverse: it contains machine translation outputs, news commentaries, and essays authored by native speakers and learners of Arabic. The learner portion of the corpus contains 90K words (Rozovskaya et al., 2015), including 43K words for training. This corpus was used in two editions of the QALB shared task Rozovskaya et al., 2015). There have also been three shared tasks on Chinese grammatical error diagnosis (Lee et al., 2016;Rao et al., 2017Rao et al., , 2018. A corpus of learner Chinese used in the competition includes 4K units for training (each unit consists of one to five sentences). Mizumoto et al. (2011) present an attempt to extract a Japanese learners' corpus from the revision log of a language learning Web site (Lang-8). They collected 900K sentences produced by learners of Japanese and implemented a character-based MT approach to correct the 2 errors. The English learner data from the Lang-8 Web site is commonly used as parallel data in English grammar correction. One problem with the Lang-8 data is a large number of remaining unannotated errors.
In other languages, attempts at automatic grammar detection and correction have been limited to identifying specific types of misuse (grammar or spelling). Imamura et al. (2012) address the problem of particle error correction for Japanese, and Israel et al. (2013) develop a small corpus of Korean particle errors and build a classifier to perform error detection. De Ilarraza et al. (2008) address errors in postpositions in Basque, and Vincze et al. (2014) study definite and indefinite conjugation usage in Hungarian. Several studies focus on developing spell checkers (Ramasamy et al., 2015;Sorokin et al., 2016;Sorokin, 2017).
There has also been work that focuses on annotating learner corpora and creating error taxonomies that do not build a grammar correction system. Dickinson and Ledbetter (2012) present an annotated learner corpus of Hungarian; Hana et al. (2010) and Rosen et al. (2014) build a learner corpus of Czech; and Abel et al. (2014) present KoKo, a corpus of essays authored by German secondary school students, some of whom are non-native writers. For an overview of learner corpora in other languages, we refer the reader to Rosen et al. (2014).

Approaches to Text Correction
There are currently two well-studied paradigms that achieve competitive results on the task in English-MT and machine learning classification. In the classification approach, error-specific classifiers are built. Given a confusion set, for example {a, the, zero article} for articles, each occurrence of a confusable word is represented as a vector of features derived from a context window around it. Classifiers can be trained either on learner or on native data, where each target word occurrence (e.g., the) is treated as a positive training example for the corresponding word. Given a text to correct, for each confusable word, the task is to select the most likely candidate from the relevant confusion set. Error-specific classifiers are typically trained for common learner errors-for example, article, preposition, or noun number in English (Izumi et al., 2003;Han et al., 2006;Gamon et al., 2008;De Felice and Pulman, 2008;Gamon, 2010;Rozovskaya and Roth, 2011;Dahlmeier and Ng, 2012).
In the MT approach, the error correction problem is cast as a translation task: namely, translating ungrammatical learner text into wellformed grammatical text, and original learner texts and the corresponding corrected texts act as parallel data. MT systems for grammar correction are trained using 20M-50M words of learner texts to achieve competitive performance. The MT approach has shown state-of-the-art results on the benchmark CoNLL-14 test set in English Junczys-Dowmunt and Grundkiewicz, 2016;Chollampatt and Ng, 2017); it is particularly good at correcting complex error patterns, which is a challenge for the classification methods (Rozovskaya and Roth, 2016). However, phrase-based MT systems do not generalize well beyond the error patterns observed in the training data. Several neural encoder-decoder approaches relying on recurrent neural networks were proposed (Chollampatt et al., 2016;Yuan and Briscoe, 2016;Ji et al., 2017). These initial attempts were not able to reach the performance of the state-of-the-art phrase-based MT systems (Junczys-Dowmunt and Grundkiewicz, 2016), but more recently neural MT approaches have shown competitive results on English grammar correction (Chollampatt and Ng, 2018;. 4 However, neural MT systems tend to require even more supervision. For instance,  adopt the methods developed for low-resource machine translation tasks, but they still require parallel corpora in tens of millions of tokens. Minimal Supervision Framework As we have noted, classifiers can be trained on either native or learner data. Native data are cheap and available in large quantities. But, when training on learner data, the potentially erroneous word can also be used by the model. Because mistakes made by non-native speakers are not random (Montrul and Slabakova, 2002;Ionin et al., 2008), using the potentially erroneous word and the correction provides the models with knowledge about learner error patterns. For this reason, models trained on error-annotated data often outperform models trained on larger amounts of native data (Gamon, 2010;Dahlmeier and Ng, 2011). But this approach requires large amounts of annotated learner data (Gamon, 2010). The minimal supervision approach (Rozovskaya and Roth, 2014;Rozovskaya et al., 2017) incorporates the best of both modes: training on native texts to facilitate the possibility of training from large amounts of data without the need for annotation, but using a modest amount of expensive learner data that contains learner error patterns. Importantly, error patterns can be estimated robustly with a small amount of annotation (Rozovskaya et al., 2017). The error patterns can be provided to the model in the form of artificial errors or by changing the model priors. In this work, we use the artificial errors approach; it has been studied extensively for English grammar correction. Several other studies consider the effect of using artificial errors (e.g., Cahill et al., 2013;Felice and Yuan, 2014).

Corpus and Annotation
We annotated data from the Russian Learner Corpus of Academic Writing (RULEC, 560K words) (Alsufieva et al., 2012), which consists of essays and papers written in a university setting in the United States by students learning Russian as a foreign language and heritage speakers (those who grew up in the United States but had exposure to Russian at home). This closely mirrors the datasets used for English grammar correction. The corpus contains data from 15 foreign language learners and 13 heritage speakers. RULEC is freely available for research use. 5

Russian Grammatical Categories
Russian is a fusional language with free word order, characterized by rich morphology and a high number of inflections. Nouns, adjectives, and certain pronouns are specified for gender, number, and case. Modifiers agree with the head nouns; thus, words in these grammatical categories can have up to 24 different word forms. Verbs are marked for number, gender, and person and agree with the grammatical subject. Other categories for verbs are aspect, tense, and voice. These are typically expressed through morphemes corresponding to functional words in English (shall, will, was, have, had, been, etc.).

Annotation
Two annotators, native speakers of Russian with a background in linguistics, corrected a subset of RULEC (12,480 sentences, comprising 206K words). One of the annotators is an English as a Second Language instructor and English-Russian translator. The annotation was performed using a tool built for a similar annotation project for English (Rozovskaya and Roth, 2010a). We refer to the resulting corpus as RULEC-GEC.
When selecting sentences to be annotated, we attempted to include a variety of writers from each group (foreign and heritage speakers). The annotated data include 12 foreign and 5 heritage writers. The essays of each writer were sorted alphabetically by the essay file name; the essays for annotation were selected in that order, and the sentences were selected in the order they appear in each essay. We intentionally selected more essays from non-native authors, as we conjectured that these authors would display a greater variety of grammatical errors and higher error rates. Eventually, for each author, a subset of that writer's essays was included, but a different number of annotated essays per author, namely, between 13 and 159 essays per author.
The data were corrected, and each mistake was assigned a type. We developed an error classification schema that addresses errors in morphology, syntax, and word usage, and takes into account linguistic properties of the Russian language, by emphasizing those that are most commonly misused. The common phenomena were identified through a pilot annotation, and with the help of sample errors that had been collected with the Russian National Corpus in the process of developing a similar annotation of Russian learner texts. The sample errors were made available to us by the authors (Klyachko et al., 2013). This study resulted in an annotated corpus, available for online search at http:// web-corpora.net/ (Rakhilina et al., 2016).
Noun:case Это зависит от *показания/показаний очевидцев This depends from testimony gen,*sg/gen,pl eyewitness gen,pl 'This depends on the testimony of eyewitnesses' Preposition Слова *от/из прошлых уроков word nom,pl *from/out of previous gen,pl lesson gen,pl 'Words from previous lessons' Verb number agreement Все новые здания *разваливается/разваливаются All new nom,pl building nom,pl * f all pres,imperfect,sg /f all pres,imperfect,pl apart 'All new buildings are falling apart' Verb gender agreement Лера *пробовал/пробовала флиртовать с ним Valerie * try past,imperfect,masc /try past,imperfect,fem to flirt with him 'Valerie tried flirting with him' Lexical choice Тогда люди стали *спрашивать/задавать вопросы Then people nom,pl started * to inquire/to ask questions acc,pl 'Then people started to ask questions'  Our error tagset was developed independently and is smaller than the one in Rakhilina et al. (2016), in order to minimize the annotation burden, while still being able to distinguish among most typical linguistic problems for Russian language learners. We include 23 tags that cover syntactic and morphosyntactic errors, orthography, and lexical errors. Table 1 illustrates some of the common errors, and Table 2 presents annotation statistics. Frequencies for the top 13 errors are shown in Table 3. Note that the top 10 error types account for over 80% of all errors. Not shown are the phenomena that occur less than one error per 1,000 words: adj:gender, verb:voice, verb:tense, adj:other, pronoun, adj:number, conjunction, verb:other, noun:gender, noun:other.

Inter-Annotator Agreement
Because annotation for grammatical errors is extremely variable, as there are often multiple  ways of correcting the same mistake (Bryant and Ng, 2015), we compute inter-rater agreement following Rozovskaya and Roth (2010a), where the texts corrected by one annotator were given to the second annotator. Agreement is computed as the percentage of sentences that did not have additional corrections on the second pass. After  all, our goal is to make the sentence well formed, without enforcing that errors are corrected in the same way. A total of 200 sentences from each annotator were selected and given to the other annotator. Table 4 shows that the error rate of the sentences corrected by annotator A on the second pass was 2.4%, with 68.5% of the sentences remaining unchanged. The sentences corrected by annotator B on the second pass had an error rate of less than 1%, and over 91% of the sentences did not have additional corrections. These agreement numbers are higher than those reported for English, where the percentage of unchanged sentences varied between 37% and 83% (Rozovskaya and Roth, 2010a).

Comparison to Other Learner Corpora
Error Rates In Table 5, we compare the error rates in RULEC-GEC to those in a learner corpus of Arabic  and three corpora of learner English: JFLEG (Napoles et al., 2017), FCE (Yannakoudakis et al., 2011), andCoNLL (Ng et al., 2014). The error rates in RULEC are generally lower than in the other learner corpora. The Arabic data have the highest error rate of 28.7%. In the English learner corpora, the error rates range between 6.5% and 25.5%. The error rates are 17.7% (FCE); 18.5-25.5% for JFLEG, annotated independently by four raters; and 10.8-13.6% for CoNLL-test, annotated by two raters. The lowest error rate that is comparable to ours is in CoNLL-train (6.6%). We attribute the differences to the proficiency levels of the RULEC writers, which is fairly advanced. In fact, error rates vary widely by learner group (foreign vs. heritage), as discussed in Section 3.5.  the most common mistake types (note that ''mechanical errors'' in CoNLL group together spelling and punctuation errors). Noun number errors are also common in CoNLL, a corpus produced by learners whose first language is Chinese, whereas these are less common in FCE, produced by learners of diverse linguistic backgrounds. In Russian, spelling, punctuation, and lexical choice are also in the top five.

Most Common Errors
In RULEC-GEC, the top five error categories are spelling, lexical choice, noun:case, punctuation, and missing word. Overall, spelling, punctuation, and lexical errors are in the top five categories for all of the three corpora. As for grammar-related errors, although article and preposition errors also made it to the top of the list in the English corpora, noun case usage is definitely the most challenging and common phenomenon for Russian learners.

Foreign vs. Heritage Speakers
We also compare foreign and heritage speakers. The heritage speaker subcorpus includes 42,187 words, and the foreign speaker partition comprises 164,071 words. The error rates are 4.0% and 6.9% for each group; foreign learners make almost twice as many mistakes as heritage speakers. In the foreign group, there is a lot of variation, with five writers exhibiting error rates of 10-13%, two writers whose error rates are below 3%, and five authors having error rates between 5% and 7%. There is not much variation in the heritage group.
The two groups also reveal differences in the error distributions (Table 7): More than 65% of errors in the heritage group are in spelling and punctuation. In fact, 42.4% of errors in the heritage corpus are spelling mistakes vs. 18.6% for foreign speakers. If we consider the number of errors per 1,000 words, we observe that, interestingly,   heritage speakers make spelling and punctuation errors more frequently (15.7 spelling and 8.5 punctuation errors in the heritage group vs. 11.7 spelling and 4.8 punctuation errors in the foreign group). As for the other grammatical phenomena, although these are all more challenging for the foreign speaker group, the distributions of these phenomena are quite similar. For example, heritage speakers make 2.9 noun case errors per 1,000 words, whereas foreign speakers make 8.8 noun case errors per 1,000 words; for both types of writers, noun case errors are at the top of the list (second most common for the foreign group and third most common for the heritage group).

Experiments
The experiments investigate the following: 1. How do the two state-of-the-art methods compare under the conditions that we have for Russian (rich morphology and limited annotations)?
2. What is the performance on individual errors and the overall performance compared with results obtained for English grammar correction? 3. How well do the classifiers within the minimal supervision framework perform in morphologically rich languages, on grammatical phenomena that are common in highly inflectional languages such as Russian, as well as on phenomena that also occur in English?
To answer these questions, the following three approaches are implemented: • Learner-trained classifiers: Error-specific classifiers trained on learner data • Minimal-supervision classifiers: Error-specific classifiers trained on learner and native data with minimal supervision (see Section 2.2) • Phrase-based machine translation system Data We split the annotated data into training (4,980 sentences, 83,410 words), development (2,500 sentences, 41,163 words), and test (5,000 sentences, 81,693 words). For the native data, we use the Yandex corpus (Borisov and Galinskaya, 2014), a diverse corpus of newswire, fiction, and other genres (18M words). All the data was preprocessed with the the Mystem morphological analyzer (Segalovich, 2003) and a part-of-speech tagger (Schmid, 1995).

Classifiers
In the classification framework, we develop classifiers for several common grammar errors: preposition, noun case, verb aspect, and verb agreement (split into number and gender). The rationale for selecting these errors is to evaluate the behavior of the classifiers on phenomena that have been well studied in English (e.g., preposition and verb number agreement), as well as those that have not received much attention (verb aspect); or those that are specific to Russian (noun case and gender agreement). For each error type, a special classifier is developed. The features include word n-grams, POS n-grams, lemma n-grams, and morphological properties of the target word and neighboring words. In addition, in line with Rozovskaya and Roth (2016), we include a punctuation module that inserts missing commas, using patterns mined from the Yandex corpus and the RULEC-GEC training data. We now provide more detail on the grammar phenomena considered.
Noun Case Errors Noun case usage is the most common error type after spelling and accounts for 14% of all errors. The Russian case system consists of six cases: Nominative, Genitive, Accusative, Dative, Instrumental, and Locative. The case classifier is thus a six-way classifier, with each class corresponding to one of the cases. The labels are obtained by extracting the case information predicted by the morphological analyzer on original and corrected noun forms. It should be noted that the surface form of the noun may be ambiguous with respect to case. For example, the word яблоко (''apple'') in different contexts can be interpreted as nominative or accusative. In that case, the morphological analyzer will list both analyses, and both of these will be included as gold labels for the word. This is because our task is not to predict the case but the surface form of the noun. About 58% of nouns are unambiguous (have one case-related morphological analysis), 34% have two possible case analyses, and 8% of nouns have three or more analyses.

Number and Gender Verb Agreement Errors
Verb agreement functions in a way that is similar in English. In Russian, verbs are specified for number (singular, plural), gender (feminine, masculine, and neutral), and person. Errors in person agreement are rare, and we ignore these.
Preposition Errors Preposition errors are some of the most common errors for learners of English (Leacock et al., 2010), and are also quite common among the Russian learners, accounting for over 3% of all errors (Table 3). In the classification framework, it is common to consider top n most frequent prepositions (Dahlmeier and Ng, 2012;. In line with work in English, we consider mistakes that involve the top 15 Russian prepositions. 6 Verb Aspect The Russian verb system is different from English, and verb aspect errors among Russian learners are quite common. Russian has three tenses-present, past, and future-and each tense can be expressed in imperfective or perfective aspect. Although there is no direct correspondence between the Russian aspect usage and the English tenses, the aspect can be weakly aligned with the English tense system. Prior research in English showed that these are some of the most difficult mistakes, as verb tense usage is highly semantic rather than grammatical (Lee and Seneff, 2008;Tajiri et al., 2012). Table 8 lists the confusion sets for each error classifier. In all cases, discriminative learning framework is used with the Averaged Perceptron algorithm (Rizzolo, 2011).  approach (Rozovskaya et al., 2017) to simulate learner errors in training. Learner error patterns (or error statistics) are extracted from the annotated learner data. Specifically, given an error type, we collect all source/label pairs from the annotated sample, where both the source and the label belong to the confusion set, and generate a confusion matrix, where each cell represents P rob(source=s|label=l). Table 9 shows a confusion matrix for noun case errors based on error statistics collected from the training and development data in RULEC-GEC. The values in the confusion matrix are used to generate noun form errors in the training data. For instance, according to the table, given a noun that needs to be in the genitive case, a learner is four times more likely to use the nominative case instead of the locative case. We use this table both to introduce artificial errors in native training data and to increase the error rates in the learner data by adding artificial mistakes to naturally occurring errors. Adding artificial errors when training on learner data is also useful, as increasing the error rates improves the recall of the system. In both cases, the generated errors are added, so that the relative frequencies of different confusions are preserved (e.g., nominative is four times more likely than locative to be used in place of genitive), and the error rates can be varied (higher error rates will improve the recall of the system at the expense of precision).

The MT System
One advantage of the MT approach is that error types need not be formulated explicitly. We build a phrase-based MT system that follows the implementation in . Our MT system is trained using Moses (Koehn et al., 2007). The phrase table is trained on the training partition of RULEC-GEC. We use two 4-gram language models-one is trained on the Yandex corpus, and the other one is trained on the corrected side of the RULEC-GEC training data. Both are trained with KenLM (Heafield et al., 2013). Tuning is done on the development dataset with MERT (Och, 2003). We use BLEU (Papineni et al., 2002) as the tuning metric.
We note that several neural MT systems have been proposed recently (see Section 2). Because we only have a small amount of parallel data, we adopt the phrase-based MT, as it is known that neural MT systems have a steeper learning curve with respect to the amount of training data, resulting in worse quality in low-resource settings (Koehn and Knowles, 2017). We also note that Junczys-Dowmunt and Grundkiewicz (2016) present a stronger SMT system for English grammar correction. Their best result that is due to adding dense and sparse features is an improvement of 3 to 4 points over the baseline system (they also rely on much large tuning sets, as required for sparse features). The baseline system is essentially the same as that of . Because our MT result is so much lower than the classification system, we do not expect that adding sparse and dense features will close that gap.

Results
We start by comparing performance on individual errors; then the overall performance of the best classification systems and the MT system is compared.

Classifier Performance On Individual Errors
First, we wish to assess the contribution of the minimal-supervision approach compared with training on the learner data for a language with rich
morphology. To this end, two types of classifiers are compared: learner-trained (trained on learner data) and minimal-supervision (trained on native data with artificial errors based on error statistics extracted from the learner data; Section 2). The classifiers are tuned on the development partition-that is, the error rates that determine at which rate artificial errors injected into the training data are optimized on the development data. Performance results on the test data are for models trained on the training+development data (learner-trained models). Similarly, the minimal supervision classifiers use error statistics extracted from training+development. Table 10 shows performance for the five types of errors. For all errors, minimal-supervision models outperform the learner-trained models substantially, by 8 to 32 F 0.5 points. This is because the amount of annotation that we have is really too small to estimate all parameters, but it is sufficient to provide error estimates in the minimal supervision framework. In addition, the punctuation module achieves an F 0.5 score of 30.5 (precision of 47.4 and recall of 12.6).
Classifiers vs. MT So far, we have evaluated performance of the classifiers with respect to individual errors. Table 11 shows the performance of the three systems on the entire dataset and evaluates with respect to all errors in the data. The results show that when annotation is scarce, MT performs poorly. This result is consistent with findings for English, showing that MT systems outperform classifiers only when the parallel corpus is large (30-40M words) (Rozovskaya and Roth, 2016) but lag behind even when over 1M tokens are available.

System
Training data P R F 0.5 Classifiers Learner 22.6 4.8 12.9 (learner) Classifiers Learner+native 38.0 7.5 21.0 (minimal sup.) MT Learner+native 30.6 2.9 10.6 We combine the MT system and the minimally supervised classifiers following Rozovskaya and Roth (2016). Because MT systems are not restricted for error type, the misuse they correct is typically more diverse (see also Section 6). The F 0.5 score thus improves by 2 points, to 23.8, for the combined system, due to a slightly better recall (10.2). However, the precision drops from 38.0 to 35.8, since the MT system has a lower precision than the classifiers.

Discussion and Error Analysis
The current state of the art in English grammar correction on the widely used benchmark CoNLL test is 50.27 for a single system . System combination, model ensembles, and adding a spell checker boost these numbers by 4 to 6 points (Chollampatt and Ng, 2018;. These models are trained on the CoNLL training data and additional learner data (about 30M words). An MT system trained on CoNLL data (1.2M words) obtains an F 0.5 score of 28.25 (Rozovskaya and Roth, 2016). Although these MT systems differ in how they are trained, these numbers should give an idea of the effect the amount of parallel data has on the performance.  A minimal-supervision classification system that uses CoNLL data obtains an F 0.5 score of 36.26 (Rozovskaya and Roth, 2016). In contrast, the classification system for Russian obtains a much lower score of 21.0. This may be due to a larger variety of grammatical phenomena in Russian, lower error rates, and a high proportion of spelling errors (especially among heritage speakers), which we currently do not specifically target. Note also that the CoNLL-2014 results are based on two gold references for each sentence, while we evaluate with respect to one, and having more reference annotations improves performance (Bryant and Ng, 2015;Sakaguchi et al., 2016;Choshen and Abend, 2018b). 7 It should also be noted that the gap between the MT system and the classification system when both are trained with limited supervision is larger for Russian (10.6 vs. 20.5) than for English (28.25 vs. 36.26). This indicates that the MT system suffers more than classifiers, when the amount of supervision is particularly small, while the morphological complexity of the language is higher.
Considering Arabic and Chinese, where the training data is also limited, the results are also much lower than in English. In Arabic, where the supervised learner data includes 43K words, the best reported F-score is 27.32 (Rozovskaya et al., 2015). 8 In Chinese, the supervised dataset size is about 50K sentences, and the highest reported scores are 26.93 for detection (Rao et al., 2017) and 17.23 for error correction (Rao et al., 2018), respectively. These results confirm that the approaches that rely on large amounts of supervision do not carry over to low-resource 7 There is ongoing research on the question of the most appropriate evaluation metric and gold references for grammatical error correction. See Sakaguchi et al. (2016), Choshen and Abend (2018b), and Choshen and Abend (2018c). 8 This result is based on performance that does not take into account some trivial Arabic-specific normalization corrections.
settings. It is thus desirable to develop approaches that can be robust with a small amount of supervision, especially when applied to languages that are morphologically more complex than English.

Error Analysis
To understand the challenges of grammar correction in a morphologically rich language such as Russian, we perform error analysis of the MT system and the classification system that uses minimal supervision. The nature of grammar correction is such that multiple different corrections are often acceptable . Furthermore, annotators often disagree on what constitutes a mistake, and some gold errors missed by a system may be considered as acceptable usage by another rater. Thus, when a system is compared against the gold truth produced by just one annotator, performance is understated. In fact, the F-score of a system increases with the number of per-sentence annotations (Bryant and Ng, 2015).

Classifiers: False Positives
We start by analyzing the cases where the system flagged an error that was not marked in the gold annotation. False positive cases were manually annotated by one of the annotators and acceptable predictions were identified. As expected, because of the variability in the annotators' judgments and possibility of multiple acceptable options, there are false positives that actually should be true positives. We re-evaluate the performance of the classifiers based on the error analysis in Table 12. For all error types, except gender agreement (which has a high precision of 67.9%), precision improvements range between 4 points and 16 points. The highest improvement is observed for preposition errors: about 48% of false positives are in fact acceptable suggestions. This improvement mirrors the results in English (precision improves from 30% to 70% [Rozovskaya et al., 2017]) and В этих местах мало *перспектива/перспектив In these places few * prospects pl,nom /prospects pl,gen 'There are few prospects in these places' Example 1: Case error on a noun following the adverbial ''few''.
Он обеспечивает *клиентов/клиентам доступ к информации It supplies * clients pl,gen /clients pl,dat access sg,acc towards inf ormation sg,gen 'It provides clients with access to information' Example 2: Case error on a noun governed by the verb ''provides''.
Она готова была *давать/дать мне все , что нужно She ready was * give inf,imperfect /give inf,perfect to me everything , that necessary 'She was ready to give me everything that is necessary' Example 4: Aspect error on a verb that requires wider context beyond sentence.
can be explained by the fact that preposition usage is highly variable (i.e., many contexts license multiple prepositions [Tetreault and Chodorow, 2008]).

Classifiers: Errors Missed by the System
Although the precision of the classifiers is generally quite good, the recall is much lower, ranging between 36.1% and 24.9% for noun case and preposition errors to 16% for agreement errors and 9.1% for verb aspect errors.
Among the languages studied in the grammar correction research, noun case errors are unique to Russian. 9 But because the appropriate case choice depends on the word governing the noun, one can view case declension to be similar to subject-verb agreement. However, case errors are arguably more challenging because the target noun may be governed by a verb, a preposition, another noun, or even by an adverbial; thus, there is a higher level of ambiguity when identifying the dependency as well as determining the appropriate case. A morphologically rich language such as Russian uses case to express relations that are commonly conveyed by prepositions in English; as a result, verbs that are followed by a direct object and a prepositional object in English appear with two noun phrases, whose relationship to the verb is expressed through appropriate cases. Examples (1) and (2) illustrate two case errors, where the first noun is governed by an adverbial, and the other noun is governed by a verb. An additional challenge is that prepositions and verbs can also license multiple cases. For example, the prepositions на and в can denote location, when followed by a noun in locative case, as well as direction when followed by a noun in dative case.
Analysis of the missed verb agreement errors reveals several challenges; some of these are specific to morphologically rich languages. The main challenge here is identifying the subject of the target verb. Thus, errors on verbs that are located far from the subject head are typically not handled well in both Russian and English; in the Russian corpus, these account for 20% of all missed errors. Because the system currently does not use a parser, we anticipate that adding a parser will improve performance. However, because of Russian's free word order, there are more options for the location of the subject. It is also not uncommon for a subject to be placed after the verb, and 19% of errors that are currently missed occur when the subject is located after the verb. Finally, about 6% of missed errors occur on verbs that have no explicit subject, as in Example (3). In such cases the verb takes the form of third person singular masculine or third person plural.
Compared with other errors, aspect errors exhibit the lowest performance. Appropriate aspect form may require understanding the context around the verb, often beyond the sentence boundaries. Example (4) illustrates an error where, without looking at the wider context, both perfect and imperfect forms are possible. Some verb aspect errors are similar to verb tense errors in English. Studies in English also reported poor performance, a precision of 20% corresponding to a recall of about 20% on verb aspect errors (Tajiri et al., 2012). Our expectation is that with richer context representation, such as identification of temporal relations, one can do better. Some verbs are also ambiguous with respect to aspect; for example, проводить can be translated as ''carry out'' (imperfective), and ''accompany'' (perfective).
The MT System Because the output of the MT system does not specify the correction type, our annotator manually analyzed the true positives of the system and classified these for type. The most common true positive corrections of the MT system fall into the following categories: spelling (40%), missing comma (36%), noun:case (13%), and lexical (7%).
We also analyze the false positives. About 15% of the false positives are in fact true positives. As a result, the precision and the F-score of the MT system improve from 30.6 to 41.0 and from 10.6 to 11.4, respectively. Even though the current MT system performs poorly, the analysis supports the findings in English that MT systems correct a more diverse set of errors and, if trained with sufficient supervision, should complement a classification system well.

Conclusion
We address the task of correcting writing mistakes in Russian, a morphologically rich language. We correct and error-tag a corpus of Russian learner data. The release of this corpus should facilitate research efforts in grammar correction for languages other than English that do not have many resources available to them. Experiments on that corpus demonstrate that the MT approach performs poorly due to lack of annotated data. The MT system is outperformed substantially by a minimally supervised machine learning classification approach.