Multilingual Projection for Parsing Truly Low-Resource Languages

We propose a novel approach to cross-lingual part-of-speech tagging and dependency parsing for truly low-resource languages. Our annotation projection-based approach yields tagging and parsing models for over 100 languages. All that is needed are freely available parallel texts, and taggers and parsers for resource-rich languages. The empirical evaluation across 30 test languages shows that our method consistently provides top-level accuracies, close to established upper bounds, and outperforms several competitive baselines.


Introduction
State-of-the-art approaches to inducing part-ofspeech (POS) taggers and dependency parsers only scale to a small fraction of the world's ∼6,900 languages. The major bottleneck is the lack of manually annotated resources for the vast majority of these languages, including languages spoken by millions, such as Marathi (73m), Hausa (50m), and Kurdish (30m). Cross-lingual transfer learning-or simply cross-lingual learning-refers to work on using annotated resources in other (source) languages to induce models for such low-resource (target) languages. Even simple cross-lingual learning techniques outperform unsupervised grammar induction by a large margin.
Most work in cross-lingual learning, however, makes assumptions about the availability of linguistic resources that do not hold for the majority of low-resource languages. The best cross-lingual dependency parsing results reported to date were pre-sented by Rasooli and Collins (2015). They use the intersection of languages covered in the Google dependency treebanks project and those contained in the Europarl corpus. Consequently, they only consider closely related Indo-European languages for which high-quality tokenization can be obtained with simple heuristics.
In other words, we argue that recent approaches to cross-lingual POS tagging and dependency parsing are biased toward Indo-European languages, in particular the Germanic and Romance families. The bias is not hard to explain: treebanks, as well as large volumes of parallel data, are readily available for many Germanic and Romance languages. Several factors make cross-lingual learning between these languages easier: (i) We have large volumes of relatively representative, translated texts available for all language pairs; (ii) It is relatively easy to segment and tokenize Germanic and Romance texts; (iii) These languages all have very similar word order, making the alignments much more reliable. Therefore, it is more straightforward to train and evaluate cross-lingual transfer models for these languages.
However, this bias means that we possibly overestimate the potential of cross-lingual learning for truly low-resource languages, i.e., languages with no supporting tools or resources for segmentation, POS tagging, or dependency parsing.
The aim of this work is to experiment with crosslingual learning via annotation projection, making minimal assumptions about the available linguistic resources. We only want to assume what we can in fact assume for truly low-resource languages. Thus, for the target languages, we do not assume the avail-ability of any labeled data, tag dictionaries, typological information, etc. For annotation projection, we need a parallel corpus, and we therefore have to rely on resources such as the Bible (parts of which are available in 1,646 languages), and publications from the Watchtower Society (up to 583 languages). These texts have the advantage of being translated both conservatively and into hundreds of languages (massively multi-parallel). However, the Bible and the Watchtower are religious texts and are more biased than the corpora that have been assumed to be available in most previous work.
In order to induce high-quality cross-lingual transfer models from noisy and very limited data, we exploit the fact that the available resources are massively multi-parallel. We also present a novel multilingual approach to the projection of dependency structures, projecting edge weights (rather than edges) via word alignments from multiple sources (rather than a single source). Our approach enables us to project more information than previous approaches: (i) by postponing dependency tree decoding to after the projection, and (ii) by exploiting multiple information sources.
Our contributions are as follows: (i) We present the first results on cross-lingual learning of POS taggers and dependency parsers, assuming only linguistic resources that are available for most of the world's written languages, specifically, Bible excerpts and translations of the Watchtower.
(ii) We extend annotation projection of syntactic dependencies across parallel text to the multisource scenario, introducing a new, heuristicsfree projection algorithm that projects weight matrices from multiple sources, rather than dependency trees or individual dependencies from a single source.
(iii) We show that our approach performs significantly better than commonly used heuristics for annotation projection, as well as than delexicalized transfer baselines. Moreover, in comparison to these systems, our approach performs particularly well on truly low-resource non-Indo-European languages.
All code and data are made freely available for general use. 1 2 Weighted annotation projection Motivation Our approach is based on the general idea of annotation projection (Yarowsky et al., 2001) using parallel sentences. The goal is to augment an unannotated target sentence with syntactic annotations projected from one or more source sentences through word alignments. The principle is illustrated in Figure 1, where the source languages are German and Croatian, and the target is English.
The simplest case is projecting POS labels, which are observed in the source sentences but unknown in the target language. In order to induce the grammatical category of the target word beginning, we project POS from the aligned words Anfang and početku, both of which are correctly annotated as NOUN. Projected POS labels from several sources might disagree for various reasons, e.g., erroneous source annotations, incorrect word alignments, or legitimate differences in POS between translation equivalents. We resolve such cases by taking a majority vote, weighted by the alignment confidences. By letting several languages vote on the correct tag of each word, our projections become more robust, less sensitive to the noise in our source-side predictions and word alignments.
We can also project syntactic dependencies across word alignments. If (u s , v s ) is a dependency edge in a source sentence, say the ingoing dependency from das to Wort, u s (Wort) is aligned to u t (word), and v s (das) is aligned to v t (the), we can project the dependency such that (u t , v t ) becomes a dependency edge in the target sentence, making the a dependent of word. Obviously, dependency annotation projection is more challenging than projecting POS, as there is a structural constraint: the projected edges must form a dependency tree on the target side. Hwa et al. (2005) were the first to consider this problem, applying heuristics to ensure well-formed trees on the target side. The heuristics were not perfect, as they have been shown to result in excessive non-projectivity and the introduction of spurious relations and tokens Tiedemann, 2014). These design choices all lead to di- Figure 1: An outline of dependency annotation projection, voting, and decoding in our method, using two sources i (German) and j (Croatian) and a target t (English). Part 1 represents the multi-parallel corpus preprocessing, while parts 2 and 3 relate to our projection method. The graphs are represented as adjacency matrices with column indices encoding dependency heads. We highlight how the weight of target edge (u t = was, v t = beginning) is computed from the two contributing sources. minished parsing quality.
We introduce a heuristics-free projection algorithm. The key difference from most previous work is that we project the whole set of potential syntactic relations with associated weights-rather than binary dependency edges-from a large number of multiple sources. Instead of decoding the best tree on the source side-or for a single source-target sentence pair-we project weights prior to decoding, only decoding the aggregated multi-source weight matrix after the individual projections are done. This means that we do not lose potentially relevant information, but rather project dense information about all candidate edges.

Multi-source sentence graph
We assume the existence of n source languages and a target language t. For each tuple of translations in our multi-parallel corpus, our algorithm projects syntactic annotations from the n source sentences to the target sentence.
Projection happens at the sentence-level, taking a tuple of n annotated sentences and an unannotated sentence as input. We formalize the projection step as label propagation in a graph structure where the words of the target and source sentences are vertices, while edges represent dependency edge candidates between words within a sentence (a parse), as well as similarity relations between words of sentences in different languages (word alignments).
Formally, a projection graph is a graph G = (V, E). All edges are weighted by the function w e : E → R. The vertices can be decomposed into sets V = V 0 ∪ · · · ∪ V n , where V i is the set of words in sentence i. We often need to identify the target sentence V t = V 0 and the source sentences V s = V 1 ∪ · · · ∪ V n separately. Edges between V s and V t are the result of word alignments. The alignment subgraph is the bipartite graph A = (V s , V t , E A ), i.e., the subgraph of G induced by all (alignment) edges, E A , connecting V s and V t .
The subgraph induced by the set of vertices V i , written as G[V i ], represents the dependency edge candidates between the words of the sentence i. In general these subgraphs are dense, i.e., they encode weight matrices of edge scores and not just the single best parse. For the source sentences, we assume that the weights are provided by a parser, while the weights for the syntactic relations of the target sentence are unknown.
With the above definitions, the dependency projection problem amounts to assigning weights to the edges of G[V t ] by transferring the syntactic parse graphs G[V 1 ], . . . , G[V n ] from the source languages through the alignments A.

Part-of-speech projection
Our annotation projection for POS tagging is similar to the one proposed by Agić et al. (2015). The algorithm is presented in Algorithm 1. We first introduce a conditional probability distribution p(l|v) 303 Algorithm 1: Project POS tags Data: A projection graph G = (V s ∪ V t , E); a set of POS labels L; a function p(l|v) assigning probabilities to labels l for word vertices v.
over POS tags l ∈ L for each vertex v in the graph. For all source vertices, the probability distributions are obtained by tagging the corresponding sentences in our multilingual corpus with POS taggers, assigning a probability of one to the best tag for each word, and zero for all other tags. For each target token, i.e., each vertex v, the projection works by gathering evidence for each tag from all source tokens aligned to v, weighted by the alignment score: The projected tag for a target vertex v t is then arg max l p(l|v t ). When both the alignment weights and the source tag probabilities are in {0, 1}, this reduces to a simple voting scheme that assigns the most frequent POS tag among the aligned words to each target word.

Dependency projection
While in POS projection, we project vertex labels, in dependency projection we project edge scores. Our procedure for dependency annotation projection is given in Algorithm 2. For each source language, we parse the corresponding side of our multi-parallel corpus using a dependency parser trained on the source language treebank. However, instead of decoding to dependency trees, we extract the weights for all potential syntactic relations, importing them into G as edge weights.
The parser we use in our experiments assigns scores w e ∈ R to possible edges. Since the ranges and values of these scores are dependent on the training set size and the number of model updates, we standardize the scores to make them comparable across languages. Standardization centers the scores around zero with a standard deviation of one by subtracting the mean and dividing by the standard deviation. We apply this normalization per sentence.
Scores are then projected from source edges to target edges via word alignments w a ∈ [0, 1]. Instead of voting among the incoming projections from multiple sources, we sum the projected edge scores. Because alignments vary in quality, we scale the score of the projected source edge by the corresponding alignment probability.
A target edge (u t , v t ) ∈ G[V t ] can originate from multiple source edges even from a single source sentence, due to m : n alignments. In such cases, we only project the source edge In the case of a single source sentence pair, the target edge scores are set as follows: We note the distinction between edge weights w e and alignment weights w a . With multiple sources, 304 the target edge scores w e (u t , v t ) are computed as a sum over the individual sources: After projection we have a dense set of weighted edges in the target sentence representing possible syntactic relations. This structure is equivalent to the n × n edge matrix used in ordinary first-order graph-based dependency parsing.
Before decoding, the weights are softmaxnormalized to form a distribution over each possible head decision. The normalization balances out the contributions of the individual head decisions; and in our development setup, we found that omitting this step resulted in a substantial (∼10%) decrease in parsing performance.
We then follow McDonald et al. (2005) in using directed maximum spanning tree (DMST) decoding to identify the best dependency tree in the matrix. We note that DMST decoding on summed projected weight matrices is similar to the idea of re-parsing with DMST decoding of the output on an ensemble of parsers (Sagae and Lavie, 2006), which we use as a baseline in our experiments.

Training and test sets
We use source treebanks from the Universal Dependencies (UD) project, version 1.2 (Nivre et al., 2015). 2 They are harmonized in terms of POS tag inventory (17 tags) and dependency annotation scheme. In our experiments, we use the canonical data splits, and disregard lemmas, morphological features and alternative POS from all treebanks.
Out of the 33 languages currently in UD1.2, we drop languages for which the treebank does not distribute word forms (Japanese), and languages for which we have no parallel unlabeled data (Latin, Ancient Greek, Old Church Slavonic, Irish, Gothic). Languages with more than 60k tokens (in the training data) are considered source languages, the remaining 6 smaller treebanks (Estonian, Greek, Hungarian, Latin, Romanian, Tamil) are strictly considered targets. This results in 22 treebanks for training source taggers and parsers. We use two additional test sets: Quechua and Serbian. The first one does not entirely adhere to UD, but we provide a POS tagset mapping and a few modifications and include it as a test language to deepen the robustness assessment for our approach across language families. The Serbian test set fully conforms to UD, as a fork of the closely related Croatian UD dataset. 3 This results in a total of 30 target languages.

Multi-parallel corpora
We use two sources of massively parallel text. The first is the Edinburgh Bible Corpus (EBC) collected by Christodouloupoulos and Steedman (2014) containing 100 languages. EBC has either 30k or 10k sentences for each language, depending on whether they are made up of full Bibles or just translations of the New Testament, respectively. We also crawled and scraped the Watchtower Online Library website to collect what we will refer to as the Watchtower Corpus (WTC). 4 The data is from 2002-2016 and the final corpus contains 135 languages with sentences in the range of 26k-145k. While some EBC Bibles are written in dated language, we do not make any modifications to the corpus if the language is also present in WTC. However, as Basque is not represented in WTC, we replace the Basque Bible from 1571 with a contemporary version from 2004, to enable the use of Basque in the parsing experiments. 5 EBC and WTC both consist of religious texts, but they are very different in terms of style and content. If we examine Table 1 that shows the most frequent words per corpus, we observe that the English Bible-the King James Version from 1611contains many Old English verb forms ("hath", "giveth"). In contrast, the English Watchtower is written in contemporary English, both in terms of verb inflection ("does", "says") and vocabulary ("today", "human"). WTC also deals with contemporary topics such as blood "transfusion" (36 mentions) and "computer" (42 mentions).
The other languages also show differences in terms of language modernity and dialectal difference between EBC and WTC. While each Bible translation has its individual history, Watchtower transla- EBC: hath, saith, hast, spake, yea, cometh, iniquity, wilt, smote, shew, begat, doth, lo, hearken, thence, verily, neighbour, goeth, shewed, giveth, smite, didst, wherewith, knoweth, night WTC: bible, does, however, says, today, during, show, human, later, important, really, humans, meetings, personal, states, future, fact, relationship, result, attention, someone, century, attitude, article, different  tions are commissioned by the same publisher, following established editorial criteria. Thus, we not only expect Watchtower to yield projected treebanks that are closer to contemporary language, but also more reliable alignments. We expect these properties to make WTC a more suitable parallel corpus for our experiments and for bootstrapping treebanks for new languages.

Preprocessing
Segmentation For the multi-parallel corpora, we apply naive sentence splitting using full-stops, question marks and exclamation points of the alphabets from our corpora. We have collected these trigger symbols from the corpora, provided that they appeared as individual tokens at the ends of lines, and belonged to the "Punctuation, Other" Unicode category. After sentence splitting, we use naive whitespace tokenization. 6 We also remove short-vowel diacritics from all corpora written in Arabic script.
We use the same sentence splitting and tokenization for EBC and WTC. This is done regardless of Bibles being distributed in a verse-per-line format, which means verses can be split in more than one sentence. The average sentence length across languages is 18.5 tokens in EBC and 16.7 in WTC.
The UD treebank tokenization differs from the tokenization used for the multi-parallel corpora. The UD dependency annotation is based on syntactic words, and the tokenization guidelines recommend, for example, splitting clitics from verbs, and undoing contractions (Spanish "del" becomes "de el"). These tokens made up of several syntactic words are 6 https://github.com/bplank/ multilingualtokenizer called multiword tokens in the UD convention, and are included in the treebanks but are not integrated in the dependency trees, i.e., only their forming subtokens are assigned a syntactic head. 7 In order to harmonize the tokenization, we eliminate subtokens from the dependency trees, and incorporate the original multiword tokens-which are more likely to be naive raw tokens-in the trees instead. For each multiword token, we provide it with POS and dependency label from the highest subtoken, namely the subtoken that is closest to root. For example, in the case of a verb and its clitics, the chosen subtoken is the verb, and the multiword token is interpreted as a verb. If there are more candidates, we select one through POS ranking. 8 Alignment We sentence-and word-align all language pairs in both our multi-parallel corpora. We use hunalign (Varga et al., 2005) to perform conservative sentence alignment. 9 The selected sentence pairs then enter word alignment. Here, we use two different aligners. The first one is IBM2 fastalign by Dyer et al. (2013), where we adopt the setup of Agić et al. (2015) who observe a major advantage in using reverse-mode alignment for POS projection (4-5 accuracy points absolute). 10 In addition, we use the IBM1 aligner efmaral 11 bÿ Ostling (2015). The intuition behind using IBM1 is that IBM2 introduces a bias toward more closely related languages, and we confirm this intuition through our experiments. We modify both aligners so that they output the alignment probability for each aligned token pair.
Tagging and parsing The source-sides of the two multi-parallel corpora, EBC and WTC, are POStagged by taggers trained on the respective source languages, using TnT (Brants, 2000). We parse the corpora using TurboParser (Martins et al., 2013). The parser is used in simple arc-factored mode with pruning. 12

Experiments
Outline For each sentence in a target language corpus, we retrieve the aligned sentences in the source corpora. Then, for each of these source-target sentence pairs, we project POS tags and dependency edge scores via word alignments, aggregating the contributions of individual sources. Once all contributions are collected, we perform a per-token majority vote on POS tags and DMST decoding on the summed edge scores. This results in a POS-tagged and dependency parsed target sentence ready to contribute in training a tagger and parser. We remove target language sentences that contain word tokens without POS labels. This may happen due to unaligned sentences and words. We then proceed to train models.

Setup
Each of the experiment steps involves a number of choices that we outline in this section. We also describe the baseline systems and upper bounds.
POS tagging Below, we present results with POS taggers based on annotation projection with both IBM1 and IBM2; cf. Table 3. We train TnT with default settings on the projected annotations. Note that we use the resulting POS taggers in our dependency parsing experiments in order not to have our parsers assume the existence of POS-annotated corpora.
For a more extensive assessment, we refer to the work by Agić et al. (2015) who report baseline and upper bounds. In contrast to their work, we consider two different alignment models and use the UD POS tagset (17 tags), in contrast to the 12 tags of Petrov et al. (2012). This makes our POS tagging problem slightly more challenging, but our parsing models potentially benefit from the extended tagset. 14 Dependency parsing We use arc-factored Tur-boParser for all parsing models, applying the same setup as in preprocessing. There are three sets of models: our systems, baselines, and upper bounds.
Our systems are trained on the projected EBC and WTC texts, while the rest-except system: DCA-PROJ (see below)-are trained on the (delexicalized) source-language treebanks.
To avoid a bias toward languages with big treebanks and to make our experiments tractable, we randomly subsample all training sets to a maximum of 20k sentences. In the multi-source systems, this means a uniform sample from all sources up to 20k sentences. This means our comparison is fair, and that our systems do not have the advantage of more training data over our baselines.
Our systems We report on four different crosslingual systems, alternating the use of word aligners (IBM1, IBM2) and the structures we project, as they can be either (i) arc-factored weight matrices from the parser (GRAPHS) or (ii) the single-best trees provided by the parser after decoding (TREES). See the if-clause in Algorithm 2.
We tune two parameters for these four systems using English as development set, confidence estimation and normalization, and we report the best setups only. For the IBM1-based systems, we use the word alignment probabilities in the arc projection, but we use unit votes in POS voting. The opposite yields the best IBM2 scores: binarizing the alignment scores in dependency projection, while weight-voting the POS tags. We also evaluated a number of different normalization techniques in projection, only to arrive at standardization and softmax as by far the best choices.
Baselines and upper bounds We compare our systems to three competitive baselines, as well as three informed upper bounds or oracles. First, we list our baselines.

DELEX-MS:
This is the multi-source direct delexicalized parser transfer baseline of McDonald et al. (2011). 15 DCA-PROJ: This is the direct correspondence assumption (DCA)-based approach to projection, i.e., the de facto standard for projecting dependencies. First introduced by Hwa et al. (2005), it was recently elucidated by Tiedemann (2014), whose implementation we follow here. In contrast to our approach, DCA projects trees on a source-target sentence pair basis, relying on heuristics and spurious nodes or edges to maintain the tree structure. In the setup, we basically plug DCA into our projection-voting pipeline instead of our own method.

REPARSE:
For this baseline, we parse a target sentence using multiple single-source delexicalized parsers. Then, we collect the output trees in a graph, unit-voting the individual edge weights, and finally using DMST to compute the best dependency tree (Sagae and Lavie, 2006). Now, we explain the three upper bounds: DELEX-SB: This result is using the best singlesource delexicalized system for a given target language following McDonald et al. (2013). We parse a target with multiple single-source delexicalized parsers, and select the best-performing one.

SELF-TRAIN:
For this result we parse the targetlanguage EBC and WTC data, train parsers on the output predictions, and evaluate the resulting parsers on the evaluation data. Note this result is available only for the source languages. Also, note that while we refer to this as self-training, we do not concatenate the EBC/WTC training data with the source treebank data. This upper bound tells us something about the usefulness of the parallel corpus texts.
FULL: Direct in-language supervision, only available for the source languages. We train parsers on the source treebanks, and use them to parse the source test sets.
Evaluation All our datasets-projected, training, and test sets-contain only the following CoNLL-X features: ID, FORM, CPOSTAG, and HEAD. 16 For simplicity, we do not predict dependency labels (DEPREL), and we only report unlabeled attachment scores (UAS). The POS taggers are evaluated for accuracy. We use our IBM1 taggers for all the baselines and upper bounds.

Results
Our average results are presented in Figure 2, including broken down by language family, the lan-16 http://ilk.uvt.nl/conll/#dataformat  guages for which we had training data (Sources) and those for which we only had test data (Targets). We see that our systems are substantially better than both multi-source delexicalized transfer, DCA, and reparsing based on delexicalized transfer models. Focusing on our system results, we see that projection with IBM1 leads to better models than projection with IBM2. We also note that our improvements are biggest with non-Indo-European languages. Our IBM1-based parsers top the ones using IBM2 alignment by 6 points UAS on Indo-European languages, while the difference amounts to almost 10 points UAS on non-Indo-European languages (cf. Table 2). This difference in scores exposes a systematic bias towards more closely related languages in work using even more advanced word alignment (Tiedemann and Agić, 2016).
The detailed results using the Watchtower Corpus are listed in Table 3, where we also list the POS tagging accuracies. Note that these are not directly comparable to Agić et al. (2015), since they use a more coarse-grained tagset, and the results listed here are using WTC. We list the detailed results with the Bible Corpus online. 17 The tendencies are the same, but the results are slightly lower almost consistently across the board.
Finally, we observe that our results are also better than those that can be obtained using a predictive model to select the best source language for delexi-  calized transfer (Rosa andŽabokrtský, 2015); and better than what can be obtained using an oracle (DELEX-SB) to select the source language. Direct supervision (FULL) upper bound unsurprisingly records the highest scores in the experiment, as it uses biased in-language and in-domain training data. We also experiment with learning curves for direct supervision, with a goal of establishing the amount of manually annotated sentences needed to beat our cross-lingual systems. We find that for most languages this number falls within the range of 100-400 in-domain sentences.

Discussion
Function words In UD, a subset of function words-tags: ADP, AUX, CONJ, SCONJ, DET, PUNCT-have to be leaves in the dependency trees, unless, e.g., they participate in multiword expressions. Our predictions show some violations of this constraint (less than 1% of all words with these POS), but this ratio is similar to the amount of vi-olations found in the test data.
Projectivity The UD treebanks are in general largely projective. Our UD test languages have an average of 89% fully projective sentences. However, with IBM1 for example, we only predict 55% of all sentences to be projective. Regardless of the differences in UAS, we observe a corpus effect in the difference of projectivity of the predictions between using EBC (65%) and WTC (55%). We attribute the higher level of projectivity of EBC-projected treebanks to Bible sentences being shorter.
The least projective predictions are Farsi (17%) and Hindi (19%), for which we also obtain the lowest UASs. This may be a consequence of our naive tokenization, yielding unreliable alignments. However, projectivity correlates more with UAS (ρ = 0.56) than with POS prediction accuracy (ρ = 0.34).
Dependency length We observe that the average edge length on IBM1 and WTC is of 2.95, while for EBC it is 2.67. The average gold edge length is 309 3.6-which is significantly higher at p < 0.05 (Student's t-test). However, the variance in gold edge length is about 1.2 times the deviation of predicted edge length. In other words, gold edges are often longer and more far-reaching. This difference indicates our predictions have worse recall for longer dependencies such as subordinate clauses, while being more accurate in local, phrasal contexts.
POS errors Unlike most previous work on crosslingual dependency parsing, and following the notable exception of McDonald et al. (2011), we rely on POS predictions from cross-lingual transfer models. One may hypothesize that there is a significant error propagation from erroneous POS projection. We observe, however, that about 40% of wrong POS predictions are nevertheless assigned the right syntactic head. We argue that the fairly uniform noise on the POS labels helps the parsers regularize over the POS-dependency relations.
Possible improvements We treat POS and syntactic dependencies as two separate annotation layers and project them independently in our approach. Moreover, we project edge scores for dependencies, in contrast to only the single-best source POS tags. Johannsen et al. (2016) introduce an approach to joint projection of POS and dependencies, showing that exploiting the interactions between the two layers yields even better cross-lingual parsers. Their approach also accounts for transferring tag distributions instead of single-best POS tags.
All the parsers in our experiments are restricted to 20k training sentences. EBC and WTC texts offer up to 120k training instances per language. We observe limited benefits of going beyond our training set cap, indicating a more elaborate instance selection-based approach would be more beneficial than just adding more training data.
In our dependency graph projection, we normalize the weights per sentence. For future development, we note that corpus-level normalization might achieve the same balancing effect while still preserving possibly important language-specific signals regarding structural disambiguations.
EBC and WTC constitute a (hopefully small) subset of the publicly available multilingual parallel corpora. The outdated EBC texts can be replaced by newer ones, and the EBC itself replaced or aug-mented by other online sources of Bible translations. Other sources include the UN Declaration of Human Rights, translated to 467 languages, 18 and repositories of movie subtitles, software localization files, and various other parallel resources, such as OPUS (Tiedemann, 2012). 19 Our approach is languageindependent and would benefit from extension to datasets beyond EBC and WTC.
6 Related work POS tagging While projection annotation of POS labels goes back to Yarowsky's seminal work, Das and Petrov (2011) recently renewed interest in this problem. Das and Petrov (2011) go beyond our approach to POS annotation by combining annotation projection and unsupervised learning techniques, but they restrict themselves to Indo-European languages and a coarser tagset. Li et al. (2012) introduce an approach that leverages potentially noisy, but sizeable POS tag dictionaries in the form of Wiktionaries for 9 resource-rich languages. Garrette et al. (2013) also consider the problem of learning POS taggers for truly low-resource languages, but suggest crowdsourcing such POS tag dictionaries.
Finally, Agić et al. (2015) were the first to introduce the idea of learning models for more than a dozen truly low-resource languages in one go, and our contribution can be seen as a non-trivial extension of theirs.
Parsing With the exception of Zeman and Resnik (2008), initial work on cross-lingual dependency parsing focused on annotation projection (Hwa et al., 2005;Spreyer et al., 2010). McDonald et al. (2011) andSøgaard (2011) simultaneously took up the idea of delexicalized transfer after Zeman and Resnik (2008), but more importantly, they also introduced the idea of multi-source cross-lingual transfer in the context of dependency parsing. McDonald et al. (2011) were the first to combine annotation projection and multi-source transfer, the approach taken in this paper.
Annotation projection has been explored in the context of cross-lingual dependency parsing since Hwa et al. (2005). Notable approaches include the soft projection of reliable dependencies by Li et al. (2014), and the work of Ma and Xia (2014), who make use of the source-side distributions through a training objective function.
Tiedemann and Agić (2016) provide a more detailed overview of model transfer and annotation projection, while introducing a competitive machine translation-based approach to synthesizing dependency treebanks. In their work, we note the IBM4 word alignments favor more closely related languages, and that building machine translation systems requires parallel data in quantities that far surpass EBC and WTC combined.
The best results reported to date were presented by Rasooli and Collins (2015). They use the intersection of languages represented in the Google dependency treebanks project and the languages represented in the Europarl corpus. Consequently, their approach-similar to all the other approaches listed in this section-is potentially biased toward closely related Indo-European languages.

Conclusions
We introduced a novel, yet simple and heuristicsfree, method for inducing POS taggers and dependency parsers for truly low-resource languages. We only assume the availability of a translation of a set of documents that have been translated into many languages. The novelty of our dependency projection method consists in projecting edge scores rather than edges, and specifically in projecting these annotations from multiple sources rather than from only one source. While we built models for more than a hundred languages during our experiments, we evaluated our approach across 30 languages for which we had test data. The results show that our approach is superior to commonly used transfer methods.