Cross-Lingual Syntactic Transfer with Limited Resources

We describe a simple but effective method for cross-lingual syntactic transfer of dependency parsers, in the scenario where a large amount of translation data is not available. This method makes use of three steps: 1) a method for deriving cross-lingual word clusters, which can then be used in a multilingual parser; 2) a method for transferring lexical information from a target language to source language treebanks; 3) a method for integrating these steps with the density-driven annotation projection method of Rasooli and Collins (2015). Experiments show improvements over the state-of-the-art in several languages used in previous work, in a setting where the only source of translation data is the Bible, a considerably smaller corpus than the Europarl corpus used in previous work. Results using the Europarl corpus as a source of translation data show additional improvements over the results of Rasooli and Collins (2015). We conclude with results on 38 datasets from the Universal Dependencies corpora.


Introduction
Creating manually-annotated syntactic treebanks is an expensive and time consuming task. Recently there has been a great deal of interest in cross-lingual syntactic transfer, where a parsing model is trained for some language of interest, using only treebanks in other languages. There is a clear motivation for this in building parsing models for languages for which treebank data is unavailable. Methods * On leave at Google Inc. New York.
for syntactic transfer include annotation projection methods (Hwa et al., 2005;Ganchev et al., 2009;McDonald et al., 2011;Ma and Xia, 2014;Rasooli and Collins, 2015;Lacroix et al., 2016;, learning of delexicalized models on universal treebanks (Zeman and Resnik, 2008;McDonald et al., 2011;Täckström et al., 2013;Rosa and Zabokrtsky, 2015), treebank translation (Tiedemann et al., 2014;Tiedemann, 2015;Tiedemann and Agić, 2016) and methods that leverage cross-lingual representations of word clusters, embeddings or dictionaries (Täckström et al., 2012;Durrett et al., 2012;Duong et al., 2015a;Zhang and Barzilay, 2015;Xiao and Guo, 2015;Guo et al., 2015;Guo et al., 2016;Ammar et al., 2016a). This paper considers the problem of cross-lingual syntactic transfer with limited resources of monolingual and translation data. Specifically, we use the Bible corpus of Christodouloupoulos and Steedman (2014) as a source of translation data, and Wikipedia as a source of monolingual data. We deliberately limit ourselves to the use of Bible translation data because it is available for a very broad set of languages: the data from Christodouloupoulos and Steedman (2014) includes data from 100 languages. The Bible data contains a much smaller set of sentences (around 24,000) than other translation corpora, for example Europarl (Koehn, 2005), which has around 2 million sentences per language pair. This makes it a considerably more challenging corpus to work with. Similarly, our choice of Wikipedia as the source of monolingual data is motivated by the availability of Wikipedia data in a very broad set of languages.
We introduce a set of simple but effective methods for syntactic transfer, as follows: • We describe a method for deriving crosslingual clusters, where words from different languages with a similar syntactic or semantic role are grouped in the same cluster. These clusters can then be used as features in a shiftreduce dependency parser.
• We describe a method for transfer of lexical information from the target language into source language treebanks, using word-to-word translation dictionaries derived from parallel corpora. Lexical features from the target language can then be integrated in parsing.
• We describe a method that integrates the above two approaches with the density-driven approach to annotation projection described by Rasooli and Collins (2015).
Experiments show that our model outperforms previous work on a set of European languages from the Google universal treebank . We achieve 80.9% average unlabeled attachment score (UAS) on these languages; in comparison the work of Zhang and Barzilay (2015), Guo et al. (2016) and Ammar et al. (2016b) have a UAS of 75.4%, 76.3% and 77.8%, respectively. All of these previous works make use of the much larger Europarl (Koehn, 2005) corpus to derive lexical representations. When using Europarl data instead of the Bible, our approach gives 83.9% accuracy, a 1.7% absolute improvement over Rasooli and Collins (2015). Finally, we conduct experiments on 38 datasets (26 languages) in the universal dependencies v1.3 (Nivre et al., 2016) corpus. Our method has an average unlabeled dependency accuracy of 74.8% for these languages, more than 6% higher than the method of Rasooli and . Thirteen datasets (10 languages) have accuracies higher than 80.0%. 1

Background
This section gives a description of the underlying parsing models used in our experiments, the data sets used, and a baseline approach based on delexicalized parsing models.

The Parsing Model
We assume that the parsing model is a discriminative linear model, where given a sentence x, and a set of candidate parses Y(x), the output from the model is where θ ∈ R d is a parameter vector, and φ(x, y) is a feature vector for the pair (x, y). In our experiments we use the shift-reduce dependency parser of Rasooli and Tetreault (2015), which is an extension of the approach in Zhang and Nivre (2011). The parser is trained using the averaged structured perceptron (Collins, 2002).
We assume that the feature vector φ(x, y) is the concatenation of three feature vectors: • φ (p) (x, y) is an unlexicalized set of features.
Each such feature may depend on the part-ofspeech (POS) tag of words in the sentence, but does not depend on the identity of individual words in the sentence.
• φ (c) (x, y) is a set of cluster features. These features require access to a dictionary that maps each word in the sentence to an underlying cluster identity. Clusters may, for example, be learned using the Brown clustering algorithm (Brown et al., 1992). The features may make use of cluster identities in combination with POS tags.
• φ (l) (x, y) is a set of lexicalized features. Each such feature may depend directly on word identities in the sentence. These features may also depend on part-of-speech tags or cluster information, in conjunction with lexical information.
Appendix A has a complete description of the features used in our experiments.

Data Assumptions
Throughout this paper we will assume that we have m source languages L 1 . . . L m , and a single target language L m+1 . We assume the following data sources: 280 Source language treebanks. We have a treebank T i for each language i ∈ {1 . . . m}.
Part-of-speech (POS) data. We have handannotated POS data for all languages L 1 . . . L m+1 . We assume that the data uses a universal POS set that is common across all languages.
Monolingual data. We have monolingual, raw text for each of the (m + 1) languages. We use D i to refer to the monolingual data for the ith language.
Translation data. We have translation data for all language pairs. We use B i,j to refer to translation data for the language pair (i, j) where i, j ∈ {1 . . . (m + 1)} and i = j.
In our main experiments we use the Google universal treebank  as our source language treebanks 2 (this treebank provides universal dependency relations and POS tags), Wikipedia data as our monolingual data, and the Bible from Christodouloupoulos and Steedman (2014) as the source of our translation data. In additional experiments we use the Europarl corpus as a source of translation data, in order to measure the impact of using the smaller Bible corpus.

A Baseline Approach: Delexicalized
Parsers with Self-Training Given the data assumption of a universal POS set, the feature vectors φ (p) (x, y) can be shared across languages. A simple approach is then to simply train a delexicalized parser using treebanks T 1 . . . T m , using the representation φ(x, y) = φ (p) (x, y) (see Täckström et al., 2013)). Our baseline approach makes use of a delexicalized parser, with two refinements: WALS properties. We use the six properties from the World Atlas of Language Structures (WALS) (Dryer and Haspelmath, 2013) to select a subset of closely related languages for each target language. These properties are shown in Table 1. The model for a target language is trained on treebank data from languages where at least 4 out of 6 WALS properties are common between the source and target 2 We also train our best performing model on the newly released universal treebank v1.3 (Nivre et al., 2016). See §4.3 for more details.

Feature Description 82A
Order of subject and verb 83A Order of object and verb 85A Order of adposition and noun phrase 86A Order of genitive and noun 87A Order of adjective and noun 88A Order of demonstrative and noun Table 1: The six properties from the world atlas of language structures (WALS) (Dryer and Haspelmath, 2013) used to select the source languages for each target language in our experiments.
language. 3 This gives a slightly stronger baseline. Our experiments showed an improvement in average labeled dependency accuracy for the languages from 62.52% to 63.18%. Table 2 shows the set of source languages used for each target language. These source languages are used for all experiments in the paper.
Self-training. We use self-training (McClosky et al., 2006) to further improve parsing performance. Specifically, we first train a delexicalized model on treebanks T 1 . . . T m ; then use the resulting model to parse a dataset T m+1 that includes target-language sentences which have POS tags but do not have dependency structures. We finally use the automatically parsed data T m+1 as the treebank data and retrain the model. This last model is trained using all features (unlexicalized, clusters, and lexicalized). Self-training in this way gives an improvement in labeled accuracy from 63.18% to 63.91%.

Translation Dictionaries
Our only use of the translation data B i,j for i, j ∈ {1 . . . (m + 1)} is to construct a translation dictionary t(w, i, j). Here i and j are two languages, w is a word in language L i , and the output w = t(w, i, j) is a word in language L j corresponding to the most frequent translation of w into this language. We define the function t(w, i, j) as follows: We first run the GIZA++ alignment process (Och and Ney, 2003) on the data B i,j . We then keep intersected alignments between sentences in the two languages. Finally, for each word w in L i , we define   et al., 2013). A language is chosen as a source language if it has at least 4 out of 6 WALS properties in common with the target language. w = t(w, i, j) to be the target language word most frequently aligned to w in the aligned data. If a word w is never seen aligned to a target language word w , we define t(w, i, j) = NULL.

Our Approach
We now describe an approach that gives significant improvements over the baseline. §3.1 describes a method for deriving cross-lingual clusters, allowing us to add cluster features φ (c) (x, y) to the model. §3.2 describes a method for adding lexical features φ (l) (x, y) to the model. §3.3 describes a method for integrating the approach with the density-driven approach of Rasooli and Collins (2015). Finally, §4 describes experiments. We show that each of the above steps leads to improvements in accuracy.

Learning Cross-Lingual Clusters
We now describe a method for learning crosslingual clusters. This follows previous work on cross-lingual clustering algorithms (Täckström et al., 2012). A clustering is a function C(w) that maps each word w in a vocabulary to a cluster C(w) ∈ {1 . . . K}, where K is the number of clusters. A hierarchical clustering is a function C(w, l) that maps a word w together with an integer l to a cluster at level l in the hierarchy. As one example, the Brown clustering algorithm (Brown et al., 1992) gives a hierarchical clustering. The level l allows cluster features at different levels of granularity.
A cross-lingual hierarchical clustering is a function C(w, l) where the clusters are shared across the (m + 1) languages of interest. That is, the word w Inputs: 1) Monolingual texts D i for i = 1 . . . (m + 1); 2) a function t(w, i, j) that translates a word w ∈ L i to w ∈ L j ; and 3) a parameter α such that 0 < α < 1.
Algorithm:  can be from any of the (m + 1) languages. Ideally, a cross-lingual clustering should put words across different languages which have a similar syntactic and/or semantic role in the same cluster. There is a clear motivation for cross-lingual clustering in the parsing context. We can use the cluster-based features φ (c) (x, y) on the source language treebanks T 1 . . . T m , and these features will now generalize beyond these treebanks to the target language L m+1 .
We learn a cross-lingual clustering by leveraging the monolingual data sets D 1 . . . D m+1 , together with the translation dictionaries t(w, i, j) learned from the translation data. Figure 1 shows the algorithm that learns a cross-lingual clustering. The algorithm first prepares a multilingual corpus, as follows: for each sentence s in the monolingual data D i , for each word in s, with probability α, we replace the word with its translation into some randomly chosen language. Once this data is created, we can easily obtain a cross-lingual clustering. Figure 1 shows the complete algorithm. The intuition behind this method is that by creating the crosslingual data in this way, we bias the clustering al-gorithm towards putting words that are translations of each other in the same cluster.

Treebank Lexicalization
We now describe how to introduce lexical representations φ (l) (x, y) to the model. Our approach is simple: we take the treebank data T 1 . . . T m for the m source languages, together with the translation lexicons t(w, i, m + 1). For any word w in the source treebank data, we can look up its translation t(w, i, m + 1) in the lexicon, and add this translated form to the underlying sentence. Features can now consider lexical identities derived in this way. In many cases the resulting translation will be the NULL word, leading to the absence of lexical features. However, the representations φ (p) (x, y) and φ (c) (x, y) still apply in this case, so the model is robust to some words having a NULL translation.

Integration with the Density-Driven
Projection Method of Rasooli and Collins (2015) In this section we describe a method for integrating our approach with the cross-lingual transfer method of Rasooli and Collins (2015), which makes use of density-driven projections.
In annotation projection methods (Hwa et al., 2005;McDonald et al., 2011), it is assumed that we have translation data B i,j for a source and target language, and that we have a dependency parser in the source language L i . The translation data consists of pairs (e, f ) where e is a source language sentence, and f is a target language sentence. A method such as GIZA++ is used to derive an alignment between the words in e and f , for each sentence pair; the source language parser is used to parse e. Each dependency in e is then potentially transferred through the alignments to create a dependency in the target sentence f . Once dependencies have been transferred in this way, a dependency parser can be trained on the dependencies in the target language.
The density-driven approach of Rasooli and Collins (2015) makes use of various definitions of "density" of the projected dependencies. For example, P 100 is the set of projected structures where the projected dependencies form a full projective parse tree for the sentence; P 80 is the set of projected structures where at least 80% of the words in the projected structure are a modifier in some dependency. An iterative training process is used, where the parsing algorithm is first trained on the set T 100 of complete structures, and where progressively less dense structures are introduced in learning.
We integrate our approach with the density-driven approach of Rasooli and Collins (2015) as follows: consider the treebanks T 1 . . . T m created using the lexicalization method of §3.2. We add all trees in these treebanks to the set P 100 of full trees used to initialize the method of Rasooli and Collins (2015). In addition we make use of the representations φ (p) , φ (c) and φ (l) , throughout the learning process.

Experiments
This section first describes the experimental settings, then reports results.

Data and Tools
Data In the first set of experiments, we consider 7 European languages studied in several pieces of previous work (Ma and Xia, 2014;Zhang and Barzilay, 2015;Guo et al., 2016;Ammar et al., 2016a;Lacroix et al., 2016). More specifically, we use the 7 European languages in the Google universal treebank (v.2; standard data) . As in previous work, gold part-of-speech tags are used for evaluation. We use the concatenation of the treebank training sentences, Wikipedia data and the Bible monolingual sentences as our monolingual raw text. Table 3 shows statistics for the monolingual data. We use the Bible from Christodouloupoulos and Steedman (2014), which includes data for 100 languages, as the source of translations. We also conduct experiments with the Europarl data (both with the original set and a subset of it with the same size as the Bible) to study the effects of translation data size and domain shift. The statistics for translation data is shown in Table 4.
In a second set of experiments, we run experiments on 38 datasets (26 languages) in the more recent Universal Dependencies v1.3 corpus (Nivre et al., 2016). The full set of languages we use is listed in Table 9. 4 We use the Bible as the translation data,  Brown Clustering Algorithm We use the off-theshelf Brown clustering tool 5 (Liang, 2005) to train monolingual Brown clusters with 500 clusters. The monolingual Brown clusters are used as features over lexicalized values created in φ (l) , and in selftraining experiments. We train our cross-lingual clustering with the off-the-shelf-tool 6 from Stratos et al. (2015). We set the window size to 2 with a cluster size of 500. 7 Parsing Model We use the k-beam arc-eager dependency parser of Rasooli and Tetreault (2015), which is similar to the model of Zhang and Nivre (2011). We modify the parser such that it can use both monolingual and cross-lingual word cluster features. The parser is trained using the the maximum violation update strategy (Huang et al., 2012). We use three epochs of training for all experiments. We use the DEPENDABLE Tool (Choi et al., 2015) to calculate significance tests on several of the comparisons (details are given in the captions to tables 5, 6, and 9).
cient Greek, Basque, Catalan, Galician, Gothic, Irish, Kazakh, Latvian, Old Church Slavonic, and Tamil). We also excluded Arabic, Hebrew, Japanese and Chinese, as these languages have tokenization and/or morphological complexity that goes beyond the scope of this paper. Future work should consider these languages. 5 https://github.com/percyliang/ brown-cluster 6 https://github.com/karlstratos/singular 7 Usually the original Brown clusters are better features for parsing but their training procedure does not scale well to large datasets. Therefore we use the more efficient algorithm from Stratos et al. (2015) on the larger cross-lingual datasets to obtain word clusters.   Word alignment We use the intersected alignments from GIZA++ (Och and Ney, 2003) on translation data. We exclude sentences in translation data with more than 100 words. Table 5 shows the dependency parsing accuracy for the baseline delexicalized approach, and for models which add 1) cross-lingual clusters ( §3.1); 2) lexical features ( §3.2); and 3) integration with the densitydriven method of Rasooli and Collins (2015). Each of these three steps gives significant improvements in performance. The final LAS/UAS of 73.9/80.3% is several percentage points higher than the baseline accuracy of 63.9/72.9%.  Table 6: Results for our method using different sources of translation data. "Density" refers to the method of Rasooli and Collins (2015); "This paper" gives results using the methods described in sections 3.1-3.3 of this paper. The "Bible" experiments use the Bible data of Christodouloupoulos and Steedman (2014). The "Europarl" experiments use the Europarl data of Koehn (2005   . "Supervised" refers to the performance of the parser trained on fully gold standard data in a supervised fashion (i.e. the practical upper-bound of our model). "avg \en " refers to the average accuracy for all datasets except English.

Results on the Google Treebank
Comparison to the Density-Driven Approach using Europarl Data Table 6 shows accuracies for the density-driven approach of Rasooli and Collins (2015), first using Europarl data 8 and second using the Bible alone (with no cross-lingual clusters or lexicalization). The Bible data is considerably smaller than Europarl (around 100 times smaller), and it can be seen that results using the Bible are several percentage points lower than the results for Europarl (75.7% UAS vs. 81.3% UAS). Integrating clusterbased and lexicalized features described in the current paper with the density-driven approach closes much of this gap in performance (80.3% UAS). Thus we have demonstrated that we can get close to the performance of the Europarl-based models using 8 Rasooli and Collins (2015) do not report results on English. We use the same setting to obtain the English results. only the Bible as a source of translation data. Using our approach on the full Europarl data gives an average UAS of 82.9%, an improvement from the 81.3% UAS of Rasooli and Collins (2015). Table 6 also shows results when we use a random subset of the Europarl data, in which the number of sentences (25,000) is chosen to give a very similar size to the Bible. It can be seen that accuracies using the Bible vs. the Europarl-Sample are very similar (80.3% vs. 80.4% UAS), suggesting that the size of the translation corpus is much more important than the genre. Table 7 compares the accuracy of our method to the following related work: 1) Ma and Xia (2014)   describe an annotation projection method based on training on partial trees with dynamic oracles; 3) Zhang and Barzilay (2015), who describe a method that learns cross-lingual embeddings and bilingual dictionaries from Europarl data, and uses these features in a discriminative parsing model; 4) Guo et al. (2016), who describe a method that learns crosslingual embeddings from Europarl data and uses a shift-reduce neural parser with these representations; 5) Ammar et al. (2016b) 9 , who use the same embeddings as Guo et al. (2016), within an LSTMbased parser; and 6) Rasooli and Collins (2015) who use the density-driven approach on the Europarl data. Our method gives significant improvements over the first three models, in spite of using the Bible translation data rather than Europarl. When using the Europarl data, our method improves the state-ofthe-art model of Rasooli and Collins (2015).

Comparison to Other Previous Work
Performance with Automatic POS Tags For completeness, Table 8 gives results for our method with automatic part-of-speech tags. The tags are obtained using the model of Collins (2002) 10 trained on the training part of the treebank dataset. Future work should study approaches that transfer POS tags in addition to dependencies. Table 9 gives results on 38 datasets (26 languages) from the newly released universal dependencies corpus (Nivre et al., 2016). Given the number of treebanks and to speed up training, we pick source lan-  Table 9: Results for the density driven method (Rasooli and  and ours using the Bible data on the universal dependencies v1.3 (Nivre et al., 2016). The table is sorted by the performance of our method. The last major columns shows the performance of the supervised parser. The abbreviations are as follows: bg (Bulgarian), cs (Czech), da (Danish), de (German), el (Greek), en (English), es (Spanish), et (Estonian), fa (Persian (Farsi)), fi (Finnish), fr (French), hi (Hindi), hr (Croatian), hu (Hungarian), id (Indonesian), it (Italian), la (Latin), nl (Dutch), no (Norwegian), pl (Polish), pt (Portuguese), ro (Romanian), ru (Russian), sl (Slovenian), sv (Swedish), and tr (Turkish). All differences in LAS and UAS in this table were found to be statistically significant according to McNemar's sign test with p < 0.001.

Results on the Universal Dependencies v1.3
guages that have at least 5 out of 6 common WALS properties with each target language. Our experiments are carried out using the Bible as our transla-286 tion data. As shown in Table 9, our method consistently outperforms the density-driven method of Rasooli and  and for many languages the accuracy of our method gets close to the accuracy of the supervised parser. In all the languages, our method is significantly better than the density-driven method using the McNemar's test with p < 0.001. Accuracy on some languages (e.g., Persian (fa) and Turkish (tr)) is low, suggesting that future work should consider more powerful techniques for these languages. There are two important facts to note. First, the number of fully projected trees in some languages is so low such that the density-driven approach cannot start with a good initialization to fill in partial dependencies. For example Turkish has only one full tree with only six words, Persian with 25 trees, and Dutch with 28 trees. Second, we observe very low accuracies in supervised parsing for some languages in which the number of training sentences is very low (for example, Latin has only 1326 projective trees in the training data).

Analysis
We conclude with some analysis of the accuracy of the method on different dependency types, across the different languages. Table 10 shows precision and recall on different dependency types in English (using the Google treebank). The improvements in accuracy when moving from the delexicalized model to the Bible or Europarl model apply quite uniformly across all dependency types, with all dependency labels showing an improvement. Table 11 shows the dependency accuracy sorted by part-of-speech tag of the modifier in the dependency. We break the results into three groups: G1 languages, where UAS is at least 80% overall; G2 languages, where UAS is between 70% and 80%; and G3 languages, where UAS is less than 70%. There are some quite significant differences in accuracy depending on the POS of the modifier word. In the G1 languages, for example, ADP, DET, ADJ, PRON and AUX all have over 85% accuracy; in contrast NOUN, VERB, PROPN, ADV all have accuracy that is less than 80%. A very similar pattern is seen for the G2 languages, with ADP, DET, ADJ, and AUX again having greater than 85% accuracy, but NOUN, VERB, PROPN and ADV having lower accuracies. These results suggest that difficulty varies quite significantly depending on the modifier POS, and different languages show the same patterns of difficulty with respect to the modifier POS. Table 12 shows accuracy sorted by the POS tag of the head word of the dependency. By far the most frequent head POS tags are NOUN, VERB, and PROPN (accounting for 85% of all dependencies). The table also shows that for all language groups G1, G2, and G3, the f1 scores for NOUN, VERB and PROPN are generally higher than the f1 scores for other head POS tags.
Finally, Table 13 shows precision and recall for different dependency labels for the G1, G2 and G3 languages. We again see quite large differences in accuracy between different dependency labels. The G1 language dependencies, with the most frequent label nmod, has an F-score of 75.2. In contrast, the second most frequent label, case, has 93.7 F-score. Other frequent labels with low accuracy in the G1 languages are advmod, conj, and cc.

Related Work
There has recently been a great deal of work on syntactic transfer. A number of methods (Zeman and Resnik, 2008;McDonald et al., 2011;Cohen et al., 2011;Naseem et al., 2012;Täckström et al., 2013;Rosa and Zabokrtsky, 2015) directly learn delexicalized models that can be trained on universal treebank data from one or more source languages, then applied to the target language. More recent work has introduced cross-lingual representationsfor example cross-lingual word-embeddings-that can be used to improve performance (Zhang and Barzilay, 2015;Guo et al., 2015;Duong et al., 2015a;Duong et al., 2015b;Guo et al., 2016;Ammar et al., 2016b). These cross-lingual representations are usually learned from parallel translation data. We show results of several methods (Zhang and Barzilay, 2015;Guo et al., 2016;Ammar et al., 2016b) in Table 7 of this paper.
The annotation projection approach, where dependencies from one language are transferred through translation alignments to another language, has been considered by several authors (Hwa et al., 2005;Ganchev et al., 2009;McDonald et al., 2011;Ma and Xia, 2014;Rasooli and Collins, 2015; Table 10: Precision, recall and f-score of different dependency relations on the English development data of the Google universal treebank. The major columns show the dependency labels ("dep."), frequency ("freq."), the baseline delexicalized model ("delex"), and our method using the Bible and Europarl ("EU") as translation data. The rows are sorted by frequency. Lacroix et al., 2016;Schlichtkrull and Søgaard, 2017).
Other recent work (Tiedemann et al., 2014;Tiedemann, 2015;Tiedemann and Agić, 2016) has considered treebank translation, where a statistical machine translation system (e.g., MOSES (Koehn et al., 2007)) is used to translate a source language treebank into the target language, complete with reordering of the input sentence. The lexicalization  approach described in this paper is a simple form of treebank translation, where we use a word-to-word translation model. In spite of its simplicity, it is an effective approach. A number of authors have considered incorporating universal syntactic properties, such as dependency order, by selectively learning syntactic attributes from similar source languages (Naseem et al., 2012;Täckström et al., 2013;Zhang and Barzilay, 2015;Ammar et al., 2016a). Selective sharing of syntactic properties is complementary to our work. We used a very limited form of selective sharing, through the WALS properties, in our baseline approach. More recently, Wang and Eisner (2016) have developed a synthetic treebank as a universal treebank to help learn parsers for new languages. Martínez Alonso et al. (2017) try a very different approach in cross-lingual transfer by using a ranking approach.
A number of authors (Täckström et al., 2012;Guo et al., 2015;Guo et al., 2016) have introduced methods that learn cross-lingual representations that are then used in syntactic transfer. Most of these approaches introduce constraints to a clustering or embedding algorithm that encourage words that are translations of each other to have similar representations. Our method of deriving a cross-lingual cor-  pus (see Figure 1) is closely related to Duong et al. (2015a); Gouws and Søgaard (2015); and Wick et al. (2015).
Our work has made use of dictionaries that are automatically extracted from bilingual corpora. An alternative approach would be to use hand-crafted translation lexicons, for example, PanLex (Baldwin et al., 2010) (e.g. see Duong et al. (2015b)), which covers 1253 language varieties, Google translate (e.g., see Ammar et al. (2016c)), or Wiktionary (e.g., see Durrett et al. (2012) for an approach that uses Wiktionary for cross-lingual transfer). These resources are potentially very rich sources of information. Future work should investigate whether they can give improvements in performance.

Conclusions
We have described a method for cross-lingual syntactic transfer that is effective in a scenario where a large amount of translation data is not available. We have introduced a simple, direct method for deriving cross-lingual clusters, and for transferring lexical information across treebanks for different languages. Experiments with this method show that the method gives improved performance over previous work that makes use of Europarl, a much larger translation corpus.   Table 13: Precision, recall and f-score for different dependency labels for three groups of languages for the universal dependencies experiments in Table 9: G1 (languages with UAS ≥ 80), G2 (languages with 70 ≤ UAS < 80), G3 (languages with UAS < 70). The rows are sorted by frequency in the G1 languages.