Many Languages, One Parser

We train one multilingual model for dependency parsing and use it to parse sentences in several languages. The parsing model uses (i) multilingual word clusters and embeddings; (ii) token-level language information; and (iii) language-specific features (fine-grained POS tags). This input representation enables the parser not only to parse effectively in multiple languages, but also to generalize across languages based on linguistic universals and typological similarities, making it more effective to learn from limited annotations. Our parser’s performance compares favorably to strong baselines in a range of data scenarios, including when the target language has a large treebank, a small treebank, or no treebank for training.


Introduction
Developing tools for processing many languages has long been an important goal in NLP (Rösner, 1988;Heid and Raab, 1989), 1 but it was only when statistical methods became standard that massively multilingual NLP became economical. The mainstream approach for multilingual NLP is to design language-specific models. For each language of interest, the resources necessary for training the model are obtained (or created), and separate parameters are fit for each language separately. This approach is simple, effective and grants the flexibility of customizing the model and features to the needs of each language, but it is suboptimal for theoretical as well as practical reasons. Theoretically, the study of linguistic typology tells us that many languages share morphological, phonological, and syntactic phenomena (Bender, 2011); therefore, the mainstream approach misses an opportunity to exploit relevant supervision from typologically related languages. Practically, it is inconvenient to deploy or distribute NLP tools that are customized for many different languages because, for each language of interest, we need to configure, train, tune, monitor, and occasionally update the model. Furthermore, codeswitching or code-mixing (mixing more than one language in the same discourse), which is pervasive in some genres, in particular social media, presents a challenge for monolingually-trained NLP models (Barman et al., 2014). 2 In parsing, the availability of homogeneous syntactic dependency annotations in many languages Nivre et al., 2015b;Agić et al., 2015;Nivre et al., 2015a) has created an opportunity to develop a parser that is capable of parsing sentences in multiple languages, addressing these theoretical and practical concerns. 3 A multilingual parser can potentially replace an array of language-specific monolingually-trained parsers (for languages with a large treebank). The same approach have been used in low-resource scenarios (with no treebank or a small treebank in the target language), where indirect supervision from auxiliary languages improves the parsing quality (Cohen et al., 2011;McDonald et al., 2011;Zhang and Barzilay, 2015;Duong et al., 2015a;Duong et al., 2015b;Guo et al., 2016), but these models may sacrifice accuracy on source languages with a large treebank. In this paper, we describe a model that works well for both low-resource and high-resource scenarios.
We propose a parsing architecture that takes as input sentences in several languages, 4 optionally predicting the part-of-speech (POS) tags and language ID if needed. The parser is trained on the union of available universal dependency annotations in different languages. Our approach integrates and critically relies on several recent developments related to dependency parsing: universal POS tagsets (Petrov et al., 2012), cross-lingual word clusters (Täckström et al., 2012), selective sharing (Naseem et al., 2012), universal dependency annotations Nivre et al., 2015b;Agić et al., 2015;Nivre et al., 2015a), advances in neural network architectures (Chen and Manning, 2014;Dyer et al., 2015), and multilingual word embeddings (Gardner et al., 2015;Guo et al., 2016;Ammar et al., 2016). We show that our parser compares favorably to strong baselines trained on the same treebanks in three data scenarios: when the target language has a large treebank (Table 3), a small treebank (Table 7), or no treebank (Table  8). Our parser will be made publicly available on github. 5

Overview
Our goal is to train a dependency parser for a set of target languages L t , given universal dependency annotations in a set of source languages L s . Ideally, we would like to have training data in all target languages (i.e., L t ⊆ L s ), but we are also interested in the case where the sets of source and target languages are disjoint (i.e., L t ∩ L s = ∅). When all languages in L t have a large treebank, the main- 4 We discuss data requirements in the next section. 5 https://github.com/clab/language-universal-parser/tree/084eed3b1510fc893c4c92474cdcea1d7c58aa7c stream approach has been to train one monolingual parser per target language and route sentences of a given language to the corresponding parser at test time. In contrast, our approach is to train one parsing model with the union of treebanks in L s , then use this single trained model to parse text in any language in L t , hence the name "many languages, one parser" (MALOPA). In a MALOPA parser, we strike a balance between: (1) enabling cross-lingual model transfer via language-invariant input representations; i.e., coarse POS tags, multilingual word embeddings and multilingual word clusters, and (2) tweaking the behavior of the parser depending on the current input language via language-specific representations; i.e., fine-grained POS tags and language embeddings.
In addition to universal dependency annotations in source languages (see Table 1), we use the following data resources for each language in L = L t ∪L s : • universal POS annotations for training a POS tagger, 6 • a bilingual dictionary with another language in L for adding cross-lingual lexical information, 7 • language typology information, 8 • language-specific POS annotations, 9 and • a monolingual corpus. 10 Novel contributions of this paper include: (1) using one parser to substitute an array of monolingually-trained parsers without sacrificing accuracy on languages with a large treebank, (2) an effective neural network architecture for using language embeddings to improve multilingual parsing, and (3) a study of how predicted language IDs affect the performance of a multilingual dependency parser.  Table 1: Number of sentences (tokens) in each treebank split in Universal Dependency Treebanks version 2.0 (UDT) and Universal Dependencies version 1.2 (UD) for the languages we experiment with. The last row gives the number of unique language-specific fine-grained POS tags used in a treebank.
While not the primary focus of this paper, we also show that a variant of our parser outperforms previous work on multi-source cross-lingual parsing in low resource scenarios, where languages in L t have a small treebank (see Table 7) or where L t ∩ L s = ∅ (see Table 8). In the small treebank setup with 3K token annotations, we show that our parser consistently outperforms a strong monolingual baseline with 5.7 absolute LAS points per language, on average.

Parsing Model
Recent advances suggest that recurrent neural networks, especially long short-term memory (LSTM) architectures, are capable of learning useful representations for modeling problems of sequential nature (Graves et al., 2013;Sutskever et al., 2014). In this section, we describe our language-universal parser, which extends the stack LSTM (S-LSTM) parser of Dyer et al. (2015).

Transition-based Parsing with S-LSTMs
This section briefly reviews Dyer et al.'s S-LSTM parser, which we modify in the following sections. The core parser can be understood as the sequential manipulation of three data structures: • a buffer (from which we read the token sequence), • a stack (which contains partially-built parse trees), and • a list of actions previously taken by the parser.
The parser uses the arc-standard transition system (Nivre, 2004). 11 At each timestep t, a transition ac- 11 In a preprocessing step, we transform non-projective trees in the training treebanks to pseudo-projective trees using the "baseline" scheme in (Nivre and Nilsson, 2005). We evaluate against the original non-projective test set. tion is applied that alters these data structures according to Table 2.
Along with the discrete transitions of the arcstandard system, the parser computes vector representations for the buffer, stack and list of actions at time step t denoted b t , s t , and a t , respectively. 12 The parser state at time t is given by: where the matrix W and the vector W bias are learned parameters. The parser state p t is then used to define a categorical distribution over possible next actions z: 13 where g z and q z are parameters associated with action z. The selected action is then used to update the buffer, stack and list of actions, and to compute b t+1 , s t+1 and a t+1 accordingly. The model is trained to maximize the loglikelihood of correct actions. At test time, the parser greedily chooses the most probable action in every time step until a complete parse tree is produced.
The following sections describe our extensions of the core parser. More details about the core parser can be found in Dyer et al. (2015).

Token Representations
The vector representations of input tokens feed into the stack-LSTM modules of the buffer and the stack. 12 A stack-LSTM module is used to compute the vector representation for each data structure, as detailed in Dyer et al. (2015). 13 The total number of actions is 1 + 2× number of unique dependency labels in the treebank used for training, but we only consider actions which meet the arc-standard preconditions in  For monolingual parsing, we represent each token by concatenating the following vectors:

Stack t Buffer t Action
• a fixed, pretrained embedding of the word type, • a learned embedding of the word type, • a learned embedding of the Brown cluster, • a learned embedding of the fine-grained POS tag, • a learned embedding of the coarse POS tag.
For multilingual parsing with MALOPA, we start with a simple delexicalized model where the token representation only consists of learned embeddings of coarse POS tags, which are shared across all languages to enable model transfer. In the following subsections, we enhance the token representation in MALOPA to include lexical embeddings, language embeddings, and fine-grained POS embeddings.

Lexical Embeddings
Previous work has shown that sacrificing lexical features amounts to a substantial decrease in the performance of a dependency parser (Cohen et al., 2011;Täckström et al., 2012;Tiedemann, 2015;. Therefore, we extend the token representation in MALOPA by concatenating learned embeddings of multilingual word clusters, and pretrained multilingual embeddings of word types.

Multilingual Brown clusters.
Before training the parser, we estimate Brown clusters of English words and project them via word alignments to words in other languages. This is similar to the 'projected clusters' method in Täckström et al. (2012). To go from Brown clusters to embeddings, we ignore the hierarchy within Brown clusters and assign a unique parameter vector to each cluster.

Multilingual word embeddings.
We also use Guo et al.'s (2016) 'robust projection' method to pretrain multilingual word embeddings. The first step in 'robust projection' is to learn embeddings for English words using the skip-gram model (Mikolov et al., 2013). Then, we compute an embedding of non-English words as the weighted average of English word embeddings, using word alignment probabilities as weights. The last step computes an embedding of non-English words which are not aligned to any English words by averaging the embeddings of all words within an edit distance of 1 in the same language. We experiment with two other methods (multiCCA and multiCluster, both proposed by Ammar et al. (2016)) for pretraining multilingual word embeddings in §4.1. MultiCCA uses a linear operator to project pretrained monolingual embeddings in each language (except English) to the vector space of pretrained English word embeddings, while multiCluster uses the same embedding for translationally-equivalent words in different languages. The results in Table 6 illustrates that the three methods perform similarly on this task.

Language Embeddings
While many languages, especially ones that belong to the same family, exhibit some similar syntactic phenomena (e.g., all languages have subjects, verbs, and objects), substantial syntactic differences abound. Some of these differences are easy to characterize (e.g., subject-verb-object vs. verb-subjectobject, prepositions vs. postpositions, adjectivenoun vs. noun-adjective), while others are subtle (e.g., number and positions of negation morphemes). It is not at all clear how to translate descriptive facts about a language's syntax into features for a parser.
Consequently, training a language-universal parser on treebanks in multiple source languages requires caution. While exposing the parser to a diverse set of syntactic patterns across many languages has the potential to improve its performance in each, dependency annotations in one language will, in some ways, contradict those in typologically different languages.
For instance, consider a context where the next word on the buffer is a noun, and the top word on the stack is an adjective, followed by a noun. Treebanks of languages where postpositive adjectives are typical (e.g., French) will often teach the parser to predict REDUCE-LEFT, while those of languages where prepositive adjectives are more typical (e.g., English) will teach the parser to predict SHIFT.
Inspired by Naseem et al. (2012), we address this problem by informing the parser about the input language it is currently parsing. Let l be the input vector representation of a particular language. We consider three definitions for l: 14 • one-hot encoding of the language ID, • one-hot encoding of individual word-order properties, 15 and • averaged one-hot encoding of WALS typological properties (including word-order properties). 16 It is worth noting that the first definition (language ID) turns out to work best in our experiments.
We use a hidden layer with tanh nonlinearity to compute the language embedding l ′ as: where L and L bias are additional model paramters. We modify the parsing architecture as follows: • include l ′ in the token representation (which feeds into the stack-LSTM modules of the buffer and the stack as described in §3.1), 14 The files which contain these definitions are available at https://github.com/clab/language-universal-parser/tree/master/typological_properties. 15 The World Atlas of Language Structures (WALS; Dryer and Haspelmath, 2013) is an online portal documenting typological properties of 2,679 languages (as of July 2015). We use the same set of WALS features used by Zhang and Barzilay (2015), namely 82A (order of subject and verb), 83A (order of object and verb), 85A (order of adposition and noun phrase), 86A (order of genitive and noun), and 87A (order of adjective and noun). 16 Some WALS features are not annotated for all languages. Therefore, we use the average value of all languages in the same genus. We rescale all values to be in the range [−1, 1].
• include l ′ in the action vector representation (which feeds into the stack-LSTM module that represents previous actions as described in §3.1), and • redefine the parser state at time t as Intuitively, the first two modifications allow the input language to influence the vector representation of the stack, the buffer and the list of actions. The third modification allows the input language to influence the parser state which in turn is used to predict the next action. In preliminary experiments, we found that adding the language embeddings at the token and action level is important. We also experimented with computing more complex functions of (s t , b t , a t , l ′ ) to define the parser state, but they did not help.

Fine-grained POS Tag Embeddings
Tiedemann (2015) shows that omitting fine-grained POS tags significantly hurts the performance of a dependency parser. However, those fine-grained POS tagsets are defined monolingually and are only available for a subset of the languages with universal dependency treebanks.
We extend the token representation to include a fine-grained POS embedding (in addition to the coarse POS embedding). We stochastically dropout the fine-grained POS embedding for each token with 50% probability (Srivastava et al., 2014) so that the parser can make use of fine-grained POS tags when available but stay reliable when the fine-grained POS tags are missing.

Predicting POS Tags
The model discussed thus far conditions on the POS tags of words in the input sentence. However, gold POS tags may not be available in real applications (e.g., parsing the web). Here, we describe two modifications to 1) model both POS tagging and dependency parsing, and 2) increase the robustness of the parser to incorrect POS predictions.
We use slightly different token representations for tagging and parsing in the same model. For tagging, we construct the token representation by concatenating the embeddings of the word type (pretrained), the Brown cluster and the input language. This token representation feeds into the bidirectional LSTM, followed by a softmax layer (at each position) which defines a categorical distribution over possible POS tags. For parsing, we construct the token representation by further concatenating the embeddings of predicted POS tags. This token representation feeds into the stack-LSTM modules of the buffer and stack components of the transition-based parser. This multi-task learning setup enables us to predict both POS tags and dependency trees in the same model. We note that pretrained word embeddings, cluster embeddings and language embeddings are shared for tagging and parsing.
Block dropout. We use an independently developed variant of word dropout (Iyyer et al., 2015), which we call block dropout. The token representation used for parsing includes the embedding of predicted POS tags, which may be incorrect. We introduce another modification which makes the parser more robust to incorrect POS tag predictions, by stochastically zeroing out the entire embedding of the POS tag. While training the parser, we replace the POS embedding vector e with another vector (of the same dimensionality) stochastically computed as: e ′ = (1 − b)/µ × e, where b ∈ {0, 1} is a Bernoulli-distributed random variable with parameter µ which is initialized to 1.0 (i.e., always dropout, setting b = 1, e ′ = 0), and is dynamically updated to match the error rate of the POS tagger on the development set. At test time, we never dropout the predicted POS embedding, i.e., e ′ = e. Intuitively, this method extends the dropout method (Srivastava et al., 2014) to address structured noise in the input layer.

Experiments
In this section, we evaluate our MALOPA parser in three data scenarios: when the target language has a large treebank (Table 3), a small treebank (Table 7) or no treebank (Table 8).
Data. For experiments where the target language has a large treebank, we use the standard data splits for German (de), English (en), Spanish (es), French (fr), Italian (it), Portuguese (pt) and Swedish (sv) in the latest release (version 1.2) of Universal Dependencies (Nivre et al., 2015a), and experiment with both gold and predicted POS tags. For experiments where the target language has no treebank, we use the standard splits for these languages in the older universal dependency treebanks v2.0 (McDonald et al., 2013) and use gold POS tags, following the baselines (Zhang and Barzilay, 2015; Guo et al., 2016). Table 1 gives the number of sentences and words annotated for each language in both versions. In a preprocessing step, we lowercase all tokens and remove multi-word annotations and language-specific dependency relations. We use the same multilingual Brown clusters and multilingual embeddings of Guo et al. (2016), kindly provided by the authors.
Optimization. We follow Dyer et al. (2015) in parameter initialization and optimization. 17 However, 17 We use stochastic gradient updates with an initial learning rate of η0 = 0.1 in epoch #0, update the learning rate in following epochs as ηt = η0/(1 + 0.1t). We clip the l2 norm of the gradient to avoid "exploding" gradients. Unlabeled attachment score (UAS) on the development set determines early stopping. Parameters are initialized with uniform samples in ± 6/(r + c) where r and c are the sizes of the previous and following layer in the nueral network (Glorot and Bengio, 2010). The standard deviations of the labeled attachment score (LAS) due to random initialization in indiviual target languages are 0.36 (de), 0.40 (en), 0.37 (es), 0.46 (fr), 0.47 (it), 0.41 (pt) and 0.24 (sv). The standard deviation of the average LAS scores across languages is 0.17.  Table 3: Dependency parsing: labeled attachment scores (LAS) for monolingually-trained parsers and MALOPA in the fully supervised scenario where L t = L s . Note that we use the universal dependencies verson 1.2 which only includes annotations for ∼ 13K English sentences, which explains the relatively low scores in English. When we instead use the universal dependency treebanks version 2.0 which includes annotations for ∼ 40K English sentences (originally from the English Penn Treebank), we achieve UAS score 93.0 and LAS score 91.5.

LAS
when training the parser on multiple languages in MALOPA, instead of updating the parameters with the gradient of individual sentences, we use minibatch updates which include one sentence sampled uniformly (without replacement) from each language's treebank, until all sentences in the smallest treebank are used (which concludes an epoch). We repeat the same process in following epochs. We found this to help prevent one source language with a larger treebank (e.g., German) from dominating parameter updates, at the expense of other source languages with a smaller treebank (e.g., Swedish).

Target Languages with a Treebank (L t = L s )
Here, we evaluate our MALOPA parser when the target language has a treebank.
Baseline. For each target language, the strong baseline we use is a monolingually-trained S-LSTM parser with a token representation which concatenates: pretrained word embeddings (50 dimensions), 18 learned word embeddings (50 dimensions), coarse (universal) POS tag embeddings (12 dimensions), fine-grained (language-specific, when available) POS tag embeddings (12 dimensions), and embeddings of Brown clusters (12 dimensions), and uses a two-layer S-LSTM for each of the stack, the buffer and the list of actions. We independently train one baseline parser for each target language, 18 These embeddings are treated as fixed inputs to the parser, and are not optimized towards the parsing objective. We use the same embeddings used in Guo et al. (2016). and share no model parameters. This baseline, denoted 'monolingual' in Tables 3 and 7, achieves UAS score 93.0 and LAS score 91.5 when trained on the English Penn Treebank, which is comparable to Dyer et al. (2015).

MALOPA.
We train MALOPA on the concantenation of training sections of all seven languages. To balance the development set, we only concatenate the first 300 sentences of each language's development section.

Token
representations. The first MAL-OPA parser we evaluate uses only coarse POS embeddings to construct the token representation. 19 As shown in Table 3, this parser consistently underperforms the monolingual baselines, with a gap of 12.5 LAS points on average.
Augmenting the token representation with lexical embeddings to the token representation (both multilingual word clusters and pretrained multilingual word embeddings, as described in §3.3) substantially improves the performance of MALOPA, recovering 83% of the gap in average performance.
We experimented with three ways to include language information in the token representation, namely: 'language ID', 'word order' and 'full typology' (see §3.4 for details), and found all three to improve the performance of MALOPA giving LAS scores 83.5, 83.2 and 82.5, respectively. It is  noteworthy that the model benefits more from language ID than from typological properties. Using 'language ID', we recover another 12% of the original gap. Finally, the best configuration of MALOPA adds fine-grained POS embeddings to the token representation. 20 Surprisingly, adding fine-grained POS embeddings improves the performance even for some languages where fine-grained POS tags are not available (e.g., Spanish). This parser outperforms the monolingual baseline in five out of seven target languages, and wins on average by 0.3 LAS points. We emphasize that this model is only trained once on all languages, and the same model is used to parse the test set of each language, which simplifies the distribution or deployment of multilingual parsing software.

Qualitative analysis.
To gain a better understanding of the model behavior, we analyze certain classes of dependency attachments/relations in German, which has notably flexible word order, in Table 4. We consider the recall of left attachments (where the head word precedes the dependent word in the sentence), right attachments, root attachments, short-attachments (with distance = 1), longattachments (with distance > 6), as well as the fol-20 Fine-grained POS tags were only available for English, Italian, Portuguese and Swedish. Other languages reuse the coarse POS tags as fine-grained tags instead of padding the extra dimensions in the token representation with zeros. lowing relation groups: nsubj* (nominal subjects: nsubj, nsubjpass), dobj (direct object: dobj), conj (conjunct: conj), *comp (clausal complements: ccomp, xcomp), case (clitics and adpositions: case), *mod (modifiers of a noun: nmod, nummod, amod, appos), neg (negation modifier: neg). 21 Findings. We found that each of the three improvements (lexical embeddings, language embeddings and fine-grained POS embeddings) tends to improve recall for most classes. Unfortunately, MALOPA underperforms (compared to the monolingual baseline) in some classes: nominal subjects, direct objects and modifiers of a noun. Nevertheless, MALOPA outperforms the baseline in some important classes such as: root, long attachments and conjunctions.
Predicting language IDs and POS tags. In Table 3, we assume that both gold language ID of the input language and the gold POS tags are given at test time. However, this assumption is not realistic in practical applications. Here, we quantify the degradation in parsing accuracy when language ID and POS tags are only given at training time, but must be predicted at test time. We do not use fine-grained POS tags in these experiments because some languages use a very large fine-grained POS tag set (e.g., 866 unique tags in Portuguese).
In order to predict language ID, we use the langid.py library (Lui and Baldwin, 2012) 22 and classify individual sentences in the test sets to one of the seven languages of interest, using the default models included in the library. The macro average language ID prediction accuracy on the test set across sentences is 94.7%. In order to predict POS tags, we use the model described in §3.6 with both input and hidden LSTM dimensions of 60, and with block dropout. The macro average accuracy of the POS tagger is 93.3%. Table 5 summarizes the four configurations: {gold language ID, predicted language ID} × {gold POS tags, predicted POS tags}. The performance of the parser suffers mildly (-0.8 LAS points) when using predicted language IDs, but suffers significantly (-5.1 LAS points) when using predicted POS tags. As an alternative approach to predicting POS tags, we trained the Stanford POS tagger, for each target language, on the coarse POS tag annotations in the training section of the universal dependency treebanks, 23 then replaced the gold POS tags in the test set of each language with predictions of the monolingual tagger. The resulting degradation in parsing performance between gold vs. predicted POS tags is -6.0 LAS points (on average, compared to a degradation of -5.1 LAS points in Table 5). The disparity in parsing results with gold vs. predicted POS tags is an important open problem, and has been previously discussed in Tiedemann (2015).
The predicted POS results in Table 5 use block dropout. Without using block dropout, we lose an extra 0.2 LAS points in both configurations using predicted POS tags. Different multilingual embeddings. Several methods have been proposed for pretraining multilingual word embeddings. We compare three of them: 22 https://github.com/saffsd/langid.py 23 We used version 3.6.0 of the Stanford POS tagger, with the following pre-packaged configuration files: german-fast-caseless.tagger.props (de), english-caseless-left3words-distsim.tagger.props (en), spanish.tagger.props (es), french.tagger.props (fr). We reused french.tagger.props for (it, pt, sv Table 6: Effect of multilingual embedding estimation method on the multilingual parsing with MAL-OPA. UAS and LAS scores are macro-averaged across seven target languages. • multiCCA (Ammar et al., 2016) uses a linear operator to project pretrained monolingual embeddings in each language (except English) to the vector space of pretrained English word embeddings.
• multiCluster (Ammar et al., 2016) uses the same embedding for translationally-equivalent words in different languages.
• robust projection  first pretrains monolingual English word embeddings, then defines the embedding of a non-English word as the weighted average embedding of English words aligned to the non-English words (in a parallel corpus). The embedding of a non-English word which is not aligned to any English words is defined as the average embedding of words with a unit edit distance in the same language (e.g., 'playz' is the average of 'plays' and 'play'). 24 All embeddings are trained on the same data and use the same number of dimensions (100). 25 Table 6 illustrates that the three methods perform similarly on this task. Aside from Table 6, in this paper, we exclusively use the robust projection multilingual embeddings trained in Guo et al. (2016). 26 The "robust projection" result in Table 6 (which uses 100 dimensions) is comparable to the last row in Table 3 (which uses 50 dimensions). 24 Our implementation of this method can be found at https://github.com/gmulcaire/average-embeddings. 25    Small target treebank. Duong et al. (2015b) considered a setup where the target language has a small treebank of ∼ 3K tokens, and the source language (English) has a large treebank of ∼ 205K tokens. The parser proposed in Duong et al. (2015b) is a neural network parser based on Chen and Manning (2014), which shares most of the parameters between English and the target language, and uses an L 2 regularizer to tie the lexical embeddings of translationally-equivalent words. While not the primary focus of this paper, 27 we compare our proposed method to that of Duong et al. (2015b) on five target languages for which multilingual Brown clusters are available from Guo et al. (2016). For each target language, we train the parser on the English training data in the UD version 1.0 corpus (Nivre et al., 2015b) and a small treebank in the target language. 28 Following Duong et al. (2015b), in this setup, we only use gold coarse POS tags, we do not use any development data in the target languages (we use the English 27 The setup cost involved in recruiting linguists, developing and revising annotation guidelines to annotate a new language ought to be higher than the cost of annotating 3K tokens. After investing much resources in a language, it is unrealistic to stop the annotation effort after 3K tokens only. 28 We thank Long Duong for sharing the processed, subsampled training corpora in each target language at https://github.com/longdt219/universal_dependency_parser/tree/master/data/universal-dep/universal-dep development set instead), and we subsample the English training data in each epoch to the same number of sentences in the target language. We use the same hyperparameters specified before for the single MALOPA parser and each of the monolingual baselines. Table 7 shows that our method outperforms Duong et al. (2015b) by 1.4 LAS points on average. Our method consistently outperforms the monolingual baselines in this setup, with an average improvement of 5.7 absolute LAS points.

Target Languages without a Treebank
McDonald et al. (2011) established that, when no treebank annotations are available in the target language, training on multiple source languages outperforms training on one (i.e., multi-source model transfer outperforms single-source model transfer).
In this section, we evaluate the performance of our parser in this setup. We use two strong baseline multi-source model transfer parsers with no supervision in the target language: • Zhang and Barzilay (2015) is a graph-based arcfactored parsing model with a tensor-based scoring function. It takes typological properties of a language as input. We compare to the best reported configuration (i.e., the column titled "OURS" in Table 5 of Zhang and Barzilay, 2015).
• Guo et al. (2016) is a transition-based neural-network parsing model based on Chen and Manning (2014).
It uses a multilingual embeddings and Brown clusters as lexical features. We compare to the best reported configuration (i.e., the column titled "MULTI-PROJ" in Table 1 of Guo et al., 2016).
Following Guo et al. (2016), for each target language, we train the parser on six other languages in the Google universal dependency treebanks version 2.0 29 (de, en, es, fr, it, pt, sv, excluding whichever is the target language), and we use gold coarse POS tags. Our parser uses the same word embeddings and word clusters used in Guo et al. (2016), and does not use any typology information. 30 The results in Table 8 show that, on average, our parser outperforms both baselines by more than 1 point in LAS, and gives the best LAS results in four (out of six) languages.

Related Work
Our work builds on the model transfer approach, which was pioneered by Zeman and Resnik (2008) who trained a parser on a source language treebank then applied it to parse sentences in a target language.
Cohen et al. (2011) and McDonald et al. (2011) trained unlexicalized parsers on treebanks of multiple source languages and applied the parser to different languages. Naseem et al. (2012), Täckström et al. (2013), and Zhang and Barzilay (2015) used language typology to improve model transfer. To add lexical information, Täckström et al. (2012) used multilingual word clusters, while Xiao and Guo (2014), , Søgaard et al. (2015) and Guo et al. (2016) used multilingual word embeddings. Duong et al. (2015b) used a neural network based model, sharing most of the parameters between two languages, and used an L 2 regularizer to tie the lexical embeddings of translationally-equivalent words. We incorporate these ideas in our framework, while proposing a novel neural architecture for embedding language typology (see §3.4), and use a variant of the word dropout (Iyyer et al., 2015) for consuming noisy structured inputs. We also show how to replace an array of monolingually trained parsers with one multilingually-trained parser without sacrificing accuracy, which is related to Vilares et al. (2015).
Neural network parsing models which preceded Dyer et al. (2015) include Henderson (2003), Titov and Henderson (2007), Henderson and Titov (2010) and 29 https://github.com/ryanmcd/uni-dep-tb/ 30 In preliminary experiments, we found language embeddings to hurt the performance of the parser for target languages without a treebank. Chen and Manning (2014). Related to lexical features in cross-lingual parsing is Durrett et al. (2012) who defined lexico-syntactic features based on bilingual lexicons. Other related work include Östling (2015), which may be used to induce more useful typological to inform multilingual parsing.
Another popular approach for cross-lingual supervision is to project annotations from the source language to the target language via a parallel corpus (Yarowsky et al., 2001;Hwa et al., 2005) or via automatically-translated sentences (Tiedemann et al., 2014). Ma and Xia (2014) used entropy regularization to learn from both parallel data (with projected annotations) and unlabeled data in the target language. Rasooli and Collins (2015) trained an array of target-language parsers on fully annotated trees, by iteratively decoding sentences in the target language with incomplete annotations. One research direction worth pursuing is to find synergies between the model transfer approach and annotation projection approach.

Conclusion
We presented MALOPA, a single parser trained on a multilingual set of treebanks. We showed that this parser, equipped with language embeddings and fine-grained POS embeddings, on average outperforms monolingually-trained parsers for target languages with a treebank. This pattern of results is quite encouraging. Although languages underlying share syntactic properties, the individual parsing models must behave, on the surface, quite differently, and our model has the ability to do this while sharing parameters across languages. The value of this sharing is more pronounced in scenarios where the target language's training treebank is small or non-existing, where our parser outperforms previous cross-lingual multi-source model transfer methods.