Multi-lingual Dependency Parsing Evaluation: a Large-scale Analysis of Word Order Properties using Artificial Data

The growing work in multi-lingual parsing faces the challenge of fair comparative evaluation and performance analysis across languages and their treebanks. The difficulty lies in teasing apart the properties of treebanks, such as their size or average sentence length, from those of the annotation scheme, and from the linguistic properties of languages. We propose a method to evaluate the effects of word order of a language on dependency parsing performance, while controlling for confounding treebank properties. The method uses artificially-generated treebanks that are minimal permutations of actual treebanks with respect to two word order properties: word order variation and dependency lengths. Based on these artificial data on twelve languages, we show that longer dependencies and higher word order variability degrade parsing performance. Our method also extends to minimal pairs of individual sentences, leading to a finer-grained understanding of parsing errors.


Introduction
Fair comparative performance evaluation across languages and their treebanks is one of the difficulties for work on multi-lingual parsing (Buchholz and Marsi, 2006;Nivre et al., 2007;Seddah et al., 2011). The differences in parsing performance can be the result of disparate properties of treebanks (such as their size or average sentence length), choices in annotation schemes, and the linguistic properties of languages. Despite recent attempts to create and apply cross-linguistic and cross-framework evaluation procedures Seddah et al., 2013), there is no commonly used method of analysis of parsing performance which accounts for different linguistic and extra-linguistic factors of treebanks and teases them apart.
When investigating possible causal factors for observed phenomena, one powerful method, if available, consists in intervening on the postulated causes to observe possible changes in the observed effects. In other words, if A causes B, then changing A or properties of A should result in an observable change in B. This interventionist approach to the study of causality creates counterfactual data and a type of controlled modification that is wide-spread in experimental methodology, but that is not widely used in fields that rely on observational data, such as corpus-driven natural language processing.
In analyses of parsing performance, it is customary to manipulate and control word-level features, such as part-of-speech tags or morphological features. These types of features can be easily omitted or modified to assess their contribution to parsing performance. However, higher-order features, such as linear word order precedence properties, are much harder to define and to manipulate. A parsing performance analysis based on controlled modification of word order, in fact, has not been reported previously. We propose such a method based on word order permutations which allows us to manipulate word order properties analogously to familiar wordlevel properties and study their effect on parsing performance.
Specifically, given a dependency treebank, we obtain new synthetic data by permuting the original order of words in the sentences, keeping the unordered dependency tree constant. These permuted sentences are not necessarily grammatical in the original language. They constitute an alternative "language" which forms a minimal pair with the original one, where the only changed property is the order of words, but all the other properties of the unordered tree and the confounding variables between the two datasets are kept constant, such as the size of the training data, the average sentence length, the number of PoS tags and the dependency labels.
We perform two types of word order permutations to the treebanks in our sample: a permutation which minimises the lengths of the dependencies in a dependency tree and a permutation which minimises the variability of word order. We then compare how the parsing performances on the original and the permuted trees vary in relation to the quantified measures of the dependency length and word order variation properties of the treebanks. To quantify dependency length, we use the ratio of minimisation of the length of dependencies between words in the tree (dependency length minimisation, DLM (Gildea and Temperley, 2010)). To quantify the property intuitively referred to as variability of word order, we use the entropy of the linear precedence ordering between a head and a child in dependency arcs (Liu, 2010).
The reason to concentrate on these two word order properties comes from previous parsing results. Morphologically-rich languages are known to be hard for parsing, as rich morphology increases the percentage of new words in the test set (Nivre et al., 2007;Tsarfaty et al., 2010). These languages however also often exhibit very flexible word order. It has not so far been investigated how much rich morphology contributes to parsing difficulty compared to the difficulty introduced by word order variation in such languages. The length of the dependencies in the tree has also been shown to affect performance: almost all types of dependency parsers, in different measure, show degraded performance for longer sentences and longer dependencies (McDonald and Nivre, 2011). 1 We use arc direction entropy and DLM ratio, respectively, as the measures of these two word order properties because they are formally defined in the previous literature and can be quantified on a dependency treebank in any language.
To preview our results, in a set of pairwise comparisons between original and permuted treebanks, we confirm the influence of word order variability and dependency length on parsing performance, at the large scale provided by fourteen different treebanks across twelve different languages. 2 Our results suggest, in addition, that word order entropy applies a stronger negative pressure on parsing performance than longer dependencies. Finally, on an example of one treebank, we show how our method can be extended to provide finer-grained analyses at the sentence level and relate the parsing errors to properties of the parsing architecture.

Parsing analysis using synthetic data
In this section, we introduce our new approach to using synthetic data for cross-linguistic analysis of parsing performance.

Methodology
Our experiments with artificial data consist in modifying a treebank T to create its minimal pair T and evaluating parsing performance by comparing these pairs of treebanks. We create several kinds of artificial treebanks in the same manner: each sentence in T is a permutation of the words of the original sentence in T . We permute words in various ways according to the word order property whose effect on parsing we want to analyse. Importantly, we only change the order of words in a sentence. In contrast, the dependency tree structure of a permuted sentence in T is the same as in the original sentence in T .
For each treebank in our sample of languages and a type of permutation, we conduct two parsing evaluations: T T rain → T T est and T T rain → T T est . The training-test data split for T and T is always the same, that is T T rain = P ermuted(T T rain ) and T T est = P ermuted(T T est ). The parsing performance is measured as Unlabeled and Labeled Attachment Scores (UAS and LAS), the proportion of correctly attached arcs in the unlabelled or labelled tree, respectively.
Given the training-testing setup, the differences in unlabelled attachment scores UAS(T T est ) − UAS(T T est ) can be directly attributed to the differences in word order properties o between T and T , abstracting away from other treebank properties h. More formally, we assume that UAS(T ) = f (o T , h T ) and UAS(T ) = f (o T , h T ). Except for word order properties o T and o T , the two equations share all other treebank properties h T -such as size, average dependency length, size of PoS tagset -and f is a function that applies to all languages, here embodied by a given parser.
Our method can be further extended to analyse parsing performance at the sentence level. Consider the pair consisting of a sentence in an original treebank and its correspondence in a permuted treebank. The two sentences share all lexical items and underlying dependencies between them: the explanation for different parsing accuracies must be sought therefore in their different word orders. In standard treebank evaluation settings, instead, exact sentence-level comparisons are not possible, as two sentences very rarely constitute a truly minimal pair with respect to any specific syntactic property. Our approach opens up the possibility of deeper understanding of parsing behaviour at the sentence-level and even of individual dependencies based on large sets of minimal pairs.

Word order properties
To be able to compare parsing performance across the actual and the synthetic data, we must manipulate the causal property we want to study. In this work, we concentrate on variability of word order and length of dependencies. We define and discuss these two properties and their measures below.
Arc direction entropy One dimension that can greatly affect parsing performance across languages is word order freedom, the ability languages have to express the same or similar meaning in the same context with a free choice of different word orders. The extent of word order freedom in a sentence is reflected in the entropy of word order, given the words and the syntactic structure of the sentence, H(order|words, tree).
One approximation of word order entropy is the entropy of the direction of dependencies in a tree-bank. This measure has been proposed in several recent works to quantitatively describe the typology of word order freedom in many languages (Liu, 2010;Futrell et al., 2015b).
Arc direction entropy can be used, for instance, to capture the difference between adjective-noun word order properties in Germanic and Romance languages. In English, this word order is fixed, as adjectives appear almost exclusively prenominally; the adjective-noun arc direction entropy will therefore be close to 0. In Italian, by contrast, the same adjective can both precede and follow nouns; the adjective-noun arc direction entropy will be greater than 0.
We calculate the overall entropy of arc directions in a treebank conditioned on the relation type defined by the dependency label Rel and the PoS tags of the head H and the child C: Dir in (1) is the order between the child and the head in the dependency arc (Left or Right). In other words, we compute the entropy of arc direction H(Dir) = −p(L) · log p(L) − p(R) · log p(R) for each observed tuple (rel, h, c) independently and weigh them according to the tuple frequency in the corpora.
DLM ratio Another property that has been shown to affect parsing performance across languages and across parsers is the length of the dependencies in the tree. 3 A global measure of average dependency length of a whole treebank has been proposed in the literature on dependency length minimisation (DLM). This measure allows comparisons across treebanks with sentences of different size and across dependency trees of different topology.
Experimental and theoretical language research has yielded a large and diverse body of evidence showing that languages, synchronically and diachronically, tend to minimise the length of their dependencies (Hawkins, 1994;Gibson, 1998;Demberg and Keller, 2008;Tily, 2010;. Languages differ, however, in the degree to which they minimise dependencies. A low degree of DLM is associated with flexibility of word order and in particular with high non-projectivity, i.e., the presence of crossing arcs in a tree, a feature that has been treated in dependency parsing using local word order permutations (Hajičová et al., 2004;Nivre, 2009;Titov et al., 2009;Henderson et al., 2013). To estimate the degree of DLM in a language, we follow previous work which analysed the dependency lengths in a treebank with respect to their random and minimal potential alternatives (Temperley, 2007;Gildea and Temperley, 2010;Futrell et al., 2015a;.
We calculate the overall ratio of DLM in a treebank as shown in equation 2.
(2) DLM Ratio = Σ s DL s |s| 2 /Σ s OptDL s |s| 2 For each sentence s and its dependency tree t, we compute the overall dependency length of the original sentence DL(s) = arc∈t DL(arc) and its minimal projective dependency length OptDL(s) = DL(s ), where s is obtained by reordering the words in the sentence s using the algorithm described in the next section (following Gildea and Temperley (2010)). To average these values across all sentences, we normalise them by |s| 2 , since it has been observed empirically that the relation between the dependency lengths DL and OptDL and the length |s| of a sentence is not linear, but rather quadratic (Ferrer-i-Cancho and Liu, 2014;Futrell et al., 2015a). 4 In the next section, we illustrate how we create two pairs of (T, T ) treebanks, manipulating the two word order properties just discussed.

Word order permutations
We create two types of permuted treebanks to optimise for the two word order parameters considered in the previous section. 4 We follow previous work in using DL(s) as the measure for DLM ratio calculation. Equivalently, we could use the average length of a single dependency DL(arc) . Given that DL(s) = |s| · DL(arc) , the fact that DL(s) ∼ |s| 2 can be more naturally stated as DL(arc) ∼ |s|: the average length of a single dependency is linear with respect to the sentence length.

Creating trees with optimal DL
Given a sentence s and its dependency tree t in a natural language, we employ the algorithm proposed by Gildea and Temperley (2010) to create a new artificial sentence s with a permuted order of words. The algorithm reorders the words in a sentence s to yield the projective dependency tree with the minimal overall dependency length DL(s ). 5 To do so, it recursively places the children on the left and on the right of the head in alternation, so that the children on the same side of the head are ordered based on their sizes -shortest phrases closer to the head. Children of the same size are ordered between each other as found in the original sentence.
This algorithm is deterministic and the dependency length of each sentence is optimised independently. We exclude from our analysis sentences with any non-final punctuation tokens and sentences with multiple roots. By definition, the DLM ratio for sentences permuted in such a way is equal to 1.

Creating trees with optimal Entropy
To obtain treebanks with a minimal arc direction entropy equal to zero, we can fix the order of each type of dependency, defined by a tuple (rel, h, c). There exist therefore many possible permutations resulting in zero arc direction entropy. We choose to assign the same direction (either Left or Right) to all the dependencies. This results in two permutations yielding fully right-branching (RB) and fully left-branching (LB) treebanks. We order the children on the same side of a head in the same way as in the OptDL permutation: the shortest children are closest to the head. For RB permutation, children of the same size are kept in the order of the original sentence; for LB permutation, this order is reversed, so that the RB and LB orders are symmetrical. These two permutations are particularly interesting, as they give us the two extremes in the space of possible tree-branching structures. Moreover, since the LB/RB word orders for each sentence are completely symmetrical, the two treebanks constitute a minimal pair with respect to the treebranching parameter.
Importantly, there exist both predominantly rightbranching (e.g. English) and left-branching natural languages (Japanese, Persian) and the comparison between LB/RB-permuted treebanks will show how much of the difference in parsing of typologically different natural languages can be attributed to their different branching directions. Of course, the parsing sensitivity to the parameter depends on the parsing architecture. As discussed in detail below, we investigate both graph-based and transition-based architectures. For a graph-based parser, we do not expect to observe much difference in parsing performance due to directionality, given its global optimisation strategy. On the other hand, a transition-based parser relies on left-to-right processing of words and the fully right-branching or fully left-branching orders can yield different results.

Dependency Treebanks
We use a sample of fourteen dependency treebanks for twelve languages. The treebanks for Bulgarian, English, Finnish, French, German, Italian and Spanish come from the Universal Dependency Project and are annotated with the same annotation scheme (Agić et al., 2015). We use the treebank for Dutch from the CONLL 2006 shared task (Buchholz and Marsi, 2006). The Polish treebank is described in Woliński et al. (2011) and the Persian treebank in Rasooli et al. (2013). In addition, we use two Latin and two Ancient Greek dependency annotated texts (Haug and Jøhndal, 2008) because these languages are well-known for having very free word order. 6 The quantitative properties of these treebanks are presented in Table 1 (second and third column). This set of treebanks includes those treebanks which had at least 3,000 sentences in their training set after eliminating sentences not fit for permutation (with punctuation tokens or multiple roots). This excluded from our analysis some otherwise typologi- 6 The Latin corpora comprise works of Cicero (circa 40 BC) and Vulgate (Bible translation, 4th century AD). The Ancient Greek corpora are works of Herodotus (4th century BC) and New Testament (4th century AD). Despite the fact that they belong to the same language, these pairs of texts of different time periods show quite different word order properties .
cally interesting languages such as Basque and Arabic. Where available, we used the training-test split of a treebank provided by its distributors; in other cases we split the treebank randomly with a 9-to-1 training-test set proportion.
3.4 Word order properties of original and permuted treebanks Table 1 presents the treebanks in our sample and the values of DLM ratio and Entropy measures calculated on the training set of the original non-permuted treebanks. From these data, we confirm that the DLM ratio and Entropy measures capture different word order properties as they are not correlated (Spearman correlation r = 0.32, p > 0.1). For example, we can find languages with both low DLM ratio and high Entropy (Finnish) and high DLM ratio and low Entropy (Persian). Furthermore, these two measures are not a simple reflex of genetic similarity between languages of the same family: for example, Polish (Indo-European family) and Finnish (Finno-Ugric family) are clustered together according to their word order properties. Table 1 also shows how the DLM ratio and Entropy values change, when we apply the two permutations to the treebanks. For the treebanks permuted to obtain minimal dependency length (DLM ratio = 1), we present Entropy values in the column 'OptDL Entropy'. For the treebanks permuted to obtain minimal entropy (Entropy = 0), we present DLM ratio values in the column 'LB/RB DLM ratio'. With respect to the values of the original treebanks, the DLM ratio and Entropy values of the artificial treebanks are much more narrowly distributed: 1.17±0.02 (mean ± SD) compared to 1.19±0.07 for DLM ratio and 0.59 ± 0.03 compared to 0.27 ± 0.17 for Entropy.
Notice also that, on average, the treebanks in the LB/RB permuted set have both lower entropy and lower DLM ratio than the original treebanks. The treebanks in the OptDL set have lower DLM ratio, but also higher entropy than the original treebanks.

Parsing setup
To evaluate the impact of word order properties on parsing performance, we use MSTParser (McDonald et al., 2006) and MaltParser (Nivre et al., 2006) -two widely used representatives of two main de-  pendency parsing architectures: a graph-based parsing architecture and a transition-based architecture. The graph-based architecture is known to be less dependent on word order and dependency length than transition-based dependency parsers, as it searches the whole space of possible parse trees and solves a global optimisation problem (McDonald and Nivre, 2011). To achieve competitive performance, the transition-based MaltParser must be provided with a list of features tailored for each treebank and each language. We used the MaltOptimizer package (Ballesteros and Nivre, 2012), to find the best features based on the training set. By contrast, MSTParser is trained on all the treebanks in our sample using the default configuration (first-order projective).

Experiments and results
In this section, we illustrate the power of the technique and the fine-grained analyses supported by it with a range of planned, pairwise quantitative and qualitative analyses of the parsing results. Table 2 presents the parsing results for MaltParser for the original treebanks and the three sets of permuted treebanks (OptDL, LB, RB). Table 3 presents the results on the same data for MSTParser. For MST, the parsing performances on the fully leftbranching and right-branching treebanks are identical, as expected, when percentages are rounded at the two-digit level, which is what we report here. As discussed in the introduction, a comparison between parsers in a multilingual setting is not straightforward. Instead, we attempt to understand their common behaviour with respect to the word order properties of languages. The first observation is that, overall, all three sets of permuted data are easier to parse than the original data, for both parsers. We observe an increase of +1% and +6% UAS for OptDL and LB/RB data, respectively, for Malt, and an increase of +4% and +8% UAS for OptDL and LB/RB data, respectively, for MST. The better results on the LB/RB permuted data must be due to the observation above: the LB/RB data have both Lang.
Overall, the performance of the parsers on our artificial treebanks confirms that the lengths of the dependencies and the word order variability are two factors that negatively affect parsing accuracy. Two illustrative examples are Latin, a language wellknown for its variable word order (as confirmed by the high entropy values of 0.42 and 0.43 for our two treebanks), and German, a language known for its long dependencies (as confirmed by its high DML ratio of 1.24). For the Cicero text, for example, we can conclude that indeed its variable word order is the primary reason for the very low parsing performances (67%-68% UAS). These numbers improve significantly when the treebanks are rearranged in a fixed word order (87%-89% UAS). This permutation reduces DLM by 0.11 and reduces entropy by 0.42, yielding the very considerable increase in UAS of 21%. The other permutation, which optimises DL, reduces DLM by 0.26, but increases entropy by 0.19. This increase in entropy dampens the beneficial effect of DL reduction and performance increases 12%, less than in the fixed-order permutation. For German, our analysis gives the same overall results. The DLM ratio in the RB/LB sce-nario decreases slightly (from 1.24 to 1.21) and its entropy also decreases (-0.21). The performance of the parsers on RB/LB-permuted data is better than on the original data (89%-91% against 86%-87% UAS). Moreover, when DLM is reduced (-0.24, in the OptDL permutation), but entropy is increased (from 0.21 to 0.62), we find a reduction in performance for Malt (from 86% to 84% for UAS). These data weakly suggest that the word order variability of German, minimised in the RB/LB case, has higher impact on parsing difficulty than its wellknown long dependencies.
A more detailed picture emerges when we compare pairwise the original treebanks to the permuted treebanks for each of the languages. For this analysis, we use the measure of unlabelled accuracy, since attachment decisions are more directly dependent on word order than labelling decisions, which are mediated by correct attachments. Hence, we limit our analysis to the space of three parameters: DLM ratio, Entropy and UAS.
Figures 1 (OptDL) and 2 (RB) plot the differences in UAS of MaltParser between pairs of the permuted and the original treebanks for each language to the differences in DLM ratio and Entropy between these treebanks. Our dependent variable is ∆UAS = UAS(T ) − UAS(T ) computed from Table 2. The x-axis and the y-axis values ∆DLM = DLM Ratio(T )−DLM Ratio(T ) and ∆Entropy = Entropy(T ) − Entropy(T ) compute the differences of the measures between the original treebank and the permuted treebank based on the numbers in Table 1. We have chosen to calculate these differences reversing the two factors, compared to the ∆UAS value, for better readability of the figures: an increase in the axes values (entropy or dependency lengths) should correspond to the decrease in difficulty of parsing and therefore to the increase of the dependent variable ∆UAS. The same relative values of the measures and the parsing accuracy for MSTParser result in very similar plots, which we do not include here for reasons of space.
For the OptDL data (Figure 1), the overall picture is very coherent: the more DLs are minimised and the less entropy is added to the artificial treebank, the larger the gain in parsing performance (blue circles in the lower left corner and red circles in the upper right corner). Again, we observe an interaction between DLM ratio and Entropy parameters: for the languages with originally relatively low DLM ratio and low Entropy, such as English or Spanish, the performance on the permuted data decreases. This is because while DLM decreases, Entropy increases and, for this group of languages, the particular tradeoff between these two properties leads to lower parsing accuracy.
RB-permuted data show similar trends ( Figure  2). An interesting regularity is shown by four languages (Latin Vulgate, Ancient Greek New Testament, Dutch and Persian) on the off-diagonal. Although they have different relative Entropy and DLM ratio values, which span from near minimal to maximal values, the improvement in parsing performance on these languages is very similar (as indicated by the same purple colour). This again strongly points to the fact that both DLM ratio and Entropy contribute to the observed parsing performance values.
We can further confirm the effect of dependency length by comparing the parsing accuracy across sentences. 7 Consider the Dutch treebank and its RBpermuted pair. For each sentence and its permuted counterpart, we can compute the difference in their dependency lengths (∆DLM = DLM −DLM RB ) and compare it to the difference in parsing performance (∆UAS = UAS RB − UAS). We expect to observe that ∆UAS increases when ∆DLM increases. Indeed, the parsing results on Dutch show a positive correlation between these two values (r = 0.40, p < 0.001 for Malt and r = 0.55, p < 0.001 for MST).
All these analyses confirm and quantify that dependency length and, more significantly, word order variability affect parsing performance.

Sentence-level analysis of parsing performance
Referring back to the results in Table 2, we observe that MaltParser shows the same average accuracy for RB and LB-permuted data. However, some languages show significantly different results between their LB and RB-permuted data, especially in their labelled accuracy scores. The New Testament corpus, for example, is much easier to parse when it is rearranged in left-branching order (91% RB vs 93% LB UAS, 73% RB vs 85% LB LAS). Our artificial data allows us to investigate this difference in the scores by looking at parsing accuracy at the sentence level.
The differences in Malt accuracies on RBpermuted and LB-permuted data are striking, because these data have the same head-direction en-tropy and dependency lengths properties. The only word order difference is in the branching parameter resulting in two completely symmetrical word orders for each sentence of the original treebank. To understand the behaviour of MaltParser, and of transition-based parsers in general, we looked at the out-degree, or branching factor, of the syntactic trees. The intuition is that when many children appear on one side of a head, the parser behaviour on head-final and head-initial orders can diverge due to sequences of different operations, such as shift versus attach, that must be chosen in the two cases. 8 The data for the New Testament shows that the branching factor plays a role in the LB/RB differences found in this treebank. For each pair of sentences with LB/RB orders, we computed the parsing accuracies (UAS and LAS) and the branching factor as the average out-degree of the dependency tree. We then tested whether the better performance on the LB data is correlated with the branching factor across the sentences (UAS LB − UAS RB ∼ BF ). The Pearson correlation for UAS values was 0.08 (p = 0.02), but for LAS values the correlation was 0.30 and highly significant (p < 0.001). On sentences with larger branching factors, the labelled accuracy scores on the LB data were better compared to the RB data.
We combine our result for the branching factor with an observation based on the confusion matrix of the labels, to provide a more accurate explanation of the comparatively low LAS in the RBpermuted treebank of the New Testament corpus. We found that when a verb or a noun has several oneword children, such as 'aux' (auxiliaries), 'atr' (attributes), 'obl' (obliques), 'adv' (adverbs) etc, these are frequently confused and receive the wrong label, if they appear after the head (RB data), but the labels are assigned correctly if these elements appear before the head (LB data). It appears that the leftward placement of children is advantageous for the transition-based MaltParser, as at the moment of first attachment decision for the child closest to the head it has access to a larger left context. When children appear after the head, the first one is attached before any other children are seen by the parser and the la-belling decision is less informed, leading to more labelling errors.
It should be noted that it is not always possible to identify a single source of difficulty in the error analysis. Contrary to New Testament, Spanish is easier to parse when it is rearranged into the rightbranching order (88% RB vs 85% LB UAS, 80% RB vs 76% LB LAS). However, the types of difficult dependencies emerging from the different branching of the LB/RB data were not similar or symmetric to that of New Testament. In the case of Spanish, we did not observe a distinct dimension of errors which would explain the 4% difference in UAS scores. 9

General discussion
Our results highlight both the contributions and the challenges of the proposed method. On the one hand, the results show that we can identify and manipulate word order properties of treebanks to analyse the impact of these properties on parsing performance and suggest avenues to improve it. In this respect, our framework is similar to standard analyses of parsing performance based on separate manipulations of individual word-level features (such as omitting morphological annotation or changing coarse PoS tags to fine PoS tags). Similarly to these evaluation procedures, our approach can lead to improved parsing models or better choice of parsing model by finding out their strengths and weaknesses. The performance of Malt and MST (Tables 2 and  3) -while not directly comparable to each other due to differences in the training set-up (Malt features are optimised for each language and permutation) -show that MST performs better on average on permuted datasets than Malt. This can suggest that MST handles the high entropy of the OptDL permuted set as well as the long dependencies of LB/RB permuted sets better, or, conversely, that the MaltParser does not perform well on treebanks with high word order variability between the children attached to the same head (see Section 4.2). When two parsing systems are known to have different strengths and weaknesses they can be successfully 9 Overall, the variance in the LB/RB performances on Spanish is relatively high and the mean difference (computed across UAS scores for sentences) is not statistically significant (t-test: p > 0.5) -a result we would expect if errors cannot be imputed to clear structural factors. combined in an ensemble model for more robust performance (Surdeanu and Manning, 2010).
A contribution of the parsing performance analyses in a multilingual setting is the identification of difficult properties of treebanks. For Cicero and Herodotus texts, for example, our method reveals that their word order properties are important reasons for the very low parsing performances. This result confirms intuition, but it could not be firmly concluded without factoring out confounds such as the size of the training set or the dissimilarity between the training and test sets, which could also be reasons for low parsing performance. For German, our analysis gives more unexpected results and allows us to conclude that the variability of word order is a more negative factor on parsing performance than long dependencies. Together, the knowledge of word order properties of a language and the knowledge of parsing performance related to these properties give us an a priori estimation of what parsing system could be better suited for a particular language.
On the other hand, our method also raises some complexities. Compared to commonly used parsing performance analyses related to word-level features, the main challenges to a systematic analysis of word order lie in its multifactorial nature and in the large choice of quantifiable properties correlated with parsing performance. First, the multifactorial nature of word order precludes one from considering word order properties separately. The two properties we have looked at -DLM ratio and arc direction entropy -cannot be teased apart completely since minimising one property leads to the increase of the other.
Another challenge is due to the fact that formal quantitative approaches to studying word order variation cross-linguistically are just beginning to appear and not all word order features relevant for parsing performance have been identified. In particular, our results suggest that the relative order between the children (and not only the order between heads and their children) should be taken into account (Section 4.2). However, we are not aware of previous work which proposes a measure for this property and describes it typologically on a large scale.
Finally, our method, which consists in creating ar-tificial treebanks, can prove useful beyond parsing evaluation. For instance, our data could enrich the training data for tasks such as de-lexicalized parser transfer . Word order properties play an important role in computing similarity between languages and finding the source language leading to the best parser performance in the target language (Naseem et al., 2012;Rosa and Zabokrtsky, 2015). A possibly large artificially permuted treebank with word order properties similar to the target language could then be a better training match than a small treebank of an existing target natural language.

Related work
Much previous work has been dedicated to the evaluation of parsing performance, also in a multilingual setting. The shared tasks in multilingual dependency parsing (Buchholz and Marsi, 2006;Nivre et al., 2007) and parsing of morphologically-rich languages (Tsarfaty et al., 2010;Seddah et al., 2013) collected a large set of parsing performance results. Some steps towards comparability of the annotations of multilingual treebanks and the parsing evaluation measures were proposed and undertaken in , Seddah et al. (2013) and, most recently, in the collaborative Universal Dependencies effort (de Marneffe et al., 2014;Nivre et al., 2016). However, little work has suggested an analysis of the differences in parsing performance across languages in connection with the word order properties of treebanks. Some papers have analysed the impact of dependency lengths on parsing performance in English. McDonald and Nivre (2011) demonstrated that parsers make more mistakes in longer sentences and on longer dependencies. Rimell et al. (2009) and Bender et al. (2011) created benchmark test sets of constructions containing long dependencies, such as subject and object relative clauses, and analysed parsing behaviour on these selected constructions. Other analyses on long-distance dependencies can be found in Nivre et al. (2010) and Merlo (2015). We are not familiar with any similar analysis of parsing performance in English addressing other word order variation properties (e.g. head-direction entropy).
In , the parsing per-formance on several Latin and Ancient Greek texts is analysed with respect to the dependency length and, indirectly, the head-direction entropy. The authors compare the parsing performance across texts of the same language (Latin or Ancient Greek) from separated historical periods which differ slightly in their word order properties. 10   show that texts with longer dependencies and more varied word order are harder to parse. Assuming the same lexical material of the texts, their particular setting allows a more direct comparison of parsing performance than a standard multilingual setting where languages differ in many aspects other than word order.
The calculation of the minimal dependency length through the permutation of a dependency treebank was proposed in the work of Temperley and Gildea (Temperley, 2007;Gildea and Temperley, 2010). In this work and the following work of Futrell et al. (2015a), several types of permutations were employed to compute different lower bounds on dependency length minimisation in English and across dozens of languages.
Artificially permuted treebanks were previously used in Fong and Berwick (2008) as stress-test diagnostics for cognitive plausibility of parsing systems. In particular, Fong and Berwick (2008) permuted the order of words in the English Penn Treebank to obtain 'unnatural' languages. Their permutations included transformations to head-final and head-initial orders (applied with 50%-50% proportion to sentences in the treebank) and reversing the respective order of complements and adjuncts. The parsing performances on these permuted treebanks were 0.5-1 point lower than on the original treebank, which the authors interpreted as too accurate to be a cognitively plausible behaviour for a model of the human parser. From the perspective of our paper, the permuted treebanks of Fong and Berwick (2008) were constructed to have longer dependencies and higher word order variation; the lower performances are therefore in agreement with our own results.

Conclusions
We have proposed a method to analyse parsing performance cross-linguistically. The method is based on the generation and the evaluation of artificial data obtained by permuting the sentences in a natural language treebank. The main advantage of this approach is that it teases apart the linguistic factors from the extra-linguistic factors in parsing evaluation.
First, we have shown how this method can be used to estimate the impact of two word order properties -dependency length and head-direction entropy -on parsing performance. Previous observations that longer dependencies are harder to parse are confirmed on a much larger scale than before, while controlling for confounding treebank properties. It has also been found that variability of word order is an even more prominent factor affecting performance.
Second, we have shown that the construction of artificial data opens a new way to analyse the behavior of parsers using sentence-level observations. Sentence-level evaluations could be a very powerful tool for detailed investigations of how syntactic properties of languages affect parsing performance and could help creating more cross-linguistically valid parsing techniques.
Two avenues are open for future work. First we will investigate more properties related to word order. Specifically, we will apply the method to the non-projectivity property. On the one hand, dependency lengths and non-projectivity are correlated properties, as predicted theoretically (Ferrer-i-Cancho, 2006). Our data confirm this relation empirically: the Pearson correlation between DLM ratio and the percentage of non-projective dependencies across treebanks is 0.66 (p < 0.02). On the other hand, this correlation is not perfect and both dependency length and non-projectivity should be taken into account to fully explain the variation in parsing performance.
Second, we have not attempted in the current work to estimate the function f (see section 2.1). This task is equivalent to automatic prediction of parsing accuracy of a treebank based on its properties. Ravi et al. (2008) have proposed an accuracy prediction method for one language (English) based on simple lexical and syntactic properties. Combining their insights with our analysis of word order could lead to a first language-independent approximation of f .