Joint Transition-Based Models for Morpho-Syntactic Parsing: Parsing Strategies for MRLs and a Case Study from Modern Hebrew

In standard NLP pipelines, morphological analysis and disambiguation (MA&D) precedes syntactic and semantic downstream tasks. However, for languages with complex and ambiguous word-internal structure, known as morphologically rich languages (MRLs), it has been hypothesized that syntactic context may be crucial for accurate MA&D, and vice versa. In this work we empirically confirm this hypothesis for Modern Hebrew, an MRL with complex morphology and severe word-level ambiguity, in a novel transition-based framework. Specifically, we propose a joint morphosyntactic transition-based framework which formally unifies two distinct transition systems, morphological and syntactic, into a single transition-based system with joint training and joint inference. We empirically show that MA&D results obtained in the joint settings outperform MA&D results obtained by the respective standalone components, and that end-to-end parsing results obtained by our joint system present a new state of the art for Hebrew dependency parsing.


Introduction
NLP research in recent years has shown increasing interest in parsing typologically different languages, as evident, for instance, by the universal dependencies 1 initiative (Nivre et al., 2016). In particular, much attention is drawn to parsing morphologically rich languages (MRLs), which differ significantly from English in their structure and characteristics (Tsarfaty et al., 2010).
In MRLs, grammatical information, typically expressed using word order in English, is often manifested in the internally complex structure of the words. Words in MRLs may carry, in addition to lexical content, functional affixes and clitics that correspond to additional pieces of information. In Modern Hebrew, for example, the inflected verb ''ahbtih'' 2 (loved + 1pers.singular.past + 3pers.feminine.singular) corresponds to three different grammatical functions: the subject ''I,'' the predicate ''loved,'' and the direct object ''her.'' Similarly, Spanish dámelo corresponds to a predicate, an indirect object, and a direct object, as in ''give it to me.'' Thus, in MRLs, morphological analysis (MA) which translates raw space-delimited tokens to syntactically relevant ''word'' units is a necessary condition for any syntactic or semantic downstream task.
However, raw space-delimited tokens in MRLs are often highly ambiguous. In Hebrew, Arabic, and other Semitic languages, this situation is further complicated by fact that written texts lack diacritics. The Hebrew token ''fmn,'' for instance, may be read as the noun ''oil,'' the adjective ''fat,'' the verb ''lubricated,'' the sequence ''that''+''of,'' or the phrase ''their''+''name,'' only one of which is relevant in context. This has clear ramifications for dependency parsing. Figure 1 shows a lattice that captures all possible analyses of the Hebrew phrase ''bclm hneim,'' literally: ''in-the-shadow-of-them the-pleasant,'' translated ''in their pleasant shadow.'' Each lattice arc corresponds to a potential node in a dependency tree. Dark circles mark morpheme boundaries, double circles mark token boundaries. The top tree depicts a correct syntactic analysis. In the bottom tree, incorrectly disambiguated tokens lead to a wrong syntactic analysis.
Previous dependency parsing evaluation campaigns (Buchholz and Marsi, 2006;Nivre et al., 2007) assumed that the correct morphological analysis and disambiguation (MA&D) of the input stream is known in advance. In realistic end-toend parsing scenarios, however, this is of course not so. To overcome this, pipeline architectures where MA&D precedes parsing have been set up. These pipelines are suboptimal since they suffer from error propagation, and since local linear context available for automatic MA&D may be insufficient for accurate morphological disambiguation. For this, actual syntactic context may be required (Tsarfaty, 2006). To resolve this apparent loop, where morphological analysis is required for syntactic parsing and syntactic analysis is required for morphological dis-ambiguation, Tsarfaty (2006) hypothesised that joint morphosyntactic parsing, where morphological information may assist syntactic disambiguation and vice versa, may be better suited.
This joint morphosyntactic hypothesis has been taken up and successfully confirmed in the context of phrase-structure parsing for Semitic languages (Goldberg and Tsarfaty, 2008;Cohen and Smith, 2007;Green and Manning, 2010). For dependency parsing, Bohnet and Nivre (2012) and Bohnet et al. (2013) present language-agnostic transition-based frameworks for jointly parsing and tagging input words, though without addressing the complex issue of retokenizing ambiguous input tokens. More recently, Seeker and Centinoglu (2015) presented a graph-based framework for lattice parsing of Turkish also covering morphological segmentation. Their system takes a ''product of experts'' approach wherein the morphological paths and dependency trees are handled via two distinct models (a linear model over bigrams for MD and an arc-factor model for dependencies), reaching agreement via a dual decomposition setup.
In this work, we present a novel, languageagnostic, transition-based framework for end-toend morphosyntactic dependency parsing. The framework unifies a morphological and a syntactic component into a joint parser encompassing a single transition system, a single objective function, joint learning, and joint decoding. We apply this system to parsing Modern Hebrew and empirically confirm that predicting MA&D in the joint settings improves upon standalone MA&D, and upon recently reported Hebrew MA&D results. Our system further improves endto-end dependency parsing results in comparison to existing state-of-the-art parsers in pipeline scenarios, it significantly outperforms the joint parser of Seeker and Centinoglu (2015), and it substantially outperforms the dependency parser of Goldberg and Elhadad (2010), so far considered the de facto standard for Hebrew dependency parsing.
The contribution of this paper is thus threefold. First, we define a language-agnostic joint morphosyntactic parser in a transition-based framework. Secondly, we empirically confirm that MA&D benefits from syntactic parsing, and in realistic end-to-end parsing scenarios, also vice versa. Finally, we present a new set of strong Hebrew end-to-end parsing results and deliver an open-source, language agnostic implementation of the joint parser, for further investigating joint morphosyntactic parsing strategies. This paper is organized as follows. In Section 2, we present our formal framework (2.1), morphological model (2.2), syntactic model (2.3), and joint framework (2.4). Sections 3 and 4 present our experiments and analysis, respectively. Section 5 discusses related and future work, and Section 6 concludes.

The Proposal: Transition-Based Joint
Morpho-Syntactic Parsing

Formal Settings
We cast end-to-end morphosyntactic parsing as a structure prediction function F : X → Y, where x ∈ X is a sequence of raw input tokens and y ∈ Y is a dependency representation where the nodes in the tree correspond to disambiguated morphosyntactic units we refer to as morphemes. 3 We assume that F is realized in a transitionbased framework augmented with the structure prediction method of Zhang and Clark (2011). We start off with a completely general definition of a transition system as a quadruple S = (C, T, c s , C t ), with C a set of configurations, T a set of transitions, c s an initialization function, and C t ⊂ C a set of terminal configurations. We then define different instantiations of S for the different (morphological, syntactic, morphosyntactic) parsing tasks. In each instantiation, a transition sequence y for x is a sequence of configurations that are obtained by applying transitions t 1 ...t n ∈ T sequentially. That is, starting with an initial configuration c 0 = c s (x), we find y = c 0 , ..., c n such that c i+1 = t i+1 (c i ) and c n ∈ C t . Thus, each y depicts a sequence of decisions that constructs a valid analysis for x at the relevant linguistic level.
For each task we employ an objective function F (x) as follows, where GEN (x) holds all the transition sequences that generate relevant candidates: To compute Score(y), y is mapped to a global feature vector Φ(y) of size d multiplied by a weights vector ω of the same size. The global feature vector Φ(y) consists of local feature vectors, each of which is defined via a set of functions {φ i : C → N } d i=1 which count the occurrences of a prespecified pattern in a given configuration in y. Following Zhang and Clark (2011), we learn the weights vector ω ∈ R d via the generalized perceptron using the early-update averaged variant of Collins and Roark (2004).
Decoding is based on the beam search algorithm, where a number of high-scoring candidate sequences are maintained in the beam in order to mitigate irrecoverable prediction errors that characterize greedy search procedures. At each step, the transition system applies all transitions to all candidates, and keeps the B highest-scoring candidates. During learning, the perceptron algorithm iterates through a gold-annotated corpus. Each sentence is parsed (decoded) with the last known weights, and if the parsed result differs from the gold, the weights are updated. The learning is stopped when overfitting begins.

The Morphological Framework
Our departure point for morphological disambiguation (MD) is the transition system of More and Tsarfaty (2016), currently established as the state of the art for Hebrew MA&D. 4 The input to the system is a lattice L that captures the range of valid morphological analyses for the input tokens x = x 1 , . . . , x k , as illustrated in the middle of Figure 1. The goal of the MD system is to select a sequence of contiguous arcs in L which represents the morphological disambiguation of x in context.
Formally, we define for each token x i its tokenlattice L i = MA(x i ) where each lattice-arc in L i corresponds to a potential node in the dependency tree. Each lattice-arc has a morphosyntactic representation (MSR) which we define as a tuple m = (b, e, f, t, g) with b and e the beginning and end indices in L, f a form, t a part-of-speech tag, and g a set of attribute:value grammatical properties. L = MA(x) is the sentence lattice obtained by concatenating the token-lattices top to bottom L = MA(x 1 ) • ... • MA(x k ). Now, L represents the full range of valid morphological analyses applicable to x. 5 A configuration for any input x in the MD system consists of its sentence lattice L = MA(x), an index n representing the internal node (dark circle) in L we are at, and an index i representing the 0-based current-token index (double circle) in L, while M is a set of disambiguated MSRs (selected arcs): The initial configuration function c s sets L = MA(x), n = bottom(L), i = 0, and M = ∅. For traversing the lattice L from bottom to top we define an open set of transitions using the MD s transition template, with s = ( , , , t, g) specifying the delexicalized projection of (any) lattice arc (b, e, f, t, g).
This transition selects a single lattice arc at a given position. Now, if p is our current position in the lattice and m = (p, q, f, t, g) is the selected arc, then j = i + 1 if q is at a token boundary (double circle) and j = i if it is not.
The terminal configuration set is defined to be In order to find this path in a data-driven fashion, we define a parametric model that scores all transitions that can be applied at each step. We define the properties f (form), t (pos tag), g (morphological attribute:value pairs), path (the path in the previously disambiguated tokenlattices), and morphs (the set of outgoing morphemes of the current node) and we use unigram, bigram, and trigram combinations of these properties as features for the learning model. 6 Our beam search decoder then applies at each point in the lattice all possible transitions and selects the B-top scoring candidates at this point. Those that don't make the B mark, fall off the beam.
Importantly, |M |, the number of lattice arcs in the path at each stage, is unknown in advance, since different disambiguation decisions between token boundaries may end up with different path lengths. This can be seen in the lattice of Figure 1, where path lengths vary between 4-7 arcs. This is a thorny issue, because it violates a basic assumption of beam search decoding-that the number of transitions is a deterministic function of the input and is known in advance. Such length discrepancies may lead to preferring short sequences in the beam due to reaching the end goal early, or preferring long sequences, due to artificial inflation of scores based on the multitude of features.
To address this issue, we adopt the solution proposed by More and Tsarfaty (2016), employing a special transition ENDTOKEN (ET) given in (2) which explicitly increments i when reaching a token boundary in L. 7 Set aside from other transitions, ET has its own set of features (of size d ). Other than incrementing i, ET causes a re-ordering of candidates in the beam at each token boundary. More and Tsarfaty (2016) show that when using this anchor, the features of the ET transition provide a counterbalance to the effects of varied-length sequences and improve the accuracy of Hebrew MD.
An MD transition sequence thus becomes a union of disjoint sets of configurations y = y md ∪ y et , and Score(y) is as follows, where ω md j φ md score configurations resulting from MD transitions and likewise ω et j φ et for ET transitions: (3)

The Syntactic Framework
Given a sequence of selected lattice arcs for the input sequence x, we can define the syntactic dependency representation for x as a dependency tree where each lattice arc corresponds to a node in the dependency tree. Let R be a set of dependency types and let M = m 1 ...m l be the sequence of l arcs selected by the MD component. 8 We denote a dependency graph for the sequence A configuration represents a partial analysis of the input sentence, where the morphemes on the stack σ are partially processed morphemes, the morphemes in the buffer β are those waiting to be processed, and the arc set A represents a partially built dependency tree (Kübler et al., 2009, Chapter 3). Unless specified otherwise, the set of terminal configurations There are various options for defining transitions over such configurations in the dependency parsing literature for English. In particular, three transition systems have been successfully applied to English as well as other languages (cf. Ballesteros and Nivre [2016]): Arc Standard: A straightforward method of bottom-up left-to-right incremental parsing as proposed in Nivre (2004). We assume the definition by Kübler et al. (2009).
Arc Eager: Following Abney and Johnson (1991), Arc Eager defines a variant of Arc Standard that allows to eagerly attach a rightdependent to its head while allowing more dependents to attach to it. We assume the definition by Kübler et al. (2009). 8 To avoid confusion between lattice arcs and dependency arcs, we refer to lattice arcs m i ∈ M as ''morphemes.'' 9 A transition system can introduce an artificial root node that can head any partial tree in the sentence. The root node allows for multiple partial trees (a forest) to be related only through the root node. We call transition systems with and without a root node root-full and root-less, respectively. In the literature, σ = [m 0 ] is the formal requirement for root-full variations; however |σ| = 1 is a generalization that applies to both root-full and root-less cases.
Arc (Z)Eager: In our reproduction of the state-of-the-art results presented by Zhang and Nivre (2011) for English, we discovered in the code a variant of Arc Eager that we call Arc (Z)Eager, which has interesting subtle variations from Arc Eager, including a second stack holding head nodes, and certain hard constraints on the application of several transitions. 10 An empirical study by Nivre (2008) compares the performance of Arc Standard and Arc Eager for 13 languages, amongst them Arabic and Turkish, both considered MRLs with some degree of wordorder freedom. For these languages, Arc Standard slightly outperformed Arc Eager. On a different but related note, our preliminary experiments on English and Hebrew show that the Arc ZEager variant always outperforms Arc Eager. However, the question which of the two, Arc-Standard or Arc-ZEager, will be more suited for parsing Hebrew, remains open for our empirical investigation in Section 3.
Defining Features. A significant contribution of Zhang and Nivre (2011) is their proposal of a set of rich non-local features (RNF) for Arc ZEager, adding higher-order information previously found only in graph-based parsers. To facilitate a fair comparison of Arc Standard to Arc (Z)Eager, we have to adapt the feature set of Zhang and Nivre (2011) to the different arc system (to the extent that this is possible), and to the different language type. In particular, the RNF set depends on word order, by encoding the arc direction explicitly. We address the order-dependence of RNF by defining a parallel set of features that is suitable for the more flexible word order in MRLs, and that is applicable to Arc-Standard. We call this feature set rich linguistic features (RLF). The essence of the two feature sets is the same, but we replace features relying on positions of nodes with features relying on the labeled grammatical functions of these nodes. 11 To construct our features, we define properties that capture the linguistic information of selectional preferences and subcategorization frames (Tesnière, 1959;Chomsky, 1965). To capture the distributional characterization of subcat frames, we define sf p to be a multiset of part-of-speech tags of the dependents of a given head. To capture the functional characterization of subcat frames, we define the sf f referring to the multiset of function labels of all dependents of a given head. For valency, we define the properties v sf , referring to the number of dependents of a given head. For capturing selectional preferences in flexible word order environments, we define order-agnostic bilexical labeled-dependency features, generated separately for each dependent.
Finally, we augment syntactic features with morphological properties. Our augmentation operator allows for creating multiple instances of the same feature, with and without morphological properties.

The Joint Framework
Given our morphological and syntactic components, we seek an integration such that morphological information aids syntactic disambiguation and vice versa.
We propose to literally embed the two standalone configurations into a single configuration, and to apply transitions via a coherent logic we call a strategy that chooses which processor to apply at a given state.
Formally, let c md and c dep be MD and dependency parser configurations as defined in Sections 2.2 and 2.3, respectively. We define the joint configuration as follows: We initialize the embedded MD configuration c md with the MD transition system initialization function, as defined in Section 2.2, but leave c dep empty, with σ = β = [] empty stack and buffer. Also, as before in c dep , A = ∅. A configuration c j is terminal if and only if c md and c dep are both terminal configurations of their respective systems.
Let T = (T md , T dep ) be a pair of transition sets of the MD and dependency parsing transition systems, respectively, let C = {C md , C dep } hold the sets of possible non-terminal configurations, and let C t = {C t md , C t dep } hold the respective sets of terminal configurations. A joint strategy is a function that, given a non-terminal joint configuration, chooses exactly one transition system to act on: The Pipeline Strategy. Our baseline morphosyntactic parsing strategy is simply a pipeline that first applies the morphological component which selects the best output morpheme sequence, and then applies the dependency parser to it.
The MDFirst Strategy. If we seek to improve on the simplistic pipeline strategies, we first need to adjust our MD transition system such that its disambiguation decisions feed into the configuration of the dependency parser. We modify the MD s transition as follows: Now, a simple improvement upon the pipeline approach would be, rather than choosing just the top-scoring candidate of the MD component, passing all B candidates in the beam to the dependency component. We refer to this strategy as MDFirst: This simple extension offers the opportunity to maximize a single objective function, and to ''re-rank'' initially locally scored candidates if syntactic processing leads to a better MD result.
The ArcGreedy Strategy. Since both transition systems process their input left to right, there is no inherent constraint preventing the application of a syntactic transition as soon as the embedded dependency configuration meets the minimal state required for a dependency transition to be applied. We therefore propose a set of ArcGreedy k strategies, in which we greedily choose to apply a syntactic transition if the dependency buffer β is populated by at least k morphemes, so that the syntactic processor may look k morphemes ''forward'' in order to predict its next transition.
In contrast with the pipeline architecture, both MDFirst and ArcGreedy k perform joint morphosyntactic parsing, in the sense that the framework aims to maximize a joint global score over both morphological and dependency transitions. This is formally depicted as follows in (9), where c md and c et are the resulting configurations of MD and ET transitions respectively, and c dep are the resulting configurations of syntactic transitions: The theoretical advantage of ArcGreedy k compared to MDFirst is that the incremental update of the joint global score by the former alternates between MD and syntactic predictions, allowing for syntax and morphological information to interact frequently. So, syntax can affect the ordering of candidates during the parsing sequence, correcting local mistakes closer to where they occur.

Experiments
Goal: We aim to test the hypothesis that joint syntactic and morphological disambiguation is better than a pipeline by empirically comparing the Pipeline, MDFirst and ArcGreedy 3 parsing strategies in our unified transition-based morphosyntactic framework. 12 Data: We use the Modern Hebrew section of the SPMRL shared task (Seddah et al., 2014), derived from the Hebrew Unified-SD version of Tsarfaty (2013). For the purpose of this work, we harmonized the treebank annotation scheme 12 We set k = 3 because some features of Zhang and Nivre (2011) require three morphemes in the buffer.
with the annotation scheme of the lexical resources of Itai and Wintner (2008), and in particular the HEBLEX lexicon of Adler and Elhadad (2006). We use the standard train/dev/test sets split, train on the train set (5,000 sentences) with a detailed investigation on dev (500), and confirm our results on test (716).
Implementation: We implemented from scratch a fully integrated, transition-based, multilingual natural language processor, written in Go. 13 Our implementation uses a general purpose morphological analyzer, which for Hebrew is backed by the BGU HEBLEX lexicon (Adler and Elhadad, 2006). We implemented the morphological disambiguator, dependency parser, and joint integration strategies defined herein. We implemented and experimented with both the Arc Standard and Arc ZEager transition systems. 14 Scenarios: In MRLs, out-of-vocabulary (OOV) tokens pose a great challenge to parsing. A raw token may have not been observed during training, even though all its morphemes have been observed in other contexts. To gauge the effect of such OOV items on the quality of Hebrew parses, we evaluate the system in two different scenarios. In the first, infused scenario, we verify that each lattice contains the gold morphological analysis. That is, if the gold path is not present in L = MA(x) (hence, an OOV), we automatically infuse the gold path into L. We contrast this with uninfused scenarios, where we use a realistic morphological analyzer with its (incomplete) lexical coverage as is, compliant with Adler and Elhadad (2006).

Settings:
In all experiments, we used a beam of size 64, which, in our preliminary experiments on dev, gave better results for the joint models than a beam of 32, and in any event no worse results than a beam of 128. To avoid both overfitting and underfitting, we define a stopping condition for the training procedure, which we test in each training iteration. During training, we use a sliding window of three iterations and select the first model that precedes two sequential scores-drop on dev.
For pipeline models, we test distinct stopping conditions for the morphological and the syntactic models, each based on its own standalone scores.
For joint models, we test the stopping condition with respect to a single overall dependency F 1 score, which we define shortly.
Evaluating Morphology: To evaluate morphological disambiguation (MD) results, we report the F 1 scores on the set of predicted morphemes versus gold-standard morphemes in the sentence. Formally, let M p , M g be sets of predicted and gold morphemes of the sentence, respectively. We define precision, recall, and F 1 -scores as: We report two different scores for each MA&D run, one for full MD including segmentation, tagging and morphological features (MD Full), and one for segmentation and tags only (MD POS).
Evaluating Dependencies: Evaluating joint morpho-syntactic dependency parsing performance is non-trivial, because the gold and parse trees may have a different number of nodes, which precludes the application of standard attachment scores; it suffices that an incorrect segmentation occurs early in the sequence, then off-by-one indices in the remainder of the sentence deem the rest of the arcs incorrect (Tsarfaty et al., 2012). Let us illustrate this effect. Consider the Hebrew phrase ''bbit'' (translated ''in the house'') that appears as a single space-delimited token. Now consider the two following MD alternatives, with and without the Hebrew covert definite article. We also include here the indices of the disambiguated morphemes in their linear order: Gold MD: 1.b(''in'') 2.h(''the'') 3.bit(''house'') Predicted MD: 1.b(''in'') 2.bit(''house'').
Further assume that both the Gold and Predicted dependency trees contain the correct dependency arc between b (''in'') and bit (''house'') labeled pobj. In simple LAS terms, the arcs that would be compared for the purpose of evaluation are: Gold Dep: pobj(1,3), det(3,2) Predicted Dep: pobj(1,2).
So the pobj predicted arc will be considered an error, even though the relation between forms is correct, and accordingly both UAS and LAS will be 0.
To address this issue, we define an F 1 accuracy measure with respect to the forms of arc edges, rather than their node indices. Formally, let M p be the predicted morphological disambiguation of x, and let A p be the predicted dependency tree over M p . Likewise, let M g , A g be the gold-standard morphological disambiguation and dependency tree of x. We now replace the index of each node in the arcs of A p , A g with the form of the corresponding morpheme in M p and M g . Let J p , J g be the form-based (rather than index-based) arcs of the predicted and gold representations of x. We report both labeled and unlabeled F 1 as: 15 In our example, the revised arcs will now be: Now, the parser will be credited for identifying the pobj arc correctly, as desired, and the dependency scores will be: P r = 1, Re = 0.5, and F 1 = 0.67.

Results
: Tables 1-4 present our morphosyntactic parsing results for each of our different systems in all, pipeline and joint, strategies. We report F 1 scores, both MD Full and MD POS for morphological disambiguation (MD), and both unlabeled and labeled F 1 scores for the dependency trees (Dep). Tables 1 and 3 present results on the Modern Hebrew dev set, and Tables 2 and 4 confirm our results on the test set. Table 1 presents parsing results for infused morphological lattices; that is, ambiguous MA lattices that are guaranteed to also include the correct MD path in them. In these experiments, we see that MD results in joint parsing strategies (MDFirst, ArcGreedy) always improve upon the MD standalone/pipeline results. In particular, all MD results across the joint strategies are very close. We observe only a minor advantage for Arc-Zeager over Arc-Standard for both joint strategies. This increase in MD accuracy unfortunately comes at the expense of syntax, where we observe a slight drop (up to 0.5 point in [un]labeled F 1 ) when switching from pipeline to joint strategies.
We confirm this trend on the test set in Table 2  joint results are better than the respective pipelines (although now Arc-Standard slightly improves upon Arc-Zeager in the ArcGreedy strategy), while dependency parsing results drop in joint scenarios (a slightly larger drop than on dev). Tables 3 and 4 present parsing results for the more interesting scenario, a realistic parsing scenario where we use uninfused lattices-ambigous lattices obtained by an existing broad-coverage morphological analyzer, which are not (and cannot be) guaranteed to always also include the correct path. As expected, on both the dev set (Table 3) and test set (Table 4), the results drop relative to the respective infused scenarios (Tables 1  and 2, respectively), as some elements from the correct path and tree are no longer reachable within the search space. At the same time, it is interesting to observe that for both dev and test, all MD scores (Full/POS) as well as dependency scores (un/labeled) are better in joint parsing. The specific differences between the joint strategies and transition systems do not matter very muchthe robust empirical trend is that switching from pipeline to joint improves both MD and dependency parsing performance.
It is interesting to inquire why in the infused scenario, on both dev and test, dependency parsing results in the joint strategies drop relative to the respective pipelines. At it turns out, in case the   correct analysis of a rare (OOV) token has been injected artificially into the lattice, training on these lattices may turn out to be misleading. Injecting a correct but rare MSR may lead to an artificial ''certainty'' as to its appropriate syntactic context. Then, if the parser does not apply robust statistics on the general behavior of rare/OOV items in different syntactic contexts (as would be the case in joint uninfused scenarios), selecting the injected MD may lead to a wrong syntactic decision.
The main message coming out of our experiments is that joint morphological disambiguation and syntactic parsing in this transition-based framework is preferred to pipeline settings, in line with the hypothesis that syntactic information aids morphological disambiguation. Furthermore, it is reassuring to observe that when parsing uninfused lattices, as in the more realistic scenario, dependency parsing results improve upon pipeline scenarios, corroborating the findings of Seeker and Centinoglu (2015) in graph-based frameworks and of Cohen and Smith (2007) and Goldberg and Tsarfaty (2008) in phrase-structure parsing.
End-to-End Parsing Performance: To put our end-to-end system performance in context, Tables 5 and 6 present our best results for dependency parsing in a pipeline architecture, assuming gold morphology, on the dev set and   the test set, respectively. We compare these results with studies that parsed the same data sets. As Table 5 shows, our parser significantly outperforms the state-of-the-art parser by Goldberg and Elhadad (2011), so far considered the de facto standard for Hebrew parsing. 16 As shown in Table 6, the parser also outperforms the results reported by most (though not all) SPMRL shared tasks participants, using the same data and same split. Such gold morphology settings are of course not suited for realistic parsing scenarios. So, in Table 7 we compare our best end-to-end parsing results to the most recent dependency parsing results in realistic scenarios on the same data (by (Seeker and Centinoglu 2015). Here our best pipeline and joint systems outperform the previously reported pipeline and joint results, thus presenting a new state of the art for Hebrew dependency parsing. Moreover, these results are obtained within a unified formal framework in a single ''allincluded'' implementation, providing a further practical advantage of not having to maintain and train separate standalone components. 17 16 Goldberg and Elhadad (2010) report only UAS, only dev. 17 Our implementation, models, and data are publicly available via https://github.com/OnlpLab/yap. We also provide a web demo of Hebrew raw-to-dependency parsing http://onlp.openu.org.il/.   Seeker and Centinoglu (2015).

Qualitative Error Analysis
To shed more light on the particular ways in which the joint system improves performance over the pipeline, we conducted a qualitative error analysis in 100 sentences from the Modern Hebrew standard dev set, when parsed in the more realistic uninfused scenario. More concretely, we sampled 100 sentences from our parsed corpus and a linguist manually assigned each error to one of 10 linguistic categories. We then clustered the categories into four different types.
• TYPE 1 errors include true semantic ambiguity, where additional semantic and world knowledge is required for disambiguation.
• TYPE 2 errors include categories that transcend different levels of linguistic structure, for example, when morphological segmentation errors affect syntactic disambiguation.
• TYPE 3 errors include parsing errors that stem from idiosyncrasies of the data and peculiarities of the SPMRL annotation scheme, • TYPE 4 (other) errors include parse errors that pertain to linguistic structures that characterize Semitic phenomena. Table 8 shows, for each error category, the number (and percentage) of occurrences of that error in the pipeline versus joint settings. The most outcome is that the type that shows the largest decrease in joint scenarios relative to pipeline scenarios belongs to TYPE 2, reflecting phenomena directly related to the morphosyntactic interface. Moreover, we also see a decrease in the errors concerned with the lexicosyntactic interface (e.g., solving PP attachment  ambiguity), which turn out to also benefit from the joint settings. With the other types of errors, there is no clear advantage for joint parsing, and we would not expect one. TYPE 3 errors have to do with train-set inconsistencies, under-specification, or errors in the gold trees. TYPE 4 errors stem from linguistic phenomena which appear harder to disambiguate, and they are equally difficult across scenarios.

Related and Future Work
Monolingual MA&D for Modern Hebrew has been previously addressed in standalone settings using Hidden Markov Models (Bar-haim et al., 2008;Adler, 2007). While these results are adequate for some downstream applications, using Adler's MA&D for dependency parsing, for instance, significantly harms parsing performance (Goldberg and Elhadad, 2010). More recently, More and Tsarfaty (2016) presented a standalone transition-based MA&D which jointly solves morphological segmentation, tagging, and feature assignment, presenting new state-of-the-art Hebrew MA&D, providing the starting point for our study.
In terms of end-to-end dependency parsing for Hebrew, Goldberg and Elhadad (2010) were the first to evaluate the impact of predicted morphol-ogy compared to gold morphology across different (transition-based, graph-based, easy-first) frameworks. They demonstrated a significant loss in accuracy for all models in predicted morphology settings, and concluded with a suggestion to attempt joint processing. Recently, Straka et al. (2016) presented UDPipe, a toolkit with standalone components for morphological analysis, segmentation, tagging, features assignment, and dependency parsing-again using a pipeline architecture, with no way of interleaving the different decisions, as we strive to do here. This work aims to cover all stages of UDPipe, but within a joint architecture, allowing the use of information from any layer when disambiguating another.
Joint morphological and syntactic processing has been addressed in the context of phrasestructure parsing for Semitic languages, showing empirical advantages over pipeline architectures (Goldberg and Tsarfaty, 2008;Cohen and Smith, 2007;Green and Manning, 2010). In the context of dependency parsing, Bohnet and Nivre (2012) and Bohnet et al. (2013) integrated tagging and dependency parsing, improving state-of-theart accuracy for a set of typologically different languages. Andor et al. (2016) use the joint transition system proposed by Bohnet and Nivre (2012), and improve it using a globally Could be considered correct Cases of true semantic ambiguity. Both analyses could be considered correct. For example, in the phrase mrkz kwx erbi the adjective erbi (''arab'') modifies mrkz (''center'') in gold. The parser attaches it to kwx (''force''). Both could be correct.

Clause attachment
In complex sentences with multiple clauses or coordinated structures, the parser often identifies the conjunctions and the predicates correctly, but makes mistakes in connecting clauses. Semantic or world knowledge is required for disambiguation. PP attachment Semantic or world knowledge is also often required to determine PP attachment. For example, in the clause kdi lmnwe hedptm el ewbdim ifralim the parser attaches the PP el ewbdim ifralim (''over Israeli workers'') to the verb lmnew (''to prevent'') rather than to the required noun hedptm (''their preference''). 2 Seg/Tag err in focus word Incorrect segmentation of a token may lead to missing or incorrect dependency heads. For example, the parser analyses the token bqrb as a single word (a preposition, ''near'') while in the gold standard it is segmented into three words b + h + qrb (preposition + def + noun, ''in the battle''). This leads to missing dependency heads. Seg/Tag err in other word Incorrect segmentation of a token may also lead to an incorrect dependent. For example, in the phrase bqrb mgnnh the parser analyses the PP b + qrb (preposition + noun, ''in battle'') as a single word bkrb (preposition, ''near''). As a result, the word mgnnh (defence) is labeled object of a preposition (pobj) rather than a genitive object of a construct-state noun (gobj). Label err due to tagging err Incorrect tag prediction may lead to an apropriate yet incorrect arc label.
For example, in the phrase amcei xi lhpgnwt (''living means for demonstrations'') the parser tags the adjective xi (''living'') as a noun instead of an adjective, which is why it attaches xi as gobj (genitive object) to ''means'' rather than as amod. 3 Gold is wrong The analysis in gold is wrong, while the analysis provided by the parser is correct. For example, in the phrase w+b+silwp ewbdwt (''and in distortion of facts''), the conjunction marker w is labeled comp in gold while the parser correctly picks cc. Train is inconsistent (a) Multiple labels are used for the same type of dependencies. For example, prepmod and comp are both used in the train set for prepositional complements and prepositional modifiers without a clear distinction. (b) Identical structures are analyzed in different ways. For example, in the train set there are different structures used for the same type of partitive construction. In both (a) and (b), the predicted analyses might likewise be inconsistent and arbitrary.

Label underspecified
The label dep is used instead of different types of dependencies in gold. In several cases the test set uses more specific labels where the parser predicts dep, and vice versa. 4 Other There is a smaller amount of errors that involve linguistic structures that reflect particular Semitic phenomena. For example: (a) Indefinite objects in Hebrew are not case marked, so are sometimes mislabeled as subject due to flexible word order patterns and object pre-posing. (b) Construct-state nouns may be analysed as names and vice versa. Since Hebrew lacks capitalization, Hebrew names very often string-match common nouns. (c) Adjective attachment errors inside construct-state nouns. For example, in the phrase hjlt qnswt kbdim the parser attaches the adjective kbdim (''heavy'') to the construct-state noun hjlt (''imposition-of'') instead of attaching it to the genitive object qnswt (''fines''). normalized neural network. These systems address joint morpho-syntactic analysis for disambiguated words, but without addressing the issue of segmenting and disambiguating raw input tokens. Seeker and Centinoglu (2015) explore the idea of joint morphological and syntactic parsing, including morphological segmentation, in a graphbased framework. Their system integrates two standalone components that reach agreement via a dual-decomposition setup. However, they report suboptimal performance on the standard Hebrew benchmark. For various Chinese parsing tasks, joint systems for word segmentation and syntactic parsing have been shown to outperform pipeline settings (Li et al., 2011;Zhang et al., 2014), but these systems assume transitions over equallength character-based sequences, and thus they are not applicable to the setup of variable-length lattice paths, as demonstrated in Figure 1.
With the surge of interest in deep learning for NLP (Goldberg, 2016), research in dependency parsing seeks to replace engineered feature models with neural networks that induce a model automatically (Chen and Manning, 2014;Zhou et al., 2015;Andor et al., 2016). Furthermore, the concept of word embedding introduced by Mikolov et al. (2013) allows for words to have vector representations, such that syntactic and semantic similarities are embodied in the vector space. However, these kinds of architectures are not immediately applicable to parsing Hebrew and other MRLs. Pretraining word embeddings is non-trivial for ambiguous input tokens, unless resorting to pipeline ''segmentation-first'' scenarios. Similarly, parsing architectures based in RNNs require morphologically disambiguated forms as input, which prevents syntax from improving morphological disambiguation, as we argue for here.
In the future, we intend to augment the architecture we present here with neural network models for both the morphological and syntactic models, in a way that would allow them to effectively interact and affect one another, in the hope to lead towards further improvements in both tasks.

Conclusion
We present a novel joint transition-based framework for morpho-syntactic parsing, designed to solve end-to-end dependency parsing in realistic scenarios. We consider the properties of MRLs and directly address the disambiguation of raw input tokens exploiting larger syntactic contexts. We apply this system to Modern Hebrew, and our empirical results support the long-standing conjecture that MA&D can greatly benefit from syntactic parsing. We present a new set of state-of-the-art Hebrew parsing results, in both pipeline and joint scenarios, which then serve as a strong baseline for exploring future neural joint morpho-syntactic architectures that would potentially improve performance on both tasks.