Parsing to Noncrossing Dependency Graphs

We study the generalization of maximum spanning tree dependency parsing to maximum acyclic subgraphs. Because the underlying optimization problem is intractable even under an arc-factored model, we consider the restriction to noncrossing dependency graphs. Our main contribution is a cubic-time exact inference algorithm for this class. We extend this algorithm into a practical parser and evaluate its performance on four linguistic data sets used in semantic dependency parsing. We also explore a generalization of our parsing framework to dependency graphs with pagenumber at most k and show that the resulting optimization problem is NP-hard for k ≥ 2.


Introduction
Dependency parsers provide lightweight representations for the syntactic and the semantic structure of natural language. Syntactic dependency parsing (Kübler et al., 2009) has been an extremely active research area for the last decade or so, resulting in accurate and efficient parsers for a wide range of languages. Semantic dependency parsing has only recently been addressed in the literature (Oepen et al., 2014;Oepen et al., 2015;Du et al., 2015a).
Syntactic dependency parsing has been formalized as the search for maximum spanning trees in weighted digraphs (McDonald et al., 2005b). For semantic dependency parsing, where target representations are not necessarily tree-shaped, it is natural to generalize this view to maximum acyclic subgraphs, with or without the additional requirement of weak connectivity (Schluter, 2014).
While a maximum spanning tree of a weighted digraph can be found in polynomial time (Tarjan, 1977), computing a maximum acyclic subgraph is intractable, and even good approximate solutions are hard to find (Guruswami et al., 2011). In this paper we therefore address maximum acyclic subgraph parsing under the restriction that the subgraph should be noncrossing, which informally means that its arcs can be drawn on the half-plane above the sentence in such a way that no two arcs cross (and without changing the order of the words). The main contribution of this paper is an algorithm that finds a maximum noncrossing acyclic subgraph of a (vertex-ordered) weighted digraph on n vertices in time O.n 3 /.
After giving some background (Section 2) we introduce the noncrossing condition, compare it to other structural constraints from the literature, and study its empirical coverage (Section 3). We then present our parsing algorithm (Section 4). To turn this algorithm into a practical parser, we combine it with an off-the-shelf feature model and online training (Section 5); the source code of our system is released with this paper. 1 We evaluate the performance of our parser on four linguistic data sets: those used in the recent SemEval task on semantic dependency parsing (Oepen et al., 2015), and the dependency graphs extracted from CCGbank (Hockenmaier and Steedman, 2007). Finally, we explore the limits of our approach by showing that finding the maximum acyclic subgraph under a natural generalization of the noncrossing condition, pagenumber at most k, is NPhard for k 2 (Section 6). We conclude the paper by discussing related and future work (Section 7).  Figure 1: A noncrossing dependency graph (Oepen, 2014, DM #20209003). Using the terminology of Sagae and Tsujii (2008), the vertices the, thus, a and from are roots; of these, thus, a and from are covered by arcs.

Background
Dependency parsing is the task of mapping a natural language sentence into a formal representation of its syntactic or semantic structure in the form of a dependency graph.

Dependency Graphs
A directed graph or digraph is a pair G D .V; A/ where V is a set of vertices and A Â V V is a set of arcs. We consider an arc .u; v/ to be directed from u to v and write it as where v 0 D u and v m D v. Note that m is allowed to be zero; in this case the path is called empty. A digraph is acyclic if there is no vertex u with a nonempty path from u to itself. A digraph is a tree if there exists a vertex r, the root, such that for every vertex u there is exactly one path from r to u. Every tree is acyclic, but not every acyclic graph is a tree. Throughout this paper we write x for the generic sentence with n words. A dependency graph for x is an acyclic digraph G D .V; A/ whose vertices correspond one-to-one to the words in x. This correspondence imposes a total order on the vertices; we represent this order by equating V with the set of positions in x, V D f1; : : : ; ng. An example dependency graph is shown in Figure 1. 2 A dependency tree is a dependency graph that forms a tree. The example graph is not a tree. McDonald et al. (2005b) cast dependency parsing as the search for a maximum spanning tree of an 2 Note that the arcs of the example graph are labeled. To simplify the presentation we mostly ignore this aspect in this paper; but our implementation supports parsing to labeled arcs. arc-weighted digraph. They start from the complete graph on n vertices where each arc i ! j carries a real-valued weight w ij , defined as a dot product

Maximum Spanning Tree Parsing
where˚is a function that maps the sentence-arc pair into a feature vector and w is a global weight vector that is learned from training data. Taking the feature representation and the weight vector to be fixed, parsing under this model amounts to finding a spanning tree of maximum total weight.

Maximum Subgraph Parsing
For semantic dependency parsing, where the target representations are not necessarily trees (viz. Figure 1), we generalize the model of McDonald et al. (2005b) to other types of subgraphs. In general we are interested in the following optimization problem: Maximum Subgraph for Graph Class G Given an arc-weighted digraph G D .V; A/, find a subset A 0 Â A with maximum total weight such that the induced subgraph G 0 D .V; A 0 / belongs to G .
The computational complexity of this problem varies with the choice of G . If G is the set of all dependency trees, then the problem can be solved in time O.jV j 2 / (Tarjan, 1977). If G is the unrestricted set of all dependency graphs, then the Maximum Subgraph problem is equivalent to the following:

Maximum Acyclic Subgraph
Given an arc-weighted digraph G D .V; A/, find a subset A 0 Â A with maximum total weight such that the induced subgraph G 0 D .V; A 0 / is acyclic.
This problem is known to be NP-hard, and also hard to approximate (Guruswami et al., 2011). 560 Because of the NP-hardness of Maximum Acyclic Subgraph we cannot expect to find a polynomial-time parsing algorithm for general dependency graphs. In this section we therefore introduce the restricted class of noncrossing dependency graphs.

The Noncrossing Condition
Let G D .V; A/ be a dependency graph. Recall that G is vertex-ordered by the correspondence of vertices to positions in x. Two arcs a 1 ; a 2 2 A cross if either min.a 1 / < min.a 2 / < max.a 1 / < max.a 2 / or min.a 2 / < min.a 1 / < max.a 2 / < max.a 1 / where min.a/ and max.a/ denote the left and right vertex of the arc a, respectively. The graph G is called noncrossing if there are no two arcs that cross. For example, the graph in Figure 1 is noncrossing. Its picture is an arc diagram, a graph layout where one places the vertices along a line and draws each arc as a smooth curve in one of the two half-planes bounded by that line. Noncrossing dependency graphs can be characterized as exactly those graphs that permit arc diagrams where 1. the vertices are lined up in their total order, 2. all arcs are drawn on the upper half-plane only, 3. two curves intersect at most at their endpoints.
Our noncrossing condition is often referred to as planarity; see e.g. Titov et al. (2009). We propose to reserve the term "planar" for its standard use in graph theory. 3 The noncrossing condition is well-known in the area of enumerative combinatorics; see for example the overview article by Flajolet and Noy (1999). In this context, one typically thinks of the vertices of a graph as being laid out on a circle, say in counterclockwise order. Then the noncrossing condition requires that the arcs of the graph can be drawn inside the circle without two arcs crossing.

Related Properties
Projectivity In syntactic dependency parsing, where the target representations are trees, the noncrossing condition is closely related to projectivity.
More specifically, a dependency tree is projective if and only if it is noncrossing and its root is not "covered", meaning that there is no arc i ! j such that the root lies properly between i and j . Sagae and Tsujii (2008) propose a generalization of this two-part characterization of projectivity to graphs. We find that the noncrossing condition alone is more practical. For example, the dependency graph in Figure 1 is not projective in the sense of Sagae and Tsujii (2008); but it is noncrossing. Schluter (2015) uses the term "projective" with the same meaning as our term "noncrossing." Pagenumber A natural generalization of the noncrossing condition is to relax property 2 of the arc diagram characterization and allow arcs to be drawn also in the lower half-plane bounded by the vertex line, or in any of some fixed number k of half-planes. These half-planes may be thought of as the pages of a book, with the vertex line corresponding to the book's spine, and the embedding of a graph into such a structure is known as a book embedding (Bernhart and Kainen, 1979). A graph that permits a crossing-free book embedding with k half-planes is said to have pagenumber at most k. Note that we here consider book embeddings where the order of the vertices along the boundary line is known in advance; it is given by the order of the words in the sentence.

Coverage on Linguistic Data
To estimate the practical value of a parser for noncrossing dependency graphs, we look into the coverage of these graphs on relevant linguistic data.

Data Sets
We choose four data sets that are generally available (through the Linguistic Data Consortium) and have already been used to build data-driven parsers.
SDP Dependencies These data sets (Oepen, 2014) consist of aligned sets of bi-lexical dependency graphs over the same Wall Street Journal text in three different representation types: These were used in the SemEval 2015 Task on Semantic Dependency Parsing (SDP; Oepen et al., 2015). 4 CCG Dependencies This data set (Hockenmaier and Steedman, 2005) consists of the bi-lexical semantic dependency triples released with CCGbank, which also is based on the Wall Street Journal text. These triples were designed to reflect the underlying predicate-argument structure of the corresponding CCG derivations (Hockenmaier and Steedman, 2007). Recent work views the set of all triples for a sentence as a dependency graph and parses directly into these target representations (Du et al., 2015a).

Results
For the four data sets, Table 1 shows the percentages of complete graphs (G) and individual arcs (A) that can be accounted for under the restriction to noncrossing dependency graphs. These percentages provide upper bounds for the performance of a parser for these graphs under two standard evaluation metrics, exact match and arc-based recall. For comparison, the table also shows results for projective dependency graphs and graphs with pagenumber at most two. The percentages of noncrossing graphs, and hence the maximal possible scores with respect to exact match, vary between 48.23% for CCG and 64.61% for PSD. On all data sets, they are considerably higher than those for projective dependency graphs. We take this as further evidence that the noncrossing property is a more practical restriction than projectivity (as discussed in Section 3.2) when it comes to semantic dependency parsing.
The percentages of arcs that can be accounted for under the noncrossing condition are computed by maximizing, for every graph in the data, the cardinality of a subset of arcs that can be selected such that the subgraph induced by the selected arcs is noncrossing. These percentages, and hence the maximal possible scores with respect to arc-based recall, are close to 96% for PSD and CCG, and exceed 97% for DM and PAS. This suggests that, while a parser restricted to noncrossing dependency graphs would necessarily score low in terms of exact match, it could in principle still obtain relatively high scores in terms of arc-based recall.
The class of graphs with pagenumber at most two has the highest coverage, both with respect to exact match and arc recall. It can account for more than 98% of the graphs and more than 99% of the arcs in each of the four data sets. However, we will show in Section 6 that there is no polynomial-time parsing algorithm for this class of graphs (unless P D NP).

Parsing Algorithm
This section contains the main contribution of this paper: a cubic-time exact algorithm for solving the Maximum Subgraph problem for the class of noncrossing dependency graphs.
Theorem 1 For noncrossing dependency graphs, Maximum Subgraph can be solved in time O.jV j 3 /.

Deduction System
We specify the parsing algorithm by means of a weighted deduction system in the sense of Nederhof (2003). The heart of such a system is a finite set of inference rules that specify how to derive solutions to subproblems of the overall problem from solutions to "simpler" subproblems. These partial solutions are represented by weighted formulas called items. Derivations start from a finite set of initial items called axioms, and the objective is to find the derivation of a goal item with maximal weight. This search can be carried out using a variant of Viterbi's algorithm (Viterbi, 1967).
For the following, we assume that we are given a digraph G D .V; A/ with vertices V D f1; : : : ; ng and arc weights as described in Section 2. We use i; j; k as metavariables for vertices (positions of the input sentence) where i Ä k and i < j < k.

Items
We consider subproblems where we construct a maximum noncrossing dependency graph on a given (closed) interval OEi; k Â V of vertices, using only arcs in A. Based on the structure of the subgraph, we distinguish five different types of subproblems: 1. Construct a graph that contains the arc i ! k. We say that such a graph is min-max-covered.
2. Construct a graph that contains the arc k ! i . We say that such a graph is max-min-covered. Note that type 1 and type 2 are mutually exclusive, as the arcs i ! k and k ! i together would form a cycle. If a graph is either min-max-covered or max-min-covered, we say that it is arc-covered.
3. Construct a graph that is not arc-covered but contains a nonempty directed path from i to k. We say that such a graph is min-max-connected.
4. Construct a graph that is not arc-covered but contains a nonempty directed path from k to i . We say that such a graph is max-min-connected. Type 3 and type 4 are mutually exclusive, as the two paths together would form a cycle.
5. Construct a graph that is not arc-covered and does not contain a nonempty directed path between i and k. We say that such a graph is bland.
We represent these different types of subproblems by the following items: i k i k ! i k i k i k We shall set up the weight of an item in such a way that it corresponds to the sum of the arc weights of the constructed subgraph.

Axioms
For each vertex i in G, the graph on the one-vertex interval OEi; i is a bland graph. Therefore, the items of type with i D k are sound axioms of the deduction system. Because one-vertex dependency graphs have no arcs, we assign zero weight to these.

Goal Items
Our objective is to construct a maximum noncrossing dependency graph on the full vertex set. Therefore, the goal items of the deduction system are all items over the full interval OE1; n.

Inference Rules
The inference rules of the deduction system are shown in Figure 2. For each rule, we let the weight of the consequent be the sum of the weights of the antecedents, plus (for R16-R19) the weight of the arc required by the side condition. Thus, rule R01 for example states that whenever we have constructed a min-max-covered graph ( ) on the interval OEi; j with some weight w 1 and another min-max-covered graph on the interval OEj; k with some weight w 2 , it is sound to conclude that we can construct a minmax-connected graph ( ! ) on the interval OEi; k with weight w 1 Cw 2 . Similarly, rule R19 states that whenever we have constructed a bland graph ( ) on the interval OEi; k with some weight w, then we may add the arc k ! i (if it exists in G) and thus construct a max-min-covered graph ( ) whose weight is the sum of w and the weight of the new arc.

Correctness
We have already argued informally that the axioms and the inference rules of the deduction system are sound. In order to show completeness, we prove the following lemma: For each of the five types listed in Section 4.1.1, whenever the corresponding maximum noncrossing dependency graph on the interval OEi; k has weight w, the appropriate item is derived.
We only sketch the (straightforward) proof of this lemma. We define the size of a graph as the total number of its vertices and arcs. The proof is by induction on the size. If the graph has size one (that is, consists of a single vertex), then it is bland and has weight zero; this case is covered by the axioms. For the inductive step, we consider a graph H with size m > 1 and assume that the lemma holds true for all graphs of smaller size. We then show how to derive the relevant item for H from the previously derived items for subgraphs of H using the inference rules.

Implementation and Runtime
In a Viterbi-style search for optimal derivations in the deduction system, we enumerate items by size and derive the weights of items with larger sizes from the weights of items with smaller sizes. Inspecting the maximal number of variables per rule (which is three, for R01-R08), we see that such a search can be implemented to run in time O.n 3 /. 2: Deduction system for noncrossing dependency graphs (1 Ä i < k Ä n, i < j < k). We write k as a shorthand for k 1. For each inference rule, the weight of the consequent is the sum of the weights of the antecedents, plus (for rules R16-R19) the weight of the arc required by the side condition.

Uniqueness of Derivations
Besides being sound and complete, the deduction system also has the property that it assigns a unique derivation to every noncrossing dependency graph. The proof, again, is by induction on the size. To illustrate the argument, suppose that we want to construct a min-max-connected graph H on some interval OEi; k. The only rules that can be used to derive the corresponding item (of type ! ) are R01 and R05. In both rules, the graph corresponding to the second antecedent must be a min-max-covered graph ( ) on some interval OEj; k. This means that the vertex j is uniquely determined; it must be the left endpoint of the longest arc j ! k in H . With this, the graph on the remaining interval OEi; j is uniquely determined as well, and both graphs are smaller than H ; hence we may assume that they have unique derivations. Note that j would not be uniquely determined if the deduction system included (say) an inference rule that derives an item ! from two other such items: Such a rule would be sound, but it would create derivational ambiguity.
An immediate consequence of the uniqueness of derivations is that our parsing algorithm can be used for counting noncrossing dependency graphs, which is useful for, among other things, testing the correctness of implementations. To count, we change the system such that the weight of an item gives the number of its derivations. 5 The modified system yields sequence A246756 in the On-Line Encyclopedia of Integer Sequences (OEIS Foundation Inc., 2011).

Enforcing Weak Connectivity
Our deduction system can be modified to parse into (and count) various other classes of noncrossing graphs. For example, we can adapt our system to find a maximum noncrossing acyclic subgraph under the additional restriction that this subgraph should be weakly connected (Schluter, 2015). To do so we distinguish between two types of bland subgraphs, (a) weakly connected graphs and (b) graphs with exactly two weakly connected components, and adapt the inference rules and goal items. The change can be implemented without affecting the asymptotic runtime of the algorithm.
Note that when we take the modified deduction system and consider undirected edges instead of directed arcs, we obtain an algorithm for finding maximum connected noncrossing graphs, which are counted by sequence A007297 in the OEIS. These graphs are the target representations of Link Grammar (Sleator and Temperley, 1993).

Practical Parsing
While the main focus of this paper is theoretical, in this section we extend our parsing algorithm into a practical parser. In the context of our general model (Section 2), this requires two additional components: a feature representation and a training algorithm.

Features
We use the arc-based features of TurboParser (Martins et al., 2009), which descend from several other feature models from the literature on syntactic dependency parsing (McDonald et al., 2005a;Carreras et al., 2006;Koo and Collins, 2010). In these models, the feature vector for an arc i ! j represents information about various combinations of the exact forms, lemmas and part-of-speech tags of the words at positions i and j ; the tags of the immediately surrounding words and the words between i and j ; as well as the length of the arc and its direction. To support parsing to labeled dependency graphs, we additionally conjoin some of these features with the arc label. For details we refer to the source code.

Training
To learn the feature weights in the weight vector from data we use online passive-aggressive training as described by Crammer et al. (2006). For each gold instance .x; G/ in the training data we let the parsing algorithm find the maximum noncrossing dependency graph O G given the current weight vector w and update the weight vector as In this formula,`.G; O G/ is a user-defined loss function and C > 0 is a parameter that controls the tradeoff between optimizing the current loss and being close to the old weight vector. We use Hamming loss, defined as the number of (labeled) arcs that are exclusive to either G or O G. Following custom practice, we apply weight vector averaging (Freund and Schapire, 1999;Collins, 2002).

Parsing Experiments
We report parsing experiments on the four data sets described in Section 3.3.1 and discuss their results. We use gold-standard lemmas and part-of-speech tags, train each parser for 10 epochs, and report results for the final model on the test data. We use the splits recommended for the respective data sets. Following Carreras (2007), prior to training we transform each dependency graph in the training data to a closest noncrossing dependency graph. 6 In a prestudy using the DM development data we found the best value for the tradeoff parameter C to be 0:01.

Results and Discussion
The experimental results are shown in Table 2. For the SDP data, we report standard metrics used in the SemEval task (Oepen et al., 2015): precision, recall, and F1 on labeled dependencies. For the CCG data, we report the same metrics for unlabeled dependencies; these take into account only the two dependent words but not the lexical category containing the dependency relation or the argument slot. 7 Given the low coverage of noncrossing dependency graphs on the four data sets (recall Section 3.3.2) and the use of a simple arc-factored model with its off-the-shelf features originally developed for syntactic parsing, it is not surprising that the parser does not achieve state-of-the art results. We still consider our results to be a useful reference for the emerging field of semantic dependency parsing. On the SDP data, the averaged labeled F1 of the parser is 79.63, which is 5.70 points below the corresponding score for the best-performing system in the task, Peking (Du et al., 2015b). Labeled F1 is highest on PAS (86.24) and lowest on PSD (70.01). On the 6 This is done by running the parser with an oracle model that assigns a score of C1 to correct and 1 to incorrect arcs. 7 The high number of different arc labels in the CCG data exceeds what can be handled by the current version of our parser. CCG data, the parser achieves an unlabeled F1 of 89.74, 4.24 points below the best reported result for parsing with gold-standard part-of-speech tags (Auli and Lopez, 2011). On all four data sets, precision exceeds recall by a significant margin, more than one might expect given the relatively high upper bounds on arc-based recall that we observed in Section 3.3.2.
The speed of the parser is about 0.01 seconds per sentence or 110 sentences per second on each of the four test sets. Training for 10 epochs takes around 40 minutes per training set. Experiments were performed on an iMac 3.4 GHz Intel Core i5 CPU with 4 cores and Java 1.8.0.

Intractability for Higher Pagenumbers
The parsing algorithm presented in Section 4 has many attractive theoretical properties, but its practical usefulness is limited by the relatively low coverage of noncrossing dependency graphs on the linguistic data. It would be desirable to generalize the algorithm to more expressive classes of graphs. A natural candidate is the class of dependency graphs with pagenumber at most k, whose coverage is excellent already for k D 2, as we saw in Section 3.3.2. However, we shall now prove the following: Theorem 2 For dependency graphs with pagenumber at most k, Maximum Subgraph is NP-hard whenever k 2.
For the proof of this theorem we do not consider the maximization problem defined in Section 2.3 but the following decision version: Maximum Subgraph for Graph Class G , Decision Version Given a digraph G D .V; A/ and an integer m 0, is there a subset A 0 Â A with jA 0 j m such that the induced subgraph G 0 D .V; A 0 / belongs to G ?
Note that any polynomial-time algorithm for the maximization problem can be turned into a polynomial-time algorithm for the decision problem: Given an instance for the decision problem, assign each arc weight 1 and solve the maximization problem; then, test whether the solution contains at least m arcs. To show the NP-hardness of the maximization problem it therefore suffices to show it for the decision version of the problem. , and a dependency graph as constructed by the algorithm specified in Section 6.2 (right). The example is adapted from Unger (1992).

Circle Graphs
To show that (the decision version of) Maximum Subgraph for dependency graphs with pagenumber at most k is NP-hard we present a polynomial reduction from a decision problem on circle graphs. A circle graph is an undirected graph that represents the intersection graph of a set of chords of a circle. That is, for every circle graph G we can draw a circle and a set of chords of that circle such that the chords are in one-to-one correspondence with the vertices of G and two chords cross each other if and only if the corresponding vertices are adjacent in G. An example of a circle graph and a corresponding chord drawing is given in Figure 3. Note that one and the same circle graph may have many different chord drawings. Here, without loss of generality, we restrict our attention to drawings in which no two chords have a common endpoint. In these drawings, there are exactly twice as many points on the circle as there are vertices in the circle graph G.

Reduction
The relevant decision problem on circle graphs is given below. For a graph G D .V; E/ and a subset V 0 Â V , we let GjV 0 denote the subgraph of G induced by V 0 , that is, GjV 0 has vertex set V 0 and contains each edge in E between vertices in V 0 . k-Colorable Induced Subgraph for Circle Graphs (k-CIG) Given a circle graph G D .V; E/ and an integer m 0, is there a subset V 0 Â V with jV 0 j m such that the induced graph GjV 0 is k-colorable? Cong and Liu (1991) show that this problem is NPcomplete if k 2. For k D 2 and k 4 their proof is based on earlier results by Sarrafzadeh and Lee (1989) and Unger (1988), respectively.
The following procedure transforms an arbitrary circle graph G into a dependency graph H . For an illustration, see Figure 3. 8 1. Construct a chord drawing for G. Recall that for this drawing there is a one-to-one correspondence between the chords and the vertices of G.
2. Cut the circle of the chord drawing and straighten it out into a line. This yields a total order on the endpoints of the chords. We identify each endpoint with its position in that order and denote the left and the right endpoint of a chord c by c L and c R , respectively. Because we assume that no two chords have a common endpoint, there will be exactly twice as many points on the line as there are chords and therefore vertices in G.
3. Direct each chord c from c L to c R . This defines a dependency graph whose arcs are the directed chords and whose vertices are (the positions of) the chords' endpoints.
The crucial property of this construction is that establishes a one-to-one correspondence between the vertices of G and the arcs of H . More formally, let G D .V G ; E/ and H D .V H ; A/. The construction establishes a bijection where c is the unique chord corresponding to the vertex v in the chord drawing of G.
The dependency graph H can be computed in polynomial time in the size of G. The non-obvious part is step 1; this can be carried out in time quadratic in the number of vertices of G using the algorithm of Spinrad (1994).
We now claim that an instance .G; m/ of k-CIG yields a "yes" answer if and only if the corresponding instance .H; m/ of Maximum Subgraph for dependency graphs with pagenumber at most k does. Assume that .G; m/ yields a positive answer. This means that there exists a subset of vertices V 0 G Â V G with jV 0 G j m and such that the vertices in this set can be colored with at most k colors in such a way that no two adjacent vertices have the same color. Without loss of generality we assume that the colors are numbers between 1 and k. Let f .v/ denote the color assigned to v. Consider the set of arcs corre- We claim that A 0 is a solution set for .H; m/. Clearly jA 0 j m and H 0 D .V H ; A 0 / is a dependency graph. We show that H 0 can be embedded into a k-book. Place every arc a.v/ 2 A 0 on page f .v/. Assume now that two arcs a 1 D a.v 1 / and a 2 D a.v 2 / are placed on the same page and cross each other. This implies that f .v 1 / D f .v 2 / and that there is an edge between v 1 and v 2 in G. Because v 1 ; v 2 2 V 0 G , we see that this edge also appears in GjV 0 G . This contradicts the assumption that the set V 0 G is k-colorable. Assume that .H; m/ yields a positive answer. This means that there exists a set of arcs A 0 with jA 0 j m and such that there is an assignment of page numbers to the arcs in this set such that no two arcs with the same page number cross each other. Let f .a/ denote the page number assigned to a. Consider the set of vertices corresponding to A 0 , V 0 G D f v 2 V G j a.v/ 2 A 0 g. We claim that V 0 G is a solution set for .G; m/. Clearly jV 0 j m. To show that GjV 0 G is k-colorable, we interpret the page assignment f as a k-coloring of the vertices in V 0 G in the obvious way: We color each vertex v 2 V 0 G with the page number of the corresponding arc a.v/ 2 A 0 . Now consider two arbitrary vertices v 1 and v 2 that are adjacent in GjV 0 . By construction, the arcs a.v 1 / and a.v 2 / cross in H ; therefore f .a.v 1 // ¤ f .a.v 1 // and v 1 and v 2 are assigned different colors in GjV 0 .
This completes the proof of Theorem 2.

Conclusion
The goal of this paper was to generalize maximum spanning tree dependency parsing to target structures that are not necessarily tree-shaped. Because parsing to unrestricted dependency graphs is intractable, we studied the problem under the restriction to noncrossing graphs, which generalize projective dependency trees as they are known from syntactic parsing. We presented a cubic-time parsing algorithm for this class of graphs, extended the algorithm into a practical parser that we evaluated on four linguistic data sets, and finally proved that the (natural) step beyond the noncrossing condition to dependency graphs with pagenumber at most k renders parsing intractable.
The main contributions of this paper are theoretical. To the best of our knowledge, ours is the first exact-inference algorithm for the full class of noncrossing dependency graphs. The similar algorithm by Schluter (2015), which was developed contemporaneously but independently of ours, is restricted to (weakly) connected graphs. This is a severe practical limitation when semantically vacuous tokens are analyzed as unconnected nodes, as such an analysis renders most graphs unconnected. 9 Another difference between our algorithm and the algorithm of Schluter (2015) is that the latter does not have the uniqueness property discussed in Section 4.4.
An interesting follow-up question to our result in Section 6.2 is whether Maximum Subgraph remains NP-hard when the candidate space is restricted to trees with pagenumber at most k. This question was raised by Gómez-Rodríguez and Nivre (2013), who present a greedy parser for the case k D 2.
We view our algorithm as a canonical generalization of the Eisner and Satta (1999) parsing algorithm for projective dependency trees, and expect it to serve as a similar point of departure for future extensions of the paradigm. On the one hand, it seems interesting to explore the use of more expressive feature models, including the generalization of our arc-factored model to the grandparent and sibling features of Carreras (2007) and Koo and Collins (2010). On the other hand, given the low coverage of noncrossing dependency graphs, it seems necessary to explore the generalization to new, "mildly" crossing classes of graphs, such as a graph version of the 1-endpoint crossing trees of Pitler et al. (2013). Pitler (2014) shows that the two directions may also be combined, which we hope will lead to new, more accurate algorithms for semantic dependency parsing.