Learning to Prune: Exploring the Frontier of Fast and Accurate Parsing

Pruning hypotheses during dynamic programming is commonly used to speed up inference in settings such as parsing. Unlike prior work, we train a pruning policy under an objective that measures end-to-end performance: we search for a fast and accurate policy. This poses a difficult machine learning problem, which we tackle with the lols algorithm. lols training must continually compute the effects of changing pruning decisions: we show how to make this efficient in the constituency parsing setting, via dynamic programming and change propagation algorithms. We find that optimizing end-to-end performance in this way leads to a better Pareto frontier—i.e., parsers which are more accurate for a given runtime.


Introduction
Decades of research have been dedicated to heuristics for speeding up inference in natural language processing tasks, such as constituency parsing (Pauls and Klein, 2009;Caraballo and Charniak, 1998) and machine translation (Petrov et al., 2008;Xu et al., 2013). Such research is necessary because of a trend toward richer models, which improve accuracy at the cost of slower inference. For example, state-of-theart constituency parsers use grammars with millions of rules, while dependency parsers routinely use millions of features. Without heuristics, these parsers take minutes to process a single sentence.
To speed up inference, we will learn a pruning policy. During inference, the pruning policy is invoked to decide whether to keep or prune various parts of the search space, based on features of the input and (potentially) the state of the inference process.
Our approach searches for a policy with maximum end-to-end performance (reward) on training data, where the reward is a linear combination of problemspecific measures of accuracy and runtime, namely reward = accuracy − λ · runtime. The parameter λ ≥ 0 specifies the relative importance of runtime and accuracy. By adjusting λ, we obtain policies with different speed-accuracy tradeoffs.
For learning, we use Locally Optimal Learning to Search (LOLS) (Chang et al., 2015b), an algorithm for learning sequential decision-making policies, which accounts for the end-to-end performance of the entire decision sequence jointly. Unfortunately, executing LOLS naively in our setting is prohibitive because it would run inference from scratch millions of times under different policies, training examples, and variations of the decision sequence. Thus, this paper presents efficient algorithms for repeated inference, which are applicable to a wide variety of NLP tasks, including parsing, machine translation and sequence tagging. These algorithms, based on change propagation and dynamic programming, dramatically reduce time spent evaluating similar decision sequences by leveraging problem structure and sharing work among evaluations.
We evaluate our approach by learning pruning heuristics for constituency parsing. In this setting, our approach is the first to account for end-to-end performance of the pruning policy, without making independence assumptions about the reward function, as in prior work (Bodenstab et al., 2011). In the larger context of learning-to-search for structured prediction, our work is unusual in that it learns to control a dynamic programming algorithm (i.e., graphbased parsing) rather than a greedy algorithm (e.g., transition-based parsing). Our experiments show that accounting for end-to-end performance in training leads to better policies along the entire Pareto frontier of accuracy and runtime.

Weighted CKY with pruning
A simple yet effective approach to speeding up parsing was proposed by Bodenstab et al. (2011), who trained a pruning policy π to classify whether or not spans of the input sentence w 1 · · · w n form plausible constituents based on features of the input sentence. These predictions enable a parsing algorithm, such as CKY, to skip expensive steps during its execution: unlikely constituents are pruned. Only plausible constituents are kept, and the parser assembles the highest-scoring parse from the available constituents.
Alg. 1 provides pseudocode for weighted CKY with pruning. Weighted CKY aims to find the highestscoring derivation (parse tree) of a given sentence, where a given grammar specifies a non-negative score for each derivation rule and a derivation's score is the product of the scores of the rules it uses. 1 CKY uses a dynamic programming strategy to fill in a three-dimensional array β, known as the chart. The score β ikx is the score of the highest-scoring subderivation with fringe w i+1 . . . w k and root x. This value is computed by looping over the possible ways to assemble such a subderivation from smaller subderivations with scores β ijy and β jkz (lines 17-22). Additionally, we track a witness (backpointer) for each β ikx , so that we can easily reconstruct the corresponding subderivation at line 23. The chart is initialized with lexical grammar rules (lines 3-9), which derive words from grammar symbols.
The key difference between pruned and unpruned CKY is an additional "if" statement (line 14), which queries the pruning policy π to decide whether to compute the several values β ikx associated with a span (i, k). Note that width-1 and width-n spans are always kept because all valid parses require them.
3 End-to-end training Bodenstab et al. (2011) train their pruning policy as a supervised classifier of spans. They derive direct supervision as follows: try to keep a span if it appears in the gold-standard parse, and prune it otherwise. They found that using an asymmetric weighting scheme helped find the right balance between false positives and false negatives. Intuitively, failing to prune is only a slight slowdown, whereas pruning a good item can ruin the accuracy of the parse. 1 As is common practice, we assume the grammar has been binarized. We focus on pre-trained grammars, leaving coadaptation of the grammar and pruning policy to future work. As indicated at lines 6 and 19, a rule's score may be made to depend on the context in which that rule is applied (Finkel et al., 2008), although the pre-trained grammars in our present experiments are ordinary PCFGs for which this is not the case.

14:
if π(w, i, k) = prune : 15: continue 16: Fill in span by considering each split point j 17: for j := i + 1 to k − 1 : 18: for (x → y z) ∈ rules(G) : 19: witness(i, k, x) := (j, y, z) 23: d := follow backpointers from (0, n, ROOT) 24: return (β, d) Our end-to-end training approach improves upon asymmetric weighting by jointly evaluating the sequence of pruning decisions, measuring its effect on the test-time evaluation metric by actually running pruned CKY (Alg. 1). To estimate the value of a pruning policy π, we call PARSE(G, w (i) , π) on each training sentence w (i) , and apply the reward function, r = accuracy −λ·runtime. The empirical value of a policy is its average reward on the training set: The expectation in the definition may be dropped if PARSE, π, and r are all deterministic, as in our setting. 2 Our definition of r depends on the user parameter λ ≥ 0, which specifies the amount of accuracy the user would sacrifice to save one unit of runtime. Training under a range of values for λ gives rise to policies covering a number of operating points along the Pareto frontier of accuracy and runtime. End-to-end training gives us a principled way to decide what to prune. Rather than artificially labeling each pruning decision as inherently good or bad, we evaluate its effect in the context of the particular sentence and the other pruning decisions. Actions that prune a gold constituent are not equally bad-some cause cascading errors, while others are "worked around" in the sense that the grammar still selects a mostly-gold parse. Similarly, actions that prune a non-gold constituent are not equally good-some provide more overall speedup (e.g., pruning narrow constituents prevents wider ones from being built), and some even improve accuracy by suppressing an incorrect but high-scoring parse.
More generally, the gold vs. non-gold distinction is not even available in NLP tasks where one is pruning potential elements of a latent structure, such as an alignment (Xu et al., 2013) or a finer-grained parse (Matsuzaki et al., 2005). Yet our approach can still be used in such settings, by evaluating the reward on the downstream task that the latent structure serves.
Past work on optimizing end-to-end performance is discussed in §8. One might try to scale these techniques to learning to prune, but in this work we take a different approach. Given a policy, we can easily find small ways to improve it on specific sentences by varying individual pruning actions (e.g., if π currently prunes a span then try keeping it instead). Given a batch of improved action sequences (trajectories), the remaining step is to search for a policy which produces the improved trajectories. Conveniently, this can be reduced to a classification problem, much like the asymmetric weighting approach, except that the supervised labels and misclassification costs are not fixed across iterations, but rather are derived from interaction with the environment (i.e., PARSE and the reward function). This idea is formalized as a learning algorithm called Locally Optimal Learning to Search (Chang et al., 2015b), described in §4.
The counterfactual interventions we requireevaluating how reward would change if we changed one action-can be computed more efficiently using our novel algorithms ( §5) than by the default strategy of running the parser repeatedly from scratch. The key is to reuse work among evaluations, which is possible because LOLS only makes tiny changes.

Learning algorithm
Pruned inference is a sequential decision process. The process begins in an initial state s 0 . In pruned CKY, s 0 specifies the state of Alg. 1 at line 10, after the chart has been initialized from some selected sentence. Next, the policy is invoked to choose action a 0 = π(s 0 )-in our case at line 14-which affects what the parser does next. Eventually the parser reaches some state s 1 from which it calls the policy to choose action a 1 = π(s 1 ), and so on. When the policy is invoked at state s t , it selects action a t based on features extracted from the current state s t -a snapshot of the input sentence, grammar and parse chart at time t. 3 We call the state-action sequence s 0 a 0 s 1 a 1 · · · s T a trajectory, where T is the trajectory length. At the final state, the reward function is evaluated, r(s T ).
The LOLS algorithm for learning a policy is given in Alg. 2, 4 with a graphical illustration in Fig. 1. At a high level, LOLS alternates between evaluating and improving the current policy π i .
The evaluation phase first samples a trajectory from π i , called a roll-in: s 0 a 0 s 1 a 1 · · · s T ∼ ROLL-IN(π i ). In our setting, s 0 is derived from a randomly sampled training sentence, but the rest of the trajectory is then deterministically computed by π i given s 0 . Then we revisit each state s in the roll-in (line 7), and try each available actionā ∈ A(s) (line 9), executing π i thereafter-a rollout-to measure the resulting reward r[ā] (line 10). Our parser is deterministic, so a single rollout is an unbiased, 0-variance estimate of the expected reward. This process is repeated many times, yielding a large list Q i of pairs s, r , where s is a state that was encountered in some roll-in and r maps the possible actions A(s) in that state to their measured rewards.
The improvement phase now trains a new policy π i+1 to try to choose high-reward actions, seeking a policy that will "on average" get high rewards r[π i+1 (s)]. Good generalization is important: the policy must select high-reward actions even in states s that are not represented in Q i , in case they are  Figure 1: Example LOLS iteration (lines 6-10). Roll-in with the current policy π i (starting with a random sentence), s 0 a 0 s 1 a 1 · · · s 5 ∼ ROLL-IN(π i ). Perform interventions at each state along the roll-in (only t = 2 is shown). The intervention tries alternative actions at each state (e.g.,ā 2 = prune at s 2 ). We rollout after the intervention by following π i until a terminal state,s 3ā3s4ā4s5 ∼ ROLLOUT(π i , s 2 ,ā 2 ), and evaluate the reward of the terminal state r(s 5 ).
Algorithm 2 LOLS algorithm for learning to prune.

14:
Finalize: Pick the best policy over all iterations 15: return argmax i R(π i ) encountered when running the new policy π i+1 (or when parsing test sentences). Thus, beyond just regularizing the training objective, we apply dataset aggregation : we take the training set to include not just Q i but also the examples from previous iterations (line 13). This also ensures that the sequence of policies π 1 , π 2 , . . . will be "stable"  and will eventually converge.
So line 13 seeks to find a good classifier π i+1 using a training set: a possible classifier π would receive from each training example s, r a reward of r[π(s)]. In our case, where A(s) = {keep, prune}, this cost-sensitive classification problem is equivalent to training an ordinary binary classifier, after converting each training example s, r to s, argmax a r[a] and giving this example a weight of | r t,keep − r t,prune |. Our specific classifier is described in §6.
In summary, the evaluation phase of LOLS collects training data for a cost-sensitive classifier, where the inputs (states), outputs (actions), and costs are obtained by interacting with the environment. LOLS concocts a training set and repeatedly revises it, similar to the well-known Expectation-Maximization algorithm. This enables end-to-end training of systems with discrete decisions and nondecomposable reward functions. LOLS gives us a principled framework for deriving (nonstationary) "supervision" even in tricky cases such as latent-variable inference (mentioned in §3). LOLS has strong theoretical guarantees, though in pathological cases, it may take exponential time to converge (Chang et al., 2015b).
The inner loop of the evaluation phase performs roll-ins, interventions and rollouts. Roll-ins ensure that the policy is (eventually) trained under the distribution of states it tends to encounter at test time. Interventions and rollouts force π i to explore the effect of currently disfavored actions.

Efficient rollouts
Unlike most applications of LOLS and related algorithms, such as SEARN (Daumé III, 2006) and DAG-GER , executing the policy is a major bottleneck in training. Because our dynamic programming parser explores many possibilities (unlike a greedy, transition-based decoder) its trajectories are quite long. This not only slows down each rollout: it means we must do more rollouts.
In our case, the trajectory has length T = n·(n+1) 2 − 1 − n for a sentence of length n, where T is also the number of pruning decisions: one for each span other than the root and width-1 spans. LOLS must then perform T rollouts on this example. This means that to evaluate policy π i , we must parse each sentence in the minibatch hundreds of times (e.g., 189 for n = 20, 434 for n = 30, and 779 for n = 40).
We can regard each policy π as defining a pruning mask m, an array that maps each of the T spans (i, k) to a decision m ik (1 = keep, 0 = prune). Each rollout tries flipping a different bit in this mask. We could spend less time on each sentence by sampling only some of its T rollouts (see §6). Regardless, the rollouts we do on a given sentence are related: in this section we show how to get further speedups by sharing work among them. In §5.2, we leverage the fact that rollouts will be similar to one another (differing by a single pruning decision). In §5.3, we show that the reward of all T rollouts can be computed simultaneously by dynamic programming under some assumptions about the structure of the reward function (described later). We found these algorithms to be crucial to training in a "reasonable" amount of time (see the empirical comparison in §7.2).

Background: Parsing as hypergraphs
It is convenient to present our efficient rollout algorithms in terms of the hypergraph structure of Alg. 1 (Klein and Manning, 2001;Huang, 2008;Li and Eisner, 2009;Eisner and Blatz, 2007). A hypergraph describes the information flow among related quantities in a dynamic programming algorithm. Many computational tricks apply generically to hypergraphs.
A hypergraph edge e (or hyperedge) is a "generalized arrow" e.head ≺ e.Tail with one output and a list of inputs. We regard each quantity β ikx , m ik , or G(. . .) in Alg. 1 as the value of a corresponding hypergraph vertexβ ikx ,ṁ ik , orĠ(. . .). Thus, value(v) = v for any vertexv. Eachṁ ik 's value is computed by the policy π or chosen by a rollout intervention. EachĠ's value is given by the grammar.
Values ofβ ikx , by contrast, are computed at line 19 if k − i > 1. To record the dependence of β ikx on other quantities, our hypergraph includes If k − i = 1, then values of β ikx are instead computed at line 6, which does not access any other β values or the pruning mask. Thus our hypergraph in- With this setup, the value β ikx is the maximum score of any derivation of vertexβ ikx (a tree rooted aṫ β ikx , representing a subderivation), where the score of a derivation is the product of its leaf values. Alg. 1 computes it by considering hyperedgesβ ikx ≺ T and the previously computed values of the vertices in the tail T . For a vertexv, we write In(v) and Out(v) for its sets of incoming and outgoing hyperedges. Our algorithms follow these hyperedges implicitly, without the overhead of materializing or storing them.

Change propagation (CP)
Change propagation is an efficient method for incrementally re-evaluating a computation under a change to its inputs (Acar and Ley-Wild, 2008;Filardo and Eisner, 2012). In our setting, each roll-in at Alg. 2 line 6 evaluates the reward r(PARSE(G, x i , π)) from (1), which involves computing an entire parse chart via Alg. 1. The inner loop at line 10 performs T interventions per roll-in, which ask how reward would have changed if one bit in the pruning mask m had been different. Rather than reparsing from scratch (T times) to determine this, we can simply adjust the initial roll-in computation (T times).
CP is efficient when only a small fraction of the computation needs to be adjusted. In principle, flipping a single pruning bit can change up to 50% of the chart, so one might expect the bookkeeping overhead of CP to outweigh the gains. In practice, however, 90% of the interventions change < 10% of the β values in the chart. The reason is that β ikx is a maximum over many quantities, only one of which "wins." Changing a given β ijy rarely affects this maximum, and so changes are unlikely to propagate from verteẋ β ijy toβ ikx . Since changes are not very contagious, the "epidemic of changes" does not spread far.
Alg. 3 provides pseudocode for updating the highest-scoring derivation found by Alg. 1. We remark that the RECOMPUTE is called only when we flip a bit from keep to prune, which removes hyperedges and potentially decreases vertex values. The reverse flip only adds hyperedges, which increases vertex values via a running max (lines 12-14).
After determining the effect of flipping a bit, we must restore the original chart before trying a different bit (the next rollout). The simplest approach is to call Alg. 3 again to flip the bit back. 5 value(ṡ) = s; witness(ṡ) = e

Dynamic programming (DP)
The naive rollout algorithm runs the parser T timesonce for each variation of the pruning mask. The reader may be reminded of the finite difference approximation to the gradient of a function, which also measures the effects from perturbing each input value individually. In fact, for certain reward functions, the naive algorithm can be precisely regarded as computing a gradient-and thus we can use a more efficient algorithm, back-propagation, which finds the entire gradient vector of reward as fast (in the big-O sense) as computing the reward once. The overall algorithm is O(|E| + T ) where |E| is the total number of hyperedges, whereas the naive algorithm is O(|E |·T ) where |E | ≤ |E| is the maximum number of hyperedges actually visited on any rollout. What accuracy measure must we use? Let r(d) denote the recall of a derivation d-the fraction of gold constituents that appear as vertices in the derivation. A simple accuracy metric would be 1-best recall, the recall r( d) of the highest-scoring derivation d that was not pruned. In this section, we relax that to ex-pected recall, 6r = d p(d)r(d). Here we interpret the pruned hypergraph's values as an unnormalized probability distribution over derivations, where the probability p(d) =p(d)/Z of a derivation is proportional to its scorep(d) = u∈leaves(d) value(u).
Thoughr is not quite our evaluation metric, it captures more information about the parse forest, and so may offer some regularizing effect when used in a training criterion (see §7.1). In any case,r is close to r( d) when probability mass is concentrated on a few derivations, which is common with heavy pruning.
We can re-expressr asr/Z, wherẽ These can be efficiently computed by dynamic programming (DP), specifically by a variant of the inside algorithm (Li and Eisner, 2009). Sincep(d) is a product of rule weights and pruning mask bits at d's leaves ( §5.1), each appearing at most once, bothr and Z vary linearly in any one of these inputs provided that all other inputs are held constant. Thus, the exact effect onr or Z of changing an input m ik can be found from the partial derivatives with respect to it.
In particular, if we increased m ik by ∆ ∈ {−1, 1} (to flip this bit), the new value ofr would be exactlỹ It remains to compute these partial derivatives. All partials can be jointly computed by back-propagation, which equivalent to another dynamic program known as the outside algorithm (Eisner, 2016).
The inside algorithm only needs to visit the |E | unpruned edges, but the outside algorithm must also visit some pruned edges, to determine the effect of "unpruning" them (changing their m ik input from 0 to 1) by finding ∂r/∂m ik and ∂Z/∂m ik . On the other hand, these partials are 0 when some other input to the hyperedge is 0. This case is common when the hypergraph is heavily pruned (|E | |E|), and means that back-propagation need not descend further through that hyperedge.
Note that the DP method computes only the accuracies of rollouts-not the runtimes. In this paper, we will combine DP with a very simple runtime measure that is trivial to roll out (see §7). An alternative would be to use CP to roll out the runtimes. This is very efficient: to measure just runtime, CP only needs to update the record of which constituents or edges are built, and not their scores, so the changes are easier to compute than in §5.2, and peter out more quickly.
6 Parser details 7 Setup: We use the standard English parsing setup: the Penn Treebank (Marcus et al., 1993) with the standard train/dev/test split, and standard tree normalization. 8 For efficiency during training, we restrict the length of sentences to ≤ 40. We do not restrict the length of test sentences. We experiment with two grammars: coarse, the "no frills" left-binarized treebank grammar, and fine, a variant of the Berkeley split-merge level-6 grammar (Petrov et al., 2006) as provided by Dunlop (2014, ch. 5). The parsing algorithms used during training are described in §5. Our test-time parsing algorithm uses the left-child loop implementation of CKY (Dunlop et al., 2010). All algorithms allow unary rules (though not chains). We evaluate accuracy at test time with the F 1 score from the official EVALB script (Sekine and Collins, 1997).
Training: Note that we never retrain the grammar weights-we train only the pruning policy. To TRAIN our classifiers (Alg. 2 line 13), we use L 2 -regularized logistic regression, trained with L-BFGS optimization. We always rescale the example weights in the training set to sum to 1 (otherwise as LOLS proceeds, dataset aggregation overwhelms the regularizer). For the baseline (defined in next section), we determine the regularization coefficient by sweeping {2 −11 , 2 −12 , 2 −13 , 2 −14 , 2 −15 } and picking the best value (2 −13 ) based on the dev frontier. We re-used this regularization parameter for LOLS. The number of LOLS iterations is determined by a 6-day training-time limit 9 (meaning some jobs run many 7 Code for experiments is available at http://github. com/timvieira/learning-to-prune. 8 Data train/dev/test split (by section) 2-21 / 22 / 23. Normalization operations: Remove function tags, traces, spurious unary edges (X → X), and empty subtrees left by other operations. Relabel ADVP and PRT|ADVP tags to PRT. 9 On the 7 th day, LOLS rested and performance was good.
fewer iterations than others). For LOLS minibatch size we use 10K on the coarse grammar and 5K on the fine grammar. At line 15 of Alg. 2, we return the policy that maximized reward on development data, using the reward function from training.
Features: We use similar features to Bodenstab et al. (2011), but we have removed features that depend on part-of-speech tags. We use the following 16 feature templates for span (i, k) with 1 < k−i < N : bias, sentence length, boundary words, conjunctions of boundary words, conjunctions of word shapes, span shape, width bucket. Shape features map a word or phrase into a string of character classes (uppercase, lowercase, numeric, spaces); we truncate substrings of identical classes to length two; punctuation chars are never modified in any way. Width buckets use the following partition: 2, 3, 4, 5, [6, 10], [11,20], [21, ∞). We use feature hashing (Weinberger et al., 2009) with MurmurHash3 (Appleby, 2008) and project to 2 22 features. Conjunctions are taken at positions (i−1, i), (k, k +1), (i−1, k +1) and (i, k). We use special begin and end symbols when a template accesses positions beyond the sentence boundary. Hall et al. (2014) give examples motivating our feature templates and show experimentally that they are effective in multiple languages. Boundary words are strong surface cues for phrase boundaries. Span shape features are also useful as they (minimally) check for matched parentheses and quotation marks.

Experimental design and results
Reward functions and surrogates: Each user has a personal reward function. In this paper, we choose to specify our true reward as accuracy − λ · runtime, where accuracy is given by labeled F 1 percentage and runtime by mega-pushes (mpush), millions of calls per sentence to lines 6 and 19 of Alg. 1, which is in practice proportional to seconds per sentence (correlation > 0.95) and is more replicable. We evaluate accordingly (on test data)-but during LOLS training we approximate these metrics. We compare: • r CP (fast): Use change propagation ( §5.2) to compute accuracy on a sentence as F 1 of just that sentence, and to approximate runtime as ||β|| 0 , 269 the number of constituents that were built. 10 • r DP (faster): Use dynamic programming ( §5.3) to approximate accuracy on a sentence as expected recall. 11 This time we approximate runtime more crudely as ||m|| 0 , the number of nonzeros in the pruning mask for the sentence (i.e., the number of spans whose constituents the policy would be willing to keep if they were built).
We use these surrogates because they admit efficient rollout algorithms. Less important, they preserve the training objective (1) as an average over sentences.
(Our true F 1 metric on a corpus cannot be computed in this way, though it could reasonably be estimated by averaging over minibatches of sentences in (1).) Controlled experimental design: Our baseline system is an adaptation of Bodenstab et al. (2011) to learning-to-prune, as described in §3 and §6. Our goal is to determine whether such systems can be improved by LOLS training. We repeat the following design for both reward surrogates (r CP and r DP ) and for both grammars (coarse and fine).
x We start by training a number of baseline models by sweeping the asymmetric weighting parameter. For the coarse grammar we train 8 such models, and for the fine grammar 12.
y For each baseline policy, we estimate a value of λ for which that policy is optimal (among baseline policies) according to surrogate reward. 12 10 When using rCP, we speed up LOLS by doing ≤ 2n rollouts per sentence of length n. We sample these uniformly without replacement from the T possible rollouts ( §5), and compensate by upweighting the resulting training examples by T /(2n).
11 Considering all nodes in the binarized tree, except for the root, width-1 constituents, and children of unary rules. 12 We estimate λ by first fitting a parametric model yi = h(xi) ymax · sigmoid(a · log(xi + c) + b) to the baseline runtime-accuracy measurements on dev data (shown in green in Fig. 2) by minimizing mean squared error. We then use the fitted curve's slope h to estimate each λi = h (xi), where xi is the runtime of baseline i. The resulting choice of reward function y − λi · x increases along the green arrow in Fig. 2, and is indeed maximized (subject to y ≤ h(x), and in the region where h is concave) at x = xi. As a sanity check, notice since λi is a derivative of the function y = h(x), its units are in units of y (accuracy) per unit of x (runtime), as appropriate for use in the expression y − λi · x. Indeed, this procedure will construct the same reward function regardless of the units we use to express x. Our specific parametric model h is a sigmoidal curve, with z For each baseline policy, we run LOLS with the same surrogate reward function (defined by λ) for which that baseline policy was optimal. We initialize LOLS by setting π 0 to the baseline policy. Furthermore, we include the baseline policy's weighted training set Q 0 in the at line 13. Fig. 2 shows that LOLS learns to improve on the baseline, as evaluated on development data.
{ But do these surrogate reward improvements also improve our true reward? For each baseline policy, we use dev data to estimate a value of λ for which that policy is optimal according to our true reward function. We use blind test data to compare the baseline policy to its corresponding LOLS policy on this true reward function, testing significance with a paired permutation test. The improvements hold up, as shown in Fig. 3.
The rationale behind this design is that a user who actually wishes to maximize accuracy−λ·runtime, for some specific λ, could reasonably start by choosing the best baseline policy for this reward function, and then try to improve that baseline by running LOLS with the same reward function. Our experiments show this procedure works for a range of λ values.
In the real world, a user's true objective might instead be some nonlinear function of runtime and accuracy. For example, when accuracy is "good enough," it may be more important to improve runtime, and vice-versa. LOLS could be used with such a nonlinear reward function as well. In fact, a user does not even have to quantify their global preferences by writing down such a function. Rather, they could select manually among the baseline policies, choosing one with an attractive speed-accuracy tradeoff, and then specify λ to indicate a local direction of desired improvement (like the green arrows in Fig. 2), modifying this direction periodically as LOLS runs.

Discussion
As previous work has shown, learning to prune gives us excellent parsers with less than < 2% overhead accuracy → ymax asymptotically as runtime → ∞. It obtains an excellent fit by placing accuracy and runtime on the loglogit scale-that is, log(xi + c) and logit(yi/ymax) transforms are used to convert our bounded random variables xi and yi to unbounded ones-and then assuming they are linearly related. x The green curve shows the performance of the baseline policies. y For each baseline policy, a green arrow points along the gradient of surrogate reward, as defined by the λ that would identify that baseline as optimal. (In case a user wants a different value of λ but is unwilling to search for a better baseline policy outside our set, the green cones around each baseline arrow show the range of λs that would select that baseline from our set.) z The LOLS trajectory is shown as a series of purple points, and the purple arrow points from the baseline policy to the policy selected by LOLS with early stopping ( §6). This improves surrogate reward if the purple arrow has a positive inner product with the green arrow. LOLS cannot move exactly in the direction of the green arrow because it is constrained to find points that correspond to actual parsers. Typically, LOLS ends up improving accuracy, either along with runtime or at the expense of runtime.
for deciding what to prune (i.e., pruning feature extraction and span classification). Even the baseline pruner has access to features unavailable to the grammar, and so it learns to override the grammar, improving an unpruned coarse parser's accuracy from 61.1 to as high as 70.1% F 1 on test data (i.e., beneficial search error). It is also 8.1x faster! 13 LOLS simply does a better job at figuring out where to prune, raising accuracy 2.1 points to 72.2 (while maintaining a 7.4x speedup). Where pruning is more aggressive, 13 We measure runtime as best of 10 runs (recommended by Dunlop (2014)). All parser timing experiments were performed on a Linux laptop with the following specs: Intel® Core™ i5-2540M 2.60GHz CPU, 8GB memory, 32K/256K/3072K L1/L2/L3 cache. Code is written in the Cython language. LOLS has even more impact on accuracy.
Even on the fine grammar, where there is less room to improve accuracy, the most accurate LOLS system improves an unpruned parser by +0.16% F 1 with a 8.6x speedup. For comparison, the most accurate baseline drops −0.03% F 1 with a 9.7x speedup.
With the fine grammar, we do not see much improvement over the baseline in the accuracy > 85 regions. This is because the supervision specified by asymmetric weighting is similar to what LOLS surmises via rollouts. However, in lower-accuracy regions we see that LOLS can significantly improve reward over its baseline policy. This is because the baseline supervision does not teach which plausible   Figure 3: Test set results on coarse (top) and fine (bottom) grammars. Each curve or column represents a different training regimen. Accuracy is measured in F 1 percentage; runtime is measured by millions of hyperedges built per sentence.
{ Here, the green arrows point in the direction of true reward. Dashed lines connect each green baseline point to the two LOLS-improved points. Starred points and bold values indicate a significant improvement over the baseline reward (paired permutation test, p < 0.05). In no case was there a statistically significant decrease. In 4 cases (marked with '−') the policy chosen by early stopping was the initial baseline policy. We also report words per second ×10 3 (kw/s).
constituents are "safest" to prune, nor can it learn strategies such as "skip all long sentences." We discuss why LOLS does not help as much in the high accuracy regions further in §7.3.
In a few cases in Fig. 2, LOLS finds no policy that improves surrogate reward on dev data. In these cases, surrogate reward does improve slightly on training data (not shown), but early stopping just keeps the initial (baseline) policy since it is just as good on dev data. Adding a bit of additional random exploration might help break out of this initialization.
Interestingly, the r DP LOLS policies find higheraccuracy policies than the corresponding r CP policies, despite a greater mismatch in surrogate accuracy definitions. We suspect that r DP 's approach of trying to improve expected accuracy may provide a useful regularizing effect, which smooths out the reward signal and provides a useful bias ( §5.3).
The most pronounced qualitative difference due to LOLS training is substantially lower rates of parse failure in the mid-to highλ-range on both grammars (not shown). Since LOLS does end-to-end training, it can advise the learner that a certain pruning decision catastrophically results in no parse being found.

Training speed and convergence
Part of the contribution of this paper is faster algorithms for performing LOLS rollouts during training ( §5). Compared to the naive strategy of running the parser from scratch T times, r CP achieves speedups of 4.9-6.6x on the coarse grammar and 1.9-2.4x on the fine grammar. r DP is even faster, 10.4-11.9x on coarse and 10.5-13.8x on fine. Most of the speedup comes from longer sentences, which take up most of the runtime for all methods. Our new algorithms enable us to train on fairly long sentences (≤ 40). We note that our implementations of r CP and r DP are not as highly optimized as our test-time parser, so there may be room for improvement.
Orthogonal to the cost per rollout is the number of training iterations. LOLS may take many steps to converge if trajectories are long (i.e., T is large) because each iteration of LOLS training attempts to improve the current policy by a single action. In our setting, T is quite large (discussed extensively in §5), but we are able to circumvent slow convergence by initializing the policy (via the baseline method). This means that LOLS can focus on fine-tuning a policy which is already quite good. In fact, in 4 cases, LOLS did not improve from its initial policy.
We find that when λ is large-the cases where we get meaningful improvements because the initial policy is far from locally optimal-LOLS steadily and smoothly improves the surrogate reward on both training and development data. Because these are fast parsers, LOLS was able to run on the order of 10 (fine grammar) or 100 (coarse grammar) epochs within our 6-day limit; usually it was still improving when we terminated it. By contrast, for the slower and more accurate small-λ parsers (which completed fewer training epochs), LOLS still improves surrogate reward on training data, but without systematically improving on development data-often the reward on development fluctuates, and early stopping simply picks the best of this small set of "random" variants.

Understanding the LOLS training signal
In §3, we argued that LOLS gives a more appropriate training signal for pruning than the baseline method of consulting the gold parse, because it uses rollouts to measure the full effect of each pruning decision in the context of the other decisions made by the policy.
To better understand the results of our previous experiments, we analyze how often a rollout does determine that the baseline supervision for a span is suboptimal, and how suboptimal it is in those cases.
We specifically consider LOLS rollouts that evaluate the r CP surrogate (because r DP is a cruder approximation to true reward). These rollouts Q i tell us what actions LOLS is trying to improve in its current policy π i for a given λ, although there is no guarantee that the learner in §4 will succeed at classifying Q i correctly (due to limited features, regularization, and the effects of dataset aggregation).
We define regret of the baseline oracle. Let best(s) argmax a ROLLOUT(π, s, a) and regret(s) (ROLLOUT(π, s, best(s) − ROLLOUT(π, s, gold(s)))). Note that regret(s) ≥ 0 for all s, and let diff(s) be the event that regret(s) > 0 strictly. We are interested in analyzing the expected regret over all gold and non-gold spans, which we break down as where expectations are taken over s ∼ ROLL-IN(π).
Empirical analysis of regret: To show where the benefit of the LOLS oracle comes from, Fig. 4 graphs the various quantities that enter into the definition (4) of baseline regret, for different π, λ, and grammar. The LOLS oracle evolves along with the policy π, since it identifies the best action given π. We thus evaluate the oracle baseline against two LOLS oracles: the one used at the start of LOLS training (derived from the initial policy π 1 that was trained on baseline supervision), and the one obtained at the end (derived from the LOLS-trained policy π * selected by early stopping). These comparisons are shown by solid and dashed lines respectively.
Class imbalance (black curves): In all graphs, the aggregate curves primarily reflect the non-gold spans, since only 8% of spans are gold.
Gold spans (gold curves): The top graphs show that a substantial fraction of the gold spans should be pruned (whereas the baseline tries to keep them all), although the middle row shows that the benefit of pruning them is small. In most of these cases, pruning a gold span improves speed but leaves accuracy unchanged-because that gold span was missed anyway by the highest-scoring parse. Such cases become both more frequent and more beneficial as λ increases and we prune more heavily. In a minority of cases, however, pruning a gold span also improves accuracy (through beneficial search error).
Non-gold spans (purple curves): Conversely, the top graphs show that a few non-gold spans should be kept (whereas the baseline tries to prune them all), and the middle row shows a large benefit from keeping them. They are needed to recover from catastrophic errors and get a mostly-correct parse.
Coarse vs. fine (left vs. right): The two grammars differ mainly for small λ, and this difference comes especially from the top row. With a fine grammar and small λ, the baseline parses are more accurate, so LOLS has less room for improvement: fewer 273 gold spans go unused, and fewer non-gold spans are needed for recovery.
Effect of λ: Aggressive pruning (large λ) reduces accuracy, so its effect on the top row is similar to that of using a coarse grammar. Aggressive pruning also has an effect on the middle row: there is more benefit to be derived from pruning unused gold spans (surprisingly), and especially from keeping those non-gold spans that are helpful (presumably they enable recovery from more severe parse errors). These effects are considerably sharper with r DP reward (not shown here), which more smoothly evaluates the entire weighted pruned parse forest rather than trying to coordinate actions to ensure a good single 1-best tree; the baseline oracle is excellent at choosing the action that gets the better forest when the forest is mostly present (small λ) but not when it is mostly pruned (large λ).
Effect on retraining the policy: The black lines in the bottom graphs show the overall regret (on training data) if we were to perfectly follow the baseline oracle rather than the LOLS oracle. In practice, retraining the policy to match the oracle will not match it perfectly in either case. Thus the baseline method has a further disadvantage: when it trains a policy, its training objective weights all gold or all non-gold examples equally, whereas LOLS invests greater effort in matching the oracle on those states where doing so would give greater downstream reward.

Related work
Our experiments have focused on using LOLS to improve a reasonable baseline. Fig. 5 shows that our resulting parser fits reasonably among state-of-the-art constituency parsers trained and tested on the Penn Treebank. These parsers include a variety of techniques that improve speed or accuracy. Many are quite orthogonal to our work here-e.g., the SpMV method (which is necessary for Bodenstab's parser to beat ours) is a set of cache-efficient optimizations (Dunlop, 2014) that could be added to our parser (just as it was added to Bodenstab's), while Hall et al. (2014) and Fernández-González and Martins (2015) replace the grammar with faster scoring models that have more conditional independence. Overall, other fast parsers could also be trained using LOLS, so that [regret|gold] [regret|¬ gold] Figure 4: Comparison of the LOLS and baseline training signals based on the regret decomposition in Eq. (4) as we vary π, λ, and grammar. Solid lines show where the baseline oracle is suboptimal on its own system π 1 and dashed lines show where it is suboptimal on the LOLS-improved system π * . Each plot shows an overall quantity in black as well as that quantity broken down by gold and non-gold spans. they quickly find parses that are accurate, or at least helpful to the accuracy of some downstream task.
Pruning methods 14 can use classifiers not only to select spans but also to prune at other granularities (Roark and Hollingshead, 2008;Bodenstab et al., 2011). Prioritization methods do not prune substructures, but instead delay their processing until they are needed-if ever (Caraballo and Charniak, 1998).
This paper focuses on learning pruning heuristics that have trainable parameters. In the same way, Stoyanov and Eisner (2012) learn to turn off unneeded factors in a graphical model, and Jiang et al. (2012) and Berant and Liang (2015) train prioritization heuristics (using policy gradient). In both of those 2012 papers, we explicitly sought to maximize accuracy − λ · runtime as we do here. Some previous "coarse-to-fine" work does not optimize heuris-System F1 words/sec Dyer et al. (2016a); Dyer et al. (2016b) 93.3 - Zhu et al. (2013) 90.4 1290 Fernández-González and Martins (2015)   tics directly but rather derives heuristics for pruning (Charniak et al., 2006;Petrov and Klein, 2007;Weiss and Taskar, 2010;Rush and Petrov, 2012) or prioritization (Klein and Manning, 2003;Pauls and Klein, 2009) from a coarser version of the model. Combining these automatic methods with LOLS would require first enriching their heuristics with trainable parameters, or parameterizing the coarse-to-fine hierarchy itself as in the "feature pruning" work of He et al. (2013) and Strubell et al. (2015). Dynamic features are ones that depend on previous actions. In our setting, a policy could in principle benefit from considering the full state of the chart at Alg. 1 line 14. While coarse-to-fine methods implicitly use certain dynamic features, training with dynamic features is a fairly new goal that is challenging to treat efficiently. It has usually been treated with some form of simple imitation learning, using a heuristic training signal much as in our baseline (Jiang, 2014;He et al., 2013). LOLS would be a more principled way to train such features, but for efficiency, our present paper restricts to static features that only access the state via π(w, i, k). This permits our fast CP and DP rollout algorithms. It also reduces the time and space cost of dataset aggregation. 15 LOLS attempts to do end-to-end training of a sequential decision-making system, without falling back on black-box optimization tools (Och, 2003;Chung and Galley, 2012) that ignore the sequential structure. In NLP, sequential decisions are more commonly trained with step-by-step supervision 15 LOLS repeatedly evaluates actions given (w, i, k). We consolidate the resulting training examples by summing their reward vectors r, so the aggregated dataset does not grow over time. (Kuhlmann et al., 2011), using methods such as local classification (Punyakanok and Roth, 2001) or beam search with early update (Collins and Roark, 2004). LOLS tackles the harder setting where the only training signal is a joint assessment of the entire sequence of actions. It is an alternative to policy gradient, which does not scale well to our long trajectories because of high variance in the estimated gradient and because random exploration around (even good) pruning policies most often results in no parse at all. LOLS uses controlled comparisons, resulting in more precise "credit assignment" and tighter exploration.
We would be remiss not to note that current transition-based parsers-for constituency parsing (Zhu et al., 2013;Crabbé, 2015) as well as dependency parsing (Chen and Manning, 2014)-are both incredibly fast and surprisingly accurate. This may appear to undermine the motivation for our work, or at least for its application to fast parsing. 16 However, transition-based parsers do not produce marginal probabilities of substructures, which can be useful features for downstream tasks. Indeed, the transitionbased approach is essentially greedy and so it may fail on tasks with more ambiguity than parsing. Current transition-based parsers also require step-by-step supervision, whereas our method can also be used to train in the presence of incomplete supervision, latent structure, or indirect feedback. Our method could also be used immediately to speed up dynamic programming methods for MT, synchronous parsing, parsing with non-context-free grammar formalisms, and other structured prediction problems for which transition systems have not (yet) been designed.

Conclusions
We presented an approach to learning pruning policies that optimizes end-to-end performance on a userspecified speed-accuracy tradeoff. We developed two novel algorithms for efficiently measuring how varying policy actions affects reward. In the case of parsing, given a performance criterion and a good baseline policy for that criterion, the learner consistently manages to find a higher-reward policy. We hope this work inspires a new generation of fast and accurate structured prediction models with tunable runtimes.