Approximation-Aware Dependency Parsing by Belief Propagation

We show how to train the fast dependency parser of Smith and Eisner (2008) for improved accuracy. This parser can consider higher-order interactions among edges while retaining O(n3) runtime. It outputs the parse with maximum expected recall—but for speed, this expectation is taken under a posterior distribution that is constructed only approximately, using loopy belief propagation through structured factors. We show how to adjust the model parameters to compensate for the errors introduced by this approximation, by following the gradient of the actual loss on training data. We find this gradient by back-propagation. That is, we treat the entire parser (approximations and all) as a differentiable circuit, as others have done for loopy CRFs (Domke, 2010; Stoyanov et al., 2011; Domke, 2011; Stoyanov and Eisner, 2012). The resulting parser obtains higher accuracy with fewer iterations of belief propagation than one trained by conditional log-likelihood.


Introduction
Recent improvements to dependency parsing accuracy have been driven by higher-order features.Such a feature can look beyond just the parent and child words connected by a single edge to also consider siblings, grand-parents, etc.By including increasingly global information, these features provide more information for the parser-but they also complicate inference.The resulting higher-order parsers depend on approximate inference and decoding procedures, which may prevent them from predicting the best parse.
For example, consider the dependency parser we will train in this paper, which is based on the work of Smith and Eisner (2008).Ostensibly, this parser finds the minimum Bayes risk (MBR) parse under a probability distribution defined by a higher-order dependency parsing model.In reality, however, it achieves O(n 3 T ) runtime by relying on three approximations during inference: (1) variational inference by loopy belief propagation (BP) on a factor graph, (2) early stopping of inference after t max iterations prior to convergence, and (3) a first-order pruning model to limit the number of edges considered in the higher-order model.Such parsers are traditionally trained as if the inference had been exact (Smith and Eisner, 2008). 1n contrast, we train the parser such that the approximate system performs well on the final evaluation function.Stoyanov and Eisner (2012) call this approach ERMA, for "empirical risk minimization under approximations."We treat the entire parsing computation as a differentiable circuit, and backpropagate the evaluation function through our approximate inference and decoding methods to improve its parameters by gradient descent.
Our primary contribution is the application of Stoyanov and Eisner's learning method in the parsing setting, for which the graphical model involves a global constraint.Smith and Eisner (2008) previously showed how to run BP in this setting (by calling the inside-outside algorithm as a subroutine).We must backpropagate the downstream objective function through their algorithm so that we can follow its gradient.We carefully define our objective function to be smooth and differentiable, yet equivalent to accuracy of the minimum Bayes risk (MBR) parse in the limit.Further we introduce a new simpler objective function based on the L 2 dis-  y 2,1  y 1,2  y 3,2  y 2,3   y 3,1  y 1,3   y 4,3  y 3,4   y 4,2  y 2,4   y 4,1  y 1,4 y 0,1 y 0,3 y 0,4 y 0,2 Figure 1: Factor graph for dependency parsing of a 4word sentence; the special node <ROOT> is the root of the dependency graph.In this figure, the boolean variable y h,m encodes whether the edge from parent h to child m is present.The unary factor (black) connected to this variable scores the edge in isolation (given the sentence).The PTREE factor (red) coordinates all variables to ensure that the edges form a tree.The drawing shows a few of the higher-order factors (purple factors for grandparents, green factors for arbitrary siblings); these are responsible for the graph being cyclic ("loopy").
tance between the approximate marginals and the "true" marginals from the gold data.
The goal of this work is to account for the approximations made by a system rooted in structured belief propagation.Taking such approximations into account during training enables us to improve the speed and accuracy of inference at test time.To this end, we compare our training method with the standard approach of conditional log-likelihood.We evaluate our parser on 19 languages from the CoNLL-2006 (Buchholz and Marsi, 2006) and CoNLL-2007(Nivre et al., 2007) Shared Tasks as well as the English Penn Treebank (Marcus et al., 1993).On English, the resulting parser obtains higher accuracy with fewer iterations of BP than standard conditional log-likelihood (CLL) training.On the CoNLL languages, we find that on average it yields higher accuracy parsers than CLL training, particularly when limited to few BP iterations.

Dependency Parsing by Belief Propagation
This section describes the parser that we will train.
Model A factor graph (Frey et al., 1997;Kschischang et al., 2001) is a bipartite graph between factors α and variables y i , and defines the factorization of a probability distribution over a set of variables {y 1 , y 2 , . ..}.The factor graph contains edges between each factor α and a subset of variables y α .Each factor has a local opinion about the possible assignments to its neighboring variables.Such opinions are given by the factor's potential function ψ α , which assigns a nonnegative score to each configuration of a subset of variables y α .We define the probability of a given assignment y to be proportional to a product of potential functions: Smith and Eisner (2008) define a factor graph for dependency parsing of a given n-word sentence: n 2 binary variables {y 1 , y 2 , . ..} indicate which of the directed arcs are included (y i = ON) or excluded (y i = OFF) in the dependency parse.One of the factors plays the role of a hard global constraint: ψ PTREE (y) is 1 or 0 according to whether the assignment encodes a projective dependency tree.Another O(n 2 ) factors (one per variable) evaluate the individual arcs given the sentence, so that p(y) describes a first-order dependency parser.A higherorder parsing model is achieved by including higherorder factors, each scoring configurations of two or more arcs, such as grandparent and sibling configurations.Higher-order factors add cycles to the factor graph.See Figure 1 for an example factor graph.
We define each potential function to have a loglinear form: ψ α (y α ) = exp(θ • f α (y α , x)).Here x is the vector of observed variables such as the sentence and its POS tags; f α extracts a vector of features; and θ is our vector of model parameters.We write the resulting probability distribution over parses as p θ (y), to indicate that it depends on θ.
Loss For dependency parsing, our loss function is the number of missing edges in the predicted parse ŷ, relative to the reference (or "gold") parse y * : Because ŷ and y * each specify exactly one parent for each word token, (ŷ, y * ) equals the number of word tokens whose parent is predicted incorrectlythat is, directed dependency error.
Decoder To obtain a single parse as output, we use a minimum Bayes risk (MBR) decoder, which attempts to find the tree with minimum expected loss under the model's distribution (Bickel and Doksum, 1977).For our directed dependency error loss function, we obtain the following decision rule: Here ŷ ranges over well-formed parses.Thus, our parser seeks a well-formed parse h θ (x) whose individual edges have a high probability of being correct according to p θ .MBR is the principled way to take a loss function into account under a probabilistic model.By contrast, maximum a posteriori (MAP) decoding does not consider the loss function.
It would return the single highest-probability parse even if that parse, and its individual edges, were unlikely to be correct. 2ll systems in this paper use MBR decoding to consider the loss function at test time.This implies that the ideal training procedure would be to find the true p θ so that its marginals can be used in (3).Our baseline system attempts this.In practice, however, we will not be able to find the true p θ (model misspecification) nor exactly compute the marginals of p θ (computational intractability).Thus, this paper proposes a training procedure that compensates for the system's approximations, adjusting θ to reduce the actual loss of h θ (x) as measured at training time.
To find the MBR parse, we first run inference to compute the marginal probability p θ (y i = ON) for each edge.Then we maximize (3) by running a firstorder dependency parser with edge scores equal to those probabilities. 3When our inference algorithm is approximate, we replace the exact marginal with its approximation-the normalized belief from BP, given by b i (ON) in (6) below.
The algorithm proceeds by iteratively sending messages from variables, y i , to factors, ψ α : and from factors to variables: where N (i) and N (α) denote the neighbors of y i and ψ α respectively, and where y α ∼ y i is standard notation to indicate that y α ranges over all assignments to the variables participating in the factor α provided that the ith variable has value y i .Note that the messages at time t are computed from those at time (t − 1).Messages at the final time t max are used to compute the beliefs at each factor and variable: Each of the functions defined by equations ( 4)-( 7) can be optionally rescaled by a constant at any time, e.g., to prevent overflow/underflow.Below, we specifically assume that each function b i has been rescaled such that y i b i (y i ) = 1.This b i approximates the marginal distribution over y i values.
Messages continue to change indefinitely if the factor graph is cyclic, but in the limit, the rescaled messages may converge.Although the equations above update all messages in parallel, convergence is much faster if only one message is updated per timestep, in some well-chosen serial order. 4or the PTREE factor, the summation over variable assignments required for m (t) α→i (y i ) in Eq. ( 5) equates to a summation over exponentially many projective parse trees.However, we can use an inside-outside variant of the algorithm of Eisner (1996) to compute this in polynomial time (we describe this as hypergraph parsing in § 3).The resulting "structured BP" inference procedure is exact for first-order dependency parsing, and approximate when high-order factors are incorporated.The advantage of BP is that it enables fast approximate inference when exact inference is too slow.See Smith and Eisner (2008) for details.5 3 Approximation-aware Learning We aim to find the parameters θ * that minimize a regularized objective function over the training sample of sentence/parse pairs {(x (d) , where λ > 0 is the regularization coefficient and J(θ; x, y) is a given differentiable function, possibly nonconvex.We locally minimize this objective using 2 -regularized AdaGrad with Composite Mirror Descent (Duchi et al., 2011)-a variant of stochastic gradient descent that uses mini-batches, an adaptive learning rate per dimension, and sparse lazy updates from the regularizer.6 Objective Functions As in Stoyanov et al. (2011), our aim is to minimize expected loss on the true data distribution over sentence/parse pairs (X, Y ): Since the true data distribution is unknown, we substitute the expected loss over the training sample, and regularize our objective to reduce sampling variance.Specifically, we aim to minimize the regularized empirical risk, given by (8) with d) ).Using our MBR decoder h θ in (3), this loss function would not be differentiable because of the argmax in the definition of h θ (3).We will address this below by substituting a differentiable softmax.This is the "ERMA" method of Stoyanov and Eisner (2012).We will also consider simpler choices of J(θ; x (d) , y (d) ) that are commonly used in training neural networks.Finally, the standard convex objective is conditional log-likelihood ( § 4).

Gradient Computation
To compute the gradient ∇ θ J(θ; x, y * ) of the loss on a single sentence (x, y * ) = (x (d) , y (d) ), we apply automatic differentiation (AD) in the reverse mode (Griewank and Corliss, 1991).This yields the same type of "backpropagation" algorithm that has long been used for training neural networks (Rumelhart et al., 1986).
In effect, we are regarding (say) 5 iterations of the BP algorithm on sentence x, followed by (softened) MBR decoding and comparison to the target output y * , as a kind of neural network that computes (h θ (x), y * ).It is important to note that the resulting gradient computation algorithm is exact up to floating-point error, and has the same asymptotic complexity as the original decoding algorithm, requiring only about twice the computation.The AD method applies provided that the original function is indeed differentiable with respect to θ, an issue that we take up below.
In principle, it is possible to compute the gradient with minimal additional coding.There exists AD software (some listed at autodiff.org)that could be used to derive the necessary code automatically.Another option would be to use the perturbation method of Domke (2010).However, we implemented the gradient computation directly, and we describe it here.

Inference, Decoding, and Loss as a Feedfoward Circuit
The backpropagation algorithm is often applied to neural networks, where the topology of a feedforward circuit is statically specified and can be applied to any input.Our BP algorithm, decoder, and loss function similarly define a feedfoward circuit that computes our function J.However, the circuit's topology is defined dynamically (per sentence x (d) ) by "unrolling" the computation into a graph.
Figure 2 shows this topology for one choice of ob-jective function.The high level modules consist of (A) computing potential functions, (B) initializing messages, (C) sending messages, (D) computing beliefs, and (E) decoding and computing the loss.We zoom in on two submodules: the first computes messages from the PTREE factor efficiently (C.1-C.3); the second computes a softened version of our loss function (E.1-E.3).Both of these submodules are made efficient by the inside-outside algorithm.
The remainder of this section describes additional details of how we define the function J (the forward pass) and how we compute its gradient (the backward pass).Backpropagation computes the derivative of any given function specified by an arbitrary circuit consisting of elementary differentiable operations (e.g.+, −, ×, ÷, log, exp).This is accomplished by repeated application of the chain rule.
Backpropagating through an algorithm proceeds by similar application of the chain rule, where the intermediate quantities are determined by the topology of the circuit.Doing so with the circuit from Figure 2 poses several challenges.Eaton and Ghahramani (2009) and Stoyanov et al. (2011) showed how to backpropagate through the basic BP algorithm, and we reiterate the key details below ( § 3.3).The remaining challenges form the primary technical contribution of this paper: 1.Our true loss function (h θ (x), y * ) by way of the decoder (3) contains an argmax over trees and is therefore not differentiable.We show how to soften this decoder, making it differentiable ( § 3.2).2. Empirically, we find the above objective difficult to optimize.To address this, we substitute a simpler L 2 loss function (commonly used in neural networks).This is easier to optimize and yields our best parsers in practice ( § 3.2).3. We show how to run backprop through the inside-outside algorithm on a hypergraph ( § 3.5), and thereby on the softened decoder and computation of messages from the PTREE factor.This allows us to go beyond Stoyanov et al. ( 2011) and train structured BP in an approximation-aware and loss-aware fashion.
(E) Decode and Loss

Differentiable Objective Functions
Annealed Risk Directed dependency error, (h θ (x), y * ), is not differentiable due to the argmax in the decoder h θ .We therefore redefine J(θ; x, y * ) to be a new differentiable loss function, the annealed risk R 1/T θ (x, y * ), which approaches the loss (h θ (x), y * ) as the temperature T → 0. This is done by replacing our non-differentiable decoder h θ with a differentiable one (at training time).As input, it still takes the marginals p θ (y i = ON | x), or in practice, their BP approximations b i (ON).We define a distribution over parse trees: Imagine that at training time, our decoder stochastically returns a parse ŷ sampled from this distribution.Our risk is the expected loss of that decoder: As T → 0 ("annealing"), the decoder almost always chooses the MBR parse, 7 so our risk approaches the loss of the actual MBR decoder that will be used at test time.However, as a function of θ, it remains differentiable (though not convex) for any T > 0.
To compute the annealed risk, observe that it simplifies to R . This is the negated expected recall of a parse ŷ ∼ q 1/T θ .We obtain the required marginals q 1/T θ (ŷ i = ON) from ( 10) by running inside-outside where the edge weight for edge i is given by exp(p θ (y i = ON|x)/T ).
With the annealed risk as our J function, we can compute ∇ θ J by backpropagating through the computation in the previous paragraph.The computations of the edge weights and the expected recall are trivially differentiable.The only challenge is computing the partials of the marginals differentiating the function computed by this call to the insideoutside algorithm; we address this in Section 3.5.Figure 2 (E.1-E.3)shows where these computations lie within the circuit.
Whether our test-time system computes the marginals of p θ exactly or does so approximately via BP, our new training objective approaches (as T → 0) the true empirical risk of the test-time parser that performs MBR decoding from the computed marginals.Empirically, however, we will find that it is not the most effective training objective ( § 5.2).Stoyanov et al. (2011) postulate that the nonconvexity of empirical risk may make it a difficult function to optimize (even with annealing).Our next two objectives provide alternatives.
L 2 Distance We can view our inference, decoder, and loss as defining a form of deep neural network, whose topology is inspired by our linguistic knowledge of the problem (e.g., the edge variables should define a tree).This connection to deep learning allows us to consider training methods akin to supervised layer-wise training.We temporarily remove the top layers of our network (i.e. the decoder and loss module, Fig. 2 (E)) so that the output layer of our "deep network" consists of the normalized vari-7 Recall from (3) that the MBR parse is the tree ŷ that maximizes the sum i:ŷ i =ON p θ (yi = ON|x).As T → 0, the right-hand side of (10) grows fastest for this ŷ, so its probability under q 1/T θ approaches 1 (or 1/k if there is a k-way tie for MBR parse).
able beliefs b i (y i ) from BP.We can then define a supervised loss function directly on these beliefs.
We don't have supervised data for this layer of beliefs, but we can create it artificially.Use the supervised parse y * to define "target beliefs" by b * i (y i ) = I(y i = y * i ) ∈ {0, 1}.To find parameters θ that make BP's beliefs close to these targets, we can minimize an L 2 distance loss function: We can use this L 2 distance objective function for training, adding the MBR decoder and loss evaluation back in only at test time.
Layer-wise Training Just as in layer-wise training of neural networks, we can take a two-stage approach to training.First, we train to minimize the L 2 distance.Then, we use the resulting θ as initialization to optimize the annealed risk, which does consider the decoder and loss function (i.e. the top layers of Fig. 2).Stoyanov et al. (2011) found mean squared error (MSE) to give a smoother training objective, though still non-convex, and similarly used it to find an initializer for empirical risk.Though their variant of the L 2 objective did not completely dispense with the decoder as ours does, it is a similar approach to our proposed layer-wise training.

Backpropagation through BP
Belief propagation proceeds iteratively by sending messages.We can label each message with a timestamp t (e.g.m (t) i→α ) indicating the time step at which it was computed.Figure 2 (B) shows the messages at time t = 0, denoted m (0) i→α , which are initialized to the uniform distribution.Figure 2 (C) depicts the computation of all subsequent messages via Eqs.( 4) and ( 5).Messages at time t are computed from messages at time t − 1 or before and the potential functions ψ α .After the final iteration T , the beliefs b i (y i ), b α (y α ) are computed from the final messages m Except for the messages sent from the PTREE factor, each step of BP computes some value from earlier values using a simple formula.Backpropagation differentiates these simple formulas.This lets it compute J's partial derivatives with respect to the earlier values, once its partial derivatives have been computed with respect to later values.Explicit formulas can be found in the appendix of Stoyanov et al. (2011).

BP and backpropagation with PTREE
The PTREE factor has a special structure that we exploit for efficiency during BP.Stoyanov et al. (2011) assume that BP takes an explicit sum in (5).For the PTREE factor, this equates to a sum over all projective dependency trees (since ψ PTREE (y) = 0 for any assignment y which is not a tree).There are exponentially many such trees.However, Smith and Eisner (2008) point out that for α = PTREE, the summation has a special structure that can be exploited by dynamic programming.
To compute the factor-to-variable messages from α = PTREE, they first run the inside-outside algorithm where the edge weights are given by the ratios of the messages to PTREE: . Then they multiply each resulting edge marginal given by inside-outside by the product of all the OFF messages i m (t) i→α (OFF) to get the marginal factor belief b α (y i ).Finally they divide the belief by the incoming message m 3), and are repeated each time we send a message from the PTree factor.The derivatives of the message ratios and products mentioned here are trivial.Though we focus here on projective dependency parsing, our techniques are also applicable to non-projective parsing and the TREE factor; we leave this to future work.In the next subsection, we explain how to backpropagate through the inside-outside algorithm.

Backpropagation through Inside-Outside on a Hypergraph
Both the annealed risk loss function ( § 3.2) and the computation of messages from the PTREE factor use the inside-outside algorithm for dependency parsing.Here we describe inside-outside and the accompanying backpropagation algorithm over a hypergraph.This more general treatment shows the ap-plicability of our method to other structured factors such as for CNF parsing, HMM forward-backward, etc.In the case of dependency parsing, the structure of the hypergraph is given by the dynamic programming algorithm of Eisner (1996).
For the forward pass of the inside-outside module, the input variables are the hyperedge weights w e ∀e and the outputs are the marginal probabilities p w (i)∀i of each node i in the hypergraph.The latter are a function of the inside β i and outside α j probabilities.We initialize α root = 1. (13) j∈T (e):j =i For each node i, we define the set of incoming edges I(i) and outgoing edges O(i).The antecedents of the edge are T (e), the parent of the edge is H(e), and its weight is w e .Below we use the concise notation of an adjoint ðy = ∂J ∂y , a derivative with respect to objective J.For the backward pass through the inside-outside AD module, the inputs are ðp w (i)∀i and the outputs are ðw e ∀e.We also compute the adjoints of the intermediate quantities ðβ j , ðα i .We first compute ðα i bottom-up.Next ðβ j are computed top-down.The adjoints ðw e are then computed in any order.
∂p w (i) For some edge e, let i = H(e) be the parent of the edge and j, k ∈ T (e) be among its antecendents.

Other Learning Settings
Loss-aware Training with Exact Inference Backpropagating through inference, decoder, and loss need not be restricted to approximate inference algorithms.Li and Eisner (2009) optimize Bayes risk with exact inference on a hypergraph for machine translation.Each of our differentiable loss functions ( § 3.2) can also be coupled with exact inference.For a first-order parser, BP is exact.Yet, in place of modules (B), (C), and (D) in Figure 2, we can use a standard dynamic programming algorithm for dependency parsing, which is simply another instance of inside-outside on a hypergraph ( § 3.5).
The exact marginals from inside-outside (15) are then fed forward into the decoder/loss module (E).
Conditional and Surrogate Log-likelihood The standard approach to training is conditional loglikelihood (CLL) maximization (Smith and Eisner, 2008), which does not take inexact inference into account.When inference is exact, this baseline computes the true gradient of CLL.When inference is approximate, this baseline uses the approximate marginals from BP in place of their exact values in the gradient.The literature refers to this approximation-unaware training method as surrogate likelihood training since it returns the "wrong" model even under the assumption of infinite training data (Wainwright, 2006).Despite this, the surrogate likelihood objective is commonly used to train CRFs.CLL and approximation-aware training are not mutually exclusive.Training a standard factor graph with ERMA and a log-likelihood objective recovers CLL exactly (Stoyanov et al., 2011).

Setup
Features As the focus of this work is on a novel approach to training, we look to prior work for model and feature design.We add O(n 3 ) secondorder grandparent and arbitrary sibling factors as in Riedel and Smith (2010) and Martins et al. (2010).We use standard feature sets for first-order (McDonald et al., 2005) and second-order (Carreras, 2007) parsing.Following Rush and Petrov (2012), we also include a version of each part-of-speech (POS) tag feature, with the coarse POS tags from Petrov et al. (2012).We use feature hashing (Ganchev and Dredze, 2008;Attenberg et al., 2009) and restrict to at most 20 million features.We leave the incorporation of third-order features to future work.
Pruning To reduce the time spent on feature extraction, we enforce the type-specific dependency length bounds from Eisner and Smith (2005) as used by Rush and Petrov (2012): the maximum allowed dependency length for each tuple (parent tag, child tag, direction) is given by the maximum observed length for that tuple in the training data.Following Koo and Collins (2010), we train an (exact) first-order model and for each token prune any parents for which the marginal probability is less than 0.0001 times the maximum parent marginal for that token.8On a per-token basis, we further restrict to the ten parents with highest marginal probability as in Martins et al. (2009).The pruning model uses a simpler feature set as in Rush and Petrov (2012).
Data We consider 19 languages from the CoNLL-2006 (Buchholz and Marsi, 2006) and CoNLL-2007(Nivre et al., 2007) Shared Tasks.We also convert the English Penn Treebank (PTB) (Marcus et al., 1993) to dependencies using the head rules from Yamada and Matsumoto (2003) (PTB-YM).We evaluate unlabeled attachment accuracy (UAS) using gold POS tags for the CoNLL languages, and predicted tags from TurboTagger 9 for the PTB.Unlike most prior work, we hold out 10% of each CoNLL training dataset as development data.http://www.cs.cmu.edu/˜afm/TurboParser Some of the CoNLL languages contain nonprojective edges.With the projectivity constraint, the model assigns zero probability to such trees.For approximation-aware training this is not a problem; however CLL training cannot handle such trees.For CLL only, we projectivize the training trees following (Carreras, 2007) by finding the maximum projective spanning tree under an oracle model which assigns score +1 to edges in the gold tree and 0 to the others.We always evaluate on the nonprojective trees for comparison with prior work.
Learning Settings We compare three learning settings.The first, our baseline, is conditional loglikelihood training (CLL) ( § 4).As is common in the literature, we conflate two distinct learning settings (conditional log-likelihood/surrogate loglikelihood) under the single name "CLL" allowing the inference method (exact/inexact) to differentiate them.The second learning setting is approximationaware learning ( § 3) with either our L 2 distance objective (L 2 ) or our layer-wise training method (L 2 +AR) which takes the L 2 -trained model as an initializer for our annealed risk ( § 3.2).The annealed risk objective requires an annealing schedule: over the course of training, we linearly anneal from initial temperature T = 0.1 to T = 0.0001, updating T at each iteration of stochastic optimization.The third uses the same two objectives, L 2 and L 2 +AR, but with exact inference ( § 4).The 2 -regularizer weight is λ = 1 0.1D .Each method is trained by AdaGrad for 10 epochs with early stopping (i.e. the model with the highest score on dev data is returned).The learning rate for each training run is dynamically tuned on a sample of the training data.

Results
Our goal is to demonstrate that our approximationaware training method leads to improved parser accuracy as compared with the standard training approach of conditional log-likelihood (CLL) maximization (Smith and Eisner, 2008), which does not take inexact inference into account.The two key findings of our experiments are that our learning approach is more robust to (1) decreasing the number of iterations of BP and (2) adding additional cycles to the factor graph in the form of higher-order factors.In short: our approach leads to faster inference and creates opportunities for more accurate parsers.
Speed-Accuracy Tradeoff Our first experiment is on English dependencies.For English PTB-YM, Figure 3 shows accuracy as a function of the number of BP iterations for our second-order model with both arbitrary sibling and grandparent factors on English.We find that our training methods (L 2 and L 2 +AR) obtain higher accuracy than standard training (CLL), particularly when a small number of BP iterations are used and the inference is a worse ap-proximation.Notice that with just two iterations of BP, the parsers trained by our approach obtain accuracy equal to the CLL-trained parser with four iterations.Contrasting the two objectives for our approximation-aware training, we find that our simple L 2 objective performs very well.In fact, in only one case at 6 iterations, does the additional annealed risk (L 2 +AR) improve performance on test data.In our development experiments, we also evaluated AR without using L 2 for initialization and we found that it performed worse than either of CLL and L 2 alone.That AR performs only slightly better than L 2 (and not worse) in the case of L 2 +AR is likely due to early stopping on dev data, which guards against selecting a worse model.
Increasingly Cyclic Models Figure 4 contrasts accuracy with the type of 2nd-order factors (grandparent, sibling, or both) included in the model for English, for a fixed budget of 4 BP iterations.As we add additional higher-order factors, the model has more loops thereby making the BP approximation more problematic for standard CLL training.By contrast, our training performs well even when the factor graphs have many cycles.
Notice that our advantage is not restricted to the case of loopy graphs.Even when we use a 1storder model, for which BP inference is exact, our approach yields higher accuracy parsers than CLL training.We postulate that this improvement comes from our choice of the L 2 objective function.Note the following subtle point: when inference is exact, the CLL estimator is actually a special case of our approximation-aware learner-that is, CLL computes the same gradient that our training by backpropagation would if we used log-likelihood as the objective.Despite its appealing theoretical justification, the AR objective that approaches empirical risk minimization in the limit consistently provides no improvement over our L 2 objective.
Exact Inference with Grandparents When our factor graph includes unary and grandparent factors, exact inference in O(n 4 ) time is possible using the dynamic programming algorithm for Model 0 of Koo and Collins (2010).

Discussion
The purpose of this work was to explore ERMA and related training methods for models which incorporate structured factors.We applied these methods to a basic higher-order dependency parsing model, because that was the simplest and first (Smith and Eisner, 2008)  topic).Another natural extension of this work is to explore other types of factors: here we considered only exponential-family potential functions (commonly used in CRFs), but any differentiable function would be appropriate, such as a neural network.Our primary contribution is approximation-aware training for structured BP.While our experiments only consider dependency parsing, our approach is applicable for any constraint factor which amounts to running the inside-outside algorithm on a hypergraph.Prior work has used this structured form of BP to do dependency parsing (Smith and Eisner, 2008), CNF grammar parsing (Naradowsky et al., 2012), TAG (Auli and Lopez, 2011), ITGconstraints for phrase extraction (Burkett and Klein, 2012), and graphical models over strings (Dreyer and Eisner, 2009).Our training methods could be applied to such tasks as well.

Conclusions
We introduce a new approximation-aware learning framework for belief propagation with structured factors.We present differentiable objectives for both empirical risk minimization (a la.ERMA) and a novel objective based on L 2 distance between the inferred beliefs and the true edge indicator functions.Experiments on the English Penn Treebank and 19 languages from CoNLL-2006CoNLL- /2007 shows that our estimator is able to train more accurate dependency parsers with fewer iterations of belief propagation than standard conditional log-likelihood training, by taking approximations into account.Our code is available in a general-purpose library for structured BP, hypergraphs, and backprop. 10

Figure 2 :
Figure 2: Feed-forward topology of inference, decoding, and loss.(E) shows the annealed risk, one of the objective functions we consider.
i→α using Eqs.(6) and (7)-this is shown in.Optionally, we can normalize the messages after each step to avoid overflow (not shown in the figure) as well as the beliefs.
This backpropagation method is used for both Figure 2 C.2 and E.2.
accuracy tradeoff of UAS vs. the number of BP iterations for standard conditional likelihood training (CLL) and our approximation-aware training with either an L 2 objective (L 2 ) or a staged training of L 2 followed by annealed risk (L 2 +AR).Note that x-axis shows the number of iterations used for both training and testing.We use a 2nd-order model with Grand.+Sib.factors.
Table 1 compares four parsers, by considering two training approaches and two inference methods.The training approaches are CLL and approximation-aware inference with an L 2

Table 2 :
Results on 19 languages fromCoNLL-2006CoNLL- /2007.For languages appearing in both datasets, the 2006 version was used, except for Chinese (ZH).Evaluation follows the 2006 conventions and excludes punctuation.We report absolute UAS for the baseline (CLL) and the improvement in UAS for L 2 over CLL (L 2 -CLL) with positive/negative differences in blue/red.The average UAS and average difference across all languages (AVG.) is given.
instance of structured BP.In future work, we hope to explore further models with structured factors-particularly those which jointly account for multiple linguistic strata (e.g.syntax, semantics, and