Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction

Due to the nature of complex NLP problems, structured prediction algorithms have been important modeling tools for a wide range of tasks. While there exists evidence showing that linear Structural Support Vector Machine (SSVM) algorithm performs better than structured Perceptron, the SSVM algorithm is still less frequently chosen in the NLP community because of its relatively slow training speed. In this paper, we propose a fast and easy-to-implement dual coordinate descent algorithm for SSVMs. Unlike algorithms such as Perceptron and stochastic gradient descent, our method keeps track of dual variables and updates the weight vector more aggressively. As a result, this training process is as efficient as existing online learning methods, and yet derives consistently better models, as evaluated on four benchmark NLP datasets for part-of-speech tagging, named-entity recognition and dependency parsing.


Introduction
Complex natural language processing tasks are inherently structured. From sequence labeling problems like part-of-speech tagging and named entity recognition to tree construction tasks like syntactic parsing, strong dependencies exist among the labels of individual components. By modeling such relations in the output space, structured output prediction algorithms have been shown to outperform significantly simple binary or multi-class classifiers (Lafferty et al., 2001;Collins, 2002;. Among the existing structured output prediction algorithms, the linear Structural Support Vector Machine (SSVM) algorithm (Tsochantaridis et al., 2004;) has shown outstanding performance in several NLP tasks, such as bilingual word alignment (Moore et al., 2007), constituency and dependency parsing (Taskar et al., 2004b;Koo et al., 2007), sentence compression (Cohn and Lapata, 2009) and document summarization (Li et al., 2009). Nevertheless, as a learning method for NLP, the SSVM algorithm has been less than popular algorithms such as the structured Perceptron (Collins, 2002). This may be due to the fact that current SSVM implementations often suffer from several practical issues. First, state-of-the-art implementations of SSVM such as cutting plane methods ) are typically complicated. 1 Second, while methods like stochastic gradient descent are simple to implement, tuning learning rates can be difficult. Finally, while SSVM models can achieve superior accuracy, this often requires long training time.
In this paper, we propose a novel optimization method for efficiently training linear SSVMs. Our method not only is easy to implement, but also has excellent training speed, competitive with both structured Perceptron (Collins, 2002) and MIRA . When evaluated on several NLP tasks, including POS tagging, NER and dependency parsing, this optimization method also outperforms other approaches in terms of prediction accuracy. Our final algorithm is a dual coordinate descent (DCD) algorithm for solving a structured output SVM problem with a 2-norm hinge loss function. The algorithm consists of two main components. One component behaves analogously to online learning methods and updates the weight vector immediately after inference is performed. The other component is similar to the cutting plane method and updates the dual variables (and the weight vector) without running inference. Conceptually, this hybrid approach operates at a balanced trade-off point between inference and weight update, performing better than with either component alone.
Our contributions in this work can be summarized as follows. Firstly, our proposed algorithm shows that even for structured output prediction, an SSVM model can be trained as efficiently as a structured Perceptron one. Secondly, we conducted a careful experimental study on three NLP tasks using four different benchmark datasets. When compared with previous methods for training SSVMs (Joachims et al., 2009), our method achieves similar performance using less training time. When compared to commonly used learning algorithms such as Perceptron and MIRA, the model trained by our algorithm performs consistently better when given the same amount of training time. We believe our method can be a powerful tool for many different NLP tasks.
The rest of our paper is organized as follows. We first describe our approach by formally defining the problem and notation in Sec. 2, where we also review some existing, closely-related structuredoutput learning algorithms and optimization techniques. We introduce the detailed algorithmic design in Sec. 3. The experimental comparisons of variations of our approach and the existing methods on several NLP benchmark tasks and datasets are reported in Sec. 4. Finally, Sec. 5 concludes the paper.

Background and Related Work
We first introduce notations used throughout this paper. An input example is denoted by x and an output structure is denoted by y. The feature vector Φ(x, y) is a function defined over an input-output pair (x, y). We focus on linear models with predictions made by solving the decoding problem: The set Y(x i ) represents all possible (exponentially many) structures that can be generated from the example x i . Let y i be the true structured label of x i . The difference between the feature vectors of the correct label y i and y is denoted as . We define ∆(y i , y) as a distance function between two structures.

Perceptron and MIRA
Structured Perceptron First introduced by Collins (2002), the structured Perceptron algorithm runs two steps iteratively: first, it finds the best structured prediction y for an example with the current weight vector using Eq. (1); then the weight vector is updated according to the difference between the feature vectors of the true label and the prediction: w ← w + Φ y i ,y (x i ). Inspired by Freund and Schapire (1999), Collins (2002) also proposed the averaged structured Perceptron, which maintains an averaged weight vector throughout the training procedure. This technique has been shown to improve the generalization ability of the model.

MIRA
The Margin Infused Relaxed Algorithm (MIRA), which was introduced by , explicitly uses the notion of margin to update the weight vector. The MIRA updates the weight vector by calculating the step size using where H k is a set containing the best-k structures according to the weight vector w 0 . MIRA is a very popular method in the NLP community and has been applied to NLP tasks like word segmentation and part-of-speech tagging (Kruengkrai et al., 2009), NER and chunking (Mejer and Crammer, 2010) and dependency parsing .

Structural SVM
Structural SVM (SSVM) is a maximum margin model for the structured output prediction setting. Training SSVM is equivalent to solving the following global optimization problem: where l is the number of labeled examples and The typical choice of is (a) = a t . If t = 2 is used, we refer to the SSVM defined in Eq.
(2), we refer to it as the L1-Loss SSVM. Note that the function ∆ is not only necessary, 2 but also enables us to use more information on the differences between the structures in the training phase. For example, using Hamming distance for sequence labeling is a reasonable choice, as the model can express finer distinctions between structures y i and y.
When training an SSVM model, we often need to solve the loss-augmented inference problem, Note that it is a different inference problem than the decoding problem in Eq. (1).
Algorithms for training SSVM Cutting plane (CP) methods (Tsochantaridis et al., 2004;) have been the dominant method for learning the L1-Loss SSVM. Eq. (2) contains an exponential number of constraints. The cutting plane (CP) methods iteratively select a subset of active constraints for each example then solve a sub-problem which contains active constraints to improve the model. CP has proven useful for solving SSVMs. For instance,  proposed using CP methods to solve a 1-slack variable formulation, and showed that solving for a 1-slack variable formulation is much faster than solving the l-slack variable one (Eq. (2)). Chang et al. (2010) also proposed a variant of cutting plane method for solving the L2-Loss SSVM. This method uses a dual coordinate descent algorithm to solve the sub-problems. We call their approach the CPD method. Several other algorithms also aim at solving the L1-Loss SSVM. Stochastic gradient descent (SGD) (Bottou, 2004;Shalev-Shwartz et al., 2007) is a technique for optimizing general convex functions and has been applied to solving the 2 Without ∆(y, yi) in Eq. 2, the optimal w would be zero. L1-Loss SSVM (Ratliff et al., 2007). Taskar et al. (2004a) proposed a structured SMO algorithm. Because the algorithm solves the dual formulation of the L1-Loss SSVM, it requires picking a violation pair for each update. In contrast, because each dual variable can be updated independently in our DCD algorithm, the implementation is relatively simple. The extragradient algorithm has also been applied to solving the L1-Loss SSVM (Taskar et al., 2005). Unlike our DCD algorithm, the extragradient method requires the learning rate to be specified.
The connections between dual methods and the online algorithms have been previously discussed. Specifically, Shalev-Shwartz and Singer (2006) connects the dual methods to a wide range of online learning algorithms. In (Martins et al., 2010), the authors apply similar techniques on L1-Loss SSVMs and show that the proposed algorithm can be faster than the SGD algorithm.
Exponentiated Gradient (EG) descent (Kivinen and Warmuth, 1995;Collins et al., 2008) has recently been applied to solving the L1-Loss SSVM. Compared to other SSVM learners, EG requires manual tuning of the step size. In addition, EG requires solution of the sum-product inference problem, which can be more expensive than solving Eq. (3) (Taskar et al., 2006). Very recently, Lacoste-Julien et al. (2013) proposed a block-coordinate descent algorithm for the L1-Loss SSVM based on the Frank-Wolfe algorithm (FW-Struct), which has been shown to outperform the EG algorithm significantly. Similar to our DCD algorithm, FW calculates the optimal learning rate when updating the dual variables.
The Sequential Dual Method (SDM) (Shevade et al., 2011) is probably the most related to this paper. SDM solves the L1-Loss SSVM problem using multiple updating policies, which is similar to our approach. However, there are several important differences in the detailed algorithmic design. As will be clear in Sec. 3, our dual coordinate descent (DCD) algorithm is very simple, while SDM (which is not a DCD algorithm) uses a complicated procedure to balance different update policies. By targeting the L2-Loss SSVM formulation, our methods can update the weight vector more efficiently, since there are no equality constraints in the dual.

Dual Coordinate Descent Algorithms for Structural SVM
In this work, we focus on solving the dual of linear L2-Loss SSVM, which can be written as follows: In the above equation, a dual variable α i,y is associated with a structure y ∈ Y(x i ). Therefore, the total number of dual variables can be quite large: its The connection between the dual variables and the weight vector w at optimal solutions is through the following equation: Advantages of L2-Loss SSVM The use of the 2-norm hinge loss function eliminates the need of equality constraints 3 ; only non-negative constraints (α i,y ≥ 0) remain. This is important because now each dual variable can be updated without changing values of the other dual variables. We can then update one single dual variable at a time. As a result, this dual formulation allows us to design a simple and principled dual coordinate descent (DCD) optimization method. DCD algorithms consist of two iterative steps: 1. Pick a dual variable α i,y .
2. Update the dual variable and the weight vector. Go to 1.
In the normal binary classification case, how to select dual variables to solve is not an issue as choosing them randomly works effectively in practice (Hsieh et al., 2008). However, this is not a practical scheme for training SSVM models given that the number of dual variables in Eq. (4) can be very large because of the exponentially many legitimate output structures. To address this issue, we introduce the concept of working set below.
Working Set The number of non-zero variables in the optimal α can be small when solving Eq. (4). Hence, it is often feasible to use a small working set W i for each example to keep track of the structures for non-zero α's. More formally, Intuitively, the working set W i records the output structures that are similar to the true structure y i .We set all dual variables to be zero initially (therefore, w = 0 as well), so W i = ∅ for all i. Then the algorithm starts to build the working set in the training procedure. After training, the weight vector is computed using dual variables in the working set and thus equivalent to Connections to Structured Perceptron The process of updating a dual variable is in fact very similar to the update rule used in Perceptron and MIRA. Take structured Perceptron for example, its weight vector can be determined using the following equation: where Γ(x i ) is the set containing all structures Perceptron predicts for x i during training, and β i,y is the number of times Perceptron predicts y for x i during training. By comparing Eq. (6) and Eq. (7), it is clear that SSVM is just a more principled way to update the weight vector, as α's are computed based on the notion of margin. 4 Updating Dual Variables and Weights After picking a dual variable α i,ȳ , we first show how to update it optimally. Recall that a dual variable α i,ȳ is associated with the i-th example and a structureȳ. The optimal update size d for α i,ȳ can be calculated analytically from the following optimization prob- Update the weight vector w and the dual variables in the working set of the i-th example. C is the regularization parameter defined in Eq.
(2). 1: Shuffle the elements in W i (but retain the newest member of the working set to be updated first. See Theorem 1 for the reasons. α i,ȳ ← α 7: end for lem (derived from Eq. (4)): where the w is defined in Eq. (6). Compared to stochastic gradient descent, DCD algorithms keep track of dual variables and do not need to tune the learning rate. Instead of updating one dual variable at a time, our algorithm updates all dual variables once in the working set. This step is important for the convergence of the DCD algorithms. 5 The exact update algorithm is presented in Algorithm 1. Line 3 calculates the optimal step size (the analytical solution to the above optimization problem). Line 4 makes sure that dual variables are non-negative. Lines 5 and 6 update the weight vectors and the dual variables. Note that every update ensures Eq. (4) to be no greater than the original value.

Two DCD Optimization Algorithms
Now we are ready to present two novel DCD algorithms for L2-Loss SSVM: DCD-Light and DCD-SSVM.

DCD-Light
The basic idea of DCD-Light is just like online learning algorithms. Instead of doing inference for the whole batch of examples before updating the weight vector in each iteration, as done in CPD and 1-slack variable formulation of SVM-Struct, DCD-Light updates the model weights after solving the inference problem for each individual example. Algorithm 2 depicts the detailed steps. In Line 5, the lossaugmented inference (Eq. (3)) is performed; then the weight vector is updated in Line 9 -all of the structures and dual variables in the working set are used to update the weight vector. Note that there is a δ parameter in Line 6 to control how precise we would like to solve this SSVM problem. As suggested in (Hsieh et al., 2008), we shuffle the examples in each iteration (Line 3) as it helps the algorithm converge faster.
DCD-Light has several noticeable differences when compared to the most popular online learning method, averaged Perceptron. First, DCD-Light performs the loss-augmented inference (Eq. (3)) at Line 5 instead of the argmax inference (Eq. (1)). Second, the algorithm updates the weight vector with all structures in the working set. Finally, DCDlight does not average the weight vectors.

DCD-SSVM
Observing that DCD-Light does not fully utilize the saved dual variables in the working set, we propose a hybrid approach called DCD-SSVM, which combines ideas from DCD-Light and cutting plane methods. In short, after running the updates on a batch of examples, we refine the model by solving the dual variables further in the current working sets.
The key advantage of keeping track of these dual variables is that it allows us to update the saved dual variables without performing any inference, which is often an expensive step in structured prediction algorithms.
DCD-SSVM is summarized in Algorithm 3. Lines 10 to 16 are from DCD-Light. In Lines 3 to 8, we grab the idea from cutting plane methods by updating the weight vector using the saved dual variables in the working sets without any inference (note that Lines 3 to 8 do not have any effect at the first iteration). By revisiting the dual variables, we can derive a better intermediate model, resulting in running the inference procedure less frequently. Similar to DCD-Light, we also shuffle the examples in each iteration.
UPDATEWEIGHT(i, w) {Algo. 1} 10: end for 11: end for DCD algorithms are similar to column generation algorithms for linear programming (Desrosiers and Lübbecke, 2005), where the master problem is to solve the dual problem that focuses on the variables in the working sets, and the subproblem is to find new variables for the working sets. In Sec. 4, we will demonstrate the importance of balancing these two problems by comparing DCD-SSVM and DCD-Light.

Convergence Analysis
We now present the theoretic analysis of both DCD-Light and DCD-SSVM, and address two main topics: (1) whether the working sets will grow exponentially and (2) the convergence rate. Due to the lack of space, we show only the main theorems.
Leveraging Theorem 5 in ), we can prove that the DCD algorithms only add a limited number of variables in the working sets, and have the following theorem.

Theorem 1. The number of times that DCD-Light or DCD-SSVM adds structures into working sets is bounded by
We discuss next the convergence rates of our DCD algorithms under two different conditionswhen the working sets are fixed and the general case. If the working sets are fixed in DCD algorithms, they become cyclic dual coordinate descent meth-Algorithm 3 DCD-SSVM: a hybrid dual coordinate descent algorithm that combines ideas from DCD-Light and cutting plane algorithms.  end for 17: end for ods. In this case, we denote the minimization problem Eq. (4) as F (α). For fixed working sets {W i }, we denote F S (α) as the minimization problem that focuses on the dual variables in the working set only. By applying the results from (Luo and Tseng, 1993;Wang and Lin, 2013) to L2-Loss SSVM, we have the following theorem.
Theorem 2. For any given non-empty working sets {W i }, if the DCD algorithms do not extend the working sets (i.e., line 6-8 in Algorithm 2 are not executed), then the DCD algorithms will obtain theoptimal solution for F S (α) in O(log( 1 )) iterations.
Based on Theorem 1 and Theorem 2, we have the following theorem.
Theorem 3. DCD-SSVM obtains an -optimal solution in O( 1 2 log( 1 )) iterations. To the best of our knowledge, this is the first convergence analysis result for L2-Loss SSVM. Compared to other theoretic analysis results for L1-Loss SSVM, a tighter bound might exist given a better theoretic analysis. We leave this for future work. 6

Experiments
In order to verify the effectiveness of the proposed algorithm, we conduct a set of experiments on different optimization and learning algorithms. Before going to the experimental results, we first introduce the tasks and settings used in the experiments.

Tasks and Data
We evaluated our method and existing structured output learning approaches on named entity recognition (NER), part-of-speech tagging (POS) and dependency parsing (DP) on four benchmark datasets.
NER-MUC7 MUC-7 data contains a subset of North American News Text Corpora annotated with many types of entities. We followed the settings in (Ratinov and Roth, 2009) and consider three main entities categories: PER, LOC and ORG. We evaluated the results using phrase-level F 1 . retic analysis significantly. Also note that Theorem 2 shows that if we put all possible structures in the working sets (i.e., F (α) = FS(α)), then the DCD algorithms can obtain -optimal solution in O(log( 1 )) iterations.

NER-CoNLL
This is the English dataset from the CoNLL 2003 shared task (T. K. Sang and De Meulder, 2003). The data set labels sentences from the Reuters Corpus, Volume 1 (Lewis et al., 2004) with four different entity types: PER, LOC, ORG and MISC. We evaluated the results using phrase-level F 1 .

POS-WSJ
The standard set for evaluating the performance of a part-of-speech tagger. The training, development and test sets consist of sections 0-18, 19-21 and 22-24 of the Penn Treebank data (Marcus et al., 1993), respectively. We evaluated the results by token-level accuracy.

DP-WSJ
We took sections 02-21 of Penn Treebank as the training set, section 00 as the development set and section 23 as the test set. We implement a simple version of hash kernel to speed up of training procedure for this task (Bohnet, 2010). We reported the unlabeled attachment accuracy for this task .

Features and Inference Algorithms
For the sequence labeling tasks, NER and POS, we followed the discriminative HMM settings used in ) and defined the features as where Φ emi is the feature vector dedicated to the i-th token (or, the emission features), N represents the number of tokens in this sequence, y i represents the i-th token in the sequence y, [y i = 1] is the indictor variable and k is the number of tags. The inference problems are solved by the Viterbi algorithm. The emission features used in both POS and NER are the standard ones, including word features, word-shape features, etc. For NER, we used additional simple gazetteer features 7 and word cluster features (Turian et al., 2010) For dependency parsing, we followed the setting described in  and used similar features. The decoding algorithm is the firstorder Eisner's algorithm (Eisner, 1997).

Algorithms and Implementation Detail
For all SSVM algorithms (including SGD), C was chosen among the set {0.01, 0.05, 0.1, 0.5, 1, 5} according to the accuracy/F 1 on the development set. For each task, the same features were used by all 7 Adding Wikipedia gazetteers would likely increase the performance significantly (Ratinov and Roth, 2009). algorithms. For NER-MUC7, NER-CoNLL and POS-WSJ, we ran the online algorithms and DCD-SSVM for 25 iterations. For DP-WSJ, we only let the algorithms run for 10 iterations as the inference procedure is very expensive computationally. The algorithms in the experiments are: DCD Our dual coordinate descent method on the L2-Loss SSVM. For DCD-SSVM, r is set to be 5. For both DCD-Light and DCD-SSVM , we follow the suggestion in : if the value of a dual variable becomes zero, its corresponding structure will be removed from the working set to improve the speed.

SVM-Struct
We used the latest (v3.10) of  This version uses the cutting plane method on a 1-slack variable formulation  for the L1-Loss SSVM. SVM-Struct was implemented in C and all the other algorithms are implemented in C#. We did not apply SVM-Struct to DP-WSJ because there is no native implementation.
Perceptron This refers to the averaged structured Perceptron method introduced by Collins (2002). To speed up the convergence rate, we shuffle the training examples at each iteration.
MIRA Margin Infused Relaxed Algorithm (MIRA)  is the online learning algorithm that explicitly uses the notion of margin to update the weight vector. We use 1-best MIRA in our experiments. To increase the convergence speed, we shuffle the training examples at each iteration. Following , we did not tune the C parameter for the MIRA algorithm.
SGD Stochastic gradient descent (SGD) (Bottou, 2004) is a technique for optimizing general convex functions. In this paper, we use SGD as an alternative baseline for optimizing the L1-Loss SSVM objective function (Eq. (2) with higne loss). 9 When using SGD, the learning rate must be carefully tuned. Following (Bottou, 2004), the learning rate is obtained by η 0 (1.0 + (η 0 T /C)) 0.75 , where C is the regularization parameter, T is the number of updates so far and η 0 is the initial step size. The parameter η 0 was selected among the set {2 −1 , 2 −2 , 2 −3 , 2 −4 , 2 −5 } by running the SGD algorithm on a set of 1000 randomly sampled examples, and then choosing the η 0 with lowest primal objective function on these examples.
In order to improve the training speed, we cached all the feature vectors generated by the gold labeled data once computed. This applied to all algorithms except SVM-Struct, which has its own caching mechanism. We report the performance of the averaged weight vectors for Perceptron and MIRA.

Results
We present the experimental results below on comparing different dual coordinate descent methods, as well as comparing our main algorithm, DCD-SSVM, with other structured learning approaches.

Comparisons of DCD Methods
We compared three DCD methods: DCD-Light, DCD-SSVM and CPD. CPD is a cutting plane method proposed by Chang et al. (2010), which uses a dual coordinate descent algorithm to solve the internal sub-problems. We specifically included CPD as it also targets at the L2-Loss SSVM.
Because different optimization strategies will reach the same objective values eventually, comparing them on prediction accuracy of the final models is not meaningful. Instead, here we compare how fast each algorithm converges as shown in Figure 1. Each marker on the line in this figure represents one iteration of the corresponding algorithm. Generally speaking, CPD improves the model very slowly in the early stages, but much faster after several iterations. In comparison, DCD-Light often behaves much better initially, and DCD-SSVM is generally the most efficient algorithm here.
The reason behind the slow performance of CPD is clear. During early rounds of the algorithm, the weight vector is far from optimal, so it spends too much time using "bad" weight vectors to find the most violated structures. On the other hand, DCD-Light updates the weight vector more frequently, so it behaves much better in general. DCD-SSVM spends more time on updating models during each batch, but keeps the same amount of time doing inference as DCD-Light. As a result, it finds a better trade-off between inference and learning.

DCD-SSVM, SVM-Struct and FW-Struct
Joachims et al. (2009) proposed a 1-slack variable method for the L1-Loss SSVM. They showed that solving a 1-slack variable formulation is an order-of-magnitude faster than solving the original formulation (l-slack variables formulation). Nevertheless, from Figure 2, we can see the clear advantage of DCD-SSVM over SVM-Struct. Although using 1-slack variable has improved the learning speed, SVM-Struct still converges slower than DCD-SSVM. In addition, the performance of models trained by SVM-Struct in the early stage is quite unstable, which makes early stopping an ineffective strategy in practice when training time is limited.
We also compared our algorithms to FW-Struct. Our results agree with (Lacoste-Julien et al., 2013), which shows that the FW-Struct outperforms the SVM-Struct. In our experiments, we found that our DCD algorithms were competitive, sometimes converged faster than the FW-Struct.

DCD-SSVM, MIRA, Perceptron and SGD
As in binary classification, large-margin methods like SVMs often perform better than algorithms like Perceptron and SGD (Hsieh et al., 2008;Shalev-Shwartz and Zhang, 2013), here we observe similar behaviors in the structured output domain. Table 1 shows the final test accuracy numbers or F 1 scores of models trained by algorithms including Perceptron, MIRA and SGD, compared to those of the SSVM models trained by DCD-SSVM. Among the benchmark datasets and tasks we have experimented with, DCD-SSVM derived the most accurate models, except for DP-WSJ when compared to SGD.
Perhaps a more interesting comparison is on the training speed, which can be observed in Figure 3. Compared to other online algorithms, DCD-SSVM can take advantage of cached dual variables and structures. We show that the training speed of DCD-SSVM can be competitive to that of the online learning algorithms, unlike SVM-Struct. Note that SGD is not very stable for NER-MUC7, even though we tuned the step size very carefully.

Conclusion
In this paper, we present a novel approach for learning the L2-Loss SSVM model. By combining the ideas of dual coordinate descent and cutting plane methods, the hybrid approach, DCD-SSVM outperforms other SSVM training methods both in terms of objective value reduction and testing error rate reduction. As demonstrated in our experiments on several NLP tasks, our approach also tends to learn more accurate models than commonly used structured learning algorithms, including structured Perceptron, MIRA and SGD. Perhaps more interestingly, our SSVM learning method is very efficient: the model training time is competitive to online learning algorithms such as structured Perceptron and MIRA. These unique qualities make DCD-SSVM an excellent choice for solving a variety of complex NLP problems.
In the future, we would like to compare our algorithm to other structured prediction approaches, such as conditional random fields (Lafferty et al., 2001) and exponential gradient descent methods (Collins et al., 2008). Expediting the learning process further by leveraging approximate inference is also an interesting direction to investigate.