Learning Structural Kernels for Natural Language Processing

Structural kernels are a flexible learning paradigm that has been widely used in Natural Language Processing. However, the problem of model selection in kernel-based methods is usually overlooked. Previous approaches mostly rely on setting default values for kernel hyperparameters or using grid search, which is slow and coarse-grained. In contrast, Bayesian methods allow efficient model selection by maximizing the evidence on the training data through gradient-based methods. In this paper we show how to perform this in the context of structural kernels by using Gaussian Processes. Experimental results on tree kernels show that this procedure results in better prediction performance compared to hyperparameter optimization via grid search. The framework proposed in this paper can be adapted to other structures besides trees, e.g., strings and graphs, thereby extending the utility of kernel-based methods.


Introduction
Kernel-based methods are a staple machine learning approach in Natural Language Processing (NLP).Frequentist kernel methods like the Support Vector Machine (SVM) pushed the state of the art in many NLP tasks, especially classification and regression.One interesting aspect of kernels is their ability to be defined directly on structured objects like strings, trees and graphs.This approach has the potential to move the modelling effort from feature engineering to kernel engineering.This is useful when we do not have much prior knowledge about how the data behaves, as we can more readily define a similarity metric between inputs instead of trying to characterize which features are the best for the task at hand.Kernels are a very flexible framework: they can be combined and parameterized in many different ways.Complex kernels, however, lead to the problem of model selection, where the aim is to obtain the best kernel configuration in terms of hyperparameter values.The usual approach for model selection in frequentist methods is to employ grid search on some development data disjoint from the training data.This approach can rapidly become impractical when using complex kernels which increase the number of model hyperparameters.Grid search also requires the user to explicitly set the grid values, making it difficult to fine tune the hyperparameters.Recent advances in model selection tackle some of these issues, but have several limitations (see §6 for details).
Our proposed approach for model selection relies on Gaussian Processes (GPs) (Rasmussen and Williams, 2006), a widely used Bayesian kernel machine.GPs allow efficient and fine-grained model selection by maximizing the evidence on the training data using gradient-based methods, dropping the requirement for development data.As a Bayesian procedure, GPs also naturally balance between model capacity and generalization.GPs have been shown to achieve state of the art performance in various regression tasks (Hensman et al., 2013;Cohn and Specia, 2013).Therefore, we base our approach on this framework.
While prediction performance is important to consider (as we show in our experiments), we are arXiv:1508.02131v1[cs.CL] 10 Aug 2015 mainly interested in two other significant aspects that are enabled by our approach: • Gradient-based methods are more efficient than grid search for high dimensional spaces.This allows us to easily propose new rich kernel extensions that rely on a large number of hyperparameters, which in turn can result in better modelling capacity.
• Since the model selection process is now finegrained, we can interpret the resulting hyperparameter values, depending on how the kernel is defined.
In this work we focus on tree kernels, which have been successfully used in a number of NLP tasks (see §6).In most cases, these kernels are used as an SVM component and model selection is not considered an important issue.Hyperparameters are usually set to default values, which work reasonably well in terms of prediction performance.However, this is only possible due to the small number of hyperparameters these kernels contain.
We perform experiments comprising synthetic data ( §4) and two real NLP regression tasks: Emotion Analysis ( §5.1) and Translation Quality Estimation ( §5.2).Our findings show that our approach outperforms SVMs using the same kernels.

Gaussian Process Regression
Our definition of GPs closely follows that of Rasmussen and Williams (2006).
Consider a setting where we have a dataset X = {(x 1 , y 1 ), (x 2 , y 2 ), . . ., (x n , y n )}, where x i is a ddimensional input and y i the corresponding output.Our goal is to infer an underlying function where µ(x) is the mean function, which is usually the 0 constant, and k(x, x ) is the kernel function.
In a regression setting, we assume that the response variables are noisy latent function evaluations, i.e., y i = f (x i ) + η, where η ∼ N (0, σ 2 n ) is added white noise.We assume a Gaussian likelihood, which allows us to obtain a closed formula solution for the posterior, namely where x * and y * are respectively the test input and its response variable, K is the Gram matrix corresponding to the training inputs and k * = [ x 1 , x * , x 2 , x * , . . ., x n , x * ] is the vector of kernel evaluations between the test input and each training input.
A key property of GP models is their ability to perform efficient model selection.This is achieved by employing gradient-based methods to maximize the marginal likelihood, where θ represents the vector of model hyperparameters and y is the vector of response variables from the training data.For a Gaussian likelihood, we can take the log of the expression above to obtain in closed-form 1 , The data fit term is dependent on the training response variables, while the complexity penalty term relies only on the kernel and training inputs.Since the first two terms have conflicting objectives, optimizing the log marginal likelihood will naturally achieve a compromise and thus limit overfitting (without the need for any validation step or additional data).
To enable gradient-based optimization we need to derive the gradients w.r.t. the hyperparameters: The gradients of G depend on the underlying kernel.Therefore we can employ any kind of valid kernel in this procedure as long as its gradients can be computed.This not only allows for fine-tuning of hyperparameters but also allows for kernel extensions which are richly parameterized.

Tree Kernels
The seminal work on Convolution Kernels by Haussler (1999) defines a broad class of kernels on discrete structures by counting and weighting the number of substructures they share.Applying Haussler's formulation to trees we reach a general formula for a tree kernel between two trees t 1 and t 2 , namely where F is the set of all tree fragments, c 1 (f ) and c 2 (f ) return the counts for fragment f in trees t 1 and t 2 , respectively, and w(f ) assigns a weight to fragment f .In other words, we can consider the kernel a weighted dot product over vectors of fragment counts.The actual fragment set F is deliberately left undefined: different concepts of tree fragments define different tree kernels.
In this paper, we will focus on Subset Tree Kernels (henceforth SSTK), first introduced by Collins and Duffy (2001).This kernel considers tree fragments that contains complete grammar rules (see Figure 1 for an example).Consider the set of nodes in the two trees as N 1 and N 2 respectively.We define I i (n) as an indicator function that returns 1 if fragment f i ∈ F has root n and 0 otherwise.A SSTK can then be defined as: and s(i) is the number of fragments in i with at least one child2 .
The formulation in Equation 2 is the same as the one shown in Equation 1, except that we are now restricting the weights w(f ) to be a function of a hyperparameter λ.The original goal of λ is to act as a decay factor that penalizes contributions from larger fragments cf smaller ones (and therefore, it should be in the [0, 1] interval).Without this factor, the resulting distribution over tree pairs is skewed, giving extremely large values when trees are equal and rapidly decreasing for small differences over fragment counts.The decay factor helps to spread this distribution, effectively giving smaller weights to larger fragments.
The function ∆ can be defined recursively, where pr(n) is the grammar production at node n and preterm(n) returns true if n is a pre-terminal node.The function g is defined as follows: where |n| is the number of children of node n and c i n is the i th child of node n.This recursive definition is calculated efficiently by employing dynamic programming to cache intermediate ∆ results.
Equation 3 also adds another hyperparameter, α.This hyperparameter was introduced by Moschitti (2006b) 3 as a way to select between two different tree kernels.If α = 1, we get the original SSTK, if α = 0, then we obtain the Subtree Kernel, which only allows fragments with terminal symbols as leaves.We can also interpret the Subtree Kernel as a "sparse" version of the SSTK, where the "nonsubtree" fragments have their weights equal to zero.
Even though fragment weights are affected by both kernel hyperparameters, previous work did not discuss their effects.The usual procedure fixes α to 1 (selecting the original SSTK) and sets λ to a default value (around 0.4).As explained in §2, the GP model selection procedure enables us to learn finegrained values for these hyperparameters, which can lead to better performing models and aid interpretation.Furthermore, it also allows us to extend these kernels by adding new hyperparameters.We propose one such kernel in the next Section.

Symbol-aware Subset Tree Kernel
While varying the SSTK hyperparameters can lead to different weight schemes, they do that in a very coarse way.For some applications, it may be necessary to give more weight to specific fragments or set of fragments (e.g., NPs being more important than ADVP in an information extraction setting).The Symbol-aware Subset Tree Kernel (henceforth, SASSTK), which we introduce here, allows a more fine-grained control over the weights by employing one λ and one α hyperparameter for each non-terminal symbol in the training data.The calculation uses a similar recursive formula to the SSTK, namely: where x is the symbol at node n 1 and The SASSTK can be interpreted as a generalization of the SSTK: we can recover the latter by tying all λ and setting all α = 1.By employing different hyperparameter values for each specific symbol, we can effectively modify the weights of all fragments where the symbol appears.Table 1 shows an example where we unrolled a kernel computation into its corresponding feature space, showing the resulting weighted counts for each feature.

Kernel Gradients
To enable hyperparameter optimization via gradient descent we must provide gradients for the kernels.
In this Section we derive the gradients for SASSTK.
From Equation 2we know that the kernel is a double summation over the ∆ function.Therefore all gradients are also double summations, but over the gradients of ∆.We can obtain these in a vectorized way, by considering the gradients of the hyperparameter vectors λ and α over ∆.Let k be the number of symbols considered in the model and λ and α be k-dimensional vectors containing the respective hyperparameters.
In the following, we use the notation ∆ i as a shorthand for ∆(c i n 1 , c i n 2 ) and we also omit the parameters of g x .We start with the λ gradient: where x is the symbol at n 1 , g x is defined in Equation 4 and u is the k-dimensional unit vector with the element corresponding to symbol x equal to 1 and all others equal to 0. The gradient in the third case is defined recursively, The α gradient is derived in a similar way, and the gradient at the second case is also defined recursively, Gradients can be efficiently obtained using dynamic programming.In fact, they can be calculated at the same time as ∆ to improve performance since they all share many terms in their derivations.Finally, we can easily obtain the gradients for the original SSTK by letting u = 1.

Kernel Normalization
It is common practice when using tree kernels to normalize the kernel.This helps reduce the random effect of tree size.Normalization can be achieved using the following, where k is the normalized kernel: To apply this normalized version in the optimization procedure we must also derive gradients for the normalization function.In the following equation, we use k ij and kij as a shorthand for k(t i , t j ) and k(t i , t j ), respectively:

Other Extensions
Many other structural kernels rely on recursive definitions and dynamic programming to perform their calculations.Examples include other tree kernels like the Partial Tree Kernel (Moschitti, 2006a) and string kernels like the ones defined on character ngrams (Lodhi et al., 2002) or word sequences (Cancedda et al., 2003).While in this paper we focus on the SSTK (and our proposed SASSTK), our approach can easily be extended to these other kernels, as long as all the corresponding recursive definitions are differentiable.

Synthetic Data Experiments
A natural question that arises in the proposed method is how much data is needed to accurately learn the kernel hyperparameters.To answer this question, we run a set of experiments using synthetic data.We generate this data by using a set of 1000 natural language syntactic trees, where we fix a random subset of 200 instances for testing and use the remaining 800 instances as training.For each training set size we define a GP over the full dataset, sample a function from it and use the function output as the response variable for each tree.We try two different GP priors, one using the SSTK and another one using the SASSTK.
The conditions above provide a controlled environment to check the modelling capacities of our approach since we know the exact distribution where the data comes from.The reasoning behind these experiments is that to be able to provide benefits in real tasks, where the data distribution is not known, our models have to be learnable in this controlled setting as well using a reasonable amount of data.
Finally, we also provide an empirical evaluation comparing the speed performance between our approach and grid search.

SSTK Prior
Our first experiments use a SSTK as the kernel with λ = 0.001, α = 1 and σ 2 n = 0.01.After obtaining the input trees and their sampled labels, we define a new GP model using only the training data plus the obtained response variables, this time using a SSTK with randomized hyperparameter values.Then we optimize the GP and check if the learned hyperparameters are close to the original ones, using 10 random restarts to limit the effect of local optima.We also use the optimized GP to predict response variables on the test set and measure Root Mean Squared Error (RMSE).Our hypothesis is that with a reasonable sample size we can retrieve the original hyperparameter values and obtain low RMSE.For each training set size, we repeat the experiment 20 times.
Figure 2 shows the results of these experiments.For small sizes the variance in the resulting hyperparameter values is large but as soon as we reach 200 instances we are able to retrieve the original values with high confidence.In other words, in an ideal setting 200 instances are enough to learn the kernel.It is also interesting to note that test RMSE after optimization steadily decreases as we increase training data size.This shows that if one is more interested in predictions themselves, it is still worth optimizing hyperparameters even if the training data is small.

SASSTK Prior
The large number of hyperparameters of the SASSTK makes it more prone to optimization and overfitting issues when compared to the SSTK.This raises the question of how much data is needed to justify its use.To address this question, we run similar experiments to those above for the SSTK, except that now we sample from a GP using a SASSTK as the kernel.
Instead of optimizing all hyperparameters freely we use a simpler version where we tie λ and α for each symbol to the same value, except for the symbol 'S'.Effectively this version has one extra λ and one extra α (henceforth λ S and α S ) when compared to the SSTK.The GP prior hyperparameter values are set to λ = 0.001, λ S = 0.5, α = 0.1, α S = 1 and σ 2 n = 0.01.For each training set size, we train two GPs, one using this SASSTK and one using the original SSTK, optimize them using 10 random restarts and measure RMSE on the test set.
Results are shown in Figure 3.For all training set sizes the SASSTK reaches lower RMSE than SSTK, with a substantial difference after reaching 100 instances.This shows that even for small datasets our proposed kernel manages to capture aspects which can not be explained by the original SSTK.Note that this is an ideal setting, and real datasets may need to be larger to realize gains from SASSTK.Nevertheless, these are promising results since they give evidence of a small lower bound on the dataset size for SASSTK to be effective.

Performance Experiments
To provide an overview of how efficient is the gradient-based method compared to grid search we also run a set of experiments measuring wall clock training time vs. RMSE on a test set.For both GP and SVM models we employ the SSTK as the kernel and we use the same synthetic data from the previous experiments4 .We perform 20 runs, keeping the test set as the same 200 instances for all runs and randomly sampling 200 instances from the remaining instances as training data.
Figure 4 shows the curves for both GP and SVM models.The GP curve is obtained by increasing the maximum number of iterations of the gradient-based method (in this case, L-BFGS) and the SVM curve is obtained by increasing the granularity of the grid size.We can see that optimizing the GP model is consistently much faster than doing grid search on the SVM model (notice the logarithmic scale), even though it shows some variance when letting L-BFGS run for a larger number of iterations.The GP model also is able to better predictions in general.Even when taking the variances into account, grid search would still need around 10 times more computation time to achieve the same predictions obtained by the GP model.In real settings, SVMs predictions tend to be more on par with the ones provided by a GP (as shown in §5) but nevertheless these figures show that the GP can be much more time efficient when optimizing hyperparameters of a tree kernel.
An important performance aspect to take into account is parallelization.Grid search is embarassingly parallelizable since each grid point can run in a different core.However, the GP optimization can also benefit from multiple cores by running each kernel computation inside the Gram matrix in parallel.
To keep the comparisons simpler, the results shown in this section use a single core but all experiments in §5 employ parallelization in the Gram matrix computation level (for both SVM and GP models).

NLP Experiments
Our experiments with NLP data address two regression tasks: Emotion Analysis and Quality Estimation.For both tasks, we use the Stanford parser (Manning et al., 2014) to obtain constituency trees for all sentences.Also, rather than using data official splits, we perform 5-fold cross-validation in order to obtain more reliable results.

Emotion Analysis
The goal of Emotion Analysis is to automatically detect emotions in a text (Strapparava and Mihalcea, 2008).This problem is closely related to Opinion Mining (Pang and Lee, 2008), with similar applications, but it is usually done at a more fine-grained level and involves the prediction of a set of labels for each text (one for each emotion) instead of a single label.Beck et al. (2014a) used a multi-task GP for this task with a bag-of-words feature representation.In theory, it is possible to combine their multi-task kernel with our tree kernels, but to keep the focus of the experiments on testing tree kernel approaches, here we use independently trained models, one per emotion.
Dataset We use the dataset provided by the "Affective Text" shared task in SemEval2007 (Strapparava and Mihalcea, 2007), which is composed of 1000 news headlines annotated in terms of six emotions: Anger, Disgust, Fear, Joy, Sadness and Sur-prise.For each emotion, a score between 0 and 100 is given, 0 meaning total lack of emotion and 100, maximally emotional.Scores are mean-normalized before training the models.

Models
We perform experiments using the following tree kernels: • SSTK: the SSTK formulation introduced by Moschitti (2006b); • SASSTK full : our proposed Symbol-Aware SSTK; • SASSTK S : same as before, but using only two λ and two α hyperparameters: one for symbols corresponding to full sentences 5 and another for all other symbols.This configuration is similar to that in Section 4.2.
For all kernels, we also use a variation fixing the α hyperparameters to 1 to emulate the original SSTK.
Baselines and evaluation Our results are compared against three baselines: • SVM SSTK: a SVM using an SSTK kernel.
• SVM BOW: same as before, but using an RBF kernel with a bag-of-words representation.
• GP BOW: same as SVM BOW but using a GP instead.
The SVM models are trained using a wrapper for LIBSVM 6 (Chang and Lin, 2001) provided by the scikit-learn toolkit 7 (Pedregosa et al., 2011) and optimized via grid search.Following previous work, we use Pearson's correlation coefficient as evaluation metric.Pearson's scores are obtained by concatenating all six emotions outputs together.
Table 2 shows the results.The best GP model with tree kernels outperforms the SVMs, showing that the fine-grained model selection procedure provided by the GP models is helpful when dealing with tree kernels.However, using the SASSTK models do not help in the case of free α and the SASSTK full actually performs worse than the original SSTK, 5 In this dataset, symbols are S, SQ, SBARQ and SIN V . 6www.csie.ntu.edu.tw/˜cjlin/libsvm 7 http://scikit-learn.org even though the optimized marginal likelihood was higher.This is evidence that the SASSTK full model is overfitting the training data, probably due to its large number of hyperparameters.Another interesting finding in Table 2 is that fixing the α values often harms performance.Inspecting the free α models showed that the values found by the optimizer were very close to zero.This indicates that the model selection procedure prefer towards giving smaller weights to incomplete tree fragments.We can interpret this as the model selecting a more lexicalized feature space, which also explains why the GP RBF model on bag-of-words performed the best in this task.
Finally, to understand how the optimized kernels could provide more interpretability, Table 3 shows the top 15 λ values obtained by the SASSTK full (fixed α variant) with their corresponding symbols.In this specific case the kernel does not give the best performance so there are limitations in doing a full linguistic analysis.Nevertheless, we believe this example shows the potential for developing more interpretable kernels.This is especially interesting because these models take into account a much richer feature space than what it is allowed by parametric models.

Quality Estimation
The goal of Quality Estimation is to provide a quality prediction for new, unseen machine translated texts (Blatz et al., 2004;Bojar et al., 2014 ples of applications include filtering machine translated sentences that would require more post-editing effort than translation from scratch (Specia et al., 2009), selecting the best translation from different MT systems (Specia et al., 2010) or between an MT system and a translation memory (He et al., 2010), and highlighting segments that need revision (Bach et al., 2011).While various quality metrics exist, here we focus on post-editing time prediction.Tree kernels have been used before in this task (with SVMs) by Hardmeier (2011) and Hardmeier et al. (2012).While their best models combine tree kernels with a set of explicit features, they also show good results using only the tree kernels.This makes Quality Estimation a good benchmark task to test our models.
Datasets We use two publicly available datasets containing post-edited machine translated sentences.Both are composed of a set of source sentences, their machine translated outputs and the corresponding post-editing time.
• English-Spanish (en-es): This dataset was used in the WMT14 Quality Estimation shared task (Bojar et al., 2014), containing 858 sentences translated from English into Spanish and post-edited by an expert translator.
For each dataset, post-editing times are first divided by the translation output length (obtaining the post-editing time per word) and then mean normalized.
Models Since our data consists of pairs of trees, our models in this task use a pair of tree kernels.We combine these two kernels by either summing or multiplying them.As for underlying tree kernels, we try both SSTK and SASSTK S .As in the Emotion Analysis task, we also experiment with a set of kernel configurations with the α hyperparameters fixed at 1.We also test models that combine our tree kernels with an RBF kernel on a set of 17 features extracted using the QuEst framework (Specia et al., 2013).These features are part of a strong baseline model used by the WMT14 shared task.
Baselines and evaluation We compare our results with a number of SVM models: • SVM SSTK: same as in the Emotion Analysis task, using either a sum (+) or a product (×) of SSTKs.
• SVM RBF: this is an SVM trained on the 17 features extracted by Quest.
• SVM RBF SSTK: a combination of the two models above.
For further comparison, we also show results obtained using a GP model and an RBF kernel on the QuEst-only features.Following previous work, we measure prediction performance using both Mean Absolute Error (MAE) and RMSE.
The prediction results are given in Table 4.They indicate a number of interesting findings: • For the fr-en dataset, the GP models combining tree kernels with an RBF kernel outperform all other models.Results for the en-es dataset are less consistent, probably due to the small size of the dataset, but on average they are better than their SVM counterparts.
• The SVMs using a combination of kernels performs worse than using the RBF kernel alone.Inspecting the models, we found that grid search actually harms performance.For instance, for the fr-en dataset, MAE and RMSE for the RBF + SSTK × model before grid search are 0.4681 and 0.6016, respectively.On the other hand, for this dataset all GP models achieve better results after optimization.
• Unlike in the Emotion Analysis task, fixing α results in better performance, even though the resulting models have lower marginal likelihood than the ones with free α.The same effect happened when comparing the SASSTK models with the SSTK ones for the en-es dataset.
Both cases are evidence of model overfitting.
French  We also inspect the resulting hyperparameters to obtain insights about the features used by the model.Table 5 shows the optimized λ values for the GP SSTK models with fixed α for the fr-en dataset.The λ values obtained are higher for the target sentence kernels than for the source sentence ones.We can interpret this as the model giving preference to features from the target trees instead of the source trees, which is what we would expect for this task.

Overfitting
Both NLP tasks show evidence that the GP models with large number of hyperparameters (SASSTK full in the case of Emotion Analysis and the free α models in Quality Estimation) are overfitting the corresponding datasets.While the Bayesian formula-λ src λ tgt GP SSTK + 0.1394 0.3108 GP SSTK × 0.1405 0.2641 Table 5: Learned hyperparameters for the GP SSTK models in the fr-en dataset, with α fixed at 1. λ src and λ tgt are the hyperparameters corresponding to the kernels on the source and target parse trees, respectively.The values shown are averaged over the cross-validation results.
tion for the marginal likelihood does help limiting overfitting, it does not prevent it completely.Small datasets or invalid assumptions about the Gaussian distribution of the data may still lead to poorly fitting models.Another means of reducing overfitting is by taking a fully Bayesian approach in which hyperparameters are considered as random variables and are marginalized out (Osborne, 2010); this is a research direction we plan to pursue in the future.8

Extensions to Other Tasks
The GP framework introduced in Section 2 can be extended to non-regression problems by changing the likelihood function.For instance, models for classification (Rasmussen and Williams, 2006, Chap.3), ordinal regression (Chu and Ghahramani, 2005) and structured prediction (Altun et al., 2004;Bratières et al., 2013) were proposed in the literature.Since the likelihood is independent of the kernel, a natural future step is to apply the kernels and models introduced in this paper to different NLP tasks.
In light of that, we did initial experiments in constituency parsing reranking9 .The first results were inconclusive but we do believe this is because we employed naive approaches using classification (1best result vs. all) and regression (using PARSEVAL metrics as the response variable) models.A more appropriate way to tackle this task is by employing a reranking-based likelihood and this is a direction we plan to pursue in the future.

Related Work
Interest in model selection procedures for kernelbased methods has been growing in the last years.
One widely used approach for that is Multiple Kernel Learning (MKL) (Gönen and Alpaydın, 2011).MKL is based on the idea of using combinations of kernels to model the data and developing algorithms to tune the kernel coefficients.This is different from our method, where we focus on learning the hyperparameters of a single structural kernel.An approach similar to ours was proposed by Igel et al. (2007).They combine oligo kernels (a kind of ngram kernel) with MKL, derive their gradients and optimize towards a kernel alignment metric.Compared to our approach, they restrict the length of the n-grams being considered, while we rely on dynamic programming to explore the whole substructure space.Also, their method does not take into account the underlying learning algorithm.Another recent approach proposed for model selection is random search (Bergstra and Bengio, 2012).Like grid search, it has the drawback of not employing gradient information, as it is designed for any kind of hyperparameters (including categorical ones).
Structural kernels have been successfully employed in a number of NLP tasks.The original SSTK proposed by Collins and Duffy (2001) was used to rerank the output of syntactic parsers.Recently, this reranking idea was also applied to discourse parsing (Joty and Moschitti, 2014).Other tree kernel applications include Semantic Role Labelling (Moschitti et al., 2008) and Relation Extraction (Plank and Moschitti, 2013).String kernels were mostly used in Text Classification (Lodhi et al., 2002;Cancedda et al., 2003), while graph kernels have been used for recognizing Textual Entailment (Zanzotto and Dell'Arciprete, 2009).However, these previous works focused on frequentist methods like SVM or voted perceptron while we employ a Bayesian approach.
Gaussian Processes are a major framework in machine learning nowadays: applications include Robotics (Ko et al., 2007), Geolocation (Schwaighofer et al., 2004) and Computer Vision (Sinz et al., 2004).Only very recently they have been successfully employed in a few NLP tasks such as translation quality estimation (Cohn and Specia, 2013;Beck et al., 2014b), detection of temporal patterns in text (Preot ¸iuc-Pietro and Cohn, 2013), semantic similarity (Rios and Specia, 2014) and emotion analysis (Beck et al., 2014a).In terms of feature representations, previous work focused on the vectorial inputs and applied well-known kernels for these inputs, e.g. the RBF kernel.As shown on §5.2, our approach is orthogonal to these previous ones, since kernels can be easily combined in different ways.
It is important to note that we are not the first ones to combine GPs with kernels on structured inputs.Driessens et al. (2006) employed a combination of GPs and graph kernels for reinforcement learning.However, unlike our approach, they did not attempt model selection, evaluating only a few hyperparameter values empirically.

Conclusions
This paper describes a Bayesian approach for structural kernel learning, based on Gaussian Processes for easy model selection.Experiments applying our models to synthetic data showed that it is possible to learn structural kernel hyperparameters using a fairly small amount of data.Furthermore we obtained promising results in two NLP tasks, including Quality Estimation, where we beat the state of the art.Finally, we showed how these rich parameterizations can lead to more interpretable kernels.
Beyond empirical improvements, an important goal of this paper is to present a method that enables new kernel developments through the extension of the number of hyperparameters.We focused on the Subset Tree Kernel, proposing an extension and then deriving its gradients.This approach can be applied to any structural kernel, as long as gradients are available.It is our hope that this work will serve as a starting point for future developments in these research directions.

Figure 1 :
Figure 1: An example tree and the respective set of tree fragments defined by a SSTK.

Figure 2 :
Figure 2: Results of synthetic experiments optimizing SSTK.The x axes correspond to different training set sizes and the the y axes are the obtained hyperparameter values in the first three plots and RMSE in the last plot.Dashed lines show the original hyperparameter values.Points are offset in RMSE chart for legibility.

Figure 3 :
Figure 3: Results from synthetic experiments comparing SSTK and SASSTK.The x axis is training set size while the y axis corresponds to RMSE.

Figure 4 :
Figure 4: Results from performance experiments.The x axis corresponds to wall clock time in seconds and it is in log scale.The y axis shows RMSE on the test set.The blue dashed line corresponds to the RMSE value obtained after L-BFGS converged.Error bars are obtained by measuring one standard deviation over the 20 runs made in each experiment.

Table 1 :
Resulting fragment weighted counts for the kernel evaluation k(t, t), for different values of hyperparameters, where t is the tree in Figure1.

Table 2 :
Pearson's correlation scores for the Emotion Analysis task (higher is better).

Table 3 :
). Exam-Top 15 symbols sorted according to their obtained λ values in the SASSTK full model with fixed α.The numbers are the corresponding λ values, averaged over all six emotions.

Table 4 :
Error scores for the Quality Estimation task (lower is better).Results are in terms of post-editing time per word.Bold scores are the best ones for each dataset.