A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection

This paper presents the results of a large-scale evaluation study of window-based Distributional Semantic Models on a wide variety of tasks. Our study combines a broad coverage of model parameters with a model selection methodology that is robust to overfitting and able to capture parameter interactions. We show that our strategy allows us to identify parameter configurations that achieve good performance across different datasets and tasks.


Introduction
Distributional Semantic Models (DSMs) are employed to produce semantic representations of words from co-occurrence patterns in texts or documents (Sahlgren, 2006;Turney and Pantel, 2010). Building on the Distributional Hypothesis (Harris, 1954), DSMs quantify the amount of meaning shared by words as the degree of overlap of the sets of contexts in which they occur.
A widely used approach operationalizes the set of contexts as co-occurrences with other words within a certain window (e.g., 5 words). A window-based DSM can be represented as a co-occurrence matrix in which rows correspond to target words, columns correspond to context words, and cells store the cooccurrence frequencies of target words and context words. The co-occurrence information is usually weighted by some scoring function and the rows of the matrix are normalized. Since the co-occurrence 1 The analysis presented in this paper is complemented by supplementary materials, which are available for download at http://www.linguistik.fau.de/dsmeval/. This page will also be kept up to date with the results of follow-up experiments. matrix tends to be very large and sparsely populated, dimensionality reduction techniques are often used to obtain a more compact representation. Landauer and Dumais (1997) claim that dimensionality reduction also improves the semantic representation encoded in the co-occurrence matrix. Finally, distances between the row vectors of the matrix are computed and -according to the Distributional Hypothesis -interpreted as a correlate of the semantic similarities between the corresponding target words. The construction and use of a DSM involves many design choices, such as: selection of a source corpus, size of the co-occurrence window; choice of a suitable scoring function, possibly combined with an additional transformation; whether to apply dimensionality reduction, and the number of reduced dimensions; metric for measuring distances between vectors. Different design choices -technically, the DSM parameters -can result in quite different similarities for the same words (Sahlgren, 2006).
Despite such progress, a full understanding of the different parameters governing a DSM and their influence on model performance has not been achieved yet. The present paper is a contribution towards this goal: it presents the results of a large-scale evaluation of window-based DSMs on a wide variety of semantic tasks. More complex tasks building on distributional representations (e.g., vector composition or relational analogies) will also benefit from our findings, allowing them to choose optimal parameters for the underlying word-level DSMs.
At the level of parameter coverage, this work evaluates most of the relevant parameters considered in comparable state-of-the-art studies (Bullinaria and Levy, 2007;Bullinaria and Levy, 2012); it also introduces an additional one, which has received little attention in the literature: the index of distributional relatedness, which connects distances in the DSM space to semantic similarity. We compare direct use of distance measures to neighbor rank. Neighbor rank has already been successfully used to model priming effects with DSMs (Hare et al., 2009;Lapesa and Evert, 2013); the present study extends its evaluation to standard tasks. We show that neighbor rank consistently improves the performance of DSMs compared to distance, but the degree of this improvement varies from task to task. At the level of task coverage, the present study includes most of the standard datasets used in comparative studies (Bullinaria and Levy, 2007;Baroni and Lenci, 2010;Bullinaria and Levy, 2012). We consider three types of evaluation tasks: multiple choice (TOEFL test), correlation to human similarity ratings, and semantic clustering.
At the level of methodology, our work adopts the approach to model selection proposed by Lapesa and Evert (2013), which is described in detail in section 4. Our results show that parameter interactions play a crucial role in determining model performance.
This paper is structured as follows. Section 2 briefly reviews state-of-the-art studies on DSM evaluation. Section 3 describes the experimental setting in terms of tasks and evaluated parameters. Section 4 outlines our methodology for model selection. In section 5 we report the results of our evaluation study. Finally, section 6 summarizes the main findings and sketches ongoing and future work.

Previous work
In this section we summarize the results of previous evaluation studies of Distributional Semantic Mod-els. Among the existing work on DSM evaluation, we can identify two main types of approaches.
One possibility is to evaluate a distributional model with certain new features on a range of tasks, applying little or no parameter tuning, and to compare it to competing models; examples are Pado and Lapata's (2007) Dependency Vectors as well as Baroni and Lenci's (2010) Distributional Memory. Since both studies focus on testing a single new model with fixed parameters (or a small number of new models), we will not go into further detail concerning them.
Alternatively, the evaluation may be conducted via incremental tuning of parameters, which are tested sequentially to identify their best performing values on a number of tasks, as has been done by Bullinaria and Levy (2007;2012), Polajnar and Clark (2014), and Kiela and Clark (2014). Bullinaria and Levy (2007) report on a systematic study of the impact of a number of parameters (shape and size of the co-occurrence window, distance metric, association score for co-occurrence counts) on a number of tasks (including the TOEFL synonym task, which is also evaluated in our study). Evaluated models were based on the British National Corpus. Bullinaria and Levy (2007) found that vectors scored with Pointwise Mutual Information, built from very small context windows with as many context dimensions as possible, and using cosine distance ensured the best performance across all tasks at issue. Bullinaria and Levy (2012) extend the evaluation reported in Bullinaria and Levy (2007). Starting from the optimal configuration identified in the first study, they test the impact of three further parameters: application of stop-word lists, stemming, and dimensionality reduction using Singular Value Decomposition. DSMs were built from the ukWaC corpus, and evaluated on a number of tasks (including TOEFL and noun clustering on the dataset of , also evaluated in our study). Neither stemming nor the application of stop-word lists resulted in a significant improvement of DSM performance. Positive results were achieved by performing SVD dimensionality reduction and discarding the initial components of the reduced matrix. Polajnar and Clark (2014) evaluate the impact of context selection (for each target, only the most rel-evant context words are selected, and the remaining vector entries are set to zero) and vector normalization (used to vary model sparsity and the range of values of the DSM vectors) in standard tasks related to word and phrase similarity. Context selection and normalization improved DSM performance on word similarity and compositional tasks, both with and without SVD. Kiela and Clark (2014) evaluate window-based and dependency-based DSMs on a variety of tasks related to word and phrase similarity. A wide range of parameters are involved in this study: source corpus, window size, number of context dimensions, use of stemming, lemmatization and stopwords, similarity metric, score for feature weighting. Best results were obtained with large corpora and small window sizes, around 50000 context dimensions, stemming, Positive Mutual Information, and a mean-adjusted version of cosine distance.
Even though we adopt a different approach than these incremental tuning studies, there is considerable overlap in the evaluated parameters and tasks, which will be pointed out in section 3.
An alternative to incremental tuning is the methodology proposed by Lapesa and Evert (2013) and Lapesa et al. (2014). They systematically test a large number of parameter combinations and use linear regression to determine the importance of individual parameters and their interactions. As their evaluation methodology is adopted in the present work and described in more detail in section 4, we will not discuss it here and instead focus on the main results. DSMs are evaluated in the task of modeling semantic priming. This task, albeit not standard in DSM evaluation, is of great interest as priming experiments provide a window into the structure of the mental lexicon. Both studies showed that neighbor rank outperforms distance in capturing priming effects. They also found that the scoring function has a crucial influence on model performance and interacts strongly with an additional logarithmic transformation. Lapesa et al. (2014) focused on a comparison of syntagmatic and paradigmatic relations. They found that discarding the initial SVD dimensions is only benefical for certain relations, suggesting that these dimensions may encode syntagmatic information if larger context windows are used. Concerning the scope of the evaluation, both studies consider a wide range of parameters 2 but target only a very specific task. Our study aims at extending their parameter set and evaluation methodology to standard tasks.

Experimental setting 3.1 Tasks
The evaluation of DSMs has been conducted on three standard types of semantic tasks.
The first task is a multiple choice setting: distributional relatedness between a target word and two or more other words is used to select the best, i.e. most similar candidate. Performance in this task is quantified by the decision accuracy. The evaluated dataset is the well-known TOEFL multiple-choice synonym test (Landauer and Dumais, 1997), which was also included in the studies of Bullinaria and Levy (2007;2012) and Kiela and Clark (2014).
In the second task, we measure the correlation between distributional relatedness and native speaker judgments of semantic similarity or relatedness. Following previous studies (Baroni and Lenci, 2010;Padó and Lapata, 2007), performance in this task is quantified in terms of Pearson correlation. 3 Evaluated datasets are the Rubenstein and Goodenough dataset (RG65) of 65 noun pairs (Rubenstein and Goodenough, 1965), also evaluated by Kiela and Clark (2014), and the WordSim-353 dataset (WS353) of 353 noun pairs (Finkelstein et al., 2002), included in the study of Polajnar and Clark (2014).
The third evaluation task is noun clustering: distributional similarity between words is used to assign them to a pre-defined number of semantic classes. Performance in this task is quantified in terms of cluster purity. Clustering is performed with an algorithm based on partitioning around medoids (Kaufman and Rousseeuw, 1990, Ch. 2), using the R function pam with standard settings. 4 Evaluated datasets for the clustering task are the Almuhareb-Poesio set (henceforth, AP) containing 402 nouns grouped into 21 classes (Almuhareb, 2006); the Battig set, containing 83 concrete nouns grouped into 10 classes (Van Overschelde et al., 2004); the ESS-LLI 2008 set, containing 44 concrete nouns grouped into 6 classes; 5 and the Mitchell set, containing 60 nouns grouped into 12 classes , also employed by Bullinaria and Levy (2012).

Parameters
DSMs evaluated in this paper belong to the class of window-based models. All models use the same large vocabulary of target words (27522 lemma types), which is based on the vocabulary of Distributional Memory (Baroni and Lenci, 2010) and has been extended to cover all items in our datasets. Distributional models were built using the UCS toolkit 6 and the wordspace package for R (Evert, 2014). The following parameters have been evaluated: 7 • Source Corpus (abbreviated in the plots as corpus): the corpora from which we compiled our DSMs differ in both size and quality, and they represent standard choices in DSM evaluation. Evaluated corpora in this study are: British National Corpus 8 ; ukWaC; WaCkypedia EN 9 ; • Context window: -Direction* (win.direction): we collected cooccurrence counts both using a directed window (i.e., separate co-occurrence counts for 4 Other clustering studies have often been carried out using the CLUTO toolkit (Karypis, 2003) with standard settings, which corresponds to spectral clustering of the distributional vectors. Unlike pam, which operates on a pre-computed dissimilarity matrix, CLUTO cannot be used to test different distance measures or neighbor rank. Comparative clustering experiments showed no substantial differences for cosine similarity; in the rank-based setting, pam consistently outperformed CLUTO clustering. 5 http://wordspace.collocations.de/doku.php/data: esslli2008:concrete nouns categorization 6 http://www.collocations.de/software.html 7 Parameters also evaluated by Bullinaria and Levy (2007;2012), albeit with a different range of values, are marked with an asterisk (*); those evaluated by Kiela and Clark (2014) and/or Polajnar and Clark (2014) are marked with a dagger ( †). 8 http://www.natcorp.ox.ac.uk/ 9 Both ukWaC and WaCkypedia EN are available from http: //wacky.sslmit.unibo.it/doku.php?id=corpora. context words to the left and to the right of the target) and an undirected window (no distinction between left and right context); -Size (win.size)* †: we expect this parameter to be crucial as it determines the amount of shared context involved in the computation of similarity. We tested windows of 1, 2, 4, 8, and 16 words to the left and right of the target, limited by sentence boundaries; • Context selection: Context words are filtered by part-of-speech (nouns, verbs, adjectives, and adverbs). From the full co-occurrence matrix, we further select dimensions (i.e., columns, corresponding to context words) according to the following two parameters: -Criterion for context selection (criterion): marginal frequency; number of nonzero cooccurrence counts; -Threshold for context selection (context.dim)* †: from the context dimensions ranked according to this criterion, we select the top 5000, 10000, 20000, 50000 or 100000 dimensions; • Score for feature weighting (score)* †: we compare plain co-occurrence frequency to tf.idf and to the following association measures: Dice coefficient; simple log-likelihood; Mutual Information (MI); t-score; z-score; 10 • Feature transformation (transformation): to reduce the skewness of feature scores, it is possible to apply a transformation function. We evaluate square root, sigmoid (tanh) and logarithmic transformation vs. no transformation.
10 See Evert (2008) for a thorough description of the association measures and details on their calculation ( Fig. 58.4 on p. 1225 and Fig. 58.9 on p. 1235). We selected these measures because they have widely been used in previous work on DSMs (tf.idf, MI and log-likelihood) or are popular choices for the identification of multiword expressions. Based on statistical hypothesis tests, log-likelihood, t-score and z-score measure the significance of association between a target and feature term; MI shows how much more frequently they co-occur than expected by chance; and Dice captures the mutual predictability of target and feature term. Note that we compute sparse versions of the association measures with negative values clamped to zero in order to preserve the sparseness of the co-occurrence matrix. For example, our MI measure corresponds to Positive MI in the other evaluation studies.
• Distance metric (metric)* †: cosine distance (i.e., angle between vectors); Manhattan distance 11 ; • Dimensionality reduction: we optionally apply Singular Value Decomposition to 1000 dimensions, using randomized SVD (Halko et al., 2009) for performance reasons. For the SVD-based models, there are two additional parameters: -Number of latent dimensions (red.dim): out of the 1000 SVD dimensions, we select the first 100, 300, 500, 700, 900 dimensions (i.e. those with the largest singular values); -Number of skipped dimensions (dim.skip): when selecting the reduced dimensions, we exclude the first 0, 50 or 100 dimensions. This parameter has already been evaluated by Bullinaria and Levy (2012), who achieved best performance by discarding the initial components of the reduced matrix, i.e., those with the highest variance.
• Index of distributional relatedness (rel.index). Given two words a and b represented in a DSM, we consider two alternative ways of quantifying the degree of relatedness between a and b.
The first option (and standard in DSM modeling) is to compute the distance (cosine or Manhattan) between the vectors of a and b. The alternative choice, proposed in this work, is based on neighbor rank. Neighbor rank has already been successfully used for capturing priming effects (Hare et al., 2009;Lapesa and Evert, 2013;Lapesa et al., 2014) and for quantifying the semantic relatedness between derivationally related words (Zeller et al., 2014); however, its performance on standard tasks has not been tested yet. For the TOEFL task, we compute rank as the position of the target among the nearest neighbors of each synonym candidate. 12 For the correla- 11 In this study, the range of evaluated metrics is restricted to cosine vs. manhattan for a number of reasons: (i) cosine is considered a standard choice in DSM modeling and is adopted by most evaluation studies (Bullinaria and Levy, 2007;Bullinaria and Levy, 2012;Polajnar and Clark, 2014); (ii) for our normalized vectors, Euclidean distance is fully equivalent to cosine; (iii) preliminary experiments with the maximum distance measure resulted in very low performance. 12 Note that using the positions of the synonym candidates among the neighbors of the target would have been equivalent to direct use of the distance measure, since the transformation from distance to rank is monotonic in this case. tion and clustering tasks, we compute a symmetric rank measure as the average of log rank(a, b) and log rank(b, a). An exploration of the effects of directionality on the prediction of similarity ratings and its use in clustering tasks (i.e., experiments involving rank(a, b) and rank(b, a) as indexes of relatedness) is left for future work.

Model selection
As has already been pointed out in the introductory section, one of the main open issues in DSM evaluation is the need for a systematic investigation of the interactions between DSM parameters. Another issue that large-scale evaluation studies face is overfitting: if a large number of models (i.e. parameter combinations) is evaluated, it makes little sense to look at the best model (i.e. the best parameter combination), which will be subject to heavy overfitting, especially on small datasets such as TOEFL. The methodology for model selection applied in this work successfully addresses both issues.
In our evaluation study, we tested all possible combinations of the parameters described in section 3.2. This resulted in a total of 537600 model runs (33600 in the unreduced setting, 504000 in the dimensionality-reduced setting). The models were generated and evaluated on a large HPC cluster within approximately 5 weeks.
Following Lapesa and Evert (2013), DSM parameters are considered predictors of model performance: we analyze the influence of individual parameters and their interactions using general linear models with performance (accuracy, correlation, purity) as a dependent variable and the model parameters as independent variables, including all two-way interactions. More complex interactions are beyond the scope of this paper and are left for future work. Analysis of variance -which is straightforward for our full factorial design -is used to quantify the importance of each parameter or interaction. Robust optimal parameter settings are identified with the help of effect displays (Fox, 2003), which show the partial effect of one or two parameters by marginalizing over all other parameters. Unlike coefficient estimates, they allow an intuitive interpretation of the effect sizes of categorical variables irrespective of the dummy coding scheme used.

Results
This section reports the results of the modeling experiments outlined in section 3. Table 1 summarizes the evaluation results: for each dataset, we report minimum, maximum and mean performance, comparing unreduced and reduced runs. The column Difference of Means shows the average difference in performance between an unreduced model and its reduced counterpart (with dimensionality reduction parameters set to the values of the general best setting identified in section 5.5) and the p-value 13 of a Wilcoxon signed rank test with continuity correction.It is evident that dimensionality reduction improves model performances for all datasets 14 .  While the improvements are only minimal in some cases, dimensionality reduction never has a detrimental effect while offering practical advantages in memory usage and computation speed. Therefore, in our analysis, we focus on the runs involving dimensionality reduction. In the following subsections, we present detailed results for each of the three tasks. In each case, we first discuss the impact of DSM parameters on performance, and then describe the optimal parameter values.

TOEFL
In the TOEFL task, the linear model achieves an adjusted R 2 of 89%, showing that it explains the influence of model parameters on TOEFL accuracy very well. Figure 1 displays the ranking of the evaluated parameters according to their importance in a feature ablation setting. The R 2 values in the plots refer to the proportion of variance explained by the respective parameter together with all its interactions, 13 * = p < 0.05; *** = p < 0.001; n.s. = not significant. 14 Difference of means and Wilcoxon p-value on Spearman's rho for ratings datasets: RG65, −0.061***; WS353, −0.091***.
corresponding to the reduction in adjusted R 2 if this parameter is left out. We do not rely on significance values for model selection because, given the large number of measurements, virtually all parameters have a highly significant effect.  Table 2 reports all parameter interactions for the TOEFL task that explain more than 0.5% of the total variance (i.e. R 2 ≥ 0.5%), as well as the corresponding degrees of freedom (df) and R 2 .  On the basis of their influence in determining model performance, we can identify three parameters that are crucial for the TOEFL task, and which will also turn out to be very influential in the other tasks at issue: distance metric, feature score and feature transformation.
The best distance metric is cosine distance: this is one of the consistent findings of our evaluation study and it is in accordance with Bullinaria and Levy (2007) and, to a lesser extent, Kiela and Clark (2014). 15 Score and transformation always have a fundamental impact on model perfor- mance: these parameters affect the distributional space independently of tasks and datasets. We will show that they are systematically involved in a strong interaction and that it is possible to identify a score/transformation combination with robust performance across all tasks. The interaction between score and transformation is displayed in figure 2. The best results are achieved by association measures based on significance tests (simple-ll, tscore, z-score), followed by MI. This result is in line with previous studies (Bullinaria and Levy, 2012;Kiela and Clark, 2014), which found Pointwise MI or Positive MI to be the best feature scores. The best choice, simple-log likelihood, exhibits a strong variation in performance across different transformations. For all three significance measures, the best feature transformation is consistently a logarithmic transformation. Raw co-occurrence frequency, tf.idf and Dice only perform well in combination with a square root transformation. The best window size, as shown in figure 3, is a 2-word window for all evaluated transformations.
(a mean-adjusted version of cosine similarity). The latter, however, turned out to be more robust across different corpora and weighting schemes.
The SVD parameters (number of latent dimensions and number of skipped dimensions) play a significant role in determining model performance. They are particularly important for the TOEFL task, but we will see that their explanatory power is also quite strong in the other tasks. Interestingly, they show a tendency to participate in interactions with other parameters, but do not interact among themselves. We display the interaction between metric and number of latent dimensions in figure 4: the steep performance increase for both metrics shows that the widely-used choice of 300 latent dimensions (Landauer and Dumais, 1997) is suboptimal for the TOEFL task. The best value in our experiment is 900 latent dimensions, and additional dimensions would probably lead to a further improvement. The interaction between metric and number of skipped dimensions is displayed in figure 5. While manhattan performs poorly no matter how many dimensions are skipped, cosine is positively affected by skipping 100 and (to a lesser extent) 50 dimensions. The latter trend has already been discussed by Bullinaria and Levy (2012).
Inspection of the remaining interaction plots, not shown here for reasons of space, reveals that the best DSM performance in the TOEFL task is achieved by selecting ukwac as corpus and 10000 original dimensions. The index of distributional relatedness has a very low explanatory power in the TOEFL task, with neighbor rank being the best choice (see plots 16 and 17 in section 5.4).
Given the minimal explanatory power of the direction of the context window and the criterion for context selection in all three tasks, we will not further consider these parameters in our analysis. We recommend to set them to an "unmarked" option: undirected and frequency.
The best setting identified by inspecting all effects is shown in table 5, together with its performance and with the performance of the (over-trained) best model in this task. Parameters of the latter are reported in appendix A.   Table 3 reports all interactions that explain more than 0.5% of the total variance in both datasets. For reasons of space, we only discuss the interactions and best parameter values on RG65; the corresponding plots for WS353 are shown only if there are substantial differences.

Ratings
As already noted for the TOEFL task, score and transformation have a large explanatory power and they are involved in a strong interaction showing the  The analysis of the main effects shows that for both datasets WaCkypedia is the best option as a source corpus, suggesting that this task benefits from a trade-off between quality and quantity (WaCkypedia being smaller and cleaner than ukWaC, but less balanced than the BNC).
Index of distributional relatedness plays a much more important role than for the TOEFL task, with neighbor rank clearly outperforming distance (see figures 16 and 17 and the discussion in section 5.4 for more details).
The choice of the optimal window size depends on transformation: on the RG65 dataset, figure 7 shows that for a logarithmic transformation -which we already identified as the best transformation in combination with significance association measures -the highest performance is achieved with a 4 word window. The corresponding effect display for WS353 (figure 8) suggests that a further small improvement may be obtained with an 8 word window in this case. One possible explanation for this observation is the different composition of the WS353 dataset, which includes examples of semantic relatedness beyond attributional similarity. The 4 word window is a robust choice across both datasets, though.
The number of latent dimensions is involved in a strong interaction with the distance metric ( figure  9). Best results are achieved with the cosine metric and at least 300 latent dimensions, as well as 50 skipped dimensions. The interaction plot between metric and number of original dimensions in figure  10 shows that 50000 context dimensions are sufficient for good performance, and no further improvement can be expected from even higher-dimensional spaces. Best settings for both datasets are summarized in table 5. Refer to appendix A for best models. Figure 11 displays the importance of the evaluated parameters in the clustering task (adj. R 2 of the full linear model: AP: 82%; BATTIG: 77%; ESSLLI: 58%; MITCHELL: 73%). Parameter ranking is determined by the average of the feature ablation R 2 values over all four datasets.    Table 4 reports all parameter interactions that explain more than 0.5% of the total variance for each of the four datasets.

Clustering
In the following discussion, we focus on the AP dataset, which is larger and thus more reliable than the other three datasets. We mention remarkable differences between the datasets in terms of best parameter values. For a full overview of the best parameter setting for each dataset, see table 5.
As already discussed for TOEFL and the ratings task, we find score and transformation at the top of the feature ablation ranking. Table 4 confirms that the two parameters are involved in a strong interaction. The interaction plot (figure 12) shows the behavior we are already familiar with: significance measures (simple-ll, t-score and z-score) reach the best performance in combination with log transformation: this combination is a robust choice also for the other datasets, with minor differences that can be observed in table 5 . The interaction between window size and metric is displayed in figure 13: best performance is achieved with a 2 or 4 word window in combination with cosine distance. Results on the other datasets suggest a preference for the 4 word window. This is confirmed by interaction plots with source corpus (figure 14), which also reveal that WaCkypedia is again the best compromise between size and quality.
A very clear picture concerning the number of skipped dimensions emerges from figure 15 and is the same for all datasets: skipping dimensions is not necessary to achieve good performance (even though skipping 50 dimensions turned out at least to be not detrimental for BATTIG and MITCHELL).
Further effect displays, not shown here for reasons of space, suggest that 300 or 500 latent dimensions -with some variation across the datasets (cf. table 5) -and a medium-sized co-occurrence matrix (20000 or 50000 dimensions) are needed to achieve good performance. Neighbor rank is the best choice as index of distributional relatedness (see section 5.4). See appendix A for best models.

Relatedness index
A novel contribution of our work is the systematic evaluation of a parameter that has received little attention in DSM research so far, and only in studies limited to a narrow choice of datasets (Lapesa and Evert, 2013;Lapesa et al., 2014;Zeller et al., 2014): the index of distributional relatedness.
The aim of this section is to provide a full overview of the impact of this parameter in our experiments. Despite the main focus of the paper on the reduced setting, in this section we also show results from the unreduced setting, for two reasons: first, since this parameter is relatively novel and evaluated here for the first time on standard tasks, we consider it necessary to provide a full picture concerning its behavior; second, relatedness index turned out to be much more influential in the unreduced setting than in the reduced one. Figure 16 and 17 display the partial effect of relatedness index for each dataset, in the unreduced and reduced setting respectively. To allow for a comparison between the different measures of perfor-  Figure 17: Reduced Setting mance, correlation and purity values have been converted to percentages. The picture emerging from the two plots is very clear: neighbor rank is the best choice for both settings across all seven datasets. The degree of improvement over vector distance, however, shows considerable variation between different datasets. The rating task benefits the most from the use of neighbor rank.
On the other hand, neighbor rank has very little effect for the TOEFL task in a reduced setting, where its high computational complexity is clearly not justified; the improvement on the AP clustering dataset is also fairly small. While the TOEFL result seems to contradict the substantial improvement of neighbor rank found by Lapesa and Evert (2013) for a multiple-choice task based on stimuli from priming experiments, there were only two choices (consistent and inconsistent prime) in this case rather than four. We do not rule out that a more refined use of the rank information (for example, different strategies for rank combinations) may produce better results on the TOEFL and AP datasets.
As discussed in section 3.2, we have not yet explored the potential of neighbor rank in modeling directionality effects in semantic similarity. Unlike Lapesa and Evert (2013), who adopt four different indexes of distributional relatedness (vector distance; forward rank, i.e., rank of the target in the neighbors of the prime; backward rank, i.e, rank of the prime in the neighbors of the target; average of backward and forward rank), we used only a single rank-based index (cf. section 3.2), mostly for reasons of computational complexity. We consider the results of this study more than encouraging, and expect further improvements from a full exploration of directionality effects in the tasks at issue.

Best settings
We conclude the result overview by evaluating the best parameter combinations identified for each task and data set, showing how well our approach to model selection works in practice. Table 5 summarizes the optimal parameter settings identified for each task and compares the performance of this model (B.set = best setting) with the over-trained best run in the experiment (B.run = best run). 16 In most cases, the result of our robust parameter optimization is close to the best run. The only exception is the ESSLLI dataset, which is smaller than the other datasets and particularly susceptible to over-training (cf. the low R 2 of the regression analysis in section 5.3). Table 5 also reports the current state of the art for each task (SoA = state-of-the-art), taken from the ACL wiki 17 where available (TOEFL and similarity ratings), from Baroni and Lenci (2010) for the clustering tasks, and from more recent studies of which we are aware. Our results are comparable to the state of the art, even though the latter includes a much broader range of approaches than our window-based DSMs. In one case (BATTIG), our optimized model even improves on the best previous result.
A side-by-side inspection of the main effects and interaction plots for different data sets allowed us to identify parameter settings that are robust across datasets and even across tasks.  Table 5: Best Settings particular dataset) and a more general setting that achieves good performance in all three tasks. Evaluation results for these settings on each dataset are reported in table 7. In most cases, the general model is close to the performance of the task-and datasetspecific settings. Our robust evaluation methodology has enabled us to find a good trade-off between portability and performance.

Conclusion
In this paper, we reported the results of a large-scale evaluation of window-based Distributional Semantic Models, involving a wide range of parameters and tasks. Our model selection methodology is robust to overfitting and sensitive to parameter interactions.
It allowed us to identify parameter configurations that perform well across different datasets within the same task, and even across different tasks. We recommend the setting highlighted in bold font in table 5 as a general-purpose DSM for future research. We believe that many applications of DSMs (e.g. vector compositon) will benefit from using such a parameter combination that achieves robust performance in a variety of semantic tasks. Moreover, an extensive evaluation based on a robust methodology like the one presented here is the first necessary step for further comparisons of bag-of-words DSMs to different techniques for modeling word meaning, such as neural embeddings (Mikolov et al., 2013). Let us now summarize our main findings.
• Our experiments show that a cluster of three parameters, namely score, transformation and distance metric, plays a consistently crucial role in determining DSM performance. These parameters also show a homogeneous behavior across tasks and datasets with respect to best parameter values: simple-ll, log transformation and cosine distance. These tendencies confirm the results in Polajnar and Clark (2014) and Kiela and Clark (2014). In particular, the finding that sparse association measures (with negative values clamped to zero) achieve the best performance can be connected to the positive impact of context selection highlighted by Polajnar and Clark (2014): ongoing work targets a more specific analysis of their "thinning" effect on distributional vectors.
• Another group of parameters (corpus, window size, dimensionality reduction parameters) is also influential in all tasks, but shows more variation wrt. the best parameter values. Except for the TOEFL task, best results are obtained with the WaCkypedia corpus, confirming the observation of Sridharan and Murphy (2012) that corpus qual-ity compensates for size to some extent. Window size and dimensionality reduction show a more task-specific behavior, even though it is possible to find a good compromise in a 4 word window, a reduced space of 500 dimensions and skipping of the first 50 dimensions. The latter result confirms the findings of Bullinaria and Levy (2007;2012) in their clustering experiments. • The number of context dimensions turned out to be less crucial. While very high-dimensional spaces usually result in better performance, the increase beyond 20000 or 50000 dimensions is rarely sufficient to justify the increased processing cost. • A novel contribution of our work is the systematic evaluation of a parameter that has been given little attention in DSM research so far: the index of distributional relatedness. Our results show that, even if the parameter is not among the most influential ones, neighbor rank consistently outperforms distance. Without SVD dimensionality reduction, the difference is more pronounced: this result is particularly interesting for compositionality tasks, where SVD has been reported to be detrimental (Baroni and Zamparelli, 2010). In such cases, the benefits of using neighbor rank clearly outweigh the increased (but manageable) computational complexity.
Ongoing work focuses on the extension of the evaluation setting to further parameters (e.g., new distance metrics and association scores, Caron's (2001) exponent p) and tasks (e.g., compositionality tasks, meaning in context), as well as the evaluation of dependency-based models. We are also working on a refined model selection methodology involving a systematic analysis of three-way interactions and the exclusion of inferior parameter values (such as Manhattan distance, sigmoid transformation and Dice score), which may have a confounding effect on some of the effect displays.