Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results

“Based on theoretical reasoning it has been suggested that the reliability of findings published in the scientific literature decreases with the popularity of a research field” (Pfeiffer and Hoffmann, 2009). As we know, deep learning is very popular and the ability to reproduce results is an important part of science. There is growing concern within the deep learning community about the reproducibility of results that are presented. In this paper we present a number of controllable, yet unreported, effects that can substantially change the effectiveness of a sample model, and thusly the reproducibility of those results. Through these environmental effects we show that the commonly held belief that distribution of source code is all that is needed for reproducibility is not enough. Source code without a reproducible environment does not mean anything at all. In addition the range of results produced from these effects can be larger than the majority of incremental improvement reported.


Introduction
The recent "reproducibility crisis" (Baker, 2016) in various scientific fields (particularly Psychology and Social Sciences) indicates that some introspection is needed in all fields, particularly those that are experimental by nature. The efforts of Collberg's repeatability studies highlight the state of affairs within the computer systems research community (Moraila et al., 2014;Collberg et al., 2015). 1 Other fields have also begun to push for more stringent presentation of 1 http://reproducibility.cs.arizona.edu results, for example, the information retrieval community has been aware for some time of the issues surrounding weak baselines (Armstrong et al., 2009) and more recently reproducibility (Arguello et al., 2016;. The issue of reproducibility in the deep-learning community has also started to become a growing concern, with the need for replicable and reproducible results being included in a list of challenges for the ACL (Nivre, 2017). In reinforcement learning, Henderson et al. (2017) showed that there are a number of effects that would change the results obtained by published authors and call for more rigorous testing, and reporting, of state-of-the-art methods. There is also an ongoing project by OpenAI to provide baselines in reinforcement learning that are reproduced from published descriptions, but even they admit that their scores are only "roughly on par with the scores in published papers." 2 Reimers and Gurevych (2017) investigated over 50,000 combinations of hyper-parameter settings, such as word embedding sources and the optimizer across five different NLP tasks and found that these settings have a significant impact on both the variability, and the relative effectiveness of models.
In this paper we present a number of controllable environment settings that often go unreported, and illustrate that these are factors that can cause irreproducibility of results as presented in the literature. These environmental factors have an effect on the effectiveness of neural networks due to the nonconvexity of the optimization surface, meaning that even minor changes in computation can lead the network to fall into one of a multitude of local minima. Because these effect sizes are comparable to the largest incremental improvements that have been reported brings into question those improvements, and associated claims of progress.

Experimental Setup
In order to limit the scope of this paper, we specifically focus our efforts on a single natural language processing task-answer selection within question answering-elaborated upon in Section 2.1. We also further limit our discussion to look at how these environmental effects manifest in a single implementation of a single model, described in Section 2.2. These restrictions, however, do not mean that our results are only applicable to this model on this task, rather our discussion generalizes to all neural network based research.
To isolate the effect that each environmental factor has all other settings related to the network are fixed; that is, the hyper-parameters are static across all experiments, and only the environmental variable of interest is manipulated. Along with each of the presented factors we include suggestions on how to respond to these in order to best ensure that the work, as presented, is reproducible.

Exemplar Task
Answer selection is one important aspect of opendomain question answering. Given a question, q, and a set of candidate sentences, A, the answer selection task is to rank the sentences contained in A such that those candidates that answer the question are ranked at the top of the list. From this ranked list and assessments of whether the candidate contains an answer to the question, common information retrieval metrics average precision (AP) and reciprocal rank (RR) can be calculated to assess the effectiveness of the system. These metrics are the de facto metrics to evaluate answer selection, and as such the metrics are reported within this paper. Descriptions of these metrics are easily found in the literature.
Worryingly, in the literature, it is becoming increasingly common to not conduct statistical significance testing, rather a higher metric value is taken as evidence that the model performs better. Due to the nature of this paper, we only perform significance testing between results in the same condition, and not across conditions. Within each condition we identify a "baseline"/default setting to compare against. Conducting this many significance tests would normally call for a correction method to be applied, but we do not do so, as we only wish to indicate that selecting the higher number may result in an absolute difference, but not necessarily a statistically significantly one. To calculate significance we use a paired Wilcoxon signed rank test.

Exemplar Model
To perform our experiments we utilize the model released by Sequiera et al. (2017), a simplified Py-Torch implementation of the model proposed by Severyn and Moschitti (2015). The model was chosen because of its simplicity, it is quick to train which supports a fast iteration of experiments, and it has also been reimplemented with similar effectiveness additional times . Figure 1 shows a diagram of the model, which adopts a "Siamese" structure with two sub-networks to process the question and candidate sentence. We emphasize that this model was only selected to serve as an exemplar; the effects that are observed in relative performance will also be present in other models. Indeed, because of the simplicity of this model, it is likely that the environmental effects described will have a more substantial impact on the network effectiveness of more complicated models.

Datasets
The experiments reported in this paper are all performed against the TrecQA dataset that was first released by Wang et al. (2007) and further elab-  orated upon by Yao et al. (2013), as well as the WikiQA dataset released by Yang et al. (2015). Both datasets consists of pre-determined training, development and test sets. For each question, each candidate answer is labelled positive if it contains the answer to the question, otherwise negative. The ratios of these labels and size of the splits for both datasets are shown in Table 1.
The TrecQA dataset has further diverged into two versions, named RAW and CLEAN. The CLEAN version has removed those questions that had no positive labelled answers. Results on these two variants are not directly comparable to each other , and experiments in this paper are performed against the RAW variant. Similar manipulation of the WikiQA dataset has also been performed, although no analysis of the comparability of the results has been conducted.

Weak Baselines
As observed by Armstrong et al. (2009) in the information retrieval field, the use of weak baselines is a factor that should be considered when discussing results as using a weak baseline shows a greater improvement than could otherwise be claimed. Table 2 shows the state-of-the-art results in answer selection, as replicated from the ACL Wiki, Table 3 likewise shows the (potentially incomplete) state-ofthe-art results on the WikiQA dataset, sourced by inspection of relevant papers. The TrecQA dataset contains an additional row that presents a simple baseline-the sum of IDF weights for terms in both  the question and candidate sentence-that performs no learning of any sort. (1) Equation 1 shows the function that produces these results, where D is the document collection, d is a document from this collection, q is the query, a is the candidate sentence, and t is a term. This calculation is done after removal of stopwords. 3 This baseline outperforms a number of the older stateof-the-art methods. Therefore these older results, and some results afterwards are comparing against a weak baseline. In fairness, this baseline was first reported in the literature by Yih et al. (2013)  a result that is frequently overlooked in the literature that followed. However, we also note that the results for this simple baseline differ between our reported value and that of Yih et al. (2013), which is substantially lower-0.6531 AP, 0.7071 RR. For these reasons we repeat this result here.

Confounding Variables
In this section we document a number of confounding variables that often go unreported in the literature, and can have a substantial effect on whether a result would be considered state-of-the-art or not, and the reproducibility of that result. These range from controllable factors, to factors that are not controllable, but need to be reported. To aid discussion, the state-of-the-art tables have been recreated and annotated with the change in AP and RR over the then state-of-the-art result (Table 2 and Table 3). Unless otherwise stated, all experiments are performed using Docker containers that are derived from common, shared, base images. This substantially eases the fixing of all environment and versioning issues that can be observed when running under a native environment. All the data that is required to reproduce the results in this paper is publicly available. Including Docker images, scripts to create and use those images, and resulting pretrained model files. 4

Software Versions
There are numerous points of software in which the version of the software being used can impact the end results substantially. These are the model definitions, they framework software and the libraries that the framework uses.

Model Definition
We refer to the code that is used to define the model and to run the experiments as the model definition.
These are changing artifacts, and when the software is made available to researchers, then it must be accompanied by the version of that software being used. A cursory glance of the commit history of some of these repositories shows a non-zero amount of bug fixing commits. Because of the nature of deep learning, these bugs may actually improve effectiveness, as anecdotal evidence suggests. 5 Whether the commits fix bugs or add features, the models being compared are inherently different and can result in different outcomes. Authors should specify which version of the code is being run to obtain the results presented.  of changing the version of the code on the model's effectiveness. As can be seen, there is a reasonable shift in results, and until authors specify the exact version of the model that is used for experimentation their results are non-reproducible. The version used for all further experiments in this paper is cf0e269.

Framework Version
Specifying the framework that is being used would be an important first step. Different framework versions could give different results for the same model code. To illustrate this we ran the sample model over a range of different versions of PyTorch. Table 5 shows the impact of changing the version of PyTorch used in the training of the sample model. It shows that newer (0.2.0) is not necessarily better, although this depends on the dataset. The version used throughout the rest of the paper is 0.1.12, as this is the version that was used in prior work for this model. Version 0.1.8 and earlier would not run the sample model due to use of features introduced in 0.1.9. The results are stable for 0.1.x versions across datasets. One possible cause could be that the underlying libraries PyTorch relies on were pinned to specific versions across Py-Torch versions. Alternatively, the model code may not be using features of PyTorch that were changing across these versions.

Framework Dependencies
While fixing the framework version is a good step, these frameworks often themselves rely on other libraries. Of particular interest to the neural network  community is the math library that underpins all the matrix and vector operations. By default PyTorch installs a version of the library that is linked against Intel's Math Kernel Library (MKL). When running the sample model on different hardware, we identify the effectiveness of the model changes. Table 6 shows the results of running the MKLbacked version on Intel and AMD hardware, compared to an OpenBLAS, which results in the same answers regardless of hardware. Intel also notes that the results of the same floating point calculation may be different across their own hardware. 6 It should not surprise the reader that Intel's math library gives different results on different architectures; after all, Intel knows with great detail the architecture of Intel chipsets and is not necessarily inclined to produce optimal code for competing platforms. The sensitivity of the network to the backing math library is dependent on the dataset. This difference in effectiveness is likely due to the relative non-convexity of the optimization surface for the two datasets, where the TrecQA surface has a large number of local minima.
Changing the library, or even the backend, within the same library, in which the model is implemented can substantially change the effectiveness of the model. For example Simon (2017) observed a 16% increase of test accuracy for the same model (from 0.5438 to 0.6197) by changing the computation backend from Tensorflow to MXNet.

Threads
Threading introduces a number of possibilities for non-reproducible results, as results from threads can be returned in differing orders. This is because floating point arithmetic is non-associative as well as non-commutative; however, these effects can be controlled by using the appropriate functions and settings in the library. Training the sample model repeatedly achieves the same results, suggesting that these settings are being utilized inside the PyTorch library, although we implore readers to discover this for their library of choice. However, while threading itself does not impact the results within PyTorch, the number of threads used does. Other than by never varying the number of threads used, this effect cannot be controlled for. The reason for this is related to the non-associativity of floating point maths. For example, given the mathematical relations a + b = e, and c + d = f , the floating point specification does not ensure that the mathematical equality a + b + c + d = e + f holds. A result calculated on two threads may perform the e + f calculation, while on four threads the a+b+c+d calculation may be performed, resulting in potential differences.
For these experiments we use PyTorch v0.1.12 with Intel's MKL library on an Intel i7-6800K processor. Using the OMP_NUM_THREADS and MKL_NUM_THREADS environment variables, as well as the set_num_threads function in Py-Torch, we can control the number of threads used in training. We range this from 1-6 on our machine, as this is the number of hardware cores on the CPU, and therefore the maximum number of threads that OpenMP will spawn. Table 7 shows the results of this experiment. Interestingly the results are consistent within datasets when using an odd number of threads, although this is most likely coincidental. The range of differences is small, but is, again, larger than some of the incremental improvements reported in the literature. The exact environment variables, or code settings, that need to be modified will depend on the framework being used.
There is no solution to this given the nonassociative nature of the floating-point and the splitting of workload among differing numbers of threads. The only recommendation is that authors report the number of threads used for training, although we do suggest a smaller number to err on the side of caution, as OpenMP will not create more threads than there are hardware cores.

GPU Computation
The variation of GPUs available for deep learning research is arguably larger than that of CPUs. There are many models, and each manufacturer is free to deviate from the reference models provided by nVidia or AMD, although it is unclear just how many choose to do so. There are also more uncontrollable factors, for instance, the number of threads that are used by the GPU is uncontrollable meaning that results are unlikely to be the same across different GPUs, unlike CPU training. Table 8 shows the results of enabling GPU computation on the sample model. We report on both enabling the cuDNN backend, as this is the default, as well as disabling it globally. The cuDNN backend is known to contain some non-deterministic kernels. 7 In addition, nVidia provides a white-paper that describes some of the implementation details and compliance issues of the IEEE 754 floating point specification and their impact on nVidia GPUs (Whitehead and Fit-Florea, 2011). The paper also presents examples where compiling for a 32-bit x86 architecture and a 64-bit x86-64 architecture can yield different results.
In addition to running the experiment on our own GPU, an Asus branded nVidia GeForce 1080GTX (revision a1) we also repeated the experiment on  an Amazon EC2 p2.xlarge instance. This instance comes equipped with a single nVidia Tesla K80 GPU. Other instances come equipped with multiple GPUs, but as the model is both small and does not take advantage of multiple GPUs experiments were not performed on these instances. Of note is the presence or absence of the cuDNN library has no effect on the K80, but does on the 1080GTX GPU. We suspect that the reason for this is that the K80 is designed as a compute card, while the 1080GTX is primarily designed for graphics processing. This design difference could manifest itself in different instructions that can be taken advantage of by cuDNN kernels. Even with just two GPUs and using the cuDNN backend, there is already evidence that the performance of the network depends on both the dataset and the underlying hardware. There is a clear dependence on the dataset for the relative performance. Further results reported on the GPU are reported on the GeForce 1080GTX with cuDNN disabled. Although this is not the default it maximizes reproducibility by avoiding non-reproducible kernels.

Random Seed
Perhaps the most obvious feature of machine learning that can impact the effectiveness is the random seed. Thus far the experiments in this paper have used a fixed seed, and like most prior research this was only implied rather than explicitly stated. The seed in question was 1234.
Randomness is a crucial part of machine learning and values from the random generator are widely used. For example, random values are used for the initial values of weights, for selecting which nodes to drop in drop-out layers, and for selecting set embeddings for terms that have no associated embeddings. As Goldberg (2017, Section 5.3.2 p59) rightly makes note of-"When debugging, and for reproducibility of results [emphasis added], it is advised to used a fixed random seed." Figure 2 shows the variance in AP and RR when specifying different seeds, for 200 randomly chosen seeds (selected using the bash (version 4.3.48(1)) RANDOM built-in, itself initialized/seeded to 1234 prior to performing runs).
Noting the generator of the random numbers is important, as different languages and libraries may use different generators. Most languages default to a pseudo-random generator for performance reasons, which carries the additional benefit that sequences can be reconstructed from a given start state, commonly referred to as a seed. For example, the bash version used to generate the seeds for Figure 2 uses a linear congruential generator (Park and Miller, 1988). A more commonly used generator is MT19937, a Mersenne Twister based on the Mersenne prime 2 19937 − 1, the standard implementation that uses a 32-bit word length. Another implementation, MT19937-64, uses a 64-bit word length and generates a different sequence. To specify the generator used it is often enough to specify the language version and platform being used. PyTorch and dependent libraries use the aforementioned MT19937 generator.
The spread of results shows that the results are either marginally worse than prior work (which would likely mean the result from this model would not be published), or better than work that was reported on afterward (meaning these latter results may not have been published). Significance testing across these models was not performed. Table 9 shows the agreements, calculated using Kendall's τ and Spearman's ρ, on rankings for each of the datasets, comparing the two metrics, and two computational backends used, all results shown are statistically significant at the p < 0.01 level.
On the TrecQA dataset, these variations in AP and RR show only moderate agreement in rankings of the model when the same training computational backend is used, and only weak agreement across computation backend. For the TrecQA dataset, the CPU results covered a range of 0.0393 for AP, and 0.0599 for RR, while the GPU covered 0.0379 AP, and 0.0492 RR respectively.
The WikiQA dataset exhibits stronger agreements about model rankings on the same computation backend, but similarly weak agreements when comparing across computational backends. The range of AP and RR values on this dataset are even larger, covering 0.0712 AP, and 0.0727 RR on GPU; and 0.0705 AP and 0.0755 RR on the CPU.
These ranges in AP and RR values are greater than a large proportion of incremental improvements reported in prior answer selection research (see Table 2 and Table 3), and indeed are an order of magnitude larger than a typically reported improvement in either metric on the WikiQA datasets. In these cases the model was trained to target AP, another setting that is not commonly reported. While some software for models made available specifies a seed, this detail is often omitted from the paper, making replication-from-paper efforts nigh on impossible. Reagen et al. (2017, Chapter 4) discuss this variance in results from seeding, calling it Iso-Training Noise. They use this concept to frame discus-  sion over whether optimizations, such as using fixed point arithmetic over floating point, are safe to perform. They define an optimization as safe if the results are within one standard deviation of the mean of the results observed from multiple seeded runs. We suggest that specifying the random seed used in training is the bare minimum, necessary step that should be taken, although given the potential for different pseudo-random generators, and differences in implementation, this may not be enough. Indeed, the best approach is to stop reporting single-value results, and instead report the distribution of results from a range of seeds. Doing so allows for a fairer comparison across models, by discarding potential comparisons of lucky and unlucky seeds. In addition, these result populations can be statistically compared for significance, allowing for stronger claims on improvement.

Interactions
Thus far this paper has presented a number of effects that can affect the results of a neural network. Each of these has been presented in isolation, after fixing the prior effects. These effects clearly have potential for interaction and the interaction is unpredictable. In this section we briefly examine one of these interactions, namely the seed selection combined with either CPU or GPU training.
The results presented in Table 8 show that for a given seed the models exhibit different effectiveness based on the hardware used for training. In Section 4.4 it was shown that the seed has a significant impact on the relative effectiveness of the model regardless of this computational backend. The correlation coefficients across devices presented in Table 9 leads us to suspect that there can be substantial changes in effectiveness when switching the backend from CPU to GPU and vice versa. Figure 3 shows the effect of changing from CPU training to GPU training, using the same 200 seeds that were used in Section 4.4. The relationship observed in Figure 2 between AP and RR is still present, but there is no telling, given a fixed seed, whether training on GPU or CPU would result in better effectiveness. In addition, these deltas can be larger than a substantial number of incremental improvements reported. For example, a middling result on the CPU may be transformed to either a top or bottom result if switching to GPU training, with everything else fixed.
By reporting results as single numbers the variation due to the hardware on which the training is performed is hidden, and this could lead authors to conclude that their model is a substantial improvement on state-of-the-art. The changes in AP and RR that are observed are representative of even the larger improvements in state-of-the-art. However, when comparing the distributions of the scores across the backends by visual inspection of Figure 2 there is clearly not any difference in the populations. Statistical significance testing (p ≫ 0.05 in a paired t-test, both two-and single-tailed) bears out this intuition. Using these population based results would then lead authors to a different conclusion than if the seed was "lucky" for the training hardware. This is a concrete example of the differences between reporting result distributions compared with single values.

Reporting Rounding
The final aspect of result reporting that is controllable for is the rounding of results. For example,  Table 2) of reporting results to three decimal points. In this case even such a minor difference can result in state-ofthe-art or not, statistical significance notwithstanding. For example a result of 0.7484 would round down, while 0.7486 would round up, overemphasizing the difference by a factor of 5. We concede, however, that the same argument can be applied regardless of which decimal point cut-off is used, although we observe that trec_eval, the de facto tool used to calculate AP and RR, reports to four. We recommend that reviewers be skeptical of such minor improvements on state-of-the-art when single results are reported, the recommendation here follows that of Section 4.4, in that ideally multiple seeds are used, and testing is performed on the population of results to determine improvement.

Conclusions
In this paper we have demonstrated a number of factors that are present during training of a model and affect the results of said model. These parameters, and their settings, often go unreported in the literature. The result is that a large amount of prior work in answer selection is inherently irreproducible. Furthermore, the differences in results illustrated by these effects can be much larger than the majority of improvements reported as gains in the literature.
The effects that we presented are not stand-alone effects. Interaction between effects also has an additional impact, one of which was discussed in Section 4.5. Other results presented in this paper do not consider this interaction. For example Table 6 suggested that a model trained using OpenBLAS produces worse results for the TrecQA dataset than one trained using Intel's MKL library, which is true. . . for that version of the model code, for that version of the framework, for that random seed, when trained on a single thread on that CPU, for that dataset. We reserve investigating the interaction effects of these individual effects for future work.
It is simply no longer adequate to report a single value when evaluating results from neural networks, especially without the presence of statistical testing on those results. By far the largest source of variability in the experiments presented in this paper was when the network was seeded with different random starting points. The range of results produced cover ranges of results that can be an order of magnitude larger than typically imported improvements.
As well as repeating experiments for multiple seeds, the specifications of the hardware on which the experiments were performed should be reported alongside the results, as changing the hardware can change the results by an order of magnitude. Additionally, the number of threads and the math library used impact on the results and should be reported.
Finally, beyond the hardware effects, the software that is used to both run the model, and define the model, has an impact. For this reason both the model definition and library versions, as well as all the required dependencies, should be pinned to a specified version. These issues are easily avoidable by the use of common packaging tools such as Docker, which also provides opportunities to fix most of the nonversioning environmental issues as well.
In cases where authors are unable to provide a Docker image, or equivalent, then making the trained models available is one alternative. Loading pre-trained models is an action that is supported by a number of frameworks. PyTorch, for example, provides functions to load a model from a URL. The pre-trained models appear to provide consistent results even when the inference pass is performed using settings that would have provided different results in training.
A sentence is all that it takes to describe the environment used for training. For example: "our model was written against PyTorch v0.1.12, and training was conducted on an Intel i7-6800K using a single thread and Intel's Math Kernel Library". Beyond this we implore reviewers to be wary of such minor reported improvements in the light of these issues.