Forecasting Conditional Probabilities of Binary Outcomes under Misspecification

Abstract We consider constructing probability forecasts from a parametric binary choice model under a large family of loss functions (“scoring rules”). Scoring rules are weighted averages over the utilities that heterogeneous decision makers derive from a publicly announced forecast (Schervish, 1989). Using analytical and numerical examples, we illustrate howdifferent scoring rules yield asymptotically identical results if the model is correctly specified. Under misspecification, the choice of scoring rule may be inconsequential under restrictive symmetry conditions on the data-generating process. If these conditions are violated, typically the choice of a scoring rule favors some decision makers over others.


I. Introduction
C ONSIDER the problem of forecasting an as yet unobserved outcome represented by the random variable Y , which takes on values {0, 1} with a vector of observables X. The conditional probability that Y = 1 conditional on X = x, denoted by p [x], is equivalent to the conditional mean and distributional forecast for the binary outcome Y . By contrast, a point forecast in this situation equals either 0 or 1. Point forecasting thus corresponds to choosing a binary action, and it is natural if the forecaster and the decision maker are one entity (Elliott & Lieli, 2013;Lieli & White, 2010). In situations where the forecaster and the decision maker are separate entities, probability forecasts are often provided since they allow decision makers to construct their own point forecasts using their respective loss functions. Examples of such "public forecasting" scenarios include recession probabilities (e.g., forecasts by Hamilton & Chinn, 2014), sovereign default probabilities (e.g., Deutsche Bank, 2014), or probabilities of binary weather outcomes, such as rain (e.g., Mass et al., 2009).
It is common in practice to estimate the conditional probability forecast using a parametric model, such as logit or probit. To estimate this model, the forecaster must choose a loss function. If the model is correctly specified, the consistency and efficiency of the maximum likelihood estimator (MLE) justify its choice as an estimation strategy. In practice, We thank the editor and two anonymous referees, as well as seminar and conference participants at the University of Konstanz (January 2013), Heidelberg University (May 2013), and IAAE (London, June 2014) for helpful comments and suggestions. All numerical computations in this paper were done using the R programming language (R Core Team, 2015), whereby the ggplot2 package (Wickham, 2009) was used for some of the graphical illustrations. The third author thanks UCSD for its hospitality during a research visit, as well as the Klaus Tschira Foundation for infrastructural support at the Heidelberg Institute for Theoretical Studies (HITS). He gratefully acknowledges funding from the Deutsche Forschungsgemeinschaft (grant PO 375/13-1) and the European Union Seventh Framework Programme (grant agreement no. 290976). A supplemental appendix is available online at http://www.mitpress journals.org/doi/suppl/10.1162/REST_a_00564. the model is very likely to be misspecified, and the choice of loss function will typically matter for the forecast, even asymptotically. Loss functions for distributional forecasts, such as predicted probabilities, are called "scoring rules" (Gneiting & Raftery, 2007). The log score and the Brier score, which give rise to the MLE and nonlinear least squares, respectively, are widely used in practice. However, there are many other scoring rules that may be of interest.
A first contribution of this paper is to review the scattered theoretical literature on scoring rules for the binary case. In the context of a double binary decision problem (two outcomes, two possible actions), scoring rules are weighted averages over the utility functions of heterogeneous decision makers (Shuford et al., 1966;Schervish, 1989). This heterogeneity stems from the different costs that individual decision makers face under false positives versus false negatives. For instance, in forecasting the probability of currency crises (Inoue & Rossi, 2008), currency traders' costs of false positives and false negatives will vary with their degree of exposure. Another example is forecasting default probabilities of federally insured student loans (Knapp & Seaks, 1992). Lenders will tend to prefer false positives to false negatives, where the degree to which they prefer the former over the latter depends on their exposure and other loans on their menu. Student borrowers will prefer false negatives to false positives at varying degrees depending on their neediness, the loan amount, and how much they value their future credit history.
Popular scoring rules are based on a certain symmetry in these weighted averages. This symmetry may or may not be appropriate for a given empirical problem. Asymmetric rules relate to situations in which false negatives are much more costly than false positives or vice versa. For example, Lieli and Springborn (2013) analyze the environmental policy decision of whether to admit possibly invasive biological imports. From the consumer's point of view, it may be devastating to mistakenly classify an import as safe; in contrast, classifying a safe import as unsafe is typically less harmful. From the importer's perspective, however, a false positive is much more costly than a false negative. A symmetric scoring rule, such as the log score, weighs the consumers' and importers' utility functions equally, which may not be appropriate if the safety of the general public is at stake. There are many other settings where symmetry may not be justified, such as forecasting natural disasters (say, wildfires or earthquakes) and economic disasters (say, recessions).
Despite their empirical relevance, such asymmetries are not reflected in common rules such as the log or Brier score. We hence show how to construct proper asymmetric scoring rules in the spirit of Buja, Stuetzle, and Shen (2005). This involves prioritizing some decision makers over others, and hence resembles the aggregation of utilities in a social planner's problem (Lieli & Nieto-Barthaburu, 2010). In order to actually use a wide variety of scoring rules for parameter estimation, convergence of the resulting estimators is a key concern. We therefore provide conditions for a weak law of large numbers, allowing for time series dependence in the data.
Under misspecification, different choices of scoring rules may lead to different estimates of the forecast model. Matching the scoring rule used for parameter estimation with the one used for forecast evaluation has been recommended in the literature for this reason. However, there has been less work on examining the magnitude of these effects. In (nonbinary) MSE-based forecasting, Weiss and Andersen (1984) make this point with respect to using autoregressions as forecast models. Granger (1993) makes this suggestion without elaboration, while Weiss (1996) makes the point more generally, giving results. Hand and Vinciotti (2003) make the same suggestion in examining models for binary forecasting, while Gneiting (2011) and Patton (2015) consider several types of point forecasts.
We contribute to this literature by providing novel analytical and numerical evidence for the binary case. With the aid of an analytical example, we illustrate how the choice of scoring rule is inconsequential in the case of correct specification. In the case of misspecification, we characterize the conditions under which the choice is inconsequential. These conditions consist of specific symmetry conditions on the data-generating process that are likely to be violated in practice. A Monte Carlo study illustrates that if a subset of the conditions is violated, the choice of scoring rule affects parameter estimates, forecasts, and-most importantlydecisions. While these effects are qualitatively robust, their magnitude is necessarily case specific and depends on factors such as the true DGP, the set of scoring rules being compared, and the preferences of the decision maker. These preferences determine whether the differences between two predicted probabilities (obtained under scoring rules A and B, say) are relevant in the sense of leading to different decisions.
The plan of this paper is as follows. Section II presents some theoretical results. Section IIA reviews results that link the scoring rule and the decision-maker loss functions. We present results for the consistency of estimators using a wide variety of scoring rules in section IIB. Section IIC uses an analytical example to illustrate why matching loss functions for estimation and evaluation may be important in the case of misspecification. Section III provides a Monte Carlo demonstration of the theoretical points. Section IV concludes.

A. Characterization of Scoring Rules
Consider the problem of forecasting a binary random variable Y with outcomes y ∈ {0, 1} given some predictors denoted by the random variable(s) X with outcomes x ∈ X, where X denotes the support of X. 1 We denote by p[X] ∈ P models of the conditional probability that Y = 1, where p 0 [X] is the correctly specified model. We use brackets to distinguish p[x] from the notation p(x). The latter notation implies that p is a function defined on the support of X. For p defined on the support of a linear index x θ, p[x] = p(x θ). The brackets hence allow us to subsume θ. We do not assume that p 0 [X] ∈ P. For notational brevity, we will often refer to p and p 0 instead of p[·] and p 0 [·]. A decision maker chooses a function f (X) from the space of X to {0, 1}. The optimal choice of this function depends on both the conditional probability that Y = 1 and the utility function of the decision maker. The decision maker's utility function has the form where 0 < c < 1. Now the utility function can be rewritten as where 1(A) denotes the indicator function of the event A.
Note that the utility function is normalized such that a correct decision yields 0 utility. The utilities for incorrect decisions are normalized to sum to 1 in absolute value. This is without loss of generality when U( y, f , c) depends only on these two outcomes. 2 Thus, c = 0.5 indicates a decision maker's indifference between false positives ( f = 1 and y = 0) and false negatives ( f = 0 and y = 1).
To motivate the problem, we use the example of forecasting a storm at a coastal location. We consider two types of decision makers, local restaurant owners and fishermen. Let y denote whether a storm takes place or not. If f = 1, the restaurant owners will allocate fewer staff members, since they expect to be serving fewer customers. If f = 0, the restaurants will hire their full staff. The fishermen will go fishing only if f = 0. We expect restaurant owners to prefer false negatives to false positives: c ≥ 0.5. In the case of a false negative, tourists will be visiting the coastal location expecting good weather. Since a storm occurs, they will be spending more time at restaurants. Restaurant owners will have hired their full staff, and hence are prepared to serve many customers. In the case of a false positive, the restaurants do not hire additional staff, and fewer tourists will be visiting the location. Thus, the restaurant owners' profits will 744 THE REVIEW OF ECONOMICS AND STATISTICS be smaller. The fishermen are likely to prefer false positives to false negatives: c ≤ 0.5. In the case of a false positive (staying at home but no storm), even though they lose the catch, they save on fuel. In the case of a false negative (going fishing when there is a storm), they may lose their equipment or even put their own life in danger. In reality, we have a continuum of heterogeneous fishermen and restaurant owners with different values of c. The exact value c ∈ [0, 0.5] of a fisherman's utility will depend on the value of his or her equipment and the number of staff on his or her crew. Similarly, a restaurant owner's c ∈ [0.5, 1] will depend on the restaurant size, menu, and how much additional staff he or she hires.
Optimal forecasts for this problem are to set Schervish, 1989;Boyes, Hoffman, & Low, 1989;Granger & Pesaran, 2000). This result assumes the knowledge of the true conditional probability. In practice, the unknown true probability p 0 [x] is replaced by an estimate p[x] that can be frequentist (as in our analysis below) or Bayesian (Lieli & Springborn, 2013). The utility function can now be written as It is important to note that the parameter c plays a dual role in equation (3). In addition to determining the decision maker's preference over false positives and negatives, c is part of the optimal forecasting rule and determines how the decision maker interprets a probability forecast. If p ≤ c, the decision maker will interpret it as f = 0; otherwise, the decision maker will interpret it as f = 1. Coming back to our example, consider a fisherman with c = 0.25 and a restaurant owner with c = 0.75. In this case, the optimal forecasting rule has the following implications. If p < 0.25, then neither the fisherman nor the restaurant owner will interpret the probability forecast to indicate that a storm will occur. If 0.25 ≤ p < 0.75, then only the fisherman will interpret the probability forecast to indicate that f = 1 and will not go fishing. If p ≥ 0.75, then both the restaurant owner and the fisherman will interpret the probability forecast to indicate that f = 1. This shows how the preference over false positives and negatives informs the interpretation of the conditional probability forecast under the optimal forecasting rule.
To construct a forecast, we require an estimate of the true conditional probability of Y = 1-an estimate of p 0 [x] or a procedure that directly estimates 1( p 0 [x] > c). Manski and Thompson (1989) and Elliott and Lieli (2013) examine the latter approach and show how direct estimation lessens the need for an exact understanding of the true conditional probability. Essentially, the function 1( p 0 [x] > c) is easier to specify correctly than p 0 [x] since the former is a step function. Our example shows why we might estimate the conditional probability instead, since it gives the individual decision makers (i.e., the restaurant owners and fishermen) the opportunity to interpret the forecast in a manner that is optimal based on their own preferences. In general, when there are users with a range of utility functions (i.e., values for c), then provision of an estimate for p 0 [x] enables all users to construct their own forecast rules (see Lieli & Nieto-Barthaburu, 2010). When constructing a model p for the conditional probability, we require a scoring rule for estimation. By definition, a proper scoring rule S( y, p) is a function for which E[S( y, p)] is finite and maximized at p = p 0 . It is considered to be a strictly proper scoring rule if this maximum is unique: the rule is maximized only at the true value for the probability (see Gneiting & Raftery, 2007). From an econometric perspective, this means that the conditional probability is identified by the scoring rule. For binary outcomes, all proper scoring rules have the form ( 4 ) Schervish (1989, theorem 4.2) shows that proper scoring rules can be seen as weighted averages of many utility functions, where the weights are over different cutoff values c. Denote by ν(c) a nonnegative weighting function over c defined on [0, 1]. By integrating the utility for a single decision maker in equation (3), we obtain the weighted average utility function ν(c) may be viewed as the density of the preference parameter c in a population of decision makers that the forecaster seeks to inform. Hence, it has an intuitive economic interpretation.
Equating equations (4) and (5), we see that As Schervish (1989) shows, scoring rules with this form are strictly proper if ν(c) gives a nonzero weighting over all c ∈ [0, 1]. These results are useful in a number of ways. First, through specification of ν(c), they provide a constructive approach to designing scoring rules. Second, for existing scoring rules, they show the implicit weighting over decision makers' utility functions that underlie the construction of that particular scoring rule. Table 1, which extends table 1 of Gneiting and Raftery (2007), gives several scoring rules and their implicit weights, ν(c). This includes popular approaches. Notice that the log scoring rule is simply (pseudo) maximum likelihood for a parameterized model of p [x]. This is the most common approach to parametrically estimating models of the conditional probability, where the models are typically either logit or probit. (See Lieli & Nieto-Barthaburu, 2010, for an economic interpretation of the weighting that underlies maximum likelihood.) It is also possible through defining f 1 ( p) to provide a positive approach to constructing proper scoring rules. If f 1 ( p) Gneiting and Raftery (2007) See equations (4) and (5) for details. and f 2 ( p) are once differentiable such that ∂f 1 ( p)/∂p > 0 for p ∈ (0, 1) and then S( y, p) in equation (4) is a proper scoring rule with This result was obtained by Shuford et al. (1966), restated in theorem 4.1 of Schervish (1989). Notice that this relates f 1 ( p) to f 2 ( p) through their slopes at any p. Hence, one can construct a proper scoring rule by defining f 1 ( p) and constructing f 2 ( p) using this restriction. For example, set f 1 ( p) = p, so ∂f 1 ( p)/∂p = 1 > 0 for all p. Now ν(c) = (1 − c) −1 > 0 for c ∈ (0, 1). Using this ν(c) results in a proper scoring rule. To obtain f 2 ( p), we solve Integrating, we obtain f 2 ( p) = p + ln(1 − p) and hence This is a strictly proper scoring rule. It is also worth mentioning that convex combinations of strictly proper scoring rules are also strictly proper. 3 For either of these directions, it is an outcome of the process that we obtain an understanding of ν(c), the weights over the individual decision makers. The weighting functions for various popular scoring rules given in table 1 are pictured 3 To see this, consider the example of a convex combination of the log score and As1 score, which we use as an example in section IIC. Each of these scores is maximized in expectation by the true probability. Hence, any convex combination of the two is also maximized in expectation by the true probability and thus defines a strictly proper scoring rule itself. Alternatively, note that a convex combination of the log and As1 scores again satisfies the relationship in equation (6), and thus inherits strict propriety.  Table 1 All functions have been multiplied by a scaling factor for comparability. in figure 1. We see that the log score and Boosting loss correspond to U-shaped weighting functions, each placing very similar weights over c. The weighting functions of the Brier and spherical score are flat and bell shaped, respectively. All of the popular rules are symmetric around c = 0.5, the point of indifference between false negatives and false positives. There is no obvious reason why this might be appropriate in general for situations where distributional forecasts are to be provided. In our simplified example of forecasting storms, where we have the restaurant owners (c ≥ 0.5) and the fishermen (c ≤ 0.5), the forecaster may prefer to weigh the restaurant owners' and fishermen's utility functions, for example, according to their proportion in the region's population or based on economic revenues. Thus, this weighting may have a social, political, or economic motivation. This paper is not concerned with the justification of a particular weighting scheme over another. Our goal is to show that the choice of the weighting function and thereby the scoring rule 746 THE REVIEW OF ECONOMICS AND STATISTICS may have consequences on conditional probability forecasts and individual decision making in practice. Buja et al. (2005) and Merkle and Steyvers (2013) use the beta distribution to parameterize the weighting function ν(c). This leads to a flexible two-parameter family of scoring rules. A somewhat simpler approach is to directly choose a given shape for ν(c). This is exemplified by the As1 and As2 scoring rules shown in figure 1. In the first of these rules, we set ν(c) so that it heavily weights small values of c relative to large values. This would be a situation where forecasters that are extremely averse to losses from false negatives are heavily weighted. The specification of As2 does the reverse of this, heavily weighting forecasters who are heavily averse to false positives. 4 By specifying ν(c) directly according to a reasonable weighting function, the results presented above allow us to construct economically meaningful scoring rules that are strictly proper. This situation, which draws on the existence of the Schervish (1989) representation for scoring rules, helps to bridge the gap between economic and statistical forecast evaluation criteria.

B. Scoring Rules and Parameter Estimation
For all of the choices of proper scoring rules, one can consider estimating p[x] using the scoring rule as a loss function. Consider linear index models: , we can consider estimating the parameters of the model from the maximization θ = arg max θ∈Θ T t=1 S( y t , p(x t θ)).
For example, for the log scoring rule and p(x t θ) = e x t θ /(1 + e x t θ ), this would be maximum likelihood of the logit model. For the same model with the half Brier score as the scoring rule, this would be nonlinear least squares estimation of the logit model. Various combinations of scoring rules and models could be employed to obtain a parameter estimatê θ, and from this, an estimate of the conditional probability that the outcome is 1, p(x θ ), for any possible x. Under fairly general conditions,θ The following theorem provides a set of conditions for achieving this consistency result. As detailed below, the theorem can be seen as a special case of M-estimation (Wooldridge, 1994), adapted to the situation of using strictly proper scoring rules and binary models. c. For each y t ∈ {0, 1}, x t ∈ X, f 1 ( p), f 2 ( p) are measurable and differentiable in p and p(x t , θ) is measurable and differentiable in θ.
Proof. The proof follows from using these conditions to show that the conditions of theorems 4.2 and 4.3 of Wooldridge (1994) hold. First, via theorem 4.2, the conditions are sufficient that max θ∈Θ |T −1 T t=1 S( y t , p(x t θ)) −E[S( y t , p(x t θ))]| p → 0, so the objective function converges to its expected value uniformly in θ. Conditions b and c yield the theorem's conditions i and ii. For theorem 4.2 part iv, notice that for all θ 1 , θ 2 ∈ Θ, the mean value theorem and equation (6) whereθ is an intermediate value for θ. Parts iii and ivb require that S( y t , p(x t θ)) and x t x t satisfy a WLLN pointwise in θ. These results follow from assumptions d through f, which are sufficient for a WLLN via corollary 3.48 of White (2001). Notice that E[S( y t , p(x t θ))] q for any integer which is finite given d for q ≤ r + δ. A WLLN for x t x t follows directly from assumptions e and f. Consistency of the estimateθ follows from the conditions being sufficient for theorem 4.3 of Wooldridge (1994). Condition M1 and M2 follow directly from assumptions b and c, along with the results presented for uniform convergence of the average of the objective function. Condition M3 follows directly from assumption a.
Assumption a ensures that the scoring rule is strictly proper and thus admits the decomposition in equation 4. Furthermore, it ensures that the objective function attains a unique maximum at θ * . If S(.) is relaxed to be a proper scoring rule that is not necessarily strictly proper, then a result similar to theorem 1 still holds. In this case, E S( y t , p(x tθ )) p → max θ∈Θ E S( y t , p(x t θ)) , which is sufficient to justify the procedure. Assumption b is the standard requirement to ensure a maximum. Assumption c is a standard regularity condition. The conditions in d relate to the scoring rule and model being employed, and are functions of both of these choices. The first part of d ensures that expected loss exists. The second part is employed as part of the requirements for uniform consistency of the objective function. This assumption seems strong; however, it holds widely since these objects are functions of p(x t θ), which is bounded between 0 and 1 for all x t and θ. For notational brevity, let s t = x t θ. For example, consider the half Brier score with a logit model for the conditional probability that y t = 1. Then f 1 ( p(s t )) = 1 − p(s t ) and )| ≤ 0.5; hence, the second part of assumption d is satisfied. In this case, E|f 1 ( p(s t ))| r+δ = E| − 0.5(1 − p(s t )) 2 | r+δ ≤ 0.5 r+δ , and so is finite for all r, δ finite. For the log score with a logit model for the conditional probability, we have that and so this is also bounded.
The mixing assumptions in e impose a limit on the degree of time series dependence in the data. The requirement of strict stationarity gives meaning to the idea that we obtain the true conditional probability, at least asymptotically, when the model is correctly specified. If the data are not strictly stationary, then we can still obtain consistency results; however, the interpretation of θ * changes to being a limiting value that minimizes the average expected losses over time. It is worth noting that some strictly proper scoring rules, such as the half Brier and spherical scores considered in this paper, are bounded. For bounded scoring rules, the assumptions of theorem 1 can be relaxed. First, the technical requirements of assumption d either become obsolete or are trivially satisfied. Second, assumption f, which ensures a WLLN for x t x t , is not required. 5 Conceptually, the main purpose of theorem 1 is to illustrate that the structure of scoring rules makes them well suited for designing (consistent) parameter estimators in the tradition of M-estimators (e.g., Hayashi, 2000; see also Gneiting & Raftery, 2007). There are many possible sets of assumptions (e.g., various ways of restricting time series 5 To see this, note that if sup θ∈Θ S( y t , p(x t θ)) < C, it follows that S( y t , p(x t θ 1 )) − S( y t , p(x t θ 2 )) < 2C. This means that the stochastic upper bound in the proof of theorem 1 can be replaced by a constant. Hence, a WLLN for the upper bound is trivially satisfied, and assumption f becomes obsolete. Furthermore, the second part of assumption d, which is used in the general case to bound the term S( y t , p(x t θ 1 )) − S( y t , p(x t θ 2 )) , is no longer required. Finally, the first part of assumption d is automatically satisfied since the f i ( p(x t θ)), i = 1, 2, are bounded. dependence) that could lead to the statement of theorem 1. Our chosen set of assumptions aims to strike a balance between generality and clarity of presentation, although results under alternative trade-offs between conditions are possible. Furthermore, results under more primitive conditions are available in more specialized settings. For example, de Jong and Woutersen (2011) analyze consistency in the important special case of a correctly specified probit model with lagged dependent variables, estimated via maximum likelihood. See their theorems 1 and 2 for low-level conditions that guarantee limited dependence properties of the data-generating process and their theorem 3 on consistency.
In terms of understanding the results for forecasting binary outcomes, two results follow directly. First, if the model is correctly specified, that is, p 0 [x t ] = p(x t θ 0 ), then for all strictly proper scoring rules, θ * = θ 0 . Hence, all strictly proper scoring rules will give the same true conditional probability asymptotically. This follows directly from the fact that strictly proper scoring rules are uniquely maximized by the true conditional probability. Thus, if the model is correctly specified, the choice of the best scoring rule to use depends not on the reasonableness of θ * but instead on the efficiency of the estimatorθ obtained by maximizing a particular scoring rule. The popularity of the MLE derives from the fact that it is an efficient parameter estimator under correct specification. However, it should be understood that the latter strong assumption is crucial in establishing the optimality of the MLE.
When the model is not correctly specified, there is no reason that θ * should be the same over different scoring rules. In practice, they will differ, and then so will the estimated conditional probability even asymptotically. Hence, decisions made for any particular loss function for the decision maker (value for c) will also differ across scoring rules. Scoring rules placing more weight on high values of c will provide probability forecasts, which are most useful for decision makers with high values of c, and vice versa. In order to attain (asymptotic) optimality, the scoring rule chosen for estimating the parameters of the model should match the scoring rule used to evaluate the probability forecast. Under the conditions above, the magnitude of this effect depends on how θ * varies with the choice of scoring rule. Since both scoring rules and models tend to be very nonlinear, this relationship will generally be complex. The analytical example in section IIC and the Monte Carlo results in section III provide evidence on this issue.
In choosing between scoring rules, a forecaster needs to trade off the loss from using a scoring rule other than the log score under correct specification with the gains this approach brings when the model is misspecified. The first consideration then is how plausible it would be to assume that the model is correctly specified. In most applications, especially those unmotivated by any underlying economic or scientific theory, this would be a difficult assumption to make. Nonetheless, it would generally be considered, and the answer is specific to the forecasting problem.

THE REVIEW OF ECONOMICS AND STATISTICS
The second consideration is how large the gains are from using the matching strategy under misspecification of the model.

C. Misspecification: An Analytical Example
In order to illustrate how the choice of scoring rule matters, we will give an analytical example where we examine the effect of trading off between two specific scoring rules on θ * . The scoring rules we consider are given in table 1, the log, and As1 scores. The former is the log likelihood; the latter is an asymmetric scoring rule that emphasizes a better fit for smaller probabilities versus larger probabilities. We can write a composite scoring rule indexed by λ ∈ [0, 1] that nests both of them: For λ = 0, S 0 ( y, p(x θ)) gives the log score. For λ = 1, S 1 ( y, p(x θ)) is the As1 score. For all λ ∈ [0, 1], S λ ( y, p(x θ)) is a proper scoring rule, since propriety carries over to convex combinations of two proper scoring rules. Let θ * denote the maximizer of the objective function defined by the scoring rule, with We can write the conditional expectation in the above objective function as follows: For simplicity, we assume in the following that X is scalar. As demonstrated in section A of the online appendix, extending the example to include an intercept is possible but appears to complicate the analysis without a compensating gain in insight.
The probit and logit link functions p(.) are by far the most common choices in the literature. 6 Koenker and Yoon (2009) survey several other choices. Here, we do not make specific assumptions about p, except that it does not depend on estimands other than θ. Assuming sufficient regularity conditions to apply the dominated convergence theorem (DCT), the first-order condition for a maximum is In the probit case, p(·) is the cumulative distribution function (cdf) of a standard normal variable. For the logit, p(z) = [1 + exp(−z)] −1 is the cdf of a standard logistic variable.
where f X (.) is the unconditional probability density function (pdf) of X. Computing the expression explicitly and subsuming x, we obtain where p ≡ ∂p(z) ∂z z=xθ * . The first-order condition in equation (8) gives a highly nonlinear implicit characterization of the limiting parameter estimate θ * . However, our setting (involving a single parameter λ to characterize the employed scoring rule) allows us to nevertheless analyze how θ * varies across scoring rules. Again assuming applicability of the DCT and using implicit differentiation, we obtain Equation (9) makes explicit how a change in the scoring rule, which is expressed here by differentiating with respect to λ, affects the probability limit θ * of the parameter estimator. The denominator of the above expression is always negative because it is the second derivative of the objective function, E X,Y S λ (Y , p(X θ)) , evaluated at θ * . However, the sign of the numerator may change depending on the truth, the model, and the density of X. Under correct specification (p 0 [X] = p(X θ 0 )), the numerator in equation (9) becomes 0, and we get ∂θ * ∂λ = 0. This result holds irrespective of the link function p and the distribution f X (x). It mirrors the fact that the composite scoring rule in equation (7) is strictly proper for any λ ∈ [0, 1]. This implies that θ * = θ 0 regardless of the value of λ. Hence, the choice of scoring rule is irrelevant under correct specification, at least in terms of the probability limit of the parameter estimator.
Under misspecification, it still holds that the denominator of equation (9) is always negative. Hence, the numerator will determine the sign of ∂θ * /∂λ. We first examine when the sign is 0 (i.e., the choice of scoring rule does not matter) in the following theorem.

THE REVIEW OF ECONOMICS AND STATISTICS
fX denotes the pdf of X, F(s) = exp(s) 1+exp(s) denotes the cdf of the logistic distribution, and U (a, b) denotes the uniform distribution with limits a and b. DGP #1 is taken from Elliott and Lieli (2013) and fulfills conditions a-c of theorem 2. a Indicates that a DGP violates condition c of theorem 2. b Indicates that a DGP violates condition a of theorem 2. of the scoring rule has no effect on the conditional probability. This is true not only for the choice between the log and As1 scores, as shown in theorem 2, but also holds for all other scoring rules we examine. The plot clearly shows other implications of theorem 2: the prediction error is pointsymmetric about the origin, the prediction error changes its sign on X + and X − , and a weighted average of the positive and negative prediction error on X + as well as X − would be equal as indicated by equation (10). For all the other DGPs, where either conditions a or c or both are violated, our numerical results clearly show that the choice of scoring rule has an effect on the conditional probability approximation. The plots show the predicted conditional probability, whereby the parameter θ * j is computed using a sample of 1,000,000 draws. The plots show the classification curves of the conditional probability given x = 2 (θ * j is computed using a sample of size 1,000,000).
For DGP #2, where only the symmetry condition on the true conditional probability is violated, the only scoring rules that result in different predicted conditional probabilities than the log score are the asymmetric scoring rules. For DGPs #3 and #4, we observe differences in the predicted probabilities for all pairs of scoring rules, even if both scoring rules under comparison are symmetric (such as the log versus Brier score). As discussed earlier, the binary action of an individual decision maker (such as a fisherman or restaurant owner in our example) is determined by whether the predicted probability, F(θ * j X), exceeds the threshold c. For a given value of the regressor X, the chosen action may thus depend on the scoring rule j used for parameter estimation. Figure 3 illustrates this point for x = 2. It shows that the scoring rules generally yield different classification curves. Rather than looking at a single design point (such as x = 2 above), we next consider a broader summary measure of differences between scoring rules, which is the probability (computed over the distribution of X) that scoring rule j implies a different binary action than the log score. Figure 4 shows how the choice of scoring rules under DGP #1 is inconsequential, in the sense of almost always leading to identical decisions. For DGP #2, the choice between the log and the two asymmetric rules is the only one that leads to different classifications. For c = 0.6 and c = 0.7, the probability of different classifications is 0.05 and 0.12, respectively. For DGPs #3 and #4, the choice between the log and any other scoring rule leads to different classifications. What is particularly interesting here is that even though choosing between the log and other symmetrical rules, such as Brier, may be relatively inconsequential for c = 0.6, it can lead to an 0.15 and 0.175 probability of different classifications at greater values of c for DGP #3 and #4, respectively. Thus, whether the differences across scoring rules matter depends on a decision maker's preferences as embodied in c.
As a further check, figure 5 provides evidence on the probability that the estimator defined by scoring rule j delivers a correct classification. This probability is given by The plots show the unconditional probability of different classifications using the log score as opposed to other scoring rules; see equation (11). θ * j is computed using a sample of size 1,000,000.

THE REVIEW OF ECONOMICS AND STATISTICS
Put slightly differently, this number represents the probability that a decision maker with preference parameter c makes a correct decision when using a prediction model fitted via scoring rule j. Figure 5 shows how the ranking of the scoring rules differs across thresholds c. For example, consider the comparison of As1 and As2 in the figure for DGP #2 (upper-right panel): While As1 performs better for values of c between 0.2 and 0.4, the reverse is true for c lying between 0.6 and 0.8. This result is closely in line with the fact that, when used as an estimation criterion, As1 places an emphasis on fitting small thresholds c correctly, whereas As2 focuses on high values of c (see section IIA). This analysis demonstrates how forecasters who are willing to favor a certain clientele (say, decision makers characterized by small values c, such as the fishermen in our example) can achieve this goal by issuing predictions based on an appropriate scoring rule (in this case, As1).

B. Finite-Sample Results
All of our results until now are for the case in which the limiting parameter values θ * j are known. We now briefly turn to the effects of sampling uncertainty. Specifically, we consider a rolling window estimation scheme for θ * j which is popular in practice (see the discussion by Giacomini & White, 2006, p. 1548, using a window length of 120. Furthermore, we consider a forecast evaluation period of 100 periods. 9 In each Monte Carlo iteration, we thus simulate 120 + 100 observations. For the first rolling window, we use observations 1 to 120 to estimate the parameter θ and make a forecast for observation 121. For the second rolling 9 These sample sizes are typical in forecasting studies using quarterly macroeconomic data, for example, when using an estimation sample from 1960 to 1989 (30 × 4 = 120 observations) and an evaluation sample from 1990 to 2014 (25 × 4 = 100 observations). The plots show the probability of correct classification (see equation [12]) for various thresholds c. θ * j is computed using a sample of size 1,000,000.
window, we use observations 2 to 121 for estimation and make a forecast for observation 122, and so forth. The online appendix reports variants of figures 2 and 3 for the rolling window case, which we construct by averaging the probability and classification curves for each estimate of θ. These figures show that on average, the rolling window parameter estimates are very similar to their asymptotic counterparts.
Theorem 1 implies that in the asymptotic case, it is generally optimal to use the same scoring rule for estimation and evaluation. We next analyze to what extent this statement carries over to the rolling window scenario. To this end, table 4 summarizes the parameter estimates and predictive performance obtained under each scoring rule. The median estimates for each scoring rule (upper panel of table 4) are very close to their asymptotic limits in table 3. This is in line with the similarity of the prediction and classification curves noted. Estimators defined by alternative scoring rules clearly differ in their sampling variability, as measured by interdecile ranges (middle panel of table 4), especially for DGPs #3 and 4. 10 That said, there is no simple relationship 10 We use interdecile ranges, rather than variances, to eliminate the effect of outliers, which are not surprising given the scale of our Monte Carlo between the choice of scoring rule and the variability of the estimator it defines. For example, the spherical score defines the most precise estimator for DGP #1, whereas it defines the (by far) least precise estimator for DGP #4.
The bottom panel of table 4 compares the forecast performance of two strategies: (a) using the same scoring rule for estimation and evaluation ("matching") and (b) simply using MLE for parameter estimation while using a different scoring rule for evaluation. In order to compare the performance of the two, we simply report the share of Monte Carlo iterations for which the first strategy performs better, whereby we average over an out-of-sample period of 100 observations in each Monte Carlo iteration. This share has a natural scaling between 0 and 1. It is therefore easily interpretable and comparable across scoring rules. 11 754 THE REVIEW OF ECONOMICS AND STATISTICS a In each MC iteration, we compute the average score over an out-of-sample period of 100 observations. "Matching" means using the same scoring rule for estimation and evaluation.
In thirteen of the twenty cases, shares in the bottom panel of table 4 are strictly above 0.5, indicating better performance of matching compared to MLE. For some of these cases, we find that matching leads to a substantially different median estimate as well as lower variability as measured by the interdecile range. This holds true for As1 under DGP #2, as well as Brier, Spherical, and As1 under DGP#3. There are also some cases where the matching estimator is more variable than MLE but nevertheless performs better out of sample. This happens for As2 under DGP #2, Boosting and As2 under DGP #3, as well as Spherical under DGP #4. In these cases, it seems that the relative gain from using an estimator that converges to the maximand of the scoring rule in question outweighs the relative loss in precision. For another subset of the cases where matching performs better, such as Brier and Spherical under DGP #1, Boosting under DGP #2, and As2 under DGP #4, the medians and interdecile ranges of the MLE and matching estimator are practically indistinguishable. Our conjecture is that in these cases, the matching strategy's improvement over MLE is marginal. Along the same lines, the six cases in which MLE does better than matching appear very close, with both strategies attaining similar medians and interdecile ranges.
To summarize, our results show that the "correct location" of the matching estimator puts it at an advantage over MLE under misspecification, which generally does not converge to the maximand of the scoring rule in question. To compensate for this, the MLE must be more precise (smaller interdecile range) in order to outperform matching in terms of out-ofsample scores.

IV. Conclusion
This paper explores the nuances in forecasting conditional probabilities under misspecification. The natural choice under correct specification, regardless of the scoring rule used for out-of-sample evaluation, is indeed MLE. It is not only consistent for the maximand of the scoring rule in question but also efficient. Under misspecification, however, there is no clear natural choice. The MLE is neither consistent for the maximand of the scoring rule in question nor necessarily "efficient" in the sense of attaining lower sampling variability than other estimators. The paper shows in an analytical example that under certain symmetry conditions, the choice of scoring rule is inconsequential for parameter estimation. With the aid of numerical results for the asymptotic problem, we then illustrate how the violation of these conditions can lead to different probability limits of the parameter estimators and different conditional probability forecasts. We also show how these different forecasts would lead to different interpretations by heterogeneous decision makers. In finite samples, we find an interesting relationship between the sampling distribution of the parameter estimators and the relative performance of the MLE (compared to the estimator that maximizes the scoring rule considered for evaluation).
Finally, our analysis has conceptual implications pertaining to the literature on distributional forecasting. It has been argued (Geweke & Amisano, 2011) that the provision of distributional forecasts is superior to the provision of point forecasts because distributional forecasts can be employed to construct point forecasts for any loss function. While this argument seems valid in many situations (see section I), it should not be misunderstood as saying that distributional forecasts were "loss function independent." Specifically, this paper illustrates that probability forecasts-which are clearly distributional-are not loss function independent. A loss function is required for estimation, and this choice makes explicit trade-offs regarding which aspects of the data to fit correctly, at the cost of neglecting other aspects.