# Philosophy of Statistics

*First published Tue Aug 19, 2014*

Statistics investigates and develops specific methods for evaluating hypotheses in the light of empirical facts. A method is called statistical, and thus the subject of study in statistics, if it relates facts and hypotheses of a particular kind: the empirical facts must be codified and structured into data sets, and the hypotheses must be formulated in terms of probability distributions over possible data sets. The philosophy of statistics concerns the foundations and the proper interpretation of statistical methods, their input, and their results. Since statistics is relied upon in almost all empirical scientific research, serving to support and communicate scientific findings, the philosophy of statistics is of key importance to the philosophy of science. It has an impact on the philosophical appraisal of scientific method, and on the debate over the epistemic and ontological status of scientific theory.

The philosophy of statistics harbors a large variety of topics and debates. Central to these is the problem of induction, which concerns the justification of inferences or procedures that extrapolate from data to predictions and general facts. Further debates concern the interpretation of the probabilities that are used in statistics, and the wider theoretical framework that may ground and justify the correctness of statistical methods. A general introduction to these themes is given in Section 1 and Section 2. Section 3 and Section 4 provide an account of how these themes play out in the two major theories of statistical method, classical and Bayesian statistics respectively. Section 5 directs attention to the notion of a statistical model, covering model selection and simplicity, but also discussing statistical techniques that do not rely on statistical models. Section 6 briefly mentions relations between the philosophy of statistics and several other themes from the philosophy of science, including confirmation theory, evidence, causality, measurement, and scientific methodology in general.

- 1. Statistics and induction
- 2. Foundations and interpretations
- 3. Classical statistics
- 4. Bayesian statistics
- 5. Statistical models
- 6. Related topics
- Bibliography
- Academic Tools
- Other Internet Resources
- Related Entries

## 1. Statistics and induction

Statistics is a mathematical and conceptual discipline that focuses on the relation
between data and hypotheses. The *data* are recordings of
observations or events in a scientific study, e.g., a set of
measurements of individuals from a population. The data actually
obtained are variously called the *sample*, the *sample
data*, or simply the *data*, and all possible samples from
a study are collected in what is called a *sample
space*. The *hypotheses*, in turn, are general
statements about the target system of the scientific study, e.g.,
expressing some general fact about all individuals in the population.
A *statistical hypothesis* is a general statement that can
be expressed by a probability distribution over sample space, i.e., it
determines a probability for each of the possible samples.

Statistical methods provide the mathematical and conceptual means to evaluate statistical hypotheses in the light of a sample. To this end the methods employ probability theory, and incidentally generalizations thereof. The evaluations may determine how believable a hypothesis is, whether we may rely on the hypothesis in our decisions, how strong the support is that the sample gives to the hypothesis, and so on. Good introductions to statistics abound (e.g., Barnett 1999, Mood and Graybill 1974, Press 2002).

To set the stage an example, taken from Fisher (1935), will be helpful.

The tea tasting lady.

Consider a lady who claims that she can, by taste, determine the order in which milk and tea were poured into the cup. Now imagine that we prepare five cups of tea for her, tossing a fair coin to determine the order of milk and tea in each cup. We ask her to pronounce the order, and we find that she is correct in all cases! Now if she is guessing the order blindly then, owing to the random way we prepare the cups, she will answer correctly 50% of the time. This is our statistical hypothesis, referred to as the null hypothesis. It gives a probability of \(1/2\) to a correct guess and hence a probability of \(1/2\) to an incorrect one. The sample space consists of all strings of answers the lady might give, i.e., all series of correct and incorrect guesses, but our actual data sits in a rather special corner in this space. On the assumption of our statistical hypothesis, the probability of the recorded events is a mere 3%, or \(1/2^{5}\) more precisely. On this ground, we may decide to reject the hypothesis that the lady is guessing.

According to the so-called *null hypothesis test*, such a
decision is warranted if the data actually obtained are included in a
particular region within sample space, whose total probability does
not exceed some specified limit, standardly set at 5%. Now consider
what is achieved by the statistical test just outlined. We started
with a hypothesis on the actual tea tasting abilities of the lady,
namely, that she did not have any. On the assumption of this
hypothesis, the sample data we obtained turned out to be surprising
or, more precisely, highly improbable. We therefore decided that the
hypothesis that the lady has no tea tasting abilities whatsoever can
be rejected. The sample points us to a negative but general conclusion
about what the lady can, or cannot, do.

The basic pattern of a statistical analysis is thus familiar from
inductive inference: we input the data obtained thus far, and the
statistical procedure outputs a verdict or evaluation that transcends
the data, i.e, a statement that is not entailed by the data alone. If
the data are indeed considered to be the only input, and if the
statistical procedure is understood as an inference, then statistics
is concerned with *ampliative* inference: roughly speaking, we
get out more than we have put in. And since the ampliative inferences
of statistics pertain to future or general states of affairs, they are
inductive. However, the association of statistics with ampliative and
inductive inference is contested, both because statistics is
considered to be non-inferential by some (see
Section 3) and
non-ampliative by others (see
Section 4).

Despite such disagreements, it is insightful to view statistics as
a response to the problem of induction (cf. Howson 2000 and the
entry on the
problem of induction).
This problem, first discussed by Hume in his *Treatise of Human
Nature* (Book I, part 3, section 6) but prefigured already by
ancient sceptics like Sextus Empiricus (see the entry on
ancient skepticism),
is that there is no proper justification for
inferences that run from given experience to expectations about the
future. Transposed to the context of statistics, it reads that there
is no proper justification for procedures that take data as input and
that return a verdict, an evaluation, or some other piece of advice
that pertains to the future, or to general states of affairs. Arguably,
much of the philosophy of statistics is about coping with this
challenge, by providing a foundation of the procedures that statistics
offers, or else by reinterpreting what statistics delivers so as to
evade the challenge.

It is debatable that philosophers of statistics are ultimately concerned with the delicate, even ethereal issue of the justification of induction. In fact, many philosophers and scientists accept the fallibility of statistics, and find it more important that statistical methods are understood and applied correctly. As is so often the case, the fundamental philosophical problem serves as a catalyst: the problem of induction guides our investigations into the workings, the correctness, and the conditions of applicability of statistical methods. The philosophy of statistics, understood as the general header under which these investigations are carried out, is thus not concerned with ephemeral issues, but presents a vital and concrete contribution to the philosophy of science, and to science itself.

## 2. Foundations and interpretations

While there is large variation in how statistical procedures and inferences are organized, they all agree on the use of modern measure-theoretic probability theory (Kolmogorov ), or a near kin, as the means to express hypotheses and relate them to data. By itself, a probability function is simply a particular kind of mathematical function, used to express the measure of a set (cf. Billingsley 1995).

Let \(W\) be a set with elements \(s\), and consider an initial
collection of subsets of \(W\), e.g., the singleton sets \(\{ s
\}\). Now consider the operation of taking the complement \(\bar{R}\)
of a given set \(R\): the complement \(\bar{R}\) contains exactly and
all those \(s\) that are not included in \(R\). Next consider the join
\(R \cup Q\) given sets \(R\) and \(Q\): an element \(s\) is a member
of \(R \cup Q\) precisely when it is a member of \(R\), \(Q\), or
both. The collection of sets generated by the operations of complement
and join is called an
*algebra*, denoted \(S\). In statistics we interpret \(S\) as
the set of samples, and we can associate sets \(R\) with specific
events or observations. A specific sample \(s\) includes a record of
the event denoted with \(R\) exactly when \(s \in R\). We take the
algebra of sets like \(R\) as a language for making claims about the
samples.

A *probability function* is defined as an additive
normalized measure over the algebra: a function
\[ P: {\cal S} \rightarrow [0, 1] \]
such that \(P(R \cup Q) = P(R) + P(Q)\) if \(R \cap Q = \emptyset\)
and \(P(W) = 1\). The *conditional probability* \(P(Q \mid R)\)
is defined as
\[ P(Q \mid R) \; = \; \frac{P(Q \cap R)}{P(R)} , \]
whenever \(P(R) > 0\). It determines the relative size of the set
\(Q\) within the set \(R\). It is often read as the probability of the
event \(Q\) *given that* the event \(R\) occurs. Recall that
the set \(R\) consists of all samples \(s\) that include a record of
the event associated with \(R\). By looking at \(P(Q \mid R)\) we zoom
in on the probability function within this set \(R\), i.e., we
consider the condition that the associated event occurs.

Now what does the probability function mean? The mathematical notion of probability does not provide an answer. The function \(P\) may be interpreted as

*physical*, namely the frequency or propensity of the occurrence of a state of affairs, often referred to as the chance, or else as*epistemic*, namely the degree of belief in the occurrence of the state of affairs, the willingness to act on its assumption, a degree of support or confirmation, or similar.

This distinction should not be confused with that between objective and subjective probability. Both physical and epistemic probability can be given an objective and subjective character, in the sense that both can be taken as dependent or independent of a knowing subject and her conceptual apparatus. For more details on the interpretation of probability, the reader is invited to consult Galavotti (2005), Gillies (2000), Mellor (2005), von Plato (1994), the anthology by Eagle (2010), the handbook of Hajek and Hitchcock (forthcoming), or indeed the entry on interpretations of probability. In this context the key point is that the interpretations can all be connected to foundational programmes for statistical procedures. Although the match is not exact, the two major types specified above can be associated with the two major theories of statistics, classical and Bayesian statistics, respectively.

### 2.1 Physical probability and classical statistics

In the sciences, the idea that probabilities express physical states
of affairs, often called chances or stochastic processes, is most
prominent. They are relative *frequencies* in series of events
or, alternatively, they are tendencies or *propensities* in the
systems that realize those events. More precisely, the probability
attached to the property of an event type can be understood as the
frequency or tendency with which that property manifests in a series
of events of that type. For instance, the probability of a coin
landing heads is a half exactly when in a series of similar coin
tosses, the coin lands heads half the time. Or alternatively, the
probability is half if there is an even tendency towards both possible
outcomes in the setup of the coin tossing. The mathematician Venn
(1888) and scientists like Quetelet and Maxwell (cf. von Plato 1994)
are early proponents of this way of viewing probability. Philosophical
theories of propensities were first coined by Peirce (1910), and
developed by Popper (1959), Mellor (1971), Bigelow (1977), and Giere
(1976); see Handfield (2012) for a recent overview. A rigourous theory
of probability as frequency was first devised by von Mises (1981),
also defended by Reichenbach (1938) and beautifully expounded in
van Lambalgen (1987).

The notion of physical probability is connected to one of the major
theories of statistical method, which has come to be called
*classical statistics*. It was developed roughly in the first
half of the 20th century, mostly by mathematicians and working
scientists like Fisher (1925, 1935, 1956), Wald (1939, 1950), Neyman
and Pearson (1928, 1933, 1967), and refined by very many classical
statisticians of the last few decades. The key characteristic of this
theory of statistics aligns naturally with viewing probabilities as
physical chances, hence pertaining to observable and repeatable
events. Physical probability cannot meaningfully be attributed to
statistical hypotheses, since hypotheses do not have tendencies or
frequencies with which they come about: they are categorically true or
false, once and for all. Attributing probability to a hypothesis seems to entail that
the probability is read epistemically.

Classical statistics is often called *frequentist*, owing to
the centrality of frequencies of events in classical procedures and
the prominence of the frequentist interpretation of probability
developed by von Mises. In this interpretation, chances are
frequencies, or proportions in a class of similar events or
items. They are best thought of as analogous to other physical
quantities, like mass and energy. It deserves emphasis that
frequencies are thus conceptually prior to chances . In propensity
theory the probability of an individual event or item is viewed as a
tendency in nature, so that the frequencies, or the proportions in a
class of similar events or items, manifest as a consequence of the law
of large numbers. In the frequentist theory, by contrast, the
proportions lay down, indeed define what the chances are. This
leads to a central problem for frequentist probability, the
so-called *reference class problem*: it is not clear what
class to associate with an individual event or item (cf. Reichenbach
1949, Hajek 2007). One may argue that the class needs to be as narrow
as it can be, but in the extreme case of a singleton class of events,
the chances of course trivialize to zero or one. Since classical
statistics employs non-trivial probabilities that attach to the single
case in its procedures, a fully frequentists understanding of
statistics is arguably in need of a response to the reference class
problem.

To illustrate physical probability, we briefly consider physical probability in the example of the tea tasting lady.

Physical probability

We denote the null hypothesis that the lady is merely guessing by \(h\). Say that we follow the rule indicated in the example above: we reject this null hypothesis, i.e., denying that the lady is merely guessing, whenever the sampled data \(s\) is included in a particular set \(R\) of possible samples, so \(s \in R\), and that \(R\) has a summed probability of 5% according to the null hypothesis. Now imagine that we are supposed to judge a whole population of tea tasting ladies, scattered in tea rooms throughout the country. Then, by running the experiment and adopting the rule just cited, we know that we will falsely attribute special tea tasting talents to 5% of those ladies for whom the null hypothesis is true, i.e., who are in fact merely guessing. In other words, this percentage pertains to the physical probability of a particular set of events, which by the rule is connected to a particular error in our judgment.

Now say that we have found a lady for whom we reject the null hypothesis, i.e., a lady who passes the test. Does she have the tea tasting ability or not? Unfortunately this is not the sort of question that can be answered by the test at hand. A good answer would presumably involve the proportion of ladies who indeed have the special tea tasting ability among those whose scores exceeded a certain threshold, i.e., those who answered correctly on all five cups. But this latter proportion, namely of ladies for whom the null hypothesis is false among all those ladies who passed the test, differs from the proportion of ladies who passed the test among those ladies for whom it is false. It will depend also on the proportion of ladies who have the ability in the population under scrutiny. The test, by contrast, only involves proportions within a group of ladies for whom the null hypothesis is true: we can only consider probabilities for particular events on the assumption that the events are distributed in a given way.

### 2.2 Epistemic probability and statistical theory

There is an alternative way of viewing the probabilities that appear in statistical methods: they can be seen as expressions of epistemic attitudes. We are again facing several interrelated options. Very roughly speaking, epistemic probabilities can be doxastic, decision-theoretic, or logical.

#### 2.2.1 Types of epistemic probability

Probabilities may be taken to represent *doxastic* attitudes
in the sense that they specify opinions about data and hypotheses of
an idealized rational agent. The probability then expresses the strength
or degree of belief, for instance regarding the correctness of the
next guess of the tea tasting lady. They may also be taken as
*decision-theoretic*, i.e., as part of a more elaborate
representation of the agent, which determines her dispositions towards
decisions and actions about the data and the hypotheses. Oftentimes a
decision-theoretic representation involves doxastic attitudes
alongside preferential and perhaps other ones. In that case, the
probability may for instance express a willingness to bet on the lady
being correct. Finally, the probabilities may be taken as
*logical*. More precisely, a probabilistic model may be
taken as a logic, i.e., a formal representation that fixes a normative
ideal for uncertain reasoning. According to this latter option,
probability values over data and hypotheses have a role that is
comparable to the role of truth values in deductive logic: they serve
to secure a notion of valid inference, without carrying the suggestion
that the numerical values refer to anything psychologically
salient.

The epistemic view on probability came into development in the 19th and the first half of the 20th century, first by the hand of De Morgan (1847) and Boole (1854), later by Keynes (1921), Ramsey (1926) and de Finetti (1937), and by decision theorists, philosophers and inductive logicians such as Carnap (1950), Savage (1962), Levi (1980), and Jeffrey (1992). Important proponents of these views in statistics were Jeffreys (1961), Edwards (1972), Lindley (1965), Good (1983), Jaynes (2003) as well as very many Bayesian philosophers and statisticians of the last few decades (e.g., Goldstein 2006, Kadane 2011, Berger 2006, Dawid 2004). All of these have a view that places probabilities somewhere in the realm of the epistemic rather than the physical, i.e., not as part of a model of the world but rather as a means to model a representing system like the human mind.

The above division is certainly not complete and it is blurry at the edges. For one, the doxastic notion of probability has mostly been spelled out in a behaviorist manner, with the help of decision theory. Many have adopted so-called Dutch book arguments to make the degree of belief precise, and to show that it is indeed captured by the mathematical theory of probability (cf. Jeffrey 1992). According to such arguments, the degree of belief in the occurrence of an event is given by the price of a betting contract that pays out one monetary unit if the event manifests. However, there are alternatives to this behaviorist take on probability as doxastic attitude, using accuracy or proximity to the truth. Most of these are versions or extensions of the arguments proposed by de Finetti (1974). Others have developed an axiomatic approach based on natural desiderata for degrees of belief (e.g., Cox 1961).

Furthermore, and as alluded to above, within the doxastic conception of probability we can make a further subdivision into subjective and objective doxastic attitudes. The defining characteristic of an objective doxastic probability is that it is constrained by the demand that the beliefs are calibrated to some objective fact or state of affairs, or else by further rationality criteria. A subjective doxastic attitude, by contrast, is not constrained in such a way: from a normative perspective, agents are free to believe as they see fit, as long as they comply to the probability axioms.

#### 2.2.2 Statistical theories

For present concerns the important point is that each of these
epistemic interpretations of the probability calculus comes with its
own set of foundational programs for statistics. On the whole,
epistemic probability is most naturally associated with *Bayesian
statistics*, the second major theory of statistical methods (Press
2002, Berger 2006, Gelman et al 2013). The key
characteristic of Bayesian statistics flows directly from the
epistemic interpretation: under this interpretation it becomes
possible to assign probability to a statistical hypothesis and to
relate this probability, understood as an expression of how strongly
we believe the hypothesis, to the probabilities of events. Bayesian
statistics allows us to express how our epistemic attitudes towards a
statistical hypothesis, be it logical, decision-theoretic, or
doxastic, changes under the impact of data.

To illustrate the epistemic conception of probability in Bayesian statistics, we briefly return to the example of the tea tasting lady.

Epistemic probability

As before we denote the null hypothesis that the lady is guessing randomly with \(h\), so that the distribution \(P_{h}\) gives a probability of 1/2 to any guess made by the lady. The alternative \(h'\) is that the lady performs better than a fair coin. More precisely, we might stipulate that the distribution \(P_{h'}\) gives a probability of 3/4 to a correct guess. At the outset we might find it rather improbable that the tea tasting lady has special tea tasting abilities. To express this we give the hypothesis of her having these abilities only half the probability of her not having the abilities: \(P(h') = 1/3\) and \(P(h) = 2/3\). Now, leaving the mathematical details to Section 4.1, after receiving the data that she guessed all five cups correctly, our new belief in the lady's special abilities has more than reversed. We now think it roughly four times more probable that the lady has the special abilities than that she is merely a random guesser: \(P(h') = 243/307 \approx 4/5\) and \(P(h) \approx 1/5\).

The take-home message is that the Bayesian method allows us to express our epistemic attitudes to statistical hypotheses in terms of a probability assignment, and that the data impact on this epistemic attitude in a regulated fashion.

It should be emphasized that Bayesian statistics is not the sole user of an epistemic notion of probability. Indeed, a frequentists understanding of probabilities assigned to statistical hypotheses seems nonsensical. But it is perfectly possible to read the probabilities of events, or elements in sample space, as epistemic, quite independently of the statistical method that is being used. As further explained in the next section, several philosophical developments of classical statistics employ epistemic probability, most notably fiducial probability (Fisher 1955 and 1956; see also Seidenfeld 1992 and Zabell 1992), likelihoodism (Hacking 1965, Edwards 1972, Royall 1997), and evidential probability (Kyburg 1961), or connect the procedures of classical statistics to inference and support in some other way. In all these developments, probabilities and functions over sample space are read epistemically, i.e., as expressions of the strength of evidence, the degree of support, or similar.

## 3. Classical statistics

The collection of procedures that may be grouped under classical
statistics is vast and multi-faceted. By and large, classical
statistical procedures share the feature that they only rely on
probability assignments over sample spaces. As indicated, an important
motivation for this is that those probabilities can be interpreted as
frequencies, from which the term of *frequentist
statistics* originates. Classical statistical procedures are
typically defined by some function over sample space, where this
function depends, often exclusively, on the distributions that the
hypotheses under consideration assign to the sample space. For the
range of samples that may be obtained, the function then points to one
of the hypotheses, or perhaps to a set of them, as being in some sense
the best fit with that sample. Or, conversely, it discards candidate
hypotheses that render the sample too improbable.

In sum, classical procedures employ the data to narrow down a set
of hypotheses. Put in such general terms, it becomes apparent that
classical procedures provide a response to the problem of
induction. The data are used to get from a weak general statement
about the target system to a stronger one, namely from a set of
candidate hypotheses to a subset of them. The central concern in the
philosophy of statistics is how we are to understand these procedures,
and how we might justify them. Notice that the pattern of classical
statistics resembles that of *eliminative induction*: in view
of the data we discard some of the candidate hypotheses. Indeed
classical statistics is often seen in loose association with Popper's
falsificationism, but this association is somewhat misleading. In
classical procedures statistical hypotheses are discarded when they
render the observed sample too improbable, which of course differs
from discarding hypotheses that deem the observed sample
impossible.

### 3.1 Basics of classical statistics

The foregoing already provided a short example and a rough sketch of classical statistical procedures. These are now specified in more detail, on the basis of Barnett (1999) as primary source. The following focuses on two very central procedures, hypothesis testing and estimation. The first has to do with the comparison of two statistical hypotheses, and invokes theory developed by Neyman and Pearson. The second concerns the choice of a hypothesis from a set, and employs procedures devised by Fisher. While these figures are rightly associated with classical statistics, their philosophical views diverge. We return to this below.

#### 3.1.1 Hypothesis testing

The procedure of Fisher's* null hypothesis test* was already
discussed briefly in the foregoing. Let \(h\) be the hypothesis of
interest and, for the sake of simplicity, let \(S\) be a finite sample
space. The hypothesis \(h\) imposes a distribution over the sample
space, denoted \(P_{h}\). Every point \(s\) in the space represents a
possible sample of data. We now define a function \(F\) on the sample
space that identifies when we will reject the null hypothesis by
marking the samples \(s\) that lead to rejection with \(F(s) = 1\), as
follows:
\[ F(s) = \begin{cases} 1 \quad \text{if } P_{h}(s) < r,\\ 0 \quad
\text{otherwise.} \end{cases} \]
Notice that the definition of the region of rejection, \(R_{r} = \{
s:\: F(s) = 1 \}\), hinges on the probability of the data under the
assumption of the hypothesis, \(P_{h}(s)\). This expression is often
called the
*likelihood* of the hypothesis on the sample \(s\). We can set
the threshold \(r\) for the likelihood to a suitable value, such that
the total probability of the region of rejection \(R_{r}\) is below a
given level of error, for example, \(P_{h}(R) < 0.05\).

It soon appeared that comparisons between two rival hypotheses were
far more informative, in particular because little can be said about
error rates if the null hypothesis is in fact false. Neyman and
Pearson (1928, 1933, and 1967) devised the so-called *likelihood
ratio test*, a test that compares the likelihoods of two rivaling
hypotheses. Let \(h\) and \(h'\) be the null and the alternative
hypothesis respectively. We can compare these hypotheses by the
following test function \(F\) over the sample space:
\[ F(s) = \begin{cases} 1 \quad \text{if } \frac{P_{h'}(s)}{P_{h}(s)}
> r,\\ 0 \quad \text{otherwise,} \end{cases} \]
where \(P_{h}\) and \(P_{h'}\) are the probability distributions over
the sample space determined by the statistical hypotheses \(h\) and
\(h'\) respectively. If \(F(s) = 1\) we decide to reject the null
hypothesis \(h\), else we accept \(h\) for the time being and so
disregard \(h'\).

The decision to accept or reject a hypothesis is associated with
the so-called significance and power of the test. The
*significance* is the probability, according to the null
hypothesis \(h\), of obtaining data that leads us to falsely reject
this hypothesis \(h\):
\[ \text{Significance}_{F} = \alpha = P_{h}(R_{r}) = \sum_{s \in S}
F(s) P_{h}(s) d s , \]
The probability \(\alpha\) is alternatively called the *type-I
error*, and it is often denoted as the
*significance* or the *p-value*. The *power* is
the probability, according to the alternative hypothesis \(h'\), of
obtaining data that leads us to correctly reject the null hypothesis
\(h\):
\[ \text{Power}_{F} = 1 - \beta = P_{h'}(F_{1}) = \sum_{s \in S} F(s)
P_{h'}(s) d s. \]
The probability \(\beta\) is called the *type-II error* of
falsely accepting the null hypothesis. An optimal test is one that
minimizes both the errors \(\alpha\) and \(\beta\). In their
fundamental lemma, Neyman and Pearson proved that the decision has
optimal significance and power for, and only for, likelihood-ratio
test functions \(F\). That is, an optimal test depends only on a
threshold for the ratio \(P_{h'}(s) / P_{h}(s)\).

The example of the tea tasting lady allows for an easy illustration of the likelihood ratio test.

Neyman-Pearson test

Next to the null hypothesis \(h\) that the lady is randomly guessing, we now consider the alternative hypothesis \(h'\) that she has a chance of \(3/4\) to guess the order of tea and milk correctly. The samples \(s\) are binary 5-tuples that record guesses as correct and incorrect. To determine the likelihoods of the two hypotheses, and thereby the value of the test function for each sample, we only need to know the so-calledsufficient statistic, in this case the number of correct guesses \(n\) independently of the order. Denoting a particular sequence of guesses in which the lady has \(n\) correct guesses out of \(t\) with \(s_{n/t}\), we have \(P_{h}(s_{n/5}) = 1/2^{5}\) and \(P_{h'}(s_{n/5}) = 3^{n} / 4^{5}\), so that the likelihood ratio becomes \(3^{n} / 2^{5}\). If we require that the significance is lower than 5%, then it can be calculated that only the samples with \(n = 5\) may be included in the region of rejection. Accordingly we may set the cut-off point \(r\) such that \(r \geq 3^{4} / 2^{5}\) and \(r \lt 3^{5} / 2^{5}\), e.g., \(r = 3^{4} / 2^{5}\).

The threshold of 5% significance is part of statistical convention and very often fixed before even considering the power. Notice that the statistical procedure associates expected error rates with a decision to reject or accept. Especially Neyman has become known for interpreting this in a strictly behaviourist fashion. For further discussion on this point, please see Section 3.2.2.

#### 3.1.2 Estimation

In this section we briefly consider parameter estimation by maximum
likelihood, as first devised by Fisher (1956). While in the foregoing
we used a finite sample space, we now employ a space with infinitely
many possible samples. Accordingly, a probability distribution over
sample space is written down in terms of a so-called *density
function*, denoted \(P(s) ds\), which technically speaking
expresses the infinitely small probability assigned to an infinitely
small patch \(ds\) around the point \(s\). This probability density
works much like an ordinary probability function.

Maximum likelihood estimation, or MLE for short, is a tool for
determining the best among a set of hypotheses, often called a
*statistical model*. Let \(M = \{h_{\theta} :\: \theta \in
\Theta \}\) be the model, labeled by the parameter \(\theta\), let
\(S\) be the sample space, and \(P_{\theta}\) the distribution
associated with \(h_{\theta}\). Then define the *maximum
likelihood estimator* \(\hat{\theta}\) as a function over the
sample space:
\[ \hat{\theta}(s) = \left\{ \theta :\: \forall h_{\theta'}
\bigl(P_{\theta'}(s)ds \leq P_{\theta}(s)ds \bigr) \right\}. \]
So the estimator is a set, typically a singleton, of values of
\(\theta\) for which the likelihood of \(h_{\theta}\) on the data
\(s\) is maximal. The associated best hypothesis we denote with
\(h_{\hat{\theta}}\). This can again be illustrated for the tea
tasting lady.

Maximum likelihood estimation

A natural statistical model for the case of the tea tasting lady consists of hypotheses \(h_{\theta}\) for all possible levels of accuracy that the lady may have, \(\theta \in [0, 1]\). Now the number of correct guesses \(n\) and the total number of guesses \(t\) are the sufficient statistics: the probability of a sample only depends on those numbers. For any particular sequence \(s_{n/t}\) of \(t\) guesses with \(n\) successes, the associated likelihoods of \(h_{\theta}\) are \[ P_{\theta}(s_{n/t}) = \theta^{n} (1 - \theta)^{t - n} . \] For any number of trials \(t\) the maximum likelihood estimator then becomes \(\hat{\theta} = n / t\).

We suppose that the number of cups served to the lady is fixed at \(t\) so that sample space is finite again. Notice, finally, that \(\hat{\theta}\) is the hypothesis that makes the data most probable and not the hypothesis that is most probable in the light of the data.

There are several requirements that we might impose on an estimator
function. One is that the estimator must be consistent. This means
that for larger samples the estimator function \(\hat{\theta}\)
converges to the parameter values associated with the distribution
\(\theta^{\star}\) of the data generating system, or the true
parameter values for short. Another requirement is that the estimator
must be unbiased, meaning that there is no discrepancy between the
expected value of the estimator and the true parameter values. The MLE
procedure is certainly not the only one used for estimating the value
of a parameter of interest on the basis of statistical data. A simpler
technique is the minimization of a particular target function, e.g.,
the minimizing the sum of the squares of the distances between the
prediction of the statistical hypothesis and the data points, also
known as the *method of least squares*. A more general
perspective, first developed by Wald (1950), is provided by measuring
the discrepancy between the predictions of the hypothesis and the
actual data in terms of a loss function. The summed squares and the
likelihoods may be taken as expressions of this loss.

Often the estimation is coupled to a so-called *confidence
interval* (cf. Cumming 2012). For ease of exposition, assume that
\(\Theta\) consists of the real numbers and that every sample \(s\) is
labelled with a unique \(\hat{\theta}(s)\). We define the set
\(R_{\tau} = \{ s:\: \hat{\theta}(s) = \tau \}\), the set of samples
for which the estimator function has the value \(\tau\). We can now
collate a region in sample space within which the estimator function
\(\hat{\theta}\) is not too far off the mark, i.e., not too far from
the true value \(\theta^{\star}\) of the parameter. For example,
\[ C^{\star}_{\Delta} = \{ R_{\tau} :\: \tau \in [ \theta^{\star} -
\Delta , \theta^{\star} + \Delta ] \} . \]
So this set is the union of all \(R_{\tau}\) for which \(\tau \in [
\theta^{\star} - \Delta , \theta^{\star} + \Delta ]\). Now we might
set this region in such a way that it covers a large portion of the
sample space, say \(1 - \alpha\), as measured by the true distribution
\(P_{\theta^{\star}}\). We choose \(\Delta\) such that
\[ P_{\theta^{\star}}(C^{\star}_{\Delta}) = \int_{\theta^{\star} -
\Delta}^{\theta^{\star} + \Delta} P_{\theta^{\star}}(R_{\tau}) d\tau =
1 - \alpha .\]
Statistical folk lore typically sets \(\alpha\) at a value
5%. Relative to this number, the size of \(\Delta\) says something
about the quality of the estimate. If we were to repeat the collection
of the sample over and over, we would find the estimator
\(\hat{\theta}\) within a range \(\Delta\) of the true value
\(\theta^{\star}\) in 95% of all samples. This leads us to define the
symmetric 95% confidence interval:
\[ CI_{95} = [ \hat{\theta} - \Delta , \hat{\theta} + \Delta ] \]
The interpretation is the same as in the foregoing: with repeated
sampling we find the true value within \(\Delta\) of the estimate in
95% of all samples.

It is crucial that we can provide an unproblematic frequentist interpretation of the event that \(\hat{\theta} \in [\theta^{\star} - \Delta, \theta^{\star} + \Delta]\), under the assumption of the true distribution. In a series of estimations, the fraction of times in which the estimator \(\hat{\theta}\) is further away from \(\theta^{\star}\) than \(\Delta\), and hence outside this interval, will tend to 5%. The smaller this region, the more reliable the estimate. Note that this interval is defined in terms of the unknown true value \(\theta^{\star}\). However, especially if the size of the interval \(2 \Delta\) is independent of the true parameter \(\theta^{\star}\), it is tempting to associate the 95% confidence interval with the frequency with which the true value lies within a range of \(\Delta\) around the estimate \(\hat{\theta}\). Below we come back to this interpretation.

There are of course many more procedures for estimating a variety of statistical targets, and there are many more expressions for the quality of the estimation (e.g., bootstrapping, see Efron and Tibshirani 1993). Theories of estimation often come equipped with a rich catalogue of situation-specific criteria for estimators, reflecting the epistemic and pragmatic goals that the estimator helps achieving. However, in itself the estimator functions do not present guidelines for belief and, importantly, confidence intervals do not either.

### 3.2 Problems for classical statistics

Classical statistics is widely discussed in the philosophy of statistics. In what follows two problems with the classical approach are outlined, to wit, its problematic interface with belief and the fact that it violates the so-called likelihood principle. Many more specific problems can be seen to derive from these general ones.

#### 3.2.1 Interface with belief

Consider the likelihood ratio test of Neyman and Pearson. As indicated, the significance or p-value of a test is an error rate that will manifest if data collection and testing is repeated, assuming that the null hypothesis is in fact true. Notably, the p-value does not tell us anything about how probable the truth of the null hypothesis is. However, many scientists do use hypothesis testing in this manner, and there is much debate over what can and cannot be derived from a p-value (cf. Berger and Sellke 1987, Casella and Berger 1987, Cohen 1994, Harlow et al 1997, Wagenmakers 2007, Ziliak and McCloskey 2008, Spanos 2007, Greco 2011, Sprenger forthcoming-a). After all, the test leads to the advice to either reject the hypothesis or accept it, and this seems conceptually very close to giving a verdict of truth or falsity.

While the evidential value of *p*-values is much debated,
many admit that the probability of data according to a hypothesis
cannot be used straightforwardly as an indication of how believable
the hypothesis is (cf. Gillies 1971, Spielman 1974 and 1978). Such
usage runs into the so-called *base-rate fallacy.* The example
of the tea tasting lady is again instructive.

Base-rate fallacy

Imagine that we travel the country to perform the tea tasting test with a large number of ladies, and that we find a particular lady who guesses all five cups correctly. Should we conclude that the lady has a special talent for tasting tea? The problem is that this depends on how many ladies among those tested actually have the special talent. If the ability is very rare, it is more attractive to put the five correct guesses down to a chance occurrence. By comparison, imagine that all the ladies enter the lottery. In analogy to a lady guessing all cups correctly, consider a lady who wins one of the lottery's prizes. Of course winning a prize is very improbable, unless one is in cahoots with the bookmaker, i.e., the analogon of having a special tea tasting ability. But surely if a lady wins the lottery, this is not a good reason to conclude that she must have committed fraud and call for her arrest. Similarly, if a lady has guessed all cups correctly, we cannot simply conclude that she has special abilities.

Essentially the same problem occurs if we consider the estimations
of a parameter as direct advice on what to believe, as made clear by
an example of Good (1983, p. 57) that is presented here in the tea
tasting context. After observing five correct guesses, we have
\(\hat{\theta} = 1\) as maximum likelihood estimator. But it is hardly
believable that the lady will in the long run be 100% accurate. The
point that estimation and belief maintain complicated relations is
also put forward in discussions of *Lindley's paradox* (Lindley
1957, Spanos 2013, Sprenger forthcoming-b). In short, it seems
wrongheaded to turn the results of classical statistical procedures
into beliefs.

It is a matter of debate whether any of this can be blamed on classical statistics. Initially, Neyman was emphatic that their procedures could not be taken as inferences, or as in some other way pertaining to the epistemic status of the hypotheses. Their own statistical philosophy was strictly behaviorist (cf. Neyman 1957), and it may be argued that the problems disappear if only scientists abandon their faulty epistemic use of classical statistics. As explained in the foregoing, we can uncontroversially associate error rates with classical procedures, and so with the decisions that flow from these procedures. Hence, a behavioural and error-based understanding of classical statistics seems just fine. However, both statisticians and philosophers have argued that an epistemic reading of classical statistics is possible, and in fact preferable (e.g., Fisher 1955, Royall 1997). Accordingly, many have attempted to reinterpret or develop the theory, in order to align it with the epistemically oriented statistical practice of scientists (see Mayo 1996, Mayo and Spanos 2011, Spanos 2013b).

#### 3.2.2 The nature of evidence

Hypothesis tests and estimations are sometimes criticised because
their results generally depend on the probability functions over the
entire sample space, and not exclusively on the probabilities of the
observed sample. That is, the decision to accept or reject the null
hypothesis depends not just on the probability of what has actually
been observed according to the various hypotheses, but also on the
probability assignments over events that could have been observed but
were not. A well-known illustration of this problem concerns so-called
*optional stopping* (Robbins 1952, Roberts 1967, Kadane et al
1996, Mayo 1996, Howson and Urbach 2006).

Optional stopping is here illustrated for the likelihood ratio test of Neyman and Pearson but a similar story can be run for Fisher's null hypothesis test and for the determination of estimators and confidence intervals.

Optional stopping

Imagine two researchers who are both testing the same lady on her ability to determine the order in which milk and tea were poured in her cup. They both entertain the null hypothesis that she is guessing at random, with a probability of \(1/2\), against the alternative of her guessing correctly with a probability of \(3/4\). The more diligent researcher of the two decides to record six trials. The more impatient, on the other hand researcher records at most six trials, but decides to stop recording the first trial that the lady guesses incorrectly. Now imagine that, in actual fact, the lady guesses all but the last of the cups correctly. Both researchers then have the exact same data of five successes and one failure, and the likelihoods for these data are the same for the two researchers too. However, while the diligent researcher cannot reject the null hypothesis, the impatient researcher can.

This might strike us as peculiar: statistics should tell us the
objective impact that the data have on a hypothesis, but here the
impact seems to depend on the *sampling plan* of the researcher
and not just on the data themselves. As further explained in
Section 3.2.3,
the results of the two researchers differ because of
differences in how samples that were not observed are factored into
the procedure.

Some will find this dependence unacceptable: the intentions and
plans of the researcher are irrelevant to the evidential value of the
data. But others argue that it is just right. They maintain that the
impact of data on the hypotheses should depend on the *stopping
rule* or protocol that is followed in obtaining it, and not only
on the likelihoods that the hypotheses have for those data
(e.g. Mayo 1996). The motivating intuition is that upholding the
irrelevance of the stopping rule makes it impossible to ban
opportunistic choices in data collection. In fact, defenders of
classical statistics turn the table on those who maintain that
optional stopping is irrelevant. They submit that it opens up the
possibility of reasoning to a foregone conclusion by, for example,
*persistent experimentation*: we might decide to cease
experimentation only if the preferred result is reached. However, as
shown in Kadane *et al*. (1996) and further discussed in Steele
(2012), persistent experimentation is not guaranteed to be
effective, as long as we make sure to use the correct, in this case
Bayesian, procedures.

The debate over optional stopping is eventually concerned with the
appropriate evidential impact of data. A central concern in this wider
debate is the so-called *likelihood principle* (see Hacking
1965 and Edwards 1972). This principle has it that the likelihoods of
hypotheses for the observed data completely fix the evidential impact
of those data on the hypotheses. In the formulation of Berger and
Wolpert (1984), the likelihood principle states that two samples \(s\)
and \(s'\) are evidentially equivalent exactly when \(P_{i}(s) =
kP_{i}(s')\) for all hypotheses \(h_{i}\) under consideration, given
some constant \(k\). Famously, Birnbaum (1962) offers a proof of the
principle from more basic assumptions. This proof relies on the
assumption of *conditionality*. Say that we first toss a coin,
find that it lands heads, then do the experiment associated with this
outcome, to record the sample \(s\). Compare this to the case where we
do the experiment and find \(s\) directly, without randomly picking
it. The conditionality principle states that this second sample has
the same evidential impact as the first one: what we could have found,
but did not find, has no impact on the evidential value of the
sample. Recently, Mayo (2010) has taken issue with Birnbaum's
derivation of the likelihood principle.

The classical view sketched above entails a violation of this: the impact of the observed data may be different depending on the probability of other samples than the observed one, because those other samples come into play when determining regions of acceptance and rejection. The Bayesian procedures discussed in Section 4, on the other hand, uphold the likelihood principle: in determining the posterior distribution over hypotheses only the prior and the likelihood of the observed data matter. In the debate over optional stopping and in many of the other debates between classical and Bayesian statistics, the likelihood principle is the focal point.

#### 3.2.3 Excursion: optional stopping

The view that the data reveal more, or something else, than what is expressed by the likelihoods of the hypotheses at issue merits detailed attention. Here we investigate this issue further with reference to the controversy over optional stopping.

Let us consider the analyses of the two above researchers in some numerical detail by constructing the regions of rejection for both of them.

Determining regions of rejection

Thediligentresearcher considers all 6-tuples of success and failure as the sample space, and takes their numbers as sufficient statistic. The event of six successes, or six correct guesses, has a probability of \(1 / 2^{6} = 1/64\) under the null hypothesis that the lady is merely guessing, against a probability of \(3^{6} / 4^{6}\) under the alternative hypothesis. If we set \(r < 3^{6} / 2^{6}\), then this sample is included in the region of rejection of the null hypothesis. Samples with five successes have a probability of \(1/64\) under the null hypothesis too, against a probability of \(3^5 / 4^{6}\) under the alternative. By lowering the likelihood ratio by a factor 3, we include all these samples in the region of rejection. But this will lead to a total probability of false rejection of \(7/64\), which is larger than 5%. So these samples cannot be included in the region of rejection, and hence the diligent researcher does not reject the null hypothesis upon finding five successes and one failure.For the

impatientresearcher, on the other hand, the sample space is much smaller. Apart from the sample consisting of six successes, all samples consist of a series of successes ending with a failure, differing only in the length of the series. Yet the probabilities over the two samples of length six are the same as for the diligent researcher. As before, the sample of six successes is again included in the region of rejection. Similarly, the sequence of five successes followed by one failure also has a probability of \(1/64\) under the null hypothesis, against a probability of \(3^5 / 4^{6}\) according to the alternative. The difference is that lowering the likelihood ratio to include this sample in the region of rejection leads to the inclusion of this sample only. And if we include it in the region of rejection, the probability of false rejection becomes \(1/32\) and hence does not exceed 5%. Consequently, on the basis of these data the laid-back researcher can reject the null hypothesis that the lady is merely guessing.

It is instructive to consider why exactly the impatient researcher can reject the null hypothesis. In virtue of his sampling plan, the other samples with five successes, namely the ones which kept the diligent researcher from including the observed sample in the region of rejection on pain of exceeding the error probability, could not have been observed. This exemplifies that the results of a classical statistical procedure do not only depend on the likelihoods for the actual data, which are indeed the same for both researchers. They also depend on the likelihoods for data that we did not obtain.

In the above example, it may be considered confusing that the protocol used for optional stopping depends on the data that is being recorded. But the controversy over optional stopping also emerges if this dependence is absent. For example, imagine a third researcher who samples until the diligent researcher is done, or before that if she starts to feel peckish. Furthermore we may suppose that with each new cup offered to the lady, the probability of feeling peckish is \(\frac{1}{2}\). This peckish researcher will also be able to reject the null hypothesis if she completes the series of six cups. And it certainly seems at variance with the objectivity of the statistical procedure that this rejection depends on the physiology and the state of mind of the researcher: if she had not kept open the possibility of a snack break, she would not have rejected the null hypothesis, even though she did not actually take that break. As Jeffrey famously quipped, this is indeed a “remarkable procedure”.

Yet the case is not as clear-cut as it may seem. For one, the peckish researcher is arguably testing two hypotheses in tandem, one about the ability of the tea tasting lady and another about her own peckishness. Together the combined hypotheses have a different likelihood for the actual sample than the simple hypothesis considered by the diligent researcher. The likelihood principle given above dictates that this difference does not affect the evidential impact of the actual sample, but some retain the intuition that it should. Moreover, in some cases this intuition is shared by those who uphold the likelihood principle, namely when the stopping rule depends on the process being recorded in a way not already expressed by the hypotheses at issue (cf. Robbins 1952, Howson and Urbach 2006, p. 365). In terms of our example, if the lady is merely guessing, then it may be more probable that the researcher gets peckish out of sheer boredom, than if the lady performs far below or above chance level. In such a case the act of stopping itself reveals something about the hypotheses at issue, and this should be reflected in the likelihoods of the hypotheses. This would make the evidential impact that the data have on the hypothesis dependent on the stopping rule after all.

### 3.3 Responses to criticism

There have been numerous responses to the above criticisms. Some of those responses effectively reinterpret the classical statistical procedures as pertaining only to the evidential impact of data. Other responses develop the classical statistical theory to accommodate the problems. Their common core is that they establish or at least clarify the connection between two conceptual realms: the statistical procedures refer to physical probabilities, while their results pertain to evidence and support, and even to the rejection or acceptance of hypotheses.

#### 3.3.1 The strength of evidence

Classical statistics is often presented as providing us with advice for actions. The error probabilities do not tell us what epistemic attitude to take on the basis of statistical procedures, rather they indicate the long-run frequency of error if we live by them. Specifically Neyman advocated this interpretation of classical procedures. Against this, Fisher (1935a, 1955), Pearson, and other classical statisticians have argued for more epistemic interpretations, and many more recent authors have followed suit.

Central to the above discussion on classical statistics is the
concept of likelihood, which reflects how the data bears on the
hypotheses at issue. In the works of Hacking (1965), Edwards (1972),
and more recently Royall (1997), the likelihoods are taken as a
cornerstone for statistical procedures and given an epistemic
interpretation. They are said to express the strength of the evidence
presented by the data, or the comparative degree of support that the
data give to a hypothesis. Hacking formulates this idea in the
so-called *law of likelihood* (1965, p. 59): if the sample
\(s\) is more probable on the condition of \(h_{0}\) than on
\(h_{1}\), then \(s\) supports \(h_{0}\) more than it supports
\(h_{1}\).

The position of likelihoodism is based on a specific combination of views on probability. On the one hand, it only employs probabilities over sample space, and avoids putting probabilities over statistical hypotheses. It thereby avoids the use of probability that cannot be given a physical interpretation. On the other hand, it does interpret the probabilities over sample space as components of a support relation, and thereby as pertaining to the epistemic rather than the physical realm. Notably, the likelihoodist approach fits well with a long history in formal approaches to epistemology, in particular with confirmation theory (see Fitelson 2007), in which the probability theory is used to spell out confirmation relations between data and hypotheses. Measures of confirmation invariably take the likelihoods of hypotheses as input components. They provide a quantitative expression of the support relations described by the law of likelihood.

Another epistemic approach to classical statistics is presented by
Mayo (1996) and Mayo and Spanos (2011). Over the past decade or so,
they have done much to push the agenda of classical statistics in the
philosophy of science, which had become dominated by Bayesian
statistics. Countering the original behaviourist tendencies of Neyman,
the *error statistical approach* advances an epistemic reading
of classical test and estimation procedures. Mayo and Spanos argue
that classical procedures are best understood as inferential: they
license inductive inferences. But they readily admit that the
inferences are defeasible, i.e., they could lead us
astray. Classical procedures are always associated with particular
error probabilities, e.g., the probability of a false rejection or
acceptance, or the probability of an estimator falling within a
certain range. In the theory of Mayo and Spanos, these error
probabilities obtain an epistemic role, because they are taken to
indicate the reliability of the inferences licensed by the
procedures.

The error statistical approach of Mayo and others comprises a
general philosophy of science as well as a particular viewpoint on the
philosophy statistics. We briefly focus on the latter, through a
discussion of the notion of a severe test (cf. Mayo and Spanos 2006).
The claim is that we gain knowledge of experimental effects on the
basis of *severely testing* hypotheses, which can be
characterized by the significance and power. In Mayo's definition, a
hypothesis passes a severe test on two conditions: the data must agree
with the hypothesis, and the probability must be very low that
the data agree with the alternative hypothesis. Ignoring potential
controversy over the precise interpretation of “agree” and “low
probability”, we can recognize the criteria of Neyman and Pearson in
these requirements. The test is severe if the significance is low,
since the data must agree with the hypothesis, and the power is high,
since those data must not agree, or else have a low probability of
agreeing, with the alternative.

#### 3.3.2 Theoretical developments

Apart from re-interpretations of the classical statistical procedures, numerous statisticians and philosophers have developed the theory of classical statistics further in order to make good on the epistemic role of its results. We focus on two developments in particular, to wit, fiducial and evidential probability.

The theory of *evidential probability* originates in Kyburg
(1961), who developed a logical system to deal consistently with the
results of classical statistical analyses. Evidential probability
thus falls within the attempts to establish the epistemic use of
classical statistics. Haenni et al (2010) and Kyburg and Teng (2001)
present an insightful introduction to evidential probability. The
system is based on a version default reasoning: statistical hypotheses
come attached with a confidence level, and the logical system
organizes how such confidence levels are propagated in inference, and
thus advises which hypothesis to use for predictions and
decisions. Particular attention is devoted to the propagation of
confidence levels in inferences that involve multiple instances of the
same hypothesis tagged with different confidences, where those
confidences result from diverse data sets that are each associated
with a particular population. Evidential probability assists in
selecting the optimal confidence level, and thus in choosing
the appropriate population for the case under consideration. In
other words, evidential probability helps to resolve the reference
class problem alluded in the foregoing.

Fiducial probability presents another way in which classical
statistics can be given an epistemic status. Fisher (1930, 1933,
1935c, 1956/1973) developed the notion of *fiducial
probability* as a way of deriving a probability assignment over
hypotheses without assuming a prior probability over statistical
hypotheses at the outset. The fiducial argument is controversial, and
it is generally agreed that its applicability is limited to particular
statistical problems. Dempster (1964), Hacking (1965), Edwards (1972),
Seidenfeld (1996) and Zabell (1996) provide insightful
discussions. Seidenfeld (1979) presents a particularly detailed study
and a further discussion of the restricted applicability of the
argument in cases with multiple parameters. Dawid and Stone (1982)
argue that in order to run the fiducial argument, one has to assume
that the statistical problem can be captured in a functional model
that is smoothly invertible. Dempster (1966) provides generalizations
of this idea for cases in which the distribution over \(\theta\) is
not fixed uniquely but only constrained within upper and lower bounds
(cf. Haenni et al 2011). Crucially, such constraints on the
probability distribution over values of \(\theta\) are obtained
without assuming any distribution over \(\theta\) at the outset.

#### 3.3.3 Excursion: the fiducial argument

To explain the *fiducial argument* we first set up a simple
example. Say that we estimate the mean \(\theta\) of a normal
distribution with unit variance over a variable \(X\). We collect a
sample \(s\) consisting of measurements \(X_{1}, X_{2}, \ldots
X_{n}\). The maximum likelihood estimator for \(\theta\) is the
average value of the \(X_{i}\), that is, \(\hat{\theta}(s) = \sum_{i}
X_{i} / n\). Under an assumed true value \(\theta\) we then have a
normal distribution for the estimator \(\hat{\theta}(s)\), centred on
the true value and with a variance \(1 / \sqrt{n}\). Notably, this
distribution has the same shape for all values of \(\theta\). Because
of this, argued Fisher, we can use the distribution over the estimator
\(\hat{\theta}(s)\) as a stand-in for the distribution over the true
value \(\theta\). We thus derive a probability distribution
\(P(\theta)\) on the basis of a sample \(s\), seemingly without
assuming a prior probability.

There are several ways to clarify this so-called fiducial argument.
One way employs a so-called *functional model*, i.e., the
specification of a statistical model by means of a particular
function. For the above model, the function is
\[ f(\theta, \epsilon) = \theta + \epsilon = \hat{\theta}(s) . \]
It relates possible parameter values \(\theta\) to a quantity based on
the sample, in this case the estimator of the observations
\(\hat{\theta}\). The two are related through a stochastic component
\(\epsilon\) whose distribution is known, and the same for all the
samples under consideration. In our case \(\epsilon\) is distributed
normally with variance \(1 / \sqrt{n}\). Importantly, the distribution
of \(\epsilon\) is the same for every value of \(\theta\). The
interpretation of the function \(f\) may now be apparent. Relative to
the choice of a value of \(\theta\), which then obtains the role of
the true value \(\theta^{\star}\), the distribution over \(\epsilon\)
dictates the distribution over the estimator function
\(\hat{\theta}(s)\).

The idea of the fiducial argument can now be expressed succinctly. It is to project the distribution over the stochastic component back onto the possible parameter values. The key observation is that the functional relation \(f(\theta, \epsilon)\) is smoothly invertible, i.e., the function \[ f^{-1}(\hat{\theta}(s), \epsilon) = \hat{\theta}(s) - \epsilon = \theta \] points each combination of \(\hat{\theta}(s)\) and \(\epsilon\) to a unique parameter value \(\theta\). Hence, we can invert the claim of the previous paragraph: relative to fixing a value for \(\hat{\theta}\), the distribution over \(\epsilon\) fully determines the distribution over \(\theta\). Hence, in virtue of the inverted functional model, we can transfer the normal distribution over \(\epsilon\) to the values \(\theta\) around \(\hat{\theta}(s)\). This yields a so-called fiducial probability distribution over the parameter \(\theta\). The distribution is obtained because, conditional on the value of the estimator, the parameters and the stochastic terms become perfectly correlated. A distribution over the latter is then automatically applicable to the former (cf. Haenni et al, 52-55 and 119–122).

Another way of explaining the same idea invokes the notion of a
*pivotal quantity*. Because of how the above statistical model
is set up, we can construct the pivotal quantity \(\hat{\theta}(s) -
\theta\). We know the distribution of this quantity, namely normal and
with the aforementioned variance. Moreover, this distribution is
independent of the sample, and it is such that fixing the sample to
\(s\), and so fixing the value of \(\hat{\theta}\), uniquely
determines a distribution over the parameter values \(\theta\). The
fiducial argument thus allows us to construct a probability
distribution over the parameter values on the basis of the observed
sample. The argument can be run whenever we can construct a pivotal
quantity like that or, equivalently, whenever we can express the
statistical model as a functional model.

A warning is in order here. As revealed in many of the above references, the fiducial argument is highly controversial. The mathematical results are there, but the proper interpretation of the results is still up for discussion . In order to properly appreciate the precise inferential move and its wobbly conceptual basis, it will be instructive to consider the use of fiducial probability in interpreting confidence intervals. A proper understanding of this requires first reading the Section 3.1.2.

Recall that confidence intervals, which are standardly taken to
indicate the quality of an estimation, are often interpreted
epistemically. The 95% confidence interval is often misunderstood as
the range of parameter values that includes the true value with 95%
probability, a so-called *credal interval*:
\[ P(\theta \in [\hat{\theta} - \Delta, \hat{\theta} + \Delta]) = 0.95
. \]
This interpretation is at odds with classical statistics but, as will
become apparent, it can be motivated by an application of the fiducial
argument. Say that we replace the integral determining the size
\(\Delta\) of the confidence interval by the following:
\[ \int_{\hat{\theta}(s) - \Delta}^{\hat{\theta}(s) + \Delta}
P_{\theta}(R_{\hat{\theta}(s)}) d\theta = 0.95 .\]
In words, we fix the estimator \(\hat{\theta}(s)\) and then integrate
over the parameters \(\theta\) in \(P_{\theta}(R_{\hat{\theta}(s)})\),
rather than assuming \(\theta^{\star}\) and then integrating over the
parameters \(\tau\) in \(R_{\tau}\). Sure enough we can calculate this
integral. But what ensures that we can treat the integral as a
probability? Notice that it runs over a continuum of probability
distributions and that, as it stands, there is no reason to think that
the terms \(P_{\theta}(R_{\hat{\theta}(s)})\) add up to a proper
distribution in \(\theta\).

The assumptions of the fiducial argument, here explained in terms of the invertibility of the functional model, ensure that the terms indeed add up, and that a well-behaved distribution will surface. We can choose the statistical model in such a way that the sample statistic \(\hat{\theta}(s)\) and the parameter \(\theta\) are related in the right way: relative to the parameter \(\theta\), we have a distribution over the statistic \(\hat{\theta}\), but by the same token we have a distribution over parameters relative to this statistic. As a result, the probability function \(P_{\theta}(R_{\hat{\theta}(s) + \epsilon})\) over \(\epsilon\), where \(\theta\) is fixed, can be transferred to a fiducial probability function \(P_{\theta + \epsilon}(R_{\hat{\theta}(s)})\) over \(\epsilon\), where \(\hat{\theta}(s)\) is fixed. The function \(P_{\theta}(R_{\hat{\theta}})\) of the parameter \(\theta\) is thus a proper probability function, from which a credal interval can be constructed.

Even then, it is not clear why we should take this distribution as an appropriate expression of our belief, so that we may support the epistemic interpretation of confidence intervals with it. And so the debate continues. In the end fiducial probability is perhaps best understood as a half-way house between the classical and the Bayesian view on statistics. Classical statistics grew out of a frequentist interpretation of probability, and accordingly the probabilities appearing in the classical statistical methods are all interpreted as frequencies of events. Clearly, the probability distribution over hypotheses that is generated by a fiducial argument cannot be interpreted in this way, so that an epistemic interpretation of this distribution seems the only option. Several authors (e.g., Dempster 1964) have noted that fiducial probability indeed makes most sense in a Bayesian perspective. It is to this perspective that we now turn.

## 4. Bayesian statistics

Bayesian statistical methods are often presented in the form of an
inference. The inference runs from a so-called *prior*
probability distribution over statistical hypotheses, which expresses
the degree of belief in the hypotheses before data has been collected,
to a *posterior* probability distribution over the
hypotheses, which expresses the beliefs after the data have been
incorporated. The posterior distribution follows, via the axioms of
probability theory, from the prior distribution and the
*likelihoods* of the hypotheses for the data obtained, i.e.,
the probability that the hypotheses assign to the data. Bayesian
methods thus employ data to modulate our attitude towards
a designated set of statistical hypotheses, and in this respect
they achieve the same as classical statistical procedures. Both types
of statistics present a response to the problem of induction. But
whereas classical procedures select or eliminate elements from the set
of hypotheses, Bayesian methods express the impact of data in a
posterior probability assignment over the set. This posterior is fully
determined by the prior and the likelihoods of the hypotheses, via the
formalism of probability theory.

The defining characteristic of Bayesian statistics is that it
considers probability distributions over statistical hypotheses as
well as over data. It embraces the epistemic interpretation of
probability whole-heartedly: probabilities over hypotheses are
interpreted as degrees of belief, i.e., as expressions of epistemic
uncertainty. The philosophy of Bayesian statistics is concerned
with determining the appropriate interpretation of these input
components, and of the mathematical formalism of probability itself,
ultimately with the aim to justify the output. Notice that the general
pattern of a Bayesian statistical method is that of
*inductivism* in the cumulative sense: under the impact of
data we move to more and more informed probabilistic opinions about
the hypotheses. However, in the following it will appear that Bayesian
methods may also be understood as deductivist in nature.

### 4.1 Basic pattern of inference

Bayesian inference always starts from a *statistical
model*, i.e., a set of statistical hypotheses. While the general
pattern of inference is the same, we treat models with a finite number
and a continuum of hypotheses separately and draw parallels with
hypothesis testing and estimation, respectively. The exposition is
mostly based on Press 2002, Howson and Urbach 2006, Gelman et al
2013, and Earman 1992.

#### 4.1.1 Finite model

Central to Bayesian methods is a theorem from probability theory
known as *Bayes' theorem*. Relative to a prior probability
distribution over hypotheses, and the probability distributions over
sample space for each hypothesis, it tells us what the adequate
posterior probability over hypotheses is. More precisely, let \(s\) be
the sample and \(S\) be the sample space as before, and let \(M = \{
h_{\theta} :\: \theta \in \Theta \}\) be the space of statistical
hypotheses, with \(\Theta\) the space of parameter values. The
function \(P\) is a probability distribution over the entire space \(M
\times S\), meaning that every element \(h_{\theta}\) is associated
with its own sample space \(S\), and its own probability distribution
over that space. For the latter, which is fully determined by the
likelihoods of the hypotheses, we write the probability of the sample
conditional on the hypothesis, \(P(s \mid h_{\theta})\). This differs
from the expression \(P_{h_{\theta}}(s)\), written in the context of
classical statistics, because in contrast to classical statisticians,
Bayesians accept \(h_{\theta}\) as an argument for the probability
distribution.

Bayesian statistics is first introduced in the context of a finite set of hypotheses, after which a generalization to the infinite case is provided. Assume the prior probability \(P(h_{\theta})\) over the hypotheses \(h_{\theta} \in M\). Further assume the likelihoods \(P(s \mid h_{\theta})\), i.e., the probability assigned to the data \(s\) conditional on the hypotheses \(h_{\theta}\). Then Bayes' theorem determines that \[ P(h_{\theta} \mid s) \; = \; \frac{P(s \mid h_{\theta})}{P(s)} P(h_{\theta}) . \] Bayesian statistics outputs the posterior probability assignment, \(P(h_{\theta} \mid s)\). This expression gets the interpretation of an opinion concerning \(h_{\theta}\) after the sample \(s\) has been recorded accommodated, i.e., it is a revised opinion. Further results from a Bayesian inference can all be derived from the posterior distribution over the statistical hypotheses. For instance, we can use the posterior to determine the most probable value for the parameter, i.e., picking the hypothesis \(h_{\theta}\) for which \(P(h_{\theta} \mid s)\) is maximal.

In this characterization of Bayesian statistical inference the probability of the data \(P(s)\) is not presupposed, because it can be computed from the prior and the likelihoods by the law of total probability, \[ P(s) \; = \; \sum_{\theta \in \Theta} P(h_{\theta}) P(s \mid h_{\theta}) . \] The result of a Bayesian statistical inference is not always reported as a posterior probability. Often the interest is only in comparing the ratio of the posteriors of two hypotheses. By Bayes' theorem we have \[ \frac{P(h_{\theta} \mid s)}{P(h_{\theta'} \mid s)} \; = \; \frac{P(h_{\theta}) P(s \mid h_{\theta})}{P(h_{\theta'}) P(s \mid h_{\theta'})} , \] and if we assume equal priors \(P(h_{\theta}) = P(h_{\theta'})\), we can use the ratio of the likelihoods of the hypotheses, the so-called Bayes factor, to compare the hypotheses.

Here is a Bayesian procedure for the example of the tea tasting lady.

Bayesian statistical analysis

Consider the hypotheses \(h_{1/2}\) and \(h_{3/4}\), which in the foregoing were used as null and alternative, \(h\) and \(h'\), respectively. Instead of choosing among them on the basis of the data, we assign a prior distribution over them so that the null is twice as probable as the alternative: \(P(h_{1/2}) = 2/3\) and \(P(h_{3/4}) = 1/3\). Denoting the a particular sequence of guessing \(n\) out of 5 cups correctly with \(s_{n/5}\), we have that \(P(s_{n/5} \mid h_{1/2}) = 1 / 2^{5}\) while \(P(s_{n/5} \mid h_{3/4}) = 3^{n} / 4^{5}\). As before, the likelihood ratio of five guesses thus becomes \[ \frac{P(s_{n/5} \mid h_{3/4})}{P(s_{n/5} \mid h_{1/2})} \; = \; \frac{3^{n}}{2^{5}} . \] The posterior ratio after 5 correct guesses is thus \[ \frac{P(h_{3/4} \mid s_{n/5})}{P(h_{1/2} \mid s_{n/5})} \; = \; \frac{3^{5}}{2^{5}}\, \frac{1}{2} \approx 4 . \] This posterior is derived by the axioms of probability theory alone, in particular by Bayes' theorem. It tells us how believable each of the hypotheses is after incorporating the sample data into our beliefs.

Notice that in the above exposition, the posterior probability is written as \(P(h_{\theta} \mid s_{n/5})\). Some expositions of Bayesian inference prefer to express the revised opinion as a new probability function \(P'( \cdot )\), which is then equated to the old \(P( \cdot \mid s)\). For the basic formal workings of Bayesian inference, tis distinction is inessential. But we will return to it in Section 4.3.3.

#### 4.1.2 Continuous model

In many applications the model is not a finite set of hypotheses,
but rather a continuum labelled by a real-valued parameter. This leads
to some subtle changes in the definition of the distribution over
hypotheses and the likelihoods. The prior and posterior must be
written down as a so-called *probability density function*,
\(P(h_{\theta}) d\theta\). The likelihoods need to be defined by a
limit process: the probability \(P(h_{\theta})\) is infinitely small
so that we cannot define \(P(s \mid h_{\theta})\) in the normal
manner. But other than that the Bayesian machinery works exactly the
same:
\[ P(h_{\theta} \mid s) d\theta \;\; = \;\; \frac{P(s \mid
h_{\theta})}{P(s)} P(h_{\theta}) d\theta. \]
Finally, summations need to be replaced by integrations:
\[ P(s) \; = \; \int_{\theta \in \Theta} P(h_{\theta}) P(s \mid
h_{\theta}) d\theta . \]
This expression is often called the *marginal likelihood* of
the model: it expresses how probable the data is in the light of the
model as a whole.

The posterior probability density provides a basis for conclusions
that one might draw from the sample \(s\), and which are similar to
estimations and measures for the accuracy of the estimations. For one,
we can derive an expectation for the parameter \(\theta\), where we
assume that \(\theta\) varies continuously:
\[ \bar{\theta} \;\; = \;\; \int_{\Theta}\, \theta P(h_{\theta} \mid s)
d\theta. \]
If the model is parameterized by a convex set, which it typically is,
then there will be a hypothesis \(h_{\bar{\theta}}\) in the
model. This hypothesis can serve as a Bayesian estimation. In analogy
to the confidence interval, we can also define a so-called
*credal* *interval* or *credibility interval*
from the posterior probability distribution: an interval of size
\(2d\) around the expectation value \(\bar{\theta}\), written
\([\bar{\theta} - d, \bar{\theta} + d]\), such that
\[ \int_{\bar{\theta} - d}^{\bar{\theta} + d} P(h_{\theta} \mid s)
d\theta = 1-\epsilon . \]
This range of values for \(\theta\) is such that the posterior
probability of the corresponding \(h_{\theta}\) adds up to
\(1-\epsilon\) of the total posterior probability.

There are many other ways of defining Bayesian estimations and credal intervals for \(\theta\) on the basis of the posterior density. The specific type of estimation that the Bayesian analysis offers can be determined by the demands of the scientist. Any Bayesian estimation will to some extent resemble the maximum likelihood estimator due to the central role of the likelihoods in the Bayesian formalism. However, the output will also depend on the prior probability over the hypotheses, and generally speaking it will only tend to the maximum likelihood estimator when the sample size tends to infinity. See Section 4.2.2 for more on this so-called “washing out” of the priors.

### 4.2 Problems with the Bayesian approach

Most of the controversy over the Bayesian method concerns the probability assignment over hypotheses. One important set of problems surrounds the interpretation of those probabilities as beliefs, as to do with a willingness to act, or the like. Another set of problems pertains to the determination of the prior probability assignment, and the criteria that might govern it.

#### 4.2.1 Interpretations of the probability over hypotheses

The overall question here is how we should understand the probability assigned to a statistical hypothesis. Naturally the interpretation will be epistemic: the probability expresses the strength of belief in the hypothesis. It makes little sense to attempt a physical interpretation since the hypothesis cannot be seen as a repeatable event, or as an event that might have some tendency of occurring.

This leaves open several interpretations of the probability
assignment as a strength of belief. One very influential
interpretation of probability as degree of belief relates probability
to a willingness to bet against certain odds (cf. Ramsey 1926, De
Finetti 1937/1964, Earman 1992, Jeffrey 1992, Howson 2000). According
to this interpretation, assigning a probability of \(3/4\) to a
proposition, for example, means that we are prepared to pay at most
$0.75 for a betting contract that pays out $1 if the
proposition is true, and that turns worthless if the proposition is
false. The claim that degrees of belief are correctly expressed in a
probability assignment is then supported by a so-called *Dutch book
argument*: if an agent does not comply to the axioms of
probability theory, a malign bookmaker can propose a set of bets that
seems fair to the agent but that lead to a certain monetary loss, and
that is therefore called Dutch, presumably owing to the Dutch's
mercantile reputation. This interpretation associates beliefs directly
with their behavioral consequences: believing something is the same as
having the willingness to engage in a particular activity, e.g., in a
bet.

There are several problems with this interpretation of the probability assignment over hypotheses. For one, it seems to make little sense to bet on the truth of a statistical hypothesis, because such hypotheses cannot be falsified or verified. Consequently, a betting contract on them will never be cashed. More generally, it is not clear that beliefs about statistical hypotheses are properly framed by connecting them to behavior in this way. It has been argued (e.g., Armendt 1993) that this way of framing probability assignments introduces pragmatic considerations on beliefs, to do with navigating the world successfully, into a setting that is by itself more concerned with belief as a truthful representation of the world.

A somewhat different problem is that the Bayesian formalism, in particular its use of probability assignments over statistical hypotheses, suggests a remarkable closed-mindedness on the part of the Bayesian statistician. Recall the example of the foregoing, with the model \(M = \{ h_{1/2}, h_{3/4} \}\). The Bayesian formalism requires that we assign a probability distribution over these two hypotheses, and further that the probability of the model is \(P(M) = 1\). It is quite a strong assumption, even of an ideally rational agent, that she is indeed equipped with a real-valued function that expresses her opinion over the hypotheses. Moreover, the probability assignment over hypotheses seems to entail that the Bayesian statistician is certain that the true hypothesis is included in the model. This is an unduly strong claim to which a Bayesian statistician will have to commit at the start of her analysis. It sits badly with broadly shared methodological insights (e.g., Popper 1934/1956), according to which scientific theory must be open to revision at all times (cf. Mayo 1996). In this regard Bayesian statistics does not do justice to the nature of scientific inquiry, or so it seems.

The problem just outlined obtains a mathematically more
sophisticated form in the problem that Bayesians expect to be
*well-calibrated*. This problem, as formulated in Dawid
(1982), concerns a Bayesian forecaster, e.g., a weatherman who
determines a daily probability for precipitation in the next day. It
is then shown that such a weatherman believes of himself that in
the long run he will converge onto the correct probability with
probability 1. Yet it seems reasonable to suppose that the weatherman
realizes something could potentially be wrong with his meteorological
model, and so sets his probability for correct prediction below 1. The
weatherman is thus led to incoherent beliefs. It seems that Bayesian
statistical analysis places unrealistic demands, even on an ideal
agent.

#### 4.2.2 Determination of the prior

For the moment, assume that we can interpret the probability over hypotheses as an expression of epistemic uncertainty. Then how do we determine a prior probability? Perhaps we already have an intuitive judgment on the hypotheses in the model, so that we can pin down the prior probability on that basis. Or else we might have additional criteria for choosing our prior. However, several serious problems attach to procedures for determining the prior.

First consider the idea that the scientist who runs the Bayesian analysis provides the prior probability herself. One obvious problem with this idea is that the opinion of the scientist might not be precise enough for a determination of a full prior distribution. It does not seem realistic to suppose that the scientist can transform her opinion into a single real-valued function over the model, especially not if the model itself consists of a continuum of hypotheses. But the more pressing problem is that different scientists will provide different prior distributions, and that these different priors will lead to different statistical results. In other words, Bayesian statistical inference introduces an inevitable subjective component into scientific method.

It is one thing that the statistical results depend on the initial
opinion of the scientist. But it may so happen that the scientist has
no opinion whatsoever about the hypotheses. How is she supposed to
assign a prior probability to the hypotheses then? The prior will have
to express her ignorance concerning the hypotheses. The leading idea
in expressing such ignorance is usually the *principle of
indifference*: ignorance means that we are indifferent between any
pair of hypotheses. For a finite number of hypotheses, indifference
means that every hypothesis gets equal probability. For a continuum of
hypotheses, indifference means that the probability density
function must be uniform.

Nevertheless, there are different ways of applying the principle of indifference and so there are different probability distributions over the hypotheses that can count as expression of ignorance. This insight is nicely illustrated in Bertrand's paradox .

Bertrand's paradox

Consider a circle drawn around an equilateral triangle, and now imagine that a knitting needle whose length exceeds the circle's diameter is thrown onto the circle. What is the probability that the section of the needle lying within the circle is longer than the side of the equilateral triangle? To determine the answer, we need to parameterize the ways in which the needle may be thrown, determine the subset of parameter values for which the included section is indeed longer than the triangle's side, and express our ignorance over the exact throw of the needle in a probability distribution over the parameter, so that the probability of the said event can be derived. The problem is that we may provide any number of ways to parameterize how the needle lands in the circle. If we use the angle that the needle makes with the tangent of the circle at the intersection, then the included section of the needle is only going to be longer if the angle is between \(60^{\circ}\) and \(120^{\circ}\). If we assume that our ignorance is expressed by a uniform distribution over these angles, which ranges from \(0^{\circ}\) to \(180^{\circ}\), then the probability of the event is going to be \(1/3\). However, we can also parameterize the ways in which the needle lands differently, namely by the shortest distance of the needle to the centre of the circle. A uniform probability over the distances will lead to a probability of \(1/2\).

Jaynes (1973 and 2003) provides a very insightful discussion of this riddle and also argues that it may be resolved by relying on invariances of the problem under certain transformations. But the general message for now is that the principle of indifference does not lead to a unique choice of priors. The point is not that ignorance concerning a parameter is hard to express in a probability distribution over those values. It is rather that in some cases, we do not even know what parameters to use to express our ignorance over.

In part the problem of the subjectivity of Bayesian analysis may be
resolved by taking a different attitude to scientific theory, and by
giving up the ideal of absolute objectivity. Indeed, some will argue
that it is just right that the statistical methods accommodate
differences of opinion among scientists. However, this response misses
the mark if the prior distribution expresses ignorance rather than
opinion: it seems harder to defend the rationality of differences of
opinion that stem from different ways of spelling out ignorance. Now
there is also a more positive answer to worries over objectivity,
based on so-called *convergence results *(e.g., Blackwell and
Dubins 1962 and Gaifman and Snir 1982). It turns out that the impact
of prior choice diminishes with the accumulation of data, and that in
the limit the posterior distribution will converge to a set, possibly
a singleton, of best hypotheses, determined by the sampled data and
hence completely independent of the prior distribution. However, in
the short and medium run the influence of subjective prior choice
remains.

Summing up, it remains problematic that Bayesian statistics is sensitive to subjective input. The undeniable advantage of the classical statistical procedures is that they do not need any such input, although arguably the classical procedures are in turn sensitive to choices concerning the sample space (Lindley 2000). Against this, Bayesian statisticians point to the advantage of being able to incorporate initial opinions into the statistical analysis.

### 4.3 Responses to criticism

The philosophy of Bayesian statistics offers a wide range of responses to the problems outlined above. Some Bayesians bite the bullet and defend the essentially subjective character of Bayesian methods. Others attempt to remedy or compensate for the subjectivity, by providing objectively motivated means of determining the prior probability or by emphasizing the objective character of the Bayesian formalism itself.

#### 4.3.1 Strict but empirically informed subjectivism

One very influential view on Bayesian statistics buys into the
subjectivity of the analysis (e.g., Goldstein 2006, Kadane 2011).
So-called *personalist*s or *strict subjectivists *argue
that it is just right that the statistical methods do not provide any
objective guidelines, pointing to radically subjective sources of any
form of knowledge. The problems on the interpretation and choice of
the prior distribution are thus dissolved, at least in part: the
Bayesian statistician may choose her prior at will, and they are an
expression of her beliefs. However, it deserves emphasis that a
subjectivist view on Bayesian statistics does not mean that all
constraints deriving from empirical fact can be disregarded. Nobody
denies that if you have further knowledge that imposes constraints on
the model or the prior, then those constraints must be
accommodated. For example, today's posterior probability may be used
as tomorrow's prior, in the next statistical inference. The point is
that such constraints concern the rationality of belief and not the
consistency of the statistical inference per se.

Subjectivist views are most prominent among those who interpret
probability assignments in a pragmatic fashion, and motivate the
representation of belief with probability assignments by the
afore-mentioned Dutch book arguments. Central to this approach is the
work of Savage and De Finetti. Savage (1962) proposed to axiomatize
statistics in tandem with *decision theory*, a mathematical
theory about practical rationality. He argued that by themselves the
probability assignments do not mean anything at all, and that they can
only be interpreted in the context where an agent faces a choice
between actions, i.e., a choice among a set of bets. In similar vein,
De Finetti (e.g., 1974) advocated a view on statistics in which only
the empirical consequences of the probabilistic beliefs, expressed in
a willingness to bet, mattered but he did not make statistical
inference fully dependent on decision theory. Remarkably, it thus
appears that the subjectivist view on Bayesian statistics is based on
the same behaviorism and empiricism that motivated Neyman and Pearson
to develop classical statistics.

Notice that all this makes one aspect of the interpretation problem of Section 4.2.1 reappear: how will the prior distribution over hypotheses make itself apparent in behavior, so that it can rightfully be interpreted in terms of belief, here understood as a willingness to act? One response to this question is to turn to different motivations for representing degrees of beliefs by means of probability assignments. Following work by De Finetti, several authors have proposed vindications of probabilistic expressions of belief that are not based on behavioral goals, but rather on the epistemic goal of holding beliefs that accurately represent the world, e.g., Rosenkrantz (1981), Joyce (2001), Leitgeb and Pettigrew (2010), Easwaran (2013). A strong generalization of this idea is achieved in Schervish, Seidenfeld and Kadane (2009), which builds on a longer tradition of using scoring rules for achieving statistical aims. An alternative approach is that any formal representation of belief must respect certain logical constraints, e.g., Cox provides an argument for the expression of belief in terms of probability assignments on the basis of the nature of partial belief per se.

However, the original subjectivist response to the issue that a
prior over hypotheses is hard to interpret came from De Finetti's
so-called *representation theorem*, which shows that every
prior distribution can be associated with its own set of predictions,
and hence with its own behavioral consequences. In other words, De
Finetti showed how priors are indeed associated with beliefs that can
carry a betting interpretation.

#### 4.3.2 Excursion: the representation theorem

De Finetti's representation theorem relates rules for
prediction, as functions of the given sample data, to Bayesian
statistical analyses of those data, against the background of a
statistical model. See Festa (1996) and Suppes (2001) for useful
introductions. De Finetti considers a process that generates a series
of time-indexed observations, and he then studies prediction rules
that take these finite segments as input and return a probability over
future events, using a statistical model that can analyze such samples
and provide the predictions. The key result of De Finetti is that a
particular statistical model, namely the set of all distributions in
which the observations are independently and identically
distributed, can be equated with the class of *exchangeable
prediction rules*, namely the rules whose predictions do not
depend on the order in which the observations come in.

Let us consider the representation theorem in some more formal detail. For simplicity, say that the process generates time-indexed binary observations, i.e., 0's and 1's. The prediction rules take such bit strings of length \(t\), denoted \(S_{t}\), as input, and return a probability for the event that the next bit in the string is a 1, denoted \(Q^{1}_{t+1}\). So we write the prediction rules as partial probability assignments \(P(Q^{1}_{t+1} \mid S_{t})\). Exchangeable prediction rules are rules that deliver the same prediction independently of the order of the bits in the string \(S_{t}\). If we write the event that the string \(S_{t}\) has a total of \(n\) observations of 1's as \(S_{n/t}\), then exchangeable prediction rules are written as \(P(Q^{1}_{t+1} \mid S_{n/t})\). The crucial property is that the value of the prediction is not affected by the order in which the 0's and 1's show up in the string \(S_{t}\).

De Finetti relates this particular set of exchangeable prediction
rules to a Bayesian inference over a specific type of statistical
model. The model that De Finetti considers comprises the so-called
*Bernoulli hypotheses* \(h_{\theta}\), i.e., hypotheses for
which
\[ P(Q^{1}_{t+1} \mid h_{\theta} \cap S_{t}) = \theta . \]
This likelihood does not depend on the string \(S_{t}\) that has gone
before. The hypotheses are best thought of as determining a fixed bias
\(\theta\) for the binary process, where \(\theta \in \Theta = [0,
1]\). The *representation theorem*states that there is a
one-to-one mapping of priors over Bernoulli hypotheses and
exchangeable prediction rules. That is, every prior distribution
\(P(h_{\theta})\) can be associated with exactly one exchangeable
prediction rule \(P(Q^{1}_{t+1} \mid S_{n/t})\), and conversely. Next
to the original representation theorem derived by De Finetti, several
other and more general representation theorems were proved, e.g., for
partially exchangeable sequences and hypotheses on Markov processes
(Diaconis and Freedman 1980, Skyrms 1991), for clustering predictions
and partitioning processes (Kingman 1975 and 1978), and even for
sequences of graphs and their generating process (Aldous 1981).

Representation theorems equate a prior distribution over statistical hypotheses to a prediction rule, and thus to a probability assignment that can be given a subjective and behavioral interpretation. This removes the worry expressed above, that the prior distribution over hypotheses cannot be interpreted subjectively because it cannot be related to belief as a willingness to act: priors relate uniquely to particular predictions. However, for De Finetti the representation theorem provided a reason for doing away with statistical hypotheses altogether, and hence for the removal of a notion of probability as anything other than subjective opinion (cf. Hintikka 1970): hypotheses whose probabilistic claims could be taken to refer to intangible chancy processes are superfluous metaphysical baggage.

Not all subjectivists are equally dismissive of the use of
statistical hypotheses. Jeffrey (1992) has proposed so-called
*mixed Bayesianism* in which subjectively interpreted
distributions over the hypotheses are combined with a physical
interpretation of the distributions that hypotheses define over sample
space. Romeijn (2003, 2005, 2006) argues that priors over hypotheses
are an efficient and more intuitive way of determining inductive
predictions than specifying properties of predictive systems directly. This advantage of using hypotheses seems in agreement with the practice of science,
in which hypotheses are routinely used, and often motivated by mechanistic knowledge on the data generating process. The fact that statistical hypotheses can
strictly speaking be eliminated does not take away from their utility in making predictions.

#### 4.3.3 Bayesian statistics as logic

Despite its—seemingly inevitable—subjective character,
there is a sense in which Bayesian statistics might lay claim to
objectivity. It can be shown that the Bayesian formalism meets certain
objective criteria of rationality, coherence, and
calibration. Bayesian statistics thus answers to the requirement of
objectivity at a meta-level: while the opinions that it deals with
retain a subjective aspect, the way in which it deals with these
opinions, in particular the way in which data impacts on them, is
objectively correct, or so it is argued. Arguments supporting the
Bayesian way of accommodating data, namely by
*conditionalization*, have been provided in a pragmatic context
by *dynamic Dutch book arguments*, whereby probability is
interpreted as a willingness to bet (cf. Maher 1993, van Fraassen
1989). Similar arguments have been advanced on the grounds that our
beliefs must accurately represent the world along the lines of De
Finetti (1974), e.g., Greaves and Wallace (2006) and Leitgeb and
Pettigrew (2010).

An important distinction must be made in arguments that support the
Bayesian way of accommodating evidence: the distinction between Bayes'
theorem, as a mathematical given, and *Bayes' rule*, as a
principle of coherence over time. The theorem is simply a mathematical
relation among probability assignments,
\[ P(h \mid s) \; = \; P(h) \frac{P(s \mid h)}{P(s)} , \]
and as such not subject to debate. Arguments that support the
representation of the epistemic state of an agent by means of
probability assignments also provide support for Bayes' theorem as a
constraint on degrees of belief. The conditional probability \(P(h
\mid s)\) can be interpreted as the degree of belief attached to the
hypothesis \(h\) on the condition that the sample \(s\) is obtained,
as integral part of the epistemic state captured by the probability
assignment. Bayes' rule, by contrast, presents a constraint on
probability assignments that represent epistemic states of an agent at
different points in time. It is written as
\[ P_{s}(h) \; = P(h \mid s) , \]
and it determines that the new probability assignment, expressing the
epistemic state of the agent after the sample has been obtained, is
systematically related to the old assignment, representing the
epistemic state before the sample came in. In the philosophy of
statistics many Bayesians adopt Bayes' rule implicitly, but in what
follows I will only assume that Bayesian statistical inferences rely
on Bayes' theorem.

Whether the focus lies on Bayes' rule or on Bayes' theorem, the
common theme in the above-mentioned arguments is that they approach
Bayesian statistical inference from a logical angle, and focus on its
internal coherence or consistency (cf. Howson 2003). While its use in
statistics is undeniably inductive, Bayesian inference thereby obtains
a deductive, or at least non-ampliative character: everything that is
concluded in the inference is somehow already present in the
premises. In Bayesian statistical inference, those premises are given
by the prior over the hypotheses, \(P(h_{\theta})\) for \(\theta \in
\Theta\), and the likelihood functions, \(P(s \mid h_{\theta})\), as
determined for each hypothesis \(h_{\theta}\) separately. These
premises fix a single probability assignment over the space \(M \times
S\) at the outset of the inference. The conclusions, in turn, are
straightforward consequences of this probability assignment. They can
be derived by applying theorems of probability theory, most notably
Bayes' theorem. Bayesian statistical inference thus becomes an
instance of *probabilistic logic* (cf. Hailperin 1986, Halpern
2003, Haenni *et al* 2011).

Summing up, there are several arguments showing that statistical inference by Bayes' theorem, or by Bayes' rule, is objectively correct. These arguments invite us to consider Bayesian statistics as an instance of probabilistic logic. Such appeals to the logicality of Bayesian statistical inference may provide a partial remedy for its subjective character. Moreover, a logical approach to the statistical inferences avoids the problem that the formalism places unrealistic demands on the agents, and that it presumes the agent to have certain knowledge. Much like in deductive logic, we need not assume that the inferences are psychologically realistic, nor that the agents actually believe the premises of the arguments. Rather the arguments present the agents with a normative ideal and take the conditional form of consistency constraints: if you accept the premises, then these are the conclusions.

#### 4.3.4 Excursion: inductive logic and statistics

An important instance of probabilistic logic is presented in
*inductive logic*, as devised by Carnap, Hintikka and others
(Carnap 1950 and 1952, Hintikka and Suppes 1966, Carnap and Jeffrey
1970, Hintikka and Niiniluoto 1980, Kuipers 1978, and Paris 1994, Nix
and Paris 2006, Paris and Waterhouse 2009). Historically, Carnapian
inductive logic developed prior to the probabilistic logics referenced
above, and more or less separately from the debates in the philosophy
of statistics. But the logical systems of Carnap can quite easily be
placed in the context of a logical approach to Bayesian inference, and
doing this is in fact quite insightful.

For simplicity, we choose a setting that is similar to the one used
in the exposition of the representation theorem, namely a binary data
generating process, i.e., strings of 0's and 1's. A prediction rule
determines a probability for the event, denoted \(Q^{1}_{t+1}\), that
the next bit in the string is a 1, on the basis of a given string of
bits with length \(t\), denoted by \(S_{t}\). Carnap and followers
designed specific exchangeable prediction rules, mostly variants of
the *straight rule* (Reichenbach 1938),
\[ P(Q^{1}_{t+1} \mid S_{n/t}) = \frac{n + 1}{t + 2} , \]
where \(S_{n/t}\) denotes a string of length \(t\) of which \(n\)
entries are 1's. Carnap derived such rules from constraints on the
probability assignments over the samples. Some of these constraints
boil down to the axioms of probability. Other constraints,
exchangeability among them, are independently motivated, by an appeal
to so-called *logical interpretation of probability*. Under
this logical interpretation, the probability assignment must respect
certain invariances under transformations of the sample space, in
analogy to logical principles that constrain truth valuations over a
language in a particular way.

Carnapian inductive logic is an instance of probabilistic logic, because its sequential predictions are all based on a single probability assignment at the outset, and because it relies on Bayes' theorem to adapt the predictions to sample data (cf. Romeijn 2011). One important difference with Bayesian statistical inference is that, for Carnap, the probability assignment specified at the outset only ranges over samples and not over hypotheses. However, by De Finetti's representation theorem Carnap's exchangeable rules can be equated to particular Bayesian statistical inferences. A further difference is that Carnapian inductive logic gives preferred status to particular exchangeable rules. In view of De Finetti's representation theorem, this comes down to the choice for a particular set of preferred priors. As further developed below, Carnapian inductive logic is thus related to objective Bayesian statistics. It is a moot point whether further constraints on the probability assignments can be considered as logical, as Carnap and followers have it, or whether the title of logic is best reserved for the probability formalism in isolation, as De Finetti and followers argue.

#### 4.3.5 Objective priors

A further set of responses to the subjectivity of Bayesian statistical inference targets the prior distribution directly: we might provide further rationality principles, with which the choice of priors can be chosen objectively. The literature proposes several objective criteria for filling in the prior over the model. Each of these lays claim to being the correct expression of complete ignorance concerning the value of the model parameters, or of minimal information regarding the parameters. Three such criteria are discussed here.

In the context of Bertrand's paradox we already discussed
the principle of indifference, according to which probability
should be distributed evenly over the available possibilities. A
further development of this idea is presented by the requirement that
a distribution should have maximum entropy. Notably, the use of
entropy maximization for determining degrees of beliefs finds much
broader application than only in statistics: similar ideas are taken
up in diverse fields like epistemology (e.g., Shore and Johnson 1980,
Williams 1980, Uffink 1996, and also Williamson 2010), inductive logic
(Paris and Vencovska 1989), statistical mechanics (Jaynes 2003)
and decision theory (Seidenfeld 1986, Grunwald and Halpern 2004). In
*objective Bayesian statistics*, the idea is applied to the
prior distribution over the model (cf. Berger 2006). For a finite
number of hypotheses the entropy of the distribution \(P(h_{\theta})\)
is defined as
\[ E[P] \; = \; \sum_{\theta \in \Theta} P(h_{\theta}) \log
P(h_{\theta}) . \]
This requirement unequivocally leads to equiprobable
hypotheses. However, for continuous models the maximum entropy
distribution depends crucially on the metric over the parameters in
the model. The burden of subjectivity is thereby moved to the
parameterization, but of course it may well be that we have strong
reasons for preferring a particular parameterization over others (cf.
Jaynes 1973).

There are other approaches to the objective determination of
priors. In view of the above problems, a particularly attractive
method for choosing a prior over a continuous model is proposed by
Jeffreys (1961). The general idea of so-called *Jeffreys
priors* is that the prior probability assigned to a small patch in
the parameter space is proportional to, what may be called, the
density of the distributions within that patch. Intuitively, if a lot
of distributions, i.e., distributions that differ quite a lot among
themselves, are packed together on a small patch in the parameter
space, this patch should be given a larger prior probability than a
similar patch within which there is little variation among the
distributions (cf. Balasubramanian 2005). More technically, such a
density is expressed by a prior distribution that is proportional to
the *Fisher information*. A key advantage of these priors is
that they are invariant under reparameterizations of the parameter
space: a new parameterization naturally leads to an adjusted density
of distributions.

A final method of defining priors goes under the name of
*reference priors* (Berger et al 2009). The proposal
starts from the observation that we should minimize the subjectivity
of the results of our statistical analysis, and hence that we should
minimize the impact of the prior probability on the posterior. The
idea of reference priors is exactly that it will allow the sample data
a maximal say in the posterior distribution. But since at the outset
we do not know what sample we will obtain, the prior is chosen so as
to maximize the expected impact of the data. The expectation must
itself be taken with respect to some distribution over sample space,
but again, it may well be that we have strong reasons for this latter
distribution.

#### 4.3.6 Circumventing priors

A different response to the subjectivity of priors is to extend the Bayesian formalism, in order to leave the choice of prior to some extent open. The subjective choice of a prior is in that case circumvented. Two such responses will be considered in some detail.

Recall that a prior probability distribution over statistical
hypotheses expresses our uncertain opinion on which of the hypotheses
is right. The central idea behind *hierarchical Bayesian
models* (Gelman et al 2013) is that the same pattern of putting a
prior over statistical hypotheses can be repeated on the level of
priors itself. More precisely, we may be uncertain over which prior
probability distribution over the hypotheses is right. If we
characterize possible priors by means of a set of parameters, we can
express this uncertainty about prior choice in a probability
distribution over the parameters that characterize the shape of the
prior. In other words, we move our uncertainty one level up in a
hierarchy: we consider multiple priors over the statistical
hypotheses, and compare the performance of these priors on the sample
data as if the priors were themselves hypotheses.

The idea of hierarchical Bayesian modeling (Gelman et al 2013)
relates naturally to the Bayesian comparison of Carnapian prediction
rules (e.g., Skyrms 1993 and 1996, Festa 1996), and also to the
estimation of optimum inductive methods (Kuipers 1986, Festa 1993).
Hierarchical Bayesian modeling can also be related to another
tool for choosing a particular prior distribution over hypotheses,
namely the method of *empirical Bayes*, which estimates the
prior that leads to the maximal marginal likelihood of the model. In
the philosophy of science, hierarchical Bayesian modeling has made a
first appearance due to Henderson *et al *(2010).

There is also a response that avoids the choice of a prior
altogether. This response starts with the same idea as hierarchical
models: rather than considering a single prior over the hypotheses in
the model, we consider a parameterized set of them. But instead of
defining a distribution over this set, proponents of
*interval-valued* or *imprecise probability* claim that
our epistemic state regarding the priors is better expressed by this
set of distributions, and that sharp probability assignments must
therefore be replaced by lower and upper bounds to the
assignments. Now the idea that uncertain opinion is best captured by a
set of probability assignments, or a *credal set* for short,
has a long history and is backed by an extensive literature (e.g., De
Finetti 1974, Levi 1980, Dempster 1967 and 1968, Shafer 1976, Walley
1991). In light of the main debate in the philosophy of statistics,
the use of interval-valued priors indeed forms an attractive extension
of Bayesian statistics: it allows us to refrain from choosing a
specific prior, and thereby presents a rapprochement to the classical
view on statistics.

These theoretical developments may look attractive, but the fact is
that they mostly enjoy a cult status among philosophers of statistics
and that they have not moved the statistician in the street. On the
other hand, standard Bayesian statistics has seen a steep rise in
popularity over the past decade or so, owing to the availability of
good software and numerical approximation methods. And most of the
practical use of Bayesian statistics is more or less insensitive to
the potentially subjective aspects of the statistical results,
employing uniform priors as a neutral starting point for the analysis
and relying on the afore-mentioned convergence results to wash out the
remaining subjectivity (cf. Gelman and Shalizi 2013). However, this
practical attitude of scientists towards modelling should not be
mistaken for a principled answer to the questions raised in the
philosophy of statistics (see Morey *et al* 2013).

## 5. Statistical models

In the foregoing we have seen how classical and Bayesian statistics
differ. But the two major approaches to statistics also have a lot in
common. Most importantly, all statistical procedures rely on the
assumption of a *statistical model*, here referring to any
restricted set of statistical hypotheses. Moreover, they are both
aimed at delivering a verdict over these hypotheses. For example, a
classical likelihood ratio test considers two hypotheses, \(h\) and
\(h'\), and then offers a verdict of rejection and acceptance, while a
Bayesian comparison delivers a posterior probability over these two
hypotheses. Whereas in Bayesian statistics the model presents a very
strong assumption, classical statistics does not endow the model with
a special epistemic status: they are simply the hypotheses currently
entertained by the scientist. But across the board, the adoption of a
model is absolutely central to any statistical procedure.

A natural question is whether anything can be said about the quality of the statistical model, and whether any verdict on this starting point for statistical procedures can be given. Surely some models will lead to better predictions, or be a better guide to the truth, than others. The evaluation of models touches on deep issues in the philosophy of science, because the statistical model often determines how the data-generating system under investigation is conceptualized and approached (Kieseppa 2001). Model choice thus resembles the choice of a theory, a conceptual scheme, or even of a whole paradigm, and thereby might seem to transcend the formal frameworks for studying theoretical rationality (cf. Carnap 1950, Jeffrey 1980). Despite the fact that some considerations on model choice will seem extra-statistical, in the sense that they fall outside the scope of statistical treatment, statistics offers several methods for approaching the choice of statistical models.

### 5.1 Model comparisons

There are in fact very many methods for evaluating statistical models (Claeskens and Hjort 2008, Wagenmakers and Waldorp 2006). In first instance, the methods occasion the comparison of statistical models, but very often they are used for selecting one model over the others. In what follows we only review prominent techniques that have led to philosophical debate: Akaike's information criterion, the Bayesian information criterion, and furthermore the computation of marginal likelihoods and posterior model probabilities, both associated with Bayesian model selection. We leave aside methods that use cross-validation as they have, unduly, not received as much attention in the philosophical literature.

#### 5.1.1 Akaike's information criterion

*Akaike's information criterion*, modestly termed An
Information Criterion or AIC for short, is based on the classical
statistical procedure of estimation (see Burnham and Anderson 2002,
Kieseppa 1997). It starts from the idea that a model \(M\) can be
judged by the estimate \(\hat{\theta}\) that it delivers, and more
specifically by the proximity of this estimate to the distribution
with which the data are actually generated, i.e., the true
distribution. This proximity is often equated with the expected
predictive accuracy of the estimate, because if the estimate and the
true distribution are closer to each other, their predictions will be
better aligned to one another as well. In the derivation of the AIC,
the so-called relative entropy or *Kullback-Leibler divergence*
of the two distributions is used as a measure of their proximity, and
hence as a measure of the expected predictive accuracy of the
estimate.

Naturally, the true distribution is not known to the statistician who is evaluating the model. If it were, then the whole statistical analysis would be useless. However, it turns out that we can give an unbiased estimation of the divergence between the true distribution and the distribution estimated from a particular model, \[ \text{AIC}[M] = - 2 \log P( s \mid h_{\hat{\theta}(s)} ) + 2 d , \] in which \(s\) is the sample data, \(\hat{\theta}(s)\) is the maximum likelihood estimate (MLE) of the model \(M\), and \(d = dim(\Theta)\) is the number of dimensions of the parameter space of the model. The MLE of the model thereby features in an expression of the model quality, i.e., in a role that is conceptually distinct from the estimator function.

As can be seen from the expression above, a model with a smaller AIC is preferable: we want the fit to be optimal at little cost in complexity. Notice that the number of dimensions, or independent parameters, in the model increases the AIC and thereby lowers the eligibility of the model: if two models achieve the same maximum likelihood for the sample, then the model with fewer parameters will be preferred. For this reason, statistical model selection by the AIC can be seen as an independent motivation for preferring simple models over more complex ones (Sober and Forster 1994). But this result also invites some critical remarks. For one, we might impose other criteria than merely the unbiasedness on the estimation of the proximity to the truth, and this will lead to different expressions for the approximation. Moreover, it is not always clearcut what the dimensions of the model under scrutiny really are. For curve fitting this may seem simple, but for more complicated models or different conceptualizations of the space of models, things do not look so easy (cf. Myung et al 2001, Kieseppa 2001).

A prime example of model selection is presented in *curve
fitting*. Given a sample \(s\) consisting of a set of points in
the plane \((x, y)\), we are asked to choose the curve that fits these
data best. We assume that the models under consideration are of the
form \(y = f(x) + \epsilon\), where \(\epsilon\) is a normal
distribution with mean 0 and a fixed standard deviation, and where
\(f\) is a polynomial function. Different models are characterized by
polynomials of different degrees that have different numbers of
parameters. Estimations fix the parameters of these polynomials. For
example, for the 0-degree polynomial \(f(x) = c_{0}\) we estimate the
constant \(\hat{c_{0}}\) for which the probability of the data is
maximal, and for the 1-degree polynomial \(f(x) = c_{0} + c_{1}\, x\)
we estimate the slope \(\hat{c_{1}}\) and the offset
\(\hat{c_{0}}\). Now notice that for a total of \(n\) points, we can
always find a polynomial of degree \(n\) that intersects with all
points exactly, resulting in a comparatively high maximum likelihood
\(P(s \mid \{\hat{c_{0}}, \ldots \hat{c_{n}} \})\). Applying the AIC,
however, we will typically find that some model with a polynomial of
degree \(k < n\) is preferable. Although \(P(s \mid \{\hat{c_{0}},
\ldots \hat{c_{k}} \})\) will be somewhat lower, this is compensated
for in the AIC by the smaller number of parameters.

#### 5.1.2 Bayesian evaluation of models

Various other prominent model selection tools are based on methods
from Bayesian statistics. They all start from the idea that the
quality of a model is expressed in the performance of the model on the
sample data: the model that, on the whole, makes the sampled data most
probable is to be preferred. Because of this, there is a close
connection with the hierarchical Bayesian modelling referred to
earlier (Gelman 2013). The central notion in the Bayesian model
selection tools is thus the marginal likelihood of the model, i.e.,
the weighted average of the likelihoods over the model, using the
prior distribution as a weighing function:
\[ P(s \mid M_{i}) \; = \; \int_{\theta \in \Theta_{i}} P(h_{\theta}) P(s
\mid h_{\theta}) d\theta . \]
Here \(\Theta_{i}\) is the parameter space belonging to model
\(M_{i}\). The marginal likelihoods can be combined with a prior
probability over models, \(P(M_{i})\), to derive the
so-called *posterior model probability*, using Bayes'
theorem. One way of evaluating models, known as *Bayesian model
selection*, is by comparing the models on their marginal
likelihood, or else on their posteriors (cf. Kass and Raftery
1995).

Usually the marginal likelihood cannot be computed analytically.
Numerical approximations can often be obtained, but for practical
purposes it has proved very useful, and quite sufficient, to employ an
approximation of the marginal likelihood. This approximation has
become known as the *Bayesian information criterion*, or BIC
for short (Schwarz 1978, Raftery 1995). It turns out that this
approximation shows remarkable similarities to the AIC:
\[ \text{BIC}[M] \; = \; - 2 \log P(s \mid h_{\hat{\theta}(s)}) + d \log
n . \]
Here \(\hat{\theta}(s)\) is again the maximum likelihood estimate of
the model, \(d = dim(M)\) the number of independent parameters, and
\(n\) is the number of data points in the sample. The latter
dependence is the only difference with the AIC, but a major difference
in how the model evaluation may turn out.

The concurrence of the AIC and the BIC seems to give a further
motivation for our intuitive preference for simple models over more
complex ones. Indeed, other model selection tools, like the
*deviance information criterion* (Spiegelhalter et al 2002) and
the approach based on *minimum description length* (Grunwald
2007), also result in expressions that feature a term that penalizes
complex models. However, this is not to say that the dimension term
that we know from the information criteria exhausts the notion of
model complexity. There is ongoing debate in the philosophy of science
concerning the merits of model selection in explications of the notion
of simplicity, informativeness, and the like (see, for example, Sober
2004, Romeijn and van de Schoot 2008, Romeijn et al 2012, Sprenger
2013).

### 5.2 Statistics without models

There are also statistical methods that refrain from the use of a particular model, by focusing exclusively on the data or by generalizing over all possible models. Some of these techniques are properly localized in descriptive statistics: they do not concern an inference from data but merely serve to describe the data in a particular way. Statistical methods that do not rely on an explicit model choice have unfortunately not attracted much attention in the philosophy of statistics, but for completeness sake they will be briefly discussed here.

#### 5.2.1 Data reduction techniques

One set of methods, and a quite important one for many practicing
statisticians, is aimed at *data reduction*. Often the sample
data are very rich, e.g., consisting of a set of points in a space of
very many dimensions. The first step in a statistical analysis may
then be to pick out the salient variability in the data, in order to
scale down the computational burden of the analysis itself.

The technique of *principal component analysis* (PCA) is
designed for this purpose (Jolliffe 2002). Given a set of points in a
space, it seeks out the set of vectors along which the variation in
the points is large. As an example, consider two points in a plane
parameterized as \((x, y)\): the points \((0, 0)\) and \((1, 1)\). In
the \(x\)-direction and in the \(y\)-direction the variation is \(1\),
but over the diagonal the variation is maximal, namely
\(\sqrt{2}\). The vector on the diagonal is called the principal
component of the data. In richer data structures, and using a more
general measure of variation among points, we can find the first
component in a similar way. Moreover, we can repeat the procedure
after subtracting the variation along the last found component, by
projecting the data onto the plane perpendicular to that
component. This allows us to build up a set of principal components of
diminishing importance.

PCA is only one item from a large collection of techniques that are
aimed at keeping the data manageable and finding patterns in it, a
collection that also includes *kernel methods* and *support
vector machines* (e.g., Vapnik and Kotz 2006). For present
purposes, it is important to stress that such tools should not be
confused with statistical analysis: they do not involve the testing or
evaluation of distributions over sample space, even though they build
up and evaluate models of the data. This sets them apart from, e.g.,
confirmatory and exploratory factor analysis (Bartholomew 2008), which
is sometimes taken to be a close relative of PCA because both
sets of techniques allows us to identify salient dimensions within
sample space, along which the data show large variation.

Practicing statisticians often employ data reduction tools to arrive at conclusions on the distributions from which the data were sampled. There is already a wide use for machine learning and data mining techniques in the sciences, and we may expect even mode usage of these techniques in the future, because so much data is now coming available for scientific analysis. However, in the philosophy of statistics there is as yet little debate over the epistemic status of conclusions reached by means of these techniques. Philosophers of statistics would do well to direct some attention here.

#### 5.2.2 Formal learning theory

An entirely different approach to statistics is presented by
*formal learning theory*.
This is again a vast area of research, primarily located in
computer science and artificial intelligence. The discipline is here
mentioned briefly, as another example of an approach to statistics
that avoids the choice of a statistical model altogether and merely
identifies patterns in the data. We leave aside the theory of
*neural networks*, which also concerns predictive systems that
do not rely on a statistical model, and focus on the theory of
learning algorithms because of all these approaches they have seen
most philosophical attention.

Pioneering work on formal learning was done by Solomonoff (1964). As before, the setting is one in which the data consist of strings of 0's and 1's, and in which an agent is attempting to identify the pattern in these data. So, for example, the data may be a string of the form \(0101010101\ldots\), and the challenge is to identify this strings as an alternating sequence. The central idea of Solomonoff is that all possible computable patterns must be considered by the agent, and therefore that no restrictive choice on statistical hypotheses is warranted. Solomonoff then defined a formal system in which indeed all patterns can be taken into consideration, effectively using a Bayesian analysis with a cleverly constructed prior over all computable hypotheses.

This general idea can also be identified in a rather new field on
the intersection of Bayesian statistics and machine learning,
*Bayesian **nonparametrics* (e.g., Orbanz and Teh
2010, Hjort et al 2010). Rather than specifying, at the outset, a
confined set of distributions from which a statistical analysis is
supposed to choose on the basis of the data, the idea is that the data
are confronted with a potentially infinite-dimensional space of
possible distributions. The set of distributions taken into
consideration is then made relative to the data obtained: the
complexity of the model grows with the sample. The result is a
predictive system that performs an online model selection alongside a
Bayesian accommodation of the posterior over the model.

Current formal learning theory is a lively field, to which
philosophers of statistics also contribute (e.g., Kelly 1996, Kelly et
al 1997). Particularly salient for the present concerns is that the
systems of formal learning are set up to achieve some notion of
adequate *universal prediction*, without confining themselves
to a specific set of hypotheses, and hence by imposing minimal
constraints on the set of possible patterns in the data. It is a
matter of debate whether this is at all possible, and to what extent
the predictions of formal learning theory thereby rely on, e.g.,
implicit assumptions on structure of the sample space. Philosophical
reflection on this is only in its infancy.

## 6. Related topics

There are numerous topics in the philosophy of science that bear direct relevance to the themes covered in this lemma. A few central topics are mentioned here to direct the reader to related lemmas in the encyclopedia.

One very important topic that is immediately adjacent to the philosophy of statistics is confirmation theory, the philosophical theory that describes and justifies relations between scientific theory and empirical evidence. Arguably, the theory of statistics is a proper part of confirmation theory, as it describes and justifies the relation that obtains between statistical theory and evidence in the form of samples. It can be insightful to place statistical procedures in this wider framework of relations between evidence and theory. Zooming out even further, the philosophy of statistics is part of the philosophical topic of methodology, i.e., the general theory on whether and how science acquires knowledge. Thus conceived, statistics is one component in a large collection of scientific methods comprising concept formation, experimental design, manipulation and observation, confirmation, revision, and theorizing.

There are also a fair number of specific topics from the philosophy
of science that are spelled out in terms of statistics or that are
located in close proximity to it. One of these topics is the process
of measurement, in particular the measurement of latent variables on
the basis of statistical facts about manifest variables. The
so-called *representational theory of measurement* (Kranz et al
1971) relies on statistics, in particular on factor analysis, to
provide a conceptual clarification of how mathematical structures
represent empirical phenomena. Another important topic form the
philosophy of science is causation (see the entries on
probabilistic causation
and
Reichenbach's common cause principle).
Philosophers have employed probability theory to capture causal
relations ever since Reichenbach (1956), but more recent work in
causality and statistics (e.g., Spirtes et al 2001) has given the
theory of *probabilistic causality* an enormous impulse. Here
again, statistics provides a basis for the conceptual analysis of
causal relations.

And there is so much more. Several specific statistical techniques, like factor analysis and the theory of Bayesian networks, invite conceptual discussion of their own accord. Numerous topics within the philosophy of science lend themselves to statistical elucidation, e.g., the coherence, informativeness, and surprise of evidence. And in turn there is a wide range of discussions in the philosophy of science that inform a proper understanding of statistics. Among them are debates over experimentation and intervention, concepts of chance, the nature of scientific models, and theoretical terms. The reader is invited to consult the entries on these topics to find further indications of how they relate to the philosophy of statistics.

## Bibliography

- Aldous, D.J., 1981, “Representations for Partially Exchangeable
Arrays of Random Variables”,
*Journal of Multivariate Analysis*, 11: 581–598. - Armendt, B., 1993, “Dutch books, Additivity, and Utility
Theory”,
*Philosophical Topics*, 21:1–20. - Auxier, R.E., and L.E. Hahn (eds.), 2006,
*The Philosophy of Jaako Hintikka*, Chicago: Open Court. - Balasubramanian, V., 2005, “MDL, Bayesian inference, and the geometry of the space of probability distributions”, in: Advances in Minimum Description Length: Theory and Applications, P.J. Grunwald et al (eds.), Boston: MIT Press, 81–99.
- Bandyopadhyay, P., and Forster, M. (eds.), 2011, Handbook for the Philosophy of Science: Philosophy of Statistics, Elsevier.
- Barnett, V., 1999,
*Comparative Statistical Inference*, Wiley Series in Probability and Statistics, New York: Wiley. - Bartholomew, D.J., F. Steele, J. Galbraith, I. Moustaki,
2008,
*Analysis of Multivariate Social Science Data*, Statistics in the Social and Behavioral Sciences Series, London: Taylor and Francis, 2nd edition. - Berger, J. 2006, “The Case for Objective Bayesian
Analysis”,
*Bayesian Analysis*, 1(3): 385–402. - Berger, J.O., J.M. Bernardo, and D. Sun, 2009, “The formal
definition of reference priors”,
*Annals of Statistics*, 37(2): 905–938. - Berger, J.O., and R.L. Wolpert, 1984,
*The Likelihood Principle*. Hayward (CA): Institute of Mathematical Statistics. - Berger, J.O. and T. Sellke, 1987, “Testing a point null
hypothesis: The irreconciliability of P-values and
evidence”,
*Journal of the American Statistical Association*, 82: 112–139. - Bernardo, J.M. and A.F.M. Smith, 1994,
*Bayesian Theory*, New York: John Wiley. - Bigelow, J. C., 1977, “Semantics of
probability”,
*Synthese*, 36(4): 459–72. - Billingsley, P., 1995,
*Probability and Measure*, Wiley Series in Probability and Statistics, New York: Wiley, 3rd edition. - Birnbaum, A., 1962, “On the Foundations of Statistical
Inference”,
*Journal of the American Statistical Association*, 57: 269–306. - Blackwell, D. and L. Dubins, 1962, “Merging of Opinions with
Increasing Information”,
*Annals of Mathematical Statistics*, 33(3): 882–886. - Boole, G., 1854,
*An Investigation of The Laws of Thought on Which are Founded the Mathematical Theories of Logic and Probabilities*, London: Macmillan, reprinted 1958, London: Dover. - Burnham, K.P. and D.R. Anderson, 2002,
*Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach*, New York: Springer, 2nd edition. - Carnap, R., 1950,
*Logical Foundations of Probability*, Chicago: The University of Chicago Press. - –––, 1952,
*The Continuum of Inductive Methods*, Chicago: University of Chicago Press. - Carnap, R. and Jeffrey, R.C. (eds.), 1970,
*Studies in Inductive Logic and Probability*, Volume I, Berkeley: University of California Press. - Casella, G., and R. L. Berger, 1987, “Reconciling Bayesian and
Frequentist Evidence in the One-Sided Testing Problem”,
*Journal of the American Statistical Association*, 82: 106–111. - Claeskens, G. and N. L. Hjort, 2008,
*Model selection and model averaging*, Cambridge: Cambridge University Press. - Cohen, J., 1994, “The Earth is Round (p < .05)”,
*American Psychologist*, 49: 997–1001. - Cox, R.T., 1961,
*The Algebra of Probable Inference*, Baltimore: John Hopkins University Press. - Cumming, G., 2012,
*Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis*, New York: Routledge. - Dawid, A.P., 1982, “The Well-Calibrated Bayesian”,
*Journal of the American Statistical Association*, 77(379): 605–610. - –––, 2004, “Probability, causality and the empirical world:
A Bayes-de Finetti-Popper-Borel synthesis”,
*Statistical Science*, 19: 44–57. - Dawid, A.P. and P. Grunwald, 2004, “Game theory, maximum entropy,
minimum discrepancy, and robust Bayesian decision theory”,
*Annals of Statistics*, 32: 1367–1433. - Dawid, A.P. and M. Stone, 1982, “The functional-model basis of
fiducial inference”,
*Annals of Statistics*, 10: 1054–1067. - De Finetti, B., 1937, “La Prévision: ses lois logiques, ses
sources subjectives”,
*Annales de l'Institut Henri Poincaré*, reprinted as “Foresight: its Logical Laws, Its Subjective Sources”, in: Kyburg, H. E. and H. E. Smokler (eds.),*Studies in Subjective Probability*, 1964, New York: Wiley. - –––, 1974,
*Theory of Probability*, Volumes I and II, New York: Wiley, translation by A. Machi and A.F.M. Smith. - De Morgan, A., 1847,
*Formal Logic or The Calculus of Inference*, London: Taylor & Walton, reprinted by London: Open Court, 1926. - Dempster, A.P., 1964, “On the Difficulties Inherent in Fisher's
Fiducial Argument”,
*Journal of the American Statistical Association*, 59: 56–66. - –––, 1966, “New Methods for Reasoning Towards Posterior
Distributions Based on Sample Data”,
*Annals of Mathematics and Statistics*, 37(2): 355–374. - –––, 1967, “Upper and lower probabilities induced by a
multivalued mapping”,
*The Annals of Mathematical Statistics*, 38(2): 325–339. - –––, 1968, “A generalization of Bayesian
inference”,
*Journal of the Royal Statistical Society*, Series B, Vol. 30: 205–247. - Diaconis, P., and D. Freedman, 1980, “De Finetti’s theorem
for Markov chains”,
*Annals of Probability*, 8: 115–130. - Eagle, A. (ed.), 2010,
*Philosophy of Probability: Contemporary Readings*, London: Routledge. - Earman, J., 1992,
*Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory*, Cambridge (MA): MIT Press. - Easwaran, K., 2013, “Expected Accuracy Supports Conditionalization
- and Conglomerability and Reflection”,
*Philosophy of Science*, 80(1): 119–142. - Edwards, A.W.F., 1972,
*Likelihood*, Cambridge: Cambridge University Press. - Efron, B. and R. Tibshirani, R., 1993,
*An Introduction to the Bootstrap*, Boca Raton (FL): Chapman & Hall/CRC. - Festa, R., 1993,
*Optimum Inductive Methods*, Dordrecht: Kluwer. - –––, 1996, “Analogy and exchangeability in predictive
inferences”,
*Erkenntnis*, 45: 89–112. - Fisher, R.A., 1925,
*Statistical Methods for Research Workers*, Edinburgh: Oliver and Boyd. - –––, 1930, “Inverse probability”,
*Proceedings of the Cambridge Philosophical Society*, 26: 528–535. - –––, 1933, “The concepts of inverse probability and
fiducial probability referring to unknown parameters”,
*Proceedings of the Royal Society*, Series A, 139: 343–348. - –––, 1935a, “The logic of inductive
inference”,
*Journal of the Royal Statistical Society*, 98: 39–82. - –––, 1935b,
*The Design of Experiments*, Edinburgh: Oliver and Boyd. - –––, 1935c, “The fiducial argument in statistical
inference”,
*Annals of Eugenics*, 6: 317–324. - –––, 1955, “Statistical Methods and Scientific
Induction”,
*Journal of the Royal Statistical Society*, B 17: 69–78. - –––, 1956,
*Statistical Methods and Scientific Inference*, New York: Hafner, 3rd edition 1973. - Fitelson, B., 2007, “Likelihoodism, Bayesianism, and relational
confirmation”,
*Synthese*, 156(3): 473–489. - Forster, M. and E. Sober, 1994, “How to Tell when Simpler, More
Unified, or Less Ad Hoc Theories will Provide More Accurate
Predictions”,
*British Journal for the Philosophy of Science*, 45: 1–35. - Fraassen, B. van, 1989,
*Laws and Symmetry*, Oxford: Clarendon Press. - Gaifman, H. and M. Snir, 1982, “Probabilities over Rich
Languages”,
*Journal of Symbolic Logic*, 47: 495–548. - Galavotti, M. C., 2005,
*Philosophical Introduction to Probability*, Stanford: CSLI Publications. - Gelman, A., J. Carlin, H. Stern, D. Dunson, A. Vehtari, and
D. Rubin, 2013,
*Bayesian Data Analysis*, revised edition, New York: Chapman & Hall/CRC. - Gelman, A., and C. Shalizi, 2013, “Philosophy and the practice of
Bayesian statistics (with discussion)”,
*British Journal of Mathematical and Statistical Psychology*, 66: 8–18. - Giere, R. N., 1976, “A Laplacean Formal Semantics for Single-Case
Propensities”,
*Journal of Philosophical Logic*, 5(3): 321–353. - Gillies, D., 1971, “A Falsifying Rule for Probability
Statements”,
*British Journal for the Philosophy of Science*, 22: 231–261. - –––, 2000,
*Philosophical Theories of Probability*, London: Routledge. - Goldstein, M., 2006, “Subjective Bayesian analysis: principles and
practice”,
*Bayesian Analysis*, 1(3): 403–420. - Good, I.J., 1983,
*Good Thinking: The Foundations of Probability and Its Applications*, University of Minnesota Press, reprinted London: Dover, 2009. - –––, 1988, “The Interface Between Statistics and Philosophy
of Science”,
*Statistical Science*, 3(4): 386–397. - Goodman, N., 1965,
*Fact, Fiction and Forecast*, Indianapolis: Bobbs-Merrill. - Greaves, H. and D. Wallace, 2006, “Justifying Conditionalization:
Conditionalization Maximizes Expected Epistemic
Utility”,
*Mind*, 115(459): 607–632. - Greco, D., 2011, “Significance Testing in Theory and
Practice”,
*British Journal for the Philosophy of Science*, 62: 607–37. - Grünwald, P.D., 2007,
*The Minimum Description Length Principle*, Boston: MIT Press. - Hacking, I.,1965, The Logic of Statistical Inference, Cambridge: Cambridge University Press.
- Haenni, R., Romeijn, J.-W., Wheeler, G., Andrews, J.,
2011,
*Probabilistic Logics and Probabilistic Networks*, Berlin: Springer. - Hailperin, T., 1996,
*Sentential Probability Logic*, Lehigh University Press. - Hájek, A., 2007, “The reference class problem is your
problem too”,
*Synthese*, 156: 563–585. - Hajek, A. and C. Hitchcock (eds.), 2013,
*Oxford Handbook of Probability and Philosophy*, Oxford: Oxford University Press. - Halpern, J.Y., 2003,
*Reasoning about Uncertainty*, MIT press. - Handfield, T., 2012,
*A Philosophical Guide to Chance: Physical Probability*, Cambridge: Cambridge University Press. - Harlow, L.L., S.A. Mulaik, and J.H. Steiger, (eds.),
1997,
*What if there were no significance tests?*, Mahwah (NJ): Erlbaum. - Henderson, L., N.D. Goodman, J.B. Tenenbaum, and J.F.
Woodward, 2010, “The Structure and Dynamics of Scientific Theories: A
Hierarchical Bayesian Perspective”,
*Philosophy of Science*, 77(2): 172–200. - Hjort, N., C. Holmes, P. Mueller, and S. Walker (eds.),
2010,
*Bayesian Nonparametrics*, Cambridge Series in Statistical and Probabilistic Mathematics, nr. 28, Cambridge: Cambridge University Press. - Howson, C., 2000,
*Hume's problem: induction and the justification of belief*, Oxford: Oxford University Press. - –––, 2003, “Probability and logic”,
*Journal of Applied Logic*, 1(3–4): 151–165. - –––, 2011, “Bayesianism as a pure logic of Inference”, in:
Bandyopadhyay, P and Foster, M, (eds.),
*Philosophy of statistics, Handbook of the Philosophy of Science*, Oxford: North Holland, 441–472. - Howson, C. and P. Urbach, 2006,
*Scientific Reasoning: The Bayesian Approach*, La Salle: Open Court, 3rd edition. - Hintikka, J., 1970, “Unknown Probabilities, Bayesianism, and de
Finetti's Representation Theorem”, in
*Proceedings of the Biennial Meeting of the Philosophy of Science Association*, Vol. 1970, Boston: Springer, 325–341. - Hintikka, J. and I. Niiniluoto, 1980, “An axiomatic foundation for
the logic of inductive generalization”, in R.C. Jeffrey
(ed.),
*Studies in Inductive Logic and Probability*, volume II, Berkeley: University of California Press, 157–181. - Hintikka J. and P. Suppes (eds.), 1966,
*Aspects of Inductive Logic*, Amsterdam: North-Holland. - Hume, D., 1739,
*A Treatise of Human Nature*, available online. - Jaynes, E.T., 1973, “The Well-Posed Problem”,
*Foundations of Physics*, 3: 477–493. - –––, 2003,
*Probability Theory: The Logic of Science*, Cambridge: Cambridge University Press. first 3 chapters available online. - Jeffrey, R., 1992,
*Probability and the Art of Judgment*, Cambridge: Cambridge University Press. - Jeffreys, H., 1961,
*Theory of Probability*, Oxford: Clarendon Press, 3rd edition. - Jolliffe, I.T., 2002,
*Principal Component Analysis*, New York: Springer, 2nd edition. - Kadane, J.B., 2011,
*Principles of Uncertainty*, London: Chapman and Hall. - Kadane, J.B., M.J. Schervish, and T. Seidenfeld, 1996, “When
Several Bayesians Agree That There Will Be No Reasoning to a Foregone
Conclusion”,
*Philosophy of Science*, 63: S281-S289. - –––, 1996, “Reasoning to a
Foregone Conclusion”,
*Journal of the American Statistical Association*, 91(435): 1228–1235. - Kass, R. and A. Raftery, 1995, “Bayes
Factors”,
*Journal of the American Statistical Association*, 90: 773–790. - Kelly, K., 1996,
*The Logic of Reliable Inquiry*, Oxford: Oxford University Press. - Kelly, K., O. Schulte, and C. Juhl, 1997, “Learning Theory and the
Philosophy of Science”,
*Philosophy of Science*, 64: 245–67. - Keynes, J.M., 1921,
*A Treatise on Probability*, London: Macmillan. - Kieseppä, I. A., 1997, “Akaike Information Criterion,
Curve-Fitting, and the Philosophical Problem of
Simplicity”,
*British Journal for the Philosophy of Science*, 48(1): 21–48. - –––, 2001, “Statistical Model Selection Criteria and
the Philosophical Problem of Underdetermination”,
*British Journal for the Philosophy of Science*, 52(4): 761–794. - Kingman, J.F.C., 1975, “Random discrete
distributions”,
*Journal of the Royal Statistical Society*, 37: 1–22. - –––, 1978, “Uses of exchangeability”,
*Annals of Probability*, 6(2): 183–197. - Kolmogorov, A.N., 1933,
*Grundbegriffe der Wahrscheinlichkeitsrechnung*, Berlin: Julius Springer. - Krantz, D. H., R. D. Luce, A. Tversky and P. Suppes, 1971,
*Foundations of Measurement*, Volumes I and II. Mineola: Dover Publications. - Kuipers, T.A.F., 1978,
*Studies in Inductive Probability and Rational Expectation*, Dordrecht: Reidel. - –––, 1986, “Some estimates of the optimum inductive
method”,
*Erkenntnis*, 24: 37–46. - Kyburg, Jr., H.E., 1961,
*Probability and the Logic of Rational Belief*, Middletown (CT): Wesleyan University Press. - Kyburg, H.E. Jr. and C.M. Teng, 2001,
*Uncertain Inference*, Cambridge: Cambridge University Press. - van Lambalgen, M., 1987,
*Random sequences*, Ph.D. dissertation, Department of Mathematics and Computer Science, University of Amsterdam, available online. - Leitgeb, H. and Pettigrew, R., 2010a, “An Objective Justification
of Bayesianism I: Measuring Inaccuracy”,
*Philosophy of Science*, 77(2): 201–235. - –––, 2010b, “An Objective Justiﬁcation
of Bayesianism II: The Consequences of Minimizing
Inaccuracy”,
*Philosophy of Science*, 77(2): 236–272. - Levi, I., 1980,
*The enterprise of knowledge: an essay on knowledge, credal probability, and chance*, Cambridge MA: MIT Press. - Lindley, D.V., 1957, “A statistical paradox”,
*Biometrika*, 44: 187–192. - –––, 1965,
*Introduction to Probability and Statistics from a Bayesian Viewpoint*, Vols. I and II, Cambridge: Cambridge University Press. - –––, 2000, “The Philosophy of Statistics”,
*Journal of the Royal Statistical Society*, D (The Statistician), Vol. 49(3): 293–337. - Mackay, D.J.C., 2003,
*Information Theory, Inference, and Learning Algorithms*, Cambridge: Cambridge University Press. - Maher, P., 1993,
*Betting on Theories*, Cambridge Studies in Probability, Induction and Decision Theory, Cambridge: Cambridge University Press. - Mayo, D.G., 1996,
*Error and the Growth of Experimental Knowledge*, Chicago: The University of Chicago Press. - –––, 2010, An error in the argument from conditionality and sufficiency to the likelihood principle, in: D. Mayo, A. Spanos (eds.), Error and Inference: Recent exchanges on experimental reasoning, reliability, and the objectivity and rationality of science, pp. 305–314, Cambridge: Cambridge University Press.
- Mayo, D.G., and A. Spanos, 2006, “Severe Testing as a Basic
Concept in a Neyman-Pearson Philosophy of Induction”,
*The British Journal for the Philosophy of Science*, 57: 323–357. - –––, 2011, “Error Statistics”, in P.S.
Bandyopadhyay and M.R. Forster,
*Philosophy of Statistics, Handbook of the Philosophy of Science*, Vol. 7, Elsevier. - Mellor, D. H., 2005,
*The Matter of Chance*, Cambridge: Cambridge University Press. - –––, 2005,
*Probability: A Philosophical Introduction*, London: Routledge. - von Mises, R., 1981,
*Probability, Statistics and Truth*, 2nd revised English edition, New York: Dover. - Mood, A. M., F. A. Graybill, and D. C. Boes,
1974,
*Introduction to the Theory of Statistics*, Boston: McGraw-Hill. - Morey, R., J.W. Romeijn and J. Rouder, 2013, “The Humble
Bayesian”,
*British Journal of Mathematical and Statistical Psychology*, 66(1): 68–75. - Myung, J., V. Balasubramanian, and M.A. Pitt, 2000, “Counting
probability distributions: Differential geometry and model selection”,
*Proceedings of the National Academy of Sciences*, 97(21): 11170–11175. - Nagel, T., 1939,
*Principles of the Theory of Probability*, Chicago: University of Chicago Press. - Neyman, J., 1957, “Inductive Behavior as a Basic Concept of
Philosophy of Science”,
*Revue Institute Internationale De Statistique*, 25: 7–22. - –––, 1971, Foundations of Behavioristic Statistics, in: V. Godambe and D. Sprott (eds.), Foundations of Statistical Inference, Toronto: Holt, Rinehart and Winston of Canada, pp. 1–19.
- Neyman, J. and K. Pearson, 1928, “On the use and interpretation of
certain test criteria for purposes of statistical inference”,
*Biometrika*, A20:175–240 and 264–294. - Neyman, J. and E. Pearson, 1933, “On the problem of the most
efficient tests of statistical hypotheses”,
*Philosophical Transactions of the Royal Society*, A 231: 289–337 - –––, 1967,
*Joint Statistical Papers*, Cambridge: Cambridge University Press. - Nix, C. J. and J. B. Paris, 2006, “A continuum of inductive
methods arising from a generalised principle of instantial
relevance”,
*Journal of Philosophical Logic*, 35: 83–115. - Orbanz, P. and Y.W. Teh, 2010, “Bayesian Nonparametric Models”,
*Encyclopedia of Machine Learning*, New York: Springer. - Paris, J.B., 1994,
*The uncertain reasoner’s companion*, Cambridge: Cambridge University Press. - Paris, J.B. and A. Vencovska, 1989, “On the applicability of
maximum entropy to inexact reasoning”,
*International Journal of Approximate Reasoning*, 4(3): 183–224. - Paris, J., and P. Waterhouse, 2009, “Atom exchangeability and
instantial relevance, atom exchangeability and instantial relevance”,
*Journal of Philosophical Logic*, 38(3): 313–332. - Peirce, C. S., 1910, “Notes on the Doctrine of Chances”, in C.
Hartshorne and P. Weiss (eds.),
*Collected Papers of Charles Sanders Peirce*, Vol. 2, Cambridge MA: Harvard University Press, 405–14, reprinted 1931. - Plato, J. von, 1994,
*Creating Modern Probability*, Cambridge: Cambridge University Press. - Popper, K.R., 1934/1959,
*The Logic of Scientific Discovery*, New York: Basic Books. - –––, 1959, “The Propensity Interpretation of
Probability”,
*British Journal of the Philosophy of Science*, 10: 25–42. - Predd, J.B., R. Seiringer, E.H. Lieb, D.N. Osherson, H.V. Poor,
and S.R. Kulkarni, 2009, “Probabilistic Coherence and Proper
Scoring Rules”,
*IEEE Transactions on Information Theory*, 55(10): 4786–4792. - Press, S. J., 2002,
*Bayesian Statistics: Principles, Models, and Applications*(Wiley Series in Probability and Statistics), New York: Wiley. - Raftery, A.E., 1995, “Bayesian model selection in social
research”,
*Sociological Methodology*, 25: 111–163. - Ramsey, F.P., 1926, “Truth and Probability”, in R.B. Braithwaite
(ed.),
*The Foundations of Mathematics and other Logical Essays*, Ch. VII, p.156–198, printed in London: Kegan Paul, 1931. - Reichenbach, H., 1938,
*Experience and prediction: an analysis of the foundations and the structure of knowledge*, Chicago: University of Chicago Press. - –––, 1949,
*The theory of probability*, Berkeley: University of California Press. - –––, 1956,
*The Direction of Time*, Berkeley: University of Los Angeles Press. - Renyi, A., 1970,
*Probability Theory*, Amsterdam: North Holland. - Robbins, H., 1952, “Some Aspects of the Sequential Design of
Experiments”,
*Bulletin of the American Mathematical Society*, 58: 527–535. - Roberts, H.V., 1967, “Informative stopping rules and inferences
about population size”,
*Journal of the American Statistical Association*, 62(319): 763–775. - Romeijn, J.W., 2004, “Hypotheses and Inductive
Predictions”,
*Synthese*, 141(3): 333–364. - –––, 2005,
*Bayesian Inductive Logic*, PhD dissertation, University of Groningen. - –––, 2006, “Analogical Predictions for Explicit
Similarity”,
*Erkenntnis*, 64: 253–280. - –––, 2011, “Statistics as Inductive Logic”,
in Bandyopadhyay, P. and M. Forster (eds.),
*Handbook for the Philosophy of Science*, Vol. 7: Philosophy of Statistics, 751–774. - Romeijn, J.W. and van de Schoot, R., 2008, “A Philosophical
Analysis of Bayesian model selection”, in Hoijtink, H.,
I. Klugkist and P. Boelen (eds.),
*Null, Alternative and Informative Hypotheses*, 329–357. - Romeijn, J.W., van de Schoot, R., and Hoijtink, H., 2012, “One
size does not fit all: derivation of a prior-adapted BIC”, in Dieks,
D., W. Gonzales, S. Hartmann, F. Stadler, T. Uebel, and M. Weber
(eds.),
*Probabilities, Laws, and Structures*, Berlin: Springer. - Rosenkrantz, R.D., 1977,
*Inference, method and decision: towards a Bayesian philosophy of science*, Dordrecht: Reidel. - –––, 1981,
*Foundations and Applications of Inductive Probability*, Ridgeview Press. - Royall, R., 1997,
*Scientific Evidence: A Likelihood Paradigm*, London: Chapman and Hall. - Savage, L.J., 1962,
*The foundations of statistical inference*, London: Methuen. - Schervish, M.J., T. Seidenfeld, and J.B. Kadane, 2009, “Proper
Scoring Rules, Dominated Forecasts, and Coherence”,
*Decision Analysis*, 6(4): 202–221. - Schwarz, G., 1978, “Estimating the Dimension of a
Model”,
*Annals of Statistics*, 6: 461–464. - Seidenfeld, T., 1979,
*Philosophical Problems of Statistical Inference: Learning from R.A. Fisher*, Dordrecht: Reidel. - –––, 1986, “Entropy and uncertainty”,
*Philosophy of Science*, 53(4): 467–491. - –––, 1992, “R.A. Fisher's Fiducial Argument and
Bayes Theorem”,
*Statistical Science*, 7(3): 358–368. - Shafer, G., 1976,
*A Mathematical Theory of Evidence*, Princeton: Princeton University Press. - –––, 1982, “On Lindley’s paradox (with discussion)”,
*Journal of the American Statistical Association*, 378: 325–351. - Shore, J. and Johnson, R., 1980, “Axiomatic derivation of the
principle of maximum entropy and the principle of minimum
cross-entropy”,
*IEEE Transactions on Information Theory*, 26(1): 26–37. - Skyrms, B., 1991, “Carnapian inductive logic for Markov chains”,
*Erkenntnis*, 35: 439–460. - –––, 1993, “Analogy by similarity in hypercarnapian
inductive logic”, in Massey, G.J., J. Earman, A.I. Janis and
N. Rescher (eds.),
*Philosophical Problems of the Internal and External Worlds: Essays Concerning the Philosophy of Adolf Gruenbaum*, Pittsburgh: Pittsburgh University Press, 273–282. - –––, 1996, “Carnapian inductive logic and Bayesian
statistics”, in: Ferguson, T.S., L.S. Shapley, and
J.B. MacQueen (eds.),
*Statistics, Probability, and Game Theory: papers in honour of David Blackwell*, Hayward: IMS lecture notes, 321–336. - –––, 1999,
*Choice and Chance: An Introduction to Inductive Logic*, Wadsworth, 4th edition. - Sober, E., 2004, “Likelihood, model selection, and the Duhem-Quine
problem”,
*Journal of Philosophy*, 101(5): 221–241. - Spanos, A., 2010, “Is Frequentist Testing Vulnerable to the
Base-Rate Fallacy?”,
*Philosophy of Science*, 77: 565-583. - –––, 2013a, “Who should be afraid of the
Jeffreys–Lindley paradox?”,
*Philosophy of Science*, 80: 73–93. - –––, 2013b, “A frequentist interpretation of probability
for model-based inductive inference”,
*Synthese*, 190: 1555–1585. - Spiegelhalter, D.J., N.G. Best, B.P. Carlin, and A. van der Linde,
2002, “Bayesian measures of model complexity and fit”,
*Journal of Royal Statistical Society*, B 64: 583–639. - Spielman, S., 1974, “The Logic of Significance
Testing”,
*Philosophy of Science*, 41: 211–225. - –––, 1978, “Statistical Dogma and the Logic of
Significance Testing”,
*Philosophy of Science*, 45: 120–135. - Sprenger, J., 2013, “The role of Bayesian philosophy within
Bayesian model selection”,
*European Journal for Philosophy of Science*, 3(1): 101–114. - –––, forthcoming-a, “Bayesianism vs. Frequentism in Statistical
Inference”, in Hajek, A. and C. Hitchcock (eds.),
*Oxford Handbook of Probability and Philosophy*, Oxford: Oxford University Press. - –––, forthcoming-b, “Testing a precise null hypothesis:
The case of Lindley’s paradox”,
*Philosophy of Science*. - Spirtes, P., Glymour, C. and Scheines, R., 2001,
*Causation, Prediction, and Search*, Boston: MIT press, 2nd edition. - Solomonoff, R.J., 1964, “A formal theory of inductive inference”,
parts I and II,
*Information and Control*, 7: 1–22 and 224–254. - Steele, K., 2013, “Persistent experimenters, stopping rules, and
statistical inference”,
*Erkenntnis*, 78(4): 937–961. - Suppes, P., 2001,
*Representation and Invariance of Scientific Structures*, Chicago: University of Chicago Press. - Uffink, J., 1996, “The constraint rule of the maximum entropy
principle”,
*Studies in History and Philosophy of Modern Physics*, 27: 47–79. - Vapnik, V.N. and S. Kotz, 2006,
*Estimation of Dependences Based on Empirical Data*, New York: Springer. - Venn, J., 1888,
*The Logic of Chance*, London: MacMillan, 3rd edition. - Wagenmakers, E.J., 2007, “A practical solution to the pervasive problems of p values”, Psychonomic Bulletin and Review 14(5), 779–804.
- Wagenmakers, E.J., and L.J. Waldorp, (eds.), 2006,
*Journal of Mathematical Psychology*, 50(2). Special issue on model selection, 99–214. - Wald, A., 1939, “Contributions to the Theory of Statistical Estimation
and Testing Hypotheses”,
*Annals of Mathematical Statistics*, 10(4): 299–326. - –––, 1950,
*Statistical Decision Functions*, New York: John Wiley and Sons. - Walley, P., 1991,
*Statistical Reasoning with Imprecise Probabilities*, New York: Chapman & Hall. - Williams, P.M., 1980, “Bayesian conditionalisation and the
principle of minimum information”,
*British Journal for the Philosophy of Science*, 31: 131–144. - Williamson, J., 2010,
*In Defence of Objective Bayesianism*, Oxford: Oxford University Press. - Ziliak, S.T. and D.N. McCloskey, 2008,
*The Cult of Statistical Significance*, Ann Arbor: University of Michigan Press. - Zabell, S.L., 1992, “R. A. Fisher and Fiducial
Argument”,
*Statistical Science*, 7(3): 358–368. - –––, 1982, “W. E. Johnson's ‘Sufficientness’
Postulate”,
*Annals of Statistics*, 10(4): 1090–1099.

## Academic Tools

How to cite this entry. Preview the PDF version of this entry at the Friends of the SEP Society. Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers, with links to its database.

## Other Internet Resources

[Please contact the author with suggestions.]