#### Supplement to The Problem of Induction

## Basic Probability

For comprehensive accounts of probability, consult the articles: interpretations of probability, inductive logic, and Bayes' theorem. In the interest of avoiding redundancy only a simple sketch is given here.

Probability is easily defined on a first order language £
described in section 5.1.1 above. Any function *P* that assigns
numerical values to the sentences of £ in accordance with the
following laws is a probability:

For all sentences *X* and *Y*:

0 ≤P(X) ≤ 1

P(¬X) = 1 −P(X)

IfXandYare logically inconsistent thenP(XY) =P(X) +P(Y)

IfXis valid (i.e. true in all models) thenP(X) = 1

The laws have as immediate consequences:

IfXis inconsistent (i.e. true in no models) thenP(X) = 0

P(XY) =P(X) +P(Y) −P(XY)

The conditional probability of *Y* given *X*, written
*P*(*Y* | *X*), is defined as the ratio

P(XY) /P(X)

when *P*(*X*) > 0.

This definition supports Bayes' Theorem:

IfX_{1}, …,X_{k}are logically exclusive and exhaustive then

P(Y|X_{1}) =P(X_{1}|Y)P(Y) / ∑_{i}P(X_{i}|Y)P(Y)

An alternative, and non-equivalent, development takes conditional probability as primitive and allows it to be defined when the probability of the condition is zero. See the references in interpretations of probability, section 1.

Conditional probability supports the definition of the critical concept of probabilistic independence. Intuitively, propositions are independent if the truth of either leaves the probability of the other unchanged. Explicitly:

XandYare independent =_{df}P(Y|X) =P(Y)

it follows immediately that:

XandYare independent iffP(XY) =P(X)P(Y)

Independence is symmetrical; if *X* and *Y* are
independent then *Y* and *X* are independent.

If *X* and *Y* are independent then *X* and
¬*Y*, ¬*X* and *Y*, ¬*X* and
¬*Y* are all independent.

Thorough independence is a stronger notion than simple or pairwise independence:

The propositionsX_{1}, …,X_{n}arethoroughly independent=_{df}

IfY_{1}, …,Y_{m}are any among theX_{i}thenP(Y_{1}…Y_{m}) =P(Y_{1})P(Y_{2}) …P(Y_{m})

If *X*_{1}, …, *X*_{n}
are thoroughly independent, *P*(*X*_{1}
…
*X*_{n}) is positive and
*Y*_{1}, …, *Y*_{m},
*Y*_{m+1} are any distinct propositions among
the *X*_{i} then

P(Y_{m+1}|Y_{1}…Y_{m}) =P(Y_{m+1})

If *X*_{1}, …, *X*_{n}
are thoroughly independent and *X*_{1}′,
…, *X*_{n}′ result from negating
some or all of them then *X*_{1}′, …,
*X*_{n}′ are thoroughly independent.

Notice that the propositions *X*_{1}, …,
*X*_{n} may be pairwise independent
(i.e. every two of them may be independent) without being thoroughly
independent.

A set
⊆ £ of sentences of £ is *maximal* if no
superset of
is consistent; if
is maximal and *X* is any sentence not in
,
then
∪ {*X*} is inconsistent. If
is maximal and consistent and *X* is any sentence of £
then either *X* is in
or ¬*X* is in
.
(See
modal fictionalism,
section 4.5 and
actualism, section 1.1).
If *x* is any (normal) model then the set of all sentences
true in *x* (sometimes called the *diagram* of
*x*) is maximal and consistent. Conversely every maximal
consistent subset of an interpreted language (the referents of the
non-logical constants being fixed) determines a unique normal
model. Maximal consistent sets thus correspond with models in a clear
and simple way.

The introduction touched on stronger relativized notions of necessity determined by restricting the class of models. There it was a question of arithmetic models, those in which the arithmetical signs had their standard arithmetical meaning. In the same way models may be restricted to those in which certain contingent propositions are true, and this sort of relativization is virtually imperative in discussing induction. Thus when experts on automobile safety conduct experiments to test hypotheses about the damage to cars in controlled crashes, they will assume the truth of laws about the strength of materials, not to speak of the laws of classical physics. They will not count as possible any model in which any of these laws do not obtain.

Now given an interpreted language £ we can refer interchangeably
to the models (perhaps restricted) of £ and the maximal
consistent sets (perhaps in consequence also restricted) of
£. If the number of (perhaps restricted) models is finite then
£ is said to be a finite language. There will of course be
denumerably infinitely many sentences of £, but there will be
only finitely many maximal consistent classes of sentences. Thus, for
example, if the models in question consist of the eight possible
outcomes of three draws from an urn containing three balls, each of
which is either red or black, we may represent this in a language with
just one predicate, *R* (for red), and three individual
constants a, *b*, and *c*, for the three draws. Each
sequence, such as *R**a*
¬*R**b*
*R**c*, will determine a maximal consistent class or
atom, namely the class of all of its logical consequences, and there
will be just eight of these classes, each of which determines a
model.

Probability fits naturally in this framework. In this example any assignment of non-negative numbers summing to one to the eight atoms fixes the probability of every sentence in the language:

P(X) = ∑_{X∈αi}P(α_{i})

where α_{i} ranges over the eight atoms.

Let *R* be a monadic predicate and a_{1}, …,
a_{n} distinct individual constants. We write
*R*_{i} for
*R*(a_{i}). There is just one sequence
*R*_{1}
…
*R*_{n}, there are *n* sequences with
(*n*−1) *R* and one *R*′ =
¬*R*. In general there are

( nk) = n! /k!(n−k)!= n(n−1)(n−2) … (n−k+1) /k(k−1)(k−2) … 1

distinct sequences with *k* *R* and
(*n*−*k*) ¬*R*.

Suppose that *R**a*_{1}, …,
*R**a*_{n} are thoroughly independent
and equiprobable, i.e. *P*(*a*_{1}) = … =
*P*(*a*_{n}) = *p*. Then
*P*(*R**a*_{1}
…
*R**a*_{n}) =
*p*^{n}. And, in general,
*P*(*R**a*_{1}
…
*R**a*_{k}
¬*R**a*_{k+1}
…
*R**a*_{n}) =
*p*^{k}*q*^{n-k}
where *q* = 1 −*p*. Indeed, if
*R**a*_{1}′, …,
*R**a*_{n}′ result from negating
any *n*−*k* of *R**a*_{1},
…, *R**a*_{n} then
*P*(*R**a*_{1}′
…
*R**a*_{n}′) =
*p*^{k}*q*^{n-k}.

Sequences of independent and equiprobable sentences like this are
known as *Bernoulli trials*. They provide probabilistic
descriptions of many random phenomena in which an experiment, such as
tosses of a coin or draws with replacement of balls of two sorts from
an urn, is repeated in uniform conditions. One outcome (red ball,
heads) is typically labeled “success”. The
*frequency* of success in such a sequence is just the number of
successful outcomes. If there are *k* red balls and
*n*−*k* black balls drawn in *n* trials,
then the frequency of success is *k*. The *relative
frequency* (r.f.) of success is the frequency of success divided
by the number of trials: *k*/*n* in the present
example.

In the Bernoulli case the probability of *k* successes in
*n* trials, in any order whatever, is

b(n,k,p)= ( nk) p^{k}q^{(n−k)}

This is the *binomial distribution*, frequently encountered in
works on probability and induction.

Many inductive inferences concern the relations between probabilities and relative frequencies. Relative frequencies conform to the laws of probability:

r.f. of ¬R= 1 − r.f. ofR

If r.f. ofRS= 0 then r.f. ofRS= r.f. ofR+ r.f. ofS

In the case of Bernoulli trials we expect the relative frequency of
success to be close to the probability of success. If two-thirds of
the balls in the urn are red it is plausible to identify this ratio
with the probability of success and to expect that the r.f of success
will be close to 2/3. This is of course not a sure thing: the r.f. of
success in actual draws may be anything from zero to one. The precise
relationship between r.f. and probability is expressed in
*Bernoulli's weak law of large numbers*. This law says that in
the Bernoulli case with probability *p* of success, the
proportion of sequences in which the r.f. of success differs from
*p* by more than an arbitrary small quantity ε
approaches zero as the number of trials increases without
bound. Symbolically, where

p= probability of success

f= frequency of success

n= number of trials

Prob = r.f. of sequences satisfying the condition

Prob[|*f*/*n* − *p*| > ε]
approaches zero as *n* increases without bound. (See Cramer,
1955 section 6.3 for an accessible proof.)

Remarks:

- The theorem does not say that r.f. approaches probability in the limit. That is a frequent misunderstanding.
- The theorem does not depend upon a frequency interpretation of
probability. It is required of the probability of success only that it
obey the laws of probability; apart from that the probability
*p*is left without interpretation or definition.

### Random variables and expectations, both badly named

Most uses of probability depend upon random variables and
expectations. Random variables, to begin with them, are not variables,
they are functions. A random variable (r.v.) is a function that
assigns a quantity—positive, negative, or zero—to each
member of a given collection with a given probability. The function
that gives the number of pips on each side of a fair die is a
r.v. that takes on the values 1–6 with probabilities each =
1/6. The *distribution* of a r.v. lists its possible values
and gives their probabilities. The *expectation* of a random
variable is the weighted sum of its values, the weights being the
probabilities of the respective values. Thus in the case of the fair
die we write

E(N) = ∑_{x}Pr[N=x]x= 3.5

In this case the expectation is just the mean of the values of the r.v. If, on the other hand, the die is shaved so as to favor the ace:

Pr[N= 1] = 1/5,Pr[N= 2] = … =Pr[N= 6] = 4/25

the expectation will shift accordingly.

E(N) = 3.4

Continuous random variables are those that may take on all values in some continuous interval. Such a r.v. is finite if assigns has only a finite number of distinct probabilities. Certain families of distributions are studied for different reasons. Perhaps the best known is the normal or Gaussian distribution, this is the familiar bell-shaped curve. A normal distribution on an interval of the real line is easily specified by just two parameters; its mean and its variance, which is, roughly the average distance from the mean of its values. The bell curve actually graphs not the normal distribution, but the normal frequency function. Its equation is

φ(x) = [1 / √(2π)]e^{−x2/2}

The normal distribution function is the integral of the frequency function, it says what the probability is that the value of the r.v. will be no greater than a given value. Its equation is

Φ( x)= [1 / 2π] x

∫

-∞e^{−t2/2}dt

A second important distribution is the binomial distribution. This is
the distribution described above that gives the probability of
*k* successes in *n* trials. One of the most important
results in the study of induction is the *de Moivre-Laplace limit
theorem*, which states that *as n and k increase without bound
the binomial distribution approaches the normal distribution as a
limit*.

A few simple principles mark off conditional probability from logical implication.

- Logical implication is
*monotonic*. This means (roughly, see the article on non-monotonic logic for the full story) that if a set Γ implies a sentence*A*and if Γ ⊆ Δ, then Δ implies*A*. Conditional probability is, however, non-monotonic:*P*(*C*|*A**B*) may be equal to, greater than, or less than*P*(*C*|*A*). Again*A*may raise the probability of*C*while*A**B*lowers the probability of*C*: - Conditional probability does not allow
contraposition.
*P*(*B*|*A*) may be equal to, greater than, or less than*P*(¬*A*| ¬*B*).