## Notes to Inductive Logic

1.
Although enumerative
inductive arguments may seem similar to what classical statisticians
call *estimation*, they are not really the same thing. As
classical statisticians are quick to point out, *estimation*
does not use the sample to *inductively support* a conclusion
about the whole population. *Estimation* is not supposed to be
a kind of inductive inference. Rather, *estimation* is a
decision strategy. The sample frequency will be within two standard
deviations of the population frequency in about 95% of all
samples. So, if one adopts the strategy of *accepting as true*
the claim that the population frequency is within two standard
deviations of the sample frequency, and if one uses this strategy
repeatedly for various samples, one should be right about 95% of the
time. We will discuss enumerative induction in much more detail later
in the article.

2.
Another way of
understanding axiom (5) is to view it as a generalization of the
*deduction theorem and its converse*. The *deduction
theorem* *and converse* says this:
C (B⊃A) if and only if (C·B) A. Given axioms (1-4), axiom (5) is
equivalent to the following:

5*. (1 − P_{α}[(B⊃A) | C]) = (1 − P_{α}[A | (B·C)]) · P_{α}[B | C].

The conditional probability P_{α}[A | (B·C)]
completely discounts the possibility that B is false, whereas the
probability of the conditional P_{α}[(B⊃A) | C]
depends significantly on how probable B is (given C), and must
approach 1 if P_{α}[B | C] is near 0. Rule (5*) captures
how this difference between the conditional probability and the
probability of a conditional works. It says that the distance below 1
of the support-strength of C for (B⊃A) equals *the product
of* the distance below 1 of the support strength of (B·C)
for A *and* the support strength of C for B. This makes good
sense: the support of C for (B⊃A) (i.e., for
(~BA))
is closer to 1 than the support of (B·C) for A by the
multiplicative factor P_{α}[B | C], which reflects the
degree to which C supports ~B. According to Rule (5*), then, for any
fixed value of P_{α}[A | (B·C)] < 1,
as P_{α}[B | C] approaches 0,
P_{α}[(B⊃A) | C] must approach 1.

3.
This is not what is
commonly referred to as *countable additivity*. Countable
additivity requires a language in which infinitely long disjunctions
are defined. It would then specify that
P_{α}[(B_{1}B_{2}…) | C] = ∑_{i} P_{α}[B_{i} | C]. The
present result may be derived (without appealing to *countable
additivity*) as follows. For each distinct i and j, let C ~(B_{i}·B_{j}); and suppose that
P_{α}[D | C] < 1 for at least one sentence D. First
notice that we have, for each i, C (~(B_{i}·B_{i+1})·…· ~(B_{i}·B_{n})); so C ~(B_{i}·(B_{i+1} …B_{n})). Then, for each finite
list of the B_{i},
P_{α}[(B_{1}B_{2} …B_{n}) | C] = P_{α}[B_{1} | C] +
P_{α}[(B_{2}… B_{n}) | C] = … =
∑_{i=1}^{n} P_{α}[B_{i} | C]. By definition, ∑_{i} P_{α}[B_{i} | C] = lim_{n} ∑_{i=1}^{n} P_{α}[B_{i} | C]. So,
lim_{n} P_{α}[(B_{1} B_{2}…B_{n}) | C] = ∑_{i} P_{α}[B_{i} | C].

4.
Here are the usual
axioms when *unconditional probability* is taken as basic:

P

_{α}is a function from statements to real numbers between 0 and 1 that satisfies the following rules:

- if A (i.e. if A is a logical truth), then P
_{α}[A] = 1;- if ~(A·B) (i.e. if A and B are logically incompatible), then P
_{α}[(AB)] = P_{α}[A] + P_{α}[B];Definition: if P

_{α}[B] > 0, then by definition P_{α}[A | B] = P_{α}[(A·B)] / P_{α}[B].

5.
Bayesians often refer
to the probability of an evidence statement on a hypothesis, P[e | h·b·c], as the *likelihood of the
hypothesis*. This can be a somewhat confusing convention since it
is clearly the evidence that is made likely to whatever degree by the
hypothesis. So, we will disregard the usual convention here. Also,
presentations of probabilistic inductive logic often suppress c and b,
and simply write ‘P[e | h]’. But c and b are important
parts of the logic of the likelihoods. So we will continue to make
them explicit.

6. These attempts have not been wholly satisfactory thus far, but research continues. For an illuminating discussion of the logic of direct inference and the difficulties involved in providing a formal account, see the series of papers (Levi, 1977), (Kyburg, 1978) and (Levi, 1978). Levi (1980) develops a very sophisticated approach.

Kyburg has developed a logic of statistical inference based solely on logical direct inference probabilities (Kyburg, 1974). Kyburg's logical probabilities do not satisfy the usual axioms of probability theory. The series of papers cited above compares Kyburg's approach to a kind of Bayesian inductive logic championed by Levi (e.g. in Levi, 1967).

7.
This idea should not
be confused with *positivism*. A version of *
positivism* applied to likelihoods would hold that if two theories
assign the same likelihood values to all possible evidence claims,
then they are essentially the same theory, though they may be couched
in different words. In short: *same likelihoods* implies
*same theory*. The view suggested here, however, is not
*positivism*, but its inverse, which should be much less
controversial: *different likelihoods* implies *different
theories*. That is, given that all of the relevant background
knowledge is made explicit (represented in ‘b’), if two
scientists disagree significantly about the likelihoods of important
evidence claims on a given hypothesis, they must understand the
empirical content of that hypothesis quite differently. To that
extent, though they may employ the same sentences, the same syntactic
expressions, they use them to express empirically distinct
hypotheses.

8.
Call an object
*grue* at a given time *just in case* either the time is
earlier than the the first second of the year 2030 and the object is
green or the time not earlier than the first second of 2030 and the
object is blue. Now the statement ‘All emeralds are green (at
all times)’ has the same syntactic structure as ‘All
emeralds are grue (at all times)’. So, if syntactic structure
determines priors, then these two hypotheses should have the same
prior probabilities. Indeed, both should have prior probabilities
approaching 0. For, there are an infinite number of competitors of
these two hypotheses, each sharing the same syntactic structure:
consider the hypotheses ‘All emeralds are grue_{n} (at
all times)’, where an object is grue_{n} at a given time
*just in case* either the time is earlier than the first second
of the n^{th} day after January 1, 2030, and the object is
green *or* the time is not earlier than the first second of the
n^{th} day after January 1, 2030, and the object is blue. A
purely syntactic specification of the priors should assign all of
these hypotheses the same prior probability. But these are mutually
exclusive hypotheses; so their prior probabilities must sum to a value
no greater than 1. The only way this can happen is for ‘All
emeralds are green’ and each of its grue_{n} competitors
to have prior probability values either equal to 0 or extremely close
to it.

9.
This assumption may be
substantially relaxed without affecting the analysis below; we might
instead only suppose that the ratios P_{α}[c^{n} | h_{j}·b]/P_{α}[c^{n} | h_{i}·b] are bounded so as not to get exceptionally far
from 1. If *that* supposition were to fail, then the mere
occurrence of the experimental conditions would count as very strong
evidence for or against hypotheses — a highly implausible
effect. Our analysis could include such bounded condition-ratios, but
this would only add inessential complexity.

10.
For example, when a
new disease is discovered, a new hypothesis h_{u+1} about that
disease being a possible cause of patients’ symptoms is made
explicit. The old catch-all was, “the symptoms are caused by
some unknown disease — some disease other than
h_{1},…, h_{u}”. So the new catch-all
hypothesis must now state that “the symptoms are caused by one
of the remaining unknown diseases — some disease other than
h_{1},…, h_{u}, h_{u+1}”. And,
clearly, P_{α}[h_{K} | b] =
P_{α}[~h_{1}·…·~h_{u} | b] =
P_{α}[~h_{1}·…·~h_{u}· (h_{u+1}~h_{u+1}) | b] =
P_{α}[~h_{1}·…·~h_{u}·~h_{u+1} | b] + P_{α}[h_{u+1} | b] =
P_{α}[h_{K*} | b] +
P_{α}[h_{u+1} | b]. Thus, the new
hypothesis h_{u+1} is “peeled off” of the old
catch-all hypothesis h_{K}, leaving a new catch-all hypothesis
h_{K*} with a prior probability value equal to that of the old
catch-all minus the prior of the new hypothesis.

11.
This claim depends,
of course, on h_{i} being empirically distinct from each
alternative h_{j}. I.e., there must be conditions
c_{k} with possible outcomes o_{ku} on which the
likelihoods differ: P[o_{ku} | h_{i}·b·c_{k}] ≠
P[o_{ku} | h_{j}·b·c_{k}].
Otherwise h_{i} and h_{j} are empirically equivalent,
and no amount of evidence can support one over the other. (Did you
think a confirmation theory could possibly do better? — could
somehow employ evidence to confirm the true hypothesis over
*empirically equivalent* rivals?) If the true hypothesis has
empirically equivalent rivals, then convergence just implies that the
odds against *the disjunction* of the true hypothesis with
these rivals very probably goes to 0, and so the posterior probability
of this *disjunction* goes to 1. Among empirically equivalent
hypotheses the ratio of their posterior probabilities equals the ratio
of their priors:
P_{α}[h_{j} | b·c^{n}·e^{n}]
/ P_{α}[h_{i} | b·c^{n}·e^{n}] = P_{α}[h_{j} | b] /
P_{α}[h_{i} | b]. So the true hypothesis will have
a posterior near 1 (after evidence drives the posteriors of empirically distinguishable
rivals near 0) *just in case* non-evidential considerations make its
evidence-independent plausibility much higher than the sum of the plausibility
ratings of any empirically equivalent rivals.

12.
This is a good place
to describe one reason for thinking that *inductive support
functions* must be distinct from subjectivist or personalist
*degree-of-belief functions*. Although likelihoods have a high
degree of objectivity in many scientific contexts, it is difficult for
*belief functions* to properly represent objective likelihoods.
This is an aspect of the *problem of old evidence*.

*Belief functions* are supposed to provide an idealized
model of belief strengths for agents. They extend the notion of
ideally consistent belief to a probabilistic notion of ideally
coherent belief strengths. There is no harm in this kind of
idealization. It is supposed to supply a normative guide for real
decision making. An agent is supposed to make decisions based on her
belief-strengths about the state of the world, her belief strengths
about possible consequences of actions, and her assessment of the
desirability (or *utility*) of these consequences. But the very
role that *belief functions* are supposed to play in decision
making makes them ill-suited to inductive inferences where the
*likelihoods* are often supposed to be objective, or at least
possess inter-subjectively agreed values that represent the empirical
import of hypotheses. For the purposes of decision making,
degree-of-belief functions *should* represent the agent's
belief strengths *based on everything she presently knows*. So,
degree-of-belief likelihoods must represent how strongly the agent
would believe the evidence if the hypothesis were added to
*everything else she presently knows*. However,
support-function likelihoods are supposed to represent what the
hypothesis (together with explicit background and experimental
conditions) *says* or *implies* about the evidence. As a
result, *degree-of-belief* likelihoods are saddled with a
version of the *problem of old evidence* – a problem not
shared by support function likelihoods. And it turns out that the old
evidence problem for likelihoods is much worse than is usually
recognized.

Here is the problem. If the agent is already certain of an
evidence statement e, then her *belief-function* likelihoods
for that statement must be 1 on every hypothesis. I.e., if
Q_{γ} is her *belief function* and
Q_{γ}[e] = 1, then it follows from the axioms of
probability theory that Q_{γ}[e | h_{i}·b·c] = 1, regardless of what h_{i}
says — even if h_{i} implies that e is quite unlikely
(given b·c). But the problem goes even deeper. It not only
applies to evidence that the agent *knows with certainty*. It
turns out that almost anything the agent learns that can change how
strongly she believes e will also influence the value of her
*belief-function* likelihood for e, because
Q_{γ}[e | h_{i}·b·c] represents
the agent's belief strength given *everything she
knows*.

To see the difficulty with less-than-certain evidence, consider
the following example. (I'll suppress the b and c here, as
subjectivist Bayesians often do, since they will make no difference
for present purposes.) A physician intends to test her patient for
heart disease, h, with a treadmill test. She knows from medical
studies that there is a 10% false negative rate for this test; so her
belief-strength for a negative result, e, given heart disease is
present, h, is Q_{γ}[e | h] = .10. Now, her nurse is
very professional and is usually unaffected by patients’ test
results. So, if asked, the physician would say her belief strength
that her nurse will feel devastated, d, if the test is positive
(i.e. if ~e) is around Q_{γ}[d | ~e] = .05. Let us
suppose, as seems reasonable, that this belief-strength is independent
of whether h is in fact true — i.e. Q_{γ}[d | ~e·h] = Q_{γ}[d | ~e]. The nurse then says to the
physician, in a completely convincing way, “he is such a nice
guy — if *his* test comes out positive, I'll be
devastated.” The physician's new belief function
likelihood for a false negative must then become
Q_{γnew}[e | h] =
Q_{γ}[e | h·(~e⊃d)] = .69 (since
Q_{γ}[e | h·(~e⊃d)] =
Q_{γ}[~e⊃d | h·e] ·
Q_{γ}[e | h] /
(Q_{γ}[~e⊃d | h·e] ·
Q_{γ}[e | h] + Q_{γ}[~e⊃d | h·~e] · Q_{γ}[~e | h]) =
Q_{γ}[e | h] / (Q_{γ}[e | h] +
Q_{γ}[d | ~e·h] · Q_{γ}[~e | h]) = .1/(.1 + (.05)(.9)) = .69).

The point is that even the most trivial knowledge of conditional
(or disjunctive) claims involving e may completely upset the value of
the likelihood for an agent's belief function. And an agent will
almost always have some such trivial knowledge. E.g., the physician
in the previous example may also learn that if the treadmill test is
negative for heart disease, *then*, (1) the patient's
worried mother will throw a party, (2) the patient's insurance
company won't cover additional tests, (3) it will be the
thirty-seventh negative treadmill test result she has received for a
patient this year,…, etc. Updating on such conditionals can
force physicians’ *belief functions* to deviate widely
from the evidentially relevant objective, textbook values of test
result likelihoods.

More generally, it can be shown that the incorporation into
Q_{γ} of almost any kind of evidence for or against the
truth of a prospective evidence claim e — even uncertain
evidence for e, as may come through Jeffrey updating —
completely undermines the objective or inter-subjectively agreed
likelihoods that a belief function might have expressed before
updating. This should be no surprise. The agent's belief
function likelihoods reflect her *total degree-of-belief* in e,
based on h together with *everything else* *she knows*
about e. So the agent's present belief function may capture
appropriate, public likelihoods for e only if e is completely isolated
from the agents other beliefs. And this will rarely be the case.

One Bayesian subjectivist response to this kind of problem is that
the *belief functions* employed in scientific inductive
inferences should often be “counterfactual” belief
functions, which represent what the agent *would believe* if e
were subtracted (in some suitable way) from everything else she knows
(see, e.g. Howson & Urbach, 1993). However, our examples show that
merely subtracting e won't do. One must also subtract any
conditional statements containing e. And one must subtract any
uncertain evidence for or against e as well. So the counterfactual
belief function idea needs a lot of working out if it is to rescue the
idea that *subjectivist Bayesian belief functions* can provide
a viable account of the likelihoods employed by the sciences in
inductive inferences.

13.
To see the point
more clearly, consider an example. To keep things simple, let's
suppose our background b says that the chances of *heads* for
tosses of this coin is some whole percentage between 0% and 100%. Let
c say that the coin is tossed in the usual random way; let e say that
the coin comes up heads; and for each r a whole fraction of 100
between 0 and 1, let h_{[r]} be the *simple statistical
hypothesis* asserting that the chance of heads on each random toss
of this coin is r. Now consider the *composite statistical
hypothesis* h_{[>.65]}, which asserts that the chance
of heads on each random toss is greater than .65. From the axioms of
probability we derive the following relationship:
P_{α}[e | h_{[>.65]}·b] =
P[e | h_{[.66]}·b] ·
P_{α}[h_{[.66]} | h_{[>.65]}·b] + P[e | h_{[.67]}·b] ·
P_{α}[h_{[.67]} | h_{[>.65]}·b] + …+ P[e | h_{[1]}·b] · P_{α}[h_{[1]} | h_{[>.65]}·b]. The issue for the
*likelihoodist* is that the values of the terms of form
P_{α}[h_{[r]} | h_{[>.65]}·b]
are not objectively specified by the composite hypothesis
h_{[>.65]} (together with b). But the value of the
likelihood P_{α}[e | h_{[>.65]}·b]
depends essentially these non-objective factors. So it fails to
possess the kind of objectivity that *likelihoodists*
requires.

14.
The **Law of
Likelihood** and the **Likelihood Principle** have
been formulated in slightly different ways by various logicians and
statisticians. The **Law of Likelihood** was first
identified by that name in Hacking (1965), and has been invoked more
recently by the *likelihoodist* statisticians A.F.W. Edwards
(1972) and R. Royall (1997). R.A. Fisher (1922) argued for the
**Likelihood Principle** early in the 20^{th}
century, though he didn't call it that. One of the first places
it is discussed under that name is (Savage, et. al., 1962). It is
also advocated by Edwards (1972) and Royall (1997).

15.
To say that S is a
random sample of population B with respect to attribute A means this:
either, (1) the sample S is generated by a process that gives every
member of B an equal chance of being selected into S, or (2) there is
a subclass of B, call it C, from which S is generated by a process
that gives every member of C an equal chance of being selected into S,
where C is *representative* of B with respect to A in the sense
that the frequency of A in C is almost precisely the same as the
frequency of A in B. The idea is this. Ideally a poll of registered
voters, B, should select a sample S in a way that gives every
registered voter the same chance of getting into S. But that may be
impractical. However, it suffices if the sample is selected from a
representative subpopulation C of B — e.g., from registered
voters, who answered the telephone between the hours of 7 PM and 9 PM
in the middle of the week. Of course, the claim that a given
subpopulation C is *representative* is itself a hypothesis that
is open to inductive support by evidence. Professional polling
organizations do a lot of research to calibrate their sampling
technique, to find out what sort of subpopulations C they may draw on
as highly representative. For example, one way to see if registered
voters who answer the phone during the evening, mid-week, are likely
to constitute a representative sample is to conduct a large poll of
such voters immediately after an election, when the result is known,
to see how representative of the actual vote count the vote count from
of the subpopulation turns out to be.

16. This is a simple version of the Stable-Estimation Theorem of (Edwards, Lindman, Savage, 1993).

17.
To get a better idea
of the import of this theorem, let's consider some specific
values. First notice that the
factor r·(1−r)
can never be larger than 1/2·1/2 = 1/4; and the
closer r is to 1 or 0, the smaller
r·(1−r)
becomes. So, whatever the value of r, the
factor
q/((r·(1−r)/n)^{½} ≤
2·q·n^{½}.
Thus, for any chosen value of q,

P[r−q < F[A,B∩S] < r+q | F[A,B] = r · Random[S,B,A] · Size[S] = n] ≥ 1 − 2·Φ[−2·q·n^{½}].

For example, if q = .05 and n = 400, then we have (for any value of r),

P[r−.05 < F[A,B∩S] < r+.05 | F[A,B] = r · Random[S,B,A] · Size[S] = 400] ≥ .95.

For n = 900 (and margin q = .05) this lower bound raises to .997:

P[r−.05 < F[A,B∩S] < r+.05 | F[A,B] = r · Random[S,B,A] · Size[S] = 900] ≥ .997.

If we are interested in a smaller margin of error q, we can keep the same sample size and find the value of the lower bound for that value of q. For example,

P[r−.03 < F[A,B∩S] < r+.03 | F[A,B] = r · Random[S,B,A] · Size[S] = 900] ≥ .928.

By increasing the sample size the bound on the likelihood can be made as close to 1 as we want, for any margin q we choose. For example:

P[r−.01 < F[A,B∩S] < r+.01 | F[A,B] = r · Random[S,B,A] · Size[S] = 38000] ≥ .9999.

As the sample size n becomes larger, it becomes extremely likely that the sample frequency will get as close to the true frequency r as anyone may desire.

18.
That is, for each
inductive support function P_{α}, the
posterior P_{α}[h_{j} | b·c^{n}·e^{n}] must go to 0 if the ratio
P_{α}[h_{j} | b·c^{n}·e^{n}] /
P_{α}[h_{i} | b·c^{n}·e^{n}] goes to 0; and that will
occur if the likelihood ratios
P[e^{n} | h_{j}·b·c^{n}]
/ P[e^{n} | h_{i}·b·c^{n}]
approach 0 and the prior P_{α}[h_{i} | b] is
greater than 0. The Likelihood Ratio Convergence Theorem will show
that when h_{i}·b is true, it is very likely that the
evidence will indeed be such as to drive the likelihood ratios as near
to 0 as you please, for a long enough evidence stream. If that
happens, the only way a Bayesian agent can *avoid* having his
inductive support function yield * posterior probabilities* for
h_{j} approaching 0 (as n gets large) is to continually switch
among support functions (moving from P_{α} to
P_{β} to P_{γ} to …) in a way that
revises the pre-evidential prior probability of h_{i} downward
towards 0. And even then, he can only *avoid* having the
posterior probability for h_{j} approach 0 for each
*current* support function, as he switches among them, by
continually switching to new support functions at a rate that keeps
the revised priors P_{ε}[h_{i} | b] for
h_{i} diminishing towards 0 at least as quickly as the
likelihood ratios diminish towards 0 (with increasing n). For,
suppose, *on the contrary*, that
P[e^{n} | h_{j}·b·c^{n}] / P[e^{n} | h_{i}·b·c^{n}]
approaches 0 *faster* than sequence
P_{ε}[h_{i} | b], for changing
P_{ε} and increasing n — i.e. *approaches 0
faster *in the sense that
(P[e^{n} | h_{j}·b·c^{n}]
/ P[e^{n} | h_{i}·b·c^{n}]) /
P_{ε}[h_{i} | b] goes to 0, for changing
P_{ε} and increasing n. Then, we have
(P[e^{n} | h_{j}·b·c^{n}]
/ P[e^{n} | h_{i}·b·c^{n}]) /
P_{ε}[h_{i} | b] >
(P[e^{n} | h_{j}·b·c^{n}]
/ P[e^{n} | h_{i}·b·c^{n}])
· (P_{ε}[h_{j} | b] /
P_{ε}[h_{i} | b]) =
P_{ε}[h_{j} | b·c^{n}·e^{n}] /
P_{ε}[h_{i} | b·c^{n}·e^{n}]. So,
P_{ε}[h_{j} | b·c^{n}·e^{n}] /
P_{ε}[h_{i} | b·c^{n}·e^{n}] must still go to 0, for
changing P_{ε} and increasing n; and so must
P_{ε}[h_{j} | b·c^{n}·e^{n}].

For a thorough presentation of the most prominent Bayesian convergence results and a discussion of their weaknesses see (Earman, 1992, Ch. 6). However, Earman does not discuss the convergence theorems under consideration here.

19.
In scientific
contexts the most prominent kind of case where data may fail to be
*result-independent* is where some quantity of past data helps
tie down the numerical value of a parameter not completely specified
by the hypothesis at issue, but where the value of this parameter
influences the likelihoods of outcomes of lots of other
experiments. Such hypotheses effectively contain *disjunctions*
of more specific hypotheses, where each distinct disjunct is a version
of the original hypothesis, but with a specific value filled in for
the parameter. Evidence that “fills in the value” for the
parameter for the original, less specific hypothesis just amounts to
evidence that refutes (via likelihood ratios) those specific disjuncts
in it that possess incorrect parameter values. So, for the purposes of
inductive logic, in many cases where it is helpful to treat evidence
as composed of *result-independent* chunks one may decompose
the less specific *composite disjunctive hypotheses* into their
more specific disjuncts. For each of them,
*result-independence* should be satisfied (when the data is
appropriately chunked, as discussed in the text).

20.
Technically,
suppose that O_{k} can be further “subdivided”
into more outcome-descriptions by replacing o_{kv} with two
“parts”, o_{kv}^{*} and
o_{kv}^{#}, to produce new outcome space
O_{k}^{*} =
{o_{k1},…,o_{kv}^{*},o_{kv}^{#},…,o_{kw}},
where P[o_{kv}^{*}·o_{kv}^{#} | h_{i}·b·c_{k}] = 0 and
P[o_{kv}^{* } | h_{i}·b·c_{k}] +
P[o_{kv}^{#} | h_{i}·b·c_{k}] =
P[o_{kv} | h_{i}·b·c_{k}];
and suppose similar relationships hold for h_{j}. Then the new
EQI* (based on O_{k}^{*}) is greater than or equal to
EQI (based on O_{k}); and EQI^{*} > EQI *just in
case* at least one of the new likelihood ratios, e.g.,
P[o_{kv}^{* } | h_{i}·b·c_{k}_{}] /
P[o_{kv}^{*} | h_{j}·b·c_{k}], differs in value from
the “undivided” outcome's likelihood ratio,
P[o_{kv}^{ } | h_{i}·b·c_{k}_{}] /
P[o_{kv } | h_{i}·b·c_{k}].

21.
The likely rate of
convergence will almost always be much faster than the worst case
bound provided by Theorem 2. To see the point more clearly,
let's look at a very simple example. Suppose h_{i} says
that a certain bent coin has a propensity for “heads” of
2/3 and h_{j} says the propensity is 1/3. Let the evidence
stream consist of outcomes of tosses. In this case the average EQI
equals the EQI of each toss, which is 1/3; and the smallest possible
likelihood ratio occurs for “heads”, which yields the
value γ = ½. So, the value of the lower bound given by
Theorem 2 for the likelihood of getting an outcome sequences with a
likelihood ratio below ε (for h_{j} over
h_{i}) is

1 − (1/n)(log ½)^{2}/((1/3) + (log ε)/n)^{2}= 1 − 9/(n·(1 + 3(log ε)/n)^{2}.

Thus, according to the theorem, the
likelihood of getting an outcome sequence with a likelihood ratio less
than ε = 1/16 (=.06) when h_{i} is true and the number
of tosses is n = 52 is *at least* .70; and for n = 204 tosses
the likelihood is *at least* .95.

To see how much lower then necessary the lower bound provided by
the theorem really is, consider what the usual binomial distribution
for the coin tosses in this example implies about the likely values of
the likelihood ratios. The likelihood ratio for exactly k
“heads” in n tosses is ((1/3)^{k} (2/3)^{n−k}) / ((2/3)^{k} (1/3)^{n−k}) =
2^{n−2k}, which we want to have a value less than ε. A
bit of algebra yields that to get a likelihood ratio below ε,
the percentage of “heads”
must be k/n > ½ − ½(log ε)/n. Using the normal approximation to
the binomial distribution (with mean = 2/3 and variance =
(2/3)·(1/3)/n) the actual likelihood of obtaining an outcome
sequence having a likelihood ratio less than ε is given by

Φ[(mean − (½ − ½(log ε)/n))/(variance)^{½}] = Φ[(1/8)^{½}n^{½}(1 + 3(log ε)/n)]

(where Φ[x] gives the value of the standard normal
distribution from −∞ to x). Now let ε = 1/16 (=
.0625), as before. So the actual likelihood of obtaining a stream of
outcomes with likelihood ratio this small when h_{i} is true
and the number of tosses is n = 52 is
Φ[1.96] > .975, whereas
the lower bound given by Theorem 2 was .70. And if the number of
tosses is increased to n = 204, the likelihood of obtaining an outcome
sequence with a likelihood ratio this small (i.e., ε = 1/16)
is
Φ[4.75]
>.999999, whereas the lower bound from Theorem 2 for this
likelihood is .95. Indeed, to actually get a likelihood of .95 that
the evidence stream will produce a likelihood ratio less than
ε >.06, the number of tosses actually needed is only n = 43
tosses, rather than the 204 tosses required by the bound given by the
theorem. (Note: These examples employ “identically
distributed” trials — repeated tosses of a coin — as
an illustration. But Convergence Theorem 2 applies much more
generally. It applies to any evidence sequence, no matter how diverse
the probability distributions for the various experiments or
observations in the sequence, and regardless of whether the outcomes
are independent.)

22.
It should now be
clear why the boundedness of EQI above 0 is important. Convergence
Theorem 2 applies only when EQI[c^{n } | h_{i}/h_{j} | b] >
−(log ε)/n. But this requirement is not a strong
assumption. For, the **Nonnegativity of EQI Theorem**
shows that the empirical distinctness of two hypotheses on a single
possible outcome *suffices* to make the average EQI positive
for the whole sequence of experiments. So, given any small fraction
ε > 0, the value of −(log ε)/n (which has to
be greater than 0) will eventually become smaller than EQI,
provided that the degree to which the hypotheses are empirical
distinct for the various observations c_{k} does not on
average degrade too much as the length n of the evidence stream
increases. This seems a reasonable condition on the empirical
distinctness of hypotheses. And Convergence Theorem 2 relies on
it.

When the possible outcomes for the sequence of observations are independent and identically distributed, Theorems 1 and 2 essentially reduce to L. J. Savage's Bayesian Convergence Theorem [Savage, pg. 52-54]. Independent, identically distributed outcomes most commonly result from the repetition of identical statistical experiments (e.g., repeated tosses of a coin, or repeated measurements of quantum systems prepared in identical states). In such experiments a hypothesis will specify the same likelihoods for the same kinds of outcomes from one observation to the next. So EQI will remain constant as the number of experiments, n, increases. However, Theorems 1 and 2 are much more general. They continue to hold when the sequence of observations encompasses completely unrelated experiments that have different distributions on outcomes — experiments that have nothing in common except their connection to the hypotheses they test.

23.
In many scientific
contexts this is the best we can hope for. But it still provides a
very reasonable representation of inductive support. Let's
consider, for example, the hypothesis that the land masses of Africa
and South America separated and drifted apart over the eons, the
*drift hypothesis*, as opposed to the hypothesis that the
continents have fixed positions acquired when the earth first formed
and cooled and contracted, the *contraction hypothesis*. One
may not be able to determine anything like precise likelihoods, on
each hypothesis, that the shape of the east coast of South America
should match the shape of the west coast of Africa as closely as it in
fact does, or that the geology of the two coasts should match up so
well, or that the plant and animal species on these distant continents
should be as similar as they are. But experts may readily agree that
each of these observations is much more likely on the *drift
hypothesis* than on the *contractionist
hypothesis*. Jointly these observations should constitute very
strong evidence for *drift* over *contraction*.

Historically, the case of continental drift is more
complicated. Geologists tended to largely dismiss this evidence until
the 1960s. This was not because the evidence wasn't strong in
its own right. Rather, this evidence was found unconvincing because it
was not sufficient to overcome prior plausibility considerations that
made the *drift* hypothesis seem extremely implausible —
much less plausible that the *contraction* hypothesis. The
problem was that there seemed to be no plausible mechanism by which
*drift* might occur. It was argued, quite plausibly, that no
known force could push or pull the continents apart, and that the less
dense continental material could not push through the denser material
that makes up the ocean floor. These plausibility objections were
overcome when a plausible mechanism was articulated — i.e. the
continental crust floats atop molten material and moves apart as
convection currents in the molten material carry it along. The case
was pretty well clinched when evidence for this mechanism was found in
the form of “spreading zones” containing alternating
strips of magnetized material at regular distances from mid-ocean
ridges. The magnetic alignments of materials in these strips
corresponds closely to the magnetic alignments found in magnetic
materials in dateable sedimentary layers at other locations on the
earth. These magnetic alignments indicate time periods when the
direction of earth's magnetic field has reversed. And this gave
geologists a way of measuring the rate at which the sea floor might
spread and the continents move apart. Although geologists may not be
able to determine anything like precise values for the likelihoods of
any of this evidence on each of the alternative hypotheses, the
evidence is universally agreed to be *much* more likely on the
*drift* hypothesis than on the *contractionist*
alternative. And, with the emergence of the possibility of a plausible
mechanism, the *drift* hypothesis no longer seems so
overwhelmingly implausible *prior* to the evidence, either. So,
the *value of a likelihood ratio* may be *objective or
public enough*, even when precise values for individual
likelihoods are not available.

24. To see the point of the last clause, suppose it were violated. That is, suppose there are possible outcomes for which the likelihood ratio is very near 1 for just one of the two support functions. Then, even a very long sequence of such outcomes might leave the likelihood ratio for one support function almost equal to 1, while the likelihood ratio for the other support function goes to an extreme value. If that can happen for support functions in a class that represent likelihoods for various scientists in the community, then the empirical contents of the hypotheses is either too vague or too much in dispute for meaningful empirical evaluation to occur.

25.
Even if there are a
few directionally controversial likelihood ratios, where
P_{α} says the ratio is somewhat greater than 1, while
and P_{β} assigns a value somewhat less than 1, these may
not greatly effect the trend of P_{α} and
P_{β} towards agreement on the refutation and support of
hypotheses *provided that* the controversial ratios are not so
extreme as to overwhelm the stream of other evidence on which the
likelihood ratios do directionally agree.