In
statistical hypothesis testing, we use data to make a decision about
whether to reject a statistical hypothesis (usually stated as a null
hypothesis) in favor of an alternative hypothesis. To illustrate we will
take the example of the binomial distribution for 2 samples each of n
trials, and take as our null hypothesis that the data come from a common
binomial distribution, that is
"
.For a
null hypothesis to be useful, there must be 1 or more plausible
alternatives to compare it to; to keep things simple we will propose
that under the alternative hypothesis
, that is, that the population differ by some specified amount.The approach is then to calculate a test statistic from the data
(e.g., a t-test) and compare its value to the statistics distribution
assuming that the null hypothesis is true.We then decide to “reject” the null hypothesis if the test
result exceeds some critical value based on the null, and “accept”
the null otherwise.Clearly,
since we don’t know what the truth is of the hypothesis, and since our
test result is only one random, and possibly aberrant sample outcome, we
may be mistaken in our decision.If
we reject the hypothesis when it is actually true we have committed a
Type I error.If we accept
the null when it is false we have committed a Type II error. Otherwise,
we have made the correct decision.
The
probability of making a Type I error is designated as
"
, and is defined more-or-less formally as the
probability of rejection given that the null is true, that is
where
T is the test statistic value and c is the critical value.Conventionally, c is chosen to keep
at or below some fixed but fairly arbitrary value, e.g. 0.01, 0.05,
0.10, etc.By comparison,
the probability of making a Type II error is designated as and is defined as the probability of accepting the
null given that the alternative is true, or
Statistical
power, or simply power, is defined as the complement of Type II
error, or
Power
depends on several things.
First
of all, because it depends on
being true, it depends on
some notion of an actual difference under the alternative hypothesis,
for example an experimental effect the investigator thinks is
likely to occur, and is trying to detect with the test. For example, the
experiment might involve assigning 10 animals to a treatment and 10
animals to a control and observing the proportion p that die in
each.Under
we expect these to be
equal, and under
we might expect a specific
increase in mortality, say 0.10.So
in this example we would have ,
and
Second,
power depends on the critical value chosen for rejection.Since this is usually based on trying to control Type I
error, power will therefore depend on
.If
is made very low, then it will take a very large critical value to
reject, and this will tend to lower the power of the test. If
is allowed to be higher, then rejection will occur at lower critical
values, resulting in a higher test power.The value of
actually determines the lower limit that power can be: if the null
hypothesis really is true, you will still reject it
*100% of the time, because of Type I error.
The
third “ingredient” in power is the amount of variability that occurs
in the test statistic.If
we have a very precise experiment, we will tend to see a very
“tight” distribution for the test statistic, and this will make it
much easier to reject the null if it is really false.The variance of the test statistic is a function of (1)
experimental error (sometimes reducible, sometimes not), and (2) sample
size. All other things being equal, the main thing the experimenter has
under his or her control is sample size.As you recall from our earlier manipulations of 2-samplebinomial distributions, increasing n makes it much easier
to determine (from samples) that 2 parameters really are different.Now we are going to apply this idea a little more formally, to
the problem of choosing sample size for a binomial experiment.
Applet
Exercises
The
binomial power applet allows you to experiment with these ideas for a
simple experiment involving 2 binomial samples.The 2 sliders at the top allow one to manipulate the binomial pvalues.The
bottom slider allows one to select value of
. Once the binomialp
values and
are specified, you can pick
a desired power and look up the needed sample size; or conversely look
up the power you would obtain for a given sample size.Clicking on the graph at the appropriate spot will display these
values at the top of the screen.
Set
one of the p values and vary the other one, from close to the
first one (the closest allowed is a difference of 0.01) to very big
differences.What happens to sample size needed to achieve a decent (say
0.8 or higher) power?values in the range of 0.01
to 0.5; what happens to power and sample size?
So,
what should Type I (
) and power or Type II error (
)
be?The answer is: “it
depends.”The reality is
that you cannot eliminate either error, and focusing on one to the
exclusion of the other will drive the other out the roof.Most statistics texts emphasizeType I error, but there are times when Type II error is actually
much more serious. For example, before marketing a drug, a manufacturer
must determine (1) if it works, and (2) if it’s safe.In each case, you would administer the drug to a treatment
and a placebo to a control group, and observe (1) if more treatment
subjects improved from an ailment common to each, and (2) if more
treatment animals than control (in groups of healthy subjects) suffered
side effects (like, mortality). For the first question, committing a
Type I error (you say it works when it doesn’t) leads to the marketing
of an ineffective drug, whereas a Type II error means you might not
market a drug that you could have; therefore, Type I error should
probably be kept quite low.For
the second question, committing a Type II error (claiming safety when it
is not) results in marketing a drug that is unsafe, lawsuits, and
bankruptcy, so you probably want to minimize Type II and may not care at
all about Type I error. Similar problems arise in trying to decide if
proposed activities will have adverse environmental impacts, if declines
in endangered species populations are “significant”, etc.In all these cases, insistence on arbitrary (often very low)levels necessarily inflict
high Type II error rates, and conversely.This has lead many applied statisticians to “reject” the
idea of hypothesis testing altogether, in favor of (1) estimation of
effect sizes, including confidence intervals (see the Anderson et al.
2000 paper), and (2) incorporation of statistical results into optimal
decision making methods.We
still see a role for conventional hypothesis testing when designed
experiments are used to test meaningful null hypotheses, but agree with
Anderson et al. (2000) that hypothesis testing and “significance”
reportinghas been severely
abused in the ecological literature.