Site Index

Statistical Power
Introduction

Density Independence

Density Dependence

Age-Structured Population Growth

Binomial Sampling

Likelihood Function

Statistical Power

Lotka-Volterra Competition

Lotka-Volterra Predation

Maximum Sustained Yield

Harvest Compensation

Diploid Selection

Genetic Drift



In statistical hypothesis testing, we use data to make a decision about whether to reject a statistical hypothesis (usually stated as a null hypothesis) in favor of an alternative hypothesis. To illustrate we will take the example of the binomial distribution for 2 samples each of n trials, and take as our null hypothesis that the data come from a common binomial distribution, that is

" .  For a null hypothesis to be useful, there must be 1 or more plausible alternatives to compare it to; to keep things simple we will propose that under the alternative hypothesis , that is, that the population differ by some specified amount.  The approach is then to calculate a test statistic from the data (e.g., a t-test) and compare its value to the statistics distribution assuming that the null hypothesis is true.  We then decide to “reject” the null hypothesis if the test result exceeds some critical value based on the null, and “accept” the null otherwise.  Clearly, since we don’t know what the truth is of the hypothesis, and since our test result is only one random, and possibly aberrant sample outcome, we may be mistaken in our decision.  If we reject the hypothesis when it is actually true we have committed a Type I error.  If we accept the null when it is false we have committed a Type II error. Otherwise, we have made the correct decision.

 

 

The probability of making a Type I error is designated as

" , and is defined more-or-less formally as the probability of rejection given that the null is true, that is

                                                                 

where T is the test statistic value and c is the critical value.  Conventionally, c is chosen to keep 

at or below some fixed but fairly arbitrary value, e.g. 0.01, 0.05, 0.10, etc.  By comparison, the probability of making a Type II error is designated as and is defined as the probability of accepting the null given that the alternative is true, or

Statistical power, or simply power, is defined as the complement of Type II error, or

Power depends on several things. 

First of all, because it depends on  being true, it depends on some notion of an actual difference under the alternative hypothesis, for example an experimental effect the investigator thinks is likely to occur, and is trying to detect with the test. For example, the experiment might involve assigning 10 animals to a treatment and 10 animals to a control and observing the proportion p that die in each.  Under  we expect these to be equal, and under

 we might expect a specific increase in mortality, say 0.10.  So in this example we would have ,  and

Second, power depends on the critical value chosen for rejection.  Since this is usually based on trying to control Type I error, power will therefore depend on .  If  is made very low, then it will take a very large critical value to reject, and this will tend to lower the power of the test. If

is allowed to be higher, then rejection will occur at lower critical values, resulting in a higher test power.  The value of actually determines the lower limit that power can be: if the null hypothesis really is true, you will still reject it  *100% of the time, because of Type I error.

The third “ingredient” in power is the amount of variability that occurs in the test statistic.  If we have a very precise experiment, we will tend to see a very “tight” distribution for the test statistic, and this will make it much easier to reject the null if it is really false.  The variance of the test statistic is a function of (1) experimental error (sometimes reducible, sometimes not), and (2) sample size. All other things being equal, the main thing the experimenter has under his or her control is sample size.  As you recall from our earlier manipulations of 2-sample  binomial distributions, increasing n makes it much easier to determine (from samples) that 2 parameters really are different.  Now we are going to apply this idea a little more formally, to the problem of choosing sample size for a binomial experiment.

 

Applet Exercises

</COMMENT>

The binomial power applet allows you to experiment with these ideas for a simple experiment involving 2 binomial samples.  The 2 sliders at the top allow one to manipulate the binomial p values.    The bottom slider allows one to select value of  . Once the binomial  p values and 

 are specified, you can pick a desired power and look up the needed sample size; or conversely look up the power you would obtain for a given sample size.  Clicking on the graph at the appropriate spot will display these values at the top of the screen.

Set one of the p values and vary the other one, from close to the first one (the closest allowed is a difference of 0.01) to very big differences.  What happens to sample size needed to achieve a decent (say 0.8 or higher) power?     values in the range of 0.01 to 0.5; what happens to power and sample size?

So, what should Type I ( ) and power or Type II error ( ) be?  The answer is: “it depends.”  The reality is that you cannot eliminate either error, and focusing on one to the exclusion of the other will drive the other out the roof.  Most statistics texts emphasize   Type I error, but there are times when Type II error is actually much more serious. For example, before marketing a drug, a manufacturer must determine (1) if it works, and (2) if it’s safe.  In each case, you would administer the drug to a treatment and a placebo to a control group, and observe (1) if more treatment subjects improved from an ailment common to each, and (2) if more treatment animals than control (in groups of healthy subjects) suffered side effects (like, mortality). For the first question, committing a Type I error (you say it works when it doesn’t) leads to the marketing of an ineffective drug, whereas a Type II error means you might not market a drug that you could have; therefore, Type I error should probably be kept quite low.  For the second question, committing a Type II error (claiming safety when it is not) results in marketing a drug that is unsafe, lawsuits, and bankruptcy, so you probably want to minimize Type II and may not care at all about Type I error. Similar problems arise in trying to decide if proposed activities will have adverse environmental impacts, if declines in endangered species populations are “significant”, etc.  In all these cases, insistence on arbitrary (often very low)   levels necessarily inflict high Type II error rates, and conversely.  This has lead many applied statisticians to “reject” the idea of hypothesis testing altogether, in favor of (1) estimation of effect sizes, including confidence intervals (see the Anderson et al. 2000 paper), and (2) incorporation of statistical results into optimal decision making methods.  We still see a role for conventional hypothesis testing when designed experiments are used to test meaningful null hypotheses, but agree with Anderson et al. (2000) that hypothesis testing and “significance” reporting  has been severely abused in the ecological literature.