In the last Blog, we learned how to compute and graphically interpret both main effects and interaction effects. Eventually the statistically significant effects will be used to develop a predictive model. But how do we determine which effects are statistically significant?

Conceptually, we first develop an “error” distribution that represents the distribution of Insignificant Effects. If we have an idea of what the Insignificant Effects look like, we can determine which of the effects we compute look significant by comparison.

There are several ways to estimate the experimental error in the study depending on the type of experiment. The 3 main methods are briefly described below

- If replication is preformed, we can use the variation among the replicates to estimate the experimental error. This is the cleanest and most accurate method, since we directly observe the error. Theoretically the replicates should be the same value since the treatments are identical. But due to the inability to control everything perfectly and other sources of variation, you will typically have variation among replicates. The details of how to use the replicates to estimate the error are beyond the scope of this blog series. A bit later we will illustrate how to test for significance using software output.
- If replicates are not used, then another common approach is to assume that 3rd order effects are higher must be insignificant. Then, those effects may be used to develop the distribution of Insignificant Effects. Also, during the process of model-building, when effects are excluded because they appear to be insignificant, these effects are in effect treated as error terms and are used to help estimate the overall experimental error. This approach may not be feasible initially if the higher order effects are confounded with lower order effects (Main Effects and Two-Factor interactions) as we typically have in fractional factorial studies (Note: Fractional Factorial experiments will be discussed in a future blog).
- The final method that may be used initially is to use a graphical approach or outlier method to distinguish between insignificant and significant effects.

These methods are not discussed in detail in this blog. Fortunately, most statistical software makes it relatively easy to determine which effects are statistically significant.

It’s useful to realize that deciding whether effects are statistically significant is a hypothesis test.

Since the null hypothesis is assumed unless strong evidence exists to the contrary, the default assumption is that all main and interaction effects are insignificant. A good analogy is a criminal trial in the U.S. justice system. The null hypothesis is that the accused is innocent and alternate hypothesis is that the accused is guilty. To convict, it requires that all 12 jurors agree that the accused is guilty beyond a reasonable doubt.

Of course, we can make errors when determining whether an effect is significant or not. If we reject the null hypothesis incorrectly (that is, we conclude significance when the factor is actually not important), then we end up with a useless predictor in our model. We will likely eventually discover this. This is called a Type I error. We control this error by establishing the significance level (α,usually 0.05 or 0.10). If the significance level is 0.05, then when we reject, we do so with 95% confidence. However, if we fail to reject the null hypothesis, when we should (we conclude that a real effect is insignificant), we have made a bigger error in this instance. Type II errors are a function of several factors (size of the actual effect, the significance level, and the sample size)

When using software, we can use p-values to determine if we should reject the null hypothesis or not. Each estimated effect will have a p-value. If the p-value is less than the significance level (usually 0.05) then we reject the null hypothesis and conclude that the effect is statistically significant. Some people remember this by saying “if p is low, the null must go”.

In the statistical output below for a golfing DOE, we can see that the significant factors are “Driver”, “Course Length”, and the “Driver*Course Length” 2-factor interaction. All the other effects have p-values greater than 0.05 so they are not statistically significant.

Note that some effects may have borderline p-values. It can be useful to remove the least significant terms from the model first and the p-values for the remaining terms may change. As terms are removed the estimate for the experimental error changes and we also have more degrees of freedom in the error estimate.

Some software programs also indicate the significance of factors using graphical methods such as a Pareto chart or probability plot. These are illustrated below.

In the next blog, we will progress to developing a predictive model, using the significant effects.