Diagnostics Report

This section contains descriptions of each case statistic. The values in this report table are used to produce the diagnostics graphs.

Run Order: The randomized order for the experiments.

Actual Value: The measured response data for this particular run, yi.

Predicted Value: The value predicted from the model, generated using the prediction equation, includes block and center-point corrections, when they are part of the design.

\(\hat{\bar{Y}} = X\hat{\beta}\)

Residual: Difference between Actual and Predicted values for each point.

\(e = Y - \hat{\bar{Y}}\)

Leverage: Leverage of a point varies from 0 to 1 and indicates how much an individual design point influences the model’s predicted values. A leverage of 1 means the predicted value at that particular case will exactly equal the observed value of the experiment, i.e., the residual will be 0. The sum of leverage values across all cases equals the number of coefficients (including the constant) fit by the model. The maximum leverage an experiment can have is 1/k, where k is the number of times the experiment is replicated.

\(H = X(X^T X)^{-1}X^T\)

\(Leverage = diag(H)\)

where, X is the model matrix that has one row for each run in the design (n) and one column for each term in the model (p). H is therefore an n x n symmetric matrix, often called the hat matrix. The diagonal elements of the H matrix are the leverages. Leverage represents the fraction of the error variance, associated with the point estimate, carried into the model. A leverage of 1 means that any error (experimental, measurement, etc.) associated with an observation is carried into the model and included in the prediction.

Internally Studentized Residual: The residual divided by the estimated standard deviation (Std Dev) of that residual. It measures the number of standard deviations separating the actual and predicted values.

\(r_i = \frac{e_i}{\hat{\sigma}\sqrt{1 - Leverage_i}}\)

Externally Studentized Residual (a.k.a. Outlier t-value, RStudent): is calculated by leaving each run, one at a time, out of the analysis and estimating the response from the remaining runs. (See Weisberg page 115.) The t-value is the number of standard deviations difference between this predicted value and the actual response. This tests whether the run in question follows the model with coefficients estimated from the rest of the runs, that is, whether this run is consistent with the rest of the data for this model. Runs with large t-values should be investigated.

\(\hat{\sigma}^2_{(-i)} = \frac{(n - p)\cdot{\hat{\sigma}^2} - {\frac{e^2_i}{(1 - Leverage_i)}}}{n - p - 1}\)

\(t_i = \frac{e_i}{\sqrt{{\hat{\sigma}^2_{(-i)}}(1 - Leverage_i)}}\)

where, n is the number of runs minus the one being left out and p is the number of terms in the model including the intercept. The deletion variance can also be computed through brute force model fitting leaving each run out one at a time.

DFFITS: Measures the influence the ith observation has on the predicted value. (See Myers page 284.) It is the studentized difference between the predicted value with observation i and the predicted value without observation i:

\(\hat{Y}_{(-i)} = Y + \frac{e}{1 - Leverage}\)

\(DFFITS = \frac{\hat{\bar{Y}} - \hat{Y}_{(-i)}}{\sqrt{\hat{\sigma}^2_{(-i)}} \cdot Leverage}\)

DFBETAS: Not Shown on the report but present on the diagnostic graphs. This statistic is calculated for each coefficient at each run. The influence tool has a pull-down to pick which term’s graph is shown. Shows the influence the ith observation has on each regression coefficient. (See Myers page 284.) The DFBETASj,i is the number of standard errors that the jth coefficient changes if the ith observation is removed.

\(DFBETAS_{j, (-i)} = \frac{\hat{\beta}_j - \hat{\beta}_{j,(-i)}}{\sqrt{\hat{\sigma}^2_{(-i), i} \cdot (X^T X)^{-1}_{jj}}}\)

A large DFBETASj,-i value indicates that the ith observation has extra influence on the jth regression coefficient.

Cook’s Distance: A measure of how much the regression would change if the case is omitted from the analysis. Relatively large values are associated with cases with high leverage and large studentized residuals. Cases with large Di values relative to the other cases should be investigated - they could be caused by recording errors, an incorrect model, or a design point far from the remaining cases.

Cook’s distance (Di) is a product of the square of the ith internally studentized residual and a monotonic function of the leverage:

\(D_i = \frac{r^2_i}{p + 1}\left(\frac{Leverage_i}{1 - Leverage_i}\right)\)

A large value in D may be due to large r, large leverage, or both.

Cook’s distance can be thought of as the average squared difference between the predictions that result from the full data set and those that result from a reduced data set (deleting the ith observation) compared to the error mean squared of the fitted model. An equivalent interpretation of D is as a standardized weighted distance between the vector of regression coefficients obtained from the full model and the vector obtained after deleting the ith case. If the value of D is substantially less than 1, deleting the ith case will not change the estimates of the regression coefficients very much.

In a perfectly balanced orthogonal array, Cook’s distance and the externally studentized residual are directly related and thus give the same information. In general regression problems there can be considerable differences in the information contained in the two statistics, in other words, different runs may be identified for investigation.

Standard Order: A conventional “textbook” ordering of the array of low and high factor levels.

For further reading:

  • Christensen, Pearson, and Johnson. Case-deletion diagnostics for mixed models. Technometrics, 34(1):38–45, 1992.

  • Raymond H. Myers. Classical and Modern Regression with Applications. Duxbury Press, 1986.

  • Weisberg and Stanford. Applied Linear Regression. John Wiley & Sons, Inc., 1985.