Response Analysis Configuration

Transformation

Transformation of the response is an important component of any data analysis. Transformation is needed if the error (residuals) is a function of the magnitude of the response (predicted values). Stat-Ease 360 provides extensive diagnostic capabilities to check if the statistical assumptions underlying the data analysis are met. The normal plot of the residuals tests their normality. The residuals versus predicted response values plot will indicate a problem if a pattern exists. Unless the ratio of the maximum response to the minimum response is large, transforming the response will not make much difference.

The Box-Cox plot on the Diagnostics button will provide a recommended transformation from the power family. The two non-power law transformations, logit for bounded data and arcsin-sqrt for proportions, must be applied based on the type of response. The Box-Cox plot will often recommend a square-root transformation when proportion data is present, and the log transformation for bounded data.

Stat-Ease 360 provides a broad range of possible transformations - most are from the power family, plus there are two additional transformations, the logit and the arcsine square root.

Most data transformations can be described by the power function, l) power gives a scale satisfying the equal variance requirement of the statistical model.

The appropriate choice of a response transformation relies on subject-matter knowledge and/or statistical considerations. The available transformations and examples for their use are:

Power Law/Standard

Square Root – count, frequency data

Natural log – variance or growth data

Base 10 log – variance or growth data

Inverse square root

Inverse – rate/time, decay rate

Power – for more extreme transformation needs

The power transformation allows transformation to any power in the range –3 to +3, provided the data are positive. You may add a constant to the data to avoid powers of negative numbers. If the standard deviation associated with an observation is proportional to the mean raised to some power, then transforming the observation by a power gives a scale satisfying the equal variance requirement of the ANOVA. The Box-Cox plot is provided in the Diagnostics plots to help you choose an appropriate power transformation.

Special Cases

Logit

The logit transformation is used when the response has a unreachable lower and upper physical limits. One example is the yield of a chemical reaction. The physical bounds are 0% and 100%, but in practice the actual yields will not quite reach 100% due to impurities, energy loss, etc. The logit transform spreads out the values near the boundaries. When using this transformation, it is very important to correctly set the lower and upper limits to the natural limits of the response.

\[\log_{e}\begin{bmatrix} \frac{Y\: -\: lower\: limit\: of\: Y} {upper\: limit\: of\: Y\: -\: Y } \end{bmatrix}\]

Arcsine square root

The arcsine square root should be used for proportion data. Proportion data is a fraction between 0 and 1 inclusive. The assumption is a batch of size “n” is generated by the settings of each run. Each individual member of the batch has a binomial outcome, either passing or failing a specified criteria.

\[\arcsin \begin{pmatrix}{\sqrt{Y}}\end{pmatrix}\]

References

  • D. Miller. Reducing transformation bias in curve fitting. The American Statistician, 38(2):124–126, 1984.

Special Models

Logistic Regression

Logistic regression analysis estimates the odds or probability of an event by modeling the log odds of that event as a polynomial function of the input factors.

Note that odds and probability are related by:

\[\mathrm{odds} = \frac{\mathrm{probability}}{1-\mathrm{probability}} \leftrightarrow \mathrm{probability} = \frac{\mathrm{odds}}{1+\mathrm{odds}}\]

Because odds are usually given as a ratio, a:b, the software opts to report the probability and models the natural logarithm of the odds as a function of probability:

\[\ln(\mathrm{odds}) = \mathrm{logit}(p) = \ln\left[\frac{p(y=1)}{1 - p(y=1)}\right] = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} = z\]

An iteratively reweighted least squares algorithm is used to estimate the coeffiecients for the polynomial model:

\[\hat{z} = \hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}\]

Finally, the probability is estimated by applying the inverse transformation:

\[\hat{p}=\frac{e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}}}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}}}=\frac{1}{1+e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}}} = \frac{1}{1+e^{-\hat{z}}}\]

This estimate is always bounded between 0 and 1, consistent with a probability.

Poisson Regression

Poisson regression is used to model the relationship between input factors and a response that represents a count. The mean count, \(y\), is related to the input factors through:

\[\ln\left(y\right) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} = z\]

An iteratively reweighted least squares algorithm is used to estimate the coeffiecients for the polynomial model:

\[\hat{z} = \hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}\]

Finally, the mean count is estimated by applying the inverse transformation:

\[\hat{y} = e^{\hat{\beta}_{0}+\hat{\beta}_{1}x_{1}+\hat{\beta}_{2}x_{2}+\cdots+ \hat{\beta}_{k}x_{k}} = e^{+\hat{z}}\]

Notice that this expression is always non-negative, consistent with count data. However, it represents a mean count, so it is not necessarily an integer.

References

  • Douglas C. Montgomery, Elizabeth A. Peck, and . Geoffrey Vining. Introduction to Linear Regression Analysis. Wiley, 5th edition, 2012.