[Disclaimer: I’m not a statistician. Nor do I want you to think that I am. I am a marketing guy (with a few years of biochemistry lab experience) learning the basics of statistics, design of experiments (DOE) in particular. This series of blog posts is meant to be a light-hearted chronicle of my travels in the land of DOE, not be a text book for statistics. So please, take it as it is meant to be taken. Thanks!]

So, I’ve gotten thru some of the basics (**Greg's DOE Adventure: Important Statistical Concepts behind DOE** and **Greg’s DOE Adventure - Simple Comparisons**). These are the ‘building blocks’ of design of experiments (DOE). However, I haven’t explored actual DOE. I start today with factorial design.

Factorial design (aka factorial DOE) allow you to experiment on many factors (oh, that’s where the name comes from!) at the same time. A simple version of this: 2 factors, each has two levels. [Mathematically, this is represented by 2 to the power of 2, or 2^{2}.] Example time! Cooking Spaghetti. The two factors are temperature of the water and cooking time in that water. Levels are high temperature (100 deg C) and low temperature (80 deg C); and short time in the water and long time in the water. What’s the reason for the experiment? Optimization of the process to make the best tasting (al dente) spaghetti.

We can illustrate like this:

We can illustrate like this:

In this case, the horizontal line (x-axis) is time and vertical line (y-axis) is temperature. The area in the box formed is called the Experimental Space. Each corner of this box is labeled as follows:

1 – low time, low temperature (resulting in crunchy, matchstick-like pasta), which can be coded minus-minus (-,-)

2 – high time, low temperature (+,-)

3 – low time, high temperature (-,+)

4 – high time, high temperature (making a mushy mass of nasty) (+,+)

One takeaway at this point is that when a test is run at each point above, we have 2 results for each level of each factor (i.e. 2 tests at low time, 2 tests at high time). In factorial design, the estimates of the effects (that the factors have on the results) is based on the average of these two points; increasing the statistical power of the experiment.

Power is the chance that an effect will be found, when there is an effect to be found. In statistical speak, power is the probability that an experiment correctly rejects the null hypothesis when the alternate hypothesis is true.

If we look at the same experiment from the perspective of altering just one factor at a time (OFAT), things change a bit. In OFAT, we start at point #1 (low time, low temp) just like in the Factorial model we just outlined (illustrated below).

Here, we go from point #1 to #2 by lengthening the time in the water. Then we would go from #1 to #3 by changing the temperature. See how this limits the number of data points we have? To get the same power as the Factorial design, the experimenter will have to make 6 different tests (2 runs at each point) in order to get the same power in the experiment.

After seeing these results of Factorial Design vs OFAT, you may be wondering why OFAT is still used. First of all, OFAT is what we are taught from a young age in most science classes. It’s easy for us, as humans, to comprehend. When multiple factors are changed at the same time, we don’t process that information too well. The advantage these days is that we live in a computerized world. A computer running software like Design-Expert®, can break it all down by doing the math for us and helping us visualize the results.

Additionally, with the factorial design, because we have results from all 4 corners of the design space, we have a good idea what is happening in the upper right-hand area of the map. This allows us to look for interactions between factors.

That is my introduction to Factorial Design. I will be looking at more of the statistical end of this method in the next post or two. I’ll try to dive in a little deeper to get a better understanding of the method.

Disclaimer: I’m not a statistician. Nor do I want you to think that I am. I am a marketing guy (with a few years of biochemistry lab experience) learning the basics of statistics, design of experiments (DOE) in particular. This series of blog posts is meant to be a light-hearted chronicle of my travels in the land of DOE, not be a textbook for statistics. So please, take it as it is meant to be taken. Thanks!

As I learn about design of experiments, it’s natural to start with simple concepts; such as an experiment where one of the inputs is changed to see if the output changes. That seems simple enough.

For example, let’s say you know from historical data that if 100 children brush their teeth with a certain toothpaste for six months, 10 will have cavities. What happens when you change the toothpaste? Does the number with cavities go up, down, or stay the same? That is a simple comparative experiment.

“Well then,” I say, “if you change the toothpaste and 6 months later 9 children have cavities, then that’s an improvement.”

Not so fast, I’m told. I’ve already forgotten about that thing called *variability* that I defined in my last post. Great.

In that first example, where 10 kids got cavities. That result comes from that particular sample of 100 kids. A different sample of 100 kids may produce an outcome of 9, other times it’s 11. There is some variability in there. It’s not 10 every time.

[Note: You can and should remove as much variability as you can. Make sure the children brush their teeth twice a day. Make sure it’s for exactly 2 minutes each time. But there is still going to be some variation in the experiment. Some kids are just more prone to cavities than others.]

How do you know when your observed differences are due to the changes to the inputs, and not from the variation?

It’s called the F-Test.

I’ve seen it written as:

Where:

s = standard deviation

s2 = variance

n = sample size

y = response

ӯ (“y bar”) = average response

In essence, this is the amount of variance for individual observations in the new experiment (multiplied by the number of observations) divided by the total variation in the experiment.

Now that, by itself, does not mean much to me (see disclaimer above!). But I was told to think of it as the ratio of signal to noise. The top part of that equation is the amount of signal you are getting from the new condition; it’s the amount of change you are seeing from the established mean with the change you made (new toothpaste). The bottom part is the total variation you see in all your data. So, the way I’m interpreting this F-Test is (again, see disclaimer above): measuring the amount of change you see versus the amount of change that is naturally there.

If that ratio is equal to 1, more than likely there is no difference between the two. In our example, changing the toothpaste probably makes no difference in the number of cavities.

As the F-value goes up, then we start to see differences that can likely be credited to the new toothpaste. The higher the value of the F-test, the less likely it is that we are seeing that difference by chance and the more likely it is due to the change in the input (toothpaste).

*Trivia*

**Question**: Why is this thing called an F-Test?

**Answer**: It is named after Sir Ronald Fisher. He was a geneticist who developed this test while working on some agricultural experiments. “Gee Greg, what kind of experiments?”. He was looking at how different kinds of manure effected the growth of potatoes. Yup. How “Peculiar Poop Promotes Potato Plants”. At least that would have been my title for the research.

If you read my previous post, you will remember that design of experiments (DOE) is a systematic method used to find cause and effect. That systematic method includes a lot of (frightening music here!) statistics.

[I’ll be honest here. I was a biology major in college. I was forced to take a statistics course or two. I didn’t really understand why I had to take it. I also didn’t understand what was being taught. I know a lot of others who didn’t understand it as well. But it’s now starting to come into focus.]

Before getting into the concepts of DOE, we must get into the basic concepts of statistics (as they relate to DOE).

Basic Statistical Concepts:

**Variability**

In an experiment or process, you have inputs you control, the output you measure, and uncontrollable factors that influence the process (things like humidity). These uncontrollable factors (along with other things like sampling differences and measurement error) are what lead to variation in your results.

**Mean/Average**

We all pretty much know what this is right? Add up all your scores, divide by the number of scores, and you have the average score.

**Normal distribution**

Also known as a bell curve due to its shape. The peak of the curve is the average, and then it tails off to the left and right.

**Variance**

Variance is a measure of the variability in a system (see above). Let’s say you have a bunch of data points for an experiment. You can find the average of those points (above). For each data point subtract that average (so you see how far away each piece of data is away from the average). Then square that. Why? That way you get rid of the negative numbers; we only want positive numbers. Why? Because the next step is to add them all up, and you want a sum of all the differences without negative numbers getting in the way. Now divide that number by the number of data points you started with. You are essentially taking an average of the squares of the differences from the mean.

That is your variance. Summarized by the following equation:

\(s^2 = \frac{\Sigma(Y_i - \bar{Y})^2}{(n - 1)}\)

In this equation:

**Yi** is a data point**Ȳ** is the average of all the data points**n** is the number of data points

**Standard Deviation**

Take the square root of the variance. The variance is the average of the squares of the differences from the mean. Now you are taking the square root of that number to get back to the original units. One item I just found out: even though standard deviations are in the original units, you can’t add and subtract them. You have to keep it as variance (s2), do your math, then convert back.

Hi there. I’m Greg. I’m starting a trip. This is an educational journey through the concept of design of experiments (DOE). I’m doing this to better understand the company I work for (Stat-Ease), the product we create (Design-Expert® software), and the people we sell it to (industrial experimenters). I will be learning as much as I can on this topic, then I’ll write about it. So, hopefully, you can learn along with me. If you have any comments or questions, please feel free to comment at the bottom.

So, off we go. First things first.

*What exactly is design of experiments (DOE)?*

When I first decided to do this, I went to Wikipedia to see what they said about DOE. No help there.

“The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe or explain the variation of information under conditions that are hypothesized to reflect the variation.” –Wikipedia

The what now?

That’s not what I would call a clearly conveyed message. After some more research, I have compiled this ‘definition’ of DOE:

Design of experiments (DOE), at its core, is a systematic method used to find cause-and-effect relationships. So, as you are running a process, DOE determines how changes in the inputs to that process change the output.

Obviously, that works for me since I wrote it. But does it work for you?

So, conceptually I’m off and running. But why do we need ‘designed experiments’? After all, isn’t all experimentation about combining some inputs, measuring the outputs, and looking at what happened?

The key words above are ‘systematic method’. Turns out, if we stick to statistical concepts we can get a lot more out of our experiments. That is what I’m here for. Understanding these ‘concepts’ within this ‘systematic method’ and how this is advantageous.

Well, off I go on my journey!

Hello, Design-Expert® software users, Stat-Ease clients and statistics fans! I’m Rachel, the Client Specialist *[ed. 2023: now I'm the Senior Sales & Marketing Specialist]* here at Stat-Ease. If you’ve ever called our general line, I’m probably the one who picked up; I’m the one who prints and binds your workshop materials when you take our courses. I am not, by any stretch of the imagination, a statistician. So why am I, a basic office administrator who hasn’t taken a math class since high school, writing a blog post for Stat-Ease? It’s because I entered this year’s Minnesota State Fair Creative Activities Contest thanks to Design-Expert and help from the Stat-Ease consultant team.

I’m what you’d call a subject matter expert when it comes to baking **challah bread**. Challah is a Jewish bread served on Shabbat, typically braided, and made differently depending if you’re of Ashkenazi or Sephardi heritage. I started making challah with my mom when I was 8 years old (Ashkenazi style), and have been making it regularly since I left home for college. As I developed my own cooking and baking styles, I began to feel like my mother’s recipe had gotten a bit stale. So I’ve started to add things to the dough — just a little vanilla extract at first, then a dash of almond extract, then a batch with cinnamon and raisins, another one with chocolate chips, a Rosh Hashanah version that swaps honey for sugar and includes apple bits (we eat apples and honey for a sweet New Year), even one batch with red food coloring and strawberry bits for a breast cancer awareness campaign. None of these additions were tested in a terribly scientific way; I’m a baker, not a lab chemist. So when I decided I wanted to enter the State Fair with my challah this year, I got to wondering: what is actually the best way to make this challah? And lucky me, I’m employed at the best place in the world to find out.

I brought up the idea of running a designed experiment on my bread with my supervisor, and one of our statisticians, Brooks Henderson, was assigned to me as my “consultant” on the project. Before designing the experiment, we first needed to narrow down the factors we wanted to test and the results we wanted to measure. I set a hard line on not changing any of my mother’s original recipe — I know what Mom’s challah tastes like, I know it’s good, and I don’t want to mess with the complex chemistry involved in baking. We settled on adjusting the amount of vanilla and almond extracts I add to the dough, and since the Fair required me to submit a smaller loaf than Mom’s recipe makes, we tested the time and temperature required to bake. For our results, we asked our coworkers to judge 7 attributes of the bread, including taste, texture, and overall appeal. A statistician and I judged the color of each loaf and measured the thickness of the crust.

It sounds so simple, right? That’s what I thought: plug the factors into Design-Expert, let it work its magic, and poof! the best bread recipe. But that just shows you how little I know! If you’re a formulator, or you’ve taken our **Mixture Design for Optimal Formulations** workshop, you know what the first hurdle was: even though we only changed two ingredients, we were still dealing with a combined mixture/process design. Since mixture designs work with ratios of ingredients as opposed to independent amounts, adding 5g of vanilla extract and 3g of almond extract is a different ratio within the dough, and therefore a different mixture, than adding 2g of vanilla and 6g of almond. To make this work, the base recipe had to become a third part of the mixture. Consultant Wayne Adams stepped in at that point to help us design the experiment. He and Brooks built a mixture/numeric combined design that specified proportions of the 3 ingredients (base recipe, vanilla, and almond), along with the time and temperature settings.

Our second major problem was the time constraint. I brought up the idea for this bread experiment on July 18, and I had to bring my loaves to the fairgrounds on the morning of August 20. We wanted our coworkers to taste this bread, and I had a required family vacation to attend that first week of August. When we accounted for that, along with the time it took to design the experiment, we were left with just 14 days of tasting. At a rate of 2 loaves per weeknight, 4 per weekend, and at the cost of my social life, our maximum budget allowed for a design with only 26 runs. I’m sure there are some of you reading this and wondering how on earth I’d get any meaningful model out of a paltry 26 runs. Well, you’ve got reason to: we just barely got information I could use. Brooks ran through a number of different designs before he got one with even halfway decent power, and we also had to accept that, if there were any curvature to the results, we would not be able to model it with much certainty. Our final design had just two center points to find any curvature related to time or temperature, with no budgeted time for follow-up. Since our working hypothesis was that we’d see a linear relationship between time and temperature, not a quadratic one, the center points were to check this assumption and ensure it was correct. We got a working model, yes, but we took a big risk — and the fact that I didn’t even place in the top 5 entries only underlines that.

On top of all these constraints? I’m only human, and as you well know, human operators make mistakes. My process notes are littered with “I messed up and…” Example: the time I stacked my lunchbox on top of a softer loaf of challah in my bicycle bag for the half-hour ride to work. I’ll give you three guesses how that one rated on “uniformity” and “symmetry,” and your first two don’t count. If we had more time, we could have added more runs and gotten data that didn’t have that extra variability, but the fair submission date was my hard deadline. Mark Anderson, a Stat-Ease principal, tells me this is a common issue in many industries. When there is a “real-time rush to improve a product,” it may not be the best science to accept flawed data, but you make do and account for variations as best you can.

During the analysis, we used the Autoselect tool in Design-Expert to determine which factors had significant effects on the responses (mostly starting with the 2FI model). Another statistician here at Stat-Ease, Martin Bezener, just presented a webinar about this incredible tool —** visit our web site** to view a recording and learn more about it. When all of our tasters’ ratings were averaged together, we got significant models for Aroma, Appeal, Texture, Overall Taste, Color, and Crust Thickness, with Adj. R² values above 0.8 in most cases. This means that our models captured 80% of the variation in the data, with about 20% unexplained variation (noise) leftover. In general, the time and temperature effects seem to be the most important — we didn’t learn much about the two extracts. Almond only showed up as an effect (and a minor one at that) in one model for the aroma response, and vanilla didn’t show up at at all!

The other thing that surprised me was that I expected to be able to block this experiment. Blocking is a technique covered in our **Modern DOE for Process Optimization** workshop by which it’s possible to account for variation between any impossible-to-change source of variation, such as personal differences between tasters. However, since our tasters weren’t always present at every tasting and because we had so few runs in the experiment, we had too few degrees of freedom to block the results and still get a powerful model. It turned out that blocking wouldn’t have shown us much. We looked at a few individual tasters’ results individually, and that didn’t seem to illuminate anything different from what we saw before — which tells us that blocking the whole experiment wouldn’t have uncovered anything new, either.

In the end, I’m happy with our kludged-together experiment. I got a lot of practice baking, and determined the best process for my bread. If we were to do this again, I’d want to start in April to train my tasters better, determine appropriate amounts of other additions like chocolate chips, and really delve into ingredient proportions in a proper mixture design. And of course, I couldn’t have done any of this without the Stat-Ease consulting team. If you have questions on how our consultants can help design and analyze your experiments, **send us an email**.