Green Belt Case study
Six Sigma Green Belt Book 4 | Module 4
©2020 Bisk Education, Inc. and Villanova University. All rights reserved.
No part of this document may be reproduced in any form or by any electronic or mechanical means,
including information storage and retrieval systems, without written permission from the copyright owner.
Company, product, and service names used herein may be trademarks of their respective owners and
are used in an editorial fashion with no intention of infringement of the respective owner’s trademark
rights. The information in this study guide is distributed on an “as is” basis, without warranty. Neither
the copyright owner nor the author(s) shall have any liability to any person or entity with respect to any
actual or alleged damage caused by the information contained herein.
Permission to print this document is limited to one copy per student.
Six Sigma Green Belt | 3
Table of Contents
Module 4
Introduction ������������������������������������������������������������������������������������������������������ 4
Objectives ��������������������������������������������������������������������������������������������������������� 4
Assignment Checklist ����������������������������������������������������������������������������������������� 4
Introduction to the Measure Phase ������������������������������������������������������������������� 5
Variation and Measurement Discrimination �������������������������������������������������������� 7
The Normal Curve ���������������������������������������������������������������������������������������� 10
Other Distributions ��������������������������������������������������������������������������������������� 11
Populations and Samples ������������������������������������������������������������������������������ 13
Normality Test ��������������������������������������������������������������������������������������������� 15
Measures of Central Tendency ����������������������������������������������������������������������� 15
Measures of Dispersion ��������������������������������������������������������������������������������� 17
Histogram Analysis ��������������������������������������������������������������������������������������� 19
Graphical Representations of Data ����������������������������������������������������������������� 23
Pareto Chart ������������������������������������������������������������������������������������������������ 28
Run Chart ��������������������������������������������������������������������������������������������������� 30
Six Sigma Green Belt | 4
Module 4 Introduction During Week 4, we discuss the next step in the DMAIC process, the Measure phase� Throughout this
module, you will learn about variation, the normal curve, and populations and samples� You will also
learn about measures of central tendency, measures of dispersion, and histogram analysis� You will be
taught how to determine bin size when dealing with histograms as well as how to graphically represent
data� Finally, you will become knowledgeable on Pareto and run charts�
Objectives When you complete this module, you should be able to:
• Apply a Pareto analysis�
• Define your Data collection plan�
• Interpret histograms�
• Describe the Motorola Shift�
• Calculate the Process Performance, Pp, and Ppk based on the current process�
Assignment Checklist � ____________________________________
� ____________________________________
� ____________________________________
� ____________________________________
� ____________________________________
� ____________________________________
� ____________________________________
� ____________________________________
� ____________________________________
� ____________________________________
� ____________________________________
� ____________________________________
� ____________________________________
� ____________________________________
Six Sigma Green Belt | 5
Introduction to the Measure Phase
Introduction
Recall that Six Sigma deploys a DMAIC Model approach to quality improvement: Define, Measure,
Analyze, Improve, and Control (DMAIC)� For each phase, the Six Sigma team has a set of “deliverables,”
or tollgates� We have already covered the three tollgates of the Define phase: develop project charter,
assess customer needs and requirements, and construct a working process map� Now, we are ready to
discuss the second major phase of DMAIC: the Measure phase�
The Measure phase also has three tollgates: Establish appropriate performance indicators, develop a
data collection plan for any data that is required, and establish baseline performance measures� In
the Measure phase, the Six Sigma team must obtain a clear understanding of both the data and the
processes� It is in the Measure phase that the team assesses the adequacy of the measurement system
and establishes the performance metrics and their baseline values that will become central to the
project� These tasks are straightforward conceptually and sometimes challenging in practice�
• Establish appropriate performance indicators: what are the critical-to-quality variables? Which
performance measures are necessary for the project as it has been defined?
• Develop a data collection plan: what do we have and what do we need to collect? The project will
be driven by the data we utilize, so the team must have valid and appropriate data�
• Establish baseline performance measures: the team must obtain an estimate of the current state
of the process regarding the ability to meet customer requirements� This “before” picture of the
important processes will allow a realistic assessment of current performance and will be utilized to
assess the team’s progress�
Leading vs. Lagging Indicators
It is important to distinguish between leading and lagging indicators when assessing performance�
• Leading indicators – upstream in the process�
• Lagging indicators – the outcomes of those leading things that go on upstream�
For example, what would be the lagging indicators and the leading indicators of weight?
• Lagging indicator – stepping on a scale (no matter how often) would be a lagging indicator because
it is an outcome resulting from things further up in the process�
• Leading indicators – diet (calories, fat grams, sodium, carbohydrates, etc�) and exercise would be
leading indicators because they are inputs that occur upstream in the process�
Six Sigma Green Belt | 6
Focusing on one or two leading indicators thought to influence some lagging goal will help us get
started through the Measurement phase of DMAIC�
Tollgates of the Measure Phase
Performance indicators identified. Questions you can ask include:
• Are those things measurable?
• What effect do they have on the customers?
• What are the customer’s requirements?
• Are there leading indicators?
Data collection plan. Questions you can ask include:
• What type of measure are we looking at? What is the type of data?
• Operational definitions are also important here� Anything the team can identify and relate to as
something to measure (a starting point and stopping point, targets, etc�)�
Baseline Performance Measurement
Note – reaching Six Sigma means there are only have 3�4 defects per million opportunities� For
example, if we all live to 95 years old and we divide our lives into hours, how many bad hours would
we experience in our lifetimes? At Six Sigma level, we would only experience two bad hours�
• It starts with yield� How many out of a hundred? Out of a thousand? Out of a million?
• Sigma helps break it down into sizable chunks�
• A baseline sigma value is often computed�
Six Sigma Green Belt | 7
Variation and Measurement Discrimination
Introduction
This lecture will discuss variation and how it always exists if you measure fine enough�
Measurement Discrimination
Measurement discrimination is defined as a measurement feature depending on the degree of resolution�
Let us say there are four bowls labeled A, B, C, and D as shown in Figure 1� The thickness of the
bowls appears to be the same; however, the results are different depending on resolution of your
measurement�
Measuring to the nearest tenth of an inch, the thickness of all the bowls is 0�1:
Figure 1
Six Sigma Green Belt | 8
Measuring to the nearest hundredth of an inch reveals that B and C are 0�12 while A and D are 0�15:
Figure 2
Measuring to a thousandth of an inch shows that the thicknesses of the four bowls are all different – A
is 0�153, B is 0�121, C is 0�126, and D is 0�155:
Figure 3
Six Sigma Green Belt | 9
We call this measurement discrimination� What if we measure the height of everyone in the facility?
• If we measure to the nearest yard, it would look like everyone is two yards tall�
• If we measure to the nearest foot, the heights vary from four to seven feet tall�
These examples illustrate measurement discrimination�
Variation
As we know, reducing process variation is a fundamental objective of the Six Sigma methodology and
tools� Reducing process variation can increase customer satisfaction, market share, and the overall
financial bottom line�
• Some degree of process variation is natural and unavoidable� Businesses must be able to rely on
their processes to consistently produce acceptable levels of variation�
• It is the responsibility and obligation of the supplier to produce products or services meeting the
predetermined customer specifications and/or requirements�
• There are many causes of variation both within and outside the process�
• Many businesses know that they have fluctuations from the expected parameters of their processes�
They accept it and choose to live with it because:
- It is at a level that does not impact the quality of the product/service�
- The level of variation continues to allow the products/services to meet customer specifications�
- It would be extremely costly to attempt to reduce the variation further and the results of this
reduction would not justify the costs�
- The technology does not exist which would allow decreasing the variation further�
• However, not knowing the causes of process variation and how to control it is a very costly mistake�
As we have learned, the cost of poor quality can approach 30% in an organization�
Customers value consistency of product and service and want to have minimal variation from lot/batch/
usage to lot/batch/usage� This reliability makes their processes more predictable and gives higher
quality� All other things being equal, a customer is more likely to want to purchase products and
services having less variation� As a matter of fact, market research has shown that many customers
are willing to pay 20% more for higher quality�
Six Sigma Green Belt | 10
Types of Variation
Common cause variation is naturally occurring and is present in all processes� Special cause variation
is unexpected and usually happens from an unusual event or occurrence� Control charts and run charts
are good tools to visualize variation� Processes must be stable in order assessment and/or improvements
to take place�
Conclusion
Remember: if you do not think there is variation in your process, there is� In a Lean Six Sigma project,
you will be able to measure at a resolution fine enough to see it� This requires the appropriate calibrated
instruments or gauges� When you start improving a process and reducing variation, it may become
harder to identify the variation as it decreases�
The Normal Curve
Introduction
One of the most often used models in statistics is the normal curve, or the normal probability distribution
(commonly referred to as “the bell curve”)� The data we encounter can often be modeled using the
normal probability distribution (though not always!)� Many inferential statistical methods assume that
some aspect of the data is indeed normally distributed� The shape of the normal distribution is easily
recognized�
Figure 7
Six Sigma Green Belt | 11
Characteristics of the Normal Probability Distribution
In general, if we know the data are normally distributed, by using the mean and the standard deviation,
we may make some statements about the spread of the data� This result is sometimes referred to as the
Empirical Rule� Note that the area under the curve of a normal probability distribution – like all valid
probability distributions – sums to one�
• The area between plus and minus one standard deviation contains 68 percent of the data�
• The area between plus and minus two standard deviations contains 95 percent of the data�
• The area between plus and minus three standard deviations contains 99�8 percent of the data�
We should note, however, that these percentages assume the data is a very good fit with a theoretical
normal probability distribution�
Since the normal probability distribution is differentiated by the mean and the standard deviation, there
are many different normal distributions (all with this basic shape)� In statistics, we refer to this as a
“family” of normal distributions� Additionally, we can make the following observations:
• The highest point on the curve is the mean�
• A normal distribution is symmetric about its mean�
• The standard deviation determines how “flat” and “wide” the curve is�
Other Distributions
Introduction
Although we will make extensive use of the normal distribution or Z distribution in this course, there
are other distributions that you need to be aware of� These include:
• T-distribution�
• Binominal�
• Poisson�
• Chi-square�
Don’t be concerned if you don’t completely understand these distributions at this point� We are just
making you aware of their existence and they will be covered in more detail as the course proceeds�
Six Sigma Green Belt | 12
T-Distribution
Characteristics of the T-distribution:
• Utilized for variables or continuous data�
• Bell shaped and symmetrical about the mean�
• Looks similar to the Z-statistic�
Use of the T-statistic:
• Used for small sample sizes of less than 30� With larger sample sizes, either the T- or Z-distribution
can be used, although the Z-distribution is easier for large sample sizes�
• Hypothesis testing about means�
• Testing the significance of regression coefficients and other statistics�
• Draw conclusions about populations from a small sample size of less than 30�
• Don’t know the standard deviation of the population which is needed for the Z-statistic�
• To calculate, need to know the mean of the sample group, the mean of the population, sample
standard deviation, and the sample size�
• Obtain probabilities from a T-table� Need to know the degrees of freedom, which is n – 1�
Binomial Distribution
Use of the binomial distribution:
• Binominal means “two names�”
• Used for attribute or discrete data�
• Two possible outcomes (e�g� yes/no, heads tails, good/bad, etc�)�
• Purpose is to understand how likely outcomes are – the probability�
• Criteria for use:
- Small sample sizes�
- Population of more than 50�
- Replace the sample into the population between trials�
• If you can’t meet these criteria, then use the other distributions�
Six Sigma Green Belt | 13
Poisson Distribution
Use of the Poisson distribution:
• Used for attribute or discrete data�
• Don’t know how many good results we could have�
• Count the number of defects� Don’t know how many defects didn’t occur�
• Things that occur randomly over time (e�g� distances, areas, volumes, etc�)�
• Rates (i�e� count per something [per day, per square foot, etc�])�
Chi-Square Distribution
• Uses sample variances to determine whether two factors are related or dependent on one another�
• This distribution is beyond the scope of this Green Belt course�
Conclusion
In addition to the normal or Z-distribution, we explored the T-distribution, the binomial distribution,
and the Poisson distribution and their usage in this course� This has just been an introduction with
more in-depth discussion to follow�
Populations and Samples
Introduction
“Data! Data! Data! I can’t make bricks without clay,” exclaimed Sherlock Holmes� And so it is� Lean
Six Sigma is data-based decision making� Where do we obtain our data? In the language of statistics,
from populations and samples�
Six Sigma Green Belt | 14
Populations and Samples
Populations are a complete collection of people, things, or data that you can analyze and make
conclusions about� Samples are subsets of populations� A statistically valid random sample is a sample
that is (1) randomly selected, (2) representative of the population, and (3) of sufficient size so that
any inferences we make utilizing those samples is reliable� In the practice of Lean Six Sigma, we will
usually (but not always) make use of samples� We make extensive use of sampling for the following
reasons:
• The population is simply too large and testing or inspecting all elements of the population would
be too expensive�
• Obtaining the data from the sample elements requires a destructive test�
• The population is dynamic; that is, the process that produces the elements of the population is not
static, so we don’t have a fixed population� Think of an ongoing production or service process�
Below are some examples of populations vs� samples:
• An entire elementary school of students is a population�
• One hundred randomly selected second graders is a sample�
• North American data centers is a population�
• The thirty largest data centers in the U�S� is a sample�
One topic that often arises in Lean Six Sigma is sample size� That is, we regularly confront the question:
what is a sufficient sample size? Unfortunately, this is no simple answer to this question� There is no
single value for the appropriate sample size that is sufficient for all our purposes� There are many
approaches for estimating a sample size in statistics� The sample size required depends on many
factors, including the amount of variation in the data and the resulting level of precision we wish to
achieve� It is best to speak with a statistician about sample size issues�
Parameters and Statistics
An important distinction between populations and samples involves the concept of parameters and
statistics� A parameter is a measurement, such as a mean or standard deviation, which represents
an entire population� Parameters are denoted by a Greek letter� For example, the population standard
deviation is denoted by the Greek letter sigma (σ)�
A statistic is a measurement constructed from a random sample or a subset of that population� A
statistic is denoted by a small-case letter of the alphabet� For example, the sample standard deviation
is denoted by s� In many statistical inferences, the sample standard deviation (s) is used as a point
estimate of the population standard deviation (σ)�
Six Sigma Green Belt | 15
Normality Test
Introduction
Not all data exhibiting a bell shape is normally distributed� The normality of the data will have implications
on how the data is analyzed and what inferences can be made from the data� To ensure normality of
the data, a normality test can be done�
Anderson-Darling Normality Test
A common test for normality is the Anderson-Darling Normality Test�
In the Anderson-Darling Normality Test, we have two hypotheses:
• Null Hypothesis (Ho) – the data points are normally distributed�
• Alternative Hypothesis (Ha) – the data points are not normally distributed�
The test calculates a p value to determine whether we accept or fail to accept the null hypothesis:
• If p > or = to �05, we accept Ho and the data is normally distributed�
• If p < �05, we fail to accept Ha and the data is not normally distributed�
Conclusion
Rather than speculating qualitatively about normality based solely on the shape of the data distribution,
we can quantitatively determine normality using the Anderson-Darling Normality Test. If p ≥ .05, the
data is normally distributed� If p < �05, the data is not normally distributed�
Measures of Central Tendency
Introduction
Measures of central tendency (also known as measures of location) attempt to identify the center of a
set of data� The most often used measures are mean, median, and mode� Since we typically are working
with samples in Six Sigma, we will focus on sample statistics (a “statistic” is a measure computed from
a sample, while a “parameter” is a measure of the population)�
Six Sigma Green Belt | 16
Measures of Central Tendency
Perhaps the measure of location that is most familiar is the sample mean� The mean provides a
measure of central location� The mean of a sample is the sum of all values in the sample divided by
the number of data values�
Note that xi is just notation to cover each element of the sample� So, a sample of 5 elements is
generically described as values x1 through x5. The summation symbol (∑) is a mathematical operator
that tells us to sum the data�
To find the mean, add up the numbers and divide by the amount of numbers you added� For example,
consider the following sample data: 4, 5, 6, 7, 8, 3, 5, 7, 2�
• The sum is 49�
• There are nine data values�
• Therefore, the mean is 49 ÷ 9 = 5�22�
The median is a very different kind of measure of location� To obtain the mean for a sample, we must:
1� Sort the data into ascending or descending order�
2� For a sample that has an odd number of data values, the median is the data value in the middle�
3� For a sample with an even number of data values, the median is the average of the two values in
the center�
Consider the same sample data as above: 4, 5, 6, 7, 8, 3, 5, 7, 2�
• To obtain the median, we first sort the data: 2, 3, 4, 5, 5, 6, 7, 7, 8 (sorted from smallest largest)�
• Since the sample size is an odd number (nine), the median is the value in the middle (that is, it
has four data values below it and four data values above it)� Therefore, the median is in the fifth
place, which is 5�
Note that if the sample size had been an even number (like eight), then we would have to average the
middle two values after sorting the data�
Finally, the mode is simply the sample value that occurs most often� Note that is possible to have more
than one mode�
Six Sigma Green Belt | 17
Again, consider the same data as above: 4, 5, 6, 7, 8, 3, 5, 7, 2� The mode is the data value that
occurs most often, so this sample data has two modes: 5 and 7�
Measures of Dispersion
Introduction
A measure of location is of limited use without a measure of dispersion (also known as a measure of
variation)� Much of our time in Six Sigma is spent trying to decide if our data indicates some type of
change that calls for action� As Donald Wheeler says it: “While every data set contains noise, some
data sets may contain signals� Therefore, before you can detect a signal within a given data set, you
must first filter out the noise�” See his book Understanding Variation� Measures of variation are one of
the ways in which we filter data in our search for a signal�
Measures of Dispersion
There are four very common sample measures of variation: the range, the interquartile range, the
standard deviation, and the variance�
The range is most easily computed measure of variation� To obtain the sample range, we take the
largest value in the sample and we subtract the smallest value� We will need to use the sample range
when we discuss the process tool known as a control chart�
Sample Range = Largest value – Smallest value
The interquartile range is a measure of variation computed by estimating the third quartile of the
sample (Q3, or the 75th percentile) and the first quartile (Q1, or 25th percentile) and subtracting Q1
from Q3� Thus, the interquartile range is a measure of variation based on middle fifty percent of the
sample data�
Interquartile Range = Q3 – Q1
The percentiles of a sample can be estimated in different ways, but the calculation is easily completed
with spreadsheet software like Excel�
Six Sigma Green Belt | 18
The most often used measure of variation is the sample standard deviation� The sample standard
deviation measures the variation of the sample data by constructing the sum of the squared deviations
of each data value from the sample average, divides them by n – 1, and takes the square root of the
result�
We use the sample standard deviation extensively� While it may look complicated, it is easily computed
with a basic statistics calculator or Excel�
Finally, the sample variance is just the square of the sample standard deviation�
Sample Variance = s2
Example: Consider the following sample data: 4, 5, 6, 7, 8, 3, 5, 7, 2�
The sample range is just the largest value minus the smallest value in the sample� Therefore:
Range = 8 – 2 = 6�
The sample standard formula is as follows�
Again, the sample standard deviation is the sum
of the squared deviations from the average (mean)
divided by the sample size minus one (n – 1)�
x Avg. (x – Avg.)2
4 5�22 1�4884
5 5�22 0�0484
6 5�22 0�6084
7 5�22 3�1684
8 5�22 7�7284
3 5�22 4�9284
5 5�22 0�0484
7 5�22 3�1684
2 5�22 10�3684
31.5556 Sum
Figure 8
Six Sigma Green Belt | 19
The sample variance is just the square of the sample standard deviation�
Sample variance = S2 = (1�986)2 = 3�945�
Histogram Analysis
Introduction
A histogram is a graph that displays data with the class interval on the horizontal axis and the class
frequency (or relative frequency) on the vertical axis� The frequencies (or relative frequencies) are
indicated by rectangles whose areas are proportional to the frequencies (or relative frequencies)
represented by the data� A histogram provides us a graphical representation of the data�
Histogram Analysis
Example: Suppose we have data from a simple random sample of the wait times (in minutes) for n =
20 patients at a local urgent care center� The wait time data are given below�
32 49 27 31 39 44 40 45 36 28 35 31 33 40 47 39 31 36 35 42
Figure 9
There are different approaches to the construction of a histogram and, of course, software like Excel will
easily construct histograms� What follows is one of the standard approaches to constructing histograms
manually� Basically, to construct a histogram, we need to determine the following:
1� The number of non-overlapping classes�
2� The width of each of the non-overlapping classes�
3� The limits (lower and upper) of each of the non-overlapping classes�
Step 1
First, we need to divide the data into non-overlapping classes (or class intervals) so that we capture all
the data for our histogram� So, how many class intervals should we use? One answer to this question
is known as the “2k Rule�” Keep in mind that there is some “trial and error” involved in generating
histograms� The 2k Rule suggests that we choose k number of class intervals such that:
2k > n
Since our sample size (n) is 20, we want to choose k such that 2k > 20� The first value of k that meets
this condition is k = 5� So, we consider five class intervals�
Six Sigma Green Belt | 20
Step 2
Next, we need to determine the width of each class� Again, there is a “rule of thumb” that assists us
here� The rule of thumb suggests we choose the width (i) such that:
Step 3
Using this rule, we obtain the following:
If possible, it is sometimes more convenient to use integers for our class widths, so let’s round 4�40
to 5�
Step 4
Finally, we need to determine the limits (lower and upper) of our classes� In general, it is helpful to
choose the limits such that the lower limit of our first class is below our smallest data value and the
upper limit of our last class is greater than our largest data value� See the following:
Class Class Frequency Class Relative Frequency
26 to 30 2 0�10
31 to 35 7 0�35
36 to 40 6 0�30
41 to 45 3 0�15
46 to 50 2 0�10
20 1�00
Figure 10
Six Sigma Green Belt | 21
Next, we inspect our data and count the number of values that fall into each of our five classes (this
provides us with the class frequencies)� We can also divide each class frequency by the total sample
size and obtain the class relative frequency� Either of these (class frequency or class relative frequency)
can be used to generate a histogram� Below is a histogram of the data using class frequency�
Figure 11
Histogram Analysis
To illustrate the construction of a histogram, a small number of data values (n = 20) was used�
However, note that for a histogram to provide a reliable picture of the shape of a distribution, it often
requires much more data than this – maybe as much as 150 or 200 data values� That is, we would
typically not attempt to assess whether or not a data set is normally distributed with a histogram
generated from just 20 values�
Using Histograms for Analysis
In Six Sigma, histograms are often one of the first methods we use to provide us a graphical idea of
the distribution of the data relative to the process requirements� Again, if there is sufficient data, the
graph might be capable of providing us an initial hint at the sample (for example, that the data might
be normally distributed)� However, to reliably answer the question of what shape the data is, we will
need more sophisticated tests� But histograms are good graphical tools for capturing the distribution of
the data relative process requirements� Below are some examples of this use of histograms�
Six Sigma Green Belt | 22
The histogram (Figure 12) illustrates the situation
where the process is producing too close to a
requirement limit� All the data is within the customer
specifications, but the distribution is nearing one
of the spec limits and, so, the risk of exceeding the
upper requirement limit is increasing� Remember,
one of the operational definitions of quality in Six
Sigma is “minimum deviation from the appropriate
target�”
The histogram of this distribution (Figure 13)
shows the process is centered in the middle of
the tolerance, but there is a lot of variation in the
process� If the variation becomes any larger, it is
likely this process will exceed the limits� Thus,
there is too much variation in the process output�
Remember, whenever the data breaches those
limits, a defect has occurred�
This histogram (Figure 14) indicates the process
center is off the target and the process is currently
generating output that does not meet the upper
requirement limit� Some action is required in order to
satisfy the customer and move the distribution back
toward the middle and within the requirements�
Figure 12
Figure 13
Figure 14
Six Sigma Green Belt | 23
We call this (Figure 15) a skewed distribution; that
is, the data is “skewed” toward the larger values�
Additionally, the mean is not in the center of the
two specifications�
This histogram (Figure 16) shows a bimodal
distribution, which typically represents data from
two different processes being mixed together: day
shift vs� night shift, or machine A vs� machine B�
This also represents inconsistent performance,
which is not desirable�
Figure 15
Figure 16
Conclusion
Pay attention to the histograms of your process� You can learn a lot about a process just by looking
carefully at the data�
Graphical Representations of Data
Introduction
There is a saying in management that goes something like this: a manager would rather live with a
problem they can’t solve than accept a solution they don’t understand� As a practitioner of Six Sigma,
this is an observation we must carefully consider, and it has a bearing on the topic of presenting data�
Six Sigma is evidence-based decision making – it is driven by data� We certainly desire to develop data
analysis skills so that we may more effectively analyze data� However, there is another imperative: we
must be able to communicate effectively to our audience� Graphical summaries of data will not only
aid us in our analysis – they are indispensable when it comes to presenting our results�
Six Sigma Green Belt | 24
The Stem-and-Leaf Plot
Stem-and-leaf plots allow us to present the data in a way that maintains the visibility of the data
values� Moreover, a stem-and-leaf plot can afford us an early look at which might be the shape of a
distribution� In the stem-and-leaf plot, data is split into the stem (the first digit of the number) and a
leaf (the last digit of the number)� The stem-and-leaf plot is especially useful for displaying a small
amount of data�
Example: suppose we are given the following heights of a group of two-year-olds:
2�3, 2�5, 2�5, 2�7, 2�8, 3�2, 3�6, 3�6, 4�5, and 5�0�
We can use the whole number as the stem and the decimal value as the leaf to obtain the following
stem-and-leaf plot�
Stem Leaves
2 3, 5, 5, 7, 8
3 2, 6, 6
4 5
5 0
Figure 17
Again, the stem-and-leaf plot organizes the data in a way that indicates the spread of the data and
hints at the possible shape (in this case, skewed to the right)� Some of the benefits of a stem-and-
leaf plot are it affords us a very concise presentation of the data and it maintains the visibility of the
individual data values� Recall that a histogram organizes the data into class intervals (so the individual
values aren’t visible)�
The Box-and-Whiskers Plot
The box-and-whiskers plot (also called a boxplot) is another graphical representation that is quite
useful to us� It has the benefit of incorporating some numerical summaries (referred to as five-number
summaries) of the data within the plot� It was introduced by John Tukey (one of the most influential
statisticians of the 20th century) in his book Exploratory Data Analysis�
Example: suppose we have data from a simple random sample of the wait times (in minutes) for n =
20 patients at a local urgent care center� The wait time data are given below�
32 49 27 31 39 44 40 45 36 28 35 31 33 40 47 39 31 36 35 42
Figure 18
Six Sigma Green Belt | 25
The first thing we need to do is sort the data into ascending order and indicate the rank�
Data 27 28 31 31 31 32 33 35 35 36 36 39 39 40 40 42 44 45 47 49
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Figure 19
The “five-number summary” contained in the box-and-whisker plot consists of the following values:
• The maximum value of the data�
• The minimum value of the data�
• The median of the data�
• The first quartile (Q1), also called the 25th percentile�
• The third quartile (Q3), also called the 75th percentile�
Just by inspection of the data, we can see the maximum value is 49 and the minimum value is 27�
Recall that the median is the “middle” of the data once the data has been ordered� Since we have an
even number of data values (n = 20), to obtain the median, we must average the two middle values
(that is values 36 and 36 which are in ranks 10 and 11)� Therefore, the median of the data is 36�0�
The quartiles (or percentiles) are a bit more complicated to estimate precisely (there are several location
algorithms that can do this)� We will employ one of the common algorithms to estimate the location
of a percentile� To use this approach, we first identify the location (rank) of the desired percentile and
then we estimate its value�
Estimating Q1 and Q3
First, we need to find the location of the 25th percentile� To do so, we use the following formula:
That is, p is the percentile and n is the sample size� Therefore, we obtain the following for the position
of Q1, or the 25th percentile:
The Lp of 5�25 tells us the 25th percentile (or first quartile) is in rank (or location) 5�25 in our data�
That is, Q1 is one-quarter of the way between ranks 5 and 6� So, the 25th percentile (or Q1) is 31�25,
found by:
Q1 = 31 + (0.25)(32 – 31) = 31.25
Six Sigma Green Belt | 26
That is, we begin with the value in rank 5 (31) and move 25% of the distance toward rank 6 (32)�
To estimate Q3, we follow the same process:
Therefore, Q3 is in location (or rank) 15�75, or 75% of the distance from the value in rank 15 (40)
towards the value in rank 16 (42)�
Q3 = 40 + (0.25)(42 – 40) = 41.50
Now we have all the values for our five-number summaries, and we are ready to create the box-and-
whiskers plot�
• The maximum value of the data = 49: this is the upper whisker�
• The minimum value of the data = 27: this is the lower whisker�
• The median of the data = 36: this is the middle of the box�
• The first quartile (Q1) also called the 25th percentile = 31�25: this is the bottom of the box�
• The third quartile (Q3) also called the 75th percentile = 41�5: this is the top of the box�
Also, note that the graph includes the average (37) indicated by an “X” in the box�
Figure 20
The box-and-whiskers plot contains a significant amount of information� For example,
• Since the first quartile (Q1, or the 25th percentile) is 31�25, then we can infer that 25% of the
data is less than 31�25 and 75% of the data is greater than 31�25�
• Since the third quartile (Q3, or the 75th percentile) is 41�5, then we can infer that 75% of the data
is less than 41�5 and 25% of the data is greater than 41�5�
Six Sigma Green Belt | 27
• Recall that the interquartile range (IQR) is Q3 – Q1� In this case, the IQR is 41�5 – 31�25, or
10�25� The IQR is the middle 50% of the data�
• Also, since the box is close to the center of the distribution of the data, the median and mean are
close to each other, and the whiskers are fairly symmetric (not perfectly so, but relatively speaking),
then we will not be surprised if a formal statistical test indicates the data is close to being normally
distributed�
Finally, one of the advantages of a boxplot is the ability to generate comparisons� For example, suppose
we have data on the time to complete a task using two different methods� The data can easily be
compared using a box-and-whiskers plot�
Figure 21
The Defect Location Check Sheet
A defect location check sheet (also known as a defect map) is a structured, prepared form for collecting
and analyzing data that provides a visual image of the item being evaluated� It is used when a visual
and simple count is needed to determine the frequency of the defects and the recurring locations of the
incidences�
To construct a defect location check sheet, place a visual cue, such as an orange dot, on an image or
sketch of the item to indicate the location of a defect�
Six Sigma Green Belt | 28
Example: we use a sketch of a person to identify the location of injuries�
Figure 22
A defect map is a common quality assurance practice when we need to count the defects and note the
location of occurrence� Some of the benefits of using a defect map are that it collects data with minimal
effort and the data is easily converted into useful information�
Pareto Chart
Introduction
A Pareto chart – also called a Pareto distribution diagram – is a vertical bar graph in which values are
plotted in decreasing order of relative frequency from left to right� The Pareto chart is named for Vilfredo
Pareto, an Italian economist who developed the 80/20 rule while analyzing income distribution� The
80/20 rule, or Pareto Principle, suggests 80% of the most important issues are generated by 20% of
the sources� Today we often apply the principle as follows: 80% of the results are from 20% of the
actions employed�
One of the challenges for a Six Sigma project team is correctly identifying the most critical relationships
between process variables and performance metrics� By helping us prioritize problems, the Pareto
chart increases the likelihood we capture the most significant relationships�
Six Sigma Green Belt | 29
Example: suppose a small pizza restaurant has collected data on customer complaints concerning
pizza delivery� Using this data, they generate a Pareto chart�
Complaint Frequency
Pizza arrives late 2
Incorrect pizza toppings 9
Incorrect crust 11
Incorrect size 4
Pizza damaged 3
Figure 23
Figure 24
Note the following characteristics of the Pareto chart:
• The causes are placed in descending order of occurrence as indicated by the bars�
• The graph shows the percentages and the cumulative percentages� For example, “Pizza arrives
late” accounts for 44�9% of the complaints and the first three causes account for 85�7% of all
complaints�
Six Sigma Green Belt | 30
Run Chart
Introduction
A run chart is a type of line graph in which the data is plotted along a time series� That is, the vertical
axis represents the values of the data and the horizontal axis is a measure of time� It is customary to
place a horizontal line to indicate the average of the data and to provide a visual idea of the variation
in the data across time�
By displaying our data over time, we can examine the run chart for evidence of patterns� But keep in
mind that the we will need to follow our examination of the run chart with a more rigorous statistical
test for any pattern our change that we might notice� That is, a run chart is not a control chart or a
hypothesis test regarding the average, so it is not as powerful as other tools for detecting changes�
However, we can still use a run chart to help identify potential sources of special cause variation
employing tests for randomness in the data (or lack thereof)�
Example: suppose we want to use a run chart to examine the data from our golf outings� The data is
shown below�
Golf Outings
1 2 3 4 5 6
May 83 84 86 89 86
June 84 80 85 86 88 87
July 85 86 87 92 88 84
August 86 87 87 85 83 82
September 83 87 85 87
Figure 25
Six Sigma Green Belt | 31
To construct a run chart, we place the data in a time series in order of occurrence�
Figure 26
The first thing we notice about the run chart is simply how much variation there is in the golf scores�
There are statistic tests for non-random behavior that can be applied with a run test, but those tests
are best performed with appropriate statistical software� However, using those tests we could check for
non-random behaviors like trends, clustering, mixtures, and oscillation�
Six Sigma Green Belt | 32
Notes
- Module 4
- Introduction
- Objectives
- Assignment Checklist
- Introduction to the Measure Phase
- Variation and Measurement Discrimination
- The Normal Curve
- Other Distributions
- Populations and Samples
- Normality Test
- Measures of Central Tendency
- Measures of Dispersion
- Histogram Analysis
- Graphical Representations of Data
- Pareto Chart
- Run Chart