Green Belt Case study

profilericky1905
SSG120_SupplementalMaterials_Module4.pdf

Six Sigma Green Belt Book 4 | Module 4

©2020 Bisk Education, Inc. and Villanova University. All rights reserved.

No part of this document may be reproduced in any form or by any electronic or mechanical means,

including information storage and retrieval systems, without written permission from the copyright owner.

Company, product, and service names used herein may be trademarks of their respective owners and

are used in an editorial fashion with no intention of infringement of the respective owner’s trademark

rights. The information in this study guide is distributed on an “as is” basis, without warranty. Neither

the copyright owner nor the author(s) shall have any liability to any person or entity with respect to any

actual or alleged damage caused by the information contained herein.

Permission to print this document is limited to one copy per student.

Six Sigma Green Belt | 3

Table of Contents

Module 4

Introduction ������������������������������������������������������������������������������������������������������ 4

Objectives ��������������������������������������������������������������������������������������������������������� 4

Assignment Checklist ����������������������������������������������������������������������������������������� 4

Introduction to the Measure Phase ������������������������������������������������������������������� 5

Variation and Measurement Discrimination �������������������������������������������������������� 7

The Normal Curve ���������������������������������������������������������������������������������������� 10

Other Distributions ��������������������������������������������������������������������������������������� 11

Populations and Samples ������������������������������������������������������������������������������ 13

Normality Test ��������������������������������������������������������������������������������������������� 15

Measures of Central Tendency ����������������������������������������������������������������������� 15

Measures of Dispersion ��������������������������������������������������������������������������������� 17

Histogram Analysis ��������������������������������������������������������������������������������������� 19

Graphical Representations of Data ����������������������������������������������������������������� 23

Pareto Chart ������������������������������������������������������������������������������������������������ 28

Run Chart ��������������������������������������������������������������������������������������������������� 30

Six Sigma Green Belt | 4

Module 4 Introduction During Week 4, we discuss the next step in the DMAIC process, the Measure phase� Throughout this

module, you will learn about variation, the normal curve, and populations and samples� You will also

learn about measures of central tendency, measures of dispersion, and histogram analysis� You will be

taught how to determine bin size when dealing with histograms as well as how to graphically represent

data� Finally, you will become knowledgeable on Pareto and run charts�

Objectives When you complete this module, you should be able to:

• Apply a Pareto analysis�

• Define your Data collection plan�

• Interpret histograms�

• Describe the Motorola Shift�

• Calculate the Process Performance, Pp, and Ppk based on the current process�

Assignment Checklist � ____________________________________

� ____________________________________

� ____________________________________

� ____________________________________

� ____________________________________

� ____________________________________

� ____________________________________

� ____________________________________

� ____________________________________

� ____________________________________

� ____________________________________

� ____________________________________

� ____________________________________

� ____________________________________

Six Sigma Green Belt | 5

Introduction to the Measure Phase

Introduction

Recall that Six Sigma deploys a DMAIC Model approach to quality improvement: Define, Measure,

Analyze, Improve, and Control (DMAIC)� For each phase, the Six Sigma team has a set of “deliverables,”

or tollgates� We have already covered the three tollgates of the Define phase: develop project charter,

assess customer needs and requirements, and construct a working process map� Now, we are ready to

discuss the second major phase of DMAIC: the Measure phase�

The Measure phase also has three tollgates: Establish appropriate performance indicators, develop a

data collection plan for any data that is required, and establish baseline performance measures� In

the Measure phase, the Six Sigma team must obtain a clear understanding of both the data and the

processes� It is in the Measure phase that the team assesses the adequacy of the measurement system

and establishes the performance metrics and their baseline values that will become central to the

project� These tasks are straightforward conceptually and sometimes challenging in practice�

• Establish appropriate performance indicators: what are the critical-to-quality variables? Which

performance measures are necessary for the project as it has been defined?

• Develop a data collection plan: what do we have and what do we need to collect? The project will

be driven by the data we utilize, so the team must have valid and appropriate data�

• Establish baseline performance measures: the team must obtain an estimate of the current state

of the process regarding the ability to meet customer requirements� This “before” picture of the

important processes will allow a realistic assessment of current performance and will be utilized to

assess the team’s progress�

Leading vs. Lagging Indicators

It is important to distinguish between leading and lagging indicators when assessing performance�

• Leading indicators – upstream in the process�

• Lagging indicators – the outcomes of those leading things that go on upstream�

For example, what would be the lagging indicators and the leading indicators of weight?

• Lagging indicator – stepping on a scale (no matter how often) would be a lagging indicator because

it is an outcome resulting from things further up in the process�

• Leading indicators – diet (calories, fat grams, sodium, carbohydrates, etc�) and exercise would be

leading indicators because they are inputs that occur upstream in the process�

Six Sigma Green Belt | 6

Focusing on one or two leading indicators thought to influence some lagging goal will help us get

started through the Measurement phase of DMAIC�

Tollgates of the Measure Phase

Performance indicators identified. Questions you can ask include:

• Are those things measurable?

• What effect do they have on the customers?

• What are the customer’s requirements?

• Are there leading indicators?

Data collection plan. Questions you can ask include:

• What type of measure are we looking at? What is the type of data?

• Operational definitions are also important here� Anything the team can identify and relate to as

something to measure (a starting point and stopping point, targets, etc�)�

Baseline Performance Measurement

Note – reaching Six Sigma means there are only have 3�4 defects per million opportunities� For

example, if we all live to 95 years old and we divide our lives into hours, how many bad hours would

we experience in our lifetimes? At Six Sigma level, we would only experience two bad hours�

• It starts with yield� How many out of a hundred? Out of a thousand? Out of a million?

• Sigma helps break it down into sizable chunks�

• A baseline sigma value is often computed�

Six Sigma Green Belt | 7

Variation and Measurement Discrimination

Introduction

This lecture will discuss variation and how it always exists if you measure fine enough�

Measurement Discrimination

Measurement discrimination is defined as a measurement feature depending on the degree of resolution�

Let us say there are four bowls labeled A, B, C, and D as shown in Figure 1� The thickness of the

bowls appears to be the same; however, the results are different depending on resolution of your

measurement�

Measuring to the nearest tenth of an inch, the thickness of all the bowls is 0�1:

Figure 1

Six Sigma Green Belt | 8

Measuring to the nearest hundredth of an inch reveals that B and C are 0�12 while A and D are 0�15:

Figure 2

Measuring to a thousandth of an inch shows that the thicknesses of the four bowls are all different – A

is 0�153, B is 0�121, C is 0�126, and D is 0�155:

Figure 3

Six Sigma Green Belt | 9

We call this measurement discrimination� What if we measure the height of everyone in the facility?

• If we measure to the nearest yard, it would look like everyone is two yards tall�

• If we measure to the nearest foot, the heights vary from four to seven feet tall�

These examples illustrate measurement discrimination�

Variation

As we know, reducing process variation is a fundamental objective of the Six Sigma methodology and

tools� Reducing process variation can increase customer satisfaction, market share, and the overall

financial bottom line�

• Some degree of process variation is natural and unavoidable� Businesses must be able to rely on

their processes to consistently produce acceptable levels of variation�

• It is the responsibility and obligation of the supplier to produce products or services meeting the

predetermined customer specifications and/or requirements�

• There are many causes of variation both within and outside the process�

• Many businesses know that they have fluctuations from the expected parameters of their processes�

They accept it and choose to live with it because:

- It is at a level that does not impact the quality of the product/service�

- The level of variation continues to allow the products/services to meet customer specifications�

- It would be extremely costly to attempt to reduce the variation further and the results of this

reduction would not justify the costs�

- The technology does not exist which would allow decreasing the variation further�

• However, not knowing the causes of process variation and how to control it is a very costly mistake�

As we have learned, the cost of poor quality can approach 30% in an organization�

Customers value consistency of product and service and want to have minimal variation from lot/batch/

usage to lot/batch/usage� This reliability makes their processes more predictable and gives higher

quality� All other things being equal, a customer is more likely to want to purchase products and

services having less variation� As a matter of fact, market research has shown that many customers

are willing to pay 20% more for higher quality�

Six Sigma Green Belt | 10

Types of Variation

Common cause variation is naturally occurring and is present in all processes� Special cause variation

is unexpected and usually happens from an unusual event or occurrence� Control charts and run charts

are good tools to visualize variation� Processes must be stable in order assessment and/or improvements

to take place�

Conclusion

Remember: if you do not think there is variation in your process, there is� In a Lean Six Sigma project,

you will be able to measure at a resolution fine enough to see it� This requires the appropriate calibrated

instruments or gauges� When you start improving a process and reducing variation, it may become

harder to identify the variation as it decreases�

The Normal Curve

Introduction

One of the most often used models in statistics is the normal curve, or the normal probability distribution

(commonly referred to as “the bell curve”)� The data we encounter can often be modeled using the

normal probability distribution (though not always!)� Many inferential statistical methods assume that

some aspect of the data is indeed normally distributed� The shape of the normal distribution is easily

recognized�

Figure 7

Six Sigma Green Belt | 11

Characteristics of the Normal Probability Distribution

In general, if we know the data are normally distributed, by using the mean and the standard deviation,

we may make some statements about the spread of the data� This result is sometimes referred to as the

Empirical Rule� Note that the area under the curve of a normal probability distribution – like all valid

probability distributions – sums to one�

• The area between plus and minus one standard deviation contains 68 percent of the data�

• The area between plus and minus two standard deviations contains 95 percent of the data�

• The area between plus and minus three standard deviations contains 99�8 percent of the data�

We should note, however, that these percentages assume the data is a very good fit with a theoretical

normal probability distribution�

Since the normal probability distribution is differentiated by the mean and the standard deviation, there

are many different normal distributions (all with this basic shape)� In statistics, we refer to this as a

“family” of normal distributions� Additionally, we can make the following observations:

• The highest point on the curve is the mean�

• A normal distribution is symmetric about its mean�

• The standard deviation determines how “flat” and “wide” the curve is�

Other Distributions

Introduction

Although we will make extensive use of the normal distribution or Z distribution in this course, there

are other distributions that you need to be aware of� These include:

• T-distribution�

• Binominal�

• Poisson�

• Chi-square�

Don’t be concerned if you don’t completely understand these distributions at this point� We are just

making you aware of their existence and they will be covered in more detail as the course proceeds�

Six Sigma Green Belt | 12

T-Distribution

Characteristics of the T-distribution:

• Utilized for variables or continuous data�

• Bell shaped and symmetrical about the mean�

• Looks similar to the Z-statistic�

Use of the T-statistic:

• Used for small sample sizes of less than 30� With larger sample sizes, either the T- or Z-distribution

can be used, although the Z-distribution is easier for large sample sizes�

• Hypothesis testing about means�

• Testing the significance of regression coefficients and other statistics�

• Draw conclusions about populations from a small sample size of less than 30�

• Don’t know the standard deviation of the population which is needed for the Z-statistic�

• To calculate, need to know the mean of the sample group, the mean of the population, sample

standard deviation, and the sample size�

• Obtain probabilities from a T-table� Need to know the degrees of freedom, which is n – 1�

Binomial Distribution

Use of the binomial distribution:

• Binominal means “two names�”

• Used for attribute or discrete data�

• Two possible outcomes (e�g� yes/no, heads tails, good/bad, etc�)�

• Purpose is to understand how likely outcomes are – the probability�

• Criteria for use:

- Small sample sizes�

- Population of more than 50�

- Replace the sample into the population between trials�

• If you can’t meet these criteria, then use the other distributions�

Six Sigma Green Belt | 13

Poisson Distribution

Use of the Poisson distribution:

• Used for attribute or discrete data�

• Don’t know how many good results we could have�

• Count the number of defects� Don’t know how many defects didn’t occur�

• Things that occur randomly over time (e�g� distances, areas, volumes, etc�)�

• Rates (i�e� count per something [per day, per square foot, etc�])�

Chi-Square Distribution

• Uses sample variances to determine whether two factors are related or dependent on one another�

• This distribution is beyond the scope of this Green Belt course�

Conclusion

In addition to the normal or Z-distribution, we explored the T-distribution, the binomial distribution,

and the Poisson distribution and their usage in this course� This has just been an introduction with

more in-depth discussion to follow�

Populations and Samples

Introduction

“Data! Data! Data! I can’t make bricks without clay,” exclaimed Sherlock Holmes� And so it is� Lean

Six Sigma is data-based decision making� Where do we obtain our data? In the language of statistics,

from populations and samples�

Six Sigma Green Belt | 14

Populations and Samples

Populations are a complete collection of people, things, or data that you can analyze and make

conclusions about� Samples are subsets of populations� A statistically valid random sample is a sample

that is (1) randomly selected, (2) representative of the population, and (3) of sufficient size so that

any inferences we make utilizing those samples is reliable� In the practice of Lean Six Sigma, we will

usually (but not always) make use of samples� We make extensive use of sampling for the following

reasons:

• The population is simply too large and testing or inspecting all elements of the population would

be too expensive�

• Obtaining the data from the sample elements requires a destructive test�

• The population is dynamic; that is, the process that produces the elements of the population is not

static, so we don’t have a fixed population� Think of an ongoing production or service process�

Below are some examples of populations vs� samples:

• An entire elementary school of students is a population�

• One hundred randomly selected second graders is a sample�

• North American data centers is a population�

• The thirty largest data centers in the U�S� is a sample�

One topic that often arises in Lean Six Sigma is sample size� That is, we regularly confront the question:

what is a sufficient sample size? Unfortunately, this is no simple answer to this question� There is no

single value for the appropriate sample size that is sufficient for all our purposes� There are many

approaches for estimating a sample size in statistics� The sample size required depends on many

factors, including the amount of variation in the data and the resulting level of precision we wish to

achieve� It is best to speak with a statistician about sample size issues�

Parameters and Statistics

An important distinction between populations and samples involves the concept of parameters and

statistics� A parameter is a measurement, such as a mean or standard deviation, which represents

an entire population� Parameters are denoted by a Greek letter� For example, the population standard

deviation is denoted by the Greek letter sigma (σ)�

A statistic is a measurement constructed from a random sample or a subset of that population� A

statistic is denoted by a small-case letter of the alphabet� For example, the sample standard deviation

is denoted by s� In many statistical inferences, the sample standard deviation (s) is used as a point

estimate of the population standard deviation (σ)�

Six Sigma Green Belt | 15

Normality Test

Introduction

Not all data exhibiting a bell shape is normally distributed� The normality of the data will have implications

on how the data is analyzed and what inferences can be made from the data� To ensure normality of

the data, a normality test can be done�

Anderson-Darling Normality Test

A common test for normality is the Anderson-Darling Normality Test�

In the Anderson-Darling Normality Test, we have two hypotheses:

• Null Hypothesis (Ho) – the data points are normally distributed�

• Alternative Hypothesis (Ha) – the data points are not normally distributed�

The test calculates a p value to determine whether we accept or fail to accept the null hypothesis:

• If p > or = to �05, we accept Ho and the data is normally distributed�

• If p < �05, we fail to accept Ha and the data is not normally distributed�

Conclusion

Rather than speculating qualitatively about normality based solely on the shape of the data distribution,

we can quantitatively determine normality using the Anderson-Darling Normality Test. If p ≥ .05, the

data is normally distributed� If p < �05, the data is not normally distributed�

Measures of Central Tendency

Introduction

Measures of central tendency (also known as measures of location) attempt to identify the center of a

set of data� The most often used measures are mean, median, and mode� Since we typically are working

with samples in Six Sigma, we will focus on sample statistics (a “statistic” is a measure computed from

a sample, while a “parameter” is a measure of the population)�

Six Sigma Green Belt | 16

Measures of Central Tendency

Perhaps the measure of location that is most familiar is the sample mean� The mean provides a

measure of central location� The mean of a sample is the sum of all values in the sample divided by

the number of data values�

Note that xi is just notation to cover each element of the sample� So, a sample of 5 elements is

generically described as values x1 through x5. The summation symbol (∑) is a mathematical operator

that tells us to sum the data�

To find the mean, add up the numbers and divide by the amount of numbers you added� For example,

consider the following sample data: 4, 5, 6, 7, 8, 3, 5, 7, 2�

• The sum is 49�

• There are nine data values�

• Therefore, the mean is 49 ÷ 9 = 5�22�

The median is a very different kind of measure of location� To obtain the mean for a sample, we must:

1� Sort the data into ascending or descending order�

2� For a sample that has an odd number of data values, the median is the data value in the middle�

3� For a sample with an even number of data values, the median is the average of the two values in

the center�

Consider the same sample data as above: 4, 5, 6, 7, 8, 3, 5, 7, 2�

• To obtain the median, we first sort the data: 2, 3, 4, 5, 5, 6, 7, 7, 8 (sorted from smallest largest)�

• Since the sample size is an odd number (nine), the median is the value in the middle (that is, it

has four data values below it and four data values above it)� Therefore, the median is in the fifth

place, which is 5�

Note that if the sample size had been an even number (like eight), then we would have to average the

middle two values after sorting the data�

Finally, the mode is simply the sample value that occurs most often� Note that is possible to have more

than one mode�

Six Sigma Green Belt | 17

Again, consider the same data as above: 4, 5, 6, 7, 8, 3, 5, 7, 2� The mode is the data value that

occurs most often, so this sample data has two modes: 5 and 7�

Measures of Dispersion

Introduction

A measure of location is of limited use without a measure of dispersion (also known as a measure of

variation)� Much of our time in Six Sigma is spent trying to decide if our data indicates some type of

change that calls for action� As Donald Wheeler says it: “While every data set contains noise, some

data sets may contain signals� Therefore, before you can detect a signal within a given data set, you

must first filter out the noise�” See his book Understanding Variation� Measures of variation are one of

the ways in which we filter data in our search for a signal�

Measures of Dispersion

There are four very common sample measures of variation: the range, the interquartile range, the

standard deviation, and the variance�

The range is most easily computed measure of variation� To obtain the sample range, we take the

largest value in the sample and we subtract the smallest value� We will need to use the sample range

when we discuss the process tool known as a control chart�

Sample Range = Largest value – Smallest value

The interquartile range is a measure of variation computed by estimating the third quartile of the

sample (Q3, or the 75th percentile) and the first quartile (Q1, or 25th percentile) and subtracting Q1

from Q3� Thus, the interquartile range is a measure of variation based on middle fifty percent of the

sample data�

Interquartile Range = Q3 – Q1

The percentiles of a sample can be estimated in different ways, but the calculation is easily completed

with spreadsheet software like Excel�

Six Sigma Green Belt | 18

The most often used measure of variation is the sample standard deviation� The sample standard

deviation measures the variation of the sample data by constructing the sum of the squared deviations

of each data value from the sample average, divides them by n – 1, and takes the square root of the

result�

We use the sample standard deviation extensively� While it may look complicated, it is easily computed

with a basic statistics calculator or Excel�

Finally, the sample variance is just the square of the sample standard deviation�

Sample Variance = s2

Example: Consider the following sample data: 4, 5, 6, 7, 8, 3, 5, 7, 2�

The sample range is just the largest value minus the smallest value in the sample� Therefore:

Range = 8 – 2 = 6�

The sample standard formula is as follows�

Again, the sample standard deviation is the sum

of the squared deviations from the average (mean)

divided by the sample size minus one (n – 1)�

x Avg. (x – Avg.)2

4 5�22 1�4884

5 5�22 0�0484

6 5�22 0�6084

7 5�22 3�1684

8 5�22 7�7284

3 5�22 4�9284

5 5�22 0�0484

7 5�22 3�1684

2 5�22 10�3684

31.5556 Sum

Figure 8

Six Sigma Green Belt | 19

The sample variance is just the square of the sample standard deviation�

Sample variance = S2 = (1�986)2 = 3�945�

Histogram Analysis

Introduction

A histogram is a graph that displays data with the class interval on the horizontal axis and the class

frequency (or relative frequency) on the vertical axis� The frequencies (or relative frequencies) are

indicated by rectangles whose areas are proportional to the frequencies (or relative frequencies)

represented by the data� A histogram provides us a graphical representation of the data�

Histogram Analysis

Example: Suppose we have data from a simple random sample of the wait times (in minutes) for n =

20 patients at a local urgent care center� The wait time data are given below�

32 49 27 31 39 44 40 45 36 28 35 31 33 40 47 39 31 36 35 42

Figure 9

There are different approaches to the construction of a histogram and, of course, software like Excel will

easily construct histograms� What follows is one of the standard approaches to constructing histograms

manually� Basically, to construct a histogram, we need to determine the following:

1� The number of non-overlapping classes�

2� The width of each of the non-overlapping classes�

3� The limits (lower and upper) of each of the non-overlapping classes�

Step 1

First, we need to divide the data into non-overlapping classes (or class intervals) so that we capture all

the data for our histogram� So, how many class intervals should we use? One answer to this question

is known as the “2k Rule�” Keep in mind that there is some “trial and error” involved in generating

histograms� The 2k Rule suggests that we choose k number of class intervals such that:

2k > n

Since our sample size (n) is 20, we want to choose k such that 2k > 20� The first value of k that meets

this condition is k = 5� So, we consider five class intervals�

Six Sigma Green Belt | 20

Step 2

Next, we need to determine the width of each class� Again, there is a “rule of thumb” that assists us

here� The rule of thumb suggests we choose the width (i) such that:

Step 3

Using this rule, we obtain the following:

If possible, it is sometimes more convenient to use integers for our class widths, so let’s round 4�40

to 5�

Step 4

Finally, we need to determine the limits (lower and upper) of our classes� In general, it is helpful to

choose the limits such that the lower limit of our first class is below our smallest data value and the

upper limit of our last class is greater than our largest data value� See the following:

Class Class Frequency Class Relative Frequency

26 to 30 2 0�10

31 to 35 7 0�35

36 to 40 6 0�30

41 to 45 3 0�15

46 to 50 2 0�10

20 1�00

Figure 10

Six Sigma Green Belt | 21

Next, we inspect our data and count the number of values that fall into each of our five classes (this

provides us with the class frequencies)� We can also divide each class frequency by the total sample

size and obtain the class relative frequency� Either of these (class frequency or class relative frequency)

can be used to generate a histogram� Below is a histogram of the data using class frequency�

Figure 11

Histogram Analysis

To illustrate the construction of a histogram, a small number of data values (n = 20) was used�

However, note that for a histogram to provide a reliable picture of the shape of a distribution, it often

requires much more data than this – maybe as much as 150 or 200 data values� That is, we would

typically not attempt to assess whether or not a data set is normally distributed with a histogram

generated from just 20 values�

Using Histograms for Analysis

In Six Sigma, histograms are often one of the first methods we use to provide us a graphical idea of

the distribution of the data relative to the process requirements� Again, if there is sufficient data, the

graph might be capable of providing us an initial hint at the sample (for example, that the data might

be normally distributed)� However, to reliably answer the question of what shape the data is, we will

need more sophisticated tests� But histograms are good graphical tools for capturing the distribution of

the data relative process requirements� Below are some examples of this use of histograms�

Six Sigma Green Belt | 22

The histogram (Figure 12) illustrates the situation

where the process is producing too close to a

requirement limit� All the data is within the customer

specifications, but the distribution is nearing one

of the spec limits and, so, the risk of exceeding the

upper requirement limit is increasing� Remember,

one of the operational definitions of quality in Six

Sigma is “minimum deviation from the appropriate

target�”

The histogram of this distribution (Figure 13)

shows the process is centered in the middle of

the tolerance, but there is a lot of variation in the

process� If the variation becomes any larger, it is

likely this process will exceed the limits� Thus,

there is too much variation in the process output�

Remember, whenever the data breaches those

limits, a defect has occurred�

This histogram (Figure 14) indicates the process

center is off the target and the process is currently

generating output that does not meet the upper

requirement limit� Some action is required in order to

satisfy the customer and move the distribution back

toward the middle and within the requirements�

Figure 12

Figure 13

Figure 14

Six Sigma Green Belt | 23

We call this (Figure 15) a skewed distribution; that

is, the data is “skewed” toward the larger values�

Additionally, the mean is not in the center of the

two specifications�

This histogram (Figure 16) shows a bimodal

distribution, which typically represents data from

two different processes being mixed together: day

shift vs� night shift, or machine A vs� machine B�

This also represents inconsistent performance,

which is not desirable�

Figure 15

Figure 16

Conclusion

Pay attention to the histograms of your process� You can learn a lot about a process just by looking

carefully at the data�

Graphical Representations of Data

Introduction

There is a saying in management that goes something like this: a manager would rather live with a

problem they can’t solve than accept a solution they don’t understand� As a practitioner of Six Sigma,

this is an observation we must carefully consider, and it has a bearing on the topic of presenting data�

Six Sigma is evidence-based decision making – it is driven by data� We certainly desire to develop data

analysis skills so that we may more effectively analyze data� However, there is another imperative: we

must be able to communicate effectively to our audience� Graphical summaries of data will not only

aid us in our analysis – they are indispensable when it comes to presenting our results�

Six Sigma Green Belt | 24

The Stem-and-Leaf Plot

Stem-and-leaf plots allow us to present the data in a way that maintains the visibility of the data

values� Moreover, a stem-and-leaf plot can afford us an early look at which might be the shape of a

distribution� In the stem-and-leaf plot, data is split into the stem (the first digit of the number) and a

leaf (the last digit of the number)� The stem-and-leaf plot is especially useful for displaying a small

amount of data�

Example: suppose we are given the following heights of a group of two-year-olds:

2�3, 2�5, 2�5, 2�7, 2�8, 3�2, 3�6, 3�6, 4�5, and 5�0�

We can use the whole number as the stem and the decimal value as the leaf to obtain the following

stem-and-leaf plot�

Stem Leaves

2 3, 5, 5, 7, 8

3 2, 6, 6

4 5

5 0

Figure 17

Again, the stem-and-leaf plot organizes the data in a way that indicates the spread of the data and

hints at the possible shape (in this case, skewed to the right)� Some of the benefits of a stem-and-

leaf plot are it affords us a very concise presentation of the data and it maintains the visibility of the

individual data values� Recall that a histogram organizes the data into class intervals (so the individual

values aren’t visible)�

The Box-and-Whiskers Plot

The box-and-whiskers plot (also called a boxplot) is another graphical representation that is quite

useful to us� It has the benefit of incorporating some numerical summaries (referred to as five-number

summaries) of the data within the plot� It was introduced by John Tukey (one of the most influential

statisticians of the 20th century) in his book Exploratory Data Analysis�

Example: suppose we have data from a simple random sample of the wait times (in minutes) for n =

20 patients at a local urgent care center� The wait time data are given below�

32 49 27 31 39 44 40 45 36 28 35 31 33 40 47 39 31 36 35 42

Figure 18

Six Sigma Green Belt | 25

The first thing we need to do is sort the data into ascending order and indicate the rank�

Data 27 28 31 31 31 32 33 35 35 36 36 39 39 40 40 42 44 45 47 49

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 19

The “five-number summary” contained in the box-and-whisker plot consists of the following values:

• The maximum value of the data�

• The minimum value of the data�

• The median of the data�

• The first quartile (Q1), also called the 25th percentile�

• The third quartile (Q3), also called the 75th percentile�

Just by inspection of the data, we can see the maximum value is 49 and the minimum value is 27�

Recall that the median is the “middle” of the data once the data has been ordered� Since we have an

even number of data values (n = 20), to obtain the median, we must average the two middle values

(that is values 36 and 36 which are in ranks 10 and 11)� Therefore, the median of the data is 36�0�

The quartiles (or percentiles) are a bit more complicated to estimate precisely (there are several location

algorithms that can do this)� We will employ one of the common algorithms to estimate the location

of a percentile� To use this approach, we first identify the location (rank) of the desired percentile and

then we estimate its value�

Estimating Q1 and Q3

First, we need to find the location of the 25th percentile� To do so, we use the following formula:

That is, p is the percentile and n is the sample size� Therefore, we obtain the following for the position

of Q1, or the 25th percentile:

The Lp of 5�25 tells us the 25th percentile (or first quartile) is in rank (or location) 5�25 in our data�

That is, Q1 is one-quarter of the way between ranks 5 and 6� So, the 25th percentile (or Q1) is 31�25,

found by:

Q1 = 31 + (0.25)(32 – 31) = 31.25

Six Sigma Green Belt | 26

That is, we begin with the value in rank 5 (31) and move 25% of the distance toward rank 6 (32)�

To estimate Q3, we follow the same process:

Therefore, Q3 is in location (or rank) 15�75, or 75% of the distance from the value in rank 15 (40)

towards the value in rank 16 (42)�

Q3 = 40 + (0.25)(42 – 40) = 41.50

Now we have all the values for our five-number summaries, and we are ready to create the box-and-

whiskers plot�

• The maximum value of the data = 49: this is the upper whisker�

• The minimum value of the data = 27: this is the lower whisker�

• The median of the data = 36: this is the middle of the box�

• The first quartile (Q1) also called the 25th percentile = 31�25: this is the bottom of the box�

• The third quartile (Q3) also called the 75th percentile = 41�5: this is the top of the box�

Also, note that the graph includes the average (37) indicated by an “X” in the box�

Figure 20

The box-and-whiskers plot contains a significant amount of information� For example,

• Since the first quartile (Q1, or the 25th percentile) is 31�25, then we can infer that 25% of the

data is less than 31�25 and 75% of the data is greater than 31�25�

• Since the third quartile (Q3, or the 75th percentile) is 41�5, then we can infer that 75% of the data

is less than 41�5 and 25% of the data is greater than 41�5�

Six Sigma Green Belt | 27

• Recall that the interquartile range (IQR) is Q3 – Q1� In this case, the IQR is 41�5 – 31�25, or

10�25� The IQR is the middle 50% of the data�

• Also, since the box is close to the center of the distribution of the data, the median and mean are

close to each other, and the whiskers are fairly symmetric (not perfectly so, but relatively speaking),

then we will not be surprised if a formal statistical test indicates the data is close to being normally

distributed�

Finally, one of the advantages of a boxplot is the ability to generate comparisons� For example, suppose

we have data on the time to complete a task using two different methods� The data can easily be

compared using a box-and-whiskers plot�

Figure 21

The Defect Location Check Sheet

A defect location check sheet (also known as a defect map) is a structured, prepared form for collecting

and analyzing data that provides a visual image of the item being evaluated� It is used when a visual

and simple count is needed to determine the frequency of the defects and the recurring locations of the

incidences�

To construct a defect location check sheet, place a visual cue, such as an orange dot, on an image or

sketch of the item to indicate the location of a defect�

Six Sigma Green Belt | 28

Example: we use a sketch of a person to identify the location of injuries�

Figure 22

A defect map is a common quality assurance practice when we need to count the defects and note the

location of occurrence� Some of the benefits of using a defect map are that it collects data with minimal

effort and the data is easily converted into useful information�

Pareto Chart

Introduction

A Pareto chart – also called a Pareto distribution diagram – is a vertical bar graph in which values are

plotted in decreasing order of relative frequency from left to right� The Pareto chart is named for Vilfredo

Pareto, an Italian economist who developed the 80/20 rule while analyzing income distribution� The

80/20 rule, or Pareto Principle, suggests 80% of the most important issues are generated by 20% of

the sources� Today we often apply the principle as follows: 80% of the results are from 20% of the

actions employed�

One of the challenges for a Six Sigma project team is correctly identifying the most critical relationships

between process variables and performance metrics� By helping us prioritize problems, the Pareto

chart increases the likelihood we capture the most significant relationships�

Six Sigma Green Belt | 29

Example: suppose a small pizza restaurant has collected data on customer complaints concerning

pizza delivery� Using this data, they generate a Pareto chart�

Complaint Frequency

Pizza arrives late 2

Incorrect pizza toppings 9

Incorrect crust 11

Incorrect size 4

Pizza damaged 3

Figure 23

Figure 24

Note the following characteristics of the Pareto chart:

• The causes are placed in descending order of occurrence as indicated by the bars�

• The graph shows the percentages and the cumulative percentages� For example, “Pizza arrives

late” accounts for 44�9% of the complaints and the first three causes account for 85�7% of all

complaints�

Six Sigma Green Belt | 30

Run Chart

Introduction

A run chart is a type of line graph in which the data is plotted along a time series� That is, the vertical

axis represents the values of the data and the horizontal axis is a measure of time� It is customary to

place a horizontal line to indicate the average of the data and to provide a visual idea of the variation

in the data across time�

By displaying our data over time, we can examine the run chart for evidence of patterns� But keep in

mind that the we will need to follow our examination of the run chart with a more rigorous statistical

test for any pattern our change that we might notice� That is, a run chart is not a control chart or a

hypothesis test regarding the average, so it is not as powerful as other tools for detecting changes�

However, we can still use a run chart to help identify potential sources of special cause variation

employing tests for randomness in the data (or lack thereof)�

Example: suppose we want to use a run chart to examine the data from our golf outings� The data is

shown below�

Golf Outings

1 2 3 4 5 6

May 83 84 86 89 86

June 84 80 85 86 88 87

July 85 86 87 92 88 84

August 86 87 87 85 83 82

September 83 87 85 87

Figure 25

Six Sigma Green Belt | 31

To construct a run chart, we place the data in a time series in order of occurrence�

Figure 26

The first thing we notice about the run chart is simply how much variation there is in the golf scores�

There are statistic tests for non-random behavior that can be applied with a run test, but those tests

are best performed with appropriate statistical software� However, using those tests we could check for

non-random behaviors like trends, clustering, mixtures, and oscillation�

Six Sigma Green Belt | 32

Notes

  • Module 4
  • Introduction
  • Objectives
  • Assignment Checklist
    • Introduction to the Measure Phase
    • Variation and Measurement Discrimination
    • The Normal Curve
    • Other Distributions
    • Populations and Samples
    • Normality Test
    • Measures of Central Tendency
    • Measures of Dispersion
    • Histogram Analysis
    • Graphical Representations of Data
    • Pareto Chart
    • Run Chart