Info project 4
INFO 1010
SINGLE VARIABLE MEASURES OF VARIABILITY
‹#›
1
Measures of Variability
It is often desirable to consider measures of variability (dispersion), as well as measures of location.
For example, in choosing supplier A or supplier B we might consider not only the average delivery time for each, but also the variability in delivery time for each.
‹#›
Measures of Variability
Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation
‹#›
Interquartile Range
The interquartile range of a data set is the difference between the third quartile and the first quartile.
It is the range for the middle 50% of the data.
It overcomes the sensitivity to extreme data values.
‹#›
525
530
530
535
535
535
535
535
540
540
540
540
540
545
545
545
545
545
550
550
550
550
550
550
550
560
560
560
565
565
565
570
570
572
575
575
575
580
580
580
580
585
590
590
590
600
600
600
600
610
610
615
625
625
625
635
649
650
670
670
675
675
680
690
700
700
700
700
715
715
Interquartile Range
3rd Quartile (Q3) = 625
1st Quartile (Q1) = 545
IQR = Q3 - Q1 = 625 - 545 = 80
‹#›
Variance
The variance is a measure of variability that utilizes all the data.
It is based on the difference between the value of each observation (xi) and the mean ( for a sample, m for a population).
The variance is useful in comparing the variability of two or more variables.
=VAR.S(data cell range)
‹#›
Standard Deviation
The standard deviation of a data set is the positive square root of the variance.
It is measured in the same units as the data, making it more easily interpreted than the variance.
STDEV.S(data cell range)
‹#›
Standard
deviation is
about 9%
of the mean
Variance
Standard Deviation
Coefficient of Variation
Sample Variance, Standard Deviation,
And Coefficient of Variation
Apartment Rents
s2 = = 2,996.16
s =
% =
‹#›
Measures of Distribution
Distribution Shape
z-Scores
Chebyshev’s Theorem
Empirical Rule
Detecting Outliers
‹#›
Distribution Shape: Skewness
An important measure of the shape of a distribution is called skewness.
The formula for the skewness of sample data is
Skewness =
‹#›
Distribution Shape: Skewness
Symmetric (not skewed)
Relative Frequency
.05
.10
.15
.20
.25
.30
.35
0
Skewness = 0
Skewness is zero.
Mean and median are equal.
‹#›
Relative Frequency
.05
.10
.15
.20
.25
.30
.35
0
Distribution Shape: Skewness
Moderately Skewed Left
Skewness = - .31
Skewness is negative.
Mean will usually be less than the median.
‹#›
Distribution Shape: Skewness
Moderately Skewed Right
Relative Frequency
.05
.10
.15
.20
.25
.30
.35
0
Skewness = .31
Skewness is positive.
Mean will usually be more than the median.
‹#›
Distribution Shape: Skewness
Highly Skewed Right
Relative Frequency
.05
.10
.15
.20
.25
.30
.35
0
Skewness = 1.25
Skewness is positive (often above 1.0).
Mean will usually be more than the median.
‹#›
Seventy efficiency apartments were randomly
sampled in a college town. The monthly rent prices
for the apartments are listed below in ascending order.
Distribution Shape: Skewness
Example: Apartment Rents
525
530
530
535
535
535
535
535
540
540
540
540
540
545
545
545
545
545
550
550
550
550
550
550
550
560
560
560
565
565
565
570
570
572
575
575
575
580
580
580
580
585
590
590
590
600
600
600
600
610
610
615
625
625
625
635
649
650
670
670
675
675
680
690
700
700
700
700
715
715
‹#›
Relative Frequency
.05
.10
.15
.20
.25
.30
.35
0
Skewness = .92
Distribution Shape: Skewness
Example: Apartment Rents
‹#›
Empirical Rule
When the data are believed to approximate a
bell-shaped distribution …
The empirical rule is based on the normal
distribution, which is covered in Chapter 6.
The empirical rule can be used to determine the
percentage of data values that must be within a
specified number of standard deviations of the
mean.
‹#›
Empirical Rule
For data having a bell-shaped distribution:
of the values of a normal random variable
are within of its mean.
68.26%
+/- 1 standard deviation
of the values of a normal random variable
are within of its mean.
95.44%
+/- 2 standard deviations
of the values of a normal random variable
are within of its mean.
99.72%
+/- 3 standard deviations
‹#›
Empirical Rule
x
m – 3s
m – 1s
m – 2s
m + 1s
m + 2s
m + 3s
m
68.26%
95.44%
99.72%
‹#›
z-Scores
Standardized Values for Apartment Rents
‹#›
The z-score is often called the standardized value.
It denotes the number of standard deviations a data
value xi is from the mean.
z-Scores
Excel’s STANDARDIZE function can be used to
compute the z-score.
=
‹#›
z-Scores
A data value less than the sample mean will have a
z-score less than zero.
A data value greater than the sample mean will have
a z-score greater than zero.
A data value equal to the sample mean will have a
z-score of zero.
An observation’s z-score is a measure of the relative
location of the observation in a data set.
‹#›
Z-Scores and Distributions
‹#›
Detecting Outliers
An outlier is an unusually small or unusually large
value in a data set.
A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
It might be:
an incorrectly recorded data value
a data value that was incorrectly included in the
data set
a correctly recorded data value that belongs in
the data set
‹#›
Chebyshev’s Theorem
At least (1 - 1/z2) of the items in any data set will be within z standard
deviations of the mean, where z is any value greater than 1.
Chebyshev’s theorem requires z > 1, but z need not be an integer.
At least of the data values must be
within of the mean.
75%
z = 2 standard deviations
At least of the data values must be
within of the mean.
89%
z = 3 standard deviations
At least of the data values must be
within of the mean.
94%
z = 4 standard deviations
‹#›
Five-Number Summaries and Box Plots
Summary statistics and easy-to-draw graphs can be
used to quickly summarize large quantities of data.
Two tools that accomplish this are five-number
summaries and box plots.
‹#›
Five-Number Summary
1
Smallest Value
First Quartile
Median
Third Quartile
Largest Value
2
3
4
5
‹#›
Box Plot
A box plot is a graphical summary of data that is
based on a five-number summary.
A key to the development of a box plot is the
computation of the median and the quartiles Q1 and
Q3.
Box plots provide another way to identify outliers.
‹#›
Box Plot
Whiskers (dashed lines) are drawn from the ends of the box to the smallest and largest data values
inside the limits.
500
525
550
575
600
625
650
675
700
725
Smallest value
inside limits = 525
Largest value
inside limits = 715
‹#›
Data Dashboards:
Adding Numerical Measures
to Improve Effectiveness
The addition of numerical measures, such as the mean
and standard deviation of KPIs, to a data dashboard
is often critical.
Drilling down refers to functionality in interactive
dashboards that allows the user to access information
and analyses at increasingly detailed level.
Dashboards are often interactive.
Data dashboards are not limited to graphical displays.
‹#›
‹#›
-1.20
-1.11
-1.11
-1.02
-1.02
-1.02
-1.02
-1.02
-0.93
-0.93
-0.93
-0.93
-0.93
-0.84
-0.84
-0.84
-0.84
-0.84
-0.75
-0.75
-0.75
-0.75
-0.75
-0.75
-0.75
-0.56
-0.56
-0.56
-0.47
-0.47
-0.47
-0.38
-0.38
-0.34
-0.29
-0.29
-0.29
-0.20
-0.20
-0.20
-0.20
-0.11
-0.01
-0.01
-0.01
0.17
0.17
0.17
0.17
0.35
0.35
0.44
0.62
0.62
0.62
0.81
1.06
1.08
1.45
1.45
1.54
1.54
1.63
1.81
1.99
1.99
1.99
1.99
2.27
2.27