quantitative analysis
1
Descriptive Statistics
1
2
Quantitative Analysis
If you put one foot in a bucket of freezing water and the other foot in a bucket of boiling water, on average you’ll be just fine.
Homer J. Simpson
3
Descriptive Statistics
Parameters and statistics
Statistical inference
Box and whisker plot
Describing data
Central tendency
Dispersion
Symmetry
Peakedness
4
Summarizing Numeric Data
Suppose you need to order lunches for high school volunteers you are supervising for a service project this weekend.
What might be good things to know regarding the number of volunteers?
How would you collect the information (data)?
How would you go about analyzing the data?
How would you use the information?
5
Parameters versus Statistics
Parameters refer to a population
Denoted by Greek letters: ,
Statistics refer to a subset of the population (i.e., a sample)
Denoted by Latin letters: x, s
Examples of statistics?
Average household income in US
Batting averages in MLB
“58% of people oppose hand gun legislation”
Why/when do we need statistics?
6
1. Population
consists of all N
elements; parameters
(, ) are
unknown
2. A sample of size n
is examined
3. The sample data
provide a sample
statistic (x, s)
4. The value of the
sample statistic is used
to make an estimate of
the population parameter
Process of Statistical Inference
6
7
Descriptive Statistics
The number of volunteers from West Forsyth and Reynolds High Schools for several randomly selected prior service projects is provided to the right
Determine five ways to measure:
The typical number of volunteers from each school
The consistency in the number of volunteers from each school
Under what circumstances is one of your measures better than another?
| WF High | Reynolds High |
| 31 | 27 |
| 17 | 10 |
| 31 | 16 |
| 23 | 22 |
| 17 | 32 |
| 20 | 15 |
| 28 | 13 |
| 24 | 20 |
| 21 | 33 |
| 21 | 29 |
| 15 |
8
Graphically Examine Data
Discrete frequency table . . .
| Number of volunteers | WFHS frequency | RHS frequency |
| 3 to 7 | 0 | 0 |
| 8 to 12 | 0 | 1 |
| 13 to 17 | 2 | 4 |
| 18 to 22 | 3 | 2 |
| 23 to 27 | 2 | 1 |
| 28 to 32 | 3 | 2 |
| 33 to 37 | 0 | 1 |
9
Graphically Examine Data
West Forsyth
. . . and resulting histograms
Frequency Distribution
0 to 4 5 to 9 10 to 14 15 to 19 20 to 24 25 to 29 30 to 34 0 0 0 2 5 1 2Number of Volunteers
Frequency
10
Graphically Examine Data
Reynolds
. . . and resulting histograms
Frequency Distribution
0 to 4 5 to 9 10 to 14 15 to 19 20 to 24 25 to 29 30 to 34 0 0 2 3 2 2 2Number of Volunteers
Frequency
11
Box and Whisker Plot
Shows max and min, IQR, median, and average
What can you surmise based on this plot?
Box and Whisker Plot
Min WFHS RHS 17 10 Quartile 1 WFHS RHS 2.25 5 Median WFHS RHS 2.75 5 Quartile 3 2.25 4 1 WFHS RHS 6.75 9 Max WFHS RHS 2.25 4 Average WFHS 23.3 0.75 Average RHS 21.09090909090909 0.25Number of Volunteers
12
Descriptive Stats
How would you generally describe the service project data?
Central tendency
Dispersion
Shape
Symmetry
Peakedness
Unusual (outlying) number of volunteers
13
Measures of Central Tendency
Arithmetic Mean
Average value
Weighted mean
Importance of observation
Median
Middle value
Mode
Most frequent value
Geometric mean
Multiplicative effects
14
Example: Arithmetic Mean Number of Volunteers WFHS
Number of volunteers for n = 10 prior service projects
31, 17, 31, . . . , 21
Add up all numbers of volunteers and divide by the total number of prior service projects
31 + 17 + 31 + . . . + 21 = 233
233/10 = 23.3
On average, each project had about 23 volunteers
15
Arithmetic Mean
Average or expected value
Denoted by for a population mean, and x “bar” for a sample mean
where xi = ith observation for i = 1, 2, 3, …, n observations (sample), or i = 1, 2, 3, …, N observations (population)
, the Greek letter “Sigma” means to add
n
x
.
.
.
x
x
n
x
x
n
2
1
n
1
i
i
+
+
+
=
=
å
=
16
Example: Weighted Mean Number of Volunteers WFHS
Number of volunteers for n = 10 service projects
31, 31, . . . 21, 21
2/10ths of the projects had 17 volunteers, 2/10ths had 21 volunteers, and 2/10ths had 31 volunteers
The projects that had 17, 21, and 31 volunteers are given more weight because they occurred more frequently than other numbers of volunteers
(.2)17 + (.2)21 + (.2)31 + . . . + (.1)24 = 23.3
The weighted average number of volunteers was about 23
17
Weighted Average
Ordinary average gives same weight to all elementary units
Weighted average allows different weights
Weights must add up to 1
If not, then divide each by their total
n
X
n
X
n
X
n
X
1
...
1
1
2
1
+
+
+
=
n
n
X
w
X
w
X
w
X
+
+
+
=
...
2
2
1
1
1
...
2
1
=
+
+
+
n
w
w
w
18
Weighted Average (con’t)
Arithmetic average is per elementary unit
The average of your course grades is your “average per course”
Weighted average is per unit of weight
Your GPA (grade point average) is a weighted average, using credit hours to define the weights. The weighted average is your “average per credit hour”
19
Median
Also summarizes the data
The middle value
Half of observations are above and half of observations are below the median
Put data in ascending or descending order
Pick middle value (or average middle two values if n is even)
Median (17, 17, 20, . . . 31) = 22
Rank of the median is (n+1)/2
If n=3, rank is (1+3)/2 = 2
If n=10, rank is (1+10)/2 = 5.5 (so average 5th and 6th values)
| WF High | WF High sorted |
| 31 | 17 |
| 17 | 17 |
| 31 | 20 |
| 23 | 21 |
| 17 | 21 |
| 20 | 23 |
| 28 | 24 |
| 24 | 28 |
| 21 | 31 |
| 21 | 31 |
22
20
Median (con’t)
A representative, central number
If data set has a center
Less sensitive to outliers than the average
For skewed data, represents the “typical case” better than the average does
21
Mode
Also summarizes the data
Most common or frequent data value
Tallest histogram bar
Can also have bi- and multi-modal distributions
Frequency Distribution
0 to 4 5 to 9 10 to 14 15 to 19 20 to 24 25 to 29 30 to 34 0 0 0 2 5 1 2Number of Volunteers
Frequency
22
Which measure to use?
Mean
Best for “normal” data
Median
May not be any central number
Good for skewed data or data with outliers
Mode
May depend on how you draw the histogram
May be more than one mode
Only summary measure computable for nominal data
23
Which measure? (con’t)
Average requires quantitative data (numbers)
Median works with quantitative or ordinal data
Mode works with quantitative, ordinal, or nominal data
Quantitative Ordinal Nominal
Average Yes - -
Median Yes Yes -
Mode Yes Yes Yes
24
Geometric Mean
Used to compute average rates of return/growth rates/percentage change over time
Suppose a portfolio of stock has grown as follows over the past five years:
What is the portfolio’s average rate of return?
| Year | Rate of return | Growth factor |
| 1 | 27% | 1.27 |
| 2 | 10% | 1.10 |
| 3 | 6% | 1.06 |
| 4 | 2% | 1.02 |
| 5 | -12% | .88 |
25
Geometric Mean
For n observations:
For stock portfolio
Portfolio five year average rate of return was 5.9 percent
Arithmetic mean rate of return of 6.6 percent overstates true rate of return
26
Examples of Variation
Production
Not all machines produce the same quality output
Stock prices
Not the same, day after day
Uncertain outcomes?
As a basketball coach, how would you decide between:
Player A who averaged 20 PPG over the last five games of last season, or
Player B who also averaged 20 PPG over the last five games of last season
27
Measures of Variation
Also known as variance, deviation, dispersion, uncertainty, diversity, risk
How to measure?
Range
Interquartile range
Sum and absolute deviation from the mean
Variance
Standard deviation
Coefficient of variation
28
Range
Range
highest value – lowest value
max(xi) – min(xi)
Sort data
Range = 31 – 17 = 14
| WF High | WF High Sorted |
| 31 | 17 |
| 17 | 17 |
| 31 | 20 |
| 23 | 21 |
| 17 | 21 |
| 20 | 23 |
| 28 | 24 |
| 24 | 28 |
| 21 | 31 |
| 21 | 31 |
29
Graphical Interpretation of Interquartile Range
Divide sorted data into fourths (quartiles)
Interquartile range is the difference between the third and first quartiles
IQR = Q3 – Q1
IQR = 28 – 20 = 8
| WF High | WF High Sorted |
| 31 | 17 |
| 17 | 17 |
| 31 | 20 |
| 23 | 21 |
| 17 | 21 |
| 20 | 23 |
| 28 | 24 |
| 24 | 28 |
| 21 | 31 |
| 21 | 31 |
28
22
20
30
Interquartile Range
More formally:
| WF High | WF High Sorted |
| 31 | 17 |
| 17 | 17 |
| 31 | 20 |
| 23 | 21 |
| 17 | 21 |
| 20 | 23 |
| 28 | 24 |
| 24 | 28 |
| 21 | 31 |
| 21 | 31 |
31
Measures of Variation
Sum of absolute deviation from the mean
Mean absolute deviation (MAD) from the mean
Are range, interquartile range, and mean absolute deviation from the mean biased? Why?
See Mean Absolute Deviation vs. Standard Deviation.xlsx
32
Example: Volunteers
Average is 23.3. Sum of squared deviations is
(31–23.3)2 + (17–23.3)2 + . . . + (21–23.3)2
= 242.1
Divide by 10
On average, the number of volunteers for each event varies by about 24 students squared
average squared deviation = 242.1/10 = 24.2 volunteers2
33
Example: Volunteers
We would like to have our measure of spread in our original units (number of student volunteers)
Take square root 24.21/2 = 4.9
On average, the number of volunteers for each service project varies by 4.9 students
Is this measure biased?
34
Variance and Standard Deviation
Dividing by n = 10 makes us think we are more certain about our statistic (have more information or degrees of freedom than we actually do)
As before, sum of squared deviations is 242.1
Divide by 10 – 1 = 9 to get variance and take square root to get standard deviation
standard deviation =
variance =
35
Variance and Standard Deviation
On average, the number of volunteers for each service project varies from the mean by about 5 students
Note greater uncertainty in the number of volunteers based on standard deviation versus mean absolute deviation from the mean
5.2 (standard deviation) versus 4.9 (MAD)
Less information (9 versus 10 “free” observations) so should be more uncertain
36
Variance
Denoted by 2 for a population variance, and s2 for a sample variance
)
(
...
)
(
)
(
2
2
2
2
1
m
-
+
+
m
-
+
m
-
=
s
N
X
X
X
N
2
1
)
(
...
)
(
)
(
2
2
2
2
1
-
-
+
+
-
+
-
=
n
X
X
X
X
X
X
S
n
2
Divide by n-1 for unbiased estimator of sample variance
37
Standard Deviation
Denoted by for population standard deviation, and s for sample standard deviation
Measures variability by answering:
“Approximately how far from average are the data values?”
Returns same measurement units as original data
Calculated by taking the square root of the variance
38
Working Formula
Population variance and standard deviation
39
Working Formula
Sample variance and standard deviation
40
Coefficient of Variation
Relative measure of dispersion
Measures amount of variation relative to size of observation
Unit-less measure—describes dispersion without having to use a variable’s unit of measurement
Enables comparison of dispersion across variables with different means and standard deviations or dissimilar units of measurement
How does the consistency in volunteering for service projects for WFHS students compare to that of Reynolds HS students?
| WF High | Reynolds High |
| 31 | 27 |
| 17 | 10 |
| 31 | 16 |
| 23 | 22 |
| 17 | 32 |
| 20 | 15 |
| 28 | 13 |
| 24 | 20 |
| 21 | 33 |
| 21 | 29 |
| 15 | |
| Mean = 23.3 | Mean= 21.1 |
| Stdev = 5.2 | Stdev = 8.1 |
41
Coefficient of Variation
Standard deviation as a percentage of the mean
WFHS students are relatively more consistent in volunteering for service projects
| WF High | Reynolds High |
| 31 | 27 |
| 17 | 10 |
| 31 | 16 |
| 23 | 22 |
| 17 | 32 |
| 20 | 15 |
| 28 | 13 |
| 24 | 20 |
| 21 | 33 |
| 21 | 29 |
| 15 | |
| Mean = 23.3 | Mean = 21.1 |
| Stdev = 5.2 | Stdev = 8.1 |
| cv = .22 | cv = .38 |
42
Distribution Shapes
Continuous versus discrete data
What data might give rise to the following distributions?
43
Distribution Shapes (con’t)
Normal
Symmetric
Bell-Shaped
Skewed
Not symmetric
Can cause trouble
Log transform?
Bimodal
Two clear groups
Find out why!
Analyze separately?
44
Describing Distributions
Measures of shape
Symmetry (skew)
Peakedness (kurtosis)
45
Measures of Shape
Skew
Areas under tails of distribution are unequal
Types
Right (positive) skew: median < mean
Left (negative) skew: median > mean
Skew allows you to better estimate if a current (or future) observation will be more or less than the mean
Mean, median, and mode
are different
46
Measures of Shape
Skew (mathematical Excel formula)
n = sample size
s = sample standard deviation
= sample average
Excel function
=SKEW(WFHS) = .45
–.5 ≤ skew ≤ +.5 “considered” to be “symmetric” (i.e., “normal” distribution)
47
Measures of Shape (con’t)
Kurtosis
Measures peakedness/flatness of distribution
Types
Leptokurtic: kurtosis > 0
More peaked, thinner tails
Mesokurtic (normal): kurtosis = 0
Platykurtic: kurtosis < 0
Less peaked, fatter tails
Extreme values more likely than we would “normally” expect
48
Measures of Shape (con’t)
Kurtosis (mathematical Excel formula)
n = sample size
s = sample standard deviation
= sample average
Excel function
=KURT(WFHS) = -1.07
Measures excess kurtosis (excess kurtosis = 0 for normal distribution)
49
Excel Commands
See Descriptive Statistics Demonstration Sol’n.xls
| mean | =AVERAGE(WFHS) | 23.3 |
| median | =MEDIAN(WFHS) | 22 |
| mode | =MODE.MULT(WFHS) | 31 |
| =MODE.MULT(WFHS) | 17 | |
| =MODE.MULT(WFHS) | 21 | |
| =MODE.MULT(WFHS) | #N/A | |
| range | =MAX(WFHS)-MIN(WFHS) | 14 |
| 9.5 | ||
| IQR | =QUARTILE.EXC(WFHS,3)-QUARTILE.EXC(WFHS,1) | 4.2 |
| MAD | =AVEDEV(WFHS) | 26.9 |
| variance | =VAR.S(WFHS) | 5.2 |
| stdev | =STDEV.S(WFHS) | 0.22 |
| skew | =SKEW(WFHS) | 0.45 |
| kurtosis | =KURT(WFHS) | -1.07 |
50
Descriptive Statistics
Must use ALL information available
Mean, median, mode
Standard deviation
Skew
Kurtosis
Mean and standard deviation, the most commonly used and reported descriptive statistics, become less appropriate measures of central tendency and variability when data are skewed and kurtotic
On average, 23 students from West Forsyth, give or take five, will volunteer for a service project.
N
x
.
.
.
x
x
N
x
N
2
1
N
1
i
i
+
+
+
=
=
m
å
=
123
()()()...()
n
n
GMxxxx
=´´´´
5
(1.27)(1.10)(1.06)(1.02)(.88)1.059
GM
=´´´´=
th
1
(1)(101)
2.75 data point
44
n
Q
++
===
1
17.75(2017)19.25
Q
=+-=
th
3
3(1)3(101)
8.25 data point
44
n
Q
++
===
3
28.25(3128)28.75
Q
=+-=
31
28.7519.259.5
IQRQQ
=-=-=
1
n
i
i
xx
=
-
å
1
n
i
i
xx
n
=
-
å
26.95.2
242.126.9101
(
)
N
-
N
1
i
2
i
2
x
å
=
m
=
s
(
)
1
-
n
x
x
n
1
i
2
i
2
-
s
å
=
=
(
)
N
-
N
1
i
2
i
2
x
å
=
m
=
s
_1051444122.unknown
(
)
1
-
n
x
x
n
1
i
2
i
2
-
s
å
=
=
_1051010494.unknown
(
)
2
2
1
N
i
i
x
N
m
ss
=
å-
==
(
)
2
2
1
1
n
i
i
xx
ss
n
=
å-
==
-
(
)
(
)
2
22
2
11
2
NN
iii
ii
xxx
NN
mmm
s
==
å-å-+
==
22
111
2
NNN
ii
iii
xx
NNN
m
m
===
ååå
=-+
2
22
1
N
i
i
x
N
sm
=
å
=-
2
2
1
N
i
i
x
N
sm
=
å
=-
(
)
(
)
2
22
2
11
2
11
nn
iii
ii
xxxxxx
s
nn
==
å-å-+
==
--
22
111
2
111
nnn
ii
iii
xxxx
nnn
===
ååå
=-+
---
1
1
so
n
n
i
i
i
i
x
xnxx
n
=
=
å
==å
(
)
2
2
1
2
111
n
i
i
x
xnx
nx
nnn
=
å
=-+
---
2
2
2
1
11
n
i
i
x
nx
s
nn
=
å
=-
--
2
2
1
11
n
i
i
x
nx
s
nn
=
å
=-
--
s
cv
x
=
5.2
.22
23.3
WFHS
cv
==
8.1
.38
21.1
RHS
cv
==
3
1
coefficient of skew (cs)
(1)(2)
n
i
i
xx
n
nns
=
-
æö
=
ç÷
--
èø
å
x
4
2
1
(1)3(1)
coefficient of kurtosis (ck)
(1)(2)(3)(2)(3)
n
i
i
xx
nnn
nnnsnn
=
-
+-
æö
=-
ç÷
-----
èø
å
x