quantitative analysis

profileConde_Nast
DescriptiveStatistics.pptx

1

Descriptive Statistics

1

2

Quantitative Analysis

If you put one foot in a bucket of freezing water and the other foot in a bucket of boiling water, on average you’ll be just fine.

Homer J. Simpson

3

Descriptive Statistics

Parameters and statistics

Statistical inference

Box and whisker plot

Describing data

Central tendency

Dispersion

Symmetry

Peakedness

4

Summarizing Numeric Data

Suppose you need to order lunches for high school volunteers you are supervising for a service project this weekend.

What might be good things to know regarding the number of volunteers?

How would you collect the information (data)?

How would you go about analyzing the data?

How would you use the information?

5

Parameters versus Statistics

Parameters refer to a population

Denoted by Greek letters: , 

Statistics refer to a subset of the population (i.e., a sample)

Denoted by Latin letters: x, s

Examples of statistics?

Average household income in US

Batting averages in MLB

“58% of people oppose hand gun legislation”

Why/when do we need statistics?

6

1. Population

consists of all N

elements; parameters

(, ) are

unknown

2. A sample of size n

is examined

3. The sample data

provide a sample

statistic (x, s)

4. The value of the

sample statistic is used

to make an estimate of

the population parameter

Process of Statistical Inference

6

7

Descriptive Statistics

The number of volunteers from West Forsyth and Reynolds High Schools for several randomly selected prior service projects is provided to the right

Determine five ways to measure:

The typical number of volunteers from each school

The consistency in the number of volunteers from each school

Under what circumstances is one of your measures better than another?

WF High Reynolds High
31 27
17 10
31 16
23 22
17 32
20 15
28 13
24 20
21 33
21 29
15

8

Graphically Examine Data

Discrete frequency table . . .

Number of volunteers WFHS frequency RHS frequency
3 to 7 0 0
8 to 12 0 1
13 to 17 2 4
18 to 22 3 2
23 to 27 2 1
28 to 32 3 2
33 to 37 0 1

9

Graphically Examine Data

West Forsyth

. . . and resulting histograms

Frequency Distribution

0 to 4 5 to 9 10 to 14 15 to 19 20 to 24 25 to 29 30 to 34 0 0 0 2 5 1 2

Number of Volunteers

Frequency

10

Graphically Examine Data

Reynolds

. . . and resulting histograms

Frequency Distribution

0 to 4 5 to 9 10 to 14 15 to 19 20 to 24 25 to 29 30 to 34 0 0 2 3 2 2 2

Number of Volunteers

Frequency

11

Box and Whisker Plot

Shows max and min, IQR, median, and average

What can you surmise based on this plot?

Box and Whisker Plot

Min WFHS RHS 17 10 Quartile 1 WFHS RHS 2.25 5 Median WFHS RHS 2.75 5 Quartile 3 2.25 4 1 WFHS RHS 6.75 9 Max WFHS RHS 2.25 4 Average WFHS 23.3 0.75 Average RHS 21.09090909090909 0.25

Number of Volunteers

12

Descriptive Stats

How would you generally describe the service project data?

Central tendency

Dispersion

Shape

Symmetry

Peakedness

Unusual (outlying) number of volunteers

13

Measures of Central Tendency

Arithmetic Mean

Average value

Weighted mean

Importance of observation

Median

Middle value

Mode

Most frequent value

Geometric mean

Multiplicative effects

14

Example: Arithmetic Mean Number of Volunteers WFHS

Number of volunteers for n = 10 prior service projects

31, 17, 31, . . . , 21

Add up all numbers of volunteers and divide by the total number of prior service projects

31 + 17 + 31 + . . . + 21 = 233

233/10 = 23.3

On average, each project had about 23 volunteers

15

Arithmetic Mean

Average or expected value

Denoted by  for a population mean, and x “bar” for a sample mean

where xi = ith observation for i = 1, 2, 3, …, n observations (sample), or i = 1, 2, 3, …, N observations (population)

, the Greek letter “Sigma” means to add

n

x

.

.

.

x

x

n

x

x

n

2

1

n

1

i

i

+

+

+

=

=

å

=

16

Example: Weighted Mean Number of Volunteers WFHS

Number of volunteers for n = 10 service projects

31, 31, . . . 21, 21

2/10ths of the projects had 17 volunteers, 2/10ths had 21 volunteers, and 2/10ths had 31 volunteers

The projects that had 17, 21, and 31 volunteers are given more weight because they occurred more frequently than other numbers of volunteers

(.2)17 + (.2)21 + (.2)31 + . . . + (.1)24 = 23.3

The weighted average number of volunteers was about 23

17

Weighted Average

Ordinary average gives same weight to all elementary units

Weighted average allows different weights

Weights must add up to 1

If not, then divide each by their total

n

X

n

X

n

X

n

X

1

...

1

1

2

1

+

+

+

=

n

n

X

w

X

w

X

w

X

+

+

+

=

...

2

2

1

1

1

...

2

1

=

+

+

+

n

w

w

w

18

Weighted Average (con’t)

Arithmetic average is per elementary unit

The average of your course grades is your “average per course”

Weighted average is per unit of weight

Your GPA (grade point average) is a weighted average, using credit hours to define the weights. The weighted average is your “average per credit hour”

19

Median

Also summarizes the data

The middle value

Half of observations are above and half of observations are below the median

Put data in ascending or descending order

Pick middle value (or average middle two values if n is even)

Median (17, 17, 20, . . . 31) = 22

Rank of the median is (n+1)/2

If n=3, rank is (1+3)/2 = 2

If n=10, rank is (1+10)/2 = 5.5 (so average 5th and 6th values)

WF High WF High sorted
31 17
17 17
31 20
23 21
17 21
20 23
28 24
24 28
21 31
21 31

22

20

Median (con’t)

A representative, central number

If data set has a center

Less sensitive to outliers than the average

For skewed data, represents the “typical case” better than the average does

21

Mode

Also summarizes the data

Most common or frequent data value

Tallest histogram bar

Can also have bi- and multi-modal distributions

Frequency Distribution

0 to 4 5 to 9 10 to 14 15 to 19 20 to 24 25 to 29 30 to 34 0 0 0 2 5 1 2

Number of Volunteers

Frequency

22

Which measure to use?

Mean

Best for “normal” data

Median

May not be any central number

Good for skewed data or data with outliers

Mode

May depend on how you draw the histogram

May be more than one mode

Only summary measure computable for nominal data

23

Which measure? (con’t)

Average requires quantitative data (numbers)

Median works with quantitative or ordinal data

Mode works with quantitative, ordinal, or nominal data

Quantitative Ordinal Nominal

Average Yes - -

Median Yes Yes -

Mode Yes Yes Yes

24

Geometric Mean

Used to compute average rates of return/growth rates/percentage change over time

Suppose a portfolio of stock has grown as follows over the past five years:

What is the portfolio’s average rate of return?

Year Rate of return Growth factor
1 27% 1.27
2 10% 1.10
3 6% 1.06
4 2% 1.02
5 -12% .88

25

Geometric Mean

For n observations:

For stock portfolio

Portfolio five year average rate of return was 5.9 percent

Arithmetic mean rate of return of 6.6 percent overstates true rate of return

26

Examples of Variation

Production

Not all machines produce the same quality output

Stock prices

Not the same, day after day

Uncertain outcomes?

As a basketball coach, how would you decide between:

Player A who averaged 20 PPG over the last five games of last season, or

Player B who also averaged 20 PPG over the last five games of last season

27

Measures of Variation

Also known as variance, deviation, dispersion, uncertainty, diversity, risk

How to measure?

Range

Interquartile range

Sum and absolute deviation from the mean

Variance

Standard deviation

Coefficient of variation

28

Range

Range

highest value – lowest value

max(xi) – min(xi)

Sort data

Range = 31 – 17 = 14

WF High WF High Sorted
31 17
17 17
31 20
23 21
17 21
20 23
28 24
24 28
21 31
21 31

29

Graphical Interpretation of Interquartile Range

Divide sorted data into fourths (quartiles)

Interquartile range is the difference between the third and first quartiles

IQR = Q3 – Q1

IQR = 28 – 20 = 8

WF High WF High Sorted
31 17
17 17
31 20
23 21
17 21
20 23
28 24
24 28
21 31
21 31

28

22

20

30

Interquartile Range

More formally:

WF High WF High Sorted
31 17
17 17
31 20
23 21
17 21
20 23
28 24
24 28
21 31
21 31

31

Measures of Variation

Sum of absolute deviation from the mean

Mean absolute deviation (MAD) from the mean

Are range, interquartile range, and mean absolute deviation from the mean biased? Why?

See Mean Absolute Deviation vs. Standard Deviation.xlsx

32

Example: Volunteers

Average is 23.3. Sum of squared deviations is

(31–23.3)2 + (17–23.3)2 + . . . + (21–23.3)2

= 242.1

Divide by 10

On average, the number of volunteers for each event varies by about 24 students squared

average squared deviation = 242.1/10 = 24.2 volunteers2

33

Example: Volunteers

We would like to have our measure of spread in our original units (number of student volunteers)

Take square root 24.21/2 = 4.9

On average, the number of volunteers for each service project varies by 4.9 students

Is this measure biased?

34

Variance and Standard Deviation

Dividing by n = 10 makes us think we are more certain about our statistic (have more information or degrees of freedom than we actually do)

As before, sum of squared deviations is 242.1

Divide by 10 – 1 = 9 to get variance and take square root to get standard deviation

standard deviation =

variance =

35

Variance and Standard Deviation

On average, the number of volunteers for each service project varies from the mean by about 5 students

Note greater uncertainty in the number of volunteers based on standard deviation versus mean absolute deviation from the mean

5.2 (standard deviation) versus 4.9 (MAD)

Less information (9 versus 10 “free” observations) so should be more uncertain

36

Variance

Denoted by 2 for a population variance, and s2 for a sample variance

)

(

...

)

(

)

(

2

2

2

2

1

m

-

+

+

m

-

+

m

-

=

s

N

X

X

X

N

2

1

)

(

...

)

(

)

(

2

2

2

2

1

-

-

+

+

-

+

-

=

n

X

X

X

X

X

X

S

n

2

Divide by n-1 for unbiased estimator of sample variance

37

Standard Deviation

Denoted by  for population standard deviation, and s for sample standard deviation

Measures variability by answering:

“Approximately how far from average are the data values?”

Returns same measurement units as original data

Calculated by taking the square root of the variance

38

Working Formula

Population variance and standard deviation

39

Working Formula

Sample variance and standard deviation

40

Coefficient of Variation

Relative measure of dispersion

Measures amount of variation relative to size of observation

Unit-less measure—describes dispersion without having to use a variable’s unit of measurement

Enables comparison of dispersion across variables with different means and standard deviations or dissimilar units of measurement

How does the consistency in volunteering for service projects for WFHS students compare to that of Reynolds HS students?

WF High Reynolds High
31 27
17 10
31 16
23 22
17 32
20 15
28 13
24 20
21 33
21 29
15
Mean = 23.3 Mean= 21.1
Stdev = 5.2 Stdev = 8.1

41

Coefficient of Variation

Standard deviation as a percentage of the mean

WFHS students are relatively more consistent in volunteering for service projects

WF High Reynolds High
31 27
17 10
31 16
23 22
17 32
20 15
28 13
24 20
21 33
21 29
15
Mean = 23.3 Mean = 21.1
Stdev = 5.2 Stdev = 8.1
cv = .22 cv = .38

42

Distribution Shapes

Continuous versus discrete data

What data might give rise to the following distributions?

43

Distribution Shapes (con’t)

Normal

Symmetric

Bell-Shaped

Skewed

Not symmetric

Can cause trouble

Log transform?

Bimodal

Two clear groups

Find out why!

Analyze separately?

44

Describing Distributions

Measures of shape

Symmetry (skew)

Peakedness (kurtosis)

45

Measures of Shape

Skew

Areas under tails of distribution are unequal

Types

Right (positive) skew: median < mean

Left (negative) skew: median > mean

Skew allows you to better estimate if a current (or future) observation will be more or less than the mean

Mean, median, and mode

are different

46

Measures of Shape

Skew (mathematical Excel formula)

n = sample size

s = sample standard deviation

= sample average

Excel function

=SKEW(WFHS) = .45

–.5 ≤ skew ≤ +.5 “considered” to be “symmetric” (i.e., “normal” distribution)

47

Measures of Shape (con’t)

Kurtosis

Measures peakedness/flatness of distribution

Types

Leptokurtic: kurtosis > 0

More peaked, thinner tails

Mesokurtic (normal): kurtosis = 0

Platykurtic: kurtosis < 0

Less peaked, fatter tails

Extreme values more likely than we would “normally” expect

48

Measures of Shape (con’t)

Kurtosis (mathematical Excel formula)

n = sample size

s = sample standard deviation

= sample average

Excel function

=KURT(WFHS) = -1.07

Measures excess kurtosis (excess kurtosis = 0 for normal distribution)

49

Excel Commands

See Descriptive Statistics Demonstration Sol’n.xls

mean =AVERAGE(WFHS) 23.3
median =MEDIAN(WFHS) 22
mode =MODE.MULT(WFHS) 31
=MODE.MULT(WFHS) 17
=MODE.MULT(WFHS) 21
=MODE.MULT(WFHS) #N/A
range =MAX(WFHS)-MIN(WFHS) 14
9.5
IQR =QUARTILE.EXC(WFHS,3)-QUARTILE.EXC(WFHS,1) 4.2
MAD =AVEDEV(WFHS) 26.9
variance =VAR.S(WFHS) 5.2
stdev =STDEV.S(WFHS) 0.22
skew =SKEW(WFHS) 0.45
kurtosis =KURT(WFHS) -1.07

50

Descriptive Statistics

Must use ALL information available

Mean, median, mode

Standard deviation

Skew

Kurtosis

Mean and standard deviation, the most commonly used and reported descriptive statistics, become less appropriate measures of central tendency and variability when data are skewed and kurtotic

On average, 23 students from West Forsyth, give or take five, will volunteer for a service project.

N

x

.

.

.

x

x

N

x

N

2

1

N

1

i

i

+

+

+

=

=

m

å

=

123

()()()...()

n

n

GMxxxx

=´´´´

5

(1.27)(1.10)(1.06)(1.02)(.88)1.059

GM

=´´´´=

th

1

(1)(101)

2.75 data point

44

n

Q

++

===

1

17.75(2017)19.25

Q

=+-=

th

3

3(1)3(101)

8.25 data point

44

n

Q

++

===

3

28.25(3128)28.75

Q

=+-=

31

28.7519.259.5

IQRQQ

=-=-=

1

n

i

i

xx

=

-

å

1

n

i

i

xx

n

=

-

å

26.95.2

242.126.9101

(

)

N

-

N

1

i

2

i

2

x

å

=

m

=

s

(

)

1

-

n

x

x

n

1

i

2

i

2

-

s

å

=

=

image1.wmf

(

)

N

-

N

1

i

2

i

2

x

å

=

m

=

s

_1051444122.unknown

image1.wmf

(

)

1

-

n

x

x

n

1

i

2

i

2

-

s

å

=

=

_1051010494.unknown

(

)

2

2

1

N

i

i

x

N

m

ss

=

å-

==

(

)

2

2

1

1

n

i

i

xx

ss

n

=

å-

==

-

(

)

(

)

2

22

2

11

2

NN

iii

ii

xxx

NN

mmm

s

==

å-å-+

==

22

111

2

NNN

ii

iii

xx

NNN

m

m

===

ååå

=-+

2

22

1

N

i

i

x

N

sm

=

å

=-

2

2

1

N

i

i

x

N

sm

=

å

=-

(

)

(

)

2

22

2

11

2

11

nn

iii

ii

xxxxxx

s

nn

==

å-å-+

==

--

22

111

2

111

nnn

ii

iii

xxxx

nnn

===

ååå

=-+

---

1

1

so

n

n

i

i

i

i

x

xnxx

n

=

=

å

==å

(

)

2

2

1

2

111

n

i

i

x

xnx

nx

nnn

=

å

=-+

---

2

2

2

1

11

n

i

i

x

nx

s

nn

=

å

=-

--

2

2

1

11

n

i

i

x

nx

s

nn

=

å

=-

--

s

cv

x

=

5.2

.22

23.3

WFHS

cv

==

8.1

.38

21.1

RHS

cv

==

3

1

coefficient of skew (cs)

(1)(2)

n

i

i

xx

n

nns

=

-

æö

=

ç÷

--

èø

å

x

4

2

1

(1)3(1)

coefficient of kurtosis (ck)

(1)(2)(3)(2)(3)

n

i

i

xx

nnn

nnnsnn

=

-

+-

æö

=-

ç÷

-----

èø

å

x