Statistics question

profileHailsldjfu
5B-Numericalandgraphicalsummaries.pdf

DATA: SUMMARIZING &

EXPLORING DATA

Numerical & Graphical Summaries Biol/Stat 2244 – Peter

Objectives

By the end of this module, you should be able to:

 Use vocabulary relevant to summarizing data

(e.g. univariate, bivariate, distribution,

relationship, etc.);

 Recognize and interpret a variety of numerical

and graphical summaries;

 Select appropriate summaries based on

variable type(s) and characteristic(s) of interest

Biol/Stat 2244 – Peter

Scientific Inquiry Framework: PPDAC

Problem Define the research question.

Plan Decide how to address the research Problem.

Data Execute your Plan and examine your Data.

Analysis Extract meaning from your Data.

Conclusion Interpret your results in the context of the Problem

(MacKay & Oldford, 2000)

Biol/Stat 2244 – Peter

Data

Collect, monitor the quality of, and conduct a

preliminary exploration of the data

• Does the data collection method need

‘tweaking’ to ensure quality (monitoring)?

• Are there any patterns, trends, or

associations apparent in the data?

• Are there any outliers or missing values? If so,

how will you handle them?

Biol/Stat 2244 – Peter

Selecting a summary?

Biol/Stat 2244 – Peter

univariate

What do you want to do?

bivariate multivariate

Inspecting or describing

a distribution

Exploring relationships

between variables

How many variables?

Just 2 3 or more

What type of variable(s)?

What characteristic(s) or

relationship do you want

to emphasize/explore?

Background on data

Biol/Stat 2244 – Peter

Does aromatherapy foot massage

have a therapeutic effect on blood

pressure?

• Response variables

✓ Blood pressure (mmHg)

• Cofactors:

✓ sex

✓ age

✓ smoking status

✓ personality type Eguchi et al. 2016. PLoS ONE 11(3): e0151712 doi:10.1371/journal.pone.0151712

Frequency distributions

Biol/Stat 2244 – Peter

Summarizing the frequency of categorical

variables as counts or relative frequencies

Measures of centre

Biol/Stat 2244 – Peter

Describe the ‘typical’ value of a distribution

Mean

ҧ𝑥 = σ𝑖=1 𝑛 𝑥𝑖 𝑛

Median

The centremost value

in the ordered list,

• “resistant” to outliers

or skew

• incorporates all values

from distribution

෤𝑥

Percentiles

Biol/Stat 2244 – Peter

A value below which a particular percentage of a

distribution lies

4 4 6 9 10 14 14 19 20 25

80th percentile

80% 20%

Quartiles divide the distribution into 4 equal-

size sections • Q1: first quartile/25th percentile

• Q2: second quartile/50th percentile/median

• Q3: third quartile/75th percentile

Measures of spread

Biol/Stat 2244 – Peter

characterise the variability in a distribution

range = maximum - minimum

• inflated by outliers and skew

5-number summary

min Q1 Q3 max෤𝑥 51 64.5 69 78 107

Interquartile range (IQR) = Q3 – Q1

13.5

13.5 4.5 9 29

Measures of spread

Variance

𝑠2 = σ𝑖=1 𝑛 𝑥𝑖 − ҧ𝑥

2

𝑛 − 1

Standard deviation

𝑠 = σ𝑖=1 𝑛 𝑥𝑖 − ҧ𝑥

2

𝑛 − 1

𝜎2 = σ𝑖=1 𝑁 𝑥𝑖 − 𝜇

2

𝑁

sample variance

population variance

• suitable for

distributions without

extreme outliers

and/or skew

Bar graph

Biol/Stat 2244 – Peter

Visual representation of the frequency distribution for

one or more categorical variables

Univariate Bivariate

What about pie charts?

See https://www.data-to-viz.com/caveat/pie.html

Mosaic plot (aka treemap)

Biol/Stat 2244 – Peter

area of rectangles

reflect the relative

frequencies from a

contingency table

for two or more

categorical

variables

Histograms and dotplots

Biol/Stat 2244 – Peter

Visual representation of the frequency

distribution for a single quantitative variable

Histogram Dotplot

Describing shape: symmetry

Biol/Stat 2244 – Peter

degree to which the distribution looks like a

mirror image when split down the centre

Symmetric Right-skewed

Describing shape: modality

the number of prominent peaks in the distribution

bimodal

Choice of binwidth

can influence

apparent shape

Boxplots

Biol/Stat 2244 – Peter

Visual representation of the five-number summary

Min

Q1

Q3

Max

෤𝑥

outliers

1.5*IQR

Stripcharts

Biol/Stat 2244 – Peter

Plot each data point of a quantitative response

variable across categorical groups

bivariate multivariate

Mean plots

Biol/Stat 2244 – Peter

a visual representation of the mean, often with

a measure of spread (e.g. SD) as error bars

Be careful to

identify what the

error bars

represent!

Scatterplots

Biol/Stat 2244 – Peter

Plots data from two quantitative variables as

(x,y) coordinates

Adding a categorical variable

Summary

Always keep your goal—for exploration or

communication—clear when selecting a relevant

summary.

Prompt: Each of the summaries we’ve seen have

specific applications and pros vs. cons. In your opinion,

which is the ‘best’ summary overall? Why?

Post to “Which summary is the best?”