Statistics question
DATA: SUMMARIZING &
EXPLORING DATA
Numerical & Graphical Summaries Biol/Stat 2244 – Peter
Objectives
By the end of this module, you should be able to:
Use vocabulary relevant to summarizing data
(e.g. univariate, bivariate, distribution,
relationship, etc.);
Recognize and interpret a variety of numerical
and graphical summaries;
Select appropriate summaries based on
variable type(s) and characteristic(s) of interest
Biol/Stat 2244 – Peter
Scientific Inquiry Framework: PPDAC
Problem Define the research question.
Plan Decide how to address the research Problem.
Data Execute your Plan and examine your Data.
Analysis Extract meaning from your Data.
Conclusion Interpret your results in the context of the Problem
(MacKay & Oldford, 2000)
Biol/Stat 2244 – Peter
Data
Collect, monitor the quality of, and conduct a
preliminary exploration of the data
• Does the data collection method need
‘tweaking’ to ensure quality (monitoring)?
• Are there any patterns, trends, or
associations apparent in the data?
• Are there any outliers or missing values? If so,
how will you handle them?
Biol/Stat 2244 – Peter
Selecting a summary?
Biol/Stat 2244 – Peter
univariate
What do you want to do?
bivariate multivariate
Inspecting or describing
a distribution
Exploring relationships
between variables
How many variables?
Just 2 3 or more
What type of variable(s)?
What characteristic(s) or
relationship do you want
to emphasize/explore?
Background on data
Biol/Stat 2244 – Peter
Does aromatherapy foot massage
have a therapeutic effect on blood
pressure?
• Response variables
✓ Blood pressure (mmHg)
• Cofactors:
✓ sex
✓ age
✓ smoking status
✓ personality type Eguchi et al. 2016. PLoS ONE 11(3): e0151712 doi:10.1371/journal.pone.0151712
Frequency distributions
Biol/Stat 2244 – Peter
Summarizing the frequency of categorical
variables as counts or relative frequencies
Measures of centre
Biol/Stat 2244 – Peter
Describe the ‘typical’ value of a distribution
Mean
ҧ𝑥 = σ𝑖=1 𝑛 𝑥𝑖 𝑛
Median
The centremost value
in the ordered list,
• “resistant” to outliers
or skew
• incorporates all values
from distribution
𝑥
Percentiles
Biol/Stat 2244 – Peter
A value below which a particular percentage of a
distribution lies
4 4 6 9 10 14 14 19 20 25
80th percentile
80% 20%
Quartiles divide the distribution into 4 equal-
size sections • Q1: first quartile/25th percentile
• Q2: second quartile/50th percentile/median
• Q3: third quartile/75th percentile
Measures of spread
Biol/Stat 2244 – Peter
characterise the variability in a distribution
range = maximum - minimum
• inflated by outliers and skew
5-number summary
min Q1 Q3 max𝑥 51 64.5 69 78 107
Interquartile range (IQR) = Q3 – Q1
13.5
13.5 4.5 9 29
Measures of spread
Variance
𝑠2 = σ𝑖=1 𝑛 𝑥𝑖 − ҧ𝑥
2
𝑛 − 1
Standard deviation
𝑠 = σ𝑖=1 𝑛 𝑥𝑖 − ҧ𝑥
2
𝑛 − 1
𝜎2 = σ𝑖=1 𝑁 𝑥𝑖 − 𝜇
2
𝑁
sample variance
population variance
• suitable for
distributions without
extreme outliers
and/or skew
Bar graph
Biol/Stat 2244 – Peter
Visual representation of the frequency distribution for
one or more categorical variables
Univariate Bivariate
What about pie charts?
See https://www.data-to-viz.com/caveat/pie.html
Mosaic plot (aka treemap)
Biol/Stat 2244 – Peter
area of rectangles
reflect the relative
frequencies from a
contingency table
for two or more
categorical
variables
Histograms and dotplots
Biol/Stat 2244 – Peter
Visual representation of the frequency
distribution for a single quantitative variable
Histogram Dotplot
Describing shape: symmetry
Biol/Stat 2244 – Peter
degree to which the distribution looks like a
mirror image when split down the centre
Symmetric Right-skewed
Describing shape: modality
the number of prominent peaks in the distribution
bimodal
Choice of binwidth
can influence
apparent shape
Boxplots
Biol/Stat 2244 – Peter
Visual representation of the five-number summary
Min
Q1
Q3
Max
𝑥
outliers
1.5*IQR
Stripcharts
Biol/Stat 2244 – Peter
Plot each data point of a quantitative response
variable across categorical groups
bivariate multivariate
Mean plots
Biol/Stat 2244 – Peter
a visual representation of the mean, often with
a measure of spread (e.g. SD) as error bars
Be careful to
identify what the
error bars
represent!
Scatterplots
Biol/Stat 2244 – Peter
Plots data from two quantitative variables as
(x,y) coordinates
Adding a categorical variable
Summary
Always keep your goal—for exploration or
communication—clear when selecting a relevant
summary.
Prompt: Each of the summaries we’ve seen have
specific applications and pros vs. cons. In your opinion,
which is the ‘best’ summary overall? Why?
Post to “Which summary is the best?”