Statistics question
DATA: SUMMARIZING &
EXPLORING DATA
Missing values and Outliers Biol/Stat 2244 – Peter
Objectives
By the end of this module, you should be able to:
use and understand the vocabulary associated
with missing values and outliers;
recognize examples of missing values and
outliers in a dataset;
select an appropriate approach for dealing
with outliers;
use guiding questions to consider an approach
to deal with missing values. Biol/Stat 2244 – Peter
Scientific Inquiry Framework: PPDAC
Problem Define the research question.
Plan Decide how to address the research Problem.
Data Execute your Plan and examine your Data.
Analysis Extract meaning from your Data.
Conclusion Interpret your results in the context of the Problem
Sampling & Study Designs (Ch. 6, 7)
Summarizing & Exploring Data (Ch. 1, 2)
Inference Procedures (Ch. 3, 4, 9-20, 23, 24)
Included in 2244 Lab Component
Biol/Stat 2244 – Peter
Data
Collect, monitor the quality of, and conduct a
preliminary exploration of the data
• Does the data collection method need
‘tweaking’ to ensure quality (monitoring)?
• Are there any patterns, trends, or
associations apparent in the data?
• Are there any outliers or missing values? If so,
how will you handle them?
Biol/Stat 2244 – Peter
Outliers
Biol/Stat 2244 – Peter
observations inconsistent with the pattern/‘range’ of
the rest of the dataset
(possible) outliers
Why do outliers exist?
• not part of the sampling
frame/population of
interest
Biol/Stat 2244 – Peter
each outlier may exist in the dataset for a
different reason
First
year
biology
Upper
years
Sample
Survey of first-year
student experience
• rare, random variation
• measurement and/or
reporting mistake
Handling outliers
Biol/Stat 2244 – Peter
nature of outlier should dictate how we ‘handle’ the
value
datasheet
shows 152 cm
datasheet
shows 15.2 cm
correct
mistake
conduct analyses
with and without
remove, if value is
impossible
re-measure
value
Missing values (‘NAs’)
Biol/Stat 2244 – Peter
an observation in the data set for which there is no value
Why do missing values exist?
Biol/Stat 2244 – Peter
• Planned in study design
• Chance failure of measurement device/researcher
• Non-chance avoidance of measurement
each missing value may exist in the dataset for a
different reason
How to handle missing values
• Deletion of the value from the variable and/or from the
entire dataset
• Replace the missing value with another value (e.g. the
mean, mode, a predicted value, etc.)
Biol/Stat 2244 – Peter