Statistics question

profileHailsldjfu
5C-Handlingmissingvaluesandoutliers.pdf

DATA: SUMMARIZING &

EXPLORING DATA

Missing values and Outliers Biol/Stat 2244 – Peter

Objectives

By the end of this module, you should be able to:

 use and understand the vocabulary associated

with missing values and outliers;

 recognize examples of missing values and

outliers in a dataset;

 select an appropriate approach for dealing

with outliers;

 use guiding questions to consider an approach

to deal with missing values. Biol/Stat 2244 – Peter

Scientific Inquiry Framework: PPDAC

Problem Define the research question.

Plan Decide how to address the research Problem.

Data Execute your Plan and examine your Data.

Analysis Extract meaning from your Data.

Conclusion Interpret your results in the context of the Problem

Sampling & Study Designs (Ch. 6, 7)

Summarizing & Exploring Data (Ch. 1, 2)

Inference Procedures (Ch. 3, 4, 9-20, 23, 24)

Included in 2244 Lab Component

Biol/Stat 2244 – Peter

Data

Collect, monitor the quality of, and conduct a

preliminary exploration of the data

• Does the data collection method need

‘tweaking’ to ensure quality (monitoring)?

• Are there any patterns, trends, or

associations apparent in the data?

• Are there any outliers or missing values? If so,

how will you handle them?

Biol/Stat 2244 – Peter

Outliers

Biol/Stat 2244 – Peter

observations inconsistent with the pattern/‘range’ of

the rest of the dataset

(possible) outliers

Why do outliers exist?

• not part of the sampling

frame/population of

interest

Biol/Stat 2244 – Peter

each outlier may exist in the dataset for a

different reason

First

year

biology

Upper

years

Sample

Survey of first-year

student experience

• rare, random variation

• measurement and/or

reporting mistake

Handling outliers

Biol/Stat 2244 – Peter

nature of outlier should dictate how we ‘handle’ the

value

datasheet

shows 152 cm

datasheet

shows 15.2 cm

correct

mistake

conduct analyses

with and without

remove, if value is

impossible

re-measure

value

Missing values (‘NAs’)

Biol/Stat 2244 – Peter

an observation in the data set for which there is no value

Why do missing values exist?

Biol/Stat 2244 – Peter

• Planned in study design

• Chance failure of measurement device/researcher

• Non-chance avoidance of measurement

each missing value may exist in the dataset for a

different reason

How to handle missing values

• Deletion of the value from the variable and/or from the

entire dataset

• Replace the missing value with another value (e.g. the

mean, mode, a predicted value, etc.)

Biol/Stat 2244 – Peter