Statistics question
DATA: SUMMARIZING &
EXPLORING DATA
Data stage and Data organization Biol/Stat 2244 – Peter
Objectives
By the end of this module, you should be able to:
identify the objective and guiding questions
associated with the Data stage of the PPDAC
framework
describe and recognize characteristics of good
data file organization
create appropriate metadata to accompany a
datafile
Biol/Stat 2244 – Peter
Scientific Inquiry Framework: PPDAC
Problem Define the research question.
Plan Decide how to address the research Problem.
Data Execute your Plan and examine your Data.
Analysis Extract meaning from your Data.
Conclusion Interpret your results in the context of the Problem
Sampling & Study Designs (Ch. 6, 7)
Summarizing & Exploring Data (Ch. 1, 2)
Inference Procedures (Ch. 3, 4, 9-20, 23, 24)
Included in 2244 Lab Component
Biol/Stat 2244 – Peter
Data
Collect, monitor the quality of, and conduct a
preliminary exploration of the data
• Does the data collection method need
‘tweaking’ to ensure quality (monitoring)?
• Are there any patterns, trends, or
associations apparent in the data?
• Are there any outliers or missing values? If so,
how will you handle them?
Biol/Stat 2244 – Peter
Data vocabulary
Datasets are typically organized as a table or matrix
with rows and columns
Variable: contain all
values of a particular
characteristic
Observation: contain
all values measured
on the same unit (e.g.
individual, day)
Biol/Stat 2244 – Peter
‘Tidy’ data rules
1. Each variable in its own column
2. Each observation has its own row
3. Each value has its own cell
“Figure 12.1” © Grolemund & Wickham in R for Data Science, licensed under CC BY-NC-ND 3.0 US Biol/Stat 2244 – Peter
consistent with
“long format”
Example of tidy vs. messy data
Tidy
Messy
Biol/Stat 2244 – Peter
Metadata
Descriptive paragraph, table of information, or more
complex file types that help understand/use the data
Biol/Stat 2244 – Peter
• description of data collection
• description of variables and measurement units
• where to access the data
• known problems or inconsistencies
• quality check characteristics
➢ number of rows/columns
➢ sum for quantitative columns
Suggestions for better data sharing and description: http://doi.org/10.4033/iee.2013.6b.6.f