Statistics question

profileHailsldjfu
5A-Datastageanddataorganization.pdf

DATA: SUMMARIZING &

EXPLORING DATA

Data stage and Data organization Biol/Stat 2244 – Peter

Objectives

By the end of this module, you should be able to:

 identify the objective and guiding questions

associated with the Data stage of the PPDAC

framework

 describe and recognize characteristics of good

data file organization

 create appropriate metadata to accompany a

datafile

Biol/Stat 2244 – Peter

Scientific Inquiry Framework: PPDAC

Problem Define the research question.

Plan Decide how to address the research Problem.

Data Execute your Plan and examine your Data.

Analysis Extract meaning from your Data.

Conclusion Interpret your results in the context of the Problem

Sampling & Study Designs (Ch. 6, 7)

Summarizing & Exploring Data (Ch. 1, 2)

Inference Procedures (Ch. 3, 4, 9-20, 23, 24)

Included in 2244 Lab Component

Biol/Stat 2244 – Peter

Data

Collect, monitor the quality of, and conduct a

preliminary exploration of the data

• Does the data collection method need

‘tweaking’ to ensure quality (monitoring)?

• Are there any patterns, trends, or

associations apparent in the data?

• Are there any outliers or missing values? If so,

how will you handle them?

Biol/Stat 2244 – Peter

Data vocabulary

Datasets are typically organized as a table or matrix

with rows and columns

Variable: contain all

values of a particular

characteristic

Observation: contain

all values measured

on the same unit (e.g.

individual, day)

Biol/Stat 2244 – Peter

‘Tidy’ data rules

1. Each variable in its own column

2. Each observation has its own row

3. Each value has its own cell

“Figure 12.1” © Grolemund & Wickham in R for Data Science, licensed under CC BY-NC-ND 3.0 US Biol/Stat 2244 – Peter

consistent with

“long format”

Example of tidy vs. messy data

Tidy

Messy

Biol/Stat 2244 – Peter

Metadata

Descriptive paragraph, table of information, or more

complex file types that help understand/use the data

Biol/Stat 2244 – Peter

• description of data collection

• description of variables and measurement units

• where to access the data

• known problems or inconsistencies

• quality check characteristics

➢ number of rows/columns

➢ sum for quantitative columns

Suggestions for better data sharing and description: http://doi.org/10.4033/iee.2013.6b.6.f