week 8-data science

profilerav
Chapter_31.pptx

Data Science and Big Data Analytics

Chap 3: Data Analytics Using R

1

Chap 3 Data Analytics Using R

This chapter has three sections

An overview of R

Using R to perform exploratory data analysis tasks using visualization

A brief review of statistical inference

Hypothesis testing and analysis of variance

2

3.1 Introduction to R

Generic R functions are functions that share the same name but behave differently depending on the type of arguments they receive (polymorphism)

Some important functions used in this chapter (most are generic)

head() displays first six records of a file

summary() generates descriptive statistics

plot() can generate a scatter plot of one variable against another

lm() applies a linear regression model between two variables

hist() generates a histogram

help() provides details of a function

3

3.1 Introduction to R Example: number of orders vs sales

lm(formula = (sales$sales_total ~ sales$num_of_orders)

intercept = -154.1 slope = 166.2

4

3.1 Introduction to R

3.1.1 R Graphical User Interfaces

Getting R and RStudio

3.1.2 Data Import and Export

Necessary for project work

3.1.3 Attributes and Data Types

Vectors, matrices, data frames

3.1.4 Descriptive Statistics

summary(), mean(), median(), sd()

5

3.1.1 Getting R and RStudio

Download R and install (32-bit and 64-bit)

https://www.r-project.org/

R-3.5.1 for Windows (32/64 bit)

https://cran.cnr.berkeley.edu/bin/windows/base/R-3.5.1-win.exe

Download RStudio and install

https://www.rstudio.com/products/RStudio/#Desk

6

3.1.1 RStudio GUI

7

3.2 Exploratory Data Analysis Scatterplots show possible relationships

x <- rnorm(50) # default is mean=0, sd=1

y <- x + rnorm(50, mean=0, sd=0.5)

plot(y,x)

8

3.2 Exploratory Data Analysis

3.2.1 Visualization before Analysis

3.2.2 Dirty Data

3.2.3 Visualizing a Single Variable

3.2.4 Examining Multiple Variables

3.2.5 Data Exploration versus Presentation

9

3.2.1 Visualization before Analysis Anscombe’s quartet – 4 datasets, same statistics

should be x

10

3.2.1 Visualization before Analysis Anscombe’s quartet – visualized

11

3.2.1 Visualization before Analysis Anscombe’s quartet – Rstudio exercise

Enter and plot Anscombe’s dataset #3

and obtain the linear regression line

More regression coming in chapter 6

)

x <- 4:14

x

y <- c(5.39,5.73,6.08,6.42,6.77,7.11,7.46,7.81,8.15,12.74,8.84)

y

summary(x)

var(x)

summary(y)

var(y)

plot(y~x)

lm(y~x)

12

3.2.2 Dirty Data Age Distribution of bank account holders

What is wrong here?

13

3.2.2 Dirty Data Age of Mortgage

What is wrong here?

14

3.2.3 Visualizing a Single Variable Example Visualization Functions

15

3.2.3 Visualizing a Single Variable Dotchart – MPG of Car Models

16

3.2.3 Visualizing a Single Variable Barplot – Distribution of Car Cylinder Counts

17

3.2.3 Visualizing a Single Variable Histogram – Income

18

3.2.3 Visualizing a Single Variable Density – Income (log10 scale)

19

In this case, the log density plot emphasizes the log nature of the distribution

The rug() function at the bottom creates a one-dimensional density plot to emphasize the distribution

3.2.3 Visualizing a Single Variable Density – Income (log10 scale)

20

3.2.3 Visualizing a Single Variable Density plots – Diamond prices, log of same

21

3.2.4 Examining Multiple Variables Examining two variables with regression

Red line = linear regression

Blue line = LOESS curve fit

x

22

3.2.4 Examining Multiple Variables Dotchart: MPG of car models grouped by cylinder

23

3.2.4 Examining Multiple Variables Barplot: visualize multiple variables

24

3.2.4 Examining Multiple Variables Box-and-whisker plot: income versus region

Box contains central 50% of data

Line inside box is median value

Shows data quartiles

25

3.2.4 Examining Multiple Variables Scatterplot (a) & Hexbinplot – income vs education

The hexbinplot combines the ideas of scatterplot and histogram

For high volume data hexbinplot may be better than scatterplot

26

3.2.4 Examining Multiple Variables Matrix of Scatterplots

27

3.2.4 Examining Multiple Variables Variable over time – airline passenger counts

28

Data visualization for data exploration is different from presenting results to stakeholders

Data scientists prefer graphs that are technical in nature

Nontechnical stakeholders prefer simple, clear graphics that focus on the message rather than the data

3.2.5 Exploration vs Presentation

29

3.2.5 Exploration vs Presentation Density plots better for data scientists

30

3.2.5 Exploration vs Presentation Histograms better to show stakeholders

31

Model Building

What are the best input variables for the model?

Can the model predict the outcome given the input?

Model Evaluation

Is the model accurate?

Does the model perform better than an obvious guess?

Does the model perform better than other models?

Model Deployment

Is the prediction sound?

Does model have the desired effect (e.g., reducing cost)?

3.3 Statistical Methods for Evaluation Statistics helps answer data analytics questions

32

3.3.1 Hypothesis Testing

3.3.2 Difference of Means

3.3.3 Wilcoxon Rank-Sum Test

3.3.4 Type I and Type II Errors

3.3.5 Power and Sample Size

3.3.6 ANOVA (Analysis of Variance)

3.3 Statistical Methods for Evaluation Subsections

33

Basic concept is to form an assertion and test it with data

Common assumption is that there is no difference between samples (default assumption)

Statisticians refer to this as the null hypothesis (H0)

The alternative hypothesis (HA) is that there is a difference between samples

3.3.1 Hypothesis Testing

34

3.3.1 Hypothesis Testing Example Null and Alternative Hypotheses

35

3.3.2 Difference of Means Two populations – same or different?

36

Student’s t-test

Assumes two normally distributed populations, and that they have equal variance

Welch’s t-test

Assumes two normally distributed populations, and they don’t necessarily have equal variance

3.3.2 Difference of Means Two Parametric Methods

37

Makes no assumptions about the underlying probability distributions

3.3.3 Wilcoxon Rank-Sum Test A Nonparametric Method

38

An hypothesis test may result in two types of errors

Type I error – rejection of the null hypothesis when the null hypothesis is TRUE

Type II error – acceptance of the null hypothesis when the null hypothesis is FALSE

3.3.4 Type I and Type II Errors

39

3.3.4 Type I and Type II Errors

40

The power of a test is the probability of correctly rejecting the null hypothesis

The power of a test increases as the sample size increases

Effect size d = difference between the means

It is important to consider an appropriate effect size for the problem at hand

3.3.5 Power and Sample Size

41

3.3.5 Power and Sample Size

42

A generalization of the hypothesis testing of the difference of two population means

Good for analyzing more than two populations

ANOVA tests if any of the population means differ from the other population means

3.3.6 ANOVA (Analysis of Variance)

43