Analytics Capstone Project

Jason07070926
DataScience_HANDOUT.pdf

Analytics Capstone Project

Course Overview

1. Part I: Data Science Fundamentals I Data

Science Concepts and Process I Exploratory

Data Analysis

I Cleaning & Manipulating Data

I Presenting Results

2. Part II: Graphs & Statistical Methods

I Basic Graphics

Data Science Concepts and Process I Advanced Graphics I Probability &

Statistical Methods

3. Part III: Modeling Methods I Linear and

Logistic Regression

I Model Selection and Evaluation I Unsupervised

Methods

I Advanced Modeling Methods

I Data science is more than statistical analysis

I Emphasis on collaboration and project definition

I Project roles

I Project sponsor I

Client or SME

I Data scientist

I Data architect I

Operations

I Data science project life cycle . . .

Data Science Concepts and Process

Data Science Concepts and Process

I Project goal – why are we doing this?

I Data collection, quality, sufficiency, and management

I Model development

I Model evaluation and sufficiency

I Presentation to stakeholders, project documentation, and

reproducibility

Communicating Results

I You’re telling a story

I What are the questions you’re seeking to answer?

I Why are these interesting questions?

Data Science Concepts and Process

General structure of a paper

1. Abstract – Brief summary of the question(s) of interest, the

methodology and the results

2. Introduction – Clear statement of the scientific question,

objectives of the study, background information to put the

question and research into perspective

3. Methods/Methodology – Describe the design of the study and

the analytical methodology you used

4. Results – Describe the key results of your data analysis and

interpret the results with respect to the objectives of the study

and the question(s) of interest

Data Science Concepts and Process

General structure of a paper (cont’d)

1. Conclusions – Summary of the main findings and a discussion

of how these findings relate to the question(s) of interest,

comparison to results from other studies, and ideas for future

research

2. References – Citations for literature you used

3. Technical Appendices (if necessary) – Describe more

complicated analyses, or derivations and proofs, or any other

material that would disrupt the flow of the paper if it were

included in the body

Data Science Concepts and Process

Use graphs to clarify not confuse or mis-lead

Data Science Concepts and Process

Data Science Concepts and Process

Data Science Concepts and Process

Data Science Concepts and Process

Exploratory Data Analysis

I EDA allows us to get a sense for what data we have if/and how

they are related

I Summary statistics

> summary(ldl) age gender ldl.pre ldl.post

Length :100 Length :100 Min. :106.5 Min. :102.9 Class :character Class :character 1st Qu. :116.9 1st Qu. :112.8 Mode :character Mode :character Median :120.7 Median :117.2 Mean :120.4 Mean :116.7

3rd Qu. :123.9 3rd Qu. :119.9

Max. :132.6 Max. :130.6

> ldl$fgender <- factor(ldl$gender)

Exploratory Data Analysis

> ldl$fage <- factor(ldl$age)

> summary(ldl[,-c(1,2)]) ldl.pre ldl.post fgender fage Min. :106.5 Min. :102.9 f:50 gt35:51 1st Qu. :116.9 1st Qu. :112.8 m:50 le35:49 Median :120.7 Median :117.2

Mean :120.4 Mean :116.7

3rd Qu. :123.9 3rd Qu. :119.9

Max. :132.6 Max. :130.6

I Visualizing data

Exploratory Data Analysis

Exploratory Data Analysis

I Identifying data problems through data summaries and

visualization I Missing data

I Invalid data and outliers

I Data ranges and comparable units

Exploratory Data Analysis

> # create a random sample of 100 numbers (size) from 1 through

100, and allow repeated values

> x <- sample(x = 1:100, size = 100, replace =

TRUE)

> # calculate the mean()

> mean(x)

[1] 48.78

> # make a copy of x

> y <- x

> # Draw a random sample of size 20 from our vector y and set

those values to NA > y[sample(x = 1:100, size = 20, replace =

FALSE)] <- NA

> mean(y)

[1] NA

Exploratory Data Analysis

> mean(y, na.rm = TRUE) > library(scales)

> ggplot(custdata) + geomdensity(aes(x=income)) + scalexcontinuous(labels =

dollar)

Exploratory Data Analysis

> ggplot(custdata) + geomdensity(aes(x=income)) + scalexlog10(breaks =

c(100,1000,10000,100000), labels=dollar) + annotationlogticks(sides = "bt")

Exploratory Data Analysis

> custdata2 <- subset(custdata, (custdata$age > 0 & custdata$age < 100 &

custdata$income > 0))

> ggplot(custdata2, aes(x=age, y=income)) + geompoint()

Exploratory Data Analysis

Cleaning & Manipulating Data

I Data transformations

I It is often helpful/necessary to transform data for analysis

I As we saw earlier, the transformed income data helped in

visualization

I When fitting a linear model a log transform will reduce

nonlinearity

I Invalid data values

I Data ranges and units

I Handling missing data -– just deleting is not always good

Cleaning & Manipulating Data

I Data may be missing for a number of reasons I It is important

to understand what data are missing and why

I Identifying missing data

I Visualizing missing data patterns

I We can remove the observations with missing data (avoid if

possible)

I Complete-case analysis – listwise deletion I Pair-

wise deletion

I We can replace the missing values

I Simple imputation

Cleaning & Manipulating Data

I Multiple imputation

Classifying Missing Data

I Missing completely at random (MCAR)

I Missing at Random (MAR)

I Not Missing at Random (NMAR)

Cleaning & Manipulating Data

Cleaning & Manipulating Data

Source: Kabacoff (2015)

Presenting Results

I Clear communication of results

I Target presentation to your audience I Project

sponsor or company executives

I End-users

I Other data scientists, analysts or researchers

Presenting Results

Project sponsor or company executives

I Summarize the project’s goals motivation for doing it I State

the results

I Provide details as needed

I Discuss recommendations or future work

Presenting Results

Think Journalistic Style . . .

End-users

I Summarize the project’s goals and motivation for doing it I

Show how the model fits into the users’ workflow and improves

the workflow

I Show how to use the model

Presenting Results

Other data scientists, analysts or researchers

I Introduce the problem I

Discuss related work

I Discuss your approach

Presenting Results

I Results and findings I

Discuss future work

This reflects the structure of a research article in a journal

Basic Graphics

I Scatterplots Simple plotting of two variables

plot(mtcars$mpg ˜ mtcars$wt, xlab = "Weight", ylab =

"MPG") plot(mtcars$mpg ˜ mtcars$wt, xlab = "Weight",

ylab = "MPG", pch = 19)

I pch allows you to specify the plotting character

Basic Graphics

I Bar plots Display the distribution (frequency) of a categorical

variable

install.packages("vcd") library(vcd) counts <-

table(Arthritis$Improved) barplot(counts,

main="Simple Bar Plot", xlab="Improvement",

ylab="Frequency") barplot(counts, main="Horizontal

Bar Plot", xlab="Frequency", ylab="Improvement",

horiz=TRUE)

Basic Graphics

If your data is a matrix rather than a vector, the resulting graph will

be a stacked or grouped bar plot

Basic Graphics

barplot(counts2, main="Stacked Bar Plot", xlab="Treatment",

ylab="Frequency", col=c("red", "yellow","green"),

legend=rownames(counts))

barplot(counts2, main="Grouped Bar Plot", xlab="Treatment",

ylab="Frequency", col=c("red", "yellow", "green"),

legend=rownames(counts), beside=TRUE)

By setting beside = TRUE we get a grouped bar plot

Basic Graphics

Basic Graphics

I As we finish our discussion of bar plots, let’s take a look at a

specialized version called a spinogram

I A spinogram is a stacked bar plot is rescaled so that the height

of each bar is 1 and the segment heights represent

proportions

I Spinograms are created with the spine() function of the vcd

package

attach(Arthritis) counts3 <- table(Treatment, Improved)

spine(counts, main="Spinogram Example")

detach(Arthritis)

Basic Graphics

I Histograms Display the distribution of a continuous variable by

dividing the range a number of bins on the x-axis and

displaying the frequency in each bin on the y-axis

Basic Graphics

I The following code creates four histograms – from basic to more

complex

par(mfrow=c(2,2)) hist(mtcars$mpg)

hist(mtcars$mpg, breaks=12, col="red", xlab="Miles

Per Gallon", main="Colored histogram with 12 bins")

hist(mtcars$mpg, freq=FALSE, breaks=12, col="red",

xlab="Miles Per Gallon", main="Histogram, rug plot,

density curve") rug(jitter(mtcars$mpg))

lines(density(mtcars$mpg), col="blue", lwd=2)

x <- mtcars$mpg h<-hist(x, breaks=12, col="red", xlab="Miles

Per Gallon", main="Histogram with normal curve and box")

xfit<-seq(min(x), max(x), length=40) yfit<-dnorm(xfit,

Basic Graphics

mean=mean(x), sd=sd(x)) yfit <- yfit*diff(h$mids[1:2])*length(x)

lines(xfit, yfit, col="blue", lwd=2) box()

Basic Graphics

Basic Graphics

I Density plots Kernel density estimation is a nonparametric

method for estimating the probability density function of a

random variable I Generally, density plots can be an effective way

to view the distribution of a continuous variable

par(mfrow=c(2,1)) d <- density(mtcars$mpg) plot(d) d <-

density(mtcars$mpg) plot(d, main="Kernel Density of Miles Per Gallon")

polygon(d, col="red", border="blue") rug(mtcars$mpg, col="brown")

Basic Graphics

Basic Graphics

I Boxplots A box-and-whiskers plot describes the distribution of a

continuous variable by plotting its five-number summary: the

minimum, first quartile (25th percentile), median (50th

percentile), third quartile (75th percentile), and maximum

I Includes observations that may be outliers (values outside the

range of ±1.5 ∗ IQR, where IQR is the interquartile range defined

as the third quartile minus the first quartile) I By default, each

whisker extends to the most extreme data point, which is no

more than 1.5 times the interquartile range

I Values outside this range are depicted as dots and may be

outliers

Basic Graphics

Source: Kabacoff (2015)

Basic Graphics

I Dot plots Provide a method of plotting a large number of

labeled values on a simple horizontal scale

Advanced Graphics

There are four graphics packages in R:

grid provides drawing primitives, but no tools for producing

statistical graphics; We will not cover grid

Base graphics are fine for assignments/projects–We will

look more advanced packages so you are aware that they exist

Advanced Graphics

I lattice provides a comprehensive graphical system for

visualizing univariate and multivariate data

I lattice has the ability to easily generate trellis graphs I A trellis

graph displays the distribution of a variable, or the relationship

between variables, separately for each level of one or more other

variables

I Let’s look at an example: How do the heights of singers in the

New York Choral Society vary by their vocal parts?

library(lattice) histogram(˜height | voice.part, data = singer,

main="Distribution of Heights by Voice Pitch", xlab="Height

(inches)")

Advanced Graphics

Graph types and functions from the lattice package

Advanced Graphics

Advanced Graphics – lattice Examples

library(lattice) attach(mtcars)

gear <- factor(gear, levels=c(3, 4, 5), labels=c("3 gears", "4 gears",

"5 gears")) cyl <- factor(cyl, levels=c(4, 6, 8), labels=c("4

cylinders", "6 cylinders",

"8 cylinders"))

densityplot(˜mpg, main="Density Plot", xlab="Miles per Gallon")

densityplot(˜mpg | cyl, main="Density Plot by Number of

Cylinders", xlab="Miles per Gallon")

lattice

Advanced Graphics – Examples

bwplot(cyl ˜ mpg | gear, main="Box Plots by Cylinders and

Gears", xlab="Miles per Gallon", ylab="Cylinders")

xyplot(mpg ˜ wt | cyl * gear, main="Scatter Plots by Cylinders

and Gears", xlab="Car Weight", ylab="Miles per Gallon")

cloud(mpg ˜ wt * qsec | cyl, main="3D Scatter Plots

by Cylinders")

dotplot(cyl ˜ mpg | gear, main="Dot Plots by Number of Gears and

Cylinders",

xlab="Miles Per Gallon")

Advanced Graphics – lattice Examples

splom(mtcars[c(1, 3, 4, 5, 6)], main="Scatter Plot Matrix for

mtcars Data")

detach(mtcars)

Advanced Graphics

I ggplot2 developed by Hadley Wickham I R

package for producing graphics

I ggplot2 is designed to work by adding layers of annotations and

statistical summaries

I Takes a little practice learning the “grammar” of ggplot2

Advanced Graphics – Examples

I Read the ggplot2 book posted on Blackboard

ggplot2

library(ggplot2) set.seed(1410) # Make the sample reproducible dsmall

<- diamonds[sample(nrow(diamonds), 100), ]

qplot(carat, price, data = diamonds) qplot(log(carat), log(price), data =

diamonds) qplot(carat, x * y * z, data = diamonds) qplot(carat, price, data

= dsmall, colour = color) qplot(carat, price, data = dsmall, shape = cut)

qplot(carat, price, data = dsmall, geom = c("point",

"smooth"))

Advanced Graphics – Examples

Analytic Geometry

Advanced Graphics – Examples

ggplot2

## Use GAM as a smoother

library(mgcv) qplot(carat, price, data = dsmall,

geom = c("point", "smooth"), method = "gam",

formula = y ˜ s(x))

## qplot does a lot for us but we can build

## layer by layer

qplot(sleep_rem / sleep_total, awake, data = msleep, geom = c("point",

"smooth"))

ggplot(msleep, aes(sleep_rem / sleep_total, awake)) geom_point() +

geom_smooth()

Advanced Graphics – Examples

Analytic Geometry

Probability & Statistical Methods

Descriptors for probability distributions. . .

I Location – the center or the peak of the distribution typically

measured by the mean (average or expected value) or median

I Shape – the degree of symmetry or skewness

I Scale or Variability – the spread of the distribution typically

measured by variance or standard deviation, the range or

interquartile range, or a scale parameter

I Deviations – possible outliers, or unusual points that are not

consistent with the rest of the data

Probability & Statistical Methods

I Probability distributions I

Normal distribution

f

I Functions for the normal distribution incluide rnorm, dnorm,

pnorm, qnorm

Probability & Statistical Methods

Source: Doane and Seward (2011)

I Summary statistics I

Sample mean

Probability & Statistical Methods

= x1 + x2 + ··· + xn = Pni=1 xi x

n n

I Sample variance

s n − 1

√ . . . and standard deviation s = s2

I Median - middle value of a data set not sensitive to extreme values First sort the data x1 ≤ x2 ≤ ···xn

If n is odd

m = x(n+1)/2

If n is even x(n/2) + x(n+2)+1

Probability & Statistical Methods

m

I The sample mean we calculated above is a point estimate I While

this is informative it is important to understand the uncertainty in

the estimate

I Confidence intervals do just that I Suppose we wish to find a

confidence interval for the mean of a distribution based on a

sample

√ x ± Zα/2 · σ/ n

where α represents the significance level, Z is the quantile

from the standard normal distribution, σ is the standard

deviation, and n is the sample size I The confidence level is 1 − α

Probability & Statistical Methods

I The confidence level is the probability that the interval includes

the true mean

Example A sample of size n = 40 is drawn from a population with a variance σ2

= 10, and x = 7.164. What is a 95% confidence interval for µ?

!

which is (6.184,8.144)

I Sampling distributions

The sampling distribution of an estimator is the probability

distribution of all possible values the statistic may assume when

a random sample of size n is taken

I The sampling distribution of the sample mean

Probability & Statistical Methods

X ∼ N

Try this . . .

library(lattice) library(resample) x <- ldl$ldl.post samp.idx <- samp.bootstrap(n = 100, R = 10000, size

=20) mean(rowMeans(matrix(x[samp.idx], nrow = 10000, byrow = FALSE))) mean(x)

histogram(rowMeans(matrix(x[samp.idx], nrow = 10000, byrow = FALSE)))

Probability & Statistical Methods

I Covariance and correlation measure the association between two

random variables given random variables X and Y

Probability & Statistical Methods

Cov(X,Y) = E(X − EX)(Y − EY)

= EXY − EX EY

and

Cov(X,Y)

ρ(X,Y) =

p

Var(X)Var(Y)

I While both measure association, correlation ρ is normalized and

ranges from −1 to 1 making it easier to interpret

I We can also write correlation as

rxy

(n − 1)SXSY

Probability & Statistical Methods

when working with a sample, X is the sample mean for X, Y is

the sample mean for Y, SX and SY are the sample standard

deviation for X and Y respectively, and n is the sample size

I In R we use the cor function for correlation and cov for covariance

library(ggplot2) cor(economics$pce,

economics$psavert)

[1] -0.837069

I Hypothesis testing is used to test assumptions and theories and

guide our decisions

I State the level of significance α: the probability that you

incorrectly reject H0

I The levels of significance correspond to various quantiles of the

standard normal distribution . . .

Probability & Statistical Methods

I α = 0.05,Z = 1.96,

I α = 0.10,Z = 1.645

I Refer to a Z table that provides the area under the curve for the

cumulative normal distribution

I Steps in hypothesis testing:

Step 1: State the null and alternative hypotheses, H0, and Ha (or H1), determine if this is a one-sided or two-sided test

Step 2: State α, the level of significance Step 3: Calculate the test statistic from the sample mean and variance

X − µ Z =

ps2/n

Step 4: Compare the calculated test statistic to the critical value

Probability & Statistical Methods

Step 5: Accept H0 if Z is in the acceptance region, or reject H0 if Z is in the

critical or rejection region

Example Historically the test scores on a given math test are normally

distributed with mean 75 and variance 36. A group of 16 students

take the test this year, and score an average of 82. Do this group’s

math scores differ from the historical average? Test this at a 0.10

level of significance.

Solution

Z .

Since α = 0.1 we have a 1 − α confidence level, and the critical value for

Z = 1.645.

Probability & Statistical Methods

Since 4.67 > 1.645 we reject H0 and conclude that this year’s math

scores have a higher average than the historical average.

Example The specifications for a cable call for a mean breaking strength of

2000 pounds. For a sample of the cable the mean breaking strngth is

1955 pounds with a standard error of the mean of 25. Using α = 0.05

determine if the difference in mean breaking strengths is statistically

significant.

I In these examples the variance is known so we use the standard

normal distribution

I If we don’t know the variance and the sample size is small

(rule of thumb n < 30) then use a t-distribution

Probability & Statistical Methods

X − µ t

= ps2/n

I Compare this test statistic to the critical value from a t-table, for n

− 1 degrees of freedom

Exercise The copier at Seidenberg can make 45 copies per minute on average.

The machine is altered to increase output, and three test runs

produce 46, 47, and 48 copies per minute. Is this increase

statistically significant? Use α = 0.05. Note this is a one-tailed test.

Conduct a t-test in R using the function t-test

> t.test(ldl$ldl.post, mu = 117) One Sample t-test

Probability & Statistical Methods

data: ldl$ldl.post t = -0.62329, df = 99, p-value = 0.5345 alternative hypothesis:

true mean is not equal to 11 95 percent confidence interval:

115.6613 117.6987 sample

estimates:

mean of x

116.68

We can do a two-sample t-test testing the equality of the sample

means, and the null hypothesis is the difference is zero:

H0 : µ1 − µ2 = 0

> t.test(ldl$ldl.post˜ldl$gender) Welch Two Sample t-

test

data: ldl$ldl.post by ldl$gender t = 0.65255, df = 88.233, p-value = 0.5157 alternative hypothesis:

true difference in means is not equal to 0 95 percent confidence interval: -1.374449 2.718449 sample

estimates: mean in group f mean in group m

Probability & Statistical Methods

117.016 116.344 When testing for a difference in means, the test statistic

(X1 − X2) − (µ1 − µ2)

t =

SX1−X2

has a t-distribution when n1 + n2 ≤ 30

SX

2 ni

i i

1 Si = =

ni(ni − 1) ni − 1

Probability & Statistical Methods

For a given significance level, α, the critical value is tα/2,df where df =

n1 + n2 − 2

We accept H0 if −tα/2,df ≤ t ≤ tα/2,df

Probability & Statistical Methods

Probability & Statistical Methods

I Arguably regression analysis is the most commonly used

statistical model I Linear regression explores relationships

between variables that are linear

I Many problems can be analyzed using linear regression

. . .

I And when the relationships are not quite linear, the

transformation of the original variables results in linear

relationships among the transformed variables

I The functional relation between two variables is

Y = f(X)

Probability & Statistical Methods

Example Let’s consider the relation between total dollar sales (Y) of a product

and the number of units sold (X). If the unit price is $2 then the

functional relation is

Y = 2X.

I Unlike a functional relation a statistical relation is not perfect

I The observations do not fall on directly the curve I Two key

concepts of a statistical relationship captured by a regression

model:

1. The tendency of the dependent variable Y to vary with the

independent variable X 2. The points scatter around the curve of the statistical relationship

I The linear regression equation is

Probability & Statistical Methods

Y

for a simple linear regression and

Y (8.1)

for a multiple regression model

In (8.1) The βi are the regression coefficients corresponding the the

Xi independent variables, aka predictor variables, or explanatory

variables, and is the random error,

When we fit a regression model we estimate the coefficients and our

model is denoted by

Yˆ = βˆ0 + βˆ1X1 + βˆ2X2 + ··· + βˆpXp

Probability & Statistical Methods

The following graph shows the data points and the fitted regression

line for Ozone regressed on Temp from the

airquality data frame

Probability & Statistical Methods

Neter et al. (1983)

Probability & Statistical Methods

Formulæfor βˆ1 and βˆ0 :

,

βˆ0 = Y − βˆ1X

Ordinary least squares: find the coefficients that minimize the sum

of the squared error terms

NOTE: Let . To minimize,

take the partial derivatives, and , set them to zero and solve

for βˆ0 and β ˆ 1.

Formal assumptions of regression analysis:

Probability & Statistical Methods

I The relation is, in fact, linear, so that the errors all have expected

value for all i

I The errors all have the same variance, for all i

I The errors are independent of each other

I The errors are all normally distributed,

The last two points are often stated as the errors are independent

and identically distributed, or iid

Fitting a regression model in R

## Create a link to the website AdLink <- "http://www-bcf.usc.edu/˜gareth/ISL/Advertising.csv"

## Use read.csv to get the data Advertising <- read.csv(file = AdLink, sep = ",", header = TRUE)

## Fit the regression ad.lm <- lm(Sales˜TV, data =

Advertising)

Probability & Statistical Methods

## View the results summary(ad.lm)

SOURCE: The data are taken from ”An Introduction to Statistical Learning, with applications in R” (Springer, 2013)

I Analysis of Variance (ANOVA)

I Used in experimental situations

I Apply several treatments or treatment combinations to randomly

selected experimental units

I Compare the treatment means for some response y

I To test the effects of these additives we need more than one car

for each type of additive

Probability & Statistical Methods

I Suppose we have three cars for each type our ANOVA model is

y , y y

, y

or

yij ij

and we test if the average y for treatment 1 is the same as the

average y for treatment 2

Assumptions for Analysis of Variance

I The response variable is normally distributed for all the groups

I The variance of the response variable, denoted σ2, is the same for

all the groups

I The observations are independent

Probability & Statistical Methods

I Analysis of Variance is misleading because the objective in ANOVA

is to analyze differences among the group means, not the

variances

I By analyzing the variation among and within the groups, you can

reach conclusions about possible differences in group means

I Total variation is subdivided into variation that is due to

differences among the groups and variation that is due to

differences within the groups

I Within-group variation measures random variation I Among-

group variation is due to differences from group to group

I The symbol c (in this text) is used to indicate the number of

groups

Probability & Statistical Methods

Total Variation c nj

SST = XX(Xij − X)2 (8.2)

j=1 i=1

where

Probability & Statistical Methods

c nj

PP Xij j=1 i=1

X = = grand mean

n

Xij = ith value in group j nj =

number of values in group j

n = total number of observations (all groups)

c = number of groups

Among-Group Variation

c

SSA = Xnj(Xj − X)2 (8.3) j=1

Probability & Statistical Methods

where

c = number of groups

nj = number of values in group j

Xj = sample mean of group j

X = grand mean

Within-Group Variation

c nj

SSW = XX(Xij − Xj)2 (8.4)

j=1 i=1

where

c = number of groups

Probability & Statistical Methods

Xij = ith value in group j

Xj = sample mean of group j

To test the null hypothesis

H0 : µ1 = µ2 = ··· = µc

against the alternative

H1 : Not all µj areequal

we create an ANOVA summary table

Probability & Statistical Methods

− =

− =

Pr( > ∗

− =

− −

Probability & Statistical Methods

Fitting an ANOVA model in R

## Create a link to the website AutoLink <- "http://www-bcf.usc.edu/˜gareth/ISL/Auto.csv"

## Use read.csv to get the data Auto <- read.csv(AutoLink, sep = ",", header = TRUE)

## Make origin a factor Auto["origin"] <- lapply(Auto["origin"], factor) # OR Auto$origin <- factor(Auto$origin)

## Fit the ANOVA model auto.aov <- aov(mpg˜origin, data = Auto)

## View the results summary(auto.aov)

SOURCE: The data are taken from ”An Introduction to Statistical Learning, with applications in R” (Springer, 2013)

Probability & Statistical Methods

I summary(auto.aov) shows us there is a significant difference in

the means across origins but doesn’t tell us what the

differences are

I Tukey Honest Significant Differences provides us with that

information

TukeyHSD(auto.aov, ordered = TRUE) plot(TukeyHSD(auto.aov, "origin"),

xlim = c(-1,12))

Probability & Statistical Methods

I χ2 test of independence

Probability & Statistical Methods

I A test of independence for two categorical variables

I Create a contingency table that has r rows and c columns

I H0 : The two categorical variables are independent

I H1 : The two categorical variables are dependent

I Compute the test statistic

2 X (fo − fe)2 χ =

fe

I Reject H0 if χ2 > χ2α, the critical value with (r − 1)(c − 1) df

I Non-parametric statistics: Wilcoxon Rank Sum Test

I A nonparametric procedure does not depend on the assumption of

normality for the two populations

Probability & Statistical Methods

I Use when the sample sizes are small and you cannot assume that

the data in each sample are from normally distributed populations

Calculating the χ2 statistic

fo = observed values in the cell fe = expected values

in a cell, calculated as follows:

row total × column total

fe = n

Probability & Statistical Methods

Golden Palm Palm Royale Palm Princess

Price 23 7 37

Location 39 13 8

Accommodation 13 5 13

Other 13 8 8

Enter the numerical data in a text file separated by commas, and

execute the following code chiexam <- read.csv("chidata.txt", header = FALSE) names(chiexam) <- c("Golden Palm","Palm

Royale","Palm Princess") rownames(chiexam) <- c("Price", "Location", "Accommodation", "Other")

chisq.test(chiexam)

> chisq.test(chiexam) Pearson’s Chi-squared

test

data: chiexam X-squared = 27.41, df = 6, p-value = 0.000121

Probability & Statistical Methods

Annual Storm Count

Wind Speed (km/h) 1 2 3 or more

119 ≤ Wi < 154 13 12 28

154 ≤ Wi < 179 2 6 12

179 ≤ Wi 12 6 17

Conduct a χ2-test to determine if wind speed and annual storm

count are independent

Source: Parisi and Lund (2000)

I The Wilcoxon rank sum test tests whether there is a difference in

the medians from two samples

Probability & Statistical Methods

I If the two sample sizes are unequal, n1 represents the smaller

sample and n2 the larger sample

I The Wilcoxon rank sum test statistic, T1, is defined as the sum of

the ranks assigned to the n1 values in the smaller sample

I If T2 is the sum of the ranks assigned to the n2 items in the second

sample, then

n(n + 1) T1

+ T2 =

2

which follows from the well known result from mathematical

induction

I The prior point checks for the accuracy in assigning the ranks

Probability & Statistical Methods

When the sample sizes are large the test statistic T1 is

approximately normally distributed with

n1(n + 1)

µT1 = 2

and

The test statistics is

T1 − n 1(n+1)

Z

Probability & Statistical Methods

which approximately follows a standard normal distribution I Steps

for performing the Wilcoxon rank sum test:

I Replace the values in the two samples of size n1 and n2 with their

combined ranks (unless the data are the ranks) I Define n = n1 + n2

as the total sample size

I Assign the ranks so that rank 1 is given to the smallest of the n

combined values, rank 2 is given to the second smallest, and so

on

I For ties, assign each value the average of the ranks that would have

been assigned had there been no ties

I Compute T1 and Z

I Compare Z to the critical value and accept or reject H0 as appropriate Table 8.1: Time to second heart attack for smokers and nonsmokers

Smoker NonSmoker

Probability & Statistical Methods

1.09 1.96

3.50 0.07 3.69 0.01 0.63 10.21 0.35 0.10 3.34 0.25 1.93 0.41 1.57 3.26 1.79 3.57 0.86 0.24 0.03 0.51 0.19 0.33 0.98 7.02 0.26 5.85 1.01 1.83

Probability & Statistical Methods

When n1 and n2 are both ≤ 10 you use the table of lower and upper

critical values

Modeling Methods

I Match the problem to an analytical method

I Evaluate the model’s quality – quantifying the performance of a

model using a number of different measures

I Validate the model’s soundness – checking that the model

will work in the real world as

well as it did during

training

Modeling Methods

Modeling Methods

I Most problems in data science fall into one of two

categories: supervised or unsupervised I Supervised

learning methods include:

I linear regression

I logistic regression

I generalized additive models (GAM) I

support vector machines (SVM)

I Unsupervised learning methods include:

I K-means clustering

Modeling Methods

I A priori algorithm for association rules I

Nearest neighbor

I Use supervised methods when there is a known outcome

I Classification – observations fall into two or distinct groups

I Scoring – predicting the value of some measure based on other

variables

I Use unsupervised methods when there are no known outcomes

but you’re looking for patterns and relationships in your data

I Segmentation into an unknown number of groups (not pre-

determined)

I Make associations based on similar qualities or activities

Modeling Methods I

As a data scientist you must be able to map your problem to

the most appropriate method

I Your intended uses of the model influence the methods you

should use

I If you want to understand the relationship between input

variables and outcome then a regression method would be a

good choice

I If you want to know what single variable influences a

categorization, then try a decision tree

I Classification is an example of supervised learning: in order to

learn how to classify objects, you need a dataset of objects

that have already been classified

Modeling Methods I

I Cluster analysis and Association rules are examples of

unsupervised learning: look for patterns in the data, not

predict an outcome

Classification Using Nearest Neighbors predict something

about a data point p (like a customer’s future purchases)

based on data that are most similar to p

I Classification Using Naive Bayes use data about prior events to

estimate the probability of future events (weather forecasting)

I Classification Using Decision Trees and Rules divide data into

smaller and smaller portions to identify patterns that can be

used for prediction

I Regression Methods used for forecasting numeric data and

quantifying the size and strength of a relationship between

Modeling Methods I

the dependent and independent variables logistic regression

can be used to model a binary

dependent variable, and Poisson regression for count data

Neural Networks model the relationship between input signals

and an output signal using a model derived from our

understanding of how a brain responds to stimuli from sensory

inputs

I Support Vector Machines find a surface that makes a boundary

between various points of data in multidimensional space

according to their feature values

I Association Rules specify patterns of relationships among items

such as

{peanut butter, jelly} ⇒ {bread}

Modeling Methods I

if peanut butter and jelly are purchased, then bread is also

likely to be purchased

I k-means Clustering is an unsupervised method that

automatically divides the data into clusters, or group of similar

items

Modeling Methods

Model Selection and Evaluation

k

Model Selection and Evaluation

I After building a model we need to if the model even works on the

data it was trained from

I We use a set of quantitative measures to assess model

performance

I The measures we use vary by class of model I To help us

decide if our scores are “good enough” we compare our

model to several ideal models

I A null model gives us the low end of performance

I A Bayes rate model gives us high end of performance

I A single-variable model tells us what a simple model can achieve

I Null Model

I A null model is the best simple model you’re trying to outperform

Model Selection and Evaluation

I We use null models to lower-bound desired performance I We

usually compare to a best null model

I Bayes Model

I A Bayes rate model is a best possible model given the data I The

Bayes rate model is the perfect model

I The Bayes rate model is an upper bound on a model evaluation score

I Single-variable Model

I Compare your model against the best single-variable model

I A complicated model that doesn’t outperform the best single-variable

model can’t be justified

Evaluating classification models

I The most common measure of classifier quality is accuracy

Model Selection and Evaluation

I The confusion matrix is a powerful tool for measuring classifier

performance

I The confusion matrix is a table counting the number of known

outcomes versus each prediction type

Predicted

Actual Default Non-default

Default 264 14

Non-default 22 158

Evaluating classification models

Predicted

Actual TRUE FALSE

TRUE True Positive (TP) False Negative (FN)

Model Selection and Evaluation

FALSE False Positive (FP) True Negative (TN)

Evaluating classification models

Accuracy number of items categorized correctly divided by the total

number of items

TP + TN

Accuracy =

TP + TN + FP + FN

Evaluating classification models

Model Selection and Evaluation

Precision fraction of the items the classifier flags as being in the class

actually are in the class

TP

Evaluating classification models

Recall fraction of the things that are in the class are

detected by the classifier; what fraction of class

members are identified as positive

TP

= +

Model Selection and Evaluation

Sensitivity Same as Recall; the true positive rate

Specificity True negative rate – what fraction of class members are

identified as negative

TN

Specificity =

TN + FP

Evaluating scoring models

I Main measure is residuals – the difference between the fitted

value and the actual value for Y

= +

Model Selection and Evaluation

I R2 tells us how much of the variability in the dependent variable is

explained by our model

I The significance of each variable is given the the corresponding p-

values

I The overall model fit is measured by the F-statistic and p-value

Evaluating scoring models

# Make up some data d <-

data.frame(y=(1:10)ˆ2,x=1:10,z=1:10 +rpois(10,lambda =

2))

# fit a regression model using the lm() function model <- lm(y˜x+z,data=d) summary(model)

Call:

Model Selection and Evaluation

lm(formula = y ˜ x + z, data = d)

Residuals: Min 1Q Median 3Q Max -10.37 -4.66 0.51 2.19 12.48

Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) -17.23 5.87 -2.94 0.0218 * x 14.99 2.61 5.75 0.0007 *** z -3.76 2.33 -1.61 ---

0.1509

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.42 on 7 degrees of freedom Multiple R-squared: 0.963,Adjusted R-squared: 0.953 F-statistic: 92.1 on 2 and 7 DF, p-value: 9.41e-06

Linear Regression Models

I Recall that the linear regression model has the form

Yˆi = βˆ0 + βˆ1X1i + βˆ2X2i + ··· + βˆpXpi i = 1,2...n

I A model with only one independent variable is a simple linear

regression model; with more than one it’s multiple regression

I This is the ordinary least squares (OLS) model and we find the βˆ0s

by minimizing the sum of squared residuals (or errors)

n n n

i=1 i=1 i=1

I Linear regression is the cornerstone prediction method

Linear Regression Models

I Models the expected value of a dependent or response variable

given a set of independent of explanatory variables

I Fitting a regression model in R

Linear Regression Models

Linear Regression Models

1. Using the database women in the base R installation fit a

regression of weight on height (don’t forget to save your

model)

Linear Regression Models

2. Display a summary of your model, and interpret the results

3. Display the fitted values

4. Display the residuals

5. Plot weight (y-axis) and height (x-axis)

A polynomial regression is a regression model where we use use

powers of the explanatory variables

Refit the the regression as a quadratic model and interpret the

results as before

fit2 <- lm(weight ˜ height + I(heightˆ2), data=women)

Note this is still a linear model even though we have a quadratic term

– it’s a linear combination of the βˆ0s

An example of a nonlinear model is

Linear Regression Models

Yˆ = βˆ0 + βˆ1eX/βˆ2

You can fit nonlinear models with nls()

The car library has a scatterplot() function that can create an

informative plot

library(car) scatterplot(weight ˜ height, data=women, spread=FALSE,

smoother.args=list(lty=2), pch=19, main="Women Age 30-39",

xlab="Height (inches)", ylab="Weight (lbs.)")

Linear Regression Models

Linear Regression Models

I What we have been discussing is simple linear regression I When

there are more than one independent variable we have multiple

linear regression

I When working with several independent variables we need to

check for correlation among them

I This gives insight into possible interactions and multicollinearity

I We’ll us the built-in data set state.x77

I Since state.x77 is a matrix we need to convert it to a data.frame

Linear Regression Models

states <as.data.frame(state.x77[,c("Murder", "Population",

"Illiteracy", "Income",

"Frost")])

I Use cor() to get a correlation matrix cor(states)

Murder Population Illiteracy Income Frost

I cor() produces a correlation matrix but doesn’t tell us anything

about the statistical significance of the estimates

Murder 1.00 0.34 0.70 -0.23 -0.54

Population 0.34 1.00 0.11 0.21 -0.33

Illiteracy 0.70 0.11 1.00 -0.44 -0.67

Income -0.23 0.21 -0.44 1.00 0.23

Frost -0.54 -0.33 -0.67 0.23 1.00

Linear Regression Models

I cor.test() calculates the correlation and tests for significance but

only works on a pair at a time

I Years ago I created a function called Mat.cor.test that does what

cor.test does and works with a matrix

## matrix cor.test Mat.cor.test <- function(obj, alt = "two.sided", meth = "pearson") { n <- dim(obj)[2] est <- matrix(0, nrow = n, ncol = n, dimnames = list(names(obj),

names(obj))) pval <- matrix(0, nrow = n, ncol = n, dimnames = list(names(obj),

names(obj))) for(i in 1:(n - 1)) { for(j in (i + 1):n) { temp.cor <- cor.test(obj[, i], obj[, j],

alternative = alt, method = meth) est[i, j] <- temp.cor$estimate pval[i, j] <-

temp.cor$p.value est[j, i] <- temp.cor$estimate pval[j, i] <- temp.cor$p.value } } diag(est) = 1 list(estimates = round(est, 2), p.values = zapsmall(round(pval, 2))) } Mat.cor.test(states) $estimates Murder Population Illiteracy Income Frost

Linear Regression Models

Murder 1.00 0.34 0.70 -0.23 -0.54 Population 0.34 1.00 0.11 0.21 -0.33 Illiteracy 0.70 0.11 1.00 -0.44 -0.67 Income -0.23 0.21 -0.44 1.00 0.23 Frost -0.54 -0.33 -0.67 0.23 1.00 $p.values Murder Population Illiteracy Income Frost Murder 0.00 0.01 0.00 0.11 0.00 Population 0.01 0.00 0.46 0.15 0.02 Illiteracy 0.00 0.46 0.00 0.00 0.00 Income 0.11 0.15 0.00 0.00 0.11 Frost 0.00 0.02 0.00 0.11 0.00

I As you recall from our discussions of EDA, it is important to

visualize the data

I Let’s turn to the car library again

Linear Regression Models

library(car) scatterplotMatrix(states, spread=FALSE,

smoother.args=list(lty=2), main="Scatter Plot Matrix")

Linear Regression Models

Linear Regression Models

I Fit a multiple regression using the states data frame, with

Murder as the dependent variable...

I fit <- lm(Murder ∼ Population + Illiteracy

+

Income + Frost, data=states)

I summary(fit) I

anova(fit)

I This model shows use the main effects of each of the variables on

the murder rate

I We should look at the interaction between variables when we fit

regression models

I Let’s look at another model

Linear Regression Models

fit <- lm(mpg hp + wt + hp:wt, data=mtcars)

I When we fit a model with interactions we can visualize the

interaction terms to learn more about the relationship

I We can use the effects package to help out

library(effects) plot(effect("hp:wt", fit,,

list(wt=c(2.2,3.2,4.2))), multiline=TRUE)

Linear Regression Models

After fitting a regression model it’s important to look at diagnostics

to check the assumptions

Linear Regression Models

1. Normality: the residuals should be normally distributed with a

mean of 0 and variance σ2

2. Independence: the dependent variable values are independent

3. Linearity: the dependent variable is linearly related to the

independent variables

4. Homoscedasticity: the variance of the residuals is constant and

does not vary with the dependent variable

We also look for outliers, high-leverage observations, and

influential observations (Cook’s distance)

Let’s return to our earlier model of regessing weight on height

fit <- lm(weight ∼ height, data=women)

par(mfrow=c(2,2)) plot(fit)

Linear Regression Models

par(mfrow=c(1,1)) ## return parameters to normal

Linear Regression Models

Look again at the states model

fit <- lm(Murder Population + Illiteracy + Income + Frost,

data=states) par(mfrow=c(2,2)) plot(fit) par(mfrow=c(1,1))

Linear Regression Models

Linear Regression Models

The car package provides several additional functions that enhance

model evaluation

Linear Regression Models

I Multicollinearity is when two or more independent variables are

highly correlated I Results in large confidence intervals for the

parameter estimates (uncertainty)

I Makes interpretation of of the parameters difficult

I May show variables are not significant when they are I

May switch the sign of some parameters I Variance Inflation

Factor (VIF) is a measure to detect multicollinearity – vif() in

the car library

The math behind the R output

I Standard error of the estimates is the square root of the variance

Linear Regression Models

SE

where σ2 is the variance of the residuals

2 RSS σˆ =

n − p

− 1

recall n

RSS = X(Yi − Yˆi)2 = Xε2

i=1

SE

Linear Regression Models

when there’s only one independent variable, and in general

SE(β)2 = σ2(X0X)−1

or

q

SE(β) = σ2(X0X)−1

where X is the model matrix

Linear Regression Models

I For regression models

H0 : βi = 0 I The t-statistic is

or

βˆi − βi

t =

SE(βi)

βˆi

t =

SE(βi)

Linear Regression Models

I Total sum of squares

SSTO = X(Yi − Y)2

I Regression sum of squares

SSR = X(Yˆi − Y)2

I Error sum of squares

Linear Regression Models

SSE = X(Yi − Yˆ)2 = RSS

I The residual standard error is

RSE

I Regression Mean Square

SSR

MSR =

p

where p is the number of independent variables in the

regression model

Linear Regression Models

I Error Mean Square

SSE

MSE =

n − p − 1

I F-statistic

MSR F

=

MSE

I R2 is the proportion of variance explained by the model

I R2 always takes a value between 0 and 1

I R2 is independent of the scale of the dependent variable Y

Linear Regression Models

R2 = SSTO − SSE = SSR = 1 − SSE

SSTO SSTO SSTO

and adjusted R2

R2adj R2)

After fitting a regression model and interpreting the output, it

may be necessary to make revisions I Delete observations

I Transform variables

I Add or delete variables

I Try another regression modeling approach

Finding the “best” model

Linear Regression Models

I Comparing models using anova – based on the F for a reduced

model vs. a full model

> anova(fit.02b, fit.02) Analysis of Variance Table

Model 1: Murder ˜ Population + Illiteracy Model 2: Murder ˜ Population + Illiteracy + Income + Frost

Res.Df RSS Df Sum of Sq F Pr(>F) 1 47 289.2 2 45 289.2 2 0.07851 0.006 0.994

I and with AIC

> AIC(fit.02b, fit.02) df AIC

fit.02b 4 237.657 fit.02 6 241.643 I Variable selection using step-wise regression, and all subsets

regression

Linear Regression Models

I Assessing how well a model works in the real world

I Cross-validation

I A portion of the data is selected as the training sample, and a

portion is selected as the hold-out sample

I A regression equation is developed on the training sample and then

applied to the hold-out sample

I The performance on this sample is a more accurate estimate of the

operating characteristics of the model with new data

I Relative Importance

I Which variables are most important in predicting the outcome?

I Standardize your data before fitting the regression

I The coefficients measure how many standard deviations your

dependent variable changes with a 1 standard deviation of your

independent variable

Linear Regression Models

I Puts the effects of each variable on the same footing, and

comparable

Logistic Regression Models

I In our discussion of OLS we noted that the assumption of

normality must hold for the dependent variable Y

I Generalized linear models extend the linear-model framework to

include dependent variables that are non-normal

I In R we use the glm() function to fit generalized linear models

I Two popular models in this framework are logistic regression

(dependent variable is categorical) and Poisson regression

(dependent variable is a count variable)

I Logistic regression is a good first choice for binary classification

problems

I Logistic regression can directly predict values that are restricted

to the (0,1) interval, like probabilities

Logistic Regression Models

I Logistic regression assumes that is a linear function of

the X0s

I In the linear regression framework we model the expected value

of Y given X1,X2,...,Xp

p

µY = β0 + XβiXi i=1

I In the generalized linear model framework we model a function

of the expected value of Y

p

g(µY) = β0 + XβiXi

Logistic Regression Models

i=1

where g(µY) is called a link function

I The format for the glm() is as follows:

glm(formula, family=family(link=function), data=)

where family is the probability distribution, and function is the

link function

Logistic Regression Models

I Logistic regression applies to situations in which the response

variable is dichotomous (0 or 1)

I The model assumes that Y follows a binomial distribution

I It takes the form

Logistic Regression Models

iXi i=1

where π = µY is the conditional mean of Y (that is, the

probability that Y = 1 given a set of X values), is the odds

that Y = 1, and is the log odds, or logit

I is the link function

I The probability distribution is binomial

I Suppose we have a data frame called mydata with dependent

variable Y and independent variables X1,X2, and X3

I We can fit our logistic regression model as follows:

Logistic Regression Models

glm(Y∼X1+X2+X3, family=binomial(link="logit"), data=mydata)

Let’s consider an example..

install.packages("AER") data(Affairs, package =

"AER") summary(Affairs) table(Affairs$affairs)

Looking at the table we see Affairs$affairs contains count data

– we can convert count data to a binary variable for our logistic

regression

## create a dichotomous variable from the counts

Affairs$ynaffair[Affairs$affairs > 0] <- 1

Affairs$ynaffair[Affairs$affairs == 0] <- 0 Affairs$ynaffair <-

factor(Affairs$ynaffair, levels=c(0,1), labels=c("No","Yes"))

table(Affairs$ynaffair)

Logistic Regression Models

Now our dependent variable is in a form we can use for logistic

regression

Fit a logistic regression model using all the variables

fit.full <- glm(ynaffair ˜ gender + age

+ yearsmarried + children

+ religiousness + education + occupation

+rating, data=Affairs, family=binomial())

summary(fit.full) What do you observe?

Fit a reduced logistic regression model using only the significant

variables

fit.reduced <- glm(ynaffair ˜ age

Logistic Regression Models

+ yearsmarried + religiousness + rating,

data=Affairs, family=binomial())

summary(fit.reduced)

What do you observe?

Use anova() to compare nested models anova(fit.reduced, fit.full,

test="Chisq") Let’s try to interpret the model

coef(fit.reduced)

(Intercept) age yearsmarried religiousness rating

1.931 -0.035 0.101 -0.329 -0.461

Since the logistic output is log odds the coefficients are difficult to

interpret

Logistic Regression Models

Let’s look at the odds instead...

exp(coef(fit.reduced))

(Intercept) age yearsmarried religiousness rating

6.895 0.965 1.106 0.720 0.630

Here’s what we observe

exp(coef(fit.reduced))

(Intercept) age yearsmarried religiousness rating

6.895 0.965 1.106 0.720 0.630

I For each year increase in yearsmarried the odds of having an

affair increase by a factor of 1.106

I The odds decrease for a unit increase in each of age,

religiousness, and rating

Logistic Regression Models

I Since none of the variables takes on a value of 0 the intercept has

no interpretation

For the effect of an n unit change we use (eβˆj)n

What is often most useful is the probability of the “outcome” so we

transform the output to a probability

eβˆ0+PβˆiXi

Pr(Y = 1|X) = ˆ

+PβˆiXi 1 + eβ0

in R we just use the predict() function

Let’s use our fitted model to test the effect of rating on the

probability of having an affair

Logistic Regression Models

testdata <- data.frame(rating=c(1, 2, 3, 4, 5), age=mean(Affairs$age),

yearsmarried=mean(Affairs$yearsmarried),

religiousness=mean(Affairs$religiousness))

testdata$prob <- predict(fit.reduced,

newdata=testdata, type="response")

testdata

Now let’s use our fitted model to test the effect of age on the

probability of having an affair

testdata <- data.frame(rating=mean(Affairs$rating), age=seq(17, 57, 10),

yearsmarried=mean(Affairs$yearsmarried),

religiousness=mean(Affairs$religiousness))

testdata$prob <- predict(fit.reduced,

newdata=testdata, type="response")

Logistic Regression Models

testdata

I Recall that in the logistic model Y is assumed to have a binomial

distribution

I We validate this assumption by testing for overdispersion I

Overdispersion occurs when the observed variance of the

response variable is larger than what would be expected from a

binomial distribution; this can lead to inaccurate tests of

significance

I One way to detect overdispersion is to compare the residual

deviance to the residual degrees of freedom in the model

Residual Deviance

φ =

Resdiual df

if φ >> 1 there is evidence of overdispersion

Logistic Regression Models

summary(fit.reduced)$deviance/df.residual(fit.reduced

I Or fit the same model but use family=quasibinomial() instead of

family=binomial() fit.od <- glm(ynaffair ˜ age + yearsmarried + religiousness + rating, family =

quasibinomial(), data = Affairs)

pchisq(summary(fit.od)$dispersion * fit.full$df.residual, fit.full$df.residual, lower = F) I While using the results of a logistic regression model to estimate

the probability or likelihood of the event there are other useful

interpretations

I We saw earlier that the (conditional) probability is the odds ratio

divided by one plus the odds ratio

I Relative risk is the ratio of the conditional probability of one

group to another

Logistic Regression Models

I Relative risk allows us to compare the risk between two groups

I Using our earlier result we see the probability of a 17 year old

having an extra-marital affair is 0.335; for a 57 year old it is

0.109

I

RR = 0.335/0.109 ≈ 3.07

a 17 year old is about 3 times more likely to have an extra-

marital affair than a 57 year old

I Probabilities estimated from logistic regression models are point

estimates πˆ

I We can compute a confidence interval for the probability

I First we need the standard error of the probability estimate

Logistic Regression Models

SE(πˆ) = πˆ(1 − πˆ)SE(logit)

I A 95% confidence interval is

πˆ ± 1.96 × SE(πˆ)

Extensions of Logistic Regression

I Robust logistic regression –The glmRob() function in the robust

package can be used to fit a robust logistic regression model

helpful when fitting logistic regression models to data

containing outliers and influential observations

I Multinomial logistic regression –When the response variable has

more than two unordered categories (for example,

Logistic Regression Models

arried/widowed/divorced), you can fit a polytomous logistic

regression using the mlogit() function in the mlogit package

I Ordinal logistic regression – When the response variable is a set of

ordered categories (for example, credit risk as

poor/good/excellent), you can fit an ordinal logistic regression

using the lrm() function in the rms package

Logistic Regression Models – Summary

I Logistic regression is a generalized linear model used for

modeling binary, multi-level, and ordered dependent variables

I The transformation for a logistic regression converts the binary

outcomes to the log of the odds ratio

I We can transform the predicted values into a probability I

Logistic regression is good first technique for modeling

probabilities

I When we set a threshold probability value we can use a logistic

regression model as a classifier

I If the Predicted proability exceed the threshold we classify the

observation as true, otherwise it is false

I We use the glm() function in R to fit a logistic regression

Poisson Regression Models

I In the example using the Affairs data frame we converted the

number of affairs to a binary variable I If we wanted to predict the

number of affairs rather than the probability of having an affair we

would fit a Poisson regression

data(Affairs, package = "AER") pois.fit <- glm(affairs ˜

gender+age+yearsmarrie children+religiousness+education+

occupation+rating, data=Affairs, family = poisson()) summary(pois.fit)

The Poisson regression has the form

p

log(Y) = XβiXi i=0

Poisson Regression Models

I We update the model to remove the insignificant variables

pois.fit2 <- update(pois.fit, .˜.-gender-childre

-education) summary(pois.fit2)

I Let’s use our Poisson model to predict the number of affairs

newaffairs <- data.frame(age=c(22,32,42),

yearsmarried=mean(Affairs$yearsmarried),

religiousness=mean(Affairs$religiousness), occupation=mean(Affairs$occupation),

rating=mean(Affairs$rating)) predict(pois.fit2, newdata = newaffairs,

type="response")

Poisson Regression Models – Summary

I Poisson regression is useful when the dependent variable

represents count data

I The transformed dependent variable is log(Y)

I When predicting from a Poisson model remember to use

type=‘‘response’’ or exponentiate the raw output

Unsupervised Methods

I Unlike supervised learning, in unsupervised learning the data do

not contain a “right” classification

I This makes unsupervised learning more challenging I Often

unsupervised learning is used as part of exploratory data analysis

I And validating results is difficult because we do not know the

true answer

Unsupervised Methods

I Unsupervised learning techniques are growing in importance,

especially in certain fields of study

I Cancer researchers may look for subgroups among breast cancer

samples to get a better understanding of the disease

I Online shopping sites may try to identify groups of shoppers with

similar histories and show items in which the shopper is likely to

have an interest

I Or one may study how the population moves around based on

sociioeconomic factors

Cluster Analysis

I Cluster analysis is a data-reduction technique used to uncover

subgroups within a dataset

I Roughly speaking, cluster is a group of observations that are

more similar to each other than they are to the observations

in the other groups

I The two most common clustering approaches are hierarchical

clustering and partitioning clustering

I In hierarchical clustering each observation is its own cluster to

start then we combine observations into larger clusters until

we reach a single cluster

I In partitioning approaches, you specify the number of clusters

k, then the observations are divided into the k groups

Cluster Analysis

I There are many algorithms within each of these approaches,

and we will explore a couple

I We will look at some data sets within the flexcust and rattle

packages

I We will make use of functions from the following packages:

cluster, NbClust, flexclust, fMultivar, ggplot2, and rattle

Steps in Cluster Analysis

1. Select the variables that you believe are important in

understanding differences among the groups

2. Scale the data – it is best to “standardize” your data so that

variables with larger ranges do not dominate your results

3. Check for outliers – either eliminate them if possible or use

an approach that is robust like partitioning around medoids

Cluster Analysis

4. Calculate the distances

5. Select a clustering algorithm

6. Obtain one or more solutions

7. Determine the number of clusters – try several different

numbers and find the “best” one

Steps in Cluster Analysis....cont’d

8. Find the final clustering solution based on 6 and 7 above

9. Visualize your results – this will help you understand the

solution and its usefulness

10. Interpret the clusters – e.g., what do the observations in

each cluster have in common?

11. Validate the results – Do these groupings make sense, are

they real?

Cluster Analysis

I A critical step in every cluster analysis is the calculation of the

distance, or dissimilarity between each observation, or the

complement, proximity

I The most common measure of distance is Euclidean distance

I The Euclidean distance between two observations is given by

vu p

dij

where i and j are observations and p is the number of

variables

The nutrients dataset in the flexclust package includes

measurements on 27 types of foods

Cluster Analysis

> data(nutrient, package="flexclust")

> head(nutrient, 4) energy protein fat

calcium iron

BEEF BRAISED 340 20 28 9 2.6

HAMBURGER 245 21 17 9 2.7

BEEF ROAST 420 15 39 7 2.0

BEEF STEAK 375 19 32 9 2.6

The Euclidean distance between beef braised and hamburger is

q

d = (340 − 245)2 + (20 − 21)2 + (28 − 17)2 + (9 − 9)2 + (26 − 27)2 or d =

95.64

Or we can use the dist() function in R...

Cluster Analysis

> d <- dist(nutrient, method = "euclidean") # default > as.matrix(d)[1:4,1:4] BEEF BRAISED HAMBURGER BEEF ROAST BEEF STEAK BEEF BRAISED 0.0 95.6 80.9 35.2 HAMBURGER 95.6 0.0 176.5 130.9 BEEF ROAST 80.9 176.5 0.0 45.8 BEEF STEAK 35.2 130.9 45.8 0.0 which gives us the same result of course

Other distance and similarity measures include

I Hamming distance: for categorical variables, counts the number

of mismatches

I Manhattan (or city block) distance

p

dij = X|xik − xjk| k=i

Cluster Analysis

I Cosine similarity commonly used in text analysis

xikxjk similarity

and distance = cos−1(similarity)/π

Hierarchical Cluster Analysis

1. Define each observation (row, case) as a cluster

2. Calculate the distances between every cluster and every

other cluster

3. Combine the two clusters that have the smallest distance this

reduces the number of clusters by one Repeat steps 2 and 3

until all clusters are merged into into a single cluster Let’s

look at R

Cluster Analysis

I The previous result helps us understand the similarities among

food groups based on their nutrients

Cluster Analysis

I If we want to classify these food into a number of smaller

groups we need to do more analysis

I The NbClust package will help us out –back to R Partitioning

Cluster Analysis

I Partitioning divides the observations into k groups I The

groups are shuffled to make up the most cohesive clusters

possible according to the criterion

I Two popular approaches are k-means and partitioning around

medoids (PAM)

k-means Clustering

1. Select k centroids (rows) chosen at random

2. Assign each data point to its closest centroid

Cluster Analysis

3. Recalculate all the centroids by averaging all the data points

in each cluster

4. Assign the data points to clusters based on the new centroids

5. Continue 3 and 4 until the observations don’t reassign

anymore

6. Let’s go to R for another example

Partitioning Around Medoids

1. Randomly select K observations (call each a medoid)

2. Calculate the distance/dissimilarity of every observation to

each medoid

3. Assign each observation to its closest medoid

4. Calculate the sum of the distances of each observation from

its medoid (total cost)

Cluster Analysis

5. Select a point that isn’t a medoid, and swap it with its

medoid

6. Reassign every point to its closest medoid

7. Calculate the total cost

8. If this total cost is smaller, keep the new point as a medoid

9. Repeat steps 5-8 until the medoids don’t change

10. Let’s look at an example in R

Cluster Analysis – Summary

I There are many methods for cluster analysis

I Cluster analysis helps us discover subgroups within our data

I Common methods we covered in R are hierarchical cluster

analysis, and partitioning methods that include k-means and

partitioning around medoids

Classification Models

I Machine Learning Methods

I Demonstrate classification models I

Logistic Regression

I Decision Trees

I Random Forests

I Support Vector Machines I

Evaluating Models

I Data Mining with rattle()

Classification Models

I We develop classification models so we can predict future

observations

I Classification methods predict to which class an observation

belongs based on its features

I We start with EDA then partition the data

I For these examples we’ll create a training data set and a

validation set

I We build the model with the training data

I We evaluate how well the model does with the validation data

Classification Models - Setting up pkgs <- c("rpart", "rpart.plot", "parity", "randomForest", "e1071") install.packages(pkgs,

dependencies = TRUE)

loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" ds <- "breast-cancer- wisconsin/breast-cancer-wisconsin.data" url <- paste(loc, ds, sep="")

breast <- read.table(url, sep=",", header=FALSE, na.strings="?") names(breast) <- c("ID",

"clumpThickness", "sizeUniformity", "shapeUniformity", "maginalAdhesion", "singleEpithelialCellSize", "bareNuclei", "blandChromatin", "normalNucleoli", "mitosis", "class")

df <- breast[-1] df$class <- factor(df$class, levels=c(2,4),

labels=c("benign", "malignant"))

set.seed(1234) train <- sample(nrow(df), 0.7*nrow(df))

df.train <- df[train,] df.validate <- df[-train,]

table(df.train$class) table(df.validate$class)

Classification Models - Logistic Regression

I In our earlier discussion of logistic regression we learned that

model predicts the logit, or log of the odds ratio I We can

exponentiate the logit to get the odds ratio I ...and we can ”undo”

the transformation to get probability of the event we are

modeling

I If we set a threshold for the predicted probability we can turn

the model into a classifier

Classification Models - Logistic Regression

fit.logit <- glm(class˜., data=df.train, family=binomial()) summary(fit.logit)

prob <- predict(fit.logit, df.validate, type="response") logit.pred <-

factor(prob > .5, levels=c(FALSE, TRUE), labels=c("benign", "malignant")) logit.perf

<- table(df.validate$class, logit.pred, dnn=c("Actual", "Predicted")) logit.perf

I We create a binary outcome from the probability by setting the

threshold at 0.5

I Any observation with a predicted probability greater than

0.5 is considered malignant (in this example)

I Lastly, we create a confusion matrix

Classification Models - Decision Trees

I Decision trees are popular in data mining I Starting at the top

(root) we follow a set of binary splits that can be used to classify

new observations

I Two types include classical trees and conditional trees I A

classical tree segregates the observations based on

homogeneity

I A conditional tree segregates based on significance test

Classification Models - Decision Trees

1. Choose a predictor variable that splits the data into two

groups and that maximizes homogeneity

2. Separate the data into the two groups and repeat the

process for of the two new groups

3. Continue 1 and 2 until no splits reduce the impurity

4. Classify an observation by going down the tree until you

reach terminal node library(rpart) set.seed(1234) dtree <- rpart(class ˜ ., data=df.train, method="class", parms=list(split="information")) dtree$cptable

Classification Models - Decision Trees

plotcp(dtree) dtree.pruned <- prune(dtree, cp=.0125)

library(rpart.plot) prp(dtree.pruned, type = 2, extra = 104, fallen.leaves = TRUE,

main="Decision Tree") dtree.pred <- predict(dtree.pruned, df.validate, type="class")

dtree.perf <- table(df.validate$class, dtree.pred, dnn=c("Actual", "Predicted")) dtree.perf

Classification Models - Random Forest

I A random forest is an ensemble learning approach I Many

predictive models are developed then aggregated

I If we have Nobservations in the training sample and M variables,

then the algorithm is as follows:

1. Grow a large number of decision trees by sampling N cases with

replacement from the training set 2. Sample m < M variables at each node keep m constant at each

node; these variables are considered candidates for splitting in

that node

Classification Models - Support Vector Machines

3. Grow each tree fully without pruning (the minimum node size is

set to 1) 4. Terminal nodes are assigned to a class based on the mode of

cases in that node 5. Classify new observations by sending them down all the trees

tracking the outcomes—majority rules

I Build random forests with randomForest() in the random-

Forest package

Classification Models - Random Forest

library(randomForest) set.seed(1234) fit.forest <- randomForest(class˜.,

data=df.train, na.action=na.roughfix, importance=TRUE)

fit.forest

forest.pred <- predict(fit.forest, df.validate) forest.perf <-

table(df.validate$class, forest.pred, dnn=c("Actual", "Predicted"))

forest.perf I A group of supervised machine-learning methods I Can

be used for classification and regression

Classification Models - Support Vector Machines

I We seek an optimal hyperplane for separating two classes in a

multidimensional space I The chosen hyperplane maximizes the

margin between the two classes’ closest points

I The points on the boundary of the margin are called

support vectors (they help define the margin) I The middle

of the margin is the separating hyperplane

I For an N-dimensional space the optimal hyperplane has N − 1

dimensions

I If there are two variables, the surface is a line I For

three variables, the surface is a plane

I For 10 variables, the surface is a 9-dimensional hyperplane

Classification Models - Support Vector Machines

Classification Models - Support Vector Machines

library(e1071) set.seed(1234) fit.svm <- svm(class˜.,

data=df.train) fit.svm

svm.pred <- predict(fit.svm, na.omit(df.validate)) svm.perf <-

table(na.omit(df.validate)$class, svm.pred, dnn=c("Actual", "Predicted"))

svm.perf

Classification Models - Choosing the Best Model

I We can use the measures we discussed previously:

accuracy, sensitivity, specificity to evaluate how well our

model works

Classification Models - Support Vector Machines

I Instead of calculating each separately we can borrow a

function defined in the book R in Action

Classification Models - Choosing the Best Model

performance <- function(table, n=2){ if(!all(dim(table) == c(2,2))) stop("Must be a

2 x 2 table") tn = table[1,1] fp = table[1,2] fn = table[2,1] tp = table[2,2] sensitivity

= tp/(tp+fn) specificity = tn/(tn+fp) ppp = tp/(tp+fp) npp = tn/(tn+fn) hitrate =

(tp+tn)/(tp+tn+fp+fn) result <- paste("Sensitivity = ", round(sensitivity, n), "\nSpecificity = ", round(specificity, n), "\nPositive Predictive Value = ", round(ppp, n), "\nNegative Predictive Value = ", round(npp, n), "\nAccuracy = ", round(hitrate, n),

"\n", sep="") cat(result) }

I Rattle – R Analytic Tool To Learn Easily

I GUI for data mining in R

Classification Models - Data Mining with Rattle

I Point-and-click access to supervised and unsupervised models

I Includes the ability to transform data and has data-visualization

tools

I install.packages("rattle")

I library(rattle)

rattle() Launches the rattle interface

See Williams (2011) for more on Rattle

loc <- "http://archive.ics.uci.edu/ml/machine-learning-database ds <- "pima-indians-

diabetes/pima-indians-diabetes.data" url <- paste(loc, ds, sep="") diabetes <- read.table(url,

sep=",", header=FALSE)

Classification Models - Data Mining with Rattle

names(diabetes) <- c("npregant", "plasma", "bp", "triceps", "insulin", "bmi", "pedigree", "age", "class") diabetes$class <-

factor(diabetes$class, levels=c(0,1), labels=c("normal", "diabetic")) library(rattle)

rattle()

cv <- matrix(c(145, 50, 8, 27), nrow=2) performance(as.table(cv))

Classification Models - Data Mining with Rattle

Classification Models – Summary

I We looked at several machine-learning methods for

classifying observations into one of two groups I The

methods vary from low complexity like logistic regression

and decision trees to high complexity like random forests

and support vector machines I Classification models apply to

many fields (beyond medicine): computer science, finance,

marketing, etc. I We looked at problems with two grouops

Time Series

but these methods extend to multigroup classification

problems

I Time series is a collection of observations Xt, each one being

recorded at time t

I Time could be discrete, t = 1,2,3,..., or continuous t > 0 I Time

series data occur very often in reality and thus it is important to

know how to deal with them

I Let’s look at some examples of time series data

Time Series

Time Series

Time Series

Time Series

I Time series show up in many fields including engineering,

science, sociology, and economics

I We analyze time series to draw inferences from them

I A time series is a sequence of random variables {Xt}

measured over time

I We often denote a time series as

Xt = mt + st + Yt

for an additive model and

(14.1)

Xt = mt ∗ st ∗ Yt (14.2)

for a multiplicative model

Time Series

I This is known as classical decomposition

I In equations (14.1) and (14.2) on the previous slide, mt is the

trend component, st is the seasonal component, and Yt is the

random or irregular component

I mt is a slowly varying function and is often estimated using least

squares

I That is, we fit a function for example

mt = a0 + a1t + a2t2 ··· + aptp

to the data {x1,...,xn} by finding parameters that minimize

n

Time Series

X − mt)2

(xt t=1

I For a linear trend we just have mt = a0 + a1t

Time Series – General Approach to Modeling

I Plot the time series and look for features do we have

1. a trend 2. a seasonal component 3. any sudden change in the series 4. any outliers

I Remove any trend and seasonal components to get

stationary residuals

I Choose a model to fit the residuals

Time Series

I Forecast based on the residuals and invert any transformations

to get the original series I An alternative approach is to express

the series in terms of

Fourier components (we won’t cover this)

A time series {Xt} is (loosely speaking) stationary if it has similar

statistical properties to those of the “time shifted” series {Xt+h} for

each integer h

Definition Let {Xt} be a time series with EXt2 < ∞. The mean function of {Xt} is

µX(t) = E(Xt).

Time Series

The covariance function of {Xt} is γX(r,s) = Cov(Xr,Xs) = E[(Xr −

µX(r))(Xs − µX(s))] for all integers r, s, and t.

Definition {Xt} is (weakly) stationary if

(i) µX(t) is independent of t, and

(ii) γX(t + h,t) is independent of t

for each h.

Time Series

Definition Let {Xt} be a stationary time series. The autocovariance

function (ACVF) of {Xt} is

γX(h) = Cov(Xt+h,Xt).

The autocorrelation function (ACF) of {Xt} is

γX(h) ρX(h) ≡

= Cor(Xt+h,Xt).

γX(0)

Some examples...

Time Series

I Creating time series

I Plotting to inform our analysis

I Seasonal decomposition to remove trend and seasonality

I Forecasting methods

1. Exponential modeling 2. Autoregressive integrated moving averages (ARIMA) models

Time Series

Time Series

Smoothing with Simple Moving Averages

Time Series

I Smoothing dampens fluctuations so we may see patterns in the

data I Simple moving averages is the simplest method to smooth

time series

I We replace each data point with the mean of that

observation and one or more points before and after

I This is a centered moving average

St = (Yt−q + ··· + Yt + ··· + Yt+q)/(2q + 1)

I R code

Time Series

I Time series data often have trend and seasonal

components

I We decompose the series into its trend, seasonal, and irregular

or random components

I After we decompose the series we can model the random

component

Exponential Smoothing Models

I Simple or single exponential models only a level with

irregular variation, no trend or seasonal

Time Series

I Double exponential aka Holt exponential model has level with

irregular variation, and a trend

I Triple exponential aka Holt-Winters has level with irregular

variation, and a trend and seasonal component

I R code

I ARIMA models forecast values as a linear function of one or more

recent values and recent errors

I Before modeling we need to understand key terms

I Lag – shift the time series back by one or more periods

Time Series

I Autocorrelation – measures the association between observations

in different time periods

I Partial autocorrelation – measures the association between two

observations, Yt and Yt−k after removing the effects of all the

observations in between

I Differencing – replace each value in the series with the difference

between successive values, removing any trend

I Stationarity – the statistical properties (mean, variance,

autocorrelations for any lag k) of the series do not change over

time the Augmented Dickey-Fuller (ADF) test is used to evaluate

stationarity

Time Series

Autoregressive Model of order p

AR(p) : Yt t

Moving Average Model of order q

MA(q) : Yt t

ARMA model of order (p,q)

ARMA t

Time Series

Typically we subtract the mean so µ = 0

Guidelines for model selection using ACF and PACF

I ARIMA(p,d,q) – Autoregressive Integrated Moving Average

model with AR order p, MA order q series differenced d times

I R code for an ARIMA example

Time Series Methods – Summary

I Time series data appear in every field of study

I Fundamental to time series analysis is to forecast future values

I There are many techniques for time series modeling and

forecasting including the two we looked at: exponential

smoothing and ARIMA(p,d,q) models

I Be aware that we are forecasting beyond the observed data

based on historical behavior and forecasts are prone to error if

things change

I This is why forecast errors increase the further out you go I This

is less of a concern when working with say stable natural

phenomena, or other phenomena where the historical behavior is

less volatile

References

Doane, D. P. and Seward, L. E. (2011).

Applied Statistics in Business and Economics. McGraw-

Hill/Irwin, third edition.

Kabacoff, R. I. (2015).

R in Action.

Manning, Shelter Island, NY, second edition.

Neter, J., Wasserman, W. and Kutner, M. H. (1983).

Applied LInear Regression Models.

Richard D. Irwin, Inc, Homewood, IL.

References (cont.)

Parisi, F. and Lund, R. (2000).

Seasonality and return periods of landfalling Atlantic basin

hurricanes.

Australian & New Zealand Journal of Statistics 42, 271–282.

Williams, G. (2011).

Data Mining with Rattle and R.

Springer, New York.