Analytics Capstone Project
Analytics Capstone Project
Course Overview
1. Part I: Data Science Fundamentals I Data
Science Concepts and Process I Exploratory
Data Analysis
I Cleaning & Manipulating Data
I Presenting Results
2. Part II: Graphs & Statistical Methods
I Basic Graphics
Data Science Concepts and Process I Advanced Graphics I Probability &
Statistical Methods
3. Part III: Modeling Methods I Linear and
Logistic Regression
I Model Selection and Evaluation I Unsupervised
Methods
I Advanced Modeling Methods
I Data science is more than statistical analysis
I Emphasis on collaboration and project definition
I Project roles
I Project sponsor I
Client or SME
I Data scientist
I Data architect I
Operations
I Data science project life cycle . . .
Data Science Concepts and Process
Data Science Concepts and Process
I Project goal – why are we doing this?
I Data collection, quality, sufficiency, and management
I Model development
I Model evaluation and sufficiency
I Presentation to stakeholders, project documentation, and
reproducibility
Communicating Results
I You’re telling a story
I What are the questions you’re seeking to answer?
I Why are these interesting questions?
Data Science Concepts and Process
General structure of a paper
1. Abstract – Brief summary of the question(s) of interest, the
methodology and the results
2. Introduction – Clear statement of the scientific question,
objectives of the study, background information to put the
question and research into perspective
3. Methods/Methodology – Describe the design of the study and
the analytical methodology you used
4. Results – Describe the key results of your data analysis and
interpret the results with respect to the objectives of the study
and the question(s) of interest
Data Science Concepts and Process
General structure of a paper (cont’d)
1. Conclusions – Summary of the main findings and a discussion
of how these findings relate to the question(s) of interest,
comparison to results from other studies, and ideas for future
research
2. References – Citations for literature you used
3. Technical Appendices (if necessary) – Describe more
complicated analyses, or derivations and proofs, or any other
material that would disrupt the flow of the paper if it were
included in the body
Data Science Concepts and Process
Use graphs to clarify not confuse or mis-lead
Data Science Concepts and Process
Data Science Concepts and Process
Data Science Concepts and Process
Data Science Concepts and Process
Exploratory Data Analysis
I EDA allows us to get a sense for what data we have if/and how
they are related
I Summary statistics
> summary(ldl) age gender ldl.pre ldl.post
Length :100 Length :100 Min. :106.5 Min. :102.9 Class :character Class :character 1st Qu. :116.9 1st Qu. :112.8 Mode :character Mode :character Median :120.7 Median :117.2 Mean :120.4 Mean :116.7
3rd Qu. :123.9 3rd Qu. :119.9
Max. :132.6 Max. :130.6
> ldl$fgender <- factor(ldl$gender)
Exploratory Data Analysis
> ldl$fage <- factor(ldl$age)
> summary(ldl[,-c(1,2)]) ldl.pre ldl.post fgender fage Min. :106.5 Min. :102.9 f:50 gt35:51 1st Qu. :116.9 1st Qu. :112.8 m:50 le35:49 Median :120.7 Median :117.2
Mean :120.4 Mean :116.7
3rd Qu. :123.9 3rd Qu. :119.9
Max. :132.6 Max. :130.6
I Visualizing data
Exploratory Data Analysis
Exploratory Data Analysis
I Identifying data problems through data summaries and
visualization I Missing data
I Invalid data and outliers
I Data ranges and comparable units
Exploratory Data Analysis
> # create a random sample of 100 numbers (size) from 1 through
100, and allow repeated values
> x <- sample(x = 1:100, size = 100, replace =
TRUE)
> # calculate the mean()
> mean(x)
[1] 48.78
> # make a copy of x
> y <- x
> # Draw a random sample of size 20 from our vector y and set
those values to NA > y[sample(x = 1:100, size = 20, replace =
FALSE)] <- NA
> mean(y)
[1] NA
Exploratory Data Analysis
> mean(y, na.rm = TRUE) > library(scales)
> ggplot(custdata) + geomdensity(aes(x=income)) + scalexcontinuous(labels =
dollar)
Exploratory Data Analysis
> ggplot(custdata) + geomdensity(aes(x=income)) + scalexlog10(breaks =
c(100,1000,10000,100000), labels=dollar) + annotationlogticks(sides = "bt")
Exploratory Data Analysis
> custdata2 <- subset(custdata, (custdata$age > 0 & custdata$age < 100 &
custdata$income > 0))
> ggplot(custdata2, aes(x=age, y=income)) + geompoint()
Exploratory Data Analysis
Cleaning & Manipulating Data
I Data transformations
I It is often helpful/necessary to transform data for analysis
I As we saw earlier, the transformed income data helped in
visualization
I When fitting a linear model a log transform will reduce
nonlinearity
I Invalid data values
I Data ranges and units
I Handling missing data -– just deleting is not always good
Cleaning & Manipulating Data
I Data may be missing for a number of reasons I It is important
to understand what data are missing and why
I Identifying missing data
I Visualizing missing data patterns
I We can remove the observations with missing data (avoid if
possible)
I Complete-case analysis – listwise deletion I Pair-
wise deletion
I We can replace the missing values
I Simple imputation
Cleaning & Manipulating Data
I Multiple imputation
Classifying Missing Data
I Missing completely at random (MCAR)
I Missing at Random (MAR)
I Not Missing at Random (NMAR)
Cleaning & Manipulating Data
Cleaning & Manipulating Data
Source: Kabacoff (2015)
Presenting Results
I Clear communication of results
I Target presentation to your audience I Project
sponsor or company executives
I End-users
I Other data scientists, analysts or researchers
Presenting Results
Project sponsor or company executives
I Summarize the project’s goals motivation for doing it I State
the results
I Provide details as needed
I Discuss recommendations or future work
Presenting Results
Think Journalistic Style . . .
End-users
I Summarize the project’s goals and motivation for doing it I
Show how the model fits into the users’ workflow and improves
the workflow
I Show how to use the model
Presenting Results
Other data scientists, analysts or researchers
I Introduce the problem I
Discuss related work
I Discuss your approach
Presenting Results
I Results and findings I
Discuss future work
This reflects the structure of a research article in a journal
Basic Graphics
I Scatterplots Simple plotting of two variables
plot(mtcars$mpg ˜ mtcars$wt, xlab = "Weight", ylab =
"MPG") plot(mtcars$mpg ˜ mtcars$wt, xlab = "Weight",
ylab = "MPG", pch = 19)
I pch allows you to specify the plotting character
Basic Graphics
I Bar plots Display the distribution (frequency) of a categorical
variable
install.packages("vcd") library(vcd) counts <-
table(Arthritis$Improved) barplot(counts,
main="Simple Bar Plot", xlab="Improvement",
ylab="Frequency") barplot(counts, main="Horizontal
Bar Plot", xlab="Frequency", ylab="Improvement",
horiz=TRUE)
Basic Graphics
If your data is a matrix rather than a vector, the resulting graph will
be a stacked or grouped bar plot
Basic Graphics
barplot(counts2, main="Stacked Bar Plot", xlab="Treatment",
ylab="Frequency", col=c("red", "yellow","green"),
legend=rownames(counts))
barplot(counts2, main="Grouped Bar Plot", xlab="Treatment",
ylab="Frequency", col=c("red", "yellow", "green"),
legend=rownames(counts), beside=TRUE)
By setting beside = TRUE we get a grouped bar plot
Basic Graphics
Basic Graphics
I As we finish our discussion of bar plots, let’s take a look at a
specialized version called a spinogram
I A spinogram is a stacked bar plot is rescaled so that the height
of each bar is 1 and the segment heights represent
proportions
I Spinograms are created with the spine() function of the vcd
package
attach(Arthritis) counts3 <- table(Treatment, Improved)
spine(counts, main="Spinogram Example")
detach(Arthritis)
Basic Graphics
I Histograms Display the distribution of a continuous variable by
dividing the range a number of bins on the x-axis and
displaying the frequency in each bin on the y-axis
Basic Graphics
I The following code creates four histograms – from basic to more
complex
par(mfrow=c(2,2)) hist(mtcars$mpg)
hist(mtcars$mpg, breaks=12, col="red", xlab="Miles
Per Gallon", main="Colored histogram with 12 bins")
hist(mtcars$mpg, freq=FALSE, breaks=12, col="red",
xlab="Miles Per Gallon", main="Histogram, rug plot,
density curve") rug(jitter(mtcars$mpg))
lines(density(mtcars$mpg), col="blue", lwd=2)
x <- mtcars$mpg h<-hist(x, breaks=12, col="red", xlab="Miles
Per Gallon", main="Histogram with normal curve and box")
xfit<-seq(min(x), max(x), length=40) yfit<-dnorm(xfit,
Basic Graphics
mean=mean(x), sd=sd(x)) yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2) box()
Basic Graphics
Basic Graphics
I Density plots Kernel density estimation is a nonparametric
method for estimating the probability density function of a
random variable I Generally, density plots can be an effective way
to view the distribution of a continuous variable
par(mfrow=c(2,1)) d <- density(mtcars$mpg) plot(d) d <-
density(mtcars$mpg) plot(d, main="Kernel Density of Miles Per Gallon")
polygon(d, col="red", border="blue") rug(mtcars$mpg, col="brown")
Basic Graphics
Basic Graphics
I Boxplots A box-and-whiskers plot describes the distribution of a
continuous variable by plotting its five-number summary: the
minimum, first quartile (25th percentile), median (50th
percentile), third quartile (75th percentile), and maximum
I Includes observations that may be outliers (values outside the
range of ±1.5 ∗ IQR, where IQR is the interquartile range defined
as the third quartile minus the first quartile) I By default, each
whisker extends to the most extreme data point, which is no
more than 1.5 times the interquartile range
I Values outside this range are depicted as dots and may be
outliers
Basic Graphics
Source: Kabacoff (2015)
Basic Graphics
I Dot plots Provide a method of plotting a large number of
labeled values on a simple horizontal scale
Advanced Graphics
There are four graphics packages in R:
grid provides drawing primitives, but no tools for producing
statistical graphics; We will not cover grid
Base graphics are fine for assignments/projects–We will
look more advanced packages so you are aware that they exist
Advanced Graphics
I lattice provides a comprehensive graphical system for
visualizing univariate and multivariate data
I lattice has the ability to easily generate trellis graphs I A trellis
graph displays the distribution of a variable, or the relationship
between variables, separately for each level of one or more other
variables
I Let’s look at an example: How do the heights of singers in the
New York Choral Society vary by their vocal parts?
library(lattice) histogram(˜height | voice.part, data = singer,
main="Distribution of Heights by Voice Pitch", xlab="Height
(inches)")
Advanced Graphics
Graph types and functions from the lattice package
Advanced Graphics
Advanced Graphics – lattice Examples
library(lattice) attach(mtcars)
gear <- factor(gear, levels=c(3, 4, 5), labels=c("3 gears", "4 gears",
"5 gears")) cyl <- factor(cyl, levels=c(4, 6, 8), labels=c("4
cylinders", "6 cylinders",
"8 cylinders"))
densityplot(˜mpg, main="Density Plot", xlab="Miles per Gallon")
densityplot(˜mpg | cyl, main="Density Plot by Number of
Cylinders", xlab="Miles per Gallon")
lattice
Advanced Graphics – Examples
bwplot(cyl ˜ mpg | gear, main="Box Plots by Cylinders and
Gears", xlab="Miles per Gallon", ylab="Cylinders")
xyplot(mpg ˜ wt | cyl * gear, main="Scatter Plots by Cylinders
and Gears", xlab="Car Weight", ylab="Miles per Gallon")
cloud(mpg ˜ wt * qsec | cyl, main="3D Scatter Plots
by Cylinders")
dotplot(cyl ˜ mpg | gear, main="Dot Plots by Number of Gears and
Cylinders",
xlab="Miles Per Gallon")
Advanced Graphics – lattice Examples
splom(mtcars[c(1, 3, 4, 5, 6)], main="Scatter Plot Matrix for
mtcars Data")
detach(mtcars)
Advanced Graphics
I ggplot2 developed by Hadley Wickham I R
package for producing graphics
I ggplot2 is designed to work by adding layers of annotations and
statistical summaries
I Takes a little practice learning the “grammar” of ggplot2
Advanced Graphics – Examples
I Read the ggplot2 book posted on Blackboard
ggplot2
library(ggplot2) set.seed(1410) # Make the sample reproducible dsmall
<- diamonds[sample(nrow(diamonds), 100), ]
qplot(carat, price, data = diamonds) qplot(log(carat), log(price), data =
diamonds) qplot(carat, x * y * z, data = diamonds) qplot(carat, price, data
= dsmall, colour = color) qplot(carat, price, data = dsmall, shape = cut)
qplot(carat, price, data = dsmall, geom = c("point",
"smooth"))
Advanced Graphics – Examples
Analytic Geometry
Advanced Graphics – Examples
ggplot2
## Use GAM as a smoother
library(mgcv) qplot(carat, price, data = dsmall,
geom = c("point", "smooth"), method = "gam",
formula = y ˜ s(x))
## qplot does a lot for us but we can build
## layer by layer
qplot(sleep_rem / sleep_total, awake, data = msleep, geom = c("point",
"smooth"))
ggplot(msleep, aes(sleep_rem / sleep_total, awake)) geom_point() +
geom_smooth()
Advanced Graphics – Examples
Analytic Geometry
Probability & Statistical Methods
Descriptors for probability distributions. . .
I Location – the center or the peak of the distribution typically
measured by the mean (average or expected value) or median
I Shape – the degree of symmetry or skewness
I Scale or Variability – the spread of the distribution typically
measured by variance or standard deviation, the range or
interquartile range, or a scale parameter
I Deviations – possible outliers, or unusual points that are not
consistent with the rest of the data
Probability & Statistical Methods
I Probability distributions I
Normal distribution
f
I Functions for the normal distribution incluide rnorm, dnorm,
pnorm, qnorm
Probability & Statistical Methods
Source: Doane and Seward (2011)
I Summary statistics I
Sample mean
Probability & Statistical Methods
= x1 + x2 + ··· + xn = Pni=1 xi x
n n
I Sample variance
s n − 1
√ . . . and standard deviation s = s2
I Median - middle value of a data set not sensitive to extreme values First sort the data x1 ≤ x2 ≤ ···xn
If n is odd
m = x(n+1)/2
If n is even x(n/2) + x(n+2)+1
Probability & Statistical Methods
m
I The sample mean we calculated above is a point estimate I While
this is informative it is important to understand the uncertainty in
the estimate
I Confidence intervals do just that I Suppose we wish to find a
confidence interval for the mean of a distribution based on a
sample
√ x ± Zα/2 · σ/ n
where α represents the significance level, Z is the quantile
from the standard normal distribution, σ is the standard
deviation, and n is the sample size I The confidence level is 1 − α
Probability & Statistical Methods
I The confidence level is the probability that the interval includes
the true mean
Example A sample of size n = 40 is drawn from a population with a variance σ2
= 10, and x = 7.164. What is a 95% confidence interval for µ?
!
which is (6.184,8.144)
I Sampling distributions
The sampling distribution of an estimator is the probability
distribution of all possible values the statistic may assume when
a random sample of size n is taken
I The sampling distribution of the sample mean
Probability & Statistical Methods
X ∼ N
Try this . . .
library(lattice) library(resample) x <- ldl$ldl.post samp.idx <- samp.bootstrap(n = 100, R = 10000, size
=20) mean(rowMeans(matrix(x[samp.idx], nrow = 10000, byrow = FALSE))) mean(x)
histogram(rowMeans(matrix(x[samp.idx], nrow = 10000, byrow = FALSE)))
Probability & Statistical Methods
I Covariance and correlation measure the association between two
random variables given random variables X and Y
Probability & Statistical Methods
Cov(X,Y) = E(X − EX)(Y − EY)
= EXY − EX EY
and
Cov(X,Y)
ρ(X,Y) =
p
Var(X)Var(Y)
I While both measure association, correlation ρ is normalized and
ranges from −1 to 1 making it easier to interpret
I We can also write correlation as
rxy
(n − 1)SXSY
Probability & Statistical Methods
when working with a sample, X is the sample mean for X, Y is
the sample mean for Y, SX and SY are the sample standard
deviation for X and Y respectively, and n is the sample size
I In R we use the cor function for correlation and cov for covariance
library(ggplot2) cor(economics$pce,
economics$psavert)
[1] -0.837069
I Hypothesis testing is used to test assumptions and theories and
guide our decisions
I State the level of significance α: the probability that you
incorrectly reject H0
I The levels of significance correspond to various quantiles of the
standard normal distribution . . .
Probability & Statistical Methods
I α = 0.05,Z = 1.96,
I α = 0.10,Z = 1.645
I Refer to a Z table that provides the area under the curve for the
cumulative normal distribution
I Steps in hypothesis testing:
Step 1: State the null and alternative hypotheses, H0, and Ha (or H1), determine if this is a one-sided or two-sided test
Step 2: State α, the level of significance Step 3: Calculate the test statistic from the sample mean and variance
X − µ Z =
ps2/n
Step 4: Compare the calculated test statistic to the critical value
Probability & Statistical Methods
Step 5: Accept H0 if Z is in the acceptance region, or reject H0 if Z is in the
critical or rejection region
Example Historically the test scores on a given math test are normally
distributed with mean 75 and variance 36. A group of 16 students
take the test this year, and score an average of 82. Do this group’s
math scores differ from the historical average? Test this at a 0.10
level of significance.
Solution
Z .
Since α = 0.1 we have a 1 − α confidence level, and the critical value for
Z = 1.645.
Probability & Statistical Methods
Since 4.67 > 1.645 we reject H0 and conclude that this year’s math
scores have a higher average than the historical average.
Example The specifications for a cable call for a mean breaking strength of
2000 pounds. For a sample of the cable the mean breaking strngth is
1955 pounds with a standard error of the mean of 25. Using α = 0.05
determine if the difference in mean breaking strengths is statistically
significant.
I In these examples the variance is known so we use the standard
normal distribution
I If we don’t know the variance and the sample size is small
(rule of thumb n < 30) then use a t-distribution
Probability & Statistical Methods
X − µ t
= ps2/n
I Compare this test statistic to the critical value from a t-table, for n
− 1 degrees of freedom
Exercise The copier at Seidenberg can make 45 copies per minute on average.
The machine is altered to increase output, and three test runs
produce 46, 47, and 48 copies per minute. Is this increase
statistically significant? Use α = 0.05. Note this is a one-tailed test.
Conduct a t-test in R using the function t-test
> t.test(ldl$ldl.post, mu = 117) One Sample t-test
Probability & Statistical Methods
data: ldl$ldl.post t = -0.62329, df = 99, p-value = 0.5345 alternative hypothesis:
true mean is not equal to 11 95 percent confidence interval:
115.6613 117.6987 sample
estimates:
mean of x
116.68
We can do a two-sample t-test testing the equality of the sample
means, and the null hypothesis is the difference is zero:
H0 : µ1 − µ2 = 0
> t.test(ldl$ldl.post˜ldl$gender) Welch Two Sample t-
test
data: ldl$ldl.post by ldl$gender t = 0.65255, df = 88.233, p-value = 0.5157 alternative hypothesis:
true difference in means is not equal to 0 95 percent confidence interval: -1.374449 2.718449 sample
estimates: mean in group f mean in group m
Probability & Statistical Methods
117.016 116.344 When testing for a difference in means, the test statistic
(X1 − X2) − (µ1 − µ2)
t =
SX1−X2
has a t-distribution when n1 + n2 ≤ 30
SX
2 ni
i i
1 Si = =
ni(ni − 1) ni − 1
Probability & Statistical Methods
For a given significance level, α, the critical value is tα/2,df where df =
n1 + n2 − 2
We accept H0 if −tα/2,df ≤ t ≤ tα/2,df
Probability & Statistical Methods
Probability & Statistical Methods
I Arguably regression analysis is the most commonly used
statistical model I Linear regression explores relationships
between variables that are linear
I Many problems can be analyzed using linear regression
. . .
I And when the relationships are not quite linear, the
transformation of the original variables results in linear
relationships among the transformed variables
I The functional relation between two variables is
Y = f(X)
Probability & Statistical Methods
Example Let’s consider the relation between total dollar sales (Y) of a product
and the number of units sold (X). If the unit price is $2 then the
functional relation is
Y = 2X.
I Unlike a functional relation a statistical relation is not perfect
I The observations do not fall on directly the curve I Two key
concepts of a statistical relationship captured by a regression
model:
1. The tendency of the dependent variable Y to vary with the
independent variable X 2. The points scatter around the curve of the statistical relationship
I The linear regression equation is
Probability & Statistical Methods
Y
for a simple linear regression and
Y (8.1)
for a multiple regression model
In (8.1) The βi are the regression coefficients corresponding the the
Xi independent variables, aka predictor variables, or explanatory
variables, and is the random error,
When we fit a regression model we estimate the coefficients and our
model is denoted by
Yˆ = βˆ0 + βˆ1X1 + βˆ2X2 + ··· + βˆpXp
Probability & Statistical Methods
The following graph shows the data points and the fitted regression
line for Ozone regressed on Temp from the
airquality data frame
Probability & Statistical Methods
Neter et al. (1983)
Probability & Statistical Methods
Formulæfor βˆ1 and βˆ0 :
,
βˆ0 = Y − βˆ1X
Ordinary least squares: find the coefficients that minimize the sum
of the squared error terms
NOTE: Let . To minimize,
take the partial derivatives, and , set them to zero and solve
for βˆ0 and β ˆ 1.
Formal assumptions of regression analysis:
Probability & Statistical Methods
I The relation is, in fact, linear, so that the errors all have expected
value for all i
I The errors all have the same variance, for all i
I The errors are independent of each other
I The errors are all normally distributed,
The last two points are often stated as the errors are independent
and identically distributed, or iid
Fitting a regression model in R
## Create a link to the website AdLink <- "http://www-bcf.usc.edu/˜gareth/ISL/Advertising.csv"
## Use read.csv to get the data Advertising <- read.csv(file = AdLink, sep = ",", header = TRUE)
## Fit the regression ad.lm <- lm(Sales˜TV, data =
Advertising)
Probability & Statistical Methods
## View the results summary(ad.lm)
SOURCE: The data are taken from ”An Introduction to Statistical Learning, with applications in R” (Springer, 2013)
I Analysis of Variance (ANOVA)
I Used in experimental situations
I Apply several treatments or treatment combinations to randomly
selected experimental units
I Compare the treatment means for some response y
I To test the effects of these additives we need more than one car
for each type of additive
Probability & Statistical Methods
I Suppose we have three cars for each type our ANOVA model is
y , y y
, y
or
yij ij
and we test if the average y for treatment 1 is the same as the
average y for treatment 2
Assumptions for Analysis of Variance
I The response variable is normally distributed for all the groups
I The variance of the response variable, denoted σ2, is the same for
all the groups
I The observations are independent
Probability & Statistical Methods
I Analysis of Variance is misleading because the objective in ANOVA
is to analyze differences among the group means, not the
variances
I By analyzing the variation among and within the groups, you can
reach conclusions about possible differences in group means
I Total variation is subdivided into variation that is due to
differences among the groups and variation that is due to
differences within the groups
I Within-group variation measures random variation I Among-
group variation is due to differences from group to group
I The symbol c (in this text) is used to indicate the number of
groups
Probability & Statistical Methods
Total Variation c nj
SST = XX(Xij − X)2 (8.2)
j=1 i=1
where
Probability & Statistical Methods
c nj
PP Xij j=1 i=1
X = = grand mean
n
Xij = ith value in group j nj =
number of values in group j
n = total number of observations (all groups)
c = number of groups
Among-Group Variation
c
SSA = Xnj(Xj − X)2 (8.3) j=1
Probability & Statistical Methods
where
c = number of groups
nj = number of values in group j
Xj = sample mean of group j
X = grand mean
Within-Group Variation
c nj
SSW = XX(Xij − Xj)2 (8.4)
j=1 i=1
where
c = number of groups
Probability & Statistical Methods
Xij = ith value in group j
Xj = sample mean of group j
To test the null hypothesis
H0 : µ1 = µ2 = ··· = µc
against the alternative
H1 : Not all µj areequal
we create an ANOVA summary table
Probability & Statistical Methods
− =
− =
Pr( > ∗
− =
− −
Probability & Statistical Methods
Fitting an ANOVA model in R
## Create a link to the website AutoLink <- "http://www-bcf.usc.edu/˜gareth/ISL/Auto.csv"
## Use read.csv to get the data Auto <- read.csv(AutoLink, sep = ",", header = TRUE)
## Make origin a factor Auto["origin"] <- lapply(Auto["origin"], factor) # OR Auto$origin <- factor(Auto$origin)
## Fit the ANOVA model auto.aov <- aov(mpg˜origin, data = Auto)
## View the results summary(auto.aov)
SOURCE: The data are taken from ”An Introduction to Statistical Learning, with applications in R” (Springer, 2013)
Probability & Statistical Methods
I summary(auto.aov) shows us there is a significant difference in
the means across origins but doesn’t tell us what the
differences are
I Tukey Honest Significant Differences provides us with that
information
TukeyHSD(auto.aov, ordered = TRUE) plot(TukeyHSD(auto.aov, "origin"),
xlim = c(-1,12))
Probability & Statistical Methods
I χ2 test of independence
Probability & Statistical Methods
I A test of independence for two categorical variables
I Create a contingency table that has r rows and c columns
I H0 : The two categorical variables are independent
I H1 : The two categorical variables are dependent
I Compute the test statistic
2 X (fo − fe)2 χ =
fe
I Reject H0 if χ2 > χ2α, the critical value with (r − 1)(c − 1) df
I Non-parametric statistics: Wilcoxon Rank Sum Test
I A nonparametric procedure does not depend on the assumption of
normality for the two populations
Probability & Statistical Methods
I Use when the sample sizes are small and you cannot assume that
the data in each sample are from normally distributed populations
Calculating the χ2 statistic
fo = observed values in the cell fe = expected values
in a cell, calculated as follows:
row total × column total
fe = n
Probability & Statistical Methods
Golden Palm Palm Royale Palm Princess
Price 23 7 37
Location 39 13 8
Accommodation 13 5 13
Other 13 8 8
Enter the numerical data in a text file separated by commas, and
execute the following code chiexam <- read.csv("chidata.txt", header = FALSE) names(chiexam) <- c("Golden Palm","Palm
Royale","Palm Princess") rownames(chiexam) <- c("Price", "Location", "Accommodation", "Other")
chisq.test(chiexam)
> chisq.test(chiexam) Pearson’s Chi-squared
test
data: chiexam X-squared = 27.41, df = 6, p-value = 0.000121
Probability & Statistical Methods
Annual Storm Count
Wind Speed (km/h) 1 2 3 or more
119 ≤ Wi < 154 13 12 28
154 ≤ Wi < 179 2 6 12
179 ≤ Wi 12 6 17
Conduct a χ2-test to determine if wind speed and annual storm
count are independent
Source: Parisi and Lund (2000)
I The Wilcoxon rank sum test tests whether there is a difference in
the medians from two samples
Probability & Statistical Methods
I If the two sample sizes are unequal, n1 represents the smaller
sample and n2 the larger sample
I The Wilcoxon rank sum test statistic, T1, is defined as the sum of
the ranks assigned to the n1 values in the smaller sample
I If T2 is the sum of the ranks assigned to the n2 items in the second
sample, then
n(n + 1) T1
+ T2 =
2
which follows from the well known result from mathematical
induction
I The prior point checks for the accuracy in assigning the ranks
Probability & Statistical Methods
When the sample sizes are large the test statistic T1 is
approximately normally distributed with
n1(n + 1)
µT1 = 2
and
The test statistics is
T1 − n 1(n+1)
Z
Probability & Statistical Methods
which approximately follows a standard normal distribution I Steps
for performing the Wilcoxon rank sum test:
I Replace the values in the two samples of size n1 and n2 with their
combined ranks (unless the data are the ranks) I Define n = n1 + n2
as the total sample size
I Assign the ranks so that rank 1 is given to the smallest of the n
combined values, rank 2 is given to the second smallest, and so
on
I For ties, assign each value the average of the ranks that would have
been assigned had there been no ties
I Compute T1 and Z
I Compare Z to the critical value and accept or reject H0 as appropriate Table 8.1: Time to second heart attack for smokers and nonsmokers
Smoker NonSmoker
Probability & Statistical Methods
1.09 1.96
3.50 0.07 3.69 0.01 0.63 10.21 0.35 0.10 3.34 0.25 1.93 0.41 1.57 3.26 1.79 3.57 0.86 0.24 0.03 0.51 0.19 0.33 0.98 7.02 0.26 5.85 1.01 1.83
Probability & Statistical Methods
When n1 and n2 are both ≤ 10 you use the table of lower and upper
critical values
Modeling Methods
I Match the problem to an analytical method
I Evaluate the model’s quality – quantifying the performance of a
model using a number of different measures
I Validate the model’s soundness – checking that the model
will work in the real world as
well as it did during
training
Modeling Methods
Modeling Methods
I Most problems in data science fall into one of two
categories: supervised or unsupervised I Supervised
learning methods include:
I linear regression
I logistic regression
I generalized additive models (GAM) I
support vector machines (SVM)
I Unsupervised learning methods include:
I K-means clustering
Modeling Methods
I A priori algorithm for association rules I
Nearest neighbor
I Use supervised methods when there is a known outcome
I Classification – observations fall into two or distinct groups
I Scoring – predicting the value of some measure based on other
variables
I Use unsupervised methods when there are no known outcomes
but you’re looking for patterns and relationships in your data
I Segmentation into an unknown number of groups (not pre-
determined)
I Make associations based on similar qualities or activities
Modeling Methods I
As a data scientist you must be able to map your problem to
the most appropriate method
I Your intended uses of the model influence the methods you
should use
I If you want to understand the relationship between input
variables and outcome then a regression method would be a
good choice
I If you want to know what single variable influences a
categorization, then try a decision tree
I Classification is an example of supervised learning: in order to
learn how to classify objects, you need a dataset of objects
that have already been classified
Modeling Methods I
I Cluster analysis and Association rules are examples of
unsupervised learning: look for patterns in the data, not
predict an outcome
Classification Using Nearest Neighbors predict something
about a data point p (like a customer’s future purchases)
based on data that are most similar to p
I Classification Using Naive Bayes use data about prior events to
estimate the probability of future events (weather forecasting)
I Classification Using Decision Trees and Rules divide data into
smaller and smaller portions to identify patterns that can be
used for prediction
I Regression Methods used for forecasting numeric data and
quantifying the size and strength of a relationship between
Modeling Methods I
the dependent and independent variables logistic regression
can be used to model a binary
dependent variable, and Poisson regression for count data
Neural Networks model the relationship between input signals
and an output signal using a model derived from our
understanding of how a brain responds to stimuli from sensory
inputs
I Support Vector Machines find a surface that makes a boundary
between various points of data in multidimensional space
according to their feature values
I Association Rules specify patterns of relationships among items
such as
{peanut butter, jelly} ⇒ {bread}
Modeling Methods I
if peanut butter and jelly are purchased, then bread is also
likely to be purchased
I k-means Clustering is an unsupervised method that
automatically divides the data into clusters, or group of similar
items
Modeling Methods
Model Selection and Evaluation
k
Model Selection and Evaluation
I After building a model we need to if the model even works on the
data it was trained from
I We use a set of quantitative measures to assess model
performance
I The measures we use vary by class of model I To help us
decide if our scores are “good enough” we compare our
model to several ideal models
I A null model gives us the low end of performance
I A Bayes rate model gives us high end of performance
I A single-variable model tells us what a simple model can achieve
I Null Model
I A null model is the best simple model you’re trying to outperform
Model Selection and Evaluation
I We use null models to lower-bound desired performance I We
usually compare to a best null model
I Bayes Model
I A Bayes rate model is a best possible model given the data I The
Bayes rate model is the perfect model
I The Bayes rate model is an upper bound on a model evaluation score
I Single-variable Model
I Compare your model against the best single-variable model
I A complicated model that doesn’t outperform the best single-variable
model can’t be justified
Evaluating classification models
I The most common measure of classifier quality is accuracy
Model Selection and Evaluation
I The confusion matrix is a powerful tool for measuring classifier
performance
I The confusion matrix is a table counting the number of known
outcomes versus each prediction type
Predicted
Actual Default Non-default
Default 264 14
Non-default 22 158
Evaluating classification models
Predicted
Actual TRUE FALSE
TRUE True Positive (TP) False Negative (FN)
Model Selection and Evaluation
FALSE False Positive (FP) True Negative (TN)
Evaluating classification models
Accuracy number of items categorized correctly divided by the total
number of items
TP + TN
Accuracy =
TP + TN + FP + FN
Evaluating classification models
Model Selection and Evaluation
Precision fraction of the items the classifier flags as being in the class
actually are in the class
TP
Evaluating classification models
Recall fraction of the things that are in the class are
detected by the classifier; what fraction of class
members are identified as positive
TP
= +
Model Selection and Evaluation
Sensitivity Same as Recall; the true positive rate
Specificity True negative rate – what fraction of class members are
identified as negative
TN
Specificity =
TN + FP
Evaluating scoring models
I Main measure is residuals – the difference between the fitted
value and the actual value for Y
= +
Model Selection and Evaluation
I R2 tells us how much of the variability in the dependent variable is
explained by our model
I The significance of each variable is given the the corresponding p-
values
I The overall model fit is measured by the F-statistic and p-value
Evaluating scoring models
# Make up some data d <-
data.frame(y=(1:10)ˆ2,x=1:10,z=1:10 +rpois(10,lambda =
2))
# fit a regression model using the lm() function model <- lm(y˜x+z,data=d) summary(model)
Call:
Model Selection and Evaluation
lm(formula = y ˜ x + z, data = d)
Residuals: Min 1Q Median 3Q Max -10.37 -4.66 0.51 2.19 12.48
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.23 5.87 -2.94 0.0218 * x 14.99 2.61 5.75 0.0007 *** z -3.76 2.33 -1.61 ---
0.1509
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.42 on 7 degrees of freedom Multiple R-squared: 0.963,Adjusted R-squared: 0.953 F-statistic: 92.1 on 2 and 7 DF, p-value: 9.41e-06
Linear Regression Models
I Recall that the linear regression model has the form
Yˆi = βˆ0 + βˆ1X1i + βˆ2X2i + ··· + βˆpXpi i = 1,2...n
I A model with only one independent variable is a simple linear
regression model; with more than one it’s multiple regression
I This is the ordinary least squares (OLS) model and we find the βˆ0s
by minimizing the sum of squared residuals (or errors)
n n n
i=1 i=1 i=1
I Linear regression is the cornerstone prediction method
Linear Regression Models
I Models the expected value of a dependent or response variable
given a set of independent of explanatory variables
I Fitting a regression model in R
Linear Regression Models
Linear Regression Models
1. Using the database women in the base R installation fit a
regression of weight on height (don’t forget to save your
model)
Linear Regression Models
2. Display a summary of your model, and interpret the results
3. Display the fitted values
4. Display the residuals
5. Plot weight (y-axis) and height (x-axis)
A polynomial regression is a regression model where we use use
powers of the explanatory variables
Refit the the regression as a quadratic model and interpret the
results as before
fit2 <- lm(weight ˜ height + I(heightˆ2), data=women)
Note this is still a linear model even though we have a quadratic term
– it’s a linear combination of the βˆ0s
An example of a nonlinear model is
Linear Regression Models
Yˆ = βˆ0 + βˆ1eX/βˆ2
You can fit nonlinear models with nls()
The car library has a scatterplot() function that can create an
informative plot
library(car) scatterplot(weight ˜ height, data=women, spread=FALSE,
smoother.args=list(lty=2), pch=19, main="Women Age 30-39",
xlab="Height (inches)", ylab="Weight (lbs.)")
Linear Regression Models
Linear Regression Models
I What we have been discussing is simple linear regression I When
there are more than one independent variable we have multiple
linear regression
I When working with several independent variables we need to
check for correlation among them
I This gives insight into possible interactions and multicollinearity
I We’ll us the built-in data set state.x77
I Since state.x77 is a matrix we need to convert it to a data.frame
Linear Regression Models
states <as.data.frame(state.x77[,c("Murder", "Population",
"Illiteracy", "Income",
"Frost")])
I Use cor() to get a correlation matrix cor(states)
Murder Population Illiteracy Income Frost
I cor() produces a correlation matrix but doesn’t tell us anything
about the statistical significance of the estimates
Murder 1.00 0.34 0.70 -0.23 -0.54
Population 0.34 1.00 0.11 0.21 -0.33
Illiteracy 0.70 0.11 1.00 -0.44 -0.67
Income -0.23 0.21 -0.44 1.00 0.23
Frost -0.54 -0.33 -0.67 0.23 1.00
Linear Regression Models
I cor.test() calculates the correlation and tests for significance but
only works on a pair at a time
I Years ago I created a function called Mat.cor.test that does what
cor.test does and works with a matrix
## matrix cor.test Mat.cor.test <- function(obj, alt = "two.sided", meth = "pearson") { n <- dim(obj)[2] est <- matrix(0, nrow = n, ncol = n, dimnames = list(names(obj),
names(obj))) pval <- matrix(0, nrow = n, ncol = n, dimnames = list(names(obj),
names(obj))) for(i in 1:(n - 1)) { for(j in (i + 1):n) { temp.cor <- cor.test(obj[, i], obj[, j],
alternative = alt, method = meth) est[i, j] <- temp.cor$estimate pval[i, j] <-
temp.cor$p.value est[j, i] <- temp.cor$estimate pval[j, i] <- temp.cor$p.value } } diag(est) = 1 list(estimates = round(est, 2), p.values = zapsmall(round(pval, 2))) } Mat.cor.test(states) $estimates Murder Population Illiteracy Income Frost
Linear Regression Models
Murder 1.00 0.34 0.70 -0.23 -0.54 Population 0.34 1.00 0.11 0.21 -0.33 Illiteracy 0.70 0.11 1.00 -0.44 -0.67 Income -0.23 0.21 -0.44 1.00 0.23 Frost -0.54 -0.33 -0.67 0.23 1.00 $p.values Murder Population Illiteracy Income Frost Murder 0.00 0.01 0.00 0.11 0.00 Population 0.01 0.00 0.46 0.15 0.02 Illiteracy 0.00 0.46 0.00 0.00 0.00 Income 0.11 0.15 0.00 0.00 0.11 Frost 0.00 0.02 0.00 0.11 0.00
I As you recall from our discussions of EDA, it is important to
visualize the data
I Let’s turn to the car library again
Linear Regression Models
library(car) scatterplotMatrix(states, spread=FALSE,
smoother.args=list(lty=2), main="Scatter Plot Matrix")
Linear Regression Models
Linear Regression Models
I Fit a multiple regression using the states data frame, with
Murder as the dependent variable...
I fit <- lm(Murder ∼ Population + Illiteracy
+
Income + Frost, data=states)
I summary(fit) I
anova(fit)
I This model shows use the main effects of each of the variables on
the murder rate
I We should look at the interaction between variables when we fit
regression models
I Let’s look at another model
Linear Regression Models
fit <- lm(mpg hp + wt + hp:wt, data=mtcars)
I When we fit a model with interactions we can visualize the
interaction terms to learn more about the relationship
I We can use the effects package to help out
library(effects) plot(effect("hp:wt", fit,,
list(wt=c(2.2,3.2,4.2))), multiline=TRUE)
Linear Regression Models
After fitting a regression model it’s important to look at diagnostics
to check the assumptions
Linear Regression Models
1. Normality: the residuals should be normally distributed with a
mean of 0 and variance σ2
2. Independence: the dependent variable values are independent
3. Linearity: the dependent variable is linearly related to the
independent variables
4. Homoscedasticity: the variance of the residuals is constant and
does not vary with the dependent variable
We also look for outliers, high-leverage observations, and
influential observations (Cook’s distance)
Let’s return to our earlier model of regessing weight on height
fit <- lm(weight ∼ height, data=women)
par(mfrow=c(2,2)) plot(fit)
Linear Regression Models
par(mfrow=c(1,1)) ## return parameters to normal
Linear Regression Models
Look again at the states model
fit <- lm(Murder Population + Illiteracy + Income + Frost,
data=states) par(mfrow=c(2,2)) plot(fit) par(mfrow=c(1,1))
Linear Regression Models
Linear Regression Models
The car package provides several additional functions that enhance
model evaluation
Linear Regression Models
I Multicollinearity is when two or more independent variables are
highly correlated I Results in large confidence intervals for the
parameter estimates (uncertainty)
I Makes interpretation of of the parameters difficult
I May show variables are not significant when they are I
May switch the sign of some parameters I Variance Inflation
Factor (VIF) is a measure to detect multicollinearity – vif() in
the car library
The math behind the R output
I Standard error of the estimates is the square root of the variance
Linear Regression Models
SE
where σ2 is the variance of the residuals
2 RSS σˆ =
n − p
− 1
recall n
RSS = X(Yi − Yˆi)2 = Xε2
i=1
SE
Linear Regression Models
when there’s only one independent variable, and in general
SE(β)2 = σ2(X0X)−1
or
q
SE(β) = σ2(X0X)−1
where X is the model matrix
Linear Regression Models
I For regression models
H0 : βi = 0 I The t-statistic is
or
βˆi − βi
t =
SE(βi)
βˆi
t =
SE(βi)
Linear Regression Models
I Total sum of squares
SSTO = X(Yi − Y)2
I Regression sum of squares
SSR = X(Yˆi − Y)2
I Error sum of squares
Linear Regression Models
SSE = X(Yi − Yˆ)2 = RSS
I The residual standard error is
RSE
I Regression Mean Square
SSR
MSR =
p
where p is the number of independent variables in the
regression model
Linear Regression Models
I Error Mean Square
SSE
MSE =
n − p − 1
I F-statistic
MSR F
=
MSE
I R2 is the proportion of variance explained by the model
I R2 always takes a value between 0 and 1
I R2 is independent of the scale of the dependent variable Y
Linear Regression Models
R2 = SSTO − SSE = SSR = 1 − SSE
SSTO SSTO SSTO
and adjusted R2
R2adj R2)
After fitting a regression model and interpreting the output, it
may be necessary to make revisions I Delete observations
I Transform variables
I Add or delete variables
I Try another regression modeling approach
Finding the “best” model
Linear Regression Models
I Comparing models using anova – based on the F for a reduced
model vs. a full model
> anova(fit.02b, fit.02) Analysis of Variance Table
Model 1: Murder ˜ Population + Illiteracy Model 2: Murder ˜ Population + Illiteracy + Income + Frost
Res.Df RSS Df Sum of Sq F Pr(>F) 1 47 289.2 2 45 289.2 2 0.07851 0.006 0.994
I and with AIC
> AIC(fit.02b, fit.02) df AIC
fit.02b 4 237.657 fit.02 6 241.643 I Variable selection using step-wise regression, and all subsets
regression
Linear Regression Models
I Assessing how well a model works in the real world
I Cross-validation
I A portion of the data is selected as the training sample, and a
portion is selected as the hold-out sample
I A regression equation is developed on the training sample and then
applied to the hold-out sample
I The performance on this sample is a more accurate estimate of the
operating characteristics of the model with new data
I Relative Importance
I Which variables are most important in predicting the outcome?
I Standardize your data before fitting the regression
I The coefficients measure how many standard deviations your
dependent variable changes with a 1 standard deviation of your
independent variable
Linear Regression Models
I Puts the effects of each variable on the same footing, and
comparable
Logistic Regression Models
I In our discussion of OLS we noted that the assumption of
normality must hold for the dependent variable Y
I Generalized linear models extend the linear-model framework to
include dependent variables that are non-normal
I In R we use the glm() function to fit generalized linear models
I Two popular models in this framework are logistic regression
(dependent variable is categorical) and Poisson regression
(dependent variable is a count variable)
I Logistic regression is a good first choice for binary classification
problems
I Logistic regression can directly predict values that are restricted
to the (0,1) interval, like probabilities
Logistic Regression Models
I Logistic regression assumes that is a linear function of
the X0s
I In the linear regression framework we model the expected value
of Y given X1,X2,...,Xp
p
µY = β0 + XβiXi i=1
I In the generalized linear model framework we model a function
of the expected value of Y
p
g(µY) = β0 + XβiXi
Logistic Regression Models
i=1
where g(µY) is called a link function
I The format for the glm() is as follows:
glm(formula, family=family(link=function), data=)
where family is the probability distribution, and function is the
link function
Logistic Regression Models
I Logistic regression applies to situations in which the response
variable is dichotomous (0 or 1)
I The model assumes that Y follows a binomial distribution
I It takes the form
Logistic Regression Models
iXi i=1
where π = µY is the conditional mean of Y (that is, the
probability that Y = 1 given a set of X values), is the odds
that Y = 1, and is the log odds, or logit
I is the link function
I The probability distribution is binomial
I Suppose we have a data frame called mydata with dependent
variable Y and independent variables X1,X2, and X3
I We can fit our logistic regression model as follows:
Logistic Regression Models
glm(Y∼X1+X2+X3, family=binomial(link="logit"), data=mydata)
Let’s consider an example..
install.packages("AER") data(Affairs, package =
"AER") summary(Affairs) table(Affairs$affairs)
Looking at the table we see Affairs$affairs contains count data
– we can convert count data to a binary variable for our logistic
regression
## create a dichotomous variable from the counts
Affairs$ynaffair[Affairs$affairs > 0] <- 1
Affairs$ynaffair[Affairs$affairs == 0] <- 0 Affairs$ynaffair <-
factor(Affairs$ynaffair, levels=c(0,1), labels=c("No","Yes"))
table(Affairs$ynaffair)
Logistic Regression Models
Now our dependent variable is in a form we can use for logistic
regression
Fit a logistic regression model using all the variables
fit.full <- glm(ynaffair ˜ gender + age
+ yearsmarried + children
+ religiousness + education + occupation
+rating, data=Affairs, family=binomial())
summary(fit.full) What do you observe?
Fit a reduced logistic regression model using only the significant
variables
fit.reduced <- glm(ynaffair ˜ age
Logistic Regression Models
+ yearsmarried + religiousness + rating,
data=Affairs, family=binomial())
summary(fit.reduced)
What do you observe?
Use anova() to compare nested models anova(fit.reduced, fit.full,
test="Chisq") Let’s try to interpret the model
coef(fit.reduced)
(Intercept) age yearsmarried religiousness rating
1.931 -0.035 0.101 -0.329 -0.461
Since the logistic output is log odds the coefficients are difficult to
interpret
Logistic Regression Models
Let’s look at the odds instead...
exp(coef(fit.reduced))
(Intercept) age yearsmarried religiousness rating
6.895 0.965 1.106 0.720 0.630
Here’s what we observe
exp(coef(fit.reduced))
(Intercept) age yearsmarried religiousness rating
6.895 0.965 1.106 0.720 0.630
I For each year increase in yearsmarried the odds of having an
affair increase by a factor of 1.106
I The odds decrease for a unit increase in each of age,
religiousness, and rating
Logistic Regression Models
I Since none of the variables takes on a value of 0 the intercept has
no interpretation
For the effect of an n unit change we use (eβˆj)n
What is often most useful is the probability of the “outcome” so we
transform the output to a probability
eβˆ0+PβˆiXi
Pr(Y = 1|X) = ˆ
+PβˆiXi 1 + eβ0
in R we just use the predict() function
Let’s use our fitted model to test the effect of rating on the
probability of having an affair
Logistic Regression Models
testdata <- data.frame(rating=c(1, 2, 3, 4, 5), age=mean(Affairs$age),
yearsmarried=mean(Affairs$yearsmarried),
religiousness=mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced,
newdata=testdata, type="response")
testdata
Now let’s use our fitted model to test the effect of age on the
probability of having an affair
testdata <- data.frame(rating=mean(Affairs$rating), age=seq(17, 57, 10),
yearsmarried=mean(Affairs$yearsmarried),
religiousness=mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced,
newdata=testdata, type="response")
Logistic Regression Models
testdata
I Recall that in the logistic model Y is assumed to have a binomial
distribution
I We validate this assumption by testing for overdispersion I
Overdispersion occurs when the observed variance of the
response variable is larger than what would be expected from a
binomial distribution; this can lead to inaccurate tests of
significance
I One way to detect overdispersion is to compare the residual
deviance to the residual degrees of freedom in the model
Residual Deviance
φ =
Resdiual df
if φ >> 1 there is evidence of overdispersion
Logistic Regression Models
summary(fit.reduced)$deviance/df.residual(fit.reduced
I Or fit the same model but use family=quasibinomial() instead of
family=binomial() fit.od <- glm(ynaffair ˜ age + yearsmarried + religiousness + rating, family =
quasibinomial(), data = Affairs)
pchisq(summary(fit.od)$dispersion * fit.full$df.residual, fit.full$df.residual, lower = F) I While using the results of a logistic regression model to estimate
the probability or likelihood of the event there are other useful
interpretations
I We saw earlier that the (conditional) probability is the odds ratio
divided by one plus the odds ratio
I Relative risk is the ratio of the conditional probability of one
group to another
Logistic Regression Models
I Relative risk allows us to compare the risk between two groups
I Using our earlier result we see the probability of a 17 year old
having an extra-marital affair is 0.335; for a 57 year old it is
0.109
I
RR = 0.335/0.109 ≈ 3.07
a 17 year old is about 3 times more likely to have an extra-
marital affair than a 57 year old
I Probabilities estimated from logistic regression models are point
estimates πˆ
I We can compute a confidence interval for the probability
I First we need the standard error of the probability estimate
Logistic Regression Models
SE(πˆ) = πˆ(1 − πˆ)SE(logit)
I A 95% confidence interval is
πˆ ± 1.96 × SE(πˆ)
Extensions of Logistic Regression
I Robust logistic regression –The glmRob() function in the robust
package can be used to fit a robust logistic regression model
helpful when fitting logistic regression models to data
containing outliers and influential observations
I Multinomial logistic regression –When the response variable has
more than two unordered categories (for example,
Logistic Regression Models
arried/widowed/divorced), you can fit a polytomous logistic
regression using the mlogit() function in the mlogit package
I Ordinal logistic regression – When the response variable is a set of
ordered categories (for example, credit risk as
poor/good/excellent), you can fit an ordinal logistic regression
using the lrm() function in the rms package
Logistic Regression Models – Summary
I Logistic regression is a generalized linear model used for
modeling binary, multi-level, and ordered dependent variables
I The transformation for a logistic regression converts the binary
outcomes to the log of the odds ratio
I We can transform the predicted values into a probability I
Logistic regression is good first technique for modeling
probabilities
I When we set a threshold probability value we can use a logistic
regression model as a classifier
I If the Predicted proability exceed the threshold we classify the
observation as true, otherwise it is false
I We use the glm() function in R to fit a logistic regression
Poisson Regression Models
I In the example using the Affairs data frame we converted the
number of affairs to a binary variable I If we wanted to predict the
number of affairs rather than the probability of having an affair we
would fit a Poisson regression
data(Affairs, package = "AER") pois.fit <- glm(affairs ˜
gender+age+yearsmarrie children+religiousness+education+
occupation+rating, data=Affairs, family = poisson()) summary(pois.fit)
The Poisson regression has the form
p
log(Y) = XβiXi i=0
Poisson Regression Models
I We update the model to remove the insignificant variables
pois.fit2 <- update(pois.fit, .˜.-gender-childre
-education) summary(pois.fit2)
I Let’s use our Poisson model to predict the number of affairs
newaffairs <- data.frame(age=c(22,32,42),
yearsmarried=mean(Affairs$yearsmarried),
religiousness=mean(Affairs$religiousness), occupation=mean(Affairs$occupation),
rating=mean(Affairs$rating)) predict(pois.fit2, newdata = newaffairs,
type="response")
Poisson Regression Models – Summary
I Poisson regression is useful when the dependent variable
represents count data
I The transformed dependent variable is log(Y)
I When predicting from a Poisson model remember to use
type=‘‘response’’ or exponentiate the raw output
Unsupervised Methods
I Unlike supervised learning, in unsupervised learning the data do
not contain a “right” classification
I This makes unsupervised learning more challenging I Often
unsupervised learning is used as part of exploratory data analysis
I And validating results is difficult because we do not know the
true answer
Unsupervised Methods
I Unsupervised learning techniques are growing in importance,
especially in certain fields of study
I Cancer researchers may look for subgroups among breast cancer
samples to get a better understanding of the disease
I Online shopping sites may try to identify groups of shoppers with
similar histories and show items in which the shopper is likely to
have an interest
I Or one may study how the population moves around based on
sociioeconomic factors
Cluster Analysis
I Cluster analysis is a data-reduction technique used to uncover
subgroups within a dataset
I Roughly speaking, cluster is a group of observations that are
more similar to each other than they are to the observations
in the other groups
I The two most common clustering approaches are hierarchical
clustering and partitioning clustering
I In hierarchical clustering each observation is its own cluster to
start then we combine observations into larger clusters until
we reach a single cluster
I In partitioning approaches, you specify the number of clusters
k, then the observations are divided into the k groups
Cluster Analysis
I There are many algorithms within each of these approaches,
and we will explore a couple
I We will look at some data sets within the flexcust and rattle
packages
I We will make use of functions from the following packages:
cluster, NbClust, flexclust, fMultivar, ggplot2, and rattle
Steps in Cluster Analysis
1. Select the variables that you believe are important in
understanding differences among the groups
2. Scale the data – it is best to “standardize” your data so that
variables with larger ranges do not dominate your results
3. Check for outliers – either eliminate them if possible or use
an approach that is robust like partitioning around medoids
Cluster Analysis
4. Calculate the distances
5. Select a clustering algorithm
6. Obtain one or more solutions
7. Determine the number of clusters – try several different
numbers and find the “best” one
Steps in Cluster Analysis....cont’d
8. Find the final clustering solution based on 6 and 7 above
9. Visualize your results – this will help you understand the
solution and its usefulness
10. Interpret the clusters – e.g., what do the observations in
each cluster have in common?
11. Validate the results – Do these groupings make sense, are
they real?
Cluster Analysis
I A critical step in every cluster analysis is the calculation of the
distance, or dissimilarity between each observation, or the
complement, proximity
I The most common measure of distance is Euclidean distance
I The Euclidean distance between two observations is given by
vu p
dij
where i and j are observations and p is the number of
variables
The nutrients dataset in the flexclust package includes
measurements on 27 types of foods
Cluster Analysis
> data(nutrient, package="flexclust")
> head(nutrient, 4) energy protein fat
calcium iron
BEEF BRAISED 340 20 28 9 2.6
HAMBURGER 245 21 17 9 2.7
BEEF ROAST 420 15 39 7 2.0
BEEF STEAK 375 19 32 9 2.6
The Euclidean distance between beef braised and hamburger is
q
d = (340 − 245)2 + (20 − 21)2 + (28 − 17)2 + (9 − 9)2 + (26 − 27)2 or d =
95.64
Or we can use the dist() function in R...
Cluster Analysis
> d <- dist(nutrient, method = "euclidean") # default > as.matrix(d)[1:4,1:4] BEEF BRAISED HAMBURGER BEEF ROAST BEEF STEAK BEEF BRAISED 0.0 95.6 80.9 35.2 HAMBURGER 95.6 0.0 176.5 130.9 BEEF ROAST 80.9 176.5 0.0 45.8 BEEF STEAK 35.2 130.9 45.8 0.0 which gives us the same result of course
Other distance and similarity measures include
I Hamming distance: for categorical variables, counts the number
of mismatches
I Manhattan (or city block) distance
p
dij = X|xik − xjk| k=i
Cluster Analysis
I Cosine similarity commonly used in text analysis
xikxjk similarity
and distance = cos−1(similarity)/π
Hierarchical Cluster Analysis
1. Define each observation (row, case) as a cluster
2. Calculate the distances between every cluster and every
other cluster
3. Combine the two clusters that have the smallest distance this
reduces the number of clusters by one Repeat steps 2 and 3
until all clusters are merged into into a single cluster Let’s
look at R
Cluster Analysis
I The previous result helps us understand the similarities among
food groups based on their nutrients
Cluster Analysis
I If we want to classify these food into a number of smaller
groups we need to do more analysis
I The NbClust package will help us out –back to R Partitioning
Cluster Analysis
I Partitioning divides the observations into k groups I The
groups are shuffled to make up the most cohesive clusters
possible according to the criterion
I Two popular approaches are k-means and partitioning around
medoids (PAM)
k-means Clustering
1. Select k centroids (rows) chosen at random
2. Assign each data point to its closest centroid
Cluster Analysis
3. Recalculate all the centroids by averaging all the data points
in each cluster
4. Assign the data points to clusters based on the new centroids
5. Continue 3 and 4 until the observations don’t reassign
anymore
6. Let’s go to R for another example
Partitioning Around Medoids
1. Randomly select K observations (call each a medoid)
2. Calculate the distance/dissimilarity of every observation to
each medoid
3. Assign each observation to its closest medoid
4. Calculate the sum of the distances of each observation from
its medoid (total cost)
Cluster Analysis
5. Select a point that isn’t a medoid, and swap it with its
medoid
6. Reassign every point to its closest medoid
7. Calculate the total cost
8. If this total cost is smaller, keep the new point as a medoid
9. Repeat steps 5-8 until the medoids don’t change
10. Let’s look at an example in R
Cluster Analysis – Summary
I There are many methods for cluster analysis
I Cluster analysis helps us discover subgroups within our data
I Common methods we covered in R are hierarchical cluster
analysis, and partitioning methods that include k-means and
partitioning around medoids
Classification Models
I Machine Learning Methods
I Demonstrate classification models I
Logistic Regression
I Decision Trees
I Random Forests
I Support Vector Machines I
Evaluating Models
I Data Mining with rattle()
Classification Models
I We develop classification models so we can predict future
observations
I Classification methods predict to which class an observation
belongs based on its features
I We start with EDA then partition the data
I For these examples we’ll create a training data set and a
validation set
I We build the model with the training data
I We evaluate how well the model does with the validation data
Classification Models - Setting up pkgs <- c("rpart", "rpart.plot", "parity", "randomForest", "e1071") install.packages(pkgs,
dependencies = TRUE)
loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" ds <- "breast-cancer- wisconsin/breast-cancer-wisconsin.data" url <- paste(loc, ds, sep="")
breast <- read.table(url, sep=",", header=FALSE, na.strings="?") names(breast) <- c("ID",
"clumpThickness", "sizeUniformity", "shapeUniformity", "maginalAdhesion", "singleEpithelialCellSize", "bareNuclei", "blandChromatin", "normalNucleoli", "mitosis", "class")
df <- breast[-1] df$class <- factor(df$class, levels=c(2,4),
labels=c("benign", "malignant"))
set.seed(1234) train <- sample(nrow(df), 0.7*nrow(df))
df.train <- df[train,] df.validate <- df[-train,]
table(df.train$class) table(df.validate$class)
Classification Models - Logistic Regression
I In our earlier discussion of logistic regression we learned that
model predicts the logit, or log of the odds ratio I We can
exponentiate the logit to get the odds ratio I ...and we can ”undo”
the transformation to get probability of the event we are
modeling
I If we set a threshold for the predicted probability we can turn
the model into a classifier
Classification Models - Logistic Regression
fit.logit <- glm(class˜., data=df.train, family=binomial()) summary(fit.logit)
prob <- predict(fit.logit, df.validate, type="response") logit.pred <-
factor(prob > .5, levels=c(FALSE, TRUE), labels=c("benign", "malignant")) logit.perf
<- table(df.validate$class, logit.pred, dnn=c("Actual", "Predicted")) logit.perf
I We create a binary outcome from the probability by setting the
threshold at 0.5
I Any observation with a predicted probability greater than
0.5 is considered malignant (in this example)
I Lastly, we create a confusion matrix
Classification Models - Decision Trees
I Decision trees are popular in data mining I Starting at the top
(root) we follow a set of binary splits that can be used to classify
new observations
I Two types include classical trees and conditional trees I A
classical tree segregates the observations based on
homogeneity
I A conditional tree segregates based on significance test
Classification Models - Decision Trees
1. Choose a predictor variable that splits the data into two
groups and that maximizes homogeneity
2. Separate the data into the two groups and repeat the
process for of the two new groups
3. Continue 1 and 2 until no splits reduce the impurity
4. Classify an observation by going down the tree until you
reach terminal node library(rpart) set.seed(1234) dtree <- rpart(class ˜ ., data=df.train, method="class", parms=list(split="information")) dtree$cptable
Classification Models - Decision Trees
plotcp(dtree) dtree.pruned <- prune(dtree, cp=.0125)
library(rpart.plot) prp(dtree.pruned, type = 2, extra = 104, fallen.leaves = TRUE,
main="Decision Tree") dtree.pred <- predict(dtree.pruned, df.validate, type="class")
dtree.perf <- table(df.validate$class, dtree.pred, dnn=c("Actual", "Predicted")) dtree.perf
Classification Models - Random Forest
I A random forest is an ensemble learning approach I Many
predictive models are developed then aggregated
I If we have Nobservations in the training sample and M variables,
then the algorithm is as follows:
1. Grow a large number of decision trees by sampling N cases with
replacement from the training set 2. Sample m < M variables at each node keep m constant at each
node; these variables are considered candidates for splitting in
that node
Classification Models - Support Vector Machines
3. Grow each tree fully without pruning (the minimum node size is
set to 1) 4. Terminal nodes are assigned to a class based on the mode of
cases in that node 5. Classify new observations by sending them down all the trees
tracking the outcomes—majority rules
I Build random forests with randomForest() in the random-
Forest package
Classification Models - Random Forest
library(randomForest) set.seed(1234) fit.forest <- randomForest(class˜.,
data=df.train, na.action=na.roughfix, importance=TRUE)
fit.forest
forest.pred <- predict(fit.forest, df.validate) forest.perf <-
table(df.validate$class, forest.pred, dnn=c("Actual", "Predicted"))
forest.perf I A group of supervised machine-learning methods I Can
be used for classification and regression
Classification Models - Support Vector Machines
I We seek an optimal hyperplane for separating two classes in a
multidimensional space I The chosen hyperplane maximizes the
margin between the two classes’ closest points
I The points on the boundary of the margin are called
support vectors (they help define the margin) I The middle
of the margin is the separating hyperplane
I For an N-dimensional space the optimal hyperplane has N − 1
dimensions
I If there are two variables, the surface is a line I For
three variables, the surface is a plane
I For 10 variables, the surface is a 9-dimensional hyperplane
Classification Models - Support Vector Machines
Classification Models - Support Vector Machines
library(e1071) set.seed(1234) fit.svm <- svm(class˜.,
data=df.train) fit.svm
svm.pred <- predict(fit.svm, na.omit(df.validate)) svm.perf <-
table(na.omit(df.validate)$class, svm.pred, dnn=c("Actual", "Predicted"))
svm.perf
Classification Models - Choosing the Best Model
I We can use the measures we discussed previously:
accuracy, sensitivity, specificity to evaluate how well our
model works
Classification Models - Support Vector Machines
I Instead of calculating each separately we can borrow a
function defined in the book R in Action
Classification Models - Choosing the Best Model
performance <- function(table, n=2){ if(!all(dim(table) == c(2,2))) stop("Must be a
2 x 2 table") tn = table[1,1] fp = table[1,2] fn = table[2,1] tp = table[2,2] sensitivity
= tp/(tp+fn) specificity = tn/(tn+fp) ppp = tp/(tp+fp) npp = tn/(tn+fn) hitrate =
(tp+tn)/(tp+tn+fp+fn) result <- paste("Sensitivity = ", round(sensitivity, n), "\nSpecificity = ", round(specificity, n), "\nPositive Predictive Value = ", round(ppp, n), "\nNegative Predictive Value = ", round(npp, n), "\nAccuracy = ", round(hitrate, n),
"\n", sep="") cat(result) }
I Rattle – R Analytic Tool To Learn Easily
I GUI for data mining in R
Classification Models - Data Mining with Rattle
I Point-and-click access to supervised and unsupervised models
I Includes the ability to transform data and has data-visualization
tools
I install.packages("rattle")
I library(rattle)
rattle() Launches the rattle interface
See Williams (2011) for more on Rattle
loc <- "http://archive.ics.uci.edu/ml/machine-learning-database ds <- "pima-indians-
diabetes/pima-indians-diabetes.data" url <- paste(loc, ds, sep="") diabetes <- read.table(url,
sep=",", header=FALSE)
Classification Models - Data Mining with Rattle
names(diabetes) <- c("npregant", "plasma", "bp", "triceps", "insulin", "bmi", "pedigree", "age", "class") diabetes$class <-
factor(diabetes$class, levels=c(0,1), labels=c("normal", "diabetic")) library(rattle)
rattle()
cv <- matrix(c(145, 50, 8, 27), nrow=2) performance(as.table(cv))
Classification Models - Data Mining with Rattle
Classification Models – Summary
I We looked at several machine-learning methods for
classifying observations into one of two groups I The
methods vary from low complexity like logistic regression
and decision trees to high complexity like random forests
and support vector machines I Classification models apply to
many fields (beyond medicine): computer science, finance,
marketing, etc. I We looked at problems with two grouops
Time Series
but these methods extend to multigroup classification
problems
I Time series is a collection of observations Xt, each one being
recorded at time t
I Time could be discrete, t = 1,2,3,..., or continuous t > 0 I Time
series data occur very often in reality and thus it is important to
know how to deal with them
I Let’s look at some examples of time series data
Time Series
Time Series
Time Series
Time Series
I Time series show up in many fields including engineering,
science, sociology, and economics
I We analyze time series to draw inferences from them
I A time series is a sequence of random variables {Xt}
measured over time
I We often denote a time series as
Xt = mt + st + Yt
for an additive model and
(14.1)
Xt = mt ∗ st ∗ Yt (14.2)
for a multiplicative model
Time Series
I This is known as classical decomposition
I In equations (14.1) and (14.2) on the previous slide, mt is the
trend component, st is the seasonal component, and Yt is the
random or irregular component
I mt is a slowly varying function and is often estimated using least
squares
I That is, we fit a function for example
mt = a0 + a1t + a2t2 ··· + aptp
to the data {x1,...,xn} by finding parameters that minimize
n
Time Series
X − mt)2
(xt t=1
I For a linear trend we just have mt = a0 + a1t
Time Series – General Approach to Modeling
I Plot the time series and look for features do we have
1. a trend 2. a seasonal component 3. any sudden change in the series 4. any outliers
I Remove any trend and seasonal components to get
stationary residuals
I Choose a model to fit the residuals
Time Series
I Forecast based on the residuals and invert any transformations
to get the original series I An alternative approach is to express
the series in terms of
Fourier components (we won’t cover this)
A time series {Xt} is (loosely speaking) stationary if it has similar
statistical properties to those of the “time shifted” series {Xt+h} for
each integer h
Definition Let {Xt} be a time series with EXt2 < ∞. The mean function of {Xt} is
µX(t) = E(Xt).
Time Series
The covariance function of {Xt} is γX(r,s) = Cov(Xr,Xs) = E[(Xr −
µX(r))(Xs − µX(s))] for all integers r, s, and t.
Definition {Xt} is (weakly) stationary if
(i) µX(t) is independent of t, and
(ii) γX(t + h,t) is independent of t
for each h.
Time Series
Definition Let {Xt} be a stationary time series. The autocovariance
function (ACVF) of {Xt} is
γX(h) = Cov(Xt+h,Xt).
The autocorrelation function (ACF) of {Xt} is
γX(h) ρX(h) ≡
= Cor(Xt+h,Xt).
γX(0)
Some examples...
Time Series
I Creating time series
I Plotting to inform our analysis
I Seasonal decomposition to remove trend and seasonality
I Forecasting methods
1. Exponential modeling 2. Autoregressive integrated moving averages (ARIMA) models
Time Series
Time Series
Smoothing with Simple Moving Averages
Time Series
I Smoothing dampens fluctuations so we may see patterns in the
data I Simple moving averages is the simplest method to smooth
time series
I We replace each data point with the mean of that
observation and one or more points before and after
I This is a centered moving average
St = (Yt−q + ··· + Yt + ··· + Yt+q)/(2q + 1)
I R code
Time Series
I Time series data often have trend and seasonal
components
I We decompose the series into its trend, seasonal, and irregular
or random components
I After we decompose the series we can model the random
component
Exponential Smoothing Models
I Simple or single exponential models only a level with
irregular variation, no trend or seasonal
Time Series
I Double exponential aka Holt exponential model has level with
irregular variation, and a trend
I Triple exponential aka Holt-Winters has level with irregular
variation, and a trend and seasonal component
I R code
I ARIMA models forecast values as a linear function of one or more
recent values and recent errors
I Before modeling we need to understand key terms
I Lag – shift the time series back by one or more periods
Time Series
I Autocorrelation – measures the association between observations
in different time periods
I Partial autocorrelation – measures the association between two
observations, Yt and Yt−k after removing the effects of all the
observations in between
I Differencing – replace each value in the series with the difference
between successive values, removing any trend
I Stationarity – the statistical properties (mean, variance,
autocorrelations for any lag k) of the series do not change over
time the Augmented Dickey-Fuller (ADF) test is used to evaluate
stationarity
Time Series
Autoregressive Model of order p
AR(p) : Yt t
Moving Average Model of order q
MA(q) : Yt t
ARMA model of order (p,q)
ARMA t
Time Series
Typically we subtract the mean so µ = 0
Guidelines for model selection using ACF and PACF
I ARIMA(p,d,q) – Autoregressive Integrated Moving Average
model with AR order p, MA order q series differenced d times
I R code for an ARIMA example
Time Series Methods – Summary
I Time series data appear in every field of study
I Fundamental to time series analysis is to forecast future values
I There are many techniques for time series modeling and
forecasting including the two we looked at: exponential
smoothing and ARIMA(p,d,q) models
I Be aware that we are forecasting beyond the observed data
based on historical behavior and forecasts are prone to error if
things change
I This is why forecast errors increase the further out you go I This
is less of a concern when working with say stable natural
phenomena, or other phenomena where the historical behavior is
less volatile
References
Doane, D. P. and Seward, L. E. (2011).
Applied Statistics in Business and Economics. McGraw-
Hill/Irwin, third edition.
Kabacoff, R. I. (2015).
R in Action.
Manning, Shelter Island, NY, second edition.
Neter, J., Wasserman, W. and Kutner, M. H. (1983).
Applied LInear Regression Models.
Richard D. Irwin, Inc, Homewood, IL.
References (cont.)
Parisi, F. and Lund, R. (2000).
Seasonality and return periods of landfalling Atlantic basin
hurricanes.
Australian & New Zealand Journal of Statistics 42, 271–282.
Williams, G. (2011).
Data Mining with Rattle and R.
Springer, New York.