week 8-data science
Data Science and Big Data Analytics
Chap 7: Adv Analytical Theory and Methods: Classification
1
Chapter Sections
7.1 Decision Trees
7.2 Naïve Bayes
7.3 Diagnostics of Classifiers
7.4 Additional Classification Models
Summary
2
7 Classification
Classification is widely used for prediction
Most classification methods are supervised
This chapter focuses on two fundamental classification methods
Decision trees
Naïve Bayes
3
7.1 Decision Trees
Tree structure specifies sequence of decisions
Given input X={x1, x2,…, xn}, predict output Y
Input attributes/features can be categorical or continuous
Node = tests a particular input variable
Root node, internal nodes, leaf nodes return class labels
Depth of node = minimum steps to reach node
Branch (connects two nodes) = specifies decision
Two varieties of decision trees
Classification trees: categorical output, often binary
Regression trees: numeric output
4
7.1 Decision Trees 7.1.1 Overview of a Decision Tree
Example of a decision tree
Predicts whether customers will buy a product
5
7.1 Decision Trees 7.1.1 Overview of a Decision Tree
Example: will bank client subscribe to term deposit?
6
7.1 Decision Trees 7.1.2 The General Algorithm
Construct a tree T from training set S
Requires a measure of attribute information
Simplistic method (data from previous Fig.)
Purity = probability of corresponding class
E.g., P(no)=1789/2000=89.45%, P(yes)=10.55%
Entropy methods
Entropy measures the impurity of an attribute
Information gain measures purity of an attribute
7
7.1 Decision Trees 7.1.2 The General Algorithm
Entropy methods of attribute information
Hx = the entropy of X
Information gain of an attribute = base entropy – conditional entropy
8
7.1 Decision Trees 7.1.2 The General Algorithm
Construct a tree T from training set S
Choose root node = most informative attribute A
Partition S according to A’s values
Construct subtrees T1, T2… for the subsets of S recursively until one of following occurs
All leaf nodes satisfy minimum purity threshold
Tree cannot be further split with min purity threshold
Other stopping criterion satisfied – e.g., max depth
9
7.1 Decision Trees 7.1.3 Decision Tree Algorithms
ID3 Algorithm
T=training set, P=output variable, A=attribute
10
7.1 Decision Trees 7.1.3 Decision Tree Algorithms
C4.5 Algorithm
Handles missing data
Handles both categorical and sontinuous variables
Uses bottom-up pruning to address overfitting
CART (Classification And Regression Trees)
Also handles continuous variables
Uses Gini diversity index as info measure
11
7.1 Decision Trees 7.1.4 Evaluating a Decision Tree
Decision trees are greedy algorithms
Best option at each step, maybe not best overall
Addressed by ensemble methods: random forest
Model might overfit the data
Blue = training set
Red = test set
Overcome overfitting:
Stop growing tree early
Grow full tree, then prune
12
7.1 Decision Trees 7.1.4 Evaluating a Decision Tree
Decision trees -> rectangular decision regions
13
7.1 Decision Trees 7.1.4 Evaluating a Decision Tree
Advantages of decision trees
Computationally inexpensive
Outputs are easy to interpret – sequence of tests
Show importance of each input variable
Decision trees handle
Both numerical and categorical attributes
Categorical attributes with many distinct values
Variables with nonlinear effect on outcome
Variable interactions
14
7.1 Decision Trees 7.1.4 Evaluating a Decision Tree
Disadvantages of decision trees
Sensitive to small variations in the training data
Overfitting can occur because each split reduces training data for subsequent splits
Poor if dataset contains many irrelevant variables
15
7.1 Decision Trees 7.1.5 Decision Trees in R
# install packages rpart,rpart.plot
# put this code into Rstudio source and execute lines via Ctrl/Enter
library("rpart")
library("rpart.plot")
setwd("c:/data/rstudiofiles/")
banktrain <- read.table("bank-sample.csv",header=TRUE,sep=",")
## drop a few columns to simplify the tree
drops<-c("age", "balance", "day", "campaign", "pdays", "previous", "month")
banktrain <- banktrain [,!(names(banktrain) %in% drops)]
summary(banktrain)
# Make a simple decision tree by only keeping the categorical variables
fit <- rpart(subscribed ~ job + marital + education + default + housing + loan + contact + poutcome,method="class",data=banktrain,control=rpart.control(minsplit=1),
parms=list(split='information'))
summary(fit)
# Plot the tree
rpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=3)
16
7.2 Naïve Bayes
The naïve Bayes classifier
Based on Bayes’ theorem (or Bayes’ Law)
Assumes the features contribute independently
Features (variables) are generally categorical
Discretization of continuous variables is the process of converting continuous variables into categorical ones
Output is usually class label plus probability score
Log probability often used instead of probability
17
7.2 Naïve Bayes 7.2.1 Bayes Theorem
Bayes’ Theorem
where C = class, A = observed attributes
Used because doctor’s frequently get this wrong
18
7.2 Naïve Bayes 7.2.2 Naïve Bayes Classifier
Conditional independence assumption
And dropping common denominator, we get
Find cj that maximizes P(cj|A)
19
7.2 Naïve Bayes 7.2.2 Naïve Bayes Classifier
Example: client subscribes to term deposit?
The following record is from a bank client. Is this client likely to subscribe to the term deposit?
20
7.2 Naïve Bayes 7.2.2 Naïve Bayes Classifier
Compute probabilities for this record
21
7.2 Naïve Bayes 7.2.2 Naïve Bayes Classifier
Compute Naïve Bayes classifier outputs: yes/no
The client is assigned the label subscribed = yes
The scores are small, but the ratio is what counts
Using logarithms helps avoid numerical underflow
22
7.2 Naïve Bayes 7.2.3 Smoothing
A smoothing technique assigns a small nonzero probability to rare events that are missing in the training data
E.g., Laplace smoothing assumes every output occurs once more than occurs in the dataset
Smoothing is essential – without it, a zero conditional probability results in P(cj|A)=0
23
7.2 Naïve Bayes 7.2.4 Diagnostics
Naïve Bayes advantages
Handles missing values
Robust to irrelevant variables
Simple to implement
Computationally efficient
Handles high-dimensional data efficiently
Often competitive with other learning algorithms
Reasonably resistant to overfitting
Naïve Bayes disadvantages
Assumes variables are conditionally independent
Therefore, sensitive to double counting correlated variables
In its simplest form, used only for categorical variables
24
7.2 Naïve Bayes 7.2.5 Naïve Bayes in R
This section explores two methods of using the naïve Bayes Classifier
Manually compute probabilities from scratch
Tedious with many R calculations
Use naïve Bayes function from e1071 package
Much easier – starts on page 222
Example: subscribing to term deposit
25
7.2 Naïve Bayes 7.2.5 Naïve Bayes in R
Get data and e1071 package
> setwd("c:/data/rstudio/chapter07")
> sample<-read.table("sample1.csv",header=TRUE,sep=",")
> traindata<-as.data.frame(sample[1:14,])
> testdata<-as.data.frame(sample[15,])
> traindata #lists train data
> testdata #lists test data, no Enrolls variable
> install.packages("e1071", dep = TRUE)
> library(e1071) #contains naïve Bayes function
26
7.2 Naïve Bayes 7.2.5 Naïve Bayes in R
Perform modeling
> model<-naiveBayes(Enrolls~Age+Income+JobSatisfaction+Desire,traindata)
> model # generates model output
> results<-predict(model,testdata)
> Results # provides test prediction
Using a Laplace parameter gives same result
27
The book covered three classifiers
Logistic regression, decision trees, naïve Bayes
Tools to evaluate classifier performance
Confusion matrix
7.3 Diagnostics of Classifiers
28
Bank marketing example
Training set of 2000 records
Test set of 100 records, evaluated below
7.3 Diagnostics of Classifiers
29
Evaluation metrics
7.3 Diagnostics of Classifiers
30
Evaluation metrics on bank marketing 100 test set
poor
poor
7.3 Diagnostics of Classifiers
31
ROC curve: good for evaluating binary detection
Bank marketing: 2000 training set + 100 test set
> banktrain<-read.table("bank-sample.csv",header=TRUE,sep=",")
> drops<-c("balance","day","campaign","pdays","previous","month")
> banktrain<-banktrain[,!(names(banktrain) %in% drops)]
> banktest<-read.table("bank-sample-test.csv",header=TRUE,sep=",")
> banktest<-banktest[,!(names(banktest) %in% drops)]
> nb_model<-naiveBayes(subscribed~.,data=banktrain)
> nb_prediction<-predict(nb_model,banktest[,-ncol(banktest)],type='raw')
> score<-nb_prediction[,c("yes")]
> actual_class<-banktest$subscribed=='yes'
> pred<-prediction(score,actual_class) # code problem
7.3 Diagnostics of Classifiers
32
ROC curve: good for evaluating binary detection
Bank marketing: 2000 training set + 100 test set
7.3 Diagnostics of Classifiers
33
7.4 Additional Classification Methods
Ensemble methods that use multiple models
Bagging: bootstrap method that uses repeated sampling with replacement
Boosting: similar to bagging but iterative procedure
Random forest: uses ensemble of decision trees
These models usually have better performance than a single decision tree
Support Vector Machine (SVM)
Linear model using small number of support vectors
34
Summary
How to choose a suitable classifier among
Decision trees, naïve Bayes, & logistic regression
35
Midterm Exam – 10/28/15 6:10-9:00 – 2 hours, 50 minutes
30% - Clustering: k-means example
30% - Association Rules: store transactions
30% - Regression: simple linear example
10% - Ten multiple choice questions
Note: for each of the three main problems
Manually compute algorithm on small example
Complete short answer sub questions
36