week 8-data science

profilerav
Chapter_72.pptx

Data Science and Big Data Analytics

Chap 7: Adv Analytical Theory and Methods: Classification

1

Chapter Sections

7.1 Decision Trees

7.2 Naïve Bayes

7.3 Diagnostics of Classifiers

7.4 Additional Classification Models

Summary

2

7 Classification

Classification is widely used for prediction

Most classification methods are supervised

This chapter focuses on two fundamental classification methods

Decision trees

Naïve Bayes

3

7.1 Decision Trees

Tree structure specifies sequence of decisions

Given input X={x1, x2,…, xn}, predict output Y

Input attributes/features can be categorical or continuous

Node = tests a particular input variable

Root node, internal nodes, leaf nodes return class labels

Depth of node = minimum steps to reach node

Branch (connects two nodes) = specifies decision

Two varieties of decision trees

Classification trees: categorical output, often binary

Regression trees: numeric output

4

7.1 Decision Trees 7.1.1 Overview of a Decision Tree

Example of a decision tree

Predicts whether customers will buy a product

5

7.1 Decision Trees 7.1.1 Overview of a Decision Tree

Example: will bank client subscribe to term deposit?

6

7.1 Decision Trees 7.1.2 The General Algorithm

Construct a tree T from training set S

Requires a measure of attribute information

Simplistic method (data from previous Fig.)

Purity = probability of corresponding class

E.g., P(no)=1789/2000=89.45%, P(yes)=10.55%

Entropy methods

Entropy measures the impurity of an attribute

Information gain measures purity of an attribute

7

7.1 Decision Trees 7.1.2 The General Algorithm

Entropy methods of attribute information

Hx = the entropy of X

Information gain of an attribute = base entropy – conditional entropy

8

7.1 Decision Trees 7.1.2 The General Algorithm

Construct a tree T from training set S

Choose root node = most informative attribute A

Partition S according to A’s values

Construct subtrees T1, T2… for the subsets of S recursively until one of following occurs

All leaf nodes satisfy minimum purity threshold

Tree cannot be further split with min purity threshold

Other stopping criterion satisfied – e.g., max depth

9

7.1 Decision Trees 7.1.3 Decision Tree Algorithms

ID3 Algorithm

T=training set, P=output variable, A=attribute

10

7.1 Decision Trees 7.1.3 Decision Tree Algorithms

C4.5 Algorithm

Handles missing data

Handles both categorical and sontinuous variables

Uses bottom-up pruning to address overfitting

CART (Classification And Regression Trees)

Also handles continuous variables

Uses Gini diversity index as info measure

11

7.1 Decision Trees 7.1.4 Evaluating a Decision Tree

Decision trees are greedy algorithms

Best option at each step, maybe not best overall

Addressed by ensemble methods: random forest

Model might overfit the data

Blue = training set

Red = test set

Overcome overfitting:

Stop growing tree early

Grow full tree, then prune

12

7.1 Decision Trees 7.1.4 Evaluating a Decision Tree

Decision trees -> rectangular decision regions

13

7.1 Decision Trees 7.1.4 Evaluating a Decision Tree

Advantages of decision trees

Computationally inexpensive

Outputs are easy to interpret – sequence of tests

Show importance of each input variable

Decision trees handle

Both numerical and categorical attributes

Categorical attributes with many distinct values

Variables with nonlinear effect on outcome

Variable interactions

14

7.1 Decision Trees 7.1.4 Evaluating a Decision Tree

Disadvantages of decision trees

Sensitive to small variations in the training data

Overfitting can occur because each split reduces training data for subsequent splits

Poor if dataset contains many irrelevant variables

15

7.1 Decision Trees 7.1.5 Decision Trees in R

# install packages rpart,rpart.plot

# put this code into Rstudio source and execute lines via Ctrl/Enter

library("rpart")

library("rpart.plot")

setwd("c:/data/rstudiofiles/")

banktrain <- read.table("bank-sample.csv",header=TRUE,sep=",")

## drop a few columns to simplify the tree

drops<-c("age", "balance", "day", "campaign", "pdays", "previous", "month")

banktrain <- banktrain [,!(names(banktrain) %in% drops)]

summary(banktrain)

# Make a simple decision tree by only keeping the categorical variables

fit <- rpart(subscribed ~ job + marital + education + default + housing + loan + contact + poutcome,method="class",data=banktrain,control=rpart.control(minsplit=1),

parms=list(split='information'))

summary(fit)

# Plot the tree

rpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=3)

16

7.2 Naïve Bayes

The naïve Bayes classifier

Based on Bayes’ theorem (or Bayes’ Law)

Assumes the features contribute independently

Features (variables) are generally categorical

Discretization of continuous variables is the process of converting continuous variables into categorical ones

Output is usually class label plus probability score

Log probability often used instead of probability

17

7.2 Naïve Bayes 7.2.1 Bayes Theorem

Bayes’ Theorem

where C = class, A = observed attributes

Typical medical example

Used because doctor’s frequently get this wrong

18

7.2 Naïve Bayes 7.2.2 Naïve Bayes Classifier

Conditional independence assumption

And dropping common denominator, we get

Find cj that maximizes P(cj|A)

19

7.2 Naïve Bayes 7.2.2 Naïve Bayes Classifier

Example: client subscribes to term deposit?

The following record is from a bank client. Is this client likely to subscribe to the term deposit?

20

7.2 Naïve Bayes 7.2.2 Naïve Bayes Classifier

Compute probabilities for this record

21

7.2 Naïve Bayes 7.2.2 Naïve Bayes Classifier

Compute Naïve Bayes classifier outputs: yes/no

The client is assigned the label subscribed = yes

The scores are small, but the ratio is what counts

Using logarithms helps avoid numerical underflow

22

7.2 Naïve Bayes 7.2.3 Smoothing

A smoothing technique assigns a small nonzero probability to rare events that are missing in the training data

E.g., Laplace smoothing assumes every output occurs once more than occurs in the dataset

Smoothing is essential – without it, a zero conditional probability results in P(cj|A)=0

23

7.2 Naïve Bayes 7.2.4 Diagnostics

Naïve Bayes advantages

Handles missing values

Robust to irrelevant variables

Simple to implement

Computationally efficient

Handles high-dimensional data efficiently

Often competitive with other learning algorithms

Reasonably resistant to overfitting

Naïve Bayes disadvantages

Assumes variables are conditionally independent

Therefore, sensitive to double counting correlated variables

In its simplest form, used only for categorical variables

24

7.2 Naïve Bayes 7.2.5 Naïve Bayes in R

This section explores two methods of using the naïve Bayes Classifier

Manually compute probabilities from scratch

Tedious with many R calculations

Use naïve Bayes function from e1071 package

Much easier – starts on page 222

Example: subscribing to term deposit

25

7.2 Naïve Bayes 7.2.5 Naïve Bayes in R

Get data and e1071 package

> setwd("c:/data/rstudio/chapter07")

> sample<-read.table("sample1.csv",header=TRUE,sep=",")

> traindata<-as.data.frame(sample[1:14,])

> testdata<-as.data.frame(sample[15,])

> traindata #lists train data

> testdata #lists test data, no Enrolls variable

> install.packages("e1071", dep = TRUE)

> library(e1071) #contains naïve Bayes function

26

7.2 Naïve Bayes 7.2.5 Naïve Bayes in R

Perform modeling

> model<-naiveBayes(Enrolls~Age+Income+JobSatisfaction+Desire,traindata)

> model # generates model output

> results<-predict(model,testdata)

> Results # provides test prediction

Using a Laplace parameter gives same result

27

The book covered three classifiers

Logistic regression, decision trees, naïve Bayes

Tools to evaluate classifier performance

Confusion matrix

7.3 Diagnostics of Classifiers

28

Bank marketing example

Training set of 2000 records

Test set of 100 records, evaluated below

7.3 Diagnostics of Classifiers

29

Evaluation metrics

7.3 Diagnostics of Classifiers

30

Evaluation metrics on bank marketing 100 test set

poor

poor

7.3 Diagnostics of Classifiers

31

ROC curve: good for evaluating binary detection

Bank marketing: 2000 training set + 100 test set

> banktrain<-read.table("bank-sample.csv",header=TRUE,sep=",")

> drops<-c("balance","day","campaign","pdays","previous","month")

> banktrain<-banktrain[,!(names(banktrain) %in% drops)]

> banktest<-read.table("bank-sample-test.csv",header=TRUE,sep=",")

> banktest<-banktest[,!(names(banktest) %in% drops)]

> nb_model<-naiveBayes(subscribed~.,data=banktrain)

> nb_prediction<-predict(nb_model,banktest[,-ncol(banktest)],type='raw')

> score<-nb_prediction[,c("yes")]

> actual_class<-banktest$subscribed=='yes'

> pred<-prediction(score,actual_class) # code problem

7.3 Diagnostics of Classifiers

32

ROC curve: good for evaluating binary detection

Bank marketing: 2000 training set + 100 test set

7.3 Diagnostics of Classifiers

33

7.4 Additional Classification Methods

Ensemble methods that use multiple models

Bagging: bootstrap method that uses repeated sampling with replacement

Boosting: similar to bagging but iterative procedure

Random forest: uses ensemble of decision trees

These models usually have better performance than a single decision tree

Support Vector Machine (SVM)

Linear model using small number of support vectors

34

Summary

How to choose a suitable classifier among

Decision trees, naïve Bayes, & logistic regression

35

Midterm Exam – 10/28/15 6:10-9:00 – 2 hours, 50 minutes

30% - Clustering: k-means example

30% - Association Rules: store transactions

30% - Regression: simple linear example

10% - Ten multiple choice questions

Note: for each of the three main problems

Manually compute algorithm on small example

Complete short answer sub questions

36