data mining ANLY600
Week 5 Assignment Instructions and Sample R codes
Classification Trees Analysis
This assignment is to give you the hands-on experience using R to conduct logistic regression in real world data set. Please refer to the Chapter 9 in the reference textbook (through the link at the bottom under "Lessons") for details about how to generate classification tree models and the evaluate the model performances. Then open this website, go over the mushrooms.csv example and use the same R codes to reproduce the results step by step, study the way to explain the model and evaluate the results:
Step 1: Install and load libraries
Step 2: Import the data set
Step 3: Data Cleaning
Step 4: Data Exploration and Analysis
Step 5: Data Splicing
Step 6: Building a model
Step 7: Visualising the tree
Step 8: Testing the model
Step 9: Calculating accuracy
Now open this file mushrooms2.csv (slightly different from the sample dataset) and repeat the same analysis as in the website to conduct a classification tree analysis according to the above steps specifically. Please copy/paste screen images of your work in R, and put into a Word document for submission. Be sure to provide narrative of your answers (i.e., do not just copy/paste your answers without providing some explanation of what you did or your findings). Please include Introudction, R codes with outputs, Figures and explanations with cover and reference pages. A good conclusion to wrap up the assignment is also expected. Please follow APA formats as well.
Reference
https://www.edureka.co/blog/decision-tree-algorithm/
#Installing libraries
install.packages('rpart')
install.packages('caret')
install.packages('rpart.plot')
install.packages('rattle')
#Loading libraries
library(rpart,quietly = TRUE)
library(caret,quietly = TRUE)
library(rpart.plot,quietly = TRUE)
library(rattle)
#Reading the data set as a dataframe
getwd() # to see which working directory you are in?”
# set the working directory to your desktop , for example.”
setwd("C:/Users/alpha/Desktop")
mushrooms <- read.csv("mushrooms.csv")
# structure of the data
str(mushrooms)
# number of rows with missing values
nrow(mushrooms) - sum(complete.cases(mushrooms))
# deleting redundant variable `veil.type`
mushrooms$veil.type <- NULL
# analyzing the odor variable
> table(mushrooms$class,mushrooms$odor)
number.perfect.splits <- apply(X=mushrooms[-1], MARGIN = 2, FUN = function(col){
t <- table(mushrooms$class,col)
sum(t == 0)
})
# Descending order of perfect splits
order <- order(number.perfect.splits,decreasing = TRUE)
number.perfect.splits <- number.perfect.splits[order]
# Plot graph
par(mar=c(10,2,2,2))
barplot(number.perfect.splits,main="Number of perfect splits vs feature", xlab="", ylab="Feature", las=2, col="wheat")
#data splicing
set.seed(12345)
train <- sample(1:nrow(mushrooms),size = ceiling(0.80*nrow(mushrooms)),replace = FALSE)
# training set
mushrooms_train <- mushrooms[train,]
# test set
mushrooms_test <- mushrooms[-train,]
# penalty matrix
penalty.matrix <- matrix(c(0,1,10,0), byrow=TRUE, nrow=2)
# building the classification tree with rpart
tree <- rpart(class~.,
data=mushrooms_train,
parms = list(loss = penalty.matrix), method = "class")
# Visualize the decision tree with rpart.plot
rpart.plot(tree, nn=TRUE)
#Testing the model
pred <- predict(object=tree,mushrooms_test[-1],type="class")
#Calculating accuracy
t <- table(mushrooms_test$class,pred)
confusionMatrix(t)
pred