Proj Data

profileLu101
DataScienceProjectEDA.Rmd

--- title: "Kwame Darko-Mensah Data Science Project EDA" output: pdf_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) #import the libraries library(tidyverse) library(ggplot2) library(corrplot) ``` ## R Markdown This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: ```{r } library(knitr) library(dplyr) bank<-read.csv("bank.txt", header = TRUE ) ``` ## Including Plots You can also embed plots, for example: ```{r } dim(bank) ``` Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot. ```{r } #name the categorical variables as they were imported as numeric bank$b_tgt<ifelse(bank$b_tgt=="1", "yes", "no") bank$demog_ho<-ifelse(bank$demog_ho=='1',"yes","no") bank$demog_genf<-ifelse(bank$demog_genf=="1", "yes","no") bank$demog_genm<-ifelse(bank$demog_genm=="1", "yes","no") ``` Before exploring the data, some variables have to be declared as factors and therefore will be applying the code below to execute that. ```{r } #declare variables as factor cols<-c("b_tgt", "cat_input1", "cat_input2","demog_ho","demog_genf", "demog_genm") bank[cols] <- lapply(bank[cols], factor) ``` Removing the dollar signs from the observations is an integral part of data cleaning and wrangling data before applying functions and codes on the data set for exploration. The code below is used to remove dollar signs form the specific column observations. ```{r } # the code below is used to deal with dollar signs that have to be removed bank$int_tgt = as.numeric(gsub("[\\$]", "", bank$int_tgt)) bank$demog_homeval = as.numeric(gsub("[\\$]", "",bank$demog_homeval)) bank$demog_inc = as.numeric(gsub("[\\$]", "",bank$demog_inc)) bank$rfm1 = as.numeric(gsub("[\\$]", "",bank$rfm1)) bank$rfm2 = as.numeric(gsub("[\\$]", "",bank$rfm2)) bank$rfm3 = as.numeric(gsub("[\\$]", "",bank$rfm3)) bank$rfm4 = as.numeric(gsub("[\\$]", "",bank$rfm4)) ``` Now we can see a general overview of the entire original data and deal with issues as they come accross in the exploration. ```{r } # Get high level overview of Data summary(bank) dim(bank) view(bank) ``` ```{r } bankdata <- bank #recreating for processing and cleaning purposes #renaming some variable names to more relatable forms for the Coding and parts of the EDA names(bankdata)[names(bankdata)=="b_tgt"] <- "new.product" names(bankdata)[names(bankdata)=="int_tgt"] <- "new.sales" names(bankdata)[names(bankdata)=="cnt_tgt"] <- "new.prodcount" names(bankdata)[names(bankdata)=="cat_input1"] <- "acct.actilvl" names(bankdata)[names(bankdata)=="cat_input2"] <- "customer.valvl" sapply(bankdata, function(x) sum(is.na(x))) # how many missing values do we have for each variable # now that we can see the variables with NA values in the bankdata exploratory data set, I will create a subset with now missing values for all variables. ``` ```{r } #Dataset variables declares whether observation is in the train, validation or test set. bank_train<- bank[ which(bank$dataset=='1'),] bank_validation<- bank[ which(bank$dataset=='2'),] bank_test<- bank[ which(bank$dataset=='3'),] ``` The codes below will be used in preparation and cleaning the data for exploration. I will replace the columns that have many missing values with the median values of that column. And for the columns wiht less amount of misisng values, I have chosen to replace with zero. Will make this prepped dataset my subset "bankdata" and will be using that to explore variables and relationships between variables to find any trends ```{r } #creating a subset with no missing values in variables # I will be replacing the missing values in the columns that have them with the median bankdata$new.sales [is.na(bankdata$new.sales)]<- median(bankdata$new.sales, na.rm = TRUE) bankdata$demog_age [is.na(bankdata$demog_age)]<- median(bankdata$demog_age, na.rm = TRUE) bankdata$new.prodcount [is.na(bankdata$new.prodcount)]<- 0 bankdata$rfm1 [is.na(bankdata$rfm1)]<- 0 bankdata$rfm3 [is.na(bankdata$rfm3)]<- median(bankdata$rfm3, na.rm = TRUE) bankdata$rfm4 [is.na(bankdata$rfm4)]<- 0 sapply(bankdata, function(x) sum(is.na(x))) # check to see if any missing Na values in variables dim(bank) dim(bankdata) ``` I will examine univariate data(practical exploratory data analysis conducted for each research variable) This is EDA on single variables then will examine some multivariate data as well. To see if any trends arise in the data set before proceeding to modeling. Univariate data = samples of one variable. this is to describe the data Discrete variable = an example is age in this dataset.It has a limited set of values Continuous = an example is Income. It can be any number. key things will be discovering in EDA: central tendency, dispersion measures,graphical techniques and variable correlation analysis Types of graphs will be used: box plot histogram density plot ```{r } #central tendency ## looking at the central tendency below using the mean and median. with the mean and median not being that different there is not much of a skew. We can see from the boxplot as well. There is no particular skew but however we see 6 outliers. summary(bankdata$demog_inc) boxplot(bankdata$demog_inc) sd(bankdata$demog_inc) var(bankdata$demog_inc) #dispersion measures(spread). Will perform this on the income to show income density as well hist(bankdata$demog_inc) plot(density(bankdata$demog_inc), main="Income Density Spread") ``` ```{r } ### CUSTOMER AGE ## looking at the central tendency below using the mean and median. with the mean and median being different there is a skew. We can see from the boxplot, histogram and density plots as well. summary(bankdata$demog_age) sd(bankdata$demog_age) var(bankdata$demog_age) boxplot(bankdata$demog_age) hist(bankdata$demog_age) plot(density(bankdata$demog_age), main="Age Density Spread") # From the histogram there is a large number of people found to be between the ages of 55 and 60 years of age. ``` ```{r } ### Male or Female Demographic and Homeowner or non homeowners #looking at some categorical variables. We look at the Demographic inputs that have Yes or No options. Homeowners, Males and females. The summary shows the number belonging to each category and also plots to give a graphical representation. summary(bankdata$demog_genf) summary(bankdata$demog_genm) summary(bankdata$demog_ho) plot(bankdata$demog_genf) plot(bankdata$demog_genm) plot(bankdata$demog_ho) ``` Exploratory data analysis below on the variable DEMOG_PR which shows the percentage of the retired people in the area. We see the summary showing the central tendency of mean and median being a point apart. from the box plot we see the distribution of the retired population percentage graphically, including some outliers. ```{r } ### The code will be looking at the Percentage of Retired people in the area summary(bankdata$demog_pr) boxplot(bankdata$demog_pr) hist(bankdata$demog_pr) sd(bankdata$demog_pr) var(bankdata$demog_pr) plot(density(bankdata$demog_pr), main="Retired Persons Density Spread") ``` ```{r } ### ACCOUNT ACTIVITY LEVEL ## looking at the summary for account activity we have account X having the highest activity level, account Y having the second highest and Z having the lowest activity level. summary(bankdata$acct.actilvl) plot(bankdata$customer.valvl, xlab = "Account Activity", ylab="level") pie(table(bankdata$acct.actilvl), main="Account Activity Level") ``` ```{r } ### CUSTOMER VALUE LEVEL ## looking at the summary for customer value levels we have customer levels A, B , C , D and E and the number of customers at those value levels summary(bankdata$customer.valvl) plot(bankdata$customer.valvl, xlab = "Custumer Value", ylab="level") pie(table(bankdata$customer.valvl), main="Customer Value Level") class(bankdata$customer.valvl) ``` Examining Multivariate Data. This can be a relationship between one categorical variable and one continuous variable, relationship between two categorical variables or relationship between two continuous variables. Below will be using some code to answer some questions with EDA describing the relationship between other variables in the data. ```{r } #comparing account activity level to income. This could be answering the question to see how much activity is coming through on accounts based on how much the people are making. by(bankdata$demog_inc, bankdata$acct.actilvl, summary) # here we can see the spread of the various account activity levels. boxplot(bankdata$demog_inc~bankdata$acct.actilvl, notch=F,col=c("grey","gold","grey"), main="Income distribution among Account Activity Levels") # we can see income outliers in activity levels X and Y. But most descriptively we can see higher income levels having more account activity levels. ``` This code will be used to look at one of the categorical inputs(Customer Value level) relationship with age. This to see if the age is a factor in the Customer value, if older people are more valued customers to the bank or not. Creating such questions come into play as to find ways of targeting customers to improve revenue for the bank. ```{r } ggplot(data=bankdata, aes(x=demog_age, group=customer.valvl))+ geom_density(aes(color = customer.valvl), adjust=1.5) ``` Looking at customers who are Home owners or non homeowners and whether they buy new products and services of the bank ```{r } # Bought new product vs Homeowner or not ggplot(data= bankdata)+geom_bar(aes(x=demog_ho, fill=as.factor(new.prodcount)))+ ggtitle(label = "barplot of homeowners and new products purchased") #We see that non homeowners are less likely to buy the different range of new products and services offered by the bank ``` I will construct some plots and correlation plots highlighting the relationships among the variables in my data set ```{r } # correlation between categorical variables is not ideal as we can see from the pairs plots.However, we can see the correlations between the input intervals as well with a few of them being only slightly positively correlated. library(corrplot) pairs(bankdata[1:3]) pairs(bankdata[4:5]) pairs(bankdata[6:10]) ``` Finally in this EDA, I will be calculating the pearrwise coefficient of correlation between the important target variables total number of new products and services and other predictor variables in the dataset ```{r } cor.test(bankdata$new.prodcount, bankdata$rfm1) cor.test(bankdata$new.prodcount, bankdata$rfm2) cor.test(bankdata$new.prodcount, bankdata$rfm4) cor.test(bankdata$new.prodcount, bankdata$rfm4) cor.test(bankdata$new.prodcount, bankdata$rfm5) cor.test(bankdata$new.prodcount, bankdata$rfm6) cor.test(bankdata$new.prodcount, bankdata$rfm7) cor.test(bankdata$new.prodcount, bankdata$rfm8) cor.test(bankdata$new.prodcount, bankdata$rfm9) cor.test(bankdata$new.prodcount, bankdata$rfm10) cor.test(bankdata$new.prodcount, bankdata$rfm11) cor.test(bankdata$new.prodcount, bankdata$rfm12) cor.test(bankdata$new.prodcount, bankdata$demog_age) cor.test(bankdata$new.prodcount, bankdata$demog_homeval) cor.test(bankdata$new.prodcount, bankdata$demog_inc) cor.test(bankdata$new.prodcount, bankdata$demog_pr) #The general trend is mixed.As we have strong positive correlation between new product sales and purchase of Past 3 years Direct Promo Responses, Purchases in the last 3 years and average sales int he last 3 years.Top 3 highest coefficients values. This shows a general trend of customers making purchases in the last 3 years will be more likely to purchase a new product or service rollout. ``` Will be plotting the boxplot of the variables that are highly correlated to the number of new products and services purchased by customers. ```{r } boxplot(bankdata$new.prodcount, bankdata$rfm1) boxplot(bankdata$new.prodcount, bankdata$rfm5) boxplot(bankdata$new.prodcount, bankdata$rfm7) ``` Looking at the correlation between the target variable New Sales(Interval) and the other predictor variables using the code below ```{r } cor.test(bankdata$new.sales, bankdata$rfm1) cor.test(bankdata$new.sales, bankdata$rfm2) cor.test(bankdata$new.sales, bankdata$rfm4) cor.test(bankdata$new.sales, bankdata$rfm4) cor.test(bankdata$new.sales, bankdata$rfm5) cor.test(bankdata$new.sales, bankdata$rfm6) cor.test(bankdata$new.sales, bankdata$rfm7) cor.test(bankdata$new.sales, bankdata$rfm8) cor.test(bankdata$new.sales, bankdata$rfm9) cor.test(bankdata$new.sales, bankdata$rfm10) cor.test(bankdata$new.sales, bankdata$rfm11) cor.test(bankdata$new.sales, bankdata$rfm12) cor.test(bankdata$new.sales, bankdata$demog_age) cor.test(bankdata$new.sales, bankdata$demog_homeval) cor.test(bankdata$new.sales, bankdata$demog_inc) cor.test(bankdata$new.sales, bankdata$demog_pr) ggplot(bankdata, aes(demog_pr, new.sales)) + geom_point() boxplot(bankdata$demog_pr,bankdata$new.sales) ## with the difference in shape in the box plots here, we can see a possibility of a factor of importance of the percentage of retired people and how they respond to new sales attempts by the bank boxplot(bankdata$demog_inc,bankdata$new.sales) # with the difference in shape in the box plots here, we can see that income levels of the people affecting their response to new sales attempts by the bank ``` Reporting on some basic findings of the data. There are certain predictor variables that have impact on the data which will definitely be factored into preparing the best models. Some of which are Average sales in the Past 3 years(RFM1), Count of Purchases in the last 3 years(rfm5), Count of Direct Promos in the Past year(RFM11),Demog_Age(customer age), Demopg_ho(homeowner status), Demog_Inc(Income). These will be used towards a model against the 3 given target variables. The goal will be to achieve the champion model.