Project 2
College of Computing and Informatics
Ensemble classification
Create an IPython notebook that answers the following questions. Any diagram should be plotted in the notebook and copied to the report for analysis. In the report, include descriptions, discussions …etc.
Dataset
Orange Telecom's Churn Dataset
Exploratory Data Analysis
· Provide summary statistics for all variables (0.1). Discover if there are any anomalies in the results (0.2). Use diagrams to investigate the variables with potential outliers (0.2) (0.5 mark)
· Create a heat map of the correlation matrix that shows correlation coefficients among all the variables in the dataset. (0.25) What are your observations (0.25)? (0.5 mark)
· What external data sources would be useful to enrich this dataset and why? Note that you should not collect any additional data. (0.5 mark)
· What is the state where most churns occur (0.2)? Is there statistical difference between the number of customer service calls received in this state compared to the remaining states (0.1)? What do you observe (0.2)? (0.5 mark)
Data cleaning
· What are the issues (e.g., missing values) that you noticed in the dataset (0.5)? Apply any cleaning method that you find fit and provide justification of your decisions (1.5). (2 marks) Your data cleaning should be comprehensive.
Classification model development
· Develop and compare at least three classification models that predict customer churn (1 mark). You are expected to perform hyperparameter tuning and choose the best combination (1 mark). (2 marks)
· Develop and compare three ensemble classification models. (2 marks)
· Analyze the results of your best performing classifier (0.5). Is it statistically different than the least performing classifier (0.5)? (1 mark)
· Novelty and innovation (1 mark)
College of Computing and
Informatics
Ensemble classification
Create an IPython
notebook that answers the following questions. Any diagram should be
plotted in the notebook and copied to the report for analysis. In the report, include descriptions,
discussions …etc.
Dataset
Orange Telecom's Churn Dataset
Exploratory Data Analysis
-
Pr
ovide summary statistics for all variables (0.1). Discover if there are any anomalies in the results
(0.2). Use diagrams to investigate the variables with potential outliers (0.2) (
0.5 mark
)
-
Create a
heat map of the correlation matrix that shows correlatio
n
coefficients among all the
variables in the dataset. (0.25) What are your observations (0.25)?
(
0.5 mark
)
-
What external data sources would be useful to enrich this dataset and why? Note that you should
not collect any additional data. (
0.5 mark
)
-
What is
the state where most churns occur (0.2)? Is there statistical difference between the number
of customer service calls received in this state compared to the remaining states (0.1)? What do you
observe (0.2)? (
0.5 mark
)
Data cleaning
-
What are the issues (
e.g., missing values) that you noticed in the dataset (0.5)? Apply any cleaning
method that you find fit
and provide justification of your decisions (1.5)
. (
2 marks
) Your data
cleaning should be comprehensive.
Classification model development
-
Develop and
compare at least three classification models that predict customer churn (1 mark).
You
are expected to perform hyperparameter tuning and choose the best combination (1 mark).
(
2 marks
)
-
Develop and compare three ensemble classification models. (
2 marks
)
College of Computing and Informatics
Ensemble classification
Create an IPython notebook that answers the following questions. Any diagram should be
plotted in the notebook and copied to the report for analysis. In the report, include descriptions,
discussions …etc.
Dataset
Orange Telecom's Churn Dataset
Exploratory Data Analysis
- Provide summary statistics for all variables (0.1). Discover if there are any anomalies in the results
(0.2). Use diagrams to investigate the variables with potential outliers (0.2) (0.5 mark)
- Create a heat map of the correlation matrix that shows correlation coefficients among all the
variables in the dataset. (0.25) What are your observations (0.25)? (0.5 mark)
- What external data sources would be useful to enrich this dataset and why? Note that you should
not collect any additional data. (0.5 mark)
- What is the state where most churns occur (0.2)? Is there statistical difference between the number
of customer service calls received in this state compared to the remaining states (0.1)? What do you
observe (0.2)? (0.5 mark)
Data cleaning
- What are the issues (e.g., missing values) that you noticed in the dataset (0.5)? Apply any cleaning
method that you find fit and provide justification of your decisions (1.5). (2 marks) Your data
cleaning should be comprehensive.
Classification model development
- Develop and compare at least three classification models that predict customer churn (1 mark). You
are expected to perform hyperparameter tuning and choose the best combination (1 mark). (2 marks)
- Develop and compare three ensemble classification models. (2 marks)