Project 2

mezz
Project_FINAL111.docx

https://www.seu.edu.sa/sites/ar/SitePages/images/logo.png

College of Computing and Informatics

Ensemble classification

Create an IPython notebook that answers the following questions. Any diagram should be plotted in the notebook and copied to the report for analysis. In the report, include descriptions, discussions …etc.

Dataset

Orange Telecom's Churn Dataset

Exploratory Data Analysis

· Provide summary statistics for all variables (0.1). Discover if there are any anomalies in the results (0.2). Use diagrams to investigate the variables with potential outliers (0.2) (0.5 mark)

· Create a heat map of the correlation matrix that shows correlation coefficients among all the variables in the dataset. (0.25) What are your observations (0.25)? (0.5 mark)

· What external data sources would be useful to enrich this dataset and why? Note that you should not collect any additional data. (0.5 mark)

· What is the state where most churns occur (0.2)? Is there statistical difference between the number of customer service calls received in this state compared to the remaining states (0.1)? What do you observe (0.2)? (0.5 mark)

Data cleaning

· What are the issues (e.g., missing values) that you noticed in the dataset (0.5)? Apply any cleaning method that you find fit and provide justification of your decisions (1.5). (2 marks) Your data cleaning should be comprehensive.

Classification model development

· Develop and compare at least three classification models that predict customer churn (1 mark). You are expected to perform hyperparameter tuning and choose the best combination (1 mark). (2 marks)

· Develop and compare three ensemble classification models. (2 marks)

· Analyze the results of your best performing classifier (0.5). Is it statistically different than the least performing classifier (0.5)? (1 mark)

· Novelty and innovation (1 mark)

College of Computing and

Informatics

Ensemble classification

Create an IPython

notebook that answers the following questions. Any diagram should be

plotted in the notebook and copied to the report for analysis. In the report, include descriptions,

discussions …etc.

Dataset

Orange Telecom's Churn Dataset

Exploratory Data Analysis

-

Pr

ovide summary statistics for all variables (0.1). Discover if there are any anomalies in the results

(0.2). Use diagrams to investigate the variables with potential outliers (0.2) (

0.5 mark

)

-

Create a

heat map of the correlation matrix that shows correlatio

n

coefficients among all the

variables in the dataset. (0.25) What are your observations (0.25)?

(

0.5 mark

)

-

What external data sources would be useful to enrich this dataset and why? Note that you should

not collect any additional data. (

0.5 mark

)

-

What is

the state where most churns occur (0.2)? Is there statistical difference between the number

of customer service calls received in this state compared to the remaining states (0.1)? What do you

observe (0.2)? (

0.5 mark

)

Data cleaning

-

What are the issues (

e.g., missing values) that you noticed in the dataset (0.5)? Apply any cleaning

method that you find fit

and provide justification of your decisions (1.5)

. (

2 marks

) Your data

cleaning should be comprehensive.

Classification model development

-

Develop and

compare at least three classification models that predict customer churn (1 mark).

You

are expected to perform hyperparameter tuning and choose the best combination (1 mark).

(

2 marks

)

-

Develop and compare three ensemble classification models. (

2 marks

)

College of Computing and Informatics

Ensemble classification

Create an IPython notebook that answers the following questions. Any diagram should be

plotted in the notebook and copied to the report for analysis. In the report, include descriptions,

discussions …etc.

Dataset

Orange Telecom's Churn Dataset

Exploratory Data Analysis

- Provide summary statistics for all variables (0.1). Discover if there are any anomalies in the results

(0.2). Use diagrams to investigate the variables with potential outliers (0.2) (0.5 mark)

- Create a heat map of the correlation matrix that shows correlation coefficients among all the

variables in the dataset. (0.25) What are your observations (0.25)? (0.5 mark)

- What external data sources would be useful to enrich this dataset and why? Note that you should

not collect any additional data. (0.5 mark)

- What is the state where most churns occur (0.2)? Is there statistical difference between the number

of customer service calls received in this state compared to the remaining states (0.1)? What do you

observe (0.2)? (0.5 mark)

Data cleaning

- What are the issues (e.g., missing values) that you noticed in the dataset (0.5)? Apply any cleaning

method that you find fit and provide justification of your decisions (1.5). (2 marks) Your data

cleaning should be comprehensive.

Classification model development

- Develop and compare at least three classification models that predict customer churn (1 mark). You

are expected to perform hyperparameter tuning and choose the best combination (1 mark). (2 marks)

- Develop and compare three ensemble classification models. (2 marks)