PYTHON PROGRAMMING

profileHussain2018
ExploratoryDataAnalysis2017.pdf

1

Exploratory Data Analysis (EDA)

by Melvin Ott, PhD

September, 2017

Introduction

The Masters in Predictive Analytics program at Northwestern University offers

graduate courses that cover predictive modeling using several software products

such as SAS, R and Python. The Predict 410 course is one of the core courses and

this section focuses on using Python.

Predict 410 will follow a sequence in the assignments. The first assignment will ask

you to perform an EDA(See Ratner1 Chapters 1&2) for the Ames Housing Data

dataset to determine the best single variable model. It will be followed by an

assignment to expand to a multivariable model. Python software for boxplots,

scatterplots and more will help you identify the single variable. However, it is easy

to get lost in the programming and lose sight of the objective. Namely, which of

the variable choices best explain the variability in the response variable?

(You will need to be familiar with the data types and level of measurement. This

will be critical in determining the choice of when to use a dummy variable for model

building. If this topic is new to you review the definitions at Types of Data before

reading further.)

This report will help you become familiar with some of the tools for EDA and allow

you to interact with the data by using links to a software product, Shiny, that will

demonstrate and interact with you to produce various plots of the data. Shiny is

located on a cloud server and will allow you to make choices in looking at the plots

for the data. Study the plots carefully. This is your initial EDA tool and leads to

your model building and your overall understanding of predictive analytics.

Single Variable Linear Regression EDA

1. Become Familiar With the Data

2

Identify the variables that are categorical and the variables that are quantitative.

For the Ames Housing Data, you should review the Ames Data Description pdf file.

2. Look at Plots of the Data

For the variables that are quantitative, you should look at scatter plots vs the

response variable saleprice. For the categorical variables, look at boxplots vs

saleprice. You have sample Python code to help with the EDA and below are some

links that will demonstrate the relationships for the a different building_prices

dataset.

For the boxplots with Shiny:

Click here

For the scatterplots with Shiny:

Click here

3. Begin Writing Python Code

Start with the shell code and improve on the model provided.

3

Single Variable Logistic Regression EDA

1. Become Familiar With the Data

In 411 you will have an introduction to logistic regression and again will ask you to

perform an EDA. See the file credit data for more info. Make sure you recognize

which variables are quantitative and which are categorical. And, for several of

these variables, what is the level of measurement?

2. Look at Plots of the Data

For logistic regression, the response variable is of the type yes/no. In this

dataset it is coded as good/bad. So, the EDA may include histograms for

quantitative variables with a separate histogram for each of the response values.

For numeric coded explanatory categorical variables, if the response good/bad is

recoded as 0/1 then the mean for the response variable for each of the categories

will indicate if there is a relationship.

For the histograms with Shiny:

Click here

For the means with Shiny:

Click here

3. Begin Writing Python Code

OK. You have looked at the plots, which variable do you think will be most useful

for predicting or explaining bad credit? After you answer this question, begin

writing Python code to see if you can replicate these plots.

4

The data set CREDIT contains information on 1000 customers. There are 21 variables in

the data set:

Name Model

Role

Measurement

Level

Description

AGE Input Interval Age in years

AMOUNT Input Interval Amount of credit requested

CHECKING Input Nominal or

Ordinal

Balance in existing checking account:

1 = less than 0 DM

2 = more than 0 but less than 200 DM

3 = at least 200 DM

4 = no checking account

COAPP Input Nominal Other debtors or guarantors:

1 = none

2 = co-applicant

3 = guarantor

DEPENDS Input Interval Number of dependents

DURATION Input Interval Length of loan in months

EMPLOYED Input Ordinal Time at present employment:

1 = unemployed

2 = less than 1 year

3 = at least 1, but less than 4 years

4 = at least 4, but less than 7 years

5 = at least 7 years

EXISTCR Input Interval Number of existing accounts at this bank

FOREIGN Input Binary Foreign worker:

1 = Yes

2 = No

GOOD_BAD Target Binary Credit Rating Status (good or bad)

5

HISTORY Input Ordinal Credit History:

0 = no loans taken / all loans paid back in

full and on time

1 = all loans at this bank paid back in full

and on time

2 = all loans paid back on time until now

3 = late payments on previous loans

4 = critical account / loans in arrears at

other banks

HOUSING Input Nominal Rent/Own:

1 = rent

2 = own

3 = free housing

INSTALLP Input Interval Debt as a percent of disposable income

JOB Input Ordinal Employment status:

1 = unemployed / unskilled non-resident

2 = unskilled resident

3 = skilled employee / official

4 = management / self-employed / highly

skilled employee / officer

MARITAL Input Nominal Marital status and gender

1 = male – divorced/separated

2 = female – divorced/separated/married

3 = male – single

4 = male – married/widowed

5 = female – single

OTHER Input Nominal or

Ordinal

Other installment loans:

1 = bank

2 = stores

3 = none

PROPERTY Input Nominal or

Ordinal

Collateral property for loan:

1 – real estate

2 = if not 1, building society savings

agreement / life insurance

3 = if not 1 or 2, car or others

4 = unknown / no property

6

PURPOSE Input Nominal Reason for loan request:

0 = new car

1 = used car

2 = furniture/equipment

3 = radio / television

4 = domestic appliances

5 = repairs

6 = education

7 = vacation

8 = retraining

9 = business

x = other

RESIDENT Input Interval Years at current address

SAVINGS Input Nominal or

Ordinal

Savings account balance:

1 = less than 100 DM

2 = at least 100, but less than 500 DM

3 = at least 500, but less than 1000 DM

4 = at least 1000 DM

5 = unknown / no savings account

TELEPHON Input Binary Telephone:

1 = none

2 = yes, registered under the customer’s

name

Exploratory Data Analysis (EDA)

Ratner1 describes ‘data mining’ “as any process that finds unexpected structures in

data and uses the EDA framework to ensure that the process explores the data,

not exploits it.” Unexpected suggests that the word exploratory is very

appropriate to this process.

Tukey2 in his book and in many presentations gave structure to EDA. Others have

extended it to include ‘big’ data. Big data has occurred due to our ability to

capture huge datasets, store it on servers cost effectively, and analyze it with

software that will handle it.

Shiny App s

To learn more about Shiny applications with RStudio click on the link below:

http://rstudio.github.io/shiny/tutorial/

7

Types of Data

Quantitative data are numeric and represent counts or measurements.

Categorical data are names or labels such as a,b,c but can often be shown as 1,2,3.

They do not suggest counts or measurements.

Discrete data are finite or countable numeric data.

Continuous data are values that represent a continuous scale of measurement.

A nominal level of measurement suggests names or categories. There is no

apparent order suggested.

Ordinal level data suggest a sequential ordering but mathematical calculations

should not be performed on this data.

Interval level data are ordinal plus the difference between two data values is

meaningful. And, there is no zero level.

Ratio level data are interval and have a zero level plus differences and ratios may

be calculated.

References:

1. Ratner, B. (2012). Statistical and Machine-Learning Data Mining: Techniques for Better

Predictive Modeling and Analysis of Big Data (2nd ed.). New York: CRC Press

[ISBN-13: 9781439860915]

2. Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley.