Big Data Analytics Tools Exam and project

Exitvxx

BigDataAnalyticsTools..zip

Home >Business & Finance homework help >Management homework help >Big Data Analytics Tools Exam and project

Big Data Analytics Tools./.DS_Store

__MACOSX/Big Data Analytics Tools./._.DS_Store

Big Data Analytics Tools./ Final Exam/PROJECT - BETTER UNDERSTAND ATTRITION.docx

FINAL EXAM – EXERCISE – To Better Understand Attrition.

This is a final project – you are going to exam the HR-BalanceSheet dataset and write a short report on what you found. I will guide you through the analysis, but as we go through the analysis you are going to need to capture data for the final report.

1. Load the dataset into Statistica

2. Generate Histograms for all of the data

a. Make notes on what you observe from the histograms. Can you learn anything about the business from these histograms?

b. Capture all of the histograms.

3. Now generate a correlation matrix to see if any variables are highly correlated. If variables are highly correlated and you are doing a supervised method (e.g., decision tree), then one of them must be omitted from the analysis. Do you know why?

Statistics->Nonparametrics->Correlations Okay.

Now select ALL of the variables and select “Spearman rank R”.

4. Let’s copy this out to Excel.

a. Open a blank Excel file

b. Go to Statistica – the output correlation matrix –

i. Hit Ctrl – A - this will select everything.

ii. Right Click - select “Copy with Headers”

iii. Go To Excel – select Paste

5. Select all of the numbers in Excel

a. Go To Conditional Formatting

i. Highlight all values greater than 0.70

6. This tells you the values that are highly correlated. Record what they are – these cannot be used in a supervised modeling exercise together. For example, JobLevel and TotalWorkingYears are highly correlated.

a. Make a list of all of the variables that are highly correlated (>0.7).

BUSINESS PROBLEM: The company has employee data for the last several years. In this data set we have a wide range of data, including whether or not they left the company (i.e., Attrition). If Attrition is set to “Yes”, they left the company. If Attrition is set to “No”, they did not leave the company.

The first thing we want to do is take a “high” level look at those people who left the company.

Go to Selection Criteria – that is accessible through the Sel:Off setting at the bottom of the Statistica window. Click on “Sel:Off”

Set the selection criteria to Attribute = “Yes”.

7. Generate Histograms for all of the data

a. Make notes on what you observe from the histograms. Can you learn anything about the business from these histograms?

b. Capture the histograms that tell you something about the business.

Go back to the selection criteria and turn the Sel: back to “Off”.

8. Now build a decision tree (C&RT) to see if we can find out what influences where or not individuals decide to leave the company.

If you exclude the variables that are highly correlated, you can generate a tree.

Generate a C&RT tree

Pick your variables (Quick)

· Attrition is your dependent variable

· Select the categorical and continuous variables that you reasonably think could be an issue with respect to attrition.

· Select your response codes

· ALL

Don’t do anything on Classification (YET) – you may want to go back and play with the classification weights – but, don’t do that yet.

On the “Stopping” tab, change the minimum n to 20. This will allow it to build a deeper tree.

Select V-fold cross validation on the Validation tab

Set Surrogate to 2 on the Advanced tab and hit OK.

Look at your tree –

Look at the Predicted Versus Observed – under classification.

Look at “Importance” on the Summary tab – this tells you which variables have the greatest impact.

This is your initial tree ---

Now – the best you’re going to be able to do is get about 80% accuracy on both Predicting yes and no.

So – play with it and see how good you can get it.

· Play with the classification costs

· You may try to create a stratified subsample using Attrition as the strata variable

__MACOSX/Big Data Analytics Tools./ Final Exam/._PROJECT - BETTER UNDERSTAND ATTRITION.docx

Big Data Analytics Tools./ Final Exam/.DS_Store

__MACOSX/Big Data Analytics Tools./ Final Exam/._.DS_Store

Big Data Analytics Tools./ Final Exam/FINAL EXAM - 2018.docx

FINAL EXAM NAME: _________________________

1. When we evaluate models we often discuss things like predictive accuracy, speed, robustness, scalability, and interpretability. Briefly discuss what is meant by “interpretability” and why it is important.

2. You have been hired by the county government to help automate a system to detect fraudulent spending by government employees.

You have been given a database of transactions for the past 10 years to work with. Each record in the database contains all of the details of each transaction as well as information related to the particular employee. In this database accountants have manually gone through the data and marked each transaction as either “Good” or “Fraudulent”.

The goal – build a model based on the historical data that will flag future transactions as either “Good” or “Fraudulent”. This will eliminate the need for the accountants to have to go through each transaction manually in the future. What type of modeling technique (e.g., decision tree, association analysis, clustering, etc.) would you use and why?

3. AT&T has been losing customers to Verizon. They want to try to understand why this is the case. They have customer records for the past 5 years that contain demographic information (age, gender, etc.) for the customers, the type of plan that they have, the number of interactions they have had with customer support and whether or not those customers left AT&T.

AT&T wants you to build a model that can be used to predict whether or not a customer is going to leave and switch to another provider. What type of technique (e.g., decision tree, association analysis, clustering, etc.) would you use and why??

4. Kroger is trying to find ways to improve sales. They have all of their receipts for the past 5 years. The receipts contain information about what was purchased, who purchased, and the date and time of the transaction.

You task is to analyze sales patterns and make recommendations with respect to store layouts that, you hope, will increase sales. What type of modeling technique (e.g., decision trees, association analysis, clustering, etc.) would you use and why?

5. You work for a cable service provider. You provide a variety of services for your customers. Your company provides cable TV, home phone, security systems, and internet services.

Your customer base is very diverse. Your customers could be male/female, young/old, single/married/married with children, etc. You have a wide range of ethnic backgrounds and income levels.

You want to make your marketing campaigns more effective. This means targeting the right groups with the right messages using the right media. (For example, marketing via social media may be more or less effective for 18 year olds as compared to 80 year olds).

You been tasked to use the customer database and determine what the different customer segments are and what they look like. Then, once you figure out what the unique groups are you can go through and develop a targeted campaign for each group. What type of modeling technique (e.g., decision trees, association analysis, clustering, etc.) would you use to determine the different market sectors and why?

Text and Web Analytics

6. When we do text analytics, we read in the data, we transform the data into documents, and then we must generate a term/document matrix. This term/document matrix is what we use to perform analysis.

Generation of the term/document matrix involves some processing of the document (see figure on the right).

Briefly describe each step and what it does.

a. Tokenize:

b. Transform Cases:

c. Filter Stopwords (English):

d. Filter Tokens by Length:

e. Stem (Porter):

7. Briefly describe what “Sentiment Analysis” is and how it might be used by a company.

8. What is the difference between text mining and data mining?

9. FINAL EXAM – 2018 – BETTER UNDERSTAND ATTRITION “projects”

Write me a short report that tells me the following (I’d like for this report to be uploaded in a separate standalone word file and look like something you would give an employer):

Business Scenario – write this like you worked for the company. Tell me what the issue is you are exploring and why.

What you did and why you did it – just discuss the technique you used, why it was appropriate and what you did. If you did several iterations, let me know what the final configuration was. I don’t need to know everything that went on – just what you did to get the final results.

What you found – tell me everything you found/learned. Include screen shots, graphs, etc. Anything appropriate to communicate what you found. Do NOT show me everything that was generated – just those things that support your “findings”.

Recommendations - What impact this would have to the business AND what your recommendations are for the business.

__MACOSX/Big Data Analytics Tools./ Final Exam/._FINAL EXAM - 2018.docx

Big Data Analytics Tools./ Final Exam/HR-BalancedSheet.sta

__MACOSX/Big Data Analytics Tools./ Final Exam/._HR-BalancedSheet.sta

Big Data Analytics Tools./Final Exam materials/1 BINS 4352 - Association Analysis.pptx

Association Analysis BINS 4352

Learning Objectives

Gain an understanding of how Association Analysis is used

Understand how Associations are created and how to interpret/evaluate those Associations

Discuss and understand Association metrics – Lift, Support, and Confidence

Gain familiarity with RapidMiner

Association Analysis (Market Basket Analysis)

This is a widely used and, in many ways, one of the most successful data mining algorithm.

It can be used to determines what products people purchase together.

Uses

Stores can use this information to determine store layout and product placement

Direct marketers can use this information to determine which new products to offer to their current customers.

Inventory policies can be improved if reorder points reflect the demand for the complementary products.

Any application where you are looking to see if there is a pattern where strong associations are present

Parable Of “Beer And Diapers”

Customers who bought diapers at a grocery store between 5-7pm also tend to by beer.

This is a good example of the business value present in big data analytics.

More than a parable – it was the result of a study commissioned by Osco in the 1990’s and represented a starting point in big data analytics

The finding led to the notion that there is value in discovering uncommon relationships in data can be used to drive business value.

Association Rules for Market Basket Analysis

Rules are written in the form “left-hand side implies right-hand side” and an example is:

Yellow Peppers IMPLIES Red Peppers, Bananas

To make effective use of a rule, three numeric measures about that rule must be considered:

(1) support

(2) confidence

(3) lift

Measures of Predictive Ability

Support and Confidence: An Illustration

RULE	SUPPORT	CONFIDENCE	LIFT
A => D	2/5	2/3	(2/3)/(2/5) = 1.67
C => A	2/5	2/4	(2/4)/(2/5) = 1.25
A => C	2/5	2/3	(2/3)/(2/5) = 1.67
B & C => D	1/5	1/3	(1/3)/(1/5) = 1.67

A Note On Lift

Lift is an interesting measurement and one that has undergone a great deal of scrutiny

For our purposes we defined Lift as Confidence/Support

However, there are other ways to calculate this measure

Some have argued that one must take into account the frequency of the observation

You don’t necessarily want a product that is in 100,000 transactions to be penalized over a product that is involved in 10 transactions simply due to the number of occurrences (or visa versa)

As such – when looking at this value in a tool keep in mind that it is the “relative” value that is important and not the “absolute” value.

Market Basket Analysis Methodology

We first need a list of transactions and what was purchased.

Receipts from stores

This may have to be “reformatted” depending on the tool that you’re using

Next, we choose a list of products to analyze, and tabulate how many times each was purchased with the others.

The diagonals of the table shows how often a product is purchased in any combination, and the off-diagonals show which combinations were bought.

A Convenience Store Example

Consider the following simple example about five transactions at a convenience store:

Transaction 1: Frozen pizza, cola, milk

Transaction 2: Milk, potato chips

Transaction 3: Cola, frozen pizza

Transaction 4: Milk, pretzels

Transaction 5: Cola, pretzels

These need to be cross tabulated and displayed in a table.

A Convenience Store Example (cont)

The diagonal shows how many times a product was purchased (in any combination)

Pizza and Cola sell together more often than any other combo; a cross-marketing opportunity?

Milk sells well with everything – people probably come here specifically to buy it.

Product Bought	Pizza also	Milk also	Cola also	Chips also	Pretzels also
Pizza	2	1	2	0	0
Milk	1	3	1	1	1
Cola	2	1	3	0	1
Chips	0	1	0	1	0
Pretzels	0	1	1	0	2

Using The Results

The tabulations can immediately be translated into association rules and the numerical measures computed.

Comparing this week’s table to last week’s table can immediately show the effect of this week’s promotional activities.

But, you need to be careful that the results were not impact by some external event (e.g., bad weather)

Some rules are going to be trivial (hot dogs and buns sell together) or inexplicable (toilet rings sell only when a new hardware store is opened).

Using The Results Barbie® => Candy

Forbes (Palmeri 1997) reported that a major retailer has determined that customers who buy Barbie dolls have a 60% likelihood of buying one of three types of candy bars. The retailer was unsure what to do with this nugget. The online newsletter Knowledge Discovery Nuggets invited suggestions (Piatesky-Shapiro 1998)

Put them closer together in the store.

Put them far apart in the store.

Package candy bars with the dolls.

Package Barbie + candy + poorly selling item.

Raise the price on one, lower it on the other.

Barbie accessories for proofs of purchase.

Do not advertise candy and Barbie together.

Offer candies in the shape of a Barbie Doll.

Augmenting Data to Yield More Insights

The sales data can be augmented with the addition of virtual items.

For example, we could record that the customer was new to us or had children.

The transaction record might look like:

Item 1: Sweater Item 2: Jacket Item 3: New

This might allow us to see what patterns new customers have versus existing customers.

Limitations to Market Basket Analysis

A large number of real transactions are needed to do an effective basket analysis, but the data’s accuracy is compromised if all the products do not occur with similar frequency.

The analysis can sometimes capture results that were due to some external event

For example:

The success of previous marketing campaigns (and not natural tendencies of customers).

Weather or natural disaster.

Association Analysis in Rapidminer

The Dataset

The data is organized into “Transactions”

Each transaction represents a grocery store receipt

The items we are interested in include

Herring	Baguette	Avocado	Heineken
Olives	Sardines	Corned Beef	Peppers
Soda	Cracker	Bourbon	Artichoke
Coke	Apples	Chicken	Ham
Turkey	Ice Cream	Steak	Bordeaux

Data is coded where “YES” indicates that it was purchased and “NO” indicates that it was not purchased

Running Assoication Analysis in Rapid Miner

Select

New Process

RadipMiner Studio Professional Main Menu

RapidMiner Studio is very similar in layout to SAS Enterprise Miner

Design Pane – where you layout the analysis you want to run

Drag/Drop Objects from the Operator list into the Design Space

Importing Dataset

There are several ways to import data

I am going to read the Excel file that has pre-processed grocery store receipt data

I drag the “Read Excel” operator into the design space.

Connect the inp port on the side of the design space to the fil port on the operator

Many operators have 2 output ports – one for processed data and the other for an original data “pass through”

Configuring The Read Excel Operator

Parameters associated with the Read Excel Operator appear on the right side of the screen when the operator is selected.

Go to “Import Configuration”

Select The Excel File

Select the data file that you want to import

Select “Next”

Preview The Data

The file that you import can contain multiple sheets

At this point you can select the sheet and the range of cells that you wish to import

The data file we are working with has 1 sheet and by default all of the entries are selected

Select “Next”

Annotating Data

Now you have the opportunity to add annotations to the data

We don’t need to set an attribute name for this data set

Select “Next”

Selecting attribute types

RapidMiner tries to determine the types of the attributes from the data.

For Association Analysis we need to set the types to either “Binomial” or “Nominal”

I am going to select “Binomial”

Select “Next”

Finished !

Once all of the data types are changed to Binomial – Select “Finish”

Attribute Selection

Next we need to select the attributes that will be used in the analysis.

This can be found on the Operator Pane under Blending -> Attributes -> Selection

Drag/Drop Select Attributes into the Design

Select Attributes

Connect the “out” port of Read Excel to the “exa” port on the Select Attributes.

Select the Select Attributes Operator – the parameters for the operator will appear on the right side of the screen.

Selecting Attributes

On the Parameters Pane

Select the “subset” Attribute filter type

Select “Select Attributes”

The Select Attributes dialog box appears

Select all of the attributes except for “Transaction”

This is an ID for the transaction and is not needed for the analysis

FP-Growth (Frequency Calculations)

Next we drag/drop and FP-Growth operator into the design

Connect the “exa” port of Select Attributes to “exa” of FP-Growth

The FP Growth operator determines the “frequent item sets”

A frequent item set denotes the items (products) in the set that have been purchased together frequently (in a certain ratio of transactions)

We also need to define the positive value (open advanced parameters)

Create Association Rules

Drag/drop “Create Association Rules” from the Operator Pane into the Design space.

Connect the “fre” (frequencies) of FP-Growth to “ite” of Create Association Rules

Parameters driving the rule creation can be set (Confidence, lift, … and Thresholds)

Ready To Run !

Connect the “rul” and “ite” ports of the Create Association Rules operator to the output (res) of the design space.

Select “Run” -

Output

We get 2 sets of output

One tab is for the FP-Growth operator and show the Frequent Item Sets

The other contains the Association Rules

The Frequency data shows you the support for every combination of products in the data set

Associations Rules (Sorted By Support)

Association Rules (Sorted By Confidence)

Association Rules (Sorted By Lift)

Interpreting The Rules

Rule: IF (turkey, baguette) THEN (ham, olives)

Support: The percentage of the time that the rule was true

26.7% of the time the basket contained both (turkey,baguette) and (ham, olives)

Confident: The percentage of the time that the baskets that did contain (turkey, baguette) also contained (ham, olives)

85% of the time when the basket contained (turkey, baguette) it also contained (ham, olives)

Lift: is the relative measure that looks at how many times Confidence is larger than the expected level (similar to what we discussed earlier – better than a baseline model)

Greater than 1 is desired

The larger the value the better

Association Analysis in Statistica

Same Data File

Reformatted data just a little for Statistica (in Excel)

Each line contains what was sold for that transaction

Link Analysis

Go to Data Mining -> Link Analysis

Select Non-sequential association analysis

Select Variables as we have done in the past

Transaction – is the Transaction ID

Food items – are Multi-response variables

Database Selection

If you run the analysis multiple times, you may have to select a database name.

This should only be the case if you exit the tool, reload data, and try to run again.

Select “OK” and run the analysis

Results

Association Rules

Frequency Itemsets

Rule Graph

Web Graph

Questions?

__MACOSX/Big Data Analytics Tools./Final Exam materials/._1 BINS 4352 - Association Analysis.pptx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352- Data Preparation.pptx

Data Preparation

Objectives

Provide a perspective on how data may have to be prepared for a given Analytics task

Understand the relationship between the data mining technique and the business problem being addressed

Gain an understanding of the types of data issues that might be found

Understand options when it comes to “fixing” or “addressing” data issues

Understand what impact “uncleaned” data may have on the analysis

Provide a “practical” and “applied” understanding of statistical concepts

Data in Data Mining

Data: a collection of facts usually obtained as the result of experiences, observations, or experiments.

Data may consist of numbers, words, images, …

Data: lowest level of abstraction (from which information and knowledge are derived).

Nominal – mutually exclusive, but not order categories (e.g., male, female)

Ordinal – order matters (e.g., Freshman, Sophomore, Junior, Senior).

Interval – measure where difference between two variables is meaningful.

Ratio – all the properties of Interval, but with an absolute 0 which means complete lack of that variable.

Inspecting Data and Preparation in Excel

Preparing Data In Excel

We are looking for anomalies in the data

Missing values

Values out of a defined range

Etc.

Once we find these values – we can

Repair the data

Impute missing values

Delete the response

DATA Dictionary

The data dictionary tab describes everything that is in the data set

If you download a data set from the Internet, it will often include a data dictionary in a separate file.

Raw data

The raw data is the data as it was collected.

There have been no changes/modifications to the data at this point.

Highlight Blanks

All of the data was selected in the spreadsheet

Go to FIND & SELECT in Excel (on the HOME Tab)

Select “GO TO SPECIAL”

Select “Blanks”

Then set the fill color to RED

Analyze Row / Columns

The raw data is the data as it was collected.

There have been no changes/modifications to the data at this point.

Count non-blank responses

Calculate the frequency of the “expected values”

Calculate difference from the means

Calculate single answer bias

How To Handle Anomalies In The Data

Decide what your going to do about the anomalies you found in the data.

Filter values

Repair the file (impute values)

Leave them alone

Delete responses

Etc.

If you are going to “change” data or “delete” data, move the original values to a separate sheet.

You are documenting what you did

This makes it easier to “undo” your change if you need to.

Statistica Data

This sheet contains the “cleaned” data that you’re going to load into Statistica for further analysis.

The excel manipulation was intended to address the “obvious” problems

Graphical Inspection Credit Scoring Data in Statistica

Application

As we have discussed practically all data will need some preparation

More over – that preparation may be slightly different based on the application and the type of analysis that you are doing

It is important to have a good understanding of both the data and what you’re trying to accomplish through the data mining process

Data preparation

Handling missing data and outliers

Selecting important variables

Sampling

Data preparation is specific to BOTH the data set and the task. The preparation method and decisions made during data preparation may change if either change

Application For Credit Scoring Data

Business Need

A financial institution has data about their past customers.

These customers are classified as either good or bad credit risks based on their history with the institution.

The classification (good/bad) is based on whether or not the loan payment was delinquent and the magnitude of the loss

A financial institution needs a way to decide if and how much credit to extend to customers who apply for loans.

Business goal: reduce the losses due to bad loans

Goals of the data mining process

Determine the variables that are best predictors of credit risk

Find a high performance predictive model that classifies customers

Deploy that model to make decisions on future credit applications

Update the model as more data is collected

Credit Scoring Data Set

We are going to explore the credit scoring data set

This data will be used to explore

Data preparation

Classification

Try to keep in mind – this will be a “classification exercise”.

It can be applied to different data sets and domains where classification is appropriate

Examine the data

Data	Type
Credit rating	Categorical
Balance of current account	Categorical
Duration of credit	Continuous
Payment of previous credit	Categorical
Purpose of credit	Categorical
Amount of credit	Continuous
Value of savings	Categorical
Employed by current employer	Categorical
Installment in % of available income	Categorical
Marital status	Categorical
Gender	Categorical
Living in current household for	Categorical
Most valuable asset	Categorical
Age	Continuous
Further running credits	Categorical
Type of apartment	Categorical
Number of previous credits at this bank	Categorical
Occupation	Categorical
TrainTest	categorical

Credit Scoring

Start by looking at the credit scoring application and the business need

Review the variables in the credit risk data set

Discuss the next steps for the data mining process

Classification

Classification can be used to classify a variable with 2 or more groups

Find the probability of a particular predicted classification.

For example:

Loan denied

Loan approved

Examine The Data

Below is the data in Statistica

It was opened by FILE-> OPEN

Look At Some Histograms Of The Data

Histograms

Credit rating is the dependent variable – it is the one we want to make predictions for

Notice that there have been more than twice as many customers with good credit as compared to bad

This may mean that we need to adjust our sample to keep the analysis from being “good” credit biased

Histograms

Here we have the number of previous credits at the bank

5-6 and 7 or more are relatively small compared to the other categories

Hence, we may want to recode the data to have a 5 or more category

This is a good general rule of thumb

Remaining Variables

Note the majority of customers either have no previous credit or have paid back their previous loans

Remaining Variables

Note: that there have been more than twice as many male customers as female customers

Remaining Variables

Age is interesting in that you would expect that customers need to be at least 18 years old to apply for credit, so we need to make sure that is the case in the 15-20 year old group

Remaining Variables

Next Steps

We have stated the business goals and have data available to do the analysis

We have visually inspected the data in order to gain a high-level understanding of the data

We still need to do more work here

But, we have identified our dependent variable (Credit Risk) and the potential predictor variables

We need to continue exploring the data and look at things that we can do to prepare the data for analysis

Further Analysis

Go to the interactive drill down tool

Select the “Drill Variables”

Select “Payment of Previous Credits”

Drill Down

Select “No Previous Credits”

Go back to the “Drill down variables” and select “Number of Previous Payments”

Select “Brush”

Previous payment

So here we have an apparent contradiction.

We have drilled down to look at customers where the Payment of Previous Credits = no previous credits

But, yet the number of previous credits at this bank has values for 2-4, 5-6, and 7 or more?

Scatter Plot Matrix

A scatter plot matrix can help us look for potential outliers

Scatter Plot Matrix

We can combine this with the Drill Down and look at scatter plots of all of the continuous variables with respect to those where the Payment of Previous Credits = no previous credits

Scatter Plot – Duration Of Credit Vs Amount Of Credit

Scatter Plot

We can look for outliers here

If we know that there is a maximum loan amount, then we can remove those that are greater than that value

If we know that there is a maximum duration, we can remove all of those that are greater than that

Other Graphical Techniques

Box Plots

A box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles.

Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.

Outliers may be plotted as individual points.

Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution.

The spacing between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers.

There Are Several Variations of Box Plots

The “box” :

The band inside the box is the second quartile (the median).

Statistica gives you the option to make this the “mean”

The ends of the “whiskers” can represent several possible alternative values:

Min and Max of all of the data

Lowest datum still within 1.5 interquartile range (IQR) of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile (often called the Tukey boxplot)

one standard deviation above and below the mean of the data

Etc.

Box Plot

These may seem more “primitive” than a histogram, but they do have some advantages

They take up less space – so they are particularly useful in comparing distributions among several groups

The number and width of bins can greatly impact the appearance of a histogram

Box plots are particularly good at identifying outliers in continuous data

Box Plots in Statistica

Go to “GRAPHS” and select “Box Plots”

Graph Type

Box Whiskers

Regular (will give you 1 variable per graph)

Multiple (will give you 1 graph with all on it”

Variables – where you select the variables for analysis

Box Plots

Leave the defaults for Grouping Intervals

Middle point

Value – median

Style – determines the graphics that show the median in the box

Box Plots

I have re-selected “Multiple” Graph type

I prefer the median be marked with a “Line”

Box Plots

The graph is fairly crowded – but, I can see outliers that are identified by the small circles.

I will re-run this with fewer variables to see if we can see it better

Box Plots

Now we can more clearly see where the outliers are in the data set.

These need to be examined for deletion

Next we will look at this in SPSS

Marking The Outliers

Right clicking on the outliers will pull up a menu where you can tell Statistica to “Mark the Outliers”

SPSS – Box Plots

The data is loaded into SPSS

This can be done by loading the same file we used for Statistica

Go to “Graphs” and then to “Legacy Dialogs” and Select “Box Plots”

Select “Simple” and “Summaries of separate variables”

SPSS – Box Plots

Next we select the variables and move them to the “Boxes Represent” area

Select “OK”

SPSS – Box Plots

We see output very similar to Statistica.

The outliers are marked with “o” on the graph

In SPSS the default is to display the “number” of the response with the outlier

Inspecting Outliers In Box Plot

We want to be very conservative when identifying responses as “outliers”

Count how many “outliers” each respondent has

For example, response 188 has 3 outliers, 187 has 4 outliers, and so on.

I may delete 1 or 2 responses that have a “large” number of outliers and then rerun the box plot.

This will cause things to shift a bit

I then iterate until I’m happy with the data.

Leaving outliers in the data is fine – this may be a “true” response

QUESTIONS?

Data

CategoricalNumerical

NominalOrdinalIntervalRatio

Histogram of Credit Rating

CreditScoring 19v*1000c

Credit Rating = 1000*1*Normal(Location=0.7, Scale=0.4585)

bad good

Credit Rating

100

200

300

400

500

600

700

800

No of obs

Histogram of Number of previous credits at this bank

CreditScoring 19v*1000c

Number of previous credits at this bank = 1000*1*Normal(Location=1.407, Scale=0.5777)

one 2- 4 5- 6 7 or more

Number of previous credits at this bank

100

200

300

400

500

600

700

No of obs

Histogram of Balance of Current Account

CreditScoring 19v*1000c

Balance of Current Account = 1000*1*Normal(Location=2.577, Scale=1.2576)

no running account

no balance

<= $300

>$300

Balance of Current Account

100

150

200

250

300

350

400

450

No of obs

Histogram of Duration of Credit

CreditScoring 19v*1000c

Duration of Credit = 1000*10*Normal(Location=20.903, Scale=12.0588)

-1001020304050607080

Duration of Credit

100

150

200

250

300

350

400

No of obs

Histogram of Payment of Previous Credits

CreditScoring 19v*1000c

Payment of Previous Credits = 1000*1*Normal(Location=2.545, Scale=1.0831)

hesistant

problematic running accounts

no previous credits

no problems with current credits

paid back

Payment of Previous Credits

100

200

300

400

500

600

No of obs

Histogram of Purpose of Credit

CreditScoring 19v*1000c

Purpose of Credit = 1000*1*Normal(Location=2.828, Scale=2.7444)

other

new car

used car

furniture

television

household appliances

repair

education

vacation

retraining

business

Purpose of Credit

100

120

140

160

180

200

220

240

260

280

300

No of obs

Histogram of Amount of Credit

CreditScoring 19v*1000c

Amount of Credit = 1000*5000*Normal(Location=4579.7472, Scale=3951.8525)

-$5,000.00

$0.00

$5,000.00

$10,000.00

$15,000.00

$20,000.00

$25,000.00

$30,000.00

Amount of Credit

100

200

300

400

500

600

700

800

No of obs

Histogram of Value of Savings

CreditScoring 19v*1000c

Value of Savings = 1000*1*Normal(Location=2.105, Scale=1.58)

no savings

<140 140-700 700-1400 >1400

Value of Savings

100

200

300

400

500

600

700

No of obs

Histogram of Employed by Current Employer for

CreditScoring 19v*1000c

Employed by Current Employer for = 1000*1*Normal(Location=3.384, Scale=1.2083)

unemployed

<1 year 1-5 years 5-8 years > 8 years

Employed by Current Employer for

100

150

200

250

300

350

400

No of obs

Histogram of Installment in % of Available Income

CreditScoring 19v*1000c

Installment in % of Available Income = 1000*1*Normal(Location=2.973, Scale=1.1187)

> 35 25-35 15- 25 < 15

Installment in % of Available Income

100

200

300

400

500

No of obs

Histogram of Marital Status

CreditScoring 19v*1000c

Marital Status = 1000*1*Normal(Location=2.682, Scale=0.7081)

divorced/living apart

divorced/living apart/married

single

married/widowed

Marital Status

100

200

300

400

500

600

No of obs

Histogram of Gender

CreditScoring 19v*1000c

Gender = 1000*1*Normal(Location=1.31, Scale=0.4627)

male female

Gender

100

200

300

400

500

600

700

800

No of obs

Histogram of Living in Current Household for

CreditScoring 19v*1000c

Living in Current Household for = 1000*1*Normal(Location=2.845, Scale=1.1037)

< 1 year 1-5 years 5-8 years >8 years

Living in Current Household for

100

150

200

250

300

350

400

450

No of obs

Histogram of Most Valuable Assets

CreditScoring 19v*1000c

Most Valuable Assets = 1000*1*Normal(Location=2.358, Scale=1.0502)

no assets

car

life insurance

ownership of house or land

Most Valuable Assets

100

150

200

250

300

350

No of obs

Histogram of Age

CreditScoring 19v*1000c

Age = 1000*5*Normal(Location=33.544, Scale=11.3498)

1015

253035404550556065707580

Age

100

120

140

160

180

200

220

240

260

No of obs

Histogram of Further running credits

CreditScoring 19v*1000c

Further running credits = 1000*1*Normal(Location=2.675, Scale=0.7056)

at other banks

at department store

no further running credits

Further running credits

100

200

300

400

500

600

700

800

900

No of obs

Histogram of Type of Apartment

CreditScoring 19v*1000c

Type of Apartment = 1000*1*Normal(Location=1.928, Scale=0.5302)

free rented owned

Type of Apartment

100

200

300

400

500

600

700

800

No of obs

Histogram of Number of previous credits at this bank

CreditScoring 19v*1000c

Number of previous credits at this bank = 1000*1*Normal(Location=1.407, Scale=0.5777)

one 2- 4 5- 6 7 or more

Number of previous credits at this bank

100

200

300

400

500

600

700

No of obs

Histogram of Occupation

CreditScoring 19v*1000c

Occupation = 1000*1*Normal(Location=2.904, Scale=0.6536)

unskilled with no permanant residence

unskilled with permanant residence

skilled employee

executive/self-employed

Occupation

100

200

300

400

500

600

700

No of obs

Histogram for brushing: Number of previous credits at this bank

N Total: 1000, Selected: 530

Payment of Previous Credits = no previous credits

050100150200250300350400450500

Number of counts

one

2- 4

5- 6

7 or more

Number of previous credits at this bank

Correlations

N Total: 1000, Selected: 1000

Duration of Credit

Amount of Credit

Age

Correlations

N Total: 1000, Selected: 530

Payment of Previous Credits = no previous credits

Duration of Credit

Amount of Credit

Age

Scatterplot of Amount of Credit against Duration of Credit

CreditScoring 19v*1000c

Amount of Credit = 298.4367+204.818*x

01020304050607080

Duration of Credit

-$2,000.00

$0.00

$2,000.00

$4,000.00

$6,000.00

$8,000.00

$10,000.00

$12,000.00

$14,000.00

$16,000.00

$18,000.00

$20,000.00

$22,000.00

$24,000.00

$26,000.00

$28,000.00

Amount of Credit

Box Plot of multiple variables

1-ORIGINAL-FULL-DATASET-WALK-THROUGH 61v*261c

Median; Box: 25%-75%; Whisker: Non-Outlier Range

Median

25%-75%

Non-Outlier Range

Outliers

Extremes

CCSC_1

CCOC_1

CCAC_1

CCAC_6

HCA_5

UCA_4

PEOU_3

PU_4

PP_5

BI_5

PF_4

PIIT_1

Box Plot of multiple variables

1-ORIGINAL-FULL-DATASET-WALK-THROUGH 61v*261c

Median; Box: 25%-75%; Whisker: Non-Outlier Range

Median

25%-75%

Non-Outlier Range

Outliers

Extremes

UCA_1

UCA_3

UCA_5

PEOU_1

PEOU_3

PU_1

PU_3

PP_1

PP_3

PP_5

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352- Data Preparation.pptx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352 - Cluster Analysis.pptx

Cluster Analysis

Cluster Analysis - Review

Outline

What is it? What is a cluster?

How is it different from a decision tree?

What is distance and linkage?

What is hierarchical clustering?

What is scree plot and denogram?

What is non-hierarchical clustering (k-means)?

How to learn it in detail?

Simple Case Study

Student	Physics	Calculus
Joe	15	20
Bill	20	15
Paula	26	21
Jane	44	52
Jack	50	45
Carlos	57	38
Carla	80	85
Russell	90	88
Eddie	98	98

If we look at the student data on the left, we can easily see a pattern emerging

There are natural groupings of students that seem to be apparent in the data.

Plotting these on a 2-dimensional scatter plot makes it more visible

Similarities And Dissimilarities

If we look at the objects in any one of the groupings, they are all very similar to the other objects in that grouping

Also, if we look at any one object in a grouping, it is very dissimilar to any object in another grouping

This give rise to the notion of:

Homogeneous within and Heterogeneous across based on characteristics

Clusters

These groupings are clusters

They represent a “natural grouping of similar objects” based on a collection of input parameters

Now, if you add a third or forth dimension (e.g., English, History), then the clusters may change

However, the way in which they are constructed is the same

Clusters

There is no objective function (i.e., an equation to be optimized given certain constraints with variables that need to be minimized or maximized such as trying to express a business goal in mathematical terms)

There is no dependent variable

This is sometimes called subjective segmentation

The segmentation is developed on its own based on the values of the input variables

It is called a unsupervised learning technique.

Once the segments are developed

They need to be understood

You need to decide how you are going to deal with the segments that have emerged

How Is This Different From A Decision Tree

Decision tree technique requires a clearly defined dependent variable

For example, GOOD/BAD credit

The technique is based on identifying those variables (characteristics) which are closely associated with the dependent variable.

Distances

Distances (cont)

Paula

Bill

Joe

Jane

Jack

Carlos

Carla

Russell

Eddie

Distance Between 2 Clusters

Sometimes called the linkage function: intermediate cluster distance

How do you calculate the distance between 2 clusters

Single linkage

Calculate the distance between each point in one cluster to each point in the neighboring cluster and then find the shortest distance.

Complete linkage

Similar to single linkage – except we look for the furthest distance

Centroid distance

Calculate the “center” of each cluster and then calculate the distance between centroids

Hierarchical Clustering

Used when you have a small number of observations (usually hundreds)

You can not use this method for a large dataset because it becomes computationally impractical

The way this works is to join objects (cases) together into successively larger clusters using some measure of similarity or distance

During the clustering process it shows how the clusters are formed

The result of the clustering is the hierarchical tree

In SAS this is done by using the “proc cluster” command

In Statistica this is called Joining or Tree Clustering

Hierarchical Clustering

We begin by each case in a cluster by itself

In each step we slowly “relax” the criterion which defines “uniqueness”

In other words, we lower the threshold by what it means to declare two or more objects within the same cluster

As a result we group more and more objects together and each “layer” consists of increasingly dissimilar objects

In the last step – all objects are grouped together into a single cluster

Scree Plot

When we look at a scree plot what we have is the “within cluster variance”

Total variance = between variance + within variance

The “elbow” in the graph indicates the optimal number of clusters

When the number of clusters = 1, the within group variance = total variance

Dendogram

When the data contain a clear structure in terms of clusters, then its structure is often reflected in the hierarchical tree

The result of a successful analysis is that you can detect and interpret the structure by looking at the branches

Hierarchical Clustering - Students

Everything starts out in a cluster by itself

Find the closest objects.

Merge those into a cluster.

Recalcuate distances (from center of new cluster)

Find the closest objects.

Merge those into a cluster.

Recalcuate distances (from center of new cluster)

Joe Bill Paula

Jack

Jane

…

K-means Clustering

Non-hierarchical

Used when you have a large number of observations

You decide up front how many clusters you need (k)

K-means algorithm

Partition objects into k non-empty subsets (randomly)

Compute the centroid for each of the clusters

Centroid is the center (i.e., mean point of the cluster)

This defines the “seed” point for each cluster

Assign each object to the cluster with the nearest seed point

Go back to step 2 and repeat until the assignment does not change

In SAS you do this my using the “proc fastclus” command

EM Clustering in Statistica

Getting started

EM algorithm

Uses distributions of (continuous) the data to find the clusters

You specify the distribution technique

Very similar to K-means clustering

First step – hypothesize how many clusters will be in the data

With K-means and EM, the optimum number of clusters can be determined with V-fold cross validation

Setting Up A Cluster Analysis

Go to DATA MINING -> Cluster

This will pull up the K-means / EM clustering dialog box

Use EM clustering

Select all of the variables (Except for Sample)

Select Variables And Configure V-fold Validation

Select EM algorithm

Number of clusters 2

Number of iterations 50

Keep defaults on EM tab – we have limited options here due to all of the variables are categorical

On the Validation tab select

V-fold cross validation

Statistica will search for the optimum number of clusters between 2 and 25 clusters

Select “OK”

Results - Similar To K-means

We see that it used the EM algorithm with normal distribution

V-fold cross validation was used with 5 folds

3 clusters were created

Graph Of Cost Sequence

Classification Probabilities For Each Case

Graph of Frequencies

This provides a graphical representation of the information contained in the frequency table(s)

What Do We Do With This Data?

Generally what we want to do with the results of a cluster analysis is start to build a narrative that describes the classification that occurred

We want to explain how this classification scheme can be used to describe the phenomena we are interested in

Questions?

Graph of Cost Sequence

Best number of clusters: 3

2 3 4

Number of clusters

28.0

28.5

29.0

29.5

30.0

30.5

31.0

-2 * log-likelihood

Graph of frequencies for ANNUALINC

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

$75,000 or more

Less than $10,000

$50,000 to $74,999$30,000 to $39,999$10,000 to $14,999$20,000 to $24,999$40,000 to $49,999$25,000 to $29,999$15,000 to $19,999

ANNUALINC

100

200

300

400

500

600

700

800

900

1000

1100

Frequencies

Graph of frequencies for SEX

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Female Male

SEX

600

800

1000

1200

1400

1600

1800

2000

Frequencies

Graph of frequencies for MARSTATUS

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Married

Single, never married

Divorced or separated

Living together, not married

Widowed

MARSTATUS

500

1000

1500

2000

2500

3000

3500

Frequencies

Graph of frequencies for AGE

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

45 thru 54

25 thru 34

14 thru 17

55 thru 64

18 thru 24

65 and Over

35 thru 44

AGE

200

400

600

800

1000

1200

Frequencies

Graph of frequencies for EDUCATION

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

1 to 3 years of college

College graduate

Grades 9 to 11

Graduated high school

Grad Study

Grade 8 or less

EDUCATION

200

400

600

800

1000

1200

1400

Frequencies

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352 - Cluster Analysis.pptx

Big Data Analytics Tools./Final Exam materials/BINS 4352- Comparing Models and Model Deployment.pptx

Comparing Models and Model Deployment

Lift And Gains Charts

Lift chart

Shows the effectiveness of the model compared to no model

Gains chart

Shows the percentage of observations correctly classified for a given category (e.g., our dependent variable DEFAULT=“BAD”)

Creating Model Assessment Charts

Using the model, produce estimated probabilities for the target event for each case

Sort all cases by decreasing estimated probability.

If the model is good, cases where the target event actually shows up should have higher estimated probabilities, therefore should end up higher in the list

Split the cases into, say 10 bins, so that:

Bin #1 has the highest probabilities

Bin #10 the lowest probabilities

Now look at the number of cases where the target event actually happens.

Creating Model Assessment Charts

Actually observed values of the target variable

Predicted Probabilities

Looking Inside 10 Bins

Good models assign high probabilities to cases which will actually have the target event (defaulting, donating, etc.)

The red balls represent where the target even “actually” occurred

The black balls represent where the target even did not actually occur

Bin #1

(high probabilities)

Bin #10

(low probabilities)

Percent Response

Percent Response shows what the percentage of each bin’s contents are target cases

For example, (15/18) ~85% of cases that ended up in bin 1 were observed to have actually become target events.

So here we are just looking at each “bin” separately.

85% of the cases in Bin 1 were the target

72% of the cases in Bin 2 were the target

39% of the cases in Bin 3 were the target

So on

Bin #1

(high probabilities)

Bin #10

(low probabilities)

Percent Captured Response

Percent Captured Response shows you what percentage of all target cases are “captured” by each bin

For example, the first bin captured (15/(15+13+7+3+3)) ~36%

So, here we are looking at “All of the target cases” and determining the percentage captured by each of the bins

Bin 1 captured 36% of the target cases

Bin 2 captured 32% of the target cases

Bin 3 captured 17% of the target cases

So on

Bin #1

(high probabilities)

Bin #10

(low probabilities)

Baseline Model (No Model At All)

Assume target cases are assigned randomly into the 10 bins.

This would create a baseline model

Comparing our model to the baseline model allows us to see how much better our model can do.

This is shown in a “lift chart”

Lift Chart

For each bin, get the ratio of the “target cases in the models bin” and divide that by the target cases in the baseline.

This is the lift value for that “bin”.

So, the lift chart represents a ratio of the baseline with what was captured

Bin #1 = 15/3 = 5

Bin #2 = 13/3 = 4.3

Bin #3 = 7/3 = 2.3

Lift value for Bin #1

= 15/3 = 5

Lift Chart

So, what we want is for more of the targets to be contained in the high probability buckets

The higher the model line is over the baseline (no model), the better our model

Lift value for Bin #1

= 15/3 = 5

15/3

13/3

7/3

3/3

0/3

Lift Charts

Visual representation that show the effectiveness of the model compared to no model at all – or the baseline model

Here we want the graph to be high on the left side and go down as it moves to the right

This indicates that more responses were correctly identified (higher probability) in the “earlier” bins

Gains Charts Are Very Similar To Lift Charts

Here when looking at the bins consider how many of the target responses have been identified.

So – since we have 10 bins. The base line model would assume that when 10% of the responses have been looked at, 10% of the targets would have been found.

However, in bin #1 after looking at 10% of the responses 36.5% of the targets have been found.

This represents an “improvement” over the baseline model

Gain for Bin #1

= 15/41 = 0.365

Gains Chart

With the gain chart the “further” the line for the model is from the baseline model, the better the model

In other words, we want to maximize the area under the curve between the model and the baseline

15/41 28/41 35/41 38/41 41/41 41/41 41/41 41/41 41/41 41/41

0.37 0.68 0.85 0.94 1.00 1.00 1.00 1.00 1.00 1.00

Gains Chart

Shows the percentage of observations correctly classified for a given category

The goal is to maximize the area between the model and the baseline model.

So, on the left the classification tree (red) performs better than the linear model (green)

Baseline Versus Perfect Model

In order to investigate the differences in model performance, we contrast our model to two models with extreme performance:

The baseline (bad performance)

The perfect model (excellent performance)

Model Deployment

Data Mining Models

Statistica can generate code for deployment using

Statistica Visual Basic

For deployment in the Statistica Data Miner Workspace

C and C++

Provide a framework for development of a custom deployment tool

PMML

Used in Statistica via Rapid Deployment

SAS

Used for scoring against data stored in a SAS database

Deployment to Statistica Enterprise

Code generator button on the Reports dialog will produce deployment code

Build Deployment Code

Build a C&RT Model

I build a C&RT model, adjusting parameters until I got fairly reasonable performance.

Deployment

You have the model the way that you want it and are ready to deploy

We are going to deploy the model so that it can be used in Rapid Deployment within Statistica

So –

Go to the Report TAB

Select Code Generator

We want to generate PMML code

PMML Code

Saving the PMML Code

Right click on the Tree PMML deployment code

Select Save Item As

Name the file

Make sure the file type is set to XML

Repeat The Process For CHAID

Adjust parameters and generate a model

Deploy the model in PMML format

Repeat For Boosted Tree

Adjust parameters and generate a model

Deploy the model in PMML format

Repeat For Random Forest

Adjust parameters and generate a model

Deploy the model in PMML format

Pull Models into Rapid Deployment

Rapid Deployment is on the Data Mining TAB

Select “Load Models from disk”

Select the models that were saved and select “Open”

Comparing the Models

4 models have been successfully loaded into Rapid Deployment

Notice that it pulled the variable information from the saved models

Check the box for “Include pred. probs. in output”

Also – Check the box for “Predict case(s) with missing data in inputs” – this is because our dataset has so much missing data

Start Comparing

Select the Summary of Predicted and residual values

It appears from this that the Boosted Tree has the lowest error rate of the 4 models.

Gains Chart

Go to the Lift Chart TAB

Select “Gains chart”

Click on “Lift chart” to see the chart

Here we want to maximize the area under the curve.

Boosted Tree looks like the best model

Lift Chart

Go to the Lift Chart TAB

Select “Lift chart (response %)”

Click on “Lift chart” to see the chart

Here we are looking at how well the models did for predictive accuracy

Again the Boosted Tree model performs best

Ensemble Models

What Is Ensemble Modeling?

Ensemble modeling is the process of running two or more related by different analytical models and then synthesizing the results into a single score or spread in order to improve the accuracy of the prediction.

A Random Forest may be thought of as a type of ensemble model.

A random forest generates a collection of trees and then those trees “vote” on what the prediction may be.

These trees are “similar” with respect to the variable definitions

But, are parameterized in such a way as to produce “different” trees

E.g., we may limit the number of predictor variables that can be considered at each node

Voting Across Models

Voting across models (a.k.a. Bagging) is particularly useful when we are working with smaller data sets

Input Data

C&RT

CHAID

Random Forest

Boosted Tree

Prediction

Why Worry About Voting?

In data mining there are a variety of model building tools and techniques available.

No model is perfect and we do our best to “tune” the model to be as accurate as possible within necessary constraints

No model can capture all of the underlying relationships that exists in the data

Some may capture some aspects of the data, while other models capture other relationships

Using an ensemble modeling technique can improve the predictive accuracy

The collection of models may indeed be more accurate than any one model taken alone

Model Stability

Data mining can uncover complex relationships between variables

Data Scientist use the term model “stability” to refer to the sensitivity of the models produced due to variation in the training data.

This is often an issue with high dimensional spaces

Or, when too little data is available

Domain experts (users) have little confidence in models that change radically based on the training data used to create them

Using multiple models with voting helps to counteract the issue of instability

Often the ensemble models outperform any individual model

Rapid Deployment

Rapid deployment in Statistica can be used to output the predictions made by individual models as well as the voted prediction

Going back to Rapid Deployment – if we scroll all the way to the right we will see the “Voted Prediction”

Looking At The Voted Predictions

Using the Summary of Deployment spreadsheet as input

Right click on the file in the browser

Select “Use as Active Input”

Go to Statistics – Basic Statistics

Select “Tables and banners”

Cross-tabulation Of Observed Value And The Voted Predicted Value

Select “Specify tables (select variables)

Select “DEFAULT” – the observed variable in List 1

Select “Voted Predicted” – the variable in List 2

Select “OK”

On the Options “TAB” check

Percentages of total count

Percentages of row counts

Percentages of column counts

Select “Summary”

Results – Similar To What We Have Seen For The Other Models

These are just a few statistics that we can look at when we are trying to develop our models for rapid deployment

Questions ?

Tree 11 graph for DEFAULT

Num. of non-terminal nodes: 17, Num. of terminal nodes: 18

ID=1

N=3364

GOOD

ID=2

N=3261

GOOD

ID=4

N=3060

GOOD

ID=6

N=1561

GOOD

ID=8

N=641

BAD

ID=10

N=497

BAD

ID=12

N=101

BAD

ID=13

N=396

BAD

ID=18

N=382

GOOD

ID=21

N=335

GOOD

ID=23

N=112

BAD

ID=9

N=920

GOOD

ID=40

N=848

GOOD

ID=42

N=725

GOOD

ID=43

N=123

GOOD

ID=65

N=97

GOOD

ID=7

N=1499

GOOD

ID=14

N=18

GOOD

ID=15

N=83

BAD

ID=20

N=47

BAD

ID=22

N=223

GOOD

ID=30

N=33

GOOD

ID=31

N=79

BAD

ID=19

N=14

BAD

ID=11

N=144

GOOD

ID=44

N=712

GOOD

ID=45

N=13

BAD

ID=64

N=26

BAD

ID=66

N=89

GOOD

ID=67

N=8

BAD

ID=41

N=72

BAD

ID=70

N=4

BAD

ID=71

N=1495

GOOD

ID=5

N=201

BAD

ID=3

N=103

BAD

DEBTINC

<= 43.679856

> 43.679856

DELINQ

<= 1.500000

> 1.500000

CLAGE

<= 178.660677

> 178.660677

VALUE

<= 85602.500000

> 85602.500000

LOAN

<= 21050.000000

> 21050.000000

VALUE

<= 51090.000000

> 51090.000000

CLAGE

<= 68.939073

> 68.939073

NINQ

<= 3.500000

> 3.500000

CLAGE

<= 81.311610

> 81.311610

YOJ

<= 7.500000

> 7.500000

VALUE

<= 58251.000000

> 58251.000000

DEROG

<= 0.500000

> 0.500000

MORTDUE

<= 111705.000000

> 111705.000000

DEBTINC

<= 42.300754

> 42.300754

VALUE

<= 166063.000000

> 166063.000000

MORTDUE

<= 197597.500000

> 197597.500000

DEBTINC

<= 13.745088

> 13.745088

BAD

GOOD

Tree 11 graph for DEFAULT

Num. of non-terminal nodes: 17, Num. of terminal nodes: 18

ID=1N=3364

GOOD

ID=2N=3261

GOOD

ID=4N=3060

GOOD

ID=6N=1561

GOOD

ID=8N=641

BAD

ID=10N=497

BAD

ID=12N=101

BAD

ID=13N=396

BAD

ID=18N=382

GOOD

ID=21N=335

GOOD

ID=23N=112

BAD

ID=9N=920

GOOD

ID=40N=848

GOOD

ID=42N=725

GOOD

ID=43N=123

GOOD

ID=65N=97

GOOD

ID=7N=1499

GOOD

ID=14N=18

GOOD

ID=15N=83

BAD

ID=20N=47

BAD

ID=22N=223

GOOD

ID=30N=33

GOOD

ID=31N=79

BAD

ID=19N=14

BAD

ID=11N=144

GOOD

ID=44N=712

GOOD

ID=45N=13

BAD

ID=64N=26

BAD

ID=66N=89

GOOD

ID=67N=8

BAD

ID=41N=72

BAD

ID=70N=4

BAD

ID=71N=1495

GOOD

ID=5N=201

BAD

ID=3N=103

BAD

DEBTINC

<= 43.679856 > 43.679856

DELINQ

<= 1.500000 > 1.500000

CLAGE

<= 178.660677 > 178.660677

VALUE

<= 85602.500000 > 85602.500000

LOAN

<= 21050.000000 > 21050.000000

VALUE

<= 51090.000000 > 51090.000000

CLAGE

<= 68.939073> 68.939073

NINQ

<= 3.500000 > 3.500000

CLAGE

<= 81.311610> 81.311610

YOJ

<= 7.500000> 7.500000

VALUE

<= 58251.000000> 58251.000000

DEROG

<= 0.500000 > 0.500000

MORTDUE

<= 111705.000000> 111705.000000

DEBTINC

<= 42.300754> 42.300754

VALUE

<= 166063.000000> 166063.000000

MORTDUE

<= 197597.500000> 197597.500000

DEBTINC

<= 13.745088> 13.745088

BAD

GOOD

Tree graph for DEFAULT

Num. of non-terminal nodes: 8, Num. of terminal nodes: 19

ID=1

N=3364

GOOD

ID=2

N=3052

GOOD

ID=5

N=2611

GOOD

ID=13

N=287

GOOD

ID=19

N=247

GOOD

ID=24

N=86

GOOD

ID=7

N=157

GOOD

ID=3

N=200

GOOD

ID=10

N=261

GOOD

ID=11

N=246

GOOD

ID=12

N=260

GOOD

ID=22

N=116

GOOD

ID=23

N=171

GOOD

ID=14

N=258

GOOD

ID=15

N=259

GOOD

ID=16

N=275

GOOD

ID=17

N=268

GOOD

ID=18

N=250

GOOD

ID=26

N=69

GOOD

ID=27

N=17

BAD

ID=25

N=161

GOOD

ID=6

N=284

GOOD

ID=20

N=128

GOOD

ID=21

N=29

BAD

ID=8

N=187

GOOD

ID=9

N=13

BAD

ID=4

N=112

GOOD

DEROG

<= 0.000000

<= 1.000000

> 1.000000

DELINQ

<= 0.000000

<= 1.000000

> 1.000000

DEBTINC

<= 24.380390

<= 27.920662

<= 30.686106

<= 33.419779

<= 35.113185

<= 36.696630

<= 38.332317

<= 39.891654

<= 41.473001

> 41.473001

JOB

= Other , ...

= Office , ...

CLAGE

<= 151.957418

> 151.957418

JOB

= Other , ...

= Sales , ...

DELINQ

<= 3.000000

> 3.000000

DELINQ

<= 2.000000

> 2.000000

BAD

GOOD

Tree graph for DEFAULT

Num. of non-terminal nodes: 8, Num. of terminal nodes: 19

ID=1N=3364

GOOD

ID=2N=3052

GOOD

ID=5N=2611

GOOD

ID=13N=287

GOOD

ID=19N=247

GOOD

ID=24N=86

GOOD

ID=7N=157

GOOD

ID=3N=200

GOOD

ID=10N=261

GOOD

ID=11N=246

GOOD

ID=12N=260

GOOD

ID=22N=116

GOOD

ID=23N=171

GOOD

ID=14N=258

GOOD

ID=15N=259

GOOD

ID=16N=275

GOOD

ID=17N=268

GOOD

ID=18N=250

GOOD

ID=26N=69

GOOD

ID=27N=17

BAD

ID=25N=161

GOOD

ID=6N=284

GOOD

ID=20N=128

GOOD

ID=21N=29

BAD

ID=8N=187

GOOD

ID=9N=13

BAD

ID=4N=112

GOOD

DEROG

<= 0.000000 <= 1.000000> 1.000000

DELINQ

<= 0.000000 <= 1.000000> 1.000000

DEBTINC

<= 24.380390<= 27.920662<= 30.686106<= 33.419779<= 35.113185<= 36.696630<= 38.332317<= 39.891654<= 41.473001 > 41.473001

JOB

= Other , ...= Office , ...

CLAGE

<= 151.957418> 151.957418

JOB

= Other , ...= Sales , ...

DELINQ

<= 3.000000> 3.000000

DELINQ

<= 2.000000> 2.000000

BAD

GOOD

Classification matrix 0

Dependent variable: DEFAULT

Options: Categorical response, Analysis sample

Classification matrix 0

Dependent variable: DEFAULT

Options: Categorical response, Analysis sample

Classification matrix 0 (hmeq in hmeq)

Dependent variable: DEFAULT

Options: Categorical response, Analysis sample

Observed

Predicted BAD

Predicted GOOD

Row Total

Number

Column Percentage

Row Percentage

Total Percentage

Number

Column Percentage

Row Percentage

Total Percentage

Count

Total Percent

BAD

262

300

64.41%

7.93%

12.67%

87.33%

1.13%

7.79%

8.92%

GOOD

3043

3064

35.59%

92.07%

0.69%

99.31%

0.62%

90.46%

91.08%

All Groups

3305

3364

1.75%

98.25%

Classification matrix 0 (hmeq in hmeq)

Dependent variable: DEFAULT

Options: Categorical response, Analysis sample

ObservedPredicted BADPredicted GOODRow Total

Number

Column Percentage

Row Percentage

Total Percentage

Number

Column Percentage

Row Percentage

Total Percentage

Count

Total Percent

BAD38262300

64.41%7.93%

12.67%87.33%

1.13%7.79%8.92%

GOOD2130433064

35.59%92.07%

0.69%99.31%

0.62%90.46%91.08%

All Groups5933053364

1.75%98.25%

Summary of Boosted Trees

Response: DEFAULT

Optimal number of trees: 199; Maximum tree size: 3

Train data

Test data

Optimal number

20406080100120140160180200

Number of Trees

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

Average Multinomial Deviance

Classification matrix

Analysis sample;Number of trees: 199

Summary of Random Forest

Response: DEFAULT

Number of trees: 100; Maximum tree size: 100

Train data

Test data

102030405060708090100

Number of Trees

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

Misclassification Rate

Gains Chart - Response/Total Response %

Cumulative

Selected category of DEFAULT: BAD

Baseline

BoostTreeModel

TreeModel

ExhaustiveCHAIDModel

RandomForestModel

0102030405060708090100

Percentile

100

Gains

Lift Chart - Response %

Cumulative

Selected category of DEFAULT: BAD

Baseline

BoostTreeModel

TreeModel

ExhaustiveCHAIDModel

RandomForestModel

0102030405060708090100110

Percentile

Response %

Bivariate Distribution: DEFAULT x Voted prediction

__MACOSX/Big Data Analytics Tools./Final Exam materials/._BINS 4352- Comparing Models and Model Deployment.pptx

Big Data Analytics Tools./Final Exam materials/1 BINS 4352 Cluster Analysis.pptx

Cluster Analysis

Introduction to Clustering

Cluster: A collection of data objects

Large similarity among objects in the same cluster

Dissimilarity among objects in different clusters

Clustering is an Unsupervised Classification technique: no pre-determined classes

Typical applications of clustering:

As a stand-alone analysis, to gain insight on the data

As a pre-processing step for other predictive models

Unsupervised Classification

Unsupervised classification (clustering) has an UNKNOWN TARGET

Clustering Applications

Marketing:

Customer segmentation - Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

Land use:

Identification of areas of similar land use in an earth observation database

Insurance:

Identifying groups of motor insurance policy holders with a high average claim cost

City-planning:

Identifying groups of houses according to their house type, value, and geographical location

Earth-quake studies:

Observed earth quake epicenters should be clustered along continent faults

Data Types In Clustering Analysis

Interval-scaled variables

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types

Interval-valued Variables

Standardize data

Calculate the mean absolute deviation:

where

Calculate the standardized measurement (z-score)

Using mean absolute deviation is more robust than using standard deviation

Why Standardize The Data?

Standardization recasts the units of measure of attributes into dimensionless units.

This addresses the potential of the chosen units impacting the measured similarities among objects.

Also, standardizing makes attributes contribute more equally to the similarities among objects.

It in essence will equal the ranges of the variables insuring that a variable with greater range does not overly influence the analysis over a variable of smaller range.

Similarity / Dissimilarity

Similarity and Dissimilarity Between Objects

Distances are normally used to measure the similarity or dissimilarity between two data objects

Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

For q=1, we get the MANHATTAN DISTANCE

For q=2, we get the EUCLIDEAN DISTANCE

Manhattan Distance

If q = 1, d is Manhattan distance

Euclidean Distance

If q = 2, d is Euclidean distance:

Properties

d(i,j)  0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j)  d(i,k) + d(k,j)

Example

Student	Exam 1	Exam 2
John	92	87
Jane	100	90

If we only consider Exam 1, what is the distance from John to Jane?

If we only Exam 1 & Exam 2, what is the distance from John to Jane?

Proximity Measures For Binary Attributes

A contingency table for binary data

Distance measure for symmetric binary variables

Distance measure for asymmetric binary variables

Jaccard coefficient (similarity measure for asymmetric binary variables)

Object i

Object j

Nominal Variable Distance

A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching

m: # of matches, p: total # of variables

Method 2: use a large number of binary variables

creating a new binary variable for each of the M nominal state


ANNUALINC	Annual income
SEX	Gender
MARSTATUS	Marital status
AGE	Age
EDUCATION	Highest level of education
OCCUPATION	Occupation
LNGOFSTAY	Length of time in current residence
DUALINC	For those married do they have a dual source of income (yes/no)
NOMEMBERS	Number of members in the family
NOOFMEM<18	Number of members in the family under the age of 18
HSEHLDSTATUS	Whether or not they own, rent or live with parents/family
HOMETYPE	Home type
ETHNICCLASS	Ethnic class
LANGUAGEHME	Language spoken in the home
SAMPLE	Training flag

	Cluster 1	Cluster 2	Cluster 3
Annual Income	Medium income 10-40K	High income >50K	Low income <15K
Gender	Female	Female	Male
Marital Status	Single	Married	Single
Age	Medium 25-34 yrs	High 35-44 yrs	Low 18-24 yrs
No Mem <18	1 or fewer	1 or fewer	1 or fewer
Household status	Rent	Own	Live with parent/family

Factor	Description
MARKETS	Strong competition Expanding global markets Blooming electronic markets (Internet) Innovative methods Opportunities for outsourcing Need for real-time on-demand transactions
CONSUMER DEMAND	Desire for customization Desire for quality, diversity of products, speed of delivery More powerful customers – less loyal
TECHNOLOGY	More innovative, new products and services Increasing obsolescence rate Increasing information overload Social networking, web 2.0 …
SOCIETAL	Changing government regulation (deregulation) Workforce diversification Concerns about security Increasing social responsibility of companies Greater emphasis on sustainability

	Predicted Good	Predicted Bad
Actual Good	True Positive	False Negative
Actual Bad	False Positive	True Negative

	Predicted Good Credit	Predicted Bad Credit
Observed Good Credit	Correct	Missed Opportunity Cost = $1
Observed Bad Credit	Lost Revenue Cost = $3	Correct

Name	Model Role	Measurement Level	Description
BAD	Target	Binary	1=defaulted on loan, 0=paid back loan
REASON	Input	Binary	HomeImp=home improvement, DebtCon=debt consolidation
JOB	Input	Nominal	Six occupational categories
LOAN	Input	Interval	Amount of loan request
MORTDUE	Input	Interval	Amount due on existing mortgage
VALUE	Input	Interval	Value of current property
DEBTINC	Input	Interval	Debt-to-income ratio
YOJ	Input	Interval	Years at present job
DEROG	Input	Interval	Number of major derogatory reports
CLNO	Input	Interval	Number of trade lines
DELINQ	Input	Interval	Number of delinquent trade lines
CLAGE	Input	Interval	Age of oldest trade line in months
NINQ	Input	Interval	Number of recent credit inquiries

Some of the companies that lost your data in 2015
Vtech 4.8M Customer Records (CRs)	Ashley Madison 37M Client Records	UCLA Health 4.5M Patient Records
LastPass Millions of User Passwords	Scottrade 4.6M CRs	Tmobile 15M CRs
TRUMP Hotels thousands of visitors	CVS unknown millions credit cards	Excellus BlueCross BlueShield 10M CRs
Carphone Warehouse (UK) 2.4M CRs	IRS 100,000 Taxpayers	Anthem (healthcare) 80M CRs
US Office of Personnel Mgmt 22M and counting Employee/Applicant	CIA thousands of arrestee’s data after John Brennan’s email breached	70M phone records of prison inmates given to reporters
Patreon (crowdfunding service) 15GB data breach included names/SS numbers/ etc.

		True Class
		Positive	Negative
Predicted Class	Positive	14008	234
	Negative	442	13005

SUMMARY OUTPUT

Regression Statistics
Multiple R	0.80860012
R Square	0.653834153
Adjusted R Square	0.634602718
Standard Error	0.435014172
Observations	20

ANOVA
	df	SS	MS	F	Significance F
Regression	1	6.43372807	6.43372807	33.99819734	1.59668E-05
Residual	18	3.40627193	0.189237329
Total	19	9.84

	Coefficients	Standard Error	t Stat	P-value
Intercept	-1.6995614	0.72677682	-2.338491483	0.031101063
GMAT	0.008399123	0.001440476	5.830797316	1.59668E-05

DOC #	Author	Doc #	Author
81	AH	82	AH
81	AH	83	AH
76	AH	77	AH
47	JM	48	JM
65	AH	66	AH
64	JJ	75	AH
32	AH	33	AH
65	AH	81	AH
80	AH	82	AH
66	AH	75	AH

Age	Age - MD	Age -K3	Age-K6
38		31	29.66667
42		25.333333	31.33333
18		32	31.5
53		38.333333	29.83333
46		26.666667	27.83333
28		26.666667	29.33333
31		31.333333	32.5
37		26.666667	31.16667
22		31	29.33333
28		25.666667	28.16667
49		35	34.5
28		27	26.33333
28		35.333333	31.5
36		21.333333	29.83333
45		32	31.83333

Big Data Analytics Tools Exam and project

Big Data Analytics Tools./.DS_Store

__MACOSX/Big Data Analytics Tools./._.DS_Store

Big Data Analytics Tools./ Final Exam/PROJECT - BETTER UNDERSTAND ATTRITION.docx

__MACOSX/Big Data Analytics Tools./ Final Exam/._PROJECT - BETTER UNDERSTAND ATTRITION.docx

Big Data Analytics Tools./ Final Exam/.DS_Store

__MACOSX/Big Data Analytics Tools./ Final Exam/._.DS_Store

Big Data Analytics Tools./ Final Exam/FINAL EXAM - 2018.docx

__MACOSX/Big Data Analytics Tools./ Final Exam/._FINAL EXAM - 2018.docx

Big Data Analytics Tools./ Final Exam/HR-BalancedSheet.sta

__MACOSX/Big Data Analytics Tools./ Final Exam/._HR-BalancedSheet.sta

Big Data Analytics Tools./Final Exam materials/1 BINS 4352 - Association Analysis.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._1 BINS 4352 - Association Analysis.pptx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352- Data Preparation.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352- Data Preparation.pptx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352 - Cluster Analysis.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352 - Cluster Analysis.pptx

Big Data Analytics Tools./Final Exam materials/BINS 4352- Comparing Models and Model Deployment.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._BINS 4352- Comparing Models and Model Deployment.pptx

Big Data Analytics Tools./Final Exam materials/1 BINS 4352 Cluster Analysis.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._1 BINS 4352 Cluster Analysis.pptx

Big Data Analytics Tools./Final Exam materials/3 BINS 4352 - Boosted Trees.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._3 BINS 4352 - Boosted Trees.pptx

Big Data Analytics Tools./Final Exam materials/BINS 4352 - Overview Presentation.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._BINS 4352 - Overview Presentation.pptx

Big Data Analytics Tools./Final Exam materials/BINS 4352 - Introduction Presentation.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._BINS 4352 - Introduction Presentation.pptx

Big Data Analytics Tools./Final Exam materials/3 BINS 4352 - Trees.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._3 BINS 4352 - Trees.pptx

Big Data Analytics Tools./Final Exam materials/MARSplines.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._MARSplines.pptx

Big Data Analytics Tools./Final Exam materials/1 BINS 4352 - Data Mining.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._1 BINS 4352 - Data Mining.pptx

Big Data Analytics Tools./Final Exam materials/Regression Analysis.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._Regression Analysis.pptx

Big Data Analytics Tools./Final Exam materials/Web Mining.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._Web Mining.pptx

Big Data Analytics Tools./Final Exam materials/BINS 4352 Twitter Sentiment Analysis.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._BINS 4352 Twitter Sentiment Analysis.pptx

Big Data Analytics Tools./Final Exam materials/3 BINS 4352 - Sampling - Subsampling.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._3 BINS 4352 - Sampling - Subsampling.pptx

Big Data Analytics Tools./Final Exam materials/Neural Networks.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._Neural Networks.pptx

Big Data Analytics Tools./Final Exam materials/1 BINS 4352- Text Mining - updated for version 9.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._1 BINS 4352- Text Mining - updated for version 9.pptx

Big Data Analytics Tools./Final Exam materials/3 BINS 4352- Data Warehousing.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._3 BINS 4352- Data Warehousing.pptx

Big Data Analytics Tools./Final Exam materials/Statistics Concepts.docx

The Mode

__MACOSX/Big Data Analytics Tools./Final Exam materials/._Statistics Concepts.docx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352 - Variable Selection.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352 - Variable Selection.pptx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352 - Text Analysis - 2.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352 - Text Analysis - 2.pptx

Big Data Analytics Tools./Final Exam materials/1 BINS 4352 - Preparing Data - Statistal Tests and Missing Data.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._1 BINS 4352 - Preparing Data - Statistal Tests and Missing Data.pptx

Big Data Analytics Tools./Final Exam materials/Statistical Concepts.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._Statistical Concepts.pptx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352 - Big Data Analytics.pptx

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352 - Big Data Analytics.pptx