Big Data Analytics Tools Exam and project

profileExitvxx
BigDataAnalyticsTools..zip

Big Data Analytics Tools./.DS_Store

__MACOSX/Big Data Analytics Tools./._.DS_Store

Big Data Analytics Tools./ Final Exam/PROJECT - BETTER UNDERSTAND ATTRITION.docx

FINAL EXAM – EXERCISE – To Better Understand Attrition.

This is a final project – you are going to exam the HR-BalanceSheet dataset and write a short report on what you found. I will guide you through the analysis, but as we go through the analysis you are going to need to capture data for the final report.

1. Load the dataset into Statistica

2. Generate Histograms for all of the data

a. Make notes on what you observe from the histograms. Can you learn anything about the business from these histograms?

b. Capture all of the histograms.

3. Now generate a correlation matrix to see if any variables are highly correlated. If variables are highly correlated and you are doing a supervised method (e.g., decision tree), then one of them must be omitted from the analysis. Do you know why?

Statistics->Nonparametrics->Correlations Okay.

Now select ALL of the variables and select “Spearman rank R”.

4. Let’s copy this out to Excel.

a. Open a blank Excel file

b. Go to Statistica – the output correlation matrix –

i. Hit Ctrl – A - this will select everything.

ii. Right Click - select “Copy with Headers”

iii. Go To Excel – select Paste

5. Select all of the numbers in Excel

a. Go To Conditional Formatting

i. Highlight all values greater than 0.70

6. This tells you the values that are highly correlated. Record what they are – these cannot be used in a supervised modeling exercise together. For example, JobLevel and TotalWorkingYears are highly correlated.

a. Make a list of all of the variables that are highly correlated (>0.7).

BUSINESS PROBLEM: The company has employee data for the last several years. In this data set we have a wide range of data, including whether or not they left the company (i.e., Attrition). If Attrition is set to “Yes”, they left the company. If Attrition is set to “No”, they did not leave the company.

The first thing we want to do is take a “high” level look at those people who left the company.

Go to Selection Criteria – that is accessible through the Sel:Off setting at the bottom of the Statistica window. Click on “Sel:Off”

Set the selection criteria to Attribute = “Yes”.

7. Generate Histograms for all of the data

a. Make notes on what you observe from the histograms. Can you learn anything about the business from these histograms?

b. Capture the histograms that tell you something about the business.

Go back to the selection criteria and turn the Sel: back to “Off”.

8. Now build a decision tree (C&RT) to see if we can find out what influences where or not individuals decide to leave the company.

If you exclude the variables that are highly correlated, you can generate a tree.

Generate a C&RT tree

Pick your variables (Quick)

· Attrition is your dependent variable

· Select the categorical and continuous variables that you reasonably think could be an issue with respect to attrition.

· Select your response codes

· ALL

Don’t do anything on Classification (YET) – you may want to go back and play with the classification weights – but, don’t do that yet.

On the “Stopping” tab, change the minimum n to 20. This will allow it to build a deeper tree.

Select V-fold cross validation on the Validation tab

Set Surrogate to 2 on the Advanced tab and hit OK.

Look at your tree –

Look at the Predicted Versus Observed – under classification.

Look at “Importance” on the Summary tab – this tells you which variables have the greatest impact.

This is your initial tree ---

Now – the best you’re going to be able to do is get about 80% accuracy on both Predicting yes and no.

So – play with it and see how good you can get it.

· Play with the classification costs

· You may try to create a stratified subsample using Attrition as the strata variable

1

__MACOSX/Big Data Analytics Tools./ Final Exam/._PROJECT - BETTER UNDERSTAND ATTRITION.docx

Big Data Analytics Tools./ Final Exam/.DS_Store

__MACOSX/Big Data Analytics Tools./ Final Exam/._.DS_Store

Big Data Analytics Tools./ Final Exam/FINAL EXAM - 2018.docx

FINAL EXAM NAME: _________________________

1. When we evaluate models we often discuss things like predictive accuracy, speed, robustness, scalability, and interpretability. Briefly discuss what is meant by “interpretability” and why it is important.

2. You have been hired by the county government to help automate a system to detect fraudulent spending by government employees.

You have been given a database of transactions for the past 10 years to work with. Each record in the database contains all of the details of each transaction as well as information related to the particular employee. In this database accountants have manually gone through the data and marked each transaction as either “Good” or “Fraudulent”.

The goal – build a model based on the historical data that will flag future transactions as either “Good” or “Fraudulent”. This will eliminate the need for the accountants to have to go through each transaction manually in the future. What type of modeling technique (e.g., decision tree, association analysis, clustering, etc.) would you use and why?

3. AT&T has been losing customers to Verizon. They want to try to understand why this is the case. They have customer records for the past 5 years that contain demographic information (age, gender, etc.) for the customers, the type of plan that they have, the number of interactions they have had with customer support and whether or not those customers left AT&T.

AT&T wants you to build a model that can be used to predict whether or not a customer is going to leave and switch to another provider. What type of technique (e.g., decision tree, association analysis, clustering, etc.) would you use and why??

4. Kroger is trying to find ways to improve sales. They have all of their receipts for the past 5 years. The receipts contain information about what was purchased, who purchased, and the date and time of the transaction.

You task is to analyze sales patterns and make recommendations with respect to store layouts that, you hope, will increase sales. What type of modeling technique (e.g., decision trees, association analysis, clustering, etc.) would you use and why?

5. You work for a cable service provider. You provide a variety of services for your customers. Your company provides cable TV, home phone, security systems, and internet services.

Your customer base is very diverse. Your customers could be male/female, young/old, single/married/married with children, etc. You have a wide range of ethnic backgrounds and income levels.

You want to make your marketing campaigns more effective. This means targeting the right groups with the right messages using the right media. (For example, marketing via social media may be more or less effective for 18 year olds as compared to 80 year olds).

You been tasked to use the customer database and determine what the different customer segments are and what they look like. Then, once you figure out what the unique groups are you can go through and develop a targeted campaign for each group. What type of modeling technique (e.g., decision trees, association analysis, clustering, etc.) would you use to determine the different market sectors and why?

Text and Web Analytics

6. When we do text analytics, we read in the data, we transform the data into documents, and then we must generate a term/document matrix. This term/document matrix is what we use to perform analysis.

Generation of the term/document matrix involves some processing of the document (see figure on the right).

Briefly describe each step and what it does.

a. Tokenize:

b. Transform Cases:

c. Filter Stopwords (English):

d. Filter Tokens by Length:

e. Stem (Porter):

7. Briefly describe what “Sentiment Analysis” is and how it might be used by a company.

8. What is the difference between text mining and data mining?

9. FINAL EXAM – 2018 – BETTER UNDERSTAND ATTRITION “projects”

Write me a short report that tells me the following (I’d like for this report to be uploaded in a separate standalone word file and look like something you would give an employer):

Business Scenario – write this like you worked for the company. Tell me what the issue is you are exploring and why.

What you did and why you did it – just discuss the technique you used, why it was appropriate and what you did. If you did several iterations, let me know what the final configuration was. I don’t need to know everything that went on – just what you did to get the final results.

What you found – tell me everything you found/learned. Include screen shots, graphs, etc. Anything appropriate to communicate what you found. Do NOT show me everything that was generated – just those things that support your “findings”.

Recommendations - What impact this would have to the business AND what your recommendations are for the business.

1

__MACOSX/Big Data Analytics Tools./ Final Exam/._FINAL EXAM - 2018.docx

Big Data Analytics Tools./ Final Exam/HR-BalancedSheet.sta

__MACOSX/Big Data Analytics Tools./ Final Exam/._HR-BalancedSheet.sta

Big Data Analytics Tools./Final Exam materials/1 BINS 4352 - Association Analysis.pptx

Association Analysis BINS 4352

Learning Objectives

Gain an understanding of how Association Analysis is used

Understand how Associations are created and how to interpret/evaluate those Associations

Discuss and understand Association metrics – Lift, Support, and Confidence

Gain familiarity with RapidMiner

Association Analysis (Market Basket Analysis)

This is a widely used and, in many ways, one of the most successful data mining algorithm.

It can be used to determines what products people purchase together.

Uses

Stores can use this information to determine store layout and product placement

Direct marketers can use this information to determine which new products to offer to their current customers.

Inventory policies can be improved if reorder points reflect the demand for the complementary products.

Any application where you are looking to see if there is a pattern where strong associations are present

Parable Of “Beer And Diapers”

Customers who bought diapers at a grocery store between 5-7pm also tend to by beer.

This is a good example of the business value present in big data analytics.

More than a parable – it was the result of a study commissioned by Osco in the 1990’s and represented a starting point in big data analytics

The finding led to the notion that there is value in discovering uncommon relationships in data can be used to drive business value.

Association Rules for Market Basket Analysis

Rules are written in the form “left-hand side implies right-hand side” and an example is:

Yellow Peppers IMPLIES Red Peppers, Bananas

To make effective use of a rule, three numeric measures about that rule must be considered:

(1) support

(2) confidence

(3) lift

Measures of Predictive Ability

Support and Confidence: An Illustration

A

B

C

A

C

D

B

C

D

A

D

E

B

C

E

RULE SUPPORT CONFIDENCE LIFT
A => D 2/5 2/3 (2/3)/(2/5) = 1.67
C => A 2/5 2/4 (2/4)/(2/5) = 1.25
A => C 2/5 2/3 (2/3)/(2/5) = 1.67
B & C => D 1/5 1/3 (1/3)/(1/5) = 1.67

A Note On Lift

Lift is an interesting measurement and one that has undergone a great deal of scrutiny

For our purposes we defined Lift as Confidence/Support

However, there are other ways to calculate this measure

Some have argued that one must take into account the frequency of the observation

You don’t necessarily want a product that is in 100,000 transactions to be penalized over a product that is involved in 10 transactions simply due to the number of occurrences (or visa versa)

As such – when looking at this value in a tool keep in mind that it is the “relative” value that is important and not the “absolute” value.

Market Basket Analysis Methodology

We first need a list of transactions and what was purchased.

Receipts from stores

This may have to be “reformatted” depending on the tool that you’re using

Next, we choose a list of products to analyze, and tabulate how many times each was purchased with the others.

The diagonals of the table shows how often a product is purchased in any combination, and the off-diagonals show which combinations were bought.

A Convenience Store Example

Consider the following simple example about five transactions at a convenience store:

Transaction 1: Frozen pizza, cola, milk

Transaction 2: Milk, potato chips

Transaction 3: Cola, frozen pizza

Transaction 4: Milk, pretzels

Transaction 5: Cola, pretzels

These need to be cross tabulated and displayed in a table.

A Convenience Store Example (cont)

The diagonal shows how many times a product was purchased (in any combination)

Pizza and Cola sell together more often than any other combo; a cross-marketing opportunity?

Milk sells well with everything – people probably come here specifically to buy it.

Product Bought Pizza also Milk also Cola also Chips also Pretzels also
Pizza 2 1 2 0 0
Milk 1 3 1 1 1
Cola 2 1 3 0 1
Chips 0 1 0 1 0
Pretzels 0 1 1 0 2

Using The Results

The tabulations can immediately be translated into association rules and the numerical measures computed.

Comparing this week’s table to last week’s table can immediately show the effect of this week’s promotional activities.

But, you need to be careful that the results were not impact by some external event (e.g., bad weather)

Some rules are going to be trivial (hot dogs and buns sell together) or inexplicable (toilet rings sell only when a new hardware store is opened).

Using The Results Barbie® => Candy

Forbes (Palmeri 1997) reported that a major retailer has determined that customers who buy Barbie dolls have a 60% likelihood of buying one of three types of candy bars. The retailer was unsure what to do with this nugget. The online newsletter Knowledge Discovery Nuggets invited suggestions (Piatesky-Shapiro 1998)

Put them closer together in the store.

Put them far apart in the store.

Package candy bars with the dolls.

Package Barbie + candy + poorly selling item.

Raise the price on one, lower it on the other.

Barbie accessories for proofs of purchase.

Do not advertise candy and Barbie together.

Offer candies in the shape of a Barbie Doll.

Augmenting Data to Yield More Insights

The sales data can be augmented with the addition of virtual items.

For example, we could record that the customer was new to us or had children.

The transaction record might look like:

Item 1: Sweater Item 2: Jacket Item 3: New

This might allow us to see what patterns new customers have versus existing customers.

Limitations to Market Basket Analysis

A large number of real transactions are needed to do an effective basket analysis, but the data’s accuracy is compromised if all the products do not occur with similar frequency.

The analysis can sometimes capture results that were due to some external event

For example:

The success of previous marketing campaigns (and not natural tendencies of customers).

Weather or natural disaster.

Association Analysis in Rapidminer

The Dataset

The data is organized into “Transactions”

Each transaction represents a grocery store receipt

The items we are interested in include

Herring Baguette Avocado Heineken
Olives Sardines Corned Beef Peppers
Soda Cracker Bourbon Artichoke
Coke Apples Chicken Ham
Turkey Ice Cream Steak Bordeaux

Data is coded where “YES” indicates that it was purchased and “NO” indicates that it was not purchased

Running Assoication Analysis in Rapid Miner

Select

New Process

RadipMiner Studio Professional Main Menu

RapidMiner Studio is very similar in layout to SAS Enterprise Miner

Design Pane – where you layout the analysis you want to run

Drag/Drop Objects from the Operator list into the Design Space

Importing Dataset

There are several ways to import data

I am going to read the Excel file that has pre-processed grocery store receipt data

I drag the “Read Excel” operator into the design space.

Connect the inp port on the side of the design space to the fil port on the operator

Many operators have 2 output ports – one for processed data and the other for an original data “pass through”

Configuring The Read Excel Operator

Parameters associated with the Read Excel Operator appear on the right side of the screen when the operator is selected.

Go to “Import Configuration”

Select The Excel File

Select the data file that you want to import

Select “Next”

Preview The Data

The file that you import can contain multiple sheets

At this point you can select the sheet and the range of cells that you wish to import

The data file we are working with has 1 sheet and by default all of the entries are selected

Select “Next”

Annotating Data

Now you have the opportunity to add annotations to the data

We don’t need to set an attribute name for this data set

Select “Next”

Selecting attribute types

RapidMiner tries to determine the types of the attributes from the data.

For Association Analysis we need to set the types to either “Binomial” or “Nominal”

I am going to select “Binomial”

Select “Next”

Finished !

Once all of the data types are changed to Binomial – Select “Finish”

Attribute Selection

Next we need to select the attributes that will be used in the analysis.

This can be found on the Operator Pane under Blending -> Attributes -> Selection

Drag/Drop Select Attributes into the Design

Select Attributes

Connect the “out” port of Read Excel to the “exa” port on the Select Attributes.

Select the Select Attributes Operator – the parameters for the operator will appear on the right side of the screen.

Selecting Attributes

On the Parameters Pane

Select the “subset” Attribute filter type

Select “Select Attributes”

The Select Attributes dialog box appears

Select all of the attributes except for “Transaction”

This is an ID for the transaction and is not needed for the analysis

FP-Growth (Frequency Calculations)

Next we drag/drop and FP-Growth operator into the design

Connect the “exa” port of Select Attributes to “exa” of FP-Growth

The FP Growth operator determines the “frequent item sets”

A frequent item set denotes the items (products) in the set that have been purchased together frequently (in a certain ratio of transactions)

We also need to define the positive value (open advanced parameters)

Create Association Rules

Drag/drop “Create Association Rules” from the Operator Pane into the Design space.

Connect the “fre” (frequencies) of FP-Growth to “ite” of Create Association Rules

Parameters driving the rule creation can be set (Confidence, lift, … and Thresholds)

Ready To Run !

Connect the “rul” and “ite” ports of the Create Association Rules operator to the output (res) of the design space.

Select “Run” -

Output

We get 2 sets of output

One tab is for the FP-Growth operator and show the Frequent Item Sets

The other contains the Association Rules

The Frequency data shows you the support for every combination of products in the data set

Associations Rules (Sorted By Support)

Association Rules (Sorted By Confidence)

Association Rules (Sorted By Lift)

Interpreting The Rules

Rule: IF (turkey, baguette) THEN (ham, olives)

Support: The percentage of the time that the rule was true

26.7% of the time the basket contained both (turkey,baguette) and (ham, olives)

Confident: The percentage of the time that the baskets that did contain (turkey, baguette) also contained (ham, olives)

85% of the time when the basket contained (turkey, baguette) it also contained (ham, olives)

Lift: is the relative measure that looks at how many times Confidence is larger than the expected level (similar to what we discussed earlier – better than a baseline model)

Greater than 1 is desired

The larger the value the better

Association Analysis in Statistica

Same Data File

Reformatted data just a little for Statistica (in Excel)

Each line contains what was sold for that transaction

Link Analysis

Go to Data Mining -> Link Analysis

Select Non-sequential association analysis

Select Variables as we have done in the past

Transaction – is the Transaction ID

Food items – are Multi-response variables

Database Selection

If you run the analysis multiple times, you may have to select a database name.

This should only be the case if you exit the tool, reload data, and try to run again.

Select “OK” and run the analysis

Results

Association Rules

Frequency Itemsets

Rule Graph

Web Graph

Questions?

__MACOSX/Big Data Analytics Tools./Final Exam materials/._1 BINS 4352 - Association Analysis.pptx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352- Data Preparation.pptx

Data Preparation

Objectives

Provide a perspective on how data may have to be prepared for a given Analytics task

Understand the relationship between the data mining technique and the business problem being addressed

Gain an understanding of the types of data issues that might be found

Understand options when it comes to “fixing” or “addressing” data issues

Understand what impact “uncleaned” data may have on the analysis

Provide a “practical” and “applied” understanding of statistical concepts

Data in Data Mining

Data: a collection of facts usually obtained as the result of experiences, observations, or experiments.

Data may consist of numbers, words, images, …

Data: lowest level of abstraction (from which information and knowledge are derived).

Nominal – mutually exclusive, but not order categories (e.g., male, female)

Ordinal – order matters (e.g., Freshman, Sophomore, Junior, Senior).

Interval – measure where difference between two variables is meaningful.

Ratio – all the properties of Interval, but with an absolute 0 which means complete lack of that variable.

3

Inspecting Data and Preparation in Excel

Preparing Data In Excel

We are looking for anomalies in the data

Missing values

Values out of a defined range

Etc.

Once we find these values – we can

Repair the data

Impute missing values

Delete the response

5

DATA Dictionary

6

The data dictionary tab describes everything that is in the data set

If you download a data set from the Internet, it will often include a data dictionary in a separate file.

Raw data

7

The raw data is the data as it was collected.

There have been no changes/modifications to the data at this point.

Highlight Blanks

8

All of the data was selected in the spreadsheet

Go to FIND & SELECT in Excel (on the HOME Tab)

Select “GO TO SPECIAL”

Select “Blanks”

Then set the fill color to RED

Analyze Row / Columns

9

The raw data is the data as it was collected.

There have been no changes/modifications to the data at this point.

Count non-blank responses

Calculate the frequency of the “expected values”

Calculate difference from the means

Calculate single answer bias

How To Handle Anomalies In The Data

Decide what your going to do about the anomalies you found in the data.

Filter values

Repair the file (impute values)

Leave them alone

Delete responses

Etc.

If you are going to “change” data or “delete” data, move the original values to a separate sheet.

You are documenting what you did

This makes it easier to “undo” your change if you need to.

11

Statistica Data

This sheet contains the “cleaned” data that you’re going to load into Statistica for further analysis.

The excel manipulation was intended to address the “obvious” problems

Graphical Inspection Credit Scoring Data in Statistica

Application

As we have discussed practically all data will need some preparation

More over – that preparation may be slightly different based on the application and the type of analysis that you are doing

It is important to have a good understanding of both the data and what you’re trying to accomplish through the data mining process

Data preparation

Handling missing data and outliers

Selecting important variables

Sampling

Data preparation is specific to BOTH the data set and the task. The preparation method and decisions made during data preparation may change if either change

Application For Credit Scoring Data

Business Need

A financial institution has data about their past customers.

These customers are classified as either good or bad credit risks based on their history with the institution.

The classification (good/bad) is based on whether or not the loan payment was delinquent and the magnitude of the loss

A financial institution needs a way to decide if and how much credit to extend to customers who apply for loans.

Business goal: reduce the losses due to bad loans

Goals of the data mining process

Determine the variables that are best predictors of credit risk

Find a high performance predictive model that classifies customers

Deploy that model to make decisions on future credit applications

Update the model as more data is collected

Credit Scoring Data Set

We are going to explore the credit scoring data set

This data will be used to explore

Data preparation

Classification

Try to keep in mind – this will be a “classification exercise”.

It can be applied to different data sets and domains where classification is appropriate

Examine the data

Data Type
Credit rating Categorical
Balance of current account Categorical
Duration of credit Continuous
Payment of previous credit Categorical
Purpose of credit Categorical
Amount of credit Continuous
Value of savings Categorical
Employed by current employer Categorical
Installment in % of available income Categorical
Marital status Categorical
Gender Categorical
Living in current household for Categorical
Most valuable asset Categorical
Age Continuous
Further running credits Categorical
Type of apartment Categorical
Number of previous credits at this bank Categorical
Occupation Categorical
TrainTest categorical

Credit Scoring

Start by looking at the credit scoring application and the business need

Review the variables in the credit risk data set

Discuss the next steps for the data mining process

Classification

Classification can be used to classify a variable with 2 or more groups

Find the probability of a particular predicted classification.

For example:

Loan denied

Loan approved

Examine The Data

Below is the data in Statistica

It was opened by FILE-> OPEN

Look At Some Histograms Of The Data

Histograms

Credit rating is the dependent variable – it is the one we want to make predictions for

Notice that there have been more than twice as many customers with good credit as compared to bad

This may mean that we need to adjust our sample to keep the analysis from being “good” credit biased

Histograms

Here we have the number of previous credits at the bank

5-6 and 7 or more are relatively small compared to the other categories

Hence, we may want to recode the data to have a 5 or more category

This is a good general rule of thumb

Remaining Variables

Remaining Variables

Note the majority of customers either have no previous credit or have paid back their previous loans

Remaining Variables

Remaining Variables

Remaining Variables

Note: that there have been more than twice as many male customers as female customers

Remaining Variables

Remaining Variables

Age is interesting in that you would expect that customers need to be at least 18 years old to apply for credit, so we need to make sure that is the case in the 15-20 year old group

Remaining Variables

Remaining Variables

Next Steps

We have stated the business goals and have data available to do the analysis

We have visually inspected the data in order to gain a high-level understanding of the data

We still need to do more work here

But, we have identified our dependent variable (Credit Risk) and the potential predictor variables

We need to continue exploring the data and look at things that we can do to prepare the data for analysis

Further Analysis

Go to the interactive drill down tool

Select the “Drill Variables”

Select “Payment of Previous Credits”

Drill Down

Select “No Previous Credits”

Go back to the “Drill down variables” and select “Number of Previous Payments”

Select “Brush”

Previous payment

So here we have an apparent contradiction.

We have drilled down to look at customers where the Payment of Previous Credits = no previous credits

But, yet the number of previous credits at this bank has values for 2-4, 5-6, and 7 or more?

Scatter Plot Matrix

A scatter plot matrix can help us look for potential outliers

Scatter Plot Matrix

We can combine this with the Drill Down and look at scatter plots of all of the continuous variables with respect to those where the Payment of Previous Credits = no previous credits

Scatter Plot – Duration Of Credit Vs Amount Of Credit

Scatter Plot

We can look for outliers here

If we know that there is a maximum loan amount, then we can remove those that are greater than that value

If we know that there is a maximum duration, we can remove all of those that are greater than that

Other Graphical Techniques

Box Plots

A box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles.

Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.

Outliers may be plotted as individual points.

Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution.

The spacing between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers.

There Are Several Variations of Box Plots

The “box” :

The band inside the box is the second quartile (the median).

Statistica gives you the option to make this the “mean”

The ends of the “whiskers” can represent several possible alternative values:

Min and Max of all of the data

Lowest datum still within 1.5 interquartile range (IQR) of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile (often called the Tukey boxplot)

one standard deviation above and below the mean of the data

Etc.

Box Plot

These may seem more “primitive” than a histogram, but they do have some advantages

They take up less space – so they are particularly useful in comparing distributions among several groups

The number and width of bins can greatly impact the appearance of a histogram

Box plots are particularly good at identifying outliers in continuous data

Box Plots in Statistica

Go to “GRAPHS” and select “Box Plots”

Graph Type

Box Whiskers

Regular (will give you 1 variable per graph)

Multiple (will give you 1 graph with all on it”

Variables – where you select the variables for analysis

Box Plots

Leave the defaults for Grouping Intervals

Middle point

Value – median

Style – determines the graphics that show the median in the box

Box Plots

I have re-selected “Multiple” Graph type

I prefer the median be marked with a “Line”

Box Plots

The graph is fairly crowded – but, I can see outliers that are identified by the small circles.

I will re-run this with fewer variables to see if we can see it better

Box Plots

Now we can more clearly see where the outliers are in the data set.

These need to be examined for deletion

Next we will look at this in SPSS

Marking The Outliers

Right clicking on the outliers will pull up a menu where you can tell Statistica to “Mark the Outliers”

SPSS – Box Plots

The data is loaded into SPSS

This can be done by loading the same file we used for Statistica

Go to “Graphs” and then to “Legacy Dialogs” and Select “Box Plots”

Select “Simple” and “Summaries of separate variables”

SPSS – Box Plots

Next we select the variables and move them to the “Boxes Represent” area

Select “OK”

SPSS – Box Plots

We see output very similar to Statistica.

The outliers are marked with “o” on the graph

In SPSS the default is to display the “number” of the response with the outlier

Inspecting Outliers In Box Plot

We want to be very conservative when identifying responses as “outliers”

Count how many “outliers” each respondent has

For example, response 188 has 3 outliers, 187 has 4 outliers, and so on.

I may delete 1 or 2 responses that have a “large” number of outliers and then rerun the box plot.

This will cause things to shift a bit

I then iterate until I’m happy with the data.

Leaving outliers in the data is fine – this may be a “true” response

QUESTIONS?

Data

CategoricalNumerical

NominalOrdinalIntervalRatio

Histogram of Credit Rating

CreditScoring 19v*1000c

Credit Rating = 1000*1*Normal(Location=0.7, Scale=0.4585)

bad good

Credit Rating

0

100

200

300

400

500

600

700

800

No of obs

Histogram of Number of previous credits at this bank

CreditScoring 19v*1000c

Number of previous credits at this bank = 1000*1*Normal(Location=1.407, Scale=0.5777)

one 2- 4 5- 6 7 or more

Number of previous credits at this bank

0

100

200

300

400

500

600

700

No of obs

Histogram of Balance of Current Account

CreditScoring 19v*1000c

Balance of Current Account = 1000*1*Normal(Location=2.577, Scale=1.2576)

no running account

no balance

<= $300

>$300

Balance of Current Account

0

50

100

150

200

250

300

350

400

450

No of obs

Histogram of Duration of Credit

CreditScoring 19v*1000c

Duration of Credit = 1000*10*Normal(Location=20.903, Scale=12.0588)

-1001020304050607080

Duration of Credit

0

50

100

150

200

250

300

350

400

No of obs

Histogram of Payment of Previous Credits

CreditScoring 19v*1000c

Payment of Previous Credits = 1000*1*Normal(Location=2.545, Scale=1.0831)

hesistant

problematic running accounts

no previous credits

no problems with current credits

paid back

Payment of Previous Credits

0

100

200

300

400

500

600

No of obs

Histogram of Purpose of Credit

CreditScoring 19v*1000c

Purpose of Credit = 1000*1*Normal(Location=2.828, Scale=2.7444)

other

new car

used car

furniture

television

household appliances

repair

education

vacation

retraining

business

Purpose of Credit

0

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

No of obs

Histogram of Amount of Credit

CreditScoring 19v*1000c

Amount of Credit = 1000*5000*Normal(Location=4579.7472, Scale=3951.8525)

-$5,000.00

$0.00

$5,000.00

$10,000.00

$15,000.00

$20,000.00

$25,000.00

$30,000.00

Amount of Credit

0

100

200

300

400

500

600

700

800

No of obs

Histogram of Value of Savings

CreditScoring 19v*1000c

Value of Savings = 1000*1*Normal(Location=2.105, Scale=1.58)

no savings

<140 140-700 700-1400 >1400

Value of Savings

0

100

200

300

400

500

600

700

No of obs

Histogram of Employed by Current Employer for

CreditScoring 19v*1000c

Employed by Current Employer for = 1000*1*Normal(Location=3.384, Scale=1.2083)

unemployed

<1 year 1-5 years 5-8 years > 8 years

Employed by Current Employer for

0

50

100

150

200

250

300

350

400

No of obs

Histogram of Installment in % of Available Income

CreditScoring 19v*1000c

Installment in % of Available Income = 1000*1*Normal(Location=2.973, Scale=1.1187)

> 35 25-35 15- 25 < 15

Installment in % of Available Income

0

100

200

300

400

500

No of obs

Histogram of Marital Status

CreditScoring 19v*1000c

Marital Status = 1000*1*Normal(Location=2.682, Scale=0.7081)

divorced/living apart

divorced/living apart/married

single

married/widowed

Marital Status

0

100

200

300

400

500

600

No of obs

Histogram of Gender

CreditScoring 19v*1000c

Gender = 1000*1*Normal(Location=1.31, Scale=0.4627)

male female

Gender

0

100

200

300

400

500

600

700

800

No of obs

Histogram of Living in Current Household for

CreditScoring 19v*1000c

Living in Current Household for = 1000*1*Normal(Location=2.845, Scale=1.1037)

< 1 year 1-5 years 5-8 years >8 years

Living in Current Household for

0

50

100

150

200

250

300

350

400

450

No of obs

Histogram of Most Valuable Assets

CreditScoring 19v*1000c

Most Valuable Assets = 1000*1*Normal(Location=2.358, Scale=1.0502)

no assets

car

life insurance

ownership of house or land

Most Valuable Assets

0

50

100

150

200

250

300

350

No of obs

Histogram of Age

CreditScoring 19v*1000c

Age = 1000*5*Normal(Location=33.544, Scale=11.3498)

1015

20

253035404550556065707580

Age

0

20

40

60

80

100

120

140

160

180

200

220

240

260

No of obs

Histogram of Further running credits

CreditScoring 19v*1000c

Further running credits = 1000*1*Normal(Location=2.675, Scale=0.7056)

at other banks

at department store

no further running credits

Further running credits

0

100

200

300

400

500

600

700

800

900

No of obs

Histogram of Type of Apartment

CreditScoring 19v*1000c

Type of Apartment = 1000*1*Normal(Location=1.928, Scale=0.5302)

free rented owned

Type of Apartment

0

100

200

300

400

500

600

700

800

No of obs

Histogram of Number of previous credits at this bank

CreditScoring 19v*1000c

Number of previous credits at this bank = 1000*1*Normal(Location=1.407, Scale=0.5777)

one 2- 4 5- 6 7 or more

Number of previous credits at this bank

0

100

200

300

400

500

600

700

No of obs

Histogram of Occupation

CreditScoring 19v*1000c

Occupation = 1000*1*Normal(Location=2.904, Scale=0.6536)

unskilled with no permanant residence

unskilled with permanant residence

skilled employee

executive/self-employed

Occupation

0

100

200

300

400

500

600

700

No of obs

Histogram for brushing: Number of previous credits at this bank

N Total: 1000, Selected: 530

Payment of Previous Credits = no previous credits

050100150200250300350400450500

Number of counts

one

2- 4

5- 6

7 or more

Number of previous credits at this bank

Correlations

N Total: 1000, Selected: 1000

Duration of Credit

Amount of Credit

Age

Correlations

N Total: 1000, Selected: 530

Payment of Previous Credits = no previous credits

Duration of Credit

Amount of Credit

Age

Scatterplot of Amount of Credit against Duration of Credit

CreditScoring 19v*1000c

Amount of Credit = 298.4367+204.818*x

01020304050607080

Duration of Credit

-$2,000.00

$0.00

$2,000.00

$4,000.00

$6,000.00

$8,000.00

$10,000.00

$12,000.00

$14,000.00

$16,000.00

$18,000.00

$20,000.00

$22,000.00

$24,000.00

$26,000.00

$28,000.00

Amount of Credit

Box Plot of multiple variables

1-ORIGINAL-FULL-DATASET-WALK-THROUGH 61v*261c

Median; Box: 25%-75%; Whisker: Non-Outlier Range

Median

25%-75%

Non-Outlier Range

Outliers

Extremes

CCSC_1

CCOC_1

CCAC_1

CCAC_6

HCA_5

UCA_4

PEOU_3

PU_4

PP_5

BI_5

PF_4

PIIT_1

0

1

2

3

4

5

6

7

8

Box Plot of multiple variables

1-ORIGINAL-FULL-DATASET-WALK-THROUGH 61v*261c

Median; Box: 25%-75%; Whisker: Non-Outlier Range

Median

25%-75%

Non-Outlier Range

Outliers

Extremes

UCA_1

UCA_3

UCA_5

PEOU_1

PEOU_3

PU_1

PU_3

PP_1

PP_3

PP_5

0

1

2

3

4

5

6

7

8

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352- Data Preparation.pptx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352 - Cluster Analysis.pptx

Cluster Analysis

Cluster Analysis - Review

Outline

What is it? What is a cluster?

How is it different from a decision tree?

What is distance and linkage?

What is hierarchical clustering?

What is scree plot and denogram?

What is non-hierarchical clustering (k-means)?

How to learn it in detail?

3

Simple Case Study

Student Physics Calculus
Joe 15 20
Bill 20 15
Paula 26 21
Jane 44 52
Jack 50 45
Carlos 57 38
Carla 80 85
Russell 90 88
Eddie 98 98

4

If we look at the student data on the left, we can easily see a pattern emerging

There are natural groupings of students that seem to be apparent in the data.

Plotting these on a 2-dimensional scatter plot makes it more visible

Similarities And Dissimilarities

If we look at the objects in any one of the groupings, they are all very similar to the other objects in that grouping

Also, if we look at any one object in a grouping, it is very dissimilar to any object in another grouping

This give rise to the notion of:

Homogeneous within and Heterogeneous across based on characteristics

Clusters

These groupings are clusters

They represent a “natural grouping of similar objects” based on a collection of input parameters

Now, if you add a third or forth dimension (e.g., English, History), then the clusters may change

However, the way in which they are constructed is the same

Clusters

Clusters

There is no objective function (i.e., an equation to be optimized given certain constraints with variables that need to be minimized or maximized such as trying to express a business goal in mathematical terms)

There is no dependent variable

This is sometimes called subjective segmentation

The segmentation is developed on its own based on the values of the input variables

It is called a unsupervised learning technique.

Once the segments are developed

They need to be understood

You need to decide how you are going to deal with the segments that have emerged

How Is This Different From A Decision Tree

Decision tree technique requires a clearly defined dependent variable

For example, GOOD/BAD credit

The technique is based on identifying those variables (characteristics) which are closely associated with the dependent variable.

Distances

Distances (cont)

Paula

Bill

Joe

Jane

Jack

Carlos

Carla

Russell

Eddie

Distance Between 2 Clusters

Sometimes called the linkage function: intermediate cluster distance

How do you calculate the distance between 2 clusters

Single linkage

Calculate the distance between each point in one cluster to each point in the neighboring cluster and then find the shortest distance.

Complete linkage

Similar to single linkage – except we look for the furthest distance

Centroid distance

Calculate the “center” of each cluster and then calculate the distance between centroids

Hierarchical Clustering

Used when you have a small number of observations (usually hundreds)

You can not use this method for a large dataset because it becomes computationally impractical

The way this works is to join objects (cases) together into successively larger clusters using some measure of similarity or distance

During the clustering process it shows how the clusters are formed

The result of the clustering is the hierarchical tree

In SAS this is done by using the “proc cluster” command

In Statistica this is called Joining or Tree Clustering

Hierarchical Clustering

We begin by each case in a cluster by itself

In each step we slowly “relax” the criterion which defines “uniqueness”

In other words, we lower the threshold by what it means to declare two or more objects within the same cluster

As a result we group more and more objects together and each “layer” consists of increasingly dissimilar objects

In the last step – all objects are grouped together into a single cluster

Scree Plot

When we look at a scree plot what we have is the “within cluster variance”

Total variance = between variance + within variance

The “elbow” in the graph indicates the optimal number of clusters

When the number of clusters = 1, the within group variance = total variance

Dendogram

When the data contain a clear structure in terms of clusters, then its structure is often reflected in the hierarchical tree

The result of a successful analysis is that you can detect and interpret the structure by looking at the branches

Hierarchical Clustering - Students

Everything starts out in a cluster by itself

Find the closest objects.

Merge those into a cluster.

Recalcuate distances (from center of new cluster)

Find the closest objects.

Merge those into a cluster.

Recalcuate distances (from center of new cluster)

Joe Bill Paula

Jack

Jane

K-means Clustering

Non-hierarchical

Used when you have a large number of observations

You decide up front how many clusters you need (k)

K-means algorithm

Partition objects into k non-empty subsets (randomly)

Compute the centroid for each of the clusters

Centroid is the center (i.e., mean point of the cluster)

This defines the “seed” point for each cluster

Assign each object to the cluster with the nearest seed point

Go back to step 2 and repeat until the assignment does not change

In SAS you do this my using the “proc fastclus” command

EM Clustering in Statistica

Getting started

EM algorithm

Uses distributions of (continuous) the data to find the clusters

You specify the distribution technique

Very similar to K-means clustering

First step – hypothesize how many clusters will be in the data

With K-means and EM, the optimum number of clusters can be determined with V-fold cross validation

Setting Up A Cluster Analysis

Go to DATA MINING -> Cluster

This will pull up the K-means / EM clustering dialog box

Use EM clustering

Select all of the variables (Except for Sample)

Select Variables And Configure V-fold Validation

Select EM algorithm

Number of clusters 2

Number of iterations 50

Keep defaults on EM tab – we have limited options here due to all of the variables are categorical

On the Validation tab select

V-fold cross validation

Statistica will search for the optimum number of clusters between 2 and 25 clusters

Select “OK”

Results - Similar To K-means

We see that it used the EM algorithm with normal distribution

V-fold cross validation was used with 5 folds

3 clusters were created

Graph Of Cost Sequence

Classification Probabilities For Each Case

Graph of Frequencies

This provides a graphical representation of the information contained in the frequency table(s)

What Do We Do With This Data?

Generally what we want to do with the results of a cluster analysis is start to build a narrative that describes the classification that occurred

We want to explain how this classification scheme can be used to describe the phenomena we are interested in

Questions?

Graph of Cost Sequence

Best number of clusters: 3

EM

2 3 4

Number of clusters

28.0

28.5

29.0

29.5

30.0

30.5

31.0

-2 * log-likelihood

Graph of frequencies for ANNUALINC

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

$75,000 or more

Less than $10,000

$50,000 to $74,999$30,000 to $39,999$10,000 to $14,999$20,000 to $24,999$40,000 to $49,999$25,000 to $29,999$15,000 to $19,999

ANNUALINC

0

100

200

300

400

500

600

700

800

900

1000

1100

Frequencies

Graph of frequencies for SEX

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Female Male

SEX

600

800

1000

1200

1400

1600

1800

2000

Frequencies

Graph of frequencies for MARSTATUS

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Married

Single, never married

Divorced or separated

Living together, not married

Widowed

MARSTATUS

0

500

1000

1500

2000

2500

3000

3500

Frequencies

Graph of frequencies for AGE

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

45 thru 54

25 thru 34

14 thru 17

55 thru 64

18 thru 24

65 and Over

35 thru 44

AGE

0

200

400

600

800

1000

1200

Frequencies

Graph of frequencies for EDUCATION

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

1 to 3 years of college

College graduate

Grades 9 to 11

Graduated high school

Grad Study

Grade 8 or less

EDUCATION

0

200

400

600

800

1000

1200

1400

Frequencies

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352 - Cluster Analysis.pptx

Big Data Analytics Tools./Final Exam materials/BINS 4352- Comparing Models and Model Deployment.pptx

Comparing Models and Model Deployment

1

1

Lift And Gains Charts

Lift chart

Shows the effectiveness of the model compared to no model

Gains chart

Shows the percentage of observations correctly classified for a given category (e.g., our dependent variable DEFAULT=“BAD”)

2

Creating Model Assessment Charts

Using the model, produce estimated probabilities for the target event for each case

Sort all cases by decreasing estimated probability.

If the model is good, cases where the target event actually shows up should have higher estimated probabilities, therefore should end up higher in the list

Split the cases into, say 10 bins, so that:

Bin #1 has the highest probabilities

Bin #10 the lowest probabilities

Now look at the number of cases where the target event actually happens.

3

Creating Model Assessment Charts

4

Actually observed values of the target variable

Predicted Probabilities

Looking Inside 10 Bins

Good models assign high probabilities to cases which will actually have the target event (defaulting, donating, etc.)

The red balls represent where the target even “actually” occurred

The black balls represent where the target even did not actually occur

5

Bin #1

(high probabilities)

Bin #10

(low probabilities)

Percent Response

Percent Response shows what the percentage of each bin’s contents are target cases

For example, (15/18) ~85% of cases that ended up in bin 1 were observed to have actually become target events.

So here we are just looking at each “bin” separately.

85% of the cases in Bin 1 were the target

72% of the cases in Bin 2 were the target

39% of the cases in Bin 3 were the target

So on

6

Bin #1

(high probabilities)

Bin #10

(low probabilities)

Percent Captured Response

Percent Captured Response shows you what percentage of all target cases are “captured” by each bin

For example, the first bin captured (15/(15+13+7+3+3)) ~36%

So, here we are looking at “All of the target cases” and determining the percentage captured by each of the bins

Bin 1 captured 36% of the target cases

Bin 2 captured 32% of the target cases

Bin 3 captured 17% of the target cases

So on

7

Bin #1

(high probabilities)

Bin #10

(low probabilities)

Baseline Model (No Model At All)

Assume target cases are assigned randomly into the 10 bins.

This would create a baseline model

Comparing our model to the baseline model allows us to see how much better our model can do.

This is shown in a “lift chart”

8

Lift Chart

For each bin, get the ratio of the “target cases in the models bin” and divide that by the target cases in the baseline.

This is the lift value for that “bin”.

So, the lift chart represents a ratio of the baseline with what was captured

Bin #1 = 15/3 = 5

Bin #2 = 13/3 = 4.3

Bin #3 = 7/3 = 2.3

9

Lift value for Bin #1

= 15/3 = 5

=

Lift Chart

So, what we want is for more of the targets to be contained in the high probability buckets

The higher the model line is over the baseline (no model), the better our model

10

Lift value for Bin #1

= 15/3 = 5

=

15/3

13/3

7/3

3/3

3/3

0/3

0/3

0/3

0/3

0/3

Lift Charts

Visual representation that show the effectiveness of the model compared to no model at all – or the baseline model

11

Here we want the graph to be high on the left side and go down as it moves to the right

This indicates that more responses were correctly identified (higher probability) in the “earlier” bins

Gains Charts Are Very Similar To Lift Charts

Here when looking at the bins consider how many of the target responses have been identified.

So – since we have 10 bins. The base line model would assume that when 10% of the responses have been looked at, 10% of the targets would have been found.

However, in bin #1 after looking at 10% of the responses 36.5% of the targets have been found.

This represents an “improvement” over the baseline model

12

Gain for Bin #1

= 15/41 = 0.365

=

41

Gains Chart

With the gain chart the “further” the line for the model is from the baseline model, the better the model

In other words, we want to maximize the area under the curve between the model and the baseline

13

15/41 28/41 35/41 38/41 41/41 41/41 41/41 41/41 41/41 41/41

0.37 0.68 0.85 0.94 1.00 1.00 1.00 1.00 1.00 1.00

Gains Chart

Shows the percentage of observations correctly classified for a given category

14

The goal is to maximize the area between the model and the baseline model.

So, on the left the classification tree (red) performs better than the linear model (green)

Baseline Versus Perfect Model

In order to investigate the differences in model performance, we contrast our model to two models with extreme performance:

The baseline (bad performance)

The perfect model (excellent performance)

15

Model Deployment

Data Mining Models

Statistica can generate code for deployment using

Statistica Visual Basic

For deployment in the Statistica Data Miner Workspace

C and C++

Provide a framework for development of a custom deployment tool

PMML

Used in Statistica via Rapid Deployment

SAS

Used for scoring against data stored in a SAS database

Deployment to Statistica Enterprise

Code generator button on the Reports dialog will produce deployment code

17

Build Deployment Code

18

Build a C&RT Model

I build a C&RT model, adjusting parameters until I got fairly reasonable performance.

19

Deployment

You have the model the way that you want it and are ready to deploy

We are going to deploy the model so that it can be used in Rapid Deployment within Statistica

So –

Go to the Report TAB

Select Code Generator

We want to generate PMML code

20

PMML Code

21

Saving the PMML Code

Right click on the Tree PMML deployment code

Select Save Item As

Name the file

Make sure the file type is set to XML

22

Repeat The Process For CHAID

Adjust parameters and generate a model

Deploy the model in PMML format

23

Repeat For Boosted Tree

24

Adjust parameters and generate a model

Deploy the model in PMML format

Repeat For Random Forest

25

Adjust parameters and generate a model

Deploy the model in PMML format

Pull Models into Rapid Deployment

Rapid Deployment is on the Data Mining TAB

Select “Load Models from disk”

Select the models that were saved and select “Open”

26

Comparing the Models

4 models have been successfully loaded into Rapid Deployment

Notice that it pulled the variable information from the saved models

Check the box for “Include pred. probs. in output”

Also – Check the box for “Predict case(s) with missing data in inputs” – this is because our dataset has so much missing data

27

Start Comparing

Select the Summary of Predicted and residual values

It appears from this that the Boosted Tree has the lowest error rate of the 4 models.

28

Gains Chart

Go to the Lift Chart TAB

Select “Gains chart”

Click on “Lift chart” to see the chart

Here we want to maximize the area under the curve.

Boosted Tree looks like the best model

29

Lift Chart

Go to the Lift Chart TAB

Select “Lift chart (response %)”

Click on “Lift chart” to see the chart

30

Here we are looking at how well the models did for predictive accuracy

Again the Boosted Tree model performs best

Ensemble Models

31

What Is Ensemble Modeling?

Ensemble modeling is the process of running two or more related by different analytical models and then synthesizing the results into a single score or spread in order to improve the accuracy of the prediction.

A Random Forest may be thought of as a type of ensemble model.

A random forest generates a collection of trees and then those trees “vote” on what the prediction may be.

These trees are “similar” with respect to the variable definitions

But, are parameterized in such a way as to produce “different” trees

E.g., we may limit the number of predictor variables that can be considered at each node

32

Voting Across Models

Voting across models (a.k.a. Bagging) is particularly useful when we are working with smaller data sets

33

Input Data

C&RT

CHAID

Random Forest

Boosted Tree

Prediction

Why Worry About Voting?

In data mining there are a variety of model building tools and techniques available.

No model is perfect and we do our best to “tune” the model to be as accurate as possible within necessary constraints

No model can capture all of the underlying relationships that exists in the data

Some may capture some aspects of the data, while other models capture other relationships

Using an ensemble modeling technique can improve the predictive accuracy

The collection of models may indeed be more accurate than any one model taken alone

34

Model Stability

Data mining can uncover complex relationships between variables

Data Scientist use the term model “stability” to refer to the sensitivity of the models produced due to variation in the training data.

This is often an issue with high dimensional spaces

Or, when too little data is available

Domain experts (users) have little confidence in models that change radically based on the training data used to create them

Using multiple models with voting helps to counteract the issue of instability

Often the ensemble models outperform any individual model

35

Rapid Deployment

Rapid deployment in Statistica can be used to output the predictions made by individual models as well as the voted prediction

Going back to Rapid Deployment – if we scroll all the way to the right we will see the “Voted Prediction”

36

Looking At The Voted Predictions

Using the Summary of Deployment spreadsheet as input

Right click on the file in the browser

Select “Use as Active Input”

Go to Statistics – Basic Statistics

Select “Tables and banners”

37

Cross-tabulation Of Observed Value And The Voted Predicted Value

Select “Specify tables (select variables)

Select “DEFAULT” – the observed variable in List 1

Select “Voted Predicted” – the variable in List 2

Select “OK”

38

On the Options “TAB” check

Percentages of total count

Percentages of row counts

Percentages of column counts

Select “Summary”

Results – Similar To What We Have Seen For The Other Models

39

These are just a few statistics that we can look at when we are trying to develop our models for rapid deployment

Questions ?

40

Tree 11 graph for DEFAULT

Num. of non-terminal nodes: 17, Num. of terminal nodes: 18

ID=1

N=3364

GOOD

ID=2

N=3261

GOOD

ID=4

N=3060

GOOD

ID=6

N=1561

GOOD

ID=8

N=641

BAD

ID=10

N=497

BAD

ID=12

N=101

BAD

ID=13

N=396

BAD

ID=18

N=382

GOOD

ID=21

N=335

GOOD

ID=23

N=112

BAD

ID=9

N=920

GOOD

ID=40

N=848

GOOD

ID=42

N=725

GOOD

ID=43

N=123

GOOD

ID=65

N=97

GOOD

ID=7

N=1499

GOOD

ID=14

N=18

GOOD

ID=15

N=83

BAD

ID=20

N=47

BAD

ID=22

N=223

GOOD

ID=30

N=33

GOOD

ID=31

N=79

BAD

ID=19

N=14

BAD

ID=11

N=144

GOOD

ID=44

N=712

GOOD

ID=45

N=13

BAD

ID=64

N=26

BAD

ID=66

N=89

GOOD

ID=67

N=8

BAD

ID=41

N=72

BAD

ID=70

N=4

BAD

ID=71

N=1495

GOOD

ID=5

N=201

BAD

ID=3

N=103

BAD

DEBTINC

<= 43.679856

> 43.679856

DELINQ

<= 1.500000

> 1.500000

CLAGE

<= 178.660677

> 178.660677

VALUE

<= 85602.500000

> 85602.500000

LOAN

<= 21050.000000

> 21050.000000

VALUE

<= 51090.000000

> 51090.000000

CLAGE

<= 68.939073

> 68.939073

NINQ

<= 3.500000

> 3.500000

CLAGE

<= 81.311610

> 81.311610

YOJ

<= 7.500000

> 7.500000

VALUE

<= 58251.000000

> 58251.000000

DEROG

<= 0.500000

> 0.500000

MORTDUE

<= 111705.000000

> 111705.000000

DEBTINC

<= 42.300754

> 42.300754

VALUE

<= 166063.000000

> 166063.000000

MORTDUE

<= 197597.500000

> 197597.500000

DEBTINC

<= 13.745088

> 13.745088

BAD

GOOD

Tree 11 graph for DEFAULT

Num. of non-terminal nodes: 17, Num. of terminal nodes: 18

ID=1N=3364

GOOD

ID=2N=3261

GOOD

ID=4N=3060

GOOD

ID=6N=1561

GOOD

ID=8N=641

BAD

ID=10N=497

BAD

ID=12N=101

BAD

ID=13N=396

BAD

ID=18N=382

GOOD

ID=21N=335

GOOD

ID=23N=112

BAD

ID=9N=920

GOOD

ID=40N=848

GOOD

ID=42N=725

GOOD

ID=43N=123

GOOD

ID=65N=97

GOOD

ID=7N=1499

GOOD

ID=14N=18

GOOD

ID=15N=83

BAD

ID=20N=47

BAD

ID=22N=223

GOOD

ID=30N=33

GOOD

ID=31N=79

BAD

ID=19N=14

BAD

ID=11N=144

GOOD

ID=44N=712

GOOD

ID=45N=13

BAD

ID=64N=26

BAD

ID=66N=89

GOOD

ID=67N=8

BAD

ID=41N=72

BAD

ID=70N=4

BAD

ID=71N=1495

GOOD

ID=5N=201

BAD

ID=3N=103

BAD

DEBTINC

<= 43.679856 > 43.679856

DELINQ

<= 1.500000 > 1.500000

CLAGE

<= 178.660677 > 178.660677

VALUE

<= 85602.500000 > 85602.500000

LOAN

<= 21050.000000 > 21050.000000

VALUE

<= 51090.000000 > 51090.000000

CLAGE

<= 68.939073> 68.939073

NINQ

<= 3.500000 > 3.500000

CLAGE

<= 81.311610> 81.311610

YOJ

<= 7.500000> 7.500000

VALUE

<= 58251.000000> 58251.000000

DEROG

<= 0.500000 > 0.500000

MORTDUE

<= 111705.000000> 111705.000000

DEBTINC

<= 42.300754> 42.300754

VALUE

<= 166063.000000> 166063.000000

MORTDUE

<= 197597.500000> 197597.500000

DEBTINC

<= 13.745088> 13.745088

BAD

GOOD

Tree graph for DEFAULT

Num. of non-terminal nodes: 8, Num. of terminal nodes: 19

ID=1

N=3364

GOOD

ID=2

N=3052

GOOD

ID=5

N=2611

GOOD

ID=13

N=287

GOOD

ID=19

N=247

GOOD

ID=24

N=86

GOOD

ID=7

N=157

GOOD

ID=3

N=200

GOOD

ID=10

N=261

GOOD

ID=11

N=246

GOOD

ID=12

N=260

GOOD

ID=22

N=116

GOOD

ID=23

N=171

GOOD

ID=14

N=258

GOOD

ID=15

N=259

GOOD

ID=16

N=275

GOOD

ID=17

N=268

GOOD

ID=18

N=250

GOOD

ID=26

N=69

GOOD

ID=27

N=17

BAD

ID=25

N=161

GOOD

ID=6

N=284

GOOD

ID=20

N=128

GOOD

ID=21

N=29

BAD

ID=8

N=187

GOOD

ID=9

N=13

BAD

ID=4

N=112

GOOD

DEROG

<= 0.000000

<= 1.000000

> 1.000000

DELINQ

<= 0.000000

<= 1.000000

> 1.000000

DEBTINC

<= 24.380390

<= 27.920662

<= 30.686106

<= 33.419779

<= 35.113185

<= 36.696630

<= 38.332317

<= 39.891654

<= 41.473001

> 41.473001

JOB

= Other , ...

= Office , ...

CLAGE

<= 151.957418

> 151.957418

JOB

= Other , ...

= Sales , ...

DELINQ

<= 3.000000

> 3.000000

DELINQ

<= 2.000000

> 2.000000

BAD

GOOD

Tree graph for DEFAULT

Num. of non-terminal nodes: 8, Num. of terminal nodes: 19

ID=1N=3364

GOOD

ID=2N=3052

GOOD

ID=5N=2611

GOOD

ID=13N=287

GOOD

ID=19N=247

GOOD

ID=24N=86

GOOD

ID=7N=157

GOOD

ID=3N=200

GOOD

ID=10N=261

GOOD

ID=11N=246

GOOD

ID=12N=260

GOOD

ID=22N=116

GOOD

ID=23N=171

GOOD

ID=14N=258

GOOD

ID=15N=259

GOOD

ID=16N=275

GOOD

ID=17N=268

GOOD

ID=18N=250

GOOD

ID=26N=69

GOOD

ID=27N=17

BAD

ID=25N=161

GOOD

ID=6N=284

GOOD

ID=20N=128

GOOD

ID=21N=29

BAD

ID=8N=187

GOOD

ID=9N=13

BAD

ID=4N=112

GOOD

DEROG

<= 0.000000 <= 1.000000> 1.000000

DELINQ

<= 0.000000 <= 1.000000> 1.000000

DEBTINC

<= 24.380390<= 27.920662<= 30.686106<= 33.419779<= 35.113185<= 36.696630<= 38.332317<= 39.891654<= 41.473001 > 41.473001

JOB

= Other , ...= Office , ...

CLAGE

<= 151.957418> 151.957418

JOB

= Other , ...= Sales , ...

DELINQ

<= 3.000000> 3.000000

DELINQ

<= 2.000000> 2.000000

BAD

GOOD

Classification matrix 0

Dependent variable: DEFAULT

Options: Categorical response, Analysis sample

Classification matrix 0

Dependent variable: DEFAULT

Options: Categorical response, Analysis sample

Classification matrix 0 (hmeq in hmeq)

Dependent variable: DEFAULT

Options: Categorical response, Analysis sample

Observed

Predicted BAD

Predicted GOOD

Row Total

Number

Column Percentage

Row Percentage

Total Percentage

Number

Column Percentage

Row Percentage

Total Percentage

Count

Total Percent

BAD

38

262

300

64.41%

7.93%

12.67%

87.33%

1.13%

7.79%

8.92%

GOOD

21

3043

3064

35.59%

92.07%

0.69%

99.31%

0.62%

90.46%

91.08%

All Groups

59

3305

3364

1.75%

98.25%

Classification matrix 0 (hmeq in hmeq)

Dependent variable: DEFAULT

Options: Categorical response, Analysis sample

ObservedPredicted BADPredicted GOODRow Total

Number

Column Percentage

Row Percentage

Total Percentage

Number

Column Percentage

Row Percentage

Total Percentage

Count

Total Percent

BAD38262300

64.41%7.93%

12.67%87.33%

1.13%7.79%8.92%

GOOD2130433064

35.59%92.07%

0.69%99.31%

0.62%90.46%91.08%

All Groups5933053364

1.75%98.25%

Summary of Boosted Trees

Response: DEFAULT

Optimal number of trees: 199; Maximum tree size: 3

Train data

Test data

Optimal number

20406080100120140160180200

Number of Trees

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

Average Multinomial Deviance

Classification matrix

Analysis sample;Number of trees: 199

Summary of Random Forest

Response: DEFAULT

Number of trees: 100; Maximum tree size: 100

Train data

Test data

102030405060708090100

Number of Trees

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

Misclassification Rate

Gains Chart - Response/Total Response %

Cumulative

Selected category of DEFAULT: BAD

Baseline

BoostTreeModel

TreeModel

ExhaustiveCHAIDModel

RandomForestModel

0102030405060708090100

Percentile

0

10

20

30

40

50

60

70

80

90

100

Gains

Lift Chart - Response %

Cumulative

Selected category of DEFAULT: BAD

Baseline

BoostTreeModel

TreeModel

ExhaustiveCHAIDModel

RandomForestModel

0102030405060708090100110

Percentile

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

Response %

Bivariate Distribution: DEFAULT x Voted prediction

__MACOSX/Big Data Analytics Tools./Final Exam materials/._BINS 4352- Comparing Models and Model Deployment.pptx

Big Data Analytics Tools./Final Exam materials/1 BINS 4352 Cluster Analysis.pptx

Cluster Analysis

Introduction to Clustering

Cluster: A collection of data objects

Large similarity among objects in the same cluster

Dissimilarity among objects in different clusters

Clustering is an Unsupervised Classification technique: no pre-determined classes

Typical applications of clustering:

As a stand-alone analysis, to gain insight on the data

As a pre-processing step for other predictive models

Unsupervised Classification

Unsupervised classification (clustering) has an UNKNOWN TARGET

Clustering Applications

Marketing:

Customer segmentation - Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

Land use:

Identification of areas of similar land use in an earth observation database

Insurance:

Identifying groups of motor insurance policy holders with a high average claim cost

City-planning:

Identifying groups of houses according to their house type, value, and geographical location

Earth-quake studies:

Observed earth quake epicenters should be clustered along continent faults

Data Types In Clustering Analysis

Interval-scaled variables

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types

Interval-valued Variables

Standardize data

Calculate the mean absolute deviation:

where

Calculate the standardized measurement (z-score)

Using mean absolute deviation is more robust than using standard deviation

Why Standardize The Data?

Standardization recasts the units of measure of attributes into dimensionless units.

This addresses the potential of the chosen units impacting the measured similarities among objects.

Also, standardizing makes attributes contribute more equally to the similarities among objects.

It in essence will equal the ranges of the variables insuring that a variable with greater range does not overly influence the analysis over a variable of smaller range.

Similarity / Dissimilarity

d1

d2

d3

Similarity and Dissimilarity Between Objects

Distances are normally used to measure the similarity or dissimilarity between two data objects

Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

For q=1, we get the MANHATTAN DISTANCE

For q=2, we get the EUCLIDEAN DISTANCE

Manhattan Distance

If q = 1, d is Manhattan distance

Euclidean Distance

If q = 2, d is Euclidean distance:

Properties

d(i,j)  0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j)  d(i,k) + d(k,j)

Example

Student Exam 1 Exam 2
John 92 87
Jane 100 90

If we only consider Exam 1, what is the distance from John to Jane?

If we only Exam 1 & Exam 2, what is the distance from John to Jane?

Proximity Measures For Binary Attributes

A contingency table for binary data

Distance measure for symmetric binary variables

Distance measure for asymmetric binary variables

Jaccard coefficient (similarity measure for asymmetric binary variables)

Object i

Object j

Nominal Variable Distance

A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching

m: # of matches, p: total # of variables

Method 2: use a large number of binary variables

creating a new binary variable for each of the M nominal state

Ordinal Variables

An ordinal variable can be discrete or continuous

order is important, e.g., rank

Can be treated like interval-scaled

replacing xif by their rank

map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

compute the dissimilarity using methods for interval-scaled variables

Similarity between documents: the vector space model (VSM)

Document (Text) Classification is made possible through calculations of VSM-based document similarities

The same similarity metric is used by search engines to calculate similarity between query texts and retrieved documents

Every document is represented as a sum vector of its index terms

Cosine of angle between vectors determines relevance:

Term

Term

j

Document

Projection Horizontal

Origin of

Vector Space

Similar Documents

Similar Documents

K-means Clustering Method

K-means Clustering Method

Given k, the k-means algorithm is implemented in the following steps (Olson & Shi, p. 75):

Select the desired number of clusters k

Select k initial observations as seeds

Calculate average cluster values (Cluster Centroids) over each variable (for the initial iteration, this will simply be the initial seed observations)

Assign each of the other training observations to the cluster with the nearest centroid

Recalculate cluster centroids (averages) based on the assignments from step 4

Iterate between steps 4 and 5, stop when there are no more new assignments

Pick 3 points at random

to be the “centers” of

the Clusters.

Then, calculated distances

putting observations in the cluster closest to the “center”

Calculate “new centers”. Drop the clusters and start again

Strengths And Weaknesses Of K-means Clustering

Strength

Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.

Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms

Weakness

Applicable only when mean is defined, then what about categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes

Example Output From SAS Enterprise Miner

Data is imported

Imputation – missing values are addressed

Cluster analysis is run

21

Interpreting The Cluster Analysis

Example Using - SAS

PROC IMPORT OUT= WORK.TempData

DATAFILE= "D:\All Folders\faculty studentsindividuals\Vess, Johson\Cluster Analysis\cluster-0707-4Vess.xls"

DBMS=EXCEL2000 REPLACE;

RANGE="Sheet3$";

GETNAMES=Yes;

RUN;

IMPORT DATA

proc cluster data=TempData method=ward print=15 ccc pseudo out=myWARDclusterout;

var TIME:;

copy INTERACT_CIO;

title " Ward proc cluster";

run;

RUNNING CLUSTER ANALYSIS

proc fastclus data=TempData random= 1289 out=clusteroutput mean=mean1 maxc=4 maxiter=0 summary;

var %ClusterTimeVar;

title "proc fastclus results";

run;

Example Using - SAS

Proc Multtest Data=clusteroutput /*Bonferroni Sidak FDR Permutation holm*/ FDR;

Class Cluster;

Contrast '1 vs others' 1 -.33 -.33 -.34;

Test Mean( %ClusterTimeVar);

title "Cluster 1 versus others";

Run;

Proc Multtest Data=clusteroutput /*Bonferroni Sidak FDR Permutation holm*/ FDR;

Class Cluster;

Contrast '2 vs others' -.33 1 -.33 -.34;

Test Mean( %ClusterTimeVar);

title "Cluster 2 versus others";

Run;

Proc Multtest Data=clusteroutput /*Bonferroni Sidak FDR Permutation holm*/ FDR;

Class Cluster;

Contrast '3 vs others' -.33 -.33 1 -.34;

Test Mean( %ClusterTimeVar);

title "Cluster 3 versus others";

Run;

Proc Multtest Data=clusteroutput /*Bonferroni Sidak FDR Permutation holm*/ FDR;

Class Cluster;

Contrast '4 vs others' -.33 -.33 -.34 1;

Test Mean( %ClusterTimeVar);

title "Cluster 4 versus others";

Run;

Cluster Analysis In Statistica

Cluster analysis is an unsupervised learning technique that seeks to divide cases into clusters, sharing similar qualities

No specific target value is specified

Simply exploring the structure of the data

Clustering Options In Statistica

K-means

Maximizes the difference in cluster means

EM algorithm

Uses distributions of the (continuous) data to find the clusters

Kohonen Network Clustering

Based on topological properties of the brain

Clustering Options in Statistica (cont)

How many clusters?

For K-means, what distance measure should be used?

Number of clusters can be determined with v-fold cross validation

You can select from several different distance measures

Euclidean distance

Squared Euclidean distance

City block or Manhattan distance

Chevey chess distance

For EM, what distributions should be used?

Number of clusters can be determined with v-fold cross validation

You can select from

Normal distribution

Log-normal distribution

Poisson distribution

For Kohonen, what dimensions are appropriate?

Specify dimensions of the cluster space

Marketing Data Set

28

Marketing Data Application

A marketing firm would like to segment their customers into similar groups or clusters in order to better understand consumer behavior and to more effectively market their products.

The goals of the data mining project are:

Determine the variables that best distinguish customers into clusters in the marketing data

Finding a high performance predictive model that assigns cases to cluster

Deploying that model to assign new customers to clusters

Marketing Data Variables

The data set contains demographic information such as

Annual income

Gender

Marital status

Age

Education

Occupation

30

Steps to The Analysis

Identify the project goals

This gives us business understanding

Understand the data – view it graphically

Prepare the data for analysis

If we need to clean the data in some way, we will do it at this point

Model

Evaluate

Deploy

Load The Dataset Into Statistica

FILE -> OPEN -> MARKETING

Variables

ANNUALINC Annual income
SEX Gender
MARSTATUS Marital status
AGE Age
EDUCATION Highest level of education
OCCUPATION Occupation
LNGOFSTAY Length of time in current residence
DUALINC For those married do they have a dual source of income (yes/no)
NOMEMBERS Number of members in the family
NOOFMEM<18 Number of members in the family under the age of 18
HSEHLDSTATUS Whether or not they own, rent or live with parents/family
HOMETYPE Home type
ETHNICCLASS Ethnic class
LANGUAGEHME Language spoken in the home
SAMPLE Training flag

View The Data Using Histograms

34

Select all of the variables except for sample (last variable) and say OK

You will get a histogram for all of the variables in the dataset

Some of The Plots

35

Looking For Data Issues

We are going to examine the data looking for variables that have more than 10% missing data

This would be considered sparse and would therefore need to be removed

Select ALL of the variables and then select OK.

Statistica will remove the variables and cases that have more than 10% missing data

Filtered Data --

Prior to filtering – we had 15 variables 8993 cases

After filtering – we have 14 variables and 8559 cases

Length of Stay was removed

Cases removed were so sparse they weren’t adding information

We Still Have Missing Data

Go back to Filter/Recode and select process missing data

DATA -> Filter/Recode -> Process Missing Data

Select all of the variables

Set the recode value to “other”

OK

We will get a new spreadsheet where the missing data has been replaced with “other”

Next We May Want To Check For Any Inconsistencies In The Data

Example – we have number of people in the household and number of people in the household under 18.

We would hope that the number under 18 is <= the total number in the household

STATISTICS -> Basic Statistics -> Tables and Banners

Specify tables (select variables)

NOMEMBERS

NOMEMBERS < 18

This could represent a problem and may need to be explored further

Further Cleaning

If you’d like to look at the dataset further –

Go ahead and try to clean it up

Go back to what we discussed early on and look for “obvious” issues with the dataset

Don’t take this too far !

You don’t want to throw away good data.

Save The Cleaned Dataset

Load The Cleaned Dataset Into Statistica

Setting Up a Cluster Analysis

Go to DATA MINING -> Cluster

This will pull up the K-means / EM clustering dialog box

Use K-Means clustering

Select all of the variables (Except for Sample)

Validation

On the Validation tab select

V-fold cross validation

Statistica will search for the optimum number of clusters between 2 and 25 clusters

K-means

On the K-means tab change the distance measure to “Squared Euclidean Distances”

Feel free to play with the different distance measures to see if and how it changes the results of the analysis

When you select “OK” Statistica will begin clustering the data

Review Results

Now we are ready to review the results

You can see from the results dialog box that

K-means method was used

Squared Euclidean distances were used and the measures

3 clusters were created

Cluster Means

47

Cluster means will show us the clusters and the category that was most prevalent in each cluster.

Cluster 1:

Females making less that 10K, 25-34yrs old and single

Cluster 2:

Females making 50-75K, 35-44yrs old and married

Cluster 3:

Males making less than 10K, 18-24 yrs old and single

Graph of Cost Sequence

This basically shows us how much information was gained as each cluster was added.

Here the optimal number of clusters is 3

Adding an additional (4th) cluster did not add much information

This shows how close the cases were to the cluster centroid

Decrease from 3 to 4 was not significant enough to merit the addition of a 4th cluster

48

Members and distances

Members and distances creates and output spreadsheet that has the final classification for each case in the dataset

This shows us the cluster assignment and the distance from the centroid.

49

Advanced Tab Frequency Table

Here you get a table output for each variable

You can see what values fell into each cluster

Graph of Frequencies

This provides a graphical representation of the information contained in the frequency table(s)

Creating a Deployment Script

Deployment scripts are created so that new cases can be assigned to new clusters as the come in

52

What Do We Do With This Data?

Generally what we want to do with the results of a cluster analysis is start to build a narrative that describes the classification that occurred

We want to explain how this classification scheme can be used to describe the phenomena we are interested in

Constructing A Narrative

The analysis resulted in 3 groupings or clusters

Cluster 1 Cluster 2 Cluster 3
Annual Income Medium income 10-40K High income >50K Low income <15K
Gender Female Female Male
Marital Status Single Married Single
Age Medium 25-34 yrs High 35-44 yrs Low 18-24 yrs
No Mem <18 1 or fewer 1 or fewer 1 or fewer
Household status Rent Own Live with parent/family

Questions ?

|)

|

...

|

|

|

(|

1

2

1

f

nf

f

f

f

f

f

m

x

m

x

m

x

n

s

-

+

+

-

+

-

=

.

)

...

2

1

1

nf

f

f

f

x

x

(x

n

m

+

+

+

=

f

f

if

if

s

m

x

z

-

=

q

q

p

p

q

q

j

x

i

x

j

x

i

x

j

x

i

x

j

i

d

)

|

|

...

|

|

|

(|

)

,

(

2

2

1

1

-

+

+

-

+

-

=

|

|

...

|

|

|

|

)

,

(

2

2

1

1

p

p

j

x

i

x

j

x

i

x

j

x

i

x

j

i

d

-

+

+

-

+

-

=

)

|

|

...

|

|

|

(|

)

,

(

2

2

2

2

2

1

1

p

p

j

x

i

x

j

x

i

x

j

x

i

x

j

i

d

-

+

+

-

+

-

=

p

m

p

j

i

d

-

=

)

,

(

}

,...,

1

{

f

if

M

r

Î

1

1

-

-

=

f

if

if

M

r

z

Histogram of ANNUALINC

Marketing 15v*8993c

ANNUALINC = 8993*1*Normal(Location=4.895, Scale=2.7737)

Less than $10,000

$10,000 to $14,999

$15,000 to $19,999

$20,000 to $24,999

$25,000 to $29,999

$30,000 to $39,999

$40,000 to $49,999

$50,000 to $74,999

$75,000 or more

ANNUALINC

0

200

400

600

800

1000

1200

1400

1600

1800

2000

No of obs

Histogram of ANNUALINC

Marketing 15v*8993c

ANNUALINC = 8993*1*Normal(Location=4.895, Scale=2.7737)

Less than $10,000

$10,000 to $14,999$15,000 to $19,999$20,000 to $24,999$25,000 to $29,999$30,000 to $39,999$40,000 to $49,999$50,000 to $74,999

$75,000 or more

ANNUALINC

0

200

400

600

800

1000

1200

1400

1600

1800

2000

No of obs

Histogram of SEX

Marketing 15v*8993c

SEX = 8993*1*Normal(Location=1.5469, Scale=0.4978)

Male

Female

SEX

0

1000

2000

3000

4000

5000

6000

No of obs

Histogram of SEX

Marketing 15v*8993c

SEX = 8993*1*Normal(Location=1.5469, Scale=0.4978)

Male Female

SEX

0

1000

2000

3000

4000

5000

6000

No of obs

Histogram of AGE

Marketing 15v*8993c

AGE = 8993*1*Normal(Location=3.4152, Scale=1.6376)

14 thru 17

18 thru 24

25 thru 34

35 thru 44

45 thru 54

55 thru 64

65 and Over

AGE

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

No of obs

Histogram of AGE

Marketing 15v*8993c

AGE = 8993*1*Normal(Location=3.4152, Scale=1.6376)

14 thru 17

18 thru 24

25 thru 34

35 thru 44

45 thru 54

55 thru 64

65 and Over

AGE

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

No of obs

Histogram of EDUCATION

Marketing 15v*8993c

1 to 3 years of college

College graduate

Grades 9 to 11

Graduated high school

Grad Study

Grade 8 or less

EDUCATION

0

500

1000

1500

2000

2500

3000

3500

No of obs

Histogram of EDUCATION

Marketing 15v*8993c

1 to 3 years of college

College graduate

Grades 9 to 11

Graduated high school

Grad Study

Grade 8 or less

EDUCATION

0

500

1000

1500

2000

2500

3000

3500

No of obs

Histogram of OCCUPATION

Marketing 15v*8993c

Homemaker

Professional/Managerial

Student, HS or College

Retired

Unemployed

Factory Worker/Laborer/Driver

Sales Worker

Clerical/Service Worker

Military

OCCUPATION

0

500

1000

1500

2000

2500

3000

No of obs

Histogram of OCCUPATION

Marketing 15v*8993c

Homemaker

Professional/Managerial

Student, HS or College

Retired

Unemployed

Factory Worker/Laborer/Driver

Sales Worker

Clerical/Service Worker

Military

OCCUPATION

0

500

1000

1500

2000

2500

3000

No of obs

Histogram of DUALINC

Marketing 15v*8993c

Not Married

Yes

No

DUALINC

0

1000

2000

3000

4000

5000

6000

No of obs

Histogram of DUALINC

Marketing 15v*8993c

Not Married Yes No

DUALINC

0

1000

2000

3000

4000

5000

6000

No of obs

Graph of Cost Sequence

Best number of clusters: 3

k-Means

2

3

4

Number of clusters

4.9

5.0

5.1

5.2

5.3

5.4

5.5

5.6

5.7

5.8

5.9

6.0

Cluster cost

Graph of Cost Sequence

Best number of clusters: 3

k-Means

2

3 4

Number of clusters

4.9

5.0

5.1

5.2

5.3

5.4

5.5

5.6

5.7

5.8

5.9

6.0

Cluster cost

Graph of frequencies for ANNUALINC

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Less than $10,000

$10,000 to $14,999

$15,000 to $19,999

$20,000 to $24,999

$25,000 to $29,999

$30,000 to $39,999

$40,000 to $49,999

$50,000 to $74,999

$75,000 or more

ANNUALINC

0

200

400

600

800

1000

1200

Frequencies

Graph of frequencies for ANNUALINC

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Less than $10,000

$10,000 to $14,999$15,000 to $19,999$20,000 to $24,999$25,000 to $29,999$30,000 to $39,999$40,000 to $49,999$50,000 to $74,999

$75,000 or more

ANNUALINC

0

200

400

600

800

1000

1200

Frequencies

Graph of frequencies for SEX

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Male

Female

SEX

800

1000

1200

1400

1600

1800

2000

2200

Frequencies

Graph of frequencies for SEX

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Male Female

SEX

800

1000

1200

1400

1600

1800

2000

2200

Frequencies

Graph of frequencies for MARSTATUS

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Married

Living together, not married

Divorced or separated

Widowed

Single, never married

other

MARSTATUS

0

500

1000

1500

2000

2500

3000

3500

Frequencies

Graph of frequencies for MARSTATUS

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Married

Living together, not married

Divorced or separated

Widowed

Single, never married

other

MARSTATUS

0

500

1000

1500

2000

2500

3000

3500

Frequencies

Graph of frequencies for AGE

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

14 thru 17

18 thru 24

25 thru 34

35 thru 44

45 thru 54

55 thru 64

65 and Over

AGE

0

200

400

600

800

1000

1200

1400

1600

Frequencies

Graph of frequencies for AGE

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

14 thru 17

18 thru 24

25 thru 34

35 thru 44

45 thru 54

55 thru 64

65 and Over

AGE

0

200

400

600

800

1000

1200

1400

1600

Frequencies

Graph of frequencies for EDUCATION

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Grade 8 or less

Grades 9 to 11

Graduated high school

1 to 3 years of college

College graduate

Grad Study

other

EDUCATION

0

200

400

600

800

1000

1200

1400

1600

Frequencies

Graph of frequencies for EDUCATION

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Grade 8 or less

Grades 9 to 11

Graduated high school

1 to 3 years of college

College graduate

Grad Study

other

EDUCATION

0

200

400

600

800

1000

1200

1400

1600

Frequencies

Graph of frequencies for OCCUPATION

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Professional/Managerial

Sales Worker

Factory Worker/Laborer/Driver

Clerical/Service Worker

Homemaker

Student, HS or College

Military

Retired

Unemployed

other

OCCUPATION

0

200

400

600

800

1000

1200

1400

1600

1800

Frequencies

Graph of frequencies for OCCUPATION

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Professional/Managerial

Sales Worker

Factory Worker/Laborer/Driver

Clerical/Service Worker

Homemaker

Student, HS or College

Military

Retired

Unemployed

other

OCCUPATION

0

200

400

600

800

1000

1200

1400

1600

1800

Frequencies

Graph of frequencies for ANNUALINC

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Less than $10,000

$10,000 to $14,999

$15,000 to $19,999

$20,000 to $24,999

$25,000 to $29,999

$30,000 to $39,999

$40,000 to $49,999

$50,000 to $74,999

$75,000 or more

ANNUALINC

0

200

400

600

800

1000

1200

Frequencies

Graph of frequencies for ANNUALINC

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Less than $10,000

$10,000 to $14,999$15,000 to $19,999$20,000 to $24,999$25,000 to $29,999$30,000 to $39,999$40,000 to $49,999$50,000 to $74,999

$75,000 or more

ANNUALINC

0

200

400

600

800

1000

1200

Frequencies

Graph of frequencies for SEX

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Male

Female

SEX

800

1000

1200

1400

1600

1800

2000

2200

Frequencies

Graph of frequencies for SEX

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Male Female

SEX

800

1000

1200

1400

1600

1800

2000

2200

Frequencies

Graph of frequencies for MARSTATUS

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Married

Living together, not married

Divorced or separated

Widowed

Single, never married

other

MARSTATUS

0

500

1000

1500

2000

2500

3000

3500

Frequencies

Graph of frequencies for MARSTATUS

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Married

Living together, not married

Divorced or separated

Widowed

Single, never married

other

MARSTATUS

0

500

1000

1500

2000

2500

3000

3500

Frequencies

Graph of frequencies for AGE

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

14 thru 17

18 thru 24

25 thru 34

35 thru 44

45 thru 54

55 thru 64

65 and Over

AGE

0

200

400

600

800

1000

1200

1400

1600

Frequencies

Graph of frequencies for AGE

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

14 thru 17

18 thru 24

25 thru 34

35 thru 44

45 thru 54

55 thru 64

65 and Over

AGE

0

200

400

600

800

1000

1200

1400

1600

Frequencies

Graph of frequencies for NOOFMEM<18

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

None

One

Two

Three

Four

Five

Six

Seven

Eight

NOOFMEM<18

0

500

1000

1500

2000

2500

3000

3500

Frequencies

Graph of frequencies for NOOFMEM<18

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

NoneOneTwoThreeFourFiveSixSevenEight

NOOFMEM<18

0

500

1000

1500

2000

2500

3000

3500

Frequencies

Graph of frequencies for HSEHLDSTATUS

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Own

Rent

Live with Parents/Family

other

HSEHLDSTATUS

0

500

1000

1500

2000

2500

3000

Frequencies

Graph of frequencies for HSEHLDSTATUS

Number of clusters: 3

Cluster 1

Cluster 2

Cluster 3

Own

Rent

Live with Parents/Family

other

HSEHLDSTATUS

0

500

1000

1500

2000

2500

3000

Frequencies

__MACOSX/Big Data Analytics Tools./Final Exam materials/._1 BINS 4352 Cluster Analysis.pptx

Big Data Analytics Tools./Final Exam materials/3 BINS 4352 - Boosted Trees.pptx

Boosted Trees and Random Forests

Boosted Trees

A series of very simple trees are generated

Each tree on its own has limited predictive value

However, using these together can create a fairly strong model

The final classification is based on “voting” from the simple trees

2

Boosted Trees in Statistica

Variables and classification values are selected in the same way that they are for C&RT and CHAID trees

Learning rate of 0.1 or smaller typically will result in the best models

Number of Additive Terms = the number of trees to generate

Random test sample = amount of data to be held out for a test sample

3

Of the remaining data (training sample), subsample is the percentage that will be selected for each of the trees generated

Stopping parameters should be set to generate “simple” trees.

Hence the maximum n of nodes is set to 3

We want very simple trees – weak “learners”

Viewing Results - Summary

4

Multinomial deviance refers to the quality of fit.

Bar Plot Of Predictor Importance

5

Tree Graphs

The tree graphs (all 200) will all be simple 3 node trees

It’s the “set” of weak trees that create the prediction

6

Prediction Tab – Predicted Values

7

Select “All Samples” so that we can view the results across the entire data set.

Classification Tab – Predicted Versus Observed

8

Random Forests

9

Random Forests

Random forests build a series of trees

Each tree in the forest predicts a classification

They in essence “vote” on the overall prediction

The random forest then predicts the classification predicted by the most trees in the forest

If the majority of trees predict “Good”, then the prediction is “Good”

If the majority of trees predict “Bad”, then the prediction is “Bad”

10

Advanced Tab – Number Of Predictors

11

Advanced Tab – Cont

We want a large number of trees in the forest – so we set Number of Trees to 100

The random test data proportion and subsample proportion is similar to what was done in boosted tree to split the dataset and generate random samples for tree generation

Stopping parameters also are similar to what we have done in the other trees

12

Quick Tab - Summary

13

Quick Tab – Risk Estimates

Here we can see misclassification rates

14

Quick Tab – Predictor Importance

15

Quick – Tree Graphs

16

But – remember we don’t use individual trees.

They work together to give a prediction

Classification Predicted Versus Observed

17

Questions?

18

Summary of Boosted Trees

Response: DEFAULT

Optimal number of trees: 199; Maximum tree size: 3

Train data

Test data

Optimal number

20406080100120140160180200

Number of Trees

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

Average Multinomial Deviance

Importance plot

Dependent variable: DEFAULT

DEBTINC

CLAGE

LOAN

DEROG

DELINQ

VALUE

CLNO

NINQ

JOB

MORTDUE

YOJ

REASON

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Importance

Tree graph for DEFAULT

Num. of non-terminal nodes: 1, Num. of terminal nodes: 2

Tree number: 1; Category: GOOD

ID=1 N=2438

Mu=0.002977

Var=0.171747

ID=2 N=1630

Mu=0.175083

Var=0.134483

ID=3 N=524

Mu=-0.708828

Var=0.248685

DELINQ

<= 0.500000 > 0.500000

Classification matrix

All samples;Number of trees: 199

Summary of Random Forest

Response: DEFAULT

Number of trees: 100; Maximum tree size: 100

Train data

Test data

102030405060708090100

Number of Trees

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

Misclassification Rate

Importance plot

Dependent variable: DEFAULT

DEBTINC

DELINQ

DEROG

CLAGE

NINQ

VALUE

LOAN

MORTDUE

CLNO

YOJ

JOB

REASON

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Importance

Tree graph for DEFAULT

Num. of non-terminal nodes: 3, Num. of terminal nodes: 4

Tree number: 1

ID=1 N=2464

GOOD

ID=2 N=1801

GOOD

ID=4 N=1776

GOOD

ID=6 N=1260

GOOD

ID=7 N=264

GOOD

ID=5 N=9

BAD

ID=3 N=34

BAD

DEBTINC

<= 44.752658 > 44.752658

VALUE

<= 295590.500000 > 295590.500000

DELINQ

<= 0.500000 > 0.500000

BAD

GOOD

Tree graph for DEFAULT

Num. of non-terminal nodes: 7, Num. of terminal nodes: 8

Tree number: 2

ID=1N=2449

GOOD

ID=2N=1645

GOOD

ID=3N=499

GOOD

ID=6N=424

GOOD

ID=9N=390

GOOD

ID=10N=339

GOOD

ID=13N=277

GOOD

ID=4N=1261

GOOD

ID=5N=24

BAD

ID=8N=33

BAD

ID=12N=57

BAD

ID=14N=168

GOOD

ID=15N=5

BAD

ID=11N=29

GOOD

ID=7N=55

BAD

DELINQ

<= 0.500000 > 0.500000

DEBTINC

<= 45.656896 > 45.656896

NINQ

<= 3.500000 > 3.500000

CLAGE

<= 80.891906 > 80.891906

YOJ

<= 23.500000 > 23.500000

MORTDUE

<= 42538.500000 > 42538.500000

DEBTINC

<= 43.570586 > 43.570586

BAD

GOOD

Tree graph for DEFAULT

Num. of non-terminal nodes: 15, Num. of terminal nodes: 16

Tree number: 3

ID=1N=2464

GOOD

ID=2N=1783

GOOD

ID=4N=955

GOOD

ID=6N=775

GOOD

ID=9N=736

GOOD

ID=10N=564

GOOD

ID=12N=516

GOOD

ID=14N=500

GOOD

ID=16N=228

GOOD

ID=18N=205

GOOD

ID=7N=171

BAD

ID=5N=776

GOOD

ID=24N=771

GOOD

ID=3N=351

BAD

ID=28N=163

GOOD

ID=8N=39

BAD

ID=20N=185

GOOD

ID=21N=20

GOOD

ID=19N=23

GOOD

ID=17N=251

GOOD

ID=15N=8

BAD

ID=13N=48

GOOD

ID=11N=16

BAD

ID=22N=50

BAD

ID=23N=115

GOOD

ID=26N=643

GOOD

ID=27N=124

GOOD

ID=25N=5

BAD

ID=30N=34

GOOD

ID=31N=124

GOOD

ID=29N=20

BAD

DEROG

<= 0.500000 > 0.500000

CLAGE

<= 189.269275 > 189.269275

DELINQ

<= 0.500000 > 0.500000

LOAN

<= 6250.000000 > 6250.000000

DEBTINC

<= 44.357807 > 44.357807

DEBTINC

<= 41.247743 > 41.247743

NINQ

<= 5.500000 > 5.500000

JOB

= Office , ProfExe= Mgr , Other , ...

CLNO

<= 27.500000> 27.500000

CLAGE

<= 174.280265> 174.280265

VALUE

<= 61832.000000> 61832.000000

CLNO

<= 56.000000> 56.000000

DELINQ

<= 0.500000> 0.500000

DEBTINC

<= 43.643514> 43.643514

CLAGE

<= 107.734204> 107.734204

BAD

GOOD

__MACOSX/Big Data Analytics Tools./Final Exam materials/._3 BINS 4352 - Boosted Trees.pptx

Big Data Analytics Tools./Final Exam materials/BINS 4352 - Overview Presentation.pptx

BINS 4352 Business Intelligence

Course Description

To demonstrate a basic knowledge of terminology, issues, and opportunities related to business intelligence.

To apply selection/critical success criteria for adoption and application of business intelligence technologies.

To analyze management issues and opportunities related to the integration of business intelligence into existing corporate infrastructures.

To understand the security risks of conducting business intelligence activities.

Changing Business Environment

Changing Business Environments

Companies are moving aggressively to computerized support of their operations

Business Intelligence

Business Pressures–Responses–Support Model

Business pressures result of today's competitive business climate

Responses to counter the pressures

Support to better facilitate the process

The Business Environment

The environment in which organizations operate today is becoming more and more complex, creating

Opportunities

Problems

Business environment factors:

markets, consumer demands, technology, and societal…

Business Environment Factors

Factor Description
MARKETS Strong competition Expanding global markets Blooming electronic markets (Internet) Innovative methods Opportunities for outsourcing Need for real-time on-demand transactions
CONSUMER DEMAND Desire for customization Desire for quality, diversity of products, speed of delivery More powerful customers – less loyal
TECHNOLOGY More innovative, new products and services Increasing obsolescence rate Increasing information overload Social networking, web 2.0 …
SOCIETAL Changing government regulation (deregulation) Workforce diversification Concerns about security Increasing social responsibility of companies Greater emphasis on sustainability

Business Response

Organizational Responses

Be Reactive, Anticipative, Adaptive, and Proactive

Managers may take actions, such as

Employ strategic planning.

Use new and innovative business models.

Restructure business processes.

Participate in business alliances.

Improve corporate information systems.

… more

8

Closing the Strategy Gap

One of the major objectives of computerized decision support is to facilitate closing the gap between the Current Performance of an organization and its Desired Performance, as expressed in its mission, objectives, and goals, and the strategy to achieve them.

9

Business Intelligence

A Framework for Business Intelligence (BI)

BI is an evolution of decision support concepts over time

Then: Executive Information System

Now: Everybody’s Information System (BI)

BI systems are enhanced with additional visualizations, alerts, and performance measurement capabilities

The term BI emerged from industry

11

Definition of BI

BI is an umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies

BI is a context-free expression, so it means different things to different people

BI's major objective is to enable easy access to data (and models) to provide business managers with the ability to conduct analysis

BI helps transform data, to information (and knowledge), to decisions, and finally to action

12

A Brief History of BI

The term BI was coined by the Gartner Group in the mid-1990s

However, the concept is much older

1970s - MIS reporting - static/periodic reports

1980s - Executive Information Systems (EIS)

1990s - OLAP, dynamic, multidimensional, ad-hoc reporting -> coining of the term “BI”

2010s - Data/Text/Web Mining; Web-based Portals, Dashboards, Big Data, Social Media, and Visual Analytics

2020s - yet to be seen

13

The Evolution of BI Capabilities

14

The Architecture of BI 4 Major Components

Data Warehouse – contains the source data

Forms the cornerstone of any medium to large scale BI system.

Originally only included historical data – organized and summarized for easy manipulation

Today they may include access to current data as well.

Differs from a traditional database in that it is typically structured for high-speed data entry

2014 – Facebook 300+ petabytes with daily incoming data rate of about 600 TB.

2009 – Ebay’s two data warehouses contained a total of 8.5 petabytes of data

15

The Architecture of BI 4 Major Components

Business Analytics – a collections of tools for manipulating, mining, and analyzing the data in the data warehouse

These are the tools that help transform the data into knowledge (e.g., queries, data/text mining tools, etc.)

Business Performance Management –is an emerging portfolio of applications within the BI framework that provides enterprises tools they need to better manage their operations

(BPM), which is also referred to as corporate performance management (CPM)

Monitoring and analyzing performance

16

The Architecture of BI 4 Major Components

User Interface (Data Visualization) – provides a comprehensive graphical/pictorial view of corporate performance measures, trends, and exceptions

17

Data Processing

Transaction Processing Versus Analytic Processing

Online Transaction processing systems (OLTP) are constantly involved in handling updates (add/edit/delete) to what we might call operational databases

ATM withdrawal transaction, sales order entry via an ecommerce site – updates DBs

OLTP – handles routine on-going business

ERP, SCM, CRM systems generate and store data in OLTP systems

The main goal is to have high efficiency

Transaction Processing Versus Analytic Processing

Online analytic processing (OLAP) systems are involved in extracting information from data stored by OLTP systems

Routine sales reports by product, by region, by sales person, by …

Often built on top of a data warehouse where the data is not transactional

Main goal is the effectiveness (and then, efficiency) – provide correct information in a timely manner

Summary: OLTP versus OLAP

Slide 1- 21

BI Implementations

Successful BI Implementation

Implementing and deploying a BI initiative is a lengthy, expensive, and risky endeavor!

Success of a BI system is measured by its widespread usage for better decision making

The typical BI user community includes

Not just the top executives (as was for EIS)

All levels of the management hierarchy

Provide what is needed to whom he/she needs it

A successful BI system must be of benefit to the enterprise as a whole…

BI - Alignment with Business Strategy

To be successful, BI must be aligned with the company’s business strategy

BI cannot/should not be a technical exercise for the information systems department

BI changes the way a company conducts business by

Improving business processes

Transforming decision making to a more data/fact/information driven activity

BI should help execute the business strategy and not be an impediment for it!

Issues for Successful BI

Developing vs. Acquiring BI systems

Justifying via cost-benefit analysis

It is easier to quantify costs

Harder to quantify benefits

Security and Protection of Privacy

Integration of Systems and Applications

Real-Time, On-Demand BI Is Attainable

The demand for “real-time” BI is growing!

Is “real-time” BI attainable?

Technology is getting there…

Automated, faster data collection (RFID, sensors,… )

Database and other software technologies (agent, SOA, …) technology is advancing

Telecommunication infrastructure is improving

Computational power is increasing while the cost for these technologies is decreasing

Trent -> Business Activity Management

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in Bohemia"

Analytics

Problem in a Nutshell

How do we effectively utilize all of the information available in order to aid in decision making and lead to actions that can positively impact the business?

Amazon Cloud – Cyber Monday 2013: 36.8 million items were ordered (426 items per second)

You Tube

More than 1 billion users and 300 videos uploaded every minute (48 hours of video)

Facebook

1 Billon active accounts each month (70 languages)

30 Billion pieces of content posted per day (2.7 Billion comments/likes) and 125 Billion friend connections

Twitter – Typical posting volume exceeds 500 million tweets per day (5,700 per second)

August 2013 – 143,000 tweets were posted in 1 second

US – Debit / Credit Card Transaction (2012)

Credit Card transactions (total) – 26.2 billion and Debit Card transactions (total) – 41.4 billion

From the beginning of time until 2003 – 5 extabytes of data were created (500 Billion Gigabytes)

2011 – Same amount of data was created every 2 days

28

Overwhelmed with data

29

Data Deluge

Hospital Patient Registries

Electronic Point-of- Sale Data

Stock Trades OLTP Telephone Calls

Catalog Orders Bank Transactions

Remote Sensing Image Processing

Tax Information

Airline Reservations

Credit Card Transactions , etc.

Analytics Overview

Analytics?

Something new or just a new name for …

A Simple Taxonomy of Analytics

Descriptive Analytics

Predictive Analytics

Prescriptive Analytics

Analytics or Data Science?

Analytics Overview

Data Analytics / Data Mining

Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data

Also referred to as:

Knowledge Extraction

Pattern Analysis

Knowledge Discovery

Information Harvesting

Pattern Searching

Etc.

32

Types of data mining

Hypothesis Driven (supervised)

I know what I am looking for and I am looking to find it in the data

My analysis is trying to find support for some hypothesis that I propose

Does performance on the GMAT exam predict performance in Graduate School (GPA)?

Should we approve or deny a customer credit applications?

Is this most recent credit card transaction fraudulent?

Analyze twitter tweet following an “event” to determine what public sentiment is regarding that event.

Discovery Driven (unsupervised)

I am looking at data to see what pattern(s) emerge

I begin the analysis with no pre-conceived notions of what might be there

I am analyzing store receipts to discover if there are groups of products that often are purchased together by consumers.

I am looking at customer purchasing patterns and profiles to determine if I can identify the market segments and how to best service those segments.

33

Taxonomy of data mining tasks

34

What is - Big Data Analytics ?

Big Data?

Not just big!

Volume

Variety

Velocity

QUESTIONS?

__MACOSX/Big Data Analytics Tools./Final Exam materials/._BINS 4352 - Overview Presentation.pptx

Big Data Analytics Tools./Final Exam materials/BINS 4352 - Introduction Presentation.pptx

BINS 4352 Big Data Analytics Tools

Fall 2018

Overview

Welcome

Course Objectives / Syllabus

Introduction

Questions?

Welcome to BINS 4352

TEXTBOOKS AND OTHER MATERIALS

Selected readings will be provided during the semester and will be posted on Blackboard.

Check Blackboard frequently or setup notifications so that you can see when I post material

We do not have a textbook for this course

Welcome to BINS 4352

You will need to download and have access to 2 software packages

Statistica Ultimate Academic Bundle – Single User (you should be able to get it by using the following link).

https:// ualr.onthehub.com/WebStore/ProductsByMajorVersionList.aspx?cmi_mnuMain_child_child=2ff73789-74c7-e011-ae14-f04da23e67f6&cmi_mnuMain_child=d05ce6ce-371d-e611-941a-b8ca3a5db7a1

Fee is $24.99 for 6 months

YOU NEED WINDOWS TO RUN Statistica

RapidMiner Studio 8.2- This software runs on both Windows and Mac (you should be able to find it at https://rapidminer.com – get an academic license)

This is free

Course Description

Students will have hands-on experience analyzing real-world data sets using industry standard analysis tools. Topics covered will include:

Data analysis processes – SEMMA, CRISP-DM, etc.

Data preparation and visualization

Analysis techniques covered in the class will include:

Decision Trees Neural Networks

Marsplines Cluster Analysis

Association Analysis Text/Web Analytics

Sentiment Analysis Similarity Analysis

Model evaluation and comparison

Course Objectives

Develop an understanding of the problems and opportunities that are available when dealing with extremely large databases.

Review visualization SW used for interpreting complex patterns in multidimensional data.

Be able to identify what information is useful and what data is not in a large database.

Have a strong understanding of different data analysis techniques: decision trees, regression, neural networks, cluster analysis, association analysis and text mining.

Develop the ability to compare models, interpret, and evaluate the appropriateness of a given model.

Develop an understanding of predictive and exploratory models and algorithms.

Understand phases of decision making including: discovery and data query, data analysis and confirmation, and presentation and implementation of results.

Syllabus / Blackboard

Syllabus

SUBJECT TO CHANGE – unexpected scheduling problems, class discussions, and class interests will cause the schedule and/or the covered topics to be modified.

The class outline/schedule represents a basic outline of the course, it is not a contractual agreement.

CHECK BLACKBOARD FREQUENTLY

Power Point Slides

Supplemental Information

Policies

ATTENDANCE

Regular punctual attendance is expected

If you MUST miss class

If possible email me 24 hours prior to missing a class.

You are responsible for getting material from another student.

Help your fellow classmates when they have questions about assignments.

Interacting with other students is an excellent way to learn!

Each student must complete his or her own assignments—no copying of files allowed!

Unless an assignment is clearly defined as a “group” assignment with a single “group” deliverable, it is an individual assignment that must be completed on your own.

Policies

ACADEMIC DISHONESTY

Students will adhere to the highest professional and ethical standards.

All work submitted will be the result of each individual student’s own effort.

Sharing homework files whether they are complete or works in process that are not your own is considered cheating. 

You may discuss homework and help each other; but the assignments submitted must have been totally completed yourself (all text/data/formulas keyed by YOU!).

Policies

DISABILITY ACCOMMODATIONS

See me after class

PATH TO SUCCESS

Arrive to class on time and follow the policy on professionalism

Read the assigned material prior to the class period and participate in class discussions

Complete the required work on time

MOST IMPORTANTLY KNOW THAT LEARNING IS A PATICIPATORY ACT WHERE EVERYONE SHARES WHAT HE OR SHE KNOWS IN A RESPECTFUL MANNER IN AN ENVIORNMENT OF HIS OR HER PEERS.

Grades

We will have 3 exams that will constitute 30% of your grade

Participation is 10% of your grade – if you miss class, then you cannot participate

Exercises – PLEASE NOTE – these will count 50% of your grade.

Late work will NOT be accepted

Introduction Dr. Vess Johnson

EDUCATION

BS – Theoretical Mathematics (Emphasis in Electrical Engineering)

BA – Philosophy

MS - Computer Science

PhD – Business (Concentration Computer Information Systems and Decision Sciences)

20+ YEARS OF INDUSTRY EXPERIENCE

Military Aerospace, Software, and Semiconductor

Startup – Turnaround Specialist

President / CEO of 7 startup and turnaround companies

HOBBIES

Cycling (Road)

Music

Camping

Electronics

QUESTIONS?

__MACOSX/Big Data Analytics Tools./Final Exam materials/._BINS 4352 - Introduction Presentation.pptx

Big Data Analytics Tools./Final Exam materials/3 BINS 4352 - Trees.pptx

Classification Decision Trees

Objectives

Conceptually understand how decision trees work

Understand what control the user has over construction of the decision tree

Understand how to adjust parameters to improve performance

Understand how to evaluate and interpret results of a decision tree

Recap

Preparing the Data

4

Data preparation is critical and can account for 85% of the effort

You must have an understanding of both the data and the business domain

Decide on the appropriate data mining / data analysis technique to use

You may try multiple modeling techniques and then compare the models to see which is most appropriate

Data Mining Techniques

Hypothesis Driven (supervised)

I know what I am looking for and I am looking to find it in the data

My analysis is trying to find support for some hypothesis that I propose

Does performance on the GMAT exam predict performance in Graduate School (GPA)?

What factors impact credit approvals?

Discovery Driven (unsupervised)

I am looking at data to see what pattern(s) emerge

I begin the analysis with no pre-conceived notions of what might be there

I want to analyze store receipts to discover if there are groups of products that are purchased together by consumers.

Taxonomy of data mining tasks

Recursive Partitioning

Recursive Partitioning Methods

Recursive partitioning refers to the process for creating a decision tree which is basically a system of questions that lead along a path to the final prediction

Predictor variables are used to make the splits, which make the “tree branches” and divide the data into more and more similar groups

These groups are “leaves” of the tree or “nodes”

When sufficient splits are made we reach “terminal nodes”

This is where we stop splitting – no more splits are made

The prediction is then made based on the makeup of the terminal node

Example – Charge Card approval

Bad

Good

Good

Balance of Current Account

>= $300

No balance or no running account

Bad

Good

Further Running Credits

No further running credits

Other credits at banks and department stores

Recursive Partitioning Methods

Advantages

Easily interpreted models

Don’t require calculations as would be the case with linear models

Allows specification of misclassification costs

We can assign weights to predictions

Predicting Good when Bad may be weighted higher than predicting Bad when Good

Good accuracy

Allows for missing data in deployment

Via surrogates or an alternative variable to make splits when data is missing

Disadvantages

Can require evaluation to find the right size of the trees

May require a great deal of experience to find the “right” size tree

Too many splits may result in a model that “over fits” the data

Too few splits may miss out on good predictive accuracy

More challenging for continuous targets

Decision Trees

11

The Curse Of Dimensionality

The dimension of a problem refers to the number of input variables (actually, degrees of freedom).

Data mining problems are often massive in both the number of cases and the dimension.

The curse of dimensionality refers to the exponential increase in data required to densely populate space as the dimension increases.

For example,

Eight points fill the one-dimensional space but become more separated as the dimension increases.

In 100-dimensional space, they would be like distant galaxies.

The curse of dimensionality limits our practical ability to fit a flexible model to noisy data (real data) when there are a large number of input variables.

A densely populated input space is required to fit highly complex models.

Addressing The Curse Of Dimensionality (Deleting variables)

13

The two principal reasons for eliminating a variable are

Redundancy:

A redundant input does not give any new information that has not already been explained.

Useful methods: principal components, factor analysis, variable clustering.

Irrelevancy

An irrelevant input is not useful in explaining variation in the target.

Interactions and partial associations make irrelevancy more difficult to detect than redundancy.

It is often useful to first eliminate redundant dimensions and then tackle irrelevancy.

Model Complexity

A naïve modeler might assume that the most complex model should always outperform the others, but this is not the case.

An overly complex model might be too flexible.

This will lead to overfitting – accommodating nuances of the random noise in the particular sample (high variance).

GOAL: A model with just enough flexibility will give the best generalization.

Creating Trees

Split Search

Which splits are to be considered?

Splitting Criterion

Which split is best?

Stopping Rule

When should the splitting stop?

Pruning Rule

Should some branches be lopped off?

Number of Possible Splits

Splitting Criteria

How is the best split determined?

In some situations, the worth of a split is obvious.

If the expected target is the same in the child nodes as in the parent node, no improvement was made, and the split is worthless!

In contrast, if a split results in pure child nodes, the split is undisputedly best.

For classification trees, the three most widely used splitting criteria are based

Pearson chi-squared test

The Gini index

Entropy

All three measure the difference in class distributions across the child nodes.

The three methods usually give similar results.

DFN: Pure Node: all of the samples at that node have the same class label; no need for further splitting

Controlling Tree Growth: Stunting

A universally accepted rule is to stop growing if the node is pure.

Two other popular rules for stopping tree growth are to

Stop if the number of cases in a node falls below a specified limit

Stop when the split is not statistically significant at a specified level.

This is called Pre-pruning, or stunting.

Controlling Tree Growth: Pruning

Pruning (also called post-pruning) creates a sequence of trees of increasing complexity.

An assessment criterion is needed for deciding the best (sub) tree.

The assessment criteria are usually based on performance on holdout samples (validation data or with cross-validation).

Cost or profit considerations can be incorporated into the assessment.

Decision Tree Algorithms

20

Example – Credit Risk

We have a total of 10 people.

6 are good risks and 4 are bad.

We apply splits to the tree based on employment status.

When we break this down, we find that there are 7 employed and 3 not employed.

Of the 3 that are not employed, all of them are bad credit risks and thus we have learned something about our data.

Note that here we cannot split this node down any further since all of our data is grouped into one set.

This is called a pure node.

The other node, however, can be split again based on a different criterion.

So we can continue to grow the tree on the left hand side.

Rule Induction Algorithms

Recursive algorithms that identify data partitions of progressive separation with respect to the outcome.

The partitions are then organized into a decision tree.

Common Algorithms:

1R

ID3

C4.5/C5.0

CART (C&RT)

CHAID

CN2

BruteDL

SDL

22

1R algorithm

23

1R – Inferring Rudimentary Rules

1R: learns a 1-level decision tree

In other words, generates a set of rules that all test on one particular attribute

Basic version (assuming nominal attributes)

One branch for each of the attribute’s values

Each branch assigns most frequent class

Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch

Choose attribute with lowest error rate

ALGORITHM

For each attribute,

For each value of the attribute, make a rule as follows:

count how often each class appears

find the most frequent class

make the rule assign that class to this attribute-value

Calculate the error rate of the rules

Choose the rules with the smallest error rate

Simple Example – Play Or Don’t Play

What are the variables?

Dependent variable – what we are trying to predict

Play

Independent variables – what drives the prediction

Outlook

Temperature

Humidity

Windy

Evaluate The Weather Attributes

* When there is a “tie” we just randomly choose between two equally likely outcomes

Sunny

There are 5 sunny days

3 are “no” we don’t go out and play

2 are “yes” we do go out and play

So, if Sunny->No is the rule (since more no than yes)

2 out of 5 times goes against the rule

Outcome For The Decision Tree

Apply The Algorithm

Consider the first (outlook) of the 4 attributes (outlook, temp, humidity, windy). Consider all values (sunny, overcast, rainy) and make 3 corresponding rules. Continue until you get all 4 sets of rules.

Although, 1R is quite simple it has been found to perform just about as well as other more complex algorithms

Simplicity is good!!

Discretizing Continuous variables

29

Decision Tree Example

HMEQ (Home Equity Loan) Case overview

The consumer credit department of a bank wants to automate the decision-making process for approval of home equity lines of credit.

To do this, they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived and statistically sound credit scoring model.

The model will be based on data collected from recent applicants granted credit through the current process of loan underwriting.

The model will be built from predictive modeling tools, but the created model must be sufficiently interpretable so as to provide a reason for any adverse actions (rejections).

The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans.

The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent.

This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded.

The Home Equity Loan Process

An applicant comes forward with a specific property and a reason for the loan (Home-Improvement, Debt-Consolidation)

Background info related to job and credit history is collected

The loan gets approved or rejected

Upon approval, the Applicant becomes a Customer

Information related to how the loan is serviced is maintained, including the Status of the loan (Current, Delinquent, Defaulted, Paid-Off)

32

HMEQ Data Set

33

Name Model Role Measurement Level Description
BAD Target Binary 1=defaulted on loan, 0=paid back loan
REASON Input Binary HomeImp=home improvement, DebtCon=debt consolidation
JOB Input Nominal Six occupational categories
LOAN Input Interval Amount of loan request
MORTDUE Input Interval Amount due on existing mortgage
VALUE Input Interval Value of current property
DEBTINC Input Interval Debt-to-income ratio
YOJ Input Interval Years at present job
DEROG Input Interval Number of major derogatory reports
CLNO Input Interval Number of trade lines
DELINQ Input Interval Number of delinquent trade lines
CLAGE Input Interval Age of oldest trade line in months
NINQ Input Interval Number of recent credit inquiries

Selecting The Task-relevant Attributes

34

HMEQ – Modeling Goal(s)

The credit scoring model should compute the probability of a given loan applicant to default on loan repayment.

A threshold is to be selected such that all applicants whose probability of default is in excess of the threshold are recommended for rejection.

35

Classification and Regression Trees (C&RT)

A non-parametric data mining approach

No distributional assumptions are made about the data

Splits are made by variables that best differentiate the target variable

Each node can be split into 2 child nodes

Stopping rules govern the size of the tree

36

Misclassification cost

Misclassification is inevitable – no model is perfect

“All models are wrong, some are useful” George Bach

Some misclassifications are worse than others, so Statistica allows you to account for misclassification costs.

Predicted Good Predicted Bad
Actual Good True Positive False Negative
Actual Bad False Positive True Negative
Predicted Good Credit Predicted Bad Credit
Observed Good Credit Correct Missed Opportunity Cost = $1
Observed Bad Credit Lost Revenue Cost = $3 Correct

Stopping Conditions

Decision tree pruning

Misclassification error or deviance

Select a minimum number of cases for a node to be considered for splitting

Node with less than some threshold will not be split

Select a maximum number of total nodes

Controls overall tree complexity

FACT direct stopping

Select the fraction of objects for determining if a node should be split

E.g. 9/10 when 9/10 of the nodes (or more) are the same no additional splits are needed

Cross Validation

Cross validation is a method to prevent over fitting the data and failing to generalize to new data

Situation where the “current” data set is “learned” very well, but the model cannot be applied to future data sets.

Cross validation techniques

V-Fold Cross Validation

Good for smaller data sets, when holding out a test sample is not feasible

Repeats the analysis on “V” different random samples taken from the data and compares the resulting trees

Train – test sample cross validation

Test sample data is used to determine if the right size tree was found based on how well the tree performs on test data.

Cross Validation (cont)

V-Fold Cross Validation (10)

1st Run
TEST
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
2nd Run
TRAIN
TEST
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
3rd Run
TRAIN
TRAIN
TEST
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
4th Run
TRAIN
TRAIN
TRAIN
TEST
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN
TRAIN

Surrogates

In deployment, surrogate splits are used in place of the actual split variable when its value is missing

The surrogate is the next best split variable

Each split can have multiple surrogates

Run HEMQ in Statistica

Load the Data into Statistica

HEMQ is an Excel File so you want to make sure that you

Take the top option – loading the Excel Data

Check – that the variable names are in the first row

Data in Statistica

Notice that I added a new variable called “Default”

This is so that when we run the model it will report in terms of “BAD” and “GOOD” as opposed to “1” and “0”

Make sure you DON’T use “BAD” as a predictor for “DEFAULT”

Select the C&RT Technique

Select “Data Mining”

Next select C&RT

Selecting The Classification Type and Specification Method

We will use the standard C&RT tool and Quick specs dialog

Select “OK”

Variable Selection

Since the dependent variable is categorical (DEFAULT) we want to check “Categorical Response”

Next, we select “Variables” and specify the dependent and independent variables (Note: You can check – Show appropriate variables only)

Select the “Response Codes” to specify the levels of the dependent variable that are important

Classification Costs

48

Since we have specified a “Response Code”, we can now specify “costs” or “weights” associated with a misclassification.

This will cause the system to view a particular “misclassification” as worse than another.

Stopping Criteria

Here we are going to use “Prune on Misclassification”

Simply means that it will take “costs” into account

Stopping parameters are

Min Cases = 596

Max Nodes =1000

Surrogates

We will use 2 surrogates for each split

On hitting “OK” the tree is built.

Results

51

Select “Scrollable Tree”

Results – “Tree Structure”

Gives you a spreadsheet of output

Shows variables used for each split

Everything in the tree given in a spreadsheet format

Importance

This will show how important each variable is in constructing the decision tree.

Observational Tab

54

I get the observed value / predicted value and the probably associated with each one

Classification

Predicted vs Observed

109 were predicted bad and were observed as bad

3045 were predicted good and were observed as good

Misclassified

191 were predicted as good – but, were observed as bad 19 were predicted as bad – but, were observed as good

55

Re-Run BUT, Add Weight To The Observed Bad/Predicted Good

Go back and change the Classification – to User Specified.

Increase the cost/weight of the Observed bad/predicted good – to 3

Compare Results - Classification

Classification weights 1/1 Classification weights 3/1

Something to Think About

As we look at the results of the modeling – we see that the system seems to be somewhat biased to predict that a Home Equity Loan will be “GOOD”.

Why do you think this might be the case?

Questions ?

59

__MACOSX/Big Data Analytics Tools./Final Exam materials/._3 BINS 4352 - Trees.pptx

Big Data Analytics Tools./Final Exam materials/MARSplines.pptx

MARSplines Multivariate Adaptive Regression Splines

MARSPLINES

MARSplines are a technique developed in the early 90’s for solving regression type problems.

Goal is to predict the values of a continuous dependent variable from a set of independent or predictor variables

MARSplines are a non-parametric regression procedure that makes no assumptions about the underlying functional relationship between the dependent and independent variables

Goal is to construct the mapping from a set of coefficients and basis functions that are entirely driven by the data

2

Divide and Conquer

In essence the MARSplines method is based on a divide and conquer strategy

The input space is divided into “regions”

Each region is analyzed and a regression equation is generated

You can think of this as trying to build a “piecewise” model is constructed using a collection of very simple linear regression equations.

This makes MARSplines particularly useful when there are a large number of variables – where the curse of dimensionality might present issues

It can be used with both continuous and categorical independent variables – but, most commonly used with continuous

3

Strengths / Popularity

Does not impose many of the restrictions on the data that you would find in other regression methods (e.g., linear regression, logistic regression, etc.)

Relationship between the dependent and independent variable does not have to be monotonic.

Does not assume that data are normally distributed

Does not assume a linear relationship exists between X and Y.

This technique will automatically look for interactions between variables and non-linear relationships

4

Linear

Monotonic

Linear

Monotonic

Weak Linear

Monotonic

Non-Linear

Non-Monotonic

Non-Linear

Monotonic

Model Features

Creates easy to understand and interpretable models (compared to boosted trees, random forests, and neural networks)

It produces a model, that in the end, can be written out as an equation

Model is not greatly affected by outliers

Can be used for either regression or classification tasks

Accepts a large number of predictor variables

Can be used as a method for selection of predictor variables as the input for another analysis technique

5

Working with MARSplines

Basis functions use a knot parameter or a breakpoint to find the non-linear relationships

Increasing the number of basis functions give the potential for a more complex model

Using the degree of interactions – you can specify the amount of interaction that is modeled between variables

6

The penalty parameter can be thought of as a “cost” for selecting each basis function

Increasing this parameter can help to protect against overfitting of the model

Model Output

The model output will be very similar to other regression type models that we have looked at

Variable importance

Model coefficients

Predicted values

Residuals

Custom predictions – this will allow you to define a new case on the fly and generate a prediction for the chosen dependent variable

7

MARSplines in Statistica

8

Launch MARSplines from the Data Mining tab

Select variables as we did in all of the prior exercises

Options

On the options TAB increase the Degree of Interactions to 3

This will allow the analysis to consider more interactions between variables and create a more complex model

9

Results

Look at Predictor Importance

10

The larger the number the more important the variable

0 – indicates that the variable was not used

Results

Coefficients – gives us another way to look at the equation parameters

Select Dependent variable and look at predictions

11

Plots

Lets look at the observed versus predicted

What we are looking for is a trend from lower left to upper right

12

Custom Predictions

Here we can type in values for the predictor variables and Statistica will calculate a prediction for the dependent variable

13

QUESTIONS?

2D Scatter Plot of PC Volume (Obs.) vs. PC Volume (Pred.) (Beverage Manufacturing)

0.000.050.100.150.200.250.300.350.400.450.500.55

PC Volume (Obs.)

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

0.38

0.40

0.42

0.44

0.46

0.48

PC Volume (Pred.)

__MACOSX/Big Data Analytics Tools./Final Exam materials/._MARSplines.pptx

Big Data Analytics Tools./Final Exam materials/1 BINS 4352 - Data Mining.pptx

BINS 4352 Analytics / Data Mining

Learning Objectives

Define data mining as an enabling technology for business intelligence

Understand the objectives and benefits of business analytics and data mining

Recognize the wide range of applications of data mining

Learn the standardized data mining processes

CRISP-DM

SEMMA

KDD

Slide 1- 2

Learning Objectives

Understand the steps involved in data preprocessing for data mining

Learn different methods and algorithms of data mining

Build awareness of the existing data mining software tools

Commercial versus free/open source

Understand the pitfalls and myths of data mining

Motivation: “Necessity is The Mother of Invention”

Data explosion problem

Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

We are drowning in data, but starving for knowledge!

Data warehousing and data mining

Data warehousing and on-line analytical processing

Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

4

Analysis Versus Analytics

Is there a difference between analysis and analytics?

The terms are often used interchangeably ?

Analysis: Dividing a large problem into parts so that the parts can be critically examined at a more granular level

Forced to decompose complex systems into their component parts

Putting the pieces back together is called synthesis

Analytics: A variety of methods, technology, and associated tools for creating new knowledge/insight to solve complex problems and make better faster decisions.

Use mathematical models and data to make sense of a complex world

Analytics includes analysis – it includes other complimentary tasks and processes and the synthesis step to pull things back together.

Goal: convert data into actionable insights

Definition of Data Mining

The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases. - Fayyad et al., (1996)

Keywords in this definition:

Non trivial Process

Valid

Novel

Useful

Understandable

6

Data Mining at the Intersection of Many Disciplines

7

Why the Sudden Popularity of Analytics?

Need

In todays global economy a company’s success or even survival can depend on the ability of the business to be agile and for their managers to make the best possible decisions in a timely manner

Availability / Affordability

Technological advancements has enabled organizations to collect a tremendous amount of data about their market, customer, etc.

Culture

Ongoing shift from the old-fashioned tuition-based decision making process to new-age fact-/evidence based decision making.

8

Main Challenges Of Analytics

Analytics talent – data scientists are scarce

Culture – the shift from intuition-based management to fact-based management is difficult

ROI – it is difficult for companies to quantify the benefits associated with large investments in DA infrastructure

Data – big data is a big challenge!

It is unstructured

It is arriving at a speed that prohibits traditional data collection and processing means

It is usually messy

9

Main Challenges Of Analytics (Cont)

Technology – although it is available, it still requires significant investment

Security / Privacy – this is one of the most common criticisms of data and analytics.

Concerns are always raised when we start talking about the collection and analysis of vast amounts of data (e.g., google searches, Facebook and twitter posts, health records, etc.)

10

Some of the companies that lost your data in 2015
Vtech 4.8M Customer Records (CRs) Ashley Madison 37M Client Records UCLA Health 4.5M Patient Records
LastPass Millions of User Passwords Scottrade 4.6M CRs Tmobile 15M CRs
TRUMP Hotels thousands of visitors CVS unknown millions credit cards Excellus BlueCross BlueShield 10M CRs
Carphone Warehouse (UK) 2.4M CRs IRS 100,000 Taxpayers Anthem (healthcare) 80M CRs
US Office of Personnel Mgmt 22M and counting Employee/Applicant CIA thousands of arrestee’s data after John Brennan’s email breached 70M phone records of prison inmates given to reporters
Patreon (crowdfunding service) 15GB data breach included names/SS numbers/ etc.

Source of data for DM is often a consolidated data warehouse (not always!).

Data is the most critical ingredient for DM which may include soft/unstructured data.

The miner is often an end user

Striking it rich requires creative thinking

Data mining tools’ capabilities and ease of use are essential (Web, Parallel processing, etc.)

Data Mining Characteristics/Objectives

11

What Kinds of Patterns Can Data Mining Discover?

Models are usually mathematical representations that identify the relationships among attribute of the objects (i.e., customers)

Simple linear correlations and/or highly complex nonlinear relationships

Some patterns will be explanatory

Explaining the interrelationships and affinities among attributes

Some patterns will be predictive

Projecting the future values of certain attributes

12

Example

I have a large HR dataset – it has tracked employees and captured data on attrition

The goal is to develop a system using historical data that can predict when an employee is at risk

Why?

Because the cost to the company to “replace” an employee is 4x greater than the cost to “save and retain” an employee

What Data do I Have?

Age
Attrition
BusinessTravel
DailyRate
Department
DistanceFromHome
Education
EducationField
EmployeeCount
EmployeeNumber
EnvironmentSatisfaction
Gender
HourlyRate
JobInvolvement
JobLevel
JobRole
JobSatisfaction
MaritalStatus
MonthlyIncome
MonthlyRate
NumCompaniesWorked
Over18
OverTime
PercentSalaryHike
PerformanceRating
RelationshipSatisfaction
StandardHours
StockOptionLevel
TotalWorkingYears
TrainingTimesLastYear
WorkLifeBalance
YearsAtCompany
YearsInCurrentRole
YearsSinceLastPromotion
YearsWithCurrManager

I have an Excel file with record for ~1500 employees

I have a wide-range of data.

80% of my time will be spent trying to understand the data, cleaning the data, and identifying what is and what is not important.

Preparing the Data

I will go through the dataset

Look for missing data

Look for data inconsistencies

Are variables highly correlated?

What variables appear to be the most important

Etc.

Run the Analysis

Evaluate the model

Am I finding patterns that I can understand and appear to be reasonable?

If so, then I can take those findings to management and suggest actions that can be taken.

If not, then I may reexamine the data and/or change some parameters – and – run it again.

Classification of Learning Algorithms

Hypothesis Driven (supervised)

I know what I am looking for and I am looking to find it in the data

My analysis is trying to find support for some hypothesis that I propose

EX: Does performance on the GMAT exam predict performance in Graduate School (GPA)?

Does attending class improve student performance?

Discovery Driven (unsupervised)

I am looking at data to see what pattern(s) emerge

I begin the analysis with no pre-conceived notions of what might be there

EX: I am analyzing store receipts to discover if there are groups of products that are purchased together by consumers.

I want to determine whether or not to approve a home loan.

I to do a market segmentation in order to determine the profile of customers that spend the most.

I want to gain a better understanding of which customers may be at risk of moving to a competitor

17

Types of Patterns

Searching for patterns in the data

Patterns – numeric and/or symbolic relationships among data items

Associations (e.g., Market Basket Analysis)

Example: How can products be organized in the store in order to improve the customer experience and take advantage of buying patterns?

Prediction (e.g., Regression, Decision Trees)

Example: Can we look at enrollment and student data to identify factors that impact student success and the 4-year graduation rate?

Cluster (e.g., Cluster Analysis)

Used for automatic identification of the natural grouping of things

Example: Can we look at our customer base and identify some natural grouping of customers?

Sequential or Time Series (e.g., Predictive Analytics)

Example: Can we look at past buying patterns in order to gain insight into how to better meet customer demands and better manage inventory?

18

Data Mining Applications

Customer Relationship Management

Maximize return on marketing campaigns

Improve customer retention (churn analysis)

Maximize customer value (cross-, up-selling)

Identify and treat most valued customers

Banking & Other Financial

Automate the loan application process

Detecting fraudulent transactions

Maximize customer value (cross-, up-selling)

Optimizing cash reserves with forecasting

Retailing and Logistics

Optimize inventory levels at different locations

Improve the store layout and sales promotions

Optimize logistics by predicting seasonal effects

Minimize losses due to limited shelf life

Manufacturing and Maintenance

Predict/prevent machinery failures

Identify anomalies in production systems to optimize the use manufacturing capacity

Discover novel patterns to improve product quality

19

Data Mining Applications (cont.)

Brokerage and Securities Trading

Predict changes on certain bond prices

Forecast the direction of stock fluctuations

Assess the effect of events on market movements

Identify and prevent fraudulent activities in trading

Insurance

Forecast claim costs for better business planning

Determine optimal rate plans

Optimize marketing to specific customers

Identify and prevent fraudulent claim activities

Computer hardware and software

Science and engineering

Government and defense

Homeland security and law enforcement

Travel industry

Healthcare

Medicine

Entertainment industry

Sports

Etc.

20

Data Mining Processes

Data Mining Process

A manifestation of best practices

A systematic way to conduct DM projects

Different companies/groups have different versions

Most common standard processes:

CRISP-DM (Cross-Industry Standard Process for Data Mining)

SEMMA (Sample, Explore, Modify, Model, and Assess)

KDD (Knowledge Discovery in Databases)

22

Data Mining Process

Source: KDNuggets.com

23

Data Mining Process: CRISP-DM

Step 1: Business Understanding

Step 2: Data Understanding

Step 3: Data Preparation (!)

Step 4: Model Building

Step 5: Testing and Evaluation

Step 6: Deployment

85% of time spent in Steps 1-3

The process is highly repetitive and experimental (DM: art versus science?)

24

Data Preparation – A Critical DM Task

25

Data Mining Process: SEMMA

26

Advantages / Short-comings of SEMMA

ADVANTAGES

It is easy to understand and follow.

Allows for the organized, well-structured, and adequate development of DM solutions

Its conception-creation-evolution methodology helps to map business problems to the DM solution.

SHORT COMINGS

SEMMA mainly focuses on the modeling tasks of data mining projects, leaving the business aspects out.

SEMMA is closely tied to SAS Enterprise Miner Software and is meant to help guide the user on the implementation of DM applications.

Therefore, applying it outside Enterprise Miner can be ambiguous.

27

KDD (Knowledge Discovery in Databases)

Classification

Data Mining Methods: Classification

Most frequently used DM method

Part of the machine-learning family

Employ supervised learning

Learn from past data, classify new data

The output variable is categorical (nominal or ordinal) in nature

30

Predictive accuracy

Hit rate

Speed

Model building; predicting

Robustness

Scalability

Interpretability

Transparency, explainability

Assessment Methods for Classification

31

Accuracy of Classification Models

In classification problems, the primary source for accuracy estimation is the confusion matrix

32

True Class
Positive Negative
Predicted Class Positive 14008 234
Negative 442 13005

Accuracy of Classification Models (cont)

Accuracy = 97.6%

True Pos Rate = 98.4%

True Neg Rate = 96.7%

Precision = 96.9%

Recall = 98.4%

True Class
Positive Negative
Predicted Class Positive 14008 234
Negative 3002 6084

Accuracy = 86.1%

True Pos Rate = 98.4%

True Neg Rate = 67.0%

Precision = 82.4%

Recall = 98.4%

True Class
Positive Negative
Predicted Class Positive 7503 4021
Negative 3002 6084

Accuracy = 66.0%

True Pos Rate = 65.1%

True Neg Rate = 67.0%

Precision = 71.4%

Recall = 65.1%

Revisiting our HR Example (Decision Tree)

74.68% of the time when the value of Attrition was “YES” the model predicted “YES”

77.78% of the time when the value of Attrition was “NO” the model predicted “NO”

Revisiting our HR Example (Decision Tree)

The most important variable in predicting Attrition – JobRole

The least important variable - Gender

Estimation Methodologies for Classification

Simple split (or holdout or test sample estimation)

Split the data into 2 mutually exclusive sets training (~70%) and testing (~30%)

For ANN, the data is split into three sub-sets (training [~60%], validation [~20%], testing [~20%])

36

Estimation Methodologies for Classification

V-Fold Cross Validation (rotation estimation)

Split the data into k mutually exclusive subsets

Use each subset as testing while using the rest of the subsets as training

Repeat the experimentation for v times

Aggregate the test results for true estimation of prediction accuracy training

TEST
TRAIN
TRAIN
TEST
TRAIN
TRAIN
TEST
TRAIN
S1
S2
S3
S4
S5
S6
S7
S8

37

Estimation Methodologies for Classification

Other estimation methodologies

Leave-one-out

Similar to the k-fold method where k = 1

Every data point is tested against the many models developed

Sometimes viable for small datasets

Bootstrapping

A fixed number of instances are sampled (with replacement) for training and the rest is used for testing

Repeat as desired

Jackknifing

Similar to leave-one-out, except one sample is left out for each iteration.

Area Under The ROC Curve

A graphical segmentation technique by plotting the true-positive rates (X axis) and the false positive rates (Y axis)

Area under the curve indicates the accuracy of the classifier

Estimation Methodologies for Classification – ROC Curve

39

Classification Techniques

Decision tree analysis

The most popular classification technique

Statistical analysis

Logistic regression and discriminate analysis

Assumption: there is a linear relationship between the inputs and outputs, normally distributed, etc.

Neural networks

Among the most popular machine learning techniques

Support vector machines

Case-based reasoning

Historical cases are used to recognize commonalities in order to assign new case to the most probable category

40

Classification Techniques

Bayesian classifiers

Uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into the right category

Genetic algorithms

Use of the analogy of natural evolution to build directed search-based mechanisms to classify data samples

Rough sets

Takes into account partial membership of class labels to predefine categories in building models (collection of rules)

41

Decision Trees

Decision Trees

Create a root node and assign all of the training data to it.

Select the best splitting attribute.

Add a branch to the root node for each value of the split. Split the data into mutually exclusive subsets along the lines of the specific split.

Repeat the steps 2 and 3 for each and every leaf node until the stopping criteria is reached.

A general algorithm for decision tree building

Employs the divide and conquer method

Recursively divides a training set until each division consists of examples from one class

43

Decision Trees

DT algorithms mainly differ on

Splitting criteria

Which variable, what value, etc.

Stopping criteria

When to stop building the tree

Pruning (generalization method)

Pre-pruning versus post-pruning

Most popular DT algorithms include

ID3, C4.5, C5; CART; CHAID; M5

44

Decision Trees

Alternative splitting criteria

Gini index determines the purity of a specific class as a result of a decision to branch along a particular attribute/value

Used in CART

Information gain uses entropy to measure the extent of uncertainty or randomness of a particular attribute/value split

Used in ID3, C4.5, C5

Chi-square statistics (used in CHAID)

45

Clustering

Cluster Analysis for Data Mining

Used for automatic identification of natural groupings of things

Part of the machine-learning family

Employ unsupervised learning

Learns the clusters of things from past data, then assigns new instances

There is not an output variable

Also known as segmentation

47

Cluster Analysis for Data Mining

Clustering results may be used to

Identify natural groupings of customers

Identify rules for assigning new cases to classes for targeting/diagnostic purposes

Provide characterization, definition, labeling of populations

Decrease the size and complexity of problems for other data mining methods

Identify outliers in a specific domain (e.g., rare-event detection)

48

Cluster Analysis for Data Mining

Analysis methods

Statistical methods (including both hierarchical and nonhierarchical), such as k-means, k-modes, and so on.

Neural networks (adaptive resonance theory [ART], self-organizing map [SOM])

Fuzzy logic (e.g., fuzzy c-means algorithm)

Genetic algorithms

49

Cluster Analysis for Data Mining

How many clusters?

There is not a “truly optimal” way to calculate it

Heuristics are often used

Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items.

Euclidian, Manhattan/Rectilinear … distance

50

Cluster Analysis for Data Mining

k-Means Clustering Algorithm

k : pre-determined number of clusters

Algorithm (Step 0: determine value of k)

Step 1: Randomly generate k random points as initial cluster centers.

Step 2: Assign each point to the nearest cluster center.

Step 3: Re-compute the new cluster centers.

Repetition step: Repeat steps 3 and 4 until some convergence criterion is met (usually that the assignment of points to clusters becomes stable).

51

Cluster Analysis for Data Mining - k-Means Clustering Algorithm

52

Association Analysis

Association Rule Mining

A very popular DM method in business

Finds interesting relationships (affinities) between variables (items or events)

Part of machine learning family

Employs unsupervised learning

There is no output variable

Also known as market basket analysis

Often used as an example to describe DM to ordinary people, such as the famous “relationship between diapers and beers!”

54

Association Rule Mining

Input: the simple point-of-sale transaction data

Output: Most frequent affinities among items

Example: according to the transaction data…

“Customer who bought a lap-top computer and a virus protection software, also bought extended service plan 70 percent of the time."

How do you use such a pattern/knowledge?

Put the items next to each other

Promote the items as a package

Place items far apart from each other!

55

Association Rule Mining

A representative applications of association rule mining include

In business: cross-marketing, cross-selling, store design, catalog design, e-commerce site design, optimization of online advertising, product pricing, and sales/promotion configuration

In medicine: relationships between symptoms and illnesses; diagnosis and patient characteristics and treatments (to be used in medical DSS); and genes and their functions (to be used in genomics projects)

56

Association Rule Mining

Are all association rules interesting and useful?

A Generic Rule: X  Y [S%, C%]

X, Y: products and/or services

X: Left-hand-side (LHS)

Y: Right-hand-side (RHS)

S: Support: how often X and Y go together

C: Confidence: how often Y go together with the X

Example: {Laptop Computer, Antivirus Software}  {Extended Service Plan} [30%, 70%]

57

Association Rule Mining

Algorithms are available for generating association rules

Apriori

Eclat

FP-Growth

+ Derivatives and hybrids of the three

The algorithms help identify the frequent item sets, which are, then converted to association rules

58

Association Rule Mining

Apriori Algorithm

Finds subsets that are common to at least a minimum number of the itemsets

Uses a bottom-up approach

frequent subsets are extended one item at a time (the size of frequent subsets increases from one-item subsets to two-item subsets, then three-item subsets, and so on), and

groups of candidates at each level are tested against the data for minimum support

59

Association Rule Mining Apriori Algorithm

60

Artificial Neural Networks (ANN)

Artificial Neural Networks for Data Mining

Artificial neural networks (ANN or NN) is a brain metaphor for information processing

a.k.a. Neural Computing

Very good at capturing highly complex non-linear functions!

Many uses – prediction (regression, classification), clustering/segmentation

Many application areas - finance, medicine, marketing, manufacturing, service operations, information systems, …

62

Biological versus Artificial Neural Networks

Elements/Concepts of ANN

Processing element (PE)

Information processing

Network structure

Feedforward vs. recurrent vs. multi-layer…

Learning parameters

Supervised/unsupervised, backpropagation, learning rate, momentum

ANN Software – NN shells, integrated modules in comprehensive DM software, …

Wrapping Up

Data Mining Software

Source: KDNuggets.com

Commercial

IBM SPSS Modeler (formerly Clementine)

SAS - Enterprise Miner

IBM - Intelligent Miner

StatSoft – Statistica Data Miner

… many more

Free and/or Open Source

RapidMiner

Weka

R, …

66

Data Mining Myths

Data mining …

provides instant solutions/predictions

is not yet viable for business applications

requires a separate, dedicated database

can only be done by those with advanced degrees

is only for large firms that have lots of customer data

is another name for the good-old statistics

67

Common Data Mining Blunders

Selecting the wrong problem for data mining

Ignoring what your sponsor thinks data mining is and what it really can/cannot do

Not leaving sufficient time for data acquisition, selection and preparation

Looking only at aggregated results and not at individual records/predictions

Being sloppy about keeping track of the data mining procedure and results

…more

68

Dark side of Data Mining

Data that is collected, stored and analyzed in data mining often contains information about real people

Name, address, social security number, driver’s license number, employee number …

Age, sex, ethnicity, marital status, number of children …

Salary, gross family income, checking/savings account balance, home ownership, loan information, …

What is bought – when/where – from vendor transaction records

Anniversary, pregnancy, illness, loss in family, bankruptcy filing …

Often data is “de-identified” prior to use

69

Dark side of data mining (cont)

2oo3 JetBlue provided a million customer records to Torch Corporation (a government contractor)

The data was augmented with additional information such as family size and social security numbers (information purchased from Acxiom)

The consolidated information was to be used to develop potential terrorist profiles

Resulted in numerous lawsuits

Same has happened with social networking companies selling customer-specific data

70

QUESTIONS?

S

t

a

t

i

s

t

i

c

s

Management Science &

Information Systems

A

r

t

i

f

i

c

i

a

l

I

n

t

e

l

l

i

g

e

n

c

e

Databases

Pattern

Recognition

Machine

Learning

Mathematical

Modeling

DATA

MINING

Data Sources

Business

Understanding

Data

Preparation

Model

Building

Testing and

Evaluation

Deployment

Data

Understanding

6

12

3

5

4

Data Consolidation

Data Cleaning

Data Transformation

Data Reduction

Well-formed

Data

Real-world

Data

·Collect data

·Select data

·Integrate data

·Impute missing values

·Reduce noise in data

·Eliminate inconsistencies

·Normalize data

·Discretize/aggregate data

·Construct new attributes

·Reduce number of variables

·Reduce number of cases

·Balance skewed data

Sample

(Generate a representative

sample of the data)

Modify

(Select variables, transform

variable representations)

Explore

(Visualization and basic

description of the data)

Model

(Use variety of statistical and

machine learning models )

Assess

(Evaluate the accuracy and

usefulness of the models)

SEMMA

True

Positive

Count (TP)

False

Positive

Count (FP)

True

Negative

Count (TN)

False

Negative

Count (FN)

True Class

PositiveNegative

P

o

s

i

t

i

v

e

N

e

g

a

t

i

v

e

P

r

e

d

i

c

t

e

d

C

l

a

s

s

Preprocessed

Data

Training Data

Testing Data

Model

Development

Model

Assessment

(scoring)

2/3

1/3

Classifier

Prediction

Accuracy

10.90.80.70.60.50.40.30.20.10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1

0.9

0.8

False Positive Rate (1 - Specificity)

T

r

u

e

P

o

s

i

t

i

v

e

R

a

t

e

(

S

e

n

s

i

t

i

v

i

t

y

)

A

B

C

Step 1Step 2Step 3

Itemset

(SKUs)

Support

Transaction

No

SKUs

(Item No)

1

1

1

1

1

1

1, 2, 3, 4

2, 3, 4

2, 3

1, 2, 4

1, 2, 3, 4

2, 4

Raw Transaction Data

1

2

3

4

3

6

4

5

Itemset

(SKUs)

Support

1, 2

1, 3

1, 4

2, 3

3

2

3

4

3, 4

5

3

2, 4

Itemset

(SKUs)

Support

1, 2, 4

2, 3, 4

3

3

One-item ItemsetsTwo-item ItemsetsThree-item Itemsets

Neuron

Axon

Axon

Synapse

Synapse

Dendrites

Dendrites

Neuron

w

1

w

2

w

n

x

1

x

2

x

n

.

.

.

Y

Y

1

Y

n

Y

2

Inputs

Weights

Outputs

.

.

.

Processing

Element (PE)

n

i

ii

WXS

1

)(Sf

Summation

Transfer

Function

BiologicalArtificial

Neuron

Dendrites

Axon

Synapse

Slow

Many (10

9

)

Node (or PE)

Input

Output

Weight

Fast

Few (10

2

)

Biological NN

Artificial NN

020406080100120Thinkanalytics Miner3D Clario Analytics ViscoveryMegaputer Insightful Miner/S-Plus (now TIBCO) BayesiaC4.5, C5.0, See5AngossOrange Salford CART, Mars, other Statsoft Statistica Oracle DM Zementis Other free tools Microsoft SQL Server KNIMEOther commercial tools MATLAB KXEN Weka (now Pentaho)Your own codeRMicrosoft Excel SAS / SAS Enterprise MinerRapidMiner SPSS PASW Modeler (formerly Clementine) Total (w/ others)Alone

__MACOSX/Big Data Analytics Tools./Final Exam materials/._1 BINS 4352 - Data Mining.pptx

Big Data Analytics Tools./Final Exam materials/Regression Analysis.pptx

Regression Analysis

Regression

Regression analysis is a statistical process for estimating the relationships among variables

It includes many techniques for modeling and analyzing several variables

The focus is on finding the relationship between a dependent variable and one or more independent variables

More importantly regression analysis helps us to understand how the typical value of the dependent variable changes when any one of the independent variables is varied while the others are held fixed

There are many techniques for carrying out regression

Familiar methods include linear regression and ordinary least squares regression - these are “parametric”

Non-parametric approaches are techniques that allow the regression function to lie in a specific set of functions which may be infinite-dimensional

2

Linear Regression

Say we have a simple data file where we have GMAT scores and the GPA of MBA students.

We want to know if GMAT is a good predictor of performance in business school

One way we might do this is to run a simple linear regression where

GMAT is the independent variable (X)

GPA is the dependent variable (Y)

3

Linear Regression In Excel

4

Linear Regression In Excel

Load the data into Excel

Go to the DATA tab and select the Data Analysis Option

It is located on the right side of the menu

If it is not there, then you may need to go to Options -> Add-Ins and load the Analysis ToolPak

5

Then select Regression and select OK

Linear Regression In Excel

Select the Input Y range (dependent variable – GPA)

Select the Input X range (independent variable – GMAT)

Check the following

Labels

Residuals

Standardized Residuals

Residual Plots

Line fit Plots

Normal Probability Plots

6

Linear Regression In Excel

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.80860012
R Square 0.653834153
Adjusted R Square 0.634602718
Standard Error 0.435014172
Observations 20
ANOVA
  df SS MS F Significance F
Regression 1 6.43372807 6.43372807 33.99819734 1.59668E-05
Residual 18 3.40627193 0.189237329
Total 19 9.84      

7

  Coefficients Standard Error t Stat P-value
Intercept -1.6995614 0.72677682 -2.338491483 0.031101063
GMAT 0.008399123 0.001440476 5.830797316 1.59668E-05

R square of 0.6539 means that 65.39% of the variance in GPA can be explained by GMAT.

P-value (<0.05) means that we reject the null hypothesis that the two variables are unrelated. In other words, they are related.

Linear regression in excel

8

Linear Regression In SPSS

9

Linear regression in spss

Load the dataset into SPSS

Analyze tab – Linear Regression

OK

10

Linear Regression SPSS

11

Linear Regression In Statistica

12

Linear Regression In Statistica

Load the data set into Statistica

Statistics tab -> Multiple Regression

Select the variables and then select OK

13

Linear Regression Statistica

14

Questions

__MACOSX/Big Data Analytics Tools./Final Exam materials/._Regression Analysis.pptx

Big Data Analytics Tools./Final Exam materials/Web Mining.pptx

Web Mining

WEB MINING

2

Web Mining Overview

Web is the largest repository of data

Data is in HTML, XML, text format

Challenges (of processing Web data)

The Web is too big for effective data mining

The Web is too complex

The Web is too dynamic

The Web is not specific to a domain

The Web has everything

Opportunities and challenges are great!

3

Web Mining

Web mining (or Web data mining) is the process of discovering intrinsic relationships from Web data (textual, linkage, or usage)

4

Web Content/Structure Mining

Mining the textual content on the Web

Data collection via Web crawlers

Web pages include hyperlinks

Authoritative pages

Hubs

Hyperlink-induced topic search (HITS)

5

Web Usage Mining

Extraction of information from data generated through Web page visits and transactions…

Data stored in server access logs, referrer logs, agent logs, and client-side cookies

User characteristics and usage profiles

Metadata, such as page attributes, content attributes, and usage data

Clickstream data

Clickstream analysis

6

Web Usage Mining

Web usage mining applications

Determine the lifetime value of clients

Design cross-marketing strategies across products

Evaluate promotional campaigns

Target electronic ads and coupons at user groups based on user access patterns

Predict user behavior based on previously learned rules and users' profiles

Present dynamic information to users based on their interests and profiles

7

Web Usage Mining (Clickstream Analysis)

8

Web Mining Success Stories

Amazon.com, Ask.com, Scholastic.com, …

Website Optimization Ecosystem

9

Web Mining Tools

10

Web Mining in Rapid Miner

Web Mining

There are several ways that we can perform webmining

We can look at web content

We can look at links between webpages

Etc.

Data File

Excel File with one column - LINK

12

The Design

13

Read Excel

Read Excel

You will walk through the Import Configuration Wizard as you have in the past.

14

Get pages

Here we need to set the link attribute to LINK

Accept the default on connection time and read timeout.

We are only reading the top-level webpage

15

Data to Documents

Data to documents – this converts the web pages into documents

Do not “select attributes and weights”

Just let the data from the webpage flow through this operator.

16

Process Documents

Data to documents – this converts the web pages into documents

Process documents

Generate a TF-IDF Matrix

17

Pushing Into Process Documents

Here we insert the “Extract Content” in “Process Documents”

This in essence strips out the html tags

The minimum text block length specifies in “words” the smallest text block to extract (shorter blocks will be discarded)

18

Data to Similarity

Data to Similarity

Here we are using the Numerical Measures and Cosine Similarity

19

Clustering

We continue using the Numerical Measures and Cosine Similary

I reduced k in this example to 2

The algorithm will attempt to create two clusters for the 5 webpages

Results - Similarity

We sorted the Similarity Measure Object (Data to Similarity) output by Similarity

Here the higher the Similarity – the more similar to the documents.

20

Results - Cluster

Cluster Model (clustering) – this shows the distribution of the documents (in this case web pages) across the 2 clusters

Here we see that there are

4 items in cluster 0

3 items in cluster 1

The 4 items in cluster_0 are all of the “news” websites

The 3 items in cluster_1 are the shopping websites

21

QUESTIONS?

Web Mining

Web Structure Mining

Source: the unified

resource locator (URL)

links contained in the

Web pages

Web Content Mining

Source: unstructured

textual content of the

Web pages (usually in

HTML format)

Web Usage Mining

Source: the detailed

description of a Web

site’s visits (sequence

of clicks by sessions)

Weblogs

Website

Pre-Process Data

Collecting

Merging

Cleaning

Structuring

-Identify users

-Identify sessions

-Identify page views

-Identify visits

Extract Knowledge

Usage patterns

User profiles

Page profiles

Visit profiles

Customer value

How to better the data

How to improve the Web site

How to increase the customer value

User /

Customer

Web

Analytics

Voice of

Customer

Customer Experience

Management

Customer Interaction

on the Web

Analysis of Interactions

Knowledge about the Holistic

View of the Customer

__MACOSX/Big Data Analytics Tools./Final Exam materials/._Web Mining.pptx

Big Data Analytics Tools./Final Exam materials/BINS 4352 Twitter Sentiment Analysis.pptx

Text Mining - Twitter Sentiment and Association Analysis

Capturing Twitter Tweets in Rapid Minder

2

Capturing Twitter Tweets

We are going to capture twitter tweets

Select the following attributes

From-User, Text, Retweet-Count

Save the data to an excel file

3

Notice that nothing is connected to the “inp” or the “res” ports of the design pane.

We simply have

Search Twitter

Select Attributes

Write Excel

Search Twitter

4

You will need to establish a twitter connection

If you haven’t established a connection, you can request one from within RapidMiner

You will also need to specify the query

The query can be a hashtag, text string or a list of strings and/or hashtags

You can specify the results types you want to see

Limit – allows you to specify how many tweets you want to capture

Additional, filters are available

Select Attributes

Here we can filter out everything except for the text of the tweet.

On the parameters pane

Select “Subset” as the attribute filter type

Under select attributes – select only the “from user”, “text”, and “retweet count”

Check “include special attributes” – this will filter out the RM generated ID

5

Write to Excel

Write Excel operator

Select the path and file name where you want to save the file

Select the file format

Specify the sheet name (RapidMiner Data – is the default)

Set any additional attributes

Run the design…

6

Results

7

Sentiment Analysis Aylien / Rosette

8

Sentiment Analysis Using a Third Party Module

Download and install the Aylien Text Analytics plugin

This will give you some basic text analytics capabilities

Note: they do sell a subscription – but, there is a “no cost” starter version that you can try to get a feel for the product.

Optional: Download and Install Rosette Text Analytics

9

Sentiment Analysis Using a Third Party Module

You will need to Register and get a key for both models (Aylien and Rosette)

They may ask for email validation – but, that is fairly straight forward

Aylien – in particular also offers an API interface that can be used in a Python, C#, etc environment.

10

Setting it up in Rapidminer

11

For Alyien – you have to establish the connection and set the analysis type to tweet

For Rosette – you have to specify the field where the tweet text is and check a box that instructs the operator to include the sentiment measure with the output.

Results

12

Association Analysis

13

Association Analysis

When we do association analysis on text or web data we will find the words that often occur together

This is very similar to what we did with the grocery store receipts

We found what products/goods often were sold together

14

The Design

The design looks very similar to what we had before when we did association analysis

We do have to modify “Process Documents” and add the something to convert numerical data to binary data

15

Process Documents

Process documents – now we need to create the vectors using the Binary Term Occurrences

You will remember from the Association Analysis that we did we coded the data with 0/1 indicating whether or not a product existed in the shopping cart

Similar here the Binary Term Occurrence operator will code the words 0 (does not exist in the document or 1 exists in the document)

16

Numerical to Binomial

We add the Numerical to Binomial operator – to convert the 0/1 “numbers” in the word vector to “binomial” values.

We do this simply because that is what is needed as input to FP-Growth

Lower cutoff on FP-Growth to ensure more data makes it through

17

Results – Searched On “Convention”

Sorted by highest support

18

Results – searched on “Convention”

Sorted by confidence

19

Graph – showing rules linking terms

20

Questions

__MACOSX/Big Data Analytics Tools./Final Exam materials/._BINS 4352 Twitter Sentiment Analysis.pptx

Big Data Analytics Tools./Final Exam materials/3 BINS 4352 - Sampling - Subsampling.pptx

Sampling and Subsampling

Sampling

In Statistics – sampling is the selection of a subset of individuals from a population

The goal is that the sample provides an accurate representation of the population as a whole

There are a variety of sampling methods

Simple Random Sampling

Convenience (accidental) Sampling

Stratified Sampling

Snowball Sampling

Systematic Sampling

Etc.

Sampling Methods

Simple Random Sample

Subset of a statistical population where each member of the subset has an equal probability to participate

Strengths

Ease of Use

Meant to be unbiased representation since everyone has an equal opportunity to participate

Weaknesses

Sampling error can occur in that it may not accurately represent the population

Sampling Methods (cont)

Convenience (accidental) Sampling

You simply include individuals that are easy to reach

You may select people from your workplace, school, club, local mall, etc.

It is, in essence, the opposite of random sampling

Strengths

Easy to get a sample

Inexpensive

Participates are readily available

Weaknesses

May not represent the target population

Results biased based on who participates and who doesn’t

Sampling Methods (cont)

Snowball Sampling

Sampling technique where study subjects recruit future subjects from people they may know

This technique is used when working with hidden, difficult or hard-to-reach populations

Strengths

Locating hidden populations

Locating people from a specific population / hard to reach

Weaknesses

Community bias – the first participants have a large impact on the sample

Anchoring – the lack of definite knowledge as to whether or not the sample reflects the target population

Sampling Methods (cont)

Stratified Sampling

Process of dividing members of the population into homogeneous subgroups prior to sampling

The “strata” should be mutually exclusive and collectively exhaustive

Strengths

Can result in smaller estimation errors – if the within strata standard deviations are low

Often this technique is easier to manage

More accurately represent the makeup of the target population

Weaknesses

Cannot be used unless the population can be divided into disjoint subgroups

Can lead to Simpson’s paradox – where trends in different groups disappear due to being canceled out

Home Equity Loan Dataset

The Home Equity Loan Process

An applicant comes forward with a specific property and a reason for the loan (Home-Improvement, Debt-Consolidation)

Background info related to job and credit history is collected

The loan gets approved or rejected

Upon approval, the Applicant becomes a Customer

Information related to how the loan is serviced is maintained, including the Status of the loan (Current, Delinquent, Defaulted, Paid-Off)

8

HMEQ Data Set

9

Name Model Role Measurement Level Description
BAD Target Binary 1=defaulted on loan, 0=paid back loan
REASON Input Binary HomeImp=home improvement, DebtCon=debt consolidation
JOB Input Nominal Six occupational categories
LOAN Input Interval Amount of loan request
MORTDUE Input Interval Amount due on existing mortgage
VALUE Input Interval Value of current property
DEBTINC Input Interval Debt-to-income ratio
YOJ Input Interval Years at present job
DEROG Input Interval Number of major derogatory reports
CLNO Input Interval Number of trade lines
DELINQ Input Interval Number of delinquent trade lines
CLAGE Input Interval Age of oldest trade line in months
NINQ Input Interval Number of recent credit inquiries

I added a variable – Default

The values of this variable are GOOD and BAD

I did this because it makes it easier to interpret the results

HMEQ – Modeling Goal(s)

The credit scoring model should compute the probability of a given loan applicant to default on loan repayment.

A threshold is to be selected such that all applicants whose probability of default is in excess of the threshold are recommended for rejection.

10

Sub-sampling

Review – Home Equity

I ran the Home Equity example

I left the classification weights as equal

I reduced the minimum number of cases per node to 50

I turned on V-fold cross-validation (10)

I specified 2 surrogates

Results indicate that the model is still biased to predict “GOOD” (88.13%)

Why is this the case??

Distribution of Dependent Variable

Go to “Graph” and select “Histogram”

Select the “Default” variable

Over 4x as many of the cases are “GOOD” loans

This means if the tool always predicts “GOOD”, then it is correct 75% of the time!

So – our data set is biasing the model to predict “GOOD”

Stratified Sub-Sample in Statistica

Generate a New Sample

Go to the “Data” tab

Select Sampling

Select a Strata Variable

Go to the Stratified Sampling Tab

Select “Strata Variable”

Select DEFAULT

Show the Codes and Counts

Select “Codes”

Specify “ALL” hit “OK”

Select “Count N”

This will show you the codes (values) for the variable DEFAULT and the number of cases with each value

We Are Going to Specify Counts of Cases

Go to the Options tab

Select “Calculate based on count of cases”

Go back to the Stratified Sampling Tab

Specify The Count from Each Strata Group

I am going to Specify 900 from each Strata Group

To do this just click under the “N” and type the number

Select “OK” and a new sub-sample will be generated

New Sub-Sample

The new sub-sample has “approximately” 1800 cases

Generating a new Histogram for “DEFAULT” will show that the new data set has a more even distribution of values

Now – it will be harder for the “tool” to be “correct” when predicting the dependent variable value

Re-run The Model (New Sample)

Re-run the Home Equity example

I left the classification weights as equal

I reduced the minimum number of cases per node to 50

I turned on V-fold cross-validation (10)

I specified 2 surrogates

Run prior to Stratified Sampling

After creating the Stratified Sample

QUESTIONS?

__MACOSX/Big Data Analytics Tools./Final Exam materials/._3 BINS 4352 - Sampling - Subsampling.pptx

Big Data Analytics Tools./Final Exam materials/Neural Networks.pptx

Neural Networks

Neural Network Models

Often regarded as a mysterious and powerful predictive modeling technique.

A nonparametric modeling tool that mirrors biological neurons to “learn” from the data

They are nonlinear by nature

Can overcome the curse of dimensionality

The usual link for the derived input’s model is inverse hyperbolic tangent, a shift and rescaling of the logit function (discussed later)

Ability to approximate virtually any continuous association between the inputs and the target

You simply need to specify the correct number of derived inputs

2

What Are Neural Networks

Neural networks learn by example

Model “learns” the structure of the data from a representative sample

The user needs to have some

Heuristic knowledge of how to select/prepare data

How to select an appropriate neural network

How to interpret the results

But, user knowledge for successfully building neural networks is lower than it might be for other techniques

Uses a series of weights and hidden neurons to detect complex relationships

Can perform well in the presence of complicated, noisy, and/or imprecise data

Appropriate for – classification, regression, time series analysis and clustering

3

Artificial Intelligence and Data Mining

Neural networks are useful for data mining and decision-support applications.

People are good at generalizing from experience.

Computers excel at following explicit instructions over and over.

Neural networks bridge this gap by modeling, on a computer, the neural behavior of human brains.

4

Anatomy of a Neural Network

Neural Networks map a set of input-nodes to a set of output-nodes

Number of inputs/outputs is variable

The Network itself is composed of an arbitrary number of nodes with an arbitrary topology

5

Input Layer, Hidden Layer, Output Layer

Multi-layer perceptron models were originally inspired by neurophysiology and the interconnections between neurons.

The basic model form arranges neurons in layers.

The input layer connects to a layer of neurons called a hidden layer, which, in turn, connects to a final layer called the target, or output, layer.

In reality there can be multiple levels of hidden layers

The structure of a multi-layer perceptron lends itself to a graphical representation called a network diagram.

6

H2

H1

H3

x1

x2

p

A Simple Perceptron

Binary logic application

fH(x) [linear threshold]

Wi = random(-1,1)

Y = µ(W0X0 + W1X1 + Wb)

7

Perceptron Training

Adjust weights based on a how well the current weights match an objective

Perceptron Learning Rule

Δ Wi = η * (D-Y).Ii

η = Learning Rate

D = Desired Output

8

Neural Network Learning

From experience: examples / training data

Strength of connection between the neurons is stored as a weight-value for the specific connection

Learning the solution to a problem = changing the connection weights

9

Neural Network Learning

Continuous Learning Process

Evaluate output

Adapt weights

Take new inputs

Learning causes stable state of the weights

10

Evaluate

Outputs

Adapt weights

Take new inputs

Generate Outputs

Learning performance

Supervised

Need to be trained ahead of time with lots of data

Unsupervised networks adapt to the input

Applications in Clustering and reducing dimensionality

Learning may be very slow

No help from the outside

No training data, no information available on the desired output

Learning by doing

Used to pick out structure in the input:

Clustering

Compression

11

Advantages / Disadvantages

Advantages

Adapt to unknown situations

Robustness: fault tolerance due to network redundancy

Autonomous learning and generalization

Disadvantages

Not exact

Large complexity of the network structure

Difficult the explain / interpret

12

Mathematics

13

Neural Network Model

14

Neural network diagram

15

Hidden Layers

Output Layer

Input

Layer

Hidden Unit

Neural Networks As A Universal Approximator

16

6+A-2B+3C

A

B

C

An Example

17

INPUT

HIDDEN

OUTPUT

ß1+ ß2AGE+ ß3INC

COMBINATION

ACTIVATION

tanh(ß1+ ß2AGE+ ß3INC)

=A

ß4+ ß5AGE+ ß6INC

tanh(ß4+ ß5AGE+ ß6INC)

=B

ß7+ ß8AGE+ ß9INC

tanh(ß7+ ß8AGE+ ß9INC)

=C

COMBINATION

COMBINATION

ACTIVATION

COMBINATION

ACTIVATION

AGE

INCOME

ß10+ß11A+ ß12B+ß13C

RESPONSE TO PROMOTION

Objective function

18

Neural network training

19

Neural network training

20

Convergence

Training concludes when small changes in the parameter values no longer decrease the value of the objective function.

The network is said to have reached a local minimum in the objective.

21

Neural network training

22

Overgeneralization

A small value for the objective function, when calculated on training data, need not imply a small value for the function on validation data.

Typically, improvement on the objective function is observed on both the training and the validation data over the first few iterations of the training process.

At convergence, however, the model is likely to be highly overgeneralized and the values of the objective function computed on training and validation data may be quite different.

23

Final model

To compensate for overgeneralization, the overall average profit, computed on validation data, is examined.

The final parameter estimates for the model are taken from the training iteration with the maximum validation profit.

24

Neural network final model

25

Statistica

26

Statistical Automated Neural Networks (SANN)

It is easy to use for those new to neural network analysis

But, also offers a wide-range of customization options

SANN first transforms the data using a linear function scaling the data to (0,1)

This is even true if the original data is on the scale of millions

This is done because the neutral network functions are sensitive to values in a very limited range

This step is hidden from the user

27

Statistica Automated Neural Networks (SANN)

Statistica offers both

Multilayer Perception networks (MLP)

Radial Basis function networks (RBF)

Both can be explored during training

Predictions can then be made by individual networks or from ensembles (collection of networks)

If several networks have high performance, then an ensemble of those networks can often improve accuracy (due to generalization).

28

Network Types

Multilayer perceptron

Allows the selection of activation functions

Identity, logistic, hyperbolic tangent, negative exponential and sine

These are used for both hidden and output neurons

Offers multiple training algorithms

Gradient decent, congregant decent, BFGS

Radial basis function

Model the relationship between inputs and targets in 2 phases

First, the probability distribution function is approximated

Then the relationship between inputs and the targets is trained

29

Analysis Options

30

ANS explores both MLP and RBF

Have the option of the number of networks to train

A large number of networks allows to tool to explore a wide variety of options

Comparing network complexities

This is the tool to use when all the problem aspects are not well known

CNN is appropriate when problem is well studied and understood

Subsampling – explores the data looking at multiple subsets of the data and can yield good generalization results

Automated Network Search

Using ANS you will have the options

How many networks to train and retain

This can be anywhere from 1 to 100,000’s

The choice of network type and number of hidden units impact the complexity

Weight decay functions are used when overfitting or underfitting is a concern

31

Neural Network Output

The summary output shows information for all of the networks retained (network type, neurons, activations, etc.)

Predictions and residuals can be output for all models and ensembles

Sensitivity analysis shows which variables contributed to each network

32

Load The Beverage Data From The Example Data Directory

33

Neural Networks

These can be chosen from either the Statistics or the Data Mining tabs

34

Neural Network Regression Model

Notice that we can perform a variety of analysis (regression, classification, time series, and cluster)

For this exercise we will do regression

SELECT OK

35

Sann Data Selection

Similar to the other analyses we ran we first need to select our variables

36

Variable Selection

First, I select my continuous target (dependent variable) – PC Volume

Next, I select my continuous predictor variables – you can select [predictor variables] at the top and it will select a pre-defined set of predictor variables.

37

Automated Network Search

We are going to use an Automated Network Search

We want to explore both MLP and RBF networks

We want to explore a variety of network complexities

38

ANS – Sampling Tab

We will keep the defaults

Notice that we are dividing the data into 3 parts

Training 70% of the data

Test 15% of the data

Validation 15% of the data

The training and test data sets will be used to train the model

The validation will be “held out” (not used during training) and used to see how good the final model is on data not used during training

SELECT OK

39

SANN – ANS

We will explore both MLP and RBF network types

We want to explore a variety of number of hidden units

We increase the training networks to 200 – exploring a wide range

We only retain the 5 best networks

40

Weight Decay

Go to the Weight decay tab

Select – use weight decay for both the hidden and output layers

This will help to protect against overfitting

SELECT TRAIN

41

Training the Model

Statistica will start generating models using a variety of network types and activation functions

The number 12-x-1 displayed is

12 – the number of input

1 – the number of targets

X – number of hidden units

In addition, you can see the training and test error as it runs

42

Results Dialog

Select the Summary output

43

Summary Output

The summary output shows which networks were retained

3 were MLP’s and 2 were RBF’s and the number in the middle shows the number of hidden units

Next you have training, test and validation performance and also error

You can also see what training algorithm was used

Also, the hidden and output activation functions are shown for each model

44

Predictions Spreadsheet

Now we can look at the predictions for both standalone and ensembles

For the training, test and validation samples

Select target and outputs – this way we will get the observed and predicted values from the networks

SELECT PREDICTIONS

45

Predictions (Target Vs Output) For Each Model And The Ensemble

46

This shows the sample that the value came from (validation, train, test)

It also shows the target value and all of the predicted values by the different models and the ensemble

Graph Of Target Vs Output

On the Graphs tab

Look at target vs output

(X Y Plot)

Go to Select\Deselect active Networks to limit it to a single network

47

Graphs

We can look at a variety of plots.

I have selected one of the models in order for it to be easier to read.

X and Y plot of Target versus Output

Histogram of residuals

48

Sensitivity Analysis

49

Sensitivity analysis shows how strongly certain variables impacted this particular network

The variables are sorted by how much they influenced the model

Hyd Pressure3 had the most influence

Next Alch Rel, then Bowl setpoint and so on

Custom Predictions

This is used if we wanted to make a prediction for a case that is not in the original dataset.

To deploy the network you would select Save Networks

50

Questions?

Neural Network

Input 0Input 1Input n

...

Output 0Output 1Output m

...

Neural Network

Input 0

Input 1

Input n

...

Output 0

Output 1

Output m

...

f

H

(x)

Input 0Input 1

W

0

W

1

+

Output

W

b

fH(x)

Input 0

Input 1

W0

W1

+

Output

Wb

u

u

u

u

e

e

e

e

u

u

u

-

-

+

-

=

=

cosh

sinh

tanh

PC Volume (Target) vs. PC Volume (Output)

Samples: Train

1.MLP 12-11-1

2.MLP 12-11-1

3.MLP 12-6-1

4.MLP 12-11-1

5.MLP 12-5-1

Y

0.000.050.100.150.200.250.300.350.400.450.500.55

PC Volume (Target)

0.12

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

0.38

0.40

0.42

0.44

0.46

0.48

PC Volume (Output)

PC Volume (Target) vs. PC Volume (Output)

Samples: Train, Test, Validation, Missing

1.MLP 12-11-1

2.MLP 12-11-1

3.MLP 12-6-1

4.MLP 12-11-1

5.MLP 12-5-1

Y

0.000.050.100.150.200.250.300.350.400.450.500.55

PC Volume (Target)

0.12

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

0.38

0.40

0.42

0.44

0.46

0.48

0.50

PC Volume (Output)

PC Volume (Target) vs. PC Volume (Output)

Samples: Train, Test, Validation, Missing

0.000.050.100.150.200.250.300.350.400.450.500.55

PC Volume (Target)

0.12

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

0.38

0.40

0.42

0.44

0.46

PC Volume (Output)

PC Volume (Target) vs. PC Volume (Output)

Samples: Train, Test, Validation, Missing

0.000.050.100.150.200.250.300.350.400.450.500.55

PC Volume (Target)

0.16

0.18

0.20

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

0.38

0.40

0.42

PC Volume (Output)

PC Volume (Target) vs. PC Volume (Output)

Samples: Train, Test, Validation, Missing

0.000.050.100.150.200.250.300.350.400.450.500.55

PC Volume (Target)

0.12

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

0.38

0.40

0.42

0.44

0.46

PC Volume (Output)

__MACOSX/Big Data Analytics Tools./Final Exam materials/._Neural Networks.pptx

Big Data Analytics Tools./Final Exam materials/1 BINS 4352- Text Mining - updated for version 9.pptx

Text Mining

Learning Objectives

Understand how Text Mining works and the challenges facing data scientists when trying to work with Text data

Understand the process associated with working with text data

Process text data using RapidMiner

Text Mining

Text Mining: Application of Information Retrieval and Data Mining techniques that accommodate text as an input variable in knowledge discovery or predictive modeling - or -

3

Text

A

Miracle

Occurs

Numbers

Data Mining Versus Text Mining

Both seek for novel and useful patterns

Both are semi-automated processes

Difference is the nature of the data:

Structured versus unstructured data

Structured data: in databases

Unstructured data: Word documents, PDF files, text excerpts, XML files, and so on

Text mining – first, impose structure to the data, then mine the structured data.

4

Text Mining Concepts

85-90 percent of all corporate data is in some kind of unstructured form (e.g., text)

Unstructured corporate data is doubling in size every 18 months

Tapping into these information sources is not an option, but a need to stay competitive

Answer: text mining

A semi-automated process of extracting knowledge from unstructured data sources

a.k.a. text data mining or knowledge discovery in textual databases

5

Text Mining Concepts

Benefits of text mining are obvious, especially in text-rich data environments

e.g., law (court orders), academic research (research articles), finance (quarterly reports), medicine (discharge summaries), biology (molecular interactions), technology (patent files), marketing (customer comments), etc.

Electronic communication records (e.g., Email)

Spam filtering

Email prioritization and categorization

Automatic response generation

6

Text Mining Terminology

Unstructured or semi-structured data

Corpus (and corpora)

Terms

Concepts

Stemming

Stop words (and include words)

Synonyms (and polysemes)

Tokenizing

7

Text Mining Terminology

Term dictionary

Word frequency

Part-of-speech tagging

Morphology

Term-by-document matrix

Occurrence matrix

Singular value decomposition

Latent semantic indexing

8

Natural Language Processing (NLP)

Structuring a collection of text

Old approach: bag-of-words

New approach: natural language processing

NLP is …

a very important concept in text mining

a subfield of artificial intelligence and computational linguistics

the studies of "understanding" the natural human language

Syntax versus semantics-based text mining

9

Natural Language Processing (NLP)

What is “Understanding”?

Human understands, what about computers?

Natural language is vague, context driven

True understanding requires extensive knowledge of a topic

Can/will computers ever understand natural language the same/accurate way we do?

10

Natural Language Processing (NLP)

Challenges in NLP

Part-of-speech tagging

Text segmentation

Word sense disambiguation

Syntax ambiguity

Imperfect or irregular input

Speech acts

Dream of AI community

to have algorithms that are capable of automatically reading and obtaining knowledge from text

I made her duck!

She bent over at the waist to keep from hitting her head.

I cooked duck for dinner.

I made the wooden duck figure that she has on her desk.

“Duck” is an example of a polyseme

11

Punctuation

12

Natural Language Processing (NLP)

WordNet

A laboriously hand-coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets.

A major resource for NLP.

Need automation to be completed.

Sentiment Analysis

A technique used to detect favorable and unfavorable opinions toward specific products and services

SentiWordNet

13

NLP Task Categories

Information retrieval, information extraction

Named-entity recognition

Question answering

Automatic summarization

Natural language generation & understanding

Machine translation

Foreign language reading & writing

Speech recognition

Text proofing, optical character recognition

14

Difficulties In Text Quantification

Abstract Concepts are difficult to quantify

Synonymy: Multiple synonyms create multiple text representations of the same concept

Polysemy: One term can be related to multiple concepts, depending on the context

The curse of dimensionality: Text representations result in high dimensionality (tens of thousands of dimensions)

15

Applications Of Text Mining

Automotive early warning system:

Text mining for warranty analysis

Medical information management:

TextWise Labs uses sophisticated text mining methodology to extract medical information from disparate data sources on the Internet.

Computer Science Innovations Inc. is developing an application for the National Cancer Institute that automatically converts medical records into XML data.

Insurance companies employ Special Investigative Units (SIU) to investigate claims for fraud.

Data mining methods can be employed to automate the process of referral.

Text mining methods are applied to claims examiner notes, physician reports, and other textual data to enhance predictive accuracy.

16

Interest In Text Mining

A 2011 Report from The Data Warehouse Institute identifies the analysis of unstructured text as an area of high potential growth

Text Mining is now the 5th most frequently used algorithm

Text Mining is now ahead of Factor Analysis, Association Rules, and Neural Nets!

17

Information Retrieval

18

Many Methods For Document Classifiers

Inductive Learning

Probabilistic and Decision Tree classifiers

Support Vector Machines

Use of Taxonomies

K-Nearest Neighbor

Text categorization using supervised learning

ETC.

19

Vector Space Model (VSM)

Introduced by Gerald Salton in 1968. Used to determine document similarity.

Assigns weights to index terms in a document

Uses Inverted Document Frequency (IDF) weights or Binary weights

Every document is represented as a sum vector of its index terms

Cosine of angle between vectors determines relevance

20

K-means Document Clustering

Document Clustering is made possible through calculations of VSM-based document similarities

21

Latent Semantic Indexing (LSI)

Method of automatic indexing and retrieval

Reveals the essence of a text by discarding surface elements.

Utilizes the Singular Value Decomposition (SVD) algorithm to reduce space dimensionality

Builds a semantic space in which similar words and documents are next to each other

Modern Text Mining packages, such as SAS® Text Miner™, Megaputer® Text Analyst™, and IBM® Content Analyst™, provide LSA capabilities.

Reducing Dimensionality

23

Term Filtering

In order to reduce the term space dimensionality, unique terms (terms appearing only once in the entire collection of documents) are excluded.

Common English words (typically 400-600 stopwords) are excluded as well.

24

stopword removal

unique term removal

Term Stemming

In order to further reduce the term space dimensionality, term suffixes are removed.

25

BIG: BIG, BIGGER, BIGGEST

REACH: REACH, REACHES, REACHED, REACHING

WORK: WORK, WORKS, WORKED, WORKING

Term stemming

Term Frequency Matrix

The raw term frequencies are weighted using one of the available transformation functions – e.g., TF-IDF, Entropy, Log-Entropy, Normal, Chi-Square, etc.

26

=

Raw term frequencies

Weighted frequencies

documents

terms

Text Mining Process

Text Mining Process

The three-step text mining process

28

Step 1: Establishing the corpus

Collect all relevant unstructured data (e.g., textual documents, XML files, emails, Web pages, short notes, voice recordings…)

Digitize, standardize the collection (e.g., all in ASCII text files)

Place the collection in a common place (e.g., in a flat file, or in a directory as separate files)

29

Step 2: Term-by-Document Matrix Creation

30

Step 2: Term-by-Document Matrix Creation

Should all terms be included?

Stop words, include words

Synonyms, homonyms

Stemming

What is the best representation of the indices (values in cells)?

Row counts; binary frequencies; log frequencies;

Inverse document frequency

31

Step 2: Term-by-Document Matrix Creation

TDM is a sparse matrix. How can we reduce the dimensionality of the TDM?

Manual - a domain expert goes through it

Eliminate terms with very few occurrences in very few documents (?)

Transform the matrix using singular value decomposition (SVD)

SVD is similar to principle component analysis

32

Step 3: Extract Patterns / Knowledge

Classification (text categorization)

Clustering (natural groupings of text)

Improve search recall

Improve search precision

Scatter/gather

Query-specific clustering

Association

Trend Analysis (…)

33

Document Preparation In Rapid Miner

34

Loading Documents

There Are Multiple Ways to Load Documents

From Excel

One document per line

From Text Files

From Directory of Text Files

Etc.

Mark Twain From Text File

Mark Twain Example

File MarkTwain.txt contains the complete work of Mark Twain.

First thing we need to do is load the document into RapidMiner

Locate and place the “Read Document” operator in the design pane

Connect the output to the “res” port

Do not connect anything to the “fil” port

Set the parameters on the “Read Document” operator

File name: MarkTwain.txt

Check – extract text only

Check – use file extension as type

Encoding – SYSTEM (this is the default)

Run the model

38

Text Files From Directories

Directory of Interviews

Place the “Process Documents from Files Operator”

Specify the directory and the File Pattern

Documents in an Excel File

Read the Excel File

Walk Through Data Import

As you walk through the process (wizard), you will notice that the Excel file contains 2 columns

Author

Text

Both Author and Text are type “Polynomial”

Previous versions of RM allowed you to change this to “Text”

In the current version this takes an additional step

Change in New Version of Rapid Miner

Insert the “Nominal to Text” Operator

You would select the operator

Select Subset

Pick the variables you want to transform to “TEXT”

Data to Documents

Data to Documents

Select attributes and weights

Specify weights – “Text” is given a weight of 1

45

Results

Processing Documents

Mark Twain Example

File MarkTwain.txt contains the complete work of Mark Twain.

First thing we need to do is load the document into RapidMiner

Locate and place the “Read Document” operator in the design pane

Connect the output to the “res” port

Do not connect anything to the “fil” port

Set the parameters r

File name: MarkTwain.txt

Check

extract text only

use file extension as type

Encoding – SYSTEM

Run the model

48

Results of reading the document

This will take a minute or two to run. It is 15M of text!

You should see something this -

49

Processing the document

Place a “Process Documents” operator into the design pane.

Notice the “layered look” of the Process Document Operator

This means we need to push into it to provide detail

Double “click” to push inside the Process Documents operator

50

Tokenize the document

Next we want to turn the document into a collection of individual tokens.

You can consider this as constructing a “bag of words” from the document.

Place the “Tokenize” operator into the design pane and connect the “doc” port on the left of the pane to the “doc” port on the left of the “Tokenize” operator.

51

Sometimes – we will construct ngrams at this point or use other processing techniques to extract noun-verb-predicate tuples

Transform Cases

Transform all of the character to lower case

This is so that the words “Everyday” and “everyday” are counted as the same word.

Place the “Transform Cases” operator into the design pane.

Connect the output of the Tokenize Operator to the Input of the Transform Case Operator

52

Filter out “Stop Words”

Next we want to filter out the “stop words”.

Remember the “stop words” are the words that really provide little or no meaning to the document itself

Examples: a, the, an, of, etc.

RM provides a module to remove those words from the document

Select the “Filter Stopwords (English)” operator and connect the “doc” port on the right side of the “Transform Cases” operator to the “doc” port on the left side of the “Filter Stopwords (English)” operator

53

Filter Words By Length

Next, filter out “short” words.

These are words that I consider to have little impact on classification of the document.

Place the “Filter Tokens (by Length)” operator into the design pane.

Set min chars to 4 (this will filter out anything shorter than 4 characters)

Set max chars to 250 – here I am setting it arbitrarily large because I don’t want any “long” words filtered from the analysis

Connect it to the “FilterStopwords (English) operator in the same way the other operators were connected.

54

Stemming

Next, we want to perform Stemming on the words

Remove the “ing”, “ed”, …

There are a variety of Stemming operators

Here we chose Stem (Porter)

55

Results

You should get a Word List (Process Documents) tab

This will show you the words found and the occurrences and the number of documents they were found in

You’ve processed a text document….

56

Questions?

q

d

q

d

q

d

sim

r

r

r

r

r

r

×

×

=

)

,

(

Establish the Corpus:

Collect & Organize the

Domain Specific

Unstructured Data

Create the Term-

Document Matrix:

Introduce Structure

to the Corpus

Extract Knowledge:

Discover Novel

Patterns from the

T-D Matrix

The inputs to the process

includes a variety of relevant

unstructured (and semi-

structured) data sources such

as text, XML, HTML, etc.

The output of the Task 1 is a

collection of documents in

some digitized format for

computer processing

The output of the Task 2 is a

flat file called term-document

matrix where the cells are

populated with the term

frequencies

The output of Task 3 is a

number of problem specific

classification, association,

clustering models and

visualizations

Task 1Task 2Task 3

FeedbackFeedback

i

n

v

e

s

t

m

e

n

t

r

i

s

k

p

r

o

j

e

c

t

m

a

n

a

g

e

m

e

n

t

s

o

f

t

w

a

r

e

e

n

g

i

n

e

e

r

i

n

g

d

e

v

e

l

o

p

m

e

n

t

1

S

A

P

.

.

.

Document 1

Document 2

Document 3

Document 4

Document 5

Document 6

...

Documents

Terms

1

1

1

2

1

1

1

3

1

__MACOSX/Big Data Analytics Tools./Final Exam materials/._1 BINS 4352- Text Mining - updated for version 9.pptx

Big Data Analytics Tools./Final Exam materials/3 BINS 4352- Data Warehousing.pptx

BINS 4352 Data Warehousing

4 Major Components of BI

Data Warehouse

Business Analytics

Business Performance Management

Data Visualization

Objectives

Understand basic definitions / concepts of data warehouses

Learn about DW architectures and types (advantages / disadvantages)

How to manage and develop a DW

Understand DW operations

Understand the role of DW in decision support

Become familiar with the ETL (extract transform and load) process

Objectives (continued)

Describe real-time (right-time or active) DW

Understand DW administration and security issues

Data warehouse - Definition

A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format

“The data warehouse is a collection of integrated, subject-oriented databases designed to support DSS functions, where each unit of data is non-volatile and relevant to some moment in time”

Historical Perspective

1970’s

Mainframe computers

Simple data entry

Routine Reporting

Primitive database structures

1980’s

Mini/personal computers

Business apps for PCs

Distributer DBMS

Relational DBMS

Business Data Warehouse coined

1990’s

Centralized data storage

Data Warehousing was born

Inmon, Building the Data Warehouse

Kimball, The Data Warehouse Toolkit

EDW architecture design

2000’s

Exponenitally growing Web data

Data warehouse appliances emerge

Business Intelligence popularized

Data mining and predictive SW

SaaS, PaaS, Cloud computing

2010’s

Big Data analytics

Social media analytics

Text/Web Analytics

Hadoop, MapReduce, NoSQL

In-memory, in-database

Characteristics - DW

Subject oriented

Integrated

Time-variant (time series)

Nonvolatile

Summarized

Not normalized

Metadata

Web based, relational/multi-dimensional

Client/server, real/right-time, active

Data Marts

A department size DW

Small-scale DW that stores only limited/relevant data

Dependent Data Mart

A subset that is created directly from a Data Warehouse

Independent Data Mart

A small (standalone) Data Warehouse that is designed for a strategic business unit or department

Data Warehouse Architectures

Generic Architecture

ERP

Legacy

POS

OLTP/WEB

External Data

Select

Extract

Load

Transform

Integrate

Enterprise Data Warehouse

Metadata

Replication

Data Mart

(Marketing)

Data Mart

(Engineering)

Data Mart

(Finance)

Data Mart

(…)

API / Middleware

Routine

Business

Reporting

Data/text

Mining

OLAP,

Dashboard

Custom

Apps

No data mart option

Data Sources

ETL Process

Access

Applications

Issues to consider: DW architectures

Which database management (DBMS) system to use?

Will parallel processing and/or partitioning be used?

Will data migration tools be used to load data into the DW?

What tools will be used to support data retrieval and analysis?

Each architecture has advantages and disadvantages

Independent Data Marts Architecture

There is no centralize DW

Data is brought in and stored in independent data mart(s)

Data Mart Bus Architecture with Linked Dimensional Data Marts

There is no centralize DW

Data is brought in and stored in Data Marts

These data marts are linked together to provide a more consistent view of the data

Hub and Spoke Architecture (Corporate Information Factory)

There is a centralized data warehouse

Dependent data marts are created from the centralized data warehouse

DW

DM

DM

DM

DM

Centralized Data Warehouse Architecture

There is a centralized DW

Users interact, more or less, directly with the centralized DW

Federated Architecture

In this architecture I may or may not have a DW

Data is kept in multiple dispersed systems

This uses logical/physical integration to map data to make it appear like there is a single DW

Factors influencing architectural decisions

Information independence between operational units

Upper management’s needs

Urgency of need for a data warehouse

Nature of end-user tasks

Constraints on resources

Strategic view of the DW prior to implementation

Compatibility with existing systems

Perceived ability of the in-house IT staff

Technical issues

Social/political factors

Extraction, Transformation, and Load (ETL)

Data Integration and ETL Processes

ETL = Extract Transform Load

Data Integration – comprises 3 major processes

Data Access

Data Federation

Change Capture

Data Integration and ETL

Enterprise Application Integration (EIA)

A technology that provides a vehicle for pushing data from the source systems to the DW

Enterprise Information Integration (EII)

An evolving tool space that promises real-time data integration

Data is pulled from source systems (e.g., relational databases, multidimensional databases, web services, etc.) as needed

Data Integration and ETL

ETL

Issues impacting the purchase of ETL tools

COST: data transformation tools are expensive

LEARNING CURVE: it make take a long time to get people up to speed using these transformation tools

Criteria for selecting an ETL tool

Ability to read from and write to an unlimited number of data sources/architectures

Automatic capturing and delivery of metadata

History of conforming to standards

Easy-to-use interface for the developer and the functional user

Data Warehouse Development

Data warehouse approaches

Inmon Model: EDW approach (top-down)

Kimball Model: Data mart approach (bottom-up)

Which is best?

Another alternative is to use a hosted warehouse

Requires minimal infrastructure investment

Makes a powerful solution affordable

Often offers better quality equipment and software

Frees up capacity on in-house systems

Frees up cash flow

Enables solutions that provide for growth

Representation of Data in DW

Dimensional Modeling

A retrieval-based system that supports high-volume query access

Star schema

The most commonly used and the simplest style of dimensional modeling

Contain a fact table surrounded by and connected to several dimension tables

Snowflakes schema

An extension of star schema where the diagram resembles a snowflake in shape

The ability to organize, present, and analyze data by several dimensions, such as sales by region, by product, by salesperson, and by time (four dimensions)

Multidimensional presentation

Dimensions: products, salespeople, market segments, business units, geographical locations, distribution channels, country, or industry

Measures: money, sales volume, head count, inventory profit, actual versus forecast

Time: daily, weekly, monthly, quarterly, or yearly

Multidimensionality

25

Star versus Snowflake Schema

Online Analytic Processing (OLAP)

OLAP Operations

Slice - a subset of a multidimensional array

Dice - a slice on more than two dimensions

Drill Down/Up - navigating among levels of data ranging from the most summarized (up) to the most detailed (down)

Roll Up - computing all of the data relationships for one or more dimensions

Pivot - used to change the dimensional orientation of a report or an ad hoc query-page display

OLAP

Slicing Operations on a Simple Three-Dimensional

Data Cube

Massive DW and Scalability

Scalability

The main issues pertaining to scalability:

The amount of data in the warehouse

How quickly the warehouse is expected to grow

The number of concurrent users

The complexity of user queries

Good scalability means that queries and other data-access functions will grow linearly with the size of the warehouse

30

Real-Time/Active DW/BI

Enabling real-time data updates for real-time analysis and real-time decision making is growing rapidly

Push vs. Pull (of data)

Concerns about real-time BI

Not all data should be updated continuously

Mismatch of reports generated minutes apart

May be cost prohibitive

May also be infeasible

31

Enterprise Decision Evolution and Data Warehousing

32

Traditional versus Active DW

DW Administration and Security

Data warehouse administrator (DWA)

DWA should…

have the knowledge of high-performance software, hardware and networking technologies

possess solid business knowledge and insight

be familiar with the decision-making processes so as to suitably design/maintain the data warehouse structure

possess excellent communications skills

Security and privacy is a pressing issue in DW

Safeguarding the most valuable assets

Government regulations (HIPAA, etc.)

Must be explicitly planned and executed

QUESTIONS?

Fact Table

SALES

UnitsSold

...

Dimension

TIME

Quarter

...

Dimension

PEOPLE

Division

...

Dimension

PRODUCT

Brand

...

Dimension

GEOGRAPHY

Country

...

Fact Table

SALES

UnitsSold

...

Dimension

DATE

Date

...

Dimension

PEOPLE

Division

...

Dimension

PRODUCT

LineItem

...

Dimension

STORE

LocID

...

Dimension

BRAND

Brand

...

Dimension

CATEGORY

Category

...

Dimension

LOCATION

State

...

Dimension

MONTH

M_Name

...

Dimension

QUARTER

Q_Name

...

Star SchemaSnowflake Schema

Product

T

i

m

e

G

e

o

g

r

a

p

h

y

Sales volumes of

a specific Product

on variable Time

and Region

Sales volumes of

a specific Region

on variable Time

and Products

Sales volumes of

a specific Time on

variable Region

and Products

Cells are filled

with numbers

representing

sales volumes

A 3-dimensional

OLAP cube with

slicing

operations

__MACOSX/Big Data Analytics Tools./Final Exam materials/._3 BINS 4352- Data Warehousing.pptx

Big Data Analytics Tools./Final Exam materials/Statistics Concepts.docx

STATISTICAL CONCEPTS

The Mean

When people talk about statistical averages, they are referring to the mean. To calculate the mean, simply add all of your numbers together. Next, divide the sum by however many numbers you added. The result is your mean or average score.

The Median

The median is the middle value in a data set. To calculate it, place all of your numbers in increasing order. If you have an odd number of integers, the next step is to find the middle number on your list.

The Mode

In statistics, the mode in a list of numbers refers to the integers that occur most frequently. Unlike the median and mean, the mode is about the frequency of occurrence. There can be more than one mode or no mode at all; it all depends on the data set itself.

Quartiles

The median divides the data into a lower half and an upper half. The lower quartile is the middle value of the lower half. The upper quartile is the middle value of the upper half. The following figure shows the median, quartiles and interquartile range. Scroll down the page for examples and solutions.

Examples – 2 sets of 50 numbers in each. What do you notice about the mean, median, and quartiles.

Standard Deviation

In statistics, the standard deviation (SD, also represented by the Greek letter sigma σ or the Latin letter s) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. The standard deviation of a random variable, statistical population, data set, or probability distribution is the square root of its variance.

https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Standard_deviation_diagram.svg/400px-Standard_deviation_diagram.svg.png

Plot of a normal distribution. Each band of the normal distribution has a width of 1 standard deviation.

Variance

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value. Variance has a central role in statistics, where some ideas that use it include descriptive statistics, statistical inference, hypothesis testing, goodness of fit, and Monte Carlo sampling.

Variance is an important tool in the sciences, where statistical analysis of data is common. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by σ 2 {\displaystyle \sigma ^{2}} , s 2 {\displaystyle s^{2}} , or Var(X). Var ⁡ ( X ) {\displaystyle \operatorname {Var} (X)}

Covariance

In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables.

The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

Correlation / Dependence

In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to how close two variables are to having a linear relationship with each other.

Statistical Test Theory

In statistics, a null hypothesis is a statement that one seeks to nullify with evidence to the contrary. Most commonly it is a statement that the phenomenon being studied produces no effect or makes no difference.

An example of a null hypothesis is the statement "This diet has no effect on people's weight."

Usually, an experimenter frames a null hypothesis with the intent of rejecting it: that is, intending to run an experiment which produces data that shows that the phenomenon under study does make a difference. In some cases there is a specific alternative hypothesis that is opposed to the null hypothesis, in other cases the alternative hypothesis is not explicitly stated, or is simply "the null hypothesis is false" – in either event, this is a binary judgment, but the interpretation differs and is a matter of significant dispute in statistics.

A type I error (or error of the first kind) is the incorrect rejection of a true null hypothesis. Usually a type I error leads one to conclude that a supposed effect or relationship exists when in fact it doesn't.

Examples of type I errors include a test that shows a patient to have a disease when in fact the patient does not have the disease, a fire alarm going on indicating a fire when in fact there is no fire, or an experiment indicating that a medical treatment should cure a disease when in fact it does not.

A type II error (or error of the second kind) is the failure to reject a false null hypothesis.

Examples of type II errors would be a blood test failing to detect the disease it was designed to detect, in a patient who really has the disease; a fire breaking out and the fire alarm does not ring; or a clinical trial of a medical treatment failing to show that the treatment works when really it does.

If the probability of obtaining a result as extreme as the one obtained, supposing that the null hypothesis were true, is lower than a pre-specified cut-off probability (for example, 5%), then the result is said to be statistically significant and the null hypothesis is rejected.

Examples:

Example 1

Hypothesis: "Adding water to toothpaste protects against cavities."

Null hypothesis (H0): "Adding water to toothpaste has no effect on cavities."

This null hypothesis is tested against experimental data with a view to nullifying it with evidence to the contrary.

A type I error occurs when detecting an effect (adding water to toothpaste protects against cavities) that is not present. The null hypothesis is true (i.e., it is true that adding water to toothpaste has no effect on cavities), but this null hypothesis is rejected based on bad experimental data.

Example 2

Hypothesis: "Adding fluoride to toothpaste protects against cavities."

Null hypothesis (H0): "Adding fluoride to toothpaste has no effect on cavities."

This null hypothesis is tested against experimental data with a view to nullifying it with evidence to the contrary.

A type II error occurs when failing to detect an effect (adding fluoride to toothpaste protects against cavities) that is present. The null hypothesis is false (i.e., adding fluoride is actually effective against cavities), but the experimental data is such that the null hypothesis cannot be rejected.

Example 3

Hypothesis: "The evidence produced before the court proves that this man is guilty."

Null hypothesis (H0): "This man is innocent."

A type I error occurs when convicting an innocent person (a miscarriage of justice). A type II error occurs when letting a guilty person go free (an error of impunity).

A positive correct outcome occurs when convicting a guilty person. A negative correct outcome occurs when letting an innocent person go free.

p-value

In statistical hypothesis testing, the p-value or probability value is the probability for a given statistical model that, when the null hypothesis is true, the statistical summary (such as the sample mean difference between two compared groups) would be the same as or of greater magnitude than the actual observed results. The use of p-values in statistical hypothesis testing is common in many fields of research.

The p-value is used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence. Null hypothesis testing is a reductio ad absurdum argument adapted to statistics. In essence, a claim is assumed valid if its counter-claim is improbable.

https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/P-value_in_statistical_significance_testing.svg/800px-P-value_in_statistical_significance_testing.svg.png

So in Statistica – when I am looking for likely variables to be used as predictors (independent variables) for some dependent variable and I run an analysis. My null Hypothesis for each independent variable X and its influence on the dependent variable Y is:

Ho = X does not predict Y

Ha = X does predict Y

When we run the statistical test and check p values, we look for the p-values to be below some pre-defined threshold (5% = 0.05). If the p-value is below the 0.05 threshold then we can reject the null hypothesis (Ho = X does not predict Y), and assume that X is a predictor of Y.

T-statistic versus p-value

When you perform a t-test, you're usually trying to find evidence of a significant difference between population means (2-sample t) or between the population mean and a hypothesized value (1-sample t). The t-value measures the size of the difference relative to the variation in your sample data. Put another way, T is simply the calculated difference represented in units of standard error. The greater the magnitude of T (it can be either positive or negative), the greater the evidence against the null hypothesis that there is no significant difference. The closer T is to 0, the more likely there isn't a significant difference.

http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/bc5183a42a169d45632fd4f6c0b153b3/distribution_plot_t_2.8Look familiar? The t-test and the p-value are inextricably linked.

P-value

The probability that the test statistic is that value or more extreme in the direction of the alternative hypothesis

Test Statistic

A measurement, in standardized units, of how far a sample statistic is from the assumed parameter if the null hypothesis is true

So – you can use a t-table to find the p-value. In essence if you’re looking to reject the null hypothesis, you are look for small p-values – smaller than a threshold such as 0.05. If you are looking to reject the null hypothesis, you are looking for large t-value – large as compared to some threshold such as approximately 1.8 to 1.9 depending on the degrees of freedom.

Chi Square Test

A chi-square test, also written as χ 2 {\displaystyle \chi ^{2}} test is any statistical hypothesis test wherein the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. Without other qualification, 'chi-squared test' often is used as short for Pearson's chi-squared test. The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.

In the standard applications of the test, the observations are classified into mutually exclusive classes, and there is some theory, or say null hypothesis, which gives the probability that any observation falls into the corresponding class. The purpose of the test is to evaluate how likely it is between the observations and the null hypothesis.

Chi-squared tests are often constructed from a sum of squared errors, or through the sample variance. Test statistics that follow a chi-squared distribution arise from an assumption of independent normally distributed data, which is valid in many cases due to the central limit theorem. A chi-squared test can be used to attempt rejection of the null hypothesis that the data are independent.

Chi Square test for variance in a normal population

If a sample of size n is taken from a population having a normal distribution, then there is a result (see distribution of the sample variance) which allows a test to be made of whether the variance of the population has a pre-determined value.

For example, a manufacturing process might have been in stable condition for a long period, allowing a value for the variance to be determined essentially without error. Suppose that a variant of the process is being tested, giving rise to a small sample of n product items whose variation is to be tested. The test statistic T in this instance could be set to be the sum of squares about the sample mean, divided by the nominal value for the variance (i.e. the value to be tested as holding). Then T has a chi-squared distribution with n − 1 degrees of freedom. For example, if the sample size is 21, the acceptance region for T with a significance level of 5% is between 9.59 and 34.17.

7

__MACOSX/Big Data Analytics Tools./Final Exam materials/._Statistics Concepts.docx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352 - Variable Selection.pptx

Data Selection

Overview

Dividing data sets into Train, Test and Validation samples

What does sampling accomplish

Methods of sampling

Issues related to sample size

Sampling

Training Sample – used to create models and find patterns of predictive value

Test Sample – prevents the models from learning only train data and helps the model to be generalizable to new cases

This sample is also used during the model building process

We want to ensure that patterns as a “whole” are detected

Sometimes we say that this will help us from “over fitting” the model

Validation Sample – estimates and compares the performance of the model

This data is “hold-out” data from the model building process

The data is NOT used during the model building process

Used to show model performance

If the model performs well on the validation set, then it is reasonable to believe that the model has discovered significant relationships

Training

Test

Validate

What Sampling Accomplishes

Random sample

Ideally, the patterns and relationships that are present in the population being studied will also be present in the sample

Using a “sample” makes the analysis more efficient and the computations can be performed much more quickly

This also allows us to safeguard against “over-fitting” the data by being able to use a test and validation sample to check the created model

It can be so “finely” tuned to the training data – that it misses the actual relationships in the data

Therefore, we say that the model fails to generalize to the population

Model Fit

Took a data set “Athletic Donations” and tried to fit a model.

Did a simple linear regression with 3 variables

All the variables were insignificant and the model fit (as you can see it poor)

This model accounts for 4% of the variance.

Model fit

Re-ran the model using 42 variables

This time including sin, cos, tan of variables etc.

All variables are significant and the model looks to fit well.

This accounted for 96.68% of the variance

However, this model has “over-fit” the data

Reduced the number of predictors

I reduced the number of predictor variables from 42 to 18

The model explains about 54% of the variance.

But, is a better predictor when tested against the hold-out sample

Sampling Methods

In Statistica you can create samples based on a variety of options.

It allows you to create a sample that is reasonable to work with

You can create a sample based on either the approximate sample size or a percentage of the whole

Spreadsheet formulas can be used to create a variable that identifies cases as either a part of “training”, “test” or “validation”

Sizing of the sample is up to the user – and may be based on the amount of data you have to start with.

8

Sample Size

The accuracy with which patterns in the sample will reflect the patterns in the population is a function of sample size and not population size

A random sample of 1000 will provide just as accurate results take from a population of 100,000 or a population of 100,000,000.

When the sample size is small and the population is large, there is always ample data left to

Validate the model

Test new hypotheses

9

Creating A Random Sample

Creating A Training And Test And Validation Sample

The random sampling tool will split the data set into 2 samples

A sample for model building with training and test

A sample for validation

Here we have selected the Random Sample Tool (Data – Sample)

We select the “Split node random sampling” and set the percentage to 15%

This will put 15% into the validation sample and 85% into the training/test sample

11

Creating A Training And Test And Validation Sample

When we “click” OK 2 new data sets are created

Spreadsheet1 has 21 variables and 287 cases

This is our validation sample

Spreadsheet2 has 21 variables and 1713 cases

This is our training / testing sample

Once we build and test the model using the training/testing sample, we can deploy the model and run the validation data set

This will enable us to evaluate how well the model generalizes to the new hold-out data

12

Dividing The Training / Testing Data Set

To divide the cases into “Training” and “Test” in Statistica use spreadsheet formulas.

This enables us to keep the training and test cases in a single sample

Create a variable – by going to “Data” – Variables - Add

13

Creating New Variable

Here we are creating a variable with the name “sample”

It will be added after the variable “Occupation”

It will be generated according to the formula that we entered

iif(rnd(1)<0.8,’Train’,’Test’)

Creating a New variable

We have generated a variable

Sample

80% of the entries have the values “Train” and approximately 20% have the value “Test”

Then we can generate a histogram of the new variable and see the distribution

Case Selection

Case selection conditions

Case selection conditions are a way to specify the data that you want to work with for analysis – without altering the data in any way

Case selection conditions have the option to show which cases are being used

This allows you to visually look at the data and make sure that the data to be used is what is expected

Benefits of using Case Selection Conditions

You can utilize various subsets of the data without having to create multiple copies of the data

When Statistica generates output is will show the data subset that was used

Example

Suppose we want to look at the CreditScoring data and focus on those customers that are requesting credit above $5000.

Looking at a histogram of the data before we get started shows us that we would have several cases both above and below $5000

Example

We can access Case Selection from a variety of places

19

Going to: Tools

Select “Enable Selection Conditions”

Then cases can be

Included

Excluded

Or – a combination of both

20

Now Generate A New Histogram

Notice that “Set Cond” has been turned on

The resulting histogram looks a lot different that it did before

The amounts shown are all greater than $5000

The inclusion condition is show in the title of the graph

Another Histogram

I select “Credit Rating”

Generate the histogram

Here again – we have the breakdown of good and bad credit ratings for the subset defined by ‘Amount of Credit’>5000

Descriptive Statistics

If we want to generate descriptive statistics for some variables, then we see that this is also based on the Include Condition specified

Finally, if you want to turn-off the selection conditions, simply go back to the selection conditions dialog and “un-check” the enable button.

Variable Selection

Why Screen Variables in Analysis?

We often see a large number of variables in data sets

Sometimes not all of these are beneficial to our data mining goals

Some variables are easy to see that they are not meaningful for the analysis

e.g., an ID number, a social security number, phone numbers, etc.

Others may at first seem important, but turn out to be not so good as predictors

Some variables may be difficult to monitor (i.e., collect data for)

Time consuming

Costly

The screening process may let us exclude the costly variables from the analysis

More importantly – a simple model is easier to deploy (may be easier to explain, may require fewer inputs, etc.)

Feature Selection and Variable Screening

In Statistica we can select

A specific number of variables

Select variables based on significance (Chi-Square measure of significance)

Once we have found the “best” predictor variables, we can define and manage variable bundles

A bundle is a way of grouping and preselecting variables for analysis

Feature Selection and Screening Tool

27

Selecting variables – Show appropriate variables will pre-select variables into categorical and continuous variables

Selecting Variables

Our dependent variable is “Credit Rating” and it is categorical

Then we can select the remaining variables as continuous and categorical as appropriate

Selecting Variables

Once the variables are selected

We select “OK” and move to the Feature Selection dialog (below)

29

We can select the best predictors

Increase the “10” to “12” and select “Summary”

Selecting Variables

This will give us the Chi-Squared and p-values for the “12” best predictors

30

We can also look at a histogram of the predictors and see that “Balance of Current Account” is the most significant predictor of credit rating

Select by p Value

We can also select by p-value

Select a maximum p-value

Select Summary

Select by p Value

Then we can generate a histogram

32

Report of Best K Predictors

Now I can create new bundles based on the best predictor variables

New Bundles

I go to Data

Bundles

Select New

And create new bundles based on the variables that will be the best predictors

34

Impact of Too Many Variables

Impact of Too Many Variables

The curse of dimensionality

Data mining performance

Deployment complexity

Capitalizing on chance

Curse of Dimensionality

For each added predictor variable in a data mining model, the number of data points (cases) needed grows rapidly

Prior to analysis, variables that are known to be unnecessary as well as those that add only a small amount of information should be excluded from the analysis

This is particularly important for neural networks

Neural networks in particular can be adversely impacted by the inclusion of variables that are unrelated to the target

Data Mining Performance And Deployment Complexity

Data mining performance

By prescreening the input variables we can improve data mining performance in several ways

We can decrease the time needed to construct the model

Removal of variables that are not meaningful for the analysis can improve model accuracy

Deployment complexity

Each predictor used by the data mining model is required to deploy/apply that model to new cases

Ex: Requiring 50 parameters for a Credit Checking model will require all new cases to provide those 50 parameters

If good predictive accuracy is attainable with a smaller set of inputs, deployment is made easier.

Capitalizing On Chance

Using feature selection and variable screening – or any other prescreening method – in conjunction with traditional statistical hypothesis testing can cause problems

This is called capitalizing on chance

The variables used in the analysis were “predetermined” to be related to “Y” (the dependent variable)

Therefore, significance tests derived from general linear models should be interpreted carefully

Variable Redundancy

Variable Redundancy

We talked about this a little earlier

We used Spearman test to see if 2 variables were highly correlated

What are redundant variables?

These are variables that convey essentially the same information

When this occurs we should keep one, but delete the others

Detecting Variable Redundancy

Categorical Data

We can detect redundant variables by using Cross-tabulation and Bivariate Histograms

Continuous Data

Correlation and Spearman Rank

Correlation can detect redundancy

Impact Of Variable Redundancy On The Analysis

Redundant variables can mask the contribution of an effect in the model

If you used both Yearly Income and Income on Last Pay Stub in a model and these two variables contain the same information, then the result could be a reduced sensitivity in the model for both

Also, since the redundant variable adds no new information – it adds unnecessary complexity to the model

Correlation Between Continuous Variables

Statistics -> Basic Statistics

Correlation

Select the variables of interest

Summary

44

Correlation Between Continuous Variables

Now we can check to see if there are any variables that are highly correlated

It appears that Obesity and Adiposity are correlated (>0.70)

Rule of Thumb: a threshold of 0.70 correlation is used for variable redundancy. However, the exact threshold used may differ from domain to domain and application to application.

We can get a better look at these two variables if we look at a scatter plot

Correlation Between Continuous Variables

The scatter plot below confirms that the two variables are correlated.

What we are looking for is an indication that there is a strong linear relationship present

46

The scatter plot on the left is an example where the two variables are not highly correlated.

The scatter plot on the right is an example where the two variables are extremely highly correlated (>0.90)

Categorical Data Cross-tabulation

To check categorical data to see if there is redundancy

Statistics -> Basic Statistics

Tables and banners

We select the variables we want to look at

Select OK

Review Summary Tables

Categorical Data Cross-tabulation

Looking at the data time of employment and employment record.

Those employed for more than 1 year are all satisfactory

Those less than 1 year have insufficient information

Those where the record is unsatisfactory are unemployed

Selecting a 3-D histogram will show the same information in a slightly different format

Combination Of Continuous And Categorical

In the credit scoring data I added a column “Age Category”

Here I created values – “over 70”, “61-70”, “51-60”… “under 20”

One would assume that this would be highly correlated to the continuous variable – Age

How do we check this?

49

Go to the “Statistics” tab and select “Nonparametrics”

Select “Correlations (Spearman, …)

Combination Of Continuous And Categorical

Select the variables

Select – Spearman rank R

Here we see, as expected, Age and Age Category are highly correlated (>0.70)

NOTE: this type of test is effective only if the categorical data is ordinal data - “ordered”

Must have a natural “ordering” to the data

QUESTIONS?

Histogram of sample

Spreadsheet2 22v*1713c

sample = 1713*1*Normal(Location=101.2008, Scale=0.4007)

Train Test

sample

0

200

400

600

800

1000

1200

1400

1600

No of obs

Histogram of Amount of Credit

CreditScoring - Copy 19v*1000c

Amount of Credit = 1000*5000*Normal(Location=4579.7472, Scale=3951.8525)

-$5,000.00

$0.00

$5,000.00

$10,000.00

$15,000.00

$20,000.00

$25,000.00

$30,000.00

Amount of Credit

0

100

200

300

400

500

600

700

800

No of obs

Histogram of Amount of Credit

CreditScoring - Copy 19v*1000c

Include condition: 'Amount of Credit'>5000

Amount of Credit = 304*2000*Normal(Location=9267.7605, Scale=4101.9388)

$2,000.00$4,000.00$6,000.00$8,000.00

$10,000.00$12,000.00$14,000.00$16,000.00$18,000.00$20,000.00$22,000.00$24,000.00$26,000.00$28,000.00

Amount of Credit

0

10

20

30

40

50

60

70

80

90

No of obs

Histogram of Credit Rating

CreditScoring - Copy 19v*1000c

Include condition: 'Amount of Credit'>5000

Credit Rating = 304*1*Normal(Location=0.625, Scale=0.4849)

bad good

Credit Rating

0

20

40

60

80

100

120

140

160

180

200

No of obs

Importance plot

Dependent variable:

Credit Rating

020406080100120140

Importance (Chi-square)

Further running credits

Employed by Current Employer for

Purpose of Credit

Type of Apartment

Most Valuable Assets

Amount of Credit

Value of Savings

Duration of Credit

Payment of Previous Credits

Balance of Current Account

Scatterplot of Adiposity against Obesity

HeartDisease 17v*463c

Adiposity = -9.0534+1.3231*x

10 15 20 25 30 35 40 45 50

Obesity

5

10

15

20

25

30

35

40

45

Adiposity

Scatterplot of Tobacco Intake (kg) against Stress Type-A behavior

HeartDisease 17v*463c

Tobacco Intake (kg) = 3.9986-0.0068*x

10 20 30 40 50 60 70 80 90

Stress Type-A behavior

-5

0

5

10

15

20

25

30

35

Tobacco Intake (kg)

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352 - Variable Selection.pptx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352 - Text Analysis - 2.pptx

Further Analysis Using Text Data

Learning Outcomes

Understand how to apply data mining techniques such as Similarity Analysis and Clustering to Text Data

Understand the difference between Text and Data Mining

Similarity Analysis: The Federalist Papers

3

The Dataset

The complete Federalist Papers Collection was downloaded

Each essay was put in a cell in an Excel file

The excel file contains the following information

The author(s)

AH: Alexander Hamilton

JM: James Madison

JJ: John Jay

Text – the essay

The Design

The design of the analysis is very similar to what was done before

This time data was read in from an Excel file

Also, in this design the Word List from Process Documents was passed on to the output

Read Excel

Here we walked through the “Import Configuration Wizard”

The Excel file was selected

Author and Text were input into RapidMiner

Nominal to Text

Next we need to convert the text data in the field “TEXT”

Select subset

Then move the TEXT field to be selected

Data to Documents

Next, we need to convert the text data into Documents

Place a Data to Documents operator in the design

Specify a weight of 1 for the TEXT attribute

Connect the outputs to the res ports

Run the design

Process Documents

Process Documents is a hierarchical operator

This is where all of the work gets done and the word-document matrix is generated

Place it in the design, then double click on it to enter the operator

Tokenize the document

Next we want to turn the document into a collection of individual tokens.

You can consider this as constructing a “bag of words” from the document.

Place the “Tokenize” operator into the design pane and connect the “doc” port on the left of the pane to the “doc” port on the left of the “Tokenize” operator.

10

Transform Cases

Transform all of the character to lower case

This is so that the words “Everyday” and “everyday” are counted as the same word.

Place the “Transform Cases” operator into the design pane.

Connect the output of the Tokenize Operator to the Input of the Transform Case Operator

11

Filter out “Stop Words”

Next we want to filter out the “stop words”.

Remember the “stop words” are the words that really provide little or no meaning to the document itself

Examples: a, the, an, of, etc.

RM provides a module to remove those words from the document

Select the “Filter Stopwords (English)” operator and connect the “doc” port on the right side of the “Transform Cases” operator to the “doc” port on the left side of the “Filter Stopwords (English)” operator

12

Filter Words By Length

Next, filter out “short” words.

These are words that I consider to have little impact on classification of the document.

Place the “Filter Tokens (by Length)” operator into the design pane.

Set min chars to 4 (this will filter out anything shorter than 4 characters)

Set max chars to 250 – here I am setting it arbitrarily large because I don’t want any “long” words filtered from the analysis

Connect it to the “FilterStopwords (English) operator in the same way the other operators were connected.

13

Stemming

Next, we want to perform Stemming on the words

Remove the “ing”, “ed”, …

There are a variety of Stemming operators

Here we chose Stem (Porter)

14

The Design

Data to Similarity

Here the measure types – Numerical

Mixed Measure – Cosine Similarity

Results

The Example Set (Process Documents) shows the TF-IDF values for each word/document combination

This is what will be used to create the word vectors for the documents

Results

We once again sort the Similarity data by “Similarity” – here those with the largest values are the most “Similar”

Here we see that essay 81 and 82 are the most similar followed by 81 and 83 and so on

DOC # Author Doc # Author
81 AH 82 AH
81 AH 83 AH
76 AH 77 AH
47 JM 48 JM
65 AH 66 AH
64 JJ 75 AH
32 AH 33 AH
65 AH 81 AH
80 AH 82 AH
66 AH 75 AH

Clustering The Federalist Papers

Splitting Paths In Rapidminer

One of the things that is very easy to do in RapidMiner is running multiple paths through the design

This can aid in comparing models

Looking at different aspects of the data

Etc.

This is done using the “Multiply” operator

The multiply operator takes the output of one operator and replicates it so that it can be used as input in other operators

As you make connections to the output port of the operator, new output ports will appear

Clustering

In order to do a K-means clustering with the Federalist Papers data, we can

Go back to the design used in the previous exercise

Add a Multiply operator to the Process Documents (exa) output port

Split that output into both

Data to Similarity

Cluster

Clustering

Select K-means clustering

This is identical to what we did earlier in the semester – we are just now doing it with text data

This operator is going to try to cluster “similar” documents together

Select k=8, I typically start out with a number ~10% of the number of documents (unless that number of documents is very large)

Measure types

Numerical Measures

Cosine Similarities

Clustering Results

ExampleSet (Multiply) – this will show you the cluster assignment.

Since we selected “keep text”, you will notice that the values are carried through to the output

Clustering Results

Cluster Model (Clustering) – this shows how the documents were distributed across the 8 clusters

The “Centroid Table” shows the distances based on the cluster id and each of the terms in the corpus

Questions?

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352 - Text Analysis - 2.pptx

Big Data Analytics Tools./Final Exam materials/1 BINS 4352 - Preparing Data - Statistal Tests and Missing Data.pptx

Preparing Data Using Statistical Tests

Common Statistical Tests

Most of these tests – you are defining a “range” and the points outside that range are considered to be “outliers”

Tests that look for outliers in continuous data include

Grubbs test

The test computes a G statistic for each case and tags the case as an outlier if it is greater than some critical G value

The use provides a parameter for “xp” (which can be between 0-1) where the smaller the parameter the more conservative the test

The test removes one outlier at a time and then the test iterates until no outliers remain

The test can be either run “one-sided” or “two-sided”

One-sided – is the minimum value an outlier?

Two-sided – considering both the minimum and maximum is there an outlier

Common Statistical Tests

Normal distribution

Values that are greater than “X” (which can be between 1-10) times the standard deviation away from the mean are identified as outliers

The test works similar to Grubbs test for one-sided or two-sided test

This test can be run iteratively until no outliers remain

Percentiles

The user specifies a percentile (between 0-100) and all values that lie beyond that percentile are marked as outliers and are marked

The test works similar to Grubbs test for one-sided or two-sided test

For the two-sided test – if you enter 10, then all points below the 10th percentile and those above the 90th percentile are removed

Common Statistical Tests

Tukey

This is very similar to how a box plot works

Outliers are determined based on a user-specified outlier coefficient (aka Tukey hinge distance factor).

For the two-sided Tukey test a data value is considered an outlier if

Data_Point > UV + OC * (UV-LV)

Data_Point < LV – OC * (UV-LV)

Where, UV = 75th percentile, LV = 25th percentile, OC = outlier coefficient (between 1 and 5)

Handling Bad Data

Handling Outliers

Once outliers are found what are we going to do about it?

The answer to this question may be different based on

The nature of the project

The cause of the outlier

Remove outliers

Are outliers data entry errors?

Are the outliers due to entries being “out of place”?

Entered into the wrong data set

Keep outliers

Are the outliers actually legitimate responses that simply have extreme values?

Recoding Outliers in Statistica (Data Set: Excel File)

Go to the “Find Feature” key-in field

Type “Recode” and select “Recode Outliers”

The select the variables to be inspected.

Statistica will automatically pick up the Measurement Type from the data set (continuous / categorical)

You should make sure that “Create a new spreadsheet” is selected

By selecting the Measurement, Test, Parameters, Type, and Values field – you can customize the setting from a pull-down.

Recoding Outliers in Statistica

Measurement

Continuous

Categorical

Test

Categorical

Normal, Grubbs, Percentile, and Tukey

One-side upper

One-side lower

Two-side

Recoding Outliers in Statistica

Once you have selected the “Test” you can then re-set the “Parameter” values accordingly

Normal (1-10): larger values more conservative

Grubbs (0-1): smaller values more conservative

Percentile (0-100):

One sided lower

Smaller value more conservative

One sided higher

Larger value more conservative

Tukey (1-5): larger value more conservative

Recoding Outliers in Statistica

Default values

Normal (1-10): 3

Grubbs (0-1): 0.05

Percentile (0-100): 99

Tukey (1-5): 1.5

Type

No Recode

Recode to MD (missing value)

Recode to Value (then you have to set the Value)

Recode to Mean

Recode to Mode

Recode to Percentile

Recode to Boundary

Recoding outliers in statistica

Here I have used the “raw” data set – not cleaned in Excel

I have selected “Repeat until all outliers have been recoded”

I have set all of the parameters to

Test: Turkey Two Sided

Parameter: 1.5

Type: MD (removed)

I will mark the observations with “!”

And, I will generate a new spreadsheet

Recoding Outliers in Statistica

Statistica ran through the data and generated a new spreadsheet

The observations with outliers are marked with the “!”

I will rerun the analysis adjusting the Tukey test to be more conservative

I will increase the parameter to 2.0

Recoding Outliers in Statistica

As you can see, setting the Tukey parameter 2.0 (increasing the range for “valid” responses) decreased the number of outliers detected.

Hence, you should begin your analysis being conservative and then slowly increase your tolerance (decrease the value in this case)

Also, feel free to try other techniques to see which one gives you the most acceptable response

We would like to error on the side of leaving more data in the data set

As long as you are always generating a new spreadsheet you can try other methods and decide on which is best

Parameter = 1.5

Parameter = 2.0

Merging Categories

14

Back To The Credit Scoring Data Set

Number of previous credits at this bank

When we last looked at the data we noted that the categories 5-6 and 7 or more were quite small

We suggested that we may want to combine these into a larger category 5 0r more

Recoding Items

We go to the feature finder and find “Recode”

(Data->Recode)

Select the “variable” as in past exercises

Type in the changes

Select OK

Recoding

‘Number of previous credits at this bank’ = ‘5- 6’

‘Number of previous credits at this bank’ = ‘7 or more’

To

5 or more

Recoding Items

We have shifted the observations into the new category 5 or more.

We now need to go and delete the old labels

5-6

7 0r more

Recoding Items

Make sure the data is visible (you may have to go to view and select cascade windows)

Select Text Labels – here we can delete the unused labels and renumber the ones that remain

Recoding Items

Delete the now “empty” rows in the label editor

Recoding Items

Change the numeric value of the “5 or more” text label to 3

Select “OK”

Look at the new histogram

Recoding Items

You have successfully updated the data

Missing Data

Data Cleaning – Missing Data

Missing data is a reality for most data mining projects

They can cause issues for some data mining techniques

Some data mining techniques cannot handle missing data

As a result cases (rows) that contained missing data are ignored when training a model

This may result in cases with valuable information in other areas are ignored

Results in a potential loss of information

Can bias the results (when systematic relationships exist between missing data an the dependent variable)

Some techniques for replacing missing values include

Mean replacement

Median replacement

Replacement with specific values

However, this can result in an artificial decrease in variance in the data which can also impact correlation

K Nearest Neighbors

In this case the user selects a value for K

Statistica then selects the “K” cases in the data set that are most similar to the one with missing data and uses this information to impute the missing observations

Each missing data case is filled in with a value found specifically for that case

Statistica – Process Missing Data Tool

Process missing data can be found on the Data Tab under “Filter/Recode” – OR – it can be accessed via the Feature Finder

Select the variables

Then you can specify the action you want Statistica to take

Recode MD to Value

Recode MD to Mean

Recode MD to Median

Recode MD to Mode

Flag MD

Recode MD to Value and Flag

Note it is wise to always select “Create new spreadsheet”

Statistica – Process Missing Data Tool

In the “Additional MD Values” field you can put values that may “represent” a missing value

For example, you may have “N/A” in some fields that represents – not available

In this case the N/A would be replaced based on the technique specified

K Nearest Neighbor

In Statistica –

Go to the Filter/Recode option on the “Data” tab

As with any other feature, this can be located with “Feature Finder”

Go to “MD Imputation” where “MD” stands for missing data

K Nearest Neighbor

Here we select the variable with missing data as the “Target Variable” and we also select the “Input Variables” that we feel would likely be able to predict the missing values

In the example below, Age has missing values

We select “Employed by Current Employer”, “Living in Current Household” and “Type of Apartment” as good predictors of “Age”

K Nearest Neighbor

Select “OK”

Statistica then generates a new spreadsheet with data filled in for the missing data based on inferring the values from “similar” cases in the data set.

Example

I saved CreditScoring as an Excel Spreadsheet (Age)

I then deleted “Ages” for cases in the spreadsheet at random (Age – MD)

I loaded Age-MD into Statistica and used K Means Neighbor to impute values for the missing data

Age-K3 uses a value of K=3 to compute the missing value

Age-K6 uses a value of K=6 to compute the missing value

Age Age - MD Age -K3 Age-K6
38 31 29.66667
42 25.333333 31.33333
18 32 31.5
53 38.333333 29.83333
46 26.666667 27.83333
28 26.666667 29.33333
31 31.333333 32.5
37 26.666667 31.16667
22 31 29.33333
28 25.666667 28.16667
49 35 34.5
28 27 26.33333
28 35.333333 31.5
36 21.333333 29.83333
45 32 31.83333

Summary

We have looked at several techniques to address missing values

We have examined simple techniques of replacing missing values with defined values – mean, median, mode, specific values

We have examined K nearest neighbor that infers a value for missing data by looking at “similar” cases in the data set

The important thing is – now we have addressed missing data and hopefully we can move on to the analysis phase without losing information

QUESTIONS?

Histogram of Number of previous credits at this bank

CreditScoring 19v*1000c

Number of previous credits at this bank = 1000*1*Normal(Location=1.407, Scale=0.5777)

one 2- 4 5- 6 7 or more

Number of previous credits at this bank

0

100

200

300

400

500

600

700

No of obs

Histogram of Number of previous credits at this bank

CreditScoring - Copy 19v*1000c

Number of previous credits at this bank = 1000*1*Normal(Location=1.469, Scale=0.811)

one

2- 4 5- 6 7 or more 5 or more

Number of previous credits at this bank

0

100

200

300

400

500

600

700

No of obs

Histogram of Number of previous credits at this bank

CreditScoring - Copy 19v*1000c

Number of previous credits at this bank = 1000*1*Normal(Location=1.401, Scale=0.5554)

one 2- 4 5 or more

Number of previous credits at this bank

0

100

200

300

400

500

600

700

No of obs

__MACOSX/Big Data Analytics Tools./Final Exam materials/._1 BINS 4352 - Preparing Data - Statistal Tests and Missing Data.pptx

Big Data Analytics Tools./Final Exam materials/Statistical Concepts.pptx

Statistical Concepts

Mean

When people talk about statistical averages, they are referring to the mean.

To calculate the mean, simply add all of your numbers together.

Next, divide the sum by however many numbers you added.

The result is your mean or average score.

Median

The median is the middle value in a data set.

To calculate it, place all of your numbers in increasing order.

If you have an odd number of integers, the next step is to find the middle number on your list.

Mode

In statistics, the mode in a list of numbers refers to the integers that occur most frequently.

Unlike the median and mean, the mode is about the frequency of occurrence.

There can be more than one mode or no mode at all; it all depends on the data set itself.

Quartiles

The median divides the data into a lower half and an upper half.

The lower quartile is the middle value of the lower half.

The upper quartile is the middle value of the upper half.

Standard Deviation

Standard deviation (SD, also represented by the Greek letter sigma σ or the Latin letter s) is a measure that is used to quantify the amount of variation or dispersion of a set of data values.

A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

The standard deviation of a random variable, statistical population, data set, or probability distribution is the square root of its variance.

Standard Deviation

Plot of a normal distribution.

Each band of the normal distribution has a width of 1 standard deviation.

Variance

Covariance

Covariance is a measure of the joint variability of two random variables.

If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive.

In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative.

The sign of the covariance therefore shows the tendency in the linear relationship between the variables.

Covariance

The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables.

The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation

Correlation / Dependence

Dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data.

Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to how close two variables are to having a linear relationship with each other.

Statistical Test Theory

In statistics, a null hypothesis is a statement that one seeks to nullify with evidence to the contrary.

Most commonly it is a statement that the phenomenon being studied produces no effect or makes no difference.

An example of a null hypothesis is the statement "This diet has no effect on people's weight.“

Usually, an experimenter frames a null hypothesis with the intent of rejecting it: that is, intending to run an experiment which produces data that shows that the phenomenon under study does make a difference.

In some cases there is a specific alternative hypothesis that is opposed to the null hypothesis, in other cases the alternative hypothesis is not explicitly stated, or is simply "the null hypothesis is false" – in either event, this is a binary judgment, but the interpretation differs and is a matter of significant dispute in statistics.

Errors

A type I error (or error of the first kind) is the incorrect rejection of a true null hypothesis. Usually a type I error leads one to conclude that a supposed effect or relationship exists when in fact it doesn't.

A type II error (or error of the second kind) is the failure to reject a false null hypothesis.

Errors - Examples

Examples of type I errors include a test that shows a patient to have a disease when in fact the patient does not have the disease, a fire alarm going on indicating a fire when in fact there is no fire, or an experiment indicating that a medical treatment should cure a disease when in fact it does not.

Examples of type II errors would be a blood test failing to detect the disease it was designed to detect, in a patient who really has the disease; a fire breaking out and the fire alarm does not ring; or a clinical trial of a medical treatment failing to show that the treatment works when really it does.

Errors

If the probability of obtaining a result as extreme as the one obtained, supposing that the null hypothesis were true, is lower than a pre-specified cut-off probability (for example, 5%), then the result is said to be statistically significant and the null hypothesis is rejected.

In English – if the probability of us being “WRONG” is less than 5%, then the result is statistically significant

p-values

In statistical hypothesis testing, the p-value or probability value is the probability for a given statistical model that, when the null hypothesis is true, the statistical summary (such as the sample mean difference between two compared groups) would be the same as or of greater magnitude than the actual observed results.

The use of p-values in statistical hypothesis testing is common in many fields of research.

The p-value is used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence.

Null hypothesis testing is a reductio ad absurdum argument adapted to statistics.

In essence, a claim is assumed valid if its counter-claim is improbable.

p-values

The “typical” cutoff is 5% (p<=0.05).

What we are looking for are values less than or equal to 5%

In Statistica

So in Statistica – when I am looking for likely variables to be used as predictors (independent variables) for some dependent variable and I run an analysis. My null Hypothesis for each independent variable X and its influence on the dependent variable Y is: Ho = X does not predict Y Ha = X does predict Y   When we run the statistical test and check p values, we look for the p-values to be below some pre-defined threshold (5% = 0.05). If the p-value is below the 0.05 threshold then we can reject the null hypothesis (Ho = X does not predict Y), and assume that X is a predictor of Y.

QUESTIONS?

__MACOSX/Big Data Analytics Tools./Final Exam materials/._Statistical Concepts.pptx

Big Data Analytics Tools./Final Exam materials/2 BINS 4352 - Big Data Analytics.pptx

Big Data and Analytics

Learning Objectives

Learn what Big Data is and how it is changing the world of analytics

Understand the motivation for and business drivers of Big Data analytics

Become familiar with the wide range of enabling technologies for Big Data analytics

Learn about Hadoop, MapReduce, and NoSQL as they relate to Big Data analytics

Understand the role of and capabilities/ skills for data scientist as a new analytics profession

2

Learning Objectives

Compare and contrast the complementary uses of data warehousing and Big Data

Become familiar with the vendors of Big Data tools and services

Understand the need for and appreciate the capabilities of stream analytics

Learn about the applications of stream analytics

The Data Size Is Getting Big, Bigger, …

Hadron Collider - 1 PB/sec

Boeing jet - 20 TB/hr

Facebook - 500 TB/day.

YouTube – 1 TB/4 min.

The proposed Square Kilometer Array telescope (the world’s proposed biggest telescope) – 1 EB/day

Big Data - Definition and Concepts

Big Data is a misnomer!

Big Data is more than just “big”

The Vs that define Big Data

Volume

Variety

Velocity

Variability

Value

5

A High-level Conceptual Architecture for Big Data Solutions (by AsterData / Teradata)

Fundamentals of Big Data Analytics

Big Data by itself, regardless of the size, type, or speed, is worthless

Big Data + “big” analytics = value

With the value proposition, Big Data also brought about big challenges

Effectively and efficiently capturing, storing, and analyzing Big Data

New breed of technologies needed (developed (or purchased or hired or outsourced …)

Big Data Considerations

You can’t process the amount of data that you want to because of the limitations of your current platform.

You can’t include new/contemporary data sources (e.g., social media, RFID, Sensory, Web, GPS, textual data) because it does not comply with the data storage schema

You need to (or want to) integrate data as quickly as possible to be current on your analysis.

You want to work with a schema-on-demand data storage paradigm because the variety of data types involved.

The data is arriving so fast at your organization’s doorstep that your traditional analytics platform cannot handle it.

Critical Success Factors for Big Data Analytics

A clear business need (alignment with the vision and the strategy)

Strong, committed sponsorship (executive champion)

Alignment between the business and IT strategy

A fact-based decision-making culture

A strong data infrastructure

The right analytics tools

Right people with right skills

Critical Success Factors for Big Data Analytics

Enablers of Big Data Analytics

In-memory analytics

Storing and processing the complete data set in RAM

In-database analytics

Placing analytic procedures close to where data is stored

Grid computing & MPP

Use of many machines and processors in parallel (MPP- massively parallel processing)

Appliances

Combining hardware, software and storage in a single unit for performance and scalability

Challenges of Big Data Analytics

Data volume

The ability to capture, store, and process the huge volume of data in a timely manner

Data integration

The ability to combine data quickly and at reasonable cost

Processing capabilities

The ability to process the data quickly, as it is captured (i.e., stream analytics)

Data governance (… security, privacy, access)

Skill availability (… data scientist)

Solution cost (ROI)

Business Problems Addressed by Big Data Analytics

Process efficiency and cost reduction

Brand management

Revenue maximization, cross-selling/up-selling

Enhanced customer experience

Churn identification, customer recruiting

Improved customer service

Identifying new products and market opportunities

Risk management

Regulatory compliance

Enhanced security capabilities

Big Data Technologies

MapReduce …

Hadoop …

Hive

Pig

Hbase

Flume

Oozie

Ambari

Avro

Mahout, Sqoop, Hcatalog, ….

Big Data Technologies MapReduce

MapReduce distributes the processing of very large multi-structured data files across a large cluster of ordinary machines/processors

Goal - achieving high performance with “simple” computers

Developed and popularized by Google

Good at processing and analyzing large volumes of multi-structured data in a timely manner

Example tasks: indexing the Web for search, graph analysis, text analysis, machine learning, …

Big Data Technologies MapReduce

How does

MapReduce

work?

Count the occurrences of the different shapes

Raw Data is split into multiple parts

“Parts” are processed independently

Results are aggregated

Big Data Technologies Hadoop

Hadoop is an open source framework for storing and analyzing massive amounts of distributed, unstructured data

Originally created by Doug Cutting at Yahoo!

Hadoop clusters run on inexpensive commodity hardware so projects can scale-out inexpensively

Hadoop is now part of Apache Software Foundation

Open source - hundreds of contributors continuously improve the core technology

MapReduce + Hadoop = Big Data core technology

Big Data Technologies Hadoop

How Does Hadoop Work?

Access unstructured and semi-structured data (e.g., log files, social media feeds, other data sources)

Break the data up into “parts,” which are then loaded into a file system made up of multiple nodes running on commodity hardware using HDFS

Each “part” is replicated multiple times and loaded into the file system for failsafe processing

A node acts as the Facilitator and another as Job Tracker

Jobs are distributed to the clients, and once completed the results are collected and aggregated using MapReduce

Big Data Technologies Hadoop

Hadoop Technical Components

Hadoop Distributed File System (HDFS)

Name Node (primary facilitator)

Secondary Node (backup to Name Node)

Job Tracker

Slave Nodes (the grunts of any Hadoop cluster)

Additionally, Hadoop ecosystem is made-up of a number of complementary sub-projects: NoSQL (Cassandra, Hbase), DW (Hive), …

NoSQL = not only SQL

Big Data Technologies Hadoop - Demystifying Facts

Hadoop consists of multiple products

Hadoop is open source but available from vendors, too

Hadoop is an ecosystem, not a single product

HDFS is a file system, not a DBMS

Hive resembles SQL but is not standard SQL

Hadoop and MapReduce are related but not the same

MapReduce provides control for analytics, not analytics

Hadoop is about data diversity, not just data volume.

Hadoop complements a DW; it’s rarely a replacement.

Hadoop enables many types of analytics, not just Web analytics.

Big Data And Data Warehousing

What is the impact of Big Data on DW?

Big Data and RDBMS do not go nicely together

Will Hadoop replace data warehousing/RDBMS?

Use Cases for Hadoop

Hadoop as the repository and refinery

Hadoop as the active archive

Use Cases for Data Warehousing

Data warehouse performance

Integrating data that provides business value

Interactive BI tools

Hadoop versus Data Warehouse When to Use Which Platform

Coexistence of Hadoop and DW

Use Hadoop for storing and archiving multi-structured data

Use Hadoop for filtering, transforming, and/or consolidating multi-structured data

Use Hadoop to analyze large volumes of multi-structured data and publish the analytical results

Use a relational DBMS that provides MapReduce capabilities as an investigative computing platform

Use a front-end query tool to access and analyze data

Coexistence of Hadoop and DW

Big Data Vendors

Big Data vendor landscape is developing very rapidly

A representative list would include

Cloudera - cloudera.com

MapR – mapr.com

Hortonworks - hortonworks.com

Also, IBM (Netezza, InfoSphere), Oracle (Exadata, Exalogic), Microsoft, Amazon, Google, …

Software,

Hardware,

Service, …

Top 10 Big Data Vendors with Primary Focus on Hadoop

How to Succeed with Big Data

Simplify

Coexist

Visualize

Empower

Integrate

Govern

Evangelize

Big Data And Stream Analytics

Data-in-motion analytics and real-time data analytics

One of the Vs in Big Data = Velocity

Analytic process of extracting actionable information from continuously flowing/streaming data

Why Stream Analytics?

It may not be feasible to store the data, or loose its value

Stream Analytics Versus Perpetual Analytics

Critical Event Processing?

Stream Analytics A Use Case in Energy Industry

Stream Analytics Applications

e-Commerce

Telecommunication

Law Enforcement and Cyber Security

Power Industry

Financial Services

Health Services

Government

Data Scientist

Data Scientist

Data Scientist = Big Data guru

One with skills to investigate Big Data

Very high salaries, very high expectations

Where do Data Scientist come from?

M.S./Ph.D. in MIS, CS, IE,… and/or Analytics

Very few specific degree programs for DS

PE, PML, … DSP (Data Science Professional)

Skills That Define a Data Scientist

Questions

Math and StatsDataMiningBusinessIntelligenceApplicationsLanguagesMarketing

ANALYTIC TOOLS & APPSUSERS

DISCOVERY PLATFORMINTEGRATED DATA WAREHOUSEDATAPLATFORM ACCESSMANAGEMOVE

UNIFIED DATA ARCHITECTURE

System Conceptual View

MarketingExecutivesOperationalSystemsFrontlineWorkersCustomersPartnersEngineersDataScientistsBusinessAnalysts

EVENT PROCESSING

ERPERPSCMCRMImagesAudio and VideoMachine LogsTextWeb and Social

BIG DATA SOURCES

ERP

Keys to Success

with Big Data

Analytics

A Clear

business need

Strong,

committed

sponsorship

Alignment

between the

business and IT

strategy

A fact-based

decision-making

culture

A strong data

infrastructure

The right

analytics tools

Personnel with

advanced

analytical skills

4

3

3

3

3

Raw DataMap FunctionReduce Function

$0$10$20$30$40$50$60$70

Sensor Data

(Energy Production

System Status)

Meteorological Data

(Wind, Light,

Temperature, etc.)

Usage Data

(Smart Meters,

Smart Grid Devises)

Permanent

Storage Area

Streaming Analytics

(Predicting Usage,

Production and

Anomalies)

Energy Production System

(Traditional and Renewable)

Energy Consumption System

(Residential and Commercial)

Data Integration

and Temporary

Staging

Capacity Decisions

Pricing Decisions

Curiosity and

Creativity

Internet and Social

Media/Social Networking

Technologies

Programming,

Scripting and Hacking

Data Access and

Management

(both traditional and

new data systems)

Domain Expertise,

Problem Definition and

Decision Modeling

Communication and

Interpersonal

DATA

SCIENTIST

__MACOSX/Big Data Analytics Tools./Final Exam materials/._2 BINS 4352 - Big Data Analytics.pptx