big data and data scientist

vamshi
Lesson3Slides.pdf

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Advanced Analytics - Theory and Methods

1Module 4: Analytics Theory/Methods

1Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Advanced Analytics – Theory and Methods

Upon completion of this module, you should be able to:

•Examine analytic needs and select an appropriate technique based on business objectives; initial hypotheses; and the data's structure and volume

•Apply some of the more commonly used methods in Analytics solutions

•Explain the algorithms and the technical foundations for the commonly used methods

•Explain the environment (use case) in which each technique can provide the most value

•Use appropriate diagnostic methods to validate the models created

•Use R and in-database analytical functions to fit, score and evaluate models

2Module 4: Analytics Theory/Methods

The objectives of this module are listed. The Analytical methods covered are:

Categorization (un-supervised) :

1.K-means clustering

2. Association Rules

Regression

3. Linear

4. Logistic

Classification (supervised)

5.Naïve Bayesian classifier

6. Decision Trees

7. Time Series Analysis

8. Text Analysis

2Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Where “R” we?

• In Module 3 we reviewed R skills and basic statistics

• You can use R to:  Generate summary statistics to investigate a data set

 Visualize Data

 Perform statistical tests to analyze data and evaluate models

• Now that you have data, and you can see it, you need to plan the analytic model and determine the analytic method to be used

3Module 4: Analytics Theory/Methods

Module 4 focuses on the most commonly used analytic methods, detailing:

a) Prominent use cases for the method

b) Algorithms to implement the method

c) Diagnostics that are most commonly used to evaluate the effectiveness of the method

d) The Reasons to Choose (+) and Cautions (-) (where the method is most and least effective)

3Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Applying the Data Analytics Lifecycle

Discovery

Operationalize

Model

Planning

Data Prep

Model

Building

Communicate

Results

• In a typical Data Analytics Problem - you would have gone through:

• Phase 1 – Discovery - have the problem framed

• Phase 2 – Data Preparation - have the data prepared

• Now you need to plan the model and determine the method to be used.

4Module 4: Analytics Theory/Methods

Here we recall phases of analytic life cycle we would have gone through before we plan for the analytic method we should be using with the data.

4Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Phase 3 - Model Planning

Discovery

Operationalize

Model

Planning

Data Prep

Model

Building

Communicate

Results

Do I have a good idea about the type of model to try? Can I refine the

analytic plan?

Is the model robust enough? Have we

failed for sure?

How do people generally solve this

problem with the kind of data and

resources I have?

• Does that work well enough? Or do I have

to come up with something new?

• What are related or analogous problems?

How are they solved? Can I do that?

5Module 4: Analytics Theory/Methods

Model planning is the process of determining the appropriate analytic method based on the problem. It also depends on the type of data and the computational resources available.

5Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

What Kind of Problem do I Need to Solve? How do I Solve it?

The Problem to Solve The Category of Techniques

Covered in this Course

I want to group items by similarity. I want to find structure (commonalities) in the data

Clustering K-means clustering

I want to discover relationships between actions or items

Association Rules Apriori

I want to determine the relationship between the outcome and the input variables

Regression Linear Regression Logistic Regression

I want to assign (known) labels to objects

Classification Naïve Bayes Decision Trees

I want to find the structure in a temporal process I want to forecast the behavior of a temporal process

Time Series Analysis ACF, PACF, ARIMA

I want to analyze my text data Text Analysis Regular expressions, Document representation (Bag of Words), TF- IDF

6Module 4: Analytics Theory/Methods

This table lists the typical business questions (column 1) addressed by a category of techniques or analytical methods (column 2)

Some of the typical business questions for different category of techniques are listed below:

Clustering How do I group these documents by topic? How do I group these images by similarity? (More businesslike questions)

Association Rules What do other people like this person tend to like/buy/watch?

Regression I want to predict the lifetime value of this customer. I want to predict the probability that this loan will default. Classification Where in the catalog should I place this product? Is this email spam? Time Series Analysis What is the likely future price of this stock? What will my sales volume be next month?

Text Analysis Is this a positive product review or a negative one?

As it can be observed that these category of techniques overlap with each other with the type of problem they can be used to solve.

Questions such as "How do I group these documents?" and "Is this email spam?" , “Is this a positive product review" can all be answered with a “classification”. But these questions can also be considered as a Text analysis problem which we cover in this module. Text analysis is defined as term for the specific process of representing, manipulating, and predicting or learning over text. The tasks themselves can often be classified as clustering, or classification.

Similarly more than one method can be used to solve the same problem. For example Time Series Analysis can be used to predict prices over time. Time series is used in cases where the past is observable to the participants, which is often true of stock, and real estate. Sometimes we can use regression methods as well. However, regression is most effective when assigning effects to complicated patterns of treatment.

Column 3 in the table above lists the specific analytical methods that are detailed in the subsequent lessons in this module.

6Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Why These Example Techniques?

• Most popular, frequently used:  Provide the foundation for Data

Science skills on which to build

• Relatively easy for new Data Scientists to understand & comprehend

• Applicable to a broad range of problems in several verticals

7Module 4: Analytics Theory/Methods

We present in this module K-means clustering, Apriori algorithm for Association rules, Linear and logistic regression, Classification methods with Naïve Bayesian method and Decision Trees, Time Series Analysis with Box-Jenkins ARIMA modeling and key concepts such as TF-IDF.

Regular expressions and document representation methods with “bag of words” are chosen to be presented in this module among several techniques available for the Data Scientists to use to solve analytic problems. The reasons for which these techniques are chosen among all the available techniques are listed on this slide.

7Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Advanced Analytics – Theory and Methods

During this lesson the following topics are covered:

• Clustering – Unsupervised learning method

• K-means clustering:

• Use cases

• The algorithm

• Determining the optimum value for K

• Diagnostics to evaluate the effectiveness of the method

• Reasons to Choose (+) and Cautions (-) of the method

Lesson 1: K-means Clustering

8Module 4: Analytics Theory/Methods

This lesson covers K-means clustering with these topics.

8Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Clustering

How do I group these documents by topic?

How do I group my customers by purchase patterns?

• Sort items into groups by similarity:  Items in a cluster are more similar to each other than they are to

items in other clusters.

 Need to detail the properties that characterize “similarity”

 Or of distance, the "inverse" of similarity

• Not a predictive method; finds similarities, relationships

• Our Example: K-means Clustering

9Module 4: Analytics Theory/Methods

In machine learning, “unsupervised” refers to the problem of finding a hidden structure within unlabeled data. In this lesson and the following lesson we will be discussing two unsupervised learning methods clustering and Association Rules.

Clustering is a popular method used to form homogenous groups within a data set based on their internal structure. Clustering is a method often used for exploratory analysis of the data. There are no ”predictions” of any values done with clustering just finding the similarity between the data and grouping them into clusters

The notion of similarities can be explained with the following examples:

Consider questions such as

1. How do I group these documents by topic?

2. How do I perform customer segmentation to allow for targeted or special marketing programs.

The definition of “similarity” is specific to the problem domain. We are defining similarity as those data points with the same “topic” tag or customers who can be profiled in to a same “age group/income/gender” or a “purchase pattern”.

If we have a vector of measurements of an attribute of the data, the data points that are grouped into a cluster will have values for the measurement close to each other than to those data points grouped in a different cluster. In other words the distance, (an inverse of similarity) between the points within a cluster are always lower than the distance between points in a different cluster. In a cluster we end up with a tight group (homogeneous) of data points that are far apart from those data points that end up in a different cluster.

There are many clustering techniques and we are going to discuss one of the most popular clustering method known as “K-means clustering” in this lesson.

9Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

K-Means Clustering - What is it?

• Used for clustering numerical data, usually a set of measurements about objects of interest.

• Input: numerical. There must be a distance metric defined over the variable space.

 Euclidian distance

• Output: The centers of each discovered cluster, and the assignment of each input datum to a cluster.

 Centroid

10Module 4: Analytics Theory/Methods

K-means clustering is used to cluster numerical data.

In K-means we define two measures of distances, between two data points(records) and the distance between two clusters. Distance can be measured (calculated) in a number of ways but four principles tend to hold true.

1. Distance is not negative (it is stated as an absolute value)

2. Distance from one record to itself is zero.

3. Distance from record I to record J is the same as the distance from record J to record I, again since the distance is stated as an absolute value, the starting and end points can be reversed.

4. Distance between two records can not be greater than the sum of the distance between each record and a third record.

Euclidean distance is the most popular method for calculating distance. Euclidian distance is a “ordinary” distance that one could measure with a ruler. In a single dimension the Euclidian distance is the absolute value of the differences between two points. The straight line distance between two points. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is √((x1 - x2)² + (y1 - y2)²).

In N dimensions, the Euclidean distance between two points p and q is √(∑i=1 N (pi-qi)²) where pi (or qi) is

the coordinate of p (or q) in dimension i.

Though there are many other distance measures, the Euclidian distance is the most commonly used distance measure and many packages use this measure.

The Euclidian distance is influenced by the scale of the variables. Changing the scale (for example from feet to inches) can significantly influence the results.Second, the equation ignores the relationship between variables. Lastly, the clustering algorithm is sensitive to outliers. If the data has outliers and removal of them is not possible, the results of the clustering can be substantially distorted.

The centroid is the center of the discovered cluster. K-means clustering provides this as an output. When the number of clusters is fixed to k, K-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

10Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Use Cases

• Often an exploratory technique:  Discover structure in the data

 Summarize the properties of each cluster

• Sometimes a prelude to classification:  "Discovering the classes“

• Examples  The height, weight and average lifespan of animals

 Household income, yearly purchase amount in dollars, number of household members of customer households

 Patient record with measures of BMI, HBA1C, HDL

11Module 4: Analytics Theory/Methods

K-means clustering is often used as a lead-in to classification. It is primarily an exploratory technique to discover the structure of the data that you might not have notice before and as a prelude to more focused analysis or decision processes.

Some examples of the set of measurements based on which clustering can be performed are detailed in the slide.

In the patient record where we have measures such as BMI, HBA1C, HDL with which we could cluster patients into groups that define varying degrees of risk of a heart disease.

In Classification the labels are known. Whereas in clustering the labels are not known. Hence clustering can be used to determine the structure in the data and summarize the properties of each cluster in terms of the measured centroids for the group. The clusters can define what the initial classes could be.

In low dimensions we can visualize the clusters. It gets very hard to visualize as the dimensions increase.

There are a lot of applications of the K-mean clustering, examples include pattern recognition, classification analysis, artificial intelligence, image processing, machine vision, etc.

In principle, you have several objects and each object has several attributes. You want to classify the objects based on the attributes, then you can apply this algorithm. For Data Scientists, K-means is an excellent tool to understand the structure of data and validate some of the assumptions that are provided by the domain experts pertaining to the data. We will look into a specific use-case in the following slide.

11Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Use-Case Example – On-line Retailer

LTV – Lifetime Customer Value

12Module 4: Analytics Theory/Methods

Here we present a fabricated example of an on-line retailer. The unique selling point of this retailer is that they make the “returns” simple with an assumption that this policy encourages use and “frequent customers are more valuable”. So let us validate this assumption.

We took a sample set of customers clustered on purchase frequency, return rate, and lifetime customer value (LTV).

We define purchase frequency as the number of visits a customer made in a month on average that had a shopping cart transaction.

We can easily see that return rate has an important effect on customer value.

We clustered the customers into 4 groups, and the plotted 3 graphs taking two of the attributes in a graph. The data points are represented in the graphs by different colors for each cluster and larger “dot” represents the centroid for the group.

The groups can be defined broadly as follows:

GP1: Visit less frequently, low return rate, moderate LTV(ranked 3rd)

GP2: Visit often, return a lot of their purchases. Lowest avg LTV (counter intuitive)

GP3: Visit often, return things moderately, High LTV (ranked 2nd) (happy medium)

GP4: Visit rarely, don't return purchases. Highest avg LTV

It appears that GP3 is the ideal group – they visit often, return things moderately, and are high value. The next questions are

- Why is it that GP3 is ideal?

- What are the people in these different groups buying?

- Is that affecting LTV?

- Can we raise the LTV of our frequent customers, perhaps by lowering the cost of returns, or by somehow discouraging customers who return goods too frequently?

- Can we encourage GP4 customers to visit more (without lowering their LTV?)

- Are more frequent customers more valuable?

You can see the range of questions that a Data Scientist can address with the initial analysis with k-means clustering.

12Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

The Algorithm

1. Choose K; then select K random "centroids" In our example, K=3

2. Assign records to the cluster with the closest centroid

13Module 4: Analytics Theory/Methods

Step 1 - K-means clustering begins with the data set segmented into K clusters.

Step 2- Observations are moved from cluster to cluster to help reduce the distance from the observation to the cluster centroid.

13Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

The Algorithm (Continued)

3. Recalculate the resulting centroids

Centroid: the mean value of all the records in the cluster

4. Repeat steps 2 & 3 until record assignments no longer change

Model Output:

• The final cluster centers

• The final cluster assignments of the training data

14Module 4: Analytics Theory/Methods

Step 3 - When observations are moved to a new cluster, the centroid for the affected clusters needs to be recalculated.

Step 4 - This movement and recalculation is repeated until movement no longer results in an improvement.

The model output is the final cluster centers and the final cluster assignments for the data.

Selecting the appropriate number of clusters, K, can be done upfront if you possess some knowledge on what the right number may be. Alternatively you can try the exercise with different values for K and decide which clusters best suit your needs. Since it is rare that the appropriate number of clusters in a dataset is known, it is good practice to select a few values for k and compare the results.

The first partitioning should be done with the same knowledge used to select the appropriate value of K, for example domain knowledge about the market or industries.

If K was selected without external knowledge, the partitioning can be done without any inputs.

Once all observations are assigned to their closest cluster, the clusters can be evaluated for their “in-cluster dispersion.” Clusters with the smallest average distance are the most homogenous. We can also examine the distance between clusters and decide if it makes sense to combine clusters which may be located close together. We can also use the distance between clusters to assess how successful the clustering exercise has been. Ideally, the clusters should not be located close together as the clusters should be well separated.

14Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Picking K

Heuristic: find the "elbow" of the within-sum-of-squares (wss) plot as a function of K.

K: # of clusters

ni: # points in i th cluster

ci: centroid of i th cluster

xij: jth point of i th cluster

"Elbows" at k=2,4,6

15Module 4: Analytics Theory/Methods

Practically based on the domain knowledge, a value for K is picked and the centroids are computed. Then a different K is chosen and the model is repeated to observe if it enhanced the cohesiveness of the data points within the cluster group. However if there is no apparent structure in the data we may have to try multiple values for K. It is an exploratory process.

We present here one of the heuristic approaches used for picking the optimal “K” for the given dataset. “Within Sum of Squares” – WSS is a measure of how tight on average each cluster is. For k=1, WSS can be considered the overall dispersion of the data. WSS primarily is a measure of homogeneity. In general more clusters result in tighter clusters. But having too many clusters is over-fitting. The formula that defines WSS is shown. The graph depicts the value of WSS on the Y-axis and the number of clusters on the X-axis. The online retailer example data we reviewed earlier is the data with which the graph shown here is generated. We repeated the clustering for 12 different values .When we went from one cluster to two there is a significant drop in the value of WSS, since with two clusters you get more homogeneity. We look for the elbow of the curve which provides the optimal number of clusters for the given data.

Visualizing the data helps in confirming the optimal number of clusters. Reviewing the three pair-wise graphs we plotted for the online retailer example earlier you can see that having four groups sufficiently explained the data and from the graph above we can also see the elbow of the curve is at 4.

15Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Diagnostics – Evaluating the Model

• Do the clusters look separated in at least some of the plots when you do pair-wise plots of the clusters?

 Pair-wise plots can be used when there are not many variables

• Do you have any clusters with few data points?  Try decreasing the value of K

• Are there splits on variables that you would expect, but don't see?

 Try increasing the value K

• Do any of the centroids seem too close to each other?  Try decreasing the value of K

16Module 4: Analytics Theory/Methods

How do we know that we have good clusters?

Pair-wise plots of the clusters provide a good visual confirmation that the clusters are homogeneous. When the dimensions of the data are not significantly large this method helps in determining the optimal number of clusters. With these plots you should be able to determine if the clusters look separated in at least some of the plots. They won’t be very separated in all of the plots. This can be seen even with the on-line retailer example we saw earlier. Some of the clusters get mixed in together in some dimensions.

If you feel that your clusters are too small it indicates that you have a large value for K and K needs to be reduced (try a smaller K). It may be the outliers in the data that tend to cluster into clusters with less data points.

Alternatively if you see there are splits that you expected but are not seen in the clusters, for example you expect two different income groups and you don’t see them, you should try a bigger value for K.

If the centroids seem too close to each other then you should try decreasing the value of K.

16Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Reasons to Choose (+) Cautions (-) Easy to implement Doesn't handle categorical variables

Easy to assign new data to existing

clusters

Which is the nearest cluster center?

Sensitive to initialization (first guess)

Concise output

Coordinates the K cluster centers

Variables should all be measured on

similar or compatible scales

Not scale-invariant!

K (the number of clusters) must be

known or decided a priori

Wrong guess: possibly poor results

Tends to produce "round" equi-sized

clusters.

Not always desirable

K-Means Clustering - Reasons to Choose (+) and Cautions (-)

Module 4: Analytics Theory/Methods 17

K-means clustering is easy to implement and it produces concise output. It is easy to assign new data to the existing clusters by determining which centroid the new data point is closest to it.

However K-means works only on the numerical data and does not handle categorical variables. It is sensitive to the initial guess on the centroids. It is important that the variables must be all measured on similar or compatible scales. If you measure the living space of a house in square feet, the cost of the house in thousands of dollars (that is, 1 unit is $1000), and then you change the cost of the house to dollars (so one unit is $1), then the clusters may change. K should be decided ahead of the modeling process. Wrong guesses for K may lead to improper clustering.

K-means tends to produce rounded and equal sized clusters. If you have clusters which are elongated or crescent shaped, K-means may not be able to find these clusters appropriately. The data in this case may have to be transformed before modeling.

Module 4: Analytics Theory/Methods 17

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge

1. Why do we consider K-means clustering as a unsupervised machine learning algorithm?

2. How do you use “pair-wise” plots to evaluate the effectiveness of the clustering?

3. Detail the four steps in the K-means clustering algorithm. 4. How do we use WSS to pick the value of K? 5. What is the most common measure of distance used with K-

means clustering algorithms?

6. The attributes of a data set are “purchase decision (Yes/No), Gender (M/F), income group (<10K, 10-50K, >50K). Can you use K-means to cluster this data set?

Your Thoughts?

18Module 4: Analytics Theory/Methods

Record your answers here.

18Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Advanced Analytics – Theory and Methods

During this lesson the following topics were covered:

• Clustering – Unsupervised learning method

• What is K-means clustering

• Use cases with K-means clustering

• The K-means clustering algorithm

• Determining the optimum value for K

• Diagnostics to evaluate the effectiveness of K-means clustering

• Reasons to Choose (+) and Cautions (-) of K-means clustering

Lesson 1: K-means Clustering - Summary

19Module 4: Analytics Theory/Methods

Summary of key-topics presented in this lesson are listed. Take a moment to review them.

19Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Advanced Analytics - Theory and Methods

1Module 4: Analytics Theory/Methods

1Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Advanced Analytics – Theory and Methods

During this lesson the following topics are covered:

▪ Association Rules mining

▪ Apriori Algorithm

▪ Prominent use cases of Association Rules

▪ Support and Confidence parameters

▪ Lift and Leverage

▪ Diagnostics to evaluate the effectiveness of rules generated

▪ Reasons to Choose (+) and Cautions (-) of the Apriori algorithm

Lesson 2: Association Rules

Module 4: Analytics Theory/Methods 2

The topics covered in this lesson are listed.

Module 4: Analytics Theory/Methods 2

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Association Rules

Which of my products tend to be purchased together?

What do other people like this person tend to like/buy/watch?

• Discover "interesting" relationships among variables in a large database

 Rules of the form “If X is observed, then Y is also observed"

 The definition of "interesting“ varies with the algorithm used for discovery

• Not a predictive method; finds similarities, relationships

3Module 4: Analytics Theory/Methods

Association Rules is another unsupervised learning method. There is no “prediction” performed but is used to discover relationships within the data. The example questions are

• Which of my products tend to be purchased together? • What will other people who are like this person or product tend to buy/watch or click

on for other products we may have to offer? In the online retailer example we analyzed in the previous lesson, we could use association rules to discover what products are purchased together within the group that yielded maximum LTV. For example if we set up the data appropriately, we could explore to further discover which products people in GP4 tend to buy together and derive any logical reasons for high rate of returns. We can discover the profile of purchases for people in different groups (Ex: people who buy high heel shoes and expensive purses tend to be in GP4 or people who buy walking shoes and camping gear tend to be in GP2 etc). The goal with Association rules is to discover “interesting” relationships among the variables and the definition of “interesting” depends on the algorithm used for the discovery. The rules you discover are of the form that when I observe X I also tend to observe Y. An example of “interesting” relationships are those rules identified with a measure of “confidence” (with a value >= a pre-defined threshold) with which a rule can be stated based on the data.

3Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Association Rules - Apriori

• Specifically designed for mining over transactions in databases

• Used over itemsets: sets of discrete variables that are linked:  Retail items that are purchased together

 A set of tasks done in one day

 A set of links clicked on by one user in a single session

• Our Example: Apriori

4Module 4: Analytics Theory/Methods

Association Rules are specifically designed for in-database mining over transactions in databases.

Association rules are used over transactions that Consists of “itemsets”.

Itemsets are discrete sets of items that are linked together. For example they could be a set of retail items purchased together in one transaction. Association rules are sometimes referred to as Market Basket Analysis and you can think of a itemset as everything in your shopping basket.

We can also group the tasks done in one day or set of links clicked by a user in a single session into a basket or an itemset for discovering associations.

“Apriori” is one of the earliest and the most commonly used algorithms for association rules and we will focus on Apriori in the rest of our lesson.

4Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Apriori Algorithm - What is it? Support

• Earliest of the association rule algorithms

• Frequent itemset: a set of items L that appears together "often enough“:

 Formally: meets a minimum support criterion

 Support: the % of transactions that contain L

• Apriori Property: Any subset of a frequent itemset is also frequent

 It has at least the support of its superset

5Module 4: Analytics Theory/Methods

We will now detail the Apriori algorithm.

Apriori algorithm uses the notion of Frequent Itemset. As the name implies the frequent itemsets are a set of items “L” that appear together “often enough”. The term “often enough” is formally defined with a support criterion where the support is defined as the percentage of transactions that contain “L”.

For example:

If we define L as a itemset {shoes, purses} and we define our “support” as 50%. If 50% of the transactions have this itemset, then we say the L is a “frequent itemset”. It is apparent that if 50% of itemsets have {shoes,purses} in them, then at least 50% of the transactions will have either {shoes} or {purses} in them. This is an Apriori property which states that any subset of a frequent itemset is also frequent. Apriori property provides the basis for the Apriori algorithm that we will detail in the subsequent slides.

5Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Apriori Algorithm (Continued) Confidence

• Iteratively grow the frequent itemsets from size 1 to size K (or until we run out of support).

 Apriori property tells us how to prune the search space

• Frequent itemsets are used to find rules X->Y with a minimum confidence:

 Confidence: The % of transactions that contain X, which also contain Y

• Output: The set of all rules X -> Y with minimum support and confidence

6Module 4: Analytics Theory/Methods

Apriori is a bottom-up approach where we start with all the frequent itemsets of size 1 (for example shoes, purses, hats etc) first and determine the support. Then we start pairing them. We find the support for say {shoes,purses} or {shoes,hats} or {purses,hats}.

Suppose we set our threshold as 50% we find those itemsets that appear in 50% of all transactions. We scan all the itemsets and "prune away" the itemsets that have less than 50% support (appear in less than 50% of the transactions), and keep the ones that have sufficient support. The word "prune" is used like it would be in gardening, where you prune away the excess branches of your bushes.

Apriori property provides the basis to prune over the transactions (search space) and to stop searching further if the support threshold criterion is not met. If the support criterion is met we grow the itemset and repeat the process until we have the specified number of items in a itemset or we run out of support.

We now use the frequent itemsets to find our rules such as X implies Y. Confidence is the percent of transactions that contain X that also contain Y. For example if we have frequent itemset {shoes,purses, hats} and consider subsets {shoes,purses}. If 80% of the transactions that have {shoes,purses} also have {hats} we define Confidence for the rule that {shoes,purses} implies {hats} as 80%.

The output of the apriori are the rules with minimum support and confidence.

6Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Lift and Leverage

7Module 4: Analytics Theory/Methods

The common measures used by Apriori algorithm are Support and Confidence . We rank all the rules based on the support and confidence and filter out the most “interesting” rules. There are other measures to evaluate candidate rules and we will define two such measures Lift and Leverage. Lift measures how many times more often X and Y occur together than expected if they were statistically independent. It is a measure of how X and Y are really related rather than coincidentally happening together. Leverage is a similar notion but instead of a ratio it is the difference. Leverage measures the difference in the probability of X and Y appearing together in the data set compared to what would be expected if X and Y were statistically independent.

For more measures refer to: http://michael.hahsler.net/research/association_rules/measures.html

7Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Association Rules Implementations

• Market Basket Analysis  People who buy milk also buy cookies 60% of the time.

• Recommender Systems  "People who bought what you bought also purchased….“.

• Discovering web usage patterns  People who land on page X click on link Y 76% of the time.

8Module 4: Analytics Theory/Methods

Listed are some example use cases with Association Rules.

Market basket analysis is an implementation of Association Rules mining that many companies use (to list a few among many) for

• Broad-scale approach to better merchandising

• Cross-merchandising between products and high-margin or high-ticket items

• Placement of product (in racks) within related category of products

• Promotional programs - Multiple product purchase incentives managed through loyalty card program

Recommender systems are used by all “on-line” retailers such as Amazon.

Web usage log files generated on web servers contain huge amounts of information and association rules can potentially give useful knowledge to the web usage data analysts.

8Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Use Case Example: Credit Records

Credit ID Attributes

1 credit_good, female_married, job_skilled, home_owner, …

2 credit_bad, male_single, job_unskilled, renter, …

Frequent Itemset Support

credit_good 70%

male_single 55%

job_skilled 63%

home_owner 71%

home_owner, credit_good

53%

Minimum Support: 50%

The itemset {home_owner,

credit_good} has minimum support.

The possible rules are

credit_good -> home_owner

and

home_owner -> credit_good

9Module 4: Analytics Theory/Methods

We present an example to detail the Apriori algorithm. We have a set of artificially created transaction records detailing several attributes of people. Let’s say that we found records in which Credit_good, male_single, job_skilled, home_owner and {home_owner,credit_good} have a support of over 50%.

As the itemset {home_owner,credit_good} has a minimum support of over 50% we can state the following rules:

Credit _good -> home_owner

Home_owner-> credit_good

Let us compute the confidence and Lift

9Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Computing Confidence and Lift

free_housing home_owner renter total

credit_bad 44 186 70 300

credit_good 64 527 109 700

108 713 179

Suppose we have 1000 credit records:

713 home_owners, 527 have good credit.

home_owner -> credit_good has confidence 527/713 = 74%

700 with good credit, 527 of them are home_owners

credit_good -> home_owner has confidence 527/700 = 75%

The lift of these two rules is

0.527 / (0.700*0.713) = 1.055

10Module 4: Analytics Theory/Methods

Consider we have 1000 credit records of individuals and the table of pair-wise attributes shows the number of individuals that have a specific attribute. We can see that among the 1000 individuals 700 have credit_good and 300 have credit_bad.

We also see among the 713 home owners 527 have good credit. The confidence for the rule

Home_owner -> credit_good is 527/713 = 74%

The confidence for the rule

Credit_good -> home owner is 527/700= 75%

The Lift is the ratio of Probability of home_owner with credit_good/probability of home_owner) x probability of credit_good

Which is 0.527/(0.700*0.713) = 1.055

The lift being close to the value of 1 indicates that the rule is purely coincidental and with larger values of Lift (say >1.5) we may say the rule is “true” and not coincidental.

10Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

A Sketch of the Algorithm

• If Lk is the set of frequent k-itemsets:  Generate the candidate set Ck+1 by joining Lk to itself

 Prune out the (k+1)-itemsets that don't have minimum support Now we have Lk+1

• We know this catches all the frequent (k-1)-itemsets by the apriori property

 a (k+1)-itemset can't be frequent if any of its subsets aren't frequent

• Continue until we reach kmax, or run out of support

• From the union of all the Lk, find all the rules with minimum confidence

11Module 4: Analytics Theory/Methods

Here we formally define the Apriori algorithm.

Step 1 is identifying the frequent itemsets by starting with each item on the transactions that meet the support level. Then we grow each item set joining another itemset and determine the support for the new grown itemset.

Prune all the itemsets that do not meet the minimum support.

We repeat the growing and pruning until we reach the specified number of items in a itemset or we run out of support.

Then form rules with the union of all the itemsets that we retained that meets the minimum confidence threshold.

We will go back to our credit records example and understand the algorithm.

11Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Step 1: 1-itemsets (L1)

• let min_support = 0.5

• 1000 credit records

• Scan the database

• Prune

Frequent Itemset Count

credit_good 700

credit_bad 300

male_single 550

male_mar_or_wid 92

female 310

job_skilled 631

job_unskilled 200

home_owner 710

renter 179

12Module 4: Analytics Theory/Methods

The first step is to start with 1 element itemset and let the support be 0.5. we scan the database and count the occurrences of each attributes.

The itemsets that meet the support criteria are the ones that are not pruned (struck off).

12Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Step 2: 2-itemsets (L2)

• Join L1 to itself

• Scan the database to get the counts

• Prune

Frequent Itemset Count

credit_good, male_single

402

credit_good, job_skilled

544

credit_good, home_owner

527

male_single, job_skilled

340

male_single, home_owner

408

job_skilled, home_owner

452

13Module 4: Analytics Theory/Methods

The itemsets that we end up with at step 1 are {credit_good}, {male_single}, {home_owner} and {job_skilled}.

In step 2 we join (grow) these itemsets with 2 elements in each itemset as {credit_good,male_single}, {credit_good,home_owner}, {credit_good,job_skilled}, {male_single,job_skilled},{male_single,home_owner) and {job_skilled,home_owner} and determine the support for each of these combinations.

What survives the pruning are {credit_good,job_skilled) and {credit_good,home_owner}

13Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Step 3: 3-itemsets

• We have run out of support.

• Candidate rules come from L2:  credit_good -> job_skilled

 job_skilled -> credit_good

 credit_good -> home_owner

 home_owner -> credit_good

Frequent Itemset Count

credit_good, job_skilled, home_owner

428

14Module 4: Analytics Theory/Methods

When we grow the itemsets to 3 we run out of support. We stop and generate rules with results in step 2 The rules that come from step 2 are shown. Obviously, depending on what we are trying to do (predict who will have good credit, or identify the characteristics of people with good credit), some rules are more useful than others, independently of confidence.

14Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Finally: Find Confidence Rules

Rule Set Cnt Set Cnt Confidence

IF credit_good THEN job_skilled

credit_good 700 credit_good AND job_skilled

544 544/700=77%

IF credit_good THEN home_owner

credit_good 700 credit_good AND home_owner

527 527/700=75%

IF job_skilled THEN credit_good

job_skilled 631 job_skilled AND credit_good

544 544/631=86%

IF home_owner THEN credit_good

home_owner 710 home_owner AND credit_good

527 527/710=74%

If we want confidence > 80%:

IF job_skilled THEN credit_good

15Module 4: Analytics Theory/Methods

Once we have the rules we compute the confidence for each rule. The table lists the rules and the computation of confidence.

We see that job_skilled -> credit_good has a 86% confidence.

15Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Diagnostics

• Do the rules make sense?  What does the domain expert say?

• Make a "test set" from hold-out data:  Enter some market baskets with a few items missing (selected at

random). Can the rules determine the missing items?

 Remember, some of the test data may not cause a rule to fire.

• Evaluate the rules by lift or leverage.  Some associations may be coincidental (or obvious).

16Module 4: Analytics Theory/Methods

The first check on the output is to determine if the rules make any sense. The domain expertise provide inputs for this.

In the example of credit records we had 1000 transactions that we worked with for the discovery of rules. Let us assume that we had 1500 transactions, we can randomly select 500 transactions out of this and keep it aside as hold-out data and run the discovery of rules on the remaining 1000 transactions. The 500 records we kept aside are known as the hold-out data.

We can use the data as a test set and drop some items from the transactions randomly. When we run the Association rules again on the test set determine if the algorithm predicts the missing data or the items dropped. It should be noted that the some of the test data may not cause the rule to fire.

It is important to evaluate the rules with “Lift” or “Leverage”. While mining data with Association Rules several rules are generated that are purely coincidental.

If 95% of your customers buy X and 90% of customers buy Y, then X and Y occur together 85% of the time, even if there is no relationship between the two. The measure of Lift ensures “interesting” rules are identified rather than the coincidental ones.

16Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Reasons to Choose (+) Cautions (-) Easy to implement Requires many database scans

Uses a clever observation to

prune the search space

•Apriori property

Exponential time complexity

Easy to parallelize Can mistakenly find spurious

(or coincidental) relationships

•Addressed with Lift and

Leverage measures

Apriori - Reasons to Choose (+) and Cautions (-)

Module 4: Analytics Theory/Methods 17

While Apriori algorithm is easy to implement and parallelize, it is computationally expensive. One of the major drawbacks with the algorithm is that many spurious rules tend to get generated that are practically not very useful. These spurious rules are generated due to coincidental relationships between the variables.

Lift and Leverage measures must be used to prune out these rules.

Module 4: Analytics Theory/Methods 17

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge

1. What is the Apriori property and how is it used in the Apriori algorithm?

2. List three popular use cases of the Association Rules mining algorithms.

3. What is the difference between Lift and Leverage. How is Lift used in evaluating the quality of rules discovered?

4. Define Support and Confidence 5. How do you use a “hold-out” dataset to evaluate the

effectiveness of the rules generated?

Your Thoughts?

18Module 4: Analytics Theory/Methods

Record your answers here.

18Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Advanced Analytics – Theory and Methods

During this lesson the following topics were covered:

▪ Association Rules mining

▪ Apriori Algorithm

▪ Prominent use cases of Association Rules

▪ Support and Confidence parameters

▪ Lift and Leverage

▪ Diagnostics to evaluate the effectiveness of rules generated

▪ Reasons to Choose (+) and Cautions (-) of the Apriori algorithm

Lesson 2: Association Rules - Summary

Module 4: Analytics Theory/Methods 19

This lesson covered these topics. Please take a moment to review them.

Module 4: Analytics Theory/Methods 19

  • ITS836_03a_Ch4_Clustering
  • ITS836_03b_Ch 5_AssociationRules