Discussion

profilenickyr
ITS632_Chapter6PPT.ppt

Data Mining
Association Analysis: Basic Concepts
and Algorithms

Lecture Notes for Chapter 6

Introduction to Data Mining

by

Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 *

Association Rule Mining

  • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

Example of Association Rules

{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},

Implication means co-occurrence, not causality!

TID

Items

1

Bread, Milk

2

Bread, Diaper, Beer, Eggs

3

Milk, Diaper, Beer, Coke

4

Bread, Milk, Diaper, Beer

5

Bread, Milk, Diaper, Coke

Definition: Frequent Itemset

  • Itemset
  • A collection of one or more items
  • Example: {Milk, Bread, Diaper}
  • k-itemset
  • An itemset that contains k items
  • Support count ()
  • Frequency of occurrence of an itemset
  • E.g. ({Milk, Bread,Diaper}) = 2
  • Support
  • Fraction of transactions that contain an itemset
  • E.g. s({Milk, Bread, Diaper}) = 2/5
  • Frequent Itemset
  • An itemset whose support is greater than or equal to a minsup threshold

TID

Items

1

Bread, Milk

2

Bread, Diaper, Beer, Eggs

3

Milk, Diaper, Beer, Coke

4

Bread, Milk, Diaper, Beer

5

Bread, Milk, Diaper, Coke

Definition: Association Rule

  • Association Rule
  • An implication expression of the form X  Y, where X and Y are itemsets
  • Example:
    {Milk, Diaper}  {Beer}

  • Rule Evaluation Metrics
  • Support (s)
  • Fraction of transactions that contain both X and Y
  • Confidence (c)
  • Measures how often items in Y
    appear in transactions that
    contain X

Example:

TID

Items

1

Bread, Milk

2

Bread, Diaper, Beer, Eggs

3

Milk, Diaper, Beer, Coke

4

Bread, Milk, Diaper, Beer

5

Bread, Milk, Diaper, Coke

Association Rule Mining Task

  • Given a set of transactions T, the goal of association rule mining is to find all rules having
  • support ≥ minsup threshold
  • confidence ≥ minconf threshold
  • Brute-force approach:
  • List all possible association rules
  • Compute the support and confidence for each rule
  • Prune rules that fail the minsup and minconf thresholds

 Computationally prohibitive!

Mining Association Rules

Example of Rules:

{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)

{Diaper,Beer}  {Milk} (s=0.4, c=0.67)

{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)

{Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Observations:

  • All the above rules are binary partitions of the same itemset:
    {Milk, Diaper, Beer}
  • Rules originating from the same itemset have identical support but
    can have different confidence
  • Thus, we may decouple the support and confidence requirements

TID

Items

1

Bread, Milk

2

Bread, Diaper, Beer, Eggs

3

Milk, Diaper, Beer, Coke

4

Bread, Milk, Diaper, Beer

5

Bread, Milk, Diaper, Coke

Mining Association Rules

  • Two-step approach:

Frequent Itemset Generation

Generate all itemsets whose support  minsup

Rule Generation

Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

  • Frequent itemset generation is still computationally expensive

Frequent Itemset Generation Strategies

  • Reduce the number of candidates (M)
  • Complete search: M=2d
  • Use pruning techniques to reduce M

  • Reduce the number of transactions (N)
  • Reduce size of N as the size of itemset increases
  • Used by DHP and vertical-based mining algorithms

  • Reduce the number of comparisons (NM)
  • Use efficient data structures to store the candidates or transactions
  • No need to match every candidate against every transaction

Reducing Number of Candidates

  • Apriori principle:
  • If an itemset is frequent, then all of its subsets must also be frequent

  • Apriori principle holds due to the following property of the support measure:
  • Support of an itemset never exceeds the support of its subsets
  • This is known as the anti-monotone property of support

Illustrating Apriori Principle

Found to be Infrequent

Pruned supersets

Apriori Algorithm

  • Method:

  • Let k=1
  • Generate frequent itemsets of length 1
  • Repeat until no new frequent itemsets are identified
  • Generate length (k+1) candidate itemsets from length k frequent itemsets
  • Prune candidate itemsets containing subsets of length k that are infrequent
  • Count the support of each candidate by scanning the DB
  • Eliminate candidates that are infrequent, leaving only those that are frequent

Reducing Number of Comparisons

  • Candidate counting:
  • Scan the database of transactions to determine the support of each candidate itemset
  • To reduce the number of comparisons, store the candidates in a hash structure
  • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

N�

k�

Buckets�

Hash Structure�

Generate Hash Tree

Suppose you have 15 candidate itemsets of length 3:

{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

You need:

  • Hash function
  • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

2 3 4

5 6 7

1 4 5

1 3 6

1 2 4

4 5 7

1 2 5

4 5 8

1 5 9

3 4 5

3 5 6

3 5 7

6 8 9

3 6 7

3 6 8

1,4,7

2,5,8

3,6,9

Hash function

Factors Affecting Complexity

  • Choice of minimum support threshold
  • lowering support threshold results in more frequent itemsets
  • this may increase number of candidates and max length of frequent itemsets
  • Dimensionality (number of items) of the data set
  • more space is needed to store support count of each item
  • if number of frequent items also increases, both computation and I/O costs may also increase
  • Size of database
  • since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
  • Average transaction width
  • transaction width increases with denser data sets
  • This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)

Compact Representation of Frequent Itemsets

  • Some itemsets are redundant because they have identical support as their supersets

  • Number of frequent itemsets

  • Need a compact representation

FP-growth Algorithm

  • Use a compressed representation of the database using an FP-tree
  • Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets

Tree Projection

  • Items are listed in lexicographic order
  • Each node P stores the following information:
  • Itemset for node P
  • List of possible lexicographic extensions of P: E(P)
  • Pointer to projected database of its ancestor node
  • Bitvector containing information about which transactions in the projected database contain the itemset

Projected Database

Original Database:

Projected Database for node A:

For each transaction T, projected transaction at node A is T  E(A)

Sheet1

TID Items
1 {A,B}
2 {B,C,D}
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}

Sheet2

Sheet3

Sheet1

TID Items
1 {B}
2 {}
3 {C,D,E}
4 {D,E}
5 {B,C}
6 {B,C,D}
7 {}
8 {B,C}
9 {B,D}
10 {}

Sheet2

Sheet3

Rule Generation

  • Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the minimum confidence requirement
  • If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

  • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and   L)

Rule Generation

  • How to efficiently generate rules from frequent itemsets?
  • In general, confidence does not have an anti-monotone property

c(ABC D) can be larger or smaller than c(AB D)

  • But confidence of rules generated from the same itemset has an anti-monotone property
  • e.g., L = {A,B,C,D}:

    c(ABC  D)  c(AB  CD)  c(A  BCD)

  • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

Effect of Support Distribution

  • Many real data sets have skewed support distribution

Support distribution of a retail data set

Effect of Support Distribution

  • How to set the appropriate minsup threshold?
  • If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)

  • If minsup is set too low, it is computationally expensive and the number of itemsets is very large
  • Using a single minimum support threshold may not be effective

Multiple Minimum Support

  • How to apply multiple minimum supports?
  • MS(i): minimum support for item i
  • e.g.: MS(Milk)=5%, MS(Coke) = 3%,
    MS(Broccoli)=0.1%, MS(Salmon)=0.5%
  • MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
    = 0.1%

  • Challenge: Support is no longer anti-monotone
  • Suppose: Support(Milk, Coke) = 1.5% and
    Support(Milk, Coke, Broccoli) = 0.5%
  • {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent

Multiple Minimum Support (Liu 1999)

  • Order the items according to their minimum support (in ascending order)
  • e.g.: MS(Milk)=5%, MS(Coke) = 3%,
    MS(Broccoli)=0.1%, MS(Salmon)=0.5%
  • Ordering: Broccoli, Salmon, Coke, Milk
  • Need to modify Apriori such that:
  • L1 : set of frequent items
  • F1 : set of items whose support is  MS(1)
    where MS(1) is mini( MS(i) )
  • C2 : candidate itemsets of size 2 is generated from F1
    instead of L1

Multiple Minimum Support (Liu 1999)

  • Modifications to Apriori:
  • In traditional Apriori,
  • A candidate (k+1)-itemset is generated by merging two
    frequent itemsets of size k
  • The candidate is pruned if it contains any infrequent subsets
    of size k
  • Pruning step has to be modified:
  • Prune only if subset contains the first item
  • e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to
    minimum support)
  • {Broccoli, Coke} and {Broccoli, Milk} are frequent but
    {Coke, Milk} is infrequent

Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.

Pattern Evaluation

  • Association rule algorithms tend to produce too many rules
  • many of them are uninteresting or redundant
  • Redundant if {A,B,C}  {D} and {A,B}  {D}
    have same support & confidence

  • Interestingness measures can be used to prune/rank the derived patterns
  • In the original formulation of association rules, support & confidence are the only measures used

Statistical Independence

  • Population of 1000 students
  • 600 students know how to swim (S)
  • 700 students know how to bike (B)
  • 420 students know how to swim and bike (S,B)
  • P(SB) = 420/1000 = 0.42
  • P(S)  P(B) = 0.6  0.7 = 0.42
  • P(SB) = P(S)  P(B) => Statistical independence
  • P(SB) > P(S)  P(B) => Positively correlated
  • P(SB) < P(S)  P(B) => Negatively correlated

Statistical-based Measures

  • Measures that take into account statistical dependence

Properties of A Good Measure

  • Piatetsky-Shapiro:
    3 properties a good measure M must satisfy:
  • M(A,B) = 0 if A and B are statistically independent
  • M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged
  • M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged

Support-based Pruning

  • Most of the association rule mining algorithms use support measure to prune rules and itemsets
  • Study effect of support pruning on correlation of itemsets
  • Generate 10000 random contingency tables
  • Compute support and pairwise correlation for each table
  • Apply support-based pruning and examine the tables that are removed

Effect of Support-based Pruning

  • Investigate how support-based pruning affects other measures

  • Steps:
  • Generate 10000 contingency tables
  • Rank each table according to the different measures
  • Compute the pair-wise correlation between the measures

Subjective Interestingness Measure

  • Objective measure:
  • Rank patterns based on statistics computed from data
  • e.g., 21 measures of association (support, confidence, Laplace, Gini, mutual information, Jaccard, etc).

  • Subjective measure:
  • Rank patterns according to user’s interpretation
  • A pattern is subjectively interesting if it contradicts the
    expectation of a user (Silberschatz & Tuzhilin)
  • A pattern is subjectively interesting if it is actionable
    (Silberschatz & Tuzhilin)

Interestingness via Unexpectedness

  • Web Data (Cooley et al 2001)
  • Domain knowledge in the form of site structure
  • Given an itemset F = {X1, X2, …, Xk} (Xi : Web pages)
  • L: number of links connecting the pages
  • lfactor = L / (k  k-1)
  • cfactor = 1 (if graph is connected), 0 (disconnected graph)
  • Structure evidence = cfactor  lfactor
  • Usage evidence

  • Use Dempster-Shafer theory to combine domain knowledge and evidence from data

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Beer

}

Diaper

,

Milk

{

Þ

4

.

0

5

2

|

T

|

)

Beer

Diaper,

,

Milk

(

=

=

=

s

s

67

.

0

3

2

)

Diaper

,

Milk

(

)

Beer

Diaper,

Milk,

(

=

=

=

s

s

c

)

(

)

(

)

(

:

,

Y

s

X

s

Y

X

Y

X

³

Þ

Í

"

null

ABACADAEBCBDBECDCEDE

ABCDE

ABCABDABEACDACEADEBCDBCEBDECDE

ABCDABCEABDEACDEBCDE

ABCDE

null

ABACADAEBCBDBECDCEDE

ABCDE

ABCABDABEACDACEADEBCDBCEBDECDE

ABCDABCEABDEACDEBCDE

ABCDE

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

N

Transactions

Hash Structure

k

Buckets

TIDA1A2A3A4A5A6A7A8A9A10B1B2B3B4B5B6B7B8B9B10C1C2C3C4C5C6C7C8C9C10

1

1111111111

00000000000000000000

2

1111111111

00000000000000000000

3

1111111111

00000000000000000000

4

1111111111

00000000000000000000

5

1111111111

00000000000000000000

60000000000

1111111111

0000000000

70000000000

1111111111

0000000000

80000000000

1111111111

0000000000

90000000000

1111111111

0000000000

100000000000

1111111111

0000000000

1100000000000000000000

1111111111

1200000000000000000000

1111111111

1300000000000000000000

1111111111

1400000000000000000000

1111111111

1500000000000000000000

1111111111

å

=

÷

ø

ö

ç

è

æ

´

=

10

1

10

3

k

k

TIDItems

1{A,B}

2{B,C,D}

3{A,C,D,E}

4{A,D,E}

5{A,B,C}

6{A,B,C,D}

7{B,C}

8{A,B,C}

9{A,B,D}

10{B,C,E}

TIDItems

1{B}

2{}

3{C,D,E}

4{D,E}

5{B,C}

6{B,C,D}

7{}

8{B,C}

9{B,D}

10{}

)]

(

1

)[

(

)]

(

1

)[

(

)

(

)

(

)

,

(

)

(

)

(

)

,

(

)

(

)

(

)

,

(

)

(

)

|

(

Y

P

Y

P

X

P

X

P

Y

P

X

P

Y

X

P

t

coefficien

Y

P

X

P

Y

X

P

PS

Y

P

X

P

Y

X

P

Interest

Y

P

X

Y

P

Lift

-

-

-

=

-

-

=

=

=

f

)

...

(

)

...

(

2

1

2

1

k

k

X

X

X

P

X

X

X

P

È

È

È

=

I

I

I