Discussion
Data Mining
Association Analysis: Basic Concepts
and Algorithms
Lecture Notes for Chapter 6
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 *
Association Rule Mining
- Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
TID |
Items |
|
1 |
Bread, Milk |
|
2 |
Bread, Diaper, Beer, Eggs |
|
3 |
Milk, Diaper, Beer, Coke |
|
4 |
Bread, Milk, Diaper, Beer |
|
5 |
Bread, Milk, Diaper, Coke |
Definition: Frequent Itemset
- Itemset
- A collection of one or more items
- Example: {Milk, Bread, Diaper}
- k-itemset
- An itemset that contains k items
- Support count ()
- Frequency of occurrence of an itemset
- E.g. ({Milk, Bread,Diaper}) = 2
- Support
- Fraction of transactions that contain an itemset
- E.g. s({Milk, Bread, Diaper}) = 2/5
- Frequent Itemset
- An itemset whose support is greater than or equal to a minsup threshold
TID |
Items |
|
1 |
Bread, Milk |
|
2 |
Bread, Diaper, Beer, Eggs |
|
3 |
Milk, Diaper, Beer, Coke |
|
4 |
Bread, Milk, Diaper, Beer |
|
5 |
Bread, Milk, Diaper, Coke |
Definition: Association Rule
- Association Rule
- An implication expression of the form X Y, where X and Y are itemsets
- Example:
{Milk, Diaper} {Beer}
- Rule Evaluation Metrics
- Support (s)
- Fraction of transactions that contain both X and Y
- Confidence (c)
- Measures how often items in Y
appear in transactions that
contain X
Example:
TID |
Items |
|
1 |
Bread, Milk |
|
2 |
Bread, Diaper, Beer, Eggs |
|
3 |
Milk, Diaper, Beer, Coke |
|
4 |
Bread, Milk, Diaper, Beer |
|
5 |
Bread, Milk, Diaper, Coke |
Association Rule Mining Task
- Given a set of transactions T, the goal of association rule mining is to find all rules having
- support ≥ minsup threshold
- confidence ≥ minconf threshold
- Brute-force approach:
- List all possible association rules
- Compute the support and confidence for each rule
- Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Mining Association Rules
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
- All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer} - Rules originating from the same itemset have identical support but
can have different confidence - Thus, we may decouple the support and confidence requirements
TID |
Items |
|
1 |
Bread, Milk |
|
2 |
Bread, Diaper, Beer, Eggs |
|
3 |
Milk, Diaper, Beer, Coke |
|
4 |
Bread, Milk, Diaper, Beer |
|
5 |
Bread, Milk, Diaper, Coke |
Mining Association Rules
- Two-step approach:
Frequent Itemset Generation
Generate all itemsets whose support minsup
Rule Generation
Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
- Frequent itemset generation is still computationally expensive
Frequent Itemset Generation Strategies
- Reduce the number of candidates (M)
- Complete search: M=2d
- Use pruning techniques to reduce M
- Reduce the number of transactions (N)
- Reduce size of N as the size of itemset increases
- Used by DHP and vertical-based mining algorithms
- Reduce the number of comparisons (NM)
- Use efficient data structures to store the candidates or transactions
- No need to match every candidate against every transaction
Reducing Number of Candidates
- Apriori principle:
- If an itemset is frequent, then all of its subsets must also be frequent
- Apriori principle holds due to the following property of the support measure:
- Support of an itemset never exceeds the support of its subsets
- This is known as the anti-monotone property of support
Illustrating Apriori Principle
Found to be Infrequent
Pruned supersets
Apriori Algorithm
- Method:
- Let k=1
- Generate frequent itemsets of length 1
- Repeat until no new frequent itemsets are identified
- Generate length (k+1) candidate itemsets from length k frequent itemsets
- Prune candidate itemsets containing subsets of length k that are infrequent
- Count the support of each candidate by scanning the DB
- Eliminate candidates that are infrequent, leaving only those that are frequent
Reducing Number of Comparisons
- Candidate counting:
- Scan the database of transactions to determine the support of each candidate itemset
- To reduce the number of comparisons, store the candidates in a hash structure
- Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
N�
k�
Buckets�
Hash Structure�
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
- Hash function
- Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)
2 3 4
5 6 7
1 4 5
1 3 6
1 2 4
4 5 7
1 2 5
4 5 8
1 5 9
3 4 5
3 5 6
3 5 7
6 8 9
3 6 7
3 6 8
1,4,7
2,5,8
3,6,9
Hash function
Factors Affecting Complexity
- Choice of minimum support threshold
- lowering support threshold results in more frequent itemsets
- this may increase number of candidates and max length of frequent itemsets
- Dimensionality (number of items) of the data set
- more space is needed to store support count of each item
- if number of frequent items also increases, both computation and I/O costs may also increase
- Size of database
- since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
- Average transaction width
- transaction width increases with denser data sets
- This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)
Compact Representation of Frequent Itemsets
- Some itemsets are redundant because they have identical support as their supersets
- Number of frequent itemsets
- Need a compact representation
FP-growth Algorithm
- Use a compressed representation of the database using an FP-tree
- Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets
Tree Projection
- Items are listed in lexicographic order
- Each node P stores the following information:
- Itemset for node P
- List of possible lexicographic extensions of P: E(P)
- Pointer to projected database of its ancestor node
- Bitvector containing information about which transactions in the projected database contain the itemset
Projected Database
Original Database:
Projected Database for node A:
For each transaction T, projected transaction at node A is T E(A)
Sheet1
| TID | Items | |
| 1 | {A,B} | |
| 2 | {B,C,D} | |
| 3 | {A,C,D,E} | |
| 4 | {A,D,E} | |
| 5 | {A,B,C} | |
| 6 | {A,B,C,D} | |
| 7 | {B,C} | |
| 8 | {A,B,C} | |
| 9 | {A,B,D} | |
| 10 | {B,C,E} |
Sheet2
Sheet3
Sheet1
| TID | Items | |
| 1 | {B} | |
| 2 | {} | |
| 3 | {C,D,E} | |
| 4 | {D,E} | |
| 5 | {B,C} | |
| 6 | {B,C,D} | |
| 7 | {} | |
| 8 | {B,C} | |
| 9 | {B,D} | |
| 10 | {} |
Sheet2
Sheet3
Rule Generation
- Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement
- If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC BD, AD BC, BC AD,
BD AC, CD AB,
- If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)
Rule Generation
- How to efficiently generate rules from frequent itemsets?
- In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)
- But confidence of rules generated from the same itemset has an anti-monotone property
- e.g., L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD)
- Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
Effect of Support Distribution
- Many real data sets have skewed support distribution
Support distribution of a retail data set
Effect of Support Distribution
- How to set the appropriate minsup threshold?
- If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)
- If minsup is set too low, it is computationally expensive and the number of itemsets is very large
- Using a single minimum support threshold may not be effective
Multiple Minimum Support
- How to apply multiple minimum supports?
- MS(i): minimum support for item i
- e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5% - MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%
- Challenge: Support is no longer anti-monotone
- Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5% - {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent
Multiple Minimum Support (Liu 1999)
- Order the items according to their minimum support (in ascending order)
- e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5% - Ordering: Broccoli, Salmon, Coke, Milk
- Need to modify Apriori such that:
- L1 : set of frequent items
- F1 : set of items whose support is MS(1)
where MS(1) is mini( MS(i) ) - C2 : candidate itemsets of size 2 is generated from F1
instead of L1
Multiple Minimum Support (Liu 1999)
- Modifications to Apriori:
- In traditional Apriori,
- A candidate (k+1)-itemset is generated by merging two
frequent itemsets of size k - The candidate is pruned if it contains any infrequent subsets
of size k - Pruning step has to be modified:
- Prune only if subset contains the first item
- e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to
minimum support) - {Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.
Pattern Evaluation
- Association rule algorithms tend to produce too many rules
- many of them are uninteresting or redundant
- Redundant if {A,B,C} {D} and {A,B} {D}
have same support & confidence
- Interestingness measures can be used to prune/rank the derived patterns
- In the original formulation of association rules, support & confidence are the only measures used
Statistical Independence
- Population of 1000 students
- 600 students know how to swim (S)
- 700 students know how to bike (B)
- 420 students know how to swim and bike (S,B)
- P(SB) = 420/1000 = 0.42
- P(S) P(B) = 0.6 0.7 = 0.42
- P(SB) = P(S) P(B) => Statistical independence
- P(SB) > P(S) P(B) => Positively correlated
- P(SB) < P(S) P(B) => Negatively correlated
Statistical-based Measures
- Measures that take into account statistical dependence
Properties of A Good Measure
- Piatetsky-Shapiro:
3 properties a good measure M must satisfy: - M(A,B) = 0 if A and B are statistically independent
- M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged
- M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged
Support-based Pruning
- Most of the association rule mining algorithms use support measure to prune rules and itemsets
- Study effect of support pruning on correlation of itemsets
- Generate 10000 random contingency tables
- Compute support and pairwise correlation for each table
- Apply support-based pruning and examine the tables that are removed
Effect of Support-based Pruning
- Investigate how support-based pruning affects other measures
- Steps:
- Generate 10000 contingency tables
- Rank each table according to the different measures
- Compute the pair-wise correlation between the measures
Subjective Interestingness Measure
- Objective measure:
- Rank patterns based on statistics computed from data
- e.g., 21 measures of association (support, confidence, Laplace, Gini, mutual information, Jaccard, etc).
- Subjective measure:
- Rank patterns according to user’s interpretation
- A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin) - A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
Interestingness via Unexpectedness
- Web Data (Cooley et al 2001)
- Domain knowledge in the form of site structure
- Given an itemset F = {X1, X2, …, Xk} (Xi : Web pages)
- L: number of links connecting the pages
- lfactor = L / (k k-1)
- cfactor = 1 (if graph is connected), 0 (disconnected graph)
- Structure evidence = cfactor lfactor
- Usage evidence
- Use Dempster-Shafer theory to combine domain knowledge and evidence from data
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Beer
}
Diaper
,
Milk
{
Þ
4
.
0
5
2
|
T
|
)
Beer
Diaper,
,
Milk
(
=
=
=
s
s
67
.
0
3
2
)
Diaper
,
Milk
(
)
Beer
Diaper,
Milk,
(
=
=
=
s
s
c
)
(
)
(
)
(
:
,
Y
s
X
s
Y
X
Y
X
³
Þ
Í
"
null
ABACADAEBCBDBECDCEDE
ABCDE
ABCABDABEACDACEADEBCDBCEBDECDE
ABCDABCEABDEACDEBCDE
ABCDE
null
ABACADAEBCBDBECDCEDE
ABCDE
ABCABDABEACDACEADEBCDBCEBDECDE
ABCDABCEABDEACDEBCDE
ABCDE
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
N
Transactions
Hash Structure
k
Buckets
TIDA1A2A3A4A5A6A7A8A9A10B1B2B3B4B5B6B7B8B9B10C1C2C3C4C5C6C7C8C9C10
1
1111111111
00000000000000000000
2
1111111111
00000000000000000000
3
1111111111
00000000000000000000
4
1111111111
00000000000000000000
5
1111111111
00000000000000000000
60000000000
1111111111
0000000000
70000000000
1111111111
0000000000
80000000000
1111111111
0000000000
90000000000
1111111111
0000000000
100000000000
1111111111
0000000000
1100000000000000000000
1111111111
1200000000000000000000
1111111111
1300000000000000000000
1111111111
1400000000000000000000
1111111111
1500000000000000000000
1111111111
å
=
÷
ø
ö
ç
è
æ
´
=
10
1
10
3
k
k
TIDItems
1{A,B}
2{B,C,D}
3{A,C,D,E}
4{A,D,E}
5{A,B,C}
6{A,B,C,D}
7{B,C}
8{A,B,C}
9{A,B,D}
10{B,C,E}
TIDItems
1{B}
2{}
3{C,D,E}
4{D,E}
5{B,C}
6{B,C,D}
7{}
8{B,C}
9{B,D}
10{}
)]
(
1
)[
(
)]
(
1
)[
(
)
(
)
(
)
,
(
)
(
)
(
)
,
(
)
(
)
(
)
,
(
)
(
)
|
(
Y
P
Y
P
X
P
X
P
Y
P
X
P
Y
X
P
t
coefficien
Y
P
X
P
Y
X
P
PS
Y
P
X
P
Y
X
P
Interest
Y
P
X
Y
P
Lift
-
-
-
=
-
-
=
=
=
f
)
...
(
)
...
(
2
1
2
1
k
k
X
X
X
P
X
X
X
P
È
È
È
=
I
I
I