Msapriori implementation(Data Mining)

web_data_mining.pdf

Home >Computer Science homework help >Msapriori implementation(Data Mining)

24 2 Association Rules and Sequential Patterns

ket. Worst still, they will almost certainly cause combinatorial explosion! For itemsets involving such items to be useful, their supports have to be much higher. Similarly, knowing that 0.006% of the customers buy the four items in f2 together is also meaningless because Bread, Egg and Milk

are purchased on almost every grocery shopping trip.

This dilemma is called the rare item problem. Using a single minsup for the whole data set is inadequate because it cannot capture the inherent natures and/or frequency differences of items in the database. By the na- tures of items we mean that some items, by nature, appear more frequently than others. For example, in a supermarket, people buy FoodProcessor and CookingPan much less frequently than Bread and Milk. The situation is the same for online stores. In general, those durable and/or expensive goods are bought less often, but each of them generates more profit. It is thus im- portant to capture rules involving less frequent items. However, we must do so without allowing frequent items to produce too many meaningless rules with very low supports and cause combinatorial explosion [344].

One common solution to this problem is to partition the data into several smaller blocks (subsets), each of which contains only items of similar fre- quencies. Mining is then done separately for each block using a different minsup. This approach is, however, not satisfactory because itemsets or rules that involve items across different blocks will not be found.

A better solution is to allow the user to specify multiple minimum sup- ports, i.e., to specify a different minimum item support (MIS) to each item. Thus, different itemsets need to satisfy different minimum supports depending on what items are in the itemsets. This model thus enables us to achieve our objective of finding itemsets involving rare items without causing frequent items to generate too many meaningless itemsets. This method helps solve the problem of f1. To deal with the problem of f2, we prevent itemsets that contain both very frequent items and very rare items from being generated. A constraint will be introduced to realize this.

An interesting by-product of this extended model is that it enables the user to easily instruct the algorithm to generate only itemsets that contain certain items but not itemsets that contain only the other items. This can be done by setting the MIS values to more than 100% (e.g., 101%) for these other items. This capability is very useful in practice because in many ap- plications the user is only interested in certain types of itemsets or rules.

2.4.1 Extended Model

To allow multiple minimum supports, the original model in Sect. 2.1 needs to be extended. In the extended model, the minimum support of a rule is

2.4 Mining with Multiple Minimum Supports 25

expressed in terms of minimum item supports (MIS) of the items that appear in the rule. That is, each item in the data can have a MIS value specified by the user. By providing different MIS values for different items, the user effectively expresses different support requirements for dif- ferent rules. It seems that specifying a MIS value for each item is a diffi- cult task. This is not so as we will see at the end of Sect. 2.4.2.

Let MIS(i) be the MIS value of item i. The minimum support of a rule R is the lowest MIS value among the items in the rule. That is, a rule R,

i1, i2, …, ik ik+1, …, ir,

satisfies its minimum support if the rule’s actual support in the data is greater than or equal to:

min(MIS(i1), MIS(i2), …, MIS(ir)).

Minimum item supports thus enable us to achieve the goal of having higher minimum supports for rules that involve only frequent items, and having lower minimum supports for rules that involve less frequent items.

Example 8: Consider the set of items in a data set, {Bread, Shoes, Clothes}. The user-specified MIS values are as follows:

MIS(Bread) = 2% MIS(Clothes) = 0.2% MIS(Shoes) = 0.1%.

The following rule doesn’t satisfy its minimum support:

Clothes Bread [sup = 0.15%, conf = 70%].

This is so because min(MIS(Bread), MIS(Clothes)) = 0.2%. The following rule satisfies its minimum support:

Clothes Shoes [sup = 0.15%, conf = 70%].

because min(MIS(Clothes), MIS(Shoes)) = 0.1%.

As we explained earlier, the downward closure property holds the key to pruning in the Apriori algorithm. However, in the new model, if we use the Apriori algorithm to find all frequent itemsets, the downward closure property no longer holds.

Example 9: Consider the four items 1, 2, 3 and 4 in a data set. Their minimum item supports are:

MIS(1) = 10% MIS(2) = 20% MIS(3) = 5% MIS(4) = 6%.

If we find that itemset {1, 2} has a support of 9% at level 2, then it does not satisfy either MIS(1) or MIS(2). Using the Apriori algorithm, this itemset is discarded since it is not frequent. Then, the potentially frequent itemsets {1, 2, 3} and {1, 2, 4} will not be generated for level 3. Clearly, itemsets {1,

26 2 Association Rules and Sequential Patterns

2, 3} and {1, 2, 4} may be frequent because MIS(3) is only 5% and MIS(4) is 6%. It is thus wrong to discard {1, 2}. However, if we do not discard {1,

2}, the downward closure property is lost.

Below, we present an algorithm to solve this problem. The essential idea is to sort the items according to their MIS values in ascending order to avoid the problem.

Note that MIS values prevent low support itemsets involving only fre- quent items from being generated because their individual MIS values are all high. To prevent very frequent items and very rare items from appear- ing in the same itemset, we introduce the support difference constraint.

Let sup(i) be the actual support of item i in the data. For each itemset s, the support difference constraint is as follows:

maxi s{sup(i)} mini s{sup(i)} ,

where 0 1 is the user-specified maximum support difference, and it is the same for all itemsets. The constraint basically limits the difference between the largest and the smallest actual supports of items in itemset s to

. This constraint can reduce the number of itemsets generated dramati- cally, and it does not affect the downward closure property.

2.4.2 Mining Algorithm

The new algorithm generalizes the Apriori algorithm for finding frequent itemsets. We call the algorithm, MS-Apriori. When there is only one MIS value (for all items), it reduces to the Apriori algorithm.

Like Apriori, MS-Apriori is also based on level-wise search. It generates all frequent itemsets by making multiple passes over the data. However, there is an exception in the second pass as we will see later.

The key operation in the new algorithm is the sorting of the items in I in ascending order of their MIS values. This order is fixed and used in all subsequent operations of the algorithm. The items in each itemset follow this order. For example, in Example 9 of the four items 1, 2, 3 and 4 and their given MIS values, the items are sorted as follows: 3, 4, 1, 2. This or- der helps solve the problem identified above.

Let Fk denote the set of frequent k-itemsets. Each itemset w is of the fol- lowing form, {w[1], w[2], …, w[k]}, which consists of items, w[1], w[2],

…, w[k], where MIS(w[1]) MIS(w[2]) … MIS(w[k]). The algorithm MS-Apriori is given in Fig. 2.6. Line 1 performs the sorting on I according to the MIS value of each item (stored in MS). Line 2 makes the first pass over the data using the function init-pass(), which takes two arguments, the

2.4 Mining with Multiple Minimum Supports 27

data set T and the sorted items M, to produce the seeds L for generating candidate itemsets of length 2, i.e., C2. init-pass() has two steps:

1. It first scans the data once to record the support count of each item. 2. It then follows the sorted order to find the first item i in M that meets

MIS(i). i is inserted into L. For each subsequent item j in M after i, if

j.count/n MIS(i), then j is also inserted into L, where j.count is the support count of j, and n is the total number of transactions in T.

Frequent 1-itemsets (F1) are obtained from L (line 3). It is easy to show that all frequent 1-itemsets are in F1.

Example 10: Let us follow Example 9 and the given MIS values for the four items. Assume our data set has 100 transactions (not limited to the four items). The first pass over the data gives us the following support counts: {3}.count = 6, {4}.count = 3, {1}.count = 9 and {2}.count = 25. Then,

L = {3, 1, 2}, and F1 = {{3}, {2}}.

Item 4 is not in L because 4.count/n < MIS(3) (= 5%), and {1} is not in F1 because 1.count / n < MIS(1) (= 10%).

For each subsequent pass (or data scan), say pass k, the algorithm per- forms three operations.

Algorithm MS-Apriori(T, MS, ) // MS stores all MIS values 1 M sort(I, MS); // according to MIS(i)’s stored in MS 2 L init-pass(M, T); // make the first pass over T 3 F1 {{l} | l L, l.count/n MIS(l)}; // n is the size of T 4 for (k = 2; Fk 1 ; k++) do 5 if k = 2 then 6 Ck level2-candidate-gen(L, ) // k = 2 7 else Ck MScandidate-gen(Fk 1, ) 8 endif; 9 for each transaction t T do 10 for each candidate c Ck do 11 if c is contained in t then // c is a subset of t 12 c.count++ 13 if c – {c[1]} is contained in t then // c without the first item 14 (c – {c[1]}).count++ 15 endfor 16 endfor 17 Fk {c Ck | c.count/n MIS(c[1])} 18 endfor 19 return F k Fk;

Fig. 2.6. The MS-Apriori algorithm

28 2 Association Rules and Sequential Patterns

1. The frequent itemsets in Fk 1 found in the (k–1)th pass are used to gener- ate the candidates Ck using the MScandidate-gen() function (line 7). However, there is a special case, i.e., when k = 2 (line 6), for which the candidate generation function is different, i.e., level2-candidate-gen().

2. It then scans the data and updates various support counts of the candi- dates in Ck (line 9–16). For each candidate c, we need to update its sup- port count (lines 11–12) and also the support count of c without the first item (lines 13–14), i.e., c – {c[1]}, which is used in rule generation and will be discussed in Sect. 2.4.3. If rule generation is not required, lines 13 and 14 can be deleted.

3. The frequent itemsets (Fk) for the pass are identified in line 17.

We present candidate generation functions level2-candidate-gen() and MScandidate-gen() below.

Level2-candidate-gen function: It takes an argument L, and returns a su- perset of the set of all frequent 2-itemsets. The algorithm is given in Fig.

2.7. Note that in line 5, we use |sup(h) sup(l)| because sup(l) may not be lower than sup(h), although MIS(l) MIS(h).

Example 11: Let us continue with Example 10. We set = 10%. Recall the MIS values of the four items are (in Example 9):

MIS(1) = 10% MIS(2) = 20% MIS(3) = 5% MIS(4) = 6%.

The level2-candidate-gen() function in Fig. 2.7 produces

C2 = {{3, 1}}.

{1, 2} is not a candidate because the support count of item 1 is only 9 (or 9%), less than MIS(1) (= 10%). Hence, {1, 2} cannot be frequent. {3, 2} is not a candidate because sup(3) = 6% and sup(2) = 25% and their difference

is greater than = 10%

Note that we must use L rather than F1 because F1 does not contain those items that may satisfy the MIS of an earlier item (in the sorted order) but not the MIS of itself, e.g., item 1 in the above example. Using L, the prob- lem discussed in Sect. 2.4.1 is solved for C2.

MScandidate-gen function: The algorithm is given in Fig. 2.8, which is similar to the candidate-gen function in the Apriori algorithm. It also has two steps, the join step and the pruning step. The join step (lines 2–6) is the same as that in the candidate-gen() function. The pruning step (lines 8– 12) is, however, different.

For each (k-1)-subset s of c, if s is not in Fk 1, c can be deleted from Ck. However, there is an exception, which is when s does not include c[1]

2.4 Mining with Multiple Minimum Supports 29

(there is only one such s). That is, the first item of c, which has the lowest MIS value, is not in s. Even if s is not in Fk 1, we cannot delete c because we cannot be sure that s does not satisfy MIS(c[1]), although we know that it does not satisfy MIS(c[2]), unless MIS(c[2]) = MIS(c[1]) (line 9).

Example 12: Let F3 = {{1, 2, 3}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {1, 4, 5}, {1, 4, 6}, {2, 3, 5}}. Items in each itemset are in the sorted order. The join step produces (we ignore the support difference constraint here)

{1, 2, 3, 5}, {1, 3, 4, 5} and {1, 4, 5, 6}.

The pruning step deletes {1, 4, 5, 6} because {1, 5, 6} is not in F3. We are then left with C4 = {{1, 2, 3, 5}, {1, 3, 4, 5}}. {1, 3, 4, 5} is not deleted al- though {3, 4, 5} is not in F3 because the minimum support of {3, 4, 5} is MIS(3), which may be higher than MIS(1). Although {3, 4, 5} does not sat- isfy MIS(3), we cannot be sure that it does not satisfy MIS(1). However, if

MIS(3) = MIS(1), then {1, 3, 4, 5} can also be deleted.

Function level2-candidate-gen(L, )

1 C2 ; // initialize the set of candidates 2 for each item l in L in the same order do

3 if l.count/n MIS(l) then 4 for each item h in L that is after l do

5 if h.count/n MIS(l) and |sup(h) sup(l)| then

6 C2 C2 {{l, h}}; // insert the candidate {l, h} into C2

Fig. 2.7. The level2-candidate-gen function

Function MScandidate-gen(Fk 1, )

1 Ck ; // initialize the set of candidates

2 forall f1, f2 Fk // find all pairs of frequent itemsets

3 with f1 = {i1, … , ik 2, ik 1} // that differ only in the last item

4 and f2 = {i1, … , ik 2, i’k 1}

5 and ik-1 < i’k 1 and |sup(ik-1) sup(i’k 1)| do

6 c {i1, …, ik 1, i’k 1}; // join the two itemsets f1 and f2 7 Ck Ck {c}; // insert the candidate itemset c into Ck 8 for each (k 1)-subset s of c do

9 if (c[1] s) or (MIS(c[2]) = MIS(c[1])) then

10 if (s Fk 1) then 11 delete c from Ck; // delete c from the set of candidates 12 endfor 13 endfor 14 return Ck; // return the generated candidates

Fig. 2.8. The MScandidate-gen function

30 2 Association Rules and Sequential Patterns

The problem discussed in Sect. 2.4.1 is solved for Ck (k > 2) because, due to the sorting, we do not need to extend a frequent (k 1)-itemset with any item that has a lower MIS value. Let us see a complete example.

Example 13: Given the following seven transactions,

Beef, Bread Bread, Clothes

Bread, Clothes, Milk Cheese, Boots Beef, Bread, Cheese, Shoes Beef, Bread, Cheese, Milk Bread, Milk, Clothes

and MIS(Milk) = 50%, MIS(Bread) = 70%, and 25% for all other items. Again, the support difference constraint is not used. The following fre- quent itemsets are produced:

F1 = {{Beef}, {Cheese}, {Clothes}, {Bread}} F2 = {{Beef, Cheese}, {Beef, Bread}, {Cheese, Bread}

{Clothes, Bread}, {Clothes, Milk}}

F3 = {{Beef, Cheese, Bread}, {Clothes, Milk, Bread}}.

To conclude this sub-section, let us further discuss two important issues:

1. Specify MIS values for items: This is usually done in two ways:

Assign a MIS value to each item according to its actual sup- port/frequency in the data set T. For example, if the actual support of item i in T is sup(i), then the MIS value for i may be computed with

sup(i), where is a parameter (0 1) and is the same for all items in T.

Group items into clusters (or blocks). Items in each cluster have simi- lar frequencies. All items in the same cluster are given the same MIS value. We should note that in the extended model frequent itemsets involving items from different clusters will be found.

2. Generate itemsets that must contain certain items: As mentioned earlier, the extended model enables the user to instruct the algorithm to generate itemsets that must contain certain items, or not to generate any itemsets consisting of only the other items. Let us see an example.

Example 14: Given the data set in Example 13, if we want to generate frequent itemsets that must contain at least one item in {Boots, Bread, Cheese, Milk, Shoes}, or not to generate itemsets involving only Beef and/or Clothes, we can simply set

MIS(Beef) = 101%, and MIS(Clothes) = 101%

2.4 Mining with Multiple Minimum Supports 31

Then the algorithm will not generate the itemsets, {Beef}, {Clothes} and {Beef, Clothes}. However, it will still generate such frequent item-

sets as {Cheese, Beef} and {Cheese, Bread, Beef}.

In many applications, this feature comes quite handy because the user is often only interested in certain types of itemsets or rules.

2.4.3 Rule Generation

Association rules are generated using frequent itemsets. In the case of a single minsup, if f is a frequent itemset and fsub is a subset of f, then fsub must also be a frequent itemset. All their support counts are computed and recorded by the Apriori algorithm. Then, the confidence of each possible rule can be easily calculated without seeing the data again.

However, in the case of MS-Apriori, if we only record the support count of each frequent itemset, it is not sufficient. Let us see why.

Example 15: Recall in Example 8, we have

MIS(Bread) = 2% MIS(Clothes) = 0.2% MIS(Shoes) = 0.1%.

If the actual support for the itemset {Clothes, Bread} is 0.15%, and for the itemset {Shoes, Clothes, Bread} is 0.12%, according to MS-Apriori, {Clothes, Bread} is not a frequent itemset since its support is less than MIS(Clothes). However, {Shoes, Clothes, Bread} is a frequent itemset as its actual support is greater than

min(MIS(Shoes), MIS(Clothes), MIS(Bread)) = MIS(Shoes)).

We now have a problem in computing the confidence of the rule,

Clothes, Bread Shoes

because the itemset {Clothes, Bread} is not a frequent itemset and thus its support count is not recorded. In fact, we may not be able to compute the confidences of the following rules either:

Clothes Shoes, Bread Bread Shoes, Clothes

because {Clothes} and {Bread} may not be frequent.

Lemma: The above problem may occur only when the item that has the lowest MIS value in the itemset is in the consequent of the rule (which may have multiple items). We call this problem the head-item problem.

Proof by contradiction: Let f be a frequent itemset, and a f be the item with the lowest MIS value in f (a is called the head item). Thus, f uses