Data Mining

profilesindhuja27d
exer7p1_7_2.pdf

7.8 Exercises 473

7.8 Exercises 1. Consider the traffic accident data set shown in Table 7.10.

Table 7,10. Traffic accident data set. Weat

Condition her Driver's

Condition Tlaffic

Violation Seat belt Urash

Severity Good Bad

Good Good Bad

Good Bad

Good Good Bad

Good Bad

Alcohol-impaired Sober Sober Sober Sober

Alcohol-impaired Alcohol-impaired

Sober Alcohol-impaired

Sober Alcohol-impaired

Sober

Exceed speed limit None

Disobey stop sign Exceed speed limit

Disobey traffic signal Disobey stop sign

None Disobey trafrc signal

None Disobey traffic signal Exceed speed limit Disobey stop sign

No Yes Yes Yes No Yes Yes Yes No No Yes Yes

Major Minor Minor Major Major Minor Major Major Major Major Major Minor

(a) Show a binarized version of the data set. (b) What is the maximum width of each transaction in the binarized data? (c) Assuming that support threshold is 30%, how many candidate and fre-

quent itemsets will be generated? (d) Create a data set that contains only the following asymmetric binary

attributes: (LJeather : Bad, Driver's condition : Alcohol-impaired, T r a f f i c v i o l a t i o n : Y e s , S e a t B e l t : N o , C r a s h S e v e r i t y : t ' t a j o r ) . For Traffic violation, only None has a value of 0. The rest of the attribute values are assigned to 1. Assuming that support threshold is 30%, how many candidate and frequent itemsets will be generated?

(e) Compare the number of candidate and frequent itemsets generated in parts (c) and (d).

2. (a) Consider the data set shown in Table 7.11. Suppose we apply the following discretization strategies to the continuous attributes of the data set.

Dl: Partition the range of each continuous attribute into 3 equal-sized bins.

D2: Partition the range of each continuous attribute into 3 bins; where each bin contains an eoual number of transactions

474 Chapter 7 Association Analysis: Advanced Concepts

Table 7.11, Data set for Exercise 2. TID Temperature Pressure Alarm 1 Alarm 2 Alarm 3

I 2 3 4 o r) 7

8 o

9l) 6 D 103 97 80 100 83 86 1 0 1

1 1 0 5 1040 1090 1084 1038 1080 1025 1030 1 100

0 I I 1 0 1 1 1 1

0 1 I

1 0 1 1 0 0 1

1 0 1 0 1 0 1 0 I

For each strategy, answer the following questions:

i. Construct a binarized version of the data set. ii. Derive all the frequent itemsets having support > 30%.

(b) The continuous attribute can also be discretized using a clustering ap- proach.

i. PIot a graph of temperature versus pressure for the data points shown in Table 7.11.

ii. How many natural clusters do you observe from the graph? Assign a label (Cr, Cr, etc.) to each cluster in the graph.

iii. What type of clustering algorithm do you think can be used to iden- tify the clusters? State your reasons clearly.

iv. Replace the temperature and pressure attributes in Table 7.11 with asymmetric binary attributes C1, C2, etc. Construct a transac- tion matrix using the new attributes (along with attributes Alarml, Alarm2, and Alarm3).

v. Derive all the frequent itemsets having support > 30% from the bi- narized data.

Consider the data set shown in Table 7.I2. The first attribute is continuous, while the remaining two attributes are asymmetric binary. A rule is considered to be strong if its support exceeds 15% and its confidence exceeds 60%. The data given in Table 7.12 supports the following two strong rules:

( i ) { ( 1 < A < 2 ) , 8 : 1 } - - - + { C : 1 } ( i i ) { ( 5 < A < 8 ) , 8 : 1 } - - + { C : 1 }

(a) Compute the support and confidence for both rules. (b) To find the rules using the traditional Apriori algorithm, we need to

discretize the continuous attribute A. Suppose we apply the equal width

,).