DATA MINING
Answer the following questions: (10 point each)
1- Consider the traffic accident data set shown in Table below.
Traffic accident data set.
|
Weather Condition |
Driver’s Condition |
Traffic Violation |
Seat Belt |
Crash Severity |
|
Good Bad Good Bad Bad Bad Bad Good Good Bad Good Bad |
Alcohol-impaired Sober Sober Alcohol-impaired Alcohol-impaired Alcohol-impaired Alcohol-impaired Sober Alcohol-impaired Sober Alcohol-impaired Sober |
Exceed speed limit None Disobey stop sign Exceed speed limit Disobey traffic signal Disobey stop sign None Disobey traffic signal None None Exceed speed limit Disobey stop sign |
No Yes No Yes No Yes Yes Yes No Yes Yes Yes |
Major Minor Minor Major Major Minor Major Minor Minor Major Major Minor |
a. Show a binarized version of the data set.
Answer:
b. What is the maximum width of each transaction in the binarized data?
Answer:
c. Assuming that support threshold is 30%, how many candidate and frequent item sets will be generated?
2- Consider the data set shown in Table below. The first attribute is continuous, while the remaining two attributes are asymmetric binary. A rule is considered to be strong if its support exceeds 15% and its confidence exceeds 60%. The data given in Table below supports the following two strong rules:
(i) {(1 ≤ A ≤ 2), B = 1} → {C = 1}
(ii) {(5 ≤ A ≤ 8), B = 1} → {C = 1}
|
A |
B |
C |
|
1 2 3 4 5 6 7 8 9 10 11 12 |
1 1 1 1 1 0 0 1 0 0 0 0 |
1 1 0 0 1 1 0 1 0 0 0 1 |
a. Compute the support and confidence for both rules.
Answer:
S ({(1 ≤ A ≤ 2), B = 1} → {C = 1}) =
C ({(1 ≤ A ≤ 2), B = 1} → {C = 0}) =
S ({(5 ≤ A ≤ 9), B = 1} → {C = 1}) =
C ({(5 ≤ A ≤ 9), B = 1} → {C = 1}) =
3. Consider the data set shown in Table below. Suppose we are interested in extracting the following association rule:
{α1 ≤ Age ≤ α2, Play Piano = Yes} → {Enjoy Classical Music = Yes}
|
Age |
Play Piano |
Enjoy Classical Music |
|
9 11 14 17 19 21 25 29 33 39 41 47 |
Yes Yes Yes Yes Yes No No Yes Yes Yes No No |
Yes Yes No No Yes No No No No Yes Yes Yes |
To handle the continuous attribute, we apply the equal-frequency approach with 3, 4, and 6 intervals. Categorical attributes are handled by introducing as many new asymmetric binary attributes as the number of categorical values. Assume that the support threshold is 10% and the confidence threshold is 70%.
(a) Suppose we discretize the Age attribute into 3 equal-frequency intervals. Find a pair of values for α1 and α2 that satisfy the minimum support and minimum confidence requirements.
Answer:
(b) Repeat part (a) by discretizing the Age attribute into 4 equal-frequency intervals. Compare the extracted rules against the ones you had obtained in part (a).
Answer:
(c) Repeat part (a) by discretizing the Age attribute into 6 equal-frequency intervals. Compare the extracted rules against the ones you had obtained in part (a).
Answer:
4. For each of the sequence w = <e1, . . . , elast> below, determine whether they are subsequences of the following data sequence:
<{A, B}{C, D}{A, B}{C, D}{A, B}{C, D}>
subjected to the following timing constraints:
mingap = 0 (interval between last event in ei and first event in ei+1 is > 0)
maxgap = 2 (interval between first event in ei and last event in ei+1 is ≤ 2)
maxspan = 6 (interval between first event in e1 and last event in elast is ≤ 6)
ws = 1 (time between first and last events in ei is ≤ 1)
a. w = < {A}{B}{C}{D}> Answer:
b. w = < {A} {B, C, D} {A}> Answer:
c. w = < {A} {B, C, D} {A}> Answer:
d. w = < {B, C} {A, D} {B, C}> Answer:
e. w = < {A, B, C, D} {A, B, C, D}> Answer:
5. Draw all candidate subgraphs obtained from joining the pair of graphs shown in Figure below Assume the edge-growing method is used to expand the subgraphs.
Answer:
3