Data Mining- Practice Example
Quiz Practice Questions - Solutions
Disclaimer: These questions are provided as a study tool, however it is not guaranteed that they fully
reflect the level and emphasis of the Final Quiz.
Question 1: Internet Advertisements Data Set A company named FreeOfAds wishes to develop an internet add-on that filters webpages from irrelevant advertisements. For that, it collected a dataset that represents a set of possible advertisements on Internet pages (data source: Machine Learning Repository). The features encode the geometry of the image (if available) as well as phrases occurring in the URL, the image's URL and alt text, the anchor text, and words occurring near the anchor text.
Based on this dataset, FreeOfAds is interested in predicting whether an image is an advertisement ("ad") or not ("nonad"). The data contains 3279 records, of which 2821 are classified as ‘nonad’, and, 458 as ‘ad’, measured on 1558 attributes (3 continuous, the rest binary). The data variables are described below.
Name Type Description
height Continuous image height
width Continuous image width
aspect ratio Continuous ratio of width to height
local Binary indicator of whether the image URL is in the same domain as the website caption binary (457 variables) a set of binary features that encode keywords in the image features title (e.g., 'gift') alt features binary(495 variables) a set of binary features encode keywords in the image alternative (<ALT>) tag base binary (472 variables) a set of binary features encode keywords in the base (current) features URL target binary (111 variables) a set of binary features encode keywords in the destination features (link) URL image binary (19 variables) a set of binary features encode keywords in the image source features URL ad/noad Output actual classification
The data were partitioned into two sets: training (71%) and validation (29%). The
following graph depicts the actual (true) classification of the validation set.
a) For the naïve rule, fill in the 4 values in the confusion matrix of the validation
set: predicted
ad nonad
actual ad 0 166
nonad 0 784
b) Compute the overall error of the naïve rule on the validation
set. %Error = 166/(166+784)= 17.4%
Three classifiers were fit to the data: i. Classification tree with all 1558 predictors
ii. Logistic regression with the following predictors: height, width, aspect-ratio and local
c) Compare the prediction accuracy of the Classification tree and the Logistic regression model
below (and assume that the cut-offs are appropriately set). Which model predicts the data best?
There are multiple ways we can compare models. We decide to calculate the
sensitivity, specificity and the overall accuracy.
Accuracy Sensitivity Specificity Logistic Regression 0.908421 0.573171 0.978372
classification trees 0.966316 0.841463 0.992366
From the table we notice that the Classification Tree outperforms the Logistic Regressions on all measures, and therefore we would choose that model. If there was not a clear winner, we would have to think of the costs of making errors, for example how does the error of removing an image from a page compare to the error of leaving an add on a page.
2
Confusion Matrix for Classification Tree and Logistic Regression
Tree predicted
ad nonad total
actual ad 138 26 164
nonad 6 780 786
144 806 950
LR predicted
ad nonad total
actual ad 94 70 164
nonad 17 769 786
111 839 950
Question 2: Crime Rate Data on crime rate were collected from 51 cities in US (data source: Life in America's Small Cities, By G.S. Thomas). We are interested in understanding the factors that affect the crime rate.
For each city, the following information was collected: CRIME_RATE: Total overall reported crime rate per 1 million residents REPORTED: Reported violent crime rate per 100,000 residents FUNDING: Annual police funding in $/resident LOCATION: One of two locations: east=1, west=0
a) The purpose of this study is (circle one):
Explanatory / Predictive
3
Question 3: k-NN Consider the one dimensional data below.
x y
2 1 4 -1 6 1 8 -1
10 1 15 -1 20 1 25 -1 30 1 35 -1 40 1 45 -1 50 1 55 -1 60 1 65 -1 70 1 75 -1 80 1 85 -1 90 1 95 -1
100 1 200 -1
a) We want to classify a new record that has x=53. Calculate the distance using the
Euclidian distance matrix to its nearest neighbor.
The nearest neighbor to 53 is 55. Therefore the distance is sqrt((55-53)2)=2
b) Now find the k-nearest neighbor classifier for the points x=1, x=11 and x=100 using k=5. Show
your work and write your answers below.
For x=1, the nearest neighbors are 2,4,6,8,10, and the most common class amongst those points
is 1, therefore we classify x=1 as class 1.
For x=11, the nearest neighbors are 4,6,8,10,15 and the most common class amongst those
points is -1, therefore we classify x=1 as class -1.
4
For x=100, the nearest neighbors are 80,85,90,95,100 and the most common class
amongst those points is 1, therefore we classify x=1 as class 1.
Question 4: Performance Measures In the table below, the value of p represents the confidence with which a certain classifier predicts the
class to be 1 (a larger value of p means that the classifier is more confident that the class is 1). Compute
the sensitivity and accuracy with respect to class 1 when a cutoff of p=.90 is used.
True Predicted
p class class
1 1 1
0.99 1 1
0.98 1 1
0.96 0 1
0.86 1 0
0.62 1 0
0.12 0 0
Accuracy: 4/7=57.1% Sensitivity: 3/5=60%
Question 5: Draw the decision boundaries The following pictures all show the same dataset. For each figure draw the approximate decision
boundary using a cutoff value of .5.
Classification Tree
1.2
1
Classified as class 1
0.8
0.6 class=1
class=0
0.4
Classified as class 0 0.2
0 0 0.2 0.4 0.6 0.8 1
5
k-NN, when k=1
1.2
1
0.8
0.6 class=1
class=0
0.4
0.2
0 0 0.2 0.4 0.6 0.8 1
Question 6: Use the following table below for the following two questions.
Transaction Items ID Bought
1 {a,d,e}
24 {a,b,c,e}
12 {a,b,d,e}
31 {a,c,d,e}
15 {b,c,e}
22 {b,d,e}
29 {c,d}
40 {a,b,c}
33 {a,d,e} 38 {a,b,c}
a) Treating each transaction as a market basket, compute the confidence for the rule {b} → {a}.
#of transactions with {a,b}=4 # of transactions with {b}=6 The confidence = 4/6=66.67%
b) Explain the meaning of the confidence value in the context of the previous problem.
With 66.67% chance, a random transaction that contains {b} will also contain {a}.
6
Question 7 – Multiple choice and short answer
a) Compute the (Euclidean) distance between the points (2,2,3) and (5,7,10).
(2 − 5)2 + (2 − 7)2 + (3 −10)2 = 9.11
b) Which of the following statements is true for bagging? (circle either true or false)
True/False
Bagging combines simple base classifiers by upweighting data points which are classified
incorrectly
True/False
Bagging builds different classifiers by training on repeated samples (with replacement)
from the data
True/False
Bagging usually gives zero training error, but rarely overfits which is very curious
c) Which of the following statements are true? (circle either true or false)
True/False
When running the association rule algorithm, increasing the minimum support
requirement will always result in a higher number of rules
True/False
When running a Hierarchical clustering algorithm we start out with each record as a
cluster. The first step is to join the two closest records. In this first step, which records
are joined up depends on the cluster distance measure selected (single linkage,
complete linkage, average linkage or the average group linkage)
d) The figure shows time series plots of monthly sales of fortified Australian wines for 1980-1994 (The
first data point is January 1980, the last data point is December 1994). The units are thousands of
liters. You have been hired to analyze the data for short-term forecasting purposes (2-3 months)
7
You start by partitioning the data using the period until December 1993 as the training period.
Which of the following models are appropriate to use (select all that apply)
i. Holt-Winter’s with multiplicative seasonality (of Sales) ii. Multiple linear regression with trend and seasonality using Sales as the dependent
variable
iii. Multiple linear regression with trend and seasonality using Log(Sales) as the
dependent variable
iv. Double exponential smoothing (of Sales)
Question 8 - Clustering
Hacker Pschorr (HP) is one of the oldest beer brewing companies in Munich.
They have collected data on beer preferences as well as demographic
information of 100 of their customers, out of which 50 prefer regular beer over
the lighter type. The figure below shows a snapshot of the first row of data.
The Gender variable =1 for males, 0 for females. The Married variable =1 if the customer is married, 0 otherwise. The Income variable is the annual
household income (sample average $40,000, Gender Married Income Age Preference
0 0 $31,779 46 Regular stdev $9725) and the Age variable is in years 1 1 $32,739 50 Regular
(sample average 44.2 years, stdev 12 years). 1 1 $24,302 46 Regular
In order to facilitate data modeling HP codes 1 1 $64,709 70 Regular
1 1 $41,882 54 Regular the Preference as Y, which equals 1 if the
1 0 $38,990 36 Regular customer prefers light beer and equals 0 if the
1 0 $22,408 40 Regular customer prefers regular beer.
1 1 $25,440 51 Regular
HP is interested in better understanding the 0 1 $30,784 52 Regular
different groups of customers and their 1 0 $31,916 43 Regular
preferences and run a clustering analysis to gain 1 0 $23,234 31 Regular
0 1 $51,094 46 Regular some insights. 1 0 $38,176 40 Regular
HP runs the k-Mean clustering algorithm, the 1 0 $28,513 34 Regular
settings are in Exhibit 1. 0 1 $44,955 53 Regular
0 1 $42,051 58 Regular
8
The following pivot table was created from the clustering output.
Cluster #
Average Average Average Average
of Gender of Married of Income of Age
1 1 0.040 36904 40.7
2 0.118 1.000 48940 41.9
3 0 1 34387 46.4
4 1 0.923 39325 49.5
5 1 1 61728 66.0
6 0 0 35421 37.8
a) How would you characterize cluster 4?
Cluster 4 consists of mostly married (92%) men (100%) who are on average older than the
typical customer, but with on average income.
b) A HP data analyst decides to run some analysis on beer preferences within each cluster. The
following table shows the average value of Y within each cluster. How would you characterize the
customers that are the most likely to prefer light beer?
Cluster # Avg of Y
1 0.4
2 0.8
3 0.22
4 0.4
5 0.5
6 0.6
The customers most likely to prefer light beer correspond to cluster 2 – which are mostly married
females, from higher income homes.
Exhibit 1: Run Parameters
9