Data Mining- Practice Example
Quiz Practice Questions
Disclaimer: These questions are provided as a study tool, however it is not guaranteed that they fully
reflect the level and emphasis of the final Quiz.
Question 1: Internet Advertisements Data Set A company named FreeOfAds wishes to develop an internet add-on that filters webpages from irrelevant advertisements. For that, it collected a dataset that represents a set of possible advertisements on Internet pages (data source: Machine Learning Repository). The features encode the geometry of the image (if available) as well as phrases occurring in the URL, the image's URL and alt text, the anchor text, and words occurring near the anchor text.
Based on this dataset, FreeOfAds is interested in predicting whether an image is an advertisement ("ad") or not ("nonad"). The data contains 3279 records, of which 2821 are classified as ‘nonad’, and, 458 as ‘ad’, measured on 1558 attributes (3 continuous, the rest binary). The data variables are described below.
Name Type Description
height continuous image height
width continuous image width
aspect ratio continuous ratio of width to height
local binary indicator of whether the image URL is in the same domain as the website caption binary (457 variables) a set of binary features that encode keywords in the image features title (e.g., 'gift') alt features binary(495 variables) a set of binary features encode keywords in the image alternative (<ALT>) tag base binary (472 variables) a set of binary features encode keywords in the base (current) features URL target binary (111 variables) a set of binary features encode keywords in the destination features (link) URL image binary (19 variables) a set of binary features encode keywords in the image source features URL ad/noad Output actual classification
The data were partitioned into two sets: training (71%) and validation (29%). The
following graph depicts the actual (true) classification of the validation set.
a) For the naïve rule, fill in the 4 values in the confusion matrix of the validation set:
predicted ad nonad
actual ad
nonad
b) Compute the overall error of the naïve rule on the validation set.
Two classifiers were fit to the data: i. Classification tree with all 1558 predictors
ii. Logistic regression with the following predictors: height, width, aspect-ratio and local
c) Compare the prediction accuracy of the Classification tree and the Logistic regression model
below (and assume that the cut-offs are appropriately set). Which model predicts the data best?
Confusion Matrix for Classification Tree and Logistic Regression
Tree predicted
ad nonad total
actual ad 138 26 164
nonad 6 780 786
144 806 950
2
LR predicted
ad nonad total
actual ad 94 70 164
nonad 17 769 786
111 839 950
Question 2: Crime Rate
Data on crime rate were collected from 51 cities in US (data source: Life in America's Small Cities, By G.S. Thomas). We are interested in understanding the factors that affect the crime rate.
For each city, the following information was collected: CRIME_RATE: Total overall reported crime rate per 1 million residents REPORTED: Reported violent crime rate per 100,000 residents FUNDING: Annual police funding in $/resident LOCATION: One of two locations: east=1, west=0
a) The purpose of this study is (circle one):
Explanatory / Predictive
3
Question 3: k-NN Consider the one dimensional data below.
x y
2 1
4 -1
6 1
8 -1
10 1
15 -1
20 1
25 -1
30 1
35 -1
40 1
45 -1
50 1
55 -1
60 1
65 -1
70 1
75 -1
80 1
85 -1 90 1 95 -1 100 1
200 -1
a) We want to classify a new record that has x=53. Calculate the distance using the
Euclidian distance matrix to its nearest neighbor.
b) Now find the k-nearest neighbor classifier for the points x=1, x=11 and x=100 using k=5. Show
your work and write your answers below.
4
Question 4: Performance Measures In the table below, the value of p represents the confidence with which a certain classifier predicts the
class to be 1 (a larger value of p means that the classifier is more confident that the class is 1). Compute
the sensitivity and accuracy with respect to class 1 when a cutoff of p=.90 is used.
True p class
1 1
0.99 1
0.98 1
0.96 0
0.86 1
0.62 1
0.12 0
5
Question 5: Draw the decision boundaries The following pictures all show the same dataset. For each figure draw the approximate decision
boundary using a cutoff value of .5.
Classification Tree of level 1 – Also called a tree-stump
1.2
1
0.8
0.6 class=1
class=0
0.4
0.2
0 0 0.2 0.4 0.6 0.8 1
k-NN, when k=1
1.2
1
0.8
0.6 class=1
class=0
0.4
0.2
0 0 0.2 0.4 0.6 0.8 1
6
Question 6: Use the following table below for the following two questions.
Transaction Items
ID Bought
1 {a,d,e}
24 {a,b,c,e}
12 {a,b,d,e}
31 {a,c,d,e}
15 {b,c,e}
22 {b,d,e}
29 {c,d}
40 {a,b,c}
33 {a,d,e} 38 {a,b,c}
a) Treating each transaction as a market basket, compute the confidence for the rule {b} → {a}.
b) Explain the meaning of the confidence value in the context of the previous problem.
7
Question 7 – Multiple choice and short answer
a) Compute the (Euclidean) distance between the points (2,2,3) and (5,7,10).
b) Which of the following statements is true for bagging? (circle either true or false)
True/False
Bagging combines simple base classifiers by upweighting data points which are classified
incorrectly
True/False
Bagging builds different classifiers by training on repeated samples (with replacement)
from the data
True/False
Bagging usually gives zero training error, but rarely overfits which is very curious
c) Which of the following statements are true? (circle either true or false)
True/False
When running the association rule algorithm, increasing the minimum support
requirement will always result in a higher number of rules
True/False
When running a Hierarchical clustering algorithm we start out with each record as a
cluster. The first step is to join the two closest records. In this first step, which records
are joined up depends on the cluster distance measure selected (single linkage,
complete linkage, average linkage or the average group linkage)
d) The figure shows time series plots of monthly sales of fortified Australian wines for 1980-1994 (The first
data point is January 1980, the last data point is December 1994). The units are thousands of liters.
You have been hired to analyze the data for short-term forecasting purposes (2-3 months)
8
You start by partitioning the data using the period until December 1993 as the training period.
Which of the following models are appropriate to use (select all that apply)
i. Holt-Winter’s with multiplicative seasonality (of Sales) ii. Multiple linear regression with trend and seasonality using Sales as the dependent
variable
iii. Multiple linear regression with trend and seasonality using Log(Sales) as the
dependent variable
iv. Double exponential smoothing (of Sales)
Question 8 - Clustering
Hacker Pschorr (HP) is one of the oldest beer brewing companies in Munich.
They have collected data on beer preferences as well as demographic
information of 100 of their customers, out of which 50 prefer regular beer over
the lighter type. The figure below shows a snapshot of the first row of data.
The Gender variable =1 for males, 0 for females.
The Married variable =1 if the customer is married,
0 otherwise. The Income variable is the annual
household income (sample average $40,000, stdev
$9725) and the Age variable is in years (sample
average 44.2 years, stdev 12 years).
In order to facilitate data modeling HP codes the
Preference as Y, which equals 1 if the customer
prefers light beer and equals 0 if the customer prefers
regular beer.
HP is interested in better understanding the
different groups of customers and their preferences
and run a clustering analysis to gain some insights.
HP runs the k-Mean clustering algorithm, the
XLMiner settings ares in Exhibit 1.
Gender Married Income Age Preference
0 0 $31,779 46 Regular
1 1 $32,739 50 Regular
1 1 $24,302 46 Regular
1 1 $64,709 70 Regular
1 1 $41,882 54 Regular
1 0 $38,990 36 Regular
1 0 $22,408 40 Regular
1 1 $25,440 51 Regular
0 1 $30,784 52 Regular
1 0 $31,916 43 Regular
1 0 $23,234 31 Regular
0 1 $51,094 46 Regular
1 0 $38,176 40 Regular
1 0 $28,513 34 Regular
0 1 $44,955 53 Regular
0 1 $42,051 58 Regular
The following pivot table was created from the clustering output.
9
Cluster # Average Average Average Average
of Gender of Married of Income of Age
1 1 0.040 36904 40.7
2 0.118 1.000 48940 41.9
3 0 1 34387 46.4
4 1 0.923 39325 49.5
5 1 1 61728 66.0
6 0 0 35421 37.8
a) How would you characterize cluster 4? b) A HP data analyst decides to run some analysis on beer preferences within each cluster. The
following table shows the average value of Y within each cluster. How would you characterize the
customers that are the most likely to prefer light beer?
Cluster # Avg of Y
1 0.4
2 0.8
3 0.22
4 0.4
5 0.5
6 0.6
Exhibit 1: Run Parameters
10