Data Mining- Practice Example

profileselen7
_QuizPracticeExamples_Solutions1.pdf

Quiz Practice Questions - Solutions

Disclaimer: These questions are provided as a study tool, however it is not guaranteed that they fully

reflect the level and emphasis of the Final Quiz.

Question 1: Internet Advertisements Data Set A company named FreeOfAds wishes to develop an internet add-on that filters webpages from irrelevant advertisements. For that, it collected a dataset that represents a set of possible advertisements on Internet pages (data source: Machine Learning Repository). The features encode the geometry of the image (if available) as well as phrases occurring in the URL, the image's URL and alt text, the anchor text, and words occurring near the anchor text.

Based on this dataset, FreeOfAds is interested in predicting whether an image is an advertisement ("ad") or not ("nonad"). The data contains 3279 records, of which 2821 are classified as ‘nonad’, and, 458 as ‘ad’, measured on 1558 attributes (3 continuous, the rest binary). The data variables are described below.

Name Type Description

height Continuous image height

width Continuous image width

aspect ratio Continuous ratio of width to height

local Binary indicator of whether the image URL is in the same domain as the website caption binary (457 variables) a set of binary features that encode keywords in the image features title (e.g., 'gift') alt features binary(495 variables) a set of binary features encode keywords in the image alternative (<ALT>) tag base binary (472 variables) a set of binary features encode keywords in the base (current) features URL target binary (111 variables) a set of binary features encode keywords in the destination features (link) URL image binary (19 variables) a set of binary features encode keywords in the image source features URL ad/noad Output actual classification

The data were partitioned into two sets: training (71%) and validation (29%). The

following graph depicts the actual (true) classification of the validation set.

a) For the naïve rule, fill in the 4 values in the confusion matrix of the validation

set: predicted

ad nonad

actual ad 0 166

nonad 0 784

b) Compute the overall error of the naïve rule on the validation

set. %Error = 166/(166+784)= 17.4%

Three classifiers were fit to the data: i. Classification tree with all 1558 predictors

ii. Logistic regression with the following predictors: height, width, aspect-ratio and local

c) Compare the prediction accuracy of the Classification tree and the Logistic regression model

below (and assume that the cut-offs are appropriately set). Which model predicts the data best?

There are multiple ways we can compare models. We decide to calculate the

sensitivity, specificity and the overall accuracy.

Accuracy Sensitivity Specificity Logistic Regression 0.908421 0.573171 0.978372

classification trees 0.966316 0.841463 0.992366

From the table we notice that the Classification Tree outperforms the Logistic Regressions on all measures, and therefore we would choose that model. If there was not a clear winner, we would have to think of the costs of making errors, for example how does the error of removing an image from a page compare to the error of leaving an add on a page.

2

Confusion Matrix for Classification Tree and Logistic Regression

Tree predicted

ad nonad total

actual ad 138 26 164

nonad 6 780 786

144 806 950

LR predicted

ad nonad total

actual ad 94 70 164

nonad 17 769 786

111 839 950

Question 2: Crime Rate Data on crime rate were collected from 51 cities in US (data source: Life in America's Small Cities, By G.S. Thomas). We are interested in understanding the factors that affect the crime rate.

For each city, the following information was collected: CRIME_RATE: Total overall reported crime rate per 1 million residents REPORTED: Reported violent crime rate per 100,000 residents FUNDING: Annual police funding in $/resident LOCATION: One of two locations: east=1, west=0

a) The purpose of this study is (circle one):

Explanatory / Predictive

3

Question 3: k-NN Consider the one dimensional data below.

x y

2 1 4 -1 6 1 8 -1

10 1 15 -1 20 1 25 -1 30 1 35 -1 40 1 45 -1 50 1 55 -1 60 1 65 -1 70 1 75 -1 80 1 85 -1 90 1 95 -1

100 1 200 -1

a) We want to classify a new record that has x=53. Calculate the distance using the

Euclidian distance matrix to its nearest neighbor.

The nearest neighbor to 53 is 55. Therefore the distance is sqrt((55-53)2)=2

b) Now find the k-nearest neighbor classifier for the points x=1, x=11 and x=100 using k=5. Show

your work and write your answers below.

For x=1, the nearest neighbors are 2,4,6,8,10, and the most common class amongst those points

is 1, therefore we classify x=1 as class 1.

For x=11, the nearest neighbors are 4,6,8,10,15 and the most common class amongst those

points is -1, therefore we classify x=1 as class -1.

4

For x=100, the nearest neighbors are 80,85,90,95,100 and the most common class

amongst those points is 1, therefore we classify x=1 as class 1.

Question 4: Performance Measures In the table below, the value of p represents the confidence with which a certain classifier predicts the

class to be 1 (a larger value of p means that the classifier is more confident that the class is 1). Compute

the sensitivity and accuracy with respect to class 1 when a cutoff of p=.90 is used.

True Predicted

p class class

1 1 1

0.99 1 1

0.98 1 1

0.96 0 1

0.86 1 0

0.62 1 0

0.12 0 0

Accuracy: 4/7=57.1% Sensitivity: 3/5=60%

Question 5: Draw the decision boundaries The following pictures all show the same dataset. For each figure draw the approximate decision

boundary using a cutoff value of .5.

Classification Tree

1.2

1

Classified as class 1

0.8

0.6 class=1

class=0

0.4

Classified as class 0 0.2

0 0 0.2 0.4 0.6 0.8 1

5

k-NN, when k=1

1.2

1

0.8

0.6 class=1

class=0

0.4

0.2

0 0 0.2 0.4 0.6 0.8 1

Question 6: Use the following table below for the following two questions.

Transaction Items ID Bought

1 {a,d,e}

24 {a,b,c,e}

12 {a,b,d,e}

31 {a,c,d,e}

15 {b,c,e}

22 {b,d,e}

29 {c,d}

40 {a,b,c}

33 {a,d,e} 38 {a,b,c}

a) Treating each transaction as a market basket, compute the confidence for the rule {b} → {a}.

#of transactions with {a,b}=4 # of transactions with {b}=6 The confidence = 4/6=66.67%

b) Explain the meaning of the confidence value in the context of the previous problem.

With 66.67% chance, a random transaction that contains {b} will also contain {a}.

6

Question 7 – Multiple choice and short answer

a) Compute the (Euclidean) distance between the points (2,2,3) and (5,7,10).

(2 − 5)2 + (2 − 7)2 + (3 −10)2 = 9.11

b) Which of the following statements is true for bagging? (circle either true or false)

True/False

Bagging combines simple base classifiers by upweighting data points which are classified

incorrectly

True/False

Bagging builds different classifiers by training on repeated samples (with replacement)

from the data

True/False

Bagging usually gives zero training error, but rarely overfits which is very curious

c) Which of the following statements are true? (circle either true or false)

True/False

When running the association rule algorithm, increasing the minimum support

requirement will always result in a higher number of rules

True/False

When running a Hierarchical clustering algorithm we start out with each record as a

cluster. The first step is to join the two closest records. In this first step, which records

are joined up depends on the cluster distance measure selected (single linkage,

complete linkage, average linkage or the average group linkage)

d) The figure shows time series plots of monthly sales of fortified Australian wines for 1980-1994 (The

first data point is January 1980, the last data point is December 1994). The units are thousands of

liters. You have been hired to analyze the data for short-term forecasting purposes (2-3 months)

7

You start by partitioning the data using the period until December 1993 as the training period.

Which of the following models are appropriate to use (select all that apply)

i. Holt-Winter’s with multiplicative seasonality (of Sales) ii. Multiple linear regression with trend and seasonality using Sales as the dependent

variable

iii. Multiple linear regression with trend and seasonality using Log(Sales) as the

dependent variable

iv. Double exponential smoothing (of Sales)

Question 8 - Clustering

Hacker Pschorr (HP) is one of the oldest beer brewing companies in Munich.

They have collected data on beer preferences as well as demographic

information of 100 of their customers, out of which 50 prefer regular beer over

the lighter type. The figure below shows a snapshot of the first row of data.

The Gender variable =1 for males, 0 for females. The Married variable =1 if the customer is married, 0 otherwise. The Income variable is the annual

household income (sample average $40,000, Gender Married Income Age Preference

0 0 $31,779 46 Regular stdev $9725) and the Age variable is in years 1 1 $32,739 50 Regular

(sample average 44.2 years, stdev 12 years). 1 1 $24,302 46 Regular

In order to facilitate data modeling HP codes 1 1 $64,709 70 Regular

1 1 $41,882 54 Regular the Preference as Y, which equals 1 if the

1 0 $38,990 36 Regular customer prefers light beer and equals 0 if the

1 0 $22,408 40 Regular customer prefers regular beer.

1 1 $25,440 51 Regular

HP is interested in better understanding the 0 1 $30,784 52 Regular

different groups of customers and their 1 0 $31,916 43 Regular

preferences and run a clustering analysis to gain 1 0 $23,234 31 Regular

0 1 $51,094 46 Regular some insights. 1 0 $38,176 40 Regular

HP runs the k-Mean clustering algorithm, the 1 0 $28,513 34 Regular

settings are in Exhibit 1. 0 1 $44,955 53 Regular

0 1 $42,051 58 Regular

8

The following pivot table was created from the clustering output.

Cluster #

Average Average Average Average

of Gender of Married of Income of Age

1 1 0.040 36904 40.7

2 0.118 1.000 48940 41.9

3 0 1 34387 46.4

4 1 0.923 39325 49.5

5 1 1 61728 66.0

6 0 0 35421 37.8

a) How would you characterize cluster 4?

Cluster 4 consists of mostly married (92%) men (100%) who are on average older than the

typical customer, but with on average income.

b) A HP data analyst decides to run some analysis on beer preferences within each cluster. The

following table shows the average value of Y within each cluster. How would you characterize the

customers that are the most likely to prefer light beer?

Cluster # Avg of Y

1 0.4

2 0.8

3 0.22

4 0.4

5 0.5

6 0.6

The customers most likely to prefer light beer correspond to cluster 2 – which are mostly married

females, from higher income homes.

Exhibit 1: Run Parameters

9