Data Mining- Practice Example

profileselen7
M5_QuizPracticeExamples.pdf

Quiz Practice Questions

Disclaimer: These questions are provided as a study tool, however it is not guaranteed that they fully

reflect the level and emphasis of the final Quiz.

Question 1: Internet Advertisements Data Set A company named FreeOfAds wishes to develop an internet add-on that filters webpages from irrelevant advertisements. For that, it collected a dataset that represents a set of possible advertisements on Internet pages (data source: Machine Learning Repository). The features encode the geometry of the image (if available) as well as phrases occurring in the URL, the image's URL and alt text, the anchor text, and words occurring near the anchor text.

Based on this dataset, FreeOfAds is interested in predicting whether an image is an advertisement ("ad") or not ("nonad"). The data contains 3279 records, of which 2821 are classified as ‘nonad’, and, 458 as ‘ad’, measured on 1558 attributes (3 continuous, the rest binary). The data variables are described below.

Name Type Description

height continuous image height

width continuous image width

aspect ratio continuous ratio of width to height

local binary indicator of whether the image URL is in the same domain as the website caption binary (457 variables) a set of binary features that encode keywords in the image features title (e.g., 'gift') alt features binary(495 variables) a set of binary features encode keywords in the image alternative (<ALT>) tag base binary (472 variables) a set of binary features encode keywords in the base (current) features URL target binary (111 variables) a set of binary features encode keywords in the destination features (link) URL image binary (19 variables) a set of binary features encode keywords in the image source features URL ad/noad Output actual classification

The data were partitioned into two sets: training (71%) and validation (29%). The

following graph depicts the actual (true) classification of the validation set.

a) For the naïve rule, fill in the 4 values in the confusion matrix of the validation set:

predicted ad nonad

actual ad

nonad

b) Compute the overall error of the naïve rule on the validation set.

Two classifiers were fit to the data: i. Classification tree with all 1558 predictors

ii. Logistic regression with the following predictors: height, width, aspect-ratio and local

c) Compare the prediction accuracy of the Classification tree and the Logistic regression model

below (and assume that the cut-offs are appropriately set). Which model predicts the data best?

Confusion Matrix for Classification Tree and Logistic Regression

Tree predicted

ad nonad total

actual ad 138 26 164

nonad 6 780 786

144 806 950

2

LR predicted

ad nonad total

actual ad 94 70 164

nonad 17 769 786

111 839 950

Question 2: Crime Rate

Data on crime rate were collected from 51 cities in US (data source: Life in America's Small Cities, By G.S. Thomas). We are interested in understanding the factors that affect the crime rate.

For each city, the following information was collected: CRIME_RATE: Total overall reported crime rate per 1 million residents REPORTED: Reported violent crime rate per 100,000 residents FUNDING: Annual police funding in $/resident LOCATION: One of two locations: east=1, west=0

a) The purpose of this study is (circle one):

Explanatory / Predictive

3

Question 3: k-NN Consider the one dimensional data below.

x y

2 1

4 -1

6 1

8 -1

10 1

15 -1

20 1

25 -1

30 1

35 -1

40 1

45 -1

50 1

55 -1

60 1

65 -1

70 1

75 -1

80 1

85 -1 90 1 95 -1 100 1

200 -1

a) We want to classify a new record that has x=53. Calculate the distance using the

Euclidian distance matrix to its nearest neighbor.

b) Now find the k-nearest neighbor classifier for the points x=1, x=11 and x=100 using k=5. Show

your work and write your answers below.

4

Question 4: Performance Measures In the table below, the value of p represents the confidence with which a certain classifier predicts the

class to be 1 (a larger value of p means that the classifier is more confident that the class is 1). Compute

the sensitivity and accuracy with respect to class 1 when a cutoff of p=.90 is used.

True p class

1 1

0.99 1

0.98 1

0.96 0

0.86 1

0.62 1

0.12 0

5

Question 5: Draw the decision boundaries The following pictures all show the same dataset. For each figure draw the approximate decision

boundary using a cutoff value of .5.

Classification Tree of level 1 – Also called a tree-stump

1.2

1

0.8

0.6 class=1

class=0

0.4

0.2

0 0 0.2 0.4 0.6 0.8 1

k-NN, when k=1

1.2

1

0.8

0.6 class=1

class=0

0.4

0.2

0 0 0.2 0.4 0.6 0.8 1

6

Question 6: Use the following table below for the following two questions.

Transaction Items

ID Bought

1 {a,d,e}

24 {a,b,c,e}

12 {a,b,d,e}

31 {a,c,d,e}

15 {b,c,e}

22 {b,d,e}

29 {c,d}

40 {a,b,c}

33 {a,d,e} 38 {a,b,c}

a) Treating each transaction as a market basket, compute the confidence for the rule {b} → {a}.

b) Explain the meaning of the confidence value in the context of the previous problem.

7

Question 7 – Multiple choice and short answer

a) Compute the (Euclidean) distance between the points (2,2,3) and (5,7,10).

b) Which of the following statements is true for bagging? (circle either true or false)

True/False

Bagging combines simple base classifiers by upweighting data points which are classified

incorrectly

True/False

Bagging builds different classifiers by training on repeated samples (with replacement)

from the data

True/False

Bagging usually gives zero training error, but rarely overfits which is very curious

c) Which of the following statements are true? (circle either true or false)

True/False

When running the association rule algorithm, increasing the minimum support

requirement will always result in a higher number of rules

True/False

When running a Hierarchical clustering algorithm we start out with each record as a

cluster. The first step is to join the two closest records. In this first step, which records

are joined up depends on the cluster distance measure selected (single linkage,

complete linkage, average linkage or the average group linkage)

d) The figure shows time series plots of monthly sales of fortified Australian wines for 1980-1994 (The first

data point is January 1980, the last data point is December 1994). The units are thousands of liters.

You have been hired to analyze the data for short-term forecasting purposes (2-3 months)

8

You start by partitioning the data using the period until December 1993 as the training period.

Which of the following models are appropriate to use (select all that apply)

i. Holt-Winter’s with multiplicative seasonality (of Sales) ii. Multiple linear regression with trend and seasonality using Sales as the dependent

variable

iii. Multiple linear regression with trend and seasonality using Log(Sales) as the

dependent variable

iv. Double exponential smoothing (of Sales)

Question 8 - Clustering

Hacker Pschorr (HP) is one of the oldest beer brewing companies in Munich.

They have collected data on beer preferences as well as demographic

information of 100 of their customers, out of which 50 prefer regular beer over

the lighter type. The figure below shows a snapshot of the first row of data.

The Gender variable =1 for males, 0 for females.

The Married variable =1 if the customer is married,

0 otherwise. The Income variable is the annual

household income (sample average $40,000, stdev

$9725) and the Age variable is in years (sample

average 44.2 years, stdev 12 years).

In order to facilitate data modeling HP codes the

Preference as Y, which equals 1 if the customer

prefers light beer and equals 0 if the customer prefers

regular beer.

HP is interested in better understanding the

different groups of customers and their preferences

and run a clustering analysis to gain some insights.

HP runs the k-Mean clustering algorithm, the

XLMiner settings ares in Exhibit 1.

Gender Married Income Age Preference

0 0 $31,779 46 Regular

1 1 $32,739 50 Regular

1 1 $24,302 46 Regular

1 1 $64,709 70 Regular

1 1 $41,882 54 Regular

1 0 $38,990 36 Regular

1 0 $22,408 40 Regular

1 1 $25,440 51 Regular

0 1 $30,784 52 Regular

1 0 $31,916 43 Regular

1 0 $23,234 31 Regular

0 1 $51,094 46 Regular

1 0 $38,176 40 Regular

1 0 $28,513 34 Regular

0 1 $44,955 53 Regular

0 1 $42,051 58 Regular

The following pivot table was created from the clustering output.

9

Cluster # Average Average Average Average

of Gender of Married of Income of Age

1 1 0.040 36904 40.7

2 0.118 1.000 48940 41.9

3 0 1 34387 46.4

4 1 0.923 39325 49.5

5 1 1 61728 66.0

6 0 0 35421 37.8

a) How would you characterize cluster 4? b) A HP data analyst decides to run some analysis on beer preferences within each cluster. The

following table shows the average value of Y within each cluster. How would you characterize the

customers that are the most likely to prefer light beer?

Cluster # Avg of Y

1 0.4

2 0.8

3 0.22

4 0.4

5 0.5

6 0.6

Exhibit 1: Run Parameters

10