Data Mining Hw #2

Kumbaya90
DataMiningHW1pt2.pdf.docx

Homework 2: Data Preprocessing (CS 5810)

February 26, 2021

Total Points: 100

Submission Deadline: Friday, March 5, 11:59 PM

Submission Guideline: Submit your answers in a .docx or .pdf file. Insert tables in your answer where appropriate. You can only work with your team member on the homework but you have to submit the work individually through blackboard. Do not forget to mention your team member’s name at the top of the document and also a paragraph containing peer evaluation at the end of the document. In peer evaluation, mention the contribution of your team member to the homework. If you wish to work alone on this homework even after having a team you are most welcome to do that. In that case, you do not need to add a peer evaluation.

Problem 1: Handling Missing Values [20]

Given the following data set with 10 data points and 4 attributes, and the final column represents the label of the data points. There are missing values in the data, replace those missing values with any one of the methods discussed in class. Comment on why you think the method you used is the most suitable.

ID

Age

Income

Spending

Label

S1

18

12K

12K

Low

S2

22

?

13K

Low

S3

27

70K

?

Middle

S4

29

120K

70K

Middle

S5

?

125K

65K

Middle

S6

35

250K

80K

Upper

S7

42

?

100K

Upper

S8

?

350K

150K

Upper

S9

36

130K

?

Middle

S10

19

35K

32K

Low

*using R is optional for this task

Problem 2: Smoothing Univariate Data [25 points]

Suppose you are given the data to analyze that includes a single attribute age. Following are the age values in unsorted order,

22 46 13 70 35 20 19 45 25 40 15 22 33 35 33 30 35 35 21 36 52 25 20 25 16 16 25 17

(a) Use smoothing by bin means to smooth these data, using a bin depth of 4. Illustrate your steps.

(b) Use smoothing by bin boundaries to smooth these data, using a bin depth of 4. Illustrate your steps.

(c) How might you determine the outliers in the data?

*it is not necessary for you to use R for solving the above problem. Type your answer in a file.

Problem 3: Data Transformation [20]

Use the data given in problem 2, answer the following:

(a) Use min-max normalization to transform the data. What is the range of values after the transformation?

(b) Use z-score normalization to transform the data. What is the range of values after the transformation?

(c) Use normalization by decimal scaling to transform the data. What is the range of values after the transformation?

(d) Comment on which method you would prefer to use for the given data, giving the reason as to why.

*You can use R to normalize the data using three different methods and copy the results in three tables in your HW file. Do not forget to mention the range of values for (a) - (c).