Data Mining Hw #2
Homework 2: Data Preprocessing (CS 5810)
February 26, 2021
Total Points: 100
Submission Deadline: Friday, March 5, 11:59 PM
Submission Guideline: Submit your answers in a .docx or .pdf file. Insert tables in your answer where appropriate. You can only work with your team member on the homework but you have to submit the work individually through blackboard. Do not forget to mention your team member’s name at the top of the document and also a paragraph containing peer evaluation at the end of the document. In peer evaluation, mention the contribution of your team member to the homework. If you wish to work alone on this homework even after having a team you are most welcome to do that. In that case, you do not need to add a peer evaluation.
Problem 1: Handling Missing Values [20]
Given the following data set with 10 data points and 4 attributes, and the final column represents the label of the data points. There are missing values in the data, replace those missing values with any one of the methods discussed in class. Comment on why you think the method you used is the most suitable.
|
ID |
Age |
Income |
Spending |
Label |
|
S1 |
18 |
12K |
12K |
Low |
|
S2 |
22 |
? |
13K |
Low |
|
S3 |
27 |
70K |
? |
Middle |
|
S4 |
29 |
120K |
70K |
Middle |
|
S5 |
? |
125K |
65K |
Middle |
|
S6 |
35 |
250K |
80K |
Upper |
|
S7 |
42 |
? |
100K |
Upper |
|
S8 |
? |
350K |
150K |
Upper |
|
S9 |
36 |
130K |
? |
Middle |
|
S10 |
19 |
35K |
32K |
Low |
*using R is optional for this task
Problem 2: Smoothing Univariate Data [25 points]
Suppose you are given the data to analyze that includes a single attribute age. Following are the age values in unsorted order,
22 46 13 70 35 20 19 45 25 40 15 22 33 35 33 30 35 35 21 36 52 25 20 25 16 16 25 17
(a) Use smoothing by bin means to smooth these data, using a bin depth of 4. Illustrate your steps.
(b) Use smoothing by bin boundaries to smooth these data, using a bin depth of 4. Illustrate your steps.
(c) How might you determine the outliers in the data?
*it is not necessary for you to use R for solving the above problem. Type your answer in a file.
Problem 3: Data Transformation [20]
Use the data given in problem 2, answer the following:
(a) Use min-max normalization to transform the data. What is the range of values after the transformation?
(b) Use z-score normalization to transform the data. What is the range of values after the transformation?
(c) Use normalization by decimal scaling to transform the data. What is the range of values after the transformation?
(d) Comment on which method you would prefer to use for the given data, giving the reason as to why.
*You can use R to normalize the data using three different methods and copy the results in three tables in your HW file. Do not forget to mention the range of values for (a) - (c).