SQL
CS699 SC1 Summer 2020
Final Exam
Note:
· You must write your answers in a document file, in a word format or in a pdf format.
· If you create your answer document in some other file format, you must convert it to a pdf file.
· Name your answer document LastName_FirstName_Final.docx or LastName_FirstName_Final.pdf.
· If you have multiple files to submit, including the answer document, you must combine all files into a single archive file and name it LastName_FirstName_Final.ext, where ext is an appropriate file extension. These are example file names: Smith_John_Final.zip or Smith_John_Final.rar.
· Unless otherwise noted, you must show all your calculations.
· If a solving a problem requires multiple steps, you have to show all intermediate steps, intermediate calculations, and all intermediate results in your answer document.
· If you use other program/software (such as R, spreadsheet program, etc.) to solve a problem, you must include the relevant screenshots in your answer document or you must submit the relevant files. If you submit files, make sure that the file name includes the problem number, such as Problem1.xlsx.
· When you run algorithms on Weka or JMP Pro, make sure that you do not change any parameters/options, unless a problem says otherwise.
Problem 1 (5 points). Use the final-p1.csv file for this problem. This dataset has 6 predictor attributes plus a class attribute.
Suppose that you are exploring the dataset (before you start classification) and you want to know which predictor variable would be the most relevant in predicting the class attribute. In general, there are different methods you can use for this. For this problem, you are required to use three different methods to do that. Specific requirements for this problem are:
(1). Briefly describe three different methods you are using.
(2). Apply each method to the given dataset and determine the most relevant predictor attribute.
Note that, for each method you are using for the dataset, you must show all intermediate steps/calculations as needed (or an evidence of those). If you just show an answer, you will not get any point.
Problem 2 (5 points). Use the final-p2.arff file for this problem. The dataset has 11 attributes and 500 tuples.
Run NaïveBayes, J48, MultilayerPerceptron, KNN with k = 5, and RandomForest on Weka on the given dataset. Make sure that 10-fold cross-validation is selected as a test method.
(1). For each algorithm, capture the screenshot of classifier output that shows all performance measures and the confusion matrix, and include them in your answer document.
(2). Choose the best model for each of the following three different data mining goals.
|
Data Mining Goal |
Best Model |
|
A model that has the highest overall accuracy |
|
|
A model that predicts class 1 tuples with the highest accuracy |
|
|
A model that predicts class 2 tuples with the highest accuracy |
|
Problem 3 (5 points). Use the final-p3.arff file for this problem. This dataset has 3 predictor attributes and a class attribute and it has 50 tuples.
Run the Logistic algorithm of Weka on this dataset.
(1). Show all coefficients generated by the algorithm.
(2). Using the fitted model, classify the following two object X1 and X2. Assume that the classification threshold is 0.5.
X1: <Age = 28, BMI = 25, Glucose = 90>
X2: <Age = 60, BMI = 20, Glucose = 160>
Note that you build a fitted model using Weka. But, you must classify the objects yourself. You need to show all intermediate steps and calculations.
Problem 4 (5 points). Use the final-p4.csv file for this problem. This dataset is a subset of a large transactional database and it has purchase records of two items – tea and coffee. Each tuple in the dataset represents a transaction and the notation “t” indicates a transaction contains the item and the notation “?” means a transaction does not contain the item.
(1). Create a contingency table from the given dataset.
(2). Calculate all_conf and Kulczynski measures and determine whether there is a correlation between the purchase of tea and the purchase of coffee.
Problem 5 (5 points). Use the final-p5.csv file for this problem.
You are comparing two classifier models M1 and M2 using the hypothesis test method (which we discussed in the class). You performed 5-fold cross validation and the result is in the final-p5.csv file. In the result, E1 is the error rates of classifier M1 and E2 is the error rates of classifier M2. Calculate the test statistic and state your conclusion. Assume that the significance level α = 0.05. You must perform this test yourself and must show all intermediate steps and intermediate results.
Problem 6 (5 points). Use the final-p6.csv file for this problem. The dataset has five transactions, where items are represented as integers.
Run the Apriori algorithm that we discussed in the class and mine all frequent itemsets. Show all candidate itemsets and frequent itemsets You should follow the process described in the book and lecture (i.e., C1 → L1 → C2 → L2 → …). Minimum support = 60% (or 3 or more transactions). You must not use a data mining tool, such as Weka, JMP pro, or R. You must run the Apriori algorithm yourself and you must show all intermediate steps.
Problem 7 (5 points). Use the final-p7-1.csv and final-p7-2.csv files for this problem.
(1). The final-p7-1.csv file has 10 frequent 2-itemsets (or L2) that were mined from a transactional database. In the file, items are encoded into integers. Which of the following 3-itemsets cannot be frequent? In your answer document, write those 3-itemsets that cannot be frequent.
(1)-a. {1, 3, 5}
(1)-b. {2, 5, 7}
(1)-c. {1, 2, 6}
(1)-d. {1, 3, 6}
(2). This question is about the XCS algorithm that we discussed in the class. Consider the current set of rules in the final-p7-2.csv file. Suppose that a sample 10010010 10 is extracted from the training dataset.
(2)-a. Generate the match set.
(2)-b. Determine the action from the match set.
(2)-c. Generate the action set.
(2)-d. Which rules are rewarded?
Problem 8 (5 points). Use the final-p8.csv file for this problem. In the file, CLASS is the class attribute.
Discretize A1 to two distinct values, Low and High, using entropy in such a way that the information gain of A1 is maximized. You must show all intermediate steps and intermediate results, and the discretized values of A1.
Problem 9 (5 points). Use the final-p9.arff dataset for this problem. This dataset has two attributes and 50 tuples.
(1) Run the SimpleKMeans algorithm of Weka on this dataset with k = 2, 3, 4, 5, 6, and 7. For each k, record the SSE and determine an optimal number of clusters using the elbow method that we discussed in the class. You must show all SSE’s and explain how you chose the optimal number of clusters. When you run SimpleKMeans onWeka, do not change any options/parameters, except the number of clusters.
(2) Using the optimal number of clusters which you determined in Problem 9-(1), run SimpleKMeans again and characterize the generated clusters using the two attribute values. The following is an example of characterization of clusters:
Cluster 0:
· A1 is mostly between 100 and 200, mean of A1 is 150
· A2 is mostly between 10 and 20, mean of A2 is 15
Cluster 1:
· A1 is mostly between 200 and 300, mean of A1 is 250
· A2 is mostly between 20 and 30, mean of A2 is 25
. . .
Problem 10 (5 points). Use the final-p10.csv file for this problem. This file contains five 2-dimensional objects that are split into two clusters. Cluster C1 has objects a and b and Cluster C2 has objects c, d, and e.
(1) Calculate the distance between the two clusters using the mean distance method. You must use the Euclidean distance measure when calculating a distance between two objects/points.
You must show all intermediate calculations and intermediate results.
Extra Credit Question (5 points). Use the final-ec.csv file for this problem, which is used as a test dataset. Suppose that the following decision tree was built from a training dataset:
(1). Classify the 10 tuples in the test dataset using the above decision tree and show the classes of all 10 tuples.
(2). Show the confusion matrix of the test result.
(3). Calculate the TP rate and FP rate for the class 1 (i.e., risk = 1).