Advanced Database(Weka explorer)

profileRama
AssignmentWeka.docx

Use Weka 3 in Data Mining

1. Download

(1) Please go to  https://www.cs.waikato.ac.nz/ml/weka/   Follow the instructions to download and install Weka 3.

(2) Goto: http://storm.cis.fordham.edu/~gweiss/data-mining/datasets.html  Download these datasets:

· soybean.arff

· iris.arff

· segmentchallenge.arff

· segmenttest.arff

2. Use

After you install Weka 3, run the program. You will see this window:

Click "Explorer". You will see this window:

Click "Open file", and open one of your arff files, for example, "iris.arff".

Click on the tab "classify". Then you can click the button "choose" to choose a classifier. 

Here we first choose "trees -> J48", which is the C4.5 Decision Tree.

Then you can choose the number of folds of cross validation. Let's change it to 5.

Then we can click "Start" to test the classifier using cross validation.

Here you can see a Confusion Matrix. Each row of the matrix shows the labels given to the instances of a particular class. 

In this table, we can see that

· There are 50 instances of class "a". 49 of them were correctly labelled as "a", while 1 of them was incorrectly labelled as "b".

· There are 50 instances of class "b". 47 of them were correctly labelled as "b", while 3 of them were incorrectly labelled as "c".

· There are 50 instances of class "c". 48 of them were correctly labelled as "c", while 2 of them were incorrectly labelled as "b".

The Confusion Matrix provides a good illustration of the correctness of the classification. In the classifier output, you can also find other stats such as the error rate.

We may also try other classifiers. You will find these classifiers:

· lazy -> IBk is the KNN classifier.

· functions -> SMO is the SVM classifier.

There are also a lot of other classifiers you can use. Let's try KNN first.

After you choose IBk, you may click the textbox to modify its parameters.

In the pop-up window, you can set the value of each parameter. For example, the first value "KNN" specifies the number of neighbors K. By default, it's 1, which means it a simple Nearest Neighbor classifier. Let's change it to 3 and click ok.

Then we can click "start" to test it. We will see a new Confusion Matrix.

Similarly, you can try other classifiers, such as SVM. If you do not know the meaning of the parameters of a classifier, you may leave them as the default values.

Now let's see how we can train a classifier using a training dataset, and test it with a different test dataset. 

Go back to the tab "Preprocess" and open "segment-challenge.arff". Then go to "Classify".

Choose any classifier, for example SVM (functions -> SMO). 

In Test options, choose "Supplied test set", and click "Set...".

Browse and open "segment-test.arff". Then click "Start" to test it. You will see the result in the classifier output.

For more information on Weka, please refer to its documentation:  https://www.cs.waikato.ac.nz/ml/weka/documentation.html  

Please try these classifiers on these data:

· Use C4.5 Decision Tree on "iris.arff". Test the accuracy using 5 fold cross validation.

· Use SVM on "iris.arff". Test the accuracy using 5 fold cross validation.

· Use KNN on "iris.arff".  Try K = 1. Test the accuracy using 5 fold cross validation.

· Use KNN on "iris.arff".  Try K = 3. Test the accuracy using 5 fold cross validation.

· Use C4.5 Decision Tree on "soybean.arff". Test the accuracy using 5 fold cross validation.

· Use SVM on "soybean.arff". Test the accuracy using 5 fold cross validation.

· Use KNN on "soybean.arff".  Try K = 1. Test the accuracy using 5 fold cross validation.

· Use KNN on "soybean.arff".  Try K = 3. Test the accuracy using 5 fold cross validation.

· Use C4.5 Decision Tree on "segment-challenge.arff". Test the accuracy using "segment-test.arff".

· Use SVM on "segment-challenge.arff". Test the accuracy using "segment-test.arff".

· Use KNN on "segment-challenge.arff".  Try K = 1. Test the accuracy using "segment-test.arff".

· Use KNN on "segment-challenge.arff".  Try K = 3. Test the accuracy using "segment-test.arff".

Write a report. Include the screenshots of your result, and compare the performance of these 3 classifiers. Which one of them has the best accuracy? Which one of them is the fastest?

Upload your report to the submission folder by the due date.