Rapid Miner

Nes555
DM3.pdf

1

Data Mining 3

Classify Data Using Logistic Regression

In DM3, we are going to build and evaluate a logistic regression model. Out final process will look like

the image below:

Task 1: Prepare Dataset for Analysis

1. Add Data.

a) Click the Add Data button in the Repository panel. b) Click My Computer in the ‘Import Data – Where is your data?’ pop up c) Navigate to your Week 5 content folder and download the file named ‘DM3.iris.xlsx.’ Click Next. d) Import all cells (click Next) and do not change any settings in ‘Format Your Column’ (click Next).

On the ‘Where to store data’ screen, ensure the ‘data’ folder is highlights. Click the Finish button

at the bottom of the screen.

e) In the Repository panel, go to the Local Repository, open the data folder, and drag the ‘DM3.iris’ data set into the Process panel.

2. Set Role.

a) Drag the Set Role operator into the Process window. Connect it with the Retrieve operator. b) Use the Set Role operator to set the irisSpecies attribute to ‘label’ and the sampleNum attribute to

‘id.’(You can refer back to HW2 for more detailed directions on how to do this)

You should be familiar with these first two sub tasks from DM1. Now we will do some additional

preparation before building a model using logistic regression.

3. Split Data.

Up to this point, we have used a distinct training set and testing set. We first built the model using the

training set. We then simulated collecting new data with the training set and we classified this new data

using the model. However, in real life, we will often want to build and test a model using already collected

data. We can easily split data into multiple sets using RapidMiner’s Split Data operator.

a) In the Operators panel, search for the Split Data operator. b) Drag the Split Data operator into the Process. Connect the “exa” output port from the Set Role

operator to the “exa” input port of the Split Data operator.

c) Select the Split Data operator. d) In the Parameters pane for the Split Data operator, click the Edit Enumeration button e) In the Edit Parameter Lists popup, click the Add Entry button. f) Set the text box under ratio to 0.7.

DM3_iris

2

g) Click Add Entry again. h) This time, set the ratio text box to 0.3 i) Click OK.

The Split Data operator splits the data into subsets based on the values in the ratio operator. We just split

the data into two subsets. The first subset contains 70% (or 0.7) of the data and the second subset contains

30 (or 0.3) of the data. The sets are non-overlapping.

Task 2: Build a Logistic Regression Model

1. Build a Logistic Regression Model.

a) Search the Operators panel for the Logistic Regression operator. b) Drag the Logistic Regression operator into the Process panel. c) Connect the first partition “par” output port of the Split Data operator to the Logistic Regression

training “tra” input port of the Logistic Regression operator.

d) Connect the model “mod” output port of the Logistic Regression operator to the results “res” port e) Run the process

Congratulations! You have just created a logistic regression model. In the results window, select the

Logistic Regression tab. Here, you can see the coefficients for the model. If we were to write these out as

an equation, we would have the following:

𝑓(𝑥) = −8.1 + 41.9 ∗ 𝑝𝑒𝑡𝑎𝑙 𝑤𝑖𝑑𝑡ℎ − 7.9 ∗ 𝑠𝑒𝑝𝑎𝑙 𝑤𝑖𝑑𝑡ℎ

3

• The decision boundary is the line that is created when f(x) = __________________?

• Which species does the model predict when f(x) > 0? ______________? (Hint: plug in values for one of the plants into the logistic regression equation)

• Which species does the model predict given the following attributes? _____________ sepal width = 2.3, petal width = 1

In addition to the coefficients, the logistic regression model returns coefficients resulting from

standardizing the data (i.e., the Std. Coefficients column). Standardizing data is useful to help interpret the

coefficients. The magnitude of standardized coefficients provides a more easily interpreted assessment of

the relative importance of each attribute.

• Which attribute is the most important for classifying iris species? _________________

• How much more important is it than the next most important factor? ______________

Finally, you will see the z-Value and p-Value for each attribute. You may remember using these values in

your statistics class. However, understanding and interpreting these values is beyond the scope of this

course.

Task 3: Evaluate the Model

1. Apply Model.

Now that we’ve built the model, let’s apply it to the 30% testing subset we previously created using the

Split Data operator.

a) Search for the operator Apply Model and drag it into the process. b) Connect the model (“mod”) output port of Logistic Regression to the model (“mod”) input port of

the Apply Model operator.

c) Connect the 2nd partition (“par”) output port from the Split Data model to the unlabeled data (“unl”) input port of the Apply Model operator.

d) Connect the labeled data (“lab”) output port to the results (“res”) e) Run the process

Since the original data was labeled (using the Set Role operator), the 30% testing subset is also labeled.

So, we can now compare the model’s prediction with the actual labels. A quick scan of the results

indicates the model correctly labeled all the test examples!

4

4. Evaluate Model Performance.

Manually scanning results is tedious. Imagine if we actually had thousands or even millions of examples!

Fortunately, there is better way.

a) Search for the Performance operator and drag it into the process. (Make sure you grab the Performance operator and not any of the derivatives (e.g., Performance (Classification),

Performance (Binomial Classification), etc. You may need to scroll down the operator pane to

find it.)

b) Connect the labeled data (“lab”) output port from the Apply Model operator to the labeled data (“lab”) input port on the Performance operator.

c) Connect the performance (“per”) output port of the Performance operator to the results (“res”) port d) Run the process

In the results window, you will find the below matrix:

Notice that accuracy is 100%. As you might expect, accuracy is the number of correct classifications

divided by the number of examples. In this case, the model labeled 30/30 correctly for an accuracy of

100%. Our model is seemingly perfect. However, accuracy alone is rarely used to evaluate a model.

5

The performance matrix shows some of the other measures used to evaluate a model such as precision

and recall. We will discuss more on model evaluation in future lessons.

4. Save your process.

a) Click the Save icon (alternatively, you can press Ctrl-S). b) Name the process ‘Iris_LogRegres’ and make sure it is being saved in the processes folder of the

Repository. Click OK to Save.

Congratulations! You have added logistic regression to your growing repertoire of data mining

techniques. You are on your way to becoming a data scientist!

To summarize:

• Linear classification uses a line to partition data

• Logistic regression is one way to create the line by maximizing the likelihood that each point is

correctly classified

• Accuracy is one of several ways to evaluate a model

DELIVERABLE: • Submit answers via iLEARN.

• Export your process as ‘[lastName]. Iris_LogRegres.rmp’ to your iLearn Data Mining 3 quiz