Zeek only
(Watch the Video Lecture and repeat my steps)
1- Create a folder on your computer and name it as Your_Last_name_Logistic Regression.
2- Download from Blackboard two data sets labeled Training and Scoring And put them inside the same folder Your_Last_name_Logistic Regression.
3- Open RapidMiner and create a new repository by clicking on the arrow next to “import Data.”
4- You should see a window similar to the one below, make sure to uncheck the selection next to “Use Standard Location”, and then, click the folder (in yellow color), keep browsing until you find the folder you created in Step-1 (above). Selected and finish the process.
5- Although the process window is empty now, go ahead and save the process as Your_Last_name_Logistic Regression, by right clicking on folder (as shown below), and selecting “Store Process Here “.
6- Save it with your name, as shown below. YOU WILL HAVE to SAVE this Process again when you are done with this Exercise.
7- Use Read CSV operator, as shown blow to import these two datasets: Training.csv dataset, and the Scoring.csv dataset into RapidMiner Process. Your folder should look like this picture below. Then, select the first operator, Read CSV, and then click on “Import Configuration Wizard” on the top- right-hand side.
8- Then complete the steps below:
1. Begin the process of importing the training data set first. This can be done by importing the data set into a RapidMiner repository or via Read CSV operator. For the most part, the process will be the same as what you have done in past exercise, but for logistic regression, there are a few subtle differences. Be sure to set the first row as the attribute names as we have always done. On the final step of the Import Wizard, when setting data types and attribute roles, you will need to make at least one change. Be sure to set the 2nd_Heart_Attack data type to "binominal" rather than "polynominal." Even though it is a yes/no field, the Logistic Regression operator we'll be using in our modeling phase expects the label to be binominal. RapidMiner does not offer polynominal-to-binominal or integer-to-binominal operators, so we need to be sure to set this target attribute to the needed data type of binominal as we import it. Use the little black down arrow next to the gear icon to change the data type of this attribute. This is shown in Figure 1:
Figure 1: Setting the 2nd_Heart_Attack attribute's data type to "binominal" during import.
2. At this time, you can also change the 2nd_Heart_Attack attribute's role to "label" if you wish. We have not done this in Figure 1, and subsequently we will be adding a Set Role operator to our stream as we continue our data preparation.
3. Complete the data import process for the training data, ensuring it is included in a new blank Process. Rename the data set's Retrieve operator as Training.
4. Import the scoring data set now. Be sure the data type for all attributes is "integer." This should be the default, but double-check to make sure. Since the 2nd_Heart_Attack attribute is not included in the scoring data set, you don't need to worry about changing it as you did in step 1. Complete the import process, and include the scoring data set in your Process. Rename this data set's operator as "Scoring." Your model should now appear similar to Figure 2. Note that we have used Read CSV operators.
Figure 2: The training and scoring data sets in a new Process window in RapidMiner.
5. Run the model and compare the ranges for all attributes between the scoring and training result set tabs (Figure 3 and Figure 4, respectively). You should find that the ranges are the same. As was the case with linear regression, the scoring values must all fall within the lower and upper bounds set by the corresponding values in the training data set. We can see in Figure 3 and Figure 4 that this is the case, so our data are very clean. They were prepared during extraction from Sonia's source database, so we will not need to do further data preparation in order to filter out observations with inconsistent values or modify missing values
Figure 3: Statistics metadata for the scoring data set (note absence of 2nd_Heart_Attack attribute).
Figure 4: Metadata for the training data set (2nd_Heart_Attack attribute is present with the binominal data type). Note that scoring range values (Min/Max) fall within training range values for all attributes.
6. Switch back to Design perspective and add a Set Role operator to your training stream. Remember that if you designated 2nd_Heart_Attack to have a "label" role during the data import process, you won't need to add a Set Role operator at this time. We did not do this in the book example, so we need the operator to designate 2nd_Heart_Attack as our label, our target attribute:
Figure 5: Configuring the 2nd_Heart_Attack attribute's role in preparation for logistic regression mining.
Modeling
7. Using the search field in the Operators tab, locate the Logistic Regression operator. You will see that if you just search for the word "logistic" (as has been done in Figure 6), there are several different logistic regression operators available to you in RapidMiner—more if you have the Weka extension installed. We will use the first one in this example; however, you are certainly encouraged to experiment with the others as you would like. Drag the Logistic Regression operator into your training stream.
Figure 6: The Logistic Regression operator in our training stream.
8. The Logistic Regression operator will generate coefficients for each of our predictor attributes in much the same way that the linear regression operator did. If you would like to see these, you can run your model now. The algebraic formula for logistic regression is different and a bit more complicated than the one for linear regression. We are no longer calculating the slope of a straight line, but rather we are trying to determine the likelihood of an observation falling at a given point along a curved and less well-defined imaginary line through a data set. The coefficients for logistic regression are used in that formula.
9. If you ran your model to see your coefficients, return now to Design perspective. Add an Apply Model operator to your stream to bring the training and scoring data sets together. Be sure your lab and mod ports are both connected to res ports.
Figure 7: Applying the model to the scoring data set.
Evaluation
Figure 8: Coefficients for each predictor attribute.
The initial tab shown in Results perspective is a list of our coefficients. These coefficients are used in the logistic regression algorithm to predict whether or not each person in our scoring data set will suffer a second heart attack, and if so, how confident we are that the prediction will come true. Switch to the Scoring results tab. We will look first at the Statistics tab (Figure 9).
Figure 9: Statistics for our scoring predictions.
We can see in this figure that RapidMiner has generated three new attributes for us: confidence (Yes), confidence (No), and prediction(2nd_Heart_Attack). In our Value column, we find that out of the 690 people represented, we're predicting that 340 will not suffer second heart attacks, and that 350 will. Sonia's hope is that she can engage these 350, and perhaps even some of the 340 with low confidence levels on their "No" prediction, in programs to improve their health and thus increase their chances of avoiding another heart attack. Let's switch to the Data view.
Figure 10: Predictions for our 690 patients who have suffered a first heart attack.
In Figure 10, we can see that each person has been given a prediction of "No" (they won't suffer a second heart attack) or "Yes" (they will). It is critically important to remember at this point of our evaluation that if this were real and not a textbook example, these would be real people, with names, families, and lives. Yes, we are using data to evaluate their health, but we shouldn't treat these people like numbers. Hopefully our work and analysis will help our imaginary client Sonia in her efforts to serve these people better. When mining data, we should always keep the human element in mind. So, we have these predictions that some people in our scoring data set are on the path to a second heart attack and others are not, but how confident are we in these predictions? The confidence (Yes) and confidence (No) attributes can help us answer that question. To start, let's just consider the person represented on Row 1. This is a single (never been married) 61-year-old man. He has been classified as overweight, but he has lower than average cholesterol (the mean shown in our metadata in Figure 9 is just over 178). He scored right in the middle on our trait anxiety test at 50 and has attended stress management class. With these personal attributes compared with those in our training data, our model offers us an 91.8% level of confidence that the "No" prediction is correct. This leaves us with 8.2% doubt in our prediction. The "No" and "Yes" values will always total to 1, or in other words, 100%. For each person in the data set, their attributes are fed into the logistic regression model, and a prediction with confidence percentages is calculated.
Let's consider one other person as an example in Figure 10. Look at Row 11. This is a 66-year-old man who's been divorced. He's above the average values in every attribute. While he's not as old as some in our data set, he is getting older, and he's obese. His cholesterol is among the highest in our data set, he scored higher than average on the trait anxiety test, and he hasn't been to a stress management class. We're predicting with 99.3% confidence that this man will suffer a second heart attack. The warning signs are all there, and Sonia can now see them fairly easily.
Deployment
In the context of the person represented on Row 11, it seems pretty obvious that Sonia should try to contact this gentleman right away, offering help in every aspect. She may want to help him find a weight-loss support group, provide information about dealing with divorce and/or stress, and encourage the person to work with his doctor to better regulate his cholesterol through diet and perhaps medication. There may be a number of the 690 individuals who clearly need specific help. Click twice on the attribute name confidence (Yes). Clicking on a column heading (the attribute name) in RapidMiner Results perspective will sort the data set by that attribute. Click it once to sort in ascending order, twice to resort in descending order, and a third time to return the data set to its original state. Figure 11 shows our results sorted in descending order on the confidence (Yes) attribute.
Figure 11: Results sorted by confidence (Yes) in descending order (two clicks on the attribute name).
If you were to count down from the first record (Row 128) to the point at which our confidence(Yes) value is 0.950, you would find that there are 155 individuals in the data set for whom we have a 95% or better confidence level that they are at risk for heart attack recurrence (and that's not rounding up those who have a 0.948 in the "Yes" column). So there are some who are fairly easy to spot. You might notice that many are divorced, but several are also widowed. Loss of a spouse by any means is difficult, so perhaps Sonia can begin by offering more programs to support those who fit this description. Most of these individuals are obese and have cholesterol levels over 200, and none have participated in stress management classes. Sonia has several opportunities to help these individuals, and she would probably offer these people opportunities to participate in several programs or create one program that offers a holistic approach to physical and mental well-being. Because there are a good number of these individuals who share so many high-risk traits, this may be an excellent way to create support groups for them.
But there are also those individuals in the data set who may need help, but aren't quite as obvious, and perhaps only need help in one or two areas. Click confidence (Yes) a third time to return the results data to its original state (sorted by Row No.). Now, scroll down until you find Row 95 (highlighted in Figure 12). Make a note of this person's attributes.
Figure 12: Examining the first of two similar individuals with different risk levels.
Next, locate Row 554 (Figure 13).
Figure 13: The second of two similar individuals with different risk levels.
The two people represented on Rows 95 and 554 have a lot in common. First of all, they're both in this data set because they've suffered heart attacks. They are both 70-year-old women whose spouse has died. Both have trait anxiety of 65 points. And yet we are predicting with 90% certainty that the first will not suffer another heart attack, while predicting with almost 92% certainty that the other will. Even their weight categories are similar, though being overweight certainly plays into the second woman's risk. But what is really evident in comparing these two women is that the second woman has a cholesterol level that nearly touches the top of our range in this data set (the upper bound shown in Figure 9 is 239), and she hasn't been to stress management classes. Perhaps Sonia can use such comparisons to help this woman understand just how dramatically she can improve her chances of avoiding another heart attack. In essence, Sonia could say: "There are women who are a lot like you who have almost zero chance of suffering another heart attack. By lowering your cholesterol, learning to manage your stress, and perhaps getting your weight down closer to a normal level, you can almost eliminate your risk for another heart attack." Sonia could follow up by offering programs for this woman targeted specifically at cholesterol, weight loss, or stress management.
1