R- regression
AD699: Data Mining for Business Analytics Individual Assignment #2 You will submit two files via Blackboard:
(1) Your write-up. This should be a PDF that includes your written answers to any questions that ask for written answers, along with the other things asked for in the prompt.
(2) Your R Script. This is the script that you will use to write your assignment. If you use Markdown, you’ll submit an .RMD rather than a .R file.
As always, remember to take advantage of your available resources: We’ll have four live Q&A sessions next week, in addition to unlimited opportunities to schedule a Zoom session on any other day or time. For this assignment in particular, the video library can be quite helpful. As the course slogan says, “Get After It!” For each step, your write-up should clearly display your code and your results. For any step in the prompt that includes a question, the question should be answered in written sentences. This model will be used to predict the AVG_SALARY, per year, of a National Basketball Association (NBA) player’s contract. This assignment will not require any specific domain knowledge from outside of the dataset description, dataset, and prompt. Main Topics: Simple Linear Regression & Multiple Linear Regression Tasks:
● Simple Linear Regression: For this assignment, we will use the dataset nba_contracts.csv, which can be found on our class Blackboard page. Start by downloading this dataset.
1. Read the dataset into your environment in R.
2. Create a new variable called ppg. This new variable, which stands for “points per game” will be created by dividing points by games played.
3. Let’s explore the relationship between points per game and average salary. Using ggplot, create a scatterplot that depicts average salary on the y-axis and ppg on the x-axis. Add a best-fit line to this scatterplot. What does this plot suggest about the relationship between these variables? Does this make intuitive sense to you? Why or why not?
4. Now, find the correlation between these variables. Then, use cor.test() to see whether this correlation is significant.
What is this correlation? Is it a strong one? Is the correlation significant?
5. Using your assigned seed value, create a data partition. Assign approximately 60% of the records to your training set, and the other 40% to your validation set. Keep in mind that a seed value has no relationship to the data itself -- it’s just an arbitrary number.
6. Using your training set, create a simple linear regression model, with AVG_SALARY
as your outcome variable and ppg as your input variable. Use the summary() function to display the results of your model.
7. What are the minimum and maximum residual values in this model?
a. Find the player whose salary generated the highest residual value. What was his actual salary? What did the model predict that it would be? How is the residual calculated from the two numbers that you just found?
b. Find the player whose salary generated the lowest residual value. What was his actual salary? What did the model predict that it would be? How is the residual calculated from the two numbers that you just found?
c. It might be unfair to say that the person in 7a is overpaid, or that the person in 7b is underpaid. Why might it be unfair to say this? (Note: You do *not* need to be a basketball fan, or to know about the NBA, in order to answer this). However, you should look at the dataset and the data description, and give this just a bit of thought before answering). You can answer this question in 2-3 sentences.
8. What is the regression equation generated by your model? Make up a hypothetical
input value and explain what it would predict as an outcome. To show the predicted outcome value, you can either use a function in R, or just explain what the predicted outcome would be, based on the regression equation and some simple math.
9. Using the accuracy() function from the forecast package, assess the accuracy of your model against both the training set and the validation set. For this answer, focus on the differences between the training and validation sets. To assess the model, focus mainly on RMSE and MAE.
10. How does your model’s RMSE compare to the standard deviation of average salary in
the dataset? What can such a comparison teach us about the model?
● K-Nearest Neighbors:
The model that we’ll build will aim to predict whether a college will have a high graduation rate. To answer this question, we will use the College dataset from the ISLR package in R. A description of this dataset can be found on our class Blackboard page, in the same folder where you found this assignment prompt.
1. Bring this dataset into your R environment. Once you have brought the ISLR package into your environment, you can do this with: > data(College)
2. We are going to build a classification model with Grad.Rate as our response variable. Call the str() function on your dataset and show the results.
a. What type of variable is Grad.Rate? b. If Grad.Rate is not currently a factor, convert it into a factor by binning it. Use
the median to create two levels for this factor -- any records at or above the median should be labeled “High Rate” and any records below the median should be labeled “Low Rate.”
3. Are there any NAs in this dataset? Show the code that you used to find this out. If there are any NA values in any particular column, replace them with the median value for that column.
4. Creating two new features:
a. Create a new variable called ‘selective.’ Selective should be found by taking Accept divided by Apps. (Accept/Apps)
b. Create another new variable called ‘yield.’ Yield should be found by taking Enroll divided by Accept. (Enroll/Accept)
5. Using your assigned seed value, partition your entire dataset into training (60%) and validation (40%) sets.
6. Make up a fake college (yes, really!) a. Give your college a name (there’s no R code needed here, and you won’t use the
name when you run k-nn...but give the school a name anyway, and just write it here).
b. Use the runif() function to give your college values for each of these numeric predictor attributes: Expend, S.F. Ratio, perc.alumni, selective, and yield. Use the min and max values from your training set as the lower and upper boundaries for runif().
7. Normalize your data using the preProcess() function from the caret package. Use Table
7.2 from the book as a guide for this.
8. Using the knn() function from the FNN package, and using a k-value of 7, generate a predicted classification for your college. For your input variables, use Expend, S.F. ratio, perc.alumni, selective, and yield. What outcome category was it predicted to belong to? Also, who were your college’s 7 nearest neighbors? How many of them were High Rate, and how many were Low Rate? Be sure to show their outcome classes in your write-up.
9. Use your validation set to help you determine an optimal k-value. Use Table 7.3 from the textbook as a guide here.
10. Using either the base graphics package or ggplot, make a scatterplot with the various
k values that you used in 7a on your x-axis, and the accuracy metrics on the y-axis.
11. Re-run your knn() function with the optimal k-value that you found previously. What result did you obtain? Was it different from the result you saw when you first ran the k-nn function? Also, what were the outcome classes for each of your college’s k-nearest neighbors? Be sure to show their outcome classes in your write-up.