R- regression

profilemathgenuis2021
attachment_17.pdf

  AD699: Data Mining for Business Analytics   Individual Assignment #2      You will submit two files via Blackboard:   

(1) Your write-up​. This should be a PDF that includes your written answers to any questions that                                ask for written answers, along with the other things asked for in the prompt.   

(2) Your R Script​. This is the script that you will use to write your assignment. If you use                                    Markdown, you’ll submit an .RMD rather than a .R file.  

  As always, remember to take advantage of your available resources: We’ll have four live Q&A sessions                                next week, in addition to unlimited opportunities to schedule a Zoom session on any other day or time.                                    For this assignment in particular, the video library can be quite helpful​. As the course slogan says, “Get                                    After It!”      For each step, your write-up should clearly display your code and your results. For any step in the                                    prompt that includes a question, the question should be answered in written sentences.      This model will be used to predict the AVG_SALARY, per year, of a National Basketball Association (NBA)                                  player’s contract. ​This assignment will not require any specific domain knowledge from outside of                            the dataset description, dataset, and prompt​.         Main Topics​: Simple Linear Regression & Multiple Linear Regression      Tasks​:     

● Simple Linear Regression​:      For this assignment, we will use the dataset ​nba_contracts​.csv, which can be found on our                              class Blackboard page.      Start by downloading this dataset.     

1. Read the dataset into your environment in R.     

2. Create a new variable called ​ppg​. This new variable, which stands for “points per                            game” will be created by dividing points by games played.     

3. Let’s explore the relationship between points per game and average salary. Using                        ggplot, create a scatterplot that depicts ​average salary on the y-axis and ​ppg on the                              x-axis. Add a best-fit line to this scatterplot.      What does this plot suggest about the relationship between these variables? Does this                          make intuitive sense to you? Why or why not?     

4. Now, find the correlation between these variables. Then, use cor.test() to see whether                          this correlation is significant.   

  What is this correlation? Is it a strong one? Is the correlation significant?     

5. Using ​your​ assigned seed value, create a data partition. Assign approximately 60% of   the records to your training set, and the other 40% to your validation set. Keep in mind                                  that a seed value has no relationship to the data itself -- it’s just an arbitrary number.   

  6. Using your training set, create a simple linear regression model, with ​AVG_SALARY   

as your outcome variable and ​ppg​ as your input variable. Use the summary() function to   display the results of your model. 

  7. What are the minimum and maximum residual values in this model?     

a. Find the player whose salary generated the highest residual value. What was his                          actual salary? What did the model predict that it would be? How is the residual                              calculated from the two numbers that you just found?   

b. Find the player whose salary generated the lowest residual value. What was his                          actual salary? What did the model predict that it would be? How is the residual                              calculated from the two numbers that you just found?   

c. It might be unfair to say that the person in 7a is overpaid, or that the person in                                    7b is underpaid. Why might it be unfair to say this? (Note: You do *not* need to                                  be a basketball fan, or to know about the NBA, in order to answer this).                              However, you should look at the dataset and the data description, and give this                            just a bit of thought before answering). You can answer this question in 2-3                            sentences.   

  8. What is the regression equation generated by your model? Make up a hypothetical  

input value and explain what it would predict as an outcome. To show the predicted   outcome value, you can either use a function in R, or just explain what the predicted   outcome would be, based on the regression equation and some simple math.   

     

9. Using the accuracy() function from the forecast package, assess the accuracy of your   model against both the training set and the validation set. For this answer, focus on   the differences between the training and validation sets. To assess the model, focus   mainly on RMSE and MAE.   

  10. How does your model’s RMSE compare to the standard deviation of average salary in  

the dataset? What can such a comparison teach us about the model?        

  ● K-Nearest Neighbors​: 

  The model that we’ll build will aim to predict whether a college will have a high graduation                                  rate. To answer this question, we will use the ​College​ dataset from the ISLR package in R.      A description of this dataset can be found on our class Blackboard page, in the same folder                                  where you found this assignment prompt.   

   

1. Bring this dataset into your R environment. Once you have brought the ISLR package                            into your environment, you can do this with: ​ ​ > data(College)    

2. We are going to build a classification model with ​Grad.Rate as our response variable.                            Call the str() function on your dataset and show the results.     

a. What type of variable is Grad.Rate?  b. If Grad.Rate is not currently a factor, convert it into a factor by binning it. Use                               

the median to create two levels for this factor -- any records at or above the                                median should be labeled “High Rate” and any records below the median should                          be labeled “Low Rate.”     

3. Are there any NAs in this dataset? Show the code that you used to find this out. If                                    there are any NA values in any particular column, replace them with the median value                              for that column.  

  4. Creating two new features: 

a. Create a new variable called ‘selective.’ Selective should be found by taking                        Accept divided by Apps. (Accept/Apps) 

b. Create another new variable called ‘yield.’ Yield should be found by taking                        Enroll divided by Accept. (Enroll/Accept)  

 

5. Using your assigned seed value, partition your entire dataset into training (60%) and                          validation (40%) sets.    

6. Make up a fake college (yes, really!)   a. Give your college a name (there’s no R code needed here, and you won’t use the                               

name when you run k-nn...but give the school a name anyway, and just write it                              here).   

b. Use the runif() function to give your college values for each of these numeric                            predictor attributes: Expend, S.F. Ratio, perc.alumni, selective, and yield. Use                    the min and max values from your training set as the lower and upper                            boundaries for runif().   

  7. Normalize your data using the preProcess() function from the caret package. Use Table                         

7.2 from the book as a guide for this.    

8. Using the knn() function from the FNN package, and using a k-value of 7, generate a                                predicted classification for your college. For your input variables, use Expend, S.F.                        ratio, perc.alumni, selective, and yield. What outcome category was it predicted to                        belong to? Also, who were your college’s 7 nearest neighbors? How many of them                            were High Rate, and how many were Low Rate? Be sure to show their outcome classes                                in your write-up.     

9. Use your validation set to help you determine an optimal k-value. Use Table 7.3 from the   textbook as a guide here.  

  10. Using either the base graphics package or ggplot, make a scatterplot with the various                               

k   values that you used in 7a on your x-axis, and the accuracy metrics on the y-axis.    

11.  Re-run your knn() function with the optimal k-value that you found previously. What   result did you obtain? Was it different from the result you saw when you first ran the   k-nn function? Also, what were the outcome classes for each of your college’s   k-nearest neighbors? Be sure to show their outcome classes in your write-up.