R Studio-Simple Linear Regression & Multiple Linear Regression

profilecloverren
Sample.pdf

For this assignment, you will use the baseballdata.csv file, which can be found on our Blackboard page. To complete this assignment, you will analyze the data for your year. Each student in the class will be assigned a unique year, so no two submissions will be the same. Submit your entire set of answers -- which will include some things typed by you, as well as several screenshots, as a single PDF file, and submit it via the class Blackboard site. Your Tasks: I. The CSV file in front of you is pretty massive, but thankfully, you only have to deal with one year’s worth of data... but first, you need to isolate your data .

A. Working with the spreadsheet, delete all the rows that pertain to years other than yours, or copy the rows that do pertain to your year, and paste them into a new sheet. B. Once you have done this, resave your csv file with a new name. Note: The sort function should be firstly used to make sure all the years are ordered. Then all years’ data except that from 1989 were deleted. II. Read your csv into R. A. Using the mean() function in R, find the average number of wins for all teams in your

season. a. What code did you enter into R to accomplish this? __________________

Th is

stu dy

re so

ur ce

w as

sh are

d v ia

Co ur

se He

ro .co

m

B. Now, use the mean function again to determine the average number of losses for all

teams in your season. a. What code did you enter into R to accomplish this? ___________________

b. What did you notice about the results from the last two functions you used?

Why does this make sense? The two outputs are the same. The win of one team also implies the loss of its competitor. As a result, the total number of wins and the total number of loss must be the same, so must the two average numbers. C. Make a scatterplot that shows wins per team as a function of runs. What do you

notice about this relationship? Include a screenshot of your R source code, and the scatterplot that it generates. Be sure to label your x and y axes.

Th is

stu dy

re so

ur ce

w as

sh are

d v ia

Co ur

se He

ro .co

m

Figure 1. Team Runs versus Team Wins

It shows a general relationship that approximately the more runs the team took, the more wins the team could get. This is understandable: Because the number of total games (161 or 162) was the almost the same for all the teams, more runs indicated that the average runs per game were higher. The higher of average runs per game, the higher possibility for the team to win the game. As a result, the higher possibility for the team to have higher number of total wins.

D. Now, add a line of best fit to the scatterplot. Show a screenshot of your R source

code as well as the resulting output.

Th is

stu dy

re so

ur ce

w as

sh are

d v ia

Co ur

se He

ro .co

m

Figure 2 line regression between number of the runs and number of the wins

E. Was any league dominant this season? Create a vertical barplot that enables a

viewer to compare the win totals for each league (NL West, AL East, etc.) side-by- side. Give each league’s bar a unique color. Show a screenshot of your R source code as well as the resulting output.

550 600 650 700 750

60 70

80 90

10 0

Runs

W in s

Th is

stu dy

re so

ur ce

w as

sh are

d v ia

Co ur

se He

ro .co

m

Figure 3. Wins by League in 1989 Results show all leagues have similar number of wins, but AL west has around 100 more (21.27%) wins than the last NL West. F. Create a histogram that shows the number of wins per team on the x-axis, and the

frequency on the y-axis. Now, suppose you want to see a finer level of detail. How can you increase the number of bins in your histogram? Show a screenshot with your R source code and the resulting output.

Th is

stu dy

re so

ur ce

w as

sh are

d v ia

Co ur

se He

ro .co

m

Figure 4. Histogram for number of wins in 1989

The distribution is not well determined. In my opinion, this should follow binomial

distribution. F. Using the GGally package, create a scatter plot matrix that shows the relationship

among all of the following variables: Wins, Losses, Runs, Runs Against, Average Batter’s Age, Average Pitcher’s Age. Show a screenshot of your R source code as well as the resulting output.

Th is

stu dy

re so

ur ce

w as

sh are

d v ia

Co ur

se He

ro .co

m

Figure 5 Correlation Matrix

The wins and losses are with correlation -1 because the total number of wins and losses

should be the same. Moreover, the sum of correlations between Wins and Losses are always 0 in each column. Wins has obvious positive correlation factor to runs and obvious negative correlation factor in Runs Against. We can get that if the runs of the against is higher, the team will has higher probabilities to lose the game. Furthermore, the pitcher’s age also has positive correlation factor and it can be explained that the old pitch can have more experiences, and can pitch better. G. Now, build a heatmap correlation matrix that shows the relationship between the

same variables from part F. Show a screenshot of your R source code as well as the resulting output.

Th is

stu dy

re so

ur ce

w as

sh are

d v ia

Co ur

se He

ro .co

m

Figure 6 Heatmap of Correlation Matrix H. Run the prcomp() function on three of the variables in your data set -- wins, runs, and runs against. How many Principal Components did it require for you to account for more than 80% of the variation in the data? Show a screenshot of your R source code as well as the resulting output.

Th is

stu dy

re so

ur ce

w as

sh are

d v ia

Co ur

se He

ro .co

m

The result show that PC1 and PC2 can explain almost 99.82% results. PC1 indicate that there is a positive relationship between Runs and Runs Against, but the number of Wins has slightly negative impact. PC2 explain the positive relationship existing between Wins and Runs, also shown in Questions C and D. If scaling the parameters before doing the Principle Components Analysis

The result show that the first two principle components can account for 97.87% data. PC1 can be better explained because PC1 indicate the Wins and Runs are positive related, and Runs Against should be negative related to them. PC2 indicates that Runs and Runs Against has positive relationship.

Th is

stu dy

re so

ur ce

w as

sh are

d v ia

Co ur

se He

ro .co

m

Powered by TCPDF (www.tcpdf.org)