Math stats
Correlation and Linear Regression Statistical Study
Introduction
As a devoted basketball fan, I’ve watched NBA games on TV, online
and even once in an arena. Since I’ve started watching basketball, I’ve
never had a favorite team, but I do have several favorite players on
different teams. Here’s a few of my favorite players; From the Brooklyn
Nets Kyrie Irving , Milwaukee Bucks Giannis Antetokounmpo, Houston
Rockets James Harden, and my personal favorite dynamic duo the splash
brothers from the Golden State Warriors Stephen Curry and Klay
Thompson. Unlike the short NFL season which is only sixteen games
and unlike the long MLB season that has 162 games during a single
season. I feel as if the NBA season has just the right amount of games.
The basketball teams in the NBA only play 82 regular season games. As
a long time, basketball fan, I noticed that as players are drafted into the
league as rookies, they get playtime but as they gain more experience
and as they develop, they get more playtime, but does that mean their
average points per game also increases. Therefore, in this paper I will
utilize common basketball statistics and explore the connection between
the average minutes per game and the average points per game of fifty
individual players.
In this project of correlation and linear regression statistical study
we are trying to determine the relationship between two variables; the
independent variable which is x and the dependent or otherwise known
as response variable known as y. We want to determine how the
different values of the independent variable correlate with the response
variable.
The Variables:
X: Average Minutes played per game
Y : Average Points scored by a player per game
Data Collection:
In order to collect the appropriate data for this study I used the official
website of the NBA.
https://stats.nba.com/leaders/?Season=2018-19&SeasonType=Regular
Season
My hypothesis is that the response variable will have a positively
skewed distribution. There will also be a strong positive correlation. I
hypothesize this because if the player receives more play time that
would mean they would have a higher chance of scoring more points.
Analysis
During the data collection process I gathered data from NBA.com since
all the statistics that I needed were displayed on the website. I collected
the average minutes played per game and the average points scored from
fifty players. Then I organized the data into two columns on excel. As
mentioned previously the X variable would be the average minutes
player game and the Y variable would be the average points scored by a
player per game. For the first part of this analysis I collected the five-
number summary of the dependent variable ( Y ) using the values that I
had collected. I was able to find that the maximum was 36.1,the
minimum was 16.6, the median was 21.05. Moving on two the 1st and 3rd
quartiles I was able to find that the first quartile otherwise defined as the
middle number between the smallest number and the median of the data
set was equivalent to 18.175. Moreover, the third quartile otherwise
defined as the middle number of the part of data which is greater than
the median which was equivalent to 24.425. Using the data, I was able to
calculate my x-mean which is essentially the average of all fifty x
values. Which equaled to 32.95 after calculating the x-mean I calculated
the y-mean which equaled to 21.57. From this data we can tell that the
average minutes played within all fifty players is 32.95 and the average
points scored within all fifty players is 21.57. Then moving on to
calculating s_x and s_y otherwise known as the standard deviation for
both x and y values. After calculating the standard deviation for both x
and y. The calculation for x equaled to 2.29 and for y it is 4.03. As for
the correlation coefficient otherwise known as (r) after calculations it
equaled to 0.57. Moreover, using the y-intercept, slope and values from
X I was able to calculate the hat(y). And to find the residuals it was
simple subtraction from the Y values and the hat(y) values. Moving over
to the construction of the scatterplot I used the values of X and Y in
order to find the regression line that include a y-intercept of 11.7 and a
slope of 1.01. Now to further discuss the data we’re going to discuss the
skewness of the histograms. In statistics skewness is a measure of the
asymmetry of the probability distribution of a real-valued random
variable about its mean. The skewness value can be positive, zero,
negative, or undefined. But in this study the histograms displays a left
skew which shows that the distribution is positively skewed. Now
moving on to the scatterplot I would identify it as a weak positive
correlation. To compare that to my predictions I predicted a positively
skewed histogram and a strong positive correlation. Referring to the data
I would say overall my predictions agreed with the data.
Conclusion
To conclude my findings, I discovered that there was a weak positive
correlation between the average minutes and the average points. A weak
positive correlation would indicate that while both variables tend to go
up in response to one another, the relationship is not very strong. In real
life this would mean that the average minutes a player plays per game
does correlate with the average points they can score however the
relationship between the two is very weak. This means a rookie
shouldn’t worry about his minutes he just needs to focus on performing
to the best of his ability.