Module 6 Discussion-Stats

profilecsht
replies.docx

Devon Conway 

Conway M6

COLLAPSE

Top of Form

Picture this: my favorite food is peanut butter. I just started my own business creating organic, vegan peanut butter. I want to grow my business, so I turn toward a vegan wellness influencer on Instagram and pay her to post about my peanut butter in hopes of growing my business and acquiring more customers. A simple example of a real-life application of a regression model addresses the question: What is the relationship between the money spent on paying this influencer compared to revenue generated? The independent variable is the money allocated to the influencer to promote my peanut butter. The dependent variable is revenue generated directly correlated to the work of the influencer promoting my peanut butter. The formula to answer this question is: revenue = β0 + β1(ad spending).

The R^2 variable is beneficial in determining the effectiveness of the regression model because it measures the linear relationship between the independent and dependant relationships in the problem on is solving. For most statistical cases, the higher R^2 is, the better it will fit within the model. As noted in the Mini Tab Blog, R^2 can be between 0-100%. Zero represents hat the model explains none of the variability of the response data around its mean and 100% represents that the model explains all the variability of the response data around its mean.

References:

Editor, M. B. Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit? https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit.

Bottom of Form

Shannon Song 

Module 6 Discussion

COLLAPSE

Top of Form

I have used regression models in the past to examine if there was any correlation between variables and the likelihood of purchasing insurance. Some examples of these variables might include age, income, job title, other types of insurance already owned, etc. Out of simplicity, I will discuss the variable age which would be our independent variable and its connection to the likelihood of obtaining insurance, our dependent variable. I would expect that as age increases, the likelihood of obtaining insurance would also increase. Typically, an |r| that is greater than 0.5 and less than 1 is considered strong. However, since I was testing multiple variables, I only wanted to use |r| that were 0.75-1, as they were the most strongly correlated. I would expect this variable to be 0.5 < |r| < 1, but this depends on the specific data provided. If the correlation is above 0.75, I would say that age can be a strong predictor for likelihood of obtaining age. I would also expect this relationship to be positively correlated – as age increases, so too does the chance of purchasing insurance. However, it could also be that age is positively correlated up to a certain point and then becomes less likely the older one gets. Other independent variables that might be closely associated could be income level – we might expect to have a positive correlation between income and chance of purchasing insurance. One thing we need to check for when using multiple independent variables to predict a single dependent variable is multicollinearity. This occurs when the independent variables are correlated to each other and can provide inaccurate regression results between the dependent variable. One way to check for this is plotting all the independent variables on a heat map using a correlation matrix – this is an easy way to visualize which variables might be too highly correlated with each other. Typically, I would drop any variable that is above 0.5 in their r value with other independent variables.  

An r^2 value measures the individual data points variability from the regression line. This measurement is also called the ‘goodness of fit’, for a linear regression model specifically. The r^2 is on a range between 0-100%. An r^2 value that is closer to 100% is typically considered to indicate a better fit of the regression line to the data points. It can be calculated by dividing the variance explained by the model by the total variance. When an r^2 is interpreted, it is in terms of the amount of variation explained by the model in terms of the response variable. An important thing to note is that obtaining an r^2, while theoretically possible, is almost impossible to achieve with real data. There will always be some unaccounted variability that cannot be explained by the regression line. R^2 with a low value are not always indicative of poor model choice. Some data will have more variability than other data and so naturally bring a lower r^2 value. Conversely, if an r^2 has too high of a value it may mean the model has been overfitted. This means the model is excellent for predicting for only the specific data set used to create the model, but it would likely be very poor in predicting other data. It is hard to say when a model has gone from a good fit model to overfitted, however in the past I have been told that percentages that are reaching 95% and over are considered overfitted.  

References:  

Frost, J. (2020, July 16). How To Interpret R-squared in Regression Analysis. Retrieved August 14, 2020, from  https://statisticsbyjim.com/regression/interpret-r-squared-regression/  

Frost, J. (2020, July 16). Multicollinearity in Regression Analysis: Problems, Detection, and Solutions. Retrieved August 14, 2020, from  https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/  

Wilson, L. (2009, May 02). Statistical Correlation. Retrieved August 14, 2020, from  https://explorable.com/statistical-correlation  

Bottom of Form