R Studio Statistical Analysis Project
SECTION I
To gather our data, we sent out a survey to 40 randomly selected BHSEC students (inclusive of high school as well as college students). We asked 11 quantitative questions, as well as three qualitative ones, although the latter had only binary response possibilities (yes/no). The questions were selected by the students in the statistics class, and refined by Dr. Noyes. We sent out at email to each of these students asking them to fill out the survey, and sent additional emails as a reminder to those who didn’t fill it out within the first week or so. Of the 40 we surveyed, we received 30 total responses, which equates to a 75% response rate. Of those responses, there were seven data points (responses to singular quantitative question, not the entire set) that seemed to not make much sense--for instance, one person claimed 120 hours of extracurricular activity a week, while another said their commute to BHSEC was one minute. We believed this data came from the respondents confusing hours with minutes and vice-versa, i.e. 120 minutes of extracurricular (instead of hours), and one hour of commute time (instead of one minute). Dr. Noyes was able to reach out to all of the respondents who had input data like this, and our hypothesis of the data proved correct--each corrected their data to account for the correct unit (hour/minute). In the future, it might be helpful to set a single unit for all quantitative questions when appropriate--for example, when asking any quantitative question related to time, ask for responses only in minutes, instead of asking for minutes in some questions and hours in others. Additionally, handing out a physical survey might improve response rate. Some unrepresentative data in the set could have been generated by the phrasing used in some of the questions--for example, the question asking the respondents what their lowest grade on a major assignment at BHSEC has been is titled “Worst Performance”--the use of the harsh adjective “worst” as opposed to the more neutral “lowest” may have lead to some respondents inflating their lowest grade due to self-consciousness. Additionally, some bias might be present in the data due to the argument one could make that those who respond to an optional survey over email might be more diligent in regards to going above and beyond requirements on average, or have some other trait that is not representative of the BHSEC population as a whole.
SECTION II: SINGLE VARIABLE ANALYSIS
This histogram shows the distribution of the data to not be symmetrical--rather, it is skewed to the left. In this case, the median and quartiles are more accurate representations of the data, rather than the mean. The data peaks in the middle of social media usage (100-150 minutes), is slightly lower on the lower end of social media usage(0-100 minutes), and drops off drastically on the higher end of usage (150-200 minutes). Two outliers exist in the data--two responses alleging 240 minutes of social media use a night. This outlier can be seen below, represented as the dot at the top of the boxplot. Additionally, it is important to note that the distribution is unimodal.
The boxplot, like the histogram, tends slightly towards the lower end of social media usage, however the median appears almost squarely in the middle--showing that it is quite representative of the data as a whole. The one thing offsetting the plot are the outlying ties of 240 minutes, as represented by the empty dot, which is significantly above the third quartile.
Mean: 101.8333
Median: 110
Standard Deviation: 59.15327
Quartiles: 0% 25% 50% 75% 100%
0 60 110 120 240
Outlier: 240 Minutes
This histogram is more symmetrical than the previous one for social media usage, but it is still skewed slightly to the left. This means that the median and quartiles are a more accurate representation of the data, rather than the mean. Additionally, the data is unimodal.
Unlike the histogram, the boxplot appears almost symmetrical, and thus is likely a better visualization of the data. Since it is symmetrical, the mean would be the most accurate measure of the data--however, the difference between the mean and median is negligible (.05), so arguably both are accurate representations of the data. The boxplot shows that most students spent around two hours per night studying, with a similar amount of students studying more (above bold black line) and less (below bold black line).
Mean: 2.05
Median: 2
Standard Deviation: 0.8842491
Quartiles: 0% 25% 50% 75% 100%
0.5 1.5 2.0 2.5 4.0
Data Analysis
From examining the data presented in the histograms and the boxplots, one could conclude that it seems as if most respondents use social media every night for around 100 to 150 minutes, while most study around one and a half to two hours. However, a sizable portion of respondents also study only zero to half an hour per night. The data for social media usage is fascinating--the majority of the respondents use social media between 0 and 150 minutes per night, with a comparatively tiny minority using it between 150 and 240 minutes per night.
SECTION III: TWO-VARIABLE ANALYSIS
Scatter Plot
Correlation: 0.07729679
Linear Regression Model
Residuals
Two-variable Data Analysis
Unfortunately, from the two-variable data analysis, we are able to conclude that there appears to be no correlation between nightly amounts of studying and nightly amounts of social media usage, as the mathematical correlation between the two is statistically insignificant--less than 0.1. One might assume that studying has an inversely proportional relationship with social media usage, but analysis of our survey’s data proves this to be incorrect. However, this phenomenon (or lack thereof) might not be representative of the BHSEC population as a whole, perhaps partially due to the biases I discussed in section one. Looking at the linear regression model, we can observe that the line of best fit follows a positive slope--indicating the miniscule correlation calculated. Additionally, one can see the two outlying data points corresponding to 240 minute per night social media usage--both are very far away from the line of best fit, and could arguably be removed from the set without any significant repercussions to the quality of the data.