R Studio Statistical Analysis Project

TonyB69

finalprojectmathExample.docx

Home >Mathematics homework help >Statistics homework help >R Studio Statistical Analysis Project

Adam Fallis

Dr. Noyes

Statistical Modeling Final Project

5/26/18

DATA GATHERING

___________________________________________________________________________________

Students from the Statistical Modeling class brought forth quantitative and qualitative questions to ask the BHSEC community. Of the questions proposed, the class came up with 14 questions that Dr. Noyes sent out as an online survey to 40 randomly selected BHSEC students. Of the 40 students, 30 responded and of the 30 responses, 20 responses were from high school students (freshman and sophomore) and 10 from college students (junior and senior). When the survey was returned, some of the responses looked unreasonable while the others seemed like it fit within the realm of possibility. Instead of discarding the data, or changing it according to what we thought the respondents meant to write, Dr. Noyes emailed those students and asked them to update on their answers. The students emailed Dr. Noyes back with their new response and the data gathered for all the responses looked reasonable. This can possibly change the dataset because the time that a respondent fixes his survey answers, his state of mind is different from what it was before.

Although unintentional, there is a bias in the data we’ve gathered because there is more data representing the high school students than the college students. However, when thinking about the ratio of high school students to college students, we wouldn’t want there to be an equal amount of high school students responding as the college students because the ratio of high school students is larger than the ratio of college students. The number of respondents should represent or be close to the ratio of the total amount of high school and college students at BHSEC. In order to have a better representation of BHSEC, many different factors (race, gender, location, family income) should be taken into account because having individuals with very similar circumstances will not be representative of the population as a whole. At least half of the BHSEC student population should be given- at random-a survey because when the surveys are sent out, not everyone replies back since it is voluntary but there will still be a good pool of data to analyze from. Larger simple random samplings will offer more precise data but only to an extent as the data should be something that can be worked with.

Since the data was gathered through an online survey, this may not be the most accurate way to gather data because there may be students in BHSEC who don’t check their emails. If these types of students were selected to complete the survey, they probably haven’t opened it. There are many factors that can affect the outcome of the data. If students do their survey at midnight while they are half asleep, they won’t read the question carefully/correctly and will write a response that doesn’t fit the units of the question. This brings up the question: what other alternatives of data gathering/sampling can be implemented to yield more accurate data from respondents? A possible option besides an online survey is to have a paper-based survey, where students are handed out surveys that they can fill in right on the spot during advisory. Another option is to have a random interview-based survey where Statistical Modeling students can ask questions and get responses. These three data gathering methods -- online survey, paper-based survey, and interview -- create options for the students to choose and respond to. This caters to the sample population better and expands its reach which can translate to more accurate data. If the students are given a choice that they feel comfortable with, then they will be more willing to respond to the questions presented. However, having a larger sample size generally translates to more outliers within the data set which can affect its summary of values. If certain outliers are so deviated from the rest of the data, then it might be discarded based on personal jurisdiction. Cherry picking is a type of selective bias because an outlier might be contradictory to the rest of the data so it will be removed. This fallacy is based on what is deemed accepted and favored. Discarding outliers is confirming pre existing modes of thought since we usually stick by our beliefs. By neglecting data points that can say otherwise, we are not getting the true, full nature of the dataset which is the purpose. In the scenario of larger sampling, outliers would sway other values less as they would balance out, so it would make the case of discarding them unfavorable.

DESCRIBING DATA

___________________________________________________________________________________

Histogram #1 : Students were asked to rate their experience at BHSEC on a scale of 0-10 where 0 was the lowest and 10 was the highest. The histogram displays the experience rates of 30 randomly selected BHSEC students. The distribution of the data is fairly symmetric and uniform because the data is evenly distributed over its range and the number of respondents per rating is balanced in frequency. Observing the structure of the histogram, the bins are layed out even across. Bins representing the lowest # of respondents (bins 6-7, 8-9) when compared to the highest # of respondents (bins 5-6, 9-10) are only two respondents behind the latter; this further supports the notion that the values in bins are fairly symmetric and uniform. The mean of the rate is 7.423 and the standard deviation is 1.811588. The distribution is spread out which is indicated by its standard deviation. However, the standard deviation also indicates that there is some uniformity in the rates in that it isn’t too large or small of a value. If the data had a very large standard deviation, there would be more extremes/outliers that would influence the general pool of data and would be much more widely spread. If the data had a very small standard deviation, the data would be closely concentrated near the mean and does not change very much. This is not the case when describing this histogram as it is somewhat leaning with smaller standard deviation characteristics as it is symmetric.

Histogram #2 : Students were asked to rate their sense of community at BHSEC on a scale of 0-10 where 0 is the lowest and 10 is the highest. This histogram shows the community rates from 30 randomly selected BHSEC students. The distribution of the data is uniform and unimodal because the highest peak- between the rates/bin of 6-7- is distinguished from the rest and the data is evenly distributed over it range in exception to a discrepancy within bins 5-6 and 7-8. When one looks at this particular histogram, they can connect it to a downward parabola shape. If cut in the middle of bin 6-7, the halfs would be very similar. Hence, the histogram portrays symmetrical qualities. Like a downward parabola, the histogram illustrates a maximum height while the others stoop before it and therefore is unimodal. The mean of the rate of community is 6.85 and the standard deviation is 1.737765. The mean of the rate is easily understood because the data is very symmetrical. The mean would have to fall somewhere close to the middle as there are no real outliers and everything appears balanced. The mean is 6.85 and leaning towards the right side of the histogram in part because of the discrepancy between bins 5-6 and 7-8. The histogram has a similar standard deviation when compared to histogram #1 which highlights that uniformity translates to symmetry; however, symmetry does not translate to uniformity. The standard deviation of the histogram demonstrates that there is some distribution of values across the range but nothing to the extreme.

Individual Variable Analysis: BHSEC Experience

Min	1 Qu.	Median	Mean	3rd Qu.	Max	St Dev.
4.000	6.000	7.400	7.423	9.000	10.000	1.811588

Box plot #1 : This is a box and whisker plot for BHSEC experience rates on scale of 0-10. There are no outliers as there would be dots near the ends of the box plot or the maximum/minimum sections. 25% of BHSEC students rated their experience from 6.000 down to 4.000, 25% between 6.000 to 7.400, 25% between 7.400 to 9.000, and 25% above 9.000 all the way to 10.000. The box plot demonstrates that there is a greater range of values in the lower half whiskers than the upper half whiskers as the lines that extend to the ends are much longer in comparison. To note, nothing descends below 4.000 yet the maximum was reached at its highest possible value at 10.000. The fourth quartile shows that its representative respondents are fixed and likely to go all the way in terms of their ratings as ¼ of the data set comprise this section. The first quartile shows that such respondents are more varied as they double the range of values of the fourth quartile. This shows that the 25th percentile is a lot more mixed in ratings than the 75th percentile and above.

Individual Variable Analysis: Community

Min	1 Qu.	Median	Mean	3rd Qu.	Max	St Dev.
3.00	5.25	7.00	6.85	8.00	10.00	1.737765

Box plot #2 : This is a box and whisker plot for community rates on scale of 0-10. There are no outliers as there would be dots near the ends of the box plot or the maximum/minimum sections. 25% of BHSEC students rated their experience from 5.25 down to 3.00, 25% between 5.25 to 7.00, 25% between 7.00 to 8.00, and 25% above 8.00 all the way to 10.00. The box plot demonstrates that the range of values in the lower half whiskers are basically the same as the upper half whiskers as the lines that extend to the ends are equidascent to each other. The 25 percentile to the median (second quartile) embodies a more varied range of values than the median to the 75th percentile (third quartile). This is illustrated by the box and the line which represents the median. The median is not within the middle but it is positioned higher to show that the 2nd quartile has a greater range of values. In addition, nothing descends below 3.00 in the community rate yet the maximum was reached at its highest possible value at 10.00 which is similar to box plot #1.

*By examining the physical and numerical characteristics of the two box plots, one can see that they resemble quite similarly and this will be useful to know when understanding their correlation.

LINEAR REGRESSION

____________________________________________________________________________________

Scatter Plot without line of best fit : This is a scatter plot of the BHSEC Experience and Community ratings. From the data, there appears to be a positive correlation between the x-axis(explanatory variable-BHSEC Experience) and the y-axis(response variable-Community) as the points seem to be going upwards. There aren’t any outliers or values which struck out as unusual as the explanatory and response variable were complementary to each other. The x-axis goes from 4 to 10 and the y-axis goes from 3 to 10. There are no repeating points as every point is different from each other. The lowest point is (4,3) while the highest is (10,10) which is understandable by the basis of the variables. The points are fairly dispersed across the scatter plot but fill the areas roughly in their trajectory. The upper left and lower right corners of the plot are blank.

Scatter Plot with the line of best fit : This is a scatter plot of the BHSEC Experience and Community ratings. The line of best fit encompasses a slope of 0.766 and a y-intercept is 1.164 thus 0.766x + 1.164. Why does the y intercept start at 1.164 instead of 4? There might be some confusion/misconception that the y-intercept should be 4 but the scatter plot did not start from the origin rather it started at the 4 for the x-axis and 3 for the y-axis. The slope is positive and correlation is 0.7985624. Correlation measures the strength of the relationship between x and y values and in this instance, the correlation is pretty strong. Correlation r is between -1 and 1 and ours is 0.7985624. If a correlation approaches -1, its correlation is negatively strong while its the opposite for 1. Since our correlation is closest to the positive whole number of 1, this indicates a positive linear relationship between the variables. The correlation is not close to 0 so the case of the scatter plot having a nonlinear relationship is not supported by this value.

Residuals: This is a residual of the BHSEC Experience and Community ratings. The red line which runs horizontally is a representation of the regression line where the difference between the actual y value (observed/derived value of dependent variable) and the line of best fit (predicted and estimated) can be calculated to find the residual. The range which goes from -2 to 2 shows the standard deviation units in which the dots/points of the plot lie away from the regression line. There are no values which portray an outlier as the data is fairly dispersed and not concentrated. Only one value goes beyond 2 standard deviations from the regression line and it can be seen as the uppermost point of the residual plot. In a scatter plot, points above the line of best fit would be above the horizontal line displayed here and the opposite if it were below. The mean and sum of the residuals comes with an expectation that it would equal to zero because there has to be negative and positive plots to counterbalance each other.

Final Note* Residual plots are beneficial in understanding what type of models to utilize for the data. A line of best fit or a linear model would be best applied in a random pattern to offer a predicted value of the data. This is the case for our data as it displays a random pattern which means that linear regression models would make the most sense. This links back to why we used scatter plots (linear models) instead of nonlinear models to analyze the data presented to us. The residual plot checks if a linear model is appropriate because the residual value would be random. Finding a fit to transform all the randomness into clearer values would lead to a better path when examining it. Hence, regression models were employed for our variables so that predictions were being made and the strength of the relationship would be identified. In the end, a correlation/association between the BHSEC Experience and Community was derived and it went smoothly because the right models were used to do so.

----------------------------------------------------------------------------------------------------------------------------

**[CAPTION FOR FIGURES 1-6 BELOW]

There are no outliers in either the explanatory or response variable, which means that the data doesn’t have to omit any data because extreme outliers can sometimes result in a mean that does not represent the data. Since there are no outliers, there is no need to recreate the regression analysis with those data values removed. Comparing (1) and (2) which shows the scattershot of both BHSEC high school and college students without the line of best fit, both have a positive correlation because all of the points for both graphs have points that are going in an upward trend. (3) and (4) are the scatterplots of BHSEC high school and college students with the line of best fit. For BHSEC high school students, the line of best fit equation is y = 0.7172x + 1.7063 with a correlation of 0.7350716. For BHSEC college students, the line of best fit equation is y = 0.7616x + 0.8599 with a correlation of 0.8715303. Since correlation measures the strength of the relationship between x and y values, the correlation for both the high school students and the college students suggest that the explanatory variable and the response variable are very closely related. Graphs (5) and (6) show the residuals for both high school students and college students. The points (6) are spread out more than the points for (5) because there is more data for high school students but the residuals are nicely distributed as points look balanced on both sides of the red line which is the linear regression. (5) seems to have the farthest point from the line as it has a point two standard deviations away yet it is not classified as an outlier from the histograms presented.

(1) (2) (3) (4)

(5) (6)

CONFIDENCE INTERVALS

___________________________________________________________________________________

p1* = 8

The explanatory variable is based on the rating BHSEC students would give for their experience at BHSEC on a scale of 0-10. I think the best numerical value that represents the “average” value of my explanatory value would be 8. I will be predicting this value because although BHSEC is challenging, in the end getting 60 credits and an Associates Degree is worth the amount of work put in. The 60 credits which can be applied towards colleges is a major incentive for students to push themselves. However, there will also be other students who will feel like the workload is too much and will downvote their experience because of their poor academic performance due to how rigorous and taxing courses can be. The workload can get in the way of free time and after school hangouts with friends which can turn students off. Students would likely appreciate the uniqueness and quality of the educational curriculum that they are experiencing in BHSEC since it is a rare opportunity to undergo accelerated college level academics in the third year of high school and onwards. In addition, all primary subject faculty have a PHD in their respective fields which means that the instruction is at a higher level.

p2* = 7

The response variable is based on the rating BHSEC students would give for their sense of community at BHSEC on a scale of 0-10. I think that a rate of 7 best represents the “average” value of my response variable. I will be using this value because, speaking from personal experience, the community in BHSEC is inclusive and welcoming. However, not everyone is able to bond with each other after school because students are busy with with programs and doing homework. There is an air of competitiveness within BHSEC which can make students feel like they are above or below someone else. The SAC(Student Activity Center) allows for students to come during their free periods which is where people can connect and hangout through board games and conversations when relaxed on a sofa or reclining chair. Also, there is a wide variety of after school clubs and teams such as the Science Olympiad, the STEP team, soccer team, etc. Through these clubs, students can find other students who are interested in similar topics which is a great way to interact and make friends. Taking into account of all these factors, the average rating of community will be set at 7.

**Experience and community are very closely related because experience is heavily relied on the way people interact with each other (community) which is why their predicted values are fairly neck in neck.

----------------------------------------------------------------------------------------------------------------------------

p̂1 = 14/30=0.467 → 14 out of 30 students rated BHSEC Experience 8 or greater.

The quantitative question of “How would you rate the BHSEC Experience on a scale of 0-10” is now converted to the TRUE/FALSE question of “Do you rate BHSEC Experience 8 or more?” There are 14 out of 30 students who have rated BHSEC Experience 8 or higher.

My sample does satisfy the conditions needed to make a Normal Sampling Distribution Model. The two conditions that must be meet in order to make a use Normal Model are the 10% condition(the size of the sample must be less than 10% of the total population) and the success/failure condition(the sample size must be large enough that we would expect at least 10 successes and at 10 failures): np>10, nq<10(where n is the total sample size and p is the proportion of the p is proportion of population favoring candidate A and q=1-p is the proportion of population not in favor of A). The 10% condition is met because the sample size of 30 students is greater than the population of BHSEC which is around 550 students. The success/failure condition is also met because (0.467)(30)>10 = 14.01>10 and (1-0.467)(30)>10 = 15.99>10. We expect more than 10 successes and more than 10 failures, which will be enough trials to run.

The mean (μ) is 0.467 and the standard error (SE) is 0.091. This standard error is a measurement of the accuracy of a sample that represents a population. Standard error is very similar to standard deviation; both are measures of spread. The difference between standard error and standard deviation is that standard deviation applies population data whereas standard error applies sample data. The mean of a sample differs from the actual mean of the population which causes a difference within the data that will be distinguished as the standard error. The larger the sample error, the closer the sample mean is to the actual population mean. The formula to find standard error is:

SE === = 0.091

The margin of error(ME) is 0.178. Margin of error is the amount of random sampling error that occurs in the results of a survey. The margin of error is found by z* SE, where z* is the critical value for the desired level of confidence. Since the desired level of confidence is 95%, the z* value that corresponds with a 95% confidence level is 1.96.

ME = z* SE = ( z*)() = (1.96)(0.091) = 0.178

In order to find the confidence intervals, you need to do p̂ ± ME because you want to know the range of true values in each end of the bell curve while taking into regard of the 95% confidence interval. In this case, p̂=0.467 and ME = 0.178. Thus, the interval would be calculated through 0.467 ± 0.178 = (0.289, 0.645) or 28.9 % and 64.5%. We are 95% confident that the true mean of the sample of BHSEC Experience is within the intervals 0.289 and 0.645. This means that any given random values chosen from the sample will likely result with a mean that is in between 0.289 and 0.645. The data suggests that the sample estimate will not differ to the true population value by more than 0.178, 95% of the time. The 0.178 represents the sampling error value of the results related to the survey. However, there is 5% chance that the values could be out of our range.

----------------------------------------------------------------------------------------------------------------------------

p̂2 = 20/30 = 0.667 → 20 out of 30 students rated the sense of community at BHSEC 7 or greater.

The quantitative question of “How would you rate the sense of community at BHSEC a scale of 0-10” is now converted to the TRUE/FALSE question of “Do you rate the sense of community at BHSEC 7 or more?” There are 20 out of 30 students who have rated their sense of community 7 or higher.

My sample does not satisfy half of the conditions needed to make a Normal Sampling Distribution Model. The two conditions that must be meet in order to make a use Normal Model are the 10% condition, which states that the size of the sample must be less than 10% of the total population, and the success/failure condition, which can be utilized for normal approximation by finding out if the sample size is large enough so that both # of successes and # of failures are more than 10: np>10, nq<10(where n= total sample size, p=proportion of population favoring candidate A, and q=1-p is the proportion of population not in favor of A). The 10% condition is met because the sample size of 30 students is greater than the the population of BHSEC which is around 550 students. However, the success/failure condition is not quite met because (.0667)(30)>10 = 20.1>10, but(1-0.667)(30) >10 = 9.9 >10 is not true. Even though I don't satisfy the conditions, I will still applying a Normal Model because this statistical analysis is being conducted for educational purposes.

The mean (μ) is 0.667 and the standard error(SE) is 0.086. This standard error is a measurement of the accuracy of a sample that represents a population. Standard error is very similar to standard deviation and both are measures of spread. The difference between standard error and standard deviation is that standard deviation applies population data whereas standard error applies sample data. The mean of a sample differs from the actual mean of the population which causes a deviation of the data which can be distinguished as the standard error. The larger the sample error, the closer the sample mean is to the actual population mean. The formula to find standard error is:

SE = = = = 0.086

The margin of error(ME) is 0.169. Margin of error is the amount of random sampling error that occurs in the results of a survey. The margin of error is found by z* SE, where z* is the critical value for the desired level of confidence. Since the desired level of confidence is 95%, and the z* value for a 95% confidence level is 1.96. The data suggests that the sample estimate will not differ to the true population value by more than 0.169 95 percent of the time. The 0.169 represents the sampling error value of the results related to the survey.

ME = z* SE =( z* )() = (1.96)(0.086) = 0.169

In order to find the confidence intervals, you need to do p̂ ± ME because you want to know the range of true values in each end of the bell curve while taking into regard of the 95% confidence interval. In this case, p̂=0.667 and ME = 0.169. Thus, the interval would be calculated through 0.667 ± 0.169 = (0.498, 0.836) or 49.8 % and 83.6%. We are 95% confident that the true mean of the sample of sense of community is within the intervals 0.498 and 0.836. This means that any given random values chosen from the sample will likely result with a mean that is in between the confidence interval of 0.498 and 0.836. However, there is 5% chance that the values could be out of our range.

HYPOTHESIS TESTING

___________________________________________________________________________________

H0: p = 0.5

HA: p < 0.5

Let the null hypothesis “by population proportion be equal to my guess of what an average student would say” or H0 : p=0.5 for the response variable of BHSEC Experience rating. I am going to use a one-sided alternative for HA because the p̂1 value is 0.467 which is less than the null hypothesis of 0.5. A one-sided alternative allows us to test one specific direction while omitting and not taking to account the other direction. We cannot say that an alternative hypothesis is true; we can, however, reject the null hypothesis by saying that it is false or fail to reject the null hypothesis since it can be true. By rejecting the null hypothesis you can get as close as to proving the alternative hypothesis.

Alpha level, also called “significance level,” is the probability of rejecting the null hypothesis when the null hypothesis is true. The alpha level I am going to use is 0.05; this is the standard level that is commonly used. By increasing the alpha, the probability of incorrectly rejecting the null hypothesis increases and the confidence level decreases.

The conditions are satisfied because the sample size is greater than the 10% of the population and the success/failure is 0.5(30)>10=15>10

The mean (μ) is 0.5. The standard deviation (σ) is 0.091 which is found by using the formula:

SD(σ) = = = 0.091

To find the z-score value:

= = -0.36

-0.36 → 0.3594

To find the z-score value, we need to use a Z-table and find the z-score that corresponds to the latter value of -0.36. The z-score states that the observed value (0.467) is 0.3594 standard deviations away from the mean value (.500). The distance between these two variables is expressed in standard deviation, which is positive and reveals that the corresponding raw score is above the mean. Since the z-score is close to 0, it is extremely similar/almost identical with the mean value which provides simple yet important context to their relationship. The raw score is a strong relative to the mean.

P-value → 0.3594

A p-value that is ≥ 0.05 means that one can’t go against the null hypothesis due to its alignment with its conditions. Since the p-value is greater than the alpha level of 0.05, my result is not statistically significant. I failed to reject the null hypothesis and the data is not significant due to chance. The higher the p-value, the data is much more likely to be aligned with the true null hypothesis. By failing to reject the null hypothesis, this illustrates that the confidence interval contains a value that is non-significant. Due to this value being a part of the parameter of the interval and constituting as non-significant, the hypothesis can not be rejected.

Type I error is when the null hypothesis(H0) is true, but we mistakenly reject it. In this context,if we thought that the population proportion was not equal to our BHSEC experience prediction (rating of 8) but it was actually equal, then it would be classified as a type I error. If the rating of BHSEC experience is 8 (which is correct), but we reject that, then we may have thought that the student experience in school was not satisfactory when indeed it was.

Type II error is when the null hypothesis(H0) is false, but we fail to reject it (or in other words accept it). In this context, if we thought that the population proportion was equal to our BHSEC experience prediction (rating of 8) but it was actually not equal, then it would be classified as a type II error. If we thought that the BHSEC experience is equal to 8 (which is ultimately wrong), but we failed to reject it, that may mean that we would were satisfied with the rating even though the rating could be lower.

----------------------------------------------------------------------------------------------------------------------------

H0: p = 0.5

HA: p > 0.5

Let the null hypothesis “by population proportion be equal to my guess of what an average student would say” or H0 : p=0.5 for the response variable of sense of community rating. I am going to use a one-sided alternative for HA because the p̂2 value is 0.667 which is greater than the null hypothesis of 0.5. A one-sided alternative allows us to test one specific direction while omitting and not taking to account the other direction. We can not say that an alternative hypothesis is true. Howeverm, by rejecting the null hypothesis we can get as close as to proving the alternative hypothesis.

Alpha level, also called “significance level,” is the probability of rejecting the null hypothesis when the null hypothesis is true. The alpha level I am going to use is 0.05; this level is also a default is commonly used when testing null hypothesis. By increasing the alpha, the probability of incorrectly rejecting the null hypothesis increases and the confidence level decreases.

The conditions are satisfied because the sample size is greater than the 10% of the population and the success/failure is 0.5(30)>10=15>10

The mean (μ) is 0.5. The standard deviation (σ) is 0.091 which is found by using the formula:

SD(σ) = = = 0.091

To find the z-score value:

= = 1.84

1.84→ 0.9671

To find the z-score value, we need to use a Z-table and find the z-score that corresponds to the latter value of 1.84. The z-score states that the observed value (0.667) is 0.9671 standard deviations away from the mean value (.500). Standard deviation is positive and reveals that the corresponding raw score is above the mean.

P-value → 1-0.9671 =0.0329

A small p-value that is ≤ 0.05 means that there is strong evidence against the null hypothesis which allows us to reject the null hypothesis. Since the p-value is less than the alpha level of 0.05, my result demonstrates the latter statement so it can be rejected. The p-value issues the significance of the results and in this case, it is statistically significant. A lower p-value suggests to us that the sample provides enough evidence to reject the null hypothesis in regards to the population.

Type I error is when the null hypothesis(H0) is true, but we mistakenly reject it. In this context, if we thought that the population proportion was not equal to our community prediction (rating of 7) but it was actually equal, then it would be classified as a type I error. If the sense of community at BHSEC is 7(which is correct), but we reject that then we may have thought that there were not many community building activities when there was enough.

Type II error is when the null hypothesis(H0) is false, but we fail to reject it (or in other words accept it). In this context, if we thought that the population proportion was equal to our community prediction (rating of 7) but it was actually not equal, then it would be classified as a type II error. If we thought that the sense of community at BHSEC is equal to 7 (which is ultimately wrong), but we failed to reject it, that may mean that we would were satisfied with the rating even though the rating could be lower.

REFLECTION

___________________________________________________________________________________

The underlying purpose of this statistical analysis was to test our explanatory and response variables under different systems such as confidence interval and null hypothesis to examine their relationships with our initial predictions. The 30 samples from the survey were analyzed based on two variables: their rating of community and rating of BHSEC experience on a scale of 0-10. These variables were separated so that they would be further compared while staying in their own domains and not being interfered by each other. Then, we provided a prediction on these numbers on average and then inserted a truth or false symbol for each. Counting the truths and creating proportions, we instituted calculations for standard error, standard deviation, margin of error, and ultimately confidence intervals to better understand how the values would look like numerically in a bell curve.

We made sure they fit both conditions for normal sampling models but conducted further analysis even if they weren’t supported. By being able to visualize and insert values for these conditions, one can derive the likelihood of a certain mean value from the survey in accordance to the percentage for the confidence interval. After these measurements were collected for both explanatory and response variables, a null and alternative hypothesis was created to either accept or reject those statements. The null hypothesis was that the population proportion was equal to that of the predictions issued or H0: p = 0.5. For the sense of community, the standard deviation was converted into the z-score, a p-value was established when the z-score was minused by 1 because we wanted the value to be in the left. For the BHSEC Experience, the z-score was not minused by 1 because the z-score was already in the left. For the second/community variable, it demonstrated that the null hypothesis was to be rejected. Since the alternative hypothesis is a negation of the null hypothesis, it was conclusively valid. The alternative hypothesis stated what our predictions suggested: population proportion was not equal to our guess purely by chance being that it would be 0.5 but that it had a non-spontaneous influence. The first/BHSEC experience variable did not reject the null hypothesis because it fit under its conditions. To determine on whether a null hypothesis is to be rejected or not rejected, the level of significance or p-value can be derived to provide evidence for either statement.

The population of BHSEC students weren’t best sampled and represented as the binary conditions did not support what was conducted during the first phase of our study. Although the sample constituted less than 10% of the BHSEC population and supported the 10% condition, the success/failure condition was not met for community. The success/failure condition is essential for the examination of the normal distribution and can be a tool when dealing with large data/samples. We figured out that the sampling number was not large enough to constitute normal approximation as it did not satisfy this specific condition. From the gecko, the conditions were telling us that the number of samples from the survey was not large enough for normal distribution calculations because there needs to be at least 10 successes and 10 failures to satisfy the trial runs. In their own residuals and regression graphs, community and BHSEC experience posed positive correlations/upwards trends and were well-distributed on both sides of the regression. These two seemed to go hand in hand as a relationship was being demonstrated through the figures. The samples had a strong tendency to match similar ratings for community and experience.

There were more high school students being surveyed than college students as well. The 95% confidence interval for BHSEC experience variable showed that the average value would lie in between (0.289, 0.645) or 28.9 % and 64.5%. For community, the 95% confidence interval would have an interval of (0.498, 0.836) or 49.8 % and 83.6%. This shows that the community is more spread out in terms of averages likely to occur for the predicted values, due to its range of the confidence interval, than BHSEC experience. In the hypothesis testing, the two variables have very similar margins of error which means that they have similar sampling errors. There is strong evidence against the null hypothesis in that the BHSEC population proportion in accordance to the predicted value would not be driven by randomness and not equal to 0.5. For BHSEC experience, we failed to reject the null hypothesis which concluded that the prediction and population proportion are equal without a direct certainty as we can not accept nor reject the null hypothesis.

For educational purposes, the conditions were overlooked when not met. I was still surprised that conditions (for making a normal model distribution in the response variable) weren’t established and that our sampling was flawed from the start. I wonder if this affected the outcome of the null hypothesis being rejected or not rejected as these conditions illustrate if our sampling number fits the criteria of normal modeling. Doesn’t this make our analysis of the data unreliable? Since the prediction was set at 7 or 8 higher for both variables, there was very little chance that there would be an equal distribution of yes’s and no’s which could throw off the conditions set in place. Another result that surprised me was that the null hypothesis was not rejected for both variables because the linear regression and residuals demonstrated that they have very similar trends and correlations. With such similar data, it would be expected that the concluding hypothesis for each prediction would be the same. What I have gotten out of the project was that hypothesis testing relies on many factors and systems of data. The histograms, residuals, regression graphs, and confidence intervals are essential to understanding the predicted values in relation to the population proportion. From this derived information, one can ultimately utilize these values to achieve alternative and null hypothesis statements and the p value which concludes them. In addition, knowing when to classify a null and alternative hypothesis or when something was a Type I or Type II error was difficult as it pertains to certain circumstances. Hypothesis testing requires sharp understanding of the situation at hand to pick which types of data gathering and calculations should be used to garner the most accurate results.

R-STUDIO HISTORY

___________________________________________________________________________________

OLD HISTORY

View (midterm)

attach (midterm)

summary(BHSEC.Experience)

hist(BHSEC.Experience)

hist(BHSEC.Experience, main="BHSEC Experience", xlab="BHSEC Experience Rate(0-10)", ylab="Number of Respondents", col="green")

hist(Community)

hist(BHSEC.Experience, main="BHSEC Experience", xlab="BHSEC Experience Rate on a Scale of 0-10", ylab="Number of Respondents", col="royalblue1")

hist(Community, main="Community", xlab="Rate of Community on a Scale of 0-10", ylab="Number of Respondents", col="royalblue1")

summary(Community)

sd(BHSEC Experience)

sd(Community)

boxplot(BHSEC.Experience)

boxplot(Community, main="Community", ylab="Community Rate on a Scale of 0-10", col="yellow")

plot(Community~BHSEC.Experience)

plot(Community~BHSEC.Experience, main="BHSEC Community vs Experience", xlab="BHSEC Experience", col="blue")

cor(BHSEC.Experience, Community)

mathfit=lm(Community~BHSEC.Experience)

mathfit

abline(mathfit, col="red")

mathfit.res=resid(mathfit)

plot(BHSEC.Experience, mathfit.res)

plot(BHSEC.Experience, mathfit.res, col="blue")

plot(BHSEC.Experience, mathfit.res, ylab="Community")

plot(BHSEC.Experience, mathfit.res, ylab="Community", xlab="BHSEC Experience", main="Residual of Community Against BHSEC Experience")

abline(0,0, col="red")

plot(Community~BHSEC.Experience)

plot(Community~BHSEC.Experience, col="blue")

plot(Community~BHSEC.Experience, col="blue", xlab="BHSEC Experience", main="BHSEC Community vs Experience")

___________________________________________________________________________________

NEW HISTORY

View(survey)

attach (survey)

survey.hs=subset(survey,Grade=="High School")

survey.c=subset(survey, Grade=="College")

plot(survey.hs$Community, survey.hs$BHSEC.Experence, main="BHSEC Highschool Community vs Experience", ylab="Community", xlab="BHSEC Experience", col="blue")

cor(survey.hs$BHSEC.Experience, survey.hs$Communiy)

line.hs=lm(survey.hs$Community~survey.hs$BHSEC.Experience)

abline(line.hs, col = “red”

line.hs.res=resid(line.hs)

plot(survey.hs$BHSEC.Experience, line.hs.res, ylab="Community", xlab="BHSEC Experience", main="Residual of High School Community Against BHSEC Experience")

abline(0,0, col =”red”)

plot(survey.c$Community, survey.c$BHSEC.Experence, main="BHSEC College Community vs Experience", ylab="Community", xlab="BHSEC Experience", col="blue")

cor(survey.c$BHSEC.Experience, survey.c$Communiy)

line.c=lm(survey.c$Community~survey.c$BHSEC.Experience)

abline(line.hs, col = “red”

line.c.res=resid(line.c)

plot(survey.c$BHSEC.Experience, line.c.res, ylab="Community", xlab="BHSEC Experience", main="Residual of College Community Against BHSEC Experience")

abline(0,0, col =”red”)