Statistics Project
Walkthrough recorded Spring 2025:
https://arizona.box.com/s/ii2mhua71jusvwdwlnuhr63kzhuk09c8
Final Project
For this project, we are going to pretend that you are doing an experiment on a topic of your choice. For demonstration purposes, I'm going to do an analysis on the topic of how much 30 people agree with me about COVID-19 vaccinations and how two different approaches (Approch A and Approch B) affect that agreement. I want you to think of a topic that you are interested and this final project will include an analysis of the data in light of these randomly selected results.
Step 1:
Create your randomized data in Excel
You can create random numbers that are in the range of data points you would expect from your measurements in your experiments. For example, if you had a Likert scale measure then you would create numbers between 0 and 10 or 1 and 5 etc.
In a Word Document (or equivalent) Describe what your Experiment is about just like I did.
in Excel use =rand() then create 3 column and 30 rows
Insert a category row above this with labels that correspond to what your data is showing. EX My three columns are Agreement, Approach A and, Approach B.
Step 2:
Convert these columns into binomial (dichotomous) data. In my experiment, I'm interested in agreement so I will convert the percentage of the agreement into a dichotomous Agree or Disagree category.
In excel, you can use this function
=if(A2 > 0.5, "Agree","Disagree")
And right corner drag across three columns and then down to the 30 participants.
a.) Look at the pattern of Agreement and Disagreement and find the longest string of dichotomous variables you have discovered and calculate how likely that would be... Don't forget to multiply by the number of potential occurrences.
EX:
0.5^(largest string) so a string of 7 heads in a row is .5^7*90 .... ~0.78125 or ~78% should we be surprised to see this string of Heads in a row? Write a sentence or two about this result.
b.) What is the likelihood of getting at least this many Agreements in this random sample? (I created three new columns and counted the agreements by modifying our if statement to read out 1 if agreement and 0 if disagreement and then count the number of agreements.
EX = If(A2>0.5,1,0))
BINOMDIST(num_successes, num_trials, prob_success, cumulative)
EX:
=binomdist(42,90,.5,True)
For the probability of at least that many occurrences (Culmilitve percentage).
=binomdist(42,90,.5,False)
Probablity of getting exact percentage.
Is the result you got probable or not? Write a sentence or two about the likelihood that you would get this result.
If you were interviewing people for your study and you got this number of disagreements (or agreements) in a row should you be surprised?
Step 3:
Let's say that Approach A increases agreement by 10% and Approach B Decreases agreement 20%
Note you can choose these effects for your own project but having an increase and decrease should be a good start.
Change the data to reflect these calculations.
in Excel
=B2*1.10
in new column and drag down
=C2*0.8
in another new column and drag down
Step 4:
Create a new column with the average of each of the three rows.
In Excel you can use =average(highlight first row) in a new column then apply to the column by dragging the formula (bottom right corner of box when it turns into a black + and drag down)
Now let's look at histograms for the averages vs. the random numbers
a.) What should we expect to see? Which one is closer to a normal distribution? Write a paragraph explaining your findings.
a. Insert Graph Histogram in Sheets for each column
EX
Please paste your histograms in the word file to demonstrate your work.
What do we expect to see in a distribution?
Most data will fall into a normal curve because this kind of data is usually measuring something that is a result of many different independent factors.
Even though each factor by itself might have a non-normal distribution, i.e. for a single die the distribution is completely flat, the combination of multiple independent factors will more often than not create a normal distribution (bell curve).
Does it have Skew?
=skew(Highlight)
in Excel
Are there Floor or Ceiling effects?
NOTE: For statistical analysis to work correctly the distribution has to be normal so translating the numbers into a different scale i.e. Log, Exponential etc. can turn a skew into an appropriate scale that shows a normal distribution.
Step 5:
Translation of scale to create a normal distribution
It may seem strange but changing the scale doesn't affect the statistics but for the stats to work the data have to be on a normalized distribution.
if we apply a logarithmic translation to a normal graph we should get non-normal distribution as a result but if you saw a logarithmic histogram (terrorist damage, earthquake energy etc.) you would be able to do this to reverse the process so we could do statistical analysis on the results.
We can test this by creating another column that takes the natural log of our average column and then taking a look at the histogram to evaluate it's normality.
=ln(highlight first-row point then drag down)
and insert chart histogram of this column
in new column in Sheets/Excel
a.) Should we apply this transformation to our data before moving on? Why or why not? Take a couple of sentences to explain.
b.) What is the Skew of each kind of data?
=skew(highlight column)
in Sheets/Excel
Step 6:
Find the Mean, Median and Mode of your data. What do these measurements tell us about your data?
=Average(highlight data) is the function for Mean
Graph these means Please (Highlight means and insert bar chart)
EXTRA CREDIT: Add SEM error bars.
=Median(highlight data) is the function for Median
=Mode(highlight data) is the function of Mode
Note: for our data there likely isn't a mode but you can check your function on the dichotomous variables to see if it works.
a.) When should you use medians vs. means? Which is best for Skewed distributions? Write a couple of sentences explaining why.
b.) Getting modes for this type of data are inappropriate but we saw these in the histograms. If you were getting people to vote would you use plurality votes or rank-order voting write a couple of sentences as to why?
Step 7:
Do a T-test between your data and another measurement of your choice and explain this difference/non-difference.
=ttest(highlight data of first group, highlight data of second group, 2,1) is a function for a 2 tail paired t-test.
or
t.test(Data$ApprochA,Data$Agreement,paired=TRUE)
t.test(Data$ApprochB,Data$Agreement,paired=TRUE)
is there a significant difference between the People who Agree or Disagree?
t.test(Data$Agreement~Data$DoAgree)
a.) Does that result make sense? Explain in a couple of sentences.
Would it be more appropriate to do an ANOVA?
In R:
one.way <- aov(Score ~ Group, data = Data)
summary(one.way)
In Sheets using a spreadsheet:
Converting F to P-value. https://www.socscistatistics.com/pvalues/fdistribution.aspx
b.) Explain why ANOVA is used vs. T-tests. What is an F ratio?
Step 8:
Correlate the three first columns with each other
In Excel
=Correl(Highlight first group, highlight the second group)
a.) Which columns have the highest/lowest correlations? Is this result surprising (expected by random chance)?
Extra Credit: Provide a Scatterplot with a trendline showing this relationship.
Step 9:
What do you do if your data has non-linear relationships?
First, we have to create non-linear data.
In a new column us the first column (EX Agreement) to apply a sin wave pattern.
In excel:
=sin(row()/10)+A2
In a new column using the equation above to add a non-linear relationship to your data create Then drag down
In R:
Data$Sin <- sin(seq(1,30,1)/10)
Data$SinAndAgg <- Data$Sin+Data$Agreement
Check out the added pattern
plot(Data$SinAndAgg)
lines(Data$Sin)
Now let's take that out again by taking the residuals
Do a regression analysis to find out if other measurements could be used to predict your own. Explain the results.
Regression
As mentioned above correlations can be used to make predictions based on many different forms of data. If you are trying to predict something using correlated data it's called regression. For example, let's see if there is a relationship between Agreement and Agreement and Sin.
AggResid <- lm(SinAndAgg ~ Sin, data=Data)
summary(AggResid)
a.) Was there a relationship between our combined factor Sin and Aggression and the Sin component?
Data$AggResid <- resid(AggResid)
plot(Data$AggResid)
abline(h=0)
b.) Let's look at what variance is leftover.
How does this relate to the original Agreement data?
plot(Data$AggResid~Data$Agreement)
c.) Now let's see if there is a relationship between our combined factor Sin and Aggression and the Aggression component?
SinResid <- lm(SinAndAgg ~ Agreement, data=Data)
summary(SinResid)
d.) Now, what is leftover?
Data$SinResid <- resid(SinResid)
plot(Data$SinResid)
abline(h=0)
e.) How does this relate to the original Sin data?
plot(Data$SinResid~Data$Sin)
Step 10:
Do an Exploratory Factor Analysis to see how our measures relate to each other.
It's not always clear what we should include in our models of what's going on. For example, we might ask the questions are both math and reading scores required in an analysis of our country data? Statisticians can help researchers narrow down what they think is important by doing an Exploratory Factor Analysis to see if we can combine/eliminate variables because they are redundant. We can often find that variables that were distinct factors are really related and can be combined into a single Latent variable (i.e. component 1, 2 of IQ etc.).
or if you want to look at inflation data this has my data I've gathered. https://docs.google.com/spreadsheets/d/1_zX6yhNj5SbrqvepUmMk06Xiu21j9WVoCXi0bka-1rk/edit?usp=sharing
Upload data to R
pc <-read.csv(file.choose())
princomp(pc, cor=TRUE)
AgreementsApprochRelated <-princomp(pc, cor=TRUE)
summary(AgreementsApprochRelated)
plot(AgreementsApprochRelated)
this is a kind of scree plot as describes here https://en.wikipedia.org/wiki/Scree_plot#:~:text=In%20multivariate%20statistics%2C%20a%20scree,principal%20component%20analysis%20(PCA).
biplot(AgreementsApprochRelated)
are there any clusters/groups?
loadings(AgreementsApprochRelated)
gives a numerical table that shows you component loadings of each factor.
This link shows how to understand factor analysis with graphs.
a.) Paste a bi-plot/loadings of your factor analysis and Tell us if you could combine any of the factors we are given? How many unique factors are in our dataset? What would you call them? i.e. how many groups do you see now and what did those measures have in common?
3 months ago
1
- NEED HOMEWORK DONE
- Looking for an expert to handle my Acct. assignment
- Intercultural Comm Paper
- Essay about the movie "Lincoln"
- Unit 2 Training and Development
- Annotated Bibliography
- AACT FEDARL TAXATION
- Foundations of Human Development in the Social Environment Paper
- Assignment 2: External and Internal Environments
- 6066