R language assignment in R Studio

profilechenxintong
regression_analysis_542.R

# Description: a script for R tutorial, regression analysis in R # Author: Oleg Firsin # Last modified: 4/30/2020 # What we do in this tutorial: first, we load data on covid-19 cases and deaths by country/territory and look at the role of latitude using simple and multiple linear regression analysis. Second, we combine U.S. covid-19 data with county characteristics and explore the effect of county characteristics--focusing on adult smoking rate--on death rate per capita using multiple linear regression analysis. rm(list = ls()) # first, clear any objects already present # Load packages to be used library(tidyverse) library(esquisse) library(scales) library(plotly) ### PART 1 ### # load data worldcases <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv") worlddeaths <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv") # look at the data head(worldcases) head(worlddeaths) # Let's run the same regressions we ran in Exam 2 # First, let's merge cases and deaths and keep only variables of interest world_covid = merge(worldcases,worlddeaths,by = c("Country.Region","Lat","Long","Province.State")) world_covid = world_covid %>% mutate(abslat = abs(Lat), abslong = abs(Long),cases=X4.1.20.x , deaths=X4.1.20.y) %>% select(Country.Region,Lat,Long,Province.State,abslat,abslong,cases,deaths) # regress cases on latitude fit_cases_lat = lm(cases ~ Lat , data=world_covid) summary(fit_cases_lat) # these coefficients are slightly different from exam because the source data has more territories now -- so, different data -> different result (but not very different) # What can we extract and save from this output? names(summary(fit_cases_lat)) # these are the various items we can extract; let's focus on coefficients summary(fit_cases_lat)$coefficients # t-value = (estimate - 0)/std.error; p-value from t-distribution, given t-value and degrees of freedom # What is the interpretation of the coefficient on latitude? When latitude increase by 1, predicted number of cases increases by the coefficient on latitude: summary(fit_cases_lat)$coefficients[2,1] # If latitude increases by 10, predicted number of cases increases by summary(fit_cases_lat)$coefficients[2,1]*10 # What is the interpretation of the intercept? The results tell us that the best fit line is: print(paste0("cases = ", summary(fit_cases_lat)$coefficients[1,1]," + ",summary(fit_cases_lat)$coefficients[2,1]," x Latitude")) # So, the intercept is the predicted number of cases when latitude = 0 # Let's graph the prediction line and the actual values # esquisser(world_covid) # let's plot with straight line ggplot(world_covid) + aes(x = Lat, y = cases) + geom_point(size = 1L, colour = "#0c4c8a") + geom_smooth(span = 0.75, method = lm) + theme_minimal() # Basically, the vast majority of countries/territories have very few cases, but those that have a lot of cases tend to be in high-latitude areas, making it so that the best fit line has an upward slope # How well does the line/model fit the data? How much of the variation in the number of cases is explained by latitude? That's the r-squared, which is equal to summary(fit_cases_lat)$r.squared # or percent(summary(fit_cases_lat)$r.squared) # How well is this linear model predicting the number of cases in the u.s.? The actual number of cases is world_covid[world_covid$Country.Region=="US","cases"] # while the predicted number is intercept + (coefficient on latitude) x (latitude for the U.S.) summary(fit_cases_lat)$coefficients[1,1] + summary(fit_cases_lat)$coefficients[2,1]* world_covid[world_covid$Country.Region=="US","Lat"] # So the difference, also called the residual = actual value - predicted value, is world_covid[world_covid$Country.Region=="US","cases"] - ( summary(fit_cases_lat)$coefficients[1,1] + summary(fit_cases_lat)$coefficients[2,1]* world_covid[world_covid$Country.Region=="US","Lat"] ) # Leaving up to you? Which are the 5 countries/territories for which we have the largest residuals? The 1st one is the U.S., but what are the other 4? # Let's run all the other models we ran for the exam -- it's fairly quick to do in R summary(lm(cases ~ abslat , data=world_covid))$coefficients summary(lm(cases ~ Long , data=world_covid))$coefficients # The effect of longitude is not statistically different from 0 summary(lm(cases ~ abslong , data=world_covid))$coefficients summary(lm(cases ~ abslat + abslong , data=world_covid))$coefficients # Absolute latitude has a statistically significant effect on the number of case when controlling for longitude summary(lm(deaths ~ cases + abslat , data=world_covid))$coefficients # Absolute latitude does not have a statistically significant effect on the number of deaths when controlling for the number of cases # What is the interpretation of the coeffiicent on cases? When the number of cases increases by 100, GIVEN NO CHANGE IN LATITUDE (in other words, controlling for latitude), the predicted number of deaths increases by summary(lm(deaths ~ cases + abslat , data=world_covid))$coefficients[2,1]*100 # Let's plot the relationship between the 3 variables: look at the relationship between z and y for the same x plot_ly(data=world_covid,x=world_covid$abslat, y=world_covid$cases, z=world_covid$deaths, type="scatter3d", mode="markers", color=world_covid$abslat) ### PART 2 ### # Let's now combine U.S. covid-19 data with county characteristics and explore the effect of county characteristics--focusing on adult smoking rate--on death rate per capita using multiple linear regression analysis rm(list = ls()) # clear environment # The next several lines, all through obtained "combined_data" data frame, are mostly a repeat of lecture on 4 27, so you can rewatch it if you need more clarity on this block setwd("C:/Users/of43/Dropbox/Econ 320/Spring 2020/r/rawdata") # change to folder on your laptop countydata <- read.csv("county_health_and_other_data2018.csv") # download dataset with county-level data (we should have previously put it in the working directory folder) countydata2 <- countydata[, colSums(is.na(countydata)) != nrow(countydata)] # keep columns where not everything is na--columns with at least some data countydata2 <- countydata2[ , !(names(countydata2) %in% grep("(FIPS)", names(countydata2), value=TRUE))] # remove more unecessary variables -- those that contain "FIPS" # keep the only the specific variables with are interested in countydata2 <- countydata2[, c("Name","State.Abbreviation","Population.raw.value","Poor.or.fair.health.raw.value","Low.birthweight.raw.value","Adult.smoking.raw.value","Adult.obesity.raw.value","Physical.inactivity.raw.value","Preventable.hospital.stays.raw.value", "High.school.graduation.raw.value" ,"Some.college.raw.value", "X..Non.Hispanic.white.raw.value", "Residential.segregation...black.white.raw.value","Median.household.income.raw.value","Income.inequality.raw.value" ,"Primary.care.physicians.raw.value","Air.pollution...particulate.matter.raw.value","Adult.smoking.raw.value","X..Rural.raw.value","X..Hispanic.raw.value" ,"X..Non.Hispanic.African.American.raw.value" , "Uninsured.adults.raw.value" ,"X..65.and.older.raw.value" )] head(countydata2) # download covid data casesus <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv") # cases deathsus <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv") # deaths # For compatibility, add "County" to county name in usdata_b and add state abbreviation usdata_b <- data.frame(casesus[,c("Admin2","Province_State","Combined_Key","X4.20.20")],deathsus[,c("X4.20.20")]) # combine data.frames and select the variables of interest head(usdata_b) names(usdata_b) <- c("County","State","Combined_Key","cases","deaths") # rename columns usdata_b$County = paste(usdata_b$County, "County") # concatenate by adding "County" usdata_b$County = str_trim(as.character(usdata_b$County)) # trim usdata_b$stab <- state.abb[match(usdata_b$State,state.name)] # add state abbreviation usdata_b$stab[usdata_b$State=="District of Columbia"] = "DC" usdata_b$deathrate=round(100*usdata_b$deaths/usdata_b$cases,digits=1) # add deathrate usdata_b$deathrate[usdata_b$deaths==0] =0 # set death rate to 0 if no deaths and no cases head(usdata_b) # Merge the two data frames combined_data = merge(usdata_b, countydata2, by.x=c("County","stab"), by.y = c("Name","State.Abbreviation")) # Let's add deaths per 100000 people and rescale median household income to be in 1000s combined_data$deathpercap = 100000*combined_data$deaths/combined_data$Population.raw.value combined_data$Median.household.income.raw.value = combined_data$Median.household.income.raw.value/1000 head(combined_data[,1:20],n=10) # Let's see all the variable names names(combined_data) names(combined_data) = gsub(".raw.value","",names(combined_data) ) # remove unnecessary part names(combined_data) # Let's try to focus on one variable and see how the estimated regression coefficient differs depending on specification summary(lm(deathpercap ~ Adult.smoking , data=combined_data)) # doesn't seem that smoking has any effect on deaths per capita # Maybe we omitted some intervening/confounding variables that are correlated both with smoking and deaths per capita. When that happens, our coefficients will be biased -- lower or higher than the true values. # What is the sign of the bias? If omitted variable is correlated positively with the outcome -- deaths per capita -- and negatively with the independent variable -- smoking -- then the bias will be negative -- the coefficient will be lower than true value. If omitted variable is correlated negatively with the outcome -- deaths per capita -- and postively with the independent variable -- smoking -- then the bias will again be negative -- the coefficient will be lower than true value. # If the omitted variable is correlated positvely with both the outcome variable and the independent variable, the bias will be positive -- the coefficient will be higher than true value. The same is true if omitted variable is negatively correlated with both. #Let's control for some other county characteristics summary(lm(deathpercap ~ Adult.smoking + X..Rural + X..65.and.older + X..Non.Hispanic.African.American + X..Hispanic, data=combined_data)) # Share rural and african american seem to be important but not smoking, share hispanic or share older than 65 # Mayve we have still omitted some confounders, let's include additional controls summary(lm(deathpercap ~ Adult.smoking + X..Rural + X..65.and.older + X..Non.Hispanic.African.American + X..Hispanic + Income.inequality + High.school.graduation +Air.pollution...particulate.matter, data=combined_data)) # The more people smoke the lower the death rate per capita? Counterintuitive, but could be # Let's also control for other health variables summary(lm(deathpercap ~ Adult.smoking + X..Rural + X..65.and.older + X..Non.Hispanic.African.American + X..Hispanic + Income.inequality + Median.household.income+ High.school.graduation +Air.pollution...particulate.matter+ Adult.obesity + Poor.or.fair.health, data=combined_data)) # now smoking has a positive effect? Why did it seem to have a negative effect? # Possible explanation 1: smoking is positively correlated with obesity, which is negatively correlated with deathspercap cor(combined_data$Adult.smoking,combined_data$Adult.obesity) # high positive correlation between obesity and smoking. Why might obesity negatively impact deaths per capita? Maybe obese people are less likely to go out as much as get infected # Possible explanation 2: smoking is negatively correlated with median income, which is positively correlated with deathspercap. Why more cases in areas with higher average income? Maybe those areas are closer to town centers and, therefore, denser. cor(combined_data$Adult.smoking,combined_data$Median.household.income,use="complete.obs") # We know that a large chunk of cases is concentraited in NY. Maybe NY has an outsized effect on the results. Let's see without NY summary(lm(deathpercap ~ Adult.smoking + X..Rural + X..65.and.older + X..Non.Hispanic.African.American + X..Hispanic + Income.inequality + Median.household.income+ High.school.graduation +Air.pollution...particulate.matter+ Adult.obesity + Poor.or.fair.health +Median.household.income, data=combined_data[!(combined_data$stab=="NY"),])) # There is still a positive effect of smoking # What is the interpretation of the coeffiicent on smoking? When the share of adults smoking increases by 10 percentage points, the predicted number of deaths per 100 thousand people increases by 36.6/10= 3.7 # How confident are we in the estimated effect of smoking? Well, it makes sense and there is some evidence for it, but we cannot be sure there aren't other confounding variables we haven't accounted for. There are some more rigorous techniques to think about this beyond the scope of this class. # So, using regression analysis we can present some evidence consistent with a given hypothesis, but without more work, we can not be confident in causal interpreation.