elem econ homework

profileshazuanzhe
Intro_to_R.docx

Introduction to R and RStudio

Drew Barker

August 30, 2018

Welcome to the wonderful world of R! First things first:

What is R?

R is a (pirate’s favorite) free programming language and software environment for statistical analysis and graphics. R is very versatile, able to handle just about any statistical analysis you need. And, as you have already discovered, R is free! Unfortunately, with the pros come the cons. One of the most common cons of R is that it has a bit more of a learning curve than some of the other statistical analysis software such as SAS or STATA. What we do in this class will be manageable.

What is RStudio?

RStudio is just an environment that runs R. This is what we will be using in this class. RStudio makes R a bit more user friendly.

OK I have RStudio open. What is all this stuff?!

When you open RStudio you should see 4 total windows. The window in the top left is the R code editor. You can create R scripts that contain multiple lines of code in one file that you can run at the same time or individually. For projects, homeworks, or your own research this is what you would want to use when writing code so you can easily reproduce your results or make a change. The bottom left window is the console. In the console you can run single lines of code one at a time. This is helpful for quick lines of code that you have no intention of reproducing so as not to clutter up your scripts. The top right window has 3 tabs, but the one you are most concerned with is the environment. This shows you a list of all datasets you have imported as well as any objects you have created in lines of code. You can visit the history tab to see hundreds of the last lines of code you have run. Finally, the bottom right window has 4 different tabs. Anytime you create a graph or plot in RStudio it will display in the Plots tab. The Packages tab displays a current list of Packages that have been installed. Packages are add-ons that allow you to do more things in R. There is a button in this window to install new packages. Help is a very helpful tab. You can find full documentation here for each package and most all commands that you will need to use.

Important comment: #hashtag

Many of you have probably used this symbol extensively and found it comes in handy for your social media. Well you’ll be delighted to know it comes in just as handy in R. The # symbol can be used to add comments to your R script.Comments are important in R scripts so you can keep track of what each line or section of your code is intended to do. Let’s see what happens if we try to put the following comment directly into our script with no symbol in front: “This is a very original comment.” Didn’t work so well, did it? That is because we aren’t speaking R’s language, so R doesn’t understand. Anything not marked with a # is considered runnable code and must abide by R’s syntax to actually work. Try the following instead:

#This is a very original comment.

That’s better. The # symbol simply tells R to ignore what we write following. Again, as you go forward using R or any other programming language comments are very important. Trust me, if you don’t put comments in your scripts you’re gonna have a bad time later on.

Getting started: importing a dataset

Ok we are all eager to start doing some data analysis in R. First we must actually get some data in here. For this lesson I have created an imaginary dataset in an excel file called “introtoR” which is on our Carmen course page. This is one of the simpler things to do thanks to RStudio. First make sure you download the file onto your computer. Then simply click the button in the Environment tab of the upper right window that says Import Dataset, next click From Excel, find the file and open it, and finally click import. Once you import the dataset it will show in the environment tab and open up a sheet in the code editing window with the full dataset. Note the structure of a dataset in R: each row is a single observation (in our case an individual person), and each column is a variable with variable names labeled. It is important to understand what each variable represents in your dataset and what each value of each variable represents. In the real world you should have access to documentation from data sources defining and describing each variable. In our imaginary dataset we have 5 total variables: health is a health index between 1 and 10 that measures how healthy the person is (1 is completely unhealthy, 10 is perfect health), healthins is a dummy variable equal to 1 if the person has insurance, female is a dummy variable equal to 1 if the person is a female, age is numerical age (big surprise there), and famheart is a dummy variable equal to 1 if the person has a history of family heart disease. In this lesson we are interested in estimating the average causal effect of health insurance on health using a simple comparison of group means.

Arithmetic and assigning values

An important aspect of R’s language involves basic arithmetic. We can perform basic arithmetic in R as follows:

9+3 #Addition

## [1] 12

54-32 #Subtraction

## [1] 22

25*2 #Multiplication

## [1] 50

625/4 #Division

## [1] 156.25

4^2 #Exponents

## [1] 16

Another important aspect involves creating new objects in R. We can create assign values to new objects using <-. For example suppose we wanted to save one of the answers to the aritmetic we did above into a new object called ans.

ans<-625/4 #This creates a new object in our environment equal to 625/4

This can come in handy if we need to reference a value again, for example to do further arithmetic. We can perform operations using the values assigned to objects in R by calling the objects directly, for example:

ans1<-(50-5)^5 #saves the value to a new object ans2<-(59-20)/5 #ditto ans1/ans2 #divides the values of the two answers by calling their object names directly in the arithmetic

## [1] 23657452

Functions

Yet another important aspect of the language of R is a function. Functions in R make our computing much simpler. For example, we are interested in looking at the average value of health for the treatment and control groups in our dataset. We could separate each observation of the health variable (more on how we do that to in a bit…) then add them all up and divide by sample size. However, our life is made much easier with functions. In particular the mean function will allow us to calculate the aritmetic average of a variable easily. In order to use a function we use the following format: function name(input1, input2,…). We can illustrate this sytax using one of the most fundamental functions in R, the summary function. The summary function generally takes one input, an object such as a dataset, and… summarizes it. The actual nature of the summary depends on the type of object we input into the function. Let’s see what we get for our dataset.

library(readxl) #IGNORE THIS LINE

## Warning: package 'readxl' was built under R version 3.4.2

introtoR <- read_excel("Teaching/Elementary metrics 18/introtoR.xlsx") #IGNORE THIS LINE summary(introtoR) #This provides of a summary of our dataset. Note the syntax here where summary is the function name and the lone input is our dataset.

## health healthins female age ## Min. : 1.000 Min. :0.0 Min. :0.0 Min. :19.00 ## 1st Qu.: 3.250 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:24.25 ## Median : 5.500 Median :0.5 Median :0.5 Median :29.50 ## Mean : 5.667 Mean :0.5 Mean :0.5 Mean :32.30 ## 3rd Qu.: 8.000 3rd Qu.:1.0 3rd Qu.:1.0 3rd Qu.:38.75 ## Max. :10.000 Max. :1.0 Max. :1.0 Max. :59.00 ## famheart ## Min. :0.0000 ## 1st Qu.:0.0000 ## Median :0.0000 ## Mean :0.4667 ## 3rd Qu.:1.0000 ## Max. :1.0000

We see that for a dataset the summary function provides some key summary statistics for all of the variables in our dataset.

Extracting rows and columns from a dataset

In order to properly use the mean function in R we need to extract a variable from the dataset that we are interested in seeing the mean of. In particular, we are interested in looking at the average value of the health index for all individuals. There are a couple of ways to do this.

introtoR$health #The'$' is the most efficient way to extract variables from a dataset. We simply put a '$' followed by the name of the variable we are interested in.

## [1] 10 7 5 8 6 10 9 5 10 7 9 6 8 7 10 1 5 4 6 9 2 4 1 ## [24] 3 2 5 2 1 5 3

introtoR[,1] #The brackets are a more general way of extracting rows and columns from a dataset. When we input "dataset[r,c]" into R we are telling it that we want information from the dataset from row number r and column number c. If we want to include all rows in our extraction we simply leave the row entry blank, as we did here (same thing for columns). Note in our case that we are extracting only the first column which corresponds to the health variable.

## # A tibble: 30 x 1 ## health ## <dbl> ## 1 10 ## 2 7 ## 3 5 ## 4 8 ## 5 6 ## 6 10 ## 7 9 ## 8 5 ## 9 10 ## 10 7 ## # ... with 20 more rows

mean(introtoR$health) #this provides the average value of the health index for all individuals

## [1] 5.666667

#If we wanted to select multiple consecutive rows(or columns) we could do so with a ":". For example: introtoR[1:3,] #This extracts rows (observations) 1 through 3 with all of the columns included.

## # A tibble: 3 x 5 ## health healthins female age famheart ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 10 1 1 21 1 ## 2 7 1 0 26 0 ## 3 5 1 0 30 0

True/false statements

Now, we are interested in separating the sample into a treatment and control group. In order to do so we have to figure out which individuals have health insurance, or in other words for which individuals is it true that healthins=1. We can test for equality or inequality in R as follows:

2==3 # '==' tests for exact equality. Note that we must use two equal signs here.

## [1] FALSE

2>3 # Big surprise, this tests for "greater than"

## [1] FALSE

2<3 #Tests for "less than"

## [1] TRUE

2>=3 #Greater or equal to

## [1] FALSE

2<=3 #less or equal to

## [1] TRUE

2!=3 #tests for inequality. Note the syntax: we use "!=" for "not equal to".

## [1] TRUE

Note that each of these comparisons returns either TRUE or FALSE. Let’s see what happens when we apply the same thing to all observations of a variable.

introtoR$healthins==1 #this tests to see if each of the values of the variable for healthins is equal to 1

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [12] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Subsetting

Alright we are almost ready to calculate the average difference. Now we know how we can figure out who does and does not have health insurance, but how do we separate them out? One way to go about this would be to examine the output of the last line of code to see which rows have healthins==1 then extract those rows as we have done before. That’s actually not too difficult with this small dataset, but imaging trying that with a dataset containing thousands of rows! There is an easier way to do this using the subset function. I’ll demostrate the subset function directly with the code we need. Then we are all ready to take our average difference in the health index.

treat<-subset(introtoR, healthins==1) #We are subsetting the dataset to only include individuals with health insurance (treatment group) here. Note that here we put two inputs into the function: the dataset we are subsetting from and the conditional statement that specifies the rows we want control<-subset(introtoR, healthins==0) #Control group. Could have also used the condition "healthins!=1". mean(treat$health)-mean(control$health) #calculate the average health index for treatment and control groups and take the difference.

## [1] 4.266667

summary(treat) #look at the sample averages of treatment and control groups to check for balance.

## health healthins female age ## Min. : 5.0 Min. :1 Min. :0.0000 Min. :19.00 ## 1st Qu.: 6.5 1st Qu.:1 1st Qu.:0.0000 1st Qu.:21.00 ## Median : 8.0 Median :1 Median :1.0000 Median :24.00 ## Mean : 7.8 Mean :1 Mean :0.6667 Mean :24.67 ## 3rd Qu.: 9.5 3rd Qu.:1 3rd Qu.:1.0000 3rd Qu.:27.50 ## Max. :10.0 Max. :1 Max. :1.0000 Max. :35.00 ## famheart ## Min. :0.0000 ## 1st Qu.:0.0000 ## Median :0.0000 ## Mean :0.3333 ## 3rd Qu.:1.0000 ## Max. :1.0000

summary(control) #look at the sample averages of treatment and control groups to check for balance.

## health healthins female age ## Min. :1.000 Min. :0 Min. :0.0000 Min. :29.00 ## 1st Qu.:2.000 1st Qu.:0 1st Qu.:0.0000 1st Qu.:32.00 ## Median :3.000 Median :0 Median :0.0000 Median :39.00 ## Mean :3.533 Mean :0 Mean :0.3333 Mean :39.93 ## 3rd Qu.:5.000 3rd Qu.:0 3rd Qu.:1.0000 3rd Qu.:45.50 ## Max. :9.000 Max. :0 Max. :1.0000 Max. :59.00 ## famheart ## Min. :0.0 ## 1st Qu.:0.0 ## Median :1.0 ## Mean :0.6 ## 3rd Qu.:1.0 ## Max. :1.0

Note the large average difference we find. I took the liberty of using the summary function on the two subsets to check for balance. It is pretty easy to see that all else is not truly equal here.