analysis with R need screen shots in R

yzfr15

Main

Home >Homework Answsers >Computer Science homework help

3.2 Free-Format Input

Free-format data are text files containing numbers or character strings separated by spaces. Optionally the file may have a header containing variable names. Here's an example of a data file containing information on three variables for 20 countries in Latin America:

               setting  effort   change

   Bolivia            46       0        1

   Brazil             74       0       10

   Chile              89      16       29

   Colombia           77      16       25

   CostaRica          84      21       29

   Cuba               89      15       40

   DominicanRep       68      14       21

   Ecuador            70       6        0

   ElSalvador         60      13       13

   Guatemala          55       9        4

   Haiti              35       3        0

   Honduras           51       7        7

   Jamaica            87      23       21

   Mexico             83       4        9

   Nicaragua          68       0        7

   Panama             84      19       22

   Paraguay           74       3        6

   Peru               73       0        2

   TrinidadTobago     84      15       29

   Venezuela          91       7       11

This small dataset includes an index of social setting, an index of family planning effort, and the percent decline in the crude birth rate between 1965 and 1975. The data are available at http://data.princeton.edu/wws509/datasets/ in a file called effort.dat which includes a header with the variable names.

R can read the data directly from the web:

> fpe <- read.table("http://data.princeton.edu/wws509/datasets/effort.dat")

The function used to read data frames is called read.table. The argument is a character string giving the name of the file containing the data, but here we have given it a fully qualified url (uniform resource locator), and that's all it takes.

Alternatively, you could download the data and save them in a local file, or just cut and paste the data from the browser to an editor such as Notepad, and then save them. Make sure the file ends up in R's working directory, which you can find out by typing getwd(). If that is not the case you can use a fully qualified path name or change R's working directory by calling setwd with a string argument. Remember to double up your backward slashes (or use forward slashes instead) when specifying paths.

The special symbol <-is R's assignment operator, which we have encountered already. Here we assigned the data to an object named fpe. To print the data simply type the name of the object.

> fpe

              setting effort change

Bolivia             46      0          1

Brazil              74      0         10

    ... output edited ...

Venezuela           91      7         11

In this example R detected correctly that the first line in our file was a header with the variable names. It also inferred correctly that the first column had the observation names. (Well, it did so with a little help; I made sure the row names did not have embedded spaces, hence CostaRica. Alternatively, I could have used "Costa Rica" in quotes as a row name.)

You can always tell R explicitly whether or not you have a header by specifying the optional argument header=TRUE or header=FALSE to the read.table function. This is important if you have a header but lack row names, because R's guess is based on the fact that the header line has one less entry than the next row, as it did in our example.

If your file does not have a header line, R will use the default variable names V1, V2, ..., etc. To override this default use read.table's optional argument col.names to assign variable names. This argument takes a vector of names. So, if our file did not have a header we could have used the command

> fpe = read.table("noheader.dat",

+ col.names=c("setting","effort","change"))

Incidentally this is the first time that our command did not fit in a line. R code can be continued automatically in a new line simply by making it obvious that we are not done, for example ending the line with a comma, or having an unclosed left parenthesis. R responds by prompting for more with the continuation symbol + instead of the usual prompt >.

If your file does not have observation names, R will simply number the observations from 1 to n. You can specify row names using read.table's optional argument row.names, which works just like col.names; type ?data.frame for more information.

There are two closely related functions that can be used to get or set variable and observation names at a later time. These are called names (for the variable names), and row.names (for the observation names). Thus, if our file did not have a header we could have read the data and then changed the default variable names using the names function:

> fpe = read.table("noheader.dat")

> names(fpe) = c("setting","effort","change")

Technical Note: If you have a background in other programming languages you may be surprised to see a function call on the left hand side of an assignment. These are special 'replacement' functions in R. They extract an element of an object and then replace its value.

In our example all three-variables were numeric. R will handle string variables with no problem. If one of our variables was sex, coded M for males and F for females, R would have created a factor, which is basically a categorical variable that takes one of a finite set of values called levels. In Section 5 we will use a data frame with categorical variables to illustrate logistic regression. Another way to generate factors is by grouping a numeric covariate. An example appears in Section 4 below.

Exercise: Use a text editor to create a small file with the following three lines:

a b c

1 2 3

4 5 6

Read this file into R so the variable names are a, b and c. Now delete the first row and read the file again so the variable names are still a, b and c.