R project

profileJasperZ511626
4_ListsAndApply.pdf

Data$frames,$Lists,$Matrices$

AND$the$Apply$Family$of$Func9ons$

2012$Summer$Olympics$

2012$Olympic$Athlete’s$

Craig$ Bloodworth$at$ the$Informa9on$ Lab$made$this$ data$explorer$for$ the$Guardian.$It$ includes$data$on$ all$athletes$ compe9ng$in$the$ Olympics$ We$download$the$data$

as$a$csv$file$and$read$$it$ into$R$with$read.csv$

hNp://www.guardian.co.uk/sport/datablog/ 2012/jul/27/londonTolympicTathletesTfullTlist$

Reading$data$into$R$

•  Many$data$sets$are$stored$$in$text$files.$$$ •  The$easiest$way$to$read$these$into$R$is$using$ either$the$read.table!or$read.csv$func9on,$ both$of$which$return$a$data$frame.$

•  There$are$quite$a$few$op9ons$that$can$be$ changed.$$Some$of$the$important$ones$are$$ –  $file$T$name$or$URL$ – header$T$are$column$names$at$the$top$of$the$file?$ – sep$T$what$divides$elements$of$the$table$ – na.strings$T$symbol$for$missing$values,$like$9999$ – Skip$T$number$of$lines$at$the$top$of$the$file$to$ignore$

$

CountryTlevel$data$

hNp://wwwT958.ibm.com/ so\ware/data/cognos/ manyeyes/datasets/ olympic2012withgdp/versions/ 1.txt$

What’s$the$ rela9onship$ between$GDP,$ popula9on,$ and$number$of$ Olympic$ medals?$

> ctry = read.csv("http://www-958.ibm.com/software/ data/cognos/manyeyes/datasets/olympic2012withgdp/ versions/1.txt", skip = 1, sep = "\t", header = FALSE, colClasses = c("character", rep("numeric", 5), rep("character", 3)))

> head(ctry) V1 V2 V3 V4 V5 V6 V7 V8 V9 1 ABW 0 0 0 0 0 2,456,000,000.00 108,000 22740.7407 2 AFG 0 0 1 1 1 20,343,461,030.00 34,385,000 591.6377 3 AGO 0 0 0 0 0 100,990,000,000.00 19,082,000 5292.4222 4 ALB 0 0 0 0 0 12,959,563,902.00 3,205,000 4043.5457 5 AND 0 0 0 0 0 3,491,000,000.00 84,864 41136.4065 6 ARE 0 0 0 0 0 360,245,000,000.00 7,512,000 47955.9372

Need$to:$ Clean$up$the$GDP$and$POP$by$removing$$“,”$and$conver9ng$ to$numeric;$ Adding$la9tude$and$longitude$for$each$country$

Review$Data$Structures$

Review:$Vector$

•  Ordered$container$of$literals$ •  Elements$must$be$same(type(

$

$

Weight

Ordered

.

.

.

175

125

105

170

Numeric

Review:$Data$Frame$

•  Ordered$container$of$vectors$ •  Vectors$must$all$be$the$same(length( •  Vectors$can$be$different(types(

$

$

Data Frame

Wireless$Data$

•  There$are$5$wireless$access$points$in$a$building$ •  A$laptop$emits$a$signal$and$the$strength$of$the$ signal$is$recorded$as$each$access$point.$

•  Also$recorded$is$the$loca9on$of$the$laptop$in$the$ building.$

•  Measurements$are$taken$at$254$loca9ons$ $ w = read.table("http:// www.stanford.edu/~vcs/StatData/ wireless.txt", header=TRUE) $

Helpful$Func9ons$for$finding$ informa9on$out$about$the$data$frame$$$ >$class(w)! [1]$"data.frame"$ $ >$names(w)! [1]$"x"$$"y"$$"S1"$"S2"$"S3"$"S4"$"S5"$ $ >$dim(w)! [1]$259$$$7$

>$head(w)! $$$ x y S1 S2 S3 S4 S5 1 225.0 144 -92 -78 -49 -92 -92 2 0.0 144 -75 -92 -87 -47 -92 3 111.0 132 -92 -92 -80 -70 -65 4 110.7 132 -87 -92 -67 -67 -62 5 105.0 132 -92 -92 -79 -66 -61 6  99.0 132 -92 -92 -79 -70 -61 $

>$summary(w$S1)! Min. 1st Qu. Median Mean 3rd Qu. Max. -92.00 -90.00 -80.00 -77.41 -71.00 -33.00 $ >$w[w$x!>!200,!c("S2")]! $[1]$T78$T74$T35$T43$T46$T43$T41$T48$T68$T71$T67$T70$T58$T61$ [15]$T64$T73$T71$T67$T68$T61$T60$T57$T58$T56$T48$T51$T68$T54$ [29]$T51$T48$T46$T35$T67 $ We$subset$rows$and$columns$of$data$frames$$ We$subset$by$posi?on,$exclusion,!logical,!name,$and$all! $

Reformakng$the$data$frame$

Revise$the$Structure$

S2 S3 S4 S5

Data Frame

S1yx x

Data Frame

x

x

y

y

y

S1

S2

S5

1

2

5

D1

D2

D5

x y SS AP Dist

w newW

Data$Frame$ap$has$ five$rows$and$2$ columns$(x,$y)$with$ the$loca9ons$of$the$ five$access$points$

X$=$rep(w$x,$5)$

Y$=$rep(w$y,$5)$

AP$=$rep(1:5,$each$=$nrow(w))$

SS$=$c(w$S1,$w$S2,$w$S3,$w$S4,$w$S5)$

D1$=$sqrt((w$x$T$ap[1,$"x"]$)^2$+$(w$y$T$ap[1,$"y”])^2$)$

D2$=$sqrt((w$x$T$ap[2,$"x"]$)^2$+$(w$y$T$ap[2,$"y"])^2$)$ D3$=$sqrt((w$x$T$ap[3,$"x"]$)^2$+$(w$y$T$ap[3,$"y"])^2$)$ D4$=$sqrt((w$x$T$ap[4,$"x"]$)^2$+$(w$y$T$ap[4,$"y"])^2$)$ D5$=$sqrt((w$x$T$ap[5,$"x"]$)^2$+$(w$y$T$ap[5,$"y"])^2$)

Con9nued..$

Dist = c(D1, D2, D3, D4, D5) newW = data.frame(x = X, y = Y, AP, SS, Dist)

Lists$

Review$

•  Data$frames$are$actually$ a$special$kind$of$list.$$

•  Unlike$a$data$frame$ each$element$in$a$list$ can$have$a$different$ length.$

•  Actually,$each$element$ can$be$either$a$list,$data$ frame,$vector,$matrix,$…$$

Matrix

List

Data frame

Vector

Rainfall$

•  Daily$rainfall$collected$ at$5$weather$sta9ons$

•  rain$is$a$list$of$length$5$ –  One$element$for$each$ sta9on$

–  Each$element$is$a$ numeric$vector$of$rain$ measurements$

–  Sta9ons$not$in$opera9on$ for$the$same$length$of$ 9me$

rain List

load(url( “http://www.stanford.edu/~vcs/StatData/ rainfallCO.rda”)) 
 $ >$class(rain)! [1] "list" >$length(rain)! [1] 5 >$names(rain)! [1] "st050183" "st050263" "st050712" "st050843" "st050945"

Indexing$lists$

•  Lists$can$be$indexed$by$ name,$using$$.$

>$class(rain$st050183)! [1]$"numeric"$

>$length(rain$st050183)$ [1]$9878$

>$head(rain$st050183)! [1]$$0$10$11$$1$$0$$0$

•  Or$by$[[$]]$with$posi9on$or$ name$

$

>$class(rain[["st050945"]])! [1]$"numeric"$

>$length(rain[["st050945"]])! [1]$3692$

>$head(rain[[5]])! [1] 0 0 1 0 26 0 $

$

Indexing$lists$

•  Lists$can$also$be$indexed$like$vectors,$using$[].$$ The$result$will$be$another$list.$

>$class(rain["st050183"])$

[1]$"list"$

>$length(rain["st050183"])$

[1]$1$

$

Indexing$lists$

•  $To$extract$individual$elements$of$a$list,$ enclose$the$index$in$[[$]].$$The$result$will$an$ object$of$the$same$type$as$the$element$of$the$ list.$You$can$only$use$one$value$in$[[$]].$

>$class(rain[[1]])! [1]$"numeric"$

>$head(rain[[1]])! [1]$$0$10$11$$1$$0$$0$

$

$

Matrices$and$Arrays$

Matrices$and$Arrays$

•  Rectangular$collec9on$of$elements$ •  Dimensions$are$two,$three,$or$more$ •  Homogeneous$primi9ve$elements$(e.g.,$all$ numeric$or$all$character)$

Arrays$–$matrices$in$higher$dimensions$

>!x!=!array(1:30,!c(4,!3,!2))! >$x$ ,$,$1$ $$$$$[,1]$[,2]$[,3]$ [1,]$$$$1$$$$5$$$$9$ [2,]$$$$2$$$$6$$$10$ [3,]$$$$3$$$$7$$$11$ [4,]$$$$4$$$$8$$$12$ $ ,$,$2$ $$$$$[,1]$[,2]$[,3]$ [1,]$$$13$$$17$$$21$ [2,]$$$14$$$18$$$22$ [3,]$$$15$$$19$$$23$ [4,]$$$16$$$20$$$24$

•  The$integers$1,$2,$…$,$30$are$ arranged$in$a$3Tdimensional$array$

•  The$array$has:$ –  $4$rows$ –  3$columns$ –  2$panels$

>$x[1:2,!3,!2]! [1]$21$22$ $ >$x[!,!2,!1]! [1]$5$6$7$8$ $ >$x[3:4,!c(3,!1),!1]! $$$$$$$$[,1]$[,2]$ [1,]$$$11$$$$3$ [2,]$$$12$$$$4$

Summary$of$Data$Structures$

Types$of$structures$$

•  To$summarize,$the$data$structures$we$have$ encountered$so$far$are:$ – vector$ – data$frame$ – list$ – matrix$

Vector$–$Data$frame$T$List$

Matrix

Data Frame List

Data frame

Vector

Vector

Ordered$collec9on$of$ vectors$all$same$length$

Ordered$collec9on$ of$objects$

Ordered$ collec9on$ of$primi9ve$ types$

Indexing$data$structures$

•  Vectors:$[index]$ > x[1:10] > x[-3] > x[x>3] $ •  Data$frames:$[rowindex,$colindex]$and$$name$ > family$weight > family[, 3:4] > family[famiy$height > 70, 2]$

•  Note:$both$$$can$index$only$one$element.$

Returns$a$vector$ when$possible,$ unless$use$ $drop$=$FALSE$

Indexing$data$structures$

•  Lists:$$name,$[index],$[[index]]$ > rain$stationname > rain[1:2] > rain[[1]]

•  Matrices:$[rowindex,$colindex]$ > m[1, 2] > m[1:2, ] > m[ ,�a�, drop = TRUE]

•  Note:$[[$]]$can$index$only$one$element.$Also$we$ can$index$a$matrix$as$if$it$were$a$vector$

Apply$Func9ons$

The$“Apply”$Func9on$

•  Some9mes$we$want$an$opera9on$to$be$ applied$to$each$element$of$a$list,$to$each$ vector$in$a$data$frame,$or$to$individual$ dimensions$of$a$matrix$

•  R$provides$the$apply$mechanism$to$do$this.$ •  There$are$several$apply$func9ons:$ – sapply()$and$lapply()$for$lists$and$data$frames$ – apply()$for$matrices$$ – tapply()$for$“tables”,$i.e.$ragged$arrays$as$vectors$

•  With$these$func9ons$we$can$avoid$looping,$ and$instead$write$code$that$is$meaningful$in$a$ sta9s9cal$sekng.$

•  For$example$with$our$list$of$rainfall$data,$each$ element$represents$the$measurements$taken$ at$a$par9cular$weather$sta9on$and$when$we$ think$about$studying$the$average$rainfall$at$ each$sta9on$T$we$don’t$think$in$terms$of$loops.$

Rainfall$

•  Daily$rainfall$collected$ at$5$weather$sta9ons$

•  rain$is$a$list$of$length$5$ –  One$element$for$each$ sta9on$

–  Each$element$is$a$ numeric$vector$of$rain$ measurements$

–  Sta9ons$not$in$opera9on$ for$the$same$length$of$ 9me$

rain List

Rainfall$

•  Apply$the$mean$ func9on$to$each$ element$of$rain

•  Finds:$average$ precipita9on$of$first$ sta9on,$

•  Second$sta9on,$ •  Third$sta9on,$ •  Etc.$

rain List

lapply()$and$sapply()$$

•  The$lapply!and$sapply!both$apply$a$specified$ func9on$to$each$element$of$a$list.$$

•  The$former$returns$a$list$object$and$the$laNer$ a$vector$when$possible.$

$

Mean$rainfall$at$each$sta9on$ >$lapply(rain,$mean)$ $st050183$ [1]$6.631707$ $ $st050263$ [1]$3.798993$ $ $st050712$ [1]$5.102299$ $ $st050843$ [1]$6.084607$ $ $st050945$ [1]$4.549296$ $

>$sapply(rain,$mean)$ st050183$st050263$st050712$$ 6.631707$3.798993$5.102299$ $ st050843$st050945$$ 6.084607$4.549296$ $

Addi9onal$arguments$

>!lapply(rain,!mean,!na.rm!=!TRUE,! !!!!!!!!!!!!!!!trim!=!0.1)! $st050183$ [1]$2.393978$ $ $st050263$ [1]$0.9875949$ $ $st050712$ [1]$0.7895235$ $ $st050843$ [1]$1.238481$ $ $st050945$ [1]$0.7366283$

>$args(lapply)$ func9on$(X,$FUN,$...)$ $ X$takes$the$list$object$ FUN$is$the$func9on$to$ apply$to$each$element$ in$X$ …$allows$any$number$ of$arguments$to$be$ passed$to$FUN$ $

tapply()$

•  This$func9on$is$useful$to$apply$a$func9on$to$subsets$of$a$ vector.$

>$x$ [1]$1$2$3$4$5$6$7$8$9$10$ >$v$ [1]$1$1$1$0$0$0$1$1$1$0$ >$tapply(x,$v,$mean)$ 0 1 6.25 5.00 >$tapply(x,$v,$median)$ 0 1 5.5 5.0

Now$it’s$your$turn$to$try$it$out$

Rainfall$data$

•  Maximum$rainfall$at$each$sta9on$ sapply(rain, max)

$

•  99th$percen9le$of$rainfall$at$each$sta9on$ $ sapply(rain, quantile, probs = 0.99)

Rainfall$data$

•  day is$a$list$with$the$same$structure$as$rain$$ •  Check$that$the$number$of$recordings$at$each$ sta9on$matches$the$number$of$days$recorded$at$ the$corresponding$sta9on$

$ all(sapply(rain, length) == sapply(day, length))

Rainfall$data$

•  Number$of$years$each$sta9on$is$in$opera9on$ Year = lapply(day, floor) Uyear = lapply(Year, unique) OpYear = sapply(Uyear, length)$ For$any$one$sta9on:$ length(unique(floor(day[[1]] ))) sapply(day, length(unique(floor(?))) )

$ sapply(day,$func9on(x)$length(unique(floor(x))))$

•  Propor9on$of$days$it$rained$at$each$sta9on$ $

sapply(rain, function(x) sum(x > 0)/length(x) ) $

Matrices$and$Arrays$

apply()$

•  apply(x,!1,!sum)!for$the$matrix$x,$the$sum$ func9on$is$applied$across$the$columns$so$that$the$ row$dimension$(i.e.$dim$1)$is$preserved.$

>$x$ [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 >$apply(x,$1,$sum)$ [1]$$9$$12$

apply()$

•  apply(x,!1,!min)!for$the$matrix$x,$the$min$func9on$ is$applied$down$the$rows$so$that$the$column$ dimension$(i.e.$dim$2)$is$preserved.$

>$x$ [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 >$apply(x,$2,$min)$ [1]$$1$$3$$5$