R project
XML$
eXtensible$Markup$Language$
XML$package$in$R$
• Handy$func8ons$for$parsing$XML$ – $readHTMLTable:$reads$an$HTML$table$into$R$ – $xmlParse:$read$an$XML$file$into$R$ – xmlValue:$retrieve$text$content$of$a$node$$ – xmlSize:$return$the$number$of$child$nodes$of$a$ node$
– xmlSApply:$applies$the$func8on$to$each$child$node$ of$a$node$
• To$read$an$XML$file$into$R,$use$xmlParse!
> doc = xmlParse(�plant.xml�)!
and$extract$the$root$node$using$xmlRoot.$
> catalog = xmlRoot(doc)! > class(catalog)! [1] "XMLNode"!
To$illustrate$how$we$manipulate$an$XML$object$in$R,$ we’ll$take$this$data$and$reformat$it$into$a$data$frame$ with$one$row$for$each$plant.$
xmlParse implements$what$is$called$the$DOM$(Document$ Object$Model)$parser.$$It$reads$the$en8re$file$into$memory.$$$
We$don�t$have$8me$to$cover$it,$but$you$should$be$aware$of$ another$parsing$model$called$SAX$(Simple$API$for$XML).$$It$ reads$the$document$incrementally$and$is$more$memory$ efficient,$but$it$is$trickier$to$use.$
The$tree$structure$is$represented$in$R$as$a$list$of$lists.$
We$can$access$an$element$within$a$node$(i.e.,$a$child),$using$ the$usual$[[ ]] indexing$for$lists.$
> ## Look at the first plant node > oneplant = catalog[[1]] > class(oneplant) [1] "XMLNode" > print(oneplant) <PLANT> <COMMON>Bloodroot</COMMON> <BOTANICAL>Sanguinaria canadensis</BOTANICAL> <ZONE>4</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$2.44</PRICE> <AVAILABILITY>031599</AVAILABILITY> </PLANT> Here$is$a$leaf$nodes$
It$contains$the$content.$
• We$can$drill$down$further$into$the$list:$ > oneplant[['COMMON']]! <COMMON>Bloodroot</COMMON> !
$
• Note$that$this$doesn’t$remove$the$markup.$$To$ do$this,$use$the$func8on$xmlValue!
> xmlValue(oneplant[['COMMON']])! [1] "Bloodroot"! > xmlValue(oneplant[['BOTANICAL']])! [1] "Sanguinaria canadensis"!
• There$are$special$XML$versions$of$lapply$and$ sapply,$named$xmlApply xmlSApply.$$ Each$takes$an$XMLNode$object$as$its$primary$ argument.$$$They$iterate$over$the$node’s$ children$nodes,$invoking$the$given$func8on.$
• Like$lapply, xmlApply returns$a$list.$$Like$ sapply,$xmlSApply returns$a$simpler$ data$structure$if$possible.$$
To$create$the$data$frame:$
$
1) What$do$you$want$to$operate$on$(iterate$ over)?$
$
2)$What$do$you$want$to$produce?$
In$our$plants$example,$we$can$use$xmlSApply$to$ extract$the$common$names$of$all$the$plants.$
> common = xmlSApply(catalog, function(x){ + xmlValue(x[['COMMON']])}) > head(common) PLANT PLANT PLANT "Bloodroot" "Columbine" "Marsh Marigold" PLANT PLANT PLANT "Cowslip" "Dutchman's-Breeches" "Ginger, Wild"
The$elements$of$the$root$node$are$all$ plant$nodes,$$ like$oneplant.$
• Using$the$same$strategy,$we$can$create$a$full$ data$frame$with$all$variables.$
> getvar = function(x, var) xmlValue(x[[var]]) > varNames = names(catalog[[1]]) > res = lapply(varNames, function(var){ + xmlSApply(catalog,getvar,var)}) > plants = data.frame(res)
What$is$this$command$doing?$
XPath$
What$is$XPath$
• Extrac8on$tool$designed$for$loca8ng$content$ in$an$XML$file$
• Uses$the$hierarchy$of$a$wellXformed$XML$ document$to$specify$the$desired$chunks$to$ extract$
• Syntax$is$similar$to$but$more$powerful$than$ the$way$files$are$located$in$a$hierarchy$of$ directories$in$a$computer$file$system$
CATALOG
PLANT
COMMO N
LIGHTZONE AVAILABI
LITY BOTANI
CAL PRICE
Cardin al
Flowe r
Labeli a..
2 Shade $3.02 02229
9
PLANT
COMMO N
LIGHTZONE AVAILABI
LITY BOTANI
CAL PRICE
Bloodr oot
Sangu ...
4 Mostly Shady
$2.44 03159
9
<PLANT> <COMMON>Bloodroot </COMMON> <BOTANICAL>Sanguinaria canadensis </BOTANICAL> <ZONE>4</ZONE> <LIGHT>Mostly Shady </LIGHT> <PRICE>$2.44</PRICE> <AVAILABILITY>031599 </AVAILABILITY> </PLANT>
/CATALOG/PLANT/COMMON$
CATALOG
PLANT
COMMO N
LIGHTZONE AVAILABI
LITY BOTANI
CAL PRICE
Cardin al
Flowe r
Labeli a..
2 Shade $3.02 02229
9
PLANT
COMMO N
LIGHTZONE AVAILABI
LITY BOTANI
CAL PRICE
Bloodr oot
Sangu ...
4 Mostly Shady
$2.44 03159
9
Example$
> common = ! xpathSApply(catalog, ! "/CATALOG/PLANT/COMMON", xmlValue)! ! > head(common)! [1] "Bloodroot” "Columbine” "Marsh Marigold" ! [4] "Cowslip" "Dutchman's-Breeches” "Ginger, Wild" !
Example$
> xpX = ! paste("/CATALOG/PLANT/", varNames, sep="")! > xpx[1]! [1] "/CATALOG/PLANT/COMMON"! > res = ! sapply(xpx,function(var) ! xpathSApply(catalog, var, xmlValue))! > res[1:2, 1:3]! [1,] "Bloodroot" "Sanguinaria canadensis" "4" ! [2,] "Columbine" "Aquilegia canadensis" "3" !
<Envelope>$$ <subject>Reference$rates</subject>$$ $<Sender>$$ $$<name>European$Central$Bank</name>$$ $</Sender>$$ $<Cube>$$ $$<Cube$8me="2008X04X21">$$ $$$<Cube$currency="USD"$rate="1.5898"/>$$ $$$<Cube$currency="JPY"$rate="164.43"/>$$ $$$<Cube$currency="BGN"$rate="1.9558"/>$$ $$$<Cube$currency="CZK"$rate="25.091"/>$$ $$</Cube>$$ $$<Cube$8me="2008X04X17">$$ $$$<Cube$currency="USD"$rate="1.5872"/>$$ $$$<Cube$currency="JPY"$rate="162.74"/>$$ $$$<Cube$currency="BGN"$rate="1.9558"/>$$ $$$<Cube$currency="CZK"$rate="24.975"/>$$ $$</Cube>$$ $</Cube>$$ </Envelope>$
DRAW$THE$TREE$FOR$ THIS$DOCUMENT$ $
/Envelope/Sender/name$ $ $ $ $ $ $ 1.$The$first$loca8on$step$iden8fies$the$root$node,$ <Envelope>.$$ 2.$The$next$loca8on$step$finds$the$<Sender>$child$of$ <Envelope>.$$ 3.$The$third$loca8on$step$iden8fies$<Sender>'s$child$ called$<name>.$
Envelope
subject Sender Cube
Cube
Cube
Cube
CubeCubeCube
name
Europea...
Referenc...
... ...
1
2
3
/Envelope
/Sender
/name
/Envelope/Cube/Cube$ Envelope
subject Sender Cube
CubeCube name
Europea...
Referenc...
... ...Cube Cube CubeCube
1
2
3
/Envelope
/Cube
/Cube
This$XPath$expression$ locates$two$sibling$ <Cube>$nodes.$$
$//Cube[@currency$=$"JPY"]$
Envelope
subject Sender Cube
Cube
Cube currency=
"USD"
Cube name
Europea...
Referenc...
Cube currency
= "JPY"
Cube currency
= "BGN"
Cube currency
= "CZK"
Cube currency
= "USD"
Cube currency
= "JPY"
Cube currency
= "BGN"
Cube currency
= "CZK"
The$expression:$//Cube$ matches$all$Cube$nodes$ found$anywhere$in$the$ document$
The$predicate$$ [@currency="JPY"]$ filters$the$matches$ to$those$elements$ with$a$currency$ arribute$value$of$ “JPY”$
Your$Turn$
ISO$Country$Abbrevia8ons$
<ISO_3166X1_List_en$$xml:lang="en">$ $$$<ISO_3166X1_Entry>$ $$$$$$<ISO_3166X1_Country_name>AFGHANISTAN$ $$$$$$</ISO_3166X1_Country_name>$ $$$$$$<ISO_3166X1_AlphaX2_Code_element>AF$ $$$$$$</ISO_3166X1_AlphaX2_Code_element>$ $$$</ISO_3166X1_Entry>$ <ISO_3166X1_Entry>$ $$$$$$<ISO_3166X1_Country_name>ALBANIA$ $$$$$$</ISO_3166X1_Country_name>$ $$$$$$<ISO_3166X1_AlphaX2_Code_element>AL$ $$$$$$</ISO_3166X1_AlphaX2_Code_element>$ $$$</ISO_3166X1_Entry>$
DRAW9the9TREE9 9 Write9an9Xpath9 expressions9to9 locate9name9and9 abbrevia?ons9
World$Health$Factbook$
<factbook>$ ….$ <field$dollars="false”$unit="(deaths/1,000$live$births)"$$$$rankorder="1”$ $$$$$$$$$$$name="Infant$mortality$rate"$id="f2091">$ <descrip8on>$ This$entry$…$number$of$deaths$of$infants$under$one$year$old$in$a$given$year$ per$1,000$live$births..$ </descrip8on>$ <rank$number="121.63"$dateEs8mated="true"$$$dateLatest="2012X12X31”$ $$$$$$$$$$$dateEarliest="2012X01X01"$dateText="2012$est."$country="af"/>$ <rank$number="109.98"$dateEs8mated="true"$dateLatest="2012X12X31”$ $$$$$$$$$$$dateEarliest="2012X01X01"$dateText="2012$est."$country="ng"/>$ …$ </field>$ Write9Xpath9expression9
to9locate9country9code9 and9death9rate9
XPath$syntax$
• XPath$is$a$language$for$loca8ng$content$in$ XML$documents.$$
• An$XPath$expression$is$a$loca%on'path'that$is$ made$up$of$loca%on'steps'separated$by$/$$
• Each$step$has$three$parts:$the$axis,$nodetest,$ and$predicate$
• Syntax:$$ axis::nodetest[predicate] !
XPath$syntax$
axis::nodetest[predicate] ! !
• The$axis$is$the$direc8on$to$look$$ • The$nodetest$is$typically$a$node$name$that$you$ wish$to$locate$
• The$op8onal$predicate$filters$the$qualifying$ nodes.$
XPath$syntax$
• We$will$use$only$very$simple$Xpath$expressions$ • The$axis$will$be$either$ – $“child”,$which$is$the$default$and$can$be$dropped,$or$$ – “descendantXorXself”,$which$says$look$anywhere$down$the$ tree$and$is$abbreviated$by$“//”$
– “self”$which$is$abbreviated$with$a$.$ • The$nodetest$will$always$be$a$node$name$$ • The$predicate$is$op8onal$and$will$either$be$a$number,$ [2]$which$asks$for,$e.g.,$the$second$node,$or$an$ arribute$filter,$e.g.,$$
[@currency = “JPY” or @currency=“USD”] !
Func8ons$that$take$XPath$expressions$
• getNodeSet(xmlTree, xpathExpression) returns$a$list$of$XML$nodes$from$xmlTree that$sa8sfy$the$XPath$expression.$
• xpathSApply(xmlTree, xpath, function) the$func8on$is$applied$to$those$ nodes$in$the$XML$tree$that$sa8sfy$the$Xpath$ expression.$$The$return$value$is$a$vector$when$ possible.$xpathApply$returns$a$list.$
CIA$World$ Factbook$$ Infant$ mortality$ Rates$
Plain$text$file$of$country$loca8ons$ This$page$contains$the$average$la8tude$and$longitude$for$ countries$around$the$world.$ Source:$CIA$World$Factbook$ "iso$3166$country","la8tude","longitude"$ AD,42.5000,1.5000$ AE,24.0000,54.0000$ AF,33.0000,65.0000$ AG,17.0500,X61.8000$ AI,18.2500,X63.1667$ AL,41.0000,20.0000$ AM,40.0000,45.0000$ AN,12.2500,X68.7500$ AO,X12.5000,18.5000$ AP,35.0000,105.0000$ AQ,X90.0000,0.0000$ AR,X34.0000,X64.0000$ AS,X14.3333,X170.0000$ AT,47.3333,13.3333$ AU,X27.0000,133.0000$ $
No8ce$how$the$ country$ISO$ codes$are$used$ instead$of$ country$names$
ISO$country$name$–$country$code$ match$
• Although$we$can$edit$XML$documents$by$hand,$ being$able$to$do$it$programma8cally$is$much$ more$powerful.$
• We$will$create$a$KML$document$that$can$be$read$ in$Google$Earth.$$KML$is$a$file$format$for$ geographic$data,$based$on$XML$standards.$$See$ hrp://code.google.com/apis/kml/ documenta8on/.$
Sources$
• Country$popula8on:$CIA$World$Factbook$ • Country$infant$mortality:$CIA$World$Factbook$ • Country$loca8on:$plain$text$file$ • Country$name$abbrevia8on$map:$ISO$map$ • Merge$into$one$data$frame$ • Make$a$plot$on$Google$Earth$where$circle$size$ is$propor8onal$to$country$popula8on,$color$is$ categorized$by$infant$mortality,$centers$ correspond$to$country$la8tude$and$longitude$$$$
Genera8ng$XML$
XML$package$in$R$
Handy$func8ons$for$crea8ng$XML$ – newXMLDoc:$create$a$new$XML$document$ – newXMLNode:$create$a$new$XML$Node$ – saveXML:$save$the$XML$tree$in$a$text$file$ – parseXMLAndAdd:$parses$character$string$of$ XML$into$an$XML$node$and$adds$it$to$ XMLInternalDocument$(i.e.$a$tree)$$
Simple$Example$
doc = newXMLDoc() root = newXMLNode("toplevel", doc = doc) child1 = newXMLNode("level1", parent = root) newXMLNode("level2", "This is the content", parent = child1) > doc <?xml version="1.0"?> <toplevel> <level1> <level2>This is the content</level2> </level1> </toplevel>
Simple$Example$
doc = newXMLDoc() root = newXMLNode("toplevel", doc = doc) child1 = newXMLNode("level1", parent = root) newXMLNode("level2", "This is the content", parent = child1) saveXML(doc, file = "simple.xml") We$only$need$to$assign$to$a$variable$nodes$that$we$want$to$ refer$to$as$parents.$$$ The$names$of$the$nodes$in$R$(e.g.$root, child1)$are$not$part$of$ the$resul8ng$XML$document.$ We$can$save$the$tree$to$a$file$
Exercise:$First$diagram$the$tree$structure$of$this$ document,$then$create$it$from$within$R. <?xml version="1.0"?> <kml> <Document> <name>Earthquakes</name> <description>6+ Earthquakes, 1968-2008</description> <Folder> <name>Quakes</name> <Placemark> <Point> <coordinates>-124.95,41.04,0 </coordinates> </Point> </Placemark> <Placemark> <Point> <coordinates>-118.83,37.59,0 </coordinates> </Point> </Placemark> ... more Placemark nodes </Folder> </Document> </kml>
Programming)Languages))
we)have)seen)so)far)
• R)–)uses)control)flow)to)describe)a) computa;on)
• shell)commands)–)command)line)interface)to) the)opera;ng)system)
• regular)expressions)–)describes)a)pa>ern)but) not)how)to)find)it))
• HTML)–)describes)what)should)appear)on)a) Web)page)but)not)how)to)render)it))
• Xpath)–)describes)the)loca;on)of)content))
Two)types)of)programming:)
• Declara've)programming:)describes)a)result;) expresses)what)the)program)should)do)
without)specifying)how)to)do)it)
• Impera've)programming:)describe)the) algorithm/steps)of)a)computa;on;)uses)
control)flow,)e.g.,)looping,)assignment,)
condi;onal)statements.)
Which)are)impera;ve/declara;ve?)
Impera;ve)
• R) • shell)
Declara;ve)
• Regular)expressions) • HTML) • XPath) • SQL)
Rela;onal)Databases)and)SQL)
A)database)is)a)collec;on)of)data)with)informa;on)about)how)the)data) are)organized)(metaRdata).))A)database'server)is)like)a)web)server,)but) responds)to)requests)for)data)rather)than)web)pages.)
We’ll)talk)about)rela+onal'database'management'systems)(RDBMS)) and)how)to)communicate)with)them)using)the)structured'query' language)(SQL).) )
Why)use)a)database?)
• )Coordinate)synchronized)access)to)data) • )Change)con;nually;)give)immediate)access)to)live)data) • )Centralize)data)for)backups) • )Control)access)to)the)data)
A)RDBMS)had)three)main)parts)
• Data)defini;on) • Data)access) • Privilege)management)
We’ll)concentrate)on)data)access,)assuming)the)database)is) already)available)and)we)have)the)needed)privileges.)
Topics:)
• )using)SQL)to)extract)info)from)RDBMSs) • )rela;ng)these)back)to)similar)tasks)in)R) • )using)SQL)from)within)R)
There)are)tradeoffs)in)terms)of)what)we)choose)to)do)using) SQL)and)what)we)do)in)R.)
• A)database)is)made)up)of)one)or)more)two)dimensional) tables,)usually)stored)as)files)on)the)server.)
• An)important)concept)in)the)design)of)a)database)is) normaliza+on.))The)idea)is)to)remove)as)much)redundancy) as)possible)when)crea;ng)the)tables.))This)is)done)by) breaking)the)full)dataset)into)separate)tables.)
• The)�rela;onal�)in)RDBMS)comes)from)the)fact)that)we) then)need)to)link)the)tables)together.)
)
• For)now)let’s)talk)about)a)single)table....)
• A)table)is)a)rectangular)arrangement)of)values,) where)a)row)represents)a)case,)and)a)column)
represents)a)variable)(just)like)a)data)frame)in)
R).)))
• Another)term)for)a)table)is)a)rela+on.))The) rows)are)referred)to)as)tuples)and)the) columns)as)a7ributes.)
Missing)value)
• An)en+ty)is)the)general)object)of)interest.))For) example,)a)lab)test.))Each)row)(tuple))is)a)par;cular) occurrence)of)the)en;ty.))This)means)that)rows)in)the) table)are)unique.)
• To)iden;fy)each)row,)we)use)a)key.))A)key)is)just)an) a>ribute)or)a)combina;on)of)a>ributes)that)uniquely)
iden;fies)the)cases.)
• In)the)lab)test)example,)we)need)a)composite'key)of) both)pa;ent)ID)and)date,)since)neither)is)necessarily) unique.)
• In)R,)the)row)names)of)a)data)frame)play)a)similar)role.)
• SQL)allows)us)to)interac;vely)query)the)database) to)reduce)the)data)by)subse^ng,)grouping,)or) aggrega;on.)
• Each)database)program)tends)to)have)its)own) version)of)SQL,)but)they)all)support)the)same) basic)SQL)statements.))(We)say)statements)rather) than)commands)because)SQL)is)referred)to)as)a) declara;ve)rather)than)an)impera;ve)language.))
)
• The)SQL)statement)for)retrieving)data)is)the) SELECT)statement.))This)operates)on)one)or)more) tables.))The)result)will)always)be)another)table.)
We)have)a)table)called)chips,)with)data)about)the) CPU)development)of)PCs)over);me)
The)simplest)possible)query)gives)back)everything:)
SELECT * FROM chips;! processor | date | transistors | microns | clockspeed | width | mips ------------+------+-------------+---------+------------+-------+------ 8080 | 1974 | 6000 | 6 | 2 | 8 | 0.64 8088 | 1979 | 29000 | 3 | 5 | 16 | 0.33 80286 | 1982 | 134000 | 1.5 | 6 | 16 | 1 80386 | 1985 | 275000 | 1.5 | 16 | 32 | 5 80486 | 1989 | 1200000 | 1 | 25 | 32 | 20 Pentium | 1993 | 3100000 | 0.8 | 60 | 32 | 100 PentiumII | 1997 | 7500000 | 0.35 | 233 | 32 | 300 PentiumIII | 1999 | 9500000 | 0.25 | 450 | 32 | 510 Pentium4 | 2000 | 42000000 | 0.18 | 1500 | 32 | 1700
• By)conven;on,)we)display)SQL)statements)in)upper) case.))Statements)are)ended)by)a)semicolon.)
A>ributes)/)Variables)
• Recall)that)in)R,)we)can)select)par;cular) variables)(columns))by)name.)
chips[ , c(�mips�, �microns�)])
• The)order)of)the)variable)names)determines) the)order)in)which)they�ll)be)returned)in)the) resul;ng)data)frame.)
• The)corresponding)SQL)query)is) SELECT mips, microns FROM chips;
SQL)Syntax)
• Similar)to)a)sentence)in)English,)except)that) there’s)less)flexibility)in)the)order)of)the)
words.)
• Sentence)ends)with)a);) • Use)blanks)and)“,”)and)“=“)and)())as)delimiters) • We)will)only)look)at)SELECT)statements,)which) begin)with)the)term)SELECT))
Examples)of)SELECT)statements)
SELECT * FROM chips;! SELECT mips, microns FROM chips;! SELECT * FROM chips ! WHERE processor = �Pentium��OR! processor = �PentiumII�;! )
Selec;ng)Tuple/Row)
In)R)we)can)select)rows)that)match)a)condi;on:)
chips[chips$processor == �Pentium� |! chips$processor == �PentiumII�,]! )
The)corresponding)SQL)statement)is)
SELECT * FROM chips ! WHERE processor = �Pentium��OR! processor = �PentiumII�;! Note:))Whitespace)can)be)used)freely)in)SQL)statements.))We) oeen)separate)lines)for)clarity.))The)statement)isn’t)evaluated) un;l)the)semicolon)is)entered.)
• Aside:)In)R,)it)usually)doesn’t)ma>er)whether)you)use) single)or)double)quotes)to)surround)character)strings.)) In)SQL,)the)standard)is)to)use)single)quotes,)so)we)will) do)this)throughout)the)notes)for)both)R)and)SQL.)
• The)WHERE)clause)can)also)be)used)with)other)Boolean) operators.))The)keyword)NOT)negates)a)condi;on,)and) parentheses)can)be)used)to)clarify)order)of)evalua;on.)
SELECT * FROM chips WHERE date > 1990;!
SELECT * FROM chips WHERE NOT width = 8;!
SELECT * FROM chips WHERE NOT (width = 8 OR width = 16);!
• In)both)R)and)SQL,)we)can)do)both)types)of) subse^ng)at)once.)
• R:)
chips[chips$processor == �Pentium� |! ! !chips$processor == �PentiumII�, ! c(�mips�, �microns�)]!
• SQL:) SELECT mips, microns FROM chips! WHERE processor = �Pentium� OR ! processor = �PentiumII�;!
General)Syntax)
)
SELECT)a>ribute(s))FROM)table(s)))
)[WHERE)constraints];) )
How)would)we)pull)the)years)of)all)32Rbit)processors)that)execute)fewer)than) 250)million)instruc;ons)per)second)(mips),)
1))in)R,))
chips[ chips$mips < 250 & chips$width == 32, ! “date”]!
2))in)SQL?)
SELECT date FROM chips ! WHERE mips < 250 AND width = 32;!
[op;onal])
SQL)offers)limited)features)for)summarizing)data)RR)some) aggregate)func;ons)that)operate)over)the)rows)of)a)table,)and) some)mathema;cal)func;ons)that)operate)on)individual) values)in)a)row.)
The)aggregate)func;ons)are)
• )COUNT)R)number)of)rows) • )SUM)R)total)of)all)values)for)an)a>ribute) • )AVG)R)average)value)for)an)a>ribute) • )MIN)R)minimum)value)for)an)a>ribute) • )MAX)R)maximum)value)for)an)a>ribute)
SELECT)a>ribute(s))FROM)table(s))[WHERE)constraints];)
)))))))))))))))))))))
can)also)be)func+ons)of) a>ributes)
Exercise:) 1) How)many)rows)are)in)the)chips)table?)
SELECT COUNT(date) FROM chips;! )
2))How)many)chips)have)a>ribute)width)equal)to)32?)
SELECT COUNT(date) FROM chips WHERE width = 32;!
3))What)is)the)average)clock)speed)for)the)chips)in)ques;on)2?)
)
SELECT AVG(clockspeed) FROM chips WHERE width = 32;!
Now)answer)the)same)ques;ons)assuming)chips)is)a)data)frame)in)R.)
)nrow(chips); !! sum(chips$width == 32); ! mean(chips$clockspeed[chips$width == 32)!
Addi;onal)clauses:)GROUP BY! )
• The)GROUP BY clause)makes)the)aggregate)func;ons)in)SQL)more) useful.))It)enables)the)aggregates)to)be)applied)to)subsets)of)the) rows)in)a)table.)
SELECT width, MAX(mips) FROM chips! GROUP BY width;! width | max -------+------ 8 | 0.64 16 | 1 32 | 1700 (3 rows)
• More)than)one)a>ribute)can)be)included)in)the)GROUP)BY)clause.)
Addi;onal)clauses:)HAVING! • The)WHERE)clause)can’t)contain)an)aggregate)func;on,)but)
the)HAVING)clause)can)be)used)to)refer)to)the)groups)to) be)selected.)
SELECT width, MAX(mips) FROM chips! GROUP BY width HAVING MAX(mips) >= 1;! width | max -------+------ 16 | 1 32 | 1700 (2 rows)
• First)the)chips)table)is)separated)into)sets)of)rows)by)width.)) For)each)set,)MAX(mips) is)calculated,)and)the)set)is) discarded)if)MAX(mips) < 1.))Finally,)width)and) MAX(mips) are)returned)for)each)set.)
A)few)other)predicates)and)clauses)
)
• DISTINCT)R)forces)values)of)an)a>ribute)in)the)results)table) to)have)unique)values)
• NOT)R)negates)condi;ons)in)WHERE)or)HAVING)clause)
• LIMIT)R)limits)the)number)of)rows)returned)
SELECT * FROM chips LIMIT 3;! SELECT DISTINCT width FROM chips;!
Order)of)Execu;on)
The)order)of)execu;on)of)the)clauses)in)a)SELECT)statement)is)as) follows:)
1.)FROM:)The)working)table)is)constructed.)
2.)WHERE:)The)WHERE)clause)is)applied)to)each)row)of)the)table,)and) only)the)rows)that)test)TRUE)are)retained.)
3.)GROUP BY: The)results)are)broken)into)groups)of)rows)all)with) the)same)value)of)the)GROUP BY clause.)
4.)HAVING:)The)HAVING)clause)is)applied)to)each)group)and)only) those)that)test)TRUE)are)retained.)
5.)SELECT:)The)a>ributes)not)in)the)list)are)dropped,)aggregates)are) calculated,)and)op;ons)DISTINCT, ORDER BY and LIMIT are) applied.)
Using)SQL)with)R)
The)DBI)and)RSQLite)libraries)allow)you)to) connect)to)an)SQL)database,)submit)a)query,) and)receive)the)results)as)a)data)frame.)) library(RSQLite)!
# set up an interface to SQLite! M = dbDriver("SQLite")!
# connect to particular DBMS (no security SQLite)! con = dbConnect(m, dbname="tfile")!
# submit an SQL statement; returns a data frame! dbGetQuery(con, "SELECT * FROM chipsSQLite LIMIT 5;")!
If)the)results)are)large,)we)don’t)have)to)pull) them)all)over)into)R)at)once.) # submit an SQL statement; but keep results in SQL! rs = dbSendQuery(con, "SELECT * FROM chipsSQLite;")! ! # Retrieve the first 5 rows in the results! fetch(rs, n = 5)! ! # Retrieve the next 3 rows! fetch(rs, n = 3)! ! # Retrieve the remaining rows! fetch(rs, n = -1)! ! # Close the query! dbClearResult(rs)!
We)can)of)course)import)the)whole)table,)
then)extrac;ng)what)we)want)using)R)
commands.)
)
But,)it)may)be)more)efficient)to)use)SELECT)
to)extract)and)import)only)the)results)with)
which)we)want)to)work)
Mul;ple)Tables)
)
• Where)do)you)see)redundancy?) • What)might)an)en;ty)be?)(e.g.)lab)test))
)
• Where)do)you)see)redundancy?) • What)might)an)en;ty)be?)(e.g.)lab)test))
BANK)BRANCH)
)
• Where)do)you)see)redundancy?) • What)might)an)en;ty)be?)(e.g.)lab)test))
BANK)BRANCH))
)
)
CUSTOMER))
)
• Where)do)you)see)redundancy?) • What)might)an)en;ty)be?)(e.g.)lab)test))
BANK))
)
)
CUSTOMER))
)
)
ACCOUNT))
En;;es)
• Customer:)) – Customers)can)have)more)than)one)account)
• Account:) )) – Accounts)can)be)owned)by)more)than)one) customer;))
– An)account)is)in)one)branch)) • Branch:))) – Branches)can)have)many)accounts)
• Here)are)tables)for)each)of)the)three)en;;es) we)iden;fied.)
We)have)) informa'on) about:) Customers)) Accounts)) Branch)
Is) anything) missing?) ) )
Missing)Informa;on)
• Rela;ons:) – Customers)have)accounts) – Accounts)are)located)in)Branches)
)
• How)do)we)bring)this)informa;on)into)the) database?)
– Add)branch)informa;on)to)the)account)table) – Create)a)new)table)that)holds)account,)customer) pairs)
• Customers)and)their)bank)accounts)are)linked) via)the)registra;on)table)–)this)is)a)rela;on:)
No;ce:)
Account)201)
appears)twice)
Customer)3)
appears)three)
;mes)
• We)could)put)this)informa;on)in)one)big)table,) but)this)would)introduce)redundancy.)
)
• There)can)be)speed)and)memory)advantages) to)working)with)the)smaller)tables,)extrac;ng) just)the)informa;on)we)need.)
Examples)of)extrac;ons)
Warmup)
• The)names)and)addresses)of)all)customers.)) SELECT Name,Addr FROM Customers; ! Name Address! 1 Smith, J 101 Elm! 2 Smith, D 101 Elm! 3 Brown, D 17 Spruce!
Warmup)
• All)of)the)a>ributes)in)the)Registra;on)table) for)accounts)belonging)to)Customer)3.))
! SELECT * FROM Registration WHERE CID = 3; ! row_names CID AcctNo! 1 4 ! !3 203! 2 5 ! !3 301! 3 6 ! !3 302!
Warmup)
• Account)numbers)for)accounts)that)don’t) meet)minimum)$100)balance)in)City)branch.))
SELECT AcctNo FROM Accounts ! WHERE Branch = 'City' ! AND Balance < 100;! AcctNo! 1 201!
• The)total)balance)for)all)accounts)belonging)to) a)branch.))
SELECT Branch, SUM(Balance) ! FROM Accounts! GROUP BY Branch;! Branch SUM(Balance)! 1 City 1129! 2 Suburb 180!
• The)total)overdrawn)amount)at)a)branch,)for) those)branches)which)have)an)account)with)at)
least)$100)overdrawn.))
SELECT Branch, SUM(Balance) FROM Accounts! WHERE Balance < 0 ! GROUP BY Branch ! HAVING Min(Balance) < -100;! [1] Branch SUM(Balance)! <0 rows> (or 0-length row.names)!
• Suppose)we)want)the)name(s))of)the)owner(s)) for)every)account))
• The)customer)name)is)in)Customers)) • The)account)number)is)in)Registra;on)) • For)each)account)in)Registra;on,)we)want)the) associated)customer)names.)
• We)need)to)joint)two)tables)to)get)this) informa;on)
)
Joining)tables) SELECT Name, AcctNo ! FROM Customers,Registration ! WHERE Customers.CustNo = Registration.CID;! ))))) !Name AcctNo! 1 Smith, J 201! 2 Smith, D 201! 3 Smith, D 202! 4 Brown, D 203! 5 Brown, D 301! 6 Brown, D 302! • Combining)two)tables)in)this)way)is)called)an)inner'join.' • Note)how)we)refer)to)columns)within)a)given)table)using)the)
form)tableName.column)(�dot)nota;on�).))This)isn’t)strictly) necessary)when)there’s)only)one)possible)match,)but)it) improves)clarity.)
)
• We)can)use)AS)to)rename)tables)or) columns.))This)can)save)a)lot)of)typing.)
SELECT Name, AcctNo AS Account ! FROM Customers AS C, Registration AS R ! WHERE C.CustNo = R.CID;! ! Name Account! 1 Smith, J 201! 2 Smith, D 201! 3 Smith, D 202! 4 Brown, D 203! 5 Brown, D 301! 6 Brown, D 302!
• The)customer)name,)account)number,)and)balance)of) the)account)for)all)accounts.))
SELECT C.Name, A.AcctNo, A.Balance ! FROM Customers C, Registration R, ! Accounts A ! WHERE C.CustNo = R.CID AND ! A.AcctNo = R.AcctNo;! ! Name AcctNo Balance! 1 Smith, J 201 12! 2 Smith, D 201 12! 3 Smith, D 202 1000! 4 Brown, D 203 117! 5 Brown, D 301 10! 6 Brown, D 302 170!
• Give)the)total)balance)for)each)customer)with) mul;ple)accounts.))
SELECT CID, SUM(Balance) ! FROM Registration R, Accounts A ! WHERE A.AcctNo = R.AcctNo ! GROUP BY R.CID HAVING COUNT(*) > 1; ! CID SUM(Balance)! 1 2 1012! 2 3 297!
Other)types)of)Databases)
• NoSQL)) – data)bases)that)don’t)use)SQL) – Not)rela;onal) – JSON)–)style)data)structures))
• NoSQL)RDBMS) – Data)base)that)doesn’t)use)SQL,)but)is)rela;onal) – Shell)based)
• Example:)MongoDB)(from)"humongous"))is)a) scalable,)highRperformance)