R project

profileJasperZ511626
10_RegularExpressions.pdf

Text%Data%

Elec+on%Study%

•  Geographic%Data%–%longitude%and%la+tude%of% the%county%center%

•  Popula+on%Data%from%the%census%for%each% county%

•  Elec+on%results%from%2008%for%each%county% (scraped%from%a%Website)%

Want%to%match/merge%the%informa+on%from% these%three%different%sources% %%

What%issues%arise%in%matching?%

What%problems%need%resolving%to% match%coun+es%across%sources?%

•  Capitaliza+on%qui%vs%Qui% •  County/Parish%missing% •  St.%vs%St% •  DeWiM%vs%De%WiM% %

Text%mining%% State%of%Union%Addresses%

• %%How%long%are%the%speeches?%%% • %%How%do%the%distribu+ons%of%certain%words% change%over%+me?%%% • %%Which%presidents%have%given%�similar�% speeches?%

*** ! State of the Union Address ! George Washington ! December 8, 1790 ! ! Fellow-Citizens of the Senate and House of Representatives: ! In meeting you again I feel much satisfaction in being able to repeat my ! congratulations on the favorable prospects which continue to distinguish ! our public affairs. The abundant fruits of another year have blessed our ! country with plenty and with the means of a flourishing commerce.!

Text%mining%State%of%Union%Addresses%

•  All%speeches%in%one%large%plain%text%file% •  Each%speech%starts%with%“***”%on%a%line% followed%by%3%lines%of%informa+on%about%who% gave%the%speech%and%when%

•  To%mine%the%speeches,%we%want%to%create%a% word%vector%for%each%speech,%which%tracks%the% counts%of%how%many%+mes%a%par+cular%word% was%said%in%each%speech.%

•  Words%such%as%na+on,%na+onal,%na+ons%should% collapse%to%the%same%“word”%

Web%behavior%

•  Every%+me%you%visit%a%Web%site,%informa+on%is% recorded%about%the%visit:%% – the%page%visited%% – date%and%+me%of%visit% – browser%used% – opera+ng%system% – IP%address%

Two%lines%of%the%Web%log%

169.237.46.168%`%`%[26/Jan/2004:10:47:58%`0800]%% "GET%/stat141/Winter04%HTTP/1.1"%301%328%% "hMp://anson.ucdavis.edu/courses/"%% "Mozilla/4.0%(compa+ble;%MSIE%6.0;%Windows%NT%5.0;%.NET% CLR%1.1.4322)”% %% 169.237.46.168%`%`%[26/Jan/2004:10:47:58%`0800]%% "GET%/stat141/Winter04/%HTTP/1.1"%200%2585%% "hMp://anson.ucdavis.edu/courses/"%% "Mozilla/4.0%(compa+ble;%MSIE%6.0;%Windows%NT%5.0;%.NET% CLR%1.1.4322)"%

•  The%informa+on%in%the%log%has%a%lot%of% structure,%for%example%the%date%always% appears%in%square%brackets.%%

•  However,%the%informa+on%is%not%consistently% separated%by%the%same%characters%such%as%in%a% csv%file,%%

•  nor%is%it%placed%consistently%in%the%same% columns%in%the%file.%

Spam%filtering:%% Anatomy%of%email%message%

•  Three%parts:%% – header,%% – body,%% – aMachments%(op+onal).%%

•  Like%regular%mail,%the%header%is%the%envelope% and%the%body%is%the%leMer.%

•  Plain%text%% %%

Header:%

•  date,%sender,%and%subject%% •  message%id,%% •  who%are%the%carbon`copy%recipients,%% •  return%path.%%

•  SYNTAX%–%% KEY:VALUE%

Example%header% Date: Mon, 2 Feb 2004 22:16:19 -0800 (PST) ! From: [email protected] ! X-X-Sender: [email protected] ! To: Txxxx Uxxx <[email protected]> ! Subject: Re: prof: did you receive my hw? ! In-Reply-To: <[email protected]> ! Message-ID: <[email protected]> ! References: <[email protected]> ! MIME-Version: 1.0 ! Content-Type: TEXT/PLAIN; charset=US-ASCII ! Status: O ! X-Status: ! X-Keywords: ! X-UID: 9079!

Email% •  Body%is%separated%from%the%header%by%a%single%blank%line.%% •  AMachment%is%included%in%the%body%of%the%message.%% •  To%figure%out%what%part%of%the%body%is%the%message%and%what%part%is%

an%aMachment%mail%programs%use%an%Internet%standard%called% MIME,%Mul+purpose%Internet%Mail%Extensions.%%

% •  Content.Type:%has%value%mul5part8when%aMachments%are%present% Content`Type:%MULTIPART/Mixed;% %BOUNDARY="_===669732====calmail-me.berkeley.edu===_”%% % •  Boundary%key%provides%a%special%character%string%to%mark%the%

beginning%and%end%of%the%message%parts.%% •  The%last%component%of%the%message%is%followed%by%a%line%containing%

the%boundary%string%with%two%hyphens%at%the%front%and%end%of%the% string:%

--_===669732====calmail-me.berkeley.edu===_--66

%

What%characteris+cs%can%you%derive% from%the%email?%

•  Sent%in%the%early%morning:% Numeric%00%–%24%hour%received% •  Has%an%Re:%in%the%subject%line% Logical:%TRUE%if%Re:%in%subject%line% •  Funny%words%like%v!@gra% Logical:%TRUE%if%punctua+on%in%the%middle%of%word% •  Lots%of%YELLING%IN%THE%EMAIL% Numeric:%propor+on%of%characters%that%are%capitals% %

One%example%

A%small%problem%

•  County%names%in%census%file%have%no%“.”%ater% St,%e.g.%“St%John”%

•  County%names%in%geographic%file%do%have%the% “.”,%e.g.%“St.%John”%

•  Let’s%find%a%way%to%update%the%county%names% in%the%census%file%to%add%the%period%ater%“St”%

String%manipula+on%func+ons%

•  substring(text, first, last) –%extract%a% por+on%of%a%character%s+ng%from%text,% beginning%at%first,%ending%at%last%

•  %nchar(text) –%return%the%number%of% characters%in%a%string%

•  strsplit(x, split) –%split%the%string%into% pieces%using%split%to%divide%it%%strsplit(x,%“”)%–% splits%into%one%character%pieces%

String%manipula+on%func+ons%

•  paste(x, y, z, …, sep = " ", collapse = NULL)% –%paste%together%character%strings%separated% by%one%blank%

•  tolower(x) toupper(x) `%%convert%upper` case%characters%to%lower`case,%or%vice%versa.% Non`alphabe+c%characters%are%let%unchanged%

Test%Data%

> cNames! [1]%"DewiM%County"%%%%%%%%%%%%%% [2]%"Lac%qui%Parle%County"%%%%%%% [3]%"St%John%the%Bap+st%Parish"% [4]%"Stone%County”% >%test = cNames[3]!

One%possible%solu+on% > substring(test, 1, 2)! [1] "St"! > substring(test, 1, 2) == "St"! [1] TRUE! > newName = ! paste("St.", ! substring(test, 3, nchar(test)), ! sep ="")! % Do%you%see%any%problems%with%this?%

Second%possible%solu+on% > substring(test, 1, 3)! [1] "St "! > substring(test, 1, 3) == "St "! [1] TRUE! > newName = ! paste("St. ", ! substring(test, 4, nchar(test)), ! sep ="")! % Do%you%see%any%problems%with%this?%

Prac+ce%with%% paste(), substring(), nchar(), strsplit()!

The%Web%log%

169.237.46.168%`%`%[26/Jan/2004:10:47:58%`0800]%% "GET%/stat141/Winter04%HTTP/1.1"%301%328%% "hMp://anson.ucdavis.edu/courses/"%% "Mozilla/4.0%(compa+ble;%MSIE%6.0;%Windows%NT%5.0;%.NET% CLR%1.1.4322)”% %%

•  How%to%extract%the%day%of%month,%month,% and%year%from%the%log%entry?%

What%features%of%the%entry%are%useful?%

•  Date%is%between%[%]% •  Day,%month,%year%are%separated%by%/% •  Year%is%separated%from%+me%by%:%

Return%to%St%vs%St.%

Another%idea% >%string%=%"The%Slippery%St%Frances"% >%chars%=%unlist(strsplit(string,%""))% >%chars% %[1]%"T"%"h"%"e"%"%"%"S"%"l"%"i"%"p"%"p"%"e"%"r"% [12]%"y"%"%"%"S"%"t"%"%"%"F"%"r"%"a"%"n"%"c"%"e"%%"s"% >%possible%=%which(chars%==%"S")% >%possible% [1]%%5%14% >%substring(string,%possible,%possible%+%2)% [1]%"Sli"%"St%"% >%substring(string,%possible,%possible%+%2)%==%"St%"% [1]%FALSE%%TRUE%

What%are%we%doing%here?%

•  Look%at%each%character% •  Check%to%see%if%it%is%“S”% •  If%it%is,%then%look%at%the%next%character(s)% •  This%is%the%idea%behind%regular%expressions%

The%regular%expression%�St%�%is%made%up%of%three%literal% characters.%%The%regular6expression6matching6engine6 does%something%very%similar%to%what%we%just%did.%

The Slippery St Frances! || ||| ! Found S________|| |||! Followed by t?__| No ||| ! Is it S?________| No ...||| Keep looking for S! Found S_________________|||! Followed by t?___________|| Yes! Followed by a blank?______| Yes - A match!! %

Luckily,%we%don’t%actually%need%to%write%our%own% func+ons%for%replacement.%%The%R%func+ons%gsub() and%sub()%look%for%a%paMern%and%replace%it%within%a% string%with%some%other%text.%

The%�g�%in%gsub()%refers%to%global.%%It%changes%all% the%matches,%whereas%sub()%only%replaces%the%first% match%(in%each%element%–%both%gsub()%%and%sub%are% vectorized).%

> gsub("St ", "St. ", cNames) [1] "Dewitt County" [2] "Lac qui Parle County" [3] "St. John the Baptist Parish" [4] "Stone County"

% > strings = c("a test", "and one and one is two", + "one two three") > gsub("one", "1", strings) [1] "a test" "and 1 and 1 is two" "1 two three" > sub("one", "1", strings) [1] "a test" "and 1 and one is two" "1 two three"

Regular%Expressions%to%the% Rescue%

Regular%Expressions%

•  Regular8expressions%give%us%a%powerful%way%of% matching%paMerns%in%text%data%

•  Most%importantly,%we%do%this%all% programa5cally%rather%than%by%hand,%so%that% we%can%easily%reproduce%our%work%if%needed.%

%

With%regular%expressions,%we%can%

•  extract%pieces%of%text%–%e.g.,%find%all%links%in%an% HTML%document% •  create%variables%from%informa+on%found%in% text% •  clean%and%transform%text%into%a%uniform% format,%resolving%inconsistencies%in%format% between%files% •  mine%text%by%trea+ng%documents%directly%as% data% • �scrape��the%web%for%data%

• %%A%regular8expression%(aka%regex%or%regexp)%is% a%paMern%that%describes%a%set%of%strings.%%% • %%This%set%may%be%finite%or%infinite,%depending% on%the%par+cular%regexp.%We%say%the%regexp% “matches�%each%element%of%that%set.%% • %%For%example,%the%regexp%%

grey|gray ! matches%both%grey%and%gray,%whereas%%

^A.*66 matches%any%string%star+ng%with%capital%A.% • %%The%idea%is%similar%to%wildcards%in%UNIX,%but% with%many%more%possibili+es.%

Syntax:% • Literal8characters%are%matched%only%by%the%character% itself.%

• A8character*class6is%matched%by%any%single%member%of% the%specified%class.%%For%example,%

6[A-Z]66 is%matched%by%any%capital%leMer.% % • Modifiers%operate%on%literal%characters,%character% classes,%or%combina+ons%of%the%two.%For%example%^%is%an% anchor%that%indicates%the%literal%must%appear%at%the% beginning%of%the%string%

Warning8

•  The%syntax%for%regexps%is%extremely%concise% •  %It%can%be%overwhelming%if%you%try%to%read%it% like%you%would%regular%text.%%%

•  Always%break%it%down%into%these%three% components:%literals,%character%classes,% modifiers%

How%to%find%fake%words?%% rep1!c@ted%%

•  What%makes%this%different%from%a%regular% word?%

•  Numbers%and%punctua+on%surrounded%by% leMers%

•  Concepts%of%�numbers�,%�punctua+on�,%and% �regular%leMers�%get%at%the%idea%of%equivalent8 characters%or%character8classes.%

%

Equivalent%Characters%

•  We%can%enumerate%any%collec+on%of% characters%within%[ ].%%Example:%[Tt]his

•  The%character%�-�%when%used%within%the% character%class%paMern%iden+fies%a%range.%% Examples:%[0-9], [A-Za-z]

•  If%we%put%a%caret%(^)%as%the%first%character%,%this% indicates%that%the%equivalent%characters%are% the%complement%of%the%enumerated%characters.% Example:%[^0-9]6

Equivalent%Characters%

•  %If%we%want%to%include%the%character%�`�%in%the% set%of%characters%to%match,%put%it%at%the% beginning%of%the%character%set%to%avoid% confusion.%%Example:%[-+][0-9]6

6 •  Note%that%here%we%have%created%a%paMern% from%a%sequence%of%two%sub`paMerns.%

Named%Equivalence%Classes% % %

% These%can%be%used%in%conjunc+on%with%other%characters,%for% example%[[:digit:]_]!

[:alpha:] All alphabetic!

[:digit:] Digits 0123456789!

[:alnum:] All alphabetic and numeric!

[:lower:] Lower case alphabetic!

[:upper:] Upper case alphabetic!

[:punct:] Punctuation characters!

[:blank:] Blank characters, i.e. space or tab!

Return%to%rep1!c@ted%%% • %%What%will%this%match?%

% [[:alpha:]][[:digit:][:punct:]][[:alpha:]]%

% • %%%Can%you%foresee%any%problems%with%it?%

Func+ons%that%use%Regular%Expressions%

• %%grep(pattern, x)%%It%looks%for%the%regular% expression%in%pattern%in%the%character%string(s)%in% x.%%It%returns%the%indices%of%the%elements%for%which% there%was%a%match.% • %%gsub(pattern, replacement, x) Look%the% regular%expression%in%pattern%in%x%and%replace% the%%matching%characters%with%replacement (all% occurrences)%sub()%works%the%same%way%but%only% replaces%the%first%occurrence.%

Func+ons%that%use%Regular%Expressions%

• %%regexpr(pattern, text)%returns%an% integer%vector%giving%the%star+ng%posi+on%of%the% first%match%or%.1%if%there%is%none.%The%return% value%has%an%aMribute%"match.length",%that%gives% the%length%of%the%matched%text%(or%.1%for%no% match).%% •  gregexpr(pattern,text)Returns%the% loca+ons%of%all%occurrences%of%the%paMern%in% each%element%of%text.%%The%return%is%a%list.%

> subjectLines! [1] "Re: 90 days" "Fancy rep1!c@ted watches" "It's me" ! > grep("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]", subjectLines)! [1] 2 3! !

We%can%either%remove%the%apostrophe%first:% > newString = gsub("'", "", subjectLines) > grep("[[:alpha:]][[:digit:][:punct:]][[:alpha:]]", newString) [1] 2 % Or%we%can%specify%the%par+cular%punctua+on%marks% we�re%looking%for:% > grep("[[:alpha:]][[:digit:]!@#$%^&*():;?,.] [[:alpha:]]", subjectLines)! [1] 2!

gregexpr() shows%exactly%where%the%paMern%was% found:%

> newString! [1] "Re: 90 days" "Fancy rep1!c@ted watches" "Its me" ! > gregexpr("[[:alpha:]][[:digit:][:punct:]] [[:alpha:]]", newString)! [[1]]! [1] -1! attr(,"match.length")! [1] -1! [[2]]! [1] 12! attr(,"match.length")! [1] 3! [[3]]! [1] -1! attr(,"match.length")! [1] -1! %

No%match%

No%match%

Star+ng%at%12,% match%of%length%3%

Did%we%miss%anything??%

We%didn’t%find%p1!c because%it%consists%of%four% characters:%a%leMer,%a%digit,%a%punctua+on%mark,%and% another%leMer.%

To%search%for%the%more%general%paMern%of%any%number%of% digits%or%punctua+on%marks%between%leMers,%we%use%

[[:alpha:]][[:digit:][:punct:]]+[[:alpha:]]! % The%plus%sign%indicates%that%members%from%the%second% character%class%(digits%and%punctua+on)%may%appear%one8 or8more%+mes.%

The%plus%sign%is%an%example%of%a%meta8character.%

Meta%characters%

% ^ As the first character in the pattern, anchor

for the beginning of the string/line! e.g. ^[lg]ame matches “lame” and “game” but not the last four characters in “flame”! " As the first character in [], exclude these! e.g. [^[:alnum:]] matches any single character that’s not a letter or number!

$ End of string/line anchor! e.g. ^[^[:lower:]]+$ (What does it match?)!

Meta%characters%that%control%how8 many85mes%something%is%repeated% ? Preceding element zero or one time!

e.g. ba? matches “b” or “ba”!

+ Preceding element one or more times! e.g. ba+ matches “ba”, “baa”, “baaa”, and so on, but not “b”!

* Preceding element zero or more times! e.g. ba* matches “b”, “ba”, “baa”, and so on. Note the difference compared to the UNIX wildcard!!

. Any single character! e.g. .* matches any character, any number of times (like * as a UNIX wildcard) " !

[ ] Character class! e.g. [a-cx-z] matches “a”, “b”, “c”, “x”, “y”, or “z”!

- Range within a character class!

| Alternation, i.e. one subpattern or another" e.g. abc|xyz matches “abc” and “xyz”!

() Identify a subpattern" e.g. ab(c|x)yz matches “abcyz” and “abxyz”!

\< Beginning of a word!

\> End of a word!

{n} Preceding item n times!

{n,} Preceding item n or more times!

{n,m} Preceding item between n and m times (inclusive)!

The%posi+on%of%a%character%in%a%paMern%determines%whether%it% is%treated%as%a%meta%character.%

Examples:%[-+*/], [1-9]*!

When%you%want%to%refer%to%one%of%these%symbols%literally,%you% need%to%precede%it%with%a%backslash%(\).%%However,%this%%already% has%a%special%meaning%in%R�s%character%strings%``%it�s%used%to% indicate%control%characters%like%newline%(\n).%

So,%to%refer%to%these%symbols%in%R�s%regular%expressions,%you% need%to%precede%them%with%two%backslashes.%

The%characters%for%which%you%need%to%do%this%are:% . ^ $ + ? ( ) [ ] { } | \!

Prac+ce:%Indicate%which%strings%contain% a%match%to%the%paMern%

“hi mabc”! “abc”! “ abcd”! “abccd”! “abcabcdx”! “cab”! “abd”! “cad”!

abc!

^abc!

abc.d!

abc+d!

abc?d!

abc$!

abc.*d!

abc?!

a[b?d]!

More%Prac+ce:%Write%a%regular% expression%that%matches%

% 1.%only%the%words%�cat�,%�at�,%and%�t�%

2.%The%words%�cat�,%�caat�,%�caaat�,%and%so%on%

3.%�dog�,%�Dog�,%�dOg�,%�doG�,%�DOg�,%etc.%(i.e.,% the%word%dog%in%any%combina+on%of%upper%and% lower%case)%anywhere%in%the%string%

4.%Any%number,%with%or%without%a%decimal%point%

Greedy%Matching%

•  Be%careful%with%paMerns%matching%too%much.%% •  The%matching%is%greedy%in%that%it%matches%as% much%as%possible%

•  For%example:%when%trying%to%remove%HTML% tags%from%a%document,%the%regular%expression%

%%<.*>%%will%match%too%much%but%the%regular% expression%%<[^>]*> will%be%just%right.%%Why?%