c and c++

Abdull2004
sedGawk.pdf

University of North Texas

2

The sed Stream Editor

3

The sed Stream Editor

• sed is a non-interactive, line-oriented stream editor that processes one line at a time

– Useful in text processing and especially performing in-place substitution

– sed can make global substitutions of matched regex patterns with specific text

• Example

– How change all occurrences of word "the" or "The" to uppercase "THE" in file called file1?

sed -r "s/(The|the)/THE/g" file1

4

The sed Stream Editor

• Usage: sed -r "s/REGEX/TEXT/g" filename

– Substitutes (replaces) occurrence(s) of REGEX with the given TEXT

– If filename is omitted, reads from standard input

– sed has other uses, but most can be emulated with substitutions

– Resulting output to terminal

• If wanted permanent changes, need redirect output to new file or make changes in-place using –i option

• Example

– Replaces all occurrences of 143 with 390 in file2.txt

sed -r "s/143/390/g" file2.txt

5

The sed Stream Editor

• sed is line-oriented; processes input a line at a time -r option makes regexes work better

– Recognizes ( ) , [ ] , * , + the right way, etc.

s for substitute

g flag after last / asks for a global match (replace all)

• Special characters must be escaped to match them literally sed -r "s/http:\/\//https:\/\//g" urls.txt

• sed can use delimiters besides / to make more readable sed -r "s#http://#https://#g" urls.txt

• Example sed -r "s/([A-Za-z]+), ([A-Za-z]+)/\2 \1/g" names.txt

6

sed Usage

• Edit files too large for interactive editing

• Edit any size files where editing sequence is too complicated to type in interactive mode

• Perform “multiple global” editing functions efficiently in one pass through the input

• Edit multiples files automatically

• Good tool for writing conversion programs

7

The sed Command

or

'command'

8

sed Syntax

sed [-n] [-e] ['command'] [file…]

sed [-n] [-f script] [file…]

• Options

–n only print lines specified with print command (or 'p' flag of substitute ('s') command)

–f script next argument is filename containing editing commands

If first line of script is "#n", acts as if -n had been specified

–e command next argument is an editing command rather than filename, useful if multiple commands are specified

9

How Does sed Work?

• sed reads line of input

– Line of input is copied into a temporary buffer called pattern space

– Editing commands are applied

• Subsequent commands are applied to line in the pattern space, not the original input line

• Once finished, line is sent to output (unless –n option was used)

– Line is removed from pattern space

• sed reads next line of input, until end of file

• Note that input file is unchanged!

10

sed Scripts

• A script is nothing more than a file of commands

• Each command consists of an address and an action, where the address can be a regular expression or line number

address action command

address action

address action

address action

address action

script

11

sed Scripts

• As each line of the input file is read, sed reads the first command of the script and checks the address against the current input line:

– If there is a match, command executed

– If there is no match, command ignored

– sed then repeats this action for every command in script file

• When it has reached the end of the script, sed outputs the current line (pattern space) unless the -n option has been set

12

Flow of Control

• sed then reads the next line in the input file and restarts from the beginning of the script file

• All commands in the script file are compared to, and potentially act on, all lines in the input file

... cmd 1 cmd n cmd 2

script

input

output output

only without -n

print cmd

13

sed Commands

• sed commands have the general form

[address[, address]][!]command [arguments]

• sed copies each input line into a pattern space

– If address of the command matches line in pattern space, command is applied to that line

– If command has no address, it is applied to each line as it enters pattern space

– If a command changes the line in pattern space, subsequent commands operate on the modified line

• When all commands have been read, the line in pattern space is written to standard output and a new line is read into pattern space

14

Addressing

• Address determines which lines in the input file are to be processed by the command(s)

– Either a line number or a pattern, enclosed in slashes / … /

– If no address is specified, then command is applied to each input line

• Most commands will accept two addresses

– If only one address is given, command operates only on that line

– If two comma separated addresses are given, then command operates on a range of lines between the first and second address, inclusively

• The ! operator can be used to negate an address

– Command applied to all lines that do NOT match address

15

Commands

• Command is a single letter

• Example:

– Deletion: d

[address1][,address2]d

• Delete the addressed line(s) from the pattern space

– Line(s) not passed to standard output

• A new line of input is read and editing resumes with the first command of the script

16

Delete Address-Command Examples

d deletes all lines

6d deletes line 6

/^$/d deletes all blank lines

1,10d deletes lines 1 through 10

1,/^$/d deletes from line 1 through the first blank line

/^$/,$d deletes from first blank line through last line of file

/^$/,10d deletes from first blank line through line 10

/^ya*y/,/[0-9]$/d deletes from first line that begins with yay, yaay, yaaay, etc. through first line that ends with a digit

17

Delete Command (D) Example

• Remove Part-time data from “tuition.data” file

cat tuition.data

Part-time 1003.99

Two-thirds-time 1506.49

Full-time 2012.29

sed –e '/^Part-time/d' tuition.data

Two-thirds-time 1506.49

Full-time 2012.29

Input data

Output after

applying delete

command

18

Multiple Commands

• Braces { } used to apply multiple commands to an address [address][,address]{

command1

command2

command3

}

• The opening brace { must be the last character on a line

• The closing brace } must be on a line by itself

– No spaces following the braces

• Alternatively, use “;” after each command:

[address][,address]{command1; command2; command3; }

• Or: '[address][,address]command1; command2; command3'

19

sed Commands

• sed contains many editing commands, though only a few are mentioned here s substitute a append i insert c change d delete p print r read w write y transform = display line number N append next line to current one q quit

20

Print

• Print command (p) used to force pattern space to be output, useful if -n option has been specified

• Syntax: [address1[,address2]]p

– Note: if -n or #n option has not been specified, p will cause the line to be output twice!

• Examples:

1,5p will display lines 1 through 5

/^$/,$p will display lines from first blank line through last line of file

21

Substitute

• Syntax: [address(es)]s/pattern/replacement/[flags]

– pattern : search pattern

– replacement : replacement string for pattern

– flags : optionally any of the following

n a number from 1 to 512 indicating which occurrence of pattern should be replaced

g global, replace all occurrences of pattern in pattern space

p print contents of pattern space

22

Substitute Examples

s/Puff Daddy/P. Diddy/

– Substitute P. Diddy for the first occurrence of Puff Daddy in pattern space

s/Four/Five/2

– Substitutes Five for the second occurrence of Four in the pattern space (i.e., each line)

s/paper/plastic/p

– Substitutes plastic for the first occurrence of paper and outputs (prints) pattern space

23

Replacement Patterns

• Substitute can use several special characters in the replacement string

& replaced by entire string matched in regular expression for pattern

\n replaced by nth substring (or sub-expression) previously specified using "\(" and "\)”

\ used to escape the ampersand (&) and the backslash (\)

24

Replacement Pattern Examples

"the UNIX operating system …"

s/.NI./wonderful &/

--> "the wonderful UNIX operating system …"

"unix is fun"

sed 's/\([[:alpha:]]\)\([^ \n]*\)/\2\1ay/g'

--> "nixuay siay unfay”

cat file3

first:second one:two

sed 's/\(.*\):\(.*\)/\2:\1/' file3

second:first two:one

25

Append, Insert, and Change

• Syntax for these commands is little strange because they must be specified on multiple lines

• Append [address]a\

text

• Insert

[address]i\ text

• Change

[address(es)]c\ text

Append (a) and Insert (i) for single lines only, not range

26

Append Command (A) Example

cat tuition.append.sed

a \

--------------------------

cat tuition.data

Part-time 1003.99

Two-thirds-time 1506.49

Full-time 2012.29

sed -f tuition.append.sed tuition.data

Part-time 1003.99

--------------------------

Two-thirds-time 1506.49

--------------------------

Full-time 2012.29

--------------------------

Input data

sed script to append

dashed line after

each input line

Output after applying

the append command

27

Insert Command (I) Example

cat tuition.insert.sed

1 i\

Tuition List\

cat tuition.data

Part-time 1003.99

Two-thirds-time 1506.49

Full-time 2012.29

sed -f tuition.insert.sed tuition.data

Tuition List

Part-time 1003.99

Two-thirds-time 1506.49

Full-time 2012.29

Input data

sed script to insert “Tuition List” as

report title before line 1

Output after applying

the insert command

28

Change Command (C) Example

cat tuition.change.sed

1 c\

Part-time 1100.00

cat tuition.data

Part-time 1003.99

Two-thirds-time 1506.49

Full-time 2012.29

sed -f tuition.change.sed tuition.data

Part-time 1100.00

Two-thirds-time 1506.49

Full-time 2012.29

Input data

sed script to change

tuition cost from

1003.99 to 1100.00

Output after applying

the change command

29

Complement (!) Operator

• If an address is followed by exclamation point (!), associated command is applied to all lines that don’t match address or address range

• Examples:

/black/!s/cow/horse/

substitute horse for cow on all lines except those that contained black

1,5!d delete all lines except 1 through 5

• Print lines that do not contain “obsolete” sed –e '/obsolete/!p' input-file

“The brown cow" --> "The brown horse" "The black cow" --> "The black cow"

30

Read and Write File Commands

• Syntax: r filename

– Queue contents of filename to be read and inserted into output stream at end of current cycle, or when next input line is read

• If filename cannot be read, treated as if were an empty file, without any error indication

• Syntax: w filename

– Write the pattern space to filename

– The filename will be created (or truncated) before the first input line is read

– All w commands which refer to the same filename are output through the same FILE stream

31

Read and Write File Commands

cat tmp

one two three one three five two four six sed 'r tmp'

My first line of input ---> no read until the first line is taken from the input My first line of input one two three one three five two four six My next line My next line^D sed 'w tmp1'

hello 1 hello 1 hello 2 hello 2 hello 3 hello 3^D cat tmp1

hello 1 hello 2 hello 3

32

Line Number

• Line number command (=) writes the current line number before each matched/output line

• Examples: sed -e '/Two-thirds-time/=' tuition.data

sed -e '/^[0-9][0-9]/=' inventory

sed '=' tmp1

1 hello1 2 hello2 3 hello3

sed -n '=' tmp1

1 2 3

33

Transform

• Transform command (y) operates like tr, doing a one-to-one or character-to-character replacement

– Accepts zero, one or two addresses

[address[,address]]y/abc/xyz/

– Every a within the specified address(es) is transformed to an x, b to y and c to z

• Examples y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/

– Changes all lower case characters on addressed line to upper case

sed –e '1,10y/abcd/wxyz/' datafile

– Must have same number of characters

34

Quit

• Syntax: [addr]q

– Quit (exit sed) when addr is encountered

• It takes at most a single line address

– Once a line matching the address is reached, script will be terminated

– Can be used to save time when you only want to process some portion of the beginning of a file

• Example

– To print the first 100 lines of a file (like head)

sed '100q' filename

– sed will, by default, send the first 100 lines of filename to standard output and then quit processing

35

The gawk Programming Language

36

The gawk Programming Language

• A scripting language used for manipulating data and generating reports

– Geared towards working with delimited fields on a line-by-line basis

• Summary of gawk operations

– Scans a file line-by-line

– Splits each input line into fields

– Compares each input line and fields to the specified pattern

– Performs the requested action(s) on lines matching the specified pattern

37

Structure of a gawk Program

• A gawk program consists of:

– An optional BEGIN segment

• For processing to execute prior to reading input

– Pattern – Action pairs

• Processing for input data

• For each pattern matched, the corresponding action is taken

– An optional END segment

• Processing after end of input data

BEGIN {action}

pattern {action}

pattern {action}

.

.

.

pattern { action}

END {action}

38

Running a gawk Program

• There are several ways to run a gawk program gawk 'program' input_file(s)

– Program and input files are provided as command-line arguments

gawk 'program'

– Program is a command-line argument

– Input is taken from standard input

gawk -f program_file input_files

– Program is read from a file

39

Patterns and Actions

• Search a set of files for patterns

• Perform specified actions upon lines or fields that contain instances of patterns

• Does not alter input files

• Process one input line at a time

• This is similar to sed

40

Pattern-Action Structure

• Every program statement has to have a pattern or an action or both

• Default pattern is to match all lines

• Default action is to print current record

• Patterns are simply listed; actions are enclosed in { }

– Some actions can be similar to C code

• gawk scans a sequence of input lines, or records, one by one, searching for lines that match the pattern

– Meaning of match depends on the pattern

41

Patterns

• Selector that determines whether action is to be executed

• Pattern can be:

– Special token BEGIN or END

– Regular expression (enclosed with / /)

– Relational or string match expression

– ! negates the match

– Arbitrary combination of the above using && and/or ||

• /UNT/ matches if the string “UNT” is in the record

• x > 0 matches if the condition is true

• /UNT/ && (name == "UNIX Tools")

42

BEGIN and END Patterns

• BEGIN and END provide a way to gain control before and after processing, for initialization, and wrap-up

BEGIN

– Actions are performed before first input line is read

END

– Actions are done after the last input line has been processed.

43

Actions

• Action may include a list of one or more C like statements, as well as arithmetic and string expressions and assignments and multiple output streams

• Action performed on every line that matches pattern

– If pattern not provided, action performed on every input line

– If action not provided, all matching lines sent to standard output

• Since patterns and actions are optional, actions must be enclosed in braces to distinguish them from pattern

44

Introductory Example

ls | gawk '

BEGIN { print "List of html files:" }

/\.html$/ { print }

END { print "There you go!" }

'

List of html files:

index.html

as1.html

as2.html

There you go!

45

Variables

• gawk scripts can define and use variables

BEGIN { sum = 0 }

{ sum++ }

END { print sum }

• Some variables are predefined

46

Basic gawk Terminology

• gawk supports two types of buffers

– Field

• A unit of data in a line separated from other fields by the field separator

– Record

• A collection of fields in a line (file made up of records)

• Default field separator is whitespace

• Namespace for fields in current record: $1, $2, etc.

– The $0 variable contains the entire record (i.e., line)

• Example

– Given line of input: "This class is fun!"

– $1 = "This", $2 = "class", etc.

47

Records

• Default record separator is newline

– By default, gawk processes its input one line at a time

• Could be any other regular expression

• RS: record separator

– Can be changed in BEGIN action

• NR is the variable whose value is the number of the current record.

48

Fields

• Each input line is split into fields

– FS: field separator

• Default is whitespace (1 or more spaces or tabs)

– gawk -Fc option sets FS to the character c

• Can also be changed in BEGIN

– $0 is the entire line

– $1 is the first field, $2 is the second field, ….

• Only fields begin with $, variables do not

49

Some gawk System Variables

• gawk supports number of system variables

– FS Field separator (default = space)

– RS Record separator (default = \n)

– NF Number of fields in current record

– NR Number of the current record

– OFS Output field separator (default = space)

– ORS Output record separator (default = \n)

– FILENAME Current filename

– ARGC/ARGV Get arguments from command line

50

Simple Output from gawk

• Printing every line

– If action has no pattern, action is performed to all input lines

• { print } prints all input lines to standard out

• { print $0 } will do the same thing

• Printing certain fields

– Multiple items can be printed on the same output line with a single print statement

– { print $1, $3 }

– Expressions separated by a comma are, by default, separated by a single space when printed (OFS)

51

More Output from gawk

• NF, the Number of Fields

– Any valid expression can be used after a $ to indicate the contents of a particular field

– One built-in expression is NF, or Number of Fields

– { print NF, $1, $NF } will print number of fields, first field, and last field in the current record

– { print $(NF-2) } prints the third to last field

• Computing and printing

– You can also do computations on the field values and include the results in your output

– { print $1, $2 * $3 }

52

More Output from gawk

• Printing line numbers

– The built-in variable NR can be used to print line numbers

– { print NR, $0 } prints each line prefixed with its line number

• Putting text in the output

– You can also add other text to the output besides what is in the current record

– { print "total pay for", $1, "is", $2 * $3 }

– Note that the inserted text needs to be surrounded by double quotes

53

Formatted Output from gawk

• Lining up fields

– Like C, gawk has a printf function for producing formatted output

– printf has the form printf( format, val1, val2, val3, … )

{ printf("total pay for %s is $%.2f\n", $1, $2 * $3) }

– When using printf, formatting is under your control so no automatic spaces or newlines are provided by gawk

– You have to insert them yourself.

{ printf("%-8s %6.2f\n", $1, $2 * $3 ) }

54

Selection

• gawk patterns are good for selecting specific lines from the input for further processing

– Selection by comparison

$2 >= 5 { print }

– Selection by computation

$2 * $3 > 50 { printf("%6.2f for %s\n", $2 * $3, $1) }

– Selection by text content

$1 == "UNT"

$2 ~ /UNT/

– Combinations of patterns

$2 >= 4 || $3 >= 20

– Selection by line number

NR >= 10 && NR <= 20

55

Arithmetic and Variables

• gawk variables take on numeric (floating point) or string values according to context

• User-defined variables are unadorned (i.e., they do not need to be declared)

• By default, user-defined variables are initialized to the null string which has numerical value 0

56

Computing with gawk

• Counting is easy to do with gawk $3 > 15 { emp = emp + 1}

END { print emp, "employees worked

more than 15 hrs"}

• Computing sums and averages is also simple { pay = pay + $2 * $3 }

END { print NR, "employees"

print "total pay is", pay

print "average pay is", pay/NR

}

57

Handling Text

• One major advantage of gawk is its ability to handle strings as easily as many languages handle numbers

• gawk variables can hold strings of characters as well as numbers, and gawk conveniently translates back and forth as needed

• This program finds the employee who is paid the most per hour:

# Fields: employee, payrate

$2 > maxrate { maxrate = $2; maxemp = $1 }

END { print "highest hourly rate:",

maxrate, "for", maxemp }

58

String Manipulation

• String Concatenation

– New strings can be created by combining old ones

{ names = names $1 " " }

END { print names }

• Printing the Last Input Line

– Although NR retains its value after the last input line has been read, $0 does not

{ last = $0 }

END { print last }

59

Built-In Functions

• gawk contains a number of built-in functions: length is one of them

• Counting lines, words, and characters using length (similar to wc) { nc = nc + length($0) + 1

nw = nw + NF

}

END { print NR, "lines,", nw, "words,", nc,

"characters" }

• substr(s, m, n) produces substring of s that begins at position m and is at most n characters long

60

Control Flow Statements

• gawk provides several control flow statements for making decisions and writing loops

• If-Then-Else

$2 > 6 { n = n + 1; pay = pay + $2 * $3 }

END { if (n > 0)

print n, "employees, total pay is",

pay, "average pay is", pay/n

else

print "no employees are paid more

than $6/hour"

}

61

Loop Control

• While # interest1 - compute compound interest

# input: amount, rate, years

# output: compound value at end of each year

{ i = 1

while (i <= $3) {

printf("\t%.2f\n", $1 * (1 + $2) ^ i)

i = i + 1

}

}

Do-While Loops do { statement1 } while (expression)

62

for Statements

• For

# interest2 - compute compound interest

# input: amount, rate, years

# output: compound value at end of each year

{ for (i = 1; i <= $3; i = i + 1)

printf("\t%.2f\n", $1 * (1 + $2) ^ i)

}

63

Arrays

• Array elements are not declared

• Array subscripts can have any value:

– Numbers

– Strings (associative arrays)

• Examples arr[3]="value"

grade["Smith"]=40.3

64

Array Example

# reverse - print input in reverse order by line

{ line[NR] = $0 } # remember each line

END {

for (i=NR; (i > 0); i=i-1) {

print line[i]

}

}

• Use for loop to read associative array for (v in array) { … }

– Assigns to v each subscript of array (unordered)

– Element is array[v]

65

Operators

= assignment operator

– Sets a variable equal to a value or string

== equality operator

– Returns TRUE is both sides are equal

!= inverse equality operator

&& logical AND

|| logical OR

! logical NOT

<, >, <=, >= relational operators

+, -, /, *, %, ^

66

Built-In Functions

• Arithmetic

– sin, cos, atan, exp, int, log, rand, sqrt

• String

– length, substr, split

• Output

– print, printf

• Special

– system - executes a Unix/Linux command

• system("clear") to clear the screen

• Note double quotes around the Unix command

– exit - stop reading input and go immediately to the END pattern-action pair if it exists, otherwise exit the script

67

gawk Examples

• Records and fields gawk '{print NR, $0}' emp1

• Space as field separator gawk '{print NR, $1, $2, $5}' emp1

• Colon as field separator gawk -F: '/Jones/{print $1, $2}' emp2

• Match input record gawk -F: '/00$/' emp2

• Explicit match gawk '$5 ~ /\.[7-9]+/' emp3

• Matching with regexes gawk '$2 !~ /E/{print $1, $2}' emp3

gawk '/^[ns]/{print $1}' emp3