Linux filter use awk, sed ..

Jack Jonshen
question.txt

Webpage Scraping: Transforming the text of a webpage into a format amenable for entry into a database system is a common task ideally suited for a filter script. In this project, you will creatively compose and combine (through pipes) the tools and utilities covered in this chapter to write a shell filter script to convert the semistructured data on a webpage to colon separated values (CSV) file, a format easily imported into a database system, written to standard output without writing or producing any intermediate files. Start by finding a webpage with some semi-structured, tabular data such as the status of the United States Congress at https://votesmart.org/officials/NA/ C/national-congressional. Then define an output format such as <state>:<branch>:<party>:<district/seat>:<name>:<URL>, an instance of which is AK:House:Republican:At-Large:Don Young:http://votesmart.org/candidate/26717/don-young. Then write your filter script to convert one into the other. You may rely on the presence of the file pvsurls.txt, available at http://perugini. cps.udayton.edu/teaching/books/SPUC/www/files/pvsurls. txt in the current working directory. Correct standard output is available at http://perugini.cps.udayton.edu/teaching/books/SPUC/ www/files/pvsstdoutstream.txt. To avoid parsing HTML code, use the following command line in your script, which uses the lynx text-based web browser to write the humanreadable contents of a webpage to standard output: lynx -dump -width=200 <url>. The lynx browser can be used to browse the web from non-graphical interfaces such as an ssh terminal. While not necessary, you may want to explore the iconv utility to deal with accents in names. You must make use of five different Linux filters in your script.