Develop a Web Crawler on Python

arc

Scenario.docx

Home >Computer Science homework help >Develop a Web Crawler on Python

Scenario: Given a domain (e.g., apple dot com), and a person’s

name (e.g., Tim Cook) who works for this organization, the crawler starts

crawling the domain’s web pages (the webpages that belong to the domain of

apple dot com) that can be visited through the Web link (URL link).

Upon visiting each page in the domain (e.g., apple dot com), the crawler searchers for the content of each page and look for the name’s person (e.g., Tim Apple).

Upon identifying a websiteLink’s Webpage that contains the search term, the content of that

web page will be parsed using some parsers such as Beautiful Soup , [crummy dot com/ software/ BeautifulSoup/ bs4/ doc/ ] I think I cannot post links here, so I edited it that way. I hope you can understand it

Designing an Automated Crawler to Collect Information: Once a crawler is deployed,

the Web crawler builds a knowledge base and returns the constructed data

structures for further consideration. A typical Web crawler looks at the content of

online Web pages (e.g., apple dot com) and searches for the types of the data that are

requested (e.g., Apple Watch).

From implementation perspective, a typical crawler accepts a URL and an integer

value showing the level or depth of crawling. When a Web crawler is deployed and

instructed to visit a Website, it searches each page and also every page that can be

reached from a given URL link found in the pages. Once a crawler collects the

requested data, it builds an unstructured or structured form such as JSON or XML

files to maintain the raw data.

Most Web crawlers are capable of parsing HTML Webpages and thus capable of

extracting instructed and requested information, such as the price of products. Each

Web crawler framework is usually integrated with a local knowledge base called

“Crawler-Base” where the extracted data are stored for future analysis.

There are several opens source software packages that can be used for

implementing and deploying a Web crawler such as Apache Nutch, StormCrawler,

Scrapy, BeautifulSoup, LXML, Most of these tools work on top of each other. For

instance, BeautifulSoup can be used for parsing responses in Scrapy callbacks.

BeautifulSoup, on the other hand, works on top of LXML.

Scrapy offers a medium-sized scraping job that makes it suitable for this project. The

crawlers generated and deployed with the Scrapy framework are quick, simple, and

scalable. The other Web crawlers such as Apache Nutch and Storm Crawler can also

be good tools for the purpose of this project. Scrapy provides a good number of

features for parsing and extracting data from HTML and XML pages.

These features use CSS selectors, a language for applying styles to HTML files, and

XPath expression, a language for selecting nodes in XML files, to search for and

retrieve information. The mechanism used by Scrapy itself is called “selector”,

because they focus and select certain parts of HTML or XML files.