Develop a Web Crawler on Python
Scenario: Given a domain (e.g., apple dot com), and a person’s
name (e.g., Tim Cook) who works for this organization, the crawler starts
crawling the domain’s web pages (the webpages that belong to the domain of
apple dot com) that can be visited through the Web link (URL link).
Upon visiting each page in the domain (e.g., apple dot com), the crawler searchers for the content of each page and look for the name’s person (e.g., Tim Apple).
Upon identifying a websiteLink’s Webpage that contains the search term, the content of that
web page will be parsed using some parsers such as Beautiful Soup , [crummy dot com/ software/ BeautifulSoup/ bs4/ doc/ ] I think I cannot post links here, so I edited it that way. I hope you can understand it
Designing an Automated Crawler to Collect Information: Once a crawler is deployed,
the Web crawler builds a knowledge base and returns the constructed data
structures for further consideration. A typical Web crawler looks at the content of
online Web pages (e.g., apple dot com) and searches for the types of the data that are
requested (e.g., Apple Watch).
From implementation perspective, a typical crawler accepts a URL and an integer
value showing the level or depth of crawling. When a Web crawler is deployed and
instructed to visit a Website, it searches each page and also every page that can be
reached from a given URL link found in the pages. Once a crawler collects the
requested data, it builds an unstructured or structured form such as JSON or XML
files to maintain the raw data.
Most Web crawlers are capable of parsing HTML Webpages and thus capable of
extracting instructed and requested information, such as the price of products. Each
Web crawler framework is usually integrated with a local knowledge base called
“Crawler-Base” where the extracted data are stored for future analysis.
There are several opens source software packages that can be used for
implementing and deploying a Web crawler such as Apache Nutch, StormCrawler,
Scrapy, BeautifulSoup, LXML, Most of these tools work on top of each other. For
instance, BeautifulSoup can be used for parsing responses in Scrapy callbacks.
BeautifulSoup, on the other hand, works on top of LXML.
Scrapy offers a medium-sized scraping job that makes it suitable for this project. The
crawlers generated and deployed with the Scrapy framework are quick, simple, and
scalable. The other Web crawlers such as Apache Nutch and Storm Crawler can also
be good tools for the purpose of this project. Scrapy provides a good number of
features for parsing and extracting data from HTML and XML pages.
These features use CSS selectors, a language for applying styles to HTML files, and
XPath expression, a language for selecting nodes in XML files, to search for and
retrieve information. The mechanism used by Scrapy itself is called “selector”,
because they focus and select certain parts of HTML or XML files.