computer science
1/5
Data Structures and Algorithms, CS146, Spring 2019
Programming Assignment 1:
Google Search Engine Simulation
Due at 11:59pm, on March 24 (Sunday), 2019 (No late submission! 0 for late submission)
Warning:
This programming assignment is for individual work only. You are not allowed to use someone's
code or copy from internet, except for Web Crawler code provided. Code plagiarism checker
tools (Jplag, MOSS, etc.) will be used to check the similarity of codes. You may be asked to
demo and explain your code to the instructor or to the class. Cheating will not be tolerated and
will be reported to University. Please check on the University Policies in the syllabus.
Read this instruction carefully before your design and implementation of your work.
Problem Statements
Google's search engine is a powerful tool. Without search engines like Google, it would be
practically impossible to find the information you need when you browse the Web. Like all
search engines, Google uses sorting algorithm, indexing, searching, and priority queueing
techniques to generate search results. Google has a large index of keywords where these words
can be found and uses automated programs called spiders or crawlers. What sets Google apart is
how it ranks search results, which in turn determines the priority order Google displays results on
its search engine results page (SERP). Google uses a trademarked algorithm called PageRank,
which assigns each Web page a relevancy score.
A Web page's PageRank depends on a few factors (we only list four for this assignment):
1. The frequency and location of keywords within the web page: If the keyword only
appears once within the body of a page, it will receive a low score for that keyword.
2. How long the web page has existed: People create new web pages every day, and not all
of them stick around for long. Google places more value on pages with an established
history.
3. The number of other web pages that link to the page in question: Google looks at
how many Web pages link to a particular site to determine its relevance.
4. How much the webpage owner has paid to Google for advertisement purpose:
Website’s owners pay a lump sum of money to Google to increase the priority of
PageRank for advertisement of their services/products
Functional Requirements
For this programming assignment, your job is to design and implement a micro version of
Google Search Engine Simulator using Heapsort algorithm and Max-Heap Priority Queue.
The Google Search Engine Simulator MUST contain the following subroutines specified in
textbook:
The following functions must be implemented.
2/5
1) Max-Heapify()
2) Build-Max-Heap()
3) Heapsort()
4) Max-Heap-Insert()
5) Heap-Extract-Max
6) Heap-Increase-Key
7) Heap-Maximum
Two major features of Google Search Engine must be implemented:
1. Using Heapsort algorithm to sort and display a list of web url links based on the PageRank calculation:
a) Allow users to enter a keyword (-20 points if hardcoded or not functioning.)
-- example: San Jose
b) Web Crawler searches for the keyword on internet to generate a list of web url links
c) Allow users to display the search result containing at least 30 web url links (-100 points if hardcoded or -50 points if not functioning.)
-- Example: 1. www.sanjoseca.gov
2. https://en.wikipedia.org/wiki/San_Jose_California
3. https://www.sanjose.org
4. …
d) Assign 4 scores (integer numbers between 1 and 100) for the 4 PageRank factors to each web url link. (allow users to enter 4 scores or use a random number generator)
(Provide users a user interface to view the 4 factor scores of each web url link. -20
points if not functioning.)
e) Calculate total PageRank score (sum up the 4 scores) for each web url link.
f) Store the PageRank score to each web url link
g) Use Heapsort algorithm to sort the 30 PageRank scores.
h) Allow users to print out the sorting result in the sequence order based on the PageRank score (the highest score is printed first. -50 points if not functioning.)
1. www.abcde.gov, PageRank score: 150
2. www.mikewu.com, PageRank score: 136
3/5
3. www.anyweb.com, PageRank score: 127
4……
2. Using Heap Priority Queue to store web url links based on the PageRank.
a) Implement Heap priority queue to store the first sorted 20 out of 30 web url links into Heap (you may use Array, ArrayList or Vector in Java) (-50 points if not
functioning.)
b) Allow users to insert a new web url link into Heap based on the PageRank score by implementing Max-Heap-Insert() (-20 points if not functioning.)
c) Allow users to view the first ranked web url link by implementing Heap-Extract- Max(). (-20 points if not functioning.)
d) Allow users to choose any one of the web url links stored in the heap Priority Queue and increase its PageRank score and its priority in the queue to be listed by
implementing Heap-Increase-Key(). (-20 points if not functioning.)
(Webpage owners can pay more money to Google to increase webpage’s exposure for
advertisement purpose.)
Programming Requirements
1. The Google Search Engine Simulator MUST be written in Java and is required to use
Heapsort and Max-Heap Priority Queue approach specified in lecture ppt slides and in
textbook. (You MUST use or revise the pseudo codes provided in textbook to write your Java
codes. Any other codes will be treated as failing requirements and will receive 0 point.)
2. You can either download a Web Crawler software tool from Internet (popular Web Crawler
links are provided) or develop your own simple Web Crawler in Java to collect research
results (See the Useful Reference Links below.)
3. Web Crawler that you selected must be integrated into your code (such as main() or in the
Java class.)
3. Each java file/class/subroutine/function call MUST contain a header comment at the
beginning of each and make sure you have enough comments at the end of lines in your code.
(-20 points if your code did not provide enough comments.)
4. You MUST write a professional/formal report in MS Word/PDF format with the following
items included in the report: (You will receive 0 point for this assignment if a report is not
provided.)
a) Cover Page – HW1 Title and your name
4/5
b) Design and Implementation:
Briefly explain/illustrate how you design and implement your Google Search Engine
Simulator such as diagram, graph, data structure, (Array, ArrayList, Linked List,
Vector, etc.).
c) A list of classes/subroutines/function calls:
Explain purpose for each.
d) Self-testing screen shots:
Provide enough screen shots for each function including inputs and outputs for each
of function listed in “Functional Requirements” section. This is to verify your codes
and to check to see if your codes match the functional requirements and run correctly
by yourself. (You will receive 0 point if no self-testing screenshots provided. Points
will be taken off if not enough self-testing screen shots provided.)
e) The procedure (step by step) of how to unzip your files, install your application, and
run/test your codes. (-20 points if not provided)
f) Problems encountered during the implementation: Describe all problems
found/encountered during your implementation and how you resolved them. (-15
points if not provided or too simple.)
g) Lessons Learned: Describe the concepts and skills you have learned from this
programming assignment and any inspiration or motivation you got from this
programming assignment. (-10 pints if not provided or too simple.)
Submission Requirements:
1. Zip all your files including your formal report, all *.java files, and a Java executable file (a
jar file, so that I can run your code) into one zipped file with file name: PA1-Secion 8-
Last_Name.zip, Any file with incorrect file name will not be accepted and 0 point will be
given. [No jar file or your code is not runnable (test it before submission: java -jar
filename.jar): -100 points]
2. The deadline to submit/upload your zip file to Canvas is before 11:59pm on Sunday,
3/24/2019. After deadline, the submission link will be closed.
Warning: DO NOT wait until the last minute to upload your files to Canvas. It has been
happening all the time that it took forever to upload files to Canvas during the last
hour/minute. No excuse for late submission.
Grading Criteria
5/5
Grading is based on the creative design, full functioning, completeness and clarity of codes and
the professional report. Each grading policy is specifically listed after each functional item
above. You should follow the functional requirements to avoid losing points.
Useful Reference Links:
1. How to make a simple web crawler in Java
2. How to make a Web crawler using Java?
3. Top 20 web crawler tools to scrape the websites
4. Open Source Web Crawler for Java
5. Make Java executable Jar file
6. https://trends.google.com/trends/?geo=US
Note: Eclipse with JDK 8 under can export project into Runnable Jar, but not JDK 9.
- 1. How to make a simple web crawler in Java
- 2. How to make a Web crawler using Java?
- 3. Top 20 web crawler tools to scrape the websites
- 4. Open Source Web Crawler for Java
- 5. Make Java executable Jar file
- 6. https://trends.google.com/trends/?geo=US