computer science

profileguobinllj
PA-1-S8.pdf

1/5

Data Structures and Algorithms, CS146, Spring 2019

Programming Assignment 1:

Google Search Engine Simulation

Due at 11:59pm, on March 24 (Sunday), 2019 (No late submission! 0 for late submission)

Warning:

This programming assignment is for individual work only. You are not allowed to use someone's

code or copy from internet, except for Web Crawler code provided. Code plagiarism checker

tools (Jplag, MOSS, etc.) will be used to check the similarity of codes. You may be asked to

demo and explain your code to the instructor or to the class. Cheating will not be tolerated and

will be reported to University. Please check on the University Policies in the syllabus.

Read this instruction carefully before your design and implementation of your work.

Problem Statements

Google's search engine is a powerful tool. Without search engines like Google, it would be

practically impossible to find the information you need when you browse the Web. Like all

search engines, Google uses sorting algorithm, indexing, searching, and priority queueing

techniques to generate search results. Google has a large index of keywords where these words

can be found and uses automated programs called spiders or crawlers. What sets Google apart is

how it ranks search results, which in turn determines the priority order Google displays results on

its search engine results page (SERP). Google uses a trademarked algorithm called PageRank,

which assigns each Web page a relevancy score.

A Web page's PageRank depends on a few factors (we only list four for this assignment):

1. The frequency and location of keywords within the web page: If the keyword only

appears once within the body of a page, it will receive a low score for that keyword.

2. How long the web page has existed: People create new web pages every day, and not all

of them stick around for long. Google places more value on pages with an established

history.

3. The number of other web pages that link to the page in question: Google looks at

how many Web pages link to a particular site to determine its relevance.

4. How much the webpage owner has paid to Google for advertisement purpose:

Website’s owners pay a lump sum of money to Google to increase the priority of

PageRank for advertisement of their services/products

Functional Requirements

For this programming assignment, your job is to design and implement a micro version of

Google Search Engine Simulator using Heapsort algorithm and Max-Heap Priority Queue.

The Google Search Engine Simulator MUST contain the following subroutines specified in

textbook:

The following functions must be implemented.

2/5

1) Max-Heapify()

2) Build-Max-Heap()

3) Heapsort()

4) Max-Heap-Insert()

5) Heap-Extract-Max

6) Heap-Increase-Key

7) Heap-Maximum

Two major features of Google Search Engine must be implemented:

1. Using Heapsort algorithm to sort and display a list of web url links based on the PageRank calculation:

a) Allow users to enter a keyword (-20 points if hardcoded or not functioning.)

-- example: San Jose

b) Web Crawler searches for the keyword on internet to generate a list of web url links

c) Allow users to display the search result containing at least 30 web url links (-100 points if hardcoded or -50 points if not functioning.)

-- Example: 1. www.sanjoseca.gov

2. https://en.wikipedia.org/wiki/San_Jose_California

3. https://www.sanjose.org

4. …

d) Assign 4 scores (integer numbers between 1 and 100) for the 4 PageRank factors to each web url link. (allow users to enter 4 scores or use a random number generator)

(Provide users a user interface to view the 4 factor scores of each web url link. -20

points if not functioning.)

e) Calculate total PageRank score (sum up the 4 scores) for each web url link.

f) Store the PageRank score to each web url link

g) Use Heapsort algorithm to sort the 30 PageRank scores.

h) Allow users to print out the sorting result in the sequence order based on the PageRank score (the highest score is printed first. -50 points if not functioning.)

1. www.abcde.gov, PageRank score: 150

2. www.mikewu.com, PageRank score: 136

3/5

3. www.anyweb.com, PageRank score: 127

4……

2. Using Heap Priority Queue to store web url links based on the PageRank.

a) Implement Heap priority queue to store the first sorted 20 out of 30 web url links into Heap (you may use Array, ArrayList or Vector in Java) (-50 points if not

functioning.)

b) Allow users to insert a new web url link into Heap based on the PageRank score by implementing Max-Heap-Insert() (-20 points if not functioning.)

c) Allow users to view the first ranked web url link by implementing Heap-Extract- Max(). (-20 points if not functioning.)

d) Allow users to choose any one of the web url links stored in the heap Priority Queue and increase its PageRank score and its priority in the queue to be listed by

implementing Heap-Increase-Key(). (-20 points if not functioning.)

(Webpage owners can pay more money to Google to increase webpage’s exposure for

advertisement purpose.)

Programming Requirements

1. The Google Search Engine Simulator MUST be written in Java and is required to use

Heapsort and Max-Heap Priority Queue approach specified in lecture ppt slides and in

textbook. (You MUST use or revise the pseudo codes provided in textbook to write your Java

codes. Any other codes will be treated as failing requirements and will receive 0 point.)

2. You can either download a Web Crawler software tool from Internet (popular Web Crawler

links are provided) or develop your own simple Web Crawler in Java to collect research

results (See the Useful Reference Links below.)

3. Web Crawler that you selected must be integrated into your code (such as main() or in the

Java class.)

3. Each java file/class/subroutine/function call MUST contain a header comment at the

beginning of each and make sure you have enough comments at the end of lines in your code.

(-20 points if your code did not provide enough comments.)

4. You MUST write a professional/formal report in MS Word/PDF format with the following

items included in the report: (You will receive 0 point for this assignment if a report is not

provided.)

a) Cover Page – HW1 Title and your name

4/5

b) Design and Implementation:

Briefly explain/illustrate how you design and implement your Google Search Engine

Simulator such as diagram, graph, data structure, (Array, ArrayList, Linked List,

Vector, etc.).

c) A list of classes/subroutines/function calls:

Explain purpose for each.

d) Self-testing screen shots:

Provide enough screen shots for each function including inputs and outputs for each

of function listed in “Functional Requirements” section. This is to verify your codes

and to check to see if your codes match the functional requirements and run correctly

by yourself. (You will receive 0 point if no self-testing screenshots provided. Points

will be taken off if not enough self-testing screen shots provided.)

e) The procedure (step by step) of how to unzip your files, install your application, and

run/test your codes. (-20 points if not provided)

f) Problems encountered during the implementation: Describe all problems

found/encountered during your implementation and how you resolved them. (-15

points if not provided or too simple.)

g) Lessons Learned: Describe the concepts and skills you have learned from this

programming assignment and any inspiration or motivation you got from this

programming assignment. (-10 pints if not provided or too simple.)

Submission Requirements:

1. Zip all your files including your formal report, all *.java files, and a Java executable file (a

jar file, so that I can run your code) into one zipped file with file name: PA1-Secion 8-

Last_Name.zip, Any file with incorrect file name will not be accepted and 0 point will be

given. [No jar file or your code is not runnable (test it before submission: java -jar

filename.jar): -100 points]

2. The deadline to submit/upload your zip file to Canvas is before 11:59pm on Sunday,

3/24/2019. After deadline, the submission link will be closed.

Warning: DO NOT wait until the last minute to upload your files to Canvas. It has been

happening all the time that it took forever to upload files to Canvas during the last

hour/minute. No excuse for late submission.

Grading Criteria

5/5

Grading is based on the creative design, full functioning, completeness and clarity of codes and

the professional report. Each grading policy is specifically listed after each functional item

above. You should follow the functional requirements to avoid losing points.

Useful Reference Links:

1. How to make a simple web crawler in Java

2. How to make a Web crawler using Java?

3. Top 20 web crawler tools to scrape the websites

4. Open Source Web Crawler for Java

5. Make Java executable Jar file

6. https://trends.google.com/trends/?geo=US

Note: Eclipse with JDK 8 under can export project into Runnable Jar, but not JDK 9.

  • 1. How to make a simple web crawler in Java
  • 2. How to make a Web crawler using Java?
  • 3. Top 20 web crawler tools to scrape the websites
  • 4. Open Source Web Crawler for Java
  • 5. Make Java executable Jar file
  • 6. https://trends.google.com/trends/?geo=US