In this project, you are challenged to create a website that mimics
Google's search engine.
A description of search engine's architecture is covered in the
lecture note. A reference
of Google's original page rank algorithm can be found at
here.
Section 2 is the relevant section.
You are provided a reference implementation of the Crawler
crawler.py,
which can recursively visit the URLs, preprocess it so that a word
list of each URLs are returned. In addition, the link relationship
between the URLs are also discovered. You can build the backend by
modifying file so that a persistent database, including both the
reverse index and rank scores, is built and used by the frontend.
You are also provided a reference implementation of the Page rank
algorithm
pagerank.py.
Depending on preference, one dedicated member should be assigned the
role of unit testing and packaging, and one dedicated person should be
assigned the role of documenting the final design and test result.
The project is to be completed in four lab stages.
Implementation refers to using appropriate data structures, having
reasonably efficient code, and organizing your code in a sane and
readable manner.
A bonus of up to 10% will be awarded for going beyond the basic
specifications. This includes all manner of extensions that you might
think of. This bonus will not bring your mark above 100%.
Here are some common mistakes to avoid:
Description
Requirement
Frontend
Your frontend should present a simple query
interface to the web user, which asks for a single keyword. In
response to a user query through a web browser, the frontend should
respond by searching the keyword against the indexed database of URLs
built by the backend. In addition, the returned URL should be displayed
in sorted order according to a ranked score computed in advance by the
backend.
Backend
Your backend should take an arbitrary file named
urllist.txt, which contains one URL per line, and build an inverted
index database which maps keyward to URLs. In addition, each URL needs
to be assigned a page rank using page rank algorithm.
Resource
Frontend
You should use the Bottle web framework, which
was covered in the tutorial.
Backend
Your backend should take an arbitrary file named
urllist.txt, which contains one URL per line, and build an inverted
index database which maps keywords to URLs. In addition, each URL
needs to be assigned a page rank using PageRank algorithm.
Batteries
The external packages that these facilities
depend on, are pre-installed for you on the ugsparc systems. You are
welcome to use other packages of your preference, but you are on your
own for installation.
Logistics
Deliverables
Lab 1
Lab 1 Instruction
Lab 2
Lab 2 Instruction
Lab 3
Lab 3 Instruction
Lab 4
Lab 4 Instruction
Grading Rubric
Things To Avoid