Description

In this project, you are challenged to create a website that mimics Google's search engine.

A description of search engine's architecture is covered in the lecture note. A reference of Google's original page rank algorithm can be found at here. Section 2 is the relevant section.

 

Requirement

 
  Frontend Your frontend should present a simple query interface to the web user, which asks for a single keyword. In response to a user query through a web browser, the frontend should respond by searching the keyword against the indexed database of URLs built by the backend. In addition, the returned URL should be displayed in sorted order according to a ranked score computed in advance by the backend.

 
  Backend Your backend should take an arbitrary file named urllist.txt, which contains one URL per line, and build an inverted index database which maps keyward to URLs. In addition, each URL needs to be assigned a page rank using page rank algorithm.

 

Resource

 
  Frontend You should use the Bottle web framework, which was covered in the tutorial.

 
  Backend Your backend should take an arbitrary file named urllist.txt, which contains one URL per line, and build an inverted index database which maps keywords to URLs. In addition, each URL needs to be assigned a page rank using PageRank algorithm.

You are provided a reference implementation of the Crawler crawler.py, which can recursively visit the URLs, preprocess it so that a word list of each URLs are returned. In addition, the link relationship between the URLs are also discovered. You can build the backend by modifying file so that a persistent database, including both the reverse index and rank scores, is built and used by the frontend.

You are also provided a reference implementation of the Page rank algorithm pagerank.py.

 
  Batteries The external packages that these facilities depend on, are pre-installed for you on the ugsparc systems. You are welcome to use other packages of your preference, but you are on your own for installation.

 

Logistics

  • One member will be responsible for the web frontend.

  • One member will be responsible for the web backend.

Depending on preference, one member should be assigned the role of unit testing and packaging, and the other should be assigned the role of documenting the final design and test result.

 

Deliverables

Before the deadline, you should submit:

  • A creative name of your website and service;

  • All source code (well commented - Make sure you have explicit instructions on how to run your project. If your project is not simple to run it will cost you.);

  • Unit test suite;

  • A report describing the objective, design, test results, roles played by each team member, and a discussion of lessons learned. How did the material from the course help you with the project? Did any parts take you longer than you thought? Why? Would you do the project differently if you had to do it again? What would you do if you had more time? What was your testing strategy during development? It is important that the names and student numbers of all member are specified in the beginning of the report.

 

Grading Rubric

  • 30% Frontend 20% implementation 10% usability/aesthetics

  • 40% Backend 25% implementation 15% PageRank correctness

  • 30% Documentation / Testing 15% unit testing 15% written report

Implementation refers to using appropriate data structures, having reasonably efficient code, and organizing your code in a sane and readable manner.

A bonus of up to 10% will be awarded for going beyond the basic specifications. This includes all manner of extensions that you might think of. This bonus will not bring your mark above 100%.

 

Things To Avoid

Here are some common mistakes to avoid:

  • Not commenting code clearly; it is especially important that you explain what each function is intended to accomplish;

  • Not including very explicit installation instructions;

  • Returning the same result multiple times;

  • Having overly verbose, unreadable search results (remember that a human is reading your results);

  • Having overly laconic, uninformative search results;

  • Not proofreading your report.