this project is to construct a search engine for a static corpus, namly the Enron e-mail collection. Due to disk space limitations on our small Amazon machines, however, I will give you a subset of the complete e-mail dump.
You should create index in memory, but store the e-mail files on disk. Set up my Web server configuration so that the files are available visible with the URI structure mirroring the disk structure. For example, I want URI /enron/motley-m/inbox/16 for file
enron_mail_20110402/maildir/motley-m/inbox/16.
Note the removal of the '.' on the end of the filename in the URI. The easiest way to do this is simply remove that from each filename on disk.
We are reusing the functionality from my previous project that loaded the mysql database with text files. In this case, we are using e-mail files. You are free to use any of the Python code we built on page TFIDF.
Assuming a free text search, not a Boolean AND or OR condition in the terms. In other words, it's possible that documents may come up that do not include all of the search terms.
We will, in principle, be using a cosine similarity between query and documents except that it will look like a simple scoring by "sum of weights" because our queries are short: we will use 0 or 1 as the weight for query terms. So for query "burn California burn" the resulting vector weights are 1 for terms burn and California. There would be an implicit zero in all other positions.
You can use either a binary tree (b-tree) or hash table to create the index and I recommend variable length arrays for the postings lists. You will also need an array that maps document ID to filename (or URI).
You should make search queries as efficient as possible and return the most relevant e-mails to the top of the list. That means using a heap as a priority queue to get the top K search results. Do not simply sort the search results by score. Create a heap from the results and then get the top results in the proper order. See Python module heapq.
Persist your index to disk with cpickle and reload upon server start up. Have a separate python program create the index before launch your server.
If you use heuristics to increase the speed of your queries, you must document this on a separate sheet so I don't miss it as a comment in your code. Heuristics can alter the order or set of documents that appear in the results in exchange for increased speed.
Example
Here is email file "16." within Motley's directory:
~/USF/CS680/data/enron_mail_20110402/maildir/motley-m/inbox $ cat 16.
Message-ID: <[login to view URL]@thyme>
Date: Fri, 15 Mar 2002 12:50:27 -0800 (PST)
From:
To:
Subject: phone number for Malowney
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Thompson, Virginia </O=ENRON/OU=NA/CN=RECIPIENTS/CN=VTHOMPSO>
X-To: Motley, Matt </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mmotley>
X-cc:
X-bcc:
X-Folder: \ExMerge - Motley, Matt\Inbox
X-Origin: MOTLEY-M
X-FileName: matt motley [login to view URL]
Matt-
I talked to John Malowney today (Friday, March 15) and he asked if I would
send you an e-mail and ask you to call him on Tuesday (March 19) (503)
833-4526.
Thanks,
Virginia
A search results for this should look like:
phone number for Malowney
From:
To:
Matt- I talked to John Malowney today (Friday, March 15) and he asked if I would send you...
Strip out newlines, combined spaces, and display the first, say, 80 characters of the e-mail message.
Deliverables
You must provide a URI that return search results:
/search?q=searchterms
Show the top 20 results in the browser and show a "next page" link that will move to the next 20 results etc...
The results in the browser should look something like google results where the link has the Subject line extracted from the e-mail, the from address, the to address, and