The cs3245hw2 from kaiserahmed

This is the README file for A0099314Y's and A0099332Y's submission

== General notes about this assignment ==

To index a collection of documents, we do the following:

First, have a sorted array of all filenames ( as all filenames are 
numbers ).

For each file, tokenize and stem its words into tokens. Store the 
tokens and sorted array of filenames into a master dictionary like 
following:

	dict {
		'dog' : [1, 2, 3, 5, 6],
		'banana' : [2, 4, 7]
	}

This implementation makes sure that there is no repetition of tokens
stored in the dictionary. It also allows us to sort the filename array
into sorted posting array.

In order to have skips in the postings list, we build a SkipList class.
A SkipList instance consists of pointers to a head Node, a tail Node
and its length. 

A Node, on the other hand, represents each posting in the postings list.
A Node instance stores pointers to next Node and skip Node (if has one)
as well as data (in this case, filename).

Once a SkipList is build from an postings array, by 
SkipList.build_skips(), we can get a SkipList instance with skip 
pointers. (follow the sqrt rule)

After getting a dictionary of tokens and postings arrays. Building 
skiplist based on positing array. Write the skiplist into postings.txt 
using pickle. While writing skiplists into posting.txt, record the 
starting and ending position of the skiplist in the file. These
information is used as pointers in dictionary file.

The first line of posting.txt is reserved for a json of filename for 
search optimization.


Examples:

postings.txt
==============
[1,2,3,4,5,6]
*strings generated by pickle*

dictionary.txt
( token, freq, start_position, end_position )
==============
cat 1 100 123

When it comes to searching, we do the following:

First, read queries line by line and tokenize it.

Then, we optimise the order of the terms AND chains to prioritise
terms that results in shorter postings list. We do this by first
locating all subqueries in parenthesis, identify the AND chains, and
sort the operands based on estimated postings size. We then run the
same algorithm on the entire query, ignoring subqueries in
parenthesis.

After the order of operands has been optimised, we convert the
expression to reverse polish notation using shunting-yard algorithm.
The benefit is that during evaluation, the order of operation is
extremely simple as all parenthesis has been removed and we don't need
to consider operator precedence. Another benefit is that using reverse
polish notation, we can guarentee that we are keeping at most 2
postings list in memory at the same time.

We then evaluate the RPN expressing, look up the postings list if we
see a token, and process them using either intersect, union, or
complement operation for AND, OR and NOT operator.

Notice that we have done some testing for the essay questions. There
are some switches in config.py file, changing them will allow you to 
test for different behaviours of the programme. Please refer to the 
inline comments for details.

== Files included with this submission ==
.
|-- ESSAY.txt        answer to essay question
|-- MyList.py        wrapper class for skiplist and native list
|-- README.txt       this file
|-- SkipList.py      implements a skip list for postings file storage
|-- config.py        configuration switches for testing
|-- dictionary.txt   indexed dictionary  
|-- index.py         python script for indexing 
|-- postings.txt     indexed postings file
`-- search.py        python script for searching


== Statement of individual work ==

Please initial one of the following statements.

[x] I, A0099314Y, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments.  In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.  

[x] I, A0099332Y, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments.  In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.  

== References ==

Appreciation towards the classmates on IVLE forum that helped each other
to check the correctness of the output.

Shunting-yard algorithm
 - https://igor.io/2013/12/03/stack-machines-shunting-yard.html
python: most elegant way to intersperse a list with an element 
 - http://stackoverflow.com/a/5656097/1903464
kaiserahmed / cs3245hw2 Goto Github PK

cs3245hw2's Introduction

cs3245hw2's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent