Giter Club home page Giter Club logo

cs3245hw2's Introduction

This is the README file for A0099314Y's and A0099332Y's submission

== General notes about this assignment ==

To index a collection of documents, we do the following:

First, have a sorted array of all filenames ( as all filenames are 
numbers ).

For each file, tokenize and stem its words into tokens. Store the 
tokens and sorted array of filenames into a master dictionary like 
following:

	dict {
		'dog' : [1, 2, 3, 5, 6],
		'banana' : [2, 4, 7]
	}

This implementation makes sure that there is no repetition of tokens
stored in the dictionary. It also allows us to sort the filename array
into sorted posting array.

In order to have skips in the postings list, we build a SkipList class.
A SkipList instance consists of pointers to a head Node, a tail Node
and its length. 

A Node, on the other hand, represents each posting in the postings list.
A Node instance stores pointers to next Node and skip Node (if has one)
as well as data (in this case, filename).

Once a SkipList is build from an postings array, by 
SkipList.build_skips(), we can get a SkipList instance with skip 
pointers. (follow the sqrt rule)

After getting a dictionary of tokens and postings arrays. Building 
skiplist based on positing array. Write the skiplist into postings.txt 
using pickle. While writing skiplists into posting.txt, record the 
starting and ending position of the skiplist in the file. These
information is used as pointers in dictionary file.

The first line of posting.txt is reserved for a json of filename for 
search optimization.


Examples:

postings.txt
==============
[1,2,3,4,5,6]
*strings generated by pickle*

dictionary.txt
( token, freq, start_position, end_position )
==============
cat 1 100 123

When it comes to searching, we do the following:

First, read queries line by line and tokenize it.

Then, we optimise the order of the terms AND chains to prioritise
terms that results in shorter postings list. We do this by first
locating all subqueries in parenthesis, identify the AND chains, and
sort the operands based on estimated postings size. We then run the
same algorithm on the entire query, ignoring subqueries in
parenthesis.

After the order of operands has been optimised, we convert the
expression to reverse polish notation using shunting-yard algorithm.
The benefit is that during evaluation, the order of operation is
extremely simple as all parenthesis has been removed and we don't need
to consider operator precedence. Another benefit is that using reverse
polish notation, we can guarentee that we are keeping at most 2
postings list in memory at the same time.

We then evaluate the RPN expressing, look up the postings list if we
see a token, and process them using either intersect, union, or
complement operation for AND, OR and NOT operator.

Notice that we have done some testing for the essay questions. There
are some switches in config.py file, changing them will allow you to 
test for different behaviours of the programme. Please refer to the 
inline comments for details.

== Files included with this submission ==
.
|-- ESSAY.txt        answer to essay question
|-- MyList.py        wrapper class for skiplist and native list
|-- README.txt       this file
|-- SkipList.py      implements a skip list for postings file storage
|-- config.py        configuration switches for testing
|-- dictionary.txt   indexed dictionary  
|-- index.py         python script for indexing 
|-- postings.txt     indexed postings file
`-- search.py        python script for searching


== Statement of individual work ==

Please initial one of the following statements.

[x] I, A0099314Y, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments.  In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.  

[x] I, A0099332Y, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments.  In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.  

== References ==

Appreciation towards the classmates on IVLE forum that helped each other
to check the correctness of the output.

Shunting-yard algorithm
 - https://igor.io/2013/12/03/stack-machines-shunting-yard.html
python: most elegant way to intersperse a list with an element 
 - http://stackoverflow.com/a/5656097/1903464

cs3245hw2's People

Contributors

benmq avatar tiotao avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.