kaiserahmed / cs3245hw2 Goto Github PK
View Code? Open in Web Editor NEWThis project forked from tiotao/cs3245hw2
This project forked from tiotao/cs3245hw2
This is the README file for A0099314Y's and A0099332Y's submission == General notes about this assignment == To index a collection of documents, we do the following: First, have a sorted array of all filenames ( as all filenames are numbers ). For each file, tokenize and stem its words into tokens. Store the tokens and sorted array of filenames into a master dictionary like following: dict { 'dog' : [1, 2, 3, 5, 6], 'banana' : [2, 4, 7] } This implementation makes sure that there is no repetition of tokens stored in the dictionary. It also allows us to sort the filename array into sorted posting array. In order to have skips in the postings list, we build a SkipList class. A SkipList instance consists of pointers to a head Node, a tail Node and its length. A Node, on the other hand, represents each posting in the postings list. A Node instance stores pointers to next Node and skip Node (if has one) as well as data (in this case, filename). Once a SkipList is build from an postings array, by SkipList.build_skips(), we can get a SkipList instance with skip pointers. (follow the sqrt rule) After getting a dictionary of tokens and postings arrays. Building skiplist based on positing array. Write the skiplist into postings.txt using pickle. While writing skiplists into posting.txt, record the starting and ending position of the skiplist in the file. These information is used as pointers in dictionary file. The first line of posting.txt is reserved for a json of filename for search optimization. Examples: postings.txt ============== [1,2,3,4,5,6] *strings generated by pickle* dictionary.txt ( token, freq, start_position, end_position ) ============== cat 1 100 123 When it comes to searching, we do the following: First, read queries line by line and tokenize it. Then, we optimise the order of the terms AND chains to prioritise terms that results in shorter postings list. We do this by first locating all subqueries in parenthesis, identify the AND chains, and sort the operands based on estimated postings size. We then run the same algorithm on the entire query, ignoring subqueries in parenthesis. After the order of operands has been optimised, we convert the expression to reverse polish notation using shunting-yard algorithm. The benefit is that during evaluation, the order of operation is extremely simple as all parenthesis has been removed and we don't need to consider operator precedence. Another benefit is that using reverse polish notation, we can guarentee that we are keeping at most 2 postings list in memory at the same time. We then evaluate the RPN expressing, look up the postings list if we see a token, and process them using either intersect, union, or complement operation for AND, OR and NOT operator. Notice that we have done some testing for the essay questions. There are some switches in config.py file, changing them will allow you to test for different behaviours of the programme. Please refer to the inline comments for details. == Files included with this submission == . |-- ESSAY.txt answer to essay question |-- MyList.py wrapper class for skiplist and native list |-- README.txt this file |-- SkipList.py implements a skip list for postings file storage |-- config.py configuration switches for testing |-- dictionary.txt indexed dictionary |-- index.py python script for indexing |-- postings.txt indexed postings file `-- search.py python script for searching == Statement of individual work == Please initial one of the following statements. [x] I, A0099314Y, certify that I have followed the CS 3245 Information Retrieval class guidelines for homework assignments. In particular, I expressly vow that I have followed the Facebook rule in discussing with others in doing the assignment and did not take notes (digital or printed) from the discussions. [x] I, A0099332Y, certify that I have followed the CS 3245 Information Retrieval class guidelines for homework assignments. In particular, I expressly vow that I have followed the Facebook rule in discussing with others in doing the assignment and did not take notes (digital or printed) from the discussions. == References == Appreciation towards the classmates on IVLE forum that helped each other to check the correctness of the output. Shunting-yard algorithm - https://igor.io/2013/12/03/stack-machines-shunting-yard.html python: most elegant way to intersperse a list with an element - http://stackoverflow.com/a/5656097/1903464
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.