cmput291f18mp2 / mini-project-2 Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 78.46 MB

This project works with data in the physical layer to implement a kijiji ads database.

Home Page: https://mini-project-2.readthedocs.io/en/latest/?badge=latest

License: BSD 3-Clause "New" or "Revised" License

Python 99.60% Perl 0.40%

university project cmput291 python3 python database berkeley-db

mini-project-2's Introduction

mini-project-2

Requirements

Python 3.5+
libdb4.8-dev
libdb4.8++-dev
db-util

Overview

mini-project-2 is a python command-line application that interfaces with the Berkeley DB Python 3 package (bsddb3). Using the program users can specify queries written in the query language grammar seen here: https://github.com/CMPUT291F18MP2/Mini-Project-2/blob/master/mini_project_2/input_parser.py. These queries are processed by the program and the associated data is retrieved and presented to the user.

Installation

To install the Berkeley DB dependencies for mini-project-2 on Ubuntu run the following commands:

sudo add-apt-repository ppa:bitcoin/bitcoin
sudo apt-get update
sudo apt-get install libdb4.8-dev libdb4.8++-dev
sudo apt-get install db-util -y

mini-project-2 can then be installed from source by running:

pip install .

Within the same directory as mini-project-2's setup.py file.

Usage

After installing mini-project-2's shell can be started by the following console command:

mini-project-2 --phase [1-3]

To get additional usage help on starting mini-project-2 run the following console command:

mini-project-2 --help

mini-project-2's People

Contributors

Watchers

mini-project-2's Issues

pdates.txt Creation

pdates.txt: This file includes one line for each ad in the form of d:a,c,l where d is a non-empty date at which the ad is posted and a, c, and l are respectively the ad id, category and location of the ad.

Report.pdf Creation

Your report must be type-written, saved as PDF and be included in your submission. Your report cannot exceed 3 pages.
The report should include
(a) a general overview of your system with a small user guide,
(b) a description of your algorithm for efficiently evaluating queries, in particular evaluating queries with multiple conditions and wild cards and range searches and an analysis of the efficiency of your algorithm,
(c) your testing strategy, and
(d) your group work break-down strategy.

Brief and full outputs

By default, the output of each query is the ad id and the title of all matching ads. The user should be able to change the output format to full record by typing "output=full" and back to id and title only using "output=brief".

README.txt Creation

The file README.txt is a text file that lists the names and ccids of all group members. This file must also include the names of anyone you collaborated with (as much as it is allowed within the course policy) or a line saying that you did not collaborate with anyone else. This is also the place to acknowledge the use of any source of information besides the course textbook and/or class notes.

Parse User Inputs

ads.txt

ads.txt: This file includes one line for each ad in the form of a:rec where a is the ad id and rec is the full ad record in xml.

Index file creation

Phase 2 would produce four indexes which should be named ad.idx, te.idx, da.idx, and pr.idx respectively corresponding to indexes 1, 2, 3, and 4, as discussed above.

Given the sorted files terms.txt, pdates.txt, prices.txt and ads.txt, create the following four indexes: (1) a hash index on ads.txt with ad id as key and the full ad record as data, (2) a B+-tree index on terms.txt with term as key and ad id as data, (3) a B+-tree index on pdates.txt with date as key and ad id, category and location as data, (4) a B+-tree index on prices.txt with price as key and ad id, category and location as data. You should note that the keys in all 4 cases are the character strings before colon ':' and the data is everything that comes after the colon.

prices.txt

prices.txt: This file includes one line for each ad that has a non-empty price field in the form of p:a,c,l where p is a number indicating the price and a, c, and l are respectively the ad id, category and location of the ad.

terms.txt creation

terms.txt: This file includes terms extracted from ad titles and descriptions; for our purpose, suppose a term is a consecutive sequence of alphanumeric, underscore '' and dashed '-' characters, i.e [0-9a-zA-Z-]. The format of the file is as follows: for every termT in the title or the description of an ad with id a, there is a row in this file of the form t:a where t is the lowercase form of T. Ignore all special characters coded as &#number; such as 産 which represents 産 as well as ', " and & which respectively represent ', " and &. Also ignore terms of length 2 or less. Convert the terms to all lowercase before writing them out. Here are the respective files for our input files with 10 records and 1000 records.

Search Indexed Files

test failures

Travis fails on these:
test/unit/test_phase2.py::test_sort_all FAILED [ 96%]
test/unit/test_phase2.py::test_format_all FAILED [ 98%]
Do they work for you @lionkingsimba ?