Giter Club home page Giter Club logo

anjanatiha / web-retrieval-search-engine-implementation-for-university-web-domain Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 1.0 27.18 MB

Vector Space based End-to-End Web Retrieval/ Search Engine with self-contained units

License: MIT License

Python 21.95% TeX 0.51% Jupyter Notebook 44.17% JavaScript 32.55% HTML 0.82%
web-retrieval search-engine crawler preprocessor web-retrieval-engine query-processor retrieval-engine-evaluator vector-space-model web-crawler

web-retrieval-search-engine-implementation-for-university-web-domain's Introduction

Web Retrieval Engine Implementation for University Domain

Domain : Information Retrieval, Web Retrieval, Web Search
Sub-Domain : Search Engine, Natural Language Processing
Techniques : Vector Space Modeling, Query Operations, Relevance Ranking, Evaluation.

Description:

  1. Developed vector space model based web retrieval engine for University of Memphis domain (memphis.edu).
  2. Crawled and preprocessed 10, 000 web pages and docs (text, pdf, docx and pptx) from University of Memphis domain.
  3. Built modules - web crawler (incremental), text preprocessor (removes- (markup, metadata, uppercase, digits, punctuation, space, stop words), tokenize, stem from raw HTML/docs), Indexer (doc-url, doc-term, term-doc), TF-IDF vector generator, webpage relevance ranker and performance evaluator (F1, precision, recall).
  4. Used TF-IDF vector space model for web page matching and cosine similarity function for web page ranking.

Web Retrieval Engine Graphical Interface:

Sample Search:

Instruction for Web Crawling and Vector Space Model Building:

  1. Go to search_engine/search_engine_website
  2. Run inverse_document_indexer_final function in "search_engine.py" file to collect documents(html/php/txt/doc/docx/ppt/pptx) using web crawler.
  3. This builds vector space model with inverse document indexer and TF-IDF vector for all collected documents.
  4. Option available to change to website by changing the url value in "search_engine.py".
  5. Enter query term for retrieving or searching within collected web documents.

Instruction for Running Query on Web Interface:

  1. To run Django server go to ”search_engine/search_engine_website”
  2. Open command prompt in the current directory of manage.py and type manage.py preceded by python.exe location and python in the follwoing manner:
  3. C:\Users\Anjana\Anaconda3\pythonmanage.pyrunserverserver (format->locationforpython.exe+python+manage.py)
  4. To view web interface for search engine go to http://127.0.0.1:8000/

Django Installation Steps:

  1. Install pip for python.
  2. Move to python directory or scripts directory in Anaconda.
  3. Please enter ”pip install Django” for installing Django.
Languages : Python
Tools/IDE : Anaconda, Django
Libraries : Django
Duration : August - December 2017

Current Version : v1.0.0.0

Last Update : 12.01.2017

web-retrieval-search-engine-implementation-for-university-web-domain's People

Contributors

anjanatiha avatar

Watchers

 avatar  avatar

Forkers

dieptran43

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.