Giter Club home page Giter Club logo

nap_zap's Introduction

Nap_Zap : A Search Engine Using Python

This Project is about creating a web search engine using python : This project meets the following criteria:

  1. Collect HTML pages up to a maximum size.(according to given crawl-depth)
  2. Make a pre-processing on these pages (eliminate 'stop words')
  3. Index crawled data.
  4. Submitting queries and returning the result
  5. Display the result in an appropriate order of relevance.
  6. A web interface!

#Requirements See requirements.txt

Structure

  1. indexes: A directory of indexed web-pages by Indexer.py

    • forward_index : Forward indexing
    • inverted_index : Inverted indexing
    • url_to_id : For mapping indexed urls.
  2. links : A directory of web-pages crawled by Crawler.py

    A web-pages(url) crawled and saved offline in links directory and named with base64 encoding. (To store longest urls in distinct names.)

  3. static : A directory of static(logo) files like pictures.

  4. templates : A directory of views (front-end files).

  5. Crawler.py : Main Crawler module

    To Run Crawler.py

           $python Crawler.py --start_url "url" --max_depth depth_value
    

    url : website address that you want to crawl. depth_value: Maximum depth of crawl (integer)

    • if this will get completed successfully it will run indexer.py itself , if not then follow next to run indexer.py module.

    • By default all crawled data will be stored in 'links/' directory.

  6. indexer.py : Main Indexer Module

    To Run indexer.py

           $python indexer.py --stored_docs_dir links/ --index_dir indexes
    
    • this module require to arguments
    1. stored link's directory (to generate index)
    2. index directory (to store generated index)
    • if this will get completed successfully then it will run web_ui.py itself.
  7. web_ui.py : Frontend (view)

    To run web_ui.py

           $python web_ui.py
    
  8. lang_proc.py : Language Processing Module

  9. util.py : HTML parser Module

Currently 2 websites crawled
1.wikipedia.org (Crawl_Depth-1)
2.Python.org (Crawl_Depth-3)

nap_zap's People

Contributors

prithviz avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.