Giter Club home page Giter Club logo

web-crawling's Introduction

CS419 - Infomation Retrieval

Web crawling and model assignments

Nguyen Quoc Viet - 1651069

Requirements

  • numpy
  • python 3
pip3 install numpy

Scraper

File: scraper.py

Usage:

  • Run scraper.py or call function scrape_web from import
import scraper
url = "https://vnexpress.net/"
links_count,documents = scraper.scrape_web(url,1)

Parameter:

  • url: link of page to scrape
  • level: maximum recursive level to scrape (0 means scrape only current page)

Return:

  • links_count: number of links scraped
  • documents: a dictionary of documents content

Boolean Model

File: boolean_model.py

Usage:

  • Run boolean_model.py or provide documents from web scraper to initialized and use retrieve to get results from query
  • Usable operators: and, or, not
import scraper
from boolean_model import BooleanModel
url = "https://vnexpress.net/"
links_count,documents = scraper.scrape_web(url,1)
model = BooleanModel(documents)
res = model.retrieve(query)
  • Example query: "việt and nam not mỹ"

Vector Model

File: vector_model.py

Usage:

  • Run vector_model.py or provide documents from web scraper to initialized and use retrieve to get results ranking from query
  • Input: list of keywords seperated by space
import scraper
from vector_model import VectorModel
url = "https://vnexpress.net/"
links_count,documents = scraper.scrape_web(url,1)
model = VectorModel(documents)
res = model.retrieve(query)
  • Example query: "việt nam"

All together

  • Run main.py to run everything together. By default, it will scrape https://vnexpress.net/ , with recursive level 1, store all contents in the most frequently use HTML tag on a page as one document. User can then input queries, and results from both Boolean Model and Vector Model will be shown.

web-crawling's People

Contributors

axblueblader avatar

Stargazers

Trung Tran avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.