Giter Club home page Giter Club logo

url-categorization-using-machine-learning's Introduction

URL categorization using machine learning

Internet can be used as one important source of information for machine learning algorithms. Web pages store diverse information about multiple domains. One critical problem is how to categorize this information.

Websites classification is performed by using NLP techniques that helps to generate words frequencies for each category and by calculating categories weights it is possible to predict categories for Websites.

Main dataset for this project could be found: URL categorization dataset file

Url category predictions usage

For url predictions you could use already generated words frequency model that was created at 2020-12-26: frequency_mode/word_frequency_2021.pickle

Otherwise, you could create your own model words frequency model by executing construct_features.py.

For python versions management I would highly advise to use pyenv tool.

  1. Execute poetry shell. Poetry installation guide
foo@bar:~$ poetry shell
  1. Start FastAPI local server
foo@bar:~$ uvicorn url_predictions.api_main:app --reload
  1. There are two ways to get website category predictions:
  • Use curls commands to predict url:
curl -X 'POST' \
  'http://localhost:8000/predict/?url=bbc.com' \
  -H 'accept: application/json' \
  -d ''
  • Use FastAPI UI:
    1. Go to http://localhost:8000/docs
    2. Expand /predict/ POST endpoint page
    3. Write an url and press execute
    4. You should get a JSON response with the results

Prediction results structure

Request url: http://localhost:8000/predict/?url=bbc.com Response:

{
  "main_category": "News_and_Media",
  "category_weight": 7123532,
  "sub_category": "Reference",
  "sub_weight": 7038726,
  "response": "All HTML content",
  "tokens":  [
      "bbc",
      "homepage",
      "homepageaccessibility",
      "linksskip",
      "contentaccessibility",
      .
      .
      .
      ]
}

Documentation

Website Classification Using Machine Learning Approaches.pdf

This project is my Bachelor thesis, so it also has documentation part written with LaTeX. Documentation with original LaTeX files is located in the Documentation folder.

Please note that some concepts of documentation does not exist anymore or there are new things since some changes were applied after documentation was written.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.