Giter Club home page Giter Club logo

Comments (3)

domantasm96 avatar domantasm96 commented on August 19, 2024 1

Hey!

Since the project code become a little bit messy and dense, I decided to split the code into the three parts:

  1. construct_features.py - Code that download each URL's HTML code and tokenize all words that would be used for generating the most frequent word list for all available categories.

  2. construct_models.py - After HTML codes from all websites are parsed, the next step is text normalization part. In this code the extracted word tokens of websites text are normalized by removing stop words and translating non english words with the Google Translator.
    The main goal of this process is to make words tokens to be more english friendly, because the most frequent words list for each category should consist of only english words (it is possible to do that in other languages, but for this project I decided to use english).

  3. train_models.py - Code that trains and tests ML models.

I know that it could be a little bit confusing since README file is not up to date anymore. I'm going to update it soon.

There are some tasks that I'm going to implement it to this project when I got more free time:

  1. Update README file with up to date information
  2. Create a code that would be available to predict websites manually which could be passed as an argument (That functionality was already implemented in the previous commit versions: https://github.com/domantasm96/URL-categorization-using-machine-learning/blob/bc2a61daeab69458a6d4158120100692a0c272e1/Scripts/predict_url.py
  3. Improve ML models by using more advanced Machine Learning frameworks. The accuracy of prediction should also improve

Thanks for asking and if you have any more questions - feel free to ask! :)

from url-categorization-using-machine-learning.

thecoderxman avatar thecoderxman commented on August 19, 2024

Thanks for it and if possible I will try to improve the accuracy and update to you

from url-categorization-using-machine-learning.

domantasm96 avatar domantasm96 commented on August 19, 2024

That would be great!

from url-categorization-using-machine-learning.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.