Giter Club home page Giter Club logo

url-categorization-using-machine-learning's Introduction

URL categorization using machine learning

Internet can be used as one important source of information for machine learning algorithms. Web pages store diverse information about multiple domains. One critical problem is how to categorize this information.

Websites classification is performed by using NLP techniques that helps to generate words frequencies for each category and by calculating categories weights it is possible to predict categories for Websites.

Main dataset for this project could be found: URL categorization dataset file

Url category predictions usage

For url predictions you could use already generated words frequency model that was created at 2020-12-26: frequency_mode/word_frequency_2021.pickle

Otherwise, you could create your own model words frequency model by executing construct_features.py.

For python versions management I would highly advise to use pyenv tool.

  1. Execute poetry shell. Poetry installation guide
foo@bar:~$ poetry shell
  1. Start FastAPI local server
foo@bar:~$ uvicorn url_predictions.api_main:app --reload
  1. There are two ways to get website category predictions:
  • Use curls commands to predict url:
curl -X 'POST' \
  'http://localhost:8000/predict/?url=bbc.com' \
  -H 'accept: application/json' \
  -d ''
  • Use FastAPI UI:
    1. Go to http://localhost:8000/docs
    2. Expand /predict/ POST endpoint page
    3. Write an url and press execute
    4. You should get a JSON response with the results

Prediction results structure

Request url: http://localhost:8000/predict/?url=bbc.com Response:

{
  "main_category": "News_and_Media",
  "category_weight": 7123532,
  "sub_category": "Reference",
  "sub_weight": 7038726,
  "response": "All HTML content",
  "tokens":  [
      "bbc",
      "homepage",
      "homepageaccessibility",
      "linksskip",
      "contentaccessibility",
      .
      .
      .
      ]
}

Documentation

Website Classification Using Machine Learning Approaches.pdf

This project is my Bachelor thesis, so it also has documentation part written with LaTeX. Documentation with original LaTeX files is located in the Documentation folder.

Please note that some concepts of documentation does not exist anymore or there are new things since some changes were applied after documentation was written.

url-categorization-using-machine-learning's People

Contributors

domantasm96 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

url-categorization-using-machine-learning's Issues

No main.py file

Please can you upload the main python file ,it would be very helpful

RequestsDependencyWarning

Maybe a newbie question, but this is what I'm getting when running 01_construct_features.py
Warning (from warnings module):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/requests/init.py", line 109
warnings.warn(
RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
Traceback (most recent call last):
File "/Users/erikchavez/Downloads/URL-categorization-using-machine-learning-master/01_construct_features.py", line 9, in
df = pd.read_csv(config.MAIN_DATASET_PATH)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 934, in init
self._engine = self._make_engine(f, self.engine)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1218, in _make_engine
self.handles = get_handle( # type: ignore[call-overload]
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/common.py", line 786, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/url_categorization_dfe.csv'

Incorrect category for CNN

Hi

When I try bbc it works.
But when I try this with cnn.com it shows "Business_and_Industry" and "Law_and_Government". Why does it not predict News for CNN.
image

Similary i tried https://www.firstpost.com/ and it predicted below
Predicted main category: Law_and_Government
Predicted submain category: People_and_Society

Additional Model Training

Thank you so much for this! Any suggestions on how one would improve the current model? I do see the dataset is quite outdated. I'd like to train it further and add more categories.

I do have a list of millions of websites and their basic category that I assume this could use to scan for categorization keywords.

Thanks!

Warm Regards,
Reece

Other ML learning algorithm source code

Hello! Nice project!

As I see in your work, you used LSVC and LR algorithms and compared their efficiency. Maybe you could share the source code for them too?

I would like to play with this ML and source code would help a lot!

Thanks in advance @domantasm96

RuntimeError

Hello,

I'm currently experiencing a RuntimeError that says "An attempt has been made to start a new process before the current process has finished its bootstrapping phase."

File "C:\Users\sbae\PycharmProjects\URL-categorization-using-machine-learning\01_construct_features.py", line 29, in
res = ex.map(parse_request, [(i, elem) for i, elem in enumerate(results)])
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 761, in map
results = super().map(partial(_process_chunk, fn),
File "C:\Program Files\Python310\lib\concurrent\futures_base.py", line 610, in map
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "C:\Program Files\Python310\lib\concurrent\futures_base.py", line 610, in
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 732, in submit
self._adjust_process_count()
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 692, in _adjust_process_count
self._spawn_process()
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 709, in _spawn_process
p.start()
File "C:\Program Files\Python310\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Program Files\Python310\lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
File "C:\Program Files\Python310\lib\multiprocessing\popen_spawn_win32.py", line 45, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Program Files\Python310\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\Program Files\Python310\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

image

Url-Categorization

After running the 01_construct_features.py getting an error "['main_category_confidence'] not in index"

WhatsApp Image 2021-01-24 at 10 04 24 PM

Running construct_data results in error message

After running 01_construct_features.py this result is returned and it does not appear to create a new dataset.

"None of [Index(['url', 'main_category', 'main_category:confidence'], dtype='object')] are in the [columns]"

Error: Executing construct_features.py

Hi,

Thanks for the interesting project. When I run ./construct_data.sh
It has the following issues:

Executing construct_features.py
Traceback (most recent call last):
File "01_construct_features.py", line 46, in
df = pd.read_csv(input_path)[['url', 'main_category', 'main_category:confidence']]
File "/home/alex/dev/URL-categorization-using-machine-learning/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 2806, in getitem
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "/home/alex/dev/URL-categorization-using-machine-learning/venv/lib/python3.8/site-packages/pandas/core/indexing.py", line 1552, in _get_listlike_indexer
self._validate_read_indexer(
File "/home/alex/dev/URL-categorization-using-machine-learning/venv/lib/python3.8/site-packages/pandas/core/indexing.py", line 1640, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['url', 'main_category', 'main_category:confidence'], dtype='object')] are in the [columns]"
Executing construct_models.py
2020-07-15 21:41:03.084514
Executing train_models.py

I think the content in Datasets/URL-categorization-DFE.csv file is incorrect:
version https://git-lfs.github.com/spec/v1
oid sha256:8b20976058afc5b6a15a7277cf9060c3f9eb4b51e0ac0c854dd55460febaa743
size 5738345

Could you help fix this issue? Thanks!

I couldn't run it!

Hi there, Thanks for your great work.

Can you please create a Dockerfile for this project ?

I want to run it but till now, no success!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.