License: MIT License

Python 100.00%

machine-learning url-categorization url-classification

url-categorization-using-machine-learning's Introduction

URL categorization using machine learning

Internet can be used as one important source of information for machine learning algorithms. Web pages store diverse information about multiple domains. One critical problem is how to categorize this information.

Websites classification is performed by using NLP techniques that helps to generate words frequencies for each category and by calculating categories weights it is possible to predict categories for Websites.

Main dataset for this project could be found: URL categorization dataset file

Url category predictions usage

For url predictions you could use already generated words frequency model that was created at 2020-12-26: frequency_mode/word_frequency_2021.pickle

Otherwise, you could create your own model words frequency model by executing construct_features.py.

For python versions management I would highly advise to use pyenv tool.

Execute poetry shell. Poetry installation guide

foo@bar:~$ poetry shell

Start FastAPI local server

foo@bar:~$ uvicorn url_predictions.api_main:app --reload

There are two ways to get website category predictions:

Use curls commands to predict url:

curl -X 'POST' \
  'http://localhost:8000/predict/?url=bbc.com' \
  -H 'accept: application/json' \
  -d ''

Use FastAPI UI:
1. Go to http://localhost:8000/docs
2. Expand /predict/ POST endpoint page
3. Write an url and press execute
4. You should get a JSON response with the results

Prediction results structure

Request url: http://localhost:8000/predict/?url=bbc.com Response:

{
  "main_category": "News_and_Media",
  "category_weight": 7123532,
  "sub_category": "Reference",
  "sub_weight": 7038726,
  "response": "All HTML content",
  "tokens":  [
      "bbc",
      "homepage",
      "homepageaccessibility",
      "linksskip",
      "contentaccessibility",
      .
      .
      .
      ]
}

Documentation

Website Classification Using Machine Learning Approaches.pdf

This project is my Bachelor thesis, so it also has documentation part written with LaTeX. Documentation with original LaTeX files is located in the Documentation folder.

Please note that some concepts of documentation does not exist anymore or there are new things since some changes were applied after documentation was written.

url-categorization-using-machine-learning's People

Contributors

Stargazers

Watchers

url-categorization-using-machine-learning's Issues

No main.py file

Please can you upload the main python file ,it would be very helpful

Dataset is too old

The dataset is old and misses detail sub category

RequestsDependencyWarning

Maybe a newbie question, but this is what I'm getting when running 01_construct_features.py
Warning (from warnings module):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/requests/init.py", line 109
warnings.warn(
RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
Traceback (most recent call last):
File "/Users/erikchavez/Downloads/URL-categorization-using-machine-learning-master/01_construct_features.py", line 9, in
df = pd.read_csv(config.MAIN_DATASET_PATH)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 934, in init
self._engine = self._make_engine(f, self.engine)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1218, in _make_engine
self.handles = get_handle( # type: ignore[call-overload]
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/common.py", line 786, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/url_categorization_dfe.csv'

Incorrect category for CNN

When I try bbc it works.
But when I try this with cnn.com it shows "Business_and_Industry" and "Law_and_Government". Why does it not predict News for CNN.

Similary i tried https://www.firstpost.com/ and it predicted below
Predicted main category: Law_and_Government
Predicted submain category: People_and_Society

Additional Model Training

Thank you so much for this! Any suggestions on how one would improve the current model? I do see the dataset is quite outdated. I'd like to train it further and add more categories.

I do have a list of millions of websites and their basic category that I assume this could use to scan for categorization keywords.

Thanks!

Warm Regards,
Reece

Other ML learning algorithm source code

Hello! Nice project!

As I see in your work, you used LSVC and LR algorithms and compared their efficiency. Maybe you could share the source code for them too?

I would like to play with this ML and source code would help a lot!

Thanks in advance @domantasm96

Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

Getting this error even after running 500 / 1000 steps, but unable to get any urls parsed.

Could you please explain why I am encountering this error?

Thanks in advance.

RuntimeError

Hello,

I'm currently experiencing a RuntimeError that says "An attempt has been made to start a new process before the current process has finished its bootstrapping phase."

File "C:\Users\sbae\PycharmProjects\URL-categorization-using-machine-learning\01_construct_features.py", line 29, in
res = ex.map(parse_request, [(i, elem) for i, elem in enumerate(results)])
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 761, in map
results = super().map(partial(_process_chunk, fn),
File "C:\Program Files\Python310\lib\concurrent\futures_base.py", line 610, in map
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "C:\Program Files\Python310\lib\concurrent\futures_base.py", line 610, in
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 732, in submit
self._adjust_process_count()
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 692, in _adjust_process_count
self._spawn_process()
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 709, in _spawn_process
p.start()
File "C:\Program Files\Python310\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Program Files\Python310\lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
File "C:\Program Files\Python310\lib\multiprocessing\popen_spawn_win32.py", line 45, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Program Files\Python310\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\Program Files\Python310\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Url-Categorization

After running the 01_construct_features.py getting an error "['main_category_confidence'] not in index"

Running construct_data results in error message

After running 01_construct_features.py this result is returned and it does not appear to create a new dataset.

"None of [Index(['url', 'main_category', 'main_category:confidence'], dtype='object')] are in the [columns]"

Error: Executing construct_features.py

Hi,

Thanks for the interesting project. When I run ./construct_data.sh
It has the following issues:

Executing construct_features.py
Traceback (most recent call last):
File "01_construct_features.py", line 46, in
df = pd.read_csv(input_path)[['url', 'main_category', 'main_category:confidence']]
File "/home/alex/dev/URL-categorization-using-machine-learning/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 2806, in getitem
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "/home/alex/dev/URL-categorization-using-machine-learning/venv/lib/python3.8/site-packages/pandas/core/indexing.py", line 1552, in _get_listlike_indexer
self._validate_read_indexer(
File "/home/alex/dev/URL-categorization-using-machine-learning/venv/lib/python3.8/site-packages/pandas/core/indexing.py", line 1640, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['url', 'main_category', 'main_category:confidence'], dtype='object')] are in the [columns]"
Executing construct_models.py
2020-07-15 21:41:03.084514
Executing train_models.py

I think the content in Datasets/URL-categorization-DFE.csv file is incorrect:
version https://git-lfs.github.com/spec/v1
oid sha256:8b20976058afc5b6a15a7277cf9060c3f9eb4b51e0ac0c854dd55460febaa743
size 5738345

Could you help fix this issue? Thanks!

I couldn't run it!

Hi there, Thanks for your great work.

Can you please create a Dockerfile for this project ?

I want to run it but till now, no success!

domantasm96 / url-categorization-using-machine-learning Goto Github PK