Giter Club home page Giter Club logo

url-categorization-using-machine-learning's Issues

I couldn't run it!

Hi there, Thanks for your great work.

Can you please create a Dockerfile for this project ?

I want to run it but till now, no success!

Running construct_data results in error message

After running 01_construct_features.py this result is returned and it does not appear to create a new dataset.

"None of [Index(['url', 'main_category', 'main_category:confidence'], dtype='object')] are in the [columns]"

Incorrect category for CNN

Hi

When I try bbc it works.
But when I try this with cnn.com it shows "Business_and_Industry" and "Law_and_Government". Why does it not predict News for CNN.
image

Similary i tried https://www.firstpost.com/ and it predicted below
Predicted main category: Law_and_Government
Predicted submain category: People_and_Society

RequestsDependencyWarning

Maybe a newbie question, but this is what I'm getting when running 01_construct_features.py
Warning (from warnings module):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/requests/init.py", line 109
warnings.warn(
RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
Traceback (most recent call last):
File "/Users/erikchavez/Downloads/URL-categorization-using-machine-learning-master/01_construct_features.py", line 9, in
df = pd.read_csv(config.MAIN_DATASET_PATH)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 934, in init
self._engine = self._make_engine(f, self.engine)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1218, in _make_engine
self.handles = get_handle( # type: ignore[call-overload]
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/common.py", line 786, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/url_categorization_dfe.csv'

Other ML learning algorithm source code

Hello! Nice project!

As I see in your work, you used LSVC and LR algorithms and compared their efficiency. Maybe you could share the source code for them too?

I would like to play with this ML and source code would help a lot!

Thanks in advance @domantasm96

Error: Executing construct_features.py

Hi,

Thanks for the interesting project. When I run ./construct_data.sh
It has the following issues:

Executing construct_features.py
Traceback (most recent call last):
File "01_construct_features.py", line 46, in
df = pd.read_csv(input_path)[['url', 'main_category', 'main_category:confidence']]
File "/home/alex/dev/URL-categorization-using-machine-learning/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 2806, in getitem
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "/home/alex/dev/URL-categorization-using-machine-learning/venv/lib/python3.8/site-packages/pandas/core/indexing.py", line 1552, in _get_listlike_indexer
self._validate_read_indexer(
File "/home/alex/dev/URL-categorization-using-machine-learning/venv/lib/python3.8/site-packages/pandas/core/indexing.py", line 1640, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['url', 'main_category', 'main_category:confidence'], dtype='object')] are in the [columns]"
Executing construct_models.py
2020-07-15 21:41:03.084514
Executing train_models.py

I think the content in Datasets/URL-categorization-DFE.csv file is incorrect:
version https://git-lfs.github.com/spec/v1
oid sha256:8b20976058afc5b6a15a7277cf9060c3f9eb4b51e0ac0c854dd55460febaa743
size 5738345

Could you help fix this issue? Thanks!

RuntimeError

Hello,

I'm currently experiencing a RuntimeError that says "An attempt has been made to start a new process before the current process has finished its bootstrapping phase."

File "C:\Users\sbae\PycharmProjects\URL-categorization-using-machine-learning\01_construct_features.py", line 29, in
res = ex.map(parse_request, [(i, elem) for i, elem in enumerate(results)])
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 761, in map
results = super().map(partial(_process_chunk, fn),
File "C:\Program Files\Python310\lib\concurrent\futures_base.py", line 610, in map
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "C:\Program Files\Python310\lib\concurrent\futures_base.py", line 610, in
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 732, in submit
self._adjust_process_count()
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 692, in _adjust_process_count
self._spawn_process()
File "C:\Program Files\Python310\lib\concurrent\futures\process.py", line 709, in _spawn_process
p.start()
File "C:\Program Files\Python310\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Program Files\Python310\lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
File "C:\Program Files\Python310\lib\multiprocessing\popen_spawn_win32.py", line 45, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Program Files\Python310\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\Program Files\Python310\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

image

No main.py file

Please can you upload the main python file ,it would be very helpful

Url-Categorization

After running the 01_construct_features.py getting an error "['main_category_confidence'] not in index"

WhatsApp Image 2021-01-24 at 10 04 24 PM

Additional Model Training

Thank you so much for this! Any suggestions on how one would improve the current model? I do see the dataset is quite outdated. I'd like to train it further and add more categories.

I do have a list of millions of websites and their basic category that I assume this could use to scan for categorization keywords.

Thanks!

Warm Regards,
Reece

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.