Giter Club home page Giter Club logo

Comments (1)

Zeerti avatar Zeerti commented on August 19, 2024 1

I found a resolution to this issue, update the 01_construct_features.py file to the below @dbae1145

import pandas as pd
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import config
import nltk
import pickle
from functions import timeit, scrape, parse_request


if __name__ == '__main__':
    df = pd.read_csv(config.MAIN_DATASET_PATH)
    df = df.rename(
        columns={'main_category:confidence': 'main_category_confidence'})
    df = df[['url', 'main_category', 'main_category_confidence']]

    df = df[(df['main_category'] != 'Not_working') &
            (df['main_category_confidence'] >= 0.5)]
    df['url'] = df['url'].apply(lambda x: 'http://' + x)
    df['tld'] = df.url.apply(lambda x: x.split('.')[-1])
    df = df[df.tld.isin(config.TOP_LEVEL_DOMAIN_WHITELIST)
            ].reset_index(drop=True)
    df['tokens'] = ''

    print("Scraping begins. Start: ", datetime.now())
    with ThreadPoolExecutor(config.THREADING_WORKERS) as executor:
        start = datetime.now()
        results = executor.map(scrape, [(i, elem)
                               for i, elem in enumerate(df['url'])])
    exec_1 = timeit(start)
    print('Scraping finished. Execution time: ', exec_1)

    print("Analyzing responses. Start: ", datetime.now())
    with ProcessPoolExecutor(config.MULTIPROCESSING_WORKERS) as ex:
        start = datetime.now()
        res = ex.map(parse_request, [(i, elem)
                     for i, elem in enumerate(results)])

    for props in res:
        i = props[0]
        tokens = props[1]
        df.at[i, 'tokens'] = tokens
    exec_2 = timeit(start)
    print('Analyzing responses. Execution time: ', exec_2)

    df.to_csv(config.TOKENS_PATH, index=False)

    print('Generating words frequency for each category: ', datetime.now())
    start = datetime.now()
    words_frequency = {}
    for category in df.main_category.unique():
        print(category)
        all_words = []
        df_temp = df[df.main_category == category]
        for word in df_temp.tokens:
            all_words.extend(word)
        most_common = [word[0] for word in nltk.FreqDist(
            all_words).most_common(config.FREQUENCY_TOP_WORDS)]
        words_frequency[category] = most_common

    # Save words_frequency model
    pickle_out = open(config.WORDS_FREQUENCY_PATH, "wb")
    pickle.dump(words_frequency, pickle_out)
    pickle_out.close()

    exec_3 = timeit(start)
    print('Generating words frequency for each category Finished. Execution time: ', exec_3)

    print('Script finished.\nTimes log:\nPart 1: ', exec_1,
          '\nPart 2: ', exec_2, '\nPart 3: ', exec_3)

from url-categorization-using-machine-learning.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.