Giter Club home page Giter Club logo

rishiraj / autolgbm Goto Github PK

View Code? Open in Web Editor NEW
31.0 1.0 5.0 1.08 MB

LightGBM + Optuna: Auto train LightGBM directly from CSV files, Auto tune them using Optuna, Auto serve best model using FastAPI. Inspired by Abhishek Thakur's AutoXGB.

Home Page: https://pypi.org/project/autolgbm/

License: Apache License 2.0

Makefile 0.46% Python 99.54%
python machine-learning kaggle gbdt gbm lightgbm gbrt decision-trees gradient-boosting data-science

autolgbm's Introduction

AutoLGBM

LightGBM + Optuna: no brainer

  • auto train lightgbm directly from CSV files
  • auto tune lightgbm using optuna
  • auto serve best lightgbm model using fastapi

NOTE: PRs are currently

  • not accepted. If there are issues/problems, please create an issue.
  • accepted. If there are issues/problems, please solve with a PR.

Inspired by Abhishek Thakur's AutoXGB.

Installation

Install using pip

pip install autolgbm

Usage

Training a model using AutoLGBM is a piece of cake. All you need is some tabular data.

Parameters

###############################################################################
### required parameters
###############################################################################

# path to training data
train_filename = "data_samples/binary_classification.csv"

# path to output folder to store artifacts
output = "output"

###############################################################################
### optional parameters
###############################################################################

# path to test data. if specified, the model will be evaluated on the test data
# and test_predictions.csv will be saved to the output folder
# if not specified, only OOF predictions will be saved
# test_filename = "test.csv"
test_filename = None

# task: classification or regression
# if not specified, the task will be inferred automatically
# task = "classification"
# task = "regression"
task = None

# an id column
# if not specified, the id column will be generated automatically with the name `id`
# idx = "id"
idx = None

# target columns are list of strings
# if not specified, the target column be assumed to be named `target`
# and the problem will be treated as one of: binary classification, multiclass classification,
# or single column regression
# targets = ["target"]
# targets = ["target1", "target2"]
targets = ["income"]

# features columns are list of strings
# if not specified, all columns except `id`, `targets` & `kfold` columns will be used
# features = ["col1", "col2"]
features = None

# categorical_features are list of strings
# if not specified, categorical columns will be inferred automatically
# categorical_features = ["col1", "col2"]
categorical_features = None

# use_gpu is boolean
# if not specified, GPU is not used
# use_gpu = True
# use_gpu = False
use_gpu = True

# number of folds to use for cross-validation
# default is 5
num_folds = 5

# random seed for reproducibility
# default is 42
seed = 42

# number of optuna trials to run
# default is 1000
# num_trials = 1000
num_trials = 100

# time_limit for optuna trials in seconds
# if not specified, timeout is not set and all trials are run
# time_limit = None
time_limit = 360

# if fast is set to True, the hyperparameter tuning will use only one fold
# however, the model will be trained on all folds in the end
# to generate OOF predictions and test predictions
# default is False
# fast = False
fast = False

Python API

To train a new model, you can run:

from autolgbm import AutoLGBM


# required parameters:
train_filename = "data_samples/binary_classification.csv"
output = "output"

# optional parameters
test_filename = None
task = None
idx = None
targets = ["income"]
features = None
categorical_features = None
use_gpu = True
num_folds = 5
seed = 42
num_trials = 100
time_limit = 360
fast = False

# Now its time to train the model!
algbm = AutoLGBM(
    train_filename=train_filename,
    output=output,
    test_filename=test_filename,
    task=task,
    idx=idx,
    targets=targets,
    features=features,
    categorical_features=categorical_features,
    use_gpu=use_gpu,
    num_folds=num_folds,
    seed=seed,
    num_trials=num_trials,
    time_limit=time_limit,
    fast=fast,
)
algbm.train()

CLI

Train the model using the autolgbm train command. The parameters are same as above.

autolgbm train \
 --train_filename datasets/30train.csv \
 --output outputs/30days \
 --test_filename datasets/30test.csv \
 --use_gpu

You can also serve the trained model using the autolgbm serve command.

autolgbm serve --model_path outputs/mll --host 0.0.0.0 --debug

To know more about a command, run:

`autolgbm <command> --help` 
autolgbm train --help


usage: autolgbm <command> [<args>] train [-h] --train_filename TRAIN_FILENAME [--test_filename TEST_FILENAME] --output
                                        OUTPUT [--task {classification,regression}] [--idx IDX] [--targets TARGETS]
                                        [--num_folds NUM_FOLDS] [--features FEATURES] [--use_gpu] [--fast]
                                        [--seed SEED] [--time_limit TIME_LIMIT]

optional arguments:
  -h, --help            show this help message and exit
  --train_filename TRAIN_FILENAME
                        Path to training file
  --test_filename TEST_FILENAME
                        Path to test file
  --output OUTPUT       Path to output directory
  --task {classification,regression}
                        User defined task type
  --idx IDX             ID column
  --targets TARGETS     Target column(s). If there are multiple targets, separate by ';'
  --num_folds NUM_FOLDS
                        Number of folds to use
  --features FEATURES   Features to use, separated by ';'
  --use_gpu             Whether to use GPU for training
  --fast                Whether to use fast mode for tuning params. Only one fold will be used if fast mode is set
  --seed SEED           Random seed
  --time_limit TIME_LIMIT
                        Time limit for optimization

autolgbm's People

Contributors

rishiraj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

autolgbm's Issues

predict error

Hi, good job @rishiraj :-)
Trying to predict on test file, but got an error :
KeyError: 'early_stopping_rounds'

It looks like lightgbm has moved this key to 'early_stopping()'
Did you encounter this ?

OperationalError: database is locked

import os
import pandas as pd
from pandas import read_csv
from autolgbm import AutoLGBM

os.chdir("//wsl.localhost/Ubuntu/home/mhrachov/algorithm testing/Side quest/Random seed/CropGBM/qtlmas")

traingeno = 'preprocessed/qtlmas_filter.geno'
trainphe = 'preprocessed/qtlmas_pheno_empty_na.phe'

traingeno_data = read_csv(traingeno, header=0, index_col=0)
trainphe_data = read_csv(trainphe, header=0, index_col=0).dropna(axis=0)
traingeno_data = traingeno_data.loc[trainphe_data.index.values, :]

combined_data = pd.concat([trainphe_data, traingeno_data], axis=1)
combined_data.to_csv('combined_data.csv')
# required parameters:
train_filename = 'combined_data.csv'
output = "output1"

# optional parameters
test_filename = None
task = None
idx = None
targets = ["phe"]
features = None
categorical_features = None
use_gpu = False
num_folds = 5
seed = 42
num_trials = 100
time_limit = 1200
fast = False

# Now its time to train the model!
algbm = AutoLGBM(
    train_filename=train_filename,
    output=output,
    test_filename=test_filename,
    task=task,
    idx=idx,
    targets=targets,
    features=features,
    categorical_features=categorical_features,
    use_gpu=use_gpu,
    num_folds=num_folds,
    seed=seed,
    num_trials=num_trials,
    time_limit=time_limit,
    fast=fast,
)
algbm.train()
....
some info messages
...
2023-08-09 14:51:32.758 | INFO     | autolgbm.autolgbm:_process_data:237 - Saving model config
2023-08-09 14:51:32.774 | INFO     | autolgbm.autolgbm:_process_data:241 - Saving encoders
Traceback (most recent call last):

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:1965 in _exec_single_context
    self.dialect.do_execute(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\default.py:921 in do_execute
    cursor.execute(statement, parameters)

OperationalError: database is locked


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  Cell In[53], line 17
    algbm.train()

  File ~\miniconda3\envs\GBM_project\lib\site-packages\autolgbm\autolgbm.py:247 in train
    best_params = train_model(self.model_config)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\autolgbm\utils.py:206 in train_model
    study = optuna.create_study(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\optuna\study\study.py:1136 in create_study
    storage = storages.get_storage(storage)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\optuna\storages\__init__.py:31 in get_storage
    return _CachedStorage(RDBStorage(storage))

  File ~\miniconda3\envs\GBM_project\lib\site-packages\optuna\storages\_rdb\storage.py:183 in __init__
    models.BaseModel.metadata.create_all(self.engine)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\sql\schema.py:5792 in create_all
    bind._run_ddl_visitor(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:3239 in _run_ddl_visitor
    conn._run_ddl_visitor(visitorcallable, element, **kwargs)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:2443 in _run_ddl_visitor
    visitorcallable(self.dialect, self, **kwargs).traverse_single(element)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\sql\visitors.py:670 in traverse_single
    return meth(obj, **kw)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\sql\ddl.py:901 in visit_metadata
    [t for t in tables if self._can_create_table(t)]

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\sql\ddl.py:901 in <listcomp>
    [t for t in tables if self._can_create_table(t)]

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\sql\ddl.py:866 in _can_create_table
    return not self.checkfirst or not self.dialect.has_table(

  File <string>:2 in has_table

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\reflection.py:88 in cache
    return fn(self, con, *args, **kw)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\dialects\sqlite\base.py:2146 in has_table
    info = self._get_table_pragma(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\dialects\sqlite\base.py:2761 in _get_table_pragma
    cursor = connection.exec_driver_sql(statement)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:1774 in exec_driver_sql
    ret = self._execute_context(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:1844 in _execute_context
    return self._exec_single_context(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:1984 in _exec_single_context
    self._handle_dbapi_exception(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:2339 in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:1965 in _exec_single_context
    self.dialect.do_execute(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\default.py:921 in do_execute
    cursor.execute(statement, parameters)

OperationalError: (sqlite3.OperationalError) database is locked
[SQL: PRAGMA main.table_info("studies")]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Installation was in a clean environment. I don't know the cause nor how to solve.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.