rishiraj / autolgbm Goto Github PK

LightGBM + Optuna: Auto train LightGBM directly from CSV files, Auto tune them using Optuna, Auto serve best model using FastAPI. Inspired by Abhishek Thakur's AutoXGB.

Home Page: https://pypi.org/project/autolgbm/

License: Apache License 2.0

Makefile 0.46% Python 99.54%

python machine-learning kaggle gbdt gbm lightgbm gbrt decision-trees gradient-boosting data-science

autolgbm's Introduction

AutoLGBM

LightGBM + Optuna: no brainer

auto train lightgbm directly from CSV files
auto tune lightgbm using optuna
auto serve best lightgbm model using fastapi

NOTE: PRs are currently

not accepted. If there are issues/problems, please create an issue.
accepted. If there are issues/problems, please solve with a PR.

Inspired by Abhishek Thakur's AutoXGB.

Installation

Install using pip

pip install autolgbm

Usage

Training a model using AutoLGBM is a piece of cake. All you need is some tabular data.

Parameters

###############################################################################
### required parameters
###############################################################################

# path to training data
train_filename = "data_samples/binary_classification.csv"

# path to output folder to store artifacts
output = "output"

###############################################################################
### optional parameters
###############################################################################

# path to test data. if specified, the model will be evaluated on the test data
# and test_predictions.csv will be saved to the output folder
# if not specified, only OOF predictions will be saved
# test_filename = "test.csv"
test_filename = None

# task: classification or regression
# if not specified, the task will be inferred automatically
# task = "classification"
# task = "regression"
task = None

# an id column
# if not specified, the id column will be generated automatically with the name `id`
# idx = "id"
idx = None

# target columns are list of strings
# if not specified, the target column be assumed to be named `target`
# and the problem will be treated as one of: binary classification, multiclass classification,
# or single column regression
# targets = ["target"]
# targets = ["target1", "target2"]
targets = ["income"]

# features columns are list of strings
# if not specified, all columns except `id`, `targets` & `kfold` columns will be used
# features = ["col1", "col2"]
features = None

# categorical_features are list of strings
# if not specified, categorical columns will be inferred automatically
# categorical_features = ["col1", "col2"]
categorical_features = None

# use_gpu is boolean
# if not specified, GPU is not used
# use_gpu = True
# use_gpu = False
use_gpu = True

# number of folds to use for cross-validation
# default is 5
num_folds = 5

# random seed for reproducibility
# default is 42
seed = 42

# number of optuna trials to run
# default is 1000
# num_trials = 1000
num_trials = 100

# time_limit for optuna trials in seconds
# if not specified, timeout is not set and all trials are run
# time_limit = None
time_limit = 360

# if fast is set to True, the hyperparameter tuning will use only one fold
# however, the model will be trained on all folds in the end
# to generate OOF predictions and test predictions
# default is False
# fast = False
fast = False

Python API

To train a new model, you can run:

from autolgbm import AutoLGBM


# required parameters:
train_filename = "data_samples/binary_classification.csv"
output = "output"

# optional parameters
test_filename = None
task = None
idx = None
targets = ["income"]
features = None
categorical_features = None
use_gpu = True
num_folds = 5
seed = 42
num_trials = 100
time_limit = 360
fast = False

# Now its time to train the model!
algbm = AutoLGBM(
    train_filename=train_filename,
    output=output,
    test_filename=test_filename,
    task=task,
    idx=idx,
    targets=targets,
    features=features,
    categorical_features=categorical_features,
    use_gpu=use_gpu,
    num_folds=num_folds,
    seed=seed,
    num_trials=num_trials,
    time_limit=time_limit,
    fast=fast,
)
algbm.train()

CLI

Train the model using the autolgbm train command. The parameters are same as above.

autolgbm train \
 --train_filename datasets/30train.csv \
 --output outputs/30days \
 --test_filename datasets/30test.csv \
 --use_gpu

You can also serve the trained model using the autolgbm serve command.

autolgbm serve --model_path outputs/mll --host 0.0.0.0 --debug

To know more about a command, run:

`autolgbm <command> --help`

autolgbm train --help


usage: autolgbm <command> [<args>] train [-h] --train_filename TRAIN_FILENAME [--test_filename TEST_FILENAME] --output
                                        OUTPUT [--task {classification,regression}] [--idx IDX] [--targets TARGETS]
                                        [--num_folds NUM_FOLDS] [--features FEATURES] [--use_gpu] [--fast]
                                        [--seed SEED] [--time_limit TIME_LIMIT]

optional arguments:
  -h, --help            show this help message and exit
  --train_filename TRAIN_FILENAME
                        Path to training file
  --test_filename TEST_FILENAME
                        Path to test file
  --output OUTPUT       Path to output directory
  --task {classification,regression}
                        User defined task type
  --idx IDX             ID column
  --targets TARGETS     Target column(s). If there are multiple targets, separate by ';'
  --num_folds NUM_FOLDS
                        Number of folds to use
  --features FEATURES   Features to use, separated by ';'
  --use_gpu             Whether to use GPU for training
  --fast                Whether to use fast mode for tuning params. Only one fold will be used if fast mode is set
  --seed SEED           Random seed
  --time_limit TIME_LIMIT
                        Time limit for optimization

autolgbm's People

Contributors

Stargazers

Watchers

Forkers

autolgbm's Issues

predict error

Hi, good job @rishiraj :-)
Trying to predict on test file, but got an error :
KeyError: 'early_stopping_rounds'

It looks like lightgbm has moved this key to 'early_stopping()'
Did you encounter this ?

OperationalError: database is locked

import os
import pandas as pd
from pandas import read_csv
from autolgbm import AutoLGBM

os.chdir("//wsl.localhost/Ubuntu/home/mhrachov/algorithm testing/Side quest/Random seed/CropGBM/qtlmas")

traingeno = 'preprocessed/qtlmas_filter.geno'
trainphe = 'preprocessed/qtlmas_pheno_empty_na.phe'

traingeno_data = read_csv(traingeno, header=0, index_col=0)
trainphe_data = read_csv(trainphe, header=0, index_col=0).dropna(axis=0)
traingeno_data = traingeno_data.loc[trainphe_data.index.values, :]

combined_data = pd.concat([trainphe_data, traingeno_data], axis=1)
combined_data.to_csv('combined_data.csv')
# required parameters:
train_filename = 'combined_data.csv'
output = "output1"

# optional parameters
test_filename = None
task = None
idx = None
targets = ["phe"]
features = None
categorical_features = None
use_gpu = False
num_folds = 5
seed = 42
num_trials = 100
time_limit = 1200
fast = False

# Now its time to train the model!
algbm = AutoLGBM(
    train_filename=train_filename,
    output=output,
    test_filename=test_filename,
    task=task,
    idx=idx,
    targets=targets,
    features=features,
    categorical_features=categorical_features,
    use_gpu=use_gpu,
    num_folds=num_folds,
    seed=seed,
    num_trials=num_trials,
    time_limit=time_limit,
    fast=fast,
)
algbm.train()

....
some info messages
...
2023-08-09 14:51:32.758 | INFO     | autolgbm.autolgbm:_process_data:237 - Saving model config
2023-08-09 14:51:32.774 | INFO     | autolgbm.autolgbm:_process_data:241 - Saving encoders
Traceback (most recent call last):

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:1965 in _exec_single_context
    self.dialect.do_execute(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\default.py:921 in do_execute
    cursor.execute(statement, parameters)

OperationalError: database is locked


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  Cell In[53], line 17
    algbm.train()

  File ~\miniconda3\envs\GBM_project\lib\site-packages\autolgbm\autolgbm.py:247 in train
    best_params = train_model(self.model_config)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\autolgbm\utils.py:206 in train_model
    study = optuna.create_study(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\optuna\study\study.py:1136 in create_study
    storage = storages.get_storage(storage)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\optuna\storages\__init__.py:31 in get_storage
    return _CachedStorage(RDBStorage(storage))

  File ~\miniconda3\envs\GBM_project\lib\site-packages\optuna\storages\_rdb\storage.py:183 in __init__
    models.BaseModel.metadata.create_all(self.engine)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\sql\schema.py:5792 in create_all
    bind._run_ddl_visitor(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:3239 in _run_ddl_visitor
    conn._run_ddl_visitor(visitorcallable, element, **kwargs)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:2443 in _run_ddl_visitor
    visitorcallable(self.dialect, self, **kwargs).traverse_single(element)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\sql\visitors.py:670 in traverse_single
    return meth(obj, **kw)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\sql\ddl.py:901 in visit_metadata
    [t for t in tables if self._can_create_table(t)]

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\sql\ddl.py:901 in <listcomp>
    [t for t in tables if self._can_create_table(t)]

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\sql\ddl.py:866 in _can_create_table
    return not self.checkfirst or not self.dialect.has_table(

  File <string>:2 in has_table

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\reflection.py:88 in cache
    return fn(self, con, *args, **kw)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\dialects\sqlite\base.py:2146 in has_table
    info = self._get_table_pragma(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\dialects\sqlite\base.py:2761 in _get_table_pragma
    cursor = connection.exec_driver_sql(statement)

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:1774 in exec_driver_sql
    ret = self._execute_context(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:1844 in _execute_context
    return self._exec_single_context(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:1984 in _exec_single_context
    self._handle_dbapi_exception(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:2339 in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\base.py:1965 in _exec_single_context
    self.dialect.do_execute(

  File ~\miniconda3\envs\GBM_project\lib\site-packages\sqlalchemy\engine\default.py:921 in do_execute
    cursor.execute(statement, parameters)

OperationalError: (sqlite3.OperationalError) database is locked
[SQL: PRAGMA main.table_info("studies")]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Installation was in a clean environment. I don't know the cause nor how to solve.

rishiraj / autolgbm Goto Github PK

autolgbm's Introduction

AutoLGBM

Installation

Usage

Parameters

Python API

CLI

autolgbm's People

Contributors

Stargazers

Watchers

Forkers

autolgbm's Issues

predict error

OperationalError: database is locked

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent