epeters3 / skplumber Goto Github PK
View Code? Open in Web Editor NEWAn AutoML tool and lightweight ML framework for Scikit-Learn.
Home Page: https://epeters3.github.io/skplumber/
An AutoML tool and lightweight ML framework for Scikit-Learn.
Home Page: https://epeters3.github.io/skplumber/
Use the ray
package to support sampling using as many processors as are available.
There is a risk that currently with the random train/test splits that the package is doing, the best model being returned is influenced in part by the luck of the train test split. In other words, the best model might have just gotten a lucky train test split that was easy to learn. Adding K fold cross validation will smooth out that variance by getting multiple sample points on the distribution of the problem being learned.
Once the best model is identified via cross validation, that model can be refit using the full training data set before being returned by the package. Then it will have the benefit of learning from as much data as it can before being deployed into the wild.
Add logic for skplumber
to optimize intelligently given a time budget. Specifically, in the pipeline sampling phase, use extreme value theory and running average pipeline fit+score times to estimate time remaining, always leaving enough time for the flexga
package to complete at least one generation of hyperparameter sampling, the time of which will take can be estimated by the time it took the best pipeline sampled so far to fit+score multiplied by the number of hyperparameters in that pipeline to tune, multiplied by 10 (since that's flexga
population size).
This will make the SKPLumber.crank
method even a little higher level with less knobs to tune, which is ok because the lower level components of the package (e.g. sampling, tuning, pipeline) or still available to the user.
Use the ray
package to support sampling using as many processors as are available.
All primitives should implement the appropriate sklearn base class which includes appropriate hyperparameter getting and setting methods.
In addition, all primitives should programmatically document datatypes of each hyperparameter. In the case of numeric hyperparameters, a way to compute the bounds for a given dataset should be provided. Bounds can be dependent on features of the dataset being trained on (e.g. number of instances, number of features, etc). In the case of categorical hyperparameters, all possible values should be enumerated.
Finally, the Pipeline class should implement the BaseEstimator API.
Doing this will enable SKPlumber to support hyperparameter search, whether it be through sampling or optimization.
The same instantiated primitive should be able to be fit on one dataset and then another. All the sklearn primitives should already support this, but the custom primitives do not. E.g. the one-hot encoder, when fit, keeps track of all the categorical columns, but when fit to a new dataset, it does not clear out all the old columns it was tracking, so things the columns the new dataset has is the union of the old dataset's columns with the new dataset's columns.
Add a test case for this by fitting on one dataset, then another, to make sure no errors occur.
In the SKPlumber._sampler_callback
function, it is not checked if the system should exit the sampling early until the SPlumber.progress
object is able to report. It is sometimes the case where the whole time budget is used up even before the progress
object is able to report.
Add basic Travis build which runs tests
Currently SKPLumber.fit
uses the best hyperparameter configuration found during hyperparameter tuning. If it didn't find any better than the defaults used during sampling, use the defaults.
Currently, something internal to flexga
raises an error when this happens and its very cryptic.
Add basic Travis build which runs tests
Add a strategy that will randomly sample a layer of primitives that all take the input data, then which adds a model to ensemble all the data outputed by the previous layer of primitives.
In the logistic regression solver, this hyperparameter combo is not supported, so make sure it cannot be tried in skplumber:
File "/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py", line 445, in _check_solver
"got %s penalty." % (solver, penalty))
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.
Here is another one for SVM:
File "/usr/local/lib/python3.6/dist-packages/sklearn/svm/_base.py", line 793, in _get_liblinear_solver_type
% (error_string, penalty, loss, dual))
ValueError: Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when
dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True
Several of the sklearn primitives are erroring out when used.
Currently, only a basic example for SKPlumber.crank()
is provided in the readme. The Pipeline
class API is slightly lower level than SKPlumber
and has some nice features. Also, since SKPlumber.crank()
returns a Pipeline
instance, its important to know how to use that pipeline for downstream use.
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/users/grads/epeter92/code/big-data-course/project/mldb/__main__.py", line 31, in <module>
main()
File "/users/grads/epeter92/code/big-data-course/project/mldb/__main__.py", line 27, in main
results = ray.get(result_id)
File "/users/grads/epeter92/.local/lib/python3.6/site-packages/ray/worker.py", line 1504, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::mldb.model_running.do_run() (pid=3548, ip=192.168.36.150)
File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
File "/users/grads/epeter92/code/big-data-course/project/mldb/model_running.py", line 31, in do_run
pipe.fit(X_train, y_train)
File "/users/grads/epeter92/.local/lib/python3.6/site-packages/skplumber/pipeline.py", line 96, in fit
self._run(X, y, fit=True)
File "/users/grads/epeter92/.local/lib/python3.6/site-packages/skplumber/pipeline.py", line 69, in _run
step_outputs = step.primitive.produce(step_inputs)
File "/users/grads/epeter92/.local/lib/python3.6/site-packages/skplumber/primitives/custom_primitives/preprocessing.py", line 93, in produce
np.random.choice(known_vals.index, p=known_vals, size=len(result.index))
File "mtrand.pyx", line 902, in numpy.random.mtrand.RandomState.choice
ValueError: 'a' cannot be empty unless no samples are taken
I think this happens when training on datasets that don't have any known values in a column.
Fixing the random seed will allow all estimators to see the same splits of data, for fair comparison.
Add logic for skplumber
to optimize intelligently given a time budget. Specifically, in the pipeline sampling phase, use extreme value theory and running average pipeline fit+score times to estimate time remaining, always leaving enough time for the flexga
package to complete at least one generation of hyperparameter sampling, the time of which will take can be estimated by the time it took the best pipeline sampled so far to fit+score multiplied by the number of hyperparameters in that pipeline to tune, multiplied by 10 (since that's flexga
population size).
This will make the SKPLumber.crank
method even a little higher level with less knobs to tune, which is ok because the lower level components of the package (e.g. sampling, tuning, pipeline) or still available to the user.
The readme file currently can't be parsed by twine, so can't be uploaded to pypi and used as the long description there.
The encoder should have fit and produce methods, so the output columns are always the same. Also, if more than unique values are found in a column, it should cap the feature expansion and only encode the most common values, putting all others in an “Other” column.
Currently the performance of all candidate pipelines can only be evaluated via k-fold cross validation. That is a great method but for large datasets especially k-fold is impractical and sometimes unecessary. SKPlumber.crank
should expose an API for passing in a custom evaluation strategy. Also, it would be good to use a sensible default and to provide both basic train/test set evaluation and k-fold cross validation evaluation utilities for the user.
Currently, all primitives in the package must be accessed by key through primitive dictionaries. It would be better to have them be accessible as objects directly e.g. instead of:
from skplumber.primitives import classifiers
prim = classifiers["RandomForestClassifierPrimitive"]
It would be more natural to say:
from skplumber.primitives.classifiers import RandomForestClassifierPrimitive
Furthermore, it would be good to eliminate the Primitive
postfix from all the primitive names; it's a little redundant.
Use the flexga
package to add genetic-algorithm-based hyperparameter optimization of arbitrary pipelines to the package.
Currently the name of a search strategy is what’s past to the plumber. It would be better to pass an instantiation of a search strategy, so the user can configure the search strategy without having to pass all the search strategy parameters to the strategy through the plumber API, but rather to the search strategy directly.
It would be useful to have an option to limit the max amount of time spent fitting a pipeline when searching for good solutions to a problem. Adding a timeout option to the Pipeline.fit
method would do the trick.
Many optimization strategies do best when the features are all normalized. Rather than currently requiring normalization to be found as a preprocessor when sampling, always insert a normalization preprocessing step, with an option to not normalize (e.g. have normalize=True
be a default in the crank
API).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.