deil87 / automl-genetic Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 3.0 142.25 MB

Applying genetic programming to AutoML https://travis-ci.org/deil87/automl-genetic.svg?branch=master

Scala 90.87% HTML 4.86% JavaScript 4.11% CSS 0.16%

automl-genetic's People

Contributors

Stargazers

Watchers

Forkers

mahatmafatalerror leetoo songnous

automl-genetic's Issues

Perceptron needs to work with labels(classes) started from 0.

We need to make sure that our One-Hot-Encoding works properly and either do something like this

.withColumnReplace("label", $"label" - 1.0 )

for the input data or make it more automatic inside algorithm

Make Stacking algorithm lazy

For now every time we add model to the stage... we calculate predictions immediately. It maybe not the very optimal way.

Add custom timeBoxes strategy.

Add curent datasize metric for Kamon

Try to improve performance with off-heap enabled.

https://stackoverflow.com/questions/43330902/spark-off-heap-memory-config-and-tungsten

Stacking ensembling seams not working for 3rd level of the tree. Collision of the collumns names in DataFrames.

Check what we should leave in second parameter of FitnessResult(..., here)

Estimate cardinality of dependant variable automatically so that we can use appropriate models

As a first version we can use api of some models for that.

Add implementation for com.automl.template.simple.KNearestNeighbours proxy classifier

Improve caching. In case of bagging we can't cache results because each child of bagging is based on different subset of data.

How can we improve situation here?

Add comparison of test error vs training error for AutoMLMainSuite test.

Our new populations contains too many identical members.

We should probably sample without replacement or tune the probabilities of being sampled.

Get rid of println and introduce smart Logging

added ability to log per individual into separate log files instead of writing everything into console.

Add fitnessErrors to individuals template tree.

Store template inside IndividualAlgorithm class with mutable fitnessError property. Print it while traversing.

Add simple UI to be able to run tasks from there instead of from tests.

Stop generation if we have lost diversity within population.

start from scratch?

Add configuration parameter to set number of top individuals to store into sorted heap.

This will help us to choose from at least something when we abrupt our evolutions due to timeboxes. Or maybe define strategy: we can store every individual at the beginning and then slowly reduce number.

Add final evaluation of test split. Make AutoML class returning best template.

Changes in AutoMLMainSuite needed

NeuralNetwork(layers: Array[Int]) extends SimpleModelMember. Add automatic detection of number of features as well as number of classes.

Implement general Boosting algorithm of assembling of classifiers.

Implement base algorithms Sigmoid, SigmoidNorm, Sine, Polynomial, Gaussian, Exponential and Linear.

Spark as far as I see has perceptron with only sigmoid activation functions….   From the source code: *Each layer has sigmoid activation function, output layer has softmax.

Number of inputs has to be equal to the size of feature vectors.
Number of outputs has to be equal to the total number of labels.  In paper you tried “Training regression models (as components of probabilistic classifiers) is fast and straightforward. We use several activation functions in simple perceptrons, namely Sigmoid, SigmoidNorm, Sine, Polynomial, Gaussian, Exponential and Linear. “ - so I believe we need to use Deeplearning4J implementations for these base classifiers

Or we can fork spark

evolutionNumber keeps resetting to 0. Maybe due to timeouts of timeboxes?

Extend Readme description with Getting started information

Add implementation for com.automl.template.simple.DeepNeuralNetwork proxy classifier.

Possible implementation could be MultiLayerNetwork class from deeplearning4j.com framework

Employ recommendation systems approaches while choosing most similar template from metaDB.

Fix versions evictions in sbt

libraryDependencies ++= Seq(
"org.slf4j" % "slf4j-api" % "1.7.7",
"org.slf4j" % "jcl-over-slf4j" % "1.7.7"
).map(_.force())

or update to latest versions.

Fix tests for TravisCI

Implement first version of selection template based on similarity between datasets #metalearning

Improve performance by caching intermediate calculation of fitness functions

At least for duplicates of individuals within one population we can cache and reuse the results.

Implement multiclasses case for Linear Perceptron.

Still need to decide how to make final predictions based on what we have from each Perceptron separately. How to measure confidence?

Replace XGBoost with spark's gradient boosting GBTRegressor.

Add visualisation tool similar to matplotlib.

Create co-evolution process of hyperparameter search

We should somehow share(not maybe physically) hyperparameters instances for all classifiers within one ensemble individual/generation/evolution. How to find similarities between classifiers in terms of optimal hyperparameters settings?

Create mappings between classifiers classes and possible subsets of hyperparameters for those classes.

Think of default/initialization values for them

Prepare Wiki page of how we should setup environment for metrics monitoring

How can we increase quality of solution through test cases?

We can choose test examples that were difficult in terms of classification.

Techniques that have been proposed to ameliorate this difficulty include shared sampling, in
which test cases are chosen so as to be unsolvable by as many of the strategies in the population
as possible

Figure it out why Bayesian model is so bad on airplane dataset

Fix evictions in sbt.

libraryDependencies ++= Seq(
).map(_.force())

or update to the latest versions.

Add crossvalidation for evaluating individuals performance

Add ensembling learning algorithm similar to RandomForest with random subspace sampling method.

For now we use only Bagging but we can subsample not only training examples but also features space.
Random Forest is not general enough and works only with trees, but we need to apply it's core idea to any ensemble of classifiers.

Add Stacking into our pool of possible ensembling algorithms.

Try to use Breeze for matrix multiplication and elementwise addition.

For now we are getting issue with serialisation of Function2 because there is no Encoder for that type:

No Encoder found for (breeze.linalg.DenseVector[Double], breeze.linalg.DenseVector[Double]) => breeze.linalg.DenseVector[Double]

field (class: "scala.Function2", name: "_1")

root class: "scala.Tuple2"
java.lang.UnsupportedOperationException: No Encoder found for (breeze.linalg.DenseVector[Double], breeze.linalg.DenseVector[Double]) => breeze.linalg.DenseVector[Double]

field (class: "scala.Function2", name: "_1")

root class: "scala.Tuple2"

Introduce concept of evolution dimensions.

We can define Set[EvolutionDimension] and create pipeline where we configure all those dimensions with strategies and weights.

Find hyperparameters that are optimal for our base classifiers. So that we can see how ensembling and other heuristics are working.

Later we should search for hyperparameters with evolutions as well.