ibm / lale Goto Github PK

Library for Semi-Automated Data Science

License: Apache License 2.0

Dockerfile 0.01% Makefile 0.02% Python 99.96%

scikit-learn automl automated-machine-learning hyperparameter-optimization hyperparameter-tuning hyperparameter-search python artificial-intelligence pipeline-tests pipeline-testing

lale's Introduction

Lale

README in other languages: 中文, deutsch, français, or contribute your own.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion. If you are a data scientist who wants to experiment with automated machine learning, this library is for you! Lale adds value beyond scikit-learn along three dimensions: automation, correctness checks, and interoperability. For automation, Lale provides a consistent high-level interface to existing pipeline search tools including Hyperopt, GridSearchCV, and SMAC. For correctness checks, Lale uses JSON Schema to catch mistakes when there is a mismatch between hyperparameters and their type, or between data and operators. And for interoperability, Lale has a growing library of transformers and estimators from popular libraries such as scikit-learn, XGBoost, PyTorch etc. Lale can be installed just like any other Python package and can be edited with off-the-shelf Python tools such as Jupyter notebooks.

Introductory guide for scikit-learn users
Installation instructions
Technical overview slides, notebook, and video
IBM's AutoAI SDK uses Lale, see demo notebook
Guide for wrapping new operators
Guide for contributing to Lale
FAQ
Papers
Python API documentation

The name Lale, pronounced laleh, comes from the Persian word for tulip. Similarly to popular machine-learning libraries such as scikit-learn, Lale is also just a Python library, not a new stand-alone programming language. It does not require users to install new tools nor learn new syntax.

Lale is distributed under the terms of the Apache 2.0 License, see LICENSE.txt. It is currently in an Alpha release, without warranties of any kind.

lale's People

Contributors

Stargazers

Watchers

Forkers

kant shinnar haodeqi beepboopboopbop rithram hirzel gbdrt bkalapar yutarochan fagan2888 noushi somsirsa csillag61 abdelrazekrizk shahir123 sreev ariffyasri ozgurgul bhaskers-blu-org1 nimeoz camellialok eecheonwu krprls iddiaziz nightmare0829 lnxpy codeaudit msaberai lukaszcmielowski sanyam07 szymonkucharczyk alondes ilham-himawan sks95 tdoublep vaisaxena dorotadr tanmaygaikwad vickyvishal tauseefhashmi rajathreddy9 obitorasu mfeffer carthick chiragsahni minihive vishalbelsare mindis ingkarat mandel mateuszszymkowskiibm priyankabanda2202 thsalon jsntsay cyberflamego hudakas phadjido ksrinivs64 python-repository-hub hayasam kiran-kate spekolator mingwei82 nehajain18 amirgholipour danielryszkaibm andrewhabib boosong samirgholipour trellixvulnteam abinashbunty david234-star rafalll-maciasz carlsouza ghas-results ghas-results ghas-results harel-coffee ruidozo michalkordyzon

lale's Issues

ThresholdClassifier

It would be nice to have a higher-order operator, which calls predict_proba on its estimator argument, and implements predict by comparing the resulting probability against its threshold argument. This would give the user more control over the trade-off between precision and recall. For example:

trainable = ThresholdClassifier(estimator=LogisticRegression(), threshold=0.3)

The schema for PCA should specify that sparse input is not supported.

Scikit-learn's PCA does not support sparse input. The schema does not specify this.

Add support for predict params

for individual ops, passing them on the impls predict method

Add/generate examples for operators

Similar to sklearn docs, each operator should have some working examples to demonstrate how to use them

Elide operator file name from documentation

For example, the operator documentation should not say lale.lib.sklearn.quadratic_discriminant_analysis.QuadraticDiscriminantAnalysis, but rather use the simpler path lale.lib.sklearn.QuadraticDiscriminantAnalysis.

this makes the "class" line at the top of the page shorter, in some cases preventing it from spilling into the margin
also, this reduces confusion about where to import from
furthermore, we would also like to omit the submodules list from the package-level documentation

Hopefully, this can be solved by moving the set_docstring call from the operator.py file to the __init__.py file, and then renaming the file to add a leading underscore, as in, _operator.py.

Add getting started examples to main page of documentation

Add basic examples to get started with link in main page of documentation

Basic pipeline construction
Using planned pipeline and auto_configure

Documentation hyperlinks from output of visualize to open in a new tab.

The output of visualize method on Lale operators adds a hyperlink to each of the operator nodes. The hyperlink points to the documentation of the operator. Currently, clicking on the link opens the documentation on the current page. Changing this to open in a new tab might be easier for the users.

Links in visualize() should open a new window

Operator nodes in visualize() are links to corresponding documentation but clicking on them goes straight to the page which is a problem in notebook settings. Should open a new window/tab instead.

Make parameter names in schemas.py consistent with json schema parameters names.

lale/schemas.py has classes that capture data types and options to be used as arguments to customize_schema. Some of the parameters passed to these classes use short forms of their json schema counterparts. For example, in the json schemas, we have minimumForOptimizer and in schemas.py, we have minForOptimizer. Renaming these parameters from schemas.py to match those from the json schema will make it easier to use. Here is the desired mapping:

minimumForOptimizer instead of minForOptimizer
maximumForOptimizer instead of maxForOptimizer
minimum instead of min
maximum instead of max
exclusiveMinimum instead of exclusiveMin
exclusiveMaximum instead of exclusiveMax

Name reflection for multi-line make_operator

For example, given this call:

QuadraticDiscriminantAnalysis = make_operator(
    QuadraticDiscriminantAnalysisImpl, _combined_schemas
)

The new operator should know its name is QuadraticDiscriminantAnalysis.

Print url to schemas instead of the actual schema in error messages

Currently we print the violated schema, which is hard to understand and scary.
Instead, we should print a link to the schema (on read the docs), that can be opened for more information.

Doing this well requires that our read the docs page have anchors for each argument and constraint so that we can link directly to them.

The schema for tol in PCA's schema does not reflect the doc.

In the hyperparameter schema for Scikit-learn's PCA, the one with "Option tol can be set for svd_solver arpack" does not correctly reflect the actual description in the official document.

Customize_schema or get_param_ranges do not parse `AnyOf(types=[Bool(), Null()], default=Null())` correctly.

The shrinking hyperparameter of lale.lib.sklearn.SVR can be customized as follows:

SVR = SVR.customize_schema(
    shrinking=AnyOf(
        types=[Bool(), Null()],
        default=Null(),
        desc="Whether to use the shrinking heuristic."))

The expected output of get_param_ranges() on the customized operator should have 'shrinking': (False, True, None), but this just results in 'shrinking':(None) as of now.

Update instructions for contributors

update the text to recommend installing forked and cloned copy of package in editable mode (pip install -e .[dev])
update the visual guide for editable-mode install too
update visual guide to include the pre-commit instructions

test epic

Hyperopt Algorithm used

Not actually an issue, just a question.

Hyperopt supports
Random Search,
Tree of Parzen Estimators (TPE), and
Adaptive TPE.

When using optimizer Hyperopt in lale, which is the search algorithm behind?

Thank you in advance for your time and contribution.

StackingClassifier and StackingRegressor

Scikit-learn 0.22 introduced StackingClassifier and StackingRegressor. These operators train the final estimator using "cross-validated predictions of the base estimators", which should reduce over-fitting and make the ensemble more effective. We should wrap StackingClassifier and StackingRegressor as higher-order operators in Lale.

Explore alternatives to readthedocs for documentation.

Readthedocs has following issues:

It has advertisements on the bottom left pane.
The build fails randomly and we don't get notified. Sometimes, only a few pages render with issues and we don't discover it unless we happen to open those pages.

Would be helpful to find an alternative if possible.
Github pages is one option, not sure of others.

Operator parameter constraint link

For each operator, if a parameter has a constraint (such as penalty and loss in LinearSVC), the parameter description should have some indicator that the constraint exists and/or a link to the list of constraints.

Lale default install issues with Python 3.8

OS: Mac OS 11.3.1 (Big Sur)
Python version: 3.8

Running pip install lale[full] with Python 3.8 causes issues at runtime. Specifically, after installing dependencies beyond pip like swig, running cells (or at least the first cell) in the demo_aif30 notebook from the examples produced blocking errors.

The first was module 'black' has no attribute 'Mode'. I caused this to go away by running pip install --upgrade black to install the latest version of the package (v21), but based on this commit, that may not actually be the correct fix.

Regardless, I soon ran into another issue related to interactions between numpy and tensorflow. After making the change to black, re-running the cell produced module compiled against API version 0xe but this version of numpy is 0xd. Installing the latest version of numpy (v1.20) via pip resulted in a warning that said tensorflow at version 2.5.0 (the one installed by default for my version of Python 3.8) would not be compatible (and would only work with versions close to 1.19.2). Downgrading numpy back to an earlier version would result in the 0xe error again, so it was impossible to proceed from here.

A workaround is to just use Python 3.7 (I've gotten everything to work virtually out of the box with this version), but I figured the installation issues and dependency incompatibility should be noted somewhere (especially as docs mention that any version beyond 3.6 should work).

replace BaselineClassifier/Regressor by DummyClassifier/Regressor

Our lale.lib.lale package has a BaselineClassifier that simply predicts the majority class. Scikit-learn has a DummyClassifier that does the same, with a few additional useful configuration options. We should add the DummyClassifier to our lale.lib.sklearn package, and eliminate the BaselineClassifier, since it is redundant. Similarly, we should also replace lale.lib.lale.BaselineRegressor by scikit-learn's DummyRegressor.

Suggest hyperparameter changes during errors

For cases where hyperparameter constraints are violated, it may be possible to suggest changes for the user in the error message. A few study participants mentioned that this would be a nice feature to have.

Improving interpretability of schemas in error messages

Almost all study participants were not able to interpret schema returned in error task. Should add some level of assistance or prose to help interpretation.

Example error message:

ValidationError: Invalid configuration for LinearSVC(penalty='l1', loss='hinge') due to constraint the combination of penalty=`l1` and loss=`hinge` is not supported.
Schema of constraint 1: {
    "description": "The combination of penalty=`l1` and loss=`hinge` is not supported",
    "anyOf": [
        {"type": "object", "properties": {"penalty": {"enum": ["l2"]}}},
        {
            "type": "object",
            "properties": {"loss": {"enum": ["squared_hinge"]}},
        },
    ],
}
Value: {'penalty': 'l1', 'loss': 'hinge', 'dual': True, 'tol': 0.0001, 'C': 1.0, 'multi_class': 'ovr', 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'verbose': 0, 'random_state': None, 'max_iter': 1000}

Allow the user to set the random state in fair_stratified_train_test_split

In the current implementation of fair_stratified_train_test_split the user cannot supply the random_state argument as is allowed for scikit-learn's train_test_split. This would be useful.

AutoPipeline predict_proba and roc_auc score.

Implement the predict_proba method for the AutoPipeline operator.
Add test for AutoPipeline with scoring="roc_auc", which has been reported to raise "AttributeError: The underlying operator impl does not define predict_proba", and get that to work as well.

dependencies

Dear All,
After installation of Lale followed by package update I got some pip errors:

ERROR: lale 0.3.3 has requirement hyperopt==0.1.2, but you'll have hyperopt 0.2.3 which is incompatible.
ERROR: lale 0.3.3 has requirement jsonschema==2.6.0, but you'll have jsonschema 3.2.0 which is incompatible.
ERROR: lale 0.3.3 has requirement pandas<=0.25.3, but you'll have pandas 1.0.1 which is incompatible.
ERROR: lale 0.3.3 has requirement scikit-learn==0.20.3, but you'll have scikit-learn 0.22.2.post1 which is incompatible.

Should I downgrade conflicting packages?

Improve interpretability of constraint list for operators in documentation

"dict of ___" portion of constraint list for operators in the documentation (such as with LinearSVC) was difficult for study participants to understand, should have some prose or assistance to help understand what it means.

validate_schema_or_subschema fails when the input data is a dictionary

The input to Batching can be a Python dictionary, and when we enable data schema validation by setting lale->settings->disable_data_schema_validation to False, we get the following error:

SubschemaError: Expected sub to be a subschema of super.
sub = {"dataset": "<<NumpyTorchDataset>>", "collate_fn": "<<function>>"}
super = {
    "description": "Features; the outer array is over samples.",
    "type": "object",
}

validate_schema_or_subschema is calling is_schema(lhs) which returns true for a dictionary input even though it is data and not a schema. Perhaps, we need to know if lhs is a data or schema?
To reproduce, the notebook lale->examples->demo_batching has a case 3 at the end, which fails once we enable the schema validation.

Create a curated schema for sklearn.neighbors.LocalOutlierFactor

Create a schema for sklearn.neighbors.LocalOutlierFactor in lale.lib.sklearn and test score_samples property on that, verify whether X=None is required/works.

module resolution issue in pretty_print()

Hi All,

I have implemented a custom imputer based on Scikit-learn SimpleImputer as an example.
My code lives in albert_imputer.py.
Everything in fine until the final result is being printed.
This is what I see in debugger:

> /Users/albert/miniconda3/envs/lale/lib/python3.7/site-packages/lale/pretty_print.py(160)_get_module_name()
-> op = find_op(mod_name_short, op_name)
(Pdb) l
155  	    mod_name_long = class_name[: class_name.rfind(".")]
156  	    mod_name_short = mod_name_long[: mod_name_long.rfind(".")]
157  	    unqualified = class_name[class_name.rfind(".") + 1 :]
158  	    if class_name.startswith("lale.") and unqualified.endswith("Impl"):
159  	        unqualified = unqualified[: -len("Impl")]
160  ->	    op = find_op(mod_name_short, op_name)
161  	    if op is not None:
162  	        mod = mod_name_short
163  	    else:
164  	        op = find_op(mod_name_long, op_name)
165  	        if op is not None:
(Pdb) p mod_name_long, mod_name_short, unqualified,
('albert_imputer', 'albert_impute', 'AlbertImputerImpl')

In "mod_name_short" the last "r" is missed. For this reason, importlib cannot load the module in find_op().
As a temporary workaround, I created a symbolic link "albert_impute.py" to "albert_imputer.py" file and it works.

Segmentation fault: 11 - SMAC - MacOS catalina

Hi all and thank you for your contribution.

I am trying to use SMAC optimizer in my tests on a Mac with Catalina os.
I have

installed swig
run the: CFLAGS=-stdlib=libc++ pip install smac
run the: pip install lale[full]
in my virtual python environment, but running my code with SMAC optimizer with my custom operator, and I am getting:
Segmentation fault: 11
Any ideas what i have done wrong?

(when using Hyperopt, I am getting results)

Thank you in advance for you time.

ImportError: cannot import name '_UnstableArchMixin'

IBM Watson Studio:Version 1.1.0-151 (1.1.0-151) on macOS Catalina 10.15.4

from sklearn.preprocessing import Normalizer
from sklearn.tree import DecitionTreeRegressor as Tree
from lale.lib.lale import Hyperopt


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-28-2eee442a0b4d> in <module>
----> 1 from lale.lib.lale import Hyperopt

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/lale/lib/lale/__init__.py in <module>
     61 from .baseline_classifier import BaselineClassifier
     62 from .baseline_regressor import BaselineRegressor
---> 63 from .grid_search_cv import GridSearchCV
     64 from .hyperopt import Hyperopt
     65 from .topk_voting_classifier import TopKVotingClassifier

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/lale/lib/lale/grid_search_cv.py in <module>
     15 from typing import Any, Dict
     16 
---> 17 import lale.lib.sklearn
     18 import lale.search.lale_grid_search_cv
     19 import lale.operators

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/lale/lib/sklearn/__init__.py in <module>
    130 from .extra_trees_classifier import ExtraTreesClassifier
    131 from .extra_trees_regressor import ExtraTreesRegressor
--> 132 from .feature_agglomeration import FeatureAgglomeration
    133 from .function_transformer import FunctionTransformer
    134 from .gaussian_nb import GaussianNB

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/lale/lib/sklearn/feature_agglomeration.py in <module>
     13 # limitations under the License.
     14 
---> 15 import sklearn.cluster.hierarchical
     16 import lale.docstrings
     17 import lale.operators

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/sklearn/cluster/__init__.py in <module>
      4 """
      5 
----> 6 from .spectral import spectral_clustering, SpectralClustering
      7 from .mean_shift_ import (mean_shift, MeanShift,
      8                           estimate_bandwidth, get_bin_seeds)

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/sklearn/cluster/spectral.py in <module>
     15 from ..metrics.pairwise import pairwise_kernels
     16 from ..neighbors import kneighbors_graph
---> 17 from ..manifold import spectral_embedding
     18 from .k_means_ import k_means
     19 

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/sklearn/manifold/__init__.py in <module>
      3 """
      4 
----> 5 from .locally_linear import locally_linear_embedding, LocallyLinearEmbedding
      6 from .isomap import Isomap
      7 from .mds import MDS, smacof

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/sklearn/manifold/locally_linear.py in <module>
     10 from scipy.sparse.linalg import eigsh
     11 
---> 12 from ..base import BaseEstimator, TransformerMixin, _UnstableArchMixin
     13 from ..utils import check_random_state, check_array
     14 from ..utils.extmath import stable_cumsum

ImportError: cannot import name '_UnstableArchMixin'

installation problem

Hi All,
Simple installation step does not work for me on Catalina OSX inside a virtual environment, Python 3.7.7 installed via Homebrew:

(pylale) albert@...> pip3 install lale[full]
zsh: no matches found: lale[full]

Similarly:

(pylale) albert@...> pip3 install lale[full] --isolated
zsh: no matches found: lale[full]

Same with cloned git repository:

(pylale) albert@... lale> pip3 install .[full,test]
zsh: no matches found: .[full,test]
(pylale) albert@... lale> pip3 install .[full,test] --isolated
zsh: no matches found: .[full,test]

It works without [full,test] but no tests ...

Add a test case to test_autoai_output_consumption.py to do fairness mitigation

Add a test case to test_autoai_output_consumption.py covering the following scenario:

Read an output AutoAI pipeline.
Use DisparateImpactRemover on the preprocessing prefix and perform refinement with a choice of classifiers.
Use Hyperopt to choose the best model with the pre-estimator mitigation of step 2.

Here is some code for using the pipeline generated for the German credit dataset:

fairness_info = {
            "protected_attributes": [
                {"feature": "Sex", "reference_group": ['male'], "monitored_group": ['female']},
                {"feature": "Age", "reference_group": [[20,40], [60,90]], "monitored_group": [[41, 59]]}
            ],
            "favorable_labels": ["No Risk"],
            "unfavorable_labels": ["Risk"],
}

prefix = best_pipeline.remove_last().freeze_trainable()

from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier as RF
from lale.operator_wrapper import wrap_imported_operators
from lale.lib.aif360 import DisparateImpactRemover
wrap_imported_operators()

di_remover = DisparateImpactRemover(**fairness_info, preparation=prefix, redact=True)
planned_fairer = di_remover >> (LR | RF)

from lale.lib.aif360 import accuracy_and_disparate_impact
from lale.lib.aif360 import FairStratifiedKFold

combined_scorer = accuracy_and_disparate_impact(**fairness_info)
fair_cv = FairStratifiedKFold(**fairness_info, n_splits=3)

from lale.lib.lale import Hyperopt

import pandas as pd
df = pd.read_csv("german_credit_data_biased_training.csv")
y = df.iloc[:, -1]
X = df.drop(columns=['Risk'])

trained_fairer = planned_fairer.auto_configure(
    X, y, optimizer=Hyperopt, cv=fair_cv, verbose=True,
    max_evals=1, scoring=combined_scorer, best_score=1.0)

reuse common schemas in lale.lib.sklearn

Many operators in lale.lib.sklearn have similar schemas for fit, predict, transform, predict_proba, etc. Similarly, there are some hyperparameters that occur in many of the operators in lale.lib.sklearn, such as random_state, n_jobs, etc. We could make our code base smaller and easier to maintain by defining such schemas only once in a file imported by each of the operators, and then plugging them in at the right place.

We already do something like that for lale.lib.aif360:

schema reuse for fit and predict:

lale/lale/lib/aif360/disparate_impact_remover.py

Lines 120 to 122 in 1f03397

_input_fit_schema = _categorical_supervised_input_fit_schema

_input_transform_schema = _categorical_input_transform_schema

_output_transform_schema = _numeric_output_transform_schema
schema reuse for common hyperparameters:

lale/lale/lib/aif360/disparate_impact_remover.py

Line 131 in 1f03397

*_categorical_fairness_properties.keys(),

and

lale/lale/lib/aif360/disparate_impact_remover.py

Line 138 in 1f03397

**_categorical_fairness_properties,

Test and make lale work with the latest version (0.4.0) of AIF360.

The latest version of AIF360 released a few hours ago breaks our travis tests with the error ModuleNotFoundError: No module named 'fairlearn'.

Will need to investigate and fix it. For now, I am pinning the version to 0.3.0.

Clean-up github actions setup

The build.yml in .github/workflows needs to be reviewed and cleaned up. We noticed the following issues:

The pip cache mechanism doesn't seem to work as expected i.e. it doesn't refresh if setup.py specifies a different version of a dependency.
The usage of matrix and if statements for exceptions looks inconsistent.

schema validation for bare sparseness constraint

Currently, schema-checking raises an internal assertion for constraints of the form X/isSparse that are "bare", i.e., not inside of an anyOf.

Test to reproduce:

import jsonschema
import scipy.sparse
import sklearn.datasets
import sklearn.decomposition
import unittest
import lale.lib.sklearn

class TestBareSparsenessConstraint(unittest.TestCase):
    def setUp(self):
        X, y = sklearn.datasets.load_iris(return_X_y=True)
        self.sparse_X = scipy.sparse.csr_matrix(X)
        self.y = y

    def test_bare_sparseness_constraint(self):
        #without Lale
        trainable = sklearn.decomposition.PCA()
        with self.assertRaisesRegex(TypeError, "PCA does not support sparse"):
            trained = trainable.fit(self.sparse_X, self.y)
        #with Lale and schema validation
        with EnableSchemaValidation():
            trainable = lale.lib.sklearn.PCA()
            with self.assertRaises(jsonschema.ValidationError):
                trained = trainable.fit(self.sparse_X, self.y)

Output:

  File "lale/lale/operators.py", line 1984, in _validate_hyperparams
    assert e.schema_path[2] == "anyOf"
AssertionError

Installation issue windows

A colleague of mine is having issues getting started with Lale on Windows. Installation seems to progress successfully using the pip install lale[full] instructions.

The issue is with importing lale libraries. Specifically the command
from lale.lib.lale import NoOp, Hyperopt generates an error (screenshot attached).

This is using Windows 10 & Python 3.8.3.

Any suggestions?

Blank AssertionError when using `r2_and_disparate_impact` scorer

The r2_and_disparate_impact scorer raises an AssertionError for some estimators.

Here is a simple example to reproduce:

import numpy as np
import pandas as pd

filename = 'Boston_Housing.csv'
X = pd.read_csv(filename)
y = X.pop('MEDV')

black_median = np.median(X['B'])
label_median = np.median(y)

loc = X.columns.get_loc('B')

X = X.values
y = y.values.reshape(-1)

fairness_info = {
    "favorable_labels": [[-10000.0, label_median]],
    "protected_attributes": [
        {"feature": loc, "privileged_groups": [[0.0, black_median]]},
    ],
}

from lale.lib.aif360 import r2_and_disparate_impact
scorer = r2_and_disparate_impact(**fairness_info)

from lale.lib.lightgbm import LGBMRegressor
from lale.lib.xgboost import XGBRegressor
from lale.lib.sklearn import DecisionTreeRegressor, RandomForestRegressor, Ridge, LinearRegression, ExtraTreesRegressor
from lale.lib.snapml import SnapDecisionTreeRegressor, SnapRandomForestRegressor, SnapBoostingMachineRegressor

for Model in [XGBRegressor, LGBMRegressor, RandomForestRegressor,
              DecisionTreeRegressor, Ridge, LinearRegression, ExtraTreesRegressor,
              SnapDecisionTreeRegressor, SnapRandomForestRegressor, SnapBoostingMachineRegressor]:

    m = Model().fit(X, y)
    try:
        out = scorer(m, X, y)
        print(Model, out)
    except Exception as e:
        print(Model, type(e), e)

produces:

XGBRegressor 0.9812037552088237
LGBMRegressor 0.9904847287204723
RandomForestRegressor 0.9634764266689257
DecisionTreeRegressor <class 'AssertionError'> 
Ridge 0.2590069531590341
LinearRegression 0.29422558503748475
ExtraTreesRegressor <class 'AssertionError'> 
SnapDecisionTreeRegressor 0.9999999999999943
SnapRandomForestRegressor 0.9487078573387389
SnapBoostingMachineRegressor 0.9475739036491716

short-circuit "Impl" on readthedocs

The readthedocs page for an individual operator lists the Impl class. For example, for LogisticRegression, it starts with:

class lale.lib.sklearn.logistic_regression.LogisticRegressionImpl(**hyperparams)

But since people do not use this Impl class directly, it would be better for the documentation to refer to the operator object instead, something like this:

PlannedIndividualOp lale.lib.sklearn.LogisticRegression(**hyperparams)

Then, we can also get rid of the disclaimer that we currently inject in the docstring:

Instead of using lale.lib.sklearn.logistic_regression.LogisticRegressionImpl directly, use its wrapper, lale.lib.sklearn.LogisticRegression.

All operators should support export_to_sklearn_pipeline

currently, only pipelines (BasePipeline and its children) implement this.
It should be implemented for all Operators

LogisticRegression penalty 'elasticnet' or 'none'

Scikit-learn introduced these values in their update from 0.20 to 0.21. We should allow them, but set forOptimizer: False to keep the old search space.

set_params test case fails on one of the autoai_ts_libs wrappers.

This test case fails on a recently added autoai_ts_libs wrapper.

Ridge and RidgeClassifier schemas for sklearn >1.0 need to set `forOptimizer` to False for normalize.

If we set forOptimizer to False for normalize when sklearn version is >=1.0, there seems to be an issue in generating the search space for optimizers.

For example, test.test_core_classifiers.TestVotingClassifier.test_with_gridsearch fails for RidgeClassifier and test.test_core_regressors.TestRegression.test_Ridge fails for Ridge.

Improve error message for missing ConcatFeatures

The Lale & combinator is most commonly used with ConcatFeatures. If a pipeline accidentally pipes the output of & directly into an operator that does not expect it, Lale should report a helpful error message for how to fix that mistake. For example, consider the following code:

X, y = load_iris(return_X_y=True)
trainable = (PCA() & NoOp) >> LogisticRegression()
trained = trainable.fit(X, y)

This produces the following error from sklearn:

ValueError: Found array with dim 3. Estimator expected <= 2.

Or, when used with lale.settings.set_disable_data_schema_validation(False), it produces the following error from Lale:

ValueError: LogisticRegression.fit() invalid X, the schema of the actual data is not a subschema of the expected schema of the argument.

It would be nice if the error message would provide the clue to the solution, which would be something like this:

from lale.lib.lale import ConcatFeatures
(PCA() & NoOp) >> ConcatFeatures >> LogisticRegression()

add support for fit params to pipelines

as per sklearn, they should be prefixed with the operator they are for:
Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

Update to newest Hyperopt

Hyperopt 0.2.6 was released on November 15: https://pypi.org/project/hyperopt/0.2.6/

Unfortunately, it breaks many Lale tests: https://github.com/IBM/lale/actions/runs/1467468837

For example, the failures include some very basic tests such as:

test.test_core_transformers.TestFeaturePreprocessing.test_MinMaxScaler
test.test_core_transformers.TestFeaturePreprocessing.test_PCA
test.test_core_transformers.TestConcatFeatures.test_concat_with_hyperopt

So for now, we limit the Hyperopt version to <=0.2.5: 24db058

We should try to update to the latest (in fact, if we are lucky, Hyperopt 0.2.7 fixes the problem).

Status of support for pandas>=1.1

Wondering if this is forthcoming in the near future? Thank you!

Move to tensorflow 2.

lale/lib/tensorflow depends on tensorflow>=1.13.1, upgrade the operators to use tensorflow 2.0 instead.

	_input_fit_schema = _categorical_supervised_input_fit_schema
	_input_transform_schema = _categorical_input_transform_schema
	_output_transform_schema = _numeric_output_transform_schema