feedzai / fairgbm Goto Github PK

View Code? Open in Web Editor NEW

97.0 13.0 5.0 43.97 MB

Train Gradient Boosting models that are both high-performance *and* Fair!

Home Page: https://arxiv.org/abs/2209.07850

License: Other

Shell 1.16% Python 25.55% PowerShell 0.68% R 0.29% CMake 1.25% C++ 63.99% C 4.36% Cuda 1.74% SWIG 0.98%

gbm lightgbm fairness fairness-ml gradient-boosting tabular-data

fairgbm's Introduction

FairGBM

Note FairGBM has been accepted at ICLR 2023. Link to paper here.

FairGBM is an easy-to-use and lightweight fairness-aware ML algorithm with state-of-the-art performance on tabular datasets.

FairGBM builds upon the popular LightGBM algorithm and adds customizable constraints for group-wise fairness (e.g., equal opportunity, predictive equality, equalized odds) and other global goals (e.g., specific Recall or FPR prediction targets).

Table of contents:

FairGBM

Install

FairGBM can be installed from PyPI:

pip install fairgbm

Or directly from GitHub:

git clone --recurse-submodules https://github.com/feedzai/fairgbm.git
pip install fairgbm/python-package/

Note Compatibility is only maintained with Linux OS.

If you don't have access to a Linux machine we advise using the free Google Colab service (example Colab notebook here).

We also provide a docker image that can be useful for non-linux platforms, run: docker run -p 8888:8888 ndrcrz/fairgbm-miniconda for a jupyter notebook environment with fairgbm installed.

Note Follow this link for more details on the Python package installation instructions.

Docker image

We provide a Docker image with python and miniconda installed, ready to run the example fairgbm jupyter notebooks.

You can get a jupyter notebook with fairgbm up and running on your local machine with:

docker run -p 8888:8888 ndrcrz/fairgbm-miniconda

Although it is recommended to use the python package directly on your local x86-64 (non-arm) linux machine, using this docker image is an option for users on other platforms (docker image was tested on an M1 Mac).

The Dockerfile is available here.

Getting started

Recommended Python notebook example here (Google Colab link here).

You can get FairGBM up and running in just a few lines of Python code:

from fairgbm import FairGBMClassifier

# Instantiate
fairgbm_clf = FairGBMClassifier(
    constraint_type="FNR",    # constraint on equal group-wise TPR (equal opportunity)
    n_estimators=200,         # core parameters from vanilla LightGBM
    random_state=42,          # ...
)

# Train using features (X), labels (Y), and sensitive attributes (S)
fairgbm_clf.fit(X, Y, constraint_group=S)
# NOTE: labels (Y) and sensitive attributes (S) must be in numeric format

# Predict
Y_test_pred = fairgbm_clf.predict_proba(X_test)[:, -1]  # Compute continuous class probabilities (recommended)
# Y_test_pred = fairgbm_clf.predict(X_test)             # Or compute discrete class predictions

For Python examples see the notebooks folder.

A more in-depth explanation and other usage examples (with python package or compiled binary) can be found in the examples folder.

Note FairGBM is a research project, so its default hyperparameters (key-word arguments) will expectedly not be as robust as the default hyperparameters in sklearn or lightgbm classifiers. We earnestly recommend running hyperparameter-tuning to tune the multiplier_learning_rate hyperparameter as well as the remaining GBM hyperparameters (example here).

Parameter list

The following parameters can be used as key-word arguments for the FairGBMClassifier Python class.

Name	Description	Default
`constraint_type`	The type of fairness (group-wise equality) constraint to use (if any).	`FPR,FNR`
`global_constraint_type`	The type of global equality constraint to use (if any).	None
`multiplier_learning_rate`	The learning rate for the gradient ascent step (w.r.t. Lagrange multipliers).	`0.1`
`constraint_fpr_tolerance`	The slack when fulfilling group-wise FPR constraints.	`0.01`
`constraint_fnr_tolerance`	The slack when fulfilling group-wise FNR constraints.	`0.01`
`global_target_fpr`	Target rate for the global FPR (inequality) constraint.	None
`global_target_fnr`	Target rate for the global FNR (inequality) constraint.	None
`constraint_stepwise_proxy`	Differentiable proxy for the step-wise function in group-wise constraints.	`cross_entropy`
`objective_stepwise_proxy`	Differentiable proxy for the step-wise function in global constraints.	`cross_entropy`
`stepwise_proxy_margin`	Intercept value for the proxy function: value at `f(logodds=0.0)`	`1.0`
`score_threshold`	Score threshold used when assessing group-wise FPR or FNR in training.	`0.5`
`global_score_threshold`	Score threshold used when assessing global FPR or FNR in training.	`0.5`
`init_multipliers`	The initial value of the Lagrange multipliers.	`0` for each constraint
...	Any core `LGBMClassifier` parameter can be used with FairGBM as well.

Please consult this list for a detailed view of all vanilla LightGBM parameters (e.g., n_estimators, n_jobs, ...).

Note The objective is the only core LightGBM parameter that cannot be changed when using FairGBM, as you must use the constrained loss function objective="constrained_cross_entropy". Using a standard non-constrained objective will fallback to using standard LightGBM.

fit(X, Y, constraint_group=S)

In addition to the usual fit arguments, features X and labels Y, FairGBM takes in the sensitive attributes S column for training.

Regarding the sensitive attributes column S:

It should be in numeric format, and have each different protected group take a different integer value, starting at 0.
It is not restricted to binary sensitive attributes: you can use two or more different groups encoded in the same column;
It is only required for training and not for computing predictions;

Here is an example pre-processing for the sensitive attributes on the UCI Adult dataset:

# Given X, Y, S
X, Y, S = load_dataset()

# The sensitive attributes S must be in numeric format
S = np.array([1 if val == "Female" else 0 for val in S])

# The labels Y must be binary and in numeric format: {0, 1}
Y = np.array([1 if val == ">50K" else 0 for val in Y])

# And the features X may be numeric or categorical, but make sure categorical columns are in the correct format
X: Union[pd.DataFrame, np.ndarray]      # any array-like can be used

# Train FairGBM
fairgbm_clf.fit(X, Y, constraint_group=S)

Features

FairGBM enables you to train a GBM model to minimize a loss function (e.g., cross-entropy) subject to fairness constraints (e.g., equal opportunity).

Namely, you can target equality of performance metrics (FPR, FNR, or both) across instances from two or more different protected groups (see fairness constraints section). Optionally, you can simultaneously add global constraints on specific metrics (see global constraints section).

Fairness constraints

You can use FairGBM to equalize the following metrics across two or more protected groups:

Equalize FNR (equivalent to equalizing TPR or Recall)
- also known as equal opportunity (Hardt et al., 2016)
Equalize FPR (equivalent to equalizing TNR or Specificity)
- also known as predictive equality (Corbett-Davies et al., 2017)
Equalize both FNR and FPR simultaneously
- also known as equalized odds (Hardt et al., 2016)

Example for equality of opportunity in college admissions: your likelihood of getting admitted to a certain college (predicted positive) given that you're a qualified candidate (label positive) should be the same regardless of your race (sensitive attribute).

Global constraints

You can also target specific FNR or FPR goals. For example, in cases where high accuracy is trivially achieved (e.g., problems with high class imbalance), you may want to maximize TPR with a constraint on FPR (e.g., "maximize TPR with at most 5% FPR"). You can set a constraint on global FPR ≤ 0.05 by using global_target_fpr=0.05 and global_constraint_type="FPR".

You can simultaneously set constraints on group-wise metrics (fairness constraints) and constraints on global metrics.

Technical Details

FairGBM is a framework that enables constrained optimization of Gradient Boosting Machines (GBMs). This way, we can train a GBM model to minimize some loss function (usually the binary cross-entropy) subject to a set of constraints that should be met in the training dataset (e.g., equality of opportunity).

FairGBM applies the method of Lagrange multipliers, and uses iterative and interleaved steps of gradient descent (on the function space, by adding new trees to the GBM model) and gradient ascent (on the space of Lagrange multipliers, Λ).

The main obstacle with enforcing fairness constraints in training is that these constraints are often non-differentiable. To side-step this issue, we use a differentiable proxy of the step-wise function. The following plot shows an example of hinge-based and cross-entropy-based proxies for the false positive value of a label negative instance.

For a more in-depth explanation of FairGBM please consult the paper.

Contact

For commercial uses of FairGBM please contact [email protected].

How to cite FairGBM

@inproceedings{cruz2023fairgbm,
  author = {Cruz, Andr{\'{e}} F. and Bel{\'{e}}m, Catarina and Jesus, S{\'{e}}rgio and Bravo, Jo{\~{a}}o and Saleiro, Pedro and Bizarro, Pedro},
  title={Fair{GBM}: Gradient Boosting with Fairness Constraints},
  booktitle={The Eleventh International Conference on Learning Representations},
  year={2023},
  url={https://openreview.net/forum?id=x-mXzBgCX3a}
}

The paper is publicly available at this arXiv link.

fairgbm's People

Contributors

Stargazers

Watchers

Forkers

manu87ds r-sleepearly andrefcruz codedojokapa eustomaqua

fairgbm's Issues

rename `constraint_group` argument to `sensitive_attributes`

The term constraint_group alludes to constrained optimization, but the main use-case for FairGBM is enhancing fairness and a better kwarg name should probably be chosen.

Suggestions:

sensitive_attributes
protected_attributes
constraint_groups -> note the plural

NOTE
this is a breaking change and will need a corresponding PR in the feedzai-openml-java repository etc.

Implement _randomized classifier_ predictions

Summary

Using a FairGBM model as a randomized classifier is described in detail in the FairGBM paper.
However, this library only allows the use of the last FairGBM iterate --- this should achieve similar performance with faster predictions, but it would be interesting to still be able to use the randomized classifier predictions for comparison and future research.

Description

These randomized classifier predictions are simply generated by matching each input row with a random boosting iterate, and using that iterate to generate the row's predictions (selected at random with replacement).

This could even be done only on the Python package part, by adding a new method to the FairGBMClassifier class, named predict_randomized predict_proba_randomized, or adding a new flag randomized=True to the existing predict and predict_proba methods.

References

https://arxiv.org/pdf/2209.07850v2.pdf

Update the readme (and create usage examples)

Summary

Update the readme file

Motivation

The project Readme is describing itself as LightGBM. It is understood that this is a consequence of the fork, but should be updated.

Handle missing values for sensitive attributes

Summary

LightGBM allegedly handles missing values in the features (e.g., represented as NaN).
We should also be able to handle missing values in the constraint group column (sensitive attribute).

For instance, we could do imputation with the majority group for all rows with unknown sensitive attribute.

Lagrange multipliers need to be manually scaled with the size of the dataset

Description

Currently we need to use larger multipliers for larger datasets, and smaller multipliers for smaller datasets.

We could potentially just multiply the gradient flowing to each sample from the constraint loss by the number of data points (or simply not divide this loss by the number of data points).

This isn't exactly theoretically sound AFAIK, as the true gradient from FPR or FNR constraints does depend on the number of data points... We could just implement it and see if performance is affected.

Issue with running fairgbm_clf.fit()

Description

We are trying to implement FairGBM in order to classify a certain feature: severity_score_class while using districts as the constraint group. After trying to train the features using X, Y and S with fairgbm_clf.fit(X_train, Y_train, constraint_group=S), the following error arises ->LightGBMError: Input data type error or field not found. After many attempts to fix this issue, it still persists.

Reproducible example

`import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import lightgbm
from fairgbm import FairGBMClassifier

data = pd.read_csv('total_df_final_for_models.csv').drop(columns=['Column1', 'Column2'])
TARGET_COL = "severity_score_class"
SENSITIVE_COL = "district"
def retrieve_X(data):
ignored_cols = [TARGET_COL, SENSITIVE_COL, "severity_score"]
feature_cols = [col for col in data.columns if col not in ignored_cols]
X = data[feature_cols]
return X
def retrieve_Y(data):
Y = data[TARGET_COL]
return Y
def retrieve_S(data):
data["district"] = data["district"].astype('category')
data["district_encoding"] = data["district"].cat.codes
S = data["district_encoding"]
return S

X = retrieve_X(data)
Y = retrieve_Y(data)
S = retrieve_S(data)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=16)

fairgbm_clf = FairGBMClassifier(constraint_type="FNR", # constraint on equal group-wise TPR (equal opportunity)
n_estimators=200, # core parameters from vanilla LightGBM
random_state=16)

fairgbm_clf.fit(X_train, Y_train, constraint_group=S)`

Additional Comments

The Y variable is multiclass as opposed to the binary predictions that FairGBM makes use of. Y consists of three levels and thus might be a problem if multiclass classification is not possible with FairGBM. The constraint group S consists of 69 districts. Maybe these are the reasons for the LightGBM Error. Every line of code works until the fairgbm_clf.fit() function.
Data used: total_df_final_for_models.csv

Add constrained_cross_entropy metric

When using constrained_cross_entropy as an objective, FairGBM constrained_cross_entropy tries to use it as a metric (to report training progress). The issue here is that, currently, no such metric exists.

Everything works fine since a metric is not necessary for instantiating or training a FairGBM object, but it would be nice to have a proper metric.

Write documentation (for developers and users) on read-the-docs

Including:

Getting started (install and run)
Minimal examples (load BAF and train a model in a 4-5 lines code example)
Detail FairGBM-specific parameters
Detail inner workings of FairGBM (not too detailed tho, point to the paper for that)

FairGBM Python tests

Change maintainer name and mailing list

Change the maintainer and email address to a mailing list.

Deploy FairGBM read-the-docs documentation

LightGBM makes available their official documentation at https://lightgbm.readthedocs.io/en/v3.3.2/
FairGBM features changes to the C API, new parameters, and other changes to the Python API
We should build and deploy our own "read-the-docs"-style documentation specifically for FairGBM;

Implement constraints on predicted prevalence (demographic parity)

Summary

We should enable creating group-wise constraints on the percentage of positive predictions (a.k.a., predicted prevalence).

This enables the popular Demographic Parity fairness metric.

TODO

Add constraint_type=PP to the available group-wise constraint options;
Implement gradient descent on proxy predicted prevalence metrics;
Add the same PP constraint type for global constraints
- This would be a constraint on maximum alert rate -> quite useful for our in-house use cases

EDIT

Minor notes for the future PP constraint implementation:
83de2426-ea6e-4308-9f0d-5ddc3b001580.pdf

Implement Multi-threading in FairGBM logic

Summary

Several CPU-bound FairGBM functions are currently single threaded.

TODO:

Test if everything works correctly when using the following directive on (most) loops over the data: #pragma omp parallel for schedule(static)
- The loop at ConstrainedObjectiveFunction::GetConstraintGradientsWRTModelOutput should be the focus, as it is where most CPU time is spent;
- AFAIK the loop seems to be parallelizable: no variables seem to be changed in the same location (arrays are changed but always at the position of the given row index, therefore no race conditions should occur);
- If not, adapt the loop such that each thread does not alter variables in common (or use locks when it's impossible to separate the logic);

Potential issue when loading constrained_cross_entropy objective from file

We need to assess what are the side-effects of not having a proper ConstrainedCrossEntropy::ToString method.

From a quick run through the code it seems the ToString method is used to pass information/configs to the Objective class. Currently we have this commented out for all ConstrainedObjectiveFunction sub-classes. For example:

If we uncomment this the following bug will be thrown:

This is also related to the fact that we cannot resume training from a previously trained FairGBM model.

Related to #10

Review the FairGBM license with legal

Summary

The current LICENSE is the same as the one used on the TimeSHAP open-source repo.

TODO:

check with legal if this is the license to use for FairGBM.
check how we should list the previous LightGBM Microsoft license? append each to the same file?

Model serialization changed from v3.0.0; check if this is expected behavior

Description

A test that compares the persisted txt file with the txt file of a model created with a previous lightgbm version is currently failing.

diff:

Files for version v3.0.0:
4f.txt
42f.txt

Files for version v3.2.1-fairgbm:
4f.txt
42f.txt

Regarding the addition of `is_linear`

It seems related to this PR: microsoft/LightGBM#3299
More specifically, this line: https://github.com/feedzai/fairgbm/blob/main-fairgbm/src/io/tree.cpp#L378

Regarding the change in tree sizes

It seems there is a shift by +12 to all tree sizes;
These are the sizes of the length of each string representing a tree;
This simply corresponds to +12 characters from the added is_linear=0\n
Check this line to verify this behavior is correct: https://github.com/feedzai/fairgbm/blob/main-fairgbm/src/boosting/gbdt_model_text.cpp#L362

Create a class hierarchy for proxy-loss types: Quadratic/XEntropy/Hinge

Summary

Currently we have to do string comparisons multiple times per data row, per iteration, to figure out which proxy loss function to use to compute the gradients.
- e.g., https://github.com/feedzai/fairgbm/blob/bulk-fairgbm-modifications/src/objective/constrained_recall_objective.hpp#L86
According to our benchmarks, this can take up around 30% of the total training time.
We should create a common interface w/ different subclasses to optimize this.
This has already been started in #18

FairGBM PyPI release

Release FairGBM's python package in PyPI under the fairgbm name.

This is done with a GitHub workflow via a organization-wide secret API key (PYPI_API_TOKEN).

See the following examples:

Blocked by #38

Possible incompatibility with macOS

Description

When importing fairgbm on a mac I immediately get the following error message:

OSError: dlopen(<env-dir>/python3.9/site-packages/fairgbm/lib_lightgbm.so, 0x0006): tried: '<env-dir>/python3.9/site-packages/fairgbm/lib_lightgbm.so' (not a mach-o file)

Screenshot:

I have tried this with Python 3.8, 3.9, and 3.10 (I haven't tried it with earlier Python versions due to incompatibility with arm64 CPUs, but I have no reason to believe the bug would be exclusive to this architecture).

Reproducible example

Install fairgbm on a mac
Run import fairgbm
See error message.

Environment info

fairgbm==0.9.13
macOS
arm64 (M1 Max)

Additional Comments

It also does not work when using a linux python3.9 environment via docker containers (on macOS).
With this setup, the following (different but related to the same file) error is shown:

OSError: /usr/local/lib/python3.9/site-packages/fairgbm/lib_lightgbm.so: cannot open shared object file: No such file or directory

Enable exporting objective function settings to a file

Summary

Currently if we load a model from a file it will look for the "constrained_cross_entropy" objective function and will not find it.
We need to export the objective function settings to a file (serialize the objective settings) in order for the model to load up all necessary definitions.
This makes it so currently we can't resume training of a model loaded from a file (but we can get predictions from that model).

Related to #36

Implement data-point weighing in FairGBM training

Summary

Currently, the function ConstrainedObjectiveFunction::GetLagrangianGradientsWRTMultipliers ignored the this->weights_ variable.

TODO: implement weighing -- although this may interfere with the constrained optimization process.

FairGBM versioning

We need to rethink and implement a versioning system for FairGBM.

Currently we're still using the version from MSFT LightGBM as of the day the code-bases diverged (3.2.1.99).

Proposed solutions:

Start numbering from zero (or 1.0.0);
- Clearly state in the release notes for each new release/tag the corresponding MSFT LightGBM version;
Accompany each new version (see VERSION.txt file) with a git tag and a GitHub release;
- Plus, a corresponding release to PyPI;

Change python package name to "fairgbm"

Summary

We need to change the python package's name from lightgbm to fairgbm;
(See setup.py line 335)

TODO: check what are the implications of this change.

Create FairGBM aliases for client-facing Python classes (e.g., `FairGBMClassifier`)

Currently the lightgbm scikit-learn API has a LGBMClassifier.

We would like to have an equivalent FairGBMClassifier that enforces the use of fairness (constraints).

This would simply be a subclass of LGBMClassifier that hard-codes the objective function to the constrained_cross_entropy (and perhaps some other kwargs should be enforced as well).

Also, the lightgbm package also provides another API (non sklearn), can we create a FairGBM alias for that API as well?

Decide on default values for FairGBM (hyper)parameters

Keep only the latest Lagrange multipliers

Summary

We currently keep all Lagrange multipliers from all iterations;
I don't really recall why this was done in the first place, but we can probably keep only the latest iteration now 🙂

[refactor] Cannot have code changes in `auto_config.cpp` - will be overriden

I just found out this file helpers/parameter_generator.py, which states at the beginning:

This script generates LightGBM/src/io/config_auto.cpp file
with list of all parameters, aliases table and other routines
along with parameters description in LightGBM/docs/Parameters.rst file
from the information in LightGBM/include/LightGBM/config.h file.

meaning all the changes in config_auto that are done currently in fairgbm need to be moved out of there to the original LightGBM/include/LightGBM/config.h file.

Improve FairGBM multi-threading

According to our perf and valgrind benchmarks, a large percentage of CPU time is spent on synchronization of separate threads during training.

The net outcome of multi-threading is still positive, however when using OMP_NUM_THREADS=4 our code will only consistently use 2 threads, seeming unable to fully parallelize.

Major refactor of objective_function.h to split FairGBM and LightGBM logic

EDIT: unlike #9, this issue is focused on splitting the ObjectiveFunction and ConstrainedObjectiveFunction logic, so as to separate all of our FairGBM-induced changes to the codebase.

Issue #9 is more focused on optimization while this one is more focused on code maintainability.

The file include/LightGBM/objective_function.h must be heavily refactored.

First, the file should be split, so as to have a constrained_objective_function.h + .cpp file, and have a new objective_functions.h file that includes the original objective_function.h, as well as the new one. After that is done, the whole codebase should include objective_functions.h instead of the original one.

After that refactoring, there are plenty optimization opportunities in the ConstrainedObjectiveFunction class which will require extensive optimization, from making runtime code indirection cheaper, to re-implementing the code with cache-friendlier data structures, so as to reduce the +80% train time when compared to the vanilla LightGBM train.

Constrained Recall Objective has weird interaction with LightGBM early stopping criteria

Description

When optimizing for Recall (minimizing FNR), only label positive samples are considered for computing the loss or its gradient;
However, passing a gradient of zero for all label negatives leads to weird behavior in the GBM::Train function;
So, for now, we're scaling down the gradient of all label negatives by multiplying them with a tiny positive number: see the label_negative_weight in ConstrainedRecallObjective::GetGradients;
- This shouldn't be needed, but seems to temporarily fix the issue with no unintended consequences (as the gradient flowing is very small);

Reproducible example

Omit the else clause in ConstrainedRecallObjective::GetGradients, which deals with label negative samples, and in theory should not be needed for optimizing for recall;
Compile and run, and observe weird "-inf split" messages, which can lead to training stopping too early;

feedzai / fairgbm Goto Github PK

fairgbm's Introduction

FairGBM

Install

Docker image

Getting started

Parameter list

fit(X, Y, constraint_group=S)

Features

Fairness constraints

Global constraints

Technical Details

Contact

How to cite FairGBM

fairgbm's People

Contributors

Stargazers

Watchers

Forkers

fairgbm's Issues

Summary

Description

References

Summary

Motivation

Summary

Description

Description

Reproducible example

Additional Comments

Summary

Summary

Summary

Description

Regarding the addition of is_linear

Regarding the change in tree sizes

Summary

Description

Reproducible example

Environment info

Additional Comments

Summary

Summary

Proposed solutions:

Summary

Summary

Description

Reproducible example

Recommend Projects

Recommend Topics

Recommend Org

Regarding the addition of `is_linear`