sb-ai-lab / hypex Goto Github PK

View Code? Open in Web Editor NEW

63.0 5.0 2.0 60.04 MB

Fast and customizable framework for automatic and quick Causal Inference in Python

Home Page: https://developers.sber.ru/portal/products/lightautoml

License: Apache License 2.0

Python 100.00%

ab-testing causal-inference causalinference data-science faiss kaggle matching python statistics

hypex's Introduction

HypEx: Advanced Causal Inference and AB Testing Toolkit

$Pypi downloads\month$

Introduction

HypEx (Hypotheses and Experiments) is a comprehensive library crafted to streamline the causal inference and AB testing processes in data analytics. Developed for efficiency and effectiveness, HypEx employs Rubin's Causal Model (RCM) for matching closely related pairs, ensuring equitable group comparisons when estimating treatment effects.

Boasting a fully automated pipeline, HypEx adeptly calculates the Average Treatment Effect (ATE), Average Treatment Effect on the Treated (ATT), and Average Treatment Effect on the Control (ATC). It offers a standardized interface for executing these estimations, providing insights into the impact of interventions across various population subgroups.

Beyond causal inference, HypEx is equipped with robust AB testing tools, including Difference-in-Differences ( Diff-in-Diff) and CUPED methods, to rigorously test hypotheses and validate experimental results.

Features

Faiss KNN Matching: Utilizes Faiss for efficient and precise nearest neighbor searches, aligning with RCM for optimal pair matching.
Data Filters: Built-in outlier and Spearman filters ensure data quality for matching.
Result Validation: Offers multiple validation methods, including random treatment, feature, and subset validations.
Data Tests: Incorporates SMD, KS, PSI, and Repeats tests to affirm the robustness of effect estimations.
Feature Selection: Employs LGBM and Catboost feature selection to pinpoint the most impactful features for causal analysis.
AB Testing Suite: Features a suite of AB testing tools for comprehensive hypothesis evaluation.
Stratification support: Stratify groups for nuanced analysis
Weights support: Empower your analysis by assigning custom weights to features, enhancing the matching precision to suit your specific research needs

Warnings

Some functions in HypEx can facilitate solving specific auxiliary tasks but cannot automate decisions on experiment design. Below, we will discuss features that are implemented in HypEx but do not automate the design of experiments.

Feature Selection

Feature selection models the significance of features for the accuracy of target approximation. However, it does not rule out the possibility of overlooked features, the complex impact of features on target description, or the significance of features from a business logic perspective. The algorithm will not function correctly if there are data leaks.

Points to consider when selecting features:

Data leaks - these should not be present.
Influence on treatment distribution - features should not affect the treatment distribution.
The target should be describable by features.
All features significantly affecting the target should be included.
The business rationale of features.
The feature selection function can be useful for addressing these tasks, but it does not solve them nor does it absolve the user of the responsibility for their selection, nor does it justify it.

Link to ReadTheDocs

Random Treatment

Random Treatment algorithm randomly shuffles the actual treatment. It is expected that the treatment's effect on the target will be close to 0.

These method is not sufficiently accurate marker of a successful experiment.

Link to ReadTheDocs

Installation

pip install -U hypex

Quick start

Explore usage examples and tutorials here.

Matching example

from hypex import Matcher
from hypex.utils.tutorial_data_creation import create_test_data

# Define your data and parameters
df = create_test_data(rs=42, na_step=45, nan_cols=['age', 'gender'])

info_col = ['user_id']
outcome = 'post_spends'
treatment = 'treat'
model = Matcher(input_data=df, outcome=outcome, treatment=treatment, info_col=info_col)
results, quality_results, df_matched = model.estimate()

AA-test example

from hypex import AATest
from hypex.utils.tutorial_data_creation import create_test_data

data = create_test_data(rs=52, na_step=10, nan_cols=['age', 'gender'])

info_cols = ['user_id', 'signup_month']
target = ['post_spends', 'pre_spends']

experiment = AATest(info_cols=info_cols, target_fields=target)
results = experiment.process(data, iterations=1000)
results.keys()

AB-test example

from hypex import ABTest
from hypex.utils.tutorial_data_creation import create_test_data

data = create_test_data(rs=52, na_step=10, nan_cols=['age', 'gender'])

model = ABTest()
results = model.execute(
    data=data,
    target_field='post_spends',
    target_field_before='pre_spends',
    group_field='group'
)

model.show_beautiful_result()

Documentation

For more detailed information about the library and its features, visit our documentation on ReadTheDocs.

You'll find comprehensive guides and tutorials that will help you get started with HypEx, as well as detailed API documentation for advanced use cases.

Contributions

Join our vibrant community! For guidelines on contributing, reporting issues, or seeking support, please refer to our Contributing Guidelines.

More Information & Resources

Habr (ru) - discover how HypEx is revolutionizing causal inference in various fields.
A/B testing seminar - Seminar in NoML about matching and A/B testing
Matching with HypEx: Simple Guide - Simple matching guide with explanation
Matching with HypEx: Grouping - Matching with grouping guide
HypEx vs Causal Inference and DoWhy - discover why HypEx is the best solution for causal inference
HypEx vs Causal Inference and DoWhy: part 2 - discover why HypEx is the best solution for causal inference

Testing different libraries for the speed of matching

Visit this notebook ain Kaggle and estimate results by yourself.

Group size	32 768	65 536	131 072	262 144	524 288	1 048 576	2 097 152	4 194 304
Causal Inference	46s	169s	None	None	None	None	None	None
DoWhy	9s	19s	40s	77s	159s	312s	615s	1 235s
HypEx with grouping	2s	6s	16s	42s	167s	509s	1 932s	7 248s
HypEx without grouping	2s	7s	21s	101s	273s	982s	3 750s	14 720s

Join Our Community

Have questions or want to discuss HypEx? Join our Telegram chat and connect with the community and the developers.

Conclusion

HypEx stands as an indispensable resource for data analysts and researchers delving into causal inference and AB testing. With its automated capabilities, sophisticated matching techniques, and thorough validation procedures, HypEx is poised to unravel causal relationships in complex datasets with unprecedented speed and precision.

hypex's People

Contributors

Stargazers

Watchers

Forkers

khodyakovamari tdl77

hypex's Issues

[BUG] TQDM Import Error

🐛 Bug Description

Encountered an issue with importing tqdm from tqdm.auto in environments with certain versions of tqdm, leading to an ImportError. This bug affects the usability of the HypEx library in environments where the specific tqdm submodule is not available.

Steps To Reproduce

Set up an environment with a version of tqdm that lacks the tqdm.auto submodule.
Attempt to execute any functionality within the HypEx library that utilizes tqdm for progress bars.
The ImportError is raised, halting execution.

Expected Behavior

The HypEx library should gracefully fall back to a compatible version of tqdm if tqdm.auto is not available, ensuring compatibility across different environments and tqdm versions.

Screenshots

Not applicable

Environment

HypEx Version: 0.0.4
Python Version: 3.10
Operating System: Varies across different OS where the issue was observed (Windows, Linux, macOS).

Additional Context

The issue stems from the diverse environment setups and the variance in tqdm versions, which may or may not include the tqdm.auto submodule. This variability can lead to import errors in certain scenarios, affecting the user experience.

Possible Solution

Implement a try-except block around the import statement for tqdm, first attempting to import from tqdm.auto and falling back to the standard tqdm if the first import fails. This approach provides a more robust solution to handle different versions of tqdm seamlessly.

try:
    from tqdm.auto import tqdm
except Exception as e:
    try:
        from tqdm import tqdm
    except:
        raise Exception("Cannot import tqdm")

Checklist

I have described the bug in detail
I have provided steps to reproduce
I have provided the expected behavior
I have provided screenshots (if applicable)
I have provided my environment details
I have suggested a possible solution (if applicable)

[BUG] Tutorials description in docs

🐛 Bug Description

Problem with tutorials in HypEx ReadTheDocs

Steps To Reproduce

Go to https://hypex.readthedocs.io/en/latest/pages/Tutorials.html

Expected Behavior

I would like to see a list of tutorials and experiments.

Screenshots

Possible Solution

Check the docs folder and fix it

Checklist

I have described the bug in detail
I have provided steps to reproduce
I have provided the expected behavior
I have provided screenshots (if applicable)
I have provided my environment details
I have suggested a possible solution (if applicable)

[FEATURE] Loc, Iloc and adding data

🚀 Feature Proposal

This proposal outlines enhancements to the Dataset class within the HypEx library, aimed at improving usability and functionality

Motivation

The motivation behind this proposal is to streamline the user experience and functionality of the Dataset class, making it more intuitive and efficient for users performing data operations.

Feature Description

Locker and ILocker: Added class Locker and ILocker to Dataset to perform usage of loc and iloc like in pandas DataFrames.
loc and iloc: Added functions loc and iloc to PandasBackend to perform loc and iloc in pandas.
add_column: Added function add_column for addind column to data with role.

Potential Impacts

add_column can help adding data with special role instead of adding only data. It will be helpful to identify column in pipline.

Checklist

I have clearly described the feature.
I have outlined the motivation for the proposal.
I have provided a detailed description of the feature.
I have discussed potential impacts and alternatives.
I have added any additional context or screenshots.

[BUG] Performance and Hypothesis Selection Issue in limit_distribution Function

🐛 Bug Description

There appears to be a significant speed issue and a problem with selecting hypotheses in the newly introduced limit_distribution function in HypEx version 0.1.0. This function is critical for our limit distribution-based methodologies, and its current performance bottleneck and hypothesis selection inaccuracies could hinder our analysis efficiency and reliability.

Steps To Reproduce

Execute the limit_distribution function with a standard set of parameters.
Observe the time taken for execution and compare it with expected performance metrics.
Note any discrepancies in the selection of hypotheses that do not align with the expected outcomes.

Expected Behavior

The limit_distribution function is expected to perform efficiently, adhering to the projected time complexities. Furthermore, it should accurately select hypotheses based on predefined criteria without any inconsistencies.

Environment

HypEx Version: 0.1.0
Python Version: 3.10
Operating System: Mac Os

Possible Solution

While the specific solution will depend on a thorough investigation, initial steps could include optimizing the function's algorithm for better performance and reviewing the hypothesis selection logic to ensure it aligns with theoretical expectations.

Checklist

I have described the bug in detail, including the expected versus actual behavior.
I have outlined steps to reproduce the issue.
I have suggested a possible solution for preliminary consideration.

[FEATURE] Expand Python Version Support to Include 3.6, 3.7, and 3.11 in HypEx

🚀 Feature Proposal

Motivation

Currently, our project HypEx supports Python versions 3.8, 3.9, and 3.10. However, we've identified a need to extend our support to include Python versions 3.6, 3.7, and the newly released 3.11. This expansion is essential to enhance the accessibility of HypEx to a broader range of users and environments, some of which still utilize these versions. This initiative is partly driven by community requests and our goal to maximize the usability of our tool.

Feature Description

The proposal is to update our project to be compatible with Python versions 3.6, 3.7, and 3.11. This would involve:

Adjusting the python dependency in pyproject.toml to reflect the broader range (">=3.6, <3.12").
Reviewing and modifying existing codebase and dependencies to ensure compatibility across all these versions.
Implementing thorough testing for each version to guarantee functionality and stability.

Potential Impacts

Performance Considerations: We need to ensure that changes made for compatibility do not adversely affect performance across all supported versions.
Compatibility Issues: Some existing dependencies may not support older or the newest versions of Python, which might require finding alternatives or updating those dependencies.
Dependencies on Other Features or Components: This change may impact other features that rely on version-specific behavior of Python.

Alternatives

One alternative is to only add support for Python 3.6 and 3.7, and wait for more stable releases and wider adoption of Python 3.11. This approach reduces the immediate workload and potential compatibility challenges with the newest Python version.

Additional Context

Feedback from our user community indicates a significant number of potential users on Python 3.6 and 3.7. Additionally, early adopters are beginning to transition to Python 3.11, and supporting it could position HypEx as a forward-compatible tool.

Checklist

I have clearly described the feature.
I have outlined the motivation for the proposal.
I have provided a detailed description of the feature.
I have discussed potential impacts and alternatives.
I have added any additional context or screenshots.

[BUG] Fix for LinalgError: Matrix is not positive definite

🐛 Bug Description

During the matching process, when dealing with datasets containing a significant number of zeroes, the Cholesky decomposition fails with a LinalgError: Matrix is not positive definite. This issue arises in scenarios where it's impossible to perform a Cholesky decomposition due to the nature of the data, particularly in the presence of groups for which the covariance matrix cannot be positively defined.

Steps To Reproduce

Attempt to perform matching on a dataset with significant zero values or with groups that result in a non-positive definite covariance matrix.
Observe the LinalgError: Matrix is not positive definite during the Cholesky decomposition step.

Expected Behavior

The system should gracefully handle cases where the Cholesky decomposition cannot be performed due to non-positive definite matrices. The proposed solution includes preprocessing steps to identify and address groups causing the issue, ensuring the decomposition can proceed without errors.

Proposed Solution

Implement a preprocessing step that identifies groups for which a Cholesky decomposition is not feasible. For such groups, apply necessary adjustments, such as removing categories that lead to non-positive definite matrices or adding a small value (epsilon) to the diagonal of the covariance matrix to ensure positivity. This approach aims to maintain the integrity of the matching process while accommodating datasets with challenging characteristics.

Additional Context

This issue was identified during the matching process, a critical step in analyzing datasets for treatment effects. The proposed fix is crucial for ensuring the robustness and reliability of the matching methodology, especially when dealing with diverse datasets.

Checklist

I have described the bug in detail.
I have provided steps to reproduce.
I have provided the expected behavior.
I have provided screenshots (if applicable).
I have provided my environment details.
I have suggested a possible solution.

[BUG] Compatibility Issue with Python 3.11

🐛 Bug Description

HypEx does not work correctly when running under Python 3.11. Certain functionalities fail to execute as expected, leading to errors or unexpected behavior during runtime.

Steps To Reproduce

Install HypEx on a system running Python 3.11.5.

Expected Behavior

HypEx should run smoothly and all functionalities should work as expected without any errors or issues, similar to its operation under previous versions of Python.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

HypEx Version: [e.g. 0.1.0]
Python Version: [3.11.5]
Operating System: Mac OS

Additional Context

This issue was identified after updating the Python environment to version 3.11. Previous versions of Python (e.g., 3.8, 3.9) did not exhibit this problem.

Possible Solution

A preliminary investigation suggests that the issue may be related to changes or deprecations in Python 3.11 that affect HypEx's dependencies or its internal operations. Further investigation and compatibility adjustments are likely required.

Checklist

I have described the bug in detail.
I have provided steps to reproduce.
I have provided the expected behavior.
I have provided screenshots (if applicable).
I have provided my environment details.
I have suggested a possible solution (if applicable).

[BUG] Creating random data in a half less then given size

🐛 Bug Description

In the function create_test_data we may give size of the dataset that will be created. But real size of created dataset in a half less than given.

Steps To Reproduce

Go to examples/tutorials
Click on Tuturial_1_Matching
Scroll down to third code cell
See error

Expected Behavior

It is expected that the size of the created dataset will be the same as that specified in the argument.

Screenshots

Does not need.

Environment

HypEx Version: latest
Python Version: 3.8 and latest
Operating System: iOS, Windows, Linux

Possible Solution

Delete row with filters after row i = 3

Checklist

I have described the bug in detail
I have provided steps to reproduce
I have provided the expected behavior
I have provided screenshots (if applicable)
I have provided my environment details
I have suggested a possible solution (if applicable)

[FEATURE] Add Warnings for Alpha Version Functionalities

🚀 Feature Proposal

Motivation

With the introduction of new functionalities in HypEx, including multitarget matching, feature selection in matching, and validation in matching, there's a need to inform users about the alpha version status of these features. Users should be aware of the potential limitations and considerations when using these functionalities.

Feature Description

Implement warnings in the HypEx documentation and possibly within the code execution process to alert users about the experimental nature of certain features. These warnings should cover:

Feature Selection: Highlight the importance of considering leaks, the influence on treatment distribution, the need for features to describe the target adequately, the inclusion of all significantly influencing features, and the business logic behind feature selection.
Multitarget Matching: Clarify that multitarget analysis should ensure the independence of targets from each other and recommend conducting separate experiments for each target with its own set of features.
Random Treatment and Random Feature: Explain that these methods, while useful, may not provide definitive evidence of a successful experiment and should be interpreted with caution.

Potential Impacts

Performance considerations: Minimal impact, primarily affecting documentation and possibly adding runtime warnings.
Compatibility issues: None expected, as this proposal involves documentation and warning messages.
Dependencies on other features or components: None beyond the existing functionalities of HypEx.

Alternatives

An alternative could be extensive documentation without runtime warnings, relying on users to thoroughly review the documentation before using these features. However, direct warnings may more effectively convey the experimental status and considerations.

Additional Context

Include specific examples and scenarios where users should be particularly cautious or consider additional validation steps. This context can help users better understand the rationale behind the warnings and how to apply the features effectively.

Checklist

I have clearly described the feature.
I have outlined the motivation for the proposal.
I have provided a detailed description of the feature.
I have discussed potential impacts and alternatives.
I have added any additional context or screenshots.

[FEATURE] Add Imbalanced Sample Size Calculation

🚀 Feature Request

Motivation

In experimental design, especially in cases where the control and test groups are expected to be of different sizes, it's crucial to accurately calculate the necessary sample size for each group to detect a specified effect size with desired power and significance level. This becomes even more critical in studies with imbalanced group sizes, where traditional sample size calculations might not be directly applicable.

Feature Description

Introduce a new function, calc_imbalanced_sample_size, that calculates the necessary sample size for control and test groups based on the expected mean outcome of the test group, the proportion of the control group in the total population, and the desired power of the test. The function should:

Take into account the current and expected conversion rates to calculate the Cohen's h effect size for binary outcome data or use the mean difference for continuous outcomes.
Use the calculated effect size along with the specified proportion and power to determine the necessary sample sizes for both control and test groups.
Return a tuple containing the calculated sample sizes for the control and test groups, ensuring that the study is adequately powered to detect the expected effect.

Expected Behavior

Users will input the observed data from the control group, the expected mean outcome in the test group, the proportion of the control group, and the desired power.
The function will return the minimum required sample sizes for both control and test groups to achieve the desired power, taking into account the imbalanced nature of the groups.

Additional Context

This feature is particularly useful for researchers and practitioners in fields such as marketing, clinical trials, and social sciences, where experiments often involve groups of different sizes and the detection of small but meaningful effects is critical.

Checklist

I have clearly described the feature.
I have provided a detailed description of the feature.
I have discussed the motivation for the feature request.
I have added any additional context or screenshots.

[FEATURE] Add Bias Contribution Assessment to ATT Calculation

🚀 Feature Request

Motivation

While analyzing the effectiveness of treatments in observational studies, it's essential to understand not just the Average Treatment Effect on the Treated (ATT) but also how much bias contributes to this estimate. This understanding can significantly enhance the reliability of the conclusions drawn from the data analysis.

Feature Description

Introduce a new feature that calculates and logs the contribution of bias to the ATT. This feature will involve:

Adding a new attribute, delta_t, to store the bias contribution as a percentage of the ATT.
Implementing a method to calculate delta_t by comparing the mean bias against the mean outcome of the treated group.
Enhancing the logging functionality to include delta_t information, providing insights into the influence of bias on the ATT results.

Expected Behavior

Upon completion of the matching process, the system should calculate the bias contribution to ATT and store this information in delta_t.
The log should include a statement like "The entry of bias into the ATT is X%", where X is the calculated value of delta_t.

Additional Context

Understanding the bias contribution to ATT is crucial for evaluating the quality of matching in observational studies. This feature will provide users with a metric to assess the potential impact of bias on their study outcomes, aiding in more informed decision-making.

Checklist

I have clearly described the feature.
I have provided a detailed description of the feature.
I have discussed the motivation for the feature request.
I have added any additional context or screenshots.

[FEATURE] Feature Request for AATester Enhancements

🚀 Feature Proposal

This proposal outlines enhancements to the AATester class within the HypEx library, aimed at improving usability and functionality.

Motivation

The motivation behind this proposal is to streamline the user experience and functionality of the AATester class, making it more intuitive and efficient for users performing A/B testing analysis. These changes are proposed in response to feedback and to accommodate common user workflows more naturally.

Feature Description

Integration of Calculation Functions: Functions calc_mde, calc_sample_size, and calc_power are now methods of the AATester class. This change brings a more cohesive structure, reducing the need for external imports and making the class more self-contained.
Alternative Input for Calculation Functions: The functions calc_mde and calc_sample_size have been updated to allow alternative inputs through a pre-split DataFrame and a target field identifier. This update provides a more intuitive input method following group segmentation.
Automated Testing: An automated test has been added to verify the functionality of calc_mde and calc_sample_size functions, ensuring their reliability and correctness.
Enhanced Tutorial: The tutorial for the AA test has been expanded to include a demonstration section for calc_mde and calc_sample_size functions, providing users with practical examples of how to utilize these enhancements.

Potential Impacts

Performance Considerations: The integration of calculation functions into the AATester class might slightly alter performance metrics, which needs to be evaluated.
Compatibility Issues: Existing scripts using older versions of the AATester class may require adjustments to accommodate these changes.
Dependencies: These enhancements do not introduce new dependencies but may require updates to documentation and tutorials.

Alternatives

An alternative considered was to keep the calculation functions separate from the AATester class. However, integrating these functions directly into the class was deemed more beneficial for user experience and code organization.

Additional Context

The proposed enhancements aim to align the AATester class with user expectations and common workflows, making the HypEx library more accessible and user-friendly for A/B testing analysis.

Checklist

I have clearly described the feature.
I have outlined the motivation for the proposal.
I have provided a detailed description of the feature.
I have discussed potential impacts and alternatives.
I have added any additional context or screenshots.

[FEATURE] Add New Jupyter Notebook Example on Predicting Model Effect

🚀 Feature Proposal

I propose adding a new Jupyter notebook example to the HypEx documentation and example section. This notebook would demonstrate how to predict the financial effect from a model, providing a practical guide for users to understand the impact of their models on financial metrics.

Motivation

Our users often need to quantify the financial impact of their models to justify the implementation and further investment into model development. Providing a clear, step-by-step example of predicting model effects can significantly enhance user understanding and application of HypEx in real-world scenarios.

Feature Description

The new notebook will include:

An introduction to the concept of model effect prediction in the context of financial metrics.
A step-by-step guide on creating a synthetic dataset with a known effect size.
Instructions on fitting and estimating a random model using HypEx.
Demonstrations of preprocessing data and conducting AA and AB tests to validate the predicted effects.
Code snippets, explanations, and visualizations to aid understanding.

This example will utilize the AATest and ABTest classes from HypEx, showcasing their application in a practical experiment. The dataset creation process, model fitting, prediction, and testing phases will be covered comprehensively.

Potential Impacts

This addition is expected to:

Enhance the documentation and examples provided with HypEx, making the library more accessible to new users.
Serve as an educational tool for understanding the application of HypEx in financial effect prediction.
Encourage the adoption of HypEx by demonstrating its practical utility in model effect quantification.

Alternatives

While users can independently research and implement model effect prediction, having a dedicated example within HypEx significantly lowers the barrier to entry and ensures consistent application of best practices.

Additional Context

The proposed example will address common challenges and questions related to model effect prediction, providing a valuable resource for both new and experienced users of HypEx.

Checklist

I have clearly described the feature.
I have outlined the motivation for the proposal.
I have provided a detailed description of the feature.
I have discussed potential impacts and alternatives.
I have added any additional context or screenshots.

[FEATURE] Enhanced Data Validation for Emissions in HypEx

🚀 Feature Proposal: Enhanced Data Validation for Emissions in HypEx

Motivation

In the current HypEx framework, handling data emissions—extreme outliers that can significantly skew analysis results—is manual and prone to inconsistencies. This can lead to biased insights, especially in cases where the treatment effect is subtle. Integrating an automated, robust mechanism for identifying and managing emissions within HypEx will streamline analyses, improve accuracy, and offer users more control over data preprocessing steps.

Feature Description

This proposal suggests the implementation of an "Emissions Handling" feature within HypEx. The feature will automatically identify and manage data emissions based on customizable thresholds, such as the 1st and 99th percentiles. It will offer options to either exclude these emissions from analyses or adjust them based on predefined rules, enhancing the framework's flexibility and the reliability of its outputs.

Potential Impacts

Performance Considerations: Implementing this feature may introduce additional preprocessing steps, potentially impacting performance. Optimizations and parallel processing techniques should be considered.
Compatibility Issues: The feature should be designed to be compatible with existing HypEx data structures and workflows, ensuring that it enhances rather than disrupts users' current processes.
Dependencies: This feature may rely on statistical libraries for calculating percentiles and identifying outliers. Dependencies should be carefully managed to avoid conflicts.

Alternatives

Manual Emissions Management: Users continue handling emissions manually outside of HypEx. This approach maintains current workflows but misses the opportunity for improvement.
Third-party Tools Integration: Instead of building this feature internally, HypEx could offer integrations with external tools specialized in data cleaning. However, this could complicate the user experience and increase dependency on external software.

Additional Context

Adding this feature will directly address user feedback regarding the challenges of managing emissions in large datasets. By automating this process, HypEx can ensure more consistent, reliable analysis outcomes, particularly in sensitive applications such as financial forecasting and medical research, where outliers can have a disproportionate impact.

Checklist

I have clearly described the feature.
I have outlined the motivation for the proposal.
I have provided a detailed description of the feature.
I have discussed potential impacts and alternatives.
I have added any additional context or screenshots.

[FEATURE] Emissions Option for Matching Result Validation

🚀 Feature Proposal: Emissions Option for Matching Result Validation

Motivation

In the field of data analysis, particularly in matching scenarios, understanding the impact of extreme values (outliers) on the overall result is crucial. The current setup in HypEx lacks a direct way to evaluate how the results vary before and after the removal of outliers. The introduction of the "emissions" option aims to fill this gap. This feature will allow analysts to assess the extent to which outliers influence the matching results, ensuring more robust and reliable data analysis.

Feature Description

The "emissions" feature is a new option added to the Matching Result Validation process in HypEx. This feature provides a comparative analysis between the results of matching before and after the removal of outliers. The core functionality includes:

Calculation of matching results with all data points, including outliers.
Recalculation of matching results after removing outliers.
Generation of a comparative report or metric that highlights the differences in results due to outliers.

This feature would be particularly useful in scenarios where data integrity and accuracy are paramount, and outliers may significantly skew the results.

Potential Impacts

Performance Considerations: The additional calculations may slightly increase the processing time, especially for large datasets.
Compatibility Issues: Should be backward compatible; however, it must be ensured that it integrates seamlessly with existing matching algorithms and validation processes.
Dependencies: Relies on the existing outlier detection and removal mechanisms within HypEx.

Alternatives

An alternative approach could be to provide enhanced reporting and visualization tools that allow users to manually inspect the impact of outliers. However, this would be less efficient and more time-consuming compared to an automated "emissions" feature.

Additional Context

This feature is in response to the need for more nuanced data analysis tools within HypEx, especially in situations where outliers can significantly alter the outcome of data matching processes.

Checklist

I have clearly described the feature.
I have outlined the motivation for the proposal.
I have provided a detailed description of the feature.
I have discussed potential impacts and alternatives.
I have added any additional context or screenshots.

[FEATURE] Extend 'group_col' in Matcher to Support Multi-Dimensional Stratification

🚀 Feature Proposal

Motivation

Currently, in the Matcher class of HypEx, the group_col parameter accepts only a single string, limiting the stratification to one dimension. This constraint can be restrictive when multi-dimensional stratification is needed, such as combining gender and location for more nuanced analysis. This feature request is motivated by the need to enhance the functionality of Matcher to accept a list of columns for multi-dimensional stratification.

Feature Description

The proposed feature involves extending the group_col parameter of the Matcher class to accept a list of strings, enabling multi-dimensional stratification. This enhancement will allow users to specify multiple columns for strict stratification, facilitating more complex matching scenarios like "Boy from Moscow", "Girl from New York", etc. The feature will internally handle the concatenation of specified features, thus streamlining the process and eliminating the need for manual preprocessing.

Potential Impacts

Performance: Handling multiple columns for stratification might slightly increase the computational complexity.
Compatibility: This feature should be backward compatible, allowing both single strings and lists.
Dependencies: The implementation will rely on existing data processing components within HypEx, ensuring seamless integration.

Alternatives

Currently, the alternative is manually concatenating features to create a composite column for stratification. While this works, it adds an extra preprocessing step for the user.

Additional Context

This feature will significantly enhance the usability of HypEx for complex stratification needs in causal inference and A/B testing scenarios.

Checklist

I have clearly described the feature.
I have outlined the motivation for the proposal.
I have provided a detailed description of the feature.
I have discussed potential impacts and alternatives.
I have added any additional context or screenshots.

[FEATURE] Incorporate Kaggle Notebooks into HypEx Documentation

🚀 Feature Proposal

Motivation

To enhance HypEx's documentation and provide users with practical examples and comprehensive comparisons, it is proposed to incorporate Kaggle notebooks into the HypEx documentation and examples section. These notebooks will showcase HypEx's capabilities in real-world scenarios and compare its performance with similar libraries.

Feature Description

Kaggle Notebooks Integration: Include links to Kaggle notebooks that demonstrate the use of HypEx in various data analysis tasks.
Comparison Tables: Create tables that compare HypEx with other libraries based on performance metrics, such as execution time and memory usage, across different datasets.

Potential Impacts

User Engagement: By providing interactive examples, we can increase user engagement and ease the learning curve for new users.
Performance Transparency: Comparison tables will offer transparency regarding HypEx's performance, helping users make informed decisions when choosing libraries for their projects.

Alternatives

Hosting similar notebooks and comparisons on an alternative platform, such as GitHub or a project-specific website, if Kaggle is not feasible for all target audiences.

Additional Context

These notebooks can also serve as a valuable benchmarking tool for future development, ensuring that HypEx continues to improve and remains competitive.

Checklist

Add Kaggle notebooks links to the documentation and examples section.
Create comparison tables showcasing HypEx against other libraries.

[BUG] group_col does not support list input as expected

🐛 Bug Description

After the latest update to HypEx version 0.1.0, which was intended to add support for using a list of strings as input for the group_col parameter to facilitate stratification across multiple columns, it was observed that this functionality does not work as expected. Instead of allowing for multi-column stratification, the feature fails and causes errors when group_col is provided with a list.

Steps To Reproduce

Update HypEx to version 0.1.0.
Attempt to use the group_col parameter with a list of column names for stratification.
Observe that the operation fails, indicating that the feature does not work as intended.

Expected Behavior

The group_col parameter should accept a list of strings without any errors, allowing users to stratify data based on multiple columns seamlessly.

Environment

HypEx Version: 0.1.0
Python Version: 3.10
Operating System: Mac OS/Linux

Additional Context

This issue undermines the usability of the newly introduced feature for multi-column stratification, which is crucial for conducting nuanced analyses that depend on grouping data across multiple dimensions.

Possible Solution

A thorough review of the changes made to support list inputs for group_col is needed to identify and resolve the underlying cause of this issue. A potential starting point could be to ensure that all internal functions that interact with group_col are updated to handle list inputs correctly.

Checklist

I have described the bug in detail.
I have provided steps to reproduce.
I have provided the expected behavior.
I have provided screenshots (if applicable).
I have provided my environment details.
I have suggested a possible solution (if applicable).

[FEATURE] Add Feature Importance Tool for Matching in HypEx

🚀 Feature Proposal: Implement Feature Importance for Matching Task in HypEx

Motivation

In previous versions of HypEx, when it functioned as an add-on for LAMA, we had access to LAMA's tools for feature importance in matching tasks. However, since becoming a standalone tool, HypEx currently lacks this capability. The reintroduction of feature importance would greatly enhance the analytical power of HypEx by allowing users to understand which features are most influential in their matching tasks. This feature is crucial for data analysis, interpretation, and model optimization.

Feature Description

The proposal involves developing and integrating a feature importance mechanism specifically tailored for matching tasks within HypEx. This feature would ideally:

Provide users with insights into which features contribute most to the success of a matching task.
Offer compatibility with the current matching algorithms in HypEx.
Include clear documentation and examples to guide users in utilizing this feature.

Potential Impacts

Enhanced User Experience: This feature would significantly boost the utility and user-friendliness of HypEx.
Performance Considerations: The implementation needs to be optimized to ensure that it does not adversely impact the performance of the matching tasks.
Compatibility: Ensure that the feature importance tool is compatible with existing and future matching algorithms in HypEx.

Alternatives

One alternative could be to provide a plug-in or interface within HypEx that allows integration with external tools offering feature importance analysis. This would, however, require users to rely on an additional tool.

Additional Context

In the past, users have benefited greatly from the feature importance analysis provided by LAMA. Restoring this capability within HypEx itself would not only restore previous functionality but also enhance the standalone value of HypEx.

Checklist

I have clearly described the feature.
I have outlined the motivation for the proposal.
I have provided a detailed description of the feature.
I have discussed potential impacts and alternatives.
I have added any additional context or screenshots.