sb-ai-lab / hypex Goto Github PK
View Code? Open in Web Editor NEWFast and customizable framework for automatic and quick Causal Inference in Python
Home Page: https://developers.sber.ru/portal/products/lightautoml
License: Apache License 2.0
Fast and customizable framework for automatic and quick Causal Inference in Python
Home Page: https://developers.sber.ru/portal/products/lightautoml
License: Apache License 2.0
While analyzing the effectiveness of treatments in observational studies, it's essential to understand not just the Average Treatment Effect on the Treated (ATT) but also how much bias contributes to this estimate. This understanding can significantly enhance the reliability of the conclusions drawn from the data analysis.
Introduce a new feature that calculates and logs the contribution of bias to the ATT. This feature will involve:
delta_t
, to store the bias contribution as a percentage of the ATT.delta_t
by comparing the mean bias against the mean outcome of the treated group.delta_t
information, providing insights into the influence of bias on the ATT results.delta_t
.delta_t
.Understanding the bias contribution to ATT is crucial for evaluating the quality of matching in observational studies. This feature will provide users with a metric to assess the potential impact of bias on their study outcomes, aiding in more informed decision-making.
In experimental design, especially in cases where the control and test groups are expected to be of different sizes, it's crucial to accurately calculate the necessary sample size for each group to detect a specified effect size with desired power and significance level. This becomes even more critical in studies with imbalanced group sizes, where traditional sample size calculations might not be directly applicable.
Introduce a new function, calc_imbalanced_sample_size
, that calculates the necessary sample size for control and test groups based on the expected mean outcome of the test group, the proportion of the control group in the total population, and the desired power of the test. The function should:
This feature is particularly useful for researchers and practitioners in fields such as marketing, clinical trials, and social sciences, where experiments often involve groups of different sizes and the detection of small but meaningful effects is critical.
There appears to be a significant speed issue and a problem with selecting hypotheses in the newly introduced limit_distribution
function in HypEx version 0.1.0. This function is critical for our limit distribution-based methodologies, and its current performance bottleneck and hypothesis selection inaccuracies could hinder our analysis efficiency and reliability.
limit_distribution
function with a standard set of parameters.The limit_distribution
function is expected to perform efficiently, adhering to the projected time complexities. Furthermore, it should accurately select hypotheses based on predefined criteria without any inconsistencies.
While the specific solution will depend on a thorough investigation, initial steps could include optimizing the function's algorithm for better performance and reviewing the hypothesis selection logic to ensure it aligns with theoretical expectations.
After the latest update to HypEx version 0.1.0, which was intended to add support for using a list of strings as input for the group_col
parameter to facilitate stratification across multiple columns, it was observed that this functionality does not work as expected. Instead of allowing for multi-column stratification, the feature fails and causes errors when group_col
is provided with a list.
group_col
parameter with a list of column names for stratification.The group_col
parameter should accept a list of strings without any errors, allowing users to stratify data based on multiple columns seamlessly.
This issue undermines the usability of the newly introduced feature for multi-column stratification, which is crucial for conducting nuanced analyses that depend on grouping data across multiple dimensions.
A thorough review of the changes made to support list inputs for group_col
is needed to identify and resolve the underlying cause of this issue. A potential starting point could be to ensure that all internal functions that interact with group_col
are updated to handle list inputs correctly.
This proposal outlines enhancements to the AATester
class within the HypEx library, aimed at improving usability and functionality.
The motivation behind this proposal is to streamline the user experience and functionality of the AATester
class, making it more intuitive and efficient for users performing A/B testing analysis. These changes are proposed in response to feedback and to accommodate common user workflows more naturally.
Integration of Calculation Functions: Functions calc_mde
, calc_sample_size
, and calc_power
are now methods of the AATester
class. This change brings a more cohesive structure, reducing the need for external imports and making the class more self-contained.
Alternative Input for Calculation Functions: The functions calc_mde
and calc_sample_size
have been updated to allow alternative inputs through a pre-split DataFrame and a target field identifier. This update provides a more intuitive input method following group segmentation.
Automated Testing: An automated test has been added to verify the functionality of calc_mde
and calc_sample_size
functions, ensuring their reliability and correctness.
Enhanced Tutorial: The tutorial for the AA test has been expanded to include a demonstration section for calc_mde
and calc_sample_size
functions, providing users with practical examples of how to utilize these enhancements.
AATester
class might slightly alter performance metrics, which needs to be evaluated.AATester
class may require adjustments to accommodate these changes.An alternative considered was to keep the calculation functions separate from the AATester
class. However, integrating these functions directly into the class was deemed more beneficial for user experience and code organization.
The proposed enhancements aim to align the AATester
class with user expectations and common workflows, making the HypEx library more accessible and user-friendly for A/B testing analysis.
Please check all tutorials for new features
We have a lot of new functions. And fixed some bug. Please add it in tutorial
Doesn't matter
HypEx does not work correctly when running under Python 3.11. Certain functionalities fail to execute as expected, leading to errors or unexpected behavior during runtime.
HypEx should run smoothly and all functionalities should work as expected without any errors or issues, similar to its operation under previous versions of Python.
If applicable, add screenshots to help explain your problem.
This issue was identified after updating the Python environment to version 3.11. Previous versions of Python (e.g., 3.8, 3.9) did not exhibit this problem.
A preliminary investigation suggests that the issue may be related to changes or deprecations in Python 3.11 that affect HypEx's dependencies or its internal operations. Further investigation and compatibility adjustments are likely required.
Encountered an issue with importing tqdm
from tqdm.auto
in environments with certain versions of tqdm
, leading to an ImportError. This bug affects the usability of the HypEx library in environments where the specific tqdm
submodule is not available.
tqdm
that lacks the tqdm.auto
submodule.tqdm
for progress bars.The HypEx library should gracefully fall back to a compatible version of tqdm
if tqdm.auto
is not available, ensuring compatibility across different environments and tqdm
versions.
Not applicable
The issue stems from the diverse environment setups and the variance in tqdm
versions, which may or may not include the tqdm.auto
submodule. This variability can lead to import errors in certain scenarios, affecting the user experience.
Implement a try-except block around the import statement for tqdm
, first attempting to import from tqdm.auto
and falling back to the standard tqdm
if the first import fails. This approach provides a more robust solution to handle different versions of tqdm
seamlessly.
try:
from tqdm.auto import tqdm
except Exception as e:
try:
from tqdm import tqdm
except:
raise Exception("Cannot import tqdm")
Currently, in the Matcher
class of HypEx, the group_col
parameter accepts only a single string, limiting the stratification to one dimension. This constraint can be restrictive when multi-dimensional stratification is needed, such as combining gender and location for more nuanced analysis. This feature request is motivated by the need to enhance the functionality of Matcher
to accept a list of columns for multi-dimensional stratification.
The proposed feature involves extending the group_col
parameter of the Matcher
class to accept a list of strings, enabling multi-dimensional stratification. This enhancement will allow users to specify multiple columns for strict stratification, facilitating more complex matching scenarios like "Boy from Moscow", "Girl from New York", etc. The feature will internally handle the concatenation of specified features, thus streamlining the process and eliminating the need for manual preprocessing.
Currently, the alternative is manually concatenating features to create a composite column for stratification. While this works, it adds an extra preprocessing step for the user.
This feature will significantly enhance the usability of HypEx for complex stratification needs in causal inference and A/B testing scenarios.
In the function create_test_data we may give size of the dataset that will be created. But real size of created dataset in a half less than given.
It is expected that the size of the created dataset will be the same as that specified in the argument.
Does not need.
Delete row with filters after row i = 3
To enhance HypEx's documentation and provide users with practical examples and comprehensive comparisons, it is proposed to incorporate Kaggle notebooks into the HypEx documentation and examples section. These notebooks will showcase HypEx's capabilities in real-world scenarios and compare its performance with similar libraries.
π Bug Description
During the matching process, when dealing with datasets containing a significant number of zeroes, the Cholesky decomposition fails with a LinalgError: Matrix is not positive definite
. This issue arises in scenarios where it's impossible to perform a Cholesky decomposition due to the nature of the data, particularly in the presence of groups for which the covariance matrix cannot be positively defined.
LinalgError: Matrix is not positive definite
during the Cholesky decomposition step.The system should gracefully handle cases where the Cholesky decomposition cannot be performed due to non-positive definite matrices. The proposed solution includes preprocessing steps to identify and address groups causing the issue, ensuring the decomposition can proceed without errors.
Implement a preprocessing step that identifies groups for which a Cholesky decomposition is not feasible. For such groups, apply necessary adjustments, such as removing categories that lead to non-positive definite matrices or adding a small value (epsilon
) to the diagonal of the covariance matrix to ensure positivity. This approach aims to maintain the integrity of the matching process while accommodating datasets with challenging characteristics.
This issue was identified during the matching process, a critical step in analyzing datasets for treatment effects. The proposed fix is crucial for ensuring the robustness and reliability of the matching methodology, especially when dealing with diverse datasets.
This proposal outlines enhancements to the Dataset class within the HypEx library, aimed at improving usability and functionality
The motivation behind this proposal is to streamline the user experience and functionality of the Dataset class, making it more intuitive and efficient for users performing data operations.
add_column
can help adding data with special role instead of adding only data. It will be helpful to identify column in pipline.
In previous versions of HypEx, when it functioned as an add-on for LAMA, we had access to LAMA's tools for feature importance in matching tasks. However, since becoming a standalone tool, HypEx currently lacks this capability. The reintroduction of feature importance would greatly enhance the analytical power of HypEx by allowing users to understand which features are most influential in their matching tasks. This feature is crucial for data analysis, interpretation, and model optimization.
The proposal involves developing and integrating a feature importance mechanism specifically tailored for matching tasks within HypEx. This feature would ideally:
One alternative could be to provide a plug-in or interface within HypEx that allows integration with external tools offering feature importance analysis. This would, however, require users to rely on an additional tool.
In the past, users have benefited greatly from the feature importance analysis provided by LAMA. Restoring this capability within HypEx itself would not only restore previous functionality but also enhance the standalone value of HypEx.
In the current HypEx framework, handling data emissionsβextreme outliers that can significantly skew analysis resultsβis manual and prone to inconsistencies. This can lead to biased insights, especially in cases where the treatment effect is subtle. Integrating an automated, robust mechanism for identifying and managing emissions within HypEx will streamline analyses, improve accuracy, and offer users more control over data preprocessing steps.
This proposal suggests the implementation of an "Emissions Handling" feature within HypEx. The feature will automatically identify and manage data emissions based on customizable thresholds, such as the 1st and 99th percentiles. It will offer options to either exclude these emissions from analyses or adjust them based on predefined rules, enhancing the framework's flexibility and the reliability of its outputs.
Adding this feature will directly address user feedback regarding the challenges of managing emissions in large datasets. By automating this process, HypEx can ensure more consistent, reliable analysis outcomes, particularly in sensitive applications such as financial forecasting and medical research, where outliers can have a disproportionate impact.
Currently, our project HypEx
supports Python versions 3.8, 3.9, and 3.10. However, we've identified a need to extend our support to include Python versions 3.6, 3.7, and the newly released 3.11. This expansion is essential to enhance the accessibility of HypEx
to a broader range of users and environments, some of which still utilize these versions. This initiative is partly driven by community requests and our goal to maximize the usability of our tool.
The proposal is to update our project to be compatible with Python versions 3.6, 3.7, and 3.11. This would involve:
python
dependency in pyproject.toml
to reflect the broader range (">=3.6, <3.12"
).One alternative is to only add support for Python 3.6 and 3.7, and wait for more stable releases and wider adoption of Python 3.11. This approach reduces the immediate workload and potential compatibility challenges with the newest Python version.
Feedback from our user community indicates a significant number of potential users on Python 3.6 and 3.7. Additionally, early adopters are beginning to transition to Python 3.11, and supporting it could position HypEx
as a forward-compatible tool.
With the introduction of new functionalities in HypEx, including multitarget matching, feature selection in matching, and validation in matching, there's a need to inform users about the alpha version status of these features. Users should be aware of the potential limitations and considerations when using these functionalities.
Implement warnings in the HypEx documentation and possibly within the code execution process to alert users about the experimental nature of certain features. These warnings should cover:
Feature Selection: Highlight the importance of considering leaks, the influence on treatment distribution, the need for features to describe the target adequately, the inclusion of all significantly influencing features, and the business logic behind feature selection.
Multitarget Matching: Clarify that multitarget analysis should ensure the independence of targets from each other and recommend conducting separate experiments for each target with its own set of features.
Random Treatment and Random Feature: Explain that these methods, while useful, may not provide definitive evidence of a successful experiment and should be interpreted with caution.
An alternative could be extensive documentation without runtime warnings, relying on users to thoroughly review the documentation before using these features. However, direct warnings may more effectively convey the experimental status and considerations.
Include specific examples and scenarios where users should be particularly cautious or consider additional validation steps. This context can help users better understand the rationale behind the warnings and how to apply the features effectively.
In the field of data analysis, particularly in matching scenarios, understanding the impact of extreme values (outliers) on the overall result is crucial. The current setup in HypEx lacks a direct way to evaluate how the results vary before and after the removal of outliers. The introduction of the "emissions" option aims to fill this gap. This feature will allow analysts to assess the extent to which outliers influence the matching results, ensuring more robust and reliable data analysis.
The "emissions" feature is a new option added to the Matching Result Validation process in HypEx. This feature provides a comparative analysis between the results of matching before and after the removal of outliers. The core functionality includes:
This feature would be particularly useful in scenarios where data integrity and accuracy are paramount, and outliers may significantly skew the results.
An alternative approach could be to provide enhanced reporting and visualization tools that allow users to manually inspect the impact of outliers. However, this would be less efficient and more time-consuming compared to an automated "emissions" feature.
This feature is in response to the need for more nuanced data analysis tools within HypEx, especially in situations where outliers can significantly alter the outcome of data matching processes.
I propose adding a new Jupyter notebook example to the HypEx documentation and example section. This notebook would demonstrate how to predict the financial effect from a model, providing a practical guide for users to understand the impact of their models on financial metrics.
Our users often need to quantify the financial impact of their models to justify the implementation and further investment into model development. Providing a clear, step-by-step example of predicting model effects can significantly enhance user understanding and application of HypEx in real-world scenarios.
The new notebook will include:
This example will utilize the AATest
and ABTest
classes from HypEx, showcasing their application in a practical experiment. The dataset creation process, model fitting, prediction, and testing phases will be covered comprehensively.
This addition is expected to:
While users can independently research and implement model effect prediction, having a dedicated example within HypEx significantly lowers the barrier to entry and ensures consistent application of best practices.
The proposed example will address common challenges and questions related to model effect prediction, providing a valuable resource for both new and experienced users of HypEx.
Problem with tutorials in HypEx ReadTheDocs
I would like to see a list of tutorials and experiments.
Check the docs folder and fix it
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.