Giter Club home page Giter Club logo

cyclops's People

Contributors

a-kore avatar adibvafa avatar amrit110 avatar dependabot[bot] avatar elhamdolatabadi avatar fcogidi avatar hoxell avatar kadenmc avatar karanswatch avatar mahshidaln avatar pre-commit-ci[bot] avatar rjavadi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cyclops's Issues

Dynamically offer every method and attribute of the PyTorch model in the PTModel wrapper

The scikit-learn model wrapper (SKModel) currently supports this behaviour using the decorator pattern - if an attribute or method is not found in the wrapper, then the wrapped model is checked.

This should be implemented for the PTModel class as well, so that an instance of the wrapper appears to be like an instance of torch.nn.Module with additional functionalities.

Documentation build fails for evaluate package

I'm currently attempting to build API docs using sphinx for the evaluate package and running into some issues.

To reproduce the error,

git checkout add_api_docs_evaluate
cd docs
make html

The autosummary plugin generates two warnings which indicate failure to generate API docs. The issue seems to be an ImportError which is tied to the medical_imagefolder module.

Support user-defined model directory in PTModel

During training, models are currently being saved in current_working_directory/output/ModelClassName.

TODO

  • Support user-defined directory for model saving.
  • Maybe support creating versions for model directories if the same directory name exists or of changes in configuration are detected. For example, path/to/model_directory/1 and path/to/model_directory/2

Importing Cyclops in Kaggle Notebooks fails

Describe the bug

Even though installing Cyclops in Kaggle Notebooks is successful, importing it fails. It's probably due to a change in PyArrow API, that causes incompatibility with the required version and the installed version in Cyclops.

To Reproduce

After installing Cyclops, try the following in a Kaggle Notebook:
from cyclops.query import MIMICIIIQuerier

Expected bug

The following bug occurs when importing cyclops.query
image
image
image

Remove baseline_models dependency

Currently, some of the drift_detection use cases import implemented models from the https://github.com/VectorInstitute/cyclops/tree/main/cyclops/monitor/baseline_models directory. This should be removed and instead be replaced with model implementations from https://github.com/VectorInstitute/cyclops/tree/main/cyclops/models.

  • Remove use of models from the above mentioned baseline_models directory and replace with those in models.
  • Remove baseline_models from the codebase
  • Test if the migration works as intended.

Logging model parameters to report only captures string values

Describe the bug

When logging model parameters using the report.log_model_parameters(), it seems like only string values are captured in the report.

To Reproduce

Running the https://vectorinstitute.github.io/cyclops/api/tutorials/mimiciii/mortality_prediction.html exposes this bug.

Expected behavior

I think we should be able to record numeric and boolean values as well.
So, for example all the hyper parameters in the dict seen below should be captured when logging.
Screen Shot 2023-07-06 at 8 14 31 AM

[Optional]
Some of these tend to be model hyper-parameters, for example learning_rate, so i wonder if we should make a distinction.

Screenshots

Screen Shot 2023-07-06 at 8 08 36 AM

Version

  • As of v0.1.31 (latest)

Runway Plots

Is your feature request related to a problem? Please describe.

Most hard metrics for classifier model performance (F1, precision, recall, sensitivity, specificity, NPV) require a pre-specified classification threshold cutoff (usually 0.5). Although for some use cases, it is useful to see how the model performs, and what tradeoffs you can make, if you'd like to prioritize for example, true positives, or minimize for example, false positives.

Describe the solution you'd like

One option is a runway plot, like the one shown below:

Runway-plot-of-diagnostic-accuracy-measures-of-the-final-model

Describe alternatives you've considered

We can currently do this in the model transparency report by developing a plotly object and logging it to the report with log_plotly_figure().

Additional context

Below is a working function that will create a runway plot in plotly as required:

import plotly.graph_objects as go
import plotly.subplots as sp
import numpy as np
from typing import List
from sklearn.metrics import roc_curve, precision_recall_curve, confusion_matrix

def runway_plot(true_labels: List[int], pred_probs: List[float]) -> go.Figure:
    """
    Plot diagnostic performance metrics with an additional histogram of predicted probabilities.
    The plot uses Plotly with a clean aesthetic. Gridlines are kept, but background color is removed.
    Y-axis ticks and labels are shown. The legend is added at the bottom.
    Tooltips show values with 3 decimal places. X-axis labels are only shown on the bottom subplot.
    The histogram's bin size is reduced and it has no borders.

    Args:
    - true_labels (List[int]): True binary class labels (0 or 1).
    - pred_probs (List[float]): Predicted probabilities for the positive class (1).

    Returns:
    - A Plotly figure containing the diagnostic performance plots and histogram.

    Example:
    ```
    # Generate synthetic data for demonstration
    true_labels = np.random.binomial(1, 0.5, 1000)
    pred_probs = np.random.uniform(0, 1, 1000)

    # Generate and show the modified faceted plot
    faceted_fig = plot_diagnostic_performance_with_histogram(true_labels, pred_probs)
    faceted_fig.show()
    ```
    """
    
    # ROC curve components
    fpr, tpr, _ = roc_curve(true_labels, pred_probs)
    # Precision-Recall curve components
    precision, recall, _ = precision_recall_curve(true_labels, pred_probs)
    
    # Thresholds for PPV and NPV
    thresholds = np.linspace(0, 1, 100)
    ppv = np.zeros_like(thresholds)
    npv = np.zeros_like(thresholds)
    
    # Calculate PPV and NPV for each threshold
    for i, threshold in enumerate(thresholds):
        # Binarize predictions based on threshold
        binarized_predictions = pred_probs >= threshold
        tn, fp, fn, tp = confusion_matrix(true_labels, binarized_predictions).ravel()
        # Calculate PPV and NPV
        ppv[i] = tp / (tp + fp) if (tp + fp) != 0 else 0
        npv[i] = tn / (tn + fn) if (tn + fn) != 0 else 0

    # Define hover template to show three decimal places
    hover_template = 'Threshold: %{x:.3f}<br>Metric Value: %{y:.3f}<extra></extra>'
        
    # Create a subplot for each metric
    fig = sp.make_subplots(rows=5, cols=1, shared_xaxes=True, vertical_spacing=0.02)

    # Sensitivity plot (True Positive Rate)
    fig.add_trace(go.Scatter(x=thresholds, y=tpr, mode='lines', name='Sensitivity', hovertemplate=hover_template), row=1, col=1)
    
    # Specificity plot (1 - False Positive Rate)
    fig.add_trace(go.Scatter(x=thresholds, y=1-fpr, mode='lines', name='1 - Specificity', hovertemplate=hover_template), row=2, col=1)
    
    # PPV plot (Positive Predictive Value)
    fig.add_trace(go.Scatter(x=thresholds, y=ppv, mode='lines', name='PPV', hovertemplate=hover_template), row=3, col=1)
    
    # NPV plot (Negative Predictive Value)
    fig.add_trace(go.Scatter(x=thresholds, y=npv, mode='lines', name='NPV', hovertemplate=hover_template), row=4, col=1)
    
    # Add histogram of predicted probabilities
    fig.add_trace(go.Histogram(x=pred_probs, nbinsx=80, name='Predicted Probabilities'), row=5, col=1)

    # Update layout
    fig.update_layout(
        height=1000, 
        width=700, 
        title_text="Diagnostic Performance Metrics by Thresholds",
        legend=dict(orientation="h", yanchor="bottom", y=-0.2, xanchor="center", x=0.5)
    )

    # Remove subplot titles
    for i in fig['layout']['annotations']:
        i['text'] = ''

    # Remove the plot background color, keep gridlines, show y-axis ticks and labels
    fig.update_xaxes(showgrid=True)
    fig.update_yaxes(showgrid=True, showticklabels=True)

    # Only show the x-axis line and labels on the bottommost plot
    fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
    fig.update_xaxes(showticklabels=True, row=4, col=1)
    fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
    
    fig.update_xaxes(showline=False, row=5, col=1, showticklabels=False)
    fig.update_yaxes(showline=False, row=5, col=1)

    # Set the background to white
    fig.update_layout(plot_bgcolor='white')

    return fig

# Generate synthetic data for demonstration
true_labels = np.random.binomial(1, 0.5, 1000)
pred_probs = np.random.uniform(0, 1, 1000)

# Generate and show the modified faceted plot
faceted_fig_clean = runway_plot(true_labels, pred_probs)
faceted_fig_clean.show()

This will generate a plot like so:

newplot

Code style check

This issue has 2 parts:

  • The codebase needs a review to see if it adheres to google code style guide (https://google.github.io/styleguide/pyguide.html).

  • For docstrings, the parameter types are documented using type hints, and adding them to the docstring is redundant. These can be removed to keep docstrings clean and reduce redundancy.

Loss reweighting?

In the PyTorch model wrapper class (PTModel) there is a function that was originally intended for re-weighting the loss for unbalanced datasets. This function is currently not implemented and can either be removed - because some PyTorch loss functions already support this - or refactored to a more general case for handling imbalanced datasets.

Add support for model hyperparameter tuning

The model wrappers have a method that is intended for tuning the model hyperparameters and returning the best model.
The method has the following signature:

    find_best(
        self,
        X: ArrayLike,
        y: ArrayLike,
        parameters: Union[Dict, List[Dict]],
        metric: Union[str, Callable, Sequence, Dict] = None,
        method: Literal["grid", "random"] = "grid",
        **kwargs,
    )

Currently, only the scikit-learn model wrapper SKModel implements this method, and that implementation would benefit from the following improvements:

  • Support using metrics from cyclops.evaluate.metrics in the hyperparameter search, potentially using the sklearn.metrics.make_scorer method.
  • Handle data splits e.g. predefined split, split by percentage, cross-validation split etc.
  • Support passing group and fit_params arguments when calling clf.fit.

The PyTorch model wrapper (PTModel) should implement this method as well, with the same behaviour as the sklearn model wrapper.

Unknown type for targets and preds

Describe the bug

When evaluate/evaluate_fairness is called and slice_spec is set, if the number of examples matching a specific slice spec is one, the type_target and type_preds become unknown and there will be a value error raised by _binary_stat_scores_format.

Calibration Plots

Is your feature request related to a problem? Please describe.

For particular problems such as prevalence estimates or risk score prediction, well calibrated models are necessary. It would be great for data scientist end users to be able to generate calibration plots in their transparency reports to examine this aspect of model performance..

Describe the solution you'd like

One option for calibration plots looks like this:

image

Describe alternatives you've considered

We can currently do this in the model transparency report by developing a plotly object and logging it to the report with log_plotly_figure().

Additional context

Below is a working function that will create a calibration plot in plotly as required:

def generate_calibration_plot(df, y_true_col, y_prob_col, grouping_var=None):
    """
    Generates a calibration plot with an optional histogram below the plot for the predicted probabilities.
    
    Parameters:
    df (DataFrame): The dataframe containing the true labels and predicted probabilities.
    y_true_col (str): The name of the column with true labels (0 or 1).
    y_prob_col (str): The name of the column with predicted probabilities.
    grouping_var (str, optional): The name of the column to group data by. If provided, the plot will include 
                                  multiple curves on the plot, one for each level of the grouping_var.
    """
    # Create subplots: 1 plot for calibration curve, 1 plot for histogram
    fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.02, row_heights=[0.8, 0.2])
    
    if grouping_var:
        # Plot a calibration curve for each level of the grouping variable
        unique_groups = df[grouping_var].unique()
        for group in unique_groups:
            group_df = df[df[grouping_var] == group]
            prob_true, prob_pred = calibration_curve(group_df[y_true_col], group_df[y_prob_col], n_bins=10)
            fig.add_trace(go.Scatter(x=prob_pred, y=prob_true, mode='markers+lines', name=f'{group}'), row=1, col=1)
    else:
        # Plot a single calibration curve
        prob_true, prob_pred = calibration_curve(df[y_true_col], df[y_prob_col], n_bins=10)
        fig.add_trace(go.Scatter(x=prob_pred, y=prob_true, mode='markers+lines', name='Model'), row=1, col=1)

    # Add perfectly calibrated line to the calibration curve
    fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', name='Perfectly calibrated', line=dict(dash='dot')), row=1, col=1)

    # Plot histogram if no grouping variable is provided
    fig.add_trace(go.Histogram(x=df[y_prob_col], nbinsx=100, name='Probabilities', showlegend=False), row=2, col=1)

    # Update layout
    legend_title = grouping_var if grouping_var else None
    fig.update_layout(title='Calibration Plot', yaxis_title='Fraction of Positives', legend_title=legend_title)
    fig.update_xaxes(title_text='Mean Predicted Probability', row=2, col=1)
    fig.update_yaxes(title_text='Count', row=2, col=1)

    fig.show()

Run on sample data, such as this:

import pandas as pd
import numpy as np

np.random.seed(42)

n_samples = 100
data = {
    'y_true': np.random.binomial(1, 0.5, n_samples),  # Random true labels (0s and 1s)
    'y_prob': np.random.rand(n_samples),               # Random probabilities between 0 and 1
    'group': np.random.choice(['A', 'B'], n_samples)   # Random group labels ('A' or 'B')
}

sample_df = pd.DataFrame(data)
generate_calibration_plot(sample_df, 'y_true', 'y_prob')

Generating the following plot:

image

Or calibration by group:

generate_calibration_plot(sample_df, 'y_true', 'y_prob', 'group')

Generating the following plot:

image

Refactor: Use `namedtuple` instead of `tuple` in `ClassificationPlotter`

There are two methods in ClassificationPlotter- roc_curve(), precision_recall_curve()- that take a tuple of ndarrays whose order is important and may introduce potential bugs. Also, the last element of tuple is never used inside the functions.

Refactor the arguments so that it is clear which element is which. Using namedtuple is preferred.

query dataset APIs design review

The query dataset API design has some problems and could be improved. This issue is to have a design review and suggest a proposal for improving the API. Firstly some problems:

Cyclops installation fails in Windows

Describe the bug

Installing Cyclops in a new conda environment on Windows is unsuccessful due to not finding any Python version that can satisfy all requirements.

To Reproduce

Upon creation of a new conda environment, enter:
py -m pip install 'pycyclops[query,models]'

Expected bug

image
image

Additional Context

The bug does not occur when installing Cyclops on Kaggle Notebooks.

Add SQL OR, LIKE and IN to Query API

Is your feature request related to a problem? Please describe.

I have the following SQL code which I'd like to port to my CyclOps pipeline:

SELECT * FROM my_table WHERE my_field LIKE '%my_pattern%' OR my_field IN (my_values)

Unfortunately with the current query operations, this does not seem possible.

Describe the solution you'd like

One potential design would be:

query_ops = qo.Sequential([
    qo.Or([
        qo.Like(my_pattern, my_field),
        qo.ConditionIn(my_values, my_field)
    ])
])

Although there may be better designs.

Describe alternatives you've considered

This may be possible by joining many tables with a different qo.ConditionEquals each, but the performance would suffer greatly. It also doesn't seem to be possible to perform a LIKE with any of the existing features.

Add support for using the query op Apply with multiple column inputs

Currently the Apply function in cyclops.query.ops supports only the application of a function to one or more columns. What would be a desirable addition is being able to apply a function that takes in multiple column inputs and gives a single column output. so F(col1, col2, ...).

Add Support for CXR Images in Drift Detection Toolkit

Overview

Using the public NIH dataset as a use-case, the support for Chest X-Ray (CXR) images and the support for dimensionality reduction and two-sample tests on large CXR image datasets will be implemented.

This will be a placeholder for a task list of features (other issues) that will be incrementally implemented. I'll create branches specifically to work on these issues to track development.

Features

Development Roadmap

I just came upon your framework and am very interested in its development.

I'm working on a research project in which we (currently) use MIMIC-IV.
The cyclops project looks quite interesting to use, especially for the ETL task.
However since it's clearly still in alpha stage, we are not sure if it makes sense to base our work on this library, since its prone to change in the future.

It would be very helpful to have an outlook on what the plans for this project are in the short and long term future.
Maybe we could also contribute on your plans if they line up with our goals.

Therefore, I would highly appreciate a development roadmap :)

Query batching using dask

The current batching while querying is a bit finicky and the implementation is untested. While it works and has been used for the mimic use case, replacing it with dask is a better choice.

TODO:

  • Use dask's read_sql_query fn and use the query interface to switch between pandas and dask dataframes.
  • Update saving and loading of dataframes in parquet and csv formats.
  • Update mimic use case and try out using the batching using dask.

Add default model params, allow override by specifying config

Currently, when instantiating an SKModel wrapper (https://github.com/VectorInstitute/cyclops/blob/main/cyclops/models/wrappers/sk_model.py#L56), model params in the form of a dict needs to be specified. This is done by loading default configs from https://github.com/VectorInstitute/cyclops/tree/main/cyclops/models/configs and then passing to the create_model factory method.

This places additional work from the user, and hence default configs can be loaded in the wrapper directly with an option to override by the user. This pattern is used in the DatasetQuerier class (https://github.com/VectorInstitute/cyclops/blob/main/cyclops/query/base.py#L90) using hydra to load yaml config.

Process batching using dask

Many of the process API fns that use pandas like groupby for aggregation, imputation and normalization have dask support. So a good feature enhancement is to be able to load as dask dataframes and call process API funcs which then call dask compute method to process in batches.

TODO:

  • Create a design doc outlining the current process API funcs, what they do and where pandas funcs are used.
  • Create a checklist for those funcs to see if they have dask support.
  • Implement checks to see if dask DF, then apply lazy compute for those funcs.
  • Add tests.

Add support to GroupbyAggregate over same column, using multiple aggfuncs

Currently, when we wish to aggregate over the same column using multiple aggregation funcs, it is not supported. (See https://github.com/VectorInstitute/cyclops/blob/main/cyclops/query/ops.py#L1751)

We wish to support something like:

GroupByAggregate("person_id", 
    {
        "lab_name": [
            ("string_agg", "lab_name_agg"),
            ("median", "lab_name_median")
    }, 
    {"lab_name": ", "}
)(table)

Considerations:

  • The stringagg separator only applies to "string_agg" and maybe its better to supply as a third item to the tuple instead and we check and use it if the aggfunc is "string_agg". So like ("string_agg", "lab_name_agg", ", ")
  • Add some checks to make sure the column can be aggregated a certain way. For example, mean over string column shouldn't be allowed and throw meaningful errors.

Increase mypy strictness and fix issues

Mypy configuration is currently a little easy-going, and doesn't care if type annotations aren't added to functions.

  • Add stricter mypy config
  • Fix issues caught by mypy and add more type annotations
  • Check if mypy config is being read correctly by pre-commit

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.