vectorinstitute / cyclops Goto Github PK

Toolkit for evaluating and monitoring AI models in clinical settings

Home Page: https://vectorinstitute.github.io/cyclops/

License: Other

Python 96.80% Dockerfile 0.02% Jinja 1.30% JavaScript 1.89%

clinical-data drift-detection mimic-iv omop-cdm clinical-decision-support healthcare mimic-iii physionet model-monitoring data-drift

cyclops's People

Contributors

Stargazers

Watchers

Forkers

hoxell mahshidaln caphadoop amrit110 seasonedmariner koustubh2111 apks17 hertera1 adibvafa harel-coffee

cyclops's Issues

Development Roadmap

I just came upon your framework and am very interested in its development.

I'm working on a research project in which we (currently) use MIMIC-IV.
The cyclops project looks quite interesting to use, especially for the ETL task.
However since it's clearly still in alpha stage, we are not sure if it makes sense to base our work on this library, since its prone to change in the future.

It would be very helpful to have an outlook on what the plans for this project are in the short and long term future.
Maybe we could also contribute on your plans if they line up with our goals.

Therefore, I would highly appreciate a development roadmap :)

Support model validation during training for the PyTorch wrapper

Split the dataset into training and validation sets in the _train_loop method or accept a predefined split when the fit method is called.
Report the validation loss.
Use the validation loss to track the best model.

Add support to GroupbyAggregate over same column, using multiple aggfuncs

Currently, when we wish to aggregate over the same column using multiple aggregation funcs, it is not supported. (See https://github.com/VectorInstitute/cyclops/blob/main/cyclops/query/ops.py#L1751)

We wish to support something like:

GroupByAggregate("person_id", 
    {
        "lab_name": [
            ("string_agg", "lab_name_agg"),
            ("median", "lab_name_median")
    }, 
    {"lab_name": ", "}
)(table)

Considerations:

The stringagg separator only applies to "string_agg" and maybe its better to supply as a third item to the tuple instead and we check and use it if the aggfunc is "string_agg". So like ("string_agg", "lab_name_agg", ", ")
Add some checks to make sure the column can be aggregated a certain way. For example, mean over string column shouldn't be allowed and throw meaningful errors.

Add Support for CXR Images in Drift Detection Toolkit

Overview

Using the public NIH dataset as a use-case, the support for Chest X-Ray (CXR) images and the support for dimensionality reduction and two-sample tests on large CXR image datasets will be implemented.

This will be a placeholder for a task list of features (other issues) that will be incrementally implemented. I'll create branches specifically to work on these issues to track development.

Features

Add default model params, allow override by specifying config

Currently, when instantiating an SKModel wrapper (https://github.com/VectorInstitute/cyclops/blob/main/cyclops/models/wrappers/sk_model.py#L56), model params in the form of a dict needs to be specified. This is done by loading default configs from https://github.com/VectorInstitute/cyclops/tree/main/cyclops/models/configs and then passing to the create_model factory method.

This places additional work from the user, and hence default configs can be loaded in the wrapper directly with an option to override by the user. This pattern is used in the DatasetQuerier class (https://github.com/VectorInstitute/cyclops/blob/main/cyclops/query/base.py#L90) using hydra to load yaml config.

Add SQL OR, LIKE and IN to Query API

Is your feature request related to a problem? Please describe.

I have the following SQL code which I'd like to port to my CyclOps pipeline:

SELECT * FROM my_table WHERE my_field LIKE '%my_pattern%' OR my_field IN (my_values)

Unfortunately with the current query operations, this does not seem possible.

Describe the solution you'd like

One potential design would be:

query_ops = qo.Sequential([
    qo.Or([
        qo.Like(my_pattern, my_field),
        qo.ConditionIn(my_values, my_field)
    ])
])

Although there may be better designs.

Describe alternatives you've considered

This may be possible by joining many tables with a different qo.ConditionEquals each, but the performance would suffer greatly. It also doesn't seem to be possible to perform a LIKE with any of the existing features.

Calibration Plots

Is your feature request related to a problem? Please describe.

For particular problems such as prevalence estimates or risk score prediction, well calibrated models are necessary. It would be great for data scientist end users to be able to generate calibration plots in their transparency reports to examine this aspect of model performance..

Describe the solution you'd like

One option for calibration plots looks like this:

Describe alternatives you've considered

We can currently do this in the model transparency report by developing a plotly object and logging it to the report with log_plotly_figure().

Additional context

Below is a working function that will create a calibration plot in plotly as required:

def generate_calibration_plot(df, y_true_col, y_prob_col, grouping_var=None):
    """
    Generates a calibration plot with an optional histogram below the plot for the predicted probabilities.
    
    Parameters:
    df (DataFrame): The dataframe containing the true labels and predicted probabilities.
    y_true_col (str): The name of the column with true labels (0 or 1).
    y_prob_col (str): The name of the column with predicted probabilities.
    grouping_var (str, optional): The name of the column to group data by. If provided, the plot will include 
                                  multiple curves on the plot, one for each level of the grouping_var.
    """
    # Create subplots: 1 plot for calibration curve, 1 plot for histogram
    fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.02, row_heights=[0.8, 0.2])
    
    if grouping_var:
        # Plot a calibration curve for each level of the grouping variable
        unique_groups = df[grouping_var].unique()
        for group in unique_groups:
            group_df = df[df[grouping_var] == group]
            prob_true, prob_pred = calibration_curve(group_df[y_true_col], group_df[y_prob_col], n_bins=10)
            fig.add_trace(go.Scatter(x=prob_pred, y=prob_true, mode='markers+lines', name=f'{group}'), row=1, col=1)
    else:
        # Plot a single calibration curve
        prob_true, prob_pred = calibration_curve(df[y_true_col], df[y_prob_col], n_bins=10)
        fig.add_trace(go.Scatter(x=prob_pred, y=prob_true, mode='markers+lines', name='Model'), row=1, col=1)

    # Add perfectly calibrated line to the calibration curve
    fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', name='Perfectly calibrated', line=dict(dash='dot')), row=1, col=1)

    # Plot histogram if no grouping variable is provided
    fig.add_trace(go.Histogram(x=df[y_prob_col], nbinsx=100, name='Probabilities', showlegend=False), row=2, col=1)

    # Update layout
    legend_title = grouping_var if grouping_var else None
    fig.update_layout(title='Calibration Plot', yaxis_title='Fraction of Positives', legend_title=legend_title)
    fig.update_xaxes(title_text='Mean Predicted Probability', row=2, col=1)
    fig.update_yaxes(title_text='Count', row=2, col=1)

    fig.show()

Run on sample data, such as this:

import pandas as pd
import numpy as np

np.random.seed(42)

n_samples = 100
data = {
    'y_true': np.random.binomial(1, 0.5, n_samples),  # Random true labels (0s and 1s)
    'y_prob': np.random.rand(n_samples),               # Random probabilities between 0 and 1
    'group': np.random.choice(['A', 'B'], n_samples)   # Random group labels ('A' or 'B')
}

sample_df = pd.DataFrame(data)
generate_calibration_plot(sample_df, 'y_true', 'y_prob')

Generating the following plot:

Or calibration by group:

generate_calibration_plot(sample_df, 'y_true', 'y_prob', 'group')

Generating the following plot:

Unknown type for targets and preds

Describe the bug

When evaluate/evaluate_fairness is called and slice_spec is set, if the number of examples matching a specific slice spec is one, the type_target and type_preds become unknown and there will be a value error raised by _binary_stat_scores_format.

Refactor: Use `namedtuple` instead of `tuple` in `ClassificationPlotter`

There are two methods in ClassificationPlotter- roc_curve(), precision_recall_curve()- that take a tuple of ndarrays whose order is important and may introduce potential bugs. Also, the last element of tuple is never used inside the functions.

Refactor the arguments so that it is clear which element is which. Using namedtuple is preferred.

Add unit tests for utils module in data subpackage

The data sub-package has important utility functions that need unit tests.

Task: Write good unit tests to increase coverage for the module to 100%.
(https://app.codecov.io/gh/VectorInstitute/cyclops/blob/main/cyclops/data/utils.py)

query dataset APIs design review

The query dataset API design has some problems and could be improved. This issue is to have a design review and suggest a proposal for improving the API. Firstly some problems:

The low level query ops API (https://vectorinstitute.github.io/cyclops/reference/api/_autosummary/cyclops.query.ops.html#module-cyclops.query.ops) is reaching a beta stage where it is useful and is being used for internal use cases.
The dataset APIs (https://vectorinstitute.github.io/cyclops/reference/api/cyclops.query.html#dataset-apis) however seem rigid and non-extensible. Firstly mapping column names from the native schema of a database to custom column names seems like a bad design choice (https://github.com/VectorInstitute/cyclops/blob/main/cyclops/query/gemini.py#L85). This made sense before since this gave same column names to work with across datasets. But doing this under the hood in the API code hides it from the user and makes it very confusing to work with the output dataframes. The output from the query API should match the native schema. To resolve this, one idea would be to create a separate column_mapper util function and use that after querying optionally.
Another issue is the design of the API methods. These methods were created based on the table. For example, in https://vectorinstitute.github.io/cyclops/reference/api/_autosummary/cyclops.query.gemini.GEMINIQuerier.html#cyclops.query.gemini.GEMINIQuerier.patient_encounters, we see that we have custom defined arguments for doing things like filtering across years, sex, timestamps. We also have arguments for passing tables to join on. But this seems rigid and limits what a user can do. A better way might be to have something like querier.table(join=(join_table, keys, join_type), filter=(col, val), apply=(col, func), ...) where join, filter and apply are more generic and flexible methods. We would have to ofc allow conditions using timestamps too and a few other cases, but the above 3 funcs should cover a lot of ground.
The dataset APIs can still have useful dataset specific querying functions. One such example is in https://vectorinstitute.github.io/cyclops/_modules/cyclops/query/mimiciv.html#MIMICIVQuerier.patients where we apply custom processing to recover time of admission information (approximate year of care).

Dynamically offer every method and attribute of the PyTorch model in the PTModel wrapper

The scikit-learn model wrapper (SKModel) currently supports this behaviour using the decorator pattern - if an attribute or method is not found in the wrapper, then the wrapped model is checked.

This should be implemented for the PTModel class as well, so that an instance of the wrapper appears to be like an instance of torch.nn.Module with additional functionalities.

Normalize split fractions?

cyclops/cyclops/process/feature/split.py

Line 21 in f5b2046

fractions: float or list of float

I was thinking the function could probably be both simplified and easier to use if the fractions were normalized within the function instead of forcing the user to ensure normalization before calling it.

Support user-defined model directory in PTModel

During training, models are currently being saved in current_working_directory/output/ModelClassName.

TODO

Support user-defined directory for model saving.
Maybe support creating versions for model directories if the same directory name exists or of changes in configuration are detected. For example, path/to/model_directory/1 and path/to/model_directory/2

NIHCXR Drift Detection - Conduct synthetic drift experiments with CXR representations

Features

Generate notebooks for synthetic drift experiments (reductor+tester):
- Gaussian noise injection
- Gaussian blur
- Knock-out shift
- Natural image shifts (random rotation/translation/crop)
- Rolling window with synthetic timestamps

Cyclops installation fails in Windows

Describe the bug

Installing Cyclops in a new conda environment on Windows is unsuccessful due to not finding any Python version that can satisfy all requirements.

To Reproduce

Upon creation of a new conda environment, enter:
py -m pip install 'pycyclops[query,models]'

Expected bug

Additional Context

The bug does not occur when installing Cyclops on Kaggle Notebooks.

NIHCXR Drift Detection - Install required libraries and add preprocessing functions

Features

Install TorchXRayVision library and add to poetry package management.
Add preprocessing functions/utilites for CXR images.

Improve API documentation

The current API documentation (https://vectorinstitute.github.io/cyclops/api/index.html#) needs improvement, with more clear descriptions of the sub-packages. Also needs sections on high level overview of the modules.

The API docs also needs tutorials to demonstrate the tasks and evaluate functionalities.

Add support for using the query op Apply with multiple column inputs

Currently the Apply function in cyclops.query.ops supports only the application of a function to one or more columns. What would be a desirable addition is being able to apply a function that takes in multiple column inputs and gives a single column output. so F(col1, col2, ...).

Remove baseline_models dependency

Currently, some of the drift_detection use cases import implemented models from the https://github.com/VectorInstitute/cyclops/tree/main/cyclops/monitor/baseline_models directory. This should be removed and instead be replaced with model implementations from https://github.com/VectorInstitute/cyclops/tree/main/cyclops/models.

Remove use of models from the above mentioned baseline_models directory and replace with those in models.
Remove baseline_models from the codebase
Test if the migration works as intended.

Code style check

This issue has 2 parts:

The codebase needs a review to see if it adheres to google code style guide (https://google.github.io/styleguide/pyguide.html).
For docstrings, the parameter types are documented using type hints, and adding them to the docstring is redundant. These can be removed to keep docstrings clean and reduce redundancy.

Linking "Contributing to cyclops" under Getting Started's "Developing" subsection

To improve the documentation, I would suggest creating a link to "Contributing to cyclops" under Getting Started's "Developing" subsection. This way it's clear that this isn't the only development documentation, and where to find this further info.

Add support for model hyperparameter tuning

The model wrappers have a method that is intended for tuning the model hyperparameters and returning the best model.
The method has the following signature:

    find_best(
        self,
        X: ArrayLike,
        y: ArrayLike,
        parameters: Union[Dict, List[Dict]],
        metric: Union[str, Callable, Sequence, Dict] = None,
        method: Literal["grid", "random"] = "grid",
        **kwargs,
    )

Currently, only the scikit-learn model wrapper SKModel implements this method, and that implementation would benefit from the following improvements:

Support using metrics from cyclops.evaluate.metrics in the hyperparameter search, potentially using the sklearn.metrics.make_scorer method.
Handle data splits e.g. predefined split, split by percentage, cross-validation split etc.
Support passing group and fit_params arguments when calling clf.fit.

The PyTorch model wrapper (PTModel) should implement this method as well, with the same behaviour as the sklearn model wrapper.

Process batching using dask

Many of the process API fns that use pandas like groupby for aggregation, imputation and normalization have dask support. So a good feature enhancement is to be able to load as dask dataframes and call process API funcs which then call dask compute method to process in batches.

TODO:

Create a design doc outlining the current process API funcs, what they do and where pandas funcs are used.
Create a checklist for those funcs to see if they have dask support.
Implement checks to see if dask DF, then apply lazy compute for those funcs.
Add tests.

Feature store support using feast

Add support for ingesting features to a feature store implementation using feast (https://docs.feast.dev/getting-started/quickstart)
Showcase an example of loading mimic-iv queried parquet data into a feature store, with useful methods.

Add support for query op for regex based filtering

Add an op using sqlalchemy to apply regex to filter out rows that match (including not condition).

Query batching using dask

The current batching while querying is a bit finicky and the implementation is untested. While it works and has been used for the mimic use case, replacing it with dask is a better choice.

TODO:

Use dask's read_sql_query fn and use the query interface to switch between pandas and dask dataframes.
Update saving and loading of dataframes in parquet and csv formats.
Update mimic use case and try out using the batching using dask.

Documentation build fails for evaluate package

I'm currently attempting to build API docs using sphinx for the evaluate package and running into some issues.

To reproduce the error,

git checkout add_api_docs_evaluate
cd docs
make html

The autosummary plugin generates two warnings which indicate failure to generate API docs. The issue seems to be an ImportError which is tied to the medical_imagefolder module.

Report export fails when there are no logged figures

Describe the bug

Currently, the ModelCardReport's export method fails if there are no logged figures in the datasets section.

To Reproduce

Try calling export() without logging a plotly figure to the datasets section, but log_dataset method is used.

Expected behavior

Without a logged figure, the other dataset information should still be rendered and the report should be exported.

Screenshots

Fails at cyclops/report/templates/model_report/macros.jinja:33

Version

Latest - 0.2.9

Additional context

Add any other context about the problem here.

Loss reweighting?

In the PyTorch model wrapper class (PTModel) there is a function that was originally intended for re-weighting the loss for unbalanced datasets. This function is currently not implemented and can either be removed - because some PyTorch loss functions already support this - or refactored to a more general case for handling imbalanced datasets.

Logging model parameters to report only captures string values

Describe the bug

When logging model parameters using the report.log_model_parameters(), it seems like only string values are captured in the report.

To Reproduce

Running the https://vectorinstitute.github.io/cyclops/api/tutorials/mimiciii/mortality_prediction.html exposes this bug.

Expected behavior

I think we should be able to record numeric and boolean values as well.
So, for example all the hyper parameters in the dict seen below should be captured when logging.

[Optional]
Some of these tend to be model hyper-parameters, for example learning_rate, so i wonder if we should make a distinction.

Screenshots

Version

As of v0.1.31 (latest)

Runway Plots

Is your feature request related to a problem? Please describe.

Most hard metrics for classifier model performance (F1, precision, recall, sensitivity, specificity, NPV) require a pre-specified classification threshold cutoff (usually 0.5). Although for some use cases, it is useful to see how the model performs, and what tradeoffs you can make, if you'd like to prioritize for example, true positives, or minimize for example, false positives.

Describe the solution you'd like

One option is a runway plot, like the one shown below:

Describe alternatives you've considered

We can currently do this in the model transparency report by developing a plotly object and logging it to the report with log_plotly_figure().

Additional context

Below is a working function that will create a runway plot in plotly as required:

import plotly.graph_objects as go
import plotly.subplots as sp
import numpy as np
from typing import List
from sklearn.metrics import roc_curve, precision_recall_curve, confusion_matrix

def runway_plot(true_labels: List[int], pred_probs: List[float]) -> go.Figure:
    """
    Plot diagnostic performance metrics with an additional histogram of predicted probabilities.
    The plot uses Plotly with a clean aesthetic. Gridlines are kept, but background color is removed.
    Y-axis ticks and labels are shown. The legend is added at the bottom.
    Tooltips show values with 3 decimal places. X-axis labels are only shown on the bottom subplot.
    The histogram's bin size is reduced and it has no borders.

    Args:
    - true_labels (List[int]): True binary class labels (0 or 1).
    - pred_probs (List[float]): Predicted probabilities for the positive class (1).

    Returns:
    - A Plotly figure containing the diagnostic performance plots and histogram.

    Example:
    ```
    # Generate synthetic data for demonstration
    true_labels = np.random.binomial(1, 0.5, 1000)
    pred_probs = np.random.uniform(0, 1, 1000)

    # Generate and show the modified faceted plot
    faceted_fig = plot_diagnostic_performance_with_histogram(true_labels, pred_probs)
    faceted_fig.show()
    ```
    """
    
    # ROC curve components
    fpr, tpr, _ = roc_curve(true_labels, pred_probs)
    # Precision-Recall curve components
    precision, recall, _ = precision_recall_curve(true_labels, pred_probs)
    
    # Thresholds for PPV and NPV
    thresholds = np.linspace(0, 1, 100)
    ppv = np.zeros_like(thresholds)
    npv = np.zeros_like(thresholds)
    
    # Calculate PPV and NPV for each threshold
    for i, threshold in enumerate(thresholds):
        # Binarize predictions based on threshold
        binarized_predictions = pred_probs >= threshold
        tn, fp, fn, tp = confusion_matrix(true_labels, binarized_predictions).ravel()
        # Calculate PPV and NPV
        ppv[i] = tp / (tp + fp) if (tp + fp) != 0 else 0
        npv[i] = tn / (tn + fn) if (tn + fn) != 0 else 0

    # Define hover template to show three decimal places
    hover_template = 'Threshold: %{x:.3f}<br>Metric Value: %{y:.3f}<extra></extra>'
        
    # Create a subplot for each metric
    fig = sp.make_subplots(rows=5, cols=1, shared_xaxes=True, vertical_spacing=0.02)

    # Sensitivity plot (True Positive Rate)
    fig.add_trace(go.Scatter(x=thresholds, y=tpr, mode='lines', name='Sensitivity', hovertemplate=hover_template), row=1, col=1)
    
    # Specificity plot (1 - False Positive Rate)
    fig.add_trace(go.Scatter(x=thresholds, y=1-fpr, mode='lines', name='1 - Specificity', hovertemplate=hover_template), row=2, col=1)
    
    # PPV plot (Positive Predictive Value)
    fig.add_trace(go.Scatter(x=thresholds, y=ppv, mode='lines', name='PPV', hovertemplate=hover_template), row=3, col=1)
    
    # NPV plot (Negative Predictive Value)
    fig.add_trace(go.Scatter(x=thresholds, y=npv, mode='lines', name='NPV', hovertemplate=hover_template), row=4, col=1)
    
    # Add histogram of predicted probabilities
    fig.add_trace(go.Histogram(x=pred_probs, nbinsx=80, name='Predicted Probabilities'), row=5, col=1)

    # Update layout
    fig.update_layout(
        height=1000, 
        width=700, 
        title_text="Diagnostic Performance Metrics by Thresholds",
        legend=dict(orientation="h", yanchor="bottom", y=-0.2, xanchor="center", x=0.5)
    )

    # Remove subplot titles
    for i in fig['layout']['annotations']:
        i['text'] = ''

    # Remove the plot background color, keep gridlines, show y-axis ticks and labels
    fig.update_xaxes(showgrid=True)
    fig.update_yaxes(showgrid=True, showticklabels=True)

    # Only show the x-axis line and labels on the bottommost plot
    fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
    fig.update_xaxes(showticklabels=True, row=4, col=1)
    fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True)
    
    fig.update_xaxes(showline=False, row=5, col=1, showticklabels=False)
    fig.update_yaxes(showline=False, row=5, col=1)

    # Set the background to white
    fig.update_layout(plot_bgcolor='white')

    return fig

# Generate synthetic data for demonstration
true_labels = np.random.binomial(1, 0.5, 1000)
pred_probs = np.random.uniform(0, 1, 1000)

# Generate and show the modified faceted plot
faceted_fig_clean = runway_plot(true_labels, pred_probs)
faceted_fig_clean.show()

This will generate a plot like so:

Improve fairness metrics visualization using scatter plot

Since fairness metrics are a parity/ratio, it might be better to visualize it as scatter plot instead of a bar plot. I think we did this before in https://github.com/VectorInstitute/cyclops/blob/main/nbs/explore_huggingface_datasets.ipynb.

To Do:
Update tutorial with fairness metrics visualization to scatter plots

Refactor `create_metric_cards` in `report.utils`

It is a rather long and complicated method, it would be nice if we could refactor it.

NIHCXR Drift Detection - Add dimensionality reduction techniques for CXR datasets

Features

Add code to facilitate DR with TorchXRayVision models (classifier, autoencoder)
Batched parallel processing of images with optional accelerator configuration to provide speed-up.
Add functionality to cache representations with image_id to save processing time.

Importing Cyclops in Kaggle Notebooks fails

Describe the bug

Even though installing Cyclops in Kaggle Notebooks is successful, importing it fails. It's probably due to a change in PyArrow API, that causes incompatibility with the required version and the installed version in Cyclops.

To Reproduce

After installing Cyclops, try the following in a Kaggle Notebook:
from cyclops.query import MIMICIIIQuerier

Expected bug

The following bug occurs when importing cyclops.query

Increase mypy strictness and fix issues

Mypy configuration is currently a little easy-going, and doesn't care if type annotations aren't added to functions.

Add stricter mypy config
Fix issues caught by mypy and add more type annotations
Check if mypy config is being read correctly by pre-commit

vectorinstitute / cyclops Goto Github PK

cyclops's People

Contributors

Stargazers

Watchers

Forkers

cyclops's Issues

Overview

Features

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Describe the bug

TODO

Features

Describe the bug

To Reproduce

Expected bug

Additional Context

Features

Describe the bug

To Reproduce

Expected behavior

Screenshots

Version

Additional context

Describe the bug

To Reproduce

Expected behavior

Screenshots

Version

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Features

Describe the bug

To Reproduce

Expected bug

Recommend Projects

Recommend Topics

Recommend Org