Giter Club home page Giter Club logo

torchmetrics's People

Contributors

akihironitta avatar ananyahjha93 avatar ashutoshml avatar awaelchli avatar borda avatar bryant1410 avatar ddrevicky avatar deepsource-autofix[bot] avatar dependabot[bot] avatar edenlightning avatar ethanwharris avatar justusschock avatar karthikrangasai avatar lucadiliello avatar mahinlma avatar matsumotosan avatar maximsch2 avatar pre-commit-ci[bot] avatar quancs avatar reaganjlee avatar rohitgr7 avatar skaftenicki avatar stancld avatar tadejsv avatar tchaton avatar teddykoker avatar tkupek avatar twsl avatar valerianrey avatar williamfalcon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

torchmetrics's Issues

Some metrics don't work on CPU using float16

πŸ› Bug

It looks like some metrics such as Precision-Recall curve don't work on CPUs when using float16, perhaps due to a missing feature in pytorch?

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1xDv043rRi5WBshP4m5aoxTt2ChlfxjIk?usp=sharing

Expected behavior

the metrics should work in half precision on CPUs as well.

Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.19.4
    • pyTorch_debug: True
    • pyTorch_version: 1.7.0+cu101
    • pytorch-lightning: 1.1.2
    • tqdm: 4.41.1
  • System:

Add a property to the Metric class from which can be determined if it can be self.log (scalar or not)

πŸš€ Feature

Add a property to Metrics which can be checked to see if it can be logged or not.
Or better?, what the computed shape will be.

Motivation

So far all Metrics in PL v1.0.x compute a scalar. The recommended way therefore is to call:

metric(predictions, targets)
self.log("some_name", metric)

which has worked up until now.
However, with upcoming metrics like ConfusionMatrix, the computed value returned is not necessary a scalar, which will result in a ValueError when trying to log it.

If you have multiple metrics, code efficient would be to loop over the metrics, e.g.:

for m in self.metrics:
    self.log("metric_name", m)

Adding a Metric that does not return a scalar will break this code.

Pitch

These are some ideas, but probably there is something better.

  • Add a property which can be checked against (e.g. scalar: True/False, loggable: True/False)
  • Add a computed_shape property, so we can check if the computed value is either (1, ) or 1
  • Add some new logic to self.log() to deal with non-scalar Metrics.

Alternatives

This is a solution that would likely work in most cases, except if on step compute is turned off:

metric(predictions, targets)
if val.numel() == 1:  # only scalars
    self.log("some_name", metric)

Additional context

Related discussion on the PyTorch Lightning forums:
https://forums.pytorchlightning.ai/t/logging-a-tensor/320

@SkafteNicki

Retrieval metrics problem with pytorch lightning integration in compute()

πŸ› Bug

I use the commit f06488f to calculate RetrievalMAP and RetrievalPrecision in a pytorch-lightning module. The validation_step and validation_step_end functions work, but the fast_dev_run=True gives an error in the compute() step.

However, when I run self.log("val_MAP", self.metric.compute()) instead of self.log("val_MAP", self.metric) in the validation_step_end I do not get errors. But computing the whole metric is becoming very slow if it is done every validation_step.

To Reproduce

Steps to reproduce the behavior:

I run the following code with the mentioned commit.

Code sample

from typing import Optional
import os
import torch
from torch import nn
import torch.nn.functional as F
from torchvision import transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
import pytorch_lightning as pl

from torchmetrics import (
    RetrievalMAP,
    RetrievalPrecision,
    MeanAbsoluteError,
)


class MNISTDataModule(pl.LightningDataModule):
    def __init__(self, batch_size=32):
        super().__init__()
        self.batch_size = batch_size

    def prepare_data(self):
        MNIST(os.getcwd(), train=True, download=True)
        MNIST(os.getcwd(), train=False, download=True)

    def setup(self, stage: Optional[str] = None):
        transform = transforms.Compose(
            [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
        )
        if stage == "fit":
            mnist_train = MNIST(os.getcwd(), train=True, transform=transform)
            self.mnist_train, self.mnist_val = random_split(
                mnist_train, [55000, 5000]
            )
        if stage == "test":
            self.mnist_test = MNIST(
                os.getcwd(), train=False, transform=transform
            )

    def train_dataloader(self):
        mnist_train = DataLoader(self.mnist_train, batch_size=self.batch_size)
        return mnist_train

    def val_dataloader(self):
        mnist_val = DataLoader(self.mnist_val, batch_size=self.batch_size)
        return mnist_val

    def test_dataloader(self):
        mnist_test = DataLoader(self.mnist_test, batch_size=self.batch_size)
        return mnist_test


class LitAutoEncoder(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 1), nn.Softplus()
        )
        self.decoder = nn.Sequential(
            nn.Linear(1, 64), nn.ReLU(), nn.Linear(64, 28 * 28)
        )
        # self.metric = RetrievalMAP()
        self.metric = RetrievalPrecision()
        # self.metric = MeanAbsoluteError()

    def forward(self, x):
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        preds = self.encoder(x).squeeze()

        indexes = torch.randint(100, size=preds.size())
        targets = torch.randint(2, size=preds.size()).to(bool)

        return {"indexes": indexes, "preds": preds, "targets": targets}

    def validation_step_end(self, outputs):
        self.metric(outputs["indexes"], outputs["preds"], outputs["targets"])
        # self.metric(outputs["preds"], outputs["preds"] ** 2)
        self.log("val_MAP", self.metric)
        # self.log("val_MAP", self.metric.compute())

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


if __name__ == "__main__":
    datamodule = MNISTDataModule()
    module = LitAutoEncoder()

    trainer = pl.Trainer(gpus=1, fast_dev_run=True)

    trainer.fit(module, datamodule=datamodule)
    trainer.test(module, datamodule=datamodule)

StackTrace

Traceback (most recent call last):
  File "reproduce_retrieval_error.py", line 107, in <module>
    trainer.fit(module, datamodule=datamodule)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.train_loop.run_training_epoch()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 577, in run_training_epoch
    self.trainer.run_evaluation(on_epoch=True)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 754, in run_evaluation
    eval_loop_results = self.evaluation_loop.log_epoch_metrics_on_evaluation_end()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 200, in log_epoch_metrics_on_evaluation_end
    eval_loop_results = self.trainer.logger_connector.get_evaluate_epoch_results()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 286, in get_evaluate_epoch_results
    metrics_to_log = self.cached_results.get_epoch_log_metrics()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 405, in get_epoch_log_metrics
    return self.run_epoch_by_func_name("get_epoch_log_metrics")
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 398, in run_epoch_by_func_name
    results = [func() for func in results]
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 398, in <listcomp>
    results = [func() for func in results]
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 128, in get_epoch_log_metrics
    return self.get_epoch_from_func_name("get_epoch_log_metrics")
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 121, in get_epoch_from_func_name
    self.run_epoch_func(results, opt_metrics, func_name, *args, **kwargs)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 110, in run_epoch_func
    metrics_to_log = func(*args, add_dataloader_idx=self.has_several_dataloaders, **kwargs)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 327, in get_epoch_log_metrics
    result[dl_key] = self[k].compute().detach()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/torchmetrics/metric.py", line 228, in wrapped_func
    self._computed = compute(*args, **kwargs)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/torchmetrics/retrieval/retrieval_metric.py", line 110, in compute
    idx = torch.cat(self.idx, dim=0)
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors.  Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

CPU: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:5925 [kernel]
CUDA: registered at /pytorch/build/aten/src/ATen/RegisterCUDA.cpp:7100 [kernel]
QuantizedCPU: registered at /pytorch/build/aten/src/ATen/RegisterQuantizedCPU.cpp:641 [kernel]
BackendSelect: fallthrough registered at /pytorch/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at /pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
AutogradOther: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradCPU: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradCUDA: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradXLA: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradNestedTensor: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
UNKNOWN_TENSOR_TYPE_ID: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse1: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse2: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse3: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
Tracer: registered at /pytorch/torch/csrc/autograd/generated/TraceType_2.cpp:10525 [kernel]
Autocast: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:254 [kernel]
Batched: registered at /pytorch/aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

Expected behavior

This error should not show up. I expect the metric to be computed correctly. When I use the MeanAbsoluteError as metric the code works. Therefore, there must be a bug in the compute step of the Retrieval Metrics in combination with pytorch-lightnings API as a call to compute() within the validation_step_end does not create errors.

Environment

  • PyTorch Version (e.g., 1.0): 1.8.1+cu102
  • OS (e.g., Linux): Ubuntu on WSL2
  • How you installed PyTorch (conda, pip, source): pip / poetry
  • Build command you used (if compiling from source):
  • Python version: 3.8.8
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration: 1 GPU
  • Any other relevant information:

MinMaxMetric for wrapping other metrics

πŸš€ Feature

Motivation

  • MinMaxMetric is a metric that simply wraps another metric (e.g.val_acc) and creates a new metric that tracks the min, max or both values of val_acc.

Pitch

  • I personally use it to quickly see the max_val_acc of a complete experiment in TensorBoard (instead of going through the graph manually to find the max value) but I can see other usecases as well.
  • It was discussed in the PL Slack here and clearly resonated with more users

Additional context

  • Happy to submit a PR for this feature, as I do have already a (incomplete) MaxMetric code implemented here

MetricLists for updating multiple metrics at once

πŸš€ Feature

Motivation

I am using my own version of MetricLists in my personal workflow for some time now and it has proven to be very helpful in keeping code clean.

A MetricList wraps multiple metrics together and puts them on proper devices (much like a ModuleList). What makes it different is that it also allows you to update all of them in one compute() statement and log all of them using one log() call.

Pitch

My dynamic inference model needs its val_acc tested in 32 different setups. Manually creating all the different Accuracy() metrics is ridiculous. ModuleList() helps to create them in batch but I still need to write helper functions to log() or compute() all of them separately.

Alternatives

See pitch.

Smart update of Collection of CompositionalMetrics

πŸš€ Feature

When updating metrics that are composed of other metrics there are two ways of dealing with updating too many times :

I don't think there is a clean way of only updating the necessary metrics in the general case (when you're just updating all the metrics yourself), but I think that when you combine your metrics in a collection, it could be useful to only update the "base" metric, instead of all metrics.

Motivation

I often want to use a base metric multiple times, and then I have to be careful not to update too many of them. A somewhat convoluted example (because the f1 score is already implemented) :

prec = Precision()
recall = Recall()
f1 = 2 * (prec * recall) / (prec + recall)
prec.update(pred, gt)
recall.update(pred, gt)
f1.update(pred, gt) # Shouldn't do this, because it updates prec and recall twice. 

Pitch

Continuing last example :

collection = MetricCollection([prec, recall, f1])
collection.update(pred, gt)

This should only update prec and recall once.

Alternatives

The alternative is to always define metrics from scratch, but this causes duplication of computation during the update phase.

Multi-label ROCs

πŸš€ Feature

Similary to issue #100 it would be nice to be able to make roc work with multi-label inputs.

Motivation & Pitch

auc and hence _auroc_compute do work with multi-label inputs and return a AUROC value for each label/class by iterating over range(num_classes) when passing average=None.
_roc_compute and hence roc differentiate only between binary and multi-class (by checking if num_classes == 1)

I would expect _roc_update to similarly return a mode using _input_format_classification(preds, target) and return a list of [fpr, tpr, threshold] of length=num_classes.
The easiest would be the format [[fpr, tpr, thres]]*5

Metrics support mask

πŸš€ Feature

Current metrics like Accuracy/Recall would be better to support mask.

Motivation

For example, when I deal with a Sequence Labeling Task and pad some sequence to max-length, I do not want to calculate metrics at the padding locations.

Pitch

I guess a simple manipulation would work for accuracy.(here is the original one)

from typing import Any, Optional

import torch
from pytorch_lightning.metrics.functional.classification import (
    accuracy,
)
from pytorch_lightning.metrics.metric import TensorMetric


class MaskedAccuracy(TensorMetric):
    """
    Computes the accuracy classification score
    Example:
        >>> pred = torch.tensor([0, 1, 2, 3])
        >>> target = torch.tensor([0, 1, 2, 2])
        >>> mask = torch.tensor([1, 1, 1, 0])
        >>> metric = MaskedAccuracy(num_classes=4)
        >>> metric(pred, target, mask)
        tensor(1.)
    """

    def __init__(
        self,
        num_classes: Optional[int] = None,
        reduction: str = 'elementwise_mean',
        reduce_group: Any = None,
        reduce_op: Any = None,
    ):
        """
        Args:
            num_classes: number of classes
            reduction: a method for reducing accuracies over labels (default: takes the mean)
                Available reduction methods:
                - elementwise_mean: takes the mean
                - none: pass array
                - sum: add elements
            reduce_group: the process group to reduce metric results from DDP
            reduce_op: the operation to perform for ddp reduction
        """
        super().__init__(name='accuracy',
                         reduce_group=reduce_group,
                         reduce_op=reduce_op)
        self.num_classes = num_classes
        self.reduction = reduction

    def forward(self, pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
        """
        Actual metric computation
        Args:
            pred: predicted labels
            target: ground truth labels
            mask: only caculate metrics where mask==1
        Return:
            A Tensor with the classification score.
        """
        mask_fill = (1-mask).bool()
        pred = pred.masked_fill_(mask=mask_fill, value=-1)
        target = target.masked_fill_(mask=mask_fill, value=-1)

        return accuracy(pred=pred, target=target,
                        num_classes=self.num_classes, reduction=self.reduction)

Alternatives

Additional context

Implement __getitem__ as "metric arithmetic"

πŸš€ Feature

Allow a user to define a new metric that takes an item out of an other metric.

Basically :

iou = IoU(num_classes=2, reduction="none")
fg_iou = iou[0]
bg_iou = iou[1]

Motivation

There are multiple metrics (like IoU and confusion matrix) that would benefit from the use of such a feature, and it is close to the mechanism of metric arithmetic.

Pitch

This would only need to define

class Metric:
    ...
    def __getitem__(self, idx):
        return CompositionalMetric(lambda x: x[idx], self, None)

Alternatives

The straightforward alternative is to use CompositionalMetric directly.

Unable to call metric from any step in Lightning module

πŸ› Bug

I implemented my own Metric class that returns from the compute data class with some aggregated metrics -- precision, recall, and f1-score. But when I try to call metric inside *_step I got the error from PyTorch internals.

The error happened in this line. If I call validation metric (initialized with compute_on_step=False) during validation_step I got:

TypeError: 'NoneType' object is not subscriptable

In the case of training metric during training_step:

TypeError: 'ClassificationMetrics' object is not subscriptable

ClassificationMetrics is the name of my data class.

I also tried to return float from compute, but it also causes the same error. I assume that PyTorch expects to receive tensor and therefore trying to get from var. An obvious solution is to return tensor from compute, but it doesn't fix calling validation metric that doesn't return anything.

Environment

  • PyTorch Version (e.g., 1.0): 1.8.0
  • OS (e.g., Linux): MacOS BigSur
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.9.2
  • CUDA/cuDNN version: -
  • GPU models and configuration: -
  • Any other relevant information: pytorch-lightning (1.1.7) / torchmetrics (0.2.0)

Functional Confusion Matrix with Multi-Label

πŸ› Bug

I am trying to analyze a model that has multi-label predictions. When creating a confusion matrix with the functional confusion_matrix method, I get a much different result than expected. I may be misunderstanding how this is supposed to work so any help would be appreciated!

To Reproduce

Steps to reproduce the behavior:

  1. Predict multi-label data that has had torch.sigmoid applied to the output (N,C) and have a matching shape truth data.
  2. Use the functional confusion_matrix method on the data

Code sample

>>> from torchmetrics.functional import confusion_matrix
>>> import torch
>>> x = torch.tensor([[.4,.5,.6,.7],[.3,.4,.7,.1]])
>>> y = torch.tensor([[0,0,0,1],[0,1,0,0]], dtype=torch.int32)
>>> cm = confusion_matrix(x, y, num_classes=4, normalize='none')
tensor([[3., 3., 0., 0.],
        [1., 1., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

Expected behavior

I would expect the confusion matrix to count the classes that were predicted for each true class. I may be wrong

tensor([[0, 0, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 0, 0],
        [0, 1, 1, 1]])

Environment

  • PyTorch Version (e.g., 1.0): 1.7
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): conda
  • Python version: 3.8.8
  • CUDA/cuDNN version: 11.03
  • GPU models and configuration: Nvidia Tesla V100

Thanks for the great project and help!!

Cohen Kappa Score and Matthews Correlation Coefficient Metrics

πŸš€ Feature

I would like to request the (re-) implementation of the Cohen Kappa score and the new implementation of the Matthews Correlation Coefficient (MCC) in PyTorch Lightning's metrics.

Motivation

The Cohen Kappa and MCC are often used metrics in classification tasks, especially in a medical setting to determine such things as inter-grader reliability. The Kappa score was originally implemented in PyTorch Lightning 0.9 but has disappeared for some reason. The MCC is often seen as the best metric to use in highly imbalanced datasets. The addition of these two metrics would make it more convenient to use PyTorch Lightning for medical tasks and other tasks that involve ground truth uncertainty and imbalanced data.

Pitch

Implementation of the Cohen Kappa and MCC as metrics in PyTorch Lightning. Both metrics are already available in sci-kit learn.

Alternatives

Cannot think of any.

Additional context

None.

Include AverageMeter?

One common pattern I've seen copy-pasted across many different projects is a generic AverageMeter, which takes the average of things of a quantity.

This isn't strictly a "metric", but I'm wondering whether you'd be open to having an implementation in this metrics repository -- it's quite common and having it in a centralized place could be helpful. If you are open to it, I'd be happy to contribute an implementation.

class AverageMeter(object):
    """Computes and stores the average and current value"""

    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0.
        self.avg = 0.
        self.sum = 0.0
        self.count = 0

    def update(self, val, n=1):
        self.val = float(val.item()) if isinstance(
            val, (np.ndarray, torch.Tensor)) else val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)

Bootstrap wrapper for metrics?

πŸš€ Feature

We should provide ability to compute bootstrapped confidence intervals for metrics.

Motivation

Confidence intervals are important and we should make it easy for people to increase rigor of their research and model evaluations.

Pitch

I'm thinking we can have something like this (very high level):

class Bootstrapper(Metric):
   def __init__(self, num_samples, metric):
       self.metrics = nn.ModuleList([deepcopy(metric) for _ in range(num_samples)])

  def update(self, preds, targets):
     for idx in range(self.num_samples):
        preds_sampled, targets_sampled = sample_for_bootstrap(preds, targets)
        self.metrics[i].update(preds_sampled, targets_sampled)

which will let people to wrap any metric, have a set of copies of the metric internally updated with different samples of the data, giving us then ability to get a distribution of metric values.

Alternatives

We can skip it on the class-based metrics side and assume anyone doing bootstrap will load everything in memory and do bootstrap using functional metrics.

Add Deviance scores

Non-Softmaxed Classification

πŸš€ Feature

Motivation

Support other types for classification metrics. I.e non-softmaxed network outputs.

For outputs of categorical classification, it does not matter if the output is softmaxed or not. The argmax of these tensors is the same. Can we support those by simply taking an argmax even if the values are out of the 0-1 range?

cc @SkafteNicki Whether we want to support this.

Change order of updates in metric forward to increase efficiency

πŸš€ Feature

Refactor Metric.forward() to call update only once.

Motivation

The update() method in Metric gets computed twice in forward() in case the compute_on_step is True.
This means repeated computation, which can slow down execution. For example, I have a custom SmoothL1Metric and the update function calculates element-wise L1 distance (see below). The problem arisesd when the tensors on which the metric is computed have many dimensions and the computation itself is slow.

class SmoothL1Metric(Metric):
    def __init__(self, mask_dim, dist_sync_on_step: bool = False, compute_on_step: bool = True):
        super().__init__(dist_sync_on_step=dist_sync_on_step, compute_on_step=compute_on_step)
        self.loss = torch.nn.SmoothL1Loss(reduction="sum")
        self.mask_dim = mask_dim

        self.add_state("sum", default=torch.tensor(0.0), dist_reduce_fx="sum")
        self.add_state("numel", default=torch.tensor(0.0), dist_reduce_fx="sum")

    def update(self, input, target, lens):
        mask = get_mask(input, lens, self.mask_dim).type(input.dtype)
        # this is a heavy computation that should not be executed twice
        self.sum += self.loss(input * mask, target * mask)
        self.numel += mask.sum()

    def compute(self):
        return self.sum / self.numel

Suggestion

How about something like:

def forward(self, *args, **kwargs):

    if self.compute_on_step:
        self._to_sync = self.dist_sync_on_step

        # save context before switch
        cache = {attr: getattr(self, attr) for attr in self._defaults.keys()}

        # call reset, update, compute, on single batch
        self.reset()
        self.update(*args, **kwargs)
        self._forward_cache = self.compute()

        # merge new and old context without recomputing update
        for attr, val in cache.items():
            setattr(self, attr, self._reductions[attr](val, getattr(self, attr)))
    else:
        with torch.no_grad():
            self.update(*args, **kwargs)
        self._forward_cache = None

    return self._forward_cache

The code probably does not work now, but the idea should be clear. What do you think?

Metrics support for sweeping

πŸš€ Feature

We would like to have tighter integration of metrics and sweeping. This requires a few features:

  1. Knowing if higher_is_better (e.g. are we trying to minimize or maximize the metric in a sweep)
  2. Knowing what value to optimize for. E.g. if a recall@precision metric returns both recall value and corresponding threshold, we want to optimize by maximizing recall and ignoring the threshold.

Alternatives

An alternative implementation will be for each metric to have is_better(left: TMetricResult, right: TMetricResult) where TMetricResult is whatever compute returns.

If we don't have it, people will have to have wrappers around the metrics to support this functionality in sweepers.

Formalize task type?

πŸš€ Feature

Let's have a formal system of task types. Things like BinaryClassificationTask, MultiClassClassificationTask, MultilabelClassificationTask, etc.

Motivation

  1. We are seeing slowdown from format checking Lightning-AI/pytorch-lightning#6605
  2. We would like to be able to do more sanity checking that metrics specified by a LightningModule are a correct fit for the task.

Pitch

Add a type hierarchy of possible task types. Each task is defined by type signature of the (predictions, labels) tuple and semantics inside it (e.g. multticlass and multilabel have same shape, but different semantics).

Then, each metric takes a task_type and can assume that predictions/labels conform to it. If we want to add checking at the run_time, each type can provide a class method (e.g. BinaryClassificationTask.validate_input) that can be enabled for checking on opt-in basis.

Alternatives

  1. People building reusable frameworks implement task types on their own as wrappers around TorchMetrics.
  2. TorchMetrics continue to have format checking in each metric.

Allow Accuracy to return metric per class

πŸš€ Feature

Implement the average argument like in Precision and Recall such that accuracy metric can return the metric per class label.

Motivation

Sometimes it may be beneficially to look at the accuracy per label, especially when working with very unbalanced datasets

Pitch

Alternatives

Additional context

Add Specificity

πŸš€ Feature

In addition to Precision and Recall it would be nice to have a Specificity metric.

For the implementation I think it would be enough to make a copy of Recall (class und function) and adapt numerator and denominator in _precision_compute.

Alternatives

For binary classification Specificity is the same as Recall with 0 as true label.
For multiclass classification this is not as easy as this though.

Add gpu/multi-gpu testing

πŸš€ Feature

Currently we only test the metrics on single cpu and distributed cpu. While we had no explicit issues that links back to the metrics not beign tested on gpu, we should do it anyway.

Motivation

Pitch

Alternatives

Additional context

register conda forge

πŸš€ Feature

Publish package also to Conda distribution

Motivation

Allow user to install from any source

Additional context

you can check the documentation at https://conda-forge.org/docs/maintainer/adding_pkgs.html. It’s actually very easy. In short you must submit a PR to https://github.com/conda-forge/staged-recipes. Once the CI is green you can ping conda-forge folks and they will review it. Once done the feedstock will be created and your package built and uploaded to conda forge.

Constant-memory implementation of precision-recall related metrics

πŸš€ Feature

Metrics that depend on precision-recall curve are currently implemented in a way that requires storing all of the predictions and labels in memory, making it's use impractical for large datasets or problem with large label spaces. We should support binning-based metrics implementation to solve this. Prototype is here: https://gist.github.com/maximsch2/2b55bab6deba629a5686258cb8152e53

Alternatives

Don't do anything and be restricted in the scalability of metrics.

Other options for scaling is making it easier to keep metrics off-GPU.

A possible question is if we want to have both raw and binned implementations of the metrics.

Additional context

Keras provides binning-based implementation by default: https://keras.io/api/metrics/classification_metrics/#auc-class

Test for differentiability

πŸš€ Feature

Add a property that for which the user can determine if a metric is differentiable or not

@property
def is_differentiable(self):
    return True/False

and add appropriate tests. We can take inspiration from what kornia is doing:
https://github.com/kornia/kornia/blob/master/test/color/test_gray.py#L69

Motivation

Some metrics support differentiability, some does not. Would be great if we were more explicit about it and actually have test for it.

Pitch

Alternatives

Additional context

in DDP training, run ROC.compute() will results gpu to 100% usage and hang the training process

πŸ› Bug

To Reproduce

follow the sample code like https://github.com/PyTorchLightning/metrics,
we use metric = torchmetrics.ROC()
model.roc_metric = metric

in test epoch,
metric.update (output, target)

and after test epoch, run compute
metric.compute()

hang training process and result two gpu in 100% usage

btw, use the metric code in the pytorch_lightning have the same issue as the standalone package

Environment

  • PyTorch Version (1.7.0+cu101):

  • OS (Linux ubuntu 18.04):

  • How you installed PyTorch (`pip):

  • torchmetrics Version: 0.2.0

  • Python version: 3.6

  • CUDA/cuDNN version: 10.1/7.6.5.32-1+cuda10.1

  • GPU models and configuration: two 2080ti

Allow MetricCollection to combine calculations

πŸš€ Feature

Allow MetricCollection to combine metrics internally to reduce redundant computations.

Motivation

Many metrics currently shares the same redundant computations underneath. Take Recall and Precision for example, they will both calculate tp, fp, tn, fn during their update step and then use them differently during the compute step. We have chosen to do it this way to make the API simple.

However, we could implement that if two metrics that have the same update states are collected using MetricCollection, that only one metric is updated and the state is just broadcasted to the other metrics.

Keeping track of which metrics can be combined could probably be done with some kind of registry:

@metric_group(Recall, Precision, F1, FBeta)`
@metric_group(MeanSquaredError, PSNR)
...

Pitch

Alternatives

Additional context

Update class metrics interface of Precision/Recall/Fbeta to allow calculate them for each individual class

πŸš€ Feature

I'd like to propose to update class metrics interface of Precision/Recall/Fbeta to have the average argument include none and weighted as in the corresponding functional metrics interface.

Motivation

Current interface with average argument restricts to macro and micro, and because of that one could not use class metrics interface to calculate precision/recall/fbeta for an individual class. For example, in binary classification, one is typically interested in getting metrics results for positive class (class 1) and this cannot be done with the current class interface. Therefore one has go back to the functional metric and this could defeat the purpose of having class metrics (to take care of ddp sync).

On the contrary, sklearn defaults to calculate precision/recall/fbeta for the individual class (class 1) while giving one option to calculate micro/macro/weighted average of these scores.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

Pitch

Update class metrics interface of Precision/Recall/Fbeta to have the average argument include none and weighted as in the corresponding functional metrics interface.

Alternatives

One can always fall back to the functional metric but I assume this is not what we would like.

Additional context

Really like the new class interface to work with DDP and appreciate all your work!

Add Negative predictive value

πŸš€ Feature

In addition to Precision and Recall it would be nice to have a Negative predictive metric.

For the implementation I think it would be enough to make a copy of Precision (class und function) and adapt numerator and denominator in _precision_compute.

Alternatives

For binary classification Specificity is the same as Precision with 0 as true label.
For multiclass classification this is not as easy as this though.

Allow unnormalized class scores for Accuracy

πŸš€ Feature

Presently, when using Accuracy metric on multi-class with scores (N,C entry in input types), the scores are required to be probabilities in [0, 1].

However, un-thresholded accuracy can be computed without normalized probabilities as inputs, as relative ordering of scores is all that is needed.

Given that some uses of Accuracy do require normalized probabilities, we could implement this as a flag that would disable the input check.

Motivation

It is common to work with unnormalized class scores during training, especially during classification tasks, as they are used in the more-stable nn.CrossEntropyLoss. Rather than having to additionally compute a softmax just for the accuracy metric, it would be reasonable to allow usage of arbitrarily scaled input data.

I specify Accuracy because it is the use case that I ran into, but it's possible other Metrics have the same property.

Pitch

Add a flag to Accuracy (and any other applicable metrics) that disables the input range check for preds.

Alternatives

The present workaround is to apply a softmax before feeding data to your Accuracy metric.

Additional context

https://github.com/PyTorchLightning/pytorch-lightning/blob/0456b4598f5f7eaebf626bca45d563562a15887b/pytorch_lightning/metrics/functional/accuracy.py#L25

Add contribution guidelines

πŸ“š Documentation

All build-in metrics follow a very fixed structure:

  • implement core logic in new_metric.py file and place that in the functional folder
    • should contain a _new_metric_update, _new_metric_compute and new_metric function
  • implement the corresponding new_metric.py file in the appropriate class based folder
    • should inherent from Metric class
    • should call the functional counterpart
  • implement test
    • should test directly against a trusted library
    • should use the MetricTester class object for testing
    • should test different input and different arguments (if any)

This should be clear from the already implemented metrics, but could be made very clear in contribution guidelines

Improve test utilities to accept metrics with more input arguments

πŸš€ Feature

Improve test utilities to accept metrics with a variable number of arguments (at the moment only 2 args are allowed).

Motivation

At the moment the test utilities accept only 2 arguments in input: preds and target. Some metrics, like RetrievalMAP and RetrievalMRR require a different number of arguments.

Offer a dedicated sync() interface on the base Metric class

πŸš€ Feature

Offer a dedicated sync() interface on the base Metric class. This would consolidate state across a provided process group using a given dist_sync_fn and would let us deprecate the dist_sync_on_step flag on the metric constructor.

Motivation

The reason we'd like this is to decouple metric computation and global syncing. As a result, we'd be able to inspect both the local metric state separately from the synced state.
Example scenario:

  • We're training on a large number of nodes
  • We wish to create a metric to track the local state during training steps, as syncing each step will be incredibly expensive
  • At the end of the epoch, we want to sync the state once and log this value.

This interface also enables the training framework to offer higher-level APIs that could automatically call sync() for a particular Metric at relevant spots in the training loop (e.g. on_step, or on_epoch in Lightning).

cc @maximsch2

Pitch

We should be able to re-use most of _sync_dist() already.

Alternatives

Keep as is

F1 and Precision/Recall value not consistent.

πŸ› Bug

The return of f1 and precision is wrong.

To Reproduce

    from pytorch_lightning.metrics.functional import *
    y_pred = torch.Tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    y_true = torch.Tensor([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
    tp, fp, tn, fn, _ = stat_scores(y_pred, y_true, 1) #tp, fp, tn, fn = [8, 8, 0, 0], if 0 is positive.
    p = precision(y_pred, y_true, 2)  # it return 0.5; tp/(tp+fp) = 0.5; if take 1 as postive, precision should be 0.
    r = recall(y_pred, y_true, 2) # it return 0.5; but tp/(tp+fn) should be 1.
    f1_score = f1(y_pred, y_true, 2) # returns 0; which is not right too.

As mentioned above, If we take 0 as positive class, then tp, fp, tn, fn = [8, 8, 0, 0], and precision will be 0.5, recall should be 1. But the precision() method get a 0.5 output.

Expected behavior

The value could be consistent. And a give parameter that could make any class as positive (like sklearn) would be easier to usr.

Environment

*python = 3.8.5
*pytorch-lightning=1.1.6
*pytorch=1.7

Add testing agains each feat PT version

πŸš€ Feature

Add a conda setup for testing against all PyTorch feature releases such as 1.4, 1.5, 1.6, ...

Motivation

have better validation if some functions are not supported in old PT versions

Pitch

Alternatives

use CI action with conda setup, probably no need for pull large docker image

Additional context

take inspiration from past Conda matrix in PL

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.