lightning-ai / torchmetrics Goto Github PK

View Code? Open in Web Editor NEW

1.9K 1.9K 379.0 11.71 MB

Torchmetrics - Machine learning metrics for distributed, scalable PyTorch applications.

Home Page: https://lightning.ai/docs/torchmetrics/

License: Apache License 2.0

Python 99.86% Makefile 0.03% Dockerfile 0.12%

analyses data-science deep-learning machine-learning metrics python pytorch

torchmetrics's People

Contributors

Stargazers

Watchers

Forkers

aribornstein haroldss dumpmemory thomasgaudelet awaelchli lepy trendingtechnology nicola-decao jspaezp wayveai davzaman ethanwharris amorehead enginbozaba maximsch2 alanhdu majjihari janvainer chenchy stonepia victorjoos arv-77 ytakatech georgwa adason anselmc hlin09 dastenturlin satishjasthi ultmaster fluidsense tidalpaladin edwardclem igorhoholko bhadreshpsavani rajs96 akihironitta bryant1410 t-schanz pranjaldatta johannespitz peppesaccardi stevenjokess n-georgakopoulos ankandrew rahul-deepsource vatch123 giannisvagionakis simran2905 sharpiless adbmd karthikrangasai qwer343 yassersouri gagan3012 hugoperrin beyondtheproof paul-grundmann discort devpranjal sameetsinghsidhu cschell ed1d1a8d chandan-h-509 shenberg alex-senov metavai stancld lovecove nishprabhu hoangtnm zax130 leezu gheinrich xfarooqi unity-technologies aretor chorseng hodgka janhenriklambrechts jorenretel veugene xvr-hlt bibinwils doglic-az yche-sflscientific gscalia rudaoshi oleksiiudod emg110 jonassoebro puhuk seermedical thscheeve zuoxingdong mahinlma mrleu shenben c00k1ez callidior

torchmetrics's Issues

Some metrics don't work on CPU using float16

🐛 Bug

It looks like some metrics such as Precision-Recall curve don't work on CPUs when using float16, perhaps due to a missing feature in pytorch?

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1xDv043rRi5WBshP4m5aoxTt2ChlfxjIk?usp=sharing

Expected behavior

the metrics should work in half precision on CPUs as well.

Environment

CUDA:
- GPU:
  - Tesla T4
- available: True
- version: 10.1
Packages:
- numpy: 1.19.4
- pyTorch_debug: True
- pyTorch_version: 1.7.0+cu101
- pytorch-lightning: 1.1.2
- tqdm: 4.41.1
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.6.9
- version: Lightning-AI/pytorch-lightning#1 SMP Thu Jul 23 08:00:38 PDT 2020

Add a property to the Metric class from which can be determined if it can be self.log (scalar or not)

🚀 Feature

Add a property to Metrics which can be checked to see if it can be logged or not.
Or better?, what the computed shape will be.

Motivation

So far all Metrics in PL v1.0.x compute a scalar. The recommended way therefore is to call:

metric(predictions, targets)
self.log("some_name", metric)

which has worked up until now.
However, with upcoming metrics like ConfusionMatrix, the computed value returned is not necessary a scalar, which will result in a ValueError when trying to log it.

If you have multiple metrics, code efficient would be to loop over the metrics, e.g.:

for m in self.metrics:
    self.log("metric_name", m)

Adding a Metric that does not return a scalar will break this code.

Pitch

These are some ideas, but probably there is something better.

Add a property which can be checked against (e.g. scalar: True/False, loggable: True/False)
Add a computed_shape property, so we can check if the computed value is either (1, ) or 1
Add some new logic to self.log() to deal with non-scalar Metrics.

Alternatives

This is a solution that would likely work in most cases, except if on step compute is turned off:

metric(predictions, targets)
if val.numel() == 1:  # only scalars
    self.log("some_name", metric)

Additional context

Related discussion on the PyTorch Lightning forums:
https://forums.pytorchlightning.ai/t/logging-a-tensor/320

@SkafteNicki

Retrieval metrics problem with pytorch lightning integration in compute()

🐛 Bug

I use the commit f06488f to calculate RetrievalMAP and RetrievalPrecision in a pytorch-lightning module. The validation_step and validation_step_end functions work, but the fast_dev_run=True gives an error in the compute() step.

However, when I run self.log("val_MAP", self.metric.compute()) instead of self.log("val_MAP", self.metric) in the validation_step_end I do not get errors. But computing the whole metric is becoming very slow if it is done every validation_step.

To Reproduce

Steps to reproduce the behavior:

I run the following code with the mentioned commit.

Code sample

from typing import Optional
import os
import torch
from torch import nn
import torch.nn.functional as F
from torchvision import transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
import pytorch_lightning as pl

from torchmetrics import (
    RetrievalMAP,
    RetrievalPrecision,
    MeanAbsoluteError,
)


class MNISTDataModule(pl.LightningDataModule):
    def __init__(self, batch_size=32):
        super().__init__()
        self.batch_size = batch_size

    def prepare_data(self):
        MNIST(os.getcwd(), train=True, download=True)
        MNIST(os.getcwd(), train=False, download=True)

    def setup(self, stage: Optional[str] = None):
        transform = transforms.Compose(
            [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
        )
        if stage == "fit":
            mnist_train = MNIST(os.getcwd(), train=True, transform=transform)
            self.mnist_train, self.mnist_val = random_split(
                mnist_train, [55000, 5000]
            )
        if stage == "test":
            self.mnist_test = MNIST(
                os.getcwd(), train=False, transform=transform
            )

    def train_dataloader(self):
        mnist_train = DataLoader(self.mnist_train, batch_size=self.batch_size)
        return mnist_train

    def val_dataloader(self):
        mnist_val = DataLoader(self.mnist_val, batch_size=self.batch_size)
        return mnist_val

    def test_dataloader(self):
        mnist_test = DataLoader(self.mnist_test, batch_size=self.batch_size)
        return mnist_test


class LitAutoEncoder(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 1), nn.Softplus()
        )
        self.decoder = nn.Sequential(
            nn.Linear(1, 64), nn.ReLU(), nn.Linear(64, 28 * 28)
        )
        # self.metric = RetrievalMAP()
        self.metric = RetrievalPrecision()
        # self.metric = MeanAbsoluteError()

    def forward(self, x):
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        preds = self.encoder(x).squeeze()

        indexes = torch.randint(100, size=preds.size())
        targets = torch.randint(2, size=preds.size()).to(bool)

        return {"indexes": indexes, "preds": preds, "targets": targets}

    def validation_step_end(self, outputs):
        self.metric(outputs["indexes"], outputs["preds"], outputs["targets"])
        # self.metric(outputs["preds"], outputs["preds"] ** 2)
        self.log("val_MAP", self.metric)
        # self.log("val_MAP", self.metric.compute())

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


if __name__ == "__main__":
    datamodule = MNISTDataModule()
    module = LitAutoEncoder()

    trainer = pl.Trainer(gpus=1, fast_dev_run=True)

    trainer.fit(module, datamodule=datamodule)
    trainer.test(module, datamodule=datamodule)

StackTrace

Traceback (most recent call last):
  File "reproduce_retrieval_error.py", line 107, in <module>
    trainer.fit(module, datamodule=datamodule)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.train_loop.run_training_epoch()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 577, in run_training_epoch
    self.trainer.run_evaluation(on_epoch=True)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 754, in run_evaluation
    eval_loop_results = self.evaluation_loop.log_epoch_metrics_on_evaluation_end()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 200, in log_epoch_metrics_on_evaluation_end
    eval_loop_results = self.trainer.logger_connector.get_evaluate_epoch_results()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 286, in get_evaluate_epoch_results
    metrics_to_log = self.cached_results.get_epoch_log_metrics()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 405, in get_epoch_log_metrics
    return self.run_epoch_by_func_name("get_epoch_log_metrics")
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 398, in run_epoch_by_func_name
    results = [func() for func in results]
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 398, in <listcomp>
    results = [func() for func in results]
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 128, in get_epoch_log_metrics
    return self.get_epoch_from_func_name("get_epoch_log_metrics")
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 121, in get_epoch_from_func_name
    self.run_epoch_func(results, opt_metrics, func_name, *args, **kwargs)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 110, in run_epoch_func
    metrics_to_log = func(*args, add_dataloader_idx=self.has_several_dataloaders, **kwargs)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 327, in get_epoch_log_metrics
    result[dl_key] = self[k].compute().detach()
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/torchmetrics/metric.py", line 228, in wrapped_func
    self._computed = compute(*args, **kwargs)
  File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/torchmetrics/retrieval/retrieval_metric.py", line 110, in compute
    idx = torch.cat(self.idx, dim=0)
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors.  Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

CPU: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:5925 [kernel]
CUDA: registered at /pytorch/build/aten/src/ATen/RegisterCUDA.cpp:7100 [kernel]
QuantizedCPU: registered at /pytorch/build/aten/src/ATen/RegisterQuantizedCPU.cpp:641 [kernel]
BackendSelect: fallthrough registered at /pytorch/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at /pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
AutogradOther: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradCPU: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradCUDA: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradXLA: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradNestedTensor: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
UNKNOWN_TENSOR_TYPE_ID: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse1: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse2: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse3: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
Tracer: registered at /pytorch/torch/csrc/autograd/generated/TraceType_2.cpp:10525 [kernel]
Autocast: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:254 [kernel]
Batched: registered at /pytorch/aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

Expected behavior

This error should not show up. I expect the metric to be computed correctly. When I use the MeanAbsoluteError as metric the code works. Therefore, there must be a bug in the compute step of the Retrieval Metrics in combination with pytorch-lightnings API as a call to compute() within the validation_step_end does not create errors.

Environment

PyTorch Version (e.g., 1.0): 1.8.1+cu102
OS (e.g., Linux): Ubuntu on WSL2
How you installed PyTorch (conda, pip, source): pip / poetry
Build command you used (if compiling from source):
Python version: 3.8.8
CUDA/cuDNN version: 10.2
GPU models and configuration: 1 GPU
Any other relevant information:

MinMaxMetric for wrapping other metrics

🚀 Feature

Motivation

MinMaxMetric is a metric that simply wraps another metric (e.g.val_acc) and creates a new metric that tracks the min, max or both values of val_acc.

Pitch

I personally use it to quickly see the max_val_acc of a complete experiment in TensorBoard (instead of going through the graph manually to find the max value) but I can see other usecases as well.
It was discussed in the PL Slack here and clearly resonated with more users

Additional context

Happy to submit a PR for this feature, as I do have already a (incomplete) MaxMetric code implemented here

`log_dict` is missing `prefix` parameter that is mentioned in docs

🐛 Bug

MetricCollection documentation mentions using self.log_dict(self.train_metrics, on_step=True, on_epoch=False, prefix='train'). The prefix parameter doesn't seem to be present in log_dict function header.

Expected behavior

prefix is most likely usable in this context, so this feature should be implemented. If not - the documentation should be fixed.

Environment

Lightning Version: 1.2.0

MetricLists for updating multiple metrics at once

🚀 Feature

Motivation

I am using my own version of MetricLists in my personal workflow for some time now and it has proven to be very helpful in keeping code clean.

A MetricList wraps multiple metrics together and puts them on proper devices (much like a ModuleList). What makes it different is that it also allows you to update all of them in one compute() statement and log all of them using one log() call.

Pitch

My dynamic inference model needs its val_acc tested in 32 different setups. Manually creating all the different Accuracy() metrics is ridiculous. ModuleList() helps to create them in batch but I still need to write helper functions to log() or compute() all of them separately.

Alternatives

See pitch.

Smart update of Collection of CompositionalMetrics

🚀 Feature

When updating metrics that are composed of other metrics there are two ways of dealing with updating too many times :

Always update the underlying metric (the pytorch lightning way)
Never update the underlying metric (the pytorch ignite way: https://pytorch.org/ignite/metrics.html#ignite.metrics.MetricsLambda)

I don't think there is a clean way of only updating the necessary metrics in the general case (when you're just updating all the metrics yourself), but I think that when you combine your metrics in a collection, it could be useful to only update the "base" metric, instead of all metrics.

Motivation

I often want to use a base metric multiple times, and then I have to be careful not to update too many of them. A somewhat convoluted example (because the f1 score is already implemented) :

prec = Precision()
recall = Recall()
f1 = 2 * (prec * recall) / (prec + recall)
prec.update(pred, gt)
recall.update(pred, gt)
f1.update(pred, gt) # Shouldn't do this, because it updates prec and recall twice.

Pitch

Continuing last example :

collection = MetricCollection([prec, recall, f1])
collection.update(pred, gt)

This should only update prec and recall once.

Alternatives

The alternative is to always define metrics from scratch, but this causes duplication of computation during the update phase.

Add ignore_index option to Accuracy metric

🚀 Feature

It would be nice if Accuracy metric accepted ignore_index as it works for StatsScore or torch.nn.CrossEntropyLoss

Multi-label ROCs

🚀 Feature

Similary to issue #100 it would be nice to be able to make roc work with multi-label inputs.

Motivation & Pitch

auc and hence _auroc_compute do work with multi-label inputs and return a AUROC value for each label/class by iterating over range(num_classes) when passing average=None.
_roc_compute and hence roc differentiate only between binary and multi-class (by checking if num_classes == 1)

I would expect _roc_update to similarly return a mode using _input_format_classification(preds, target) and return a list of [fpr, tpr, threshold] of length=num_classes.
The easiest would be the format [[fpr, tpr, thres]]*5

Metrics support mask

🚀 Feature

Current metrics like Accuracy/Recall would be better to support mask.

Motivation

For example, when I deal with a Sequence Labeling Task and pad some sequence to max-length, I do not want to calculate metrics at the padding locations.

Pitch

I guess a simple manipulation would work for accuracy.(here is the original one)

from typing import Any, Optional

import torch
from pytorch_lightning.metrics.functional.classification import (
    accuracy,
)
from pytorch_lightning.metrics.metric import TensorMetric


class MaskedAccuracy(TensorMetric):
    """
    Computes the accuracy classification score
    Example:
        >>> pred = torch.tensor([0, 1, 2, 3])
        >>> target = torch.tensor([0, 1, 2, 2])
        >>> mask = torch.tensor([1, 1, 1, 0])
        >>> metric = MaskedAccuracy(num_classes=4)
        >>> metric(pred, target, mask)
        tensor(1.)
    """

    def __init__(
        self,
        num_classes: Optional[int] = None,
        reduction: str = 'elementwise_mean',
        reduce_group: Any = None,
        reduce_op: Any = None,
    ):
        """
        Args:
            num_classes: number of classes
            reduction: a method for reducing accuracies over labels (default: takes the mean)
                Available reduction methods:
                - elementwise_mean: takes the mean
                - none: pass array
                - sum: add elements
            reduce_group: the process group to reduce metric results from DDP
            reduce_op: the operation to perform for ddp reduction
        """
        super().__init__(name='accuracy',
                         reduce_group=reduce_group,
                         reduce_op=reduce_op)
        self.num_classes = num_classes
        self.reduction = reduction

    def forward(self, pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
        """
        Actual metric computation
        Args:
            pred: predicted labels
            target: ground truth labels
            mask: only caculate metrics where mask==1
        Return:
            A Tensor with the classification score.
        """
        mask_fill = (1-mask).bool()
        pred = pred.masked_fill_(mask=mask_fill, value=-1)
        target = target.masked_fill_(mask=mask_fill, value=-1)

        return accuracy(pred=pred, target=target,
                        num_classes=self.num_classes, reduction=self.reduction)

Alternatives

Additional context

Add Mean Average Precision (mAP) metric

The main metric for object detection tasks is the Mean Average Precision, implemented in PyTorch, and computed on GPU.

It would be nice to add it to the collection of the metrics.

The example implementation using numpy:

https://github.com/ternaus/iglovikov_helper_functions/blob/master/iglovikov_helper_functions/metrics/map.py

Implement getitem as "metric arithmetic"

🚀 Feature

Allow a user to define a new metric that takes an item out of an other metric.

Basically :

iou = IoU(num_classes=2, reduction="none")
fg_iou = iou[0]
bg_iou = iou[1]

Motivation

There are multiple metrics (like IoU and confusion matrix) that would benefit from the use of such a feature, and it is close to the mechanism of metric arithmetic.

Pitch

This would only need to define

class Metric:
    ...
    def __getitem__(self, idx):
        return CompositionalMetric(lambda x: x[idx], self, None)

Alternatives

The straightforward alternative is to use CompositionalMetric directly.

Unable to call metric from any step in Lightning module

🐛 Bug

I implemented my own Metric class that returns from the compute data class with some aggregated metrics -- precision, recall, and f1-score. But when I try to call metric inside *_step I got the error from PyTorch internals.

The error happened in this line. If I call validation metric (initialized with compute_on_step=False) during validation_step I got:

TypeError: 'NoneType' object is not subscriptable

In the case of training metric during training_step:

TypeError: 'ClassificationMetrics' object is not subscriptable

ClassificationMetrics is the name of my data class.

I also tried to return float from compute, but it also causes the same error. I assume that PyTorch expects to receive tensor and therefore trying to get from var. An obvious solution is to return tensor from compute, but it doesn't fix calling validation metric that doesn't return anything.

Environment

PyTorch Version (e.g., 1.0): 1.8.0
OS (e.g., Linux): MacOS BigSur
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.9.2
CUDA/cuDNN version: -
GPU models and configuration: -
Any other relevant information: pytorch-lightning (1.1.7) / torchmetrics (0.2.0)

Functional Confusion Matrix with Multi-Label

🐛 Bug

I am trying to analyze a model that has multi-label predictions. When creating a confusion matrix with the functional confusion_matrix method, I get a much different result than expected. I may be misunderstanding how this is supposed to work so any help would be appreciated!

To Reproduce

Steps to reproduce the behavior:

Predict multi-label data that has had torch.sigmoid applied to the output (N,C) and have a matching shape truth data.
Use the functional confusion_matrix method on the data

Code sample

>>> from torchmetrics.functional import confusion_matrix
>>> import torch
>>> x = torch.tensor([[.4,.5,.6,.7],[.3,.4,.7,.1]])
>>> y = torch.tensor([[0,0,0,1],[0,1,0,0]], dtype=torch.int32)
>>> cm = confusion_matrix(x, y, num_classes=4, normalize='none')
tensor([[3., 3., 0., 0.],
        [1., 1., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

Expected behavior

I would expect the confusion matrix to count the classes that were predicted for each true class. I may be wrong

tensor([[0, 0, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 0, 0],
        [0, 1, 1, 1]])

Environment

PyTorch Version (e.g., 1.0): 1.7
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): conda
Python version: 3.8.8
CUDA/cuDNN version: 11.03
GPU models and configuration: Nvidia Tesla V100

Thanks for the great project and help!!

Add Hinge losses

🚀 Feature

Add Hinge losses to classification package:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html#sklearn.metrics.hinge_loss
possibly with an added parameter squared to also calculated squared hinge loss

Motivation

Pitch

Alternatives

Additional context

Cohen Kappa Score and Matthews Correlation Coefficient Metrics

🚀 Feature

I would like to request the (re-) implementation of the Cohen Kappa score and the new implementation of the Matthews Correlation Coefficient (MCC) in PyTorch Lightning's metrics.

Motivation

The Cohen Kappa and MCC are often used metrics in classification tasks, especially in a medical setting to determine such things as inter-grader reliability. The Kappa score was originally implemented in PyTorch Lightning 0.9 but has disappeared for some reason. The MCC is often seen as the best metric to use in highly imbalanced datasets. The addition of these two metrics would make it more convenient to use PyTorch Lightning for medical tasks and other tasks that involve ground truth uncertainty and imbalanced data.

Pitch

Implementation of the Cohen Kappa and MCC as metrics in PyTorch Lightning. Both metrics are already available in sci-kit learn.

Alternatives

Cannot think of any.

Additional context

None.

[Metrics] WER

Add WER cc @oplatek

Lightning-AI/pytorch-lightning#973 (comment)

Include AverageMeter?

One common pattern I've seen copy-pasted across many different projects is a generic AverageMeter, which takes the average of things of a quantity.

This isn't strictly a "metric", but I'm wondering whether you'd be open to having an implementation in this metrics repository -- it's quite common and having it in a centralized place could be helpful. If you are open to it, I'd be happy to contribute an implementation.

class AverageMeter(object):
    """Computes and stores the average and current value"""

    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0.
        self.avg = 0.
        self.sum = 0.0
        self.count = 0

    def update(self, val, n=1):
        self.val = float(val.item()) if isinstance(
            val, (np.ndarray, torch.Tensor)) else val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)

Bootstrap wrapper for metrics?

🚀 Feature

We should provide ability to compute bootstrapped confidence intervals for metrics.

Motivation

Confidence intervals are important and we should make it easy for people to increase rigor of their research and model evaluations.

Pitch

I'm thinking we can have something like this (very high level):

class Bootstrapper(Metric):
   def __init__(self, num_samples, metric):
       self.metrics = nn.ModuleList([deepcopy(metric) for _ in range(num_samples)])

  def update(self, preds, targets):
     for idx in range(self.num_samples):
        preds_sampled, targets_sampled = sample_for_bootstrap(preds, targets)
        self.metrics[i].update(preds_sampled, targets_sampled)

which will let people to wrap any metric, have a set of copies of the metric internally updated with different samples of the data, giving us then ability to get a distribution of metric values.

Alternatives

We can skip it on the class-based metrics side and assume anyone doing bootstrap will load everything in memory and do bootstrap using functional metrics.

Add Deviance scores

🚀 Feature

Add deviance scores:

Use in regression tasks and for goodness-of-fit testing.

Motivation

Pitch

Alternatives

Additional context

Non-Softmaxed Classification

🚀 Feature

Motivation

Support other types for classification metrics. I.e non-softmaxed network outputs.

For outputs of categorical classification, it does not matter if the output is softmaxed or not. The argmax of these tensors is the same. Can we support those by simply taking an argmax even if the values are out of the 0-1 range?

cc @SkafteNicki Whether we want to support this.

Add `FID`

🚀 Feature

Add Fréchet inception distance (FID) metric.
Standard measure for image quality of generative models.

Originally proposed here:
https://arxiv.org/abs/1706.08500

Other links:
https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance
https://machinelearningmastery.com/how-to-implement-the-frechet-inception-distance-fid-from-scratch/

Motivation

Pitch

Alternatives

Additional context

Change order of updates in metric forward to increase efficiency

🚀 Feature

Refactor Metric.forward() to call update only once.

Motivation

The update() method in Metric gets computed twice in forward() in case the compute_on_step is True.
This means repeated computation, which can slow down execution. For example, I have a custom SmoothL1Metric and the update function calculates element-wise L1 distance (see below). The problem arisesd when the tensors on which the metric is computed have many dimensions and the computation itself is slow.

class SmoothL1Metric(Metric):
    def __init__(self, mask_dim, dist_sync_on_step: bool = False, compute_on_step: bool = True):
        super().__init__(dist_sync_on_step=dist_sync_on_step, compute_on_step=compute_on_step)
        self.loss = torch.nn.SmoothL1Loss(reduction="sum")
        self.mask_dim = mask_dim

        self.add_state("sum", default=torch.tensor(0.0), dist_reduce_fx="sum")
        self.add_state("numel", default=torch.tensor(0.0), dist_reduce_fx="sum")

    def update(self, input, target, lens):
        mask = get_mask(input, lens, self.mask_dim).type(input.dtype)
        # this is a heavy computation that should not be executed twice
        self.sum += self.loss(input * mask, target * mask)
        self.numel += mask.sum()

    def compute(self):
        return self.sum / self.numel

Suggestion

How about something like:

def forward(self, *args, **kwargs):

    if self.compute_on_step:
        self._to_sync = self.dist_sync_on_step

        # save context before switch
        cache = {attr: getattr(self, attr) for attr in self._defaults.keys()}

        # call reset, update, compute, on single batch
        self.reset()
        self.update(*args, **kwargs)
        self._forward_cache = self.compute()

        # merge new and old context without recomputing update
        for attr, val in cache.items():
            setattr(self, attr, self._reductions[attr](val, getattr(self, attr)))
    else:
        with torch.no_grad():
            self.update(*args, **kwargs)
        self._forward_cache = None

    return self._forward_cache

The code probably does not work now, but the idea should be clear. What do you think?

Support sample_weight in metrics

🚀 Feature

Metrics should give an option to be compute with per-sample weights, similar to Scikit-Learn (e.g. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html)

Motivation

For a lot of applications, per-sample weighting is important and so metrics package should provide support for this.

Additional context

Keras metrics provide sample_weight support as well: https://keras.io/api/metrics/classification_metrics/

Metrics support for sweeping

🚀 Feature

We would like to have tighter integration of metrics and sweeping. This requires a few features:

Knowing if higher_is_better (e.g. are we trying to minimize or maximize the metric in a sweep)
Knowing what value to optimize for. E.g. if a recall@precision metric returns both recall value and corresponding threshold, we want to optimize by maximizing recall and ignoring the threshold.

Alternatives

An alternative implementation will be for each metric to have is_better(left: TMetricResult, right: TMetricResult) where TMetricResult is whatever compute returns.

If we don't have it, people will have to have wrappers around the metrics to support this functionality in sweepers.

Formalize task type?

🚀 Feature

Let's have a formal system of task types. Things like BinaryClassificationTask, MultiClassClassificationTask, MultilabelClassificationTask, etc.

Motivation

We are seeing slowdown from format checking Lightning-AI/pytorch-lightning#6605
We would like to be able to do more sanity checking that metrics specified by a LightningModule are a correct fit for the task.

Pitch

Add a type hierarchy of possible task types. Each task is defined by type signature of the (predictions, labels) tuple and semantics inside it (e.g. multticlass and multilabel have same shape, but different semantics).

Then, each metric takes a task_type and can assume that predictions/labels conform to it. If we want to add checking at the run_time, each type can provide a class method (e.g. BinaryClassificationTask.validate_input) that can be enabled for checking on opt-in basis.

Alternatives

People building reusable frameworks implement task types on their own as wrappers around TorchMetrics.
TorchMetrics continue to have format checking in each metric.

Add `KLDivergence`

🚀 Feature

Add KLDivergence metric. Measures the distance between two probability distributions p(x) and q(x).
Given by (pseudo implementation):

sum(s_p * log(s_p / s_q))

where s_p are samples from the p(x) distribution and s_q are samples from the q(x) distribution.
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

Motivation

Pitch

Alternatives

Additional context

Allow Accuracy to return metric per class

🚀 Feature

Implement the average argument like in Precision and Recall such that accuracy metric can return the metric per class label.

Motivation

Sometimes it may be beneficially to look at the accuracy per label, especially when working with very unbalanced datasets

Pitch

Alternatives

Additional context

[Metrics] ROUGE

🚀 Feature

Implement ROUGE

Micro average for auroc

🚀 Feature

Add 'micro' to list of allowed averages.

Motivation

For now, 'average' must be in [None, 'macro', 'weighted].
For multi-label classification it would be nice to allow micro averaging.
This comes down to calculating auroc(preds.flatten(), target.flatten()).

Similarly to https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html.
One could also consider adding 'samples' to the list of accepted averages.

What do you think?

Remove Lightning as Dependency

Requires just a little modification to remove lightning as a dependency (it will still be used for testing)

Add Specificity

🚀 Feature

In addition to Precision and Recall it would be nice to have a Specificity metric.

For the implementation I think it would be enough to make a copy of Recall (class und function) and adapt numerator and denominator in _precision_compute.

Alternatives

For binary classification Specificity is the same as Recall with 0 as true label.
For multiclass classification this is not as easy as this though.

finishing TM

🚀 To Be Done

You still need to enable some external integrations such as:

add relevant badges (docs, tests, coverage, pypi)
init Read-The-Docs (add this new project)
enable Discussion - community/community#24 (reply in thread)

Add gpu/multi-gpu testing

🚀 Feature

Currently we only test the metrics on single cpu and distributed cpu. While we had no explicit issues that links back to the metrics not beign tested on gpu, we should do it anyway.

Motivation

Pitch

Alternatives

Additional context

register conda forge

🚀 Feature

Publish package also to Conda distribution

Motivation

Allow user to install from any source

Additional context

you can check the documentation at https://conda-forge.org/docs/maintainer/adding_pkgs.html. It’s actually very easy. In short you must submit a PR to https://github.com/conda-forge/staged-recipes. Once the CI is green you can ping conda-forge folks and they will review it. Once done the feedstock will be created and your package built and uploaded to conda forge.

Constant-memory implementation of precision-recall related metrics

🚀 Feature

Metrics that depend on precision-recall curve are currently implemented in a way that requires storing all of the predictions and labels in memory, making it's use impractical for large datasets or problem with large label spaces. We should support binning-based metrics implementation to solve this. Prototype is here: https://gist.github.com/maximsch2/2b55bab6deba629a5686258cb8152e53

Alternatives

Don't do anything and be restricted in the scalability of metrics.

Other options for scaling is making it easier to keep metrics off-GPU.

A possible question is if we want to have both raw and binned implementations of the metrics.

Additional context

Keras provides binning-based implementation by default: https://keras.io/api/metrics/classification_metrics/#auc-class

Test for differentiability

🚀 Feature

Add a property that for which the user can determine if a metric is differentiable or not

@property
def is_differentiable(self):
    return True/False

and add appropriate tests. We can take inspiration from what kornia is doing:
https://github.com/kornia/kornia/blob/master/test/color/test_gray.py#L69

Motivation

Some metrics support differentiability, some does not. Would be great if we were more explicit about it and actually have test for it.

Pitch

Alternatives

Additional context

in DDP training, run ROC.compute() will results gpu to 100% usage and hang the training process

🐛 Bug

To Reproduce

follow the sample code like https://github.com/PyTorchLightning/metrics,
we use metric = torchmetrics.ROC()
model.roc_metric = metric

in test epoch,
metric.update (output, target)

and after test epoch, run compute
metric.compute()

hang training process and result two gpu in 100% usage

btw, use the metric code in the pytorch_lightning have the same issue as the standalone package

Environment

PyTorch Version (1.7.0+cu101):
OS (Linux ubuntu 18.04):
How you installed PyTorch (`pip):
torchmetrics Version: 0.2.0
Python version: 3.6
CUDA/cuDNN version: 10.1/7.6.5.32-1+cuda10.1
GPU models and configuration: two 2080ti

Allow MetricCollection to combine calculations

🚀 Feature

Allow MetricCollection to combine metrics internally to reduce redundant computations.

Motivation

Many metrics currently shares the same redundant computations underneath. Take Recall and Precision for example, they will both calculate tp, fp, tn, fn during their update step and then use them differently during the compute step. We have chosen to do it this way to make the API simple.

However, we could implement that if two metrics that have the same update states are collected using MetricCollection, that only one metric is updated and the state is just broadcasted to the other metrics.

Keeping track of which metrics can be combined could probably be done with some kind of registry:

@metric_group(Recall, Precision, F1, FBeta)`
@metric_group(MeanSquaredError, PSNR)
...

Pitch

Alternatives

Additional context

Update class metrics interface of Precision/Recall/Fbeta to allow calculate them for each individual class

🚀 Feature

I'd like to propose to update class metrics interface of Precision/Recall/Fbeta to have the average argument include none and weighted as in the corresponding functional metrics interface.

Motivation

Current interface with average argument restricts to macro and micro, and because of that one could not use class metrics interface to calculate precision/recall/fbeta for an individual class. For example, in binary classification, one is typically interested in getting metrics results for positive class (class 1) and this cannot be done with the current class interface. Therefore one has go back to the functional metric and this could defeat the purpose of having class metrics (to take care of ddp sync).

On the contrary, sklearn defaults to calculate precision/recall/fbeta for the individual class (class 1) while giving one option to calculate micro/macro/weighted average of these scores.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

Pitch

Update class metrics interface of Precision/Recall/Fbeta to have the average argument include none and weighted as in the corresponding functional metrics interface.

Alternatives

One can always fall back to the functional metric but I assume this is not what we would like.

Additional context

Really like the new class interface to work with DDP and appreciate all your work!

Add Negative predictive value

🚀 Feature

In addition to Precision and Recall it would be nice to have a Negative predictive metric.

For the implementation I think it would be enough to make a copy of Precision (class und function) and adapt numerator and denominator in _precision_compute.

Alternatives

For binary classification Specificity is the same as Precision with 0 as true label.
For multiclass classification this is not as easy as this though.

Allow unnormalized class scores for Accuracy

🚀 Feature

Presently, when using Accuracy metric on multi-class with scores (N,C entry in input types), the scores are required to be probabilities in [0, 1].

However, un-thresholded accuracy can be computed without normalized probabilities as inputs, as relative ordering of scores is all that is needed.

Given that some uses of Accuracy do require normalized probabilities, we could implement this as a flag that would disable the input check.

Motivation

It is common to work with unnormalized class scores during training, especially during classification tasks, as they are used in the more-stable nn.CrossEntropyLoss. Rather than having to additionally compute a softmax just for the accuracy metric, it would be reasonable to allow usage of arbitrarily scaled input data.

I specify Accuracy because it is the use case that I ran into, but it's possible other Metrics have the same property.

Pitch

Add a flag to Accuracy (and any other applicable metrics) that disables the input range check for preds.

Alternatives

The present workaround is to apply a softmax before feeding data to your Accuracy metric.

Additional context

https://github.com/PyTorchLightning/pytorch-lightning/blob/0456b4598f5f7eaebf626bca45d563562a15887b/pytorch_lightning/metrics/functional/accuracy.py#L25

Add contribution guidelines

📚 Documentation

All build-in metrics follow a very fixed structure:

implement core logic in new_metric.py file and place that in the functional folder
- should contain a _new_metric_update, _new_metric_compute and new_metric function
implement the corresponding new_metric.py file in the appropriate class based folder
- should inherent from Metric class
- should call the functional counterpart
implement test
- should test directly against a trusted library
- should use the MetricTester class object for testing
- should test different input and different arguments (if any)

This should be clear from the already implemented metrics, but could be made very clear in contribution guidelines

Improve test utilities to accept metrics with more input arguments

🚀 Feature

Improve test utilities to accept metrics with a variable number of arguments (at the moment only 2 args are allowed).

Motivation

At the moment the test utilities accept only 2 arguments in input: preds and target. Some metrics, like RetrievalMAP and RetrievalMRR require a different number of arguments.

Some metrics return NaN for edge cases

🐛 Bug

To Reproduce

https://github.com/PyTorchLightning/pytorch-lightning/blob/74171efadf5ea94416d26f1e2c27c1586a3a84b3/pytorch_lightning/metrics/classification/average_precision.py#L71

Expected behavior

Some metrics for unused classes return NaN, I would say it shall be rather 0

Offer a dedicated sync() interface on the base Metric class

🚀 Feature

Offer a dedicated sync() interface on the base Metric class. This would consolidate state across a provided process group using a given dist_sync_fn and would let us deprecate the dist_sync_on_step flag on the metric constructor.

Motivation

The reason we'd like this is to decouple metric computation and global syncing. As a result, we'd be able to inspect both the local metric state separately from the synced state.
Example scenario:

We're training on a large number of nodes
We wish to create a metric to track the local state during training steps, as syncing each step will be incredibly expensive
At the end of the epoch, we want to sync the state once and log this value.

This interface also enables the training framework to offer higher-level APIs that could automatically call sync() for a particular Metric at relevant spots in the training loop (e.g. on_step, or on_epoch in Lightning).

cc @maximsch2

Pitch

We should be able to re-use most of _sync_dist() already.

Alternatives

Keep as is

F1 and Precision/Recall value not consistent.

🐛 Bug

The return of f1 and precision is wrong.

To Reproduce

    from pytorch_lightning.metrics.functional import *
    y_pred = torch.Tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    y_true = torch.Tensor([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
    tp, fp, tn, fn, _ = stat_scores(y_pred, y_true, 1) #tp, fp, tn, fn = [8, 8, 0, 0], if 0 is positive.
    p = precision(y_pred, y_true, 2)  # it return 0.5; tp/(tp+fp) = 0.5; if take 1 as postive, precision should be 0.
    r = recall(y_pred, y_true, 2) # it return 0.5; but tp/(tp+fn) should be 1.
    f1_score = f1(y_pred, y_true, 2) # returns 0; which is not right too.

As mentioned above, If we take 0 as positive class, then tp, fp, tn, fn = [8, 8, 0, 0], and precision will be 0.5, recall should be 1. But the precision() method get a 0.5 output.

Expected behavior

The value could be consistent. And a give parameter that could make any class as positive (like sklearn) would be easier to usr.

Environment

*python = 3.8.5
*pytorch-lightning=1.1.6
*pytorch=1.7

[Metrics] Panoptic Quality

🚀 Feature

Implement Panoptic Quality

Add testing agains each feat PT version

🚀 Feature

Add a conda setup for testing against all PyTorch feature releases such as 1.4, 1.5, 1.6, ...

Motivation

have better validation if some functions are not supported in old PT versions

Pitch

Alternatives

use CI action with conda setup, probably no need for pull large docker image

Additional context

take inspiration from past Conda matrix in PL

Add CosineSimilarity

🚀 Feature

Add Cosine similarity metric
https://en.wikipedia.org/wiki/Cosine_similarity
Measures the similarity between two feature vectors by calculating the angle between them.