Giter Club home page Giter Club logo

lighthouse.jl's People

Contributors

a-cakir avatar alexmchan avatar christopher-dg avatar ericphanson avatar femtomc avatar hannahilea avatar josephsdavid avatar kimlaberinto avatar kleinschmidt avatar palday avatar simondanisch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lighthouse.jl's Issues

compute cohen's kappa from confusion matrix

currently, we compute many metrics directly from the confusion matrix, but not Cohen's kappa:

"""
cohens_kappa(class_count, hard_label_pairs)
Return `(κ, p₀)` where `κ` is Cohen's kappa and `p₀` percent agreement given
`class_count` and `hard_label_pairs` (these arguments take the same form as
their equivalents in [`confusion_matrix`](@ref)).
"""
function cohens_kappa(class_count, hard_label_pairs)
all(issubset(pair, 1:class_count) for pair in hard_label_pairs) ||
throw(ArgumentError("Unexpected class in `hard_label_pairs`."))
p₀ = accuracy(confusion_matrix(class_count, hard_label_pairs))
pₑ = _probability_of_chance_agreement(class_count, hard_label_pairs)
return _cohens_kappa(p₀, pₑ), p₀
end
_cohens_kappa(p₀, pₑ) = (p₀ - pₑ) / (1 - ifelse(pₑ == 1, zero(pₑ), pₑ))
function _probability_of_chance_agreement(class_count, hard_label_pairs)
labels_1 = (pair[1] for pair in hard_label_pairs)
labels_2 = (pair[2] for pair in hard_label_pairs)
x = sum(k -> count(==(k), labels_1) * count(==(k), labels_2), 1:class_count)
return x / length(hard_label_pairs)^2
end

It could clean up some of the data flow to do so. Here is an implementation.

function cohens_kappa_from_confusion_matrix(conf)
    p₀ = accuracy(conf)
    pₑ = probability_of_chance_agreement_from_confusion_matrix(conf)
    return (p₀ - pₑ) / (1 - ifelse(pₑ == 1, zero(pₑ), pₑ))
end

function probability_of_chance_agreement_from_confusion_matrix(conf)
    counts_1 = dropdims(sum(conf; dims=1); dims=1)
    counts_2 = dropdims(sum(conf; dims=2); dims=2)
    n = sum(counts_1)
    @check n == sum(counts_2)
    return dot(counts_1, counts_2) / n^2
end

Checkpointing?

I think it would make sense for Lighthouse to support per-epoch checkpointing and automatic resumption from checkpoints. I imagine the interface could work like:

MyClassifier <: AbstractClassifier may optionally provide methods for:

  • load_checkpoint(::Type{MyClassifier}, uri) -> MyClassifier
  • save_checkpoint(uri, ::MyClassifier) -> Nothing
  • And can optionally provide a specialized implementation for has_checkpoint(uri, ::AbstractClassifier) = isfile(uri) and checkpoint_extension(::Type{<:AbstractClassifier}) = ".checkpoint".

Then learn! optionally takes a checkpoint_dir URI as a keyword argument.

  • If no URI is passed, no checkpointing is done.
  • If one is passed, the classifier must support the load and save methods, and
  • at the start of each epoch, uri = joinpath(checkpoint_dir, string(epoch), "classifier_checkpoint$(ext)") (with ext = checkpoint_extension(MyClassifier)) is checked for the existence of a checkpoint (by has_checkpoint).
  • If so, we immediately proceed to the next epoch (without calling callbacks-- presumably those already ran the first time).

Not sure what to do about logs though-- should we also load logs from a checkpoint? I'd like any model-managed state to be part of the AbstractClassifier and have its checkpointing handle any of that state, but I'm not sure about any Lighthouse-managed state, and especially what gaurantees should callbacks have about the availability of that state if we resumed from a checkpoint.

`plot_confusion_matrix` cannot change `annotation_text_size`

I was trying to increase the text size of my confusion matrix plot. But when I tried to pass in the keyword arg annotation_text_size=34 it gave me an error. The code works fine without the keyword args (Lighthouse.plot_confusion_matrix(rand(2, 2), ["1", "2"], :Row)) and produces a confusion matrix plot. But when I try to change the default text size it gave me this error:

Lighthouse.plot_confusion_matrix(rand(2, 2), ["1", "2"], :Row, annotation_text_size=34)
MethodError: no method matching plot_confusion_matrix(::Matrix{Float64}, ::Vector{String}, ::Symbol; annotation_text_size=34)

Closest candidates are:

plot_confusion_matrix(::AbstractMatrix, ::Any, ::Any) at ~/.julia/packages/Lighthouse/K3PBy/src/learn.jl:333 got unsupported keyword argument "annotation_text_size"

here are my environment/system info, if it's of any help:

Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i3-4030U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, haswell)

Environment:
  JULIA_REVISE_WORKER_ONLY = 1
      Status `~/usr/project/dir/Project.toml`
  [ac2c24cd] Lighthouse v0.11.3

RFC: Remove training harness wrappers (`learn!`, `evaluate!`) from Lighthouse.jl?

I took a recent time-boxed* attempt at deprecating EvaluationRow (#71), and it was Not A Fun Time™. I'm not sure what the correct thing to do here is, and I think that it is very possible that we should remove some (all?) of the training harness aspects of Lighthouse.jl altogether---or at least, we should figure out the minimal set of them that users actually depend on.

We've spent a certain amount of time maintaining the existing harness behavior, but afaik our internal usage of the harness itself is basically non-existent, as teams have combined the Lighthouse components in their own ways to support their specific use-cases. (Specifics to be left to a Beacon slack thread!) Externally, I would be surprised if anyone was depending on them (although if someone is, please say the word!!). Additionally, https://github.com/beacon-biosignals/LighthouseFlux.jl/blob/main/src/LighthouseFlux.jl does not use learn! or evaluate!.

*technically it was also [hurtling-through-]space-boxed as well, huzzah for transcontinental flight....


The particular hole I fell down was trying to cleanly swap out the use of evaluation_metrics_row (and its corresponding log_evaluation_row!) inside evaluate!.

metrics = evaluation_metrics_row(predicted_hard_labels, predicted_soft_labels,

Why? Well, the existing usage conflates tradeoff metrics and hardened metrics, and we shouldn't be computing (and logging) both there without the caller specifying things like how the per-class binarization should happen (to create the hardened metrics), and whether they want to additionally compute multirater label metrics (i.e., if votes exist), and whether they additionally want to compute additional metrics that (until now) have only been computed for binary classification problems.

At the point that we expose separate interfaces for the various combinations of inputs that a user might want (mulitclass v binary, single rater v multirater, binarization choice, pre-computed predicted_hard_labels for multiclass input), we will have created a lot more code to maintain for a lot less benefit. At the end of the day, asking the caller to implement their own evaluate!(model::AbstractClassifier, ...) would be easier.

...which seems like a viable option. Except what inputs would a template evaluate! function take? What combination of the existing

predicted_hard_labels::AbstractVector,
              predicted_soft_labels::AbstractMatrix,
              elected_hard_labels::AbstractVector,
              classes, logger;
              logger_prefix, logger_suffix,
              votes::Union{Nothing,AbstractMatrix}=nothing,
              thresholds=0.0:0.01:1.0, optimal_threshold_class::Union{Nothing,Integer}=nothing

? Because the choice depends on the type of metrics being computed in the first place.

If various projects were depending on this evaluate! function, it seems like choosing some minimal required arguments (

predicted_soft_labels::AbstractMatrix,
              elected_hard_labels::AbstractVector,
              classes, logger;
              logger_prefix, logger_suffix

?) and passing the others as kwargs from the learn! function might do it. Buuuut I'm not sure how many folks use this function in the first place!


Which brings me to: what is the correct thing to do here? And does anyone use the learn! and evaluate! functions, or could/should we remove them except as examples (in the docs) for how to implement a custom loop for one's own model?

To quote @ericphanson, when talking about this issue,

I think a harness like that it just too rigid. We just need a lot of useful primitives like metrics, logging, plots, etc, and then we can combine them in different ways as makes sense for the project

I agree---and so I think losing the evaluate! and learn! calls as they currently exist would allow us to focus on those primitives and sink less time into maintaining code that isn't used. And if more flexible (or smaller scoped) harnesses make their way back into Lighthouse.jl, I think that would be cool also, but we should let that happen from the ground up once we create training loops that actively work for us and could be shared.

Path to `EvaluationRow` deprecation

Once #69 lands (via #70), the next piece is to refactor out the individual plotting components in relation to their given *MetricsRow---and then we can get rid of EvaluationRow, which combines multiple unrelated metrics types in a confusing (and easy to misinterpret) way.

(Ref brief discussion in #70 (comment))

Tasks:

  • Replace evaluation_metrics_plot with per-schema plots that take in a table of the relevant metrics row
  • Remove EvaluationRow and remove evaluation_metrics_row

Package is unsupported on Apple Silicon (M1/M2 chips)

A dependency of this package, FFMPEG, cannot be built due to the following exception:

ERROR: Error building `FFMPEG`:
┌ Warning: Platform `arm64-apple-darwin22.4.0` is not an officially supported platform
└ @ BinaryProvider ~/.julia/packages/BinaryProvider/U2dKK/src/PlatformNames.jl:450
ERROR: LoadError: KeyError: key "unknown" not found

Since this dependency cannot be built, the Lighthouse package itself won't build. BinaryProvider.jl is more or less a deprecated package that hasn't been updated since 2020, but it appears that FFMPEG still relies on it. I will file a related issue there as well.

Compute AUC without needing specific thresholds

Currently, Lighthouse asks for a set of input thresholds (e.g. 0.0:0.01:1.0), then computes binary stats for each threshold, then forms corresponding fpr/tpr curves, then computes the area underneath those curves w/ the trapezoid rule.

I came across An Improved Model Selection Heuristic for AUC today which among other things, pointed out that you can compute the AUC without reference to specific thresholds by sorting the soft-labels of all instances, and then doing a particular kind of count:

Screen Shot 2022-12-19 at 6 36 22 PM

Screen Shot 2022-12-19 at 6 39 38 PM

I think this is a nice algorithm we should consider using, so that the accuracy of the AUC estimate is not influenced by the granularity of the thresholds chosen.

Remove LearnLogger?

After #60 we've defined an interface that folks can plug into for custom logging.

We could drop the Tensorboard dep by removing the LearnLogger. We could make a no-op logger that doesn't actually log anything, or maybe a Dict logger that logs things to a logged dict as the current one does, which could be good for testing. Or we could keep the current LearnLogger.

Update `log_evaluation_row!` to not special case Spearman correlation coefficient

okay, i'm not "allowed" to comment below this point, but I think we should specialize the below implementation of
log_evaluation_row! on LearnLogger so that special-casing spearman correlation doesn't have to happen by default---that is a very tb-specific logged item. E.g.,

function log_evaluation_row!(logger, field::AbstractString, metrics)
    metrics_plot = evaluation_metrics_plot(metrics)
    metrics_dict = _evaluation_row_dict(metrics)
    log_plot!(logger, field, metrics_plot, metrics_dict)
    return metrics_plot
end

or, better:

function log_evaluation_row!(logger, field::AbstractString, metrics)
    metrics_plot = evaluation_metrics_plot(metrics)
    log_plot!(logger, field, metrics_plot) # if we make the plot_data field optional
    # optionally, could also then log each field of `metrics_plot` with `log_values!`
    return metrics_plot
end

and

function log_evaluation_row!(logger::LearnLogger, field::AbstractString, metrics)
    metrics_plot = evaluation_metrics_plot(metrics)
    metrics_dict = _evaluation_row_dict(metrics)
    log_plot!(logger, field, metrics_plot, metrics_dict)
    if haskey(metrics_dict, "spearman_correlation")
        sp_field = replace(field, "metrics" => "spearman_correlation")
        log_value!(logger, sp_field, metrics_dict["spearman_correlation"].ρ)
    end
    return metrics_plot
end

Originally posted by @hannahilea in #60 (comment)

Add "abstract" logging interface

We should abstract our logging interface so there is no default logger, and folks can provide methods for a well-documented interface, likely based on this code:

struct LearnLogger
path::String
tensorboard_logger::TensorBoardLogger.TBLogger
logged::Dict{String,Vector{Any}}
end
function LearnLogger(path, run_name; kwargs...)
tensorboard_logger = TBLogger(joinpath(path, run_name); kwargs...)
return LearnLogger(path, tensorboard_logger, Dict{String,Any}())
end
function log_event!(logger::LearnLogger, value)
logged = string(now(), " | ", value)
TensorBoardLogger.log_text(logger.tensorboard_logger, "events", logged)
return logged
end
function log_plot!(logger::LearnLogger, field::AbstractString, plot, plot_data)
values = get!(() -> Any[], logger.logged, field)
push!(values, plot_data)
TensorBoardLogger.log_image(logger.tensorboard_logger, field, plot; step=length(values))
return plot
end
function log_value!(logger::LearnLogger, field::AbstractString, value)
values = get!(() -> Any[], logger.logged, field)
push!(values, value)
TensorBoardLogger.log_value(logger.tensorboard_logger, field, value;
step=length(values))
return value
end
function log_resource_info!(logger, section::AbstractString, info::ResourceInfo;
suffix::AbstractString="")
log_value!(logger, section * "/time_in_seconds" * suffix, info.time_in_seconds)
log_value!(logger, section * "/gc_time_in_seconds" * suffix, info.gc_time_in_seconds)
log_value!(logger, section * "/allocations" * suffix, info.allocations)
log_value!(logger, section * "/memory_in_mb" * suffix, info.memory_in_mb)
return info
end

Add plot of class distribution to default metrics plots

"Is there class imbalance" and "how many observations are in each class?" are common questions people ask when viewing the output Lighthouse metrics plots. Let's add a class distribution bar plot (or similar) to the default plots to answer these questions!

Update doc examples to use meaningful test data

Right now, the examples in the docs are generated from test data that shows very bad/very poorly performing (basically random!) model performance. Update test data to approximate values from a "decently" performing model, so that metrics plots are meaningful/interpretable and demonstrate accurate implementation (i.e., give someone looking at the ROC curve demo a reason to trust that it is a correctly implemented ROC curve!).

Follow-up: can use the results to generate #17.

`binary_statistics` reports sensitivies as 1 instead of NaN when no actual positives

true_positive_rate = (true_positives == 0 && actual_positives == 0) ?
(one(true_positives) / one(actual_positives)) :
(true_positives / actual_positives)
true_negative_rate = (true_negatives == 0 && actual_negatives == 0) ?
(one(true_negatives) / one(actual_negatives)) :
(true_negatives / actual_negatives)
false_positive_rate = (false_positives == 0 && actual_negatives == 0) ?
(zero(false_positives) / one(actual_negatives)) :
(false_positives / actual_negatives)
false_negative_rate = (false_negatives == 0 && actual_positives == 0) ?
(zero(false_negatives) / one(actual_positives)) :
(false_negatives / actual_positives)

IMO 1.0 is misleading here and NaN would be better. I think this is a breaking change.

Feature request: `binary_comparison_metrics_plot`

The idea is to have something like evaluation_metrics_plot except for when you have more than 1 model, and for the simpler case of binary classifiers instead of multi-class. Ideally, this plot would include:

  • ROC curves (note in the binary case, we only really care about 1 ROC curve per model, since the other classes' curve is mirrored from it)
  • Detection calibration curves
  • interrater reliability (possibly with something more compact than a bar chart)
  • intermodel reliability-- like interrate reliability except comparing the models' predictions
  • and maybe, though less importantly IMO, prediction reliability calibration curves and PR curves

Ideally the plots could scale from 2 to say 20 models, though if 20 gets cramped we could split it up and only do smaller subsets.

Essentially, it would look like evaluation_metrics_plot except for most of the plots, it would be 1 curve per model instead of 1 curve per class.

Lighthouse should consume tables

This is the counterpart to #45 which is about producing rows. It would be nice if the inputs were tables.

  • Create Legolas schema with 1 column per segment with columns for what we need in order to evaluate (just votes, predicted_soft_labels, elected_hard_labels?)
  • Write an evaluation_metrics method that consumes tables with this schema, along with keyword arguments (classes, thresholds, probably?)

Exception thrown in area_under_curve

This is a bit of a weird issue - just popped up during training:

image

Seems like it could be:

  1. Types are too loose, letting something in which should throw earlier.
  2. This is a bug and we should define one(::Type{Union{}}) = 1 for example.

`evaluation_metrics_plot` fails when IRA fields are in the input `EvaluationRow` for multiclass result

Currently, if votes is not provided to Lighthouse.evaluation_metrics_row, it outputs a row with per_class_IRA_kappas and multiclass_IRA_kappas fields defined as missing. If this is a multiclass result, then passing in such a row to evaluation_metrics_plot fails because it doesn't expect these fields to exist when not dealing with multi-rater metrics.

For example, for a 5 class EvaluationRow, it throws:

ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 12 and 8")

`true_negatives` in `binary_statistics()` is incorrect

The computation of true negatives based on the confusion matrix looks like it's summing only the diagonal of the submatrix instead of the full submatrix of all classes except the class of interest. Since this is supposed to be a binary metric, it should be summing all classes that are not the class of interest that are also not predicted to be the class of interest.

true_negatives = sum(diag(confusion)) - true_positives

The great `evaluation_metrics_row` refactor!

Right now, evaluation_metrics_row is kludgy and tries to do way too much stuff, which makes it (a) hard to know exactly what your outputs relate to and (b) hard to customize any internal subfunctions without rewriting the entire loop (or threading some new param everywhere). It also combines a bunch of different types of output metrics into a single hard-to-parse schema (EvaluationRow). We'd like to refactor this!

Plan (as devised w/ @ericphanson): Split current evaluation_metrics_row function into three separate (types of) functions:

  1. a function that covers everything not dependent on a specific threshold OR on predicted_hard_labels (e.g., roc curves, pr curves, some calibration curves maybe)
  2. function(s) that choose a threshold, given the output of step 1 - this would also allow choice of specific desired threshold, e.g., thresh for given specificity OR thresh based on calibration curves (default now) OR thresh based on ROC curve minimization (other default currently!)
  3. functions that calculates derived stats that are based on hardened predictions (e.g., confusion matrix, ea kappas, etc) (multiclass AND per-class rows); two methods:
    • (convenience option) takes a hardening option + threshold + observations, compute predicted_hard_labels internally
    • takes predicted_hard_labels, goes from there

What is the output if you call all three steps??

  • Currently, we return a single EvaluationRow, which contains threshold-dependent AND threshold-independent metrics, and where each field contains results for all classes AND (when valid) a multiclass value. (i.e., some fields are per class, some fields are multiclass, some contain both; some fields threshold dependent, some are threshold independent
  • After refactor, those should all be separate things:
    • Should have separate schemas for threshold-dependent (from step 3) and threshold-independent statistics (from step 1)
    • Each class should have its own output row(s) - #row
    • Multiclass stats should be its own row - confusion matrix & multiclass expert-agreement kappa (if valid)

Other important points:

  • This doesn't necessarily account for specialized hardening (a la #68), but would make it clearer to implement. What needs a hardening function? Step 1 and the convenience option of step 3.
  • As much as possible we should reorganize the loop structure so that instead of doing each computation over all classes before moving on to the next metric, we instead do all computations for a single class and then iterate over all classes at the outer loop
  • make sure docs are clear about what values do and don't depend on predictions (i.e., ira does not!)
  • Can make a breaking release with all of these changes; depending on how much time it takes, can add a deprecation path to construct the current EvaluationRow from new row types (might be useful for testing that change adds no functional change, otherwise file it for later)
  • How does this interact with plotting? Plotting is its own special hell that we'll likely want to refactor also, but for now,

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

upgrade to Legolas 0.5

We define two schemas

  • "lighthouse.evaluation@1",
  • "lighthouse.observation@1"
  • "lighthouse.class@1"
  • "lighthouse.label-metrics@1" > "lighthouse.class@1"
  • "lighthouse.hardened-metrics@1" >"lighthouse.class@1"
  • "lighthouse.tradeoff-metrics@1" >"lighthouse.class@1"

We use Legolas.validate in code and tests.

TODOs

  • upgrade evaluation schema
  • upgrade observation schema
  • upgrade class schema
  • upgrade label-metrics schema
  • upgrade hardened-metrics schema
  • upgrade tradeoff-metrics schema

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.