beacon-biosignals / lighthouse.jl Goto Github PK
View Code? Open in Web Editor NEWPerformance evaluation tools for multiclass, multirater classification models
License: MIT License
Performance evaluation tools for multiclass, multirater classification models
License: MIT License
I don't think this is the right thing -- but for some reason Pkg grabs OldLighthouse.
currently, we compute many metrics directly from the confusion matrix, but not Cohen's kappa:
Lines 105 to 127 in 2feb223
It could clean up some of the data flow to do so. Here is an implementation.
function cohens_kappa_from_confusion_matrix(conf)
p₀ = accuracy(conf)
pₑ = probability_of_chance_agreement_from_confusion_matrix(conf)
return (p₀ - pₑ) / (1 - ifelse(pₑ == 1, zero(pₑ), pₑ))
end
function probability_of_chance_agreement_from_confusion_matrix(conf)
counts_1 = dropdims(sum(conf; dims=1); dims=1)
counts_2 = dropdims(sum(conf; dims=2); dims=2)
n = sum(counts_1)
@check n == sum(counts_2)
return dot(counts_1, counts_2) / n^2
end
I think it would make sense for Lighthouse to support per-epoch checkpointing and automatic resumption from checkpoints. I imagine the interface could work like:
MyClassifier <: AbstractClassifier
may optionally provide methods for:
load_checkpoint(::Type{MyClassifier}, uri) -> MyClassifier
save_checkpoint(uri, ::MyClassifier) -> Nothing
has_checkpoint(uri, ::AbstractClassifier) = isfile(uri)
and checkpoint_extension(::Type{<:AbstractClassifier}) = ".checkpoint"
.Then learn!
optionally takes a checkpoint_dir
URI as a keyword argument.
uri = joinpath(checkpoint_dir, string(epoch), "classifier_checkpoint$(ext)")
(with ext = checkpoint_extension(MyClassifier)
) is checked for the existence of a checkpoint (by has_checkpoint
).Not sure what to do about logs though-- should we also load logs from a checkpoint? I'd like any model-managed state to be part of the AbstractClassifier
and have its checkpointing handle any of that state, but I'm not sure about any Lighthouse-managed state, and especially what gaurantees should callbacks have about the availability of that state if we resumed from a checkpoint.
I was trying to increase the text size of my confusion matrix plot. But when I tried to pass in the keyword arg annotation_text_size=34
it gave me an error. The code works fine without the keyword args (Lighthouse.plot_confusion_matrix(rand(2, 2), ["1", "2"], :Row)
) and produces a confusion matrix plot. But when I try to change the default text size it gave me this error:
Lighthouse.plot_confusion_matrix(rand(2, 2), ["1", "2"], :Row, annotation_text_size=34)
MethodError: no method matching plot_confusion_matrix(::Matrix{Float64}, ::Vector{String}, ::Symbol; annotation_text_size=34)
Closest candidates are:
plot_confusion_matrix(::AbstractMatrix, ::Any, ::Any) at ~/.julia/packages/Lighthouse/K3PBy/src/learn.jl:333 got unsupported keyword argument "annotation_text_size"
here are my environment/system info, if it's of any help:
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i3-4030U CPU @ 1.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, haswell)
Environment:
JULIA_REVISE_WORKER_ONLY = 1
Status `~/usr/project/dir/Project.toml`
[ac2c24cd] Lighthouse v0.11.3
I took a recent time-boxed* attempt at deprecating EvaluationRow
(#71), and it was Not A Fun Time™. I'm not sure what the correct thing to do here is, and I think that it is very possible that we should remove some (all?) of the training harness aspects of Lighthouse.jl altogether---or at least, we should figure out the minimal set of them that users actually depend on.
We've spent a certain amount of time maintaining the existing harness behavior, but afaik our internal usage of the harness itself is basically non-existent, as teams have combined the Lighthouse components in their own ways to support their specific use-cases. (Specifics to be left to a Beacon slack thread!) Externally, I would be surprised if anyone was depending on them (although if someone is, please say the word!!). Additionally, https://github.com/beacon-biosignals/LighthouseFlux.jl/blob/main/src/LighthouseFlux.jl does not use learn!
or evaluate!
.
*technically it was also [hurtling-through-]space-boxed as well, huzzah for transcontinental flight....
The particular hole I fell down was trying to cleanly swap out the use of evaluation_metrics_row
(and its corresponding log_evaluation_row!
) inside evaluate!
.
Line 233 in 0540cdd
Why? Well, the existing usage conflates tradeoff metrics and hardened metrics, and we shouldn't be computing (and logging) both there without the caller specifying things like how the per-class binarization should happen (to create the hardened metrics), and whether they want to additionally compute multirater label metrics (i.e., if votes
exist), and whether they additionally want to compute additional metrics that (until now) have only been computed for binary classification problems.
At the point that we expose separate interfaces for the various combinations of inputs that a user might want (mulitclass v binary, single rater v multirater, binarization choice, pre-computed predicted_hard_labels
for multiclass input), we will have created a lot more code to maintain for a lot less benefit. At the end of the day, asking the caller to implement their own evaluate!(model::AbstractClassifier, ...)
would be easier.
...which seems like a viable option. Except what inputs would a template evaluate!
function take? What combination of the existing
predicted_hard_labels::AbstractVector,
predicted_soft_labels::AbstractMatrix,
elected_hard_labels::AbstractVector,
classes, logger;
logger_prefix, logger_suffix,
votes::Union{Nothing,AbstractMatrix}=nothing,
thresholds=0.0:0.01:1.0, optimal_threshold_class::Union{Nothing,Integer}=nothing
? Because the choice depends on the type of metrics being computed in the first place.
If various projects were depending on this evaluate!
function, it seems like choosing some minimal required arguments (
predicted_soft_labels::AbstractMatrix,
elected_hard_labels::AbstractVector,
classes, logger;
logger_prefix, logger_suffix
?) and passing the others as kwargs from the learn!
function might do it. Buuuut I'm not sure how many folks use this function in the first place!
Which brings me to: what is the correct thing to do here? And does anyone use the learn!
and evaluate!
functions, or could/should we remove them except as examples (in the docs) for how to implement a custom loop for one's own model?
To quote @ericphanson, when talking about this issue,
I think a harness like that it just too rigid. We just need a lot of useful primitives like metrics, logging, plots, etc, and then we can combine them in different ways as makes sense for the project
I agree---and so I think losing the evaluate!
and learn!
calls as they currently exist would allow us to focus on those primitives and sink less time into maintaining code that isn't used. And if more flexible (or smaller scoped) harnesses make their way back into Lighthouse.jl, I think that would be cool also, but we should let that happen from the ground up once we create training loops that actively work for us and could be shared.
I think this would be nice to have. Could just put it in a log_resource_info!
I think.
Once #69 lands (via #70), the next piece is to refactor out the individual plotting components in relation to their given *MetricsRow
---and then we can get rid of EvaluationRow
, which combines multiple unrelated metrics types in a confusing (and easy to misinterpret) way.
(Ref brief discussion in #70 (comment))
Tasks:
evaluation_metrics_plot
with per-schema plots that take in a table of the relevant metrics rowEvaluationRow
and remove evaluation_metrics_row
A dependency of this package, FFMPEG
, cannot be built due to the following exception:
ERROR: Error building `FFMPEG`:
┌ Warning: Platform `arm64-apple-darwin22.4.0` is not an officially supported platform
└ @ BinaryProvider ~/.julia/packages/BinaryProvider/U2dKK/src/PlatformNames.jl:450
ERROR: LoadError: KeyError: key "unknown" not found
Since this dependency cannot be built, the Lighthouse package itself won't build. BinaryProvider.jl
is more or less a deprecated package that hasn't been updated since 2020, but it appears that FFMPEG
still relies on it. I will file a related issue there as well.
(transferred from https://github.com/beacon-biosignals/OldLighthouse.jl/issues/134)
The consensus is that AbstractPlotting is easier to work with and gives you more control, and that the conversion shouldn't be too much work -- @kolia
Currently, Lighthouse asks for a set of input thresholds (e.g. 0.0:0.01:1.0
), then computes binary stats for each threshold, then forms corresponding fpr/tpr curves, then computes the area underneath those curves w/ the trapezoid rule.
I came across An Improved Model Selection Heuristic for AUC today which among other things, pointed out that you can compute the AUC without reference to specific thresholds by sorting the soft-labels of all instances, and then doing a particular kind of count:
I think this is a nice algorithm we should consider using, so that the accuracy of the AUC estimate is not influenced by the granularity of the thresholds chosen.
After #60 we've defined an interface that folks can plug into for custom logging.
We could drop the Tensorboard dep by removing the LearnLogger
. We could make a no-op logger that doesn't actually log anything, or maybe a Dict
logger that logs things to a logged
dict as the current one does, which could be good for testing. Or we could keep the current LearnLogger.
(transferred from beacon-biosignals/OldLighthouse.jl#126)
Regarding the bootstrapping stuff ... I actually have a Julia package for jackknife resampling and estimation https://github.com/ararslan/Jackknife.jl -- @ararslan
As introduced in #45, and blocked by apache/arrow-julia#125. Fix will be to line
Line 18 in 9e9de96
okay, i'm not "allowed" to comment below this point, but I think we should specialize the below implementation of
log_evaluation_row!
on LearnLogger so that special-casing spearman correlation doesn't have to happen by default---that is a very tb-specific logged item. E.g.,
function log_evaluation_row!(logger, field::AbstractString, metrics)
metrics_plot = evaluation_metrics_plot(metrics)
metrics_dict = _evaluation_row_dict(metrics)
log_plot!(logger, field, metrics_plot, metrics_dict)
return metrics_plot
end
or, better:
function log_evaluation_row!(logger, field::AbstractString, metrics)
metrics_plot = evaluation_metrics_plot(metrics)
log_plot!(logger, field, metrics_plot) # if we make the plot_data field optional
# optionally, could also then log each field of `metrics_plot` with `log_values!`
return metrics_plot
end
and
function log_evaluation_row!(logger::LearnLogger, field::AbstractString, metrics)
metrics_plot = evaluation_metrics_plot(metrics)
metrics_dict = _evaluation_row_dict(metrics)
log_plot!(logger, field, metrics_plot, metrics_dict)
if haskey(metrics_dict, "spearman_correlation")
sp_field = replace(field, "metrics" => "spearman_correlation")
log_value!(logger, sp_field, metrics_dict["spearman_correlation"].ρ)
end
return metrics_plot
end
Originally posted by @hannahilea in #60 (comment)
We should abstract our logging interface so there is no default logger, and folks can provide methods for a well-documented interface, likely based on this code:
Lines 16 to 55 in 344672c
just realized this would be a nice illustration of what the package currently provides :)
Should probably be optional (as it will increase computation time!); should add least include CI to inter-rater agreement, possibly also ROC curves etc.
(transferred from https://github.com/beacon-biosignals/OldLighthouse.jl/issues/82)
"Is there class imbalance" and "how many observations are in each class?" are common questions people ask when viewing the output Lighthouse metrics plots. Let's add a class distribution bar plot (or similar) to the default plots to answer these questions!
Right now, the examples in the docs are generated from test data that shows very bad/very poorly performing (basically random!) model performance. Update test data to approximate values from a "decently" performing model, so that metrics plots are meaningful/interpretable and demonstrate accurate implementation (i.e., give someone looking at the ROC curve demo a reason to trust that it is a correctly implemented ROC curve!).
Follow-up: can use the results to generate #17.
Lines 76 to 87 in e829617
IMO 1.0 is misleading here and NaN would be better. I think this is a breaking change.
The idea is to have something like evaluation_metrics_plot
except for when you have more than 1 model, and for the simpler case of binary classifiers instead of multi-class. Ideally, this plot would include:
Ideally the plots could scale from 2 to say 20 models, though if 20 gets cramped we could split it up and only do smaller subsets.
Essentially, it would look like evaluation_metrics_plot
except for most of the plots, it would be 1 curve per model instead of 1 curve per class.
It is a measure of inter-rater reliability and there is a Julia package to compute it: https://discourse.julialang.org/t/ann-krippendorff-jl/54261
This way we can call it from log_evaluation_row!
. Perhaps as the fallback we can call log_value!
on each entry.
(transferred from beacon-biosignals/OldLighthouse.jl#115)
the numbers should be white when the color in a cell is below a certain threshold, so that we can still read them
This is the counterpart to #45 which is about producing rows. It would be nice if the inputs were tables.
evaluation_metrics
method that consumes tables with this schema, along with keyword arguments (classes, thresholds, probably?)Currently, if votes
is not provided to Lighthouse.evaluation_metrics_row
, it outputs a row with per_class_IRA_kappas
and multiclass_IRA_kappas
fields defined as missing
. If this is a multiclass result, then passing in such a row to evaluation_metrics_plot
fails because it doesn't expect these fields to exist when not dealing with multi-rater metrics.
For example, for a 5 class EvaluationRow
, it throws:
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 12 and 8")
The computation of true negatives based on the confusion matrix looks like it's summing only the diagonal of the submatrix instead of the full submatrix of all classes except the class of interest. Since this is supposed to be a binary metric, it should be summing all classes that are not the class of interest that are also not predicted to be the class of interest.
Line 73 in e829617
Right now, evaluation_metrics_row
is kludgy and tries to do way too much stuff, which makes it (a) hard to know exactly what your outputs relate to and (b) hard to customize any internal subfunctions without rewriting the entire loop (or threading some new param everywhere). It also combines a bunch of different types of output metrics into a single hard-to-parse schema (EvaluationRow
). We'd like to refactor this!
Plan (as devised w/ @ericphanson): Split current evaluation_metrics_row
function into three separate (types of) functions:
What is the output if you call all three steps??
EvaluationRow
, which contains threshold-dependent AND threshold-independent metrics, and where each field contains results for all classes AND (when valid) a multiclass value. (i.e., some fields are per class, some fields are multiclass, some contain both; some fields threshold dependent, some are threshold independentOther important points:
(transferred from beacon-biosignals/OldLighthouse.jl#130)
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
Okay yup, using WGLMakie
as a backend did the trick. But, was this mentioned in the docs that we have to activate one of the Maxkie
backend?
Originally posted by @abhinav3398 in #82 (comment)
We define two schemas
"lighthouse.evaluation@1",
"lighthouse.observation@1"
"lighthouse.class@1"
"lighthouse.label-metrics@1" > "lighthouse.class@1"
"lighthouse.hardened-metrics@1" >"lighthouse.class@1"
"lighthouse.tradeoff-metrics@1" >"lighthouse.class@1"
We use Legolas.validate
in code and tests.
(transferred from https://github.com/beacon-biosignals/OldLighthouse.jl/issues/110)
Note:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.