Hey, I wanted to run against a production dataset of small-mid size:

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

report now showing when using a bit more data about evidently HOT 4 CLOSED

evidentlyai commented on May 21, 2024

report now showing when using a bit more data

from evidently.

Comments (4)

emeli-dral commented on May 21, 2024 1

Hey @rmminusrslash ,
thanks for more details!

We thought about adding an error message based on data size. But the limit would depend on the user infrastructure especially if used locally, so it would be hard to set a universal threshold when sampling should be applied. And as a priority, we are also working right now to speed up the UI which should solve part of cases when reports are too large to display. Hopefully, it will help a lot 🤞

We are thinking about adding a flag later that the user can set on their own ("large dataset") which would then generate a variation of report that is best suited for larger datasets. It will include not only sampling but a different aggregated views for some parts of the report.

Agree on your comment of making the limitation for large datasets and sampling option even more clear for Jupyter notebook: we already added this now to the Quick-start part of the docs.

from evidently.

emeli-dral commented on May 21, 2024

Hey @rmminusrslash,
Thanks for reporting! Unfortunately, this is the current limitation of the tool.

The report is large because the tool stores all the data necessary to generate interactive plots directly inside the HTML. We plan to fix it when we create a service version of the tool (where we decouple the data storage and the browser-based web service).

For now there are two workarounds:

Use some sampling strategy for your dataset, for instance random sapling. For Jupyter notebook, that can be done directly with pandas. For command line interface, we have a configuration - you can choose random sampling or pick the n-th rows.
Use JSON profile. This way, Evidently calculates the metrics and statistical tests but they can be logged or displayed elsewhere. We have an example for MLflow https://docs.evidentlyai.com/step-by-step-guides/integrations/evidently-+-mlflow and i am working now on one for Grafana.

We understand this limits how you can use the tool now, and are working hard to get to the more feature-full version!

from evidently.

rmminusrslash commented on May 21, 2024

Hey @emeli-dral,

ah, I probably should have been more clear about what I was asking. I tried sampling when I figured out the root cause, up to 10K datapoints worked.

Would it make sense to

add sampling as the default if the dataset exceeds current limits (display a message that sampling happened)
if you decide against it, at least raise an unsupported exception that mentions the sampling option and mention the limitation in the docs

The current behavior of failing silently might not be ideal until you release the full version (unless you expect people to try the tool mostly with toy data)

from evidently.

emeli-dral commented on May 21, 2024

Now reports by default do not use any raw data plots and this reduces reports size significantly

from evidently.

Recommend Projects

report now showing when using a bit more data about evidently HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent