Comments (4)
Hey @rmminusrslash ,
thanks for more details!
We thought about adding an error message based on data size. But the limit would depend on the user infrastructure especially if used locally, so it would be hard to set a universal threshold when sampling should be applied. And as a priority, we are also working right now to speed up the UI which should solve part of cases when reports are too large to display. Hopefully, it will help a lot 🤞
We are thinking about adding a flag later that the user can set on their own ("large dataset") which would then generate a variation of report that is best suited for larger datasets. It will include not only sampling but a different aggregated views for some parts of the report.
Agree on your comment of making the limitation for large datasets and sampling option even more clear for Jupyter notebook: we already added this now to the Quick-start part of the docs.
from evidently.
Hey @rmminusrslash,
Thanks for reporting! Unfortunately, this is the current limitation of the tool.
The report is large because the tool stores all the data necessary to generate interactive plots directly inside the HTML. We plan to fix it when we create a service version of the tool (where we decouple the data storage and the browser-based web service).
For now there are two workarounds:
- Use some sampling strategy for your dataset, for instance random sapling. For Jupyter notebook, that can be done directly with pandas. For command line interface, we have a configuration - you can choose random sampling or pick the n-th rows.
- Use JSON profile. This way, Evidently calculates the metrics and statistical tests but they can be logged or displayed elsewhere. We have an example for MLflow https://docs.evidentlyai.com/step-by-step-guides/integrations/evidently-+-mlflow and i am working now on one for Grafana.
We understand this limits how you can use the tool now, and are working hard to get to the more feature-full version!
from evidently.
Hey @emeli-dral,
ah, I probably should have been more clear about what I was asking. I tried sampling when I figured out the root cause, up to 10K datapoints worked.
Would it make sense to
- add sampling as the default if the dataset exceeds current limits (display a message that sampling happened)
- if you decide against it, at least raise an unsupported exception that mentions the sampling option and mention the limitation in the docs
The current behavior of failing silently might not be ideal until you release the full version (unless you expect people to try the tool mostly with toy data)
from evidently.
Now reports by default do not use any raw data plots and this reduces reports size significantly
from evidently.
Related Issues (20)
- UI dashboard cannot update if the project does not contain a Git repository
- Report import error in evidently.report package in databricks notebook HOT 1
- UI dashboard doesn't automatically update HOT 1
- Support for using embeded images instead of Plotly interactive component HOT 2
- DataDriftPreset : drift_share threshehold does not work with spark engine
- evidently slow in docker HOT 2
- If current df doesn't have an expected column, running tests raises assertion error HOT 1
- Failed to load module script: Expected a JavaScript module script but the server responded with a MIME type of "text/html". Strict MIME type checking is enforced for module scripts per HTML spec. HOT 1
- New pydantic version crashes import resolving
- Save report to file cause "unclosed file" ResourceWarning
- Custom StatTest Failing in Python Due to missing Python Engine HOT 1
- Feedback and bugs (mostly about Cloud) HOT 1
- Bug: Cannot save html classification report when target column and possible labels do not match. HOT 2
- Missing "Per feature"-metrics in Regression HOT 2
- Question: Targetdrift doesn't detect drift when trend is reversed.
- Bug: Classification metrics do not support label names containing numbers
- Data quality test suite saved as HTML is much bigger than data quality preset metric report (300MB vs. 3MB) HOT 3
- ModuleNotFoundError: No module named 'evidently.metrics'; 'evidently' is not a package
- DatasetSummaryMetric for Spark Engine
- Predefined tests expect the target variable to be named 'target' regardless of the predefined ColumnMapping HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from evidently.