Comments (5)
In the examples directory we have a simple download script that will download the example data sets for the blog posts we created on yellowbrick (and power the Jupyter notebook that is also in that directory).
However, we also need to make sure that this script can be used to download or unpack data for testing (e.g. to a tmp directory). So we have to do one of the following:
- Add compressed datasets and write a script in tests to load and decompress them in memory (or write them to a temporary directory).
- Upload the datasets to S3 (they're on dropbox now I think, which isn't exactly a great host) and then write a more robust download script.
Whoever takes this item, we'd be happy to discuss a path for either of the above.
from yellowbrick.
@rebeccabilbro moving forward on this for my afternoon Yellowbrick session; just to take on something fairly easy so that I can get back to my proposal.
@ojedatony1616 I'm going to add the datasets to the DDL Data Bucket on S3; hope that's ok, let me know if you have other thoughts.
Tasks I'm going to do here:
- Create data set bundles as CSV with header and meta data and README from the three data sets we currently have.
- Zip the dataset bundles into a single file with an MD5 hash.
- Create a downloader that checks the hash against the file and stores it locally.
- Update the examples.ipynb and examples.py with downloaded data.
- Create a test fixture that downloads the data to a temporary file.
- Write a test or two that uses the temporary fixture.
from yellowbrick.
@rebeccabilbro ok, so what I've done (particularly as it relates to #60) for each dataset is as follows:
-
Created a README.md from the UCI ML Repository information with correct citations (including bibtex) and other information (similar to the bundle methodology)
-
Created a meta.json with the feature names and target names if a classifier
-
Created a single csv dataset (e.g. combining test, training, etc.) with a header row that works with
pd.read_csv
. -
Ensured that the feature names in the CSV file were easily understandable.
-
Ensure that the target is the last column in the dataset
-
Packaged the dataset into a directory with zip as follows:
$ zip -r name.zip name/
Where name is the single identifier of the directory (e.g. "occupancy")
-
Took the sha256 hash of the file and stored it in
download.py
to ensure that the latest version of the dataset is being downloaded and that it hasn't been corrupted. -
Uploaded the .zip file to S3 -- namely to DDL's data-lake bucket and made it publicly available for download
-
Modified
download.py
to ensure that the new file is downloaded indownload_all
.
If possible, I'd like to ensure that all of our data sets that we produce for Yellowbrick are treated in a similar fashion.
from yellowbrick.
Also note that this means the examples.ipynb
will break if you haven't redownloaded the data files!
from yellowbrick.
And now we cross our fingers that Travis passes ...
from yellowbrick.
Related Issues (20)
- learning curve visualizer for catboost automl using Pipelines HOT 2
- could not determine class_counts_ from previously fitted classifier HOT 5
- Radviz error from DataFrame which doesn't have sequantial index HOT 3
- How not to plot legend in RadViz plot? HOT 4
- On the generation of RadViz plot HOT 1
- Use classification visualizers directly from predictions, targets and logits? HOT 1
- [SilhouetteVisualizer] Constructor argument is_fitted is ignored during initialization HOT 1
- ConfusionMatrix visualizer error with sklearn models HOT 3
- Is there a way to hide the figure from KElbowVisualizer? HOT 3
- Let `KElbowVisualizer` use all the distance metrics supported by sklearn HOT 5
- The PredictionError can't be visualized due to the dim error HOT 2
- Adjusting markersize in `prediction_error` HOT 2
- Matplotlib warning about color usage in Datasaurus
- No figure output of the show method and produce a lot of findfont: Generic family 'sans-serif' not found warnings HOT 2
- Unable to use Silhouette Visualizer with Gaussian Mixture Model HOT 7
- Can't plot class report with trained model HOT 1
- Interactive plots - support plotly backend. HOT 4
- InterclusterDistance AttributeError: 'NoneType' object has no attribute '_get_renderer' HOT 2
- Add arguments to change PCA biplot arrow and arrow label colors and other properties HOT 1
- yellowbrics conflict with matplotlib: use_line_collection in cause!
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yellowbrick.