Comments (6)
Hi, we have recently introduced initial support for training on arbitrary out-of-memory datasets using ivis by formalizing the interface that input data must conform to.
Ivis will accept ivis.data.sequence.IndexableDataset
instances in its fit
, transform
and fit_transform
methods. An IndexableDataset inherits from collections.abc.Sequence
and defines one new method, shape
, that takes no arguments and returns the expected shape of the dataset (for example, [rows, columns]).
The collections.abc.Sequence
class requires __len__
(returns number of rows) and __getitem__
(returns data at row index) to be implemented. When implementing the __getitem__
method we can customize how the data is retrieved to behave in any way desired.
As an example, we have provided a ivis.data.sequence.ImageDataset
class for loading images from disk for reference, which reads image files from disk into memory when indexed.
This is still quite a new feature and we may enhance it based on the feedback we get, so any thoughts on your experience with this would be valued if you end up trying it. We also want to, in time, expand the classes we provide to cover some common use-cases.
from ivis.
Hi thanks for raising this, we agree that this would be a useful feature to have. We are looking into whether we could support the use of generators with ivis.
from ivis.
Just as an update to this, using a generator to train ivis is difficult since the triplet sampling algorithm may need to retrieve KNNs or negative points that are not in the batch - we normally index the dataset to retrieve these efficiently, but a generator can only be iterated over.
We are exploring other potential ways we could make training on out-of-memory data easier, so will leave this issue open as we look into it.
from ivis.
@Szubie
Thank you very much for the update.
As a side note, in https://bering-ivis.readthedocs.io/en/latest/oom_datasets.html I read:
When training on a h5 dataset, we recommend to use the shuffle_mode='batch' option in the fit method. This will speed up the training process by pulling a batch of data from disk and shuffling tethat batch, rather than shuffling across the whole dataset.
I don't know if this is a custom training strategy, but if you use the keras' fit() method, my understanding is that "batch shuffle" doesn't shuffle rows inside batches, but it shuffles the batches order (please correct me if I'm wrong).
from ivis.
I don't know if this is a custom training strategy, but if you use the keras' fit() method, my understanding is that "batch shuffle" doesn't shuffle rows inside batches, but it shuffles the batches order (please correct me if I'm wrong).
That's right.
Each triplet is made up of three data points: 1) the anchor, 2) the positive example (one of the k-nearest neighbors), and 3) a negative example. The keras fit
method only shuffles the anchors - when using the 'batch' shuffle mode, anchors are shuffled within a batch.
But each anchor data point then needs to be combined with a positive and negative example in order to create a triplet. And these points may be in a completely different part of the data, outside of the current batch of 'anchors'.
For each anchor, we can retrieve the index of a positive example using the AnnoyIndex, but to actually retrieve the data at that index we need an indexable data structure (at least at the moment).
from ivis.
As an update to this issue, support for data stored outside of memory has been improved with the new get_batch
method which will be called in preference to __getitem__
if possible. Fetching full batches of data at once can greatly improve performance when running ivis on an out-of-memory data store.
For an example of using ivis on data stored in a sqlite database we've added the following jupyter notebook:
https://github.com/beringresearch/ivis/blob/master/notebooks/using_ivis_with_sqlite.ipynb
By using get_batch
the SqliteDB class is able to get all the data required for an ivis training step in a single SQL query. The same techniques used in the notebook can be used to adapt any out-of-memory dataset with minimal code.
Closing this issue now as it has now been addressed.
from ivis.
Related Issues (20)
- `NotFittedError` after caching and reloading fitted `Ivis` instance HOT 2
- Suggest implementing `predict_proba` and `predict` methods for Ivis object. HOT 1
- How does ivis compare to UMAP? HOT 2
- Add conda-forge package
- About scaling HOT 2
- `KeyError` followed by `joblib.externals.loky.process_executor.BrokenProcessPool` when using `sklearn.model_selection.GridSearchCV` with `n_jobs != 1` HOT 3
- One of the unit tests (knn_retrieval) can fail (machine dependent?) HOT 1
- OSError HOT 1
- attempt to apply non-function HOT 9
- Extremely slow extraction of KNN neighbours on 100k samples HOT 4
- InternalError: Graph execution error: HOT 4
- 2D visulization of crowded cluster with ivis HOT 1
- model_save: optimizer is not compatible with pickle HOT 4
- How to get stable results? HOT 4
- Ivis is not able to run inference on a sparse matrix
- Reproducibility HOT 2
- `chunk_size` in knn set to 0 HOT 2
- Ivis seems to provoke errors when composing a sklearn.pipeline.Pipeline passed to sklearn.model_selection.GridSearchCV and executed in parallel HOT 10
- classification_weight Parameter HOT 2
- Meaning of "Observations" on https://bering-ivis.readthedocs.io/en/latest/hyperparameters.html HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ivis.