Comments (6)
Hi! thanks for your contribution!, great first issue!
from litdata.
Hey @dnnspark,
Yes, this is quite simple to add. Simply needs to add the downloader.
from litdata.
Thanks @tchaton, do you have an idea when it's going to land (even very rough estimate)?
from litdata.
Hey @dnnspark,
If you are willing to give it a try, I can look into it this week.
from litdata.
Sorry for the late @tchaton
I'm willing to try! But it's not blocking at the moment, so I will stay tuned about the GCS support (it will be very helpful if you ping on this thread once it's ready).
One thing I notice is that optimize()
function assumes the data is stored on local disk (at least in the example). In my case, the raw data is at GCS (because it's too large). Is there a way to transform the data that is stored in the cloud, and save the transformed data to the cloud without having to download the entire data?
from litdata.
Sorry for the late @tchaton
I'm willing to try! But it's not blocking at the moment, so I will stay tuned about the GCS support (it will be very helpful if you ping on this thread once it's ready).
One thing I notice is that
optimize()
function assumes the data is stored on local disk (at least in the example). In my case, the raw data is at GCS (because it's too large). Is there a way to transform the data that is stored in the cloud, and save the transformed data to the cloud without having to download the entire data?
Yes. that's why this library was built :) But I would need to add GCS support for it ;) I will try to prioritize it.
from litdata.
Related Issues (20)
- ValueError: buffer size must be a multiple of element size
- Dataloading is not working when used in litgpt's debug pretraining example HOT 4
- Please add s3 path support to optimize (read and write to s3) HOT 5
- optimize function on multiple machine writing to local pathes
- StreamingDataset support for older PyTorch versions HOT 1
- Progress bar missing with `litdata.StreamingDataset` and wrong number of steps in an epoch HOT 4
- Slow Dataset Preprocessing due to CPU affinity (?) issues HOT 5
- Time per sample grows as processed samples grows HOT 15
- Optimizing dictionary data structures fails when using a partially initialized function HOT 2
- Cache directory resolution issues in Google Colab HOT 1
- Stream selected channels
- Pytorch lighting Fabric + lit data + DDP hangs when finishing epoch HOT 3
- DataChunkRecipe is not working when used in litgpt's TinyLlama pretraining example HOT 5
- StreamingDataset incompatibility with PyTorch Lightning HOT 11
- Adding breakpoint in `random_images` function crashes pdb HOT 2
- Subsample StreamingDataset
- Make `optimize` continue from last checkpoint after crash HOT 3
- Training slowed down as time progress with litdata streaming dataset HOT 3
- Data shard delation with multi GPU does not work HOT 3
- Ram increasing during first epoch of training
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from litdata.