Comments (6)
Hey @ehartford. I have already prepared a version of SlimPajama. It is ready to use on the platform.
from litdata.
Here is the code:
from litdata import StreamingDataset, CombinedStreamingDataset
from litdata.streaming.item_loader import TokensLoader
from tqdm import tqdm
import os
from torch.utils.data import DataLoader
train_datasets = [
StreamingDataset(
input_dir="s3://tinyllama-template/slimpajama/train/",
item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
shuffle=True,
drop_last=True,
),
StreamingDataset(
input_dir="s3://tinyllama-template/starcoder/",
item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
shuffle=True,
drop_last=True,
),
]
# Mix SlimPajama data and Starcoder data with these proportions:
weights = (0.693584, 0.306416)
combined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights)
train_dataloader = DataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())
# Iterate over the combined datasets
for batch in tqdm(train_dataloader):
pass
from litdata.
Hi! thanks for your contribution!, great first issue!
from litdata.
Ok but, is it better to support hugging face instead of having to copy the dataset to s3? Aws charges for ingress and egress
from litdata.
Ok but, is it better to support hugging face instead of having to copy the dataset to s3?
we used to have some issues with the stability and reachability of HF models and datasets in the past so I may say that S3 is a more reliable alternative...
from litdata.
Hey @ehartford. In order to stream datasets, we need to optimize the dataset first. We could have an auto-optimize version for the HF datasets, but it would still require to download the dataset and convert it.
HF supports some streaming with webdataset backend but I gave up on it as it was too un-reliable for anything serious. The pipe breaks, it doesn't support multi node, etc...
If you are interested in using any particular dataset, I recommend trying out the Lightning AI platform.
Here is an example where I prepare Wikipedia Swedish: https://lightning.ai/lightning-ai/studios/tokenize-2m-swedish-wikipedia-articles
And another one were I prepared SlimPajama & StarCoder: https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset.
Don't hesitate to ask any other questions :)
from litdata.
Related Issues (20)
- Allow a StreamingDataset to wrap around when running in a CombinedStreamingDataset HOT 1
- Prints inside the worker processes mess up the progress bar HOT 1
- TPU support HOT 5
- Issue with StreamingDataset when not using all GPUs on host. HOT 6
- Assert when deserializing `no_header_numpy` or `no_header_tensor`. HOT 4
- `litdata.optimize` accidentally deletes files from the local filesystem HOT 2
- GCSFuse mount + Vertex AI custom training jobs support HOT 1
- Compression using the optimize function from litdata HOT 5
- Dataset not created when using `map()` on data structure without file paths inside
- Question: is there a plan to support streaming from GCS? HOT 6
- ValueError: buffer size must be a multiple of element size
- Dataloading is not working when used in litgpt's debug pretraining example HOT 4
- Please add s3 path support to optimize (read and write to s3) HOT 5
- optimize function on multiple machine writing to local pathes
- StreamingDataset support for older PyTorch versions HOT 1
- Progress bar missing with `litdata.StreamingDataset` and wrong number of steps in an epoch HOT 4
- Slow Dataset Preprocessing due to CPU affinity (?) issues HOT 3
- Time per sample grows as processed samples grows HOT 4
- Optimizing dictionary data structures fails when using a partially initialized function HOT 2
- Cache directory resolution issues in Google Colab HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from litdata.