Comments (3)
Oh, my bad. I didn't consider that internal fragmentation might still occur even though each pth
file is larger than 4KB. I checked that each pth
is around 7.7KB which could explain why 32MB data shard ended up taking slightly more space. Yet, 8KB - 7.7KB = 0.3KB ~= 5%, so it doesn't likely explain 39MB - 32MB = 7MB ~= 20% gap in the shard size.
from webdataset.
ShardWriter
estimates shard size based on the total number of content bytes, not the actual file size of the shard on disk; the overhead from the tar format and container is not accounted for. If you write small files, this overhead might become significant. We'll try to make the estimates more accurate in the future.
from webdataset.
Ok, fair enought.
from webdataset.
Related Issues (20)
- Deterministic dataloading setup that covers the entire data space "n" with none left behind HOT 1
- Training stuck after training and validation steps
- Handling multiple annotations per image HOT 1
- Splitting features into separate archives HOT 5
- How to install wids?? HOT 1
- decode tensor type HOT 1
- slow data loading speed HOT 3
- Could someone help me to clarify the concept of multi-node training for webdataset ? HOT 1
- The behavior of one node multi-gpus with webdataset HOT 2
- AttributeError: module 'wids' has no attribute 'DistributedChunkedSampler' HOT 1
- FAQ : What's the meaning of n in `with_epoch(n)` HOT 2
- Distributed Training with videos not working? HOT 1
- [Errno 32] Broken pipe - Download Failed Error with S3 URLs HOT 1
- Webdataset (Liaon115M) + Torchlightning (pl.DataModule) with visualizing progressbar during training HOT 1
- Seed in multiprocessing (DDP) is not fixed in shuffle() HOT 1
- Update pypi with 0.2.88?
- How does shuffling work? HOT 1
- Restricting the number of samples in the dataset HOT 1
- wds.Decoder TypeError: 'functools.partial' object is not iterable HOT 2
- Loop through same tar file 10 times? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from webdataset.