Comments (6)
A question: how constrained can Chest's API be in terms of types?
I ask because I think it would be interesting/fun/useful to implement either Thrift or Protobuf (or maybe we could try both and benchmark them) as a backend for Chest and be able to avoid pickle, because it's pickle.
Another thought: If we only need mapping between keys and arrays, we might be able to create a custom serialization format around bcolz, and then we get all the benefits that entails.
Thoughts? This is a problem I'm interested in hacking on.
from dask.
One could parametrize chest with strict key and value types and do some type checking. Not sure if this should live outside or inside core Chest
class.
Chest currently can be parametrized by dump
/load
functions so swapping in other protocols can be made to work that way.
One reason that I actually like pickle is that it handles numpy and pandas very nicely (as long as you use pickle.dump(..., protocol=2)
). I'm pretty confident that this does almost no manipulation of the actual data bytes (though I'm sure that the metadata is entirely mangled).
Using BColz (or compression generally) also makes sense. One thing to beware of is that bcolz often trades CPU for Memory bandwidth. In the context of dask we may not have excess CPU to spare and minimally compressed may be better than in a normal, single-core case.
from dask.
@wrobstory if you're at all interested in any of this I'd be thrilled.
I think that efficient spill-to-disk data structures is a potentially very useful and relatively untapped development space. I'd love to see someone revamp chest
or even reinvent something newer and better (chest
was a quick-and-dirty evening project).
from dask.
I've added a naive LRU solution for chest in mrocklin/chest@24210a5
from dask.
And significantly reduced write costs by being less dumb in mrocklin/chest@ba15e5e
from dask.
And it's now threadsafe (maybe). I'm going to close this for now. Chest meets immediate requirements for use with numpy+dask things. Blogpost forthcoming.
from dask.
Related Issues (20)
- `vindex` as outer indexer: memory and time performance
- Hash join transfer with error cannot pickle '_contextvars.ContextVar' object HOT 6
- `set_index` returns the divisions instead of the dataframe with query planning enabled
- Preserving divisions when reading/loading dataframes with structs containing multiple fields HOT 1
- Incorrect shape computation with getitem and structured numpy array
- Poor scheduling with `flox`, leading to high memory usage and eventual failure HOT 7
- Does not work with AWS - aiobotocore related error HOT 2
- Print functions are wrong inside of map_blocks HOT 2
- Client() generate the error concurrent.futures._base.CancelledError: ('head-1-5-read-csv-19ebc21b0abac0313dd0e5004ea2fce7', 0) HOT 1
- Column aggregation with "list" produces incorrect output HOT 2
- dask-expr with drop_duplicates messes with dtypes HOT 4
- Azure Functions using dask raise unexpected "AttributeError: 'function' object has no attribute 'fs'" for io operations HOT 6
- Incompatibility with python 3.11.9 HOT 2
- Weird behavior of `rename` HOT 3
- Add determinant(like scipy.linalg.det/slogdet) to dask.array.linalg
- FileExpired exception when reading parquet from a Minio bucket using Dask HOT 6
- ValueError: memoryview is too large (dask.array.histogram) HOT 4
- Add static type annotations HOT 4
- KeyError on existing column after executing `reset_index` and `set_index` HOT 2
- db.to_dataframe, throws: TypeError: 'coroutine' object is not iterable HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask.