mrocklin / chest Goto Github PK
View Code? Open in Web Editor NEWSimple spill-to-disk dictionary
License: Other
Simple spill-to-disk dictionary
License: Other
Much of the cost of using a chest on-disk is writing excess data. If we don't mind possibly spilling over the available_memory
limit a bit then this operation could happen asynchronously. Upon inserting a new element in memory we launch a separate thread to call the Chest.shrink
function and then release control immediately back to the caller.
Probably the IO here could overlap nicely with computationally intense work.
Maybe I'm not using Chest right
>>> from chest import Chest
>>> c = Chest({'k': 6})
>>> assert 'k' in c
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError
Invalid filename stops flush()
, .keys
file does not get written.
❯❯❯ python
Python 2.7.8 (default, Aug 24 2014, 21:26:19)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from chest.core import Chest
>>> c = Chest(path='path')
>>> c[1] = 'one'
>>> c['two'] = 2
>>> c['1/2'] = 2
>>> print c.path
path
>>> c.flush()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<venv>/lib/python2.7/site-packages/chest/core.py", line 180, in flush
self.move_to_disk(key)
File "<venv>/lib/python2.7/site-packages/chest/core.py", line 90, in move_to_disk
with open(fn, mode='wb') as f:
IOError: [Errno 20] Not a directory: 'path/1/2'
>>> exit()
❯❯❯ ls -la path
path:
total 8,0K
drwxr-xr-x 4 grimnight staff 136 Mar 30 13:27 .
drwxr-xr-x 68 grimnight staff 2,3K Mar 30 13:26 ..
-rw-r--r-- 1 grimnight staff 10 Mar 30 13:27 1
-rw-r--r-- 1 grimnight staff 5 Mar 30 13:27 two
When the in-memory dictionary is full and we add something new (either new kv pair or something pulled from disk) we currently find the largest element in the dict and send it to disk.
This is bad in a few ways
max(values, key=nbytes)
call which might be expensive if we have many keys, particularly we if we do it at every insertion.A decent cheap solution would be the following.
Have a dictionary, inmem
as we do now. Have a new set seen-recently
. Whenever we touch a piece of data we add its key to seen-recently
and update our total-bytes-in-memory
counter. When this counter goes above available-memory
we flush all values corresponding to keys not found in seen-recently
. We then clear seen-recently
.
That is to say that the policy becomes that you can stay in memory if you've been used since the last clearing out.
This is amortized cheap, happens less frequently, is cheap to maintain, and easy to build.
Using original keys hash value as a key will create a new key, but only until flush()
.
After flush()
the key will still exist in the .keys
file, but not in the filesystem and it will return the value of the original key.
❯❯❯ python
Python 2.7.8 (default, Aug 24 2014, 21:26:19)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from chest.core import Chest
>>> import json
>>> c = Chest(path='path', dump=json.dump, load=json.load)
>>> c['1/2'] = '1/2'
>>> c['1/2']
'1/2'
>>> str(abs(hash('1/2')))
'163512108431620421'
>>> c[163512108431620421] = 163512108431620421
>>> c[163512108431620421]
163512108431620421
>>> c['1/2']
'1/2'
>>> c.keys()
[163512108431620421, '1/2']
>>> c.flush()
>>> c = Chest(path='path', dump=json.dump, load=json.load)
>>> c.keys()
[u'1/2', 163512108431620421]
>>> c['1/2']
u'1/2'
>>> c[163512108431620421]
u'1/2'
>>> exit()
❯❯❯ ls -la path
total 16
drwxr-xr-x 6 grimnight staff 204 Jou 1 09:46 .
drwxr-xr-x 67 grimnight staff 2278 Jou 1 09:43 ..
-rw-r--r-- 1 grimnight staff 37 Jou 1 09:46 .keys
-rw-r--r-- 1 grimnight staff 5 Jou 1 09:46 163512108431620421
❯❯❯ cat path/.keys
[163512108431620421, "1/2"]
As an alternative to str(abs(hash('')))
consider quote_plus or something similar.
After seeing 19 and multi_key_dict I thought of c[('some', 1.0, 'key')]
or c['some', 1.0, 'key']
becoming path/some/1.0/key
, assuming that is possible or even a good idea.
Hi,
I had some questions regarding Chest?
Is the repository, fully integrated into Dask and is that why it hasn't been updated for a while?
What size does chest start to fall over at? I noticed there was a comment it handles moderate sized arrays?
Is there a way to break up computation in dask or other libraries with chest? For example, I have a large image array that when I try to work out the histogram, the dask graph has a massive bottle neck at the point of being the data together. Is there a way to break up the work with chest or dask to store intermediate results in chunks and then bring it back together? I think the native dask implementation may be causing issues and break the steps up would help significantly with really large tasks.
Would appreciate your thoughts and help.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.