mrocklin / chest Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 21.0 56 KB

Simple spill-to-disk dictionary

License: Other

Python 100.00%

chest's People

Contributors

Stargazers

Watchers

Forkers

blaze wrobstory maxhutch hezila cowlicks bigrlab sh4zkh4n along1x davidhassell

chest's Issues

launch shrink or move_to_disk in separate thread

Much of the cost of using a chest on-disk is writing excess data. If we don't mind possibly spilling over the available_memory limit a bit then this operation could happen asynchronously. Upon inserting a new element in memory we launch a separate thread to call the Chest.shrink function and then release control immediately back to the caller.

Probably the IO here could overlap nicely with computationally intense work.

Possible bug instantiating chest with data

Maybe I'm not using Chest right

>>> from chest import Chest
>>> c = Chest({'k': 6})
>>> assert 'k' in c
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

Invalid filename stops flush

Invalid filename stops flush(), .keys file does not get written.

❯❯❯ python
Python 2.7.8 (default, Aug 24 2014, 21:26:19)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from chest.core import Chest
>>> c = Chest(path='path')
>>> c[1] = 'one'
>>> c['two'] = 2
>>> c['1/2'] = 2
>>> print c.path
path
>>> c.flush()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<venv>/lib/python2.7/site-packages/chest/core.py", line 180, in flush
    self.move_to_disk(key)
  File "<venv>/lib/python2.7/site-packages/chest/core.py", line 90, in move_to_disk
    with open(fn, mode='wb') as f:
IOError: [Errno 20] Not a directory: 'path/1/2'
>>> exit()
❯❯❯ ls -la path
path:
total 8,0K
drwxr-xr-x  4 grimnight staff  136 Mar 30 13:27 .
drwxr-xr-x 68 grimnight staff 2,3K Mar 30 13:26 ..
-rw-r--r--  1 grimnight staff   10 Mar 30 13:27 1
-rw-r--r--  1 grimnight staff    5 Mar 30 13:27 two

Possible strategy for LRU-light

When the in-memory dictionary is full and we add something new (either new kv pair or something pulled from disk) we currently find the largest element in the dict and send it to disk.

This is bad in a few ways

We do this all the time once we fill up our in-memory cache
We find the largest element, this may not be the right thing to push down. We may want the least recently used
We currently do this through a global max(values, key=nbytes) call which might be expensive if we have many keys, particularly we if we do it at every insertion.

A decent cheap solution would be the following.

Have a dictionary, inmem as we do now. Have a new set seen-recently. Whenever we touch a piece of data we add its key to seen-recently and update our total-bytes-in-memory counter. When this counter goes above available-memory we flush all values corresponding to keys not found in seen-recently. We then clear seen-recently.

That is to say that the policy becomes that you can stay in memory if you've been used since the last clearing out.

This is amortized cheap, happens less frequently, is cheap to maintain, and easy to build.

Using hash value as a key will create a temporary key that does not keep it's value

Using original keys hash value as a key will create a new key, but only until flush().
After flush() the key will still exist in the .keys file, but not in the filesystem and it will return the value of the original key.

❯❯❯ python
Python 2.7.8 (default, Aug 24 2014, 21:26:19)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from chest.core import Chest
>>> import json
>>> c = Chest(path='path', dump=json.dump, load=json.load)
>>> c['1/2'] = '1/2'
>>> c['1/2']
'1/2'
>>> str(abs(hash('1/2')))
'163512108431620421'
>>> c[163512108431620421] = 163512108431620421
>>> c[163512108431620421]
163512108431620421
>>> c['1/2']
'1/2'
>>> c.keys()
[163512108431620421, '1/2']
>>> c.flush()
>>> c = Chest(path='path', dump=json.dump, load=json.load)
>>> c.keys()
[u'1/2', 163512108431620421]
>>> c['1/2']
u'1/2'
>>> c[163512108431620421]
u'1/2'
>>> exit()
❯❯❯ ls -la path
total 16
drwxr-xr-x  6 grimnight staff  204 Jou  1 09:46 .
drwxr-xr-x 67 grimnight staff 2278 Jou  1 09:43 ..
-rw-r--r--  1 grimnight staff   37 Jou  1 09:46 .keys
-rw-r--r--  1 grimnight staff    5 Jou  1 09:46 163512108431620421
❯❯❯ cat path/.keys
[163512108431620421, "1/2"]

As an alternative to str(abs(hash(''))) consider quote_plus or something similar.
After seeing 19 and multi_key_dict I thought of c[('some', 1.0, 'key')] or c['some', 1.0, 'key'] becoming path/some/1.0/key, assuming that is possible or even a good idea.

Chest questions

Hi,

I had some questions regarding Chest?

Is the repository, fully integrated into Dask and is that why it hasn't been updated for a while?

What size does chest start to fall over at? I noticed there was a comment it handles moderate sized arrays?

Is there a way to break up computation in dask or other libraries with chest? For example, I have a large image array that when I try to work out the histogram, the dask graph has a massive bottle neck at the point of being the data together. Is there a way to break up the work with chest or dask to store intermediate results in chunks and then bring it back together? I think the native dask implementation may be causing issues and break the steps up would help significantly with really large tasks.

Would appreciate your thoughts and help.

mrocklin / chest Goto Github PK

chest's People

Contributors

Stargazers

Watchers

Forkers

chest's Issues

launch shrink or move_to_disk in separate thread

Possible bug instantiating chest with data

Invalid filename stops flush

Possible strategy for LRU-light

Using hash value as a key will create a temporary key that does not keep it's value

Chest questions

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent