Giter Club home page Giter Club logo

chest's People

Contributors

cowlicks avatar maxhutch avatar mrocklin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

chest's Issues

launch shrink or move_to_disk in separate thread

Much of the cost of using a chest on-disk is writing excess data. If we don't mind possibly spilling over the available_memory limit a bit then this operation could happen asynchronously. Upon inserting a new element in memory we launch a separate thread to call the Chest.shrink function and then release control immediately back to the caller.

Probably the IO here could overlap nicely with computationally intense work.

Possible bug instantiating chest with data

Maybe I'm not using Chest right

>>> from chest import Chest
>>> c = Chest({'k': 6})
>>> assert 'k' in c
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

Invalid filename stops flush

Invalid filename stops flush(), .keys file does not get written.

❯❯❯ python
Python 2.7.8 (default, Aug 24 2014, 21:26:19)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from chest.core import Chest
>>> c = Chest(path='path')
>>> c[1] = 'one'
>>> c['two'] = 2
>>> c['1/2'] = 2
>>> print c.path
path
>>> c.flush()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<venv>/lib/python2.7/site-packages/chest/core.py", line 180, in flush
    self.move_to_disk(key)
  File "<venv>/lib/python2.7/site-packages/chest/core.py", line 90, in move_to_disk
    with open(fn, mode='wb') as f:
IOError: [Errno 20] Not a directory: 'path/1/2'
>>> exit()
❯❯❯ ls -la path
path:
total 8,0K
drwxr-xr-x  4 grimnight staff  136 Mar 30 13:27 .
drwxr-xr-x 68 grimnight staff 2,3K Mar 30 13:26 ..
-rw-r--r--  1 grimnight staff   10 Mar 30 13:27 1
-rw-r--r--  1 grimnight staff    5 Mar 30 13:27 two

Possible strategy for LRU-light

When the in-memory dictionary is full and we add something new (either new kv pair or something pulled from disk) we currently find the largest element in the dict and send it to disk.

This is bad in a few ways

  1. We do this all the time once we fill up our in-memory cache
  2. We find the largest element, this may not be the right thing to push down. We may want the least recently used
  3. We currently do this through a global max(values, key=nbytes) call which might be expensive if we have many keys, particularly we if we do it at every insertion.

A decent cheap solution would be the following.

Have a dictionary, inmem as we do now. Have a new set seen-recently. Whenever we touch a piece of data we add its key to seen-recently and update our total-bytes-in-memory counter. When this counter goes above available-memory we flush all values corresponding to keys not found in seen-recently. We then clear seen-recently.

That is to say that the policy becomes that you can stay in memory if you've been used since the last clearing out.

This is amortized cheap, happens less frequently, is cheap to maintain, and easy to build.

Using hash value as a key will create a temporary key that does not keep it's value

Using original keys hash value as a key will create a new key, but only until flush().
After flush() the key will still exist in the .keys file, but not in the filesystem and it will return the value of the original key.

❯❯❯ python
Python 2.7.8 (default, Aug 24 2014, 21:26:19)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from chest.core import Chest
>>> import json
>>> c = Chest(path='path', dump=json.dump, load=json.load)
>>> c['1/2'] = '1/2'
>>> c['1/2']
'1/2'
>>> str(abs(hash('1/2')))
'163512108431620421'
>>> c[163512108431620421] = 163512108431620421
>>> c[163512108431620421]
163512108431620421
>>> c['1/2']
'1/2'
>>> c.keys()
[163512108431620421, '1/2']
>>> c.flush()
>>> c = Chest(path='path', dump=json.dump, load=json.load)
>>> c.keys()
[u'1/2', 163512108431620421]
>>> c['1/2']
u'1/2'
>>> c[163512108431620421]
u'1/2'
>>> exit()
❯❯❯ ls -la path
total 16
drwxr-xr-x  6 grimnight staff  204 Jou  1 09:46 .
drwxr-xr-x 67 grimnight staff 2278 Jou  1 09:43 ..
-rw-r--r--  1 grimnight staff   37 Jou  1 09:46 .keys
-rw-r--r--  1 grimnight staff    5 Jou  1 09:46 163512108431620421
❯❯❯ cat path/.keys
[163512108431620421, "1/2"]

As an alternative to str(abs(hash(''))) consider quote_plus or something similar.
After seeing 19 and multi_key_dict I thought of c[('some', 1.0, 'key')] or c['some', 1.0, 'key'] becoming path/some/1.0/key, assuming that is possible or even a good idea.

Chest questions

Hi,

I had some questions regarding Chest?

Is the repository, fully integrated into Dask and is that why it hasn't been updated for a while?

What size does chest start to fall over at? I noticed there was a comment it handles moderate sized arrays?

Is there a way to break up computation in dask or other libraries with chest? For example, I have a large image array that when I try to work out the histogram, the dask graph has a massive bottle neck at the point of being the data together. Is there a way to break up the work with chest or dask to store intermediate results in chunks and then bring it back together? I think the native dask implementation may be causing issues and break the steps up would help significantly with really large tasks.

Would appreciate your thoughts and help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.