Giter Club home page Giter Club logo

bloscpack's Introduction

Blosc: A blocking, shuffling and lossless compression library

Author Contact URL
Blosc Development Team [email protected] https://www.blosc.org
Gitter GH Actions NumFOCUS Code of Conduct
Gitter CI CMake Powered by NumFOCUS Contributor Covenant

What is it?

Note: There is a more modern version of this package called C-Blosc2 which supports many more features and is more actively maintained. Visit it at: https://github.com/Blosc/c-blosc2

Blosc is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor (that I'm aware of) that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations.

It uses the blocking technique so as to reduce activity in the memory bus as much as possible. In short, this technique works by dividing datasets in blocks that are small enough to fit in caches of modern processors and perform compression / decompression there. It also leverages, if available, SIMD instructions (SSE2, AVX2) and multi-threading capabilities of CPUs, in order to accelerate the compression / decompression process to a maximum.

See some benchmarks about Blosc performance.

Blosc is distributed using the BSD license, see LICENSE.txt for details.

Meta-compression and other differences over existing compressors

C-Blosc is not like other compressors: it should rather be called a meta-compressor. This is so because it can use different compressors and filters (programs that generally improve compression ratio). At any rate, it can also be called a compressor because it happens that it already comes with several compressor and filters, so it can actually work like a regular codec.

Currently C-Blosc comes with support of BloscLZ, a compressor heavily based on FastLZ (https://ariya.github.io/FastLZ/), LZ4 and LZ4HC (https://lz4.org/), Snappy (https://google.github.io/snappy/), Zlib (https://zlib.net/) and Zstandard (https://facebook.github.io/zstd/).

C-Blosc also comes with highly optimized (they can use SSE2 or AVX2 instructions, if available) shuffle and bitshuffle filters (for info on how and why shuffling works see here). However, additional compressors or filters may be added in the future.

Blosc is in charge of coordinating the different compressor and filters so that they can leverage the blocking technique as well as multi-threaded execution (if several cores are available) automatically. That makes that every codec and filter will work at very high speeds, even if it was not initially designed for doing blocking or multi-threading.

Finally, C-Blosc is specially suited to deal with binary data because it can take advantage of the type size meta-information for improved compression ratio by using the integrated shuffle and bitshuffle filters.

When taken together, all these features set Blosc apart from other compression libraries.

Compiling the Blosc library

Blosc can be built, tested and installed using CMake_. The following procedure describes the "out of source" build.

  $ cd c-blosc
  $ mkdir build
  $ cd build

Now run CMake configuration and optionally specify the installation directory (e.g. '/usr' or '/usr/local'):

  $ cmake -DCMAKE_INSTALL_PREFIX=your_install_prefix_directory ..

CMake allows to configure Blosc in many different ways, like preferring internal or external sources for compressors or enabling/disabling them. Please note that configuration can also be performed using UI tools provided by CMake (ccmake or cmake-gui):

  $ ccmake ..      # run a curses-based interface
  $ cmake-gui ..   # run a graphical interface

Build, test and install Blosc:

  $ cmake --build .
  $ ctest
  $ cmake --build . --target install

The static and dynamic version of the Blosc library, together with header files, will be installed into the specified CMAKE_INSTALL_PREFIX.

Codec support with CMake

C-Blosc comes with full sources for LZ4, LZ4HC, Snappy, Zlib and Zstd and in general, you should not worry about not having (or CMake not finding) the libraries in your system because by default the included sources will be automatically compiled and included in the C-Blosc library. This effectively means that you can be confident in having a complete support for all the codecs in all the Blosc deployments (unless you are explicitly excluding support for some of them).

But in case you want to force Blosc to use external codec libraries instead of the included sources, you can do that:

  $ cmake -DPREFER_EXTERNAL_ZSTD=ON ..

You can also disable support for some compression libraries:

  $ cmake -DDEACTIVATE_SNAPPY=ON ..  # in case you don't have a C++ compiler

Examples

In the examples/ directory you can find hints on how to use Blosc inside your app.

Supported platforms

Blosc is meant to support all platforms where a C89 compliant C compiler can be found. The ones that are mostly tested are Intel (Linux, Mac OSX and Windows) and ARM (Linux), but exotic ones as IBM Blue Gene Q embedded "A2" processor are reported to work too.

Mac OSX troubleshooting

If you run into compilation troubles when using Mac OSX, please make sure that you have installed the command line developer tools. You can always install them with:

  $ xcode-select --install

Wrapper for Python

Blosc has an official wrapper for Python. See:

https://github.com/Blosc/python-blosc

Command line interface and serialization format for Blosc

Blosc can be used from command line by using Bloscpack. See:

https://github.com/Blosc/bloscpack

Filter for HDF5

For those who want to use Blosc as a filter in the HDF5 library, there is a sample implementation in the hdf5-blosc project in:

https://github.com/Blosc/hdf5-blosc

Mailing list

There is an official mailing list for Blosc at:

[email protected] https://groups.google.com/g/blosc

Acknowledgments

See THANKS.rst.


Enjoy data!

bloscpack's People

Contributors

bnavigator avatar cpcloud avatar esc avatar francescalted avatar francescelies avatar mindw avatar mmohrhard avatar ogrisel avatar oogali avatar sachk avatar toddrme2178 avatar waffle-iron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bloscpack's Issues

Failing Travis tests

@esc So, it looks we have some test failures in Travis, but they're not what I expected?

The errors seem to fall into two categories.

  1. It looks like the expected compression ratio has been cut in half.

  2. The newer version of blosc introduces the zstd compression library, in which the tests don't expect that in the output.

The latter is easily fixable by updating the tests. The former... not so much.

BLOSC_MAX_BUFFERSIZE disappeared

It appears that blosc.BLOSC_MAX_BUFFERSIZE has turned into blosc.MAX_BUFFERSIZE, which I believe causes issues within bloscpack.

CPU core use

It seems it's not detecting/using the available CPU cores.

Testing on Ubuntu server 19.04 on quadruple AMD Opteron 6282 SE, for a total of 64 cores on the system.
I'm using incompresible random data for this example but the behaviour is the same with real data.

$ sudo apt install bloscpack
$ blpk --version
bloscpack: '0.15.0' python-blosc: '1.7.0' blosc: '1.15.1'
$ dd if=/dev/urandom of=test bs=1M count=10k
$ blpk -v compress test                                                                                              
blpk: using 64 threads
blpk: getting ready for compression
blpk: input file is: 'test'
blpk: output file is: 'test.blp'
blpk: input file size: 10.0G (10737418240B)
blpk: nchunks: 10240
blpk: chunk_size: 1.0M (1048576B)
blpk: last_chunk_size: 1.0M (1048576B)
blpk: output file size: 10.0G (10738524192B)
blpk: compression ratio: 0.999897
blpk: done

Activity during compression shows that 4 cores out of 64 are used:

Screenshot from 2019-09-19 23-47-10

Specifying 64 threads does not change the behaviour:

$ blpk -v -n 64 compress test                                                                                      
blpk: using 64 threads
blpk: getting ready for compression
blpk: input file is: 'test'
blpk: output file is: 'test.blp'
blpk: input file size: 10.0G (10737418240B)
blpk: nchunks: 10240
blpk: chunk_size: 1.0M (1048576B)
blpk: last_chunk_size: 1.0M (1048576B)
blpk: output file size: 10.0G (10738524192B)
blpk: compression ratio: 0.999897
blpk: done

Activity during compression shows that 4 cores out of 64 are used:

Screenshot from 2019-09-19 23-54-14

bug in determining n_chunks and chunk_size

The function PlainNumpySource.__iter__ (numpy_io.py) and they it is used when calling compress_func (line 160, abstract_io.py) actually assumes that chunk_size is divisible by ndarray.itemsize. Here is an example that shows the importance of this:

In [2]: arr = numpy.arange(10.0)

In [3]: bloscpack.unpack_ndarray_str(bloscpack.pack_ndarray_str(arr, chunk_size=16))
Out[3]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

In [4]: bloscpack.unpack_ndarray_str(bloscpack.pack_ndarray_str(arr, chunk_size=17))
array([  0.00000000e+000,   1.00000000e+000,   1.78005909e-307,
         1.78353576e-307,   3.48007310e-310,   3.48092190e-310,
         1.36006668e-312,   1.36039824e-312,   5.31535078e-315,
         0.00000000e+000])

This behavior happens because in __iter__ the array is split in chunks in the middle of the typesize and this broken item is completely skipped:

    def __iter__(self):
        self.nitems = int(self.chunk_size / self.ndarray.itemsize)
        offset = self.ptr
        for i in xrange(self.nchunks - 1):
            yield offset, self.nitems
            offset += self.chunk_size # bug: self.nitems and offset are not computed consistently
                                      # it  should have been
                                      # offset += self.nitems * self.ndarray.itemsize
                                      # but in this case self.nchunks might be wrong...
        yield offset, int(self.last_chunk / self.ndarray.itemsize)

Thus, I propose to:

  1. add assert self.chunk_size % self.ndarray.itemsize == 0.
  2. modify args.calculate_nchunks such that it makes sure that chunk_size is divisible by itemsize.

Thanks,
Dmitry.

Release the GIL

I would like to use single-threaded bloscpack in many threads in parallel. My understanding is that this is possible in the C layer by creating many contexts but not currently possible in the Python layer.

compress-from-stdin (or pipe) and compress-to-stdout?

Would be nice if blpk could compress-from stdin or decompress-to stdout, so it could be a drop-in alternative to gzip/bzip2/etc.

I tried the workaround of supplying a process-substitution pipe, as in..

  blpk c <(zcat bigfile.gz) bigfile.blp

...but that simply created a 140-byte file that decompressed to the 0-length file. (So it'd be nice if that worked, too.)

Or does blosc[pack] require random access to entire files to operate?

Can we save dictionary including numpy arrays ?

Hi,

I tested to save dicttionar consisting of two numpy arrays .

it says

dict object has no attribute 'dtype'

array is:


  `_names_np = np.empty(shape=[number_of_item, ], dtype='<U254')

    _floats_np = np.empty(shape=[number_of_item, 512], dtype=np.float32)`

import issue

When I import bloscpack on Ubuntu 16.04 I got

Traceback (most recent call last):
  File "deneme.py", line 1, in <module>
    import bloscpack
  File "/home/egolge/miniconda3/lib/python3.6/site-packages/bloscpack/__init__.py", line 9, in <module>
    from .args import (BloscArgs,
  File "/home/egolge/miniconda3/lib/python3.6/site-packages/bloscpack/args.py", line 6, in <module>
    import blosc
  File "/home/egolge/miniconda3/lib/python3.6/site-packages/blosc/__init__.py", line 13, in <module>
    from blosc.blosc_extension import (
ImportError: /home/egolge/miniconda3/lib/python3.6/site-packages/blosc/blosc_extension.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE9_M_appendEPKcm


compat_util.py file is not added

in python3,

File "anaconda3/lib/python3.4/site-packages/bloscpack/headers.py", line 17, in
from .compat_util import (OrderedDict,
ImportError: No module named 'bloscpack.compat_util'

Add info about the codec used for compression

Currently, the info subcommand does not offer info on the codec used for compressing a file:

$ blpk i p.dat.blp
blpk: bloscpack header: 
blpk:     format_version=3,
blpk:     offsets=True,
blpk:     metadata=False,
blpk:     checksum='adler32',
blpk:     typesize=8,
blpk:     chunk_size=1.0M (1048576B),
blpk:     last_chunk=962.0K (985088B),
blpk:     nchunks=763,
blpk:     max_app_chunks=7630
blpk: 'offsets':
blpk: [67176,257788,400131,536937,653836,...]

I think the recently added get_clib(cbuffer) function in python-blosc 1.2.1 should help here.

test_append.test_append_mix_shuffle error

I am trying to package bloscpack for openSUSE and I am getting the following error:

======================================================================
FAIL: test_append.test_append_mix_shuffle
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/abuild/rpmbuild/BUILD/bloscpack-0.13.0/test/test_append.py", line 366, in test_append_mix_shuffle
    nt.assert_equal(blosc_header_last['flags'], 0)
AssertionError: 2 != 0

----------------------------------------------------------------------

This is with:

  • python 2.7
  • blosc 1.13.5
  • gcc8 8.1.1

Any idea what might be going wrong?

bug in storing array of 0 size

Current version of blsocpack has unnecessary complication with storing array of 0 size: it raises an exception. Here is an example:

arr = numpy.array([], 'f8')
bloscpack.pack_ndarray_file(arr, 'arr.blp')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[...]

/nfs/home/dmitryb/soft/bloscpack/bloscpack/args.py in calculate_nchunks(in_file_size, chunk_size)
    178         raise ValueError("'in_file_size' must be strictly positive, not %d"
--> 179                          % in_file_size)
    180     # convert a human readable description to an int
    181     if isinstance(chunk_size, basestring):

ValueError: 'in_file_size' must be strictly positive, not 0

Actually, the code that that does compression understand 0 values for chunk_size. Thus, one can safely replace raise ValueError on lines 178-179 by:

    if in_file_size <= 0:
        return (1, 0, 0)

Here is how it works:

In [3]: bloscpack.unpack_ndarray_str(bloscpack.pack_ndarray_str(arr))
Out[3]: array([], dtype=float64)

Could you please make this simple change in the main branch?

Thanks,
Dmitry.

numpy savez_compressed much smaller filesizes for small arrays

I have a few million images to save to disk and have been trying a few options out. I thought blosc/bloscpack would be well suited but I'm getting far larger image sizes than using the standard numpy savez_compressed.

My images are size (3,200,200) and dtype=float32. Typical file sizes I'm getting are:

  • np.savez ~ 470k
  • np.savez_compressed ~ 53k
  • blosc.pack_array ~ 200k
  • blosc.compress_ptr ~ 200k
  • bloscpack.pack_ndarray_to_file ~ 200-400k

For a sample of 370 images this gives:

67M      ./blosc_packarray
67M      ./blosc_pointer
121M     ./bp
19M      ./npz
172M     ./uncompressed

For the blosc_* methods I'm writing the packed bytes like:

with open(dest, 'wb') as f:
            f.write(packed)

Is there anything I'm missing or is numpy's compression just as good as it gets for small images like these?

Broken on Mac OSX

hz:> pip3 install bloscpack
hz:> blpk c utf.csv

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/bin/blpk", line 7, in
from bloscpack.cli import main

File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bloscpack/init.py", line 9, in
from .args import (BloscArgs,
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bloscpack/args.py", line 6, in
import blosc
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/blosc/init.py", line 12, in
from blosc.toplevel import (
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/blosc/toplevel.py", line 16, in
from blosc import blosc_extension as _ext
ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/blosc/blosc_extension.cpython-35m-darwin.so, 2): Symbol not found: _aligned_alloc
Referenced from: /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/blosc/blosc_extension.cpython-35m-darwin.so
Expected in: flat namespace
in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/blosc/blosc_extension.cpython-35m-darwin.so

pack_unpack_hard does not work

Here it is what I am seeing using 0.4.0 (master):

faltet@linux-je9a:~/software/bloscpack> nosetests test_bloscpack.py:pack_unpack_hard
E
======================================================================
ERROR: Test on somewhat larger arrays, but be nice to memory.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/faltet/anaconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/faltet/software/bloscpack/test_bloscpack.py", line 1101, in pack_unpack_hard
    pack_unpack(100, nchunks=1536, progress=True)
TypeError: pack_unpack() got an unexpected keyword argument 'nchunks'

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

Python 3.10 Runtime Error

From baf0e764143cd6369835d60013d3d9c3eaa771fd Mon Sep 17 00:00:00 2001
From: Chris Piekarski [email protected]
Date: Mon, 12 Jun 2023 23:40:33 -0600
Subject: [PATCH] change import to work with python 3.10


bloscpack/abstract_objects.py | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/bloscpack/abstract_objects.py b/bloscpack/abstract_objects.py
index f365c37..fd75fc8 100644
--- a/bloscpack/abstract_objects.py
+++ b/bloscpack/abstract_objects.py
@@ -4,7 +4,7 @@

import abc
-import collections
+from collections.abc import MutableMapping
import copy
import pprint

@@ -13,7 +13,7 @@ from .pretty import (double_pretty_size,
)

-class MutableMappingObject(collections.abc.MutableMapping):
+class MutableMappingObject(MutableMapping):

 _metaclass__ = abc.ABCMeta

--
2.34.1

Propagate unpin commit to Conda recipe

I unpinned the version for the blosc dependency in requirements.txt, but I didn't make that change to the Conda recipe.

(I'll get to that after I fix the failing unit tests)

License file

I am trying to package bloscpack for openSUSE. However, there is no license file I can find, either in the github or tarball. In order for people to know how they can use your code it is really important to have a license file. Would it be possible to add one? Thank you.

two test failures on i386

With 0.15.0, we have two test suite failures on i386 going like this:

FAIL: test_numpy_io.test_itemsize_chunk_size_mismatch(<class 'bloscpack.exceptions.ChunkSizeTypeSizeMismatch'>, <function pack_ndarray_str at 0xf4d4da4c>, array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
AssertionError: ChunkSizeTypeSizeMismatch not raised by pack_ndarray_str

======================================================================
FAIL: test_numpy_io.test_itemsize_chunk_size_mismatch(<class 'bloscpack.exceptions.ChunkSizeTypeSizeMismatch'>, <function pack_ndarray_str at 0xf4d4da4c>, array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
AssertionError: ChunkSizeTypeSizeMismatch not raised by pack_ndarray_str

----------------------------------------------------------------------
Ran 582 tests in 7.203s

FAILED (failures=2)

Strange call of np.linspace

array_ = np.linspace(i, i+1, 2e6)

The third parameter of np.linspace is:

numint, optional

    Number of samples to generate. Default is 50. Must be non-negative.

Are you certain you want 2,000,000 samples? Besides my numpy (version 1.19.1) complains that TypeError: 'float' object cannot be interpreted as an integer for this parameter.

Port testing from nose to pytest

I am a maintainer of Python packages in openSUSE, and I am on my crusade of eliminating nose1 from our distribution. When I look at its repository on https://github.com/nose-devs/nose, the last release 1.3.7 was on 2 Jun 2015, and even the last commit on the master branch was on 4 Mar 2016. Also, nose won't be supported starting with Python 3.9/3.10.

This patch eliminates dependency on nose. Resulting test suite depends on pytest, which is currently perfectly developed and maintained.

Unfortunately, I wasn’t able to test it properly, because the rest of the test suite seems to me to be in pretty bad shape (#98 is I am afraid just one example of many), so I have to leave this patch as is in whatever shape it is.

A flag for specifying the output

Maybe it would be nice to add a flag for specifying the output filename? -o is taken though. Or is there a way to specify that already?

Value error when compressing file

Traceback (most recent call last):
  File "/bin/blpk", line 11, in <module>
    load_entry_point('bloscpack==0.14.0', 'console_scripts', 'blpk')()
  File "/usr/lib/python3.7/site-packages/bloscpack/cli.py", line 460, in main
    metadata_args=MetadataArgs())
  File "/usr/lib/python3.7/site-packages/bloscpack/file_io.py", line 464, in pack_file
    metadata_args=metadata_args)
  File "/usr/lib/python3.7/site-packages/bloscpack/abstract_io.py", line 153, in pack
    max_app_chunks=max_app_chunks
  File "/usr/lib/python3.7/site-packages/bloscpack/headers.py", line 295, in __init__
    % (last_chunk, chunk_size))
ValueError: 'last_chunk' (854860) is larger than 'chunk_size' (854856)

When trying to compress the following file: test.zip

blosc 1.14.4
python-blosc 1.6.1
bloscpack 0.14.0

blosc/bloscpack dependency

Hello,

I was trying to install bloscpack and noticed that setup.py has the following lines...

install_requires = [
    'blosc==1.2.7',
    'numpy',
    'six',
]

However, blosc appears to currently be at version 1.2.9dev0. Using pip/conda I wasn't able to install it either.

Is it safe to just edit the requirements to be blosc>=1.2.7?

bloscpack locked in multiprocessing mode

Here I share a mimicry of my problem where the code freezes in multiprocess mode and works just fine with single process. Below code tries to create a dummy array simulating my image files and tries to save them with multi processing. If you reduce number of processes one, it just works fine but with >1 processes it freezes.

I use Ubuntu 16.04 , Python 3.6 with Conda.


import os
import sys
import tempfile

import numpy as np
import bloscpack as bp

from tqdm import tqdm
from concurrent.futures import ProcessPoolExecutor, as_completed


def parallel_process(array, function, n_jobs=16, use_kwargs=False, front_num=3):
    """
        A parallel version of the map function with a progress bar. 

        Args:
            array (array-like): An array to iterate over.
            function (function): A python function to apply to the elements of array
            n_jobs (int, default=16): The number of cores to use
            use_kwargs (boolean, default=False): Whether to consider the elements of array as dictionaries of 
                keyword arguments to function 
            front_num (int, default=3): The number of iterations to run serially before kicking off the parallel job. 
                Useful for catching bugs
        Returns:
            [function(array[0]), function(array[1]), ...]
    """
    #We run the first few iterations serially to catch bugs
    if front_num > 0:
        front = [function(**a) if use_kwargs else function(a) for a in array[:front_num]]
    #If we set n_jobs to 1, just run a list comprehension. This is useful for benchmarking and debugging.
    if n_jobs==1:
        return front + [function(**a) if use_kwargs else function(a) for a in tqdm(array[front_num:])]
    #Assemble the workers
    with ProcessPoolExecutor(max_workers=n_jobs) as pool:
        #Pass the elements of array into function
        if use_kwargs:
            futures = [pool.submit(function, **a) for a in array[front_num:]]
        else:
            futures = [pool.submit(function, a) for a in array[front_num:]]
        kwargs = {
            'total': len(futures),
            'unit': 'it',
            'unit_scale': True,
            'leave': True
        }
        #Print out the progress as tasks complete
        for f in tqdm(as_completed(futures), **kwargs):
            pass
    out = []
    #Get the results from the futures. 
    for i, future in tqdm(enumerate(futures)):
        try:
            out.append(future.result())
        except Exception as e:
            out.append(e)
    return front + out


def dump_blosc(data, filename):
    with open(filename, 'wb') as f:
        f.write(bp.pack_ndarray_str(data))


def write_data(inputs):
    dummy = np.random.rand(16,3,224,224).astype('uint8')
    tf = tempfile.NamedTemporaryFile()
    dump_blosc(dummy, tf.name)


if __name__ == '__main__':
    parallel_process(range(100), write_data, n_jobs=2)
    # for dir in dirs:
    #     print(dir)
    #     write_data(dir)

Tests currently broken

\Traceback (most recent call last):
  File "/home/esc/git/bloscpack/test-venv/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/esc/git/bloscpack/test/test_headers.py", line 168, in test_decode_blosc_header
    nt.assert_equal(expected, header)
AssertionError: {'flags': 3, 'blocksize': 88, 'typesize':[58 chars] 108} != OrderedDict([('version', 2), ('versionlz'[86 chars]08)])

On master.

Plain nosetests seems not to be able to run the test suite

With latest nose 1.3.0, I am seeing this:

nosetests test_bloscpack.py
E
======================================================================
ERROR: Failure: ImportError (No module named nose_parameterized)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/faltet/anaconda/lib/python2.7/site-packages/nose/loader.py", line 413, in loadTestsFromName
    addr.filename, addr.module)
  File "/home/faltet/anaconda/lib/python2.7/site-packages/nose/importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/home/faltet/anaconda/lib/python2.7/site-packages/nose/importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/tmp/bloscpack-0.5.0/test_bloscpack.py", line 16, in <module>
    from nose_parameterized import parameterized
ImportError: No module named nose_parameterized

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

Looks like some dependency is missing in requeriments.txt?

Bug in loading structured array

Bloscpack version 0.7.1 incorrectly stores or loads metadata for a structured Numpy array. Could you please correct it?

Here is an example:

import bloscpack
arr = numpy.array([('a', 1), ('b', 2)], dtype=[('a', 'S1'), ('b', 'f8')])

bloscpack.pack_ndarray_file(arr, 'arr.blp')
bloscpack.unpack_ndarray_file('arr.blp')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[I removed most of the traceback except the last call]

/usr/bin/anaconda/lib/python2.7/site-packages/bloscpack/numpy_io.py in __init__(self, metadata)
     82             raise NotANumpyArray
     83         self.ndarray = numpy.empty(metadata['shape'],
---> 84                                    dtype=numpy.dtype(metadata['dtype']),
     85                                    order=metadata['order'])
     86         self.ptr = self.ndarray.__array_interface__['data'][0]

TypeError: data type not understood

In the debugger we clearly see the problem --- numpy does NOT understand the format of metadata['dtype']:

ipdb> metadata['dtype']
[[u'a', u'|S1'], [u'b', u'<f8']]

Note that:

  1. the inner list must be a tuple;
  2. string must be usual strings rather than unicode ones.

To make it working, I corrected the code on lines 83-86 in numpy_io.py as follows, but, probably, there is a better solution:

/usr/bin/anaconda/lib/python2.7/site-packages/bloscpack/numpy_io.py in __init__(self, metadata)

        def _conv(descr):
            """
           Converts nested list of lists into list of tuples. Examples::
             [[u'a', u'f8']] -> [('a', 'f8')]
             [[u'a', u'f8', 2]] -> [('a', 'f8', 2)]
             [[u'a', [[u'b', 'f8']]]] -> [('a', [('b', 'f8')])]
            """
            if isinstance(descr, list):
                if isinstance(descr[0], list):
                    descr = map(_conv, descr)
                else:
                    descr = tuple(map(_conv, descr))
            elif isinstance(descr, unicode):
                descr = str(descr)
            else:
                # keep descr as is
                pass
            return descr

        self.ndarray = numpy.empty(metadata['shape'],
                                   dtype=numpy.dtype(_conv(metadata['dtype'])),
                                   order=metadata['order'])

Thanks,
Dmitry.

Checksum should be disabled when compressing into memory

Checksum is a nice thing to have when you are writing to disk, but its use is more questionable for the in-memory API (e.g. pack_ndarray_str). I propose to disable the checksum whenever bloscpack is using memory as a backend to store results.

Allow a better calculation for chunksize that is actually divisible by typesize

For example, when trying to compress an image file with 24bit depth:

$ ll 24bit.bpt 
-rwx------ 1 faltet faltet 1377618 jul 26 14:15 24bit.bpt*

$ blpk -f c -t 3 24bit.bpt 24bit-shuffle.blp
Traceback (most recent call last):
  File "/home/faltet/miniconda/bin/blpk", line 11, in <module>
    sys.exit(main())
  File "/home/faltet/miniconda/lib/python2.7/site-packages/bloscpack/cli.py", line 457, in main
    metadata_args=MetadataArgs())
  File "/home/faltet/miniconda/lib/python2.7/site-packages/bloscpack/file_io.py", line 465, in pack_file
    metadata_args=metadata_args)
  File "/home/faltet/miniconda/lib/python2.7/site-packages/bloscpack/abstract_io.py", line 127, in pack
    (double_pretty_size(chunk_size), blosc_args.typesize)
bloscpack.exceptions.ChunkSizeTypeSizeMismatch: chunk_size: '1.0M (1048576B)' is not divisible by typesize: '3'

whereas if we help bloscpack passing a chunksize (via -z):

$ blpk -f c -t 3 -z 1377618 24bit.bpt 24bit-shuffle.blp

$ ll 24bit-shuffle.blp
-rw-rw-r-- 1 faltet faltet 40976 jul 27 17:10 24bit-shuffle.blp

Adding a more adaptative chunksize calculation would save the user to have to pass the chunksize manually.

This example is based on: https://github.com/Cyan4973/zstd/issues/256

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.