tensorwerk / hangar-py Goto Github PK

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

License: Apache License 2.0

Python 100.00%

hangar-py's People

Contributors

Stargazers

Watchers

hangar-py's Issues

Environment Singleton Needed?

I think we should at least give the option of cleaning up the singletons (either a specific one or all of them). In general I'm not 100% sure why we need a singleton (or something like it) design-wise, can you explain in a one-liner?

Originally posted by @lantiga in #20 (comment)

[BUG REPORT] Test Cases Needed for New Merge/Diff Logic and API

Describe the bug
A clear and concise description of what the bug is.

Test cases are needed for the changes introduced in #32.

Severity

Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

N/A

To Reproduce
Steps to reproduce the behavior, minimal example code preffered:

Run the tests

Expected behavior
A clear and concise description of what you expected to happen.

More thorough testing.

Screenshots
If applicable, add screenshots to help explain your problem.

N/A

Desktop (please complete the following information):

OS: any
Python: any
Hangar: 0.0.0 -> 0.1 rc

Additional context
Add any other context about the problem here.

See #32 for changes introduced

[FEATURE REQUEST] More options through command line interface

Is your feature request related to a problem? Please describe.
We need more options in our CLI for making few workflows seamless. Two examples I could think off

create dataset: We have repo init and data import through CLI but without the option to create a dataset through CLI, the user will always have to go through python APIs
commit: Now, if users import data using hangar import, they will need to spin up a python program and use the python APIs to commit the staged data

[FEATURE REQUEST] Ability to delete data from the history

Is your feature request related to a problem? Please describe.
Currently, we can't delete data from history once it is committed. Since in some cases where the end user triggers a data deletion request, organizations who use hangar for data storage and versioning will have to delete it from everywhere including history

AttributeError on data removal

While trying to remove the data with .remove(key), hangar raises AttributeError: 'DatasetDataWriter' object has no attribute 'TxnRegister'.

Not catching merge conflicts

Expected Behavior

Hangar currently not catching merge conflicts properly. This issue keeps a track of all the test cases those expects the conflict error and could be fixed along with #32

Current Behavior

[FEATURE REQUEST] Model Storage

Is your feature request related to a problem? Please describe.
Allow binary model storage for trained models

Describe the solution you'd like
A clear and concise description of what you want to happen:

a new class in the checkout objects which allows for binary storage of model blobs

[FEATURE REQUEST] Multi-tensor samples from complex data sources

Is your feature request related to a problem? Please describe.
I want to know how to accomplish the following. It doesn't need to require zero effort on the user's part, but there needs to be a clear best-hangar-practices path to a workable setup.

Take a source data format that is complex (e.g. DICOM, JPEG) and infeasible to reconstitute bit-exact from the tensor+metadata form. Each instance of the raw data produces a sample that consists of 2 (or more) tensors; an image tensor and a 1D tensor that encodes things like lat/long or age/sex/etc. (to be concatenated with the output of the convolutional layers prior to the fully connected layers). To be clear, this is intended to be an illustrative example, not a concrete use case.

Per my reading of the docs, right now these two tensors wouldn't qualify as being in the same hangar dataset (it's not clear if that's problematic or not).

Let's express the above conversion as:

f_v1(raw) -> (t1, t2)

Users will need to:

Update the conversion function to f_v2 and repopulate t1 and t2.
Update the conversion function to f_v3 which outputs (t1, t2, t3).
Update the raw data for a sample and repopulate t1 and t2.
Be handed the pair of t1 and t2 for training/validation (including when training is randomized).
Retrieve the raw data given IDs/tags/metadata included with the training sample (for use in an external viewer, manual investigation, etc.).

Describe the solution you'd like
I think that changing the definition of a sample to be a tuple of binary blobs plus a tuple of tensors plus metadata would work, but I haven't considered the potential impacts from that kind of change. Seems potentially large.

Describe alternatives you've considered
Another option would be to have separate datasets for t1 and t2 and combine them manually, plus manage the binary blobs separately. That seems like a lot of infra work, and might be at risk of having drift between the samples themselves, and with the blobs.

Additional context
I suspect that I want/expect Hangar to solve a larger slice of the problem than it's intended to, but it's not clear at first glance what the intended approach would be for more complicated setups like the above.

Singletons live indefinitly

With the bugfixes to #5 #4 in PR: #20, environment handle singletons can now potentially live indefinitly as there is no cleanup method.

Is there a way to GC the objects without introducing an explicit .close() method on the repository?

Tag @lantiga for discussion here.

[BUG REPORT] Traversing through dataset data records is buggy

Describe the bug
The function RecordQuery._traverse_dataset_data_records(self, dataset_name) is has a critical bug which related to the condition check we are doing to make sure we are checking for the provided dataset (This must be same with _traverse_dataset_schema_records function as well since the logic is similar there as well). If the repository has multiple datasets where one's name has another dataset's name as a substring in the start of the name, would cause the condition check dataRecKey.startswith(startDatasetRecordCountRangeKey) to return wrong data.

Severity

Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

To Reproduce
First code snippet to save two datasets in the same repository named MNIST and MNISTLabel

import os
from hangar import Repository
import torchvision
import numpy as np

data = torchvision.datasets.MNIST('.', download=True)
path = os.path.dirname(os.path.abspath(__file__))
repo = Repository(path=path)
repo.init(user_name='hhsecond', user_email='[email protected]', remove_old=True)
co = repo.checkout(write=True)
co.datasets.init_dataset('MNIST', shape=(28, 28), dtype=np.uint8)
co.datasets.init_dataset('MNISTLabel', shape=(1,), dtype=np.int64)
mnist_dset = co.datasets['MNIST']
label_dset = co.datasets['MNISTLabel']

with mnist_dset, label_dset:
    for i, (image, label) in enumerate(data):
        print(i)
        mnist_dset[str(i)] = np.array(image)
        label_dset[str(i)] = np.array([label])
co.commit('0.1')
co.close()

The second part where we could try to check the name of dataset MNIST

co = repo.checkout()
mnist = co.datasets['MNIST']
mnistlabel = co.datasets['MNISTLabel']
print(len(mnist))

len on dataset triggers the above functions which would cause it to fail

Expected behavior

In the above case, the user should get the actual length
Generally, _traverse_dataset_data_records should return a dictionary of data records with name as key and hash as value

Dockerizing the server

Expected Behavior

We need to have docker image for server instance. I'll try to work on it over the weekend

Need to raise proper exceptions in metadata operations

Expected Behavior

Metadata addition with invalid type needs to explicitly tell the user what happened
Metadata addition with read-checkout should raise `PermissionError'
Metadata addition with an illegal key name should raise ValueError

Current Behavior

Both the cases raise AttributeError now

Steps to Reproduce (for bugs)

testcase for 1
testcase for 2
testcase for 3

Support Tensorflow 2.0 Release

@hhsecond, Tensorflow 2.0 has been released. https://www.tensorflow.org/install

Are there any changes which will prevent us from supporting both 1.14 as well as 2.0?

[BUG REPORT]

Describe the bug
A clear and concise description of what the bug is.

current pip install is missing google requirement. import hangar fails because google is missing

Severity

Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

To Reproduce
Steps to reproduce the behavior, minimal example code preferred:

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS:
Python:
Hangar:

Additional context
Add any other context about the problem here.

Add docstrings.

Add docstring to the bulk add function at:

hangar-py/src/hangar/dataset.py

Line 743 in 4e5ec90

def add(self, dset_data_map: dict) -> str:

Branch removal

Expected Behavior

repo.remove_branch(branch_name)

Current Behavior

NotImplemented

Steps to Reproduce (for bugs)

#25 (comment)

Exception handling

The way we are handling the exception is not user-friendly and has to be changed. This issue is to discuss it further. After thinking about it for some time, I think it's better to raise an exception rather than returning the exception object, as Rick suggested. Here are the cases I have found out so far where we are not raising the exception instead of pushing it to stderr/stdout. I'll add more as I find

Trying to init the already initialized dataset
Trying to initiate dataset with more than 31 dimension
Accessing dset['key_doesnt_exist']
dset[1] = somearray (should raise the exception about integer being used as a key)
wrong argument order to dset.add()
If name is not passed to dset.add() for samples_are_named
adding different dtypes and shapes than used while init
committing without any changes should return something if not raising an exception

Not able to import hangar

import hangar
threw the error
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/varun/hangar-py/src/hangar/__init__.py", line 5, in <module> from .remote.server import serve File "/home/varun/hangar-py/src/hangar/remote/server.py", line 18, in <module> from . import chunks File "/home/varun/hangar-py/src/hangar/remote/chunks.py", line 6, in <module> from . import hangar_service_pb2 File "/home/varun/hangar-py/src/hangar/remote/hangar_service_pb2.py", line 23, in <module> TypeError: __new__() got an unexpected keyword argument

after running sudo python3 setup.py develop

Python Version - 3.6.7

Environment
varun@pop-os
OS: Ubuntu 18.04 bionic (pop-os)
Kernel: x86_64 Linux 4.18.0-16-generic
Uptime: 1h 12m
Packages: 3599
Shell: bash 4.4.19
Resolution: 1920x1080
DE: KDE 5.44.0 / Plasma 5.12.7
WM: KWin
WM Theme: Breeze
GTK Theme: Breeze [GTK2/3]
Icon Theme: breeze
Font: Noto Sans Regular
CPU: Intel Core i7-8750H @ 12x 4.1GHz [57.0°C]
GPU: intel
RAM: 2968MiB / 7817MiB

Also not able to run hangar init as it resulted in
raise DistributionNotFound(req, requirers) pkg_resources.DistributionNotFound: The 'lmdb<=0.96,>=0.94' distribution was not found and is required by hangar
after installing lmdb 0.95 it still threw the same error

[BUG REPORT] References are rewritten on commit

Describe the bug
With the new weakref logic, ideally, hangar's user-facing objects such as dataset/metadata are weak references to the actual object and it gets removed on checkout.close(). But since we are rewriting self._datasets & self._metadata here

hangar-py/src/hangar/checkout.py

Lines 336 to 341 in deeeafc

 self._metadata = MetadataWriter(dataenv=self._stageenv, labelenv=self._labelenv) 

 self._datasets = Datasets._from_staging_area( 

 repo_pth=self._repo_path, 

 hashenv=self._hashenv, 

 stageenv=self._stageenv, 

 stagehashenv=self._stagehashenv)

on commit (not only on close since self.__setup() gets called on commit as well)

hangar-py/src/hangar/checkout.py

Line 387 in deeeafc

self.__setup()

the references are gone after the commit. In a nutshell, the issue is: User can't make changes after the commit even if the checkout is not closed yet.

Severity

Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

To Reproduce
Steps to reproduce the behavior, minimal example code preffered:

co = repo.checkout(write=True)
dset = co.datasets.init_dataset('dset', prototype=array5by7)
dset.add(array5by7, '1')
co.commit()
dset.add(array5by7, '2')  # this raises ReferenceError since the reference to dset is gone
co.close()

Expected behavior
User should be able to add data after the commit. References should be released only after the close. Perhaps we should split __setup() function into two with very specific functionalities.

[BUG REPORT] Hangar can't handle dataset with empty dimension

Describe the bug
If the shape of the dataset is empty i.e. shape=() or prototype=np.array(1), hangar is not designed to handle it properly. Instead, it throws an error while committing -> ValueError: invalid literal for int() with base 10: ''

Severity
Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

To Reproduce

co = repo.checkout(write=True)
arr = np.array(1)
dset = co.datasets.init_dataset('dset2', prototype=arr)
dset['1'] = arr
co.commit()
co.close()

Expected behavior
It should work like any other arrays that have non-empty dimension

Dataloaders for PyTorch

@rlizzo I was thinking an API like hangar.dataloaders.pytorch (let me know if you have another structuring in your mind). Basically, the idea is to load data in batches synchronously or asynchronously and enable the features of PyTorch DataLoader. My plan is to have a Dataset class and a DataLoader class which is essentially the way PyTorch's data loader work.

[BUG REPORT] Commit inside context manager throws RuntimeError

Describe the bug
If we try to commit inside the context manager (before __exit__()), hangar throws RuntimeError saying No changes made in the staging area. Cannot commit.. We should allow the user to do commits inside the context manager IMO but probably with a warning about the performance hit

Severity

Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

To Reproduce

import numpy as np

from hangar import Repository
repo = Repository(path='myhangarrepo')
repo.init(user_name='Sherin Thomas', user_email='[email protected]', remove_old=True)

# generate data
data = []
for i in range(1000):
    data.append(np.random.rand(28, 28))
data = np.array(data)

co = repo.checkout(write=True)
data_dset = co.datasets.init_dataset('mnist_data', prototype=data[0])
co.commit('datasets init')
co.close()
co = repo.checkout(write=True)
data_dset = co.datasets['mnist_data']

with data_dset:
    for i in range(len(data)):
        sample_name = str(i)
        data_dset[sample_name] = data[i]
        co.commit('dataset curation: stage 1')  # this throws error
co.close()

Expected behavior
It should not break the program instead raise a warning about the performance hit

[BUG REPORT] Setup.py installation failing

python setup.py develop command failed with below error message.

Best match: blosc 1.8.1
Processing blosc-1.8.1.tar.gz
Writing /var/folders/yl/9l3dn2s53wdd4__5v1p1cwbm0000gn/T/easy_install-b7znieg5/blosc-1.8.1/setup.cfg
Running blosc-1.8.1/setup.py -q bdist_egg --dist-dir /var/folders/yl/9l3dn2s53wdd4__5v1p1cwbm0000gn/T/easy_install-b7znieg5/blosc-1.8.1/egg-dist-tmp-6xli48d_
SSE2 detected
warning: no files found matching '*.hpp' under directory 'c-blosc'
c-blosc/internal-complibs/zlib-1.2.8/gzread.c:30:15: warning: implicit declaration of function 'read'
is invalid in C99 [-Wimplicit-function-declaration]
ret = read(state->fd, buf + *have, len - *have);
^
c-blosc/internal-complibs/zlib-1.2.8/gzread.c:30:15: warning: this function declaration is not a
prototype [-Wstrict-prototypes]
c-blosc/internal-complibs/zlib-1.2.8/gzread.c:591:11: warning: implicit declaration of function
'close' is invalid in C99 [-Wimplicit-function-declaration]
ret = close(state->fd);
^
c-blosc/internal-complibs/zlib-1.2.8/gzread.c:591:11: warning: this function declaration is not a
prototype [-Wstrict-prototypes]
4 warnings generated.
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:256:24: warning: implicit declaration of function
'lseek' is invalid in C99 [-Wimplicit-function-declaration]
state->start = LSEEK(state->fd, 0, SEEK_CUR);
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:14:17: note: expanded from macro 'LSEEK'

define LSEEK lseek
            ^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:256:24: warning: this function declaration is not a
prototype [-Wstrict-prototypes]
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:14:17: note: expanded from macro 'LSEEK'

define LSEEK lseek
            ^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:355:9: warning: implicit declaration of function 'lseek'
is invalid in C99 [-Wimplicit-function-declaration]
if (LSEEK(state->fd, state->start, SEEK_SET) == -1)
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:14:17: note: expanded from macro 'LSEEK'

define LSEEK lseek
            ^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:396:15: warning: implicit declaration of function
'lseek' is invalid in C99 [-Wimplicit-function-declaration]
ret = LSEEK(state->fd, offset - state->x.have, SEEK_CUR);
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:14:17: note: expanded from macro 'LSEEK'

define LSEEK lseek
            ^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:492:14: warning: implicit declaration of function
'lseek' is invalid in C99 [-Wimplicit-function-declaration]
offset = LSEEK(state->fd, 0, SEEK_CUR);
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:14:17: note: expanded from macro 'LSEEK'

define LSEEK lseek
            ^
5 warnings generated.
c-blosc/internal-complibs/zlib-1.2.8/gzwrite.c:84:15: warning: implicit declaration of function
'write' is invalid in C99 [-Wimplicit-function-declaration]
got = write(state->fd, strm->next_in, strm->avail_in);
^
c-blosc/internal-complibs/zlib-1.2.8/gzwrite.c:84:15: warning: this function declaration is not a
prototype [-Wstrict-prototypes]
c-blosc/internal-complibs/zlib-1.2.8/gzwrite.c:101:33: warning: implicit declaration of function
'write' is invalid in C99 [-Wimplicit-function-declaration]
if (have && ((got = write(state->fd, state->x.next, have)) < 0 ||
^
c-blosc/internal-complibs/zlib-1.2.8/gzwrite.c:573:9: warning: implicit declaration of function
'close' is invalid in C99 [-Wimplicit-function-declaration]
if (close(state->fd) == -1)
^
c-blosc/internal-complibs/zlib-1.2.8/gzwrite.c:573:9: warning: this function declaration is not a
prototype [-Wstrict-prototypes]
5 warnings generated.
In file included from c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:27:
c-blosc/internal-complibs/zstd-1.3.8/compress/zstd_compress_internal.h:190:5: error: unknown type
name 'ZSTD_dictAttachPref_e'
ZSTD_dictAttachPref_e attachDictPref;
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:675:85: error: use of undeclared
identifier 'ZSTD_c_forceMaxWindow'; did you mean 'ZSTD_p_forceMaxWindow'?
...const forceWindowError = ZSTD_CCtxParam_setParameter(&jobParams, ZSTD_c_forceMaxWindow, !job-...
^~~~~~~~~~~~~~~~~~~~~
ZSTD_p_forceMaxWindow
//anaconda3/include/zstd.h:1110:5: note: 'ZSTD_p_forceMaxWindow' declared here
ZSTD_p_forceMaxWindow=1100, /* Force back-reference distances to remain < windowSize,
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:999:21: error: use of undeclared
identifier 'ZSTD_OVERLAPLOG_MIN'
if (value < ZSTD_OVERLAPLOG_MIN) value = ZSTD_OVERLAPLOG_MIN;
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:999:50: error: use of undeclared
identifier 'ZSTD_OVERLAPLOG_MIN'
if (value < ZSTD_OVERLAPLOG_MIN) value = ZSTD_OVERLAPLOG_MIN;
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:1000:21: error: use of undeclared
identifier 'ZSTD_OVERLAPLOG_MAX'
if (value > ZSTD_OVERLAPLOG_MAX) value = ZSTD_OVERLAPLOG_MAX;
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:1000:50: error: use of undeclared
identifier 'ZSTD_OVERLAPLOG_MAX'
if (value > ZSTD_OVERLAPLOG_MAX) value = ZSTD_OVERLAPLOG_MAX;
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:1171:14: error: use of undeclared
identifier 'ZSTD_btultra2'; did you mean 'ZSTD_btultra'?
case ZSTD_btultra2:
^~~~~~~~~~~~~
ZSTD_btultra
//anaconda3/include/zstd.h:450:42: note: 'ZSTD_btultra' declared here
ZSTD_btlazy2, ZSTD_btopt, ZSTD_btultra } ZSTD_strategy; /* from faster to st...
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:1173:14: error: duplicate case value
'ZSTD_btultra'
case ZSTD_btultra:
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:1171:14: note: previous case defined
here
case ZSTD_btultra2:
^
8 errors generated.
error: Setup script exited with error: command 'gcc' failed with exit status 1

Desktop (please complete the following information):

OS: MacOs Mojave, 10.14.6
Python: 3.7.3, Anaconda
Hangar: master branch

Demo for Example Usage

High level demo for users.

[FEATURE REQUEST] Command line options to import and export typical data file types to a dataset/checkout

Note: The feature as detailed below was initially proposed by @lantiga in an offline phone call.

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

In order to improve the command line experience, Hangar should provide built in tools to import common data file types stored in a directory into the staging area. In addition, in order to ease the user transition from domain-specific file formats to array based storage, a utility should be included which exports samples in a dataset into common file types.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Add commands to the CLI which mimic the following workflow (exact syntax and names are not super important here, just the idea of the workflow)

Importing

$ hangar branch --create foo
$ hangar checkout foo               # write enabled checkout which can modify the staging area
$ hangar dataset create --name images --dtype uint8 --shape 500 500 3 --variable-shape ...
$ hangar dataset import --name images --directory ~/data/images/   # directory contains image files. 
$ hangar commit -m "added data"

Exporting

$ hangar dataset export --commit foo_hash --name images --directory ~/data/images/   --format jpg

Notes:

It's important that the implementation of these utilities are not directly integrated or included into the hangar core (either in backend data IO processing, record reference parsing/storage/interaction, or the user facing programatic API). Instead these utilities should lie in a special module (isolated from the hangar core) where any relevant IO or data-file processing / transformation methods are defined. This module would transform the files as necessary, and then make the standard calls into the user-facing API to interact with Hangar.
A few file types we could natively support would be:
- jpg, png, tiff, ... (other standard image files)
- csv
- npy / npz
- others? (open to suggestions here)

Additional context
Add any other context or screenshots about the feature request here.

While the data import feature would be a very useful tool which has many uses, use of the export function as a crutch (instead of just reading data directly as arrays via the API) should not be encouraged; that type of use makes some key benefits of hangar null from the start.

The export tool does however, have really interesting potential to provide human understandable diffs of the data which was changed between two branches or commits. I'm not quite sure how yet, as the diff interface is still really evolving, but there are interesting possibilities here.

Tagging @rgalbo for reference in relation to the issue opened in #71

[FEATURE REQUEST] Accept intigers as sample name

Is your feature request related to a problem? Please describe.
Sample additions become more intuitive for a beginner if we enable sample name to be integers

`init_dataset` raises zerodivision error if shape tuple contains zero

Expected Behavior

shape = (0, 1, 2)
co.datasets.init_dataset('dset', shape=shape, dtype=np.int)

This should raise a sensible error

Current Behavior

The above raises ZeroDivision error

Steps to Reproduce (for bugs)

testcase

[BUG REPORT] Multiprocess pytorch dataloaders

Describe the bug
A clear and concise description of what the bug is.

The pytorch dataloader cannot currently be run with multiprocess workers:

>>> torch_dset = make_torch_dataset(aset, index_range=slice(1, 100))
>>> loader = DataLoader(torch_dset, batch_size=16, num_workers=2)
>>> for batch in loader:
...     train_model(batch)
Exception: Cannot pickle `hangar.dataloaders.TorchDataset.BatchTuple`

This is because the the BatchTuple wrappers passed to hangar.dataloaders.TorchDataset are dynamically defined namedtuple classes, whose definition is not appropriately scoped for a forked subprocess to introspect it's name/contents upon pickling.

hangar-py/src/hangar/dataloaders/torchloader.py

Lines 63 to 74 in e2c7a89

 if field_names: 

 if not isinstance(field_names, (list, tuple, set)): 

 raise TypeError( 

 f'type(field_names): {type(field_names)} != (list, tuple, set)') 

 if len(field_names) != len(arraysets): 

 m = f'len(field_names): {len(field_names)} != len(arraysets): {len(arraysets)}' 

 raise ValueError(m) 

 wrapper = namedtuple('BatchTuple', field_names=field_names) 

 else: 

 wrapper = namedtuple('BatchTuple', field_names=gasets.arrayset_names, rename=True) 

 return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)

As the structure needs to be dynamically defined based on user arguments, we cannot just place the BatchTuple definition in the main body of the module.

Two possible solutions:

@lantiga and @hhsecond, let me know what you prefer, or any other solutions you might have.

1) keep the current return type exactally the same, but add the definition of BatchTuple to globals() before it is passed to TorchLoader

        wrapper = namedtuple('BatchTuple', field_names=field_names)
    else:
        wrapper = namedtuple('BatchTuple', field_names=gasets.arrayset_names, rename=True)

    globals()[`BatchTuple`] = wrapper
    return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)

This works, but is generally bad practice to manually modify global scope.

2) return a dict of field_names and tensors instead of a namedtuple

        wrapper = tuple(field_names)
    else:
        wrapper = tuple(gasets.arrayset_names)

    return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)

And in TorchDataset replace:

hangar-py/src/hangar/dataloaders/torchloader.py

Line 135 in e2c7a89

return self.wrapper(*out)

with

    return dict(zip(self.wrapper, out))

Which still works, and does not modify globals(), but changes the output o the function to something "not quite as nice" as a namedtuple.

Severity

Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

[FEATURE REQUEST] Do not close checkout on staging area `hard reset`

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

In #54 we introduced a fix to the checkout.commit() and checkout. reset_staging_area() methods which allow correct behavior on previously checked out objects after close. The first for the checkout.reset_staging_area() defaults to close() all objects permenantly when it is called. This behavior is undesireable, but convenient for the fix that was presented.

Describe the solution you'd like
A clear and concise description of what you want to happen.

When checkout.reset_staging_area() is called, the reset should be performed, and any dataset objects referencing datasets which existed in the CLEAN staging area should continue to work, while any which were created in the DIRTY state should be invalidated.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Current behavior: close() the checkout entirely.

Additional context
Add any other context or screenshots about the feature request here.

[FEATURE REQUEST] Repository description

It's good to have an option to set repository description on repo.init

Dataset object is operatable even after closing the checkout

Expected Behavior

After closing the checkout, the dataset object should not be operatable. We need to decide what all are possible on the dataset object though. So far, we know

User should not be able to add data.
User should not be able to fetch data

Current Behavior

co = repo.checkout(write=True)
dset = co.datasets.init_dataset('dset', prototype=array5by7)
co.commit()
co.close()
co = repo.checkout(write=True)  # without opening the checkout again, user won't be able to do the addition even now
dset.add(array5by7, '1')  # user can add data to dset object

co = repo.checkout(write=True)
dset = co.datasets.init_dataset('dset', prototype=array5by7)
dset.add(array5by7, '1')
co.commit()
co.close()
print(dset['1'])  # user can fetch data from dset object

These issues are part of #25

Package on Conda Forge

With the latest release, create a conda forge package for hangar

[FEATURE REQUEST] IterableDataLoaders from PyTorch

Is your feature request related to a problem? Please describe.
PyTorch has an IterableDataset in the upcoming release which we can plug into Hangar. Keeping this issue to track the PyTorch release and integrate it when it is time

LockHeld check is failing

Expected Behavior

Lockheld check at diff.py should verify whether the lock is held or not

hangar-py/src/hangar/diff.py

Line 237 in 044937c

if lockHeld is False:

Current Behavior

Execution never reaches there

Steps to Reproduce (for bugs)

#25 (comment)

[BUG REPORT] New repo creation is unfriendly

Describe the bug

The message HANGAR RUNTIME WARNING: no repository exists at /some/path/__hangar, please use init_repo function makes me think my script is doing something wrong, even though the next thing that I do is call repo.init().

Additionally, if the path does not exist, there should be an option to have it be created.

A single default param exists=True flag could handle both cases, creating the directory and suppressing the init warning when set to False

Severity

Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

Checkout as a context manager

Expected Behavior

Making checkout as a context manager let the users avoid writing explicit close() and we can make sure the lock is released. Keeping the current functionality along with this new ability should be the way to go.

with repo.checkout(write=True) as co:
    co.datasets['dset'] = array
    co.commit()

We could either raise a warning if context manager exits without commit or we can even do commit implicitly (by having another argument autocommit=True in the checkout() call)

Current Behavior

Currently, we need to explicitly call close() for all the write-enabled checkouts.

I could take it up if this doesn't raise any other concerns.

[FEATURE REQUEST] Reporting utility for time consuming details

Is your feature request related to a problem? Please describe.
Currently, we don't have time consumption reports generated for different operations such as data addition (and import through the plugin), etc.

Describe the solution you'd like
Figure out what all operations need to spit this info and print it out to the UI mentioning time taken by different modules

Suggestion from @lantiga in the weekly

[BUG REPORT] Exceptions from non-initialised repository is not informative

Describe the bug
Most (if not all) of the operations on non-initialized repository throws Exceptions like AttributeError. It either should return proper error message or an empty object, at least for operations like .log() /.summary() etc.

Severity
Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

To Reproduce

repo = Repository(path=path_to_non-initiazlized_repository)
repo.status()  # AttributeError: 'NoneType' object has no attribute 'begin'

Expected behavior
Either a proper exception or a sensible warning

Fetched data is not same as saved data

In the same python session, if we initiate dataset twice and then save the data with same shape and type as the first instance, we won't get the data that we saved in the second instance. Instead, we'll get the data from first instance. Here is the script to reproduce.

import numpy as np
from hangar.repository import Repository

test_hangar_path1 = '/home/hhsecond/mypro/test_hangar_dir_1'
arr1 = np.random.random((5, 7))
repo1 = Repository(path=test_hangar_path1)
repo1.init(user_name='tester1', user_email='[email protected]', remove_old=True)
co = repo1.checkout(write=True)
dset1 = co.datasets.init_dataset('dset1', prototype=arr1)
dset1['arr1'] = arr1
co.commit()
co.close()


test_hangar_path2 = '/home/hhsecond/mypro/test_hangar_dir_2'
arr2 = np.random.random((5, 7))
repo2 = Repository(path=test_hangar_path2)
repo2.init(user_name='tester2', user_email='[email protected]', remove_old=True)
co = repo2.checkout(write=True)
dset2 = co.datasets.init_dataset('dset2', prototype=arr2)
dset2['arr2'] = arr2
co.commit()
co.close()


co = repo2.checkout()
print(arr2[1], co.datasets['dset2']['arr2'][1])
print((arr2 == co.datasets['dset2']['arr2']).all())

[BUG REPORT] Ability to Add Data with Fewer Dimensions than Schema in Variable Sized Dataset

Describe the bug
A clear and concise description of what the bug is.

ability to add arrays with fewer dimensions (and not just smaller dimension sizes) to an arrayset with variable_size=True schema.

Though the current implementation does not support this, it won't allow records to be added in this fashion, instead forcing a ValueErrorto be raised - basically, it's a bug, but one we have addressed by stoping execution which violates the limitations of the slice generator.

Severity

Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

To Reproduce
Steps to reproduce the behavior, minimal example code preferred:

>>> co = repo.checkout(write=True)
>>> aset = co.arraysets.init_arrayset('foo', shape=(10, 10, 10), dtype=np.uint8, variable=True)
>>> arr = np.random.randn(10, 10).astype(np.uin8)
>>> aset[1] = arr
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-dc4fc55d87c3> in <module>
----> 1 aset[1] = arr

~/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py in __setitem__(self, key, value)
    484             sample name of the stored data (assuming operation was successful)
    485         """
--> 486         self.add(value, key)
    487         return key
    488 

~/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py in add(self, data, name, **kwargs)
    588 
    589         except ValueError as e:
--> 590             raise e from None
    591 
    592         # --------------------- add data to storage backend -------------------

~/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py in add(self, data, name, **kwargs)
    576                 if data.ndim != len(self._schema_max_shape):
    577                     raise ValueError(
--> 578                         f'`data` rank: {data.ndim} != aset rank: {len(self._schema_max_shape)}')
    579                 for dDimSize, schDimSize in zip(data.shape, self._schema_max_shape):
    580                     if dDimSize > schDimSize:

ValueError: `data` rank: 2 != aset rank: 3

Expected behavior
A clear and concise description of what you expected to happen.

>>> aset[1] = arr
>>> out = aset[1]
>>> out.shape
(10, 10)

Symbolic Link Priveliges

Remove the need to symlink privliges. This would require some rework of the internal data storage dir organization, but would be nice to have as illustrated in #138

[BUG REPORT] Testcases are not running for python 3.7

Describe the bug
Python 3.7 test cases are not running. Yaml file only triggers docs env

Severity

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

[BUG REPORT] license fixes

Describe the bug
A clear and concise description of what the bug is.

Move code which contains other license files - for config (from dask) and Git log graph - to individual modules

Severity

Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck
other

Tagging @elistevens as fyi

[BUG REPORT] TypeError: 'rdcc_nbytes' is an invalid keyword argument for this function

Describe the bug
" TypeError: 'rdcc_nbytes' is an invalid keyword argument for this function" on saving an image to Hangar repository

Severity

Select an option:

Data Corruption / Loss of Any Kind
[ ] Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

To Reproduce
import hangar
repo=hangar.Repository('.')
repo.init(user_name='chandnika', user_email='[email protected]')
co=repo.checkout(write=True)
import numpy as np
co.arraysets.init_arrayset('images' ,shape=(10,10,3) , dtype=np.uint8)
co.commit('arrayset init')
images=co.arraysets['images']
img=np.random.random((10,10,3)).astype(np.uint8)
images[0]=img

Expected behavior
TypeError Traceback (most recent call last)
in ()
----> 1 images[0]=img

5 frames
/content/hangar-py/src/hangar/arrayset.py in setitem(self, key, value)
500 sample name of the stored data (assuming operation was successful)
501 """
--> 502 self.add(value, key)
503 return key
504

/content/hangar-py/src/hangar/arrayset.py in add(self, data, name, **kwargs)
646 existingHashVal = self._hashTxn.get(hashKey, default=False)
647 if existingHashVal is False:
--> 648 hashVal = self._fs[self._dflt_backend].write_data(data)
649 self._hashTxn.put(hashKey, hashVal)
650 self._stageHashTxn.put(hashKey, hashVal)

/content/hangar-py/src/hangar/backends/hdf5_00.py in write_data(self, array, remote_operation)
675 self._create_schema(remote_operation=remote_operation)
676 else:
--> 677 self._create_schema(remote_operation=remote_operation)
678
679 srcSlc = None

/content/hangar-py/src/hangar/backends/hdf5_00.py in _create_schema(self, remote_operation)
540 rdcc_nbytes=rdcc_nbytes_val,
541 rdcc_w0=CHUNK_RDCC_W0,
--> 542 rdcc_nslots=rdcc_nslots_prime_val)
543 self.w_uid = uid
544 self.hNextPath = 0

/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py in init(self, name, mode, driver, libver, userblock_size, swmr, **kwds)
309 name = filename_encode(name)
310 with phil:
--> 311 fapl = make_fapl(driver, libver, **kwds)
312 fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
313

/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py in make_fapl(driver, libver, **kwds)
107 msg = "'{key}' is an invalid keyword argument for this function"
108 .format(key=next(iter(kwds)))
--> 109 raise TypeError(msg)
110 return plist
111

TypeError: 'rdcc_nbytes' is an invalid keyword argument for this function

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Google Colab Debian based Linux
Python: Python 3.6.8
Hangar: 0.3.0

Additional context
Add any other context about the problem here.

Can't initiate two repository with different paths in the same python session

I found an issue with initiating the repository again and again in the same python session even if the repository path is different. Because of the singleton nature of Environment class, regardless of the path, we pass for a repository, all the following repositories will have the path of the first repository (Is this something we desire to have?)

The below script can reproduce the issue. The last print statement must print True

import numpy as np
from hangar.repository import Repository

test_hangar_path1 = '/home/hhsecond/mypro/test_hangar_dir_1'
arr1 = np.random.random((5, 7))
repo1 = Repository(path=test_hangar_path1)
repo1.init(user_name='tester1', user_email='[email protected]', remove_old=True)
co = repo1.checkout(write=True)
dset1 = co.datasets.init_dataset('dset1', prototype=arr1)
dset1['arr1'] = arr1
co.commit()
co.close()


test_hangar_path2 = '/home/hhsecond/mypro/test_hangar_dir_2'
arr2 = np.random.random((5, 7))
repo2 = Repository(path=test_hangar_path2)
repo2.init(user_name='tester2', user_email='[email protected]', remove_old=True)
co = repo2.checkout(write=True)
dset2 = co.datasets.init_dataset('dset2', prototype=arr2)
dset2['arr2'] = arr2
co.commit()
co.close()


print(test_hangar_path2, repo2._repo_path)
print(test_hangar_path2 == repo2._repo_path)

Not able to init hangar

hanger init threw the error

Traceback (most recent call last):
  File "/usr/local/bin/hangar", line 11, in <module>
    load_entry_point('hangar', 'console_scripts', 'hangar')()
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 2693, in load_entry_point
    return ep.load()
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 2324, in load
    return self.resolve()
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 2330, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/ameen4455/hangar-py/src/hangar/cli/__init__.py", line 1, in <module>
    from .cli import main
  File "/home/ameen4455/hangar-py/src/hangar/cli/cli.py", line 509, in <module>
    @click.option('--limit', default=30, help='limit the amount of records displayed before truncation')
  File "/usr/lib/python3/dist-packages/click/core.py", line 1163, in decorator
    cmd = command(*args, **kwargs)(f)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 115, in decorator
    cmd = _make_command(f, name, attrs, cls)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 89, in _make_command
    callback=f, params=params, **attrs)
TypeError: __init__() got an unexpected keyword argument 'hidden'

from python terminal
repo.init('myrepo', user_name='username', user_email='[email protected]') threw the error

Hangar Repo initialized at: ./.hangar
Traceback (most recent call last):
  File "/home/ameen4455/hangar-py/src/hangar/records/heads.py", line 220, in create_branch
    success = branchtxn.put(branchHeadKey, branchHeadVal, overwrite=False)
lmdb.CorruptedError: mdb_put: MDB_CORRUPTED: Located page was wrong type

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ameen4455/hangar-py/src/hangar/repository.py", line 321, in init
    user_name=user_name, user_email=user_email, remove_old=remove_old)
  File "/home/ameen4455/hangar-py/src/hangar/context.py", line 286, in _init_repo
    heads.create_branch(self.branchenv, 'master', '')
  File "/home/ameen4455/hangar-py/src/hangar/records/heads.py", line 226, in create_branch
    TxnRegister().commit_writer_txn(branchenv)
  File "/home/ameen4455/hangar-py/src/hangar/context.py", line 118, in commit_writer_txn
    self.WriterTxn[lmdbenv].commit()
lmdb.BadTxnError: mdb_txn_commit: MDB_BAD_TXN: Transaction must abort, has a child, or is invalid

OS:
Python:3.6.4
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

[FEATURE REQUEST] Importers that consume tf or pytorch Dataset, and can produce identical Datasets after

Is your feature request related to a problem? Please describe.
Existing DL projects are already going to have a data pipeline. Often, these are going to result in tf or pytorch Datasets. Having to replace all of the existing mechanisms by which those datasets are created and managed with hangar-specific code is a barrier to adoption.

Describe the solution you'd like
I think that import routines should be able to consume a third-party Dataset instance, inspect it, and store the relevant data in hangar. The intent would be that hangar could then produce a functionally identical Dataset for use in training, but without having to go back to the raw data.

This would change hangar adoption best practice from "replace your data pipeline" to "just insert this mostly-transparent step in the middle." Once the project is fully committed to using hangar, then the architecture can be revisited, if needed.

It would be nice if it could also vacuum up a dir tree of .tfrecord files, but that's a little less well-defined.

Describe alternatives you've considered
N/A

Additional context
N/A

[FEATURE REQUEST] Simple Command Line Interface hang

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when

I want to use hanger (hang) at the terminal!

Describe the solution you'd like

I would like

~> hang commit -m "update the training set"

~> hang merge < branches-of-interest>

Describe alternatives you've considered

Additional context

there are many simple cli for python. even the base cli tool making should be sufficient to not require any dependencies. I have no issue completing this myself in a new branch. Most likely could do so this afternoon with mild coaching. is there a slack or discord i should be aware of?

[BUG REPORT] OSError: symbolic link privilege not held

Describe the bug
Following error is thrown post developer setup of Hangar in Windows machine while trying to assign a dummy image to repo array set - "OSError: symbolic link privilege not held"

Unexpected Behavior, Exceptions or Error Thrown

To Reproduce
Steps to reproduce the behavior, minimal example code:

import hangar
repo = hangar.Repository('.')
co = repo.checkout(write=True)
co.arraysets.init_arrayset('images', shape=(10,10,3),dtype=np.uint8)
co.commit('arrayset init')
images = co.arraysets['images']
img = np.random.random((10,10,3)).astype(np.uint8)
images[0] = img

images[0] = img
Traceback (most recent call last):
File "", line 1, in
File "c:\users\naren\documents\github\hangar-py\src\hangar\arrayset.py", line 502, in setitem
self.add(value, key)
File "c:\users\naren\documents\github\hangar-py\src\hangar\arrayset.py", line 624, in add
raise e from None
File "c:\users\naren\documents\github\hangar-py\src\hangar\arrayset.py", line 621, in add
raise ValueError(isCompat.reason)
ValueError: data shape: (10, 3, 3) != fixed aset shape: (10, 10, 3)
img = np.random.random((10,10,3)).astype(np.uint8)
images[0] = img
Traceback (most recent call last):
File "", line 1, in
File "c:\users\naren\documents\github\hangar-py\src\hangar\arrayset.py", line 502, in setitem
self.add(value, key)
File "c:\users\naren\documents\github\hangar-py\src\hangar\arrayset.py", line 648, in add
hashVal = self._fs[self._dflt_backend].write_data(data)
File "c:\users\naren\documents\github\hangar-py\src\hangar\backends\hdf5_00.py", line 677, in write_data
self._create_schema(remote_operation=remote_operation)
File "c:\users\naren\documents\github\hangar-py\src\hangar\backends\hdf5_00.py", line 553, in _create_schema
symlink_rel(file_path, symlink_file_path)
File "c:\users\naren\documents\github\hangar-py\src\hangar\utils.py", line 114, in symlink_rel
os.symlink(rel_path_src, dst, target_is_directory=is_dir)
OSError: symbolic link privilege not held

Expected behavior
No errors expected and the data getting assigned to the data array set is expected.

Desktop (please complete the following information):

OS: Windows 10
Python: 3.7.5
Hangar: latest code

[BUG REPORT] repo log doesn't return branch information when return_contents is True

Describe the bug
A clear and concise description of what the bug is.
repo.log() prints branch information to the terminal while repo.log(return_contents=True) doesn't return branch information

Severity

Select an option:

Data Corruption / Loss of Any Kind
Unexpected Behavior, Exceptions or Error Thrown
Performance Bottleneck

	self._metadata = MetadataWriter(dataenv=self._stageenv, labelenv=self._labelenv)
	self._datasets = Datasets._from_staging_area(
	repo_pth=self._repo_path,
	hashenv=self._hashenv,
	stageenv=self._stageenv,
	stagehashenv=self._stagehashenv)

	if field_names:
	if not isinstance(field_names, (list, tuple, set)):
	raise TypeError(
	f'type(field_names): {type(field_names)} != (list, tuple, set)')
	if len(field_names) != len(arraysets):
	m = f'len(field_names): {len(field_names)} != len(arraysets): {len(arraysets)}'
	raise ValueError(m)
	wrapper = namedtuple('BatchTuple', field_names=field_names)
	else:
	wrapper = namedtuple('BatchTuple', field_names=gasets.arrayset_names, rename=True)

	return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)

tensorwerk / hangar-py Goto Github PK

hangar-py's People

Contributors

Stargazers

Watchers

Forkers

hangar-py's Issues

Expected Behavior

Current Behavior

Expected Behavior

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

define LSEEK lseek

define LSEEK lseek

define LSEEK lseek

define LSEEK lseek

define LSEEK lseek

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Two possible solutions:

Expected Behavior

Current Behavior

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Expected Behavior

Current Behavior

Recommend Projects

Recommend Topics

Recommend Org