tensorwerk / hangar-py Goto Github PK
View Code? Open in Web Editor NEWHangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.
License: Apache License 2.0
Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.
License: Apache License 2.0
I think we should at least give the option of cleaning up the singletons (either a specific one or all of them). In general I'm not 100% sure why we need a singleton (or something like it) design-wise, can you explain in a one-liner?
Originally posted by @lantiga in #20 (comment)
Describe the bug
A clear and concise description of what the bug is.
Test cases are needed for the changes introduced in #32.
Severity
Select an option:
N/A
To Reproduce
Steps to reproduce the behavior, minimal example code preffered:
Run the tests
Expected behavior
A clear and concise description of what you expected to happen.
More thorough testing.
Screenshots
If applicable, add screenshots to help explain your problem.
N/A
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
See #32 for changes introduced
Is your feature request related to a problem? Please describe.
We need more options in our CLI for making few workflows seamless. Two examples I could think off
hangar import
, they will need to spin up a python program and use the python APIs to commit the staged dataIs your feature request related to a problem? Please describe.
Currently, we can't delete data from history once it is committed. Since in some cases where the end user triggers a data deletion request, organizations who use hangar for data storage and versioning will have to delete it from everywhere including history
While trying to remove the data with .remove(key)
, hangar raises AttributeError: 'DatasetDataWriter' object has no attribute 'TxnRegister'
.
Is your feature request related to a problem? Please describe.
Allow binary model storage for trained models
Describe the solution you'd like
A clear and concise description of what you want to happen:
Is your feature request related to a problem? Please describe.
I want to know how to accomplish the following. It doesn't need to require zero effort on the user's part, but there needs to be a clear best-hangar-practices path to a workable setup.
Take a source data format that is complex (e.g. DICOM, JPEG) and infeasible to reconstitute bit-exact from the tensor+metadata form. Each instance of the raw data produces a sample that consists of 2 (or more) tensors; an image tensor and a 1D tensor that encodes things like lat/long or age/sex/etc. (to be concatenated with the output of the convolutional layers prior to the fully connected layers). To be clear, this is intended to be an illustrative example, not a concrete use case.
Per my reading of the docs, right now these two tensors wouldn't qualify as being in the same hangar dataset (it's not clear if that's problematic or not).
Let's express the above conversion as:
f_v1(raw) -> (t1, t2)
Users will need to:
f_v2
and repopulate t1
and t2
.f_v3
which outputs (t1, t2, t3)
.t1
and t2
.t1
and t2
for training/validation (including when training is randomized).Describe the solution you'd like
I think that changing the definition of a sample to be a tuple of binary blobs plus a tuple of tensors plus metadata would work, but I haven't considered the potential impacts from that kind of change. Seems potentially large.
Describe alternatives you've considered
Another option would be to have separate datasets for t1
and t2
and combine them manually, plus manage the binary blobs separately. That seems like a lot of infra work, and might be at risk of having drift between the samples themselves, and with the blobs.
Additional context
I suspect that I want/expect Hangar to solve a larger slice of the problem than it's intended to, but it's not clear at first glance what the intended approach would be for more complicated setups like the above.
Describe the bug
The function RecordQuery._traverse_dataset_data_records(self, dataset_name)
is has a critical bug which related to the condition check we are doing to make sure we are checking for the provided dataset (This must be same with _traverse_dataset_schema_records
function as well since the logic is similar there as well). If the repository has multiple datasets where one's name has another dataset's name as a substring in the start of the name, would cause the condition check dataRecKey.startswith(startDatasetRecordCountRangeKey)
to return wrong data.
Severity
Select an option:
To Reproduce
First code snippet to save two datasets in the same repository named MNIST
and MNISTLabel
import os
from hangar import Repository
import torchvision
import numpy as np
data = torchvision.datasets.MNIST('.', download=True)
path = os.path.dirname(os.path.abspath(__file__))
repo = Repository(path=path)
repo.init(user_name='hhsecond', user_email='[email protected]', remove_old=True)
co = repo.checkout(write=True)
co.datasets.init_dataset('MNIST', shape=(28, 28), dtype=np.uint8)
co.datasets.init_dataset('MNISTLabel', shape=(1,), dtype=np.int64)
mnist_dset = co.datasets['MNIST']
label_dset = co.datasets['MNISTLabel']
with mnist_dset, label_dset:
for i, (image, label) in enumerate(data):
print(i)
mnist_dset[str(i)] = np.array(image)
label_dset[str(i)] = np.array([label])
co.commit('0.1')
co.close()
The second part where we could try to check the name of dataset MNIST
co = repo.checkout()
mnist = co.datasets['MNIST']
mnistlabel = co.datasets['MNISTLabel']
print(len(mnist))
len
on dataset triggers the above functions which would cause it to fail
Expected behavior
_traverse_dataset_data_records
should return a dictionary of data records with name as key and hash as valueWe need to have docker image for server instance. I'll try to work on it over the weekend
ValueError
Both the cases raise AttributeError
now
@hhsecond, Tensorflow 2.0 has been released. https://www.tensorflow.org/install
Are there any changes which will prevent us from supporting both 1.14 as well as 2.0?
Describe the bug
A clear and concise description of what the bug is.
current pip install is missing google requirement. import hangar
fails because google
is missing
Severity
Select an option:
To Reproduce
Steps to reproduce the behavior, minimal example code preferred:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
Add docstring to the bulk add
function at:
hangar-py/src/hangar/dataset.py
Line 743 in 4e5ec90
repo.remove_branch(branch_name)
NotImplemented
The way we are handling the exception is not user-friendly and has to be changed. This issue is to discuss it further. After thinking about it for some time, I think it's better to raise an exception rather than returning the exception object, as Rick suggested. Here are the cases I have found out so far where we are not raising the exception instead of pushing it to stderr/stdout. I'll add more as I find
import hangar
threw the error
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/varun/hangar-py/src/hangar/__init__.py", line 5, in <module> from .remote.server import serve File "/home/varun/hangar-py/src/hangar/remote/server.py", line 18, in <module> from . import chunks File "/home/varun/hangar-py/src/hangar/remote/chunks.py", line 6, in <module> from . import hangar_service_pb2 File "/home/varun/hangar-py/src/hangar/remote/hangar_service_pb2.py", line 23, in <module> TypeError: __new__() got an unexpected keyword argument
after running sudo python3 setup.py develop
Python Version - 3.6.7
Environment
varun@pop-os
OS: Ubuntu 18.04 bionic (pop-os)
Kernel: x86_64 Linux 4.18.0-16-generic
Uptime: 1h 12m
Packages: 3599
Shell: bash 4.4.19
Resolution: 1920x1080
DE: KDE 5.44.0 / Plasma 5.12.7
WM: KWin
WM Theme: Breeze
GTK Theme: Breeze [GTK2/3]
Icon Theme: breeze
Font: Noto Sans Regular
CPU: Intel Core i7-8750H @ 12x 4.1GHz [57.0°C]
GPU: intel
RAM: 2968MiB / 7817MiB
Also not able to run hangar init
as it resulted in
raise DistributionNotFound(req, requirers) pkg_resources.DistributionNotFound: The 'lmdb<=0.96,>=0.94' distribution was not found and is required by hangar
after installing lmdb 0.95 it still threw the same error
Describe the bug
With the new weakref logic, ideally, hangar's user-facing objects such as dataset/metadata are weak references to the actual object and it gets removed on checkout.close()
. But since we are rewriting self._datasets
& self._metadata
here
hangar-py/src/hangar/checkout.py
Lines 336 to 341 in deeeafc
commit
(not only on close
since self.__setup()
gets called on commit
as well)hangar-py/src/hangar/checkout.py
Line 387 in deeeafc
Severity
Select an option:
To Reproduce
Steps to reproduce the behavior, minimal example code preffered:
co = repo.checkout(write=True)
dset = co.datasets.init_dataset('dset', prototype=array5by7)
dset.add(array5by7, '1')
co.commit()
dset.add(array5by7, '2') # this raises ReferenceError since the reference to dset is gone
co.close()
Expected behavior
User should be able to add data after the commit. References should be released only after the close. Perhaps we should split __setup()
function into two with very specific functionalities.
Describe the bug
If the shape of the dataset is empty i.e. shape=()
or prototype=np.array(1)
, hangar is not designed to handle it properly. Instead, it throws an error while committing -> ValueError: invalid literal for int() with base 10: ''
Severity
Select an option:
To Reproduce
co = repo.checkout(write=True)
arr = np.array(1)
dset = co.datasets.init_dataset('dset2', prototype=arr)
dset['1'] = arr
co.commit()
co.close()
Expected behavior
It should work like any other arrays that have non-empty dimension
@rlizzo I was thinking an API like hangar.dataloaders.pytorch
(let me know if you have another structuring in your mind). Basically, the idea is to load data in batches synchronously or asynchronously and enable the features of PyTorch DataLoader. My plan is to have a Dataset class and a DataLoader class which is essentially the way PyTorch's data loader work.
Describe the bug
If we try to commit inside the context manager (before __exit__()
), hangar throws RuntimeError saying No changes made in the staging area. Cannot commit.
. We should allow the user to do commits inside the context manager IMO but probably with a warning about the performance hit
Severity
Select an option:
To Reproduce
import numpy as np
from hangar import Repository
repo = Repository(path='myhangarrepo')
repo.init(user_name='Sherin Thomas', user_email='[email protected]', remove_old=True)
# generate data
data = []
for i in range(1000):
data.append(np.random.rand(28, 28))
data = np.array(data)
co = repo.checkout(write=True)
data_dset = co.datasets.init_dataset('mnist_data', prototype=data[0])
co.commit('datasets init')
co.close()
co = repo.checkout(write=True)
data_dset = co.datasets['mnist_data']
with data_dset:
for i in range(len(data)):
sample_name = str(i)
data_dset[sample_name] = data[i]
co.commit('dataset curation: stage 1') # this throws error
co.close()
Expected behavior
It should not break the program instead raise a warning about the performance hit
python setup.py develop
command failed with below error message.
Best match: blosc 1.8.1
Processing blosc-1.8.1.tar.gz
Writing /var/folders/yl/9l3dn2s53wdd4__5v1p1cwbm0000gn/T/easy_install-b7znieg5/blosc-1.8.1/setup.cfg
Running blosc-1.8.1/setup.py -q bdist_egg --dist-dir /var/folders/yl/9l3dn2s53wdd4__5v1p1cwbm0000gn/T/easy_install-b7znieg5/blosc-1.8.1/egg-dist-tmp-6xli48d_
SSE2 detected
warning: no files found matching '*.hpp' under directory 'c-blosc'
c-blosc/internal-complibs/zlib-1.2.8/gzread.c:30:15: warning: implicit declaration of function 'read'
is invalid in C99 [-Wimplicit-function-declaration]
ret = read(state->fd, buf + *have, len - *have);
^
c-blosc/internal-complibs/zlib-1.2.8/gzread.c:30:15: warning: this function declaration is not a
prototype [-Wstrict-prototypes]
c-blosc/internal-complibs/zlib-1.2.8/gzread.c:591:11: warning: implicit declaration of function
'close' is invalid in C99 [-Wimplicit-function-declaration]
ret = close(state->fd);
^
c-blosc/internal-complibs/zlib-1.2.8/gzread.c:591:11: warning: this function declaration is not a
prototype [-Wstrict-prototypes]
4 warnings generated.
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:256:24: warning: implicit declaration of function
'lseek' is invalid in C99 [-Wimplicit-function-declaration]
state->start = LSEEK(state->fd, 0, SEEK_CUR);
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:14:17: note: expanded from macro 'LSEEK'define LSEEK lseek
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:256:24: warning: this function declaration is not a
prototype [-Wstrict-prototypes]
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:14:17: note: expanded from macro 'LSEEK'define LSEEK lseek
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:355:9: warning: implicit declaration of function 'lseek'
is invalid in C99 [-Wimplicit-function-declaration]
if (LSEEK(state->fd, state->start, SEEK_SET) == -1)
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:14:17: note: expanded from macro 'LSEEK'define LSEEK lseek
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:396:15: warning: implicit declaration of function
'lseek' is invalid in C99 [-Wimplicit-function-declaration]
ret = LSEEK(state->fd, offset - state->x.have, SEEK_CUR);
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:14:17: note: expanded from macro 'LSEEK'define LSEEK lseek
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:492:14: warning: implicit declaration of function
'lseek' is invalid in C99 [-Wimplicit-function-declaration]
offset = LSEEK(state->fd, 0, SEEK_CUR);
^
c-blosc/internal-complibs/zlib-1.2.8/gzlib.c:14:17: note: expanded from macro 'LSEEK'define LSEEK lseek
^
5 warnings generated.
c-blosc/internal-complibs/zlib-1.2.8/gzwrite.c:84:15: warning: implicit declaration of function
'write' is invalid in C99 [-Wimplicit-function-declaration]
got = write(state->fd, strm->next_in, strm->avail_in);
^
c-blosc/internal-complibs/zlib-1.2.8/gzwrite.c:84:15: warning: this function declaration is not a
prototype [-Wstrict-prototypes]
c-blosc/internal-complibs/zlib-1.2.8/gzwrite.c:101:33: warning: implicit declaration of function
'write' is invalid in C99 [-Wimplicit-function-declaration]
if (have && ((got = write(state->fd, state->x.next, have)) < 0 ||
^
c-blosc/internal-complibs/zlib-1.2.8/gzwrite.c:573:9: warning: implicit declaration of function
'close' is invalid in C99 [-Wimplicit-function-declaration]
if (close(state->fd) == -1)
^
c-blosc/internal-complibs/zlib-1.2.8/gzwrite.c:573:9: warning: this function declaration is not a
prototype [-Wstrict-prototypes]
5 warnings generated.
In file included from c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:27:
c-blosc/internal-complibs/zstd-1.3.8/compress/zstd_compress_internal.h:190:5: error: unknown type
name 'ZSTD_dictAttachPref_e'
ZSTD_dictAttachPref_e attachDictPref;
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:675:85: error: use of undeclared
identifier 'ZSTD_c_forceMaxWindow'; did you mean 'ZSTD_p_forceMaxWindow'?
...const forceWindowError = ZSTD_CCtxParam_setParameter(&jobParams, ZSTD_c_forceMaxWindow, !job-...
^~~~~~~~~~~~~~~~~~~~~
ZSTD_p_forceMaxWindow
//anaconda3/include/zstd.h:1110:5: note: 'ZSTD_p_forceMaxWindow' declared here
ZSTD_p_forceMaxWindow=1100, /* Force back-reference distances to remain < windowSize,
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:999:21: error: use of undeclared
identifier 'ZSTD_OVERLAPLOG_MIN'
if (value < ZSTD_OVERLAPLOG_MIN) value = ZSTD_OVERLAPLOG_MIN;
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:999:50: error: use of undeclared
identifier 'ZSTD_OVERLAPLOG_MIN'
if (value < ZSTD_OVERLAPLOG_MIN) value = ZSTD_OVERLAPLOG_MIN;
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:1000:21: error: use of undeclared
identifier 'ZSTD_OVERLAPLOG_MAX'
if (value > ZSTD_OVERLAPLOG_MAX) value = ZSTD_OVERLAPLOG_MAX;
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:1000:50: error: use of undeclared
identifier 'ZSTD_OVERLAPLOG_MAX'
if (value > ZSTD_OVERLAPLOG_MAX) value = ZSTD_OVERLAPLOG_MAX;
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:1171:14: error: use of undeclared
identifier 'ZSTD_btultra2'; did you mean 'ZSTD_btultra'?
case ZSTD_btultra2:
^~~~~~~~~~~~~
ZSTD_btultra
//anaconda3/include/zstd.h:450:42: note: 'ZSTD_btultra' declared here
ZSTD_btlazy2, ZSTD_btopt, ZSTD_btultra } ZSTD_strategy; /* from faster to st...
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:1173:14: error: duplicate case value
'ZSTD_btultra'
case ZSTD_btultra:
^
c-blosc/internal-complibs/zstd-1.3.8/compress/zstdmt_compress.c:1171:14: note: previous case defined
here
case ZSTD_btultra2:
^
8 errors generated.
error: Setup script exited with error: command 'gcc' failed with exit status 1
Desktop (please complete the following information):
High level demo for users.
Note: The feature as detailed below was initially proposed by @lantiga in an offline phone call.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
In order to improve the command line experience, Hangar should provide built in tools to import common data file types stored in a directory into the staging area. In addition, in order to ease the user transition from domain-specific file formats to array based storage, a utility should be included which exports samples in a dataset into common file types.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Add commands to the CLI which mimic the following workflow (exact syntax and names are not super important here, just the idea of the workflow)
Importing
$ hangar branch --create foo
$ hangar checkout foo # write enabled checkout which can modify the staging area
$ hangar dataset create --name images --dtype uint8 --shape 500 500 3 --variable-shape ...
$ hangar dataset import --name images --directory ~/data/images/ # directory contains image files.
$ hangar commit -m "added data"
Exporting
$ hangar dataset export --commit foo_hash --name images --directory ~/data/images/ --format jpg
Notes:
It's important that the implementation of these utilities are not directly integrated or included into the hangar core (either in backend data IO processing, record reference parsing/storage/interaction, or the user facing programatic API). Instead these utilities should lie in a special module (isolated from the hangar core) where any relevant IO or data-file processing / transformation methods are defined. This module would transform the files as necessary, and then make the standard calls into the user-facing API to interact with Hangar.
A few file types we could natively support would be:
jpg
, png
, tiff
, ... (other standard image files)csv
npy
/ npz
Additional context
Add any other context or screenshots about the feature request here.
While the data import feature would be a very useful tool which has many uses, use of the export function as a crutch (instead of just reading data directly as arrays via the API) should not be encouraged; that type of use makes some key benefits of hangar null from the start.
The export tool does however, have really interesting potential to provide human understandable diffs of the data which was changed between two branches or commits. I'm not quite sure how yet, as the diff interface is still really evolving, but there are interesting possibilities here.
Tagging @rgalbo for reference in relation to the issue opened in #71
Is your feature request related to a problem? Please describe.
Sample additions become more intuitive for a beginner if we enable sample name to be integers
shape = (0, 1, 2)
co.datasets.init_dataset('dset', shape=shape, dtype=np.int)
This should raise a sensible error
The above raises ZeroDivision error
Describe the bug
A clear and concise description of what the bug is.
The pytorch dataloader cannot currently be run with multiprocess workers:
>>> torch_dset = make_torch_dataset(aset, index_range=slice(1, 100))
>>> loader = DataLoader(torch_dset, batch_size=16, num_workers=2)
>>> for batch in loader:
... train_model(batch)
Exception: Cannot pickle `hangar.dataloaders.TorchDataset.BatchTuple`
This is because the the BatchTuple
wrappers passed to hangar.dataloaders.TorchDataset
are dynamically defined namedtuple classes, whose definition is not appropriately scoped for a forked subprocess to introspect it's name/contents upon pickling.
hangar-py/src/hangar/dataloaders/torchloader.py
Lines 63 to 74 in e2c7a89
As the structure needs to be dynamically defined based on user arguments, we cannot just place the BatchTuple
definition in the main body of the module.
@lantiga and @hhsecond, let me know what you prefer, or any other solutions you might have.
1) keep the current return type exactally the same, but add the definition of BatchTuple
to globals()
before it is passed to TorchLoader
wrapper = namedtuple('BatchTuple', field_names=field_names)
else:
wrapper = namedtuple('BatchTuple', field_names=gasets.arrayset_names, rename=True)
globals()[`BatchTuple`] = wrapper
return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)
This works, but is generally bad practice to manually modify global
scope.
2) return a dict
of field_names
and tensors
instead of a namedtuple
wrapper = tuple(field_names)
else:
wrapper = tuple(gasets.arrayset_names)
return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)
And in TorchDataset
replace:
return dict(zip(self.wrapper, out))
Which still works, and does not modify globals()
, but changes the output o the function to something "not quite as nice" as a namedtuple
.
Severity
Select an option:
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
checkout.commit()
and checkout. reset_staging_area()
methods which allow correct behavior on previously checked out objects after close. The first for the checkout.reset_staging_area()
defaults to close()
all objects permenantly when it is called. This behavior is undesireable, but convenient for the fix that was presented.Describe the solution you'd like
A clear and concise description of what you want to happen.
checkout.reset_staging_area()
is called, the reset should be performed, and any dataset objects referencing datasets which existed in the CLEAN
staging area should continue to work, while any which were created in the DIRTY
state should be invalidated.Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
close()
the checkout entirely.Additional context
Add any other context or screenshots about the feature request here.
It's good to have an option to set repository description on repo.init
After closing the checkout, the dataset object should not be operatable. We need to decide what all are possible on the dataset object though. So far, we know
co = repo.checkout(write=True)
dset = co.datasets.init_dataset('dset', prototype=array5by7)
co.commit()
co.close()
co = repo.checkout(write=True) # without opening the checkout again, user won't be able to do the addition even now
dset.add(array5by7, '1') # user can add data to dset object
co = repo.checkout(write=True)
dset = co.datasets.init_dataset('dset', prototype=array5by7)
dset.add(array5by7, '1')
co.commit()
co.close()
print(dset['1']) # user can fetch data from dset object
These issues are part of #25
With the latest release, create a conda forge package for hangar
Is your feature request related to a problem? Please describe.
PyTorch has an IterableDataset in the upcoming release which we can plug into Hangar. Keeping this issue to track the PyTorch release and integrate it when it is time
Lockheld check at diff.py should verify whether the lock is held or not
Line 237 in 044937c
Execution never reaches there
Describe the bug
The message HANGAR RUNTIME WARNING: no repository exists at /some/path/__hangar, please use init_repo function
makes me think my script is doing something wrong, even though the next thing that I do is call repo.init()
.
Additionally, if the path does not exist, there should be an option to have it be created.
A single default param exists=True
flag could handle both cases, creating the directory and suppressing the init warning when set to False
Severity
Select an option:
Making checkout as a context manager let the users avoid writing explicit close()
and we can make sure the lock is released. Keeping the current functionality along with this new ability should be the way to go.
with repo.checkout(write=True) as co:
co.datasets['dset'] = array
co.commit()
We could either raise a warning if context manager exits without commit or we can even do commit implicitly (by having another argument autocommit=True
in the checkout()
call)
Currently, we need to explicitly call close()
for all the write-enabled checkouts.
I could take it up if this doesn't raise any other concerns.
Is your feature request related to a problem? Please describe.
Currently, we don't have time consumption reports generated for different operations such as data addition (and import through the plugin), etc.
Describe the solution you'd like
Figure out what all operations need to spit this info and print it out to the UI mentioning time taken by different modules
Suggestion from @lantiga in the weekly
Describe the bug
Most (if not all) of the operations on non-initialized repository throws Exceptions like AttributeError. It either should return proper error message or an empty object, at least for operations like .log()
/.summary()
etc.
Severity
Select an option:
To Reproduce
repo = Repository(path=path_to_non-initiazlized_repository)
repo.status() # AttributeError: 'NoneType' object has no attribute 'begin'
Expected behavior
Either a proper exception or a sensible warning
In the same python session, if we initiate dataset twice and then save the data with same shape and type as the first instance, we won't get the data that we saved in the second instance. Instead, we'll get the data from first instance. Here is the script to reproduce.
import numpy as np
from hangar.repository import Repository
test_hangar_path1 = '/home/hhsecond/mypro/test_hangar_dir_1'
arr1 = np.random.random((5, 7))
repo1 = Repository(path=test_hangar_path1)
repo1.init(user_name='tester1', user_email='[email protected]', remove_old=True)
co = repo1.checkout(write=True)
dset1 = co.datasets.init_dataset('dset1', prototype=arr1)
dset1['arr1'] = arr1
co.commit()
co.close()
test_hangar_path2 = '/home/hhsecond/mypro/test_hangar_dir_2'
arr2 = np.random.random((5, 7))
repo2 = Repository(path=test_hangar_path2)
repo2.init(user_name='tester2', user_email='[email protected]', remove_old=True)
co = repo2.checkout(write=True)
dset2 = co.datasets.init_dataset('dset2', prototype=arr2)
dset2['arr2'] = arr2
co.commit()
co.close()
co = repo2.checkout()
print(arr2[1], co.datasets['dset2']['arr2'][1])
print((arr2 == co.datasets['dset2']['arr2']).all())
Describe the bug
A clear and concise description of what the bug is.
ability to add arrays with fewer dimensions (and not just smaller dimension sizes) to an arrayset
with variable_size=True
schema.
Though the current implementation does not support this, it won't allow records to be added in this fashion, instead forcing a ValueErrorto be raised - basically, it's a bug, but one we have addressed by stoping execution which violates the limitations of the slice generator.
Severity
Select an option:
To Reproduce
Steps to reproduce the behavior, minimal example code preferred:
>>> co = repo.checkout(write=True)
>>> aset = co.arraysets.init_arrayset('foo', shape=(10, 10, 10), dtype=np.uint8, variable=True)
>>> arr = np.random.randn(10, 10).astype(np.uin8)
>>> aset[1] = arr
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-dc4fc55d87c3> in <module>
----> 1 aset[1] = arr
~/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py in __setitem__(self, key, value)
484 sample name of the stored data (assuming operation was successful)
485 """
--> 486 self.add(value, key)
487 return key
488
~/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py in add(self, data, name, **kwargs)
588
589 except ValueError as e:
--> 590 raise e from None
591
592 # --------------------- add data to storage backend -------------------
~/projects/tensorwerk/hangar/hangar-py/src/hangar/arrayset.py in add(self, data, name, **kwargs)
576 if data.ndim != len(self._schema_max_shape):
577 raise ValueError(
--> 578 f'`data` rank: {data.ndim} != aset rank: {len(self._schema_max_shape)}')
579 for dDimSize, schDimSize in zip(data.shape, self._schema_max_shape):
580 if dDimSize > schDimSize:
ValueError: `data` rank: 2 != aset rank: 3
Expected behavior
A clear and concise description of what you expected to happen.
>>> aset[1] = arr
>>> out = aset[1]
>>> out.shape
(10, 10)
Remove the need to symlink privliges. This would require some rework of the internal data storage dir organization, but would be nice to have as illustrated in #138
Describe the bug
Python 3.7 test cases are not running. Yaml file only triggers docs env
Severity
Describe the bug
A clear and concise description of what the bug is.
Move code which contains other license files - for config (from dask) and Git log graph - to individual modules
Severity
Select an option:
Tagging @elistevens as fyi
Describe the bug
" TypeError: 'rdcc_nbytes' is an invalid keyword argument for this function" on saving an image to Hangar repository
Severity
Select an option:
To Reproduce
import hangar
repo=hangar.Repository('.')
repo.init(user_name='chandnika', user_email='[email protected]')
co=repo.checkout(write=True)
import numpy as np
co.arraysets.init_arrayset('images' ,shape=(10,10,3) , dtype=np.uint8)
co.commit('arrayset init')
images=co.arraysets['images']
img=np.random.random((10,10,3)).astype(np.uint8)
images[0]=img
Expected behavior
TypeError Traceback (most recent call last)
in ()
----> 1 images[0]=img
5 frames
/content/hangar-py/src/hangar/arrayset.py in setitem(self, key, value)
500 sample name of the stored data (assuming operation was successful)
501 """
--> 502 self.add(value, key)
503 return key
504
/content/hangar-py/src/hangar/arrayset.py in add(self, data, name, **kwargs)
646 existingHashVal = self._hashTxn.get(hashKey, default=False)
647 if existingHashVal is False:
--> 648 hashVal = self._fs[self._dflt_backend].write_data(data)
649 self._hashTxn.put(hashKey, hashVal)
650 self._stageHashTxn.put(hashKey, hashVal)
/content/hangar-py/src/hangar/backends/hdf5_00.py in write_data(self, array, remote_operation)
675 self._create_schema(remote_operation=remote_operation)
676 else:
--> 677 self._create_schema(remote_operation=remote_operation)
678
679 srcSlc = None
/content/hangar-py/src/hangar/backends/hdf5_00.py in _create_schema(self, remote_operation)
540 rdcc_nbytes=rdcc_nbytes_val,
541 rdcc_w0=CHUNK_RDCC_W0,
--> 542 rdcc_nslots=rdcc_nslots_prime_val)
543 self.w_uid = uid
544 self.hNextPath = 0
/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py in init(self, name, mode, driver, libver, userblock_size, swmr, **kwds)
309 name = filename_encode(name)
310 with phil:
--> 311 fapl = make_fapl(driver, libver, **kwds)
312 fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
313
/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py in make_fapl(driver, libver, **kwds)
107 msg = "'{key}' is an invalid keyword argument for this function"
108 .format(key=next(iter(kwds)))
--> 109 raise TypeError(msg)
110 return plist
111
TypeError: 'rdcc_nbytes' is an invalid keyword argument for this function
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
I found an issue with initiating the repository again and again in the same python session even if the repository path is different. Because of the singleton nature of Environment
class, regardless of the path, we pass for a repository, all the following repositories will have the path of the first repository (Is this something we desire to have?)
The below script can reproduce the issue. The last print statement must print True
import numpy as np
from hangar.repository import Repository
test_hangar_path1 = '/home/hhsecond/mypro/test_hangar_dir_1'
arr1 = np.random.random((5, 7))
repo1 = Repository(path=test_hangar_path1)
repo1.init(user_name='tester1', user_email='[email protected]', remove_old=True)
co = repo1.checkout(write=True)
dset1 = co.datasets.init_dataset('dset1', prototype=arr1)
dset1['arr1'] = arr1
co.commit()
co.close()
test_hangar_path2 = '/home/hhsecond/mypro/test_hangar_dir_2'
arr2 = np.random.random((5, 7))
repo2 = Repository(path=test_hangar_path2)
repo2.init(user_name='tester2', user_email='[email protected]', remove_old=True)
co = repo2.checkout(write=True)
dset2 = co.datasets.init_dataset('dset2', prototype=arr2)
dset2['arr2'] = arr2
co.commit()
co.close()
print(test_hangar_path2, repo2._repo_path)
print(test_hangar_path2 == repo2._repo_path)
hanger init
threw the error
Traceback (most recent call last):
File "/usr/local/bin/hangar", line 11, in <module>
load_entry_point('hangar', 'console_scripts', 'hangar')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 2693, in load_entry_point
return ep.load()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 2324, in load
return self.resolve()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 2330, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "/home/ameen4455/hangar-py/src/hangar/cli/__init__.py", line 1, in <module>
from .cli import main
File "/home/ameen4455/hangar-py/src/hangar/cli/cli.py", line 509, in <module>
@click.option('--limit', default=30, help='limit the amount of records displayed before truncation')
File "/usr/lib/python3/dist-packages/click/core.py", line 1163, in decorator
cmd = command(*args, **kwargs)(f)
File "/usr/lib/python3/dist-packages/click/decorators.py", line 115, in decorator
cmd = _make_command(f, name, attrs, cls)
File "/usr/lib/python3/dist-packages/click/decorators.py", line 89, in _make_command
callback=f, params=params, **attrs)
TypeError: __init__() got an unexpected keyword argument 'hidden'
from python terminal
repo.init('myrepo', user_name='username', user_email='[email protected]')
threw the error
Hangar Repo initialized at: ./.hangar
Traceback (most recent call last):
File "/home/ameen4455/hangar-py/src/hangar/records/heads.py", line 220, in create_branch
success = branchtxn.put(branchHeadKey, branchHeadVal, overwrite=False)
lmdb.CorruptedError: mdb_put: MDB_CORRUPTED: Located page was wrong type
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ameen4455/hangar-py/src/hangar/repository.py", line 321, in init
user_name=user_name, user_email=user_email, remove_old=remove_old)
File "/home/ameen4455/hangar-py/src/hangar/context.py", line 286, in _init_repo
heads.create_branch(self.branchenv, 'master', '')
File "/home/ameen4455/hangar-py/src/hangar/records/heads.py", line 226, in create_branch
TxnRegister().commit_writer_txn(branchenv)
File "/home/ameen4455/hangar-py/src/hangar/context.py", line 118, in commit_writer_txn
self.WriterTxn[lmdbenv].commit()
lmdb.BadTxnError: mdb_txn_commit: MDB_BAD_TXN: Transaction must abort, has a child, or is invalid
Is your feature request related to a problem? Please describe.
Existing DL projects are already going to have a data pipeline. Often, these are going to result in tf or pytorch Datasets. Having to replace all of the existing mechanisms by which those datasets are created and managed with hangar-specific code is a barrier to adoption.
Describe the solution you'd like
I think that import routines should be able to consume a third-party Dataset
instance, inspect it, and store the relevant data in hangar. The intent would be that hangar could then produce a functionally identical Dataset
for use in training, but without having to go back to the raw data.
This would change hangar adoption best practice from "replace your data pipeline" to "just insert this mostly-transparent step in the middle." Once the project is fully committed to using hangar, then the architecture can be revisited, if needed.
It would be nice if it could also vacuum up a dir tree of .tfrecord files, but that's a little less well-defined.
Describe alternatives you've considered
N/A
Additional context
N/A
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when
I want to use hanger (hang) at the terminal!
Describe the solution you'd like
I would like
~> hang commit -m "update the training set"
~> hang merge < branches-of-interest>
Describe alternatives you've considered
Additional context
there are many simple cli for python. even the base cli tool making should be sufficient to not require any dependencies. I have no issue completing this myself in a new branch. Most likely could do so this afternoon with mild coaching. is there a slack or discord i should be aware of?
Describe the bug
Following error is thrown post developer setup of Hangar in Windows machine while trying to assign a dummy image to repo array set - "OSError: symbolic link privilege not held"
Unexpected Behavior, Exceptions or Error Thrown
To Reproduce
Steps to reproduce the behavior, minimal example code:
import hangar
repo = hangar.Repository('.')
co = repo.checkout(write=True)
co.arraysets.init_arrayset('images', shape=(10,10,3),dtype=np.uint8)
co.commit('arrayset init')
images = co.arraysets['images']
img = np.random.random((10,10,3)).astype(np.uint8)
images[0] = img
images[0] = img
Traceback (most recent call last):
File "", line 1, in
File "c:\users\naren\documents\github\hangar-py\src\hangar\arrayset.py", line 502, in setitem
self.add(value, key)
File "c:\users\naren\documents\github\hangar-py\src\hangar\arrayset.py", line 624, in add
raise e from None
File "c:\users\naren\documents\github\hangar-py\src\hangar\arrayset.py", line 621, in add
raise ValueError(isCompat.reason)
ValueError: data shape: (10, 3, 3) != fixed aset shape: (10, 10, 3)
img = np.random.random((10,10,3)).astype(np.uint8)
images[0] = img
Traceback (most recent call last):
File "", line 1, in
File "c:\users\naren\documents\github\hangar-py\src\hangar\arrayset.py", line 502, in setitem
self.add(value, key)
File "c:\users\naren\documents\github\hangar-py\src\hangar\arrayset.py", line 648, in add
hashVal = self._fs[self._dflt_backend].write_data(data)
File "c:\users\naren\documents\github\hangar-py\src\hangar\backends\hdf5_00.py", line 677, in write_data
self._create_schema(remote_operation=remote_operation)
File "c:\users\naren\documents\github\hangar-py\src\hangar\backends\hdf5_00.py", line 553, in _create_schema
symlink_rel(file_path, symlink_file_path)
File "c:\users\naren\documents\github\hangar-py\src\hangar\utils.py", line 114, in symlink_rel
os.symlink(rel_path_src, dst, target_is_directory=is_dir)
OSError: symbolic link privilege not held
Expected behavior
No errors expected and the data getting assigned to the data array set is expected.
Desktop (please complete the following information):
Describe the bug
A clear and concise description of what the bug is.
repo.log()
prints branch information to the terminal while repo.log(return_contents=True)
doesn't return branch information
Severity
Select an option:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.