Giter Club home page Giter Club logo

dataset-registry's Introduction

DVC Dataset Registry

This DVC Data Registry is a centralized place to manage raw data files for use in other example DVC projects, such as https://github.com/iterative/example-get-started.

Installation

Start by cloning the project:

$ git clone https://github.com/iterative/dataset-registry
$ cd dataset-registry

This DVC project comes with a preconfigured DVC remote storage to hold all of the datasets. This is a read-only HTTP remote.

$ dvc remote list
storage https://remote.dvc.org/dataset-registry

Important: To be able to push to the default remote, overwrite it with:

$ dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry

This requires having configured corresponding S3 credentials locally.

Testing data synchronization locally

If you'd like to test commands like dvc push, that require write access to the remote storage, the easiest way would be to set up a "local remote" on your file system:

This kind of remote is located in the local file system, but is external to the DVC project.

$ mkdir -P /tmp/dvc-storage
$ dvc remote add local /tmp/dvc-storage

You should now be able to run:

$ dvc push -r local

Datasets

The folder structure of this project groups datasets corresponding to the external projects they pertain to. After cloning and using dvc pull to download data under DVC control, the workspace should look like this:

$ tree
.
├── README.md
├── get-started
│   └── data.xml.dvc  # Dataset used in iterative/example-get-started
├── mnist
│   └── raw.dvc       # Dataset used in iterative/dvc-get-started
├── fashion-mnist
    └── raw.dvc       # Dataset used in iterative/dvc-get-started

dataset-registry's People

Contributors

casperdcl avatar dberenbaum avatar efiop avatar flippedcoder avatar iesahin avatar jorgeorpinel avatar omesser avatar shcheklein avatar soygema avatar tibor-mach avatar ykasimov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataset-registry's Issues

prepare.py takes directory as argument, returns error

I get this error:

Traceback (most recent call last):
  File "prepare.py", line 60, in <module>
    with io.open(input, encoding='utf8') as fd_in:
IsADirectoryError: [Errno 21] Is a directory: 'data'

When I run
python3 prepare.py data
as suggested if no argument is entered

This is also in reference to another issue #5, as data.xml does not exist

Add CI/CD to check that data exists when PR is published

There should be a CI check that would run dvc status -c to ensure that data is uploaded.

Question: should we run it per-commit or per PR - not clear and depends on the PR intention to some extent and how it's going to be merged (squash vs rebase).

Use OIDC connector like in some other projects around, ask @0x2b3bfa0 for the details.

Update readme

The readme states:

This is an auto-generated repository for use in https://dvc.org/doc/. Please report any issues in its source project, example-repos-dev.

I don't see a corresponding project in example-repos-dev to generate this repo. Are these instructions outdated?

Failed SSL certificate on tutorial dvc get

Hi 😄

(Apologies if this is a stupid user-error or the wrong place to post - do tell me if so)

I am a new user and trying to do the tutorial exercise.

When running the command dvc get https://github.com/iterative/dataset-registry tutorials/versioning/data.zip, I get the following error:

❯ dvc --version
3.50.0
❯ dvc get https://github.com/iterative/dataset-registry tutorials/versioning/data.zip
ERROR: failed to get 'tutorials/versioning/data.zip' - SCM error: Failed to clone repo 'https://github.com/iterative/dataset-registry' to '/var/folders/sr/wjtfqr9s6x3bw1s647t649x80000gn/T/tmp9dut6q7idvc-clone': HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /iterative/dataset-registry/info/refs?service=git-upload-pack (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)'))): HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /iterative/dataset-registry/info/refs?service=git-upload-pack (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)'))): [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)

I know nothing about certificates, and tried utilising LLMs to guide my troubleshooting.

I found one post talking about certificates, but I don't seem to have access to the certificate update command they reference. Though I do have an up to date certifi module in my virtual environment which the LLM says should be good enough to ensure up-to-date certificates. (Sorry I know little about them).

[Technical details]
Device: MacBook Pro 14" (2021)
OS: MacOS Version 14.4.1 (23E224) (Up-to-date)
Processor: M1 Pro (2021)(ARM)
Shell: zsh
Python: virtualenv environment 3.9 & conda environment 3.11
DVC: 3.50.0

[Steps attempted]

  • Update Operating System
  • Update pip
  • Try conda
  • Try virtualenv
  • Create a fresh directory
  • Use sudo
  • Restart my computer
  • Use a different network (Wifi & wired)
  • Check system date and time are accurate.
  • Attempt to update dvc (already updated)
  • Attempt to update certificates
  • Try the other dvc get command from another tutorial: dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
  • Ran the command (successfully) on Linux (python 3.10) (x86) - hence it's an issue with my MacBook.

I know this is likely an issue with my computer, but I am posting just in case as I am out of ideas aside from a full OS-reinstall (which I will do if needed).

I searched online about this and came across a couple posts, however they don't seem to have certificate issues. Issue #42 might be relevant?

Thank you for your time 🙏

tutorials/versioning/data.zip

Trying to run

dvc get https://github.com/iterative/dataset-registry \ tutorials/versioning/data.zip

However, I get this error.

ERROR: unexpected error - : Unable to find DVC file with output '../../../../private/var/folders/pq/57gd7mhx08g_66qwkzjmf6xh0000gn/T/tmp6n27kbdldvc-clone/tutorials/versioning/data.zip'

Am I doing something wrong?

Add image directory version for MNIST dataset

It's better to have a browsable directory with images.

It can be a directory in the form of

images
├── test
│   ├── 0
│   ├── 1
│   ├── 2
│   ├── 3
│   ├── 4
│   ├── 5
│   ├── 6
│   ├── 7
│   ├── 8
│   └── 9
└── train
    ├── 0
    ├── 1
    ├── 2
    ├── 3
    ├── 4
    ├── 5
    ├── 6
    ├── 7
    ├── 8
    └── 9

Each directory will contain a set of images for the corresponding label.

A total of 70000 28x28 grayscale images will be stored. We can also use a .zip file that stores all the images to reduce the download overhead.

DVC get from get started DOCS not working after last commit

I was running a demo this morning( 29/04 UTC 00:51 PM )
and the command:
dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml was working perfectly fine.

I get this command from following the docs:

https://dvc.org/doc/start

After the commit cdc0b6b i get the following error, which i could not solve for :

ERROR: unexpected error - [Errno 2] No storage files available: 'get-started/data.xml

I currently running dvc with the following configs:

DVC version: 3.50.1 (pip)

Platform: Python 3.10.12 on Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.2
Supports:
        gs (gcsfs = 2024.3.1),
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3)

What is the issue? Is the dataset not available after last commit?

Docs mention a file that is missing (in tutorial)

https://dvc.org/doc/tutorials/get-started/add-files mentions

dvc get https://github.com/iterative/dataset-registry \get-started/data.xml -o data/data.xml
(which has some extra newlines, and could not be copy pasted correctly)

However, in https://github.com/iterative/dataset-registry/tree/master/get-started , there is no data.xml and thus, when following the tutorial an error message is returned:

ERROR: failed to download 'https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355' to 'data/.CE4hKPGKTam3UAWsZ7pUaG/a3/04afb96060aad90176268345e10355' - HTTPSConnectionPool(host='s3-us-east-2.amazonaws.com', port=443): Read timed out.
ERROR: failed to get 'get-started/data.xml' from 'https://github.com/iterative/dataset-registry' - 1 files failed to download

Editing the command to data.xml.dvc works.

use-cases: create dataset for new use case

From iterative/dvc.org#674:

The datasets (actually 2 parts of the same dataset) in that tutorial are currently 2 ZIP files, data.zip and new-labels.zip. We are extracting the first dataset into a directory as a first version... Then we'll extract the 2nd one on top and update the dataset version.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.