Giter Club home page Giter Club logo

example-repos-dev's Introduction

Get Started Tutorial (sources)

Contains source code and Shell scripts to generate and deploy example DVC repositories used in the Get Started and other sections of the DVC docs.

Requirements

Please make sure you have these available on the environment where these scripts will run:

Naming Convention for Example Repositories

In order to have a consistent naming scheme across all example repositories, the new repositories should be named as:

example-PROD-FEATURE

where PROD is one of the products like dvc, cml, studio, or dvclive, and FEATURE is the feature that the repository focused on, like experiments, or pipelines. You can also use additional keywords as suffix to differentiate from the others.

⚠️ Please create all new repositories with the prefix example-.

Scripts

Each example DVC project is in each of the root directories (below). cd into the directory first before running the desired script, for example:

$ cd example-get-started
$ ./deploy.sh

example-get-started

There are 2 GitHub Actions set up to test and deploy the project:

These will automatically test and deploy the project. If you need to run the project locally/manually, you only directly need generate.sh. deploy.sh is a helper script run within generate.sh.

  • generate.sh: Generates the example-get-started DVC project from scratch.

    By default, the source code archive is derived from the local workspace for development purposes.

    For deployment, use generate.sh prod to upload/download a source code archive from S3 the same way as in Connect Code and Data.

  • deploy.sh: Makes and deploys code archive from example-get-started/code to use for generate.sh.

    By default, makes local code archive in example-get-started/code.zip.

    For deployment, use deploy.sh prod to upload to S3.

    Requires AWS CLI and write access to s3://dvc-public/code/get-started/.

example-get-started-experiments

There are 2 GitHub Actions set up to test and deploy the project:

These will automatically test and deploy the project. If you need to run the project locally/manually, run generate.sh.

Even after automatic deployment, you still need to follow the instructions to:

  • Update Studio to create a PR from the best generated experiment.
  • Push to GitLab if you want to update the repo there.

example-repos-dev's People

Contributors

0x2b3bfa0 avatar 8bit-pixies avatar aguschin avatar casperdcl avatar daavoo avatar dberenbaum avatar efiop avatar iesahin avatar iterative-olivaw avatar jorgeorpinel avatar leonardcser avatar mattseddon avatar mayankgoel28 avatar metalglove avatar mike0sv avatar pmrowla avatar shcheklein avatar skshetry avatar tapadipti avatar tasdomas avatar tibor-mach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

example-repos-dev's Issues

Remove dataset-registry?

I think the point of generating https://github.com/iterative/example-get-started from a script in this repo was that there are different steps that loosely follow the actual Get Started section of the dvc.org docs. Each step corresponds to a Git commit and tag.

After merging PR #7 I realized that actually https://github.com/iterative/dataset-registry doesn't really needs steps. All the datasets could just be added in a single commit. So that repo could just be maintained manually instead of having to edit the script here and overwrite it every time. What do you think @shcheklein?

There should be other repos that may be better fit to be generated from this project though, e.g. as described in iterative/dvc.org#487 (comment).

dvc-get-started: Rename `main` branch to `pipeline`

The main branch looks as if it's the primary use case. Although this might be true in the past for example-get-started, we have more than one use cases that need attention now. (checkpoints, experiments and maybe studio.)

Instead of selecting one of these as main, the naming should emphasize the different use cases.

Related #34

get-started-experiments: old `pip` installed with README instructions

$ python -m pip install -r requirements.txt
Collecting tensorflow<2.6,>=2.5 (from -r requirements.txt (line 1))
  Could not find a version that satisfies the requirement tensorflow<2.6,>=2.5 (from -r requirements.txt (line 1)) (from versions: 0.12.1, 1.0.0, 1.0.1, 1.1.0rc0, 1.1.0rc1, 1.1.0rc2, 1.1.0, 1.2.0rc0, 1.2.0rc1, 1.2.0rc2, 1.2.0, 1.2.1, 1.3.0rc0, 1.3.0rc1, 1.3.0rc2, 1.3.0, 1.4.0rc0, 1.4.0rc1, 1.4.0, 1.4.1, 1.5.0rc0, 1.5.0rc1, 1.5.0, 1.5.1, 1.6.0rc0, 1.6.0rc1, 1.6.0, 1.7.0rc0, 1.7.0rc1, 1.7.0, 1.7.1, 1.8.0rc0, 1.8.0rc1, 1.8.0, 1.9.0rc0, 1.9.0rc1, 1.9.0rc2, 1.9.0, 1.10.0rc0, 1.10.0rc1, 1.10.0, 1.10.1, 1.11.0rc0, 1.11.0rc1, 1.11.0rc2, 1.11.0, 1.12.0rc0, 1.12.0rc1, 1.12.0rc2, 1.12.0, 1.12.2, 1.12.3, 1.13.0rc0, 1.13.0rc1, 1.13.0rc2, 1.13.1, 1.13.2, 1.14.0rc0, 1.14.0rc1, 1.14.0, 2.0.0a0, 2.0.0b0, 2.0.0b1)
No matching distribution found for tensorflow<2.6,>=2.5 (from -r requirements.txt (line 1))

$ pip -V
pip 9.0.1 from /home/.../get-started-experiments/.venv/lib/python3.6/site-packages (python 3.6)

I'm on Ubuntu 20.04 LTS on WSL 2

example-get-started: long names in params file

The param names can be shortened to improve commands output for dvc params diff and especially dvc exp show.

Suggested names:

  1. featurize group --> fr
  2. max_features--> max_fr
  3. n_estimators--> n_est

Default branch for example-get-started (and others)

My git config sets main as my default branch, which breaks the instructions to publish example-get-started (since I have no master branch to push by default). Ideally, we would update the example-get-started default branch, but we may need to check that we don't break anything in the tutorial. At minimum, we should explicitly checkout a specific branch before publishing.

We should also keep this in mind for other repos in development.

dvc-get-started: Add `experiments` branch to collect various experiments

Currently the experiments are implicit in the main branch.

A separate branch with the Fashion-MNIST dataset may be better to reflect the changes in various metrics.

Tags

  • experiments-base: Based on evaluate tag in pipelines/main branch

  • experiments-cnn: Use CNN model instead of MLP

  • experiments-fashion-mnist: It may use Fashion-MNIST dataset

Related #34

Error dvc get

I followed the hands-on tutorial some days ago and now I've got an error when trying to download the data.xml from dataset-registry.

Could it be an error with the data.xml.dvc file?

dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

ERROR: failed to get 'get-started/data.xml' from 'https://github.com/iterative/dataset-registry' - '../../../../../tmp/tmp0dar7q9ddvc-clone/get-started/data.xml.dvc' format error: extra keys not allowed @ data['outs'][0]['size']

example-get-started: generalize CML workflow

Right now for some reason (simplicity?) we compare everything with the bigrams experiment.

  • we depend on a specific tag name
  • when we do a PR to that repo - report name is misleading and not very meaningful

Getting started example fails at pulling data from s3://dvc-storage/

Following the example at https://github.com/iterative/example-get-started, it fails at this point:

$ dvc remote list
s3	s3://dvc-storage

$ dvc pull -r s3
Preparing to download data from 's3://dvc-storage'
Preparing to collect status from s3://dvc-storage
[#########                     ] 30% Collecting information
Error: failed to pull data from the cloud - Unable to locate credentials

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

Since this is a "Getting Started" example that is supposed to work, then this step should be downloading prepared data from a remote cache that is somewhere.

`example-dvc-experiments`: Include CML configuration

@casperdcl what do we need from to make this repo useful for CML happy-path?

@iesahin @casperdcl as we discussed can we make it example-experiments that would cover basic scenarios with predefined language (python), predefined framework (let's say tensorflow for now). It would have CML action from the first (?) commit that could be run if it's needed (and may be even runs automatically).

Then it'll be a good repo that we can even meaningfully present in Studio?

What do we need to make it substantially useful for CML?

Originally posted by @shcheklein in #79 (comment)

get-started-X: Write `generate` scripts for each project

TODO

  • Add an experiments branch for various experiments (#38) (Done as a separate repository)
  • (CANCEL) Add a studio branch to be used as a studio showcase
  • Move current main to pipeline branch (#39) (Done as a separate repository)
  • Add main to contain only README (#40) (Cancelled)
  • Create Docker containers for each tag
  • Update Katacoda GS:Experiments scenario and get rid of dvc-get-started-mnist
  • Remove model metrics from params
  • (MOVE) Keep only the relevant parameters for each repository
  • Create two fictional people for each branch, spreading the commits across time
  • Add additional plots like confusion matrix
  • Review dvc.org docs examples (besides Get Started) that could use the resulting example repo. Mainly the Experiments docs

Older Proposal Below.

We have a new Get Started project aimed towards experimentation features in DVC 2.0.

A step-by-step approach similar to the current example-get-started project is necessary for exposition. Creating the project from development sources to provide a clear history, without manual intervention also provides
easier maintenance.

Tags

  • 0-git-init: Empty Git repository initialized.

  • 1-dvc-init: DVC has been initialized. .dvc/ with the cache directory created.

  • 2-import-data: Raw data files data/raw/ downloaded and tracked with DVC using dvc import-url from the dataset-registry. First .dvc file created.

  • 3-config-remote: Remote HTTP storage initialized. It's a shared read only storage that contains all data artifacts produced during next steps.

  • 4-source-code: Source code copied from the generation repository.

  • 5-prepare-stage: Create dvc.yaml and the first pipeline stage with dvc stage add. It converts MNIST data from IPX format to NumPy format and stores in data/prepared/

  • 6-preprocess-stage: Create a new stage named preprocess to convert the data files produced in the previous stage into a format ready to supply to Keras.

  • 7-train-stage: Create a new stage that depends on the data files produced in preprocess and model.py files. The model file provides the model according to the parameters in params.yaml. This stage also produces train.log.csv file to track the training performance metrics.

  • 8-evaluation: Evaluation stage. Runs the whole pipeline to produce metrics.json file.

  • 9-cnn-model: Up until this stage, the model we use is MLP with a single hidden layer. This stage demonstrates the use of dvc exp run to create more fine grained experiments and implicit checkpoints at the end of dvc exp run. This corresponds to basic branch of dvc-checkpoints-mnist.

  • 10-make-checkpoint: Updates evaluate.py with calls to make_checkpoint to show the Python API usage. This corresponds to make_checkpoint branch of dvc-checkpoints-mnist.

  • 11-dvclive: Adds DVClive features to train.py to demonstrate their usage. This corresponds to full_pipeline branch of dvc-checkpoints-mnist.

  • 12-signal-file: Uses signal-file features in evaluate.py to show the language-agnostic features checkpoints. This corresponds to signal_file branch of dvc-checkpoints-mnist.

Discussion Points

  • I'll defer checkpoints' implementation until we decide on a path to merge or keep the checkpoints project separate.

  • For the tags 9-12, we can have renamed files within the same repository to make comparisons, e.g., in 10-make-checkpoint, we can have a evaluate-make-checkpoint.py file that's similar to evaluate.py except the checkpoint usage.

  • We can put 0-8 to the main branch and convert other tags to branches as there is no clear progress between these.

  • There is no dvc add step in this setup unlike in example-get-started. It's possible to use Fashion-MNIST for this. It can be downloaded manually and put inside data/ to update the pipeline. It's possible to parameterize prepare.py with the path of the dataset or we can demonstrate how to replace the dataset by overwriting it. Or we can do vice versa, dvc add MNIST in step 2 and have a dvc import-url later as step 9 to update the dataset with Fashion-MNIST.

  • The script will use dvc stage add instead of dvc run for all stages. dvc exp run will be favored to dvc repro also.

  • The source code and related files will be copied to this repository instead of downloading from S3. I don't think of a use case to put the files to S3 instead of a Github repository. Why was so?

  • The new project repository will be https://github.com/iterative/dvc-get-started.

get-started-experiments: delete all remote experiments in the push script

The current push script built by generate.bash deletes the remote tags before pushing new tags to prevent older tags lurking around. It doesn't do this for the experiments and all older experiments remain in the repository after a new get-started-experiments repository is created.

We should delete all the experiments in the remote before this line in the push script:

dvc exp list --all --names-only | xargs -n 1 dvc exp push origin

`example-get-started` : use tar instead of zip

I think we're using ZIP so Windows users can use the example-get-started repo

pushd $PACKAGE_DIR
zip -r $PACKAGE src/*
popd

But a) I can't run the above command from Windows (i.e. generating the repo, because zip isn't available on Win cmd or any terminal), and most importantly, b) unzip is also not a command shipped with Windows. You need to install it on some other console such as Git Bash using Chocolatey, etc.

wget https://code.dvc.org/get-started/code.zip
unzip code.zip
rm -f code.zip

Why not just use something more Linux friendly that we can also install on Windows anyway? Meaning tar

Add MNIST dataset to the dataset-registry for the new example-get-started

I started to write the experiments tutorial using the original MNIST from deepai.org:

Currently I'm using a Google Storage bucket as a remote:

https://console.cloud.google.com/storage/browser/dvc-example-data

After prepare and preprocess, the data dir is like:

[I] ➜ tree data
data
├── prepared
│   ├── mnist-test.npz
│   └── mnist-train.npz
├── preprocessed
│   ├── mnist-test.npz
│   └── mnist-train.npz
├── raw
│   ├── t10k-images-idx3-ubyte.gz
│   ├── t10k-labels-idx1-ubyte.gz
│   ├── train-images-idx3-ubyte.gz
│   └── train-labels-idx1-ubyte.gz
└── raw.dvc
  • There are 4 raw data files. These are better served with dvc import-url or dvc get instead of using a common remote, I think. Also instead of adding them one by one, I decided to add raw/ directory to DVC.

  • prepared/ directory contains mixed, and shuffled and remixed data, preprocess/ contains normalized and (optionally) noise-added data. These are all steps in the pipeline. I plan to use tracking directories instead of individual files in these as well.

  • There may be multiple models (one MLP, one CNN, another deep CNN ... ) that can be selected in params.yaml by train.py. I think it overcomplicates the pipeline. (Which files will the train.py depend?) However it can show the experimentation features more clearly. "We select this model with this amount of salt and pepper noise and it gives us this result."

  • I plan to write individual model files for different models. Some of them may not run on Katacoda, but overall there will more parameters for the users to try.

In summary, I need some way to put the data files to a public place. I can use the above URL or you may want to keep the data at the current one: Which way would you recommend?

@shcheklein @dberenbaum

This is related with iterative/dvc.org#1400 iterative/katacoda-scenarios#60 and probably many others :)

Downgrade pickling protocol

@dmpetrov reported an error with the get started repo for older Python versions:

$ dvc exp run --set-param train.n_est=120
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Running stage 'train':
> python src/train.py data/features model.pkl
Traceback (most recent call last):
  File "src/train.py", line 22, in <module>
    matrix = pickle.load(fd)
ValueError: unsupported pickle protocol: 5
ERROR: failed to reproduce 'dvc.yaml': failed to run: python src/train.py data/features model.pkl, exited with 1

The most recent repo was built using Python 3.8, which makes the highest pickle protocol (5) incompatible with earlier Python versions. See https://docs.python.org/3/library/pickle.html#pickle-protocols. Unless there was a specific reason for this, it might be better to use the default protocol or to hard-code a protocol.

repos: don't use `dataset-registry` to store models, intermediate artifacts, etc

In some example repos I see:

[core]
    remote = storage
['remote "storage"']
    url = https://remote.dvc.org/dataset-registry

We should not be using dataset-registry as a remote for these repos. For every repo we use a separate remote now, e.g. here is the example-get-started:

https://github.com/iterative/example-get-started/blob/master/.dvc/config#L1-L4

[core]
    remote = storage
['remote "storage"']
    url = https://remote.dvc.org/get-started

Should there be .dvc files, or a .sh script for setup?

I see this repository has .dvc files already in place. It might be more instructive to the user to instead have a shell script that runs the setup, and the .dvc files are generated as a side-effect of that script.

experiments: use pytorch or other lightweight lib

One of the common problems of TF is that it always lacks a support for Python or hardware and it takes time to upgrade it.

Does it makes sense to use PyTorch? Is it simpler in that sense?

One of the most recent examples- I can't run the experiments-get-started on M1.

We need to have get started repos and accessible as possible.

example-get:started: make plot files smaller

Yeah, technically plotting precision-recall gets a bit weird with fewer points, but as long as we keep a reasonably large number of points, it shouldn’t be a noticeable issue. We can do the same with roc.json

Agree that we need a good number of data points. It should be enough for Studio to cut the size 2x. And it should not be noticeable for the curve quality (I hope 🙂 ).

roc.json is very small 56Kb. No need to optimize.

Oh weird, it looks like sklearn automatically cuts it down to an optimal size for roc but not prc.

@ivan Let me know if you want me to regenerate it. I don’t know how the additional branch was added, but I can at least take care of master if needed.

cc @dberenbaum @dmpetrov

get-started-experiments: Ignore `data/images`

After doing dvc pull and extracting data/images.tar.gz to data/images, the resulting directory doesn't end up in data/.gitignore and this breaks down dvc exp apply in the repository.

Write a README

This repository should have a README.md file showing how to use the code here, and setup/execution of the pipeline.

Syntax highlighting for dvc.lock and .dvc files

We should add syntax highlighting for dvc.lock and .dvc files as an yaml file. We can do so by using .gitattributes to force Linguist to highlight as such in Github.

The following text should be kept in .gitattributes:

*.dvc linguist-language=YAML
dvc.lock linguist-language=YAML

See https://github.com/skshetry/example-get-started/blob/e67562ec1983bbcee0e5282227d0d47564c54eb2/dvc.lock for an example.

Until we add the dvc.lock and .dvc to the linguist for syntax highlighting those files automatically.

Include naming convention for the new repositories to the Readme

The naming convention for the new repositories should be documented. The best place for this seems to be the Readme file of this repository.

Naming convention for generated example repos

  • example-PROD-FEAT
    • e.g. example-dvc-checkpoints-live-torch
  • PROD-example-FEAT
    • e.g. dvc-example-checkpoints-file-tensorflow

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.