iterative / example-repos-dev Goto Github PK

Source code and generator scripts for example DVC projects

Python 43.19% Shell 41.27% Jupyter Notebook 15.54%

bash dvc machine-learning open-source example

example-repos-dev's Introduction

Get Started Tutorial (sources)

Contains source code and Shell scripts to generate and deploy example DVC repositories used in the Get Started and other sections of the DVC docs.

Requirements

Please make sure you have these available on the environment where these scripts will run:

Git
Python 3 (with python3 and pip commands)
Virtualenv

Naming Convention for Example Repositories

In order to have a consistent naming scheme across all example repositories, the new repositories should be named as:

example-PROD-FEATURE

where PROD is one of the products like dvc, cml, studio, or dvclive, and FEATURE is the feature that the repository focused on, like experiments, or pipelines. You can also use additional keywords as suffix to differentiate from the others.

⚠️ Please create all new repositories with the prefix example-.

Scripts

Each example DVC project is in each of the root directories (below). cd into the directory first before running the desired script, for example:

$ cd example-get-started
$ ./deploy.sh

example-get-started

There are 2 GitHub Actions set up to test and deploy the project:

test
deploy

These will automatically test and deploy the project. If you need to run the project locally/manually, you only directly need generate.sh. deploy.sh is a helper script run within generate.sh.

generate.sh: Generates the example-get-started DVC project from scratch.

By default, the source code archive is derived from the local workspace for development purposes.

For deployment, use generate.sh prod to upload/download a source code archive from S3 the same way as in Connect Code and Data.
deploy.sh: Makes and deploys code archive from example-get-started/code to use for generate.sh.

By default, makes local code archive in example-get-started/code.zip.

For deployment, use deploy.sh prod to upload to S3.

Requires AWS CLI and write access to s3://dvc-public/code/get-started/.

example-get-started-experiments

There are 2 GitHub Actions set up to test and deploy the project:

test
deploy

These will automatically test and deploy the project. If you need to run the project locally/manually, run generate.sh.

Even after automatic deployment, you still need to follow the instructions to:

Update Studio to create a PR from the best generated experiment.
Push to GitLab if you want to update the repo there.

example-repos-dev's People

Contributors

Stargazers

Watchers

Forkers

mayankgoel28 metalglove dmitry-mezhensky gheehwee dberenbaum iesahin hercules261188 elyast alexrogalskiy omesser leonardcser mnrozhkov

example-repos-dev's Issues

Remove dataset-registry?

I think the point of generating https://github.com/iterative/example-get-started from a script in this repo was that there are different steps that loosely follow the actual Get Started section of the dvc.org docs. Each step corresponds to a Git commit and tag.

After merging PR #7 I realized that actually https://github.com/iterative/dataset-registry doesn't really needs steps. All the datasets could just be added in a single commit. So that repo could just be maintained manually instead of having to edit the script here and overwrite it every time. What do you think @shcheklein?

There should be other repos that may be better fit to be generated from this project though, e.g. as described in iterative/dvc.org#487 (comment).

Include `devcontainer.json` as part of the example repos

I think it would be nice to include a custom Codespaces environment to make the example repositories directly usable in Codespaces without extra setup steps.

Include iterative/dvc-doc-tutorial repo?

https://github.com/iterative/dvc-doc-tutorial

(Generate from this repo.)

example-get-started is broken with latest DVC

From iterative/dvc.org#1743 (comment)

$ [email protected]:iterative/example-get-started.git
...
$ cd example-get-started
$ dvc fetch
ERROR: failed to fetch data from the cloud - Lockfile 'dvc.lock' is corrupted.

dvc-get-started: Rename `main` branch to `pipeline`

The main branch looks as if it's the primary use case. Although this might be true in the past for example-get-started, we have more than one use cases that need attention now. (checkpoints, experiments and maybe studio.)

Instead of selecting one of these as main, the naming should emphasize the different use cases.

Related #34

Epic: add badges in all projects

add badges as per iterative/dvc.org#234 (comment) - particularly the JSON/YAML ones.

Replace `virtualenv` with `python3 -m venv` in README.md

Is there a reason the example's README.md uses virtualenv? It needs to be installed separately. It has more features but for basic use, python3 -m venv can replace it I think.

get-started-experiments: old `pip` installed with README instructions

$ python -m pip install -r requirements.txt
Collecting tensorflow<2.6,>=2.5 (from -r requirements.txt (line 1))
  Could not find a version that satisfies the requirement tensorflow<2.6,>=2.5 (from -r requirements.txt (line 1)) (from versions: 0.12.1, 1.0.0, 1.0.1, 1.1.0rc0, 1.1.0rc1, 1.1.0rc2, 1.1.0, 1.2.0rc0, 1.2.0rc1, 1.2.0rc2, 1.2.0, 1.2.1, 1.3.0rc0, 1.3.0rc1, 1.3.0rc2, 1.3.0, 1.4.0rc0, 1.4.0rc1, 1.4.0, 1.4.1, 1.5.0rc0, 1.5.0rc1, 1.5.0, 1.5.1, 1.6.0rc0, 1.6.0rc1, 1.6.0, 1.7.0rc0, 1.7.0rc1, 1.7.0, 1.7.1, 1.8.0rc0, 1.8.0rc1, 1.8.0, 1.9.0rc0, 1.9.0rc1, 1.9.0rc2, 1.9.0, 1.10.0rc0, 1.10.0rc1, 1.10.0, 1.10.1, 1.11.0rc0, 1.11.0rc1, 1.11.0rc2, 1.11.0, 1.12.0rc0, 1.12.0rc1, 1.12.0rc2, 1.12.0, 1.12.2, 1.12.3, 1.13.0rc0, 1.13.0rc1, 1.13.0rc2, 1.13.1, 1.13.2, 1.14.0rc0, 1.14.0rc1, 1.14.0, 2.0.0a0, 2.0.0b0, 2.0.0b1)
No matching distribution found for tensorflow<2.6,>=2.5 (from -r requirements.txt (line 1))

$ pip -V
pip 9.0.1 from /home/.../get-started-experiments/.venv/lib/python3.6/site-packages (python 3.6)

I'm on Ubuntu 20.04 LTS on WSL 2

dvc-get-started: Create an ˋexperimentsˋ tag to be used in the experiments tutorial

Based on evaluate tag in pipelines/main branch
Use CNN model instead of MLP
It may use Fashion-MNIST dataset

example-get-started: long names in params file

The param names can be shortened to improve commands output for dvc params diff and especially dvc exp show.

Suggested names:

featurize group --> fr
max_features--> max_fr
n_estimators--> n_est

Default branch for example-get-started (and others)

My git config sets main as my default branch, which breaks the instructions to publish example-get-started (since I have no master branch to push by default). Ideally, we would update the example-get-started default branch, but we may need to check that we don't break anything in the tutorial. At minimum, we should explicitly checkout a specific branch before publishing.

We should also keep this in mind for other repos in development.

dvc-get-started: Add `experiments` branch to collect various experiments

Currently the experiments are implicit in the main branch.

A separate branch with the Fashion-MNIST dataset may be better to reflect the changes in various metrics.

Tags

experiments-base: Based on evaluate tag in pipelines/main branch
experiments-cnn: Use CNN model instead of MLP
experiments-fashion-mnist: It may use Fashion-MNIST dataset

Related #34

`get-started-experiments`: Update `README.md` to contain `images/` and `images.tar.gz`

The Readme file lacks information how to get individual image files.

get-started-experiments: add `models/model.h5` to `.gitignore` in `generate.bash`

As we began to use --outs-no-cache for the model, it should be added to .gitignore manually.

Error dvc get

I followed the hands-on tutorial some days ago and now I've got an error when trying to download the data.xml from dataset-registry.

Could it be an error with the data.xml.dvc file?

dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

ERROR: failed to get 'get-started/data.xml' from 'https://github.com/iterative/dataset-registry' - '../../../../../tmp/tmp0dar7q9ddvc-clone/get-started/data.xml.dvc' format error: extra keys not allowed @ data['outs'][0]['size']

example-get-started: codify the branch with CML report and larger dataset

As we use now project as a demo for studio, we added the new branch and PR:

iterative/example-get-started#8

we need to add commands to the script to generate this branch
we need to add commands to generate this PR
make sure that we drop the previous one
make sure that CML reports are being generated

example-get-started: generalize CML workflow

Right now for some reason (simplicity?) we compare everything with the bigrams experiment.

we depend on a specific tag name
when we do a PR to that repo - report name is misleading and not very meaningful

Getting started example fails at pulling data from s3://dvc-storage/

Following the example at https://github.com/iterative/example-get-started, it fails at this point:

$ dvc remote list
s3	s3://dvc-storage

$ dvc pull -r s3
Preparing to download data from 's3://dvc-storage'
Preparing to collect status from s3://dvc-storage
[#########                     ] 30% Collecting information
Error: failed to pull data from the cloud - Unable to locate credentials

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

Since this is a "Getting Started" example that is supposed to work, then this step should be downloading prepared data from a remote cache that is somewhere.

Make sed commands for params replacement work across mac/linux

sed -i has slightly different syntax on mac and linux, causing generate.sh to fail on one or the other unless it follows the syntax sed -i'' suggested in https://stackoverflow.com/questions/4247068/sed-command-with-i-option-failing-on-mac-but-works-on-linux.

`cats-and-dogs`: codify the versioning tutorial

Currently in https://dvc.org/doc/get-started/example-versioning

Used in https://dvc.org/doc/tutorials/versioning

rename `get-started-experiments` to `dvc-example-experiments`.

We have a new naming convention for the example projects and the latter name seems more descriptive.

`example-dvc-experiments`: Include CML configuration

@casperdcl what do we need from to make this repo useful for CML happy-path?

@iesahin @casperdcl as we discussed can we make it example-experiments that would cover basic scenarios with predefined language (python), predefined framework (let's say tensorflow for now). It would have CML action from the first (?) commit that could be run if it's needed (and may be even runs automatically).

Then it'll be a good repo that we can even meaningfully present in Studio?

What do we need to make it substantially useful for CML?

Originally posted by @shcheklein in #79 (comment)

mnist examples: simplify data layout, params, etc

don't use .gz
don't use deeply nested folders
remove params that are not relevant

indent json files?

It'd be easier to view the json files in the Github or locally without a fancy text editor if the json were indented.

https://github.com/iterative/example-get-started/blob/7808693e73795d2fb90dc5f42ed23c967e8fa7a7/roc.json#L1

readme: repro requires a target

https://discordapp.com/channels/485586884165107732/485596304961962003/570490701326974976

get-started-X: Write `generate` scripts for each project

TODO

Add an experiments branch for various experiments (#38) (Done as a separate repository)
(CANCEL) Add a studio branch to be used as a studio showcase
Move current main to pipeline branch (#39) (Done as a separate repository)
Add main to contain only README (#40) (Cancelled)
Create Docker containers for each tag
Update Katacoda GS:Experiments scenario and get rid of dvc-get-started-mnist
Remove model metrics from params
(MOVE) Keep only the relevant parameters for each repository
Create two fictional people for each branch, spreading the commits across time
Add additional plots like confusion matrix
Review dvc.org docs examples (besides Get Started) that could use the resulting example repo. Mainly the Experiments docs

Older Proposal Below.

We have a new Get Started project aimed towards experimentation features in DVC 2.0.

A step-by-step approach similar to the current example-get-started project is necessary for exposition. Creating the project from development sources to provide a clear history, without manual intervention also provides
easier maintenance.

Discussion Points

I'll defer checkpoints' implementation until we decide on a path to merge or keep the checkpoints project separate.
For the tags 9-12, we can have renamed files within the same repository to make comparisons, e.g., in 10-make-checkpoint, we can have a evaluate-make-checkpoint.py file that's similar to evaluate.py except the checkpoint usage.
We can put 0-8 to the main branch and convert other tags to branches as there is no clear progress between these.
There is no dvc add step in this setup unlike in example-get-started. It's possible to use Fashion-MNIST for this. It can be downloaded manually and put inside data/ to update the pipeline. It's possible to parameterize prepare.py with the path of the dataset or we can demonstrate how to replace the dataset by overwriting it. Or we can do vice versa, dvc add MNIST in step 2 and have a dvc import-url later as step 9 to update the dataset with Fashion-MNIST.

The script will use dvc stage add instead of dvc run for all stages. dvc exp run will be favored to dvc repro also.
The source code and related files will be copied to this repository instead of downloading from S3. I don't think of a use case to put the files to S3 instead of a Github repository. Why was so?
The new project repository will be https://github.com/iterative/dvc-get-started.

get-started-experiments: delete all remote experiments in the push script

The current push script built by generate.bash deletes the remote tags before pushing new tags to prevent older tags lurking around. It doesn't do this for the experiments and all older experiments remain in the repository after a new get-started-experiments repository is created.

We should delete all the experiments in the remote before this line in the push script:

dvc exp list --all --names-only | xargs -n 1 dvc exp push origin

`example-get-started` : use tar instead of zip

I think we're using ZIP so Windows users can use the example-get-started repo

example-repos-dev/example-get-started/deploy.sh

Lines 17 to 19 in 9ec8211

 pushd $PACKAGE_DIR 

 zip -r $PACKAGE src/* 

 popd

But a) I can't run the above command from Windows (i.e. generating the repo, because zip isn't available on Win cmd or any terminal), and most importantly, b) unzip is also not a command shipped with Windows. You need to install it on some other console such as Git Bash using Chocolatey, etc.

example-repos-dev/example-get-started/generate.sh

Lines 82 to 84 in 9ec8211

 wget https://code.dvc.org/get-started/code.zip 

 unzip code.zip 

 rm -f code.zip

Why not just use something more Linux friendly that we can also install on Windows anyway? Meaning tar

Add MNIST dataset to the dataset-registry for the new example-get-started

I started to write the experiments tutorial using the original MNIST from deepai.org:

Currently I'm using a Google Storage bucket as a remote:

https://console.cloud.google.com/storage/browser/dvc-example-data

After prepare and preprocess, the data dir is like:

[I] ➜ tree data
data
├── prepared
│   ├── mnist-test.npz
│   └── mnist-train.npz
├── preprocessed
│   ├── mnist-test.npz
│   └── mnist-train.npz
├── raw
│   ├── t10k-images-idx3-ubyte.gz
│   ├── t10k-labels-idx1-ubyte.gz
│   ├── train-images-idx3-ubyte.gz
│   └── train-labels-idx1-ubyte.gz
└── raw.dvc

There are 4 raw data files. These are better served with dvc import-url or dvc get instead of using a common remote, I think. Also instead of adding them one by one, I decided to add raw/ directory to DVC.
prepared/ directory contains mixed, and shuffled and remixed data, preprocess/ contains normalized and (optionally) noise-added data. These are all steps in the pipeline. I plan to use tracking directories instead of individual files in these as well.
There may be multiple models (one MLP, one CNN, another deep CNN ... ) that can be selected in params.yaml by train.py. I think it overcomplicates the pipeline. (Which files will the train.py depend?) However it can show the experimentation features more clearly. "We select this model with this amount of salt and pepper noise and it gives us this result."
I plan to write individual model files for different models. Some of them may not run on Katacoda, but overall there will more parameters for the users to try.

In summary, I need some way to put the data files to a public place. I can use the above URL or you may want to keep the data at the current one: Which way would you recommend?

@shcheklein @dberenbaum

This is related with iterative/dvc.org#1400 iterative/katacoda-scenarios#60 and probably many others :)

Downgrade pickling protocol

@dmpetrov reported an error with the get started repo for older Python versions:

$ dvc exp run --set-param train.n_est=120
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Running stage 'train':
> python src/train.py data/features model.pkl
Traceback (most recent call last):
  File "src/train.py", line 22, in <module>
    matrix = pickle.load(fd)
ValueError: unsupported pickle protocol: 5
ERROR: failed to reproduce 'dvc.yaml': failed to run: python src/train.py data/features model.pkl, exited with 1

The most recent repo was built using Python 3.8, which makes the highest pickle protocol (5) incompatible with earlier Python versions. See https://docs.python.org/3/library/pickle.html#pickle-protocols. Unless there was a specific reason for this, it might be better to use the default protocol or to hard-code a protocol.

repos: don't use `dataset-registry` to store models, intermediate artifacts, etc

In some example repos I see:

[core]
    remote = storage
['remote "storage"']
    url = https://remote.dvc.org/dataset-registry

We should not be using dataset-registry as a remote for these repos. For every repo we use a separate remote now, e.g. here is the example-get-started:

https://github.com/iterative/example-get-started/blob/master/.dvc/config#L1-L4

[core]
    remote = storage
['remote "storage"']
    url = https://remote.dvc.org/get-started

Should there be .dvc files, or a .sh script for setup?

I see this repository has .dvc files already in place. It might be more instructive to the user to instead have a shell script that runs the setup, and the .dvc files are generated as a side-effect of that script.

Include desc field into the file description

Introduced iterative/example-get-started#5 (comment).

desc field can be both useful for the newbies and for the Viewer team in order to test parsing of this field. Please, include it in the .dvc file description in the next version of the example repository.

experiments: use pytorch or other lightweight lib

One of the common problems of TF is that it always lacks a support for Python or hardware and it takes time to upgrade it.

Does it makes sense to use PyTorch? Is it simpler in that sense?

One of the most recent examples- I can't run the experiments-get-started on M1.

We need to have get started repos and accessible as possible.

example-get:started: make plot files smaller

Yeah, technically plotting precision-recall gets a bit weird with fewer points, but as long as we keep a reasonably large number of points, it shouldn’t be a noticeable issue. We can do the same with roc.json

Agree that we need a good number of data points. It should be enough for Studio to cut the size 2x. And it should not be noticeable for the curve quality (I hope 🙂 ).

roc.json is very small 56Kb. No need to optimize.

Oh weird, it looks like sklearn automatically cuts it down to an optimal size for roc but not prc.

@ivan Let me know if you want me to regenerate it. I don’t know how the additional branch was added, but I can at least take care of master if needed.

cc @dberenbaum @dmpetrov

get-started-experiment: Fix minor issues in the repo for the GS:Experiments Trail

Create a tag get-started to clone at the beginning of the document
Add images.tar.gz.dvc for dvc pull instead of the whole directory
Ensure dvc list remote lists storage
Add a badge to get-started-experiments (#60)
Fix models/model.h5 not found in dvc pull by dvc commit train

delete `get-started-checkpoints` repository in favor of `dvc-example-checkpoints-tensorflow`

The older repository will be removed for clean up.

dvc-get-started: Add `main` branch that documents all other branches with Readme

The main branch should only document all other branches in Readme. It should list all branches, tags, parameters, code and directory structure of the project.

Related: #34
Related: #39

need to rebuild get-started with the latest DVC version

Experience is broken since every DVC command changes .gitignore now - makes it very annoying to jump between branches.

`example-dvc-experiments`: Add checkpoints in the final tag

We can use experiments repository to present the checkpoints.

It can use python-api branch in example-dvc-checkpoints to introduce make-checkpoint usage in the experiments.
Update the training loop to cover indefinite training
We can add another parameter to resume the training from the previous model (or make this default behavior.)

@shcheklein @dberenbaum @daavoo

get-started-experiments: Ignore `data/images`

After doing dvc pull and extracting data/images.tar.gz to data/images, the resulting directory doesn't end up in data/.gitignore and this breaks down dvc exp apply in the repository.

get-started-checkpoints: Create a new `DVCCallback` that saves the model and metrics before `signal-file` or `make_checkpoint`

Currently the get-started-checkpoints uses Tensorflow callbacks to make a checkpoint after each epoch. But saving the model and metrics causes an error if make_checkpoint and signal_file runs before them. So the project should save the model and metrics within the callback, not after it.

Related with #45

`get-started-experiments`: Update to use lightweight tags

As of now, we have a bug in DVC in presenting dvc exp show. dvc.org#6383

As a workaround we can convert the current annotated tags to lightweight tags.

Create an example project for `dvclive` separate from `example-dvc-checkpoints`.

We can create a separate project for DVCLive.

It can have multiple branches for each ML framework listed in https://dvc.org/doc/dvclive/user-guide/ml-frameworks
Or it's possible to create similar repositories for each ML framework. (example-dvclive-keras, example-dvclive-pytorch ...) These can also be templates for users when creating projects.

Write a README

This repository should have a README.md file showing how to use the code here, and setup/execution of the pipeline.

Syntax highlighting for dvc.lock and .dvc files

We should add syntax highlighting for dvc.lock and .dvc files as an yaml file. We can do so by using .gitattributes to force Linguist to highlight as such in Github.

The following text should be kept in .gitattributes:

*.dvc linguist-language=YAML
dvc.lock linguist-language=YAML

See https://github.com/skshetry/example-get-started/blob/e67562ec1983bbcee0e5282227d0d47564c54eb2/dvc.lock for an example.

Until we add the dvc.lock and .dvc to the linguist for syntax highlighting those files automatically.

`example-dvc-experiments`: Improvements

What do we need to make https://github.com/iterative/example-dvc-experiments useful for the Studio?

Originally posted by @shcheklein in #79 (comment)

get-started-checkpoints: Convert checkpoints' examples to indefinite-training

When the epochs value is 0, the training should continue indefinitely in checkpoints tags. We can set this as the default to show checkpoints' usage without a set ending point.

Todo

Runs the training indefinitely when train.epochs=0
Move all model and logs recording code to a callback (#47)

Originally posted by @iesahin in iterative/dvc-checkpoints-mnist#13 (comment)

Drop experiments in generate.sh

generate.sh pushes experiments to the example-get-started repo but doesn't first clean up old experiments.

Include naming convention for the new repositories to the Readme

The naming convention for the new repositories should be documented. The best place for this seems to be the Readme file of this repository.

Naming convention for generated example repos

example-PROD-FEAT
- e.g. example-dvc-checkpoints-live-torch
PROD-example-FEAT
- e.g. dvc-example-checkpoints-file-tensorflow

	wget https://code.dvc.org/get-started/code.zip
	unzip code.zip
	rm -f code.zip

iterative / example-repos-dev Goto Github PK

example-repos-dev's Introduction

Get Started Tutorial (sources)

Requirements

Naming Convention for Example Repositories

Scripts

example-get-started

example-get-started-experiments

example-repos-dev's People

Contributors

Stargazers

Watchers

Forkers

example-repos-dev's Issues

Tags

TODO

Older Proposal Below.

Tags

Discussion Points

Todo

Recommend Projects

Recommend Topics

Recommend Org