Giter Club home page Giter Club logo

cml_dvc_case's Introduction

CML with DVC use case

This repository contains a sample project using CML with DVC to push/pull data from cloud storage and track model metrics. When a pull request is made in this repository, the following will occur:

  • GitHub will deploy a runner machine with a specified CML Docker environment
  • DVC will pull data from cloud storage
  • The runner will execute a workflow to train a ML model (python train.py)
  • A visual CML report about the model performance with DVC metrics will be returned as a comment in the pull request

The key file enabling these actions is .github/workflows/cml.yaml.

Secrets and environmental variables

In this example, .github/workflows/cml.yaml contains three environmental variables that are stored as repository secrets.

Secret Description
GITHUB_TOKEN This is set by default in every GitHub repository. It does not need to be manually added.
AWS_ACCESS_KEY_ID AWS credential for accessing S3 storage
AWS_SECRET_ACCESS_KEY AWS credential for accessing S3 storage
AWS_SESSION_TOKEN Optional AWS credential for accessing S3 storage (if MFA is enabled)

DVC works with many kinds of remote storage. To configure this example for a different cloud storage provider, see our documentation on the CML repository.

Cloning this project

Note that if you clone this project, you will have to configure your own DVC storage and credentials for the example. We suggest the following procedure:

  1. Fork the repository and clone to your local workstation.
  2. Run python get_data.py to generate your own copy of the dataset. After initializing DVC in the project directory and configuring your remote storage, run dvc add data and dvc push to push your dataset to remote storage.
  3. git add, commit and push to push your DVC configuration to GitHub.
  4. Add your storage credentials as repository secrets.
  5. Copy the workflow file .github/workflows/cml.yaml from this repository to your fork. By default, workflow files are not copied in forks. When you commit this file to your repository, the first workflow should be initiated.

cml_dvc_case's People

Contributors

davidgortega avatar elleobrien avatar erfard avatar tasdomas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cml_dvc_case's Issues

Is there a possibilty to use the dvc in cml with GDrive

As a remote dir I use GDrive.

When I pull the data to my personal computer it requires manual authentication conducted by clicking the provided link.

When I try to use a similar action as you do I get error cause remote computer on which github tries to run action can't conduct an authentification in this way.

Is there a possibility to authenticate gdrive in a smillar way as you do withn S3 bucket - with some kind of token?

train a model on large dataset with gitlab-ci.yaml

Hi,

First of all, thank you for the nice tools developed by you. I am trying to create a training ML workflow with gitlab, CML and DVC with MinIO storage as my remote backup where I have my training dataset stored. my .gitlab-ci.yaml looks like this:

stages:
  - cml_run
cml:
  stage: cml_run
  image: dvcorg/cml:0-dvc2-base1-gpu

  script:
  - echo 'Hi from CML' >> report.md
  - apt-get update && apt-get install -y python3-opencv
  - pip3 install -r requirements.txt
  - dvc remote add -d minio_data s3://bucket/dataset/
  - dvc remote modify minio_data endpointurl http://<MINIOSERVER_IP_ADDRESS>:9000
  - dvc remote modify minio_data use_ssl False
  - export AWS_ACCESS_KEY_ID="xxxxxxx"
  - export AWS_SECRET_ACCESS_KEY="xxxxxxx"
  - dvc pull -r minio_data
  - python main.py
  - cml-send-comment report.md --repo=https://<my_gitlab_repo_url>

My setup is configured as following:

  • A gitlab self-hosted runner listening for job (works: Ubuntu 20.04, 2 x RTX 3070 GPUs, ).
  • An S3 MinIO storage server configured as DVC remote local backup (works with my credentials).
  • A training script (works).

My workflow is working and I am able to train my model on the runner and queue jobs but I have the following issues with it (maybe there is a better way to do this, hence I am here asking for directions):

  1. For each training job, the entire dataset is pulled from the remote and then the model is trained. This is really slow. It is my requirement to keep using dvc for data versioning but is there a way to bypass the dataset pull dvc pull -r minio_data everytime and use the same data between different training jobs? (maybe mount volumes to the docker container?)
  2. For MinIO authentication, I do not want to put my credentials as in AWS_SECRET_ACCESS_KEY in the .gitlab-ci.yaml file, in case more than one person want to use this workflow to queue their training jobs in a collaborative environment. What other options do I have?
  3. Is there a way to configure a local container registry cache for the runner (and this worflow) where I can put all the necessary docker images and use them instead of adding dependencies to the workflow like I am doing and let docker handle it?

Any feedback or suggestions would be appreciated. Thank you.

Cannot publish plot images

Hello,

I am trying to reproduce the test case but I am unable to include plot images in my report it throws this error

"""
{"code":"ERR_INVALID_URL","input":"\r\n<title>400 Bad Request</title>\r\n\r\n

400 Bad Requestcloudflare\r\n\r\n\r\n","level":"error","message":"Invalid URL","stack":"TypeError [ERR_INVALID_URL]: Invalid URL\n at new NodeError (node:internal/errors:399:5)\n at new URL (node:internal/url:560:13)\n at uriParam (/usr/local/lib/node_modules/@dvcorg/cml/src/utils.js:149:15)\n at watermarkUri (/usr/local/lib/node_modules/@dvcorg/cml/src/utils.js:140:10)\n at CML.publish (/usr/local/lib/node_modules/@dvcorg/cml/src/cml.js:349:13)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async visitor (/usr/local/lib/node_modules/@dvcorg/cml/src/cml.js:247:26)\n at async Promise.all (index 0)\n at async publishLocalFiles (/usr/local/lib/node_modules/@dvcorg/cml/src/cml.js:261:7)"}
"
"""

Without images the report looks good, here is my cml.yaml file

name: train-my-model
on: [push]

jobs:
run:
runs-on: [ubuntu-latest]

  steps:
    - uses: actions/checkout@v3
      with:
        ref: ${{ github.event.pull_request.head.sha }}

    - uses: iterative/setup-cml@v1

    - uses: iterative/setup-dvc@v1

    - uses: actions/setup-python@v2
      with:
        python-version: '3.x'

    - name: cml
      env:
        repo_token: ${{ secrets.REPO_SECRET }}
      run: |
        pip install -r requirements.txt

        # Pull dataset with DVC
        #dvc pull data

        # Reproduce pipeline if any changes detected in dependencies
        #dvc repro

        # Use DVC metrics diff to compare metrics to master
        git fetch --prune --unshallow
        dvc metrics diff master >> report.md

        # Add figure to report
        dvc plots diff --target loss.csv --show-vega master > vega.json
        vl2png vega.json -s 1.3 > vega.png
        echo '![](./vega.png)' >> report.md
        cml comment create --pr --publish report.md

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.