Giter Club home page Giter Club logo

Comments (4)

DavidGOrtega avatar DavidGOrtega commented on August 17, 2024

👋 @niqbal996

For each training job, the entire dataset is pulled from the remote and then the model is trained. This is really slow. It is my requirement to keep using dvc for data versioning but is there a way to bypass the dataset pull dvc pull -r minio_data everytime and use the same data between different training jobs? (maybe mount volumes to the docker container?)

Why not to download the dataset locally in the machine in a folder accesible by the local runners?
If you start the runners with docker (I recommend this as you are ) you need to add the volume.

For MinIO authentication, I do not want to put my credentials as in AWS_SECRET_ACCESS_KEY in the .gitlab-ci.yaml

You need to setup them as secrets in Gitlab also named CI variables. You can find it under the Settings -> CI/CD
image

Is there a way to configure a local container registry cache for the runner (and this worflow) where I can put all the necessary docker images and use them instead of adding dependencies to the workflow like I am doing and let docker handle it?

Of course. You could:

  • extend our docker image installing your stack
  • create a docker image with your stack, installing dvc and cml

When you build it you have that image locally you can also publish it into dockerhub to preserve it.

FROM iterativeai/cml:0-dvc2-base1-gpu
RUN pip install 'your-libraries'
docker build -t myCML Dockerfile
docker run --name myrunner -d --gpus all \
    -e RUNNER_IDLE_TIMEOUT=300 \
    -e RUNNER_LABELS=cml \
    -e RUNNER_REPO=$my_repo_url \
    -e repo_token=$my_repo_token \
    -v /path/myhugedatset:/myhugedataset \
    myCML

from cml_dvc_case.

lenaherrmann-dfki avatar lenaherrmann-dfki commented on August 17, 2024

Hey folks,

if the data is pre-downloaded locally, might there be a way to check if there have been changes to the data?

As I understood the CML correctly, it's suppose automate the ML-Pipeline. So, downloading the data only when needed, might be an idea to save some time. Could this be achieved by some git-like commands in DVC?

from cml_dvc_case.

DavidGOrtega avatar DavidGOrtega commented on August 17, 2024

As I understood the CML correctly, it's suppose automate the ML-Pipeline. So, downloading the data only when needed, might be an idea to save some time. Could this be achieved by some git-like commands in DVC?

👋 @lenaherrmann-dfki the thing is that CML helps you to launch GPU runner in many vendors (AWS, AZURE and GCP) those runners can be ephemerals just launched when you need to train (to save costs) and those runners needs to access that data. We are designing volumes to make this lightweight but big part of this responsibility resides in DVC

from cml_dvc_case.

casperdcl avatar casperdcl commented on August 17, 2024

if the data is pre-downloaded locally, might there be a way to check if there have been changes to the data?

@lenaherrmann-dfki you can track the data with DVC and a "local remote"

# setup
cd myrepo
# assuming `git init && dvc init` already done
cp -R /predownloaded/data .
dvc add ./data
git add data.dvc .gitignore
dvc remote add --local localcache /predownloaded/cache
dvc push -r localcache
git push

now in future you can:

cd myrepo
dvc remote add --local localcache /predownloaded/cache
dvc pull -r localcache
echo "new stuff" >> ./data/new
dvc add ./data
git add data.dvc
dvc push -r localcache
git push

from cml_dvc_case.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.