Giter Club home page Giter Club logo

xvc's Introduction

xvc

codecov build crates.io docs.rs unsafe forbidden

A fast and robust MLOps tool to manage data and pipelines

โŒ› When to use xvc?

  • When you have a photo, audio, media, or document collection to backup/version with Git, but don't want to copy that huge data to all Git clones.
  • When you manage a large number of unstructured data, like images, documents, and audio files.
  • When you want to version data files, and want to track versions across datasets.
  • When you want to store this data in local, SSH-accessible, or S3-compatible cloud storage.
  • When you create data pipelines on top of this data and want to run these pipelines when the data, code, or other dependencies change.
  • When you want to track which subset of the data you're working with, and how it changes by your operations.
  • When you have binary artifacts that you use as dependencies and would like to have a make alternative that considers content changes rather than timestamps.

โœณ๏ธ What is xvc for?

  • (for x = files) Track large files on Git, store them in the cloud, create view-only subsets, retrieve them only when necessary.
  • (for x = pipelines) Define and run data -> model pipelines whose dependencies may be files, hyperparameters, regex searches, arbitrary URLs, and more.

๐Ÿ”ฝ Installation

You can get the binary files for Linux, macOS, and Windows from releases page. Extract and copy the file to your $PATH.

Alternatively, if you have Rust installed, you can build xvc:

$ cargo install xvc

๐Ÿƒ๐Ÿพ Quicktart

Xvc seamlessly monitors your files and directories on top of Git. To commence, execute the following command within the repository:

$ git init # if you're not already in a Git repository
Initialized empty Git repository in [CWD]/.git/

$ xvc init

This command initializes the .xvc/ directory and adds a .xvcignore file for specifying paths you wish to conceal from Xvc.

Include your data files and directories for tracking:

$ xvc file track my-data/ --as symlink

This command calculates content hashes for data (using BLAKE-3, by default) and logs them. The changes are committed to Git, and the files are copied to content-addressed directories within .xvc/b3. Additionally, read-only symbolic links to these directories are created.

You can specify different recheck (checkout) methods for files and directories, depending on your use case. If you need to track model files that change frequently, you can set recheck method --as copy (the default).

$ xvc file track my-models/ --as copy

Configure a cloud storage to share the files you added.

$ xvc storage new s3 --name my-remote --region us-east-1 --bucket-name my-xvc-remote

You can send the files to this storage.

$ xvc file send --to my-remote

When you (or someone else) want to access these files later, you can clone the Git repository and get the files from the storage.

$ git clone https://example.com/my-machine-learning-project
Cloning into 'my-machine-learning-project'...

$ cd my-machine-learning-project
$ xvc file bring my-data/ --from my-remote

This approach ensures convenient access to files from the shared storage when needed.

You don't have to reconfigure the storage after cloning, but you need to have valid credentials as environment variables to access the storage. Xvc never stores any credentials.

If you have commands that depend on data or code elements, you can configure a pipeline.

For this example, we'll use a Python script to generate a data set with random names with random IQ scores.

The script uses the Faker library and this library must be available where you run the pipeline. To make it repeatable, we start the pipeline by adding a step that installs dependencies.

$ xvc pipeline step new --step-name install-deps --command 'python3 -m pip install --quiet --user -r requirements.txt'

We'll make this this step to depend on requirements.txt file, so when the file changes it will make the step run.

$ xvc pipeline step dependency --step-name install-deps --file requirements.txt

Xvc allows to create dependencies between pipeline steps. Dependent steps wait for dependencies to finish successfully.

Now we create a step to run the script and make install-deps step a dependency of it.

$ xvc pipeline step new --step-name generate-data --command 'python3 generate_data.py'
$ xvc pipeline step dependency --step-name generate-data --step install-deps

After you define the pipeline, you can run it by:

$ xvc pipeline run
[DONE] install-deps (python3 -m pip install --quiet --user -r requirements.txt)
[OUT] [generate-data] CSV file generated successfully.
 
[DONE] generate-data (python3 generate_data.py)

Xvc allows many kinds of dependnecies, like files, groups of files and directories defined by globs, regular expression searches in files, line ranges in files, hyper-parameters defined in YAML, JSON or TOML files HTTP URLs, shell command outputs, and other steps.

Suppose you're only interested in the IQ scores of those with Dr. in front of their names and how they differ from the rest in the dataset we created. Let's create a regex search dependency to the data file that will show all doctors IQ scores.

$ xvc pipeline step new --step-name dr-iq --command 'echo "${XVC_REGEX_ADDED_ITEMS}" >> dr-iq-scores.csv '
$ xvc pipeline step dependency --step-name dr-iq --regex-items 'random_names_iq_scores.csv:/^Dr\..*'

The first line specifies a command, when run writes ${XVC_REGEX_ADDED_ITEMS} environment variable to dr-iq-scores.csv file. The second line specifies the dependency which will also populate the $[XVC_REGEX_ADDED_ITEMS] environment variable in the command.

Some dependency types like [regex items], [line items] and [glob items] inject environment variables in the commands they are a dependency. For example, if you have two million files specified with a glob, but want to run a script only on the added files after the last run, you can use these environment variables.

When you run the pipeline again, a file named dr-iq-scores.csv will be created. Note that, as requirements.txt didn't change install-deps step and its dependent generate-data steps didn't run.

$ xvc pipeline run
[DONE] dr-iq (echo "${XVC_REGEX_ADDED_ITEMS}" >> dr-iq-scores.csv )

$ cat dr-iq-scores.csv
Dr. Brian Shaffer,122
Dr. Brittany Chang,82
Dr. Mallory Payne MD,70
Dr. Sherry Leonard,93
Dr. Susan Swanson,81

We are using this feature to get lines starting with Dr. from the file and write them to another file. When the file changes, e.g. another record matching the dependency regex added to the random_names_iq_scores.csv file, it will also be added to dr-iq-scores.csv file.

$ zsh -cl 'echo "Dr. Albert Einstein,144" >> random_names_iq_scores.csv'

$ xvc pipeline run
[DONE] dr-iq (echo "${XVC_REGEX_ADDED_ITEMS}" >> dr-iq-scores.csv )

$ cat dr-iq-scores.csv
Dr. Brian Shaffer,122
Dr. Brittany Chang,82
Dr. Mallory Payne MD,70
Dr. Sherry Leonard,93
Dr. Susan Swanson,81
Dr. Albert Einstein,144

Now we want to add a another command that draws a fancy histogram from dr-iq-scores.csv. As this new step must wait dr-iq-scores.csv file to be ready, we'll define dr-iq-scores.csv as an output of dr-iq step and set the file as a dependency to this new visualize step.

$ xvc pipeline step output --step-name dr-iq --output-file dr-iq-scores.csv
$ xvc pipeline step new --step-name visualize --command 'python3 visualize.py'
$ xvc pipeline step dependency --step-name visualize --file dr-iq-scores.csv
$ xvc pipeline run
[ERROR] Step visualize finished UNSUCCESSFULLY with command python3 visualize.py

You can get the pipeline in Graphviz DOT format to convert to an image.

$ zsh -cl 'xvc pipeline dag | dot -opipeline.png'

You can also export and import the pipeline to JSON to edit in your editor.

$ xvc pipeline export --file my-pipeline.json

$ cat my-pipeline.json
{
  "name": "default",
  "steps": [
    {
      "command": "python3 -m pip install --quiet --user -r requirements.txt",
      "dependencies": [
        {
          "File": {
            "content_digest": {
              "algorithm": "Blake3",
              "digest": [
                43,
                86,
                244,
                111,
                13,
                243,
                28,
                110,
                140,
                213,
                105,
                20,
                239,
                62,
                73,
                75,
                13,
                146,
                82,
                17,
                148,
                152,
                66,
                86,
                154,
                230,
                154,
                246,
                213,
                214,
                40,
                119
              ]
            },
            "path": "requirements.txt",
            "xvc_metadata": {
              "file_type": "File",
              "modified": {
                "nanos_since_epoch": [..],
                "secs_since_epoch": [..]
              },
              "size": 14
            }
          }
        }
      ],
      "invalidate": "ByDependencies",
      "name": "install-deps",
      "outputs": []
    },
    {
      "command": "python3 generate_data.py",
      "dependencies": [
        {
          "Step": {
            "name": "install-deps"
          }
        }
      ],
      "invalidate": "ByDependencies",
      "name": "generate-data",
      "outputs": []
    },
    {
      "command": "echo /"${XVC_REGEX_ADDED_ITEMS}/" >> dr-iq-scores.csv ",
      "dependencies": [
        {
          "RegexItems": {
            "lines": [
              "Dr. Brian Shaffer,122",
              "Dr. Susan Swanson,81",
              "Dr. Brittany Chang,82",
              "Dr. Mallory Payne MD,70",
              "Dr. Sherry Leonard,93",
              "Dr. Albert Einstein,144"
            ],
            "path": "random_names_iq_scores.csv",
            "regex": "^Dr//..*",
            "xvc_metadata": {
              "file_type": "File",
              "modified": {
                "nanos_since_epoch": [..],
                "secs_since_epoch": [..]
              },
              "size": 19021
            }
          }
        }
      ],
      "invalidate": "ByDependencies",
      "name": "dr-iq",
      "outputs": [
        {
          "File": {
            "path": "dr-iq-scores.csv"
          }
        }
      ]
    },
    {
      "command": "python3 visualize.py",
      "dependencies": [
        {
          "File": {
            "content_digest": null,
            "path": "dr-iq-scores.csv",
            "xvc_metadata": null
          }
        }
      ],
      "invalidate": "ByDependencies",
      "name": "visualize",
      "outputs": []
    }
  ],
  "version": 1,
  "workdir": ""
}

You can edit the file to change commands, add new dependencies, etc. and import it back to Xvc.

$ xvc pipeline import --file my-pipeline.json --overwrite

Lastly, if you noticed that the commands are long to type, there is an xvc aliases command that prints a set of aliases for commands. You can source the output in your .zshrc or .bashrc, and use the following commands instead, e.g., xvc pipelines run becomes pvc run.

$ xvc aliases

alias xls='xvc file list'
alias pvc='xvc pipeline'
alias fvc='xvc file'
alias xvcf='xvc file'
alias xvcft='xvc file track'
alias xvcfl='xvc file list'
alias xvcfs='xvc file send'
alias xvcfb='xvc file bring'
alias xvcfh='xvc file hash'
alias xvcfco='xvc file checkout'
alias xvcfr='xvc file recheck'
alias xvcp='xvc pipeline'
alias xvcpr='xvc pipeline run'
alias xvcps='xvc pipeline step'
alias xvcpsn='xvc pipeline step new'
alias xvcpsd='xvc pipeline step dependency'
alias xvcpso='xvc pipeline step output'
alias xvcpi='xvc pipeline import'
alias xvcpe='xvc pipeline export'
alias xvcpl='xvc pipeline list'
alias xvcpn='xvc pipeline new'
alias xvcpu='xvc pipeline update'
alias xvcpd='xvc pipeline dag'
alias xvcs='xvc storage'
alias xvcsn='xvc storage new'
alias xvcsl='xvc storage list'
alias xvcsr='xvc storage remove'

Please create an issue or discussion for any other kinds of dependencies that you'd like to be included.

I'm planning to add data label and annotations tracking), experiments tracking), model tracking), encrypted cache, server to control all commands from a web interface, and more as my time permits.

Please check docs.xvc.dev for documentation.

๐ŸคŸ Big Thanks

xvc stands on the following (giant) crates:

  • trycmd is used to run all example commands in this file, reference, and how-to documentation at every PR. It makes sure that the documentation is always up-to-date and shown commands work as described. We start development by writing documentation and implementing them thanks to trycmd.

  • serde allows all data structures to be stored in text files. Special thanks from xvc-ecs for serializing components in an ECS with a single line of code.

  • Xvc processes files in parallel with pipelines and parallel iterators thanks to crossbeam and rayon.

  • Thanks to strum, Xvc uses enums extensively and converts almost everything to typed values from strings.

  • Xvc has a deep CLI that has subcommands of subcommands (e.g. xvc storage new s3), and all these work with minimum bugs thanks to clap.

  • Xvc uses rust-s3 to connect to S3 and compatible storage services. It employs excellent tokio for fast async Rust. These cloud storage features can be turned off thanks to Rust conditional compilation.

  • Without implementations of BLAKE3, BLAKE2, SHA-2 and SHA-3 from Rust crypto crate, Xvc couldn't detect file changes so fast.

  • Many thanks to small and well built crates, reflink, relative-path, path-absolutize, glob for file system and glob handling.

  • Thanks to sad_machine for providing a State Machine implementation that I used in xvc pipeline run. A DAG composed of State Machines made running pipeline steps in parallel with a clean separation of process states.

  • Thanks to thiserror and anyhow for making error handling a breeze. These two crates make me feel I'm doing something good for the humanity when handling errors.

  • Xvc is split into many crates and owes this organization to cargo workspaces.

And, biggest thanks to Rust designers, developers and contributors. Although I can't see myself expert to appreciate it all, it's a fabulous language and environment to work with.

๐Ÿš Support

  • You can use Discussions to ask questions. I'll answer as much as possible. Thank you.
  • I don't follow any other sites regularly. You can also reach me at [email protected]

๐Ÿ‘ Contributing

  • Star this repo. I feel very happy for every star and send my best wishes to you. That's a certain win to spend your two seconds for me. Thanks.
  • Use xvc. Tell me how it works for you, read the documentation, report bugs, discuss features.
  • Please note that, I don't accept large code PRs. Please open an issue to discuss your idea and write/modify a reference page before sending a PR. I'm happy to discuss and help you to implement your idea. Also, it may require a copyright transfer to me, as there may be cases which I provide the code in other licenses.

๐Ÿ“œ License

Xvc is licensed under the GNU GPL 3.0 License. If you want to use the code in your project with other licenses, please contact me.

๐ŸŒฆ๏ธ Future and Maintenance

I'm using Xvc daily and I'm happy with it. Tracking all my files with Git via arbitrary servers and cloud providers is something I always need. I'm happy to improve and maintain it as long as I use it.

Given that I'm working on this for the last two years for pure technical bliss, you can expect me to work on it more.

โš ๏ธ Disclaimer

This software is fresh and ambitious. Although I use it and test it close to real-world conditions, it didn't go under the test of time. Xvc can eat your files and spit them into the eternal void! Please take backups.

xvc's People

Contributors

iesahin avatar restyled-commits avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

xvc's Issues

Add storage tests to Github Actions

Remote tests use feature flags to prevent to run every time the test suite runs.

These test could be run manually, either by per remote or in total.

  • Enter all relevant credentials as secrets to Github Actions
  • Create a new CI file that has separate testing for each of the remotes, via matrix

Add a benchmark script to compare Xvc with other tools

The script should have 3 scenarios.

  • Many small files. Generate 100.000 files of 10kb each and measure git annex add, git lfs track '*.ext' ; git add and dvc add
  • Many larger files. Generate 1000 files of 1 MB each and measure the same commands.
  • Small number of very large files. Generate 10 files of 1GB each and measure the same commands.

The script should output to a file as well.

Add hierarchical credentials for S3 based storages

What I mean is something like:

For example, for S3,

  • If there is an environment variable XVC_STORAGE_ACCESS_KEY_ID_remote1, it's considered to have more priority over other credentials.
  • Q: Should we continue to support XVC_STORAGE_ACCESS_KEY_ID (without remote1 per storage)? I think for multiple storage types, this can cause difficulties. If I want to access GCS with its credentials and S3 with its credentials, having a single environment variable makes everything complex. OTOH if there is such a single remote storage, that might work. But for single storage, the above scheme might also work. There is no need for a common credentials scheme for all storage types. If user wants to set credentials for remote1 specifically, they can do using the above environment variable.

Clean up xvc crate dependencies

There is no point making it to depend all others.

The current dependency graph where lower level modules are used directly is this:

graph TD 

xvc --> xvc-file
xvc --> xvc-pipeline
xvc --> xvc-storage

xvc-file --> xvc-config
xvc-file --> xvc-core
xvc-file --> xvc-ecs
xvc-file --> xvc-logging
xvc-file --> xvc-walker
xvc-file --> xvc-storage

xvc-storage --> xvc-config
xvc-storage --> xvc-core
xvc-storage --> xvc-ecs
xvc-storage --> xvc-logging
xvc-storage --> xvc-walker


xvc-pipeline --> xvc-config 
xvc-pipeline --> xvc-core
xvc-pipeline --> xvc-ecs
xvc-pipeline --> xvc-logging
xvc-pipeline --> xvc-walker

xvc-config --> xvc-walker
xvc-config --> xvc-logging

xvc-ecs --> xvc-logging

xvc-core --> xvc-config
xvc-core --> xvc-logging
xvc-core --> xvc-walker
xvc-core --> xvc-ecs

xvc-walker --> xvc-logging

After the inferfaces are stabilized, all lower level interface will be used through from xvc-core.
In this case the graph will be simplified like.

graph TD 

xvc --> xvc-file
xvc --> xvc-pipeline
xvc --> xvc-storage

xvc-file --> xvc-core
xvc-file --> xvc-storage

xvc-pipeline --> xvc-core

xvc-storage --> xvc-core

xvc-config --> xvc-walker
xvc-config --> xvc-logging

xvc-ecs --> xvc-logging

xvc-core --> xvc-config
xvc-core --> xvc-logging
xvc-core --> xvc-walker
xvc-core --> xvc-ecs

xvc-walker --> xvc-logging

All functionality of base packages should be made available at xvc-core and an internal API should be available in xvc-core.

Update Readme.md

There are various leftovers from #2

`xvc storage new gcs`

  • get credentials for testing
  • test credentials for s3cmd
  • create a test for xvc storage new gcs in workflow tests
  • fix command handler

Skip init in `xvc storage new` commands when `.xvc-guid` file is present

  • #123
  • Skip init when storage dir contains .xvc-guid file for xvc storage new s3
  • Skip init when storage dir contains .xvc-guid file for xvc storage new minio
  • Skip init when storage dir contains .xvc-guid file for xvc storage new wasabi
  • Skip init when storage dir contains .xvc-guid file for xvc storage new r2
  • Skip init when storage dir contains .xvc-guid file for xvc storage new digital-ocean
  • Skip init when storage dir contains .xvc-guid file for xvc storage new gcs
  • Skip init when storage dir contains .xvc-guid file for xvc storage new rsync

Create a website for Xvc

We need a website for Xvc

  • Select and register a domain
  • Select a theme
  • Write a bit of an introductory content
  • Use netlify to publish the site

Domain

There are several options for this:

  • use a subdomain like xvc.emresult.com. We don't have content in emresult.com yet.
  • buy xvc dot dev . It's expensive for a free tool, but it may pay for the rest of its life.
  • buy something cheaper like getxvc or trustxvc dot com. This is a secondary option.

I think I can buy the dev domain, it will look more professional.

Website

I'm using Zola for other sites and for the time being, I can just make a
landing page with it.

Some theme options:

I think making this a proper site from the beginning with adidoks might be
better. Until publish phase, Juice may work too. It will only link to the
book.

Add `--generic-command` as a dependency type to `xvc pipeline dependency`

When a step has a --generic-command as a dependency, the command is run and the output from stderr is hashed. This digest is used to compare whether that dependency is invalidated.

User can add any kind of commands, ls -R or s3cmd ls s3://bucket-name etc. as a dependency. When the output of that command changes, the step is considered invalidated.

`xvc aliases` for creating aliases for common shells

We have at least 2 level commands, and they get long pretty quickly. A set of aliases can be written to a shell script, like:

alias xvcf = xvc file
alias xvcft = xvc file track
alias xvcfl = xvc file list
alias xvcfs = xvc file send
alias xvcfb = xvc file bring
alias xls = xvc file list
alias pvc = xvc pipeline
alias xvcp = xvc pipeline 
alias xvcpr = xvc pipeline run
alias xvcps = xvc pipeline step
alias xvcpsn = xvc pipeline step new
alias xvcpsd = xvc pipeline step dependency
alias xvcpso = xvc pipeline step output
alias xvcs = xvc storage
alias xvcsn = xvc storage new
alias xvcsl = xvc storage list
alias xvcsr = xvc storage remove 

....

These can be written to ~/.xvc_aliases and can be sourced from .zshrc or .bashrc.

Add profile support to s3 credentials

When all credentials in s3::credentials::Credentials::new is left out, it means default profile. I think we can check this at the CLI level.

  • If no AWS_... environment keys are present, and no profile is set in the options, we use default profile.
  • If no AWS_... environment keys are present but profile is set in options, we use that profile.
  • If AWS_... environment keys are present but profile is set, we use profile.
  • If no profile is set, and AWS_... keys are present, we use that.

This should be similar to aws cli behavior.

Add auto-commit functionality to stores and ec after operations

This should be on by default, meaning all operations that affect .xvc folder should be commit automatically.

These can be configured with the config options,

core.auto-stage = True
core.auto-commit = True

These can be handled at the CLI level. After a command is run, if there are changes in .xvc/, they can be staged to Git, and commit.

If there are other staged changes, they should be stashed and reapplied. There is git stash push --staged # since 2.35 for this.

So the workflow is

if auto-commit:
   git stash push --staged
   git add .xvc
   git commit -m "Xvc changes after ..."
   git stash pop
if auto-stage:
   # we don't need to stash, simply add the changes
   git add .xvc 

These can be handled at the CLI parsing level. We can add the command line options to the commit message that way.

Rename `xvc remote` to `xvc storage`

  • Rename XvcRemote to XvcStorage
  • Any struct name that contains XvcRemote will be renamed
  • Move directory to storage
  • Rename the crate
  • Update reference help text script
  • Update references

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.