Giter Club home page Giter Club logo

turnkeyml's Introduction

PyPI - Version CI CII Best Practices OpenSSF Scorecard REUSE compliant Ruff Black

Open Neural Network Exchange (ONNX) is an open ecosystem that empowers AI developers to choose the right tools as their project evolves. ONNX provides an open source format for AI models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types. Currently we focus on the capabilities needed for inferencing (scoring).

ONNX is widely supported and can be found in many frameworks, tools, and hardware. Enabling interoperability between different frameworks and streamlining the path from research to production helps increase the speed of innovation in the AI community. We invite the community to join us and further evolve ONNX.

Use ONNX

Learn about the ONNX spec

Programming utilities for working with ONNX Graphs

Contribute

ONNX is a community project and the open governance model is described here. We encourage you to join the effort and contribute feedback, ideas, and code. You can participate in the Special Interest Groups and Working Groups to shape the future of ONNX.

Check out our contribution guide to get started.

If you think some operator should be added to ONNX specification, please read this document.

Community meetings

The schedules of the regular meetings of the Steering Committee, the working groups and the SIGs can be found here

Community Meetups are held at least once a year. Content from previous community meetups are at:

Discuss

We encourage you to open Issues, or use Slack (If you have not joined yet, please use this link to join the group) for more real-time discussion.

Follow Us

Stay up to date with the latest ONNX news. [Facebook] [Twitter]

Roadmap

A roadmap process takes place every year. More details can be found here

Installation

Official Python packages

ONNX released packages are published in PyPi.

pip install onnx  # or pip install onnx[reference] for optional reference implementation dependencies

ONNX weekly packages are published in PyPI to enable experimentation and early testing.

vcpkg packages

onnx is in the maintenance list of vcpkg, you can easily use vcpkg to build and install it.

git clone https://github.com/microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.bat # For powershell
./bootstrap-vcpkg.sh # For bash
./vcpkg install onnx

Conda packages

A binary build of ONNX is available from Conda, in conda-forge:

conda install -c conda-forge onnx

Build ONNX from Source

Before building from source uninstall any existing versions of onnx pip uninstall onnx.

c++17 or higher C++ compiler version is required to build ONNX from source. Still, users can specify their own CMAKE_CXX_STANDARD version for building ONNX.

If you don't have protobuf installed, ONNX will internally download and build protobuf for ONNX build.

Or, you can manually install protobuf C/C++ libraries and tools with specified version before proceeding forward. Then depending on how you installed protobuf, you need to set environment variable CMAKE_ARGS to "-DONNX_USE_PROTOBUF_SHARED_LIBS=ON" or "-DONNX_USE_PROTOBUF_SHARED_LIBS=OFF". For example, you may need to run the following command:

Linux:

export CMAKE_ARGS="-DONNX_USE_PROTOBUF_SHARED_LIBS=ON"

Windows:

set CMAKE_ARGS="-DONNX_USE_PROTOBUF_SHARED_LIBS=ON"

The ON/OFF depends on what kind of protobuf library you have. Shared libraries are files ending with *.dll/*.so/*.dylib. Static libraries are files ending with *.a/*.lib. This option depends on how you get your protobuf library and how it was built. And it is default OFF. You don't need to run the commands above if you'd prefer to use a static protobuf library.

Windows

If you are building ONNX from source, it is recommended that you also build Protobuf locally as a static library. The version distributed with conda-forge is a DLL, but ONNX expects it to be a static library. Building protobuf locally also lets you control the version of protobuf. The tested and recommended version is 3.21.12.

The instructions in this README assume you are using Visual Studio. It is recommended that you run all the commands from a shell started from "x64 Native Tools Command Prompt for VS 2019" and keep the build system generator for cmake (e.g., cmake -G "Visual Studio 16 2019") consistent while building protobuf as well as ONNX.

You can get protobuf by running the following commands:

git clone https://github.com/protocolbuffers/protobuf.git
cd protobuf
git checkout v21.12
cd cmake
cmake -G "Visual Studio 16 2019" -A x64 -DCMAKE_INSTALL_PREFIX=<protobuf_install_dir> -Dprotobuf_MSVC_STATIC_RUNTIME=OFF -Dprotobuf_BUILD_SHARED_LIBS=OFF -Dprotobuf_BUILD_TESTS=OFF -Dprotobuf_BUILD_EXAMPLES=OFF .
msbuild protobuf.sln /m /p:Configuration=Release
msbuild INSTALL.vcxproj /p:Configuration=Release

Then it will be built as a static library and installed to <protobuf_install_dir>. Please add the bin directory(which contains protoc.exe) to your PATH.

set CMAKE_PREFIX_PATH=<protobuf_install_dir>;%CMAKE_PREFIX_PATH%

Please note: if your protobuf_install_dir contains spaces, do not add quotation marks around it.

Alternative: if you don't want to change your PATH, you can set ONNX_PROTOC_EXECUTABLE instead.

set CMAKE_ARGS=-DONNX_PROTOC_EXECUTABLE=<full_path_to_protoc.exe>

Then you can build ONNX as:

git clone https://github.com/onnx/onnx.git
cd onnx
git submodule update --init --recursive
# prefer lite proto
set CMAKE_ARGS=-DONNX_USE_LITE_PROTO=ON
pip install -e .

Linux

First, you need to install protobuf. The minimum Protobuf compiler (protoc) version required by ONNX is 3.6.1. Please note that old protoc versions might not work with CMAKE_ARGS=-DONNX_USE_LITE_PROTO=ON.

Ubuntu 20.04 (and newer) users may choose to install protobuf via

apt-get install python3-pip python3-dev libprotobuf-dev protobuf-compiler

In this case, it is required to add -DONNX_USE_PROTOBUF_SHARED_LIBS=ON to CMAKE_ARGS in the ONNX build step.

A more general way is to build and install it from source. See the instructions below for more details.

Installing Protobuf from source

Debian/Ubuntu:

  git clone https://github.com/protocolbuffers/protobuf.git
  cd protobuf
  git checkout v21.12
  git submodule update --init --recursive
  mkdir build_source && cd build_source
  cmake ../cmake -Dprotobuf_BUILD_SHARED_LIBS=OFF -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_SYSCONFDIR=/etc -DCMAKE_POSITION_INDEPENDENT_CODE=ON -Dprotobuf_BUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=Release
  make -j$(nproc)
  make install

CentOS/RHEL/Fedora:

  git clone https://github.com/protocolbuffers/protobuf.git
  cd protobuf
  git checkout v21.12
  git submodule update --init --recursive
  mkdir build_source && cd build_source
  cmake ../cmake  -DCMAKE_INSTALL_LIBDIR=lib64 -Dprotobuf_BUILD_SHARED_LIBS=OFF -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_INSTALL_SYSCONFDIR=/etc -DCMAKE_POSITION_INDEPENDENT_CODE=ON -Dprotobuf_BUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=Release
  make -j$(nproc)
  make install

Here "-DCMAKE_POSITION_INDEPENDENT_CODE=ON" is crucial. By default static libraries are built without "-fPIC" flag, they are not position independent code. But shared libraries must be position independent code. Python C/C++ extensions(like ONNX) are shared libraries. So if a static library was not built with "-fPIC", it can't be linked to such a shared library.

Once build is successful, update PATH to include protobuf paths.

Then you can build ONNX as:

git clone https://github.com/onnx/onnx.git
cd onnx
git submodule update --init --recursive
# Optional: prefer lite proto
export CMAKE_ARGS=-DONNX_USE_LITE_PROTO=ON
pip install -e .

Mac

export NUM_CORES=`sysctl -n hw.ncpu`
brew update
brew install autoconf && brew install automake
wget https://github.com/protocolbuffers/protobuf/releases/download/v21.12/protobuf-cpp-3.21.12.tar.gz
tar -xvf protobuf-cpp-3.21.12.tar.gz
cd protobuf-3.21.12
mkdir build_source && cd build_source
cmake ../cmake -Dprotobuf_BUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON -Dprotobuf_BUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=Release
make -j${NUM_CORES}
make install

Once build is successful, update PATH to include protobuf paths.

Then you can build ONNX as:

git clone --recursive https://github.com/onnx/onnx.git
cd onnx
# Optional: prefer lite proto
set CMAKE_ARGS=-DONNX_USE_LITE_PROTO=ON
pip install -e .

Verify Installation

After installation, run

python -c "import onnx"

to verify it works.

Common Build Options

For full list refer to CMakeLists.txt

Environment variables

  • USE_MSVC_STATIC_RUNTIME should be 1 or 0, not ON or OFF. When set to 1 onnx links statically to runtime library. Default: USE_MSVC_STATIC_RUNTIME=0

  • DEBUG should be 0 or 1. When set to 1 onnx is built in debug mode. or debug versions of the dependencies, you need to open the CMakeLists file and append a letter d at the end of the package name lines. For example, NAMES protobuf-lite would become NAMES protobuf-lited. Default: Debug=0

CMake variables

  • ONNX_USE_PROTOBUF_SHARED_LIBS should be ON or OFF. Default: ONNX_USE_PROTOBUF_SHARED_LIBS=OFF USE_MSVC_STATIC_RUNTIME=0 ONNX_USE_PROTOBUF_SHARED_LIBS determines how onnx links to protobuf libraries.

    • When set to ON - onnx will dynamically link to protobuf shared libs, PROTOBUF_USE_DLLS will be defined as described here, Protobuf_USE_STATIC_LIBS will be set to OFF and USE_MSVC_STATIC_RUNTIME must be 0.
    • When set to OFF - onnx will link statically to protobuf, and Protobuf_USE_STATIC_LIBS will be set to ON (to force the use of the static libraries) and USE_MSVC_STATIC_RUNTIME can be 0 or 1.
  • ONNX_USE_LITE_PROTO should be ON or OFF. When set to ON onnx uses lite protobuf instead of full protobuf. Default: ONNX_USE_LITE_PROTO=OFF

  • ONNX_WERROR should be ON or OFF. When set to ON warnings are treated as errors. Default: ONNX_WERROR=OFF in local builds, ON in CI and release pipelines.

Common Errors

  • Note: the import onnx command does not work from the source checkout directory; in this case you'll see ModuleNotFoundError: No module named 'onnx.onnx_cpp2py_export'. Change into another directory to fix this error.

  • If you run into any issues while building Protobuf as a static library, please ensure that shared Protobuf libraries, like libprotobuf, are not installed on your device or in the conda environment. If these shared libraries exist, either remove them to build Protobuf from source as a static library, or skip the Protobuf build from source to use the shared version directly.

  • If you run into any issues while building ONNX from source, and your error message reads, Could not find pythonXX.lib, ensure that you have consistent Python versions for common commands, such as python and pip. Clean all existing build files and rebuild ONNX again.

Testing

ONNX uses pytest as test driver. In order to run tests, you will first need to install pytest:

pip install pytest nbval

After installing pytest, use the following command to run tests.

pytest

Development

Check out the contributor guide for instructions.

License

Apache License v2.0

Code of Conduct

ONNX Open Source Code of Conduct

turnkeyml's People

Contributors

danielholanda avatar dependabot[bot] avatar jeremyfowers avatar pcolange avatar ramkrishna2910 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

turnkeyml's Issues

Proposal: Benchmark a build directory

Today, if I want to benchmark a model, I do turnkey model.py. This requires analysis of model.py, which can use a lot of time relative to the actual benchmark.

If a model is pre-built / pre-compiled, there is no value from running analysis again, so that time is wasted.

We could save a large % of time by directly benchmarking the pre-built build directory. This could look like turnkey benchmark ~/.cache/turnkey/*.

Such a feature would shave many hours off of mass-benchmarking timm.

Goals

  • Benchmark an entire cache directory of prebuilt models in one shot
  • Don't spend any time on analysis
  • Preserve the compilation stats in turnkey_stats.yaml

Implementation Proposal 1

Enable turnkey benchmark ~/.cache/turnkey/*. This would make build directories a valid input to turnkey benchmark INPUT_FILES. At the models API level we have a switch that bypasses build_model() when the input is a build directory and goes straight into benchmarking

Implementation Proposal 2

Create a new command turnkey cache benchmark BUILD_DIRS [--all]

cc @danielholanda

Fix the transformers models

The labels spec for turnkey says that the # labels: LABELS line must be the first line of its model script. However, in the models/transformers corpus, every model script has an empty line first, and then the labels are on the second line.

This bug means that no labels are ever ingested for any of the transformers models.

  • delete the empty line from the top of all the files (don't forget the files in skip/).
  • set the torch seed in each model

Analysis of model scripts finds two models

Example reproduction:

(tkml) jfowers@LAPTOP-5VK03G46:~/onnxmodelzoo/toolchain/models/torchvision/skip$ turnkey ssdlite320_mobilenet_v3_large.py --analyze-only

Models discovered during profiling:

ssdlite320_mobilenet_v3_large.py:
        model (executed 1x - 0.06s)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          SSDLiteFeatureExtractorMobileNet (<class 'torchvision.models.detection.ssdlite.SSDLiteFeatureExtractorMobileNet'>)
                Location:       /home/jfowers/miniconda3/envs/tkml/lib/python3.8/site-packages/torchvision/models/detection/_utils.py, line 461
                Parameters:     3,546,992 (13.53 MB)
                Input Shape:    'Positional Arg 1': (1, 3, 320, 320)
                Hash:           7b1ed851
                Build dir:      /home/jfowers/.cache/turnkey/ssdlite320_mobilenet_v3_large_torchvision_40d8a795

        model (executed 1x - 0.09s)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          SSD (<class 'torchvision.models.detection.ssd.SSD'>)
                Location:       /home/jfowers/miniconda3/envs/tkml/lib/python3.8/site-packages/torchvision/models/detection/ssdlite.py, line 331
                Parameters:     5,198,540 (19.83 MB)
                Input Shape:    'images': (1, 3, 224, 224)
                Hash:           40d8a795
                Build dir:      /home/jfowers/.cache/turnkey/ssdlite320_mobilenet_v3_large_torchvision_40d8a795


Woohoo! The 'benchmark' command is complete.

Desired behavior: Only one model should be discovered.

Full list of model scripts that have this behavior:

  • torchvision/ssdlite320_mobilenet_v3_large.py
  • rename toolchain/models/timm/{ => skip}/vit_base_r26_s32_224.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_base_r50_s16_224.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_base_r50_s16_224_in21k.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_base_r50_s16_384.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_base_resnet26d_224.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_base_resnet50_224_in21k.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_base_resnet50_384.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_base_resnet50d_224.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_large_r50_s32_224.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_large_r50_s32_224_in21k.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_large_r50_s32_384.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_small_r26_s32_224.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_small_r26_s32_224_in21k.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_small_r26_s32_384.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_small_resnet26d_224.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_small_resnet50d_s16_224.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_tiny_r_s16_p8_224.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_tiny_r_s16_p8_224_in21k.py (100%)
  • rename toolchain/models/timm/{ => skip}/vit_tiny_r_s16_p8_384.py (100%)
  • rename toolchain/models/torch_hub/{ => skip}/midas_v3_hybrid.py (100%)
  • rename toolchain/models/torchvision/{ => skip}/ssd300_vgg16.py (100%)
  • rename toolchain/models/transformers/{ => skip}/distil_wav2vec2_for_audio_classification.py (100%)
  • rename toolchain/models/transformers/{ => skip}/distilhubert_for_audio_classification.py (100%)
  • rename toolchain/models/transformers/{ => skip}/speech_encoder_decoder.py (100%)

cc @danielholanda

Graph convolution models don't handle --pretrained

None of the models/graph_convolutions handle --pretrained

There are two types of bugs here:

  • scripts that use argparse but don't support --pretrained, and crash when --pretrained is used
  • scripts that dont use argparse and therefore ignore --pretrained and the result is undefined (does it have weights or not? we dont know)

The reason things are implemented this way is because they are layers, not models, so it is expected that they wouldn't have weights.

Actions needed:

  • Properly identify these as layers, not models (create a layers directory that is a sibling of models?)
  • For the scripts with argparse, capture and ignore --pretrained
  • Make sure all the scripts have argparse

cc @danielholanda @ramkrishna2910

Add `build` and `discover` commands

Problem

The turnkey benchmark command has --build-only and --analyze-only flags that provide an "early exit" from benchmarking to support standalone analysis and analysis+build.

However, this syntax is confusing: "use turnkey benchmark --build-only to export the ONNX model zoo".

It also leads to a lot of mutually exclusive arguments to turnkey benchmark. For example, --analyze-only is mutually exclusive with --sequence. This makes the help pages needlessly convoluted.

We also want to rename analyze to discover, which is a more accurate term for what is actually happening.

Proposal

Part 1: Discover

Rename analyze to discover throughout the docs and code.

Part 2: New Commands

Create two new commands:

  • turnkey build: like turnkey benchmark --build-only, except that all of the benchmarking-specific flags are removed.
  • turnkey discover: like turnkey benchmark --analyze-only, except that all of the build- and benchmarking-specific flags are removed.

Part 3: Help Page

Broken out into a separate issue #73

cc @danielholanda

Clarify naming for builds and evaluations

Problem statement

the API is named "add_build_stat()" - that should be renamed to something more general

I was thinking of evaluation as a superset of build+benchmark

Proposal

  • in the stats.yaml: builds -> evaluations
  • add_build_stat() -> save_model_evaluation_stat()
  • stat_id -> evaluation_id
  • save_stat() -> save_model_stat()

cc @danielholanda

Proposal: Release process for Version 1.0

Problem Statement

Updates to turnkeyml that have breaking, large, or disruptive changes should go through a special release process (ie, not simply a PR into `main). Such updates should be tested against any third-party plugins that are in development in other repos.

Proposal

  1. Breaking, large, and disruptive changes should be developed in the canary branch
  2. PRs into canary go through the regular PR process, however with the expectation that a subsequent PR is needed to merge from canary into main (and then into the PyPI package).
  3. Completed PRs into canary can trigger a package push to TestPyPI if they are tagged with RC*.
  4. A TestPyPI package must be distributed to registered third-party plugin developers for review whenever there is a PR from canary to main.

Actions

  1. The Version 1.0 development will used the proposed practices above
  2. If these practices work out well, they will be enshrined into the contribution guidelines

Models that require CUDA don't work by default

Models that expect CUDA to be installed do not work by default on tkml.

Reproduction

turnkey onnxmodelzoo\toolchain\models\torch_hub\mealv1_resnest50.py

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Workaround

Installing torch+cuda with pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 solves the issue.

Possible solutions:

  1. Add some logic to skip these models if CUDA is not installed, like if not torch.cuda.is_available(): return
  2. See if the models can be fixed with something like model.device("cpu")
  3. Remove models from the turnkey corpus that require CUDA

cc @danielholanda @ramkrishna2910 @andyluo7

Multi-cache reporting bug

Bug: our caching code decides on the CSV-wide column headers one cache at a time. If one cache has a column header that isn't present in other caches, information will be missing from that cache's dict that will blow up when we write the CSV.

Solution: scan all caches for column headers, then apply those column headers across all caches (break the current loop into two loops)

Error message:

Traceback (most recent call last):
  File "/home/azureuser/miniconda3/envs/tkml/bin/turnkey", line 8, in <module>
    sys.exit(turnkeycli())
  File "/home/azureuser/onnxmodelzoo/toolchain/src/turnkeyml/cli/cli.py", line 500, in main
    args.func(args)
  File "/home/azureuser/onnxmodelzoo/toolchain/src/turnkeyml/cli/report.py", line 128, in summary_spreadsheets
    writer.writerow([build[col] for col in column_headers])
  File "/home/azureuser/onnxmodelzoo/toolchain/src/turnkeyml/cli/report.py", line 128, in <listcomp>
    writer.writerow([build[col] for col in column_headers])
KeyError: 'onnx_input_dimensions'

ERROR conda.cli.main_run:execute(49): `conda run turnkey cache report -d /home/azureuser/.cache/tkscale/omz-result/omz-0/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-1/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-10/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-11/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-12/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-13/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-14/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-15/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-2/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-3/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-4/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-5/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-6/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-7/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-8/.cache/turnkey /home/azureuser/.cache/tkscale/omz-result/omz-9/.cache/turnkey -r result` failed. (See above for error)
Traceback (most recent call last):
  File "omz.py", line 418, in <module>
    main()
  File "omz.py", line 244, in main
    create_and_retrieve_onnx(
  File "omz.py", line 145, in create_and_retrieve_onnx
    download_from_cluster()
  File "omz.py", line 103, in download_from_cluster
    execute(
  File "/home/azureuser/onnxmodelzoo/toolchain/utilities/scale_prty/src/tkscale/cluster.py", line 800, in execute
    handle.report(merge_caches, delete_files, reuse_tar)
  File "/home/azureuser/onnxmodelzoo/toolchain/utilities/scale_prty/src/tkscale/cluster.py", line 582, in report
    device.local_command(
  File "/home/azureuser/onnxmodelzoo/toolchain/utilities/scale_prty/src/tkscale/device.py", line 95, in local_command
    subprocess.run(command, check=True, shell=True, capture_output=quiet)
  File "/home/azureuser/miniconda3/envs/tkml/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'conda run -n tkml turnkey cache report -d /home/azureuser/.cache/tkscale/omz-result/omz*/.cache/turnkey -r result' returned non-zero exit status 1.

Move rwkv to skip

Move transformers/rwkv.py to transformers/skip

Skip reason: takes >3 hours to export from pytorch to ONNX

Reproduction: turnkey transformers/skip/rwkv.py --build-only --sequence onnx-fp32

And file a bug with the ONNX exporter team using the tkml reproduction case.

cc @ramkrishna2910

Proposal: Expose stages in the CLI

Issue

We are experiencing a combinatorial explosion of sequences as we add more stages. We also have no support in the CLI for combinations of stages that are not explicitly exposed as a sequence.

Example

The CLI could look like:

  • turnkey model.py --sequence export ort-opt oml-fp16 for just exporting the fp32 ONNX file
  • turnkey model.py --sequence export ort-opt oml-fp16 for export + ORT optimization + ONNX ML Tools fp16 conversion

This would probably also make the code cleaner and facilitate the understanding of those trying to add custom plugins to turnkey,

It's worth noting that Sequence is a child class of Stage, so users also have the flexibility to do something like: turnkey model.py --sequence sequence_A stage_Z where sequence_A includes both stage_X and stage_Y stages. Just an example, but the point is that we can mix named sequences and stages in any order and still get the intended result.

Supporting both Stage and Sequence(Stage) as inputs would make this a non-breaking change.

@danielholanda

Polish the readme

  • Link to the license file
  • Test all links
  • Add windows badge
  • Remove GPU test badge
  • Remove email reference

turnkey reporting hashes don't match analysis hashes

Issue

turnkey reporting hashes don't match hashes shown during analysis

Reproducing

turnkeyany_model.py followed by turnkey cache report

Simply compare the "hashes" column of the report with the one shown during analysis.

Eliminate the `benchmark_model()` API

Problem Statement

Right now, evaluate_invocation() calls an api called benchmark_model(), which internally calls the build_model() API and then calls the BaseRT instance to perform benchmarking.

Problems:

  • Capturing good telemetry about failures is very hard when one critical tool is nested inside of another
  • benchmark_model() is technically a supported public API, but nobody has ever used it
    • The examples and user guide clog up our documentation
    • It has tests that take up time without providing more coverage
  • Important analysis functionality ends up happening post-benchmark instead of post-build simply because of the nesting

Proposal

  • Eliminate benchmark_model() and remove all of its tests and documentation
  • Refactor evaluate_invocation() such that build and benchmark are peer tools
  • Fix myriad analysis, telemetry, and order-of-operations issues along the way
  • Introduce a new analyze_onnx() function that can be the home of any future ONNX analysis tooling

Eliminate the cache/labels directory

Turnkey produces a labels directory in the build cache. This directory is a "write only memory" - we don't have any code that reads it anymore. Additionally, turnkey_stats.yaml is a newer, better way to store the same information.

Analysis does weird and incorrect things with Stable Diffusion

Stable diffusion is one of the flagship generative AI models

However, when I try to analyze stable diffusion 2.1 using turnkey I get unexpected outputs. I am expecting 4-5 models (clip, vae_decoder, unet, safety, vae_encoder) but I get this:

Models discovered during profiling:

stable_diffusion.py:
        model (executed 1x - 45.09s)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          UNet2DConditionModel (<class 'diffusers.models.unet_2d_condition.UNet2DConditionModel'>)
                Location:       C:\\work\\miniconda3\\envs\\tkml\\lib\\site-packages\\diffusers\\configuration_utils.py, line 265
                Parameters:     865,910,724 (1651.6 MB)
                Input Shape:    'Positional Arg 1': (2, 4, 96, 96), 'encoder_hidden_states': (2, 77, 1024), 'return_dict': (1,)
                Hash:           5590988f
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70

        sub_module (executed 1x - 0.76s)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          ResnetBlock2D (<class 'diffusers.models.resnet.ResnetBlock2D'>)
                Location:       C:\\work\\miniconda3\\envs\\tkml\\lib\\site-packages\\diffusers\\models\\modeling_utils.py, line 914
                Parameters:     1,181,184 (2.3 MB)
                Input Shape:    'Positional Arg 1': (1, 256, 384, 384)
                Hash:           5194825a
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70

        sub_module
                Model Type:     Pytorch (torch.nn.Module)
                Class:          ResnetBlock2D (<class 'diffusers.models.resnet.ResnetBlock2D'>)
                Location:       C:\\work\\miniconda3\\envs\\tkml\\lib\\site-packages\\diffusers\\models\\modeling_utils.py, line 914
                Parameters:     4,721,664 (9.0 MB)

                With input shape 1 (executed 2x - 0.39s)
                Input Shape:    'Positional Arg 1': (1, 512, 96, 96)
                Hash:           64325d9f
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70


                With input shape 2 (executed 1x - 0.66s)
                Input Shape:    'Positional Arg 1': (1, 512, 192, 192)
                Hash:           7761649c
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70

        sub_module (executed 1x - 1.20s)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          Attention (<class 'diffusers.models.attention_processor.Attention'>)
                Location:       C:\\work\\miniconda3\\envs\\tkml\\lib\\site-packages\\diffusers\\models\\modeling_utils.py, line 914
                Parameters:     1,051,648 (2.0 MB)
                Input Shape:    'Positional Arg 1': (1, 512, 96, 96)
                Hash:           271b8138
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70

        sub_module
                Model Type:     Pytorch (torch.nn.Module)
                Class:          Upsample2D (<class 'diffusers.models.resnet.Upsample2D'>)
                Location:       C:\\work\\miniconda3\\envs\\tkml\\lib\\site-packages\\diffusers\\models\\modeling_utils.py, line 914
                Parameters:     2,359,808 (4.5 MB)

                With input shape 1 (executed 1x - 0.38s)
                Input Shape:    'Positional Arg 1': (1, 512, 96, 96)
                Hash:           a3534926
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70


                With input shape 2 (executed 1x - 1.11s)
                Input Shape:    'Positional Arg 1': (1, 512, 192, 192)
                Hash:           8cecddd7
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70

        sub_module (executed 1x - 0.14s)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          LoRACompatibleConv (<class 'diffusers.models.lora.LoRACompatibleConv'>)
                Location:       C:\\work\\miniconda3\\envs\\tkml\\lib\\site-packages\\diffusers\\models\\modeling_utils.py, line 914
                Parameters:     131,328 (0.3 MB)
                Input Shape:    'Positional Arg 1': (1, 512, 384, 384)
                Hash:           5af9c176
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70

        sub_module (executed 1x - 1.30s)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          Upsample2D (<class 'diffusers.models.resnet.Upsample2D'>)
                Location:       C:\\work\\miniconda3\\envs\\tkml\\lib\\site-packages\\diffusers\\models\\modeling_utils.py, line 914
                Parameters:     590,080 (1.1 MB)
                Input Shape:    'Positional Arg 1': (1, 256, 384, 384)
                Hash:           5fe428da
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70

        sub_module (executed 1x - 3.86s)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          UpDecoderBlock2D (<class 'diffusers.models.unet_2d_blocks.UpDecoderBlock2D'>)
                Location:       C:\\work\\miniconda3\\envs\\tkml\\lib\\site-packages\\diffusers\\models\\modeling_utils.py, line 914
                Parameters:     1,067,648 (2.0 MB)
                Input Shape:    'Positional Arg 1': (1, 256, 768, 768)
                Hash:           6b88bd41
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70

        sub_module (executed 1x - 0.10s)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          Conv2d (<class 'torch.nn.modules.conv.Conv2d'>)
                Location:       C:\\work\\miniconda3\\envs\\tkml\\lib\\site-packages\\diffusers\\models\\modeling_utils.py, line 914
                Parameters:     3,459 (<0.1 MB)
                Input Shape:    'Positional Arg 1': (1, 128, 768, 768)
                Hash:           511f6e70
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70

        sub_module (executed 1x - 0.00s)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          Conv2d (<class 'torch.nn.modules.conv.Conv2d'>)
                Location:       C:\\work\\miniconda3\\envs\\tkml\\lib\\site-packages\\diffusers\\models\\modeling_utils.py, line 914
                Parameters:     20 (<0.1 MB)
                Input Shape:    'Positional Arg 1': (1, 4, 96, 96)
                Hash:           a57b4330
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70

        model (executed 2x - 0.73s)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          CLIPTextModel (<class 'transformers.models.clip.modeling_clip.CLIPTextModel'>)
                Location:       C:\\work\\miniconda3\\envs\\tkml\\lib\\site-packages\\transformers\\modeling_utils.py, line 2766
                Parameters:     340,387,840 (649.2 MB)
                Input Shape:    'Positional Arg 1': (1, 77)
                Hash:           2e024212
                Build dir:      C:\work\tkdevcache\/stable_diffusion_511f6e70


Woohoo! The 'benchmark' command is complete.

Reproduction:

Model script:

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt, num_inference_steps=1).images[0]

Command:

turnkey stable_diffusion.py --analyze-only

cc @danielholanda @vgodsoe

Design proposal: Simplify the `spawn` args

Current Status

Right now there are 3 turnkey arguments related to spawning processes:

  • --use-slurm
  • --process-isolation
  • --timeout TIMEOUT

This is bad UI design because the 3 flags all control the same thing, slurm/processes are mutually exclusive, and timeout only works if slurm/processes is used.

Design Proposal

Create a new argument, --runs-with, that replaces the 3 args above.

  • --runs-on slurm::TIMEOUT
  • --runs-on processes::TIMEOUT

@danielholanda to review

stages_completed is out of order in the report

Stages attempted has the correct order:

['export_pytorch', 'optimize_onnx', 'fp16_conversion', 'set_success']

Stages completed has the wrong order:

['export_pytorch', 'fp16_conversion', 'optimize_onnx', 'set_success']

This will cause some confusion when people look at the reports in Excel.

Suggested resolution: remove stages_completed from the report. We are already breaking out the member fields, and reporting success/fail, so including stages_completed is redundant and confusing.

cc @danielholanda

Proposal: Add --max-throughput to CLI and expose it to runtimes

Description

The goal is to enable users to artificially limit throughput to the desired number of invocations per second. Note: Turnkey does not use the batch size selected by the user to adjust the IPS as the batch size may not always be known.

--max-throughput should be an argument of Turnkey's CLI.

Discussion

The benchmarking help page will get very long if we add a lot of benchmarking specific flags. This, --iterations, --num-threads/processes, etc. Consider hiding these behind some kind of "advanced benchmarking options"?

Proposal: Rename the `benchmark` command to `evaluate`

Don't call it a benchmarking tool.

There are three reasons that benchmark is a bad name for the default command:

  • It does more than just benchmark. It also discovers and builds.
  • We're trying to get away from "its a benchmarking tool" to "its a tool for evaluating the landscape of models, software, and hardware". Evaluating.
  • We're going to introduce a dedicated benchmarking command with #19

I propose to rename turnkey benchmark to turnkey evaluate, and turnkeyml.benchmark_files() to turnkey.evaluate_files().

`benchmarking_status` stat is ambiguous

The "benchmarking status" stat is ambiguous because it is the status for the entire turnkey benchmark command execution.

Status should actually be broken out by tool phases:

  • analysis_status
  • build_status
  • benchmark_status

That way if the tool gets killed, we know what phase it got killed in.

Create unit tests for each public API

See discussion in #55

Create a test/public_api.py test file that includes 1 or more unit tests per public API. These tests should be as minimal as possible and not rely on any end-to-end actions such as calling the turnkey CLI.

This will help to ensure that we have clear indications whenever the public API is being broken: the public API test will fail, and any changes to the public API test will require a change to the major version number.

cc @danielholanda

Proposal: Set batch size as a first-class argument in the CLI

Batching is the most common parameter in AI benchmarking, and it applies to virtually every model.

We currently support batching via the --script-args argument, which allows a batching parameter to be sent to the input script and therefore the model. We also support --script-args="--batch_size N" as the "official" semantics for passing batch size to a turnkey model.

However, we have a major flaw: batch size is never reflected in our results. We report throughput as "invocations per second", not "inferences per second". The latter would be far more useful.

To truly report "inferences per second" we need to somehow parse the batch size and then pass it into the benchmarking software, so that we can report inferences_per_second = invocations_per_second * batch_size

There may not be any perfect way to solve this, but we should still do something. Some issues with potential solutions:

  1. Models/applications that were not created by the TurnkeyML maintainers may use other arg names for batching (e.g., --batch, --batching, etc.)
  2. Batch size is usually the outer dimension of the input tensor, but not always (e.g., in LSTM it is the second dimension)
  3. Batch size may be hardcoded in the application (not configurable as a script arg at all)

A bulletproof (if verbose) solution could be like this:

  1. batch_size is a reserved --script-arg name that indicates the batch size, and will be used in IPS computations
  2. A new CLI arg --batch-arg-name can override the reserved term batch_size to some other name such as batching in corner cases where the model/app developer has named their arg something else.
  3. a new CLI arg --batch-size=N can set the batch size in both the inputs (by setting --script-args="<batch_arg_name>=N) as well as in the IPS calculation. This is needed in the case where batch size is hardcoded in the application.

cc @danielholanda @viradhak-amd @ramkrishna2910

Help users who forgot to install the models requirements

Problem

Right now if a user forgets to install models/requirements.txt they will get a ModuleNotFound error as soon as they try to run any model outside of selftest.

Proposal

In analysis or files_api: wrap the script invocation with catch ModuleNoteFoundError, check whether the module is in our models requirements, and provide a helpful error message.

Public API contract

Introduction

This issue describes a public API contract that will be enshrined into code.

The public API is:

  • Any class, function, or data structure that is meant to be used by external code (e.g., wrapper utilities, plugins, etc.) is exposed via src/turnkeyml/__init__.py
  • Any breaking change to the interface of anything in src/turnkeyml/__init__.py constitutes a breaking change to TurnkeyML and a rev of the major version number
  • Changes that don't change an interface in src/turnkeyml/__init__.py is not considered a breaking change.
  • External code uses classes/functions/data NOT in the public API at its own risk. Those internal classes/functions/data can be changed without warning.

Contract

The following classes, functions, and data are already in the public API:

  • turnkeycli: the turnkey CLI
  • benchmark_files()
  • build_models()
  • load_state() API for loading build state
  • The package version number turnkeyml.version

We will be adding the following:

  • From the run module:
    • The BaseRT class
  • From the common.filesystem module:
    • get_available_builds()
    • make_cache_dir()
    • MODELS_DIR: the location of turnkey's corpus on disk
    • Stats
    • Keys: the common keys for the Stats dicts
  • From the common.printing module:
    • log_info()
    • log_warning()
    • log_error()
  • From the build.export module:
    • onnx_dir()
    • ExportPlaceholder(Stage)
    • OptimizeOnnxModel(Stage)
    • ConvertOnnxToFp16(Stage)
  • From the build.stage module:
    • Sequence
    • Stage
  • From the common.build module:
    • State
    • logged_subprocess() BTW this is only ever used in runtime plugins, so why is it in the build module?? Will move it...
  • From the common.exceptions module:
    • StageError
    • ModelRuntimeError
  • From run.plugin_helpers everything
    • CondaError
    • SubprocessError
    • get_python_path()
    • run_subprocess()
    • HardwareError

Not in contract

There are some helper functions used in external tests that I dont want to expose in the public API. They just dont seem like "official turnkey functions" because they are too generic. They are:

In the filesystem module:

  • get_all(): finds all the files with a given file extension
  • rmdir(): delete a directory from the filesystem
  • expand_inputs()

In the export module:

  • check_model(): just a wrapper for onnx.checker..check_model()
  • base_onnx_file(): location on disk of the fp32 onnx file (this shouldnt be public info; Stages should only get to know where the intermediate result is, not a specific past result)

My advice would be for external code to either copy the internal code or replace it with something more generic.

Release notes markdown file

Add a new doc that tracks the release notes for TKML.

This file will be initialized with the release notes for v1.0.0.

midas_v3_large.py in torch_hub export failed

Problem

The export Stage in the build tool fails on midas_v3_large.py with an unexpected error.

Reproduction

turnkey benchmark midas_v3_large.py --build-only --opset 17

Error message

opset 17 is used.

Models discovered during profiling:

midas_v3_large.py:
model (executed 1x)
Model Type: Pytorch (torch.nn.Module)
Class: DPTDepthModel (<class 'midas.dpt_depth.DPTDepthModel'>)
Location: C:\Users\ryzen/.cache\torch\hub\intel-isl_MiDaS_master\hubconf.py, line 239
Parameters: 344,055,465 (1.28 GB)
Input Shape: 'x': (1, 3, 224, 224)
Hash: 8fdf03df
Build dir: C:\Users\ryzen/.cache/turnkey/midas_v3_large_torch_hub_8fdf03df
Status: Unknown turnkey error: unflattened_size must be tuple of ints, but found element of type Tensor at pos 0
Traceback (most recent call last):
File "C:\Users\ryzen\onnxmodelzoo\toolchain\src\turnkeyml\analyze\script.py", line 240, in explore_invocation
perf = benchmark_model(
File "C:\Users\ryzen\onnxmodelzoo\toolchain\src\turnkeyml\model_api.py", line 88, in benchmark_model
build_model(
File "C:\Users\ryzen\onnxmodelzoo\toolchain\src\turnkeyml\build_api.py", line 130, in build_model
state = sequence_locked.launch(state)
File "C:\Users\ryzen\onnxmodelzoo\toolchain\src\turnkeyml\build\stage.py", line 290, in launch
state = stage.fire_helper(state)
File "C:\Users\ryzen\onnxmodelzoo\toolchain\src\turnkeyml\build\stage.py", line 127, in fire_helper
state = self.fire(state)
File "C:\Users\ryzen\onnxmodelzoo\toolchain\src\turnkeyml\build\export.py", line 291, in fire
torch.onnx.export(
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\onnx\utils.py", line 516, in export
_export(
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\onnx\utils.py", line 1596, in _export
graph, params_dict, torch_out = _model_to_graph(
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\onnx\utils.py", line 1135, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args)
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\onnx\utils.py", line 1011, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\onnx\utils.py", line 915, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\jit_trace.py", line 1285, in _get_trace_graph
outs = ONNXTracedModule(
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\jit_trace.py", line 133, in forward
graph, out = torch._C._create_graph_by_tracing(
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\jit_trace.py", line 124, in wrapper
outs.append(self.inner(*trace_inputs))
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\nn\modules\module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "C:\Users\ryzen\onnxmodelzoo\toolchain\src\turnkeyml\analyze\script.py", line 560, in forward_spy
return old_forward(*args, **kwargs)
File "C:\Users\ryzen/.cache\torch\hub\intel-isl_MiDaS_master\midas\dpt_depth.py", line 166, in forward
return super().forward(x).squeeze(dim=1)
File "C:\Users\ryzen/.cache\torch\hub\intel-isl_MiDaS_master\midas\dpt_depth.py", line 114, in forward
layers = self.forward_transformer(self.pretrained, x)
File "C:\Users\ryzen/.cache\torch\hub\intel-isl_MiDaS_master\midas\backbones\vit.py", line 13, in forward_vit
return forward_adapted_unflatten(pretrained, x, "forward_flex")
File "C:\Users\ryzen/.cache\torch\hub\intel-isl_MiDaS_master\midas\backbones\utils.py", line 99, in forward_adapted_unflatten
nn.Unflatten(
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\nn\modules\flatten.py", line 109, in init
self._require_tuple_int(unflattened_size)
File "C:\ProgramData\anaconda3\envs\tkml-2311-rc2\lib\site-packages\torch\nn\modules\flatten.py", line 132, in _require_tuple_int
raise TypeError("unflattened_size must be tuple of ints, " +
TypeError: unflattened_size must be tuple of ints, but found element of type Tensor at pos 0

Support for PyTorch Lightning

Problem Statement

TurnkeyML currently supports only torch.nn.modules within the turnkey benchmark command. However, there is interest in using TurnkeyML with PyTorch Lightning modules as well.

Scope of Work

At least the following improvements would be needed to support PyTorch Lightning:

  1. In the analyze module, detect PyTorch Lightning modules and pass them to explore_invocation(). Currently, we only pass torch.nn.module to explore_invocation().
  2. Add a new build Stage to export the Lightning module to ONNX. This is hopefully as simple as calling lightning_module.to_onnx().
  3. Add this new build stage to the ExportPlaceholder Stage so that it can be automatically selected whenever a Lightning module is encountered.

Contribution

  • I am setting a priority of P2 until we have an e2e example available that we can use for testing.
  • I believe that this would make a good first issue if someone wanted to contribute it.

cc @andife with regards to #69

TurnkeyML shows an unhelpful error message when docker daemon is not running

Issue

The message below is shown when the docker daemon is not running

Models discovered during profiling:

resnet50.py:
        model (executed 1x)
                Model Type:     Pytorch (torch.nn.Module)
                Class:          ResNet (<class 'timm.models.resnet.ResNet'>)
                Location:       C:\\Users\\danie\\miniconda3\\envs\\tkml\\lib\\site-packages\\timm\\models\\_builder.py, line 390
                Parameters:     25,557,032 (97.49 MB)
                Input Shape:    'Positional Arg 1': (1, 3, 224, 224)
                Hash:           694a8fff
                Build dir:      C:\Users\danie/.cache/turnkey/resnet50_timm_694a8fff
                Status:         Unknown turnkey error: 'Total Latency'
                Traceback (most recent call last):
                  File "C:\Users\danie\onnxmodelzoo\toolchain\src\turnkeyml\analyze\script.py", line 193, in explore_invocation
                    perf = benchmark_model(
                  File "C:\Users\danie\onnxmodelzoo\toolchain\src\turnkeyml\model_api.py", line 109, in benchmark_model
                    perf = model_handle.benchmark()
                  File "C:\Users\danie\onnxmodelzoo\toolchain\src\turnkeyml\run\basert.py", line 177, in benchmark
                    mean_latency=self.mean_latency,
                  File "C:\Users\danie\onnxmodelzoo\toolchain\src\turnkeyml\run\tensorrt\runtime.py", line 54, in mean_latency
                    return float(self._get_stat("Total Latency")["mean "].split(" ")[1])
                  File "C:\Users\danie\onnxmodelzoo\toolchain\src\turnkeyml\run\basert.py", line 190, in _get_stat
                    return performance[stat]
                KeyError: 'Total Latency'

Suggested solution

The errors shown below are suppressed when TURNKEY_DEBUG is not set to True

docker: error during connect: this error may indicate that the docker daemon is not running: Post "http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/containers/create?name=tensorrt23.03-py3": open //./pipe/docker_engine: The system cannot find the file specified.
See 'docker run --help'.
error during connect: this error may indicate that the docker daemon is not running: Get "http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/containers/tensorrt23.03-py3/json": open //./pipe/docker_engine: The system cannot find the file specified.
error during connect: this error may indicate that the docker daemon is not running: Post "http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/containers/tensorrt23.03-py3/stop": open //./pipe/docker_engine: The system cannot find the file specified.

The ideal solution here is to show those errors as part of the Status

Reproducing

On a Windows system run:
turnkey models\timm\resnet50.py --device nvidia

Bug? / Wrong error message? "turnkey build bert.py"

Hi,
not sure if thats a bug?

Line 70 of the readme shows "turnkey build bert.py --sequence optimize-fp16"

I get the following:


(lightning_py310_231210) user1@hitssv565:/local_data/user1/models$ turnkey build rotational_power.py --sequence optimize-fp16

Error: Unexpected positional argument `turnkey build`. The first positional argument must either be an input file with the .py or .onnx file extension or one of the following commands: ['benchmark', 'cache', 'models', 'version'].

▄██████████████▄▐█▄▄▄▄█▌
██████▌▄▌▄▐▐▌███▌▀▀██▀▀
████▄█▌▄▌▄▐▐▌▀███▄▄█▌
▄▄▄▄▄██████████████


Traceback (most recent call last):
  File "/local_data/user1/miniforge3/envs/lightning_py310_231210/bin/turnkey", line 8, in <module>
    sys.exit(turnkeycli())
  File "/local_data/user1/miniforge3/envs/lightning_py310_231210/lib/python3.10/site-packages/turnkeyml/cli/cli.py", line 493, in main
    raise exceptions.ArgError(error_msg)
turnkeyml.common.exceptions.ArgError: Unexpected positional argument `turnkey build`. The first positional argument must either be an input file with the .py or .onnx file extension or one of the following commands: ['benchmark', 'cache', 'models', 'version'].

rotational_power.py contains a pytorch_lightning module? Does Turnkeyml work for lightning?

Build directories in the cache are ignored

filesystem.get_available_builds() looks for a .turnkey_build file marker to assess whether a directory is a build directory. There are a variety of situations where .turnkey_build is excluded, which then causes issues. For example, turnkey cache report doesn't care about .turnkey_build but turnkey cache delete --all does check for it.

The call to create_build_directory() was placed where it was, in analyze/script.py because we were under code freeze for the build tool at the time. As you can see in the comment, this was not the right long-term home for it.

The placement of that call lead to a bug: if the process doing the build is killed for any reason (e.g., a timeout), then that finally block will never execute and the directory will not count as a build because it doesn't have a .turnkey_build marker in it. In turn, this impacts functions like filesystem.get_available_builds() that check for that marker (ie, such a build shouldn't show up in turnkey cache list).

Solution: move the LoC to the right long term home: the moment the build is actually being created: build/ignition/py::_begin_fresh_build()

Handling what happens when `--process-isolation` crashes

Description

The following problems occur when --process-isolation turnkey subprocess times out or runs OOM, or the parent process crashes:

  • --lean-cache doesn't work (generating large cache folders)
  • No information is saved into the state to show that the process timed out or ran out of memory

We made a bunch of related design decisions to have a process-safe cache database, however we have since eliminated that database. We can revisit those design decisions to make the code more robust for its main usecases, such as mass-benchmarking in process isolation mode.

Suggested fix: catch the subprocess exceptions when the isolated subprocess dies, and perform cleanup tasks.

`--script-args` argument is not recorded in stats

Problem

We currently do not log the contents of --script-args into turnkey_stats.yaml or anywhere else.

This is bad for both traceability and reproducibility. For example, there is no way to know after the fact if --pretrained was used for a model, or not.

Solution

Just log --script-args contents into turnkey_stats.yaml early in the files API.

Better solution

We actually dont log the full set of turnkey args. We should log the entire turnkey command, using something similar to the logic for spawning processes (since that logic also captures all args to the command).

Fix flakey test: `timeout.py`

Description

test_28_cli_timeout doesn't always produce the same results given the same inputs.

image

This test could benefit from adjustments to ensure it operates independently of the system it's on. Relying on approximate timeout allowances might not be the best approach for its reliability.

@jeremyfowers @ramkrishna2910

Add plugin API tutorials

Right now the plugin interface has basic documentation in the Tools User Guide and there are some sample plugins in examples/cli/plugins.

Things to add:

  • tutorial for how to create, install, and use plugins in examples/cli/plugins.md
  • tutorial for how to add custom statistics to the report and CLI status
  • expand the turnkey cache report command documentation to explain how it puts all key:value pairs from all stats.yaml files into a single csv file

New feature: Model acceptance test

Problem Statement

It is easy to add or modify a model script in the toolchain/models directory: just save a python file. But how do we know if the model meets all of the acceptance criteria that are specified in the contribution guide?

Proposal

Create a test that takes in models, and makes sure that the following all pass:

  • Model passes through the analysis tool correctly
  • All required labels are present
  • --script-args is supported
  • etc.

cc @ramkrishna2910 @danielholanda

Proposal: Enable setting all optional Turnkey CLI arguments using a config file

Description

Path to config file that enables setting all optional CLI arguments instead of explicitly setting them in the CLI.

Example:
turnkey model.py --config my_config.yaml

We may also want a corresponding command like turnkey --lots-of-args --save-config my_config.yaml, where all the args get saved to my_config.yaml.

Discussion

This issue is on hold until we identify a driving use case.

Also, we probably want a corresponding command like turnkey --lots-of-args --save-config my_config.json, where all the args get saved to my_config.json.

Feature Request - report device utilization per model/benchmark

Adrian: I would like to start calculating device utilization. This requires an estimate of the TOPs/Inference per model. @pcolange has created a tool that can do this.

Implementation suggestion

The Analysis function in turnkey already does a bunch of model analysis, such as getting the parameter count and input shapes, and logging the performance data. That is the right place to add a TOPs/Inference and calculate device utilization.

First, @pcolange would need to create a function, tops_per_inference(model: Union[torch.nn.module, str]) -> int that took a ONNX file or PyTorch instance as input and returned the number of TOPs/inference.

Then, anyone (could be @pcolange as well) would need to:

  • Paste the function above into the Analysis codebase
  • Call the function during the model stats collection code block
  • Add a simple calculation utilization = throughput * TOPs/inference / TOPS/device
  • Call stats.add_build_stat("utilization", utilization) to save the information to the report

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.