uber / neuropod Goto Github PK

View Code? Open in Web Editor NEW

929.0 25.0 77.0 3.27 MB

A uniform interface to run deep learning models from multiple frameworks

Home Page: https://neuropod.ai

License: Apache License 2.0

Python 22.71% C++ 54.37% Dockerfile 0.28% PureBasic 0.04% Shell 2.74% Starlark 9.00% C 3.48% Java 7.38%

tensorflow pytorch keras deep-learning deeplearning machine-learning machinelearning inference incubation

neuropod's Introduction

Neuropod

What is Neuropod?

Neuropod is a library that provides a uniform interface to run deep learning models from multiple frameworks in C++ and Python. Neuropod makes it easy for researchers to build models in a framework of their choosing while also simplifying productionization of these models.

It currently supports TensorFlow, PyTorch, TorchScript, Keras and Ludwig.

For more information:

Why use Neuropod?

Run models from any supported framework using one API

Running a TensorFlow model looks exactly like running a PyTorch model.

x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])

for model_path in [TF_ADDITION_MODEL_PATH, PYTORCH_ADDITION_MODEL_PATH]:
    # Load the model
    neuropod = load_neuropod(model_path)

    # Run inference
    results = neuropod.infer({"x": x, "y": y})

    # array([6, 8, 10, 12])
    print results["out"]

See the tutorial, Python guide, or C++ guide for more examples.

Some benefits of this include:

All of your inference code is framework agnostic.
You can easily switch between deep learning frameworks if necessary without changing runtime code.
Avoid the learning curve of using the C++ libtorch API and the C/C++ TF API

Any Neuropod model can be run from both C++ and Python (even PyTorch models that have not been converted to TorchScript).

Define a Problem API

This lets you focus more on the problem you're solving rather than the framework you're using to solve it.

For example, if you define a problem API for 2d object detection, any model that implements it can reuse all the existing inference code and infrastructure for that problem.

INPUT_SPEC = [
    # BGR image
    {"name": "image", "dtype": "uint8", "shape": (1200, 1920, 3)},
]

OUTPUT_SPEC = [
    # shape: (num_detections, 4): (xmin, ymin, xmax, ymax)
    # These values are in units of pixels. The origin is the top left corner
    # with positive X to the right and positive Y towards the bottom of the image
    {"name": "boxes", "dtype": "float32", "shape": ("num_detections", 4)},

    # The list of classes that the network can output
    # This must be some subset of ['vehicle', 'person', 'motorcycle', 'bicycle']
    {"name": "supported_object_classes", "dtype": "string", "shape": ("num_classes",)},

    # The probability of each class for each detection
    # These should all be floats between 0 and 1
    {"name": "object_class_probability", "dtype": "float32", "shape": ("num_detections", "num_classes")},
]

This lets you

Build a single metrics pipeline for a problem
Easily compare models solving the same problem (even if they're in different frameworks)
Build optimized inference code that can run any model that solves a particular problem
Swap out models that solve the same problem at runtime with no code change (even if the models are from different frameworks)
Run fast experiments

See the tutorial for more details.

Build generic tools and pipelines

If you have several models that take in a similar set of inputs, you can build and optimize one framework-agnostic input generation pipeline and share it across models.

Other benefits

Fully self-contained models (including custom ops)
Efficient zero-copy operations
Tested on platforms including
- Mac, Linux, Linux (GPU)
- Four or five versions of each supported framework
- Five versions of Python
Model isolation with out-of-process execution
- Use multiple different versions of frameworks in the same application
  - Ex: Experimental models using Torch nightly along with models using Torch 1.1.0
Switch from running in-process to running out-of-process with one line of code

Getting started

See the basic introduction tutorial for an overview of how to get started with Neuropod.

The Python guide and C++ guide go into more detail on running Neuropod models.

neuropod's People

Contributors

Stargazers

Watchers

neuropod's Issues

[OPE] Worker process CPU affinity

Allow pinning OPE worker processes to specific cores to reduce variability in inference time.

On linux: https://linux.die.net/man/2/sched_setaffinity

[Java API] Remove isTestMode

#405 added an isTestMode flag that causes backend overrides to be specified when loading a model.

This should be refactored to be more generic and, ideally, let us pass in a map of overrides from Java.

neuropod/source/neuropod/bindings/java/src/main/native/com_uber_neuropod_Neuropod.cc

Lines 48 to 55 in b3f5eb3

 if (isTestMode) 

 { 

 ret = new neuropod::Neuropod(convertedPath, detail::ope_backend_location_overrides, opts); 

 } 

 else 

 { 

 ret = new neuropod::Neuropod(convertedPath, opts); 

 }

[Torch] Seeds / Determinism

Enable setting a random seed when running a model + set cuDNN to deterministic mode

Have a branch locally, need to put up a PR

Bazel build fails for Neuropods torch version 1.3.0

ERROR: Analysis of target '//neuropods:packages' failed; build aborted: no such package '@libtorch_repo//': java.io.IOException: Error downloading [https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-1.3.0.zip] to /[redacted]/.neuropods/source/bazel/external/libtorch_repo/libtorch-shared-with-deps-1.3.0.zip: GET returned 403 Forbidden
--

Note that the link to libtorch is https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-1.3.0.zip, which is incorrect. It should be https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-1.3.0%2Bcpu.zip instead.

Build is not compatible with Mac OS

source> bazel build //...:all
ERROR: /private/var/tmp/_bazel_so/c274d4d44ecd8e949cd8e79d474671e5/external/local_config_cc/BUILD:55:5: in apple_cc_toolchain rule @local_config_cc//:cc-compiler-armeabi-v7a: Xcode version must be specified to use an Apple CROSSTOOL. If your Xcode version has changed recently, try: "bazel clean --expunge" to re-run Xcode configuration
ERROR: Analysis of target '//neuropods/tests:test_tensorflow_backend' failed; build aborted: Analysis of target '@local_config_cc//:cc-compiler-armeabi-v7a' failed; build aborted
INFO: Elapsed time: 3.976s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (37 packages loaded, 582 targets configured)
    Fetching @remotejdk10_macos; fetching

Doing bazel clean --expunge as suggested didn't help

source> bazel version
Build label: 0.23.0-homebrew
Build target: bazel-out/darwin-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Wed Feb 27 11:03:45 2019 (1551265425)
Build timestamp: 1551265425
Build timestamp as int: 1551265425

Error after upgrade: "The model was not loaded before calling `infer`"

I tried to upgrade my project with latest neuropod and one of the TensorFlow tests (TestInferenceExampleTensorFlowAdditionModel) failed with error:

    model_test.go:115:
        	Error Trace:	model_test.go:115
        	            				model_test.go:151
        	Error:      	Received unexpected error:
        	            	Neuropod Error: The model was not loaded before calling `infer`. This usually means that `load_model_at_construction` was set to false and `load_model()` was not explicitly called

In my project we don't set load_model_at_construction and I see its default value is true:

  struct RuntimeOptions
  {
      // The device to run this Neuropod on.
      // Some devices are defined in the namespace above. For machines with more
      // than 8 GPUs, passing in an index will also work (e.g. `9` for `GPU9`).
      //
      // To attempt to run the model on CPU, set this to `Device::CPU`
      NeuropodDevice visible_device = Device::GPU0;

      // Sometimes, it's important to be able to instantiate a Neuropod without
      // immediately loading the model. If this is set to `false`, the model will
      // not be loaded until the `load_model` method is called on the Neuropod.
      bool load_model_at_construction = true;
  };

We use constructor:

Neuropod(const std::string &neuropod_path, const RuntimeOptions &options = {});

It means this regression is caused not by caller code.

Support flexible interface between Neuropod input tensor dtype and underlying model dtype

In Keras and PyTorch, using Python, you can train (for example) a float32 model using float64 input tensors, but once served in C++, you'll get errors like the following (TorchScript):

Expected object of scalar type Double but got scalar type Float for argument #2 'mat2'
The above operation failed in interpreter, with the following stack trace:
at <string>:2:25

      def addmm(self: Tensor, mat1: Tensor, mat2: Tensor, beta: number = 1.0, alpha: number = 1.0):
          return self + mat1.mm(mat2)
                        ~~~~~~~ <--- HERE

This is with an input_spec that accepts float64 tensors as input. Ideally, we should be able to pass in float64 input tensors and process them in float32 models through casting.

One solution would be specify the input dtype in Neuropod as float32, then make a best effort attempt to cast inputs to that type. Another solution would be set the input dtype to float64 and then either infer to model dtype (which may not be possible in TorchScript, but is possible in Keras) or have a binding layer similar to node_name_mapping that maps from input tensor dtype to model dtype.

It's certainly possible to do this conversion on the caller side, but I expect it will be more efficient to do it in Neuropods in cases where GPU acceleration is being used.

Add Get Model Name and Platform to Neuropod

Neuropod contains Model Config that has Name and Platform.

If Neuropod object created from path to model config, client code/context may not have Model's name and platform. It would be useful to have an access to Model's Name and Platform.

+    // Get the name of the loaded Neuropod.
+    const std::string& get_name() const;
+    // Get the platform of the loaded Neuropod.
+    const std::string& get_platform() const;

In our use case, this is useful because service is going to support multi-models and this info can be used at least for details in logs.

Python native bindings cannot recognize non-system Python versions

Bug

Attempting to load a neuropod using a Python version managed by pyenv results in the following error:

>       from neuropod.neuropod_native import Neuropod as NeuropodNative
E       ImportError: dlopen(/Users/taddair/repos/ludwig/py37/lib/python3.7/site-packages/neuropod/neuropod_native.so, 2): Library not loaded: /Library/Frameworks/Python.framework/Versions/3.7/Python
E         Referenced from: /Users/taddair/repos/ludwig/py37/lib/python3.7/site-packages/neuropod/neuropod_native.so
E         Reason: image not found

py37/lib/python3.7/site-packages/neuropod/loader.py:114: ImportError

Because this version of Python was installed via pyenv, /Library/Frameworks/Python.framework/Versions/3.7/Python does not exist.

When I set _always_use_native=False, everything works as expected.

To Reproduce

On Mac, use pyenv to download a version of Python you don't have on your system, then attempt to load a neuropod Python model.

Expected behavior

Same behavior as _always_use_native=False.

Environment

Neuropod Version (e.g., 0.2.0): 0.2.0
OS (Linux, macOS): macOS
Language (Python, C++, Go bindings): Python
Python version: 3.7
Using OPE: yes (default)

If this bug report is about running a specific model:

Neuropod backend (e.g. TensorFlow, PyTorch, Keras, TorchScript, Python): Python
Framework version (e.g. 1.3.0): N/A

Add support for more frameworks like caffe, onnx, mxnet .

Feature

Neuropod support TF, keras, pytorch now. Do you consider support more frameworks?

ImportError on load_neuropod

Bug

Get ImportError when loading a neuropod:
ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

To Reproduce

>>> from neuropod.loader import load_neuropod
>>> load_neuropod("neuropod_path")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/piero/dev/mlvenv/lib/python3.7/site-packages/neuropod/loader.py", line 211, in load_neuropod
    return NativeNeuropodExecutor(neuropod_path, **kwargs)
  File "/home/piero/dev/mlvenv/lib/python3.7/site-packages/neuropod/loader.py", line 114, in __init__
    from neuropod.neuropod_native import Neuropod as NeuropodNative
ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

This happens on this line in the neuropod test inside Ludwig:
https://github.com/uber/ludwig/blob/master/tests/integration_tests/test_neuropod.py#L118

You can use that whole code to reproduce the issue.
Note that that same code was working fine with version 0.1.x of neuropod (the internal version).

Expected behavior

The model gets loaded.

Environment

Neuropod Version: 0.2.0
OS: Ubuntu Linux 20.04
Language (Python, C++, Go bindings): Python
Python version: Python 3.7.7

AttributeError: module 'sys' has no attribute 'argv'

Got this error when loading a model that contains sys.argv

LoadNeuropod modelPath = user_model/neuropod.zip failed: AttributeError: module 'sys' has no attribute 'argv'
At:
  /[redacted]/env/lib/python3.6/site-packages/ludwig/contrib.py(35): contrib_import
  /[redacted]/env/lib/python3.6/site-packages/ludwig/api.py(38): <module>
  <frozen importlib._bootstrap>(219): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(678): exec_module
  <frozen importlib._bootstrap>(665): _load_unlocked
  <frozen importlib._bootstrap>(955): _find_and_load_unlocked
  <frozen importlib._bootstrap>(971): _find_and_load
  /tmp/tmp17kynda9.neuropod_python_symlinks/aa723e1c_34fd_4df9_b1de_df9076e88879/ludwig/neuropod_export.py(10): <module>
  <frozen importlib._bootstrap>(219): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(678): exec_module
  <frozen importlib._bootstrap>(665): _load_unlocked
  <frozen importlib._bootstrap>(955): _find_and_load_unlocked
  <frozen importlib._bootstrap>(971): _find_and_load
  <frozen importlib._bootstrap>(994): _gcd_import
  /[redacted]/env/lib/python3.6/importlib/__init__.py(126): import_module
  /[redacted]/env/lib/python3.6/site-packages/neuropod-0.1.0-py3.6.egg/neuropod/backends/python/executor.py(120): __init__
  /[redacted]/env/lib/python3.6/site-packages/neuropod-0.1.0-py3.6.egg/neuropod/loader.py(29): load_neuropod

This issue related to this python issue , it only happen in embedded python system, and fixed in py3.8.

But to make it backward compatible, we can add this work around

import sys
if not hasattr(sys, 'argv'):
    sys.argv  = ['']

Add multiprocess/multiprocess.hh to tar package

To run a model in another process, your should use code:

#include <neuropod/multiprocess/multiprocess.hh>

...

auto neuropod = neuropod::load_neuropod_in_new_process(neuropod_path);

In my project I use neuropod package libneuropod.tar.gz that contains neuropod/include directory.

This directory doesn't contain multiprocess.hh.

Should we add it to the tar package?

SHA256 mismatch for libtorch on Mac after #47

Seeing this:

➜  source git:(master) bazel build //neuropods:all
ERROR: /Users/achal/Uber/neuropods/source/neuropods/backends/torchscript/BUILD:5:1: no such package '@libtorch_repo//': java.io.IOException: Error downloading [https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-1.0.0.dev20190318.zip] to /private/var/tmp/_bazel_achal/fc3133d1218956d01371cff669f31dfc/external/libtorch_repo/libtorch-macos-1.0.0.dev20190318.zip: Checksum was d55b535f42c965fa2ccfd6656b7b37010615b9202f8ce80bd36aca682e2aaedd but wanted 6e59041708bf7438f9e5d4e1dbf6fbb4dc53aacae3f198f5d626e0ef0d5b4e19 and referenced by '//neuropods/backends/torchscript:libneuropod_torchscript_backend.so'
ERROR: /Users/achal/Uber/neuropods/source/neuropods/backends/torchscript/BUILD:5:1: no such package '@libtorch_repo//': java.io.IOException: Error downloading [https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-1.0.0.dev20190318.zip] to /private/var/tmp/_bazel_achal/fc3133d1218956d01371cff669f31dfc/external/libtorch_repo/libtorch-macos-1.0.0.dev20190318.zip: Checksum was d55b535f42c965fa2ccfd6656b7b37010615b9202f8ce80bd36aca682e2aaedd but wanted 6e59041708bf7438f9e5d4e1dbf6fbb4dc53aacae3f198f5d626e0ef0d5b4e19 and referenced by '//neuropods/backends/torchscript:libneuropod_torchscript_backend.so'
ERROR: Analysis of target '//neuropods:libneuropods_libs' failed; build aborted: no such package '@libtorch_repo//': java.io.IOException: Error downloading [https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-1.0.0.dev20190318.zip] to /private/var/tmp/_bazel_achal/fc3133d1218956d01371cff669f31dfc/external/libtorch_repo/libtorch-macos-1.0.0.dev20190318.zip: Checksum was d55b535f42c965fa2ccfd6656b7b37010615b9202f8ce80bd36aca682e2aaedd but wanted 6e59041708bf7438f9e5d4e1dbf6fbb4dc53aacae3f198f5d626e0ef0d5b4e19
INFO: Elapsed time: 10.048s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)
    Fetching @tensorflow_repo; fetching 9s
    Fetching @libtorch_repo; fetching 9s

My guess is that #47 added deps that are specific to Linux, and that breaks on mac. @VivekPanyam

Add an option to disable shape and type checking from C++

This is useful during prototyping and may be used to ease the transition of existing models into v0.2.0

Add support for JAX/XLA/Trax?

A few months ago, I noticed that JAX (https://github.com/google/jax) and Trax (https://github.com/google/trax) have been getting more popular.

JAX functions which are compiled (https://github.com/google/jax#compilation-with-jit) can be turned into an XLA HLO proto (see google/jax#1871) which can be run from C++

Trax can use TF, Numpy, or JAX under the hood so I don't think we need to do much additional work to add support for it.

Concretely, we'd need to add a backend for XLA and packagers for JAX and Trax

Logging improvement in torch_backend when input "does not exist"

NEUROPOD_ERROR("Input '{}' does not exist. Model inputs {}", input_name, schema);

I hit error and it took me long time to root cause. It's partly because this logging message misled me to think I'm missing one column in input data.

It'd better to change it to "input name {} does not exist in model input schema {} ..."

Add more multithreaded PythonBridge tests

#163 fixes a set of issues with the GIL and threading (and adds tests to validate the fix).

We should add more threading tests of different types to flush out other potential multithreading issues w/ python and the GIL.

python.bzl should respect NEUROPOD_PYTHON_VERSION on Darwin

I am using Darwin, OSX Catalina that has default python 2 and also has python 3 (3.7.7).

I tried to build Neuropod with NEUROPOD_PYTHON_VERSION=3.7 and found that build failed at python_repo.

I found that this is because for Darwin python.bzl doesn't respect NEUROPOD_PYTHON_VERSION when it gets "libdir" and creates symlink then. As result it uses "python" and creats symlinks to python 2 even if NEUROPOD_PYTHON_VERSION=3.7 was requested.

I tested minor fix and it works after that. Will publish Pull Request.

"Fatal Python error: PyMUTEX_LOCK(gil->mutex) failed" when using native bindings

Bug

Calling load_neuropod(neuropod_path, _always_use_native=False), everything works as expected, but when attempting to load a model with _always_use_native=True (the default), the following error is raised:

Fatal Python error: PyMUTEX_LOCK(gil->mutex) failed
Python runtime state: unknown

Fatal Python error: Aborted

Current thread 0x000000010ad65dc0 (most recent call first):
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1101 in create_module
  File "<frozen importlib._bootstrap>", line 556 in module_from_spec
  File "<frozen importlib._bootstrap>", line 657 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 991 in _find_and_load
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/neuropod/loader.py", line 114 in __init__
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/neuropod/loader.py", line 211 in load_neuropod
  File "/Users/taddair/repos/ludwig/tests/integration_tests/test_neuropod.py", line 120 in test_neuropod
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/python.py", line 182 in pytest_pyfunc_call
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/python.py", line 1477 in runtest
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/runner.py", line 135 in pytest_runtest_call
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/runner.py", line 217 in <lambda>
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/runner.py", line 244 in from_call
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/runner.py", line 216 in call_runtest_hook
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/runner.py", line 186 in call_and_report
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/runner.py", line 100 in runtestprotocol
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/runner.py", line 85 in pytest_runtest_protocol
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/main.py", line 272 in pytest_runtestloop
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/main.py", line 247 in _main
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/main.py", line 191 in wrap_session
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/Users/taddair/repos/ludwig/env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 124 in main
  File "/Users/taddair/repos/ludwig/env/bin/pytest", line 8 in <module>
Abort trap: 6
/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

To Reproduce

This was observed when testing the Ludwig migration to TensorFlow 2: https://github.com/uber/ludwig/blob/tf2_porting/tests/integration_tests/test_neuropod.py#L35

I'm working on putting together a minimal repro, but I wanted to put this issue out there now in case anyone can help narrow down the search.

Expected behavior

Same behavior as using _always_with_native=False.

Environment

Neuropod Version (e.g., 0.2.0): 0.2.0
OS (Linux, macOS): macOS
Language (Python, C++, Go bindings): Python
Python version: 3.8
Using OPE: yes (default)

cc @vkuzmin-uber

Better error messages when using tensors of incorrect types

Something like

expected tensor 'Blah' to be of type TensorflowNeuropodTensor but got a tensor of type TorchNeuropodTensor. Please ensure you are using the correct allocator for this model

if the dynamic cast fails

failed to run the sample code after installing neuropod

I tried to install neuropod and have a try with the sample, but it failed with below error:

[root@kworker2 home]# python neuropod_test.py
Traceback (most recent call last):
  File "neuropod_test.py", line 7, in <module>
    with load_neuropod(ADDITION_MODEL_PATH) as neuropod:
NameError: name 'load_neuropod' is not defined

I suppose it is because neuropod is not imported, so I included it, however it failed as well.

[root@kworker2 home]# python neuropod_test.py
Traceback (most recent call last):
  File "neuropod_test.py", line 2, in <module>
    import neuropod
ImportError: No module named neuropod

Here is the code that I update to run the test:

[root@kworker2 home]# cat neuropod_test.py
import numpy as np
import neuropod
from neuropod import load_neuropod
#from neuropod.loader import load_neuropod

x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])

with load_neuropod(ADDITION_MODEL_PATH) as neuropod:
  results = neuropod.infer({"x": x, "y": y})

  # array([6, 8, 10, 12])
  print results["out"]

I've already installed neuropod following https://neuropod.ai/installing/

Could you please share if there is anything I missed? How can I use neuropod correctly? Thanks a lot!

Support node_name_mapping for TorchScript models

The node_name_mapping used in Keras is useful for mapping from input tensors to model inputs. In TorchScript, without this mapping, we need to ensure that every input tensor exactly matches a variable name in the forward() method of the PyTorch model.

Some models like torch.nn.Sequential have pre-baked forward() methods where the user doesn't actually know what the kwargs are, they just call model.forward(x). For the purpose of auto-generating a Neuropod in such a setup, where the user's input tensor name is provided as positional argument whose name does not match any in the TorchScript model, it would be useful to be able to construct such a mapping.

NOTICE: Intent to upgrade to C++14

Neuropod is currently built with C++11

There are lots of nice C++ features between 11 and 17

Many of our other internal tools and libraries at ATG also use C++17 so it would be nice to upgrade for consistency.

Also Torch 1.5.0 requires C++14 (so we need to upgrade anyway for at least the Torch 1.5 builds)

PyTorch 1.4 is the last release that supports Python 2. For the C++ API, it is the last release that supports C++11: you should start migrating to Python 3 and building with C++14 to make the future transition from 1.4 to 1.5 easier.

Please comment below if you have any concerns about this

Improve interface of NeuropodTensorAllocator::allocate_tensor

Not a critical but I found that NeuropodTensorAllocator::allocate_tensor accepts dims as a std::vector<int64_t> .

... allocate_tensor(const std::vector<int64_t> &input_dims, ...)

Since we use neuropod from Cgo, I usually have plain arrays. I need to create std::vector to pass it as attribute.

First of all, this is an extra copy on client side.
std::vector is a great default container in regular code but in lightweight generic library there should be interface that is more flexible.

I think interface will be better if use 2 approaches instead:

Range based: allocate_tensor(begin, end)
Container based: allocate_tensor(const Container& input_dims). This should support current std::vector<int64_t> version.

I would also suggest to not require explicitly int64_t type for dim element at the interface level and just use static_cast from given type to int64_t in definition. I expect that this interface improvement should not be a breaking change.

If you think this would be acceptable, let me know and If no objection, I could prototype it as a proof of concept and create PR.

Add a Go (CGo) API to neuropod

We use neuropod in Go service with a "bridge" code that converts C++ API into Go API.
It uses libneuropod.tar package that includes necessary headers and libs.

I think having Go API a part of neuropod (or as a separate repo) may have the following benefits:

Go is a very popular platform for services and microservices now and "standard' API can make add a value to Neuropod library.
CGo bridge requires some specific knowledge and it's easy to introduce memory leaks. Standard API will solve this problem.
Having client of libneuropod.tar will help to catch possible problems that other clients may meet (like missed headers, extra dependencies and regressions).

Let me know what do you think. We can work on making our bridge generic and add it to neuropod or as a separate repo neuropod-go.

Crash with "signal SIGSEGV: segmentation violation code" with multiple OPE instances

Bug

Master process crash with "signal SIGSEGV: segmentation violation code" when uses multiple OPE instances of simple "string python" model, under high load

signal SIGSEGV: segmentation violation code

SIGSEGV happens inside "neuropod::tensor_from_id(std::__1::array<char, 24ul> const&)"

It doesn't happen immediately. There are particular conditions - high volume/load, particular number of concurrent callers.

See below details with Stack and Trace logs:

To Reproduce

"Master" Process with 4 OPE instances of the same "string_python" mode performs round-robin on incoming Inference requests. 4 concurrent callers perform Inference calls. Master crashes with SEGV not immediately but at "high volume". It is 100% reproducible if generate 20K calls. If I send per 2K calls, it can take many iterations to hit it.

Enabled TRACE logs

07/14/20 21:02:41.269865: T ./neuropod/multiprocess/mq/ipc_message_queue_impl.hh:152] [thread 118825, process 118433] OPE: Sending IPC control message 2.
07/14/20 21:02:41.269898: T ./neuropod/multiprocess/mq/ipc_message_queue_impl.hh:152] [thread 118442, process 118433] OPE: Sending IPC control message 2.
07/14/20 21:02:41.269912: T ./neuropod/multiprocess/mq/ipc_message_queue_impl.hh:91] [thread 118735, process 118733] OPE: Read thread received IPC control message 2.
07/14/20 21:02:41.269922: T neuropod/multiprocess/mq/transferrables.cc:56] [thread 118735, process 118733] OPE: Clearing transferrables for msg with id 1032
fatal error: unexpected signal during runtime execution
07/14/20 21:02:41.269957: T ./neuropod/multiprocess/mq/ipc_message_queue_impl.hh:91] [thread 118597, process 118595] OPE: Read thread received IPC control message 2.
07/14/20 21:02:41.269975: T neuropod/multiprocess/mq/transferrables.cc:56] [thread 118597, process 118595] OPE: Clearing transferrables for msg with id 1033
[signal SIGSEGV: segmentation violation code=0x1 addr=0x7fd320c10480 pc=0x7fd320d43797]

With ASAN

==27092==Register values:
rax = 0x000060300025ab48  rbx = 0x000060700005aae0  rcx = 0x00000000000000c4  rdx = 0x00000000000000c5
rdi = 0x0000000011b87060  rsi = 0x0000000011b870e0  rbp = 0x000070000b7e29f0  rsp = 0x000070000b7e2920
 r8 = 0x0000000000000098   r9 = 0x0000000011e41440  r10 = 0x0000000000000011  r11 = 0x0000000011e41458
r12 = 0x0000000000000000  r13 = 0x0000000011e414b8  r14 = 0x000060e000130ca0  r15 = 0x0000000000000000
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xbac7d) in neuropod::tensor_from_id(std::__1::array<char, 24ul> const&)

Stack

2020-07-07T02:53:31.656-0700	DEBUG	model/models.go:153	Tensor: &Tensor{Value:[],Dimensions:[1 1],Dtype:DTYPE_STRING,StringVector:[benchmark],}
    #0 0x5f01c7d in neuropod::tensor_from_id(std::__1::array<char, 24ul> const&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xbac7d)
    #1 0x5f033db in void neuropod::ipc_deserialize<std::__1::shared_ptr<neuropod::NeuropodValue> >(std::__1::basic_istream<char, std::__1::char_traits<char> >&, std::__1::shared_ptr<neuropod::NeuropodValue>&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xbc3db)
    #2 0x5ef9266 in void neuropod::ipc_deserialize<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue> >(std::__1::basic_istream<char, std::__1::char_traits<char> >&, std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > >&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xb2266)
    #3 0x5ef900f in void neuropod::detail::deserialize_payload<std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > >, neuropod::MessageType>(neuropod::detail::WireFormat<neuropod::MessageType> const&, std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > >&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xb200f)
    #4 0x5ee67cb in neuropod::(anonymous namespace)::MultiprocessNeuropodBackend::infer_internal(std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0x9f7cb)
    #5 0x5f3f08f in neuropod::NeuropodBackend::infer(std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xf808f)
    #6 0x5e4c340 in neuropod::Neuropod::infer(std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0x5340)
    #7 0x494a31a in Infer (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/./ma-dl-inference:x86_64+0x494a31a)
    #8 0x4948d07 in _cgo_f1624c697bd0_Cfunc_Infer (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/./ma-dl-inference:x86_64+0x4948d07)
    #9 0x406918f in runtime.asmcgocall (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/./ma-dl-inference:x86_64+0x406918f)

Expected behavior

Environment

Neuropod Version (e.g., 0.2.0): 0.2.0
OS (Linux, macOS): Linux and macOS
Language (Python, C++, Go bindings): Go binding
Python version:
Using OPE: yes

If this bug report is about running a specific model:

Neuropod backend (e.g. TensorFlow, PyTorch, Keras, TorchScript, Python): Python
Framework version (e.g. 1.3.0): 2.7

If running on GPU:

CUDA/cuDNN version:
The output of nvidia-smi:

Any other relevant information:

Python Model is simple:

  class Model:
      def __call__(self, str_input):
          return {'str_output': str_input}

  def get_model(data_root):
      return Model()

Additional context

It fails if there are 4 OPE instances and 4 concurrent callers.
It doesn't fail if I use 1, 2, 8 concurrent clients.
It fails if I send 20K messages, when it reaches ~3-4K. If I send by batches per 2 K, it may fails around 8K number.

Consider using bazelisk to automatically download and use the correct Bazel version

TL;DR: Use Bazelisk https://github.com/bazelbuild/bazelisk and .bazelversion to build neuropod.

Neuropod requires specific Bazel version:

build/install_system_deps.sh
....
curl -sSL -o bazel.sh https://github.com/bazelbuild/bazel/releases/download/0.28.1/bazel-0.28.1-installer-darwin-x86_64.sh

I found reference to bazelisk in Release Notes of TF https://github.com/tensorflow/tensorflow/releases/tag/v2.1.0-rc1:

"If you're building Tensorflow from source, consider using bazelisk to automatically download and use the correct Bazel version. Bazelisk reads the .bazelversion file at the root of the project directory."

It can use either .bazelversion file or env variable, something like:

USE_BAZEL_VERSION=$BAZEL_VERSION BAZEL_USE_CPP_ONLY_TOOLCHAIN=1 bazelisk --output_base $bazel_cache build $BAZEL_FLAGS //neuropods:all //neuropods:packages

Interface Design: Support different error handling in neuropod::infer

neuropod::infer always throw exception. For generic library, it would be better to support more generic approach for error handling.

This a pretty common question in design of libraries and one of the pretty simple and "modern C++" approach is Expected.

It was considered in details by Andrei Alexandrescu in his talk "Systematic Error Handling in C++" https://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C , there is a stack discussion about it https://stackoverflow.com/questions/14923346/how-would-you-use-alexandrescus-expectedt-with-void-functions

Some details from my side:
We call it from Cgo and as result our code catch exception and converts into Expected-like structure and this is why the lack of the current interface is visible for my case.

Compile error after upgrade: 'fmt/format.h' file not found

After recent change, neuropod/internal/error_utils.hh uses https://github.com/fmtlib package.

Now the project that uses libneuropod.tar fails during compile time with error:

include/neuropod/internal/error_utils.hh:7:10: fatal error: 'fmt/format.h' file not found

what is expectation here, should build environment install required packages (in this case https://github.com/fmtlib/fmt/archive/5.3.0.zip) or libneuropod.tar should have all required?

Add null check for `dlerror()`

This code assumes that if dlopen returns null, it's because of an error. This isn't always true because we're using RTLD_NOLOAD. When using RTLD_NOLOAD, dlopen can also return null if the process doesn't have the library loaded.

https://github.com/uber/neuropods/blob/cf05fe750556396d2371581b43b9f8bfa37752e4/source/neuropods/backends/python_bridge/python_bridge.cc#L54-L59

Because of this, we also need to add a null check for the dlerror() call.

Rename `TestNeuropodTensor` and remove `TestNeuropodBackend`

TestNeuropodTensor has outgrown its original scope. Rename it to GenericNeuropodTensor

TestNeuropodBackend is no longer necessary.

We should replace it with an allocator called GenericNeuropodAllocator

Allow models to specify target framework versions

At export time, models should be able to specify version(s) of a framework that the model is compatible with.

When we load a model, Neuropod will attempt to find a backend that satisfies the specified range.

This combined with OPE will enable environments to provide many supported versions of each framework and have models choose which one to use.

This is useful in several cases:

If a model uses custom ops that are only compatible with one version of the framework (e.g. custom ops built against TF 1.15 headers vs TF 1.13.1 headers)
Being able to ensure that runtime and test environments are as similar as possible
Avoiding bugs in a specific version of the underlying framework

Target version ranges are specified as semver ranges (see https://semver.org/, https://docs.npmjs.com/misc/semver#ranges, https://docs.npmjs.com/misc/semver#advanced-range-syntax) which provides a very flexible way of defining the required framework.

TODO:

Specify target versions at export time (#316)
Load different backends depending on the specified version range (#330)
Better error messages about required framework versions when loading models from python (#408)
Rename backend so files depending on the version of the framework they're built against (e.g. libneuropod_torchscript_1_4_0_cpu_backend.so)

Once this is production ready, using dockerized OPE worker processes will also let us specify required CUDA versions (if any) and provide even more isolation.

Infer fails on different order of Tensors

I found this bug when called neuropod::infer from CGo code. In Go map by design uses randomization when generates map, when we generated input collection using the same data, Tensors have a different order.

Appeared that it exposed a bug in neuropods:infer, it seems that this became visible because of different "insertion" order of tensors into of std::unodered_map (I know it sounds strange but this is how it is).

The model I used "tensorflow_addition" with float, double and string Tensors. The error infer reports:

Neuropod Error: TensorFlow error: Type mismatch: actual double vs. expect string
	 [[{{node _arg_some_namespace/str_input_0_0}} = _Arg[T=DT_STRING, index=0, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

I managed to write a test for neuropod that demonstrates the problem. I will publish Pull request to this Issue.

Add a C API to Neuropod

Feature

C API

We don't expect that somebody is using C API directly. This will be used by future CGo API mostly.

Connected to #294

Describe the solution you'd like

Design is inspired by the TensorFlow C API and Uber CGO package that includes Bridge C/C++ bridge. Namely:
Internal C Headers
Public C Headers
C/C++ implementation of public and internal headers.

Move macOS builds from Travis CI to Github Actions

Once we open-source, move the macOS builds to Github Actions

Error from dlopen: libneuropod_tensorflow_backend.so: cannot open shared object file: No such file or directory

Bug

I'm trying to use Python Guide to create a neuropod model from my pre-created TensorFlow inference graph using

from neuropod.packagers import create_tensorflow_neuropod

method. The neuropod model file was written successfully without any errors. Now when i try to use that neuropod model from Python - i get the following error:

06/09/20 15:55:28.213452: E neuropod/internal/backend_registration.cc:125] [thread 22488, process 22488] Loading the default backend for type 'tensorflow' failed. Error from dlopen: libneuropod_tensorflow_backend.so: cannot open shared object file: No such file or directory
06/09/20 15:55:28.213700: E neuropod/multiprocess/multiprocess.cc:128] [thread 22479, process 22479] Got an exception when loading the model at /home/app/model_training/neuropod_model: Neuropod Error: Loading the default backend for type 'tensorflow' failed. Error from dlopen: libneuropod_tensorflow_backend.so: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "neuropod_test.py", line 7, in <module>
    with load_neuropod(PATH_TO_MODEL, visible_gpu=None) as neuropod:
  File "/home/app/miniconda3/envs/tf1x/lib/python3.7/site-packages/neuropod/loader.py", line 211, in load_neuropod
    return NativeNeuropodExecutor(neuropod_path, **kwargs)
  File "/home/app/miniconda3/envs/tf1x/lib/python3.7/site-packages/neuropod/loader.py", line 117, in __init__
    neuropod_path, _REGISTERED_BACKENDS, use_ope=True, **kwargs
RuntimeError: Neuropod Error: Got an exception when loading the model at /home/app/model_training/neuropod_model: Neuropod Error: Loading the default backend for type 'tensorflow' failed. Error from dlopen: libneuropod_tensorflow_backend.so: cannot open shared object file: No such file or directory

I did run find over my / and that file libneuropod_tensorflow_backend.so is nowhere to be found

To Reproduce

Steps to reproduce the behavior:

Sorry, i can't share the model itself.

I'm using the most basic script to load the model from examples:

import os
import os.path
from neuropod.loader import load_neuropod
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
PATH_TO_MODEL = os.environ.get('NEUROPOD_MODEL') or os.path.realpath('./neuropod_model')

with load_neuropod(PATH_TO_MODEL, visible_gpu=None) as neuropod:
    # This is a list of dicts containing the "name", "dtype", and "shape"
    # of the input
    print(neuropod.inputs, neuropod.outputs)
    # Do something here
    pass

Expected behavior

I expect program exit normaly

Environment

Neuropod Version: 0.2.0
OS: Linux Ubuntu 18.04 x86_64
Language: Python
Python version: 3.7.7
Using OPE: no (not sure what it is...)

If this bug report is about running a specific model:

Neuropod backend: TensorFlow
Framework version: 1.14.0

I'm running on CPU

Build on OSX with clang 11 reports warnings

Two warnings during build on OSX with clang

g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.0 (clang-1100.0.33.12)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

1st:

INFO: From Compiling neuropods/multiprocess/multiprocess.cc:
neuropods/multiprocess/multiprocess.cc:37:9: warning: ISO C++11 does not allow conversion from string literal to 'char *' [-Wwritable-strings]
        "neuropod_multiprocess_worker",

here explicit cast helps

2nd:

INFO: From Compiling neuropods/multiprocess/ipc_control_channel.cc:
In file included from neuropods/multiprocess/ipc_control_channel.cc:10:
./neuropods/multiprocess/shm_tensor.hh:39:7: warning: unused function 'get_next_aligned_offset' [-Wunused-function]
void *get_next_aligned_offset(void *base)

Compiler warning is not correct, function is used. Function is defined in header and making it "inline" fixes warning and is reasonable at the same time.

I can provide pull request

Tensorflow 2.x support in neuropod.backends.*.packager

Feature

Currently neuropod.backends.keras.packager and neuropod.backends.tensorflow.packager supports Tensorflow 1.x interface.

It should provide support for Tensorflow 2.x, ideally it should be generic interface that allows client to work with 1.x and 2.x in the same way.

Is your feature request related to a problem? Please describe.

Tensorflow 2 doesn't recommend using session anymore but supports workaround if this is necessary. Keras backend session is necessary and caller should use the following workaround:

sess = None
if tf.__version__[0] == "2":
    import tensorflow.compat.v1 as tf
    tf.disable_v2_behavior()

    import tensorflow.python.keras.backend as K
    sess = K.get_session()
else:
    import tensorflow as tf
    sess = keras.backend.get_session()

...

create_keras_neuropod(
    neuropod_path=neuropod_path,
    model_name='keras_simple',
    sess=sess,
    model=model,
    default_input_tensor_device="CPU",
    test_input_data=input_data,
    test_expected_out=expected,
    package_as_zip=True
)

Describe the solution you'd like

New interface create_keras_neuropod (or create_keras_neuropod_tf2) that allows to create keras neuropod for TF2 without workaround mentioned above.

Describe alternatives you've considered

We may still use workaround if there is no other way. In this case we should describe it as recommended solution and maybe add some helpers.

Additional context

tf.placeholders are also deprecated in TF2.x. This was used in neuropod (within node name mapping).
Example:

def create_tf_addition_model():
    """
    A simple addition model
    """
    g = tf.Graph()
    with g.as_default():
        with tf.name_scope("some_namespace"):
            x_float = tf.placeholder(tf.float32, name="x_float")
            y_float = tf.placeholder(tf.float32, name="y_float")
            x_double = tf.placeholder(tf.float64, name="x_double")
            y_double = tf.placeholder(tf.float64, name="y_double")
            x_vec = tf.placeholder(tf.float64, name="x_vec")
            y_vec = tf.placeholder(tf.float64, name="y_vec")
            str_input = tf.placeholder(tf.string, name="str_input")

            sum_float = tf.add(x_float, y_float, name="sum_float")
            sum_double = tf.add(x_double, y_double, name="sum_double")
            sum_vec = tf.add(x_vec, y_vec, name="sum_vec")
            str_len = tf.strings.length(str_input, name="str_len")

   if __name__=='__main__':
    neuropod_path="example/tensorflow_addition/neuropod"
 
    if tf.__version__[0] == "2":
        import tensorflow.compat.v1 as tf
        tf.disable_v2_behavior()

    create_tensorflow_neuropod(
        neuropod_path=neuropod_path,
        model_name="tf_addition_model",
        graph_def=create_tf_addition_model(),
        node_name_mapping={
            "x_float": "some_namespace/x_float:0",
            "y_float": "some_namespace/y_float:0",
            "x_double": "some_namespace/x_double:0",
            "y_double": "some_namespace/y_double:0",
            "x_vec": "some_namespace/x_vec:0",
            "y_vec": "some_namespace/y_vec:0",
            "str_input": "some_namespace/str_input:0",

            "sum_float": "some_namespace/sum_float:0",
            "sum_double": "some_namespace/sum_double:0",
            "sum_vec": "some_namespace/sum_vec:0",
            "str_len": "some_namespace/str_len:0",
        },
...
    )

Does neuropods suport TF 2.0?

We are looking to update tensorflow(2.0) to support newer versions of cuda (10.1).
Does neuropods support TF 2.0?

Enable `clang-tidy` in CI

#303 added support for running clang-tidy locally (./tools/tidy.sh), but we don't currently run it in CI as master does not pass all checks yet.

This issue tracks fixing all existing clang-tidy issues so we can enable it in CI.

Update the CI build matrix once CUDA 10.0 LibTorch Nightly URLs are fixed

LibTorch nightly URLs for CUDA 10.0 don't work correctly (see pytorch/pytorch#23039). Because of this, we're using CUDA 9.0 with torch 1.2.0.dev20190601 in our build matrix:

https://github.com/uber/neuropods/blob/3e5bc72f81e67f26d68af8bbed712b5437e2e3a5/build/ci_matrix.py#L86-L91

After the nightly URLs get fixed, the CUDA 10 build should use a nightly instead of 1.1.0

Recovery from: "libc++abi.dylib: terminating with uncaught exception of type std::runtime_error"

Bug

I am testing inference with 4 OPE instances. I am using C++ library from CGo service.

Under high load I observed main process termination:

"libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Neuropod Error: Timed out waiting for a response from worker process. Didn't receive a message in 5000ms, but expected a heartbeat every 2000ms."

The caller has try/catch arounf inference:

    try {
       auto output = neuropod->infer(valueMap);
    } catch (const std::exception& e) {
 ...
   }

and I know that it catches exception in case of IPE (received TF errors). But it seems this is not the case for OPE.

Regardless of what caused it, there are 2 questions:

How can I catch this exception?
If by some reason worker process died, or something else, can I release and load neuropod again to recover?

To Reproduce

Steps to reproduce the behavior:

Load system with high load or somehow else make that parent process doesn't send HeartBeat in time.
UPDATE:
Worker process died. Kill Worker to reproduce.

Expected behavior

Exception should be caught by caller process.

Environment

Neuropod Version (e.g., 0.2.0): 0.2.0
OS (Linux, macOS):
Language (Python, C++, Go bindings): CGo, with C bridge.
Python version:
Using OPE: yes

Additional context

I suspect that the exception in OPE is not propagated between threads:

C++ supports

std::current_exception();
std::rethrow_exception(teptr);

I grep-ed neuropod and don't see these calls.

UPDATE:
If parent process detect timeout, it reports error and exits. It considers this as critical, non-recoverable problem. Though, in production Worker can dies because of OOM or bugs/leaks in platform. Parent process should be able to stay running and handle it.

Make sure `python_root` and `neuropod_path` are not the same

This check currently ensures that neuropod_path is not a subdirectory of python_root. However, it misses cases where they are the same directory

https://github.com/uber/neuropods/blob/9c4b58b84429b4390e8dfa658963d65312b6b52c/source/neuropods/python/backends/python/packager.py#L152-L153

Wrong results with non-contiguous arrays

Bug

Neuropod doesn't deal correctly with non-contiguous numpy arrays as outputs.

To Reproduce

Steps to reproduce the behavior:

Create module directory, copy the following code into module/export.py:

import click
import numpy as np
from pathlib import Path


def split(x):
    x1 = x[:, :2]
    x2 = x[:, 2:]
    return {'x1': x1, 'x2': x2}


def split_contiguous(x):
    x1 = np.ascontiguousarray(x[:, :2])
    x2 = np.ascontiguousarray(x[:, 2:])
    return {'x1': x1, 'x2': x2}


def get_model(data_path):
    return split


@click.group()
def main():
    pass


@main.command('export')
def export():
    from neuropod.packagers import create_python_neuropod
    x = np.arange(16).reshape(4, 4)
    print(x.dtype)

    create_python_neuropod(
        neuropod_path='test.np',
        model_name='splitter',
        data_paths=[],
        entrypoint_package='module.export',
        entrypoint='get_model',
        code_path_spec=[
            {"python_root": str(Path(__file__).parents[1].absolute()), "dirs_to_package": ["module"]}
        ],
        input_spec=[{'name': 'x', 'dtype': 'int64', "shape": (4, 4)}],
        output_spec=[
            {'name': 'x1', 'dtype': 'int64', "shape": (4, 2)},
            {'name': 'x2', 'dtype': 'int64', "shape": (4, 2)}
        ],
        test_input_data={'x': x},
        test_expected_out={'x1': x[:, :2], 'x2': x[:, 2:]}
    )


@main.command('run')
def run():
    from neuropod import loader
    x = np.arange(16).reshape(4, 4)

    with loader.load_neuropod('test.np') as pod:
        result = pod.infer({'x': x})

    np.testing.assert_array_equal(result['x1'], x[:, :2])
    np.testing.assert_array_equal(result['x2'], x[:, 2:])
    print('Done')


if __name__ == '__main__':
    main()

Export the neuropod using python module/export.py export. It's written to test.np
Run the test using python module/export.py run -- the assertion fails.
In get_model, replace return split by return split_contiguous and run steps 2 & 3 again -- the assertions pass

Expected behavior

Assertions in run should not fail.

Environment

Neuropod Version (e.g., 0.2.0): 0.2.0
OS (Linux, macOS): Linux
Language (Python, C++, Go bindings): Python
Python version: 3.7.4
Using OPE: no
Numpy version: 1.18.1

If this bug report is about running a specific model: N/A (pure numpy model)

If running on GPU: N/A

Any other relevant information:

Neuropod 0.2.0rc1 in setup.py

I upgraded to test 0.2.0rc2 and found that logs still mention 0.2.0rc1

I see in ./source/python/setup.py

setup(
    name="neuropod",
    version="0.2.0rc1",
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    package_data={'': ["neuropod_native.so", "libneuropod.so", "neuropod_multiprocess_worker"]},
    distclass=BinaryDistribution,
)

Should it be updated?

Add `at` method to accessors

Bounds check and call [] operator

Parallelism problem: single caller to multiple OPE instances of model

I tested parallelism ability for multiple instances of the same model. Goal was to see if RPS/Bandwidth can be increased using multiple OPE instances of model. This is especially interesting for Python models that have GIL per process.

In short, I observe parallelism and see benefits of "parallel" approach when model has heavy inference.

But, I observed some unexpected behavior and was able to demonstrate it with "single caller to multiple OPE instances of model".

Description:

Load 8 or more models of "String Python" model that accepts "str_input" string and returns it as "str_output" string.
Use caller of Inference with concurrency 1 (only 1 thread executes 1 inference at the moment).

Test 1
Caller calls 20K inference of one model instances always (first model with index 0):
Call 0
Call 0
...
Call 0

Result, ElapsedTime=7.669s, RPS=2607.86

Algorithm/Code

// Note that it performs "round-robin" index calculation and only then reset it to 0
func (p *modelProxy) Infer(request) {
	modelIndex := p.currentIndex.increment() % int32(len(p.instances))
	if modelIndex > 0 {
		modelIndex = 0
	}
	instance := p.instances[modelIndex]
	return instance.Infer(request)
}

Details:

Benchmark parameters:
  CPUs:            12
  Connections:     1
  Concurrency:     1
  Max requests:    20000
  Max duration:    0s
  Max RPS:         0
Latencies:
  0.5000: 338.51µs
  0.9000: 402.814µs
  0.9500: 442.647µs
  0.9900: 578.288µs
  0.9990: 838.394µs
  0.9995: 972.958µs
  1.0000: 5.465745ms
Elapsed time:      7.669s
Total requests:    20000
RPS:               2607.86

Test 2
Caller calls 20K inference of model instances using Round-Robin strategy.
Call 0
Call 1
...
Call 8
Call 0

Result, ElapsedTime=1m43.279s, RPS=193.65

Algorithm/Code


// Note that it performs "round-robin" index calculation using sync/atomic 
func (p *modelProxy) Infer(request) {
	modelIndex := p.currentIndex.increment() % int32(len(p.instances))
	instance := p.instances[modelIndex]
	return instance.Infer(request)
}

Details:

Benchmark parameters:
  CPUs:            12
  Connections:     1
  Concurrency:     1
  Max requests:    20000
  Max duration:    0s
  Max RPS:         0
Latencies:
  0.5000: 5.38258ms
  0.9000: 6.242816ms
  0.9500: 6.265063ms
  0.9900: 6.425146ms
  0.9990: 6.946838ms
  0.9995: 27.727982ms
  1.0000: 39.07278ms
Elapsed time:      1m43.279s
Total requests:    20000
RPS:               193.65

You may see that Test 2 ten times slower than Test 1 that is not expected. It looks that Caller thread from "master" process performs some expensive "reset" every time it calls a new instances of model.

Note that with caller concurrency >1 it RPS is increasing. So, this is important to use concurrency 1 to reproduces it.

Add "neuropod" tag to stackoverflow?

I suggest to add Support section to README.md that has a link to Issues and stackoverflow

 ## Support
 
Use [stackoverflow](http://stackoverflow.com/questions/tagged/neuropod) and/or
[Github issues](https://github.com/uber/neuropod/issues).

If agree, feel free to add it or let me know and I can make PR

Bazel build fails for Neuropods torch version 1.3.0

To get around #187, I tried to use the CUDA version of libtorch to test the PyTorch model. The CUDA version downloads fine but the build command:

bazel --output_base ./bazel build --noshow_progress --noshow_loading_progress --action_env=NEUROPODS_IS_GPU=true --action_env=NEUROPODS_TORCH_VERSION=1.3.0 //neuropods:all //neuropods:packages

fails with:

[redacted]

cc. @achals

	if (isTestMode)
	{
	ret = new neuropod::Neuropod(convertedPath, detail::ope_backend_location_overrides, opts);
	}
	else
	{
	ret = new neuropod::Neuropod(convertedPath, opts);
	}

uber / neuropod Goto Github PK

neuropod's Introduction

Neuropod

What is Neuropod?

Why use Neuropod?

Run models from any supported framework using one API

Define a Problem API

Build generic tools and pipelines

Other benefits

Getting started

neuropod's People

Contributors

Stargazers

Watchers

Forkers

neuropod's Issues

Bug

To Reproduce

Expected behavior

Environment

Feature

Bug

To Reproduce

Expected behavior

Environment

Bug

To Reproduce

Expected behavior

Environment

Bug

To Reproduce

Expected behavior

Environment

Additional context

TODO:

Feature

Describe the solution you'd like

Bug

To Reproduce

Expected behavior

Environment

Feature

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Bug

To Reproduce

Expected behavior

Environment

Additional context

Bug

To Reproduce

Expected behavior

Environment

Recommend Projects

Recommend Topics

Recommend Org