apache / tvm Goto Github PK

Open deep learning compiler stack for cpu, gpu and specialized accelerators

License: Apache License 2.0

Python 59.44% Makefile 0.12% C++ 36.41% C 0.69% Shell 0.67% CMake 0.52% Objective-C 0.04% Objective-C++ 0.16% JavaScript 0.04% HTML 0.01% Java 0.35% Rust 0.69% Go 0.19% Cuda 0.16% RenderScript 0.01% TypeScript 0.32% Batchfile 0.01% Cython 0.05% Groovy 0.03% Jinja 0.09%

compiler tensor deep-learning gpu opencl metal performance javascript rocm tvm

tvm's Introduction

Open Deep Learning Compiler Stack

Documentation | Contributors | Community | Release Notes

Apache TVM is a compiler stack for deep learning systems. It is designed to close the gap between the productivity-focused deep learning frameworks, and the performance- and efficiency-focused hardware backends. TVM works with deep learning frameworks to provide end to end compilation to different backends.

License

TVM is licensed under the Apache-2.0 license.

Getting Started

Check out the TVM Documentation site for installation instructions, tutorials, examples, and more. The Getting Started with TVM tutorial is a great place to start.

Contribute to TVM

TVM adopts apache committer model, we aim to create an open source project that is maintained and owned by the community. Check out the Contributor Guide.

Acknowledgement

We learned a lot from the following projects when building TVM.

Halide: Part of TVM's TIR and arithmetic simplification module originates from Halide. We also learned and adapted some part of lowering pipeline from Halide.
Loopy: use of integer set analysis and its loop transformation primitives.
Theano: the design inspiration of symbolic scan operator for recurrence.

tvm's People

Contributors

Stargazers

Watchers

Forkers

tqchen icemelon mkolod yajiedesign tornadomeet benjamesbabala liuguoyou vsooda zihengjiang laurawly cv9527 zhouyonglong nlpguyz zhoubaozhou zpmaths limin2021 stevenlol bin913 linpingchuan aromazyl eversimon linecode bygreencn abuccts entropy-xcy cybercoderbot songlang thlo7777 colinsongf unsky mjz9054 wtree codeaudit collector-m kekedan ysh329 yzhliu ringwraith zgsxwsdxg claudexin freefox2005 sn0wfree srsman yuchengwang lijiansong ml-lab soledad89 wikipedia2008 tiandiao123 jianyuzhan dreadlord1984 zj2089 benmfeng mengdiwang reed-lau bdyunmu hiranoaya doldre jeason-hu matrixplayer chowguohua yigo3000 alan15932 liangfu jfdw derisavi arthur-shi deepage longjohncoder nagyistge thatkidflo liuleicode doctaweeks caozhengquan jabher wweic lewuathe k9sret 6676401088 phisiart bmei1314 rimlesss abergeron code-goblin reminisce eqy yangyang233 zhuwenxiao haoxuu daimagou merrymercy happybaoliang lihaol westalgo hongyunnchen xty19 cnsuhao srkreddy1238 derisavi-huawei w-ding

tvm's Issues

get_started.py error: src/codegen/codegen.cc:27: Check failed: bf != nullptr Target llvm is not enabled

Is llvm necessary? I saw these lines in config.mk:

# whether build with LLVM support
# Requires LLVM version >= 4.0
# Set LLVM_CONFIG to your version, uncomment to build with llvm support
#
# LLVM_CONFIG = llvm-config

It seems llvm is not necessary and it's commented out defaultly. However, I'm confused because of this error: nullptr Target llvm is not enabled

Env.

I installed tvm on my laptop.
Ubuntu 16.04 64bit

clinfo

yuens@Spark:~$ clinfo                                                                                                                                                                                                                                                                                                   [16/16]
Number of platforms                               1
  Platform Name                                   Intel Gen OCL Driver
  Platform Vendor                                 Intel
  Platform Version                                OpenCL 1.2 beignet 1.1.1
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_spir cl_khr_icd
  Platform Extensions function suffix             Intel

  Platform Name                                   Intel Gen OCL Driver
Number of devices                                 1
  Device Name                                     Intel(R) HD Graphics IvyBridge M GT2
  Device Vendor                                   Intel
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 1.2 beignet 1.1.1
  Driver Version                                  1.1.1
  Device OpenCL C Version                         OpenCL C 1.2 beignet 1.1.1
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Max compute units                               16
  Max clock frequency                             1000MHz
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None, None, None
  Max work item dimensions                        3
  Max work item sizes                             512x512x512
  Max work group size                             512
  Preferred work group size multiple              16
  Preferred / native vector sizes                 
    char                                                16 / 8       
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 0 / 8        (n/a)
    float                                                4 / 4       
    double                                               0 / 2        (n/a)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    32, Little-Endian
  Global memory size                              2147483648 (2GiB)
  Error Correction support                        No
  Max memory allocation                           1073741824 (1024MiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        8192
  Global Memory cache line                        64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             8192x8192 pixels
    Max 3D image size                             8192x8192x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Global
  Local memory size                               65536 (64KiB)
  Max constant buffer size                        134217728 (128MiB)
  Max number of constant args                     8
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      80ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    SPIR versions                                 <printDeviceInfo:138: get   SPIR versions size : error -30>
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image
_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_
align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_spir cl_khr_icd

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  Intel Gen OCL Driver
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [Intel]
  clCreateContext(NULL, ...) [default]            Success [Intel]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Intel Gen OCL Driver
    Device Name                                   Intel(R) HD Graphics IvyBridge M GT2
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Intel Gen OCL Driver
    Device Name                                   Intel(R) HD Graphics IvyBridge M GT2

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.8
  ICD loader Profile                              OpenCL 1.2
        NOTE:   your OpenCL library declares to support OpenCL 1.2,
                but it seems to support up to OpenCL 2.1 too.

config.mk

I copy ./make/config.mk to root directory and modified lines below:

DEBUG = 1
USE_OPENCL = 1 # I installed opencl previously
USE_BLAS = openblas

After successful build, I set the environment variable and executed cd python; python setup.py install --user according to tvm/install.md.

Error Log

yuens@Spark:~/Software/tvm/tutorials/python$ python get_started.py 
<class 'tvm.tensor.Tensor'>
[15:44:03] /home/yuens/Software/tvm/dmlc-core/include/dmlc/logging.h:308: [15:44:03] src/codegen/codegen.cc:27: Check failed: bf != nullptr Target llvm is not enabled

Stack trace returned 10 entries:
[bt] (0) /home/yuens/Software/tvm/lib/libtvm.so(_ZN3tvm7codegen5BuildERKNS_5ArrayINS_11LoweredFuncEvEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x10b1) [0x7feac33eea41]
[bt] (1) /home/yuens/Software/tvm/lib/libtvm.so(+0x302eb9) [0x7feac339aeb9]
[bt] (2) /home/yuens/Software/tvm/lib/libtvm.so(TVMFuncCall+0x5e) [0x7feac35977ce]
[bt] (3) /usr/local/lib/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c) [0x7feacbef357c]
[bt] (4) /usr/local/lib/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(ffi_call+0x1f5) [0x7feacbef2cd5]
[bt] (5) /usr/local/lib/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x3e6) [0x7feacbeea376]
[bt] (6) /usr/local/lib/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(+0x9db3) [0x7feacbee1db3]
[bt] (7) /usr/local/lib/anaconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x53) [0x7feacd4cde93]
[bt] (8) /usr/local/lib/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x715d) [0x7feacd58080d]
[bt] (9) /usr/local/lib/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e) [0x7feacd582c3e]

Traceback (most recent call last):
  File "get_started.py", line 111, in <module>
    fadd_cuda = tvm.build(s, [A, B, C], "cuda", target_host="llvm", name="myadd")
  File "/home/yuens/Software/tvm/python/tvm/build_module.py", line 349, in build
    mhost = codegen.build_module(fhost, target_host)
  File "/home/yuens/Software/tvm/python/tvm/codegen.py", line 20, in build_module
    return _Build(lowered_func, target)
  File "/home/yuens/Software/tvm/python/tvm/_ffi/function.py", line 255, in my_api_func
    return flocal(*args)
  File "/home/yuens/Software/tvm/python/tvm/_ffi/_ctypes/function.py", line 183, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/home/yuens/Software/tvm/python/tvm/_ffi/base.py", line 62, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: [15:44:03] src/codegen/codegen.cc:27: Check failed: bf != nullptr Target llvm is not enabled

Stack trace returned 10 entries:
[bt] (0) /home/yuens/Software/tvm/lib/libtvm.so(_ZN3tvm7codegen5BuildERKNS_5ArrayINS_11LoweredFuncEvEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x10b1) [0x7feac33eea41]
[bt] (1) /home/yuens/Software/tvm/lib/libtvm.so(+0x302eb9) [0x7feac339aeb9]
[bt] (2) /home/yuens/Software/tvm/lib/libtvm.so(TVMFuncCall+0x5e) [0x7feac35977ce]
[bt] (3) /usr/local/lib/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c) [0x7feacbef357c]
[bt] (4) /usr/local/lib/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(ffi_call+0x1f5) [0x7feacbef2cd5]
[bt] (5) /usr/local/lib/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x3e6) [0x7feacbeea376]
[bt] (6) /usr/local/lib/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(+0x9db3) [0x7feacbee1db3]
[bt] (7) /usr/local/lib/anaconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x53) [0x7feacd4cde93]
[bt] (8) /usr/local/lib/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x715d) [0x7feacd58080d]
[bt] (9) /usr/local/lib/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e) [0x7feacd582c3e]

Integrate DNN Libraries as Contrib

It is important to be able to integrate with dnn libraries, both for benchmarking and AOT codegen purposes. A practical compiled library is likely to contain both TVM generated wrapper calls into the dnn libraries as well as the generated dsl kernels.

Current we have example integration of blas

We would like to integrate more. This could also eventually benefit dlpack users, as these registered functions are dlpack compatible. List of possible libs

cuDNN
nnpack fc
nnpack conv
mpsdnn

OS X failure: Check failed: type_code_ == kTVMType (11 vs. 6) expected TVMType but get bytes

After compiling and installing TVM on Mac OS X (with CUDA disabled, of course), when I attempt to use the Python API I get the following error:

(tvm) MacBook-Pro-97:python ezyang$ python
Python 3.6.2 | packaged by conda-forge | (default, Jul 23 2017, 23:01:38) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tvm
>>> tvm.var('a')
[12:19:27] dmlc-core/include/dmlc/logging.h:304: [12:19:27] include/tvm/./runtime/./packed_func.h:273: Check failed: type_code_ == kTVMType (11 vs. 6)  expected TVMType but get bytes

Stack trace returned 6 entries:
[bt] (0) 0   libtvm.so                           0x00000001103612e8 _ZN4dmlc15LogMessageFatalD2Ev + 40
[bt] (1) 1   libtvm.so                           0x0000000110365fe9 _ZNK3tvm7runtime11TVMArgValuecv10DLDataTypeEv + 393
[bt] (2) 2   libtvm.so                           0x00000001103687fa _ZNSt3__110__function6__funcIN3tvm2ir3$_0ENS_9allocatorIS4_EEFvNS2_7runtime7TVMArgsEPNS7_11TVMRetValueEEEclEOS8_OSA_ + 122
[bt] (3) 3   libtvm.so                           0x000000011060b62b TVMFuncCall + 75
[bt] (4) 4   _ctypes.cpython-36m-darwin.so       0x000000010f47f2b7 ffi_call_unix64 + 79
[bt] (5) 5   ???                                 0x00007fff50f154f0 0x0 + 140734551381232

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ezyang/Dev/tvm/python/tvm/api.py", line 115, in var
    return _api_internal._Var(name, dtype)
  File "/Users/ezyang/Dev/tvm/python/tvm/_ffi/function.py", line 255, in my_api_func
    return flocal(*args)
  File "/Users/ezyang/Dev/tvm/python/tvm/_ffi/_ctypes/function.py", line 183, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/Users/ezyang/Dev/tvm/python/tvm/_ffi/base.py", line 62, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: [12:19:27] include/tvm/./runtime/./packed_func.h:273: Check failed: type_code_ == kTVMType (11 vs. 6)  expected TVMType but get bytes

Stack trace returned 6 entries:
[bt] (0) 0   libtvm.so                           0x00000001103612e8 _ZN4dmlc15LogMessageFatalD2Ev + 40
[bt] (1) 1   libtvm.so                           0x0000000110365fe9 _ZNK3tvm7runtime11TVMArgValuecv10DLDataTypeEv + 393
[bt] (2) 2   libtvm.so                           0x00000001103687fa _ZNSt3__110__function6__funcIN3tvm2ir3$_0ENS_9allocatorIS4_EEFvNS2_7runtime7TVMArgsEPNS7_11TVMRetValueEEEclEOS8_OSA_ + 122
[bt] (3) 3   libtvm.so                           0x000000011060b62b TVMFuncCall + 75
[bt] (4) 4   _ctypes.cpython-36m-darwin.so       0x000000010f47f2b7 ffi_call_unix64 + 79
[bt] (5) 5   ???                                 0x00007fff50f154f0 0x0 + 140734551381232

I'm willing to debug but posting here in case anyone has ideas.

[Help] test_device_module_dump

I was running the tests locally on my machine and I got the following failure:

======================================================================
ERROR: test_module_load.test_device_module_dump
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ezyang/local/anaconda2/conda-bld/tvm_1499269439375/_b_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.6/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/data/users/ezyang/anaconda2/conda-bld/tvm_1499269439375/work/tests/python/unittest/test_module_load.py", line 99, in test_device_module_dump
    check_device("cuda")
  File "/data/users/ezyang/anaconda2/conda-bld/tvm_1499269439375/work/tests/python/unittest/test_module_load.py", line 91, in check_device
    f.export_library(path_dso)
  File "/data/users/ezyang/anaconda2/conda-bld/tvm_1499269439375/work/python/tvm/module.py", line 75, in export_library
    raise ValueError("Module[%s]: Only llvm support export shared" % self.type_key)
ValueError: Module[stackvm]: Only llvm support export shared

What does this error message mean? Am I building with the wrong compiler or something?

[DOC] Fix comments in thread allreduce

https://github.com/tqchen/tvm/blob/master/include/tvm/ir.h#L287

Java/Scala Support

This issue is used to track related improvements in Java/Scala bindings. As a first step, we aim to bring a runtime so that we can deploy tvm compiled things into jvm languages

Support JNI TVM runtime that directly load compiled programs and invoke from java
- WIP in #176
Support a basic version of JNI RPC Server, bring some of the python API to C++ backend (if they are easy in c++) or reimplement a few functions in JNI
Test out auto packaging and RPC on android

Working on Var

C API Interface

cudnn error: `cannot find device code`

I have installed cudnn 7 and compile tvm with USE_CUDNN=1. When I run test_cudnn.py, it generates error:

/home/hyw/mxnet-tvm/tvm/python/tvm/build_module.py:331: UserWarning: Specified target cuda, but cannot find device code, did you do bind?
  "Specified target %s, but cannot find device code, did you do bind?" % target)

OS: ubuntu 16.04
CUDA: 8.0

What could it be the reason?

make html fails

I clone the latest version, make, then do make html in docs, but it fails with

reduce(combiner=comm_reducer(result=(x + y), args=[x, y], identity_element=0.000000f), B.rf(k.inner.v, ax0), axis=[iter_var(k.inner.v, Range(min=0, extent=16))])
Internal error at HalideIR/src/tvm/ir_functor.h:72
Condition failed: type_index < func_.size() && func_[type_index] != nullptr
IRFunctor calls un-registered function on type Allocate
make: *** [html] 已放弃

Reconsider dynamic library loading strategy

Here is the current state of affairs in TVM:

The cmake build dumps compiled libraries in the lib directory of the source project. This is achieved by explicitly setting CMAKE_LIBRARY_OUTPUT_DIRECTORY
The configure/make build builds libraries into the lib directory
python/tvm/_ffi/libinfo.py is responsible for figuring out where the dynamic library lives. It looks in the following locations, preferring the first that it finds:
- dirname(__file__) aka curr_path aka the directory which contains this source file. This could either be the actual _ffi source directory (in the case you are running inplace or with python setup.py develop), or the install directory of TVM
- $curr_path/../ aka root_path aka the root directory of TVM Python modules, e.g., python/tvm in the source checkout
- $curr_path/../../../lib. When libinfo.py lives in a source checkout, this points to the top-level lib directory of the source project, aka, where cmake/make dump their build products. When libinfo.py was installed, curr_path will typically look something like $prefix/lib/python3.6/site-packages/tvm/_ffi which means that this point will point to $prefix/lib/python3.6/lib which generally doesn't exist
- `$curr_path/../../../build/Release/'. Once again, in a source checkout, this will point to the Release directory of cmake (but NOTE that due to the manual configuration in cmake, this folder will generally never be populated)
- Some Windows-only paths for VS in build and windows
- If the OS is POSIX-based (linux or osx), all of the entries in LD_LIBRARY_PATH

Here are the problems with this strategy:

It's easy to accidentally get a stale copy of an so file, and be unable to tell why updates are not being reflected. This happened to me in #280, where a stale libtvm.so living in my lib directory overrode an installed copy, and the ABI didn't match up at all (leading to strange errors). I'm not altogether certain why later invocations of make did not update the dynamic library, but I know empirically that this was the case.
It means that TVM will behave differently re install lookup depending on whether or not it is installed with python setup.py install or python setup.py develop, because the source file location changes. Most notably, the search strategy doesn't work at all if you actually install the files.
In general, while this search strategy makes TVM very friendly for in-tree development (since it is able to find in tree build products), it works very poorly when TVM is actually installed somewhere, since there isn't any way to say, "please use exactly this library."

I can see two primary ways of solving the problem.

First, is to configure the location of the TVM library at python setup.py time. The usual way to do this is to define a C extension module, and then declare the TVM library as a dependency. Then the standard mechanisms to declare where libraries are to be found.

Second is to tweak the search path strategy a bit. Here are my recommendations:

The precedence of LD_LIBRARY_PATH should be over the hardcoded search paths is wrong. LD_LIBRARY_PATH should always OVERRIDE any of the built-in search paths, because it's the thing someone (like me) will use to try to force a particular dynamic library to be used.
cmake should NOT dump build products outside of the build directory, this runs counter to the cmake philosophy where all build products live in build. A concrete case where this is bad is it means you cannot run parallel debug and regular builds in the same directory. However, I understand there is a countervailing desire, which is for Python to be able to automatically pick up on an inplace built binary; but with this change, there may be two places the build product may live. With this I have two orthogonal suggestions:
- We should get rid of the Makefile based build system. There isn't any good reason to two build systems at the same time: it just means more opportunity for things to go wrong. I don't know if there are particular things we are relying on in the Makefile, but they should get ported to cmake if they are. This has an added bonus whereby there is now only one place an inplace build product may be stored.
- We should require a user to install their build products before they are picked up by Python. The default install location could be to somewhere in the source directory (it could even be lib, as it is today). This means that if you have multiple build trees, you can simply "install" one to make it be picked up by the Python library. In this case, we should also print a debug message notifying the user that they're working with an inplace library
I don't know of a known-working way to record a setup-time provided library path durably so that the search algorithm could look in the location. But we could still use the __file__ trick to guess where the library might be installed at install time.

So here is the new search algorithm:

Check LD_LIBRARY_PATH
Check the in-place install location (emitting a notice if it is)
Check the real install location

Do people agree with this? If so, I'll write a patch.

[TOPI] Implementation Guideline

There has been a few discussions on this and I am creating this issue to consolidate what we have so far. The specific question we have is that how should we create APIs for data flow declaration and Schedule interface in TOPI, here are a two key guidelines

Tensor in,Tensor out in dataflow declaration

The qoute(tensor in/tensor out) comes from google brain team. This is a general principle for compositional APIs design. Imagine we want to create a dataflow declaration for conv-relu, instead of create a declaration function for conv_relu, we want to create two functions (conv, and relu), and compose them

def conv(input, weight):
     out =  tvm.comput(...)
     return out

def relu(input):
     return tvm.compute(input.shape, lambda i*:tvm.max(0, input(*)))

input = tvm.placeholder((n, c, h, w))
weight = tvm.placeholder((c1, c2, hh, ww))
net= conv(input, weight)
net = relu(net)
# Now that net contains data flow declaration of conv-relu

Seperate data flow declaration from schedule

While it is usually convenient to put schedule logic together with the data flow declaration, it is also somewhat harmful. Imagine we have a schedule for conv-relu, what if we want to schedule conv-sigmoid? They contains the essentially the same pattern, but in the old style we need to create schedule for each of them. So ideally, we want to create a generic schedule function for a class of dataflows, without directly touching the dataflow part.

To get the tensors needed for schedule, we can recover them by traversing the dataflow DAG. Here is an possible skeleton to schedule schedule conv-ewise

# generic schedule convolution ewise
def schedule_conv_map(op):  
      # find the conv part
     s = create_schedule(op)
     conv_args = []
     def schedule_conv(data, filter, conv):
         # schedule conv here
         pass
     # visit, maybe need deduplicate
     def  visit(op):
        if is_ewise(op):
          if not is_output(op):  
              s[op].compute_inline()
           for t in op.input_tensors:
               visit(t.op)
        if is_conv(op):
           conv = op.output(0)
           data = op.input_tensors[0]
           filter = op.input_tensors[1]
           schedule_conv(conv,  data, filter)

Tuple Inputs&Outputs Support for ComputeOp

In order to implement some ops like argmax easily, we need to add multiple inputs support for ComputeOp, so that we can compare elements by their value with the index reserved.

Tuple inputs&outputs support for normal ComputeOp
Tuple inputs&outputs support for reduction ComputeOp

Predefine kernels for generic operations

Sharing the kernel generation is great. But it still request all framework to generate many kernel for many platwork/generation of platform, for many shapes/strides/axis.

It is possible to include inside tvm, all that handling and hide it in a top level fct like max(x, axis=...). Do you see that coming in tvm or not? This would stay optional to use.

NNVM Integration

nnvm graph to tvm (hooks in python)
External function callback with array
Global memory sharing plan pass in tvm
AOT flow for nn graph
Simple example

[CONTRIB] Metal Performance Shader Integration

It would be great to integrate metal performance shader into tvm.contrib as we did for nnpack and cudnn.

MPS FullyConnected
MPS Convolution

installation guide wrong git clone cmd / permission problems

Hi - I tried to follow the installation guide.
Issues:

https://github.com/dmlc/tvm/blob/master/docs/how_to/install.md
git clone --recursive ssh://[email protected]/dmlc/tvm did not work with my credentials, should be:
git clone --recursive https://github.com/dmlc/tvm.git
git clone recursively failed, see below. I was able to clone manually:
git clone https://github.com/dmlc/HalideIR.git

Failure mode:
git clone --recursive https://github.com/dmlc/tvm.git
Cloning into 'tvm'...
remote: Counting objects: 5714, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 5714 (delta 0), reused 0 (delta 0), pack-reused 5710
Receiving objects: 100% (5714/5714), 1.47 MiB | 860.00 KiB/s, done.
Resolving deltas: 100% (3847/3847), done.
Submodule 'HalideIR' (ssh://[email protected]/dmlc/HalideIR) registered for path 'HalideIR'
Submodule 'dlpack' (https://github.com/dmlc/dlpack) registered for path 'dlpack'
Submodule 'dmlc-core' (https://github.com/dmlc/dmlc-core) registered for path 'dmlc-core'
Cloning into '/Users/steroche/Documents/Projects/TVM/tvm/HalideIR'...
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'ssh://[email protected]/dmlc/HalideIR' into submodule path '/Users/steroche/Documents/Projects/TVM/tvm/HalideIR' failed
Failed to clone 'HalideIR'. Retry scheduled
Cloning into '/Users/steroche/Documents/Projects/TVM/tvm/dlpack'...
Cloning into '/Users/steroche/Documents/Projects/TVM/tvm/dmlc-core'...
Cloning into '/Users/steroche/Documents/Projects/TVM/tvm/HalideIR'...
Permission denied (publickey).
fatal: Could not read from remote repository.

error `CUDA_ERROR_NO_BINARY_FOR_GPU` when run `cuda_gemm_square.py`

it looks like the binary is not built successfully. i tried both run from a local ubuntu machine and the gpu docker in ci_test. here is the detailed log

[23:42:41] /home/muli/work/tvm/dmlc-core/include/dmlc/logging.h:308: [23:42:41] src/runtime/cuda/cuda_module.cc:93: CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU

Stack trace returned 10 entries:
[bt] (0) /workspace/lib/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x39) [0x7f0efbbcc959]
[bt] (1) /workspace/lib/libtvm.so(_ZNK3tvm7runtime15CUDAWrappedFuncclENS0_7TVMArgsEPNS0_11TVMRetValueEPPv+0x428) [0x7f0efbeff068]
[bt] (2) /workspace/lib/libtvm.so(_ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValueEEZNS1_6detail17PackFuncVoidAddr_ILi4ENS1_15CUDAWrappedFuncEEENS1_10PackedFuncET0_RKSt6vectorINS6_14ArgConvertCodeESaISC_EEEUlS2_S4_E_E9_M_invokeERKSt9_Any_dataS2_S4_+0xc0) [0x7f0efbeff1b0]
[bt] (3) /workspace/lib/libtvm.so(TVMFuncCall+0x52) [0x7f0efbed7b12]
[bt] (4) [0x7f0efc2cb2b5]
[bt] (5) /workspace/lib/libtvm.so(_ZZN3tvm7codegen14LLVMModuleNode11GetFunctionERKSsRKSt10shared_ptrINS_7runtime10ModuleNodeEEENKUlNS5_7TVMArgsEPNS5_11TVMRetValueEE_clESA_SC_+0x3c) [0x7f0efbd5dd7c]
[bt] (6) /workspace/lib/libtvm.so(TVMFuncCall+0x52) [0x7f0efbed7b12]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f0f1223fadc]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7f0f1223f40c]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7f0f124565fe]

[23:42:41] /home/muli/work/tvm/dmlc-core/include/dmlc/logging.h:308: [23:42:41] src/codegen/llvm/llvm_module.cc:51: Check failed: ret == 0 (-1 vs. 0) [23:42:41] src/runtime/cuda/cuda_module.cc:93: CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU

Stack trace returned 10 entries:
[bt] (0) /workspace/lib/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x39) [0x7f0efbbcc959]
[bt] (1) /workspace/lib/libtvm.so(_ZNK3tvm7runtime15CUDAWrappedFuncclENS0_7TVMArgsEPNS0_11TVMRetValueEPPv+0x428) [0x7f0efbeff068]
[bt] (2) /workspace/lib/libtvm.so(_ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValueEEZNS1_6detail17PackFuncVoidAddr_ILi4ENS1_15CUDAWrappedFuncEEENS1_10PackedFuncET0_RKSt6vectorINS6_14ArgConvertCodeESaISC_EEEUlS2_S4_E_E9_M_invokeERKSt9_Any_dataS2_S4_+0xc0) [0x7f0efbeff1b0]
[bt] (3) /workspace/lib/libtvm.so(TVMFuncCall+0x52) [0x7f0efbed7b12]
[bt] (4) [0x7f0efc2cb2b5]
[bt] (5) /workspace/lib/libtvm.so(_ZZN3tvm7codegen14LLVMModuleNode11GetFunctionERKSsRKSt10shared_ptrINS_7runtime10ModuleNodeEEENKUlNS5_7TVMArgsEPNS5_11TVMRetValueEE_clESA_SC_+0x3c) [0x7f0efbd5dd7c]
[bt] (6) /workspace/lib/libtvm.so(TVMFuncCall+0x52) [0x7f0efbed7b12]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f0f1223fadc]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7f0f1223f40c]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7f0f124565fe]


Stack trace returned 10 entries:
[bt] (0) /workspace/lib/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x39) [0x7f0efbbcc959]
[bt] (1) /workspace/lib/libtvm.so(_ZZN3tvm7codegen14LLVMModuleNode11GetFunctionERKSsRKSt10shared_ptrINS_7runtime10ModuleNodeEEENKUlNS5_7TVMArgsEPNS5_11TVMRetValueEE_clESA_SC_+0x203) [0x7f0efbd5df43]
[bt] (2) /workspace/lib/libtvm.so(TVMFuncCall+0x52) [0x7f0efbed7b12]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f0f1223fadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7f0f1223f40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7f0f124565fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7f0f12457f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python() [0x568b3a]
[bt] (9) python() [0x4c2604]

Traceback (most recent call last):
  File "topi/recipe/gemm/cuda_gemm_square.py", line 124, in <module>
    test_gemm()
  File "topi/recipe/gemm/cuda_gemm_square.py", line 121, in test_gemm
    check_device("cuda")
  File "topi/recipe/gemm/cuda_gemm_square.py", line 114, in check_device
    f(a, b, c)
  File "/workspace/python/tvm/_ffi/function.py", line 128, in __call__
    return f(*args)
  File "/workspace/python/tvm/_ffi/_ctypes/function.py", line 183, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/workspace/python/tvm/_ffi/base.py", line 61, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: [23:42:41] src/codegen/llvm/llvm_module.cc:51: Check failed: ret == 0 (-1 vs. 0) [23:42:41] src/runtime/cuda/cuda_module.cc:93: CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_NO_BINARY_FOR_GPU

Stack trace returned 10 entries:
[bt] (0) /workspace/lib/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x39) [0x7f0efbbcc959]
[bt] (1) /workspace/lib/libtvm.so(_ZNK3tvm7runtime15CUDAWrappedFuncclENS0_7TVMArgsEPNS0_11TVMRetValueEPPv+0x428) [0x7f0efbeff068]
[bt] (2) /workspace/lib/libtvm.so(_ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValueEEZNS1_6detail17PackFuncVoidAddr_ILi4ENS1_15CUDAWrappedFuncEEENS1_10PackedFuncET0_RKSt6vectorINS6_14ArgConvertCodeESaISC_EEEUlS2_S4_E_E9_M_invokeERKSt9_Any_dataS2_S4_+0xc0) [0x7f0efbeff1b0]
[bt] (3) /workspace/lib/libtvm.so(TVMFuncCall+0x52) [0x7f0efbed7b12]
[bt] (4) [0x7f0efc2cb2b5]
[bt] (5) /workspace/lib/libtvm.so(_ZZN3tvm7codegen14LLVMModuleNode11GetFunctionERKSsRKSt10shared_ptrINS_7runtime10ModuleNodeEEENKUlNS5_7TVMArgsEPNS5_11TVMRetValueEE_clESA_SC_+0x3c) [0x7f0efbd5dd7c]
[bt] (6) /workspace/lib/libtvm.so(TVMFuncCall+0x52) [0x7f0efbed7b12]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f0f1223fadc]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7f0f1223f40c]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7f0f124565fe]


Stack trace returned 10 entries:
[bt] (0) /workspace/lib/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x39) [0x7f0efbbcc959]
[bt] (1) /workspace/lib/libtvm.so(_ZZN3tvm7codegen14LLVMModuleNode11GetFunctionERKSsRKSt10shared_ptrINS_7runtime10ModuleNodeEEENKUlNS5_7TVMArgsEPNS5_11TVMRetValueEE_clESA_SC_+0x203) [0x7f0efbd5df43]
[bt] (2) /workspace/lib/libtvm.so(TVMFuncCall+0x52) [0x7f0efbed7b12]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f0f1223fadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7f0f1223f40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7f0f124565fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7f0f12457f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python() [0x568b3a]
[bt] (9) python() [0x4c2604]

Intrinsics and extern expr function registry

Callback allows pass NodeRef.
Registry to support register extern expr and intrinsic
Tutorial on intrinsics

cuModuleUnload returns CUDA_ERROR_DEINITIALIZED

Env:

Tesla K80
Driver Version 352.39
Centos 7.3.1611
CUDA 7.5
cudnn 7.5 v5.1

Source code:

import tvm
import numpy as np

n = tvm.var("n")
m = 128
A = tvm.placeholder((n, m), name='A')
k = tvm.reduce_axis((0, m), "k")
B = tvm.compute((n,), lambda i: tvm.sum(A[i, k], axis=k), name="B")

s = tvm.create_schedule(B.op)
s[B.op].bind(B.op.axis[0], tvm.thread_axis("blockIdx.x"))
s[B.op].bind(k, tvm.thread_axis("threadIdx.x"))
print(tvm.lower(s, [A, B], with_api_wrapper=False))

fcuda = tvm.build(s, [A, B], "cuda")
print(fcuda.imported_modules[0].get_source())

ctx  = tvm.gpu(0)
a = tvm.nd.array(np.random.uniform(size=(m, m)).astype(A.dtype), ctx)
b = tvm.nd.array(np.zeros(m, dtype=B.dtype), ctx)
fcuda(a, b)
np.testing.assert_allclose(
    b.asnumpy(),  np.sum(a.asnumpy(), axis=1), rtol=1e-4)

Output:

$ python reduce_axis.py 
produce B {
  // attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = n
  // attr [reduce_temp] storage_scope = "local"
  allocate reduce_temp[float32 * 1]
  // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 128
  // attr [comm_reducer(result=(x + y), args=[x, y], identity_element=0.000000f)] reduce_scope = 0.000000f
  reduce_temp[0] = tvm_thread_allreduce(A[((blockIdx.x*128) + threadIdx.x)], (uint1)1, threadIdx.x)
  if ((threadIdx.x == 0)) {
    B[blockIdx.x] = reduce_temp[0]
  }
}

extern "C" __global__ void default_function__kernel0( float* A,  float* B) {
  int blockIdx_x = blockIdx.x;
  __shared__ float red_buf[128];
  int threadIdx_x = threadIdx.x;
  ((volatile __shared__  float*)red_buf)[threadIdx_x] = A[((blockIdx_x * 128) + threadIdx_x)];
  __syncthreads();
  if ((threadIdx_x < 64)) {
    ((volatile __shared__  float*)red_buf)[threadIdx_x] = (((volatile __shared__  float*)red_buf)[threadIdx_x] + ((volatile __shared__  float*)red_buf)[(threadIdx_x + 64)]);
  }
  __syncthreads();
  if ((threadIdx_x < 32)) {
    ((volatile __shared__  float*)red_buf)[threadIdx_x] = (((volatile __shared__  float*)red_buf)[threadIdx_x] + ((volatile __shared__  float*)red_buf)[(threadIdx_x + 32)]);
  }
  __syncthreads();
  if ((threadIdx_x < 16)) {
    ((volatile __shared__  float*)red_buf)[threadIdx_x] = (((volatile __shared__  float*)red_buf)[threadIdx_x] + ((volatile __shared__  float*)red_buf)[(threadIdx_x + 16)]);
    ((volatile __shared__  float*)red_buf)[threadIdx_x] = (((volatile __shared__  float*)red_buf)[threadIdx_x] + ((volatile __shared__  float*)red_buf)[(threadIdx_x + 8)]);
    ((volatile __shared__  float*)red_buf)[threadIdx_x] = (((volatile __shared__  float*)red_buf)[threadIdx_x] + ((volatile __shared__  float*)red_buf)[(threadIdx_x + 4)]);
    ((volatile __shared__  float*)red_buf)[threadIdx_x] = (((volatile __shared__  float*)red_buf)[threadIdx_x] + ((volatile __shared__  float*)red_buf)[(threadIdx_x + 2)]);
    ((volatile __shared__  float*)red_buf)[threadIdx_x] = (((volatile __shared__  float*)red_buf)[threadIdx_x] + ((volatile __shared__  float*)red_buf)[(threadIdx_x + 1)]);
  }
  if ((threadIdx_x == 0)) {
    B[blockIdx_x] = ((volatile __shared__  float*)red_buf)[0];
  }
}

[22:15:05] dmlc-core/include/dmlc/logging.h:300: [22:15:05] src/runtime/cuda/cuda_module.cc:43: CUDAError: cuModuleUnload(module_[i]) failed with error: 
terminate called after throwing an instance of 'dmlc::Error'
  what():  [22:15:05] src/runtime/cuda/cuda_module.cc:43: CUDAError: cuModuleUnload(module_[i]) failed with error: 
已放弃

I debugged and found the return value is CUDA_ERROR_DEINITIALIZED.
The calculation result is correct (if I remove the CUDA_DRIVER_CALL check)

Any idea why this happened? @tqchen

TVM 0.1 Release Note

Thanks all for pushing hard toward the release. We plan to make the repo public on Thursday, it is two days behind our original plan but hopefully we get most pieces here. I will keep editing the issue to as we have more updates, the checklist will go into the release note.

Check List

Language runtime
- python
- javascript
- java
- c++
Backend
- arm, x86
- javascript, wasm
- CUDA
- opencl
- Metal
DNN Library integration
- nnpack fc
- nnpack conv
- CuDNN
- MPS
RPC runtime
- Normal RPC
- RPC Proxy
- Websocket RPC
- Android RPC
- iOS RPC
TOPI operator pipeline python
- Convolution
- Depthwise convolution
- GEMM
- Broadcast and reduction
TOPI operator pipeline in C++
- Convolution
- Depthwise
- Broadcast
- reduction
Rough perf of the TOPI GPU pipeline
- GPU GEMM
- GPU depthwise-conv
- GPU conv2d, batch=64
- GPU conv2d, batch=1
- GPU bcast and reduction
Rough pref of TOPI CPU pipeline
- CPU GEMM
- CPU conv2d
- CPU depthwise-conv
- CPU bcast and reduction
End to end graph executors
- Basic graph lowering framework
- NNVM Graph runtime
- Resnet pipeline CPU version
- Resnet pipeline raspi

Scaffolding Task

LLVM Support

https://github.com/tqchen/tvm/pull/49

Prefetch Optimization

Support prefetch primitive in schedule
Prefetch Injection pass
Prefetch lowering pass in storage flatten
- Fold multiple prefetch dimensions into one dimension as much as possible(when extent and strides are constant)
- We only need to support constant extent(can have non constant stride) for common cases
Add tvm_prefetch intrinsic
LLVM codegen
- Generate llvm.prefetch intrinsics

TVM and HLO/XLA

Any quick overview on the differences?

Ahead of Time Compilation and Module System

https://github.com/tqchen/tvm/pull/51 https://github.com/tqchen/tvm/pull/53

[DOC] Prerequisites in Installation Guide and set CUDA_PATH

Maybe add a Prerequisites or Dependencies section in Installation Guide? Since USE_CUDA is set by default, users probably should have CUDA installed before the build. Otherwise, the "make" would fail out of box.

Also, might be useful to include a note to set CUDA_PATH if not already set. Maybe we can change
https://github.com/dmlc/tvm/blob/master/make/config.mk#L53
to
CUDA_PATH ?= /usr/local/cuda

ThreadIndex unsigned to sign concise conversion

In same padding depthwise conv operator, errors occur when tvm.build_config(explicit_unroll=False).

Below is the generated CUDA code with explicit_unroll = True:

if (threadIdx.x < 6) {
  if (threadIdx.y < 6) {
    PaddedInput_shared[(((threadIdx.x * 18) + threadIdx.y) * 3)] = ((min(threadIdx.x, threadIdx.y) < 1) ? 0.000000e+00f : Input[((((threadIdx.x * 16) + threadIdx.y) * 3) + -17)]);
    PaddedInput_shared[((((threadIdx.x * 18) + threadIdx.y) * 3) + 1)] = (((1 <= threadIdx.x) && (0 <= threadIdx.y)) ? Input[((((threadIdx.x * 16) + threadIdx.y) * 3) + -16)] : 0.000000e+00f);
    PaddedInput_shared[((((threadIdx.x * 18) + threadIdx.y) * 3) + 2)] = ((((1 <= threadIdx.x) && (0 <= threadIdx.y)) && (threadIdx.y < 5)) ? Input[((((threadIdx.x * 16) + threadIdx.y) * 3) + -15)] : 0.000000e+00f);
  }
}
if (threadIdx.x < 6) {
  if (threadIdx.y < 6) {
    PaddedInput_shared[((((threadIdx.x * 18) + threadIdx.y) * 3) + 18)] = (((0 <= threadIdx.x) && (1 <= threadIdx.y)) ? Input[((((threadIdx.x * 16) + threadIdx.y) * 3) + -1)] : 0.000000e+00f);
    PaddedInput_shared[((((threadIdx.x * 18) + threadIdx.y) * 3) + 19)] = ((min(threadIdx.x, threadIdx.y) < 0) ? 0.000000e+00f : Input[(((threadIdx.x * 16) + threadIdx.y) * 3)]);
    PaddedInput_shared[((((threadIdx.x * 18) + threadIdx.y) * 3) + 20)] = (((0 <= min(threadIdx.x, threadIdx.y)) && (threadIdx.y < 5)) ? Input[((((threadIdx.x * 16) + threadIdx.y) * 3) + 1)] : 0.000000e+00f);
  }
}
if (threadIdx.x < 6) {
  if (threadIdx.y < 6) {
    PaddedInput_shared[((((threadIdx.x * 18) + threadIdx.y) * 3) + 36)] = ((((0 <= threadIdx.x) && (threadIdx.x < 5)) && (1 <= threadIdx.y)) ? Input[((((threadIdx.x * 16) + threadIdx.y) * 3) + 15)] : 0.000000e+00f);
    PaddedInput_shared[((((threadIdx.x * 18) + threadIdx.y) * 3) + 37)] = ((((0 <= threadIdx.x) && (threadIdx.x < 5)) && (0 <= threadIdx.y)) ? Input[((((threadIdx.x * 16) + threadIdx.y) * 3) + 16)] : 0.000000e+00f);
    PaddedInput_shared[((((threadIdx.x * 18) + threadIdx.y) * 3) + 38)] = (((((0 <= threadIdx.x) && (threadIdx.x < 5)) && (0 <= threadIdx.y)) && (threadIdx.y < 5)) ? Input[((((threadIdx.x * 16) + threadIdx.y) * 3) + 17)] : 0.000000e+00f);
  }
}

And the generated CUDA code with explicit_unroll = False:

#pragma unroll
for (int ax2_inner = 0; ax2_inner < 3; ++ax2_inner) {
  #pragma unroll
  for (int ax3_inner = 0; ax3_inner < 3; ++ax3_inner) {
    if ((threadIdx.x * 3) < (18 - ax2_inner)) {
      if ((threadIdx.y * 3) < (18 - ax3_inner)) {
        PaddedInput_shared[((((((threadIdx.x * 3) + ax2_inner) * 6) + threadIdx.y) * 3) + ax3_inner)] = ((((((1 - ax2_inner) <= (threadIdx.x * 3)) && ((threadIdx.x * 3) < (17 - ax2_inner))) && ((1 - ax3_inner) <= (threadIdx.y * 3))) && ((threadIdx.y * 3) < (17 - ax3_inner))) ? Input[((((((threadIdx.x * 3) + ax2_inner) * 16) + (threadIdx.y * 3)) + ax3_inner) + -17)] : 0.000000e+00f);
      }
    }
  }
}

The error is fixed after I manually modified the code:

((1 - ax2_inner) <= (threadIdx.x * 3)) -> (1 <= ax2_inner + threadIdx.x * 3)
((1 - ax3_inner) <= (threadIdx.y * 3)) -> (1 <= ax3_inner + threadIdx.y * 3)

I guess it's because threadIdx.x is unsigned. Maybe the codegen module corresponding to tvm.select should be modified to fix the bug.

Bulk Module

Being able to pack multiple generated schedules into a single module shared library file

Binary2c util for pack binaries
Meta data saving of device dependent modules in binary

External Function Call

Operator system refactor
External function call

When do you plan to make this public?

Hi,

I'll post a few questions about this repo as different issue.

When do you plan to have this public?

Simplify the if statement in backprop depthwise convolution

Hi everyone. I am working on the backward method for the depth wise convolution. The implementation I currently can think of has a lot of tvm.select. Is there any way we could simplify the code?

def trans(b, i, j, c):
    global Out_grad_cond
    Out_grad_cond = tvm.compute(
        (batch, in_h, in_w, out_c),
        lambda bo, io, jo, co: tvm.select(tvm.all(io >= tvm.select(0<(i - filter_h + pad_h + stride_h) / stride_h,(i-filter_h+pad_h+stride_h)/stride_h,tvm.const(0)),
                                                  io <  tvm.select(0<((i + pad_h) / stride_h)+1-out_h, tvm.const(out_h - 1), (i + pad_h) / stride_h),
                                                  jo >= tvm.select(0<(j - filter_w + pad_w + stride_w) / stride_w,(j-filter_w+pad_w+stride_w)/stride_w,tvm.const(0)),
                                                  jo <  tvm.select(0<((j + pad_w) / stride_w)+1-out_w, tvm.const(out_w - 1), (j + pad_w) / stride_w)),
                                                  Out_grad[b, i, j, c], tvm.const(0.0)))

    di = tvm.reduce_axis((0, out_h-1), name='di')
    dj = tvm.reduce_axis((0, out_w-1), name='dj')
    dc = tvm.reduce_axis((0, channel_multiplier), name='dc')

    return tvm.sum(Out_grad_cond[b, di, dj, c*channel_multiplier + dc] * Filter[i+pad_h-di*stride_h, j+pad_w-dj*stride_w, c, dc],axis=[di,dj,dc])

In_grad = tvm.compute(
         (batch, in_h, in_w, in_c),
         lambda b, i, j, c: trans(b,i,j,c),
         name='In_grad')

Thank you!

Recurrent Support

Scan
Fix point detection, to enable slice over scan's result
Persistent kernel
Spinlock for cross block sync

https://github.com/tqchen/tvm/pull/43

CXX API Todo

Tensor
- InferInputDomain @icemelon9
Schedule @tqchen
Split @tqchen
Buffer
Codegen

Broadcast Reduction Example

Implement broadcast
Implement reduction on certain axis

Metal Backend for iOS devices

Metal generator
Objective c++ based metal runtime

CodeGenC C++ API

Hello,

I am trying to create a minimal interesting example using the C++ API only.
My goal is to programmatically create a simple matmul and generate C code from it using tvm::codegen::CodeGenC
I've looked at what happens in python for the gemm_square.py example.

The following minimal C++ TVM example errors with:

tvm/dmlc-core/include/dmlc/logging.h:308: [15:13:48] src/codegen/codegen_c.cc:184: unknown field code

I am curious if I overlooked something simple or it something else is going on.
In particular, is the C code generator usable by itself?
I see all other code generators (CUDA, LLVM, OpenCL, Metal and Verilog) being used to generate code, except for the C code generator.

My full example below:

#include <iostream>
#include <string>
#include <vector>

#include "../tvm/src/codegen/codegen_c.h"

#include "tvm/tvm.h"
#include "tvm/ir_pass.h"
#include "tvm/schedule_pass.h"

using namespace std;

int main(void) {
  tvm::Var M("M");
  tvm::Var N("N");
  tvm::Var K("K");
  tvm::Tensor I = tvm::placeholder({M, K}, tvm::Float(32), "I");
  tvm::Tensor W = tvm::placeholder({K, N}, tvm::Float(32), "W");
  tvm::IterVar rv = tvm::reduce_axis(tvm::Range{0, K}, "kk");

  auto O = tvm::compute(
    {M, N},
    [&](tvm::Var i, tvm::Var j) {
      return tvm::sum(I[i][rv] * W[rv][j], {rv});
    },
    "O");

  tvm::Array<tvm::Operation> ops({O->op});
  auto schedule = tvm::Schedule(ops).normalize();
  auto bounds = tvm::schedule::InferBound(schedule);
  auto stmt = tvm::schedule::ScheduleOps(schedule, bounds);
  stmt = tvm::ir::Simplify(stmt);
  std::cout << stmt << std::endl;

  auto bufferI = tvm::BufferNode::make(
    std::string("I"),
    tvm::Var("pI", tvm::Handle()),
    tvm::Array<tvm::Expr>({M, K}),
    tvm::Array<tvm::Expr>({M, K}),
    tvm::Float(32),
    tvm::Expr(0),
    0
  );

  auto bufferW = tvm::BufferNode::make(
    std::string("W"),
    tvm::Var("pW", tvm::Handle()),
    tvm::Array<tvm::Expr>({K, N}),
    tvm::Array<tvm::Expr>({K, N}),
    tvm::Float(32),
    tvm::Expr(0),
    0
  );

  auto bufferO = tvm::BufferNode::make(
    std::string("O"),
    tvm::Var("pO", tvm::Handle()),
    tvm::Array<tvm::Expr>({M, N}),
    tvm::Array<tvm::Expr>({M, N}),
    tvm::Float(32),
    tvm::Expr(0),
    0
  );

  tvm::Array<tvm::NodeRef> tvmArgs({bufferI, bufferW, bufferO});
  auto api = tvm::ir::MakeAPI(stmt, "matmul", tvmArgs, 3);
  std::cout << api << std::endl;

  tvm::codegen::CodeGenC cg;
  cg.Init(false);
  cg.AddFunction(api);
  std::cout << cg.Finish() << std::endl;
}

iOS RPC Server App

We have a nice tutorial on how to use RPC to tune code on devboards such as raspi. We want to do the same practice on iOS devices. Here is a todo list of RPC server in iOS.

Implement a RPC server in objective c++,
- c.f. reference implementation in python
- Some logic of RPC server is in python is because we need some features such as create temp directory, invoke shells. These are usually platform dependent and c++ is bit clumsy to do these things and not really easy to do cross platform, so we choose to do these minimum logics in front-end, while keep most of runtime in C++
Build an iOS app that starts an RPC server, and optionally prints out the IP and port for the developer to connect to
- This requires dynamic loading of shared libraries on iOS app, which is not supported in appstore Apps, but seems to be fine for developer apps
Improve client side build script in contrib to be able to quickly build iOS shared libraries and ship them through RPC
- We might want to add contrib hook to call xcode
Support shipping of metal code.

Usage examples

Could you guys point us any example about how to start using this project?

Jenkins TestFlow

Currently most tests are run through travis, we also need a jenkins test flow for most GPU and examples, as well as backup for things travis cannot deal with

GEMM Perf

Vectorize
Shared memory allocation
Virtual Thread
GEMM Perf

compile errors on fresh clone

following instructions in install.md trying to build and install on MAC 10.11.6

git clone --recursive ...
is missing dmlc-core and dlpack
by default USE_CUDA=1, which requires cuda libraries. Lack of cuda libraries results in compile error
installation for user failed, system wide installation worked.
python setup.py install --user
['/Users/steroche/Documents/Projects/TVM/tvm/lib/libtvm.so']
Warning: Extension name 'tvm._ffi._cy2.core' does not match fully qualified name 'core' of 'tvm/_ffi/_cython/core.pyx'
Compiling tvm/_ffi/_cython/core.pyx because it changed.
[1/1] Cythonizing tvm/_ffi/cython/core.pyx
running install
error: can't combine user with prefix, exec_prefix/home, or install(plat)base
c4b301cf6ca3:python steroche$ ll /Users/steroche/Documents/Projects/TVM/tvm/lib/libtvm.so
-rwxr-xr-x 1 steroche ANT\Domain Users 7334320 Jul 9 12:15 /Users/steroche/Documents/Projects/TVM/tvm/lib/libtvm.so
c4b301cf6ca3:python steroche$ vim setup.py
c4b301cf6ca3:python steroche$ sudo python setup.py install

Python IRBuilder

IRBuilder from python
Update testcases using new IRBuilder

LoopPartition

Bound deducer https://github.com/tqchen/tvm/pull/40
LoopPartition

Start working on Tensor Infer

Cannot store type float32 for fusion of bn and relu

I am trying to add the fusion of bn and relu for the depthwise conv forward for layout NHWC. When I try to build the bn with the s2 scheduler, it does not work. f2 = tvm.build(s2, [Input, Filter, Scale, Shift, ScaleShift], device)
In the same test file, the depthwise conv forward test for both NHWC and NCHW passed. The bn and relu fusion for NCHW also passes. Am I missing something for NHWC layout with the fusion?

Traceback (most recent call last):
File "test_topi_depthwise_conv2d_map.py", line 215, in
test_depthwise_conv2d_map()
File "test_topi_depthwise_conv2d_map.py", line 204, in test_depthwise_conv2d_map
depthwise_conv2d_map_with_workload_nhwc(1, 728, 64, 1, 3, 1, "SAME")
File "test_topi_depthwise_conv2d_map.py", line 188, in depthwise_conv2d_map_with_workload_nhwc
check_device("cuda")
File "test_topi_depthwise_conv2d_map.py", line 158, in check_device
f2 = tvm.build(s2, [Input, Filter, Scale, Shift, ScaleShift], device)
File "/home/liuwt92/.local/lib/python2.7/site-packages/tvm-0.1.0-py2.7-linux-x86_64.egg/tvm/build_module.py", line 345, in build
mhost = codegen.build_module(fhost, target_host)
File "/home/liuwt92/.local/lib/python2.7/site-packages/tvm-0.1.0-py2.7-linux-x86_64.egg/tvm/codegen.py", line 20, in build_module
return _Build(lowered_func, target)
File "/home/liuwt92/.local/lib/python2.7/site-packages/tvm-0.1.0-py2.7-linux-x86_64.egg/tvm/_ffi/function.py", line 255, in my_api_func
return flocal(*args)
File "tvm/_ffi/_cython/function.pxi", line 260, in core.FunctionBase.call (tvm/_ffi/_cython/core.cpp:6818)
File "tvm/_ffi/_cython/function.pxi", line 209, in core.FuncCall (tvm/_ffi/_cython/core.cpp:6040)
File "tvm/_ffi/_cython/function.pxi", line 201, in core.FuncCall3 (tvm/_ffi/_cython/core.cpp:5943)
File "tvm/_ffi/_cython/base.pxi", line 130, in core.CALL (tvm/_ffi/_cython/core.cpp:1837)
tvm._ffi.base.TVMError: [10:31:52] src/codegen/stack_vm/././stack_vm.h:382: Cannot store type float32

The code I modify from is this: https://github.com/dmlc/tvm/blob/master/topi/tests/python/test_topi_depthwise_conv2d_map.py
Thank you!

Fail to load .so

I use the latest version:
save.py:

import tvm

n = tvm.var("n")
A = tvm.placeholder((n,), name='A')
B = tvm.placeholder((n,), name='B')
C = tvm.compute(A.shape, lambda i: A[i] + B[i], name="C")
s = tvm.create_schedule(C.op)
fadd = tvm.build(s, [A, B, C], "llvm", target_host="llvm", name="myadd")

from tvm.contrib import cc_compiler as cc
from tvm.contrib import util

fadd.save("myadd.o")
cc.create_shared("myadd.so", ["myadd.o"])

load.py:

import tvm
import numpy as np

fadd1 = tvm.module.load("myadd.so")
ctx = tvm.cpu(0)
n = 10

a = tvm.nd.array(np.random.uniform(size=n).astype(np.float32), ctx)
b = tvm.nd.array(np.random.uniform(size=n).astype(np.float32), ctx)
c = tvm.nd.array(np.zeros(n, dtype=np.float32), ctx)

print(type(fadd1))
fadd1(a, b, c)
print(c.asnumpy())

But I got:

$ python3 load.py 
TVM: Initializing cython mode...
[00:23:59] /home/liuyizhi/tvm/dmlc-core/include/dmlc/logging.h:308: [00:23:59] src/runtime/dso_module.cc:49: Check failed: lib_handle_ != nullptr Failed to load dynamic shared library myadd.so

Stack trace returned 10 entries:
[bt] (0) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x29) [0x7ff6188d8879]
[bt] (1) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(_ZN3tvm7runtime13DSOModuleNode4InitERKSs+0x1f90) [0x7ff618bd8410]
[bt] (2) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(+0x83d0de) [0x7ff618bd40de]
[bt] (3) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(_ZN3tvm7runtime6Module12LoadFromFileERKSsS3_+0x43c) [0x7ff618bde9dc]
[bt] (4) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(+0x84a15d) [0x7ff618be115d]
[bt] (5) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(TVMFuncCall+0x52) [0x7ff618be5122]
[bt] (6) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/_ffi/_cy3/core.cpython-35m-x86_64-linux-gnu.so(+0x15c7d) [0x7ff61566fc7d]
[bt] (7) /home/liuyizhi/anaconda3/bin/../lib/libpython3.5m.so.1.0(PyObject_Call+0x56) [0x7ff629749236]
[bt] (8) /home/liuyizhi/anaconda3/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x6614) [0x7ff629823234]
[bt] (9) /home/liuyizhi/anaconda3/bin/../lib/libpython3.5m.so.1.0(+0x144b49) [0x7ff629826b49]

Traceback (most recent call last):
  File "load.py", line 4, in <module>
    fadd1 = tvm.module.load("myadd.so")
  File "/home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/module.py", line 136, in load
    return _LoadFromFile(path, fmt)
  File "/home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/_ffi/function.py", line 255, in my_api_func
    return flocal(*args)
  File "tvm/_ffi/_cython/function.pxi", line 260, in core.FunctionBase.__call__ (tvm/_ffi/_cython/core.cpp:6339)
  File "tvm/_ffi/_cython/function.pxi", line 209, in core.FuncCall (tvm/_ffi/_cython/core.cpp:5583)
  File "tvm/_ffi/_cython/function.pxi", line 201, in core.FuncCall3 (tvm/_ffi/_cython/core.cpp:5486)
  File "tvm/_ffi/_cython/base.pxi", line 129, in core.CALL (tvm/_ffi/_cython/core.cpp:1654)
tvm._ffi.base.TVMError: b'[00:23:59] src/runtime/dso_module.cc:49: Check failed: lib_handle_ != nullptr Failed to load dynamic shared library myadd.so\n\nStack trace returned 10 entries:\n[bt] (0) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x29) [0x7ff6188d8879]\n[bt] (1) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(_ZN3tvm7runtime13DSOModuleNode4InitERKSs+0x1f90) [0x7ff618bd8410]\n[bt] (2) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(+0x83d0de) [0x7ff618bd40de]\n[bt] (3) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(_ZN3tvm7runtime6Module12LoadFromFileERKSsS3_+0x43c) [0x7ff618bde9dc]\n[bt] (4) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(+0x84a15d) [0x7ff618be115d]\n[bt] (5) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/libtvm.so(TVMFuncCall+0x52) [0x7ff618be5122]\n[bt] (6) /home/liuyizhi/.local/lib/python3.5/site-packages/tvm-0.1.0-py3.5-linux-x86_64.egg/tvm/_ffi/_cy3/core.cpython-35m-x86_64-linux-gnu.so(+0x15c7d) [0x7ff61566fc7d]\n[bt] (7) /home/liuyizhi/anaconda3/bin/../lib/libpython3.5m.so.1.0(PyObject_Call+0x56) [0x7ff629749236]\n[bt] (8) /home/liuyizhi/anaconda3/bin/../lib/libpython3.5m.so.1.0(PyEval_EvalFrameEx+0x6614) [0x7ff629823234]\n[bt] (9) /home/liuyizhi/anaconda3/bin/../lib/libpython3.5m.so.1.0(+0x144b49) [0x7ff629826b49]\n'

Did I do something wrong? It used to work with some earlier version.

TVM Operator Recipes

As per discussion with a few folks, it would be nice to have a common operator recipe project, with all the sugars to construct the ops, as well as good schedules for them. So they can be reused by most of the frameworks using TVM, currently including caffe2, mxnet and pytorch.

I am thinking to put it directly under dmlc/tvm/recipe, with include of python and c++ structures for operator constructions as well as benchmarking. Any thoughts?

Executor Generation

The current function generation is great for API functions with few input, but not so great for executor of multiple stages, where some argument(weight) can be preset. A executor interface might be better for these cases

def Executor(object):
    def set_input(self, index, tensor):
          """Set tensor to i-th input"""
          pass
    def run(self):
          """Run computations"""
    def get_output(self, index):
          """Get output to index-th tensor"""

Associative Reduction

Reduction factoring https://github.com/tqchen/tvm/issues/63
Parallel reduction primitives in GPU https://github.com/tqchen/tvm/pull/79