halide / halide Goto Github PK

a language for fast, portable data-parallel computation

License: Other

CMake 2.06% Makefile 1.53% C++ 87.71% Java 0.62% Objective-C 0.01% C 3.92% Python 2.06% Shell 0.68% LLVM 0.91% Objective-C++ 0.10% Smarty 0.01% Batchfile 0.01% HTML 0.03% JavaScript 0.08% Assembly 0.01% CSS 0.18% Jupyter Notebook 0.08% PowerShell 0.01%

halide hexagon compiler dsl gpu image-processing performance

halide's Introduction

Halide

Halide is a programming language designed to make it easier to write high-performance image and array processing code on modern machines. Halide currently targets:

CPU architectures: X86, ARM, Hexagon, PowerPC, RISC-V
Operating systems: Linux, Windows, macOS, Android, iOS, Qualcomm QuRT
GPU Compute APIs: CUDA, OpenCL, Apple Metal, Microsoft Direct X 12, Vulkan

Rather than being a standalone programming language, Halide is embedded in C++. This means you write C++ code that builds an in-memory representation of a Halide pipeline using Halide's C++ API. You can then compile this representation to an object file, or JIT-compile it and run it in the same process. Halide also provides a Python binding that provides full support for writing Halide embedded in Python without C++.

Halide requires C++17 (or later) to use.

For more detail about what Halide is, see http://halide-lang.org.

For API documentation see http://halide-lang.org/docs

To see some example code, look in the tutorials directory.

If you've acquired a full source distribution and want to build Halide, see the notes below.

Getting Halide

Binary tarballs

The latest version of Halide can always be found on GitHub at https://github.com/halide/Halide/releases

We provide binary releases for many popular platforms and architectures, including 32/64-bit x86 Windows, 64-bit macOS, and 32/64-bit x86/ARM Ubuntu Linux.

Vcpkg

If you use vcpkg to manage dependencies, you can install Halide via:

$ vcpkg install halide:x64-windows # or x64-linux/x64-osx

One caveat: vcpkg installs only the minimum Halide backends required to compile code for the active platform. If you want to include all the backends, you should install halide[target-all]:x64-windows instead. Note that since this will build LLVM, it will take a lot of disk space (up to 100GB).

Homebrew

Alternatively, if you use macOS, you can install Halide via Homebrew like so:

$ brew install halide

Other package managers

We are interested in bringing Halide to other popular package managers and Linux distribution repositories including, but not limited to, Conan, Debian, Ubuntu (or PPA), CentOS/Fedora, and Arch. If you have experience publishing packages we would be happy to work with you!

If you are a maintainer of any other package distribution platform, we would be excited to work with you, too.

Platform Support

There are two sets of platform requirements relevant to Halide: those required to run the compiler library in either JIT or AOT mode, and those required to run the binary outputs of the AOT compiler.

These are the tested host toolchain and platform combinations for building and running the Halide compiler library.

Compiler	Version	OS	Architectures
GCC	9.4	Ubuntu Linux 20.04 LTS	x86, x64, ARM32
GCC	9.4	Ubuntu Linux 18.04 LTS	ARM32, ARM64
MSVC	2019 (19.28)	Windows 10 (20H2)	x86, x64
AppleClang	14.0.3	macOS 13.4	x86_64
AppleClang	14.0.3	macOS 13.4	ARM64

Some users have successfully built Halide for Linux using Clang 9.0.0+, for Windows using ClangCL 11.0.0+, and for Windows ARM64 by cross-compiling with MSVC. We do not actively test these scenarios, however, so your mileage may vary.

Beyond these, we are willing to support (by accepting PRs for) platform and toolchain combinations that still receive active, first-party, public support from their original vendors. For instance, at time of writing, this excludes Windows 7 and includes Ubuntu 18.04 LTS.

Compiled AOT pipelines are expected to have much broader platform support. The binaries use the C ABI, and we expect any compliant C compiler to be able to use the generated headers correctly. The C++ bindings currently require C++17. If you discover a compatibility problem with a generated pipeline, please open an issue.

Building Halide with Make

TL;DR

Have llvm-16.0 (or greater) installed and run make in the root directory of the repository (where this README is).

Acquiring LLVM

At any point in time, building Halide requires either the latest stable version of LLVM, the previous stable version of LLVM, and trunk. At the time of writing, this means versions 18, 17, and 16 are supported, but 15 is not. The commands llvm-config and clang must be somewhere in the path.

If your OS does not have packages for LLVM, you can find binaries for it at http://llvm.org/releases/download.html. Download an appropriate package and then either install it, or at least put the bin subdirectory in your path. (This works well on OS X and Ubuntu.)

If you want to build it yourself, first check it out from GitHub:

% git clone --depth 1 --branch llvmorg-16.0.6 https://github.com/llvm/llvm-project.git

(If you want to build LLVM 17.x, use branch release/17.x; for current trunk, use main)

Then build it like so:

% cmake -DCMAKE_BUILD_TYPE=Release \
        -DLLVM_ENABLE_PROJECTS="clang;lld;clang-tools-extra" \
        -DLLVM_TARGETS_TO_BUILD="X86;ARM;NVPTX;AArch64;Hexagon;WebAssembly;RISCV" \
        -DLLVM_ENABLE_TERMINFO=OFF -DLLVM_ENABLE_ASSERTIONS=ON \
        -DLLVM_ENABLE_EH=ON -DLLVM_ENABLE_RTTI=ON -DLLVM_BUILD_32_BITS=OFF \
        -DLLVM_ENABLE_RUNTIMES="compiler-rt" \
        -S llvm-project/llvm -B llvm-build
% cmake --build llvm-build
% cmake --install llvm-build --prefix llvm-install

Running a serial build will be slow. To improve speed, try running a parallel build. That's done by default in Ninja; for make, use the option -j NNN, where NNN is the number of parallel jobs, e.g. the number of CPUs you have. Then, point Halide to it:

% export LLVM_ROOT=$PWD/llvm-install
% export LLVM_CONFIG=$LLVM_ROOT/bin/llvm-config

Note that you must add clang to LLVM_ENABLE_PROJECTS; adding lld to LLVM_ENABLE_PROJECTS is only required when using WebAssembly, LLVM_ENABLE_RUNTIMES="compiler-rt" is only required if building the fuzz tests, and adding clang-tools-extra is only necessary if you plan to contribute code to Halide (so that you can run clang-tidy on your pull requests). We recommend enabling both in all cases to simplify builds. You can disable exception handling (EH) and RTTI if you don't want the Python bindings.

Building Halide with make

With LLVM_CONFIG set (or llvm-config in your path), you should be able to just run make in the root directory of the Halide source tree. make run_tests will run the JIT test suite, and make test_apps will make sure all the apps compile and run (but won't check their output).

There is no make install. If you want to make an install package, use CMake.

Building Halide out-of-tree with make

If you wish to build Halide in a separate directory, you can do that like so:

% cd ..
% mkdir halide_build
% cd halide_build
% make -f ../Halide/Makefile

Building Halide with CMake

MacOS and Linux

Follow the above instructions to build LLVM or acquire a suitable binary release. Then change directory to the Halide repository and run:

% cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_DIR=$LLVM_ROOT/lib/cmake/llvm -S . -B build
% cmake --build build

LLVM_DIR is the folder in the LLVM installation tree (do not use the build tree by mistake) that contains LLVMConfig.cmake. It is not required to set this variable if you have a suitable system-wide version installed. If you have multiple system-wide versions installed, you can specify the version with Halide_REQUIRE_LLVM_VERSION. Remove -G Ninja if you prefer to build with a different generator.

Windows

We suggest building with Visual Studio 2019. Your mileage may vary with earlier versions. Be sure to install the "C++ CMake tools for Windows" in the Visual Studio installer. For older versions of Visual Studio, do not install the CMake tools, but instead acquire CMake and Ninja from their respective project websites.

These instructions start from the D: drive. We assume this git repo is cloned to D:\Halide. We also assume that your shell environment is set up correctly. For a 64-bit build, run:

D:\> "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x64

For a 32-bit build, run:

D:\> "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x64_x86

Managing dependencies with vcpkg

The best way to get compatible dependencies on Windows is to use vcpkg. Install it like so:

D:\> git clone https://github.com/Microsoft/vcpkg.git
D:\> cd vcpkg
D:\> .\bootstrap-vcpkg.bat
D:\vcpkg> .\vcpkg integrate install
...
CMake projects should use: "-DCMAKE_TOOLCHAIN_FILE=D:/vcpkg/scripts/buildsystems/vcpkg.cmake"

Then install the libraries. For a 64-bit build, run:

D:\vcpkg> .\vcpkg install libpng:x64-windows libjpeg-turbo:x64-windows llvm[target-all,clang-tools-extra]:x64-windows

To support 32-bit builds, also run:

D:\vcpkg> .\vcpkg install libpng:x86-windows libjpeg-turbo:x86-windows llvm[target-all,clang-tools-extra]:x86-windows

Building Halide

Create a separate build tree and call CMake with vcpkg's toolchain. This will build in either 32-bit or 64-bit depending on the environment script (vcvars) that was run earlier.

D:\Halide> cmake -G Ninja ^
                 -DCMAKE_BUILD_TYPE=Release ^
                 -DCMAKE_TOOLCHAIN_FILE=D:/vcpkg/scripts/buildsystems/vcpkg.cmake ^
                 -S . -B build

Note: If building with Python bindings on 32-bit (enabled by default), be sure to point CMake to the installation path of a 32-bit Python 3. You can do this by specifying, for example: "-DPython3_ROOT_DIR=C:\Program Files (x86)\Python38-32".

Then run the build with:

D:\Halide> cmake --build build --config Release

To run all the tests:

D:\Halide> cd build
D:\Halide\build> ctest -C Release

Subsets of the tests can be selected with -L and include correctness, python, error, and the other directory names under /tests.

Building LLVM (optional)

Follow these steps if you want to build LLVM yourself. First, download LLVM's sources (these instructions use the latest 17.0 release)

D:\> git clone --depth 1 --branch release/17.x https://github.com/llvm/llvm-project.git

For a 64-bit build, run:

D:\> cmake -G Ninja ^
           -DCMAKE_BUILD_TYPE=Release ^
           -DLLVM_ENABLE_PROJECTS=clang;lld;clang-tools-extra ^
           -DLLVM_ENABLE_TERMINFO=OFF ^
           -DLLVM_TARGETS_TO_BUILD=X86;ARM;NVPTX;AArch64;Hexagon;RISCV ^
           -DLLVM_ENABLE_ASSERTIONS=ON ^
           -DLLVM_ENABLE_EH=ON ^
           -DLLVM_ENABLE_RTTI=ON ^
           -DLLVM_BUILD_32_BITS=OFF ^
           -S llvm-project\llvm -B llvm-build

For a 32-bit build, run:

D:\> cmake -G Ninja ^
           -DCMAKE_BUILD_TYPE=Release ^
           -DLLVM_ENABLE_PROJECTS=clang;lld;clang-tools-extra ^
           -DLLVM_ENABLE_TERMINFO=OFF ^
           -DLLVM_TARGETS_TO_BUILD=X86;ARM;NVPTX;AArch64;Hexagon;RISCV ^
           -DLLVM_ENABLE_ASSERTIONS=ON ^
           -DLLVM_ENABLE_EH=ON ^
           -DLLVM_ENABLE_RTTI=ON ^
           -DLLVM_BUILD_32_BITS=ON ^
           -S llvm-project\llvm -B llvm32-build

Finally, run:

D:\> cmake --build llvm-build --config Release
D:\> cmake --install llvm-build --prefix llvm-install

You can substitute Debug for Release in the above cmake commands if you want a debug build. Make sure to add -DLLVM_DIR=D:/llvm-install/lib/cmake/llvm to the Halide CMake command to override vcpkg's LLVM.

MSBuild: If you want to build LLVM with MSBuild instead of Ninja, use -G "Visual Studio 16 2019" -Thost=x64 -A x64 or -G "Visual Studio 16 2019" -Thost=x64 -A Win32 in place of -G Ninja.

If all else fails...

Do what the build-bots do: https://buildbot.halide-lang.org/master/#/builders

If the column that best matches your system is red, then maybe things aren't just broken for you. If it's green, then you can click the "stdio" links in the latest build to see what commands the build bots run, and what the output was.

Some useful environment variables

HL_TARGET=... will set Halide's AOT compilation target.

HL_JIT_TARGET=... will set Halide's JIT compilation target.

HL_DEBUG_CODEGEN=1 will print out pseudocode for what Halide is compiling. Higher numbers will print more detail.

HL_NUM_THREADS=... specifies the number of threads to create for the thread pool. When the async scheduling directive is used, more threads than this number may be required and thus allocated. A maximum of 256 threads is allowed. (By default, the number of cores on the host is used.)

HL_TRACE_FILE=... specifies a binary target file to dump tracing data into (ignored unless at least one trace_ feature is enabled in HL_TARGET or HL_JIT_TARGET). The output can be parsed programmatically by starting from the code in utils/HalideTraceViz.cpp.

Using Halide on OSX

Precompiled Halide distributions are built using XCode's command-line tools with Apple clang 500.2.76. This means that we link against libc++ instead of libstdc++. You may need to adjust compiler options accordingly if you're using an older XCode which does not default to libc++.

Halide for Hexagon HVX

Halide supports offloading work to Qualcomm Hexagon DSP on Qualcomm Snapdragon 845/710 devices or newer. The Hexagon DSP provides a set of 128 byte vector instruction extensions - the Hexagon Vector eXtensions (HVX). HVX is well suited for image processing, and Halide for Hexagon HVX will generate the appropriate HVX vector instructions from a program authored in Halide.

Halide can be used to compile Hexagon object files directly, by using a target such as hexagon-32-qurt-hvx.

Halide can also be used to offload parts of a pipeline to Hexagon using the hexagon scheduling directive. To enable the hexagon scheduling directive, include the hvx target feature in your target. The currently supported combination of targets is to use the HVX target features with an x86 linux host (to use the simulator) or with an ARM android target (to use Hexagon DSP hardware). For examples of using the hexagon scheduling directive on both the simulator and a Hexagon DSP, see the blur example app.

To build and run an example app using the Hexagon target,

Obtain and build trunk LLVM and Clang. (Earlier versions of LLVM may work but are not actively tested and thus not recommended.)
Download and install the Hexagon SDK and Hexagon Tools. Hexagon SDK 4.3.0 or later is needed. Hexagon Tools 8.4 or later is needed.
Build and run an example for Hexagon HVX

1. Obtain and build trunk LLVM and Clang

(Follow the instructions given previously, just be sure to check out the main branch.)

2. Download and install the Hexagon SDK and Hexagon Tools

Go to https://qpm.qualcomm.com/#/main/home

Go to Tools, and download Qualcomm Package Manager 3. Install the package manager on your machine.
Run the installed Qualcomm Package Manager and install the Qualcomm Hexagon SDK 5.x (or 4.x). The SDK can be selected from the Qualcomm Hexagon SDK Products.
Set an environment variable to point to the SDK installation location
```
export SDK_LOC=/location/of/SDK
```

3. Build and run an example for Hexagon HVX

In addition to running Hexagon code on device, Halide also supports running Hexagon code on the simulator from the Hexagon tools.

To build and run the blur example in Halide/apps/blur on the simulator:

cd apps/blur
export HL_HEXAGON_SIM_REMOTE=../../src/runtime/hexagon_remote/bin/v65/hexagon_sim_remote
export HL_HEXAGON_TOOLS=$SDK_LOC/Hexagon_Tools/8.x/Tools/
LD_LIBRARY_PATH=../../src/runtime/hexagon_remote/bin/host/:$HL_HEXAGON_TOOLS/lib/iss/:. HL_TARGET=host-hvx make test

To build and run the blur example in Halide/apps/blur on Android:

To build the example for Android, first ensure that you have Android NDK r19b or later installed, and the ANDROID_NDK_ROOT environment variable points to it. (Note that Qualcomm Hexagon SDK v4.3.0 includes Android NDK r19c, which is fine.)

Now build and run the blur example using the script to run it on device:

export HL_HEXAGON_TOOLS=$SDK_LOC/HEXAGON_Tools/8.4.11/Tools/
HL_TARGET=arm-64-android-hvx ./adb_run_on_device.sh

halide's People

Contributors

Stargazers

Watchers

Forkers

tijsmaas cesarnog invinciblejha hal2001 sy-zygy pythons lvzongting josephwinston sherlockxlg autumnm1981 emory55 turboho zloidemon mvl cultofmetatron binarysentient luisibanez nzinfo wishqube damianfral iitaku hshu mokerjoke lenhamey ab2005 phunterlau hksonngan xiaonanzzz nerei yongyi781 purcaro lhc180 xeschen moloned r2jitu jansel 202198 jiawen dsharlet-intel drtpig sanyaade-teachings psuriana jacobke bblum mikeseven victoroliv2 languagefun josephlaurino vboomi marcantoine-arnaud taa4 paranaliyanage uikit0 jrprice zxwglzi pstanczyk weiweichen syoummer tiantian88 blastarindia tompao vbychkovsky danielhauagge atbrox warrenmcquinn michaelbacci alextooter lujaw unimatrixzxero ashkanershadi whuaegeansea mfkiwl zouguangxian fcr zzmjohn trass3r satorukuma chenxilinsidney parvizp sonsongithub sebastianklose vtkingc nushio3 youheixx amos-zq philipoakley josephsieh netaz pauleyy kree-colemcalughlin zmxu bnascimento guanqun pgec imageproc joeyjal luiseduardohdbackup cometdlut emanuelev trgardos

halide's Issues

PTX backend doesn't support most math lib functions

Until it moves to the new NVPTX LLVM backend, the legacy PTX backend is missing most math lib functions (transcendentals, pow, etc.) and some other expected standard library features.

Add task parallelism

Just add an unordered flag to Block and throw them into the task pool.

run_test should show test names as it runs

Desired format:

<test_name>: {compile|link|run}
....E..

Pyramid interpolation generates unexpectedly slow code

Jim's interpolation algorithm test runs slower (~2x) than gcc-4.6's optimized C result (on x86-64/Mac). We need to look into the generated code to sniff out why.

bootstrap should print errors correctly

Currently, bootstrap errors tend to be in the form of long log spew from a failed LLVM build or similar. In this case, the script tends to print the start of the log, rather than the end of it, where the actual information is.

RDoms don't like zero size

Hi,

If I try to write a reduction of the form:

RDom r(0, input);
f(x) += r;

and supply an input value of 0, the reduction fails to terminate instead of outputting the initial value.

Any thoughts?

Cheers

Conditionally remove C++11 features for older compilers

It seems all that's needed for the stock Lion compiler is to remove the initializer list support for images.

Generated code segfaults on iOS

404 - Not Found

binary distribution of the Halide compiler is 404 - Not Found.

src build should make initial modules if dirty

There is logic in src/myocamlbuild.ml to do this, but it seems to have stopped working correctly.

transpose syntax sucks

It's quite confusing. We need some better way to handle loop nesting order.

Update LLVM to 3.2svn with NVPTX

We are currently relying on a hacked branch of LLVM 3.1svn from around the SIGGRAPH deadline, which we had to patch to fix a variety of codegen bugs for ARM and add features to the PTX backend. This should be updated to 3.2svn. The limiting factor is that the PTX codegen needs to be updated to work with the new conventions of NVPTX instead of the older independent PTX target.

Make LL functional IR layer

The stateful nature of the LLVM OCaml bindings is nasty. It would be much nicer to have a thin ADT layer above this, much like the C types used in the C backend.

This could be started as part of the llvalue IR node, which would need to be constructable without a current "builder" context.

Add llvalue IR node

This would be valuable in a bunch of paths in the LLVM codegen, and potentially later for platform-specific optimization pre-passes.

Halide uses all memory

Hi!

I've been trying to implement the SURF descriptor algorithm in Halide. I've run into the problem that the piece of code below (one of the parts of SURF) takes a really long time to compile (compileJIT and compileToFile) and it exits when memory completely fills up (4GB). It's really hard to know the source of the problem since there are no error messages.

I tried to reduce code size (removing some calculations), and sometimes it compiles (though very slow).

FUNC and VAR are just macros to set unique object names.

Here is the function:

Func getResponse(UniformImage resp, UniformImage src)
{
  Func FUNC(f);
  Var VAR(x),VAR(y);

  Expr scale = cast<float>(resp.width()) / cast<float>(src.width());
  Expr xx = cast<int>(scale * cast<float>(x));
  Expr yy = cast<int>(scale * cast<float>(y));
  f(x,y) = resp(clamp(xx, 0, resp.width()), clamp(yy, 0, resp.height()));
  return f;
}

Func nonMaxSuppression
(UniformImage t, Uniform<int> t_step, Uniform<int> t_filter,
 UniformImage m, Uniform<int> m_step, Uniform<int> m_filter,
 UniformImage b, Uniform<int> b_step, Uniform<int> b_filter,
 UniformImage laplacian,
 Uniform<float> threshold)
{
    Var VAR(x), VAR(y);

    /* clamp parameters */
    Func FUNC(clamped_t), FUNC(clamped_m), FUNC(clamped_b);
    clamped_t(x,y) = t(clamp(x, 0, t.width()), clamp(y, 0, t.height()));
    clamped_m(x,y) = m(clamp(x, 0, m.width()), clamp(y, 0, m.height()));
    clamped_b(x,y) = b(clamp(x, 0, b.width()), clamp(y, 0, b.height()));

    Func response_m_t = getResponse(m, t);
    Func response_b_t = getResponse(b, t);

    Expr layerBorder = (t_filter + 1) / (2 * t_step);
    Expr validBounds =  (   y > layerBorder 
                         && y < t.height() - layerBorder
                         && x > layerBorder
                         && x < t.width() - layerBorder);

    Expr candidate = response_m_t(x, y);
    Expr aboveThreshold = candidate >= threshold;

    RDom r(-1, 2, -1, 2);

    Expr max_t = maximum(clamped_t(r.x+x, r.y+y));
    Expr max_m = maximum(clamped_m(r.x+x, r.y+y));
    Expr max_b = maximum(clamped_b(r.x+x, r.y+y));

    Func FUNC(GreaterNeigh);
    GreaterNeigh(x,y) = max(max_t, max(max_m, max_b));

    Expr isExt = validBounds && aboveThreshold && (GreaterNeigh(x,y) <= candidate);

    // ---------------------------------------
    // Step 1: Calculate the 3D derivative
    // ---------------------------------------

    Func FUNC(dx), FUNC(dy), FUNC(ds);

    dx(x,y) = (response_m_t(x+1, y  ) - response_m_t(x-1, y  )) / 2.0f;
    dy(x,y) = (response_m_t(x,   y+1) - response_m_t(x,   y-1)) / 2.0f;
    ds(x,y) = (clamped_t(x, y) - clamped_b(x, y)) / 2.0f;

    // ---------------------------------------
    // Step 2: Calculate the inverse Hessian
    // ---------------------------------------

    Expr v;
    Func FUNC(dxx), FUNC(dyy), FUNC(dss), FUNC(dxy), FUNC(dxs), FUNC(dys);

    v = response_m_t(x, y);

    dxx(x,y) = response_m_t(x + 1, y) + m(x - 1, y) - 2 * v;
    dyy(x,y) = response_m_t(x, y + 1) + m(x, y - 1) - 2 * v;
    dss(x,y) = clamped_t(x, y) + response_b_t(x, y) - 2 * v;

    dxy(x,y) = ( response_m_t(x + 1, y + 1)
               - response_m_t(x - 1, y + 1)
               - response_m_t(x + 1, y - 1)
               + response_m_t(x - 1, y - 1) ) / 4.0;

    dxs(x,y) = ( clamped_t(x + 1, y)
               - clamped_t(x - 1, y)
               - response_b_t(x + 1, y)
               + response_b_t(x - 1, y) ) / 4.0;

    dys(x,y) = ( clamped_t(x, y + 1)
               - clamped_t(x, y - 1)
               - response_b_t(x, y + 1)
               + response_b_t(x, y - 1) ) / 4.0;

    Expr H[3][3] = {{dxx(x,y), dxy(x,y), dxs(x,y)},
                    {dxy(x,y), dyy(x,y), dys(x,y)},
                    {dxs(x,y), dys(x,y), dss(x,y)}};

    Func FUNC(invDet);
    invDet(x,y) = 1.0 /
         (H[0][0]*(H[1][1]*H[2][2]-H[2][1]*H[1][2]) -
          H[0][1]*(H[1][0]*H[2][2]-H[1][2]*H[2][0]) +
          H[0][2]*(H[1][0]*H[2][1]-H[2][2]*H[2][0]));

    Expr invH[3][3] =
      {{ (H[1][1]*H[2][2]-H[2][1]*H[1][2])*invDet(x,y), -(H[1][0]*H[2][2]-H[1][2]*H[2][0])*invDet(x,y),  (H[1][0]*H[2][1]-H[2][0]*H[1][1])*invDet(x,y)},
       {-(H[0][1]*H[2][2]-H[0][2]*H[2][1])*invDet(x,y),  (H[0][0]*H[2][2]-H[0][2]*H[2][0])*invDet(x,y), -(H[0][0]*H[2][1]-H[2][0]*H[0][1])*invDet(x,y)},
       { (H[0][1]*H[1][2]-H[0][2]*H[1][1])*invDet(x,y), -(H[0][0]*H[1][2]-H[1][0]*H[0][2])*invDet(x,y),  (H[0][0]*H[1][1]-H[1][0]*H[0][1])*invDet(x,y)}};

    // ---------------------------------------
    // Step 3: Multiply derivative and Hessian
    // ---------------------------------------

    Expr cx =  (invH[0][0] * dx(x,y) * -1.0) + (invH[0][1] * dy(x,y) * -1.0) + (invH[0][2] * ds(x,y) * -1.0);
    Expr cy =  (invH[1][0] * dx(x,y) * -1.0) + (invH[1][1] * dy(x,y) * -1.0) + (invH[1][2] * ds(x,y) * -1.0);
    Expr ci =  (invH[2][0] * dx(x,y) * -1.0) + (invH[2][1] * dy(x,y) * -1.0) + (invH[2][2] * ds(x,y) * -1.0);


    Expr isClose   = (abs(cx) < 0.5 && abs(cy) < 0.5 && abs(ci) < 0.5);
    Expr posx      = cast<float>((x + cx)*t_step);
    Expr posy      = cast<float>((y + cy)*t_step);
    Expr det_scale = cast<float>((0.1333)*(m_filter + (ci* (m_filter - b_filter))));

    Func FUNC(laplacianF);
    laplacianF = getResponse(laplacian, t);

    Var VAR(c);

    Func FUNC(out);
    out(x,y,c) = select(c==0, isExt && isClose,
                 select(c==1, posx,
                 select(c==2, posy,
                 select(c==3, det_scale,
                 select(c==4, laplacianF(x,y), 0.0)))));

    //schedule
    GreaterNeigh.root();
    invDet.root();

    return out;
}

int main()
{
  UniformImage t(Float(32), 2),
               m(Float(32), 2),
               b(Float(32), 2),
               laplacian(Float(32), 2);

  Uniform<int> t_step, t_filter,
               m_step, m_filter,
               b_step, b_filter;

  Uniform<float> threshold;

  Func nms = nonMaxSuppression(t, t_step, t_filter,
                               m, m_step, m_filter,
                               b, b_step, b_filter,
                               laplacian,
                               threshold);

  nms.compileJIT(); /* error */

  return 0;
}

Bounds checking on input images

We don't currently do any bounds checking on input images, which causes segfaults. One subtle way this triggers is if you vectorize something which accesses the input image but the input image is not a multiple of the vector width.

See test/cpp/input_image_bounds_check/test.cpp for code that triggers this bug

We should add asserts at the function preamble that check this (conservatively).

ptx backend not working (cuCtxSynchronize could not be resolved)

Halide wroks fine with the CPU backend. However, as soon as I try to use the Cuda backend with
HL_TARGET=ptx ./executable

I get an error like this:

...
%f0.v0_nextvar = add i32 %f0.v0, 1
%55 = icmp ne i32 %f0.v0_nextvar, %38
br i1 %55, label %f0.v0_loop, label %f0.v0_afterloop

f0.v0_afterloop: ; preds = %f0.v0_loop
call void @__free_buffer(%struct.buffer_t* %f0.f5_buf)
call void @fast_free(i8* %f0.f5)
ret void
}
LLVM ERROR: Program used external function 'cuCtxSynchronize' which could not be resolved!

Reusing a variable name as the inner dimension of a split breaks

For example, the following code accesses the output image incorrectly.

f(x) = x
f.root().split(x, xo, x, 2)

f.realize(...)

A failing test has been checked in as split_reuse_inner_name_bug

Create separate UserAssert for user input errors

Currently, quite a few user errors result in assertions. These have been improved to be reasonably informative, but they should still be pushed off to a separate path from the implementation error asserts, specific to input program warnings and errors.

Vectorizing by more than dimension size segfaults

During autotuner debugging we found the following schedule for blur triggers a segfault rather than an error:

blur_y.tile(x, y, xi, yi, 2, 2)
blur_y.vectorize(xi, 8)

Unify Expr Images & UniformImages lists

These are almost entirely duplicated functionality. They should be refactored into a single list.

Microsoft Windows support

I know this is early, but I and I'm sure others would be interested in Windows support. I'm taking a guess that VC10 and maybe even VC11 don't have enough C++11 features implemented that would allow Halide to compile, so mingw is probably the way to go. Maybe someone more knowledgeable than I can explain some of the challenges that will be faced in porting to Windows.

Add native struct types

The existing tuples support is awkward, limited, and confusing.

Add support for alternative parallel runtimes

TBB and Grand Central Dispatch would be valuable. This should be doable (almost?) entirely as alternative standard libraries.

Emit object code directly from compileToFile

It would be nice for users if we didn't require llc/opt to actually codegen and assemble a statically compiled pipeline.

This will require either plumbing llvm-c's LLVMTargetMachineEmitToFile(...) through to the OCaml standard interface, or making our own shim in src/cllutil.c.

Casting a FuncRef to an Expr crashes if the Func is not yet defined.

Bug in the C++ layer. There should be a check for this. Possibly non-trivial, because reductions may recursively reference themselves.

Add OpenGL ES backend

Would be similar to the ptx backend. Helpful for current-gen cell phones.

Build fails on ubuntu 11.10

Hi,

Running bootstrap fails at that point:

--------------------------
Test: building halide.cmxa
--------------------------
Traceback (most recent call last):
  File "util/bootstrap.py", line 80, in <module>
    print ocamlbuild('-use-ocamlfind', 'halide.cmxa')
  File "/home/hamstah/repos/Halide/util/pbs.py", line 352, in __call__
    return RunningCommand(command_ran, process, call_args, actual_stdin)
  File "/home/hamstah/repos/Halide/util/pbs.py", line 136, in __init__
    if rc != 0: raise get_rc_exc(rc)(self.command_ran, self._stdout, self._stderr)
pbs.ErrorReturnCode_10: 

Ran: '/usr/bin/ocamlbuild -use-ocamlfind halide.cmxa'

STDOUT:

  ocamlfind ocamlopt -I /usr/lib/ocaml/ocamlbuild unix.cmxa /usr/lib/ocaml/ocamlbuild/ocamlbuildlib.cmxa myocamlbuild.ml /usr/lib/ocaml/ocamlbuild/ocamlbuild.cmx -o myocamlbuild
+ ocamlfind ocamlopt -I ... (214 more, please see e.stdout)

STDERR:

Running the command manually gives:


$ /usr/bin/ocamlbuild -use-ocamlfind halide.cmxa
Solver failed:
  Ocamlbuild cannot find or build halide.ml.  A file with such a name would usually be a source file.  I suspect you have given a wrong target name to Ocamlbuild.
Backtrace:
  - Failed to build the target halide.cmxa
      - Building halide.cmxa:
          - Failed to build all of these:
              - Building halide.cmx:
                  - Failed to build all of these:
                      - Building halide.ml:
                          - Failed to build all of these:
                              - Building halide.mly
                              - Building halide.mll
                      - Building halide.mlpack
              - Building halide.mllib
Compilation unsuccessful after building 0 targets (0 cached) in 00:00:00.

the cmxa file seems to be missing:


$ find . -name "*.cmxa"
./llvm/Release+Asserts/lib/ocaml/llvm_bitwriter.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_analysis.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_bitreader.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_target.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_ipo.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_scalar_opts.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_executionengine.cmxa
./llvm/bindings/ocaml/target/Release+Asserts/llvm_target.cmxa
./llvm/bindings/ocaml/bitreader/Release+Asserts/llvm_bitreader.cmxa
./llvm/bindings/ocaml/llvm/Release+Asserts/llvm.cmxa
./llvm/bindings/ocaml/transforms/ipo/Release+Asserts/llvm_ipo.cmxa
./llvm/bindings/ocaml/transforms/scalar/Release+Asserts/llvm_scalar_opts.cmxa
./llvm/bindings/ocaml/executionengine/Release+Asserts/llvm_executionengine.cmxa
./llvm/bindings/ocaml/bitwriter/Release+Asserts/llvm_bitwriter.cmxa
./llvm/bindings/ocaml/analysis/Release+Asserts/llvm_analysis.cmxa

bootstrap also got stuck on the git submodule update for some reason, might because of older python/modules, but running the command manually and bootstrap again solved it

Support GPU device delection

Automatic GPU device selection should be overridable via an HL_GPU_DEVICE environment variable.

C backend does not support vector types

They can be added trivially (with potentially poor performance) as simple loops. The longer-term plan is to rely on syrah.

Build should be able to bootstrap missing OCaml libraries

One of the things which makes the build difficult for first-timers is the need to install OCaml libraries on which we depend. On some platforms this is easy (modern Ubuntu tends to have many up-to-date OCaml packages in apt), while on others it involves a long chain of manual download/configure/make/installs which is an unnecessary distraction.

The build bootstrap process should have functionality which automatically fetches, builds, and installs the relevant libraries in a project-local path, and makes these discoverable to all subsequent build steps. odb is a straightforward option.

The challenge is to do this while also using the system packages when they exist and are sufficient, to avoid too much bloat.

OpenCL backend plans?

Is there any chance, that Halide will generate OpenCL kernels for use on GPUs? Sometimes in future....

I want to use Halide in my desktop computer graphics (photo processing) app, but many users have AMD cards, not NVidia.

Improve buffer_t

Major issues:

size_t strides[MAX_DIMS] - allow inputs with padded scanlines
size_t offset[MAX_DIMS] - allow execution over sub-region of a host buffer
Number of used dimensions should be expressed more clearly. Obvious choices:
- int dims field
- unused dim size = 0 convention
Zero-initialization of the struct should create sane defaults

Also of note: pointer sizes vary between architectures, which means that the size and offsets of the structure itself vary. This probably remains the right answer, but we need to be careful on 32-bit architectures.

Update build instructions

Known issues:

GCC 4.7 recipe changed
Required packages are misleading
MacPorts, Ubuntu instructions are largely missing

build should use a configure step to set up desired options

Options:

--use-local-llvm/use-system-llvm
--use-local-clang/use-system-clang
--use-local-ocaml-libs

compute_remainder_modulus analysis fails for cast<int>(some float)

This was a bug listed on the trello board. It needs to be tested to see if it's still a bug and fixed if so.

Convolution test fails on GPU

This is a bug migrated from the trello list

Vectorize across multiple variables

It correctly fails, but does not issue any kind of useful error message. E.g.:

f.vectorize(x, 4).vectorize(y, 4);

Right now it just fails an assertion inside vectorize.ml

Generate efficient vector loads from clamped indices

A very common pattern in Halide code loads from an image using a clamped index:

Func clamped(x,y) = input(clamp(x,0,input.width() - 1), clamp(y,0,input.height() - 1);

In the current backend, this generates unnecessarily conservative code when vectorized. A better strategy would be to generate a dynamic branch which detects if the index vector is near the edge of the clamp range, and if not, removes the clamp and generates a simple dense aligned vector load.

Compute Capability of Halide

I am working on CUDA and Halide. I have compiled and ran few examples. When i opened my working directory, i found a file "kernel.ptx". I opened it and found this.

.version 2.0
.target sm_20

Does Halide support only devices with compute capability 2.0??

GPU reductions potentially cause redundant buffer copies

Identified by Victor Oliveira on the halide-dev list:

RDom r (-5, 11);
Func box_x("box_x");
box_x(x,y,c) += (clamped(c, x + r, y));
Func box_y("box_y");
box_y(x,y,c) += (box_x(x, y + r, c));

box_x.root().update().reorder(r,c,x,y).cudaTile(x,y,16,16);
box_y.root().update().reorder(r,c,x,y).cudaTile(x,y,16,16);

Run Problem for Negative Values

I am trying to find the cosine of a set of numbers. I used from 0 to 10. My code compiled but, i have this error while running.

Halide::DynImage::Contents::Contents(const Halide::Type&, int): Assertion `a > 0 && "Images must have positive sizes\n"' failed.
Aborted (core dumped)

I understood the second part. How to make it for negative numbers?

segfault in vectorization

I'm using the latest Halide release to implement a dilation algorithm in RGBA float, but it segfaults. If I remove the vectorize schedule, it works.

Uniform<int> radius = 20;
RDom dom(-radius, 2*radius+1, -radius, 2*radius+1);
structEl(x,y) = select(x*x + y*y <= radius, 1, 0);
dilation(x, y, c) = select(c < 3,
    maximum( select(structEl(dom.x, dom.y) == 1, input(x + dom.x - radius, y + dom.y - radius, c), 0.0f) ),
    input(x, y, 3));

    //schedule
    structEl.root();
    dilation.vectorize(x, 8);

Compile and Run problem for CUDA

I have run the code given in Getting Started and test folder. It ran fine with g++. But, when i include the shell code for CUDA, .i.e.,

"g++-4.6 -std=c++0x hello_halide.cpp -L /usr/local/cuda/lib64 -lcuda halide -lHalide -ldl -lpthread -o hello_halide"

It shows,

"/usr/bin/ld: cannot find halide: File format not recognized
/usr/bin/ld: cannot find -lHalide
collect2: ld returned 1 exit status"

What to do??

Simple tiling fails in blur test

In py_bindings/test_blur.py, this schedule fails:

blur_y.root().tile(y,c,_c0,_c1,64,8)

Problem working on VS 2010

I am trying to build Halide on VS 2010. Need some help. I have successfully compiled the code. But, the runtime prob is:

error LNK2019: unresolved external symbol "public: class Halide::DynImage __thiscall Halide::Func::realize(int)" (?realize@Func@Halide@@QAE?AVDynImage@2@H@Z) referenced in function _main

Problem with guassian pyramid

I'm using Halide (the precompiled libs) for a HDR fusion program and I need to create a gaussian pyramid. There is an example in local_laplacian code, but it doesn't not use JIT (what I want).

I have this code:

Image<int> subsample()
{
    Func downx, downy;
    Var x, y;

    downx(x, y) = ( (*this->image)(2*x-1, y) + 2 * (*this->image)(2*x, y) + (*this->image)(2*x+1 , y) ) / 4;
    downy(x, y) = (downx(x, 2*y-1) + 2 * downx(x, 2*y) + downx(x, 2*y+1)) / 4; 
    int width = this->image->width() / 2, height = this->image->height() / 2 ;
    Image<int> out = downy.realize(width-1, height-1);
    return out;
}

This code obviously fails because it starts on x=0, y=0 and indexes has to be positive numbers.

How can I set the limits for "realize" downy? Any help, please?

PS: I'm just starting with C++ because of Halide, you may think my code is horrible. :)