maratyszcza / nnpack Goto Github PK

Acceleration package for neural networks on multi-core CPUs

License: BSD 2-Clause "Simplified" License

C 59.66% Python 13.59% C++ 22.02% HTML 0.93% CMake 2.40% Assembly 1.39%

neural-network neural-networks convolutional-layers inference high-performance high-performance-computing simd cpu multithreading fast-fourier-transform

nnpack's Introduction

NNPACK

NNPACK is an acceleration package for neural network computations. NNPACK aims to provide high-performance implementations of convnet layers for multi-core CPUs.

NNPACK is not intended to be directly used by machine learning researchers; instead it provides low-level performance primitives leveraged in leading deep learning frameworks, such as PyTorch, Caffe2, MXNet, tiny-dnn, Caffe, Torch, and Darknet.

Platforms and requirements

Environment	Architecture	CPU requirements
Linux	x86-64	AVX2 and 3-level cache hierarchy
Linux	ARM	NEON
Linux	ARM64
macOS	x86-64	AVX2 and 3-level cache hierarchy
Android	ARM	NEON
Android	ARM64
Android	x86
Android	x86-64
iOS	ARM
iOS	ARM64
Emscripten	Asm.js
Emscripten	WebAssembly

Features

Multiple algorithms for convolutional layers:
- Fast convolution based on Fourier transform (for kernels up to 16x16 without stride)
- Fast convolution based on Winograd transform (for 3x3 kernels without stride)
- Implicit matrix-matrix multiplication algorithm (no limitations)
- Direct convolution algorithm (for 1x1 kernels without stride)
Multi-threaded SIMD-aware implementations of neural network layers
Implemented in C99 and Python without external dependencies
Extensive coverage with unit tests

Layers

Convolutional layer
- Inference-optimized forward propagation (nnp_convolution_inference)
- Training-optimized forward propagation (nnp_convolution_output)
- Training-optimized backward input gradient update (nnp_convolution_input_gradient)
- Training-optimized backward kernel gradient update (nnp_convolution_kernel_gradient)
Fully-connected layer
- Inference-optimized forward propagation (nnp_fully_connected_inference and nnp_fully_connected_inference_f16f32 version for FP16 weights)
- Training-optimized forward propagation (nnp_fully_connected_output)
Max pooling layer
- Forward propagation, both for training and inference, (nnp_max_pooling_output)
ReLU layer (with parametrized negative slope)
- Forward propagation, both for training and inference, optionally in-place, (nnp_relu_output)
- Backward input gradient update (nnp_relu_input_gradient)
Softmax layer
- Forward propagation, both for training and inference, optionally in-place (nnp_softmax_output)

Building

For most users, the recommended way to build NNPACK is through CMake:

mkdir build
cd build
cmake -G Ninja ..
ninja

Note: if ninja is not available on your system, configure without -G Ninja, and use make instead of ninja.

Building NNPACK - Using vcpkg

You can download and install NNPACK using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install NNPACK

The NNPACK port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Cross-compilation for Android

To cross-compile for Android, add extra configuration options for cmake: -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake (where $ANDROID_NDK is the path to Android NDK directorory, e.g. /opt/android-ndk-r15c) AND arguments from the table below

ABI	Extra cmake args	Restrictions
armeabi	`-DANDROID_ABI=armeabi -DANDROID_TOOLCHAIN=gcc`	Requires CPU with ARM NEON
armeabi-v7a	`-DANDROID_ABI=armeabi-v7a -DANDROID_TOOLCHAIN=gcc`	Requires CPU with ARM NEON
arm64-v8a	`-DANDROID_ABI=arm64-v8a -DANDROID_TOOLCHAIN=clang`	Requires clang toolchain
x86	`-DANDROID_ABI=x86`
x86_64	`-DANDROID_ABI=x86_64`

Notes:

On armeabi and armeabi-v7a nnp_initialize will fail with nnp_status_unsupported_hardware if the mobile CPU does not support ARM NEON. Don't set -DANDROID_ARM_NEON=1 for NNPACK compilation as it can make nnp_initialize crash on CPUs without ARM NEON.
NNPACK builds for armeabi and armeabi-v7a are up to 2x slower if you use clang toolchain.
mips and mips64 are not supported, and we have no plans to add it (pull request would be welcome, though)
x86_64 build will use generic 128-bit (SSE2) micro-kernels rather than AVX2 micro-kernels in native build

Ecosystem

Deep Learning Frameworks

PyTorch supports NNPACK on mobile for inference in convolutional layers.
TVM supports NNPACK for inference in convolutional layers. See these instructions to enable NNPACK in TVM.
MXNet supports NNPACK for inference in convolutional layers, fully-connected, and max-pooling layers. See MXNet wiki for configuration instructions and performance benchmarks).
Caffe2 supports NNPACK for inference in convolutional layers.
darknet-nnpack - fork of Darknet framework with NNPACK support.
tiny-dnn - header-only deep learning framework in C++11, which natively supports NNPACK.
Maratyszcza/caffe - up-to-date integration of NNPACK (convolutional, fully-connected, max-pooling, and ReLU layers) into Caffe based on nnpack-pr branch in ajtulloch/caffe.
Maratyszcza/caffe-nnpack - older and unmaintained integration of NNPACK (convolutional layers only) into Caffe.
szagoruyko/nnpack.torch - integration of NNPACK into Lua Torch via ffi
See also discussion in Issue #1

Languages and Environments

nnpack-windows - unofficial port for Windows
node-nnpack - Node.js bindings
peterhj/libnnpack - Rust bindings

Users

Facebook uses NNPACK in production.
Prisma uses NNPACK in the mobile app.

Acknowledgements

The library is developed by Marat Dukhan of Georgia Tech with extensive advice from Nicolas Vasilache and Soumith Chintala of Facebook Artificial Intelligence Research. Andrew Tulloch of Facebook Artificial Intelligence Research contributed Caffe integration. We thank Andrew Lavin for fruitful discussions on Winograd transform-based implementations. NNPACK is a research project at Richard Vuduc's HPC Garage lab in the Georgia Institute of Technology, College of Computing, School of Computational Science and Engineering.

This material is based upon work supported by the U.S. National Science Foundation (NSF) Award Number 1339745. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of NSF.

nnpack's People

Contributors

Stargazers

Watchers

Forkers

codeaudit evilmucedin nasenspray jmrinaldi scatterbrain333 directorscut82 nagyist wanjinchang kaimi2007 caomw ml-ai-nlp-ir chubbymaggie leishics kubytskyi tfwu dromanov lelou6666 liangdu simonlmic nkhuyu spyatakov hbs leo-hzau intelpcs canyiwu starimpact benjamwhite towardthesea hma02 liyancas shihenw ominux mldl ottolu nyanp wangyida montekristo1946 cfandy godricly xmchen1987 darwin2011 nunb peterhj andreasgal xshhhm ml-lab vpatil131 silklabs mmx110 jokeren clcarwin pengwubj chenjingwei76 lijian8 zhangyangang zbxzc35 shadermanager leizi007 chagge paintmagazine solertis 6676401088 coocoky robert-junwang derog-li zhengfangwu luhaofang digitalbrain79 yinyan1001 liwangqian jiayohsu-junkers transcranial xreki paseam microblink liuguoyou phantask shivajid blmousee wonderzy zengjianping jackmitch wolf1981 jackbro gclouding yangqing mindis ezineo kafkafield johnson-yue jiangxuehan hdmjdp channingxiao liuyi999111 matrixplayer xupeng082008 soledad89 10imaging trantorrepository craft-zhang

nnpack's Issues

Calculate utilization

I would like to calculate utilization, but I can't find AVX2 frequency for i7-6700K. For example, if AVX2 frequency were 4.0 GHz (which it isn't) max FLOP/s would be:

32 FLOP/clock * 4.0GHz * 4core=512 GFLOP/s (which it isn't)

See discussion, starting here:
soumith/convnet-benchmarks#59 (comment)

nnpack in android cost more time in conv than openblas with singlethread.

im2col + openblas sgemm cost 1/2 time which compared with nnpack in single thread. (multi thread will interference other program in weak cpu of android)
batchsize = 1 using inference mode.
so I continue to use openblas sgemm + im2col.

Binary ops

Is there already a plan to add binary ops like xnor bitcount for XNOR-NET?

pnacl build is broken after build procedure update

Default target build is working ok. Version from 1-2 weeks ago is also building fine.
If you follow installation guide and run configure --target=pnacl you will get this error during ninja build:

[52/232] CXX deps/googletest/googletest/src/gtest-all.cc
FAILED: /home/ubuntu/caffe-compilation/NNPACK/build/deps/googletest/googletest/src/gtest-all.cc.bc
/home/ubuntu/caffe-compilation/nacl_sdk/pepper_42/toolchain/linux_pnacl/bin/pnacl-clang++ -o /home/ubuntu/caffe-compilation/NNPACK/build/deps/googletest/googletest/src/gtest-all.cc.bc -c /home/ubuntu/caffe-compilation/NNPACK/deps/googletest/googletest/src/gtest-all.cc -MMD -MF /home/ubuntu/caffe-compilation/NNPACK/build/deps/googletest/googletest/src/gtest-all.cc.bc.d -O3 -std=gnu++11 -g -I/home/ubuntu/caffe-compilation/nacl_sdk/pepper_42/include -I/home/ubuntu/caffe-compilation/NNPACK/deps/googletest/googletest/include -I/home/ubuntu/caffe-compilation/NNPACK/deps/googletest/googletest
In file included from /home/ubuntu/caffe-compilation/NNPACK/deps/googletest/googletest/src/gtest-all.cc:42:
/home/ubuntu/caffe-compilation/NNPACK/deps/googletest/googletest/src/gtest.cc:330:43: error: use of undeclared identifier 'GetArgvs'
static bool GTestIsInitialized() { return GetArgvs().size() > 0; }
^
/home/ubuntu/caffe-compilation/NNPACK/deps/googletest/googletest/src/gtest.cc:2215:29: error: unknown type name 'GTEST_FLAG_SAVER_'
: gtest_flag_saver_(new GTEST_FLAG_SAVER_) {
^
/home/ubuntu/caffe-compilation/NNPACK/deps/googletest/googletest/src/gtest-all.cc:48:35: error: expected '{' or ','
#include "src/gtest-typed-test.cc"
^
/home/ubuntu/caffe-compilation/NNPACK/deps/googletest/googletest/src/gtest-all.cc:48:35: error: expected '}'
/home/ubuntu/caffe-compilation/NNPACK/deps/googletest/googletest/src/gtest.cc:149:19: note: to match this '{'
namespace testing {

NNPACK build error

After python configure.py, I got the following error with ninja

ninja: error: build.ninja:271: unknown pool name

DO you know why?

Benchmark convolution vs other libraries

Have you benchmarked NNPACK convolution over Intel HPC https://github.com/hfp/libxsmm/?

Winograd F(4x4, 5x5)

Here are transforms for Winograd F(4x4, 5x5). That means a 5x5 kernel with a 4x4 output tile. I imagine the code should be a relatively simple adaptation of F(6x6, 3x3), because both algorithms use 8x8 input tiles.

It uses 8x8/(4x4)=4 multiplies / output which gives it a max theoretical speedup of 5x5/4 = 6.25.

AT =
⎡1  1  1   1  1   8  8   0⎤
⎢                         ⎥
⎢0  1  -1  2  -2  4  -4  0⎥
⎢                         ⎥
⎢0  1  1   4  4   2  2   0⎥
⎢                         ⎥
⎣0  1  -1  8  -8  1  -1  1⎦

G =
⎡ 1      0     0      0      0  ⎤
⎢                               ⎥
⎢-2/9  -2/9   -2/9  -2/9   -2/9 ⎥
⎢                               ⎥
⎢-2/9   2/9   -2/9   2/9   -2/9 ⎥
⎢                               ⎥
⎢1/90  1/45   2/45  4/45   8/45 ⎥
⎢                               ⎥
⎢1/90  -1/45  2/45  -4/45  8/45 ⎥
⎢                               ⎥
⎢4/45  2/45   1/45  1/90   1/180⎥
⎢                               ⎥
⎢4/45  -2/45  1/45  -1/90  1/180⎥
⎢                               ⎥
⎣ 0      0     0      0      1  ⎦

BT =
⎡1   0    -21/4    0    21/4     0    -1  0⎤
⎢                                          ⎥
⎢0   1      1    -17/4  -17/4    1    1   0⎥
⎢                                          ⎥
⎢0   -1     1    17/4   -17/4   -1    1   0⎥
⎢                                          ⎥
⎢0  1/2    1/4   -5/2   -5/4     2    1   0⎥
⎢                                          ⎥
⎢0  -1/2   1/4    5/2   -5/4    -2    1   0⎥
⎢                                          ⎥
⎢0   2      4    -5/2    -5     1/2   1   0⎥
⎢                                          ⎥
⎢0   -2     4     5/2    -5    -1/2   1   0⎥
⎢                                          ⎥
⎣0   -1     0    21/4     0    -21/4  0   1⎦

There are some wrongs with nnp_fully_connected_output

hello, @Maratyszcza
I find there are some wrongs with nnp_fully_connected_output, if batch-size != 4*n.
My config is x86_64-fma, and threadpool is null.

And I find that, at NNPACK-master\src\fully-connected-output.c Line 40.
If change the 'outer_subblock_max' to 'outer_subblock_size', can fix this problem.

I would like to know whether my amendment is reasonable.
And will it affect efficiency?

OpenMP Support of NNPACK

Hi, @Maratyszcza

Currently NNP uses self-implemented threadpool rather than OpenMP.
Why NNP chooses pthread rather than openmp? In the future, openmp parallel will be supported or not?
I think in some degree, openmp is much easier to use for me.

Thanks.
Best Regards

Convolution-inference running slower on newer ARM chipsets

Hi, @Maratyszcza

I ran convolution-inference on some of the Android devices with Qualcomm Snapdragon chipsets and I noticed it's much faster on older chipsets than the newer ones.

NNPACK libraries were compiled with Android-23 for armeabi-v7a ABI.

Following are the input parameters to convolution-inference:
Batch size: 1
Input channels: 16
Output channels: 16
Input: 800x600 with implicit padding 2
Kernel: 5x5
Subsampling: 1x1
Algorithm: FT16x16
Threads: 4
Iterations: 1

Performance on Snapdragon 805 (older chipset):
Time: 227.654 ms
Input transform: 36.907 ms (16.2%) [2.3 GB/s]
Kernel transform: 0.362 ms (0.2%) [0.8 GB/s]
Output transform: 22.522 ms (9.9%) [3.8 GB/s]
Block multiplication: 60.708 ms (26.7%) [14.5 GFLOPS]
Overhead: 107.156 ms (47.1%)

Performance on Snapdragon 820 (latest chipset):
Time: 349.370 ms
Input transform: 27.081 ms (7.8%) [3.2 GB/s]
Kernel transform: 0.446 ms (0.1%) [0.6 GB/s]
Output transform: 16.603 ms (4.8%) [5.2 GB/s]
Block multiplication: 41.086 ms (11.8%) [21.4 GFLOPS]
Overhead: 264.154 ms (75.6%)

I'm mainly interested in convolution and the results on 805 look promising. I was hoping to get much better performance on 820.
Is there anything that I can try/tweak to get better performance on 820?

Best Regards.

Android NDK compile error: complex.h not found

Hi @Maratyszcza

Compiling NNPACK as it is fails with compilations error: "fatal error: complex.h not found".

My solution: Increasing Android native API level (APP_PLATFORM under jni/Application.mk) to "android-21" resolves this error.

I guess it must be because "complex.h" was introduced in Android-L NDK.

Best Regards,
Vasu

How is caffe timed?

Could you tell us how the caffe timing in your readme is obtained?
Specifically, is it MKL/OpenBLAS caffe? what is the command? Batch size is 64 right?

I'm asking because I saw a much worse timing on a broadwell 6core i7 CPU, with caffe+OpenBLAS.
Thanks.

nnp_convolution_algorithm_auto doesn't work with strides

With nnp_convolution_algorithm_auto the nnp_convolution_inference can, and normally will, choose fast convolution algorithm, which doesn't support strides. NNPACK should choose nnp_convolution_algorithm_implicit_gemm for convolution with non-unit strides.

How to use C++ interface convolution in NNPACK to replace im2col&sgemm ?

Hello,
I am wondering if it is possible to use NNPACK C++ interface to replace im2col&sgemm in a lower level?
I tried to use nnp_convolution_output() but return all zero. Am I using the function wrong? Here is my code:

void Conv_Forward(convHandle Conv, float* pSrc, float* pDst)
{
    int j, k;
    ConvLayer *Conv = (ConvLayer*)(Conv);

    enum nnp_convolution_algorithm algorithm = nnp_convolution_algorithm_auto;    
    const size_t batch_size = 1;
    const size_t input_channels = Conv->nInputNum;
    const size_t output_channels = Conv->nOutputNum;
    const struct nnp_padding input_padding = { Conv->padding, Conv->padding, Conv->padding, Conv->padding};  
    const struct nnp_size input_size ={Conv->width + 2*Conv->padding, Conv->height + 2*Conv->padding};  
    const struct nnp_size kernel_size = { Conv->kernelSize, Conv->kernelSize };  
    const struct nnp_size output_size = { (input_padding.left + Conv->width  + input_padding.right - Conv->kernelSize)/Conv->nStride + 1,
                                          (input_padding.top + Conv->height  + input_padding.bottom - Conv->kernelSize)/Conv->nStride + 1};

    nnp_convolution_output(algorithm,  
                batch_size,  
                input_channels,  
                output_channels,  
                input_size,  
                input_padding,  
                kernel_size,  
                pSrc,  
                Conv->pWeight,  
                Conv->pBias,  
                pDst,  
                NULL,  
                NULL);  

    float* pDstData = pDst;
    for(j=0; j<Conv->nOutputNum; j++)
    {
        if(j==0)
        {
            printf("feature map %d: \n", j); 
            for(int ii = 0; ii < output_size.width*output_size.height; ii++)
            {
                printf("%f ", pDstData[ii]); 
            }
        }
        pDstData += output_size.width*output_size.height;
        float xParam = Conv->relus[j];
        Activation(pDstData, output_size.width,output_size.height, xParam ,Conv->nActiveType);
    }
}

Best Regards

iOS support

Hello! Do you plan to support iOS in the nearest future?
We are working on real-time NNs on mobile and hope that your framework can give us some boost.

ninja error: build.ninja:441: unknown pool name

@Maratyszcza Hi, thanks for your code sharing.
But there is something wrong during the setup. How could I settle it?

Benchmarking Error in NNPACK

I have set up a new environment as follows:
Ubuntu 12.04 + Intel Core i7-3770

I entered the following codes as shown in #3 for benchmarking.

bin/convolution-benchmark --mode output --batch 128 --input-channels 1024 --output-channels 1024 --input-size 40 40 --kernel-size 5 5 --input-padding 2 --algorithm ft16x16

But I encountered an error as shown:

~/Programs/NNPACK$ ./benchmark.py -l fully-connected -m inference -n vgg-a
NNPACK initialization failed: error code 51
Traceback (most recent call last):
File "./benchmark.py", line 184, in
print("{name}\t{measurements}".format(name=name, measurements="\t".join(measurements)))
TypeError

May I know how to solve this error?

Thank you.

Error in configure with "pnacl-nacl-newlib" option

python ./configure.py --host=pnacl-nacl-newlib
Traceback (most recent call last):
File "./configure.py", line 865, in
sys.exit(main())
File "./configure.py", line 567, in main
fft_objects = reference_fft_objects + arch_fft_stub_objects
UnboundLocalError: local variable 'arch_fft_stub_objects' referenced before assignment

python --version
Python 2.7.6

build NNPACK error

build NNPACK success with https://github.com/Maratyszcza/NNPACK#building after ninja every things looks ok.

but when built with MXNet with NNPACK, it always output these error:
relocation R_X86_64_32S against `nnp_fft8x8_and_store__avx2' can not be used when making a shared object; recompile with -fPIC
so what's this mean? and is this caused by MXNet or NNPACK itself?

thanks~

NNPACK for XEON W3565 (microarchitecture Nehalem)

May I know is it possible to parameterize NNPACK source code to support XEON processor without L3? I confirm XEON support AVX2.

I got error code below when running bench for [email protected]

nnp_status_unsupported_hardware = 51,
#3 mentioned that NNPACK requires L1~~3, but XEON only has L1~~2, according to link below

https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/291067

Thank you.

nnpack cross-compilation error for android

hi, when I compile nnpack for android on Ubuntu 14.04 using android-ndk-r11b, there are always errors such as:

In file included from jni/../jni/../src/psimd/2d-fourier-8x8.c:3:
/usr/include/complex.h:35:19: error: token is not a valid binary operator in a preprocessor subexpression
#if __GNUC_PREREQ (2, 7) && !__GNUC_PREREQ (2, 97)

I directly compiled nnpack without any modifications of jni/Android.mk, Application.mk using command NDK_ROOT/ndk_build in NNPACK folder.
I've googled it much but still can't find out what the problem is. Can you give me some hint? Thanks in advance!

Does NNPACK support windows?

hell, @Maratyszcza does NNPACK support windows? as i cannot found in https://github.com/Maratyszcza/NNPACK#host-system . thanks.

Windows

Can NNPACK support windows?

Question on Benchmarks

Thank you for this much-needed project.

The benchmark-related question I have is,

Do we have to sum up each column below to arrive at the input-to-output inference-path performance number?

What about the FC and Softmax layer perf numbers?

It looks like these perf-numbers are for individual convolutional layers only?

VGG-A:conv1         255 ms  303 ms  260 ms  404 ms
VGG-A:conv2        902 ms   369 ms  267 ms  372 ms
VGG-A:conv3.1      566 ms   308 ms  185 ms  279 ms
VGG-A:conv3.2     1091 ms   517 ms  309 ms  463 ms
VGG-A:conv4.1      432 ms   228 ms  149 ms  188 ms
VGG-A:conv4.2     842 ms    402 ms  264 ms  329 ms
VGG-A:conv5        292 ms   141 ms  83 ms   114 ms

enable-psimd

Wanted to try out --enable-psimd timings, just for fun. It almost worked.
Consider for the x86_64-linux-gnu host in configure.py:

--- a/configure.py
+++ b/configure.py
@@ -65,6 +65,9 @@ class Configuration:
self.writer.variable("ar", "ar")
self.writer.variable("cc", "gcc")
self.writer.variable("cxx", "g++")

```
       if options.use_psimd:
```

           self.writer.variable("cc", "clang")

           self.writer.variable("cxx", "clang++")
       ldflags.append("-Wl,-fuse-ld=gold")
   elif self.host == "x86_64-windows-msvc":
       import _winreg

Maybe there's a better way to do this, but that worked for me.

The only other hitch was to work around a missing ::max_align_t because my default clang (3.4, in Ubuntu 14.04) was too old (Possibly clang-3.5 is sufficient?):
sudo apt-get install clang-3.6
and install correct links so clang and clang++ run the newer version. Then it built with no warnings,
and I'm happily running comparison timing tests. Of course it would be great if I someday we could have coexisting output libs (maybe libnnpack-psimd?), but maybe ther
Thanks.

Problems occur after setting distribution from -1.0 to 1.0

@Maratyszcza

To begin with, I checked out the latest commit without any modification. Then, I changed the the distribution in convolution.h from the default (0.0, 1.0) to (-1.0, 1.0). It turns out that all tests cannot pass.

Because I would like to test the relu combination, the distribution should be set to contain negative values, just as what you did in the relu tests.

Could we change the errorLimit or the evaluation function?

Thanks very much!

Building issue

After al follow everything written, here is what I get from ninja build command

[1/25] CXX fp16/ieee-to-fp32-value.cc
FAILED: /NNPackBundle/NNPACK/build/test/fp16/ieee-to-fp32-value.cc.o 
g++ -o /NNPackBundle/NNPACK/build/test/fp16/ieee-to-fp32-value.cc.o -c /NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc -MMD -MF /NNPackBundle/NNPACK/build/test/fp16/ieee-to-fp32-value.cc.o.d -O3 -g -std=gnu++0x -pthread -I/NNPackBundle/NNPACK/include -I/NNPackBundle/NNPACK/third-party/pthreadpool/include -I/NNPackBundle/NNPACK/test -I/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include
In file included from /NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1929:0,
                 from /NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:1:
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc: In member function ‘virtual void IEEE_FP16_VALUE_positive_nan_Test::TestBody()’:
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:326:28: error: ‘signbit’ was not declared in this scope
   EXPECT_EQ(signbit(nan_f32), 0) <<
                            ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:77:52: note: in definition of macro ‘GTEST_ASSERT_’
   if (const ::testing::AssertionResult gtest_ar = (expression)) \
                                                    ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:162:3: note: in expansion of macro ‘GTEST_PRED_FORMAT2_’
   GTEST_PRED_FORMAT2_(pred_format, v1, v2, GTEST_NONFATAL_FAILURE_)
   ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1978:3: note: in expansion of macro ‘EXPECT_PRED_FORMAT2’
   EXPECT_PRED_FORMAT2(::testing::internal:: \
   ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1979:32: note: in expansion of macro ‘GTEST_IS_NULL_LITERAL_’
                       EqHelper<GTEST_IS_NULL_LITERAL_(expected)>::Compare, \
                                ^
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:326:3: note: in expansion of macro ‘EXPECT_EQ’
   EXPECT_EQ(signbit(nan_f32), 0) <<
   ^
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:326:28: note: suggested alternative:
   EXPECT_EQ(signbit(nan_f32), 0) <<
                            ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:77:52: note: in definition of macro ‘GTEST_ASSERT_’
   if (const ::testing::AssertionResult gtest_ar = (expression)) \
                                                    ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:162:3: note: in expansion of macro ‘GTEST_PRED_FORMAT2_’
   GTEST_PRED_FORMAT2_(pred_format, v1, v2, GTEST_NONFATAL_FAILURE_)
   ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1978:3: note: in expansion of macro ‘EXPECT_PRED_FORMAT2’
   EXPECT_PRED_FORMAT2(::testing::internal:: \
   ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1979:32: note: in expansion of macro ‘GTEST_IS_NULL_LITERAL_’
                       EqHelper<GTEST_IS_NULL_LITERAL_(expected)>::Compare, \
                                ^
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:326:3: note: in expansion of macro ‘EXPECT_EQ’
   EXPECT_EQ(signbit(nan_f32), 0) <<
   ^
In file included from /NNPackBundle/NNPACK/include/nnpack/fp16.h:5:0,
                 from /NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:6:
/usr/include/c++/4.8/cmath:668:5: note:   ‘std::signbit’
     signbit(_Tp __x)
     ^
In file included from /NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1929:0,
                 from /NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:1:
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1979:64: error: template argument 1 is invalid
                       EqHelper<GTEST_IS_NULL_LITERAL_(expected)>::Compare, \
                                                                ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:77:52: note: in definition of macro ‘GTEST_ASSERT_’
   if (const ::testing::AssertionResult gtest_ar = (expression)) \
                                                    ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:162:3: note: in expansion of macro ‘GTEST_PRED_FORMAT2_’
   GTEST_PRED_FORMAT2_(pred_format, v1, v2, GTEST_NONFATAL_FAILURE_)
   ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1978:3: note: in expansion of macro ‘EXPECT_PRED_FORMAT2’
   EXPECT_PRED_FORMAT2(::testing::internal:: \
   ^
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:326:3: note: in expansion of macro ‘EXPECT_EQ’
   EXPECT_EQ(signbit(nan_f32), 0) <<
   ^
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc: In member function ‘virtual void IEEE_FP16_VALUE_negative_nan_Test::TestBody()’:
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:347:28: error: ‘signbit’ was not declared in this scope
   EXPECT_EQ(signbit(nan_f32), 1) <<
                            ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:77:52: note: in definition of macro ‘GTEST_ASSERT_’
   if (const ::testing::AssertionResult gtest_ar = (expression)) \
                                                    ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:162:3: note: in expansion of macro ‘GTEST_PRED_FORMAT2_’
   GTEST_PRED_FORMAT2_(pred_format, v1, v2, GTEST_NONFATAL_FAILURE_)
   ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1978:3: note: in expansion of macro ‘EXPECT_PRED_FORMAT2’
   EXPECT_PRED_FORMAT2(::testing::internal:: \
   ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1979:32: note: in expansion of macro ‘GTEST_IS_NULL_LITERAL_’
                       EqHelper<GTEST_IS_NULL_LITERAL_(expected)>::Compare, \
                                ^
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:347:3: note: in expansion of macro ‘EXPECT_EQ’
   EXPECT_EQ(signbit(nan_f32), 1) <<
   ^
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:347:28: note: suggested alternative:
   EXPECT_EQ(signbit(nan_f32), 1) <<
                            ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:77:52: note: in definition of macro ‘GTEST_ASSERT_’
   if (const ::testing::AssertionResult gtest_ar = (expression)) \
                                                    ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:162:3: note: in expansion of macro ‘GTEST_PRED_FORMAT2_’
   GTEST_PRED_FORMAT2_(pred_format, v1, v2, GTEST_NONFATAL_FAILURE_)
   ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1978:3: note: in expansion of macro ‘EXPECT_PRED_FORMAT2’
   EXPECT_PRED_FORMAT2(::testing::internal:: \
   ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1979:32: note: in expansion of macro ‘GTEST_IS_NULL_LITERAL_’
                       EqHelper<GTEST_IS_NULL_LITERAL_(expected)>::Compare, \
                                ^
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:347:3: note: in expansion of macro ‘EXPECT_EQ’
   EXPECT_EQ(signbit(nan_f32), 1) <<
   ^
In file included from /NNPackBundle/NNPACK/include/nnpack/fp16.h:5:0,
                 from /NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:6:
/usr/include/c++/4.8/cmath:668:5: note:   ‘std::signbit’
     signbit(_Tp __x)
     ^
In file included from /NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1929:0,
                 from /NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:1:
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1979:64: error: template argument 1 is invalid
                       EqHelper<GTEST_IS_NULL_LITERAL_(expected)>::Compare, \
                                                                ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:77:52: note: in definition of macro ‘GTEST_ASSERT_’
   if (const ::testing::AssertionResult gtest_ar = (expression)) \
                                                    ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest_pred_impl.h:162:3: note: in expansion of macro ‘GTEST_PRED_FORMAT2_’
   GTEST_PRED_FORMAT2_(pred_format, v1, v2, GTEST_NONFATAL_FAILURE_)
   ^
/NNPackBundle/NNPACK/third-party/gtest-1.7.0/include/gtest/gtest.h:1978:3: note: in expansion of macro ‘EXPECT_PRED_FORMAT2’
   EXPECT_PRED_FORMAT2(::testing::internal:: \
   ^
/NNPackBundle/NNPACK/test/fp16/ieee-to-fp32-value.cc:347:3: note: in expansion of macro ‘EXPECT_EQ’
   EXPECT_EQ(signbit(nan_f32), 1) <<
   ^
ninja: build stopped: subcommand failed.

Integration with deep learning frameworks

NNPACK needs to be integrated into deep learning frameworks to deliver performance benefits to end-users. If you would like to work on such integration, please comment here.

@ajtulloch contributed basic integration of NNPACK's convolutional layers into Caffe, which is now available at Maratyszcza/caffe-nnpack. This project is not complete: it only exposes NNPACK's training-optimized forward propagation in convolution layers to Caffe, but it is a good starting point.

Tagged release?

Would be nice to know where we can build from safely. A formal tagged release would be helpful in that effort. Though not sure how complete this is.

Better transforms for Winograd/Cook/Toom F(6x6,3x3)

Here are a set of more accurate transforms for the Winograd F(6x6,3x3) implementation. By my tests they are 100X more numerically stable than the old transforms (from the first draft of my paper), and within a factor of 2X of the accuracy of F(4x4, 3x3).

These transforms should use the same number of operations as the old transforms. Basically the common subexpressions are the same, just new values for the constants, so it should be easy to plug in these new constants to your existing implementation.

AT = 
⎡1  1  1   1    1    1      1    0⎤
⎢                                 ⎥
⎢0  1  -1  2   -2   1/2   -1/2   0⎥
⎢                                 ⎥
⎢0  1  1   4    4   1/4    1/4   0⎥
⎢                                 ⎥
⎢0  1  -1  8   -8   1/8   -1/8   0⎥
⎢                                 ⎥
⎢0  1  1   16  16   1/16  1/16   0⎥
⎢                                 ⎥
⎣0  1  -1  32  -32  1/32  -1/32  1⎦

G =
⎡ 1      0     0  ⎤
⎢                 ⎥
⎢-2/9  -2/9   -2/9⎥
⎢                 ⎥
⎢-2/9   2/9   -2/9⎥
⎢                 ⎥
⎢1/90  1/45   2/45⎥
⎢                 ⎥
⎢1/90  -1/45  2/45⎥
⎢                 ⎥
⎢ 32    16        ⎥
⎢ ──    ──    8/45⎥
⎢ 45    45        ⎥
⎢                 ⎥
⎢ 32   -16        ⎥
⎢ ──   ────   8/45⎥
⎢ 45    45        ⎥
⎢                 ⎥
⎣ 0      0     1  ⎦

BT =
⎡1   0    -21/4    0    21/4     0    -1  0⎤
⎢                                          ⎥
⎢0   1      1    -17/4  -17/4    1    1   0⎥
⎢                                          ⎥
⎢0   -1     1    17/4   -17/4   -1    1   0⎥
⎢                                          ⎥
⎢0  1/2    1/4   -5/2   -5/4     2    1   0⎥
⎢                                          ⎥
⎢0  -1/2   1/4    5/2   -5/4    -2    1   0⎥
⎢                                          ⎥
⎢0   2      4    -5/2    -5     1/2   1   0⎥
⎢                                          ⎥
⎢0   -2     4     5/2    -5    -1/2   1   0⎥
⎢                                          ⎥
⎣0   -1     0    21/4     0    -21/4  0   1⎦

Hope that helps.

Strides support for convolutional layers

Hello, and thank you for your work!

You wrote that only convolutional layers without stride are supported and pooling size is restricted by 2x2. But I saw that you've added custom pooling size support recently. Are you going to get rid of stride restriction either? I could help you with the problem if it is possible.

getting a `nnp_status_unsupported_hardware` with `nnp_initialize()` on OSX 10.11.4

Hi,

I am getting a status nnp_status_unsupported_hardware when i call nnp_initialize(). Was wondering if you have any idea if I am doing anything stupid?

I am on OSX 10.11.4

$ clang --version            
Apple LLVM version 7.3.0 (clang-703.0.29)
Target: x86_64-apple-darwin15.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Thanks!

use pretrained model without retrain

hi , could I use your library without retrain my caffemodel ?

AVX Support

Hi, @Maratyszcza

I want to support NNPACK for Intel SNB and IVB platform. I hope to reuse your current PeachPy kernel and use Macros in kernels as the following.
#if defined(AVX)
// code for avx
#elif defined(AVX2)
// code for avx2
#else
// common code for scalar instruction...
#endif

May I know whether PeachPy support this kind of preprocessor?
Thanks

Best Regards

Understanding Benchmarks

I'm trying to understand the benchmarks on the main github page.

Are these the timings for a single forward pass through the layer?
Is this leveraging all cores available on the CPU or is this the single core timing?

I'm asking because I was considering porting NNPack to Windows vs. just using my existing convolution code (I can't easily run NNPack on Windows to estimate the timing myself). Using the posted benchmarks, it doesn't seem the speed is faster than just an optimized spatial convolution.

Thanks for any insights. I appreciated it.

Performance not so good on armv7 cpu

@Maratyszcza can you give me some hint about: in which cases nnpack may have a better performance compared with im2col+sgemm using openblas/eigen on armv7 cpu? I also got the similar result as @conansherry 's(my net architeture is: input 60x60, stack of conv5x5, conv1x1, conv3x3 etc, stride == 1 ) and I'm wondering why in details fast algorithms in NNPACK seems to be inferior to openblas/eigen in this case.

And how to understand your comment in issue #39

When the number of channels on the input to convolution is small, the operation is similar to outer product: it is intrinsically memory bound, and fast algorithms in NNPACK do not help with performance.

Why would fast algorithms in NNPACK be memory bound when the number of channels on the input to convolution is small and thus be inferior to openblas/eigen? I think in this case im2col+sgemm using openblas/eigen will also need to perform a sgemm operation similar to outer product and be memory bound, but it is faster. What slows down nnpack here?

I must have missed something and need to hack into nnpack more thoroughly. Anyway, any little advice will be of great help. Thanks.

NNPACK UNIFIED API PROBLEM

Hi,

I am trying to figure out NNPACK API usage.

for convolution with stride = 1, I should use compute_convolution_output.
for convolution with stride >= 2, I should use nnp_convolution_inference. In this case, patche2cols and sgemm will be used as computation backend.

Is there any possibility to integrate there two API into one and provide users one unified and transparent interface? That can make things much easy when using the library.

Best Regards.

Wrong arg in configure.py for nacl hosts

Trying to run configure.py for x86_64-nacl-newlib hosts on osx, gives the following error:

$python ./configure.py --host=x86_64-nacl-newlib
Traceback (most recent call last):
  File "./configure.py", line 866, in <module>
    sys.exit(main())
  File "./configure.py", line 590, in main
    config.module(nnpack_objects + nacl_module_objects, "nnpack", lib_dirs=["$pepper_lib_dir"], libs=["ppapi"])
TypeError: module() got an unexpected keyword argument 'libs'

Upon looking at the configure.py, I realized that the line

nacl_module_binary = \
            config.module(nnpack_objects + nacl_module_objects, "nnpack", lib_dirs=["$pepper_lib_dir"], libs=["ppapi"])

has to change to

        nacl_module_binary = \
            config.module(nnpack_objects + nacl_module_objects, "nnpack", lib_dirs=["$pepper_lib_dir"],  extra_ldlibs=["ppapi"])

unsupported hardware

My cpu is 32 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz.
I use caffe-nnpack from https://github.com/Maratyszcza/caffe-nnpack. When I set the engine to NNPACK. I found the status return by nnp_intialize() is 51 (/** NNPACK does not implement this function for the host CPU */).

Does my cpu supported nnpack?

Do you have comparison numbers for thin networks such as ResNet or Inception?

problems about optflags

one problem is that when i set the optflags to -O0 in configure.py and ninja it, i will meet the following problem when compiling /bin/transform-bench. I asked the author and he tell me how to do it!! here is the answer he gives to me .It works!!Thanks for his fast reply

Answer:I think the problem is due to functions enable_perf_counter, disable_perf_counter and read_perf_counter in NNPACK/bench/perf_counter.h are defined without "static" keyword. If you add "static" after "inline", the error will be gone.

could it work on Windows too ?

problem about NACL

dear Maratyszcza,
i am using nacl sdk .i have reconfigure and build it. at last it succeeds and compiled the following programs under /web directory

now i wonder how to start the program of nacl. when i open the html file directly, it will show the following information:

you can see that native client not available. do you know the problem?? looking forward to your reply

Build for Linux/ARMv7

Hi,
I am wondering whether it is possible to use NNPACK on an ARM device running Ubuntu.
What I understand from requirements, it cannot be built natively, but is cross-compilation an option here? Is only the combination of Android/ARM supported?

Bug in nnp_convolution_kernel_gradient with non-square image/kernel

@szagoruyko found that nnp_convolution_kernel_gradient does not match Torch results when image and/or kernel is non-square (i.e. width != height). Probably it is also related to Caffe backprop mismatch noticed by @ajtulloch. @szagoruyko checked that the reference implementation nnp_convolution_kernel_gradient__reference from nnpack/reference.h does not have this bug. TODO:

Add test cases for non-square images and kernels (currently all NNPACK tests cases for convolution use square images and kernels).
Fix the bug

Support processors without L3 cache

Hi,
I am trying to use nnpack on an Intel Atom Z530 and I get the hardware not supported status when initializing. After taking a look at init.c I figured there is a requirement for L3 cache. Is it in any way possible to work around this requirement or is the deeply necessary by nnpack?

missing declaration

include s8gemm.py.h declarations should be available when
include/blas.h
is used

pre-generated assembly

We are looking into torch integration, but it seems that the peachpy and ninja dependency are a bit annoying to embed deep into torch (for example as a git subtree). Can you pre-generate the assembly so that we can simply have a git subtree of nnpack inside https://github.com/torch/nn and then use our CMake build system to build it on the fly?

nnp_fully_connected_output will get wrong result if batch-size != 2^n

hello, @Maratyszcza

i have add NNPACK to MXNet, apache/mxnet#4373 which support batch-size >= 1 on conv, max-pooling, fully-connected. but i found nnp_fully_connected_output will get wrong result if batch-size != 2^n, so i add some commit in https://github.com/tornadomeet/mxnet/blob/b7caa3bed94a08f2a285981b5051ba009422461a/src/operator/nnpack/nnpack_fully_connected-inl.h#L63-L65 .

does this is a konw bug? thanks.

BTW, i used a face detecton model called mtcnn for test, and found this problem.

Would you please provide the avg pooling method?

since global avg pooling is essential to the Inception and ResNet architecture，will you implement avg pooling in the future？
thank you

maratyszcza / nnpack Goto Github PK

nnpack's Introduction

NNPACK

Platforms and requirements

Features

Layers

Building

Building NNPACK - Using vcpkg

Cross-compilation for Android

Ecosystem

Deep Learning Frameworks

Languages and Environments

Users

Acknowledgements

nnpack's People

Contributors

Stargazers

Watchers

Forkers

nnpack's Issues

Recommend Projects

Recommend Topics

Recommend Org