google / xnnpack Goto Github PK
View Code? Open in Web Editor NEWHigh-efficiency floating-point neural network inference operators for mobile, server, and Web
License: Other
High-efficiency floating-point neural network inference operators for mobile, server, and Web
License: Other
Hi,
2D Convolution is supported by optimization, will 3D convolution optimization be optimized in the future?
Thanks.
I read the excellent paper 'Fast Sparse Convolutions' in CVPR2020 and I'm very interested in it. However, when I run the SPMM benchmark implemented in XNNPACK, it seems that in some cases it is even slower with higher sparsity. My codes are as follows:
pthreadpool_t threadpool = pthreadpool_create(1);
status = xnn_initialize(NULL);
fprintf(stderr, "mr = %d, nr = %d\n", xnn_params.f32.spmm.mr, xnn_params.f32.spmm.nr);
xnn_operator_t spmm_op;
status = xnn_create_convolution2d_nchw_f32(
// padding
0, 0, 0, 0,
// kernel size
1, 1,
// stride
1, 1,
// dilation
1, 1,
// groups
1,
// input/output channels per group
K, N,
// input/output channel stride
K, N,
// kernel, bias
weight, bias,
// min/max value of output
0, FLT_MAX,
// input tensor stored in NCHW order
0,
&spmm_op
);
status = xnn_setup_convolution2d_nchw_f32(
spmm_op,
1,
1, M,
input, output,
threadpool
);
According to the implementation of xnn_convlution2d_nchw_f32 in XNNPACK/src/operators/convolution_nchw.c, it will run the convolution with SPMM when the kernel size/stride/dilation are all 1, and input/output tensors are stored in NCHW layout. I run the operator with M = 49[spatial dimension of fmap], N = 512[output channels], K = 1024[input channels] under different weight sparsity, results are shown as follows:
sparsity | time (ms)
0.0 12.60
0.1 12.64
0.2 25.52
0.3 22.94
0.4 19.84
0.5 16.77
0.6 13.36
0.7 9.76
0.8 5.80
0.9 2.06
Hi, thanks for contrbuting such a good project. I'm trying to build xnnpack on nvidia jetson tx2 using cmake. But it seems the download link of clog and cpuinfo is identical. Is this mistake?
in cmake/DownloadCLog.cmake:
ExternalProject_Add(clog
URL https://github.com/pytorch/cpuinfo/archive/d5e37adf1406cf899d7d9ec1d317c47506ccb970.tar.gz
URL_HASH SHA256=3f2dc1970f397a0e59db72f9fca6ff144b216895c1d606f6c94a507c1e53a025
SOURCE_DIR "${CMAKE_BINARY_DIR}/clog-source"
BINARY_DIR "${CMAKE_BINARY_DIR}/clog"
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
INSTALL_COMMAND ""
TEST_COMMAND ""
)
in cmake/DownloadCpuinfo.cmake:
ExternalProject_Add(cpuinfo
URL https://github.com/pytorch/cpuinfo/archive/d5e37adf1406cf899d7d9ec1d317c47506ccb970.tar.gz
URL_HASH SHA256=3f2dc1970f397a0e59db72f9fca6ff144b216895c1d606f6c94a507c1e53a025
SOURCE_DIR "${CMAKE_BINARY_DIR}/cpuinfo-source"
BINARY_DIR "${CMAKE_BINARY_DIR}/cpuinfo"
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
INSTALL_COMMAND ""
TEST_COMMAND ""
)
It seems the F32 GEMM implementation quantizes the input and output? I got that from the usage here, where one has to pass min/max values of the output. I'm worried that the approximation will degrade accuracy significantly (I'm already quantising all the layers I can to int8), still have to test that, but just to confirm is there no SGEMM implementation that doesn't quantise the input and output?
The readme says
XNNPACK is a highly optimized library of floating-point neural network inference operators
however in the code there seems to be implementation for GEMM with int8 weights etc. ? I'm using QNNPACK at the moment for that, would it make sense to switch to XNNPACK for int8 layers?
The thing with clang-cl.exe on windows is that it doesn't predefine __GNUC__
macro as in clang. Instead it defines _MSC_VER
and __clang__
. In order to build xop sources with clang-cl, should we can add an or condition to include x86intrin.h?
#include <assert.h>
$if SSE == 5:
-#ifdef __GNUC__
+#if defined(__GNUC__) || defined(__clang__)
#include <x86intrin.h>
#else
#include <immintrin.h>
Hi I ran into this error when building for aarch6. The argument expect f32 but given f16 and are incompatible (alignment)
../../src/f16-hswish/gen/hswish-neonfp16arith-x8.c:32:48: error: incompatible type for argument 1 of ‘vreinterpretq_s16_f32’
const int16x8_t vsix = vreinterpretq_s16_f32(vld1q_dup_f16(¶ms->six));
^~~~~~~~~~~~~
The latest commit causes the following build errors on intel machines. (Pasting only the first few lines here, can provide more if required):
[141/425] Building C object CMakeFiles/XNNPACK.dir/src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c.o
FAILED: CMakeFiles/XNNPACK.dir/src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c.o
/usr/bin/cc -DCPUINFO_SUPPORTED_PLATFORM=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DPTHREADPOOL_NO_DEPRECATED_API=1 -DXNN_ENABLE_ASSEMBLY=1 -DXNN_ENABLE_MEMOPT=1 -DXNN_ENABLE_SPARSE=1 -DXNN_LOG_LEVEL=0 -I../../include -I../../src -Iclog-source/deps/clog/include -Icpuinfo-source/include -Ipthreadpool-source/include -IFXdiv-source/include -IFP16-source/include -O3 -DNDEBUG -fPIC -Wno-psabi -pthread -std=gnu99 -msse4.1 -O2 -MD -MT CMakeFiles/XNNPACK.dir/src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c.o -MF CMakeFiles/XNNPACK.dir/src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c.o.d -o CMakeFiles/XNNPACK.dir/src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c.o -c ../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c
../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c: In function ‘xnn_qs8_dwconv_minmax_ukernel_up8x9__sse41_mul32’:
../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c:87:50: warning: implicit declaration of function ‘_mm_loadu_si32’; did you mean ‘_mm_loadu_si128’? [-Wimplicit-function-declaration]
const __m128i vi0x0123 = _mm_cvtepi8_epi32(_mm_loadu_si32(i0));
^~~~~~~~~~~~~~
_mm_loadu_si128
../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c:87:50: error: incompatible type for argument 1 of ‘_mm_cvtepi8_epi32’
In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:37:0,
from ../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c:12:
/usr/lib/gcc/x86_64-linux-gnu/7/include/smmintrin.h:482:1: note: expected ‘__m128i {aka __vector(2) long long int}’ but argument is of type ‘int’
_mm_cvtepi8_epi32 (__m128i __X)
^~~~~~~~~~~~~~~~~
../../src/qs8-dwconv/gen/up8x9-minmax-sse41-mul32.c:88:50: error: incompatible type for argument 1 of ‘_mm_cvtepi8_epi32’
I have verified that the previous commit works well.
Hi
I have built XNN pack successfully for target device (RPI2) and can run the benchmark tests. However, when I built libtensorflow-lite.a (or .so) with XNNPACK and run with the prebuilts models I run into ModifyGraphWithDelegate is disallowed
error
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
ERROR: ModifyGraphWithDelegate is disallowed when graph is immutable.
The inference still ran successfully with correct result, but I guess it didn't benefit from the XNNPACK speedup because of the error.
In my testing, I used these two models from google prebuilt:
ssd_mobilenet_v3_small_coco_2020_01_14.tflite
coco_ssd_mobilenet_v1_1.0_quant_2018_06_29.tflite
Is this ModifyGraphWithDelegate is disallowed
expected when the model I used didn't comply to certain requirements? Can you give some pointers how to get around this issue?
You can also find more details of my setup at this tensorflow issue I created.
Thanks!
Reproducible steps:
docker run --it --name xnnpack_test ubuntu:16.04
apt install -y cmake git build-essential
# cmake --version shows 3.5.1
# gcc --version shows 5.4.0
git clone https://github.com/google/XNNPACK.git
cd XNNPACK
mkdir build
cd build
cmake -DXNNPACK_BUILD_TESTS=OFF -DXNNPACK_BUILD_BENCHMARKS=OFF ..
make -j8
Logs:
[ 4%] Building C object CMakeFiles/XNNPACK.dir/src/operators/convolution-nhwc.c.o
cc: error: ../src/operators/average-pooling-nhwc.c../src/operators/average-pooling-nhwc.cNOT:../src/operators/average-pooling-nhwc.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/average-pooling-nhwc.c.o: No such file or directory
[ 4%] Building C object CMakeFiles/XNNPACK.dir/src/operators/convolution-nchw.c.o
CMakeFiles/XNNPACK.dir/build.make:86: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/average-pooling-nhwc.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/average-pooling-nhwc.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
cc: error: ../src/operators/binary-elementwise-nd.c../src/operators/binary-elementwise-nd.cNOT:../src/operators/binary-elementwise-nd.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/binary-elementwise-nd.c.o: No such file or directory
cc: error: ../src/operators/constant-pad-nd.c../src/operators/constant-pad-nd.cNOT:../src/operators/constant-pad-nd.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/constant-pad-nd.c.o: No such file or directory
CMakeFiles/XNNPACK.dir/build.make:158: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/constant-pad-nd.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/constant-pad-nd.c.o] Error 1
CMakeFiles/XNNPACK.dir/build.make:110: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/binary-elementwise-nd.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/binary-elementwise-nd.c.o] Error 1
cc: error: ../src/operators/argmax-pooling-nhwc.c../src/operators/argmax-pooling-nhwc.cNOT:../src/operators/argmax-pooling-nhwc.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/argmax-pooling-nhwc.c.o: No such file or directory
cc: error: ../src/operators/convolution-nhwc.c../src/operators/convolution-nhwc.cNOT:../src/operators/convolution-nhwc.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/convolution-nhwc.c.o: No such file or directory
cc: error: ../src/operators/convolution-nchw.c../src/operators/convolution-nchw.cNOT:../src/operators/convolution-nchw.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/convolution-nchw.c.o: No such file or directory
[ 4%] Building C object CMakeFiles/XNNPACK.dir/src/operators/channel-shuffle-nc.c.o
CMakeFiles/XNNPACK.dir/build.make:62: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/argmax-pooling-nhwc.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/argmax-pooling-nhwc.c.o] Error 1
CMakeFiles/XNNPACK.dir/build.make:206: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/convolution-nhwc.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/convolution-nhwc.c.o] Error 1
CMakeFiles/XNNPACK.dir/build.make:182: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/convolution-nchw.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/convolution-nchw.c.o] Error 1
cc: error: ../src/operators/channel-shuffle-nc.c../src/operators/channel-shuffle-nc.cNOT:../src/operators/channel-shuffle-nc.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/channel-shuffle-nc.c.o: No such file or directory
CMakeFiles/XNNPACK.dir/build.make:134: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/channel-shuffle-nc.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/channel-shuffle-nc.c.o] Error 1
[ 4%] Building C object CMakeFiles/XNNPACK.dir/src/operators/deconvolution-nhwc.c.o
cc: error: ../src/operators/deconvolution-nhwc.c../src/operators/deconvolution-nhwc.cNOT:../src/operators/deconvolution-nhwc.cCONFIG:Debug: No such file or directory
cc: error: CMakeFiles/XNNPACK.dir/src/operators/deconvolution-nhwc.c.o: No such file or directory
CMakeFiles/XNNPACK.dir/build.make:230: recipe for target 'CMakeFiles/XNNPACK.dir/src/operators/deconvolution-nhwc.c.o' failed
make[2]: *** [CMakeFiles/XNNPACK.dir/src/operators/deconvolution-nhwc.c.o] Error 1
CMakeFiles/Makefile2:69: recipe for target 'CMakeFiles/XNNPACK.dir/all' failed
make[1]: *** [CMakeFiles/XNNPACK.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2
Since the author is at Google now, may I ask that clang-format
is run over the code with Google settings? This is trival to do and it would significantly improve readability. I could send a PR if you'd like.
Thanks!
It seems all of the pooling operators doesn't support filter size 1×1 with the code like here since you think 1x1 pooling is meaningless. But recently I'm trying to do inference with tfjs-backend-wasm while the model includes maxpooling with filter 1×1 and stride 2×2. This seems more like a downsampling process and looks meaningful. Do you think the pooling operators should support case like that?
TF recently added support for HalfPixelCenters for resize_bilinear op: https://www.tensorflow.org/api_docs/cc/class/tensorflow/ops/resize-bilinear
Is there any plan to add this attr too in XNNPack?
Hi, a recent change breaks the build for ARMv7-32 bit (e.g. RPI2, or RPI3-32bit). This is what I set for cmake
set(CMAKE_SYSTEM_PROCESSOR armv7)
FAILED: ... -c ../../src/qs8-gemm/gen/8x8c4-minmax-neondot.c
arm-linux-gnueabihf-gcc: error: unrecognized argument in option '-march=armv8.2-a+dotprod'
I think the XNNPACK_NEONDOT_MICROKERNEL_SRCS should be limited to if user specify dotprod
modifier in the CMAKE_SYSTEM_PROCESSOR
E.g.
set(CMAKE_SYSTEM_PROCESSOR armv8.2-a+dotprod)
I'm building some Mediapipe examples, and have noticed that it uses AVX512 / AVX2 functions from xnnpack (depending on the cpu capabilities). (Windows build)
Is there a good way to build xnnpack in a way that won't build the AVX parts? Modifying BUILD.Bazel
in xnnpack is throwing some linker errors if I just comment out sections related to AVX like following
xnnpack_cc_library(
name = "avx2_ukernels",
hdrs = INTERNAL_HDRS,
gcc_copts = xnnpack_gcc_std_copts(),
gcc_x86_copts = [
"-mfma",
"-mavx2",
],
msvc_copts = xnnpack_msvc_std_copts(),
msvc_x86_32_copts = ["/arch:AVX2"],
msvc_x86_64_copts = ["/arch:AVX2"],
x86_srcs = AVX2_UKERNELS,
deps = [
":tables",
"@FP16",
"@pthreadpool",
],
)
so wanted to see if there is another way?
I have look into micro kernel implementation for personal study.
Due to the lack of documentation, I am little confused about notation.
For example, xnn_qu8_gemm_ukernel_function
benchmark in
Lines 36 to 39 in 15c0036
takes mr, nr, kr as arguments.
My question is which is the dimension for fully-connected operation's kernel(weight) ? I think nr*kr
is the size of FC's kernel and mr*kr
is the size of FC's input. Please let me know if this is incorrect.
Thank you.
Dear all,
we are evaluating the use of XNNPACK for our own development. I have seen that the input
and output
vectors are set in the *_setup_*
method that construct the operator.
I wonder if it possible to extend the API to set the output and input address after the ***_setup***
has been called?
We are happy to develop the change ourselves, but wanted to be sure if this is all together possible?
Thanks,
Pablo.
@Maratyszcza I see some perf data with MobileNet in README. Is there any data about the speed of ARM Compute Library (or ARM NN) in those tasks available? Thanks.
[ 10%] Building C object _deps/xnnpack-build/CMakeFiles/XNNPACK.dir/src/qs8-igemm/gen/6x8c4-minmax-neondot.c.o
/home/chendongmin/project/tensorflow_lite_cmake/dtln_aec_android_build/xnnpack/src/qs8-gemm/gen/1x16c4-minmax-neondot.c:64:20: warning: implicit declaration of function
'vdotq_lane_s32' is invalid in C99 [-Wimplicit-function-declaration]
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
^
/home/chendongmin/project/tensorflow_lite_cmake/dtln_aec_android_build/xnnpack/src/qs8-gemm/gen/1x16c4-minmax-neondot.c:64:18: error: assigning to 'int32x4_t'
(vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/chendongmin/project/tensorflow_lite_cmake/dtln_aec_android_build/xnnpack/src/qs8-gemm/gen/1x16c4-minmax-neondot.c:65:18: error: assigning to 'int32x4_t'
(vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x4567 = vdotq_lane_s32(vacc0x4567, vb0123x4567, va0x01234567, 0);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The CmakeLists doesn't support emscripten. I am building the tflite with emscripten. I am able to build all the required static libraries of libtensorflowlite. I need a delegate to launch the interpreter. But emscripten build also have source code dependencies on xnnpack_delegates.h/cc .
So can you please guide me on how to build xnnpack and xnnpack_delegate.cc so that i could register the xnnpack as delegate using low level delegate api(or any other means) listed in https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/xnnpack
Hi,
I build a coordinate regression network based on MobileNetV2.
When I used the xnnpack delegate, I found that the results of inference had a big error, and found the error caused by the optimization here in xnnpack.
Line 548 in f124e88
Through more detailed research, it is found that when InputSize is (384, 384), MobileNetV2 will perform a ZeroPadding2D operation with Pad Size of (0,1)(0,1), similar to this:
I think it should be caused by this unconventional operation.
Thanks.
There is a hard sigmoid function in Mobilenet v3, how could we support this function in XNNPACK?
By default the MacOS filesystem is case-insensitive, which means a 'build' directory cannot be created in the root of the repository since a file with the same uppercase name is already present. This renders some of the scripts unusable.
What is the design policy of handling NaN for min/max instruction in XNNPACK?
vminq_f32/vmaxq_f32 : Not IEEE754-2008 aware(NaN propagates when either input is NaN) http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/CIHDEEBE.html
vminnmq_f32/vmaxnmq_f32 : IEEE754-2008 aware(= matches with SSE2's min/max) http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/CIHFCJCF.html
(available only in ARMv8(or AARCH64))
SSE2 _mm_min_ps/_mm_max_ps : IEEE754-2008 aware
https://www.felixcloutier.com/x86/maxps
Currently, XNNPACK uses VMIN/VMAX for min/max instruction, thus at least there is an inconsistency between ARM and x86 code paths when handling NaN value.
Related:
Implement NaN-propagating max/min on Vec256
pytorch/pytorch#13399
Hi,
I have a model that uses SeparableConv2D layers extensively. The results from model with XNNPACK is very different than without XNNPACK i.e. unable to reproduce the original results with XNNPACK. Does XNNPACK support SeparableConv2D?
I'm running into an error when compiling asm files in XNNPACK_AARCH64_ASM_MICROKERNEL_SRCS
for arm64
error: /Users/taox/Projects/XNNPACK/src/f32-dwconv/up4x9-aarch64-neonfma.S:21:23: error: error: unknown token in expression
error: brackets expression not supported on this target
brackets expression not supported on this target
LD2R {v30.4s, v31.4s}, [x8]
STP d10, d11, [sp, 16]
STP d10, d11, [sp, 16]
Seems like the compiler doesn't recognize the syntax. I'm not an expert in ASM, but my guess is that the compiler flag -march=armv8.2-a+fp16
is not supported by Clang? However, I did find a link that discussed adding such support - https://reviews.llvm.org/D41792.
Hi,
Will XNNPACK support iOS in future?
Currently, XNNPACK seems cannot be compiled for iOS.
I'm trying to understand the indirect convolution algorithm used in xnnpack. It's a cool idea to implement convolution and thanks for contributing this project!
During the code reading, I find a few questions about the implementation of indirect convolution. I list them below.
in XNNPACK/bench/f32-dwconv.cc, line 69
I think the step_height
represents how many pointers are for one single row of the output. But I cannot understand why it is calculated as kernel_size + (output_width * step_width - 1) * kernel_height
. If I understand it correctly, it should be kernel_size + ((output_width-1) * step_width) * kernel_height
. The first kernel_size
is for one complete convolution window and the following part computes how many new pointers are needed in each step. Please correct me if I do wrong.
in XNNPACK/src/indirection.c: xnn_indirection_init_dwconv2d
In this function, we compute the input spatial location (input_x, input_y)
. The code checks if it's outside the input(input_x < input_width, input_y < input_height
) but it does not compare input_x
and input_y
to zero. For example, when we compute the input spatial location for the output location (0, 0)
and the padding size is not zero(input_padding_top>0, input_padding_left>0
), the input_x
and input_y
will be negative. Is the corresponding address used to set indirection_buffer
right in this situation?
Dear colleagues:
I tried with building pytorch on macOS10.13.6 + xcode 10.1. And one show-blocker during this process is that
XNNPACK is missing with symbol "__cvtu32_mask16".
I double-checked with clang version, which is 10 not 11. so when compiler linked object to executable file, it can't find "__cvtu32_mask16" in clang.
Also refer to discussion in https://discuss.pytorch.org/t/pytorch-build-almost-succeeds-but-fails-undefined-symbols-for-architecture-x86-64-cvtu32-mask16/73000.
My question is:
if we try to make building working on macOS10.13.6 + xcode 10.1, can we have some tweaks to solve such issue? I means to tell compiler use __cvtu32_mask16 defined within XNNPACK instead of clang library, where such function does not exist.
any hints?
https://github.com/google/XNNPACK/blob/master/src/xnnpack/intrinsics-polyfill.h#L36
Have you compared quantized (unsigned 8-bit) inference with QNNPACK? Given that this library was forked off of QNNPACK, are there any optimizations in XNNPACK on top which could make it faster for quantized inference (maybe for some particular layers)?
The work spase computing for 1x1 conv is great, it seems now only support fp32.spmm in https://github.com/google/XNNPACK/blob/master/src/operators/convolution-nchw.c .
And how use sparse fp16? And do you have plan to support sparse int8/uint8?
Thanks!
Hi,
I created a model using lstm operations.
I used performance tools and used xnnpack delegate, and found that the lstm operation is not supported by xnnpack delegate. Are there any plans to support this operation?
Thanks.
hi y'all
with the aim of compiling the benchmark_model
in TF (as this depends on XNNPACK) on commit c2db3a8fae0f6558e9dbdee79e67e74c1e95981c I was trying to build the end2end_bench using bazel 4.0.0
(ARM64)
the docs state a macOS support for arm64, I assume this only holds true when using cmake.
so I added configs macos_arm64
by updating .bazelrc, build_defs.bzl, cpuinfo.BUILD, BUILD.bazel and then run:
bazel build --config=macos_arm64 :end2end_bench
compiling with ios_arm64
as build config works fine. however not with macos_arm64 even though the macOS should be using the iOS kernels
could you give me a hint on how to build for platform macos_arm64
with bazel ? @Maratyszcza
Hi, currently it seems only f32 delegates to xnnpack is supported, though the qs8~/qu8~ operators available in xnnpack.
Is it on roadmap to add runtime support so quantized tflite models can also delegate to xnnpack qs8*/qu8* operators?
Thanks!
Hi, thank you for your great project.
I'd tried to build XNNPACK e2ebench with ios_armv7 config with commands below:
$ bazel build -c opt --config ios_armv7 :end2end_bench
and bazel build log saids:
$ bazel build -c opt --config ios_armv7 :end2end_bench
INFO: Analyzed target //:end2end_bench (24 packages loaded, 1387 targets configured).
INFO: Found 1 target...
ERROR: /Users/jhyoo/workspace/src/XNNPACK/BUILD.bazel:3122:19: C++ compilation of rule '//:tables' failed (Exit 1): wrapped_clang failed: error executing command external/local_config_cc/wrapped_clang '-D_FORTIFY_SOURCE=1' -fstack-protector -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 37 argument(s) skipped)
Use --sandbox_debug to see verbose messages from the sandbox wrapped_clang failed: error executing command external/local_config_cc/wrapped_clang '-D_FORTIFY_SOURCE=1' -fstack-protector -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 37 argument(s) skipped)
Use --sandbox_debug to see verbose messages from the sandbox
clang: error: invalid iOS deployment version '--target=armv7-apple-ios', iOS 10 is the maximum deployment target for 32-bit targets [-Winvalid-ios-deployment-target]
Target //:end2end_bench failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 12.155s, Critical Path: 0.37s
INFO: 133 processes: 133 internal.
FAILED: Build did NOT complete successfully
The main reason seems clang: error: invalid iOS deployment version '--target=armv7-apple-ios', iOS 10 is the maximum deployment target for 32-bit
. My ios sdk version is 13.7 so I guess local_config_cc put -miphoneos-version-min=13.7
on ios_armv7 build. To avoid this I'd tried:
$ bazel build --ios_minimum_os='10.0' -c opt --config ios_armv7 :end2end_bench
And then I faced compile error because armv7 doesn't support dot product simd command:
$ bazel build --ios_minimum_os='10.0' -c opt --config ios_armv7 :end2end_bench
INFO: Build option --ios_minimum_os has changed, discarding analysis cache.
INFO: Analyzed target //:end2end_bench (0 packages loaded, 1387 targets configured).
INFO: Found 1 target...
ERROR: /Users/jhyoo/workspace/src/XNNPACK/BUILD.bazel:3470:19: C++ compilation of rule '//:neondot_ukernels' failed (Exit 1): wrapped_clang failed: error executing command external/local_config_cc/wrapped_clang '-D_FORTIFY_SOURCE=1' -fstack-protector -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 65 argument(s) skipped)
Use --sandbox_debug to see verbose messages from the sandbox wrapped_clang failed: error executing command external/local_config_cc/wrapped_clang '-D_FORTIFY_SOURCE=1' -fstack-protector -fcolor-diagnostics -Wall -Wthread-safety -Wself-assign -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG ... (remaining 65 argument(s) skipped)
Use --sandbox_debug to see verbose messages from the sandbox
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:62:20: warning: implicit declaration of function 'vdotq_lane_s32' is invalid in C99 [-Wimplicit-function-declaration]
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
^
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:62:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:63:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x4567 = vdotq_lane_s32(vacc0x4567, vb0123x4567, va0x01234567, 0);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:64:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb4567x0123, va0x01234567, 1);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:65:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x4567 = vdotq_lane_s32(vacc0x4567, vb4567x4567, va0x01234567, 1);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:79:20: warning: implicit declaration of function 'vdotq_lane_s32' is invalid in C99 [-Wimplicit-function-declaration]
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
^
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:79:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:80:18: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x4567 = vdotq_lane_s32(vacc0x4567, vb0123x4567, va0x01234567, 0);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:88:20: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb4567x0123, va0x01234567, 1);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/qs8-gemm/gen/1x8c4-minmax-neondot.c:89:20: error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x4567 = vdotq_lane_s32(vacc0x4567, vb4567x4567, va0x01234567, 1);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 warnings and 8 errors generated.
Target //:end2end_bench failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 7.145s, Critical Path: 6.64s
INFO: 215 processes: 130 internal, 85 darwin-sandbox.
FAILED: Build did NOT complete successfully
I'm not sure whether ios armv7 is still used popular or not but it was necessary for my work, so I fixed this by adding 'apple_aarch32_copt' on neon dot product targets in BUILD.bazel and add an ios option on ios_armv7 config.
If you're ok, may I make a PR for this?
Thank you. :)
Dear authors,
In the "Fast Sparse ConvNets" paper, it says: "Instead, we implement a dense convolutional kernel which takes as input the image in the standard HWC layout and outputs the CHW layout consumed by the sparse operations in the rest of the network.' However, its seems to me that the layout of the feature maps are not changed after the Conv2d layer when I was inspecting the pre-trained sparse model with Netron. Could you please explain to me about this?
Another issue is, I encontered some problams when I tried to run the benchmark using bazel build with tensorflow lite on armv7 with linux (which I have raised an issue in their repo). And I am trying to run the end-2-end benchmark in this repository. Do you have c++ implementation of the pre-trained sparse model (like the ones in the models folder) so that I can run directly with this repo? Since its hard to extract the parameters used in the sparse model (subsampling size, relu ,etc).
Thank you very much!
Hey my fellow developers,
Was peaking around the build instructions, and upon inspecting the bash script download_dependencies.sh
nothing shows XNNPACK being downloaded from anywhere.
I wasn't sure if this is a TensorFlow issue or post this issue directly to XNNPACK.
Thank you for any support I can get.
-Montana
Hi
I see XNNPACK support 2D Bilinear Resize, if I want to use nearest neighbor resize, is it supported?
Hi
Could you tell me the detail configuration of Raspberry Pi 4(RPi 4 (BCM2711)) in the performance data table?
for example, the memory of the RPi 4 is 1G,2G,4G or 8G? the OS is Raspbian Buster 32 bit or 64 bit, and release date?
I ran the benchmark command but did not got the same performance result,
and I'm not sure which result from the end2end-bench output you are checking.
I'm running it in a 64 bit Ubuntu build.
here is the result:
FP32MobileNetV1/T:1/real_time 312008 us 311970 us 22 Freq=1.5G
FP32MobileNetV1/T:2/real_time 186397 us 186380 us 37 Freq=1.5G
FP32MobileNetV1/T:3/real_time 147883 us 147872 us 48 Freq=1.5G
FP32MobileNetV1/T:4/real_time 142362 us 142349 us 49 Freq=1.5G
FP32MobileNetV2/T:1/real_time 193028 us 193004 us 36 Freq=1.5G
FP32MobileNetV2/T:2/real_time 106852 us 106843 us 65 Freq=1.5G
FP32MobileNetV2/T:3/real_time 81655 us 81648 us 85 Freq=1.5G
FP32MobileNetV2/T:4/real_time 72311 us 72304 us 97 Freq=1.5G
FP32MobileNetV3Large/T:1/real_time 156868 us 156850 us 45 Freq=1.5G
FP32MobileNetV3Large/T:2/real_time 91508 us 91499 us 76 Freq=1.5G
FP32MobileNetV3Large/T:3/real_time 71158 us 71150 us 98 Freq=1.5G
FP32MobileNetV3Large/T:4/real_time 65070 us 65061 us 107 Freq=1.5G
FP32MobileNetV3Small/T:1/real_time 48827 us 48821 us 143 Freq=1.5G
FP32MobileNetV3Small/T:2/real_time 31378 us 31375 us 223 Freq=1.5G
FP32MobileNetV3Small/T:3/real_time 24950 us 24947 us 280 Freq=1.5G
FP32MobileNetV3Small/T:4/real_time 22732 us 22729 us 309 Freq=1.5G
failed to create operation #0
FP16MobileNetV1/T:1/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:2/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:3/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:4/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:1/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:2/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:3/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:4/real_time ERROR OCCURRED: 'failed to create a mode
Hi,
I build a CNN model which has three convolution layers.
I use pthreadpool as follows (the platform is ARM architecture):
auto num_cores = 1;
auto threads = pthreadpool_create(num_cores);
// init
xnn_initialize(nullptr /* allocator */);
// create
xnn_status status = xnn_create_convolution2d_nchw_f32(...);
// setup
xnn_setup_convolution2d_nchw_f32(..., threads /* thread poll */)
// inference
xnn_run_operator(..., threads /* thread pool */);
When the num_cores
is 1 (if I pass threads
as nullptr
, it uses 1 core by default.). The result is correct and everything is fine.
However, if I set the num_cores
to value larger than 1 (whatever 2 or 6, or others), the result is wrong.
There are two things I want to highlight in this error:
res
has dim: 1, 16, 240, 240. The res[0][0][:][:]
is correct, and others are wrong (most of them are zeros).According to my observations, I have three questions:
Thanks a lot!
I tried to substitute engine from NNPACK to XNNPACK but faced that XNNPACK 3-5x times slower than NNPACK on my nets on both arm64 and x86 devices. I took some layers from net and tried to run benchmarks on ubuntu and got even worse result:
XNNPACK (bazel run //:convolution_bench
):
Run on (6 X 4300 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 9216 KiB (x1)
Load Average: 3.69, 1.76, 1.46
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
xnnpack_convolution_f32/some_test/N:1/H:128/W:128/KH:3/KW:3/PH:1/PW:1/S:1/D:1/G:1/GCin:256/GCout:128/real_time 1345810175 ns 1345571878 ns 1 FLOPS=7.06881G/s Freq=4.19863G
xnnpack_convolution_f32/some_test/N:1/H:256/W:256/KH:3/KW:3/PH:1/PW:1/S:1/D:1/G:1/GCin:192/GCout:96/real_time 3060002089 ns 3059821942 ns 1 FLOPS=7.05024G/s Freq=1.17911G
And NNPACK:
./benchmark_conv -ic 256 -oc 128 -is 128 128 -ks 3 3 -m inference -ip 1 -t 1 -a wt8x8
Batch size: 1
Input channels: 256
Output channels: 128
Input: 128x128 with implicit padding 1
Kernel: 3x3
Subsampling: 1x1
Algorithm: WT8x8
Threads: 1
Iterations: 3
Time: 31.948 ms
Input transform: 7.692 ms (24.1%) [6.3 GB/s]
Kernel transform: 0.460 ms (1.4%) [20.8 GB/s]
Output transform: 1.587 ms (5.0%) [15.3 GB/s]
Block multiplication: 22.206 ms (69.5%) [91.4 GFLOPS]
Overhead: 0.002 ms (0.0%)
./benchmark_conv -ic 192 -oc 96 -is 256 256 -ks 3 3 -m inference -ip 1 -t 1 -a wt8x8
Batch size: 1
Input channels: 192
Output channels: 96
Input: 256x256 with implicit padding 1
Kernel: 3x3
Subsampling: 1x1
Algorithm: WT8x8
Threads: 1
Iterations: 3
Time: 76.170 ms
Input transform: 22.479 ms (29.5%) [6.3 GB/s]
Kernel transform: 0.258 ms (0.3%) [20.8 GB/s]
Output transform: 4.566 ms (6.0%) [15.5 GB/s]
Block multiplication: 48.866 ms (64.2%) [89.3 GFLOPS]
Overhead: 0.001 ms (0.0%)
Why it may be so much slower?
In CMakeLists.txt we have:
IF(CMAKE_SYSTEM_PROCESSOR MATCHES "^armv[5-8]" OR IOS_ARCH MATCHES "^armv7")
...
SET_PROPERTY(SOURCE ${XNNPACK_NEONV8_MICROKERNEL_SRCS} APPEND_STRING PROPERTY COMPILE_FLAGS " -march=armv8-a -mfpu=neon-fp-armv8 ")
SET_PROPERTY(SOURCE ${XNNPACK_NEONDOT_MICROKERNEL_SRCS} APPEND_STRING PROPERTY COMPILE_FLAGS " -march=armv8.2-a+dotprod -mfpu=neon-fp-armv8 ")
...
ENDIF()
In pytorch we are not able to build this on iphoneos armv7 because in arm_neon.h (clang 10.0.0) it fails these checks:
#if __ARM_ARCH >= 8 && defined(__ARM_FEATURE_DIRECTED_ROUNDING)
...
#if defined(__ARM_FEATURE_DOTPROD)
Is it possible for us to not include neon_dot and neon_v8 for iphoneos armv7? Or can we have a macro to exclude those two features?
Hi,
I am running a tflite model which has a final softmax layer whose input is a heatmap of dimension 1x64x64x3 where 3 is the number of channels. The output dimension is 1x64x64. Tflite is built on mac and I use the c api with XNN delegate. When running this on an iPad, i get a EXC_BAD_ACCESS error and the program crashes.
I was able to narrow the error down to 'raddstoreexpminusmax_ukernel' function inside 'xnn_compute_f32_three_pass_softmax'. Based on the device, the function 'xnn_f32_raddstoreexpminusmax_ukernel__neonfma_lut64_p2_x16' is called.
Within this function, the crash happens at the last call to 'xnn_compute_f32_three_pass_softmax' on line 263 of function 'xnn_f32_raddstoreexpminusmax_ukernel__neonfma_lut64_p2_x16' which is
const float32x4_t vi = vld1q_f32(input); input += 4;
The two things i want to highlight are that the crash occurs at the last call of 'xnn_compute_f32_three_pass_softmax' (ie. batch_index = 4095, total = 64x64 calls) and the value for variable 'elements' inside raddstoreexpminusmax_ukernel is 12 ie. 3 (number of channels) * sizeof(float)
From my observation, in the above line we are loading 4 inputs. However, in my case, the number of inputs is 3. My question is would this lead to out of bounds read specially for the last iteration of the function?
Thank you!
Hi, I cross compiled XNNPACK for RPI2. When I ran the end2end-bench
I got some failures on FP16MobileNetV1
others passed.
$ ./end2end-bench
2020-08-13 01:15:36
Running ./end2end-bench
Run on (4 X 900 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
FP32MobileNetV1/T:1/real_time 1068462 us 1067627 us 1 Freq=900M
FP32MobileNetV1/T:2/real_time 556305 us 555739 us 1 Freq=900M
FP32MobileNetV1/T:3/real_time 399058 us 396603 us 2 Freq=900M
FP32MobileNetV1/T:4/real_time 322151 us 321932 us 2 Freq=900M
FP32MobileNetV2/T:1/real_time 661108 us 658790 us 1 Freq=900M
FP32MobileNetV2/T:2/real_time 354653 us 344778 us 2 Freq=900M
FP32MobileNetV2/T:3/real_time 251801 us 251222 us 3 Freq=900M
FP32MobileNetV2/T:4/real_time 220476 us 219576 us 3 Freq=900M
FP32MobileNetV3Large/T:1/real_time 514123 us 512953 us 1 Freq=900M
FP32MobileNetV3Large/T:2/real_time 287760 us 287761 us 2 Freq=900M
FP32MobileNetV3Large/T:3/real_time 216038 us 215373 us 3 Freq=900M
FP32MobileNetV3Large/T:4/real_time 191086 us 190798 us 4 Freq=900M
FP32MobileNetV3Small/T:1/real_time 156918 us 156769 us 4 Freq=900M
FP32MobileNetV3Small/T:2/real_time 89814 us 89439 us 8 Freq=900M
FP32MobileNetV3Small/T:3/real_time 67036 us 66751 us 10 Freq=900M
FP32MobileNetV3Small/T:4/real_time 56187 us 55934 us 12 Freq=900M
failed to create operation #0
FP16MobileNetV1/T:1/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:2/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:3/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV1/T:4/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:1/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:2/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:3/real_time ERROR OCCURRED: 'failed to create a model'
failed to create operation #0
FP16MobileNetV2/T:4/real_time ERROR OCCURRED: 'failed to create a model'
QS8MobileNetV1/T:1/real_time 747497 us 747360 us 1 Freq=900M
QS8MobileNetV1/T:2/real_time 379634 us 379594 us 2 Freq=900M
QS8MobileNetV1/T:3/real_time 256996 us 256934 us 3 Freq=900M
QS8MobileNetV1/T:4/real_time 196462 us 196338 us 4 Freq=900M
Can you shed some light what could be the cause failure?
Hi, I cross-compiled TFLite (v2.4.1 and pre-release 2.5.0) with XNNPACK for Windows using Mingw-w64 cmake. On a single thread, the model inference works as expected. When choosing more than 1 thread (example: 2 or 4), the program quits during Invoke()
unexpectedly (no errors printed).
I used the following command to set number of threads: InterpreterBuilder (*model, resolver)(&interpreter, num_threads)
A direct compile for Linux works fine when num_threads
is greater than 1. Inference, as expected, is faster on 2 threads than 1.
When using default TFLite kernels on Windows (cross compiled as well), the model works fine for any number of threads (Threads set via SetNumThreads(num_threads)
).
Am I missing any configuration steps when trying to cross-compile? Any assistance is appreciated. Thank you.
I tried to compile tflite_with_xnnpack=true in the tensorflow folder using the following command line on aarch64
bazel build --define tflite_with_xnnpack=true //tensorflow/tools/pip_package:build_pip_package --discard_analysis_cache --notrack_incremental_state --jobs=1
After quite a long time compilation, I got the the following error information:
bazel-out/aarch64-opt-exec-50AE0418/bin/_solib_aarch64/_U_S_Stensorflow_Spython_Cgen_Ustate_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.2: error: undefined reference to 'aws_checksums_do_cpu_id'
bazel-out/aarch64-opt-exec-50AE0418/bin/_solib_aarch64/_U_S_Stensorflow_Spython_Cgen_Ustate_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.2: error: undefined reference to 'aws_checksums_crc32c_hw'
collect2: error: ld returned 1 exit status
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 8709.447s, Critical Path: 152.23s
INFO: 2105 processes: 2105 local.
FAILED: Build did NOT complete successfully
I wonder if this results from some flats that were not set up correctly.
Thanks in advance.
Hi,
Will xnnpack support low-precision OPs?
Another question is: TF-Lite can delegate to xnnpack, so will xnnpack replace the Ops implemented in TF-lite for CPU?
Thanks
Hi, I try to build xnnpack on my devices, a nvidia jetson tx2 and a macbook pro(2015), but encounter some probelms. I use the scripts/build-local.sh
to build.
For tx2, the detected arch is aarch64
which is set in CMAKE_SYSTEM_PROCESSOR
. In this situation, the -march=armv8.2-a+fp16
flag is added but tx2 does not implement armv8.2 instruction set. Similarly, on x86, the arch is x86_64
and avx512 is used in compilation. Even though I comment related source files(XNNPACK_AVX512F_MICROKERNEL_SRCS
) and compilation flag(-mavx512f
) in CMakeList.txt, avx512 code still exists in files like f32-rmax.cc in benchmark, which is activated if the marcro XNN_ARCH_X86 orXNN_ARCH_X86_64
is defined.
It seems the armv8.2 and avx512 support is necessary in default for aarch and x86, respectively. Will xnnpack support older arm archs like armv8 and x86 without avx512?
Sorry to open a new issue. But I think describing problems with a more related title helps.
I'm trying to build neondot using NDK, but it failed:
error: assigning to 'int32x4_t' (vector of 4 'int32_t' values) from incompatible type 'int'
vacc0x0123 = vdotq_lane_s32(vacc0x0123, vb0123x0123, va0x01234567, 0);
ANDROID_ABI="arm64-v8a", ndk 21.0.6113669
Hi all,
Just got this error compiling in raspberry pi 4 and AmazonEC2 ARM64:
./XNNPACK/src/qs8-gemm/2x8c16-aarch64-neon-mlal-padal.S: Assembler messages:
./XNNPACK/src/qs8-gemm/2x8c16-aarch64-neon-mlal-padal.S:51: Error: operand mismatch -- `mov v17.4s,v16.4s'
could you please let me know if I am doing something wrong or is it an actual compilation error in the repo?
Thanks,
Pablo.
Hi!
I am using X86 desktop.
when I try to create the full convolution which takes nhwc as input and outputs nchw:
xnn_create_convolution2d_nchw_f32(
1 /* top padding /, 1 / right padding /,
1 / bottom padding /, 1 / left padding /,
3 / kernel height /, 3 / kernel width /,
2 / subsampling height /, 2 / subsampling width /,
1 / dilation_height /, 1 / dilation_width /,
1 / groups /,
3 / input channels per group /,
24 / output_channels_per_group /,
w0, w1,
0.0f / output min /, 6.0f / output max /,
XNN_FLAG_INPUT_NHWC/ flags */,
&op0);
en error occured: failed to create Convolution operator: only selected Convolution parameters are supported
and I found that it is because xnn_params.f32.hwc2spchw_dconv3x3c3s2.ukernel_with_symm_padding == NULL which should be initialized for x86
I tried to exclude 'XNN_NO_NCHW_OPERATORS' in BUILD.bazel for the "xnnpack_operators_nhwc_f32" library but received the same error.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.