hollance / neural-engine Goto Github PK

Everything we actually know about the Apple Neural Engine (ANE)

License: MIT License

neural-engine ane coreml iphone ios neural-network tpu

neural-engine's Introduction

The Neural Engine — what do we know about it?

Most new iPhones and iPads have a Neural Engine, a special processor that makes machine learning models really fast, but not much is publicly known about how this processor actually works.

The Apple Neural Engine (or ANE) is a type of NPU, which stands for Neural Processing Unit. It's like a GPU, but instead of accelerating graphics an NPU accelerates neural network operations such as convolutions and matrix multiplies.

The ANE isn't the only NPU out there — many companies besides Apple are developing their own AI accelerator chips. Besides the Neural Engine, the most famous NPU is Google's TPU (or Tensor Processing Unit).

Why this document?

When I was still providing ML consulting services for iOS, I would often get email from people who are confused why their model doesn't appear to be running on the Neural Engine, or why it is so slow when the ANE is supposed to be way faster than the GPU...

It turns out that not every Core ML model can make full use of the ANE. The reason why can be complicated, hence this document tries to answer the most common questions.

The ANE is great for making ML models run really fast on iPhones and iPads. A model that is optimized for the ANE will seriously outperform the CPU and GPU. But the ANE also has limitations. Unfortunately Apple isn't giving third-party developers any guidance on how to optimize their models to take advantage of the ANE. It's mostly a process of trial-and-error to figure out what works and what doesn't.

Note: Everything here was obtained by experimentation. I do not work at Apple and never have, so I am not privy to any implementation details of this chip. Some of this information is probably wrong. It's definitely incomplete. If you know something that isn't explained here, or if you find information that is wrong or missing, please file an issue or make a pull request. Thanks!

I was originally planning to make this a blog post but decided to put it on GitHub to make it a community resource and so that other people could contribute to it too. Please do!

neural-engine's People

Contributors

Stargazers

Watchers

neural-engine's Issues

Do you have any workarounds about pooling layer?

Hello, Hollemans san.

Thank you for your useful repository!

I found that a pooling layer which follows a convolution prevents the convolution to run on ANE.
Do you know any workarounds to avoid it?

The pooling layer is actually a global average pooling layer.
I tried to turn on/off globalPooling and to replace it with reduce mean, but no better effects.

Thanks.

[Additional Information]
I found that padding="VALID" causes the problem.
If you use padding="SAME", the model runs on ANE, but you know it is useless since the activation outputs an output with different shape.

About the description of allowLowPrecisionAccumulationOnGPU

Hi.

In https://github.com/hollance/neural-engine/blob/master/docs/16-bit.md, you wrote

"On the GPU it uses float16 for the weights and the intermediate tensors, but float32 for the calculations. You can turn this off with the option allowLowPrecisionAccumulationOnGPU from MLModelConfiguration, in which case the GPU also uses float16 for the calculations. This is a bit faster but you may lose precision."

Do you have any reference for this description?

In WWDC 19 (https://developer.apple.com/videos/play/wwdc2019/704/ (39:00)), they said,

"And the idea here is that if your model is learning on the GPU, instead of doing accumulation in float32, that happens in float60."

So I guess that this option may be effective only for macOS and the change is from float60 to float32.

I tried the option on iOS device without Neural Engine, but it seemed no speed enhancement.

Thanks.

Issues discovered while running MobileNet V3

Hi, I'm working on integrating ANE to TFLite. While testing for MobileNet V3, I discovered following messages from os_log output.

Debug com.apple.espresso espresso
"Kernel validation warning PoolingLayerBuilder (AVERAGE)_41 (pool) @ 33: Unsupported: (dilated)kernel width = 28 > 13"

Debug com.apple.espresso espresso
"Kernel validation warning MulOpBuilder_49 (elementwise) @ 41: elementwise with channel broadcast supported only with constant vector or transplant input"

So the average pooling has hidden constraint that restricts size of kernel up to 13, and elementwise multiplication does not support broadcasted multiplication of [CxHxW] and [Cx1x1] tensors. (don't know what "transplant input" mean.)

The accuracy of cpu only, cpu and gpu, and ALL are different, and the result of cpu only is accurate.

Matrix multiply example/benchmark?

Stupid question. How do you do float16 square matrix matrix multiply with their SDK? Can it handle sparse matrix multiply.

I'm interested in hacking it to do string parsing - https://en.wikipedia.org/wiki/CYK_algorithm.

Since ANE accelerate neural network model, how could it benefit traditional cpu task? I mean in Apple M1.

With the release of Apple M1, a lot of people start to compare M1 with intel cpu, even nvida gpu. But to my understanding, ANE should not intervene normal tasks handled by gpu for example rendering, nor should it intervene normal tasks handled by cpu... npu should focus on their own business... and it’s more confusing to integrate NPU inside chips(maybe they should have their own position)

Sorry to ask question here and make GitHub a little bit like a forum...

Link to some examples / demos

This is a great resource - thanks!

Can you link to or provide some basic demos of the ANE in action?
Ideally for me that would be something I can run from the command line on an M1 from Python, but since it seems that even Apple's Tensorflow fork (and subsequent tensorflow metal PluggableDevice doesn't leverage the ANE, something else I could run on an M1 with Xcode would be a great help.

[Question] Have you tried A17 Pro Neural Engine?

Hi, Hollance san.

Have you tried A17 Pro Neural Engine?

Apple claim that it has 35TOPS, which is a big leap performance boost.
But Geekbench ML shows that it is a just little.
iPhone16,1(iPhone 15 Pro) = 3402
iPhone15,2(iPhone 14 Pro) = 3349
https://browser.geekbench.com/ml/v0/inference

I wonder that Apple may implement 4bit INT OPs for internal use, and legacy performance is almost same, just clock up.
What do you think about it?

CoreML model processing time is almost twice as much as when the app is active

Great article! I think it's a great idea to write this on GitHub instead of a blog post. I am running a CoreML model on iPhone 8. I am using cpuOnly as I want the model to process in the background. I have observed a strange thing. The model takes approximately 8 seconds to process one image when the app is active or in background mode. But if I open Chrome or Safari while my App is running in the background the processing speed per image suddenly drops to approximately 3 seconds per image. Why is it so and how can I decrease processing in all cases?

Other frameworks

https://github.com/geohot/tinygrad/tree/master/ane

Thoughts on "Detecting" Supported Layers

Hey there! This repo is super amazing, thanks for putting all these findings together.

Was reading this sub-section (link below) on which layers are unsupported, and I had an idea on how to programmatically identify them.

https://github.com/hollance/neural-engine/blob/master/docs/unsupported-layers.md

This part here: "S → U → S → U → S → U" about swapping supported/unsupported layers made me wonder. In theory, if we take a layer X from a CoreML model (or just some CoreML op) we have that we do not yet know can run on the ANE, and we make a dummy model that is built solely of that layer, i.e. X -> X -> X ... and we set the compute unit to be CPU/ANE only, this would encourage the Neural Engine to run this model on 1 compute unit since that would be efficient. So, in theory, if you set the compute unit to be CPU && ANE and you see that the layer runs on the ANE only, then you will have identified that this op is ANE compatible? 🧐 could even record stats as well.

I'd like to test this theory but wanted to run this by you. Thinking this could be a way to, given a model, programmatically piecemeal individual layers, build a simple repeated layer model, and produce a chart of whether the layer is CPU/GPU/ANE supported (maybe even with statistics). Maybe even a chart that can be publically availabe of ops and their supported compute units (since something like that does not exist today to my knowledge).

Would help with identifying areas where a layer could be swapped out/modified to encourage running the model more efficiently on the ANE.

Any thoughts would be appreciated! 😊

PoolingLayerBuilder (MEAN)_57' is not set

Hello, I have a problem with saving efficientnetv2 in a tflite model in order to then run it in coreml. When I assemble the project I get an error

Error compiling model compiler error: Error reading protobuf spec. validator error: Padding type for the pooling layer 'PoolingLayerBuilder (MEAN)_57' is not set.

Do you know what the problem might be?

hollance / neural-engine Goto Github PK

neural-engine's Introduction

The Neural Engine — what do we know about it?

Why this document?

Table of contents

neural-engine's People

Contributors

Stargazers

Watchers

Forkers

neural-engine's Issues

Recommend Projects

Recommend Topics

Recommend Org