hollance / neural-engine Goto Github PK

Everything we actually know about the Apple Neural Engine (ANE)

License: MIT License

neural-engine ane coreml iphone ios neural-network tpu

neural-engine's Issues

Thoughts on "Detecting" Supported Layers

Hey there! This repo is super amazing, thanks for putting all these findings together.

Was reading this sub-section (link below) on which layers are unsupported, and I had an idea on how to programmatically identify them.

https://github.com/hollance/neural-engine/blob/master/docs/unsupported-layers.md

This part here: "S → U → S → U → S → U" about swapping supported/unsupported layers made me wonder. In theory, if we take a layer X from a CoreML model (or just some CoreML op) we have that we do not yet know can run on the ANE, and we make a dummy model that is built solely of that layer, i.e. X -> X -> X ... and we set the compute unit to be CPU/ANE only, this would encourage the Neural Engine to run this model on 1 compute unit since that would be efficient. So, in theory, if you set the compute unit to be CPU && ANE and you see that the layer runs on the ANE only, then you will have identified that this op is ANE compatible? 🧐 could even record stats as well.

I'd like to test this theory but wanted to run this by you. Thinking this could be a way to, given a model, programmatically piecemeal individual layers, build a simple repeated layer model, and produce a chart of whether the layer is CPU/GPU/ANE supported (maybe even with statistics). Maybe even a chart that can be publically availabe of ops and their supported compute units (since something like that does not exist today to my knowledge).

Would help with identifying areas where a layer could be swapped out/modified to encourage running the model more efficiently on the ANE.

Any thoughts would be appreciated! 😊

The accuracy of cpu only, cpu and gpu, and ALL are different, and the result of cpu only is accurate.

CoreML model processing time is almost twice as much as when the app is active

Great article! I think it's a great idea to write this on GitHub instead of a blog post. I am running a CoreML model on iPhone 8. I am using cpuOnly as I want the model to process in the background. I have observed a strange thing. The model takes approximately 8 seconds to process one image when the app is active or in background mode. But if I open Chrome or Safari while my App is running in the background the processing speed per image suddenly drops to approximately 3 seconds per image. Why is it so and how can I decrease processing in all cases?

PoolingLayerBuilder (MEAN)_57' is not set

Hello, I have a problem with saving efficientnetv2 in a tflite model in order to then run it in coreml. When I assemble the project I get an error

Error compiling model compiler error: Error reading protobuf spec. validator error: Padding type for the pooling layer 'PoolingLayerBuilder (MEAN)_57' is not set.

Do you know what the problem might be?

Link to some examples / demos

This is a great resource - thanks!

Can you link to or provide some basic demos of the ANE in action?
Ideally for me that would be something I can run from the command line on an M1 from Python, but since it seems that even Apple's Tensorflow fork (and subsequent tensorflow metal PluggableDevice doesn't leverage the ANE, something else I could run on an M1 with Xcode would be a great help.

[Question] Have you tried A17 Pro Neural Engine?

Hi, Hollance san.

Have you tried A17 Pro Neural Engine?

Apple claim that it has 35TOPS, which is a big leap performance boost.
But Geekbench ML shows that it is a just little.
iPhone16,1(iPhone 15 Pro) = 3402
iPhone15,2(iPhone 14 Pro) = 3349
https://browser.geekbench.com/ml/v0/inference

I wonder that Apple may implement 4bit INT OPs for internal use, and legacy performance is almost same, just clock up.
What do you think about it?

may i ask what is mle5 in coreml.framework？

Matrix multiply example/benchmark?

Stupid question. How do you do float16 square matrix matrix multiply with their SDK? Can it handle sparse matrix multiply.

I'm interested in hacking it to do string parsing - https://en.wikipedia.org/wiki/CYK_algorithm.

About the description of allowLowPrecisionAccumulationOnGPU

Hi.

In https://github.com/hollance/neural-engine/blob/master/docs/16-bit.md, you wrote

"On the GPU it uses float16 for the weights and the intermediate tensors, but float32 for the calculations. You can turn this off with the option allowLowPrecisionAccumulationOnGPU from MLModelConfiguration, in which case the GPU also uses float16 for the calculations. This is a bit faster but you may lose precision."

Do you have any reference for this description?

In WWDC 19 (https://developer.apple.com/videos/play/wwdc2019/704/ (39:00)), they said,

"And the idea here is that if your model is learning on the GPU, instead of doing accumulation in float32, that happens in float60."

So I guess that this option may be effective only for macOS and the change is from float60 to float32.

I tried the option on iOS device without Neural Engine, but it seemed no speed enhancement.

Thanks.

Do you have any workarounds about pooling layer?

Hello, Hollemans san.

Thank you for your useful repository!

I found that a pooling layer which follows a convolution prevents the convolution to run on ANE.
Do you know any workarounds to avoid it?

The pooling layer is actually a global average pooling layer.
I tried to turn on/off globalPooling and to replace it with reduce mean, but no better effects.

Thanks.

[Additional Information]
I found that padding="VALID" causes the problem.
If you use padding="SAME", the model runs on ANE, but you know it is useless since the activation outputs an output with different shape.

Other frameworks

https://github.com/geohot/tinygrad/tree/master/ane

Issues discovered while running MobileNet V3

Hi, I'm working on integrating ANE to TFLite. While testing for MobileNet V3, I discovered following messages from os_log output.

Debug com.apple.espresso espresso
"Kernel validation warning PoolingLayerBuilder (AVERAGE)_41 (pool) @ 33: Unsupported: (dilated)kernel width = 28 > 13"

Debug com.apple.espresso espresso
"Kernel validation warning MulOpBuilder_49 (elementwise) @ 41: elementwise with channel broadcast supported only with constant vector or transplant input"

So the average pooling has hidden constraint that restricts size of kernel up to 13, and elementwise multiplication does not support broadcasted multiplication of [CxHxW] and [Cx1x1] tensors. (don't know what "transplant input" mean.)

Since ANE accelerate neural network model, how could it benefit traditional cpu task? I mean in Apple M1.

With the release of Apple M1, a lot of people start to compare M1 with intel cpu, even nvida gpu. But to my understanding, ANE should not intervene normal tasks handled by gpu for example rendering, nor should it intervene normal tasks handled by cpu... npu should focus on their own business... and it’s more confusing to integrate NPU inside chips(maybe they should have their own position)

Sorry to ask question here and make GitHub a little bit like a forum...

hollance / neural-engine Goto Github PK

neural-engine's Issues

Thoughts on "Detecting" Supported Layers

The accuracy of cpu only, cpu and gpu, and ALL are different, and the result of cpu only is accurate.

CoreML model processing time is almost twice as much as when the app is active

PoolingLayerBuilder (MEAN)_57' is not set

Link to some examples / demos

[Question] Have you tried A17 Pro Neural Engine?

may i ask what is mle5 in coreml.framework？

Matrix multiply example/benchmark?

About the description of allowLowPrecisionAccumulationOnGPU

Do you have any workarounds about pooling layer?

Other frameworks

Issues discovered while running MobileNet V3

Since ANE accelerate neural network model, how could it benefit traditional cpu task? I mean in Apple M1.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent