hollance / neural-engine Goto Github PK
View Code? Open in Web Editor NEWEverything we actually know about the Apple Neural Engine (ANE)
License: MIT License
Everything we actually know about the Apple Neural Engine (ANE)
License: MIT License
Hey there! This repo is super amazing, thanks for putting all these findings together.
Was reading this sub-section (link below) on which layers are unsupported, and I had an idea on how to programmatically identify them.
https://github.com/hollance/neural-engine/blob/master/docs/unsupported-layers.md
This part here: "S → U → S → U → S → U" about swapping supported/unsupported layers made me wonder. In theory, if we take a layer X from a CoreML model (or just some CoreML op) we have that we do not yet know can run on the ANE, and we make a dummy model that is built solely of that layer, i.e. X -> X -> X ... and we set the compute unit to be CPU/ANE only, this would encourage the Neural Engine to run this model on 1 compute unit since that would be efficient. So, in theory, if you set the compute unit to be CPU && ANE and you see that the layer runs on the ANE only, then you will have identified that this op is ANE compatible? 🧐 could even record stats as well.
I'd like to test this theory but wanted to run this by you. Thinking this could be a way to, given a model, programmatically piecemeal individual layers, build a simple repeated layer model, and produce a chart of whether the layer is CPU/GPU/ANE supported (maybe even with statistics). Maybe even a chart that can be publically availabe of ops and their supported compute units (since something like that does not exist today to my knowledge).
Would help with identifying areas where a layer could be swapped out/modified to encourage running the model more efficiently on the ANE.
Any thoughts would be appreciated! 😊
The accuracy of cpu only, cpu and gpu, and ALL are different, and the result of cpu only is accurate.
Great article! I think it's a great idea to write this on GitHub instead of a blog post. I am running a CoreML model on iPhone 8. I am using cpuOnly
as I want the model to process in the background. I have observed a strange thing. The model takes approximately 8 seconds to process one image when the app is active or in background mode. But if I open Chrome or Safari while my App is running in the background the processing speed per image suddenly drops to approximately 3 seconds per image. Why is it so and how can I decrease processing in all cases?
Hello, I have a problem with saving efficientnetv2 in a tflite model in order to then run it in coreml. When I assemble the project I get an error
Error compiling model compiler error: Error reading protobuf spec. validator error: Padding type for the pooling layer 'PoolingLayerBuilder (MEAN)_57' is not set.
Do you know what the problem might be?
This is a great resource - thanks!
Can you link to or provide some basic demos of the ANE in action?
Ideally for me that would be something I can run from the command line on an M1 from Python, but since it seems that even Apple's Tensorflow fork (and subsequent tensorflow metal PluggableDevice doesn't leverage the ANE, something else I could run on an M1 with Xcode would be a great help.
Hi, Hollance san.
Have you tried A17 Pro Neural Engine?
Apple claim that it has 35TOPS, which is a big leap performance boost.
But Geekbench ML shows that it is a just little.
iPhone16,1(iPhone 15 Pro) = 3402
iPhone15,2(iPhone 14 Pro) = 3349
https://browser.geekbench.com/ml/v0/inference
I wonder that Apple may implement 4bit INT OPs for internal use, and legacy performance is almost same, just clock up.
What do you think about it?
may i ask what is mle5 in coreml.framework?
Stupid question. How do you do float16 square matrix matrix multiply with their SDK? Can it handle sparse matrix multiply.
I'm interested in hacking it to do string parsing - https://en.wikipedia.org/wiki/CYK_algorithm.
Hi.
In https://github.com/hollance/neural-engine/blob/master/docs/16-bit.md, you wrote
"On the GPU it uses float16 for the weights and the intermediate tensors, but float32 for the calculations. You can turn this off with the option allowLowPrecisionAccumulationOnGPU from MLModelConfiguration, in which case the GPU also uses float16 for the calculations. This is a bit faster but you may lose precision."
Do you have any reference for this description?
In WWDC 19 (https://developer.apple.com/videos/play/wwdc2019/704/ (39:00)), they said,
"And the idea here is that if your model is learning on the GPU, instead of doing accumulation in float32, that happens in float60."
So I guess that this option may be effective only for macOS and the change is from float60 to float32.
I tried the option on iOS device without Neural Engine, but it seemed no speed enhancement.
Thanks.
Hello, Hollemans san.
Thank you for your useful repository!
I found that a pooling layer which follows a convolution prevents the convolution to run on ANE.
Do you know any workarounds to avoid it?
The pooling layer is actually a global average pooling layer.
I tried to turn on/off globalPooling and to replace it with reduce mean, but no better effects.
Thanks.
[Additional Information]
I found that padding="VALID" causes the problem.
If you use padding="SAME", the model runs on ANE, but you know it is useless since the activation outputs an output with different shape.
Hi, I'm working on integrating ANE to TFLite. While testing for MobileNet V3, I discovered following messages from os_log output.
Debug com.apple.espresso espresso
"Kernel validation warning PoolingLayerBuilder (AVERAGE)_41 (pool) @ 33: Unsupported: (dilated)kernel width = 28 > 13"
Debug com.apple.espresso espresso
"Kernel validation warning MulOpBuilder_49 (elementwise) @ 41: elementwise with channel broadcast supported only with constant vector or transplant input"
So the average pooling has hidden constraint that restricts size of kernel up to 13, and elementwise multiplication does not support broadcasted multiplication of [CxHxW] and [Cx1x1] tensors. (don't know what "transplant input" mean.)
With the release of Apple M1, a lot of people start to compare M1 with intel cpu, even nvida gpu. But to my understanding, ANE should not intervene normal tasks handled by gpu for example rendering, nor should it intervene normal tasks handled by cpu... npu should focus on their own business... and it’s more confusing to integrate NPU inside chips(maybe they should have their own position)
Sorry to ask question here and make GitHub a little bit like a forum...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.