Comments (20)
Thanks for sharing this.
It looks like it takes a considerable amount of time creating new arrays.
It is due to the broadcasting mechanism. It is trading memory for CPU.
(The ONNX models are heavily relying on broadcasting to reduce the size of the models).
The broadcasting mechanism has been optimized in Gorgonia already, but it looks like there is still some work to do :).
from onnx-go.
I took the opportunity of my blog post to do a very simple test on amd64 to be sure that I did not introduce any regression lately in the code. I made two executions of the tiny yolo v2 and measured them with /usr/bin/time
.
- The first execution is with the
darknet
binary (written in C). - The second one is with gofaces.
Here are the results:
Command being timed: "./gofaces -img /tmp/MemeLoveTriangle_297886754.jpg"
User time (seconds): 2.61
System time (seconds): 0.20
Percent of CPU this job got: 213%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.32
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 347952
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 83580
Voluntary context switches: 2239
Involuntary context switches: 57
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Command being timed: "./darknet detector test ../azFace/net_cfg/azface.data ../azFace/net_cfg/tiny-yolo-azface-fddb.cfg ../azFace/weights/tiny-yolo-azface-fddb_82000.weights /tmp/MemeLoveTriangle_297886754.jpg"
User time (seconds): 2.32
System time (seconds): 0.06
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.39
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 198840
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 51155
Voluntary context switches: 1
Involuntary context switches: 7
Swaps: 0
File system inputs: 0
File system outputs: 384
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
darknet takes 2.32s to execute while gofaces takes 2.61s. The amount of memory used is much more important with darknet, but that does not explain everything. I should have a closer look at the ARM implementation.
from onnx-go.
Before using webGL, we can use https://github.com/go-gl/mathgl for some operations.
This is what Tensorflow-lite do (https://www.tensorflow.org/lite/performance/gpu_advanced).
@owulveryck should we add a backend for supports opengl
from onnx-go.
Will try it tomorrow. ;)
from onnx-go.
@blackrez I can do the test this weekend. My RPI is running a 64bit OS. The problematic thing is, I don't know why, but with the 64bit OS at /proc/cpuinfo
is showing less flags, it's missing vfp
for example.
from onnx-go.
That's normal if you don't have vfp because it was removed with armv8 and NEON instructions.
It's more then that, looks like the ARMv8 don't work quite well with procinfo: https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/runtime-detection-of-cpu-features-on-an-armv8-a-cpu
There are still some problems with the build on arm
:
pi64 ~/source/onnx-go/examples/model_zoo_executor # MODELDIR=/tmp/onnx/tiny_yolov2 go test -tags=noasm -bench=.
# github.com/chewxy/math32
/root/go/pkg/mod/github.com/chewxy/[email protected]/exp.go:3:6: missing function body
/root/go/pkg/mod/github.com/chewxy/[email protected]/log.go:76:6: missing function body
/root/go/pkg/mod/github.com/chewxy/[email protected]/remainder.go:33:6: missing function body
/root/go/pkg/mod/github.com/chewxy/[email protected]/sqrt.go:3:6: missing function body
FAIL github.com/owulveryck/onnx-go/examples/model_zoo_executor [build failed]
pi64 ~/source/onnx-go/examples/model_zoo_executor # MODELDIR=/tmp/onnx/tiny_yolov2 go test -tags=noasm -bench=.
# gorgonia.org/gorgonia
/root/go/pkg/mod/gorgonia.org/[email protected]/bitmap.go:17:10: undefined: divmod
/root/go/pkg/mod/gorgonia.org/[email protected]/bitmap.go:35:16: undefined: divmod
/root/go/pkg/mod/gorgonia.org/[email protected]/bitmap.go:45:16: undefined: divmod
/root/go/pkg/mod/gorgonia.org/[email protected]/bitmap.go:55:16: undefined: divmod
/root/go/pkg/mod/gorgonia.org/[email protected]/operations.go:549:20: undefined: divmod
FAIL github.com/owulveryck/onnx-go/examples/model_zoo_executor [build failed]
Both were solved with this change on the go.mod
:
replace github.com/chewxy/math32 => github.com/chewxy/math32 v1.0.4
replace gorgonia.org/gorgonia => gorgonia.org/gorgonia v0.9.4-0.20190906113433-1b4cc0a64e58
goenv
:
pi64 ~/source/onnx-go/examples/model_zoo_executor # go env
GOARCH="arm64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/root/go"
GOPROXY=""
GORACE=""
GOROOT="/root/source/golang/go"
GOTMPDIR=""
GOTOOLDIR="/root/source/golang/go/pkg/tool/linux_arm64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/root/source/onnx-go/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build787647331=/tmp/go-build -gno-record-gcc-switches"
@owulveryck, the benchmark you asked for:
// old is without change, new is commenting the lock
pi64 ~ # benchcmp old.txt new.txt
benchmark old ns/op new ns/op delta
BenchmarkRun-4 21771941953 21652831591 -0.55%
It's a Raspberry Pi 3 Model B+ running Gentoo 64-bit.
from onnx-go.
Hi!
I see two possibilities:
- gonum's implementation of ARM may be slow; I don't know if their BLAS library is optimized for ARM. I know we can change the BLAS implementation at the level of Gorgonia, but I don't know how to do it properly. Maybe @chewxy can help us.
- The small amount of memory of the pi induces some pressure on the GC.
If you want to profile, you can use the model_zoo_executor
, and run
MODELDIR=/path/to/tiny-yolo -go test bench=. -benchmem -memprofile memprofile.out -cpuprofile profile.out -run=NONE
and read the files with go tool pprof
on your mac.
from onnx-go.
Last time I run a profile on the model, I saw that the bottleneck was the general-purpose matrix multiplication. There is no magic in here, this operation is time-consuming.
I don't understand how other implementations are 100 times faster (and I know they are).
from onnx-go.
Profile files: profile.zip
from onnx-go.
Last time I run a profile on the model, I saw that the bottleneck was the general-purpose matrix multiplication. There is no magic in here, this operation is time-consuming.
I don't understand how other implementations are 100 times faster (and I know they are).
They use neon instructions (https://en.wikipedia.org/wiki/ARM_architecture#Advanced_SIMD_(NEON) ) , I will open an issue in gorgonia to see how we can add this instructions.
from onnx-go.
Thanks @blackrez, if possible and after you create the issue post the link here as well. I would like to help.
from onnx-go.
Really nice! Didn't knew that we're so close to a C implementation.
from onnx-go.
I saw a Darknet presentation on TED (https://www.ted.com/talks/joseph_redmon_how_a_computer_learns_to_recognize_objects_instantly) and want to know how they could do realtime detection if in your test you got 2s+ per frame.
I tried the gofaces
on my machine and got a time similar to yours, 1.86s. To do something like 60fps, we need to detect each frame in 16.6ms, just need to be 112x faster π
. This could be archived using a GPU?
from onnx-go.
@diegobernardes a GPU is mandatory to reach such performances. When you profile the application on a intel CPU, we can see that all of the computation time is used to perform matrix multiplication. This operation is already coded in ASM within gonumβs BLAS. Therefore there is no way to really optimize that.
When I have time, I will try to build Gorgonia with CUDA. (Compiling is easy, but last time I tried, it was suffering from a bug, and I had no time to investigate it).
Another thing Iβd like to evaluate is the possibility to use web assembly and use WebGL for matrix multiplication.
It could be a solution to avoid the mess of CUDA;
the principle is βeasyβ, but I really lack some skills as I now nothing about WebGL. And it would probably not help the problem we are facing with the Raspberry either :D
from onnx-go.
FYI: https://github.com/doe300/VC4CL, this project implements a OpenCL layer for the Raspberry Pi.
Maybe when Gorgonia have support to OpenCL we can get better performance from it. Still gonna miss the Neon instructions, but it's a start π
Also, the OpenCL layer benefits everyone that doesn't have CUDA.
On this video the CEO from Idein shows the performance boost of using GPU at deep learning applications. Also is nice to see the Pi Zero having almost the same performance of the Pi 3 because they have the same GPU. So we can deploy deep learning applications on the edge with a $10 machine.
from onnx-go.
A bit on the side of this issue, I' d like to bench the code on Raspberry by removing those lines from the file vm_tape.go
in gorgonia:
func (m *tapeMachine) RunAll() (err error) {
- runtime.LockOSThread()
- defer runtime.UnlockOSThread()
defer m.DoWork()
workAvailable := m.ExternMetadata.WorkAvailable()
syncChan := m.ExternMetadata.Sync()
errChan := make(chan error)
doneChan := make(chan struct{})
In Gorgonia, the thread locking is mandatory for any call to cgo (CUDA or OpenBLAS for example), but it also gives a penalty on the performances; on ARM, by default, we are using a pure-go implementation; therefore this is not useful. On my intel CPU, it increases the performances by 2%, but I guess that the gain will be better on ARM.
a benchcmp
with and without the Lock on OSThread, would be awesome.
By now my RPI is not usable, anybody can test this?
Note: if you download the tiny yolo model from the zoo, you can easily run a bench with the examples/model_zoo_executor
and run
onnx-go/examples/model_zoo_executor β MODELDIR=path/to/the/extracted/model go test -bench=.
from onnx-go.
Here the results :
With the locking
root@raspberrypi:/home/pi/go/src/github.com/owulveryck/onnx-go/examples/model_zoo_executor# MODELDIR=/tmp/onnx/tiny_yolov2 go test -tags=noasm -bench=.
goos: linux
goarch: arm
pkg: github.com/owulveryck/onnx-go/examples/model_zoo_executor
BenchmarkRun-4 1 27418660316 ns/op
PASS
ok github.com/owulveryck/onnx-go/examples/model_zoo_executor 55.683s
Without the locking
root@raspberrypi:/home/pi/go/src/github.com/owulveryck/onnx-go/examples/model_zoo_executor# MODELDIR=/tmp/onnx/tiny_yolov2 go test -tags=noasm -bench=.
goos: linux
goarch: arm
pkg: github.com/owulveryck/onnx-go/examples/model_zoo_executor
BenchmarkRun-4 1 27723200546 ns/op
PASS
ok github.com/owulveryck/onnx-go/examples/model_zoo_executor 55.976s
Bad news, the pure-go implementations is very very slow with or without the modifications. Because go is very slow on ARMv6/7 and most of maths implementation is software-based and doesn't use hardware accelaration.
BUT Rpi can use ARMv8 and the math lib's use hardware.
I don't try but I think use rpi and a linux ARMv8 is one of solution.
@owulveryck I know you have ec2 ready to go terraform, can you use A1 instances for testing ?
from onnx-go.
That's normal if you don't have vfp because it was removed with armv8 and NEON instructions.
from onnx-go.
There is a new experimental VM in Gorgonia (https://github.com/gorgonia/gorgonia/tree/master/x/vm)
Is it possible to run it on a PI to see how this VM behave.
You need to tweak the graph.go
file in gorgonnx
like this:
diff --git a/backend/x/gorgonnx/graph.go b/backend/x/gorgonnx/graph.go
index 90b403d..5978433 100644
--- a/backend/x/gorgonnx/graph.go
+++ b/backend/x/gorgonnx/graph.go
@@ -7,6 +7,8 @@ import (
"gonum.org/v1/gonum/graph"
"gonum.org/v1/gonum/graph/simple"
"gorgonia.org/gorgonia"
+
+ xvm "gorgonia.org/gorgonia/x/vm"
"gorgonia.org/tensor"
)
@@ -58,7 +60,8 @@ func (g *Graph) Run() error {
}
}
if g.m == nil {
- g.m = gorgonia.NewTapeMachine(g.exprgraph)
+ g.m = xvm.NewGoMachine(g.exprgraph)
} else {
g.m.Reset()
}
from onnx-go.
There is no more issue on building on ARMv6 and ARM64. For the performance issues, I add this issue : #169
from onnx-go.
Related Issues (20)
- Will this project be maintained further and are contributions still welcomed?
- Implement operator `LinearRegressor` for backend `gorgonia`
- "Asymmetric padding" error
- panic: negative dimension size does not make sense
- ../../go/src/gorgonia.org/tensor/dense_compat.go:442:23: undefined: array.Interface HOT 2
- Updated depens
- poor performance (run model)
- run() function calls newMachine() everytime HOT 1
- Question: unsqueeze: axes in not an []int64 HOT 3
- Support for empty tensors
- Tape machine does not reset properly for some models HOT 2
- Implement operator `PReLU` for backend `Gorgonia`
- Implement operator `Cast` for backend `Gorgonia`
- panic: index out of range HOT 4
- the Error during UnmarshalBinary(b)
- Is Conv1D supported?
- Implement operator `Gather` for backend `YYY`
- Any plan to support Graph attribute type?
- Time taken to initialize model
- shame on the developers HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onnx-go.