Is your feature request related to a problem? Please describe. I'

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi! I see two possibilities: gonum's implementation of ARM

Profile files: <a href="https://github.com/owulveryck/onnx-go/files/3512905/profile.zi

Fix build at Raspberry PI,about owulveryck/onnx-go

Comments (20)

owulveryck commented on July 22, 2024 1

Thanks for sharing this.

It looks like it takes a considerable amount of time creating new arrays.

It is due to the broadcasting mechanism. It is trading memory for CPU.
(The ONNX models are heavily relying on broadcasting to reduce the size of the models).

The broadcasting mechanism has been optimized in Gorgonia already, but it looks like there is still some work to do :).

from onnx-go.

owulveryck commented on July 22, 2024 1

I took the opportunity of my blog post to do a very simple test on amd64 to be sure that I did not introduce any regression lately in the code. I made two executions of the tiny yolo v2 and measured them with /usr/bin/time.

The first execution is with the darknet binary (written in C).
The second one is with gofaces.

Here are the results:

        Command being timed: "./gofaces -img /tmp/MemeLoveTriangle_297886754.jpg"
        User time (seconds): 2.61
        System time (seconds): 0.20
        Percent of CPU this job got: 213%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.32
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 347952
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 83580
        Voluntary context switches: 2239
        Involuntary context switches: 57
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


        Command being timed: "./darknet detector test ../azFace/net_cfg/azface.data ../azFace/net_cfg/tiny-yolo-azface-fddb.cfg ../azFace/weights/tiny-yolo-azface-fddb_82000.weights /tmp/MemeLoveTriangle_297886754.jpg"
        User time (seconds): 2.32
        System time (seconds): 0.06
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.39
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 198840
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 51155
        Voluntary context switches: 1
        Involuntary context switches: 7
        Swaps: 0
        File system inputs: 0
        File system outputs: 384
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

darknet takes 2.32s to execute while gofaces takes 2.61s. The amount of memory used is much more important with darknet, but that does not explain everything. I should have a closer look at the ARM implementation.

from onnx-go.

blackrez commented on July 22, 2024 1

Before using webGL, we can use https://github.com/go-gl/mathgl for some operations.
This is what Tensorflow-lite do (https://www.tensorflow.org/lite/performance/gpu_advanced).

@owulveryck should we add a backend for supports opengl

from onnx-go.

blackrez commented on July 22, 2024 1

Will try it tomorrow. ;)

from onnx-go.

diegobernardes commented on July 22, 2024 1

@blackrez I can do the test this weekend. My RPI is running a 64bit OS. The problematic thing is, I don't know why, but with the 64bit OS at /proc/cpuinfo is showing less flags, it's missing vfp for example.

from onnx-go.

diegobernardes commented on July 22, 2024 1

That's normal if you don't have vfp because it was removed with armv8 and NEON instructions.

It's more then that, looks like the ARMv8 don't work quite well with procinfo: https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/runtime-detection-of-cpu-features-on-an-armv8-a-cpu

There are still some problems with the build on arm:

pi64 ~/source/onnx-go/examples/model_zoo_executor # MODELDIR=/tmp/onnx/tiny_yolov2 go test -tags=noasm -bench=.
# github.com/chewxy/math32
/root/go/pkg/mod/github.com/chewxy/[email protected]/exp.go:3:6: missing function body
/root/go/pkg/mod/github.com/chewxy/[email protected]/log.go:76:6: missing function body
/root/go/pkg/mod/github.com/chewxy/[email protected]/remainder.go:33:6: missing function body
/root/go/pkg/mod/github.com/chewxy/[email protected]/sqrt.go:3:6: missing function body
FAIL	github.com/owulveryck/onnx-go/examples/model_zoo_executor [build failed]

pi64 ~/source/onnx-go/examples/model_zoo_executor # MODELDIR=/tmp/onnx/tiny_yolov2 go test -tags=noasm -bench=.
# gorgonia.org/gorgonia
/root/go/pkg/mod/gorgonia.org/[email protected]/bitmap.go:17:10: undefined: divmod
/root/go/pkg/mod/gorgonia.org/[email protected]/bitmap.go:35:16: undefined: divmod
/root/go/pkg/mod/gorgonia.org/[email protected]/bitmap.go:45:16: undefined: divmod
/root/go/pkg/mod/gorgonia.org/[email protected]/bitmap.go:55:16: undefined: divmod
/root/go/pkg/mod/gorgonia.org/[email protected]/operations.go:549:20: undefined: divmod
FAIL	github.com/owulveryck/onnx-go/examples/model_zoo_executor [build failed]

Both were solved with this change on the go.mod:

replace github.com/chewxy/math32 => github.com/chewxy/math32 v1.0.4
replace gorgonia.org/gorgonia => gorgonia.org/gorgonia v0.9.4-0.20190906113433-1b4cc0a64e58

goenv:

pi64 ~/source/onnx-go/examples/model_zoo_executor # go env
GOARCH="arm64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/root/go"
GOPROXY=""
GORACE=""
GOROOT="/root/source/golang/go"
GOTMPDIR=""
GOTOOLDIR="/root/source/golang/go/pkg/tool/linux_arm64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/root/source/onnx-go/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build787647331=/tmp/go-build -gno-record-gcc-switches"

@owulveryck, the benchmark you asked for:

// old is without change, new is commenting the lock
pi64 ~ # benchcmp old.txt new.txt
benchmark          old ns/op       new ns/op       delta
BenchmarkRun-4     21771941953     21652831591     -0.55%

It's a Raspberry Pi 3 Model B+ running Gentoo 64-bit.

from onnx-go.

owulveryck commented on July 22, 2024

Hi!
I see two possibilities:

gonum's implementation of ARM may be slow; I don't know if their BLAS library is optimized for ARM. I know we can change the BLAS implementation at the level of Gorgonia, but I don't know how to do it properly. Maybe @chewxy can help us.
The small amount of memory of the pi induces some pressure on the GC.

If you want to profile, you can use the model_zoo_executor, and run

MODELDIR=/path/to/tiny-yolo -go test bench=. -benchmem -memprofile memprofile.out -cpuprofile profile.out -run=NONE

and read the files with go tool pprof on your mac.

from onnx-go.

owulveryck commented on July 22, 2024

Last time I run a profile on the model, I saw that the bottleneck was the general-purpose matrix multiplication. There is no magic in here, this operation is time-consuming.
I don't understand how other implementations are 100 times faster (and I know they are).

from onnx-go.

diegobernardes commented on July 22, 2024

Profile files: profile.zip

Memory:

CPU:

from onnx-go.

blackrez commented on July 22, 2024

Last time I run a profile on the model, I saw that the bottleneck was the general-purpose matrix multiplication. There is no magic in here, this operation is time-consuming.
I don't understand how other implementations are 100 times faster (and I know they are).

They use neon instructions (https://en.wikipedia.org/wiki/ARM_architecture#Advanced_SIMD_(NEON) ) , I will open an issue in gorgonia to see how we can add this instructions.

from onnx-go.

diegobernardes commented on July 22, 2024

Thanks @blackrez, if possible and after you create the issue post the link here as well. I would like to help.

from onnx-go.

diegobernardes commented on July 22, 2024

Really nice! Didn't knew that we're so close to a C implementation.

from onnx-go.

diegobernardes commented on July 22, 2024

I saw a Darknet presentation on TED (https://www.ted.com/talks/joseph_redmon_how_a_computer_learns_to_recognize_objects_instantly) and want to know how they could do realtime detection if in your test you got 2s+ per frame.

I tried the gofaces on my machine and got a time similar to yours, 1.86s. To do something like 60fps, we need to detect each frame in 16.6ms, just need to be 112x faster 😅 . This could be archived using a GPU?

from onnx-go.

owulveryck commented on July 22, 2024

@diegobernardes a GPU is mandatory to reach such performances. When you profile the application on a intel CPU, we can see that all of the computation time is used to perform matrix multiplication. This operation is already coded in ASM within gonum’s BLAS. Therefore there is no way to really optimize that.

When I have time, I will try to build Gorgonia with CUDA. (Compiling is easy, but last time I tried, it was suffering from a bug, and I had no time to investigate it).

Another thing I’d like to evaluate is the possibility to use web assembly and use WebGL for matrix multiplication.
It could be a solution to avoid the mess of CUDA;
the principle is “easy”, but I really lack some skills as I now nothing about WebGL. And it would probably not help the problem we are facing with the Raspberry either :D

from onnx-go.

diegobernardes commented on July 22, 2024

FYI: https://github.com/doe300/VC4CL, this project implements a OpenCL layer for the Raspberry Pi.
Maybe when Gorgonia have support to OpenCL we can get better performance from it. Still gonna miss the Neon instructions, but it's a start 😅 Also, the OpenCL layer benefits everyone that doesn't have CUDA.

On this video the CEO from Idein shows the performance boost of using GPU at deep learning applications. Also is nice to see the Pi Zero having almost the same performance of the Pi 3 because they have the same GPU. So we can deploy deep learning applications on the edge with a $10 machine.

from onnx-go.

owulveryck commented on July 22, 2024

A bit on the side of this issue, I' d like to bench the code on Raspberry by removing those lines from the file vm_tape.go in gorgonia:

func (m *tapeMachine) RunAll() (err error) {
-   runtime.LockOSThread()
-   defer runtime.UnlockOSThread()
   defer m.DoWork()

   workAvailable := m.ExternMetadata.WorkAvailable()
   syncChan := m.ExternMetadata.Sync()
   errChan := make(chan error)
   doneChan := make(chan struct{})

In Gorgonia, the thread locking is mandatory for any call to cgo (CUDA or OpenBLAS for example), but it also gives a penalty on the performances; on ARM, by default, we are using a pure-go implementation; therefore this is not useful. On my intel CPU, it increases the performances by 2%, but I guess that the gain will be better on ARM.

a benchcmp with and without the Lock on OSThread, would be awesome.

By now my RPI is not usable, anybody can test this?

Note: if you download the tiny yolo model from the zoo, you can easily run a bench with the examples/model_zoo_executor and run

onnx-go/examples/model_zoo_executor ✗ MODELDIR=path/to/the/extracted/model go test -bench=.

from onnx-go.

blackrez commented on July 22, 2024

Here the results :

With the locking

root@raspberrypi:/home/pi/go/src/github.com/owulveryck/onnx-go/examples/model_zoo_executor# MODELDIR=/tmp/onnx/tiny_yolov2 go test -tags=noasm -bench=.                   
goos: linux
goarch: arm
pkg: github.com/owulveryck/onnx-go/examples/model_zoo_executor
BenchmarkRun-4                 1        27418660316 ns/op
PASS
ok      github.com/owulveryck/onnx-go/examples/model_zoo_executor       55.683s

Without the locking

root@raspberrypi:/home/pi/go/src/github.com/owulveryck/onnx-go/examples/model_zoo_executor# MODELDIR=/tmp/onnx/tiny_yolov2 go test -tags=noasm -bench=.
goos: linux
goarch: arm
pkg: github.com/owulveryck/onnx-go/examples/model_zoo_executor
BenchmarkRun-4                 1        27723200546 ns/op
PASS
ok      github.com/owulveryck/onnx-go/examples/model_zoo_executor       55.976s

Bad news, the pure-go implementations is very very slow with or without the modifications. Because go is very slow on ARMv6/7 and most of maths implementation is software-based and doesn't use hardware accelaration.

BUT Rpi can use ARMv8 and the math lib's use hardware.
I don't try but I think use rpi and a linux ARMv8 is one of solution.

@owulveryck I know you have ec2 ready to go terraform, can you use A1 instances for testing ?

from onnx-go.

blackrez commented on July 22, 2024

That's normal if you don't have vfp because it was removed with armv8 and NEON instructions.

from onnx-go.

owulveryck commented on July 22, 2024

There is a new experimental VM in Gorgonia (https://github.com/gorgonia/gorgonia/tree/master/x/vm)
Is it possible to run it on a PI to see how this VM behave.

You need to tweak the graph.go file in gorgonnx like this:

diff --git a/backend/x/gorgonnx/graph.go b/backend/x/gorgonnx/graph.go
index 90b403d..5978433 100644
--- a/backend/x/gorgonnx/graph.go
+++ b/backend/x/gorgonnx/graph.go
@@ -7,6 +7,8 @@ import (
        "gonum.org/v1/gonum/graph"
        "gonum.org/v1/gonum/graph/simple"
        "gorgonia.org/gorgonia"
+
+       xvm "gorgonia.org/gorgonia/x/vm"
        "gorgonia.org/tensor"
 )

@@ -58,7 +60,8 @@ func (g *Graph) Run() error {
                }
        }
        if g.m == nil {
-               g.m = gorgonia.NewTapeMachine(g.exprgraph)
+               g.m = xvm.NewGoMachine(g.exprgraph)
        } else {
                g.m.Reset()
        }

from onnx-go.

blackrez commented on July 22, 2024

There is no more issue on building on ARMv6 and ARM64. For the performance issues, I add this issue : #169

from onnx-go.

Fix build at Raspberry PI about onnx-go HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent