This is related to the <a href="https://github.com/owulveryck/gorgonnx/issues/2" data-

This is weird, just by looking at the <a href="https://github.com/owulveryck/onnx-go/b

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Gorgonia's evaluation of the MNIST model does not give expected result,about owulveryck/onnx-go

Comments (16)

lynic commented on June 16, 2024 3

I extracted the inputs and outputs from 2nd conv node, and run some tests on gorgonia conv node, it didn't give out the correct result, I suspect it's a bug in gorgonia's conv layer.
@owulveryck @chewxy
To reproduce the tests, follow this README.
https://github.com/lynic/gorgonnx/tree/test_conv/eltest

Submit a related issue to gorgonia gorgonia/gorgonia#268

from onnx-go.

owulveryck commented on June 16, 2024 1

First of all: many thanks for your help.

I have just tested it with a fresh and empty GOPATH; this seems to work:

GOPATH=/tmp go get -v github.com/owulveryck/onnx-go
cd /tmp/src/github.com/owulveryck/onnx-go
git checkout b7af8bd
cd example/gorgonia
GOPATH=/tmp go get -v ./...
GOPATH=/tmp go run mnist.go

from onnx-go.

owulveryck commented on June 16, 2024

This is weird, just by looking at the test file on line 463, it looks that the add operator is a no-op.

I have placed a couple of markers (log.Println based debugging). The Add operation operates the Add tensor package
The input of the tensor.Add operation are:

[  55.45495   984.50616  -1191.5568  -652.15924  ...   -303.621   952.82043  -233.81728    -672.868]

R[-0.044856027   0.007791661    0.06810082    0.02999374  ... -0.055284902  -0.049383815    0.08432205  -0.054540414]

and the output value is:

[55.41009 984.514 -1191.4886 -652.1293 802.4857 497.57553 -303.6763 952.77106 -233.73296 -672.92255]

which is correct (55.45495 - 0.044856027 = 55.41009 ...).

In the graph, the value carried by the first input of the operator is:

[55.41009 984.514 -1191.4886 -652.1293 802.4857 497.57553 -303.6763 952.77106 -233.73296 -672.92255]

Maybe a problem with pointers somewhere....

from onnx-go.

blackrez commented on June 16, 2024

Hello,

I wanted to help but I encounter a bug in reproducing steps for debug and use delve (https://github.com/go-delve/delve).

➜  gorgonia git:(b7af8bd) go run mnist.go
# github.com/owulveryck/onnx-go/example/gorgonia/vendor/gorgonia.org/gorgonia/debugger/dot
vendor/gorgonia.org/gorgonia/debugger/dot/encode.go:15:25: not enough arguments in call to dot.Marshal
	have (graph.Graph, string, string, string)
	want (graph.Graph, string, string, string, bool)
# github.com/owulveryck/onnx-go/internal/pb-onnx
../../internal/pb-onnx/onnx.proto3.pb.go:22:11: undefined: proto.ProtoPackageIsVersion3

from onnx-go.

blackrez commented on June 16, 2024

I confirm, my GOPATH was the root cause of my issue.

from onnx-go.

chewxy commented on June 16, 2024

From what I could tell, the Add operation is performing correctly. It would appear that the numpy generation is not.

I re-added the VMOpt to gorgonia/onnx/machine.go (essentially a redirect to engine.VMOpt), then I augmented the machine in mnist.go with the correct VMOpts:

machine := gorgonnx.NewTapeMachine(graph, 
    gorgonnx.WithLogger(log.New(os.Stderr, "", 0)),  // log execution
    gorgonnx.WithWatchlist(),                                     //  watching all nodes
    gorgonnx.WithValueFmt("%#1.6f"))                      // log with the following format for values

This is the result, which looks correct to me:

PC 49
Executing + false	[CPU32 CPU8]	CPU32	false	true	false. Node is: 14
	Inputs:
		R[   55.454948    984.506165  -1191.556763   -652.159241    802.612122    497.435303   -303.621002    952.820435   -233.817276   -672.867981]
		R[-0.044856   0.007792   0.068101   0.029994  -0.126410   0.140219  -0.055285  -0.049384   0.084322  -0.054540]
	Result:
		R[   55.410091    984.513977  -1191.488647   -652.129272    802.485718    497.575531   -303.676300    952.771057   -233.732956   -672.922546]
	Written To: CPU32
		R[   55.410091    984.513977  -1191.488647   -652.129272    802.485718    497.575531   -303.676300    952.771057   -233.732956   -672.922546]

Important to note is the instruction itself:

Executing + false	[CPU32 CPU8]	CPU32	false	true	false. Node is: 14

It reads from CPU32, CPU8 and then it overwrites CPU32.

I have no idea how the mnist test numbers are generated.

from onnx-go.

chewxy commented on June 16, 2024

I may be on the wrong path down, lmk

from onnx-go.

owulveryck commented on June 16, 2024

I think that you are on the right path, but I am not :D
Thank you for your help.
The Add operation is actually ok. Which is coherent: in the previous Gorgonnx implementation, I made functional tests of every operator and they all successfully passed.

To generate the numpy, what I do is:

running the graph;
looping over all the nodes;
extracting the Value().Data().

From your analysis, I see that some registers are overwritten.
Question for you @chewxy: Can this have an impact on the underlying backing value? Maybe one of the registers is wrongly overwritten at a certain point. Can we avoid this behavior for testing purpose (by playing with isPointer or anything else?)

I have two possibilities to continue the debugging process now:

to extend the debug process and use the VMOpts to investigate more (and maybe I will couple it with the old work of stepped execution to investigate.)
to build a new MNIST graph into Gorgonia (based on the model I have extracted), and set the weights' values with the data from the initializer. Then play with the model (mostly with the shape of the tensor to avoid broadcasting) and see if I can get the good result.

from onnx-go.

lynic commented on June 16, 2024

Hi, I would like to help, but I'm new to this project. My though was to check the outputs node by node and see the difference between gorgonia and onnxruntime. Following this the script I used to check the output of "Convolution28" node. But I still don't know how to check the outputs of same node in onnx-go.

import onnx
import os
import glob
import onnxruntime as onnxrt
import onnxruntime.backend as backend 
# import onnx_tf.backend as backend
# import caffe2.python.onnx.backend as backend
# import cntk as C
import numpy as np

from onnx import numpy_helper, helper
from onnx import TensorProto

print("after import")
model = onnx.load('mnist/model.onnx')
onnx.checker.check_model(model)
test_data_dir = 'mnist/test_data_set_0'

# Load inputs
inputs = []
inputs_num = len(glob.glob(os.path.join(test_data_dir, 'input_*.pb')))
for i in range(inputs_num):
    input_file = os.path.join(test_data_dir, 'input_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(input_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    inputs.append(numpy_helper.to_array(tensor))

# Load reference outputs
ref_outputs = []
ref_outputs_num = len(glob.glob(os.path.join(test_data_dir, 'output_*.pb')))
for i in range(ref_outputs_num):
    output_file = os.path.join(test_data_dir, 'output_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(output_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    ref_outputs.append(numpy_helper.to_array(tensor))
# print(ref_outputs)


mg = model.graph
n1 = mg.node[1]
# n1.attribute[2].s = "NOTSET".encode()
# inp0 = mg.input[0]
# inp1 = mg.input[1]
# out0 = helper.make_tensor_value_info('Convolution28_Output_0', TensorProto.FLOAT, [1])
graph_def = helper.make_graph(
    [
        n1,
    ],
    "MLP",
    [
        helper.make_tensor_value_info('Input3', TensorProto.FLOAT, [1,1,28,28]),
        helper.make_tensor_value_info('Parameter5', TensorProto.FLOAT, [8,1,5,5]),
        # inp0,
        # inputs,
        # inp1,
    ],
    [
        helper.make_tensor_value_info('Convolution28_Output_0', TensorProto.FLOAT, [1]),
    ]
)
graph_def.initializer.extend([mg.initializer[2]])
import ipdb; ipdb.set_trace()
model_def = helper.make_model(graph_def, producer_name='onnx-example')
onnx.checker.check_model(model_def)
pm = backend.prepare(model_def)
outs = list(pm.run(inputs))
oo = np.asarray(outs[0])
print(oo[0][0])
# for ref_o, o in zip(ref_outputs, outs):
#     np.testing.assert_almost_equal(ref_o, o)

# ro = onnxrt.RunOptions()
# ro.run_log_verbosity_level = 1
# ro.run_tag = "testtag123"

# import ipdb; ipdb.set_trace()
# model.graph.node[1].attribute[2].s = "NOTSET".encode()
# Run the model on the backend
# prep_model = backend.prepare(model, session_log_verbosity_level=1)
# outputs = list(prep_model.run(inputs, run_options=ro))
prep_model = backend.prepare(model)
outputs = list(prep_model.run(inputs))
# outputs = list(backend.run(model, inputs, run_log_verbosity_level=1))
# print(outputs)
# import ipdb; ipdb.set_trace()
# Compare the results with reference outputs.
for ref_o, o in zip(ref_outputs, outputs):
    np.testing.assert_almost_equal(ref_o, o)

We could run this script on my builded docker image "elynn/onnxrt:latest" .

from onnx-go.

owulveryck commented on June 16, 2024

Thanks @lynic for your help;

Wtih the help of @chewxy, I have realized that my tests files were not ok. Actually, I was extracting the values from the nodes after the execution on the tape machine, but some values are de facto incorrect due to optimization (some nodes can have their values overwritten).

I have started a new branch to track this issue.
So far, I have inserted a channel inside the tapeMachine. I can grab the instructions and the associated tensors at runtime.
The code here demonstrate how to get the values (forgive me for its ugliness, but it's a 30 minutes work between two meetings).

With this code I am sure that I can get the exact values. I will try to analyze them manually or to get the inspiration from your python code to generate a test file for every operation. I may then see if one is behaving badly (I have some doubts on the Maxpool operator).

from onnx-go.

owulveryck commented on June 16, 2024

I have generated a sort of sequence graph of the register usage in the tapeMachine of gorgonia.
I was looking for something weird, such as a register that could have been wrongly overwritten.
But I don't find anything strange on the graph.
I copy/paste the graph here only for the record; On each edge, the (number) is the execution order.

from onnx-go.

lynic commented on June 16, 2024

I compare the outputs node by node and found that the output from 2nd conv node is not correct, that's weird since conv node pass my tests. "maxpool" node before 2cd "conv" node actually gave out the correct matrix.


import onnx
import os
import glob
import onnxruntime as onnxrt
import onnxruntime.backend as backend 
# import onnx_tf.backend as backend
# import caffe2.python.onnx.backend as backend
# import cntk as C
import numpy as np

from onnx import numpy_helper, helper
from onnx import TensorProto

print("after import")
model = onnx.load('mnist/model.onnx')
onnx.checker.check_model(model)
test_data_dir = 'mnist/test_data_set_1'

# Load inputs
inputs = []
inputs_num = len(glob.glob(os.path.join(test_data_dir, 'input_*.pb')))
for i in range(inputs_num):
    input_file = os.path.join(test_data_dir, 'input_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(input_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    inputs.append(numpy_helper.to_array(tensor))

# Load reference outputs
ref_outputs = []
ref_outputs_num = len(glob.glob(os.path.join(test_data_dir, 'output_*.pb')))
for i in range(ref_outputs_num):
    output_file = os.path.join(test_data_dir, 'output_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(output_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    ref_outputs.append(numpy_helper.to_array(tensor))
# print(ref_outputs)


mg = model.graph
n1 = mg.node[1]
# n1.attribute[2].s = "NOTSET".encode()
# inp0 = mg.input[0]
# inp1 = mg.input[1]
# out0 = helper.make_tensor_value_info('Convolution28_Output_0', TensorProto.FLOAT, [1])
graph_def = helper.make_graph(
    [
        # mg.node[1],  # Convolution28
        onnx.helper.make_node(
            "Conv",
            name='Convolution28',
            inputs=['Input3', 'Parameter5'],
            outputs=['Convolution28_Output_0'],
            kernel_shape=[5, 5],
            pads=[2, 2, 2, 2],
            # auto_pad="SAME_UPPER",
            strides=[1, 1],  # Default values for other attributes: dilations=[1, 1], groups=1
            group=1,
            dilations=[1,1],
        ), # Convolution28
        mg.node[2],  # Plus30
        mg.node[3], # ReLU32
        mg.node[4], # Pooling66
        # mg.node[5], # Convolution110
        onnx.helper.make_node(
            "Conv",
            name='Convolution110',
            inputs=['Pooling66_Output_0', 'Parameter87'],
            outputs=['Convolution110_Output_0'],
            kernel_shape=[5, 5],
            pads=[2, 2, 2, 2],
            # auto_pad="SAME_UPPER",
            strides=[1, 1],  # Default values for other attributes: dilations=[1, 1], groups=1
            group=1,
            dilations=[1,1],
        ), # Convolution28
    ],
    "test_mnist",
    [
        mg.input[0], # Input3
        mg.input[1], # Parameter5
        mg.input[2], # Parameter6
        mg.input[3], # Parameter87
    ],
    [
        # mg.value_info[1], # Convolution28_Output_0
        # mg.value_info[2], # Plus30_Output_0
        # mg.value_info[3], # ReLU32_Output_0
        # mg.value_info[4], # Pooling66_Output_0
        mg.value_info[5], # Convolution110_Output_0
    ],
    initializer = [
        mg.initializer[2], # Parameter5
        mg.initializer[3], # Parameter6
        mg.initializer[1], # Parameter87
    ],
    value_info= [
        mg.value_info[1], # Convolution28_Output_0
        mg.value_info[2], # Plus30_Output_0
        mg.value_info[3], # ReLU32_Output_0
        mg.value_info[4], # Pooling66_Output_0
    ]


)
import ipdb; ipdb.set_trace()
model_def = helper.make_model(graph_def, producer_name='onnx-example')
onnx.checker.check_model(model_def)
pm = backend.prepare(model_def)
outs = list(pm.run(inputs))
oo = np.asarray(outs[0])
print(oo[0][0])

from onnx-go.

chewxy commented on June 16, 2024

My aim for today is to fix the conv and maxpool - will be available on Slack in 4 hrs to discuss

from onnx-go.

lynic commented on June 16, 2024

My aim for today is to fix the conv and maxpool - will be available on Slack in 4 hrs to discuss

Hi @chewxy , if you need any help from me, I would like to join the discussion. What slack channel are you in?

from onnx-go.

chewxy commented on June 16, 2024

Gophers #datascience *Xuanyi Chew* *@chewxy on Twitter* *Check out my latest talk: A Funny Thing Happened On The Way To Implementing AlphaGo in Go <https://www.youtube.com/watch?v=nk87zsxpF1A>* *Read:* my blog <http://blog.chewxy.com> *+61403928398*

…

On Sat, Feb 23, 2019 at 1:27 PM Ethan Lynn ***@***.***> wrote: My aim for today is to fix the conv and maxpool - will be available on Slack in 4 hrs to discuss Hi @chewxy <https://github.com/chewxy> , if you need any help from me, I would like to join the discussion. What slack channel are you in? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAczUnyU7H6UUsj3IT6MEDuEL5bYNguPks5vQKcugaJpZM4aec5R> .

from onnx-go.

owulveryck commented on June 16, 2024

The graph-builder branch has been updated.
The vendor's version of Gorgonia is using the code of @lynic referenced in this PR.
The execution is now providing a good result:

$ go run mnist.go
2019/02/25 09:12:02 [5041.889 -3568.877 -187.82419 -1685.7964 -1183.323 -614.4293 892.66394 -373.65866 -290.2622 -111.176735]

Many thanks to all of you for your help.

from onnx-go.

Gorgonia's evaluation of the MNIST model does not give expected result about onnx-go HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent