Giter Club home page Giter Club logo

Comments (16)

lynic avatar lynic commented on June 16, 2024 3

I extracted the inputs and outputs from 2nd conv node, and run some tests on gorgonia conv node, it didn't give out the correct result, I suspect it's a bug in gorgonia's conv layer.
@owulveryck @chewxy
To reproduce the tests, follow this README.
https://github.com/lynic/gorgonnx/tree/test_conv/eltest

Submit a related issue to gorgonia gorgonia/gorgonia#268

from onnx-go.

owulveryck avatar owulveryck commented on June 16, 2024 1

First of all: many thanks for your help.

I have just tested it with a fresh and empty GOPATH; this seems to work:

GOPATH=/tmp go get -v github.com/owulveryck/onnx-go
cd /tmp/src/github.com/owulveryck/onnx-go
git checkout b7af8bd
cd example/gorgonia
GOPATH=/tmp go get -v ./...
GOPATH=/tmp go run mnist.go

from onnx-go.

owulveryck avatar owulveryck commented on June 16, 2024

This is weird, just by looking at the test file on line 463, it looks that the add operator is a no-op.

I have placed a couple of markers (log.Println based debugging). The Add operation operates the Add tensor package
The input of the tensor.Add operation are:

[  55.45495   984.50616  -1191.5568  -652.15924  ...   -303.621   952.82043  -233.81728    -672.868]

R[-0.044856027   0.007791661    0.06810082    0.02999374  ... -0.055284902  -0.049383815    0.08432205  -0.054540414]

and the output value is:

[55.41009 984.514 -1191.4886 -652.1293 802.4857 497.57553 -303.6763 952.77106 -233.73296 -672.92255]

which is correct (55.45495 - 0.044856027 = 55.41009 ...).

In the graph, the value carried by the first input of the operator is:

[55.41009 984.514 -1191.4886 -652.1293 802.4857 497.57553 -303.6763 952.77106 -233.73296 -672.92255]

Maybe a problem with pointers somewhere....

from onnx-go.

blackrez avatar blackrez commented on June 16, 2024

Hello,

I wanted to help but I encounter a bug in reproducing steps for debug and use delve (https://github.com/go-delve/delve).

➜  gorgonia git:(b7af8bd) go run mnist.go
# github.com/owulveryck/onnx-go/example/gorgonia/vendor/gorgonia.org/gorgonia/debugger/dot
vendor/gorgonia.org/gorgonia/debugger/dot/encode.go:15:25: not enough arguments in call to dot.Marshal
	have (graph.Graph, string, string, string)
	want (graph.Graph, string, string, string, bool)
# github.com/owulveryck/onnx-go/internal/pb-onnx
../../internal/pb-onnx/onnx.proto3.pb.go:22:11: undefined: proto.ProtoPackageIsVersion3

from onnx-go.

blackrez avatar blackrez commented on June 16, 2024

I confirm, my GOPATH was the root cause of my issue.

from onnx-go.

chewxy avatar chewxy commented on June 16, 2024

From what I could tell, the Add operation is performing correctly. It would appear that the numpy generation is not.

I re-added the VMOpt to gorgonia/onnx/machine.go (essentially a redirect to engine.VMOpt), then I augmented the machine in mnist.go with the correct VMOpts:

machine := gorgonnx.NewTapeMachine(graph, 
    gorgonnx.WithLogger(log.New(os.Stderr, "", 0)),  // log execution
    gorgonnx.WithWatchlist(),                                     //  watching all nodes
    gorgonnx.WithValueFmt("%#1.6f"))                      // log with the following format for values

This is the result, which looks correct to me:

PC 49
Executing + false	[CPU32 CPU8]	CPU32	false	true	false. Node is: 14
	Inputs:
		R[   55.454948    984.506165  -1191.556763   -652.159241    802.612122    497.435303   -303.621002    952.820435   -233.817276   -672.867981]
		R[-0.044856   0.007792   0.068101   0.029994  -0.126410   0.140219  -0.055285  -0.049384   0.084322  -0.054540]
	Result:
		R[   55.410091    984.513977  -1191.488647   -652.129272    802.485718    497.575531   -303.676300    952.771057   -233.732956   -672.922546]
	Written To: CPU32
		R[   55.410091    984.513977  -1191.488647   -652.129272    802.485718    497.575531   -303.676300    952.771057   -233.732956   -672.922546]

Important to note is the instruction itself:

Executing + false	[CPU32 CPU8]	CPU32	false	true	false. Node is: 14

It reads from CPU32, CPU8 and then it overwrites CPU32.

I have no idea how the mnist test numbers are generated.

from onnx-go.

chewxy avatar chewxy commented on June 16, 2024

I may be on the wrong path down, lmk

from onnx-go.

owulveryck avatar owulveryck commented on June 16, 2024

I think that you are on the right path, but I am not :D
Thank you for your help.
The Add operation is actually ok. Which is coherent: in the previous Gorgonnx implementation, I made functional tests of every operator and they all successfully passed.

To generate the numpy, what I do is:

  • running the graph;
  • looping over all the nodes;
  • extracting the Value().Data().

From your analysis, I see that some registers are overwritten.
Question for you @chewxy: Can this have an impact on the underlying backing value? Maybe one of the registers is wrongly overwritten at a certain point. Can we avoid this behavior for testing purpose (by playing with isPointer or anything else?)

I have two possibilities to continue the debugging process now:

  • to extend the debug process and use the VMOpts to investigate more (and maybe I will couple it with the old work of stepped execution to investigate.)
  • to build a new MNIST graph into Gorgonia (based on the model I have extracted), and set the weights' values with the data from the initializer. Then play with the model (mostly with the shape of the tensor to avoid broadcasting) and see if I can get the good result.

from onnx-go.

lynic avatar lynic commented on June 16, 2024

Hi, I would like to help, but I'm new to this project. My though was to check the outputs node by node and see the difference between gorgonia and onnxruntime. Following this the script I used to check the output of "Convolution28" node. But I still don't know how to check the outputs of same node in onnx-go.

import onnx
import os
import glob
import onnxruntime as onnxrt
import onnxruntime.backend as backend 
# import onnx_tf.backend as backend
# import caffe2.python.onnx.backend as backend
# import cntk as C
import numpy as np

from onnx import numpy_helper, helper
from onnx import TensorProto

print("after import")
model = onnx.load('mnist/model.onnx')
onnx.checker.check_model(model)
test_data_dir = 'mnist/test_data_set_0'

# Load inputs
inputs = []
inputs_num = len(glob.glob(os.path.join(test_data_dir, 'input_*.pb')))
for i in range(inputs_num):
    input_file = os.path.join(test_data_dir, 'input_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(input_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    inputs.append(numpy_helper.to_array(tensor))

# Load reference outputs
ref_outputs = []
ref_outputs_num = len(glob.glob(os.path.join(test_data_dir, 'output_*.pb')))
for i in range(ref_outputs_num):
    output_file = os.path.join(test_data_dir, 'output_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(output_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    ref_outputs.append(numpy_helper.to_array(tensor))
# print(ref_outputs)


mg = model.graph
n1 = mg.node[1]
# n1.attribute[2].s = "NOTSET".encode()
# inp0 = mg.input[0]
# inp1 = mg.input[1]
# out0 = helper.make_tensor_value_info('Convolution28_Output_0', TensorProto.FLOAT, [1])
graph_def = helper.make_graph(
    [
        n1,
    ],
    "MLP",
    [
        helper.make_tensor_value_info('Input3', TensorProto.FLOAT, [1,1,28,28]),
        helper.make_tensor_value_info('Parameter5', TensorProto.FLOAT, [8,1,5,5]),
        # inp0,
        # inputs,
        # inp1,
    ],
    [
        helper.make_tensor_value_info('Convolution28_Output_0', TensorProto.FLOAT, [1]),
    ]
)
graph_def.initializer.extend([mg.initializer[2]])
import ipdb; ipdb.set_trace()
model_def = helper.make_model(graph_def, producer_name='onnx-example')
onnx.checker.check_model(model_def)
pm = backend.prepare(model_def)
outs = list(pm.run(inputs))
oo = np.asarray(outs[0])
print(oo[0][0])
# for ref_o, o in zip(ref_outputs, outs):
#     np.testing.assert_almost_equal(ref_o, o)

# ro = onnxrt.RunOptions()
# ro.run_log_verbosity_level = 1
# ro.run_tag = "testtag123"

# import ipdb; ipdb.set_trace()
# model.graph.node[1].attribute[2].s = "NOTSET".encode()
# Run the model on the backend
# prep_model = backend.prepare(model, session_log_verbosity_level=1)
# outputs = list(prep_model.run(inputs, run_options=ro))
prep_model = backend.prepare(model)
outputs = list(prep_model.run(inputs))
# outputs = list(backend.run(model, inputs, run_log_verbosity_level=1))
# print(outputs)
# import ipdb; ipdb.set_trace()
# Compare the results with reference outputs.
for ref_o, o in zip(ref_outputs, outputs):
    np.testing.assert_almost_equal(ref_o, o)

We could run this script on my builded docker image "elynn/onnxrt:latest" .

from onnx-go.

owulveryck avatar owulveryck commented on June 16, 2024

Thanks @lynic for your help;

Wtih the help of @chewxy, I have realized that my tests files were not ok. Actually, I was extracting the values from the nodes after the execution on the tape machine, but some values are de facto incorrect due to optimization (some nodes can have their values overwritten).

I have started a new branch to track this issue.
So far, I have inserted a channel inside the tapeMachine. I can grab the instructions and the associated tensors at runtime.
The code here demonstrate how to get the values (forgive me for its ugliness, but it's a 30 minutes work between two meetings).

With this code I am sure that I can get the exact values. I will try to analyze them manually or to get the inspiration from your python code to generate a test file for every operation. I may then see if one is behaving badly (I have some doubts on the Maxpool operator).

from onnx-go.

owulveryck avatar owulveryck commented on June 16, 2024

I have generated a sort of sequence graph of the register usage in the tapeMachine of gorgonia.
I was looking for something weird, such as a register that could have been wrongly overwritten.
But I don't find anything strange on the graph.
I copy/paste the graph here only for the record; On each edge, the (number) is the execution order.
register

from onnx-go.

lynic avatar lynic commented on June 16, 2024

image
I compare the outputs node by node and found that the output from 2nd conv node is not correct, that's weird since conv node pass my tests. "maxpool" node before 2cd "conv" node actually gave out the correct matrix.


import onnx
import os
import glob
import onnxruntime as onnxrt
import onnxruntime.backend as backend 
# import onnx_tf.backend as backend
# import caffe2.python.onnx.backend as backend
# import cntk as C
import numpy as np

from onnx import numpy_helper, helper
from onnx import TensorProto

print("after import")
model = onnx.load('mnist/model.onnx')
onnx.checker.check_model(model)
test_data_dir = 'mnist/test_data_set_1'

# Load inputs
inputs = []
inputs_num = len(glob.glob(os.path.join(test_data_dir, 'input_*.pb')))
for i in range(inputs_num):
    input_file = os.path.join(test_data_dir, 'input_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(input_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    inputs.append(numpy_helper.to_array(tensor))

# Load reference outputs
ref_outputs = []
ref_outputs_num = len(glob.glob(os.path.join(test_data_dir, 'output_*.pb')))
for i in range(ref_outputs_num):
    output_file = os.path.join(test_data_dir, 'output_{}.pb'.format(i))
    tensor = onnx.TensorProto()
    with open(output_file, 'rb') as f:
        tensor.ParseFromString(f.read())
    ref_outputs.append(numpy_helper.to_array(tensor))
# print(ref_outputs)


mg = model.graph
n1 = mg.node[1]
# n1.attribute[2].s = "NOTSET".encode()
# inp0 = mg.input[0]
# inp1 = mg.input[1]
# out0 = helper.make_tensor_value_info('Convolution28_Output_0', TensorProto.FLOAT, [1])
graph_def = helper.make_graph(
    [
        # mg.node[1],  # Convolution28
        onnx.helper.make_node(
            "Conv",
            name='Convolution28',
            inputs=['Input3', 'Parameter5'],
            outputs=['Convolution28_Output_0'],
            kernel_shape=[5, 5],
            pads=[2, 2, 2, 2],
            # auto_pad="SAME_UPPER",
            strides=[1, 1],  # Default values for other attributes: dilations=[1, 1], groups=1
            group=1,
            dilations=[1,1],
        ), # Convolution28
        mg.node[2],  # Plus30
        mg.node[3], # ReLU32
        mg.node[4], # Pooling66
        # mg.node[5], # Convolution110
        onnx.helper.make_node(
            "Conv",
            name='Convolution110',
            inputs=['Pooling66_Output_0', 'Parameter87'],
            outputs=['Convolution110_Output_0'],
            kernel_shape=[5, 5],
            pads=[2, 2, 2, 2],
            # auto_pad="SAME_UPPER",
            strides=[1, 1],  # Default values for other attributes: dilations=[1, 1], groups=1
            group=1,
            dilations=[1,1],
        ), # Convolution28
    ],
    "test_mnist",
    [
        mg.input[0], # Input3
        mg.input[1], # Parameter5
        mg.input[2], # Parameter6
        mg.input[3], # Parameter87
    ],
    [
        # mg.value_info[1], # Convolution28_Output_0
        # mg.value_info[2], # Plus30_Output_0
        # mg.value_info[3], # ReLU32_Output_0
        # mg.value_info[4], # Pooling66_Output_0
        mg.value_info[5], # Convolution110_Output_0
    ],
    initializer = [
        mg.initializer[2], # Parameter5
        mg.initializer[3], # Parameter6
        mg.initializer[1], # Parameter87
    ],
    value_info= [
        mg.value_info[1], # Convolution28_Output_0
        mg.value_info[2], # Plus30_Output_0
        mg.value_info[3], # ReLU32_Output_0
        mg.value_info[4], # Pooling66_Output_0
    ]


)
import ipdb; ipdb.set_trace()
model_def = helper.make_model(graph_def, producer_name='onnx-example')
onnx.checker.check_model(model_def)
pm = backend.prepare(model_def)
outs = list(pm.run(inputs))
oo = np.asarray(outs[0])
print(oo[0][0])

from onnx-go.

chewxy avatar chewxy commented on June 16, 2024

My aim for today is to fix the conv and maxpool - will be available on Slack in 4 hrs to discuss

from onnx-go.

lynic avatar lynic commented on June 16, 2024

My aim for today is to fix the conv and maxpool - will be available on Slack in 4 hrs to discuss

Hi @chewxy , if you need any help from me, I would like to join the discussion. What slack channel are you in?

from onnx-go.

chewxy avatar chewxy commented on June 16, 2024

from onnx-go.

owulveryck avatar owulveryck commented on June 16, 2024

The graph-builder branch has been updated.
The vendor's version of Gorgonia is using the code of @lynic referenced in this PR.
The execution is now providing a good result:

$ go run mnist.go
2019/02/25 09:12:02 [5041.889 -3568.877 -187.82419 -1685.7964 -1183.323 -614.4293 892.66394 -373.65866 -290.2622 -111.176735]

Many thanks to all of you for your help.

from onnx-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.