Thanks for making Livebook! It's been great overall but sometimes the error reporting

Was just able to repro this: <a href="https://github.com/livebook-dev/livebook/files/1

Errors can fail to indicate which cell they came from about livebook HOT 8 CLOSED

jyc commented on June 5, 2024

Errors can fail to indicate which cell they came from

from livebook.

Comments (8)

josevalim commented on June 5, 2024 1

It's true that it's not an entirely fair comparison because as you mention Erlang implements TCO. But as someone working on a production system I certainly prefer the stacktrace from Java, which tells me where in my code the error happened, to:

Good to know! The other difference here is that in a production system you will be running code from modules, not cell evaluation, so let's see if using modules change the results for a more apples to apples comparison and let us know. I agree it is important that our stacktraces in concurrent production systems are meaningful. If you confirm that modules improve stacktraces, I will add that to our docs as an additional benefit of modules.

Meanwhile I am looking if we can improve the evaluation stacktrace but it seems rather unlikely (without making evaluation more expensive).

from livebook.

josevalim commented on June 5, 2024 1

If you don't want to use _nolink everywhere, you can also call Process.flag(:trap_exit, true) in your notebook, that would ensure crashes in linked processes does not crash the notebook runtime, but the side-effect is that you can have failures going completely unnoticed, so you should tread carefully. :) I will push more docs soon.

from livebook.

jyc commented on June 5, 2024

Was just able to repro this: scratch.livemd.zip

Scratch

Mix.install([
  {:kino, "~> 0.12.3"}
])

Section

Kino.start_child!({Task.Supervisor, name: MyApp.TaskSupervisor})

Task.Supervisor.async_stream_nolink(MyApp.TaskSupervisor, [nil], fn el ->
  el <> "string"
end)
|> Enum.to_list()


20:14:29.256 [error] Task #PID<0.224.0> started from #PID<0.152.0> terminating
** (ArgumentError) errors were found at the given arguments:

  * 1st argument: not a bitstring

    :erlang.bit_size(nil)
    (stdlib 5.1.1) eval_bits.erl:114: :eval_bits.eval_exp_field1/9
Function: &:erlang.apply/2
    Args: [#Function<42.105768164/1 in :erl_eval.expr/6>, [nil]]

Note that the error does not reference any line numbers in my code, but that the error at least shows up below the relevant cell.

But if I change async_stream_nolink to async_stream, I now also have no way of knowing which cell the error occurred in! I only get a toast notification and all cells just say "Aborted."

from livebook.

josevalim commented on June 5, 2024

For the aborted case, do you see the error message or no error message at all? That’s something we could improve if you don’t see anything.

Other than that, this is expected. We don’t track under which cell a process was started. we could even attempt to, but I don’t think we would be able to retrieve this information when the process crashes.

Regarding the stacktrace, erlang performs tail call optimization, so you can miss the source in case like that.

from livebook.

jyc commented on June 5, 2024

Thanks for the reply!

For the aborted case, do you see the error message or no error message at all? That’s something we could improve if you don’t see anything.

There is an error message in a toast notification, but there is no error message in any cell's output :( This is critical because it means means I now have no idea where the error happened: the toast is not associated with any cell. Unfortunately debugging is now harder than in a lower-level langauge like C++ where I can at least look at the stacktrace.

Other than that, this is expected. We don’t track under which cell a process was started. we could even attempt to, but I don’t think we would be able to retrieve this information when the process crashes.

I see, thanks. Honestly that is understandable but pretty unfortunate. I've worked with mostly JavaScript, Java, and C++, and the main thing that excited me about Elixir was actually better error handling. The error messages in Java suck, but the production instrumentation is great, e.g. you can always run e.g. jstack and figure out what your program is actually doing and where an error came from. Elixir has much nicer error messages for e.g. match failures though.

Regarding the stacktrace, erlang performs tail call optimization, so you can miss the source in case like that.

It makes sense that I wouldn't see the tail calls in the stack trace given TCO, but it's still surprising that the stacktrace is so short but doesn't involve any of my code at all. For instance:

:erlang.system_flag(:backtrace_depth, 1)

defmodule TCO do
  def tco(n) do
    if n > 10_000 do
      raise RuntimeError, message: "error!"
    end
    tco(n+1)
  end
end

defmodule Foo do
  def foo() do
    TCO.tco(1)
  end
end

Foo.foo()

outputs:

** (RuntimeError) error!
    #cell:zww6gl57kk56swz7:18: (file)

when :backtrace_depth is 1, and

** (RuntimeError) error!
    #cell:zww6gl57kk56swz7:6: TCO.tco/1
    #cell:zww6gl57kk56swz7:18: (file)

when :backtrace_depth is 5, which I'd expect due to TCO.

What I'm surprised by the Livebook I sent is that no #cell:...:line: (file) backtrace entry shows up, which unfortunately again makes debugging more difficult than in C++/Java.

Is there a way to figure out which cell caused the error in the Livebook I sent? Or would I just have to bisect my code?

from livebook.

josevalim commented on June 5, 2024

Oh, something that can help is that code in cells are evaluated, unless their are put into modules. If you define the task you are starting in a module, you may get both better stacktraces and you will surely get better performance. Try this:

Task.async_stream(&SomeMod.function(&1, additional, args))

For instance:

In your example above, Foo.foo() is not a tail call because Elixir automatically adds a call after it. :) Since it is a file, it is safe to do that, because you will either block or execute the whole file. But we can't do that for random processes starting code, as we would affect their semantics.

The whole issue is that once you start spawning processes, you have a concurrent system. You are comparing with C++/Java, but you should really compare with a concurrent C++/Java system running multiple threads (green or otherwise).

from livebook.

jyc commented on June 5, 2024

Oh, something that can help is that code in cells are evaluated, unless their are put into modules. If you define the task you are starting in a module, you may get both better stacktraces and you will surely get better performance. Try this [...]

Interesting! Thanks for the tip, I will try this out. I also have changed my code to use _nolink just for the better error handling.

In your example above, Foo.foo() is not a tail call because Elixir automatically adds a call after it. :) Since it is a file, it is safe to do that, because you will either block or execute the whole file. But we can't do that for random processes starting code, as we would affect their semantics.

You are right, good point!

The whole issue is that once you start spawning processes, you have a concurrent system. You are comparing with C++/Java, but you should really compare with a concurrent C++/Java system running multiple threads (green or otherwise).

I agree that concurrent systems will always be harder to debug, but I still wish Elixir were a little nicer here! I am indeed comparing Elixir to a concurrent Java system. In particular, the job before my previous one was actually working on maps for a phone company where we used Spark. There is much that is unpleasant about Java, but for instance the stack trace here is great:

class Scratch {
  public static int foo(Object bar) {
    return 2 + (int) bar;
  }

  public static void main(String[] args) throws Exception {
    Thread thread = new Thread(() -> {
      System.out.println(Scratch.foo("not an int"));
    });
    thread.start();
    Thread.sleep(5000);
    thread.join();
  }
}

It tells me the exact line in my code where the exception was thrown:

$ java test.java
Exception in thread "Thread-0" java.lang.ClassCastException: class java.lang.String cannot be cast to class java.lang.Integer (java.lang.String and java.lang.Integer are in module java.base of loader 'bootstrap')
	at Scratch.foo(test.java:3)
	at Scratch.lambda$main$0(test.java:8)
	at java.base/java.lang.Thread.run(Thread.java:1583)

It's true that it's not an entirely fair comparison because as you mention Erlang implements TCO. But as someone working on a production system I certainly prefer the stacktrace from Java, which tells me where in my code the error happened, to:

20:14:29.256 [error] Task #PID<0.224.0> started from #PID<0.152.0> terminating
** (ArgumentError) errors were found at the given arguments:

  * 1st argument: not a bitstring

    :erlang.bit_size(nil)
    (stdlib 5.1.1) eval_bits.erl:114: :eval_bits.eval_exp_field1/9
Function: &:erlang.apply/2
    Args: [#Function<42.105768164/1 in :erl_eval.expr/6>, [nil]]

... which gives zero information other than that some function somewhere is being called with something that's not a bitstring!

There are things that are much nicer about Elixir, e.g. match failures printing the value that failed to match. That avoids a whole class of annoying Java production error, e.g.:

Exception in thread "Thread-0" java.lang.RuntimeException: some generic message about how an assertion failed
	at Scratch.lambda$main$0(test.java:8)
	at java.base/java.lang.Thread.run(Thread.java:1583)

... which is especially annoying when it happens in the middle of a long-running batch job. But between stack traces and error messages I do feel like stack traces are more important here.

Thanks again for the replies. Will try out modules to see if it helps. Don't want to be a downer but it is a little sad that the Elixir stack traces here are less informative than in threaded Java code, was excited after seeing the better match failure messages.

from livebook.

jyc commented on June 5, 2024

Thanks again for all the help! TLDR I think you're right that the stacktraces are much better with modules and in compiled code, which is awesome!

If you confirm that modules improve stacktraces, I will add that to our docs as an additional benefit of modules.

Modules do seem to improve the stacktrace!

defmodule Foo do
  def bar(x) do
    x <> "string"
  end
end

Task.Supervisor.async_stream_nolink(MyApp.TaskSupervisor, [nil], fn el ->
  Foo.bar(el)
end)
|> Enum.to_list()

yields:

01:58:06.212 [error] Task #PID<0.603.0> started from #PID<0.593.0> terminating
** (ArgumentError) construction of binary failed: segment 1 of type 'binary': expected a binary but got: nil
    #cell:3rotzbrltta2n3br:3: Foo.bar/1
    (elixir 1.15.7) src/elixir.erl:396: :elixir.eval_external_handler/3
    (elixir 1.15.7) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
Function: &:erlang.apply/2
    Args: [#Function<42.105768164/1 in :erl_eval.expr/6>, [nil]]

Which shows Foo.bar! If I run without nolink, then the error message is still not associated with the cell, but at least it mentions Foo.bar now. So for now I will try to make sure I:

Use the _nolink versions of Task functions, so that errors show up in the cell output, as opposed to in a toast notification not associated with any cell
Move functions into modules whenever possible in notebooks

It sounds like in compiled code I would get stacktraces even for anonymous functions, which is nice because then I can't forget to do it! I tested it by running the following using elixir from the command line:

Supervisor.start_link([{Task.Supervisor, name: MyApp.TaskSupervisor}],
  strategy: :one_for_one
)

Task.Supervisor.async_stream(MyApp.TaskSupervisor, [nil], fn el ->
  el <> "string"
end)
|> Enum.to_list()

... and the stacktrace does seem to include the relevant line:

$ elixir foo.ex
** (EXIT from #PID<0.98.0>) an exception was raised:
    ** (ArgumentError) construction of binary failed: segment 1 of type 'binary': expected a binary but got: nil
        foo.ex:6: anonymous fn/1 in :elixir_compiler_0.__FILE__/1
        (elixir 1.16.0) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
        (elixir 1.16.0) lib/task/supervised.ex:36: Task.Supervised.reply/4


02:07:08.039 [error] Task #PID<0.104.0> started from #PID<0.98.0> terminating
** (ArgumentError) construction of binary failed: segment 1 of type 'binary': expected a binary but got: nil
    foo.ex:6: anonymous fn/1 in :elixir_compiler_0.__FILE__/1
    (elixir 1.16.0) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
    (elixir 1.16.0) lib/task/supervised.ex:36: Task.Supervised.reply/4
Function: &:erlang.apply/2
    Args: [#Function<0.49987700 in file:foo.ex>, [nil]]

So all in all it sounds like maybe the stacktrace situation is special in Livebook, which is great to know! Thanks for the tips and for thinking it over.

Please feel free to close the issue if you'd like; of course similar stacktraces in Livebook would still be nice-to-have, but I understand if it would be a lot of effort for an edge case.

from livebook.

Errors can fail to indicate which cell they came from about livebook HOT 8 CLOSED

Comments (8)

Scratch

Section

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent