Comments (8)
It's true that it's not an entirely fair comparison because as you mention Erlang implements TCO. But as someone working on a production system I certainly prefer the stacktrace from Java, which tells me where in my code the error happened, to:
Good to know! The other difference here is that in a production system you will be running code from modules, not cell evaluation, so let's see if using modules change the results for a more apples to apples comparison and let us know. I agree it is important that our stacktraces in concurrent production systems are meaningful. If you confirm that modules improve stacktraces, I will add that to our docs as an additional benefit of modules.
Meanwhile I am looking if we can improve the evaluation stacktrace but it seems rather unlikely (without making evaluation more expensive).
from livebook.
If you don't want to use _nolink
everywhere, you can also call Process.flag(:trap_exit, true)
in your notebook, that would ensure crashes in linked processes does not crash the notebook runtime, but the side-effect is that you can have failures going completely unnoticed, so you should tread carefully. :) I will push more docs soon.
from livebook.
Was just able to repro this: scratch.livemd.zip
Scratch
Mix.install([
{:kino, "~> 0.12.3"}
])
Section
Kino.start_child!({Task.Supervisor, name: MyApp.TaskSupervisor})
Task.Supervisor.async_stream_nolink(MyApp.TaskSupervisor, [nil], fn el ->
el <> "string"
end)
|> Enum.to_list()
20:14:29.256 [error] Task #PID<0.224.0> started from #PID<0.152.0> terminating
** (ArgumentError) errors were found at the given arguments:
* 1st argument: not a bitstring
:erlang.bit_size(nil)
(stdlib 5.1.1) eval_bits.erl:114: :eval_bits.eval_exp_field1/9
Function: &:erlang.apply/2
Args: [#Function<42.105768164/1 in :erl_eval.expr/6>, [nil]]
Note that the error does not reference any line numbers in my code, but that the error at least shows up below the relevant cell.
But if I change async_stream_nolink
to async_stream
, I now also have no way of knowing which cell the error occurred in! I only get a toast notification and all cells just say "Aborted."
from livebook.
For the aborted case, do you see the error message or no error message at all? That’s something we could improve if you don’t see anything.
Other than that, this is expected. We don’t track under which cell a process was started. we could even attempt to, but I don’t think we would be able to retrieve this information when the process crashes.
Regarding the stacktrace, erlang performs tail call optimization, so you can miss the source in case like that.
from livebook.
Thanks for the reply!
For the aborted case, do you see the error message or no error message at all? That’s something we could improve if you don’t see anything.
There is an error message in a toast notification, but there is no error message in any cell's output :( This is critical because it means means I now have no idea where the error happened: the toast is not associated with any cell. Unfortunately debugging is now harder than in a lower-level langauge like C++ where I can at least look at the stacktrace.
Other than that, this is expected. We don’t track under which cell a process was started. we could even attempt to, but I don’t think we would be able to retrieve this information when the process crashes.
I see, thanks. Honestly that is understandable but pretty unfortunate. I've worked with mostly JavaScript, Java, and C++, and the main thing that excited me about Elixir was actually better error handling. The error messages in Java suck, but the production instrumentation is great, e.g. you can always run e.g. jstack
and figure out what your program is actually doing and where an error came from. Elixir has much nicer error messages for e.g. match failures though.
Regarding the stacktrace, erlang performs tail call optimization, so you can miss the source in case like that.
It makes sense that I wouldn't see the tail calls in the stack trace given TCO, but it's still surprising that the stacktrace is so short but doesn't involve any of my code at all. For instance:
:erlang.system_flag(:backtrace_depth, 1)
defmodule TCO do
def tco(n) do
if n > 10_000 do
raise RuntimeError, message: "error!"
end
tco(n+1)
end
end
defmodule Foo do
def foo() do
TCO.tco(1)
end
end
Foo.foo()
outputs:
** (RuntimeError) error!
#cell:zww6gl57kk56swz7:18: (file)
when :backtrace_depth
is 1, and
** (RuntimeError) error!
#cell:zww6gl57kk56swz7:6: TCO.tco/1
#cell:zww6gl57kk56swz7:18: (file)
when :backtrace_depth
is 5, which I'd expect due to TCO.
What I'm surprised by the Livebook I sent is that no #cell:...:line: (file)
backtrace entry shows up, which unfortunately again makes debugging more difficult than in C++/Java.
Is there a way to figure out which cell caused the error in the Livebook I sent? Or would I just have to bisect my code?
from livebook.
Oh, something that can help is that code in cells are evaluated, unless their are put into modules. If you define the task you are starting in a module, you may get both better stacktraces and you will surely get better performance. Try this:
Task.async_stream(&SomeMod.function(&1, additional, args))
For instance:
In your example above, Foo.foo()
is not a tail call because Elixir automatically adds a call after it. :) Since it is a file, it is safe to do that, because you will either block or execute the whole file. But we can't do that for random processes starting code, as we would affect their semantics.
The whole issue is that once you start spawning processes, you have a concurrent system. You are comparing with C++/Java, but you should really compare with a concurrent C++/Java system running multiple threads (green or otherwise).
from livebook.
Oh, something that can help is that code in cells are evaluated, unless their are put into modules. If you define the task you are starting in a module, you may get both better stacktraces and you will surely get better performance. Try this [...]
Interesting! Thanks for the tip, I will try this out. I also have changed my code to use _nolink
just for the better error handling.
In your example above, Foo.foo() is not a tail call because Elixir automatically adds a call after it. :) Since it is a file, it is safe to do that, because you will either block or execute the whole file. But we can't do that for random processes starting code, as we would affect their semantics.
You are right, good point!
The whole issue is that once you start spawning processes, you have a concurrent system. You are comparing with C++/Java, but you should really compare with a concurrent C++/Java system running multiple threads (green or otherwise).
I agree that concurrent systems will always be harder to debug, but I still wish Elixir were a little nicer here! I am indeed comparing Elixir to a concurrent Java system. In particular, the job before my previous one was actually working on maps for a phone company where we used Spark. There is much that is unpleasant about Java, but for instance the stack trace here is great:
class Scratch {
public static int foo(Object bar) {
return 2 + (int) bar;
}
public static void main(String[] args) throws Exception {
Thread thread = new Thread(() -> {
System.out.println(Scratch.foo("not an int"));
});
thread.start();
Thread.sleep(5000);
thread.join();
}
}
It tells me the exact line in my code where the exception was thrown:
$ java test.java
Exception in thread "Thread-0" java.lang.ClassCastException: class java.lang.String cannot be cast to class java.lang.Integer (java.lang.String and java.lang.Integer are in module java.base of loader 'bootstrap')
at Scratch.foo(test.java:3)
at Scratch.lambda$main$0(test.java:8)
at java.base/java.lang.Thread.run(Thread.java:1583)
It's true that it's not an entirely fair comparison because as you mention Erlang implements TCO. But as someone working on a production system I certainly prefer the stacktrace from Java, which tells me where in my code the error happened, to:
20:14:29.256 [error] Task #PID<0.224.0> started from #PID<0.152.0> terminating
** (ArgumentError) errors were found at the given arguments:
* 1st argument: not a bitstring
:erlang.bit_size(nil)
(stdlib 5.1.1) eval_bits.erl:114: :eval_bits.eval_exp_field1/9
Function: &:erlang.apply/2
Args: [#Function<42.105768164/1 in :erl_eval.expr/6>, [nil]]
... which gives zero information other than that some function somewhere is being called with something that's not a bitstring!
There are things that are much nicer about Elixir, e.g. match failures printing the value that failed to match. That avoids a whole class of annoying Java production error, e.g.:
Exception in thread "Thread-0" java.lang.RuntimeException: some generic message about how an assertion failed
at Scratch.lambda$main$0(test.java:8)
at java.base/java.lang.Thread.run(Thread.java:1583)
... which is especially annoying when it happens in the middle of a long-running batch job. But between stack traces and error messages I do feel like stack traces are more important here.
Thanks again for the replies. Will try out modules to see if it helps. Don't want to be a downer but it is a little sad that the Elixir stack traces here are less informative than in threaded Java code, was excited after seeing the better match failure messages.
from livebook.
Thanks again for all the help! TLDR I think you're right that the stacktraces are much better with modules and in compiled code, which is awesome!
If you confirm that modules improve stacktraces, I will add that to our docs as an additional benefit of modules.
Modules do seem to improve the stacktrace!
defmodule Foo do
def bar(x) do
x <> "string"
end
end
Task.Supervisor.async_stream_nolink(MyApp.TaskSupervisor, [nil], fn el ->
Foo.bar(el)
end)
|> Enum.to_list()
yields:
01:58:06.212 [error] Task #PID<0.603.0> started from #PID<0.593.0> terminating
** (ArgumentError) construction of binary failed: segment 1 of type 'binary': expected a binary but got: nil
#cell:3rotzbrltta2n3br:3: Foo.bar/1
(elixir 1.15.7) src/elixir.erl:396: :elixir.eval_external_handler/3
(elixir 1.15.7) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
Function: &:erlang.apply/2
Args: [#Function<42.105768164/1 in :erl_eval.expr/6>, [nil]]
Which shows Foo.bar
! If I run without nolink
, then the error message is still not associated with the cell, but at least it mentions Foo.bar
now. So for now I will try to make sure I:
- Use the
_nolink
versions ofTask
functions, so that errors show up in the cell output, as opposed to in a toast notification not associated with any cell - Move functions into modules whenever possible in notebooks
It sounds like in compiled code I would get stacktraces even for anonymous functions, which is nice because then I can't forget to do it! I tested it by running the following using elixir
from the command line:
Supervisor.start_link([{Task.Supervisor, name: MyApp.TaskSupervisor}],
strategy: :one_for_one
)
Task.Supervisor.async_stream(MyApp.TaskSupervisor, [nil], fn el ->
el <> "string"
end)
|> Enum.to_list()
... and the stacktrace does seem to include the relevant line:
$ elixir foo.ex
** (EXIT from #PID<0.98.0>) an exception was raised:
** (ArgumentError) construction of binary failed: segment 1 of type 'binary': expected a binary but got: nil
foo.ex:6: anonymous fn/1 in :elixir_compiler_0.__FILE__/1
(elixir 1.16.0) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
(elixir 1.16.0) lib/task/supervised.ex:36: Task.Supervised.reply/4
02:07:08.039 [error] Task #PID<0.104.0> started from #PID<0.98.0> terminating
** (ArgumentError) construction of binary failed: segment 1 of type 'binary': expected a binary but got: nil
foo.ex:6: anonymous fn/1 in :elixir_compiler_0.__FILE__/1
(elixir 1.16.0) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
(elixir 1.16.0) lib/task/supervised.ex:36: Task.Supervised.reply/4
Function: &:erlang.apply/2
Args: [#Function<0.49987700 in file:foo.ex>, [nil]]
So all in all it sounds like maybe the stacktrace situation is special in Livebook, which is great to know! Thanks for the tips and for thinking it over.
Please feel free to close the issue if you'd like; of course similar stacktraces in Livebook would still be nice-to-have, but I understand if it would be a lot of effort for an edge case.
from livebook.
Related Issues (20)
- `nil` autosave location causes option to be missing from HTML on Settings and notebooks are not saved. HOT 4
- Problem with `lb:*` cookies HOT 2
- ImageComponent crashes livebook if passed non-binary `contents` HOT 1
- Race condition on smart cell evaluation HOT 1
- Store app password in stamp encrypted data HOT 4
- Information loss with current variable name binding model (JSON -> Json) HOT 6
- Crash on Windows when trying to save file if a nearby filename contains an emoji HOT 3
- Livebook not working on window 10 HOT 2
- Livebook cannot connect to the default runtime HOT 20
- Deployed apps don't reevaluate cells marked as 'Reevaluates automatically' HOT 6
- Support epmdless runtimes HOT 1
- Cannot drop files on Safari
- Add two options to xlsx_reader? One to skip empty sheets and another to nillify expect_chars? HOT 4
- Revisit teleport integration
- Node terminated unexpectedly HOT 14
- Fix asset cache URL on distributed deployments
- mix escript.install hex livebook (cannot run livebook after) HOT 3
- Request to Acknowledge Uffizzi's Contributions in README HOT 5
- Livebook will not open; due to error during startup HOT 3
- Support Fly for zero-trust authentication HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from livebook.