Giter Club home page Giter Club logo

Comments (10)

jan-auer avatar jan-auer commented on May 30, 2024 1

It's been a while. Time to revive this issue, starting with symbol parsing.

After some benchmarking of our Breakpad parser, we concluded that our pest impl is too slow, at least with our grammar. In #365, we have started a nom-based implementation inspired by your breakpad-symbols crate and a snippet shared with us by @calixteman. The current state is in the parsing module.

Major differences to the breakpad-symbols implementation are:

  • No eager collecting of line records or CFI delta records into vectors. Instead, there's an iterator API that lazy-parses them
  • Wrappers doing simple line splitting and starts_with checks to locate records fast (example)
  • We didn't use macros but rather nom functions. We're still figuring out what we like more :)

For context, we also evaluated a streaming parser based on std::io::Read. Ultimately, the decision against it was driven by keeping the parser zero-copy. gimli solves this with a trait to abstract over slices and readers, but I believe that this is too much complexity in this case.

@luser would you be interested in getting these changes upstream into breakpad-symbols? It definitely makes sense to centralize code at this point. Remaining action items would then be:

  • Minimize footprint, e.g. making log optional etc.
  • Releasing a version to crates.io

cc @Gankra @gabrielesvelto since you're also interested in this topic

from symbolic.

jan-auer avatar jan-auer commented on May 30, 2024

Hi @luser, and thanks a lot for publishing your crate(s). Your parser is certainly nicer and way more complete than ours which was only hacked together to serve our immediate needs :) The part I'm excited about even more is the processor implemented in Rust. So we would love to switch over to your implementation and also contribute to it. Not sure when this fits in our roadmap, but I'll definitely get in touch with you after New Year's!

Just to mention what's on my mind right now:

  • I'd like to be the symbol parser as lazy as possible. That is, being able to consume a file byte by byte and then only store certain records of interest (e.g. only STACK or only PUBLIC, etc...)
  • We have our own file handling, as well as generate a special symbol file format in symbolic. Therefore it would be cool to export only the parser that, for instance, consumes a &[u8].
  • The processor seems awesome. We'd just have to make sure that it is not missing some corner cases and produces equal results to the Breakpad library.
  • Here again, symbolic implements its own symbolication. Ideally, we expose parts of the minidump processor as clear public interfaces. In symbolic, we'd then just use the stackwalker but not the symbolication part.

from symbolic.

luser avatar luser commented on May 30, 2024

These are all things I'm very interested in! I'm very open to changes in the existing design of things in the minidump / breakpad-symbols crates. As I said, I wrote them as a learning exercise, so they're not particularly great in terms of API design.

I'd like to be the symbol parser as lazy as possible. That is, being able to consume a file byte by byte and then only store certain records of interest (e.g. only STACK or only PUBLIC, etc...)

I actually thought about trying to do this while cleaning up the crates to publish them, but I realized it'd be a fair amount of work and that getting the existing code published would be useful. The Google folks actually moved away from the .sym file format a few years ago because they profiled their minidump processing code and found that it spent most of its time in parsing. (They invented some other arbitrary binary serialization, there's code for it in the Breakpad repo.) I think it would be a lot easier to do something useful here in Rust, for sure.

The processor seems awesome. We'd just have to make sure that it is not missing some corner cases and produces equal results to the Breakpad library.

It's definitely not 100%: the stack unwinding is pretty deficient right now. It can only unwind x86 stacks, and only using frame pointers. I'm interested in seeing whether we could use unwind-rs for unwinding, so that the code is useful for other projects.

Here again, symbolic implements its own symbolication. Ideally, we expose parts of the minidump processor as clear public interfaces. In symbolic, we'd then just use the stackwalker but not the symbolication part.

For a long time I've wanted a version of minidump_stackwalk that could take a minidump but use platform-native debug symbols, but adapting the Breakpad C++ code to do so seemed like too much work. On a lark I hacked support for DWARF into minidump-processor a few weeks ago by way of the addr2line crate, and it was very straightforward! I think we could take what I wrote there as a starting point and clean it up to make it easy for the symbolication step to be implemented separately.

from symbolic.

jan-auer avatar jan-auer commented on May 30, 2024

Okay, this sounds very nice already. I think we share a common idea of how this could look like.

The Google folks actually moved away from the .sym file format a few years ago

Yes, although from what I've seen they do not give a guarantee that this binary format is actually stable. From what I remember, the FastModuleResolver (was it called like that?) just maps most of the symbols into memory pretty directly. However, they also chose to stick with line programs to save disk space but requires more computation to obtain an actual line number.

I think it would be a lot easier to do something useful here in Rust

Yes, and that is exactly what we do in symbolic-symcache. This is essentially an efficient layout for symbolicate an instruction address within a code module optimized for mmap. Most notably, it does not use line programs, so looking up line numbers is blazingly fast. We have written conversions from Dwarf (both Mach-O and ELF) as well as Breakpad symbols. In the end, I think having an internal format that facilitates fast symbolication while supporting various input formats is the right solution.

Ideally, I would like to refactor the Breakpad to Symcache conversion to use your crate. In turn, this could be an interesting starting point to use in your Minidump symbolicator.

the stack unwinding is pretty deficient right now

That is good to know. I'm not sure when, but I would love to allocate some time and collaborate with you to make it more complete / stable.

I hacked support for DWARF into minidump-processor a few weeks ago by way of the addr2line crate

This already looks pretty neat. So I was thinking that maybe stackwalking and symbolicating should be two completely separate concerns, maybe even solved by two different crates. That is, have a component that reads the minidump as is and then basically an optimized version of addr2line, which symbolic-symcache is. At least this is how we view it right now at Sentry because of some other requirements we have in our application (for example mixed stack traces with JavaScript, etc...)

from symbolic.

luser avatar luser commented on May 30, 2024

Touching base here again because one of my colleagues mentioned symbolic to me just now. A few of us at Mozilla are planning to start work on replacing our existing Breakpad-based stackwalker that we use in Socorro (crash-stats.mozilla.com) with something based on my minidump crates. As part of that we're going to need to flesh out the implementation, so it would be great if we can use the best parts of my crates and your crates (and whatever else is in the crates.io ecosystem). FWIW, I'm not super attached to anything in my set of crates except that they exist and the bits that are there work. I'd be happy to take code out of my crates and merge it into yours, combine the best parts of both into new crates, or take bits of your crates into mine, whatever gives us the best result!

This already looks pretty neat. So I was thinking that maybe stackwalking and symbolicating should be two completely separate concerns, maybe even solved by two different crates.

I agree! The only caveat here is that stackwalking often requires access to the unwind info which is either in the debug symbols or the binaries, so it would be good to ensure that it's possible to have a straightforward way to get a symbolicated stacktrace all in one go. While that's what we do for getting stacks out of minidumps in Socorro, we also do things in two parts other times--we have code to generate non-symbolicated stacktraces from minidumps on the user's machine when Firefox crashes that we send to our telemetry systems, and then we have a web service that we use to symbolicate them (using the Breakpad-format symbols) after-the-fact, so we could definitely make use of separate implementations as well.

If you're interested I'd be happy to have a video chat to sync up and figure out a plan here.

from symbolic.

luser avatar luser commented on May 30, 2024

@luser would you be interested in getting these changes upstream into breakpad-symbols? It definitely makes sense to centralize code at this point. Remaining action items would then be:

My crate is pretty old and busted, and what you're describing sounds very sensible, so if you'd like to burn breakpad-symbols to the ground and start over by moving your new implementation there, be my guest.

  • No eager collecting of line records or CFI delta records into vectors. Instead, there's an iterator API that lazy-parses them

This makes a lot of sense. You might consider going one step further and only parsing the FUNC lines, leaving the line data for each function unparsed until actually requested. That would allow looking up a function by address without having to parse gobs of line data.

  • Wrappers doing simple line splitting and starts_with checks to locate records fast (example)

This is very clever, nice work!

  • We didn't use macros but rather nom functions. We're still figuring out what we like more :)

breakpad-symbols uses a very old version of nom so it's not a great example in any event! I find the nom functions much easier to reason about, but also a bit more awkward to use in practice. Having the input buffer be magically handled for you certainly helps!

from symbolic.

jan-auer avatar jan-auer commented on May 30, 2024

You might consider going one step further and only parsing the FUNC lines, leaving the line data for each function unparsed until actually requested.

We got you covered: accessor -> lazy iterator. It's probably more verbose than need be.

from symbolic.

luser avatar luser commented on May 30, 2024

Question on the mechanics here: do you want to take the code you've written and put it in the breakpad-symbols crate in this repo, or would you prefer to just create a new breakpad-symbols crate in your repo, and I can give you ownership of the existing crate on crates.io so you can publish what you have as a new release? Either way I suspect you're going to want to remove breakpad-symbols as a direct dependency from minidump-processor in favor of using symbolic, so maybe it makes more sense to just have that code live in the symbolic repo instead?

from symbolic.

jan-auer avatar jan-auer commented on May 30, 2024

I didn't fully think this through yet. Symbolic is a large and higher-level library for debugging and symbolicating. Particularly symbolic-debuginfo builds on other libraries for the actual implementation of various debug info formats. I was thinking it's actually nice to have a crate which does nothing but reading breakpad symbols. And since I also own breakpad, we could make use of that with its speaking name.

Having symbolic as a dependency of minidump-processor is a great idea and it would certainly simplify things. I guess users would primarily want to override the symbol supplier, but not necessarily symbol file parsing etc (in Breakpad mostly handled by SourceLineResolver and friends).

Getting symbolic would also come with the benefit of supporting native debug files directly without converting to sym, but at the expense of a larger dependency tree. So maybe that's an optional feature?

from symbolic.

luser avatar luser commented on May 30, 2024

I guess users would primarily want to override the symbol supplier, but not necessarily symbol file parsing etc

I don't think we have enough knowledge of use cases to really say here, but having built-in support for using .sym files in a specific directory (with a Breakpad-compatible symbol store-style hierarchy) and from a symbol server over HTTP would likely cover a lot of ground.

Getting symbolic would also come with the benefit of supporting native debug files directly without converting to sym, but at the expense of a larger dependency tree. So maybe that's an optional feature?

Making it an optional cargo feature sounds fine. I know from my experience with Firefox that being able to get stack traces from minidumps without having to generate .sym files would have been really useful for developers running tests and such on their local machines, since we didn't have dump_syms run as part of a normal developer build (because it takes a while). Mostly I just think y'all have built a great crate that does exactly what's necessary here and we might as well reuse it!

from symbolic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.