We currently need exceptions to implement operator aborts, but these scale poorly in p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

It seems like we're facing something similar to <a class="issue-link js-issue-link" da

Poor Scaling of Exceptions Fix,about intelligentsoftwaresystems/galois

Comments (23)

insertinterestingnamehere commented on July 30, 2024

I'm still not totally clear on what's involved for a fix here, but I'll add some of the info that's come up in discussions thus far.

from galois.

insertinterestingnamehere commented on July 30, 2024

@darthscsi is there somewhere we can see the patch you put together on top of gcc 4.8 back when you were working on this?

from galois.

insertinterestingnamehere commented on July 30, 2024

It seems like we're facing something similar to scylladb/seastar#73. There they ran up against two locks: one in libstdc++ and one in libc. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68297 (resolved) and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744 (still not resolved) respectively. They worked out a workaround in https://github.com/scylladb/seastar/blob/master/src/core/exception_hacks.cc for the second issue, but it's not clear to me whether we can reliably work around the second issue like they did.

from galois.

insertinterestingnamehere commented on July 30, 2024

Some other maybe plausible options: unwinding the stack ourselves. I discussed this today with @arthurp and it seems doable using __gxx_personality_v0 (https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html#base-personality), but we may end up needing to make API calls that still aquire internal locks. In other words, it's possible to do this in a relatively reliable way, but it may or may not actually fix the problem depending on the exact code paths involved. One TODO item is to look at performance with recent clang/gcc and see where we're actually seeing the performance hit and then trace through the different calls that happen during stack unwinding to see which pieces we need to switch out.

from galois.

insertinterestingnamehere commented on July 30, 2024

Related question: do libc++ and libcxxabi work around this in some more intelligent way or do we need to worry about them too. Performance numbers are needed for this as well.

from galois.

insertinterestingnamehere commented on July 30, 2024

Another useful side note: we could probably use libunwind if we do end up needing to unroll the stack ourselves.

from galois.

ddn0 commented on July 30, 2024

For reference: a reasonable summary of how to customize exception handling:

https://github.com/dropbox/pyston/blob/master/docs/UNWINDING.md

from galois.

darthscsi commented on July 30, 2024

The stack unwinding is not the bottleneck per-se. It's the fact that reading DWARF tables requires traversing ELF headers. headers which may appear or disappear if another thread does a dlopen/dlclose. I thought there was a patch in the repo for safe keeping.

…

On Wed, Feb 5, 2020 at 3:54 PM Donald Nguyen ***@***.***> wrote: For reference: a reasonable summary of how to customize exception handling: - https://github.com/dropbox/pyston/blob/master/docs/UNWINDING.md — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#71?email_source=notifications&email_token=ACY5KK47XDRV3STFUND73CTRBMYQLA5CNFSM4KP4BZQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK5D6WI#issuecomment-582631257>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACY5KK3XTY5T7HAL6PVPDX3RBMYQLANCNFSM4KP4BZQA> .

from galois.

arthurp commented on July 30, 2024

If the only issue is dlopen/dlclose locks then maybe it would be possible to make a version of traversal that is just the same as usual, but uses a version of ELF header traversal which is scalable or aggressive thread local caching. These would provide some kind of "eventually consistent" dlopen/dlclose semantics, but that might well be acceptable. If it is not, it might be possible to use an extremely biased reader-writer lock, so that the fast path (where no dlopen/dlclose occured) is very fast, but the dlopen/dlclose case is very slow.

from galois.

insertinterestingnamehere commented on July 30, 2024

Okay, I just reread the discussion on the two GNU issues. It looks like the only remaining roadblock to scalable stack unwinding is the lock in dl_iterate_phdr discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744. They already patched libgcc so the main issue isn't present there anymore.

from galois.

insertinterestingnamehere commented on July 30, 2024

If anyone's curions, here's the backtrace showing how that function gets called...

Thread 0x7ffff In: dl_iterate_phdr                                                                                                                          Line: ??   PC: 0x7ffff6a12080
#1  0x00007ffff6cb57d1 in _Unwind_Find_FDE (pc=0x7ffff6cb361d <_Unwind_RaiseException+61>, bases=bases@entry=0x7fffffffbba8)
    at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2-fde-dip.c:469
#2  0x00007ffff6cb1fa8 in uw_frame_state_for (context=context@entry=0x7fffffffbb00, fs=fs@entry=0x7fffffffb950) at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2.c:1257
#3  0x00007ffff6cb3120 in uw_init_context_1 () at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2.c:1586
#4  0x00007ffff6cb361e in _Unwind_RaiseException (exc=exc@entry=0x65d5a0) at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind.inc:93
#5  0x00007ffff724ec36 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x42b878 <typeinfo for galois::runtime::ConflictFlag>, dest=0x0)
    at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libstdc++-v3/libsupc++/eh_throw.cc:90
#6  0x0000000000415e90 in galois::runtime::signalConflict (lockable=<optimized out>) at /h1/ian/Galois/libgalois/src/Context.cpp:49
#7  0x0000000000415f5f in galois::runtime::SimpleRuntimeContext::acquire (this=0x7ffff6cb4560 <_Unwind_IteratePhdrCallback>, lockable=0x7fffffffb860, m=galois::UNPROTECTED)
    at /h1/ian/Galois/libgalois/src/Context.cpp:109

from galois.

insertinterestingnamehere commented on July 30, 2024

Similar backtrace when using libc++

#0  0x00007ffff6885080 in dl_iterate_phdr () from /lib64/libc.so.6
#1  0x00007ffff6b287d1 in _Unwind_Find_FDE (pc=0x7ffff6b2661d <_Unwind_RaiseException+61>, bases=bases@entry=0x7fffffffbc28)
    at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2-fde-dip.c:469
#2  0x00007ffff6b24fa8 in uw_frame_state_for (context=context@entry=0x7fffffffbb80, fs=fs@entry=0x7fffffffb9d0) at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2.c:1257
#3  0x00007ffff6b26120 in uw_init_context_1 () at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2.c:1586
#4  0x00007ffff6b2661e in _Unwind_RaiseException (exc=0x664020) at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind.inc:93
#5  0x00007ffff7051adb in __cxa_throw () from /org/centers/cdgc/sw/llvm-9.0.1/lib/libc++abi.so.1
#6  0x0000000000419710 in galois::runtime::signalConflict (lockable=<optimized out>) at /h1/ian/Galois/libgalois/src/Context.cpp:49
#7  0x00000000004197df in galois::runtime::SimpleRuntimeContext::acquire (this=0x7ffff6b27560 <_Unwind_IteratePhdrCallback>, lockable=0x7fffffffb8e0, m=galois::WRITE)
    at /h1/ian/Galois/libgalois/src/Context.cpp:109
#8  0x0000000000413a0e in doAcquire (lockable=0x2aaaaac02890, m=galois::WRITE) at /h1/ian/Galois/libgalois/include/galois/runtime/Context.h:181

from galois.

insertinterestingnamehere commented on July 30, 2024

Including discussion from #72 here to consolidate:
Me in #72 (comment)

This is a companion issue to #71 focused on the question "what do we do now?" We need a short term solution since getting exceptions to scale in parallel contexts will likely require some long term work on upstream projects.

There are a few things that would be nice to achieve:

Not relying on a custom toolchain

Not killing performance of apps with moderate contention

Resolve the crashes in the gmetis and torus apps

Not breaking user code in conter-intuitive ways

Not poluting the user interface with additional options

Not forcing users to worry about another configuration setting

Here are a few plausible options and where they stand:

Sticking with longjump exclusively provides good performance, but relies on the currently undocumented assumption that operators that use conflict detection do not construct any objects that require nontrivial destruction before reaching their failsafe point. If we stick with this in the long run we should probably document this limitation. We already warn people not to modify data till after the failsafe point, so this isn't much more of a constraint than that. Switching off setjmp/longjmp based aborts also fixed our gmetis and torus apps. That said, it's possible that the runtime locks for exceptions are just hiding some other bug in those cases, so more investigation is needed to see what's going on there. This is what we'd been doing up until we realized the crashes in torus and gmetis were connected to this. If we commit to this option we should at least document the additional limitations on operators.

Switching to exceptions entirely has severe performance implications for some apps. It does avoid adding additional constraints on the style of code the user writes though. It also fixes the crashes in gmetis and torus. It also avoids any user interface issues.

Modifying the Galois API to no longer rely on exceptions for locking/control flow. It's not currently clear what this would look like, but maybe we could modify our API to make operator aborts something that doesn't have to involve unwinding the stack at all.

Use exceptions in the CI tests and nowhere else. This requires no user options and very little additional work from us right now, but it could result in unexpected memory leaks or crashes in user code that uses the current implementation of conflict detection. It also only hides the issue in gmetis and torus and introduces a disparity in our testing environment that could hide other bugs as well.

Allow switching between setjmp/longjmp based aborts and exception based aborts via a preprocessor flag. This avoids directly polluting the user interface, but still requires users to make a decision about which exception style is better suited to their application. If we get the exception scaling issues fully fixed upstream then the build option can be silently ignored later without any API changes. We can avoid major performance penalties by allowing demo applications that need setjmp/longjmp to opt into that abort mechanism in their build configs.

Allow switching between abort implementations via a template argument flag. This directly pollutes the user API with something that is arguable an implementation choice, but it makes sense if we actually intend setjmp/longjmp to be permanent distinct options for aborts.

Thus far I'm partial to option 5 (see #68), but my opinion isn't the only one that matters. We also need to measure to see exactly what kinds of performance tradeoffs are in play here. If exceptions kill performance in every case where we use termination detection then options 1 and 3 are probably better approaches.

@amberhassaan in #72 (comment)

Meta comment first: It appears to me (from far away) that this discussion is taking place in 2 or 3 places. Can we consolidate in one place please?

I favor option 5. Reasons:
-- Like I said, we had to revive setjump/longjump to fix poor scaling.
-- It's not clear if & when the issue will get fixed in abi_runtime. Asking people to install a custom piece can be asking too much. I'd find it annoying that after I checkout Galois, I need to install a custom tool-chain.
-- Until the issue is fixed in abi runtime, we can restrict the rules on what's allowed in the operator. If creating non-stack-resident objects are inevitable, we can allow the usercode to register undo operations as callbacks that get invoked on abort. That's something I did for ordered runtime. Although that does require adding to the user API.

from galois.

insertinterestingnamehere commented on July 30, 2024

With the latest clang the torus test (run as example-improved-torus 2 100) consistently fails in the setjmp/longjmp case but it doesn't crash. I investigated a bit yesterday and it looks like the fine-grained locks are failing to protect the += operations happening there. Given that, my current thinking is that clang's optimizer is showing a bug in our software transactional memory implementation (or vice versa) and that the lock acquired when throwing an exception on abort somehow shields us from that.

Put more simply, it seems like the issues with the torus app are not directly caused by setjmp/longjmp and are instead masked by the locking that happens when throwing exceptions.

from galois.

darthscsi commented on July 30, 2024

There was a second scalability problem which was there was a malloc for the exception object. I copied the last-chance reserve space idea in the code so most exceptions had reserved memory space to be constructed in rather than using dynamic allocation. The biasing of the lock isn't a big problem, the problem is simply bouncing the lock between cache lines to increment the read counter. Using a per-cpu reader lock solves this, requiring the writer (dlopen/dlcose) to take the lock on each cpu, where as the exception case just takes a local read lock. Andrew

…

On Thu, Feb 6, 2020 at 11:31 AM Ian Henriksen ***@***.***> wrote: With the latest clang the torus test (run as example-improved-torus 2 100) consistently fails in the setjmp/longjmp case but it doesn't crash. I investigated a bit yesterday and it looks like the fine-grained locks are failing to protect the += operations happening there. Given that, my current thinking is that clang's optimizer is showing a bug in our software transactional memory implementation (or vice versa) and that the lock acquired when throwing an exception on abort somehow shields us from that. Put more simply, it seems like the issues with the torus app are not directly caused by setjmp/longjmp and are instead masked by the locking that happens when throwing exceptions. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#71?email_source=notifications&email_token=ACY5KK3K2CDBZEBHU5VFJIDRBRCPNA5CNFSM4KP4BZQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELACTEY#issuecomment-583018899>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACY5KK23V55XPEKTR625P6TRBRCPNANCNFSM4KP4BZQA> .

from galois.

insertinterestingnamehere commented on July 30, 2024

I think @arthurp's suggestion of using libunwind may actually work to avoid the locks. Looking at https://www.nongnu.org/libunwind/man/unw_set_caching_policy(3).html it appears that they support thread-local caching of the needed metadata. They give us access to the needed personality routines for each stack frame, so hopefully we can call the needed cleanup code.

@ddn0 Thanks for that link about how Pyston does unwinding. This further motivates unwinding the stack ourselves. Apparently by default C++ exceptions traverse the stack twice: once to determine if there's a corresponding "catch" statement for the exception being thrown, and then again to actually call destructors/etc. The two passes are only necessary because they have to make sure that a backtrace is available if the exception isn't caught and the process terminates. Operator aborts should always be caught so we don't need to build up a backtrace to report if they aren't.

from galois.

insertinterestingnamehere commented on July 30, 2024

@darthscsi Thanks for the reminder about that allocation. For some reason I had mistakenly thought they had some thread local storage and they only allocated if it overflowed. It looks like that's not the case. See https://github.com/RTEMS/gnu-mirror-gcc/blob/86e1d9f75bae096922664755d037f2a9d85e2a24/libstdc%2B%2B-v3/libsupc%2B%2B/eh_alloc.cc#L284. This seems like another point in favor of unwinding the stack ourselves.

from galois.

insertinterestingnamehere commented on July 30, 2024

Okay! I've dug into this quite a lot over the last few days. Here a quick update on what I've figured out thus far. I need to spend some time on other projects for a bit now, so unless someone else picks this up, it'll have to sit for a bit till I can get back to it.

There isn't currently a way to do this just by using libunwind. libunwind can unwind the stack and give us access to each stack frame's personality routine, but the personality routine can, at it's discretion, set up a call to a "landing pad" routine that actually does the cleanup (clang and gcc usually do this). During normal stack unwinding, the exception landing pad does not return control to its caller. When it finishes it calls _Unwind_Resume which takes care of cleaning up the next stack frame, calling the appropriate personality function then landing pad which then calls _Unwind_Resume, etc. The point is that control never returns anywhere, _Unwind_Resume and the landing pad functions are mutually recursive. Because of this, we can't just perform the cleanup needed for each frame using the metadata present. See https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/unwind.inc#L229 for the actual code in libgcc. The best we could do with libunwind without additional magic is verify whether or not a personality function is present for each frame and warn of the performance consequences.
It's unlikely that there's a way to just get the needed piece of stack unwinding to happen using libgcc library routines either. The acquiring function is dl_iterate_phdr which is called by _Unwind_Find_FDE which is called in uw_install_context_1 which is called in the macro uw_install_context which is used by _Unwind_RaiseException_Phase2 which is called in _Unwind_Resume. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744 still applies even when only trying to do the minimal portion of exception handling that we need.
Due to the previous points, it seems highly unlikely that we'll be able to work around this without overriding some libgcc or glibc function within each process's address space. I'm not sure if this is something that downstream users would appreciate having our library do since it seems kind-of invasive, but it seems less likely to cause trouble than leaking memory during longjmps. I think this is possible to do entirely from within a header by just defining the function we're overriding with inline visibility. I have yet to confirm that though.
If we decide to try overriding something, we could take the approach used in seastar and override dl_iterate_phdr in Galois parallel sections. They already have the code for this at https://github.com/scylladb/seastar/blob/master/src/core/exception_hacks.cc. This will make exceptions just work at the cost of not supporting loading shared objects from within a Galois parallel loop. That limitation isn't necessarily a limitation of this approach though, just a limitation of their current implementation of it.
We could also try overriding _Unwind_Resume, __cxa_throw or other portions of the Itanium exception ABI would probably also work. This would allow us to avoid one of the two passes over the stack that are normally done when handling an exception and make it possible to have things behave more like "longjmp with exception handling" which is really what we want. This would likely perform better but it would probably also be harder to figure out and slightly more invasive. Some caching of shared object metadata would still be needed, so starting with overriding dl_iterate_phdr and going from there seems like a better approach.
A "real" fix for this will require resolving https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744. It would be even better if we could get some kind of standard "longjmp but calls destructors" kind of thing too since exceptions already do more than we need. As noted earlier in this thread, MSVC is the only compiler that provides something like that right now.
As mentioned earlier, avoiding the lock in dl_iterate_phdr still doesn't save us from the fact that libgcc also uses malloc to allocate space for exceptions. This strengthens the case for setjmp/longjmp.

from galois.

insertinterestingnamehere commented on July 30, 2024

Just to add, documenting the fact that we're living with the limitations of longjmp seems like a sane thing to do at least in the short term. (See https://stackoverflow.com/a/1376099/1935144. for a reference to the applicable part of the C++ standard.)

from galois.

insertinterestingnamehere commented on July 30, 2024

Okay, here's an idea that could maybe get us part of the way there: We could make our signal_abort function a part of a shared object that we control and then link that shared object (libgalois or some small dedicated shared library for this?) with -Bsymbolic and override _Unwind_Resume there. This could potentially work if it's possible to set things up to pretend that the first iteration over the stack has already been done so that subsequent calls to _Unwind_Resume do the right thing and somehow make sure that function landing pad lookups don't require taking locks. It's a stretch, but it may be possible depending on how _Unwind_Resume is currently implemented. I'll have to look at it again.

from galois.

insertinterestingnamehere commented on July 30, 2024

Okay, with that suggested technique I still don't see a way to avoid the call path: https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/unwind.inc#L241, https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/unwind.inc#L50, https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/unwind-dw2.c#L1257, https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/unwind-dw2-fde-dip.c#L469. Even though we can control the initial call to _cxa_throw and _Unwind_Resume, _Unwind_Resume is still called by the personality routines and that call will pick up the default implementation with anything less than a process-level override. That said, this approach has several fewer calls to dl_iterate_phdr than are there if we just throw the exception, so that's something.

Back to other things for now I guess.

from galois.

insertinterestingnamehere commented on July 30, 2024

WRT overriding stuff in glibc/libgcc, I'd like to avoid requiring users to have to do anything custom to enable that. We could instead take the approach of making it as easy as possible if no other solution is possible.

from galois.

insertinterestingnamehere commented on July 30, 2024

Also, it's not clear to me if MSVC's special longjmp actually fixes this. Reading dwarf info in parallel is the issue on Linux and I'm not sure what MSVC does with the PDB debug info in parallel situations. It'd be interesting to try, but we don't support Windows yet.

from galois.

Poor Scaling of Exceptions Fix about galois HOT 23 OPEN

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent