Comments (23)
I'm still not totally clear on what's involved for a fix here, but I'll add some of the info that's come up in discussions thus far.
from galois.
@darthscsi is there somewhere we can see the patch you put together on top of gcc 4.8 back when you were working on this?
from galois.
It seems like we're facing something similar to scylladb/seastar#73. There they ran up against two locks: one in libstdc++ and one in libc. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68297 (resolved) and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744 (still not resolved) respectively. They worked out a workaround in https://github.com/scylladb/seastar/blob/master/src/core/exception_hacks.cc for the second issue, but it's not clear to me whether we can reliably work around the second issue like they did.
from galois.
Some other maybe plausible options: unwinding the stack ourselves. I discussed this today with @arthurp and it seems doable using __gxx_personality_v0 (https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html#base-personality), but we may end up needing to make API calls that still aquire internal locks. In other words, it's possible to do this in a relatively reliable way, but it may or may not actually fix the problem depending on the exact code paths involved. One TODO item is to look at performance with recent clang/gcc and see where we're actually seeing the performance hit and then trace through the different calls that happen during stack unwinding to see which pieces we need to switch out.
from galois.
Related question: do libc++ and libcxxabi work around this in some more intelligent way or do we need to worry about them too. Performance numbers are needed for this as well.
from galois.
Another useful side note: we could probably use libunwind if we do end up needing to unroll the stack ourselves.
from galois.
For reference: a reasonable summary of how to customize exception handling:
from galois.
from galois.
If the only issue is dlopen/dlclose locks then maybe it would be possible to make a version of traversal that is just the same as usual, but uses a version of ELF header traversal which is scalable or aggressive thread local caching. These would provide some kind of "eventually consistent" dlopen/dlclose semantics, but that might well be acceptable. If it is not, it might be possible to use an extremely biased reader-writer lock, so that the fast path (where no dlopen/dlclose occured) is very fast, but the dlopen/dlclose case is very slow.
from galois.
Okay, I just reread the discussion on the two GNU issues. It looks like the only remaining roadblock to scalable stack unwinding is the lock in dl_iterate_phdr
discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744. They already patched libgcc so the main issue isn't present there anymore.
from galois.
If anyone's curions, here's the backtrace showing how that function gets called...
Thread 0x7ffff In: dl_iterate_phdr Line: ?? PC: 0x7ffff6a12080
#1 0x00007ffff6cb57d1 in _Unwind_Find_FDE (pc=0x7ffff6cb361d <_Unwind_RaiseException+61>, bases=bases@entry=0x7fffffffbba8)
at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2-fde-dip.c:469
#2 0x00007ffff6cb1fa8 in uw_frame_state_for (context=context@entry=0x7fffffffbb00, fs=fs@entry=0x7fffffffb950) at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2.c:1257
#3 0x00007ffff6cb3120 in uw_init_context_1 () at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2.c:1586
#4 0x00007ffff6cb361e in _Unwind_RaiseException (exc=exc@entry=0x65d5a0) at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind.inc:93
#5 0x00007ffff724ec36 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x42b878 <typeinfo for galois::runtime::ConflictFlag>, dest=0x0)
at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libstdc++-v3/libsupc++/eh_throw.cc:90
#6 0x0000000000415e90 in galois::runtime::signalConflict (lockable=<optimized out>) at /h1/ian/Galois/libgalois/src/Context.cpp:49
#7 0x0000000000415f5f in galois::runtime::SimpleRuntimeContext::acquire (this=0x7ffff6cb4560 <_Unwind_IteratePhdrCallback>, lockable=0x7fffffffb860, m=galois::UNPROTECTED)
at /h1/ian/Galois/libgalois/src/Context.cpp:109
from galois.
Similar backtrace when using libc++
#0 0x00007ffff6885080 in dl_iterate_phdr () from /lib64/libc.so.6
#1 0x00007ffff6b287d1 in _Unwind_Find_FDE (pc=0x7ffff6b2661d <_Unwind_RaiseException+61>, bases=bases@entry=0x7fffffffbc28)
at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2-fde-dip.c:469
#2 0x00007ffff6b24fa8 in uw_frame_state_for (context=context@entry=0x7fffffffbb80, fs=fs@entry=0x7fffffffb9d0) at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2.c:1257
#3 0x00007ffff6b26120 in uw_init_context_1 () at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind-dw2.c:1586
#4 0x00007ffff6b2661e in _Unwind_RaiseException (exc=0x664020) at /opt/apps/ossw/sourcesdir/gcc/gcc-8.1.0/libgcc/unwind.inc:93
#5 0x00007ffff7051adb in __cxa_throw () from /org/centers/cdgc/sw/llvm-9.0.1/lib/libc++abi.so.1
#6 0x0000000000419710 in galois::runtime::signalConflict (lockable=<optimized out>) at /h1/ian/Galois/libgalois/src/Context.cpp:49
#7 0x00000000004197df in galois::runtime::SimpleRuntimeContext::acquire (this=0x7ffff6b27560 <_Unwind_IteratePhdrCallback>, lockable=0x7fffffffb8e0, m=galois::WRITE)
at /h1/ian/Galois/libgalois/src/Context.cpp:109
#8 0x0000000000413a0e in doAcquire (lockable=0x2aaaaac02890, m=galois::WRITE) at /h1/ian/Galois/libgalois/include/galois/runtime/Context.h:181
from galois.
Including discussion from #72 here to consolidate:
Me in #72 (comment)
This is a companion issue to #71 focused on the question "what do we do now?" We need a short term solution since getting exceptions to scale in parallel contexts will likely require some long term work on upstream projects.
There are a few things that would be nice to achieve:
- Not relying on a custom toolchain
- Not killing performance of apps with moderate contention
- Resolve the crashes in the gmetis and torus apps
- Not breaking user code in conter-intuitive ways
- Not poluting the user interface with additional options
- Not forcing users to worry about another configuration setting
Here are a few plausible options and where they stand:
- Sticking with longjump exclusively provides good performance, but relies on the currently undocumented assumption that operators that use conflict detection do not construct any objects that require nontrivial destruction before reaching their failsafe point. If we stick with this in the long run we should probably document this limitation. We already warn people not to modify data till after the failsafe point, so this isn't much more of a constraint than that. Switching off setjmp/longjmp based aborts also fixed our gmetis and torus apps. That said, it's possible that the runtime locks for exceptions are just hiding some other bug in those cases, so more investigation is needed to see what's going on there. This is what we'd been doing up until we realized the crashes in torus and gmetis were connected to this. If we commit to this option we should at least document the additional limitations on operators.
- Switching to exceptions entirely has severe performance implications for some apps. It does avoid adding additional constraints on the style of code the user writes though. It also fixes the crashes in gmetis and torus. It also avoids any user interface issues.
- Modifying the Galois API to no longer rely on exceptions for locking/control flow. It's not currently clear what this would look like, but maybe we could modify our API to make operator aborts something that doesn't have to involve unwinding the stack at all.
- Use exceptions in the CI tests and nowhere else. This requires no user options and very little additional work from us right now, but it could result in unexpected memory leaks or crashes in user code that uses the current implementation of conflict detection. It also only hides the issue in gmetis and torus and introduces a disparity in our testing environment that could hide other bugs as well.
- Allow switching between setjmp/longjmp based aborts and exception based aborts via a preprocessor flag. This avoids directly polluting the user interface, but still requires users to make a decision about which exception style is better suited to their application. If we get the exception scaling issues fully fixed upstream then the build option can be silently ignored later without any API changes. We can avoid major performance penalties by allowing demo applications that need setjmp/longjmp to opt into that abort mechanism in their build configs.
- Allow switching between abort implementations via a template argument flag. This directly pollutes the user API with something that is arguable an implementation choice, but it makes sense if we actually intend setjmp/longjmp to be permanent distinct options for aborts.
Thus far I'm partial to option 5 (see #68), but my opinion isn't the only one that matters. We also need to measure to see exactly what kinds of performance tradeoffs are in play here. If exceptions kill performance in every case where we use termination detection then options 1 and 3 are probably better approaches.
@amberhassaan in #72 (comment)
Meta comment first: It appears to me (from far away) that this discussion is taking place in 2 or 3 places. Can we consolidate in one place please?
I favor option 5. Reasons:
-- Like I said, we had to revive setjump/longjump to fix poor scaling.
-- It's not clear if & when the issue will get fixed in abi_runtime. Asking people to install a custom piece can be asking too much. I'd find it annoying that after I checkout Galois, I need to install a custom tool-chain.
-- Until the issue is fixed in abi runtime, we can restrict the rules on what's allowed in the operator. If creating non-stack-resident objects are inevitable, we can allow the usercode to register undo operations as callbacks that get invoked on abort. That's something I did for ordered runtime. Although that does require adding to the user API.
from galois.
With the latest clang the torus test (run as example-improved-torus 2 100
) consistently fails in the setjmp/longjmp case but it doesn't crash. I investigated a bit yesterday and it looks like the fine-grained locks are failing to protect the += operations happening there. Given that, my current thinking is that clang's optimizer is showing a bug in our software transactional memory implementation (or vice versa) and that the lock acquired when throwing an exception on abort somehow shields us from that.
Put more simply, it seems like the issues with the torus app are not directly caused by setjmp/longjmp and are instead masked by the locking that happens when throwing exceptions.
from galois.
from galois.
I think @arthurp's suggestion of using libunwind may actually work to avoid the locks. Looking at https://www.nongnu.org/libunwind/man/unw_set_caching_policy(3).html it appears that they support thread-local caching of the needed metadata. They give us access to the needed personality routines for each stack frame, so hopefully we can call the needed cleanup code.
@ddn0 Thanks for that link about how Pyston does unwinding. This further motivates unwinding the stack ourselves. Apparently by default C++ exceptions traverse the stack twice: once to determine if there's a corresponding "catch" statement for the exception being thrown, and then again to actually call destructors/etc. The two passes are only necessary because they have to make sure that a backtrace is available if the exception isn't caught and the process terminates. Operator aborts should always be caught so we don't need to build up a backtrace to report if they aren't.
from galois.
@darthscsi Thanks for the reminder about that allocation. For some reason I had mistakenly thought they had some thread local storage and they only allocated if it overflowed. It looks like that's not the case. See https://github.com/RTEMS/gnu-mirror-gcc/blob/86e1d9f75bae096922664755d037f2a9d85e2a24/libstdc%2B%2B-v3/libsupc%2B%2B/eh_alloc.cc#L284. This seems like another point in favor of unwinding the stack ourselves.
from galois.
Okay! I've dug into this quite a lot over the last few days. Here a quick update on what I've figured out thus far. I need to spend some time on other projects for a bit now, so unless someone else picks this up, it'll have to sit for a bit till I can get back to it.
- There isn't currently a way to do this just by using libunwind. libunwind can unwind the stack and give us access to each stack frame's personality routine, but the personality routine can, at it's discretion, set up a call to a "landing pad" routine that actually does the cleanup (clang and gcc usually do this). During normal stack unwinding, the exception landing pad does not return control to its caller. When it finishes it calls
_Unwind_Resume
which takes care of cleaning up the next stack frame, calling the appropriate personality function then landing pad which then calls_Unwind_Resume
, etc. The point is that control never returns anywhere,_Unwind_Resume
and the landing pad functions are mutually recursive. Because of this, we can't just perform the cleanup needed for each frame using the metadata present. See https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/unwind.inc#L229 for the actual code in libgcc. The best we could do with libunwind without additional magic is verify whether or not a personality function is present for each frame and warn of the performance consequences. - It's unlikely that there's a way to just get the needed piece of stack unwinding to happen using libgcc library routines either. The acquiring function is
dl_iterate_phdr
which is called by_Unwind_Find_FDE
which is called inuw_install_context_1
which is called in the macrouw_install_context
which is used by_Unwind_RaiseException_Phase2
which is called in_Unwind_Resume
. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744 still applies even when only trying to do the minimal portion of exception handling that we need. - Due to the previous points, it seems highly unlikely that we'll be able to work around this without overriding some libgcc or glibc function within each process's address space. I'm not sure if this is something that downstream users would appreciate having our library do since it seems kind-of invasive, but it seems less likely to cause trouble than leaking memory during longjmps. I think this is possible to do entirely from within a header by just defining the function we're overriding with inline visibility. I have yet to confirm that though.
- If we decide to try overriding something, we could take the approach used in seastar and override
dl_iterate_phdr
in Galois parallel sections. They already have the code for this at https://github.com/scylladb/seastar/blob/master/src/core/exception_hacks.cc. This will make exceptions just work at the cost of not supporting loading shared objects from within a Galois parallel loop. That limitation isn't necessarily a limitation of this approach though, just a limitation of their current implementation of it. - We could also try overriding
_Unwind_Resume
,__cxa_throw
or other portions of the Itanium exception ABI would probably also work. This would allow us to avoid one of the two passes over the stack that are normally done when handling an exception and make it possible to have things behave more like "longjmp with exception handling" which is really what we want. This would likely perform better but it would probably also be harder to figure out and slightly more invasive. Some caching of shared object metadata would still be needed, so starting with overridingdl_iterate_phdr
and going from there seems like a better approach. - A "real" fix for this will require resolving https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744. It would be even better if we could get some kind of standard "
longjmp
but calls destructors" kind of thing too since exceptions already do more than we need. As noted earlier in this thread, MSVC is the only compiler that provides something like that right now. - As mentioned earlier, avoiding the lock in
dl_iterate_phdr
still doesn't save us from the fact that libgcc also uses malloc to allocate space for exceptions. This strengthens the case for setjmp/longjmp.
from galois.
Just to add, documenting the fact that we're living with the limitations of longjmp seems like a sane thing to do at least in the short term. (See https://stackoverflow.com/a/1376099/1935144. for a reference to the applicable part of the C++ standard.)
from galois.
Okay, here's an idea that could maybe get us part of the way there: We could make our signal_abort function a part of a shared object that we control and then link that shared object (libgalois or some small dedicated shared library for this?) with -Bsymbolic
and override _Unwind_Resume
there. This could potentially work if it's possible to set things up to pretend that the first iteration over the stack has already been done so that subsequent calls to _Unwind_Resume
do the right thing and somehow make sure that function landing pad lookups don't require taking locks. It's a stretch, but it may be possible depending on how _Unwind_Resume
is currently implemented. I'll have to look at it again.
from galois.
Okay, with that suggested technique I still don't see a way to avoid the call path: https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/unwind.inc#L241, https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/unwind.inc#L50, https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/unwind-dw2.c#L1257, https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libgcc/unwind-dw2-fde-dip.c#L469. Even though we can control the initial call to _cxa_throw
and _Unwind_Resume
, _Unwind_Resume
is still called by the personality routines and that call will pick up the default implementation with anything less than a process-level override. That said, this approach has several fewer calls to dl_iterate_phdr
than are there if we just throw the exception, so that's something.
Back to other things for now I guess.
from galois.
WRT overriding stuff in glibc/libgcc, I'd like to avoid requiring users to have to do anything custom to enable that. We could instead take the approach of making it as easy as possible if no other solution is possible.
from galois.
Also, it's not clear to me if MSVC's special longjmp actually fixes this. Reading dwarf info in parallel is the issue on Linux and I'm not sure what MSVC does with the PDB debug info in parallel situations. It'd be interesting to try, but we don't support Windows yet.
from galois.
Related Issues (20)
- Compilation Problems HOT 2
- Question about sssp-pull.cpp HOT 1
- What is the format of mastersFile? HOT 1
- How to add new code to Galois and modify the Makefile HOT 2
- Question of the MPI Network Layer (is the order guaranteed for the message?) HOT 6
- BFS hangs HOT 1
- Compiling Galois HOT 1
- Installation error HOT 1
- did you forget to ‘#include <optional>’? HOT 2
- Performance degradation for distributed PageRank HOT 8
- Is this project still alive? HOT 1
- papiGetTID error HOT 1
- Cmake error HOT 1
- "endian.h" file not found error while building BFS Application
- conversion undefined for void graphs HOT 1
- Cannot seem to get correct results for Connected Components
- Where is the directory of your HyperNetVec's implementation in corresponding paper?
- Does it support running on mac with m series chips
- Did this framework support SIMD instruction for parallelization?
- The server with the inputs requires a secured https connection
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from galois.