flow-ipc / ipc Goto Github PK

View Code? Open in Web Editor NEW

261.0 5.0 11.0 121.3 MB

[Start here!] Flow-IPC - Modern C++ toolkit for high-speed inter-process communication (IPC)

Home Page: https://flow-ipc.github.io/

License: Apache License 2.0

CMake 7.11% Cap'n Proto 1.99% C++ 88.47% Shell 0.68% Python 1.74%

async capnp capnproto cplusplus documentation flow-ipc functional-tests generic-programming jemalloc malloc-library

ipc's People

Contributors

Stargazers

Watchers

Forkers

gmh5225 dosyago perrywang yima77 yodamaster jsoref zsqrq

ipc's Issues

(FAKE ISSUE) Store uploaded items not in repo - like videos for docs.

This is apparently a common trick: GitHub will host files if one attaches them to an issue or PR. (The issue need not even be created, but it seems best to make it visible, for our own sanity.)

This should be used sparingly. As of this writing, I'm researching embedding video in a README.md, and apparently one can directly do so using a <video> tag; but unlike Markdown image links - which can refer to images in the repo, via relative path - the same does not work for the video tag. One has to, it appears, give an absolute URL. Ideally such videos-for-docs would be in the repo too, but we do what we must. AGAIN: Sparingly. (And keep the size down; let's not abuse GitHub.)

ipc: unit_test + TSAN: TSAN lock count limit reached in some tests: some run OK separately; 1 never runs OK; make the latter run OK; investigate further.

Filed by @ygoldfeld pre-open-source:

The current situation is as follows:

General description: Whether run locally with my clang-17, or in the GitHub pipeline with clang-15/16/17, reliably some tests in some situations hit a certain specific point within the test, at which point console gets

ThreadSanitizer: CHECK failed: sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40) (tid=74526)

and the test hangs forever right there. To be clear this is not a normal TSAN warning about a race or anything; but rather TSAN instrumentation code hitting a problem and refusing to proceed further. By the text of the problem, indeed some sort of limit of 64 "locks with contexts" is reached, and TSAN blows up. (No further analysis done on that but read on.)

1 test, even if run absolutely by itself, always hits this problem: Jemalloc_shm_pool_collection_test.Multiprocess. Hence it is explicitly skipped in the pipeline at the moment, using the gtest command line feature that can exclude tests individually.

The other problematic tests -- meaning that failing to exclude all of them from a run, while keeping all the others => problem -- are:

LOCK_HEAVY_TESTS='Shm_session_test.External_process_array:\
                  Shm_session_test.External_process_vector_offset_ptr:\
                  Shm_session_test.External_process_string_offset_ptr:\
                  Shm_session_test.External_process_list_offset_ptr:\
                  Shm_session_test.Multisession_external_process:\
                  Shm_session_test.Disconnected_external_process:\
                  Borrower_shm_pool_collection_test.Multiprocess:\
                  Shm_pool_collection_test.Multiprocess'

Happily, though, they run just fine in a group -- but not if run as part of all the many other tests. Therefore, to avoid hitting the limitation, I have changed the pipeline to the following:

Run all tests minus the above LOCK_HEAVY_TESTS and the 1 never-works test. That's unit_test invocation 1.
Run all the LOCK_HEAVY_TESTS. That's unit_test invocation 2.

It is not ideal, but it does give good TSAN coverage, thus reducing the priority of this ticket. The priority somewhat rises due to Jemalloc_shm_pool_collection_test.Multiprocess being unable to complete even by itself however.

As for what to do -- just ideas:

First, see if Jemalloc_shm_pool_collection_test.Multiprocess can be rjiggered somewhat just to avoid the problem/hang. This would get us to full coverage, not skipping anything.
Next, look into the problem itself.
- Here it might be a good idea to straight-up go into LLVM code (where TSAN is maintained) -- I have done it for some other topics with reasonable success -- and see what it is that this limit is. See the documentation (https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual and around there, unless there is a newer version in LLVM GitHub somewhere -- but either way not the very short clang summary docs https://clang.llvm.org/docs/ThreadSanitizer.html) too, in case something comes up.
- Contacting the devs (which they encourage in the above wiki) may well be helpful, if the above does not help.
- Be on the lookout for any pathologically evil stuff we might be doing in unit_test or the real code; maybe we do not clean up some locks or threads properly? (Honestly I doubt it. Do we start threads, yes, but we join them early and often too. Maybe ultimately it really is just a TSAN limitation, and there's nothing we can really do.)

It is worth looking into, but it is not a hair-on-fire problem. We can skip one test w/r/t to TSAN and survive.

ipc_, ipc, flow: Port them beyond Linux/x86-64 - macOS, ARM64; Windows. (No BSD - due to no capnp)

As of this writing, in its original form, Flow-IPC (including Flow, repository flow, which is self-contained) builds in gcc/clang; targets Linux running in 64-bit mode (x64_64 aka AMD64). Nothing about its design is Linux-y per se (and earlier versions of Flow ran in such environments as Android and iOS and Windows); Linux was just the thing needed in practice.

This issue:

covers a possible road-map re. to what to port in what order (of course this could always change depending on demand; and on available help);
contains notes I've assembled about what it would, specifically, take to get it done.

I should note that some of the technical notes below could be wrong; if my knowledge was in error, or if I looked up something incorrectly. Still, should be usefl.

First the goal(s):

Random note: Having contemplated iOS and Android, I figure IPC in this form, in general, is not really that useful in the mobile setting. So for now at least it would appear we can skip this family of stuff. (Flow, being a general library, is different - so TODO: file Issue for that - but as part of Flow-IPC the not-that-useful calculus applies to it.)
The first target is two-fold: macOS + ARM64 (the latter being the current architecture used by modern Macs). Whether/how to test macOS + x86-64 (which is now a retired combo, I believe) is TBD; we should do it if convenient, otherwise forget it.
- Why the first target: Why not? So many of us have Mac laptops. GitHub Actions apparently supplies Mac hardware (details TBD). Plus many of the solutions, such as cooking up a /dev/shm replacement, will probably keep working in all the other OS including Windows. So it's a (relatively) low-hanging fruit; and it has high value in the sense that many developers would be able to immediately build Flow-IPC directly on their laptops.
- High-level notes in addition to the above:
  - macOS is at its core Darwin which is BSD-esque.
  - Local testing should be a relative cinch, while developing.
  - ARM64 versus x86-64 should not come up much at all, with one exception: The pointer tagging scheme used by our SHM-jemalloc module will need to be updated for ARM64 pointer conventions. (Spoiler alert: Apparently this adjustment will be quite easy, as basically they use the same system, except ARM64 is less restrictive - has no canonical form.)
    - It appears after covering x86-64 compatibility and adding ARM64 - where it would be an issue, which appears to be
The second target is a bit vague: it is *BSD, meaning FreeBSD, OpenBSD, maybe NetBSD. They're not the same, and also I personally am not super familiar with them. That said my impression so far is that the stuff that would need to be ported has straightforward answers that would either carry-over from macOS, or be different but fairly simple.
- Why the second target: The way I see it, we are Unixy, and if we handle macOS, then the major family that remains of the Unixes = *BSD. EDIT: WAIT, NO! See note just below: No capnp for BSD, so no Flow-IPC for BSD until then.
- High-level notes in addition to the above:
  - macOS is at its core Darwin which is BSD-esque.
  - Local testing should be a relative cinch, while developing.
  - No direct GitHub Actions BSD runners, but apparently something called "VM Actions" allows to run such within macOS runners. TBD.
  - OOPS!!! capnp (Cap'n Proto) apparently does not build for BSD, at least not officially. With that, we are dead in the water - for now.
The third target is Windows (Visual C++).
- Why the third target: We are not really Windows people, so it is farthest from our expertise... we being the original developers/maintainers. But... so? Windows is important.
- High-level notes in addition to the above:
  - Based on my notes, all the need-to-port items have fine answers in Windows. Though, some of it could be quite different from Unix-y code; for instance there is no Unix domain socket that transmits FDs, so we have to use the quite-different Windows equivalent.
  - Local testing should be doable; just need a Windows machine and Visual Studio.
  - GitHub Actions has Windows runners. However, our CI pipeline (the stuff in .github/main.yml and flow/.github/main.yml) casually assumes Unix-y command line tools, etc. etc. So that stuff will be significant work (and learning, depending on who does it).
    - So in that sense it will be added work versus macOS/BSD.

So, basically, at the moment:

Start with macOS + ARM64. Get it done.
Repeat at some point for Windows; some of the things done for macOS should be reusable.

Now for the notes.

Firstly, should get Flow-IPC/flow#88 out of the way before tackling this. Then it'll be static_assert()s failing instead of sometimes those, sometimes direct #errors. Detail... but anyway.

Then: Be aware that everything that we consciously knew was not portable, anywhere in the code, was #ifdefed and should #error or static_assert(false) if one tries to build for the port-target OS/arch. (NOTE!!!! SHM-jemalloc - namely anything in ipc_shm_arena_lend/ paths src/ipc/shm/arena_lend/** except src/ipc/shm/arena_lend/detail/shm_pool_offset_ptr_data.?pp src/ipc/session/standalone/** - may not have held to this. Will need special attention. Other than that, though, yep.)

So, I went over all instances of such to-port code and made the following notes. Following these (where they're correct at least) may well be the bulk of the work.

Last thing before that though... I'd recommend, before doing any of that, setting up two things:

(If possible) a local dev/test/debug environment. E.g., Macbook Pro with Xcode and all that.
A CI environment in GitHub Actions. E.g., take our main.yml for flow/, write a simple portable hello-world program instead of existing compiled code, and have the main.yml build and run that in the target-OS/arch.

Once that's ready can start hacking away at the to-port code. Now those notes:

This code checks the location of a certain executable, by itself: const auto bin_path = fs::read_symlink("/proc/self/exe"); // Can throw.

macOS: _NSGetExecutablePath() + realpath()
Windows: GetModuleFileName()

/var/run is assumed to exist in Linux by default (though I have a mechanism for overriding this location, if only for test/debug at least). Is it similarly assumed to exist in macOS/iOS? And is there a Windows equivalent? In particular, it would be nice if it were something that would get cleared via reboot.

macOS: Should be fine; do ensure it has the same kernel-persistent properties (emptied on reboot); and if not whether that matters for anything we care about (I forget at the moment).
Windows: Possibly GetTempPath() or ProgramData or AppData/Local. Look into it.

A key one: We use /dev/shm/ listing for important SHM-pool and bipc-MQ management. (Also /dev/mqueue for POSIX MQs - however since no POSIX MQs in macOS, it is moot.)

macOS, Windows: It would appear, from my research so far, that there is simply no such feature in either OS.
- Need to really triple-check this. Because if it were available, this would really be much easier than the next bullet point.
- If so, though... then there are essentially a couple of approaches. Both could be significant work (as of this writing looks like the hardest one). Not absurd amount of work but definitely new stuff/algorithms.
  - Approach A: Essentially make our own equivalent of /dev/shm: For example, make a SHM-segment that itself would keep track of the SHM-segments that we create or unlink. Conceptually it is straightforward; but it needs to be:
    - VERY robust/simple: If we mess something up here due to being too fancy, everything else SHM-related is potentially a mess - at least the cleanup aspects anyway. So keep it very very very simple.
    - Don't run out of space... but don't use some huge RAM chunk either. Now... in Unix at least (Windows: TBD)... we can size a quite-large SHM pool; that RAM is not actually taken from other apps/OS, until something is actually written to a particular page. So we should pick some size -- which will, after all, be system-wide, not app/process/whatever-wide -- which should be enough to track all SHM-pool-names currently in existence/not-unlinked. If we fill it up, then there's a problem. On the other hand maybe that's simply life; after all Linux keeps track of all of them, instead having OS-limits on how many pools can exist at a time. So we should reserve enough pool-space to handle that. Then, hopefully, there won't be that many in existence, so not that much RAM will be used to track them.
  - Approach B: Use smart naming for each object type. E.g., for objects of a similar conceptual type, name them ...1, ...2, ...3; then when trying to "list" them, just go from ...1 to <first guy that does not exist>. Or something like that.
    - This honestly would suck, as it would be a constant source of complexity and bugs. And, some things are named based on App::m_name, so there would still need to be some kind of kernel-persistent directory of those names. Anyway, a difficult mess.
    - It would potentially be more space-efficient.
- NOTE: Perf-wise, this is not very sensitive. We don't often do the list-/dev/shm/* op; it's only at cleanup points basically. Similarly, we don't create new SHM-segments that often. It should be quick but need not be optimized down to the microsecond.

We use boost::interprocess::shared_memory_object::remove(X) to remove SHM-segment (SHM-pool) named X. Error handling, specifically, relies on the undocumented fact that on failure it'll set errno. So then we check that, if it throws, and throw the appropriate error. What about the other OS?

macOS: This should still work the same. They both just do unlink() really.
Windows: It depends on what boost::interprocess::shared_memory_object::remove(X) actually does. Find out by looking at Boost code. But potentially it'll be some WindowsThingLikeThis() and perhaps we would just check GetLastError().

Using kill(process_id, 0) to check whether PID process_id is running at the moment.

macOS: Should work.

Windows: Something like this:

#include <windows.h>

bool process_running(DWORD process_id) {
    HANDLE processHandle = OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION, FALSE, process_id);
    if (processHandle != NULL) {
        CloseHandle(processHandle);
        return true; // Process is running
    }
    if (GetLastError() == ERROR_INVALID_PARAMETER) {
        return false; // Process does not exist
    }
    // ...etc...whatever...
}

PIDs being used as effectively unique across time:

macOS: Still yes. Double-check.
Windows: Apparently still yes. And apparently the PID type is wider (32 bits).

Process_credentials::process_invoked_as() checks how (via what command line, argv[0]-ish) this->m_process_id process was executed, assuming that guy is currently active (which it is). It reads /proc/...pid.../cmdline. This works, unlike some other things, even if ...pid... is running as some other user - which is an important thing in our use-case. We use this for a safety check when opening session: A certain location shall be hard-coded by user in ipc::session::App struct known to us; and we use process_invoked_as() to see what the OS says. The two must match, or else the app is misbehaving, and we reject session. So the goal here is not the specific behavior we use in Linux, but merely some check like it that we could execute.

macOS: proc_pidpath() is somewhat similar (might be the actual location of executable, not as specified on command line... that would be fine, as long as specified and documented to mean that in ipc::session::App). However it might have security restrictions which would make it useless for us. If it can check it for a process run by another user or even another group -- then no problem. Otherwise, not useful. Allegedly only a privileged process can check it for other users; so that would mean no-go for us; but double-check.
- Moral of the story: We may need to either
  - skip this check for the OS (and therefore omit it in ipc::session::App declaration); or
  - do something cheesy like have each process send-over-IPC its own argv[0] or proc_pidpath(...itself...) or equivalent. It's not being reported by OS then but still not-nothing as a safety feature.
Windows: <subsumed by Unix-domain-socket question, see below>

In Linux we use getsockopt(AF_LOCAL/SO_PEERCRED) to obtain the opposing process's info: PID (important, not just for safety), UID+GID (for safety check only).

macOS: There is LOCAL_PEEREPID sockopt. However, there is no mechanism for UID+GID apparently.
- Can (again) either skip the check or have app send their own geteuid()+geteguid(). The fact we can get the PID still is good.
Windows: <subsumed by Unix-domain-socket question, see below>

HUGE IMPORTANT THING TO FILL-IN FOR Windows

We use, for Native_socket_stream, on which very much stuff is built, a Unix-domain stream socket. This is for two important purposes without which everything falls apart:

It is a fast/good way to transmit binary blobs. (bipc-MQ is also available in all OS. POSIX MQ is also available but in Linux only, not in Windows or macOs; BSD does have it, I believe, but moot.)
It is the only way to transmit I/O handles, including from socketpair() so as to create more pre-connected socket streams!
It also provides the bootstrap process-to-process acceptor-connector pair of classes. That's how sessions start, so it is very much a key thing!
Related: The very concept of transmitting Native_handles is well-defined in Linux and macOS (and BSD and any Unix): it just wraps an int (FD). (EDIT: I didn't mean to imply it's a mere matter of sending an int; rather it's sendmsg() with SOL_SOCKET/SCM_RIGHTS ancillary data, which will create an FD in the receiver process pointing to the same file-description.) But in Windows, what kind of thing would we even potentially want to transmit?
- If there is something we can transmit to be able to set-up more socket-streams... then Native_handle would need to cover that at least.
- If there is something typical users would want to transmit -- file handles? SOCKET network handles? What else? -- then we should support that.
macOS: It has all of this the same as Linux, portably. It is FD-based, with Unix-domain stream sockets being able to transmit FDs via sendmsg/recvmsg().... Whew!
Windows: Unclear! I have some notes, but I am not pasting them yet. We need a strategy for all of the above use-cases, before starting to port to Windows. So filling this is in a MAJOR TODO and precludes all other Windows-port work.

Flow's Logger::this_thread_set_logged_nickname(util::String_view thread_nickname, Logger* logger_ptr, bool also_set_os_name) -- if that bool=true -- uses pthread_setname_np(pthread_self(), os_name.c_str()) to set system-visible thread nickname. This is useful in top, sanitizer output, debuggers... it's quite useful. However, Linux has a 15-char limit which we grapple with, not very well.

When handling this, consider Flow-IPC/flow#71 also in parallel.
macOS: pthread_setname_np(thread_name) should work. Different API signature but same result. Allegedly the 15-char limit is more lenient here (consider preceding bullet point).
Windows: This will only allegedly help in the debugger (VC++ specifically even?). Allegedly this will work:

const DWORD MS_VC_EXCEPTION = 0x406D1388;

#pragma pack(push, 8)
typedef struct tagTHREADNAME_INFO
{
    DWORD dwType; // Must be 0x1000.
    LPCSTR szName; // Pointer to name (in user addr space).
    DWORD dwThreadID; // Thread ID (-1=caller thread).
    DWORD dwFlags; // Reserved for future use, must be zero.
} THREADNAME_INFO;
#pragma pack(pop)

void SetThreadName(DWORD dwThreadID, const char* threadName) // -1 for cur thread
{
    THREADNAME_INFO info;
    info.dwType = 0x1000;
    info.szName = threadName;
    info.dwThreadID = dwThreadID;
    info.dwFlags = 0;

    __try
    {
        RaiseException(MS_VC_EXCEPTION, 0, sizeof(info)/sizeof(ULONG_PTR), (ULONG_PTR*)&info);
    }
    __except(EXCEPTION_EXECUTE_HANDLER)
    {
    }
}

cpu_idx() gets the current processor core index (0, 1, ..., N-1), where N = # of logical cores. In Linux can get it from ::sched_getcpu().

macOS: If Intel, can use a certain hack involving __cpuid_count(). But that'll be rare now anyway. With ARM64, I could not find a good answer. Check again - but so far it looks like there's nothing to use.
Windows: ...forgot to check....
Moral of the story: This is used only for logging (and TRACE/DATA logging at that). It might be best to just omit it for architectures where there's no good official technique.

optimize_pinning_in_thread_pool():

void optimize_pinning_in_thread_pool(flow::log::Logger* logger_ptr,
                                     const std::vector<util::Thread*>& threads_in_pool,
                                     [[maybe_unused]] bool est_hw_core_sharing_helps_algo,
                                     bool est_hw_core_pinning_helps_algo,
                                     bool hw_threads_is_grouping_collated)

where est_hw_core_pinning_helps_algo is to be set by the user to true if and only if the algorithm they're running over the given thread pool would be actively helped perf-wise if one were to pin each thread to a different CPU core. So if there are 32 threads and 32 logical cores, AND they set this to true, THEN it'll pin thread 1 to core 1, thread 2 to core 2, etc.; if there are 16 threads and 32 logical cores (but 16 physical cores), then it'll pin thread 1 to cores 1+17, thread 2 to 2+18, etc. (Unless hw_threads_is_grouping_collated, then t1 to 1+2, t2 to 3+4, etc.) I am not describing this perfectly, but I think you get the idea.
Linux: Use pthread_setaffinity_np().

macOS: Use thread_policy_set(). (The code is already there. However, it indicates a possible bug. Look into that, before considering the task done.)
Windows: Supposedly use SetThreadAffinityMask(GetCurrentThread(), affinityMask);.

Last but not least, in SHM-jemalloc fancy-pointers -- the core living in Shm_pool_offset_ptr_data which I wrote -- there is a pointer-tagging scheme in use, designed to ensure sizeof() is 64.

When storing a SHM-pointer, a pool ID and pool offset are encoded in 63 bits, but the MSB is set to 1 to indicate that, in fact, a SHM-pointer is there.
When storing a raw vaddr - such as to a stack location - the MSB is set to 0, while the other bits are simply copied from the original real vaddr. Then, the get-vaddr operation in this case:
- Returns bits 62 through 0 as-is.
- Returns the MSB as 0 if bit 62 is 0; else 1. This is because x86-64 pointers have to follow a canonical form, wherein the significant bits are the lowest 48-ish bits; while the remaining bits must equal to the most-significant significant (of those 48-ish). So it's either 00000..0...stuff...; or 11111..1...stuff.

This is a processor thing, not an OS thing. So this is where ARM64 could have different behavior. Turns out, apparently, its behavior is different but simpler: still only the low 48-ish bits are the significant ones; hence it is safe for us to keep using the MSB for our pointer-tagging scheme. But, it is simpler: there is no "canonical form": so we can just return the entire thing as-is in the get-vaddr op. There will be an #if but a simple one.

ipc_*: Normalize logging names/thread names.

Filed by @ygoldfeld pre-open-source:

~All types in Flow-IPC (outside of non-ipc::session::shm SHM-jemalloc-land, I think, which should probably be changed, but that's orthogonal) have operator<< for ostream output, used in ~all log call sites. Plus when creating background threads they're usually named using the same thing. Some types also have nickname() (given by user in ctor) in the mix in there.

The names tend to be reasonably helpfully descriptive, though a once-over on that count wouldn't hurt as part of this ticket. However the formatting is inconsistent; sometimes there's this address near the start, sometimes at the end (which can be particularly unnerving when an internally stored object is itself <<ed inside the <<). Sometimes they're too long to be used as Linux thread names helpfully. Sometimes nickname() is included, other times it IS nickname().

You get the idea. Someone should holistically look at this, come up with some conventions, and straighten it all out.

Additionally it would be nice to come up with some boiler-plate-reducing way of pre-pending the "<whatever object> [" << *this << "]:" thing automatically to log calls.

Anyway. Just make it nice. And probably it should share DNA with Flow-IPC/flow#86.

Addendum: Flow-IPC/flow#71 is related; or at least the musing in its Description "addendum" is. Basically that is about thread naming conventions, possibly having a separate API arg in flow.log to name a thread natively for the OS (top and such), which has a length limit in Linux at least. It's all part of one mess that would be good to straighten out. So this guy and that guy should probably be looked-at together.

compare with iceoryx

Original description by issue filer: your performance is higher than iceoryx?

Then I (@ygoldfeld) edited in the following:

It would be good to add a "Comparison with other approaches, products" section to the documentation (guided Manual); it would cover iceoryx potentially but also other things like local gRPC, capnp-RPC - and the fact that Flow-IPC can be used together with those to make them faster and possibly easier to use.

I shall also pre-answer the iceoryx question in a comment below.

ipc[_*]: Generated-documentation (Reference, guided Manual) + example-program gaps.

The state of documentation is good IMO. The Reference (as generated by Doxygen from source) covers everything (except see below); the guided Manual (as generated by Doxygen from additional not-really-source-but-source files) covers every feature (except see below); significant effort went into these. Plus there's a nice system making the generated docs available: Actions CI pipeline checks-in generated docs into main branch automatically; plus for each Release tag; plus links to these are auto-placed into the web site.

The present ticket covers gaps -- things lacking from the above. I labeled it as "bug," as they are IMO specific gaps we've identified -- not just niceties -- so "bug" would draw attention to it. That said, to reiterate, what is there is fairly comprehensive, meaning these gaps IMO don't constitute a major deficit. (Or not... it's subjective... examples-meant-to-be-examples really would be good to have, in particular.)

The gaps:

The parts of ipc_shm_arena_lend (a/k/a SHM-jemalloc) under src/ipc/shm and src/ipc/session/standalone -- which are (with the exception of shm_pool_offset_ptr_data.?pp, writted by @ygoldfeld) the parts of Flow-IPC originally written by @echan-dev -- and excluding the test/ parts (which in general do not participate in doc generation) -- are in this state:
- Everything has a doc header, as required. Great!
- It has not been run through Doxygen -- doing so would definitely at first issue many Doxygen warnings, which our CI/CD pipeline will treat as errors. Therefore it is fow now excluded from doc generation, manually in root CMakeLists.txt (set(DOC_EXCLUDE_PATTERNS */ipc/session/standalone/shm/arena_lend/* */ipc/shm/arena_lend/*).
- So, TODO: Go over all Doxygen-facing source code; make any fixes according to Coding Guide/precedent; remove the exception CMake code mentioned above; fix any resulting errors; check output visually; fix any resulting problems.
- Mitigating factors:
  - The docs are there in comments; just not in the nice generated HTML Reference that builds with Doxygen (including via GitHub Actions).
  - Typically a Flow-IPC user, we expect, will access SHM-jemalloc APIs -- if at all (e.g., just selecting it as the ipc::session alias = enough to enable end-to-end-zero-copy for capnp-encoded messages) -- only through well-documented concepts that are documented. Only when using SHM-jemalloc as a standalone SHM-provider, without ipc::session, would one miss any generated docs. This is fairly unlikely in practice.
    - (But that said, they are public APIs. Plus, the full-reference form of the generated docs Reference -- which includes innards, not just public API -- lacks this aspect of SHM-jemalloc, but by definition it shouldn't).
In the guided Manual, there are a couple of pages that are marked under construction and don't have any other text (except the intro paragraph). Search for "This page is under construction". As of this writing they are @page universes Multi-split Universes and @page transport_core Transport Core Layer: Transmitting Unstructured Data.
- So, TODO: Write them.
- Mitigating factors:
  - These are pretty advanced topics, not exactly the first thing someone is going to use, probably (multi-split thing especially).
  - "Transport core layer" page is pretty well covered in the whirlwind synopsis page @page api_overview API Overview / Synopses (the not-yet-written page links to it for now, near where it says under construction).
  - All aspects of these are well covered in the Reference.
In the guided Manual, there are lots of code examples in-line in the text. Many of the build-up a program piece-by-piece. There's a pretty high chance this code is correct as well. But:
- They are not formally tested as such; there could be errors (typos, oversights).
- There is no actual test/demo program for any of them. (We do have tests and demos, but not these specific items.)
  - The one(s) corresponding to e-session-setup.dox.txt, f-chan-open.dox.txt, and especially g-session-app-org.dox.txt = particularly nice to have available; e.g., one can use them as templates (in the non-C++ sense) for user applications.
- So, TODO: Add these as example programs (and refer to them as available, in the Manual text).
In general, while we have some test/demo/example programs, like the link_tests, transport_test x 2, and perf_demo, they are not written as example of fine style and organization but rather other purposes (respectively: basic hello-world link+functionality test, heavy functional testing, benchmarking).
- So they're usable as examples, but they're not intended to primarily be examples -- and we should have such things too.
- So, TODO: Add (more) example programs (in addition to the thingies corresponding to Manual snippets, from a preceding bullet point).

ipc_*: The (very few) blocking APIs' behavior on signal = inconsistent, some uninterruptible.

In most of Flow-API, APIs are non-blocking. There are various key async APIs, but that's not blocking. There are however a handful of potentially-blocking APIs. Ideally these should behave in some reasonable way on signal, so that at least it would be possible to interrupt a program on SIGTERM while it is blocking inside such an API. In POSIX generally blocking I/O functions will emit EINTR in this situation. We should do something analogous; or failing that at least something predictable and consistent; or failing that present mitigations (but in that case not resolve this ticket).

Here are the existing blocking APIs as of this writing.

Concept: Persistent_mq_handle::[timed_]send(), [timed_]receive(), [timed_]wait_sendable(), [timed_]wait_receivable(). Impls:
- Posix_mq_handle::*()
  - Internally this epoll_wait()s. Signal => EINTR; so we will also emit Error_code around EINTR.
  - Assessment: Pretty good, in that it is interruptible with minimum fuss.
- Bipc_mq_handle::*()
  - Internally this (via some Boost code, long story) pthread_cond_[timed]wait()s. How this will act on signal is unclear; as of this writing I have not checked or do not remember also what the Boost code does in wrapping the pthread_ call. But the call itself is documented to yield no EINTR ever in some man pages, and EINTR only from pthread_cond_timedwait() (not the one without timed_ though) in some other man pages.
  - Assessment: Unclear, but not good. It will almost certainly not work in some cases and will just keep waiting.
struc::Channel::sync_request() (+ future likely similar API(s) including sync_expect_msg())
- Internally this uses promise::get_future().wait[_for]() which almost certainly internally in turn uses pthread_cond_[timed]wait().
- Assessment: Probably similar to Bipc_mq_handle situation but would need to be verified.
Nota bene: sync_connect() (in socket-stream and client-session) are today formally non-blocking, because it's non-networked IPC; but there are plans to make it networked-IPC and those guys in that case will become potentially-blocking. At that point we could leverage built-in EINTR probably; details TBD. Just keep it in mind.

Ideally all three -- and any future blocking impls -- should get interrupted if-and-only-if a built-in blocking call such as ::read() or ::poll() would be interrupted, on signal. How they're interrupted is less important: whether it's EINTR or another Error_code, as long as it's documented (that said perhaps it's best to standardize around an error::Code of ours we'd add for this purpose).

I'll forego explaining how to impl this. The Posix_mq_handle guy is already there; but the other two are a different story. (For Bipc_mq_handle one could somehow leverage the already-existing Persistent_mq_handle::interrupt_*() APIs. But, then another thread would be necessary. Food for thought at any rate.) (For Channel, not sure, but we should be able to swing something; maybe an intermediary thread; signal => satisfy the promise.)

Lastly, but very much not leastly (sorry), users should be aware of the following mitigating facts, until this ticket is resolved:

Persistent_mq_handle:
- This concept is not really primarily meant for direct use. It can be -- it is nicer than using the POSIX MQ or bipc::message_queue APIs directly, and it adds features including interruptibility to the latter -- but that's a bonus really; these exist to implement Blob_stream_mq_sender/receiver higher-level APIs, and those work just fine (there is no signal-interrupt issue): they (internally) avoid blocking calls except .wait_able(), and the latter is .interrupt_()ed if needed (namely from dtor).
- If you do use it:
  - interrupt_*() is available to you. So if it is necessary to interrupt a wait or blocking-send/receive, then you can use a separate thread to run it in and invoke interrupt_*() from a signal handler. This isn't no-fuss, but it does work if needed at least.
  - With Posix_mq_handle, you do get the nice normal EINTR behavior already.
  - timed_*() are also available: You could use those with a short, but not-too-short, timeout -- in which case a signal can affect them but only after a bit of lag; could be fine. This is the usual approach... not perfect but doable and pretty simple.
struc::Channel::sync_request():
- This guy is only blocking if locally there is no expect_msg[s]() pending on the opposing side (for future speculated sync_expect_msg() API: if there is no already-received-and-queued-inside in-message of the proper type, or one isn't forthcoming immediately).
- But in many -- most -- cases protocols are set up in such a way as to in fact make sync_request() non-blocking: It is expected the opposing side has set-up their expect_msg[s]() by the time we issue the sync_request(). That's just good for responsive IPC; and it's nice not to need an async call like async_request(), where a non-blocking one would do much more simply.
  - For example we use sync_request() internally in other parts of Flow-IPC this way routinely.

ipc + likely flow: CI pipeline: Sporadic/awful TSAN+ASAN runtime instal-fail due to "unexpected memory mapping 0x...-0x...".

Sometimes, but especially over the past couple of weeks for some reason, the fancy GitHub Action CI build/test workflow fails in ASAN- and/or TSAN-enabled build configs. Almost always the failure is before any of our code even builds but rather during the build-dependencies phase conan install. Exactly once I saw it when running some random test binary.

Example: https://github.com/Flow-IPC/ipc/actions/runs/8336944527/job/22819113854

At first it looked like jemalloc failed during configure; or m4 failed during its configure. As usual default configure output does not help. But once one hacks together capturing config.log, the problem is clear:

configure:5928: ./conftest
FATAL: ThreadSanitizer: unexpected memory mapping 0x767804872000-0x767804d00000

From memory, the one time it was an actual test program failing (after deps built fine), it was just like that: Nothing at all executed, visibly at least, and instead an error like above was shown, program exiting.

This led to StackOverflow, namely

https://stackoverflow.com/questions/77850769/fatal-threadsanitizer-unexpected-memory-mapping-when-running-on-linux-kernels

which led to work-around

https://stackoverflow.com/questions/5194666/disable-randomization-of-memory-addresses

Essentially, something involving memory addresses being randomized "tickles" a clang-TSAN/ASAN problem/bug. According to the first link the fix would be in clang-18; but we cannot simply drop testing of clang-15 through 17. The work-around is not 100% ideal but seems essentially okay. So prepending

setarch uname -m -R

to commands that'll invoke sanitizer-enabled executables (there are lots of those) will make it go away.

So far, at least, doing so for the problematic conan install command did indeed make it go away.

Since this is diagnosed, here are the tasks to resolve this:

Apply the work-around, only for ASAN and TSAN build configs in the pipeline, for ipc/.../main.yml and flow/.../main.yml.
- See if we can instead reduce the randomization bits as suggested in the first link. It might involve a sudo sysctl which may not be available. Just try it; this would be less of a change from vanilla *SAN runs, which is probably better than turning off the randomization thing.
Consider opportunistically adding config.log capturing during build-deps step of ipc/ and flow/ main.yml pipelines. Then next time some horrible thing like this happens, where configure is mysteriously messing up, we will have a log to look at in the artifacts instead of having to haxor it.

This Issue is fairly high-priority, as it's been wreaking havoc lately in the pipeline results -- which were ever-so-clean up to that point.

ipc: TSAN: transport_test exercise-mode heap-submode one-time prob: race: epoll_reactor::deregister_descriptor() vs eventfd_select_interrupter::open_descriptors()

TSAN warned once: https://github.com/Flow-IPC/ipc/actions/runs/8261259787/job/22598241886#step:27:190

Logs attached (also in above link for a while).
cli.log
cli.console.log
srv.log
srv.console.log

ipc_*: Formal test coverage analysis.

Filed by @ygoldfeld pre-open-source:

We should formally analyze coverage of the source code. Consider doing so both before and after #83.

Before: So we know how to do it and have a good idea of how much we actually lack (the current text in #83 description is my subjective assessment).
After: So we know what might still be lacking; and possibly to set up automatic periodic coverage analysis (perhaps for each PR).

ipc_{core|transport_structured|session|shm}: Unit tests.

Filed by @ygoldfeld pre-open-source:

unit_test (and its helpers) extensively test ipc_shm_arena_lend (a/k/a IPC: SHM-jemalloc), which is the module dependent on all other modules. @echan-dev wrote this as a unit test (within GoogleTest framework), and in addition in my opinion several of the higher-level tests, such as the session-test, are effectively integration-testing all of the dependencies: ipc_core, ipc_transport_structured (it is used by tests for their purposes as well as inside ipc_shm_arena_lend impl internally), ipc_session (used to set-up interprocess communication); though ipc_shm not so much.

transport_test is an extensive integration test of all of the above. The scripted mode focuses on ipc_core (and is extensible, both with new commands and more scripts), while the exercise mode focuses on APIs in ipc_session, ipc_transport_structured, ipc_shm, and ipc_shm_arena_lend. Additional functional testing is in perf_demo as of this writing; surely there will be more.

In that sense IMO, subjectively speaking, the test coverage is pretty damned good, and my confidence in correctness is at a high level.

However, the ipc_* modules other than ipc_shm_arena_lend really should have direct representation in unit_test in the same way ipc_shm_arena_lend (a/k/a IPC: SHM-jemalloc) enjoys. I.e., more or less every class/function – at least public/protected APIs – should have a GoogleTest, similarly to what Mr. Chan did.

ipc: TSAN/esp clang-17 + some transport_test and ~1 unit_test: strange output/hang/pause ("PLEASE submit a bug report").

Filed by @ygoldfeld pre-open-source:

Environments:

my ([~ygoldfel]) local clang-17 ({+}LLVM libc{+}+ replacing GNU libstdc++; probably irrelevant) env.
GitHub Flow-IPC pipeline.

Observed:

my env: never so far.
GitHub pipeline: yes; for me only with clang-17 (not clang-15, not clang-16: they have been fine):
transport_test exercise mode, SHM-jemalloc sub-mode (not SHM-classic, not heap; and not scripted mode);
the specific unit_test: Jemalloc_shm_pool_collection_test.Multithread_load (all others are fine, except Shm_session_data_test.Multithread_object_database was disabled due to test bug – it might trigger it too, based on @kinokrt comments; UPDATE - it is fixed and no longer disabled, might trigger this, don't know).

The problem itself presents as follows: console prints:

==76700==WARNING: Can't read from symbolizer at fd 8
==76700==WARNING: Can't write to symbolizer at fd 44
LLVM ERROR: Sections with relocations should have an address of 0
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /usr/bin/llvm-symbolizer-17 --demangle --inlines --default-arch=x86_64
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  libLLVM-17.so.1    0x00007f4efaacc406 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 54
1  libLLVM-17.so.1    0x00007f4efaaca5b0 llvm::sys::RunSignalHandlers() + 80
2  libLLVM-17.so.1    0x00007f4efaacca9b
3  libc.so.6          0x00007f4ef9642520
4  libc.so.6          0x00007f4ef96969fc pthread_kill + 300
5  libc.so.6          0x00007f4ef9642476 raise + 22
6  libc.so.6          0x00007f4ef96287f3 abort + 211
7  libLLVM-17.so.1    0x00007f4efaa2eb15 llvm::report_fatal_error(llvm::Twine const&, bool) + 437
8  libLLVM-17.so.1    0x00007f4efaa2e956
9  libLLVM-17.so.1    0x00007f4efc1b3002
10 libLLVM-17.so.1    0x00007f4efc3855a8 llvm::DWARFContext::create(llvm::object::ObjectFile const&, llvm::DWARFContext::ProcessDebugRelocations, llvm::LoadedObjectInfo const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::function<void (llvm::Error)>, std::function<void (llvm::Error)>) + 4328
11 libLLVM-17.so.1    0x00007f4efc517fcf llvm::symbolize::LLVMSymbolizer::getOrCreateModuleInfo(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) + 2479
12 libLLVM-17.so.1    0x00007f4efc5147aa llvm::Expected<llvm::DIGlobal> llvm::symbolize::LLVMSymbolizer::symbolizeDataCommon<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, llvm::object::SectionedAddress) + 58
13 libLLVM-17.so.1    0x00007f4efc514769 llvm::symbolize::LLVMSymbolizer::symbolizeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, llvm::object::SectionedAddress) + 9
14 llvm-symbolizer-17 0x0000557435c89893
15 llvm-symbolizer-17 0x0000557435c884ef
16 llvm-symbolizer-17 0x0000557435c87860
17 libc.so.6          0x00007f4ef9629d90
18 libc.so.6          0x00007f4ef9629e40 __libc_start_main + 128
19 llvm-symbolizer-17 0x0000557435c85905
==76700==WARNING: Can't read from symbolizer at fd 8
==76700==WARNING: Can't write to symbolizer at fd 15
LLVM ERROR: Sections with relocations should have an address of 0
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /usr/bin/llvm-symbolizer-17 --demangle --inlines --default-arch=x86_64
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  libLLVM-17.so.1    0x00007f7cdf4cc406 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 54
1  libLLVM-17.so.1    0x00007f7cdf4ca5b0 llvm::sys::RunSignalHandlers() + 80
2  libLLVM-17.so.1    0x00007f7cdf4cca9b
3  libc.so.6          0x00007f7cde042520
4  libc.so.6          0x00007f7cde0969fc pthread_kill + 300
5  libc.so.6          0x00007f7cde042476 raise + 22
6  libc.so.6          0x00007f7cde0287f3 abort + 211
7  libLLVM-17.so.1    0x00007f7cdf42eb15 llvm::report_fatal_error(llvm::Twine const&, bool) + 437
8  libLLVM-17.so.1    0x00007f7cdf42e956
9  libLLVM-17.so.1    0x00007f7ce0bb3002
10 libLLVM-17.so.1    0x00007f7ce0d855a8 llvm::DWARFContext::create(llvm::object::ObjectFile const&, llvm::DWARFContext::ProcessDebugRelocations, llvm::LoadedObjectInfo const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::function<void (llvm::Error)>, std::function<void (llvm::Error)>) + 4328
11 libLLVM-17.so.1    0x00007f7ce0f17fcf llvm::symbolize::LLVMSymbolizer::getOrCreateModuleInfo(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) + 2479
12 libLLVM-17.so.1    0x00007f7ce0f147aa llvm::Expected<llvm::DIGlobal> llvm::symbolize::LLVMSymbolizer::symbolizeDataCommon<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, llvm::object::SectionedAddress) + 58
13 libLLVM-17.so.1    0x00007f7ce0f14769 llvm::symbolize::LLVMSymbolizer::symbolizeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, llvm::object::SectionedAddress) + 9
14 llvm-symbolizer-17 0x00005586f49f8893
15 llvm-symbolizer-17 0x00005586f49f74ef
16 llvm-symbolizer-17 0x00005586f49f6860
17 libc.so.6          0x00007f7cde029d90
18 libc.so.6          0x00007f7cde029e40 __libc_start_main + 128
19 llvm-symbolizer-17 0x00005586f49f4905
==76700==WARNING: Can't read from symbolizer at fd 8
==76700==WARNING: Can't write to symbolizer at fd 15
LLVM ERROR: Sections with relocations should have an address of 0
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /usr/bin/llvm-symbolizer-17 --demangle --inlines --default-arch=x86_64
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  libLLVM-17.so.1    0x00007f12a6ccc406 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 54
1  libLLVM-17.so.1    0x00007f12a6cca5b0 llvm::sys::RunSignalHandlers() + 80
2  libLLVM-17.so.1    0x00007f12a6ccca9b
3  libc.so.6          0x00007f12a5842520
4  libc.so.6          0x00007f12a58969fc pthread_kill + 300
5  libc.so.6          0x00007f12a5842476 raise + 22
6  libc.so.6          0x00007f12a58287f3 abort + 211
7  libLLVM-17.so.1    0x00007f12a6c2eb15 llvm::report_fatal_error(llvm::Twine const&, bool) + 437
8  libLLVM-17.so.1    0x00007f12a6c2e956
9  libLLVM-17.so.1    0x00007f12a83b3002
10 libLLVM-17.so.1    0x00007f12a85855a8 llvm::DWARFContext::create(llvm::object::ObjectFile const&, llvm::DWARFContext::ProcessDebugRelocations, llvm::LoadedObjectInfo const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::function<void (llvm::Error)>, std::function<void (llvm::Error)>) + 4328
11 libLLVM-17.so.1    0x00007f12a8717fcf llvm::symbolize::LLVMSymbolizer::getOrCreateModuleInfo(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) + 2479
12 libLLVM-17.so.1    0x00007f12a87147aa llvm::Expected<llvm::DIGlobal> llvm::symbolize::LLVMSymbolizer::symbolizeDataCommon<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, llvm::object::SectionedAddress) + 58
13 libLLVM-17.so.1    0x00007f12a8714769 llvm::symbolize::LLVMSymbolizer::symbolizeData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, llvm::object::SectionedAddress) + 9
14 llvm-symbolizer-17 0x00005614bacd4893
15 llvm-symbolizer-17 0x00005614bacd34ef
16 llvm-symbolizer-17 0x00005614bacd2860
17 libc.so.6          0x00007f12a5829d90
18 libc.so.6          0x00007f12a5829e40 __libc_start_main + 128
19 llvm-symbolizer-17 0x00005614bacd0905
==76700==WARNING: Can't read from symbolizer at fd 8
2023-12-12 15:28:16.230671546 +0000 [info]: T7f36dc2ff640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(170): Destroying thread cache id 9
2023-12-12 15:28:16.249109513 +0000 [info]: T7f36dc2ff640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(179): Destroyed thread cache id 9
2023-12-12 15:28:16.284143943 +0000 [info]: T7f36cf7fa640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(170): Destroying thread cache id 2
2023-12-12 15:28:16.300969510 +0000 [info]: T7f36cf7fa640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(179): Destroyed thread cache id 2
2023-12-12 15:28:16.331467917 +0000 [info]: T7f36d07fb640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(170): Destroying thread cache id 3
2023-12-12 15:28:16.344776151 +0000 [info]: T7f36d07fb640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(179): Destroyed thread cache id 3
2023-12-12 15:28:16.631431376 +0000 [info]: T7f36d27fd640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(170): Destroying thread cache id 6
2023-12-12 15:28:16.632357574 +0000 [info]: T7f36d27fd640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(179): Destroyed thread cache id 6
2023-12-12 15:28:16.700983790 +0000 [info]: T7f36dedff640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(170): Destroying thread cache id 8
2023-12-12 15:28:16.702557775 +0000 [info]: T7f36dedff640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(179): Destroyed thread cache id 8
2023-12-12 15:28:16.722868851 +0000 [info]: T7f36d47ff640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(170): Destroying thread cache id 7
2023-12-12 15:28:16.724069064 +0000 [info]: T7f36d47ff640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(179): Destroyed thread cache id 7
2023-12-12 15:28:16.760789707 +0000 [info]: T7f36d37fe640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(170): Destroying thread cache id 4
2023-12-12 15:28:16.763157102 +0000 [info]: T7f36d37fe640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(179): Destroyed thread cache id 4
2023-12-12 15:28:16.772291545 +0000 [info]: T7f36d17fc640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(170): Destroying thread cache id 5
2023-12-12 15:28:16.774043434 +0000 [info]: T7f36d17fc640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(179): Destroyed thread cache id 5
2023-12-12 15:28:16.791601069 +0000 [info]: T7f36cd7f8640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(170): Destroying thread cache id 0
2023-12-12 15:28:16.793051823 +0000 [info]: T7f36cd7f8640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(179): Destroyed thread cache id 0
2023-12-12 15:28:16.815875854 +0000 [info]: T7f36ce7f9640: TEST-SHM: memory_manager.cpp:destroy_thread_cache(170): Destroying thread cache id 1
==76700==WARNING: Can't write to symbolizer at fd 15
==76700==WARNING: Failed to use and restart external symbolizer!
==================
WARNING: ThreadSanitizer: data race (pid=76700)
  Write of size 8 at 0x7f36e08e5140 by thread T289:
    #0 <null> <null> (libipc_unit_test.exec+0x76454d) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #1 <null> <null> (libipc_unit_test.exec+0x732a9b) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #2 <null> <null> (libipc_unit_test.exec+0x770bf4) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #3 <null> <null> (libipc_unit_test.exec+0x797533) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #4 <null> <null> (libipc_unit_test.exec+0x798fdd) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #5 <null> <null> (libipc_unit_test.exec+0x745dbb) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #6 <null> <null> (libipc_unit_test.exec+0x740872) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #7 <null> <null> (libipc_unit_test.exec+0x72cd58) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #8 <null> <null> (libipc_unit_test.exec+0x4fc012) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #9 <null> <null> (libipc_unit_test.exec+0x50ea6a) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #10 <null> <null> (libipc_unit_test.exec+0x50e7f3) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #11 <null> <null> (libipc_unit_test.exec+0x412ac9) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #12 <null> <null> (libipc_unit_test.exec+0x412f20) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #13 <null> <null> (libipc_unit_test.exec+0x4fe235) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #14 <null> <null> (libc.so.6+0x45d9e) (BuildId: a43bfc8428df6623cd498c9c0caeb91aec9be4f9)

  Previous read of size 8 at 0x7f36e08e5140 by thread T290 (mutexes: write M0, write M1):
    #0 <null> <null> (libipc_unit_test.exec+0x769d24) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #1 <null> <null> (libipc_unit_test.exec+0x76473e) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #2 <null> <null> (libipc_unit_test.exec+0x764564) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #3 <null> <null> (libipc_unit_test.exec+0x732a9b) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #4 <null> <null> (libipc_unit_test.exec+0x770bf4) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #5 <null> <null> (libipc_unit_test.exec+0x797275) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #6 <null> <null> (libipc_unit_test.exec+0x796720) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #7 <null> <null> (libipc_unit_test.exec+0x79a5b6) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #8 <null> <null> (libc.so.6+0x91690) (BuildId: a43bfc8428df6623cd498c9c0caeb91aec9be4f9)

  Mutex M0 (0x7f36e08032a8) created at:
    #0 <null> <null> (libipc_unit_test.exec+0xc19b0) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #1 <null> <null> (libipc_unit_test.exec+0x773e98) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #2 <null> <null> (libipc_unit_test.exec+0x762fdd) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #3 <null> <null> (libipc_unit_test.exec+0x738cdd) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #4 <null> <null> (libipc_unit_test.exec+0x720a43) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #5 <null> <null> (libipc_unit_test.exec+0x72d9b9) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #6 <null> <null> (libipc_unit_test.exec+0x73099a) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #7 <null> <null> (libipc_unit_test.exec+0x72d38b) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #8 <null> <null> (libc.so.6+0x29eba) (BuildId: a43bfc8428df6623cd498c9c0caeb91aec9be4f9)

  Mutex M1 (0x561bce9e3ca8) created at:
    #0 <null> <null> (libipc_unit_test.exec+0xc19b0) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #1 <null> <null> (libipc_unit_test.exec+0x773e98) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #2 <null> <null> (libipc_unit_test.exec+0x7740bd) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #3 <null> <null> (libipc_unit_test.exec+0x768c09) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #4 <null> <null> (libipc_unit_test.exec+0x72d90c) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #5 <null> <null> (libipc_unit_test.exec+0x73099a) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #6 <null> <null> (libipc_unit_test.exec+0x72d38b) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #7 <null> <null> (libc.so.6+0x29eba) (BuildId: a43bfc8428df6623cd498c9c0caeb91aec9be4f9)

  Thread T289 (tid=77197, running) created by main thread at:
    #0 <null> <null> (libipc_unit_test.exec+0xbffcb) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #1 <null> <null> (libstdc++.so.6+0xe6388) (BuildId: 2db998bd67acbfb235c464c0275d4070061695fb)
    #2 <null> <null> (libipc_unit_test.exec+0x644b76) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #3 <null> <null> (libipc_unit_test.exec+0x61c32d) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #4 <null> <null> (libipc_unit_test.exec+0x61dc86) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #5 <null> <null> (libipc_unit_test.exec+0x61ef64) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #6 <null> <null> (libipc_unit_test.exec+0x637499) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #7 <null> <null> (libipc_unit_test.exec+0x646017) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #8 <null> <null> (libipc_unit_test.exec+0x636c51) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #9 <null> <null> (libipc_unit_test.exec+0x48b154) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #10 <null> <null> (libc.so.6+0x29d8f) (BuildId: a43bfc8428df6623cd498c9c0caeb91aec9be4f9)

  Thread T290 (tid=77198, finished) created by main thread at:
    #0 <null> <null> (libipc_unit_test.exec+0xbffcb) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #1 <null> <null> (libstdc++.so.6+0xe6388) (BuildId: 2db998bd67acbfb235c464c0275d4070061695fb)
    #2 <null> <null> (libipc_unit_test.exec+0x644b76) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #3 <null> <null> (libipc_unit_test.exec+0x61c32d) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #4 <null> <null> (libipc_unit_test.exec+0x61dc86) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #5 <null> <null> (libipc_unit_test.exec+0x61ef64) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #6 <null> <null> (libipc_unit_test.exec+0x637499) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #7 <null> <null> (libipc_unit_test.exec+0x646017) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #8 <null> <null> (libipc_unit_test.exec+0x636c51) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #9 <null> <null> (libipc_unit_test.exec+0x48b154) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)
    #10 <null> <null> (libc.so.6+0x29d8f) (BuildId: a43bfc8428df6623cd498c9c0caeb91aec9be4f9)

SUMMARY: ThreadSanitizer: data race (/home/runner/work/ipc/ipc/install/RelWithDebInfo/bin/libipc_unit_test.exec+0x76454d) (BuildId: 7992d1dbec863670d21d6bb09687761223088e76)

What happens next depends. I've seen:

unit_test: It may or may not pause for a while, but it prints many warnings in a row (like 20 of them – which is completely abnormal otherwise and makes no logical sense there, plus it is not observed locally for me nor for anyone I know of in clang-15/16), all with nonsensical stack trace lines as seen above. However, actually, the test suite completes and even passes fine eventually – including the triggering test. On the other hand these still show up as warnings, plus the exit code is no longer 0, plus these warnings appear impossible to suppress, plus we can hardly trust TSAN's work after this point in time. Bottom line is there's no choice but to omit this test.
transport_test (with the above-listed options): I'll just paste:

# First the reason in detail: This run semi-reliably (50%+) fails at this point in the server binary:
#   2023-12-20 11:36:11.322479842 +0000 [info]: Tguy: ex_srv.hpp:send_req_b(1428): App_session [0x7b3800008180]:
#     Chan B[0]: Filling/send()ing payload (description = [reuse out-message + SHM-handle to modified (unless
#     SHM-jemalloc) existing STL data]; alt-payload? = [0]; reusing msg? = [1]; reusing SHM payload? = [1]).
#   LLVM ERROR: Sections with relocations should have an address of 0
#   PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
#   Stack dump:
#   0.  Program arguments: /usr/bin/llvm-symbolizer-17 --demangle --inlines --default-arch=x86_64
#   Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
#   ...
#   ==77990==WARNING: Can't read from symbolizer at fd 599
#   2023-12-20 11:36:31.592293322 +0000 [info]: Tguy: ex_srv.hpp:send_req_b(1547): App_session [0x7b3800008180]: Chan B[0]: Filling done.  Now to send.
# Sometimes the exact point is different, depending on timing; but in any case it is always the above
# TSAN/LLVM error, at which point the thread gets stuck for a long time (10+ seconds); but eventually gets
# unstuck; however transport_test happens to be testing a feature in a certain way so that a giant blocking
# operation in this thread delays certain processing, causes an internal timeout, and the test exits/fails.
# Sure, we could make some changes to the test for that to not happen, but that's beside the point: TSAN
# at run-time is trying to do something and fails terribly; I have no wish to try to work around that situation;
# literally it says "PLEASE submit a bug report [to clang devs]."
#
# TODO: Revisit; figure out how to not trigger this; re-enable.  For the record, I (ygoldfel) cannot reproduce
# in a local clang-17, albeit with libc++ (LLVM STL) instead of libstdc++ (GNU STL).  I've also tried to
# reduce optimization to -O1, as well as with and without LTO, and with and without -fno-omit-frame-pointer;
# same result.

We should:

In fact submit a bug report – ideally with a min repro case, which would also help us find a work-around perhaps – as they beg in the error message itself.
Look into it. Looking at LLVM source (where TSAN is) might help; it has helped for some other items for me.
Contacting their devs could help; they ask people to do that in various contexts.

Do note that I've tried a few things (see last paragraph); but that's different from an investigation into the nitty-gritty of it. What's with this relocation thing? We should find out.

The priority is medium. We have TSAN coverage for the problematic tests (which are themselves only a subset), just not with that particular compiler, clang-17 (and locally that works too).

Last point! TSAN is officially in beta. This is not said in the ASAN, UBSAN, or even MSAN docs. So some level of problems is to be expected. THAT said, even though TSAN can be quite a pain in the butt with things like this, it is worth remembering that it has found tricky real problems and is an extremely valuable tool. It is worth fighting for, so to speak.

flow-ipc / ipc Goto Github PK

ipc's People

Contributors

Stargazers

Watchers

Forkers

ipc's Issues

HUGE IMPORTANT THING TO FILL-IN FOR Windows

Recommend Projects

Recommend Topics

Recommend Org