I remember <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

It is configured via network option, and in Java it would be <code class="notranslate"

Support for 3.x clusters to live migrate,about apple/foundationdb

Comments (106)

panghy commented on August 17, 2024 1

Confirmed to be working now (without my patch), thanks!

from foundationdb.

ajbeamon commented on August 17, 2024 1

I don't remember the details of our old Java bindings super well, but I don't think it will be necessary for you to dispose of the Transaction yourself, as the run loop should do it (and also, the watch can outlive the Transaction). Calling it inside the Transaction will interfere with the retry loop, as you point out.

from foundationdb.

ajbeamon commented on August 17, 2024

The change would actually be in the code you are upgrading to. Here is the line that prevents any external clients from being loaded that don't support API version 400:

foundationdb/fdbclient/MultiVersionTransaction.actor.cpp

Line 314 in 3cea71f

headerVersion = std::max(apiVersion, 400);

If it were changed to 300, then it should be able to load a 3.0 client library. I tried this in a simple test locally and it seemed to work, and I also didn't see anything obvious that should prevent it from working at that API version.

I don't remember offhand exactly why we don't support earlier versions, but my guess is that it first of all allowed us to avoid doing some validation of the safety of upgrading from older versions (and for even older versions, we don't have to deal with ancient changes to the C API).

Also, our reporting mechanisms don't have a way to distinguish client versions before 4.1, and the easiest way to skirt that problem was to only support clients >=4.0. Then we can assume clients with unknown versions are running 4.0, and that they are necessarily upgrading to a version that we can ascertain. This matters because we like to verify that all clients support the new cluster version prior to upgrading, and that's not possible if a client can only tell us that it supports unknown versions.

Based on the above, we haven't determined yet if it would make sense to make an official change to this code to support earlier versions in general, but it should be possible to build an alternate version of a released fdb_c client library that changes this restriction. Using it, you would then update your clients to use this new library and configure it to load the 3.x client as an external library. Then you can upgrade the cluster to whichever 5.x version you're using. After that, it should be safe to replace the custom client library with the official one.

from foundationdb.

panghy commented on August 17, 2024

Awesome, I'll give it a try

from foundationdb.

panghy commented on August 17, 2024

@ajbeamon It didn't work for me when I tried. What I did was to make the deb packages for server and client and plop them on a machine (which restarted the cluster). Then I put 3.0.10 on another machine (just the client). Both fdbcli and a simple java app fails.

from foundationdb.

panghy commented on August 17, 2024

Oh, or should it have been the other way around, that the client can talk to older clusters?

from foundationdb.

panghy commented on August 17, 2024

Nope, the other way around doesn't work either (old 3.0.10 cluster, 5.2.0 client with the apiVersion change).

from foundationdb.

ajbeamon commented on August 17, 2024

To use the multi-version client, you'll start with a cluster and client running an old version (say, 3.0.10). You first update your clients to use the new version of fdb_c (say, 5.2.0) and configure them through network options to load a copy of the 3.0.10 fdb_c client. At this point, your clients should be able to connect to the 3.0.10 cluster. It should also be possible in our JSON status to confirm that each of your clients supports both 3.0.10 and 5.2.0 by looking in the connected_clients section.

Next, you would upgrade all of your server processes at the same time (or as close as you can manage). At this point, your clients should all continue to work, as they will automatically switch from using the 3.0.10 fdb_c client to the 5.2.0 one. You can then remove the configuration from your clients to include the 3.0.10 fdb_c client, as it will not be used in the future.

See https://github.com/apple/foundationdb/blob/master/documentation/sphinx/source/api-general.rst#multi-version-client for our current documentation of this, including the relevant network options. The most common way that I've seen these deployed is to use our new capability to set network options via environment variables by setting FDB_NETWORK_OPTION_EXTERNAL_CLIENT_DIRECTORY=<dir_containing_old_fdb_c>. I don't see it currently documented, but I think it may be necessary for you to rename the old version of fdb_c (e.g. libfdb_c_3.0.10.so) for it to be properly loaded.

from foundationdb.

panghy commented on August 17, 2024

Ah, got it, I assume those are environmental variables (so will work regardless of whether we are running in java or not). I also assume that fdbcli will be able to understand those options too? Can I also configure this programmatically via the NetworkOptions object in java?

Thanks for the help!

from foundationdb.

ajbeamon commented on August 17, 2024

It is configured via network option, and in Java it would be fdb.options().setExternalClientDirectory("/path/to/dir"). I don't think I can currently link to the documentation for that function, as that code is autogenerated, but you can look at the option definition here:

foundationdb/fdbclient/vexillographer/fdb.options

Line 89 in d5e6e60

<Option name="external_client_directory" code="63"

The example I gave above is using an environment variable to set the network option with FDB_NETWORK_OPTION_EXTERNAL_CLIENT_DIRECTORY, so you could use it in your Java program. In general, you can set any network option through environment variables using the FDB_NETWORK_OPTION_ prefix. This of course has the benefit of not requiring a code change to set the option.

It turns out that fdbcli is a special sort of database client that does not support the multi-version API. It will be necessary to start a new instance at the updated version after you upgrade.

from foundationdb.

panghy commented on August 17, 2024

Got it!

from foundationdb.

panghy commented on August 17, 2024

Alright, I have tried a couple of things but a java client running 5.2.0 on a machine that has the patched client doesn't seem to connect, the trace file (on the client shows):

(10.254.0.45:4500 is the single fdb node)

<Event Severity="10" Time="1521878659.272863" Type="ConnectingTo" Machine="10.254.0.33:14477" ID="0000000000000000" PeerAddr="10.254.0.45:4500" SuppressedEventCount="1" logGroup="default"/>
<Event Severity="10" Time="1521878659.272863" Type="ConnectionExchangingConnectPacket" Machine="10.254.0.33:14477" ID="232557b301302f4f" PeerAddr="10.254.0.45:4500" SuppressedEventCount="1" logGroup="default"/>
<Event Severity="20" Time="1521878659.273319" Type="N2_ReadError" Machine="10.254.0.33:14477" ID="232557b301302f4f" Message="2" SuppressedEventCount="1" logGroup="default"/>
<Event Severity="10" Time="1521878659.273319" Type="ConnectionClosed" Machine="10.254.0.33:14477" ID="232557b301302f4f" PeerAddr="10.254.0.45:4500" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="1" logGroup="default"/>

I am sure that the library option is read (since if you put a wrong external library path, the init does fail). I have also tried making FDB.selectAPIVersion accept 300 (instead of 510-520 in the current java client).

Essentially the following were attempted:

3.0.10 java client running on a machine with 5.2.0 fdb_c (installed via client .deb package, patched with headerVersion = std::max(apiVersion, 300)). FDB.selectApiVersion(300) with environment param of FDB_NETWORK_OPTION_EXTERNAL_CLIENT_LIBRARY set to libfdb_c_3.0.10.so. => Times out on getting a transaction.
5.2.0 java client with selectApiVersion(300) (requires patching the java release) => also times out
5.2.0 java client with selectApiVersion(510) => also times out

In every case, the cluster is a single node cluster running 3.0.10 (both client and server .deb).

The error message seems to suggest to me that there's some kind of failure at the network handshake level.

from foundationdb.

ajbeamon commented on August 17, 2024

I forgot to mention that you won't be able to update your API version in the client code until after the upgrade, so the intent was that you go with option 1.

I tried testing this again, and it worked normally with the Python bindings. When I tried with the 3.0 Java bindings, though, I discovered another wrinkle that's specific to them. My result was actually pretty different that what you experienced, however, as my 3.0 Java bindings could connect to the 3.0 cluster regardless of what client I had on the library path (or even if there wasn't one at all).

The reason for this is that in the 3.0 Java bindings (or at least the ones I was using), the client library is included in the JAR file. To be able to use the multi-version client, then, I had to replace the client library in the archive with my custom 5.2 client library. Once I did that and set the external client environment variable, I was able to connect to the cluster without issue.

Your situation sounds a bit different, so let's try to figure out what's going on there. First, can you confirm that this setup all works when everything is running at version 3.0.10? In other words, did this test run prior to upgrading the client .deb package on the client machine?

Next, I'm curious whether your 3.0.10 JAR includes fdb_c in it at the path lib/linux/amd64/libfdb_c.so. If it does and is using it, then I wouldn't expect your client to be using the 5.2 lib at all even after installing the package.

If your Java bindings are able to use the 5.2 lib, the next thing to check is whether your client is logging two separate trace files, one for each version. The trace file snippet above looks like it's coming from the 5.2 client, so it's expected that you see connection errors in it when trying to connect to a 3.0 cluster. If both clients are loaded, though, there should be an additional trace file for the 3.0 client. In 3.0, the trace file naming for clients was a bit different, so I think it will show up as something like trace.000.000.000.000... rather than having an IP such as trace.127.000.000.001... Are you getting both of these trace files when running with a 3.0.10 Java client?

from foundationdb.

panghy commented on August 17, 2024

Thanks for the tip. I am on a flight but I can confirm that the jar files do have the .so and .dylib files in them (we compile with the fatjar option). I can do some surgery to hack in the 5.2 .so file into the 3.0.10 jar (actually we are internally at a higher version but we never touched the api version).

from foundationdb.

panghy commented on August 17, 2024

I believe there's a different issue with just swapping out the .so file from the jar:

Caused by: java.lang.UnsatisfiedLinkError: com.foundationdb.FDB.Select_API_version(I)V
	at com.foundationdb.FDB.Select_API_version(Native Method)
	at com.foundationdb.FDB.selectAPIVersion(FDB.java:132)

My guess is that the namespace has changed. Perhaps in your tests, your 3.x JAR's native bindings are already in the proper namespace?

from foundationdb.

panghy commented on August 17, 2024

I guess before I undertake a massive "rename our 3.x code dump from com.foundationdb to com.apple.foundationdb", that's what you have on your 3.x branch? :)

from foundationdb.

ajbeamon commented on August 17, 2024

I was running with fdb-java-3.0.8, which is the most recent version of our Java bindings that I could easily find. It was released before we switched to com.apple.foundationdb, so like you it should be using com.foundationdb.

Did you also swap out the libfdb_java.so file in the JAR? That file should be left at the original version.

from foundationdb.

panghy commented on August 17, 2024

Oh I see, we only have the libfdb_java.so in our 3.0.10 and 5.2.0 java libraries. I swapped the one from 5.2.0 into 3.0.10 (but sounds like you are not referring to the JNI bindings but libfdb_c.so, which we don't have inside the JAR).

from foundationdb.

panghy commented on August 17, 2024

And to answer your previous question, with the 3.0.10 client deb package and a 3.0.10 java library, the machine can talk to a 3.x fdb cluster (with 3.0.10 client/server deb packages installed).

from foundationdb.

ajbeamon commented on August 17, 2024

When you try your test with the 3.0 Java bindings, a 5.2 client lib, and an external 3.0 client lib, do you get two sets of trace files on your client? If so, are there any interesting events in the logs of the 3.0 client?

from foundationdb.

panghy commented on August 17, 2024

There isn't AFAICT although the way I enable trace logs is programmatically in Java.

from foundationdb.

ajbeamon commented on August 17, 2024

Ok, then my best guess is that it's either not successfully finding the external client or unable to use it.

The former case is usually the result of some sort of configuration issue (e.g. having FDB_NETWORK_OPTION_EXTERNAL_CLIENT_LIBRARY not pointing directly to the library, FDB_NETWORK_OPTION_EXTERNAL_CLIENT_DIRECTORY not pointing to a directory containing the library, or using the directory option with libraries that don't have the expected extension -- .so in Linux).

Some ways the latter problem could occur are if the line of code pointed out above in MultiVersionTransaction.actor.cpp wasn't changed (e.g. you are inadvertently using a different version of the library), a newer API version than 300 was selected in your client, the API exposed by the client was different than expected (this doesn't appear to be a problem in 3.0.8), etc.

Sadly we don't have great logging around the early setup process because our trace logs aren't able to capture events from before trace logging is enabled. However, you could try looking for any interesting events in the early part of your trace file to see if it mentions any problems loading the external client library.

It may also be useful to try setting up the multi-version client with some modern versions (perhaps 5.1 and 5.2) to confirm that you can get it to work under more normal circumstances.

from foundationdb.

panghy commented on August 17, 2024

(thanks for all the help)

I am pretty sure the configuration is right since if the CLIENT_LIBRARY environment variable is pointed to a non-existent file, it won't initialize properly (it throws an exception). I am going to try and locate 3.0.8 and see if I can use that version instead of what we have.

Let me recap:

Patch MultiVersionTransaction.actor.cpp and recompile the client package (then install as .deb on the client machine)
The client java code is technically unchanged (only libfdb_java.so is in the jar file)
Put 3.0.8 libfdb_c.so in a directory (say /opt/foundationdb/libfdb_c.3.0.8.so)
Add the environment variable FDB_NETWORK_OPTION_EXTERNAL_CLIENT_LIBRARY and have it point to /opt/foundationdb/libfdb_c.3.0.8.so
Start the java app and have it call FDB.selectApiVersion(300), the code should continue to work from there

from foundationdb.

ajbeamon commented on August 17, 2024

Ah yes, you're correct that it would report an error if you tried to specify a client library that didn't exist.

Those steps all sound correct to me. To aid in debugging this further, I started working on support for logging trace events before the network is started and the log file opened. It's still a work in progress, so it's on my fork under the enable-trace-logging-before-open branch. I've given you read access to my repo, so you should be able to grab it from there. I wouldn't run with it on all the time just yet, but for the purposes of this investigation it might be helpful.

If you want to try this, you'll need to build a copy of your 5.2 client library with this change incorporated. Then, you should see some new trace event messages show up in your client logs before the ClientStart event. Here is the sequence of events I see for a successful connection over the 3.0.8 client with the Python bindings, which should be pretty similar to what you see:

<?xml version="1.0"?>
<Trace>
<Event Severity="10" Time="0.000000" Type="AddingExternalClient" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="libfdb_c_3.0.8.so" logGroup=""/>
<Event Severity="10" Time="0.000000" Type="Net2Starting" Machine="0.0.0.0:0" ID="0000000000000000" logGroup=""/>
<Event Severity="10" Time="1522251262.482962" Type="InitializingExternalClient" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="/home/aj/external/libfdb_c_3.0.8.so" logGroup=""/>
<Event Severity="20" Time="1522251262.482962" Type="LoadFunctionFailed" Machine="0.0.0.0:0" ID="0000000000000000" Function="fdb_get_client_version" Error="/home/aj/external/libfdb_c_3.0.8.so: undefined symbol: fdb_get_client_version" logGroup=""/>
<Event Severity="20" Time="1522251262.482962" Type="LoadFunctionFailed" Machine="0.0.0.0:0" ID="0000000000000000" Function="fdb_transaction_get_versionstamp" Error="/home/aj/external/libfdb_c_3.0.8.so: undefined symbol: fdb_transaction_get_versionstamp" logGroup=""/>
<Event Severity="10" Time="1522251262.482962" Type="CreatingClusterOnExternalClient" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="/home/aj/external/libfdb_c_3.0.8.so" failed="0" logGroup=""/>
<Event Severity="10" Time="1522251262.482962" Type="Net2Running" Machine="0.0.0.0:0" ID="0000000000000000" logGroup=""/>
<Event Severity="10" Time="1522251262.482962" Type="ProcessTimeOffset" Machine="0.0.0.0:0" ID="0000000000000000" ProcessTime="1522251262.486520" SystemTime="1522251262.486623" OffsetFromSystemTime="-0.000103" logGroup=""/>
<Event Severity="10" Time="1522251262.486695" Type="ClientStart" Machine="..." ID="0000000000000000" SourceVersion="62726efa82757f6b4e9ae64acc4d5e4de4a3d24d" Version="5.2.0-PRERELEASE" PackageName="5.2" ClusterFile="/etc/foundationdb/fdb.cluster" ConnectionString="..." ActualTime="1522251262" ApiVersion="300" ImageOffset="0x7fa76a758000" logGroup="default" TrackLatestType="Original"/>

from foundationdb.

panghy commented on August 17, 2024

I haven't had a chance to try your change yet (mainly because I have been traveling for work to Armenia, Bulgaria and India). Today starts my 2 week vacation though so I'll have time finally :)

from foundationdb.

panghy commented on August 17, 2024

@ajbeamon sorry for the radio silence, i am finally back from vacation and sick leave, will be looking into this this week.

from foundationdb.

panghy commented on August 17, 2024

@ajbeamon Finally got the issue, posting here first before I dig deeper.

<Event Severity="10" Time="1524705927.728745" Type="InitializingExternalClient" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="/opt/wavefront/fdb/libfdb_c_3.0.20.so" logGroup=""/>
<Event Severity="20" Time="1524705927.728745" Type="LoadLibraryFailed" Machine="0.0.0.0:0" ID="0000000000000000" Library="/opt/wavefront/fdb/libfdb_c_3.0.20.so" Error="/opt/wavefront/fdb/libfdb_c_3.0.20.so: undefined symbol: _ZN14ThreadCallback11addCallbackEPS_" logGroup=""/>
<Event Severity="40" Time="1524705927.728745" Type="ErrorLoadingExternalClientLibrary" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="/opt/wavefront/fdb/libfdb_c_3.0.20.so" logGroup="" Backtrace="addr2line -e libfdb_c.so-debug -p -C -f -i 0x3bf2cc 0x3be442 0xb3c35 0xb3f53 0xb0746 0xbf2a2 0xbfabf 0xcfc90 0x9e089 0x24b90d19 0x1497f50a1 0x1497f50a1 0x1497f4e00 0x1497f50a1 0x1497f50a1 0x1497f4e00 0x14a185664 0x1497f4e00 0x1497f4e00 0x1497f50a1 0x1497f50a1 0x1497f4e00 0x1497f4e00 0x1497f4e00 0x1497f4e00 0x1497f4ee0 0x1497f4ee0 0x1497f4e00 0x1497f4ee0 0x1497f4ee0 0x1497f4ee0 0x1497f4e00 0x1497f4e00 0x1497f4e00 0x1497f4e00 0x1497f4ee0 0x1497f4ee0 0x1497f4e00 0x1497f4ee0 0x1497f4ee0 0x1497f\..."/>
<Event Severity="30" Time="1524705927.728745" Type="ExternalClientFailure" Machine="0.0.0.0:0" ID="0000000000000000" LibPath="/opt/wavefront/fdb/libfdb_c_3.0.20.so" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" logGroup=""/>
<Event Severity="10" Time="1524705927.728745" Type="CreatingClusterOnExternalClient" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="/opt/wavefront/fdb/libfdb_c_3.0.20.so" failed="1" logGroup=""/>

from foundationdb.

panghy commented on August 17, 2024

Okay, I think that was the wrong library, now I got it to load:

<Event Severity="10" Time="0.000000" Type="AddingExternalClientDirectory" Machine="0.0.0.0:0" ID="0000000000000000" Directory="/opt/wavefront/fdb/" logGroup=""/>
<Event Severity="10" Time="0.000000" Type="AddingExternalClient" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="libfdb_c.3.0.10.so" logGroup=""/>
<Event Severity="10" Time="0.000000" Type="Net2Starting" Machine="0.0.0.0:0" ID="0000000000000000" logGroup=""/>
<Event Severity="10" Time="1524706572.146162" Type="InitializingExternalClient" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="/opt/wavefront/fdb/libfdb_c.3.0.10.so" logGroup=""/>
<Event Severity="20" Time="1524706572.146162" Type="LoadFunctionFailed" Machine="0.0.0.0:0" ID="0000000000000000" Function="fdb_get_client_version" Error="/opt/wavefront/fdb/libfdb_c.3.0.10.so: undefined symbol: fdb_get_client_version" logGroup=""/>
<Event Severity="20" Time="1524706572.146162" Type="LoadFunctionFailed" Machine="0.0.0.0:0" ID="0000000000000000" Function="fdb_transaction_get_versionstamp" Error="/opt/wavefront/fdb/libfdb_c.3.0.10.so: undefined symbol: fdb_transaction_get_versionstamp" logGroup=""/>
<Event Severity="10" Time="1524706572.146162" Type="CreatingClusterOnExternalClient" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="/opt/wavefront/fdb/libfdb_c.3.0.10.so" failed="0" logGroup=""/>

But not it throws an internal_error

<Event Severity="40" Time="1524706572.188166" Type="InternalError" Machine="0.0.0.0:0" ID="0000000000000000" Error="internal_error" ErrorCode="4100" File="fdbclient/ReadYourWrites.actor.cpp" Line="540" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x7f7a422445a7 0x7f7a42243a5d 0x7f7a421e5c29 0x7f7a421be46e 0x7f7a421bea5a 0x7f7a421bf38b 0x7f7a421bf8e0 0x7f7a421c351b 0x7f7a421c3a0f 0x7f7a421c3afa 0x7f7a421925e2 0x7f7a4217165f 0x7f7a42171818 0x7f7a42217468 0x7f7a4220ba42 0x7f7a4209472d 0x7f7a4208b8e9 0x7f7a429eb97b 0x7f7a429e9af0 0x7f7b99aa6184 0x7f7b993b903d"/>
<Event Severity="40" Time="1524706572.188166" Type="SystemError" Machine="0.0.0.0:0" ID="0000000000000000" Error="internal_error" ErrorCode="4100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x7f7a422445a7 0x7f7a42243a5d 0x7f7a421e5a90 0x7f7a421e5c4b 0x7f7a421be46e 0x7f7a421bea5a 0x7f7a421bf38b 0x7f7a421bf8e0 0x7f7a421c351b 0x7f7a421c3a0f 0x7f7a421c3afa 0x7f7a421925e2 0x7f7a4217165f 0x7f7a42171818 0x7f7a42217468 0x7f7a4220ba42 0x7f7a4209472d 0x7f7a4208b8e9 0x7f7a429eb97b 0x7f7a429e9af0 0x7f7b99aa6184 0x7f7b993b903d"/>
<Event Severity="40" Time="1524706572.190873" Type="InternalError" Machine="0.0.0.0:0" ID="0000000000000000" Error="internal_error" ErrorCode="4100" File="fdbclient/ReadYourWrites.actor.cpp" Line="540" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x7f7a422445a7 0x7f7a42243a5d 0x7f7a421e5c29 0x7f7a421be46e 0x7f7a421bea5a 0x7f7a421bf38b 0x7f7a421bf8e0 0x7f7a421c351b 0x7f7a421c3a0f 0x7f7a421c3afa 0x7f7a421925e2 0x7f7a4217165f 0x7f7a42171818 0x7f7a42217468 0x7f7a4220ba42 0x7f7a4209472d 0x7f7a4208b8e9 0x7f7a429eb97b 0x7f7a429e9af0 0x7f7b99aa6184 0x7f7b993b903d"/>
<Event Severity="40" Time="1524706572.190873" Type="SystemError" Machine="0.0.0.0:0" ID="0000000000000000" Error="internal_error" ErrorCode="4100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x7f7a422445a7 0x7f7a42243a5d 0x7f7a421e5a90 0x7f7a421e5c4b 0x7f7a421be46e 0x7f7a421bea5a 0x7f7a421bf38b 0x7f7a421bf8e0 0x7f7a421c351b 0x7f7a421c3a0f 0x7f7a421c3afa 0x7f7a421925e2 0x7f7a4217165f 0x7f7a42171818 0x7f7a42217468 0x7f7a4220ba42 0x7f7a4209472d 0x7f7a4208b8e9 0x7f7a429eb97b 0x7f7a429e9af0 0x7f7b99aa6184 0x7f7b993b903d"/>

from foundationdb.

panghy commented on August 17, 2024

The InternalError happens on a different trace file. Similar lines also appear on the "main" trace file (the one with the machine IP address).

from foundationdb.

panghy commented on August 17, 2024

I think it might be the assert that are failing:

ASSERT( requestLimit.hasRowLimit() || requestLimit.hasByteLimit() );

in ReadYourWrites.actor.cpp of the 3.x code

from foundationdb.

ajbeamon commented on August 17, 2024

If the 5.x client is also logging internal errors, can you paste one here? We can try to see what the backtrace is from that. It can probably also be done for the 3.x backtrace, but because of the way it gets logged it can take some effort to do.

from foundationdb.

panghy commented on August 17, 2024

It is, I'll get them to you asap. Thanks.

…

On Wed, Apr 25, 2018, 7:09 PM A.J. Beamon ***@***.***> wrote: If the 5.x client is also logging internal errors, can you paste one here? We can try to see what the backtrace is from that. It can probably also be done for the 3.x backtrace, but because of the way it gets logged it can take some effort to do. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#112 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAgKM8YAxfgEdw3iwQprFNfKYHtFqTt6ks5tsSxhgaJpZM4S2jh4> .

from foundationdb.

panghy commented on August 17, 2024

Pretty much the full log from the 5.x client.

<Event Severity="10" Time="0.000000" Type="AddingExternalClientDirectory" Machine="0.0.0.0:0" ID="0000000000000000" Directory="/opt/wavefront/fdb/" logGroup=""/>
<Event Severity="10" Time="0.000000" Type="AddingExternalClient" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="libfdb_c.3.0.10.so" logGroup=""/>
<Event Severity="10" Time="0.000000" Type="Net2Starting" Machine="0.0.0.0:0" ID="0000000000000000" logGroup=""/>
<Event Severity="10" Time="1524721878.279425" Type="InitializingExternalClient" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="/opt/wavefront/fdb/libfdb_c.3.0.10.so" logGroup=""/>
<Event Severity="20" Time="1524721878.279425" Type="LoadFunctionFailed" Machine="0.0.0.0:0" ID="0000000000000000" Function="fdb_get_client_version" Error="/opt/wavefront/fdb/libfdb_c.3.0.10.so: undefined symbol: fdb_get_client_version" logGroup=""/>
<Event Severity="20" Time="1524721878.279425" Type="LoadFunctionFailed" Machine="0.0.0.0:0" ID="0000000000000000" Function="fdb_transaction_get_versionstamp" Error="/opt/wavefront/fdb/libfdb_c.3.0.10.so: undefined symbol: fdb_transaction_get_versionstamp" logGroup=""/>
<Event Severity="10" Time="1524721878.279425" Type="CreatingClusterOnExternalClient" Machine="0.0.0.0:0" ID="0000000000000000" LibraryPath="/opt/wavefront/fdb/libfdb_c.3.0.10.so" failed="0" logGroup=""/>
<Event Severity="10" Time="1524721878.279425" Type="Net2Running" Machine="0.0.0.0:0" ID="0000000000000000" logGroup=""/>
<Event Severity="10" Time="1524721878.279425" Type="ProcessTimeOffset" Machine="0.0.0.0:0" ID="0000000000000000" ProcessTime="1524721878.284168" SystemTime="1524721878.284170" OffsetFromSystemTime="-0.000002" logGroup=""/>
<Event Severity="10" Time="1524721878.284212" Type="ClientStart" Machine="10.254.0.33:55061" ID="0000000000000000" SourceVersion="8429bed2c5c9ea9f92df5247f25a62aa9756ee2a" Version="5.2.0-PRERELEASE" PackageName="5.2" ClusterFile="/etc/foundationdb/fdb.cluster" ConnectionString="telemetry:[email protected]:4500" ActualTime="1524721878" ApiVersion="300" ImageOffset="0x7fe99234b000" logGroup="default" TrackLatestType="Original"/>
<Event Severity="10" Time="1524721878.284212" Type="MachineLoadDetail" Machine="10.254.0.33:55061" ID="0000000000000000" User="14215806" Nice="125374230" System="14509778" Idle="19306693215" IOWait="818799" IRQ="20420" SoftIRQ="4017881" Steal="6874212" Guest="0" logGroup="default"/>
<Event Severity="10" Time="1524721878.284212" Type="MonitorLeaderChange" Machine="10.254.0.33:55061" ID="0000000000000000" NewLeader="0000000000000001" logGroup="default"/>
<Event Severity="10" Time="1524721878.286101" Type="MonitorLeaderChange" Machine="10.254.0.33:55061" ID="0000000000000000" NewLeader="0000000000000001" logGroup="default"/>
<Event Severity="10" Time="1524721878.286649" Type="ConnectingTo" Machine="10.254.0.33:55061" ID="0000000000000000" PeerAddr="10.254.0.45:4500" SuppressedEventCount="0" logGroup="default"/>
<Event Severity="10" Time="1524721878.286649" Type="ConnectionExchangingConnectPacket" Machine="10.254.0.33:55061" ID="61708e5588b5babd" PeerAddr="10.254.0.45:4500" SuppressedEventCount="0" logGroup="default"/>
<Event Severity="20" Time="1524721878.287782" Type="N2_ReadError" Machine="10.254.0.33:55061" ID="61708e5588b5babd" Message="2" SuppressedEventCount="0" logGroup="default"/>
<Event Severity="10" Time="1524721878.287782" Type="ConnectionClosed" Machine="10.254.0.33:55061" ID="61708e5588b5babd" PeerAddr="10.254.0.45:4500" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="0" logGroup="default"/>
<Event Severity="40" Time="1524721878.323799" Type="SystemError" Machine="10.254.0.33:55061" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" logGroup="default" Backtrace="addr2line -e libfdb_c.so-debug -p -C -f -i 0x3bf2cc 0x3be442 0x37157d 0xe3bec 0xca019 0x180278 0x38f7b3 0xec728 0xc041c 0x9ebd9 0xffffffffffdfdd66 0x13ecbd0a1 0x13ecbd0a1 0x13ecbd0e6 0x13ecbc850 0x13ecbcac0 0x13ecbd0a1 0x13ecbd0a1 0x13ecb54e7 0x1560626b6 0x156062bc1 0x156063067 0x1560f9fc0 0x15644643f 0x15644656c 0x1562f7698 0x1571ac184"/>
<Event Severity="40" Time="1524721878.325478" Type="SystemError" Machine="10.254.0.33:55061" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" logGroup="default" Backtrace="addr2line -e libfdb_c.so-debug -p -C -f -i 0x3bf2cc 0x3be442 0x37157d 0xe3bec 0xca019 0x180278 0x38f7b3 0xec728 0xc041c 0x9ebd9 0xffffffffffdfdd66 0x13ecbd0a1 0x13ecbd0a1 0x13ecbd0e6 0x13ecbc850 0x13ecbcac0 0x13ecbd0a1 0x13ecbd0a1 0x13ecb54e7 0x1560626b6 0x156062bc1 0x156063067 0x1560f9fc0 0x15644643f 0x15644656c 0x1562f7698 0x1571ac184"/>
<Event Severity="40" Time="1524721878.326216" Type="SystemError" Machine="10.254.0.33:55061" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" logGroup="default" Backtrace="addr2line -e libfdb_c.so-debug -p -C -f -i 0x3bf2cc 0x3be442 0x37157d 0xe0d68 0xc8169 0x180278 0x38f7b3 0xec728 0xc041c 0x9ebd9 0xffffffffffdfdd66 0x13ecbd0a1 0x13ecbd0a1 0x13ecbd0e6 0x13ecbc850 0x13ecbcac0 0x13ecbd0a1 0x13ecbd0a1 0x13ecb54e7 0x1560626b6 0x156062bc1 0x156063067 0x1560f9fc0 0x15644643f 0x15644656c 0x1562f7698 0x1571ac184"/>

from foundationdb.

ajbeamon commented on August 17, 2024

Hmm, it's interesting that your 5.x trace log doesn't have any InternalError events. Can you try running the listed backtrace command against your copy of libfdb_c.so-debug? I'm not sure in which ways it's different from the 5.2.0 version in the repo, so it's probably best for you to do it. This is the command:

addr2line -e libfdb_c.so-debug -p -C -f -i 0x3bf2cc 0x3be442 0x37157d 0xe3bec 0xca019 0x180278 0x38f7b3 0xec728 0xc041c 0x9ebd9 0xffffffffffdfdd66 0x13ecbd0a1 0x13ecbd0a1 0x13ecbd0e6 0x13ecbc850 0x13ecbcac0 0x13ecbd0a1 0x13ecbd0a1 0x13ecb54e7 0x1560626b6 0x156062bc1 0x156063067 0x1560f9fc0 0x15644643f 0x15644656c 0x1562f7698 0x1571ac184

from foundationdb.

panghy commented on August 17, 2024

There you go, mine is a 5.2 branch from March 22 with your commit on it.

TraceEvent::backtrace(std::string) at /usr/include/c++/4.9/bits/basic_string.h:2424
 (inlined by) TraceEvent::backtrace(std::string) at /src/fdb-new/flow/Trace.cpp:891
std::string::_M_rep() const at /usr/include/c++/4.9/bits/basic_string.h:301
 (inlined by) ~basic_string at /usr/include/c++/4.9/bits/basic_string.h:547
 (inlined by) TraceEvent::~TraceEvent() at /src/fdb-new/flow/Trace.cpp:925
std::string::_M_data() const at /usr/include/c++/4.9/bits/basic_string.h:293 (discriminator 8)
 (inlined by) std::string::_M_rep() const at /usr/include/c++/4.9/bits/basic_string.h:301 (discriminator 8)
 (inlined by) ~basic_string at /usr/include/c++/4.9/bits/basic_string.h:547 (discriminator 8)
 (inlined by) Error::Error(int) at /src/fdb-new/flow/Error.cpp:68 (discriminator 8)
DLThreadSingleAssignmentVar<Standalone<RangeResultRef> >::apply() at /src/fdb-new/fdbclient/MultiVersionAssignmentVars.h:190
ActorCallback<(anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<Standalone<RangeResultRef> >::futureCallback(future*, void*)::{lambda()#1}>, 0, Void>::fire((anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<Standalone<RangeResultRef> >::futureCallback(future*, void*)::{lambda()#1}> const&) at /src/fdb-new/./flow/flow.h:636
 (inlined by) ~DoOnMainThreadVoidActorState at /src/fdb-new/flow/ThreadHelper.actor.h:569
 (inlined by) ~DoOnMainThreadVoidActor at /src/fdb-new/flow/ThreadHelper.actor.h:569
 (inlined by) a_body1cont6 at /src/fdb-new/flow/ThreadHelper.actor.h:569
 (inlined by) a_body1cont2 at /src/fdb-new/flow/ThreadHelper.actor.g.h:934
 (inlined by) a_body1cont5 at /src/fdb-new/flow/ThreadHelper.actor.g.h:962
 (inlined by) a_body1cont1 at /src/fdb-new/flow/ThreadHelper.actor.g.h:884
 (inlined by) a_body1when1 at /src/fdb-new/flow/ThreadHelper.actor.g.h:896
 (inlined by) a_callback_fire at /src/fdb-new/flow/ThreadHelper.actor.g.h:910
 (inlined by) fire at /src/fdb-new/./flow/flow.h:928
void SAV<Void>::send<Void>(Void&&) at /src/fdb-new/./flow/flow.h:381
N2::Net2::run() at /src/fdb-new/flow/flow.h:720
 (inlined by) ~PromiseTask at /src/fdb-new/flow/Net2.actor.cpp:467
 (inlined by) N2::PromiseTask::operator()() at /src/fdb-new/flow/Net2.actor.cpp:474
 (inlined by) N2::Net2::run() at /src/fdb-new/flow/Net2.actor.cpp:628
runNetwork() at /src/fdb-new/fdbclient/NativeAPI.actor.cpp:870
Reference<ClientInfo>::operator->() const at /src/fdb-new/./flow/FastRef.h:143
 (inlined by) MultiVersionApi::runNetwork() at /src/fdb-new/fdbclient/MultiVersionTransaction.actor.cpp:1203
fdb_run_network at /src/fdb-new/bindings/c/fdb_c.cpp:119
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0

from foundationdb.

ajbeamon commented on August 17, 2024

Ah, I see. The reason we don't see the InternalError events is because this is just forwarding the error from the underlying client. SystemError gets logged any time there is an error with an error code in a certain range, so the act of forwarding it causes it to get logged in the primary client library as well.

I'll have to think about why we might be triggering that ASSERT here. Is there any particular operation that you are executing when it hits?

from foundationdb.

panghy commented on August 17, 2024

Not particularly interesting, just a single line in java:

db.run(tx -> tx.getRange(Range.startsWith(aTuple)).asList().get())

We do set the following options on the tx (which again, seems innocuous):

TransactionOptions options = tx.options();
// 60s maximum timeout
options.setTimeout(60000);
// Seemingly endless retries
options.setRetryLimit(100);

We also do reference counting for all transaction objects with native counterparts and dispose of them when the transaction is disposed (which we also do immediately).

from foundationdb.

panghy commented on August 17, 2024

Also to note is that the code works on a machine with 5.2.x client debs but 3.0.10 java lib, talking to a 5.2.x client/server node. It just breaks with 5.2.x client debs (with the trace patch), 3.0.10 java lib and talking to a 3.0.10 client/server node with the environment variables in place to activate multi-version clients.

from foundationdb.

panghy commented on August 17, 2024

@ajbeamon I am trying to see what the addr2line would do from the 3.x build but it doesn't seem to work (and strangely it suggests doing it against fdbserver.debug but I tried that and also libfdb_c.so-debug). Basically I am trying to see the backtrace from the 3.x client's trace output.

from foundationdb.

ajbeamon commented on August 17, 2024

Yeah the 3.x backtrace is harder to use. Not only do you need to change it to use libfdb_c rather than fdbserver, but you also have to correct the offsets. How to actually correct them is a bit complicated, but if we need to go that route I can help when I get back to the office on Monday.

from foundationdb.

panghy commented on August 17, 2024

The ByteLimit and RowLimit is indeed both -1 from additional trace events:

<Event Severity="10" Time="1524861141.642535" Type="RYWSelectorsStartForward" Machine="0.0.0.0:0" ID="58695c7507658b9f" ByteLimit="-1" RowLimit="-1"/>
<Event Severity="10" Time="1524861141.642535" Type="RYWSelectors" Machine="0.0.0.0:0" ID="58695c7507658b9f" begin="0+firstGreaterOrEqual(\x02registry\x00\x02/services\x00\x02s\x00)" end="0+firstGreaterOrEqual(\x02registry\x00\x02/services\x00\x02s\x01)" reached="0" itemsPastEnd="0" endOffset="-1" itBegin="" itEnd="" unknown="1" requests="0"/>

from foundationdb.

panghy commented on August 17, 2024

Ah, I see, I am just going down the rabit hole of modifying the 3.x code to see why that assert is firing.

from foundationdb.

panghy commented on August 17, 2024

@ajbeamon I got it to work, but I had to patch setRequestLimits in ReadYourWrites.actor.cpp so that it would set a row limit if byte limit is <= 0. Not sure why that's needed since both 3.x and 5.x sets the row limit if byte limit is 0. It is, for some reason, -1 (both byte and row limit) when it hits the 3.x client.

from foundationdb.

panghy commented on August 17, 2024

For posterity, the final bit to fix is:

	static void setRequestLimits(GetRangeLimits &requestLimit, int64_t additionalRows, int offset, int requestCount) {
		requestLimit.minRows = (int)std::min(std::max(1 + additionalRows, (int64_t)offset), (int64_t)std::numeric_limits<int>::max());
		if(requestLimit.hasRowLimit()) {
			requestLimit.rows = (int)std::min(std::max(std::max(1,requestLimit.rows) + additionalRows, (int64_t)offset), (int64_t)std::numeric_limits<int>::max());
		}

		// Calculating request byte limit
		if(requestLimit.bytes<=0) {
			requestLimit.bytes = CLIENT_KNOBS->BYTE_LIMIT_UNLIMITED;
			if(!requestLimit.hasRowLimit()) {
				requestLimit.rows = (int)std::min(std::max(std::max(1,requestLimit.rows) + additionalRows, (int64_t)offset), (int64_t)std::numeric_limits<int>::max());
			}
		}
		else if(requestLimit.hasByteLimit()) {
			requestLimit.bytes = std::min(int64_t(requestLimit.bytes)<<std::min(requestCount, 20), (int64_t)CLIENT_KNOBS->REPLY_BYTE_LIMIT);
		}
	}

Specifically, if(requestLimit.bytes<=0) { is the fix instead of if(requestLimit.bytes==0) {

from foundationdb.

panghy commented on August 17, 2024

Oh @ajbeamon do you also plan on checking in those trace event changes too?

from foundationdb.

panghy commented on August 17, 2024

Strangely this doesn't work with the public downloads on foundationdb.org (just the client deb), it only works with the custom build that I made with your commit in it (based on 5.2.x).

from foundationdb.

ajbeamon commented on August 17, 2024

I'm working on a more substantive change for the trace logs, but it should go in when I'm done with it.

The public downloads won't work with pre-4.0 clients because they don't have the patch in MultiVersionTransaction.actor.cpp to change the minimum supported API version.

I don't remember what changed about the limits, but I'll have a look to see what I can find out on Monday. I'm glad that you were able to track that down.

from foundationdb.

panghy commented on August 17, 2024

Ah right, that was so long ago that I forgot already :). Is that something we can change? Now that we know it does work?

from foundationdb.

ajbeamon commented on August 17, 2024

Well, there are still the reporting downsides of doing so that had I listed above. Also, you've already found one problem with the 5.2/3.0 combo, so I'd be hesitant to make the change without more extensive testing.

My own gut is to not incorporate it into the main repo, given that it's probably not useful to most people and arguably is not generally correct (i.e. certain combinations of versions would have the problems I listed originally). However if it's important that you have an official build supporting it, it's something we could discuss.

from foundationdb.

panghy commented on August 17, 2024

Fair enough, we can use an internal build, it also works with 5.1/3.0 (both patched).

from foundationdb.

ajbeamon commented on August 17, 2024

I've found the bug that's affecting you here, which was fixed in version 4.1:

https://apple.github.io/foundationdb/old-release-notes/release-notes-410.html

The problem is that the streaming mode EXACT was ignoring the target_bytes parameter in the C bindings, and the multi-version client uses the EXACT streaming mode. In particular, the problem is with this line in fdb_c.cpp:

target_bytes = std::min(target_bytes, mode_bytes)

In the new version of fdb_c.cpp, this line is guarded by an extra if that verifies that the mode_bytes isn't unlimited:

foundationdb/bindings/c/fdb_c.cpp

Line 465 in b359e57

else if(mode_bytes != CLIENT_KNOBS->BYTE_LIMIT_UNLIMITED)

If you adopt the same check into your 3.0 bindings, then I think it should work correctly.

from foundationdb.

panghy commented on August 17, 2024

Ah, cool, I'll check that out

from foundationdb.

panghy commented on August 17, 2024

On a larger scale tests we seem to be seeing this issue w.r.t. watches from a 5.x client with multi version client talking to a 3.x cluster:

WARN  [2018-05-17 19:33:18,799] Flags: watch failed
! com.foundationdb.FDBException: The protocol version of the cluster has changed
! at com.foundationdb.NativeFuture.Future_getError(Native Method)
! at com.foundationdb.NativeFuture.getError(NativeFuture.java:69)
! at com.foundationdb.NativeFuture.getError(NativeFuture.java:10)
WARN  [2018-05-17 19:33:18,799] Flags: watch failed
! com.foundationdb.FDBException: Operation aborted because the transaction was cancelled
! at com.foundationdb.NativeFuture.Future_getError(Native Method)
! at com.foundationdb.NativeFuture.getError(NativeFuture.java:69)
! at com.foundationdb.NativeFuture.getError(NativeFuture.java:10)

Note that this is not a spurious issue, it happens continuously (and we are not upgrading the cluster when this is happening).

from foundationdb.

ajbeamon commented on August 17, 2024

Hmm, interesting, you shouldn't be getting protocol version errors except when starting a client or at upgrade time. I'd have to look to see if I could find a reason why you would be getting them at random times. The other error is usually fired when a watch is created but the transaction that it was created in is reset or destroyed without being committed, but if that's not your issue there I'd have to investigate that as well.

Let me see if I can get a reproduction.

from foundationdb.

panghy commented on August 17, 2024

The believe the cancelled issue is because on the error of the watch, the code disposes the transaction, which in this case, it probably is still in scope because the watch failed immediately and the callback disposed the transaction before the actual transaction code ran.

from foundationdb.

panghy commented on August 17, 2024

@ajbeamon also, this is happening when we start the service for some reason and it's very reproducible. It is happening consistently on start-up but it wasn't doing that consistently before, we are looking at what changed.

from foundationdb.

ajbeamon commented on August 17, 2024

Are you saying that a different watch is failing with the second error because the first watch caused the transaction to get canceled? I wouldn't expect the same watch to fire with more than one error.

I also tried a simple test with a watch against a 3.0 cluster using the multi-version client and a 5.2 primary library, and it worked as expected. How reproducible is your error? Is it happening every time?

The reason it may be happening to you on startup is that the cluster version changed error gets thrown on startup if you are connecting through an external client. It starts up assuming that the primary library is the valid one and throws the error to switch.

from foundationdb.

panghy commented on August 17, 2024

Hm, it is very early in the start-up of the process so it's entirely possible that if the first call is a watch that it fails that way (and it doesn't retry). Is that possible? We might just need to make sure we do a simple get() call before we do any watches.

from foundationdb.

ajbeamon commented on August 17, 2024

The cluster version changed error should be retried (at the transaction level) if you are using an onError based retry loop.

from foundationdb.

ajbeamon commented on August 17, 2024

The transaction canceled error won't be retried though.

from foundationdb.

panghy commented on August 17, 2024

Hm, what we are doing is the first thing in the transaction, we call watch and we listen to the future returning and it's isError() is returning true. I just added a dummy line:
db.run("prime", tx -> tx.get(EMPTY_BYTES).get());
before the watch transaction and it seems to work.

So I guess the issue is that if there's a watch and it's the first time a cluster is used, that watch will see a failure and it should be retried. This does not seem to be the case with other futures (say from a get() call, it is automatically retried underneath our code).

from foundationdb.

panghy commented on August 17, 2024

Let me paste a bit more context:

db.run(tx -> tx.get(EMPTY_BYTES).get());
  Map<String, Integer> currentFlags = db.run(tx -> {
    final Future<Void> watch = tx.watch(rootB);
    watch.onReady(() -> {
      if (watch.isError()) {
        log.log(Level.WARNING, "watch failed", watch.getError());
      }
      watch.dispose();
      tx.dispose();
      update = true;
    });
    // do stuff with tx but it fails here if the watch loop fails immediately (because of the dispose).
  });

from foundationdb.

ajbeamon commented on August 17, 2024

It's true that the watch will be set to the error and won't automatically reset itself, but it should be the case that when you try to commit the transaction (which is required to set the watch), the commit should also fail with the same error and cause the transaction to retry. So if you did something like the following, you should get back a watch that isn't an error:

db.run(tx -> tx.watch(rootB));

from foundationdb.

panghy commented on August 17, 2024

I see, perhaps the dispose without the onReady() shouldn't be called after all then since it might be interfering with the retry logic.

from foundationdb.

panghy commented on August 17, 2024

@ajbeamon Seems like we are running into seg faults with it running the 3.x client:

Register to memory mapping:

RAX=0x00007f5980042790 is an unknown value
RBX=0x00007f5980077550 is an unknown value
RCX=0x0000000000000001 is an unknown value
RDX=0x00007f5980042790 is an unknown value
RSP=0x00007f52c293a4f8 is pointing into the stack for thread: 0x00007f5c28034000
RBP=0x00007f52c293a530 is pointing into the stack for thread: 0x00007f5c28034000
RSI=0x0000000000000000 is an unknown value
RDI=0x00007f5980076840 is an unknown value
R8 =0x00007f5980076840 is an unknown value
R9 =0x0000000000000000 is an unknown value
R10=0x00007f5c373f5cf0: Java_com_foundationdb_NativeFuture_Future_1releaseMemory+0 in /tmp/fdbjni2093460040563041950.library at 0x00007f5c373f3000
R11=0x0000000000000001 is an unknown value
R12=0x00007f59800775d0 is an unknown value
R13=
[error occurred during error reporting (printing register info), id 0xb]

Stack: [0x00007f52c283b000,0x00007f52c293c000],  sp=0x00007f52c293a4f8,  free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x0000000000000065
C  [libfdb_c.so+0xa337d]
C  [libfdb_c.so+0xd15d4]
C  [libfdb_c.so+0xd1b95]
C  [libfdb_c.so+0x9eefd]  fdb_future_release_memory+0x3d
J 5470  com.foundationdb.NativeFuture.Future_releaseMemory(J)V (0 bytes) @ 0x00007f99117653b0 [0x00007f9911765300+0xb0]
J 14290 C2 wavefront.fdb.BatchReader$2.run()V (525 bytes) @ 0x00007f991157c7f8 [0x00007f991157c5a0+0x258]
J 10004 C2 java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec()Z (11 bytes) @ 0x00007f9912e08d50 [0x00007f9912e08d20+0x30]
J 12341% C2 java.util.concurrent.ForkJoinPool.runWorker(Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V (77 bytes) @ 0x00007f991342b0f8 [0x00007f991342b020+0xd8]
J 8818 C1 java.util.concurrent.ForkJoinWorkerThread.run()V (192 bytes) @ 0x00007f99125c9584 [0x00007f99125c93e0+0x1a4]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x695b96]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056
V  [libjvm.so+0x6960a1]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
V  [libjvm.so+0x696537]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47
V  [libjvm.so+0x71596e]  thread_entry(JavaThread*, Thread*)+0x7e
V  [libjvm.so+0xa7f243]  JavaThread::thread_main_inner()+0x103
V  [libjvm.so+0xa7f38c]  JavaThread::run()+0x11c
V  [libjvm.so+0x92e0f8]  java_start(Thread*)+0x108
C  [libpthread.so.0+0x76ba]  start_thread+0xca

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 5470  com.foundationdb.NativeFuture.Future_releaseMemory(J)V (0 bytes) @ 0x00007f9911765338 [0x00007f9911765300+0x38]
J 14290 C2 wavefront.fdb.BatchReader$2.run()V (525 bytes) @ 0x00007f991157c7f8 [0x00007f991157c5a0+0x258]
J 10004 C2 java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec()Z (11 bytes) @ 0x00007f9912e08d50 [0x00007f9912e08d20+0x30]
J 12341% C2 java.util.concurrent.ForkJoinPool.runWorker(Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V (77 bytes) @ 0x00007f991342b0f8 [0x00007f991342b020+0xd8]
J 8818 C1 java.util.concurrent.ForkJoinWorkerThread.run()V (192 bytes) @ 0x00007f99125c9584 [0x00007f99125c93e0+0x1a4]
v  ~StubRoutines::call_stub

Seems like it happens when we release memory on a NativeFuture at this point, trying multiple things.

from foundationdb.

panghy commented on August 17, 2024

Also:

Stack: [0x00007f0bb30b3000,0x00007f0bb31b4000],  sp=0x00007f0bb31af288,  free space=1008k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x0000000000000065
C  [libfdb_c.so+0xe2654]
C  [libfdb_c.so+0xe21f0]
C  [libfdb_c.so+0xe23ad]
C  [libfdb_c.so+0xc9af9]
C  [libfdb_c.so+0x180008]
C  [libfdb_c.so+0x38f663]
C  [libfdb_c.so+0xec628]
C  [libfdb_c.so+0xc031c]
C  [libfdb_c.so+0x9eb99]  fdb_run_network+0x19
C  [fdbjni6223930233250240845.library+0x4d66]  Java_com_foundationdb_FDB_Network_1run+0x26
j  com.foundationdb.FDB.access$300(Lcom/foundationdb/FDB;)V+1
j  com.foundationdb.FDB$3.run()V+103
j  java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec()Z+4
j  java.util.concurrent.ForkJoinTask.doExec()I+10
j  java.util.concurrent.ForkJoinPool$WorkQueue.runTask(Ljava/util/concurrent/ForkJoinTask;)V+21
j  java.util.concurrent.ForkJoinPool.runWorker(Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V+35
j  java.util.concurrent.ForkJoinWorkerThread.run()V+24
v  ~StubRoutines::call_stub
V  [libjvm.so+0x695b96]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056
V  [libjvm.so+0x6960a1]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
V  [libjvm.so+0x696537]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47
V  [libjvm.so+0x71596e]  thread_entry(JavaThread*, Thread*)+0x7e
V  [libjvm.so+0xa7f243]  JavaThread::thread_main_inner()+0x103
V  [libjvm.so+0xa7f38c]  JavaThread::run()+0x11c
V  [libjvm.so+0x92e0f8]  java_start(Thread*)+0x108
C  [libpthread.so.0+0x76ba]  start_thread+0xca

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.foundationdb.FDB.Network_run()V+0
j  com.foundationdb.FDB.access$300(Lcom/foundationdb/FDB;)V+1
j  com.foundationdb.FDB$3.run()V+103
j  java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec()Z+4
j  java.util.concurrent.ForkJoinTask.doExec()I+10
j  java.util.concurrent.ForkJoinPool$WorkQueue.runTask(Ljava/util/concurrent/ForkJoinTask;)V+21
j  java.util.concurrent.ForkJoinPool.runWorker(Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V+35
j  java.util.concurrent.ForkJoinWorkerThread.run()V+24
v  ~StubRoutines::call_stub

from foundationdb.

ajbeamon commented on August 17, 2024

Are you able to able to convert those call stacks into backtraces? I wasn't able to do it myself, possibly because I tried the wrong version, you have local changes, and/or you built it differently. For example, try something like the following and see if you get something that seems reasonable:

addr2line -e libfdb_c.so-debug -p -C -f -i 0xe2654 0xe21f0 0xe23ad 0xc9af9 0x180008 0x38f663 0xec628 0xc031c 0x9eb99

from foundationdb.

panghy commented on August 17, 2024

Checking, will have to check both the 5.x client as well as the 3.x client I presume?

from foundationdb.

panghy commented on August 17, 2024

Seems like it's this

ThreadFuture<Void>::clearCallback(ThreadCallback*) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:468
 (inlined by) AbortableSingleAssignmentVar<Optional<Standalone<StringRef> > >::cancelCallbacks() at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:112
 (inlined by) AbortableSingleAssignmentVar<Optional<Standalone<StringRef> > >::fire(Void const&, int&) at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:73
ThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::send(Optional<Standalone<StringRef> > const&) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:425
DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::apply() at /home/ubuntu/foundationdb/./flow/flow.h:136
 (inlined by) DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::apply() at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:195
ActorCallback<(anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::futureCallback(future*, void*)::{lambda()#1}>, 0, Void>::fire((anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::futureCallback(future*, void*)::{lambda()#1}> const&) at /home/ubuntu/foundationdb/./flow/flow.h:636
 (inlined by) ~DoOnMainThreadVoidActorState at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) ~DoOnMainThreadVoidActor at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) a_body1cont6 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) a_body1cont2 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:934
 (inlined by) a_body1cont5 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:962
 (inlined by) a_body1cont1 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:884
 (inlined by) a_body1when1 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:896
 (inlined by) a_callback_fire at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:910
 (inlined by) fire at /home/ubuntu/foundationdb/./flow/flow.h:928
void SAV<Void>::send<Void>(Void&&) at /home/ubuntu/foundationdb/./flow/flow.h:381
N2::Net2::run() at /home/ubuntu/foundationdb/flow/flow.h:720
 (inlined by) ~PromiseTask at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:467
 (inlined by) N2::PromiseTask::operator()() at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:474
 (inlined by) N2::Net2::run() at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:628
runNetwork() at /home/ubuntu/foundationdb/fdbclient/NativeAPI.actor.cpp:865
Reference<ClientInfo>::operator->() const at /home/ubuntu/foundationdb/./flow/FastRef.h:143
 (inlined by) MultiVersionApi::runNetwork() at /home/ubuntu/foundationdb/fdbclient/MultiVersionTransaction.actor.cpp:1203
fdb_run_network at /home/ubuntu/foundationdb/bindings/c/fdb_c.cpp:119

from foundationdb.

ajbeamon commented on August 17, 2024

This happens when running against a 3.0 cluster? Are there any simultaneous upgrades taking place?

Nothing immediately jumps out at me from the stack trace, although I do notice the line numbers in ThreadHelper.actor.h don't seem right. Have you made changes there, or is it just giving the wrong numbers?

Are you willing to share the Java code that's calling release_memory? It looks from the call stack like your code is calling it directly, and I don't think this was something our bindings exposed. If that's the case, then maybe the usage pattern will tell us something.

from foundationdb.

panghy commented on August 17, 2024

Yeah, running 5.x client DEBs against a 3.x cluster. For what's worth, I pulled out all the dispose() calls in our java code but it still faults. I also tried moving locks around in the 5.x client library but that didn't help either. I didn't change anything in ThreadHelper.actor.h when I posted the backtrace.

MultiVersionAssignmentVars is the logic that's invoked only when there are multi-version clients right?

from foundationdb.

panghy commented on August 17, 2024

Just rebased onto release-5.1 branch again with just the api version change: https://github.com/panghy/foundationdb/commits/support-300

from foundationdb.

panghy commented on August 17, 2024

Rerunning our tests with that branch:

ThreadFuture<Void>::clearCallback(ThreadCallback*) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:468
 (inlined by) AbortableSingleAssignmentVar<Optional<Standalone<StringRef> > >::cancelCallbacks() at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:112
 (inlined by) AbortableSingleAssignmentVar<Optional<Standalone<StringRef> > >::fire(Void const&, int&) at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:73
ThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::send(Optional<Standalone<StringRef> > const&) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:425
DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::apply() at /home/ubuntu/foundationdb/./flow/flow.h:136
 (inlined by) DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::apply() at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:195
ActorCallback<(anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::futureCallback(future*, void*)::{lambda()#1}>, 0, Void>::fire((anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::futureCallback(future*, void*)::{lambda()#1}> const&) at /home/ubuntu/foundationdb/./flow/flow.h:636
 (inlined by) ~DoOnMainThreadVoidActorState at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) ~DoOnMainThreadVoidActor at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) a_body1cont6 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) a_body1cont2 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:934
 (inlined by) a_body1cont5 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:962
 (inlined by) a_body1cont1 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:884
 (inlined by) a_body1when1 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:896
 (inlined by) a_callback_fire at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:910
 (inlined by) fire at /home/ubuntu/foundationdb/./flow/flow.h:928
void SAV<Void>::send<Void>(Void&&) at /home/ubuntu/foundationdb/./flow/flow.h:381
N2::Net2::run() at /home/ubuntu/foundationdb/flow/flow.h:720
 (inlined by) ~PromiseTask at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:467
 (inlined by) N2::PromiseTask::operator()() at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:474
 (inlined by) N2::Net2::run() at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:628
runNetwork() at /home/ubuntu/foundationdb/fdbclient/NativeAPI.actor.cpp:865
Reference<ClientInfo>::operator->() const at /home/ubuntu/foundationdb/./flow/FastRef.h:143
 (inlined by) MultiVersionApi::runNetwork() at /home/ubuntu/foundationdb/fdbclient/MultiVersionTransaction.actor.cpp:1203
fdb_run_network at /home/ubuntu/foundationdb/bindings/c/fdb_c.cpp:119

from foundationdb.

ajbeamon commented on August 17, 2024

It looks like the inaccurate line numbers is due to a bug in the actor compiler with marking the line numbers of multi-line comment blocks. Because our files start with a 19 line comment, we are getting an 18 line offset for parts of our actor files. I'll fix that up and submit a PR.

Are you doing any upgrades during this test, or is it just crashing while the cluster version is stable?

Can you share the code calling release_memory? If it can help me put together a reproduction, it will make tracking this down much easier.

from foundationdb.

mzeier commented on August 17, 2024

Are you doing any upgrades during this test, or is it just crashing while the cluster version is stable?

The fdb cluster is static, on v3.0.10.

from foundationdb.

ajbeamon commented on August 17, 2024

Line number fix, in case you're curious: #488

from foundationdb.

panghy commented on August 17, 2024

Cool, just cherry-picked and will be building and testing it again.

from foundationdb.

panghy commented on August 17, 2024

Also @ajbeamon, I took out all places that we call dispose() and it was still crashing.

from foundationdb.

panghy commented on August 17, 2024

ThreadFuture<Void>::clearCallback(ThreadCallback*) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:450
 (inlined by) AbortableSingleAssignmentVar<long>::cancelCallbacks() at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:112
 (inlined by) AbortableSingleAssignmentVar<long>::fire(Void const&, int&) at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:73
ThreadSingleAssignmentVar<long>::send(long const&) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:407
ThreadSafeReferenceCounted<ThreadSingleAssignmentVar<long> >::delref_no_destroy() const at /home/ubuntu/foundationdb/./flow/FastRef.h:43
 (inlined by) ThreadSafeReferenceCounted<ThreadSingleAssignmentVar<long> >::delref() const at /home/ubuntu/foundationdb/./flow/FastRef.h:55
 (inlined by) ThreadSingleAssignmentVar<long>::delref() at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:377
 (inlined by) DLThreadSingleAssignmentVar<long>::apply() at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:198
ActorCallback<(anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<long>::futureCallback(future*, void*)::{lambda()#1}>, 0, Void>::fire((anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<long>::futureCallback(future*, void*)::{lambda()#1}> const&) at /home/ubuntu/foundationdb/./flow/flow.h:636
 (inlined by) ~DoOnMainThreadVoidActorState at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) ~DoOnMainThreadVoidActor at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) a_body1cont6 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) a_body1cont2 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:934
 (inlined by) a_body1cont5 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:962
 (inlined by) a_body1cont1 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:884
 (inlined by) a_body1when1 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:896
 (inlined by) a_callback_fire at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:910
 (inlined by) fire at /home/ubuntu/foundationdb/./flow/flow.h:928
void SAV<Void>::send<Void>(Void&&) at /home/ubuntu/foundationdb/./flow/flow.h:381
N2::Net2::run() at /home/ubuntu/foundationdb/flow/flow.h:720
 (inlined by) ~PromiseTask at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:467
 (inlined by) N2::PromiseTask::operator()() at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:474
 (inlined by) N2::Net2::run() at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:628
runNetwork() at /home/ubuntu/foundationdb/fdbclient/NativeAPI.actor.cpp:865
Reference<ClientInfo>::operator->() const at /home/ubuntu/foundationdb/./flow/FastRef.h:143
 (inlined by) MultiVersionApi::runNetwork() at /home/ubuntu/foundationdb/fdbclient/MultiVersionTransaction.actor.cpp:1185
fdb_run_network at /home/ubuntu/foundationdb/bindings/c/fdb_c.cpp:119

from foundationdb.

panghy commented on August 17, 2024

You can see how the JVM crashes (this is tracking uptime). It's kinda random (so there's a particular interaction that triggers this).

from foundationdb.

ajbeamon commented on August 17, 2024

What happens if you use dispose but remove releaseMemory? I've been poring over the code but haven't yet identified what specifically might be going wrong. I'm curious if the releaseMemory lead is relevant or not.

from foundationdb.

panghy commented on August 17, 2024

I think those are two separate issues, I removed the dispose in the watch and that seemed to have fixed that issue (that issue happens immediately on start-up when the watch fails due to the version change but dispose is called). This is another issue that I am seeing that occurs randomly during the life of the process. As a precaution, I removed all dispose calls just in case that's what's interfering.

from foundationdb.

panghy commented on August 17, 2024

Another slightly different instance:

ThreadFuture<Void>::clearCallback(ThreadCallback*) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:450
 (inlined by) AbortableSingleAssignmentVar<Optional<Standalone<StringRef> > >::cancelCallbacks() at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:112
 (inlined by) AbortableSingleAssignmentVar<Optional<Standalone<StringRef> > >::fire(Void const&, int&) at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:73
ThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::send(Optional<Standalone<StringRef> > const&) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:407
DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::apply() at /home/ubuntu/foundationdb/./flow/flow.h:136
 (inlined by) DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::apply() at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:195
ActorCallback<(anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::futureCallback(future*, void*)::{lambda()#1}>, 0, Void>::fire((anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<Optional<Standalone<StringRef> > >::futureCallback(future*, void*)::{lambda()#1}> const&) at /home/ubuntu/foundationdb/./flow/flow.h:636
 (inlined by) ~DoOnMainThreadVoidActorState at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) ~DoOnMainThreadVoidActor at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) a_body1cont6 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:569
 (inlined by) a_body1cont2 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:934
 (inlined by) a_body1cont5 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:962
 (inlined by) a_body1cont1 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:884
 (inlined by) a_body1when1 at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:896
 (inlined by) a_callback_fire at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:910
 (inlined by) fire at /home/ubuntu/foundationdb/./flow/flow.h:928
void SAV<Void>::send<Void>(Void&&) at /home/ubuntu/foundationdb/./flow/flow.h:381
N2::Net2::run() at /home/ubuntu/foundationdb/flow/flow.h:720
 (inlined by) ~PromiseTask at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:467
 (inlined by) N2::PromiseTask::operator()() at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:474
 (inlined by) N2::Net2::run() at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:628
runNetwork() at /home/ubuntu/foundationdb/fdbclient/NativeAPI.actor.cpp:865
Reference<ClientInfo>::operator->() const at /home/ubuntu/foundationdb/./flow/FastRef.h:143
 (inlined by) MultiVersionApi::runNetwork() at /home/ubuntu/foundationdb/fdbclient/MultiVersionTransaction.actor.cpp:1185
fdb_run_network at /home/ubuntu/foundationdb/bindings/c/fdb_c.cpp:119

from foundationdb.

panghy commented on August 17, 2024

Curiously clearCallback is commented out in 3.x code's ThreadHelper and is only used in the new code in MultiVersion* classes.

from foundationdb.

panghy commented on August 17, 2024

Going to try and see:

panghy@3832b13

from foundationdb.

panghy commented on August 17, 2024

Did not work :(

from foundationdb.

ajbeamon commented on August 17, 2024

To give a little context here, DLThreadSingleAssignmentVar is essentially wrapping the futures returned by the external (3.0 here) client. It looks like these future are being completed normally in your call stacks.

This DLThreadSingleAssignmentVar is wrapped by an AbortableThreadSingleAssignmentVar which allows an external abort signal to also trigger the future. The external signal here is supposed to occur when another client library manages to connect to the database and replaces the current one (i.e. the version changes), thereby failing all outstanding futures. When the DLThreadSingleAssignmentVar completes, the AbortableThreadSingleAssignmentVar unregisters itself from the callback list of the abort signal, which is where we are when this crash occurs.

The abort signal shouldn't be firing at all while the cluster version is stable, so I would expect very little to be happening with it other than callbacks being registered and removed.

from foundationdb.

panghy commented on August 17, 2024

@ajbeamon thanks for the clarification. We are not attempting any upgrades of the FDB cluster at the current time and the JVMs die at random times which seemed odd since there's only the 3.0 (external) client and the built-in 5.1 client and the latter should never be able to connect.

Do you think this is something to do with the multi version logic or do you think this is 3.0+5.1 specific? I have checked the code and removed everything that cancel() or dispose() NativeFutures in the code and it doesn't seem to make a difference so far.

from foundationdb.

ajbeamon commented on August 17, 2024

I don't have any specific reason to suspect 3.0 is contributing to this problem, but it's hard to say. It's possible that we may be able to learn a little bit more from core dumps if you have them, but if that doesn't point us in the right direction it may be hard for me to determine the problem unless I have a reproduction I can toy around with.

One other thought -- are you able to reproduce this if you run a 5.1 client that doesn't have compiler optimizations turned on? You can comment out the CFLAGS += -O2 line and comment in the CFLAGS += -O0 line in Makefile to build the client without optimizations. If it does reproduce there, then maybe the backtrace will yield something useful that we aren't getting with the current backtraces.

from foundationdb.

panghy commented on August 17, 2024

It does happen without optimizations. Got the backtrace here, just started looking into it myself.

DLThreadSingleAssignmentVar<long>::cancel() at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:178
AbortableSingleAssignmentVar<long>::cancelCallbacks() at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:112
AbortableSingleAssignmentVar<long>::fire(Void const&, int&) at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:74
ThreadSingleAssignmentVar<long>::send(long const&) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.h:391
DLThreadSingleAssignmentVar<long>::apply() at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:198
DLThreadSingleAssignmentVar<long>::futureCallback(future*, void*)::{lambda()#1}::operator()() const at /home/ubuntu/foundationdb/fdbclient/MultiVersionAssignmentVars.h:205
(anonymous namespace)::DoOnMainThreadVoidActorState<DLThreadSingleAssignmentVar<long>::futureCallback(future*, void*)::{lambda()#1}, (anonymous namespace)::DoOnMainThreadVoidActor<{lambda()#1}> >::a_body1cont1(Void const&, int) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:884
(anonymous namespace)::DoOnMainThreadVoidActorState<DLThreadSingleAssignmentVar<long>::futureCallback(future*, void*)::{lambda()#1}, (anonymous namespace)::DoOnMainThreadVoidActor<{lambda()#1}> >::a_body1when1(Void const&, int) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:896
(anonymous namespace)::DoOnMainThreadVoidActorState<DLThreadSingleAssignmentVar<long>::futureCallback(future*, void*)::{lambda()#1}, (anonymous namespace)::DoOnMainThreadVoidActor<{lambda()#1}> >::a_callback_fire(ActorCallback<(anonymous namespace)::DoOnMainThreadVoidActor, 0, Void>*, ActorCallback) at /home/ubuntu/foundationdb/flow/ThreadHelper.actor.g.h:918
ActorCallback<(anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<long>::futureCallback(future*, void*)::{lambda()#1}>, 0, Void>::fire((anonymous namespace)::DoOnMainThreadVoidActor<DLThreadSingleAssignmentVar<long>::futureCallback(future*, void*)::{lambda()#1}> const&) at /home/ubuntu/foundationdb/./flow/flow.h:928
void SAV<Void>::send<Void>(Void&&) at /home/ubuntu/foundationdb/./flow/flow.h:381
void Promise<Void>::send<Void>(Void&&) const at /home/ubuntu/foundationdb/./flow/flow.h:709
N2::PromiseTask::operator()() at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:474
N2::Net2::run() at /home/ubuntu/foundationdb/flow/Net2.actor.cpp:635
runNetwork() at /home/ubuntu/foundationdb/fdbclient/NativeAPI.actor.cpp:865
ThreadSafeApi::runNetwork() at /home/ubuntu/foundationdb/fdbclient/ThreadSafeTransaction.actor.cpp:361
MultiVersionApi::runNetwork() at /home/ubuntu/foundationdb/fdbclient/MultiVersionTransaction.actor.cpp:1185
fdb_run_network at /home/ubuntu/foundationdb/bindings/c/fdb_c.cpp:119 (discriminator 1)

from foundationdb.

panghy commented on August 17, 2024

The line numbers look even more confusing without optimizations for some reason.

from foundationdb.

ajbeamon commented on August 17, 2024

Hmm, one possibility is that the the 2nd item in the stack should be at line 110 rather than 112, which then involves a call to DLThreadSingleAssignmentVar<long>::cancel() at the top of the stack. That seems a little more believable to me, given the lack of activity that's happening on the abort signal. On the other hand, it's not exactly clear from the top of the stack what might have gone wrong.

from foundationdb.

panghy commented on August 17, 2024

Is there anything we can do to fix the line numbers just in case they are still lying?

from foundationdb.

ajbeamon commented on August 17, 2024

My suspicion is that this is an artifact of how backtraces are being collected, in this case by the JVM. If you could catch this error in a debugger like gdb, among other things it would probably help you get a reliable call stack.

from foundationdb.

panghy commented on August 17, 2024

I see, I am not sure we have that option here. Anyways, turning on client tracing to see if there's more information there at the exact point of the crash.

from foundationdb.

panghy commented on August 17, 2024

Nothing pops out, I just don't see how those calls could seg fault. Core dump would be very big since this is a scale test that's failing. It's sporadically failing and doesn't seem like there's a way to trigger it. We are talking to at least 3 different FDB clusters (all 3.x clusters). Not sure if that matters.

from foundationdb.

panghy commented on August 17, 2024

This patch: panghy@5b6628f fixes the problem for us so far. There is no segfault in 48 hours which is a first. The change obviously removes calling cancel on the FdbCApi reference it seems but there's no observable issues on the cluster so far.

@ajbeamon any concerns? If this is a 3.x thing, it probably won't be an issue after we hot roll with this patch and bring clusters onto 5.x. We can then just go back to the standard binaries.

from foundationdb.

ajbeamon commented on August 17, 2024

That's interesting, does it also work with only line 174 commented out? The call to cancel on the external future should be doing very little in the case where the future is already complete (as it was in the call stacks you sent me). It should just be incrementing a reference counter on the future, checking that the future is already set, and then decrementing the same reference counter.

It should also be the case that destroy is being called on this future later, which is basically the same as cancel except that it doesn't add to the reference count.

The obvious effect of your change as you noted is that you can't cancel futures, but I think otherwise it should work ok. Based in part on the above, I'm not convinced yet that this is a problem specific to 3.x versions of the client, so if you want to go this route we'll want to do some testing and reevaluate when you're working with modern versions.

from foundationdb.

panghy commented on August 17, 2024

@ajbeamon I can test but I would assume the other two lines can't really fault. If my understanding is correct, api is, somehow, pointing to something that's not an FdbCApi.

from foundationdb.

ajbeamon commented on August 17, 2024

Hmm, I hadn't really considered the possibility of api being invalid. I was more focused on something being wrong with f or with the reference counting that's happening in the if block. Did you see something that suggested something was wrong with api specifically?

from foundationdb.

Support for 3.x clusters to live migrate about foundationdb HOT 106 CLOSED

Comments (106)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent