Giter Club home page Giter Club logo

Comments (8)

msalle avatar msalle commented on July 29, 2024

Hi @fscheiner, short answer: no it hasn't been merged in. Whether it's still needed I don't know.

from gct.

fscheiner avatar fscheiner commented on July 29, 2024

@msalle

Whether it's still needed I don't know.

Well, if it wasn't changed, I assume it's still a problem, also because of the description in globus/globus-toolkit#63. If the deadlock behaviour can be easily reproduced, we can just compare the results of (1) using the unmodified source and (2) using a "patched" source, to determine if it's needed or not.

from gct.

fscheiner avatar fscheiner commented on July 29, 2024

@msalle
I seem to remember observing globus-gridftp-server processes that keep running w/o producing any load after past transfers were finished. This could be related to the deadlocks described in globus/globus-toolkit#63. So I will try to reproduce this behaviour with the current GCT code locally and then compare to the behaviour with @bbockelm's patch included.

from gct.

ellert avatar ellert commented on July 29, 2024

The lock is already held during fork().

The fork is done in globus_l_gfs_spawn_child():

static
globus_result_t
globus_l_gfs_spawn_child(
globus_xio_handle_t handle)
{


And though this function does not lock the mutex, it is called from

static
void
globus_l_gfs_server_accept_cb(
globus_xio_server_t server,
globus_xio_handle_t handle,
globus_result_t result,
void * user_arg)
{
GlobusGFSName(globus_l_gfs_server_accept_cb);
GlobusGFSDebugEnter();
globus_mutex_lock(&globus_l_gfs_mutex);

result = globus_l_gfs_spawn_child(handle);

globus_mutex_unlock(&globus_l_gfs_mutex);
GlobusGFSDebugExit();
return;

I.e. the lock is already held when globus_l_gfs_spawn_child() is called. There is no need to lock it again in this function. Locking the same mutex again from the same thread would be undefined behaviour.

from gct.

ellert avatar ellert commented on July 29, 2024

Never mind - I now realise that the original PR was for a different mutex than this one. Please disregard my comment.

from gct.

ellert avatar ellert commented on July 29, 2024

Hi again.

I looked at this as part of preparing my "fix some compiler warnings" PR #179.
I now think that my comment above was not so off the mark after all.
In the source file in question - gridftp/server/src/globus_gridftp_server.c - the mutex globus_l_gfs_xio_server->mutex is never locked or released. All manipulations of globus_l_gfs_xio_server (assignments, function calls, ...) are (or at least are supposed to be) protected by holding the global mutex globus_l_gfs_mutex. So any locking of mutex globus_l_gfs_xio_server->mutex should not be able to persist if the global mutex is released, and hence my comment above is valid.

However, there is a flaw to this logic - which I think is a bug. For some reason, the lock is released here:

globus_mutex_unlock(&globus_l_gfs_mutex);

inside the globus_l_gfs_spawn_child() function, even tough the logic demands that the mutex should be locked for the duration of this function. The logic demands that the mutex stays locked after returning from the call to globus_l_gfs_spawn_child() and is released by the call to globus_mutex_unlock on line 1270:

else
{
/* if we fail to actually open a connection with either method we
do not fail, just log that the connection failed */
if(globus_i_gfs_config_bool("daemon"))
{
result = globus_l_gfs_spawn_child(handle);
if(result != GLOBUS_SUCCESS)
{
globus_gfs_log_result(
GLOBUS_GFS_LOG_ERR, "Could not spawn a child", result);
result = GLOBUS_SUCCESS;
/* if fork fails (cygwin sometimes),
* attempt to run non-forked */
if(globus_i_gfs_config_bool("fork_fallback"))
{
result = globus_l_gfs_open_new_server(handle);
if(result != GLOBUS_SUCCESS)
{
globus_gfs_log_result(
GLOBUS_GFS_LOG_ERR,
_GSSL("Could not open new handle"),
result);
result = GLOBUS_SUCCESS;
}
}
}
}
else
{
result = globus_l_gfs_open_new_server(handle);
if(result != GLOBUS_SUCCESS)
{
globus_gfs_log_result(
GLOBUS_GFS_LOG_ERR,
_GSSL("Could not open new handle"),
result);
result = GLOBUS_SUCCESS;
}
}
}
/* be sure to close handle on server proc and close server on
* client proc (close on exec)
*
* need to handle proc exits and decrement the open server count
* to keep the limit in effect.
*
* win32 will have to simulate fork/exec... should i just do that
* for unix too?
*/
if(globus_i_gfs_config_bool("single"))
{
result = globus_xio_server_register_close(
globus_l_gfs_xio_server, globus_l_gfs_server_close_cb, NULL);
if(result == GLOBUS_SUCCESS)
{
globus_l_gfs_outstanding++;
}
else
{
globus_l_gfs_xio_server = GLOBUS_NULL;
}
}
else if(!globus_l_gfs_terminated)
{
result = globus_xio_server_register_accept(
globus_l_gfs_xio_server,
globus_l_gfs_server_accept_cb,
NULL);
if(result != GLOBUS_SUCCESS)
{
goto error_register_accept;
}
globus_l_gfs_outstanding++;
globus_l_gfs_xio_server_accepting = GLOBUS_TRUE;
}
}
globus_mutex_unlock(&globus_l_gfs_mutex);
GlobusGFSDebugExit();
return;
error_register_accept:
error_accept:
globus_l_gfs_terminated = GLOBUS_TRUE;
if(globus_gfs_config_get_int("open_connections_count") == 0)
{
globus_cond_signal(&globus_l_gfs_cond);
}
globus_mutex_unlock(&globus_l_gfs_mutex);
GlobusGFSDebugExitWithError();
}

As can be seen above, there are some function calls after the return from globus_l_gfs_spawn_child() than uses globus_l_gfs_xio_server, that could potentially screw up the locking on mutex globus_l_gfs_xio_server->mutex without the global lock being locked due to the early release of the lock on line 538.

This unlock also makes no sense for a different reason. Every other call to globus_xio_register_close() is done while the mutex is locked. Here the mutex is released just before calling this function, which makes the call questionable for a different reason.

Dropping the call to release the lock early would fix this.

The direct call to globus_l_gfs_close_cb() in case the callback registration fails must then be handled in some way, since the callback tries to acquire the lock. This could be

        if(result != GLOBUS_SUCCESS)
        {
            globus_mutex_unlock(&globus_l_gfs_mutex);
            globus_l_gfs_close_cb(handle, result, NULL);
            globus_mutex_lock(&globus_l_gfs_mutex);
        }        

But the callback function is:

static
void
globus_l_gfs_close_cb(
globus_xio_handle_t handle,
globus_result_t result,
void * user_arg)
{
GlobusGFSName(globus_l_gfs_close_cb);
GlobusGFSDebugEnter();
globus_mutex_lock(&globus_l_gfs_mutex);
{
globus_i_gfs_connection_closed();
}
globus_mutex_unlock(&globus_l_gfs_mutex);
GlobusGFSDebugExit();
}

So this would be equivalent to:

        if(result != GLOBUS_SUCCESS)
        {
            globus_mutex_unlock(&globus_l_gfs_mutex);
            globus_mutex_lock(&globus_l_gfs_mutex);
            globus_i_gfs_connection_closed();
            globus_mutex_unlock(&globus_l_gfs_mutex);
            globus_mutex_lock(&globus_l_gfs_mutex);
        }        

i.e., one could just write:

        if(result != GLOBUS_SUCCESS)
        {
            globus_i_gfs_connection_closed();
        }        

Which would be consistent with the rest of the code in the file, since in other places where, if a callback registration fails in a place where the lock is held, the backup is to execute the code of the callback without the lock/unlock, e.g.:

result = globus_xio_register_close(
handle,
NULL,
globus_l_gfs_reject_close_cb,
NULL);
if(result != GLOBUS_SUCCESS)
{
if(globus_i_gfs_config_bool("inetd"))
{
globus_i_gfs_connection_closed();
}
else
{
globus_l_gfs_outstanding--;
}
}

static
void
globus_l_gfs_reject_close_cb(
globus_xio_handle_t handle,
globus_result_t result,
void * user_arg)
{
globus_mutex_lock(&globus_l_gfs_mutex);
{
if(globus_i_gfs_config_bool("inetd"))
{
globus_i_gfs_connection_closed();
}
else
{
globus_l_gfs_outstanding--;
}
}
globus_mutex_unlock(&globus_l_gfs_mutex);
}

Coincidently, this change is proposed in the PR #179:
https://github.com/gridcf/gct/pull/179/files#diff-f76f3cae29cf9ff2977b4124b6177a87eb0b1500c72908a7ca6194a6c0e21bd8

from gct.

fscheiner avatar fscheiner commented on July 29, 2024

globus_mutex_unlock(&globus_l_gfs_mutex);

This came in some 18 years ago. Prior to that, the child was spawned before the locking of globus_l_gfs_mutex had happened (see

result = globus_l_gfs_spawn_child(handle);
if(result != GLOBUS_SUCCESS)
{
}
}
else
{
result = globus_l_gfs_open_new_server(handle);
if(result != GLOBUS_SUCCESS)
{
globus_i_gfs_log_result("Could not open new handle", result);
/* we're not going to terminate the server just because we
* couldnt open a single client
*/
result = GLOBUS_SUCCESS;
}
}
/* be sure to close handle on server proc and close server on
* client proc (close on exec)
*
* need to handle proc exits and decrement the open server count
* to keep the limit in effect.
*
* win32 will have to simulate fork/exec... should i just do that
* for unix too?
*/
globus_mutex_lock(&globus_l_gfs_mutex);

...for details). Not sure what the reasons were to change that back then.


But is globus_l_gfs_mutex for the parent the same mutex as globus_l_gfs_xio_server->mutex for the spawned child? I ask, because if not, I don't see a chance for a deadlock in the child. But as @bbockelm detected one in globus/globus-toolkit#63, there must have been one in the code (at least back then).

Won't there be a real chance for a deadlock for the parent in

result = globus_xio_register_close(
handle,
NULL,
globus_l_gfs_close_cb,
NULL);

...,after removing the globus_mutex_unlock() in line 538, because inside the call in line 539 there happens a globus_mutex_lock() on handle->context->mutex (see
globus_mutex_lock(&handle->context->mutex);
) which - if this mutex is the same mutex as globus_l_gfs_mutex - can't succeed until globus_l_gfs_mutex is unlocked in line 1270.

UPDATE: But maybe not, because if both were the same, the deadlock would - without any changes to line 538 above - already happen in the call to globus_xio_register_open() in

result = globus_xio_register_open(
(which itself calls globus_l_xio_register_open() which then calls globus_mutex_lock(&handle->context->mutex) before the call to globus_l_gfs_spawn_child() happens).

I'll check if the changes from #179 show any differences in behaviour for the Globus GridFTP server compared to the current code state (@da14279). I still wonder how the mentioned deadlocks could happen.

from gct.

fscheiner avatar fscheiner commented on July 29, 2024

I assume this is no longer relevant or needed with the inclusion of #179 (specifically https://github.com/gridcf/gct/pull/179/files#diff-f76f3cae29cf9ff2977b4124b6177a87eb0b1500c72908a7ca6194a6c0e21bd8).

from gct.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.