<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hold XIO server lock across fork about gct HOT 8 CLOSED

gridcf commented on July 29, 2024

Hold XIO server lock across fork

from gct.

Comments (8)

msalle commented on July 29, 2024

Hi @fscheiner, short answer: no it hasn't been merged in. Whether it's still needed I don't know.

from gct.

fscheiner commented on July 29, 2024

@msalle

Whether it's still needed I don't know.

Well, if it wasn't changed, I assume it's still a problem, also because of the description in globus/globus-toolkit#63. If the deadlock behaviour can be easily reproduced, we can just compare the results of (1) using the unmodified source and (2) using a "patched" source, to determine if it's needed or not.

from gct.

fscheiner commented on July 29, 2024

@msalle
I seem to remember observing globus-gridftp-server processes that keep running w/o producing any load after past transfers were finished. This could be related to the deadlocks described in globus/globus-toolkit#63. So I will try to reproduce this behaviour with the current GCT code locally and then compare to the behaviour with @bbockelm's patch included.

from gct.

ellert commented on July 29, 2024

The lock is already held during fork().

The fork is done in globus_l_gfs_spawn_child():

gct/gridftp/server/src/globus_gridftp_server.c

Lines 439 to 443 in da14279

 static 

 globus_result_t 

 globus_l_gfs_spawn_child( 

 globus_xio_handle_t handle) 

 {

gct/gridftp/server/src/globus_gridftp_server.c

Line 466 in da14279

child_pid = fork();

gct/gridftp/server/src/globus_gridftp_server.c

Line 561 in da14279

}

And though this function does not lock the mutex, it is called from

gct/gridftp/server/src/globus_gridftp_server.c

Lines 1154 to 1165 in da14279

 static 

 void 

 globus_l_gfs_server_accept_cb( 

 globus_xio_server_t server, 

 globus_xio_handle_t handle, 

 globus_result_t result, 

 void * user_arg) 

 { 

 GlobusGFSName(globus_l_gfs_server_accept_cb); 

 GlobusGFSDebugEnter(); 

 globus_mutex_lock(&globus_l_gfs_mutex);

gct/gridftp/server/src/globus_gridftp_server.c

Line 1198 in da14279

result = globus_l_gfs_spawn_child(handle);

gct/gridftp/server/src/globus_gridftp_server.c

Lines 1270 to 1273 in da14279

 globus_mutex_unlock(&globus_l_gfs_mutex); 

 GlobusGFSDebugExit(); 

 return;

I.e. the lock is already held when globus_l_gfs_spawn_child() is called. There is no need to lock it again in this function. Locking the same mutex again from the same thread would be undefined behaviour.

from gct.

ellert commented on July 29, 2024

Never mind - I now realise that the original PR was for a different mutex than this one. Please disregard my comment.

from gct.

ellert commented on July 29, 2024

Hi again.

I looked at this as part of preparing my "fix some compiler warnings" PR #179.
I now think that my comment above was not so off the mark after all.
In the source file in question - gridftp/server/src/globus_gridftp_server.c - the mutex globus_l_gfs_xio_server->mutex is never locked or released. All manipulations of globus_l_gfs_xio_server (assignments, function calls, ...) are (or at least are supposed to be) protected by holding the global mutex globus_l_gfs_mutex. So any locking of mutex globus_l_gfs_xio_server->mutex should not be able to persist if the global mutex is released, and hence my comment above is valid.

However, there is a flaw to this logic - which I think is a bug. For some reason, the lock is released here:

gct/gridftp/server/src/globus_gridftp_server.c

Line 538 in da14279

globus_mutex_unlock(&globus_l_gfs_mutex);

inside the globus_l_gfs_spawn_child() function, even tough the logic demands that the mutex should be locked for the duration of this function. The logic demands that the mutex stays locked after returning from the call to globus_l_gfs_spawn_child() and is released by the call to globus_mutex_unlock on line 1270:

gct/gridftp/server/src/globus_gridftp_server.c

Lines 1192 to 1286 in da14279

 else 

 { 

 /* if we fail to actually open a connection with either method we 

  do not fail, just log that the connection failed */ 

 if(globus_i_gfs_config_bool("daemon")) 

 { 

 result = globus_l_gfs_spawn_child(handle); 

 if(result != GLOBUS_SUCCESS) 

 { 

 globus_gfs_log_result( 

 GLOBUS_GFS_LOG_ERR, "Could not spawn a child", result); 

 result = GLOBUS_SUCCESS; 

 /* if fork fails (cygwin sometimes), 

  * attempt to run non-forked */ 

 if(globus_i_gfs_config_bool("fork_fallback")) 

 { 

 result = globus_l_gfs_open_new_server(handle); 

 if(result != GLOBUS_SUCCESS) 

 { 

 globus_gfs_log_result( 

 GLOBUS_GFS_LOG_ERR, 

 _GSSL("Could not open new handle"), 

 result); 

 result = GLOBUS_SUCCESS; 

 } 

 } 

 } 

 } 

 else 

 { 

 result = globus_l_gfs_open_new_server(handle); 

 if(result != GLOBUS_SUCCESS) 

 { 

 globus_gfs_log_result( 

 GLOBUS_GFS_LOG_ERR, 

 _GSSL("Could not open new handle"), 

 result); 

 result = GLOBUS_SUCCESS; 

 } 

 } 

 } 

 /* be sure to close handle on server proc and close server on  

  * client proc (close on exec) 

  * 

  * need to handle proc exits and decrement the open server count 

  * to keep the limit in effect. 

  *  

  * win32 will have to simulate fork/exec... should i just do that 

  * for unix too? 

  */ 

 if(globus_i_gfs_config_bool("single")) 

 { 

 result = globus_xio_server_register_close( 

 globus_l_gfs_xio_server, globus_l_gfs_server_close_cb, NULL); 

 if(result == GLOBUS_SUCCESS) 

 { 

 globus_l_gfs_outstanding++; 

 } 

 else 

 { 

 globus_l_gfs_xio_server = GLOBUS_NULL; 

 } 

 } 

 else if(!globus_l_gfs_terminated) 

 { 

 result = globus_xio_server_register_accept( 

 globus_l_gfs_xio_server, 

 globus_l_gfs_server_accept_cb, 

 NULL); 

 if(result != GLOBUS_SUCCESS) 

 { 

 goto error_register_accept; 

 } 

 globus_l_gfs_outstanding++; 

 globus_l_gfs_xio_server_accepting = GLOBUS_TRUE; 

 } 

 } 

 globus_mutex_unlock(&globus_l_gfs_mutex); 

 GlobusGFSDebugExit(); 

 return; 

 error_register_accept: 

 error_accept: 

 globus_l_gfs_terminated = GLOBUS_TRUE; 

 if(globus_gfs_config_get_int("open_connections_count") == 0) 

 { 

 globus_cond_signal(&globus_l_gfs_cond); 

 } 

 globus_mutex_unlock(&globus_l_gfs_mutex); 

 GlobusGFSDebugExitWithError(); 

 }

As can be seen above, there are some function calls after the return from globus_l_gfs_spawn_child() than uses globus_l_gfs_xio_server, that could potentially screw up the locking on mutex globus_l_gfs_xio_server->mutex without the global lock being locked due to the early release of the lock on line 538.

This unlock also makes no sense for a different reason. Every other call to globus_xio_register_close() is done while the mutex is locked. Here the mutex is released just before calling this function, which makes the call questionable for a different reason.

Dropping the call to release the lock early would fix this.

The direct call to globus_l_gfs_close_cb() in case the callback registration fails must then be handled in some way, since the callback tries to acquire the lock. This could be

        if(result != GLOBUS_SUCCESS)
        {
            globus_mutex_unlock(&globus_l_gfs_mutex);
            globus_l_gfs_close_cb(handle, result, NULL);
            globus_mutex_lock(&globus_l_gfs_mutex);
        }

But the callback function is:

gct/gridftp/server/src/globus_gridftp_server.c

Lines 583 to 601 in da14279

 static 

 void 

 globus_l_gfs_close_cb( 

 globus_xio_handle_t handle, 

 globus_result_t result, 

 void * user_arg) 

 { 

 GlobusGFSName(globus_l_gfs_close_cb); 

 GlobusGFSDebugEnter(); 

 globus_mutex_lock(&globus_l_gfs_mutex); 

 { 

 globus_i_gfs_connection_closed(); 

 } 

 globus_mutex_unlock(&globus_l_gfs_mutex); 

 GlobusGFSDebugExit(); 

 }

So this would be equivalent to:

        if(result != GLOBUS_SUCCESS)
        {
            globus_mutex_unlock(&globus_l_gfs_mutex);
            globus_mutex_lock(&globus_l_gfs_mutex);
            globus_i_gfs_connection_closed();
            globus_mutex_unlock(&globus_l_gfs_mutex);
            globus_mutex_lock(&globus_l_gfs_mutex);
        }

i.e., one could just write:

        if(result != GLOBUS_SUCCESS)
        {
            globus_i_gfs_connection_closed();
        }

Which would be consistent with the rest of the code in the file, since in other places where, if a callback registration fails in a place where the lock is held, the backup is to execute the code of the callback without the lock/unlock, e.g.:

gct/gridftp/server/src/globus_gridftp_server.c

Lines 905 to 920 in da14279

 result = globus_xio_register_close( 

 handle, 

 NULL, 

 globus_l_gfs_reject_close_cb, 

 NULL); 

 if(result != GLOBUS_SUCCESS) 

 { 

 if(globus_i_gfs_config_bool("inetd")) 

 { 

 globus_i_gfs_connection_closed(); 

 } 

 else 

 { 

 globus_l_gfs_outstanding--; 

 } 

 }

gct/gridftp/server/src/globus_gridftp_server.c

Lines 866 to 885 in da14279

 static 

 void 

 globus_l_gfs_reject_close_cb( 

 globus_xio_handle_t handle, 

 globus_result_t result, 

 void * user_arg) 

 { 

 globus_mutex_lock(&globus_l_gfs_mutex); 

 { 

 if(globus_i_gfs_config_bool("inetd")) 

 { 

 globus_i_gfs_connection_closed(); 

 } 

 else 

 { 

 globus_l_gfs_outstanding--; 

 } 

 } 

 globus_mutex_unlock(&globus_l_gfs_mutex); 

 }

Coincidently, this change is proposed in the PR #179:
https://github.com/gridcf/gct/pull/179/files#diff-f76f3cae29cf9ff2977b4124b6177a87eb0b1500c72908a7ca6194a6c0e21bd8

from gct.

fscheiner commented on July 29, 2024

gct/gridftp/server/src/globus_gridftp_server.c

Line 538 in da14279

globus_mutex_unlock(&globus_l_gfs_mutex);

This came in some 18 years ago. Prior to that, the child was spawned before the locking of globus_l_gfs_mutex had happened (see

gct/gridftp/server/src/globus_gridftp_server.c

Lines 435 to 462 in 7e6fa4e

 result = globus_l_gfs_spawn_child(handle); 

 if(result != GLOBUS_SUCCESS) 

 { 

 } 

 } 

 else 

 { 

 result = globus_l_gfs_open_new_server(handle); 

 if(result != GLOBUS_SUCCESS) 

 { 

 globus_i_gfs_log_result("Could not open new handle", result); 

 /* we're not going to terminate the server just because we  

  * couldnt open a single client 

  */ 

 result = GLOBUS_SUCCESS; 

 } 

 } 

 /* be sure to close handle on server proc and close server on  

  * client proc (close on exec) 

  * 

  * need to handle proc exits and decrement the open server count 

  * to keep the limit in effect. 

  *  

  * win32 will have to simulate fork/exec... should i just do that 

  * for unix too? 

  */ 

 globus_mutex_lock(&globus_l_gfs_mutex);

...for details). Not sure what the reasons were to change that back then.

But is globus_l_gfs_mutex for the parent the same mutex as globus_l_gfs_xio_server->mutex for the spawned child? I ask, because if not, I don't see a chance for a deadlock in the child. But as @bbockelm detected one in globus/globus-toolkit#63, there must have been one in the code (at least back then).

Won't there be a real chance for a deadlock for the parent in

gct/gridftp/server/src/globus_gridftp_server.c

Lines 539 to 543 in da14279

 result = globus_xio_register_close( 

 handle, 

 NULL, 

 globus_l_gfs_close_cb, 

 NULL);

...,after removing the globus_mutex_unlock() in line 538, because inside the call in line 539 there happens a globus_mutex_lock() on handle->context->mutex (see

gct/xio/src/globus_xio_handle.c

Line 2604 in da14279

globus_mutex_lock(&handle->context->mutex);

) which - if this mutex is the same mutex as globus_l_gfs_mutex - can't succeed until globus_l_gfs_mutex is unlocked in line 1270.

UPDATE: But maybe not, because if both were the same, the deadlock would - without any changes to line 538 above - already happen in the call to globus_xio_register_open() in

gct/gridftp/server/src/globus_gridftp_server.c

Line 1180 in da14279

result = globus_xio_register_open(

(which itself calls globus_l_xio_register_open() which then calls globus_mutex_lock(&handle->context->mutex) before the call to globus_l_gfs_spawn_child() happens).

I'll check if the changes from #179 show any differences in behaviour for the Globus GridFTP server compared to the current code state (@da14279). I still wonder how the mentioned deadlocks could happen.

from gct.

fscheiner commented on July 29, 2024

I assume this is no longer relevant or needed with the inclusion of #179 (specifically https://github.com/gridcf/gct/pull/179/files#diff-f76f3cae29cf9ff2977b4124b6177a87eb0b1500c72908a7ca6194a6c0e21bd8).

from gct.

Hold XIO server lock across fork about gct HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	static
	globus_result_t
	globus_l_gfs_spawn_child(
	globus_xio_handle_t handle)
	{

	static
	void
	globus_l_gfs_server_accept_cb(
	globus_xio_server_t server,
	globus_xio_handle_t handle,
	globus_result_t result,
	void * user_arg)
	{
	GlobusGFSName(globus_l_gfs_server_accept_cb);
	GlobusGFSDebugEnter();

	globus_mutex_lock(&globus_l_gfs_mutex);

	globus_mutex_unlock(&globus_l_gfs_mutex);

	GlobusGFSDebugExit();
	return;

	else
	{
	/* if we fail to actually open a connection with either method we
	do not fail, just log that the connection failed */
	if(globus_i_gfs_config_bool("daemon"))
	{
	result = globus_l_gfs_spawn_child(handle);
	if(result != GLOBUS_SUCCESS)
	{
	globus_gfs_log_result(
	GLOBUS_GFS_LOG_ERR, "Could not spawn a child", result);
	result = GLOBUS_SUCCESS;
	/* if fork fails (cygwin sometimes),
	* attempt to run non-forked */
	if(globus_i_gfs_config_bool("fork_fallback"))
	{
	result = globus_l_gfs_open_new_server(handle);
	if(result != GLOBUS_SUCCESS)
	{
	globus_gfs_log_result(
	GLOBUS_GFS_LOG_ERR,
	_GSSL("Could not open new handle"),
	result);
	result = GLOBUS_SUCCESS;
	}
	}
	}
	}
	else
	{
	result = globus_l_gfs_open_new_server(handle);
	if(result != GLOBUS_SUCCESS)
	{
	globus_gfs_log_result(
	GLOBUS_GFS_LOG_ERR,
	_GSSL("Could not open new handle"),
	result);
	result = GLOBUS_SUCCESS;
	}
	}
	}
	/* be sure to close handle on server proc and close server on
	* client proc (close on exec)
	*
	* need to handle proc exits and decrement the open server count
	* to keep the limit in effect.
	*
	* win32 will have to simulate fork/exec... should i just do that
	* for unix too?
	*/

	if(globus_i_gfs_config_bool("single"))
	{
	result = globus_xio_server_register_close(
	globus_l_gfs_xio_server, globus_l_gfs_server_close_cb, NULL);
	if(result == GLOBUS_SUCCESS)
	{
	globus_l_gfs_outstanding++;
	}
	else
	{
	globus_l_gfs_xio_server = GLOBUS_NULL;
	}
	}
	else if(!globus_l_gfs_terminated)
	{
	result = globus_xio_server_register_accept(
	globus_l_gfs_xio_server,
	globus_l_gfs_server_accept_cb,
	NULL);
	if(result != GLOBUS_SUCCESS)
	{
	goto error_register_accept;
	}
	globus_l_gfs_outstanding++;
	globus_l_gfs_xio_server_accepting = GLOBUS_TRUE;
	}
	}
	globus_mutex_unlock(&globus_l_gfs_mutex);

	GlobusGFSDebugExit();
	return;

	error_register_accept:

	error_accept:
	globus_l_gfs_terminated = GLOBUS_TRUE;
	if(globus_gfs_config_get_int("open_connections_count") == 0)
	{
	globus_cond_signal(&globus_l_gfs_cond);
	}
	globus_mutex_unlock(&globus_l_gfs_mutex);

	GlobusGFSDebugExitWithError();
	}

	static
	void
	globus_l_gfs_close_cb(
	globus_xio_handle_t handle,
	globus_result_t result,
	void * user_arg)
	{
	GlobusGFSName(globus_l_gfs_close_cb);
	GlobusGFSDebugEnter();

	globus_mutex_lock(&globus_l_gfs_mutex);
	{
	globus_i_gfs_connection_closed();
	}
	globus_mutex_unlock(&globus_l_gfs_mutex);

	GlobusGFSDebugExit();
	}

	result = globus_xio_register_close(
	handle,
	NULL,
	globus_l_gfs_reject_close_cb,
	NULL);
	if(result != GLOBUS_SUCCESS)
	{
	if(globus_i_gfs_config_bool("inetd"))
	{
	globus_i_gfs_connection_closed();
	}
	else
	{
	globus_l_gfs_outstanding--;
	}
	}

	static
	void
	globus_l_gfs_reject_close_cb(
	globus_xio_handle_t handle,
	globus_result_t result,
	void * user_arg)
	{
	globus_mutex_lock(&globus_l_gfs_mutex);
	{
	if(globus_i_gfs_config_bool("inetd"))
	{
	globus_i_gfs_connection_closed();
	}
	else
	{
	globus_l_gfs_outstanding--;
	}
	}
	globus_mutex_unlock(&globus_l_gfs_mutex);
	}