Comments (54)
Hi @sflanagan -- Josh has succeeded in reproducing the issue with the current alpha on Ubuntu. We are now debugging the problem. And will post additional updates here later this afternoon.
from mdsplus.
@mwinkel-dev
If the fixes of #2746 are needed to address this issue, or would be encountered during our testing, then please include them; however, we were hoping to have the tarball very soon, and wouldn't want to delay that for #2746, unless you think it is part of the overall fix that should be included.
from mdsplus.
I think you need to add dispatch/close
to the script that is doing the dispatch/build.
TCL> help dispatch/close
Command: DISPATCH/CLOSE
Purpose: Instruct MDSplus actions server(s) to close all open trees.
Format: DISPATCH/CLOSE[/SERVER=(server1,server2...)]
Description:
The DISPATCH/CLOSE can be used to instruct an MDSplus action server to close all MDSplus
trees the server currently has open. If the /SERVER qualifier is not included to specify
any action servers then the current dispatch table (See: DISPATCH/BUILD and
DISPATCH/PHASE commands) is inspected to construct a list of servers that had actions
dispatched to them to send close instructions to. The DISPATCH/CLOSE command is typically
used at the end of a data acquisition cycle to instruct the action server to close
any open trees.
TCL>
I am happy to meet this afternoon to discuss...
from mdsplus.
According to Mitchell's post on another issue, Atlas has been running alpha-7.139.59
since November 2023. And that Atlas is on RHEL8.
Issue 2704, post of 13-Feb-2024
from mdsplus.
There's likely more to this than open files, as enabling core dumps against the systemd
services running the "disaptcher" (i.e. mdsip -s
on atlas:8002
) and the "mdsloader" (i.e. mdsip -s
on atlas:8003
) comes up with some potential issues in mdstcpip/mdsipshr/ConnectToMds.c:
?
I haven't looked at this too closely, but for the sake of notes I'll just quickly post the coredumpctl gdb <pid>; bt
of the last core dump produced (this one is from the port 8002 side):
[root@atlas ~]# coredumpctl gdb 865034
PID: 865034 (mdsip)
UID: 117 (mdsadmin)
GID: 511 (mdsdev)
Signal: 11 (SEGV)
Timestamp: Mon 2024-03-25 21:44:26 PDT (12h ago)
Command Line: /usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts
Executable: /usr/local/mdsplus/bin64/mdsip
Control Group: /
Slice: -.slice
Boot ID: 7392c4d0130e43a9a07fbde2f35cd9c7
Machine ID: 1a91ee619a7f42d0b73da694e83b0be3
Hostname: atlas.gat.com
Storage: /var/lib/systemd/coredump/core.mdsip.117.7392c4d0130e43a9a07fbde2f35cd9c7.865034.1711428266000000.lz4
Message: Process 865034 (mdsip) of user 117 dumped core.
Stack trace of thread 866623:
#0 0x00001468186e2c61 do_login (libMdsIpShr.so)
#1 0x000014681530a01e server_connect (libMdsServerShr.so)
#2 0x000014681530ba91 ServerConnect (libMdsServerShr.so)
#3 0x000014681530bc57 ServerSendMessage (libMdsServerShr.so)
#4 0x000014681530cdd7 ServerDispatchAction (libMdsServerShr.so)
#5 0x0000146815307ec6 dispatch (libMdsServerShr.so)
#6 0x000014681530834c ServerDispatchPhase (libMdsServerShr.so)
#7 0x0000146814868521 TclDispatch_phase (libtcl_commands.so)
#8 0x00001468150f1a9a dispatchToHandler (libMdsdcl.so)
#9 0x00001468150f30b3 cmdExecute (libMdsdcl.so)
#10 0x00001468150f6f04 mdsdcl_do_command_extra_args (libMdsdcl.so)
#11 0x00001468150f2abb mdsdcl_do_command (libMdsdcl.so)
#12 0x0000146815304449 DoSrvCommand (libMdsServerShr.so)
#13 0x0000146815304b29 WorkerThread (libMdsServerShr.so)
#14 0x000014681808b1ca start_thread (libpthread.so.0)
#15 0x0000146817cf7e73 __clone (libc.so.6)
Stack trace of thread 865034:
#0 0x0000146817de4a3f __select (libc.so.6)
#1 0x0000146816d8e877 listen_loop (libMdsIpTCP.so)
#2 0x0000146816d8ecc6 io_listen (libMdsIpTCP.so)
#3 0x00000000004008e8 main (mdsip)
#4 0x0000146817cf8d85 __libc_start_main (libc.so.6)
#5 0x000000000040097e _start (mdsip)
Stack trace of thread 866830:
#0 0x000014681809145c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x0000146815306f07 send_monitor_pop (libMdsServerShr.so)
#2 0x00001468153076ab send_monitor_thread (libMdsServerShr.so)
#3 0x000014681808b1ca start_thread (libpthread.so.0)
#4 0x0000146817cf7e73 __clone (libc.so.6)
Stack trace of thread 1046533:
#0 0x0000146817de4a3f __select (libc.so.6)
#1 0x0000146816d8e154 io_recv_to (libMdsIpTCP.so)
#2 0x00001468186d9304 get_bytes_to (libMdsIpShr.so)
#3 0x00001468186d93e2 get_bytes_to (libMdsIpShr.so)
#4 0x00001468186e2eba ConnectionDoMessage (libMdsIpShr.so)
#5 0x0000146816d8d492 client_thread (libMdsIpTCP.so)
#6 0x000014681808b1ca start_thread (libpthread.so.0)
#7 0x0000146817cf7e73 __clone (libc.so.6)
Stack trace of thread 866831:
#0 0x0000146817de4a3f __select (libc.so.6)
#1 0x000014681530b42e receiver_thread (libMdsServerShr.so)
#2 0x000014681808b1ca start_thread (libpthread.so.0)
#3 0x0000146817cf7e73 __clone (libc.so.6)
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-19.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/mdsplus/bin64/mdsip...done.
[New LWP 866623]
[New LWP 865034]
[New LWP 866830]
[New LWP 1046533]
[New LWP 866831]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `/usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 do_login (c=0x1) at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:78
78 /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c: No such file or directory.
[Current thread is 1 (Thread 0x146814c71700 (LWP 866623))]
Missing separate debuginfos, use: yum debuginfo-install mdsplus-alpha-kernel_bin-7.139-59.el8.x86_64
(gdb) bt
#0 do_login (c=0x1) at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:78
#1 ConnectToMds (hostin=hostin@entry=0x1468080fdea0 "atlas.gat.com:8003") at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:161
#2 0x000014681530a01e in server_connect (server=server@entry=0x1468080fdea0 "atlas.gat.com:8003", addr=1755710912, port=<optimized out>)
at /git/mdsplus/servershr/ServerSendMessage.c:529
#3 0x000014681530ba91 in ServerConnect (server_in=server_in@entry=0x146814c70530 "mdsloader") at /git/mdsplus/servershr/ServerSendMessage.c:557
#4 0x000014681530bc57 in ServerSendMessage (msgid=msgid@entry=0x0, server=server@entry=0x146814c70530 "mdsloader", op=op@entry=2,
retstatus=retstatus@entry=0x146808999338, lock=lock@entry=0x146808999358, conid_out=conid_out@entry=0x14680899933c,
callback_done=0x146815307b60 <action_done>, callback_param=0xa7, callback_before=0x146815306d50 <before>, numargs_in=3)
at /git/mdsplus/servershr/ServerSendMessage.c:339
#5 0x000014681530cdd7 in ServerDispatchAction (id=id@entry=0x0, server=0x146814c70530 "mdsloader", tree=0x146808992fb4 "D3D", shot=<optimized out>,
nid=<optimized out>, ast=ast@entry=0x146815307b60 <action_done>, astprm=0xa7, retstatus=0x146808999338, lock=0x146808999358,
socket=0x14680899933c, before_ast=0x146815306d50 <before>) at /git/mdsplus/servershr/ServerDispatchAction.c:67
#6 0x0000146815307ec6 in dispatch (i=i@entry=167) at /git/mdsplus/servershr/ServerDispatchPhase.c:201
#7 0x000014681530834c in ServerDispatchPhase (id=id@entry=0x0, vtable=<optimized out>, phasenam=phasenam@entry=0x1468080b36b0 "TEST",
noact_in=noact_in@entry=0 '\000', sync=sync@entry=1, output_rtn=output_rtn@entry=0x146814867740 <printIt>, monitor=0x0)
at /git/mdsplus/servershr/ServerDispatchPhase.c:536
#8 0x0000146814868521 in TclDispatch_phase (ctx=<optimized out>, error=0x146814c70c28, output=<optimized out>) at /git/mdsplus/tcl/tcl_dispatch.c:411
#9 0x00001468150f1a9a in dispatchToHandler (getlineinfo=0x0, getline=0x0, output=0x146814c70c20, error=0x146814c70c28, prompt=<optimized out>,
cmdDef=<optimized out>, cmd=0x1468080fdf60, image=<optimized out>) at /git/mdsplus/mdsdcl/cmdExecute.c:935
#10 processCommand (docList=docList@entry=0x146808011e50, verbNode_in=<optimized out>, cmd=cmd@entry=0x1468080065c0,
cmdDef=cmdDef@entry=0x146808006610, prompt=prompt@entry=0x146814c70c10, error=error@entry=0x146814c70c28, output=0x146814c70c20, getline=0x0,
getlineinfo=0x0) at /git/mdsplus/mdsdcl/cmdExecute.c:1165
#11 0x00001468150f30b3 in cmdExecute (cmd=0x1468080065c0, prompt_out=prompt_out@entry=0x0, error_out=error_out@entry=0x146814c70d70,
output_out=output_out@entry=0x146814c70d78, getline=getline@entry=0x0, getlineinfo=getlineinfo@entry=0x0) at /git/mdsplus/mdsdcl/cmdExecute.c:1453
#12 0x00001468150f6f04 in mdsdcl_do_command_extra_args (command=0x146810008db0 "DISPATCH/PHASE TEST", prompt=prompt@entry=0x0,
error=error@entry=0x146814c70d70, output=output@entry=0x146814c70d78, getline=getline@entry=0x0, getlineInfo=getlineInfo@entry=0x0)
at mdsdcl/yylex/cmdParse.y:213
#13 0x00001468150f2abb in mdsdcl_do_command (command=<optimized out>) at /git/mdsplus/mdsdcl/cmdExecute.c:1350
#14 0x0000146815304449 in DoSrvCommand (job_in=job_in@entry=0x1468100430b0) at /git/mdsplus/servershr/ServerQAction.c:642
#15 0x0000146815304b29 in WorkerThread (arg=<optimized out>) at /git/mdsplus/servershr/ServerQAction.c:786
#16 0x000014681808b1ca in start_thread () from /usr/lib64/libpthread.so.0
#17 0x0000146817cf7e73 in clone () from /usr/lib64/libc.so.6
(gdb)
from mdsplus.
EDIT:
Note that as of the time the following lsof
was pulled, those PIDs are for:
- dispatcher: systemd service
mds-8002
on pid3730021
- mdsloader: systemd service
mds-8003
on pid3730036
[root@atlas system]# ps -fp 3730021
UID PID PPID C STIME TTY TIME CMD
mdsadmin 3730021 1 0 10:19 ? 00:00:01 /usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts -c 0
[root@atlas system]# ps -fp 3730036
UID PID PPID C STIME TTY TIME CMD
mdsadmin 3730036 1 0 10:19 ? 00:00:01 /usr/local/mdsplus/bin/mdsip -s -p 8003 -h /etc/mdsip.hosts -c 0
A current lsof -u mdsadmin | grep TCP
:
[root@atlas ~]# lsof -u mdsadmin | grep TCP
mdsip 3730021 mdsadmin mem REG 9,127 80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip 3730021 mdsadmin 3u IPv4 12579304 0t0 TCP *:8002 (LISTEN)
mdsip 3730021 mdsadmin 22u IPv4 12298036 0t0 TCP *:8801 (LISTEN)
mdsip 3730021 mdsadmin 28u IPv4 12428281 0t0 TCP atlas.gat.com:60656->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 102u IPv4 12298128 0t0 TCP atlas.gat.com:60622->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 103u IPv4 12298039 0t0 TCP atlas.gat.com:40034->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 104u IPv4 12410125 0t0 TCP atlas.gat.com:8801->atlas.gat.com:41562 (ESTABLISHED)
mdsip 3730021 mdsadmin 105u IPv4 12298131 0t0 TCP atlas.gat.com:60636->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 106u IPv4 12648855 0t0 TCP atlas.gat.com:51018->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 107u IPv4 12648858 0t0 TCP atlas.gat.com:51026->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 108u IPv4 12298136 0t0 TCP atlas.gat.com:60650->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 109u IPv4 12298160 0t0 TCP atlas.gat.com:52952->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 110u IPv4 12298181 0t0 TCP atlas.gat.com:49554->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 111u IPv4 12420686 0t0 TCP atlas.gat.com:54658->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 112u IPv4 12420672 0t0 TCP atlas.gat.com:49558->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 115u IPv4 12420676 0t0 TCP atlas.gat.com:57096->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 116u IPv4 12298188 0t0 TCP atlas.gat.com:57098->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 117u IPv4 12420679 0t0 TCP atlas.gat.com:57114->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 118u IPv4 12420682 0t0 TCP atlas.gat.com:57122->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 119u IPv4 12298192 0t0 TCP atlas.gat.com:54668->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 120u IPv4 12420690 0t0 TCP atlas.gat.com:38610->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 121u IPv4 12298196 0t0 TCP atlas.gat.com:38614->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 122u IPv4 12427862 0t0 TCP atlas.gat.com:43624->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 123u IPv4 12427868 0t0 TCP atlas.gat.com:43626->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 124u IPv4 12427871 0t0 TCP atlas.gat.com:43630->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 125u IPv4 12427877 0t0 TCP atlas.gat.com:43644->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 126u IPv4 12427884 0t0 TCP atlas.gat.com:43650->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 128u IPv4 12427961 0t0 TCP atlas.gat.com:47100->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 129u IPv4 12298215 0t0 TCP atlas.gat.com:47072->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 130u IPv4 12427957 0t0 TCP atlas.gat.com:47092->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 131u IPv4 12427954 0t0 TCP atlas.gat.com:47086->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 132u IPv4 12428079 0t0 TCP atlas.gat.com:47680->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 133u IPv4 12428055 0t0 TCP atlas.gat.com:33592->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 134u sock 0,9 0t0 12428052 protocol: TCP
mdsip 3730021 mdsadmin 135u IPv4 12428059 0t0 TCP atlas.gat.com:33608->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 136u IPv4 12428094 0t0 TCP atlas.gat.com:46216->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 137u IPv4 12428232 0t0 TCP atlas.gat.com:46218->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 138u IPv4 12428235 0t0 TCP atlas.gat.com:46228->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 139u IPv4 12428240 0t0 TCP atlas.gat.com:46234->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 140u IPv4 12648863 0t0 TCP atlas.gat.com:51030->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 142u IPv4 12428244 0t0 TCP atlas.gat.com:46250->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 143u IPv4 12428251 0t0 TCP atlas.gat.com:46266->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 144u IPv4 12428270 0t0 TCP atlas.gat.com:60642->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 145u IPv4 12728337 0t0 TCP atlas.gat.com:46278->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 146u IPv4 12728340 0t0 TCP atlas.gat.com:46282->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 147u IPv4 12728343 0t0 TCP atlas.gat.com:46292->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 148u IPv4 12428267 0t0 TCP atlas.gat.com:46306->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 149u IPv4 12428273 0t0 TCP atlas.gat.com:60654->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 153u IPv4 12740616 0t0 TCP atlas.gat.com:60674->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 154u IPv4 12428284 0t0 TCP atlas.gat.com:60666->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 155u IPv4 12428287 0t0 TCP atlas.gat.com:60670->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 157u IPv4 12728345 0t0 TCP atlas.gat.com:60684->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 158u IPv4 12715420 0t0 TCP atlas.gat.com:60694->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 159u IPv4 12715423 0t0 TCP atlas.gat.com:60710->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 160u IPv4 12715428 0t0 TCP atlas.gat.com:60720->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 161u IPv4 12715431 0t0 TCP atlas.gat.com:44726->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 162u IPv4 12715435 0t0 TCP atlas.gat.com:44736->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 163u IPv4 12401444 0t0 TCP atlas.gat.com:44744->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 164u IPv4 12401452 0t0 TCP atlas.gat.com:44762->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 165u IPv4 12401459 0t0 TCP atlas.gat.com:44772->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 166u IPv4 12401449 0t0 TCP atlas.gat.com:44760->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 167u IPv4 12401464 0t0 TCP atlas.gat.com:44788->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 168u IPv4 12401467 0t0 TCP atlas.gat.com:44804->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 169u IPv4 12715452 0t0 TCP atlas.gat.com:44806->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 170u IPv4 12715460 0t0 TCP atlas.gat.com:EtherNet/IP-2->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 171u IPv4 12715463 0t0 TCP atlas.gat.com:44832->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 172u IPv4 12401471 0t0 TCP atlas.gat.com:57696->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 173u IPv4 12401475 0t0 TCP atlas.gat.com:57708->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 174u IPv4 12401480 0t0 TCP atlas.gat.com:57712->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 175u IPv4 12401483 0t0 TCP atlas.gat.com:57720->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 177u IPv4 12401488 0t0 TCP atlas.gat.com:57734->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 178u IPv4 12401493 0t0 TCP atlas.gat.com:57750->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 179u IPv4 12401498 0t0 TCP atlas.gat.com:57752->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 180u IPv4 12401501 0t0 TCP atlas.gat.com:57760->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 181u IPv4 12401505 0t0 TCP atlas.gat.com:57768->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 182u IPv4 12401508 0t0 TCP atlas.gat.com:57782->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 183u IPv4 12715507 0t0 TCP atlas.gat.com:57784->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 184u IPv4 12715550 0t0 TCP atlas.gat.com:60928->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 185u IPv4 12715558 0t0 TCP atlas.gat.com:60946->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 186u IPv4 12715553 0t0 TCP atlas.gat.com:60932->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 188u IPv4 12715562 0t0 TCP atlas.gat.com:60948->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 189u IPv4 12715568 0t0 TCP atlas.gat.com:60964->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 190u IPv4 12715573 0t0 TCP atlas.gat.com:60966->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 191u IPv4 12401511 0t0 TCP atlas.gat.com:60980->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 192u IPv4 12401515 0t0 TCP atlas.gat.com:60994->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 193u IPv4 12401608 0t0 TCP atlas.gat.com:59376->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730021 mdsadmin 194u IPv4 12401605 0t0 TCP atlas.gat.com:59374->atlas.gat.com:8003 (ESTABLISHED)
mdsip 3730026 mdsadmin mem REG 9,127 80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip 3730026 mdsadmin 3u IPv4 12568210 0t0 TCP *:8012 (LISTEN)
mdsip 3730036 mdsadmin mem REG 9,127 80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip 3730036 mdsadmin 3u IPv4 12568211 0t0 TCP *:8003 (LISTEN)
mdsip 3730036 mdsadmin 4u IPv4 12609768 0t0 TCP atlas.gat.com:8003->atlas.gat.com:40034 (ESTABLISHED)
mdsip 3730036 mdsadmin 61u IPv4 12628995 0t0 TCP atlas.gat.com:53526->atlas.gat.com:sunwebadmin (CLOSE_WAIT)
mdsip 3730036 mdsadmin 62u IPv4 12628998 0t0 TCP atlas.gat.com:41562->atlas.gat.com:8801 (ESTABLISHED)
mdsip 3730036 mdsadmin 63u IPv4 12609794 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60622 (ESTABLISHED)
mdsip 3730036 mdsadmin 64u IPv4 12609800 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60650 (ESTABLISHED)
mdsip 3730036 mdsadmin 65u IPv4 12230182 0t0 TCP atlas.gat.com:8003->atlas.gat.com:52952 (ESTABLISHED)
mdsip 3730036 mdsadmin 66u IPv4 12230189 0t0 TCP atlas.gat.com:8003->atlas.gat.com:49554 (ESTABLISHED)
mdsip 3730036 mdsadmin 67u IPv4 12230192 0t0 TCP atlas.gat.com:8003->atlas.gat.com:49558 (ESTABLISHED)
mdsip 3730036 mdsadmin 68u IPv4 12230203 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57096 (ESTABLISHED)
mdsip 3730036 mdsadmin 69u IPv4 12230209 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57114 (ESTABLISHED)
mdsip 3730036 mdsadmin 70u IPv4 12230212 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57122 (ESTABLISHED)
mdsip 3730036 mdsadmin 71u IPv4 12230250 0t0 TCP atlas.gat.com:8003->atlas.gat.com:54658 (ESTABLISHED)
mdsip 3730036 mdsadmin 72u IPv4 12230260 0t0 TCP atlas.gat.com:8003->atlas.gat.com:54668 (ESTABLISHED)
mdsip 3730036 mdsadmin 73u IPv4 12230263 0t0 TCP atlas.gat.com:8003->atlas.gat.com:38610 (ESTABLISHED)
mdsip 3730036 mdsadmin 74u IPv4 12230266 0t0 TCP atlas.gat.com:8003->atlas.gat.com:38614 (ESTABLISHED)
mdsip 3730036 mdsadmin 75u IPv4 12715064 0t0 TCP atlas.gat.com:8003->atlas.gat.com:43624 (ESTABLISHED)
mdsip 3730036 mdsadmin 76u IPv4 12715069 0t0 TCP atlas.gat.com:8003->atlas.gat.com:43626 (ESTABLISHED)
mdsip 3730036 mdsadmin 77u IPv4 12715076 0t0 TCP atlas.gat.com:8003->atlas.gat.com:43644 (ESTABLISHED)
mdsip 3730036 mdsadmin 78u IPv4 12715079 0t0 TCP atlas.gat.com:8003->atlas.gat.com:43650 (ESTABLISHED)
mdsip 3730036 mdsadmin 79u IPv4 12230269 0t0 TCP atlas.gat.com:8003->atlas.gat.com:47072 (ESTABLISHED)
mdsip 3730036 mdsadmin 80u IPv4 12715164 0t0 TCP atlas.gat.com:8003->atlas.gat.com:47086 (ESTABLISHED)
mdsip 3730036 mdsadmin 81u IPv4 12715167 0t0 TCP atlas.gat.com:8003->atlas.gat.com:47092 (ESTABLISHED)
mdsip 3730036 mdsadmin 82u IPv4 12715170 0t0 TCP atlas.gat.com:8003->atlas.gat.com:47100 (ESTABLISHED)
mdsip 3730036 mdsadmin 83u IPv4 12582098 0t0 TCP atlas.gat.com:8003->atlas.gat.com:33592 (ESTABLISHED)
mdsip 3730036 mdsadmin 84u IPv4 12582101 0t0 TCP atlas.gat.com:8003->atlas.gat.com:33608 (ESTABLISHED)
mdsip 3730036 mdsadmin 85u IPv4 12582112 0t0 TCP atlas.gat.com:8003->atlas.gat.com:47680 (ESTABLISHED)
mdsip 3730036 mdsadmin 86u IPv4 12582115 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46216 (ESTABLISHED)
mdsip 3730036 mdsadmin 87u IPv4 12582118 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46218 (ESTABLISHED)
mdsip 3730036 mdsadmin 88u IPv4 12582124 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46234 (ESTABLISHED)
mdsip 3730036 mdsadmin 89u IPv4 12582127 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46250 (ESTABLISHED)
mdsip 3730036 mdsadmin 90u IPv4 12582130 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46266 (ESTABLISHED)
mdsip 3730036 mdsadmin 91u IPv4 12625488 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46278 (ESTABLISHED)
mdsip 3730036 mdsadmin 92u IPv4 12582136 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60642 (ESTABLISHED)
mdsip 3730036 mdsadmin 93u IPv4 12582139 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60654 (ESTABLISHED)
mdsip 3730036 mdsadmin 94u IPv4 12582142 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60656 (ESTABLISHED)
mdsip 3730036 mdsadmin 95u IPv4 12582151 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60674 (ESTABLISHED)
mdsip 3730036 mdsadmin 96u IPv4 12582154 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60694 (ESTABLISHED)
mdsip 3730036 mdsadmin 97u IPv4 12582159 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60710 (ESTABLISHED)
mdsip 3730036 mdsadmin 98u IPv4 12582162 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60720 (ESTABLISHED)
mdsip 3730036 mdsadmin 99u IPv4 12582165 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44726 (ESTABLISHED)
mdsip 3730036 mdsadmin 100u IPv4 12582168 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44736 (ESTABLISHED)
mdsip 3730036 mdsadmin 101u IPv4 12625500 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44744 (ESTABLISHED)
mdsip 3730036 mdsadmin 102u IPv4 12625504 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44760 (ESTABLISHED)
mdsip 3730036 mdsadmin 103u IPv4 12625510 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44772 (ESTABLISHED)
mdsip 3730036 mdsadmin 104u IPv4 12625513 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44788 (ESTABLISHED)
mdsip 3730036 mdsadmin 105u IPv4 12625516 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44804 (ESTABLISHED)
mdsip 3730036 mdsadmin 106u IPv4 12582173 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44806 (ESTABLISHED)
mdsip 3730036 mdsadmin 107u IPv4 12582177 0t0 TCP atlas.gat.com:8003->atlas.gat.com:EtherNet/IP-2 (ESTABLISHED)
mdsip 3730036 mdsadmin 108u IPv4 12625519 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57696 (ESTABLISHED)
mdsip 3730036 mdsadmin 109u IPv4 12625522 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57708 (ESTABLISHED)
mdsip 3730036 mdsadmin 110u IPv4 12625525 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57712 (ESTABLISHED)
mdsip 3730036 mdsadmin 111u IPv4 12625528 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57720 (ESTABLISHED)
mdsip 3730036 mdsadmin 112u IPv4 12625531 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57734 (ESTABLISHED)
mdsip 3730036 mdsadmin 113u IPv4 12625534 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57750 (ESTABLISHED)
mdsip 3730036 mdsadmin 114u IPv4 12625537 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57752 (ESTABLISHED)
mdsip 3730036 mdsadmin 115u IPv4 12625543 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57768 (ESTABLISHED)
mdsip 3730036 mdsadmin 116u IPv4 12625546 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57782 (ESTABLISHED)
mdsip 3730036 mdsadmin 117u IPv4 12514018 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57784 (ESTABLISHED)
mdsip 3730036 mdsadmin 118u IPv4 12514021 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60928 (ESTABLISHED)
mdsip 3730036 mdsadmin 119u IPv4 12514024 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60932 (ESTABLISHED)
mdsip 3730036 mdsadmin 120u IPv4 12514027 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60946 (ESTABLISHED)
mdsip 3730036 mdsadmin 122u IPv4 12514031 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60948 (ESTABLISHED)
mdsip 3730036 mdsadmin 123u IPv4 12609797 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60636 (ESTABLISHED)
mdsip 3730036 mdsadmin 124u IPv4 12514034 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60964 (ESTABLISHED)
mdsip 3730036 mdsadmin 125u IPv4 12514037 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60966 (ESTABLISHED)
mdsip 3730036 mdsadmin 126u IPv4 12625549 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60980 (ESTABLISHED)
mdsip 3730036 mdsadmin 127u IPv4 12625552 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60994 (ESTABLISHED)
mdsip 3730036 mdsadmin 128u IPv4 12625561 0t0 TCP atlas.gat.com:8003->atlas.gat.com:59374 (ESTABLISHED)
mdsip 3730036 mdsadmin 129u IPv4 12230206 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57098 (ESTABLISHED)
mdsip 3730036 mdsadmin 130u IPv4 12625564 0t0 TCP atlas.gat.com:8003->atlas.gat.com:59376 (ESTABLISHED)
mdsip 3730036 mdsadmin 131u IPv4 12401620 0t0 TCP atlas.gat.com:8003->atlas.gat.com:51018 (ESTABLISHED)
mdsip 3730036 mdsadmin 132u IPv4 12401626 0t0 TCP atlas.gat.com:8003->atlas.gat.com:51030 (ESTABLISHED)
mdsip 3730036 mdsadmin 138u IPv4 12715072 0t0 TCP atlas.gat.com:8003->atlas.gat.com:43630 (ESTABLISHED)
mdsip 3730036 mdsadmin 150u IPv4 12582121 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46228 (ESTABLISHED)
mdsip 3730036 mdsadmin 153u IPv4 12582145 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60666 (ESTABLISHED)
mdsip 3730036 mdsadmin 155u IPv4 12625491 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46282 (ESTABLISHED)
mdsip 3730036 mdsadmin 156u IPv4 12625494 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46292 (ESTABLISHED)
mdsip 3730036 mdsadmin 157u IPv4 12582133 0t0 TCP atlas.gat.com:8003->atlas.gat.com:46306 (ESTABLISHED)
mdsip 3730036 mdsadmin 160u IPv4 12582148 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60670 (ESTABLISHED)
mdsip 3730036 mdsadmin 164u IPv4 12625497 0t0 TCP atlas.gat.com:8003->atlas.gat.com:60684 (ESTABLISHED)
mdsip 3730036 mdsadmin 172u IPv4 12625507 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44762 (ESTABLISHED)
mdsip 3730036 mdsadmin 178u IPv4 12582180 0t0 TCP atlas.gat.com:8003->atlas.gat.com:44832 (ESTABLISHED)
mdsip 3730036 mdsadmin 186u IPv4 12625540 0t0 TCP atlas.gat.com:8003->atlas.gat.com:57760 (ESTABLISHED)
mdsip 3730036 mdsadmin 201u IPv4 12401623 0t0 TCP atlas.gat.com:8003->atlas.gat.com:51026 (ESTABLISHED)
mdsip 3767135 mdsadmin mem REG 9,127 80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip 3767135 mdsadmin 0u IPv4 12230301 0t0 TCP atlas.gat.com:mdsip->iris29.gat.com:59836 (ESTABLISHED)
mdsip 3771645 mdsadmin mem REG 9,127 80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip 3771645 mdsadmin 0u IPv4 12689154 0t0 TCP atlas.gat.com:mdsip->iris29.gat.com:60068 (ESTABLISHED)
from mdsplus.
Systemd contents:
mds-8002
[Unit]
Description=MDSplus Dispatcher Service
After=network.target
[Service]
User=mdsadmin
Type=simple
LimitNOFILE=64000
LimitCORE=infinity
ExecStart=/usr/local/mdsplus/bin/mdsipd-dispatching 8002 -block
[Install]
WantedBy=multi-user.target
mds-8003:
[Unit]
Description=MDSplus Loader Service
After=network.target
[Service]
Type=simple
LimitNOFILE=64000
LimitCORE=infinity
ExecStart=/usr/local/mdsplus/bin/mdsipd-dispatching 8003 -block
[Install]
WantedBy=multi-user.target
OS (atlas.gat.com):
[root@atlas system]# cat /etc/*-release
NAME="Red Hat Enterprise Linux"
VERSION="8.8 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
Red Hat Enterprise Linux release 8.8 (Ootpa)
Red Hat Enterprise Linux release 8.8 (Ootpa)
MDSplus Version:
[root@atlas system]# rpm -qa | grep mds
mdsplus-alpha-mssql-7.139-59.el8.x86_64
mdsplus-alpha-idl_bin-7.139-59.el8.x86_64
mdsplus-alpha-kernel_bin-7.139-59.el8.x86_64
mdsplus-alpha-d3d-7.139-59.el8.noarch
mdsplus-alpha-kernel-7.139-59.el8.noarch
mdsplus-alpha-motif-7.139-59.el8.noarch
mdsplus-alpha-repo-7.139-40.el8.noarch
mdsplus-alpha-motif_bin-7.139-59.el8.x86_64
mdsplus-alpha-python-7.139-59.el8.noarch
mdsplus-alpha-idl-7.139-59.el8.noarch
To explain mdsipd-dispatching
:
#!/bin/sh -f
#
# MDSplus server daemon procedure
#
# Parameters:
#
# $1 = service or portnumber
if [ $# -lt 1 ]
then
echo ""
echo "Correct usage is $0 portnumber [-block]"
echo example: $0 8001
echo ""
exit
fi
MDSPLUS_DIR=/usr/local/mdsplus
# Note that we source the setup-server.sh file because this is to be a server
# process, not a regular data server process.
. $MDSPLUS_DIR/setup-server.sh
logfile=/var/log/mdsplus/mdsip-server/$1.log
echo "" >> $logfile 2>> $logfile
echo "Telling any existing server on this port to shut down..." >> $logfile 2>> $logfile
/usr/local/mdsplus/bin/mdstcl stop server localhost:$1 >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
echo "############# END OF OLD LOGFILE #############" >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
echo "############ START OF NEW LOGFILE ############" >> $logfile 2>> $logfile
echo "Starting mdsip -s on port "$1" at "`date` >> $logfile 2>> $logfile
echo "--------------------" >> $logfile 2>> $logfile
echo "Showing ulimit -a" >> $logfile 2>> $logfile
ulimit -a >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
echo "--------------------" >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
# if -block is specified, then don't run in the background
if [ $# -gt 1 ]
then
if [ $2 = "-block" ]
then
exec $MDSPLUS_DIR/bin/mdsip -s -p $1 -h /etc/mdsip.hosts -c 0 >> $logfile 2>> $logfile
exit 0
fi
fi
exec $MDSPLUS_DIR/bin/mdsip -s -p $1 -h /etc/mdsip.hosts -c 0 >> $logfile 2>> $logfile &
Please note that the error mentioned occurs after ~1000 actions nodes triggered.
This happens whether it its 1k actions in a single shot, or across multiple shots.
from mdsplus.
Thanks for all the info.
A quick workaround:
add
Restart=always
to you service files
after the last dispatch /phase
add exit
from mdsplus.
Took some time to configure my dev environment, but I am now also able to reproduce the bug. So will attach the debugger and figure out how the code is leaving dangling sockets.
from mdsplus.
Progress has been made on debugging the issue, but still more to do to locate the root cause of the dangling sockets. Have provided my initial findings to Josh and Stephen for review and suggestions.
from mdsplus.
@mwinkel-dev Glad you were able to reproduce. Did the workaround proposed by @joshStillerman work?
from mdsplus.
Hi @smithsp and @sflanagan -- I will test Josh's workaround right now. I'm sure it will work.
I have been manually killing the mdsip
process between experiments and that of course cleans up the dangling sockets. Nonetheless, you have a very good point that I should also automate that by using Josh's workaround (which kills the mdsip
at the end of the shot and relies on systemd
to restart the process).
Current investigation shows that the dangling sockets gradually grow until the process that launches the actions (such as TCL
exits). When the process exits, the sockets are cleaned up. After I test Josh's workaround, I will resume looking for a fix that kills the sockets when each action thread finishes, instead of waiting until the entire process exits.
from mdsplus.
Hi @smithsp and @sflanagan -- I have tested Josh's recommended workaround and it is A-OK.
Here are the details of the experiment . . .
- There are two action nodes in a dinky tree named
zzz
. Each action node just sleeps for 10 seconds. - The
mdsip
access to thezzz_<shot>
trees is on the usual port 8000. - The
mdsip
process for actions (aka action server) is on port 8100 and configured as asystemd
service. - The
systemd
service file is similar to GA's, but also hasRestart=always
in the[Service]
section. - Then these
TCL
commands are executed (as per Josh's reproduce case).
TCL> set tree zzz
TCL> create pulse 1
TCL> set tree zzz /shot=1
TCL> dispatch /build
TCL> dispatch/phase/log init
- After ~20 seconds, control returns to the
TCL>
prompt. - In another terminal window, run
lsof
and grep for the action server on port 8100. Note the dangling TCP sockets. (They do get cleaned up if exitTCL
. However for this example we wish to remain inTCL
.)
mdsip 3899986 mdsplus mem REG 8,5 102144 31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip 3899986 mdsplus 3u IPv4 30142804 0t0 TCP *:8100/tcp (LISTEN)
mdsip 3899986 mdsplus 6u IPv4 30155996 0t0 TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip 3899986 mdsplus 7u IPv4 30144205 0t0 TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip 3899986 mdsplus 9u IPv4 30156000 0t0 TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip 3899986 3900283 mdsip mdsplus mem REG 8,5 102144 31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip 3899986 3900283 mdsip mdsplus 3u IPv4 30142804 0t0 TCP *:8100/tcp (LISTEN)
mdsip 3899986 3900283 mdsip mdsplus 6u IPv4 30155996 0t0 TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip 3899986 3900283 mdsip mdsplus 7u IPv4 30144205 0t0 TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip 3899986 3900283 mdsip mdsplus 9u IPv4 30156000 0t0 TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip 3899986 3900284 mdsip mdsplus mem REG 8,5 102144 31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip 3899986 3900284 mdsip mdsplus 3u IPv4 30142804 0t0 TCP *:8100/tcp (LISTEN)
mdsip 3899986 3900284 mdsip mdsplus 6u IPv4 30155996 0t0 TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip 3899986 3900284 mdsip mdsplus 7u IPv4 30144205 0t0 TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip 3899986 3900284 mdsip mdsplus 9u IPv4 30156000 0t0 TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip 3899986 3900285 mdsip mdsplus mem REG 8,5 102144 31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip 3899986 3900285 mdsip mdsplus 3u IPv4 30142804 0t0 TCP *:8100/tcp (LISTEN)
mdsip 3899986 3900285 mdsip mdsplus 6u IPv4 30155996 0t0 TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip 3899986 3900285 mdsip mdsplus 7u IPv4 30144205 0t0 TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip 3899986 3900285 mdsip mdsplus 9u IPv4 30156000 0t0 TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
- Then at the
TCL>
prompt, type this:TCL> stop server localhost:8100
- That kills the current action server.
- Whereupon
systemd
automatically restarts the service. It remains on port 8100, but of course has a new process ID (i.e., compare the following process ID with the one above).
root@mfews-mwinkel:/home/mwinkel/x_markw/config# ps -ef | fgrep mdsip
mdsplus 3900338 1 0 18:57 ? 00:00:00 /usr/local/xmw_mds/mdsplus/bin/mdsip -s -p 8100 -h /tmp/xmw_ga_2731/mdsip.hosts -c 9
- And as expected,
lsof
shows that the new process has just a single TCP socket listening on port 8100.
root@mfews-mwinkel:/home/mwinkel/x_markw/config# lsof | fgrep 3900338 | fgrep TCP
mdsip 3900338 mdsplus mem REG 8,5 102144 31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip 3900338 mdsplus 3u IPv4 30160993 0t0 TCP *:8100/tcp (LISTEN)
- While still at the
TCL>
prompt, run a new shot (i.e., repeat all the steps listed above). And observe the same behavior.
In short, Josh's workaround works fine on my Ubuntu 20 development system.
Let us know if you have any problems with this workaround on GA's RHEL8 server.
from mdsplus.
Hi @smithsp and @sflanagan -- Root cause of the dangling sockets in mdstcl
has been found. When "actions" complete, there is a coordination problem between two threads. Conceptually, one thread attempts to delete the associated socket, but can't because the other thread still has a reference to it. (That is a generalization; the details are a bit different.)
It will take us a few days to fix the issue (i.e., iterate on the design, manual testing, and peer review).
While investigating this issue, I have also been using the systemd workaround (so I have a clean slate for each experiment). The workaround is working fine on my development system.
Let me know if you have any problems with the systemd workaround on GA's server.
Addendum
This post (and many of the following posts) were initial conjectures and thus wrong. For the final summary of the investigation, refer to the post at this link.
#2731 (comment)
from mdsplus.
Hi @sflanagan -- How many "actions" are issued when running a single DIII-D shot? And how many "action servers" are involved? And how many shots per day?
Would like that information so that I can design a test case that is many times more load on the "action server" than the DIII-D load.
from mdsplus.
Assuming, as an upper bound, that everything is enabled and ops is having a long and/or productive run day...
How many "actions" are issued when running a single DIII-D shot?
~300
And how many "action servers" are involved?
1
And how many shots per day?
~50
from mdsplus.
Let me know if you have any problems with the systemd workaround on GA's server.
I redesigned my mds-8002
systemd file to add:
User=mdsadmin
Restart=always
ExecStartPost=/usr/local/mdsplus/local/mds-8002-post.sh
Where the mds-8002-post.sh
just automatically sets up the dispatch table, shot number(s), opens the relevant shot tree(s), etc... and then triggers a "test" action node.
Seems "fine" as a temporary workaround, but it does leave me open to skipping/missing a shot if the dispatcher service is restarting while the DAQ system is trying to start a new shot cycle. Trying to reduce that risk is next up on my agenda (if I can).
from mdsplus.
Hi @sflanagan -- Thanks for the information about the number of actions in a DIII-D shot. And for the news regarding the systemd
workaround.
from mdsplus.
Now have a prototype fix. Next step is to do some manual stress testing, followed by peer review. Will take a few more days before there is a final fix that will be merged to alpha and cherry-picked to the GA branch.
from mdsplus.
Hi @smithsp and @sflanagan,
Summary
This post is a summary of recent emails with @smithsp, plus provides additional details.
This problem was introduced with PR #2288 on 19-May-2021. The bug appears to only affect the dispatch
command of TCL
. Thus, it is likely that the associated fix is low risk. (Testing is presently underway to confirm that is indeed true.)
Details
When mdstcl
(which is a client) dispatches an action, it also creates a "receiver" thread to receive the eventual status reply regarding that action. (The action is sent to an "action server" that then processes the action and sends a reply back to the mdstcl
client.) For each action that is dispatched, mdstcl
thus opens up two sockets: one is the "to" socket used to send the action to the action server, and the other is the "from" socket used to receive the reply from the action server.
The issue is that the "connection ID table" in the mdstcl
client should be accessible by all threads, but unfortunately isn't. Thus, when the receiver thread receives the SrvJobFINISHED
reply, it can close its "from" socket, but is unable to access the main thread's "connection ID table" in order to close the "to" socket. Thus, for each action dispatched, the current code leaves a dangling "to" socket.
The two workarounds clean up the dangling "to" sockets. Adding an "exit" statement at the end of a TCL
script that dispatches actions terminates the TCL
process thus all of its sockets are closed. Alternatively, using the stop server
command of TCL will cause the action server to reboot (if systemd or xinetd is configured correctly) which also closes all dangling sockets.
The proposed fix does not alter any logic regarding sockets and connections. All that it does is allow the "receiver" thread to access the "connection ID table" so that the "to" socket can be closed. Preliminary testing confirms that the fix works and that mdstcl
is no longer leaking a "to" socket for each dispatched action. Thus, the workarounds are no longer needed.
Low Risk?
Whenever fiddling with multi-threaded code, there is always a risk of race conditions being created. Hence the focus on extensive testing of the fix prior to release to GA. The goal of the testing is to confirm that all critical sections are protected with locks. Initial testing is promising; no threading issues have been encountered.
And because the problem appears to be limited to the dispatch
command of mdstcl
, that also reduces the risk associated with the fix. It is unlikely to affect other aspects of the MDSplus suite of tools.
Note also that this bug has been present in the alpha
branch for ~3 years, and that no other customers have reported problems. That tends to confirm that the problem doesn't affect the entire socket communication foundation of MDSplus. And likely also means that the fix will have lower risks than expected. (Regardless, extensive testing is needed to confirm that is so.)
Clarification
This issue affects all commands sent to an "action server". That includes the following activities (i.e., the list gives descriptions; the actual TCL
statements are likely different).
- abort server
- close trees
- create pulse
- dispatch action
- dispatch command
- monitor check-in
- monitor
- logging
- stop server
from mdsplus.
Hi @sflanagan -- Questions about your site's single "action server" . . .
- What version of MDSplus does it run?
- What Operating System and version?
- Does the "action server" MDSplus/OS version match that of the "dispatch server" (apparently Atlas)?
- Is it a dedicated "action server" (only processes) actions? (Am curious if it is used for other MDSplus activities such as data analysis and/or to run non-MDSplus applications.)
Answers to the above questions will enable us to set up a cross-version testing environment, if needed.
from mdsplus.
A stress test of 1,200 actions (400 actions per shot, 3 shots) reveals that there is an unprotected critical section that can cause a segfault. The segfault also occurs in the alpha
branch thus was not introduced by the proposed fix.
Adding a delay of 0.01 seconds to each action eliminates the segfault, but causes the proposed fix to leave ~10 dangling sockets (i.e., it closes the other 1,190 "to" sockets). Additional investigation is required to determine if the dangling sockets are caused by a second unprotected critical section. And if so, whether it was introduced with the proposed fix.
Increasing the delay to 0.1 seconds allows the proposed fix to work fine -- no dangling sockets.
from mdsplus.
The proposed fix successfully handled 20,000 actions without leaking sockets. (However, each action had a 0.1 second delay.) That was 50 back-to-back shots with 400 actions per shot. The test was designed to generate a spike load on the "action server".
An even more extreme stress test does create a segfault in both the current alpha
and the proposed fix. Preliminary investigation indicates that the segfault is probably a pre-existing issue and was not introduced by the proposed fix. The segfault investigation continues (and perhaps will be tracked in a new GitHub issue).
Summary is that the proposed fix is now working well enough that it is moving on to the peer review stage. However, it will still be a few days before the fix is ready for release to GA.
from mdsplus.
Thanks for the update @mwinkel-dev .
from mdsplus.
Took all of Saturday to figure out that there is a "race" condition associated with this fix.
When mdstcl
dispatches an action to the "action server", the server sends back two messages: "handshake" and "done". And they are received by two different threads.
- The "handshake" message is when the server acknowledges that it has received an action.
- The "done" message is sent when the server has processed the action.
Note: The names, "handshake" and "done", are just for clarity in this post. Those names don't appear in the source code.
In practice, there is usually a fair amount of time between the two messages (e.g., it takes time for an action to configure a digitizer). And that time interval enables the code to process the "handshake" before the "done".
However, the spike load test has an action that can be done instantly (i.e., just evaluating the number 1
). And thus on a heavily loaded action server, the "done" message can be received and processed by mdstcl
before the "handshake" is fully processed. Because processing the "done" message deletes the associated connection, the "handshake" code gets mighty confused when its connection id gets clobbered. Hence the errors and occasional segfaults.
Conceptually, the ideal fix to avoid the "race" condition is to use a lock on every single connection object. However, it will likely be a challenge to retrofit that into the existing code. (There is a risk of subtle bugs creeping in that could result in deadlock.)
The practical fix is probably just to add 100ms delay before processing the "done" message and closing the connection. Experiments prove that this simple approach almost always avoids the race condition. (It leaked 2 or 3 sockets out of 400 actions processed.)
However, it is important that the dispatch cycle roughly match the added delay, so that connections are created and destroyed at approximately the same rate. If the delay is too long, and there are hundreds of actions dispatched at a high rate, then all sockets (allowed by the operating system) can be consumed thereby causing mdstcl
and/or the action server to crash.
An alternative concept for matching connection creation / destruction rates is for mdstcl
to keep a running tally of open connections. And if the count hits a user specified threshold (e.g., 500 active connections max), then it pauses dispatching actions. As actions are drained from the server's input queue, mdstcl
will close connections accordingly. And thus as each action finishes, mdstcl
will be able to dispatch another action.
Combining both approaches would allow mdstcl
to avoid both problems: race condition and no more sockets.
from mdsplus.
There is a fatal flaw in the proposed fix (call it "Option A") described in the preceding posts: deadlock.
The existing architecture of the mdstcl dispatch
feature consists of the following:
- a "connection id" table that contains information about each connection (sockets used, connection state, and so forth),
- a main thread that also dispatches actions, and
- many "receiver" threads, one per each action sent to the action server.
Option A treats the "connection id" table as a resource shared by all threads. And thus locks are used to protect access when adding or deleting connections. However, in order for that to work well, the main thread needs to be throttled so that the receiver threads have an opportunity to access the table too. Note though that if a receiver thread obtains the lock and then dies before releasing the lock, that the main thread is thus permanently blocked from dispatching more actions. Thus mdstcl
deadlocks. (There are likely other edge cases that would also create deadlock.)
Although the principles and architectures for high-performance, robust, multi-threaded, web servers and clients are now well known, the challenge is in retrofitting the existing action server architecture to accommodate those features.
Option B might be a more practical and less risky option.
It has these features . . .
- the main dispatch thread passes a connection's socket to the associated receiver thread
- thus the receiver threads can close sockets without having to access the "connection id" table
- that eliminates the need for locking on the table (because it is only accessed by the main thread)
- it only allows ~N active connections at a time, where N is chosen to be well below the number of sockets allowed by the operating system,
- N is user configurable (likely via an environment variable)
Advantages:
- doesn't immediately consume all available sockets when dispatching hundreds of actions
- doesn't leak sockets
- avoids deadlock
Disadvantages:
- when the action server is finished processing an action, its entry in the "connection id" table becomes obsolete
- thus the "connection id" table constantly grows
Analysis:
- suppose that 20,000 actions are dispatched in a day
- the memory required to store a "connection id" table of 20,000 entries is fairly small -- probably just a few MB
- the growth of the table is highly unlikely to force a reboot of the server (as is the case if run out of sockets)
- a cron job can be run each evening (when shots aren't being run) to restart the action server, thus each morning the action server starts with an empty "connection id" table
- the current shot's connections are added to the head of the table, thus table growth during the day won't have much impact on the speed of table lookups or adding new entires
- this option involves little change to the existing code, plus avoids deadlock, thus is less risky than Option A
Summary:
- The long-term solution is to write a new dispatcher using a more robust architecture.
- However to meet the schedule of DIII-D's upcoming campaign, the faster and safest approach is to make minimal changes to the existing
mdstcl dispatch
feature. - Option B thus shows more promise than Option A.
from mdsplus.
Now have a rough prototype of Option B. It dispatched 3,600 actions at full speed and only leaked one socket. The Linux top
command shows a negligible memory footprint (i.e., as expected the growth of the "connection id" table is insignificant).
On Monday, will clean up the code and do more extensive testing. And if the results look good, will move Option B to the peer review stage of the development process.
from mdsplus.
Option B prototype dispatched 21,600 actions with just 35 sockets leaked (a leak rate of ~0.2%). Actions were dispatched at the rate of 10 actions per second. Rough estimate is that the "connection ID" table grew to ~187 MB.
However, if remove the dispatch throttle of 10 actions/second, then mdstcl
can deadlock when dispatching actions. That is a latent bug that has surely existed in the code for years; it is unrelated to the fix for this socket issue. (Nonetheless, it will be investigated before this socket fix is released to GA.)
When mdstcl
deadlocked during testing, usually did not also examine the "action server". However, the one instance when did check revealed that the "action server" remained active (i.e., opened a new terminal window, started another mdstcl
session and dispatched more actions). When additional mdstcl
deadlocks are observed, more tests will be done to determine if the "action server" is always unaffected.
from mdsplus.
There is likely another mode that causes mdstcl dispatch
to leak sockets. It is associated with the queue length on the "action server".
Here are the details . . .
Because mdstcl
dispatches actions and processes the associated "done" messages, it can estimate the number of actions that are queued up on the "action server". (For each queued action, two sockets are used.).
Configured mdstcl
so it would maintain ~20 actions in the server's queue. And then dumped 400 actions into the dispatch hopper. It quickly ramped up to ~20 dispatched actions, and thereafter maintained steady state. When an action finished, it would dispatch another action and maintain the tally at ~20 dispatched-and-queued actions. The logs for mdstcl
and the "action server" both indicated all 400 actions were processed. So was surprised to see that lsof
reported ~20 leaked sockets.
Repeated the experiment with the limit set at ~100 actions. And observed that ~100 sockets leaked.
On Tuesday, will dig into this mystery. (There is a chance that there is a flaw in the design / implementation of the experiment.)
from mdsplus.
This problem was introduced with PR #2288 on 19-May-2021. The bug appears to only affect the
dispatch
command ofTCL
. Thus, it is likely that the associated fix is low risk. (Testing is presently underway to confirm that is indeed true.)
I slowly remember the struggle we had sorting out the existing problem at the time while keeping backward compatibility. The issue was that the old protocol required a reconnection on an offered port to send the result of the action.
the original goals that i remember were to support:
- cancelation of dispatched actions by event (e.g. user intervention, Ctrl+C)
- timeout
- parallel execution of actions
from mdsplus.
Hi @zack-vii -- Thanks for the history of the change. Much appreciated!
I will expand my experiments to see if my proposed fix for the leaking sockets breaks any of the features you listed above.
from mdsplus.
When configure mdstcl
to always dispatch N
actions more than have finished, some actions will fail. If hundreds of actions are dispatched, eventually there will be N
failed actions. The action server never sends a "done" message back to mdstcl
for those failed actions, which therefore remain connected. Thus causing N
leaked sockets.
The actions are failing because their associated "job callback" functions are misbehaving. Now investigating if the "job callback" problem is a multi-threading issue.
from mdsplus.
Hi @smithsp and @sflanagan,
I am now compiling and testing the proposed fix (PR #2740) that was created by @zack-vii. Results of the test will be posted in that PR and summarized in this issue.
Note that @zack-vii was the author of PRs #2288 and #2289 that involved the dispatch
feature of mdstcl
. His contributions are thus welcomed and appreciated.
from mdsplus.
PR #2740 by @zack-vii is a server-side fix and thus more elegant than the client-side (i.e., mdstcl
) fix that I created. We should use PR #2740 as the foundation for the fix of this issue.
However, there is an edge case in mdstcl
associated with "failed actions" that leaks sockets even with PR #2740 present. The edge case likely will require a client-side fix. And will probably be a separate PR.
The complete fix of this issue will likely consist of two or three PRs.
from mdsplus.
@mwinkel-dev @zack-vii
Thanks for your continuing efforts and attention to testing in developing a fix.
from mdsplus.
Note that the workaround provided of letting the system restart is not robust. We look forward to a fix as soon as possible.
from mdsplus.
Hi @smithsp -- Thank you for the update. I will put that news on the agenda for Tuesday's meeting of the MDSplus software team.
from mdsplus.
Hi @smithsp and @sflanagan,
Although the temporary workarounds are not an adequate solution for GA, nonetheless it would be helpful to for us to learn what aspect of the workaround is failing for GA. (That information might help us create additional test cases for the real fix.)
On 1-Apr-2024, @sflanagan posted that the temporary fix does mean that a shot could be missed (which of course is a serious problem). However, are there now additional problems with the temporary workarounds?
#2731 (comment)
Also, which of these workarounds has GA tried?
- adding an
exit
statement at the end of all TCL scripts, or - adding a
stop server
command at the end of all TCL scripts, or - running a shell script between shots that executes the following
systemd
commands:
systemctl stop action_server
systemctl start action_server
where "action_server" is whatever the name is of the systemd
service that runs your "action server"
- or running a shell script between shots that executes
kill -9 action_server
from mdsplus.
As per the request submitted via PR #2740, that partial fix will be cherry-picked into the GA branch. After the associated RPM packages are distributed to GA, work will resume on fixing the edge cases associated with this Issue #2731.
from mdsplus.
Hi @smithsp and @sflanagan,
It is my understanding that the GA branch should be based on alpha-7.139.59
(based on the following post from Mitchell). If that is incorrect, let us know.
#2704 (comment)
from mdsplus.
@mwinkel-dev The version is in #2731 (comment) (repeated here):
[root@atlas system]# rpm -qa | grep mds
mdsplus-alpha-mssql-7.139-59.el8.x86_64
mdsplus-alpha-idl_bin-7.139-59.el8.x86_64
mdsplus-alpha-kernel_bin-7.139-59.el8.x86_64
mdsplus-alpha-d3d-7.139-59.el8.noarch
mdsplus-alpha-kernel-7.139-59.el8.noarch
mdsplus-alpha-motif-7.139-59.el8.noarch
mdsplus-alpha-repo-7.139-40.el8.noarch
mdsplus-alpha-motif_bin-7.139-59.el8.x86_64
mdsplus-alpha-python-7.139-59.el8.noarch
mdsplus-alpha-idl-7.139-59.el8.noarch
from mdsplus.
Hi @smithsp,
Apparently, I need to drink more coffee and wake up. Which is to say that I'm amused to learn from your post that the version information was present if I had just scrolled up to the top of this GitHub issue.
Regardless, thanks for confirming the version again. Reason I was double checking was to make sure we create the GA branch from the correct commit.
Today, we will be cherry-picking PR #2735 and PR #2740 into the branch. Here are the tasks:
- creating the GA branch
- cherry picking the two PRs
- manually building the branch (including running the automated tests)
- testing the RPMs on a RHEL8 system to ensure not leaking sockets
- notifying GA when the release is ready
- note: so as not to add additional delay to the release, we will not be adding the GA branch to the Jenkins build system today
If all goes according to plan, the above tasks will all be completed this afternoon or evening.
When the release is ready, GA will have two options:
- I can provide a *.tgz file of the RPMs as promised, and/or
- GA can build the branch and create its own RPMs.
from mdsplus.
Trial run of GA branch / build is underway. Now have RPMs on a Rocky8 system for testing on Friday. Some minor build issues still to be addressed (major-minor-release labelling scheme for the branch, and so forth).
from mdsplus.
Hi @smithsp and @sflanagan,
What "tagging" scheme would you like us to use for the GA branch (i.e., values for "branch-major-minor-release")?
GA release zero is identical to alpha_7.139.59
. This commit does not presently have a GA "tag".
What should the next release of the GA branch be called? (This will be the release with the cherry-picks of PR #2735 and PR #2740.)
from mdsplus.
I propose the next release be labeled with atlas
. It could be atlas_1.0.0
. (Or this could be the source for the new stable
set of labels, since we think it is the most recent stable release.)
from mdsplus.
Hi @smithsp and @sflanagan,
Thanks for the answer. We will start with atlas_1.0.0
. However, that label can be changed later if GA so decides.
from mdsplus.
Just dispatched 2,000 actions on Rocky Linux 8.9 without leaking any sockets.
Details of the trial branch build and test . . .
- Manually ran the Docker build for RHEL8 which produced unsigned RPMs for the trial branch (=
alpha_7.139.59
plus cherry-picks of PR 2735 and PR 2740). - Installed the RPMs on Rocky Linux 8.9 (with
gpgcheck=0
because RPMs were unsigned). - Dispatched the 2,000 actions to the "action service".
- Used
lsof
to confirm that no sockets leaked.
from mdsplus.
Many of the preceding posts by @mwinkel-dev were initial conjectures and thus incorrect. Now that the investigation has been completed, here is a summary of the root cause.
It is expensive to create connections, and thus the architecture of mdstcl
and the "action service" is to reuse connections whenever possible. The following explanation uses the simplest configuration: mdstcl
, one "action service" and a "mdsip service" (for tree access).
In normal usage, mdstcl
has one connection open to the mdsip
service so that it can obtain all action nodes from the tree. It then builds a table of actions to be dispatched. So, mdstcl
also opens another connection to the "action service", and dispatches all actions over that connection. The main thread of mdstcl
dispatches the actions. As actions execute, the "action server" sends replies back to mdstcl
on a different connection. There are not multiple "receiver" threads; mdstcl
just has a single thread listening for replies. After a connection is made, it is reused over and over. (Of course if a connection dies because of a network glitch, then mdstcl
will create a replacement connection.)
And this is how the software changed over the years . . .
Release stable_7.96.9
is what GA was running until November 2023. It reused the connection to dispatch actions so did not leak sockets.
PR #2288 refactored the software and inadvertently omitted reuse of the dispatch connection. It created a new connection for each action dispatched, thus leaked sockets.
The proposed fix created by @mwinkel-dev incorrectly assumed that the architecture was indeed supposed to create a new connection per action. And also incorrectly assumed that there was a receiver thread created for each action. (Maintenance developers rely on comments in the source code to guide them; when those comments are missing it is easy to make bad assumptions.). This proposed fix did solve the leaking socket issue, but had other problems (such as deadlock).
PR #2740 corrects the omission of PR #2288 and reuses the dispatch connection. Thus eliminating the leaking sockets. Although some code cleanup was done on the "action service", the primary fix is actually in the client-side code.
from mdsplus.
Hi @smithsp,
PR #2746 addresses some problems that were spotted while investigating this Issue #2731. Let us know if you also want PR #2746 cherry-picked to GA's branch.
from mdsplus.
Hi @smithsp,
Thanks for the decision. Therefore, we will include PR #2746 only if two conditions are met:
- it adds no delay to the release of the RPM tarball of the GA branch, and
- the peer reviewers consider it useful to have it in GA's branch.
If PR #2746 is not included in the forthcoming tarball (aka "ga_atlas_1.0.0"), my recommendation is that GA add it to the list of things to be included in an eventual subsequent release ("ga_atlas_1.0.1").
Regarding the tarball, I will post an update on Wednesday 24-Apr-2024 with an updated schedule. (I was out of the office today, so must check with my colleagues to find out the present status of the GA branch and tarball.)
from mdsplus.
Hi @smithsp,
With guidance from @WhoBrokeTheBuild, the official GA branch is now available. It is named ga_atlas
.
And it contains the cherry-picks of PR #2735 and PR #2740. (PR #2746 was not included because it is still in the review process.)
The history of the ga_atlas
branch is as follows.
ga_atlas_1.0.0
PR #2740 = 4cee3b1 = Fix: reduce open files due to dispatcher
PR #2735 = 471a06c = When activate debug trace, now compiles without error
alpha_7.139.59 = version GA installed in November 2023
PR #2647 = 37652c2 = Fix: dir /full segfault
The ga_atlas_1.0.0
was manually built (on my dev system using Docker images) for three platforms: RHEL8, Ubuntu20 and Windows. All builds successfully ran the automated test suite. In addition, the Ubuntu20 build passed the manual test of 1,200 dispatched actions without leaking any sockets.
Stephen informed me that the Jenkins build server is not configured to build the ga_atlas
branch. If GA has problems building the branch, let us know.
I am also creating a tarball of the RHEL8 RPMs so that I can install the ga_atlas_1.0.0
release on my Rocky Linux 8 virtual machine and repeat the dispatch / socket test. As a courtesy to GA, I can also make my RPM tarball available if needed.
from mdsplus.
Timeline of fix (in calendar days, not work days) and lessons @mwinkel-dev learned . . .
- workaround proposed and tested = 1 day
- investigation and first two versions of fix = 12 days
- develop PR 2740 fix = 2 days
- review, merge and test PR 2740 = 9 days
- create PR 2746 to fix residual dispatcher issues = 4 days
- create
ga_atlas
branch and deliver tarball of RPMs = 2 days
Step 2 was the result of undocumented source code and wrong assumptions about the software architecture. To prevent this from occurring again, PR #2746 adds comments to help future maintenance programmers. Lesson learned = seek input from others sooner.
Six days elapsed between Steps 4 and 6. Fix was merged at step 4 thus available to customer, but not in the form the customer requested. Lesson learned = better coordination saves time.
from mdsplus.
The ga_atlas
branch is working in production.
from mdsplus.
Hi @smithsp -- Thanks for the update. Am glad to read that the ga_atlas
branch is working OK at DIII-D.
from mdsplus.
Related Issues (20)
- Activating the `MDSDBG()` debug statements generates compiler errors
- Building on conda forge OSX fails HOT 12
- Mdstcl's "dispatch" feature can trigger segfaults and other errors if flood the "action server" with hundreds of actions HOT 6
- Missing "mdsobjectscppshr-vs.lib" in vs devtools lib32 folder HOT 6
- Some calling routines do not properly handle SsINTERNAL, C_ERROR, and/or FALSE returned by low-level routines HOT 4
- alpha versions fail to raise MDSconnection issue HOT 8
- Mdstcl segfaults for a tree that doesn't carry a shot number HOT 2
- The information and links in `mdstcl show version` are out of date
- RW on mds+ shotfiles for "group" in linux HOT 4
- NFSLOCKING problem - non reproducible HOT 11
- Is it possible to limit the amount of concurrent connections of a specific user HOT 2
- Have TCL's `show version` command also display the build date HOT 1
- In `tdi/RfxDevices/DIO2` initialization, remove the flag that triggers a TDI `abort()` on a failed connection HOT 1
- Error reporting when using function mdsvalue in Matlab which version——stable_release-7-142-80 HOT 13
- Source repo contains obsolete public key for RPM packages
- Wrong number of child nodes in python TreeNode.getChildren()
- add timestamps, client ip & name, pid to mdsip error logs, equivalent to "access" logs HOT 1
- assign priorities to incoming mdsip connections
- intermittent failure to connect to MDSplus server
- enable linking MDSplus to custom python conda environments, and NOT system-wide python HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mdsplus.