Affiliation GA/DIII-D Version(s) Affect

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

EDIT: Note that as of the time the following <code class="notranslat

Systemd contents: mds-8002 <div class="snippet-clipboard-content

Thanks for all the info. A quick workaround: add <code class="notranslate"

MDSplus dispatcher opens lots of files,about mdsplus/mdsplus

Comments (54)

mwinkel-dev commented on June 20, 2024 1

Hi @sflanagan -- Josh has succeeded in reproducing the issue with the current alpha on Ubuntu. We are now debugging the problem. And will post additional updates here later this afternoon.

from mdsplus.

smithsp commented on June 20, 2024 1

@mwinkel-dev
If the fixes of #2746 are needed to address this issue, or would be encountered during our testing, then please include them; however, we were hoping to have the tarball very soon, and wouldn't want to delay that for #2746, unless you think it is part of the overall fix that should be included.

from mdsplus.

joshStillerman commented on June 20, 2024

I think you need to add dispatch/close to the script that is doing the dispatch/build.

TCL> help dispatch/close

  Command: DISPATCH/CLOSE
  Purpose: Instruct MDSplus actions server(s) to close all open trees.
  Format: DISPATCH/CLOSE[/SERVER=(server1,server2...)]

  Description:

  The DISPATCH/CLOSE can be used to instruct an MDSplus action server to close all MDSplus
  trees the server currently has open. If the /SERVER qualifier is not included to specify
  any action servers then the current dispatch table (See: DISPATCH/BUILD and
  DISPATCH/PHASE commands) is inspected to construct a list of servers that had actions
  dispatched to them to send close instructions to. The DISPATCH/CLOSE command is typically
  used at the end of a data acquisition cycle to instruct the action server to close
  any open trees.

TCL>

I am happy to meet this afternoon to discuss...

from mdsplus.

mwinkel-dev commented on June 20, 2024

According to Mitchell's post on another issue, Atlas has been running alpha-7.139.59 since November 2023. And that Atlas is on RHEL8.
Issue 2704, post of 13-Feb-2024

from mdsplus.

sflanagan commented on June 20, 2024

There's likely more to this than open files, as enabling core dumps against the systemd services running the "disaptcher" (i.e. mdsip -s on atlas:8002) and the "mdsloader" (i.e. mdsip -s on atlas:8003) comes up with some potential issues in mdstcpip/mdsipshr/ConnectToMds.c:?

I haven't looked at this too closely, but for the sake of notes I'll just quickly post the coredumpctl gdb <pid>; bt of the last core dump produced (this one is from the port 8002 side):

[root@atlas ~]# coredumpctl gdb 865034
           PID: 865034 (mdsip)
           UID: 117 (mdsadmin)
           GID: 511 (mdsdev)
        Signal: 11 (SEGV)
     Timestamp: Mon 2024-03-25 21:44:26 PDT (12h ago)
  Command Line: /usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts
    Executable: /usr/local/mdsplus/bin64/mdsip
 Control Group: /
         Slice: -.slice
       Boot ID: 7392c4d0130e43a9a07fbde2f35cd9c7
    Machine ID: 1a91ee619a7f42d0b73da694e83b0be3
      Hostname: atlas.gat.com
       Storage: /var/lib/systemd/coredump/core.mdsip.117.7392c4d0130e43a9a07fbde2f35cd9c7.865034.1711428266000000.lz4
       Message: Process 865034 (mdsip) of user 117 dumped core.

                Stack trace of thread 866623:
                #0  0x00001468186e2c61 do_login (libMdsIpShr.so)
                #1  0x000014681530a01e server_connect (libMdsServerShr.so)
                #2  0x000014681530ba91 ServerConnect (libMdsServerShr.so)
                #3  0x000014681530bc57 ServerSendMessage (libMdsServerShr.so)
                #4  0x000014681530cdd7 ServerDispatchAction (libMdsServerShr.so)
                #5  0x0000146815307ec6 dispatch (libMdsServerShr.so)
                #6  0x000014681530834c ServerDispatchPhase (libMdsServerShr.so)
                #7  0x0000146814868521 TclDispatch_phase (libtcl_commands.so)
                #8  0x00001468150f1a9a dispatchToHandler (libMdsdcl.so)
                #9  0x00001468150f30b3 cmdExecute (libMdsdcl.so)
                #10 0x00001468150f6f04 mdsdcl_do_command_extra_args (libMdsdcl.so)
                #11 0x00001468150f2abb mdsdcl_do_command (libMdsdcl.so)
                #12 0x0000146815304449 DoSrvCommand (libMdsServerShr.so)
                #13 0x0000146815304b29 WorkerThread (libMdsServerShr.so)
                #14 0x000014681808b1ca start_thread (libpthread.so.0)
                #15 0x0000146817cf7e73 __clone (libc.so.6)

                Stack trace of thread 865034:
                #0  0x0000146817de4a3f __select (libc.so.6)
                #1  0x0000146816d8e877 listen_loop (libMdsIpTCP.so)
                #2  0x0000146816d8ecc6 io_listen (libMdsIpTCP.so)
                #3  0x00000000004008e8 main (mdsip)
                #4  0x0000146817cf8d85 __libc_start_main (libc.so.6)
                #5  0x000000000040097e _start (mdsip)

                Stack trace of thread 866830:
                #0  0x000014681809145c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x0000146815306f07 send_monitor_pop (libMdsServerShr.so)
                #2  0x00001468153076ab send_monitor_thread (libMdsServerShr.so)
                #3  0x000014681808b1ca start_thread (libpthread.so.0)
                #4  0x0000146817cf7e73 __clone (libc.so.6)

                Stack trace of thread 1046533:
                #0  0x0000146817de4a3f __select (libc.so.6)
                #1  0x0000146816d8e154 io_recv_to (libMdsIpTCP.so)
                #2  0x00001468186d9304 get_bytes_to (libMdsIpShr.so)
                #3  0x00001468186d93e2 get_bytes_to (libMdsIpShr.so)
                #4  0x00001468186e2eba ConnectionDoMessage (libMdsIpShr.so)
                #5  0x0000146816d8d492 client_thread (libMdsIpTCP.so)
                #6  0x000014681808b1ca start_thread (libpthread.so.0)
                #7  0x0000146817cf7e73 __clone (libc.so.6)

                Stack trace of thread 866831:
                #0  0x0000146817de4a3f __select (libc.so.6)
                #1  0x000014681530b42e receiver_thread (libMdsServerShr.so)
                #2  0x000014681808b1ca start_thread (libpthread.so.0)
                #3  0x0000146817cf7e73 __clone (libc.so.6)

GNU gdb (GDB) Red Hat Enterprise Linux 8.2-19.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/mdsplus/bin64/mdsip...done.
[New LWP 866623]
[New LWP 865034]
[New LWP 866830]
[New LWP 1046533]
[New LWP 866831]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `/usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  do_login (c=0x1) at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:78
78      /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c: No such file or directory.
[Current thread is 1 (Thread 0x146814c71700 (LWP 866623))]
Missing separate debuginfos, use: yum debuginfo-install mdsplus-alpha-kernel_bin-7.139-59.el8.x86_64
(gdb) bt
#0  do_login (c=0x1) at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:78
#1  ConnectToMds (hostin=hostin@entry=0x1468080fdea0 "atlas.gat.com:8003") at /git/mdsplus/mdstcpip/mdsipshr/ConnectToMds.c:161
#2  0x000014681530a01e in server_connect (server=server@entry=0x1468080fdea0 "atlas.gat.com:8003", addr=1755710912, port=<optimized out>)
    at /git/mdsplus/servershr/ServerSendMessage.c:529
#3  0x000014681530ba91 in ServerConnect (server_in=server_in@entry=0x146814c70530 "mdsloader") at /git/mdsplus/servershr/ServerSendMessage.c:557
#4  0x000014681530bc57 in ServerSendMessage (msgid=msgid@entry=0x0, server=server@entry=0x146814c70530 "mdsloader", op=op@entry=2,
    retstatus=retstatus@entry=0x146808999338, lock=lock@entry=0x146808999358, conid_out=conid_out@entry=0x14680899933c,
    callback_done=0x146815307b60 <action_done>, callback_param=0xa7, callback_before=0x146815306d50 <before>, numargs_in=3)
    at /git/mdsplus/servershr/ServerSendMessage.c:339
#5  0x000014681530cdd7 in ServerDispatchAction (id=id@entry=0x0, server=0x146814c70530 "mdsloader", tree=0x146808992fb4 "D3D", shot=<optimized out>,
    nid=<optimized out>, ast=ast@entry=0x146815307b60 <action_done>, astprm=0xa7, retstatus=0x146808999338, lock=0x146808999358,
    socket=0x14680899933c, before_ast=0x146815306d50 <before>) at /git/mdsplus/servershr/ServerDispatchAction.c:67
#6  0x0000146815307ec6 in dispatch (i=i@entry=167) at /git/mdsplus/servershr/ServerDispatchPhase.c:201
#7  0x000014681530834c in ServerDispatchPhase (id=id@entry=0x0, vtable=<optimized out>, phasenam=phasenam@entry=0x1468080b36b0 "TEST",
    noact_in=noact_in@entry=0 '\000', sync=sync@entry=1, output_rtn=output_rtn@entry=0x146814867740 <printIt>, monitor=0x0)
    at /git/mdsplus/servershr/ServerDispatchPhase.c:536
#8  0x0000146814868521 in TclDispatch_phase (ctx=<optimized out>, error=0x146814c70c28, output=<optimized out>) at /git/mdsplus/tcl/tcl_dispatch.c:411
#9  0x00001468150f1a9a in dispatchToHandler (getlineinfo=0x0, getline=0x0, output=0x146814c70c20, error=0x146814c70c28, prompt=<optimized out>,
    cmdDef=<optimized out>, cmd=0x1468080fdf60, image=<optimized out>) at /git/mdsplus/mdsdcl/cmdExecute.c:935
#10 processCommand (docList=docList@entry=0x146808011e50, verbNode_in=<optimized out>, cmd=cmd@entry=0x1468080065c0,
    cmdDef=cmdDef@entry=0x146808006610, prompt=prompt@entry=0x146814c70c10, error=error@entry=0x146814c70c28, output=0x146814c70c20, getline=0x0,
    getlineinfo=0x0) at /git/mdsplus/mdsdcl/cmdExecute.c:1165
#11 0x00001468150f30b3 in cmdExecute (cmd=0x1468080065c0, prompt_out=prompt_out@entry=0x0, error_out=error_out@entry=0x146814c70d70,
    output_out=output_out@entry=0x146814c70d78, getline=getline@entry=0x0, getlineinfo=getlineinfo@entry=0x0) at /git/mdsplus/mdsdcl/cmdExecute.c:1453
#12 0x00001468150f6f04 in mdsdcl_do_command_extra_args (command=0x146810008db0 "DISPATCH/PHASE TEST", prompt=prompt@entry=0x0,
    error=error@entry=0x146814c70d70, output=output@entry=0x146814c70d78, getline=getline@entry=0x0, getlineInfo=getlineInfo@entry=0x0)
    at mdsdcl/yylex/cmdParse.y:213
#13 0x00001468150f2abb in mdsdcl_do_command (command=<optimized out>) at /git/mdsplus/mdsdcl/cmdExecute.c:1350
#14 0x0000146815304449 in DoSrvCommand (job_in=job_in@entry=0x1468100430b0) at /git/mdsplus/servershr/ServerQAction.c:642
#15 0x0000146815304b29 in WorkerThread (arg=<optimized out>) at /git/mdsplus/servershr/ServerQAction.c:786
#16 0x000014681808b1ca in start_thread () from /usr/lib64/libpthread.so.0
#17 0x0000146817cf7e73 in clone () from /usr/lib64/libc.so.6
(gdb)

from mdsplus.

sflanagan commented on June 20, 2024

EDIT:

Note that as of the time the following lsof was pulled, those PIDs are for:

dispatcher: systemd service mds-8002 on pid 3730021
mdsloader: systemd service mds-8003 on pid 3730036

[root@atlas system]# ps -fp 3730021
UID          PID    PPID  C STIME TTY          TIME CMD
mdsadmin 3730021       1  0 10:19 ?        00:00:01 /usr/local/mdsplus/bin/mdsip -s -p 8002 -h /etc/mdsip.hosts -c 0

[root@atlas system]# ps -fp 3730036
UID          PID    PPID  C STIME TTY          TIME CMD
mdsadmin 3730036       1  0 10:19 ?        00:00:01 /usr/local/mdsplus/bin/mdsip -s -p 8003 -h /etc/mdsip.hosts -c 0

A current lsof -u mdsadmin | grep TCP:

[root@atlas ~]# lsof -u mdsadmin | grep TCP
mdsip   3730021 mdsadmin  mem    REG    9,127     80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip   3730021 mdsadmin    3u  IPv4 12579304       0t0       TCP *:8002 (LISTEN)
mdsip   3730021 mdsadmin   22u  IPv4 12298036       0t0       TCP *:8801 (LISTEN)
mdsip   3730021 mdsadmin   28u  IPv4 12428281       0t0       TCP atlas.gat.com:60656->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  102u  IPv4 12298128       0t0       TCP atlas.gat.com:60622->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  103u  IPv4 12298039       0t0       TCP atlas.gat.com:40034->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  104u  IPv4 12410125       0t0       TCP atlas.gat.com:8801->atlas.gat.com:41562 (ESTABLISHED)
mdsip   3730021 mdsadmin  105u  IPv4 12298131       0t0       TCP atlas.gat.com:60636->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  106u  IPv4 12648855       0t0       TCP atlas.gat.com:51018->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  107u  IPv4 12648858       0t0       TCP atlas.gat.com:51026->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  108u  IPv4 12298136       0t0       TCP atlas.gat.com:60650->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  109u  IPv4 12298160       0t0       TCP atlas.gat.com:52952->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  110u  IPv4 12298181       0t0       TCP atlas.gat.com:49554->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  111u  IPv4 12420686       0t0       TCP atlas.gat.com:54658->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  112u  IPv4 12420672       0t0       TCP atlas.gat.com:49558->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  115u  IPv4 12420676       0t0       TCP atlas.gat.com:57096->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  116u  IPv4 12298188       0t0       TCP atlas.gat.com:57098->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  117u  IPv4 12420679       0t0       TCP atlas.gat.com:57114->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  118u  IPv4 12420682       0t0       TCP atlas.gat.com:57122->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  119u  IPv4 12298192       0t0       TCP atlas.gat.com:54668->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  120u  IPv4 12420690       0t0       TCP atlas.gat.com:38610->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  121u  IPv4 12298196       0t0       TCP atlas.gat.com:38614->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  122u  IPv4 12427862       0t0       TCP atlas.gat.com:43624->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  123u  IPv4 12427868       0t0       TCP atlas.gat.com:43626->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  124u  IPv4 12427871       0t0       TCP atlas.gat.com:43630->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  125u  IPv4 12427877       0t0       TCP atlas.gat.com:43644->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  126u  IPv4 12427884       0t0       TCP atlas.gat.com:43650->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  128u  IPv4 12427961       0t0       TCP atlas.gat.com:47100->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  129u  IPv4 12298215       0t0       TCP atlas.gat.com:47072->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  130u  IPv4 12427957       0t0       TCP atlas.gat.com:47092->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  131u  IPv4 12427954       0t0       TCP atlas.gat.com:47086->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  132u  IPv4 12428079       0t0       TCP atlas.gat.com:47680->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  133u  IPv4 12428055       0t0       TCP atlas.gat.com:33592->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  134u  sock      0,9       0t0  12428052 protocol: TCP
mdsip   3730021 mdsadmin  135u  IPv4 12428059       0t0       TCP atlas.gat.com:33608->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  136u  IPv4 12428094       0t0       TCP atlas.gat.com:46216->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  137u  IPv4 12428232       0t0       TCP atlas.gat.com:46218->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  138u  IPv4 12428235       0t0       TCP atlas.gat.com:46228->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  139u  IPv4 12428240       0t0       TCP atlas.gat.com:46234->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  140u  IPv4 12648863       0t0       TCP atlas.gat.com:51030->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  142u  IPv4 12428244       0t0       TCP atlas.gat.com:46250->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  143u  IPv4 12428251       0t0       TCP atlas.gat.com:46266->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  144u  IPv4 12428270       0t0       TCP atlas.gat.com:60642->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  145u  IPv4 12728337       0t0       TCP atlas.gat.com:46278->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  146u  IPv4 12728340       0t0       TCP atlas.gat.com:46282->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  147u  IPv4 12728343       0t0       TCP atlas.gat.com:46292->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  148u  IPv4 12428267       0t0       TCP atlas.gat.com:46306->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  149u  IPv4 12428273       0t0       TCP atlas.gat.com:60654->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  153u  IPv4 12740616       0t0       TCP atlas.gat.com:60674->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  154u  IPv4 12428284       0t0       TCP atlas.gat.com:60666->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  155u  IPv4 12428287       0t0       TCP atlas.gat.com:60670->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  157u  IPv4 12728345       0t0       TCP atlas.gat.com:60684->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  158u  IPv4 12715420       0t0       TCP atlas.gat.com:60694->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  159u  IPv4 12715423       0t0       TCP atlas.gat.com:60710->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  160u  IPv4 12715428       0t0       TCP atlas.gat.com:60720->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  161u  IPv4 12715431       0t0       TCP atlas.gat.com:44726->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  162u  IPv4 12715435       0t0       TCP atlas.gat.com:44736->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  163u  IPv4 12401444       0t0       TCP atlas.gat.com:44744->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  164u  IPv4 12401452       0t0       TCP atlas.gat.com:44762->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  165u  IPv4 12401459       0t0       TCP atlas.gat.com:44772->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  166u  IPv4 12401449       0t0       TCP atlas.gat.com:44760->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  167u  IPv4 12401464       0t0       TCP atlas.gat.com:44788->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  168u  IPv4 12401467       0t0       TCP atlas.gat.com:44804->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  169u  IPv4 12715452       0t0       TCP atlas.gat.com:44806->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  170u  IPv4 12715460       0t0       TCP atlas.gat.com:EtherNet/IP-2->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  171u  IPv4 12715463       0t0       TCP atlas.gat.com:44832->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  172u  IPv4 12401471       0t0       TCP atlas.gat.com:57696->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  173u  IPv4 12401475       0t0       TCP atlas.gat.com:57708->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  174u  IPv4 12401480       0t0       TCP atlas.gat.com:57712->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  175u  IPv4 12401483       0t0       TCP atlas.gat.com:57720->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  177u  IPv4 12401488       0t0       TCP atlas.gat.com:57734->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  178u  IPv4 12401493       0t0       TCP atlas.gat.com:57750->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  179u  IPv4 12401498       0t0       TCP atlas.gat.com:57752->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  180u  IPv4 12401501       0t0       TCP atlas.gat.com:57760->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  181u  IPv4 12401505       0t0       TCP atlas.gat.com:57768->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  182u  IPv4 12401508       0t0       TCP atlas.gat.com:57782->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  183u  IPv4 12715507       0t0       TCP atlas.gat.com:57784->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  184u  IPv4 12715550       0t0       TCP atlas.gat.com:60928->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  185u  IPv4 12715558       0t0       TCP atlas.gat.com:60946->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  186u  IPv4 12715553       0t0       TCP atlas.gat.com:60932->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  188u  IPv4 12715562       0t0       TCP atlas.gat.com:60948->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  189u  IPv4 12715568       0t0       TCP atlas.gat.com:60964->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  190u  IPv4 12715573       0t0       TCP atlas.gat.com:60966->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  191u  IPv4 12401511       0t0       TCP atlas.gat.com:60980->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  192u  IPv4 12401515       0t0       TCP atlas.gat.com:60994->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  193u  IPv4 12401608       0t0       TCP atlas.gat.com:59376->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730021 mdsadmin  194u  IPv4 12401605       0t0       TCP atlas.gat.com:59374->atlas.gat.com:8003 (ESTABLISHED)
mdsip   3730026 mdsadmin  mem    REG    9,127     80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip   3730026 mdsadmin    3u  IPv4 12568210       0t0       TCP *:8012 (LISTEN)
mdsip   3730036 mdsadmin  mem    REG    9,127     80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip   3730036 mdsadmin    3u  IPv4 12568211       0t0       TCP *:8003 (LISTEN)
mdsip   3730036 mdsadmin    4u  IPv4 12609768       0t0       TCP atlas.gat.com:8003->atlas.gat.com:40034 (ESTABLISHED)
mdsip   3730036 mdsadmin   61u  IPv4 12628995       0t0       TCP atlas.gat.com:53526->atlas.gat.com:sunwebadmin (CLOSE_WAIT)
mdsip   3730036 mdsadmin   62u  IPv4 12628998       0t0       TCP atlas.gat.com:41562->atlas.gat.com:8801 (ESTABLISHED)
mdsip   3730036 mdsadmin   63u  IPv4 12609794       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60622 (ESTABLISHED)
mdsip   3730036 mdsadmin   64u  IPv4 12609800       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60650 (ESTABLISHED)
mdsip   3730036 mdsadmin   65u  IPv4 12230182       0t0       TCP atlas.gat.com:8003->atlas.gat.com:52952 (ESTABLISHED)
mdsip   3730036 mdsadmin   66u  IPv4 12230189       0t0       TCP atlas.gat.com:8003->atlas.gat.com:49554 (ESTABLISHED)
mdsip   3730036 mdsadmin   67u  IPv4 12230192       0t0       TCP atlas.gat.com:8003->atlas.gat.com:49558 (ESTABLISHED)
mdsip   3730036 mdsadmin   68u  IPv4 12230203       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57096 (ESTABLISHED)
mdsip   3730036 mdsadmin   69u  IPv4 12230209       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57114 (ESTABLISHED)
mdsip   3730036 mdsadmin   70u  IPv4 12230212       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57122 (ESTABLISHED)
mdsip   3730036 mdsadmin   71u  IPv4 12230250       0t0       TCP atlas.gat.com:8003->atlas.gat.com:54658 (ESTABLISHED)
mdsip   3730036 mdsadmin   72u  IPv4 12230260       0t0       TCP atlas.gat.com:8003->atlas.gat.com:54668 (ESTABLISHED)
mdsip   3730036 mdsadmin   73u  IPv4 12230263       0t0       TCP atlas.gat.com:8003->atlas.gat.com:38610 (ESTABLISHED)
mdsip   3730036 mdsadmin   74u  IPv4 12230266       0t0       TCP atlas.gat.com:8003->atlas.gat.com:38614 (ESTABLISHED)
mdsip   3730036 mdsadmin   75u  IPv4 12715064       0t0       TCP atlas.gat.com:8003->atlas.gat.com:43624 (ESTABLISHED)
mdsip   3730036 mdsadmin   76u  IPv4 12715069       0t0       TCP atlas.gat.com:8003->atlas.gat.com:43626 (ESTABLISHED)
mdsip   3730036 mdsadmin   77u  IPv4 12715076       0t0       TCP atlas.gat.com:8003->atlas.gat.com:43644 (ESTABLISHED)
mdsip   3730036 mdsadmin   78u  IPv4 12715079       0t0       TCP atlas.gat.com:8003->atlas.gat.com:43650 (ESTABLISHED)
mdsip   3730036 mdsadmin   79u  IPv4 12230269       0t0       TCP atlas.gat.com:8003->atlas.gat.com:47072 (ESTABLISHED)
mdsip   3730036 mdsadmin   80u  IPv4 12715164       0t0       TCP atlas.gat.com:8003->atlas.gat.com:47086 (ESTABLISHED)
mdsip   3730036 mdsadmin   81u  IPv4 12715167       0t0       TCP atlas.gat.com:8003->atlas.gat.com:47092 (ESTABLISHED)
mdsip   3730036 mdsadmin   82u  IPv4 12715170       0t0       TCP atlas.gat.com:8003->atlas.gat.com:47100 (ESTABLISHED)
mdsip   3730036 mdsadmin   83u  IPv4 12582098       0t0       TCP atlas.gat.com:8003->atlas.gat.com:33592 (ESTABLISHED)
mdsip   3730036 mdsadmin   84u  IPv4 12582101       0t0       TCP atlas.gat.com:8003->atlas.gat.com:33608 (ESTABLISHED)
mdsip   3730036 mdsadmin   85u  IPv4 12582112       0t0       TCP atlas.gat.com:8003->atlas.gat.com:47680 (ESTABLISHED)
mdsip   3730036 mdsadmin   86u  IPv4 12582115       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46216 (ESTABLISHED)
mdsip   3730036 mdsadmin   87u  IPv4 12582118       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46218 (ESTABLISHED)
mdsip   3730036 mdsadmin   88u  IPv4 12582124       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46234 (ESTABLISHED)
mdsip   3730036 mdsadmin   89u  IPv4 12582127       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46250 (ESTABLISHED)
mdsip   3730036 mdsadmin   90u  IPv4 12582130       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46266 (ESTABLISHED)
mdsip   3730036 mdsadmin   91u  IPv4 12625488       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46278 (ESTABLISHED)
mdsip   3730036 mdsadmin   92u  IPv4 12582136       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60642 (ESTABLISHED)
mdsip   3730036 mdsadmin   93u  IPv4 12582139       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60654 (ESTABLISHED)
mdsip   3730036 mdsadmin   94u  IPv4 12582142       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60656 (ESTABLISHED)
mdsip   3730036 mdsadmin   95u  IPv4 12582151       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60674 (ESTABLISHED)
mdsip   3730036 mdsadmin   96u  IPv4 12582154       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60694 (ESTABLISHED)
mdsip   3730036 mdsadmin   97u  IPv4 12582159       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60710 (ESTABLISHED)
mdsip   3730036 mdsadmin   98u  IPv4 12582162       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60720 (ESTABLISHED)
mdsip   3730036 mdsadmin   99u  IPv4 12582165       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44726 (ESTABLISHED)
mdsip   3730036 mdsadmin  100u  IPv4 12582168       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44736 (ESTABLISHED)
mdsip   3730036 mdsadmin  101u  IPv4 12625500       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44744 (ESTABLISHED)
mdsip   3730036 mdsadmin  102u  IPv4 12625504       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44760 (ESTABLISHED)
mdsip   3730036 mdsadmin  103u  IPv4 12625510       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44772 (ESTABLISHED)
mdsip   3730036 mdsadmin  104u  IPv4 12625513       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44788 (ESTABLISHED)
mdsip   3730036 mdsadmin  105u  IPv4 12625516       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44804 (ESTABLISHED)
mdsip   3730036 mdsadmin  106u  IPv4 12582173       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44806 (ESTABLISHED)
mdsip   3730036 mdsadmin  107u  IPv4 12582177       0t0       TCP atlas.gat.com:8003->atlas.gat.com:EtherNet/IP-2 (ESTABLISHED)
mdsip   3730036 mdsadmin  108u  IPv4 12625519       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57696 (ESTABLISHED)
mdsip   3730036 mdsadmin  109u  IPv4 12625522       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57708 (ESTABLISHED)
mdsip   3730036 mdsadmin  110u  IPv4 12625525       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57712 (ESTABLISHED)
mdsip   3730036 mdsadmin  111u  IPv4 12625528       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57720 (ESTABLISHED)
mdsip   3730036 mdsadmin  112u  IPv4 12625531       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57734 (ESTABLISHED)
mdsip   3730036 mdsadmin  113u  IPv4 12625534       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57750 (ESTABLISHED)
mdsip   3730036 mdsadmin  114u  IPv4 12625537       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57752 (ESTABLISHED)
mdsip   3730036 mdsadmin  115u  IPv4 12625543       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57768 (ESTABLISHED)
mdsip   3730036 mdsadmin  116u  IPv4 12625546       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57782 (ESTABLISHED)
mdsip   3730036 mdsadmin  117u  IPv4 12514018       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57784 (ESTABLISHED)
mdsip   3730036 mdsadmin  118u  IPv4 12514021       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60928 (ESTABLISHED)
mdsip   3730036 mdsadmin  119u  IPv4 12514024       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60932 (ESTABLISHED)
mdsip   3730036 mdsadmin  120u  IPv4 12514027       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60946 (ESTABLISHED)
mdsip   3730036 mdsadmin  122u  IPv4 12514031       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60948 (ESTABLISHED)
mdsip   3730036 mdsadmin  123u  IPv4 12609797       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60636 (ESTABLISHED)
mdsip   3730036 mdsadmin  124u  IPv4 12514034       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60964 (ESTABLISHED)
mdsip   3730036 mdsadmin  125u  IPv4 12514037       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60966 (ESTABLISHED)
mdsip   3730036 mdsadmin  126u  IPv4 12625549       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60980 (ESTABLISHED)
mdsip   3730036 mdsadmin  127u  IPv4 12625552       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60994 (ESTABLISHED)
mdsip   3730036 mdsadmin  128u  IPv4 12625561       0t0       TCP atlas.gat.com:8003->atlas.gat.com:59374 (ESTABLISHED)
mdsip   3730036 mdsadmin  129u  IPv4 12230206       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57098 (ESTABLISHED)
mdsip   3730036 mdsadmin  130u  IPv4 12625564       0t0       TCP atlas.gat.com:8003->atlas.gat.com:59376 (ESTABLISHED)
mdsip   3730036 mdsadmin  131u  IPv4 12401620       0t0       TCP atlas.gat.com:8003->atlas.gat.com:51018 (ESTABLISHED)
mdsip   3730036 mdsadmin  132u  IPv4 12401626       0t0       TCP atlas.gat.com:8003->atlas.gat.com:51030 (ESTABLISHED)
mdsip   3730036 mdsadmin  138u  IPv4 12715072       0t0       TCP atlas.gat.com:8003->atlas.gat.com:43630 (ESTABLISHED)
mdsip   3730036 mdsadmin  150u  IPv4 12582121       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46228 (ESTABLISHED)
mdsip   3730036 mdsadmin  153u  IPv4 12582145       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60666 (ESTABLISHED)
mdsip   3730036 mdsadmin  155u  IPv4 12625491       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46282 (ESTABLISHED)
mdsip   3730036 mdsadmin  156u  IPv4 12625494       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46292 (ESTABLISHED)
mdsip   3730036 mdsadmin  157u  IPv4 12582133       0t0       TCP atlas.gat.com:8003->atlas.gat.com:46306 (ESTABLISHED)
mdsip   3730036 mdsadmin  160u  IPv4 12582148       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60670 (ESTABLISHED)
mdsip   3730036 mdsadmin  164u  IPv4 12625497       0t0       TCP atlas.gat.com:8003->atlas.gat.com:60684 (ESTABLISHED)
mdsip   3730036 mdsadmin  172u  IPv4 12625507       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44762 (ESTABLISHED)
mdsip   3730036 mdsadmin  178u  IPv4 12582180       0t0       TCP atlas.gat.com:8003->atlas.gat.com:44832 (ESTABLISHED)
mdsip   3730036 mdsadmin  186u  IPv4 12625540       0t0       TCP atlas.gat.com:8003->atlas.gat.com:57760 (ESTABLISHED)
mdsip   3730036 mdsadmin  201u  IPv4 12401623       0t0       TCP atlas.gat.com:8003->atlas.gat.com:51026 (ESTABLISHED)
mdsip   3767135 mdsadmin  mem    REG    9,127     80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip   3767135 mdsadmin    0u  IPv4 12230301       0t0       TCP atlas.gat.com:mdsip->iris29.gat.com:59836 (ESTABLISHED)
mdsip   3771645 mdsadmin  mem    REG    9,127     80560 135299483 /usr/local/mdsplus/lib64/libMdsIpTCP.so
mdsip   3771645 mdsadmin    0u  IPv4 12689154       0t0       TCP atlas.gat.com:mdsip->iris29.gat.com:60068 (ESTABLISHED)

from mdsplus.

sflanagan commented on June 20, 2024

Systemd contents:

mds-8002

[Unit]
Description=MDSplus Dispatcher Service
After=network.target

[Service]
User=mdsadmin
Type=simple
LimitNOFILE=64000
LimitCORE=infinity
ExecStart=/usr/local/mdsplus/bin/mdsipd-dispatching 8002 -block

[Install]
WantedBy=multi-user.target

mds-8003:

[Unit]
Description=MDSplus Loader Service
After=network.target

[Service]
Type=simple
LimitNOFILE=64000
LimitCORE=infinity
ExecStart=/usr/local/mdsplus/bin/mdsipd-dispatching 8003 -block

[Install]
WantedBy=multi-user.target

OS (atlas.gat.com):

[root@atlas system]# cat /etc/*-release
NAME="Red Hat Enterprise Linux"
VERSION="8.8 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
Red Hat Enterprise Linux release 8.8 (Ootpa)
Red Hat Enterprise Linux release 8.8 (Ootpa)

MDSplus Version:

[root@atlas system]# rpm -qa | grep mds
mdsplus-alpha-mssql-7.139-59.el8.x86_64
mdsplus-alpha-idl_bin-7.139-59.el8.x86_64
mdsplus-alpha-kernel_bin-7.139-59.el8.x86_64
mdsplus-alpha-d3d-7.139-59.el8.noarch
mdsplus-alpha-kernel-7.139-59.el8.noarch
mdsplus-alpha-motif-7.139-59.el8.noarch
mdsplus-alpha-repo-7.139-40.el8.noarch
mdsplus-alpha-motif_bin-7.139-59.el8.x86_64
mdsplus-alpha-python-7.139-59.el8.noarch
mdsplus-alpha-idl-7.139-59.el8.noarch

To explain mdsipd-dispatching:

#!/bin/sh -f
#
# MDSplus server daemon procedure
#
# Parameters:
#
#       $1 = service or portnumber

if [ $# -lt 1  ]
then
  echo ""
  echo "Correct usage is $0 portnumber [-block]"
  echo example: $0 8001
  echo ""
  exit
fi

MDSPLUS_DIR=/usr/local/mdsplus

# Note that we source the setup-server.sh file because this is to be a server
# process, not a regular data server process.
. $MDSPLUS_DIR/setup-server.sh

logfile=/var/log/mdsplus/mdsip-server/$1.log
echo "" >> $logfile 2>> $logfile
echo "Telling any existing server on this port to shut down..." >> $logfile 2>> $logfile
/usr/local/mdsplus/bin/mdstcl stop server localhost:$1 >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
echo "############# END OF OLD LOGFILE #############" >> $logfile 2>> $logfile
echo "" >> $logfile 2>> $logfile
echo "############ START OF NEW LOGFILE ############" >> $logfile 2>> $logfile
echo "Starting mdsip -s on port "$1" at "`date` >> $logfile  2>> $logfile
echo "--------------------" >> $logfile  2>> $logfile
echo "Showing ulimit -a" >> $logfile  2>> $logfile
ulimit -a >> $logfile  2>> $logfile
echo "" >> $logfile  2>> $logfile
echo "--------------------" >> $logfile  2>> $logfile
echo "" >> $logfile  2>> $logfile

# if -block is specified, then don't run in the background
if [ $# -gt 1 ]
then
  if [ $2 = "-block" ]
  then
    exec $MDSPLUS_DIR/bin/mdsip -s -p $1 -h /etc/mdsip.hosts -c 0 >> $logfile 2>> $logfile
    exit 0
  fi
fi
exec $MDSPLUS_DIR/bin/mdsip -s -p $1 -h /etc/mdsip.hosts -c 0 >> $logfile 2>> $logfile &

Please note that the error mentioned occurs after ~1000 actions nodes triggered.

This happens whether it its 1k actions in a single shot, or across multiple shots.

from mdsplus.

joshStillerman commented on June 20, 2024

Thanks for all the info.
A quick workaround:
add
Restart=always
to you service files
after the last dispatch /phase add exit

from mdsplus.

mwinkel-dev commented on June 20, 2024

Took some time to configure my dev environment, but I am now also able to reproduce the bug. So will attach the debugger and figure out how the code is leaving dangling sockets.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Progress has been made on debugging the issue, but still more to do to locate the root cause of the dangling sockets. Have provided my initial findings to Josh and Stephen for review and suggestions.

from mdsplus.

smithsp commented on June 20, 2024

@mwinkel-dev Glad you were able to reproduce. Did the workaround proposed by @joshStillerman work?

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp and @sflanagan -- I will test Josh's workaround right now. I'm sure it will work.

I have been manually killing the mdsip process between experiments and that of course cleans up the dangling sockets. Nonetheless, you have a very good point that I should also automate that by using Josh's workaround (which kills the mdsip at the end of the shot and relies on systemd to restart the process).

Current investigation shows that the dangling sockets gradually grow until the process that launches the actions (such as TCL exits). When the process exits, the sockets are cleaned up. After I test Josh's workaround, I will resume looking for a fix that kills the sockets when each action thread finishes, instead of waiting until the entire process exits.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp and @sflanagan -- I have tested Josh's recommended workaround and it is A-OK.

Here are the details of the experiment . . .

There are two action nodes in a dinky tree named zzz. Each action node just sleeps for 10 seconds.
The mdsip access to the zzz_<shot> trees is on the usual port 8000.
The mdsip process for actions (aka action server) is on port 8100 and configured as a systemd service.
The systemd service file is similar to GA's, but also has Restart=always in the [Service] section.
Then these TCL commands are executed (as per Josh's reproduce case).

TCL> set tree zzz
TCL> create pulse 1
TCL> set tree zzz /shot=1
TCL> dispatch /build
TCL> dispatch/phase/log init

After ~20 seconds, control returns to the TCL> prompt.
In another terminal window, run lsof and grep for the action server on port 8100. Note the dangling TCP sockets. (They do get cleaned up if exit TCL. However for this example we wish to remain in TCL.)

mdsip     3899986                            mdsplus  mem       REG                8,5    102144   31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip     3899986                            mdsplus    3u     IPv4           30142804       0t0        TCP *:8100/tcp (LISTEN)
mdsip     3899986                            mdsplus    6u     IPv4           30155996       0t0        TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip     3899986                            mdsplus    7u     IPv4           30144205       0t0        TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip     3899986                            mdsplus    9u     IPv4           30156000       0t0        TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip     3899986 3900283 mdsip              mdsplus  mem       REG                8,5    102144   31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip     3899986 3900283 mdsip              mdsplus    3u     IPv4           30142804       0t0        TCP *:8100/tcp (LISTEN)
mdsip     3899986 3900283 mdsip              mdsplus    6u     IPv4           30155996       0t0        TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip     3899986 3900283 mdsip              mdsplus    7u     IPv4           30144205       0t0        TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip     3899986 3900283 mdsip              mdsplus    9u     IPv4           30156000       0t0        TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip     3899986 3900284 mdsip              mdsplus  mem       REG                8,5    102144   31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip     3899986 3900284 mdsip              mdsplus    3u     IPv4           30142804       0t0        TCP *:8100/tcp (LISTEN)
mdsip     3899986 3900284 mdsip              mdsplus    6u     IPv4           30155996       0t0        TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip     3899986 3900284 mdsip              mdsplus    7u     IPv4           30144205       0t0        TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip     3899986 3900284 mdsip              mdsplus    9u     IPv4           30156000       0t0        TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)
mdsip     3899986 3900285 mdsip              mdsplus  mem       REG                8,5    102144   31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip     3899986 3900285 mdsip              mdsplus    3u     IPv4           30142804       0t0        TCP *:8100/tcp (LISTEN)
mdsip     3899986 3900285 mdsip              mdsplus    6u     IPv4           30155996       0t0        TCP localhost:8100/tcp->localhost:51822 (ESTABLISHED)
mdsip     3899986 3900285 mdsip              mdsplus    7u     IPv4           30144205       0t0        TCP localhost:57616->localhost:8800 (ESTABLISHED)
mdsip     3899986 3900285 mdsip              mdsplus    9u     IPv4           30156000       0t0        TCP localhost:8100/tcp->localhost:51826 (ESTABLISHED)

Then at the TCL> prompt, type this: TCL> stop server localhost:8100
That kills the current action server.
Whereupon systemd automatically restarts the service. It remains on port 8100, but of course has a new process ID (i.e., compare the following process ID with the one above).

root@mfews-mwinkel:/home/mwinkel/x_markw/config# ps -ef | fgrep mdsip
mdsplus  3900338       1  0 18:57 ?        00:00:00 /usr/local/xmw_mds/mdsplus/bin/mdsip -s -p 8100 -h /tmp/xmw_ga_2731/mdsip.hosts -c 9

And as expected, lsof shows that the new process has just a single TCP socket listening on port 8100.

root@mfews-mwinkel:/home/mwinkel/x_markw/config# lsof | fgrep 3900338 | fgrep TCP
mdsip     3900338                            mdsplus  mem       REG                8,5    102144   31209609 /usr/local/xmw_mds/ga_2731_v4/lib/libMdsIpTCP.so
mdsip     3900338                            mdsplus    3u     IPv4           30160993       0t0        TCP *:8100/tcp (LISTEN)

While still at the TCL> prompt, run a new shot (i.e., repeat all the steps listed above). And observe the same behavior.

In short, Josh's workaround works fine on my Ubuntu 20 development system.

Let us know if you have any problems with this workaround on GA's RHEL8 server.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp and @sflanagan -- Root cause of the dangling sockets in mdstcl has been found. When "actions" complete, there is a coordination problem between two threads. Conceptually, one thread attempts to delete the associated socket, but can't because the other thread still has a reference to it. (That is a generalization; the details are a bit different.)

It will take us a few days to fix the issue (i.e., iterate on the design, manual testing, and peer review).

While investigating this issue, I have also been using the systemd workaround (so I have a clean slate for each experiment). The workaround is working fine on my development system.

Let me know if you have any problems with the systemd workaround on GA's server.

Addendum
This post (and many of the following posts) were initial conjectures and thus wrong. For the final summary of the investigation, refer to the post at this link.
#2731 (comment)

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @sflanagan -- How many "actions" are issued when running a single DIII-D shot? And how many "action servers" are involved? And how many shots per day?

Would like that information so that I can design a test case that is many times more load on the "action server" than the DIII-D load.

from mdsplus.

sflanagan commented on June 20, 2024

Assuming, as an upper bound, that everything is enabled and ops is having a long and/or productive run day...

How many "actions" are issued when running a single DIII-D shot?

~300

And how many "action servers" are involved?

And how many shots per day?

~50

from mdsplus.

sflanagan commented on June 20, 2024

Let me know if you have any problems with the systemd workaround on GA's server.

I redesigned my mds-8002 systemd file to add:

User=mdsadmin
Restart=always
ExecStartPost=/usr/local/mdsplus/local/mds-8002-post.sh

Where the mds-8002-post.sh just automatically sets up the dispatch table, shot number(s), opens the relevant shot tree(s), etc... and then triggers a "test" action node.

Seems "fine" as a temporary workaround, but it does leave me open to skipping/missing a shot if the dispatcher service is restarting while the DAQ system is trying to start a new shot cycle. Trying to reduce that risk is next up on my agenda (if I can).

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @sflanagan -- Thanks for the information about the number of actions in a DIII-D shot. And for the news regarding the systemd workaround.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Now have a prototype fix. Next step is to do some manual stress testing, followed by peer review. Will take a few more days before there is a final fix that will be merged to alpha and cherry-picked to the GA branch.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp and @sflanagan,

Summary
This post is a summary of recent emails with @smithsp, plus provides additional details.

This problem was introduced with PR #2288 on 19-May-2021. The bug appears to only affect the dispatch command of TCL. Thus, it is likely that the associated fix is low risk. (Testing is presently underway to confirm that is indeed true.)

Details
When mdstcl (which is a client) dispatches an action, it also creates a "receiver" thread to receive the eventual status reply regarding that action. (The action is sent to an "action server" that then processes the action and sends a reply back to the mdstcl client.) For each action that is dispatched, mdstcl thus opens up two sockets: one is the "to" socket used to send the action to the action server, and the other is the "from" socket used to receive the reply from the action server.

The issue is that the "connection ID table" in the mdstcl client should be accessible by all threads, but unfortunately isn't. Thus, when the receiver thread receives the SrvJobFINISHED reply, it can close its "from" socket, but is unable to access the main thread's "connection ID table" in order to close the "to" socket. Thus, for each action dispatched, the current code leaves a dangling "to" socket.

The two workarounds clean up the dangling "to" sockets. Adding an "exit" statement at the end of a TCL script that dispatches actions terminates the TCL process thus all of its sockets are closed. Alternatively, using the stop server command of TCL will cause the action server to reboot (if systemd or xinetd is configured correctly) which also closes all dangling sockets.

The proposed fix does not alter any logic regarding sockets and connections. All that it does is allow the "receiver" thread to access the "connection ID table" so that the "to" socket can be closed. Preliminary testing confirms that the fix works and that mdstcl is no longer leaking a "to" socket for each dispatched action. Thus, the workarounds are no longer needed.

Low Risk?
Whenever fiddling with multi-threaded code, there is always a risk of race conditions being created. Hence the focus on extensive testing of the fix prior to release to GA. The goal of the testing is to confirm that all critical sections are protected with locks. Initial testing is promising; no threading issues have been encountered.

And because the problem appears to be limited to the dispatch command of mdstcl, that also reduces the risk associated with the fix. It is unlikely to affect other aspects of the MDSplus suite of tools.

Note also that this bug has been present in the alpha branch for ~3 years, and that no other customers have reported problems. That tends to confirm that the problem doesn't affect the entire socket communication foundation of MDSplus. And likely also means that the fix will have lower risks than expected. (Regardless, extensive testing is needed to confirm that is so.)

Clarification
This issue affects all commands sent to an "action server". That includes the following activities (i.e., the list gives descriptions; the actual TCL statements are likely different).

abort server
close trees
create pulse
dispatch action
dispatch command
monitor check-in
monitor
logging
stop server

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @sflanagan -- Questions about your site's single "action server" . . .

What version of MDSplus does it run?
What Operating System and version?
Does the "action server" MDSplus/OS version match that of the "dispatch server" (apparently Atlas)?
Is it a dedicated "action server" (only processes) actions? (Am curious if it is used for other MDSplus activities such as data analysis and/or to run non-MDSplus applications.)

Answers to the above questions will enable us to set up a cross-version testing environment, if needed.

from mdsplus.

mwinkel-dev commented on June 20, 2024

A stress test of 1,200 actions (400 actions per shot, 3 shots) reveals that there is an unprotected critical section that can cause a segfault. The segfault also occurs in the alpha branch thus was not introduced by the proposed fix.

Adding a delay of 0.01 seconds to each action eliminates the segfault, but causes the proposed fix to leave ~10 dangling sockets (i.e., it closes the other 1,190 "to" sockets). Additional investigation is required to determine if the dangling sockets are caused by a second unprotected critical section. And if so, whether it was introduced with the proposed fix.

Increasing the delay to 0.1 seconds allows the proposed fix to work fine -- no dangling sockets.

from mdsplus.

mwinkel-dev commented on June 20, 2024

The proposed fix successfully handled 20,000 actions without leaking sockets. (However, each action had a 0.1 second delay.) That was 50 back-to-back shots with 400 actions per shot. The test was designed to generate a spike load on the "action server".

An even more extreme stress test does create a segfault in both the current alpha and the proposed fix. Preliminary investigation indicates that the segfault is probably a pre-existing issue and was not introduced by the proposed fix. The segfault investigation continues (and perhaps will be tracked in a new GitHub issue).

Summary is that the proposed fix is now working well enough that it is moving on to the peer review stage. However, it will still be a few days before the fix is ready for release to GA.

from mdsplus.

smithsp commented on June 20, 2024

Thanks for the update @mwinkel-dev .

from mdsplus.

mwinkel-dev commented on June 20, 2024

Took all of Saturday to figure out that there is a "race" condition associated with this fix.

When mdstcl dispatches an action to the "action server", the server sends back two messages: "handshake" and "done". And they are received by two different threads.

The "handshake" message is when the server acknowledges that it has received an action.
The "done" message is sent when the server has processed the action.

Note: The names, "handshake" and "done", are just for clarity in this post. Those names don't appear in the source code.

In practice, there is usually a fair amount of time between the two messages (e.g., it takes time for an action to configure a digitizer). And that time interval enables the code to process the "handshake" before the "done".

However, the spike load test has an action that can be done instantly (i.e., just evaluating the number 1). And thus on a heavily loaded action server, the "done" message can be received and processed by mdstcl before the "handshake" is fully processed. Because processing the "done" message deletes the associated connection, the "handshake" code gets mighty confused when its connection id gets clobbered. Hence the errors and occasional segfaults.

Conceptually, the ideal fix to avoid the "race" condition is to use a lock on every single connection object. However, it will likely be a challenge to retrofit that into the existing code. (There is a risk of subtle bugs creeping in that could result in deadlock.)

The practical fix is probably just to add 100ms delay before processing the "done" message and closing the connection. Experiments prove that this simple approach almost always avoids the race condition. (It leaked 2 or 3 sockets out of 400 actions processed.)

However, it is important that the dispatch cycle roughly match the added delay, so that connections are created and destroyed at approximately the same rate. If the delay is too long, and there are hundreds of actions dispatched at a high rate, then all sockets (allowed by the operating system) can be consumed thereby causing mdstcl and/or the action server to crash.

An alternative concept for matching connection creation / destruction rates is for mdstcl to keep a running tally of open connections. And if the count hits a user specified threshold (e.g., 500 active connections max), then it pauses dispatching actions. As actions are drained from the server's input queue, mdstcl will close connections accordingly. And thus as each action finishes, mdstcl will be able to dispatch another action.

Combining both approaches would allow mdstcl to avoid both problems: race condition and no more sockets.

from mdsplus.

mwinkel-dev commented on June 20, 2024

There is a fatal flaw in the proposed fix (call it "Option A") described in the preceding posts: deadlock.

The existing architecture of the mdstcl dispatch feature consists of the following:

a "connection id" table that contains information about each connection (sockets used, connection state, and so forth),
a main thread that also dispatches actions, and
many "receiver" threads, one per each action sent to the action server.

Option A treats the "connection id" table as a resource shared by all threads. And thus locks are used to protect access when adding or deleting connections. However, in order for that to work well, the main thread needs to be throttled so that the receiver threads have an opportunity to access the table too. Note though that if a receiver thread obtains the lock and then dies before releasing the lock, that the main thread is thus permanently blocked from dispatching more actions. Thus mdstcl deadlocks. (There are likely other edge cases that would also create deadlock.)

Although the principles and architectures for high-performance, robust, multi-threaded, web servers and clients are now well known, the challenge is in retrofitting the existing action server architecture to accommodate those features.

Option B might be a more practical and less risky option.

It has these features . . .

the main dispatch thread passes a connection's socket to the associated receiver thread
thus the receiver threads can close sockets without having to access the "connection id" table
that eliminates the need for locking on the table (because it is only accessed by the main thread)
it only allows ~N active connections at a time, where N is chosen to be well below the number of sockets allowed by the operating system,
N is user configurable (likely via an environment variable)

Advantages:

doesn't immediately consume all available sockets when dispatching hundreds of actions
doesn't leak sockets
avoids deadlock

Disadvantages:

when the action server is finished processing an action, its entry in the "connection id" table becomes obsolete
thus the "connection id" table constantly grows

Analysis:

suppose that 20,000 actions are dispatched in a day
the memory required to store a "connection id" table of 20,000 entries is fairly small -- probably just a few MB
the growth of the table is highly unlikely to force a reboot of the server (as is the case if run out of sockets)
a cron job can be run each evening (when shots aren't being run) to restart the action server, thus each morning the action server starts with an empty "connection id" table
the current shot's connections are added to the head of the table, thus table growth during the day won't have much impact on the speed of table lookups or adding new entires
this option involves little change to the existing code, plus avoids deadlock, thus is less risky than Option A

Summary:

The long-term solution is to write a new dispatcher using a more robust architecture.
However to meet the schedule of DIII-D's upcoming campaign, the faster and safest approach is to make minimal changes to the existing mdstcl dispatch feature.
Option B thus shows more promise than Option A.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Now have a rough prototype of Option B. It dispatched 3,600 actions at full speed and only leaked one socket. The Linux top command shows a negligible memory footprint (i.e., as expected the growth of the "connection id" table is insignificant).

On Monday, will clean up the code and do more extensive testing. And if the results look good, will move Option B to the peer review stage of the development process.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Option B prototype dispatched 21,600 actions with just 35 sockets leaked (a leak rate of ~0.2%). Actions were dispatched at the rate of 10 actions per second. Rough estimate is that the "connection ID" table grew to ~187 MB.

However, if remove the dispatch throttle of 10 actions/second, then mdstcl can deadlock when dispatching actions. That is a latent bug that has surely existed in the code for years; it is unrelated to the fix for this socket issue. (Nonetheless, it will be investigated before this socket fix is released to GA.)

When mdstcl deadlocked during testing, usually did not also examine the "action server". However, the one instance when did check revealed that the "action server" remained active (i.e., opened a new terminal window, started another mdstcl session and dispatched more actions). When additional mdstcl deadlocks are observed, more tests will be done to determine if the "action server" is always unaffected.

from mdsplus.

mwinkel-dev commented on June 20, 2024

There is likely another mode that causes mdstcl dispatch to leak sockets. It is associated with the queue length on the "action server".

Here are the details . . .

Because mdstcl dispatches actions and processes the associated "done" messages, it can estimate the number of actions that are queued up on the "action server". (For each queued action, two sockets are used.).

Configured mdstcl so it would maintain ~20 actions in the server's queue. And then dumped 400 actions into the dispatch hopper. It quickly ramped up to ~20 dispatched actions, and thereafter maintained steady state. When an action finished, it would dispatch another action and maintain the tally at ~20 dispatched-and-queued actions. The logs for mdstcl and the "action server" both indicated all 400 actions were processed. So was surprised to see that lsof reported ~20 leaked sockets.

Repeated the experiment with the limit set at ~100 actions. And observed that ~100 sockets leaked.

On Tuesday, will dig into this mystery. (There is a chance that there is a flaw in the design / implementation of the experiment.)

from mdsplus.

zack-vii commented on June 20, 2024

This problem was introduced with PR #2288 on 19-May-2021. The bug appears to only affect the dispatch command of TCL. Thus, it is likely that the associated fix is low risk. (Testing is presently underway to confirm that is indeed true.)

I slowly remember the struggle we had sorting out the existing problem at the time while keeping backward compatibility. The issue was that the old protocol required a reconnection on an offered port to send the result of the action.

the original goals that i remember were to support:

cancelation of dispatched actions by event (e.g. user intervention, Ctrl+C)
timeout
parallel execution of actions

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @zack-vii -- Thanks for the history of the change. Much appreciated!

I will expand my experiments to see if my proposed fix for the leaking sockets breaks any of the features you listed above.

from mdsplus.

mwinkel-dev commented on June 20, 2024

When configure mdstcl to always dispatch N actions more than have finished, some actions will fail. If hundreds of actions are dispatched, eventually there will be N failed actions. The action server never sends a "done" message back to mdstcl for those failed actions, which therefore remain connected. Thus causing N leaked sockets.

The actions are failing because their associated "job callback" functions are misbehaving. Now investigating if the "job callback" problem is a multi-threading issue.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp and @sflanagan,

I am now compiling and testing the proposed fix (PR #2740) that was created by @zack-vii. Results of the test will be posted in that PR and summarized in this issue.

Note that @zack-vii was the author of PRs #2288 and #2289 that involved the dispatch feature of mdstcl. His contributions are thus welcomed and appreciated.

from mdsplus.

mwinkel-dev commented on June 20, 2024

PR #2740 by @zack-vii is a server-side fix and thus more elegant than the client-side (i.e., mdstcl) fix that I created. We should use PR #2740 as the foundation for the fix of this issue.

However, there is an edge case in mdstcl associated with "failed actions" that leaks sockets even with PR #2740 present. The edge case likely will require a client-side fix. And will probably be a separate PR.

The complete fix of this issue will likely consist of two or three PRs.

from mdsplus.

smithsp commented on June 20, 2024

@mwinkel-dev @zack-vii
Thanks for your continuing efforts and attention to testing in developing a fix.

from mdsplus.

smithsp commented on June 20, 2024

Note that the workaround provided of letting the system restart is not robust. We look forward to a fix as soon as possible.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp -- Thank you for the update. I will put that news on the agenda for Tuesday's meeting of the MDSplus software team.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp and @sflanagan,

Although the temporary workarounds are not an adequate solution for GA, nonetheless it would be helpful to for us to learn what aspect of the workaround is failing for GA. (That information might help us create additional test cases for the real fix.)

On 1-Apr-2024, @sflanagan posted that the temporary fix does mean that a shot could be missed (which of course is a serious problem). However, are there now additional problems with the temporary workarounds?
#2731 (comment)

Also, which of these workarounds has GA tried?

adding an exit statement at the end of all TCL scripts, or
adding a stop server command at the end of all TCL scripts, or
running a shell script between shots that executes the following systemd commands:

systemctl stop action_server
systemctl start action_server

where "action_server" is whatever the name is of the systemd service that runs your "action server"

or running a shell script between shots that executes kill -9 action_server

from mdsplus.

mwinkel-dev commented on June 20, 2024

As per the request submitted via PR #2740, that partial fix will be cherry-picked into the GA branch. After the associated RPM packages are distributed to GA, work will resume on fixing the edge cases associated with this Issue #2731.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp and @sflanagan,

It is my understanding that the GA branch should be based on alpha-7.139.59 (based on the following post from Mitchell). If that is incorrect, let us know.
#2704 (comment)

from mdsplus.

smithsp commented on June 20, 2024

@mwinkel-dev The version is in #2731 (comment) (repeated here):

[root@atlas system]# rpm -qa | grep mds
mdsplus-alpha-mssql-7.139-59.el8.x86_64
mdsplus-alpha-idl_bin-7.139-59.el8.x86_64
mdsplus-alpha-kernel_bin-7.139-59.el8.x86_64
mdsplus-alpha-d3d-7.139-59.el8.noarch
mdsplus-alpha-kernel-7.139-59.el8.noarch
mdsplus-alpha-motif-7.139-59.el8.noarch
mdsplus-alpha-repo-7.139-40.el8.noarch
mdsplus-alpha-motif_bin-7.139-59.el8.x86_64
mdsplus-alpha-python-7.139-59.el8.noarch
mdsplus-alpha-idl-7.139-59.el8.noarch

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp,

Apparently, I need to drink more coffee and wake up. Which is to say that I'm amused to learn from your post that the version information was present if I had just scrolled up to the top of this GitHub issue.

Regardless, thanks for confirming the version again. Reason I was double checking was to make sure we create the GA branch from the correct commit.

Today, we will be cherry-picking PR #2735 and PR #2740 into the branch. Here are the tasks:

creating the GA branch
cherry picking the two PRs
manually building the branch (including running the automated tests)
testing the RPMs on a RHEL8 system to ensure not leaking sockets
notifying GA when the release is ready
note: so as not to add additional delay to the release, we will not be adding the GA branch to the Jenkins build system today

If all goes according to plan, the above tasks will all be completed this afternoon or evening.

When the release is ready, GA will have two options:

I can provide a *.tgz file of the RPMs as promised, and/or
GA can build the branch and create its own RPMs.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Trial run of GA branch / build is underway. Now have RPMs on a Rocky8 system for testing on Friday. Some minor build issues still to be addressed (major-minor-release labelling scheme for the branch, and so forth).

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp and @sflanagan,

What "tagging" scheme would you like us to use for the GA branch (i.e., values for "branch-major-minor-release")?

GA release zero is identical to alpha_7.139.59. This commit does not presently have a GA "tag".

What should the next release of the GA branch be called? (This will be the release with the cherry-picks of PR #2735 and PR #2740.)

from mdsplus.

smithsp commented on June 20, 2024

I propose the next release be labeled with atlas. It could be atlas_1.0.0. (Or this could be the source for the new stable set of labels, since we think it is the most recent stable release.)

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp and @sflanagan,

Thanks for the answer. We will start with atlas_1.0.0. However, that label can be changed later if GA so decides.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Just dispatched 2,000 actions on Rocky Linux 8.9 without leaking any sockets.

Details of the trial branch build and test . . .

Manually ran the Docker build for RHEL8 which produced unsigned RPMs for the trial branch (= alpha_7.139.59 plus cherry-picks of PR 2735 and PR 2740).
Installed the RPMs on Rocky Linux 8.9 (with gpgcheck=0 because RPMs were unsigned).
Dispatched the 2,000 actions to the "action service".
Used lsof to confirm that no sockets leaked.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Many of the preceding posts by @mwinkel-dev were initial conjectures and thus incorrect. Now that the investigation has been completed, here is a summary of the root cause.

It is expensive to create connections, and thus the architecture of mdstcl and the "action service" is to reuse connections whenever possible. The following explanation uses the simplest configuration: mdstcl, one "action service" and a "mdsip service" (for tree access).

In normal usage, mdstcl has one connection open to the mdsip service so that it can obtain all action nodes from the tree. It then builds a table of actions to be dispatched. So, mdstcl also opens another connection to the "action service", and dispatches all actions over that connection. The main thread of mdstcl dispatches the actions. As actions execute, the "action server" sends replies back to mdstcl on a different connection. There are not multiple "receiver" threads; mdstcl just has a single thread listening for replies. After a connection is made, it is reused over and over. (Of course if a connection dies because of a network glitch, then mdstcl will create a replacement connection.)

And this is how the software changed over the years . . .

Release stable_7.96.9 is what GA was running until November 2023. It reused the connection to dispatch actions so did not leak sockets.

PR #2288 refactored the software and inadvertently omitted reuse of the dispatch connection. It created a new connection for each action dispatched, thus leaked sockets.

The proposed fix created by @mwinkel-dev incorrectly assumed that the architecture was indeed supposed to create a new connection per action. And also incorrectly assumed that there was a receiver thread created for each action. (Maintenance developers rely on comments in the source code to guide them; when those comments are missing it is easy to make bad assumptions.). This proposed fix did solve the leaking socket issue, but had other problems (such as deadlock).

PR #2740 corrects the omission of PR #2288 and reuses the dispatch connection. Thus eliminating the leaking sockets. Although some code cleanup was done on the "action service", the primary fix is actually in the client-side code.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp,

PR #2746 addresses some problems that were spotted while investigating this Issue #2731. Let us know if you also want PR #2746 cherry-picked to GA's branch.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp,

Thanks for the decision. Therefore, we will include PR #2746 only if two conditions are met:

it adds no delay to the release of the RPM tarball of the GA branch, and
the peer reviewers consider it useful to have it in GA's branch.

If PR #2746 is not included in the forthcoming tarball (aka "ga_atlas_1.0.0"), my recommendation is that GA add it to the list of things to be included in an eventual subsequent release ("ga_atlas_1.0.1").

Regarding the tarball, I will post an update on Wednesday 24-Apr-2024 with an updated schedule. (I was out of the office today, so must check with my colleagues to find out the present status of the GA branch and tarball.)

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp,

With guidance from @WhoBrokeTheBuild, the official GA branch is now available. It is named ga_atlas.

And it contains the cherry-picks of PR #2735 and PR #2740. (PR #2746 was not included because it is still in the review process.)

The history of the ga_atlas branch is as follows.

ga_atlas_1.0.0
   PR #2740 = 4cee3b1 = Fix: reduce open files due to dispatcher
   PR #2735 = 471a06c = When activate debug trace, now compiles without error

alpha_7.139.59 = version GA installed in November 2023
   PR #2647 = 37652c2 = Fix: dir /full segfault

The ga_atlas_1.0.0 was manually built (on my dev system using Docker images) for three platforms: RHEL8, Ubuntu20 and Windows. All builds successfully ran the automated test suite. In addition, the Ubuntu20 build passed the manual test of 1,200 dispatched actions without leaking any sockets.

Stephen informed me that the Jenkins build server is not configured to build the ga_atlas branch. If GA has problems building the branch, let us know.

I am also creating a tarball of the RHEL8 RPMs so that I can install the ga_atlas_1.0.0 release on my Rocky Linux 8 virtual machine and repeat the dispatch / socket test. As a courtesy to GA, I can also make my RPM tarball available if needed.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Timeline of fix (in calendar days, not work days) and lessons @mwinkel-dev learned . . .

workaround proposed and tested = 1 day
investigation and first two versions of fix = 12 days
develop PR 2740 fix = 2 days
review, merge and test PR 2740 = 9 days
create PR 2746 to fix residual dispatcher issues = 4 days
create ga_atlas branch and deliver tarball of RPMs = 2 days

Step 2 was the result of undocumented source code and wrong assumptions about the software architecture. To prevent this from occurring again, PR #2746 adds comments to help future maintenance programmers. Lesson learned = seek input from others sooner.

Six days elapsed between Steps 4 and 6. Fix was merged at step 4 thus available to customer, but not in the form the customer requested. Lesson learned = better coordination saves time.

from mdsplus.

smithsp commented on June 20, 2024

The ga_atlas branch is working in production.

from mdsplus.

mwinkel-dev commented on June 20, 2024

Hi @smithsp -- Thanks for the update. Am glad to read that the ga_atlas branch is working OK at DIII-D.

from mdsplus.

MDSplus dispatcher opens lots of files about mdsplus HOT 54 CLOSED

Comments (54)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent