Giter Club home page Giter Club logo

Comments (10)

jennifer-richards avatar jennifer-richards commented on September 5, 2024

I believe this was fixed by #68, but I am not entirely sure- it may be that this is what happens if a deadlock or seg fault does not happen to occur immediately. I have not been able to reproduce this with that fix in place, but it was sometimes difficult to reproduce before. Open this back up if it recurs.

from trust-router.

jennifer-richards avatar jennifer-richards commented on September 5, 2024

Test

The best I know how to test:

Steps

  1. Set up two or more trust router peers
  2. Start the trust routers
  3. Attempt repeated connections to one of the trust routers:
    a. using trmon
    b. using tidc
    c. by stopping and restarting the other trust router
  4. Repeat step 3 with valid and invalid credentials, with local and discovered realms, etc.
  5. Shut down the peer trust router (but not the one you're testing! leave that running!)

Expected results

When the trust router has been left alone for at least 30 seconds, running top, ps, or a system monitor should show one active (or sleeping) trust_router process and at most one other trust_router process that is in the zombie state. Running top or a system monitor should show that the trust router is using very low CPU.

A fail is:

  1. multiple zombie (or worse, active) processes
  2. high CPU load with no activity

from trust-router.

meadmaker avatar meadmaker commented on September 5, 2024

I've run this with:

  • trmon - two windows which each run trmon ... show 100 times each
  • tidc
    • two windows which each run tidc 100 times each
    • one window which runs tidc 250 times, in groups of five parallel processes
    • one window which runs tidc in the background 100 times

This last one crashed my tidc machine - it caused it to hang. (I should have anticipated that...) However, when the tidc machine hung and had to be restarted, the ninety trust_router processes that served the tidc connections remained on the trust_router server. These processes aren't using any CPU; they're just hanging out in the process table.

So, the overall CPU usage is fine, but it does leave processes behind if the other side stops communicating with us.

from trust-router.

jennifer-richards avatar jennifer-richards commented on September 5, 2024

There is a sweep every 10 seconds (plus after every completed TID request) that should flush any processes that exited, so this probably means something is blocking the processes from exiting.

from trust-router.

meadmaker avatar meadmaker commented on September 5, 2024

I'm not sure the (trust_router) processes exited. I think they are still waiting for the next communication from the no longer existing tidc processes.

from trust-router.

jennifer-richards avatar jennifer-richards commented on September 5, 2024

Yes, that's what I expect. Normally the socket closes and whatever is waiting catches wind of that and the process exits uncleanly (but exits). If we have any processes still stuck, it'd be helpful to attach gdb to one and see where it is. Can you reproduce this with the debug symbols installed?

from trust-router.

jennifer-richards avatar jennifer-richards commented on September 5, 2024

It's (unsurprisingly) in ReadBuffer in the gsscon code:

(gdb) bt
#0  0x00007fa338c277fd in read () from /lib64/libpthread.so.0
#1  0x0000000000423e71 in read (__nbytes=4, __buf=0x7ffdcfa9ad64, __fd=21) at /usr/include/bits/unistd.h:44
#2  ReadBuffer (inSocket=inSocket@entry=21, inBufferLength=inBufferLength@entry=4, ioBuffer=ioBuffer@entry=0x7ffdcfa9ad64 "")
    at gsscon_common.c:101
#3  0x0000000000423f2e in gsscon_read_token (inSocket=inSocket@entry=21, outTokenValue=outTokenValue@entry=0x7ffdcfa9ae50, 
    outTokenLength=outTokenLength@entry=0x7ffdcfa9ae58) at gsscon_common.c:168
#4  0x0000000000424744 in gsscon_passive_authenticate (inSocket=inSocket@entry=21, inNameBuffer=..., 
    outGSSContext=outGSSContext@entry=0x7ffdcfa9af00, clientCb=clientCb@entry=0x408800 <tr_gss_auth_cb>, 
    clientCbData=clientCbData@entry=0x1fb5860) at gsscon_passive.c:119
#5  0x00000000004089a5 in tr_gss_auth_connection (auth_cookie=0x1f115c0, auth_cb=0x405310 <tr_tids_gss_handler>, gssctx=0x7ffdcfa9af00, 
    acceptor_hostname=<optimized out>, acceptor_service=0x429333 "trustidentity", conn=21) at common/tr_gss.c:126
#6  tr_gss_handle_connection (conn=conn@entry=21, acceptor_service=acceptor_service@entry=0x429333 "trustidentity", 
    acceptor_hostname=<optimized out>, auth_cb=0x405310 <tr_tids_gss_handler>, auth_cookie=0x1f115c0, 
    req_cb=req_cb@entry=0x409e40 <tids_req_cb>, req_cookie=req_cookie@entry=0x1f0f870) at common/tr_gss.c:237
#7  0x000000000040a7c8 in tids_handle_proc (result_fd=105, conn_fd=21, tids=0x1f0f870) at tid/tids.c:424
#8  tids_accept (tids=0x1f0f870, listen=10) at tid/tids.c:487
#9  0x00007fa339dc9a14 in event_base_loop () from /lib64/libevent-2.0.so.5
#10 0x0000000000404643 in main (argc=<optimized out>, argv=<optimized out>) at tr/tr_main.c:324

from trust-router.

jennifer-richards avatar jennifer-richards commented on September 5, 2024

Ok, I suspect what is happening is that an unclean shutdown of a client is failing to notify the server that it has disconnected.

I have added a poll / timeout to ReadBuffer(). It's a little bit tricky to set the timeout here because I think there are some times where a trust router may legitimately be waiting for a significant period of time. E.g., I think it sits in this function while waiting for a TID request to go through. I've set it to 60 seconds for now, which I think should be long enough. We may eventually want to make this a parameter, but I'm hoping to get away without refactoring the various internal APIs that would be involved.

I committed this directly to the milestone branch because it's the easiest way to do testing on this right now. It's e42c080.

from trust-router.

jennifer-richards avatar jennifer-richards commented on September 5, 2024

Kicking back to @meadmaker to test the fix

from trust-router.

meadmaker avatar meadmaker commented on September 5, 2024

Fix tested, passed.

from trust-router.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.