Giter Club home page Giter Club logo

Comments (9)

rosenbaumalex avatar rosenbaumalex commented on July 25, 2024

Hi Barak,
Accelio reconnect logic will be GA with the coming v1.5 release this week.
Alon just competed a set of fixes on this which are part of the v1.5-rc1 we
published today.

Which accelio version did you test with?
What kind of issues do you see?
Can you describe your setup?

Regards,
Alex
On Sep 2, 2015 6:19 PM, "Barak Pinhas" [email protected] wrote:

Hi Eyal,

While trying the reconnect feature in accelio, I ran a simple test which
took the transport down and then back up during IOs - and encountered some
issues. I'd appreciate some words about how the reconnect feature is
designed - specifically with regards to how the nexus swapping and the
transport duplication take action during the reconnect (on the client and
server side).

Your help would be much appreciated.
Thank you,
Barak


Reply to this email directly or view it on GitHub
#59.

from accelio.

barakp avatar barakp commented on July 25, 2024

Hi Alex,

Thanks for your prompt reply. I am based on 313423c (from the for_next branch). I have a pretty simple setup with two endpoints connected via RDMA and a proxy in between. The proxy is being disconnected/reconnected during IOs to trigger the reconnect logic.
Few issues I encountered:

  • Segmentation fault in xio_nexus_swap after the transport dup2 call that was caused after xio_nexus_close tried to access the transport from the nexus, although the transport was already closed after the dup2 call.
  • Requests are responded with -ENOTCONN synchronously during reconnect, instead of reinserting them to the queue for later retry - practically not hiding the failure from the callers (hence not hiding the reconnect attempts)
  • I see some occurrences where after a successful reconnect the client, and the server - can 't find the connection of the reconnected session to re-establish IOs.

I have fixes for the first two, but not for the third one.
I'd appreciate some comments/details regarding these issues and the reconnect logic in general (maybe what I'm missing here is some sort of documentation regarding the feature).

Thanks,
Barak

from accelio.

katyakats avatar katyakats commented on July 25, 2024

Hi Barak!
Last week we pushed some bug fixes in the reconnect feature. Please let me know if you still see the issues with latest accelio.
Thanks,
Katya

from accelio.

katyakats avatar katyakats commented on July 25, 2024

P.S. we will add documentation of the reconnect next week

from accelio.

barakp avatar barakp commented on July 25, 2024

Hi Katya,

Unfortunately even with the latest master I still see issues with reconnect. The latest issue on the server side is a segmentation fault while trying to swap the nexus:

#0  0x00007f0ac8d338d2 in spin_lock (spinlock=<optimized out>)
    at ./linux/kernel.h:145

#1  xio_timers_list_lock (timers_list=<optimized out>)

    at xio/xio_timers_list.h:70

#2  xio_workqueue_add_delayed_work (work_queue=0xffff9045e8000000, 

    msec_duration=60000, data=0x117c9cd0, 

    function=0x7f0ac8d5e8f0 <xio_nexus_release_cb>, dwork=0x117c9d58)

    at xio/xio_workqueue.c:355

#3  0x00007f0ac8d3159c in xio_ctx_add_delayed_work (ctx=<optimized out>, 

    msec_duration=<optimized out>, data=<optimized out>, 

    timer_fn=<optimized out>, work=<optimized out>) at xio/xio_context.c:310

#4  0x00007f0ac8d5eaa6 in xio_nexus_delayed_close (kref=0x117c9d40)

    at ../common/xio_nexus.c:2142

#5  0x00007f0ac8d62fda in xio_nexus_swap (_new=<optimized out>, 

    old=<optimized out>) at ../common/xio_nexus.c:502

#6  xio_nexus_on_recv_setup_req (task=<optimized out>, 

    new_nexus=<optimized out>) at ../common/xio_nexus.c:558

#7  xio_nexus_on_new_message (event_data=<optimized out>, 

    event_data=<optimized out>, nexus=<optimized out>)

    at ../common/xio_nexus.c:1434

#8  xio_nexus_on_transport_event (observer=0x117c9cd0, sender=0x1, 

    event=293379280, event_data=0x7fff4cc5c360) at ../common/xio_nexus.c:1597

#9  0x00007f0ac8d5e656 in xio_observable_notify_all_observers (

---Type <return> to continue, or q <return> to quit---

    observable=0x117c93b0, event=5, event_data=0x7fff4cc5c360)

    at ../common/xio_observer.c:225

#10 0x00007f0ac8d44931 in xio_transport_notify_observer (

    event_data=<optimized out>, event=<optimized out>, 

    trans_hndl=<optimized out>) at ../../src/common/xio_transport.h:303

#11 xio_rdma_on_setup_msg (task=<optimized out>, rdma_hndl=<optimized out>)

    at transport/rdma/xio_rdma_datapath.c:4339

#12 xio_rdma_rx_handler (task=<optimized out>, rdma_hndl=<optimized out>)

    at transport/rdma/xio_rdma_datapath.c:791

#13 xio_handle_wc (wc=<optimized out>, wc=<optimized out>, 

    last_in_rxq=<optimized out>) at transport/rdma/xio_rdma_datapath.c:1109

#14 xio_poll_cq (tcq=0xffff9045e8000018, tcq@entry=0x1178abf0, timeout_us=1, 

    max_wc=<optimized out>) at transport/rdma/xio_rdma_datapath.c:1215

#15 0x00007f0ac8d46a3c in xio_poll_cq_armable (tcq=0x1178abf0)

    at transport/rdma/xio_rdma_datapath.c:1262

#16 0x00007f0ac8d46b35 in xio_cq_event_handler (fd=<optimized out>, 

    events=<optimized out>, data=0x1178abf0)

    at transport/rdma/xio_rdma_datapath.c:1362

#17 0x00007f0ac8d2f2b0 in xio_ev_loop_run_helper (loop_hndl=0x1d3a7d0, 

    timeout=0) at xio/xio_ev_loop.c:448

#18 0x0000000000404a62 in xio_event_handler (

    loop=0x7f0ac86f1a40 <default_loop_struct>, xio_watcher=0x7f0ac9192808, 

    revents=1) at portable/abstraction/linux_user/accelio/accelio.c:22

My setup is somewhat different than yours - there's a proxy in between the two systems. The proxy is being restarted during IOs. The transport is RDMA.

I'm trying to debug it now.

from accelio.

alon-grinshpoon avatar alon-grinshpoon commented on July 25, 2024

Hey Barak!

We are working on recreating the scenario you described above and hopefully solving the problem. I will update you as soon a possible.

Thanks for being patient :)
Alon

from accelio.

barakp avatar barakp commented on July 25, 2024

Hi Alon, thanks.
I have a proposed fix for the segfault above - which is caused by kref on the nexus going down to zero - and registering a callback that accesses the nexus on the kref release.
I'm still encountering issues end to end - now on the client side - still trying to figure out what's going on there.

from accelio.

alon-grinshpoon avatar alon-grinshpoon commented on July 25, 2024

Hey Barak!

We were successful in recreating and fixing the bug above (in xio_observable_notify_all_observers).
Please pull the latest for_next branch (the latest version of accelio).
Your problem should be solved.

Have a great day :)
Alon

from accelio.

barakp avatar barakp commented on July 25, 2024

Hi Alon,
We've already fixed all reconnect related issues back in Sep.
I'm closing this ticket.
Thanks,
Barak

from accelio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.