Comments (9)
Hi Barak,
Accelio reconnect logic will be GA with the coming v1.5 release this week.
Alon just competed a set of fixes on this which are part of the v1.5-rc1 we
published today.
Which accelio version did you test with?
What kind of issues do you see?
Can you describe your setup?
Regards,
Alex
On Sep 2, 2015 6:19 PM, "Barak Pinhas" [email protected] wrote:
Hi Eyal,
While trying the reconnect feature in accelio, I ran a simple test which
took the transport down and then back up during IOs - and encountered some
issues. I'd appreciate some words about how the reconnect feature is
designed - specifically with regards to how the nexus swapping and the
transport duplication take action during the reconnect (on the client and
server side).Your help would be much appreciated.
Thank you,
Barak—
Reply to this email directly or view it on GitHub
#59.
from accelio.
Hi Alex,
Thanks for your prompt reply. I am based on 313423c (from the for_next branch). I have a pretty simple setup with two endpoints connected via RDMA and a proxy in between. The proxy is being disconnected/reconnected during IOs to trigger the reconnect logic.
Few issues I encountered:
- Segmentation fault in xio_nexus_swap after the transport dup2 call that was caused after xio_nexus_close tried to access the transport from the nexus, although the transport was already closed after the dup2 call.
- Requests are responded with -ENOTCONN synchronously during reconnect, instead of reinserting them to the queue for later retry - practically not hiding the failure from the callers (hence not hiding the reconnect attempts)
- I see some occurrences where after a successful reconnect the client, and the server - can 't find the connection of the reconnected session to re-establish IOs.
I have fixes for the first two, but not for the third one.
I'd appreciate some comments/details regarding these issues and the reconnect logic in general (maybe what I'm missing here is some sort of documentation regarding the feature).
Thanks,
Barak
from accelio.
Hi Barak!
Last week we pushed some bug fixes in the reconnect feature. Please let me know if you still see the issues with latest accelio.
Thanks,
Katya
from accelio.
P.S. we will add documentation of the reconnect next week
from accelio.
Hi Katya,
Unfortunately even with the latest master I still see issues with reconnect. The latest issue on the server side is a segmentation fault while trying to swap the nexus:
#0 0x00007f0ac8d338d2 in spin_lock (spinlock=<optimized out>)
at ./linux/kernel.h:145
#1 xio_timers_list_lock (timers_list=<optimized out>)
at xio/xio_timers_list.h:70
#2 xio_workqueue_add_delayed_work (work_queue=0xffff9045e8000000,
msec_duration=60000, data=0x117c9cd0,
function=0x7f0ac8d5e8f0 <xio_nexus_release_cb>, dwork=0x117c9d58)
at xio/xio_workqueue.c:355
#3 0x00007f0ac8d3159c in xio_ctx_add_delayed_work (ctx=<optimized out>,
msec_duration=<optimized out>, data=<optimized out>,
timer_fn=<optimized out>, work=<optimized out>) at xio/xio_context.c:310
#4 0x00007f0ac8d5eaa6 in xio_nexus_delayed_close (kref=0x117c9d40)
at ../common/xio_nexus.c:2142
#5 0x00007f0ac8d62fda in xio_nexus_swap (_new=<optimized out>,
old=<optimized out>) at ../common/xio_nexus.c:502
#6 xio_nexus_on_recv_setup_req (task=<optimized out>,
new_nexus=<optimized out>) at ../common/xio_nexus.c:558
#7 xio_nexus_on_new_message (event_data=<optimized out>,
event_data=<optimized out>, nexus=<optimized out>)
at ../common/xio_nexus.c:1434
#8 xio_nexus_on_transport_event (observer=0x117c9cd0, sender=0x1,
event=293379280, event_data=0x7fff4cc5c360) at ../common/xio_nexus.c:1597
#9 0x00007f0ac8d5e656 in xio_observable_notify_all_observers (
---Type <return> to continue, or q <return> to quit---
observable=0x117c93b0, event=5, event_data=0x7fff4cc5c360)
at ../common/xio_observer.c:225
#10 0x00007f0ac8d44931 in xio_transport_notify_observer (
event_data=<optimized out>, event=<optimized out>,
trans_hndl=<optimized out>) at ../../src/common/xio_transport.h:303
#11 xio_rdma_on_setup_msg (task=<optimized out>, rdma_hndl=<optimized out>)
at transport/rdma/xio_rdma_datapath.c:4339
#12 xio_rdma_rx_handler (task=<optimized out>, rdma_hndl=<optimized out>)
at transport/rdma/xio_rdma_datapath.c:791
#13 xio_handle_wc (wc=<optimized out>, wc=<optimized out>,
last_in_rxq=<optimized out>) at transport/rdma/xio_rdma_datapath.c:1109
#14 xio_poll_cq (tcq=0xffff9045e8000018, tcq@entry=0x1178abf0, timeout_us=1,
max_wc=<optimized out>) at transport/rdma/xio_rdma_datapath.c:1215
#15 0x00007f0ac8d46a3c in xio_poll_cq_armable (tcq=0x1178abf0)
at transport/rdma/xio_rdma_datapath.c:1262
#16 0x00007f0ac8d46b35 in xio_cq_event_handler (fd=<optimized out>,
events=<optimized out>, data=0x1178abf0)
at transport/rdma/xio_rdma_datapath.c:1362
#17 0x00007f0ac8d2f2b0 in xio_ev_loop_run_helper (loop_hndl=0x1d3a7d0,
timeout=0) at xio/xio_ev_loop.c:448
#18 0x0000000000404a62 in xio_event_handler (
loop=0x7f0ac86f1a40 <default_loop_struct>, xio_watcher=0x7f0ac9192808,
revents=1) at portable/abstraction/linux_user/accelio/accelio.c:22
My setup is somewhat different than yours - there's a proxy in between the two systems. The proxy is being restarted during IOs. The transport is RDMA.
I'm trying to debug it now.
from accelio.
Hey Barak!
We are working on recreating the scenario you described above and hopefully solving the problem. I will update you as soon a possible.
Thanks for being patient :)
Alon
from accelio.
Hi Alon, thanks.
I have a proposed fix for the segfault above - which is caused by kref on the nexus going down to zero - and registering a callback that accesses the nexus on the kref release.
I'm still encountering issues end to end - now on the client side - still trying to figure out what's going on there.
from accelio.
Hey Barak!
We were successful in recreating and fixing the bug above (in xio_observable_notify_all_observers).
Please pull the latest for_next branch (the latest version of accelio).
Your problem should be solved.
Have a great day :)
Alon
from accelio.
Hi Alon,
We've already fixed all reconnect related issues back in Sep.
I'm closing this ticket.
Thanks,
Barak
from accelio.
Related Issues (20)
- Question about accelio RPC HOT 3
- Flush error in trace log HOT 1
- memory leak without mr HOT 2
- Slow RDMA req/rsp performance even than TCP HOT 1
- Benchmark failure over RDMA
- Thread stuck in busy loop when RDMA is being used
- compile accelio error in centos6.5 HOT 5
- can accelio be used with a framework like wangle HOT 1
- testing hello_test and not respond
- Keepalive isn't always triggered HOT 8
- hello word Concurrent test error
- (sheepdog) mempool is empty for 12587576 bytes HOT 2
- Peer Direct Support HOT 1
- Huge latency noticed of RDMA session_established for multi-process HOT 4
- SIGSEGV on hello_test HOT 1
- failed to build kmod with MLNX_OFED 4.0-2.0.0.1 driver HOT 1
- Is accelio abandoned? HOT 15
- xio_context_stop_loop sometimes failed to stop loop HOT 11
- kernel built failed as conflict with ofed-3.4
- Is accelio.org website still alive? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from accelio.