s3authserver not working normally and gateway time-out,about seagate/cortx-s3server

Comments (26)

welcome commented on August 26, 2024

Thanks for opening this issue. A contributor should be by to give feedback soon. In the meantime, please check out the contributing guidelines and explore other ways you can get involved.

from cortx-s3server.

cortx-admin commented on August 26, 2024

For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/EOS-28345. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.

from cortx-s3server.

gregnsk commented on August 26, 2024

@h3czxp - thank you for reporting it. Could you share more details about the test scenario? How many objects were created, what are the parameters of your warp command?
Are you running the latest CORTX version in Kubernetes?

Could you supply s3auth server logs (/var/log/seagate/auth/server) when the issue happens?

from cortx-s3server.

h3czxp commented on August 26, 2024

@gregnsk - There are three nodes for the cluster. Each node has 96 cores and 25Gb/s network. The parameters of warp is "./warp get --duration=5m --warp-client=172.16.27.51:6000,172.16.27.52:6000,172.16.27.53:6000 --host=172.16.27.51:80,172.16.27.52:80,172.16.27.53:80 --access-key=AKIAbYLPub8LSBWbZr8mtSDgZg --secret-key=z9QLkRjvg6uKyOjiaJvF8HJutUKvHyZQVtLuibTS --obj.size=1024KiB --objects=10000 --concurrent=100".

We are using the newest rpm packets provided by your company to deploy out cluster, not in Kubernetes.

The s3auth server and s3server logs are below.

s3server.node51.invalid-user.log.ERROR.20220126-073648.87646.txt
app.log
app-2022-01-26-00.log.gz
app-2022-01-26-01.log.gz
app-2022-01-26-02.log.gz
app-2022-01-26-03.log.gz
app-2022-01-26-04.log.gz
app-2022-01-26-05.log.gz
app-2022-01-26-06.log.gz
app-2022-01-26-07.log.gz
app-2022-01-26-09.log.gz
app-2022-01-26-10.log.gz
app-2022-01-26-20.log.gz

s3server_INFO_WARNING_log.zip

from cortx-s3server.

t7ko-seagate commented on August 26, 2024

1.When test the cortx cluster for about an hour using tool warp, the s3authserver is not working normally.

What are the symptoms? How exactly it is failing?

from cortx-s3server.

t7ko-seagate commented on August 26, 2024

Initial logs analysis.

app.log files are clean. No failures or exceptions, all responses are 200 OK (auth successful).

ERROR log reports two kinds of errors:

error 404
motr error -110

E0126 07:36:48.065618 87646 s3_get_object_action.cc:1082] [send_response_to_s3_client] [ReqID: d104772c-2103-4290-b337-4986aa55a8b6] S3 Get request failed. HTTP status code = 404

E0126 20:21:42.594319 87968 s3_motr_rw_common.cc:174] [s3_motr_op_common] [ReqID: 4483f40f-11fd-4886-8404-8cfb48df75ec] Error code = -110 op_code = 11

Error -110 is causing failures to some write operations and some delete-object operations.

Error 110 is "connection timed out" -- since it comes from libmotr, I assume the reason is connection loss between s3 and Motr (probably due to m0d going down?).

These errors are not present in the detailed INFO logs. Reason -- log rotation. ERROR log ends at Jan 26th, INFO log is captured on Jan 27th.

INFO logs include:

a lot of successfully completed operations
7 failures due to API user supplying wrong signature/keys (error 403)
2 internal failures (error 500) -- both happen at the same time (within half second), due to inability to connect to Auth server. Most probably caused by auth server restart.

With these logs it is not possible to debug further.

Hi @h3czxp --

To help us troubleshoot, can you please:

Check /var/crash for any recent core files.
Clarify on how exactly you observe that auth server is failing.
Reproduce the 2nd failure (Gateway Timeout), and capture app.log and all s3server logs immediately -- so we have full detailed info (before logs are rotated).

from cortx-s3server.

t7ko-seagate commented on August 26, 2024

One more note on error 404.

We observed this error in a first warp invocation after cluster restart.

As a work-around, while we're looking into the issue, you can use this: after cluster restart, trigger some quick warp workload (small objects, small number, short time period), wait until it completes and ignore the result. Then start your main workload.

from cortx-s3server.

cortx-admin commented on August 26, 2024

Ivan Tishchenko commented in Jira Server:

GitHub issue has been re-opened. h3c still need help.

from cortx-s3server.

cortx-admin commented on August 26, 2024

Ivan Tishchenko commented in Jira Server:

Initial logs analysis.

last app.log is clean. have not yet checked other app.log archives.

ERROR log reports two kinds of errors:

error 404
motr error -110

UPD -- posted detailed analysis in GitHub. Comments from JIRA do not seem to be mirrored back to GitHub.

from cortx-s3server.

h3czxp commented on August 26, 2024

@t7ko-seagate Thanks for your reply!
1.There are core files for m0d in /var/crash, but the files are too large to upload. There is the stack information below.

(gdb) bt
#0 0x00007ff9cfc58387 in raise () from /lib64/libc.so.6
#1 0x00007ff9cfc59a78 in abort () from /lib64/libc.so.6
#2 0x00007ff9d1c2c42d in m0_arch_panic (c=c@entry=0x7ff9d2037d20 <__pctx.8265>, ap=ap@entry=0x7ff702ffc298) at lib/user_space/uassert.c:131
#3 0x00007ff9d1c1b7c4 in m0_panic (ctx=ctx@entry=0x7ff9d2037d20 <__pctx.8265>) at lib/assert.c:52
#4 0x00007ff9d1b51090 in mem_alloc (zonemask=2, size=, tx=0x7ff4884dcdb0, btree=0x400000139f90) at be/btree.c:127
#5 btree_save (tree=tree@entry=0x400000139f90, tx=tx@entry=0x7ff4884dcdb0, op=op@entry=0x7ff4884dd360, key=key@entry=0x7ff4884dd8e0, val=val@entry=0x0,
anchor=anchor@entry=0x7ff4884dd578, optype=optype@entry=BTREE_SAVE_OVERWRITE, zonemask=zonemask@entry=2) at be/btree.c:1453
#6 0x00007ff9d1b529d5 in m0_be_btree_save_inplace (tree=0x400000139f90, tx=0x7ff4884dcdb0, op=0x7ff4884dd360, key=key@entry=0x7ff4884dd8e0, anchor=0x7ff4884dd578,
overwrite=, zonemask=2) at be/btree.c:2148
#7 0x00007ff9d1b8107a in ctg_op_exec (ctg_op=ctg_op@entry=0x7ff4884dd350, next_phase=48) at cas/ctg_store.c:1021
#8 0x00007ff9d1b81536 in ctg_exec (ctg_op=ctg_op@entry=0x7ff4884dd350, ctg=ctg@entry=0x400000139f70, key=key@entry=0x7ff702ffca20, next_phase=next_phase@entry=48)
at cas/ctg_store.c:1231
#9 0x00007ff9d1b81c56 in m0_ctg_insert (ctg_op=ctg_op@entry=0x7ff4884dd350, ctg=ctg@entry=0x400000139f70, key=key@entry=0x7ff702ffca20, val=val@entry=0x7ff702ffca70,
next_phase=next_phase@entry=48) at cas/ctg_store.c:1257
#10 0x00007ff9d1b7afd9 in cas_exec (next=48, rec_pos=0, ctg=0x400000139f70, ct=CT_BTREE, opc=CO_PUT, fom=0x7ff4884dcce0) at cas/service.c:2184
#11 cas_fom_tick (fom0=0x7ff4884dcce0) at cas/service.c:1473
#12 0x00007ff9d1bed0bb in fom_exec (fom=0x7ff4884dcce0) at fop/fom.c:791
#13 loc_handler_thread (th=0x1c83970) at fop/fom.c:931
#14 0x00007ff9d1c21f8e in m0_thread_trampoline (arg=arg@entry=0x1c83978) at lib/thread.c:117
#15 0x00007ff9d1c2d18d in uthread_trampoline (arg=0x1c83978) at lib/user_space/uthread.c:98
#16 0x00007ff9d139aea5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007ff9cfd2096d in clone () from /lib64/libc.so.6

static inline void *mem_alloc(const struct m0_be_btree *btree,
struct m0_be_tx *tx, m0_bcount_t size,
uint64_t zonemask)
{
void *p;

M0_BE_OP_SYNC(op,
	      m0_be_alloc_aligned(tree_allocator(btree),
				  tx, &op, &p, size,
				  BTREE_ALLOC_SHIFT,
				  zonemask));
M0_ASSERT(p != NULL);
return p;

}
2.There are warning log "Unable to connect to Auth server" and "Socket error: Connection reset by peer, errno: 104, set errtype: Reading Error" when auth server is failing in the warning file uploaded before.

from cortx-s3server.

madhavemuri commented on August 26, 2024

@h3czxp : Can you update following parameter to higher value (in multiple of 4K) equal to the meta-data disk used.

/etc/sysconfig/motr

# Backend segment size (in bytes) for IO service. Default is 4 GiB.
#MOTR_M0D_IOS_BESEG_SIZE=4294967296

Make sure that mkfs is done after this setting is done on all the nodes.

cc: @huanghua78

from cortx-s3server.

h3czxp commented on August 26, 2024

@t7ko-seagate
I reproduce the 2nd failure (Gateway Timeout) using "./warp get --duration=2m --warp-client=172.16.27.51:6000,172.16.27.52:6000,172.16.27.53:6000 --host=172.16.27.51:80,172.16.27.52:80,172.16.27.53:80 --access-key=AKIAbYLPub8LSBWbZr8mtSDgZg --secret-key=z9QLkRjvg6uKyOjiaJvF8HJutUKvHyZQVtLuibTS --obj.size=64KiB --objects=1000000 --concurrent=100". The s3server logs are too large to upload.

app_log_node51.zip
app_log_node52.zip
app_log_node53.zip

from cortx-s3server.

h3czxp commented on August 26, 2024

@madhavemuri : Thank you very much for your advice. The parameter "MOTR_M0D_IOS_BESEG_SIZE=4294967296" in file /etc/sysconfig/motr is disable default. We enable the parameter and deploy our cluster. We execute the test case for more than three hours and the s3 auth service doesn't fail. Now, we are furture testing.

from cortx-s3server.

h3czxp commented on August 26, 2024

@madhavemuri : When we update the parameter "MOTR_M0D_IOS_BESEG_SIZE" to the value(200G) equal to the meta-data disk used, there are errors in deploying cluster.

[root@node51 kyk]# hctl bootstrap --mkfs CDF_2+1.yaml
2022-01-28 00:44:49: Generating cluster configuration... OK
2022-01-28 00:44:55: Starting Consul server on this node......... OK
2022-01-28 00:45:02: Importing configuration into the KV store... OK
2022-01-28 00:45:03: Starting Consul on other nodes...Consul ready on all nodes
2022-01-28 00:45:04: Updating Consul configuraton from the KV store... OK
2022-01-28 00:45:10: Waiting for the RC Leader to get elected...../opt/seagate/cortx/hare/bin/../libexec/hare-bootstrap: line 447: ((: == 1 : syntax error: operand expected (error token is "== 1 ")
OK
2022-01-28 00:45:13: Starting Motr (phase1, mkfs)... OK
2022-01-28 00:45:21: Starting Motr (phase1, m0d)... OK
2022-01-28 00:45:23: Starting Motr (phase2, mkfs)...Job for motr-mkfs@0x7200000000000001:0x165.service failed because the control process exited with error code. See "systemctl status motr-mkfs@0x7200000000000001:0x165.service" and "journalctl -xe" for details.
Error at node52 with command PATH=/opt/seagate/cortx/hare/bin/../bin:/opt/seagate/cortx/hare/bin/../libexec:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/puppetlabs/bin:/root/bin /opt/seagate/cortx/hare/libexec/bootstrap-node --mkfs-only --phase phase2 --xprt libfab
Job for motr-mkfs@0x7200000000000001:0x22e.service failed because the control process exited with error code. See "systemctl status motr-mkfs@0x7200000000000001:0x22e.service" and "journalctl -xe" for details.
Error at node53 with command PATH=/opt/seagate/cortx/hare/bin/../bin:/opt/seagate/cortx/hare/bin/../libexec:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/puppetlabs/bin:/root/bin /opt/seagate/cortx/hare/libexec/bootstrap-node --mkfs-only --phase phase2 --xprt libfab
Job for motr-mkfs@0x7200000000000001:0x9c.service failed because the control process exited with error code. See "systemctl status motr-mkfs@0x7200000000000001:0x9c.service" and "journalctl -xe" for details.

Reading symbols from /usr/sbin/m0mkfs...Reading symbols from /usr/lib/debug/usr/sbin/m0mkfs.debug...done.
done.
Missing separate debuginfo for
Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d2/b0d6d232d66c778f97273965ae044c718e7d2b
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/m0mkfs -e libfab:inet:tcp:172.16.27.52@3011 -A linuxstob:addb-stobs -'.
Program terminated with signal 6, Aborted.
#0 0x00007f9fa4ebd387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-317.el7.x86_64 isa-l-2.30.0-1.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libfabric-1.11.2-1.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 libyaml-0.1.4-11.el7_0.x86_64 openssl-libs-1.0.2k-24.el7_9.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0 0x00007f9fa4ebd387 in raise () from /lib64/libc.so.6
#1 0x00007f9fa4ebea78 in abort () from /lib64/libc.so.6
#2 0x00007f9fa6e9142d in m0_arch_panic (c=c@entry=0x7f9fa72a2760 <__pctx.10349>, ap=ap@entry=0x7f9c197f9648) at lib/user_space/uassert.c:131
#3 0x00007f9fa6e807c4 in m0_panic (ctx=ctx@entry=0x7f9fa72a2760 <__pctx.10349>) at lib/assert.c:52
#4 0x00007f9fa6dc22fb in be_io_cb (link=0x4fe8c50) at be/io.c:570
#5 0x00007f9fa6e81d47 in clink_signal (clink=clink@entry=0x4fe8c50) at lib/chan.c:135
#6 0x00007f9fa6e81d9a in chan_signal_nr (chan=chan@entry=0x4fe8b48, nr=0) at lib/chan.c:154
#7 0x00007f9fa6e81e1d in m0_chan_broadcast (chan=chan@entry=0x4fe8b48) at lib/chan.c:174
#8 0x00007f9fa6e81e39 in m0_chan_broadcast_lock (chan=chan@entry=0x4fe8b48) at lib/chan.c:181
#9 0x00007f9fa6f567e3 in ioq_complete (res2=, res=, qev=, ioq=0x3f74a10) at stob/ioq.c:587
#10 stob_ioq_thread (ioq=0x3f74a10) at stob/ioq.c:669
#11 0x00007f9fa6e86f8e in m0_thread_trampoline (arg=arg@entry=0x3f74d10) at lib/thread.c:117
#12 0x00007f9fa6e9218d in uthread_trampoline (arg=0x3f74d10) at lib/user_space/uthread.c:98
#13 0x00007f9fa65ffea5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f9fa4f8596d in clone () from /lib64/libc.so.6

from cortx-s3server.

t7ko-seagate commented on August 26, 2024

One more note -- on error 404.

We observed this error in a first warp invocation after cluster restart.

from cortx-s3server.

h3czxp commented on August 26, 2024

@madhavemuri We execute the test case "./warp get --duration=2m --warp-client=172.16.27.51:6000,172.16.27.52:6000,172.16.27.53:6000 --host=172.16.27.51:80,172.16.27.52:80,172.16.27.53:80 --access-key=AKIAbYLPub8LSBWbZr8mtSDgZg --secret-key=z9QLkRjvg6uKyOjiaJvF8HJutUKvHyZQVtLuibTS --obj.size=64KiB --objects=1000000 --concurrent=100" again,. Unlukily, gateway time-out error occurs again.

Reading symbols from /usr/bin/m0d...Reading symbols from /usr/lib/debug/usr/bin/m0d.debug...done.
done.
Missing separate debuginfo for
Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d2/b0d6d232d66c778f97273965ae044c718e7d2b
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/m0d -e libfab:inet:tcp:172.16.27.52@3009 -A linuxstob:addb-stobs -f <0'.
Program terminated with signal 6, Aborted.
#0 0x00007f9c0e922387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-317.el7.x86_64 isa-l-2.30.0-1.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libfabric-1.11.2-1.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 libyaml-0.1.4-11.el7_0.x86_64 openssl-libs-1.0.2k-24.el7_9.x86_64 systemd-libs-219-78.el7_9.5.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0 0x00007f9c0e922387 in raise () from /lib64/libc.so.6
#1 0x00007f9c0e923a78 in abort () from /lib64/libc.so.6
#2 0x00007f9c108f642d in m0_arch_panic (c=c@entry=0x7f9c10e21f00 <__pctx.7416>, ap=ap@entry=0x7f9bd97a14a8) at lib/user_space/uassert.c:131
#3 0x00007f9c108e57c4 in m0_panic (ctx=ctx@entry=0x7f9c10e21f00 <__pctx.7416>) at lib/assert.c:52
#4 0x00007f9c1098c05e in m0_sm_ast_post (grp=, ast=ast@entry=0x7f9c10ed1938 <fdmi_global_src_dock+1784>) at sm/sm.c:138
#5 0x00007f9c108afd2f in m0_fdmi__src_dock_fom_wakeup (sd_fom=sd_fom@entry=0x7f9c10ed12d0 <fdmi_global_src_dock+144>) at fdmi/source_dock_fom.c:344
#6 0x00007f9c108acfd8 in m0_fdmi__enqueue_locked (src_rec=src_rec@entry=0x7f96ccb685f0) at fdmi/source_dock.c:247
#7 0x00007f9c108ad03f in m0_fdmi__enqueue (src_rec=src_rec@entry=0x7f96ccb685f0) at fdmi/source_dock.c:257
#8 0x00007f9c108ad6d7 in m0_fdmi__record_post (src_rec=0x7f96ccb685f0) at fdmi/source_dock.c:280
#9 0x00007f9c108a9a82 in m0_fol_fdmi_post_record (fom=fom@entry=0x7f96ccb681a0) at fdmi/fol_fdmi_src.c:688
#10 0x00007f9c108b7d0a in m0_fom_fdmi_record_post (fom=fom@entry=0x7f96ccb681a0) at fop/fom.c:1745
#11 0x00007f9c108b8566 in m0_fom_tx_logged_wait (fom=0x7f96ccb681a0) at fop/fom_generic.c:472
#12 0x00007f9c108b8bb2 in m0_fom_tick_generic (fom=fom@entry=0x7f96ccb681a0) at fop/fom_generic.c:860
#13 0x00007f9c10845193 in cas_fom_tick (fom0=0x7f96ccb681a0) at cas/service.c:1218
#14 0x00007f9c108b70bb in fom_exec (fom=0x7f96ccb681a0) at fop/fom.c:791
#15 loc_handler_thread (th=0x1793030) at fop/fom.c:931
#16 0x00007f9c108ebf8e in m0_thread_trampoline (arg=arg@entry=0x1793038) at lib/thread.c:117
#17 0x00007f9c108f718d in uthread_trampoline (arg=0x1793038) at lib/user_space/uthread.c:98
#18 0x00007f9c10064ea5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f9c0e9ea96d in clone () from /lib64/libc.so.6

from cortx-s3server.

h3czxp commented on August 26, 2024

@t7ko-seagate We are executing our test case as what you say, but gateway time-out error occurs when increasing object num.

from cortx-s3server.

stale commented on August 26, 2024

This issue/pull request has been marked as needs attention as it has been left pending without new activity for 4 days. Tagging @nileshgovande @bkirunge7 @knrajnambiar76 @t7ko-seagate for appropriate assignment. Sorry for the delay & Thank you for contributing to CORTX. We will get back to you as soon as possible.

from cortx-s3server.

madhavemuri commented on August 26, 2024

@madhavemuri We execute the test case "./warp get --duration=2m --warp-client=172.16.27.51:6000,172.16.27.52:6000,172.16.27.53:6000 --host=172.16.27.51:80,172.16.27.52:80,172.16.27.53:80 --access-key=AKIAbYLPub8LSBWbZr8mtSDgZg --secret-key=z9QLkRjvg6uKyOjiaJvF8HJutUKvHyZQVtLuibTS --obj.size=64KiB --objects=1000000 --concurrent=100" again,. Unlukily, gateway time-out error occurs again.

Reading symbols from /usr/bin/m0d...Reading symbols from /usr/lib/debug/usr/bin/m0d.debug...done. done. Missing separate debuginfo for Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d2/b0d6d232d66c778f97273965ae044c718e7d2b [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/bin/m0d -e libfab:inet:tcp:172.16.27.52@3009 -A linuxstob:addb-stobs -f <0'. Program terminated with signal 6, Aborted. #0 0x00007f9c0e922387 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-317.el7.x86_64 isa-l-2.30.0-1.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libfabric-1.11.2-1.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 libyaml-0.1.4-11.el7_0.x86_64 openssl-libs-1.0.2k-24.el7_9.x86_64 systemd-libs-219-78.el7_9.5.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64 (gdb) bt #0 0x00007f9c0e922387 in raise () from /lib64/libc.so.6 #1 0x00007f9c0e923a78 in abort () from /lib64/libc.so.6 #2 0x00007f9c108f642d in m0_arch_panic (c=c@entry=0x7f9c10e21f00 <__pctx.7416>, ap=ap@entry=0x7f9bd97a14a8) at lib/user_space/uassert.c:131 #3 0x00007f9c108e57c4 in m0_panic (ctx=ctx@entry=0x7f9c10e21f00 <__pctx.7416>) at lib/assert.c:52 #4 0x00007f9c1098c05e in m0_sm_ast_post (grp=, ast=ast@entry=0x7f9c10ed1938 <fdmi_global_src_dock+1784>) at sm/sm.c:138 #5 0x00007f9c108afd2f in m0_fdmi__src_dock_fom_wakeup (sd_fom=sd_fom@entry=0x7f9c10ed12d0 <fdmi_global_src_dock+144>) at fdmi/source_dock_fom.c:344 #6 0x00007f9c108acfd8 in m0_fdmi__enqueue_locked (src_rec=src_rec@entry=0x7f96ccb685f0) at fdmi/source_dock.c:247 #7 0x00007f9c108ad03f in m0_fdmi__enqueue (src_rec=src_rec@entry=0x7f96ccb685f0) at fdmi/source_dock.c:257 #8 0x00007f9c108ad6d7 in m0_fdmi__record_post (src_rec=0x7f96ccb685f0) at fdmi/source_dock.c:280 #9 0x00007f9c108a9a82 in m0_fol_fdmi_post_record (fom=fom@entry=0x7f96ccb681a0) at fdmi/fol_fdmi_src.c:688 #10 0x00007f9c108b7d0a in m0_fom_fdmi_record_post (fom=fom@entry=0x7f96ccb681a0) at fop/fom.c:1745 #11 0x00007f9c108b8566 in m0_fom_tx_logged_wait (fom=0x7f96ccb681a0) at fop/fom_generic.c:472 #12 0x00007f9c108b8bb2 in m0_fom_tick_generic (fom=fom@entry=0x7f96ccb681a0) at fop/fom_generic.c:860 #13 0x00007f9c10845193 in cas_fom_tick (fom0=0x7f96ccb681a0) at cas/service.c:1218 #14 0x00007f9c108b70bb in fom_exec (fom=0x7f96ccb681a0) at fop/fom.c:791 #15 loc_handler_thread (th=0x1793030) at fop/fom.c:931 #16 0x00007f9c108ebf8e in m0_thread_trampoline (arg=arg@entry=0x1793038) at lib/thread.c:117 #17 0x00007f9c108f718d in uthread_trampoline (arg=0x1793038) at lib/user_space/uthread.c:98 #18 0x00007f9c10064ea5 in start_thread () from /lib64/libpthread.so.0 #19 0x00007f9c0e9ea96d in clone () from /lib64/libc.so.6

@h3czxp : We too have observed the similar issue occurring intermittently, which seems to be a regression. We are actively working on it, will post the updates here about the issue.

from cortx-s3server.

huanghua78 commented on August 26, 2024

This will be fixed in https://jts.seagate.com/browse/EOS-27717.
(This is not accessible for external users).

from cortx-s3server.

h3czxp commented on August 26, 2024

@madhavemuri Hi, when gateway time-out occurs, the log(INFO) output contains the following error message. What are the reasons the error?

I0208 21:06:20.112838 19696 s3server.cc:134] [on_client_request_error] [ReqID: -] S3 Client disconnected: Reading Error

I0208 21:06:20.112844 19696 request_object.h:293] [client_has_disconnected] [ReqID: 84a6a5120b98] S3 Client disconnected.

in file third_party/libevent/include/events/bufferevent.h
#define BEV_EVENT_READING 0x01 /< error encountered while reading */
#define BEV_EVENT_WRITING 0x02 /< error encountered while writing */
#define BEV_EVENT_EOF 0x10 /< eof file reached */
#define BEV_EVENT_ERROR 0x20 /< unrecoverable error encountered */
#define BEV_EVENT_TIMEOUT 0x40 /< user-specified timeout reached */
#define BEV_EVENT_CONNECTED 0x80 /< connect operation finished. */

from cortx-s3server.

t7ko-seagate commented on August 26, 2024

Exact meaning of this error is, cortx-s3server sees the API request socket being closed. On this level, this is the socket between haproxy and s3server processes. In turn, this socket may be closed due to the following reasons:

1- too short timeout configured on the client
2- some internal operation takes too long time, and client times out
3- haproxy misconfiguration (too short timeouts; or - when using additional haproxy instances to distribute the load across CORTX nodes, kind of on top of haproxy which runs on each node already).

Most probably you ran into case nr.2, as I think you mentioned you see this error when you increase the load.

from cortx-s3server.

h3czxp commented on August 26, 2024

@t7ko-seagate
How much load do you use in your test cases? What size of the objects and how many objects do you use?

from cortx-s3server.

gregnsk commented on August 26, 2024

@h3czxp - we're planning to release a new development version in ~2 weeks. In that release we should address the Motr issue and replace S3server with RGW. @madhavemuri, @osowski, @huanghua78 - FYI
We'll let you know when this version is ready for installation in your dev environment

from cortx-s3server.

h3czxp commented on August 26, 2024

@gregnsk-Wow, that sounds great! We are looking forward to the new version.

from cortx-s3server.

stale commented on August 26, 2024

from cortx-s3server.

s3authserver not working normally and gateway time-out about cortx-s3server HOT 26 OPEN

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent