Describe the bug MergerFS mount seem to randomly di

MergerFS mount randomly disappears, only displays ??? when listed,about trapexit/mergerfs

Comments (66)

trapexit commented on June 12, 2024 2

I've gotten possible confirmation that this is a kernel bug. The author of FUSE has said:

This is really weird. It's a kernel bug, no arguments, because kernel
should never send a forget against the root inode. But that
lookup(nodeid=1, name=..); already looks bogus.

IE... what is happening is that the kernel is asking mergerfs for details on the parent of the root node. Which doesn't make much sense as the root node is... the root :) Currently I see that and return ENOENT since before the code didn't check at all because as far as I understood it should never ask that and it led to a crash. But with that it results in that EIO errors you are all seeing.

There is some debate if this is truly "bogus". We'll see.

If it is a kernel issue I'm not sure I'll be able to provide any workarounds. We'll have to see what the kernel devs find. If I can work around it I'll do so. If not... might just need to use something besides NFS or upgrade your kernel once available.

from mergerfs.

trapexit commented on June 12, 2024 2

While there is a bug in the kernel it wasn't exclusively the issue. It looks like 5.14 kernel added a check for certain conditions that can happen outside of using with NFS but are far far more likely with NFS. While mergerfs' code had been the same for years with regard to this stuff the fact people are moving to kernels past 5.14 is leading to problems. I'll have a release out shortly with a fix. It seems to work regardless of the kernel issue.

from mergerfs.

trapexit commented on June 12, 2024 1

https://github.com/trapexit/mergerfs/releases/tag/2.40.1

from mergerfs.

shocker2 commented on June 12, 2024 1

Just a feedback after 3 days of usage. During this time I had two transfers ongoing with total write on mergerfs ±300MB/s and it's working great. No stale, no crash no performance degradation, thank you @trapexit for this awesome fix!

from mergerfs.

Janbong commented on June 12, 2024 1

Same here. Problem is completely gone after updating and configuring the new setting. Thanks @trapexit for the quick handling of this issue.

from mergerfs.

trapexit commented on June 12, 2024

When was the strace of mergerfs taken? While in that broken state? If so... mergerfs looks fine.

Preferably you would trace just before issuing a request to mergerfs mount and then trace the app you use to generate that error. mergerfs is a proxy. The kernel handles requests between the client app and mergerfs. There are lots of situations where the kernel short circuits the communication and therefore not sending anything to it. Which I suspect is happening (for whatever reason). mergerfs would only return EIO if the underlying filesystem did. And the kernel could return it for numerous reasons. And NFS and FUSE don't always play nice together.

Are you modifying the mergerfs pool on the host? Not through NFS?

from mergerfs.

gogo199432 commented on June 12, 2024

I started this trace right before I knew the backup would start in the hope that I would catch the issue. After checking it a minute or so later I saw that the mount disappeared and stopped the trace.

Not quite sure what you mean by modifying the pool on the host, but I didn't touch the system while the backup was running. In Proxmox you can add the backup target as both a Directory and NFS mount, but I tried both and didn't seem to have made a difference. During the trace it was set up as a Directory.

I'm pretty sure that I have tried killing NFS completely and even then the backup would randomly fail, but this was a bit ago, I'm not sure anymore.

from mergerfs.

trapexit commented on June 12, 2024

NFS does not like out of band changes. Particularly NFSv4. If you have a NFS export and then on the host you export from modify the filesystem straight... you can and will causes problems. It usually leads to stale errors but could depend on the situation.

Tracing as you did is fine but since you aren't providing a matching trace of anything trying to interact with the filesystem on the host (or through NFS) I can't pinpoint who is responsible for the error which is critical. Even if you traced mergerfs after the failure starts and then trace "ls" or "stat" accessing /Storage it would answer that question.

from mergerfs.

gogo199432 commented on June 12, 2024

So it finally crashed again, I managed to get both a strace with 'ls' and 'stat', hopefully this helps something.

ls.strace.txt
mergerfs-ls.trace.txt

mergerfs-stat.trace.txt
statfolder.strace.txt

On a related note, before I got my 2 big HDD-s I was running the 4TB disks on ZFS in a mirror and used the "sharenfs" toggle of ZFS to expose the folders I needed. The caveat there is that at least the top-most folders were all like separate Volumes or whatever ZFS calls them, so they were technically separate filesystems. I wonder if it wasn't breaking because of that? That's the only thing I can think of unless you can find something in the logs above.

from mergerfs.

trapexit commented on June 12, 2024

Hmm.. Well the "stat" clearly worked though you didn't stat the file you statfs'ed it. But the statfs clearly worked and you can see the request in mergerfs. The ls however failed with EIO when it tried to stat /Storage and I don't see any evidence of mergerfs receiving the request. However, that trace shows a file being read. Totally Spies. So something is able to interact with it.

Have you tried disabling all caching? If the kernel caches things it becomes more difficult to debug.

from mergerfs.

gogo199432 commented on June 12, 2024

It also fails with cache.files=off, but I can set it to that and do another 'ls' trace if that helps. Is there any other caching that I'm not aware of that I can disable?

from mergerfs.

trapexit commented on June 12, 2024

See the caching section of the docs. Turn off entry and attr too. And yes, could help to trace that setup.

from mergerfs.

gogo199432 commented on June 12, 2024

Got a new trace with the following mergerfs settings:
-o cache.files=off,cache.entry=0,cache.attr=0,moveonenospc=true,category.create=mfs,dropcacheonclose=true,posix_acl=true,noforget,inodecalc=path-hash,fsname=mergerfs

ls.strace.txt
mergerfs.trace.txt

from mergerfs.

gogo199432 commented on June 12, 2024

Is there anything more I can provide to help deduce the issue? It is still occuring sadly. The frequence seems to depend on how many applications are trying to use it. Also, the issue occured when using Jellyfin, so it also happens after a strictly read-only operation.

from mergerfs.

trapexit commented on June 12, 2024

If I knew I'd ask for it.

The kernel isn't forwarding the request.

33344 15:35:19.359147 writev(4, [{iov_base="`\0\0\0\0\0\0\0\264K\373\0\0\0\0\0", iov_len=16}, {iov_base="0\242R\266\2\0\0\0\252\372\304~\2\0\0\0\26\205\327[\2\0\0\0\0\300}A\0\0\0\0\17\312{A\0\0\0\0\0\20\0\0\377\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=80}], 2) = 96 <0.000007>
33344 15:35:19.359172 read(4,  <unfinished ...>
33343 15:35:20.129934 <... read resumed>"8\0\0\0\3\0\0\0\266K\373\0\0\0\0\0\t\0\0\0\0\0\0\0\350\3\0\0\350\3\0\0b\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 1052672) = 56 <1.623557>

The fact that read succeeds means some messages are coming in from the kernel. In the very least statfs requesting info on the mount. But there is no log indicating why the kernel behaves that way.

You have a rather complex setup. All I can suggest is making it less so. It's not feasible for me recreate what you have. Create a second pool. 1 branch. Keep everything simple. See if you can break it via standard tooling. There are many variables here. We can't keep using the full stack to debug if nothing obvious presents itself.

from mergerfs.

Janbong commented on June 12, 2024

Hi,

I'm experiencing a similar issue on the same platform as OP (as far as I can tell from his shared outputs): Proxmox VE.
PVE version: pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.11-7-pve)

I used to have a VM under Promox running MergerFS (with an HBA with PCIe passthrough to the VM).

Last week I changed this so my Proxmox host runs MergerFS with the disks attached directly to the motherboard instead of using the HBA. (Server is way less power hungry that way.)

After a day or 2 I noticed the my services, who rely on the MergerFS mount, stopped working.
ls of the mount resulted in Input/Output error. Killing MergerFS and running mount -a (I'm using a MergerFS line in my /etc/fstab) again is the only solution.

Today the problem occured again. I was using MergerFS version 2.33.5 from the debian repo's that are configured in Proxmox.
I just updated to version 2.39.0 (latest release) but don't expect this to be the solution.

In my previous setup (with the VM), the last version I was using was `2.21.0'.

Any steps I can take to help us troubleshoot the problem? I love MergerFS, have been using it for years (in the VM setup I explained above). I'd like to keep using it but losing the mount every 2 days is not an option of course.

dmesg reports no issues with the relevant disks.

Thanks in advance for the help!

from mergerfs.

trapexit commented on June 12, 2024

Any steps I can take to help us troubleshoot the problem?

The same as I describe in this thread and the docs.

from mergerfs.

gogo199432 commented on June 12, 2024

Quick update from my side. I tried to reduce it to the best of my abilities. First of all, I made a second mount only for Proxmox backups. This mount never had any issues, so we can cross that out. Second, I modified my existing MergerFS mount so it only contained a single disk. In addition removed all services except for Jellyfin.

So setup was:

Disk 1 MergerFS mount -> Shared over NFS -> Jellyfin VM, mounted folders using fstab

This seemed the most stable, but still failed. When I clicked on Jellyfin to scan my library from zero it managed all the way to like 96% or something, and suddenly the mount vanished. This seems like the most "reliable" way to repro, managed to trigger it twice in a row. Still no luck in triggering it using any basic linux command.

Having said all this, after 3-4 months of this issue persisting and with no light at the end of the tunnel I decided to throw money at the problem, bought another disk and moved back over to ZFS. Would love to use MergerFS in the setup I described in the original post, but having my main storage disappear from my production systems is not a state to be in long-term.

from mergerfs.

trapexit commented on June 12, 2024

@Janbong And are you using NFS too?

@gogo199432 Are your mergerfs and NFS settings the same as the original post? Have you tried modifying NFS settings? Do you have other FUSE filesystems mounted? Are you doing any modifying of the export outside NFS?

from mergerfs.

Janbong commented on June 12, 2024

I'm also relying on NFS to let services running on VMs access the MergerFS mount. But that was also the case when MergerFS was running on one of the VMs, which worked without any issues for years
.

from mergerfs.

trapexit commented on June 12, 2024

Yes, but was it the same kernel? Same NFS config? Same NFS version?

from mergerfs.

Janbong commented on June 12, 2024

Different kernel. Now using pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.11-7-pve). Ubuntu 18.04 VM which worked for years was running kernel 4.15.0-213-generic
Exact same NFS config, copied it from the VM to the PVE host when I migrated.

from mergerfs.

trapexit commented on June 12, 2024

What were the versions? What are the settings?

from mergerfs.

gogo199432 commented on June 12, 2024

I was using the same settings as previously except for reducing to single disk. Apart from what we talked about with caching I have not been modifying the NFS settings. I tried removing the whole posix settings, but that has made no difference. If you mean another system apart from MergerFS that uses FUSE, then no I do not have anything else. As far as I can tell I have been doing no modifications outside of NFS, as I have said the only out-of-band thing was the backup but I moved that to a second MergerFS mount. So at the time of the issue with Jellyfin, there was nothing running on Proxmox that would access the mount.

from mergerfs.

Janbong commented on June 12, 2024

What were the versions? What are the settings?

Looking for a way to check the version. I think both are enabled looking at output of nfsstat -s:

root@pve1:~# nfsstat -s
Server rpc stats:
calls      badcalls   badfmt     badauth    badclnt
2625399    0          0          0          0

Server nfs v3:
null             getattr          setattr          lookup           access
6         0%     844071   32%     204       0%     797       0%     150711    5%
readlink         read             write            create           mkdir
0         0%     1206289  45%     2833      0%     88        0%     4         0%
symlink          mknod            remove           rmdir            rename
0         0%     0         0%     0         0%     0         0%     0         0%
link             readdir          readdirplus      fsstat           fsinfo
0         0%     0         0%     4324      0%     414536   15%     8         0%
pathconf         commit
4         0%     1503      0%

Server nfs v4:
null             compound
4         6%     54       93%

Server nfs v4 operations:
op0-unused       op1-unused       op2-future       access           close
0         0%     0         0%     0         0%     2         1%     0         0%
commit           create           delegpurge       delegreturn      getattr
0         0%     0         0%     0         0%     0         0%     36       24%
getfh            link             lock             lockt            locku
4         2%     0         0%     0         0%     0         0%     0         0%
lookup           lookup_root      nverify          open             openattr
2         1%     0         0%     0         0%     0         0%     0         0%
open_conf        open_dgrd        putfh            putpubfh         putrootfh
0         0%     0         0%     34       23%     0         0%     8         5%
read             readdir          readlink         remove           rename
0         0%     0         0%     0         0%     0         0%     0         0%
renew            restorefh        savefh           secinfo          setattr
0         0%     0         0%     0         0%     0         0%     0         0%
setcltid         setcltidconf     verify           write            rellockowner
0         0%     0         0%     0         0%     0         0%     0         0%
bc_ctl           bind_conn        exchange_id      create_ses       destroy_ses
0         0%     0         0%     4         2%     2         1%     2         1%
free_stateid     getdirdeleg      getdevinfo       getdevlist       layoutcommit
0         0%     0         0%     0         0%     0         0%     0         0%
layoutget        layoutreturn     secinfononam     sequence         set_ssv
0         0%     0         0%     4         2%     44       30%     0         0%
test_stateid     want_deleg       destroy_clid     reclaim_comp     allocate
0         0%     0         0%     2         1%     2         1%     0         0%
copy             copy_notify      deallocate       ioadvise         layouterror
0         0%     0         0%     0         0%     0         0%     0         0%
layoutstats      offloadcancel    offloadstatus    readplus         seek
0         0%     0         0%     0         0%     0         0%     0         0%
write_same
0         0%

Settings:

root@pve1:~# cat /etc/exports
# /etc/exports: the access control list for filesystems which may be exported
#               to NFS clients.  See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes       hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check)
#
# Example for NFSv4:
# /srv/nfs4        gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check)
# /srv/nfs4/homes  gss/krb5i(rw,sync,no_subtree_check)
#
/mnt/storage redacted_ip/24(rw,async,no_subtree_check,fsid=0) redacted_ip/8(rw,async,no_subtree_check,fsid=0) redacted_ip/24(rw,async,no_subtree_check,fsid=0)
/mnt/seed redacted_ip/24(rw,async,no_subtree_check,fsid=1) redacted_ip/8(rw,async,no_subtree_check,fsid=1) redacted_ip/24(rw,async,no_subtree_check,fsid=1)

from mergerfs.

Janbong commented on June 12, 2024

Client side reporting that it's using NFSv3:

21:02:00 in ~ at k8s ➜ cat /proc/mounts | grep nfs
fs1.bongers.lan:/mnt/storage /mnt/storage nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,na
.4.90,mountvers=3,mountport=51761,mountproto=udp,local_lock=none,addr=redacted_ip
fs1.bongers.lan:/mnt/seed /mnt/seed nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=2
mountvers=3,mountport=51761,mountproto=udp,local_lock=none,addr=redacted_ip

from mergerfs.

trapexit commented on June 12, 2024

The pattern I'm seeing is Proxmox + NFS. While I certainly won't rule out a mergerfs bug the fact everyone who experiences this seems to have that setup suggests it could be the proxmox kernel. I'm going to reach out to the fuse community to see if anyone has any ideas.

from mergerfs.

trapexit commented on June 12, 2024

Some notes:

I setup a dedicated machine running latest Armbian (x86), mounted a SSD formatted with ext4, put a mergerfs mount over it, exported it and from another machine mounted that export. I used a stress tool I wrote (bbf) to hammer the NFS mount for several hours. No issues.
I will setup Proxmox and attempt the same. If the stress test fails I'll try adding some media and installing Plex I guess. For some reason Proxmox doesn't like my extra machine I have been testing with so I'll try with a VM i guess.
I think what is happening is that NFS is getting into a bad state and causing mergerfs (or kernel side of the relationship) to get into a bad state due to ... something... maybe metadata changing under it in a way it doesn't like. For the root of the mount. And then that "sticks" and therefore no new requests can be made because the root is lookup is failing. I was under the impression that the kernel does not keep forever that error state but maybe I'm wrong or maybe it is triggering something different than normal. Either way I will probably need to fabricate errors to see if it behaves similarly to what you all are seeing.

All that said: NFS and FUSE just do not play nicely together. There are fundamental issues with how the two interact. Even if that were fixed it would still be complicated to get it working flawlessly on my end. I've gotten some feedback from kernel devs on the topic and will write something up about it later once I test some things out but I think after this I'm going to have to recommend people not use NFS or at least "you're on your own". I'll offer suggestions on setup to minimize issues but at the end of the day 100% support is not likely.

from mergerfs.

trapexit commented on June 12, 2024

Looking over the kernel code I've narrowed down situations where EIO errors can be returned and are "sticky". I think I can add some debug code to detect the situation and then maybe we can figure out what is happening.

from mergerfs.

Janbong commented on June 12, 2024

I appreciate you looking into this this thoroughly. Thanks!

In the mean time, the problem occurred two more times on my end.

Once the "local" mount itself was working, but the NFS mounts on my VM returned error "stale file error".
Just a couple of hours ago the full error like described above: Input/Output error on the NFS server side as well as the client VMs mounting the exported share.

I have disks spinning down automatically. I found in the logs that some time before the error occurred, all disks started spinning up, which leads me to believe some kind of scan was initiated. Plex scan possible, Sonarr/Radarr scan. Something like that probably. Which reads a lot of files in a short period of time. Not sure how long before the actual error occurred as I am basing myself of off the a timestamp when I got a message from someone using a service and it suddenly stopped working for them.

To be completely transparent on how I set things up: my (3) data disks are XFS, my SnapRAID parity is ext4. Though the parity disk is not really relevant here, I guess, since it is not part of the MergerFS mount.

I get what you are saying about FUSE + NFS not working together nicely. Still I was running my previous (similar) setup for years without ANY issues at all. My bet is Proxmox doing something MergerFS can't handle actually. It's a little too coincidental that @gogo199432 is also running MergerFS on Proxmox and running into the same issue.

My previous setup was MergerFS on a Ubuntu VM, albeit with a HBA in between as opposed to directly connected drives to the motherboard using SATA in my current setup on Proxmox.

from mergerfs.

trapexit commented on June 12, 2024

What is your mergerfs config? And are you changing things out of band? Those two errors are pretty different things.

from mergerfs.

trapexit commented on June 12, 2024

I get what you are saying about FUSE + NFS not working together nicely. Still I was running my previous (similar) setup for years without ANY issues at all.

Yes, but both mergerfs and the kernel and NFS code had evolved. The kernel code has gotten more strict about certain security concerns in and after 5.14.

If the kernel marks a node bad (literally called fuse_make_bad(inode)) there is nothing I can do. Hence why I'm trying to understand how NFS is triggering the root to be marked as such. Because this is not a unique situation. NFS shouldn't cause an issue any more than normal use. If there is a bug in the kernel that is leading to this I likely can't do much about it till addressed by the kernel devs. Or Proxmox updates their kernel.

from mergerfs.

Janbong commented on June 12, 2024

my /etc/fstab (MergerFS settings):

/mnt/disk* /mnt/storage fuse.mergerfs direct_io,defaults,allow_other,noforget,use_ino,minfreespace=50G,fsname=mergerfs 0 0

from mergerfs.

trapexit commented on June 12, 2024

@Janbong As mentioned in the docs direct_io, allow_other, and use_ino are deprecated. And for NFS usage you should really be setting inodecalc=path-hash for consistent inodes.

from mergerfs.

trapexit commented on June 12, 2024

Would it be possible to try building the nfsdebug branch? Or tell me what version of debian proxmox is based on and I can build packages.

I put in debugging info that will be printed to syslog (journalctl) in cases that the kernel would normally mark things as errored.

from mergerfs.

Go0oSer commented on June 12, 2024

I presently don't have any trace logs to add, but I wanted to inform those here that I am also having the same issue. I am also using Proxmox with NFS. Specifically Im using mergerfs in a Ubuntu vm and passing in the HBA PCIE device from the hypervisor to the VM. My mergerfs mount has been EIO'ing more frequently lately. It was fairly stable for a while before sometime in December. Im not reliable able to trigger the EIO.

I also had opened issue 1004 in the past for a similar issue - albeit I did not at the time report that I was using Proxmox and NFS. In that issue we somewhat determined that ballooning RAM and KVM were likely at fault.

Ive gone ahead and disabled ballooning RAM for the time being and will report back. If you would like for me to gather any other info, please let me know.

from mergerfs.

trapexit commented on June 12, 2024

#1004 was a different issue (afaict). Though still uncertain why it happens. In that case the kernel was asking for something that isn't ever supposed to happen. The parent of the root. I "fixed" that by returning that it doesn't exist. Perhaps that is causing this EIO but since I can't reproduce the situation I really can't say.

I think the best we can do right now is have you all run a special build that I've instrumented to log when these odd situations occurred.

I failed to install Proxmox 8.1 on a random PC I have laying around (doesn't like the emmc) so I'm going to try with an old laptop. Then I can build a deb for you all to try.

Specifically Im using mergerfs in a Ubuntu vm and passing in the HBA PCIE device from the hypervisor to the VM.

That's not what is being done by the others in this thread... unless we aren't clearly communicating. They are running mergerfs on Proxmox.

I think I need to build a tool to query a user's host and config because it is just too difficult to get everyone to explain their setup.

from mergerfs.

Go0oSer commented on June 12, 2024

Apologies. I didn't mean to muddy the waters.

from mergerfs.

trapexit commented on June 12, 2024

mergerfs_2.39.0-2-g9123d05~debian-bookworm_amd64.deb.zip

Can you try this (those using Proxmox)? And when you run into the EIO issue give me the output from journalctl -xb -t mergerfs

from mergerfs.

trapexit commented on June 12, 2024

I've been able to reproduce. Not easily but given enough clients and enough requests it happens after some time. Interestingly not seeing anything in the log from the debugging I put into place. I've got it setup to only 1 branch with path-hash inode calc and all caching disabled. I need to see if I can make the reproduction even simpler and add some more logging.

from mergerfs.

yeyeoke commented on June 12, 2024

Thank god, someone else is having this issue as well. This has been driving me crazy for months now.

I've got 3 Proxmox Nodes running amongst others 1 Debian node, with drives combined using MergerFS with the following entry in /etc/fstab:

/mnt/disks/storage/* /mnt/storage fuse.mergerfs fsname=hdd-pool,func.getattr=newest,noforget,cache.files=off,dropcacheonclose=true,category.create=mfs,moveonenospc=true,minfreespace=25G 0 0

The Mergerfs pool is then shared to the following machines:

1 VM Running Proxmox Backup Server
1 VM Running a media-stack in docker, jellyfin plex etc.
1 VM Running Qbittorrent, sonarr and radarr etc.

Here is the weird part; the media-vm keeps on working when I get notified that the NFS-VM is experiencing input/output-error on /mnt/storage, the Qbittorrent-VM also experiences these problems, but once again, Sonarr, Plex and Jellyfin all still have access to the NFS-share.

Specs for the PBS-VM:

System:
Distro......: Debian GNU/Linux 12 (bookworm)
Kernel......: Linux 6.5.11-8-pve

Specs for all the other machines:

System:
Distro......: Debian GNU/Linux 12 (bookworm)
Kernel......: Linux 6.1.0-18-amd64

Mergerfs-version: v2.39.0
NFS-version used by the clients: v4.2

from mergerfs.

trapexit commented on June 12, 2024

@yeyeoke

If you are exporting mergerfs through NFS you need to follow the instructions in the doc otherwise you absolutely will run into issues. Those, however, are different from what is being discussed here.

I'm writing up more detailed and explicit docs on the topic given the sudden increase in reports and possibly a workaround/fix. Might as well wait till I post those. Hopefully later today.

from mergerfs.

trapexit commented on June 12, 2024

https://github.com/trapexit/mergerfs/releases/tag/2.40.0

Try this with export-support=false

from mergerfs.

Janbong commented on June 12, 2024

Just got back from a holiday. Happy to see progress made in diagnosing the issue. Will try the workaround tomorrow. And will read the docs in how to setup MergerFS with NFS because I totally missed that.

Thanks @trapexit. Greatly appreciated!

from mergerfs.

mitzsch commented on June 12, 2024

Just want to peak in. I have a somewhat similar issue, at least it shares some similarities with the one here.

On my end after a heavy filesystem activity task (in my case snapraid scrub or plex doing its thing, not very easy to reproduce) the network connection completely frozes. To be precise the i40e driver "crashes", the igb driver/connection remains functional.

The server (dell r730xd with md1200 disk shelf, two e5 2640v4, 32gb ram, ubuntu 22.04.4 - kernel 6.5.0-18.18) is then not handling any commands over the network. Sometimes it's accessible, and sometimes the connection collapses. An iperf3 test would show one or two successful 10g transfers and then only 0gbit transfers until the SFP connector is pulled and inserted again...
However, the disks and their filesystem on the server remain functional. I can copy stuff from the array to a different drive... When I understand it correctly this is not the case with this issue? I also don´t have an NFS export, however, the nfs utils are installed and loaded into the kernel...

Is it possible that I´m also hitting the same issue and in my case, its manifesting as a broken (i40e) network state?

When its happening the next time, I will create a strace log... Maybe its also throwing out I/O errors...

from mergerfs.

trapexit commented on June 12, 2024

Is it possible that I´m also hitting the same issue and in my case, its manifesting as a broken (i40e) network state?

Sounds like a different issue entirely. This issue has to do explicitly with NFS and FUSE.

from mergerfs.

mitzsch commented on June 12, 2024

It looks like 5.14 kernel added a check for certain conditions that can happen outside of using with NFS but are far far more likely with NFS.

Okay, so my issue might actually be due to the same "bug-family"... (high IO causing FUSE problems?)

from mergerfs.

trapexit commented on June 12, 2024

No, you aren't using NFS. You say your network is going out. That is not the issue this thread is about. Why do you believe the network going out is related to NFS or FUSE or mergerfs?

from mergerfs.

shocker2 commented on June 12, 2024

Small update regarding the 1st release with export-support false, it's not usable and it's worse than before. Random NFS mounts get stale and they are not clearing out with unmount/mount. It will require an OS reboot to clean, and this happens every few hours (I'm not complaining, just sharing the feedback). Waiting for the new release, thank you @trapexit for all the hard work on this!

from mergerfs.

mitzsch commented on June 12, 2024

Why do you believe the network going out is related to NFS or FUSE or mergerfs

Hm, its just a theory of mine. I mean I´m also running a mergerfs pool/mount and after high I/O activity through the array (= plex) and/or directly to the disks (= snapraid) weird behavior occurs. In my case it renders network access unusable. (My (smb) shares become unresponsive, as well as any other network-related app). Also there is no dmesg or syslog output when the issue arises.

It seems like the issue described here - with the only difference that NFS is not involved. But as you wrote the kernel bug can also be triggered outside of NFS - even though not that likely.

Anyway, I will keep an eye on my issue, whenever its happening again, I will do a strace, do some further analysis and report back. (here or in a different bug report, or wherever it belongs...)

Sorry, for the inconvenience!

from mergerfs.

trapexit commented on June 12, 2024

It seems like the issue described here - with the only difference that NFS is not involved. But as you wrote the kernel bug can also be triggered outside of NFS - even though not that likely.

But it doesn't. This about mergerfs becoming not unresponsive but returning EIO errors. It isn't that the network dies but NFS and mergerfs locally return EIO errors when accessed. That is not what you've described. And you also talk about snapraid which is entirely and fully unrelated to mergerfs or network filesystems or even the network.

from mergerfs.

shocker2 commented on June 12, 2024

Thanks, already compiling it.

from mergerfs.

shocker2 commented on June 12, 2024

Seems that version is not shown now:

 # mergerfs -V
mergerfs vunknown

from mergerfs.

trapexit commented on June 12, 2024

How are you building it?

from mergerfs.

shocker2 commented on June 12, 2024

make uninstall; make; sudo make install

from mergerfs.

trapexit commented on June 12, 2024

I mean... what are you downloading? You shouldn't be using the automatic "source code" download from github. You need to pull from git or use https://github.com/trapexit/mergerfs/releases/download/2.40.1/mergerfs-2.40.1.tar.gz

from mergerfs.

shocker2 commented on June 12, 2024

wget https://github.com/trapexit/mergerfs/archive/refs/tags/2.40.1.zip
unzip 2.40.1.zip
cd mergerfs-2.40.1;
make uninstall
make
sudo make install

from mergerfs.

shocker2 commented on June 12, 2024

~/mergerfs-2.40.1 # cat VERSION 
unknown

from mergerfs.

trapexit commented on June 12, 2024

As I said don't use the auto generated source zip from github.

https://github.com/trapexit/mergerfs?tab=readme-ov-file#build

Use https://github.com/trapexit/mergerfs/releases/download/2.40.1/mergerfs-2.40.1.tar.gz

What OS are you on? If redhat or debian/ubuntu you can just use my packages.

from mergerfs.

shocker2 commented on June 12, 2024

Sorry missed that. I have installed it from your link and it's working fine now.
I'm running it with openSuse 15.4.

from mergerfs.

trapexit commented on June 12, 2024

What "new setting"? export-support should not be set to anything but true (the default) for usage with NFS. As the docs now say that option is purely for debugging.

from mergerfs.

MergerFS mount randomly disappears, only displays ??? when listed about mergerfs HOT 66 OPEN

Comments (66)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent