Giter Club home page Giter Club logo

Comments (34)

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

To 2.: Without ZPREP_R there is no error message. Of course the functionality is then also not what I would like to have for hierarchical filesystems.

To 1.: is it needed to have a really separate remote system, or would it be enough to use a different destination pool or an SSH to the own machine?

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

to update the original case (with recursion): when doing the
env ZREP_R=-R DEBUG=1 zrep sync backup/z0
more times than "savecount", I see that the oldest @zrep_?????? snapshots are removed in the original filesystems backup/z0, z0/z1, z0/z2 recursively (as expected), but the @zrep_?????? snapshots are not removed in any destination filesystem, neither in backup/zcopy/z0, nor z0/z1, z0/z2

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

Now I found the time to set things up on 2 different machines. There is works without problems. So it looks like the problem only appears when running against localhost. If you need more logs or tests or so, tell me.

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

Will do.
Here is output of your latest version including debug, on the 2 separate systems were it works ok, just as reference. I will later run the same on the single system where I do have the problems.

root@euca-172-31-15-10:~/zrep# env ZREP_R=-R DEBUG=1 zrep sync orig/z0
zrep_lock_fs: set lock on orig/z0
sending orig/z0@zrep_000009 to 10.181.47.175:back/zcopy/z0
Expiring zrep snaps on orig/z0
DEBUG: expiring orig/z0@zrep_000004
Also running expire on 10.181.47.175:back/zcopy/z0 now...
Expiring zrep snaps on back/zcopy/z0
zrep_unlock_fs: unset lock on orig/z0
root@euca-172-31-15-10:~/zrep# env ZREP_R=-R DEBUG=1 zrep sync orig/z0
zrep_lock_fs: set lock on orig/z0
sending orig/z0@zrep_00000a to 10.181.47.175:back/zcopy/z0
Expiring zrep snaps on orig/z0
DEBUG: expiring orig/z0@zrep_000005
Also running expire on 10.181.47.175:back/zcopy/z0 now...
Expiring zrep snaps on back/zcopy/z0
zrep_unlock_fs: unset lock on orig/z0

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

And here is the output of a retest on the original single machine, incl. remote DEBUG:

root# zfs destroy -r backup/z0
root# zfs destroy -r backup/zcopy
root# zfs create backup/z0
root# zfs create backup/z0/z1
root# zfs create backup/z0/z2
root# touch /backup/z0/touch1
root# touch /backup/z0/z1/touch1
root# touch /backup/z0/z2/touch1
root# zfs create backup/zcopy
root#
root# env ZREP_R=-R DEBUG=1 zrep init backup/z0 localhost backup/zcopy/z0
Setting properties on backup/z0
Warning: zfs recv lacking -o readonly
Creating readonly destination filesystem as separate step
Creating snapshot backup/z0@zrep_000000
Sending initial replication stream to localhost:backup/zcopy/z0
Initialization copy of backup/z0 to localhost:backup/zcopy/z0 complete
Filesystem will not be mounted
root#
root# env ZREP_R=-R DEBUG=1 zrep sync backup/z0
zrep_lock_fs: set lock on backup/z0
sending backup/z0@zrep_000001 to localhost:backup/zcopy/z0
Expiring zrep snaps on backup/z0
Also running expire on localhost:backup/zcopy/z0 now...
Expiring zrep snaps on backup/zcopy/z0
Error: zrep_expire Internal Err caller did not hold fs lock on backup/zcopy/z0
REMOTE expire failed
zrep_unlock_fs: unset lock on backup/z0
root#
root# env ZREP_R=-R DEBUG=1 zrep sync backup/z0
zrep_lock_fs: set lock on backup/z0
sending backup/z0@zrep_000002 to localhost:backup/zcopy/z0
Expiring zrep snaps on backup/z0
Also running expire on localhost:backup/zcopy/z0 now...
Expiring zrep snaps on backup/zcopy/z0
Error: zrep_expire Internal Err caller did not hold fs lock on backup/zcopy/z0
REMOTE expire failed
zrep_unlock_fs: unset lock on backup/z0
root#

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

zrep_sync -> _expire -> $ZREP_PATH expire

zrep_expire DOES call zrep_lock_fs
So..
????
You SHOULD be seeing debug.

I'm thinking that you are perhaps calling first invocation of zrep, with
/full/path/to/zrep

However, that one then just calls "zrep", which is getting picked up from somewhere ELSE in your path, and it doesnt have the new DEBUG output in it?

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

aha. dont call it with

env DEBUG=1 zrep ...

JUST call it with

DEBUG=1 zrep ...

because "env" nukes your PATH variable. So if your path=/usr/local/bin:/usr/bin
first invocation will pick up /usr/local/bin/zrep
but then second invocation will pick up /usr/bin/zrep
?

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

Hi,
I cannot follow you now.
I understand that debug is called when the dest/remote expire is called, as you have the debug messages in the printout. Is there a place you would expect an additional debug message?

Regarding how I call zrep. You see my statements in this ticket. So I do not call zrep with full path, but without path. However, there is only exactly 1 zrep on the machine. I moved it to /usr/local/sbin, which is part to my PATH variable. That is, for each new commit you do I do:

clanger@nas:~$	cd
clanger@nas:~$	wget https://raw.githubusercontent.com/bolthole/zrep/master/zrep
clanger@nas:~$	diff zrep /usr/local/sbin/zrep
clanger@nas:~$	chmod +x zrep
clanger@nas:~$	sudo mv zrep /usr/local/sbin/zrep

Regarding using env or not using env. I think this does not make a difference, as env is not (at least for me) changing my path, see the environment with and without usage of env:

clanger@nas:~$ env | grep -E "DEBUG|PATH"
PATH=/home/clanger/bin:/home/clanger/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
clanger@nas:~$ env DEBUG=1 env | grep -E "DEBUG|PATH"
PATH=/home/clanger/bin:/home/clanger/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
DEBUG=1
clanger@nas:~$ DEBUG=1 env | grep -E "DEBUG|PATH"
DEBUG=1
PATH=/home/clanger/bin:/home/clanger/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

But just for testing, I re-executed the whole script, now without using "env". To me the result looks identical

root@nas:/home/clanger# zfs destroy -r backup/zcopy
root@nas:/home/clanger# zfs create backup/z0
root@nas:/home/clanger# zfs create backup/z0/z1
root@nas:/home/clanger# zfs create backup/z0/z2
root@nas:/home/clanger# touch /backup/z0/touch1
root@nas:/home/clanger# touch /backup/z0/z1/touch1
root@nas:/home/clanger# touch /backup/z0/z2/touch1
root@nas:/home/clanger# zfs create backup/zcopy
root@nas:/home/clanger#
root@nas:/home/clanger# ZREP_R=-R DEBUG=1 zrep init backup/z0 localhost backup/zcopy/z0
Setting properties on backup/z0
Warning: zfs recv lacking -o readonly
Creating readonly destination filesystem as separate step
Creating snapshot backup/z0@zrep_000000
Sending initial replication stream to localhost:backup/zcopy/z0
Initialization copy of backup/z0 to localhost:backup/zcopy/z0 complete
Filesystem will not be mounted
root@nas:/home/clanger#
root@nas:/home/clanger# ZREP_R=-R DEBUG=1 zrep sync backup/z0
zrep_lock_fs: set lock on backup/z0
sending backup/z0@zrep_000001 to localhost:backup/zcopy/z0
Expiring zrep snaps on backup/z0
Also running expire on localhost:backup/zcopy/z0 now...
Expiring zrep snaps on backup/zcopy/z0
Error: zrep_expire Internal Err caller did not hold fs lock on backup/zcopy/z0
REMOTE expire failed
zrep_unlock_fs: unset lock on backup/z0
root@nas:/home/clanger#

I also run yet another script which is still on a single machine, but source and dest filesystems are on different pools, however, the same error message comes when zrep tries to expire the snaps on the destination.

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

Regarding env, I feel the discussion drives away from the core of the problem.
But to answer your question.
env without parameters prints out the environment. In that sense it is similar to just using "set".
env with parameters changes the environment for the next command
env DEBUG=1 env then first changes the environment, then calls the second env without parameters which prints out that changed environment.

Back to the core problem.

Going through your script I find that when expiring the origin side, when _expire() is called, it is called from within the running script, so with the PID that also did the previous sync and which is stored in the zfs properties as lock-pid of the destination as "received".

Then in line 1741 you find that you also want to expire the destination side.
For that, in line 1749 you call zrep again as a new process "zrep expire dest...". This process then runs in parallel and has a different PID, which I verified by adding some "ps auxww" statements during debugging. When that new process then is inside _expire, the PID fo the _expire process does not match the PID stored in the zfs properties of the destination filysytem, thus you throw the error that your process does not hold the lock. I verified this by adding "+x" to your script and checking the printouts.

That triggers these questions: Is it necessary to call a second process when the destination is on the same localhost machine than the origin?
Does it matter for expire on destination if there is a lock-pid set but is a "received" property, thus only reflects the PID the zrep had on the sending side when syncing?

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

BTW, my refactored self-contained test script is now this:

zpool destroy orgpool
truncate --size 500M /tmp/vol1
zpool create orgpool /tmp/vol1

zpool destroy destpool
truncate --size 500M /tmp/vol2
zpool create destpool /tmp/vol2

zfs destroy -r orgpool/z0
zfs create orgpool/z0
zfs create orgpool/z0/z1
zfs create orgpool/z0/z2

zfs destroy -r destpool/zcopy
zfs create destpool/zcopy

ZREP_R=-R DEBUG=1 zrep init orgpool/z0 localhost destpool/zcopy/z0

ZREP_R=-R DEBUG=1 zrep sync orgpool/z0

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

I can probably not prove to you that I only run 1 version of zrep, and as origfs and destfs are on the same localhost, there is only 1 zrep script, so there is nothing like old/new script. I guess you would have to trust me on this.

I was debugging your script. This is what I understand:
During "zrep init", no zrep:lock-pid is set, but @zrep_000000 is created on orgpool and copied over to destination.
During "zrep sync", first a zrep:lock-pid property is set on the orgpool/z0 filesystem.
Then _snapandsync is called which calls _makesnap, which creates the second snapshot orgpool/z0@zrep_000001.

Now we have 3 zrep:lock-pid properties:

  • as "local" in orgpool/z0
  • as "inherited" in orgpool/z0@zrep_000000
  • as "inherited" in orgpool/z0@zrep_000001
  • and more inherited in sub-fs z0/z1 etc. as I use "-R" for recursion.

Then inside _snapandsync next _sync is called, which zfs send/receive the snapshot @zrep_000001 to the destination filesystem.

Now we have 6 zrep:lock-pid properties,

  • as "local" in orgpool/z0
  • as "inherited" in orgpool/z0@zrep_000000
  • as "inherited" in orgpool/z0@zrep_000001
  • as "received" in destpool/zcopy/z0
  • as "inherited" in destpool/z0@zrep_000000
  • as "inherited" in destpool/z0@zrep_000001
    That is, also for the top filesystem of the hierarchical backup on the destination, the "destpool/zcopy/z0", we now have a zrep:lock-pid, as it was copied over via the zfs send/receive.

Now, still from zrep_sync, the _expire $srcfs is called
This checks zrep_has_fs_lock orgpool/z0, which checks if the zrep:lock-pid saved in orgpool/z0 is the current process id, which is true, as the zrep:lock-pid was just set during the still ongoing call to zrep_sync which runs at this PID. So expiration can continue and does run successfully.

After the expiration ran successfully on origpool/z0, next zrep_ssh localhost 'zrep expire destpool/zcopy/z0' is called.
This calls "zrep expire destpool/zcopy/z0" in a new process with a new PID on the same localhost.

New process: zrep expire calls zrep_expire
this calls
zrep_lock_fs destpool/zcopy/z0
zrep_fs_lock_pid destpool/zcopy/z0
zfs get -H -o value zrep:lock-pid destpool/zcopy/z0
So this fetches any zrep:lock-pid. It received that id, which is "received", and not "local", but the "zfs get" does not specify that is only wants to have "local", so it gets the "received" one.

It then checks if "owning lock still exists", which is true, as the "received" lock-pid is the PID of the zrep sync process and this is still running as it has just spawned the zrep expire process.

I guess this is a crucial point. The PID retrieved from the destination filesystem is the PID of the currently running "zrep sync" process on the original side.

Here it makes a difference: on 2 different machines, that PID may also be in use by some arbitrary other process, but on a not so busy system chances are that that on the destination system there is no running process with this PID. In that case you would (falsly) interpret it as stale and just assign a new lock which then makes it appear to work.

But here the process is still running, thus you return 1, i.e. zrep_lock_fs() fails.
Then back in zrep_expire you state "Note: we should continue if we hit problems with an individual filesystem.", so you continue, even though the current process of "zrep expire" does have a PID which is not identical to the PID in the zrep:lock-pid of the filesystem you want to expire.

From all this, I could think the problem is that you probably did not expect to get the "received" property of "zrep:lock-pid", but your check more is inclided towards a "local" property.

Hope this helps you finding the root cause.

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

This is how it should look.

Note the double "zrep_lock_fs" output.

$# DEBUG=1 ZREP_R=-R zrep sync all
zrep_lock_fs: set lock on rpool/zrepsrc
sending rpool/zrepsrc@zrep_000009 to localhost:rpool/zrepdest
Expiring zrep snaps on rpool/zrepsrc
DEBUG: expiring rpool/zrepsrc@zrep_000004_unsent
Also running expire on localhost:rpool/zrepdest now...
zrep_lock_fs: set lock on rpool/zrepdest
Expiring zrep snaps on rpool/zrepdest
DEBUG: expiring rpool/zrepdest@zrep_000004_unsent
zrep_unlock_fs: unset lock on rpool/zrepdest
zrep_unlock_fs: unset lock on rpool/zrepsrc

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

Hi Phil,

I added as first line in zrep_lock_fs()
echo "Hi Carsten, entering zrep_lock_fs!" >/dev/fd/2
and hooray, it prints out also after "Also running expire on localhost:destpool/zcopy/z0 now..." :-)
I also tweaked the output strings for _debugprint() and _errprint(), just in case.

You write

you are wrong about this:

...(Here I wrote that zrep:loc-pid property exists as received in destination after _snapandsync())...

because on SYNCS zrep does not use (send properties in snapshot) flag. It does that only on init (see use of -p).

Well, that was not theory. I actually observed it in my system. I added in zrep_sync() after _snapandsync and before _expire a "exit 1". This exited the script after "zfs send" and befor any expriation. Then I listed all properties and found them as described.

You are right that the first time (from zrep init) zfs send runs with -p:
zfs send -R -p orgpool/z0@zrep_000000
And the second time (from zrep sync) runs without -p:
zfs send -R -I orgpool/z0@zrep_000000 orgpool/z0@zrep_000001

BUT: I also use -R for the recursive sync.

And ZFS's manpage states that this implies copying over all properties:

zfs send
... -R
Generate a replication stream package, which will replicate the specified filesystem, and all descendent file systems, up to the named snapshot. When received, all properties, snapshots, descendent file systems, and clones are preserved.

If the -i or -I flags are used in conjunction with the -R flag, an incremental replication stream is generated. The current values of properties, and current snapshot and file system names are set when the stream is received. If the -F flag is specified when this stream is received, snapshots and file systems that do not exist on the sending side are destroyed.

The man page continues ...

-p
Include the dataset's properties in the stream. This flag is implicit when -R is specified. The receiving system must also support this feature.

This explains why I see the zrep:loc-pid property with "source=received" on the destination filesystem.

My feeling is that you did not expect that "zrep sync" would copy over any property, and it would not if the "-R" was not used. Without copying over the properties, the destination filesystem would not have the zrep:loc-pid property received from the source filesystem, and then your locking methodology would work.

But with "-R" it does copy over the properties, at least in my version of ZFS (zfs and os versions see bottom of #44 (comment)).

You then wrote:

... In that case you would (falsly) interpret it as stale and just assign a new lock which then makes it appear to work.

but if this were the case, then since you have DEBUG flag on, you should
see output from:
_errprint overiding stale lock on $1 from pid $check
but you dont.
Therefore it is not doing that.

It does indeed do that, i.e. it would do if there was not the other bug.
In my comment #44 (comment) I stated it works on 2 machines. That was the time that you did not yet commit 4a897b5 And because of that, on the remote side the DEBUG was not set and thus I did not get the printout about the stale filesystem lock.

Now that mentioned commit broke things for me. When I retested today with 2 machines, I got the following error: "ssh: Could not resolve hostname debug=1: Name or service not known", and the script brakes. Running with ksh -x I see:

+ ssh_cmd='ssh DEBUG=1 10.181.47.175'
+ shift
+ ssh DEBUG=1 10.181.47.175 zfs create -o readonly=on back/zcopy/z0
ssh: Could not resolve hostname debug=1: Name or service not known
+ zrep_errquit 'Cannot create 10.181.47.175:back/zcopy/z0'

So I guess that the commit for adding the DEBUG to the other side through ssh does not work.

I therefore removed the "DEBUG=${DEBUG}" from that line again and instead uncommented the prepared DEBUG=1 in line 33 on both machines. Now both zrep run in DEBUG mode, and voila, I get the error message on the stale lock when running on 2 separate machines.

It looks like complicated, but I think we now have 2 streams:
a) get the DEBUG to the other side, as the current way brakes
b) adopt to the fact that a zrep -R does send all properties incl. zrep:loc-pid to the remote side.

I appreciate your active discussion in this topic. I hope you can fix the issues.

Best regards
Carsten

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

ah... thanks for your detailed investigation. I did the -R check, but only for sync, not for init.
Which apparently makes all the difference.
SIGHHHHhhh.

most places I already used zfs get -s local.
but not for the locking check.
I've now fixed that.
I've also juggled that ssh DEBUG thing, so I think it's in the correct place now :-}

Please try the latest git.

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

wait.. my comments dont make sense.
I STILL should have seen it, even doing "-R check, but only for sync" ???
Very Odd.

But new version should still fix your problems I think :)

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

or then again.. i may have broken it

.<

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

Fun fact..

when you use "zfs get", an unset value is returned as "-"
BUT,
when you use "zfs get -s local", an unset value is returned as ""
at least on MY system.

/smack zfs devs.

git updated

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

Nearly there. There is still one error in commit 1247e4a, which i commented there. With that proposed change, it runs for me on the 2 separate machines (both debug and sync). will now try on the single machine.

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

I can confirm it also works on the single machine now (with the manual patch that I commented above).

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

from zrep.

carsten-langer avatar carsten-langer commented on August 20, 2024

cool, finally all parts of the issue fixed. I close this issue. Thanks again.

from zrep.

ppbrown avatar ppbrown commented on August 20, 2024

Thank you for your willingness and assistance in testing and debugging :)

from zrep.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.