Comments (35)
from scorch.
from scorch.
You need to provide more detail. How are you running the script?
There is a difference between "CHANGED" and "FAILED". "CHANGED" is a file that is no longer the same thing or metadata is different which is based on the "--diff-fields" argument. "FAILED" means the hash is different but the size (and other diff fields) are the same.
from scorch.
If you don't want to find changed files (which could be risky) then set the "diff-fields" to an empty string.
from scorch.
Actually. Looks like it skips to hash checks if the diff check doesn't return so those files will show as failed instead of changed. I guess I could either change the verbosity setup to make changes only print if set high enough or have another argument to ignore changes all together. Let me look.
from scorch.
As for regex... it's just regular python regex and should work with any common regex pattern. What are you trying to do?
from scorch.
Maybe store mtime along with the hash, and have an option to ignore changes
if the mtime on the file is newer than in the db?
mtime, inode, size, and mode are stored in the DB. That's how it determines it "changed". The better way of not having it print out changes if it is unnecessary data for you would be changing the verbosity settings so it is optional. By default it only prints changed and failed. I could change it to be just failed on default and move changed to verbose I guess. Though that would mean I'd need to move current verbose up a level.
An alternative would be to run an "update" first to refresh the metadata on changed files then run the "check". "update" is cheaper. It only does the file diff check. Not the hash check (though it does rehash updated files).
from scorch.
That said... if you don't care about NFO files then just add -F -f ".*.nfo" and it'll only process files that don't match ending in ".nfo".
from scorch.
Motivation
I was hoping to use scorch only for bitrot / corruption detection tool, cron to send STDOUT to root e-mail and normal rsync backups to another server to recover in time.
My files change on daily basis and I use rsync "mtime / size" to detect changes and external backup in combination with hardlinks to create external snapshots of data.
Use case would be the following (all commands as root user where there is nothing else but my data):
1. Initialize checksums for /home
# scorch -d /var/lib/scorch.db add /home
2. Create cron script for periodic data scrubbing
# test -d /root/cron || mkdir -p /root/cron
# cd /root/cron
# cat << EOF > ./scorch.sh
#!/bin/sh
scorch -D 'size,mtime' -d /var/lib/scorch.db check+update /home
scorch -D 'size,mtime' -d /var/lib/scorch.db append /home
scorch -D 'size,mtime' -d /var/lib/scorch.db cleanup /home
EOF
# chmod +x ./scorch.sh
# chmod o-rwx ./scorsh.sh
3. Install cron to run once per month
# crontab -e
# m h dom mon dow command
# run data integrity (bitrot) checks on 15th of each month at 03:00 and report to root e-mail
0 3 15 * * /root/cron/scorch.sh
If there is anything on script STDOUT - it will be sent to root e-mail (my machine has /etc/aliases configured and e-mail working).
4. Configure external backups and keep data since last data scrub
In order to recover when bitrot is detected - we need to know "which files" are good and we need proper backups.
Here is a sample rsync script that I use:
#!/bin/bash
backup_date=`date +%Y-%m-%d`
# configure where to keep last backup and daily archives / snapshots
local_backup_dir="/mnt/backup/home/current"
local_archive_dir="/mnt/backup/home/archives"
# show commands and stop on first error
set -x
set -e
# show current date
date
# backup via ssh and rsync
ionice -c 3 rsync -ave ssh --numeric-ids --delete --rsync-path="ionice -c 3 rsync" \
root@SERVER_IP:/home/ "${local_backup_dir}/" || return_status="$?"
# ignore "vanished files"
test "$return_status" -eq "24" && return_status="0"
# show when backup finished copying data
date
backup_time=`date +%Y%m%d%H%M`
# create snapshot from current backup using hard links
cp -al "$local_backup_dir" "$local_archive_dir/$backup_date"
touch -m -t "$backup_time" "$local_archive_dir/$backup_date"
# purge archives / snapshots older than 90 days
find "$local_archive_dir" -mindepth 1 -maxdepth 1 -mtime +90 -type d | xargs -I {} rm -rf {}
# show when script has ended
date
# return correct exit code
exit "$return_status"
from scorch.
I have updated my original answer.
For me any of the following would be great:
- Have the ability of launch scorch in mode where it only looks and logs to STDOUT or STDERR when bitrot / data corruption is detected.
- Make scorch log FAILED events on STDERR.
If you have any other questions please don't hesitate to ask :-)
from scorch.
from scorch.
You can use --maxdata
and/or --maxactions
to do more regularly checks (with --sort=random
). You'd catch things more quickly.
I prefer to talk in terms of general features rather than implementations. If I understand you right you don't care about "changes"? You have lots of legitimate changes and all those changes will make seeing FAILED values?
BTW... if check+update
doesn't update on FAILED. Only "CHANGED".
from scorch.
re regex: ^.*\.(ext1|ext2|ext3)$
That will match those extensions. If you want to easily negate the regex just use -F
.
from scorch.
re changes: if you want to index the files but don't care about changes (or want them emailed at a different time) you could use different databases for them and/or use the filter. Can also use the null hash if you don't even care about bitrot checks given they change so often.
from scorch.
What I'm getting at with these suggestions is that I'm not sure it's a good idea to ignore changes wholesale. It is better to have different checks for different usecases which I think is mostly available given the features available. Printing to stderr feels wrong because it's not an app failure. grep prints failed file opens to stderr because it's an error about the workflow. Not with the data. A FAILED hash check is not a workflow issue. That said I do have app errors printing to stdout which I should fix.
from scorch.
From default options of scorch I suspect that original intent of the program was to detect file corruption on immutable files and perhaps let you know which files were modified recently.
Do you see scorch supporting a scenario where files are modified on daily basis (ie.: .doc or .odt documents) and just focus on file corruption where it can be detected?
from scorch.
I'm not sure I understand. If the file is modified regularly then the whole "silent" part of "silent corruption detection" is no longer a thing because it will be found when used/updated. scorch is for files that aren't consumed or changed regularly (similar to SnapRAID).
I can make it so it doesn't report changed files but I'm not sure I understand why you would be adding them in the first place. Clearly if they change regularly then they aren't sitting around and at risk of bitrot in the same way an archived file or media generally is.
Can you explain to me what you're trying to accomplish? What is the workflow? Do you want it to index but ignore files that have changed? Would you want it to "update" them but not print it? What would the tool do? How would you run it?
from scorch.
Sorry! It is difficult to communicate effectively over such deep subject in just few words.
When I found your project I was given the impression that the scorch is generic tool to detect file corruption on best effort basis and README.md was written in such detail and precise manner I thought it would be a good candidate.
I was not realizing you meant scorch for immutable files checksum tool and this came up when I started testing it.
My case is the following:
- I have mix of immutable data (90%) and mutable data (10%).
- Mutable data change just few times a day.
- Primary data is served on mdadm RAID1 and have mdadm check array sync once per month.
- Data is regularly backed up on different server.
- I want to spot file corruption problems but I would need scorch to skip files that have changed since last check because it cannot verify any consistency (when programs open files the modification time is modified).
Allowing this would allow to create a very universal file corruption detection tool and I believe more people could use it.
You have found a good niche for that kind of tool because (IMHO):
- it would allow to detect file corruption problems on any filesystem on best effort basis (if I would like to checksum immutable files I would go for md5deep/hashdeep project);
- you cannot use SnapRAID without parity disk and this costs and on top of that you still need a normal backup;
- people who want realtime protection will go for ZFS;
- BTRFS future on Linux is not really clear and Linux filesystems are surprisingly lacking in case of data integrity;
To me the only problem now is that if I want to have it done in automatic way I need to ignore output of "update" phase and I cannot distinct program errors from data:
scorch -D 'size,mtime' -d /var/lib/scorch.db update /home 1>/dev/null
scorch -D 'size,mtime' -d /var/lib/scorch.db append /home
scorch -D 'size,mtime' -d /var/lib/scorch.db cleanup /home
scorch -D 'size,mtime' -d /var/lib/scorch.db check /home
Option to silence "CHANGED" files would be a godsend in my case. What do you think?
from scorch.
When I found your project I was given the impression that the scorch is generic tool to detect file corruption on best effort basis and README.md was written in such detail and precise manner I thought it would be a good candidate.
This is what I don't quite understand. I don't know what you mean by "generic tool to detect file corruption". It is. However, data corruption on files that change regularly is not possible without detailed understanding of the format itself. Hashes are for general detection. If the file changes a single bit the hash will change. So if you have "live" files that are changed regularly keeping hashes around has no utility. They will never be useful. Only after they stop being changed does storing the hash make sense.
I can change the verbosity settings to make "changed" a different level but I'm still unsure why you index files that knowingly, regularly change. Is it just in case they don't change in the future? Wouldn't that situation be better served with an mtime timeout? Only process files which haven't been modified for a certain period of time? I have the restrict
feature for similar workflows. Some people set files they expect not to change to readonly. Or set the sticky bit to indicate it's special. So the restrict
option lets people use that to filter out files that don't match. An mtime timeout could work similarly. But that is if I'm understanding why you are indexing regularly changing files. Your intent here is what eludes me. Why index the whole of your home directory including files you know regularly change (and therefore hashing is not useful in bitrot detection).
from scorch.
I'll change the verbosity settings. update should have been silent by default anyway. But what I'm really trying to understand is why you index files in which indexing isn't useful. Is it just because you have data stored scattershot across the paths and don't have the ability to define some metric to know which to index and which not to index? That is a fine reason but it'd help me a lot to address if I the reasons could be articulated. Right now I feel like I'm guessing.
from scorch.
regarding hashdeep
The problem with those tools is they don't fit the problem space of the typical data hoarder. There needs to be a way to indicate a file has changed vs have data silently corrupted so I don't get false positives when I rip my BluRay and replace the DVD rip of some movie and want to use the same filename. You need a workflow that enables ease of automation and general use. When you use a tool like mergerfs and you want the ability to index your data and easily see which files you are missing if a drive dies. There are a number of other features and usecases scorch provides.
from scorch.
Hit the wrong button...
from scorch.
Can you try the version from the verbosity
branch?
from scorch.
bump
from scorch.
Hi there @trapexit. Thanks for the great software.
I've been trying to use it in a similar use case to the OP of this thread. I have lots of legitimate changes on the filesystem that I'm running it, but also want to be able to detect bitrot on files that are mixed in that don't change.
I'm scripting the use of Scorch and deciding what to do based on the exit code I receive back from Scorch.
Is there anyway to distinguish between files that have legitimately changed on the file system (that I don't care about), vs bitrot that I do care about? Both seem to result in a exit code of 4.
Would it be possible to add an additional exit code that is just for file changes without the metadata changing (eg bitrot)? This would be extremely useful for me.
My current workflow is (feel free to let me know if I'm using Scorch wrong):
scorch -D 'size,mtime' -d ./scorch.db check+update /data
scorch -D 'size,mtime' -d ./scorch.db append /data
scorch -D 'size,mtime' -d ./scorch.db cleanup /data
I have tried running scorch update
then scorch check
but in the intervening period the files can change so I still get errors.
If you didn't like the idea of a separate error code for file corruption, rather than file changes and corruption, would it be possible to add new command scorch update+check
which updates the metadata first and then does the hash check?
from scorch.
For anyone playing along at home Scorch is very well written software and easy to modify. I was able to add the functionality I described above (the additional exit code) extremely easily.
If you'd like the add the functionality yourself you can with the following diff (patch):
diff --git a/scorch b/scorch
index 1523811..fab284e 100755
--- a/scorch
+++ b/scorch
@@ -50,8 +50,6 @@ ERROR_DIGEST_MISMATCH = 4
ERROR_FOUND = 8
ERROR_NOT_FOUND = 16
ERROR_INTERRUPTED = 32
-ERROR_FILE_CORRUPTION = 64
-
class Options(object):
@@ -673,7 +671,7 @@ def inst_check(opts,path,db,dbremove,update=False):
newfi.digest = hash_file(filepath,oldfi.digest)
if newfi.digest != oldfi.digest:
- err = err | ERROR_DIGEST_MISMATCH | ERROR_FILE_CORRUPTION
+ err = err | ERROR_DIGEST_MISMATCH
oldfi.state = 'F'
oldfi.checked = time.time()
if not opts.verbose:
You'll now get an exit code of 64 or above (68 if you have no other errors) when you have a file whose contents have changed (ie different hash value), but the metadata has remained the same.
@trapexit it would be great if you could incorporate this change or something similar, based on exit codes into the mainline of Scorch - that way I don't have to keep the diff and others could easily benefit (assuming someone else finds it useful).
Thanks again for the great software!!
from scorch.
Please don't hijack threads / issues.
As for your request: it makes more sense to keep in line with the behavior of the software and have a change flag. digest mismatch is file corruption. The interpretation of that digest based on other data determines if it's "changed". I'll look at it.
from scorch.
Firstly sorry for the thread hijack. It did seem to me that my issue / use case was very similar (almost identical to the OPs). I'll open a new issue now, which can be found here
Thanks for looking at it! Very much appreciated and the idea of a flag to ignore metadata changes (when you are expecting them for example) would definitely work for me.
from scorch.
@azurefreecovid your use case is exact the same as mine but I think that the author of scorch has different vision for the program (or a reason for it to be this way).
from scorch.
This is what I don't quite understand. I don't know what you mean by "generic tool to detect file corruption".
...
Why index the whole of your home directory including files you know regularly change (and therefore hashing is not useful in bitrot detection).
Because most files never change and it is useful to have automatic (no human interaction) way to detect that there is something wrong with your data on best effort basis (check what I can) on native Linux ext2/3/4 filesystems. Not everyone wants to use ZFS on Linux.
It is also useful not to separate the data by mutable/immutable collections and not to have to run manual data checks myself (which I will fail to do regularly and reliably).
from scorch.
@azurefreecovid your use case is exact the same as mine but I think that the author of scorch has different vision for the program (or a reason for it to be this way).
? I just didn't consider the return code for changed vs failed. Obviously I focus on that distinction given the high level of configuration around this feature. There are two separate things. One is a return code. One is how verbose things are. Not wanting them conflated isn't some rebut of the request. Thread hijacking confusing the conversation and complicates tracking.
Because most files never change and it is useful to have automatic (no human interaction) way to detect that there is something wrong with your data on best effort basis (check what I can) on native Linux ext2/3/4 filesystems. Not everyone wants to use ZFS on Linux.
I know that. And that's why I wrote the "changes" logic. That's why I don't understand what you mean by "generic tool to detect file corruption." It is. The concern of what gets printed is entirely separate. And I made changes 6 months ago and no one ever commented on it.
from scorch.
Well from the discussion I also learned that you purposed STDERR for "program errors" - not data corruption errors so I think the idea from issue title may not be how scorch is written (different usage) and we possibly can't do anything.
Hm.. so @trapexit - If we would like to have this issue gone - what would you like us to do? Is there anything we can do?
from scorch.
Well from the discussion I also learned that you purposed STDERR for "program errors" - not data corruption errors so I think the idea from issue title may not be how scorch is written (different usage) and we possibly can't do anything.
I don't understand what you're talking about. I made the changes to the branch as you requested.
Hm.. so @trapexit - If we would like to have this issue gone - what would you like us to do? Is there anything we can do?
You could test the changes I made for you back in January and tell me if it was sufficient.
I've made a bunch of changes since then so I'll need to port the changes over.
from scorch.
@trapexit Sorry!
I missed the message and assumed the changes are not fitting Scorch's purpose and just gave up.
I better get to testing then :-)
from scorch.
@azurefreecovid your use case is exact the same as mine but I think that the author of scorch has different vision for the program (or a reason for it to be this way).
? I just didn't consider the return code for changed vs failed. Obviously I focus on that distinction given the high level of configuration around this feature. There are two separate things. One is a return code. One is how verbose things are. Not wanting them conflated isn't some rebut of the request. Thread hijacking confusing the conversation and complicates tracking.
Completely understand. I can see you are very committed to Scorch and helping meet users requirements, thanks again!
from scorch.
Related Issues (14)
- Bug on humansize
- List files where a hash is present but the file is not accessible HOT 3
- Support portability of db HOT 6
- Support Parallel Checking Under MergerFS HOT 9
- Support for progressive write to disk HOT 35
- Way to differentiate changed files (with metadata changes) from changed files (without metadata changes) based on exit code
- Scorch exits early on Raspberry Pi3 due to use of sys.maxsize HOT 5
- UnicodeEncodeError when running integrity check on file with emoji HOT 11
- scorch data stored in xmp files instead of db HOT 6
- Is it possible to ignore directories? HOT 1
- blake3 hash support HOT 1
- Wrapper to send report per email
- Bug on is_readonly
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scorch.