trapexit / scorch Goto Github PK

Silent CORruption CHecker and filesystem audit tool

License: ISC License

Python 100.00%

data-integrity md5 corruption filesystem bitrot sha1sum md5sum

scorch's Introduction

scorch (Silent CORruption CHecker)

scorch is a tool to catalog files and their hashes to help in discovering file corruption, missing files, duplicate files, etc.

Usage

usage: scorch [<options>] <instruction> [<directory>]

scorch (Silent CORruption CHecker) is a tool to catalog files, hash
digests, and other metadata to help in discovering file corruption,
missing files, duplicates, etc.

positional arguments:
  instruction:           * add: compute & store digests for found files
                         * append: compute & store digests for unhashed files
                         * backup: backs up selected database
                         * restore: restore backed up database
                         * list-backups: list database backups
                         * diff-backup: show diff between current & backup DB
                         * hashes: print available hash functions
                         * check: check stored info against files
                         * update: update metadata of changed files
                         * check+update: check and update if new
                         * cleanup: remove info of missing files
                         * delete: remove info for found files
                         * list: md5sum'ish compatible listing
                         * list-unhashed: list files not yet hashed
                         * list-missing: list files no longer on filesystem
                         * list-dups: list files w/ dup digests
                         * list-solo: list files w/ no dup digests
                         * list-failed: list files marked failed
                         * list-changed: list files marked changed
                         * in-db: show if files exist in DB
                         * found-in-db: print files found in DB
                         * notfound-in-db: print files not found in DB
  directory:             Directory or file to scan.

optional arguments:
  -d, --db=:             File to store digests and other metadata in. See
                         docs for info. (default: /var/tmp/scorch/scorch.db)
  -v, --verbose:         Make `instruction` more verbose. Actual behavior
                         depends on the instruction. Can be used multiple
                         times.
  -q, --quote:           Shell quote/escape filenames when printed.
  -r, --restrict=:       * sticky: restrict scan to files with sticky bit
                         * readonly: restrict scan to readonly files
  -f, --fnfilter=:       Restrict actions to files which match regex.
  -F, --negate-fnfilter  Negate the fnfilter regex match.
  -s, --sort=:           Sorting routine on input & output. (default: natural)
                         * random: shuffled / random
                         * natural: human-friendly sort, ascending
                         * natural-desc: human-friendly sort, descending
                         * radix: RADIX sort, ascending
                         * radix-desc: RADIX sort, descending
                         * mtime: sort by file mtime, ascending
                         * mtime-desc: sort by file mtime, descending
                         * checked: sort by last time checked, ascending
                         * checked-desc: sort by last time checked, descending
  -m, --maxactions=:     Max actions before exiting. (default: maxint)
  -M, --maxdata=:        Max bytes to process before exiting. (default: maxint)
                         Can use 'K', 'M', 'G', 'T' suffix.
  -T, --maxtime=:        Max time to process before exiting. (default: maxint)
                         Can use 's', 'm', 'h', 'd' suffix.
  -b, --break-on-error:  Any error or digest mismatch will cause an exit.
  -D, --diff-fields=:    Fields to use to indicate a file has 'changed' (vs.
                         bitrot / modified) and should be rehashed.
                         Combine with ','. (default: size)
                         * size
                         * inode
                         * mtime
                         * mode
  -H, --hash=:           Hash algo. Use 'scorch hashes' get available algos.
                         (default: md5)
  -h, --help:            Print this message.

exit codes:
  *  0 : success, behavior executed, something found
  *  1 : processing error
  *  2 : error with command line arguments
  *  4 : hash mismatch
  *  8 : found
  * 16 : not found, nothing processed
  * 32 : interrupted

Database

Format

The file is simply CSV compressed with gzip.

$ # file, hash:digest, size, mode, mtime, inode, state, checked
$ zcat /var/tmp/scorch/scorch.db
/tmp/files/a,md5:d41d8cd98f00b204e9800998ecf8427e,0,33188,1546377833.3844686,123456,0,1588895022.6193066

The 'state' value can be 'U' for unknown, 'C' for changed, 'F' for failed, or 'O' for OK.

The 'mtime' and 'checked' values are floating point seconds since epoch.

--db argument

The --db argument can take more than a path.

/tmp/test/myfiles.db : Full path. Used as is.
/tmp/test : If /tmp/test is a directory -> /tmp/test/scorch.db
/tmp/test/ : Force interpretation as directory -> /tmp/test/scorch.db
/tmp/test : /tmp/test is not a directory -> /tmp/test.db
./test : Prepend current working directory and same as above. Any relative path with a '/'.
test : No forward slashes -> /var/tmp/scorch/test.db

If there is no extension then .db will be added.

Backup / Restore

To simplify backing up the scorch database there is a backup command. Without a directory defined it will store the database to the same location as the database. If directories are added to the arguments then the database backup will be stored there.

$ scorch -v backup
/var/tmp/scorch/scorch.db.backup_2019-07-29T02:35:46Z
$ scorch -v backup /tmp
/tmp/scorch.db.backup_2019-07-29T02:36:12Z
$ scorch list-backups
/var/tmp/scorch/scorch.db.backup_2019-07-29T02:35:46Z
$ scorch list-backups /tmp
/tmp/scorch.db.backup_2019-07-29T02:36:12Z
/tmp/scorch.db.backup_2019-07-29T02:13:34Z
$ scorch restore /tmp/scorch.db.backup_2019-07-29T02:36:12Z

Example

$ ls -lh /tmp/files
total 0
-rw-rw-r-- 1 nobody nogroup 0 May  3 16:30 a
-rw-rw-r-- 1 nobody nogroup 0 May  3 16:30 b
-rw-rw-r-- 1 nobody nogroup 0 May  3 16:30 c

$ scorch -v -d /tmp/hash.db add /tmp/files
1/3 /tmp/files/c: d41d8cd98f00b204e9800998ecf8427e
2/3 /tmp/files/a: d41d8cd98f00b204e9800998ecf8427e
3/3 /tmp/files/b: d41d8cd98f00b204e9800998ecf8427e

$ scorch -v -d /tmp/hash.db check /tmp/files
1/3 /tmp/files/a: OK
2/3 /tmp/files/b: OK
3/3 /tmp/files/c: OK

$ echo asdf > /tmp/files/d

$ scorch -v -d /tmp/hash.db list-unhashed /tmp/files
/tmp/files/d

$ scorch -v -d /tmp/hash.db append /tmp/files
1/1 /tmp/files/d: md5:2b00042f7481c7b056c4b410d28f33cf

$ scorch -d /tmp/hash.db list-dups /tmp/files
md5:d41d8cd98f00b204e9800998ecf8427e /tmp/files/a /tmp/files/b /tmp/files/c

$ scorch -v -d /tmp/hash.db list-dups /tmp/files
md5:d41d8cd98f00b204e9800998ecf8427e
 - /tmp/files/a
 - /tmp/files/b
 - /tmp/files/c

$ echo foo > /tmp/files/a
$ scorch -v -d /tmp/hash.db check+update /tmp/files
1/4 /tmp/files/b: OK
2/4 /tmp/files/c: OK
3/3 /tmp/files/c: FILE CHANGED
 - size: 0B -> 4B
 - mtime: Tue Jan  1 16:23:57 2019 -> Tue Jan  1 16:24:09 2019
 - hash: d41d8cd98f00b204e9800998ecf8427e -> d3b07384d113edec49eaa6238ad5ff00
4/4 /tmp/files/d: OK

$ scorch -v -d /tmp/hash.db list /tmp/files | cut -d: -f2- | md5sum -c
/tmp/files/c: OK
/tmp/files/d: OK
/tmp/files/a: OK
/tmp/files/b: OK

Automation

A typical setup would probably be initialized manually by using add or append. After it's finished creating the database a cron job can be created to check, update, append, and cleanup the database. By not placing scorch into verbose mode only differences or failures will be printed and the output from the job running will be emailed to the user (if setup to do so).

#!/bin/sh

scorch -M 128G -T 2h check+update /tmp/files
scorch append /tmp/files
scorch cleanup /tmp/files

Support

Contact / Issue submission

github.com: https://github.com/trapexit/scorch/issues
email: [email protected]
twitter: https://twitter.com/_trapexit
reddit: https://www.reddit.com/user/trapexit
discord: https://discord.gg/MpAr69V

Support development

This software is free to use and released under a very liberal license. That said if you like this software and would like to support its development donations are welcome.

PayPal: [email protected]
Patreon: https://www.patreon.com/trapexit
Bitcoin (BTC): 1DfoUd2m5WCxJAMvcFuvDpT4DR2gWX2PWb
Bitcoin Cash (BCH): qrf257j0l09yxty4kur8dk2uma8p5vntdcpks72l8z
Ethereum (ETH): 0xb486C0270fF75872Fc51d85879b9c15C380E66CA
Litecoin (LTC): LW1rvHRPWtm2NUEMhJpP4DjHZY1FaJ1WYs
Basic Attention Token (BAT): 0xE651d4900B4C305284Da43E2e182e9abE149A87A
Zcash (ZEC): t1ZwTgmbQF23DJrzqbAmw8kXWvU2xUkkhTt
Zcoin (XZC): a8L5Vz35KdCQe7Y7urK2pcCGau7JsqZ5Gw

scorch's People

Contributors

Stargazers

Watchers

Forkers

marmiky tabulon-ext gwood skarekrow jn7163 swipswaps alvarofg aa956 scapix craig-m-unsw

scorch's Issues

Bug on is_readonly

Hi
I do some test
You have a bug
def is_readonly(st):
return not (fi.mode & (stat.S_IWUSR | stat.S_IWGRP | stat.S_IWOTH))

may be
def is_readonly(fi):

Support for progressive write to disk

Hi,

When running a checksum that takes a few hours (ex: a dir with many large files), if the machine crashes or scorch exits before it completes, all results are lost.

It would be nice if scorch could periodically write-to-disk_ to prevent losing many hours of work. My suggestion would be to write every five minutes or every N entries (perhaps configurable?). Thanks!

UnicodeEncodeError when running integrity check on file with emoji

I get the following error when I try to run the integrity check on a file that includes an emoji in the name. This happens even after I renamed the file, I suspect it's baked in the database now. Here is the error:


Traceback (most recent call last):
  File "/volume1/system/scorch/scorch.py", line 640, in inst_check
    newfi = get_fileinfo(filepath)
  File "/volume1/system/scorch/scorch.py", line 404, in get_fileinfo
    st = os.lstat(filepath)
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f917' in position 56: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/volume1/system/scorch/scorch.py", line 1647, in <module>
    main()
  File "/volume1/system/scorch/scorch.py", line 1626, in main
    rv = rv | func(opts,directory,db,dbremove)
  File "/volume1/system/scorch/scorch.py", line 707, in inst_check
    print_filepath(filepath,actions,total,opts.quote)
  File "/volume1/system/scorch/scorch.py", line 432, in print_filepath
    print(s,end=end)
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f917' in position 66: ordinal not in range(128)

Way to differentiate changed files (with metadata changes) from changed files (without metadata changes) based on exit code

Thanks for the great software. Scorch is fantastic!

I've been trying Scorch on a system where I have lots of legitimate changes on the filesystem that I'm running it on, but also want to be able to detect bitrot on files that are mixed in that don't change.

I'm scripting the use of Scorch and deciding what to do based on the exit code I receive back from Scorch.

Is there anyway to distinguish between files that have legitimately changed on the file system (I'm happy to ignore these), vs bitrot that I do care about? At the moment both seem to result in a exit code of 4 from Scorch and it makes it difficult for me to distinguish between the two use cases.

I realise that the log messages do show the difference, but for follow on processing I'd really like the distinction to come from the exit codes.

Would it be possible to add an additional exit code that is just for file changes without the metadata changing?

My current workflow is (feel free to let me know if I'm using Scorch wrong):

scorch -D 'size,mtime' -d ./scorch.db check+update /data
scorch -D 'size,mtime' -d ./scorch.db append /data
scorch -D 'size,mtime' -d ./scorch.db cleanup /data

I have tried running scorch update then scorch check but in the intervening period the files can change so I still get errors.

If you didn't like the idea of a separate error code for file corruption, rather than file changes and corruption, would it be possible to add new command scorch update+check which updates the metadata first and then does the hash check?

Print only warnings when data integrity problem occurs (conjunction with crontab reports to root)

Hello!
I have found scorch while researching data integrity tools available for Linux and one of the users mentioned your script.

The script works beautifully and works as expected and thank you for creating it - great job!

Would it be possible to adjust it so it would only print information when there is actual data integrity problem ("FAILED")?

When used with crontab this would allow to setup cron job and cron by default sends e-mail to the user when there is any output on STDOUT (ie. root).

Right now the script also shows information about "CHANGED" hashes and defeats the purpose without additional scripting.

What do you think?

Is it possible to ignore directories?

Hi
Is there a way to ignore certain directories? I'm trying to skip all .something dirs (and theirs content), but fnfilter switch seems to apply only to files (or using it wrong). Even though, dirs technically are also files... :P

Scorch exits early on Raspberry Pi3 due to use of sys.maxsize

Hi @trapexit,

I've been using Scorch on a RaspberryPi 3 today and noticed that it exists after a couple of hours (exit code 0) without having processed all the files.

This is caused by the use of sys.maxsize for a number of the default parameter values in Scorch. On the RaspberryPi, due to its CPU architecture, Python3's sys.maxsize == 2147483647, which I believe is causing Scorch to exit.

I know that I can change these options on the command line, but I was also wondering if it might make sense to change the default values instead? Perhaps to 2**64/2?? Which is the default of sys.maxsize on most x64 systems.

Have a great weekend!

Bug on humansize

if a file has 0 bytes (difficult but i have them...)
you must check it because log of 0 is error

def humansize(nbytes):
suffixes = ['B','KB','MB','GB','TB','PB','ZB']
rank = 0
if(nbytes > 0):
rank = int(math.log(nbytes,1024))

scorch data stored in xmp files instead of db

Several free software photography tools like darktable store all the metadata, and the changes they make in the image in an xmp file with the same name. Original file is never touched.
Would it be possible to give scorch a feature where the checksum information is stored in these same xmp files instead of the database file?
The main benefits of this, is that other apps like darktable or digikam that will also be able to access and use the scorch information from the xmp file. Scorch could even be integrated and launched from these apps directly.
Most photographs don't store checksums of their photos, but just like everybody else , they suffer the consequences when it happens. This way, as soon as data corruption occurs it can be detected and the photographer can delete the corrupt file and restore a backup of just that file.

blake3 hash support

Is it ok to support a new (fast) hashing function that is not in python's hashlib?

BLAKE3 looks really interesting.

Initial implementation would be something along the lines of https://github.com/aa956/scorch/tree/blake3-hash - try to import blake3 and use if it's installed.

Wrapper to send report per email

Hi @trapexit

I find your tool useful to ensure that photos/music don't corrupted over time / while copying on other devices specifically for setups without RAID.

I developed a small wrapper to send scorch's output as email (and made the configuration a bit more readable)
https://gitlab.com/daufinsyd/robespierre
I wanted to let you know :)

Thank you for your great software btw !

Support Parallel Checking Under MergerFS

Sorry if this is the wrong spot for a feature request, but I use MergerFS also (awesome and flawless), and have been using scorch to check for corruption (also awesome). But due to the 4 8TB HDDs i have 85% full, it takes days for scorch to complete. I was wondering if it would be possible to somehow integrate with mergerfs and parallel check all the drives I have under my /mnt . Or if there is a way to better do that already that doesnt miss out on potential corruption when I use mergerfs.balance. Thanks for the awesome software!

List files where a hash is present but the file is not accessible

How about a function for scorch where the hash database is checked for files that are missing - i.e. a hash is in the databasem but the corresponding file is not found in the expected path.

The use case for such a function could be if you decide to live without a RAID but want to know which files are gone for good after a disk fails faster than you can move the files off the platters.

Support portability of db

Hi,
It would be great if the tool would support portability of db.

In my case I have data that I backup to at least 2 drives and I want to verify that all drives contain the same data.
In my case i have them mounted here /mnt/y and /mnt/v
I want to check both drives using the same database but create only hashes from one of them.

So maybe just another argument that would replace /mnt/y with /mnt/v/ prefix in memory would be sufficient.

I will try to look into the code and create pull request if I find time.

Thanks,
Petr