rocketraman / sane-scan-pdf Goto Github PK

Sane command-line scan-to-pdf script on Linux with OCR and deskew support

License: MIT License

Shell 100.00%

unpaper deskew ocr scanning sane scanner linux

sane-scan-pdf's Introduction

SANE Command-Line Scan to PDF

Sane command-line scanning bash shell script on Linux with OCR and deskew support. The script automates common scan-to-pdf operations for scanners with an automatic document feeder, such as the awesome Fujitsu ScanSnap S1500, with output to PDF files.

Tested and run regularly on Fedora, but should work on other distributions with the requirements below.

Features

Join scanned pages into a single output file, or specify a name for each page
Deskew (if supported by scanner driver, or software-based via unpaper)
Crop (if supported by scanner driver)
Creates searchable PDFs (with tesseract)
Duplex (if scanner supports it)
Specify resolution
Truncate n pages explicitly from end of scan e.g. duplex scanning with last page truncated
Skip white-only pages automatically (with ImageMagick)
Specify page width and height for odd size pages, or common sizes (Letter, Legal, A4)
Performance: scanner run in parallel with page post-processing
Limit parallel processing for very fast scanners or constrained environments (if sem installed)
Post-scan open scan output(s) in viewer
Configuration via default and named option groups

Requirements

The following dependencies are requirements of the script. See also Dependencies Installation.

bash
pnmtops (netpbm-progs)
ps2pdf (ghostscript)
pdfunite (poppler-utils)
units (units)
ImageMagick (if --skip-empty-pages or --ocr is used)

Optional

unpaper (for software deskew)
flock (usually provided by util-linux) (for properly ordered verbose logs)
tesseract (to make searchable PDFs)
sem (via gnu-parallels, to constrain resource usage during page processing -- install this if you have a fast scanner)
bc (for whitepage detection percentage calculations)
xdg-open (for opening scan after completion)

Getting Started

# scan --help
scan [OPTIONS]... [OUTPUT]

OPTIONS
 -v, --verbose
   Verbose output (this will slow down the scan due to the need to prevent interleaved output)
 -d, --duplex
   Duplex scanning
 -m, --mode
   Mode e.g. Lineart (default), Halftone, Gray, Color, etc. Use --mode-hw-default to not set any mode
 --mode-hw-default
   Do not set the mode explicitly, use the hardware default — ignored if --mode is set
 -r, --resolution
   Resolution e.g 300 (default)
 -a, --append
   Append output to existing scan
 -e, --max <pages>
   Max number of pages e.g. 2 (default is all pages)
 -t, --truncate <pages>
   Truncate number of pages from end e.g. 1 (default is none) -- truncation happens after --skip-empty-pages
 -s, --size
   Page Size as type e.g. Letter (default), Legal, A4, no effect if --crop is specified
 -ph, --page-height
   Custom Page Height in mm
 -pw, --page-width
   Custom Page Width in mm
 -x, --device
   Override scanner device name, defaulting to "fujitsu", pass an empty value for no device arg
 -xo, --driver-options
   Send additional options to the scanner driver e.g.
   -xo "--whatever bar --frobnitz baz"
 --no-default-size
   Disable default page size, useful if driver does not support page size/location arguments
 --crop
   Crop to contents (driver must support this)
 --deskew
   Run driver deskew (driver must support this)
 --unpaper
   Run post-processing deskew and black edge detection (requires unpaper)
 --ocr
   Run OCR to make the PDF searchable (requires tesseract)
 --language <lang>
   which language to use for OCR
 --skip-empty-pages
   remove empty pages from resulting PDF document (e.g. one sided doc in duplex mode)
 --white-threshold
   threshold to identify an empty page is a percentage value between 0 and 100. The default is 99.8
 --brightness-contrast-sw
   Alter brightness and contrast via post-processing - prefer specifying brightness and/or
   contrast via --driver-options if supported by your hardware.
 --open
   After scanning, open the scan via xdg-open
 -og, --option-group
   A named option group. Useful for saving collections of options under a name e.g. 'receipt' for easy reuse.
   Use this option in combination with '--help' to show the location and content of the file and edit it manually.

CONFIGURATION
<not shown, system-specific, run `--help` locally>

Configuration

Use --help locally to show the location of optional configuration and pre-scan hook scripts. These scripts may contain environment variables to pre-configure scan. For example the contents of the default file may be something like:

DEVICE=something
SEARCHABLE=1
MODE_HW_DEFAULT=1

Command line argument --option-group foo (or -og foo) will read the foo file in the standard XDG home config directory (use -og foo --help to see the exact location) for configuration.

For example, if one wishes to scan receipts always with crop, deskew, unpaper post-processing, and making them searchable via OCR, a receipt option group can be created by writing the following to a file named receipt in the config directory:

CROP=1
DESKEW=1
UNPAPER=1
SEARCHABLE=1

Command-line arguments will overwride settings in the default and named configurations. All command line flags support prefixing with no- in order to override configuration settings. For example, to scan receipts using the option group above, but without making it searchable, you would do:

--option-group receipt --no-searchable

Tips

The default scanner device is set to fujitsu. If you have another scanner, you will need to use the -x/--device argument to specify your scanner, or save a DEVICE=something line to a local config file as shown above. See below for how to get the list of available devices.

If running via scanbd, scanning occurs via the net driver rather than the usual device driver. In this case, it may be necessary to specify the net driver device in the scanbd script, OR perhaps do not specify any device at all to let the script choose the best device when running outside of scanbd, and when running via scanbd. To do this, use an empty device i.e. --device "".

The scanners and scanner drivers vary in features they support. This script provides several options to the underlying scanner driver by default, and these options may not be supported by your scanner/scanner driver. If you are receiving an error about --page-width/--page-height being unrecognized options, try the --no-default-size option. If you receive an error about the --mode value being invalid, try --mode-hw-default and see below for how to retrieve the list of modes that your system understands.

Helpful Commands

List available scanner devices (for -x/--device argument):

scanadf -L

List available device-specific options, including acceptable values for -m/--mode and -r/--resolution:

scanadf [-d <device>] --help

Author(s)

Raman Gupta

With assistance from various other contributors! Thank you!

Blog Post Mentions

The following blog posts talk about scanner automation, and mention use of this script. If you create a blog post, please submit a PR and add your link here!

Other Useful Software

OCRmyPDF - forgot to use the --ocr option at scanning time? use this

sane-scan-pdf's People

Contributors

Stargazers

Watchers

sane-scan-pdf's Issues

OCR option creates large page sizes

As reported by @IBMPortablePc in #9 (comment), when using the OCR option, tesseract creates very large page sizes.

See tesseract-ocr/tesseract#150 for a discussion and possible solution.

multi page scanning

Hello! Is it possible to do multi page scanning? I tried the -e command but it doesn't seem to do much (albeit i'm not sure what it does)

Binary name conflict

The binary name scan is a very generic one, already-present in many different packages. There are conflicts. Please choose a less-generic global binary name than a simple scan. User can still have a local alias scan if that's preferred.

units: command not found

Thanks for this script. I know people have had success on Debian / Rasperry Pi. When I run the script I encounter this.

scan: line 201: units: command not found
scan: line 207: units: command not found
Scanning...
scanadf: open of device fujitsu failed: Invalid argument

Any advice greatly appreciated

Using AVStream.codec to pass codec parameters to muxers is deprecated...

Is there something that can be done about that message?...

Scanning...
scanadf: rounded value of page-height from 279.4 to 279.406
scanadf: rounded value of page-width from 215.9 to 215.893
scanadf: rounded value of br-x from 215.9 to 215.893
scanadf: rounded value of br-y from 279.4 to 279.406
Scanned document /tmp/scan.us5ex1g0IT/scan-0001
Scanned document /tmp/scan.us5ex1g0IT/scan-0002
Scanned document /tmp/scan.us5ex1g0IT/scan-0003
Scanned document /tmp/scan.us5ex1g0IT/scan-0004
[image2 @ 0x560e268f8c00] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x560e268f8c00] Encoder did not produce proper pts, making some up.
Scanned document /tmp/scan.us5ex1g0IT/scan-0005
[image2 @ 0x5569745a8c00] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x5569745a8c00] Encoder did not produce proper pts, making some up.
Scanned 5 pages
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Page 1
Page 1
[image2 @ 0x5639ad6e2c00] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x5639ad6e2c00] Encoder did not produce proper pts, making some up.
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Page 1
[image2 @ 0x561d9cb7bc00] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x561d9cb7bc00] Encoder did not produce proper pts, making some up.
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Page 1
[image2 @ 0x55d20e2c0c00] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55d20e2c0c00] Encoder did not produce proper pts, making some up.
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Page 1
Processing 5 pages
Concatenating pdfs...

--verbose flag breaks post-processing on Ubuntu/Debian

Debian and Ubuntu use a very old version of netpbm (from 2002) which does not support the -verbose flag. When using the -vflag for the scan script, the pbmtops command fails.

As a workaround I've commented out the PNMVERBOSE variable in line 91 of scan_perpage. Of course it would be nicer to have a version check to ignore it, but I#m not up for that task.

-m Color hangs

First of all thank you for this great script.
Currently using it with a ix1500 and a raspberry pi running ubuntu. All color modes work as expected except -m Color.
Script output:

Scanning...
scanadf: rounded value of page-height from 9999 to 3012.9
scanadf: rounded value of page-width from 9999 to 221.121
scanadf: rounded value of br-x from 9999 to 221.121
scanadf: rounded value of br-y from 9999 to 3012.9
Scanned document /tmp/scan.UFUoyVqpFj/scan-0001
Scanned document /tmp/scan.UFUoyVqpFj/scan-0002
Scanned 2 pages

After that nothing happens, the script hangs.
Verbose mode on:

Scanning...
scanadf: rounded value of page-height from 9999 to 3012.9
scanadf: rounded value of page-width from 9999 to 221.121
scanadf: rounded value of br-x from 9999 to 221.121
scanadf: rounded value of br-y from 9999 to 3012.9
Scanned document /tmp/scan.7yhnGAIvQF/scan-0001
scan_perpage:
scan_perpage: -------------------------------------------------------------------------------
scan_perpage: Post-processing scanned page /tmp/scan.7yhnGAIvQF/scan-0001, deskew=0, searchable=0...
scan_perpage: -------------------------------------------------------------------------------
Scanned document /tmp/scan.7yhnGAIvQF/scan-0002
Scanned 2 pages
scan_perpage: /tmp/scan.7yhnGAIvQF/scan-0001 has 18.072 % white
scan_perpage: Converting image data to pdf...
scan_perpage: Using page options: -equalpixels -dpi=300 -noturn

The script i used:

#!/bin/sh
now=`date +"%Y-%m-%d-%H%M"`
/home/ubuntu/sane-scan-pdf/scan -d -v -r 300 -m Color --crop --skip-empty-pages -o /home/ubuntu/scans/scan-$now.pdf

Does somebody have an idea how to solve this?
Thank you!

Cropping limited to letter size?

I'm working on setting up a basic document scanning station for my monthly bills. some of the bills are odd sizes, but they closely resemble legal size. its maybe 7x14, instead of 8.5x14. When I include the crop option, it crops the side, but also cuts off the bottom 3 inches. If i set the pagesize to legal, then it works, but that would do it for every scan then. Here's my code that i'm using with a Fujitsu SnapScan S510.

now=date +"%Y-%m-%d-%H%M"
#/home/pi/sane-scan-pdf/scan -d -r 300 -v -m Gray --crop --deskew --ocr -o /home/pi/scans/scan-$now.pdf

Fujitsu driver option support

Thank you for a great script. GUI is well and good, however CLI is often the way to simply get things done. A case in point is the lack of a decent Mac frontend for Sane.

I had no trouble using this script on my Mac (OS X 10.14.3) , once I installed the dependencies in Brew. In order to install scanadf I had to compile the Sane frontends, however this went smoothly.

However, the Fujitsu driver ( for my Scansnap S510M) supports brightness and contrast (see https://fossies.org/dox/sane-backends-1.0.27/fujitsu_8c_source.html etc) and I suspect that I can use those option by tweaking your script?

Also, I find the cropping to be a little too random i.e. sometimes it is too aggressive, however I guess that's simply how the driver works.

Lastly, the OCR option works perfectly except that the resulting PDF is extremely large i.e. A0 or A1 in size even if the original is A5. Any thoughts on this would be appreciated although I suspect it's an issue with how one of the dependencies works on Macs.

no decode delegate for this image format

Any idea why the script is failing on the Brother DCP-L2541 printer. I have imagemagick 7.1.0.25-1 installed on the Arch system.

$ ./scan -x "brother4:net1;dev0"  -r 300 -v --mode-hw-default  --no-default-size --ocr --unpaper  -a -o /tmp/a.pdf
Scanning...
scanadf: rounded value of br-x from 211.9 to 211.881
scanadf: rounded value of br-y from 355.6 to 355.567
Scanned document /tmp/scan.PgVN9BfQhp/scan-0001
scan_perpage: 
scan_perpage: -------------------------------------------------------------------------------
scan_perpage: Post-processing scanned page /tmp/scan.PgVN9BfQhp/scan-0001, deskew=1, searchable=1, skip-empty-pages=0, white-threshold=99.8, brightness-contrast-sw=...
scan_perpage: -------------------------------------------------------------------------------
scan_perpage: Applying unpaper post-processing to image data...
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

-------------------------------------------------------------------------------
Processing sheet #1: /tmp/scan.PgVN9BfQhp/scan-0001 -> /tmp/scan.PgVN9BfQhp/unpaper-scan-0001
input-file for sheet 1: /tmp/scan.PgVN9BfQhp/scan-0001
output-file for sheet 1: /tmp/scan.PgVN9BfQhp/unpaper-scan-0001
sheet size: 2480x4175
...
noise-filter ... deleted 879 clusters.
blur-filter... deleted 903 pixels.
auto-masking (1240,2087): 875,0,1970,4174
gray-filter... deleted 44871600 pixels.
auto-masking (1240,2087): 875,0,1970,4174
detected rotation left: [875,0,1970,4174]: 0.003491
detected rotation right: [875,0,1970,4174]: 0.003491
rotation average: 0.003491  deviation: 0.000000  rotation-scan-deviation (maximum): 0.017453  [875,0,1970,4174]
rotate (1240,2087): 0.003491
auto-masking (1240,2087): 875,0,1965,4174
centering mask [875,0,1965,4174] (1240,2087): -180, 0
border detected: (0,85,1,1181) in [0,0,2479,4174]
aligning mask [0,85,2478,2993] (0,632): 0, 547
writing output.

real    0m4.642s
user    0m4.529s
sys     0m0.110s
scan_perpage: Converting image data to searchable pdf...
scan_perpage: ...Running convert
convert: no decode delegate for this image format `' @ error/constitute.c/ReadImage/737.
convert: no images defined `/tmp/scan.PgVN9BfQhp/unpaper-scan-0001.tiff' @ error/convert.c/ConvertImageCommand/3325.

real    0m0.025s
user    0m0.006s
sys     0m0.003s
scan_perpage: ...Running tesseract
Error, cannot read input file /tmp/scan.PgVN9BfQhp/unpaper-scan-0001.tiff: No such file or directory
Error during processing.

real    0m0.207s
user    0m0.122s
sys     0m0.033s
scan_perpage: 
scan_perpage: Scan page processing done, status = 1
Scanned document /tmp/scan.PgVN9BfQhp/scan-0002
Scanned 2 pages
scan_perpage: 
scan_perpage: -------------------------------------------------------------------------------
scan_perpage: Post-processing scanned page /tmp/scan.PgVN9BfQhp/scan-0002, deskew=1, searchable=1, skip-empty-pages=0, white-threshold=99.8, brightness-contrast-sw=...
scan_perpage: -------------------------------------------------------------------------------
scan_perpage: Applying unpaper post-processing to image data...
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

-------------------------------------------------------------------------------
Processing sheet #1: /tmp/scan.PgVN9BfQhp/scan-0002 -> /tmp/scan.PgVN9BfQhp/unpaper-scan-0002
input-file for sheet 1: /tmp/scan.PgVN9BfQhp/scan-0002
output-file for sheet 1: /tmp/scan.PgVN9BfQhp/unpaper-scan-0002
sheet size: 2480x4175
...
noise-filter ... deleted 217 clusters.
blur-filter... deleted 10 pixels.
auto-masking (1240,2087): 285,0,1940,4174
gray-filter... deleted 45649100 pixels.
auto-masking (1240,2087): 285,0,1935,4174
detected rotation left: [285,0,1935,4174]: 0.010472
detected rotation right: [285,0,1935,4174]: -0.033161
rotation average: -0.011345  deviation: 0.030853  rotation-scan-deviation (maximum): 0.017453  [285,0,1935,4174]
out of deviation range - NO ROTATING
rotate (1240,2087): 0.000000
auto-masking (1240,2087): 285,0,1935,4174
centering mask [285,0,1935,4174] (1240,2087): 130, 0
border detected: (0,75,1,1181) in [0,0,2479,4174]
aligning mask [0,75,2478,2993] (0,627): 0, 552
writing output.

real    0m3.907s
user    0m3.846s
sys     0m0.050s
scan_perpage: Converting image data to searchable pdf...
scan_perpage: ...Running convert
convert: no decode delegate for this image format `' @ error/constitute.c/ReadImage/737.
convert: no images defined `/tmp/scan.PgVN9BfQhp/unpaper-scan-0002.tiff' @ error/convert.c/ConvertImageCommand/3325.

real    0m0.005s
user    0m0.005s
sys     0m0.000s
scan_perpage: ...Running tesseract
Error, cannot read input file /tmp/scan.PgVN9BfQhp/unpaper-scan-0002.tiff: No such file or directory
Error during processing.

real    0m0.127s
user    0m0.088s
sys     0m0.039s
scan_perpage: 
scan_perpage: Scan page processing done, status = 1
Processing 2 pages
Concatenating pdfs...
Syntax Error: Document stream is empty
Syntax Error: Could not merge damaged documents ('/tmp/scan.PgVN9BfQhp/scan-0001.pdf')

Done.

Fast scanning ADF with long post-processing steps will consume all resources

Since every page will spawn a new instance of the scan_perpage script (unless verbose logging is enabled) and if the scanner is scanning pages rapidly, it'll spawn too many processes and consume all resources as a result.

Perhaps should limit the amount of scripts to as many CPU cores the host has.

When calling sane-scan-pdf from scanbd, it is run with euid root, causing permission errors

I'm still trying to get sane-scan-pdf to work with scanbd. After solving some configuration issues, I am now stuck.

In the debian bugtracker, there is an unresolved bug from 2016 that seems to describe my problem fairly well:

when configuring user = saned in the global section of scanbd.conf, scanbd itself is running as user saned, but the effective UID remains root. This leads to the situation that any scripts that are executed by scanbd, will act as user root. The expected result is, that the effective UID is set to saned as well.

This seem to be the root of the permissions issues I [1] and others [2] have been facing.

After working around the units issue by adding --no-default-size [3], I got very similar issues later in post processing with parallel: parallel: Error: Cannot change into non-executable dir /root/.parallel: Permission denied.

As described in the bug report quoted above, I configured scanbd to drop its privileges to user saned (114) and group scanner (110). On startup of sanebd, I do get the following output which looks promising:

scanbd: drop privileges to gid: 114
scanbd: Running as effective gid 114
scanbd: drop privileges to uid: 110
scanbd: Running as effective uid 110

However, when pressing the scan button, no matter what I set scanbd.conf's user setting to, scanbd always outputs scanbd: setting env: USER=root and scanbd: setting env: HOME=/root when scanning, despite the debug output at startup, which seems to be in line with the quoted sanebd bug report (see debug log at the end).

In the shim script that calls sane-scan-pdf, I added a line HOME=/home/username which got me past the parallel: Error: Cannot change into non-executable dir /root/.parallel: Permission denied error, but then caused all sorts of other issues until I exited the shell, so I guess that's not really a good idea either.

Do you have any idea how to solve this? I wonder how @nyn44e got it to work?

scanbd: value trigger: numerical
scanbd: trigger action for scan for device epjitsu:libusb:001:005 with script epjitsu.script
scanbd: get_sane_option_value
scanbd: Value of mode as string (len 7, hash -621353420): Lineart
scanbd: setting env: SCANBD_FUNCTION_MODE=Lineart
scanbd: setting env: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
scanbd: No PWD, setting env: PWD=/home/username
scanbd: setting env: USER=root
scanbd: setting env: HOME=/root
scanbd: setting env: SCANBD_DEVICE=epjitsu:libusb:001:005
scanbd: setting env: SCANBD_ACTION=scan
scanbd: append string epjitsu:libusb:001:005 to signal scan_begin
scanbd: now sending signal scan_begin
scanbd: append string SCANBD_FUNCTION_MODE=Lineart to signal trigger
scanbd: append string PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin to signal trigger
scanbd: append string PWD=/home/user to signal trigger
scanbd: append string USER=root to signal trigger
scanbd: append string HOME=/root to signal trigger
scanbd: append string SCANBD_DEVICE=epjitsu:libusb:001:005 to signal trigger
scanbd: append string SCANBD_ACTION=scan to signal trigger
scanbd: now sending signal trigger
scanbd: now flushing the dbus
scanbd: unref the signal
scanbd: using relative script path: epjitsu.script, expanded to: /home/username/scanbd-scripts/epjitsu.script
scanbd: waiting for child: /home/username/scanbd-scripts/epjitsu.script
scanbd: setgid to gid=114
scanbd: setuid to uid=1000
scanbd: exec for /home/username/scanbd-scripts/epjitsu.script
scanbd: octal mode for /home/username/scanbd-scripts/epjitsu.script: 100755
scanbd: file uid: 1000, file gid: 1000
Scanning...
Scanned document /tmp/scan.H396FvHBB3/scan-0001
Scanned document /tmp/scan.H396FvHBB3/scan-0002
scan_perpage: 
scan_perpage: -------------------------------------------------------------------------------
scan_perpage: Post-processing scanned page /tmp/scan.H396FvHBB3/scan-0001, deskew=0, searchable=0, skip-empty-pages=1, white-threshold=99.8, brightness-contrast-sw=...
scan_perpage: -------------------------------------------------------------------------------
Scanned 2 pages
scan_perpage: /tmp/scan.H396FvHBB3/scan-0001 has 95.2695 % white
scan_perpage: Converting image data to pdf...
scan_perpage: ...Running pnmtops on /tmp/scan.H396FvHBB3/scan-0001 using page options: -equalpixels -dpi=300 -noturn
parallel: Error: Cannot change into non-executable dir /root/.parallel: Permission denied
scan_perpage: ...Running ps2pdf on /tmp/scan.H396FvHBB3/scan-0001.ps
parallel: Error: Cannot change into non-executable dir /root/.parallel: Permission denied

scanimage instead of scanadf

is it possible to use the script with scanimage instead of scanadf?
I'm using Manjaro and there is no scanadf.

Simulated duplex scanning with page re-ordering

I have a Brother Printer (DCP-L2541DW) which has a flat bed scanner and an ADF. The scanner does not due duplex scanning though. At the moment i am using brscan to enable the 'Scan to PC' on the printer. This way i can scan a document with a press of a button.

As i mentioned the printer can only do single side scanning. After it has scanned the page, the printer LCD does pop a question whether to scan another page or continue. Unfortunately, brscan (using scanimage) does not scan the second page at all. I am curious whether your code can be used with brscan for manual duplex scanning using the flat bed scanner?

Rotate

I have a ScanSnap S1500 and I can't figure out how to rotate all the images 180 degrees, since they all come out upside down... Can anyone point me in the right direction?

Scan quality Fujitsu Software vs Sane?

Are there any good defaults to achieve a similar scan quality like with the "original" fujitsu software?

If I switch from LineArt to Color --skip-empty-pages is no longer working
Setting a Size does not work on the S1300i (only on the S1500) i need --no-default-size otherwise I will get "scanadf: invalid option -- 'y'"
When scanning without a default size I see the borders

I would appreciate some tips

Thanks for the great script

bc appears to be required in default configuration

The readme says bc is optional, but it appears to be required in the default configuration:

tinfoil%  ./scan -x 'brother4:bus6;dev1' --no-default-size --mode-hw-default 
Scanning...
scanadf: rounded value of br-x from 211.9 to 211.881
scanadf: rounded value of br-y from 355.6 to 355.567
Scanned document /tmp/scan.G03GKrmbGL/scan-0001
Scanned 1 pages
/home/scan/sane-scan-pdf/scan_perpage: line 81: bc: command not found
Found no scans.

In looking at its invocation in scan_perpage, I think it is always invoked if the scan_perpage script is invoked.

Page not aligning correctly

RaspberryPi 3+
Ubuntu Server 23.04
Epson DS-310

Scanning with just a scanadf command issued per hand, the scanner is able to scan the whole page. During post processing something must go wrong, as the top of the page is cut off from the final converted PDF. It looks like the original image is moved out of the top of the resulting page. --> In this case the resulting page is A4-sized. Part of the scanned top is missing while whitespace below the scanned page is added.

Please tell me what I need to provide in order for you to fix the problem. Thanks in advance!

#EDIT: Not happening when using OCR option. --> In this case the resulting page is a custom size that is way longer than A4. The top is fully there but also whitespace is added below.

Settings SOURCE=ADF doesn't work on brother MFC-L2700DW

Also tried
SOURCE="ADF"

It's not a big problem as I can use
DRIVER_OPTION="--source ADF"

But just in case it points to a problem in the script you can help me understand the problem please

Options specific to device airscan:w1:Brother':
Standard:
--resolution 100|200|300dpi [300]
Sets the resolution of the scanned image.
--mode Color|Gray [Color]
Selects the scan mode (e.g., lineart, monochrome, or color).
--source Flatbed|ADF [Flatbed]
Selects the scan source (such as a document-feeder).
Geometry:
-l 0..215.9mm [0]
Top-left x position of scan area.
-t 0..297.18mm [0]
Top-left y position of scan area.
-x 0..215.9mm [215.9]
Width of scan-area.
-y 0..297.18mm [297.18]
Height of scan-area.
Enhancement:
--brightness -100..100% (in steps of 1) [0]
Controls the brightness of the acquired image.
--contrast -100..100% (in steps of 1) [0]
Controls the contrast of the acquired image.
--shadow 0..100% (in steps of 1) [0]
Selects what radiance level should be considered "black".
--highlight 0..100% (in steps of 1) [100]
Selects what radiance level should be considered "white".
--analog-gamma 0.0999908..4 [1]
Analog gamma-correction
--negative[=(yes|no)] [no]
Swap black and white
`

Adjust brightness and optimise white page recognition

Thanks for your script, it makes the scanworld easier. :) I came from this page.

I have two questions:

Is it possible to implement a parameter to adjust the brightness? I scan white papers from letters, but it is quite dark. The corresponding software of my Fujitsu iX500 (ScanSnap Home) returns much brighter results.
I use the duplex function and the option to skip empty pages. However, complete white pages are recognised with 0% white. Is it possible to optimise the calculation? Probably, this question is connected to the first one.

Thanks in advance.

usage with scanbd: invalid argument when script is executed directly

Hi,

first of all thanks for making this great script. It's a real time-saver.

I have a Fujitsu S1500 connected to a raspberrypi that also runs scanbd to enable me to scan stuff by pressing the button on the scanner.
Scanbd is configured to execute the script scan.sh upon signal from the button:

scan.sh
#!/bin/sh
now=`date +"%Y-%m-%d-%H%M"`
/home/pi/sane-scan-pdf/scan -d -r 300 -v -m Lineart --skip-empty-pages -o /home/pi/scans/scan-$now.pdf

This works great when I press the button. However, when I execute the script directly via ./scan.sh, I get the following error message from scanadf:
scanadf: open of device fujitsu failed: Invalid argument

If I just execute scanadf > tmp.jpg (or something) it scans fine.

If I change the DEVICE name in sane-scan-pdf/scan to the actual SANE-name of the scanner,

scanimage -L
device `net:localhost:fujitsu:ScanSnap S1500:300802' is a FUJITSU ScanSnap S1500 scanner

I can execute the script directly and I can scan with pressing the button, however there's a delay of almost 25s between press of the button and the start of the scan.

If I disable scanbd I can execute the script directly, no problems.

All of that makes me think that the script is somehow trying to call upon the scanner device directly or at least locally which it should not. It's supposed to go through scanbd.

Is there a way to just get rid of the device specification and let scanadf pick the device (which does work, see above)?
I cannot figure out how to achieve this and would appreciate any help.

Kind regards,
nyn

Improve OCR layer compatibility with MacOS Preview via hocr renderer

I do duplex scans only with my Fujitsu ScanSnap S1500M and edit the resulting searchable PDF files in macOS Preview to remove any pages I don't want. However, the PDF files combined by pdfunite, once edited, make the embedded OCR'd text into strings of question-mark-in-box characters. This stops the files from being searchable.

When I change the scan script thus it does NOT solve this problem, and tried a number of permutations of gs parameters for PDF output.

310c310
<           pdfunite "${pdffiles[@]}" "${OUTPUT[$index]}" && rm $TMP_DIR/scan-*(0)$scanno.pdf
---
>           gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -dDetectDuplicateImages -dCompressFonts=true -r150 -sOutputFile="${OUTPUT[$index]}" "${pdffiles[@]}" && rm $TMP_DIR/scan-*(0)$scanno.pdf
329c329
<     pdfunite "${pdffiles[@]}" "$OUTPUT" && rm $TMP_DIR/scan-[0-9]*.pdf
---
>     gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -dDetectDuplicateImages -dCompressFonts=true -r150 -sOutputFile="$OUTPUT" "${pdffiles[@]}" && rm $TMP_DIR/scan-[0-9]*.pdf

Verified that this problem occurs on a different computer running a different macOS version.

The commercial ScanSnap software and included OCR produces PDF files that can be edited and the OCR'd text remains usable.

scanadf: open of device fujitsu failed: Invalid argument

I meant to say the script fails on this. Thanks Tim

Scan on Brother DCP-L3550CDW from ADF fails with `unrecognized option '--page-height'`

Scan command I'm using:

./sane-scan-pdf/scan --device airscan:e0:Brother --mode Color --size A4

Error I'm getting:

3
Scanning...
scanadf: unrecognized option '--page-height'
Found no scans.

When using the --no-default-size parameter the scan area is wrong, beginning of the A4 page is cropped, and the bottom of the scan is empty/white.

A workaround I used was to remove the --page-height and --page-width parameters from the script using this patch:

diff --git a/scan b/scan
index 27962c8..cb2b04a 100755
--- a/scan
+++ b/scan
@@ -333,13 +333,13 @@ fi
 
 if [[ $CROP != 1 && "$PGHEIGHT" != "" ]]; then
   PGHEIGHTIN=$(units --compact -1 "$PGHEIGHT mm" 'in')
-  PGHEIGHT="--page-height $PGHEIGHT -y $PGHEIGHT"
+  PGHEIGHT="-y $PGHEIGHT"
   PS2PDF_OPTS="-dEPSCrop"
 fi
 
 if [[ $CROP != 1 && "$PGWIDTH" != "" ]]; then
   PGWIDTHIN=$(units --compact -1 "$PGWIDTH mm" 'in')
-  PGWIDTH="--page-width $PGWIDTH -x $PGWIDTH"
+  PGWIDTH="-x $PGWIDTH"
   PS2PDF_OPTS="-dEPSCrop"
 fi

Options supported by my scanner (Brother DCP-L3550CDW):

$ scanadf --help
Usage: scanadf [OPTION]...

Start image acquisition on a scanner device and write image data to
output files.

   [ -d | --device-name <device> ]   use a given scanner device.
   [ -h | --help ]                   display this help message and exit.
   [ -L | --list-devices ]           show available scanner devices.
   [ -v | --verbose ]                give even more status messages.
   [ -V | --version ]                print version information.
   [ -N | --no-overwrite ]           don't overwrite existing files.

   [ -o | --output-file <name> ]     name of file to write image data
                                     (%d replacement in output file name).
   [ -S | --scan-script <name> ]     name of script to run after every scan.
   [ --script-wait ]                 wait for scripts to finish before exit
   [ -s | --start-count <num> ]      page count of first scanned image.
   [ -e | --end-count <num> ]        last page number to scan.
   [ -r | --raw ]                    write raw image data to file.
scanadf: rounded value of br-x from 211.9 to 211.881
scanadf: rounded value of br-y from 355.6 to 355.567

Options specific to device `brother4:net1;dev0':
  Mode:
    --mode Black & White|Gray[Error Diffusion]|True Gray|24bit Color[Fast] [24bit Color[Fast]]
        Select the scan mode
    --resolution 100|150|200|300|400|600|1200|2400|4800|9600dpi [200]
        Sets the resolution of the scanned image.
    --source FlatBed|Automatic Document Feeder(left aligned)|Automatic Document Feeder(centrally aligned) [Automatic Document Feeder(left aligned)]
        Selects the scan source (such as a document-feeder).
    --brightness -50..50% (in steps of 1) [inactive]
        Controls the brightness of the acquired image.
    --contrast -50..50% (in steps of 1) [inactive]
        Controls the contrast of the acquired image.
  Geometry:
    -l 0..211.9mm (in steps of 0.0999908) [0]
        Top-left x position of scan area.
    -t 0..355.6mm (in steps of 0.0999908) [0]
        Top-left y position of scan area.
    -x 0..211.9mm (in steps of 0.0999908) [211.881]
        Width of scan-area.
    -y 0..355.6mm (in steps of 0.0999908) [355.567]
        Height of scan-area.

Type ``scanadf --help -d DEVICE'' to get list of all options for DEVICE.

List of available devices:
    brother4:net1;dev0 v4l:/dev/video2 v4l:/dev/video0
    airscan:e0:Brother DCP-L3550CDW series

Add example command for installing dependencies to README.md

Can you please add a line in the Requirements section on how to install the dependencies?

The package names for Ubuntu are:

sudo apt install netpbm ghostscript poppler-utils imagemagick unpaper util-linux tesseract-ocr parallel

If you prefer, I can add this in a pull request.

How to select the ADF as a source for scanning

How to select the ADF as a source for scanning?

I have a Brother device (Brother DCP-L3550CDW series) and would like to scan from ADF. The script doesn't seem to allow me to select ADF as a source. As a workaround I modified the script and hardcoded --source ADF parameter with this patch:

diff --git a/scan b/scan
index 27962c8..4c13430 100755
--- a/scan
+++ b/scan
@@ -301,6 +301,8 @@ fi
 
 if [[ $DUPLEX == 1 ]]; then
   SOURCE="--source \"ADF Duplex\""
+else
+  SOURCE="--source ADF"
 fi

Batch scan into single files doesn't work

See title. I tried a scan with 9 pages and --output_list option but I only get 3 output files with a single page.

I can confirm that all 9 pages where created successfully inside the tmp folder.

I replaced the while loop inside the "Naming pdfs based on output list..." section with the following code and this works:

while [[ "$index" < "$numscans" ]]; do
      let "scanno = $index + 1"
      mv $TMP_DIR/scan-*(0)$scanno.pdf .
      let "index = $index + 1"
    done

my scan call:

scan -r 300 -v -m Lineart -l a b c d e f g h i

As I mentioned with 9 single pages I only get the files a, b and c

My scanadf does not recognize --page-height

The default invocations of scan aren't working for me, and I would appreciate any assistance with figuring out what I am missing.

The minimum invocation with specifying the device only fails with scanadf not recognizing the --page-height option:

scan% bash -x  ./scan -x 'brother4:bus7;dev1'                          
...
+ eval scanadf -d ''\''brother4:bus7;dev1'\''' --page-height 279.4 -y 279.4 --page-width 215.9 -x 215.9 -S /home/bootstrap/sane-scan-pdf/scan_perpage --script-wait --resolution 300 Lineart 0 0 -o /tmp/scan.EvTCrlOk58/scan-%04d
++ scanadf -d 'brother4:bus7;dev1' --page-height 279.4 -y 279.4 --page-width 215.9 -x 215.9 -S /home/bootstrap/sane-scan-pdf/scan_perpage --script-wait --resolution 300 Lineart 0 0 -o /tmp/scan.EvTCrlOk58/scan-%04d
scanadf: unrecognized option '--page-height'

After I add --no-default-size, this works, but if I add --crop, scan puts in the --page-height option back which once again fails:

scan% bash -x  ./scan -x 'brother4:bus7;dev1'  --no-default-size -m '' --unpaper --crop
+ eval scanadf -d ''\''brother4:bus7;dev1'\''' --page-height 9999 -y 9999 --page-width 9999 -x 9999 -S /home/bootstrap/sane-scan-pdf/scan_perpage --script-wait --resolution 300 0 --swcrop=yes --ald=yes -o /tmp/scan.u3O3Dx0Zi8/scan-%04d
++ scanadf -d 'brother4:bus7;dev1' --page-height 9999 -y 9999 --page-width 9999 -x 9999 -S /home/bootstrap/sane-scan-pdf/scan_perpage --script-wait --resolution 300 0 --swcrop=yes --ald=yes -o /tmp/scan.u3O3Dx0Zi8/scan-%04d
scanadf: unrecognized option '--page-height'

The scanadf manpage does not appear to mention --page-height, is that option supposed to be recognized by scanadf or some other program?

My software versions:

scan% dpkg -l|grep sane                                                                
ii  libsane:amd64                        1.0.31-4                           amd64        API library for scanners [transitional package]
ii  libsane-common                       1.0.31-4                           all          API library for scanners -- documentation and support files
ii  libsane1:amd64                       1.0.31-4                           amd64        API library for scanners
ii  sane                                 1.0.14-16                          amd64        scanner graphical frontends
ii  sane-utils                           1.0.31-4                           amd64        API library for scanners -- utilities

I saw that the current sane version is 1.0.32, I retrieved git source for sane-frontends and I'm still not seeing the --page-height option in scanadf there.

Appreciate any insights as to how to get the --page-height working. Alternatively I can submit a PR that would make scan not pass it (via a new option perhaps?) if this is preferable.

units: cannot open file '/root/.units': Permission denied

I'm trying to get scanbd to call sane-scan-pdf when pressing the scan button on my Fujitsu S1300i.

When I run scanbd with sudo SANE_CONFIG_DIR=/etc/scanbd scanbd -d7 -f and press the scan button, I get the following debug output. Note the two lines units: cannot open file '/root/.units': Permission denied.

I don't know how to fix this...

scanbd: value trigger: numerical
scanbd: trigger action for scan for device epjitsu:libusb:001:004 with script epjitsu.script
scanbd: get_sane_option_value
scanbd: Value of mode as string (len 7, hash -621353420): Lineart
scanbd: setting env: SCANBD_FUNCTION_MODE=Lineart
scanbd: setting env: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
scanbd: No PWD, setting env: PWD=/home/username
scanbd: setting env: USER=root
scanbd: setting env: HOME=/root
scanbd: setting env: SCANBD_DEVICE=epjitsu:libusb:001:004
scanbd: setting env: SCANBD_ACTION=scan
scanbd: append string epjitsu:libusb:001:004 to signal scan_begin
scanbd: now sending signal scan_begin
scanbd: Iteration on dbus call
scanbd: append string SCANBD_FUNCTION_MODE=Lineart to signal trigger
scanbd: append string PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin to signal trigger
scanbd: append string PWD=/home/username to signal trigger
scanbd: append string USER=root to signal trigger
scanbd: append string HOME=/root to signal trigger
scanbd: append string SCANBD_DEVICE=epjitsu:libusb:001:004 to signal trigger
scanbd: append string SCANBD_ACTION=scan to signal trigger
scanbd: now sending signal trigger
scanbd: now flushing the dbus
scanbd: unref the signal
scanbd: using relative script path: epjitsu.script, expanded to: /usr/share/scanbd/scripts/epjitsu.script
scanbd: waiting for child: /usr/share/scanbd/scripts/epjitsu.script
scanbd: setgid to gid=114
scanbd: setuid to uid=110
scanbd: exec for /usr/share/scanbd/scripts/epjitsu.script
scanbd: octal mode for /usr/share/scanbd/scripts/epjitsu.script: 100755
scanbd: file uid: 0, file gid: 0
units: cannot open file '/root/.units': Permission denied
units: cannot open file '/root/.units': Permission denied
Scanning...
scanadf: open of device fujitsu failed: Invalid argument
Found no scans.
scanbd: child /usr/share/scanbd/scripts/epjitsu.script exited with status: 0
scanbd: Iteration on dbus call
scanbd: append string epjitsu:libusb:001:004 to signal scan_end
scanbd: now sending signal scan_end
scanbd: reopen device epjitsu:libusb:001:004

Integration with Paperless-ng

Not a bug but just curious whether this script can be used with paperless-ng.