Paperless Office on a budget
sudo apt-add-repository ppa:rolfbensch/sane-git
sudo apt-get update
sudo apt-get install sane sane-utils imagemagick tesseract-ocr pdftk libtiff-tools libsane-extras exactimage wput
Install scanbuttond
Download the “Debian Experimental” package from http://pkgs.org/download/scanbuttond
sudo dpkg -i scanbuttond_0.2.3.cvs20090713-14_i386.deb
This step is only for the Fujitsu ScanSnap support. For other scanners you can probably install from the Ubuntu Repository
Scanner config
vim 40-libsane.rules
#add this line
ATTRS{idVendor}=="04c5", ATTRS{idProduct}=="11a2", ENV{libsane_matched}="yes"
Permissions
sudo adduser saned scanner
Useful command lines for troubleshooting
Since I had a few trouble getting this scanner to work properly I found the following commands highly useful in locating the issue.
man sane-usb
sane-find-scanner
scanimage -L
dmesg
tail /var/log/udev
NOTE: If you are using a notebook devices be careful as I spent quite a few hours troubleshooting an error when opening the device from saned. It turned out to be that the USB power-management on the Toshiba notebook caused havoc with saned (http://askubuntu.com/questions/55140/error-during-device-i-o-when-using-usb-scanner). Switching to the desktop that is now housing the scanner fixed that problem. Thank you VIRTUALBOX (I ended up setting up a dedicated VM for this task) !
Configure scanbuttond
vim /etc/default/scanbuttond
#change this line from no to yes
RUN=yes
cd /etc/scanbuttond
sudo cp initscanner.sh.example initscanner.sh
sudo vim initscanner.sh
Uncomment or copy any scanner init command(s).
sudo cp buttonpressed.sh.example buttonpressed.sh
sudo vim buttonpressed.sh
Copy the contents of the scan script below. The script is also hosted on GitHub (https://github.com/leogaggl/misc-scripts/blob/master/buttonpressed.sh)
Scan script
#!/bin/bash
OUT_DIR=/output/directory/name
TMP_DIR=`mktemp -d`
FILE_NAME=scan_`date +%Y%m%d-%H%M%S`
LANGUAGE="eng"
echo 'scanning...'
scanimage --resolution 300
--batch="$TMP_DIR/scan_%03d.pnm"
--format=pnm
--mode Gray
--source 'ADF Duplex'
echo "Output saved in "$TMP_DIR/scan*.pnm"
cd $TMP_DIR
# cut borders
echo 'cutting borders...'
for i in scan_*.pnm; do
mogrify -shave 50x5 "${i}"
done
# check if there is blank pages
echo 'checking for blank pages...'
for f in ./*.pnm; do
unpaper --size "a4" --overwrite "$f" `echo "$f" | sed 's/scan/scan_unpaper/g'`
#need to rename and delete original since newer versions of unpaper can't use same file name
rm -f "$f"
done
# apply text cleaning and convert to tif
echo 'cleaning pages...'
for i in scan_*.pnm; do
echo "${i}"
convert "${i}" -contrast-stretch 1% -level 29%,76% "${i}.tif"
done
# Starting OCR
echo 'doing OCR...'
for i in scan_*.pnm.tif; do
echo "${i}"
tesseract "$i" "$i" -l $LANGUAGE hocr
hocr2pdf -i "$i" -s -o "$i.pdf"
Credits:
A big thank you & hat tip to the following authors of the following pages:
- http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/
- http://www.robinclarke.net/archives/the-paperless-office-with-linux
- http://askubuntu.com/questions/271271/how-do-i-produce-a-multi-page-sandwich-pdf-with-hocr2pdf
EDIT (2013-09-16): I found this link describing how to remove empty pages: http://philipp.knechtges.com/?p=190 – might have to investigate this when I have some time.
Webmentions
No webmentions yet. Be the first to send a webmention !