Ubuntu - paperless office on a budget
By Leo Gaggl
Since paper and myself have never gotten on well I have always been dreaming of a paperless office. A while ago I purchased a Fujitsu ScanSnap S1500 scanner for the office. I did this after doing some research on which Automatic Document Feed (ADF) multipage & duplex scanners were both affordable as well as supported on Linux.
- scan the document
- perform OCR to convert to text
- combine the text with PDF to create a searchable PDF
- OPTIONAL – send the resulting document into Alfresco Document Management Server via FTP
Install dependencies
NOTE: PPA is only required for support of Fujitsu ScanSnap S1500
sudo apt-add-repository ppa:rolfbensch/sane-git<br></br>sudo apt-get update<br></br>sudo apt-get install sane sane-utils imagemagick tesseract-ocr pdftk libtiff-tools libsane-extras exactimage wput
Install scanbuttond
Download the “Debian Experimental” package from http://pkgs.org/download/scanbuttond
sudo dpkg -i scanbuttond_0.2.3.cvs20090713-14_i386.deb
This step is only for the Fujitsu ScanSnap support. For other scanners you can probably install from the Ubuntu Repository
Scanner config
vim 40-libsane.rules<br></br>#add this line<br></br>ATTRS{idVendor}=="04c5", ATTRS{idProduct}=="11a2", ENV{libsane_matched}="yes"
Permissions
sudo adduser saned scanner
Useful command lines for troubleshooting
Since I had a few trouble getting this scanner to work properly I found the following commands highly useful in locating the issue.
man sane-usb<br></br>sane-find-scanner<br></br>scanimage -L<br></br>dmesg<br></br>tail /var/log/udev<br></br>
NOTE: If you are using a notebook devices be careful as I spent quite a few hours troubleshooting an error when opening the device from saned. It turned out to be that the USB power-management on the Toshiba notebook caused havoc with saned (http://askubuntu.com/questions/55140/error-during-device-i-o-when-using-usb-scanner). Switching to the desktop that is now housing the scanner fixed that problem. Thank you VIRTUALBOX (I ended up setting up a dedicated VM for this task) !
Configure scanbuttond
vim /etc/default/scanbuttond<br></br>#change this line from no to yes<br></br>RUN=yes
cd /etc/scanbuttond<br></br>sudo cp initscanner.sh.example initscanner.sh<br></br>sudo vim initscanner.sh
Uncomment or copy any scanner init command(s).
sudo cp buttonpressed.sh.example buttonpressed.sh<br></br>sudo vim buttonpressed.sh
Copy the contents of the scan script below. The script is also hosted on GitHub (https://github.com/leogaggl/misc-scripts/blob/master/buttonpressed.sh)
Scan script
#!/bin/bash<br></br>OUT_DIR=/output/directory/name<br></br>TMP_DIR=
mktemp -d<br></br>FILE_NAME=scan_
date +%Y%m%d-%H%M%S<br></br>cd $TMP_DIR<br></br>echo "################## Scanning ###################"<br></br>scanimage --resolution 150 --batch=scan_%03d.pnm --format=pnm --mode Gray --device-name "fujitsu:ScanSnap S1500:67953" --source “ADF Duplex” --page-width 210 --page-height 297 --sleeptimer 1 -y 297 -x 210<br></br>echo "################## Cleaning ###################"<br></br>for f in ./*.pnm; do<br></br> unpaper --size "a4" --overwrite "$f" "$f"<br></br>done<br></br>echo "############## Converting to TIF ##############"<br></br>mogrify -format tif *.pnm<br></br>echo "################ OCR ################"<br></br>for f in ./*.tif; do<br></br> tesseract "$f" "$f" -l eng hocr<br></br> hocr2pdf -i "$f" -s -o "$f.pdf"
Credits:
A big thank you & hat tip to the following authors of the following pages:
- http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/
- http://www.robinclarke.net/archives/the-paperless-office-with-linux
- http://askubuntu.com/questions/271271/how-do-i-produce-a-multi-page-sandwich-pdf-with-hocr2pdf
EDIT (2013-09-16): I found this link describing how to remove empty pages: http://philipp.knechtges.com/?p=190 – might have to investigate this when I have some time.