Make sure you have installed the following ubuntu/debian packages
sudo apt update ; sudo apt install imagemagick poppler-utils rename
-
Create two directories:
Monopagina
andMultipagina
-
Put the multipage PDFs in
Multipagina
-
Create directories for each
ls *pdf | sed -e 's/ /_/g' | awk '{print "mkdir \"../Monopagina/" $0 "/\""}' | sed 's/.pdf//g' | bash
-
Convert
ls *pdf | sed 's/.pdf//g' | awk '{print "pdfimages -j \"" $0 ".pdf\" \"../Monopagina/" $0 "/\""'} | bash
-
Rename
ls *pdf | sed 's/.pdf//g' | awk '{print "cd \"../Monopagina/" $0 "\" ; for filename in *.jpg; do mv -- \"$filename\" \"" $0 "_pg$filename\"; done; cd -" }' | bash
5 Put into single folder. If you want all JPEGs to be in a single folder, do:
ls *pdf | sed 's/.pdf//g' | awk '{print "cd \"../Monopagina/" $0 "\" ; for filename in *.jpg; do mv -- \"$filename\" ..; done; cd -" }' | bash
5.1 Delete empty folders
cd ../Monopagina ; find . -type d -empty -delete
- archivos pdf con muchos documentos y bookmarks por documento
- un archivo pdf por cada página (extraido con adobe acrobat)
sudo snap install pdftk
pdftk "Tomo III.pdf" dump_data_^Cf8 > in_III.info
Convertir a tiffs
- Crear script
ls *pdf | awk '{print "convert -density 300 \"" $0 "\" TOMO_II_" $5 ".tiff"}' | sed 's/pdf\.tiff/tiff/g' > convert.sh
- Ejecutar script
sh convert.sh
- Los binarios de tesseract :
apt-get install tesseract-ocr libtesseract-dev tesseract-ocr-spa
- Los paquetes de python
pip install -r requirements.txt