Converts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from the Apache PDFBox library).
Data extraction from a form in a PDF file
<dependency>
<groupId>io.github.jonathanlink</groupId>
<artifactId>PDFLayoutTextStripper</artifactId>
<version>2.2.3</version>
</dependency>
- Install apache pdfbox manually (to get the v2.0.6 click here ) and its two dependencies commons-logging.jar and fontbox
warning: only pdfbox versions from version 2.0.0 upwards are compatible with this version of PDFLayoutTextStripper.java
cd PDFLayoutTextStripper
javac -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar *.java
java -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar test
The same as for Linux (see above) but replace : with ;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class test {
public static void main(String[] args) {
String string = null;
try {
PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("./samples/bus.pdf"), "r"));
pdfParser.parse();
PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
string = pdfTextStripper.getText(pdDocument);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
};
System.out.println(string);
}
}
Step 1.
Install py4j
Step 2.
Run Python Gateway
java -jar python-gateway.jar
or with custom port
java -jar python-gateway.jar 2222
Step 3.
Parse pdf file with python code
from py4j.java_gateway import JavaGateway
gw = JavaGateway()
result = gw.entry_point.strip('samples/bus.pdf')
#
# result is a dict of {
# 'success': 'true' or 'false',
# 'payload': pdf file content if 'success' is 'true'
# 'error': error message if 'success' is 'false'
# }
#
print result['payload']
Thanks to
- Dmytro Zelinskyy for reporting an issue with its correction (v2.2.3)
- Ho Ting Cheng for reporting an issue (v2.1)
- James Sullivan for having updated the code to make it work with the latest version of PDFBox (v2.0)