Comments (7)
Sorry, forgot to follow up on this one! I have a branch with this work mostly done. I should be able to finish it up and make a new release in the next couple of days.
from pdftotext.
Yeah, I can add this sometime. Would this be enough for you?
pdf = pdftotext.PDF(f, layout="raw")
Or would you need to change the layout page-by-page?
from pdftotext.
Thanks @jalan, per document is perfect.
I already started working on it locally, and it worked:
diff --git a/pdftotext.cpp b/pdftotext.cpp
index 3e1bfbb..9c53a26 100644
--- a/pdftotext.cpp
+++ b/pdftotext.cpp
@@ -14,12 +14,14 @@ static PyObject* PdftotextError;
typedef struct {
PyObject_HEAD
int page_count;
+ bool raw;
PyObject* data;
poppler::document* doc;
} PDF;
static void PDF_clear(PDF* self) {
self->page_count = 0;
+ self->raw = false;
delete self->doc;
self->doc = NULL;
Py_CLEAR(self->data);
@@ -63,11 +65,12 @@ static int PDF_unlock(PDF* self, char* password) {
static int PDF_init(PDF* self, PyObject* args, PyObject* kwds) {
PyObject* pdf_file;
char* password = (char*)"";
- static char* kwlist[] = {(char*)"pdf_file", (char*)"password", NULL};
+ bool* raw = (bool*)false;
+ static char* kwlist[] = {(char*)"pdf_file", (char*)"password", (char*)"raw", NULL};
PDF_clear(self);
- if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|s", kwlist, &pdf_file, &password)) {
+ if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|sp", kwlist, &pdf_file, &password, &raw)) {
goto error;
}
if (PDF_load_data(self, pdf_file) < 0) {
@@ -81,6 +84,7 @@ static int PDF_init(PDF* self, PyObject* args, PyObject* kwds) {
}
self->page_count = self->doc->pages();
+ self->raw = raw;
return 0;
error:
@@ -107,7 +111,12 @@ static PyObject* PDF_read_page(PDF* self, int page_number) {
const int min = std::min(rect.left(), rect.top());
const int max = std::max(rect.right(), rect.bottom());
- page_utf8 = page->text(poppler::rectf(min, min, max, max)).to_utf8();
+ poppler::page::text_layout_enum layout_mode = poppler::page::physical_layout;
+ if (self->raw) {
+ layout_mode = poppler::page::raw_order_layout;
+ }
+
+ page_utf8 = page->text(poppler::rectf(min, min, max, max), layout_mode).to_utf8();
delete page;
return PyUnicode_DecodeUTF8(page_utf8.data(), page_utf8.size(), NULL);
}
@@ -135,41 +144,41 @@ static PySequenceMethods PDF_sequence_methods = {
static PyTypeObject PDFType = {
PyVarObject_HEAD_INIT(NULL, 0)
- "pdftotext.PDF", // tp_name
- sizeof(PDF), // tp_basicsize
- 0, // tp_itemsize
- (destructor)PDF_dealloc, // tp_dealloc
- 0, // tp_print
- 0, // tp_getattr
- 0, // tp_setattr
- 0, // tp_reserved
- 0, // tp_repr
- 0, // tp_as_number
- &PDF_sequence_methods, // tp_as_sequence
- 0, // tp_as_mapping
- 0, // tp_hash
- 0, // tp_call
- 0, // tp_str
- 0, // tp_getattro
- 0, // tp_setattro
- 0, // tp_as_buffer
- Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE, // tp_flags
- "PDF(pdf_file, password="") -> new PDF document", // tp_doc
- 0, // tp_traverse
- 0, // tp_clear
- 0, // tp_richcompare
- 0, // tp_weaklistoffset
- 0, // tp_iter
- 0, // tp_iternext
- 0, // tp_methods
- 0, // tp_members
- 0, // tp_getset
- 0, // tp_base
- 0, // tp_dict
- 0, // tp_descr_get
- 0, // tp_descr_set
- 0, // tp_dictoffset
- (initproc)PDF_init, // tp_init
+ "pdftotext.PDF", // tp_name
+ sizeof(PDF), // tp_basicsize
+ 0, // tp_itemsize
+ (destructor)PDF_dealloc, // tp_dealloc
+ 0, // tp_print
+ 0, // tp_getattr
+ 0, // tp_setattr
+ 0, // tp_reserved
+ 0, // tp_repr
+ 0, // tp_as_number
+ &PDF_sequence_methods, // tp_as_sequence
+ 0, // tp_as_mapping
+ 0, // tp_hash
+ 0, // tp_call
+ 0, // tp_str
+ 0, // tp_getattro
+ 0, // tp_setattro
+ 0, // tp_as_buffer
+ Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE, // tp_flags
+ "PDF(pdf_file, password="", raw=False) -> new PDF document", // tp_doc
+ 0, // tp_traverse
+ 0, // tp_clear
+ 0, // tp_richcompare
+ 0, // tp_weaklistoffset
+ 0, // tp_iter
+ 0, // tp_iternext
+ 0, // tp_methods
+ 0, // tp_members
+ 0, // tp_getset
+ 0, // tp_base
+ 0, // tp_dict
+ 0, // tp_descr_get
+ 0, // tp_descr_set
+ 0, // tp_dictoffset
+ (initproc)PDF_init, // tp_init
};
#if POPPLER_CPP_AT_LEAST_0_30_0
But... it seems the result is not the one I expected. it might be that poppler has diverted far from xpdf so that raw layout is different.
from pdftotext.
@uda, nice, looks good. I'll be sure to credit you if I add that.
Do you happen to have a PDF that raw mode helps with? As far as I can tell, poppler does not recommend using raw mode anymore.
from pdftotext.
@jalan, we mainly encounter the usage of this for Israeli law produced with Adobe InDesign.
The PDF: http://fs.knesset.gov.il//20/law/20_lsr_491466.pdf
Attached: physical and raw exports using the pdftotext
command
20_lsr_491466_physical.txt
20_lsr_491466_raw.txt
While the physical layout provides the logical flow, it produces text that is difficult to process by script. the raw layout provides out of order chunks, but we can handle them with scripts.
from pdftotext.
How can I use the raw parameter of pdftotext?
As I use it in console
from pdftotext.
Done, new release on PyPI
from pdftotext.
Related Issues (20)
- Unable to install pdftotext : poppler/cpp/poppler-document.h not found HOT 4
- Crash when PDF contains empty pages HOT 3
- problems reading and maintaining the layout HOT 2
- AttributeError: module 'pdftotext' has no attribute 'PDF' HOT 4
- ImportError: DLL load failed while importing pdftotext: The specified module could not be found
- Import error when running on MacOs (M1) HOT 1
- Enable tests requiring at least version 0.88 if requirement is met HOT 3
- Formatting changed after new install HOT 4
- Provide access to page::text_list HOT 1
- not able to install in red-hat base image 8 HOT 1
- Can't install using conda/mamba HOT 4
- double column pdf HOT 2
- PDF tags after converting tags from PDF HOT 5
- Poppler/error seen while extracting text from PDF such as poppler/error (572194): Unknown filter 'JPXDecode'\n HOT 2
- I am getting this issue in python 3.7.7 macosm2
- Getting error Invalid ToUnicode Cmap HOT 2
- Can't make crop work HOT 1
- #17 in arch linux HOT 9
- Not exactly an issue HOT 1
- Unable to install HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdftotext.