Please add the ability to pass the raw layout option to page->text: <p dir="aut

Yeah, I can add this sometime. Would this be enough for you? <div class="highlight

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Add raw layout on page->text about pdftotext HOT 7 CLOSED

jalan commented on September 25, 2024

Add raw layout on page->text

from pdftotext.

Comments (7)

jalan commented on September 25, 2024 1

Sorry, forgot to follow up on this one! I have a branch with this work mostly done. I should be able to finish it up and make a new release in the next couple of days.

from pdftotext.

jalan commented on September 25, 2024

Yeah, I can add this sometime. Would this be enough for you?

pdf = pdftotext.PDF(f, layout="raw")

Or would you need to change the layout page-by-page?

from pdftotext.

uda commented on September 25, 2024

Thanks @jalan, per document is perfect.

I already started working on it locally, and it worked:

diff --git a/pdftotext.cpp b/pdftotext.cpp
index 3e1bfbb..9c53a26 100644
--- a/pdftotext.cpp
+++ b/pdftotext.cpp
@@ -14,12 +14,14 @@ static PyObject* PdftotextError;
 typedef struct {
     PyObject_HEAD
     int page_count;
+    bool raw;
     PyObject* data;
     poppler::document* doc;
 } PDF;
 
 static void PDF_clear(PDF* self) {
     self->page_count = 0;
+    self->raw = false;
     delete self->doc;
     self->doc = NULL;
     Py_CLEAR(self->data);
@@ -63,11 +65,12 @@ static int PDF_unlock(PDF* self, char* password) {
 static int PDF_init(PDF* self, PyObject* args, PyObject* kwds) {
     PyObject* pdf_file;
     char* password = (char*)"";
-    static char* kwlist[] = {(char*)"pdf_file", (char*)"password", NULL};
+    bool* raw = (bool*)false;
+    static char* kwlist[] = {(char*)"pdf_file", (char*)"password", (char*)"raw", NULL};
 
     PDF_clear(self);
 
-    if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|s", kwlist, &pdf_file, &password)) {
+    if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|sp", kwlist, &pdf_file, &password, &raw)) {
         goto error;
     }
     if (PDF_load_data(self, pdf_file) < 0) {
@@ -81,6 +84,7 @@ static int PDF_init(PDF* self, PyObject* args, PyObject* kwds) {
     }
 
     self->page_count = self->doc->pages();
+    self->raw = raw;
     return 0;
 
 error:
@@ -107,7 +111,12 @@ static PyObject* PDF_read_page(PDF* self, int page_number) {
     const int min = std::min(rect.left(), rect.top());
     const int max = std::max(rect.right(), rect.bottom());
 
-    page_utf8 = page->text(poppler::rectf(min, min, max, max)).to_utf8();
+    poppler::page::text_layout_enum layout_mode = poppler::page::physical_layout;
+    if (self->raw) {
+        layout_mode = poppler::page::raw_order_layout;
+    }
+
+    page_utf8 = page->text(poppler::rectf(min, min, max, max), layout_mode).to_utf8();
     delete page;
     return PyUnicode_DecodeUTF8(page_utf8.data(), page_utf8.size(), NULL);
 }
@@ -135,41 +144,41 @@ static PySequenceMethods PDF_sequence_methods = {
 
 static PyTypeObject PDFType = {
     PyVarObject_HEAD_INIT(NULL, 0)
-    "pdftotext.PDF",                                   // tp_name
-    sizeof(PDF),                                       // tp_basicsize
-    0,                                                 // tp_itemsize
-    (destructor)PDF_dealloc,                           // tp_dealloc
-    0,                                                 // tp_print
-    0,                                                 // tp_getattr
-    0,                                                 // tp_setattr
-    0,                                                 // tp_reserved
-    0,                                                 // tp_repr
-    0,                                                 // tp_as_number
-    &PDF_sequence_methods,                             // tp_as_sequence
-    0,                                                 // tp_as_mapping
-    0,                                                 // tp_hash
-    0,                                                 // tp_call
-    0,                                                 // tp_str
-    0,                                                 // tp_getattro
-    0,                                                 // tp_setattro
-    0,                                                 // tp_as_buffer
-    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE,          // tp_flags
-    "PDF(pdf_file, password="") -> new PDF document",  // tp_doc
-    0,                                                 // tp_traverse
-    0,                                                 // tp_clear
-    0,                                                 // tp_richcompare
-    0,                                                 // tp_weaklistoffset
-    0,                                                 // tp_iter
-    0,                                                 // tp_iternext
-    0,                                                 // tp_methods
-    0,                                                 // tp_members
-    0,                                                 // tp_getset
-    0,                                                 // tp_base
-    0,                                                 // tp_dict
-    0,                                                 // tp_descr_get
-    0,                                                 // tp_descr_set
-    0,                                                 // tp_dictoffset
-    (initproc)PDF_init,                                // tp_init
+    "pdftotext.PDF",                                              // tp_name
+    sizeof(PDF),                                                  // tp_basicsize
+    0,                                                            // tp_itemsize
+    (destructor)PDF_dealloc,                                      // tp_dealloc
+    0,                                                            // tp_print
+    0,                                                            // tp_getattr
+    0,                                                            // tp_setattr
+    0,                                                            // tp_reserved
+    0,                                                            // tp_repr
+    0,                                                            // tp_as_number
+    &PDF_sequence_methods,                                        // tp_as_sequence
+    0,                                                            // tp_as_mapping
+    0,                                                            // tp_hash
+    0,                                                            // tp_call
+    0,                                                            // tp_str
+    0,                                                            // tp_getattro
+    0,                                                            // tp_setattro
+    0,                                                            // tp_as_buffer
+    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE,                     // tp_flags
+    "PDF(pdf_file, password="", raw=False) -> new PDF document",  // tp_doc
+    0,                                                            // tp_traverse
+    0,                                                            // tp_clear
+    0,                                                            // tp_richcompare
+    0,                                                            // tp_weaklistoffset
+    0,                                                            // tp_iter
+    0,                                                            // tp_iternext
+    0,                                                            // tp_methods
+    0,                                                            // tp_members
+    0,                                                            // tp_getset
+    0,                                                            // tp_base
+    0,                                                            // tp_dict
+    0,                                                            // tp_descr_get
+    0,                                                            // tp_descr_set
+    0,                                                            // tp_dictoffset
+    (initproc)PDF_init,                                           // tp_init
 };
 
 #if POPPLER_CPP_AT_LEAST_0_30_0

But... it seems the result is not the one I expected. it might be that poppler has diverted far from xpdf so that raw layout is different.

from pdftotext.

jalan commented on September 25, 2024

@uda, nice, looks good. I'll be sure to credit you if I add that.

Do you happen to have a PDF that raw mode helps with? As far as I can tell, poppler does not recommend using raw mode anymore.

from pdftotext.

uda commented on September 25, 2024

@jalan, we mainly encounter the usage of this for Israeli law produced with Adobe InDesign.

The PDF: http://fs.knesset.gov.il//20/law/20_lsr_491466.pdf
Attached: physical and raw exports using the pdftotext command
20_lsr_491466_physical.txt
20_lsr_491466_raw.txt

While the physical layout provides the logical flow, it produces text that is difficult to process by script. the raw layout provides out of order chunks, but we can handle them with scripts.

from pdftotext.

acuatoria commented on September 25, 2024

How can I use the raw parameter of pdftotext?
As I use it in console

from pdftotext.

jalan commented on September 25, 2024

Done, new release on PyPI

from pdftotext.

Add raw layout on page->text about pdftotext HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent