Giter Club home page Giter Club logo

Comments (7)

jalan avatar jalan commented on September 25, 2024 1

Sorry, forgot to follow up on this one! I have a branch with this work mostly done. I should be able to finish it up and make a new release in the next couple of days.

from pdftotext.

jalan avatar jalan commented on September 25, 2024

Yeah, I can add this sometime. Would this be enough for you?

pdf = pdftotext.PDF(f, layout="raw")

Or would you need to change the layout page-by-page?

from pdftotext.

uda avatar uda commented on September 25, 2024

Thanks @jalan, per document is perfect.

I already started working on it locally, and it worked:

diff --git a/pdftotext.cpp b/pdftotext.cpp
index 3e1bfbb..9c53a26 100644
--- a/pdftotext.cpp
+++ b/pdftotext.cpp
@@ -14,12 +14,14 @@ static PyObject* PdftotextError;
 typedef struct {
     PyObject_HEAD
     int page_count;
+    bool raw;
     PyObject* data;
     poppler::document* doc;
 } PDF;
 
 static void PDF_clear(PDF* self) {
     self->page_count = 0;
+    self->raw = false;
     delete self->doc;
     self->doc = NULL;
     Py_CLEAR(self->data);
@@ -63,11 +65,12 @@ static int PDF_unlock(PDF* self, char* password) {
 static int PDF_init(PDF* self, PyObject* args, PyObject* kwds) {
     PyObject* pdf_file;
     char* password = (char*)"";
-    static char* kwlist[] = {(char*)"pdf_file", (char*)"password", NULL};
+    bool* raw = (bool*)false;
+    static char* kwlist[] = {(char*)"pdf_file", (char*)"password", (char*)"raw", NULL};
 
     PDF_clear(self);
 
-    if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|s", kwlist, &pdf_file, &password)) {
+    if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|sp", kwlist, &pdf_file, &password, &raw)) {
         goto error;
     }
     if (PDF_load_data(self, pdf_file) < 0) {
@@ -81,6 +84,7 @@ static int PDF_init(PDF* self, PyObject* args, PyObject* kwds) {
     }
 
     self->page_count = self->doc->pages();
+    self->raw = raw;
     return 0;
 
 error:
@@ -107,7 +111,12 @@ static PyObject* PDF_read_page(PDF* self, int page_number) {
     const int min = std::min(rect.left(), rect.top());
     const int max = std::max(rect.right(), rect.bottom());
 
-    page_utf8 = page->text(poppler::rectf(min, min, max, max)).to_utf8();
+    poppler::page::text_layout_enum layout_mode = poppler::page::physical_layout;
+    if (self->raw) {
+        layout_mode = poppler::page::raw_order_layout;
+    }
+
+    page_utf8 = page->text(poppler::rectf(min, min, max, max), layout_mode).to_utf8();
     delete page;
     return PyUnicode_DecodeUTF8(page_utf8.data(), page_utf8.size(), NULL);
 }
@@ -135,41 +144,41 @@ static PySequenceMethods PDF_sequence_methods = {
 
 static PyTypeObject PDFType = {
     PyVarObject_HEAD_INIT(NULL, 0)
-    "pdftotext.PDF",                                   // tp_name
-    sizeof(PDF),                                       // tp_basicsize
-    0,                                                 // tp_itemsize
-    (destructor)PDF_dealloc,                           // tp_dealloc
-    0,                                                 // tp_print
-    0,                                                 // tp_getattr
-    0,                                                 // tp_setattr
-    0,                                                 // tp_reserved
-    0,                                                 // tp_repr
-    0,                                                 // tp_as_number
-    &PDF_sequence_methods,                             // tp_as_sequence
-    0,                                                 // tp_as_mapping
-    0,                                                 // tp_hash
-    0,                                                 // tp_call
-    0,                                                 // tp_str
-    0,                                                 // tp_getattro
-    0,                                                 // tp_setattro
-    0,                                                 // tp_as_buffer
-    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE,          // tp_flags
-    "PDF(pdf_file, password="") -> new PDF document",  // tp_doc
-    0,                                                 // tp_traverse
-    0,                                                 // tp_clear
-    0,                                                 // tp_richcompare
-    0,                                                 // tp_weaklistoffset
-    0,                                                 // tp_iter
-    0,                                                 // tp_iternext
-    0,                                                 // tp_methods
-    0,                                                 // tp_members
-    0,                                                 // tp_getset
-    0,                                                 // tp_base
-    0,                                                 // tp_dict
-    0,                                                 // tp_descr_get
-    0,                                                 // tp_descr_set
-    0,                                                 // tp_dictoffset
-    (initproc)PDF_init,                                // tp_init
+    "pdftotext.PDF",                                              // tp_name
+    sizeof(PDF),                                                  // tp_basicsize
+    0,                                                            // tp_itemsize
+    (destructor)PDF_dealloc,                                      // tp_dealloc
+    0,                                                            // tp_print
+    0,                                                            // tp_getattr
+    0,                                                            // tp_setattr
+    0,                                                            // tp_reserved
+    0,                                                            // tp_repr
+    0,                                                            // tp_as_number
+    &PDF_sequence_methods,                                        // tp_as_sequence
+    0,                                                            // tp_as_mapping
+    0,                                                            // tp_hash
+    0,                                                            // tp_call
+    0,                                                            // tp_str
+    0,                                                            // tp_getattro
+    0,                                                            // tp_setattro
+    0,                                                            // tp_as_buffer
+    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE,                     // tp_flags
+    "PDF(pdf_file, password="", raw=False) -> new PDF document",  // tp_doc
+    0,                                                            // tp_traverse
+    0,                                                            // tp_clear
+    0,                                                            // tp_richcompare
+    0,                                                            // tp_weaklistoffset
+    0,                                                            // tp_iter
+    0,                                                            // tp_iternext
+    0,                                                            // tp_methods
+    0,                                                            // tp_members
+    0,                                                            // tp_getset
+    0,                                                            // tp_base
+    0,                                                            // tp_dict
+    0,                                                            // tp_descr_get
+    0,                                                            // tp_descr_set
+    0,                                                            // tp_dictoffset
+    (initproc)PDF_init,                                           // tp_init
 };
 
 #if POPPLER_CPP_AT_LEAST_0_30_0

But... it seems the result is not the one I expected. it might be that poppler has diverted far from xpdf so that raw layout is different.

from pdftotext.

jalan avatar jalan commented on September 25, 2024

@uda, nice, looks good. I'll be sure to credit you if I add that.

Do you happen to have a PDF that raw mode helps with? As far as I can tell, poppler does not recommend using raw mode anymore.

from pdftotext.

uda avatar uda commented on September 25, 2024

@jalan, we mainly encounter the usage of this for Israeli law produced with Adobe InDesign.

The PDF: http://fs.knesset.gov.il//20/law/20_lsr_491466.pdf
Attached: physical and raw exports using the pdftotext command
20_lsr_491466_physical.txt
20_lsr_491466_raw.txt

While the physical layout provides the logical flow, it produces text that is difficult to process by script. the raw layout provides out of order chunks, but we can handle them with scripts.

from pdftotext.

acuatoria avatar acuatoria commented on September 25, 2024

How can I use the raw parameter of pdftotext?
As I use it in console

from pdftotext.

jalan avatar jalan commented on September 25, 2024

Done, new release on PyPI

from pdftotext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.