Giter Club home page Giter Club logo

Comments (6)

leeper avatar leeper commented on May 27, 2024

I think this is an inherent limitation of PDF format. As I understand it, white space is not represented as actual "space" characters but rather as horizontal offsets for the represented text. So, the underlying tabula library has no way of knowing how much space there is because there's nothing there except the horizontal start position of the text. I could be wrong as I'm not a PDF expert, but my fear is your workaround might be the only way to achieve this.

from tabulapdf.

alanpaulkwan avatar alanpaulkwan commented on May 27, 2024

That makes sense, although what I'm suggesting t would just be about
representing the offsets with whitespace. It sounds like what you're saying
is as far as you know, Tabula doesn't give options to do this. Since your
goal is to create an R binding I suppose it's a feature request to be send
over to the tabula guys?

RPoppler / pdftools seems to get along the lines of what I want, but there
are some problems there too. Some of the text in adjacent lines gets mashed.

On Sat, Nov 12, 2016 at 6:12 AM, Thomas J. Leeper [email protected]
wrote:

I think this is an inherent limitation of PDF format. As I understand it,
white space is not represented as actual "space" characters but rather as
horizontal offsets for the represented text. So, the underlying tabula
library has no way of knowing how much space there is because there's
nothing there except the horizontal start position of the text. I could be
wrong as I'm not a PDF expert, but my fear is your workaround might be the
only way to achieve this.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#25 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AH745FI8l1zfyk-FO2N_9lQEp1SXGn8Jks5q9Z84gaJpZM4KqW1b
.

from tabulapdf.

leeper avatar leeper commented on May 27, 2024

Oh actually extract_text() isn't a tabula feature. It just uses pdfbox. If it looks like it possible directly with PDFbox, I can try to implement it but I don't think it is possible.

from tabulapdf.

alanpaulkwan avatar alanpaulkwan commented on May 27, 2024

I can't figure out how to do it here, but I have a piece of Java code... can I send it to you?

from tabulapdf.

leeper avatar leeper commented on May 27, 2024

Thanks. I will take a look as soon as I can.

from tabulapdf.

alanpaulkwan avatar alanpaulkwan commented on May 27, 2024

Awesome, thanks. Hoping it helps improve the package!

On Sat, Nov 12, 2016 at 12:21 PM, Thomas J. Leeper <[email protected]

wrote:

Thanks. I will take a look as soon as I can.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#25 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AH745A2nECuO9LxBw_6DFhbCPUWNBK5Qks5q9fWNgaJpZM4KqW1b
.

from tabulapdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.