Comments (6)
I think this is an inherent limitation of PDF format. As I understand it, white space is not represented as actual "space" characters but rather as horizontal offsets for the represented text. So, the underlying tabula library has no way of knowing how much space there is because there's nothing there except the horizontal start position of the text. I could be wrong as I'm not a PDF expert, but my fear is your workaround might be the only way to achieve this.
from tabulapdf.
That makes sense, although what I'm suggesting t would just be about
representing the offsets with whitespace. It sounds like what you're saying
is as far as you know, Tabula doesn't give options to do this. Since your
goal is to create an R binding I suppose it's a feature request to be send
over to the tabula guys?
RPoppler / pdftools seems to get along the lines of what I want, but there
are some problems there too. Some of the text in adjacent lines gets mashed.
On Sat, Nov 12, 2016 at 6:12 AM, Thomas J. Leeper [email protected]
wrote:
I think this is an inherent limitation of PDF format. As I understand it,
white space is not represented as actual "space" characters but rather as
horizontal offsets for the represented text. So, the underlying tabula
library has no way of knowing how much space there is because there's
nothing there except the horizontal start position of the text. I could be
wrong as I'm not a PDF expert, but my fear is your workaround might be the
only way to achieve this.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#25 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AH745FI8l1zfyk-FO2N_9lQEp1SXGn8Jks5q9Z84gaJpZM4KqW1b
.
from tabulapdf.
Oh actually extract_text()
isn't a tabula feature. It just uses pdfbox. If it looks like it possible directly with PDFbox, I can try to implement it but I don't think it is possible.
from tabulapdf.
I can't figure out how to do it here, but I have a piece of Java code... can I send it to you?
from tabulapdf.
Thanks. I will take a look as soon as I can.
from tabulapdf.
Awesome, thanks. Hoping it helps improve the package!
On Sat, Nov 12, 2016 at 12:21 PM, Thomas J. Leeper <[email protected]
wrote:
Thanks. I will take a look as soon as I can.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#25 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AH745A2nECuO9LxBw_6DFhbCPUWNBK5Qks5q9fWNgaJpZM4KqW1b
.
from tabulapdf.
Related Issues (20)
- Specifying columns as percentages
- Having problems with automate table recognition, can one save areas found manually for reproduction?
- {tabulizer} got archived on CRAN on 2021-10-31 HOT 20
- extract_tables function status was 'SSL connect error' error
- Select multiple areas per page in `*_areas()`
- q question about package( Tabulizer) installation HOT 4
- a suggested code or documentation change, improvement to the code, or feature request HOT 1
- inconsistent behavior of extract_tables and extract_areas HOT 4
- Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.IllegalAccessException: class RJavaTools cannot access a member of class java.util.ArrayList$Itr (in module java.base) with modifiers "public" HOT 11
- New Maintainer Wanted :-) HOT 6
- An illegal reflective access operation has occurred HOT 1
- Renaming to tabula HOT 2
- Windows CI fails because of Java 8 requirement HOT 1
- build fails with tabula 1.2.1 jar HOT 1
- ROADMAP FOR FALL 2023 HOT 6
- Unable to install in tabulizer HOT 1
- pkgdown building issue HOT 10
- Is jdk7 -y needed? HOT 6
- Is the package abandoned? HOT 1
- Issue with extract_tables function. Couldn't run the example: getRowCount HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tabulapdf.