Comments (5)
So glad to hear it! And that's a great question. Currently, pdfplumber
doesn't provide any special methods for handling column-spanning rows. But it's a really interesting problem to try to solve. For now, the recommended approach is just to write custom logic, as you suggest. If you run into any difficulties with that, I'd be happy to take a look at the code.
from pdfplumber.
Curious if this has changed in the two years since I first asked the question.
I still have a lot of PDFs where a header row might have text spanning multiple columns, which seems to defeat table extraction.
Here's an example, and the resulting tablefinder debug:
Example PDF
from pdfplumber.
@Kirkman I believe the best approach is still to write your own logic--there's so many examples of weird tables that writing a general approach gets super complicated. To my mind it's useful to write your own parser, as you can build in logical tests that detect when it is going wrong, which is the sorta feedback that's harder to convey in an automated extraction, but is often required in dealing with weird formats.
from pdfplumber.
Another thought. In this specific case, is there a way I can adjust the table parser settings so that it identifies the column-spanning header rows as separate, 1-col tables? There is a full empty line of white space between those rows and the actual tables.
from pdfplumber.
Hi @Kirkman, I'm catching up on a few old issues and came back across this one. I realize you probably have moved on from this particular challenge, but for the sake of responding:
I can't think of a simple, library-generalizable approach that would provide the functionality sought. I think the best approach for tables like these (which, in some respects, depend on human spatial reasoning skills rather than explicit delineations βΒ and are more akin to custom-structured lists than proper tables) is to write a bit of custom code. For instance, you could use page.crop(...)
to divide the page into each candidate's section, and then parse each individually.
from pdfplumber.
Related Issues (20)
- Why is the order of extracting the contents in the table cells wrong?
- original_path extraction error regarding LTCurve HOT 2
- Pickle implementation for PDF and Page objects HOT 3
- suggest page.extract_words() word sequence same as page.extract_text()
- Lots of whitespaces in between words
- page.to_image() causes error "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process " HOT 5
- can' resolve pdf encoded in ETenms-B5-H HOT 1
- debug_tablefinder is weirdly offset HOT 2
- Extracting devnagiri text. HOT 3
- `Page.to_image()` leaks file descriptors HOT 7
- Extracting table with no vertical lines (only horizontal lines) doesn't work
- page.search("text", regex = True) is magnitudes slower in 0.10.4 compared to 0.10.3. HOT 5
- The position of numbers and punctuation marks is incorrect HOT 2
- bottom-to-top text in cell rendered in wrong location in extract_text() HOT 6
- Any way to detect formatting? HOT 1
- Add `autodetect_direction` option to text-extraction methods
- Table extraction bug when lines are just barely end-to-end
- About paragraph recognition
- Custom deduppe_chars char properties HOT 4
- Got different result of "page.to_image()" on MacOS and Linux HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdfplumber.