Comments (9)
I find out the cause of this error and fixed it. f1db4ef
@alonsopg Could you upgrade your tabula-py?
from tabula-py.
@alonsopg Did your problem solve with updated version? If so, I would like to close this issue.
from tabula-py.
b'Skipping line 28: expected 2 fields, saw 4\nSkipping line 29: expected 2 fields, saw 4\nSkipping line 30: expected 2 fields, saw 4\nSkipping line 31: expected 2 fields, saw 4\nSkipping line 32: expected 2 fields, saw 4\nSkipping line 33: expected 2 fields, saw 4\nSkipping line 34: expected 2 fields, saw 4\nSkipping line 35: expected 2 fields, saw 4\nSkipping line 36: expected 2 fields, saw 4\nSkipping line 37: expected 2 fields, saw 4\nSkipping line 38: expected 2 fields, saw 4\nSkipping line 39: expected 2 fields, saw 4\nSkipping line 40: expected 2 fields, saw 4\nSkipping line
I got above warnings also , I have set pandas_options={'error_bad_lines': False}
from tabula-py.
If there were multiple tables in a file, you should specify page number with pages
option. This might be related to #2
from tabula-py.
Thanks for the help @chezou, I tried this:
In:
from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="45")
pdf_table
out:
dic 27 2016 12:32:04 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
0 ADVERTENCIA: java.lang.ArrayIndexOutOfBoundsEx... NaN
1 java.lang.ArrayIndexOutOfBoundsException: 5 NaN
2 \tat java.awt.geom.Path2DFloatFloatTxIterator.cur... NaN
3 \tat technology.tabula.ObjectExtractor.strokeO... NaN
4 \tat technology.tabula.ObjectExtractor.strokeP... NaN
5 \tat org.apache.pdfbox.util.operator.pagedrawe... NaN
6 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
7 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
8 \tat org.apache.pdfbox.util.operator.pagedrawe... NaN
9 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
10 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
11 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
12 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
13 \tat technology.tabula.ObjectExtractor.drawPag... NaN
14 \tat technology.tabula.ObjectExtractor.extract... NaN
15 \tat technology.tabula.PageIterator.next(PageI... NaN
16 \tat technology.tabula.CommandLineApp.extractT... NaN
17 \tat technology.tabula.CommandLineApp.main(Com... NaN
18 dic 27 2016 12:32:04 PM org.apache.pdfbox.util.PDFSt...
19 ADVERTENCIA: java.lang.ArrayIndexOutOfBoundsEx... NaN
20 java.lang.ArrayIndexOutOfBoundsException: 10 NaN
21 \tat java.awt.geom.Path2DFloatFloatTxIterator.cur... NaN
22 \tat technology.tabula.ObjectExtractor.strokeO... NaN
23 \tat technology.tabula.ObjectExtractor.strokeP... NaN
24 \tat org.apache.pdfbox.util.operator.pagedrawe... NaN
25 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
26 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
27 \tat org.apache.pdfbox.util.operator.pagedrawe... NaN
28 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
29 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
30 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
31 \tat org.apache.pdfbox.util.PDFStreamEngine.pr... NaN
32 \tat technology.tabula.ObjectExtractor.drawPag... NaN
33 \tat technology.tabula.ObjectExtractor.extract... NaN
34 \tat technology.tabula.PageIterator.next(PageI... NaN
35 \tat technology.tabula.CommandLineApp.extractT... NaN
36 \tat technology.tabula.CommandLineApp.main(Com... NaN
37 ATOS DO PODER EXECUTIVO NaN
38 ADMINISTRAÇÃO DIRETA NaN
39 DECRETOS NaN
And:
In:
from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="5")
pdf_table
Out:
CParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 6
As you can see I specified the pages parameter. Any idea of how to proceed?. Thanks!
from tabula-py.
Could you try with tabula-java?
https://github.com/tabulapdf/tabula-java
If page 45 of your pdf includes multiple table or has combined cell, tabula-py should be fail. If you use area
option, you might extract the table.
Anywhere I can't uess anymore without your pdf.
from tabula-py.
I have the same problem with pages =all. could anuone help me ?
from tabula-py.
@RAHAAMA Set mutiple_tables=True
.
https://github.com/chezou/tabula-py#i-faced-cparsererror-how-can-i-extract-multiple-tables
from tabula-py.
@chezou Thank you . There is another problem with multiple tables , I have a pdf that prepared in two language , It means that pdf has two column (English and French ) , when I want to extract the tables , it consider all text like table. Is there any suggestion for this problem ?
from tabula-py.
Related Issues (20)
- Unable to remove note in log : Got stderr: Picked up _JAVA_OPTIONS: -Djava.awt.headless=true HOT 1
- Tabula py Ignores an entire column if it's blank and if it does not contain headerd? HOT 1
- tabula-py CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', HOT 3
- dont ignore empty columns in tables spanning multiple pages HOT 1
- Try to install tabula-py HOT 1
- Use JPype instead of subprocess HOT 11
- Add a way to set areas for non-existent pages in template HOT 4
- Exception: RuntimeError: java.lang.UnsatisfiedLinkError: HOT 2
- cant install tabula-py on m1 mac vscode. HOT 1
- Support Python 3.12 HOT 5
- Pls add "orientation" parameter to read_pdf HOT 4
- Security vulnerability in tabula-1.0.5-jar-with-dependencies.jar HOT 4
- [BUG] Encoding still being overridden even after fix to #371. HOT 5
- FutureWarning: errors='ignore' is deprecated and will raise in a future version. HOT 3
- Unable to detect table with longer header information HOT 4
- [BUG] issue just running sample code HOT 1
- Table detection in images HOT 1
- [BUG] <FutureWarning: errors='ignore' > HOT 3
- [BUG] Error importing jpype dependencies. Fallback to subprocess. No module named 'org.apache' HOT 1
- [BUG] column parameter of read_pdf currently needs to be list, not generic iterable HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tabula-py.