I am trying to extract the tables from a number of pdf documents: In

I find out the cause of this error and fixed it. <a class="commit-link" data-hovercard

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

b'Skipping line 28: expected 2 fields, saw 4 Skipping line

Thanks for the help <a class="user-mention notranslate" data-hovercard-type="user" dat

Could you try with tabula-java? <a href="https://github.com/tabulapdf/tabula-java"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3 about tabula-py HOT 9 CLOSED

alonsopg commented on July 26, 2024

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3

from tabula-py.

Comments (9)

chezou commented on July 26, 2024 1

I find out the cause of this error and fixed it. f1db4ef

@alonsopg Could you upgrade your tabula-py?

from tabula-py.

chezou commented on July 26, 2024 1

@alonsopg Did your problem solve with updated version? If so, I would like to close this issue.

from tabula-py.

RAHAAMA commented on July 26, 2024 1

b'Skipping line 28: expected 2 fields, saw 4\nSkipping line 29: expected 2 fields, saw 4\nSkipping line 30: expected 2 fields, saw 4\nSkipping line 31: expected 2 fields, saw 4\nSkipping line 32: expected 2 fields, saw 4\nSkipping line 33: expected 2 fields, saw 4\nSkipping line 34: expected 2 fields, saw 4\nSkipping line 35: expected 2 fields, saw 4\nSkipping line 36: expected 2 fields, saw 4\nSkipping line 37: expected 2 fields, saw 4\nSkipping line 38: expected 2 fields, saw 4\nSkipping line 39: expected 2 fields, saw 4\nSkipping line 40: expected 2 fields, saw 4\nSkipping line

I got above warnings also , I have set pandas_options={'error_bad_lines': False}

from tabula-py.

chezou commented on July 26, 2024

If there were multiple tables in a file, you should specify page number with pages option. This might be related to #2

from tabula-py.

alonsopg commented on July 26, 2024

Thanks for the help @chezou, I tried this:
In:

from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="45")
pdf_table

out:


 	dic 27 	2016 12:32:04 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
0 	ADVERTENCIA: java.lang.ArrayIndexOutOfBoundsEx... 	NaN
1 	java.lang.ArrayIndexOutOfBoundsException: 5 	NaN
2 	\tat java.awt.geom.Path2DFloatFloatTxIterator.cur... 	NaN
3 	\tat technology.tabula.ObjectExtractor.strokeO... 	NaN
4 	\tat technology.tabula.ObjectExtractor.strokeP... 	NaN
5 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
6 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
7 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
8 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
9 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
10 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
11 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
12 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
13 	\tat technology.tabula.ObjectExtractor.drawPag... 	NaN
14 	\tat technology.tabula.ObjectExtractor.extract... 	NaN
15 	\tat technology.tabula.PageIterator.next(PageI... 	NaN
16 	\tat technology.tabula.CommandLineApp.extractT... 	NaN
17 	\tat technology.tabula.CommandLineApp.main(Com... 	NaN
18 	dic 27 	2016 12:32:04 PM org.apache.pdfbox.util.PDFSt...
19 	ADVERTENCIA: java.lang.ArrayIndexOutOfBoundsEx... 	NaN
20 	java.lang.ArrayIndexOutOfBoundsException: 10 	NaN
21 	\tat java.awt.geom.Path2DFloatFloatTxIterator.cur... 	NaN
22 	\tat technology.tabula.ObjectExtractor.strokeO... 	NaN
23 	\tat technology.tabula.ObjectExtractor.strokeP... 	NaN
24 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
25 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
26 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
27 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
28 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
29 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
30 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
31 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
32 	\tat technology.tabula.ObjectExtractor.drawPag... 	NaN
33 	\tat technology.tabula.ObjectExtractor.extract... 	NaN
34 	\tat technology.tabula.PageIterator.next(PageI... 	NaN
35 	\tat technology.tabula.CommandLineApp.extractT... 	NaN
36 	\tat technology.tabula.CommandLineApp.main(Com... 	NaN
37 	ATOS DO PODER EXECUTIVO 	NaN
38 	ADMINISTRAÇÃO DIRETA 	NaN
39 	DECRETOS 	NaN

And:

In:

from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="5")
pdf_table

Out:

CParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 6

As you can see I specified the pages parameter. Any idea of how to proceed?. Thanks!

from tabula-py.

chezou commented on July 26, 2024

Could you try with tabula-java?
https://github.com/tabulapdf/tabula-java

If page 45 of your pdf includes multiple table or has combined cell, tabula-py should be fail. If you use area option, you might extract the table.

Anywhere I can't uess anymore without your pdf.

from tabula-py.

RAHAAMA commented on July 26, 2024

I have the same problem with pages =all. could anuone help me ?

from tabula-py.

chezou commented on July 26, 2024

@RAHAAMA Set mutiple_tables=True.
https://github.com/chezou/tabula-py#i-faced-cparsererror-how-can-i-extract-multiple-tables

from tabula-py.

RAHAAMA commented on July 26, 2024

@chezou Thank you . There is another problem with multiple tables , I have a pdf that prepared in two language , It means that pdf has two column (English and French ) , when I want to extract the tables , it consider all text like table. Is there any suggestion for this problem ?

from tabula-py.

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3 about tabula-py HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent