Giter Club home page Giter Club logo

Comments (9)

chezou avatar chezou commented on July 26, 2024 1

I find out the cause of this error and fixed it. f1db4ef

@alonsopg Could you upgrade your tabula-py?

from tabula-py.

chezou avatar chezou commented on July 26, 2024 1

@alonsopg Did your problem solve with updated version? If so, I would like to close this issue.

from tabula-py.

RAHAAMA avatar RAHAAMA commented on July 26, 2024 1

b'Skipping line 28: expected 2 fields, saw 4\nSkipping line 29: expected 2 fields, saw 4\nSkipping line 30: expected 2 fields, saw 4\nSkipping line 31: expected 2 fields, saw 4\nSkipping line 32: expected 2 fields, saw 4\nSkipping line 33: expected 2 fields, saw 4\nSkipping line 34: expected 2 fields, saw 4\nSkipping line 35: expected 2 fields, saw 4\nSkipping line 36: expected 2 fields, saw 4\nSkipping line 37: expected 2 fields, saw 4\nSkipping line 38: expected 2 fields, saw 4\nSkipping line 39: expected 2 fields, saw 4\nSkipping line 40: expected 2 fields, saw 4\nSkipping line

I got above warnings also , I have set pandas_options={'error_bad_lines': False}

from tabula-py.

chezou avatar chezou commented on July 26, 2024

If there were multiple tables in a file, you should specify page number with pages option. This might be related to #2

from tabula-py.

alonsopg avatar alonsopg commented on July 26, 2024

Thanks for the help @chezou, I tried this:
In:

from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="45")
pdf_table

out:


 	dic 27 	2016 12:32:04 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
0 	ADVERTENCIA: java.lang.ArrayIndexOutOfBoundsEx... 	NaN
1 	java.lang.ArrayIndexOutOfBoundsException: 5 	NaN
2 	\tat java.awt.geom.Path2DFloatFloatTxIterator.cur... 	NaN
3 	\tat technology.tabula.ObjectExtractor.strokeO... 	NaN
4 	\tat technology.tabula.ObjectExtractor.strokeP... 	NaN
5 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
6 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
7 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
8 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
9 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
10 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
11 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
12 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
13 	\tat technology.tabula.ObjectExtractor.drawPag... 	NaN
14 	\tat technology.tabula.ObjectExtractor.extract... 	NaN
15 	\tat technology.tabula.PageIterator.next(PageI... 	NaN
16 	\tat technology.tabula.CommandLineApp.extractT... 	NaN
17 	\tat technology.tabula.CommandLineApp.main(Com... 	NaN
18 	dic 27 	2016 12:32:04 PM org.apache.pdfbox.util.PDFSt...
19 	ADVERTENCIA: java.lang.ArrayIndexOutOfBoundsEx... 	NaN
20 	java.lang.ArrayIndexOutOfBoundsException: 10 	NaN
21 	\tat java.awt.geom.Path2DFloatFloatTxIterator.cur... 	NaN
22 	\tat technology.tabula.ObjectExtractor.strokeO... 	NaN
23 	\tat technology.tabula.ObjectExtractor.strokeP... 	NaN
24 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
25 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
26 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
27 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
28 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
29 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
30 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
31 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
32 	\tat technology.tabula.ObjectExtractor.drawPag... 	NaN
33 	\tat technology.tabula.ObjectExtractor.extract... 	NaN
34 	\tat technology.tabula.PageIterator.next(PageI... 	NaN
35 	\tat technology.tabula.CommandLineApp.extractT... 	NaN
36 	\tat technology.tabula.CommandLineApp.main(Com... 	NaN
37 	ATOS DO PODER EXECUTIVO 	NaN
38 	ADMINISTRAÇÃO DIRETA 	NaN
39 	DECRETOS 	NaN

And:

In:

from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="5")
pdf_table

Out:

CParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 6

As you can see I specified the pages parameter. Any idea of how to proceed?. Thanks!

from tabula-py.

chezou avatar chezou commented on July 26, 2024

Could you try with tabula-java?
https://github.com/tabulapdf/tabula-java

If page 45 of your pdf includes multiple table or has combined cell, tabula-py should be fail. If you use area option, you might extract the table.

Anywhere I can't uess anymore without your pdf.

from tabula-py.

RAHAAMA avatar RAHAAMA commented on July 26, 2024

I have the same problem with pages =all. could anuone help me ?

from tabula-py.

chezou avatar chezou commented on July 26, 2024

@RAHAAMA Set mutiple_tables=True.
https://github.com/chezou/tabula-py#i-faced-cparsererror-how-can-i-extract-multiple-tables

from tabula-py.

RAHAAMA avatar RAHAAMA commented on July 26, 2024

@chezou Thank you . There is another problem with multiple tables , I have a pdf that prepared in two language , It means that pdf has two column (English and French ) , when I want to extract the tables , it consider all text like table. Is there any suggestion for this problem ?

from tabula-py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.