Giter Club home page Giter Club logo

Comments (8)

chezou avatar chezou commented on July 26, 2024 6

In short, you can extract with area and spreadsheet option.

In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
  Unnamed: 0 Col2 Col3 Col4 Col5
0          A    B   12    R    G
1        NaN    R    T   23    H
2          B    B   33    R    A
3          C    T   99    E    M
4          D    I   12   34    M
5          E    I    I    W   90
6        NaN    1    2    W    h
7        NaN    4    3    E    H
8          F    E   E4    R    4

How to use area option

According to tabula-java wiki, there is a explain how to specify the area:
https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want

Using macOS's preview, I got area information:

image

java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

given

Note the left, top, height, and width parameters and calculate the following:

y1 = top
x1 = left
y2 = top + height
x2 = left + width

I confirmed with tabula-java:

java -jar ./tabula/tabula-0.9.1-jar-with-dependencies.jar -g -r -a "337.29,226.49,472.85,384.91" table.pdf

Without -r(same as --spreadsheet) option, it does not work properly.

from tabula-py.

chezou avatar chezou commented on July 26, 2024 1

For that type of table, you should use column or area option.
This issue may help you. tabulapdf/tabula-java#84

from tabula-py.

sfinotti avatar sfinotti commented on July 26, 2024 1

@sfinotti Use columns instead if column. Note that columns option doesn't work with lattice mode.

Just tried with "columns", and got the error: 'float' object is not iterable
Then, changing "col_def" from =(186.681) to =(186.681,), it worded out
So, even if you have only ONE column delimiter, it's necessary to ad a "," at the end.

Thanks a lot !!!

from tabula-py.

alonsopg avatar alonsopg commented on July 26, 2024

Indeed @chezou, I know that this is related to the area or column options. I looked through the docs, unfortunately I did not understood how to use such parameters in my case. Could you provide some example of how to use column or area parameters for this case?.

For instance I tried this:
In:

df = read_pdf_table('file.pdf', area = (269.875, 12.75, 790.5, 561))

But it still doesn't worked..

from tabula-py.

alonsopg avatar alonsopg commented on July 26, 2024

@chezou thanks for the help!. It would worth to add this information to the docs!

from tabula-py.

jiteshm17 avatar jiteshm17 commented on July 26, 2024

The spreadsheet flag did the trick. Thanks a lot @chezou

from tabula-py.

sfinotti avatar sfinotti commented on July 26, 2024

I'm trying to use tabula-py to import some info from pdf files, but having problems with the argument 'column'. In my case, I need to use area (so far so good) and also column, since my data is not very well defined.

The data I need is a 2 column 'table' positioned in an specific area of the pdf files. I used tabula to determine the positions and everything is working good, except for the column argument.

I'm using this way:

def le_2(directory,tab_def,col_def):
    demonstrativos = []
    for filename in os.listdir(directory):
        demo_mes = read_pdf(f"{directory}/{filename}", area=tab_def, columns=col_def, pandas_options={'header':None}, spread=True, guess=False)
        demonstrativos.append(demo_mes)

    return demonstrativos

tab_def=(148.378,13.016,410.922,253.991)
col_def=(186.681)
demo1 = le_2("2019/t1", tab_def, col_def)

The problem is that the column argument seems to be ignored. I always get the same output (as if there were no column argument), no matter what number I use for 'column'.

from tabula-py.

chezou avatar chezou commented on July 26, 2024

@sfinotti Use columns instead if column. Note that columns option doesn't work with lattice mode.

from tabula-py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.