eihli / image-table-ocr Goto Github PK

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

License: MIT License

Python 97.28% Shell 2.72%

image-table-ocr's Introduction

Overview
Requirements
Demo
Modules

Overview

This python package contains modules to help with finding and extracting tabular data from a PDF or image into a CSV format.

Given an image that contains a table…

Extract the the text into a CSV format…

PRIZE,ODDS 1 IN:,# OF WINNERS*
$3,9.09,"282,447"
$5,16.66,"154,097"
$7,40.01,"64,169"
$10,26.67,"96,283"
$20,100.00,"25,677"
$30,290.83,"8,829"
$50,239.66,"10,714"
$100,919.66,"2,792"
$500,"6,652.07",386
"$40,000","855,899.99",3
1,i223,
Toa,,
,,
,,"* Based upon 2,567,700"

Requirements

Along with the python requirements that are listed in setup.py and that are automatically installed when installing this package through pip, there are a few external requirements for some of the modules.

I haven’t looked into the minimum required versions of these dependencies, but I’ll list the versions that I’m using.

pdfimages 20.09.0 of Poppler
tesseract 5.0.0 of Tesseract
mogrify 7.0.10 of ImageMagick

Demo

There is a demo module that will download an image given a URL and try to extract tables from the image and process the cells into a CSV. You can try it out with one of the images included in this repo.

pip3 install table_ocr
python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

That will run against the following image:

The following should be printed to your terminal after running the above commands.

Running `extract_tables.main([/tmp/demo_p9on6m8o/simple.png]).`
Extracted the following tables from the image:
[('/tmp/demo_p9on6m8o/simple.png', ['/tmp/demo_p9on6m8o/simple/table-000.png'])]
Processing tables for /tmp/demo_p9on6m8o/simple.png.
Processing table /tmp/demo_p9on6m8o/simple/table-000.png.
Extracted 18 cells from /tmp/demo_p9on6m8o/simple/table-000.png
Cells:
/tmp/demo_p9on6m8o/simple/cells/000-000.png: Cell
/tmp/demo_p9on6m8o/simple/cells/000-001.png: Format
/tmp/demo_p9on6m8o/simple/cells/000-002.png: Formula
...

Here is the entire CSV output:

Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4

Modules

The package is split into modules with narrow focuses.

pdf_to_images uses Poppler and ImageMagick to extract images from a PDF.
extract_tables finds and extracts table-looking things from an image.
extract_cells extracts and orders cells from a table.
ocr_image uses Tesseract to OCR the text from an image of a cell.
ocr_to_csv converts into a CSV the directory structure that ocr_image outputs.

The outputs of a previous module can be used by a subsequent module so that they can be chained together to create the entire workflow, as demonstrated by the following shell script.

#!/bin/sh

PDF=$1

python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt
cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {}  | grep table > /tmp/extracted-tables.txt
cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {}

for image in $(cat /tmp/extracted-tables.txt); do
    dir=$(dirname $image)
    python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt")
done

The package was written in a literate programming style. The source code at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html is meant to act as the documentation and reference material.

image-table-ocr's People

Stargazers

Watchers

Forkers

marcelfenerich idhruvc ygest chez beyondyourself pradyutdec sebastiankmilo pratikkayal b2bda sickerin roughsoft sumit-kothari mfolkestad huang-xx dapperdatadog dennisgu sibtainrazajamali akdavidsson lakshaymiddha plumiron swha0105 creasson yeohoonyun mehrdad-shokri ghinch rakexue nimalirajakaruna chros425 siddhigolatkar aiwenforgit penil93 mjdhasan tourconnect sutirthachakraborty achuga200 spynos balajiramachandran electrapro-pk subbu-art feb-dugan hdlopeza cloud-computer-vision dthinkcs plaban1981 sitek94 vladtermene ziedmaaloul sanjosh alexjonsson tablerecognitionorg xhoong dove-olive andersoneduardo danil212211 aspnetcs brentguttmann polfb 1branch kp-forks nehcuh rkmax liweipython patrickcantona blmvay yellowpillowhz sagarikaraje bernard-sh srimouli04 ehsanshiri audaefi zjred13 srwareham yosso pragyasrivastava0805 yosso-ama sheikhasim danny305 zhaoyiou1990 ceshine michellemendezp owaiskhan9654 harishgajawada avs-abhishek123 guoqiangjia software-resources mannazsci n1figo hizieun huwiee itsmesid12 abbottkilig kenadave praneeth2428 prietopy tchr-dev caochengfei qqzwc schuck2002 dongjinlee123 silverrnk

image-table-ocr's Issues

Can I use script for local files?

Hello. Can I ocr local images and other files?

Tesseract error in preprocessing

Attempting to OCR a table and I keep getting an error.
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/table_ocr/pdf_to_images/init.py", line 69, in preprocess_img
rotate = get_rotate(filepath, tess_params)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/table_ocr/pdf_to_images/init.py", line 79, in get_rotate
subprocess.check_output(tess_command)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['tesseract', '--psm', '0', '--oem', '0', '/Users/andrewmcfadden/Documents/GitHub/one2many.github.io/image-table-ocr/dance/ga-20190131-001.png', '-']' returned non-zero exit status 1.

The image is the logo at the top of the page (every page).

unable to run the code

Can you please share the setup instructions getting below error

"pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:\Users\Ankur.Biswal\AppData\Local\Tesseract-OCR\tessdata/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')"

extract table that spend across pages

hey, I m not able to extract table that continues from one page to another as one table using the tools here:
Life_Cycle_Assessment_of_Cow_Tanned_Leather_Produc.pdf
can someone help?

Running issue with simple.png exemple under Win 10

Dear Eihli, Your program will help me in the future for personal porposes. I am running it on Win 10. I foolow all the steps to simply extract datas from images but I don't find why it does not run through it.

Here is the message after I run py -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

Running extract_tables.main([C:\Users\MAGICB~1\AppData\Local\Temp\demo_cp3ejb98\simple.png]).
Extracted the following tables from the image:
[('C:\Users\\AppData\Local\Temp\demo_cp3ejb98\simple.png', ['C:\Users\\AppData\Local\Temp\demo_cp3ejb98\simple\table-000.png'])]
Processing tables for C:\Users*\AppData\Local\Temp\demo_cp3ejb98\simple.png.
Processing table C:\Users*\AppData\Local\Temp\demo_cp3ejb98\simple\table-000.png.
Traceback (most recent call last):
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 255, in run_tesseract
proc = subprocess.Popen(cmd_args, **subprocess_args())
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 947, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users*****\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 1416, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users*****\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main.py", line 51, in
csv_output = main(sys.argv[1])
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main_.py", line 32, in main
ocr = [
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main_.py", line 33, in
table_ocr.ocr_image.main(cell, None)
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\ocr_image_init_.py", line 31, in main
txt = ocr_image(cropped, " ".join(tess_args))
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\ocr_image_init_.py", line 83, in ocr_image
return pytesseract.image_to_string(
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 409, in image_to_string
return {
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 412, in
Output.STRING: lambda: run_and_get_output(args),
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 287, in run_and_get_output
run_tesseract(**kwargs)
File "C:\Users**\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 259, in run_tesseract
raise TesseractNotFoundError()
pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

I have tesseract installed so I donnot get it:
PS C:\Users*\AppData\Local\Programs\Python\Python39> py -m pip install tesseract
Requirement already satisfied: tesseract in c:\users*\appdata\local\programs\python\python39\lib\site-packages (0.1.3)

Thanks for your help.

Eddy

Version of the external requierements

first, thanks for this package its look amazing.

help

what is the version that i should install of:

pdfimages from Poppler
Tesseract
mogfrify ImageMagick

ModuleNotFoundError: No module named 'table_ocr' (windows/mac)

Hi - thank you for creating this - it really looks useful!
When I try pip3 install table_ocr followed by python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

I always get the same problem - No module named 'table_ocr'

The installation runs successfully, all the dependencies are installed.

Happens both on Windows and Mac. Am I missing something?

Error opening data file /usr/share/tessdata/table-ocr.traineddata

Hello, thanks for this repo!

It's a bit hard to understand how to get it working when you simply start with a PNG image and want to give it a try. So I'm trying with a sample file you're giving.

I run

python -m table_ocr.extract_tables resources/examples/example-page-table-000.png | grep table > /tmp/extracted-tables.txt
cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} --psm 7 -l table-ocr

and I get

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /usr/share/tessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'table-ocr\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

I don't understand is how to get the table-ocr.traineddata file that tesseract seems to be looking for?

Thanks again

Traineddata path issue on Windows 10.

When i run

python -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

i get

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:UsersGetyAppDataLocalProgramsPythonPython38libsite-packagestable_ocrtessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'table-ocr\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

(note file path does not have '/')

File does exist

I tried setting env variable TESSDATA_PREFIX - same error.

as well as specifying path in cli python -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png --tessdata-dir C:\Users\Btycoon\AppData\Local\Programs\Python\Python38\Lib\site-packages\table_ocr\tessdata

I am on Windows 10.

Tessdata access error under Windows

Hi,

I run the following demo command

python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

on Windows but got the following error:

raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:UsersjackylamAppDataLocalPackagesPythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0LocalCachelocal-packagesPython310site-packagestable_ocrtessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'table-ocr' Tesseract couldn't load any languages! Could not initialize tesseract.')

I set TESSDATA_PREFIX and point to somewhere containing the table-ocr.traindata but no use.

However, the above problem doesn't happen on Linux. As my project prefer to run on Windows, hope someone can give me some hint on this issue.

Thanks,
Sing

,

Merging columns are not able to be detected

Dear @eihli ,

Thank you very much for your project. It works great!
I have not fully understood your detection algorithms yet, but I think there is this issue, which would be great to improve the accuracy of your package. I noticed that in the case some columns are merged, the program will cut it followed by the major columns. Besides, your program works well in case of rows are merged:
Here is the example:

The extract_cell_images_from_table method 's results:

I will take a look deeper into the code, meanwhile, I think it's better to report this to you so that the library can be enhanced in the future.
Asides from this minor issue, your library is awesome.

Thanks again and best regards

End to End Instruction

Hi, glad that I found this. Kudos to the developers first of all. I was just wondering if you can provide an end to end descriptive steps from input PDF to output CSV. It's not exactly clear from the shell script you gave. Thanks!

I can't run with any URL

Hello, I open this question because I need help. May you please help me?

I cloned the repository and following your read.me I managed to run your demo (Image 1 shows successful execution).

However, I have some issues.

I did not find the csv spreadsheet on my computer. I found the txt files in /var/tmp (Image 2), but I didn't find the csv spreadsheet.

I tried to execute the same command with a URL that I sent. So I put a png image in a public GitHub repository and sent the link and I got an error (Image 3). (I used this URL: https://github.com/ajandrey/OCR/blob/main/table.png)

I tried to run the same command again, but with a link from your page. I didn't get the same URL from your read.me file, but yes, I tried with the same image and returned the same error (Image 4). (For this, I used this URL: https://github.com/eihli/image-table-ocr/blob/master/resources/test_data/simple.png)

So I can't run for any link.
Questions:

Does the link need to have any specifications? Can't it be any link pointing to an image?
I already have the images of the tables, they are not in PDF, so I just need modules extract_cells, ocr_image, and ocr_to_csv. Can I use it to run in an image folder (of tables) for example?
(Note that the error did not use only these three modules, I have not yet performed this test).

Thank you and I look forward to your return.
Alessandra Jandrey

comment

It's hard to use.

No way to get hocr of the image with the table_ocr library

We use the below config to get the table ocr, but there is no way to get hocr of the image. can someone add this feature please?
d = os.path.dirname(sys.modules["table_ocr"].__file__) tessdata_dir = os.path.join(d, "tessdata") tess_args = "--psm 6 -l table-ocr --tessdata-dir {0}".format(tessdata_dir)

UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 15: ordinal not in range(128)

Traceback (most recent call last):
File "/opt/anaconda3/envs/Hyper-Table-Recognition/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/anaconda3/envs/Hyper-Table-Recognition/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 51, in
csv_output = main(sys.argv[1])
File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 34, in main
for cell in cells
File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 34, in
for cell in cells
File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/ocr_image/init.py", line 33, in main
txt_file.write(txt)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 15: ordinal not in range(128)

in some cases, we get this issue it can't be fixed by adding this line of code in "/image-table-ocr/table_ocr/ocr_image /init.py" line 32 :

txt = txt.encode('ascii', 'ignore').decode('ascii')