Giter Club home page Giter Club logo

image-table-ocr's Introduction

Table of Contents

  1. Overview
  2. Requirements
  3. Demo
  4. Modules

Overview

This python package contains modules to help with finding and extracting tabular data from a PDF or image into a CSV format.

Given an image that contains a table…

img

Extract the the text into a CSV format…

PRIZE,ODDS 1 IN:,# OF WINNERS*
$3,9.09,"282,447"
$5,16.66,"154,097"
$7,40.01,"64,169"
$10,26.67,"96,283"
$20,100.00,"25,677"
$30,290.83,"8,829"
$50,239.66,"10,714"
$100,919.66,"2,792"
$500,"6,652.07",386
"$40,000","855,899.99",3
1,i223,
Toa,,
,,
,,"* Based upon 2,567,700"

Requirements

Along with the python requirements that are listed in setup.py and that are automatically installed when installing this package through pip, there are a few external requirements for some of the modules.

I haven’t looked into the minimum required versions of these dependencies, but I’ll list the versions that I’m using.

Demo

There is a demo module that will download an image given a URL and try to extract tables from the image and process the cells into a CSV. You can try it out with one of the images included in this repo.

  1. pip3 install table_ocr
  2. python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

That will run against the following image:

img

The following should be printed to your terminal after running the above commands.

Running `extract_tables.main([/tmp/demo_p9on6m8o/simple.png]).`
Extracted the following tables from the image:
[('/tmp/demo_p9on6m8o/simple.png', ['/tmp/demo_p9on6m8o/simple/table-000.png'])]
Processing tables for /tmp/demo_p9on6m8o/simple.png.
Processing table /tmp/demo_p9on6m8o/simple/table-000.png.
Extracted 18 cells from /tmp/demo_p9on6m8o/simple/table-000.png
Cells:
/tmp/demo_p9on6m8o/simple/cells/000-000.png: Cell
/tmp/demo_p9on6m8o/simple/cells/000-001.png: Format
/tmp/demo_p9on6m8o/simple/cells/000-002.png: Formula
...

Here is the entire CSV output:

Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4

Modules

The package is split into modules with narrow focuses.

  • pdf_to_images uses Poppler and ImageMagick to extract images from a PDF.
  • extract_tables finds and extracts table-looking things from an image.
  • extract_cells extracts and orders cells from a table.
  • ocr_image uses Tesseract to OCR the text from an image of a cell.
  • ocr_to_csv converts into a CSV the directory structure that ocr_image outputs.

The outputs of a previous module can be used by a subsequent module so that they can be chained together to create the entire workflow, as demonstrated by the following shell script.

#!/bin/sh

PDF=$1

python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt
cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {}  | grep table > /tmp/extracted-tables.txt
cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {}

for image in $(cat /tmp/extracted-tables.txt); do
    dir=$(dirname $image)
    python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt")
done

The package was written in a literate programming style. The source code at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html is meant to act as the documentation and reference material.

image-table-ocr's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

image-table-ocr's Issues

Tesseract error in preprocessing

Attempting to OCR a table and I keep getting an error.
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/table_ocr/pdf_to_images/init.py", line 69, in preprocess_img
rotate = get_rotate(filepath, tess_params)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/table_ocr/pdf_to_images/init.py", line 79, in get_rotate
subprocess.check_output(tess_command)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['tesseract', '--psm', '0', '--oem', '0', '/Users/andrewmcfadden/Documents/GitHub/one2many.github.io/image-table-ocr/dance/ga-20190131-001.png', '-']' returned non-zero exit status 1.

The image is the logo at the top of the page (every page).
ga-20190131-001

unable to run the code

Can you please share the setup instructions getting below error

"pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:\Users\Ankur.Biswal\AppData\Local\Tesseract-OCR\tessdata/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')"

Running issue with simple.png exemple under Win 10

Dear Eihli, Your program will help me in the future for personal porposes. I am running it on Win 10. I foolow all the steps to simply extract datas from images but I don't find why it does not run through it.

Here is the message after I run py -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

Running extract_tables.main([C:\Users\MAGICB~1\AppData\Local\Temp\demo_cp3ejb98\simple.png]).
Extracted the following tables from the image:
[('C:\Users\\AppData\Local\Temp\demo_cp3ejb98\simple.png', ['C:\Users\\AppData\Local\Temp\demo_cp3ejb98\simple\table-000.png'])]
Processing tables for C:\Users*\AppData\Local\Temp\demo_cp3ejb98\simple.png.
Processing table C:\Users*
\AppData\Local\Temp\demo_cp3ejb98\simple\table-000.png.
Traceback (most recent call last):
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 255, in run_tesseract
proc = subprocess.Popen(cmd_args, **subprocess_args())
File "C:\Users*
\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 947, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users*****\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 1416, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users*
\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users*****\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main
.py", line 51, in
csv_output = main(sys.argv[1])
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main_.py", line 32, in main
ocr = [
File "C:\Users*
\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\demo_main_.py", line 33, in
table_ocr.ocr_image.main(cell, None)
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\ocr_image_init_.py", line 31, in main
txt = ocr_image(cropped, " ".join(tess_args))
File "C:\Users*
\AppData\Local\Programs\Python\Python39\lib\site-packages\table_ocr\ocr_image_init_.py", line 83, in ocr_image
return pytesseract.image_to_string(
File "C:\Users*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 409, in image_to_string
return {
File "C:\Users*
\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 412, in
Output.STRING: lambda: run_and_get_output(args),
File "C:\Users*
\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 287, in run_and_get_output
run_tesseract(**kwargs)
File "C:\Users*
*\AppData\Local\Programs\Python\Python39\lib\site-packages\pytesseract\pytesseract.py", line 259, in run_tesseract
raise TesseractNotFoundError()
pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

I have tesseract installed so I donnot get it:
PS C:\Users*\AppData\Local\Programs\Python\Python39> py -m pip install tesseract
Requirement already satisfied: tesseract in c:\users*
\appdata\local\programs\python\python39\lib\site-packages (0.1.3)

Thanks for your help.

Eddy

Version of the external requierements

first, thanks for this package its look amazing.

help

what is the version that i should install of:

  • pdfimages from Poppler
  • Tesseract
  • mogfrify ImageMagick

ModuleNotFoundError: No module named 'table_ocr' (windows/mac)

Hi - thank you for creating this - it really looks useful!
When I try pip3 install table_ocr followed by python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

I always get the same problem - No module named 'table_ocr'

The installation runs successfully, all the dependencies are installed.

Happens both on Windows and Mac. Am I missing something?

image

Error opening data file /usr/share/tessdata/table-ocr.traineddata

Hello, thanks for this repo!

It's a bit hard to understand how to get it working when you simply start with a PNG image and want to give it a try. So I'm trying with a sample file you're giving.

I run

python -m table_ocr.extract_tables resources/examples/example-page-table-000.png | grep table > /tmp/extracted-tables.txt
cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} --psm 7 -l table-ocr

and I get

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /usr/share/tessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'table-ocr\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

I don't understand is how to get the table-ocr.traineddata file that tesseract seems to be looking for?

Thanks again

Traineddata path issue on Windows 10.

When i run

python -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

i get

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:UsersGetyAppDataLocalProgramsPythonPython38libsite-packagestable_ocrtessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'table-ocr\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

(note file path does not have '/')

File does exist

I tried setting env variable TESSDATA_PREFIX - same error.

as well as specifying path in cli python -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png --tessdata-dir C:\Users\Btycoon\AppData\Local\Programs\Python\Python38\Lib\site-packages\table_ocr\tessdata

I am on Windows 10.

Tessdata access error under Windows

Hi,

I run the following demo command

python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png

on Windows but got the following error:

raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:UsersjackylamAppDataLocalPackagesPythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0LocalCachelocal-packagesPython310site-packagestable_ocrtessdata/table-ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'table-ocr' Tesseract couldn't load any languages! Could not initialize tesseract.')

I set TESSDATA_PREFIX and point to somewhere containing the table-ocr.traindata but no use.

However, the above problem doesn't happen on Linux. As my project prefer to run on Windows, hope someone can give me some hint on this issue.

Thanks,
Sing

Merging columns are not able to be detected

Dear @eihli ,

Thank you very much for your project. It works great!
I have not fully understood your detection algorithms yet, but I think there is this issue, which would be great to improve the accuracy of your package. I noticed that in the case some columns are merged, the program will cut it followed by the major columns. Besides, your program works well in case of rows are merged:
Here is the example:
table_to_cut_vertical

The extract_cell_images_from_table method 's results:

table_type1_indexed10

I will take a look deeper into the code, meanwhile, I think it's better to report this to you so that the library can be enhanced in the future.
Asides from this minor issue, your library is awesome.

Thanks again and best regards

End to End Instruction

Hi, glad that I found this. Kudos to the developers first of all. I was just wondering if you can provide an end to end descriptive steps from input PDF to output CSV. It's not exactly clear from the shell script you gave. Thanks!

I can't run with any URL

Hello, I open this question because I need help. May you please help me?

I cloned the repository and following your read.me I managed to run your demo (Image 1 shows successful execution).

Image 1

However, I have some issues.

  1. I did not find the csv spreadsheet on my computer. I found the txt files in /var/tmp (Image 2), but I didn't find the csv spreadsheet.

Image 2

  1. I tried to execute the same command with a URL that I sent. So I put a png image in a public GitHub repository and sent the link and I got an error (Image 3). (I used this URL: https://github.com/ajandrey/OCR/blob/main/table.png)

Image 3

  1. I tried to run the same command again, but with a link from your page. I didn't get the same URL from your read.me file, but yes, I tried with the same image and returned the same error (Image 4). (For this, I used this URL: https://github.com/eihli/image-table-ocr/blob/master/resources/test_data/simple.png)

Image 4

So I can't run for any link.
Questions:

  1. Does the link need to have any specifications? Can't it be any link pointing to an image?
  2. I already have the images of the tables, they are not in PDF, so I just need modules extract_cells, ocr_image, and ocr_to_csv. Can I use it to run in an image folder (of tables) for example?
    (Note that the error did not use only these three modules, I have not yet performed this test).

Thank you and I look forward to your return.
Alessandra Jandrey

No way to get hocr of the image with the table_ocr library

We use the below config to get the table ocr, but there is no way to get hocr of the image. can someone add this feature please?
d = os.path.dirname(sys.modules["table_ocr"].__file__) tessdata_dir = os.path.join(d, "tessdata") tess_args = "--psm 6 -l table-ocr --tessdata-dir {0}".format(tessdata_dir)

UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 15: ordinal not in range(128)

Traceback (most recent call last):
File "/opt/anaconda3/envs/Hyper-Table-Recognition/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/anaconda3/envs/Hyper-Table-Recognition/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 51, in
csv_output = main(sys.argv[1])
File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 34, in main
for cell in cells
File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/demo/main.py", line 34, in
for cell in cells
File "/Users/chouroukhelaoui/PycharmProjects/image-table-ocr/table_ocr/ocr_image/init.py", line 33, in main
txt_file.write(txt)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 15: ordinal not in range(128)

in some cases, we get this issue it can't be fixed by adding this line of code in "/image-table-ocr/table_ocr/ocr_image /init.py" line 32 :

txt = txt.encode('ascii', 'ignore').decode('ascii')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.