Giter Club home page Giter Club logo

wzbsocialsciencecenter / pdftabextract Goto Github PK

View Code? Open in Web Editor NEW
2.2K 2.2K 367.0 141.16 MB

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

Home Page: https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/

License: Apache License 2.0

Python 99.70% Makefile 0.30%
data-mining image-processing ocr pdf python tables

pdftabextract's People

Contributors

internaut avatar stweil avatar timgates42 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdftabextract's Issues

jpeg8.dll does not exist

Hi,
when I run pdftohtml I have an error because I do not have on my system (Win 10 64bit) jpeg8.dll.

How to solve this problem?

Thank you

pdftohtml -c -hidden -xml input.pdf output.xml

This gives us an error about the parameters:

pdftohtml version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdftohtml [options]
-f : first page to convert
-l : last page to convert
-z : initial zoom level (1.0 means 72dpi)
-r : resolution, in DPI (default is 150)
-skipinvisible : do not draw invisible text
-allinvisible : treat all text as invisible
-opw : owner password (for encrypted files)
-upw : user password (for encrypted files)
-q : don't print any messages or errors
-cfg : configuration file to use in place of .xpdfrc
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information

Is there an updating on the package? I installed it using:

pip install pdftabextract

Logger file missing

It seems logger file is missing

from pdftabextract import logger

in clustering.py

Not able to create vertical lines and recognize clusters

I have run catalog_30s.py, on one of my pdfs which has some text on the top and bottom and a table with 2 columns at the center like below Screen.
image

I changed these parameters in the script
N_COL_BORDERS = 3
MIN_COL_WIDTH = 687

The output was

page 1: detecting lines in image file 'data/sample.pdf-1_1.png'...

found 38 lines
saving image with detected lines to 'generated_output/sample.pdf-1_1-lines-orig.png'
saving image with detected lines to 'generated_output/sample.pdf-1_1-lines.png'
WARNING:root:no vertical lines found
no page rotation / skew found
found 0 clusters
Traceback (most recent call last):
File "sample.py", line 140, in
img_w_clusters = iproc_obj.draw_line_clusters(imgproc.DIRECTION_VERTICAL, vertical_clusters)
File "build/bdist.macosx-10.12-intel/egg/pdftabextract/imgproc.py", line 395, in draw_line_clusters
ZeroDivisionError: integer division or modulo by zero

Why is the script not able to recognise the vertical lines ? What could be the issue.

Data Sources

Hello, my graduation thesis is also related to document image recognition. Can you give me your data source?

`Poppler` installation on windows

I've been trying to install Poppler to execute the first line of code pdf2html, using "pip install python-poppler-qt5" also tried conda installation but failed. Tried to add the source files to my anaconda/lib/site-packages directory also failed. Please could someone tell me how to get poppler up and running on windows?

pdftohtml does not create any scanned page with formats png and jpg

I tried to extract table data from PDF files but the first step in the process is to generate xml and page images.

Unfortunately, using pdftohtml library for PDF pages I cannot create any image files in the data/ directory when following the tutorial at: DataMining- WZBSocialScienceCenter

The command that fails to create the images:

pdftohtml -c -xml -hidden TradingIEX.pdf TradingIEX.pdf.xml

How is it possible to create such page images for the non-scanned PDF files?

Thanks.

No text boxes in the output

Hi,
when I run pdftohtml -c -hidden -xml a.pdf a.pdf.xml in this file I have no text boxes in the output, but only the below infos.

Is it normal? What's wrong in my command?

Thank you

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.41.0">
<page number="1" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-1_1.jpg"/>
</page>
<page number="2" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-2_1.jpg"/>
</page>
<page number="3" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-3_1.jpg"/>
</page>
<page number="4" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-4_1.jpg"/>
</page>
<page number="5" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-5_1.jpg"/>
</page>
</pdf2xml>

A question

Can this project run on windows? And this project can recognize numbers from image?

Output is not coming

I'm using that schoollist_1.py file on my document but it is not showing anything in the output format. It is running the code but it is not showing anything on the output files. The .csv and .xlsx format files are empty when I'm opening them. So please help me out. And I'm using my code on the company invoice. They are basically a scanned documents and I want to using them as you have mentioned them. First by creating the .xml file and then by inserting them into the data folder and then using them. At the output folder all the files are coming. Everything is there, those xml, json, and png format documents. But the output is not being showed to me at my file in any of the .csv or on the .xlsx format
Please help

pdftohtml not generating image tag in XML file

When generating a XML file via pdftohtml like
pdftohtml -c -hidden -xml input.pdf output.xml
there is no image tag in the XML file (also this command is only generating a XML file, PNGs will not be generated).
I have followed all the steps mentioned on your blog post but the code does not execute properly because at
imgflebasename = p['image'][:p['image'].rindex('.')]
no images are found and therefore there is no attribute rindex.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.