wzbsocialsciencecenter / pdftabextract Goto Github PK

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

Home Page: https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/

License: Apache License 2.0

Python 99.70% Makefile 0.30%

data-mining image-processing ocr pdf python tables

pdftabextract's People

Contributors

Stargazers

Watchers

Forkers

mikeatm moandcompany kaiserdan neuroradiology priestd09 twocngdagz parksebastien yanlinaung yushu-liu jangkyung roryosiochain seanzhanjx indy9000 chagge ecoblockchain hsavran gondo jburke007 number0 wombatpm shilezi neo4reo sayiho benjamesbabala ddc-gc cash2one li363849131 robingong xwang0415 kamalkarki kangkot bmschmidt riccheng shenggaozhu lxj0276 jkamlah stweil hadokencode chapzq77 dh-heima puremath86 guptam lihka1 jashurc fashtimedotcom newusername321 hitflame mojgang mpvyard kmccabe1990 vibi007 ardentran gpsbird vidascontadas alon-zhong huynguyen7994 cosecant-csc luffydream cenxim liyanhang1989 theamazingjarvis zofuthan krzynio yuyuelovediandian hanxueming126 shanmugapriya2k15 iamlostcoast tybiot jeffli678 generalzh vault-the alwaysproblem cn3c3p shubhampachori12110095 laal65 eudaimonia kjeanclaude weeang763162 jenniferzhu tangcheng2014 dta0502 gts-gi asitang thesoulbender www3838438 xuezhizeng chenkaigithub ownermz breakend2010 justinxiong veilinghigh guoqing1988 fosity xtuyaowu uscdatascience ginking majian7654 romainberti ufukhurriyetoglu yc-maltazard

pdftabextract's Issues

jpeg8.dll does not exist

Hi,
when I run pdftohtml I have an error because I do not have on my system (Win 10 64bit) jpeg8.dll.

How to solve this problem?

Thank you

pdftohtml is not generating image files for the given pdf file.

pdftohtml -c -hidden -xml input.pdf output.xml

This gives us an error about the parameters:

pdftohtml version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdftohtml [options]
-f : first page to convert
-l : last page to convert
-z : initial zoom level (1.0 means 72dpi)
-r : resolution, in DPI (default is 150)
-skipinvisible : do not draw invisible text
-allinvisible : treat all text as invisible
-opw : owner password (for encrypted files)
-upw : user password (for encrypted files)
-q : don't print any messages or errors
-cfg : configuration file to use in place of .xpdfrc
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information

Is there an updating on the package? I installed it using:

pip install pdftabextract

Problem in running script

I am not getting how to use this scripts.
please tell me how to run this scipt

request feature for compatible with google vision api (document_text_detection)

need feature can manipulate with protobuff document scan object
https://cloud.google.com/vision/docs/detecting-fulltext#vision-document-text-detection-python

Logger file missing

It seems logger file is missing

from pdftabextract import logger

in clustering.py

Use pdftabextract convert pdf which is converted by a picture

Hi, I try to convert a pdf to excel, but it failed. It is a table in a picture. I convert the picture into pdf , then I use the code to convert. It failed. So could you tell me what kind of pdf can be converted?

Not able to create vertical lines and recognize clusters

I have run catalog_30s.py, on one of my pdfs which has some text on the top and bottom and a table with 2 columns at the center like below Screen.

I changed these parameters in the script
N_COL_BORDERS = 3
MIN_COL_WIDTH = 687

The output was

page 1: detecting lines in image file 'data/sample.pdf-1_1.png'...

found 38 lines
saving image with detected lines to 'generated_output/sample.pdf-1_1-lines-orig.png'
saving image with detected lines to 'generated_output/sample.pdf-1_1-lines.png'
WARNING:root:no vertical lines found
no page rotation / skew found
found 0 clusters
Traceback (most recent call last):
File "sample.py", line 140, in
img_w_clusters = iproc_obj.draw_line_clusters(imgproc.DIRECTION_VERTICAL, vertical_clusters)
File "build/bdist.macosx-10.12-intel/egg/pdftabextract/imgproc.py", line 395, in draw_line_clusters
ZeroDivisionError: integer division or modulo by zero

Why is the script not able to recognise the vertical lines ? What could be the issue.

poppler pdftohtml my pdf, the pictures of outputs are not right, my system is win7 and the pdf is Chinese

Data Sources

Hello, my graduation thesis is also related to document image recognition. Can you give me your data source?

pdftabextract does not label an text boxes

pdftabextract does not label an text boxes in the scanned PDF I have. What could be possible reason?

Code is Running. However not detecting any data.

Code is not running getting below error.

getting below:

Tagged releases and changelog

Could you tag your releases and make the changelog directly visible in your GitHub releases? This would increase transparency for potential users and distribution maintainers. Something like https://skywinder.github.io/github-changelog-generator/ could be used to generate the changelog based on merged PRs and closed issues since the last release tag.

`Poppler` installation on windows

I've been trying to install Poppler to execute the first line of code pdf2html, using "pip install python-poppler-qt5" also tried conda installation but failed. Tried to add the source files to my anaconda/lib/site-packages directory also failed. Please could someone tell me how to get poppler up and running on windows?

pdftohtml does not create any scanned page with formats png and jpg

I tried to extract table data from PDF files but the first step in the process is to generate xml and page images.

Unfortunately, using pdftohtml library for PDF pages I cannot create any image files in the data/ directory when following the tutorial at: DataMining- WZBSocialScienceCenter

The command that fails to create the images:

pdftohtml -c -xml -hidden TradingIEX.pdf TradingIEX.pdf.xml

How is it possible to create such page images for the non-scanned PDF files?

Thanks.

No text boxes in the output

Hi,
when I run pdftohtml -c -hidden -xml a.pdf a.pdf.xml in this file I have no text boxes in the output, but only the below infos.

Is it normal? What's wrong in my command?

Thank you

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.41.0">
<page number="1" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-1_1.jpg"/>
</page>
<page number="2" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-2_1.jpg"/>
</page>
<page number="3" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-3_1.jpg"/>
</page>
<page number="4" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-4_1.jpg"/>
</page>
<page number="5" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-5_1.jpg"/>
</page>
</pdf2xml>

Please help me.

I want this table to be recognized using pdftabextract.

I converted that to searchable pdf using tesseract and followed every step in this tutorial
https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/
And I got this result
output.xlsx
I want to improve result as much similar as image.
Please help me.

Question

Can this be used to extract tables from something like this through Juypter Notebook?

http://www.tradeweb.com/uploadedFiles/Tradweb/Content/About_Us/02.21.2018%20-%20Tradeweb%20Markets%20Quarterly%20Activity%20Report%20-%204Q17.pdf

Thanks

A question

Can this project run on windows？ And this project can recognize numbers from image？

Output is not coming

I'm using that schoollist_1.py file on my document but it is not showing anything in the output format. It is running the code but it is not showing anything on the output files. The .csv and .xlsx format files are empty when I'm opening them. So please help me out. And I'm using my code on the company invoice. They are basically a scanned documents and I want to using them as you have mentioned them. First by creating the .xml file and then by inserting them into the data folder and then using them. At the output folder all the files are coming. Everything is there, those xml, json, and png format documents. But the output is not being showed to me at my file in any of the .csv or on the .xlsx format
Please help

pdftohtml not generating image tag in XML file

When generating a XML file via pdftohtml like
pdftohtml -c -hidden -xml input.pdf output.xml
there is no image tag in the XML file (also this command is only generating a XML file, PNGs will not be generated).
I have followed all the steps mentioned on your blog post but the code does not execute properly because at
imgflebasename = p['image'][:p['image'].rindex('.')]
no images are found and therefore there is no attribute rindex.