wzbsocialsciencecenter / pdftabextract Goto Github PK
View Code? Open in Web Editor NEWA set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
License: Apache License 2.0
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
License: Apache License 2.0
Hi,
when I run pdftohtml I have an error because I do not have on my system (Win 10 64bit) jpeg8.dll.
How to solve this problem?
Thank you
This gives us an error about the parameters:
pdftohtml version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdftohtml [options]
-f : first page to convert
-l : last page to convert
-z : initial zoom level (1.0 means 72dpi)
-r : resolution, in DPI (default is 150)
-skipinvisible : do not draw invisible text
-allinvisible : treat all text as invisible
-opw : owner password (for encrypted files)
-upw : user password (for encrypted files)
-q : don't print any messages or errors
-cfg : configuration file to use in place of .xpdfrc
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
Is there an updating on the package? I installed it using:
pip install pdftabextract
I am not getting how to use this scripts.
please tell me how to run this scipt
need feature can manipulate with protobuff document scan object
https://cloud.google.com/vision/docs/detecting-fulltext#vision-document-text-detection-python
It seems logger file is missing
from pdftabextract import logger
in clustering.py
Hi, I try to convert a pdf to excel, but it failed. It is a table in a picture. I convert the picture into pdf , then I use the code to convert. It failed. So could you tell me what kind of pdf can be converted?
I have run catalog_30s.py, on one of my pdfs which has some text on the top and bottom and a table with 2 columns at the center like below Screen.
I changed these parameters in the script
N_COL_BORDERS = 3
MIN_COL_WIDTH = 687
The output was
page 1: detecting lines in image file 'data/sample.pdf-1_1.png'...
found 38 lines
saving image with detected lines to 'generated_output/sample.pdf-1_1-lines-orig.png'
saving image with detected lines to 'generated_output/sample.pdf-1_1-lines.png'
WARNING:root:no vertical lines found
no page rotation / skew found
found 0 clusters
Traceback (most recent call last):
File "sample.py", line 140, in
img_w_clusters = iproc_obj.draw_line_clusters(imgproc.DIRECTION_VERTICAL, vertical_clusters)
File "build/bdist.macosx-10.12-intel/egg/pdftabextract/imgproc.py", line 395, in draw_line_clusters
ZeroDivisionError: integer division or modulo by zero
Hello, my graduation thesis is also related to document image recognition. Can you give me your data source?
pdftabextract does not label an text boxes in the scanned PDF I have. What could be possible reason?
Could you tag your releases and make the changelog directly visible in your GitHub releases? This would increase transparency for potential users and distribution maintainers. Something like https://skywinder.github.io/github-changelog-generator/ could be used to generate the changelog based on merged PRs and closed issues since the last release tag.
I've been trying to install Poppler to execute the first line of code pdf2html, using "pip install python-poppler-qt5" also tried conda installation but failed. Tried to add the source files to my anaconda/lib/site-packages directory also failed. Please could someone tell me how to get poppler up and running on windows?
I tried to extract table data from PDF files but the first step in the process is to generate xml and page images.
Unfortunately, using pdftohtml library for PDF pages I cannot create any image files in the data/ directory when following the tutorial at: DataMining- WZBSocialScienceCenter
The command that fails to create the images:
pdftohtml -c -xml -hidden TradingIEX.pdf TradingIEX.pdf.xml
How is it possible to create such page images for the non-scanned PDF files?
Thanks.
Hi,
when I run pdftohtml -c -hidden -xml a.pdf a.pdf.xml
in this file I have no text boxes in the output, but only the below infos.
Is it normal? What's wrong in my command?
Thank you
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.41.0">
<page number="1" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-1_1.jpg"/>
</page>
<page number="2" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-2_1.jpg"/>
</page>
<page number="3" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-3_1.jpg"/>
</page>
<page number="4" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-4_1.jpg"/>
</page>
<page number="5" position="absolute" top="0" left="0" height="892" width="1262">
<image top="0" left="0" width="1263" height="893" src="a.pdf-5_1.jpg"/>
</page>
</pdf2xml>
I want this table to be recognized using pdftabextract.
I converted that to searchable pdf using tesseract and followed every step in this tutorial
https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/
And I got this result
output.xlsx
I want to improve result as much similar as image.
Please help me.
Can this be used to extract tables from something like this through Juypter Notebook?
Thanks
Can this project run on windows? And this project can recognize numbers from image?
I'm using that schoollist_1.py file on my document but it is not showing anything in the output format. It is running the code but it is not showing anything on the output files. The .csv and .xlsx format files are empty when I'm opening them. So please help me out. And I'm using my code on the company invoice. They are basically a scanned documents and I want to using them as you have mentioned them. First by creating the .xml file and then by inserting them into the data folder and then using them. At the output folder all the files are coming. Everything is there, those xml, json, and png format documents. But the output is not being showed to me at my file in any of the .csv or on the .xlsx format
Please help
When generating a XML file via pdftohtml like
pdftohtml -c -hidden -xml input.pdf output.xml
there is no image tag in the XML file (also this command is only generating a XML file, PNGs will not be generated).
I have followed all the steps mentioned on your blog post but the code does not execute properly because at
imgflebasename = p['image'][:p['image'].rindex('.')]
no images are found and therefore there is no attribute rindex
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.