johnlinp / pdf-to-markdown Goto Github PK
View Code? Open in Web Editor NEWConvert PDF files into markdown files
License: BSD 3-Clause "New" or "Revised" License
Convert PDF files into markdown files
License: BSD 3-Clause "New" or "Revised" License
I really need to convert a lot of pdf file to markdown,
I tried a lot, of site, many have upload limitations per day..etc,
really need this python package to help in my work,.
I checkd theres a supportPython3 branch in @nella17 commits.. but dont know, how to use it from there?
please publish an updated python package, and usage guide
Traceback (most recent call last):
File "C:\Python36\lib\runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "C:\Python36\lib\runpy.py", line 142, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "C:\Python36\lib\runpy.py", line 109, in _get_module_details
__import__(pkg_name)
File "C:\Python36\lib\site-packages\pdf2md\__init__.py", line 1, in <module>
from parser import Parser
ImportError: cannot import name 'Parser'
Hi,
i have just tried to get pdf2md running in a Conda Python 3.7.2 environment and got stuck in parser.py imports:
from pile import Pile
https://pypi.org/project/pile/
It seems this package is only available for Python 2.7. Additionally the pdf2md script has issues with Python 3+
Please consider adding pile to the dependencies file for pip and state the required runtime it in README.md .
This may save some people from wasting their time with the wrong environment.
When processing a PDF on Fedora 20 Linux, I got
Parsing test.pdf
Traceback (most recent call last):
File "main.py", line 30, in <module>
main(sys.argv)
File "main.py", line 17, in main
piles = parser.parse()
File "/tmp/pdf/pdf-to-markdown-13jul15/pdf2md/parser.py", line 36, in parse
piles += self._parse_page(page)
File "/tmp/pdf/pdf-to-markdown-13jul15/pdf2md/parser.py", line 60, in _parse_page
pile.parse_layout(page)
File "/tmp/pdf/pdf-to-markdown-13jul15/pdf2md/pile.py", line 52, in parse_layout
assert False, "Unrecognized type: %s" % type(obj)
AssertionError: Unrecognized type: <class 'pdfminer.layout.LTLine'>
I got around it with the patch below.
--- pdf-to-markdown-13jul15/pdf2md/pile.py- 2015-07-13 11:31:43.000000000 -0400
+++ pdf-to-markdown-13jul15/pdf2md/pile.py 2015-07-13 12:17:47.143587827 -0400
@@ -49,7 +49,8 @@
elif type(obj) == LTCurve:
pass
else:
- assert False, "Unrecognized type: %s" % type(obj)
+ print "Unrecognized type: " + str(type(obj))
+ # assert False, "Unrecognized type: %s" % type(obj)
def split_piles(self):
Python 2 is going to be deprecated; let's support Python 3.x.
It would be handy if it was in AUR!
See the following exceptions :
Traceback (most recent call last):
File "main.py", line 30, in <module>
main(sys.argv)
File "main.py", line 17, in main
piles = parser.parse()
File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 36, in parse
piles += self._parse_page(page)
File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 60, in _parse_page
pile.parse_layout(page)
File "/Users/bma/git/pdf-to-markdown/pdf2md/pile.py", line 55, in parse_layout
assert False, "Unrecognized type: %s" % type(obj)
AssertionError: Unrecognized type: <class 'pdfminer.layout.LTLine'>
Traceback (most recent call last):
File "main.py", line 30, in <module>
main(sys.argv)
File "main.py", line 17, in main
piles = parser.parse()
File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 36, in parse
piles += self._parse_page(page)
File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 60, in _parse_page
pile.parse_layout(page)
File "/Users/bma/git/pdf-to-markdown/pdf2md/pile.py", line 52, in parse_layout
assert False, "Unrecognized type: %s" % type(obj)
AssertionError: Unrecognized type: <class 'pdfminer.layout.LTChar'>
Getting IndexError: list index out of range (see bellow) when converting THIS PDF. Reopening of #15.
Parsing Anbinderis_2010.pdf
Traceback (most recent call last):
File "/usr/local/bin/pdf2md", line 32, in <module>
main(sys.argv)
File "/usr/local/bin/pdf2md", line 27, in main
writer.write(piles)
File "/usr/local/lib/python2.7/dist-packages/pdf2md/writer.py", line 27, in write
self._write_simple(piles)
File "/usr/local/lib/python2.7/dist-packages/pdf2md/writer.py", line 50, in _write_simple
markdown = pile.gen_markdown(self._syntax)
File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 76, in gen_markdown
return self._gen_table_markdown(syntax)
File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 290, in _gen_table_markdown
intermediate = self._gen_table_intermediate()
File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 319, in _gen_table_intermediate
bottom, rowspan = self._find_exist_coor(left, right, row_idx, horizontal_coor, 'horizontal')
File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 357, in _find_exist_coor
coor = line_coor[start_idx + span]
IndexError: list index out of range
Dear pdf2md team,
I am using python36, whenever I try to convert a pdf to markdown with your library there is no effect at all, nothing is given as output or anything
I have tried to go file by file and build it and I have found that python36 shows some errors in some files, for example the encoding: utf8 is shown as error so I have replaced it for encoding: utf-8 in the files I have seen that need encoding but still nothing happens
Could you please be more precise on how is this suppose to work?
Thank you in advance
Have a good day
(C:\Program Files\Anaconda3\envs\pdftomd) C:\Users\Administrator\PycharmProjects\pdftomd>python main.py test.pdf
Parsing test.pdf
Traceback (most recent call last):
File "main.py", line 31, in <module>
main(sys.argv)
File "main.py", line 26, in main
writer.write(piles)
File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\writer.py", line 27, in write
self._write_simple(piles)
File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\writer.py", line 50, in _write_simple
markdown = pile.gen_markdown(self._syntax)
File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 76, in gen_markdown
return self._gen_table_markdown(syntax)
File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 290, in _gen_table_markdown
intermediate = self._gen_table_intermediate()
File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 319, in _gen_table_intermediate
bottom, rowspan = self._find_exist_coor(left, right, row_idx, horizontal_coor, 'horizontal')
File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 357, in _find_exist_coor
coor = line_coor[start_idx + span]
IndexError: list index out of range
I works for some pdf files and fails sometimes. I try to figure out what is going on here but nothing comes out from my mind.
Remember to use Travis CI.
I need this for work and can help you with this repo if you like. :)
Parsing /Users/jodonnell/Desktop/MyFile.pdf
Traceback (most recent call last):
File "main.py", line 30, in <module>
main(sys.argv)
File "main.py", line 15, in main
parser = pdf2md.Parser(filename)
File "/Users/jodonnell/Code/3rdParty/new/pdf2md/parser.py", line 14, in __init__
self._document = self._read_file(filename)
File "/Users/jodonnell/Code/3rdParty/new/pdf2md/parser.py", line 45, in _read_file
document = PDFDocument(parser)
File "/Library/Python/2.7/site-packages/pdfminer/pdfdocument.py", line 575, in __init__
self._initialize_password(password)
File "/Library/Python/2.7/site-packages/pdfminer/pdfdocument.py", line 598, in _initialize_password
raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={u'CF': {u'StdCF': {u'Length': 16, u'CFM': /V2, u'AuthEvent': /DocOpen}}, u'O': '\x0cr\x00O\xda\x01#0\xf0?<\x17B\xac \xaa\xb7=\x14\xa2\x91\xf5\xc5>(\xdc\xdc\x9b\xd6t\xb3\xb1', u'Filter': /Standard, u'P': -1324, u'Length': 128, u'R': 4, u'U': '$\xd8\x8d`\x1802\xc4\xc4\x19:bM\xf5/4\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', u'V': 4, u'StmF': /StdCF, u'StrF': /StdCF}
I converted a pdf to markdown using this tool, which has some hyper-links, but it didn't converted any links, it just converted it to plain text.
I used this package pdfx and this helped in extracting links.
Thanks a lot !
But there is a problem if we use sudo apt-get install python-pdfminer
, the error code is:
ImportError: No module named pdfdocument
but after update the pdfminer, the problem solved:
sudo pip install --upgrade pdfminer
maybe because of my old source---- ubuntu 14.04?
I can not open Images that pdf-to-markdown extracts. What is the format of those files?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.