johnlinp / pdf-to-markdown Goto Github PK

View Code? Open in Web Editor NEW

281.0 8.0 70.0 6.34 MB

Convert PDF files into markdown files

License: BSD 3-Clause "New" or "Revised" License

Python 99.37% Makefile 0.63%

pdf-to-markdown's People

Stargazers

Watchers

pdf-to-markdown's Issues

How do I use this? Please somone publish update python package, with usage guide lines

I really need to convert a lot of pdf file to markdown,
I tried a lot, of site, many have upload limitations per day..etc,
really need this python package to help in my work,.
I checkd theres a supportPython3 branch in @nella17 commits.. but dont know, how to use it from there?

please publish an updated python package, and usage guide

@johnlinp @nella17 please

ImportError: cannot import name 'Parser'

Traceback (most recent call last):
  File "C:\Python36\lib\runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "C:\Python36\lib\runpy.py", line 142, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "C:\Python36\lib\runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "C:\Python36\lib\site-packages\pdf2md\__init__.py", line 1, in <module>
    from parser import Parser
ImportError: cannot import name 'Parser'

ModuleNotFoundError: Pile / Python 2.7 dependency

Hi,

i have just tried to get pdf2md running in a Conda Python 3.7.2 environment and got stuck in parser.py imports:

from pile import Pile
https://pypi.org/project/pile/

It seems this package is only available for Python 2.7. Additionally the pdf2md script has issues with Python 3+

Please consider adding pile to the dependencies file for pip and state the required runtime it in README.md .

This may save some people from wasting their time with the wrong environment.

AssertionError: Unrecognized type: <class 'pdfminer.layout.LTLine'>

When processing a PDF on Fedora 20 Linux, I got

Parsing test.pdf
Traceback (most recent call last):
  File "main.py", line 30, in <module>
    main(sys.argv)
  File "main.py", line 17, in main
    piles = parser.parse()
  File "/tmp/pdf/pdf-to-markdown-13jul15/pdf2md/parser.py", line 36, in parse
    piles += self._parse_page(page)
  File "/tmp/pdf/pdf-to-markdown-13jul15/pdf2md/parser.py", line 60, in _parse_page
    pile.parse_layout(page)
  File "/tmp/pdf/pdf-to-markdown-13jul15/pdf2md/pile.py", line 52, in parse_layout
    assert False, "Unrecognized type: %s" % type(obj)
AssertionError: Unrecognized type: <class 'pdfminer.layout.LTLine'>

I got around it with the patch below.

--- pdf-to-markdown-13jul15/pdf2md/pile.py-     2015-07-13 11:31:43.000000000 -0400
+++ pdf-to-markdown-13jul15/pdf2md/pile.py      2015-07-13 12:17:47.143587827 -0400
@@ -49,7 +49,8 @@
                        elif type(obj) == LTCurve:
                                pass
                        else:
-                               assert False, "Unrecognized type: %s" % type(obj)
+                               print "Unrecognized type: " + str(type(obj))
+                               # assert False, "Unrecognized type: %s" % type(obj)


        def split_piles(self):

Support Python 3.x

Python 2 is going to be deprecated; let's support Python 3.x.

Please submit it to AUR!

It would be handy if it was in AUR!

Exception on LTLine and LTChar

See the following exceptions :

Traceback (most recent call last):
  File "main.py", line 30, in <module>
    main(sys.argv)
  File "main.py", line 17, in main
    piles = parser.parse()
  File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 36, in parse
    piles += self._parse_page(page)
  File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 60, in _parse_page
    pile.parse_layout(page)
  File "/Users/bma/git/pdf-to-markdown/pdf2md/pile.py", line 55, in parse_layout
    assert False, "Unrecognized type: %s" % type(obj)
AssertionError: Unrecognized type: <class 'pdfminer.layout.LTLine'>

Traceback (most recent call last):
  File "main.py", line 30, in <module>
    main(sys.argv)
  File "main.py", line 17, in main
    piles = parser.parse()
  File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 36, in parse
    piles += self._parse_page(page)
  File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 60, in _parse_page
    pile.parse_layout(page)
  File "/Users/bma/git/pdf-to-markdown/pdf2md/pile.py", line 52, in parse_layout
    assert False, "Unrecognized type: %s" % type(obj)
AssertionError: Unrecognized type: <class 'pdfminer.layout.LTChar'>

IndexError: list index out of range (reopening of #15)

Getting IndexError: list index out of range (see bellow) when converting THIS PDF. Reopening of #15.

Parsing Anbinderis_2010.pdf
Traceback (most recent call last):
  File "/usr/local/bin/pdf2md", line 32, in <module>
    main(sys.argv)
  File "/usr/local/bin/pdf2md", line 27, in main
    writer.write(piles)
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/writer.py", line 27, in write
    self._write_simple(piles)
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/writer.py", line 50, in _write_simple
    markdown = pile.gen_markdown(self._syntax)
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 76, in gen_markdown
    return self._gen_table_markdown(syntax)
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 290, in _gen_table_markdown
    intermediate = self._gen_table_intermediate()
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 319, in _gen_table_intermediate
    bottom, rowspan = self._find_exist_coor(left, right, row_idx, horizontal_coor, 'horizontal')
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 357, in _find_exist_coor
    coor = line_coor[start_idx + span]
IndexError: list index out of range

pdf2md doesn't work

Dear pdf2md team,

I am using python36, whenever I try to convert a pdf to markdown with your library there is no effect at all, nothing is given as output or anything

I have tried to go file by file and build it and I have found that python36 shows some errors in some files, for example the encoding: utf8 is shown as error so I have replaced it for encoding: utf-8 in the files I have seen that need encoding but still nothing happens

Could you please be more precise on how is this suppose to work?

Thank you in advance

Have a good day

IndexError: list index out of range

(C:\Program Files\Anaconda3\envs\pdftomd) C:\Users\Administrator\PycharmProjects\pdftomd>python main.py test.pdf
Parsing test.pdf
Traceback (most recent call last):
  File "main.py", line 31, in <module>
    main(sys.argv)
  File "main.py", line 26, in main
    writer.write(piles)
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\writer.py", line 27, in write
    self._write_simple(piles)
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\writer.py", line 50, in _write_simple
    markdown = pile.gen_markdown(self._syntax)
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 76, in gen_markdown
    return self._gen_table_markdown(syntax)
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 290, in _gen_table_markdown
    intermediate = self._gen_table_intermediate()
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 319, in _gen_table_intermediate
    bottom, rowspan = self._find_exist_coor(left, right, row_idx, horizontal_coor, 'horizontal')
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 357, in _find_exist_coor
    coor = line_coor[start_idx + span]
IndexError: list index out of range

I works for some pdf files and fails sometimes. I try to figure out what is going on here but nothing comes out from my mind.

Publish it on PyPI

Add unit tests

Remember to use Travis CI.

Can't convert my pdf doc

I need this for work and can help you with this repo if you like. :)

Parsing /Users/jodonnell/Desktop/MyFile.pdf
Traceback (most recent call last):
  File "main.py", line 30, in <module>
    main(sys.argv)
  File "main.py", line 15, in main
    parser = pdf2md.Parser(filename)
  File "/Users/jodonnell/Code/3rdParty/new/pdf2md/parser.py", line 14, in __init__
    self._document = self._read_file(filename)
  File "/Users/jodonnell/Code/3rdParty/new/pdf2md/parser.py", line 45, in _read_file
    document = PDFDocument(parser)
  File "/Library/Python/2.7/site-packages/pdfminer/pdfdocument.py", line 575, in __init__
    self._initialize_password(password)
  File "/Library/Python/2.7/site-packages/pdfminer/pdfdocument.py", line 598, in _initialize_password
    raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={u'CF': {u'StdCF': {u'Length': 16, u'CFM': /V2, u'AuthEvent': /DocOpen}}, u'O': '\x0cr\x00O\xda\x01#0\xf0?<\x17B\xac \xaa\xb7=\x14\xa2\x91\xf5\xc5>(\xdc\xdc\x9b\xd6t\xb3\xb1', u'Filter': /Standard, u'P': -1324, u'Length': 128, u'R': 4, u'U': '$\xd8\x8d`\x1802\xc4\xc4\x19:bM\xf5/4\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', u'V': 4, u'StmF': /StdCF, u'StrF': /StdCF}

johnlinp / pdf-to-markdown Goto Github PK

pdf-to-markdown's People

Stargazers

Watchers

Forkers

pdf-to-markdown's Issues

Recommend Projects

Recommend Topics

Recommend Org