Giter Club home page Giter Club logo

pdf-to-markdown's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdf-to-markdown's Issues

ImportError: cannot import name 'Parser'

Traceback (most recent call last):
  File "C:\Python36\lib\runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "C:\Python36\lib\runpy.py", line 142, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "C:\Python36\lib\runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "C:\Python36\lib\site-packages\pdf2md\__init__.py", line 1, in <module>
    from parser import Parser
ImportError: cannot import name 'Parser'

ModuleNotFoundError: Pile / Python 2.7 dependency

Hi,

i have just tried to get pdf2md running in a Conda Python 3.7.2 environment and got stuck in parser.py imports:

from pile import Pile
https://pypi.org/project/pile/

It seems this package is only available for Python 2.7. Additionally the pdf2md script has issues with Python 3+

Please consider adding pile to the dependencies file for pip and state the required runtime it in README.md .

This may save some people from wasting their time with the wrong environment.

AssertionError: Unrecognized type: <class 'pdfminer.layout.LTLine'>

When processing a PDF on Fedora 20 Linux, I got

Parsing test.pdf
Traceback (most recent call last):
  File "main.py", line 30, in <module>
    main(sys.argv)
  File "main.py", line 17, in main
    piles = parser.parse()
  File "/tmp/pdf/pdf-to-markdown-13jul15/pdf2md/parser.py", line 36, in parse
    piles += self._parse_page(page)
  File "/tmp/pdf/pdf-to-markdown-13jul15/pdf2md/parser.py", line 60, in _parse_page
    pile.parse_layout(page)
  File "/tmp/pdf/pdf-to-markdown-13jul15/pdf2md/pile.py", line 52, in parse_layout
    assert False, "Unrecognized type: %s" % type(obj)
AssertionError: Unrecognized type: <class 'pdfminer.layout.LTLine'>

I got around it with the patch below.

--- pdf-to-markdown-13jul15/pdf2md/pile.py-     2015-07-13 11:31:43.000000000 -0400
+++ pdf-to-markdown-13jul15/pdf2md/pile.py      2015-07-13 12:17:47.143587827 -0400
@@ -49,7 +49,8 @@
                        elif type(obj) == LTCurve:
                                pass
                        else:
-                               assert False, "Unrecognized type: %s" % type(obj)
+                               print "Unrecognized type: " + str(type(obj))
+                               # assert False, "Unrecognized type: %s" % type(obj)


        def split_piles(self):

Exception on LTLine and LTChar

See the following exceptions :

Traceback (most recent call last):
  File "main.py", line 30, in <module>
    main(sys.argv)
  File "main.py", line 17, in main
    piles = parser.parse()
  File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 36, in parse
    piles += self._parse_page(page)
  File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 60, in _parse_page
    pile.parse_layout(page)
  File "/Users/bma/git/pdf-to-markdown/pdf2md/pile.py", line 55, in parse_layout
    assert False, "Unrecognized type: %s" % type(obj)
AssertionError: Unrecognized type: <class 'pdfminer.layout.LTLine'>
Traceback (most recent call last):
  File "main.py", line 30, in <module>
    main(sys.argv)
  File "main.py", line 17, in main
    piles = parser.parse()
  File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 36, in parse
    piles += self._parse_page(page)
  File "/Users/bma/git/pdf-to-markdown/pdf2md/parser.py", line 60, in _parse_page
    pile.parse_layout(page)
  File "/Users/bma/git/pdf-to-markdown/pdf2md/pile.py", line 52, in parse_layout
    assert False, "Unrecognized type: %s" % type(obj)
AssertionError: Unrecognized type: <class 'pdfminer.layout.LTChar'>

IndexError: list index out of range (reopening of #15)

Getting IndexError: list index out of range (see bellow) when converting THIS PDF. Reopening of #15.

Parsing Anbinderis_2010.pdf
Traceback (most recent call last):
  File "/usr/local/bin/pdf2md", line 32, in <module>
    main(sys.argv)
  File "/usr/local/bin/pdf2md", line 27, in main
    writer.write(piles)
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/writer.py", line 27, in write
    self._write_simple(piles)
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/writer.py", line 50, in _write_simple
    markdown = pile.gen_markdown(self._syntax)
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 76, in gen_markdown
    return self._gen_table_markdown(syntax)
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 290, in _gen_table_markdown
    intermediate = self._gen_table_intermediate()
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 319, in _gen_table_intermediate
    bottom, rowspan = self._find_exist_coor(left, right, row_idx, horizontal_coor, 'horizontal')
  File "/usr/local/lib/python2.7/dist-packages/pdf2md/pile.py", line 357, in _find_exist_coor
    coor = line_coor[start_idx + span]
IndexError: list index out of range

pdf2md doesn't work

Dear pdf2md team,

I am using python36, whenever I try to convert a pdf to markdown with your library there is no effect at all, nothing is given as output or anything

I have tried to go file by file and build it and I have found that python36 shows some errors in some files, for example the encoding: utf8 is shown as error so I have replaced it for encoding: utf-8 in the files I have seen that need encoding but still nothing happens

Could you please be more precise on how is this suppose to work?

Thank you in advance

Have a good day

IndexError: list index out of range

(C:\Program Files\Anaconda3\envs\pdftomd) C:\Users\Administrator\PycharmProjects\pdftomd>python main.py test.pdf
Parsing test.pdf
Traceback (most recent call last):
  File "main.py", line 31, in <module>
    main(sys.argv)
  File "main.py", line 26, in main
    writer.write(piles)
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\writer.py", line 27, in write
    self._write_simple(piles)
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\writer.py", line 50, in _write_simple
    markdown = pile.gen_markdown(self._syntax)
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 76, in gen_markdown
    return self._gen_table_markdown(syntax)
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 290, in _gen_table_markdown
    intermediate = self._gen_table_intermediate()
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 319, in _gen_table_intermediate
    bottom, rowspan = self._find_exist_coor(left, right, row_idx, horizontal_coor, 'horizontal')
  File "C:\Users\Administrator\PycharmProjects\pdftomd\pdf2md\pile.py", line 357, in _find_exist_coor
    coor = line_coor[start_idx + span]
IndexError: list index out of range

I works for some pdf files and fails sometimes. I try to figure out what is going on here but nothing comes out from my mind.

Can't convert my pdf doc

I need this for work and can help you with this repo if you like. :)

Parsing /Users/jodonnell/Desktop/MyFile.pdf
Traceback (most recent call last):
  File "main.py", line 30, in <module>
    main(sys.argv)
  File "main.py", line 15, in main
    parser = pdf2md.Parser(filename)
  File "/Users/jodonnell/Code/3rdParty/new/pdf2md/parser.py", line 14, in __init__
    self._document = self._read_file(filename)
  File "/Users/jodonnell/Code/3rdParty/new/pdf2md/parser.py", line 45, in _read_file
    document = PDFDocument(parser)
  File "/Library/Python/2.7/site-packages/pdfminer/pdfdocument.py", line 575, in __init__
    self._initialize_password(password)
  File "/Library/Python/2.7/site-packages/pdfminer/pdfdocument.py", line 598, in _initialize_password
    raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={u'CF': {u'StdCF': {u'Length': 16, u'CFM': /V2, u'AuthEvent': /DocOpen}}, u'O': '\x0cr\x00O\xda\x01#0\xf0?<\x17B\xac \xaa\xb7=\x14\xa2\x91\xf5\xc5>(\xdc\xdc\x9b\xd6t\xb3\xb1', u'Filter': /Standard, u'P': -1324, u'Length': 128, u'R': 4, u'U': '$\xd8\x8d`\x1802\xc4\xc4\x19:bM\xf5/4\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', u'V': 4, u'StmF': /StdCF, u'StrF': /StdCF}

Hyper-links are not converted at all.

I converted a pdf to markdown using this tool, which has some hyper-links, but it didn't converted any links, it just converted it to plain text.
I used this package pdfx and this helped in extracting links.

No module named pdfdocument

Thanks a lot !
But there is a problem if we use sudo apt-get install python-pdfminer, the error code is:
ImportError: No module named pdfdocument
but after update the pdfminer, the problem solved:
sudo pip install --upgrade pdfminer

maybe because of my old source---- ubuntu 14.04?

Images does not work

I can not open Images that pdf-to-markdown extracts. What is the format of those files?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.