hsoft / pdfmasher Goto Github PK

Convert PDFs to HTML, MOBI and EPUB

License: GNU General Public License v3.0

Shell 0.05% Python 94.71% Objective-C 5.24%

pdfmasher's Introduction

PdfMasher

Current status: unmaintained

I'm the only maintainter of PdfMasher and I've lost interest in ebooks a good while ago (back to good old paper). Therefore, this app is unmaintained.

If you're interested in assuming maintainership of this app, don't hesitate to fork it off. When you feel you're on a good enough track to assume maintainership, it will be a pleasure for me to point to your fork. Just tell me.

Contents of this folder

This package contains the source for PdfMasher. Its documentation is available online. Here's how this source tree is organised:

core: Contains the core logic code for PdfMasher. It's Python code.
cocoa: UI code for the Cocoa toolkit. It's Objective-C code.
qt: UI code for the Qt toolkit. It's written in Python and uses PyQt.
images: Images used by the different UI codebases.
debian: Skeleton files required to create a .deb package
help: Help document, written for Sphinx.

There are also other sub-folder that comes from external repositories and are part of this repo as git subtrees:

hscommon: A collection of helpers used across HS applications.
cocoalib: A collection of helpers used across Cocoa UI codebases of HS applications.
qtlib: A collection of helpers used across Qt UI codebases of HS applications.

How to build PdfMasher from source

The very, very, very easy way

If you're on Linux or Mac, there's a bootstrap script that will make building very, very easy. There might be some things that you need to install manually on your system, but the bootstrap script will tell you when what you need to install. You can run the bootstrap with:

./bootstrap.sh

and follow instructions from the script. You can then ignore the rest of the build documentation.

Prerequisites installation

Then, you have to make sure that your system has its "non-pip-installable" prerequisites installed:

All systems: Python 3.3+.
Mac OS X: The last XCode to have the 10.7 SDK included.
Windows: Visual Studio 2010, PyQt 4.8+, cx_Freeze and Advanced Installer (you only need the last two if you want to create an installer)

On Ubuntu, the apt-get command to install all pre-requisites is:

$ apt-get install python3-dev python3-pyqt4 pyqt4-dev-tools python3-setuptools

Setting up the virtual environment

Use Python's built-in pyvenv to create a virtual environment in which we're going to install our. Python-related dependencies. pyvenv is built-in Python but, unlike its virtualenv predecessor, it doesn't install setuptools and pip (unless you use Python 3.4+), so it has to be installed manually:

$ pyvenv --system-site-packages env
$ source env/bin/activate
$ python get-pip.py

Then, you can install pip requirements in your virtualenv:

$ pip install -r requirements-[osx|win].txt

([osx|win] depends, of course, on your platform. On other platforms, just use requirements.txt).

Actual building and running

With your virtualenv activated, you can build and run PdfMasher with these commands:

$ python configure.py
$ python build.py
$ python run.py

You can also package PdfMasher into an installable package with:

$ python package.py

pdfmasher's People

Contributors

Stargazers

Watchers

pdfmasher's Issues

"Hide Ignored" hides too much

From email:

The bug I see is that sometime the Table view hides data when it shouldn't

For example:

I open the PDF file
I sort by Y
I select page numbers etc and click "Ignore" Button
Then I check "Hide Ignored Elements" check box
But Then when I sort by page number a large part of the document is not visible
To fix this I unchecked and then recheck the "Hide Ignored Elements"

pictures please

I love the flexibility of this tool. But my first attempt with it was not that successful because pdfmasher simply ignores the pictures. It would be really nice if we could add pictures to the generated epub

how to use on cloud ubuntu

I have successfully installed it. please guide me how to use it on ubuntu google cloud

Option to break markdown into multiple markdowns

I have not seen the TOC option in my downloaded version of PDF Masher but I have a relatively simple (to me :-) PDF structured document and it is chunking elements of the TOC into a single markdown. Instead of displaying each on a separate line (carriage return in markdown), PDF masher generates all lines into one line in the resulting HTML (and presume in the final ebook file).

If you could split a markdown into separate markdowns to force PDF Mashers logic it was assumed to be built with to make each markdown a separate line, I could more easily build/fix the TOC in the enclosed pdf file.

Auto-generate page marker elements

A way to automatically generate a text element at the beginning (or the end) of each page. The contents of this element could be customizable with placeholders for page numbers and totals (for example "Page %p of %t").

This feature would mainly be to allow for HTML splitting, a feature requested at http://forum.hardcoded.net/topic/456/ , but I'm guessing that it could be useful for many other purposes.

Enable "open with" ability

From email:

The ability to right click a PDF file, select Open With, choose PDFmasher and have that file open inside pdfmasher.

Right now you can set up pdfmasher as one of the options to use with "open with" easy, but pdfmasher wont open up that file when it starts.

PDFMasher opens to a blank page

When trying to open a file it seems to load but there is nothing to work on. The file is not displayed. There is no error message I can see.

Need to be able to resume work

Just tried PDF Masher, and it does as it says. However it does not seem to have an option for saving work in progress so that more than one session can be used for a single PDF. This makes it pretty unfriendly and not really usable for any serious work.
When you move from Alpha to Beta this needs to be there,

SyntaxError: invalid syntax

Debian Wheezy 32. I did all the virtual env stuff, stuck at this step

python build.py
Traceback (most recent call last):
File "build.py", line 18, in
from hscommon import sphinxgen
File "/Yedeksiz/_temp/pdfmasher/hscommon/sphinxgen.py", line 12, in
from .build import print_and_do, read_changelog_file, filereplace
File "/Yedeksiz/_temp/pdfmasher/hscommon/build.py", line 292
def copy_resources(self, *resources, use_symlinks=False):
^
SyntaxError: invalid syntax

dupeGuru references in menu

The PDFMasher 'Help' menu contains 'Register dupeGuru' and 'About dupeGuru'.

Processing more text than in the visibile window

If I try to process a pdf of a book I want to get rid of the footers and headers. To do this I use Briss (http://briss.sourceforge.net/) which manipulates the viewport of the documents.

When I load such a pdf into pdfmasher it is displayed correctly, but the Markup generates includes the text form parts of the document which are not visible.

Pdfmasher should ignore any element that is not visible when the pdf is opened in a reader.

Opening 500 hundred pdf file: Error Report

Application Name: PdfMasher
Version: 0.1.1

Traceback (most recent call last):
File "/usr/local/share/pdfmasher/qt/main_window.py", line 90, in openButtonClicked
self.app.model.open_file(destination)
File "/usr/local/share/pdfmasher/core/app.py", line 46, in open_file
self.elements = extract_text_elements_from_pdf(path)
File "/usr/local/share/pdfmasher/core/pdf.py", line 77, in extract_text_elements_from_pdf
interpreter.process_page(page)
File "/usr/local/share/pdfmasher/pdfminer/pdfinterp.py", line 754, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/usr/local/share/pdfmasher/pdfminer/pdfinterp.py", line 765, in render_contents
self.init_resources(resources)
File "/usr/local/share/pdfmasher/pdfminer/pdfinterp.py", line 336, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/usr/local/share/pdfmasher/pdfminer/pdfinterp.py", line 172, in get_font
font = PDFType1Font(self, spec)
File "/usr/local/share/pdfmasher/pdfminer/pdffont.py", line 556, in init
PDFSimpleFont.init(self, descriptor, widths, spec)
File "/usr/local/share/pdfmasher/pdfminer/pdffont.py", line 523, in init
CMapParser(self.unicode_map, io.BytesIO(strm.get_data())).run()
File "/usr/local/share/pdfmasher/pdfminer/cmapdb.py", line 294, in run
self.nextobject()
File "/usr/local/share/pdfmasher/pdfminer/psparser.py", line 608, in nextobject
self.do_keyword(pos, token)
File "/usr/local/share/pdfmasher/pdfminer/cmapdb.py", line 401, in do_keyword
self.cmap.add_cid2unichr(nunpack(cid), code)
File "/usr/local/share/pdfmasher/pdfminer/cmapdb.py", line 192, in add_cid2unichr
raise TypeError(code)
TypeError: '

two column hebrew text (read right to left) is ordered incorrectly

Hebrew is read from right to left, so the first column is on the right side.
Pdfmasher orders the text elements in the English direction, with the first column on the left side.
The result is that when reading e.g. the epub you read e.g. paragraphs 3 and 4 (from the column on the left side, which is really the second column but is processed as the first column) and then paragraphs 1 and 2.

Accent replacement problem

"(cid:22)" is replacing for exemple the "é" (In Linux 32bits) There is also (cid:21) and so on...

Add HTML style options

From email:

For a suggestion on the HTML formatting, if possible, maybe make the first line of each paragraph indented 3-5 spaces?

I have found if I do that, a) it looks nicer visually and b) Its much easier to see were if / were lines were split improperly. Which makes it easier to join them back up again.

Even if at the first glance, HTML styling is useless because the end result is the mobi/epub, some types of styling can help to visually identify element types.

Automatic TOC creation

FromGS

In PDFMasher, could there be an extra option, "TOC" or something like it, that would collect up all items flagged with it and create a quick Table of Contents in the finished htm?

errors just errors

Python 2.7.8 (default, Oct 20 2014, 15:05:19)
[GCC 4.9.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.

root@pdf-html:/home/surjit/public_html/pdfmasher# python build.py
Traceback (most recent call last):
File "build.py", line 18, in
from hscommon import sphinxgen
File "/home/surjit/public_html/pdfmasher/hscommon/sphinxgen.py", line 14, in
from .build import read_changelog_file, filereplace
File "/home/surjit/public_html/pdfmasher/hscommon/build.py", line 317
def copy_resources(self, *resources, use_symlinks=False):
^
SyntaxError: invalid syntax

Chapter based footnotes

[[http://getsatisfaction.com/hardcodedsoftware/topics/chapter_support|From GS]]

Hi I also love the concept for this application.
For longer documents, such as books, or multi-article periodicals, it would be really nice to be able to have "chapters" for the following reasons:

Footnotes could be collected at the end of the chapter rather then at the end of the document which often makes more sense.
The chapter marks could be titled and a TOC/Idex could then be generated.

(The TOC-part is already covered by <<issue 4>>)

[deb] embedded cssutils module does not works under python 3.5

Unpacking 0.7.4-1 deb package found on launchpad, when I execute "run.py" script, I got a sre_constants.error in pdfmasher_0.7.4-1~precise_all/usr/share/pdfmasher/cssutils/__init__.py.

After searching the web, I choose to install the cssutils with my distribution package manager - package python3-cssutils version 1.0-4.1 - and remove the one embedded in the deb ... and now it works !

I suggest remove cssutils embedded in the deb and add it as a dependence, in the control file, the same way as qt4 and lxml. Or maybe change version in the requirements file.

Remove newlines inside paragraphs when generating HTML

Some epub (and maybe mobi) generators don't ignore newlines inside

tags when converting HTML. Because of this, newlines inside paragraphs should be stripped in the Markdown --> HTML conversion.

Add new ID unit?

My PDF conversion resulted in some footnotes being combined into a single ID unit. It would be great if there was a way to create a new ID unit so that I could cut the extra footnote out of one ID unit and paste it into a new one. Does that make sense?

I've attached a copy of the PDF I'm testing so you can see what I mean. ID#4 contains 3 footnotes, so I need to break each one out into a separate unit.

Support tables

[[http://getsatisfaction.com/hardcodedsoftware/topics/pdfmasher_table_support|From GS]]

It would be nice if there were some heuristic for recognizing a table in a PDF and converting it to an HTML table.

For example, the table on the second page of http://disruptor.googlecode.com/files/Disruptor-1.0.pdf is comparatively simple. However, pdfmasher 0.2.1 writes out the first column, then the second column, which makes the result hard to read.

Obviously, some hand editing can be done in Markdown to work around this.

Clearer Build pane

From email:

In the help page "Build Pane" page, perhaps a bit more explanation of the relationship between the markdown and HTML files.

After experimenting, I think this is how it works (of course you know for sure!):

"Generate Markdown" creates the .txt file.
"Edit Markdown" edits the .txt file.
"View HTML" creates a .htm file from the .txt file.
"Create e-book" creates the e-book from the .txt file, via a re-created .htm file.

Thus, and this is the important part, any manual edits to the HTML file will not only be ignored, they will be over-written and lost.

Perhaps a note about that in the "A few things to know" section.

Better page navigation

from email:

On the "Page" panel, it seems the only way to navigate to another page is using the << and >> buttons with the mouse (actually, once one of the buttons has been clicked, the space-bar will continue to change pages in that direction). It would be nice to be able to enter an actual page number to make it quicker to get to a page in a large document (eg, I'm working on one with 430 pages).

Maybe, when on the "Table" panel, a right-click|"go to page layout" option.

Unable to open any of 3 files

Application Name: PdfMasher
Version: 0.1.1

Traceback (most recent call last):
File "/usr/local/share/pdfmasher/qt/main_window.py", line 90, in openButtonClicked
self.app.model.open_file(destination)
File "/usr/local/share/pdfmasher/core/app.py", line 46, in open_file
self.elements = extract_text_elements_from_pdf(path)
File "/usr/local/share/pdfmasher/core/pdf.py", line 76, in extract_text_elements_from_pdf
for pageno, page in enumerate(doc.get_pages()):
File "/usr/local/share/pdfmasher/pdfminer/pdfparser.py", line 514, in get_pages
for (pageid,tree) in search(self.catalog['Pages'], self.catalog):
File "/usr/local/share/pdfmasher/pdfminer/pdfparser.py", line 499, in search
tree = dict_value(obj).copy()
File "/usr/local/share/pdfmasher/pdfminer/pdftypes.py", line 134, in dict_value
x = resolve1(x)
File "/usr/local/share/pdfmasher/pdfminer/pdftypes.py", line 62, in resolve1
x = x.resolve()
File "/usr/local/share/pdfmasher/pdfminer/pdftypes.py", line 51, in resolve
return self.doc.getobj(self.objid)
File "/usr/local/share/pdfmasher/pdfminer/pdfparser.py", line 438, in getobj
parser = PDFStreamParser(stream.get_data().decode('ascii'))
AttributeError: 'str' object has no attribute 'decode'

Fresh install from deb package

uname -a

Linux murr-desktop 2.6.39-0-generic #5~20110427-Ubuntu SMP Wed Apr 27 15:27:41 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

Items marked as title aren't wrapped in h1

From GS

Titles ending with newlines aren't correctly wrapped in a h1 element. Example markdown:

Some title

===

Image support

Greatest pain with PDFs on Kindle are diagrams, which can't often be viewed entirely. Please add image support so (not only) I can finally study Go @ Kindle ;)

Filters for rows in the table of elements

Choosing similar elements from the table is difficult right not.
A 'search'-type filter for the table would be great ... searching for a string and selecting all elements with that string in it, or specifying an x and y range would make processing so much easier, especially for very large pdfs

I am sure there is some Qt way of doing this, e.g. http://doc.qt.nokia.com/4.7-snapshot/itemviews-customsortfiltermodel.html

Footnote confusion

from email:

It seems PdfMasher attempts to create links to footnotes. However, it is misreading scripture references and trying to convert them.

E.g., the PDF (from http://www.garynorth.com/PrioritiesAndDominion.pdf, 2MB, if that helps) has:

about God (Rom. 1:18– 25).1

(where the final "1" is actually superscript denoting a footnote).

The markdown is:

about God (Rom. [1]:18– 25).1

Another, in the PDF:

whom he will he hardeneth” (Rom. 9:17–18).

results in the markdown:

whom he will he hardeneth” (Rom. [5]:17–18).

That one is not actually referencing any footnote anyway. Also, it's actually changed the visible text from "Rom. 9:17–18" to "Rom. [5]:17–18". That means it's not even possible to post-process the markdown to just remove the link because information has been lost (the "9" has gone).

Maybe an option to not make links to footnotes (which of course has other disadvantages).

Is it possible to detect super/subscript formatting in the PDFs and add them as a columns (similar to the font size column, which is very useful) on the "Table" panel?

I already have an answer for that lastpoint about detecting superscript: I've tried, but I couldn't find how to detect them. Of source, there must be a way to somehow detect them, but there's no easy way.

hsoft / pdfmasher Goto Github PK

pdfmasher's Introduction

PdfMasher

Current status: unmaintained

Contents of this folder

How to build PdfMasher from source

The very, very, very easy way

Prerequisites installation

Setting up the virtual environment

Actual building and running

pdfmasher's People

Contributors

Stargazers

Watchers

Forkers

pdfmasher's Issues

Recommend Projects

Recommend Topics

Recommend Org