Giter Club home page Giter Club logo

pdf2epubex's Introduction

pdf2epubEX

This Bash shell script uses the pdf2htmlEX tool to convert a PDF file to an ePub file.

The result is a fixed layout ePub version 3: the layout is perfectly retained and all the fonts are embedded.

The pdf2htmlEX tool converts a PDF file into HTML5 (with CSS, JS, fonts, and bitmap and/or vector images).

Once you have an ePub file, only if you want, you can edit it with one of the available tools (Sigil, Calibre, Kotobee, etc.) to add interactive content, add reflowable text, etc.

If you want to publish an eBook on one of the available eBook stores (Google Play Books, Apple Books, Rakuten Kobo, etc.), you have to provide an ePub file and not a PDF file.

For Linux (Debian, Ubuntu, Mint)

Usage

To convert myfile.pdf to myfile.epub, open the terminal and run the following command in the directory where the PDF file is located:

pdf2epubEX myfile.pdf

Installation

  • Add the package repository (repository.dodeeric.be/apt):
sudo echo "deb [trusted=yes] https://repository.dodeeric.be/apt/ /" > /tmp/dodeeric.list
sudo mv /tmp/dodeeric.list /etc/apt/sources.list.d/
sudo apt update
  • Install the package (pdf2epubex):
sudo apt install pdf2epubex

All the dependencies will be automatically installed.

For MacOS and Linux (all distributions)

Usage

To convert myfile.pdf to myfile.epub, open the terminal and run the following command in the directory where the PDF file is located:

docker run -ti --rm -v `pwd`:/temp dodeeric/pdf2epubex pdf2epubEX myfile.pdf

You can also replace `pwd` by the absolute path of the directory where the PDF file is located (e.g.: /home/dodeeric/Documents/myfile.pdf):

docker run -ti --rm -v /home/dodeeric/Documents:/temp dodeeric/pdf2epubex pdf2epubEX myfile.pdf

Installation

You need to install Docker which is available for all computer operating systems: Linux, Windows and MacOS. See here.

For Windows

Usage

To convert C:\Users\Eric\Documents\myfile.pdf to C:\Users\Eric\Documents\myfile.epub, open a terminal (PowerShell, Command Prompt or Windows terminal) and run the following command:

docker run -ti --rm -v C:\Users\Eric\Documents:/temp dodeeric/pdf2epubex pdf2epubEX myfile.pdf

or

docker run -ti --rm -v /c/Users/Eric/Documents:/temp dodeeric/pdf2epubex pdf2epubEX myfile.pdf

Installation

You need to install Docker which is available for all computer operating systems: Linux, Windows and MacOS. See here.

Remark: You will first have to install WSL1 (Windows Subsystem Linux), then update to WSL2.

pdf2htmlEX

You can also use pdf2htmlEX with this same Docker image:

To convert myfile.pdf to myfile.html, open the terminal and run the following command in the directory where the PDF file is located:

docker run -ti --rm -v `pwd`:/temp dodeeric/pdf2epubex pdf2htmlEX myfile.pdf

Remark: use the dodeeric/pdf2epubex:original Docker image to use the original version of pdf2htmlEX (coolwanglu).

Parameters

Once you launch pdf2epubEX, some information will be displayed like the book/PDF width and height (in inches and cm), then some questions will be asked like:

  • Format of the images in the epub (png, jpg or svg) [default: jpg]
  • Resolution of the images in the epub in dpi (e.g.: 150 or 300) [default: 150]
  • Title, author, publisher, year, language, ISBN number, subject

If you want, you can hit ENTER to all the questions.

Image formats:

  • if you chose png or jpg (bitmap formats), the vector images of the PDF will be converted in bitmap images (rasterized).
  • if you chose svg (vector and bitmap format), the vector images of the PDF will remain in vector format, but: a) you cannot chose the resolution of the bitmap images (it is the one from the PDF); b) the bitmap images will be included in the svg files (Base64 coded); c) this format is not always correctly rendered by eBook readers; d) the generated epub file is not always passing the epub check.

A vector image can be as simple as a line, a rectangle, a table frame, a colored background, etc.

For eBooks with a lot of bitmap images, it is better to chose JPG (compression with loss) to not have a file too big. For eBooks with mainly vector images, it is better to chose PNG (lossless compression).

The ePub cover image will be made from the first page of the PDF file (png format).

Examples

In the examples below, the HTML version is one big file including everything (all the pages with HTML5, CSS, JS, fonts and images; fonts and images are coded in Base64, which can make the file quite big). pdf2htmlEX can also put all that content in different files (.html, .css, .js, .woff, .png, .jpg, .svg); that's in fact what basicaly the pdf2epubEX script does before wripping all the files in one ePub container file (.epub). Sometime, ePub is referred as "website in a box".

Legends:

  • Number in parentheses: the size of the file in MB.
  • Hashtag in parentheses: the ePub file does not pass the epub check validation using version epub 3.2 rules (commands not allowed in some svg files). This does not mean the ePub will not be displayed properly in most ePub readers.
  • ePub written in bold: the recommended ePub version.

CEB 2015 - Solides et figures
(24 pages, only vector images in the PDF)

150 DPI 300 DPI
PDF PDF (0.3)
SVG ePub (1.0)(#)
JPG ePub (0.6) ePub (1.1)
PNG ePub (0.7) ePub (1.5)
SVG HTML (2.2)
JPG HTML (1.8) HTML (5.7)
PNG HTML (1.1) HTML (2.5)

Install your own OpenStack Cloud
(49 pages, bitmap and vector images in the PDF)

150 DPI 300 DPI
PDF PDF (1.0)
SVG ePub (1.4)
JPG ePub (1.5) ePub (2.0)
PNG ePub (1.6) ePub (3.2)
SVG HTML (2.9)
JPG HTML (5.3) HTML (14.0)
PNG HTML (3.0) HTML (6.4)

La dynastie belge en images
(248 pages, lot of bitmap images in the PDF)

150 DPI 300 DPI
PDF PDF (56) PDF (396)
SVG ePub (78)(#) ePub (504)(#)
JPG ePub (48) ePub (150)
PNG ePub (209) ePub (628)
SVG HTML (142) HTML (895)
JPG HTML (69) HTML (217)
PNG HTML (296) HTML (869)

Vector image quality in different formats (zoom of 500 %):

  1. SVG (vector format):
Vector
SVG
  1. PNG (bitmap format, lossless compression):
150 DPI 300 DPI
PNG-150 PNG-300
  1. JPG (bitmap format, compression with loss):
150 DPI 300 DPI
JPG-150 JPG-300

Additional information

Book

The script is based on the method described in my book published in 2014: Fixed Layout ePub: A Practical Guide to Publish eBooks from PDF Files. It is available on Amazon and on Googgle Play Books.

Fixed layout ePub

To read a fixed layout ePub, the best device is a tablet (Android or iOS/iPad). A smartphone is not adapted most of the time because of the too small screen size.

A lot of ePub reader apps exist (to read reflowable text ePub and fixed layout ePub) available on different platforms (Android, iOS, Windows, MacOS, or Linux): Google Play Books, BookShelf, PocketBook, Adobe Digital Editions, Apple Books (only on iOS; formely known as Apple iBooks), etc.

Amazon Kindle does not support the standard ePub format (they have their own format which is based on the ePub format).

To use Google Play Books, you have to go to Settings, then set Enable uploading. The uploaded eBooks (PDF or ePub) will be available on all devices using the same Google account. You can also upload eBooks from the Google Play Books web interface (see the Upload files button on the top right corner). Please note that: a) the ePub file has to pass a pre-check to be able to be hosted in the Google cloud; b) if you upload a PDF, all pages (text + images) will be converted into images (the text and vector images are rasterized, and no hidden text layer will be added: it means no text search or copy/paste is possible). Please also note that there is a Web browser reader available.

More about fixed layout (FXL) ePub version 3 specifications (IDPF / W3C): Fixed Layouts (EPUB Content Documents 3.2) and Fixed-Layout Properties (EPUB Packages 3.2).

Reflowable text ePub

This script is converting a PDF to a fixed layout ePub. pdf2htmlEX is THE tool to maintain the original layout. Hence, it is not the best tool to extract the text and the images from a PDF.

Anyway, the script will ask if you want a reflowable text ePub or a fixed layout ePub if you install the Calibre software (apt install calibre) or if you use the image dodeeric/pdf2epubex:calibre (image much bigger than dodeeric/pdf2epubex).

The script will use the ebook-convert command from Calibre to convert the PDF to a reflowable text ePub.

Caution: converting automatically a PDF file to a reflowable text ePub file cannot be perfect. We suggest to edit the ePub file manually with Calibre or Sigil.

Other Git Repositories

Repositories for pdf2htmlEX: the original one (coolwanglu) and the new one (with updated .deb packages).

This script is based on the Bash scripts written by Robert Clayton (RNCTX) and available in his Git repository.

pdf2epubex's People

Contributors

dodeeric avatar otakashiyamauchi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdf2epubex's Issues

Does this code still work? I've tried all available options to run the program but no success so far.

Hello Eric!

I came across your program while searching for a converter from pdf to epub. However, despite my several attempts to convert my file with all available options to run the code, I haven't managed to do this.

  1. First, I added the package repository and installed the package with the command

sudo apt install pdf2epubex

Unfortunately, when converting, I got the same error about something being missing.

  1. So I moved on to the next option.

I tried to install pdf2epubEX from the deb package, but it failed due to two missing dependencies, namely libfontforge1 and libpoppler58. These packages are absent in the Ubuntu repository. Installing Font Forge and Poppler from other sources didn't help either, apparently because of the difference in package names. I guess that was also the reason for my failures with the first option.

  1. Finally, I installed Docker and tried out the image. That's what I got:
 ~/Desktop  sudo docker run -ti --rm -v `pwd`:/pdf dodeeric/pdf2epubex pdf2epubEX.sh test.pdf
Unable to find image 'dodeeric/pdf2epubex:latest' locally
latest: Pulling from dodeeric/pdf2epubex
7b1a6ab2e44d: Pull complete 
7f937bafbd1b: Pull complete 
62c28fc93ed9: Pull complete 
1f68659c9c03: Pull complete 
c742b8d6bbbc: Pull complete 
30d0de691ff0: Pull complete 
c29d2e948a49: Pull complete 
Digest: sha256:52054bb4f365c498f184673414e17db09145e5d28f2d30cb247e1877c1a6e73d
Status: Downloaded newer image for dodeeric/pdf2epubex:latest
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "pdf2epubEX.sh": executable file not found in $PATH: unknown.

Please assist me in this matter. I am very much looking forward to your help.

Here are my specs.

2023-11-26_21-44

Feature Enhancement for Read-Aloud ePub file

Hi,

I think it would benefit the publishing community which requires read-aloud fixed layout epub files where each word gets highlighted as it is spoken in its audio file. In the SMIL file of the read-aloud ePub 3.0 file we have to give the ID of each word to be highlighted.
So, as a feature enhancement, could your tool output the XHTML file in such a way that each word appears in a span with a unique ID. Currently your tool outputs the XHTML file with the text sentence/Line format.

Thanks

Request for Feature

I don't know whether it is already a feature or not so i made this request. Is there any settings options so that the output EPUB file can have a Table of Contents based on the input PDF file. If there is not, are you planning to implement any feature or setting option like that in future. Thanks for you sharing this wonderful piece of code. It is literally a lifesaver. THANKS

Cant run app as described in readme

Hi, after trying docker run -ti --rm -v pwd:/temp dodeeric/pdf2epubex pdf2epubEX myfile.pdf on mac M1 I got:

docker: invalid reference format: repository name must be lowercase.

Error message

Hi
I went through all the steps and got these error messages:
pdf2htmlEX version: 0.18.8.rc2
Error when parsing the arguments:
Invalid argument: --page-filename
cat: '.page': No such file or directory
mv: cannot stat '
.css': No such file or directory
mv: cannot stat '.woff': No such file or directory
mv: cannot stat '
.jpg': No such file or directory
I/O Error: Couldn't open file 'mybook.pdf': No such file or directory.
mv: cannot stat 'cover.png': No such file or directory
sed: can't read base.min.css: No such file or directory
mv: target 'filename-150dpi-jpg.epub' is not a directory
Done

I have correctly specified the path to my file and mapped it to the /temp directory inside the Docker container.
Kindly advise. Thanks

Didnt convert the pdf proprely

Archive.zip

Please check the sample pdf and outpur epub file. It didnt convert properly.

$ sh pdf2epubEX.sh Explanation_of_a_Summary_of_Aqeedat_Hamawiyyah.pdf
(standard_in) 1: parse error
(standard_in) 1: parse error
(standard_in) 1: parse error
(standard_in) 1: parse error
(standard_in) 1: parse error
(standard_in) 1: parse error
-------------------------------------------------------------------------------------------------
Book/PDF Width: 0.00 inches / 0.0 cm
Book/PDF Height: 0.00 inches / 0.0 cm
Factor ratio (Height / Width): 0.00
ePub Viewport (Width x Height): 900 pixels x  pixels
-------------------------------------------------------------------------------------------------
Do you want to see more information on the PDF file? (y or n) [default: n]:
pdf2epubEX.sh: line 58: pdffonts: command not found
======================
Caution:
- if you chose png or jpg (bitmap formats), the vector images will be converted in bitmap images (rasterized).
- if you chose svg (vector and bitmap format), the vector images will remain in vector format, but: a) you cannot chose the resolution of the bitmap images (it is the one from the PDF); b) the bitmap images will be included in the svg files (Base64 coded); c) this format is not always correctly rendered by eBook readers; d) the generated epub file is not always passing the epub check.
======================
If you want, you can hit <ENTER> to all the next questions.
Format of the images in the epub (png, jpg or svg) [default: jpg]:
Resolution of the images in the epub in dpi (e.g.: 150 or 300) [default: 150]:
Title [Default: None]:
Author [Default: None]:
Publisher [Default: None]:
Year [Default: 1900]:
Language (e.g.: fr) [Default: en]:
ISBN number [Default: None]:
Subject (e.g.: history) [Default: None]:
Wait...
pdf2htmlEX: unrecognized option `--dpi'
cat: *.page: No such file or directory
mv: rename *.css to ./bookroot/OEBPS/*.css: No such file or directory
mv: rename *.woff to ./bookroot/OEBPS/*.woff: No such file or directory
mv: rename *.jpg to ./bookroot/OEBPS/*.jpg: No such file or directory
pdf2epubEX.sh: line 219: pdftoppm: command not found
mv: rename cover.png to ./bookroot/OEBPS/cover.png: No such file or directory
sed: 1: "base.min.css": undefined label 'ase.min.css'
Done

TOC and internal crossref links?

My PDF (produced in LaTex) has a TOC and many internal crossref links. Is there anyway to import/enable these in the converted epub? I know Kindle Create has such an option. It would be great if we can create epubs like Kindle ebooks.

Table of Content (ToC)

Add the left side menu (ToC) from the pdf2htmlEX browser reader into the nav.xhtml file (ToC of the ePub book).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.