Giter Club home page Giter Club logo

tools's Introduction

About

A collection of tools Standard Ebooks uses to produce its ebooks, including basic setup of ebooks, text processing, and build tools.

Installing this toolset using pipx makes the se command line executable available. Its various commands are described below, or you can use se help to list them.

Installation

The toolset requires Python >= 3.8 and <= 3.12.

To install the toolset locally for development and debugging, see Installation for toolset developers.

Optionally, install Ace and the se build --check command will automatically run it as part of the checking process.

Ubuntu 20.04 (Trusty) users

# Install some pre-flight dependencies.
sudo apt install -y calibre default-jre git python3-dev python3-pip python3-venv

# Install pipx.
python3 -m pip install --user pipx
python3 -m pipx ensurepath

# Install the toolset.
pipx install standardebooks

Optional: Install shell completions

# Install ZSH completions.
sudo ln -s $HOME/.local/pipx/venvs/standardebooks/lib/python3.*/site-packages/se/completions/zsh/_se /usr/share/zsh/vendor-completions/_se && hash -rf && compinit

# Install Bash completions.
sudo ln -s $HOME/.local/pipx/venvs/standardebooks/lib/python3.*/site-packages/se/completions/bash/se /usr/share/bash-completion/completions/se

# Install Fish completions.
ln -s $HOME/.local/pipx/venvs/standardebooks/lib/python3.*/site-packages/se/completions/fish/se $HOME/.config/fish/completions/se.fish

Fedora users

# Install some pre-flight dependencies.
sudo dnf install calibre git java-1.8.0-openjdk python3-devel

# Install pipx.
python3 -m pip install --user pipx
python3 -m pipx ensurepath

# Install the toolset.
pipx install standardebooks

Optional: Install shell completions

# Install ZSH completions.
sudo ln -s $HOME/.local/pipx/venvs/standardebooks/lib/python3.*/site-packages/se/completions/zsh/_se /usr/share/zsh/vendor-completions/_se && hash -rf && compinit

# Install Bash completions.
sudo ln -s $HOME/.local/pipx/venvs/standardebooks/lib/python3.*/site-packages/se/completions/bash/se /usr/share/bash-completion/completions/se

# Install Fish completions.
ln -s $HOME/.local/pipx/venvs/standardebooks/lib/python3.*/site-packages/se/completions/fish/se $HOME/.config/fish/completions/se.fish

macOS users

  1. Install the Homebrew package manager. Or, if you already have it installed, make sure it’s up to date:

    brew update
  2. Install dependencies:

    # Install some pre-flight dependencies.
    brew install cairo calibre git openjdk pipx [email protected]
    pipx ensurepath
    sudo ln -sfn $(brew --prefix)/opt/openjdk/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk
    
    # Install the toolset.
    pipx install --python python3.11 standardebooks
    
    # Optional: Bash users who have set up bash-completion via brew can install tab completion.
    ln -s $HOME/.local/pipx/venvs/standardebooks/lib/python3.*/site-packages/se/completions/bash/se $(brew --prefix)/etc/bash_completion.d/se
    
    # Optional: Fish users can install tab completion.
    ln -s $HOME/.local/pipx/venvs/standardebooks/lib/python3.*/site-packages/se/completions/fish/se $HOME/.config/fish/completions/se.fish

OpenBSD 6.6 Users

These instructions were tested on OpenBSD 6.6, but may also work on the 6.5 release as well.

  1. Create a text file to feed into pkg_add called ~/standard-ebooks-packages. It should contain the following:

    py3-pip--
    py3-virtualenv--
    py3-gitdb--
    jdk--%11
    calibre--
    git--
  2. Install dependencies using doas pkg_add -ivl ~/standard-ebooks-packages. Follow linking instructions provided by pkg_add to save keystrokes, unless you want to have multiple python versions and pip versions. In my case, I ran doas ln -sf /usr/local/bin/pip3.7 /usr/local/bin/pip.

  3. Add ~/.local/bin to your path.

  4. Run pip install --user pipx

  5. If you’re using ksh from base and have already added ~/.local/bin, you can skip pipx ensurepath because this step is for bash users.

  6. The rest of the process is similar to that used on other platforms:

    # Install the toolset.
    pipx install standardebooks

Installation for toolset developers

If you want to work on the toolset source, it’s helpful to tell pipx to install the package in “editable” mode. This will allow you to edit the source of the package live and see changes immediately, without having to uninstall and re-install the package.

To do that, follow the general installation instructions above; but instead of doing pipx install standardebooks, do the following:

git clone https://github.com/standardebooks/tools.git
pipx install --editable ./tools

Now the se binary is in your path, and any edits you make to source files in the tools/ directory are immediately reflected when executing the binary.

Running commands on the entire corpus

As a developer, it’s often useful to run an se command like se lint or se build on the entire corpus for testing purposes. This can be very time-consuming in a regular invocation (like se lint /path/to/ebook/repos/*), because each argument is processed sequentially. Instead of waiting for a single invocation to process all of its arguments sequentially, use GNU Parallel to start multiple invocations in parallel, with each one processing a single argument. For example:

# Slow, each argument is processed in sequence
se lint /path/to/ebook/repos/*

# Fast, multiple invocations each process a single argument in parallel
export COLUMNS; parallel --keep-order se lint ::: /path/to/ebook/repos/*

The toolset tries to detect when it’s being invoked from parallel, and it adjusts its output to accomodate.

We export COLUMNS because se lint needs to know the width of the terminal so that it can format its tabular output correctly. We pass the --keep-order flag to output results in the order we passed them in, which is useful if comparing the results of multiple runs.

Linting with pylint and mypy

Before we can use pylint or mypy on the toolset source, we have to inject them (and additional typings) into the venv pipx created for the standardebooks package:

pipx inject standardebooks pylint==3.2.2 mypy==1.10.0 types-requests==2.32.0.20240602 types-setuptools==70.0.0.20240524 types-Pillow==10.2.0.20240520

Then make sure to call the pylint and mypy binaries that pipx installed in the standardebooks venv, not any other globally-installed binaries:

cd /path/to/tools/repo
$HOME/.local/pipx/venvs/standardebooks/bin/pylint tests/*.py se

Testing with pytest

Instructions are found in the testing README.

Code style

  • In general we follow a relaxed version of PEP 8. In particular, we use tabs instead of spaces, and there is no line length limit.

  • Always use the regex module instead of the re module.

Help wanted

We need volunteers to take the lead on the following goals:

  • Add more test cases to the test framework.

  • Writing installation instructions for Bash and ZSH completions for MacOS.

  • Currently the toolset requires the whole Calibre package, which is very big, but it’s only used to convert epub to azw3. Can we inline Calibre’s azw3 conversion code into our ./vendor/ directory, to avoid having to install the entire Calibre package as a big dependency? If so, how do we keep it updated as Calibre evolves?

  • Over the years, ./se/se_epub_build.py has evolved to become very large and unwieldy. Is there a better, clearer way to organize this code?

Tool descriptions

  • se british2american

    Try to convert British quote style to American quote style in DIRECTORY/src/epub/text/.

    Quotes must already be typogrified using the se typogrify tool.

    This script isn’t perfect; proofreading is required, especially near closing quotes near to em-dashes.

  • se build

    Build an ebook from a Standard Ebook source directory.

  • se build-ids

    Change ID attributes for non-sectioning content to their expected values across the entire ebook. IDs must be globally unique and correctly referenced, and the ebook spine must be complete.

  • se build-images

    Build ebook cover and titlepage images in a Standard Ebook source directory and place the output in DIRECTORY/src/epub/images/.

  • se build-manifest

    Generate the <manifest> element for the given Standard Ebooks source directory and write it to the ebook’s metadata file.

  • se build-spine

    Generate the <spine> element for the given Standard Ebooks source directory and write it to the ebook’s metadata file.

  • se build-title

    Generate the title of an XHTML file based on its headings and update the file’s <title> element.

  • se build-toc

    Generate the table of contents for the ebook’s source directory and update the ToC file.

  • se clean

    Prettify and canonicalize individual XHTML, SVG, or CSS files, or all XHTML, SVG, or CSS files in a source directory.

  • se compare-versions

    Use Firefox to render and compare XHTML files in an ebook repository. Run on a dirty repository to visually compare the repository’s dirty state with its clean state. If a file renders differently, place screenshots of the new, original, and diff (if available) renderings in the current working directory. A file called diff.html is created to allow for side-by-side comparisons of original and new files.

  • se create-draft

    Create a skeleton of a new Standard Ebook.

  • se css-select

    Print the results of a CSS selector evaluated against a set of XHTML files.

  • se dec2roman

    Convert a decimal number to a Roman numeral.

  • se extract-ebook

    Extract an .epub, .mobi, or .azw3 ebook into ./FILENAME.extracted/ or a target directory.

  • se find-mismatched-dashes

    Find words with mismatched dashes in a set of XHTML files. For example, extra-physical in one file and extraphysical in another.

  • se find-mismatched-diacritics

    Find words with mismatched diacritics in a set of XHTML files. For example, cafe in one file and café in another.

  • se find-unusual-characters

    Find characters outside a nominal expected range in a set of XHTML files. This can be useful to find transcription mistakes and mojibake.

  • se help

    List available SE commands.

  • se hyphenate

    Insert soft hyphens at syllable breaks in an XHTML file.

  • se interactive-replace

    Perform an interactive search and replace on a list of files using Python-flavored regex. The view is scrolled using the arrow keys, with alt to scroll by page in any direction. Basic Emacs (default) or Vim style navigation is available. The following actions are possible: (y) Accept replacement. (n) Reject replacement. (a) Accept all remaining replacements in this file. (r) Reject all remaining replacements in this file. (c) Center on match. (q) Save this file and quit.

  • se lint

    Check for various Standard Ebooks style errors.

  • se make-url-safe

    Make a string URL-safe.

  • se modernize-spelling

    Modernize spelling of some archaic words, and replace words that may be archaically compounded with a dash to a more modern spelling. For example, replace ash-tray with ashtray.

  • se prepare-release

    Calculate work word count, insert release date if not yet set, and update modified date and revision number.

  • se recompose-epub

    Recompose a Standard Ebooks source directory into a single HTML5 file, and print to standard output.

  • se renumber-endnotes

    Renumber all endnotes and noterefs sequentially from the beginning.

  • se roman2dec

    Convert a Roman numeral to a decimal number.

  • se semanticate

    Apply some scriptable semantics rules from the Standard Ebooks semantics manual to a Standard Ebook source directory.

  • se shift-endnotes

    Increment or decrement the specified endnote and all following endnotes by 1 or a specified amount.

  • se split-file

    Split an XHTML file into many files at all instances of <!--se:split-->, and include a header template for each file.

  • se titlecase

    Convert a string to titlecase.

  • se typogrify

    Apply some scriptable typography rules from the Standard Ebooks typography manual to a Standard Ebook source directory.

  • se unicode-names

    Display Unicode code points, descriptions, and links to more details for each character in a string. Useful for differentiating between different flavors of spaces, dashes, and invisible characters like word joiners.

  • se word-count

    Count the number of words in an HTML file and optionally categorize by length.

  • se xpath

    Print the results of an xpath expression evaluated against a set of XHTML files. The default namespace is removed.

What a Standard Ebooks source directory looks like

Many of these tools act on Standard Ebooks source directories. Such directories have a consistent minimal structure:

.
|__ images/
|   |__ cover.jpg
|   |__ cover.source.jpg
|   |__ cover.svg
|   |__ titlepage.svg
|
|__ src/
|   |__ META-INF/
|   |   |__ container.xml
|   |
|   |__ epub/
|   |   |__ css/
|   |   |   |__ core.css
|   |   |   |__ local.css
|   |   |   |__ se.css
|   |   |
|   |   |__ images/
|   |   |   |__ cover.svg
|   |   |   |__ logo.svg
|   |   |   |__ titlepage.svg
|   |   |
|   |   |__ text/
|   |   |   |__ colophon.xhtml
|   |   |   |__ imprint.xhtml
|   |   |   |__ titlepage.xhtml
|   |   |   |__ uncopyright.xhtml
|   |   |
|   |   |__ content.opf
|   |   |__ onix.xml
|   |   |__ toc.xhtml
|   |
|   |__ mimetype
|
|__ LICENSE.md

./images/ contains source images for the cover and titlepages, as well as ebook-specific source images. Source images should be in their maximum available resolution, then compressed and placed in ./src/epub/images/ for distribution.

./src/epub/ contains the actual epub files.

tools's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tools's Issues

Constructions like "her's" and "your's" (hers and yours)

I don't think these are sorted out by modernize-spelling as yet, but in Clarissa these kind of archaic constructions appear throughout: your's = yours, her's = hers, etc.

I presume we'll want to modernize these. Do you want me to have a go at creating a pull request?

Hyphenation

Hi there
I remember you had once a script for hyphenation provided but I cannot find it anymore. How is this done right now?

Thanks a lot!

Original Publication Date in metadata?

A question/request (I hope this is the right place for it): Is there a metadata field for original publication date? If so it would be nice to have it filled in. For readers it's more relevant than the date of the ebook edition.

I tried to answer the question myself but could not find anything resembling a complete specification of the content.opf format. I did see that dc:date is specifically meant to contain the publication date of respective ebook edition, so that't out.

lint reports errors with CSS copied directly from Typography Manual

I’m not sure if this repository is the right place to submit this kind of issue, but lint reports errors about CSS copied directly from the Standard Ebooks Typography Manual. For example, the CSS for abbreviated names is currently:

abbr.name{ white-space: nowrap; }

Here’s a screenshot:

screen shot 2017-10-24 at 11 25 54 am

Using that CSS results in CSS closing braces must be on their own line. from lint.

It would seem that either lint’s requirements should be relaxed, or the CSS in the Typography Manual should be updated to conform to lint’s requirements.

tools install fails on hyphen dictionary download

I am installing tools dependencies inside a python3 virtualenv. The following command suggested in the README fails:

$ python3 -c "exec(\"from hyphen import dictools\\ndictools.install('en_GB')\\ndictools.install('en_US')\")"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 2, in <module>
  File "/home/data/regis/projets/code/drafts/georgeorwell/env/lib/python3.5/site-packages/hyphen/dictools.py", line 71, in install
    descr_file = urlopen(descr_url)
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 451, in open
    req = Request(fullurl, data)
  File "/usr/lib/python3.5/urllib/request.py", line 269, in __init__
    self.full_url = url
  File "/usr/lib/python3.5/urllib/request.py", line 295, in full_url
    self._parse()
  File "/usr/lib/python3.5/urllib/request.py", line 324, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: '$repoen_GB/dictionaries.xcu'

The error is caused by the fact that we are trying to download files from an incorrect url ($repoen_GB/dictionaries.xcu). This weird url is generated using the default_repository variable from the hyphen.config module. This variable should have been modified by the pyhyphen setup.py script; but the string substitution from setup.py only works when running python setup.py install on the original repo, not pip install pyhyphen. (see original source)

If I understand pyhyphen's setup.py file correctly, this hack was put in place to support debian packaging. This is confirmed by this conversation on bugs.debian.org.

I see three possible solutions to this problem:

  1. Patch the README to explain how to install pyhyphen from source. (requires Mercurial because the original repo is on BitBucket)
  2. Fork pyhyphen (the project seems unmaintained), patch the issue and point tools to this new repo
  3. Get rid of pyhyphen, as a dependency, from tools, and replace it by another library, such as Pyphen, hyphenator, or something else.

I'd suggest to go with solution 3. If you're OK with that, I can try to implement the changes myself.

lint treats .DS_Store files as errors, tries and fails to read them

.DS_Store files are created automatically by Finder on macOS. Even when these files don’t appear in an ebook package manifest, aren’t committed to the repository, and are perhaps explicitly ignored using a .gitignore file, lint reports them as illegal files.

It may be better for lint to treat the presence of .DS_Store files as a warning, with a suggestion to delete them using, for example:

find . -name \.DS_Store -delete

Issues installing PyHyphen on Mac

I'm having issues installing PyHyphen on my Mac. I recently updated to macOS 10.13, High Sierra, so that could be to blame.

When I run pip3 install -r ./tools/requirements.txt, this is the output:

Requirement already satisfied: beautifulsoup4==4.6.0 in /usr/local/lib/python3.6/site-packages (from -r ./requirements.txt (line 1))
Requirement already satisfied: cssselect==1.0.1 in /usr/local/lib/python3.6/site-packages (from -r ./requirements.txt (line 2))
Requirement already satisfied: gitpython==2.1.5 in /usr/local/lib/python3.6/site-packages (from -r ./requirements.txt (line 3))
Requirement already satisfied: lxml==3.8.0 in /usr/local/lib/python3.6/site-packages (from -r ./requirements.txt (line 4))
Collecting pyhyphen==2.0.8 (from -r ./requirements.txt (line 5))
  Downloading PyHyphen-2.0.8.tar.gz (95kB)
    100% |████████████████████████████████| 102kB 1.3MB/s
    Complete output from command python setup.py egg_info:
    RefactoringTool: Refactored hyphen/__init__.py
    RefactoringTool: Refactored hyphen/config.py
    RefactoringTool: Refactored hyphen/dictools.py
    RefactoringTool: Files that were modified:
    RefactoringTool: hyphen/__init__.py
    RefactoringTool: hyphen/config.py
    RefactoringTool: hyphen/dictools.py
    running egg_info
    creating pip-egg-info/PyHyphen.egg-info
    writing pip-egg-info/PyHyphen.egg-info/PKG-INFO
    writing dependency_links to pip-egg-info/PyHyphen.egg-info/dependency_links.txt
    writing top-level names to pip-egg-info/PyHyphen.egg-info/top_level.txt
    writing manifest file 'pip-egg-info/PyHyphen.egg-info/SOURCES.txt'
    reading manifest file 'pip-egg-info/PyHyphen.egg-info/SOURCES.txt'
    writing manifest file 'pip-egg-info/PyHyphen.egg-info/SOURCES.txt'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/tmp/pip-build-tmi1azho/pyhyphen/setup.py", line 140, in <module>
        install('en_US')
      File "/usr/local/lib/python3.6/site-packages/hyphen/dictools.py", line 71, in install
        descr_file = urlopen(descr_url)
      File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
        return opener.open(url, data, timeout)
      File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 511, in open
        req = Request(fullurl, data)
      File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 329, in __init__
        self.full_url = url
      File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 355, in full_url
        self._parse()
      File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 384, in _parse
        raise ValueError("unknown url type: %r" % self.full_url)
    ValueError: unknown url type: '$repoen_US/dictionaries.xcu'
    Installing dictionary en_US

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/tmp/pip-build-tmi1azho/pyhyphen/

I then tried running the next command, python3 -c "exec(\"from hyphen import dictools\\ndictools.install('en_GB')\\ndictools.install('en_US')\")", and got this for output:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 2, in <module>
  File "/usr/local/lib/python3.6/site-packages/hyphen/dictools.py", line 71, in install
    descr_file = urlopen(descr_url)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 511, in open
    req = Request(fullurl, data)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 329, in __init__
    self.full_url = url
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 355, in full_url
    self._parse()
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 384, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: '$repoen_GB/dictionaries.xcu'

Any advice or ideas?

I noticed the script is trying to install PyHyphen 2.0.8. I could try installing the newest version (3.0.1) if you think that might help.

Extra blank spaces between verse lines on Kobo

I’ve noticed that standard verse markup in SE books has got extra blank lines between each line on my Kobo (1st gen Auru H2O, latest software). This isn’t present in Calibre when reading the same files on a desktop. I’ve been using the kepub editions. I’ve seen this in the couple of books I’m working on, but also in existing SE books like Lyrical Ballads, and in the verse in the uncopyright file.

An example (and apologies for my crappy webcam quality):
lyrical_ballads

I haven’t yet checked to see if this is kepub specific, but will do that next. Just wanted to record this issue first in case someone else has already noticed it and has a fix, or was thinking about filing the same issue.

`clean` chokes on &lt; and &gt;

I'm using the Standard Ebooks toolset to create a non-Standard-Ebooks ebook. The source text I'm converting uses a somewhat nonstandard convention of replacing foreign words with their English equivalents, between less than and greater than signs. The clean tool doesn't like this. I can convince it to convert the text once, if I enclose the translated passages in a <[CDATA[...]]> block. However, xmllint canonicalization takes it back out of the block, so that trick only works for one run of clean.

I believe the issue is on this line (62) of clean. html.unescape() unilaterally unescapes any HTML character reference, including the ones that need to remain references, like ampersands and less than/greater than signs. It seems like someone already tried to fix this halfway by replacing ampersands with the character reference, but something tells me that technique won't work for this case :P

I'm a fairly competent Python developer, would a PR to fix this be welcome?

pip3 install -r ./tools/requirements.txt fails

Installing the tools on a Google Cloud instance of Ubuntu 16.04.04.

The step "pip3 install -r ./tools/requirements.txt" fails. Digging through it it fails on step 7. If I comment out step 7 it continues working and installs everything without errors.

I tried to skip this step but the tools wouldn't work properly without it.

Typogrify incorrectly handles left-double-quote followed by straight-quote

In "Memoirs of Sherlock Holmes" there are long passages where someone is quoting reported speech, like this (from the Gutenberg body.xhtml:

<p>“'Anything else?' he asked, smiling.</p>
<p>“'You have boxed a good deal in your youth.'</p>

The outer quotes are curly-quotes, left-double-quote, followed by a straight quote. Certainly an unusual way to do it, but...

Typogrify turns this into:

<p>“ ’Anything else?’ he asked, smiling.</p>
<p>“ ’You have boxed a good deal in your youth.’</p>

That's left-double-quote, hair-space, then right-single-quote.

There was at least one instance of this in "The Railway Children", which I'll now go back to check.

python 'clean' on Mac deletes target file

On Mac OSX 10.11, I pulled the latest version of tools, installed all dependencies, and ran 'clean' on the raw Gutenberg .html file. The target body.xhtml file is deleted. Here is the output running in verbose mode:

$ clean -v .
Processing /Users/michael/Documents/StandardEbooks/thomas-love-peacock_nightmare-abbey ...-:1: parser error : Document is empty

^
-:1: parser error : Start tag expected, '<' not found

^
-:1: parser error : Document is empty

^
-:1: parser error : Start tag expected, '<' not found

^
-:16: parser error : Unescaped '<' not allowed in attributes values
<meta name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webma
                                                                          ^
-:16: parser error : attributes construct error
<meta name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webma
                                                                          ^
-:16: parser error : Couldn't find end of Start Tag meta line 16
<meta name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webma
                                                                          ^
-:16: parser error : error parsing attribute name
a name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webmaster
                                                                               ^
-:16: parser error : attributes construct error
a name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webmaster
                                                                               ^
-:16: parser error : Couldn't find end of Start Tag webmaster line 16
a name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webmaster
                                                                               ^
-:1: parser error : Document is empty

^
-:1: parser error : Start tag expected, '<' not found

^
-:1: parser error : Document is empty

^
-:1: parser error : Start tag expected, '<' not found

^
 OK

Make simplify-tags errors more informative

When simplify-tags dies due to an XML error, it stack dumps, but there is no indication of what file it is process, or of the string that caused the problem. If one of them was present, we could live without the other one, but the absence of both makes it very difficult to track down.

I made some local changes to simplify-tags to do a couple of things. First, when -v is passed, it prints out each filename as it's processing it. Thus, when errors happen, it's clear which file caused them.
Second, I added a try/except to each "tree = etree.fromstring…", and in the except I print the current filename and then just re-raise the exception. That way, even if -v isn't passed, you still get the filename that contains the error. I also differentiated the error messages between the three calls to tree, so we could tell the difference. (I don't know if we need to tell, but I prefer not to have the same error generated in multiple places).

I'm attaching the output of git diff because I wasn't sure how you preferred to get changes, i.e. with a format-patch, or with a pull request for the specific tool, or what. If you would prefer one of those, just let me know and I'll take care of it.
simplify-tags.txt

build script chokes on macOS

Combination of extended attributes for cp / rm and a reliance on -e / -q flags for xpath. The extended attributes seem to be easily replaced with the shorthand (e.g. --recursive to -r), xpath seems to support a fixed argument order (filename then expression rather than using -e). I’m gradually fixing these up locally, but I’m not sure if macOS is even a desired platform for building.

Latest prepare-release doesn't write out word count

The latest prepare-release contains a bug that is causing it to not write out the updated word count. The cause is using "xhtml" where "processed_xhtml" should be used.

At line 52, processed_xhtml is set to xhtml. I'm guessing that from point forward, everything should be using processed_html instead of xhtml.
At line 62, when updating the word count, xhtml is used in the regex instead of processed_xhtml. Technically this isn't a problem, since they're both still the same. However…
At line 63, when updating the reading_ease, xhtml is again used in the regex, which causes the word count that was updated in 62 to be lost.
At line 73, the revision is determined by searching xhtml. This is again ambiguous; the script hasn't touched the revision, so searching xhtml is probably OK, but processed_xhtml might have been intended.

So, at least line 63 needs to be changed to used processed_xhtml, but my guess is the intent is for lines 62 and 73 to use it as well.

I apologize, I would have submitted a pull request, but I'm still reading up on how best to do that. My git-fu is still very weak.

Help needed: `clean` tool runs into a 'utf-8' unicode error.

Ran into this issue when trying to use the clean tool, while following along with the step-by-step guide.

Processing /Users/bryan/git/StandardEBooks/HoundOfBaskervilles/arthur-conan-doyle_the-hound-of-the-baskervilles ...Traceback (most recent call last):
  File "/Users/bryan/git/StandardEBooks/tools/clean", line 92, in <module>
    main()
  File "/Users/bryan/git/StandardEBooks/tools/clean", line 70, in main
    error = result.stderr.decode().strip()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 77: invalid start byte

Is this an issue with the particular ePub file I'm working with, or an issue with clean, an issue with my machine's setup, or something else?

clean removes nbsp around ampersand in name

The typography manual says ampersands in names should be surrounded by nbsp. When I ran clean on text containing "Cambell & Hambell" it got changed to "Cambell & Hambell"

Lint for mismatches between ToC entries and content headings

Alex suggested:

I'd suggest using Beautiful Soup to pull the plain text from each <h#> tag (after removing endnotes, which can occur in <h#>), then after processing all the xhtml files compare those against each <li> item in toc.xhtml. You can test against our corpus which should (hopefully) all pass.

I’ll try to find time to have a look at this in the next week, but filing to get it on record anyway.

page-break-after: avoid is broken in iBooks

tools/build

Line 718 in 4c2ff31

processed_css = regex.sub(r"(page\-break\-(before|after|inside)\s*:\s*(.+))", "\\1\n\t-webkit-column-break-\\2: \\3 /* For Readium */", processed_css)

build adds a -webkit-column-break-after: avoid after the specified page-break-after: avoid. Removing this for iBooks fixes the lack of breaking (indicating that it’s overriding the page-break functionality, but not actually applying as it’s a page, not a column. Moving the -webkit-column-break-after declaration to before the page-break-after declaration fixes the issue, but breaks Readium. It looks like Readium hasn’t got support for page-break-* yet: readium/readium-shared-js#127

I’m not sure what the fix is here. Potentially we could use a iBooks filter like we do for theming to add in a specific fix, but it feels like this is more a Readium problem?

[create-draft] Illegal option for sed.

The actual example from Producing an ebook, step by step just gives a whole bunch of errors on my Mac with OS X 10.11.1.

$ tools/create-draft --author="Robert Louis Stevenson" --title="The Strange Case of Dr. Jekyll and Mr. Hyde"
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]
./// already exists.  Overwrite? 

Cannot install tools from clean clone

First, I want to say thank you to the people who work on this project. I sideloaded a bunch of books onto my Kindle and I've been reading them, and they are fantastic! Now on to the issue...

This repository cannot be installed without manually installing dependencies, due to the setup.py file needing to import se to do its work, and se.__init__ importing lots of dependencies. The only thing the setup script needs from se is the version. I'm a big fan of specifying data in only one place, so it's important to keep that. setuptools has a whole page devoted to multiple techniques for doing this.

I can submit a PR with a fix. My preferred approach would be to do the same thing as done by the trio library: https://github.com/python-trio/trio/blob/master/setup.py#L3. I adopted this approach for my projects too.

Basically, make a new file, se/_version.py, set a single variable, VERSION = "1.0.6", have setup.py exec that file, and import it in se/__init__.py with a line from ._version import VERSION. Nothing will change from a user perspective. The only thing that will change from a developer perspective is the location of the version information. Importantly, it's still specified in one place.

Steps for reproducing:

  1. create new virtual environment
  2. clone this repository
  3. attempt to install it with pip install -e . from inside the repo folder.
  4. Get exceptions about ImportErrors

Titlecase linting should be aware of embedded titles

Linting Keat’s Poetry throws this warning:

Title "On Leigh Hunt’s Poem, The Story of Rimini" not correctly titlecased. Expected: On Leigh Hunt’s Poem, the Story of Rimini

‘The Story of Rimini’ though is italicised as a work, so the initial letter should be capitalised. The linted should understand that works, either in italics or quoted, should allow their first letter to be capitalised.

Potentially include a blank cover.jpg with create-draft templates

At the moment it’s not possible to build a book without having images/cover.jpg, but often I want to read (or reread) the book first to get better ideas of the sort of art I should look for. At that point I usually just generate a 1400x2100 white jpg and drop it in, but maybe this is something that should be in there from the start? Maybe watermark it with “SAMPLE COVER REPLACE ME” or equivalent.

.epub3 file extension

Great project, thank you! :)

I was just testing the Readium "cloud reader" app with the exploded / unzipped EPUBs available in your GitHub repositories (e.g. https://readium.firebaseapp.com/?epub=https%3A%2F%2Fcdn.rawgit.com%2Fstandardebooks%2Fmark-twain_a-connecticut-yankee-in-king-arthurs-court%2Fmaster%2Fsrc )

...and I later discovered your ODPS feed:
https://readium.firebaseapp.com/?epubs=https%3A%2F%2Fcrossorigin.me%2Fhttps%3A%2F%2Fstandardebooks.org%2Fopds%2Fall

First remark: as you can see, I am using http://crossorigin.me to proxy the requests, because of missing HTTP CORS headers (Access-Control-Allow-Origin etc.). Ideally, your entire collection would be served with CORS headers in HTTP responses, so that client-side (web browser) apps are allowed to fetch and process material from your library. A CORS proxy works for now, but it introduces an unnecessary level of indirection (added network latency, etc.).

Secondly, the ".epub3" filename extension is not standard practice, I believe. Strictly-speaking the specification only recommends the use of ".epub" (no strong conformance requirement), but I doubt all reading systems recognize ".epub3" as a valid extension.
http://www.idpf.org/epub/301/spec/epub-ocf.html

I am patching ReadiumJS now to support this slightly unorthodox convention.

Thoughts?

Lint should understand illustration alt text

Given a particular illustration, we should be able to lint the alt text to check for discrepancies between:

  1. The <title> of the SVG
  2. The text within the LoI entry
  3. The alt attribute of the <img>

I’ll pick this up this week.

EPUB 3 files not readable in Kobo Aura H2O

I've tried uploading about 10 EPUB 3 files from the site to my Kobo Aura H2O (with the latest update, version 4.2.8432, from 2017-01-24), including g-k-chesterton_the-napoleon-of-notting-hill.epub3. None of them show up in the device's library. I tried uploading the EPUB files for the same books, and they all show up. Kobo support insist that EPUB 3 is supported by this device. Are you aware of something quirky/non-standard in your files?

Parallelise more of SE file processing code

Using concurrent.futures helped drop reorder-endnotes time considerably. This could be rolled out to most places in the SE codebase that process a bunch of distinct files at the same time; certainly typogrify, clean, british2american etc., and possibly lint too (assuming I can marshal the results into a suitable order.

So I guess before I put any time into it:

  • Did anyone test concurrent.futures Linux / Windows to check that it works there?
  • Is this wanted work? Obviously with reorder-endnotes I’m running it over and over again at the moment so any increase in speed is personally beneficial.
  • Does anyone else want to work on this instead of me? I don’t want to hog the fun stuff :)

british2american can't be run multiple times

I have a project with quotations in that happen to all be in double quotes (the correct style for standardebooks). Running the british2american script on this project converts the (correct) double quotes to (incorrect) single quotes. If I run it again, they switch back again.

The standardebooks manual (https://standardebooks.org/contribute/producing-an-ebook-step-by-step), says:

"If your work is already in American style, you can skip this step."

I suggest changing 'can' to 'must', and provide a warning about the tool's current behaviour, or fix the tool so it can be run multiple times.

Licence should clarify /templates status

This repo is licensed under GPLv3. When create-draft creates a new book it copies the contents of /templates over, at which point they somehow become CC0. We should probably clarify in the root LICENCE.md that the repository is licensed under GPLv3 with the exception of /templates, which is licensed under CC0.

Covers not showing in e-reader (Sony PRS-T1)

Don't know if it is the right place to report this, but this issue happens with all epubs from standardebooks. The covers are not shown in my e-reader which is an older model (Sony PRS-T1).

I've tried to resize the cover image (of this ebook) to 600x800 px (device resolution) to no avail..

All other ebooks that I have do show the cover images, just not the ones from standardebooks.org. Any ideias? I can test them on my device..

print-toc crashes

Behavior of clean, lint, and semanticize with XML comments

A few scripts, like clean and semanticize, appear to change text in XML comments; lint appears to check some content in comments as if it wasn’t in a comment.

A case where this can be problematic is when using comments to keep of track of things when preparing an ebook. For example, if you have a comment like—

<!--
TODO: Review this later:
⟨A URL with something like “value=X&” in it⟩
-->

—then semanticize encloses X in a <span epub:type="z3998:roman"> element, and clean replaces the ampersand with &amp;; either way, the URL will no longer work.

On the other hand, lint reports issues with Google Books URLs in comments, for example. (XML comments in epub files are probably worth warning about, but non-canonical Google Books URLs in comments are likely innocuous.)

Build notifies about missing packages one at a time

I ran build on a fresh Ubuntu install. It informed me of a missing package so I installed the package. When I ran it again it informed me of another missing package so I installed the package. This went on for a few more iterations. If build could give a complete list of missing packages it would make the first run-through easier. (I'm talking about apt-get packages. I know it's not so easy with Python packages.)

Way to remove BOM from text

I had an issue on my first book that the text from Gutenburg had a BOM in it. That caused split-file to put the BOM in its own chapter, which of course threw off everything else.

The "best" thing might be to change split-file to ignore a BOM, but the easy thing is to provide an easy way to strip a BOM from a file and then just mention it in the docs.

Here is a python script to strip them (from https://www.ueber.net/who/mjl/projects/bomstrip/). (I had to add the .txt to get the website to take the attachment.)
bomstrip.py.txt

Can we remove the leading 0 id check from lint?

I was trying to fix it as it’s failing on ids that look like entry-1662-01-01 (from Pepys), but actually I’m not sure why the restriction on zeros is there in the first place? It was added in a606a51 but there’s no info on that particular check in the commit message.

Github repo name expected by lint can be impossible

Lint tries to build an appropriate Github URL for the project from the author, title and any translators. The Ivan Bunin collection I’m currently prodocuing has 5 different translators, so the repo name it expects is ivan-bunin_short-fiction_s-s-koteliansky_d-h-lawrence_leonard-woolf_bernard-guilbert-guerney_the-russian-review. Unfortunately Github repo names are limited to 100 characters and this is 111. So, two possible fixes:

  1. I change Bernard Guilbert Guerney to be B. G. Guerney in content.opf and add a seperate full name field. That’d fix this particular repo as ivan-bunin_short-fiction_s-s-koteliansky_d-h-lawrence_leonard-woolf_b-g-guerney_the-russian-review is 99 characters.
  2. We slice the generated Github repo name to 100 characters, and lint against that instead. That’d fix future problems, and I find it hard to see how we’d end up in the position of an id collision.

Endnote referrer arrows render as emoji in iOS iBooks

As mentioned on the mailing list:

I noticed that endnotes currently render the referrer arrow "↩︎" as the emoji "↩️" in iBooks on iOS devices, and recalled Jon Gruber mentioning a way to avoid this in a post on Daring Fireball. He linked to the following article from Matias Singers with a solution: use U+FE0E, a.k.a. VARIATION SELECTOR-15. Appending this to the symbol ensures it renders as text and not emoji.

I was thinking we could add this as a step in our build tool to ensure proper rendering, probably in the same section where we convert the endnotes for Kindles?

tools/build

Lines 779 to 845 in 99301ad

# Convert endnotes to Kindle popup compatible notes
if os.path.isfile(os.path.join(work_epub_root_directory, "epub", "text", "endnotes.xhtml")):
with open(os.path.join(work_epub_root_directory, "epub", "text", "endnotes.xhtml"), "r+", encoding="utf-8") as file:
xhtml = file.read()
# We have to remove the default namespace declaration from our document, otherwise
# xpath won't find anything at all. See http://stackoverflow.com/questions/297239/why-doesnt-xpath-work-when-processing-an-xhtml-document-with-lxml-in-python
tree = etree.fromstring(str.encode(xhtml.replace(" xmlns=\"http://www.w3.org/1999/xhtml\"", "")))
notes = tree.xpath("//li[@epub:type=\"rearnote\" or @epub:type=\"footnote\"]", namespaces=se.XHTML_NAMESPACES)
processed_endnotes = ""
for note in notes:
note_id = note.get("id")
note_number = note_id.replace("note-", "")
# First, fixup the reference link for this endnote
try:
ref_link = etree.tostring(note.xpath("p[last()]/a[last()]")[0], encoding="unicode", pretty_print=True, with_tail=False).replace(" xmlns:epub=\"http://www.idpf.org/2007/ops\"", "").strip()
except Exception:
se.print_error("Can't find ref link for #{}".format(note_id))
exit(1)
new_ref_link = regex.sub(r">.*?</a>", ">" + note_number + "</a>.", ref_link)
# Now remove the wrapping li node from the note
note_text = regex.sub(r"^<li[^>]*?>(.*)</li>$", r"\1", etree.tostring(note, encoding="unicode", pretty_print=True, with_tail=False), flags=regex.IGNORECASE | regex.DOTALL)
# Insert our new ref link
result = regex.subn(r"^\s*<p([^>]*?)>", "<p\\1 id=\"" + note_id + "\">" + new_ref_link + " ", note_text)
# Sometimes there is no leading <p> tag (for example, if the endnote starts with a blockquote
# If that's the case, just insert one in front.
note_text = result[0]
if result[1] == 0:
note_text = "<p id=\"" + note_id + "\">" + new_ref_link + "</p>" + note_text
# Now remove the old ref_link
note_text = note_text.replace(ref_link, "")
# Trim trailing spaces left over after removing the ref link
note_text = regex.sub(r"\s+</p>", "</p>", note_text).strip()
# Sometimes ref links are in their own p tag--remove that too
note_text = regex.sub(r"<p>\s*</p>", "", note_text)
processed_endnotes += note_text + "\n"
# All done with endnotes, so drop them back in
xhtml = regex.sub(r"<ol>.*</ol>", processed_endnotes, xhtml, flags=regex.IGNORECASE | regex.DOTALL)
file.seek(0)
file.write(xhtml)
file.truncate()
# While Kindle now supports soft hyphens, popup endnotes break words but don't insert the hyphen characters. So for now, remove soft hyphens from the endnotes file.
with open(os.path.join(work_epub_root_directory, "epub", "text", "endnotes.xhtml"), "r+", encoding="utf-8") as file:
xhtml = file.read()
processed_xhtml = xhtml
processed_xhtml = processed_xhtml.replace(se.SHY_HYPHEN, "")
if processed_xhtml != xhtml:
file.seek(0)
file.write(processed_xhtml)
file.truncate()

I’d be happy to open a PR, but I’m not sure how to properly tackle this. I think we’ll need to look at the actual unicode code points—a simple text-search find-and-replace couldn’t distinguish between the arrow vs the arrow + variation selector (at least when I was trying with Sublime Text’s find and replace tools 🤷‍♂️ )

Semanticate placed roman-numeral span within image tag

A minor issue, but I don't think semanticate ought to look at what is inside HTML tags. Admittedly this is an unusual case in that I've named an image (used as a character symbol) with a letter. But this is what semanticate did to the image tag:

<img alt="Dancing man with arms in the air." class="character-symbol" id="v-letter" src="../images/<span epub:type="z3998:roman">v</span>-letter.svg" epub:type="z3998:illustration se:image.color-depth.black-on-transparent"/>

Obviously, this is why we do a git diff and manual check after semanticate, but it would be nice if semanticate didn't do this in the first place.

lint throws wrong error when colon is in title

In T. S. Eliot's Poetry lint says that the title "Burbank with a Baedeker: Bleistein with a Cigar" is not found, when it is in the ToC. Printing out the toc_headings variable in se_epub.py line 1125 shows that in toc_headings, the heading is saved without a colon (probably due to line 1122?). Thus, the match fails, because it's comparing the title with a colon to the title without a colon. Can you take a look @robinwhittleton?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.