Giter Club home page Giter Club logo

interlinearize's Introduction

Interlinearizer

An interlinear book is an annotaded version of a text, where each word of the original text is accompanied with a translation just underneath. interlinearize is a tool with which we can convert EPUB files (or any other format supported by Calibre) into their respective interlinearized versions.

For example, see below the interlinearized version of the first passage of Voltaire's Candide:

Each word is translated individually and without context. This means that in most cases the translations will read awkardly and sometimes imprecisely as well. For the purposes of its intended usage this is not much of an issue, as a competent reader in a target language only looks up the occasional low-frequency word, and should in most cases be able to contextualize it within the larger sentence.

The program uses Calibre's ebook-convert to convert between formats andgoogletrans to translate words. So far interlinearize has only been tested on books in the .epub format, and translations from French to English. Although in theory the program should work fine with any book format supported by Calibre, and any languages supported by Google Translate. interlinearize has only been tested on Linux.

The repository comes with two examples. Candide by Voltaire and Madame Bovary by Gustave Flaubert, along with their interlinear versions. You can find them inside the compressed folder interlinearized_books.zip.

Installation

interlinearize does not need to be installed and can be run straight out of the repository. However, a few dependencies needs to be staisfied. The interlinearizer depends on googletrans, BeautifulSoup and nltk. To install these, cd into the repository and run the command

pip install requirements.txt

You will also have to install Calibre, as interlinearize makes use of its ebook-convert. Furthermore, you'll need to have Python 3 installed.

If you do want to "install" interlinearize (i.e. make it accessible from anywhere within the terminal), you can move interlinearize.py to /usr/local/bin:

chmod +x interlinearize.py
cp interlinearize.py /usr/local/bin/interlinearize

or any other folder that is in your PATH environmental variable. If you do this, then interlinearize will automatically generate configuration files in ~/.interlinearize/ the first time it is run.

Usage

Use the interlinearizer in the following way:

python interlinearize.py src dest book.format1 output.format2

where src and dest are the source and destination languages respectively, and book.format1 the input book of a given format and output.format2 the output of the desired format.

For example, to translate the copy of Candide in the repository execute the following command

python interlinearize.py fr en "Candide - Voltaire.epub" "Candide - Voltaire (interlinearized).epub"

If you want an HTML version of the interlinearized book, then omit the file extension for the output

python interlinearize.py fr en "Candide - Voltaire.epub" "Candide - Voltaire (interlinearized)"

Whenever you use the interlinearize, the dictionaries it assembles will always be stored as text files in the dict folder (for more details see the section below), so the more books you use it on the less time it will take to translate.

See this for a list of language codes.

Other commands

Open the config file:

python interlinearize.py -c config

Open the style file:

python interlinearize.py -c css

Open the a dictionary:

python interlinearize.py -c dict src dest

Reset the config|css|all to default:

python interlinearize.py -c clear config|css|all

Reset a dictionary:

python interlinearize.py -c cleardict src dest

Settings

You can change the formatting of the interlinearized text by editing the interlinear.css style file. Other settings can be found in the interlinearize.config file.

You can also edit the dictionaries directly in the dict folder. Each dictionary is named in the format src_dest.txt, so a French to English dictionary would be in dict/fr_en.txt for example. See this for a list of language codes.

interlinearize always looks for the configuration and dictionary files within the execution folder first. If it cannot find the files there, it will then look in ~/.interlinearize. If that folder does not exist, it will then create it and place default config files there.

interlinearize's People

Contributors

lukastk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

interlinearize's Issues

Script not working with included test epub.

It appears there is a problem with googletrans. Tested on fedora. Attached is the traceback.

Converting book to HTMLZ
Finding translations of new words
Exception in thread Thread-18 (translate_words):
Traceback (most recent call last):
  File "/usr/lib64/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.10/threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amichaygiuili/Downloads/interlinearize/interlinearize.py", line 320, in translate_words
    ts = translator.translate(words_to_translate, src=src, dest=dest)
  File "/home/amichaygiuili/.local/lib/python3.10/site-packages/googletrans/client.py", line 127, in translate
    translated = self.translate(item, dest=dest, src=src)
  File "/home/amichaygiuili/.local/lib/python3.10/site-packages/googletrans/client.py", line 132, in translate
    data = self._translate(text, dest, src)
  File "/home/amichaygiuili/.local/lib/python3.10/site-packages/googletrans/client.py", line 57, in _translate
    token = self.token_acquirer.do(text)
  File "/home/amichaygiuili/.local/lib/python3.10/site-packages/googletrans/gtoken.py", line 180, in do
    self._update()
  File "/home/amichaygiuili/.local/lib/python3.10/site-packages/googletrans/gtoken.py", line 59, in _update
    code = unicode(self.RE_TKK.search(r.text).group(1)).replace('var ', '')
AttributeError: 'NoneType' object has no attribute 'group'
^CTraceback (most recent call last):
  File "/home/amichaygiuili/Downloads/interlinearize/interlinearize.py", line 568, in <module>
    construct_word_list_from_text(word_list, word_dict, src_lan, dest_lan, service_urls, words_per_request)
  File "/home/amichaygiuili/Downloads/interlinearize/interlinearize.py", line 343, in construct_word_list_from_text
    t_dict = que.get()
  File "/usr/lib64/python3.10/queue.py", line 171, in get
    self.not_empty.wait()
  File "/usr/lib64/python3.10/threading.py", line 320, in wait
    waiter.acquire()
KeyboardInterrupt

UnicodeDecodeError

Hello i would really like to use this nice program but i get this error with all epubs. As well with the epub that you used as an example:

PS C:\Users\chlad\Desktop\interlinearize-main\interlinearize-main> python interlinearize.py fr en "Candide - Voltaire.epub" "Candide - Voltaire (interlinearized).epub"
Traceback (most recent call last):
File "C:\Users\chlad\Desktop\interlinearize-main\interlinearize-main\interlinearize.py", line 533, in
config = get_config()
^^^^^^^^^^^^
File "C:\Users\chlad\Desktop\interlinearize-main\interlinearize-main\interlinearize.py", line 124, in get_config
config.read(str(config_path))
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64__qbz5n2kfra8p0\Lib\configparser.py", line 713, in read
self._read(fp, filename)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64__qbz5n2kfra8p0\Lib\configparser.py", line 1036, in _read
for lineno, line in enumerate(fp, start=1):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64__qbz5n2kfra8p0\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 209: character maps to

What to do?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.