Giter Club home page Giter Club logo

zero-epwing's Introduction

Zero-EPWING

Zero-EPWING is a tool built to export easy to process JSON formatted UTF-8 data from dictionaries in EPWING format. This is a terrible format for many reasons, some of which are outlined below:

  • It is based on a closed and undocumented standard.
  • Not well supported as it isn't used anywhere else in the world.
  • The only library for parsing this format, libeb, is abandoned.
  • Data is stored in an inconsistent manner, with lots of duplication.
  • Text data is represented using the annoying EUC-JP encoding.
  • Characters which cannot be encoded are represented by image bitmaps.

Most applications that parse EPWING data traditionally use libeb to perform dictionary searches in place; dealing with quirks in the format and libeb output is just part of the process. Zero-EPWING takes a different approach -- extract all the data and output it an sane intermediate format, like JSON. As everyone knows how to parse JSON, it is trivial to take this intermediate data and store it in a reasonable, industry standard representation.

Installation

Pre-built binaries are available for Linux, Mac OS X, and Windows:

Building

Prepare your development environment by making sure the following tools are set up:

Once your system is configured, follow the steps below to create builds:

  1. Clone the repository by executing
    git clone --recurse-submodules https://github.com/FooSoft/zero-epwing
    
  2. Prepare the project. From the project root directory, execute
    cmake . -Bbuild && cmake --build build --
    
  3. Find the executable in the build directory.

Usage

Zero-EPWING takes a single parameter, the directory of the EPWING dictionary to dump. It also supports the following optional flags:

  • --entries (-e): output dictionary entry data (most common option).
  • --fonts (-f): output output font bitmap data (useful for OCR).
  • --markup (-m): markup the output with as much metadata as possible.
  • --positions (-s): output page and offset data for each entry.
  • --pretty (-p): output pretty-printed JSON (useful for debugging).

Upon loading and processing the requested EPWING data, Zero-EPWING will output a UTF-8 encoded JSON file to stdout. Diagnostic information about errors will be printed to stderr. Serious errors will result in this application returning a non-zero exit code. A sample of the JSON dictionary entry data output is pretty-printed below for reference.

{
    "charCode": "jisx0208",
    "discCode": "epwing",
    "subbooks": [
        {
            "title": "大辞泉",
            "copyright": "CD-ROM版大辞泉 1997年4月10日 第1版発行\n\n監 修 松村 明\n発行者 鈴木俊彦\n発行所...",
            "entries": [
                {
                    "heading": "",
                    "text": "\n[音]ア\n[訓]つ‐ぐ\n[部首]二\n[総画数]7\n[コード]区点..."
                },
                {
                    "heading": "",
                    "text": "\n{{w_50275}}\n{{w_50035}}五十音図ア行の第一音。五母音の一。後舌の開母音..."
                }
            ]
        }
    ]
}

You may have noticed the unusual-looking double curly bracket markers such as {{w_50035}}. Remember what I mentioned about certain characters being represented by image files? There are two graphical fonts sets in each dictionary: narrow and wide. Both of these font sets are available in four sizes: 24, 30, 36, and 48 pixels. Whenever a character cannot be encoded as text, a glyph is used in its place. These font indices cannot be converted directly to characters, differ from one dictionary to another, and must be manually mapped to Unicode character tables. Zero-EWPING has no facility to map these font glyphs to Unicode by itself, and instead places inline markers in the form of {{w_xxxx}} and {{n_xxxx}} in the output, specifying the referenced indices of the wide or narrow fonts respectively.

The bitmaps for these font glyphs can be dumped by executing this application with the --fonts command line argument.

zero-epwing's People

Contributors

foosoft avatar makigumo avatar ejls avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.