burntsushi / ucd-generate Goto Github PK

A command line tool to generate Unicode tables as source code.

License: Apache License 2.0

Rust 99.91% Shell 0.09%

ucd-generate's Introduction

ucd-generate

A command line tool to generate Unicode tables in Rust source code. Tables can typically be generated in one of three formats: a sorted sequence of character ranges, a finite state transducer or a compressed trie. Full support for name canonicalization is also provided.

Installation

Since this is mostly intended as a developer tool for use while writing Rust programs, the principle method of installation is from crates.io:

$ cargo install ucd-generate
ucd-generate --help

Example

This somewhat arbitrary example shows the output of generating tables for three properties, and representing them as normal Rust character literal ranges.

To run the example, you need to download the Unicode Character Database (UCD):

$ mkdir /tmp/ucd-15.0.0
$ cd /tmp/ucd-15.0.0
$ curl -LO https://www.unicode.org/Public/zipped/15.0.0/UCD.zip
$ unzip UCD.zip

Note that prior to version 13.0.0, emoji/emoji-data.txt file was distributed separate from the UCD bundle. For these versions, you may need to download this file from https://unicode.org/Public/emoji in order to generate certain tables.

Now tell ucd-generate what you want and point it to the directory created above:

$ ucd-generate property-bool /tmp/ucd-15.0.0 --include Hyphen,Dash,Quotation_Mark --chars

And the output, which is valid Rust source code:

// DO NOT EDIT THIS FILE. IT WAS AUTOMATICALLY GENERATED BY:
//
//   ucd-generate property-bool /tmp/ucd-15.0.0 --include Hyphen,Dash,Quotation_Mark --chars
//
// Unicode version: 15.0.0.
//
// ucd-generate 0.2.10 is available on crates.io.

pub const BY_NAME: &'static [(&'static str, &'static [(char, char)])] = &[
  ("Dash", DASH), ("Hyphen", HYPHEN), ("Quotation_Mark", QUOTATION_MARK),
];

pub const DASH: &'static [(char, char)] = &[
  ('-', '-'), ('֊', '֊'), ('־', '־'), ('᐀', '᐀'), ('᠆', '᠆'),
  ('‐', '―'), ('⁓', '⁓'), ('⁻', '⁻'), ('₋', '₋'),
  ('−', '−'), ('⸗', '⸗'), ('⸚', '⸚'), ('⸺', '⸻'),
  ('⹀', '⹀'), ('\u{2e5d}', '\u{2e5d}'), ('〜', '〜'), ('〰', '〰'),
  ('゠', '゠'), ('︱', '︲'), ('﹘', '﹘'), ('﹣', '﹣'),
  ('－', '－'), ('𐺭', '𐺭'),
];

pub const HYPHEN: &'static [(char, char)] = &[
  ('-', '-'), ('\u{ad}', '\u{ad}'), ('֊', '֊'), ('᠆', '᠆'),
  ('‐', '‑'), ('⸗', '⸗'), ('・', '・'), ('﹣', '﹣'),
  ('－', '－'), ('･', '･'),
];

pub const QUOTATION_MARK: &'static [(char, char)] = &[
  ('"', '"'), ('\'', '\''), ('«', '«'), ('»', '»'), ('‘', '‟'),
  ('‹', '›'), ('⹂', '⹂'), ('「', '』'), ('〝', '〟'),
  ('﹁', '﹄'), ('＂', '＂'), ('＇', '＇'), ('｢', '｣'),
];

DFA serialization

Prior to ucd-generate 0.3.0, the sub-commands dfa and regex could be used to build fully compiled DFAs, serialize them to disk and generate Rust code for deserializing them. This functionality was removed in 0.3.0 and moved to regex-cli.

Contributing

The ucd-generate tool doesn't have any specific design goals, other than to collect Unicode table generation tasks. If you need ucd-generate to do something and it's reasonably straight-forward to add, then just submitting a PR would be great. Otherwise, file an issue and we can discuss.

Alternatives

The primary alternative is ICU4X. If you have sophisticated Unicode requirements, it is almost certainly what you should be using.

It's beyond the scope of this README to do a full comparison between ICU4X and ucd-generate, but I think the shortest way to describe it is that ucd-generate is simplistic, with all the associated positive and negative connotations that come with that word.

Future work

This tool is by no means is exhaustive. In fact, it's not even close to exhaustive, and it may never be. For the most part, the intent of this tool is to collect virtually any kind of Unicode generation task. In theory, this would ideally replace the hodge podge collection of Python programs that is responsible for this task today in various Unicode crates.

Here are some examples of future work that would be welcome:

More support for parsing things in the UCD.
More generation tasks based on things in the UCD.
More output formats, especially for reducing binary size.

Sub-crates

This repository is home to three sub-crates:

ucd-parse - A crate for parsing UCD files into structured data.
ucd-trie - Auxiliary type for handling the trie set table format emitted by ucd-generate. This crate has a no_std mode.
ucd-util - A purposely small crate for Unicode auxiliary functions. This includes things like symbol or character name canonicalization, ideograph name generation and helper functions for searching property name and value tables.

License

This project is licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT) at your option.

ucd-generate's People

Contributors

Stargazers

Watchers

Forkers

zackpierce cuviper kpozin kornelski mesalock-linux thomcc yeslogic inquisitivecrystal ma124 pombredanne dtolnay-contrib djudd ahlinc erf choznerol addisoncrump raskad riomhchlaraitheoir

ucd-generate's Issues

Cyclic build dependency

Hi, can you please clarify why ucd-util contains files generated by ucd-generate, but then ucd-generate depends on ucd-util? This makes it hard for us to package things in Debian where the FTP masters are being very strict about where data (and "source" code) is ultimately coming from.

Presumably the data in ucd-util was generated using some particular version of Unicode, which locks ucd-generate to that particular version of Unicode?

Subcommand to generate custom tables containing a combination of properties / property values.

Essentially: a flag for compiling a table that is roughtly equivalent to a regex character class.

For example, currently the perl-word flag could be expressed as [\p{alpha}\p{joinc}\p{gc=digit}\p{gc=M}\p{gc=pc}], or something along those lines.

I also wanted this when I was experimenting with different options for getting a more accurate version of unicode-width (one that operates on strings and takes grapheme sequences into account, using heuristics to determine the actual rendered width on terminals).

When I've wanted this, I've resorted to just hacking it into a local branch of ucd-generate, or a separate project with a bunch of copypasted code. This also would make the work in #37 unnecessary, as it could be expressed as just a combination of \p{gc=...}s.

One option would be to accept regex char class syntax and use regex-syntax to parse it... The benefit here is that syntax is quite rich and supports the full spectrum of set operations (for example, the include/exclude flags we currently support can't express intersection or inversion, and doesn't nest).

But I believe it would be using the version of the UCD it was compiled with? Is this true? Does it matter? I'm on my lunch break now and haven't investigated this, so it's possible it's easy and does exactly what we want.

However we do it, we'd want to express something like:

enum ClassType {
    Range(u32, u32),
    Property(String),
    // .0 is == vs !=
    PropertyVal(bool, String, String),
    Expr(BinOp, Box<Class>, Box<Class>),
}
struct Class {
    negated: bool,
    ty: ClassType,
}
enum BinOp {
    Union,
    Intersection,
    Difference,
    SymmetricDifference,
}

Which then could be evaluated in a fairly easy manner, ignoring many concerns around performance, error handling, etc:

// This is horrifyingly inefficient, and just to prove the idea
type CharSet = BTreeSet<u32>;

fn evaluate(c: Class) -> CharSet {
    let set: CharSet = match (c.ty) {
        Range(lo, hi) => (lo..=hi).collect(),
        Property(s) => codepoints_with_property(&s, None),
        PropertyVal(true, p, v) =>
            codepoints_with_property(&p, Some(&v))?,
        PropertyVal(false, p, v) =>
            invert(codepoints_with_property(&p, Some(&v))),

        Expr(Union, lhs, rhs) =>
            evaluate(*lhs).union(evaluate(*rhs)),
        // ... etc
    };
    if c.negated {
        invert(set)
    } else {
        set
    }
}

fn invert(c: CharSet) -> CharSet {
    (0..0x110000).collect::<CharSet>().difference(&c)
}
// This assumes this function exists, which is probably not the signature
// it would actually have.
fn codepoints_with_property(name: &str, value: Option<&str>) -> CharSet;

Then, we'd use that to emit tables.

Unicode version 12.1

Please update.

Spaces and underscores are not completely ignored for U+1180 HANGUL JUNGSEONG O-E

UAX44-LM2 exceptionally treats the hyphen in “HANGUL JUNGSEONG O-E” as non-medial. character_name_normalize detects that by checking whether the remaining string after the hyphen is E or e. It should also ignore spaces and underscores. For example, "hangul jungseong o-e _" should be canonicalized to "hanguljungseongo-e", not "hanguljungseongoe".

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=0ac3aa310000440b22a7dfe05b56df8d

Add support for the compressed bitset format that rustc uses for some tables in libcore

Documented here: https://github.com/rust-lang/rust/blob/master/src/tools/unicode-table-generator/src/raw_emitter.rs, added in rust-lang/rust#68232

So I actually experimented a bit with this a bit shortly after I heard about it. Unfortunately, it actually requires a bit of tweaking to behave well for certain tables. You'll note that it says

//! The last byte of this top-level array is pulled out to a separate static
//! and trailing zeros are dropped; this is simply because grapheme_extend and
//! case_ignorable have a single entry in the 896th entry, so this shrinks them
//! down considerably.

Indicating the sort of tweaking you sometimes need to do. Additionally:

Some tables seem work better with a group length of 16, others 32, probably not limited to these sizes either.
A few work better with a bitset of u32 instead of u64.
etc.

And even without that, there are many tables that are obviously way larger than they need to be (anything that characters with both high and low codepoints) -- I ended up with several tables looking like:

static BITSET_CHUNKS_MAP: [u8; 544] = [
    5, 0, 0, 0, 6, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9,
    9, 9, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
    9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
    9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8,
];

which very clearly can be improved (for example, you could imagine solving it by some sort of wrap-around offset, and then trimming any trailing zeros after the wraparound). I also have very little doubt that if the generation code in rustc had a table like this, the generator would be tweaked so that it did better.

I'm half filing this just to document this investigation, and that I don't think it's as simple as just taking the generator from rustc and dropping it in.

I'm not sure how much it's ucd-generate's place to try to solve these issues, it would be great if it could -- right now there's no compelling reason to switch to ucd-generate for crates that already use a python generator, but support for a format which is both smaller and faster would be that (not to mention -- it would be great to reduce the size of these tables in common rust programs).

So yeah it would be great, and is something I'm interested in, but is subtle and might be more trouble than it's worth...

@Mark-Simulacrum might have some insight if they aren't too busy coming up with further clever methods of encoding data :p

Travis badges in crate Readmes

Travis CI support was removed in #19 but there are still Travis badges in the ucd-parse, ucd-trie, and ucd-util Readmes. Should these be changed or removed?

Feature request: add "block" subcommand

It would be useful to add "block" subcommand to generate ranges from Blocks.txt - unicode blocks.

Missing file emoji-data.txt

I ran into this following the instructions in the Example section on the crates.io page:

$ ucd-generate property-bool /tmp/ucd-10.0.0 --include Hyphen,Dash,Quotation_Mark --chars
No such file or directory (os error 2)

I had to dig into the code to figure out what was going on. The missing file is emoji-data.txt.

I reproduced this behavior with both version 0.2.1 (cargo install) and current master (b56d7f3).

I would have expected:

The instruction on crates.io to not run into file-not-found errors. Maybe a first example that doesn't require emoji-data.txt?
A more discoverable instruction for downloading emoji-data.txt. Maybe an adapted version of the note in emoji_properties.rs in the Example section, along with some hints for when it's needed and which version to pick?
Path-related error messages to spell out the troublesome path. Maybe relative to <ucd-dir>?

If you like I could make a PR to spell out the relevant path in error messages?

Fails to build on 32-bit mips/arm: the trait `regex_automata::StateID` is not implemented for `u64`

https://buildd.debian.org/status/package.php?p=rust-ucd-generate&suite=sid

https://buildd.debian.org/status/fetch.php?pkg=rust-ucd-generate&arch=mipsel&ver=0.2.1-1&stamp=1565069166&raw=0

   Compiling ucd-generate v0.2.1 (/<<PKGBUILDDIR>>)
     Running `CARGO_PKG_VERSION_PRE= CARGO_PKG_HOMEPAGE='https://github.com/BurntSushi/ucd-generate' CARGO_PKG_REPOSITORY='https://github.com/BurntSushi/ucd-generate' CARGO_PRIMARY_PACKAGE=1 CARGO_PKG_VERSION_PATCH=1 CARGO=/usr/bin/cargo CARGO_MANIFEST_DIR=/<<PKGBUILDDIR>> CARGO_PKG_VERSION_MAJOR=0 CARGO_PKG_NAME=ucd-generate CARGO_PKG_DESCRIPTION='A program for generating packed representations of the Unicode character
database that can be efficiently searched.
' CARGO_PKG_VERSION_MINOR=2 CARGO_PKG_AUTHORS='Andrew Gallant <[email protected]>' LD_LIBRARY_PATH='/<<PKGBUILDDIR>>/target/debug/deps:/usr/lib' CARGO_PKG_VERSION=0.2.1 rustc --crate-name ucd_generate src/main.rs --color never --crate-type bin --emit=dep-info,link -C debuginfo=2 -C metadata=9ab98c81f210928d -C extra-filename=-9ab98c81f210928d --out-dir /<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps --target mipsel-unknown-linux-gnu -C incremental=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/incremental -L dependency=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps -L dependency=/<<PKGBUILDDIR>>/target/debug/deps --extern byteorder=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libbyteorder-3e5d722a6d0ec6e8.rlib --extern clap=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libclap-5f7ca4d17efee90c.rlib --extern fst=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libfst-b10846890d7a7ac2.rlib --extern regex_automata=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libregex_automata-17bd458206005a4e.rlib --extern ucd_parse=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libucd_parse-9d99ee4ae391a19e.rlib --extern ucd_trie=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libucd_trie-8f97c9bd47582f20.rlib --extern ucd_util=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libucd_util-da5b2e49a7fb7143.rlib -C debuginfo=2 --cap-lints warn -C linker=mipsel-linux-gnu-gcc -C link-arg=-Wl,-z,relro --remap-path-prefix /<<PKGBUILDDIR>>=/usr/share/cargo/registry/ucd-generate-0.2.1`
error[E0599]: no method named `to_u64` found for type `regex_automata::DenseDFA<std::vec::Vec<usize>, usize>` in the current scope
  --> src/regex.rs:29:31
   |
29 |                 let dfa = dfa.to_u64()?.to_sparse()?;
   |                               ^^^^^^ help: there is a method with a similar name: `to_u16`

error[E0599]: no method named `to_u64` found for type `regex_automata::DenseDFA<std::vec::Vec<usize>, usize>` in the current scope
  --> src/regex.rs:49:31
   |
49 |                 let dfa = dfa.to_u64()?;
   |                               ^^^^^^ help: there is a method with a similar name: `to_u16`

error[E0277]: the trait bound `u64: regex_automata::StateID` is not satisfied
  --> src/regex.rs:81:34
   |
81 |                 let re = builder.build_with_size_sparse::<u64>(&pattern)?;
   |                                  ^^^^^^^^^^^^^^^^^^^^^^ the trait `regex_automata::StateID` is not implemented for `u64`

error[E0277]: the trait bound `u64: regex_automata::StateID` is not satisfied
  --> src/regex.rs:81:26
   |
81 |                 let re = builder.build_with_size_sparse::<u64>(&pattern)?;
   |                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `regex_automata::StateID` is not implemented for `u64`
   |
   = note: required by `regex_automata::SparseDFA`

error[E0277]: the trait bound `u64: regex_automata::StateID` is not satisfied
  --> src/regex.rs:82:21
   |
82 |                 wtr.sparse_regex(args.name(), &re)?;
   |                     ^^^^^^^^^^^^ the trait `regex_automata::StateID` is not implemented for `u64`

error[E0277]: the trait bound `u64: regex_automata::StateID` is not satisfied
   --> src/regex.rs:101:34
    |
101 |                 let re = builder.build_with_size::<u64>(&pattern)?;
    |                                  ^^^^^^^^^^^^^^^ the trait `regex_automata::StateID` is not implemented for `u64`

error[E0277]: the trait bound `u64: regex_automata::StateID` is not satisfied
   --> src/regex.rs:101:26
    |
101 |                 let re = builder.build_with_size::<u64>(&pattern)?;
    |                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `regex_automata::StateID` is not implemented for `u64`
    |
    = note: required by `regex_automata::DenseDFA`

error[E0277]: the trait bound `u64: regex_automata::StateID` is not satisfied
   --> src/regex.rs:102:21
    |
102 |                 wtr.dense_regex(args.name(), &re)?;
    |                     ^^^^^^^^^^^ the trait `regex_automata::StateID` is not implemented for `u64`

error: aborting due to 8 previous errors

Some errors have detailed explanations: E0277, E0599.
For more information about an error, try `rustc --explain E0277`.
error: Could not compile `ucd-generate`.

Caused by:
  process didn't exit successfully: `CARGO_PKG_VERSION_PRE= CARGO_PKG_HOMEPAGE='https://github.com/BurntSushi/ucd-generate' CARGO_PKG_REPOSITORY='https://github.com/BurntSushi/ucd-generate' CARGO_PRIMARY_PACKAGE=1 CARGO_PKG_VERSION_PATCH=1 CARGO=/usr/bin/cargo CARGO_MANIFEST_DIR=/<<PKGBUILDDIR>> CARGO_PKG_VERSION_MAJOR=0 CARGO_PKG_NAME=ucd-generate CARGO_PKG_DESCRIPTION='A program for generating packed representations of the Unicode character
database that can be efficiently searched.
' CARGO_PKG_VERSION_MINOR=2 CARGO_PKG_AUTHORS='Andrew Gallant <[email protected]>' LD_LIBRARY_PATH='/<<PKGBUILDDIR>>/target/debug/deps:/usr/lib' CARGO_PKG_VERSION=0.2.1 rustc --crate-name ucd_generate src/main.rs --color never --crate-type bin --emit=dep-info,link -C debuginfo=2 -C metadata=9ab98c81f210928d -C extra-filename=-9ab98c81f210928d --out-dir /<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps --target mipsel-unknown-linux-gnu -C incremental=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/incremental -L dependency=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps -L dependency=/<<PKGBUILDDIR>>/target/debug/deps --extern byteorder=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libbyteorder-3e5d722a6d0ec6e8.rlib --extern clap=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libclap-5f7ca4d17efee90c.rlib --extern fst=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libfst-b10846890d7a7ac2.rlib --extern regex_automata=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libregex_automata-17bd458206005a4e.rlib --extern ucd_parse=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libucd_parse-9d99ee4ae391a19e.rlib --extern ucd_trie=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libucd_trie-8f97c9bd47582f20.rlib --extern ucd_util=/<<PKGBUILDDIR>>/target/mipsel-unknown-linux-gnu/debug/deps/libucd_util-da5b2e49a7fb7143.rlib -C debuginfo=2 --cap-lints warn -C linker=mipsel-linux-gnu-gcc -C link-arg=-Wl,-z,relro --remap-path-prefix /<<PKGBUILDDIR>>=/usr/share/cargo/registry/ucd-generate-0.2.1` (exit code: 1)
dh_auto_test: /usr/share/cargo/bin/cargo build returned exit code 101
make: *** [debian/rules:3: build-arch] Error 255
dpkg-buildpackage: error: debian/rules build-arch subprocess returned exit status 2

Hyphens next to underscores are considered medial in character names

character_name_normalize only treats a hyphen as medial if it is not next to a space, but UAX44-LM2 defines medial as “immediately between two letters”. That means that, for example, "tibetan_letter_-a" should match "TIBETAN LETTER -A", not "TIBETAN LETTER A".

http://play.rust-lang.org/?gist=b509b7c2ea044330a0062afd20e72a15&version=stable&mode=debug&edition=2015

Provide access as a library?

I've just done work to add Unicode Property matchers to The Elegant Parser pest. This uses ucd-generate-generated tables and works great!

It would be very nice, however, if I could write a reusable cargo script to regenerate the tables. This would be enabled by being able to use ucd-generate as a library.

Would this be a use case you'd like to support? (Long-term, using UNIC as the tables would probably be better, but I've made sure it's encapsulated and at least until I get around to finishing the first pass of table optimization I feel more confident using ucd-trie than UNIC.)

ucd-generate should have tests

At a minimum we should verify that the things we output compile. This sounds doable-if-tedious. We do this now for some things, in the benchmarks. (It probably should be wider, although I'm not going to say we should try every flag combo or anything -- just ones likely to interact).

Ideally we could go one step further and verify that the data we output matches some reference version of the data. ucd-generate itself could even emit the tests.

`codepoint_to_codepoints` tables are needlessly wasteful

This output is used by the non---simple case mapping tables, and the --all-pairs case folding table.

I ended up going on a bit of a journey while looking into this one, mostly trying to figure out why it helped more than I had expected. The TLDR is that going from &[(char, &[char])] to &[(char, [char;3])] saves a lot of space -- more than you'd naively expect, and probably reduces startup time as well (on some platforms, given some set of compiler flags, etc).

Note: All sizes are given for 64 bit, but the overhead is still present on 32 bit, just slightly smaller. Additionally, do a mental s/char/u32/ where necessary if you're not outputting tables with char literals directly (e.g. --chars) -- it makes no difference.

I need to be convinced that it's bad.

The most obvious way this is bad is that these tables are output a slice of (char, &[char]). This occupies 24 bytes per entry (4 + 8 + 8 aligned up to 8), which is quite bad, but not even the whole story -- each entry also points to an out-of-line array, which occupies between 4 and 12 bytes. Being generous and rounding down to 4 (as most are 4 for most uses), this is still substantial, and means we have 28 bytes of table data per entry.

However, there's a more subtle way it's bad. When you store static arrays containing lot of pointers in this manner, and compile as a shared library, it can cause all the pointers to have to be relocated when the library is loaded. This be slow and contribute to further binary bloat (especially huge tables like these), and it can prevent the table from being placed in read-only memory (not that important probably, but cases like regex-capi might care).

For more on this, see https://akkadia.org/drepper/dsohowto.pdf but note that it's old, that many of its suggestions are either irrelevant or impossible for Rust, and that some of the things that he says don't cause issues in C, seem to still cause them for rust. readelf is the only real source of truth here (... well, for me it's been anyway).

Yes, yes, okay, what's your plan.

Instead of (char, &[char]), we should store entries as (char, [char; 3]). This would give us 16 bytes per entry, with no data stored out-of-line. We can append trailing NULs to unused portions of the [char; 3], which simultaneously solves the issues of "how do we recover entry.1.len()" and "what do we put in unused portions of entry.1".

This means you need slightly more complex logic to get the &[char] back, but in exchange you get reduced binary size and better cache locality (both because more entries fit into a single cache line, and because you no longer have to chase a pointer to get the data for a given entry).

There are other options slight variations here, but last night before I called it quits on writing this, I decided for some reason that this was where I should start deleting... Essentially: NUL is a little dodgy, Option would work but forces a perf hit without unsafe (example: entry.1[0] should not need a check), and 0! or 0x110000 would work for u32 output. That said, it seems very very unlikely that using NUL like this is a problem.

Explain how good of an idea it is, using numbers.

Doing this would reduce the various tables to about 56% of their former size. In theory it's between 57% and 44% depending on how many tables use longer &[char] slices, but in practice the non-singleton entries are very rare for all tables. Concretely, the case-folding-simple --all-chars table goes from 78,728 bytes to 44,768 bytes -- a reduction to 56.8% of the former size.

However, this is ignoring any size gained by removing overhead caused by the binary format and relocation information!

In practice, I've verified that at least on x86_64 linux when a library exposing a single function boiling down to CASE_FOLDING_TABLE.get(usize), 102,472 bytes are saved by the change in representation -- and the shared library file goes from 154,600 bytes to 52,128 (which is essentially optimal without material changes to the algorithm). It's hard to be certain, but that's somewhere around a reduction to 33% of the old size.

Note that the verification I did was based on making this change via regex-replace &[x] => [x, 0, 0] (and so on for 2 and 3) on the existing table table data, so it's not like I'm sitting on a patch for this.

Please, ramble for two more paragraphs, but say very little

... Honestly, 44k for this data still feels pretty terrible to me, but this seems the easiest possible change without coming up with a new algorithm for it. I would be interested in options for encoding this data with lower overhead still though.

More broadly, there are also other changes that could be made to reduce relocation overhead for some tables, but this is probably the biggest (maybe some property value things?). That said, it's probably worth considering for some of the ones used by regex-syntax...

Option to emit attribute to disable rustfmt

Depending on how rustfmt is configured, it may try to reformat ucd-generate output. Add an option to emit #![rustfmt::skip] at the top of files created by ucd-generate so that rustfmt knows to skip them.

See rust-lang/regex#634 for an example where files generated by ucd-generate don’t match rustfmt’s configuration, which cause CI failures.

Discussion: Emitting code to search tables with complex layouts

Right now, some tables are too complex for it to be reasonable to ask users to write the table search code themselves.

Currently, the model we follow for this is to have the table depend on an external crate, for example: ucd-trie, or fst. Lets ignore fst, because I have only surface level understanding of it (I really should get around to reading the blog post on it...).

Anyway the cases I care about are much closer in spirit to TrieSet:

Accessing the data in the trie requires a very specific set of undocumented operations. Even if it were documented, it's fiddly enough that we don't want users to have to write these lines themselves (even asking them to copy/paste them in would imply that they should understand it and are taking over maintenance of it).

We'd also like to think of the trie format as subject to change... (even though adding the statics essentially freezes it, as I'll get into) So we just ask them to take a dep, and ensure it can be as close to zero-cost as possible (#[no_std], functions inline, etc) so that nobody really has an excuse for a reason to not use it.

That said, this approach has downsides:

Problems with the separate-crate approach

The main problem is that it "freezes" the format of the trie -- any update that doesn't maintain 100% compatibility with existing tries must be a breaking change, or come in a TrieSet2 type or something. This might be acceptable in the case of the trie (IMO it's not -- it's not like there the trie is impossible to improve upon), but I don't think it's a good solution overall.
Users have to include ucd-trie as a dependency, and the dependency is bringing in a single 18 line function (well, and a type definition).

This is somewhat petty of me, but there are fairly few cases when I'm comfortable bringing in an external dependency for an 18 line function. Given that bstr has a line about "pushing back against the micro-crate ecosystem" in the readme, I suspect you can at least see my reasoning there.

That said, the downsides to this (aside from introducing dependency) are subtle, and it's not exactly clear what else to do -- using the language-provided method of code sharing for this is clean and obvious, if nothing else.

Anyway, the first point is what I care more about (or at least, I'd just suck up the second point if that were it).

In particular, I'd like to introduce new formats which are not frozen, and can change without worrying about compatibility with an external crate. Specifically (With most of these, I'd want the data structure exposed to be totally opaque, or potentially just expose a single function):

libcore-style bitset and skiplist, as discussed in #30.
re2-style range-based cyclic case-folding tables: https://github.com/google/re2/blob/master/re2/unicode_casefold.h.
Apply compression algorithms to some tables, possibly only rarely used ones (for example, the name table looks like it could compress well if massaged).
- In practice this would be somewhat customized, since most off the shelf lossless compressors aren't allowed to modify the input to compress better -- we could, so long as we can also emit code to reverse that modification if needed.
Better trie output
Fully flattened versions of some tables, representing any slice or str are represented as indices into a big shared string / shared array. (The property value tables in particular practically have a bulls-eye on their back for this).
...

Even some of the current formats are getting complex (--split-ranges --separated-values isn't that hard to write code for, but it's complex enough that you probably need documentation to do so).

But even for simple formats, like the slice of (u32, u32), I've gotten reading it it wrong more times than I want to admit![0] This may say more about me as a programmer than anything else, but I would have definitely used an option to spit out the table searching code.

Potential solutions

There's basically a few options here.

1. Stick with the external crate approach.

Pro: No duplicated code, Uses the language-provided method of sharing code.
Pro: Users don't have to look at the searching code, maintain it in any way, or worry about the scandalous way in which it uses integer arithmetic.
Con: Users need to add dependencies that may feel trivial to them.
Con: Discourages us from improving those tables.
Con: Causes people like me to file issues like this.

2. Emit code for searching in the same file.

Pro: No compatibility worries, we could even make the table structures explicitly opaque from outside the module.
Pro: Users can tweak ucd-generate options in various ways, without having to change their code, we could offer something like a --minimize flag that picks whichever format has the smallest output size.
Pro: Doesn't require any major changes to ucd-generate, nor does it get in the way of changes we may find desirable.
Con: Forces there to be multiple copies of the same code.
- Note: In practice these are probably inline anyway. Additionally, linkers can (sometimes) dedupe symbols with identical code, which is why sometimes why you see calls into the wrong (e.g.) Copy impl in profiler output and similar.
- Con: Users need to have code in their project that they don't maintain, and aren't expected to understand.
  - Worse, this code may not conform to their opinions around code style (No matter how clean the code we generate is, it's impossible to please everybody. I've seen someone argue that https://rust-lang.github.io/rust-clippy/master/#integer_arithmetic should be a more commonly used lint...).

3. Have a dedicated option for emitting the shared code in its own module...

3a: ...and have the user be responsible for passing the table and search term into the functions in that module. (This requires the table structure be defined as just tuples, slices, etc. And not require shared type definitions)
3b: ...and have the user pass the name/path of the shared code as an argument when generating the table (e.g. emitting to unicode_common.rs and then passing --module-path 'super::unicode_common' to everything else).

The pros/cons for these two are similar:

Pro: No duplicated code.
Con: All the stuff about users having "code in their project that they don't maintain" from above.
Con: It feels a little weird since we're explicitly not using the the language-provided method of sharing code.
- I imagine people who don't have issues with dependenies might think "why isn't this just in an external crate".
Con: Need some way of saying which table layout you're using, and thus need the code for. E.g. interacts poorly with how ad-hoc we are about ouput kinds is.
- Could complicate the CLI greatly unless some work to be less ad-hoc here was done first, maybe.
Con: For 3a, this doesn't fully addresses the compatibility issue.
- Because the table structure has to be "defined as just tuples, slices, etc", the user can (and likely will) depend on on the concrete layout here, meaning some types of updates could cause compile errors.
- Even if the layout remained constant, and just the interpretation of it changed, the user could forget to regenerate the shared code to match. This isn't a case I care as much about -- And it could be addressed in part by emitting exhaustive tests with the table data.
  - It would probably mean we would need to refrain from unsafe in the table access code, which isn't a thing we do now, but in some cases it could be worth it, potentially.
In 3b, neither of these issues exist -- we can provide structure definitions, and we could emit a const indicating a version number in the shared code, and a static assert that they match in the table module. This is possible because the table module has the path to the shared code, which is not true for the 3a. (Something like const _: [(); shared::BITSET_VERSION] = [(); 1];)

4. many-table output

Change ucd-generate so that it can output as many tables as are needed in a single run of the command. Then, it would generate the shared code for exactly the set of tables types which are needed as well as all the tables.

Pro: Opens the gate for cross-table optimizations: (shared rANS symbol frequencies as mentioned in #30 (comment))
Pro: No compat worries, since we control the world. Could emit code to detect version mismatch as in 3b above if desired.
Pro: Snags another benefit from the jaws of the python generators, which can easly do this sort of thing.
Con (and an enormous one): ucd-generate is not set up for this sort of thing, and it would be very hard to add. I don't know how the CLI params for this would look at all, and so we'd need a big benefit for it to be worth it, tbh.

This is of course not complete, there are also hybrids you could imagine... But it's getting long.

Opinions

In reverse order -- popping each of these off my mental stack:

Choice 4: multi-table output

I like the idea of this choice: Generate only what's needed, and have a full understanding of desired dataset, potentially opening further optimizations.

And really, I wish we lived in a world where we already had this information... But that's not the world we live in, and I don't even know what this would look like in the CLI. How much code would have to change? Honestly It doesn't seem worth it. I suspect cross-table optimizations would not be worth a major change like this anyway, and things that do prove useful can just be handled on a case-by-case basis with a new subcommand or flag or whatever.

Choice 3: New subcommand to emit code in separate module

This seems very fiddly. In particular I think describing the sets of all the things you want to search means it has similar problems to choice 4 (need to describe the set of tables you're generating).

If we made the code generation bits more consistent and used that to be more principaled about the CLI options it wouldn't be that bad to add, but I don't think it has enough benefits to outweigh the complexity and downsides at the moment.

Choice 2: Emit code in same module.

This doesn't have many major downsides, but looks a little goofy. I think the concerns about code style are fine so long as what we emit passes default clippy lints and such.

If we do this I suspect it's only a matter of time before someone ask for choice 3, but maybe things will be cleaner then and it will be easier to add. This wouldn't prevent that, or anything.

Choice 1: Status quo

If I liked this, wouldn't have filed this issue.

Wrapping up

So yeah, number two is my opinion. In practice the changes here would be:

Adding a flag to emit code that searches the tables for existing formats.
Support this flag for new formats, potentially exclusively if the format is unreasonable to access manually or still subject to change.
Add an option to emit --trie-sets which are standalone and don't depend on ucd-trie

Thoughts?

(Note that a large part of the reason behind this is because I feel like just submitting a PR with one of those weird data structures, and the code generated next it would at a minimum raise, eyebrows. I also feel like the current status is not ideal).