Giter Club home page Giter Club logo

cesu8str-rs's Introduction

CESU-8 encoder/decoder for Rust

Crates.io Docs.rs Build Status

Convert between ordinary UTF-8 and CESU-8 (Wikipedia, Unicode Consortium) encodings.

CESU-8 encodes characters outside the Basic Multilingual Plane as two UTF-16 surrogate chacaters, which are then further re-encoded as invalid, 3-byte UTF-8 characters. This means that 4-byte UTF-8 sequences become 6-byte CESU-8 sequences.

Note that CESU-8 is only intended for internal use within tightly-coupled systems, and not for data interchange.

This encoding is sometimes needed when working with Java, Oracle or MySQL, and when trying to store emoji, hieroglyphs, or other characters on the Supplementary Multilingual Plane or the Supplementary Ideographic Plane.

We also support Java's Modified UTF-8 encoding, which additionally encodes \0 using a multi-byte UTF-8 sequence.

Commandline Utilities

Installing: cargo install cesu8str --features=build-binary

cesu8str

encode/decode between MUTF8/CESU8 and UTF8

$ cesu8str --help
Converts files or standard IO streams between standard UTF8 and CESU8, or the JVM's modified UTF-8.
Note that the default Windows' console does not support non-UTF8 sequences - attempting to type/print them will result in 
an error.

This tool will immediately exit upon finding an invalid character sequence.

EXIT CODES:
0 - if completed normally
1 - if an IO error has occured
2 - if an encoding error has occured (invalid/incomplete character sequences/etc)


Usage: cesu8str [OPTIONS] [INPUT] [OUTPUT]

Arguments:
  [INPUT]   The input file. Defaults to stdin if '-' or not set
  [OUTPUT]  The output file. Defaults to stdout if '-' or not set

Options:
  -j, --java     Toggles the use of the JVM's modified UTF8. In effect, it encodes nul-bytes as 0xC0,0x80 while nul-bytes 
are left alone in normal mode
  -d, --decode   Decodes CESU8 text into standard UTF8. By default, this tool encodes UTF8 to CESU8
  -h, --help     Print help information
  -V, --version  Print version information

count_codepoints - count unicode code point lengths for UTF8/16 and friends

More of a utility tool for verifying functionality during development.

$ count_codepoints --help
USAGE: count_codepoints [--json] [file] [KB_chunk_size=4]

Reads a file and emits the number of occurences differing lengths of UTF-8 codepoints on stdout.

If `file` is not supplied, stdin will be read.

Exit Codes:
0       Success
1       Command-line error
2       IO Error

Example output:

$ count_codepoints ./test_files/random.txt      
len4: 1001
len3: 1013
len2: 966
ascii: 1095
nul: 0
total (chars/bytes): 4075/10070

$ count_codepoints ./test_files/random.cesu8.txt
len4: 0
len3: 3015
len2: 966
ascii: 1095
nul: 0
total (chars/bytes): 5076/1207

License

Some of this code is adapted from Rust's src/libcore/str.rs file. This code is covered by LICENSE-RUST.txt and copyright by The Rust Project Developers and individual Rust contributors, as described in that file.

Most code in this project was forked, and adapted from, Eric Kidd's cesu8 crate. That adapted code is distributed under the same terms as the original project at time of fork.

The new code in this project is distributed under the same terms.

cesu8str-rs's People

Contributors

chrismooredev avatar emk avatar gsingh93 avatar kennytm avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.