lifthrasiir / rust-encoding Goto Github PK

Character encoding support for Rust

License: MIT License

Makefile 0.08% Rust 91.62% HTML 6.34% Python 1.96%

rust-encoding's Introduction

Encoding 0.3.0-dev

Character encoding support for Rust. (also known as rust-encoding) It is based on WHATWG Encoding Standard, and also provides an advanced interface for error detection and recovery.

This documentation is for the development version (0.3). Please see the stable documentation for 0.2.x versions.

Complete Documentation (stable)

Usage

Put this in your Cargo.toml:

[dependencies]
encoding = "0.3"

Then put this in your crate root:

extern crate encoding;

Data Table

By default, Encoding comes with ~480 KB of data table ("indices"). This allows Encoding to encode and decode legacy encodings efficiently, but this might not be desirable for some applications.

Encoding provides the no-optimized-legacy-encoding Cargo feature to reduce the size of encoding tables (to ~185 KB) at the expense of encoding performance (typically 5x to 20x slower). The decoding performance remains identical. This feature is strongly intended for end users. Do not try to enable this feature from library crates, ever.

For finer-tuned optimization, see src/index/gen_index.py for custom table generation.

Overview

To encode a string:

use encoding::{Encoding, EncoderTrap};
use encoding::all::ISO_8859_1;

assert_eq!(ISO_8859_1.encode("caf\u{e9}", EncoderTrap::Strict),
           Ok(vec![99,97,102,233]));

To encode a string with unrepresentable characters:

use encoding::{Encoding, EncoderTrap};
use encoding::all::ISO_8859_2;

assert!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Strict).is_err());
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Replace),
           Ok(vec![65,99,109,101,63]));
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Ignore),
           Ok(vec![65,99,109,101]));
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::NcrEscape),
           Ok(vec![65,99,109,101,38,35,49,54,57,59]));

To decode a byte sequence:

use encoding::{Encoding, DecoderTrap};
use encoding::all::ISO_8859_1;

assert_eq!(ISO_8859_1.decode(&[99,97,102,233], DecoderTrap::Strict),
           Ok("caf\u{e9}".to_string()));

To decode a byte sequence with invalid sequences:

use encoding::{Encoding, DecoderTrap};
use encoding::all::ISO_8859_6;

assert!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Strict).is_err());
assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Replace),
           Ok("Acme\u{fffd}".to_string()));
assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Ignore),
           Ok("Acme".to_string()));

To encode or decode the input into the already allocated buffer:

use encoding::{Encoding, EncoderTrap, DecoderTrap};
use encoding::all::{ISO_8859_2, ISO_8859_6};

let mut bytes = Vec::new();
let mut chars = String::new();

assert!(ISO_8859_2.encode_to("Acme\u{a9}", EncoderTrap::Ignore, &mut bytes).is_ok());
assert!(ISO_8859_6.decode_to(&[65,99,109,101,169], DecoderTrap::Replace, &mut chars).is_ok());

assert_eq!(bytes, [65,99,109,101]);
assert_eq!(chars, "Acme\u{fffd}");

A practical example of custom encoder traps:

use encoding::{Encoding, ByteWriter, EncoderTrap, DecoderTrap};
use encoding::types::RawEncoder;
use encoding::all::ASCII;

// hexadecimal numeric character reference replacement
fn hex_ncr_escape(_encoder: &mut RawEncoder, input: &str, output: &mut ByteWriter) -> bool {
    let escapes: Vec<String> =
        input.chars().map(|ch| format!("&#x{:x};", ch as isize)).collect();
    let escapes = escapes.concat();
    output.write_bytes(escapes.as_bytes());
    true
}
static HEX_NCR_ESCAPE: EncoderTrap = EncoderTrap::Call(hex_ncr_escape);

let orig = "Hello, 世界!".to_string();
let encoded = ASCII.encode(&orig, HEX_NCR_ESCAPE).unwrap();
assert_eq!(ASCII.decode(&encoded, DecoderTrap::Strict),
           Ok("Hello, &#x4e16;&#x754c;!".to_string()));

Getting the encoding from the string label, as specified in WHATWG Encoding standard:

use encoding::{Encoding, DecoderTrap};
use encoding::label::encoding_from_whatwg_label;
use encoding::all::WINDOWS_949;

let euckr = encoding_from_whatwg_label("euc-kr").unwrap();
assert_eq!(euckr.name(), "windows-949");
assert_eq!(euckr.whatwg_name(), Some("euc-kr")); // for the sake of compatibility
let broken = &[0xbf, 0xec, 0xbf, 0xcd, 0xff, 0xbe, 0xd3];
assert_eq!(euckr.decode(broken, DecoderTrap::Replace),
           Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));

// corresponding Encoding native API:
assert_eq!(WINDOWS_949.decode(broken, DecoderTrap::Replace),
           Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));

Types and Stuffs

There are three main entry points to Encoding.

Encoding is a single character encoding. It contains encode and decode methods for converting String to Vec<u8> and vice versa. For the error handling, they receive traps (EncoderTrap and DecoderTrap respectively) which replace any error with some string (e.g. U+FFFD) or sequence (e.g. ?). You can also use EncoderTrap::Strict and DecoderTrap::Strict traps to stop on an error.

There are two ways to get Encoding:

encoding::all has static items for every supported encoding. You should use them when the encoding would not change or only handful of them are required. Combined with link-time optimization, any unused encoding would be discarded from the binary.
encoding::label has functions to dynamically get an encoding from given string ("label"). They will return a static reference to the encoding, which type is also known as EncodingRef. It is useful when a list of required encodings is not available in advance, but it will result in the larger binary and missed optimization opportunities.

RawEncoder is an experimental incremental encoder. At each step of raw_feed, it receives a slice of string and emits any encoded bytes to a generic ByteWriter (normally Vec<u8>). It will stop at the first error if any, and would return a CodecError struct in that case. The caller is responsible for calling raw_finish at the end of encoding process.

RawDecoder is an experimental incremental decoder. At each step of raw_feed, it receives a slice of byte sequence and emits any decoded characters to a generic StringWriter (normally String). Otherwise it is identical to RawEncoders.

One should prefer Encoding::{encode,decode} as a primary interface. RawEncoder and RawDecoder is experimental and can change substantially. See the additional documents on encoding::types module for more information on them.

Supported Encodings

Encoding covers all encodings specified by WHATWG Encoding Standard and some more:

7-bit strict ASCII (ascii)
UTF-8 (utf-8)
UTF-16 in little endian (utf-16 or utf-16le) and big endian (utf-16be)
All single byte encoding in WHATWG Encoding Standard:
- IBM code page 866
- ISO 8859-{2,3,4,5,6,7,8,10,13,14,15,16}
- KOI8-R, KOI8-U
- MacRoman (macintosh), Macintosh Cyrillic encoding (x-mac-cyrillic)
- Windows code pages 874, 1250, 1251, 1252 (instead of ISO 8859-1), 1253, 1254 (instead of ISO 8859-9), 1255, 1256, 1257, 1258
All multi byte encodings in WHATWG Encoding Standard:
- Windows code page 949 (euc-kr, since the strict EUC-KR is hardly used)
- EUC-JP and Windows code page 932 (shift_jis, since it's the most widespread extension to Shift_JIS)
- ISO-2022-JP with asymmetric JIS X 0212 support (Note: this is not yet up to date to the current standard)
- GBK
- GB 18030
- Big5-2003 with HKSCS-2008 extensions
Encodings that were originally specified by WHATWG Encoding Standard:
- HZ
ISO 8859-1 (distinct from Windows code page 1252)

Parenthesized names refer to the encoding's primary name assigned by WHATWG Encoding Standard.

Many legacy character encodings lack the proper specification, and even those that have a specification are highly dependent of the actual implementation. Consequently one should be careful when picking a desired character encoding. The only standards reliable in this regard are WHATWG Encoding Standard and vendor-provided mappings from the Unicode consortium. Whenever in doubt, look at the source code and specifications for detailed explanations.

rust-encoding's People

Contributors

Stargazers

Watchers

rust-encoding's Issues

"Replace" vs. WHATWG error handling

Hi,

Quoting from the README:

use encoding::whatwg;
let mut euckr = whatwg::TextDecoder::new(Some(~"euc-kr")).unwrap();
euckr.encoding(); // => ~"euc-kr"
let broken = &[0xbf, 0xec, 0xbf, 0xcd, 0xff, 0xbe, 0xd3];
euckr.decode_buffer(Some(broken)); // => Ok(~"\uc6b0\uc640\ufffd\uc559")

// this is different from rust-encoding's default behavior:
let decoded = all::WINDOWS_949.decode(broken, Replace); // => Ok(~"\uc6b0\uc640\ufffd\ufffd")

Is there a reason for this difference? Could the Replace built-in trap be align with the spec?

Use Cow?

&[u8] can be converted to &str (rather than String) in some cases, and vice-versa. std::borrow::Cow would be nice.

Attempt to bound type parameter with a nonexistent trait `Sync`

When initiating a build, the following error occurs:

pub type EncodingRef = &'static Encoding + Send + Sync;
// Line 249 of `src/encoding/types.rs`

By looking into the source code, there is absolutely no reference to Sync whatsoever.
Since this was non existent in earlier commits, I removed this reference and tried to build. Then I got the following error:

src/encoding/codec/utf_16.rs:153   box UTF16Decoder::<E> { leadbyte: 0xffff, leadsurrogate: 0xffff } as Box<Decoder>

to GBK and to UTF8 is not right work

i have a GBK string, GBK.decode(rst_raw, DecoderTrap::Strict).is_err() and UTF_8.decode(rst_raw, DecoderTrap::Strict).is_err() can not judge right result.i don't why, so i writed judge "utf8 str" code:
fn is_utf8(data: &[u8]) -> bool {
let mut i = 0;
while i < data.len() {
let num = preNUm(data[i]);
if data[i] & 0x80 == 0x00 {
i += 1;
continue;
} else if num > 2 {
i += 1;
let mut j = 0;
while j < num -1 {
if data[i] & 0xc0 != 0x80 {
return false;
}
j += 1;
i += 1;
}
} else {
return false;
}

}
return true;

}

fn preNUm(data: u8) -> i32 {
let rst = format!("{:b}", data);
let mut i = 0;
for j in rst.chars() {
if j != '1' {
break;
}
i += 1;
}
return i;
}

Remove the TextDecoder and TextEncoder APIs

Hi,

If #3 is resolved and the spec’s error handling behavior can be obtained with the "normal" API, I believe that the TextDecoder and TextEncoder APIs should be removed. The reasons are:

It doesn’t seem to do anything useful that the rest of the API doesn’t already do.
The spec says "Non-browser implementations are not required to implement this API."
It is designed for JavaScript (options dicts, TypeError, …) and doesn’t translate well into Rust.
I don’t personally find it very good anyway.

Instead, the get_encoding() function should be public. It is useful anyway when dealing with multiple encoding hints. (See for example the relevant CSS spec)

Charset request: ArmSCII-8

Would it be possible to add support for the ArmSCII-8 encoding?
Ref: https://manned.org/armscii-8 and https://en.wikipedia.org/wiki/ArmSCII

I had a quick look to see if I could add this myself, as it's just a single-byte encoding; But seeing how all current codecs are autogenerated from the whatwg specs, I'm a bit lost as to the best approach to implement a custom codec. I'd be happy to provide a PR if I have some guidance on the next steps to take.

There is no tests for rust-encoding.

Performance: Consider replacing lookup tables with match statements or binary search in single byte index

The current technique for building the single byte "forward" and "backward" function is to generate lookup tables using gen_index.py

Here's an example generated file: https://github.com/lifthrasiir/rust-encoding/blob/master/src/index/singlebyte/windows_1252.rs

There are some benchmarks that are generated, but they're micro-benchmarks with synthetic data, and I'm not sure they adequately capture how the library would be used in the wild.

So I wrote a few tiny benchmarks that exercise the encoder and decoder at the level they're typically used.

/// Some Latin-1 text to test
//
// the first few sentences of the article "An Ghaeilge" from Irish Wikipedia.
// https://ga.wikipedia.org/wiki/An_Ghaeilge
pub static IRISH_TEXT: &'static str =
    "Is ceann de na teangacha Ceilteacha í an Ghaeilge (nó Gaeilge na hÉireann mar a thugtar \
     uirthi corruair), agus ceann den dtrí cinn de theangacha Ceilteacha ar a dtugtar na \
     teangacha Gaelacha (.i. an Ghaeilge, Gaeilge na hAlban agus Gaeilge Mhanann) go háirithe. \
     Labhraítear in Éirinn go príomha í, ach tá cainteoirí Gaeilge ina gcónaí in áiteanna eile ar \
     fud an domhain. Is í an teanga náisiúnta nó dhúchais agus an phríomhtheanga oifigiúil i \
     bPoblacht na hÉireann í an Ghaeilge. Tá an Béarla luaite sa Bhunreacht mar theanga oifigiúil \
     eile. Tá aitheantas oifigiúil aici chomh maith i dTuaisceart Éireann, atá mar chuid den \
     Ríocht Aontaithe. Ar an 13 Meitheamh 2005 d'aontaigh airí gnóthaí eachtracha an Aontais \
     Eorpaigh glacadh leis an nGaeilge mar theanga oifigiúil oibre san AE";

pub static RUSSIAN_TEXT: &'static str =
    "Ру?сский язы?к Информация о файле слушать)[~ 3] один из восточнославянских языков, \
     национальный язык русского народа. Является одним из наиболее распространённых языков мира \
     шестым среди всех языков мира по общей численности говорящих и восьмым по численности \
     владеющих им как родным[9]. Русский является также самым распространённым славянским \
     языком[10] и самым распространённым языком в Европе ? географически и по числу носителей \
     языка как родного[7]. Русский язык ? государственный язык Российской Федерации, один из \
     двух государственных языков Белоруссии, один из официальных языков Казахстана, Киргизии и \
     некоторых других стран, основной язык международного общения в Центральной Евразии, в \
     Восточной Европе, в странах бывшего Советского Союза, один из шести рабочих языков ООН, \
     ЮНЕСКО и других международных организаций[11][12][13].";


#[bench]
fn bench_encode_irish(bencher: &mut test::Bencher) {
    bencher.bytes = IRISH_TEXT.len() as u64;
    bencher.iter(|| {
        test::black_box(
            WINDOWS_1252.encode(&ASCII_TEXT, EncoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_decode_irish(bencher: &mut test::Bencher) {
    let bytes = WINDOWS_1252.encode(IRISH_TEXT, EncoderTrap::Strict).unwrap();
    
    bencher.bytes = bytes.len() as u64;
    bencher.iter(|| {
        test::black_box(
            WINDOWS_1252.decode(&bytes, DecoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_encode_russian(bencher: &mut test::Bencher) {
    bencher.bytes = RUSSIAN_TEXT.len() as u64;
    bencher.iter(|| {
        test::black_box(
            ISO_8859_5.encode(&RUSSIAN_TEXT, EncoderTrap::Strict)
        )
    })
}

#[bench]
fn bench_decode_russian(bencher: &mut test::Bencher) {
    let bytes = ISO_8859_5.encode(RUSSIAN_TEXT, EncoderTrap::Strict).unwrap();
    
    bencher.bytes = bytes.len() as u64;
    bencher.iter(|| {
        test::black_box(
            ISO_8859_5.decode(&bytes, DecoderTrap::Strict)
        )
    })
}

I picked the windows-1252 encoding because it's similar to the old latin-1 standard and can encode the special characters in the Irish text I grabbed, and iso-8859-5 for similar reasons for the Russian test.

I rewrote gen_index.py to create match statements instead of building a lookup table. You get something like this:

// AUTOGENERATED FROM index-windows-1252.txt, ORIGINAL COMMENT FOLLOWS:
//
// For details on index index-windows-1252.txt see the Encoding Standard
// https://encoding.spec.whatwg.org/
//
// Identifier: e56d49d9176e9a412283cf29ac9bd613f5620462f2a080a84eceaf974cfa18b7
// Date: 2018-01-06
#[inline]
pub fn forward(code: u8) -> Option<u16> {
    match code {
        128 => Some(8364),
        129 => Some(129),
        130 => Some(8218),
        131 => Some(402),
        132 => Some(8222),
        133 => Some(8230),
        134 => Some(8224),
        135 => Some(8225),
        136 => Some(710),
        137 => Some(8240),
        //  a bunch more items
        250 => Some(250),
        251 => Some(251),
        252 => Some(252),
        253 => Some(253),
        254 => Some(254),
        255 => Some(255),
        _ => None
    }
}

#[inline]
pub fn backward(code: u32) -> Option<u8> {
    match code {
        8364 => Some(128),
        129 => Some(129),
        8218 => Some(130),
        402 => Some(131),
        8222 => Some(132),
        8230 => Some(133),
        8224 => Some(134),
        8225 => Some(135),
        710 => Some(136),
        8240 => Some(137),
        352 => Some(138),
        8249 => Some(139),
        338 => Some(140),
        141 => Some(141),
        381 => Some(142),
        //  a bunch more items
        251 => Some(251),
        252 => Some(252),
        253 => Some(253),
        254 => Some(254),
        255 => Some(255),
        _ => None
    }
}

Note that I changed the function signature to return an Option instead of a sentinel value. That wasn't strictly required, and didn't have a large effect on performance, but makes the code more idiomatic, I think.

I also generated a version that uses a binary search. It's pretty simple.

const BACKWARD_KEYS: &'static [u32] = &[
    128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
    147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
    167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
    188, 189, 190, 215, 247, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499,
    1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 8206,
    8207, 8215
];

const BACKWARD_VALUES: &'static [u8] = &[
    128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
    147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
    167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
    188, 189, 190, 170, 186, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237,
    238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 253, 254, 223
];

#[inline]
pub fn backward(code: u32) -> u8 {
    if let Ok(index) = BACKWARD_KEYS.binary_search(&code) {
        BACKWARD_VALUES[index]
    } else {
        0
    }
}

Here's a table comparing the three techniques (scroll to see entire table):

test	master				match					binary search
codec::singlebyte::tests::bench_decode_irish	3246	ns/iter	240	MB/s	3171	ns/iter	245	MB/s	2.08%
codec::singlebyte::tests::bench_decode_russian	8508	ns/iter	98	MB/s	8890	ns/iter	94	MB/s	-4.08%
codec::singlebyte::tests::bench_encode_irish	2622	ns/iter	310	MB/s	1688	ns/iter	482	MB/s	55.48%	2243	ns/iter	363	MB/s	17.10%
codec::singlebyte::tests::bench_encode_russian	6692	ns/iter	228	MB/s	10578	ns/iter	144	MB/s	-36.84%	10019	ns/iter	152	MB/s	-33.33%

Obviously the Irish / Windows-1252 case is improved with both alternative techniques, but the Russian case is degraded.

It looks like the decode method isn't changed much, and that makes sense, because the match expressions are contiguous integers, I bet that LLVM is optimizing that down to a lookup table anyways.

I'll try running some more tests.

Extract interface into a separate crate

Some (e.g. me) might want to use encoding’s interfaces and build decoders and encoders for their own encodings (e.g. see #68). Pulling in the whole encoding with all its supported encodings and their tables is not quite ergonomic, though, and splitting interfaces/types/traits out from the main crate would keep encoding as simple to use as it is currently and allow people to build on it.

Use FM-index & json for DBCS decoding.

Stable reference to the encodings without a full library

Rust-encoding is (and will continue to be) big, and some library may want to provide a reference to the encoding but not the direct transcode facililty. emk/rust-uchardet#1 is one example, but there might be some more (for example, quoted-printable decoder may want not to transcode itself but may want to signal the encoding anyway).

A possible solution is to have a small crate (encoding-label would be fine) that has an enum representing all available encodings and a function to convert a label to that enum. encoding_from_whatwg_label will then depend on that function. Since Cargo does not allow multiple versions of the same crate linked together, this is fine; a function in the main crate that converts the enum to the actual EncodingRef cannot fail/panic.

I still want to explore some other possibilities and actual use cases. This is a good-to-have feature but not a blocker.

BOM-aware Unicode encodings

This issue was spotted during the removal of TextEncoder and TextDecoder (#4). TextDecoder has an ability to automatically strip the BOM (U+FFFD) from the input string if any. ~~We need to emulate this in a separate encoding, perhaps BOMAwareUTF8Encoding (which whatwg_name() is still utf-8)?~~ This use case itself can be handled better with decoders with a fallback encoding (#19), but we may need to require BOM-attached Unicode encodings from time to time: many applications of UTF-16 require BOM, for example.

Add Support For CP850

Hello,

Would it be possible to get support for CP850 (despite it being a near dead encoding)?

Implement missing encodings

This is a master list for important missing encodings. What is considered "important" is a delicate question, but for now I have the following list:

WHATWG multibyte encodings

Non-WHATWG multibyte encodings of the special interest

JIS X 0213 encodings: euc-jis-2004, iso-2022-jp-2004, shift-jis-2004
Other encodings based on Shift_JIS (WHATWG's shift_jis is actually windows-31j)
cesu-8, required for compatibility

Required for completeness

iso-8859-1 (compare with windows-1252)
euc-kr (compare with windows-949)
euc-jp with and without JIS X 0212 compatibility (WHATWG's euc-jp is asymmetric: it encodes without 0212 and decodes with 0212)
utf-32 and friends
utf-7?

hz-gb-2312 encoding and WHATWG compatibility

The WHATWG Encoding Spec lists hz-gb-2312 as mapping to the replacement encoding, which uses the UTF-8 encoder and throws a special replacement encoding error for its decoder. However, it looks like this crate implements the actual HZ encoding. For WHATWG compatibility, this would have to get folded in with the rest of the replacement encodings, but I don't know if that's acceptable considering other people may be using the current implementation.

Would you prefer to maintain strict WHATWG compatibility or keep the current implementation? If the current implementation is kept, this deviation needs to be well documented - it isn't too hard to work around, but is a bit annoying and could catch someone unaware because the rest of the crate is compatible.

Implement common traits for Encoding

It would be useful to have Debug, Clone, Eq/PartialEq implemented or dervied for Encoding/EncodingRef so that structs can contain an EncodingRef and have those traits be derived.

`encoding::Encoding` cannot be shared between threads safely

required because of the requirements on the impl of std::marker::Send for &encoding::Encoding

   scope.execute(move || {
              ^

How can I prevent this?

`all::encodings()` returns an errornous list (and should be sorted alphabetically).

The bug is in src/all.rs:

   const ENCODINGS: &'static [EncodingRef] = &[ ....

This is the way I collect the list (please confirm that it should be done this way):

   let list = all::encodings().iter().map(|&e|format!(" {}\n",e.name())).collect::<String>();

the following names do not work:

 error mac-roman mac-roman mac-cyrillic hz big5-2003 pua-mapped-binary encoder-only-utf-8

These two do work but are not listed:

  x-user-defined macintosh

PLEASE return them alphabetically sorted!

Add IANA names for label lookup

MIME Content-Type headers (among others) specify character set by IANA name[1] rather than by WHATWG or Windows Code page. It would be handy to have a way of getting an encoding by this label.

[1] - http://www.iana.org/assignments/character-sets/character-sets.xhtml

Issue with multi-codepoint graphemes

I just tried encoding "A\u{308}", which is another Unicode representation of "Ä" to ISO8859-1 with EncoderTrap::Replace but instead of "Ä" I get "A?". Is this intentional or an oversight?

Why I always got the error?

╭─mindcat@mindcat-linux-pc ~/mydev/rust/Rygos  
╰─➤  rustc -L . main.rs                                                                                          101 ↵
main.rs:88:32: 89:29 error: type `&'static encoding::codec::utf_16::UTF16LEEncoding` does not implement any method in scope named `decode`
main.rs:88                                 UTF_16LE.decode(reader.read_to_end().as_slice(),DecodeReplace)
main.rs:89                             },
error: aborting due to previous error
task 'rustc' failed at 'explicit failure', /home/mindcat/everyday/rust-git/src/rust/src/libsyntax/diagnostic.rs:102
task '<main>' failed at 'explicit failure', /home/mindcat/everyday/rust-git/src/rust/src/librustc/lib.rs:397
╭─mindcat@mindcat-linux-pc ~/mydev/rust/Rygos  
╰─➤  rustc -L . main.rs                                                                                          101 ↵
main.rs:88:32: 89:29 error: type `&'static encoding::codec::utf_16::UTF16LEEncoding` does not implement any method in scope named `decode`
main.rs:88                                 encoding::all::UTF_16LE.decode(reader.read_to_end().as_slice(),DecodeReplace)
main.rs:89                             },
error: aborting due to previous error
task 'rustc' failed at 'explicit failure', /home/mindcat/everyday/rust-git/src/rust/src/libsyntax/diagnostic.rs:102
task '<main>' failed at 'explicit failure', /home/mindcat/everyday/rust-git/src/rust/src/librustc/lib.rs:397
╭─mindcat@mindcat-linux-pc ~/mydev/rust/Rygos  
╰─➤  rustc -L . main.rs                                                                                          101 ↵
main.rs:88:32: 89:29 error: type `&'static encoding::codec::singlebyte::SingleByteEncoding` does not implement any method in scope named `decode`
main.rs:88                                 encoding::all::ISO_8859_1.decode(reader.read_to_end().as_slice(),DecodeReplace)
main.rs:89                             },
error: aborting due to previous error
task 'rustc' failed at 'explicit failure', /home/mindcat/everyday/rust-git/src/rust/src/libsyntax/diagnostic.rs:102
task '<main>' failed at 'explicit failure', /home/mindcat/everyday/rust-git/src/rust/src/librustc/lib.rs:397
╭─mindcat@mindcat-linux-pc ~/mydev/rust/Rygos  
╰─➤  rustc -L . main.rs                                                                                          101 ↵
main.rs:88:32: 89:29 error: type `&'static encoding::codec::singlebyte::SingleByteEncoding` does not implement any method in scope named `decode`
main.rs:88                                 encoding::all::ISO_8859_1.decode([0],DecodeReplace)
main.rs:89                             },
error: aborting due to previous error
task 'rustc' failed at 'explicit failure', /home/mindcat/everyday/rust-git/src/rust/src/libsyntax/diagnostic.rs:102
task '<main>' failed at 'explicit failure', /home/mindcat/everyday/rust-git/src/rust/src/librustc/lib.rs:397
╭─mindcat@mindcat-linux-pc ~/mydev/rust/Rygos  
╰─➤

I wrote "pub use encoding::all::UTF_16LE;"

Abandoned?

What is the status of the project? It seems to have seen no updates in the last 5 years, is it abandoned? And if so what is the "official" replacement?

If the project is to be considered abandoned, maybe that could be indicated in the readme and the project archived?

Add a "mid-level" API

Currently the API provides "high-level" and "low-level" methods. The former takes a single string/vector in memory with an error handling mechanism, and returns another string/vector:

fn decode(&'static Encoding, input: &[u8], trap: Trap) -> Result<~str,SendStr>;

The latter allows incremental processing of the input (eg. as it is downloaded from the network), but leaves error handling to the user:

fn decoder(&'static Encoding) -> ~Decoder;
fn raw_feed(&mut Decoder, input: &[u8], output: &mut StringWriter) -> (uint, Option<CodecError>);
fn raw_finish(&mut Decoder, output: &mut StringWriter) -> Option<CodecError>;

It would be useful to also have an intermediate that does error handling but allows incremental processing, eg:

fn decoder(&'static Encoding, trap: Trap) -> ~Decoder;
fn feed(&mut Decoder, input: &[u8], output: &mut StringWriter) -> Option<SendStr>;
fn finish(&mut Decoder, output: &mut StringWriter) -> Option<SendStr>;

(trap is part of the decoder, because there is no reason to change it mid-stream.)

Unicode decoder with a fallback encoding

Many use cases involving BOM-aware Unicode encodings (#17) actually require some sort of multi-encoding decoder, specifically having a Unicode encoding (used when there is BOM) plus a fallback encoding (used when there isn't).

Need the performance data for rust-encoding.

The performance data for iconv-lite is:
https://github.com/ashtuchkin/iconv-lite#encodingdecoding-speed

I am curious about how fast the rust-encoding is

A proposed redesign for Gecko's encoders and decoders.

@hsivonen is proposing to replace Gecko's encoding converters by a library written in Rust: https://docs.google.com/document/d/13GCbdvKi83a77ZcKOxaEteXp1SOGZ_9Fmztb9iX22v0/edit.

He identifies some arguments against using rust-encoding in its current form. Since I believe it would be great to have the wider Rust ecosystem as well as Servo and Gecko share the same library, I'd appreciate your thoughts on the proposal and whether it would be an option to adjust rust-encoding to the proposed design (keeping the current API in place as much as possible).

Share has been renamed to Sync in master

See here

Share has been renamed to Sync. It is still there as deprecated, but it isn't automatically imported by prelude anymore. So this either needs to be changed to Sync or Share has to be explicitly imported

cp437

I need support to encode/decode cp437 to interface with a dynamic library.

The documentation does not explain how to create your own charset.

Also, I would happily submit a PR with some guidance of how to implement it since I've seen other requests for it.

Encoding API prososal for libstd/libextra

Hi,

As you may have heard, there was some discussion last week about importing some of rust-encoding into Rust’s libstd or libextra. I think it is important to get the API right first (w.r.t. errors handling, incremental processing, etc.) The latest version of my proposal is on the rust-dev mailing list: (but see the rest of the thread.)

https://mail.mozilla.org/pipermail/rust-dev/2013-September/005556.html

I would appreciate your feedback on this, especially on using conditions for error handling, and on the necessity for open-ended error handling as opposed to a fixed set of modes as in recent versions of the spec.

How to Reset a RawDecoder

I'm current working on this issue:
servo/servo#13234
To support streaming, I need to use RawDecoder provided by rust-encoding. I add an instance of RawDecoder to TextDecoder, make decode method use interface by RawDecoder. However I found no way to 'reset' the RawDecoder, that is, clear raw decoder's current state and unprocessed buffer. Is that can be done?

rust-encoding failed to compile with rust-msvc_x64

libencoding-628a0580477da326.rlib(encoding-628a0580477da326.0.o) : error LNK2019: unresolved external symbol __imp__ZN6euc_kr20BACKWARD_TABLE_UPPER20h7dcfe3eed7745131q6oE referenced in function _ZN6euc_kr8backward20h382884ad758a08bazDpE
libencoding-628a0580477da326.rlib(encoding-628a0580477da326.0.o) : error LNK2019: unresolved external symbol __imp__ZN6euc_kr20BACKWARD_TABLE_LOWER20h7dcfe3eed7745131RlgE referenced in function _ZN6euc_kr8backward20h382884ad758a08bazDpE
libencoding-628a0580477da326.rlib(encoding-628a0580477da326.0.o) : error LNK2019: unresolved external symbol __imp__ZN6euc_kr13FORWARD_TABLE20h7dcfe3eed7745131gaaE referenced in function _ZN6euc_kr7forward20h893f5ab6f8728033rlgE
libencoding-628a0580477da326.rlib(encoding-628a0580477da326.0.o) : error LNK2019: unresolved external symbol __imp__ZN7jis020820BACKWARD_TABLE_UPPER20h8b1de7443deb80aa6MiE referenced in function _ZN7jis02088backward20h25fc7b114b367a7dstjE
libencoding-628a0580477da326.rlib(encoding-628a0580477da326.0.o) : error LNK2019: unresolved external symbol __imp__ZN7jis020820BACKWARD_TABLE_LOWER20h8b1de7443deb80aaT3cE referenced in function _ZN7jis02088backward20h25fc7b114b367a7dstjE
.........

many link errors.

C1 are part of ISO-8859-1 (as far as the IANA is concerned)

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

"ISO-8859-1 is the IANA preferred name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429"

(Python didn't get the memo)

Incrementally parsed invalid sequences spanning multiple chunks write data

    #[test]
    fn test_invalid_multibyte_span() {
        use std::mem;
        let mut d = UTF8Encoding.decoder();
        // "ef bf be" is an invalid sequence.
        assert_feed_ok!(d, [], [0xef, 0xbf], "");
        let input: [u8, ..1] = [ 0xbe ];
        let (_, _, buf) = unsafe { d.test_feed(mem::transmute(input.as_slice())) };
        // Make sure no data was written to the buffer.
        assert_eq!(buf, String::new());
        // task 'codec::utf_8::tests::test_invalid_multibyte_span' failed at 'assertion failed: `(left == right) && (right == left)` (left: `�`, right: ``)', /Users/cgaebel/code/rust-encoding/src/codec/utf_8.rs:529
    }

This test successfully reports an error, but when it does it writes an invalid code sequence into the buffer.

(side note, github markup is eating the invalid UTF-8 char in left. Rest assured SOMETHING is in there.

Release on crates.io

The current version of rust-encoding on crates.io is out of date. Please release a new one!

warning: private trait in public interface (error E0445)

Compiling encoding v0.3.0-dev (file:///rust-encoding-master)
src/codec/utf_16.rs:86:9: 86:15 warning: private trait in public interface (error E0445), #[warn(private_in_public)] on by default
src/codec/utf_16.rs:86 impl<E: Endian> Encoding for UTF16Encoding<E> {
                               ^~~~~~
src/codec/utf_16.rs:86:9: 86:15 warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
src/codec/utf_16.rs:86:9: 86:15 note: for more information, see the explanation for E0446 (`--explain E0446`)
src/codec/utf_16.rs:106:9: 106:15 warning: private trait in public interface (error E0445), #[warn(private_in_public)] on by default
src/codec/utf_16.rs:106 impl<E: Endian> UTF16Encoder<E> {
                                ^~~~~~
src/codec/utf_16.rs:106:9: 106:15 warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
src/codec/utf_16.rs:106:9: 106:15 note: for more information, see the explanation for E0446 (`--explain E0446`)
src/codec/utf_16.rs:112:9: 112:15 warning: private trait in public interface (error E0445), #[warn(private_in_public)] on by default
src/codec/utf_16.rs:112 impl<E: Endian> RawEncoder for UTF16Encoder<E> {
                                ^~~~~~
src/codec/utf_16.rs:112:9: 112:15 warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
src/codec/utf_16.rs:112:9: 112:15 note: for more information, see the explanation for E0446 (`--explain E0446`)
src/codec/utf_16.rs:159:9: 159:15 warning: private trait in public interface (error E0445), #[warn(private_in_public)] on by default
src/codec/utf_16.rs:159 impl<E: Endian> UTF16Decoder<E> {
                                ^~~~~~
src/codec/utf_16.rs:159:9: 159:15 warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
src/codec/utf_16.rs:159:9: 159:15 note: for more information, see the explanation for E0446 (`--explain E0446`)
src/codec/utf_16.rs:166:9: 166:15 warning: private trait in public interface (error E0445), #[warn(private_in_public)] on by default
src/codec/utf_16.rs:166 impl<E: Endian> RawDecoder for UTF16Decoder<E> {
                                ^~~~~~
src/codec/utf_16.rs:166:9: 166:15 warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
src/codec/utf_16.rs:166:9: 166:15 note: for more information, see the explanation for E0446 (`--explain E0446`)
src/codec/simpchinese.rs:90:9: 90:15 warning: private trait in public interface (error E0445), #[warn(private_in_public)] on by default
src/codec/simpchinese.rs:90 impl<T: GBType> Encoding for GBEncoding<T> {
                                    ^~~~~~
src/codec/simpchinese.rs:90:9: 90:15 warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
src/codec/simpchinese.rs:90:9: 90:15 note: for more information, see the explanation for E0446 (`--explain E0446`)
src/codec/simpchinese.rs:110:9: 110:15 warning: private trait in public interface (error E0445), #[warn(private_in_public)] on by default
src/codec/simpchinese.rs:110 impl<T: GBType> GBEncoder<T> {
                                     ^~~~~~
src/codec/simpchinese.rs:110:9: 110:15 warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
src/codec/simpchinese.rs:110:9: 110:15 note: for more information, see the explanation for E0446 (`--explain E0446`)
src/codec/simpchinese.rs:116:9: 116:15 warning: private trait in public interface (error E0445), #[warn(private_in_public)] on by default
src/codec/simpchinese.rs:116 impl<T: GBType> RawEncoder for GBEncoder<T> {
                                     ^~~~~~
src/codec/simpchinese.rs:116:9: 116:15 warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
src/codec/simpchinese.rs:116:9: 116:15 note: for more information, see the explanation for E0446 (`--explain E0446`)

Warnings emited when building

On master, cargo build emits 237 warnings.

Here are the different kinds of warnings:

warning: trait objects without an explicit dyn are deprecated
... range patterns are deprecated
unreachable pattern (this one has to do with the fact that Rust wasn't able to tell when a match pattern was exhaustive if you used all of the scalar values for a type, but now it appears to handle that correctly)
use of deprecated item 'try': use the ? operator instead

You can run cargo build -v 2>&1 | grep warning | sort | uniq to get a summary.

error: type `&mut collections::string::String` does not implement any method in scope named `push`

Compiling encoding v0.1.0 (https://github.com/lifthrasiir/rust-encoding#1d264878)

src/encoding/types.rs:128:14: 128:21 error: type `&mut collections::string::String` does not implement any method in scope named `push`
src/encoding/types.rs:128         self.push(c);

Register with Travis CI

See http://hiho.io/rust-ci/

Steps: http://hiho.io/rust-ci/help/

request for no_std support

compile rust-encoding with master version(now) failed.

http://paste.ubuntu.com/6556081/

include LICENSE text to subcrates

This would help me in packaging for Fedora.

Consider implementing CESU-8 (external library available)

I saw that you had the evil encoding CESU-8 on one of your TODO lists. If it helps, I've taken a stab at a small, standalone encoding library which converts between UTF-8 and CESU-8:

https://crates.io/crates/cesu8

Is there any way to make a library like this compatible with rust-encoding without depending on anything more than a few core types? Thank you as always for your thoughts!

ISO-2022-JP is not up-to-date

Should follow whatwg/encoding@19b0ebf. Unfortunately, this is not an easy task which should update both the encoder, decoder and tests. Also it would be desirable to have a version of ISO-2022-JP encoding with JIS X 0212 and/or 0213 support as an extra bonus.

Relicense under dual MIT/Apache-2.0

This issue was automatically generated. Feel free to close without ceremony if
you do not agree with re-licensing or if it is not possible for other reasons.
Respond to @cmr with any questions or concerns, or pop over to
#rust-offtopic on IRC to discuss.

You're receiving this because someone (perhaps the project maintainer)
published a crates.io package with the license as "MIT" xor "Apache-2.0" and
the repository field pointing here.

TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that
license is good for interoperation. The MIT license as an add-on can be nice
for GPLv2 projects to use your code.

Why?

The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback. However, this is not the
primary motivation for me creating these issues. The Apache license also has
protections from patent trolls and an explicit contribution licensing clause.
However, the Apache license is incompatible with GPLv2. This is why Rust is
dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for
GPLv2 compat), and doing so would be wise for this project. This also makes
this crate suitable for inclusion and unrestricted sharing in the Rust
standard distribution and other projects using dual MIT/Apache, such as my
personal ulterior motive, the Robigalia project.

Some ask, "Does this really apply to binary redistributions? Does MIT really
require reproducing the whole thing?" I'm not a lawyer, and I can't give legal
advice, but some Google Android apps include open source attributions using
this interpretation. Others also agree with
it.
But, again, the copyright notice redistribution is not the primary motivation
for the dual-licensing. It's stronger protections to licensees and better
interoperation with the wider Rust ecosystem.

How?

To do this, get explicit approval from each contributor of copyrightable work
(as not all contributions qualify for copyright) and then add the following to
your README:

## License

Licensed under either of
 * Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.

and in your license headers, use the following boilerplate (based on that used in Rust):

// Copyright (c) 2016 rust-encoding developers
//
// Licensed under the Apache License, Version 2.0
// <LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0> or the MIT
// license <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. All files in the project carrying such notice may not be copied,
// modified, or distributed except according to those terms.

Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these
from the Rust repo for a plain-text
version.

And don't forget to update the license metadata in your Cargo.toml to:

license = "MIT/Apache-2.0"

I'll be going through projects which agree to be relicensed and have approval
by the necessary contributors and doing this changes, so feel free to leave
the heavy lifting to me!

Contributor checkoff

To agree to relicensing, comment with :

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option

Or, if you're a contributor, you can check the box in this repo next to your
name. My scripts will pick this exact phrase up and check your checkbox, but
I'll come through and manually review this issue later as well.

Community activeness

Can you please provide the community activity？

Encoding.name() vs. WHATWG encoding name

whatwg::encoding_from_label returns a tuple of an Encoding object and the encoding name as a string, while the object also has a .name() that returns the same as a string. This seems redundant.

I would like to remove the former and only keep the latter, which should use names from the spec. The requires changes are:

Rename shift-jis to shift_jis.
Add iso-8859-8-i, identical to iso-8859-8 but with a different name.
Rename windows-949 to euc-kr

1 and 2 are harmless, but 3 seems to have been deliberate. Is there a difference between windows-949 and euc-kr, or a reason to prefer the first name?

Make `CodecError.cause` an enum or similar

Since it is a ~str right now, it incurs a (small) allocation for each error. This can be an easy attack vector if we are not careful. Maybe an enum is sufficient, having the following variants (which are all messages currently in use):

"invalid (byte) sequence" (the byte sequence invalid in any circumstances)
"incomplete (byte) sequence" (the byte sequence terminated in the middle, valid when the appropriate suffix is given)
"unrepresentable character" (the character not in the encoding repotoire)

error: duplicate definition of type or module `RelativeSchemeData`

.cargo/git/checkouts/rust-url-1e22af4233079a1e/master/src/lib.rs:210:1: 246:2 error: duplicate definition of type or module `RelativeSchemeData`
.cargo/git/checkouts/rust-url-1e22af4233079a1e/master/src/lib.rs:210 pub struct RelativeSchemeData {
.cargo/git/checkouts/rust-url-1e22af4233079a1e/master/src/lib.rs:211     /// The username of the URL, as a possibly empty, pecent-encoded string.
.cargo/git/checkouts/rust-url-1e22af4233079a1e/master/src/lib.rs:212     ///
.cargo/git/checkouts/rust-url-1e22af4233079a1e/master/src/lib.rs:213     /// Percent encoded strings are within the ASCII range.
.cargo/git/checkouts/rust-url-1e22af4233079a1e/master/src/lib.rs:214     ///
.cargo/git/checkouts/rust-url-1e22af4233079a1e/master/src/lib.rs:215     /// See also the `lossy_percent_decode_username` method.
                                                                                 ...
.cargo/git/checkouts/rust-url-1e22af4233079a1e/master/src/lib.rs:198:5: 198:43 note: first definition of type or module `RelativeSchemeData` here
.cargo/git/checkouts/rust-url-1e22af4233079a1e/master/src/lib.rs:198     RelativeSchemeData(RelativeSchemeData),
                                                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
error: aborting due to previous error

Rename "trap" to "error handler"

Although I can infer from context, I don’t know what "trap" is supposed to mean in this API. If you’re ok with this, I can provide a PR to rename it to "error handler" or "error handling", which I find more meaningful.

Readers?

It would be convenient to have an object that implements Read, so one could for example easily and efficiently read from a file in an encoding other than utf-8.