samthor / fast-text-encoding Goto Github PK

View Code? Open in Web Editor NEW

100.0 3.0 27.0 145 KB

Fast polyfill for TextEncoder and TextDecoder, only supports UTF-8

License: Apache License 2.0

JavaScript 97.02% HTML 2.33% Shell 0.66%

utf-8 encoding javascript

fast-text-encoding's Introduction

👋

fast-text-encoding's People

Contributors

Stargazers

Watchers

fast-text-encoding's Issues

FastTextEncoder and FastTextDecoder being assigned to `scope` regardless if instance already present?

Source:
https://github.com/samthor/fast-text-encoding/blob/master/src/polyfill.js#L8-L9

scope['TextEncoder'] = scope['TextEncoder'] || FastTextEncoder;
scope['TextDecoder'] = scope['TextDecoder'] || FastTextDecoder;

Output:
https://github.com/samthor/fast-text-encoding/blob/master/text.min.js

l.TextEncoder = m;
l.TextDecoder = k;

Should the above read this instead?

l.TextEncoder = l.TextEncoder || m;
l.TextDecoder = l.TextDecoder || k;

LICENSE file includes Apache license template, rather than filled-out license

A minor note, but it looks like the LICENSE file includes the template but doesn't fill it out:

fast-text-encoding/LICENSE

Line 189 in 0423145

utfLabel argument does not support "utf8"

FastTextDecoder and FastTextEncoder should also support a utfLabel argument with the value "utf8". Currently the constructors throw an error if the argument is not exactly equal to "utf-8" (https://github.com/samthor/fast-text-encoding/blob/master/text.js#L36).

According to https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/TextDecoder, both "utf8" and "utf-8" are acceptable values for utfLabel

How to reproduce

Some libraries, e.g. https://github.com/kripken/sql.js, pass the value "utf8" to the TextDecoder (or Encoder) constructor.

I found this issue by accident, since our project uses the following libraries (amongst others):

https://github.com/googleapis/google-auth-library-nodejs (includes fast-text-encoding)
- https://github.com/samthor/fast-text-encoding
https://github.com/typeorm/typeorm (uses sql.js)
- https://github.com/kripken/sql.js

TypeORM attempts to require sql.js, which then fails with this error:

RangeError: Failed to construct 'TextDecoder': The encoding label provided ('utf8') is invalid.

Apparently by some strange coincident this issue has not occurred earlier in our codebase. Perhaps it has something to do with the order in which some transitive dependencies are required.

I'm Sorry

I'm sorry. I really am. It appears that you have put a lot of time into this project. Nevertheless, it appears there is now a polyfill that is apparently better in every aspect here.

It is smaller
It is faster on every benchmark. In particular, when decoding a huge ascii array (which is quite common), the linked polyfill is twice as fast.
It offers a TextEncoder.prototype.encodeInto polyfill
It offers a solo TextEncoder-only file
It offers a solo TextDecoder-only file

I realize that you have put a lot of work into this polyfill, and I respect the amount of work you have put into it. I hope you understand.

(Disclaimer: I created FastestSmallestTextEncoderDecoder)

Probably not needed anymore?

time for deprecation
and archiving the Github repo?

all green env. have text encoder/decoder nowdays

get [Error: Cannot create URL for blob!]

when I use
new TextDecoder().decode([72, 101, 108, 108, 111])
I get [Error: Cannot create URL for blob!] in react native.
Any solution?

out of stack space when string is too long

The call to String.fromCharCode.apply(null, out) passes each char code as an argument on the stack. This will cause IE11 to generate an out of stack space error if the string is several hundred k long. Other browsers have similar limits.

In the closure compiler, they work around this by doing chunks at a time.

Null padded string length not preserved

When dealing with certain buffers from other systems, the string is padded with null. The native implementations preserve this and simply copy the nulls over to the string, ensuring that the length is correct.

This part of the decode does not allow that to happen in fast-text-encoding:

fast-text-encoding/text.js

Lines 149 to 151 in 227d7d2

 if (byte1 === 0) { 

 break; // NULL 

 }

Could you guys please support utf-16le

Some packages that's depend on this package won't work on Node.js

RangeError: Failed to construct 'TextDecoder': The encoding label provided ('utf-16le') is invalid.
    at new k (.../node_modules/fast-text-encoding/text.min.js:1:134)
    at Object.<anonymous> (.../node_modules/rustbn.js/lib/index.asm.js:1:17613)

The reason is fast-text-encoding package didn't support utf-16le

feat: why not use `codePointAt`?

Overview

This is a feature request/clarification?

fast-text-encoding/src/lowlevel.js

Lines 93 to 97 in 60d0a6c

 var value = string.charCodeAt(pos++); 

 if (value >= 0xd800 && value <= 0xdbff) { 

 // high surrogate 

 if (pos < len) { 

 var extra = string.charCodeAt(pos);

charCodeAt gives the UTF-16 codepoints and you need to deal with surrogate pairs.

The charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charCodeAt

For https://en.wikipedia.org/wiki/UTF-8 encoding you just need the Unicode codepoint which is what codePointAt provides.

The codePointAt() method returns a non-negative integer that is the Unicode code point value at the given position.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt

Am I missing something? I am not an expert on this topic.

Code

I wrote an implementation that uses codePointAt instead of charCodeAt.
https://jsfiddle.net/pzh6ofnj/2/

const f = (s) => new Uint8Array([...s].map(c => c.codePointAt(0)).flatMap(x => {
  if (x < 0x80) {
    // first 128 code points need 1 byte
    return x;
  }
  if (x < 0x800) {
    // next 1920 code points need 2 bytes
    return [((x >>> 6) & 0x1F) | 0xC0, (x & 0x3F) | 0x80];
  }
  if (x < 0x10000) {
    // next 63488 (really only 61440 are in use) code points need 3 bytes
    return [((x >>> 12) & 0x0F) | 0xE0, ((x >>> 6) & 0x3F) | 0x80, (x & 0x3F) | 0x80];
  }
  // rest need 4 bytes
  return [
    ((x >>> 18) & 0x07) | 0xF0,
    ((x >>> 12) & 0x3F) | 0x80,
    ((x >>> 6) & 0x3F) | 0x80,
    (x & 0x3F) | 0x80,
  ];
}));

function main() {
  // const s = 'abcd😊efgh\n012345689\t€\r🧑🏽‍🍳helloworld!';
  const s = '$£Иह€한𐍈';

  console.log(s);
  const arr = f(s);
  console.log(Array.from(arr).map(x => x.toString(16)));

  // test against TextEncoder
  {
    const expected = new TextEncoder().encode(s);
    console.log(Array.from(expected).map(x => x.toString(16)));
    const actual = Array.from(arr);
    if (Array.from(expected).some((x, i) => x !== actual[i])) {
      throw new Error('the encoded bytes do not match');
    }
  }
  // test against TextDecoder
  {
    const actual = new TextDecoder().decode(arr);
    console.log(actual);
    if (actual !== s) {
      throw new Error(`the decoded string does not match the original: ${actual}`);
    }

  }
}

main();

19 Reasons Why It Is Not "Fast"

The concept of performance and npm package is a nice sentiment, but there are so many different issues with this library. Let me list a few.

Function.prototype.apply breaks when passed an array that is too big. In the console of Chrome 74, this limit appears to be 125833: [].push.apply([], (new Array(125833)).fill(0))
Array.prototype.push + String.fromCharCode = performance bottleneck. Use binary string concatenation if you must accumulate a string.
Stop reassigning the typed array target in the FastTextEncoder.prototype.encode function. Even if this does not happen often, the mere presence of reassignment forces the JIST compiler to insert extra code in case of recompilation due to potential constructor changes of target. More code means that aligning to the page line is less likely. See this SO post..
In FastTextDecoder.prototype.decode, the assertion that if (byte1 === 0) break; is completely wrong because you should use a null byte ("\x00") instead.
In FastTextDecoder.prototype.decode, you should try to reuse the underlying buffer instead of copying everything into a new duplicate array in case if a typed array is passed instead of an ArrayBuffer.
You should support NodeJS<3.0's native Buffer and use it as an alternative to typed arrays.
Stop using rest parameters. Minifiers are horrible at bloating the code size on account of them. Use the || operator instead to ensure greater standards compliance and a smaller file size at the minimal overhead of performance.
In the browser, you should use Array as a shim-like fallback for TypedArrays to provide limited backwards support into IE5.5-IE9.
Bring in the new and the shiny "use strict", which your minified file (the only file that actually counts) does not do.
Test TextEncoder and TextDecoder separately. There is simply a right way to do polyfilling and there is simply a wrong way to do polyfilling. Except with vendor-specific features, you should never assume that just because one item does not exists, the other one does not exist too.
Apply SMI integer optimizations to lines 55, 59 x2, 63 x2, 67, 69, 70, 79, 80, 81 x4, 90, 93, 95, 96, 98-100, 106, 144, 148 x2, 156, 159, 160, and 163-165 to maximize performance.
Use !== instead of != when comparing typeof. The browser is able to assume that it will be a string-only comparison, thus the extra byte used is a waist of space.
If you must use Object.defineProperty, then use try/catch around Object.defineProperty with alternative code for IE5.5-IE8. The try/catch is because Object.defineProperty exists only for DOM in IE8.
Stop using new in front of Error/RangeError. It is not needed and it waists precious space for something that is not a performance priority in the slightest.
Support AMD.
Add in support for Service Worker usage. Neither window, global, nor this (in strict mode) are available in Service Workers. Use self instead of window to support the Service Worker environment instead of relying on your minifer to strip the "use strict" so that this can be used.
Write to module.exports if available for drop-in use in NodeJS.
Use a different minifier. Your current minified code reassigns argument variables corresponding to what was originally rest parameters, thus greatly diminishing the stack overhead optimizations that the V8 JIST compiler is able to apply to the function and delaying entry into the function.
Replace "undefined" with ""+void 0 in the minified code to save 2 bytes of space with each occurrence without performance penalty.

I came up with my own original solution that fixed many problems, greatly increased performance, and reduced file size.

FastestSmallestTextEncoderDecoder

Why `fatal` option is not supported?

Hi,

I am wondering why the fatal option is not supported. Will setting fatal=true will cause any issue when using your library as a polyfill?

Many thanks

IE11 doesn't work

Object doesn't support property or method 'slice', when trying to encode something in IE11
(new TextEncoder()).encode('sample string')

	var value = string.charCodeAt(pos++);
	if (value >= 0xd800 && value <= 0xdbff) {
	// high surrogate
	if (pos < len) {
	var extra = string.charCodeAt(pos);