On Windows, character input was traditionally done via UTF-16 code units. The result of operating this way is that any code point that requires a surrogate pair in UTF-16 will come as two user input events instead of one. Thereโs a newer technique that lets you take an entire code point at once for applications that support it, but itโs a mixed bunch whether IMEs that are sending the input in the first place will operate in that way. I have two ways of entering emoji, for example: Windows 10โs now-built-in emoji picker (Win+.), and my compose key, using WinCompose. The emoji picker is capable of sending both at once, while WinCompose sends the surrogates individually. (Iโm not certain about other platforms and their IME techniques; I think theyโre probably all safe from this gotcha.)
The result is this: you cannot trust the value of an <input>
to be a legal UTF-16 string at any given point; it should (should) eventually be a legal Unicode string, but in the mean time it is just like any DOMString, merely a sequence of legal UTF-16 code units.
This is the effect of this on the todomvc example: when I type ๐
into the #new-todo input via WinCompose, it is actually input as two two input events, one with the high surrogate 0xD83D, and then one with the low surrogate 0xDE15, which together make U+1F615. (With the emoji picker it comes through in one event, so the bug does not occur.)
The first one triggers an input
event on the <input>
element. The code fetches the value, finds it to be "\ud83d"
(in JavaScript terms), then tries to turn it into a UTF-8 string for Rust, and encountering an unmatched surrogate replaces it with the replacement character, ๏ฟฝ
. Then, because the binding is two-way, it writes ๏ฟฝ
back to the text input.
Then the second logical keystroke is processed, and the low surrogate appended to the ๏ฟฝ
, and then through Rust again, and so "\ufffd\ude15"
becomes "\ufffd\ufffd"
.
End result: ๐
became ๏ฟฝ๏ฟฝ
because of the combination of two-way bindings and the use of UTF-8 instead of WTF-8.
This particular case can be resolved by killing off the altogether unnecessary and inefficient two-way binding of the value (just read and reset the value at submission time, you have a handle to the DOM node), and leaving the browser to sort it out, but itโs indicative of a broader class of bug that will generally affect few people (not many people use a compose key on Windows), but could be catastrophic for e.g. Chinese users, depending on the IME theyโre using.
I think this is the first time Iโve ever come across a thing on the web that didnโt cope with transient unmatched surrogatesโitโs not something thatโs ever likely to trip you up in JavaScript, but itโs a problem for wasm stuff.