Comments (15)
I'd like it to be defined by referring to an existing external standard, instead of us reinventing something. Specifically, whether it be Unicode or what, and how escape sequences work.
from spec.
I agree, though I'm not sure Unicode is what we want; we're not really trying to define strings here, just raw bytes, which happen to map to ASCII characters. Perhaps just the ASCII isprint characters?
http://www.cplusplus.com/reference/cctype/isprint/
For the standard ASCII character set (used by the "C" locale), printing characters are all with an ASCII code greater than 0x1f (US), except 0x7f (DEL).
Just for comparison, string literals in Go: https://golang.org/ref/spec#String_literals. Not too complex, but I'm not certain we need that functionality.
from spec.
I think the data segment strings and import/export strings should be interpreted differently. A Unicode interpretation of a data segment is �. Even a printable ASCII interpretation of the data segments isn't very useful (M\00\00\00a\00\00\00r\00\00\00
etc), and they might as well just be strings of hex digits.
But even though import/export names are also semantically just sequences of bytes, it would be helpful to define an interpretation for tools that need to display them (the text format being one of them). JavaScript and other languages support Unicode identifiers, so a Unicode interpretation of the import/export names would allow tools to correctly display import/export names derived from those identifiers.
from spec.
In my prior discussions with @jfbastien we both thought that data segments should not be string literals, but hex literals, along the lines of [ff 00 3a 0a 0d ...]
or something, to distinguish them from text like import/export names.
from spec.
@kg, I imagine that in tests or examples you will potentially want to use segments to store string constants, though, so you want a readable syntax for that.
from spec.
As for referring to an international standard, that seems like overkill. Escapes are a trivial mechanism, and the S-expr format is not part of the proper spec anyway.
FWIW, the current implementation of characters is (from lexer.mll)
let escape = ['n''t''\\''\'''\"']
let character = [^'"''\\''\n'] | '\\'escape | '\\'hexdigit hexdigit
from spec.
@kg, I imagine that in tests or examples you will potentially want to use segments to store string constants, though, so you want a readable syntax for that.
We'll also want non-string content, so it seems like we'd want to support both?
As for referring to an international standard, that seems like overkill.
On the simple string format: does this support non-ASCII characters natively (without hex-escape)? I find it unfortunate to design a new language with effectively English-only support.
It may be overkill, but I know enough about character encodings to believe that what seems simple about them actually isn't.
from spec.
On 16 September 2015 at 18:01, JF Bastien [email protected] wrote:
@kg https://github.com/kg, I imagine that in tests or examples you will
potentially want to use segments to store string constants, though, so you
want a readable syntax for that.We'll also want non-string content, so it seems like we'd want to support
both?
Well, you have hex escapes to express random bytes. They are as short as
the alternatives.
As for referring to an international standard, that seems like overkill.
On the simple string format: does this support non-ASCII characters
natively (without hex-escape)? I find it unfortunate to design a new
language with effectively English-only support.
Well, it allows arbitrary bytes in the input, without imposing any
particular interpretation whatsoever. That is to say. it supports any 8-bit
extension of ASCII that you want the bytes in the source to be interpreted
as, e.g. Latin-1.
It may be overkill, but I know enough about character encodings to believe
that what seems simple about them actually isn't
http://eev.ee/blog/2015/09/12/dark-corners-of-unicode/.
Yes -- which is exactly the reason why I think we do not want to get into
the business of designing anything more advanced for now. Wasn't the point
of the simple S-expression format avoiding the design of a "real" source
language? Extended character encodings seem to fall into that category.
from spec.
S-expressions are our only textual format at the moment and are on a path to remain this way. I'm fine with adding Unicode support later so it's easier to get off the ground, but I strongly believe that we want to support more than ASCII, and not just with second-class hex escapes.
I see this as a basic courtesy to non-English folks, and an important way to make them feel welcome and their usecase considered. We want strings because hex is hard to read as text, and the same applies to internationalization: they want Unicode because hex-escapes in strings are hard to read.
from spec.
OK, so as mentioned by @AndrewScheidecker above, perhaps it is worth differentiating between import/export strings and data strings. It seems like export strings need to match JavaScript identifiers at the very least (perhaps more than that?), so maybe starting with this definition is best:
http://www.ecma-international.org/ecma-262/5.1/#sec-7.6
which references the Unicode Identifier Syntax: http://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers
For data strings we could use http://www.ecma-international.org/ecma-262/5.1/#sec-7.8.4 as a basis, which references http://www.ecma-international.org/ecma-262/5.1/#sec-6 as the definition for a source character. But this is defined in terms of UTF-16 encodings, which is definitely not what we want. The list of escape sequences seem reasonable though, and we could assume UTF-8 encoding for source instead.
from spec.
We don't really need to match JavaScript identifiers for exports, we just need to use a subset. For example we could choose not to allow identifiers to have embedded nulls.
from spec.
If we only allow non-ASCII characters between quotes, the lexer only needs to know enough about Unicode to find escape codes and quotes. I'm sure that's more complicated than it seems, but surely better than UAX-31...
from spec.
On 16 September 2015 at 20:29, JF Bastien [email protected] wrote:
S-expressions are our only textual format at the moment and are on a path
to remain this way. I'm fine with adding Unicode support later so it's
easier to get off the ground, but I strongly believe that we want to
support more than ASCII, and not just with second-class hex escapes.I see this as a basic courtesy to non-English folks, and an important way
to make them feel welcome and their usecase considered. We want strings
because hex is hard to read as text, and the same applies to
internationalization: they want Unicode because hex-escapes in strings are
hard to read.
Being a native speaker of a language with fancy characters myself, and even
carrying one in my very name ("Roßberg"), I don't particularly mind. :)
That being said, the prototype is completely agnostic to the interpretation
of strings. Since it allows any byte, and the few characters it recognises
specially are disjoint from any Unicode surrogate code, you can readily
feed it source with UTF8-encoded string literals and everything should be
fine.
from spec.
Being a native speaker of a language with fancy characters myself, and even carrying one in my very name ("Roßberg"), I don't particularly mind. :)
Amen, my name's ç thanks you!
from spec.
Should be addressed by #253.
from spec.
Related Issues (20)
- [JS Generator] Add scope for each wasm instance HOT 1
- Improve SIMD test diversity HOT 1
- web
- Is it different to compile codes from `neon` or `x86` SIMD? HOT 1
- The definition of validation of br_table is inadequate HOT 3
- Question about validation of {element, data} segments HOT 1
- Wrong test cases in memory_init.wast HOT 3
- test case `as-load-operand` have an error HOT 1
- branching on function name HOT 7
- [interpreter] Source locations are broken since menhir switch HOT 2
- SIMD test optimization? HOT 3
- [core] SIMD Compare instructions link to the wrong "execution"/"validation" sections HOT 1
- Add a note indicating that section ids and section ordering do not correspond HOT 2
- Missing ability to access built older versions of the WebAssembly specification in archivable form HOT 2
- Web HOT 1
- [JS API] 'run a host function' and 'create a host function' don't consistently handle completion records HOT 1
- Wording of README in proposals directory is misleading. HOT 6
- Test extreme alignments HOT 2
- Tests seem to cause side effects that other tests depend on HOT 2
- A question about max table size HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spec.