Giter Club home page Giter Club logo

Comments (15)

jfbastien avatar jfbastien commented on August 12, 2024

I'd like it to be defined by referring to an existing external standard, instead of us reinventing something. Specifically, whether it be Unicode or what, and how escape sequences work.

from spec.

binji avatar binji commented on August 12, 2024

I agree, though I'm not sure Unicode is what we want; we're not really trying to define strings here, just raw bytes, which happen to map to ASCII characters. Perhaps just the ASCII isprint characters?

http://www.cplusplus.com/reference/cctype/isprint/
For the standard ASCII character set (used by the "C" locale), printing characters are all with an ASCII code greater than 0x1f (US), except 0x7f (DEL).

Just for comparison, string literals in Go: https://golang.org/ref/spec#String_literals. Not too complex, but I'm not certain we need that functionality.

from spec.

AndrewScheidecker avatar AndrewScheidecker commented on August 12, 2024

I think the data segment strings and import/export strings should be interpreted differently. A Unicode interpretation of a data segment is �. Even a printable ASCII interpretation of the data segments isn't very useful (M\00\00\00a\00\00\00r\00\00\00 etc), and they might as well just be strings of hex digits.

But even though import/export names are also semantically just sequences of bytes, it would be helpful to define an interpretation for tools that need to display them (the text format being one of them). JavaScript and other languages support Unicode identifiers, so a Unicode interpretation of the import/export names would allow tools to correctly display import/export names derived from those identifiers.

from spec.

kg avatar kg commented on August 12, 2024

In my prior discussions with @jfbastien we both thought that data segments should not be string literals, but hex literals, along the lines of [ff 00 3a 0a 0d ...] or something, to distinguish them from text like import/export names.

from spec.

rossberg avatar rossberg commented on August 12, 2024

@kg, I imagine that in tests or examples you will potentially want to use segments to store string constants, though, so you want a readable syntax for that.

from spec.

rossberg avatar rossberg commented on August 12, 2024

As for referring to an international standard, that seems like overkill. Escapes are a trivial mechanism, and the S-expr format is not part of the proper spec anyway.

FWIW, the current implementation of characters is (from lexer.mll)

let escape = ['n''t''\\''\'''\"']
let character = [^'"''\\''\n'] | '\\'escape | '\\'hexdigit hexdigit

from spec.

jfbastien avatar jfbastien commented on August 12, 2024

@kg, I imagine that in tests or examples you will potentially want to use segments to store string constants, though, so you want a readable syntax for that.

We'll also want non-string content, so it seems like we'd want to support both?

As for referring to an international standard, that seems like overkill.

On the simple string format: does this support non-ASCII characters natively (without hex-escape)? I find it unfortunate to design a new language with effectively English-only support.

It may be overkill, but I know enough about character encodings to believe that what seems simple about them actually isn't.

from spec.

rossberg avatar rossberg commented on August 12, 2024

On 16 September 2015 at 18:01, JF Bastien [email protected] wrote:

@kg https://github.com/kg, I imagine that in tests or examples you will
potentially want to use segments to store string constants, though, so you
want a readable syntax for that.

We'll also want non-string content, so it seems like we'd want to support
both?

Well, you have hex escapes to express random bytes. They are as short as
the alternatives.

As for referring to an international standard, that seems like overkill.

On the simple string format: does this support non-ASCII characters
natively (without hex-escape)? I find it unfortunate to design a new
language with effectively English-only support.

Well, it allows arbitrary bytes in the input, without imposing any
particular interpretation whatsoever. That is to say. it supports any 8-bit
extension of ASCII that you want the bytes in the source to be interpreted
as, e.g. Latin-1.

It may be overkill, but I know enough about character encodings to believe

that what seems simple about them actually isn't
http://eev.ee/blog/2015/09/12/dark-corners-of-unicode/.

Yes -- which is exactly the reason why I think we do not want to get into
the business of designing anything more advanced for now. Wasn't the point
of the simple S-expression format avoiding the design of a "real" source
language? Extended character encodings seem to fall into that category.

from spec.

jfbastien avatar jfbastien commented on August 12, 2024

S-expressions are our only textual format at the moment and are on a path to remain this way. I'm fine with adding Unicode support later so it's easier to get off the ground, but I strongly believe that we want to support more than ASCII, and not just with second-class hex escapes.

I see this as a basic courtesy to non-English folks, and an important way to make them feel welcome and their usecase considered. We want strings because hex is hard to read as text, and the same applies to internationalization: they want Unicode because hex-escapes in strings are hard to read.

from spec.

binji avatar binji commented on August 12, 2024

OK, so as mentioned by @AndrewScheidecker above, perhaps it is worth differentiating between import/export strings and data strings. It seems like export strings need to match JavaScript identifiers at the very least (perhaps more than that?), so maybe starting with this definition is best:
http://www.ecma-international.org/ecma-262/5.1/#sec-7.6
which references the Unicode Identifier Syntax: http://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers

For data strings we could use http://www.ecma-international.org/ecma-262/5.1/#sec-7.8.4 as a basis, which references http://www.ecma-international.org/ecma-262/5.1/#sec-6 as the definition for a source character. But this is defined in terms of UTF-16 encodings, which is definitely not what we want. The list of escape sequences seem reasonable though, and we could assume UTF-8 encoding for source instead.

from spec.

MikeHolman avatar MikeHolman commented on August 12, 2024

We don't really need to match JavaScript identifiers for exports, we just need to use a subset. For example we could choose not to allow identifiers to have embedded nulls.

from spec.

AndrewScheidecker avatar AndrewScheidecker commented on August 12, 2024

If we only allow non-ASCII characters between quotes, the lexer only needs to know enough about Unicode to find escape codes and quotes. I'm sure that's more complicated than it seems, but surely better than UAX-31...

from spec.

rossberg avatar rossberg commented on August 12, 2024

On 16 September 2015 at 20:29, JF Bastien [email protected] wrote:

S-expressions are our only textual format at the moment and are on a path
to remain this way. I'm fine with adding Unicode support later so it's
easier to get off the ground, but I strongly believe that we want to
support more than ASCII, and not just with second-class hex escapes.

I see this as a basic courtesy to non-English folks, and an important way
to make them feel welcome and their usecase considered. We want strings
because hex is hard to read as text, and the same applies to
internationalization: they want Unicode because hex-escapes in strings are
hard to read.

Being a native speaker of a language with fancy characters myself, and even
carrying one in my very name ("Roßberg"), I don't particularly mind. :)

That being said, the prototype is completely agnostic to the interpretation
of strings. Since it allows any byte, and the few characters it recognises
specially are disjoint from any Unicode surrogate code, you can readily
feed it source with UTF8-encoded string literals and everything should be
fine.

from spec.

jfbastien avatar jfbastien commented on August 12, 2024

Being a native speaker of a language with fancy characters myself, and even carrying one in my very name ("Roßberg"), I don't particularly mind. :)

Amen, my name's ç thanks you!

from spec.

rossberg avatar rossberg commented on August 12, 2024

Should be addressed by #253.

from spec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.