Giter Club home page Giter Club logo

Comments (3)

tajmone avatar tajmone commented on June 19, 2024

PureBasic's string functions use UCS-2 encoding in Unicode mode according to the official documentation. But PureBasic uses the API functions of the operating systems for displaying the strings and these all (Windows, Linux and macOS) interpret the PureBasic string as UTF-16, so programs written in PureBasic can display all Unicode characters.

I'm not quite sure about that, because internally PB will treat all strings as UCS-2 (i.e. 16-bit fixed-width encoding) so if you pass to PB a string in UTF-16 that encodes beyond the range supported by UCS-2 then PB is most likely going to corrupt it since it doesn't cater for variable-width encoding (like UTF-16 does). The problem would affect any string manipulation command, but possibly also under-the-hood string processing.

The whole reason why PB went for UCS-2 is to keep things simple, i.e. have a fixed correspondence of one-character = two-bytes — an assumption that doesn't hold true in UTF-16. This might affect correct counting of characters in string when the UTF-16 string contains a 32-bit Unicode character.

So, although it's true that any UCS-2 string will be correctly handled by OS functions that accept UTF-16 strings, the reverse might not be true — i.e. that "PureBasic can display all Unicode characters".

If that's an important assumption, it might be better to test it out first.

from regex-engine.

SicroAtGit avatar SicroAtGit commented on June 19, 2024

I'm not quite sure about that, because internally PB will treat all strings as UCS-2 (i.e. 16-bit fixed-width encoding) so if you pass to PB a string in UTF-16 that encodes beyond the range supported by UCS-2 then PB is most likely going to corrupt it since it doesn't cater for variable-width encoding (like UTF-16 does). The problem would affect any string manipulation command, but possibly also under-the-hood string processing.

I am aware of that. By "PureBasic can display all Unicode characters" I meant that the string is not manipulated beforehand.

Yes, often you want to manipulate a string before displaying it, e.g. you want to display only a part of the string, so your objection is valid. But then the programmer still has the option to create his own string manipulation functions that can correctly handle UTF-16 surrogate pairs. With these functions he can then also partially bring UTF-16 support into the RegEx engine, but he would fail with the RegEx character classes (e.g. \w) or with the case-insensitive mode. Therefore, I think it would be better if the RegEx engine already supports UTF-16 by itself, even if the PureBasic string manipulation functions have problems with it.

If that's an important assumption, it might be better to test it out first.

There are already tests confirming that UTF-16 surrogate pairs are treated as two valid UCS-2 characters by PureBasic string functions and that PB display the UTF-16 surrogate pair as a single Unicode character (tested on Windows, Linux, macOS):

from regex-engine.

tajmone avatar tajmone commented on June 19, 2024

There are already tests confirming that UTF-16 surrogate pairs are treated as two valid UCS-2 characters by PureBasic string functions and that PB display the UTF-16 surrogate pair as a single Unicode character (tested on Windows, Linux, macOS):

Thanks, that link was very helpful. I always wanted to carry out similar tests myself, to ascertain first hand how the UCS2 vs UTF-16 encoding actually works, but never got to doing it.

I guess the flip side of the coin is that you never known for sure when and if strings are being manipulated behind the scene in the final program, since many internal commands might be carrying out under the hood stuff that might compromise an UTF-16 string.

But then, if Unicode characters beyond UCS2 can be represented via escape sequence representations then it should be possible to correctly handle them via custom procedures, even in the RegEx engine, since PB would treat the escape as a plain text sequence of ASCII chars.

from regex-engine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.