krakjoe / ustring Goto Github PK
View Code? Open in Web Editor NEWUnicodeString for PHP7
License: Other
UnicodeString for PHP7
License: Other
What do you think about making strings immutable? I.e. methods would always return the result without modifying the original object.
It would make much more sense IMO.
For example:
if ($str->toUpper() === 'HELLO') {
// ...
}
Here $str
would be modified, which is confusing.
Because this would be really useful.
Hi,
The indentation style of this project is starting to look a bit funky, so can we agree on a crude coding standard we should be following?
I'm used to php-src style (tabs in most places), but I'm willing to compromise on a sane alternative.
C:\php-sdk\php70dev>nmake php_ustring.dll
Microsoft (R) Program Maintenance Utility Version 11.00.61030.0
Copyright (C) Microsoft Corporation. All rights reserved.
...
ustring.obj : error LNK2001: unresolved external symbol _zend_new_interned_string
ustring.obj : error LNK2001: unresolved external symbol _std_object_handlers
ustring.obj : error LNK2001: unresolved external symbol _compiler_globals
C:\php-sdk\php70dev\Release\php_ustring.dll : fatal error LNK1120: 3 unresolved externals
NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\BIN\link.exe"' : return code '0x460'
Stop.
I can fix it here later but just in case you are on it:
in php_ustring_cast:
Z_STR_P(zwrite) = STR_ALLOC(length+1, 0); already sets the length correctly and it should not be overridden afterwards. It leads to crash (wrong buffer size given to ICU > crash).
One important thing needing doing is making sure indexOf, split, startsWith and so on work properly with empty strings, i.e. pretending there's one between each actual codepoint. Otherwise it's harder to write userland string-handling functions, because you need to work around incorrectly-implemented boundary cases :(
Hello :-),
The Hoa\String
library provides a class, called Hoa\String\String
, that helps to manipulate UTF-8 strings. Its API has been designed during several months. It could be a source of inspiration for your work.
If you want some clarifications, ping me.
Since this doesn't operate on grapheme clusters, we should refer to dealing with codepoints, not characters.
In particular, documentation comments need changing, and charAt
should be codepointAt
.
ICU supports this in UnicodeString
, but our API doesn't. We need to allow specifying the locale, because toUpper
and toLower
will behave differently for different locales, e.g. in Turkish, I
becomes ı
in lower case, not i
.
C'mon, strings gotta have one. Either split
or explode
, I'd go with the former as it's a more common name in other languages, and it's the verb the manual uses to describe what explode
does.
I'd suggest this signature (pseudo-code):
UString::split(UString $delimeter, int $limit = NULL): Array<UString>
The optional $limit
parameter is a number. If specified, then the string is only split $limit
times. This is a really useful feature.
Examples:
(new UString("1,2,3,4"))->split(new UString(","));
// => ["1", "2", "3", "4"]
(new UString("1,2,3,4"))->split(new UString(","), 1);
// => ["1", "2,3,4"]
(new UString("1,2,3,4"))->split(new UString(","), 2);
// => ["1", "2", "3,4"]
(new UString(",1,2,,3,4,"))->split(new UString(","));
// => ["", "1", "2", "", "3", "4", ""]
One way to implement it (in Game Maker Language, it's C-like but strings indexed from 1): https://github.com/Medo42/Gang-Garrison-2/blob/master/Source/gg2/Scripts/split.gml
I would like to see the API more feature safe to other encodings/charsets.
[get|set][Default]Codepage
to [get|set][Default]Encoding
U
from UString
(or use a different name)$german = new UString("T\xD4st", 'ISO-8859-15');
Thoughts?
Currently, string reversal works on code points, it doesn't care what kind. So it won't reverse strings containing combining characters properly.
We could quite simply implement the Missy Elliot algorithm that @mathiasbynens came up with, and then it'd do what people expect.
Thoughts?
stuff already exists in the class Normalizer but would be handy to have it in this class as well. could look like JS's normalize()
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
echo u($str)->normalize(UString::NFKC)->subString(0, 4);
if it's kept, it should pad to the "visible width", ergo should take count of combining glyphs
which characters will trim() trim?
whitespace is very subjective and different in every language, including unicode
i would use C#'s whitespace definition as default (https://msdn.microsoft.com/en-us/library/t809ektx(v=vs.110).aspx) plus the null byte as in php's trim()
but i would also add an optional parameter "$charset" so that the behavior would be completely customizable
UString::trim(UString $charset = null)
they are redundant with substring(). there are better ways to avoid unnecessary copies [stringbuilder is one, or interning], provided that we care of this kind of optimizations
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.