Comments (5)
Do I see it correctly that even utf8ncpy
works with bytes (code units), too, instead of codepoints? That'd be a showstopper for me, unfortunately.
I think I'd rather add utf8ccmp (codepoint compare?) - since we've already got n as a denotion for bytes as a relic from mimicing string.h. Thoughts?
FWIW, I'd vote for a breaking change, perhaps accompanied by a filename change (like utf8str.h
or u8str.h
etc.), and do away with the legacy byte semantics for n
, in favor of codepoints by default. (Or always? Are there compelling use cases when you'd want to iterate an UTF-8 string by byte?)
from utf8.h.
I think your arguments are probably right the more I've considered this. My hesitation has been that the n
is generally used in strn*
functions to say 'hey I only have these many bytes in this buffer!'. We can always add a b
suffix like utf8b
to mean bytes.
I can't commit to a timescale for this, but I'll try and work out a plan to do it.
from utf8.h.
I think I'd rather add utf8ccmp
(codepoint compare?) - since we've already got n
as a denotion for bytes as a relic from mimicing string.h
. Thoughts?
from utf8.h.
Ahh, good point, what I suggested would break existing code that uses the header. Yes, I think your utf8ccmp is good solution. I think many of the utf8n* functions could use a utf8c* version, though it may not make sense in all cases. Thanks for the speedy reply.
from utf8.h.
Do I see it correctly that even utf8ncpy works with bytes (code units), too, instead of codepoints? That'd be a showstopper for me, unfortunately.
For me too.
I recently had to implement my own routine to copy up to N codepoints/characters, which worked fine (I think), but it would be really helpful if this already existed in a library such as utf8.h =).
and do away with the legacy byte semantics for n, in favor of codepoints by default. (Or always? Are there compelling use cases when you'd want to iterate an UTF-8 string by byte?)
This is something I agree with as well.
I believe the API is currently mixed, as evidenced by a man strlen
on my system:
RETURN VALUE
The strlen() function returns the number of bytes in the string pointed to by s.
whereas utf8len()
returns the number of codepoints, not the number of bytes.
Since the utf8*()
routines always deal with utf8, I believe the 'n' parameter should always refer to codepoints/characters, rather than bytes. Any byte movements, the standard libc routines already handle.
from utf8.h.
Related Issues (20)
- Possibility of dual-licensing? HOT 5
- `utf8nvalid` reads out bounds HOT 2
- utf8upr/lwr size issues? HOT 3
- provide get codepoint visual width function HOT 1
- utf8ncpy incorrectly removes last valid codepoint HOT 7
- Add test similar to one used in issue 109
- support utf8 convert to utf16? HOT 1
- utf8rchr issue HOT 3
- utf8makevalid read out of bounds (+ other functions) HOT 1
- utf8makevalid : test to identify sequence length and possible values not sufficient HOT 1
- grapheme support HOT 1
- Not an issue HOT 1
- Allow programmer specified allocator HOT 2
- utf8valid with size HOT 1
- utf8tok and utf8tok_r HOT 2
- Support constexpr? HOT 3
- clang-format? HOT 1
- utf8ncat - size wraparound bug HOT 1
- Way of removing malloc completely HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from utf8.h.