Giter Club home page Giter Club logo

utf8's Introduction

UTF8 - Simple Library for Internationalization

While most of the (computing) world has standardized on using UTF-8 encoding, Win32 has remained stuck with wide character strings (also called UTF-16 encoding).

This library simplifies usage of UTF-8 encoded strings under Win32 using principles outlined in the UTF-8 Everywhere Manifesto.

Here is an example of a function call:

  utf8::mkdir ("ελληνικό");   //create a directory with a UTF8-encoded name

and another example of a C++ stream with a name and content that are not ASCII characters:

  utf8::ofstream u8strm("😃😎😛");

  u8strm << "Some Cree ᓀᐦᐃᔭᐍᐏᐣ text" << endl;
  u8strm.close ();

A call to Windows API functions can be written as:

  HANDLE f = CreateFile (utf8::widen ("ελληνικό").c_str (), GENERIC_READ, 0,
    NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

Usage

Before using this library, please review the guidelines from the UTF-8 Everywhere Manifesto. In particular:

  • define UNICODE or _UNICODE in your program

  • for Visual C++ users, make sure "Use Unicode Character Set" option is defined (under "Configuration Properties" > "General" > "Project Defaults" page).

  • for Visual C++ users, add /utf-8 option under "C/C++" > "All Options" > "Additional Options".

  • use only std::string and char* variables. Assume they all contain UTF-8 encoded strings.

  • for Visual C++ users, if compiling under C++20 language standard, add the Zc:char8_t- option under "C/C++" > "All Options" >"Additional Options" (see discussion below.)

  • use UTF-16 strings only in arguments to Windows API calls.

All functions and classes in this library are included in the utf8 namespace. It is a good idea not to have a using directive for this namespace. That makes it more evident in the code where UTF8-aware functions are used.

Narrowing and Widening Functions

The basic conversion functions change the encoding between UTF-8, UTF-16 and UTF-32.

narrow() function converts strings from UTF-16 or UTF-32 encoding to UTF-8:

std::string utf8::narrow (const wchar_t* s, size_t nch=0);
std::string utf8::narrow (const std::wstring & s);
std::string utf8::narrow (const char32_t* s, size_t nch=0);
std::string utf8::narrow (const std::u32string& s);	

The widen() function converts from UTF-16:

std::wstring utf8::widen (const char* s, size_t nch);
std::wstring utf8::widen (const std::string& s);

The runes() function converts from UTF-32:

std::u32string runes (const char* s, size_t nch = 0);
std::u32string utf8::runes (const std::string& s);

There are also functions for:

  • character counting
  • string traversal
  • validity checking

Case Folding Functions

Case folding (conversion between upper case and lower case) in Unicode is more complicated than traditional ASCII case conversion. This library uses standard tables published by Unicode Consortium to perform upper case to lower case conversions and case-insensitive string comparison.

  • case folding - toupper(), tolower(), make_upper(), make_lower()
  • case-insensitive string comparison - icompare()

Common "C" Functions Wrappers

The library provides UTF-8 wrappings most frequently used C functions. Function name and arguments match their traditional C counterparts.

  • Common file access operations: utf8::fopen, utf8::access, utf8::remove, utf8::chmod, utf8::rename
  • Directory operations: utf8::mkdir, utf8::rmdir, utf8::chdir, utf8::getcwd
  • Environment functions: utf8::getenv, utf8::putenv
  • Program execution: utf8::system
  • Character classification functions is... (isalnum, isdigit, etc.)

C++ File I/O Streams

C++ I/O streams (utf8::ifstream, utf8::ofstream, utf8::fstream) provide and easy way to create files with names that are encoded using UTF-8. Because UTF-8 strings are character strings, reading and writing from these files can be done with standard insertion and extraction operators.

Windows-Specific Functions

  • path management: splitpath, makepath
  • conversion of command-line arguments: get_argv and free_argv
  • popular Windows API functions: MessageBox, LoadString, ShellExecute, CopyFile, etc.
  • Registry API (RegCreateKey, RegOpenKey, RegSetValue, RegGetValue, etc.)

The API for Windows profile files (also called INI files) was replaced with an object utf8::IniFile.

Using the library under C++20 standard

The C++20 standard has added an additional type char8_t, designed to keep UTF-8 encoded characters, and a string type std::u8string. By making it a separate type from char and unsigned char, the committee has also created a number of incompatibilities. For instance the following fragment will produce an error:

std::string s {"English text"}; //this is ok
s = {u8"日本語テキスト"}; //"Japaneese text" - error

You would have to change it to something like:

std::u8string s {u8"English text"}; 
s = {u8"日本語テキスト"}; 

Recently (June, 2022) the committee seems to have changed position and introduced a compatibility and portability fix - DR2513R3 allowing initialization of arrays of char or unsigned char with UTF-8 string literals. Until the defect report makes its way into the next standard edition, the solution for Visual C++ users who compile under C++20 standard rules is to use the Zc:char8_t- compiler option.

In my opinion, by introducing the char8_t type, the committee went against the very principles of UTF-8 encoding. The purpose of the encoding was to extend usage of the char type to additional Unicode code points. It has been so successful that it is now the de-facto standard used all across the Internet. Even Windows, that has been a bastion of UTF-16 encoding, is now slowly moving toward UTF-8.

In this context, the use of char data type for anything other than holding encodings of strings, seems out of place. In particular arithmetic computations with char or unsigned char entities are just a small fraction of the use cases. The standard should try to simplify usage in the most common cases leaving the infrequent ones to bear the burden of complexity.

Following this principle, you would want to write:

std::string s {"English text"};
s += " and ";
s += "日本語テキスト";

with the implied assumption that all char strings are UTF-8 encoded character strings.

Documentation

Doxygen documentation can be found at https://neacsum.github.io/utf8/

Building

The UTF8 library doesn't have any dependencies. The test program however uses the UTTP library.

The preferred method is to use the CPM - C/C++ Package Manager to fetch all dependent packages and build them. Download the CPM program and, from the root of the development tree, issue the cpm command:

  cpm -u https://github.com/neacsum/utf8.git utf8

The Visual C++ projects are set to compile under C++17 rules and can also be compiled under C++20 rules. If you are using C++20 rules, you have to add the Zc:char8_t- option as discussed above.

You can build the library using CMake. From the utf8 directory:

  cmake -S . -B build
  cmake --build build

Alternatively, BUILD.bat script will build the libraries and test programs.

While the library has been designed for Windows, some of the functions may be useful in a Linux environment. Under Linux, the library can be build using CPM as explained before, or with cmake using the same commands shown above.

License

The MIT License

utf8's People

Contributors

emanspeaks avatar neacsum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

utf8's Issues

gen_casetab not compiling with C++23

Using MSVC Build Tools 2022, with the C++ standard set to anything after C++17, line 67 of gen_casetab.cpp causes compilation to fail due to attempting to use out << tab[i].uc, where the latter is a char32_t in the codept struct. Even when compiling with the /Zc:char8_t- option, this fails due to "incompatible arguments" to the << operator. I don't immediately have a suggested fix other than just setting the standard to C++17 when building and then switching back to later standard for my larger project, but this is not ideal.

CMake support

This seems to be a very neat library!
Have you considered adding CMake support?
Otherwise it would be quite cumbersome to ingegrate into existing projects

program crash?

Hi, I tried this code but the program crash:
string y="đặng dủy khánh siêu đại prố khủng khiếp";
const auto& result = utf8::tolower(y.substr(1));
also i intend to add function utf8::tolower(char).
can you suggest it directly?
I want to write a function similar to sentence case however the only thing that comes to mind is
utf8::tolower(string{char}) this is doing it indirectly and doesn't look very nice.

Thanks for your work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.