Giter Club home page Giter Club logo

cpp-libucd's Introduction

Universal Character Set Detector C Library (libucd)

What is it?

This library provides a highly accurate set of heuristics that attempt to determine the character set used to encode some input text. This is extremely useful when your program has to handle an input file which is supplied without any encoding metadata.

The original code of libucd was written by Netscape Communications Corporation, is available at http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/, but unfortunately the Firefox project removed most of the encoding detecting functions in their latest version. While the multi-language detector is widely used in other open source projects. So it's urgent to maintain a standalone version of the library that supports most of the language detecting. And this project was setup, and now also extended more languages, utilities and packaging.

Pulls together:

  • A command line interface to the library, which also compare the results from alternative libraries, such as libicu.
  • The UCD library itself from the Mozilla seamonkey source tree
  • The extended languages detection from the project https://bitbucket.org/medoc/uchardet-enhanced/

Why do you need this library?

  • Integrated with patches and improvements of the users over the Internet;
  • Provided thread-safe APIs;
  • Support multiple packages compiling, includs RPM/DEB/PACMAN/ANDROID etc;
  • Including the test case data and tools, you may confident to release the new code after running test cases whenever you improved the code by yourself;
  • Added more languages and encoding support;
  • Provided document for APIs, man pages;

Supported Encodings

  • Unicode: UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
  • Traditional and Simplified Chinese: Big5, GB18030, EUC-TW, HZ-GB-2312, ISO-2022-CN
  • Japanese: EUC-JP, SHIFT_JIS, ISO-2022-JP
  • Korean: EUC-KR, ISO-2022-KR
  • Cyrillic: KOI8-R, KOI8-U, MacCyrillic, IBM855, IBM866, ISO-8859-5, WINDOWS-1251
  • Hungarian: ISO-8859-2, WINDOWS-1250
  • Bulgarian: ISO-8859-5, WINDOWS-1251
  • English: WINDOWS-1252
  • Greek: ISO-8859-7, WINDOWS-1253
  • Visual and Logical Hebrew: ISO-8859-8, WINDOWS-1255
  • Thai: TIS-620
  • Czech: ISO-8859-2
  • Finnish: WINDOWS-1252
  • French: WINDOWS-1252
  • German: WINDOWS-1252
  • Polish: ISO-8859-2
  • Spanish: WINDOWS-1252
  • Swedish: WINDOWS-1252
  • Turkish: ISO-8859-9

Building

We have a build system based on autoconf/automake, simply do this incantation:

./configure
make

It also supports building for Linux distributions, such as RedHat/CentOS, Debian/Ubuntu, Arch Linux etc.

  • RedHat/CentOS
./autogen.sh
make rpm
  • Debian/Ubuntu
./autogen.sh
debuild -c -uc -us
  • Pacman
cd pacman
makepkg -Asf
  • Android

Add a line in your Android.mk file in the folder jni, for example:

include jni/libucd/Android.mk

and then run ndk-build

API

See libucd.h or man pages, and utils/sample.c for the example.

Directory contents:

debian/, rpm/, pacman/

  • the configure files for various types of packages.

doc/

  • Contains the docuement described the general idea of auto detection.

man/

  • man pages for library and utils

include/

  • The C API header file

src/

  • The C API from the reference above, with the enhanced mozilla code.

utils/

  • A command line universal character set detector can process files both by file name or by reading from STDIN.

test/

  • Wikipedia index pages in target languages, sometimes in multiple encodings. The pages were manually stripped of english and boilerplate content, in hope that the remaining is significant and typical text.

  • Used to check how the detection works.

langstats/

  • Data and code used to produce the bigram frequencies for a language/encoding pair, used for the "Two char Distribution Method" from the reference article (neither the article nor the mozilla module publish the scripts used to generate the tables or the reference data).

License

The library is subject to the GNU General Public License Version 2. Alternatively, the library may be used under the terms of the GNU Lesser General Public License 2.1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.