Giter Club home page Giter Club logo

ude's Introduction

Ude is a C# port of Mozilla Universal Charset Detector.

The article "A composite approach to language/encoding detection" describes the charsets detection algorithms implemented by the library.

Ude can recognize the following charsets:

  • UTF-8
  • UTF-16 (BE and LE)
  • UTF-32 (BE and LE)
  • windows-1252 (mostly equivalent to iso8859-1)
  • windows-1251 and ISO-8859-5 (cyrillic)
  • windows-1253 and ISO-8859-7 (greek)
  • windows-1255 (logical hebrew. Includes ISO-8859-8-I and most of x-mac-hebrew)
  • ISO-8859-8 (visual hebrew)
  • Big-5
  • gb18030 (superset of gb2312)
  • HZ-GB-2312
  • Shift-JIS
  • EUC-KR, EUC-JP, EUC-TW
  • ISO-2022-JP, ISO-2022-KR, ISO-2022-CN
  • KOI8-R
  • x-mac-cyrillic
  • IBM855 and IBM866
  • X-ISO-10646-UCS-4-3412 and X-ISO-10646-UCS-4-2413 (unusual BOM)
  • ASCII

Platform

Windows and Linux (Mono)

Install

The release consists in the main library (Ude.dll) and a command-line client (udetect.exe) that can be used for one-shot tests.

On Windows, compile the Visual Studio 2005 solution ude.sln. On Linux you can build the library, the example and the nunit tests with monodelop and its solution ude.mds, or using make. To compile the sources tarball:

$ ./configure.sh --prefix=/usr/local --enable-tests=yes
$ make

To compile from svn:

$ ./autogen.sh --prefix=/usr/local --enable-tests=yes $ make

You can pick the library (Ude.dll) from the toplevel build directory (./bin) or you can install it to $prefix/lib/ude by typing:

$ make install

This will installs a command-line example program ($prefix/bin/udetect) to test the library on a given file as:

$ udetect filename To run the nunit tests type:

$ make test

Usage

Example

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}    

Other portings

The original Mozilla Universal Charset Detector has been ported to a variety of languages. Among these, a Java port:

from which I copied a few data structures, and a Python port:

License

The library is subject to the Mozilla Public License Version 1.1 (the "License"). Alternatively, it may be used under the terms of either the GNU General Public License Version 2 or later (the "GPL"), or the GNU Lesser General Public License Version 2.1 or later (the "LGPL").

Test data has been extracted from Wikipedia and The Project Gutenberg books and is subject to their licenses.

ude's People

Contributors

errepi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ude's Issues

[Suggestion] Mute output?

Hi, would you provide a way to disable command line output while using this library? it's making debugging this along with other things a bit annoying.

Detection fails on particular, simple ANSI file

What steps will reproduce the problem?
1. Save an ANSI file containing the text "CONFIG: main 30000000"
2. Run the library and/or exe on it

What is the expected output? What do you see instead?

I expect ANSI detected.

What version of the product are you using? On what operating system?

The library shows null for charset, and the exe shows "detection failed".

Please provide any additional information below.

I don't know if this is how the library is intended to work, but I think it 
would be more useful to detect ANSI if all the characters fit into ANSI. Or at 
least support this behavior optionally.

Original issue reported on code.google.com by [email protected] on 14 Sep 2014 at 4:59

UTF16 LE is not detected correctly

UTF16_Test.csv

For the provided file the encoding is not detected as UTF16 LE even though it is.

detector.Encoding = ASCIIEncoding.ASCIIEncodingSealed
 BodyName = {string} "us-ascii"
 CodePage = {int} 20127
 DecoderFallback = DecoderReplacementFallback
 EncoderFallback = EncoderReplacementFallback
 EncodingName = {string} "US-ASCII"
 HeaderName = {string} "us-ascii"
 IsBrowserDisplay = {bool} false
 IsBrowserSave = {bool} false
 IsMailNewsDisplay = {bool} true
 IsMailNewsSave = {bool} true
 IsReadOnly = {bool} true
 IsSingleByte = {bool} true
 IsUTF8CodePage = {bool} false
 Preamble = {ReadOnlySpan<byte>} System.ReadOnlySpan<Byte>[0]
 WebName = {string} "us-ascii"
 WindowsCodePage = {int} 1252
 _codePage = {int} 20127
 _dataItem = CodePageDataItem
 _isReadOnly = {bool} true
 decoderFallback = DecoderReplacementFallback
 encoderFallback = EncoderReplacementFallback

BUG in SBCSGroupProber class in function Reset

http://ude.googlecode.com/svn/trunk/src/Library/Ude.Core/SBCSGroupProber.cs

existing code:

public override void Reset ()
{
    int activeNum = 0;
...

SHOULD be:

public override void Reset ()
{
    activeNum = 0;
...

in many cases this bug will cause fail to detect right charset because class 
member activeNum is currently always 0 because in Reset function local variable 
used, see this piece of code:
} else if (st == ProbingState.NotMe) {
   isActive[i] = false;
   activeNum--;
   if (activeNum <= 0) {
      state = ProbingState.NotMe;
      break;
   }
}

I fixed it locally but want that other developer didn't spent much time 
debugging the same issue)

attached file is where bug is reproduced (charset is KOI8-R)

Original issue reported on code.google.com by [email protected] on 25 May 2012 at 6:40

Attachments:

UTF-16 without BOM not detected correctly

What steps will reproduce the problem?
1. Create a text file encoded as UTF-16 little endian.
2. Edit hex and remove the BOM from the file.  Yes, this is purposely modifying 
the file to cause a problem but I have been encountering many examples of 
UTF-16 encoded files lacking a BOM as provided to me from other applications.  
And not having a BOM does not invalidate the file.
3. Test Ude.Example by passing path to this BOM-less UTF-16LE file
4. When UniversalDetector is called the first check is to look for a BOM.
5. Not having a BOM, the evaluation passes to the deeper analysis which returns 
a result of encoding = ANSI 1252 which is wrong.

What is the expected output? 

Expected output is encoding = "UTF-16"

What do you see instead?

"Charset: ASCII, confidence: 1"


What version of the product are you using? On what operating system?

Ude C# port with all current code changes applied
Window 7 Ultimate SP1 64-bit

Please provide any additional information below.

Larger files (1000kb+) lacking the BOM tend to show result of "Charset: 
windows-1252, confidence: 0.5"

Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 10:52

Building ude with autoconf v.1.14 on debian/linux jessie failed

I tried to build ude with autoconf 1.14. on debian/linux 8.5. I'm working on the
current master branch of ude. First I tried:

git clone https://github.com/errepi/ude
cd ude
./configure --prefix=/some/prefix --enable-test=yes

and I got the following error message:

config.status: error: cannot find input file: Makefile.in

So I tried secondly:

git clone https://github.com/errepi/ude
cd ude
./autogen.sh

and I got the following error messages:

Makefile.include:11: error: 'pkglibdir' is not a legitimate directory for 'SCRIPTS'
src/Example/Makefile.am:23: 'Makefile.include' included from here
Makefile.include:11: error: 'pkglibdir' is not a legitimate directory for 'SCRIPTS'
src/Library/Makefile.am:63: 'Makefile.include' included from here
Makefile.include:11: error: 'pkglibdir' is not a legitimate directory for 'SCRIPTS'
src/Tests/Makefile.am:67: 'Makefile.include' included from here
[...]
[...]/ude/missing: Unknown --is-lightweight' option Try[...]/ude/missing --help' for more information
configure: WARNING: 'missing' script is too old or missing

Can you please port ude to the newest version of autoconf? So that everybody
can use ude again? As far as I found out, the 'missing' script and the description of
the destination directory changed in autoconf version 1.11.2.

Thanks a lot,
Dorle :)

Cannot Find .sln for windows usage

What steps will reproduce the problem?
1. DL the tarball
2. Extract
3. Look for .sln

What is the expected output? What do you see instead?
Should be there somewhere...  Its not.

What version of the product are you using? On what operating system?
0.1 windows xp

Please provide any additional information below.
Is there a workaround?  Should I just build my own solution form the 
source?

Original issue reported on code.google.com by [email protected] on 13 Jul 2009 at 11:53

EUCTW: System.IndexOutOfRangeException

The problem is CharDistributionAnalyser.HandleOneChar call for EUCTW detection.

size of charToFreqOrder array is 5376 but tableSize is deffind as 8102 and
this check is wrong
if (order < tableSize) <--
 { // order is valid
   if (512 > charToFreqOrder[order])
     freqChars++;
 }

I have take a look in Java code and this part of code is changed to

if (order < charToFreqOrder.Length)
{ // order is valid
  if (512 > charToFreqOrder[order])
    freqChars++;
}

we don't need tableSize any more and there will be no Exception at this
place in future.


Original issue reported on code.google.com by [email protected] on 16 Nov 2009 at 3:12

Returns UTF-8 for Cyrillic text

What steps will reproduce the problem?
1. Define Cyrillic text, "Это пример кириллического 
текста".
2. Feed the CharsetDetector with stream to this text.
3. Result charset is "UTF-8" with Confidence 1.0

What is the expected output? 
Charset is koi-8

What do you see instead?
UTF-8

What version of the product are you using? 
Ude, C# port 

On what operating system?
Windows 7/8, x64

Original issue reported on code.google.com by [email protected] on 8 Dec 2012 at 8:19

pureascii detection issue

What steps will reproduce the problem?
1. create a text file with just the character "3"
2. save it and run detection.
3. notice that it gives detection failed

What is the expected output? What do you see instead?
expected it to report the file as ascii(happens on any file that had the 
number 3 in it)

What version of the product are you using? On what operating system?
last updated version on windows xp

Please provide any additional information below.

noticed that the code is looking for EscAscii characters and it is looking 
for 0x33 instead of 0x1b. 0x33 is the number 3 and not an escape character.
not sure if there is such an issue anywhere else in the code

Original issue reported on code.google.com by rbhatt%[email protected] on 2 Dec 2009 at 5:29

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.