Giter Club home page Giter Club logo

chardet's Introduction

Build Status

Chardet: The Universal Character Encoding Detector

Detects :

  • ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
  • Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
  • EUC-JP, SHIFT_JIS, CP932 (aka MS932), ISO-2022-JP (Japanese)
  • EUC-KR, ISO-2022-KR (Korean)
  • KOI8-R, x-mac-cyrillic (prev MacCyrillic), IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
  • ISO-8859-5, windows-1251 (Bulgarian)
  • ISO-8859-1, windows-1252 (Western European languages)
  • ISO-8859-7, windows-1253 (Greek)
  • ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
  • TIS-620 (Thai)

note :
The ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily disabled until the models can be retrained.

About

This is a port to go of the excellent python chardet library<https://github.com/chardet/chardet>. It is based on the mozilla statistical encoding detector. v0.0.7 is based on the chardet version 0.0.4 (Dec 20)

Usage

The simplest way to use chardet is simply the package-level exported Detect method:

package main

import (
	"fmt"
	"github.com/olaure/chardet"
)

func main() {
	data := []byte("नमस्कार")
	detected := chardet.Detect(data)
	fmt.Printf(
		"Detectected character set : %v with confidence %v\n",
		detected.Encoding, detected.Confidence,
	)
}

Another way uses the method DetectShortestUTF8 that will look for the decoded string with the lowest count of unicode categories C (control), S (symbol), P (punctuation):

package main

import (
	"fmt"
	"github.com/olaure/chardet"
)

func main() {
	data := []byte("नमस्कार")
	detected := chardet.DetectShortestUTF8(data)
	fmt.Printf(
		"Detectected character set : %v with confidence %v\n",
		detected.Encoding, detected.Confidence,
	)
}

This function thus will not necessarily yield the highest probability decoder, unless the probability is maximum.

chardet's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.