shengche / juniversalchardet Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/juniversalchardet
Automatically exported from code.google.com/p/juniversalchardet
What steps will reproduce the problem?
1. for each case, read first line
2. if it's encoded word, decoded using either base64 or quoted printable
3. convert it to UTF-8
4. compare with second line of each case, which is expected result.
What is the expected output? What do you see instead?
please see second line of attached file.
What version of the product are you using? On what operating system?
redhat enterprise 4
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 18 Jun 2010 at 6:14
Attachments:
What steps will reproduce the problem?
1. Try charset detection of a file, using the sample code from the homepage.
What is the expected output? What do you see instead?
Expected: the detected charset (WINDOWS-1252, UTF-8)
Instead: null
What version of the product are you using? On what operating system?
Using the Jul 23, 2008 binary of juniversalchardet 1.0.3, SHA1=591d72211acc0b909b79c840e0b3ed9a0982d807
Problem appeared on:
a. A x64 Windows Server 2008 R2 server with Java 1.6.0_43
b. A x64 Windows 7 workstation with Java 1.6.0_43
Problem did not appear (detection worked flawlessly) on:
c. Another x64 Windows 7 workstation with Java 1.6.0_43
Please provide any additional information below.
In order to understand the issue, I ended up re-building the .jar with the debug=true <javac> option. Which of course did let me properly debug like expected, but also solved my problem: now detection worked on machines a and b! That seemed strange, so I rolled back my changes to build.xml, re-launched the compile & dist Ant tasks, and ta-da, it works.
--> One some systems/jvm combinations, it seems the binary build on Jul 23,
2008 doesn't work and always returns null.
--> Being just a user of the library who barely understands the flow of the
detection, I failed to understand what went wrong and where and cannot be more
precise. Feel free to ask for trace information.
--> Maybe publishing a re-compiled version on the website would be a good idea?
Mine is attached, compiled with Java 1.6.0_43 and Ant 1.9.0 on my machine 'a'
(x64 Windows Server 2008 R2 server with Java 1.6.0_43).
Original issue reported on code.google.com by [email protected]
on 15 Mar 2013 at 6:28
Attachments:
What steps will reproduce the problem?
1. if I use a fileinputstream it detects fine, if i use FileItem always detect
maccrylic
2. atach you can see the example file
3.
here is the peace of code:
BufferedWriter clsWriter = new BufferedWriter ( new OutputStreamWriter (
clsFile.getOutputStream () ) );
clsWriter
.write ( "ÄÜÖßäöü,Name1ÄÜÖßäöü,Name2ÄÜÖßäöü,Name3ÄÜÖßäöü,StreetÄÜÖßäöü,MÄÜÖßäöü,DE,80080,München,ContactÄÜÖßäöü,+49(0)ÄÜÖßäöü,ÄÜÖßäöü@gls-itservices.com,CommentÄÜÖßäöü,+49,(0)98,765,432,BlÄÜÖßäöü" );
clsWriter.close ();
InputStream clsInput = clsFile.getInputStream ();
byte[] buffer = new byte[ 1024 ];
while ( true )
{
int n = clsInput.read ( buffer );
if ( n <= 0 )
{
break;
}
detector.handleData ( buffer, 0, n );
}
detector.dataEnd ();
clsInput.close ();
String strEncoding = detector.getDetectedCharset ();
System.out.println ( "encoding: " + strEncoding );
What is the expected output? What do you see instead?
I expect latin-1
What version of the product are you using? On what operating system?
juniversalchardet-1.0.3.jar windowsxp
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 23 Jul 2014 at 3:31
Attachments:
When a file starts with a Byte Order Mark, there needs to be a way to discard
those bytes. The detected charset is not enough information, because the file
may include a BOM or not.
The easy way would be a method indicating the number of bytes to skip.
What steps will reproduce the problem?
1. Run the universal detector on a file with a BOM, such as UTF-16LE
2. Open a reader using the detected charset
3. Observe the spurious first character
Original issue reported on code.google.com by marcus.downing
on 29 Apr 2011 at 12:08
Can't copy+paste and compile the example TestDetector class on the project
home page because:
1. uses single quotes instead of double quotes in Strings
2. main() not declared to throw java.io.IOException (or other apropos
try/catches)
Might also be nice if it took the file to test with as a command line arg.
Also would be nice to have a link to download the source file, all just for
convenience.
Modified source attached.
Original issue reported on code.google.com by [email protected]
on 5 Feb 2008 at 12:02
Attachments:
It can not detect the EUC-TW Charset
Original issue reported on code.google.com by [email protected]
on 24 Jul 2008 at 1:01
What steps will reproduce the problem?
1. Pass UniversalDetector a byte buffer for WINDOWS-1252 containing a series of
degree symbols and character / numbers
e.g. {91, -80, 52, -80, 48, -80, 84, -80, 67, -80, 67, -80, 48, -80, 67, -80, 84}
2. Call UniversalDetector#getDetectedCharset(), it should be WINDOWS-1252, but
instead returns GB18030.
See attached unit test for minimal reproduction test case.
What is the expected output? What do you see instead?
Expected output from UniversalDetector#getDetectedCharset() is "WINDOWS-1252,"
but instead is "GB18030."
What version of the product are you using? On what operating system?
I'm using version 1.0.3 on 64-bit Ubuntu 11.4 (Natty) with default kernel 2.6.38-10-generic. The JDK I'm currently running is 1.6.0_23-x64.
Original issue reported on code.google.com by [email protected]
on 13 Jul 2011 at 4:34
Attachments:
Even though the distributed C sources contain a TIS620/Thai
detector, it is not included in the Java version.
Please add/change the attached Java files.
ThaiModel.java to src/org/mozilla/universalchardet/prober/sequence
change SBCSGroupProber as shown
change Constants as shown.
Regards,
Randolf
Original issue reported on code.google.com by [email protected]
on 12 Feb 2015 at 3:19
Attachments:
Because the length of ISO2022JPSMModel.iso2022jpCharLenTable is smaller
than the class factor, getCharLen() causes ArrayIndexOutOfBoundsException.
Original issue reported on code.google.com by [email protected]
on 14 May 2007 at 8:35
What steps will reproduce the problem?
1. Get Bangla characters encoded in utf-8
2. Try to detect the encoding charset
3. IBM866 is detected
What is the expected output?
Bangla text
What do you see instead?
Kind of garbaged russian
What version of the product are you using? On what operating system?
juniversalchardet-1.0.3.jar on redhat
Please provide any additional information below.
i sent an email (composed with the text below) by omitting the charset
Ovi মেইল ওয়েব
à¦à¦¾à¦·à¦¾ সমরà§�থন
বিষয়ে
বিজ�ঞপ�তি
প�রিয় Bretajohnson,
আগামী কয়েক
সপ�তাহের
à¦à¦¿à¦¤à¦°à§‡ Nokia তাদের Ovi
মেইল
ওয়েবসাইটের
নত�ন দর�শন ও
অà¦à¦¿à¦œà§�ঞতা
সংযোজন করতে
চলেছেন, কারণ
সেটি Yahoo! পরিসেবা
দ�বারা প�ষ�ট Ovi
মেইল-�
পরিবর�তিত হতে
চলেছে। �ই
পরিবর�তনের
ফলে সংয�ক�ত
তাত�ক�ষণিক
বার�তা
(ইন�সট�যান�ট
মেসেজিং, IM) সহ
স�ট�রিমলাইন
করা ওয়েব
অà¦à¦¿à¦œà§�ঞতা ও
অতিরিক�ত
বৈশিষ�ট�য
আপনি আপনার Ovi
মেইল
অ�যাকাউন�টে
পেয়ে যাবেন।
আমরা যখন Ovi মেইল
পরিসেবার
সর�বমোট
উন�নতিসাধন
ঘটাব, আমরা তখন
সেই নত�ন ওয়েব
অà¦à¦¿à¦œà§�ঞতা শà§�রà§�
করার সময়ে Bengali
à¦à¦¾à¦·à¦¾à§Ÿ ওয়েব
সমর�থন করতে
পারব না। �ই
কার�য
চলাকালীন
আপনারা Ovi মেইল
ওয়েবে ইংরেজী
à¦à¦¾à¦·à¦¾à§Ÿ
অ�যাক�সেস
করতে পারবেন,
যাতে আপনারা Ovi
মেইল পরিসেবার
নত�ন ক�ষমতা
সমূহ ব�যবহার
করতে পারেন।
আমরা �ই
পরিস�থিতির
জন�য
আনà§�তরিকà¦à¦¾à¦¬à§‡
দ�ঃখিত, আর আমরা
আগামী কয়েক
মাসে Bengali à¦à¦¾à¦·à¦¾à¦°
সমর�থন
প�নর�চাল�
করার জনà§�à
¦¯ নিরনà§�তর কাজ করে চলেছি।
অন�গ�রহ করে
মনে রাখবেন যে
à¦�ই à¦à¦¾à¦·à¦¾
সমর�থনে
পরিবর�তন
কেবলমাত�র mail.ovi.com-�
ওয়েবকেই
পà§�রà¦à¦¾à¦¬à¦¿à¦¤ করবে,
আপনি যদি Nokia ফোন
থেকে Ovi মেইল
অ�যাক�সেস
করেন তবে সেই
পà§�রà¦à¦¾à¦¬ à¦�খানে
কার�যকারী হবে
না।
Ovi মেইল ব�যবহার
করার জন�য আমরা
আপনাদের
আন�তরিক
ধন�যবাদ জানাই!
ধন�যবাদান�তে,
Ovi by Nokia
আপনি আপনার
ইনবক�সে
আমাদের ই-মেইল
গ�রহণ করে
যাওয়া নিশ�চিত
করতে (জাঙ�ক বা
বাল�ক
ফোল�ডারে নয়)
আপনার
যোগাযোগের
তালিকা বা
নিরাপদ
প�রাপকের
তালিকায়
অন�গ�রহ করে
[email protected] য�ক�ত
কর�ন।
কপিরাইট 2011 Nokia. সব
স�বত�ব
সংরক�ষিত। Nokia Inc, 102
Corporate Park Drive, White Plains, NY 10604 Ovi
http://ct.nokia.com/?593603168&FGOI0 |
ব�যবহারের
শর�তাবলি
http://ct.nokia.com/?593603168&FGOI6 |
গোপনীয়তার
নীতি http://ct.nokia.com/?593603168&FGOI3
Original issue reported on code.google.com by [email protected]
on 3 Aug 2011 at 4:28
Distributed jar is built with Java 1.6 (I think). Would be convenient for
wider range of users (e.g. 1.5 and 1.6 users) if the build.xml javac target
had source="1.5" and target="1.5" which would make this jar useful to more
people "out-of-the-box".
This causes compile failure on JDK 1.5.0_12
> javac -cp .:juniversalchardet-1.0.2.jar TestDetector.java
TestDetector.java:1: cannot access
org.mozilla.universalchardet.UniversalDetector
bad class file:
juniversalchardet-1.0.2.jar(org/mozilla/universalchardet/UniversalDetector.class
)
class file has wrong version 50.0, should be 49.0
Please remove or make sure it appears in the correct subdirectory of the
classpath.
import org.mozilla.universalchardet.UniversalDetector;
^
1 error
Patch attached.
Original issue reported on code.google.com by [email protected]
on 4 Feb 2008 at 11:55
Attachments:
In case you'd like to switch the repo to Git, I made a git-svn clone available
at:
https://github.com/thkoch2001/juniversalchardet
You can just clone it and for example upload it to google code.
Regards, Thomas Koch
Original issue reported on code.google.com by [email protected]
on 26 Aug 2012 at 6:11
IBM850/437 codepages are not detected for German (and possibly other
Western European languages). Greek - IBM737/851/869 and Arabic 864 would be
nice too.
Reason: no code.
Original issue reported on code.google.com by [email protected]
on 14 Sep 2007 at 11:09
Your licensing terms read "The library is subject to the Mozilla Public License
Version 1.1. Alternatively, the library may be used under the terms of either
the GNU General Public License Version 2 or later, or the GNU Lesser General
Public License 2.1 or later."
Please add the possible use of the library under the terms of the EPL
(http://www.eclipse.org/legal/epl-v10.html, for use of juniversalchardet in
Eclipse-based applications (RCP)), or state in some way that it is okay to
distribute juniversalchardet under the LGPL together with components under the
EPL, e.g., as a separate Plug-in.
Thanks a lot in advance!
Original issue reported on code.google.com by [email protected]
on 27 Feb 2013 at 4:30
What steps will reproduce the problem?
1. Save a file in UTF-8 without BOM
2. Try to detect Character Encoding.
What is the expected output? What do you see instead?
I expect to see UTF-8 from the #getDetectedCharset() method. Instead I get null.
What version of the product are you using? On what operating system?
I am using juniversalchardet-1.0.3.jar on a Windows 7 System.
Please provide any additional information below.
When I use UTF-8 with BOM I can detect the file just fine but Java does not
support BOM so I get characters at the beginning of the file which I do not
want. Therefore I have been using UTF-8 without BOM.
Perhaps I am not feeding the detector enough data with the file I am reading
in? Although I don't think that is the case because I have extended the amount
of data inside of the file up to 171390 characters with no difference.
Original issue reported on code.google.com by [email protected]
on 30 Sep 2011 at 10:20
Attachments:
It would be nice to have support for windows-1250 charset.
Original issue reported on code.google.com by [email protected]
on 19 Feb 2013 at 5:46
We store some metadata for the stream contents (like hashes), and we wanted to
determine the encoding with it as well. I have therefore wrapped the
UniversalDetector inside a stream to be able to do several actions in one step
using nested streams.
Maybe it is useful to others:
public class EncodingDetectorInputStream extends BufferedInputStream {
private final UniversalDetector detector = new UniversalDetector(null);
public EncodingDetectorInputStream(InputStream in) {
super(in);
}
public String getDetectedCharset() {
return detector.getDetectedCharset();
}
@Override
public synchronized int read(byte[] b, int off, int len) throws IOException {
final int nrOfBytesRead = super.read(b, off, len);
if (!detector.isDone() && nrOfBytesRead > 0) {
detector.handleData(b, 0, nrOfBytesRead);
}
if (nrOfBytesRead == -1) {
detector.dataEnd();
}
return nrOfBytesRead;
}
}
Original issue reported on code.google.com by [email protected]
on 23 Jul 2013 at 3:32
GB2312 GB18030 and Big5 Charset Detect error
So I look up the mozilla universalchardet code
I found the BIG5Prober.handleData has an error
please modify "this.distributionAnalyzer.handleChar(buf, i - 1, charLen);"
to "this.distributionAnalyzer.handleOneChar(buf, i - 1, charLen);"
and GB18030Prober.handleData has the same bug.Do the same thing can fix
the same bug.
Original issue reported on code.google.com by [email protected]
on 9 Jul 2008 at 3:24
Hi, it would be great if JUC could be uploaded to a fitting Maven repo.
Original issue reported on code.google.com by [email protected]
on 5 Jun 2009 at 10:38
What steps will reproduce the problem?
1.Create a file with following line
Wykamol,£588.95,0.18,0.12,testingSpecialised Products for DIY and
Professionals£12
(Any text containing two pound signs followed by numbers like
Wykamol,£588.95£12)
2. Save the file as Ansi
3.
What is the expected output? What do you see instead?
Western European(windows) or something.. but it is GB18030
What version of the product are you using? On what operating system?
1.0.3
Please provide any additional information below.
Not sure how the API is supposed to be used. I tried a simple file with few
ansi characters like "Find Encoding".. API return encoding as null..
Original issue reported on code.google.com by [email protected]
on 12 Apr 2011 at 11:26
in c
Source CharDistribution.cpp
Method float CharDistributionAnalysis::GetConfidence()
float r = mFreqChars / ((mTotalChars - mFreqChars) * mTypicalDistributionRatio);
in java
Source CharDistributionAnalysis.java
Method public float getConfidence()
float r = this.freqChars / (this.totalChars - this.freqChars) *
this.typicalDistributionRatio;
Parenthesis is less. May be porting miss.
Original issue reported on code.google.com by [email protected]
on 12 Sep 2012 at 10:21
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.