Giter Club home page Giter Club logo

Comments (7)

sigmavirus24 avatar sigmavirus24 commented on June 27, 2024

So first a question:

  1. Did you test this against master?

Second, chardet isn't meant to work on extraordinarily tiny samples. Perhaps we haven't documented this well, but we really don't aim to have perfect results for small samples.

from chardet.

DRMacIver avatar DRMacIver commented on June 27, 2024

Oops. You're quite right I should have tested this against master. Sorry! However I have now and it still exhibits the same problem.

The size of the string doesn't seem to be the source of the problem. Here's a longer string that does the same thing:

u'\x0000000000000\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0000000000000'

from chardet.

DRMacIver avatar DRMacIver commented on June 27, 2024

It seems to specifically be the utf-16 encoding of the null control character that triggers the issue. There don't appear to be any examples which exhibit the same problem if it's absent.

from chardet.

dan-blanchard avatar dan-blanchard commented on June 27, 2024

I'm going to say this is just a limitation of detecting encodings automatically. It would be pretty much impossible for us to correctly detect UTF-16LE that starts with a null byte, since that is exactly the BOM for UTF-32LE, and the first thing we do is check for BOMs.

>>> import codecs
>>> u'\x00'.encode('utf16') == codecs.BOM_UTF32
True

You can see the relevant code here.

from chardet.

DRMacIver avatar DRMacIver commented on June 27, 2024

So it's certainly ambiguous in the general case, but it's not actually ambiguous here because the file isn't actually valid utf-32LE, which seems like a property that could be used to disambiguate. Not caring about that is a reasonable decision though.

from chardet.

DRMacIver avatar DRMacIver commented on June 27, 2024

The context here is that I was looking at https://github.com/audreyr/binaryornot and one of the things it does is try to use the encoding returned by chardet to decode the file in order to see if it can reasonably be interpreted as text, so I thought I'd look further into chardet. I don't believe this is actually causing that project a problem any more (the logic there wasn't working correctly before), but it seemed worth investigating. I think this will manifest in some valid utf-16 being rejected as being binary, but I've no idea if that's a problem for people in practice.

from chardet.

dan-blanchard avatar dan-blanchard commented on June 27, 2024

So it's certainly ambiguous in the general case, but it's not actually ambiguous here because the file isn't actually valid utf-32LE, which seems like a property that could be used to disambiguate.

Very true. A simple check we could use to distinguish the two cases would be to see if the byte string that starts with the UTF-32 BOM has a length that's divisible by 4. If it is, it's UTF-32, and if it isn't, but it is divisible by 2, then it's UTF-16. If it has an odd number of bytes, we could fall back on whichever prober was ranked highest.

The only thing I don't like about this is that it makes UTF-32 and UTF-16 detection much slower, because the entire byte string would need to be read in to determine the length. Right now we feed in one character at a time and make decisions based on that. The goal is to examine as little of the string as possible.

We could make things reasonably fast if we didn't care about the case where the string has an odd number of bytes, and just added a new InputState called utf_16_or_32, and just had UniversalDetector.feed() do nothing until had it consumed the whole string if it was in that state.

@sigmavirus24, what do you think?

from chardet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.