Giter Club home page Giter Club logo

Comments (8)

rebeccabilbro avatar rebeccabilbro commented on August 18, 2024

good (terrible) example: https://www.ssa.gov/foia/html/FY08CSV.csv

from cultivar.

rebeccabilbro avatar rebeccabilbro commented on August 18, 2024

another great one: http://www.planecrashinfo.com/1920/1920.htm

from cultivar.

rebeccabilbro avatar rebeccabilbro commented on August 18, 2024

another example: https://catalog.data.gov/dataset/veterans-health-administration-2008-hospital-report-card-patient-satisfaction

from cultivar.

bbengfort avatar bbengfort commented on August 18, 2024

Great examples - I definitely noticed all the error messages that came up as you were experimenting! The actual error seems to be a Unicode decoding error, which is actually potentially more serious. It leads to the question of whether or not these files are actually unicode encoded or if they have some other scheme (making things way more difficult).

from cultivar.

rebeccabilbro avatar rebeccabilbro commented on August 18, 2024

Hmm, sounds like it's potentially related to my #43 then?

from cultivar.

bbengfort avatar bbengfort commented on August 18, 2024

Potentially, though encoding detection is more of an annoying task that's tough to figure out. You could use the file command in your terminal to see if your computer knows the encoding. It's definitely something I'll take a look at.

from cultivar.

rebeccabilbro avatar rebeccabilbro commented on August 18, 2024

Found some more test data that might help with this issue, see: https://github.com/okfn/messytables/tree/7e4f12abef257a4d70a8020e0d024df6fbb02976/horror

from cultivar.

lauralorenz avatar lauralorenz commented on August 18, 2024

Ok, so specifically for the files that @rebeccabilbro first linked, in terms of unicode decode errors, this problem has been solved, assumedly from the python 3.x upgrade in which the default python encoding is utf-8 instead of ascii so these utf-8 (or utf-8 subset) encoded files no longer caused unicode errors. Specifically I today tested the CSVs from https://www.ssa.gov/foia/html/FY08CSV.csv and https://catalog.data.gov/dataset/veterans-health-administration-2008-hospital-report-card-patient-satisfaction, and the HTML from http://www.planecrashinfo.com/1920/1920.htm, with both file storage and S3 backend and none of them caused an error.

How we want to deal with files in encodings that are not utf-8 encoded is a much broader question. For example a utf-16le encoded file (i.e. https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/horror/utf-16le_encoded.csv) won't work right now since utf-16le isn't a subset of utf-8, but I'm not sure yet if we really care. IMHO, it was unreasonable to expect ascii encoding, but is not unreasonable to expect utf-8/utf-8-subset encoding.

So, given all of that I am going to close this issue in terms of the scope of the initial bug. However I will make a note on the roadmap issue for consideration more generally about how we want to deal with non-utf-8-subset encodings in this project for the future. cc @rebeccabilbro @ojedatony1616 @bbengfort @looselycoupled

from cultivar.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.