Giter Club home page Giter Club logo

Comments (7)

EricSoroos avatar EricSoroos commented on June 18, 2024 2

So having a look at what python's csv.sniff is doing, the _guess_quote_and_delimiter https://github.com/python/cpython/blob/main/Lib/csv.py#L273 is very similar, but covers all 4 possible quote patterns for a quoted field, not trying to find the possible cases for a delimiter between quotes.

It also includes a plain _guess_delimiter https://github.com/python/cpython/blob/main/Lib/csv.py#L349 that's essentially what I was suggesting in the last para as a fallback to when guessing the quote and delimiter at the same time doesn't work.

from qsv.

EricSoroos avatar EricSoroos commented on June 18, 2024 1

I'll take a look on it -- We've apparently hit this before (according to a co-worker) and we've got a build of dp+ that basically ignores non-standard delimiters. So immediate issue is patched around.

So, I'm not good with rust (you've seen basically all of what I've ever done), but this confuses me: https://github.com/jqnatividad/qsv-sniffer/blob/master/src/sniffer.rs#L506 . We're calling this with all of the possible quote characters in character, delim is None. This for the "most common" case (csv with "), this regex (somewhat simplified) is hopefully looking for "[ ]*?,[ ]*". If there are no quotes in the csv, then I'm unclear what it's ever going to match, but apparently it was coming up with E?

The delimiter count here is only going to ever pick up a valid delimiter if there's a quote delim quote pattern, which doesn't look like it's going to happen with this test file, and isn't necessarily going to be consistent unless it's a 'quote all' type of file. I suspect if the initial guess of the delimiter was good, then we'd probably get the Viterbi confirming the choice.

What if we look at a sample of likely delimiters. ,\t;|:, and ran counts per line. even in the ignorance of quotes, there should be one that has a roughly consistent number of occurrences per line, with the min value likely to be approximately the number of columns. It would at least rule out ones that don't appear, or have large variations in the number of occurrences per line. That would probably work better for then figuring out a quote, because quotes should be either [quote]{2}, or [quote][delim] or [delim][quote], or it's not a valid quoting for the file.

from qsv.

jqnatividad avatar jqnatividad commented on June 18, 2024

sniff uses the qsv-sniffer crate, which in turn, is a fork of the csv-sniffer crate.

I ended up creating qsv-sniffer since it seems csv-sniffer was no longer being actively maintained as shown by the numerous unanswered issues.

But TBH, the Viterbi algorithm it uses to sniff and infer CSV schemas is still something I don't fully grok, so it trips up on certain CSVs.

It works well enough for "typical" CSVs and I've tweaked it enough to remove the panics, but the whole Viterbi inferencing part needs to be refactored.

FYI, my intent with qsv sniff is to help power a next-gen harvester, that's why I've added all the other extra stuff (mime-type sniffing, range requests sniffing, etc.).

Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable?

from qsv.

jqnatividad avatar jqnatividad commented on June 18, 2024

Thanks for digging into this @EricSoroos !

Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go!

Do you know if it uses the same Viterbi algorithm? If not, I'm inclined to just rewrite qsv-sniffer to just do a straight port of python's csv sniffer...

from qsv.

EricSoroos avatar EricSoroos commented on June 18, 2024

It doesn't look like it -- it just looks like a frequency based check. Viterbi looks like a general constrain satisfaction algorithm, so it's just one way to determine if a set of parameters is the most likely description of the data.

from qsv.

jqnatividad avatar jqnatividad commented on June 18, 2024

Awesome! I'll create a new branch on the qsv-sniffer crate then and start porting over the code.

Would be nice if we can co-maintain it as there's really nothing like python's csv sniffing in the Rust ecosystem beyond csv-sniffer and qsv-sniffer. Polars has a very fast multi-threaded, mem-mapped CSV reader, and it has some CSV schema sniffing smarts, but its not a general library that can be used separately without bringing in a lot of polars crates.

from qsv.

jqnatividad avatar jqnatividad commented on June 18, 2024

Here's the tracking issue for the EPIC (in more ways than one 😉 ) port of python's csv-sniff

jqnatividad/qsv-sniffer#14

from qsv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.