Comments (7)
So having a look at what python's csv.sniff is doing, the _guess_quote_and_delimiter
https://github.com/python/cpython/blob/main/Lib/csv.py#L273 is very similar, but covers all 4 possible quote patterns for a quoted field, not trying to find the possible cases for a delimiter between quotes.
It also includes a plain _guess_delimiter
https://github.com/python/cpython/blob/main/Lib/csv.py#L349 that's essentially what I was suggesting in the last para as a fallback to when guessing the quote and delimiter at the same time doesn't work.
from qsv.
I'll take a look on it -- We've apparently hit this before (according to a co-worker) and we've got a build of dp+ that basically ignores non-standard delimiters. So immediate issue is patched around.
So, I'm not good with rust (you've seen basically all of what I've ever done), but this confuses me: https://github.com/jqnatividad/qsv-sniffer/blob/master/src/sniffer.rs#L506 . We're calling this with all of the possible quote characters in character, delim is None. This for the "most common" case (csv with "), this regex (somewhat simplified) is hopefully looking for "[ ]*?,[ ]*"
. If there are no quotes in the csv, then I'm unclear what it's ever going to match, but apparently it was coming up with E?
The delimiter count here is only going to ever pick up a valid delimiter if there's a quote delim quote
pattern, which doesn't look like it's going to happen with this test file, and isn't necessarily going to be consistent unless it's a 'quote all' type of file. I suspect if the initial guess of the delimiter was good, then we'd probably get the Viterbi confirming the choice.
What if we look at a sample of likely delimiters. ,\t;|:
, and ran counts per line. even in the ignorance of quotes, there should be one that has a roughly consistent number of occurrences per line, with the min value likely to be approximately the number of columns. It would at least rule out ones that don't appear, or have large variations in the number of occurrences per line. That would probably work better for then figuring out a quote, because quotes should be either [quote]{2}
, or [quote][delim]
or [delim][quote]
, or it's not a valid quoting for the file.
from qsv.
sniff
uses the qsv-sniffer crate, which in turn, is a fork of the csv-sniffer crate.
I ended up creating qsv-sniffer since it seems csv-sniffer was no longer being actively maintained as shown by the numerous unanswered issues.
But TBH, the Viterbi algorithm it uses to sniff and infer CSV schemas is still something I don't fully grok, so it trips up on certain CSVs.
It works well enough for "typical" CSVs and I've tweaked it enough to remove the panics, but the whole Viterbi inferencing part needs to be refactored.
FYI, my intent with qsv sniff
is to help power a next-gen harvester, that's why I've added all the other extra stuff (mime-type sniffing, range requests sniffing, etc.).
Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable?
from qsv.
Thanks for digging into this @EricSoroos !
Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go!
Do you know if it uses the same Viterbi algorithm? If not, I'm inclined to just rewrite qsv-sniffer to just do a straight port of python's csv sniffer...
from qsv.
It doesn't look like it -- it just looks like a frequency based check. Viterbi looks like a general constrain satisfaction algorithm, so it's just one way to determine if a set of parameters is the most likely description of the data.
from qsv.
Awesome! I'll create a new branch on the qsv-sniffer crate then and start porting over the code.
Would be nice if we can co-maintain it as there's really nothing like python's csv sniffing in the Rust ecosystem beyond csv-sniffer and qsv-sniffer. Polars has a very fast multi-threaded, mem-mapped CSV reader, and it has some CSV schema sniffing smarts, but its not a general library that can be used separately without bringing in a lot of polars crates.
from qsv.
Here's the tracking issue for the EPIC (in more ways than one 😉 ) port of python's csv-sniff
from qsv.
Related Issues (20)
- qsv count HOT 1
- Locating.installing qsv HOT 2
- sql windows functions HOT 7
- add `--no-headers` support to qsv cat rowskey HOT 2
- group by HOT 1
- `frequency`: add `--other` option HOT 5
- `luau`: additional helper functions
- `search`: add preview and JSON options HOT 1
- `search` & `searchset`: when a CSV is indexed, parallelize search
- partition file into files with n rows each (except for last file) HOT 7
- `stats` command writes output file even when `--output` is not set HOT 7
- `stats`: Max precision for float types HOT 5
- sqlp selects wrong data when multiple tables have the same named column HOT 8
- Getting "usage error: " prepended to help messages for commands HOT 1
- Can qsv reverse or sort column order? HOT 7
- `select`: tweak lastcol sentinel and `--sort` and `--random` behaviors HOT 2
- Update `contrib/bashly/completions.bash` & `contrib/bashly/src/bashly.yml` for qsv v0.129.0 HOT 2
- add enum-like options to `qsv fill`
- deterministic qsv enum HOT 1
- `validate`: add support for custom JSONSchema keyword `dynenum` - allowing dynamic validation lookups against a CSV (remote or local) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qsv.