Comments (8)
The RFC-compliant CSV reader handles embedded separators within double quotes; readers for the other formats do not at all. This is a non-ideal situation, for sure.
This is a dup of #52 and all four of the current on-deck or active tasks are what I'm actively working on for v2.2.0.
from miller.
@johnkerl While proper quoting can solve this issue, I'm not sure it is the only solution and I'm not certain that it is preferable in this case. Having miller make a best-effort to parse unquoted-but-still-parsable data is a bug, IMO.
Given the log file that I am currently working with, for example: adding quotes to all the fields would make it a LOT harder to read (as a human reading plaintext, I mean).
from miller.
Good feedback. I'll dig harder into your request this evening.
from miller.
Thanks for looking into it further, @johnkerl -- I have not spent long looking into this, but I'm wondering if the problem isn't in /c/input/lrec_reader_mmap_dkvp.c within the lrec_parse_mmap_dkvp
function: whenever it matches on ips, it changes the ips to a null and then sets the value to the byte after the ips. This seemingly would result in the behavior that is being seen (i.e. the key ends at the first ips, and the value begins at the last ips).
I wonder if this couldn't be fixed by tweaking that code to ensure that you only match on ips once per field, which should result in only the first ips being matched.
from miller.
My apologies for the hasty read; I've got double-quotes on the brain & assigned too much weight to the double-quotes in your data. :^/
This is (was) definitely a bug; fixed in c2e11c0.
Thank you for the report!! :)
from miller.
No problem at all, @johnkerl -- thanks very much for the quick fix!
from miller.
Nice fix, @johnkerl -- have confirmed that with it in-place, miller can now deal very nicely with things like a web-server access.log file in dkvp format (using = as the PS now works well, even though it is used in some field values as well).
from miller.
Awesome!!
from miller.
Related Issues (20)
- CSV header/data length mismatch 5 != 3 on row that does not exist HOT 2
- Is there a way to "sparsify" HOT 3
- [feature request] Split by file size
- Miller produces no output on TSV with > 64K characters per line HOT 11
- [feature request] Right-align numeric values in PPRINT and Markdown output formats
- Read performance can be improved for high-column-count data
- Investigate shutdown latency on `mlr head` HOT 2
- Cryptic fatal error message for nonexistent files since 6.9.0 HOT 2
- Investigate spurious `[]` on JSON output in some cases HOT 4
- `flatten` not working on csv input data
- Bash process substitution not working with `put -f`
- Miller's `strptime` accepts fewer format options than `strptime`
- Inconsistent result when using `$*`
- Double-width characters spoil column alignment HOT 4
- `mlr --icsv --ojson cat < mlr.bug.csv` drops some columns HOT 5
- Add description for "put" verb HOT 1
- 'mlr cut' is very slow HOT 8
- mlr --otsv does not handle broken quotes correctly compared to --ocsv HOT 6
- JSON to CSV Error HOT 8
- exit code = 1 for --csv skip-trivial-records and csv file's last record is blank
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from miller.