Comments (8)
Hi @osevill it's probably related to flatten/unflatten, but I do not have a solution for you.
@johnkerl will be able to help you.
from miller.
@osevill the 6.11.0 release (https://github.com/johnkerl/miller/releases/tag/v6.11.0) contains PR #1479 which addresses issue #1418.
Before this, Miller was in some cases producing non-compliant CSV output:
$ cat i.j
[
{ "a": 1 },
{ "b": 2, "c": 3 }
]
$ mlr-6.10.0 --j2c cat i.j
a
1
b,c
2,3
After this, Miller now produces compliant CSV output, or says that it can't:
$ mlr-6.11.0 --j2c cat i.j
a
1
mlr: CSV schema change: first keys "a"; current keys "b,c"
mlr: exiting due to data error.
If one row's list of column names is a strict subset of the others it can auto-unsparsify:
$ cat k.j
[
{ "a": 1, "b": 2 },
{ "a": 3 }
]
$ mlr-6.10.0 --j2c cat k.j
a,b
1,2
a
3
$ mlr-6.11.0 --j2c cat k.j
a,b
1,2
3,
The concern raised by issue #1418, and addressed by PR #1479, is that Miller should stop producing "CSV" with non-compliant blank lines in it. @aborruso was right to request to #1418.
For the data files in this issue, the records are truly non-homogeneous and are truly not representable as compliant CSV.
Two options I can suggest:
- Use
csvlite
output format (https://miller.readthedocs.io/en/latest/file-formats/#csvtsvasvusvetc)- This doesn't claim to comply with RFC4180
- It allows non-homogeneous records, separated with line breaks, which is a good match for the kind of output data you want to obtain
mlr --ijson --ocsvlite group-like i.j
- Use
unsparsify
to obtain compliant CSVmlr --ijson --ocsv group-like then flatten then unsparsify i.j
from miller.
mlr --ijson --ocsv group-like then flatten then unsparsify i.j
I was looking for this, but each time I am unable to reconstruct it. Thank you @johnkerl
from miller.
@osevill the best option would be to restore the behavior of #1479, default off, only behind a new opt-in flag -- I can do this, no problem.
from miller.
Thanks!
So the proposed behavior would be to continue to auto-unsparsify all header fields of csv/tsv output, but with a new optional flag to not auto-unsparsify (which would make mlr --j2c group-like
with the new flag work like it did in 6.10)?
from miller.
Could you share a sample file and a sample command?
from miller.
given this sample file:
[{"name":"Rixos The Palm Dubai","location_1":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri-La Hotel","location_1":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand Hyatt","location_1":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]
if I run mlr --j2c group-like sample.csv
using ver 6.12 (or 6.11)
I get this:
name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
Shangri-La Hotel,25.2084,55.2719,
Grand Hyatt,25.2285,55.3273,1,2,3
even though each object element has slightly different fields, where I would expect a new csv header row each time.
if I change the source file so that the nested array names differ from element to element of the outer array:
[{"name":"Rixos The Palm Dubai","location_1":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri-La Hotel","location_2":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand Hyatt","location_3":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]
I get this error:
mlr: CSV schema change: first keys "name,location_1.1.lat,location_1.2.long,field_1"; current keys "name,location_2.1.lat,location_2.2.long"
name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
mlr: exiting due to data error.
in version 6.10, this works as expected:
name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
name,location_1.1.lat,location_1.2.long
Shangri-La Hotel,25.2084,55.2719
name,location_1.1.lat,location_1.2.long,field_1,field_2,field_3
Grand Hyatt,25.2285,55.3273,1,2,3
and
name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
name,location_2.1.lat,location_2.2.long
Shangri-La Hotel,25.2084,55.2719
name,location_3.1.lat,location_3.2.long,field_1,field_2,field_3
Grand Hyatt,25.2285,55.3273,1,2,3
Also, if I use --j2p (instead of --j2c) in 6.12, it seems to work fine however.
Thanks.
from miller.
@johnkerl The feedback above is great help, particularly ...group-like then flatten then unsparsify
to have one unique header row with fields from all json array elements.
But I find it useful sometimes to see csv records in groups, by distinct header, which is what your first suggestion does. You're correct that the sample I provided has no commas in the data fields and so csvlite works, but I then realized that my actual data does sometimes have commas inside double-quoted values, which is why the --j2c
option worked well for me prior to 6.11.
Since my json data does have commas in the values, here's what I came up with...
Given this sample file (this time with commas in the values):
[{"name":"Rixos,The,Palm,Dubai","location":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri,La,Hotel","location_2":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand,Hyatt","location_3":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]
...if I convert from json to tsvlite when doing the group-like
, I avoid field breaks after each comma in the value. (I'm assuming my data will not have tabs in the values, which has been the case so far...or embedded newlines); then I do a separate mlr cat
just to convert from tab-delimiters to a symbol delimiter of my choice (so that the delimiter is a printable character). At this point, I tell whatever software I'm using that the delimiter is my symbol, and I'm good:
mlr --ijson --otsvlite --from ./sample_json_array.json group-like | mlr --itsvlite --ocsvlite --ofs '•' cat > ./output_file.csv
Is there a simpler way to change the delimiter than calling mlr again and changing the output field separator? tsvlite doesn't seem to support changing the output field separator.
Don't know your thoughts on this but would it be worthwhile to have a new file format that is "in between" csvlite and csv, in the sense that it would be csvlite + support for commas or newlines embedded in double quotes, but because it wouldn't adhere to the RFC4180 spec, it would allow for blank rows in the output?
In this particular instance, it would assist me in accomplishing my json to row-based transform in just one mlr group-like.
Thanks again for the feedback.
from miller.
Related Issues (20)
- Double-width characters spoil column alignment HOT 4
- `mlr --icsv --ojson cat < mlr.bug.csv` drops some columns HOT 5
- Add description for "put" verb HOT 1
- 'mlr cut' is very slow HOT 8
- mlr --otsv does not handle broken quotes correctly compared to --ocsv HOT 6
- exit code = 1 for --csv skip-trivial-records and csv file's last record is blank
- Automated way of clearing down column data HOT 7
- JSON flag documentation question HOT 1
- Equivalent to Excel function "data load from folder" then "combine and load" multiple CSV's finally "apply transformations" HOT 7
- Find and remove "string" retaining all other row data HOT 5
- Find and replace special character & with and using ssub HOT 2
- Add a `stat` DSL function HOT 4
- Notwithstanding --skip-comments, double-quoted tokens in comment are mishandled with any --c2* option HOT 3
- Completely unexpted output of put, then cut, then label HOT 3
- do we have python binding? HOT 1
- About `--transpose` in the documentation
- newlines in fields | CSV file: Header/data length mismatch HOT 6
- More than one blank line at the end of CSV file: automate cleanup?
- Feature request: NO_COLOR standard implementation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from miller.