In previous versions mlr --j2c group-like worked fine

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

JSON to CSV Error about miller HOT 8 OPEN

osevill commented on June 8, 2024

JSON to CSV Error

from miller.

Comments (8)

aborruso commented on June 8, 2024 2

Hi @osevill it's probably related to flatten/unflatten, but I do not have a solution for you.

@johnkerl will be able to help you.

from miller.

johnkerl commented on June 8, 2024 2

@osevill the 6.11.0 release (https://github.com/johnkerl/miller/releases/tag/v6.11.0) contains PR #1479 which addresses issue #1418.

Before this, Miller was in some cases producing non-compliant CSV output:

$ cat i.j
[
  { "a": 1 },
  { "b": 2, "c": 3 }
]

$ mlr-6.10.0 --j2c cat i.j
a
1

b,c
2,3

After this, Miller now produces compliant CSV output, or says that it can't:

$ mlr-6.11.0 --j2c cat i.j
a
1
mlr: CSV schema change: first keys "a"; current keys "b,c"
mlr: exiting due to data error.

If one row's list of column names is a strict subset of the others it can auto-unsparsify:

$ cat k.j
[
  { "a": 1, "b": 2 },
  { "a": 3 }
]

$ mlr-6.10.0 --j2c cat k.j
a,b
1,2

a
3

$ mlr-6.11.0 --j2c cat k.j
a,b
1,2
3,

The concern raised by issue #1418, and addressed by PR #1479, is that Miller should stop producing "CSV" with non-compliant blank lines in it. @aborruso was right to request to #1418.

For the data files in this issue, the records are truly non-homogeneous and are truly not representable as compliant CSV.

Two options I can suggest:

Use csvlite output format (https://miller.readthedocs.io/en/latest/file-formats/#csvtsvasvusvetc)
- This doesn't claim to comply with RFC4180
- It allows non-homogeneous records, separated with line breaks, which is a good match for the kind of output data you want to obtain
- mlr --ijson --ocsvlite group-like i.j
Use unsparsify to obtain compliant CSV
- mlr --ijson --ocsv group-like then flatten then unsparsify i.j

from miller.

aborruso commented on June 8, 2024 1

mlr --ijson --ocsv group-like then flatten then unsparsify i.j

I was looking for this, but each time I am unable to reconstruct it. Thank you @johnkerl

from miller.

johnkerl commented on June 8, 2024 1

@osevill the best option would be to restore the behavior of #1479, default off, only behind a new opt-in flag -- I can do this, no problem.

from miller.

osevill commented on June 8, 2024 1

Thanks!
So the proposed behavior would be to continue to auto-unsparsify all header fields of csv/tsv output, but with a new optional flag to not auto-unsparsify (which would make mlr --j2c group-like with the new flag work like it did in 6.10)?

from miller.

aborruso commented on June 8, 2024

Could you share a sample file and a sample command?

from miller.

osevill commented on June 8, 2024

given this sample file:
[{"name":"Rixos The Palm Dubai","location_1":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri-La Hotel","location_1":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand Hyatt","location_1":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]

if I run mlr --j2c group-like sample.csv using ver 6.12 (or 6.11)
I get this:

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
Shangri-La Hotel,25.2084,55.2719,
Grand Hyatt,25.2285,55.3273,1,2,3

even though each object element has slightly different fields, where I would expect a new csv header row each time.

if I change the source file so that the nested array names differ from element to element of the outer array:
[{"name":"Rixos The Palm Dubai","location_1":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri-La Hotel","location_2":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand Hyatt","location_3":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]
I get this error:

mlr: CSV schema change: first keys "name,location_1.1.lat,location_1.2.long,field_1"; current keys "name,location_2.1.lat,location_2.2.long"
name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
mlr: exiting due to data error.

in version 6.10, this works as expected:

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1

name,location_1.1.lat,location_1.2.long
Shangri-La Hotel,25.2084,55.2719

name,location_1.1.lat,location_1.2.long,field_1,field_2,field_3
Grand Hyatt,25.2285,55.3273,1,2,3

and

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1

name,location_2.1.lat,location_2.2.long
Shangri-La Hotel,25.2084,55.2719

name,location_3.1.lat,location_3.2.long,field_1,field_2,field_3
Grand Hyatt,25.2285,55.3273,1,2,3

Also, if I use --j2p (instead of --j2c) in 6.12, it seems to work fine however.

Thanks.

from miller.

osevill commented on June 8, 2024

@johnkerl The feedback above is great help, particularly ...group-like then flatten then unsparsify to have one unique header row with fields from all json array elements.
But I find it useful sometimes to see csv records in groups, by distinct header, which is what your first suggestion does. You're correct that the sample I provided has no commas in the data fields and so csvlite works, but I then realized that my actual data does sometimes have commas inside double-quoted values, which is why the --j2c option worked well for me prior to 6.11.

Since my json data does have commas in the values, here's what I came up with...

Given this sample file (this time with commas in the values):
[{"name":"Rixos,The,Palm,Dubai","location":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri,La,Hotel","location_2":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand,Hyatt","location_3":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]

...if I convert from json to tsvlite when doing the group-like, I avoid field breaks after each comma in the value. (I'm assuming my data will not have tabs in the values, which has been the case so far...or embedded newlines); then I do a separate mlr cat just to convert from tab-delimiters to a symbol delimiter of my choice (so that the delimiter is a printable character). At this point, I tell whatever software I'm using that the delimiter is my symbol, and I'm good:

mlr --ijson --otsvlite --from ./sample_json_array.json group-like | mlr --itsvlite --ocsvlite --ofs '•' cat > ./output_file.csv

Is there a simpler way to change the delimiter than calling mlr again and changing the output field separator? tsvlite doesn't seem to support changing the output field separator.

Don't know your thoughts on this but would it be worthwhile to have a new file format that is "in between" csvlite and csv, in the sense that it would be csvlite + support for commas or newlines embedded in double quotes, but because it wouldn't adhere to the RFC4180 spec, it would allow for blank rows in the output?

In this particular instance, it would assist me in accomplishing my json to row-based transform in just one mlr group-like.

Thanks again for the feedback.

from miller.

JSON to CSV Error about miller HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent