Giter Club home page Giter Club logo

Comments (8)

aborruso avatar aborruso commented on June 8, 2024 2

Hi @osevill it's probably related to flatten/unflatten, but I do not have a solution for you.

@johnkerl will be able to help you.

from miller.

johnkerl avatar johnkerl commented on June 8, 2024 2

@osevill the 6.11.0 release (https://github.com/johnkerl/miller/releases/tag/v6.11.0) contains PR #1479 which addresses issue #1418.

Before this, Miller was in some cases producing non-compliant CSV output:

$ cat i.j
[
  { "a": 1 },
  { "b": 2, "c": 3 }
]
$ mlr-6.10.0 --j2c cat i.j
a
1

b,c
2,3

After this, Miller now produces compliant CSV output, or says that it can't:

$ mlr-6.11.0 --j2c cat i.j
a
1
mlr: CSV schema change: first keys "a"; current keys "b,c"
mlr: exiting due to data error.

If one row's list of column names is a strict subset of the others it can auto-unsparsify:

$ cat k.j
[
  { "a": 1, "b": 2 },
  { "a": 3 }
]

$ mlr-6.10.0 --j2c cat k.j
a,b
1,2

a
3

$ mlr-6.11.0 --j2c cat k.j
a,b
1,2
3,

The concern raised by issue #1418, and addressed by PR #1479, is that Miller should stop producing "CSV" with non-compliant blank lines in it. @aborruso was right to request to #1418.

For the data files in this issue, the records are truly non-homogeneous and are truly not representable as compliant CSV.

Two options I can suggest:

  • Use csvlite output format (https://miller.readthedocs.io/en/latest/file-formats/#csvtsvasvusvetc)
    • This doesn't claim to comply with RFC4180
    • It allows non-homogeneous records, separated with line breaks, which is a good match for the kind of output data you want to obtain
    • mlr --ijson --ocsvlite group-like i.j
  • Use unsparsify to obtain compliant CSV
    • mlr --ijson --ocsv group-like then flatten then unsparsify i.j

from miller.

aborruso avatar aborruso commented on June 8, 2024 1

mlr --ijson --ocsv group-like then flatten then unsparsify i.j

I was looking for this, but each time I am unable to reconstruct it. Thank you @johnkerl

from miller.

johnkerl avatar johnkerl commented on June 8, 2024 1

@osevill the best option would be to restore the behavior of #1479, default off, only behind a new opt-in flag -- I can do this, no problem.

from miller.

osevill avatar osevill commented on June 8, 2024 1

Thanks!
So the proposed behavior would be to continue to auto-unsparsify all header fields of csv/tsv output, but with a new optional flag to not auto-unsparsify (which would make mlr --j2c group-like with the new flag work like it did in 6.10)?

from miller.

aborruso avatar aborruso commented on June 8, 2024

Could you share a sample file and a sample command?

from miller.

osevill avatar osevill commented on June 8, 2024

given this sample file:
[{"name":"Rixos The Palm Dubai","location_1":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri-La Hotel","location_1":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand Hyatt","location_1":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]

if I run mlr --j2c group-like sample.csv using ver 6.12 (or 6.11)
I get this:

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
Shangri-La Hotel,25.2084,55.2719,
Grand Hyatt,25.2285,55.3273,1,2,3

even though each object element has slightly different fields, where I would expect a new csv header row each time.

if I change the source file so that the nested array names differ from element to element of the outer array:
[{"name":"Rixos The Palm Dubai","location_1":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri-La Hotel","location_2":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand Hyatt","location_3":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]
I get this error:

mlr: CSV schema change: first keys "name,location_1.1.lat,location_1.2.long,field_1"; current keys "name,location_2.1.lat,location_2.2.long"
name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1
mlr: exiting due to data error.

in version 6.10, this works as expected:

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1

name,location_1.1.lat,location_1.2.long
Shangri-La Hotel,25.2084,55.2719

name,location_1.1.lat,location_1.2.long,field_1,field_2,field_3
Grand Hyatt,25.2285,55.3273,1,2,3

and

name,location_1.1.lat,location_1.2.long,field_1
Rixos The Palm Dubai,25.1212,55.1535,1

name,location_2.1.lat,location_2.2.long
Shangri-La Hotel,25.2084,55.2719

name,location_3.1.lat,location_3.2.long,field_1,field_2,field_3
Grand Hyatt,25.2285,55.3273,1,2,3

Also, if I use --j2p (instead of --j2c) in 6.12, it seems to work fine however.

Thanks.

from miller.

osevill avatar osevill commented on June 8, 2024

@johnkerl The feedback above is great help, particularly ...group-like then flatten then unsparsify to have one unique header row with fields from all json array elements.
But I find it useful sometimes to see csv records in groups, by distinct header, which is what your first suggestion does. You're correct that the sample I provided has no commas in the data fields and so csvlite works, but I then realized that my actual data does sometimes have commas inside double-quoted values, which is why the --j2c option worked well for me prior to 6.11.

Since my json data does have commas in the values, here's what I came up with...

Given this sample file (this time with commas in the values):
[{"name":"Rixos,The,Palm,Dubai","location":[{"lat":25.1212},{"long":55.1535}],"field_1":1},{"name":"Shangri,La,Hotel","location_2":[{"lat":25.2084},{"long":55.2719}]},{"name":"Grand,Hyatt","location_3":[{"lat":25.2285},{"long":55.3273}],"field_1":1,"field_2":2,"field_3":3}]

...if I convert from json to tsvlite when doing the group-like, I avoid field breaks after each comma in the value. (I'm assuming my data will not have tabs in the values, which has been the case so far...or embedded newlines); then I do a separate mlr cat just to convert from tab-delimiters to a symbol delimiter of my choice (so that the delimiter is a printable character). At this point, I tell whatever software I'm using that the delimiter is my symbol, and I'm good:

mlr --ijson --otsvlite --from ./sample_json_array.json group-like | mlr --itsvlite --ocsvlite --ofs '•' cat > ./output_file.csv

Is there a simpler way to change the delimiter than calling mlr again and changing the output field separator? tsvlite doesn't seem to support changing the output field separator.

Don't know your thoughts on this but would it be worthwhile to have a new file format that is "in between" csvlite and csv, in the sense that it would be csvlite + support for commas or newlines embedded in double quotes, but because it wouldn't adhere to the RFC4180 spec, it would allow for blank rows in the output?

In this particular instance, it would assist me in accomplishing my json to row-based transform in just one mlr group-like.

Thanks again for the feedback.

from miller.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.