Hi, I think that aggregation (any type from stats1</code

Is this your desired output? <div class="snippet-clipboard-content notranslate pos

If so this sounds like system uniq , or <code class="n

Oh, I totally miss -g switch to <code class="notransl

Rows aggregation about miller HOT 8 CLOSED

johnkerl commented on May 22, 2024

Rows aggregation

from miller.

Comments (8)

johnkerl commented on May 22, 2024

Is this your desired output?

$ cat x
a=1,b=2,key=klucz,c=3
a=4,b=4,key=klucz,c=3
a=3,b=4,key=klucz,c=1
a=0,b=0,key=klucz2,c=4
a=2,b=3,key=klucz2,c=3
a=1,b=2,key=klucz2,c=0
a=1,b=2,key=klucz,c=3
a=4,b=4,key=klucz,c=3
a=3,b=4,key=klucz,c=1

$ cat x | mlr uniq -g a,b,key,c
a=1,b=2,key=klucz,c=3
a=4,b=4,key=klucz,c=3
a=3,b=4,key=klucz,c=1
a=0,b=0,key=klucz2,c=4
a=2,b=3,key=klucz2,c=3
a=1,b=2,key=klucz2,c=0

from miller.

johnkerl commented on May 22, 2024

If so this sounds like system uniq, or sort -u, or https://github.com/johnkerl/scripts/blob/master/fundam/uniqm

from miller.

johnkerl commented on May 22, 2024

mlr uniq -g ... already does this if you type out all the column names. There could be a mlr uniq -a which does the uniqueness check on all column names without you needing to type them all out. For DKVP, no better than uniq. But for CSV, it would have added value since it would be header-aware.

from miller.

Komosa commented on May 22, 2024

No, this output is not desired - I included repetition of rows in example just to show, that input doesn't have to be sorted.

Basic use case for aggr is to sum up all columns for each key. Simplest real world scenario: for given financial transactions list we want to compute summary for each month (or day, or year, ...).

from miller.

Komosa commented on May 22, 2024

I can see also one additional gotcha:
some fields may be not suitable for aggregation, especially strings. It may be worth to use better strategy than just drop that fields.

I can propose following solutions for this case:

use first/last value - simplest
use most common (top one)

from miller.

johnkerl commented on May 22, 2024

Sorry, I didn't read your example closely enough.

Why is this different from

$ cat x
a=1,b=2,key=klucz,c=3
a=4,b=4,key=klucz,c=3
a=3,b=4,key=klucz,c=1
a=0,b=0,key=klucz2,c=4
a=2,b=3,key=klucz2,c=3
a=1,b=2,key=klucz2,c=0
a=1,b=2,key=klucz,c=3
a=4,b=4,key=klucz,c=3
a=3,b=4,key=klucz,c=1

$ cat x | mlr stats1 -a sum -f a,b,c -g key
key=klucz,a_sum=16.000000,b_sum=20.000000,c_sum=14.000000
key=klucz2,a_sum=3.000000,b_sum=5.000000,c_sum=7.000000

In the general case, by what criteria do I keep the first batch of three key=klucz rows distinct from the second batch of three? All six of them have the common key=klucz so mlr stats1 -g key aggregates all six of them.

from miller.

Komosa commented on May 22, 2024

Oh, I totally miss -g switch to stats1..., sorry for problem.
So, in general, this feature-proposal is already implemented ;)

It is possible to use different aggregation type for different fields (other than filter results after)?
And I'm correct that -a mode works with strings?

from miller.

johnkerl commented on May 22, 2024

Correct on both: Yes, -a mode works for strings. No, at present you can't, say, only sum on one field and only max on another: if there are m aggregators and n fields then you get m*n aggregate outputs.

from miller.

Rows aggregation about miller HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent