Currently file lists are a crude form of the end (map-reduce) pattern. Ideally, the user should be able to specify a reduction/aggregation function to be run on named lists.
Current grammar/rules
filelists=["name", ["name", "path"], "tableX]
- name --> the list, containing file/dir names, is saved to name.txt
- name, path --> the list, containing file/dir names, is saved to name.txt
- name with table pattern --> assume files are CSVs, concatentate all into 1 table, name.csv
The first two are there to collect files in an ordered, threadsafe manner, to generate lists of input/output pairs for SLURM batch scripts.
The second is a quick fix to enable fusing of CSVs, another common task.
This entire pattern is an instance of map-reduce, e.g. collect, transform, reduce, so let's formalize to
- [ name transformer aggregator ]
Challenges
Don't make the user specify what they don't have to, e.g. 90% of the case, transformer = identity, but positional decoding can be tricky (a lot of lookups).
We could try to have the type system find out if a function is transformer(x)->x or aggregator(l, n)->nothing, but that requires full control of the symbols -> functions -> methods lookup, and we already expose 'any symbol is a function' pricinple.
Extend current grammar to be:
file_lists=[entry+]
entry={name=name, transformer=identity, aggregator=aggretor}
with shorthands
entry=name --> name, identity, list_to_file
entry=name , concat --> name, identity, concat tables
entry=name, path --> name, change_path(path), list_tofile
Features
- Type safety --> switch from Any to NamedTuple, so Julia knows which to resolve at compile time.
- Allow arbitrary aggregation
Transformer
function transform(x::AbstractString)
return identity(x)
end
Aggregator
function aggregate(name::AbstractString, list)
R = reduce(reduce(X, sublist) for sublist in list)
save(name, R)
end
Use cases that should work
Files
- save files to list
- save files to list + change of path
Tables
- concat table
- concat table + select columns/arbitrary transform
- instead of concat, group_by specified, then save
Images
- Stack kd to k+1d
- mean/median/max/mask
Transformers that alter content
- Rather than add list, create a tmp file with transformed content, add that to list
- Provide wrapper functions a la image_reduce_by, image_group