Giter Club home page Giter Club logo

datacurator.jl's People

Contributors

bencardoen avatar hamarneh avatar haneneby avatar pitmonticone avatar slee04 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

pitmonticone

datacurator.jl's Issues

Add support to Singularity image for PackageCompiler.jl

Because of the large set of dependencies, the initial startup latency is quite high. Using PackageCompiler.jl we can create a precompile sysimage + precompiled versions of the shortcode functions so the start and usage latency goes down.
Because this is isolated in Singularity, it should not require portability workarounds.

Add sorting capability to aggregators

When you're aggregating data, being able to control how the final result is fused, can be vital.

E.g. for images, you don't want to stack the following

a_1.tif, a_3.tif, a_2.tif

But rather in order.

Needs

  • Add a sort argument to the aggregation functions, using lexicographical sorting
  • Add a 'group by' for stacking argument so
    • a_1, b_1, a_2 --> a.tif, b.tif

Extend dataframe filtering

The current workflow assumes filters on dataframes are AND'ed. In addition, the columns to be extracted are all columns.

  • Allow specifying OR
  • Allow selecting of columns (in addition to filter)

Do not halt on error for exceptions in calling code

Add behavior that will continue processing if exceptions or errors are thrown, e.g. catch at the visitor stage.

Make this optional.

Proposed change

  • Define crash secure visitors that catch any exception
  • Log but continue
  • Implement by a wrapper function that forwards but logs on exception

Review documentation & test use cases

Documentation is now listed at:
docs/build/index.html

You can open this with a browser. It needs a thorough review, and verifying it's clear / correct.

You can build the documentation after changes

cd docs
julia make.jl

Then open docs/build/index.html in a browser.

Second, the package needs more testing (see example_recipes) on how to get started.

After completion, we'll remove most of the readme.

Add `remove pattern` for filenames

Add capability that

  • allows user to specify a pattern (begin, end)
  • end result should be file or directory without (begin, end)

Example

actions = [ [ "transfrom_copy",  ["remove_from_to", "CDE", ".txt"] ] ]

Given /dev/shm/CDEabc.txt would create /dev/shm/CDE.txt

Steps

  • add remove_from_to(x, from, to; first_inclusive=true, second_inclusive=false)
  • define variants as API calls with parameters
  • test with copy
  • test with move

Add support for databases

While DataCurator is designed to process and validate filesystem based datasets, it is entirely possible to encounter database files.
As a proof of concept, show how databases-as-files (SQLite) can be supported.

  • Add SQLite.jl
  • Add detection "is_sqlite" and load functions "load_sqlite"
  • Add saving from DataFrame to SQLite
  • Add option to save aggregated tables to SQLite
  • Add example recipe to detect, load, extract a table using SQL, and save to dataframe
  • Add SQL condition, e.g. table contains rows matching condition y
  • Add SQL action, e.g. modify table

Make formal aggregate API

Currently file lists are a crude form of the end (map-reduce) pattern. Ideally, the user should be able to specify a reduction/aggregation function to be run on named lists.

Current grammar/rules

filelists=["name", ["name", "path"], "tableX]

  • name --> the list, containing file/dir names, is saved to name.txt
  • name, path --> the list, containing file/dir names, is saved to name.txt
  • name with table pattern --> assume files are CSVs, concatentate all into 1 table, name.csv

The first two are there to collect files in an ordered, threadsafe manner, to generate lists of input/output pairs for SLURM batch scripts.
The second is a quick fix to enable fusing of CSVs, another common task.

This entire pattern is an instance of map-reduce, e.g. collect, transform, reduce, so let's formalize to

  • [ name transformer aggregator ]

Challenges

Don't make the user specify what they don't have to, e.g. 90% of the case, transformer = identity, but positional decoding can be tricky (a lot of lookups).

We could try to have the type system find out if a function is transformer(x)->x or aggregator(l, n)->nothing, but that requires full control of the symbols -> functions -> methods lookup, and we already expose 'any symbol is a function' pricinple.

Extend current grammar to be:

file_lists=[entry+]
entry={name=name, transformer=identity, aggregator=aggretor}

with shorthands

entry=name --> name, identity, list_to_file
entry=name , concat --> name, identity, concat tables
entry=name, path --> name, change_path(path), list_tofile

Features

  • Type safety --> switch from Any to NamedTuple, so Julia knows which to resolve at compile time.
  • Allow arbitrary aggregation

Transformer

function transform(x::AbstractString)
     return identity(x)
end

Aggregator

function aggregate(name::AbstractString, list)
     R = reduce(reduce(X, sublist) for sublist in list)
     save(name, R)
end

Use cases that should work

Files

  • save files to list
  • save files to list + change of path

Tables

  • concat table
  • concat table + select columns/arbitrary transform
  • instead of concat, group_by specified, then save

Images

  • Stack kd to k+1d
  • mean/median/max/mask

Transformers that alter content

  • Rather than add list, create a tmp file with transformed content, add that to list
  • Provide wrapper functions a la image_reduce_by, image_group

Ensure installation instructions work as-is and as claimed

  • Ensure SmlmTools.jl dependency is listed to prevent it from stopping the installation in package mode
  • Ensure R_HOME is set correctly so RCall builds correctly regardless of the user environment
  • Ensure Singularity image runs as expected

Add a 'describe' function for images

Without arguments:

  • report kth moments of distribution of image
  • saves in table with filename as column
  • adds to list for aggregation to master table

describe, table
describe, table, axis

  • slice over axis X and report 1 row per slice

  • add_to_file_list --> transformer = describe, aggregator = concat_to_table

  • concat to table --> add column :sourcefile

Add global support for output directory

For file aggregation, it's arduous to specify output path in the name of the table/list.

Add

[global]
outputdirectory="path"

For aggregation operations, then prefix the path to the table.

SCP fails with spaces in names

For now, upload_to_scp will fail with spaces in names. It's partially fixed, with some remote paths being fixed, but files with spaces will probably still fail to copy.

Add modification of file content

Transform in place & friends work on naming, not content.

Add a (limited) list of functions to transform the content, as well as the files

  • Images
    • 1-1 Images : mask, normalize, roi, slice
    • 1-N : split_channel, split_color
  • CSV
    • 1-1 : nan_to
    • normalize_linear
    • select_columns

Supporting:

  • add transform_file(x, f) to mark for the parser when this is applicable
    • error on directory
    • keep table of transform_f capable functions
    • store the result of each in a tmp file, returning the tmp file
      • x -> f(x) --> tmp
      • x -> f(g(h(x)) --> tmp
  • when parser sees transform_
    • if content transformers are seen, replace transform_inplace/copy(x, f(x) with
      • transform_inplace [file actions] [content actions]
      • add a function with table lookup to splice out content actions
      • tmpfile = f(x)
      • transform_inplace(tmpfile, x) (with overwrite)
      • transform_copy(tmpfile, final)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.