Giter Club home page Giter Club logo

data-vault's Issues

Add more metadata?

  • hashsum of notebook at the time of the write
  • hashcode of the most recent commit (if in git)

Store in path equivalent to the current notebook

It is often useful to have the data storage structure reflect the structure of the notebooks. However, as notebooks get renamed and moved around the paths needs to be updated. I propose using dot (.) to indicate that data should be saved in the path equivalent to the currently running notebook. Runnning:

%vault store data in .

in a notebook located in analyses/main_analysis.ipynb would save the data in analyses/main_analysis path of the vault, i.e. be equivalent of running:

%vault store data in analyses/main_analysis

an alternative syntax would use double underscores, e.g.:

%vault store data in __here__

The dot syntax is more akin to the import syntax of Python (from . import x), thus slightly preferred.

The dot syntax could allow further path specification:

%vault store data_clean_1 in ./processed
%vault store data_clean_2 in ./processed
%vault store data_raw_1 in ./raw
%vault store data_raw_2 in ./raw

Support for .. could be considered too, but is outside of scope of this proposal.

Allow for multi-line imports?

This may require cell (%%) magic, but would be useful when importing multiple things at once, e.g.

%%vault from x import (
    a,
    b,
    c,
    d
)

Which makes sense if a, d, c, d are longer identifiers.

example does not work in jupyter lab (windows) ?

the example generates errors on line
%vault store salaries in datasets

. Not sure what the 'datasets' is - not declared in example.
Trying to use this on a windows computer . A windows example would be appreciated

Consider adding simple filtering

Simple filtering proposal - idea 1

To enable high-performance subsetting a simple, grip-like pre-filtering will be provided:

Import only first five rows:

%vault from notebook import large_frame.rows[:5] as large_frame_head

When subsetting, the use of as would be required to prevent potential confusion of the original large_frame object with its subset.

To import only rows including text "SNP":

%vault from notebook import large_frame.grep("SNP") as large_frame_snps
By design, no advanced filtering is intended at this step.

However, if your file is too big to fit into memory and you need more advanced filtering,
you can provide your custom import function to the low-level load_storage_object magic:

def your_function(f):
    return [
        line
        for i, line in enumerate(f)
        if i % 2 == 0   # replace with fancy filtering as needed
    ]
%vault import 'notebook_path/variable.tsv' as variable with your_function

The advanced filtering can be already achieved with existing code.

Simple filtering proposal - idea 2

Import the first 5 rows:

from data_vault import subset
%vault import 'notebook_path/variable.tsv' as variable with subset.rows[:5]

to be implemented with nrows

Import the first 5 columns:

%vault import 'notebook_path/variable.tsv' as variable with subset.columns[:5]

to be implemented with usecols

Import rows containig a string:

%vault import 'notebook_path/variable.tsv' as variable with subset.contains('text')

Import rows matching a regular expression:

%vault import 'notebook_path/variable.tsv' as variable with subset.matches('.*? text')

both to be implemented with a custom IO iterator which discards lines which do not match the criteria on the fly.

Challenges:

  • how to support the variety of delimiters and options?
    • subset.using(sep='csv').rows[:5]?

Additional test cases

  • %vault from path in variable should raise,
  • %vault from variable import path should raise

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.