krassowski / data-vault Goto Github PK

View Code? Open in Web Editor NEW

13.0 3.0 2.0 83 KB

IPython magic for simple, organized, compressed and encrypted: storage & transfer of files between notebooks.

License: MIT License

Python 56.45% Jupyter Notebook 29.93% JavaScript 0.98% TypeScript 12.63%

ipython-magic jupyter-notebook data-storage disk-cache persistent-storage

data-vault's Introduction

IPython data-vault

IPython magic for simple, organized, compressed and encrypted storage & transfer of files between notebooks.

Background and demo

Right tool for a simple job

The %vault magic provides a reproducible caching mechanism for variables exchange between notebooks. The cache is compressed, persistent and safe.

Differently to the builtin %store magic, the variables are stored in plain sight, in a zipped archive, so that they can be easily accessed for manual inspection, or for the use by other tools.

Demonstration by usage:

Let's open the vault (it will be created if not here yet):

%open_vault -p data/storage.zip

Generate some dummy dataset:

from pandas import DataFrame
from random import choice, randint
cities = ['London', 'Delhi', 'Tokyo', 'Lagos', 'Warsaw', 'Chongqing']
salaries = DataFrame([
    {'salary': randint(0, 100), 'city': choice(cities)}
    for i in range(10000)
])

Store variable in a module

And store it in the vault:

%vault store salaries in datasets

Stored salaries (None → 40CA7812) at Sunday, 08. Dec 2019 11:58

A short description is printed out (including a CRC32 hashsum and a timestamp) by default, but can be disabled by passing --timestamp False to %open_vault magic. Even more information enhancing the reproducibility is stored in the cell metadata.

Import variable from a module

We can now load the stored DataFrame in another (or the same) notebook:

%vault import salaries from datasets

Imported salaries (40CA7812) at Sunday, 08. Dec 2019 12:02

Thanks to (optional) memory optimizations we saved some RAM (87% as compared to unoptimized pd.read_csv() result). To track how many MB were saved use --report_memory_gain setting which will display memory optimization results below imports, for example:

Reduced memory usage by 87.28%, from 0.79 MB to 0.10 MB.

Import variable as something else

If we already have the salaries variable, we can use as, just like in the Python import system.

%vault import salaries from datasets as salaries_dataset

Store or import with a custom function

from pandas import read_csv
to_csv = lambda df: df.to_csv()
%vault store salaries in datasets with to_csv as salaries_csv
%vault import salaries_csv from datasets with read_csv

Import an arbitrary file

from pandas import read_excel
%vault import 'cars.xlsx' as cars_dataset with read_excel

More examples are available in the Examples.ipynb notebook, which can be run interactively in the browser.

Goals

Syntax:

easy to understand in plain language (avoid abbreviations when possible),
while intuitive for Python developers,
...but sufficiently different so that it would not be mistaken with Python constructs
- for example, we could have %from x import y, but this looks very like normal Python; having %vault from x import y makes it sufficiently easy to distinguish
star imports are better avoided, thus not supported
as imports may be confusing if there is more than one

Reproducibility:

promote good reproducible and traceable organization of files:
- promote storage in plain text files and the use of DataFrame
  
  pickling is often an easy solution, but it can cause hurtful problems in prototyping phase (which is what notebooks are often used for): if you pickle you objects, then change the class definition and attempt to load your data again you are likely to fail severly; this is why the plain text files are the default option in this package (but pickling is supported too!).
- print out a short hashsum and human-readable datetime (always in UTC),
- while providing even more details in cell metadata
allow to trace instances of the code being modified post execution

Security:

think of it as a tool to minimize the damage in case of accidental git add of data files (even if those should have been elsewhere and .gitignored in the first place),
or, as an additional layer of security for already anonymized data,
but this tool is not aimed at facilitating the storage of highly sensitive data
you have to set a password, or explicitly set --secure False to get rid of a security warning

Features overview

Metadata for storage operations

Each operation will print out the timestamp and the CRC32 short checksum of the files involved. The timestamp of the operation is reported in the UTC timezone in a human-readable format.

This can be disabled by setting -t False or --timestamp False, however for the sake of reproducibility it is encouraged to keep this information visible in the notebook.

More precise information including the SHA256 cheksum (with a lower probability of collisions), and a full timestamp (to detect potential race condition errors in file write operations) are embedded in the metadata of the cell. You can disable this by setting --metadata False.

The exact command line is also stored in the metadata, so that if you accidentally modify the code cell without re-running the code, the change can be tracked down.

Storage

In order to enforce interoperability plain text files are used for pandas DataFrame and Series objects. Other variables are stores as pickle objects. The location of the storage archive on the disk defaults to storage.zip in the current directory, and can changed using %open_vault magic:

%open_vault -p custom_storage.zip

Encryption

The encryption is not intended as a high security mechanism, but only as an additional layer of protection for already anonymized data.

The password to encrypt the storage archive is retrieved from the environmental variable, using a name provided in encryption_variable during the setup.

%open_vault -e ENV_STORAGE_KEY

Memory optimizations

Pandas DataFrames are by-default memory optimized by conversion of string variables to (ordered) categorical columns (pandas equivalent of R's factors/levels). Each string column will be tested for the memory improvement and the optimization will be only applied if it does reduce the memory usage.

Why ZIP and not HDF?

The storage archive is conceptually similar to Hierarchical Data Format (e.g. HDF5) object - it contains:

a hierarchy of files, and
a metadata files

I believe that HDF may be the future, but this future is not here yet - numerous issues with the packages handling the HDF files, as well as low performance and compression rate prompted me to stay with a simple zip format now.

ZIP is a popular file format with known features and limitations - files can be password encrypted, while the file list is always accessible. This is okay given that the code of the project is assumed to be public, and only the files in the storage area are assumed to be of encrypted, increasing the security in case of unauthorized access.

As the limitations of the ZIP encryption are assumed to be a common knowledge, I hope that managing expectations of the level of security offered by this package will be easier.

Installation and requirements

Pre-requirements:

Python 3.6+
7zip (16.02+) (see below for Ubuntu and Mac commands)

Installation:

pip3 install data_vault

Installing 7-zip

You can use p7zip packages from the default repositories:

Ubuntu

sudo apt-get install -y p7zip-full

Mac

brew install p7zip

Windows

~~Installers for Windows can be downloaded from the 7-zip website.~~

Windows is not supported as it has known issues.

data-vault's People

Stargazers

Watchers

Forkers

chenxisssfork unity-tuananh

data-vault's Issues

Export all with all?

example does not work in jupyter lab (windows) ?

the example generates errors on line
%vault store salaries in datasets

. Not sure what the 'datasets' is - not declared in example.
Trying to use this on a windows computer . A windows example would be appreciated

Implement assert command

Make memory optimization messages optional / less intrusive

del x from y requires x to be in global namespaces

del x from y requires x to be in global namespaces, which is too eager validation.

Test comments

Allow for multi-line imports?

This may require cell (%%) magic, but would be useful when importing multiple things at once, e.g.

%%vault from x import (
    a,
    b,
    c,
    d
)

Which makes sense if a, d, c, d are longer identifiers.

Consider adding simple filtering

Simple filtering proposal - idea 1

~~To enable high-performance subsetting a simple, grip-like pre-filtering will be provided:~~

Import only first five rows:

%vault from notebook import large_frame.rows[:5] as large_frame_head

When subsetting, the use of as would be required to prevent potential confusion of the original large_frame object with its subset.

To import only rows including text "SNP":

%vault from notebook import large_frame.grep("SNP") as large_frame_snps
By design, no advanced filtering is intended at this step.

However, if your file is too big to fit into memory and you need more advanced filtering,
you can provide your custom import function to the low-level load_storage_object magic:

def your_function(f):
    return [
        line
        for i, line in enumerate(f)
        if i % 2 == 0   # replace with fancy filtering as needed
    ]
%vault import 'notebook_path/variable.tsv' as variable with your_function

The advanced filtering can be already achieved with existing code.

Simple filtering proposal - idea 2

Import the first 5 rows:

from data_vault import subset
%vault import 'notebook_path/variable.tsv' as variable with subset.rows[:5]

to be implemented with nrows

Import the first 5 columns:

%vault import 'notebook_path/variable.tsv' as variable with subset.columns[:5]

to be implemented with usecols

Import rows containig a string:

%vault import 'notebook_path/variable.tsv' as variable with subset.contains('text')

Import rows matching a regular expression:

%vault import 'notebook_path/variable.tsv' as variable with subset.matches('.*? text')

both to be implemented with a custom IO iterator which discards lines which do not match the criteria on the fly.

Challenges:

how to support the variety of delimiters and options?
- subset.using(sep='csv').rows[:5]?

Enable IPython completions

See: https://stackoverflow.com/questions/36479197/ipython-custom-tab-completion-for-user-magic-function

Store in path equivalent to the current notebook

It is often useful to have the data storage structure reflect the structure of the notebooks. However, as notebooks get renamed and moved around the paths needs to be updated. I propose using dot (.) to indicate that data should be saved in the path equivalent to the currently running notebook. Runnning:

%vault store data in .

in a notebook located in analyses/main_analysis.ipynb would save the data in analyses/main_analysis path of the vault, i.e. be equivalent of running:

%vault store data in analyses/main_analysis

an alternative syntax would use double underscores, e.g.:

%vault store data in __here__

The dot syntax is more akin to the import syntax of Python (from . import x), thus slightly preferred.

The dot syntax could allow further path specification:

%vault store data_clean_1 in ./processed
%vault store data_clean_2 in ./processed
%vault store data_raw_1 in ./raw
%vault store data_raw_2 in ./raw

Support for .. could be considered too, but is outside of scope of this proposal.

ip = get_ipython()
vault = ip.magics_manager.registry['VaultMagics'].current_vault
vault.list_members()

Add more metadata?

hashsum of notebook at the time of the write
hashcode of the most recent commit (if in git)

Integer type memory optimization needs to verify not only numeric but also int

Additional test cases

%vault from path in variable should raise,
%vault from variable import path should raise

Implement logging of metadata (into a file in zip?)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.