Giter Club home page Giter Club logo

data-wrangler's Introduction

Overview

build status docs status doi

Datasets come in all shapes and sizes, and are often messy:

  • Observations come in different formats
  • There are missing values
  • Labels are missing and/or aren't consistent
  • Datasets need to be wrangled ๐Ÿ„ ๐Ÿ‘ ๐Ÿšœ

The main goal of data-wrangler is to turn messy data into clean(er) data, defined as either a DataFrame or a list of DataFrame objects. The package provides code for easily wrangling data from a variety of formats into DataFrame objects, manipulating DataFrame objects in useful ways (that can be tricky to implement, but that apply to many analysis scenarios), and decorating Python functions to make them more flexible and/or easier to write.

The data-wrangler package supports a variety of datatypes. There is a special emphasis on text data, whereby data-wrangler provides a simple API for interacting with natural language processing tools and datasets provided by scikit-learn, hugging-face, and flair. The package is designed to provide sensible defaults, but also implements convenient ways of deeply customizing how different datatypes are wrangled.

For more information, including a formal API and tutorials, check out https://data-wrangler.readthedocs.io

Quick start

Install datawrangler using:

$ pip install pydata-wrangler

Some quick natural language processing examples:

import datawrangler as dw

# load in sample text
text_url = 'https://raw.githubusercontent.com/ContextLab/data-wrangler/main/tests/resources/home_on_the_range.txt'
text = dw.io.load(text_url)

# embed text using scikit-learn's implementation of Latent Dirichlet Allocation, trained on a curated subset of
# Wikipedia, called the 'minipedia' corpus.  Return the fitted model so that it can be applied to new text.
lda = {'model': ['CountVectorizer', 'LatentDirichletAllocation'], 'args': [], 'kwargs': {}}
lda_embeddings, lda_fit = dw.wrangle(text, text_kwargs={'model': lda, 'corpus': 'minipedia'}, return_model=True)

# apply the minipedia-trained LDA model to new text
new_text = 'how much wood could a wood chuck chuck if a wood chuck could check wood?'
new_embeddings = dw.wrangle(new_text, text_kwargs={'model': lda_fit})

# embed text using hugging-face's pre-trained GPT2 model
gpt2 = {'model': 'TransformerDocumentEmbeddings', 'args': ['gpt2'], 'kwargs': {}}
gpt2_embeddings = dw.wrangle(text, text_kwargs={'model': gpt2})

The data-wrangler package also provides powerful decorators that can modify existing functions to support new datatypes. Just write your function as though its inputs are guaranteed to be Pandas DataFrames, and decorate it with datawrangler.decorate.funnel to enable support for other datatypes without any new code:

image_url = 'https://raw.githubusercontent.com/ContextLab/data-wrangler/main/tests/resources/wrangler.jpg'
image = dw.io.load(image_url)

# define your function and decorate it with "funnel"
@dw.decorate.funnel
def binarize(x):
  return x > np.mean(x.values)

binarized_image = binarize(image)  # rgb channels will be horizontally concatenated to create a 2D DataFrame

Supported data formats

One package can't accommodate every foreseeable format or input source, but data-wrangler provides a framework for adding support for new datatypes in a straightforward way. Essentially, adding support for a new data type entails writing two functions:

  • An is_<datatype> function, which should return True if an object is compatible with the given datatype (or format), and False otherwise
  • A wrangle_<datatype> function, which should take in an object of the given type or format and return a pandas DataFrame with numerical entries

Currently supported datatypes are limited to:

  • array-like objects (including images)
  • DataFrame-like or Series-like objects
  • text data (text is embedded using natural language processing models)

or lists of mixtures of the above.

Missing observations (e.g., nans, empty strings, etc.) may be filled in using imputation and/or interpolation.

data-wrangler's People

Contributors

jeremymanning avatar paxtonfitzpatrick avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

data-wrangler's Issues

make tutorials

key things to highlight:

  • wrangle each supported datatype (include use of loaded-in object + filename of saved object):
    • dataframe
    • array
    • text-- use a few different scikit-learn models (train with different built-in and hugging-face corpora) and also show a few hugging-face models. also demo vectorization (scikit-learn) vs. embedding using an already-trained model
    • image
    • mixed list of several datatypes
  • interpolation and imputation
  • decorators:
    • list_generalizer, funnel
    • stacking and unstacking (show examples of how this is useful); also show stack_apply/unstack_apply
  • io:
    • loading data (local + remote)
    • saving data
  • util:
    • btwn
    • dataframe_like, array_like
    • load_dataframe (show demos with a few different filetypes-- e.g. csv, json, etc. to show how extensions are handled automatically)
  • core:
    • show how config.ini is set up and how it can be customized
    • show how get_default_options and apply_default_options work

CUDA-optimized embeddings

converting flair-derived text embeddings to numpy arrays doesn't work (as written) with CUDA systems. there should be a try/except statement that copies the array to memory (and then re-tries converting to numpy) when this exception is raised:

Screen Shot 2021-11-05 at 11 10 54 AM

JOSS paper

write up JOSS-style paper highlighting the key functionality. also ensure all requirements have been met.

khan academy and neurips datasets are not downloading correctly

example (downloading neurips dataset):

Screen Shot 2021-08-03 at 9 52 25 AM

i'm thinking one of the following may have happened:

  • the download links are broken
  • the "load" function is broken for remote downloads of npy/npz files
  • something about my local cache is messed up (e.g. interrupted download)--

it's surprising that the tests using minipedia and sotus datasets would have passed on the CI system if load was truly broken, which makes me think it may be a local issue...but it needs to be debugged further.

bug when unstacking a 3+ level multi-index dataframe

unstacking a 3+ level multi-index dataframe does not properly set the indices of the lowest level dataframes. Example:

import datawrangler as dw

xs = dw.wrangle([np.cumsum(np.random.randn(100, 5), axis=0) for _ in range(10)])
ys = dw.wrangle([np.cumsum(np.random.randn(100, 5), axis=0) for _ in range(10)])

stacked_xs = dw.stack(xs)
stacked_ys = dw.stack(ys)
stacked_xy = dw.stack([stacked_xs, stacked_ys])

dw.unstack(stacked_xs)[0]  # this recovers xs[0], as expected
dw.unstack(stacked_xy)[0]  # this recovers the values of stacked_xs (as expected) but the index is not a multi-index, and the index values are incorrect

Snapshot of current dw.unstack(stacked_xy)[0]:

Screen Shot 2021-08-03 at 9 46 49 AM

Expected result (index should match stacked_xs):
Screen Shot 2021-08-03 at 9 47 41 AM

notebook installs

when installing davos in notebooks, it requires restarting the runtime.

@paxtonfitzpatrick would it be possible to use your "davos" trick when installing in notebooks? i think it'd entail something like the following (on installation):

1.) determine we're installing from a notebook environment
2.) if so, automatically restart the runtime

i wonder if we could even include davos as a dependency of pydata-wrangler and just call the smuggle function directly... ๐Ÿค”

unify "return_model" syntax

some, but not all, wrangling functions support the "return_model" keyword argument. proposed solution: add a decorator (defined in format.py to avoid conflicts) that checks for the return_model keyword argument; if it is set to True and the function only returns a DataFrame (rather than a tuple), simply return a "null" model-- something like

{'model': None, 'args': [], 'kwargs': {}}

re-organize decorators to fix circular dependencies

text wrangling imports two decorators:

  • apply_defaults
  • list_generalizer

importing these decorators calls decorate/init.py, which imports the funnel decorator; the funnel decorator in turn imports the wrangle function from format.py...which in turn imports the wrangle_text function from text.py.

To fix, it seems like the simplest change would be to move apply_defaults and list_generalizer decorators outside of the decorate module. but the challenge is that funnel depends on wrangle AND list_generalizer... so with this change we'd have:

  • format depends on text
  • text depends on util + decorate
  • decorate depends on format (circular)

I could also move everything to be in one module...although that seems somewhat messy.

sphinx/readthedocs pages

fill in docstrings for all core functions, along with an .rst version of the README file to support a sphinx API webpage. potentially post on readthedocs along with links to the tutorials

datawrangler.__version__ reports incorrect version

Looks like the version string for the package metadata was updated for the v0.2.2 release

version='0.2.2',

but the one stored in the __version__ attribute wasn't

__version__ = '0.2.1'

If you want, you can make it so that the version string is stored in only one place (your setup file) by instead setting the __version__ attribute with:

import sys
if sys.version_info < (3, 8):
    import importlib_metadata as metadata
else:
    from importlib import metadata


__version__ = metadata.version('pydata-wrangler')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.