Giter Club home page Giter Club logo

impyute's Introduction

https://travis-ci.org/eltonlaw/impyute.svg?branch=master

Impyute

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

>>> n = 5
>>> arr = np.random.uniform(high=6, size=(n, n))
>>> for _ in range(3):
>>>    arr[np.random.randint(n), np.random.randint(n)] = np.nan
>>> print(arr)
array([[0.25288643, 1.8149261 , 4.79943748, 0.54464834, np.nan],
       [4.44798362, 0.93518716, 3.24430922, 2.50915032, 5.75956805],
       [0.79802036, np.nan, 0.51729349, 5.06533123, 3.70669172],
       [1.30848217, 2.08386584, 2.29894541, np.nan, 3.38661392],
       [2.70989501, 3.13116687, 0.25851597, 4.24064355, 1.99607231]])
>>> import impyute as impy
>>> print(impy.mean(arr))
array([[0.25288643, 1.8149261 , 4.79943748, 0.54464834, 3.7122365],
       [4.44798362, 0.93518716, 3.24430922, 2.50915032, 5.75956805],
       [0.79802036, 1.99128649, 0.51729349, 5.06533123, 3.70669172],
       [1.30848217, 2.08386584, 2.29894541, 3.08994336, 3.38661392],
       [2.70989501, 3.13116687, 0.25851597, 4.24064355, 1.99607231]])

Feature Support

  • Imputation of Cross Sectional Data
    • K-Nearest Neighbours
    • Multivariate Imputation by Chained Equations
    • Expectation Maximization
    • Mean Imputation
    • Mode Imputation
    • Median Imputation
    • Random Imputation
  • Imputation of Time Series Data
    • Last Observation Carried Forward
    • Moving Window
    • Autoregressive Integrated Moving Average (WIP)
  • Diagnostic Tools
    • Loggers
    • Distribution of Null Values
    • Comparison of imputations
    • Little's MCAR Test (WIP)

Versions

Currently tested on 2.7, 3.4, 3.5, 3.6 and 3.7

Installation

To install impyute, run the following:

$ pip install impyute

Or to get the most current version:

$ git clone https://github.com/eltonlaw/impyute
$ cd impyute
$ python setup.py install

Documentation

Documentation is available here: http://impyute.readthedocs.io/

How to Contribute

Check out CONTRIBUTING

impyute's People

Contributors

a-ozbek avatar ahmedhshahin avatar dmitrypolo avatar edzwilmm avatar eltonlaw avatar erikpartridge avatar pavantejadokku avatar tahmidmehdi avatar xyz8983 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

impyute's Issues

add pypy to supported envs

todo

  1. Dockerfile.pybase builds the eltonlaw/pybase image which sets up the python environments. Add stuff to install pypy here.
  2. Dockerfile builds the impyute image which installs dev requirements and installs impyute. Add stuff to install impyute into the new pypy enviroment.
  3. Update README.rst to support pypy
  4. Add $ docker run impyute pypy -m pytest to the test target in the Makefile

verifying things work

  1. Rebuild pybase: image make rebuild-pybase
  2. Rebuild impyute image: docker build -t impyute .
  3. Check that $ docker run impyute pypy -m pytest runs. If tests are successful, it should be okay.

[DDFG] Add BadInputError for dtype handling

In the first case, an UnboundLocalError occurs because data is not assigned based on the current if/else criteria. Add an else clause and raise BadInputError accompanied by a more informative error handling message.

if dtype == "int":
data = np.random.randint(bound[0], bound[1], size=shape).astype(float)
elif dtype == "float":
data = np.random.uniform(bound[0], bound[1], size=shape)

In the second case, no matter what value you pass through dtype, no error occurs. This is because, in this instance, data is assigned immediately. Follow the same logic as above.

data = np.random.normal(mean, sigma, size=shape)
if dtype == "int":
data = np.round(data)
elif dtype == "float":
pass

Be sure to follow the 4 steps outlined in contributing.md

The below labels are for DDFG (Data Days for Good) participant reference:
Priority: Low
Difficulty: Low

fast_knn: the nearest neighbor gets the lowest weight

Hi Elton,

Thank you for implementing this library, it's so convenient!
I found your library from the link below.
https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

When I was using fast_knn, I found that nearer neighbor got lower weight when getting the weighted average of k nearest neighbors.

In the example you provided,

fast_knn(data, k=2) # Weighted average of nearest 2 neighbours
array([[ 0. , 1. , 10.08608891, 3. , 4. ],
[ 5. , 6. , 7. , 8. , 9. ],
[10. , 11. , 12. , 13. , 14. ],
[15. , 16. , 17. , 18. , 19. ],
[20. , 21. , 22. , 23. , 24. ]])

In this example, 10.086 is imputed according to kNN algorithm.
We get the 2 nearest neighbors using Euclidean distance, for the first row as a "point", the nearest neighbor is the second "point" (second row), and the second nearest neighbor is the third "point" which is the third row.
The distance between the first point and second point (nearest neighbor) is 12.5, the distance between the first point and third point (the second nearest neighbor) is 20.156.
So this is how 10.086 comes:
10.086 = 7 * 12.5/(12.5 + 20.156) + 12 * 20.156/(12.5 + 20.156)
The weight for each point is calculated based on its distance, so the nearer the point, the smaller the distance, the lower the weight, which is supposed to be the opposite.

In a nutshell, I believe the nearest neighbor should have the highest weight, in this example, the imputed value should be close to 7 instead of 12 (the average of 7 and 12 is 9.5 for reference).

Thanks.
Best,
Minjie

consider better handling of pandas dataframes

Side issue of #54

Needs more research. Maybe have a function to straightforwardly check/parse? Parse out non float columns? Return a pandas dataframe?

Would we need to add Pandas as a dependency?

Multivariate Imputation by Chained Equations is going to return only mean value of the Column.

Issue:
I
n the module "impyute.mice" we were expecting to return the imputed value for each column based on the linear equation converged for the column. In contrast we are getting only the mean values of the column.

Reason:
Here we have a logic of entering into loop only if we satisfy below condition, however we have a glitch in the condition which is making us to skip it and return the mean values imputed data set straight away.

Condition Failing:
converged = [False] * len(null_xyv)
while all(converged):
.....................
....................

Resolution:

converged = [False] * len(null_xyv)
while not(all(converged)):
.....................
....................

lint repo

Slightly relax line lengths, and clean up some of the outstanding issues with linting.

Parsing requirements error from upgrade to pip 10

Error running make test during the installation of impyute it errors out

...
Step 8/10 : WORKDIR /impyute
Removing intermediate container 59c6df997518
 ---> 457da5768f03
Step 9/10 : RUN pip2.7 install -e . &&     pip3.4 install -e . &&     pip3.5 install -e . &&     pip3.6 install -e .
 ---> Running in 7dd589eb1d87
Obtaining file:///impyute
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/impyute/setup.py", line 4, in <module>
        from pip.req import parse_requirements
    ImportError: No module named req
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /impyute/
The command '/bin/sh -c pip2.7 install -e . &&     pip3.4 install -e . &&     pip3.5 install -e . &&     pip3.6 install -e .' returned a non-zero code: 1
make: *** [test] Error 1

Imputations run on the reference of the value

>>> random_normal(shape=(100, 100))
array([[        nan,         nan,         nan, ...,  2.29665783,            
                nan, -0.71412759],    
       [ 0.49046767, -0.31572648,         nan, ...,  1.39596078,            
        -1.14295358, -1.30660838],    
       [        nan, -2.0708818 ,  0.7393569 , ..., -0.52974738,            
         0.24717896, -2.37030327],    
       ...,        
       [ 0.01155974,  0.88793848, -0.04410631, ..., -1.28196955,            
         0.75566477, -0.39039914],    
       [ 0.23240304,         nan,  1.59899984, ..., -1.06248365,            
                nan,  0.65453688],    
       [-0.13855768, -0.00358682,         nan, ...,  1.29588659,            
        -0.20579175,  0.59610582]])   
>>> data = random_normal(shape=(100, 100))                                  
>>> import impyute as impy            
>>> impy.em(data)  
array([[ 0.5599156 , -0.24410474, -0.99875721, ..., -0.74595691,            
         0.25954462,  0.3936289 ],    
       [-0.5491675 ,  0.39810825,  0.15029102, ..., -0.99765863,            
        -0.98604735,  1.24321062],    
       [ 0.36389712,  1.56754062,  1.38492368, ..., -0.04457599,            
        -0.12098783,  0.98864098],    
       ...,        
       [ 1.37199931, -0.45710982, -1.30196092, ..., -0.38020366,            
         0.31780175, -0.08301059],    
       [ 0.52415   , -1.02749075,  2.03909177, ...,  0.66138282,            
         1.31679312, -0.41575647],    
       [-0.36272847,  0.65262579, -0.11336795, ..., -0.1538307 ,            
        -1.24756562, -0.27470951]])   
>>> impy.em(data)  
Traceback (most recent call last):    
  File "<stdin>", line 1, in <module> 
  File "/Users/elton.law/sandbox/github/impyute/impyute/utils/checks.py", line 34, in wrapper
    raise BadInputError("No NaN's in given data")                           
impyute.utils.errors.BadInputError: No NaN's in given data                  
>>> 

Need to make a copy and run each algorithm on that instead. This can be very expensive for big datasets. Keep an inplace keyword like in pandas so that we can use both behaviours

[DDFG] Add randc function for random generation of categorical values

Create a function named randc to generate a dataset of categorical variables with missingness. Follow the general form of randu & randn, witht he following arguments:

  • nlevels: Number of different categories
  • shape: Same as in randu & randn, including defaults
  • missingness: " "
  • thr: " "

The dtype arrgument is not necessary here. Create this function within impyute.dataset.base

Be sure that functions accept & return matrices.
Be sure to follow the 4 steps outlined in contributing.md

The below labels are for DDFG (Data Days for Good) participant reference:
Priority: Medium
Difficulty: Medium

Which is the similarity function in KNN imputation?

Hi!

I know that there are many different function to calculate similarity between incomplete vectors, but which one is implemented in the package? I have not found any specification or cite about KNN in the documentation.

fast_knn, moving_window and locf are returning data without imputation for univariate time series

Data looks as below

tsNH4_na.head()

index ds y
2010-11-30 16:10:00 2010-11-30 16:10:00 13.714667
2010-11-30 16:20:00 2010-11-30 16:20:00 NaN
2010-11-30 16:30:00 2010-11-30 16:30:00 14.630500
2010-11-30 16:40:00 2010-11-30 16:40:00 16.385333
2010-11-30 16:50:00 2010-11-30 16:50:00 15.992667

Including ds is giving error BadInputError: Data is not float. So just tried with single variable y

np.isnan(impy.imputation.ts.moving_window(np.array(tsNH4_na[["y"]]),func = np.mean,errors='raise',nindex=0,wsize=10)).sum()

833

impy.fast_knn(tsNH4_na[['y']],k = 2) np.isnan(imput).sum()

833

Unimputed data also have 833 missing points.

Name change request: mice

Dear Elton,

Thanks for your effort to implement an algorithm for imputing multivariate data.

I’d like to request a name change of your impyute.imputation.cs.mice procedure. The documentation of this procedure says that it implements Multivariate Imputation by Chained Equations (MICE) from my JSS 2011 paper. However, this documentation is not accurate since your procedure does not implement the MICE algorithm. It differs in important respects from my method:

  • Your procedure provides a single imputation, whereas MICE is a procedure for generating multiple imputations;
  • Your procedure imputes the “best” (predicted) value, while the MICE algorithm always adds noise;
  • Your procedure uses linear regression, whereas the MICE algorithm is open to any type of imputation model;
  • Your procedure uses different convergence criteria.

These differences have profound methodological implications. Advertising your procedure as “MICE” will create confusion among analysts, who might be led to believe that they are doing MICE when in fact they are not.

Your procedure is an implementation of Buck’s method published in 1960 (described in more detail in Little & Rubin 2002), so I would suggest that you could perhaps rename to “buck”?

With regards,
Stef van Buuren

[DDFG] Complete MNAR missingness generation

Complete mnar method in the Corruptor class.

Simplified, MNAR (Missing Not at Random) is a type of missingness in which the probability of a value being missing is conditional (in whole or in part) on unobserved data. Missingness may be simultaneously conditional on observed data in addition to unobserved data.

Implementation: Generate a random selection of new features and base missingness on these features. The number of features to generate may be based on some fraction of the existing features, or a random number between 1 - n_features. These features could (should?) be a mix of continuous & categorical; this could be based on the fraction of each respective feature type in the existing features. Once generated, impose missingness based on these new features.

Be sure that functions accept & return matrices.
Be sure to follow the 4 steps outlined in contributing.md

The below labels are for DDFG (Data Days for Good) participant reference:
Priority: High
Difficulty: Medium

def mnar(self):
""" Overwrites values with MNAR placed NaN's """
pass

[DDFG] Complete MAR missingness generation

Complete mar method in the Corruptor class.

Simplified, MAR (Missing at Random) is a type of missingness in which the probability of a value being missing is conditional only on the observed data.

Implementation: Select a random subset of the features in the given dataset and base missingness on these features. This could be some fraction of the features, or a random number between 1 - n_features.

Be sure that functions accept & return matrices.
Be sure to follow the 4 steps outlined in contributing.md

The below labels are for DDFG (Data Days for Good) participant reference:
Priority: High
Difficulty: Low - Medium

def mar(self):
""" Overwrites values with MAR placed NaN's """
pass

[DDFG] Add OutofBounds error handling

If axis != 1, functionality of axis = 0 occurs regardless of value (or dtype) passed. Enforce [0, 1] values and throw BadInputError from impyute.util.errors.py with an appropriate explanation.

Be sure to follow the 4 steps outlined in contributing.md

The below labels are for DDFG (Data Days for Good) participant reference:
Priority: Low
Difficulty: Low

if axis == 0:
data = np.transpose(data)
elif axis == 1:
pass

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.