eltonlaw / impyute Goto Github PK

View Code? Open in Web Editor NEW

352.0 11.0 49.0 2.49 MB

Data imputations library to preprocess datasets with missing data

Home Page: http://impyute.readthedocs.io/

License: MIT License

Python 97.17% Makefile 1.89% Dockerfile 0.94%

imputation missing-data python scientific-computing

impyute's Introduction

Impyute

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

>>> n = 5
>>> arr = np.random.uniform(high=6, size=(n, n))
>>> for _ in range(3):
>>>    arr[np.random.randint(n), np.random.randint(n)] = np.nan
>>> print(arr)
array([[0.25288643, 1.8149261 , 4.79943748, 0.54464834, np.nan],
       [4.44798362, 0.93518716, 3.24430922, 2.50915032, 5.75956805],
       [0.79802036, np.nan, 0.51729349, 5.06533123, 3.70669172],
       [1.30848217, 2.08386584, 2.29894541, np.nan, 3.38661392],
       [2.70989501, 3.13116687, 0.25851597, 4.24064355, 1.99607231]])
>>> import impyute as impy
>>> print(impy.mean(arr))
array([[0.25288643, 1.8149261 , 4.79943748, 0.54464834, 3.7122365],
       [4.44798362, 0.93518716, 3.24430922, 2.50915032, 5.75956805],
       [0.79802036, 1.99128649, 0.51729349, 5.06533123, 3.70669172],
       [1.30848217, 2.08386584, 2.29894541, 3.08994336, 3.38661392],
       [2.70989501, 3.13116687, 0.25851597, 4.24064355, 1.99607231]])

Feature Support

Imputation of Cross Sectional Data
- K-Nearest Neighbours
- Multivariate Imputation by Chained Equations
- Expectation Maximization
- Mean Imputation
- Mode Imputation
- Median Imputation
- Random Imputation
Imputation of Time Series Data
- Last Observation Carried Forward
- Moving Window
- Autoregressive Integrated Moving Average (WIP)
Diagnostic Tools
- Loggers
- Distribution of Null Values
- Comparison of imputations
- Little's MCAR Test (WIP)

Versions

Currently tested on 2.7, 3.4, 3.5, 3.6 and 3.7

Installation

To install impyute, run the following:

$ pip install impyute

Or to get the most current version:

$ git clone https://github.com/eltonlaw/impyute
$ cd impyute
$ python setup.py install

Documentation

Documentation is available here: http://impyute.readthedocs.io/

How to Contribute

Check out CONTRIBUTING

impyute's People

Contributors

Stargazers

Watchers

impyute's Issues

todo

Dockerfile.pybase builds the eltonlaw/pybase image which sets up the python environments. Add stuff to install pypy here.
Dockerfile builds the impyute image which installs dev requirements and installs impyute. Add stuff to install impyute into the new pypy enviroment.
Update README.rst to support pypy
Add $ docker run impyute pypy -m pytest to the test target in the Makefile

verifying things work

Rebuild pybase: image make rebuild-pybase
Rebuild impyute image: docker build -t impyute .
Check that $ docker run impyute pypy -m pytest runs. If tests are successful, it should be okay.

Minor mistake in example in documentation [here](https://impyute.readthedocs.io/en/master/)

here, random_imputation is not defined, which should be mean_imputation.

remove specific versions from requirements.txt

No real reason to have specific versions, just makes it a hassle especially if the requirements.txt hasn't been updated in a while.

[DDFG] Complete MNAR missingness generation

Complete mnar method in the Corruptor class.

Simplified, MNAR (Missing Not at Random) is a type of missingness in which the probability of a value being missing is conditional (in whole or in part) on unobserved data. Missingness may be simultaneously conditional on observed data in addition to unobserved data.

Implementation: Generate a random selection of new features and base missingness on these features. The number of features to generate may be based on some fraction of the existing features, or a random number between 1 - n_features. These features could (should?) be a mix of continuous & categorical; this could be based on the fraction of each respective feature type in the existing features. Once generated, impose missingness based on these new features.

Be sure that functions accept & return matrices.
Be sure to follow the 4 steps outlined in contributing.md

The below labels are for DDFG (Data Days for Good) participant reference:
Priority: High
Difficulty: Medium

impyute/impyute/dataset/corrupt.py

Lines 48 to 50 in 2c25368

 def mnar(self): 

 """ Overwrites values with MNAR placed NaN's """ 

 pass

1D time series imputations

For algorithms that can be straightforwardly applied (ex. moving_window)

Came up in #54

Unit tests for utils.compare

Add Little's MCAR Test

Part of push to better validate imputations during regular usage/testing, it would be nice to have a way to check for MCAR type data.

Paper: https://www.jstor.org/stable/2290157?seq=1#page_scan_tab_contents

Steps

Create new module: impyute/validation, add __init__.py and update the top level __init__.py as needed.
Implement method from paper in littles_mcar_test.py within folder

release/0.0.8

"wsize_left" is not define

parameter "wsize_left" in 119 of moving_window.py not define.

add concurrency to imputations

Implement ARMA TS Imputation

`fast_knn` pass allowed parameters through to `kdtree.query and `KDTree`

For KDTree currently using the default leafsize and for kdtree.query, the default eps, p and distance_upper_bound. Allow this to be modified.

fast_knn(data, k=3, eps=0, p=2, distance_upper_bound=np.inf, leafsize=10)

Replace instances of the general exception with more specific/custom exceptions.

there are no content in rules_of_thumb.rst

unit tests for `inplace` kwarg

unexpected keyword argument 'inplace' when using

Occurs because of removal of **kwargs from functions in #74

Name change request: mice

Dear Elton,

Thanks for your effort to implement an algorithm for imputing multivariate data.

I’d like to request a name change of your impyute.imputation.cs.mice procedure. The documentation of this procedure says that it implements Multivariate Imputation by Chained Equations (MICE) from my JSS 2011 paper. However, this documentation is not accurate since your procedure does not implement the MICE algorithm. It differs in important respects from my method:

Your procedure provides a single imputation, whereas MICE is a procedure for generating multiple imputations;
Your procedure imputes the “best” (predicted) value, while the MICE algorithm always adds noise;
Your procedure uses linear regression, whereas the MICE algorithm is open to any type of imputation model;
Your procedure uses different convergence criteria.

These differences have profound methodological implications. Advertising your procedure as “MICE” will create confusion among analysts, who might be led to believe that they are doing MICE when in fact they are not.

Your procedure is an implementation of Buck’s method published in 1960 (described in more detail in Little & Rubin 2002), so I would suggest that you could perhaps rename to “buck”?

With regards,
Stef van Buuren

Parsing requirements error from upgrade to pip 10

Error running make test during the installation of impyute it errors out

...
Step 8/10 : WORKDIR /impyute
Removing intermediate container 59c6df997518
 ---> 457da5768f03
Step 9/10 : RUN pip2.7 install -e . &&     pip3.4 install -e . &&     pip3.5 install -e . &&     pip3.6 install -e .
 ---> Running in 7dd589eb1d87
Obtaining file:///impyute
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/impyute/setup.py", line 4, in <module>
        from pip.req import parse_requirements
    ImportError: No module named req
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /impyute/
The command '/bin/sh -c pip2.7 install -e . &&     pip3.4 install -e . &&     pip3.5 install -e . &&     pip3.6 install -e .' returned a non-zero code: 1
make: *** [test] Error 1

[DDFG] Add BadInputError for dtype handling

In the first case, an UnboundLocalError occurs because data is not assigned based on the current if/else criteria. Add an else clause and raise BadInputError accompanied by a more informative error handling message.

impyute/impyute/dataset/base.py

Lines 34 to 37 in 2c25368

 if dtype == "int": 

 data = np.random.randint(bound[0], bound[1], size=shape).astype(float) 

 elif dtype == "float": 

 data = np.random.uniform(bound[0], bound[1], size=shape)

In the second case, no matter what value you pass through dtype, no error occurs. This is because, in this instance, data is assigned immediately. Follow the same logic as above.

impyute/impyute/dataset/base.py

Lines 66 to 70 in 2c25368

 data = np.random.normal(mean, sigma, size=shape) 

 if dtype == "int": 

 data = np.round(data) 

 elif dtype == "float": 

 pass

Be sure to follow the 4 steps outlined in contributing.md

The below labels are for DDFG (Data Days for Good) participant reference:
Priority: Low
Difficulty: Low

better documentation of what imputations can be used where

Side issue of #54

Data properties (shape, data type, distribution etc.)

Always throws exception

impyute/impyute/imputations/ts/arima.py

Line 31 in 99fdb5f

data = isinstance(data, np.array)

It will always throw Exception

This line should be changed to:
isinstance(data, np.ndarray)

please give an example for fast_knn

there is no example for fast_knn.

cleanup how pybase is handled/rebuilt

way to build and use pybase locally, currently build Makefile target always runs a docker pull
remove apk command

consider better handling of pandas dataframes

Side issue of #54

Needs more research. Maybe have a function to straightforwardly check/parse? Parse out non float columns? Return a pandas dataframe?

Would we need to add Pandas as a dependency?

[DDFG] Complete MAR missingness generation

Complete mar method in the Corruptor class.

Simplified, MAR (Missing at Random) is a type of missingness in which the probability of a value being missing is conditional only on the observed data.

Implementation: Select a random subset of the features in the given dataset and base missingness on these features. This could be some fraction of the features, or a random number between 1 - n_features.

Be sure that functions accept & return matrices.
Be sure to follow the 4 steps outlined in contributing.md

The below labels are for DDFG (Data Days for Good) participant reference:
Priority: High
Difficulty: Low - Medium

impyute/impyute/dataset/corrupt.py

Lines 44 to 46 in 2c25368

 def mar(self): 

 """ Overwrites values with MAR placed NaN's """ 

 pass

Implement Multiple Imputation CS

implement MICE

#64

Paper: https://www.jstatsoft.org/article/view/v045i03

Original implementation had been written after roughly reading through the steps outlined in: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

add moving window imputation

readthedocs not displaying sphinx theme

https://impyute.readthedocs.io/en/latest/index.html

[DDFG] Add OutofBounds error handling

If axis != 1, functionality of axis = 0 occurs regardless of value (or dtype) passed. Enforce [0, 1] values and throw BadInputError from impyute.util.errors.py with an appropriate explanation.

Be sure to follow the 4 steps outlined in contributing.md

The below labels are for DDFG (Data Days for Good) participant reference:
Priority: Low
Difficulty: Low

impyute/impyute/imputation/ts/locf.py

Lines 33 to 36 in 2c25368

 if axis == 0: 

 data = np.transpose(data) 

 elif axis == 1: 

 pass

fast_knn, moving_window and locf are returning data without imputation for univariate time series

Data looks as below

tsNH4_na.head()

index	ds	y
2010-11-30 16:10:00	2010-11-30 16:10:00	13.714667
2010-11-30 16:20:00	2010-11-30 16:20:00	NaN
2010-11-30 16:30:00	2010-11-30 16:30:00	14.630500
2010-11-30 16:40:00	2010-11-30 16:40:00	16.385333
2010-11-30 16:50:00	2010-11-30 16:50:00	15.992667

Including ds is giving error BadInputError: Data is not float. So just tried with single variable y

np.isnan(impy.imputation.ts.moving_window(np.array(tsNH4_na[["y"]]),func = np.mean,errors='raise',nindex=0,wsize=10)).sum()

833

impy.fast_knn(tsNH4_na[['y']],k = 2) np.isnan(imput).sum()

833

Unimputed data also have 833 missing points.

Which is the similarity function in KNN imputation?

Hi!

I know that there are many different function to calculate similarity between incomplete vectors, but which one is implemented in the package? I have not found any specification or cite about KNN in the documentation.

rename files and functions to make them more inline with standard open source practises

Remove s from directories.
Remove all instances of _imputation from functions and files in the imputations folder. Seems redundant.
Rename functions in datasets to follow the numpy naming convention. Ex. random_normal -> randn

Implement Kalman Filter TS Imputation

`make test` failure propogation through docker to travis

Add support for python3.7

swap unittest with pytest

mainly replace self.* methods with assert + some additional clean up

Multivariate Imputation by Chained Equations is going to return only mean value of the Column.

Issue:
In the module "impyute.mice" we were expecting to return the imputed value for each column based on the linear equation converged for the column. In contrast we are getting only the mean values of the column.

Reason:
Here we have a logic of entering into loop only if we satisfy below condition, however we have a glitch in the condition which is making us to skip it and return the mean values imputed data set straight away.

Condition Failing:
converged = [False] * len(null_xyv)
while all(converged):
.....................
....................

Resolution:

converged = [False] * len(null_xyv)
while not(all(converged)):
.....................
....................

fast_knn: the nearest neighbor gets the lowest weight

Hi Elton,

Thank you for implementing this library, it's so convenient!
I found your library from the link below.
https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

When I was using fast_knn, I found that nearer neighbor got lower weight when getting the weighted average of k nearest neighbors.

In the example you provided,

fast_knn(data, k=2) # Weighted average of nearest 2 neighbours
array([[ 0. , 1. , 10.08608891, 3. , 4. ],
[ 5. , 6. , 7. , 8. , 9. ],
[10. , 11. , 12. , 13. , 14. ],
[15. , 16. , 17. , 18. , 19. ],
[20. , 21. , 22. , 23. , 24. ]])

In this example, 10.086 is imputed according to kNN algorithm.
We get the 2 nearest neighbors using Euclidean distance, for the first row as a "point", the nearest neighbor is the second "point" (second row), and the second nearest neighbor is the third "point" which is the third row.
The distance between the first point and second point (nearest neighbor) is 12.5, the distance between the first point and third point (the second nearest neighbor) is 20.156.
So this is how 10.086 comes:
10.086 = 7 * 12.5/(12.5 + 20.156) + 12 * 20.156/(12.5 + 20.156)
The weight for each point is calculated based on its distance, so the nearer the point, the smaller the distance, the lower the weight, which is supposed to be the opposite.

In a nutshell, I believe the nearest neighbor should have the highest weight, in this example, the imputed value should be close to 7 instead of 12 (the average of 7 and 12 is 9.5 for reference).

Thanks.
Best,
Minjie

Add ability to work on pandas data frame

Implement ARIMA TS Imputation

[DDFG] Add randc function for random generation of categorical values

Create a function named randc to generate a dataset of categorical variables with missingness. Follow the general form of randu & randn, witht he following arguments:

nlevels: Number of different categories
shape: Same as in randu & randn, including defaults
missingness: " "
thr: " "

The dtype arrgument is not necessary here. Create this function within impyute.dataset.base

Be sure that functions accept & return matrices.
Be sure to follow the 4 steps outlined in contributing.md

The below labels are for DDFG (Data Days for Good) participant reference:
Priority: Medium
Difficulty: Medium

Imputations run on the reference of the value

>>> random_normal(shape=(100, 100))
array([[        nan,         nan,         nan, ...,  2.29665783,            
                nan, -0.71412759],    
       [ 0.49046767, -0.31572648,         nan, ...,  1.39596078,            
        -1.14295358, -1.30660838],    
       [        nan, -2.0708818 ,  0.7393569 , ..., -0.52974738,            
         0.24717896, -2.37030327],    
       ...,        
       [ 0.01155974,  0.88793848, -0.04410631, ..., -1.28196955,            
         0.75566477, -0.39039914],    
       [ 0.23240304,         nan,  1.59899984, ..., -1.06248365,            
                nan,  0.65453688],    
       [-0.13855768, -0.00358682,         nan, ...,  1.29588659,            
        -0.20579175,  0.59610582]])   
>>> data = random_normal(shape=(100, 100))                                  
>>> import impyute as impy            
>>> impy.em(data)  
array([[ 0.5599156 , -0.24410474, -0.99875721, ..., -0.74595691,            
         0.25954462,  0.3936289 ],    
       [-0.5491675 ,  0.39810825,  0.15029102, ..., -0.99765863,            
        -0.98604735,  1.24321062],    
       [ 0.36389712,  1.56754062,  1.38492368, ..., -0.04457599,            
        -0.12098783,  0.98864098],    
       ...,        
       [ 1.37199931, -0.45710982, -1.30196092, ..., -0.38020366,            
         0.31780175, -0.08301059],    
       [ 0.52415   , -1.02749075,  2.03909177, ...,  0.66138282,            
         1.31679312, -0.41575647],    
       [-0.36272847,  0.65262579, -0.11336795, ..., -0.1538307 ,            
        -1.24756562, -0.27470951]])   
>>> impy.em(data)  
Traceback (most recent call last):    
  File "<stdin>", line 1, in <module> 
  File "/Users/elton.law/sandbox/github/impyute/impyute/utils/checks.py", line 34, in wrapper
    raise BadInputError("No NaN's in given data")                           
impyute.utils.errors.BadInputError: No NaN's in given data                  
>>>

Need to make a copy and run each algorithm on that instead. This can be very expensive for big datasets. Keep an inplace keyword like in pandas so that we can use both behaviours

	def mnar(self):
	""" Overwrites values with MNAR placed NaN's """
	pass

	if dtype == "int":
	data = np.random.randint(bound[0], bound[1], size=shape).astype(float)
	elif dtype == "float":
	data = np.random.uniform(bound[0], bound[1], size=shape)

	data = np.random.normal(mean, sigma, size=shape)
	if dtype == "int":
	data = np.round(data)
	elif dtype == "float":
	pass

	def mar(self):
	""" Overwrites values with MAR placed NaN's """
	pass