Giter Club home page Giter Club logo

wpca's Introduction

Weighted Principal Component Analysis in Python

Author: Jake VanderPlas

version status downloads build status license

This repository contains several implementations of Weighted Principal Component Analysis, using a very similar interface to scikit-learn's sklearn.decomposition.PCA:

  • wpca.WPCA uses a direct decomposition of a weighted covariance matrix to compute principal vectors, and then a weighted least squares optimization to compute principal components. It is based on the algorithm presented in Delchambre (2014)

  • wpca.EMPCA uses an iterative expectation-maximization approach to solve simultaneously for the principal vectors and principal components of weighted data. It is based on the algorithm presented in Bailey (2012).

  • wpca.PCA is a standard non-weighted PCA implemented using the singular value decomposition. It is mainly included for the sake of testing.

Examples and Documentation

For an example application of a weighted PCA approach, See WPCA-Example.ipynb.

Installation & Dependencies

This package has the following requirements:

  • Python versions 2.7, or 3.4+
  • numpy (tested with version 1.10)
  • scipy (tested with version 0.16)
  • scikit-learn (tested with version 0.17)
  • nose (optional) to run unit tests.

With these requirements satisfied, you can install this package by running

$ pip install wpca

or to install from the source tree, run

$ python setup.py install

To run the suite of unit tests, make sure nose is installed and run

$ nosetests wpca

wpca's People

Contributors

jakevdp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

wpca's Issues

Support for 1d weights (e.g. sample-weighting)

PCA weighted by observations (samples) is described in this exchange on Cross Validated.

Feasibly, it can be performed by taking a weight vector of shape (n_rows, 1) and copying it into multiple columns. What was originally, say,

0.25
0.25
0.5

is now

0.25    0.25    0.25
0.25    0.25    0.25
0.5     0.5     0.5

using the following function:

def make_weights_matrix(weights, X):
    if weights.shape[0] != X.shape[0]:
        raise ValueError("weights {} and X {} must have same length.".format(weights.shape, X.shape))
    w_new = np.empty_like(X)
    w_new[:] = weights
    return w_new
w = make_weights_matrix(weights, df)

This could be avoided if the package supported broadcasting, rather than having utils perform numerous checks to assure that weights and X have the same shape. Additionally, the einsum logic would have to be circumvented since the broadcasting should be built-in.

comparison with FA?

hey @jakevdp

how does it compare with FactorAnalysis. FactorAnalysis is precisely to handle heteroscedastic noise on the features

Inconsistency between WPCA and EMPCA

In EMPCA (based on Bailey 2012) weights are equivalent to the inverse variance of observations (see eq. 6)

In WPCA (based on Delchambre 2015) weights are equivalent to the inverse standard deviation (see eq. 7) โ€“ i.e. the square root of weights used by Bailey.

The API here needs to be changed so that the units of expected inputs are similar.

Provide the scores [feature request]

I came thinking that many users of PCA are looking to get new (a smaller set) variables Y from a set of variables X. But the example provided with this package does not appear to do that. Moreover, the methods do not offer to do it for you.

In fact, to the clueless, it's not obvious how to even get these. And, moreover, the example that comes with this package does not tell you what to do if you have a set of non-commensurable variables X, ie which should be normalized according to standard practice, before running PCA.

I am not statistically sophisticated, and maybe there are reasons to make the above uses hard for people who would abuse PCA? Or maybe I have overlooked something. But, I believe all that needs to be added is:

  • normalize the input X columns to make X~
  • multiply the pca.components_.dot( X~)
  • and make this available, eg as pca.scores_ ?

Could/should this be added to either the code or docs?

Unit tests are failing

The current status with current sklearn and numpy is:

(wpca) wpca% nosetests wpca
..E.................................FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF....................................................................................................................................................

There are two sources of failures:

  • sklearn.utils.estimator_checks expects fit() to set an attribute n_iter_ >= 1.
  • the explained variance computed by WPCA and EMPCA does not match sklearn PCA.

The first is easy to fix (set n_iter_ = 1 in fit()), but I'm not sure what is going on with the second. Did it ever pass? The values don't seem wildly different, but its obviously more than round off, and even more than I would expect from SVD vs eigh, e.g.

 x: array([ 3.007   ,  1.693608,  1.072242,  0.647339,  0.304193])
 y: array([ 3.6084  ,  2.032329,  1.28669 ,  0.776807,  0.365031])

Incompatibility with recent sklearn versions?

Hi there, with my current setup with sklearn==0.24.2 if I have X and weights with

X.shape = ( n_samples, n_features)
weights.shape = ( n_samples, )

I get the error

ValueError: Expected 2D array, got 1D array instead:
array=...
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

If I do the reshape of weights to (n_samples, 1) I then get ValueError: Shape of Xandweights should match.

So there's something broken with current versions. Cheers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.