pydata / patsy Goto Github PK

View Code? Open in Web Editor NEW

940.0 940.0 103.0 1.9 MB

Describing statistical models in Python using symbolic formulas

License: Other

Python 99.44% R 0.56%

patsy's People

Contributors

Stargazers

Watchers

Forkers

westurner guyrt jarrodmillman danieltaborda anamp zaharid shoyer neurodebian sergiopasra bogdans83 abelliae joaonatali camdavidsonpilon cynddl broessli jiffyclub b-rich rogeriofalcone jankatins z7198185 leoninnovate thorstenkranz chrish42 optionclubindia noelhx jewellthomas has2k1 ukoethe ashishnamdev njsmith louispotok zhouyonglong ml-ai-nlp-ir sunshe35 makmanalp thequackdaddy colinsongf adamchainz minhpascal rickyall xuejingshang vnavkal haoybl mjiansun ontologymachine christang paurichardson jreback aerysbat arbazkhan002 kormilitzin arose13 batermj andportnoy hpill26 vishalbelsare johnnyc08 mwhudson drewblasius 42093688 asalbarak ruben-tsui radovankavicky gapdata ashleywillard bin-miao kmosiejczuk mohneeshjhapuresoftware ossdev07 manuelmusngi laurarvbd awoziji christianjauregui stjordanis codeur66 dthadi3 ecedegirmenci tomicapretto jtlehman1 milos-korenciak mostafa-at-github bashtage kirenz aaronjx python-repository-hub ibm-z-oss-oda musicinmybrain yikang1020 rmallof arpitjain799 kloczek sailfish009 epigenemax dearborn-open-ai a-detiste

patsy's Issues

Better contrast specification

Two possible improvements in how we work with contrasts:

(minor) Maybe we should have a way to specify a contrast symbolically, like linear_constraint but just the linear part, not the = constant part.
(major) There should be some way to specify contrasts and constraints in terms of predictions (which are invariant wrt coding), rather than predictors (betas). Like it should be possible to say "the difference between an item with a=1 and an item with a=2", and the model will spit out a matrix encoding this -- or, crucially, if the model has an interaction between a and b, then it will say "that's not estimable, you can't leave b unspecified". Possibly we also want to be able to say "tell me the derivative of the prediction wrt x1", since that's also one of the things that betas encode.

Syntax for this will be tricky. Possibly just an array of weights + a list of data dicts?

I need to brush up on the estimable contrast literature...

Categorical does not work with nan

I have a columns whose unique looks like:

array([nan, 'CONFERENCE', 'ANALYST', 'FORUM', 'SEMINAR'], dtype=object)

I would expect that adding C(col_name) to the formula would create 4 dummy variables (5 values-1), bu in fact it only adds 3.

When I tried to explicitly set control to be nan, i get an exception:

C(col_name, Treatment(reference=nan))

PatsyError: specified level nan not found

clarify that 2to3 is needed for Python 3 support

From what I can tell, this project requires 2to3 to create a Python 3 build, but this is not stated in the README. Please clarify this, preferably including an example coversion like:
2to3 --output-dir=patsy3 -W -n patsy
Thanks!

Optional removal of redundant columns

Patsy automatically remove redundant columns (linearly dependent) so that the final matrix is not overdetermined. is there an option to turn off the removal? I would like to use patsy formulas for regularized linear regression and for that I need all the columns, even if they repeat.

Errors thrown with incr_dbuilders()

I'm running a logistic regression and having trouble using Patsy's API to prepare the data when it is bigger than a small sample.

Using the dmatrices function directly on a DataFrame, I am left with this abrupt error ( please note, I spun up an EC2 with 300GB of RAM after encountering this on my laptop, and got the same error ):

Traceback (most recent call last):
File "My_File.py", line 22, in <module>
   df, return_type="dataframe")
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices
 NA_action, return_type)
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 156, in do_highlevel_design
return_type=return_type)
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 989, in build_design_matrices
results.append(builder._build(evaluator_to_values, dtype))
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 821, in _build
m = DesignMatrix(np.empty((num_rows, self.total_columns), dtype=dtype),
MemoryError

So, I combed through Patsy's docs and found this gem:

patsy.incr_dbuilder(formula_like, data_iter_maker, eval_env=0)
    Construct a design matrix builder incrementally from a large data set.

However, the method is sparsely documented, and the source code is largely uncommented.

I have arrived at this code:

def iter_maker():
    with open("test.tsv", "r") as f:
        reader = csv.DictReader(f, delimiter="\t")
        for row in reader:
            yield(row)


y, dta = incr_dbuilders("s ~ C(x) + C(y):C(rgh) + \
C(z):C(f) + C(r):C(p) + C(q):C(w) + \
C(zr):C(rt) + C(ff):C(djjj) + C(hh):C(tt) + \
C(bb):lat + C(jj):lng + C(ee):C(bb) + C(qq):C(uu)",
        iter_maker)

df = dmatrix(dta, {}, 0, "drop", return_type="dataframe")

but I receive PatsyError: Error evaluating factor: NameError: name 'ff' is not defined

This is being thrown because _try_incr_builders (called from dmatrix) is returning None on line 151 of highlevel.py

What is the correct way to use these Patsy functions to prepare my data? Any examples or guidance you may have will be helpful.

Providing a unicode string to dmatrices results in a misleading error message

Doing this works fine:
foo = pd.DataFrame(dict(a=[1,2.0], b=[4,5.0]))
patsy.dmatrices('a ~ b', foo)

However, if the formula is given by a unicode string:
patsy.dmatrices(u'a ~ b', foo)

An exception is raised:
ValueError: design matrix must be real-valued floating point

The exception is very misleading about the nature of the problem. I thought an inf or nan had crept into the dataframe, but no bad values were there.

One possible fix is to send the first argument of dmatrices through the str() function:
patsy.dmatrices(str(u'a ~ b'), foo)

Another possible fix is to raise print a different error message so that the user has some clear idea of the cause of the problem.

I'm using patsy version 0.3.0

Thanks!

Listwise deletion

Listwise deletion is an extremely common operation. It needs to be applied after selecting variables (done by patsy when it interprets the formula), but before splitting exogenous and endogenous variables in separate matrices (also done by patsy).

R's model.matrix function applies list-wise deletion by default, and I expect that statsmodels will also want to have that be the default behavior. Ohterwise, many of the estimation procedures will bork and we will need to remind users to handle their NAs.

Thanks!

Feature request: "target ~ ." syntax

R lets you regress a target against all the variables in the dataframe by simply typing "target ~ ."
This is particularly useful when we have lots of columns and don't want to specify each predictor. I don't think patsy supports this yet. Any chance if it can be added? Thanks.

Docs CSS issue

Is this just me or are the stylesheets not working / uploaded?

http://patsy.readthedocs.org/en/latest/

Header in *.py files mention COPYING instead of LICENSE.txt

Hi, COPYING was renamed to LICENSE.txt in 625317c but the header of *.py files still mention it

polynomials

It would be nice if this behaved as expected:

y ~ x**2

BUILD: please rename changes into License

When trying to build & package for Ubuntu, I get the following error:

Copying patsy.egg-info to /build/buildd/patsy-0.1.0~ppa1~revno/debian/python-patsy/usr/lib/python2.7/dist-packages/patsy-0.1.0_dev.egg-info
running install_scripts
   dh_install -O--buildsystem=python_distutils
   dh_installdocs -O--buildsystem=python_distutils
cp: cannot stat `LICENSE.txt': No such file or directory
dh_installdocs: cp -a LICENSE.txt debian/python-patsy/usr/share/doc/python-patsy returned exit code 1
make: *** [binary] Error 2

Would it be possible to rename
https://github.com/pydata/patsy/blob/master/COPYING
into
https://github.com/pydata/patsy/blob/master/LICENSE.txt
?

Using `=` instead of `~` in a formula

I was wondering if using ~ instead of = in a formula string is to be compatible with the R formula syntax, or because using the latter will break something. If it won't break anything, could we optionally use = in the place of ~?

I know it is not a big deal, but I think a model statement with = would look more natural.

Idiom question for dataframes + test data

I have a workflow where I want to construct a design matrix from some training data that happens to be stored in a pandas dataframe. I then want to be able to take a new pandas dataframe and apply the same transformation, so that I can have a pure numpy representation of test dataset with the same category indices. I suppose an error would be raised if my training dataset has categorical values that were missing in the original dmatrix.

I can't seem to figure out the most idiomatic way to achieve this.

My big picture workflow looks something like this:

X = dmatrix("y ~ v0 * v1", train_data, return_type="dataframe")
X_sparse = scipy.sparse.csr_matrix(X.values)
fit a custom model (using pymc2, actually) with the sparse representation of the data
X_test = X.transform_into_basis(test_data)
X_test_sparse = scipy.sparse.csr_matrix(X_test.values)
X_test = model.predict(X_test_sparse)

What I need is a clean way to transform / featurize my test dataset. I suppose I'd also love for the API to look something like the sklearn model.transform() api.

Does this exist? Can someone point me in the right direction?

Categorical names (again)

Do we really need, say, the reference level in the Treatment contrast? I'm not sure it adds enough information vs. the complexity it adds to the names to warrant inclusion. Thoughts? AFAICT, it only appears if you specify a reference level. If you specify one, then surely you know what you specified.

[~/]
[7]: dmatrix('~C(A, Treatment)', data=pd.DataFrame([['some really long name'], ['other name'], ['other name']], columns=['A']))
[7]: 
DesignMatrix with shape (3, 2)
Intercept  C(A, Treatment)[T.some really long name]
        1                                         1
        1                                         0
        1                                         0
Terms:
    'Intercept' (column 0)
    'C(A, Treatment)' (column 1)

[~/]
[8]: dmatrix("~C(A, Treatment('some really long name'))", data=pd.DataFrame([['some really long name'], ['other name'], ['other name']], columns=['A']))
[8]: 
DesignMatrix with shape (3, 2)
Intercept  C(A, Treatment('some really long name'))[T.other name]
        1                                                       0
        1                                                       1
        1                                                       1
Terms:
    'Intercept' (column 0)
    "C(A, Treatment('some really long name'))" (column 1)

Tests fail with latest Pandas (0.15.2)

With NumPy 1.9.1, Pandas 0.15.2, and Patsy 0.3.0:

$ nosetests
............./usr/local/lib/python2.7/dist-packages/pandas-0.15.2_182_gbb9c311-py2.7-linux-x86_64.egg/pandas/core/c
ategorical.py:462: FutureWarning: Accessing 'levels' is deprecated, use 'categories'
  warn("Accessing 'levels' is deprecated, use 'categories'", FutureWarning)
/usr/local/lib/python2.7/dist-packages/pandas-0.15.2_182_gbb9c311-py2.7-linux-x86_64.egg/pandas/core/categorical.py
:414: FutureWarning: 'labels' is deprecated. Use 'codes' instead
  warnings.warn("'labels' is deprecated. Use 'codes' instead", FutureWarning)
...FF.................
======================================================================
FAIL: patsy.test_highlevel.test_formula_likes
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/whelchel/patsy-0.3.0/patsy/test_highlevel.py", line 202, in test_formula_likes
    [[1], [2], [3]], ["x"])
  File "/home/whelchel/patsy-0.3.0/patsy/test_highlevel.py", line 104, in t
    expected_lhs_values, expected_lhs_names)
  File "/home/whelchel/patsy-0.3.0/patsy/test_highlevel.py", line 32, in check_result
    assert rhs.design_info.column_names == expected_rhs_names
AssertionError

======================================================================
FAIL: patsy.test_highlevel.test_return_pandas
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/whelchel/patsy-0.3.0/patsy/test_highlevel.py", line 348, in test_return_pandas
    assert np.array_equal(df4.columns, ["AA"])
AssertionError

----------------------------------------------------------------------
Ran 35 tests in 54.269s
FAILED (failures=2)

Feature Request: bivariate splines

I would like to use bivariate splines similarly to the way you have patsy.spline working already, e.g.
z ~ 1 + bispline(x, y, knots_x, knots_y)
Since scipy includes scipy.interpolate.bisplrep and scipy.interpolate.bisplev, you have already shown the way, and I can try to follow it and send a pull request if it goes well. Would this be of interest?

Is there an implementation of bivariate splines in R that you would like patsy to follow in notation?

Check for None in doc before changing it.

Python code byte-compiled with -OO has doc-strings stripped out.

This creates problems when compiling different packages which changes the doc-strings by doing something like this:
doc += "additional text"
(when the docstring is 'None', this will fail).

mgcv_cubic_splines.py does this and thus creates problems when trying to bytecompile it with the -OO option using cx_freeze/py2exe.

In order to fix this, I would propose to change all the lines which say:
doc += CubicRegressionSpline.common_doc
with
doc =doc +CubicRegressionSpline.common_doc if doc else ''

Simpler term names

First I want to thank the developer team for their excellent work.

Well, I feel that the "predictors" names when using for instaqnce C( predictor, Treatment(5)) ar too long and somehow confusing. When you make interactions between predictors, you get things like:

C(trimcod, Treatment(4))[T.3]:C(flota, Treatment(11))[T.10]

It would be nice to be able to assign an alias or just forget all the stuff apart from the predictor name to get something like:

[trimcod][T.3]

or just [trimcod 3]

I've playng with the MyTreat example but I can't get any positive results

Thank you very much

Jorge Tornero

Preseve pandas index when creating design matrix

Hi -- me again!

In my project, I'm planning to use pandas to do initial manipulation of the data. Patsy works great with pandas DataFrames, but when I build the design matrices, they lose their index. It's definitely something I can work around -- but is there a reasonable way to preserve the index?

DesignMatrix should have to_dataframe() method

This would be useful, for example, when I really want to be able to use a design matrix as both a raw numpy array and a pandas dataframe.

I suppose I could specify return_type="dataframe" and then get the numpy array from df.values, and it's also not hard to build the dataframe from scratch, but this would be particularly handy for interactive use, where it would provide a useful shortcut (e.g., X.to_dataframe().plot() or X.to_dataframe().head()).

To do this right, the new method would be factored out of build_design_matrices. Roughly speaking, it would look like this:

def to_dataframe(self):
    if not have_pandas:
        raise PatsyError("pandas.DataFrame was requested, but "
                         "pandas is not installed")
    di = self.design_info
    df = pandas.DataFrame(self, columns=di.column_names,
                          index=di.pandas_index)
    df.design_info = di
    return df

The main design change would be that DesignInfo (or DesignMatrix) would need to gain a pandas_index attribute, which would keep track of any index from the original data.

If this seems reasonable, I could put together a pull request.

statistics should get more tests

I was just looking for a nice example of stat_* in the tests (useable in a tweet...) and found mostly the *bline ones. stat_density and co should also get test cases.

Ideas for testcases: http://docs.ggplot2.org/0.9.3.1/index.html -> Statistics

geom_smooth also needs tests

Bug parsing formulas with with "[]" in feature names

I have features with [] in their names like "features[1]" and that confuses the parser. Is there any way to escape it?

In [164]: data = demo_data("a", "b", "x1", "x2", "x[5]")

In [165]: dmatrices("a ~ x[5]", data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-165-1a6534f0f134> in <module>()
----> 1 dmatrices("a ~ x[5]", data)

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/highlevel.pyc in dmatrices(formula_like, data, eval_env, return_type)
    281     """
    282     eval_env = EvalEnvironment.capture(eval_env, reference=1)
--> 283     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env, return_type)
    284     if lhs.shape[1] == 0:
    285         raise PatsyError("model is missing required outcome variables")

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/highlevel.pyc in _do_highlevel_design(formula_like, data, eval_env, return_type)
    145     def data_iter_maker():
    146         return iter([data])
--> 147     builders = _try_incr_builders(formula_like, data_iter_maker, eval_env)
    148     if builders is not None:
    149         return build_design_matrices(builders, data,

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/highlevel.pyc in _try_incr_builders(formula_like, data_iter_maker, eval_env)
     59         return design_matrix_builders([formula_like.lhs_termlist,
     60                                        formula_like.rhs_termlist],
---> 61                                       data_iter_maker)
     62     else:
     63         return None

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/build.pyc in design_matrix_builders(termlists, data_iter_maker)
    691      cat_postprocessors) = _examine_factor_types(all_factors,
    692                                                  factor_states,
--> 693                                                  data_iter_maker)
    694     # Now we need the factor evaluators, which encapsulate the knowledge of
    695     # how to turn any given factor into a chunk of data:

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/build.pyc in _examine_factor_types(factors, factor_states, data_iter_maker)
    441             break
    442         for factor in list(examine_needed):
--> 443             value = factor.eval(factor_states[factor], data)
    444             if isinstance(value, Categorical):
    445                 postprocessor = CategoricalTransform(levels=value.levels)

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/eval.pyc in eval(self, memorize_state, data)
    471     #    http://nedbatchelder.com/blog/200711/rethrowing_exceptions_in_python.html
    472     def eval(self, memorize_state, data):
--> 473         return self._eval(memorize_state["eval_code"], memorize_state, data)
    474
    475 def test_EvalFactor_basics():

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/eval.pyc in _eval(self, code, memorize_state, data)
    454     def _eval(self, code, memorize_state, data):
    455         inner_namespace = VarLookupDict([data, memorize_state["transforms"]])
--> 456         return self._eval_env.eval(code, inner_namespace=inner_namespace)
    457
    458     def memorize_chunk(self, state, which_pass, data):

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/eval.pyc in eval(self, expr, source_name, inner_namespace)
    119         code = compile(expr, source_name, "eval", self.flags, False)
    120         return eval(code, {}, VarLookupDict([inner_namespace]
--> 121                                             + self._namespaces))
    122
    123     @classmethod

<string> in <module>()

NameError: name 'x' is not defined

How to do a polynomial?

I've seen https://groups.google.com/forum/#!topic/pystatsmodels/96cMRgFXBaA, but why doesn't something like this work.

from patsy import dmatrices
data = dict(y=range(1,11), x1=range(21,31), x2=range(11,21))

dmatrices("y ~ x1 + x2**2", data)

dmatrices("y ~ x1 + I(x2**2)", data)

This works

dmatrices("y ~ x1 + np.power(x2, 2)", data)

Trouble making categorical variable --- TypeError: 'ClassRegistry' object is not callable

Not sure what's going on here.

In [44]: patsy.dmatrix("C(a)", {'a':['m', 'n', 'o']})
Out[44]:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-44-24c8bf15ba3b> in <module>()
----> 1 dmatrix("C(a)", {'a':['m', 'n', 'o']})

/Users/mike/venv/sci/lib/python2.7/site-packages/patsy/highlevel.pyc in dmatrix(formula_like, data, eval_env, return_type)
    259     """
    260     (lhs, rhs) = _do_highlevel_design(formula_like, data, _get_env(eval_env),
--> 261                                       return_type)
    262     if lhs.shape[1] != 0:
    263         raise PatsyError("encountered outcome variables for a model "

/Users/mike/venv/sci/lib/python2.7/site-packages/patsy/highlevel.pyc in _do_highlevel_design(formula_like, data, eval_env, return_type)
    145     def data_iter_maker():
    146         return iter([data])
--> 147     builders = _try_incr_builders(formula_like, data_iter_maker, eval_env)
    148     if builders is not None:
    149         return build_design_matrices(builders, data,

/Users/mike/venv/sci/lib/python2.7/site-packages/patsy/highlevel.pyc in _try_incr_builders(formula_like, data_iter_maker, eval_env)
     59         return design_matrix_builders([formula_like.lhs_termlist,
     60                                        formula_like.rhs_termlist],
---> 61                                       data_iter_maker)
     62     else:
     63         return None

/Users/mike/venv/sci/lib/python2.7/site-packages/patsy/build.pyc in design_matrix_builders(termlists, data_iter_maker)
    691      cat_postprocessors) = _examine_factor_types(all_factors,
    692                                                  factor_states,
--> 693                                                  data_iter_maker)
    694     # Now we need the factor evaluators, which encapsulate the knowledge of
    695     # how to turn any given factor into a chunk of data:

/Users/mike/venv/sci/lib/python2.7/site-packages/patsy/build.pyc in _examine_factor_types(factors, factor_states, data_iter_maker)
    441             break
    442         for factor in list(examine_needed):
--> 443             value = factor.eval(factor_states[factor], data)
    444             if isinstance(value, Categorical):
    445                 postprocessor = CategoricalTransform(levels=value.levels)

/Users/mike/venv/sci/lib/python2.7/site-packages/patsy/eval.pyc in eval(self, memorize_state, data)
    429     #    http://nedbatchelder.com/blog/200711/rethrowing_exceptions_in_python.html
    430     def eval(self, memorize_state, data):
--> 431         return self._eval(memorize_state["eval_code"], memorize_state, data)
    432 
    433 def test_EvalFactor_basics():

/Users/mike/venv/sci/lib/python2.7/site-packages/patsy/eval.pyc in _eval(self, code, memorize_state, data)
    412     def _eval(self, code, memorize_state, data):
    413         inner_namespace = VarLookupDict([data, memorize_state["transforms"]])
--> 414         return self._eval_env.eval(code, inner_namespace=inner_namespace)
    415 
    416     def memorize_chunk(self, state, which_pass, data):

/Users/mike/venv/sci/lib/python2.7/site-packages/patsy/eval.pyc in eval(self, expr, source_name, inner_namespace)
    119         code = compile(expr, source_name, "eval", self.flags, False)
    120         return eval(code, {}, VarLookupDict([inner_namespace]
--> 121                                             + self._namespaces))
    122 
    123     @classmethod

<string> in <module>()

TypeError: 'ClassRegistry' object is not callable

Using virtualenv on a Mac. Python 2.7.3.
Pygments==1.5
cloud==2.6.9
distribute==0.6.31
ipython==0.13.1
matplotlib==1.2.0
nose==1.2.1
numpy==1.6.2
pandas==0.10.0
patsy==0.1.0
python-dateutil==2.1
pytz==2012h
pyzmq==2.2.0.1
readline==6.2.4.1
scipy==0.11.0
six==1.2.0
statsmodels==0.5.0
sympy==0.7.2
tornado==2.4.1
wsgiref==0.1.2

dmatrices name?

I mentioned this on the ML, but I thought I'd raise an issue. dmatrices is a bit of a misnomer since there is no such thing as a left-hand side design matrix. This is why I preferred the original make_matrices or make_model_matrices.

Better explain what "statistical models" patsy can handle in the README / overview?

Physicists / astronomers (and maybe other potential users like engineers, ...?) often use a different vocabulary than statisticians / economists. E.g. I'm an astronomer and I would say that y(x) = a * x + b is a linear model with parameters a and b and that y(x) = exp(- x / s) is a non-linear model with parameter s. Now when I read "This is patsy, a Python library for describing statistical models and building design matrices." I was excited that I might be able to specify arbitrary models (being ignorant of what exactly "statistical models" are and thinking that "design matrices" refers to the subset of linear models patsy can handle) and fit them to data, but if I understand correctly patsy only handles linear models, right?

Maybe you could add a sentences or two to the README or to http://patsy.readthedocs.org/en/latest/overview.html to make it even more clear up-front what kinds of "statistical models" patsy can and can't handle?
(I had the same problem when I first saw statsmodels, it took me a while to figure out that it only can fit very specific models, not arbitrary nonlinear models.)

Two more things you could add to the docs:

A link to the R formula mini-language you mention?
With google I found http://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html , but I'm not sure this is the best starting point to read about R formulas?
A link to statsmodels in the docs and mention that it is planned to use patsy for statsmodels?

Thanks for making patsy and writing great documentation!
I'll try to learn how statisticians do regression from the patsy and statsmodels docs.

[docs]: Library-developers.rst : Least Squares Regression example

Error Message in the RTD-rendered documentation http://patsy.readthedocs.org/en/v0.1.0/library-developers.html [1]:

NameError: name 'LM' is not defined

It looks like the example LM least squares regression model class in example_lm.py [2] _ (that is not for reuse, but may be relevant to #3) is not persisted between invocations of the IPython directive [3] in docs/library-developers.rst. [1]

The IPython directive docs [3] say that "[functions] persist across sessions", but I guess execfile does not persist into locals().

[1] https://github.com/pydata/patsy/blob/master/doc/library-developers.rst
[2] https://github.com/pydata/patsy/blob/master/doc/_examples/example_lm.py
[3] http://ipython.org/ipython-doc/dev/development/ipython_directive.html

Provide more metadata about how coding actually occurred, at the factor level

In DesignInfo, we should provide:

For each factor, whether it is continuous or categorical
If categorical, what levels it has
If categorical, the ContrastMatrix we use for each (factor, term) pair.

This is pretty simple and obvious stuff, and useful to statsmodels (in particular, knowing what levels exist so you can generate all pairwise tests etc.).

Add an over-parametrized dummy coding scheme

To provide one way for users who definitely want overparametrized dummy coding to just do

C(myfactor, DummyDammit)

or whatever.

[Request from Josef, talked over at PyCon]

Patsy prints uninformative error message when user places "Intercept" in a formula

This is observed on patsy 0.1.0.

I saw that the design_info object of a design matrix uses "Intercept" as the encoding for the intercept term so I wondered what would happen if a programmer chose this as the name for a feature.

The ideal scenerio is that patsy either:
a. Does some name mangling
b. throws an error telling me exactly what I did wrong if this is not going to be supported

What happens in reality is that an uniformative assertion message is produced:

Traceback (most recent call last): File "failure.py", line 5, in <module> y,X = patsy.dmatrices("sl ~ Intercept",dataFrame) File "build/bdist.macosx-10.8-intel/egg/patsy/highlevel.py", line 283, in dmatrices File "build/bdist.macosx-10.8-intel/egg/patsy/highlevel.py", line 150, in _do_highlevel_design File "build/bdist.macosx-10.8-intel/egg/patsy/build.py", line 860, in build_design_matrices File "build/bdist.macosx-10.8-intel/egg/patsy/build.py", line 776, in _build File "build/bdist.macosx-10.8-intel/egg/patsy/build.py", line 757, in design_info File "build/bdist.macosx-10.8-intel/egg/patsy/design_info.py", line 78, in __init__ AssertionError

Here is the code that produces the error:

import pandas
import patsy

dataFrame = pandas.io.parsers.read_csv("salary2.txt") 
y,X = patsy.dmatrices("sl ~ Intercept",dataFrame)

Patsy imports IPython

Is this necessary?

>>> import sys
>>> 'IPython' in sys.modules
False
>>> import patsy
>>> 'IPython' in sys.modules
True

cc @changhiskhan. this was causing downstream issues in pandas testing where some tests were skipped when IPython was imported ( to prevent the test suite from messing up an interactive session with plot windows, etc.)

Support for whitespace in column names

Maybe I'm doing something wrong, but these example fails:

df = pd.DataFrame( np.arange(10), columns = ['test data'] )

patsy.dmatrix('test data', df )
#SyntaxError: unexpected EOF while parsing

patsy.dmatrix('"test data"', df )
#PatsyError: categorical data must be an iterable container

Need a way to limit what functions can be called

Hi,

Patsy is awesome, and I'm hoping to use it in a web application. However, the fact that it'll eval() arbitrary expressions makes it a big ol' security hole. For example, you can do this:

dmatrices("y ~ open('bad.txt', 'w')")

and while patsy raises an exception that files don't have a len(), it does indeed create the file. With a bit of creativity, I'm pretty sure there are a variety of paths to arbitrary code execution exploits there; see:

http://me.veekun.com/blog/2012/03/24/on-principle/

for a much more devious example.

What might work better would be to actually have a minimal parser and a set of functions it'll be willing to run on data (one could maybe pass in a mapping of function names to functions) instead -- this:

http://effbot.org/zone/simple-iterator-parser.htm

suggests it should be pretty easy.

Arbitrary python code isn't really the right thing here, IMO -- there's no reason to be defining classes or using comprehensions in a formula.

In any case: eval() in this context probably cannot reasonably be made safe, and will probably wind up biting people who are using patsy as part of anything that accepts untrusted strings.

Capture only the values of referenced variables in formula namespace

Right now when creating a formula, we capture the namespace itself.

This can pin large variables in memory, and presents an obstacle to serializiing model designs (#26).

What we should do is to figure out which variables from the enclosing namespace are actually used, and then capture only those.

The klugey way to do this is to observe which variables are accessed when evaluating the formula the first time, and then save only those.

The more principle, reliable, and modular way is to use ast to parse the formula, and then extract all bare variable references. (Or maybe we should just re-use the token-based implementation of this.) Those which are found not in the data, and not in the builtins, but in the environment, should get stashed to use for actual evaluation.

This isn't just an optimization, it does produce a user-visible effect: if some variable name referenced in the formula is rebound after the formula is created, then previously the new value would be used in future predictions, but after this change, the old value will continue to be used.

This is unaesthetic (I think PHP's so-called "closures" work this way?), but in our case it's actually for the best -- ideally we'd save a read-only snapshot of the environment, period, which is not tractable. But this moves us slightly in that direction, so, okay.

Sparse matrix from formula

Is it currently possible to build a sparse design matrix from a formula? This is desirable for formulas with lot of categorical variables and their interactions, which are naturally represented as a sparse matrix of zeros and a relatively few ones. R sparse.model.matrix produces an sparse matrix but, when the model includes interactions, it fallbacks to the standard model.matrix so a dense matrix is created as an intermediate step. I don't know if patsy includes any sparse matrix support at all; in case it doesn't consider this a request for enhancement.

setup.py develop doesn't work?

I'm a bit baffled by this. Is this a local problem?

|16 $ python setup.py develop
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'install_requires'
warnings.warn(msg)
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: setup.py --help [cmd1 cmd2 ...]
or: setup.py --help-commands
or: setup.py cmd --help

error: invalid command 'develop'

Setuptools version

>>> import setuptools
>>> print(setuptools.__version__)
3.4.4

|20 $ python --version
Python 2.7.5+

Improve error message for unexpected categorical level

Parking this here so I don't forget about it. The following error message is a little obscure. I think it could be improved (from the statsmodels perspective) if object -> observation and if it also gave the factor name that caused the problem. E.g.,

Observation with level 4 in C(X) does not match any of the expected levels.

from patsy import build_design_matrices, dmatrix
from patsy import EvalEnvironment

env = EvalEnvironment.capture()

data = {"X" : [0,1,2,3], "Y" : [1,2,3,4]}
formula = "C(X) + Y"
new_data = {"X" : [0,0,1,2,3,3,4], "Y" : [1,2,3,4,5,6,7]}
info = dmatrix(formula, data)
build_design_matrices([info.design_info.builder], new_data)

0.2.2 or 0.3(?) release please

Finally echo of API change in Ipython 2.0 reached Debian:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=751045
so I wondered if I should patch or consider taking current master snapshot or just wait for a new release... ?

Make DesignMatrixBuilders pickleable and saveable

Use cases:

It should be possible to pickle a DesignMatrixBuilder (and/or DesignInfo, same issue)
Checking if two designs are the same: this comes up for rERPy -- it's only valid to form a grand average across multiple analyses if the underlying regressions were the same. In particular it would be good to be able to check for subtle gotchas like use of center(...) with different means across the different analyses.

The easy part of this is reviewing the inner structure of DesignMatrixBuilder (column builders and all that) to make sure it's sensible, and similarly for factor state dicts.

The more complicated part is capturing the evaluation environment in a reasonable way.

Precondition: #25

Rigorously handle 0d/1d interaction in factor values

Some things that would be nice:

build_design_matrices([builder], {"x": 1, "y": [1, 2, 3]})

should probably broadcast x against y -- very nice for prediction! (But if it were {"x": [1], "y": [1, 2, 3]} then that should be an error.)

build_design_matrices([builder], {"x": 1, "a": "a0"})

should not be an error. (Right now, scalar numerics get converted up to columns via atleast2d_column_default, but scalar categoricals are just an error.)

And also, that last one should perhaps return a 1d ndarray or Series, not a 2d ndarray or DataFrame. (And this also applies when the data passed in is a Series, e.g. a row from a DataFrame.) The motivation is that this would make

pred_x = build_design_matrices([builder], {"x": 1, "a": "a0"})
dot(pred_x, betas)

give you a scalar when betas.ndim == 1 or a 1d vector in the multivariate case where betas.ndim == 2. But

pred_x = build_design_matrices([builder], {"x": [1], "a": ["a0"]})
dot(pred_x, betas)

would give you a 1d vector when betas.ndim == 1 or a 2d vector in the multivariate case.

This would definitely simplify patsy's prediction code!

A concern is that just starting to return 1d design "matrices" (or Series with return_type="dataframe", oops) might break existing code.

Malfunctionary testing in categorical.py

Have no idea what is testing in
cat = pandas.Categorical([1, 0, -1], ("a", "b"))
conv = categorical_to_int(cat, ("a", "b"), NAAction())
assert np.all(conv == [1, 0, -1])
of test_categorical_to_int(). Need to be corrected

Less awkward API for using simple stateful transforms?

I found it awkward to use the syntax for using stateful transforms, as shown by the tutorial example:

>>> build_design_matrices([mat.design_info.builder], new_data)[0]

Two reasons:

Understanding the full expression entails a deep dive into patsy's API.
As a code reviewer, I also worry when I see things like [0] at the end of an expression because it looks like data might be being thrown away. (I suppose one solution to this is to do explicit assignment like new_mat, = ....)

So instead, I wrote a helper function:

def updated_design_matrix(design_matrix, data, NA_action='drop'):
    """Shortcut to ``build_design_matrices`` with the builder from
    ``design_matrix.design_info.builder``
    """
    if have_pandas and isinstance(design_matrix, pandas.DataFrame):
        return_type = 'dataframe'
    else:
        return_type = 'matrix'
    return build_design_matrices([design_matrix.design_info.builder], data,
                                 NA_action, return_type, design_matrix.dtype)[0]

This lets me write this instead:

>>> updated_design_matrix(mat, new_data)

...which looks much closer to the "high level" syntax.

Does something like this belong in core patsy?

Note: we will need a similarly named (but probably not the same) function to handle the "update" syntax of . (note #28).

P.S. In case it isn't obvious, it has been a pleasure for me to discover and use patsy over the past few weeks :).

Exceptions in examples on readthedocs

The pandas example gives an ImportError:
http://patsy.readthedocs.org/en/latest/expert-model-specification.html

I think you can install numpy and pandas via a pip requirements file on readthedocs (haven't tried), and then it should work:
http://read-the-docs.readthedocs.org/en/latest/faq.html

The least squares example (the LM class) gives a NameError:
http://patsy.readthedocs.org/en/latest/library-developers.html
Does LM stand for LinearModel? Maybe you can mention that in a class docstring?

Rude install behavior

Hi Nathaniel, can you please remove this line

install_requires=["numpy"],

from setup.py? It isn't a good idea to have this line here, because then if you use pip or easy_install to install patsy, it will try to upgrade numpy under some circumstances. This is rude and should never happen automatically. In my case, I had 1.8.0-dev on my system (in-place build) and pip tried to ignore it and install 1.7.0 into site-packages.

The same was done for statsmodels. There it was replaced with

 try:
     from numpy.version import short_version as npversion
 except ImportError:
     raise ImportError("statsmodels requires numpy")

Error in dmatrix with missing data in pandas Categoricals & return type=dataframe

Just exploring patsy seriously for the first time -- instant love affair! Thanks a lot!

Found one little bug (patsy 0.2.0, pandas 0.12) as described in issue title. Minimal example taken from the first line of the docs and slightly adjusted to reproduce the behaviour. Seems like the missing data should be dropped from some index but is not. Or am I missing something?

In [1]: from patsy import dmatrix, demo_data
In [2]: import pandas as pd
In [3]: import numpy as np

In [4]: data = pd.DataFrame(demo_data("a", "b", "x1", "x2", "y"))

In [5]: data['a'].iloc[0] = np.nan

In [6]: dmatrix('a', data, return_type='dataframe')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-3ab0bd40ae1e> in <module>()
----> 1 dmatrix('a', data, return_type='dataframe')

/home/xxx/anaconda/envs/py33/lib/python3.3/site-packages/patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
    276     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    277     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 278                                       NA_action, return_type)
    279     if lhs.shape[1] != 0:
    280         raise PatsyError("encountered outcome variables for a model "

/home/xxx/anaconda/envs/py33/lib/python3.3/site-packages/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    154         return build_design_matrices(builders, data,
    155                                      NA_action=NA_action,
--> 156                                      return_type=return_type)
    157     else:
    158         # No builders, but maybe we can still get matrices

/home/xxx/anaconda/envs/py33/lib/python3.3/site-packages/patsy/build.py in build_design_matrices(builders, data, NA_action, return_type, dtype)
    954             matrices[i] = pandas.DataFrame(matrix,
    955                                            columns=di.column_names,
--> 956                                            index=pandas_index)
    957             matrices[i].design_info = di
    958     return matrices

/home/xxx/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    413             else:
    414                 mgr = self._init_ndarray(data, index, columns, dtype=dtype,
--> 415                                          copy=copy)
    416         elif isinstance(data, list):
    417             if len(data) > 0:

/home/xxx/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py in _init_ndarray(self, values, index, columns, dtype, copy)
    559             columns = _ensure_index(columns)
    560 
--> 561         return create_block_manager_from_blocks([ values.T ], [ columns, index ])
    562 
    563     def _wrap_array(self, arr, axes, copy=False):

/home/xxx/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/internals.py in create_block_manager_from_blocks(blocks, axes)
   2233         blocks = [ getattr(b,'values',b) for b in blocks ]
   2234         tot_items = sum(b.shape[0] for b in blocks)
-> 2235         construction_error(tot_items,blocks[0].shape[1:],axes)
   2236 
   2237 def create_block_manager_from_arrays(arrays, names, axes):

/home/xxx/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/internals.py in construction_error(tot_items, block_shape, axes)
   2215     raise ValueError("Shape of passed values is %s, indices imply %s" % (
   2216             tuple(map(int, [tot_items] + list(block_shape))),
-> 2217             tuple(map(int, [len(ax) for ax in axes]))))
   2218 
   2219 

ValueError: Shape of passed values is (0, 8), indices imply (0, 7)

Maximum recursion depth error for formulas with more than 485 terms

I am working with a dataframe which has 7000 columns and it turns out that once you go beyond 485 terms, patsy throws a recursion error when going from a formula to a design matrix. Is there a better way of doing this?

Thanks!

In [282]: df = pd.DataFrame(dict(('a' + str(i), np.random.randn(5)) for i in xrange(500)))

In [283]: formula = " + ".join(df.columns)

In [284]: dmatrices(formula, df)

....

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr)
    452                                 "'%s' operator" % (tree.type,),
    453                                 tree.token)
--> 454         result = self._evaluators[key](self, tree)
    455         if require_evalexpr and not isinstance(result, IntermediateExpr):
    456             if isinstance(result, ModelDesc):

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in _eval_binary_plus(evaluator, tree)
    283
    284 def _eval_binary_plus(evaluator, tree):
--> 285     left_expr = evaluator.eval(tree.args[0])
    286     if tree.args[1].type == "ZERO":
    287         return IntermediateExpr(False, None, True, left_expr.terms)

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr)
    452                                 "'%s' operator" % (tree.type,),
    453                                 tree.token)
--> 454         result = self._evaluators[key](self, tree)
    455         if require_evalexpr and not isinstance(result, IntermediateExpr):
    456             if isinstance(result, ModelDesc):

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in _eval_binary_plus(evaluator, tree)
    283
    284 def _eval_binary_plus(evaluator, tree):
--> 285     left_expr = evaluator.eval(tree.args[0])
    286     if tree.args[1].type == "ZERO":
    287         return IntermediateExpr(False, None, True,
    left_expr.terms)

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr)
    448         assert isinstance(tree, ParseNode)
    449         key = (tree.type, len(tree.args))
--> 450         if key not in self._evaluators:
    451             raise PatsyError("I don't know how to evaluate this "
    452                                 "'%s' operator" % (tree.type,),

RuntimeError: maximum recursion depth exceeded in cmp

how to deal with unicode?

Dealing with just Python 2 for now, I understand that patsy expects strings. But the data containers might not have this design. So what's the recommended way for handling this? Should we be messing with the data keys under the hood, or should patsy? The only way I can think to handle this (other than statsmodels doing it under the hood) is for patsy to accept unicode but also an encoding so the formula and they data keys can both be encoded correctly. E.g., this fails

import pandas as pd 
import patsy

data = pd.DataFrame({
    u'àèéòù' : np.random.randn(100),
    'x' : np.random.randn(100)})

formula = u"Q('àèéòù') ~ x".encode('utf-8')
dmatrices = patsy.dmatrices(formula, data=data)

But if we also encode the data keys, it's fine. So should dmatrices and whatever other entry points also take an encoding? Am I missing something?

Preserving column order for categoricals

https://groups.google.com/forum/#!topic/pystatsmodels/ZvsyZag3xaw

import patsy
import pandas as pd
cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
patsy.dmatrix("age + educ + married", data=cps)

Non standard python name error (unicode or invalid names)

I use ptsy almost everyday trough the statsmodels library. There is a little catch that hit me here and there, and it's about pandas dataframe with column name that are written with non ascii characters or that are not valid python names.

In the first case I obtain an UnicodeError saying that it can't convert the special character to a proper ascii value.

take for example this code:

bad_df = pd.DataFrame({u'xè':randn(10), u'x2': randn(10)})
dmatrices(u"xè ~ x2 ", bad_df)

This can be due to some internal conversion to a simple string instead of unicode. This looks trivial, but in Europe we use a lot of special characters and this force me to "reshape" all the dataset beforehand in a very error-prone job. I know this is an issue of python 2.7 and probably won't appear in python 3.x, but I cannot do the switch yet.

The second case is a lot more tricky and without an obvious solution. take this dataframe:

bad_df2 = pd.DataFrame({'x 1':randn(10), 'x 2': randn(10)})

The names "x 1" and "x 2" cannot be used as it is, forcing again to make a lot of conversion for alla special kind of characters mangling the names of all columns. Would it be possible to implement a special case to pass the name of the columns as a string (inside the formula string) and let patsy evaluate it as a LookupFactor? With this kind of solution the model could look like this:

" 'x 1' ~ 'x 2'  "

It's not beautiful, but shouldn't create any kind of ambiguity and still let us use a more general set of meaningful names

Thank you very much for your hard work!

pydata / patsy Goto Github PK

patsy's People

Contributors

Stargazers

Watchers

Forkers

patsy's Issues

Recommend Projects

Recommend Topics

Recommend Org