dodger487 / dplython Goto Github PK

View Code? Open in Web Editor NEW

762.0 762.0 58.0 1.08 MB

dplyr for python

License: MIT License

Python 100.00%

dplython's People

Contributors

Stargazers

Watchers

dplython's Issues

Typing issue with spread()

There is a typing issue with spread, in that pandas doesn't retype columns that are pivoted. tidyr::spread doesn't retype columns by default, but can if you select the option.

Consider the following dataset:

id	var	value
1	date	1/12/12
1	pop	100
1	country	USA

If we spread on var/value, we get the following:

id	date	pop	country
1	1/12/12	100	USA

Now, in the first dataframe, the column value will be a string column, and by usual pandas behavior, all the new columns will also be strings, even though date should be a date, and pop should be numeric. In tidyr, if we used the convert option, these columns will be converted to what they should be.

pandas used to have a convert objects that would work on the whole dataframe, doing conversions like this, but it's been deprecated and replaced with specific methods that only specify a particular type (e.g. to_datetime). One sticky issue is that numbers can be converted to dates without errors, and dates (in some cases) can be converted to numbers without errors.

I see two directions to take:

Ignore the typing issue. This is the default behavior in tidyr, and the default behavior of pandas (spread here is just a wrapper for some functions where pandas.pivot is the main workhorse). If the user wants the correct typing, they can do that themselves after the data has been spread (while googling a possible solution to this problem, it came u a lot on stackoverflow, and the advice was to just do it themselves, although that was more of a specific problem, and not so much a development issue).
Agree on some heuristic to convert data (e.g. if the value is a string that looks like [0-9]+, then make it a number, otherwise try to convert it to a date, and if that fails, leave it as a string). I haven't investigated too much into how tidyr does it.

Note that this only an issue when there are mixed types in one column (in which case they'll almost always be strings). If the value column is say a date column, then the resulting spread columns will also be dates.

nrow only works once

Really excited to use this more! One thing I noticed playing around today is that nrow breaks after the first call.

# this works
(diamonds >>
    group_by(X.cut, X.color) >>
    summarize(count = nrow()))

# this breaks
(diamonds >>
    group_by(X.cut, X.clarity) >>
    summarize(num_rows = nrow()))

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-9c32d69f8d9e> in <module>()
      1 (diamonds >>
      2     group_by(X.cut, X.clarity) >>
----> 3     summarize(num_rows = nrow()))

//anaconda/lib/python2.7/site-packages/dplython/dplython.pyc in __rshift__(self, delayedFcn)
    300 
    301     if self._group_dict:
--> 302       outDf = self.apply_on_groups(delayedFcn, otherDf)
    303       return outDf
    304     else:

//anaconda/lib/python2.7/site-packages/dplython/dplython.pyc in apply_on_groups(self, delayedFcn, otherDf)
    284       if len(subsetDf) > 0:
    285         subsetDf._current_group = dict(zip(self._grouped_on, group_vals))
--> 286         groups.append(delayedFcn(subsetDf))
    287 
    288     outDf = DplyFrame(pandas.concat(groups))

//anaconda/lib/python2.7/site-packages/dplython/dplython.pyc in CreateSummarizedDf(df)
    393 def summarize(**kwargs):
    394   def CreateSummarizedDf(df):
--> 395     input_dict = {k: val.applyFcns(df) for k, val in six.iteritems(kwargs)}
    396     if len(input_dict) == 0:
    397       return DplyFrame({}, index=index)

//anaconda/lib/python2.7/site-packages/dplython/dplython.pyc in <dictcomp>((k, val))
    393 def summarize(**kwargs):
    394   def CreateSummarizedDf(df):
--> 395     input_dict = {k: val.applyFcns(df) for k, val in six.iteritems(kwargs)}
    396     if len(input_dict) == 0:
    397       return DplyFrame({}, index=index)

//anaconda/lib/python2.7/site-packages/dplython/dplython.pyc in applyFcns(self, df)
    183     stmt = df
    184     for func in self.todo:
--> 185       stmt = func(stmt)
    186     return stmt
    187 

//anaconda/lib/python2.7/site-packages/dplython/dplython.pyc in <lambda>(foo)
    191 
    192   def __call__(self, *args, **kwargs):
--> 193     self.todo.append(lambda foo: foo.__call__(*args, **kwargs))
    194     return self
    195 

AttributeError: 'int' object has no attribute '__call__'

Covert dplyframe into DataFrame

How do I convert the dplyframe into DataFrame? My code is as follows.

newcalc=(dplython.DplyFrame(calc_order_emis) >>
arrange(X.mobile, X.emi_number, X.rank_f) >>
group_by(X.emi_number, X.mobile) >>
mutate(inter2=
X.inter1.cumsum()))

Now when I try to convert the newcalc to data frame it says constructor not properly called or sth. The type of newcalc is tuple.

pd.DataFrame(newcalc)

Modifying dplython functions to take DataFrame arguments

I have idea on a change and want to get feedback before making it.

In dplyr, you can call each function on the dataframe itself. So:

# with piping
df %>% mutate(foo=bar)

# same as
mutate(df, foo=bar)

Currently in dplython, the dplython functions all return other functions which are then applied to the DataFrame. If you want to replicate the above, you would have to do something that looks like this:

# with piping
df >> mutate(foo=X.bar)

# same as
mutate(foo=X.bar)(df)

My proposal is to modify the dplython functions to check the type of the first argument. If the first argument is a DataFrame (or inherits from DataFrame), then instead of returning a function, the function is applied to the dataframe. I think this will be more readable.
So:

# old
mutate(foo=X.bar)(df)

# new, also works
mutate(df, foo=X.bar)

Note that this will not break the old way of doing things. I wanted to see what people think before making the change!

Descending sort for arrange

Currently, arrange only sorts in ascending columns. Functionality should be added so that arrange can sort both ascending and descending. Dplyr syntax uses an additional function, desc, to do this. Additionally, dplyr will sort the data frame based on the resulting values of the "Later". For example,
df %>% arrange(desc(foo))
is essentially equivalent to
df %>% arrange(-foo)

Users can also do stuff like:
df %>% arrange(foo**2) or df %>% arrange(abs(foo))
and it will sort by the largest absolute value of foo.

It'd be nice to have this functionality implemented in dplython as well.

Make if_else function like np.where

if_else should take broadcastable things, much like np.where does.
Perhaps if_else should just be DelayFunction version of np.where

The TestMutates.test_multi test is "theoratically" broken

I'm working on adding python3 support (my fork, travis-ci). The only issue I find is the sporadic failures of this specific test.
The reason for the failure is that pd.DataFrame.equals compares, between other things, the columns of the data frames. On the other hand, your mutate function uses dict.iteritems internally to add columns. dict.iteritems, by definition, doesn't preserve any order, so there is no way to predict the order of the mutated data frame columns.
I'm not sure what the expected solution will be, but IMHO:

Your API is very nice and elegant. I don't think that the mutate function signature should be changed, so there is no way to get the columns in order.
The equals function can be override to be less restrictive.
Stop using equals in the tests (or at least in this specific test).

Having said that, I didn't notice any failure on python 2.7, but somehow it fails sporadically on python 3.3/4/5.

What do you think?

Implement select all except

It would be nice to use select to get a DplyFrame with all columns but a set of columns to be left out.

Allow string arguments to select, dfilter, arrange, group_by

For a lot of common use cases, column names could be passed as strings rather than properties of X:

selecting particular columns
filtering out missing or false values from single columns
arranging by particular columns
grouping by particular columns

This would slightly increase the interface complexity, but I think it would be easy (arguably easier) to read. It also is consistent with a clean (though verbose) style that uses mutate + group_by/arrange/filter rather than a "complex" group_by/arrange/filter that does an operation before grouping, arranging, or filtering.

Does the select function support range selection

for instance, if I have a table with 3 columns, (i.e. col1, col2, col3), can I do something like this:

df >> select(X.col1:X.col3)

How should grouping be handled by various verbs?

In the current implementation of the join verbs, any grouping needs to be removed in order for the join to function correctly. In the existing code (not yet merged), it's done automatically, and the returned DplyFrame has no grouping. In Dplyr, the operation x %>% join(y) returns a dataframe with any grouping on x preserved. This wouldn't require too much work to implement, but it would require an additional attribute to serve as a swap variable (ungroup, do the operation, and then regroup) and an additional method. Also, while doing a mutating join returns a new dataframe, and hence we could ignore grouping theoretically, filtering joins return the original dataframe, so maintaining the grouping is probably desirable.
There are other issues regarding grouping as well.

Should summarise peel away any layers of grouping? In Dplython, x >> group_by(a, b) >> summarise() returns a dataframe grouped on a and b. In Dplyr, x %>% group_by(a, b) %>%> summarise() returns a dataframe grouped on a. This wouldn't be that hard to implement.
Should select statements have grouping variables implied? Currently, x >> group_by(a) >> select(b) throws an error in dplython, but returns x with columns a and b in Dplyr (still grouped on a).
while implementing spread(), how should grouping be handled? For example, if we have x >> group_by(a, b) >> spread(c, d), then it's easy enough to make it so that that the dataframe returned is still grouped on a and b, but something like x >> group_by(a, b) >> summarise(y = f(z)) >> spread(b, y), should that return a dataframe grouped on a? Dplyr would return a dataframe grouped on a, but if we had ... >> spread(a, y), that would return an ungrouped dataframe.
It gets a little more confusing with multiple variables.
x %>% group_by(a, b, c) %>% summarise(y=f(z)) %>% spread(a, y) gives a df grouped on b
x %>% group_by(a, b, c) %>% summarise(y=f(z)) %>% spread(b, y) gives a df grouped on a
x %>% group_by(a, b, c) %>% summarise(y=f(z)) %>% spread(c, y) gives a df grouped on a, b
Not sure what the pattern here is.
At any rate, I don't think any of these would be too difficult to do. But feel free to comment.

NameError: 'DplyFrame' is not defined

I'm a recent Python convert from R, so this is great to see. I have what could be a simple file I/O error when I run your example, but not sure:
/anaconda/bin/python /Users/myusername/code/dplyr_example1.py Traceback (most recent call last): File "/Users/myusername/code/dplyr_example1.py", line 1, in <module> from dplython import * File "/anaconda/lib/python3.4/site-packages/dplython/__init__.py", line 4, in <module> diamonds = DplyFrame(diamonds) NameError: name 'DplyFrame' is not defined
Any ideas what's wrong?

Missing count

Count the number of unique combinations of columns with diamonds >> count(X.color, X.cut), creating a new column called n.

(Of the missing dplyr verbs, I think this is the single most important; it's one of the most frequent ones I use. It can be imitated with >> group_by(X.color, X.cut) >> summarize(n = X._.nrow())) (or whatever nrow equivalent is, but this is especially awkward until we have a n()-like solution)

Some wrong results, may just need an error

Hi Chris R.

Thank you for making such a useful library. I'm really looking forward to using this!

There were a couple of things I tried on a lark, kind of hoping that they might work, but which gave me "wrong" results. I kind of expected these wouldn't work, but I think getting an error might be better. Of course it would be even more amazing if they magically worked as expected :) although I'm guessing that would be very hard or even impossible to implement.

# I tried to pass an if-else expression as the value in mutate()
# The results only used the color column, but not clarity column
diamonds = diamonds >> mutate(conditional_results = X.color if X.cut == "Premium" else X.clarity)

# I tried including an "or" in dfilter, but this only returned premium diamonds.
premium_and_ideal = diamonds >> dfilter(X.cut == "Premium" or X.cut == "Ideal")

Create DelayModule functionality

Currently, users can use functions with Later objects by decorating them / calling them with DelayFunction. It would be nice to have a module that calls DelayFunction on everything callable inside a module, so that other module's functions can be made to easily work with dplython.

Unsupported Operand Types

I get this error
----> 1 df >> select(X.Month) >> head(5)

TypeError: unsupported operand type(s) for >>: 'DataFrame' and 'function'

Slowness when grouping on a large number of keys

Hi, maybe you already know about this, but just something important to have on your radar. When grouping on large number of keys, things can get very slow. I had to switch back to regular pandas when an operation was taking > 10 minutes.

# Grouping variable with 5 values -> Get results immediately
diamonds >> group_by(X.cut) >> mutate(m = X.x.mean()

# Grouping variable with 273 values -> Get results after 10 seconds. 
# For larger data frames, can take more than 10 minutes
diamonds >> group_by(X.carat) >> mutate(m = X.x.mean())

# The same operation in standard pandas happens instantaneously
diamonds.groupby('carat').mean()

Add documentation

This is a note mainly for me. Well, also for you to see, which will pressure me to do it quickly so I am not embarrassed by not having done it. So sort of for you, but mainly for me.

I need to add some good docs to pythonhosted or somewhere...

If you are interested in helping out with documentation, than this note can be for you even more! Just comment here or ping me on Twitter @dodger487

Index after summarize

Currently, summarize returns a dataframe initialized with index=[0]. This is fine when summarize is used on an ungrouped dataframe and returns a dataframe with a single row, but it returns a dataframe with multiple rows, it initializes the index for every row to 0, which breaks a lot of behavior (including any subsequent arranges).

Also, if summarize is passed no arguments at all, it throws an error, because index is referenced on line 402 but not defined.

Later breaks in certain situations

Later doesn't work when a method is called on a column before other operations.

In : diamonds >> mutate(foo=X.x.mean() + X.x) >> head() >> select(X.x, X.foo)
"""
Out:
      x             foo
0  3.95  NotImplemented
1  3.89  NotImplemented
2  4.05  NotImplemented
3  4.20  NotImplemented
4  4.34  NotImplemented
"""

It works in many other cases though, notably,

In : diamonds >> mutate(foo=X.x + X.x.mean()) >> head() >> select(X.x, X.foo)
"""
Out:
      x        foo
0  3.95   9.681157
1  3.89   9.621157
2  4.05   9.781157
3  4.20   9.931157
4  4.34  10.071157
"""

In [84]: diamonds >> mutate(foo=X.x.mean()) >> head() >> select(X.x, X.foo)
"""
Out[84]:
      x       foo
0  3.95  5.731157
1  3.89  5.731157
2  4.05  5.731157
3  4.20  5.731157
4  4.34  5.731157
"""

In [85]: diamonds >> mutate(foo=X.x.mean() + 1) >> head() >> select(X.x, X.foo)
"""
Out[85]:
      x       foo
0  3.95  6.731157
1  3.89  6.731157
2  4.05  6.731157
3  4.20  6.731157
4  4.34  6.731157
"""

In : diamonds >> mutate(foo=X.x + X.x.mean() + X.x) >> head() >> select(X.x, X.foo)
"""
Out:
      x        foo
0  3.95  13.631157
1  3.89  13.511157
2  4.05  13.831157
3  4.20  14.131157
4  4.34  14.411157
"""

Maximum recursion depth for Later attributes

Some external functions might enter an infinite loop when called incorrectly with Later arguments. For example, calling zip with Later arguments appears to enter one of these loops. It would help usability with other libraries if we could detect when code is being improperly used with Laters and return an error.

Error on select with groups

In a DplyFrame x with columns a and b, the following throws an error:
x >> group_by(a) >> select(b)

The correct result should be a dataframe with two columns, a and b, still grouped on a.

Missing column selection verbs rename and transmute

select and mutate are present, but there are two others that are variations on these:

rename is like select, but leaves other columns as they are: diamonds >> rename(Price = price) will just rename the one column.
transmute is like select, but allows mutation in column definitions, e.g. diamonds >> transmute(carat, new_price = price / 1000) will have only two columns, with new_price manipulated

Change README.md example not to use wildcard import

Wildcard imports are discouraged by PEP 8 and obscures which functions actually came from the package.

Add to PyPI

Let's get this up on PyPI so users can install via pip

SQLalchemy + pandas data frame and dplython throws error

Hi,

I'm pulling in some data from a postgres database using sqlalchemy and pandas:

sic_codes_filtered = pandas.read_sql_query(query, con=db)

This shows up as a DataFrame in my environment:

However, when I try to execute dplython code, e.g. the following:

(sic_codes_filtered >> 
    sample_n(10))

I get the following error:

I'm not sure what I'm doing wrong. Could you point me in the right direction?

Thanks in advance!

anti_join operation request

An enhancement request (if you are taking these) for an anti_join operation over multiple columns.

Rename dfilter to filter

As I suggested in #22, I think we could safely "overload" the built-in filter function by testing whether it is called with a callable and an iterable in that order. Since neither DplyFrames nor DataFrames are callable, I think we'll be safe.

Factor out Laters into separate package

#48 factors out the Later object into its own class, and also mostly decouples it from DplyFrame-specific logic. Might it make sense to make "laters" its own package for handling arbitrary delayed-evaluation expressions in Python? (That will also help us weed out special-case inelegances and bugs.)

I'd be happy to work on that.

dplython cannot be imported into Python v3.6.2?

I attempted to import dplython:

import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction)

This error was displayed:

Traceback (most recent call last):
File "C:\Users\Shane\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.2\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 21, in do_import
module = sel
am Files\JetBrains\PyCharm Community Edition 2017.2\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 21, in do_import
module = self.sy
ok.py", line 21, in do
o_import
module = self._system_import(name, *args, **kwargs)
ModuleNotFoundError: No module named 'dplython'

The dplython module appears to be installed, as I can see it in the list of packages in PyCharm. However, I cannot see it in the list of packages in Anaconda, which seems suspicious.

To install, I checking out the git repository, then used:

python setup.py install

Result:

E:\git\dplython>python setup.py install
running install
running bdist_egg
running egg_info
writing dplython.egg-info\PKG-INFO
writing dependency_links to dplython.egg-info\dependency_links.txt
writing requirements to dplython.egg-info\requires.txt
writing top-level names to dplython.egg-info\top_level.txt
reading manifest file 'dplython.egg-info\SOURCES.txt'
writing manifest file 'dplython.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
warning: install_lib: 'build\lib' does not exist -- no Python modules to install
creating build\bdist.win-amd64\egg
creating build\bdist.win-amd64\egg\EGG-INFO
copying dplython.egg-info\PKG-INFO -> build\bdist.win-amd64\egg\EGG-INFO
copying dplython.egg-info\SOURCES.txt -> build\bdist.win-amd64\egg\EGG-INFO
copying dplython.egg-info\dependency_links.txt -> build\bdist.win-amd64\egg\EGG-INFO
copying dplython.egg-info\requires.txt -> build\bdist.win-amd64\egg\EGG-INFO
copying dplython.egg-info\top_level.txt -> build\bdist.win-amd64\egg\EGG-INFO
zip_safe flag not set; analyzing archive contents...
creating 'dist\dplython-0.0.7-py3.6.egg' and adding 'build\bdist.win-amd64\egg' to it
removing 'build\bdist.win-amd64\egg' (and everything under it)
Processing dplython-0.0.7-py3.6.egg
Removing c:\users\shane\anaconda3\lib\site-packages\dplython-0.0.7-py3.6.egg
Copying dplython-0.0.7-py3.6.egg to c:\users\shane\anaconda3\lib\site-packages
dplython 0.0.7 is already the active version in easy-install.pth
Installed c:\users\shane\anaconda3\lib\site-packages\dplython-0.0.7-py3.6.egg
Processing dependencies for dplython==0.0.7
Searching for six==1.10.0
Best match: six 1.10.0
Adding six 1.10.0 to easy-install.pth file
Using c:\users\shane\anaconda3\lib\site-packages
Searching for pandas==0.20.3
Best match: pandas 0.20.3
Adding pandas 0.20.3 to easy-install.pth file
Using c:\users\shane\anaconda3\lib\site-packages
Searching for numpy==1.12.1
Best match: numpy 1.12.1
Adding numpy 1.12.1 to easy-install.pth file
Using c:\users\shane\anaconda3\lib\site-packages
Searching for pytz==2017.2
Best match: pytz 2017.2
Adding pytz 2017.2 to easy-install.pth file
Using c:\users\shane\anaconda3\lib\site-packages
Searching for python-dateutil==2.6.1
Best match: python-dateutil 2.6.1
Adding python-dateutil 2.6.1 to easy-install.pth file
Using c:\users\shane\anaconda3\lib\site-packages
Finished processing dependencies for dplython==0.0.7
E:\git\dplython>

Allow group_by on expressions

Currently, group_by only works on existing columns, not expressions. For example, group_by(X.carat + X.price) seems to only group by X.carat.

Dplyr's group_by allows group_by to be used on expressions.

I think it would make sense if group_by could take named arguments, like group_by(carat_plus_price=X.carat + X.price). This could probably be implemented fairly easily under the hood in group_by by checking for **kwargs and passing them to a mutate() before grouping.

It would be nice if ultimately it could autogenerate a column name, such as "carat_plus_price", if the user does not supply one.

No ordering of mutate kwargs arguments

As uncovered in #5.

In Dplyr, the user is able to add a column in a mutate statement derived from a column that he or she just wrote. I want to make this feature available in dplython. So:
diamonds_dp = self.diamonds >> mutate(foo=X.x, bar=X.foo.mean())
should be valid. foo should be the first column added, followed by bar.

This is difficult to accomplish though as python throws away the order information of *_kwargs. Currently, this example code would sometimes work and sometimes not work depending on the dictionary ordering of *_kwargs. There's a PEP out to accomplish this (https://www.python.org/dev/peps/pep-0468/) but it doesn't look like this feature is currently supported.

In a bad case, this will mean the user doesn't know what order to expect columns to be in. In a worse case, this will inconsistently cause errors when a use tries to create a column derived from another one.

Some potential solutions, none of which seem great:

Restrict mutate statements to one additional column per mutate (yuck)
Sort the dictionary in some way, like alphabetical
Examine each added Later to see if it derives from a column that isn't yet present in the DataFrame, add these last.
Something else?

DelayFunction no longer works with ggplot

The example listed on the README no longer works for ggplot.

from ggplot import ggplot, aes, geom_point, facet_wrap
ggplot = DelayFunction(ggplot)  # Simple installation
(diamonds >> ggplot(aes(x="carat", y="price", color="cut"), data=X._) + 
  geom_point() + facet_wrap("color"))

My guess is this is due to the Later refactor.

Better str for Later object

There's no "str" for a Later object right now. This makes it hard to interactively see what's going on. It also makes it more difficult to debug or know what a particular Later is. If we have this feature, it could enable other nice features! For example, dplyr does a nice job with this. As mentioned in #23, if you do something like df %>% mutate(x + 7) the output will be a dataframe with a column named x + 7.

It would be great to have a str method on a Later that could show you the expression that was written. So:

foo = X.x + X.y.mean()**2
str(foo)
# returns "X.x + X.y.mean()**2"

Build broken on README.md update

I updated the README to include a video, and now the build is broken on a seemingly unrelated issue. My first guess is that something changed in pandas to make comparison between Series even more difficult, but I'm not sure yet. I will investigate but will hopefully resolve this soon. If anyone has issues with dplython breaking after the latest update please let me know.

(Quick note: After a bit of an absence due to talking on a very intense, temporary job for a few months, I'm back improving and updating dplython)

group_by mutate do not run per group

Here is a clear example of it

You would expect to have 5 different uuid

Benchmarks

Since dplython deals with large dataset, it's important to make sure performance is as high as possible. To that end, we should set up benchmarks to see where our speed is improving (or getting worse!) over time.

Add transmute functionality into select

transmute and select do very similar things: create a smaller dataframe that is just a few derived columns from the current one.

The different between transmute and select is a small one. transmute creates new columns, basically a mutate and then a select. select only uses existing columns. Why not put all of this functionality inside select?

diamonds >> select(X.carat * 2, X.color, chair=X.table) >> head()
  # Out:
  # X["carat"] * 2  color  chair
  #  8.01              I1     61  
  #  8.01              I1     62

One argument against this is that dplyr uses - to indicate "drop this". So diamonds %>% select(-carat) drops the carat row. This seems a little strange here, and separate from the SQL syntax, where a user might expect that this gives you the negative version of carat. To keep this functionality, we could make a new drop verb which drops rows.

Problem with Spread

I am trying to reshape this data, but I am getting this error

Problem with mutate and delay functions

Hi all,

I've noticed an issue with mutate when you define variables using delay functions and group_by. I think the problem is actually just with mutating not working properly with group_by but I haven't extensively tested. For example:

@DelayFunction
def lead(series, i=1):
    index = series.index
    shifted = series.shift(i)
    shifted.index = index
    return shifted

diamonds >> group_by(X.cut) >> mutate(price_lead = lead(X.price)) >> head(6)

   Unnamed: 0  carat        cut color clarity  depth  table  price     x  \
0           1   0.23      Ideal     E     SI2   61.5   55.0    326  3.95   
1           2   0.21    Premium     E     SI1   59.8   61.0    326  3.89   
2           3   0.23       Good     E     VS1   56.9   65.0    327  4.05   
3           4   0.29    Premium     I     VS2   62.4   58.0    334  4.20   
4           5   0.31       Good     J     SI2   63.3   58.0    335  4.34   
5           6   0.24  Very Good     J    VVS2   62.8   57.0    336  3.94   

      y     z  price_lead  
0  3.98  2.43         NaN  
1  3.84  2.31       326.0  
2  4.07  2.31       326.0  
3  4.23  2.63       327.0  
4  4.35  2.75       334.0  
5  3.96  2.48       335.0

The lead delay function should operate independently on each group, but instead it is operating on the entire dataframe regardless of group.

I solved this in my own fork of dplython by removing mutate from the handled classes in the DplyFrame class. I assume however that you put it in handled classes for a reason, so I don't consider this a great fix (for example, arrange broke due to this and I had to change it to work again).

Curious to hear your opinion on this.

P.S. There are tons of changes and additions in that personal fork that I should make pull requests for, but a lot has changed including the formatting and so I've been lazy about it...

Can't import functions

import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction)

Yields:
ImportError: cannot import name 'DplyFrame'

If I remove DplyFrame, I get:
ImportError: cannot import name 'X'

Remove dataframe copying on each >> operation

Currently, dplython copies a new DataFrame whenever >> is used. The goal of this is to prevent dplython from inadvertently altering the contents of the original DataFrame when executing operations. See this pandas reference: http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy

It would be great if we could restrict this behavior, or push it to key verbs (such as mutate), as it's very inefficient on large data sets.

Drop support for python 2.6

Running your test suite after python setup.py install on a clean python 2.6 virtualenv fails. The failure occurs when importing pandas, as the latest pandas version (0.18.0) doesn't support python 2.6 anymore. Therefore, I think that it is legitimate to drop support for python 2.6 in this project as well.

Pandas SettingWithCopyWarning

I'm seeing this warning from Pandas:

.............../Users/nathangould/workspace/dplython/dplython/dplython.py:393: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df[key] = val.applyFcns(df)
........................................

Since this happening inside mutate(), I think df[key] would be a new column, and therefore not be a copy. So I don't yet understand what pandas is complaining about, but it could be worth looking into.

Rename mutate to define

We have already differed from dplyr by using sift instead of filter. mutate seems like a misnomer-- we're not mutating an existing column, but rather we're adding a new column. We're changing the dataframe that is output, but we're doing that with all of the verbs.

To make things a little easier to understand for new users, we could rename mutate to define. We could also keep around mutate for the hardcore dplyr users.

Problem with mutate

I am trying to generate a variable, based on conditions of another variable, but the mutate function is failing.

dplyr::rename?

The dplyr select() function is nice because you can simultaneously select and rename columns. Do you offer this functionality? Alternatively, implementing the function dplyr::rename() would be sufficient.

Add inner_join, left_join, right_join, full_join, etc

These are verbs in dplyr

Or condition in sift()

I tried these two commands and they produce different output:

df >> sift(X.a == 1) 
df >> sift(X.a == 1 or X.b == 1) # this is equivalent to the line above

df >>sift(X.a == 1 | X.b == 1) # produces different results than the lines above and I don't know what the result represents

So is it possible to use the or condition inside sift at the moment?

Alternative possible names for X

Have you considered _? IMHO it feels more evocative (of member variables, for example) and invisible. The biggest downside is that it already has a meaning in the interactive interpreter (the last evaluated expression). If people want to use dplython in an interactive session and use the _ for that purpose, they could use from dplython import _ as X.

Are there any other names you considered? Assuming you want it to be a single character, I guess there are only 53 possibilities. I could see a case for x (lowercase is subtler and easier to type), d (for dataframe), s (for self), or c (for column).

dodger487 / dplython Goto Github PK

dplython's People

Contributors

Stargazers

Watchers

Forkers

dplython's Issues

Recommend Projects

Recommend Topics

Recommend Org