Comments (19)
mutate
would be different than transform
, because mutate
would work on lazy series.
I would start with filter_with
cause I am not sure how we can do filter
without an anonymous function. In DFs we refer to them by column name but that's not an option here.
Please go ahead!
from explorer.
Well that is a SUPER elegant solution. What a great idea. I'm in full support.
Edit: to clarify I mean the idea of converting to a df, applying functions, and converting back to a series.
from explorer.
That would work but, at the same time, everyone is using pipelines to transfrom series today anyway, so requesting a pipeline inside the anonymous function is not that bad.
from explorer.
Yeah!
from explorer.
The issue is that doing it with a function is horribly expensive and should be generally avoided.
I agree! I was mostly focused on making the DataFrame and Series APIs similar.
Polars as a few filter
functions which take either an expression or a boolean series (mask). They're a bit inconsistent with how they do that, though:
Entity | Lang | Function | Expr | Mask |
---|---|---|---|---|
DataFrame | Python | DataFrame.filter |
✅ | ✅ |
DataFrame | Rust | DataFrame::filter |
❌ | ✅ |
LazyFrame | Python | LazyFrame.filter |
✅ | ✅ |
LazyFrame | Rust | LazyFrame::filter |
✅ | ❌ |
Series | Python | Series.filter |
❌ | ✅ |
Series | Rust | Series::filter |
❌ | ✅ |
Explorer OTOH introduced the concept of a mask
function. I think this was a good call. It helps hint what the input should be:
filter
functions accept expressionsmask
functions accept masks
So if Explorer is to have a Series.filter
, I think the least surprising choice would be to have it accept an expression as well. Unfortunately for the goal of a mask
/filter
distinction, I couldn't find a way in Polars to filter a Series on an expression (other than I guess wrapping it in a DataFrame then converting back?).
Here are the options I see:
# | Option | Pro | Con |
---|---|---|---|
1. | Keep things as they are | Consistent meaning for mask |
Lack of filter surprises newcomers |
2. | Rename Series.mask to Series.filter |
Newcomers find filter quicker |
filter is inconsistent across Series and DataFrame |
3. | Keep Series.mask and hack together a Series.filter that accepts an expression |
Consistent and easy for newcomers to find | Hacks are bad |
My first choice is 3. Assuming the hack isn't terrible, it helps newcomers find the functionality and keeps the meanings consistent. We can also document that Series.mask
should be preferred.
My second choice is 1. I think I favor consistency over newcomer surprise. We could possibly add an example to the docs to help newcomers find Series.mask
if the question comes up often enough.
My last choice is 2. I think Series.filter
working like DataFrame.mask
while also having a DataFrame.filter
would be confusing long term.
from explorer.
Honestly, an implementation of Series.filter(fn x -> x end)
(or filter_with
) where we put the series in a dataframe, filter it, and then read the series out, sounds very elegant to me and it would also be optimized quite cleanly and most likely the most efficient approach too. Polars may even have better APIs.
This would also allow us to introduce map
(or mutate?) and arrange
for individual series. filter
/filter_with
docs could point out to mask
for when the lazy expression version is not enough. Thoughts?
from explorer.
Yeah I'd love to use Series.{filter,filter_with}
if I could!
This would also allow us to introduce
map
(ormutate
?) andarrange
for individual series.filter/filter_with
docs could point out to mask for when the lazy expression version is not enough. Thoughts?
For map
/mutate
, I know there's already Series.transform
so we'll want to keep that in mind. But yes I agree. Assuming the overhead with wrapping-then-unwrapping is acceptable, we should be able to bring a lot of the DataFrame specific functionality over to Series.
I'm happy to try for a filter
/filter_with
PR to see how the idea shakes out.
from explorer.
In DFs we refer to them by column name but that's not an option here.
Oh of course. I'll think about it, but nothing clean comes to immediately to mind.
from explorer.
This would also allow us to introduce map (or mutate?) and arrange for individual series. filter/filter_with docs could point out to mask for when the lazy expression version is not enough. Thoughts?
What would arrange
be for individual Series
?
from explorer.
@cigrainger that was what i thought but i guess it doesn't make much sense since we would only arrange ourselves?
from explorer.
Exactly, I think it would just be a sort. Though maybe you could provide a sorter? E.g. on strings you could sort by ends_with
or similar? (speaking of which I need to add some more string ops). But that could just be sort/2
.
Edit: So in this case, in the background you're basically doing: DataFrame.new(s: s) |> DataFrame.mutate(new_s: some_func(s)) |> DataFrame.arrange(new_s) |> DataFrame.pull(s)
.
from explorer.
Correct. So I guess we could have some use cases?
from explorer.
Definitely.
from explorer.
We will need to decide on the naming though. Today we have Series.sort
. Should it be sort_with
or arrange_with
?
from explorer.
Good question. I'm struggling to disentangle the macro aspect from the motivation to introduce _with
variants. I don't have a strong opinion about sort vs arrange here (largely because in R they're different anyway -- sorting a vector in R is done with base R's sort
). Do we need a _with
or could we just make it multi-arity?
from explorer.
That's a good discussion. We don't need _with
in series. It depends on how consistent we want to be with DF and within ourselves. The question is: can we overload sort
? Would the two options (direction and nils) apply to our function-based version?
from explorer.
I think they do apply. We have a direction selector in DataFrame.arrange
. I think we could also use nils
in DataFrame.arrange
.
from explorer.
Thought: for the macro versions, what if we did like Ecto and had them provide the column as an argument? E.g. we make a filter/3
:
dates =
[~D[2023-11-01], ~D[2023-11-02], ~D[2023-11-03]]
|> S.from_list()
|> S.filter(date, date > ^~D[2023-11-01])
from explorer.
I think #728 closed this issue. We discussed several other additions to Series
in that PR. Shall I close this issue and open another to track those additions?
from explorer.
Related Issues (20)
- Error using is_finite and is_infinite within mutate HOT 1
- Explorer NIF broken on FreeBSD HOT 12
- Support Elixir built in Duration struct HOT 1
- Bug: Rounding Error in Tests HOT 1
- exposing the `fold` expressions from Polars HOT 7
- :nif_panicked "Chunk require all its arrays to have an equal number of rows" HOT 1
- Sorting an empty DataFrame results in a runtime Polars error HOT 1
- Performance of `DataFrame.new/2` on dataframes containing list columns HOT 7
- `Series.filter` should work inside `DataFrame.summarise` HOT 5
- Large memory usage when using `Explorer.Dataframe.concat_columns` on 30k (small) data frames. Memory leak? HOT 4
- [Not Issue] - Are the plans to use duckdb as an alternative backend? HOT 2
- Support streaming: true on collect HOT 1
- Select from different dataframe inside mutate HOT 1
- Config to make table view default like in Polars itself HOT 1
- Duckdb select statement only works on non first line in the cell HOT 4
- Explorer.DataFrame.from_query/4 with :snowflake adapter returns dtype error on any numeric field HOT 3
- Orders of names after CSV loading seems buggy since 9.0 HOT 5
- Join on columns of type `:list` HOT 2
- Discrepancy between typespec for Series.cast/2 (parameter dtype) and implementation HOT 1
- LazyFrame not being able to cast dtypes HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from explorer.