Comments (22)
FYI, the DF.describe/2
implementation in Polars is now written in Python: https://github.com/pola-rs/polars/blob/67cb6923b8546fc96bda2d28fce293bfe47561c6/py-polars/polars/dataframe/frame.py#L4363
We can use that as a reference :)
from explorer.
👍 for adding null type. And I think it is safest to skip mean, std, min, and max for strings and other dtypes.
from explorer.
@lkarthee it is fine to mirror the new outer join with polars. The reason why we are having an exception is because the current join assumes all columns from both arguments will be the output, but it seems polars changed it to drop duplicate columns. You will have to mirror this logic in Explorer.DataFrame.join
and fix the tests.
from explorer.
I was thinking of keeping the work outside the main branch - branching from, or working directly in my branch.
That's what I was thinking too. I just wanted to make sure :)
But we could implement "our version" of DF.describe/2 branching from main and merge it separately. WDYT?
I think it's fine either way. If I tackle any of the pieces I think I'll branch off yours. But others should feel free to do that one off main.
from explorer.
Yes!
from explorer.
@lkarthee please go ahead, I don't think any of them will reply soon due to timezone :)
from explorer.
@lkarthee Yeah go for it! I actually tried yesterday morning, but I spun my wheels trying to make it "elegant". More than happy to let you take over :)
EDIT: Based on what I tried yesterday, advice would be to just calculate what you need and move on. I kept trying to be clever with loops but I couldn't make it work.
from explorer.
This is what I tried that didn't work (I hadn't gotten to percentiles yet).
Show/Hide
def describe(%DataFrame{} = df, _percentiles) do
require Explorer.DataFrame, as: DF
numeric_dtypes = Explorer.Shared.numeric_types()
ordered_dtypes =
List.flatten([
# [:date, :string],
[:date],
numeric_dtypes,
Explorer.Shared.datetime_types(),
Explorer.Shared.duration_types()
])
metrics = [
count: %{dtypes: nil, fun: &Explorer.Series.n_distinct/1},
nil_count: %{dtypes: nil, fun: &Explorer.Series.nil_count/1},
mean: %{dtypes: numeric_dtypes, fun: &Explorer.Series.mean/1},
std: %{dtypes: numeric_dtypes, fun: &Explorer.Series.standard_deviation/1},
min: %{dtypes: ordered_dtypes, fun: &Explorer.Series.min/1},
max: %{dtypes: ordered_dtypes, fun: &Explorer.Series.max/1}
]
metric_dfs =
for {_metric, %{dtypes: dtypes, fun: fun}} <- metrics do
if dtypes == nil do
DF.summarise(df, for(s <- across(), do: {s.name, ^fun.(s)}))
else
metric_df =
DF.summarise(df, for(s <- across(), s.dtype in ^dtypes, do: {s.name, ^fun.(s)}))
# Manually add `nil` to all non-computed columns.
metric_df =
Enum.reduce(df.names, metric_df, fn col, acc ->
if col not in acc.names, do: DF.put(acc, col, [nil]), else: acc
end)
metric_df[df.names]
end
end
metric_df =
metric_dfs
|> DF.concat_rows()
|> DF.put(:describe, metrics |> Keyword.keys() |> Enum.map(&Atom.to_string/1))
metric_df[["describe"] ++ df.names]
end
Which gives you (notice the string columns aren't handled right):
# test/explorer/data_frame_test.exs:3321
df = DF.new(a: ["d", nil, "f"], b: [1, 2, 3], c: ["a", "b", "c"])
df1 = DF.describe(df)
# +-------------------------------------------+
# | Explorer DataFrame: [rows: 6, columns: 4] |
# +-------------+---------+---------+---------+
# | describe | a | b | c |
# | <string> | <f64> | <f64> | <f64> |
# +=============+=========+=========+=========+
# | count | 3.0 | 3.0 | 3.0 |
# +-------------+---------+---------+---------+
# | nil_count | 1.0 | 0.0 | 0.0 |
# +-------------+---------+---------+---------+
# | mean | | 2.0 | |
# +-------------+---------+---------+---------+
# | std | | 1.0 | |
# +-------------+---------+---------+---------+
# | min | | 1.0 | |
# +-------------+---------+---------+---------+
# | max | | 3.0 | |
# +-------------+---------+---------+---------+
The issue was that our min
/max
functions don't work on dtype: :string
(perhaps they should?). So I think we need to handle that case by hand.
One approach I thought of: use the Series.sort
function to compute all order statistics (min, max, 25%, etc.). You'll need to compute nil_count
first since those need to be excluded. But after that, arithmetic should give you the indices of each order statistic. Something like:
percentile_25_index = floor(0.25 * (length(s) - nil_count(s)))
This was apparently attempted by someone on the Polars side, but they said they didn't get the performance improvements they expected:
Is there any way we can achieve lit(None) with existing api ?
I got stuck on that too! You can see my workaround in my attempt. I don't know if there's a way we can do it easily on our side.
from explorer.
One way to go is exclude non_stat columns from describe and revisit this after null type pr ? Only two metrics will be relevant for non_stat columns - count and nil_count.
Sounds good to me!
from explorer.
The work here is complete, so I'm close.
Thank you all for the contributions! 💜
from explorer.
2 test cases relating to DataFrame.join/3 are failing due to changes from #784 ?
1) test join/3 with a custom 'on' but with repeated column on left side - outer join (Explorer.DataFrame.LazyTest)
test/explorer/data_frame/lazy_test.exs:1224
** (RuntimeError) Polars Error: not found: d
code: assert DF.to_columns(df, atom_keys: true) == %{
stacktrace:
(explorer 0.8.0-dev) lib/explorer/polars_backend/shared.ex:35: Explorer.PolarsBackend.Shared.apply_dataframe/3
(explorer 0.8.0-dev) lib/explorer/data_frame.ex:1873: anonymous fn/3 in Explorer.DataFrame.to_columns/2
(elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
(elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
(explorer 0.8.0-dev) lib/explorer/data_frame.ex:1872: Explorer.DataFrame.to_columns/2
test/explorer/data_frame/lazy_test.exs:1233: (test)
7) test join/3 with a custom 'on' but with repeated column on left side (Explorer.DataFrameTest)
test/explorer/data_frame_test.exs:2070
** (RuntimeError) Polars Error: not found: d
code: assert DF.to_columns(df2, atom_keys: true) == %{
stacktrace:
(explorer 0.8.0-dev) lib/explorer/polars_backend/shared.ex:35: Explorer.PolarsBackend.Shared.apply_dataframe/3
(explorer 0.8.0-dev) lib/explorer/data_frame.ex:1873: anonymous fn/3 in Explorer.DataFrame.to_columns/2
(elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
(elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
(explorer 0.8.0-dev) lib/explorer/data_frame.ex:1872: Explorer.DataFrame.to_columns/2
test/explorer/data_frame_test.exs:2098: (test)
from explorer.
@philss Is your sense that we can fix these issues on main? Or should we branch off ps-bump-polars-to-v0.36? (I probably won't have time to look into specifics until tonight.)
from explorer.
@billylanchantin I was thinking of keeping the work outside the main
branch - branching from, or working directly in my branch. But we could implement "our version" of DF.describe/2
branching from main and merge it separately. WDYT?
from explorer.
Logged a bug for outer_coalesce
- pola-rs/polars#13450 .
from explorer.
@josevalim One question I have about outer join is - polars returns new columns, should we forward them to Explorer ?
The behavior has been changed to include the original join keys. Name clashes are solved by appending a suffix (_right by default) to the right join key name.
L1_right in below case ?
>>> df1.join(df2, on="L1", how="outer")
shape: (4, 4)
┌──────┬──────┬──────────┬──────┐
│ L1 ┆ L2 ┆ L1_right ┆ R2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 │
╞══════╪══════╪══════════╪══════╡
│ a ┆ 1 ┆ a ┆ 7 │
│ b ┆ 2 ┆ null ┆ null │
│ c ┆ 3 ┆ c ┆ 8 │
│ null ┆ null ┆ d ┆ 9 │
└──────┴──────┴──────────┴──────┘
from explorer.
I have fixed outer join in latest pr.
Can I fix the describe function ? @philss or @billylanchantin are you working on it ?
from explorer.
Thank you @billylanchantin .
I have a draft rust version (have to test more). I am exploring if it can be implemented with DF.summarise() in elixir. Took me Down the rabbit hole trying to achieve lit(None)
(Null Series) in explorer from elixir. NullType
is missing from dtypes - I logged a bug #783 recently relating to this.
Is there any way we can achieve lit(None)
with existing api ?
from explorer.
I have to expose this on Series.nil_() - figuring out the cogs in the wheel. Its not working yet.
#[rustler::nif]
pub fn expr_nil_() -> ExExpr {
ExExpr::new(Expr::Literal(LiteralValue::Null))
}
Below works for df with numeric types - pivot is pending. Exprs work and currently data is in columns.
def describe(df, opts \\ []) do
opts = Keyword.validate!(opts, percentiles: nil)
if Enum.empty?(df.names) do
raise ArgumentError, message: "cannot describe a DataFrame without any columns"
end
percentiles = process_percentiles(opts[:percentiles])
numeric_dtypes = Shared.numeric_types()
datetime_types = Shared.datetime_types()
duration_types = Shared.duration_types()
stat_cols = for {name, type} <- df.dtypes, type in numeric_dtypes, do: name
min_max_cols =
for {name, type} <- df.dtypes,
type in numeric_dtypes or type in datetime_types or type in duration_types,
do: name
metrics = ["count", "null_count", "mean", "std", "min"]
p_metrics = for p <- percentiles, do: "#{p * 100}%"
metrics = metrics ++ p_metrics
metrics = ["max" | metrics]
df_metrics =
summarise_with(df, fn x ->
counts_exprs = Enum.map(df.names, &{"count:#{&1}", Series.count(x[&1])})
nil_counts_exprs = Enum.map(df.names, &{"nil_count:#{&1}", Series.nil_count(x[&1])})
percentile_exprs =
for p <- percentiles, c <- df.names do
name = "#{p}:#{c}"
if c in stat_cols do
{name, Series.quantile(x[c], p)}
else
{name, Series.nil_()} # this i wrote in rust and exposed in expression.ex, I have to expose it on Series i guess.
end
end
# TODO: handle Series.nil_() for below
mean_exprs = for c <- stat_cols, do: {"mean:#{c}", Series.mean(x[c])}
std_exprs = for c <- stat_cols, do: {"std:#{c}", Series.standard_deviation(x[c])}
min_exprs = for c <- min_max_cols, do: {"min:#{c}", Series.min(x[c])}
max_exprs = for c <- min_max_cols, do: {"max:#{c}", Series.max(x[c])}
counts_exprs ++
nil_counts_exprs ++
mean_exprs ++ std_exprs ++ min_exprs ++ percentile_exprs ++ max_exprs
end)
# Reshape wide result
row = head(df_metrics)
#TODO - pivot columns to rows
end
def process_percentiles(nil), do: [0.25, 0.50, 0.75]
def process_percentiles(percentiles) do
Enum.each(percentiles, fn p ->
if p < 0 or p > 1 do
raise ArgumentError, message: "percentiles must all be in the range [0, 1]"
end
end)
Enum.sort(percentiles)
end
df = Explorer.DataFrame.new(a: [5,6,7], b: [1, 2, 3])
df2 = DF.describe(df)
#Explorer.DataFrame<
Polars[1 x 18]
count:a s64 [3]
count:b s64 [3]
nil_count:a s64 [0]
nil_count:b s64 [0]
mean:a f64 [6.0]
mean:b f64 [2.0]
std:a f64 [1.0]
std:b f64 [1.0]
min:a s64 [5]
min:b s64 [1]
0.25:a f64 [6.0]
0.25:b f64 [2.0]
0.5:a f64 [6.0]
0.5:b f64 [2.0]
0.75:a f64 [7.0]
0.75:b f64 [3.0]
max:a s64 [7]
max:b s64 [3]
>
from explorer.
👍 to skipping order statistics on strings. While technically possible, I don't think people usually care.
from explorer.
@billylanchantin Thank you for the pointers, I have completed the percentiles part. I have tried to mirror python code very closely. Hopefully I will figure out more about adding Series.nil_() tomorrow.
@josevalim One way to go is exclude non_stat columns from describe and revisit this after null type pr ? Only two metrics will be relevant for non_stat columns - count and nil_count.
from explorer.
Went ahead with rust func - only percentiles and pivot logic is in elixir. It fails if there is a non numeric column in data frame. Please review PR, i will fix rest in next pr.
from explorer.
Implemented describe for stat_cols in elixir and tests are passing
from explorer.
Related Issues (20)
- [Feature request] Add support for Decimal type HOT 12
- Should we always raise when a column is missing? HOT 3
- Split string column into multiple columns (feature request / use case) HOT 4
- Seeing `:nif_not_loaded` error for `Series.split/2` when mutating a dataframe HOT 1
- [Feature request] Add support for read_database in Polars backend. HOT 1
- Using `sort_by` with a grouped data frame doesn't respect `nils:` option HOT 1
- `{:datetime, :second}` dtype support HOT 2
- Add :streaming option to DataFrame.to_csv/3 HOT 1
- Exporting to CSV with a duration column returns an error
- Regression in `DataFrame.concat_rows/2` in v0.8.2 HOT 1
- Filter throwing undefined variable error HOT 1
- Error using is_finite and is_infinite within mutate HOT 1
- Explorer NIF broken on FreeBSD HOT 12
- Support Elixir built in Duration struct HOT 1
- Bug: Rounding Error in Tests HOT 1
- exposing the `fold` expressions from Polars HOT 6
- :nif_panicked "Chunk require all its arrays to have an equal number of rows" HOT 1
- Sorting an empty DataFrame results in a runtime Polars error HOT 1
- Performance of `DataFrame.new/2` on dataframes containing list columns HOT 7
- `Series.filter` should work inside `DataFrame.summarise` HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from explorer.