Comments (14)
I think we should be stricter, unless we have a valid use case. Returning nil
or wrapping are both undesired IMO.
from explorer.
This issue is a blocker for the property test PR. Should I fix it there or do it separately?
from explorer.
Lists in polars are tricky:
- If it is empty it does not matter how many levels of nested list. It takes the dtype of non empty list element
For [[] , [[]] , [[[[]]]], [[2]] ] => dtype will be list[list[s64]]
- if all elements are empty in list it results in a different dtype - dtype of nested list having highest levels.
[[[[]]]] => dtype will be list[list[list[list[null]]]]
What is the rationale behind this, any clue ?
from explorer.
@lkarthee Interesting I didn't realize. Is this the behavior of Rust Polars or Python Polars?
from explorer.
I have tested it in py polars - from your rust example it looks like rust is no different from py.
from explorer.
I ask because I'm not sure you could make it in Rust because the compiler might prevent you.
Also, what's resulting series from [[] , [[]] , [[[[]]]], [[2]]]
? I'd assume this:
[[] , [[]] , [[]], [[2]]]
^^^^
Inner nesting removed
Otherwise the dtypes wouldn't match, right?
I'll try and test these in Rust later today.
from explorer.
Yeah true.
s = pl.Series([[[[[[[]]]]]], [], [[]]])
shape: (3,)
Series: '' [list[list[list[list[list[list[null]]]]]]]
[
[[[[[[]]]]]]
[]
[[]]
]
s = pl.Series([[[[[[[]]]]]], [], [[]], [[]], [[0], [], []]])
shape: (5,)
Series: '' [list[list[i64]]]
[
[null]
[]
[[]]
[[]]
[[0], [], []]
]
Interesting to note that first element reduced to [null] not [[]].
from explorer.
Interesting to note that first element reduced to [null] not [[]].
Hm yeah I would've expected [[]]
or [[null]]
, not that.
from explorer.
Hm yeah I would've expected [[]] or [[null]], not that.
That can potentially be a polars bug?
from explorer.
Yeah it might be a bug. Although it's technically legal since every element can potentially be null
. Who knows, maybe it's actually the correct way to do it for some non-obvious reason 🤷
(But I think bug is most likely)
from explorer.
s = pl.Series([[], [[[3]]], [[2], [2]]])
shape: (3,)
Series: '' [list[list[list[i64]]]]
[
[]
[[[3]]]
[[[2]], [[2]]]
]
This makes me believe there is a bug - a list 2 level nesting [[2]] is converted to 3 level nesting [[[2]]]. Better to log an issue with polars before attempting a fix ? It only considers dtype of first non empty element.
s = pl.Series([[], [3], [[2], [2]]])
shape: (3,)
Series: '' [list[i64]]
[
[]
[3]
[null, null]
]
from explorer.
I started to make an issue, but now I think Polars is more correct than we initially thought. Consider this example:
pl.Series([["a"], [1]])
# shape: (2,)
# Series: '' [list[str]]
# [
# ["a"]
# ["1"]
# ]
pl.Series([[1], ["a"]])
# shape: (2,)
# Series: '' [list[i64]]
# [
# [1]
# [null]
# ]
The dtypes of these series are either list[str]
or list[i64]
. Polars needs a tie-breaker, so letting the first element decide seems like a sensible choice. I think that's what's happening with some of our examples.
However, I still can't account for some other stuff we're seeing. These cases still seem wrong even if we use the first-non-empty-element-wins rule:
# Wrapping elements in additional layers of nesting to make the dtypes compatible
pl.Series([[[2]], [1, 1]])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
# [[2]]
# [[1], [1]]
# ]
# An empty list as the errant element pushes the null up one layer of nesting
pl.Series([ [[2, 2]], [["b"]] ])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
# [[2, 2]]
# [[null]]
# ]
pl.Series([ [[2, 2]], [[[]]] ])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
# [[2, 2]]
# [null]
# ]
Then there's this:
pl.Series([ [[2, 2]], [[2, 2]] ])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
# [[2, 2]]
# [[2, 2]]
# ]
pl.Series([ [[2, 2]], [[2, []]] ])
# shape: (2,)
# Series: '' [o][object]
# [
# [[2, 2]]
# [[2, []]]
# ]
Not sure what to make of that. Does anyone know if the intended behavior is documented anywhere? I didn't see it in the Python docs, but I might've just missed it.
from explorer.
The dtypes of these series are either list[str] or list[i64]. Polars needs a tie-breaker, so letting the first element decide seems like a sensible choice. I think that's what's happening with some of our examples.
You are right, that makes sense. This means we should raise on these cases, including in several of the examples below:
pl.Series([[[2]], [1, 1]])
Should raise.
pl.Series([ [[2, 2]], [["b"]] ])
Should raise.
pl.Series([ [[2, 2]], [[[]]] ])
Should raise. You can think that [[[]]]
means it is at least a three element deep list, which violates the first value.
pl.Series([ [[2, 2]], [[2, []]] ])
Should raise.
The example you originally reported is still wrong and has to be fixed though. [[], [[]], [[1]]]
means that the 1+ level list, then 2+ level list, and officially confirmed with a {:list, {:list, :s64}}
list. I can try to work on a PR, since I wrote the current version of the code.
from explorer.
@josevalim I think Polars is making non-matching elements null
instead of raising. So for:
pl.Series([[[2]], [1, 1]])
Should raise.
I think it should yield [[[2]], [null, null]]
instead since each 1
it encounters ought to be a list. I have no idea why they decided to wrap each 1
to yield [[[2]], [[1], [1]]]
.
That said, we can also chose to differ from how Python Polars is resolving these situations by raising.
from explorer.
Related Issues (20)
- [Feature request] Add support for Decimal type HOT 12
- Should we always raise when a column is missing? HOT 3
- Split string column into multiple columns (feature request / use case) HOT 4
- Seeing `:nif_not_loaded` error for `Series.split/2` when mutating a dataframe HOT 1
- [Feature request] Add support for read_database in Polars backend. HOT 1
- Using `sort_by` with a grouped data frame doesn't respect `nils:` option HOT 1
- `{:datetime, :second}` dtype support HOT 2
- Add :streaming option to DataFrame.to_csv/3 HOT 1
- Exporting to CSV with a duration column returns an error
- Regression in `DataFrame.concat_rows/2` in v0.8.2 HOT 1
- Filter throwing undefined variable error HOT 1
- Error using is_finite and is_infinite within mutate HOT 1
- Explorer NIF broken on FreeBSD HOT 12
- Support Elixir built in Duration struct HOT 1
- Bug: Rounding Error in Tests HOT 1
- exposing the `fold` expressions from Polars HOT 7
- :nif_panicked "Chunk require all its arrays to have an equal number of rows" HOT 1
- Sorting an empty DataFrame results in a runtime Polars error HOT 1
- Performance of `DataFrame.new/2` on dataframes containing list columns HOT 7
- `Series.filter` should work inside `DataFrame.summarise` HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from explorer.