Comments (17)
To me, a dataframe is a map of 1-dimensional tensors with the column names as keys.
In fact, Nx
defines a Nx.Container
protocol. My hope is that you can fully pass a dataframe into a defn
function, and if you implement the protocol, all "columns" would automatically convert to tensors.
So for example, your floats-ints dataframe, if we implement the Nx.Container
protocol, you could do this:
defn add_floats_and_ints(df) do
df["floats"] + df["ints"]
end
And it would return a tensor with the results of adding those.
Therefore, if we were to add some functionality, it would be like Df.to_tensors_map(df)
.
from explorer.
This is interesting, I had only thought of DataFrame as tables. It definitely makes sense to return a map since it gives flexibility on the datatype of the individual 1-d tensors.
I will have a look at the Nx.Container and create a PR for discussion tomorrow.
Todo:
- Df.to_tensors_map(df)
- Nx integration through Nx.Container
from explorer.
I am not sure if we should do the to_tensors_map
version. Today you can easily get a column as a series right? This means you can get a tensor for it. :) but exploration into containers would be cool!
from explorer.
Hmmm... giving this some thought. @josevalim I totally agree. However, re: your point about a map of 1d tensors... while this is usually how I think about it too, sometimes a dataframe has portions which should be treated as two-dimensional tensors. For example, you might have generated dummy variables and it totally makes sense to treat that as a matrix. You may want to pass a dataframe of features to an ML algorithm -- even just linear regression would treat a dataframe as a 2d matrix.
Agreed that this should be achieved with the Nx.Container
protocol, just calling it out for consideration and some colour. For example, in R you pass a dataframe to lm
and identify the formula with regressand ~ regressor_1 + regressor_2
etc. So you might have a dataframe df
with variables height
, age
and sex
and you might write model = lm(height ~ age + sex, data = df)
. In this case, R treats age
and sex
together as a 2d matrix. In Python you can do very similar with sklearn
and pandas
.
At the simplest level, this definitely gets used pretty frequently in pandas.
R is a little more similar to what I think we'd do, which is very similar to your suggestion above @NduatiK. If you have a purely numerical dataframe, you can just call as.matrix
on it.
from explorer.
The sklearn
and pandas
example is indeed what I was thinking of.
I had been worried about the datatypes on the final tensor but it is already possible to isolate the numeric columns with a select
and then convert the result to a preferred tensor type using Nx.as_type
.
I will look into Nx.Container
from explorer.
So what do you think about DF.to_tensor(df, ["height", "width"])
? This ends up a 2d tensor.
from explorer.
Not sure if you are are asking me or @cigrainger, but that sounds looks good to me. I am currently doing the same thing but just using a DF.select
first.
Would this allow column names to be shared or is it a normal tensor?
from explorer.
I was asking both/everyone. :)
Would this allow column names to be shared or is it a normal tensor?
What do you mean?
from explorer.
One thing is that tensor names are atoms and columns are strings. But perhaps we could introduce a convention: if the column name is a string, then it is not named in the tensor. If an atom, it preserves its name in the tensor?
from explorer.
What do you mean?
I meant: Given dataframes do not have an obvious sorting of columns, would the returned tensor contain information about which series on the dataframe a column was created from. A DF with age and salary would lose data semantics as soon as it turns into a tensor. One would have to get the DataFrame.names
and keep that around.
One thing is that tensor names are atoms and columns are strings
Whoops, just reread the docs. Turns out tensor names actually identify axes not columns, DataFrame
column names do not map onto names
on Nx.Tensor
. If we were to do this, it would be using something else.
I am not sure it would make sense to add a field to the Nx.Tensor
type since the column names would have to be tracked and could easily destroyed (eg when you run a dot product). Perhaps naming is something that should be tracked by ML algorithms that consume DataFrame
s.
Of course, like you said, this issue goes away if we can just treat DataFrame
s as tensors.
from explorer.
Whoops, just reread the docs. Turns out tensor names actually identify axes not columns, DataFrame column names do not map onto names on Nx.Tensor. If we were to do this, it would be using something else.
Ah, good call. The information is then lost, I am afraid.
Of course, like you said, this issue goes away if we can just treat DataFrames as tensors.
Right, my understand right now is that those are two separate problems. Nx.Container is about accessing each column individually. What you told me is that sometimes you may also want to get certain (or all?) columns from DF as a matrix, that would be a different operation (either to_matrix
or to_tensor
).
from explorer.
Yes, my original issue was about a convenience function on the DataFrame itself.
By default it would convert all columns, but a list of names can be passed in.
When the list of names is provided, the tensor is ordered by that list.
When the names are not provided, we might need to return the list of columns. names
depends on the backend and I am not sure if all will guarantee ordering.
from explorer.
Hi @josevalim, what has prevented the implementation of an Nx.Container implementation so far? I had assumed that the solution would be as simple as adapting the Map
implementation.
Are there any special cases to be aware of given the different backends and the possible need to transfer values to the device?
from explorer.
I think there are no blockers, it is just that nobody implemented the solution so far. I think the implementation is similar to maps, yeah. I would even make it so the Dataframe becomes a map.
from explorer.
Got it, thanks
from explorer.
@josevalim should this be closed now that we have TensorFrame
?
from explorer.
Thanks @josevalim and @cigrainger, I think your zero-copy work is much better than what I would have been able to do.
❤️💚💙
from explorer.
Related Issues (20)
- Sorting an empty DataFrame results in a runtime Polars error HOT 1
- Performance of `DataFrame.new/2` on dataframes containing list columns HOT 7
- `Series.filter` should work inside `DataFrame.summarise` HOT 5
- Large memory usage when using `Explorer.Dataframe.concat_columns` on 30k (small) data frames. Memory leak? HOT 4
- [Not Issue] - Are the plans to use duckdb as an alternative backend? HOT 2
- Support streaming: true on collect HOT 1
- Select from different dataframe inside mutate HOT 1
- Config to make table view default like in Polars itself HOT 1
- Duckdb select statement only works on non first line in the cell HOT 4
- Explorer.DataFrame.from_query/4 with :snowflake adapter returns dtype error on any numeric field HOT 3
- Orders of names after CSV loading seems buggy since 9.0 HOT 5
- Join on columns of type `:list` HOT 2
- Discrepancy between typespec for Series.cast/2 (parameter dtype) and implementation HOT 1
- LazyFrame not being able to cast dtypes HOT 1
- DataFrame.load_csv!/2 seems to fail on certain options HOT 2
- Mathematical operations are newline dependent in `summarise` macro HOT 2
- Support dplyr-like `keep` argument for `mutate` functions HOT 1
- v0.9 names_from no longer included means loss of valuable information HOT 10
- `to_csv` function occasionally generates CSV files with binary encoding HOT 1
- Support expressions in `on` option for complex Joins HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from explorer.