Giter Club home page Giter Club logo

Comments (17)

josevalim avatar josevalim commented on September 26, 2024

To me, a dataframe is a map of 1-dimensional tensors with the column names as keys.

In fact, Nx defines a Nx.Container protocol. My hope is that you can fully pass a dataframe into a defn function, and if you implement the protocol, all "columns" would automatically convert to tensors.

So for example, your floats-ints dataframe, if we implement the Nx.Container protocol, you could do this:

defn add_floats_and_ints(df) do
  df["floats"] + df["ints"]
end

And it would return a tensor with the results of adding those.

Therefore, if we were to add some functionality, it would be like Df.to_tensors_map(df).

from explorer.

NduatiK avatar NduatiK commented on September 26, 2024

This is interesting, I had only thought of DataFrame as tables. It definitely makes sense to return a map since it gives flexibility on the datatype of the individual 1-d tensors.

I will have a look at the Nx.Container and create a PR for discussion tomorrow.

Todo:

  • Df.to_tensors_map(df)
  • Nx integration through Nx.Container

from explorer.

josevalim avatar josevalim commented on September 26, 2024

I am not sure if we should do the to_tensors_map version. Today you can easily get a column as a series right? This means you can get a tensor for it. :) but exploration into containers would be cool!

from explorer.

cigrainger avatar cigrainger commented on September 26, 2024

Hmmm... giving this some thought. @josevalim I totally agree. However, re: your point about a map of 1d tensors... while this is usually how I think about it too, sometimes a dataframe has portions which should be treated as two-dimensional tensors. For example, you might have generated dummy variables and it totally makes sense to treat that as a matrix. You may want to pass a dataframe of features to an ML algorithm -- even just linear regression would treat a dataframe as a 2d matrix.

Agreed that this should be achieved with the Nx.Container protocol, just calling it out for consideration and some colour. For example, in R you pass a dataframe to lm and identify the formula with regressand ~ regressor_1 + regressor_2 etc. So you might have a dataframe df with variables height, age and sex and you might write model = lm(height ~ age + sex, data = df). In this case, R treats age and sex together as a 2d matrix. In Python you can do very similar with sklearn and pandas.

At the simplest level, this definitely gets used pretty frequently in pandas.

R is a little more similar to what I think we'd do, which is very similar to your suggestion above @NduatiK. If you have a purely numerical dataframe, you can just call as.matrix on it.

from explorer.

NduatiK avatar NduatiK commented on September 26, 2024

The sklearn and pandas example is indeed what I was thinking of.

I had been worried about the datatypes on the final tensor but it is already possible to isolate the numeric columns with a select and then convert the result to a preferred tensor type using Nx.as_type.

I will look into Nx.Container

from explorer.

josevalim avatar josevalim commented on September 26, 2024

So what do you think about DF.to_tensor(df, ["height", "width"])? This ends up a 2d tensor.

from explorer.

NduatiK avatar NduatiK commented on September 26, 2024

Not sure if you are are asking me or @cigrainger, but that sounds looks good to me. I am currently doing the same thing but just using a DF.select first.

Would this allow column names to be shared or is it a normal tensor?

from explorer.

josevalim avatar josevalim commented on September 26, 2024

I was asking both/everyone. :)

Would this allow column names to be shared or is it a normal tensor?

What do you mean?

from explorer.

josevalim avatar josevalim commented on September 26, 2024

One thing is that tensor names are atoms and columns are strings. But perhaps we could introduce a convention: if the column name is a string, then it is not named in the tensor. If an atom, it preserves its name in the tensor?

from explorer.

NduatiK avatar NduatiK commented on September 26, 2024

What do you mean?

I meant: Given dataframes do not have an obvious sorting of columns, would the returned tensor contain information about which series on the dataframe a column was created from. A DF with age and salary would lose data semantics as soon as it turns into a tensor. One would have to get the DataFrame.names and keep that around.

One thing is that tensor names are atoms and columns are strings

Whoops, just reread the docs. Turns out tensor names actually identify axes not columns, DataFrame column names do not map onto names on Nx.Tensor. If we were to do this, it would be using something else.

I am not sure it would make sense to add a field to the Nx.Tensor type since the column names would have to be tracked and could easily destroyed (eg when you run a dot product). Perhaps naming is something that should be tracked by ML algorithms that consume DataFrames.

Of course, like you said, this issue goes away if we can just treat DataFrames as tensors.

from explorer.

josevalim avatar josevalim commented on September 26, 2024

Whoops, just reread the docs. Turns out tensor names actually identify axes not columns, DataFrame column names do not map onto names on Nx.Tensor. If we were to do this, it would be using something else.

Ah, good call. The information is then lost, I am afraid.

Of course, like you said, this issue goes away if we can just treat DataFrames as tensors.

Right, my understand right now is that those are two separate problems. Nx.Container is about accessing each column individually. What you told me is that sometimes you may also want to get certain (or all?) columns from DF as a matrix, that would be a different operation (either to_matrix or to_tensor).

from explorer.

NduatiK avatar NduatiK commented on September 26, 2024

Yes, my original issue was about a convenience function on the DataFrame itself.

By default it would convert all columns, but a list of names can be passed in.
When the list of names is provided, the tensor is ordered by that list.

When the names are not provided, we might need to return the list of columns. names depends on the backend and I am not sure if all will guarantee ordering.

from explorer.

NduatiK avatar NduatiK commented on September 26, 2024

Hi @josevalim, what has prevented the implementation of an Nx.Container implementation so far? I had assumed that the solution would be as simple as adapting the Map implementation.

Are there any special cases to be aware of given the different backends and the possible need to transfer values to the device?

from explorer.

josevalim avatar josevalim commented on September 26, 2024

I think there are no blockers, it is just that nobody implemented the solution so far. I think the implementation is similar to maps, yeah. I would even make it so the Dataframe becomes a map.

from explorer.

NduatiK avatar NduatiK commented on September 26, 2024

Got it, thanks

from explorer.

cigrainger avatar cigrainger commented on September 26, 2024

@josevalim should this be closed now that we have TensorFrame?

from explorer.

NduatiK avatar NduatiK commented on September 26, 2024

Thanks @josevalim and @cigrainger, I think your zero-copy work is much better than what I would have been able to do.
❤️💚💙

from explorer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.