AFAIK, we use the n x p convention where <code class=

Regarding <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Dimension convention and table <--> matrix conversions,about juliaai/mlj.jl

Comments (34)

ablaom commented on August 19, 2024 2

Yes. Once you have decided the natural choice of type for X in your fit, you ensure coerce maps provided iterable table to your type. The existing type declarations are a confusing anachronism (predating coerce). My apologies.

Should we allow the user (at the machine level) to present matrix/abstract matrix inputs?

Pros:

(i) performance benefit (but not at an expected bottleneck, 99% of the time)
(ii) convenient - here's a use case: in a learning network for a stack, the Vector predictions of the base learners are concatenated into a Matrix to be fed into the meta-learner; it would be a pain if this matrix had first to be wrapped as an iterable table.

Cons:

(iii) the designers of both generic data interfaces decided against viewing matrices as tables, probably for good reasons (although I don't know what they are)
(iv) a non-trivial complication in the specification of the model interface, because coerce must deal with a variety of types, unless the matrices are wrapped as tables "at-point-of-entry" (a different kind of complication in which any performance benefit may be lost).

I vote for simplicity. No matrices. Provide the user with a function astable(::AbtractArray) to wrap any AbstractMatrix as a table (which is suggested to them if they try to pass a matrix). Explain the row vs column convention in its doc string.

What do others think?

from mlj.jl.

fkiraly commented on August 19, 2024 1

Regarding @ablaom 's earlier suggestion on checking equal length of X,y etc - I'd recommend to avoid too many case distinctions as they tend to make the interface more brittle - it's very easy to forget which different cases or coercions need to be covered, unless there's a dedicated interface point for input handling.

from mlj.jl.

fkiraly commented on August 19, 2024

Disagreed, I'm against introducing case distinctions into the interface.
Since it suddenly forces all implemented methods to be able to cope with the respective case distinctions.

If this is a common problem, it's maybe a better idea to provide helper functionality for the interface locations, rather than performing ad-hoc surgery on the interface design?

(or maybe I misunderstood your suggestion?)

from mlj.jl.

fkiraly commented on August 19, 2024

@ablaom 's opinion is maybe the most important here - he seems like the most likely person to suffer the consequences (bad or good)

from mlj.jl.

ablaom commented on August 19, 2024

I'm not sure I understand the problem exactly. For supervised models, data arrives as any generic table (currently Query iterable table) whose type is completely unknown. Incidentally, sometimes it is an AbstractMatrix and sometimes it is not. The only convention in our whole API is this: If we do Query row iteration, we are expecting this to correspond to iteration over patterns (and not features). The interface implementer provides a method coerce to convert this data into whatever form he likes for his fit/predict method.

MLJBase presently provides a convenience function MLJBase.matrix to convert this to a Matrix{Float64} with rows corresponding to patterns. We could provide a second convenience function which returns the transpose but, from the way Query works, this would just take the transpose, with no performance benefit.

Perhaps the benefit you are looking for is one of the casualties of an agnostic data interface?

Edit
Actually, the eltype of matrix(Xtable) is not necessarily Float64 - it just the tightest type for elements of Xtable.

And Query.jl has been dropped in favour of Tables.jl.

from mlj.jl.

tlienart commented on August 19, 2024

Apologies for the lack of clarity and maybe that my suggestion doesn't make sense. For simplicity let's just assume the input is a matrix that the user has loaded somehow.

My question is: if the user feeds X' to the machinery, what happens? does it work? and are we sure that X' is not effectively materialized?

from mlj.jl.

fkiraly commented on August 19, 2024

@tlienart, and how on earth should any machinery know, or distinguish between:
(a) the user intends to feed x with rows as samples and columns as features,
(b) the user really intends to feed x', i.e., feeds x with columns as samples and rows as features?

That is, how should any machinery incapable of mind reading know.
Or, maybe you still need to clarify your query further since it starts to confuse me.

@ablaom, I very much agree with the agnostic data interface. There's a standard way in which the interface expects data, and I only see problems arising from adding redundant type or formatting options.

from mlj.jl.

tlienart commented on August 19, 2024

No no, I think the issue I'm raising is much simpler than the data agnostic interface with which I wholeheartedly agree.

The key element is that in Julia, a transpose is a type but not a Matrix (it's a kind of "view" if you like). See below:

X = randn(50, 10);
typeof(X) # Array{Float64, 2}
typeof(transpose(X)) # LinearAlgebra.Transpose{Float64,Array{Float64,2}}

So if your functions only expect a Matrix{Float64} (which is the same as Array{Float64, 2}) they will fail on the transpose.

foo(X::Matrix{Float64}) = maximum(X)
foo(X) #  will work
foo(transpose(X)) # will NOT work

to fix this you need to write

foo(X::AbstractMatrix{Float64}) = maximum(X)
foo(X) # will work as before
foo(transpose(X)) # will ALSO work

I hope this clarifies the confusion, sorry.

So all I'm saying is to make sure that no error is thrown if a transpose is fed by the user (who would know that MLJ expects a n x p but has a p x n for the moment) and to avoid forcing the user to explicitly allocate the transpose matrix by doing Xt = collect(transpose(X)) and then passing Xt to MLJ.

from mlj.jl.

ablaom commented on August 19, 2024

Strictly speaking the user should not feed a matrix X into the machinery, because a matrix is not an iterable table. If she does insert a matrix - let's call it Xtableto be consistent with the docs - then its fate (at present) depends on the model's coerce method.

Consider the common cases coerce(model, Xtable) = MLJBase.matrix(Xtable) and coerce(model,Xtable) = MLJBase.matrix(Xtable)' (where fit expects X = coerce(model, Xtable) to follow the patterns-as-rows or patterns-as-columns conventions respectively). Then Xtable had better follow the pattern-as-rows convention or the results are unpredictable and unreliable.

You raise an issue here, which is what we should do about the user who inserts a matrix. It would be nice to allow (because it saves conversions most of the time) but there is the danger of abuse. Ah, but for supervised models there is no danger because X and y must have compatible shapes. Also, if the user works through the (not yet implemented) task interface (ala MLR) there will be no danger at all. Perhaps for unsupervised models a warning is enough?

from mlj.jl.

fkiraly commented on August 19, 2024

Ah, I understand. I'm slightly shocked by the fact that in Julia, transposes do not behave like matrices - or, at least, do not live in the union type or similar. But surely there's good reasons for this?

Anyway, regarding MLJ, this seems to be an issue of the transpose rather than of the MLJ interface.
Given the discussion I wonder, should there be common coercions that are dispatch inherited? I.e., for all matrices being passed, the dispatch order ensures they reach an ancestor method that coerces them the same way into a nice iterable and then calls the method with that and the model?

from mlj.jl.

tlienart commented on August 19, 2024

Ok fair enough however I imagine that this (the user feeding a matrix) will happen a lot don't you think? Especially anyone coming from ScikitLearn (the python one) might try to do that.

I would just suggest that the fit methods in interfaces such as the one for DecisionTree.jl an GaussianProcesses.jl etc take X::AbstractMatrix{Float64} instead of enforcing Matrix{Float64}.

To avoid issues we could then check in the fit that

@assert length(y) == size(X, 1) "The number of rows of X must match the length of the response vector y"

Note that "matching shapes" for X and y you mention is not ambiguous as far as I'm aware. Given that y is just a vector, there's no would be no ambiguity in the user's mind in having X to be p x n unless we explicitly check somewhere else that y matches the number of rows of X?

from mlj.jl.

tlienart commented on August 19, 2024

Ah, I understand. I'm slightly shocked by the fact that in Julia, transposes do not behave like matrices - or, at least, do not live in the union type or similar. But surely there's good reasons for this?

Anyway, regarding MLJ, this seems to be an issue of the transpose rather than of the MLJ interface.
Given the discussion I wonder, should there be common coercions that are dispatch inherited? I.e., for all matrices being passed, the dispatch order ensures they reach an ancestor method that coerces them the same way into a nice iterable and then calls the method with that and the model?

I think the union type you're thinking of @fkiraly is precisely the AbstractMatrix I'm advocating we use.

from mlj.jl.

fkiraly commented on August 19, 2024

About AbstractMatrix : it would solve the problem, but not the problem that there's still array and transpose floating around. So now the user may only use one of three obvious types that could pop up, two of the three cases will still throw an error.

Pragmatically, if we burden the user to adhere with a type restriction anway (since they may not use array!), it should be the more sensible restriction of "use an iterable, please".

from mlj.jl.

ablaom commented on August 19, 2024

Previous comment made before reading preceding two, sorry.

I don't agree on dimension checks at the level of model definitions. Surely that role lives at a higher level of the interface.

There still seems to be some confusion about the responsibility of the interface implementer, and over what she has no control. The implementation author writes the coerce method (according to the convention that Xtable is pattern-by-rows iterable table - over this the implementer has no control) and so thus takes complete control of the type of X used in his fit method. Indeed, mostly, for this reason, a type declaration on the X in fit is generally superfluous. If you have a fit method that wants type AbstractWeird, and have designed a coerce method that turns every iterable table into an AbstractWeird object, then you have no need to check the type, no?

from mlj.jl.

ablaom commented on August 19, 2024

re @fkiraly:

Regarding @ablaom 's earlier suggestion on checking equal length of X,y etc - I'd recommend to avoid too many case distinctions as they tend to make the interface more brittle - it's very easy to forget which different cases or coercions need to be covered, unless there's a dedicated interface point for input handling.

There is one single natural point for supervised models to check that X and y (or ys) are compatible, in the construction of a machine. I think it would be prudent to have the check in this one place but nowhere else. I think currently I haven't put it in.

from mlj.jl.

tlienart commented on August 19, 2024

Right so if I understand you correctly you are suggesting to

remove any type declaration in the signature of fit (so in fact go from X::Matrix{Float64} to just X)
leave any adaptation to be done by the coerce method?

I'm sorry if I'm being dense and not translating your suggestion properly. I just believe that the possibility that a user tries to give a transpose matrix is extremely likely, especially non-technical users who would also be confused by the fact that transpose(X) is not the same type as X. Maybe a possibility would be to have a switch dims=2 like in implementations of cov.

from mlj.jl.

tlienart commented on August 19, 2024

I'm good with your suggestion favoring simplicity as long as there's enough testing with various inputs for the interfaces.

from mlj.jl.

fkiraly commented on August 19, 2024

Agreed, and also agreed with @ablaom in particular on the additional argument that the model is the wrong interface level for random type coercion.

As @ablaom implicitly points out, in fact we already do have a dedicated input handling interface - the data container/provider framework iterables. Hence the clean way to deal with matrices is on the data side of this rather than on the model/MLJ side.
It is, however, external to MLJ, which is why this important argument is perhaps harder to see.

Which would mean that in consequence, the clean way of dealing with the matrices is raising an issues with IterableTables.jl rather than here.

from mlj.jl.

nalimilan commented on August 19, 2024

Please file an issue in Tables.jl if you think a method wrapping a matrix as a table can be useful. See also JuliaData/Tables.jl#56 and JuliaData/Tables.jl#58, which are a bit different but related.

from mlj.jl.

ablaom commented on August 19, 2024

Important design question

It would be nice to avoid unnecessary data "coercions" (usually table -> matrix) in meta-modelling. The two main cases are ensembling and resampling by cross-validation. The raison d'ere for the coerce method is exactly this. But to finish the implementation I must introduce a convention which I think violates a principle lauded by @fkiraly above.

To state the convention let's recall that the coerce method takes tabular input Xtable, provided by the user when constructing a machine, and returns new data X for passing to the fit method. The coerce and fit methods are defined by someone implementing a new algorithm, and the form of X is currently up to them. Very often, X is a matrix, but it could have rows-as-features or rows-as-patterns or maybe its an Adjoint or sparse or some other abstract matrix.

The meta-algorithms I have in mind need to apply pattern-selection to the data. In the simple design this is always row-access to Xtable - which we always know how to do. However, to avoid repeated coercions, the meta-algorithms must instead access X directly, and for this we need:

convention: X must be an AbstractMatrix with rows-as-patterns or a Tables.jl table (with rows as patterns).

Note that as X could be a Julia Adjoint object (presents as matrix but internally stores data in row-major instead of column-major fashion) which might optimise the core algorithm performance, although the coercion will be more or less efficient depending on the table type (but only performed once by the meta-algorithm).

The alternative is to not impose a convention, in which case we may as well drop the coerce method altogether (X becomes the table input by user and coercions are done within fit). The resulting design is a lot simpler.

My main disappointment, if we go the simple route, is the case of homogeneous ensembles, which can be quite large. In most other use-cases I can imagine, the performance/resource bottlenecks are elsewhere.

What do others think?

Edit. An extra complication with the "sophisticated" option is that I will want to add a schema argument to fit, to allow passing of feature names or similar, for reporting purposes (e.g., report feature rankings). In the "simple" design, X has this information built-in.

from mlj.jl.

nalimilan commented on August 19, 2024

I don't have a strong opinion regarding this, but it may be worth looking at how StatsModels handles this, since it does a lot of similar work (see JuliaStats/StatsModels.jl#71 for the upcoming implementation).

from mlj.jl.

fkiraly commented on August 19, 2024

Question/request for clarification, @ablaom: why, or in which form, are coercions necessary in meta-modelling? I've been trying to think what you mean but I'm not sure. Isn't indexing by row index set (as supported by tables) sufficient here, why does it have to be a matrix?

from mlj.jl.

ablaom commented on August 19, 2024

The external package algorithm being wrapped may require a matrix for its fit method (and a matrix may indeed be the most efficient form for it to work with). Remember, we have no control over the preferred input of the externally defined algorithm. Also, row selection of a table is generally slower (may be serial, as opposed to random-access).

Consider, for sake of illustration, 2-fold cross validation: I want to fit on rows1, predict on rows2 and then reverse the roles of 1 and 2 for a second prediction. Starting with user-supplied Xtable we have:

simple design

select rows1 from Xtable, coerce to X1, use X1 to fit; select rows2 from Xtable, coerce to X2, apply predict using preceding fit-result -> prediction1
select rows2 from Xtable, coerce to X2, use X2 to fit; select rows1 from Xtable, coerce to X1, apply predict using preceding fit-result. -> prediction2

sophisticated design

coerce Xtable to X.
select rows1 from X to get X1, use X1 to fit; select rows2 from X, apply predict using preceding fit-result -> prediction1
select rows2 from X to get X2, use X2 to fit; select rows1 from X, apply predict using preceding fit-result -> prediction2

The simple design requires four coercions from table to matrix (say), and the selection is generally slower (serial, not-random access).

The sophisticated design requires one coercion and the selection is generally faster.

These coercions probably don't matter too much for cv, but for ensembling lots of simple models, the fit may be on same order as coercion/selection.

from mlj.jl.

fkiraly commented on August 19, 2024

@ablaom, I see - I still personally think that the onus should be on the table to provide a uniform interface with quick access functionality optional.

E.g., if a "access_as_matrix" flag is set internally in Xtable, it would not need coercion, or coercion would be instantaneous since the outer layer is just an access layer.

This would make a lot of sense also for other wrapped storage modalities such as sparse matrices. In this respect, I disagree with the consensus in JuliaData/Tables.jl#56.
(the point about sparse tables is i.m.o. a non-argument: you can internally store I,J,V cols = long format while accessing externally via wide format)

In my opinion, it would make much sense to have functionality to wrap a matrix coercion-free with tables accessor syntax, even if there is no 1:1 mapping between matrices and tables.

Then, in the simple design, coercions would still be called, but would be of negligible computational impact, no? Since the table sees "I'm a matrix internally" and just loops the query through to random access internally.

from mlj.jl.

fkiraly commented on August 19, 2024

@ablaom, on a higher level: I'm generally against "fixing lack of features on the wrong side of inter-package-interface".

Not only does this lead to interface proliferation (e.g., now you suddenly need to tell the ensembling wrapper when to coerce) inside, but it may lead to the inter-package-interface breaking when the other package goes through consolidation/re-factoring, since it doesn't take into account the functionality you've sneakily tacked on in your package.

from mlj.jl.

ablaom commented on August 19, 2024

Hmm. I think I understand your objections, and do not at all disagree. However, I do not understand the proposed solution:

@ablaom, I see - I still personally think that the onus should be on the table to provide a uniform interface with quick access functionality optional.

Agreed, but Tables.jl (and Query) are far from providing this kind of access. Tables.jl returns a row (or column) iterator. The production of this iterator may be fast or slow, depending on the table being accessed, but access is necessarily sequential from the outsider's point-of-view. At least this is my understanding.

E.g., if a "access_as_matrix" flag is set internally in Xtable, it would not need coercion, or coercion would be instantaneous since the outer layer is just an access layer.

I'm sorry I don't understand. How, for example can I set "access_as_matrix" for a CSV file (a kind of data source) or even a DataFrame (a vector of vectors of differing eltypes)? There is no matrix to access??

Perhaps there is some misunderstanding about the sense in which Tables.jl is a "wrapper". Tables.jl does not wrap data in new object. Rather, it provides methods for accessing all tabular data objects in the same way.

from mlj.jl.

fkiraly commented on August 19, 2024

Perhaps there is some misunderstanding about the sense in which Tables.jl is a "wrapper". Tables.jl does not wrap data in new object. Rather, it provides methods for accessing all tabular data objects in the same way.

Sorry for creating confusion, @ablaom . It is perhaps created by me referring ambigiously to "Tables.jl in its current state" and "abstract iterator interfaces in general".

A clean tabular data container interface usually provides a front-end, i.e., a user interface or access layer, and a back-end, i.e., a data interface or storage layer. As far as I understand, Tables.jl provides a unified set of access patterns for the front-end, and by-proxy implements some back-ends, through "dispatch-
on-back-end-storage" (in the integrated packages). Is this understanding of mine correct?

If so, given the nice front-end, I don't see what would stop you to write methods dispatching on matrices, or "my matrix plus schema" objects. Similarly, you could implement a back-end data container where internally you can switch matrix representation vs table representation on and off (via the "flag"). But none of these exist.

So the wrapper I meant is not the accessor patterns in Tables.jl, but a purpose-built data structure which could be accessed via the patterns - does that make more sense?

from mlj.jl.

fkiraly commented on August 19, 2024

I.e., DataFrames.jl or similar could provide such a functionality, at least there needs to be some type-based case distinction between accessor and back-end, since a matrix itself doesn't store a schema.

from mlj.jl.

nalimilan commented on August 19, 2024

It could be possible to add a trait to Tables.jl indicating whether a given table type can be efficiently indexed. If not, you could copy its content to a type that fulfills this requirement (e.g. using Tables.columnstable or using a plain matrix).

from mlj.jl.

ablaom commented on August 19, 2024

@nalimilan Is there not still the problem that Tables.jl does not provide fast random access methods, even if the type supports it, because only iterators are returned? Or is this a planned features enhancement?

from mlj.jl.

nalimilan commented on August 19, 2024

That can probably be considered if it fills a need. Or maybe you can already just call Tables.columnstable on all tables, knowing that it's very cheap for DataFrame (since it makes no copy), and that you can rely on indexing it since it's defined to return a tuple of vectors. It will be costly for row-oriented tables (like CSV) since it will allocate new vectors, but that's the intended behavior AFAICT.

from mlj.jl.

ablaom commented on August 19, 2024

@nalimilan Thanks for that. I did not realise that Tables.columnstable essentially returns a view and not a copy where possible. This makes a big difference in those cases. So, if doing row selection, you would recommend first making a Tables.columnstable rather than a Tables.rowstable (my current naive implementation?) Perhaps you would you be willing to review my current row/column section add-on to Tables.jl? It is here; the methods are called selectrows and selectcols.

I must say, it would be easier for a Tables.jl user like me if Tables.jl provided row/column/cartesian indexing option, so that I didn't have to think about such details. Is the reluctance to provide such functionality because the project has other priorities (perfectly understandable!) - or is there some other objection to providing it? This functionality seems orthogonal to the existing functionality, as far as I can tell.

At this point it is difficult for me to say how much a performance advantage I would get from Tables.jl providing the indexing feature. Mostly the benefit would be small, I'm sure. The cases I am thinking about are meta-algorithms like bagging and gradient boosting for arbitrary models, which we would like to apply to arbitrary models (in external packages) and not just on a case-by-case basis. In these ensemble methods, one might have hundreds or thousands of "simple" learners which are extremely quick to train (e.g., a decision stump), and perhaps this training time is comparable to the time to access the data through "inefficient" indexing. (I put inefficient in quotes because I agree now this may or may not be so bad, depending on the supplied table type. ) To avoid "inefficient" indexing, MLJ must cache the data or wrap tables in an objects with a (possibly) duplicate storage layer, along the lines I think @fkiraly is suggesting above. Both options are complications to our design which, in my opinion, are out of place.

from mlj.jl.

ablaom commented on August 19, 2024

@fkiraly On reflection I am in complete agreement with your objections to my "sophisticated' attempts to circumvent repetitive data coercions. All row/column access to data should be applied through the one interface (Tables.jl, or some functional or functional/object wrap). I will depreciate the coerce method: the X in a model fit method's signature is to be the generic table provided by the user, and any further data coercions (e.g., conversion to a matrix) are to performed within fit itself.

from mlj.jl.

nalimilan commented on August 19, 2024

@nalimilan Thanks for that. I did not realise that Tables.columnstable essentially returns a view and not a copy where possible. This makes a big difference in those cases. So, if doing row selection, you would recommend first making a Tables.columnstable rather than a Tables.rowstable (my current naive implementation?)

AFAICT it makes sense to collect data to a columnstable if you need to access each row multiple times. Otherwise iterating over them will probably be better, at least for tables for which columnstable would make a copy. But this should be benchmarked to be certain.

Perhaps you would you be willing to review my current row/column section add-on to Tables.jl? It is here; the methods are called selectrows and selectcols.

I've looked at it quickly, and it looks good to me. Some features sound like they could be moved to CategoricalArrays (like transform/inverse_transform) or Tables.jl in one form or another.

I must say, it would be easier for a Tables.jl user like me if Tables.jl provided row/column/cartesian indexing option, so that I didn't have to think about such details. Is the reluctance to provide such functionality because the project has other priorities (perfectly understandable!) - or is there some other objection to providing it? This functionality seems orthogonal to the existing functionality, as far as I can tell.

I can't speak for @quinnj (and better comment at JuliaData/Tables.jl#48 for possible improvements to Tables.jl), but I can see at least two reasons. First, starting with iteration over rows and columns suits most use cases, so that's a good start. Second, efficient indexing isn't possible for all sources without making a copy, contrary to iteration which works for all sources.

But that doesn't mean we shouldn't investigate what's the most appropriate way of dealing with indexing. It would be too bad if packages had to duplicate features internally to make it work.

Anyway, I think you should really benchmark the different options first. Without reliable information on performance design discussions cannot be very precise. If you know what's the most appropriate storage format you would need Tables.jl to produce (which could possibly vary depending on whether the source is column-oriented or row-oriented), that would help discussing concrete options.

from mlj.jl.

Dimension convention and table <--> matrix conversions about mlj.jl HOT 34 OPEN

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent