Comments (34)
Yes. Once you have decided the natural choice of type for X
in your fit
, you ensure coerce
maps provided iterable table to your type. The existing type declarations are a confusing anachronism (predating coerce
). My apologies.
Should we allow the user (at the machine level) to present matrix/abstract matrix inputs?
Pros:
(i) performance benefit (but not at an expected bottleneck, 99% of the time)
(ii) convenient - here's a use case: in a learning network for a stack, the Vector
predictions of the base learners are concatenated into a Matrix
to be fed into the meta-learner; it would be a pain if this matrix had first to be wrapped as an iterable table.
Cons:
(iii) the designers of both generic data interfaces decided against viewing matrices as tables, probably for good reasons (although I don't know what they are)
(iv) a non-trivial complication in the specification of the model interface, because coerce
must deal with a variety of types, unless the matrices are wrapped as tables "at-point-of-entry" (a different kind of complication in which any performance benefit may be lost).
I vote for simplicity. No matrices. Provide the user with a function astable(::AbtractArray)
to wrap any AbstractMatrix
as a table (which is suggested to them if they try to pass a matrix). Explain the row vs column convention in its doc string.
What do others think?
from mlj.jl.
Regarding @ablaom 's earlier suggestion on checking equal length of X,y etc - I'd recommend to avoid too many case distinctions as they tend to make the interface more brittle - it's very easy to forget which different cases or coercions need to be covered, unless there's a dedicated interface point for input handling.
from mlj.jl.
Disagreed, I'm against introducing case distinctions into the interface.
Since it suddenly forces all implemented methods to be able to cope with the respective case distinctions.
If this is a common problem, it's maybe a better idea to provide helper functionality for the interface locations, rather than performing ad-hoc surgery on the interface design?
(or maybe I misunderstood your suggestion?)
from mlj.jl.
@ablaom 's opinion is maybe the most important here - he seems like the most likely person to suffer the consequences (bad or good)
from mlj.jl.
I'm not sure I understand the problem exactly. For supervised models, data arrives as any generic table (currently Query iterable table) whose type is completely unknown. Incidentally, sometimes it is an AbstractMatrix
and sometimes it is not. The only convention in our whole API is this: If we do Query row iteration, we are expecting this to correspond to iteration over patterns (and not features). The interface implementer provides a method coerce
to convert this data into whatever form he likes for his fit/predict method.
MLJBase presently provides a convenience function MLJBase.matrix
to convert this to a Matrix{Float64}
with rows corresponding to patterns. We could provide a second convenience function which returns the transpose but, from the way Query works, this would just take the transpose, with no performance benefit.
Perhaps the benefit you are looking for is one of the casualties of an agnostic data interface?
Edit
Actually, the eltype of matrix(Xtable)
is not necessarily Float64
- it just the tightest type for elements of Xtable
.
And Query.jl has been dropped in favour of Tables.jl.
from mlj.jl.
Apologies for the lack of clarity and maybe that my suggestion doesn't make sense. For simplicity let's just assume the input is a matrix that the user has loaded somehow.
My question is: if the user feeds X'
to the machinery, what happens? does it work? and are we sure that X'
is not effectively materialized?
from mlj.jl.
@tlienart, and how on earth should any machinery know, or distinguish between:
(a) the user intends to feed x with rows as samples and columns as features,
(b) the user really intends to feed x', i.e., feeds x with columns as samples and rows as features?
That is, how should any machinery incapable of mind reading know.
Or, maybe you still need to clarify your query further since it starts to confuse me.
@ablaom, I very much agree with the agnostic data interface. There's a standard way in which the interface expects data, and I only see problems arising from adding redundant type or formatting options.
from mlj.jl.
No no, I think the issue I'm raising is much simpler than the data agnostic interface with which I wholeheartedly agree.
The key element is that in Julia, a transpose is a type but not a Matrix (it's a kind of "view" if you like). See below:
X = randn(50, 10);
typeof(X) # Array{Float64, 2}
typeof(transpose(X)) # LinearAlgebra.Transpose{Float64,Array{Float64,2}}
So if your functions only expect a Matrix{Float64}
(which is the same as Array{Float64, 2}
) they will fail on the transpose.
foo(X::Matrix{Float64}) = maximum(X)
foo(X) # will work
foo(transpose(X)) # will NOT work
to fix this you need to write
foo(X::AbstractMatrix{Float64}) = maximum(X)
foo(X) # will work as before
foo(transpose(X)) # will ALSO work
I hope this clarifies the confusion, sorry.
So all I'm saying is to make sure that no error is thrown if a transpose is fed by the user (who would know that MLJ expects a n x p
but has a p x n
for the moment) and to avoid forcing the user to explicitly allocate the transpose matrix by doing Xt = collect(transpose(X))
and then passing Xt
to MLJ.
from mlj.jl.
Strictly speaking the user should not feed a matrix X
into the machinery, because a matrix is not an iterable table. If she does insert a matrix - let's call it Xtable
to be consistent with the docs - then its fate (at present) depends on the model's coerce
method.
Consider the common cases coerce(model, Xtable) = MLJBase.matrix(Xtable)
and coerce(model,Xtable) = MLJBase.matrix(Xtable)'
(where fit
expects X = coerce(model, Xtable)
to follow the patterns-as-rows or patterns-as-columns conventions respectively). Then Xtable
had better follow the pattern-as-rows convention or the results are unpredictable and unreliable.
You raise an issue here, which is what we should do about the user who inserts a matrix. It would be nice to allow (because it saves conversions most of the time) but there is the danger of abuse. Ah, but for supervised models there is no danger because X
and y
must have compatible shapes. Also, if the user works through the (not yet implemented) task interface (ala MLR) there will be no danger at all. Perhaps for unsupervised models a warning is enough?
from mlj.jl.
Ah, I understand. I'm slightly shocked by the fact that in Julia, transposes do not behave like matrices - or, at least, do not live in the union type or similar. But surely there's good reasons for this?
Anyway, regarding MLJ, this seems to be an issue of the transpose rather than of the MLJ interface.
Given the discussion I wonder, should there be common coercions that are dispatch inherited? I.e., for all matrices being passed, the dispatch order ensures they reach an ancestor method that coerces them the same way into a nice iterable and then calls the method with that and the model?
from mlj.jl.
Ok fair enough however I imagine that this (the user feeding a matrix) will happen a lot don't you think? Especially anyone coming from ScikitLearn (the python one) might try to do that.
I would just suggest that the fit
methods in interfaces such as the one for DecisionTree.jl
an GaussianProcesses.jl
etc take X::AbstractMatrix{Float64}
instead of enforcing Matrix{Float64}
.
To avoid issues we could then check in the fit
that
@assert length(y) == size(X, 1) "The number of rows of X must match the length of the response vector y"
Note that "matching shapes" for X
and y
you mention is not ambiguous as far as I'm aware. Given that y
is just a vector, there's no would be no ambiguity in the user's mind in having X
to be p x n
unless we explicitly check somewhere else that y
matches the number of rows of X
?
from mlj.jl.
Ah, I understand. I'm slightly shocked by the fact that in Julia, transposes do not behave like matrices - or, at least, do not live in the union type or similar. But surely there's good reasons for this?
Anyway, regarding MLJ, this seems to be an issue of the transpose rather than of the MLJ interface.
Given the discussion I wonder, should there be common coercions that are dispatch inherited? I.e., for all matrices being passed, the dispatch order ensures they reach an ancestor method that coerces them the same way into a nice iterable and then calls the method with that and the model?
I think the union type you're thinking of @fkiraly is precisely the AbstractMatrix
I'm advocating we use.
from mlj.jl.
About AbstractMatrix : it would solve the problem, but not the problem that there's still array and transpose floating around. So now the user may only use one of three obvious types that could pop up, two of the three cases will still throw an error.
Pragmatically, if we burden the user to adhere with a type restriction anway (since they may not use array!), it should be the more sensible restriction of "use an iterable, please".
from mlj.jl.
Previous comment made before reading preceding two, sorry.
I don't agree on dimension checks at the level of model definitions. Surely that role lives at a higher level of the interface.
There still seems to be some confusion about the responsibility of the interface implementer, and over what she has no control. The implementation author writes the coerce
method (according to the convention that Xtable
is pattern-by-rows iterable table - over this the implementer has no control) and so thus takes complete control of the type of X
used in his fit
method. Indeed, mostly, for this reason, a type declaration on the X
in fit
is generally superfluous. If you have a fit
method that wants type AbstractWeird
, and have designed a coerce
method that turns every iterable table into an AbstractWeird
object, then you have no need to check the type, no?
from mlj.jl.
re @fkiraly:
Regarding @ablaom 's earlier suggestion on checking equal length of X,y etc - I'd recommend to avoid too many case distinctions as they tend to make the interface more brittle - it's very easy to forget which different cases or coercions need to be covered, unless there's a dedicated interface point for input handling.
There is one single natural point for supervised models to check that X
and y
(or ys
) are compatible, in the construction of a machine
. I think it would be prudent to have the check in this one place but nowhere else. I think currently I haven't put it in.
from mlj.jl.
Right so if I understand you correctly you are suggesting to
- remove any type declaration in the signature of
fit
(so in fact go fromX::Matrix{Float64}
to justX
) - leave any adaptation to be done by the
coerce
method?
I'm sorry if I'm being dense and not translating your suggestion properly. I just believe that the possibility that a user tries to give a transpose matrix is extremely likely, especially non-technical users who would also be confused by the fact that transpose(X)
is not the same type as X
. Maybe a possibility would be to have a switch dims=2
like in implementations of cov
.
from mlj.jl.
I'm good with your suggestion favoring simplicity as long as there's enough testing with various inputs for the interfaces.
from mlj.jl.
Agreed, and also agreed with @ablaom in particular on the additional argument that the model is the wrong interface level for random type coercion.
As @ablaom implicitly points out, in fact we already do have a dedicated input handling interface - the data container/provider framework iterables. Hence the clean way to deal with matrices is on the data side of this rather than on the model/MLJ side.
It is, however, external to MLJ, which is why this important argument is perhaps harder to see.
Which would mean that in consequence, the clean way of dealing with the matrices is raising an issues with IterableTables.jl rather than here.
from mlj.jl.
Please file an issue in Tables.jl if you think a method wrapping a matrix as a table can be useful. See also JuliaData/Tables.jl#56 and JuliaData/Tables.jl#58, which are a bit different but related.
from mlj.jl.
Important design question
It would be nice to avoid unnecessary data "coercions" (usually table -> matrix) in meta-modelling. The two main cases are ensembling and resampling by cross-validation. The raison d'ere for the coerce
method is exactly this. But to finish the implementation I must introduce a convention which I think violates a principle lauded by @fkiraly above.
To state the convention let's recall that the coerce
method takes tabular input Xtable
, provided by the user when constructing a machine, and returns new data X
for passing to the fit
method. The coerce
and fit
methods are defined by someone implementing a new algorithm, and the form of X
is currently up to them. Very often, X
is a matrix, but it could have rows-as-features or rows-as-patterns or maybe its an Adjoint
or sparse or some other abstract matrix.
The meta-algorithms I have in mind need to apply pattern-selection to the data. In the simple design this is always row-access to Xtable
- which we always know how to do. However, to avoid repeated coercions, the meta-algorithms must instead access X
directly, and for this we need:
convention: X
must be an AbstractMatrix
with rows-as-patterns or a Tables.jl table (with rows as patterns).
Note that as X
could be a Julia Adjoint
object (presents as matrix but internally stores data in row-major instead of column-major fashion) which might optimise the core algorithm performance, although the coercion will be more or less efficient depending on the table type (but only performed once by the meta-algorithm).
The alternative is to not impose a convention, in which case we may as well drop the coerce
method altogether (X
becomes the table input by user and coercions are done within fit
). The resulting design is a lot simpler.
My main disappointment, if we go the simple route, is the case of homogeneous ensembles, which can be quite large. In most other use-cases I can imagine, the performance/resource bottlenecks are elsewhere.
What do others think?
Edit. An extra complication with the "sophisticated" option is that I will want to add a schema
argument to fit
, to allow passing of feature names or similar, for reporting purposes (e.g., report feature rankings). In the "simple" design, X
has this information built-in.
from mlj.jl.
I don't have a strong opinion regarding this, but it may be worth looking at how StatsModels handles this, since it does a lot of similar work (see JuliaStats/StatsModels.jl#71 for the upcoming implementation).
from mlj.jl.
Question/request for clarification, @ablaom: why, or in which form, are coercions necessary in meta-modelling? I've been trying to think what you mean but I'm not sure. Isn't indexing by row index set (as supported by tables) sufficient here, why does it have to be a matrix?
from mlj.jl.
The external package algorithm being wrapped may require a matrix for its fit method (and a matrix may indeed be the most efficient form for it to work with). Remember, we have no control over the preferred input of the externally defined algorithm. Also, row selection of a table is generally slower (may be serial, as opposed to random-access).
Consider, for sake of illustration, 2-fold cross validation: I want to fit on rows1
, predict on rows2
and then reverse the roles of 1 and 2 for a second prediction. Starting with user-supplied Xtable
we have:
simple design
-
select rows1 from Xtable, coerce to X1, use X1 to fit; select rows2 from Xtable, coerce to X2, apply predict using preceding fit-result -> prediction1
-
select rows2 from Xtable, coerce to X2, use X2 to fit; select rows1 from Xtable, coerce to X1, apply predict using preceding fit-result. -> prediction2
sophisticated design
-
coerce Xtable to X.
-
select rows1 from X to get X1, use X1 to fit; select rows2 from X, apply predict using preceding fit-result -> prediction1
-
select rows2 from X to get X2, use X2 to fit; select rows1 from X, apply predict using preceding fit-result -> prediction2
The simple design requires four coercions from table to matrix (say), and the selection is generally slower (serial, not-random access).
The sophisticated design requires one coercion and the selection is generally faster.
These coercions probably don't matter too much for cv, but for ensembling lots of simple models, the fit may be on same order as coercion/selection.
from mlj.jl.
@ablaom, I see - I still personally think that the onus should be on the table to provide a uniform interface with quick access functionality optional.
E.g., if a "access_as_matrix" flag is set internally in Xtable, it would not need coercion, or coercion would be instantaneous since the outer layer is just an access layer.
This would make a lot of sense also for other wrapped storage modalities such as sparse matrices. In this respect, I disagree with the consensus in JuliaData/Tables.jl#56.
(the point about sparse tables is i.m.o. a non-argument: you can internally store I,J,V cols = long format while accessing externally via wide format)
In my opinion, it would make much sense to have functionality to wrap a matrix coercion-free with tables accessor syntax, even if there is no 1:1 mapping between matrices and tables.
Then, in the simple design, coercions would still be called, but would be of negligible computational impact, no? Since the table sees "I'm a matrix internally" and just loops the query through to random access internally.
from mlj.jl.
@ablaom, on a higher level: I'm generally against "fixing lack of features on the wrong side of inter-package-interface".
Not only does this lead to interface proliferation (e.g., now you suddenly need to tell the ensembling wrapper when to coerce) inside, but it may lead to the inter-package-interface breaking when the other package goes through consolidation/re-factoring, since it doesn't take into account the functionality you've sneakily tacked on in your package.
from mlj.jl.
Hmm. I think I understand your objections, and do not at all disagree. However, I do not understand the proposed solution:
@ablaom, I see - I still personally think that the onus should be on the table to provide a uniform interface with quick access functionality optional.
Agreed, but Tables.jl (and Query) are far from providing this kind of access. Tables.jl returns a row (or column) iterator. The production of this iterator may be fast or slow, depending on the table being accessed, but access is necessarily sequential from the outsider's point-of-view. At least this is my understanding.
E.g., if a "access_as_matrix" flag is set internally in Xtable, it would not need coercion, or coercion would be instantaneous since the outer layer is just an access layer.
I'm sorry I don't understand. How, for example can I set "access_as_matrix" for a CSV file (a kind of data source) or even a DataFrame (a vector of vectors of differing eltypes)? There is no matrix to access??
Perhaps there is some misunderstanding about the sense in which Tables.jl is a "wrapper". Tables.jl does not wrap data in new object. Rather, it provides methods for accessing all tabular data objects in the same way.
from mlj.jl.
Perhaps there is some misunderstanding about the sense in which Tables.jl is a "wrapper". Tables.jl does not wrap data in new object. Rather, it provides methods for accessing all tabular data objects in the same way.
Sorry for creating confusion, @ablaom . It is perhaps created by me referring ambigiously to "Tables.jl in its current state" and "abstract iterator interfaces in general".
A clean tabular data container interface usually provides a front-end, i.e., a user interface or access layer, and a back-end, i.e., a data interface or storage layer. As far as I understand, Tables.jl provides a unified set of access patterns for the front-end, and by-proxy implements some back-ends, through "dispatch-
on-back-end-storage" (in the integrated packages). Is this understanding of mine correct?
If so, given the nice front-end, I don't see what would stop you to write methods dispatching on matrices, or "my matrix plus schema" objects. Similarly, you could implement a back-end data container where internally you can switch matrix representation vs table representation on and off (via the "flag"). But none of these exist.
So the wrapper I meant is not the accessor patterns in Tables.jl, but a purpose-built data structure which could be accessed via the patterns - does that make more sense?
from mlj.jl.
I.e., DataFrames.jl or similar could provide such a functionality, at least there needs to be some type-based case distinction between accessor and back-end, since a matrix itself doesn't store a schema.
from mlj.jl.
It could be possible to add a trait to Tables.jl indicating whether a given table type can be efficiently indexed. If not, you could copy its content to a type that fulfills this requirement (e.g. using Tables.columnstable
or using a plain matrix).
from mlj.jl.
@nalimilan Is there not still the problem that Tables.jl does not provide fast random access methods, even if the type supports it, because only iterators are returned? Or is this a planned features enhancement?
from mlj.jl.
That can probably be considered if it fills a need. Or maybe you can already just call Tables.columnstable
on all tables, knowing that it's very cheap for DataFrame
(since it makes no copy), and that you can rely on indexing it since it's defined to return a tuple of vectors. It will be costly for row-oriented tables (like CSV) since it will allocate new vectors, but that's the intended behavior AFAICT.
from mlj.jl.
@nalimilan Thanks for that. I did not realise that Tables.columnstable essentially returns a view and not a copy where possible. This makes a big difference in those cases. So, if doing row selection, you would recommend first making a Tables.columnstable rather than a Tables.rowstable (my current naive implementation?) Perhaps you would you be willing to review my current row/column section add-on to Tables.jl? It is here; the methods are called selectrows
and selectcols
.
I must say, it would be easier for a Tables.jl user like me if Tables.jl provided row/column/cartesian indexing option, so that I didn't have to think about such details. Is the reluctance to provide such functionality because the project has other priorities (perfectly understandable!) - or is there some other objection to providing it? This functionality seems orthogonal to the existing functionality, as far as I can tell.
At this point it is difficult for me to say how much a performance advantage I would get from Tables.jl providing the indexing feature. Mostly the benefit would be small, I'm sure. The cases I am thinking about are meta-algorithms like bagging and gradient boosting for arbitrary models, which we would like to apply to arbitrary models (in external packages) and not just on a case-by-case basis. In these ensemble methods, one might have hundreds or thousands of "simple" learners which are extremely quick to train (e.g., a decision stump), and perhaps this training time is comparable to the time to access the data through "inefficient" indexing. (I put inefficient in quotes because I agree now this may or may not be so bad, depending on the supplied table type. ) To avoid "inefficient" indexing, MLJ must cache the data or wrap tables in an objects with a (possibly) duplicate storage layer, along the lines I think @fkiraly is suggesting above. Both options are complications to our design which, in my opinion, are out of place.
from mlj.jl.
@fkiraly On reflection I am in complete agreement with your objections to my "sophisticated' attempts to circumvent repetitive data coercions. All row/column access to data should be applied through the one interface (Tables.jl, or some functional or functional/object wrap). I will depreciate the coerce
method: the X
in a model fit
method's signature is to be the generic table provided by the user, and any further data coercions (e.g., conversion to a matrix) are to performed within fit
itself.
from mlj.jl.
@nalimilan Thanks for that. I did not realise that Tables.columnstable essentially returns a view and not a copy where possible. This makes a big difference in those cases. So, if doing row selection, you would recommend first making a Tables.columnstable rather than a Tables.rowstable (my current naive implementation?)
AFAICT it makes sense to collect data to a columnstable
if you need to access each row multiple times. Otherwise iterating over them will probably be better, at least for tables for which columnstable
would make a copy. But this should be benchmarked to be certain.
Perhaps you would you be willing to review my current row/column section add-on to Tables.jl? It is here; the methods are called
selectrows
andselectcols
.
I've looked at it quickly, and it looks good to me. Some features sound like they could be moved to CategoricalArrays (like transform
/inverse_transform
) or Tables.jl in one form or another.
I must say, it would be easier for a Tables.jl user like me if Tables.jl provided row/column/cartesian indexing option, so that I didn't have to think about such details. Is the reluctance to provide such functionality because the project has other priorities (perfectly understandable!) - or is there some other objection to providing it? This functionality seems orthogonal to the existing functionality, as far as I can tell.
I can't speak for @quinnj (and better comment at JuliaData/Tables.jl#48 for possible improvements to Tables.jl), but I can see at least two reasons. First, starting with iteration over rows and columns suits most use cases, so that's a good start. Second, efficient indexing isn't possible for all sources without making a copy, contrary to iteration which works for all sources.
But that doesn't mean we shouldn't investigate what's the most appropriate way of dealing with indexing. It would be too bad if packages had to duplicate features internally to make it work.
Anyway, I think you should really benchmark the different options first. Without reliable information on performance design discussions cannot be very precise. If you know what's the most appropriate storage format you would need Tables.jl to produce (which could possibly vary depending on whether the source is column-oriented or row-oriented), that would help discussing concrete options.
from mlj.jl.
Related Issues (20)
- Update list of BetaML models HOT 1
- Reinstate CatBoost integraton test HOT 1
- Upate ROADMAP.md HOT 1
- Improve documentation by additional hierarchy HOT 5
- Include support for MixedModels.jl HOT 2
- Deserialisation fails for wrappers like `TunedModel` when atomic model overloads `save/restore` HOT 2
- feature_importances for Pipeline including XGBoost don't work HOT 2
- Current performance evaluation objects, recently added to TunedModel histories, are too big HOT 2
- Update cheat sheet instance of depracated `@from_network` code
- Requesting better exposure to MLJFlux in the model browser HOT 4
- Reexport `CompactPerformanceEvaluation` and `InSample`
- Remove `info(rms)` from the cheatsheet HOT 4
- Re-instate integration tests for scikit-learn models
- [tracking] Add default logger to MLJ
- Enable entry of model wrappers into the MLJ Model Registry
- Link in examples on CV Recursive Feature Elimination into the manual or in the planned tutorial interface. HOT 1
- broken link for UnivariateFinite doc string
- Add Missingness Encoder Transformer HOT 1
- Add pipeline support for `Unsupervised` models that have a target in `fit`
- InteractionTransformer is missing from the "Transformers and Other..." manual page
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlj.jl.