Uploading of implementations with neither source nor binary file,about openml/openml

Comments (19)

ltorgo commented on June 7, 2024

A few comments here. To me what one uploads from R is a source file,
even if it contains a simple script calling a standard function in some
R package. Uploading the package makes no sense, I agree. In my opinion
this is not different from the other tools. For instance, a knime node
for me is equivalent to an R language function. You do not upload these
functions as you do not upload knime nodes. What you upload are knime
workflows, which basically are sequences of knime nodes. So what you
upload in R is the same, a sequence of R language function calls, i.e. a
script. If a script is simply a single function call, then one my wonder
whether it makes sense to share it, so I would say that "interesting"
scripts that are worthwhile sharing typically will do a bit more than
calling an out-of-the-box algorithm function that is already available
in some R package / function.

Luis

On 22-08-2013 01:48, berndbischl wrote:

When thinking about uploading our first experiments, I noticed that
sometimes I do not want to upload either a source file or a binary file.

This mainly concerns applying "standard methods" from libraries. E.g.,
when I apply the libsvm implementation in the R package e1071, I only
need to know the package name and the version number. Uploading the
package itself (in binary or source form) makes no sense, this is
hosted on the official CRAN R package server.

I /could/ upload a very short code that uses this package and produces
the desired predictions. Actually there are a few more subtle
questions involed here and it might be easier to discuss them briefly
on Skype, I would like to hear your opinions on this.

The question basically is, how much we want to enable users that
download implementations to rerun the experiments in a convenient fashion.

Reply to this email directly or view it on GitHub
#22.

from openml.

joaquinvanschoren commented on June 7, 2024

I have been thinking about this as well. Here is what I propose:

Currently, dataset/implementation upload requires a file (binary or dataset). This does not always make sense. Sometimes the code or dataset is hosted elsewhere (CRAN, SourceForge,...), or the dataset is too big to store, or the dataset comes from a webservice, or the user only wants to host the implementation/dataset on her own server. In all those cases, it should be possible to just provide a url. The server should then check if it is a valid resource and perform a checksum to see if the version hasn't changed.
When your run uses a standard library method (e.g. libsvn from package X), you upload it as an implementation with name, version, description and a url. That url could, for instance, point to a file on CRAN.
The more I think about it, the implementation name-version combo's seem like a bad idea. Maybe implementations should just have a numerical id (1,2,3), like the datasets, and a separate name and version. You then check whether an implementation exists by giving a name and version separately, or just by giving the id number. Storing implementation using a simple id also helps with fool-proofing. Sometimes a method version remains unchanged while the library changes, which may actually provide a different result. Thus, a new implementation id should be created if the library version changes. Also, a user may have changed the code without changing the version number. We could catch that by doing a checksum on the source/binary. If the checksum changes, it should also produce a new implementation id.
The call you do to start a procedure is not something we currently store, but I think this is valuable information for novice users who are less familiar with the libraries/tools used, or when we want to automatically rerun experiments (not a current requirement but good if we store the information for that). My proposal is to NOT upload a new implementation for that, but to add a new optional attribute to the run upload, say 'start_command', where you state how you start the procedure. This should probably not be required, as I can imagine situations where there is no call (e.g. tools without a CLI).

This implies the following changes:

Remove the requirement to upload a dataset file or implementation source/binary, and add the possibility to supply a url.
Extend the run upload with an optional 'start_command', next to the implementation id.
Replace the implementation id's with numerical id's. API calls should exist that returns an implementation id based on name, version, and file/url. If the implementation exists (unchanged), an existing id will be returned, otherwise a new id will be generated.

What do you think? Does this solve our current/future problems?

Cheers,
Joaquin

from openml.

joaquinvanschoren commented on June 7, 2024

Any comments? Do you think we should move in this direction?

from openml.

berndbischl commented on June 7, 2024

Hi,

Apologies for the delay regarding this important issue.

I mainly agree with what you write, especially about linking to publicly available stuff on servers.
As a side note: You already uploaded stuff on OpenML which is simply available as a part of WEKA right? Just asking, not a criticism.

But here is my main point, and I want to discuss this properly before we start changing things:

What is actually the formal idea of the thing we call "implementation".
(For now, think about our current classification /regression tasks)

Here are three options:

a) Can be anything, as long as it gives other users an idea what happened in the experiment.

b) It is a machine learning algorithm. This is then an "object" that has a training and a prediction method.

c) It is a workflow that takes an openml task, resamples it and produces the openml output.
So this is then like a function: task.id, parameters --> resampled run result

The documentation seems to be torn between b) and c):
"An implementation can be a single algorithm or a composed workflow."

from openml.

joaquinvanschoren commented on June 7, 2024

As a side note: You already uploaded stuff on OpenML which is simply
available as a part of WEKA right? Just asking, not a criticism.

Yes, I just registered the algorithm used and gave the correct version of
the weka jar as the binary. That way the results are linked to the
algorithm name, and people can download that version of weka to repeat the
experiment.

What is actually the formal idea of the thing we call "implementation".

I see your point. I would say the implementation is the piece of software
that produces the output, but separate from the evaluation procedure. Does
that make sense? As such, it should not be the OpenML workflow/script
itself. It should be named after the procedure that actually produces the
output, e.g. 'R.libsvm', not 'openmlworkflow123'.

I should be able to run the implementation with the same evaluation
procedure and get the same result, but I should also be able to run the
same implementation in a different evaluation procedure.

If you believe we should also store the OpenML-wrapper, I would store that
separately?

That, at least, is how I have done it until now.
What do you think?

Cheers,
Joaquin

(For now, think about our current classification /regression tasks)

Here are three options:

a) Can be anything, as long as it gives other users an idea what happened
in the experiment.

b) It is a machine learning algorithm. This is then an "object" that has a
training and a prediction method.

c) It is a workflow that takes an openml task, resamples it and produces
the openml output.
So this is then like a function: task.id, parameters --> resampled run
result

The documentation seems to be torn between b) and c):
"An implementation can be a single algorithm or a composed workflow."

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/22#issuecomment-23716034
.

Dr. Ir. Joaquin Vanschoren

Leiden Institute of Advanced Computer Science (LIACS)
Universiteit Leiden
Niels Bohrweg 1, 2333 CA Leiden, The Netherlands
office: 1.14
phone: +31 715 27 89 19
fax: +32 16 32 79 96
mobile: (+32) (0)497 90 30 69

from openml.

berndbischl commented on June 7, 2024

Yes, I just registered the algorithm used and gave the correct version of
the weka jar as the binary. That way the results are linked to the
algorithm name, and people can download that version of weka to repeat the
experiment.

But later on we would encourage people to simply provide a URL to that WEKA release on the official WEKA page, right?

I see your point. I would say the implementation is the piece of software
that produces the output, but separate from the evaluation procedure.
Does that make sense? As such, it should not be the OpenML workflow/script
itself. It should be named after the procedure that actually produces the
output, e.g. 'R.libsvm', not 'openmlworkflow123'.

I should be able to run the implementation with the same evaluation
procedure and get the same result, but I should also be able to run the
same implementation in a different evaluation procedure.

The 2nd paragraph helped a bit, but I am still unsure what I should produce in R, especially w.r.t. to your last sentence above. Could we discuss this briefly on Skype next week? I assume that would work a lot better and has less potential for misunderstandings.

Also:Could you please send my some kind of screenshot of a workflow in WEKA that you would upload?

If you believe we should also store the OpenML-wrapper, I would store that
separately?

This is closely connected to the stuff above, lets postpone it for now.

from openml.

joaquinvanschoren commented on June 7, 2024

Sorry about the slow reply, many deadlines :/.

I guess a Skype call is a good idea. Do you want to do it this afternoon? Otherwise Wednesday or Thursday is also ok for me. I guess this is just between you, me and Jan? If others want to join, let me know.

Jan, what exactly are you doing in WEKA/RapidMiner? Do you upload the whole workflow (including OpenML operators) or only the workflow/algorithm that actually produces the predictions given train/test data?

Cheers,
Joaquin

from openml.

joaquinvanschoren commented on June 7, 2024

Sorry, I assume Luis would also be interested?

Cheers,
Joaquin

from openml.

berndbischl commented on June 7, 2024

Sorry, hurt my back somewhat during the last days.
I would prefer to Skype next week, also rather late in the afternoon / evening.
Monday till Wednesday are good, Thursday and Friday not.

from openml.

joaquinvanschoren commented on June 7, 2024

Get well soon! Is Monday at 17h CET ok?

Cheers,
Joaquin

On Friday, 13 September 2013, berndbischl wrote:

Sorry, hurt my back somewhat during the last days.
I would prefer to Skype next week, also rather late in the afternoon /
evening.
Monday till Wednesday are good, Thursday and Friday not.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/22#issuecomment-24368953
.

Dr. Ir. Joaquin Vanschoren

Leiden Institute of Advanced Computer Science (LIACS)
Universiteit Leiden
Niels Bohrweg 1, 2333 CA Leiden, The Netherlands
mobile: (+32) (0)497 90 30 69

from openml.

berndbischl commented on June 7, 2024

Get well soon! Is Monday at 17h CET ok?

Yes. Noted.

from openml.

joaquinvanschoren commented on June 7, 2024

Bernd and I discussed this issue on Skype today, and I think it is important that we all at least think about this briefly.

We need to clearly state what we expect to be uploaded as an implementation. It must be easy for developers to upload what they have developed, but it should also be easy for people who discover a nice implementation on OpenML to download and use it. There should be as little second guessing as possible.

The basic signature of an implementation(=algorithm, script, workflow,...) could be simply the following:

implementation(openmltask,parameters...) -> expected outputs

Here, 'openmltask' is a language-specific object that represents an OpenML task. We can provide helper functions (for R, Java, Python,...) or workflow operators that create such a task object given a task_id by talking to the OpenML API and downloading the necessary data. Thus something like: createOpenMLTask(task_id) -> openmltask.

How we build that openmltask and how we send the results back is thus NOT part of the uploaded implementation. For workflows, this means uploading the subworkflow between the 'import OpenML task' and 'export OpenML result' operators. For R this means uploading the function that consumes the openmltask and produces the required output. Does this seem feasible and practical?

Note that this allows you to create 'custom' openmltasks that do NOT belong to a task_id, by writing your own function/operator, e.g. createOpenMLTask(input1, input2,...) -> openmltask

This can be useful when you have proprietary data that you don't want to upload, but still you may want to run experiments in the same fashion (e.g. with the same cross-validation folds) as other OpenML experiments. Or maybe you want to experiment with new task types. You won't be able to upload the results of these experiments though, not until the dataset or task_type is added to OpenML and their task_ids are generated. Still, it allows you to experiment freely beyond the tasks that we provide.

You can also provide a list of parameters that belong to your implementation. Say you have a new SVM implementation, it might look like this:

mySVMWorkflow(openmltask, C, sigma)

As such, you can run the same implementation with many different parameter settings (as long as you externalise them). The parameter values must be sent along when you upload your run. If you decide to externalise/add a new parameter, it should be handled as a new implementation since the signature has changed.

When you want to repeat an OpenML experiment, you download the implementation and the helper functions/operators, and you start it again on the task_id (and parameter setting) in question. Or you can run it with other parameters or on your 'custom' task.

I think we should still provide a way to indicate when you are simply wrapping a library function (e.g. WEKA's J48). Also, when you upload a workflow it should be clear which environment you need to run it. Should we add a (list of) dependencies together with the uploaded implementation, with urls where you can download them?

Please let us hear what you think. We should decide upon this fairly quickly.

Thanks,
Joaquin

PS. For now, this is a recommendation (a best practice), we won't be blocking implementations that look differently just yet.

from openml.

berndbischl commented on June 7, 2024

I think we should still provide a way to indicate when you are simply wrapping a library function (e.g. WEKA's J48).

Have a special attribute in the implementation xml that distinguishes between "custom workflows" and "library algorithms"?

Also, when you upload a workflow it should be clear which environment you need to run it.
Should we add a (list of) dependencies together with the uploaded implementation,
with urls where you can download them?

I though we had this already? Element "dependencies" in XSD for implementation?

from openml.

dominikkirchhoff commented on June 7, 2024

I have a question: What if I'm a lazy user and I want to run (the newest version of) a certain standard method. Do I have to 'upload' it to check if it's already there and get an id or will there be a possibility to ask the server about all id's for a given name?

The question is: Can I get the results of the SQL query

'SELECT id FROM implementation WHERE name = "classif.rpart"'

without going to the website and typing it manually?

from openml.

janvanrijn commented on June 7, 2024

Yes you can. There is a query API (mostly used by the frontend) that accepts any query and returns some sort of json answer. Param q = the query (preferably URL encoded. )

In your case this would be
http://www.openml.org/api_query/?q=SELECT%20id%20FROM%20implementation%20WHERE%20name%20=%20%22classif.rpart%22

Please let me know if you are interested in something more robust, integrated in the current API.

from openml.

berndbischl commented on June 7, 2024

I think we should really start defining what we formally mean by an implementation quite soon. Currently I see these scenarios:

a library-provided algorithm that the user simply applied. Maybe he changed some parameters.
a combination of library-provided pieces that the user chose to put together. E.g. he used a feature filter, than an SVM.
a library-provided piece, extended with code code from the user. E.g. he wrote his own preprocessing, than applied an SVM.
a completely self-written, custom algorithm

Note, these are just some things I came up with in a few seconds of brainstorming, it is not supposed to be a formal onthology.

Also, we need to define what we really mean with the version nr of the algorithm. All of these definitions and possibly other stuff relating to the issue needs to be documented in one place for both developers and uploaders.

from openml.

joaquinvanschoren commented on June 7, 2024

I prefer one simple definition, without many different scenarios. It needs
to be easily explained to users. I remember we went back and forth between
different options, but did we actually settle on a final solution? In any
case, I've given it some thought, and here is a proposal:

Uploading

The cleanest solution would be to just upload the code that actually
produces the results. For R, that means the script that reads the task,
does whatever it wants and then uploads the results. For RM/KNIME, it is
the workflow that includes operators/nodes for reading the task and
uploading the result. The WEKA experimenter is a bit special, probably the
best solution is to upload a java wrapper that starts the experiment (even
if originally run from the GUI).

In all cases, the task id is an input, next to other inputs (parameter
settings). The implementation description should include (I believe this is
covered already):

Name and version
Description of input parameters (including the task_id)
Textual description for the user
Longer textual description for indexing
A list of dependencies (necessary libraries and their versions). Also
mention the workbench version you need.

A run is a file with results linked to the implementation_id and task_id
(both returned by the server) and any additional parameter settings that
are inputted to the implementation.

Downloading

The user who downloads the implementation should be able to easily run it
on a new task. In the simplest case, she just input a new task_id and runs
it. If there are other parameters, it should be clear from the
implementation description how to set them.

Searching

Here is what I was most worried about during previous discussions: how can
I search for the experiments with, e.g., libsvm and how do I compare libsvn
with other algorithms? I must also know which versions of libsvn are used.
The current way of doing it (tagging each implementation with a general
name for the learning algorithm) is probably untenable.

The simplest thing to do would be to just 'dump' the workflow/script to
text and index that. If I then search for 'libsvn', I will find all
implementations, i.e. scripts/workflows (I still struggle with the term
'implementations' :)), that somehow mention libsvn. It is then up to the
user to decide which implementations to select.

Additionally, I do see some benefit in using a more structured description
for implementations:

Include an optional tag that says 'this is a simple wrapper for library
algorithm XXX'
A list of operators or function calls. This is easy enough for the
workflows, a bit harder for R scripts I presume.

We can then build a more powerful structured search and better 'present'
the implementation to the user online.

Caveat: in theory, you could write an implementation that takes an
algorithm name as an input parameter. Not sure how to make that searchable.

Versioning

About the version numbers: what is the question exactly? I believe it is
best if the user can choose a name and version number during upload, but we
always keep a checksum on the server so that no two different
implementations are uploaded with the same name/version. On the server, we
store implementations based on a unique numeric id, just like datasets.
This id is referenced when you upload runs, and we offer an API call to get
an id given a name/version combo.

Does that sound like a clear description? Maybe I slightly differ from the
current specs.

Let me know.

Cheers,
Joaquin

On 23 October 2013 03:08, berndbischl [email protected] wrote:

I think we should really start defining what we formally mean by an
implementation quite soon. Currently I see these scenarios:

a library-provided algorithm that the user simply applied. Maybe he
changed some parameters.

a combination of library-provided pieces that the user chose to put
together. E.g. he used a feature filter, than an SVM.

a library-provided piece, extended with code code from the user. E.g.
he wrote his own preprocessing, than applied an SVM.

a completely self-written, custom algorithm

Note, these are just some things I came up with in a few seconds of
brainstorming, it is not supposed to be a formal onthology.

Also, we need to define what we really mean with the version nr of the
algorithm. All of these definitions and possibly other stuff relating to
the issue needs to be documented in one place for both developers and
uploaders.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/22#issuecomment-26872281
.

Dr. Ir. Joaquin Vanschoren

Leiden Institute of Advanced Computer Science (LIACS)
Universiteit Leiden
Niels Bohrweg 1, 2333 CA Leiden, The Netherlands
mobile: (+32) (0)497 90 30 69

from openml.

berndbischl commented on June 7, 2024

This thread has become increasingly complicated. Lets keep it open but discuss it very soon on Skype.
Very soon = some time before Christmas.

from openml.

joaquinvanschoren commented on June 7, 2024

I agree, this tread has gone off-topic, so I will close it and open a new one with the conclusion of our last Skype call.

from openml.

Uploading of implementations with neither source nor binary file about openml HOT 19 CLOSED

Comments (19)

a library-provided algorithm that the user simply applied. Maybe he
changed some parameters.

a combination of library-provided pieces that the user chose to put
together. E.g. he used a feature filter, than an SVM.

a library-provided piece, extended with code code from the user. E.g.
he wrote his own preprocessing, than applied an SVM.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (19)

a library-provided algorithm that the user simply applied. Maybe he changed some parameters.

a combination of library-provided pieces that the user chose to put together. E.g. he used a feature filter, than an SVM.

a library-provided piece, extended with code code from the user. E.g. he wrote his own preprocessing, than applied an SVM.

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

a library-provided algorithm that the user simply applied. Maybe he
changed some parameters.

a combination of library-provided pieces that the user chose to put
together. E.g. he used a feature filter, than an SVM.

a library-provided piece, extended with code code from the user. E.g.
he wrote his own preprocessing, than applied an SVM.