Giter Club home page Giter Club logo

Comments (24)

joaquinvanschoren avatar joaquinvanschoren commented on June 14, 2024

Versions can be different depending on the used system: 1.2.3, r1234, 234

I do agree that we should define this more clearly though. Suggestions?

from openml.

berndbischl avatar berndbischl commented on June 14, 2024

Maybe we should simply test it with a very weird name first and then see, what bad things happen?
But my stomach feeling is that stuff like in my example above should not be allowed for IDs.
Maybe: (both for "name" and "version")
[a-z] [A-Z] [0-9] [_ , -, .] ???

from openml.

janvanrijn avatar janvanrijn commented on June 14, 2024

I will implement a check for this in the next API update.

from openml.

berndbischl avatar berndbischl commented on June 14, 2024

Considering the current problems we encountered here

#20

I think defining a conservative naming scheme and checking for it on the server side is reasonable....

from openml.

joaquinvanschoren avatar joaquinvanschoren commented on June 14, 2024

Well, the problem with #20 related to the description field, not the name field, but yes, a conservative naming scheme sounds like a good idea. Any suggestions?

from openml.

berndbischl avatar berndbischl commented on June 14, 2024

For names I suggested above:

[a-z] [A-Z] [0-9] [_ , -, .]

But I think we should actually make a list first, where problems like this might occur. Basically all "free" user provided text. I suppose their are then two (?) categories: a) stuff that becomes like an Id / file name, etc.
b) text-like descriptions.

For a) I would try to be as conservative as possible, like the suggestion above. For b) I don't really know what exactly causes problems for you. Problem also is that b) will often come out of files / data / is generated, and I don't know how freely we can "throw stuff away" from it. Certainly not many user want to edit these text blobs manually, if they are not validated by the server.

from openml.

janvanrijn avatar janvanrijn commented on June 14, 2024

I think this is important for all the XSD's that validate uploaded content. I added some datatypes, oml:system_string and oml:simple_string.

oml:system_string allows users to insert [a-z] [A-Z] [0-9] [_ , -, .] and is applied to all fields where we want a high restriction level, e.g., because these can occur in URLs. Examples are implementation:name, implementation:version, etc.
oml:simple_string allows [a-z] [A-Z] [0-9] [_ , -, .], comma's and white spaces. This is used for textual input, where we want to restrict the input, but when we do not need it to be URL friendly. implementation:creator and implementation:contributor are examples of such fields. We can extend the list of allowed characters even further.
All other fields (which are likely to accept machine generated content) are still xs:string.

However, we can compile a list of characters which are allowed on OpenML. The workbenches are then responsible for checking on these characters, and removing them (or replacing them) before uploading the XML. Of course, this happens without the user noticing. Any suggestions on this are welcome.

from openml.

berndbischl avatar berndbischl commented on June 14, 2024

a) I think it is a good idea of compiling such a list like you are doing above. I hope for simplicity's sake we can keep the type definitions simple.

b) Why do we need to restrict a field like "creator"? Just a question out of curiosity. Also, many people in Non-English countries have sometimes weird characters in their names.

c) For "free text fields" like "description" and such:
Have you already noticed that some input created problems on the server, even though it was a valid xs:string?

from openml.

janvanrijn avatar janvanrijn commented on June 14, 2024

What I have done for now: I implemented the system_string datatype, which restricts limits a string to alphanumeric characters and underscore, dash and point. For fields like name, version and very obvious other fields (md5_hashes) the system_string is used, all other fields not.

a) I agree on that, but I think that for the time being we are good.

b and c) From what I understand is that you are hesitant to restrict the input of these fields, because it can give users (or the automated system) a hard time getting the format correct. I myself have not encountered any problems so far, but I can see that the more we restrict on this, the more likely it is that these problems occur.

from openml.

berndbischl avatar berndbischl commented on June 14, 2024

I asked about creator names because of stuff like German umlaute, French accents and so on. I am perfectly fine with that not being possible now - no umlaute in my name :).

And about problems with weird characters in text descriptions I simply asked out of curiosity to understand our current system better. I thought you already had to clean up some descriptions of UCI data sets because they caused problems? Or am I wrong? If not, what exactly was the problem?

from openml.

joaquinvanschoren avatar joaquinvanschoren commented on June 14, 2024

We previously had problems because of HTML tags in the textual descriptions.

If we can process everything in utf-8 that would be preferable, but I agree
it's not that urgent.

Cheers!
Joaquin

On Friday, 13 September 2013, berndbischl wrote:

I asked about creator names because of stuff like German umlaute, French
accents and so on. I am perfectly fine with that not being possible now -
no umlaute in my name :).

And about problems with weird characters in text descriptions I simply
asked out of curiosity to understand our current system better. I thought
you already had to clean up some descriptions of UCI data sets because they
caused problems? Or am I wrong? If not, what exactly was the problem?


Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-24369169
.

Dr. Ir. Joaquin Vanschoren

Leiden Institute of Advanced Computer Science (LIACS)
Universiteit Leiden
Niels Bohrweg 1, 2333 CA Leiden, The Netherlands
office: 1.14
phone: +31 715 27 89 19
fax: +32 16 32 79 96
mobile: (+32) (0)497 90 30 69

from openml.

mfeurer avatar mfeurer commented on June 14, 2024

Is there a reason why brackets in flow names are forbidden? My use case is that I want to store a pipeline in scikit-learn, and want to add the components to the name of the flow to distinguish between similar pipelines, for example:

  • sklearn.grid_search.RandomizedSearchCV(sklearn.pipeline.Pipeline(sklearn.preprocessing.data.StandardScaler,sklearn.pipeline.FeatureUnion(sklearn.preprocessing.data.PolynomialFeatures,sklearn.decomposition.pca.PCA),sklearn.ensemble.weight_boosting.AdaBoostClassifier(sklearn.tree.tree.DecisionTreeClassifier)))
  • sklearn.grid_search.RandomizedSearchCV(sklearn.pipeline.Pipeline(sklearn.preprocessing.data.StandardScaler,sklearn.pipeline.FeatureUnion(sklearn.preprocessing.data.PolynomialFeatures,sklearn.decomposition.pca.PCA),sklearn.ensemble.weight_boosting.RandomForestClassifier))

I want to use brackets here instead of the underscores as done for WEKA, because the flows contains nested components.

from openml.

janvanrijn avatar janvanrijn commented on June 14, 2024

I updated XSD, ( and ) are now possible

from openml.

mfeurer avatar mfeurer commented on June 14, 2024

Thanks for your fast reply. I just tried this on the test server, it still works with plain strings like dummy, but not with complicated strings as shown above.

from openml.

joaquinvanschoren avatar joaquinvanschoren commented on June 14, 2024

Jan, did you perhaps only push the change to the production server?

On Fri, Apr 29, 2016 at 9:47 AM Matthias Feurer [email protected]
wrote:

Thanks for your fast reply. I just tried this on the test server, it still
works with plain strings like dummy, but not with complicated strings as
shown above.


You are receiving this because you modified the open/close state.

Reply to this email directly or view it on GitHub
#14 (comment)

from openml.

mfeurer avatar mfeurer commented on June 14, 2024

@joaquinvanschoren @janvanrijn are there any news on this?

from openml.

janvanrijn avatar janvanrijn commented on June 14, 2024

Yes, I pushed this to the production server.

Now also on test server.

from openml.

mfeurer avatar mfeurer commented on June 14, 2024

It works now, thanks a lot

from openml.

mfeurer avatar mfeurer commented on June 14, 2024

Sorry to open this again, but could you please also allow commas to be part of the name?

from openml.

janvanrijn avatar janvanrijn commented on June 14, 2024

Not so sure actually. As flow names are defined to be URL safe, according to HTTP specification comma's can not occur in a URL.

Need to check many things internally before being able to change this

from openml.

mfeurer avatar mfeurer commented on June 14, 2024

I didn't know that. I'll think about a different solution for now. Is it possible to have this in the future or do you think this would be too much work?

from openml.

janvanrijn avatar janvanrijn commented on June 14, 2024

The answer wasn't a "no"!

It was more a "definitely not now"

from openml.

mfeurer avatar mfeurer commented on June 14, 2024

Sure, but I need a solution as soon as possible, it didn't sound like 'not now' is before the 10th of May ;)

from openml.

mfeurer avatar mfeurer commented on June 14, 2024

Brackets are actually working now.

from openml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.