Comments (24)
Versions can be different depending on the used system: 1.2.3, r1234, 234
I do agree that we should define this more clearly though. Suggestions?
from openml.
Maybe we should simply test it with a very weird name first and then see, what bad things happen?
But my stomach feeling is that stuff like in my example above should not be allowed for IDs.
Maybe: (both for "name" and "version")
[a-z] [A-Z] [0-9] [_ , -, .] ???
from openml.
I will implement a check for this in the next API update.
from openml.
Considering the current problems we encountered here
I think defining a conservative naming scheme and checking for it on the server side is reasonable....
from openml.
Well, the problem with #20 related to the description field, not the name field, but yes, a conservative naming scheme sounds like a good idea. Any suggestions?
from openml.
For names I suggested above:
[a-z] [A-Z] [0-9] [_ , -, .]
But I think we should actually make a list first, where problems like this might occur. Basically all "free" user provided text. I suppose their are then two (?) categories: a) stuff that becomes like an Id / file name, etc.
b) text-like descriptions.
For a) I would try to be as conservative as possible, like the suggestion above. For b) I don't really know what exactly causes problems for you. Problem also is that b) will often come out of files / data / is generated, and I don't know how freely we can "throw stuff away" from it. Certainly not many user want to edit these text blobs manually, if they are not validated by the server.
from openml.
I think this is important for all the XSD's that validate uploaded content. I added some datatypes, oml:system_string and oml:simple_string.
oml:system_string allows users to insert [a-z] [A-Z] [0-9] [_ , -, .] and is applied to all fields where we want a high restriction level, e.g., because these can occur in URLs. Examples are implementation:name, implementation:version, etc.
oml:simple_string allows [a-z] [A-Z] [0-9] [_ , -, .], comma's and white spaces. This is used for textual input, where we want to restrict the input, but when we do not need it to be URL friendly. implementation:creator and implementation:contributor are examples of such fields. We can extend the list of allowed characters even further.
All other fields (which are likely to accept machine generated content) are still xs:string.
However, we can compile a list of characters which are allowed on OpenML. The workbenches are then responsible for checking on these characters, and removing them (or replacing them) before uploading the XML. Of course, this happens without the user noticing. Any suggestions on this are welcome.
from openml.
a) I think it is a good idea of compiling such a list like you are doing above. I hope for simplicity's sake we can keep the type definitions simple.
b) Why do we need to restrict a field like "creator"? Just a question out of curiosity. Also, many people in Non-English countries have sometimes weird characters in their names.
c) For "free text fields" like "description" and such:
Have you already noticed that some input created problems on the server, even though it was a valid xs:string?
from openml.
What I have done for now: I implemented the system_string datatype, which restricts limits a string to alphanumeric characters and underscore, dash and point. For fields like name, version and very obvious other fields (md5_hashes) the system_string is used, all other fields not.
a) I agree on that, but I think that for the time being we are good.
b and c) From what I understand is that you are hesitant to restrict the input of these fields, because it can give users (or the automated system) a hard time getting the format correct. I myself have not encountered any problems so far, but I can see that the more we restrict on this, the more likely it is that these problems occur.
from openml.
I asked about creator names because of stuff like German umlaute, French accents and so on. I am perfectly fine with that not being possible now - no umlaute in my name :).
And about problems with weird characters in text descriptions I simply asked out of curiosity to understand our current system better. I thought you already had to clean up some descriptions of UCI data sets because they caused problems? Or am I wrong? If not, what exactly was the problem?
from openml.
We previously had problems because of HTML tags in the textual descriptions.
If we can process everything in utf-8 that would be preferable, but I agree
it's not that urgent.
Cheers!
Joaquin
On Friday, 13 September 2013, berndbischl wrote:
I asked about creator names because of stuff like German umlaute, French
accents and so on. I am perfectly fine with that not being possible now -
no umlaute in my name :).And about problems with weird characters in text descriptions I simply
asked out of curiosity to understand our current system better. I thought
you already had to clean up some descriptions of UCI data sets because they
caused problems? Or am I wrong? If not, what exactly was the problem?—
Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-24369169
.
Dr. Ir. Joaquin Vanschoren
Leiden Institute of Advanced Computer Science (LIACS)
Universiteit Leiden
Niels Bohrweg 1, 2333 CA Leiden, The Netherlands
office: 1.14
phone: +31 715 27 89 19
fax: +32 16 32 79 96
mobile: (+32) (0)497 90 30 69
from openml.
Is there a reason why brackets in flow names are forbidden? My use case is that I want to store a pipeline in scikit-learn, and want to add the components to the name of the flow to distinguish between similar pipelines, for example:
- sklearn.grid_search.RandomizedSearchCV(sklearn.pipeline.Pipeline(sklearn.preprocessing.data.StandardScaler,sklearn.pipeline.FeatureUnion(sklearn.preprocessing.data.PolynomialFeatures,sklearn.decomposition.pca.PCA),sklearn.ensemble.weight_boosting.AdaBoostClassifier(sklearn.tree.tree.DecisionTreeClassifier)))
- sklearn.grid_search.RandomizedSearchCV(sklearn.pipeline.Pipeline(sklearn.preprocessing.data.StandardScaler,sklearn.pipeline.FeatureUnion(sklearn.preprocessing.data.PolynomialFeatures,sklearn.decomposition.pca.PCA),sklearn.ensemble.weight_boosting.RandomForestClassifier))
I want to use brackets here instead of the underscores as done for WEKA, because the flows contains nested components.
from openml.
I updated XSD, ( and ) are now possible
from openml.
Thanks for your fast reply. I just tried this on the test server, it still works with plain strings like dummy
, but not with complicated strings as shown above.
from openml.
Jan, did you perhaps only push the change to the production server?
On Fri, Apr 29, 2016 at 9:47 AM Matthias Feurer [email protected]
wrote:
Thanks for your fast reply. I just tried this on the test server, it still
works with plain strings like dummy, but not with complicated strings as
shown above.—
You are receiving this because you modified the open/close state.Reply to this email directly or view it on GitHub
#14 (comment)
from openml.
@joaquinvanschoren @janvanrijn are there any news on this?
from openml.
Yes, I pushed this to the production server.
Now also on test server.
from openml.
It works now, thanks a lot
from openml.
Sorry to open this again, but could you please also allow commas to be part of the name?
from openml.
Not so sure actually. As flow names are defined to be URL safe, according to HTTP specification comma's can not occur in a URL.
Need to check many things internally before being able to change this
from openml.
I didn't know that. I'll think about a different solution for now. Is it possible to have this in the future or do you think this would be too much work?
from openml.
The answer wasn't a "no"!
It was more a "definitely not now"
from openml.
Sure, but I need a solution as soon as possible, it didn't sound like 'not now' is before the 10th of May ;)
from openml.
Brackets are actually working now.
from openml.
Related Issues (20)
- Give back error messages when something with a dataset upload went wrong HOT 1
- Return error code 404 instead of 412 when dataset is not found
- [Server Error] HOT 1
- Issues uploading to test server (production works fine)
- Feature Request: Active Classification Task
- `Code 215: Database error. Setup search query failed - None`
- list_tasks() return task less than shown on the website HOT 1
- Dead link in README: "Citation and Honor Code" HOT 1
- Provide test datasets where labels are hidden HOT 2
- `parquet_url` incorrectly provided for non-arff formats
- Server error when creating a new task HOT 1
- Missing error message on blank fields
- Same dataset can be uploaded multiple times
- ServerError: server failed with HTTP status code 503 HOT 9
- Bucket does not exist or is private. Failed to download parquet, fallback on ARFF. HOT 2
- Cannot edit dataset HOT 1
- ElasticSearch user mapping: first name and last name should use a normalizer to allow case-insensitive sort
- JSON endpoint can return XML
- Weird task state for 361162?
- Delete HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openml.