redpony / creg Goto Github PK
View Code? Open in Web Editor NEWFast regression modeling framework
License: Apache License 2.0
Fast regression modeling framework
License: Apache License 2.0
Currently, responses are always atomic. However, for some applications it would be nice to represent labels (categorical responses) with limited structure such that features are extracted over parts of the structure, as well as the full label. We are still talking about making a single prediction, not structured prediction; the proposal is simply to enable a richer space of features over labels.
For example, in building a part-of-speech classifier, one could have features that score the full, fine-grained POS tag, as well as features that group together related tags into coarser categories to share statistical strength.
Define a composite label as a categorical response that is made up of multiple categorical parts, or components. The components could be characters in a string (such as a bit string), or in an explicit structure (such as a JSON data structure).
In the model, there will be a feature for every input characteristic (percept) and any full (simple or composite) label. In addition, when a percept is scored with a composite label, a feature for every component will fire that conjoins the percept with that component. So if every label is a POS tag and consists of two components, a coarse component and a fine component, there will be three features that fire for the label for every percept: one with the coarse component, one with the fine component, and one with the full label.
We assume the output space of the classifier will not be affected by the use of composite labels—only full labels (simple or composite) seen during training will be candidates for prediction.
Information about label structure could be (a) inferred automatically from the name of the label, (b) specified in the response file, in place of a single string name for the label, or (c) specified in some other file as a mapping of label names to richer structures. The interface proposed here will allow (a) or (b).
Let the option --composite-labels [json|string] [positional|bag]
enable this feature:
json
(the default format) is specified, then all responses will be read as JSON objects. There are three allowed types of responses: JSON strings, lists of strings, and maps from strings to strings. JSON strings are interpreted as simple labels; in a list of strings, each string is a component; and in a map, the key-value pairs are components.string
is specified, then all responses will be read as unquoted strings and treated as composite; the components are individual characters.positional
(the default ordering) is specified, then any sequential composite labels (the label name in string
mode, lists in json
mode) are treated as ordered slot-fillers; i.e., each component is conjoined with its offset in the sequence.bag
is specified, then any sequential composite labels are interpreted as bags of components; within a label, any repetition of a component will trigger an error. JSON maps are always treated as bags of key-value pairs.If all labels are length-2 POS tags like NN
= noun singular, NS
= noun plural, PN
= pronoun singular, PS
= pronoun plural, etc., the following are equivalent ways to specify the response:
PN
with --composite-labels string positional
(note that bag
would conflate the two possible uses of N
!)["P", "N"]
with --composite-labels json positional
{"coarse": "P", "fine": "N"}
with --composite-labels json
If all labels are fixed-length bitstrings, the following are equivalent:
01011
with --composite-labels string positional
["0", "1", "0", "1", "1"]
with --composite-labels json positional
{"0": "0", "1": "1", "2": "0", "3": "1", "4": "1"}
with --composite-labels json
If the labels are clusters of morphosyntactic attributes, then with --composite-labels json bag
, the two labels ["noun", "singular", "accusative"]
and ["verb", "past", "singular", "causative"]
would share one component in common: features associated with the "singular"
component would fire for both.
Right now the weight vector is always a vector<double>
, and for discrete regression (ordinal regression and multiclass logistic regression) the vector subscript is computed inline.
Suggested interface for the discrete weight vector: an "enhanced" vector in which the ()
operator is overloaded to take the class (label index) and the feature id as separate arguments. (Can this be done by subclassing or wrapping vector
and still play nicely with optimization routines?) So weights(k, fid)
would access the weight for a given class-feature pair, and weights(k, fid, w)
would assign the value w
to that weight. This should make working with the weights vector more intuitive.
An additional benefit would be that the weights instance can store extra information, e.g. whether or not one of the K
classes should be treated as a background class (which affects indexing into the vector).
Chris -- can you add a license? Thanks!
For feature engineering, it would be nice not to have to train on a different feature file for each combination of features. One solution would be to allow multiple feature files to be loaded for the same training instances (the instance IDs should prevent any ambiguity). (This has the advantage that features can be extracted in parallel.) Another would be a command-line regex option for features to ablate.
During learning, computing the loss and its gradient relative to the parameters (especially with large numbers of training instances or features) can be quite expensive. OpenMP (http://openmp.org/wp/), which is supported by default with g++, could easily be used to parallelize this computation. Basically, all the loops of the following form
for (unsigned i = 0; i < training.size(); ++i)
are good candidates for parallelization. Reading about OpenMP such "reductions" will have to be implemented by creating a gradients buffer per thread and then summing them at the end (although this summing could also be parallelized).
If the training arguments are -x /dev/null -x train.txt
, creg immediately segfaults. (It is happy as long as the first -x
argument is a non-empty file.)
Perhaps with scipy data structures: http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.html#scipy.sparse.dok_matrix
As of 2014/Mar/3 the README is very much out of date. As it stands, it is not possible to build from the most recent release.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.