redpony / creg Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 12.0 1.16 MB

Fast regression modeling framework

License: Apache License 2.0

Python 8.61% C++ 42.37% C 28.66% R 0.07% Makefile 0.31% M4 18.68% LLVM 1.30%

creg's People

Contributors

Stargazers

Watchers

Forkers

mrorii jflanigan zhuochen24 gurugan1 as1986 0x0all liuyepku suyinw nycjing saraswat hppy139 tianshu-z

creg's Issues

Proposal: composite labels

Currently, responses are always atomic. However, for some applications it would be nice to represent labels (categorical responses) with limited structure such that features are extracted over parts of the structure, as well as the full label. We are still talking about making a single prediction, not structured prediction; the proposal is simply to enable a richer space of features over labels.

For example, in building a part-of-speech classifier, one could have features that score the full, fine-grained POS tag, as well as features that group together related tags into coarser categories to share statistical strength.

Define a composite label as a categorical response that is made up of multiple categorical parts, or components. The components could be characters in a string (such as a bit string), or in an explicit structure (such as a JSON data structure).

In the model, there will be a feature for every input characteristic (percept) and any full (simple or composite) label. In addition, when a percept is scored with a composite label, a feature for every component will fire that conjoins the percept with that component. So if every label is a POS tag and consists of two components, a coarse component and a fine component, there will be three features that fire for the label for every percept: one with the coarse component, one with the fine component, and one with the full label.

We assume the output space of the classifier will not be affected by the use of composite labels—only full labels (simple or composite) seen during training will be candidates for prediction.

Interface

Information about label structure could be (a) inferred automatically from the name of the label, (b) specified in the response file, in place of a single string name for the label, or (c) specified in some other file as a mapping of label names to richer structures. The interface proposed here will allow (a) or (b).

Let the option --composite-labels [json|string] [positional|bag] enable this feature:

If json (the default format) is specified, then all responses will be read as JSON objects. There are three allowed types of responses: JSON strings, lists of strings, and maps from strings to strings. JSON strings are interpreted as simple labels; in a list of strings, each string is a component; and in a map, the key-value pairs are components.
If string is specified, then all responses will be read as unquoted strings and treated as composite; the components are individual characters.
If positional (the default ordering) is specified, then any sequential composite labels (the label name in string mode, lists in json mode) are treated as ordered slot-fillers; i.e., each component is conjoined with its offset in the sequence.
If bag is specified, then any sequential composite labels are interpreted as bags of components; within a label, any repetition of a component will trigger an error. JSON maps are always treated as bags of key-value pairs.

Examples

If all labels are length-2 POS tags like NN = noun singular, NS = noun plural, PN = pronoun singular, PS = pronoun plural, etc., the following are equivalent ways to specify the response:

PN with --composite-labels string positional (note that bag would conflate the two possible uses of N!)
["P", "N"] with --composite-labels json positional
{"coarse": "P", "fine": "N"} with --composite-labels json

If all labels are fixed-length bitstrings, the following are equivalent:

01011 with --composite-labels string positional
["0", "1", "0", "1", "1"] with --composite-labels json positional
{"0": "0", "1": "1", "2": "0", "3": "1", "4": "1"} with --composite-labels json

If the labels are clusters of morphosyntactic attributes, then with --composite-labels json bag, the two labels ["noun", "singular", "accusative"] and ["verb", "past", "singular", "causative"] would share one component in common: features associated with the "singular" component would fire for both.

better interface to weight vector?

Right now the weight vector is always a vector<double>, and for discrete regression (ordinal regression and multiclass logistic regression) the vector subscript is computed inline.

Suggested interface for the discrete weight vector: an "enhanced" vector in which the () operator is overloaded to take the class (label index) and the feature id as separate arguments. (Can this be done by subclassing or wrapping vector and still play nicely with optimization routines?) So weights(k, fid) would access the weight for a given class-feature pair, and weights(k, fid, w) would assign the value w to that weight. This should make working with the weights vector more intuitive.

An additional benefit would be that the weights instance can store extra information, e.g. whether or not one of the K classes should be treated as a background class (which affects indexing into the vector).

Software license?

Chris -- can you add a license? Thanks!

Better support for feature engineering

For feature engineering, it would be nice not to have to train on a different feature file for each combination of features. One solution would be to allow multiple feature files to be loaded for the same training instances (the instance IDs should prevent any ambiguity). (This has the advantage that features can be extracted in parallel.) Another would be a command-line regex option for features to ablate.

use OpenMP to parallelize learning

During learning, computing the loss and its gradient relative to the parameters (especially with large numbers of training instances or features) can be quite expensive. OpenMP (http://openmp.org/wp/), which is supported by default with g++, could easily be used to parallelize this computation. Basically, all the loops of the following form
for (unsigned i = 0; i < training.size(); ++i)
are good candidates for parallelization. Reading about OpenMP such "reductions" will have to be implemented by creating a gradients buffer per thread and then summing them at the end (although this summing could also be parallelized).

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

redpony / creg Goto Github PK

creg's People

Contributors

Stargazers

Watchers

Forkers

creg's Issues

Proposal: composite labels

Interface

Examples

better interface to weight vector?

Software license?

Better support for feature engineering

use OpenMP to parallelize learning

Segfault if first training input is /dev/null

creg2: use sparse updates

README is out of date, cannot build

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent