Giter Club home page Giter Club logo

eth_ml's People

Contributors

aerial1543 avatar go1dshtein avatar hanveiga avatar martinthenext avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

eth_ml's Issues

Implement a window bag of words

Currently, bag of words feature takes the entire context of an ambiguous term as an argument. We need to implement a new feature that will only account for k words around the ambiguous term in vectorization.

Technical details:

  1. Refactor code so that feature selection and classification routines are pluggable into class definitions. Probably use mixins.
  2. Subclass CountVectorizer to implement a bag of words window.

Validate classifiers against the labeled ambiguous data

Implement a procedure to measure agreement of a classifier with the labeled ambiguous data. For that purpose:

  1. Annotation class should be modified to store ambiguous data as well
  2. Function should be implemented in data.py to deserialize labeled ambiguous data (from MTurk or expert) to a list of Annotation objects.
  3. Module should be implemented to test classifiers against this data - analogous to cv.py

Make plotting learning curves possible and very easy

In this ticket you have to implement a very easy to use function that will allow to plot learning curves. The image should be written to the specified file location. The user of the function should be able to use it without knowing how it works. A good example of a call would be:

plot_curves('output.jpg', passive_learner = [0.2, 0.21, 0.22],
  active_learner = [0.2, 0.23, 0.55])

For every keyword argument (see info about kwargs) this would plot a line plot with list index (starting with 1) on X axis and list values on Y axis. For the supplied example it would plot points (1, 0.2), (2, 0.21), (3, 0.22) in red and (1, 0.2), (2, 0.23), (3, 0.55) in blue and have a legend about red means 'passive_learner' and blue means 'active learner'. Optional control over graphical parameters could be also useful. Please describe how to use the function in a docstring.

If a plotting library motivates some other argument structure, it's ok - the main thing that it should be very straightforward and easy to use.

It would be nice to use matplotlib as it is installed on the working server.

Implement separate one-vs-all classifier for semantic groups

Fit 10 separate classifiers - one for every semantic group. For classification of an annotation instance:

  1. Observe which options for semantic groups are presented for ambiguous term: typically 2 or 3
  2. Run according group-specific classifiers and retrieve probabilities of conflicting groups
  3. Assign a group with the highest probability

Simplest classifier to output probability - logistic regression.

Resulting collection of classifiers should be wrapped into a classifier class.

New feature graph

In the dimensionality reduction section of result summary there is a graph where each point is a feature set with certain parameters and coordinates are accuracy values on EMEA and Medline.

Under the graph in the section re-evaluation you can find similar data for the new dataset. The task is to produce the new graph from this data. The graph should look like the old one - Pareto front should be highlighted and color-coded with according features.

Fix learning curve labels

The plot_curves function, when called consecutively on different arguments (see active_vs_passive.py) does not clear the Y axis labels. For example:

plot_medline_39

Labels should be cleared on every call of the function.

Further investigate Bag of Words

For the sake of dimensionality reduction, following variants of ContextRestrictedBagOfWordsLeftRight should be implemented:

  1. One that omits the words with low counts. For example, if the word occurs less than 3 times in the whole data set, exclude it from bag of word features. Call it ContextRestrictedBagOfWordsLeftRightCutoff and make a cut-off frequency a parameter of it's constructor, just like the window size. Hint: parameter min_df can be set to 3 in CountVectorizer options
  2. One that uses English stop words: ContextRestrictedBagOfWordsLeftRightStopWords Hint: stop_words='english'.

To compare the performance of new vectorizers, create a script prototypes/compare_vectorizers.py that does the following:

  1. Outputs the agreement of the OptionAwareNaiveBayesLeftRight on the given data, just like mturk_classifier_agreement.py
  2. Substitutes a vectorizer in this classifier with described before variants (Cutoff and StopWords), trains the new classifiers and outputs the resulting agreement for comparison. It would be ideal for it to output a table with names of vectorizers, parameters (like min_df) and according agreements.

The resulting script should have the same command-line arguments as mturk_classifier_agreement.py.

Implement an bigram annotation vectorizer

Now in models.py there is a class called ContextRestrictedBagOfWords. It implements two functions fit_transform and transform to vectorize annotations.

The task is to make a new version of this class called ContextRestrictedBagOfBigrams that would use word bigrams instead of just words.

Please refer to sklean docs on CountVectorizer, specifically to the parameter ngram_range.

To test a new class you can just plug it into an existing classifier instead of ContextRestrictedBagOfWords like that:

class NaiveBayesContextRestricted(AnnotationClassifier):
  def __init__(self, **kwargs):
    self.classifier = MultinomialNB()
    window_size = kwargs.get('window_size', 3)
    self.vectorizer = ContextRestrictedBagOfBigrams(window_size)

Differentiate between left and right contexts in vectorizer

This task in similar to the bigram one in a sense that one also needs to modify the ContextRestrictedBagOfWords. There should be two feature vectors created for each annotation instead of one: bag of words on the right part and bag of words on the left part. Then two vectors should be joined into one.

In realization of this idea one should work with feature matrices directly, joining outputs of two CountVectorizers.

Measure agreement between the best classifier and Expert

To evaluate the accuracy of the best classifier (OptionAwareNaiveBayesFullContextLeftRightCutoff trained on Medline) on the "Gold standard" we need to measure its agreement with expert annotations.

Expert annotations are stored in this file.

  1. Modify the load_ambiguous_annotations_labeled method from data.py so that it also works with loading data from this tsv file.
  2. Create a file expert_classifer_agreement.py where you use the function get_mturk_pickled_classifier_agreement from mturk_classifier_agreement.py to get the agreement between the pickled classifier (you need to load it with joblib.load) and the expert.

expert_classifer_agreement.py should have a pickle of a classifier and an expert annotation tsv file as parameters and output two numbers:

  1. Agreement with strict answer comparison
  2. Agreement when only useful answers are counted (that is, if an expert says IDK or NONE you should exclude this annotation from consideration.

Training classifiers on data fraction doesn't work

Looks like train_and_serialize.py produces similar classifiers for any value of dataset_fraction parameter.

Evidence 1. Pickled classifier files for different fractions have the same size.

Evidence 2. Plots for passive vs. active are exactly the same for active learner and slightly different for passive, which indicates that active learning is acting on the same data.

Full dataset:

weightedpartialfitpassivetransferclassifier2_emea_weight1000

What's expected to be 5% fraction:

weightedpartialfitpassivetransferclassifier2_emea_fraction0 05_weight1000

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.