martinthenext / eth_ml Goto Github PK

Projects in Machine Learning ETH team trying to use mechanical turk and active learning for solving word-sense disambiguation task

Python 100.00%

machine-learning disambiguation computational-linguistics nlp

eth_ml's People

Contributors

Stargazers

Watchers

Forkers

go1dshtein aerial1543 computational-linguistics-research

eth_ml's Issues

Implement a window bag of words

Currently, bag of words feature takes the entire context of an ambiguous term as an argument. We need to implement a new feature that will only account for k words around the ambiguous term in vectorization.

Technical details:

Refactor code so that feature selection and classification routines are pluggable into class definitions. Probably use mixins.
Subclass CountVectorizer to implement a bag of words window.

Validate classifiers against the labeled ambiguous data

Implement a procedure to measure agreement of a classifier with the labeled ambiguous data. For that purpose:

Annotation class should be modified to store ambiguous data as well
Function should be implemented in data.py to deserialize labeled ambiguous data (from MTurk or expert) to a list of Annotation objects.
Module should be implemented to test classifiers against this data - analogous to cv.py

Assess classifier performance on MTurk data only

Do cross-validation of OptionAwareNaiveBayesLeftRight classifier on MTurk data.

Make plotting learning curves possible and very easy

In this ticket you have to implement a very easy to use function that will allow to plot learning curves. The image should be written to the specified file location. The user of the function should be able to use it without knowing how it works. A good example of a call would be:

plot_curves('output.jpg', passive_learner = [0.2, 0.21, 0.22],
  active_learner = [0.2, 0.23, 0.55])

For every keyword argument (see info about kwargs) this would plot a line plot with list index (starting with 1) on X axis and list values on Y axis. For the supplied example it would plot points (1, 0.2), (2, 0.21), (3, 0.22) in red and (1, 0.2), (2, 0.23), (3, 0.55) in blue and have a legend about red means 'passive_learner' and blue means 'active learner'. Optional control over graphical parameters could be also useful. Please describe how to use the function in a docstring.

If a plotting library motivates some other argument structure, it's ok - the main thing that it should be very straightforward and easy to use.

It would be nice to use matplotlib as it is installed on the working server.

Generate sample corpus files for EMEA and Medline

For the sake of testing on small scale, generate a small subset of two corpus files so that Maria and Valya can work on them without accessing the server.

Implement separate one-vs-all classifier for semantic groups

Fit 10 separate classifiers - one for every semantic group. For classification of an annotation instance:

Observe which options for semantic groups are presented for ambiguous term: typically 2 or 3
Run according group-specific classifiers and retrieve probabilities of conflicting groups
Assign a group with the highest probability

Simplest classifier to output probability - logistic regression.

Resulting collection of classifiers should be wrapped into a classifier class.

New feature graph

In the dimensionality reduction section of result summary there is a graph where each point is a feature set with certain parameters and coordinates are accuracy values on EMEA and Medline.

Under the graph in the section re-evaluation you can find similar data for the new dataset. The task is to produce the new graph from this data. The graph should look like the old one - Pareto front should be highlighted and color-coded with according features.

Classification accuracy measurements don't match

In the result summary OptionAwareNaiveBayesFullContextLeftRightCutoff attains 75% accuracy when trained on Medline and 61% accuracy when trained on EMEA.

Nevertheless, when wrapped into a passive/active learner, it produces results which are worse - 63% and 56% accordingly.

Fix learning curve labels

The plot_curves function, when called consecutively on different arguments (see active_vs_passive.py) does not clear the Y axis labels. For example:

Labels should be cleared on every call of the function.

Full context for OptionAwareNaiveBayesLeftRightCutoff

OptionAwareNaiveBayesLeftRightCutoff needs to be modified to use all context and tested on Medline, see motivation here

This new classifier should be named OptionAwareNaiveBayesFullContextLeftRightCutoff and added to models.py.

Further investigate Bag of Words

For the sake of dimensionality reduction, following variants of ContextRestrictedBagOfWordsLeftRight should be implemented:

One that omits the words with low counts. For example, if the word occurs less than 3 times in the whole data set, exclude it from bag of word features. Call it ContextRestrictedBagOfWordsLeftRightCutoff and make a cut-off frequency a parameter of it's constructor, just like the window size. Hint: parameter min_df can be set to 3 in CountVectorizer options
One that uses English stop words: ContextRestrictedBagOfWordsLeftRightStopWords Hint: stop_words='english'.

To compare the performance of new vectorizers, create a script prototypes/compare_vectorizers.py that does the following:

Outputs the agreement of the OptionAwareNaiveBayesLeftRight on the given data, just like mturk_classifier_agreement.py
Substitutes a vectorizer in this classifier with described before variants (Cutoff and StopWords), trains the new classifiers and outputs the resulting agreement for comparison. It would be ideal for it to output a table with names of vectorizers, parameters (like min_df) and according agreements.

The resulting script should have the same command-line arguments as mturk_classifier_agreement.py.

Implement an bigram annotation vectorizer

Now in models.py there is a class called ContextRestrictedBagOfWords. It implements two functions fit_transform and transform to vectorize annotations.

The task is to make a new version of this class called ContextRestrictedBagOfBigrams that would use word bigrams instead of just words.

Please refer to sklean docs on CountVectorizer, specifically to the parameter ngram_range.

To test a new class you can just plug it into an existing classifier instead of ContextRestrictedBagOfWords like that:

class NaiveBayesContextRestricted(AnnotationClassifier):
  def __init__(self, **kwargs):
    self.classifier = MultinomialNB()
    window_size = kwargs.get('window_size', 3)
    self.vectorizer = ContextRestrictedBagOfBigrams(window_size)

Differentiate between left and right contexts in vectorizer

This task in similar to the bigram one in a sense that one also needs to modify the ContextRestrictedBagOfWords. There should be two feature vectors created for each annotation instead of one: bag of words on the right part and bag of words on the left part. Then two vectors should be joined into one.

In realization of this idea one should work with feature matrices directly, joining outputs of two CountVectorizers.

Measure agreement between the best classifier and Expert

To evaluate the accuracy of the best classifier (OptionAwareNaiveBayesFullContextLeftRightCutoff trained on Medline) on the "Gold standard" we need to measure its agreement with expert annotations.

Expert annotations are stored in this file.

Modify the load_ambiguous_annotations_labeled method from data.py so that it also works with loading data from this tsv file.
Create a file expert_classifer_agreement.py where you use the function get_mturk_pickled_classifier_agreement from mturk_classifier_agreement.py to get the agreement between the pickled classifier (you need to load it with joblib.load) and the expert.

expert_classifer_agreement.py should have a pickle of a classifier and an expert annotation tsv file as parameters and output two numbers:

Agreement with strict answer comparison
Agreement when only useful answers are counted (that is, if an expert says IDK or NONE you should exclude this annotation from consideration.

Training classifiers on data fraction doesn't work

Looks like train_and_serialize.py produces similar classifiers for any value of dataset_fraction parameter.

Evidence 1. Pickled classifier files for different fractions have the same size.

Evidence 2. Plots for passive vs. active are exactly the same for active learner and slightly different for passive, which indicates that active learning is acting on the same data.

Full dataset:

What's expected to be 5% fraction:

$weightedpartialfitpassivetransferclassifier2_emea_fraction0 05_weight1000$