Giter Club home page Giter Club logo

activelearning's Introduction

Created on Jul 4, 2014 based on http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html

This program implements active learning (http://en.wikipedia.org/wiki/Active_learning_(machine_learning)) for text classification tasks with scikit-learn's LinearSVC classifier. Despite differences this can also be called incremental training. Instead of using Stochastic Gradient Descent we used the batch mode because the data is not that big and accuracy here was more of concern than efficiency.

The algorithm trains the model based on a train dataset and evaluates using a test dataset. After each evaluation algorithm selects 2*NUM_QUESTIONS samples from unlabeled dataset in order to be labeled by a user/expert. The labeled sample is then moved to the corresponding directory in the train dataset and the model will start training again with the new improved training set.

The selection of unlabeled samples is based on decision_function of SVM which is the distance of the samples X to the separating hyperplane. This distance is between [-1, 1] but because we need confidence levels we use absolute values. In case the classes are more than two, the decision function will return a confidence level for each class and for each sample so in case we have more than 2 classes we average over the absolute values of confidence over all the classes.

We use top NUM_QUESTIONS samples with highest average absolute confidence and also top NUM_QUESTIONS samples with lowest average absolute confidence for expert labeling. This procedure can be easily changed by modifying the code in benchmark function.

This program requires a directory structure similar to what is shown below:

mainDirectory
   train
       pos
           1.txt
           2.txt
       neg
           3.txt
           4.txt
   test
       pos
           5.txt
           6.txt
       neg
           7.txt
           8.txt
   unlabeled
       unlabeled
           9.txt
           10.txt
           11.txt

The filenames in unlabeled should not be a duplicate of filenames in train directory because every time we label a file we will move that file into the corresponding class directory in train directory.

The pos and neg categories are arbitrary and both the number of the classes and their name can be different with what is shown here. The classifier can also be changed to any other classifier in scikit-learn.

@author: afshin rahimi

activelearning's People

Contributors

afshinrahimi avatar alenrooni avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.