ntucllab / libact Goto Github PK

View Code? Open in Web Editor NEW

778.0 59.0 175.0 1.89 MB

Pool-based active learning in Python

Home Page: http://libact.readthedocs.org/

License: BSD 2-Clause "Simplified" License

Python 75.12% C 8.93% C++ 14.02% Cython 1.93%

machine-learning-library active-learning machine-learning uncertainty-sampling

libact's Introduction

libact: Pool-based Active Learning in Python

authors: Yao-Yuan Yang, Shao-Chuan Lee, Yu-An Chung, Tung-En Wu, Si-An Chen, Hsuan-Tien Lin

Introduction

libact is a Python package designed to make active learning easier for real-world users. The package not only implements several popular active learning strategies, but also features the active-learning-by-learning meta-algorithm that assists the users to automatically select the best strategy on the fly. Furthermore, the package provides a unified interface for implementing more strategies, models and application-specific labelers. The package is open-source along with issue trackers on github, and can be easily installed from Python Package Index repository.

Documentation

The technical report associated with the package is on arXiv, and the documentation for the latest release is available on readthedocs. Comments and questions on the package is welcomed at [email protected]. All contributions to the documentation are greatly appreciated!

Basic Dependencies

Python 2.7, 3.3, 3.4, 3.5, 3.6
Python dependencies

pip install -r requirements.txt

Debian (>= 7) / Ubuntu (>= 14.04)

sudo apt-get install build-essential gfortran libatlas-base-dev liblapacke-dev python3-dev

Arch

sudo pacman -S lapacke

macOS

brew install openblas

Installation

After resolving the dependencies, you may install the package via pip (for all users):

sudo pip install libact

or pip install in home directory:

pip install --user libact

or pip install from github repository for latest source:

pip install git+https://github.com/ntucllab/libact.git

To build and install from souce in your home directory:

python setup.py install --user

To build and install from souce for all users on Unix/Linux:

python setup.py build
sudo python setup.py install

Installation Options

LIBACT_BUILD_HINTSVM: set this variable to 1 if you would like to build hintsvm c-extension. If set to 0, you will not be able to use the HintSVM query strategy. Default=1.
LIBACT_BUILD_VARIANCE_REDUCTION: set this variable to 1 if you would like to build variance reduction c-extension. If set to 0, you will not be able to use the VarianceReduction query strategy. Default=1.

Example:

LIBACT_BUILD_HINTSVM=1 pip install git+https://github.com/ntucllab/libact.git

Usage

The main usage of libact is as follows:

qs = UncertaintySampling(trn_ds, method='lc') # query strategy instance

ask_id = qs.make_query() # let the specified query strategy suggest a data to query
X, y = zip(*trn_ds.data)
lb = lbr.label(X[ask_id]) # query the label of unlabeled data from labeler instance
trn_ds.update(ask_id, lb) # update the dataset with newly queried data

Some examples are available under the examples directory. Before running, use examples/get_dataset.py to retrieve the dataset used by the examples.

Available examples:

plot : This example performs basic usage of libact. It splits a fully-labeled dataset and remove some label from dataset to simulate the pool-based active learning scenario. Each query of an unlabeled dataset is then equivalent to revealing one labeled example in the original data set.
label_digits : This example shows how to use libact in the case that you want a human to label the selected sample for your algorithm.
albl_plot: This example compares the performance of ALBL with other active learning algorithms.
multilabel_plot: This example compares the performance of algorithms under multilabel setting.
alce_plot: This example compares the performance of algorithms under cost-sensitive multi-class setting.

Running tests

To run the test suite:

python setup.py test

To run pylint, install pylint through pip install pylint and run the following command in root directory:

pylint libact

To measure the test code coverage, install coverage through pip install coverage and run the following commands in root directory:

coverage run --source libact --omit */tests/* setup.py test
coverage report

Citing

If you find this package useful, please cite the original works (see Reference of each strategy) as well as the following

@techreport{YY2017,
  author = {Yao-Yuan Yang and Shao-Chuan Lee and Yu-An Chung and Tung-En Wu and Si-An Chen and Hsuan-Tien Lin},
  title = {libact: Pool-based Active Learning in Python},
  institution = {National Taiwan University},
  url = {https://github.com/ntucllab/libact},
  note = {available as arXiv preprint \url{https://arxiv.org/abs/1710.00379}},
  month = oct,
  year = 2017
}

Acknowledgments

The authors thank Chih-Wei Chang and other members of the Computational Learning Lab at National Taiwan University for valuable discussions and various contributions to making this package better.

libact's People

Contributors

Stargazers

Watchers

Forkers

lazywei a9261 inonchiu gitderek chairco kanhua kingbing gogobook kennychou0529 poyuwu alanyannick yenchih copyfun terry07 phonchi ewanlu ml-ai-nlp-ir gom7745 hongyunnchen resnick1223 wy36101299 brchiu zhang-m xuq hao-hsuan crowdcurio zshwuhan maxbest postalc stegben xingwudao souvag yu-shang jonzarecki dutinghou sambozek robbymeals ryanvarley yunfuliu mdmustafizurrahman sychen1121 ajpharrington iamyuanchung sian-chen chaoyue0307 betterylk calypso-team young-won phillipf windj007 charlesity deepesch nikhitasingh qitma byted mherde emilleishida matela mars-wei hughsyx mlliarm aykol souvenir13 wadkar bkj maartenvm akiratu mrlevo520 codeaudit chkoar wnstlr shomronjacob heliwang zhensongqian lopamudra26pal songfgh pdumaitre ml-lab lenatech befeng adripurkayastha notani generalzh jkleint dolzodmaa morenolaquatra ahlane anbangleo decade2014 shuaiyicao jenny-nlc jaykimbravekjh rgitz-setuserv pinghsieh mkuuwaujinga ej0cl6 dunzhang arnabkar afcarl avain

libact's Issues

Interfaces documentation

make_query() in active_learning_by_learning is broken

Hello. The make_query() method fails at the following line, with q undefined:
ask_idx = np.random.choice(
np.arange(len(self.unlabeled_invert_id_idx)), size=1, p=q
)[0]

Could you please fix it?

Thanks!

ModuleNotFoundError: No module named 'libact.query_strategies.multilabel'

HI all,

I have installed libact package in my ubuntu OS but for some reason i cant run the alce_plot.py and multilabel_plot.py examples. I keep getting the ModuleNotFoundError for the module name 'libact.query_strategies.multilabel'

Please help!
Regards

SVM: use scikit-learn instead of LIBSVM

Separate changes out from quire branch.

Unit testing for active learning algorithms

More examples with sphinx-gallery

https://github.com/sphinx-gallery/sphinx-gallery

Incompatibility with plotly and cufflinks

Hello,
I have found that your lib is not compatible with python packages plotly and cufflinks. I have tested it on fresh install of ubuntu 16.04 where anaconda was installed.
Everything was ok till installation of plotly and cufflinks:


pip install plotly --upgrade
pip install cufflinks --upgrade

Then running python setup.py test ends on this:

======================================================================
ERROR: query_strategies (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: query_strategies
Traceback (most recent call last):
  File "/path/anaconda3/lib/python3.5/unittest/loader.py", line 153, in loadTestsFromName
    module = __import__(module_name)
  File "/path/libact/libact/query_strategies/__init__.py", line 20, in <module>
    from ._variance_reduction import estVar
ImportError: /usr/lib/liblapacke.so.3: undefined symbol: dpotrf2_

moving sklearn.cross_validation to sklearn.model_selection after v0.18.0

QS: check if unlabeled pool is empty upon update

Raise exception in Ideal_labeler when the given feature is not found

https://github.com/ntucllab/libact/blob/interface-documentation/libact/labelers/ideal_labeler.py#L30

Problems installing in Linux

Hello,

I am trying to install Libact in the HPC facilites of my university. However I am getting the following error every time I try to install it:

error: Command "gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/rmegret/irodriguez/anaconda3/envs/bee/lib/python3.6/site-packages/numpy/core/include -I/usr/include/lapacke -I/home/rmegret/irodriguez/anaconda3/envs/bee/include/python3.6m -c libact/query_strategies/src/variance_reduction/variance_reduction.c -o build/temp.linux-x86_64-3.6/libact/query_strategies/src/variance_reduction/variance_reduction.o -std=c11" failed with exit status 1

I have tried pip and cloning the repo and then using setup.py.

Just in case here is the specifications of the HPC: https://www.hpcf.upr.edu/documentation/boqueron/#ffs-tabbed-15

Is there any tutorial/notebook notes?

Is there a jupyter notebook for learning how to use this library?.

Fix ReadTheDocs integration

Build fails on RTD server as dependencies (numpy) won't build. Should look for a workaround.

Clarify semantics of Model.predict_real

Currently Model.predict_real is connected to predict_proba in scikit-learn, which returns an array of n_classes floats standing for probabilities of corresponding labels. But decision_function is another candidate whose returning shapes vary from model to model, for example (in our case n_samples = 1):

LogisticRegression: (n_samples,) if n_classes == 2 else (n_samples, n_classes)
C-SVC: (n_samples, n_classes * (n_classes-1) / 2)

We have to make sure what we want in order to well-define the interface. @hsuantien can you give us some advice on this?

Developer guidelines

ALBL: ensure all query_models reference the same Dataset instance

ideal_labeler: label() should return the label instead of list

Current IdealLabeler seems to return a list of labels instead of the label.

self.y[np.where([...])[0]]

should be

self.y[np.where([...])[0]][0]

self.y[np.where([...])[0][0]]

Identify whether the relabeling in sklearn will cause problem

Since sklearn internally relabels the given label to 0-n_labels. If I get it correctly, they do it in the order of data sending into the fit method.
So if after we updated an unlabeled data and cause the order of data sending into fit method to change. The value from predict_real method of our model might have wrong order.
One proposal for solving this problem could be manage relabeling set ourself in the model classes.

Error when installing libact

On Ubuntu
When installing libact using command " pip install git+https://github.com/ntucllab/libact.git"
Get error:
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-Q9a2LI-build/

Backport to Python 2

Can win10 system install this? Or must Linux/macOS?

Expected Error Reduction

Perhaps write a faster implementation in C.

https://github.com/ntucllab/libact/tree/EER

Dataset loading utilities

I think we should move the current get_dataset.py to something like the following utility
http://scikit-learn.org/stable/datasets/

It would be easier to write example in sphinx-gallery that way.

How do you think?
@iamyuanchung @hsuantien

Enhancement for unit testing

For now, the unit tests for active learning algorithms are using the results of real-world data with fixed random seeds. So in the future if any modification to these algorithms have conflict with current test, it should be taken care carefully.

The rigorous way to do the test is to design artificial datasets. We'll leave it as future development goal.

Use pkg-config for setup.py lapacke path

https://github.com/ntucllab/libact/blob/master/setup.py#L30

Next stage

Implement more classical query strategies.
Add examples for using all query strategies.

Installation using pip fails for python 2

Tried to install libact using sudo pip install libact and got the following error message

libact/query_strategies/variance_reduction.c:26:15: error: variable ‘moduledef’ has initializer but incomplete type

You can see the full error message here.

I also tried to install using the setup.pyscript, which actually did work just fine, also the python3 installation worked using pip on the same machine.
I did some googling and the error looked similar to here, I cant look into it because setup.py worked.
Just wanted to let you guys know.

scikit-learn model adapter

Since we use scikit-learn models a lot, we should define an adapter from scikit-learn models to libact models.

Allow make_query to return multiple items (or the entire scored set)

In certain applications, you might want to know what the top N unlabelled entities are so that a human can go through and do batch labeling offline. Right now I have a particularly hacky way of getting multiple results out, just assuming the majority class in the update, but it would be great to tweak the make_query function to return arbitrary numbers of ordered results for batch label processing.
for i in range(20):
item_to_investigate = qs.make_query()
libact_ds.update(item_to_investigate, 0)
print item_to_investigate

Happy to contribute code to try to help this happen!

add example usage into docstring

give example usage in each code's doc string

Creating libact.base.dataset.Dataset with numpy array may cause error?

If feature vector are passed with np.array, when calling format_sklearn() method it would return a 3-dimensional array for the feature.

IdealLabeler error using numpy 1.11.0b3

https://github.com/ntucllab/libact/blob/master/libact/labelers/ideal_labeler.py#L28

This line may have to be changed to
return self.y[np.where([np.array_equal(x, feature) for x in self.X])[0][0]]

when using numpy 1.11.0b3

Maybe caused by this?
numpy/numpy#6155

HintSVM mldataset - Buffer dtype mismatch error

Hi,

I try to use hintSVM query strategy with the vehicle dataset from mldata.
However, I don't understand why, I got the following error :

File "testing.py", line 60, in run
    ask_id = qs.make_query()
  File "/usr/local/lib/python3.5/site-packages/libact-0.1.2-py3.5-macosx-10.12-x86_64.egg/libact/query_strategies/hintsvm.py", line 151, in make_query
    np.array([x.tolist() for x in unlabeled_pool]), self.svm_params)
  File "libact/query_strategies/_hintsvm.pyx", line 16, in libact.query_strategies._hintsvm.hintsvm_query (libact/query_strategies/_hintsvm.c:1836)
ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'long'

I don't have this error when I use others strategies (UncertaintySampling,Quire).

def split_scale_train_test(name_dataset,test_size):
    # choose a dataset with unbalanced class instances
    #data = sklearn.datasets.fetch_mldata('segment')
    data = sklearn.datasets.fetch_mldata(name_dataset)

    X = StandardScaler().fit_transform(data['data'])
    target = np.unique(data['target'])
    # mapping the targets to 0 to n_classes-1
    y = np.array([np.where(target == i)[0][0] for i in data['target']])

    X_trn, X_tst, y_trn, y_tst = \
        train_test_split(X, y, test_size=test_size, stratify=y)

    # making sure each class appears ones initially
    init_y_ind = np.array(
        [np.where(y_trn == i)[0][0] for i in range(len(target))])
    y_ind = np.array([i for i in range(len(X_trn)) if i not in init_y_ind])
    trn_ds = Dataset(
        np.vstack((X_trn[init_y_ind], X_trn[y_ind])),
        np.concatenate((y_trn[init_y_ind], [None] * (len(y_ind)))))

    tst_ds = Dataset(X_tst, y_tst)

    fully_labeled_trn_ds = Dataset(
        np.vstack((X_trn[init_y_ind], X_trn[y_ind])),
        np.concatenate((y_trn[init_y_ind], y_trn[y_ind])))

    cost_matrix = 2000. * np.random.rand(len(target), len(target))
    np.fill_diagonal(cost_matrix, 0)

    return trn_ds, tst_ds, y_trn,y_tst, fully_labeled_trn_ds, cost_matrix

def run(trn_ds, tst_ds, lbr, model, qs, quota):
    E_in, E_out = [], []
    score_train = []
    score_test = []

    for _ in range(quota):
        ask_id = qs.make_query()
        X, _ = zip(*trn_ds.data)
        lb = lbr.label(X[ask_id])
        trn_ds.update(ask_id, lb)

        model.train(trn_ds)
        E_in = np.append(E_in, 1 - model.score(trn_ds))
        E_out = np.append(E_out, 1 - model.score(tst_ds))
        score_train = np.append(score_train,model.score(trn_ds)*100)
        score_test = np.append(score_test,model.score(tst_ds)*100)

    return E_in, E_out,score_train,score_test

qs5 = HintSVM(trn_ds5, cl=1.0, ch=1.0, p=0.5)
        model = SVM(kernel='rbf',C = n_C, gamma = n_gamma, decision_function_shape='ovr')
        E_in_5, E_out_5,score_train_5,score_test_5 = run(trn_ds5, tst_ds, idealLabels, model, qs5, quota_to_query)
        results_out.append(E_out_5.tolist())
        results_score.append(score_test_5.tolist())

Do you have any insights about this error ?

thank you

Dataset: specify numbers of labels at constructor

The labeled pool may contain only a subset of all possible labels.

self assigned kernel passing to QUIRE

https://github.com/ntucllab/libact/blob/master/libact/query_strategies/quire.py#L66

and the gamma parameter should be part of the kernel.

Is specified version of Python is required when compiling? Compile error using "python setup.py install"

Hello, Thank you for providing this project

After I have installed the dependencies, I run
python setup.py install

But, I get some errors:

Platform Detection: Linux. Link to liblapacke...
running install
running build
running build_py
running build_ext
building 'libact.query_strategies._variance_reduction' extension
C compiler: x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC

compile options: '-I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/lapacke -I/usr/include/python2.7 -c'
extra options: '-std=c11'
x86_64-linux-gnu-gcc: libact/query_strategies/src/variance_reduction/variance_reduction.c
libact/query_strategies/src/variance_reduction/variance_reduction.c:26:15: error: variable ‘moduledef’ has initializer but incomplete type
static struct PyModuleDef moduledef = {
^
libact/query_strategies/src/variance_reduction/variance_reduction.c:27:5: error: ‘PyModuleDef_HEAD_INIT’ undeclared here (not in a function)
PyModuleDef_HEAD_INIT,
^
。。。。。。。。。
。。。。。。。。。

I wonder if I need to specify the version of Python, so I tried
python3 steup.py install
Still, I cannot install successfully, but the error changes
File "setup.py", line 13, in
from Cython.Build import cythonize
ImportError: No module named 'Cython'

However, I have already installed Cython using "pip install Cython"

It will be very kind of you if you could tell me the requirement of version of the installed dependencies

OR could you please tell how to modify the "-I/usr/include/lapacke -I/usr/include/python2.7" in the compile option

Many Thanks

Documents on implementing their own algorithm on this framework

Supporting multi-label active learning problems.

It seems it is able for current interfaces to support multi-label problems without too much changes?

Possible algorithms to implement:

Supporting multiple queries a time

#57 #84

Polishing documentation

This basic infrastructure of documentation generation has been establish.
Please read about the spec of how to write document in your code.

we are currently using numpydoc:
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

also there is a lot of bug when building sphinx waiting to be fixed:
https://readthedocs.org/projects/striatum/builds/4370706/

Supporting selection score when make_query

Some applications would need the selection score to do further things.

QS: Model type check at constructor

For QSs that rely on a user-given model, a type checked should be performed since different QSs require different capabilities (e.g. UncertaintySampling requires a ContinuousModel).

Is there a way to perform batch mode active learning ?

Hi,

Instead of having of having unlabeled data which come as a stream, I would like to know if there is a way with libact to perform batch mode active learning meaning that the users can select multiples images at once (positive and negatives) ?

thank you in advance

Introduction on Pypi site

https://pypi.python.org/pypi/libact)

installation instructions
related links to docs and github
introduction

Fix Travis Python 3.5 build

Python 3.5 seems to import everything before running unit tests, the _variance_reduction native extension is built and installed but import fails:

ImportError: Failed to import test module: libact.query_strategies
Traceback (most recent call last):
  File "/opt/python/3.5.0/lib/python3.5/unittest/loader.py", line 462, in _find_test_path
    package = self._get_module_from_name(name)
  File "/opt/python/3.5.0/lib/python3.5/unittest/loader.py", line 369, in _get_module_from_name
    __import__(name)
  File "/home/travis/build/ntucllab/libact/libact/query_strategies/__init__.py", line 16, in <module>
    from .variance_reduction import VarianceReduction
  File "/home/travis/build/ntucllab/libact/libact/query_strategies/variance_reduction.py", line 11, in <module>
    from libact.query_strategies import _variance_reduction
ImportError: cannot import name '_variance_reduction'

Build/install log of extension:

running build_ext
building 'libact.query_strategies._variance_reduction' extension
C compiler: gcc -pthread -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC
creating build/temp.linux-x86_64-3.5
creating build/temp.linux-x86_64-3.5/libact
creating build/temp.linux-x86_64-3.5/libact/query_strategies
compile options: '-I/home/travis/virtualenv/python3.5.0/lib/python3.5/site-packages/numpy/core/include -I/opt/python/3.5.0/include/python3.5m -c'
extra options: '-std=c11'
Warning: Can't read registry to find the necessary compiler setting
Make sure that Python modules winreg, win32api or win32con are installed.
gcc: libact/query_strategies/variance_reduction.c
gcc -pthread -shared -L/opt/python/3.5.0/lib -Wl,-rpath=/opt/python/3.5.0/lib build/temp.linux-x86_64-3.5/libact/query_strategies/variance_reduction.o -L/opt/python/3.5.0/lib -lpython3.5m -o build/lib.linux-x86_64-3.5/libact/query_strategies/_variance_reduction.cpython-35m-x86_64-linux-gnu.so -llapacke -llapack -lblas
running install_lib
creating /home/travis/virtualenv/python3.5.0/lib/python3.5/site-packages/libact
creating /home/travis/virtualenv/python3.5.0/lib/python3.5/site-packages/libact/query_strategies
copying build/lib.linux-x86_64-3.5/libact/query_strategies/_variance_reduction.cpython-35m-x86_64-linux-gnu.so -> /home/travis/virtualenv/python3.5.0/lib/python3.5/site-packages/libact/query_strategies