modal-python / modal Goto Github PK

A modular active learning framework for Python

Home Page: https://modAL-python.github.io/

License: MIT License

Python 99.92% Shell 0.08%

scikit-learn machine-learning python active-learning machine-learning-library machine-learning-algorithms machine-learning-api active-learning-module bayesian-optimization

modal's People

Contributors

Stargazers

Watchers

Forkers

chenb67 prof-schacht zyyj007 karlqu1990 zhuwenxiao vishalbelsare yydxlv sasdelli 760146355 dhaalves jenny-nlc msultan hbcbh1999 afcarl evgeneus xychenunc warden10086 smrjans yaoyang33 databill86 mustuner nikolay-bushkov xiaoningwang thanhtunggggg ferisetiawan pawel-dyda jdetras trendingtechnology connorxploit joejiong junlichen chaoyue729 angelo337 pythonexpert dataframing chirayukong iamgroot42 csmolnar shannonyu xinrisanshao dgreyling tony32769 rn0311 orsonadams itsvasavya oddecust evanzhu2013 gauthamkrishna-g volkerbergen weitann coetic haiwentom avensamareh ricklentz nishkaks dechangwang rymc lyzl2010 damienlancry hyzcn djordi9712 xcgfth benedicte3004 yyht vikaskyadav krishnachaitanya-gopaluni sahithyaravi o7s8r6 sanchaynib mannyjop logp chkoar panky8070 knahsahs96 moomoofarm1 vareto-forks vbirbal sijiewang luckyxuli kushagra-2503 philipjhj sirius-aerostar gmontanari zeng8280 nibydlo alexandreabraham paijo12-org hiancdtrsnm a-mh etarakci-hvl rz-zhang zeetaros anuragsinghchaudhary arita37 prarshah charlesbmi markmace kunakl07 alirezabayatmk iainzheng

modal's Issues

pytorch examples error

Thank you for your great work!

I ran the script examples/pytorch_integration.py but got the following error:TypeError: <class 'torch.Tensor'> datatype is not supported.

According to the compiler, this error is raised by the code line learner.teach(X_pool[query_idx], y_pool[query_idx], only_new=True). I tried pytorch 1.1.0 and 0.4.1 but the same error occurred.

I'm using Windows10, python3.6. Could you please give me some advice on how to figure it out? Thanks in advance.

Is it possible to work with CNNs and images without flatten data?

I see ActiveLearner expects a 2d array for x_training, What if I want to train a keras CNN (Resnet, Inception, etc) on cifar10 (32,32,3) for example. Is it possible?

can use modAl with keras multi_gpu_model?how should i do?

when I use modAl with keras multi_gpu_model,training occurred error like following:

Query no. 1
Traceback (most recent call last):
File "/home/es712/Documents/MingHan/pycode/test/ALtest.py", line 77, in
query_idx, query_instance = learner.query(X_pool, n_instances=100, verbose=0)
File "/home/es712/pythonenvs/tensorflow1.13.1/tensorflow1.13.1/lib/python3.6/site-packages/modAL/models/base.py", line 203, in query
query_result = self.query_strategy(self, *query_args, **query_kwargs)
File "/home/es712/pythonenvs/tensorflow1.13.1/tensorflow1.13.1/lib/python3.6/site-packages/modAL/uncertainty.py", line 152, in uncertainty_sampling
uncertainty = classifier_uncertainty(classifier, X, **uncertainty_measure_kwargs)
File "/home/es712/pythonenvs/tensorflow1.13.1/tensorflow1.13.1/lib/python3.6/site-packages/modAL/uncertainty.py", line 77, in classifier_uncertainty
classwise_uncertainty = classifier.predict_proba(X, **predict_proba_kwargs)
File "/home/es712/pythonenvs/tensorflow1.13.1/tensorflow1.13.1/lib/python3.6/site-packages/modAL/models/base.py", line 186, in predict_proba
return self.estimator.predict_proba(X, **predict_proba_kwargs)
File "/home/es712/pythonenvs/tensorflow1.13.1/tensorflow1.13.1/lib/python3.6/site-packages/tensorflow/python/keras/wrappers/scikit_learn.py", line 265, in predict_proba
probs = self.model.predict_proba(x, **kwargs)
AttributeError: 'Model' object has no attribute 'predict_proba'

Hi ,I am new to programming, so Can anyone tell me whether I can use this package for Multi-Dimensional problem as well?
for example for multi dimension Rosen-brock function.
If I can can somebody me tell me how to set kernels for multi dimension.
Please.

Error in expected_error.py

Hello,

I noticed a bug in your expected_error.py. In line 70, the loss variable which is supposed to be log/binary is being replaced with a local variable. As a result after the first iteration the if statement does not execute and the loss remains constant. Just change the variable to nloss or something else and it should work.

Thank you,

Input data with different lengths / filled with NAs

I'm trying to use modAL in combination with tslearn to classify timeseries of different lengths.
tslearn supports variable-length time series by filling the shorter time series up with NAs, but modAL calls

check_X_y(X, y, accept_sparse=True, ensure_2d=False, allow_nd=True, multi_output=True)

without setting force_all_finite = 'allow-nan'.
Is there a reason for not allowing NAs, or did this use case just not come up before?

Thanks a lot!

There is an error when I use this model on image classification.

import numpy as np
import os
import glob
from skimage import io,transform
import matplotlib.pyplot as plt
from copy import deepcopy
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from modAL.models import ActiveLearner,Committee

#数据集地址
path = 'E:/data/datasets/flower_photos/'
#模型保存地址
model_path = 'E:/data/model/model.ckpt'

#将所有图片重置为100100
w = 100
h = 100
c = 3
#读取图片
def read_img(path):
cate = [path+x for x in os.listdir(path) if os.path.isdir(path+x)]
imgs = []
labels = []
for idx,folder in enumerate(cate):
for im in glob.glob(folder+'/.jpg'):
print('reading the images:%s'%(im))
img = io.imread(im)
img = transform.resize(img,(w,h))
imgs.append(img)
labels.append(idx)
return np.asarray(imgs,np.float32),np.asarray(labels,np.int32)
data,label = read_img(path)
shape = np.shape(data)

#产生池
X_pool = deepcopy(data)
y_pool = deepcopy(label)

#初始化委员会
n_members = 3
learner_list = list()

for member_idx in range(n_members):
#初始化训练集
n_initial = 100
train_idx = np.random.choice(range(X_pool.shape[0]),size=n_initial,replace=False)
X_train = X_pool[train_idx]
y_train = y_pool[train_idx]
#去除训练集之后的数据集
X_pool = np.delete(X_pool,train_idx,axis=0)
y_pool = np.delete(y_pool,train_idx)

#初始化学习器
learner = ActiveLearner(estimator=RandomForestClassifier(),X_training=X_train,y_training=y_train)
learner_list.append(learner)

Traceback (most recent call last):
File "E:/query_by_committee/query_by_committee.py", line 56, in
learner = ActiveLearner(estimator=RandomForestClassifier(),X_training=X_train,y_training=y_train)
File "D:\Anaconda3\lib\site-packages\modAL\models.py", line 104, in init
self.X_training = check_array(X_training)
File "D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 451, in check_array
% (array.ndim, estimator_name))
ValueError: Found array with dim 4. Estimator expected <= 2.

I am a novice, so the problem may be stupid, please excuse me.

Hierarchical sampling for active learning

Implement the hierarchical sampling algorithm from Dasgupta and Hsu paper.

Refactoring documentation

cold start handling in ranked batch sampling

Hi!

The behavior of cold start handling in ranked batch sampling seems different from the Cardoso et al.'s "Ranked batch-mode active learning".

modAL/modAL/batch.py

Lines 133 to 139 in 452898f

 if classifier.X_training is None: 

 labeled = select_cold_start_instance(X=unlabeled, metric=metric, n_jobs=n_jobs) 

 elif classifier.X_training.shape[0] > 0: 

 labeled = classifier.X_training[:] 

 # Define our record container and the maximum number of records to sample. 

 instance_index_ranking = []

In modAL's implementation, in the case of cold start, the instance selected by select_cold_start_instance is not added to the instance list instance_index_ranking.
While in "Ranked batch-mode active learning", the instance selected by select_cold_start_instance seems to be the first item in instance_index_ranking.

modAL/modAL/batch.py

Line 46 in 452898f

return X[best_coldstart_instance_index].reshape(1, -1)

If my understanding on the algorithm proposed in the paper and modAL's implementation is correct, we can change the return of select_cold_start_instance to
return best_coldstart_instance_index, X[best_coldstart_instance_index].reshape(1, -1),
store best_coldstart_instance_index in instance_index_ranking, and revise ranked_batch correspondingly.

modAL.disagreement.max_std_sampling

Hi,
I am experiencing an issue when using modAL.disagreement.max_std_sampling in a custom query strategy.

When using the full number of instances included in X, the function doesn't return the sorted list of index and samples and return initial ordering. It looks like it works only when n_instances < X.shape[0]

sample_idx, sample_x = max_std_sampling(regressor, X, n_instances=X.shape[0])

Replace np.sum(generator) with np.sum(np.from_iter(generator)) in modAL.utils.combination

np.sum(generator) throws DeprecationWarning, should replace this with np.sum(np.from_iter(generator)).

issues with query strategy

Hi Travidar

Thank you for this solid project.

I am running the pool_based_sampling.py for my own dataset. The dataset has image features and their respective labels in a numpy array.
However, i am getting an error due to his line of code

learner.teach(
X=new_test[query_idx].reshape(1, -1),
y=dummy_test[query_idx].reshape(1,)
Here is the error

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

File "c:\users\nainasaid\appdata\local\programs\python\python36\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 668, in runfile
execfile(filename, namespace)

File "c:\users\nainasaid\appdata\local\programs\python\python36\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/NainaSaid/Downloads/Active Learning Codes/modAL-master/examples/pool-based_sampling.py", line 118, in
y=dummy_test[query_idx].reshape(1,)

File "c:\users\nainasaid\appdata\local\programs\python\python36\lib\site-packages\modAL\models\learners.py", line 95, in teach
self._add_training_data(X, y)

File "c:\users\nainasaid\appdata\local\programs\python\python36\lib\site-packages\modAL\models\base.py", line 81, in _add_training_data
self.X_training = data_vstack((self.X_training, X))

File "c:\users\nainasaid\appdata\local\programs\python\python36\lib\site-packages\modAL\utils\data.py", line 28, in data_vstack
raise TypeError('%s datatype is not supported' % type(blocks[0]))

TypeError: <class 'pandas.core.frame.DataFrame'> datatype is not supported

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I dont know why am i getting this error even though "dummy_test" variable is a numpy array and not a pandas dataframe. Can somebody please help?

Question : Did ActiveLearner support a trained RandomForestClassifier ?

Hello,

I initialize my active learner with a saved trained randomforest classifier (loaded with pickle) with its training samples as you can see in the code below.
Did this impact the performances of the Active learner ?
The results i obtained are very bad and i get better results with a random selection with the same number of samples.

I would appreciate any feedback or advice !

Thank you in advance,

old model

Model=pickle.load(open(OldModel, 'rb'))[0]

training samples used by the old model

TrainDset0=pd.read_csv(OldTrainFile,sep=",")
X_train0=np.array(TrainDset0.loc[:,TrainDset0.loc[:,'band_0':'band_129'].columns.tolist()])
y_train0=np.array(TrainDset0.loc[:,str(ClassLabel)])

new train samples

TrainDset2=pd.read_csv(NewTrainFile,sep=",")
X_train2=np.array(TrainDset2.loc[:,TrainDset2.loc[:,'band_0':'band_129'].columns.tolist()])
y_train2=np.array(TrainDset2.loc[:,str(ClassLabel)])

validation samples

ValidationDset=pd.read_csv(NewValidationFile,sep=",")
X_validation=np.array(ValidationDset.loc[:,ValidationDset.loc[:,'band_0':'band_129'].columns.tolist()])
y_validation=np.array(ValidationDset.loc[:,str(ClassLabel)])

Active learner

AdditionalSamples=10
MaxScore=0.9
estimator=deepcopy(Model)

Learner=ActiveLearner(estimator=estimator,query_strategy=entropy_sampling,X_training=X_train0,y_training=y_train0)
while Learner.score(X_validation,y_validation) < MaxScore:
query_idx, query_inst = Learner.query(X_train2,n_instances=AdditionalSamples)
Learner.teach(X=query_inst,y=y_train2[query_idx],only_new=False)
X_train2=np.delete(X_train2,query_idx,axis=0)
y_train2=np.delete(y_train2,query_idx)

some results (it is the case of many iterations and data)

with AL samples [0.13, 0.0, 0.7, 0.66, 0.60, 0.49, 0.56, 0.81,................... 0.56, 0.71]
with Random samples [0.13, 0.0, 0.60, 0.70, 0.72, 0.71, 0.85, 0.84,................... 0.87, 0.88]

_add_training_data and _fit_on_new does not work with custom data types

I have a sklearn pipeline that accepts custom data type as input but it when I use that pipeline and teach the learner, I get the following error
TypeError: float() argument must be a string or a number, not 'MyClass'

I traced the problem back to check_X_y function used in BaseLearner. I added dtype=None so that it preserves the input type instead of trying to convert it to a numeric and it didn't throw any errors and works as expected.

I think that behaviour should be expected instead of it trying to convert our data types for us.

Change number of epochs in keras with activelearner class

I would like to use the ActiveLearner Class in combination with a keras model.
I followed the example code at the documentation and everything worked out fine.
However the model performance is really poor.
One major drawback I recognized is the missing ability to change the number of epochs for each training instance in the activelearner loop.
Since you would normally change or state the number of epochs in the model.fit function, you cant do that in the current configuration.
I would be happy if you could give me a hint on how to accomplish that or you may take that issue as an inspiration for the next update.

missing 'inputs' positional argument with ActiveLearner function

All of my relevant code:

#!/usr/bin/env python3.5

from data_generator import data_generator as dg

# standard imports
from keras.models import load_model
from keras.utils import to_categorical
from keras.wrappers.scikit_learn import KerasClassifier
from os import listdir
import pandas as pd
import numpy as np
from modAL.models import ActiveLearner

######## NEW STUFF ########

# get filenames and folder names
data_location = './sensor_preprocessed_dataset/flow_rates_pressures/'
subfolders = ['true','false']

###########################

classifier = KerasClassifier(load_model('./0.7917.h5'))

(X_train, y_train), (X_test, y_test) = dg.load_data_for_model(data_location, subfolders)
WINDOW_SIZE = X_train[0].shape[0]
CHANNELS = X_train[0].shape[1]

# reshape and retype the data for the classifier
X_train = X_train.reshape(X_train.shape[0], WINDOW_SIZE, CHANNELS, 1)
X_test = X_test.reshape(X_test.shape[0], WINDOW_SIZE, CHANNELS, 1)

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# assemble initial data
n_initial = 30
initial_idx = np.random.choice(range(len(X_train)), size=n_initial, replace=False)
X_initial = X_train[initial_idx]
y_initial = y_train[initial_idx]

learner = ActiveLearner(
	estimator=classifier,
	X_training=X_train,
	y_training=y_train,
	verbose=1
)

X_pool = X_test
y_pool = y_test

n_queries = 10
for idx in range(n_queries):
    print('Query no. %d' % (idx + 1))
    query_idx, query_instance = learner.query(X_pool, n_instances=100, verbose=0)
    learner.teach(
        X=X_pool[query_idx], y=y_pool[query_idx], only_new=True,
        verbose=1
    )
    X_pool = np.delete(X_pool, query_idx, axis=0)
    y_pool = np.delete(y_pool, query_idx, axis=0)

Messages, Warnings, and Errors:

Using TensorFlow backend.
WARNING:tensorflow:From /home/jazz/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-01-24 10:03:54.427147: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-24 10:03:54.447927: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2712000000 Hz
2020-01-24 10:03:54.448529: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4c01c00 executing computations on platform Host. Devices:
2020-01-24 10:03:54.448599: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
WARNING:tensorflow:From /home/jazz/.local/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/jazz/.local/lib/python3.5/site-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Traceback (most recent call last):
  File "./classifier.py", line 45, in <module>
    y_training=y_train
  File "/home/jazz/.local/lib/python3.5/site-packages/modAL/models/learners.py", line 79, in __init__
    X_training, y_training, bootstrap_init, **fit_kwargs)
  File "/home/jazz/.local/lib/python3.5/site-packages/modAL/models/base.py", line 63, in __init__
    self._fit_to_known(bootstrap=bootstrap_init, **fit_kwargs)
  File "/home/jazz/.local/lib/python3.5/site-packages/modAL/models/base.py", line 106, in _fit_to_known
    self.estimator.fit(self.X_training, self.y_training, **fit_kwargs)
  File "/home/jazz/.local/lib/python3.5/site-packages/keras/wrappers/scikit_learn.py", line 210, in fit
    return super(KerasClassifier, self).fit(x, y, **kwargs)
  File "/home/jazz/.local/lib/python3.5/site-packages/keras/wrappers/scikit_learn.py", line 139, in fit
    **self.filter_sk_params(self.build_fn.__call__))
TypeError: __call__() missing 1 required positional argument: 'inputs'

I honestly don't even know where to begin to solve this, my code is based on your example here: [https://modal-python.readthedocs.io/en/latest/content/examples/Keras_integration.html] https://modal-python.readthedocs.io/en/latest/content/examples/Keras_integration.html)

And I've read the docs here: [https://modal-python.readthedocs.io/en/latest/content/apireference/models.html] https://modal-python.readthedocs.io/en/latest/content/apireference/models.html

Any input is appreciated.

Notebook-style documentation?

Hi Tivadar,

First: this is a really solid project — thank you for your contributions!

I noticed that the examples that accompany this repository are functionally sufficient, but difficult for someone to skim/follow along without running locally. Do you think it'd be worthwhile to add examples that are in a Jupyter notebook format? If so, would you mind if I took a crack at porting over one of the current scripts into a notebook (with inline plots, comments, etc.) this weekend? Thanks!

Danny

Issues with multidimensional input data

Hi there

I'm trying to "replicate" the example you have for Active regression with Gaussian processes for 2d input data.

This is the code so far (based on what you provided in the example):

import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, RBF
from modAL.models import ActiveLearner

# query strategy for regression
def GP_regression_std(regressor, X):
    _, std = regressor.predict(X, return_std=True)
    query_idx = np.argmax(std)
    return query_idx, X[query_idx]

# generating the data
num_dim, num_data = 2, 1000
data = np.random.rand(num_data, num_dim)
x_data = data[:,0]
y_data = data[:,1]
label = np.sin(np.sqrt(x_data ** 2 + y_data **2)) 

# assembling initial training set
n_initial = 50
initial_idx = np.random.choice(range(len(data)), size=n_initial, replace=False)
X_initial, y_initial = data[initial_idx], label[initial_idx]

# defining the kernel for the Gaussian process
kernel = RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e3)) \
        + WhiteKernel(noise_level=1, noise_level_bounds=(1e-10, 1e+1))

# initializing the active learner
regressor = ActiveLearner(
    estimator=GaussianProcessRegressor(kernel=kernel),
    query_strategy=GP_regression_std,
    X_training=X_initial, y_training=y_initial
)

# active learning
n_queries = 100
for idx in range(n_queries):
    query_idx, query_instance = regressor.query(data)
    regressor.teach(data[query_idx], label[query_idx].reshape(1, -1))

And this is the error I get:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
~/Documents/tmc/python/active_learning/gpr2d.py in <module>()
     42 for idx in range(n_queries):
     43     query_idx, query_instance = regressor.query(data)
---> 44     regressor.teach(data[query_idx], label[query_idx].reshape(1, -1))

/usr/local/lib/python3.7/site-packages/modAL/models.py in teach(self, X, y, bootstrap, only_new, **fit_kwargs)
    352             Keyword arguments to be passed to the fit method of the predictor.
    353         """
--> 354         self._add_training_data(X, y)
    355         if not only_new:
    356             self._fit_to_known(bootstrap=bootstrap, **fit_kwargs)

/usr/local/lib/python3.7/site-packages/modAL/models.py in _add_training_data(self, X, y)
     63         classifier has seen.
     64         """
---> 65         assert len(X) == len(y), 'the number of new data points and number of labels must match'
     66
     67         if type(self.X_training) != type(None):

AssertionError: the number of new data points and number of labels must match

I understand the issue, but am not sure how to pass multidimensional data. I suspect the solution will be simple.

Thanks in advance for your time and attention!

Cheers

using modAL with object detection algorithms

Hello,

first of all I would like to thank you for sharing this great code.
I was wondering if it is possible to integrate an object detection method like (SSD, YOLO, ..., etc) to label specific objects in images with modAL?

thanks a lot again.
Regards

Return confidence score for query samplers

Dear modAL team,
I am trying to use the strategies of modAL outside of the ActiveLearner and I would like to use confidence score but they are not returned by query sampling functions. For example, uncertainty_sampling return the index of the samples, the samples, but not the scores associated to each of them.
Do you think that a kwarg such as return_scores=False (similar to return_proba for predictor in some estimators) that adds the scores in the returned tuple could be a good idea?

Thanks for your feedback.

support Pandas dataframe as training data

Thanks for sharing the great code!
Lightgbm is a popular package, which supports numpy, pd.df as train/test data.
It would be great for modAL to support pd.df as train/pool/test data.
Thanks!

.query on large DataFrame yields "None of [Int64Index([...] dtype='int64')] are in the [columns]"

This simple example:

from sklearn.linear_model import LogisticRegression
from modAL.models import ActiveLearner

X = pd.DataFrame([[1],[2],[3]])
y = pd.Series([True, False, False])
my_learner = ActiveLearner(estimator=LogisticRegression(), X_training=X, y_training=y)
df = pd.concat([X]*2000)
query_idx, _ = my_learner.query(df, n_instances=100)

yields:

KeyError: "None of [Int64Index([1665, 1662, 5412, 3399, 1758, 4866, 1755, 3402, 1752, 5415, 3405,\n            1749, 1746, 3408, 1743, 5418, 4863, 1740, 3411, 1737, 3414, 1734,\n            5421, 1731, 3417, 1728, 4860, 3420, 1725, 5424, 1722, 3423, 1719,\n            3426, 1716, 5427, 1713, 4857, 3429, 1710, 3432, 1707, 5430, 1704,\n            3435, 1701, 4854, 1698, 5433, 3438, 1695, 3441, 1692, 1689, 5436,\n            3444, 1686, 4851, 1683, 3447, 1680, 5439, 3450, 1677, 1674, 3453,\n            1671, 5442, 4848, 1668, 3456, 1764, 3459, 5469, 1587, 3492, 1608,\n            5463, 3495, 1605, 1602, 3498, 1599, 5466, 4833, 1596, 3501, 1593,\n            3504, 1590, 4836, 1575, 3513, 3519, 4827, 1569, 5475, 1572, 3516,\n            1614],\n           dtype='int64')] are in the [columns]"

at:

/databricks/python/lib/python3.7/site-packages/modAL/uncertainty.py in uncertainty_sampling(classifier, X, n_instances, random_tie_break, **uncertainty_measure_kwargs)
    157         query_idx = shuffled_argmax(uncertainty, n_instances=n_instances)
    158 
--> 159     return query_idx, X[query_idx]

It works fine with a smaller input, like:

...
query_idx, _ = my_learner.query(X, n_instances=1)

It seems like query_idx is an array for smaller input, but a different index representation, Int64Index when the number of instances or input is large. And then that can't be used for indexing rows in X.

Is it possible that this needs to be X.iloc[query_idx]? I don't really know enough pandas to know for sure. Thanks!

Ranked batch mode sampling - pre-compute pairwise distance to reduce running time

The following code is called for computing the pairwise distances for every sample within the batch. This slows down the program significantly for larger batch sizes.

modAL/modAL/batch.py

Lines 93 to 96 in 4029dfd

 if n_jobs == 1 or n_jobs is None: 

 _, distance_scores = pairwise_distances_argmin_min(X_pool[mask], X_training, metric=metric) 

 else: 

 distance_scores = pairwise_distances(X_pool[mask], X_training, metric=metric, n_jobs=n_jobs).min(axis=1)

We can compute the pairwise distances once per batch within ranked_batch(outside the for loop) and pass only the minimum distance array to select_instance and assign it directly to

modAL/modAL/batch.py

Line 96 in 4029dfd

 distance_scores = pairwise_distances(X_pool[mask], X_training, metric=metric, n_jobs=n_jobs).min(axis=1) 

There is a significant reduction in running time with this change.

@cosmic-cortex - can I contribute this code change to this repo?

may i ask how to choose query strategies in using keras for mutilabel classification

bayesian DL

HI
I opened this issue to discuss the implementation of the acquisition functions that you said you would like to make a feature in #48. I am interested in contributing. where should it be implemented? in uncertainty.py?

Clarification on similarity measures for information density

The current implementation of the information_density function uses 1/1+d to convert distance (d) to a measure of similarity. However, at least for cosine, this is not how similarity and distance are related.
Is this because you're treating similarity as an ordinal value? i.e. As long as the ranking of instances doesn't change it doesn't matter how we convert distance to similarity and we will always get the same argmax (in Settles, eq. 5.1) when choosing which point to query?

about learner.teach

it seems that each time we run the learner. teach, the model will fit the initial data plus the new data from the beginning just like an untrained new model, can the model just learn the new data with the weight which has been trained on the initial data?

can not import "from abc import ABC"

My python env is 2.7

ImportError Traceback (most recent call last)
in ()
----> 1 from modAL.models import ActiveLearner
2 from sklearn.neighbors import KNeighborsClassifier
3
4 # initializing the active learner
5 learner = ActiveLearner(

/usr/local/lib/python2.7/dist-packages/modAL/init.py in ()
----> 1 from .models import ActiveLearner, Committee, CommitteeRegressor
2 from .uncertainty import classifier_uncertainty, classifier_margin, classifier_entropy,
3 uncertainty_sampling, margin_sampling, entropy_sampling
4 from .disagreement import vote_entropy, consensus_entropy, KL_max_disagreement,
5 vote_entropy_sampling, consensus_entropy_sampling, max_disagreement_sampling, max_std_sampling

/usr/local/lib/python2.7/dist-packages/modAL/models.py in ()
4
5 import numpy as np
----> 6 from abc import ABC, abstractmethod
7 from sklearn.utils import check_array
8 from sklearn.base import BaseEstimator

ImportError: cannot import name ABC

Any example for Regression with Multiple predictors?

There are lot of examples on using Active learning for classification but for regression there is only one example which uses only one predictor variable. Can we anyone guide me on working with multiple predictors ?

does the model re-created during learner.teach period?

i used model.summary() in create_keras_model() function, and i set n_queries=10, i saw the model summary info at each iteration, why this happend?

1-D arrays for learner.teach()

Hello!
I am trying to use modAL with a sklearn pipeline described here.
So, the X_training shape is (n_samples,) rather than (n_samples, n_features).
Learner creation works well but after successful querying I could not pass query_inst to the learner.teach(), because it internally calls np.vstack((X_seed, query_inst)).
Why not use here np.concatenate(X_seed, query_inst) in the same way as it is used for labels?

Also, I expect that only_new=True will solve this, but no...

Performance on MNIST doesn't seem great

When comparing to random sampling it does not seem to give significantly different results. I would have expected the curve to be much higher for active learning. Potentially the defaults aren't great?

learner = ActiveLearner(
    estimator=RandomForestClassifier(random_state = 1234),
    X_training=start_X, 
    y_training=start_y
)

Extend modAL to pytorch models

I am a research assistant and I have been working on deep bayesian active learning for the past few weeks. I have been using pytorch and custom active learning classes so far, and i just found out about modAL and it seems very cool. That s why I was wondering if it was possible to extend it to pytorh models. I would be glad to contribute.

more specifically i am using dropout based bayesian neural networks and use monte carlo sampling to compute predictive variance. i am quite new to active learning but i believe deep bayesian active learning is very close to query by committee in the sense that for every x of the unlabeled pool set, there are N feedforward passes of x through a committee of N networks sampled from the posterior distribution over the weights of the bayesian network.

I also experimented with some query strategies for classifiers mentioned in the active learning survey by Burr Settles that I think are not implemented in modAL yet. I would be glad to contribute on this side to. I am think about gini index of the votes, gini index of the consensus, least confident vote, least confident consensus. (In my experiments they perform as well as vote entropy and consensus entropy).

use different query strategies

I am using keras/tensorflow models with this framework and the activelearner class.
As soon as I try to change the query strategy, different errors occur.

  learner = ActiveLearner(
estimator=classifier,
query_strategy=expected_error_reduction,
X_training=x_initial_training,
y_training=y_initial_training,
)
prescore = learner.score(x_test, y_test)
n_queries = 50
postscore = np.zeros(shape=(n_queries, 1))
for idx in range(n_queries):
    print('Query no. %d' % (idx + 1))
    query_idx, query_instance = learner.query(x_pool)
    learner.teach(
        X=x_pool[query_idx],
        y=y_pool[query_idx],
        only_new=True,
        epochs=10,
        validation_data=(x_val, y_val),
    )
   # remove queried instances from pool
   x_pool = np.delete(x_pool, query_idx, axis=0)
   y_pool = np.delete(y_pool, query_idx, axis=0)
   postscore[idx, 0] = learner.score(x_test, y_test)

What do I have to change to implement the different strategies. The trainings_input is 3D shape.
I tried up to now all uncertainty methods of which only the default selection did work. Now I was trying the expected error_reduction strategy, but there occur errors as well.

I am afraid the 3D shape of the training data is killing all the other algorithms, but for a LSTM this kind of shape is required.

Keras Integration doesn't work with multiple inputs

This code breaks:

def build_model():
    grey = Input(shape=(34,34,1), name="input_grey")
    # ...

    red = Input(shape=(34,34,1), name="input_red")
    # ...

    merged = concatenate([maxpoolg2, maxpoolr2])
    # ...
    softmax1 = Activation('softmax')(dense2)

    model = Model(inputs=[grey, red], outputs=[softmax1])
    
    model.compile(loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy'])
    
    return model

Xg_train = np.array(Xg_train).astype("float32") / 255.0
Xr_train = np.array(Xr_train).astype("float32") / 255.0
Xg_test = np.array(Xg_test).astype("float32") / 255.0
Xr_test = np.array(Xr_test).astype("float32") / 255.0

Xg_train = np.reshape(Xg_train, (len(Xg_train), 34, 34, 1))
Xr_train = np.reshape(Xr_train, (len(Xr_train), 34, 34, 1))
Xg_test = np.reshape(Xg_test, (len(Xg_test), 34, 34, 1))
Xr_test = np.reshape(Xr_test, (len(Xr_test), 34, 34, 1))
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)

classifier = KerasClassifier(build_model())

n_initial = 1000
initial_idx = np.random.choice(range(len(Xg_train)), size=n_initial, replace=False)
Xg_train = Xg_train[initial_idx]
Xr_train = Xr_train[initial_idx]
Y_train = Y_train[initial_idx]

Xg_pool = np.delete(Xg_train, initial_idx, axis=0)
Xr_pool = np.delete(Xr_train, initial_idx, axis=0)
Y_pool = np.delete(Y_train, initial_idx, axis=0)


from modAL.models import ActiveLearner

learner = ActiveLearner(
    estimator=classifier,
    X_training=[Xg_train, Xr_train],
    y_training=Y_train,
    verbose=1
)

n_queries = 10
for idx in range(n_queries):
    query_idx, query_instance = learner.query([Xg_pool, Xr_pool], n_instances=200, verbose=1)
    learner.teach(X=[Xg_pool[query_idx], Xr_pool[query_idx]], y=Y_pool[query_idx],
        verbose=1
    )

    Xg_pool = np.delete(Xg_pool, query_idx, axis=0)
    Xr_pool = np.delete(Xr_pool, query_idx, axis=0)
    Y_pool = np.delete(Y_pool, query_idx, axis=0)

In the last line with this error:

File "/usr/local/lib/python3.5/dist-packages/modAL/models.py", line 43, in __init__
    self._fit_to_known(bootstrap=bootstrap_init, **fit_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/modAL/models.py", line 93, in _fit_to_known
    self.estimator.fit(self.X_training, self.y_training, **fit_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/wrappers/scikit_learn.py", line 209, in fit
    return super(KerasClassifier, self).fit(x, y, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/wrappers/scikit_learn.py", line 138, in fit
    **self.filter_sk_params(self.build_fn.__call__))
TypeError: __call__() missing 1 required positional argument: 'inputs'

I think this happens because of the multiple inputs but I'm not sure.
What else could be wrong?

Limitation of scoring function to only 'accuracy'

Excellent work on this library. Love it. 😄

https://github.com/modAL-python/modAL/blob/master/modAL/models/learners.py#L363 <-- we are only allow to use accuracy for the score(). There are variety of scoring functions (e.g., F1) which are more solid than accuracy in sklearn: https://scikit-learn.org/stable/modules/model_evaluation.html for model evaluation. May be this can be implemented in the next release of modAL?

3D CNN problem

I want to know if modAL supports 3DCNN?

Organisation name disambiguation - Can I use AL for this problem?

Hi Tivadar,

This library is very interesting and easy to use. I wanted to know if there is any way I can use ActiveLearning(especially modAL) in order to tag set of different texts(organization names) and tag them that these belong to the same name. Think of it as clustering of similar companies except that there is no supervision with final values, or mapping companies to the master one, but we have huge amounts of data and I'm thinking I can take help of active learning to remove the manual process of tagging.

Could you help me on this?

AttributeError : 'Committe' object has no attribute 'score'

AttributeError : 'Committe' object has no attribute 'score' in Query by comitte example.

Committee class doesn't offer score() method

For convenience it would be nice, if the Committee class would also offer a score() method. This would allow to compare the performance of a simple learner with a committee of learners more easily.

Committee class does not update to new labels correctly.

Hi,

I encountered an exception in the Committee class when trying to capture the score after teaching the committee with unseen data containing new classes.

The problem seem to be that the teach function does the following:

Adds the new data to the training set X and y
Updates the known classes based on the classes known to the estimators given by the learners
Refits the estimators (hence updating the known classes to include the new classes)

Step 2 should not depend on the estimators or happen after Step 3.

I am working on a PR to fix this.

Query synthesis

Active learning not only works in pool-based or stream-based setting, it can generate examples which can be queried for labels. This is called query synthesis. (See this paper for further details.) This should be implemented in modAL.

Support batch-mode queries?

Hi,

I've run into a bit of a use-case that I'm not sure is quite supported by modAL – nor the broader libraries for active learning – but would be relatively simple to implement. After reviewing modAL's internals a bit, I don't think it officially supports active learning with batch-mode queries.

The sampling strategies (for example, uncertainty sampling) do support the n_instances parameter, but from what I can tell, uncertainty sampling may return redundant/sub-optimal queries if we return more than one instance from the unlabeled set. This is a bit prohibitive in settings where we'd like to ask an active learner to return multiple (if not all) examples from the unlabeled set/pool, and the computational cost for re-training an active learning model goes without saying.

I found requests for batch-mode support in the popular libact library (issues #57 and #89) but, to the best of my knowledge, I'm not sure they were addressed in any of their PRs.

In that case, does it make sense to implement something like [Ranked batch-mode active learning] by Cardoso et al.? I took a crack at it this weekend for a better personal understanding, but if it's worth integrating and supporting in modAL I'm happy to polish it and talk it through in a PR.

Thanks!

using fit_generator() with modAL

hello,
I'm using too much memory when I'm reading images, so I want to use fit_generator() in keras to training. Is it possible to use modAL with fit_generator() method? Or use another way to yield batches for training?
Thank you！

Expected error and variance reduction

Implement expected error and variance reduction from the Roy and McCallum paper.

Entropy sampling query startegy instable

I'm using entropy sampling startegy to select samples for RandomForest classification of 7 classes.
However when i did my query with entropy sampling (i tried also uncertainty samplig) i have a different result every time i run the query.
the selected samples are never the same (i have not changed my input data).

Thank you in advance for your help.

No module named 'modAL.models'; 'modAL' is not a package

After installing modAL by pip install modAL on Ubuntu 16.04 with a virtualenv python3.5, I tried to import modAL, but got the error message as titled. How can I solve this issue?

Question about the active learning strategy

Suppose an original dataset contains 100 samples (pre-train data) , we try to train a model using 1000 unlabelled(pool data). Active learning picks up 10 samples for each iteration.
Question: Pretrain with 100 samples, we can get a model A. Then AL strategy selects 10 new samples. With the 100+10 samples, a) modAL uses model A to retrain on 110 samples; b) modAL initialize a new model B, and train on the 110 samples?
which is right?
In my opinion, a) is right. It is the way that modAL does, according to the codes.
Could you pls figure out the differences between a) and b)? which one is better?
Thanks!

can we save a trained active learning model?

do modAL implementation the module to save and load the trained model ?

	if classifier.X_training is None:
	labeled = select_cold_start_instance(X=unlabeled, metric=metric, n_jobs=n_jobs)
	elif classifier.X_training.shape[0] > 0:
	labeled = classifier.X_training[:]

	# Define our record container and the maximum number of records to sample.
	instance_index_ranking = []

	if n_jobs == 1 or n_jobs is None:
	_, distance_scores = pairwise_distances_argmin_min(X_pool[mask], X_training, metric=metric)
	else:
	distance_scores = pairwise_distances(X_pool[mask], X_training, metric=metric, n_jobs=n_jobs).min(axis=1)

modal-python / modal Goto Github PK

modal's People

Contributors

Stargazers

Watchers

Forkers

modal's Issues

old model

training samples used by the old model

new train samples

validation samples

Active learner

some results (it is the case of many iterations and data)

Recommend Projects

Recommend Topics

Recommend Org