keras-team / keras-preprocessing Goto Github PK

Utilities for working with image data, text data, and sequence data.

License: Other

Python 100.00%

keras-preprocessing's Introduction

Keras Preprocessing

⚠️ This GitHub repository is now deprecated -- all Keras Preprocessing symbols have moved into the core Keras repository and the TensorFlow pip package. All code changes and discussion should move to the Keras repository.

For users looking for a place to start preprocessing data, consult the preprocessing layers guide and refer to the data loading utilities API.

keras-preprocessing's People

Contributors

Stargazers

Watchers

Forkers

shlpu grseb9s awesome-archive sunshinezhihuo zhongkailv web199195 meghanshubhatt templeblock cclauss boozyguo lguyogiro dref360 ozabluda onisimchukv tanguyurvoy davideboschetto rakeshchada justinhochn zbxzc35 fuzzythecat yashk2810 yanhaoy madhuvenkatesh93 cdemutiis wakame1367 justinessert jwood803 zhenglilei ionicsolutions guillaumeerhard qinst64 eong2012 abhaikollara omritreidel hughku yanghaha11514 kaczmarj elmerehbi dotlambda suvojit-0x55aa stevenhickson xinyuegtxy everbrighten vijayabhaskar96 dvlshah dlworkspace gregreen ibexian spsanps julien-ur jshuadvd stevenlol pajai aeftimia mvaldenegro borysrybak mvsantosdev kdemon1011 duoergun0729 tux-o-matic jenslaufer roywei vkk800 dyerrington masfbeca woolfel p16i khinmaunghtay4ah bayethiernodiop smurak mkaze apatsekin qdbp mohitlimje viivvip filipecaixeta arunkumarramanan v2c08 leigh-johnson tito21 gaoji7777 xieliaing aaronruben tactychq srajanpaliwal kroeliebuschie toddrme2178 paul-english adalisan junweston zgsxwsdxg teeru hachreak alegaros rameezrehman83 karthikeyansam eelvi piyushchauhan fk128 soroushj

keras-preprocessing's Issues

Use stable hash function in hashing_trick and one_hot

The fact that Python's hash function is used as a default hashing function in hashing_trick and one_hot is confusing (see #9500 issue in Keras) and was discussed before (see #9635 issue in Keras). The hash function in Python 3 is randomized, what means that the results obtained during different sessions are inconsistent, so using it in data processing pipeline would lead to inconsistent results.

While I understand that one_hot exists for historical reasons, this does not seem to justify preserving function that gives inconsistent results. Even if this is backward compatible with previous versions of Keras, it is not backward compatible with itself since different runs of the function give different results.

Proposed solution

The simple solution would be to use "md5" as a default hashing function in hashing_trick and use one_hot as an alias to one_hot with hash_function='md5'.

Alternatively, since md5 is clearly slower then hash, a faster alternative can be used. Instead of md5, xxHash function can be used. From what I know, xxHash is faster then md5 (but sill, slower then hash) while giving equal quality results. The function is implemented in xxhash package (ports for Python, R, C++ etc.).

I may provide PR for this, but first I'd be grateful for comments on this, as I don't want to waste time for PR that gets rejected.

`zca_whitening` causes memory leak in `ImageDataGenerator.fit`

Keras 2.2.2
TensorFlow 1.10.1
Python 3.6.6
macOS 10.13.6

I'm trying to generate augmentations of my training data with zca_whitening and an ImageDataGenerator. But when I try to fit the generator (which is mandatory when using zca_whitening) the python process eats more and more memory (100Gb+) until it gets killed by the system.

This small example can cause the leak:

import numpy as np
from keras.preprocessing.image import ImageDataGenerator

def cause_leak():
    idg = ImageDataGenerator(zca_whitening = True)
    random_sample = np.random.random((1, 250, 250, 3))
    idg.fit(random_sample)

cause_leak()

The terminal output only consists of a warning saying that featurewise_center is overwritten when enabling zca_whitening. I don't think this is related to the problem but who knows.

Does anybody know a workaround?

Iterator are not Sequence anymore

Since the split, Iterators are not Sequence objects, which make them seen as generators from fit_generator.

Should we modify keras_preprocessing to use Sequence if possible or change the logic of *_generator to not check the type but just validate that the methods are there?

how keras.preprocessing.text.Tokenizer processing oov_token and predefined special token?

I try to use Tokenizer to handle string input. "oov_token" param is given "" when Tokenizer was initializing. However, oov_token's corresponding index is more than num_words. This index can't be used directly in embedding_lookup by token index.
Another question is how to use predefined words with Tokenizer , such as .

Replacing `if variable not in dict` construction

Hello, firstly I would like to thank your for this new library which removes the burden from writing some repetitive code, specially regarding text, and let us focus on solving problems instead.

While reading text.py, I spotted the following construction in two different places:

if variable not in some_dict:
    some_dict[variable] = 1
else:
    some_dict[variable] += 1

If some_dict = defaultdict(int), then this code could be replaced by the one-liner some_dict[variable] += 1. Why not use it? According to the tests below it is even faster:

In [1]: from collections import defaultdict;
In [2]: simple_dict = dict()
In [3]: def fun(): z = defaultdict(int); z['shoe'] += 1;

# >> Inserting a new element
In [4]: %timeit if 'shoe' not in simple_dict: simple_dict['shoe'] = 1
# 56.6 ns ± 0.0336 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [5]: %timeit fun
# 36.1 ns ± 0.00178 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

# >> Fetching existing element
In [6]: %timeit simple_dict['shoe']
# 54.3 ns ± 0.012 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [7] ddict = defaultdict(int); ddict['shoe'] += 1
In [8]: %timeit ddict['shoe']
# 58.3 ns ± 0.0122 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

# >> Updating existing element:
In [9]: def fun_2():
     ...:     if 'shoe' not in simple_dict: simple_dict['shoe'] = 1
     ...:     else: simple_dict['shoe'] += 1
     ...:   
In [10]: %timeit fun_2()
# 290 ns ± 0.171 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [11]: z = defaultdict(int)
In [12]: %timeit z['shoe'] += 1
# 122 ns ± 0.0166 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Iterate on ImageGenerator loops indefinitely

Hello everybody,
I'm facing a wired issue with image generators while iterating over them (keras 2.2.0)

My validation set contains 7160 pictures. Then, I set my generator like this:

batch_size = 32

train_datagen = image.ImageDataGenerator()
train_generator = train_datagen.flow_from_directory("train/", target_size=(224, 224), batch_size=batch_size)

Up to here, everything looks normal: train_generator[0] returns a tuple of 32 image arrays and 32 label arrays, as expected.

The strange things is that if I iterate with a for loop as follow

x_train = []
for x, y in train_generator:
    x_train.append(preprocess_input(x))

it simply iterates forever! And as consequence the size of x_train gets bigger and bigger!
I would expect instead exactly 224 iteration (7160 samples in 32 batches).

And indeed, if I ask train_generator[405] I get a reasonable ValueError: Asked to retrieve element 405, but the Sequence has length 224.

What's going on here?! Am I missing something about how ImageGenerators work?

image rescaling and featurewise normalization not coordinated

I believe the current implementation provides an erroneous, unexpected behaviour if both the rescale parameter is used (not None or different than 0) and feature-wise normalization is applied (featurewise_center, featurewise_std_normalization, ZCA whitening).

The fit() function computes the statistics from the original, un-rescaled inputs and these statistics are applied finally on the rescaled data. For instance, if the images are uint8 (in the range [0, 255]) the feature-wise mean may be, for instance, 128. Then, if rescale=1./255 the output images will be in the range [0, 1], but the original mean 128 will be subtracted.

import numpy as np
from keras.preprocessing.image import ImageDataGenerator

images = np.random.randint(low=0, high=255, size=(10, 32, 32, 3))

rescale = 1. / 255
imagedatagen = ImageDataGenerator(rescale=rescale,
                                  featurewise_center=True)
imagedatagen.fit(images)
batchgen = imagedatagen.flow(images, batch_size=10)
batch = batchgen.next()

images = images.astype(float)
images *= rescale
mean = np.mean(images, axis=(0, 1, 2))
images -= mean
print('Data range should be (approximately): [{}, {}]. \n'
      'Actual data range is: [{}, {}]'.format(np.min(images),
                                              np.max(images),
                                              np.min(batch),
                                              np.max(batch)))

Chinese text support in text preprocessing

Looks like there is no built-in support in Tokenizer for Chinese text parsing. It can be built using Jieba package, just need some coding work.

AttributeError when running tests

Multiple fails when running tests, although Tokenizer definitely has a sequences_to_texts attribute.
Keras version : 2.2.0

    def test_sequences_to_texts():
        texts = [
            'The cat sat on the mat.',
            'The dog sat on the log.',
            'Dogs and cats living together.'
        ]
        tokenizer = keras.preprocessing.text.Tokenizer(num_words=10,
                                                       oov_token='<unk>')
        tokenizer.fit_on_texts(texts)
        tokenized_text = tokenizer.texts_to_sequences(texts)
>       trans_text = tokenizer.sequences_to_texts(tokenized_text)

E       AttributeError: 'Tokenizer' object has no attribute 'sequences_to_texts'

flow_from_dataframe() - ValueError: has_ext is set to True but extension not found in x_col

I'm working with images organized across several folders. I have a dataframe of their file paths, and up until now I've been using that with a script to move them into the necessary categorical folders. It takes up a lot of time and space. So, needless to say I was ecstatic when I found the flow_from_dataframe() method.

I have my valid dataframe and filepaths. I initialize the generators like this:

from keras.preprocessing.image import ImageDataGenerator

main_dir = '/User/name/etc/etc/224px/'

# Initiate the train and test generators with data Augumentation 
train_datagen = ImageDataGenerator(preprocessing_function = preprocess_input,
                                   #rescale = 1./255,
                                   horizontal_flip = True,
                                   fill_mode = "nearest",
                                   zoom_range = 0.3,
                                   width_shift_range = 0.1,
                                   height_shift_range = 0.1,
                                   rotation_range = 30)

test_datagen = ImageDataGenerator(preprocessing_function = preprocess_input)

seed = 108201987 # Optional random seed for shuffling and transformations.

train_generator = train_datagen.flow_from_dataframe(dataframe=train,
                                                    directory=main_dir,
                                                    x_col='filepath',
                                                    y_col='label',
                                                    has_ext=True,
                                                    target_size = (img_height, img_width),
                                                    batch_size = batch_size, 
                                                    class_mode = "binary",
                                                   seed = seed)

validation_generator = test_datagen.flow_from_directory(dataframe=val,
                                                        directory=main_dir,
                                                        x_col='filepath',
                                                        y_col='label',
                                                        has_ext=True,
                                                        target_size = (img_height, img_width),
                                                        class_mode = "binary",
                                                        seed = seed)

Here's a sample file path: 'October 29 2018/Top view_XYZ_4-9/IMG_6854.JPG'

Both the training and validation generators have has_ext set to True, since my files have extensions. However, I get this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-1e162ab6ee70> in <module>()
     23                                                     batch_size = batch_size,
     24                                                     class_mode = "binary",
---> 25                                                    seed = seed)
     26 
     27 validation_generator = test_datagen.flow_from_directory(dataframe=val,

/usr/local/anaconda3/lib/python3.5/site-packages/keras_preprocessing/image.py in flow_from_dataframe(self, dataframe, directory, x_col, y_col, has_ext, target_size, color_mode, classes, class_mode, batch_size, shuffle, seed, save_to_dir, save_prefix, save_format, subset, interpolation)
   1105                                  save_format=save_format,
   1106                                  subset=subset,
-> 1107                                  interpolation=interpolation)
   1108 
   1109     def standardize(self, x):

/usr/local/anaconda3/lib/python3.5/site-packages/keras_preprocessing/image.py in __init__(self, dataframe, directory, image_data_generator, x_col, y_col, has_ext, target_size, color_mode, classes, class_mode, batch_size, shuffle, seed, data_format, save_to_dir, save_prefix, save_format, follow_links, subset, interpolation, dtype)
   2101                     break
   2102             if not ext_exist:
-> 2103                 raise ValueError('has_ext is set to True but'
   2104                                  ' extension not found in x_col')
   2105             temp_df = pd.DataFrame({x_col: filenames}, dtype=str)

ValueError: has_ext is set to True but extension not found in x_col

I was so excited about the possibility of never having to sort or mix-up my images again. Any ideas?

Tokenizer should always initialize document_count with zero

In the Tokenizer class, currently self.document_count is initialized with the argument document_count, but it should always be initialized with 0. Allowing the user to initialize it with a non-zero value will result in incorrect results or errors in tf-idf mode. Furthermore, the document_count argument is not documented.

TimeseriesGenerator object is not an iterator

hello,

I see an issue with TimeSeriesGenerator.

tensorflow 1.11.0
Keras 2.2.2
Keras-Applications 1.0.6
Keras-Preprocessing 1.0.5

I am using the following code to test the TimeseriesGenerator

data = np.arange(0,100).reshape(-1,1)
data_gen = TimeseriesGenerator(data, data, length=WINDOW_LENGTH,
                               sampling_rate=1, batch_size=1)


data_dim = 1
input1 = Input(shape=(WINDOW_LENGTH, data_dim))
lstm1 = LSTM(100)(input1)
hidden = Dense(20, activation='relu')(lstm1)
output = Dense(data_dim, activation='linear')(hidden)

model = Model(inputs=input1, outputs=output)
model.compile(loss='mse', optimizer='rmsprop', metrics=['accuracy'])

model.fit_generator(generator=data_gen,
                    steps_per_epoch=32,
                    epochs=10)

here is the stacktrace.

TypeErrorTraceback (most recent call last)
<ipython-input-55-ad7e35e8fffd> in <module>()
     16 model.fit_generator(generator=data_gen,
     17                     steps_per_epoch=32,
---> 18                     epochs=10)

/usr/lib/python2.7/site-packages/keras/legacy/interfaces.pyc in wrapper(*args, **kwargs)

/usr/lib/python2.7/site-packages/keras/engine/training.pyc in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)

/usr/lib/python2.7/site-packages/keras/engine/training_generator.pyc in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)

/usr/lib/python2.7/site-packages/keras/utils/data_utils.pyc in get(self)

/usr/lib/python2.7/site-packages/keras/utils/data_utils.pyc in _data_generator_task(self)

TypeError: TimeseriesGenerator object is not an iterator

I tried to play around with package versions and I see that issue occurs only when using Keras-Preprocessing >= 1.0.3. I am able to run this code with 1.0.2.

flow_from_dataframe with directory=None not working

I'm trying to use flow_from_dataframe with directory=None to use absolute path as descirbed here

Keras==2.2.4
Keras-preprocessing==1.0.5

but this is what I get:

datagen.flow_from_dataframe(data, directory=None, x_col='fname', y_col='cat',has_ext=True)
...

TypeError                                 Traceback (most recent call last)
<ipython-input-128-f0acbea298e4> in <module>()
----> 1 datagen.flow_from_dataframe(data, directory=None, batch_size=2, x_col='fname',  y_col='cat',has_ext=True)

/usr/local/anaconda3/lib/python3.6/site-packages/keras_preprocessing/image.py in flow_from_dataframe(self, dataframe, directory, x_col, y_col, has_ext, target_size, color_mode, classes, class_mode, batch_size, shuffle, seed, save_to_dir, save_prefix, save_format, subset, interpolation)
   1105                                  save_format=save_format,
   1106                                  subset=subset,
-> 1107                                  interpolation=interpolation)
   1108 
   1109     def standardize(self, x):

/usr/local/anaconda3/lib/python3.6/site-packages/keras_preprocessing/image.py in __init__(self, dataframe, directory, image_data_generator, x_col, y_col, has_ext, target_size, color_mode, classes, class_mode, batch_size, shuffle, seed, data_format, save_to_dir, save_prefix, save_format, follow_links, subset, interpolation, dtype)
   2093             class_indices=self.class_indices,
   2094             follow_links=follow_links,
-> 2095             df=True)
   2096         if has_ext:
   2097             ext_exist = False

/usr/local/anaconda3/lib/python3.6/site-packages/keras_preprocessing/image.py in _list_valid_filenames_in_directory(directory, white_list_formats, split, class_indices, follow_links, df)
   1762             `["file1.jpg", "file2.jpg", ...]`).
   1763     """
-> 1764     dirname = os.path.basename(directory)
   1765     if split:
   1766         num_files = len(list(

/usr/local/anaconda3/lib/python3.6/posixpath.py in basename(p)
    144 def basename(p):
    145     """Returns the final component of a pathname"""
--> 146     p = os.fspath(p)
    147     sep = _get_sep(p)
    148     i = p.rfind(sep) + 1

TypeError: expected str, bytes or os.PathLike object, not NoneType

Wrong link in the docs

There is something wrong with this link in the Image Preprocessing docs.:

keras-preprocessing/keras_preprocessing/image.py

Lines 886 to 888 in 45fc4a0

  See [this script]( 

  https://gist.github.com/fchollet/0830affa1f7f19fd47b06d4cf89ed44d) 

  for more details.

When I hover on it, I see at the bottom of my browser this URL:

https://gist.github.com/fchollet/        0830affa1f7f19fd47b06d4cf89ed44d

and when I click on it, it leads me to

https://gist.github.com/fchollet/%20%20%20%20%20%20%20%200830affa1f7f19fd47b06d4cf89ed44d

I can reproduce on Chrome and Firefox.

Is multiprocessing deleted in DirectoryGenerator?

I don't see this support on latest keras-preprocessing source codes.

Unable to train with RGBA images; RGB work fine

I have training data that are RGBA images stored as png files. When I read them in as RGB images using flow_from_directory everything runs smoothly.

But if I set the 'color_mode' argument of flow_from_directory to 'rgba', as in he documentation, I get the following error when trying to run fit_generator:

`Epoch 1/120
Traceback (most recent call last):
  File "training_keras.py", line 326, in <module>
    train_model(MODEL_NAME,BASE_DIR,OUTPUT_DIR,GPUS,NUM_EPOCHS,BATCH_SIZE,WIDTH,HEIGHT,MODEL_TYPE,WORKERS,DATA_FRACTION,TRAIN_ALL,FIRST_LAYER,FCN_SIZE,VALIDATION_DIR)
  File "training_keras.py", line 314, in train_model
    model.fit_generator( train_generator,steps_per_epoch=(train_generator.n/(BATCH_SIZE)/DATA_FRACTION),epochs=NUM_EPOCHS,callbacks=cbks,workers=WORKERS, validation_data=validate_generator)
  File "/usr/local/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/site-packages/keras/engine/training_generator.py", line 181, in fit_generator
    generator_output = next(output_generator)
  File "/usr/local/lib/python3.6/site-packages/keras/utils/data_utils.py", line 601, in get
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/keras/utils/data_utils.py", line 595, in get
    inputs = self.queue.get(block=True).get()
  File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/site-packages/keras/utils/data_utils.py", line 401, in get_index
    return _SHARED_SEQUENCES[uid][i]
  File "/usr/local/lib/python3.6/site-packages/keras_preprocessing/image.py", line 1441, in __getitem__
    return self._get_batches_of_transformed_samples(index_array)
  File "/usr/local/lib/python3.6/site-packages/keras_preprocessing/image.py", line 1932, in _get_batches_of_transformed_samples
    batch_x[i] = x
ValueError: could not broadcast input array from shape (296,296,3) into shape (296,296,4)
`

Why does this happen?

Thanks!

P.S.
Keras is version 2.2.4
Keras-Preprocessing is version 1.0.6
Tensorflow-GPU is version 1.10.1

Installation issues with keras-preprocessing to get flow_from_dataframe

Could you please tell me the right procedure of installation.Since the feature flow_from_dataframe is not availble in the pip version, i try to get it installed from the gitub repo. This is what i tried and failed in my virtual environment

pip uninstall keras
pip uninstall keras-preprocessing
pip install git+https://github.com/keras-team/keras-preprocessing.git
pip install keras

What am i doing wrong here?

pass `order` to affine_transform

The method apply_affine_transform is a wrapper scipy.ndimage.interpolation.affine_transform that has a parameter order which is the order a a spline interpolation. Not being able to pass this parameter causes a problem when generating random transformation of labeled images since it results in non-integer values.

for example:

for a given labels (each integer is a class label)

[[2 2 0 2 2]
 [1 3 2 3 1]
 [2 1 0 1 2]
 [3 1 0 2 0]
 [3 1 3 2 1]]

it transformed to

[[2.289865   1.7110896  1.8507836  2.172145   0.15832195]
 [3.         2.2037435  1.0774351  1.4194988  2.3764393 ]
 [3.         1.4194988  0.39426237 0.49894607 1.7452569 ]
 [1.8929222  1.7380519  1.7849773  2.         1.2062019 ]
 [1.2646285  2.8956027  2.2367117  1.3144331  0.23671168]]

Which are clearly not valid class labels. I dag around and found that the reason is the hard coded order in https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image.py line 323.

being able to pass this parameter would enable correct data augmentation for image segmentation.

I have made a small path to fix this issue so i'll create a PR soon.

tx an ty in apply_transform

Hi, while implementing object detection in Keras and using data augmentation i have been checking the results of the affine_transform and I got strange results. For instance for tx=0 and ty=24 I got a horizontal displacement to the left

You can check the data augmentation code in

https://github.com/RParedesPalacios/GILA/blob/development/src/detect_generators.py

line 68

Has anybody else checked this??

Thanks.

Tokenizer.fit_on_text splits 1 string into chars when char_level=False

From: keras-team/keras#10768 by @hadaev8

Tokenizer will fit/transform the string into chars if a string is provided to fit_on_texts/text_to_sequences methods regardless of char_level setting. This is happening because the method expects a list of strings and is splitting the string into chars if just 1 string is given in this line for fitting:

keras-preprocessing/keras_preprocessing/text.py

Line 205 in e002ebd

for text in texts:

and this one for trasnforming:

keras-preprocessing/keras_preprocessing/text.py

Line 293 in e002ebd

for text in texts:

Reproducible code illustrating the problem with fit_on_texts:

from keras.preprocessing.text import Tokenizer
text='check check fail'
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text)
tokenizer.word_index

Output:

{'c': 1, 'h': 2, 'e': 3, 'k': 4, 'f': 5, 'a': 6, 'i': 7, 'l': 8}

wrapping text into list solves the issue

tokenizer.fit_on_texts([text])
tokenizer.word_index

{'check': 1, 'fail': 2}

I can recommend checking that text is a list of strings and if it is not producing a warning and wrapping it into the list or erroring out

[pre-pull-request] opencv version (PIL alternative) for image.py

I found that I need to rewrite load_img in image.py with opencv for my corner case of 16-bit images. Also benchmarks show that python-opencv is faster than PIL. Will there be interest in incorporating opencv/cv2-based reading function as an alternative to PIL into this package?

Problems with multi-task learning with DataframeIterator / flow_from_dataframe

Hello!

I am training a image classification model with multiple outputs:

trained_model = tf.keras.applications.xception.Xception(
        include_top=False,
        weights='imagenet',
        input_shape=[300, 300, 3],
        pooling='max')

    outputs = []
    for i in range(8):
      outputs.append(tf.keras.layers.Dense(1,  activation='softmax', kernel_initializer=kernel_initializer) (trained_model.output))

    model = tf.keras.Model(inputs=trained_model.input, outputs=outputs)

The y returned by this model is a Python List, with 8 elements. Each element is a mini-batch of tensors.

However, flow_from_dataframe reads all my y columns from the dataframe as one numpy array, instead of a Python list.

Example

Suppose my dataframe is something like this:

image_path,field_1,field_2,field_3,field_4,field_5,field_6,field_7,field_8
1532672467738.jpeg,1,1,0,1,0,0,0,1
1532669990747.jpeg,0,0,0,1,0,1,1,0
...

Then I call flow_from_dataframe:

train_batches = generator.flow_from_dataframe(
  dataframe=dataframe,
  directory=path,
  x_col='image_path',
  y_col=['field_1', 'field_2', 'field_3', 'field_4', 'field_5', 'field_6', 'field_7', 'field_8'],
  class_mode='other',
  batch_size=16
)

When I call fit_generator with both the model and train_batches, I get this error:

ValueError: Error when checking model target: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 8 array(s), but instead got the following list of 1 arrays: [array([[0, 0, 0, 0, 1, 0, 1, 1],
       [1, 1, 0, 1, 1, 0, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 1],
       [0, 0, 1, 1, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 1, 0, 1],
       [1, 0, 0, 1, 0, 1, 0, 0],

So, like I wrote in the beggining: DataframeIterator sends a numpy array of shape (16, 8), and the model outputs a Python List of 8 numpy arrays of size (16).

I think the problem is in this excerpt from keras_preprocessing/image.py:

if self.class_mode == 'input':
            batch_y = batch_x.copy()
        elif self.class_mode == 'sparse':
            batch_y = self.classes[index_array]
        elif self.class_mode == 'binary':
            batch_y = self.classes[index_array].astype(self.dtype)
        elif self.class_mode == 'categorical':
            batch_y = np.zeros(
                (len(batch_x), self.num_classes),
                dtype=self.dtype)
            for i, label in enumerate(self.classes[index_array]):
                batch_y[i, label] = 1.
        elif self.class_mode == 'other':
            batch_y = self.data[index_array]
        else:
            return batch_x
        return batch_x, batch_y

The line batch_y = self.data[index_array] returns a Numpy array.

Tokenizer.texts_to_sequences is using at most Tokenizer.num_words - 1 words.

from keras.preprocessing.text import Tokenizer
texts = ['a b c']
tokenizer = Tokenizer(num_words=2)
tokenizer.fit_on_texts(texts)
tokenizer.word_index
{'a': 1, 'b': 2, 'c': 3}
print(tokenizer.texts_to_sequences(texts))

[[1,]]

Avoid Package Dependencies

From a perspective from who only want to use the sequence.pad_sequences module and don't want to do a shallow copy of it, I believe that a better aproach for this project would be to avoid any dependency, or just use numpy.

We can do like all packages do for matplotlib for example, displaying a warning to the user that a optional dependency are required for that process.

Packages to remove from dependencies:

Keras
scipy

version in setup.py, 1.0.1, is incorrect for the 1.0.2 tag

The version recorded in the setup.py file for the 1.0.2 tag is 1.0.1, which is incorrect. The sdist on PyPI has the correct version in the setup.py file.

This can causes issues if a checkout of the git tag is used as the mean to install keras-preprocessing as the incorrect version will be recorded in the metadata.

I do not think anything should be done to fix this as changing a tag is a bad procedure. I think it is useful to have this in the issue tracker in case anyone else runs into the issue.

random_brightness shifts the image normalization

My images are trained with float representation, so that their maximum value is 1.0.
However, when I applied random_brightness, the image is between 0.0 and 255.0

I think this is not expected, or at least, it should be warned in the documentation, shouldn't it?

fill_mode on zoom and rotation problem

I'm trying to augment both images and masks. Images are working propely but masks fail.

Example:

It happens when mask is zoomed out and rotated. That black and white should be blue, like background.

I've tried to change fill_mode, but it doesn't work for constant and nearest. Wrap works, but it creates red areas where it shouldn't.

Code:


def augGenerator():

    gen = ImageDataGenerator(
            rotation_range=20,
            shear_range=0.2,
            zoom_range=0.2,
            horizontal_flip=True,
            )

    return gen   




def augmentImage(img, mask, img_size, aug_count):

    aug_images = [img]
    aug_masks = [mask]

    img = img.reshape(-1, img_size, img_size, 3)   
    mask = mask.reshape(-1, img_size, img_size, 3)   


    gen_img = augGenerator()
    gen_mask = augGenerator()

    
    seed = 1

    gen_img.fit(img, augment=True, seed=seed)
    gen_mask.fit(mask, augment=True, seed=seed)

    img_aug_iter = gen_img.flow(img,seed=seed)
    mask_aug_iter = gen_mask.flow(mask,seed=seed)

    aug_images += [next(img_aug_iter)[0] for i in range(aug_count)]
    aug_masks += [next(mask_aug_iter)[0] for i in range(aug_count)]


    return aug_images, aug_masks

flow_from_dataframe() found 0 images

Hello again! I'm still struggling with flow_from_dataframe() after the issues I had here.

In order to use the new fixes, I cloned the keras repo, and then replaced the contents of the preprocessing folder with the latest from the keras-preprocessing repo. I renamed the local repo keras2 to avoid importing the vanilla repo. The code finally runs, but it's not finding any images.

Here's my script:

import pandas as pd
import numpy as np
import sys
sys.path.append('/Users/lmcane/documents/tools/keras2/')
from keras2.preprocessing.image import ImageDataGenerator


train = pd.read_csv('short_dir_train.csv', index_col=0)
print(train.filepath[0] + '\n')
train.info()

Returns:

Using TensorFlow backend.

March 29 2018/Top view_1-2/IMG_6823.JPG

<class 'pandas.core.frame.DataFrame'>
Int64Index: 869 entries, 0 to 868
Data columns (total 2 columns):
filepath    869 non-null object
label       869 non-null object
dtypes: object(2)
memory usage: 60.4+ KB

Then the main body of the script:

main_dir = '/Users/lmcane/Documents/Datasets/Unsorted Extracted/224x224px'

img_width, img_height = 224, 224
nb_train_samples = 433
nb_validation_samples = 216
batch_size = 20
epochs = 10

train_datagen = ImageDataGenerator(horizontal_flip = True,
                                   fill_mode = "nearest",
                                   zoom_range = 0.3,
                                   width_shift_range = 0.1,
                                   height_shift_range = 0.1,
                                   rotation_range = 30)

train_generator = train_datagen.flow_from_dataframe(dataframe=train,
                                                    directory=main_dir,
                                                    x_col='filepath',
                                                    y_col='label',
                                                    has_ext=True,
                                                    target_size = (img_height, img_width),
                                                    batch_size = batch_size, 
                                                    class_mode = "binary")

Returns:

Found 0 images belonging to 2 classes.

It should find 433. I suspect I didn't import the repo correctly?

Which one is expected by channel_shift?

Hello! While runnning keras-preprocessing(master)/image.py/random_channel_shift, I thought it was different from the expected channel_shift behavior.

I think that the expected channel shift movement is old.

Original Images(cifar10)

from keras.datasets import cifar10
from keras.preprocessing.image import random_channel_shift
import numpy as np
import matplotlib.pyplot as plt


def plot_tiles(images, rows=5, columns=5):
    pos = 1
    for idx in range(rows*columns):
        plt.subplot(rows, columns, pos)
        img = images[idx]
        plt.imshow(img)
        plt.axis("off")
        pos += 1
    plt.show()


(x_train, y_train), (x_test, y_test) = cifar10.load_data()
sample_images = x_train[:9]/255
channel_shift_range = 0.3
plot_tiles(sample_images, rows=3, columns=3)

Latest Channel_Shift Images(cifar10)

channel_shift_images_latest = []
for _ in sample_images:
    channel_shift_images_latest.append(_random_channel_shift(_, channel_shift_range, 2))
channel_shift_images_latest = np.array(channel_shift_images_latest)
plot_tiles(channel_shift_images_latest, rows=3, columns=3)

keras-preprocessing(master)/image.py/random_channel_shift

Old Channel_Shift Images(cifar10)

channel_shift_images_old = []
for _ in sample_images:
    channel_shift_images_old.append(random_channel_shift(_, channel_shift_range, 2))
channel_shift_images_old = np.array(channel_shift_images_old)
plot_tiles(channel_shift_images_old, rows=3, columns=3)

keras-preprocessing(old)/image.py/random_channel_shift

flow_from_dataframe with multiple dtype entries

I am trying to use the flow_from_dataframe() function but run on a KeyError: nan error.

My csv file is as follow:
subDirectory_filePath, expression
img_1, 0
img_2, 3
...
img_n, 0

therefore the first argument is a string while the other one is a integer. I have tried to follow that tuorial:
Tutorial on Keras ImageDataGenerator with flow_from_dataframe

and thus my code is:

train_datagen = ImageDataGenerator(rescale=1. / 255,horizontal_flip=False)
df_train = pd.read_csv(data['csv_train_file'], dtype={'subDirectory_filePath': str, 'expression': int})
train_generator = train_datagen.flow_from_dataframe(
    dataframe=df_train,
    directory=data['img_dir'],
    x_col='subDirectory_filePath',
    y_col='expression',
    has_ext=True,
    class_mode="categorical",
    target_size=(model_params['img_height'], model_params['img_width']),
    batch_size=model_params['batch_size']
    #save_to_dir='test_train'
)

and get the following issues:

Found 427298 images belonging to 11 classes.

Traceback (most recent call last):
  File "train_model.py", line 250, in <module>
    train_model(model_name=model_name, dataset=dataset, mode=mode, weights=weights, computer=computer, run=run)
  File "train_model.py", line 198, in train_model
    train_generator, validation_generator = get_csv_generator(data, model_params, da, extended=False)
  File "train_model.py", line 121, in get_csv_generator
    batch_size=model_params['batch_size']
  File "/home/michael/.local/lib/python3.5/site-packages/keras_preprocessing/image.py", line 1108, in flow_from_dataframe
    interpolation=interpolation)
  File "/home/michael/.local/lib/python3.5/site-packages/keras_preprocessing/image.py", line 2168, in __init__
    self.classes = np.array([self.class_indices[cls] for cls in classes])
  File "/home/michael/.local/lib/python3.5/site-packages/keras_preprocessing/image.py", line 2168, in <listcomp>
    self.classes = np.array([self.class_indices[cls] for cls in classes])
KeyError: nan

Furthermore, while printing my csv file I can see that it is full of NaN such as:
subDirectory_filePath expression
002276d73d5822544f39d86b45098e67f84f78cd8edcba8... NaN
01807ce4c37cc4463bd06a966a4043edc14864a0075ff78... NaN

I am guessing that my issue is something with loading my csv file, I have first tried without the dtype dictionary but the same error occur.

Any help much appreciated

TimeseriesGenerator introduces offset between target and test data

If you inspect the indices of the target and test data provided with each iteration of TimeseriesGenerator, you find that the target data comes from time step i, while the test data comes from time steps i-length to i-1, inclusive. There appears to be no way to adjust this offset.

The line

keras-preprocessing/keras_preprocessing/sequence.py

Line 379 in 2c7ef1d

targets[j] = self.targets[rows[j]]

should be changed to

targets[j] = self.targets[indices[-1]]

or something similar.
Here's some sample code that displays the issue.

from __future__ import print_function

from keras.preprocessing.sequence import TimeseriesGenerator

import numpy


target = numpy.zeros((100,4,4), dtype = numpy.float32)

for i in range(0,100):
    target[i,...] = i

test = 0 + target

sequence = TimeseriesGenerator(test, target, length = 5, sampling_rate = 1,
                               stride = 1, start_index = 0, end_index = None,
                               shuffle = False, batch_size = 32)

epochs = len(sequence)

print('Length of sequence is', epochs)

epoch = 1

for block in sequence:
    print('Epoch', epoch)
    print('    test data')
    print('        shape    =', block[0].shape)
    print('        elements =', block[0][:,:,2,2])
    print('    target data')
    print('        shape    =', block[1].shape)
    print('        elements =', block[1][:,2,2])

    epoch += 1

Pip releases vs github releases

Hi, I'm a maintainer for https://aur.archlinux.org/pkgbase/python-keras-preprocessing/
The latest pip release is 1.0.3 whereas the latest github release is 1.0.2 (albeit with a wrong version number in the setup.py)
Are the pip releases the preferred official release, or should I stick to using the github releases?

Off-by-one error in text preprocessing (sequence_to_matrix)

Seems like the num_words property in text.py is not initialized with the correct length. I found this out because I'm using this value in order to calculate the number of input/output neurons which leads to issues when I'm training the model.

I think num_words should be initialized like this: num_words = len(self.word_index) if not set explicitly.

Is x = np.copy(x) necessary in fit function of class ImageDataGenerator?

Hi! When I was running my codes, the memory error occurred in fit function. I have changed the type of img1 as float32 in order not to copy x in x = np.asarray(x, dtype = backend.floatx()), which is shown in the picture below. Although there are many ways to solve this problem, I am curious about whether x = np.copy(x) is needed. It seems that an if...else... statement to decide whether to adjust the order of x can avoid unnecessary memory allocation, especially when x is a huge matrix.
Many thanks!

The following is the codes from Line 1205 to 1232 in image.py.

        x = np.asarray(x, dtype=backend.floatx())
        if x.ndim != 4:
            raise ValueError('Input to `.fit()` should have rank 4. '
                             'Got array with shape: ' + str(x.shape))
        if x.shape[self.channel_axis] not in {1, 3, 4}:
            warnings.warn(
                'Expected input to be images (as Numpy array) '
                'following the data format convention "' +
                self.data_format + '" (channels on axis ' +
                str(self.channel_axis) + '), i.e. expected '
                'either 1, 3 or 4 channels on axis ' +
                str(self.channel_axis) + '. '
                'However, it was passed an array with shape ' +
                str(x.shape) + ' (' + str(x.shape[self.channel_axis]) +
                ' channels).')

        if seed is not None:
            np.random.seed(seed)

        x = np.copy(x)
        if augment:
            ax = np.zeros(
                tuple([rounds * x.shape[0]] + list(x.shape)[1:]),
                dtype=backend.floatx())
            for r in range(rounds):
                for i in range(x.shape[0]):
                    ax[i + r * x.shape[0]] = self.random_transform(x[i])
            x = ax

flow_from_dataframe generates all the images from directory regardless of x_col

When the image directory has more files than specified in the 'x_col' of the dataframe, the generator generates more images than expected. See the repro.

It might be that I don't understand how it works though :)

valAcc disparity between flow_from_directory() and flow_from_dataframe()

In order to test how my model training script performed on a benchmark dataset, I converted the stored MNIST to a set of png images. I have them organized in two ways:

Method 1. I have a "train" folder and a "test" folder where images are stored without further organization. I have a dataframe for the train and test set, with column 1 listing the absolute directory, and column 2 listing the label. I've carefully examined this csv- the labels and image listings appear accurate. I've captured and tested the output of flow_from_dataframe(), and it looks fine.

Sample of csv:

,test_samples,test_labels
0,/path/Data/test/6992.png,7
1,/path/Data/test/1380.png,9
2,/path/Data/test/5817.png,4
3,/path/Data/test/5295.png,5
4,/path/Data/test/5340.png,2

Method 2. I have a train and test folder, each with subdirectories for the different categories of images.

Other than how they're organized, these datasets are otherwise identical.

If I run my script and use flow_from_dataframe() with the assets from Method 1, the highest validation accuracy I can manage ranges from 0.01-0.05. If I run my script with flow_from_directory() using the assets in Method 2, my highest checkpoint is 0.93.

What could be the source of this disparity? Am I misusing flow_from_dataframe()? I'll share my scripts from each approach below. Thanks in advance for any insight.

Method 1: Garbage Validation Accuracy

import pandas as pd
import numpy as np
 
import keras
from keras_preprocessing.image import ImageDataGenerator
 
from keras import applications
from keras import optimizers
from keras.models import Model 
from keras.layers import Dropout, Flatten, Dense, GlobalAveragePooling2D
from keras import backend as k 
from keras.callbacks import ModelCheckpoint, CSVLogger
 
from keras.applications.vgg16 import VGG16, preprocess_input
 
# INITIALIZE MODEL
 
img_width, img_height = 32, 32
model = VGG16(weights = 'imagenet', include_top=False, input_shape = (img_width, img_height, 3))
 
# freeze all layers
for layer in model.layers:
    layer.trainable = False
 
# Adding custom Layers 
x = model.output
x = Flatten()(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(10, activation="softmax")(x)
 
# creating the final model 
model_final = Model(input = model.input, output = predictions)
 
# compile the model 
rms = optimizers.RMSprop(lr=1e-4)
 
model_final.compile(loss = "categorical_crossentropy", optimizer = rms, metrics=["accuracy"])
 
# LOAD AND DEFINE SOURCE DATA
#df.column_name = df.column_name.astype(str)
 
train = pd.read_csv('/path/Data/MNIST_train.csv', index_col=0)
train.train_labels = train.train_labels.astype(str)
 
val = pd.read_csv('/path/Data/MNIST_test.csv', index_col=0)
val.test_labels = val.test_labels.astype(str)
 
nb_train_samples = 60000
nb_validation_samples = 10000
batch_size = 60
epochs = 5
 
# Initiate the train and test generators
train_datagen = ImageDataGenerator()
test_datagen = ImageDataGenerator()
 
train_generator = train_datagen.flow_from_dataframe(dataframe=train,
                                                    directory=None,
                                                    x_col='train_samples',
                                                    y_col='train_labels',
                                                    has_ext=True,
                                                    target_size = (img_height,
                                                                   img_width),
                                                    batch_size = batch_size, 
                                                    class_mode = 'categorical',
                                                    color_mode = 'rgb')
 
validation_generator = test_datagen.flow_from_dataframe(dataframe=val,
                                                        directory=None,
                                                        x_col='test_samples',
                                                        y_col='test_labels',
                                                        has_ext=True,
                                                        target_size = (img_height, 
                                                                       img_width),
                                                        batch_size = batch_size, 
                                                        class_mode = 'categorical',
                                                        color_mode = 'rgb')

# DEFINE CALLBACKS
 
path = '/path/chk/epoch_{epoch:02d}-valLoss_{val_loss:.2f}-valAcc_{val_acc:.2f}.hdf5'
 chk = ModelCheckpoint(path, monitor = 'val_acc', verbose = 1, save_best_only = True, mode = 'max')
 logger = CSVLogger('/path/chk/training_log.csv', separator = ',', append=False)
 
nPlus = 1
samples_per_epoch = nb_train_samples * nPlus
 
# Train the model 
model_final.fit_generator(train_generator,
                          steps_per_epoch = int(samples_per_epoch/batch_size),
                          epochs = epochs,
                          validation_data = validation_generator,
                          validation_steps = int(nb_validation_samples/batch_size),
                          callbacks = [chk, logger])

METHOD 2: Stellar Validation Accuracy

from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Model 
from keras.layers import Dropout, Flatten, Dense, GlobalAveragePooling2D
from keras import backend as k 
from keras.callbacks import ModelCheckpoint, CSVLogger
 
img_width, img_height = 32, 32
train_data_dir = '/path/Data/categorical_subdirectories/test/'
validation_data_dir = '/path/Data/categorical_subdirectories/train/'
nb_train_samples = 60000
nb_validation_samples = 10000
batch_size = 60
epochs = 10
 
from keras.applications.vgg16 import VGG16, preprocess_input
 
model = VGG16(weights = "imagenet", include_top=False, input_shape = (img_width, img_height, 3))
 
 # Freeze the layers which you don't want to train. Here I am freezing the first 5 layers.
for layer in model.layers:
    layer.trainable = False
 
#Adding custom Layers 
x = model.output
x = Flatten()(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(10, activation="softmax")(x)
 
# creating the final model 
model_final = Model(input = model.input, output = predictions)
 
RMSprop = optimizers.RMSprop(lr=1e-4)
 
# compile the model 
model_final.compile(loss = "categorical_crossentropy", optimizer = RMSprop, metrics=["accuracy"])
 
model_final.summary()
 
# Initiate the train and test generators with data Augumentation 
train_datagen = ImageDataGenerator(preprocessing_function = preprocess_input)
test_datagen = ImageDataGenerator(preprocessing_function = preprocess_input)
 
train_generator = train_datagen.flow_from_directory(train_data_dir,
                                                    target_size = (img_height, img_width),
                                                    batch_size = batch_size, 
                                                    class_mode = "categorical")
 
validation_generator = test_datagen.flow_from_directory(validation_data_dir,
                                                        target_size = (img_height, img_width),
                                                        class_mode = "categorical")
 
# Save the model according to the conditions
path = '/path/chk/epoch_{epoch:02d}-valLoss_{val_loss:.2f}-valAcc_{val_acc:.2f}.hdf5'
chk = ModelCheckpoint(path, monitor = 'val_acc', verbose = 1, save_best_only = True, mode = 'max')
logger = CSVLogger('/path/chk/training log.csv', separator = ',', append=False)
 
nPlus = 1
samples_per_epoch = nb_train_samples * nPlus
 
# Train the model 
model_final.fit_generator(train_generator,
                          steps_per_epoch = int(samples_per_epoch/batch_size),
                          epochs = epochs,
                          validation_data = validation_generator,
                          validation_steps = int(nb_validation_samples/batch_size),
                          callbacks = [chk, logger])

ImageDataGenerator resizing and transforming images before preprocessing

I am trying to use the preprocessing function to take a network sized crop out of inconsistently sized input images instead of resizing to the network size. I have tried to do this using the preprocessing function but found that it is not easily possible. Using Keras 2.2.2

ImageDataGenerator does not accept None as a type for target_size which should cause load_img to not resize things.

C:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\keras_preprocessing\image.py in __init__(self, directory, image_data_generator, target_size, color_mode, classes, class_mode, batch_size, shuffle, seed, data_format, save_to_dir, save_prefix, save_format, follow_links, subset, interpolation)
   1665         self.directory = directory
   1666         self.image_data_generator = image_data_generator
-> 1667         self.target_size = tuple(target_size)
   1668         if color_mode not in {'rgb', 'rgba', 'grayscale'}:
   1669             raise ValueError('Invalid color mode:', color_mode,

TypeError: 'NoneType' object is not iterable

To accomplish my goal, I modified the image data generator to pass none to load_img and then to resize afterwards if the size doesn't match the target. This hack works for my scenario:

    def _get_batches_of_transformed_samples(self, index_array):
        batch_x = np.zeros(
            (len(index_array),) + self.image_shape,
            dtype=backend.floatx())
        # build batch of image data
        for i, j in enumerate(index_array):
            fname = self.filenames[j]
            img = load_img(os.path.join(self.directory, fname),
                           color_mode=self.color_mode,
                           target_size=None)
            x = img_to_array(img, data_format=self.data_format)
            # Pillow images should be closed after `load_img`,
            # but not PIL images.
            if hasattr(img, 'close'):
                img.close()
            params = self.image_data_generator.get_random_transform(x.shape)
            x = self.image_data_generator.apply_transform(x, params)
            x = self.image_data_generator.standardize(x)
            width_height_tuple = (self.target_size[1], self.target_size[0])
            if (x.shape[1],x.shape[0]) != width_height_tuple:
              x=cv2.resize(x,width_height_tuple, interpolation=cv2.INTER_AREA)
            batch_x[i] = x

While looking into this I saw that the preprocessing function runs at the start of standardize, which is after the random transforms are applied. To me this sounds like preprocssing is a bad name since it isn't actually happening first.

ImageDataGenerator.flow_from_dataframe keeps loading when directory has subdirectories

I'm working on the MURA dataset by Stanford. I'm trying to load the dataset using Keras's ImageDataGenerator. The data is in the following hierarchy:

The study1_positive folder contains the images.

ImageDataGenerator.flow_from_directory cannot be used with this folder structure, therefore I tried using the flow_from_dataframe method.

However, when run, the code keeps on executing and doesn't stop.

Following is the format of the Pandas DataFrame that I'm passing to the flow_from_directory method:

I've also tried changing the labels to 'abnormal' and 'normal' in place of 1 and 0, respectively.

Below is the code:

train_imggen = ImageDataGenerator(rescale=1./255, rotation_range=30,
                              horizontal_flip=True)

train_loader = train_imggen.flow_from_dataframe(traindf, './', shuffle=True,
                                            x_col='path', y_col='label',
                                            color_mode='grayscale',
                                            target_size=(320,320), 
                                            class_mode='binary', 
                                            batch_size=8)

apply_transform changes shape

apply_transform changes the number of channels from input to output.

image_datagen_args = {
        'shear_range': 0.2,
        'zoom_range': 0.2,
        'width_shift_range': 0.2,
        'height_shift_range': 0.2,
        'rotation_range': 45,
        'horizontal_flip': True,
        'vertical_flip': True
}

image_datagen = ImageDataGenerator(**image_datagen_args)

x = np.zeros((32, 32, 1))
params = image_datagen.get_random_transform(x.shape)
x = image_datagen.apply_transform(x, params)
x.shape == (32, 32, 3)

Pillow dependency

In https://github.com/keras-team/keras-preprocessing/blob/master/setup.py

extras_require={
    'tests': ['pytest',
        'pytest-pep8',
        'pytest-xdist',
        'pytest-cov'],
    'image': ['scipy>=0.14'],
},

Shouldn't Pillow also be declared?

I just did a fresh virtual env install and after pip install -r requirements.txt (which doesn't contain Pillow because I assume you take care of that) I noticed that my code fails because Pillow isn't installed.

ImageDataGenerator inherits wrong class from TF 1.11, causing fit_generator to assume it isn't a Sequence

The full description can be found at:
keras-team/keras#11452

ImageDataGenerator returns empty/white images

Referencing this: keras-team/keras#10869 (comment)

I've just been able to reproduce this bug using the following script.

from keras.datasets import mnist

from keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np

(X, y), _ = mnist.load_data()
X = X.reshape(X.shape[0], 1, 28, 28)
X = X[:100]


datagen = ImageDataGenerator(width_shift_range=0.2)
datagen.fit(X)


imgs = []
batches = 0
for i in datagen.flow(X, batch_size=32):
    batches += 1
    for x in i:
        img = np.asarray(x).reshape((28, 28))
        imgs.append([plt.imshow(img, cmap='gray', animated=True)])
    print"Completed batch : %i" % (batches)
    if batches >= len(X) / 32:
        break

fig = plt.figure()
ani = animation.ArtistAnimation(fig, imgs, interval=50, blit=True,
                                repeat_delay=1000)
ani.save("test.gif", writer="imagemagick")

Which produces the following (blank) gif:

(Apologies about the scripting, did it whilst stood up waiting for a meeting!)

Adding tests for Exception error messages

If any contributor is feeling up for it, it would be good to add unit tests that check that appropriate error messages are getting raised in various situations not yet covered. It seems that a few error messages were previously incorrectly formatted, because we don't have unit tests for some of these exceptions.

We have some such unit tests already, which look like this:

with pytest.raises(ValueError) as e_info:
      generator.flow((images, x_misc_err), np.arange(dsize), batch_size=3)
      assert 'All of the arrays in' in str(e_info.value)

making keras-preprocessing independent

Currently keras and keras-preprocessing depends on each other.
Since it's already a separate module with it's own pip package, shouldn't we be able to use keras-preprocessing as an independent tool that does not depends on keras?
Having these two modules mutually depends on each other is causing conflicts at keras-mxnet (a keras fork) awslabs/keras-apache-mxnet#129

ImageDataGenerators ``standardize`` modifies input

I am not sure if this is a bug or intended functionality, but I realized that ImageDataGenerators standardize() method is actually modifying the input passed to it. This can be quite annoying in the context of jupyter notebook, when you want to reuse the raw images again. Consider the following example code:

from keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(rescale=2)

images = np.ones((10, 128, 128, 3))
print(images.mean()) # 1.0

images_std = datagen.standardize(images)

print(images.mean(), images_std.mean()) # 2.0 2.0

I tracked this down to the fact, that this function uses the shorthand operator for simple arithmetic operations: https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image.py#L1124 The following two functions do exactly the same thing, however the first one does not modify the input while the second does:

def multiply1(x):
    x = x * 2

def multiply2(x):
    x *= 2
    
    
images = np.ones((10, 128, 128, 3))
print(images.mean()) # 1.0

multiply1(images)
print(images.mean()) # 1.0

multiply2(images)
print(images.mean()) # 2.0

Note that the behavior also differs for different standardization steps, e.g. zca_whitening does not modify the input while rescale does.

How to get flow_from_directory accept absolute path of images as input

I learned from the manual page of flow_from_directory, the first argument passed to flow_from_directory is a directory. Sometimes, it's also convenient to pass the path of images if the images are placed in multiple directories. If we could get flow_from_directory to accept images in the following format:

/path1/img1.jpg cat
/path2/img2.jpg dog

The first column is the absolute path to the image, and the second column is the class names.

[API DESIGN REVIEW] Add random crop support to ImageDataGenerator

Modify the ImageDataGenerator class to receive an extra boolean target_size argument on its constructor and update its methods to produce random crops during training.

See Keras API Design Review at https://docs.google.com/document/d/1zdSsPCxbrCedQgOYqc-Ne6gWzYqIqIqDgExHyThxl1o/edit?usp=sharing

See Keras Issue keras-team/keras#11237 for more details.

issue with index_doc preprocessing/text

keras-preprocessing/keras_preprocessing/text.py

Line 504 in f5b8bef

index_docs = {int(k): v for k, v in index_docs.items()}

seems value should be converted to integer not the keys as per get_config method

index_docs = {k: int(v) for k, v in index_docs.items()}

Is it possible the directoryiterator or flow_from_directory takes multiple directories as input ?

Dear Keras team,
I need to combine data from multiple directories with exactly same sub directory structure.
Would it be possible to add this feature into flow_from_directory class ?

With my restricted programming knowledge I would simply modify/add some lines before the line below.

keras-preprocessing/keras_preprocessing/image.py

Line 1888 in 2c7ef1d

for dirpath in (os.path.join(directory, subdir) for subdir in classes):

Redesign

This package suffers from the same P0 issue as keras-team/keras-applications#28

There is the further complication that this package leverages imports from Keras as soon as the image.py and sequence.py files are executed for the first time (due to subclassing of the Sequence class). This does not prevent us from applying any of the two solutions proposed, though (for the second solution, we would have to use multiple inheritance to get it to work).

	See [this script](
	https://gist.github.com/fchollet/0830affa1f7f19fd47b06d4cf89ed44d)
	for more details.

keras-team / keras-preprocessing Goto Github PK

keras-preprocessing's Introduction

Keras Preprocessing

keras-preprocessing's People

Contributors

Stargazers

Watchers

Forkers

keras-preprocessing's Issues

Proposed solution

Example

Original Images(cifar10)

Latest Channel_Shift Images(cifar10)

Old Channel_Shift Images(cifar10)

Recommend Projects

Recommend Topics

Recommend Org