Comments (6)
I am looking a the process_data(self, batch_data) function under the DataProvider class, there it can be seen that for each "batch data", i.e. a labeled example, all augmentors are applied in order then all transformers are applied in order:
# Then augment, transform and postprocess the batch data for objects in [self._augmentors, self._transformers]: for object in objects: data, annotation = object(data, annotation)
Isn't the purpose of augmentors to add more examples to increase the training set? i.e., for each example, it should add to the training set both the original example, as well as the an "augmented" variation, preferably multiple augmented versions per a single example?
I am I misunderstanding?
Seems like process_data(self, batch_data) should look something like this:
def process_data(self, batch_data): """ Process data batch of data """ if self._use_cache and batch_data[0] in self._cache: data, annotation = copy.deepcopy(self._cache[batch_data[0]]) else: data, annotation = batch_data for preprocessor in self._data_preprocessors: data, annotation = preprocessor(data, annotation) if data is None or annotation is None: self.logger.warning("Data or annotation is None, marking for removal on epoch end.") self._on_epoch_end_remove.append(batch_data) return None, None if self._use_cache and batch_data[0] not in self._cache: self._cache[batch_data[0]] = (copy.deepcopy(data), copy.deepcopy(annotation)) # Then transform, augment and postprocess the batch data for transformer in self._transformers: data, annotation = transformer(data, annotation) augmented_data_list = [] if len(self._augmentors) > 0: for i in range(self._variation_count): # generate multiple variations using specified augmentors augmented_data = data for augmentor in self._augmentors: augmented_data, annotation = augmentor(augmented_data, annotation) augmented_data_list.append((augmented_data, annotation)) all_data_list = [] for data, annotation in [(data, annotation)] + augmented_data_list: # Convert to numpy array if not already if not isinstance(data, np.ndarray): data = data.numpy() # Convert to numpy array if not already # TODO: This is a hack, need to fix this if not isinstance(annotation, (np.ndarray, int, float, str, np.uint8, float)): annotation = annotation.numpy() all_data_list.append((data, annotation)) return all_data_list
With getitem(self, index: int) looking something like this:
def __getitem__(self, index: int): """ Returns a batch of data by batch index""" dataset_batch = self.get_batch_annotations(index) # First read and preprocess the batch data batch_data, batch_annotations = [], [] for index, batch in enumerate(dataset_batch): for data, annotation in self.process_data(batch): if data is None or annotation is None: self.logger.warning("Data or annotation is None, skipping.") continue batch_data.append(data) batch_annotations.append(annotation) return np.array(batch_data), np.array(batch_annotations)
Hey, thanks for the question, but the idea is different than you think. When we are training our models, we don't want to change the number of data samples in our dataset. But yes, we want to return original examples and modified examples, but we do this randomly each training epoch. This is why we randomly augment our data in augmentors, where we choose the randomness coefficient, how often we want to return modified examples, and how often the original ones.
from mltu.
I get what is going on in the code, but why would we want to randomly change our original examples, without even using the original examples? as far as I understand the purpose of augmentation is to "artificially increase the training set by creating modified copies of a dataset using existing data" in order to:
- To prevent models from overfitting.
- The initial training set is too small.
- To improve the model accuracy.
- To Reduce the operational cost of labeling and cleaning the raw dataset.
(Source for the above: https://www.datacamp.com/tutorial/complete-guide-data-augmentation)
Thoughts?
from mltu.
Who said we are not using original examples?
If we have 1000 images, and augmentor has a chance of 50%, then 500 images will be original and 500 modified
from mltu.
Yes, you're correct, but why not use 100% of the original examples (that are in the training set, of course) and then "augment" that data set with additional examples? I gather this is the purpose of augmentation?
from mltu.
You are not training the model for 1 epoch, probably you gonna train it for 50 epochs at least. Because we are augmenting randomly picked images, original photos will be used 50% of the time through all these epochs. So what is the problem? Want to use less augmented photos? Then set augmentor random change to 30% and etc.
Because, for example, you are using 10 different augmentors, then, for example, from 10k images, you receive 100k images, and what happens if you store all of them in RAM? You may get out of ram. So there is no reason to hold original images + augmented in one place (in a single list, for example) because you still gonna use some batch size that fits with your model. My solution is efficient, simple, and expandable. I hope you understand :)
from mltu.
Yes, thanks a lot for the explanation! closing this.
from mltu.
Related Issues (20)
- model.onnx file is not creating HOT 5
- captcha images name issue for training HOT 2
- Epoch 51: early stopping HOT 7
- Impossible to solve contradiction in dependencies on Mac M1. HOT 3
- Failed to find data adapter that can handle input HOT 11
- Epoch 11: val_CER did not improve from 1.00000 HOT 3
- mltu/transformer.py SpectrogramPadding Bag. HOT 2
- no module named tf2onnx HOT 8
- No module named mltu
- I want to increase learning_rate and train_workers, is that possible? HOT 18
- train.py giving error on custom dataset HOT 19
- Transcription has no stops between sentences. HOT 8
- about your augmentators ultility HOT 1
- Compatibility Issue with Protobuf Versions in TensorFlow ONNX Conversion HOT 1
- Dropout with Batch Normalization Disharmony HOT 3
- ValueError: Failed to find data adapter that can handle input: <class 'mltu.dataProvider.DataProvider'>, <class 'NoneType'> HOT 5
- The module 'Models' could not be loaded. HOT 4
- Model configuration for new captcha type HOT 4
- Unable to execute Train.py file in Captcha to text project HOT 5
- mltu/Tutorials /01_image_to_word HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mltu.