Giter Club home page Giter Club logo

Comments (6)

pythonlessons avatar pythonlessons commented on July 28, 2024

I am looking a the process_data(self, batch_data) function under the DataProvider class, there it can be seen that for each "batch data", i.e. a labeled example, all augmentors are applied in order then all transformers are applied in order:

    # Then augment, transform and postprocess the batch data
    for objects in [self._augmentors, self._transformers]:
        for object in objects:
            data, annotation = object(data, annotation)

Isn't the purpose of augmentors to add more examples to increase the training set? i.e., for each example, it should add to the training set both the original example, as well as the an "augmented" variation, preferably multiple augmented versions per a single example?

I am I misunderstanding?

Seems like process_data(self, batch_data) should look something like this:

def process_data(self, batch_data):
    """ Process data batch of data """
    if self._use_cache and batch_data[0] in self._cache:
        data, annotation = copy.deepcopy(self._cache[batch_data[0]])
    else:
        data, annotation = batch_data
        for preprocessor in self._data_preprocessors:
            data, annotation = preprocessor(data, annotation)

        if data is None or annotation is None:
            self.logger.warning("Data or annotation is None, marking for removal on epoch end.")
            self._on_epoch_end_remove.append(batch_data)
            return None, None

        if self._use_cache and batch_data[0] not in self._cache:
            self._cache[batch_data[0]] = (copy.deepcopy(data), copy.deepcopy(annotation))

    # Then transform, augment and postprocess the batch data
    for transformer in self._transformers:
        data, annotation = transformer(data, annotation)

    augmented_data_list = []
    if len(self._augmentors) > 0:
        for i in range(self._variation_count):  # generate multiple variations using specified augmentors
            augmented_data = data
            for augmentor in self._augmentors:
                augmented_data, annotation = augmentor(augmented_data, annotation)

            augmented_data_list.append((augmented_data, annotation))

    all_data_list = []
    for data, annotation in [(data, annotation)] + augmented_data_list:

        # Convert to numpy array if not already
        if not isinstance(data, np.ndarray):
            data = data.numpy()

        # Convert to numpy array if not already
        # TODO: This is a hack, need to fix this
        if not isinstance(annotation, (np.ndarray, int, float, str, np.uint8, float)):
            annotation = annotation.numpy()

        all_data_list.append((data, annotation))

    return all_data_list

With getitem(self, index: int) looking something like this:

def __getitem__(self, index: int):
    """ Returns a batch of data by batch index"""
    dataset_batch = self.get_batch_annotations(index)

    # First read and preprocess the batch data
    batch_data, batch_annotations = [], []
    for index, batch in enumerate(dataset_batch):
        for data, annotation in self.process_data(batch):
            if data is None or annotation is None:
                self.logger.warning("Data or annotation is None, skipping.")
                continue

            batch_data.append(data)
            batch_annotations.append(annotation)

    return np.array(batch_data), np.array(batch_annotations)

Hey, thanks for the question, but the idea is different than you think. When we are training our models, we don't want to change the number of data samples in our dataset. But yes, we want to return original examples and modified examples, but we do this randomly each training epoch. This is why we randomly augment our data in augmentors, where we choose the randomness coefficient, how often we want to return modified examples, and how often the original ones.

from mltu.

seidnerj avatar seidnerj commented on July 28, 2024

I get what is going on in the code, but why would we want to randomly change our original examples, without even using the original examples? as far as I understand the purpose of augmentation is to "artificially increase the training set by creating modified copies of a dataset using existing data" in order to:

  1. To prevent models from overfitting.
  2. The initial training set is too small.
  3. To improve the model accuracy.
  4. To Reduce the operational cost of labeling and cleaning the raw dataset.

(Source for the above: https://www.datacamp.com/tutorial/complete-guide-data-augmentation)

Thoughts?

from mltu.

pythonlessons avatar pythonlessons commented on July 28, 2024

Who said we are not using original examples?
If we have 1000 images, and augmentor has a chance of 50%, then 500 images will be original and 500 modified

from mltu.

seidnerj avatar seidnerj commented on July 28, 2024

Yes, you're correct, but why not use 100% of the original examples (that are in the training set, of course) and then "augment" that data set with additional examples? I gather this is the purpose of augmentation?

from mltu.

pythonlessons avatar pythonlessons commented on July 28, 2024

You are not training the model for 1 epoch, probably you gonna train it for 50 epochs at least. Because we are augmenting randomly picked images, original photos will be used 50% of the time through all these epochs. So what is the problem? Want to use less augmented photos? Then set augmentor random change to 30% and etc.

Because, for example, you are using 10 different augmentors, then, for example, from 10k images, you receive 100k images, and what happens if you store all of them in RAM? You may get out of ram. So there is no reason to hold original images + augmented in one place (in a single list, for example) because you still gonna use some batch size that fits with your model. My solution is efficient, simple, and expandable. I hope you understand :)

from mltu.

seidnerj avatar seidnerj commented on July 28, 2024

Yes, thanks a lot for the explanation! closing this.

from mltu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.