Hi, Fausto First of all, thanks for sharing this implementation, I have actually s

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The example that <a class="user-mention notranslate" data-hovercard-type="user"

Sure, you can reach me at faustomorales@gmail

Struggling to train with custom alphabet including spaces... about keras-ocr HOT 11 CLOSED

faustomorales commented on July 20, 2024 1

Struggling to train with custom alphabet including spaces...

from keras-ocr.

Comments (11)

faustomorales commented on July 20, 2024 1

Thank you so much for reporting this issue. Indeed the problem was that you had labels that looked like "foo bar", which made it very difficult for the model to train since it is all but impossible to distinguish five spaces from four spaces. I was able to use your dataset to train successfully using the following snippet.

The key line is ' '.join(f.read().split()) which splits the strings using whitespace and then recombines them with only a single space.

import glob
import string

import keras_ocr

def load(label_filepath):
    with open(label_filepath) as f:
        label = ' '.join(f.read().split())
    image_filepath = label_filepath.replace('.txt', '.jpg')
    return image_filepath, None, label

labels = list(map(load, glob.glob('sample_dataset/*.txt')))
alphabet = string.ascii_letters + string.digits + '* /.:,+-¥='
assert all(not any(t not in alphabet for t in text) for _, _, text in labels), 'An illegal character was found.'

recognizer = keras_ocr.recognition.Recognizer(alphabet=alphabet)
recognizer.compile()

image_generator = keras_ocr.datasets.get_recognizer_image_generator(alphabet=alphabet, labels=labels, height=31, width=200)
batch_generator = recognizer.get_batch_generator(image_generator=image_generator)
recognizer.training_model.fit(
    x=batch_generator,
    steps_per_epoch=10
)

I've just pushed 900f873, which adds an assertion to check for this problem. Without this issue, we probably would not have found it. Again, thanks!

from keras-ocr.

csmcallister commented on July 20, 2024

Just chiming in as someone who has been able to train the recognizer using a custom dataset and a custom alphabet that includes more than just lowercase letters and digits.

Here's my alphabet:
alphabet = ' #()-./0123456789:ABCDEGHIKLMNRSTUVWabcdeghiklmnoprstuvwyz'

I then instantiate the recognizier like this:

recognizer = keras_ocr.recognition.Recognizer(alphabet=alphabet, weights=None)

Then a training script identical to the one here in the docs works, with the only change being the kwarg for keras_ocr.datasets.get_recognizer_image_generator being changed from alphabet=recognizer.alphabet to alphabet=alphabet.

from keras-ocr.

faustomorales commented on July 20, 2024

Thanks @csmcallister! What you proposed is what I was planning to say.

I think the main problem here is that it seems you are expecting the recognizer to be able to pick up on leading / trailing spaces and newlines. The recognizer architecture, being a convolutional recurrent neural network, makes its predictions using full height vertical slices being passed sequentially to the RNN portion of the network. This architecture makes it all but impossible for the network to discern between what is actual whitespace (i.e., margin) and semantically meaningful whitespace (i.e., trailing space) when it occurs at the start or end of a sentence. Spaces embedded within a sentence can be picked up but not spaces at the start or end. This is why the .strip() is important and should not be removed.

from keras-ocr.

cheperuiz commented on July 20, 2024

@csmcallister and @faustomorales Thank you both for your quick replies. That is essentially what I'm doing, except for the fact that I'm using actual crops from real images instead of artificially generated images. (dataset was annotated manually and used effectively to train other models.
The strip function is back in place. But the spaces embedded within sentences are not being picked up by the model, instead sending the errors to inf... I can share some code snippets below:

DEFAULT_ALPHABET = '\t\n!"#$\'/()*+.,-0123456789:;=?<>@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz{}~¢¥µñÑΩ€'`

recognizer = keras_ocr.recognition.Recognizer(alphabet=DEFAULT_ALPHABET, weights='kurapan')
`
To cleanup the sentences and introduce the start/end characters (which are useful for my application and other RNNs later in the chain), i do:


def clean_text(samples):
    new_samples = []
    for _,text,img in samples:
        text = remove_accents(text)
        text = '\t'+text+'\n'
        text = ''.join([c for c in text if c in DEFAULT_ALPHABET])
        new_samples.append((_,text,img))
    return new_samples

Then, I call your batch generator as below:

batch_size = 8
training_gen, validation_gen = [
    recognizer.get_batch_generator(
        recognizer,
        image_generator=image_generator,
        batch_size=batch_size
    )
    for image_generator in [sample_generator(images_train, texts_train),sample_generator(images_val, texts_val) ]
]

And start the training proces..
However, if the dicionary includes the space character, this is what I'm seing in the training log: 100/100 [==============================] - 5s 47ms/step - loss: inf - val_loss: inf

and in the output console:
./tensorflow/core/util/ctc/ctc_loss_calculator.h:499] No valid path found

If the alphabet doesn't include the space character, it is removed by my cleanup function...

from keras-ocr.

faustomorales commented on July 20, 2024

The example script that @csmcallister linked to does not use artificially generated images. They are crops from real images.

The recognizer architecture will not be able to detect whitespace characters like \t or \n. These must be removed. Spaces are okay, as long as they are between words (like the space between "the" and "fox" in the phrase "the fox").

To help diagnose the exploding gradient, I would suggest using a single image as a test case and see if the gradient continues to explode. If you share a sample image, I can try to take a look.

from keras-ocr.

faustomorales commented on July 20, 2024

If you need the whitespace characters to wrap the predictions, that can happen as a post-processing step after the network output, rather than including it as part of the network output.

from keras-ocr.

cheperuiz commented on July 20, 2024

Thank you, that would be great. I can send you a few sample images privately (please let me know an email address, I can't post them here...).

About the extra characters, they are useful for a seq2seq model that come after in the chain, but for the purposes of this discussion they have been removed (with the same result).

I don't think the problem is exploding gradients per se... the message in the console leads me to believe that the ctc error calculation is aborted for some reason, because the loss changes to inf imediately after that message appears.

from keras-ocr.

faustomorales commented on July 20, 2024

Sure, you can reach me at [email protected].

from keras-ocr.

cheperuiz commented on July 20, 2024

Done. Thank you :)

from keras-ocr.

cheperuiz commented on July 20, 2024

Hi! I just solved my issue. Turns out that some of my labels had duplicate spaces... I fixed that and now it's training flawlessly :D thanks again for your comments, they definitely pointed me in the right direction (our own data).
Cheers!

from keras-ocr.

cheperuiz commented on July 20, 2024

Awesome! Thank you very much for this!

from keras-ocr.

Struggling to train with custom alphabet including spaces... about keras-ocr HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent