There are 2 major issues with the Greek data. They tend to produce µ

Right, so how can this be fixed? For example I can see in <a href="https://github.com/

Modern Greek data issues about tessdata HOT 10 OPEN

chopinesque commented on July 18, 2024

Modern Greek data issues

from tessdata.

Comments (10)

stweil commented on July 18, 2024

https://github.com/tesseract-ocr/langdata_lstm/tree/main/ell contains training text and a word list with the same issues, so the model was trained to produce such results.

from tessdata.

chopinesque commented on July 18, 2024

Right, so how can this be fixed? For example I can see in https://github.com/tesseract-ocr/langdata_lstm/blob/main/ell/desired_characters and https://github.com/tesseract-ocr/langdata_lstm/blob/main/ell/ell.unicharset the existence of polytonic characters which should not be there.

from tessdata.

stweil commented on July 18, 2024

In a first step you could send a pull request for langdata_lstm which fixes the files there. But finally new trainings are required, maybe based on the existing models for Greek.

from tessdata.

chopinesque commented on July 18, 2024

OK, I may need some guidance please. I created a fork. So do I simply have to remove non-valid characters from above mentioned files?

I also see

tessedit_load_sublangs grc
https://github.com/chopinesque/langdata_lstm_modern_greek/blob/main/ell/ell.config#L2

I am not sure whether this line should be there going forward.

from tessdata.

stweil commented on July 18, 2024

So do I simply have to remove non-valid characters from above mentioned files?

Remove or replace, what fits better.

from tessdata.

chopinesque commented on July 18, 2024

Thanks. If I replace, I need to know about the structure, for example,

ὶ 3 0,255,0,255,0,0,0,0,0,0 Greek 124 0 124 ὶ # ὶ [1f76 ]a

How is the 124 0 124 derived?

from tessdata.

stweil commented on July 18, 2024

You can keep the unicharset file unmodified. A replacement will be created when a new training is run.

from tessdata.

stweil commented on July 18, 2024

tessedit_load_sublangs grc

That line tells Tesseract to always use grc in addition to ell. Therefore wrong glyphs can also come from grc as long as that configuration is there.

from tessdata.

chopinesque commented on July 18, 2024

You can keep the unicharset file unmodified. A replacement will be created when a new training is run.

Not sure then which files I should change. I don't think I have the knowledge to do any training (I also use Windows).

from tessdata.

chopinesque commented on July 18, 2024

tessedit_load_sublangs grc

That line tells Tesseract to always use grc in addition to ell. Therefore wrong glyphs can also come from grc as long as that configuration is there.

So this line should be removed.

from tessdata.

Recommend Projects

Modern Greek data issues about tessdata HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent