Giter Club home page Giter Club logo

Comments (8)

amitdo avatar amitdo commented on May 18, 2024

It is unclear who invented the name frk for Frankish. Maybe it should be renamed.

frk is the ISO 639-3 code for Frankish.

from tessdata.

amitdo avatar amitdo commented on May 18, 2024

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

FYI, the source of the 'Language' column in the tables is the old google code download page. Ray uploaded the official traineddata files to that old page, Zdenko added a few 3rd party files.

from tessdata.

stweil avatar stweil commented on May 18, 2024

I should have explained my question better. Why is German fraktur called Frankish? Neither the characters nor the words and also not the fonts used are Frankish language. And without hints from others I'd never have thought of using frk for German fraktur.

from tessdata.

amitdo avatar amitdo commented on May 18, 2024

It seems frk is trained using modern German corpus and a small number of fonts.

from tessdata.

amitdo avatar amitdo commented on May 18, 2024

@stweil, maybe you want to close this issue?

from tessdata.

stweil avatar stweil commented on May 18, 2024

Do you think that frk is the right name? Or should it be renamed, maybe deu_old or deu_frak (as people are used to that name)? "Frankish" is definitely the wrong description for the current frk.

from tessdata.

amitdo avatar amitdo commented on May 18, 2024

Is 'frk' only for German Fraktur?

from tessdata.

stweil avatar stweil commented on May 18, 2024

I expect that the frk LSTM model will work quite good with Fraktur text in other languages, too. But the word list of frk is mainly based on German words (I estimate more than 95 % of the 473228 words are German). The list also includes few words from English, Spanish, French, Latin, Russian and other languages. Many of them would not be expected in Fraktur text (jQuery, motherboard, ...). The German words contain lots of the known problems like ß/B, ii/ü and other confusions, lower case substantives (should always be upper case for German), upper case adjectives (should normally be lower case), random words in all upper case, lots of web sites (also not typical for Fraktur) and so on.

@theraysmith, it would be really interesting to know more details of the process which leads to that and also the other word lists. They look like extracts from random web sites. I don't think that good word lists for Fraktur can be produced like that.

from tessdata.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.