Background
giella-shared
contains today a mixture of data for many different languages:
giella-shared/
├── all_langs
│ └── src
│ ├── filters ⇒ obligatory, move to giella-core?
│ └── fst ⇒ url, punctuation, symbols
├── eng ⇒ names for languages in English majority countries
├── smi ⇒ names, cg functions and dependency graphs mainly for Sámi languages
└── urj-Cyrl ⇒ names for Uralic languages written in Cyrillic
Core idea
Ideally we would only have giella-core
as a required dependency (thus needing to move the filters there), and everything else as separate repositories that can be subscribed on an as-needed/wanted basis.
By generalising sharing resources, it would also be straightforward to share content across language repositories, like including sma
and sme
proper nouns in smj
(with some filtering and restrictions). Technically there would be no difference between getting content from lang-sme
and shared-smi
.
Naming
- using a prefix
shared-
, parallel to lang-
, keyboard-
etc. It does not have to be what is suggested here, other suggestions are welcome.
- followed by a BCP 47 like locale tag, but also allowing language family tags such as
smi
and urj
Concrete example
The present giella-shared
would after a split become (with check marks for the actual split):
Another example:
- using
lang-sme
as a source for North Sámi names when used in another Sámi language, like place names. Non-Sámi names in lang-sme
would be filtered out, and generic last elements could be (automatically) adapted to Lule Sámi spelling and inflection as needed. This is relevant both for text analysis and parsing in general, but especially for TTS, where there is a need to get a best possible transcription and pronunciation of whatever is thrown at the system. Place names from related neightbouring languages will certainly be a pain point for many minority languages in such a context.
By treating all repos the same as a potential source for lexical and other resources, we get a more flexible and powerful infrastructure.
Restrictions
Ideally the shared resources should never be required — without access to them the result should only be a smaller analyser with worse coverage. This will make giella-core
the only required external dependency.
As far as possible, the resources in each repo should be independently compilable and testable, kind of like independent code libraries.
Benefits
- more flexibility
- only use what is needed for a language, and start small and simple
- still access to all sorts of premade resources for various purposes
- easier version tagging of each shared resource
- with each repo containing a more clearly defined and limited set of data, it is easier to document, specify and reuse
Considerations
versioning
- should one always asume latest code
- or should it be possible to peg the inclusion to a specific version
dependency management
We need a straightforward and simple system to declare dependency on a list of other repositories, kind of like Rust cargo lists. But as noted above, the system should be robust enough to not break if a resource is not available, only give a warning.
CI
Dependency management needs to be automatic, at least for CI systems. We need at least:
Cleanup
Comments welcome!
@flammie and I discussed this today, the notes above are based on that. We would very much like feedback on these ideas from anyone, but especially from @TinoDidriksen @bbqsrc @Eijebong @Trondtr @aarppe