Giter Club home page Giter Club logo

giellalt / giella-core Goto Github PK

View Code? Open in Web Editor NEW
7.0 27.0 2.0 13.7 MB

Build tools and build support files as well as developer support tools for the GiellaLT repositories.

Home Page: https://giellalt.uit.no

License: GNU General Public License v3.0

Makefile 1.21% Shell 27.70% M4 0.38% Python 30.94% Perl 15.84% HTML 0.03% Awk 1.66% XSLT 20.92% R 0.08% CSS 1.09% JavaScript 0.07% Dockerfile 0.08%
nlp rule-based-nlp autotools minority-language proofing-tools maturity-prod

giella-core's Introduction

giella-core

Build tools and build support files as well as developer support tools for the GiellaLT repositories.

This repo is required by all other GiellaLT repos, and will be cloned automatically wheen running ./autogen.sh in a lang repo for the first time.

Some further documentation can be found on the project site.

Requirements

  • uconv from ICU, used to convert unicodes in builds
    • on Debian / ubuntu apt install icu-devtools
    • on Mac OS X homebrew brew install icu4c
  • HFST

giella-core's People

Contributors

aarppe avatar albbas avatar biret365 avatar carges avatar ciprian-no avatar flammie avatar gueriksson avatar leneantonsen avatar merisiga avatar mka055 avatar phaqui avatar reynoldsnlp avatar rtxanson avatar rueter avatar snomos avatar tinodidriksen avatar trondtr avatar trondtynnol avatar unhammer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

giella-core's Issues

Documentation for new languages

Hello,

I am interested in building FSTs for a new language, is there documentation that will help someone new to get started?

Errors in AWK script converting in-source jspwiki to markdown

The first test docs can be found in lang-sma - scroll to Documentation and click In-source documentation. The following things should be fixed:

  • tables
  • definition lists
  • nested lists
  • numbered lists
  • boldface in all cases

Add more bugs to the list above as they are found, and we can check them off as they get fixed.

yaml paradigm tests do not give summary

To repeat:

cd src/fst/test
make -d devtest

YAML test 219:  generator-gt-norm.hfstol + gt-norm-yamls/V-lex-der-kastid_gt-norm.gen.yaml - 44/0/44 PASS
SUMMARY for the generating gt-norm fst(s): PASSES:  / FAILS:  / TOTAL: 

The problem is that the SUMMARY does not give total number of passes, fails and totals. The error shows up on both Mac and Linux.

Relax dependency on `kbdgen` for mobile spellers

Many/most CI/CD speller builds fail because it tries to build the keyboard layout error model using kbdgen, and it can't find the tool. Even if it could, it would not work, since keyboard layout error model generation is not yet reimplemented in the Rust version of kbdgeb.

We thus need to ensure that mobile spellers are built even when the keyboard layout error model file is not present, basically making the speller identical to the desktop speller.

mob_spellercorpus.unitweight.txt missing

Pristine --enable-hfst-mobile-speller builds die with No rule to make target 'mob_spellercorpus.unitweight.txt', needed by '.generated/mob_unitweighted.hfst'.

Improve GitHub push event posts in Zulip

Background: the commit messages in Zulip we have now is not very informative, and lack essential information to do light-weight code reviews in the way we used to do when using svn. At that time, we received an email for each commit, containing a list of all changed files + a diff for the first 1000 lines or so.

Now I stumbled across the following: https://github.com/zulip/github-actions-zulip#readme

Using it we can build an action that sends a message to Zulip that contains exactly what we want. The most wanted content is:

  • full commit message, not only title
  • list of all files in each commit
  • diffs (or some of it, at least) for each file? Needs to be evaluated, can easily be too much

In addition, we want to include what is already now present:

  • the push hash and link
  • each of the commits, with hash and links
  • user name of user pushing
  • number of commits in push
  • name of branch being pushed to
  • stream as a function of repo name
  • topic derived from repo + branch

There is an absolute limit of 10 000 bytes in each post. And we don't want to spam our Zulip streams too much.

make in docs fails in making targets (very many)

To repeat: Go to lang-nob/docs and write make

Result:

make[1]: *** No rule to make target adverbs-stems.md'. Stop. make[1]: *** No rule to make target nynorsk-stems-stems.md'. Stop.
make[1]: *** No rule to make target interjections-stems.md'. Stop. make[1]: *** No rule to make target conjunctions-stems.md'. Stop.
make[1]: *** No rule to make target nouns-stems.md'. Stop. make[1]: *** No rule to make target adjectives-stems.md'. Stop.
make[1]: *** No rule to make target nob-abbreviations-stems.md'. Stop. make[1]: *** No rule to make target verbs-stems.md'. Stop.
make[1]: *** No rule to make target prepositions-stems.md'. Stop. make[1]: *** No rule to make target subjunctions-stems.md'. Stop.
make[1]: *** No rule to make target pronouns-stems.md'. Stop. make[1]: *** No rule to make target numerals-stems.md'. Stop.
make[1]: *** No rule to make target nob-propernouns-stems.md'. Stop. make[1]: root-morphology.md' is up to date.
make[1]: *** No rule to make target symbols-affixes.md'. Stop. make[1]: *** No rule to make target nouns-affixes.md'. Stop.
make[1]: *** No rule to make target adjectives-affixes.md'. Stop. make[1]: *** No rule to make target verbs-affixes.md'. Stop.
make[1]: *** No rule to make target abbreviations-affixes.md'. Stop. make[1]: *** No rule to make target numerals-affixes.md'. Stop.
make[1]: *** No rule to make target propernouns-affixes.md'. Stop. make[1]: phonology-morphology.md' is up to date.
make[1]: *** No rule to make target compounding-morphology.md'. Stop. make[1]: *** No rule to make target smi-propernouns-generated.md'. Stop.
make[1]: *** No rule to make target smi-acronyms-generated.md'. Stop. make[1]: *** No rule to make target punctuation-generated.md'. Stop.
make[1]: *** No rule to make target symbols-generated.md'. Stop. make[1]: *** No rule to make target arabic_roman_digits-generated.md'. Stop.
make[1]: *** No rule to make target smi-nob-abbreviations-generated.md'. Stop. make[1]: *** No rule to make target smi-nob-propernouns-generated.md'. Stop.
make[1]: *** No rule to make target smi-abbreviations-generated.md'. Stop. make[1]: *** No rule to make target transcriptor-abbrevs2text.md'. Stop.

Digits wrongly processed by divvun-suggest output awk script

The script earlier found in lang-smj/tools/tts/convert-helper.awk has been moved to giella-core/scripts/convert-divvunsuggest-to-almostplain.awk, I believe it can be useful for more languages.

There are still some issues:

  • digits in the thousands
  • some compounds (followed by :?)
  • CLB tags in the output (final :?)
  • Actual commas disappear
  • Some long compounds get duplicated

Sample input to test the errors

Digits in the thousands (ie including space):

– Nordlánda fylkkamánne le ájnas aktisasjbarggoguojmme gå galggap tjadádit dåjmajt sámegiela doajmmaplánan oarjjel- ja julevsámegielaj gáktuj. Dan diehti doarjju ráddidus fylkkamánne gielladåjmajt jagen 2010 1,7 millijåvnåjn kråvnåjn, javllá ådåstuhttem-, háldadus- ja girkkoministar Rigmor Aasrud. Duodden juolloduvvá 150 000 kråvnå prosjæktaj ”YouTube på lulesamisk”.

CLB tag in the output:

Sij gudi libjjáv oadtju li:

Compound:

Gållelibjes: Oddvar Hansen, Otto Kristian Løvik, Lill Hege Nilsen, Jørn Øverby ja Erik Martinsen Øvergaard.

Disappearing commas:

Silbbalibjes: Lise Berit Aronsen, Thomas Cordtsen, Kjersti Buer Dolve, Sjur Harald Dolve, Marcel Gleffe, Bjørn Kasper Ilaug, Allan Søndergaard Jensen, Bjørn Juvet, Terje Bergan Lien, Christer Lillestrøm, Sissel Martinsen, Ole Johan Simonsen, Julien Lucien Bernard Sué ja Helge Gjerløw Wettre.

Duplicated compounds:

Barggovuorddemrudá binneduvvi 3 165 millijåvnåjn kråvnåjn

ends up as:

BargovuorddemrudáBarggovuorddemrudá binneduvvi 3 165 millijåvnåjn kråvnåjn

Create list of MD pages automatically

Presently all Markdown pages generated from doccomments are listed in the Makefile.am/MD_PAGES variable. This is error prone since the gawk scriopt will error out when nothing was found.

I suggest we scan all dirs & potential source files for doccomments, and build the list of files to include dynamically. This is less error prone/more robust, and makes for less maintenance burden.

grep -rl '!![=≈]? ' src/* tools/*

should be enough.

This will also make it easier for us when we reorganise the src/ dir, as we don't need to manually change all the lists of documents to be extracted from.

Byggefeil i samband med "mt-sigma.txt" i lang-sma/-smj

Config-oppsett: ./configure --enable-all-tools

Feilmelding

Making all in apertium
Making all in filters
make[4]: Nothing to be done for `all'.
Making all in tagsets
  HRGX2FST modify-tags.hfst
make[4]: *** No rule to make target `.hfst', needed by `mt-sigma.txt'.  Stop.

:sparkles: **Transfer Bugzilla's** :sparkles:

Things to consider:

  • Move all or just open issues? Move all.
  • split per language - this might be a hard one, most tools assume one target GH repo for all issues
  • clean Bugzilla dumb before move (there are some noise entries from the past)
  • standardise labels across language repos
  • what to do about attachments
  • make autoforward link from old Bugzilla URL to new GH link, perhaps this

See also meeting notes here.

See this link for suggestions on how to do the actual transfer.

Other informative discussions:

pmhfst tokeniser inconsistently tokenises hyphen minus

lang-fin/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst

The "hyphen minus" is sometimes separate and other times retained in +Cmp/SplitR situations

Here are five separate instances:
(1a)
Ruotsin keski- ja eteläosien välille

Ruotsin                                                                                                                                                     
keski                                                                                                                                                       
-                                                                                                                                                           
ja                                                                                                                                                          
eteläosien                                                                                                                                                  
välille  

(1b)
yleisesti Etelä- ja Keski-Suomen alueella

yleisesti                                                                                                                                                   
Etelä-                                                                                                                                                      
ja                                                                                                                                                          
Keski-Suomen                                                                                                                                                
alueella  

(2)
Instances of +Cmp/SplitL
Keski-Ruotsin ja -Norjan asumattomille metsäseuduille

Keski-Ruotsin                                                                                                                                               
ja                                                                                                                                                          
 -                                                                                                                                                          
Norjan                                                                                                                                                      
asumattomille                                                                                                                                               
metsäseuduille

In (2), one notices the indented and separate minus hyphen, which is a distinction from what is found in (1a).

In (3) and (4), it is disturbing to observe that a leading whitespace appears before the minus hyphen.
the "niin kuin" token is also peculiar
(3)
tiukoin ottein - niin kuin

tiukoin                                                                                                                                                     
ottein                                                                                                                                                      
 -                                                                                                                                                          
niin kuin

(4)
(n. 4200 - 2500 eaa.)

(                                                                                                                                                           
n.                                                                                                                                                          
4200                                                                                                                                                        
 -                                                                                                                                                          
2500                                                                                                                                                        
eaa.                                                                                                                                                        
)

In (5), an extra line has been inserted, but it may be associated with the « quote.
(5)
tarkoituksenmukaisuus - «muoto seuraa

tarkoituksenmukaisuus                                                                                                                                       
 -                                                                                                                                                          
                                                                                                                                                            
«                                                                                                                                                           
muoto                                                                                                                                                       
seuraa            

The Improve Build Infra Project!

Planned tasks

Installing doesn't adjust generated paths

make install needs to adjust paths in generated files, or builds wind up with these warnings: /usr/share/giella-core/scripts/make-hfstspeller-version-easter-egg.sh: line 21: /build/giella-core-0.20.3+g4190~09cf9c39-1~sid1/scripts/iso639-to-name.sh: No such file or directory

Or use the DIR=$( cd $(dirname $0) ; pwd) trick to always call other scripts relatively to itself - but I see 7e679ea is exactly what was done before, and somehow wasn't good enough?

scripts/vislcg-convert.py does not return full multiword phon string

Given the following input:

"<10:14>"
	"njealljenuppelohkái badjel logi" SETPARENT:1279 "njealljenuppelohkái badjel logi"phon
		"10:14" Num Sem/Time-clock Pl Nom <W:0.0> MAP:2604:hnounNom @HNOUN #3->2 SETPARENT:1279
	"njealljenuppelohkái badjel logi" #3->2 "njealljenuppelohkái badjel logi"phon
		"10:14" Num Sem/Time-clock Sg Nom <W:0.0> MAP:2604:hnounNom @HNOUN #3->2

it returns the following:

10:14	ogi	# "njealljenuppelohkái badjel logi" SETPARENT:1279 "njealljenuppelohkái badjel logi"phon

ie only the last three characters of the phon string.

Build breaks when installing libdivvun - wrong hfst dep

/bin/bash -c wget -q https://apertium.projectjj.com/apt/install-nightly.sh -O install-nightly.sh && bash install-nightly.sh
[1156](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1156)

[1157](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1157)
Cleaning up old install, if any...
[1158](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1158)
Determining Debian/Ubuntu codename...
[1159](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1159)
install-nightly.sh: line 25: lsb_release: command not found
[1160](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1160)
Found evidence of bionic...
[1161](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1161)
Settling for bionic - enabling the Apertium nightly repo...
[1162](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1162)
Installing Apertium GnuPG key to /etc/apt/trusted.gpg.d/apertium.gpg
[1163](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1163)
2022-02-09 06:07:26 URL:https://apertium.projectjj.com/apt/apertium-packaging.public.gpg [3725/3725] -> "-" [1]
[1164](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1164)
Installing package override to /etc/apt/preferences.d/apertium.pref
[1165](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1165)
2022-02-09 06:07:27 URL:https://apertium.projectjj.com/apt/apertium.pref [65/65] -> "-" [1]
[1166](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1166)
Creating /etc/apt/sources.list.d/apertium.list
[1167](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1167)
Running apt-get update...
[1168](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1168)
All done - enjoy the packages! If you just want all core tools, do: sudo apt-get install apertium-all-dev
[1169](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1169)
/usr/bin/apt-get install -qfy foma hfst libhfst-dev cg3-dev divvun-gramcheck
[1170](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1170)
Reading package lists...
[1171](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1171)
Building dependency tree...
[1172](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1172)
Reading state information...
[1173](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1173)
Some packages could not be installed. This may mean that you have
[1174](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1174)
requested an impossible situation or if you are using the unstable
[1175](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1175)
distribution that some required packages have not yet been created
[1176](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1176)
or been moved out of Incoming.
[1177](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1177)
The following information may help to resolve the situation:
[1178](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1178)

[1179](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1179)
The following packages have unmet dependencies:
[1180](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1180)
 hfst : Depends: libhfst55 (= 3.16.0+g3859~3a99b739-1~bionic1) but it is not going to be installed
[1181](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1181)
 libhfst-dev : Depends: libhfst55 (= 3.16.0+g3859~3a99b739-1~bionic1) but it is not going to be installed
[1182](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1182)
E: Unable to correct problems, you have held broken packages.
[1183](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1183)
Error: The process '/usr/bin/apt-get' failed with exit code 100
[1184](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1184)
    at ExecState._setResult (/__w/_actions/divvun/actions/master/node_modules/@actions/exec/lib/toolrunner.js:574:25)
[1185](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1185)
    at ExecState.CheckComplete (/__w/_actions/divvun/actions/master/node_modules/@actions/exec/lib/toolrunner.js:557:18)
[1186](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1186)
    at ChildProcess.<anonymous> (/__w/_actions/divvun/actions/master/node_modules/@actions/exec/lib/toolrunner.js:451:27)
[1187](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1187)
    at ChildProcess.emit (events.js:210:5)
[1188](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1188)
    at maybeClose (internal/child_process.js:1021:16)
[1189](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1189)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:283:5)
[1190](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1190)
##[debug]Node Action run completed with exit code 1
[1191](https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true#step:6:1191)

Cf https://github.com/giellalt/lang-sma/runs/5120834717?check_suite_focus=true

@TinoDidriksen or @unhammer - is this related to the Python issue in libdivvun? In any case, this is high priority - it blocks every single build using GitHub Actions.

Speller error model built from typos list

Just an idea:

Given a large typos list, one could imagine making an error model of it + a simple Levenshtein 1 edit distance thing on top of it.

Needs to be tested for:

  • speed
  • memory/disk size
  • correction performance

If we find that it works well given a typos list of X entries, we could build it automatically if typos file ≥ X.

Main benefit: since we already collect typos, it would be an easy way to build an error model that would correct most typos without us having to do any work.

Split giella-shared in several repos

Background

giella-shared contains today a mixture of data for many different languages:

giella-shared/
├── all_langs
│   └── src
│       ├── filters ⇒ obligatory, move to giella-core?
│       └── fst ⇒ url, punctuation, symbols
├── eng ⇒ names for languages in English majority countries
├── smi ⇒ names, cg functions and dependency graphs mainly for Sámi languages
└── urj-Cyrl ⇒ names for Uralic languages written in Cyrillic

Core idea

Ideally we would only have giella-core as a required dependency (thus needing to move the filters there), and everything else as separate repositories that can be subscribed on an as-needed/wanted basis.

By generalising sharing resources, it would also be straightforward to share content across language repositories, like including sma and sme proper nouns in smj (with some filtering and restrictions). Technically there would be no difference between getting content from lang-sme and shared-smi.

Naming

  • using a prefix shared-, parallel to lang-, keyboard- etc. It does not have to be what is suggested here, other suggestions are welcome.
  • followed by a BCP 47 like locale tag, but also allowing language family tags such as smi and urj

Concrete example

The present giella-shared would after a split become (with check marks for the actual split):

  • shared-smi: the present shared Sámi resources
  • shared-mul: the present shared symbols, url's and punctuation lexicons (mul = multiple languages)
  • shared-eng: present shared English resources (like names)
  • shared-urj-Cyrl: shared resources for Uralic languages written in Cyrillic
  • giella-core/fst-filters/: fst filters moved here, since they are a prerequisite for compiling fst's

Another example:

  • using lang-sme as a source for North Sámi names when used in another Sámi language, like place names. Non-Sámi names in lang-sme would be filtered out, and generic last elements could be (automatically) adapted to Lule Sámi spelling and inflection as needed. This is relevant both for text analysis and parsing in general, but especially for TTS, where there is a need to get a best possible transcription and pronunciation of whatever is thrown at the system. Place names from related neightbouring languages will certainly be a pain point for many minority languages in such a context.

By treating all repos the same as a potential source for lexical and other resources, we get a more flexible and powerful infrastructure.

Restrictions

Ideally the shared resources should never be required — without access to them the result should only be a smaller analyser with worse coverage. This will make giella-core the only required external dependency.

As far as possible, the resources in each repo should be independently compilable and testable, kind of like independent code libraries.

Benefits

  • more flexibility
  • only use what is needed for a language, and start small and simple
  • still access to all sorts of premade resources for various purposes
  • easier version tagging of each shared resource
  • with each repo containing a more clearly defined and limited set of data, it is easier to document, specify and reuse

Considerations

versioning

  • should one always asume latest code
  • or should it be possible to peg the inclusion to a specific version

dependency management

We need a straightforward and simple system to declare dependency on a list of other repositories, kind of like Rust cargo lists. But as noted above, the system should be robust enough to not break if a resource is not available, only give a warning.

CI

Dependency management needs to be automatic, at least for CI systems. We need at least:

  • simple dependency specification \
    Covered by what is specified in configure.ac, at least for now
  • add routines to make sure that the dependencies are available in the following cases:
    • when starting work (ie by running ./autogen.sh in a directory, using the same cloning scheme as the depending repo — svn, git-ssh or git-https)
    • during CI (Taskcluster, GH Actions, Tino's build machine)

Cleanup

  • Remove giella-shared when everything has stabilized (incl. ensuring the full history of the content of the new repos is retained, cf this task)

Comments welcome!

@flammie and I discussed this today, the notes above are based on that. We would very much like feedback on these ideas from anyone, but especially from @TinoDidriksen @bbqsrc @Eijebong @Trondtr @aarppe

Improve release procedure: generate changelog using script

The idea is to semi-automatically generate or update changelogs. See e.g.

We might consider having separate changelogs for analysers (default), spellers, grammar checkers, etc, based on paths of modified files. The different changelogs would correspond to different release tags. The present tag system only supports speller releases.

Empty sub-item in in-source documentation under tools / grammarcheckers

The generated markdown in Links.md looks like this:

* `tools/`
    * `grammarcheckers/`
        * [grammarchecker.cg3](tools-grammarcheckers-grammarchecker.cg3.html)
            * `/`
        * [grc-disambiguator.cg3](tools-grammarcheckers-grc-disambiguator.cg3.html)

Why is that? Not directly harmful, but looks ugly, takes up space, and breaks the list of relevant documentation.

[dicts] clean up DTDs, add/improve documentation

1: Make sure that the .dtd file only exist in giella-core, and that each dictionary refers to that one. An immediate observation
is that the path to the dtd in .xml files are written as a relative path. After the move to github, users may have the dictionaries cloned to an arbitrary path. This could make running xmllint --valid on a dictionary challenging.

2: Document what the different attributes mean, how they are used, and what they are used for. There are two audiences for such documentation:

  • Dictionary authors. They need to know how to express various linguistic properties in the dictionary. For example how to mark transitive/intransitive verbs, noun cases, etc.
  • Programmers: who will need to know how the various pieces are going to be used, or presented. For example, how to display the fact that a verb can only be used in a transitive way.

Add support for lexc in GH Linguist

Cf https://github.com/github/linguist/blob/master/CONTRIBUTING.md

There is already one for VSCode here: https://github.com/eddieantonio/vscode-lexc

There are more than 200 repos with lexc files, according to this search:

https://github.com/search?p=1&q=extension%3Alexc+NOT+nothack&type=Code (4930 hits).

Benefits:

  • our actual code would count in the code statistics, not just Makefiles etc
  • syntax highlighting in Markdown files and other GH places
  • more visibility for our work 👍

WDYT @flammie @albbas @Trondtr @ftyers

Ampersand in the end of lexicon names breaks @CODE@ formatting

A lang-kal developer has reported that when a lexicon name ends in &, and the line is included in documentation using @CODE@, the documentation is not generated correctly. E.g. in line 952 of lang-kal/src/cg3/disambiguator.cg3 (documentation page):

SET IVTVSUBJ& = IV_SUBJ& | SUBJTRANSVERB& ;  #!!≈ - **@CODE@**

This should be generated as

  • SET IVTVSUBJ& = IV_SUBJ& | SUBJTRANSVERB& ;

but is instead generated as

  • **SET IVTVSUBJ@CODE@ = IV_SUBJ@CODE@ | SUBJTRANSVERB@CODE@ ;**

Fix speller filenames for CI requirements

All variants, mobile vs desktop, default vs alt-orth, alt-WS etc need to have a -desktop or -mobile suffix in front of .zhfst for the CI and CD to automatically pick them up and bundle/deliver them.

Error building gramchecker in sme

./configure --enable-grammarchecker gives this error on my linux box:

CP final_strings.all.default.hfst
  HRGX2FST errmodel.default.hfst
  HPROJECT acceptor.default.hfst
  ZIP      se.zhfst
  GEN      se_LO-voikko-5.0.oxt
cp: klarte ikke å opprette vanlig fil 'build/5.0/voikko/3/': Ikke en mappe
make[3]: *** [Makefile:2562: se_LO-voikko-5.0.oxt] Error 1
rm easteregg.default.desktop.errorth.hfst final_strings.txt.default.hfst strings.txt.default.hfst easteregg.default.desktop.suggtxt easteregg.default.desktop.temp.hfst final_strings.all.default.hfst strings.all.default.hfst easteregg.default.desktop.analyser.hfst
make[3]: Leaving directory '/home/boerre/gut/giellalt/lang-sme/tools/spellcheckers'
make[2]: *** [Makefile:1647: all-recursive] Error 1
make[2]: Leaving directory '/home/boerre/gut/giellalt/lang-sme/tools/spellcheckers'
make[1]: *** [Makefile:431: all-recursive] Error 1
make[1]: Leaving directory '/home/boerre/gut/giellalt/lang-sme/tools'
make: *** [Makefile:535: all-recursive] Error 1

To bypass it, I have to add option --disable-hfst-desktop-spellers to the configure line.

Headers in doccomments are not correctly converted when used with !!=/≈ + @CODE@

In the file scripts/doccomments2ghpages.awk. The extraction leaves a space in front of the header symbol #, which makes the Markdown parser treat it as regular text instead of a header:

Multichar_Symbols  !!≈ # Definitions for @CODE@

results in this:

 # Definitions for Multichar_Symbols

(note the space at the beginning of the line).

`lexc-giella-style.py` fails in several ways

./giella-core/devtools/lexc-giella-style.py -h                                   
Traceback (most recent call last):
  File "/Users/smo036/langtech/gut/giellalt/./giella-core/devtools/lexc-giella-style.py", line 1015, in <module>
    ARGS = parse_options()
           ^^^^^^^^^^^^^^^
  File "/Users/smo036/langtech/gut/giellalt/./giella-core/devtools/lexc-giella-style.py", line 1003, in parse_options
    arguments = parser.parse_args()
                ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1869, in parse_args
    args, argv = self.parse_known_args(args, namespace)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1902, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 2114, in _parse_known_args
    start_index = consume_optional(start_index)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 2054, in consume_optional
    take_action(action, args, option_string)
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1978, in take_action
    action(self, namespace, argument_values, option_string)
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1119, in __call__
    parser.print_help()
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 2601, in print_help
    self._print_message(self.format_help(), file)
  File "/opt/homebrew/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 2607, in _print_message
    file.write(message)
  File "<frozen codecs>", line 378, in write
TypeError: write() argument must be str, not bytes

And:

./giella-core/devtools/lexc-giella-style.py --align lang-sma/src/fst/root.lexc   
Traceback (most recent call last):
  File "/Users/smo036/langtech/gut/giellalt/./giella-core/devtools/lexc-giella-style.py", line 1031, in <module>
    NEWLINES.extend(align_lexicon(READLINES))
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/smo036/langtech/gut/giellalt/./giella-core/devtools/lexc-giella-style.py", line 962, in align_lexicon
    lines.parse_lines(lexc_lines)
  File "/Users/smo036/langtech/gut/giellalt/./giella-core/devtools/lexc-giella-style.py", line 740, in parse_lines
    line_dict = parse_line(content)
                ^^^^^^^^^^^^^^^^^^^
  File "/Users/smo036/langtech/gut/giellalt/./giella-core/devtools/lexc-giella-style.py", line 917, in parse_line
    line_dict = defaultdict(unicode)
                            ^^^^^^^
NameError: name 'unicode' is not defined

Bygging kræsjer i tools/grammarcheckers/filters

./configure --enable-all-tools fører til full stopp i byggingen med beskjeden

Making all in grammarcheckers
Making all in filters
make[3]: *** No rule to make target `.hfst', needed by `desktopspeller-sigma.txt'.  Stop.
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

Det er samme problem i lang-sme og lang-sma

Byggefeil i samband med `scripts/iso-639-3.readme.txt'

make i giella-core gir følgende melding:

Running make here will compile all necessary scripts and auxiliary files.
Making all in .
make[1]: *** No rule to make target `scripts/iso-639-3.readme.txt', needed by `all-am'.  Stop.
make: *** [all-recursive] Error 1

Kan ikke bygge "analyser-tts-gt-output.hfst" i lang-sme/-smn/-nob

Byggingen i de tre språkene gir denne feilen:
make[2]: *** No rule to make target analyser-tts-gt-output.hfst', needed by analyser-tts-gt-output.hfstol'. Stop.

lang-smj og lang-sma fungerer.

I lang-smj nevnes analyser-tts-gt-output.hfst i src/Makefile.am, mens i lang-sma hentes koden for å bygge analyser-tts-gt-output.hfst fra giella-core, siden den ikke nevnes i src/Makefile.am

The missing grc easteregg

[Recreated here, as it fits better. Moving issues across orgs is not possible. Original, filed by @Trondtr can be found here]

There is no easteregg of the "nuvviDspeller" type for our grammarcheckers. One might think they are not needed, since the use normally accesses the server, and whatever version problems present should be shared by server and user machines alike. But upon debugging the smn grc for the sentence

  • Mij eelijm tagarijn puáris kaavpugijn.

we run into the following problem: The sentence is correct, and flagged as such (= no error flag) on my command line, my MS Word, my Google Doc. For my collegue, the Google Doc version behaves as expected, whereas the MS Word version gives a false positive.

An obvious debugging tool would thus be an easteregg.

Parsing errors in lexc2markdown using new variables

The following lexc code:

LEXICON NounRoot
!! ## The lexicon @LEXNAME@
!! This lexicon is the start of all noun lemmas. It splits the nouns in three
!! classes as follows:

!! ```mermaid
!! stateDiagram-v2
!! direction LR
  FirstComponent ; !!≈ @LEXNAME@ --> @LEMMA@
  HyphNouns      ; !!≈ @LEXNAME@ --> @LEMMA@
  Noun           ; !!≈ @LEXNAME@ --> @CONTLEX@
!! ```

results in the following markdown fragment:

## The lexicon NounRoot
This lexicon is the start of all noun lemmas. It splits the nouns in three
classes as follows:

```mermaid
stateDiagram-v2
direction LR
FirstComponent ;   FirstComponent ;  NounRoot --> FirstComponent
HyphNouns ;   HyphNouns      ;  NounRoot --> HyphNouns
Noun ;   Noun           ;  NounRoot --> ;
```

Expected:

## The lexicon NounRoot
This lexicon is the start of all noun lemmas. It splits the nouns in three
classes as follows:

```mermaid
stateDiagram-v2
direction LR
NounRoot --> 
NounRoot --> 
NounRoot --> Noun
```

doccomments2ghpages.awk - improve CG doccomment extraction

The main culprit is that CG uses # as the comment character, whereas the other source files we extract from are using !. This leads to # being part of the extracted text after !!= and !!≈. That is not desirable.

Example:

LIST Err/Orth =                  #!! - `Err/Orth`:
                Err/Orth         #!!≈     - `@CODE@`
                Err/Orth-a/á     #!!≈     - `@CODE@`
                Err/Orth-nom/gen #!!≈     - `@CODE@`
                Err/Orth-nom/acc #!!≈     - `@CODE@`
                Err/DerSub       #!!≈     - `@CODE@`
                Err/CmpSub       #!!≈     - `@CODE@`
                Err/UnspaceCmp   #!!≈     - `@CODE@`
                Err/HyphSub      #!!≈     - `@CODE@`
                Err/SpaceCmp     #!!≈     - `@CODE@`
                Err/Spellrelax   #!!≈     - `@CODE@`
                err_orth_mt      #!!≈     - `@CODE@`
                ;

comes out as:

- `Err/Orth`:
    - `Err/Orth #`
    - `Err/Orth-a/á #`
    - `Err/Orth-nom/gen #`
    - `Err/Orth-nom/acc #`
    - `Err/DerSub #`
    - `Err/CmpSub #`
    - `Err/UnspaceCmp #`
    - `Err/HyphSub #`
    - `Err/SpaceCmp #`
    - `Err/Spellrelax #`
    - `err_orth_mt #`

which looks like this in the final output:

  • Err/Orth:
    • Err/Orth #
    • Err/Orth-a/á #
    • Err/Orth-nom/gen #
    • Err/Orth-nom/acc #
    • Err/DerSub #
    • Err/CmpSub #
    • Err/UnspaceCmp #
    • Err/HyphSub #
    • Err/SpaceCmp #
    • Err/Spellrelax #
    • err_orth_mt #

We don't want the # there 😄

`make clean` does not remove all generated files

Steps to reproduce:

cd lang-sma
make -j
make clean
find . -name '*.hfst'

Result: a long list of .hfst files

Expected result: no files found

I assume this is caused by outdated clean targets after the dir reorgs.

What makes this issue a blocker is that for people working with MT/Apertium, lang-sma requires weighted generators. For speed reasons weighted FST's are not default, Foma format FST's are. When reconfiguring for MT using e.g. this configuration:

./configure --enable-apertium --with-backend-format=openfst-tropical
make clean

the unremoved FST files are typically in the Foma backend format (the default), and when building anew after reconfiguration, the different FST format causes compilation errors:

libc++abi: libc++abi: libc++abi: terminating due to uncaught exception of type TransducerTypeMismatchExceptionterminating due to uncaught exception of type TransducerTypeMismatchExceptionterminating due to uncaught exception of type TransducerTypeMismatchException

Cf giellalt/lang-sma#23

Automatise handling of diacritics

This covers two distinct cases:

  • automatise making all non-ASCII chars with diacritics optionally in Unicode NFD (default is NFC) in descriptive analysers, so that we can analyse PDF files without worries - pdf stores all texts in NFD
  • automatise splitting combining diacritic letters (ie those that does not exists as premade NFC's in Unicode) into sequences of base letters + diacritics as individual states in the FST (as opposed to a single, multichar symbol state), for use in tokenisers only, and everywhere else as single-state multichar symbols; this makes input tokenisation on a character level reliable in hfst-tokenise, without resorting to problematic Unicode hacks, but tends to be broken because people forget to make such filters themselves, causing hard-to-debug errors

In the first case, the pseudo code could go something like this:

extract all letter symbols (ie non-multichars)
remove ASCII letters, digits, punctuation
uconv nfc nfd
paste nfc nfd
remove lines with identical columns # e.g. ø and ŋ can't be decomposed
make the result into a XFST regex file for optional change from nfc to nfd
apply the compiled regex to descriptive analysers on the __surface__ side

In the second case, the pseudocode could be something like the following:

extract all multichar symbols from the fst
get rid of everything that looks like tags, flag diacritics and internal symbols
make a regex to mandatorily turn a multichar base letter + (one or more) combining \
    diacritics into a sequence of single symbols
apply that regex to tokeniser FST's on the __surface__ side

With routines like the above integrated into the build system, no-one should ever have to worry about these issues anymore 🙂

speller Levenshtein manipulations are ignored by hfst-ospell

To repeat: compile spellers in e.g. lang-mns. As shown, the intended иӈ for ин has a distance of 2, as defined in ditdist.default.txt. Unfortunately, mns.zhfst does not read this definition, and returns 10 (one Levenshtein operation).

uit-mac-443 lang-mns (main)$ e ин|hfst-ospell -S -n 10 tools/spellcheckers/mns.zhfst 
"ин" is NOT in the lexicon:
Corrections for "ин":
и    10.000000
и-    10.000000
ис    10.000000
иӈ    10.000000
щин    10.000000
шин    10.000000
итн    10.000000
ит    10.000000
исн    10.000000
и.    10.000000

uit-mac-443 lang-mns (main)$ grep ӈ tools/spellcheckers/editdist.default.txt 
ӈ
н	ӈ 	2
uit-mac-443 lang-mns (main)$ 

Also the other files (strings.default.txt, final.default.txt etc) are invisible to msn.zhfst.

The same goes for other languages, but not for all: sme and sms work fine, the same does the cyrillic-based mhr.

Support Lemma/Stem/Contlex variables in lexc2markdown

Presently there is the DATA variable when writing doc comments. When documenting lexc, it would be very useful to also have access to the individual parts of the lexc entry, like the:

  • Lemma
  • Stem
  • Continuation lexicon

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.