HunLinter

Forewords

Please, be aware that this application requires Java 16+!

You can download and install it for free from this link.

Main features

affix file and dictionary linter
rules reducer
LibreOffice and Mozilla packager
Part-of-Speech and dictionary FSA extractor for LanguageTools
automatically choose a font to render custom language
manages thesaurus, hyphenation, auto-correct, sentence exceptions, and word exception files
minimal pairs extraction
statistics
… and many more!

Motivation
What the application can do
How to enhance its capabilities
Recognized charsets
Recognized flags
How to
Screenshots
Changelog

Motivation

I created this project in order to help me construct my hunspell language files (particularly for the Venetan language, you can find some tools here, and the language pack here (for the LibreOffice tools) and here (for the Mozilla tools)). I mean .aff and .dic files, along with hyphenation and thesaurus.

What the application can do

This application is able to do many correctness checks about the files structure and its content. It is able to tell you if some rule is missing or redundant. You can test rules and compound rules. You can also test hyphenation and eventually add rules. It is also able to manage and build the thesaurus.

This application can also sort the dictionary, counting words (unique and total count), gives some statistics, duplicate extraction, wordlist extraction, minimal pairs extraction, and package creation in order to build an .oxt or .xpi for deploy.

How to enhance its capabilities

You can customize the tests the application made by simply add another package along with vec, named as the ISO 639-3 or ISO 639-2 code, and extending the DictionaryCorrectnessChecker, Orthography, and DictionaryBaseData classes (this last class is used to drive the Bloom filter).

Along with these classes you can insert your rules.properties, a file that describes various constraints about the rules in the .dic file.

After that you have to tell the application that exists those files editing the BaseBuilder class and adding a LanguageData to the DATAS hashmap.

The application automatically recognize which checker to use based on the code in the LANG option present in the .aff file.

Recognized charsets

UTF-8
ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15,
KOI8-R, KOI8-U
MICROSOFT-CP1251
ISCII-DEVANAGARI
TIS620-2533

Recognized flags

General

SET, FLAG, COMPLEXPREFIXES, LANG, AF, AM

Suggestions

REP

Compounding

COMPOUNDRULE, COMPOUNDMIN, COMPOUNDFLAG, ONLYINCOMPOUND, COMPOUNDPERMITFLAG, COMPOUNDFORBIDFLAG, COMPOUNDMORESUFFIXES, COMPOUNDWORDMAX, CHECKCOMPOUNDDUP, CHECKCOMPOUNDREP, CHECKCOMPOUNDCASE, CHECKCOMPOUNDTRIPLE, SIMPLIFIEDTRIPLE, FORCEUCASE

Affix creation

PFX, SFX

Others

CIRCUMFIX, FORBIDDENWORD, FULLSTRIP, KEEPCASE, ICONV, OCONV, NEEDAFFIX

How to

Open a project

Select File|Open Project. A dialog will appear, and a blue folder (this marks a valid project) should be selected.

A META-INF folder containing a manifest.xml file is loaded, and all the information of where a particular file is are retrieved from it.

Upon loading a font is chosen that can render the content of the project. If you want another font, just select File|Select font and choose another one.

The font will be linked to the project so, opening it again later, the same font will be used.

Create an extension

In order to create an extension (e.g. for LibreOffice, or for Mozilla products) you have to use the option File|Create package. This will package the directory in which the .aff/.dic resides into a zip file. All there is to do afterwards is to rename the extensions into .oxt (LibreOffice), or .xpi (Mozilla).

Remember that the package will have the same name of the directory, but the directory itself is not included, just the content is.

Linter dictionary

To linter a dictionary just select Dictionary tools|Correctness check/Dictionary tools|Correctness check using dictionary FSA.

Each line is then linted following the rules of a particular language (IF the corresponding files are present in the project, e.g. for Venetan). If no such file is present a general linter is applied.

Linter thesaurus

To linter the thesaurus just select Thesaurus tools|Correctness check/Thesaurus tools|Correctness check using dictionary FSA.

Each thesaurus entry is linted checking for the presence of each synonym as a definition (with same Part-of-Speech).

In case of error it is suggested to copy all the synonyms for the indicated words (and all that came out from the filtering using those two words), remove each of them, and reinsert again.

Linter hyphenation

To linter the hyphenation just select Hyphenation tools|Correctness check.

Each hyphenation code is then linted following certain rules (among them the one that says that a breakpoint should not be on the boundary, that a code should have at least a breakpoint, etc.).

Sort dictionary

By selecting Dictionary tools|Sort dictionary you can sort specific parts of a dictionary file selecting the highlighted sections between a comment or empty line and the following.

The sorting order is language-dependent.

Reduce rules

Use Dictionary tools|Rules reducer to find the minimum set of rules that can be applied following the current dictionary file.

E.g. If a dictionary file has the lines aa/b and bb/b and in the affix file are present the rules SFX b 0 A a, SFX b 0 B b, and SFX b 0 C c (where the last is not used), then this tools returns the minimum set of SFX b 0 A a and SFX b 0 B b.

Word count

Use Dictionary tools|Word count to count all the words generated by the affix files, as long as unique word (not considering part-of-speech).

Note: There is an uncertainty about the uniqueness count, but it should be small. Deal with it :p.

Dictionary statistics

Use Dictionary tools|Statistics to produce some statistics (graphs and values are exportable with a right click!) about word and compound word count, mode of words' length, mode of words' syllabe, most common syllabes, the longest words (by letters and by syllabes).

If you want to include hyphenation statistics be sure to use Hyphenation tools|Statistics instead, but expect a 3.6× or so increase in running time.

Dictionary duplicates

To obtain a list of word duplicates (same word, same part-of-speech), the tool you want to use is under Dictionary tools|Extract duplicates.

Dictionary wordlist

To obtain a list of all the words generated by a dictionary and affix file, the menus Dictionary tools|Extract wordlist and Dictionary tools|Extract wordlist (plain words) should be used.

Create a Part-of-Speech FSA

In order to create an FSA for Part-of-Speech, suitable for use in LanguageTool you have to use the option File|Extract PoS FSA selecting the output folder. This will create an FSA using a provided <language>.info file (or automatically generated).

Remember that the FSA file will have the same name as specified in the LANG option in the .aff file, and extension .dict.

Minimal paris

To obtain a list of minimal pairs use the menu Dictionary tools|Extract minimal pairs.

Rule flags aid

An external text file can be put into the directory aids (on the same level of the executable jar) whose content will be displayed in the drop-down element in the Dictionary tab (blank lines are ignored).

This file could be used as a reminder of all the flag that can be added to a word and their meaning.

The filename has to be the language (as specified in the option LANG inside the .aff file), and the extension aid (e.g. for Venetan: vec-IT.aid).

Ordering table columns

It is possible to sort certain columns of the tables, just click on the header of the column. The sort order will cycle between ascending, descending, and unsorted.

Copying text

Is it possible to copy content of tables and words in the statistics section. Also, the graph in the statistics section can be exported into images.

Use Ctrl+C after selecting the row, or use the right click of the mouse to access the popup menu.

Rule/dictionary insertion

This is NOT an editor tool¹! If you want to add affix rules, add words in the dictionary, or change them, you have plenty of tools around you. For Windows, I suggest Notepad++ (for example, you will see immediately while typing if a word is already present in the dictionary).

¹: Even if for the hyphenation file a new rule can actually be added…

Screenshots

Inflections

Entries can be a single word followed by a slash and all the flags that have to be applied, followed optionally by one or more morphological fields.

Dictionary linter

Thesaurus

Entries can be inserted in two ways:

(pos)|word1|word2|word3
pos:word1,word2,word3

Once something is written, an automatic filtering is executed to find all the words (and part-of-speech if given) that are already contained into the thesaurus.

It is possible to right-click on a row to bring up the popup menu and select whether to copy it, remove it (and all the other rows in which the selected definition appears), or merge with the current synonyms.

Hyphenation

Dictionary sorter

Rule reducer

Font selection

Statistics

Autocorrection

Sentence exceptions

Word exceptions

Part-of-Speech dictionary

Changelog

version 2.2.0 - 20230521

rules reducer fixes and enhancements
considered different formats for part-of-speech in thesaurus file
fixed early creation of thesaurus parser (language was not available yet)
fixed circumfix inflections
enhanced duplication worker capabilities
delayed creation of file chooser (faster startup)
eliminated double reloading of dictionary in sort dialog when something changes
understood how ICON and OCONV works
supported ISO-8859-10, ISO-8859-14, and ISCII-DEVANAGARI charsets
added a check on declared charset and real charset of a file
adjusted scroll to the bottom of the log text area while changing font
decreased the loading time of sorting dialog
increased speed by 57% (for dictionary linter: from 2m 13s to 57s)
decreased start-up time
corrected some typos

version 2.1.0 - 20210807

fix bug on initial font size
automatically unzip .dat and .bau files (in autocorr and autotext folders)
startup time reduced

version 2.0.2 - 20210806

added linter for auto-correct
corrected the size of the font
corrected the executable

version 2.0.1 - 20210805

added warn for unused rules after dictionary linter
added the possibility to hide selected columns from dictionary table
(finally) added a Windows installer
some minor improvements on speed and linting capabilities

version 2.0.0 - 20200524

made update process stoppable
added a linter for thesaurus
added a menu to generate Dictionary FSA (used in LanguageTools, for example)
added a section to see the PoS FSA execution
fixed a bug on hyphenation: when the same rule is being added (with different breakpoints), the old one is being lost
substituted charting library
added undo/redo capabilities on input fields
completely revised thread management
fixed a nasty memory leak
now the sort dialog remains open after a sort
categorized the errors in (true) errors and warnings, now the warning are no longer blocking
reduced compiled size by 52% (from 6 201 344 B to 3 002 671 B)
reduced memory footprint by 13% (for dictionary linter: from 728 MB to 630 MB)
increased speed by 53% (for dictionary linter: from 4m 44s to 2m 13s)
various minor bugfixes and code revisions

version 1.10.0 - 20200131

(finally) given a decent name to the project: HunLinter
fixed a bug while selecting the font once a project is loaded
fixed a bug while storing thesaurus information (only lowercase words are allowed)
added update capability (the new jar will be copied in the directory of the old jar and started)
added buttons to open relevant files
added management of SentenceExceptList.xml and WordExceptList.xml
added a menu to generate Part-of-Speech FSA (used in LanguageTools, for example)
made tables look more standard (copy and edit operations)
improved thesaurus merging

version 1.9.1 - 20191028

completely revised how the loading of a project works, now it is possible to load and manage all the languages in an extension (or package), all the relevant files are read from manifest.xml and linked .xcu files
the way a project is loaded in the application is changed, now the project folder (signed by a blue icon) has to be selected instead of an .aff file
added the possibility to change the options for hyphenation

version 1.9.0 - 20191027

added the parsing and management of auto-correct files (only DocumentList.xml can be edited for now, SentenceExceptList.xml and WordExceptList.xml are currently read only)
now all the relevant files are loaded by reading the META-INF\manifest.xml file, no assumptions was made
enhancement for hyphenation section: now it is possible also to insert custom hyphenations
bug fix on duplicate extraction
some simplifications were made in the main menu (removed thesaurus validation on request because it will be done anyway at loading)
improvements on thesaurus table filtering
prevented the insertion of a new thesaurus if it is already contained
revised the dictionary sort dialog from scratch to better handle sections between comments
minor GUI adjustments and corrections

version 1.8.1 - 20190930

added the link to the online help
corrected the font size on the dictionary sorter dialog
bugfix: scroll on dictionary sorter dialog

version 1.8.0 - 20190928

introduced the possibility to choose the font (you can select it whenever you've loaded an .aff file, it will give you a list of all the fonts that can render the loaded language -- once selected the font, it will be used that for all the .aff files in that language)

mtrevisan / hunlinter Goto Github PK

hunlinter's Introduction