Giter Club home page Giter Club logo

hunlinter's Introduction

HunLinter

Java-16+ License: GPL v3

Project Status: Active – The project has reached a stable, usable state and is being actively developed.


Forewords

Please, be aware that this application requires Java 16+!

You can download and install it for free from this link.


Main features

  • affix file and dictionary linter
  • rules reducer
  • LibreOffice and Mozilla packager
  • Part-of-Speech and dictionary FSA extractor for LanguageTools
  • automatically choose a font to render custom language
  • manages thesaurus, hyphenation, auto-correct, sentence exceptions, and word exception files
  • minimal pairs extraction
  • statistics
  • … and many more!

Table of Contents

  1. Motivation
  2. What the application can do
  3. How to enhance its capabilities
  4. Recognized charsets
  5. Recognized flags
    1. General
    2. Suggestions
    3. Compounding
    4. Affix creation
    5. Others
  6. How to
    1. Open a project
    2. Create an extension
    3. Linter dictionary
    4. Linter thesaurus
    5. Linter hyphenation
    6. Sort dictionary
    7. Reduce rules
    8. Word count
    9. Rule flags aid
    10. Dictionary statistics
    11. Dictionary duplicates
    12. Dictionary wordlist
    13. Create a Part-of-Speech FSA
    14. Minimal pairs
    15. Ordering table columns
    16. Copying text
    17. Rule/dictionary insertion
  7. Screenshots
    1. Inflections
    2. Dictionary linter
    3. Thesaurus
    4. Hyphenation
    5. Dictionary sorter
    6. Rule reducer
    7. Font selection
    8. Statistics
    9. Autocorrections
    10. Sentence exceptions
    11. Word exceptions
    12. Part-of-Speech dictionary
  8. Changelog
    1. version 2.0.0
    2. version 2.1.0
    3. version 2.0.2
    4. version 2.0.1
    5. version 2.0.0
    6. version 1.10.0
    7. version 1.9.1
    8. version 1.9.0
    9. version 1.8.1
    10. version 1.8.0

Motivation

I created this project in order to help me construct my hunspell language files (particularly for the Venetan language, you can find some tools here, and the language pack here (for the LibreOffice tools) and here (for the Mozilla tools)). I mean .aff and .dic files, along with hyphenation and thesaurus.


What the application can do

This application is able to do many correctness checks about the files structure and its content. It is able to tell you if some rule is missing or redundant. You can test rules and compound rules. You can also test hyphenation and eventually add rules. It is also able to manage and build the thesaurus.

This application can also sort the dictionary, counting words (unique and total count), gives some statistics, duplicate extraction, wordlist extraction, minimal pairs extraction, and package creation in order to build an .oxt or .xpi for deploy.


How to enhance its capabilities

You can customize the tests the application made by simply add another package along with vec, named as the ISO 639-3 or ISO 639-2 code, and extending the DictionaryCorrectnessChecker, Orthography, and DictionaryBaseData classes (this last class is used to drive the Bloom filter).

Along with these classes you can insert your rules.properties, a file that describes various constraints about the rules in the .dic file.

After that you have to tell the application that exists those files editing the BaseBuilder class and adding a LanguageData to the DATAS hashmap.

The application automatically recognize which checker to use based on the code in the LANG option present in the .aff file.


Recognized charsets

  • UTF-8
  • ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15,
  • KOI8-R, KOI8-U
  • MICROSOFT-CP1251
  • ISCII-DEVANAGARI
  • TIS620-2533

Recognized flags

General

SET, FLAG, COMPLEXPREFIXES, LANG, AF, AM

Suggestions

REP

Compounding

COMPOUNDRULE, COMPOUNDMIN, COMPOUNDFLAG, ONLYINCOMPOUND, COMPOUNDPERMITFLAG, COMPOUNDFORBIDFLAG, COMPOUNDMORESUFFIXES, COMPOUNDWORDMAX, CHECKCOMPOUNDDUP, CHECKCOMPOUNDREP, CHECKCOMPOUNDCASE, CHECKCOMPOUNDTRIPLE, SIMPLIFIEDTRIPLE, FORCEUCASE

Affix creation

PFX, SFX

Others

CIRCUMFIX, FORBIDDENWORD, FULLSTRIP, KEEPCASE, ICONV, OCONV, NEEDAFFIX


How to

Open a project

Select File|Open Project. A dialog will appear, and a blue folder (this marks a valid project) should be selected.

A META-INF folder containing a manifest.xml file is loaded, and all the information of where a particular file is are retrieved from it.

Upon loading a font is chosen that can render the content of the project. If you want another font, just select File|Select font and choose another one.

The font will be linked to the project so, opening it again later, the same font will be used.

Create an extension

In order to create an extension (e.g. for LibreOffice, or for Mozilla products) you have to use the option File|Create package. This will package the directory in which the .aff/.dic resides into a zip file. All there is to do afterwards is to rename the extensions into .oxt (LibreOffice), or .xpi (Mozilla).

Remember that the package will have the same name of the directory, but the directory itself is not included, just the content is.

Linter dictionary

To linter a dictionary just select Dictionary tools|Correctness check/Dictionary tools|Correctness check using dictionary FSA.

Each line is then linted following the rules of a particular language (IF the corresponding files are present in the project, e.g. for Venetan). If no such file is present a general linter is applied.

Linter thesaurus

To linter the thesaurus just select Thesaurus tools|Correctness check/Thesaurus tools|Correctness check using dictionary FSA.

Each thesaurus entry is linted checking for the presence of each synonym as a definition (with same Part-of-Speech).

In case of error it is suggested to copy all the synonyms for the indicated words (and all that came out from the filtering using those two words), remove each of them, and reinsert again.

Linter hyphenation

To linter the hyphenation just select Hyphenation tools|Correctness check.

Each hyphenation code is then linted following certain rules (among them the one that says that a breakpoint should not be on the boundary, that a code should have at least a breakpoint, etc.).

Sort dictionary

By selecting Dictionary tools|Sort dictionary you can sort specific parts of a dictionary file selecting the highlighted sections between a comment or empty line and the following.

The sorting order is language-dependent.

Reduce rules

Use Dictionary tools|Rules reducer to find the minimum set of rules that can be applied following the current dictionary file.

E.g. If a dictionary file has the lines aa/b and bb/b and in the affix file are present the rules SFX b 0 A a, SFX b 0 B b, and SFX b 0 C c (where the last is not used), then this tools returns the minimum set of SFX b 0 A a and SFX b 0 B b.

Word count

Use Dictionary tools|Word count to count all the words generated by the affix files, as long as unique word (not considering part-of-speech).

Note: There is an uncertainty about the uniqueness count, but it should be small. Deal with it :p.

Dictionary statistics

Use Dictionary tools|Statistics to produce some statistics (graphs and values are exportable with a right click!) about word and compound word count, mode of words' length, mode of words' syllabe, most common syllabes, the longest words (by letters and by syllabes).

If you want to include hyphenation statistics be sure to use Hyphenation tools|Statistics instead, but expect a 3.6× or so increase in running time.

Dictionary duplicates

To obtain a list of word duplicates (same word, same part-of-speech), the tool you want to use is under Dictionary tools|Extract duplicates.

Dictionary wordlist

To obtain a list of all the words generated by a dictionary and affix file, the menus Dictionary tools|Extract wordlist and Dictionary tools|Extract wordlist (plain words) should be used.

Create a Part-of-Speech FSA

In order to create an FSA for Part-of-Speech, suitable for use in LanguageTool you have to use the option File|Extract PoS FSA selecting the output folder. This will create an FSA using a provided <language>.info file (or automatically generated).

Remember that the FSA file will have the same name as specified in the LANG option in the .aff file, and extension .dict.

Minimal paris

To obtain a list of minimal pairs use the menu Dictionary tools|Extract minimal pairs.

Rule flags aid

An external text file can be put into the directory aids (on the same level of the executable jar) whose content will be displayed in the drop-down element in the Dictionary tab (blank lines are ignored).

This file could be used as a reminder of all the flag that can be added to a word and their meaning.

The filename has to be the language (as specified in the option LANG inside the .aff file), and the extension aid (e.g. for Venetan: vec-IT.aid).

Ordering table columns

It is possible to sort certain columns of the tables, just click on the header of the column. The sort order will cycle between ascending, descending, and unsorted.

Copying text

Is it possible to copy content of tables and words in the statistics section. Also, the graph in the statistics section can be exported into images.

Use Ctrl+C after selecting the row, or use the right click of the mouse to access the popup menu.

Rule/dictionary insertion

This is NOT an editor tool1! If you want to add affix rules, add words in the dictionary, or change them, you have plenty of tools around you. For Windows, I suggest Notepad++ (for example, you will see immediately while typing if a word is already present in the dictionary).

1: Even if for the hyphenation file a new rule can actually be added…


Screenshots

Inflections

Entries can be a single word followed by a slash and all the flags that have to be applied, followed optionally by one or more morphological fields.

alt text

Dictionary linter

alt text

Thesaurus

Entries can be inserted in two ways:

  1. (pos)|word1|word2|word3
  2. pos:word1,word2,word3

Once something is written, an automatic filtering is executed to find all the words (and part-of-speech if given) that are already contained into the thesaurus.

It is possible to right-click on a row to bring up the popup menu and select whether to copy it, remove it (and all the other rows in which the selected definition appears), or merge with the current synonyms.

alt text

alt text

Hyphenation

alt text

Dictionary sorter

alt text

Rule reducer

alt text

Font selection

alt text

Statistics

alt text alt text alt text

Autocorrection

alt text

Sentence exceptions

alt text

Word exceptions

alt text

Part-of-Speech dictionary

alt text


Changelog

version 2.2.0 - 20230521

  • rules reducer fixes and enhancements
  • considered different formats for part-of-speech in thesaurus file
  • fixed early creation of thesaurus parser (language was not available yet)
  • fixed circumfix inflections
  • enhanced duplication worker capabilities
  • delayed creation of file chooser (faster startup)
  • eliminated double reloading of dictionary in sort dialog when something changes
  • understood how ICON and OCONV works
  • supported ISO-8859-10, ISO-8859-14, and ISCII-DEVANAGARI charsets
  • added a check on declared charset and real charset of a file
  • adjusted scroll to the bottom of the log text area while changing font
  • decreased the loading time of sorting dialog
  • increased speed by 57% (for dictionary linter: from 2m 13s to 57s)
  • decreased start-up time
  • corrected some typos

version 2.1.0 - 20210807

  • fix bug on initial font size
  • automatically unzip .dat and .bau files (in autocorr and autotext folders)
  • startup time reduced

version 2.0.2 - 20210806

  • added linter for auto-correct
  • corrected the size of the font
  • corrected the executable

version 2.0.1 - 20210805

  • added warn for unused rules after dictionary linter
  • added the possibility to hide selected columns from dictionary table
  • (finally) added a Windows installer
  • some minor improvements on speed and linting capabilities

version 2.0.0 - 20200524

  • made update process stoppable
  • added a linter for thesaurus
  • added a menu to generate Dictionary FSA (used in LanguageTools, for example)
  • added a section to see the PoS FSA execution
  • fixed a bug on hyphenation: when the same rule is being added (with different breakpoints), the old one is being lost
  • substituted charting library
  • added undo/redo capabilities on input fields
  • completely revised thread management
  • fixed a nasty memory leak
  • now the sort dialog remains open after a sort
  • categorized the errors in (true) errors and warnings, now the warning are no longer blocking
  • reduced compiled size by 52% (from 6 201 344 B to 3 002 671 B)
  • reduced memory footprint by 13% (for dictionary linter: from 728 MB to 630 MB)
  • increased speed by 53% (for dictionary linter: from 4m 44s to 2m 13s)
  • various minor bugfixes and code revisions

version 1.10.0 - 20200131

  • (finally) given a decent name to the project: HunLinter
  • fixed a bug while selecting the font once a project is loaded
  • fixed a bug while storing thesaurus information (only lowercase words are allowed)
  • added update capability (the new jar will be copied in the directory of the old jar and started)
  • added buttons to open relevant files
  • added management of SentenceExceptList.xml and WordExceptList.xml
  • added a menu to generate Part-of-Speech FSA (used in LanguageTools, for example)
  • made tables look more standard (copy and edit operations)
  • improved thesaurus merging

version 1.9.1 - 20191028

  • completely revised how the loading of a project works, now it is possible to load and manage all the languages in an extension (or package), all the relevant files are read from manifest.xml and linked .xcu files
  • the way a project is loaded in the application is changed, now the project folder (signed by a blue icon) has to be selected instead of an .aff file
  • added the possibility to change the options for hyphenation

version 1.9.0 - 20191027

  • added the parsing and management of auto-correct files (only DocumentList.xml can be edited for now, SentenceExceptList.xml and WordExceptList.xml are currently read only)
  • now all the relevant files are loaded by reading the META-INF\manifest.xml file, no assumptions was made
  • enhancement for hyphenation section: now it is possible also to insert custom hyphenations
  • bug fix on duplicate extraction
  • some simplifications were made in the main menu (removed thesaurus validation on request because it will be done anyway at loading)
  • improvements on thesaurus table filtering
  • prevented the insertion of a new thesaurus if it is already contained
  • revised the dictionary sort dialog from scratch to better handle sections between comments
  • minor GUI adjustments and corrections

version 1.8.1 - 20190930

  • added the link to the online help
  • corrected the font size on the dictionary sorter dialog
  • bugfix: scroll on dictionary sorter dialog

version 1.8.0 - 20190928

  • introduced the possibility to choose the font (you can select it whenever you've loaded an .aff file, it will give you a list of all the fonts that can render the loaded language -- once selected the font, it will be used that for all the .aff files in that language)

hunlinter's People

Contributors

mtrevisan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

dheid

hunlinter's Issues

Error while executing the application

Target version: 2.0.1
Created with unlicensed compiler: The program was made with an Unlicensed compiler. Please buy the PRO version to distribute your EXE.

pos on thesaurus

Some languages have the part-of-speech in brackets, others in square brackets, still others have an empty field.

font size

Remove font size request, automatically adapt size to labels.

problems with dictionary sorting

Find out why marathi language takes forever to open the dialog that sorts the dictionary. It has to do with the language, 'cause the operation shouldn't involve the knowledge of the language.

DictionarySortDialog:174
**entriesList.ensureIndexIsVisible(firstVisibleItemIndex);**

...

BasicListUI:1442
**for(int index = 0; index < dataModelSize; index++) {**
    Object value = dataModel.getElementAt(index);
    Component c = renderer.getListCellRendererComponent(list, value, index, false, false);
    rendererPane.add(c);
    Dimension cellSize = c.getPreferredSize();

compound rule error

Fix compound rule extraction (now it returns an empty list because of the way the affixes are separated before composing the production)

open files

Add buttons to open relevant files (aff, dic, xml...)

No need to check if characters are common

This affix works correctly as expected:

SET UTF8

SFX P Y 1
SFX P णे लेल्या/Aacd णे

This affix file does not work because the 3rd, 4th and 5th columns are the same. This is also expected:

SET UTF8

SFX P Y 1
SFX P णे णे णे

Reading error: Characters in common between removed and added part: 'SFX P णे णे णे', line 4


But the following file does not work and that is a problem.

SET UTF8

SFX P Y 1
SFX P णे णार णे

Reading error: Characters in common between removed and added part: 'SFX P णे णार णे', line 4

You are checking only the first character in column 3 and 4. If it matches you throw the error. But you should compare the “entire” 3rd and 4th column, (Not just first character).

In fact there is no need of that check at all. Simply follow the rule!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.