morfologik / morfologik-stemming Goto Github PK

Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.

License: BSD 3-Clause "New" or "Revised" License

Java 100.00%

morfologik-stemming's People

Contributors

Stargazers

Watchers

morfologik-stemming's Issues

Rename FSAFinalStatesIterator to ByteSequenceIterator

Speller.findReplacements() doesn't provide properly ordererd suggestions

Calling Speller.findReplacements("schin") with a German dictionary gives this result:

China, Dschinn, Schanz, Schein, Scheine, Scheins, Schi, Schiene, Schier, Schieß, Schiff, Schiit, Schild, Schilf, Schily, Schira, Schirm, Schis, Schiss, Schiwa, Schon, Schund, Schwing, Schön, Sphinx, china, schanz, schein, scheine, scheins, scheint, schenk, schi, schick, schied, schief, schien, schiene, schient, schier, schieß, schiff, schiit, schild, schilf, schilt, schinde, schirm, schis, schiss, schon, schone, schont, schund, schwing, schön, schöne, schönt, sphinx

I would have expected that at least schon and schön come up earlier. Debugging in the CandidateData constructor shows that all words have a distance of 2. Is this expected?

This happens in LanguageTool 2.4 with morfologik-speller 1.8.2, not using frequency information yet. Also, the dictionary has not been re-encoded yet with 1.8.2, as you suggested.

Reuse and clean up inflected form encoders

simple regexp for replacement-pairs

I noticed the Danish (and other) hunspell dictionaries have REP statements like these in their .aff file:

REP ^hen hen_ #henover -> hen over
REP ^påny$ på_ny

Morfologik doesn't seem to support ^, $ and _ in its replacement-pairs feature. It would be nice if these could be added so more hunspell dictionaries could be ported to Morfologik without loss of quality in suggestions.

runon-words doesn't help with "alot"

Speller.replaceRunOnWords() won't suggest a lot for alot, as the first pair it tries is al + ot. The next one it tries is alo + t, so at the end it does allow a single character. The easiest solution seems to be to simply allow a single characters at the beginning of the word, too.

Some adverbs not marked with degree

The ones I've found are:

bardzo bardzo adv
najbardziej najbardziej adv
najpewniej najpewniej adv
posuwiściej posuwiściej adv:pos (shouldn't that ba adv:com from posuwiście?)
prędzej prędzej adv:pos (adv:com from prędko?)

I'm not sure about:

najpierw najpierw adv
najpierwej najpierwej adv
pierwej pierwej adv
pierwiej pierwiej adv

OSGi plugin forks execution into insane loops

The osgi (BND) plugin forks the execution of targets into insane loops (multiple runs of tests, javadocs, etc.).

I will remove the OSGi plugin from the build. If somebody needs it and knows how to make it work cleanly with Maven, submit a patch.

Make Java 1.7 the minimum required version

Remove and fail on deprecated metadata

Fail if dictionaries with the following metadata are present:

fsa.dict.uses-prefixes=
fsa.dict.uses-infixes=
fsa.dict.uses-suffixes=

Make it possible to look up dictionaries without relying on thread context class loader

This call is dodgy, we should avoid it.

link to Oflazer's paper in Speller.java's javadoc is broken

http://acl.ldc.upenn.edu/J/J96/J96-1003.pdf doesn't work for me. I guess this is the paper: http://dl.acm.org/citation.cfm?id=234293

Dictionary.read(URL) ends in NPE when reading from a JAR resource

This is a problem of the "nice" path conversion:

featureMapURL = new URL(dictURL,   DictionaryMetadata.getExpectedMetadataFileName(dictURL.toURI().getPath()));

Unfortunately dictURL.toURI().getPath() returns null for JAR URLs. Back to old method then.

Is it possible to create dictionaries for other languages

Hi,

I have noticed that Czech and Slovak languages have quite poor stemming support in Solr. Only some basic heuristics and hunspell which is very slow in Solr 4.x. Would it be possible to prepare dictionaries similar to Polish one for that languages based for example on openoffice dictionaries?
if so - how to achieve that?

howto

Panowie,

jak tego narzędzia użyć... Chcę sprowadzić zbiór odmienianych form słów języka polskiego do postaci podstawowej. Przedstawiony w readme opis jest dość ubogi i nie daję rady wywnioskować jak to zrobić.

Można prosić o małe rozszerzenie?

pozdrawiam
robert

format zbioru dopasuje do potrzeb...

Move Dictionary.convertText utility to DictionaryLookup.applyReplacements and fix current reliance on map ordering

This method is also incorrect in that it replaces strings from an unordered map; the result may be different depending on the ordering.

JavaDoc/ compilation fails on 1.8

Bug in building the dictionary

When building the dictionary, some lemmas are silently dropped.

The test case is attached. The form:

odgrywać|pact:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:aff:refl.nonrefl

has no lemma in the fsa dump, even if it has a lemma in the test_file.txt. The file is here:

https://dl.dropboxusercontent.com/u/4350317/test_file.zip

Autocorrection for stemming

I used new (1.7.0) polish-stemmer package from maven and noticed that it doesn't fix diacritics, even though these options are true by default.

Here are simple unit tests I made
http://pastebin.com/jwHSVecU

Another question is why "ą" is not replaced by "a" by default like "Ł" and "L"?

Dictionary data format has changed between 1.5.5 and 1.6.0

In short tags are concatenated for every lemma and previously they were returned separately.

1.5.5:
liście+AAA+subst:sg:acc:n2
liście+AAA+subst:sg:nom:n2
liście+AAA+subst:sg:voc:n2
liście+AADć+subst:pl:acc:m3
liście+AADć+subst:pl:nom:m3
liście+AADć+subst:pl:voc:m3
liście+AAFst+subst:sg:loc:m3
liście+AAFst+subst:sg:voc:m3
liście+AAFsta+subst:sg:dat:f
liście+AAFsta+subst:sg:loc:f

1.6.0:
liście+AAA+subst:sg:acc:n2+subst:sg:nom:n2+subst:sg:voc:n2
liście+AADć+subst:pl:acc:m3+subst:pl:nom:m3+subst:pl:voc:m3
liście+AAFst+subst:sg:loc:m3+subst:sg:voc:m3
liście+AAFsta+subst:sg:dat:f+subst:sg:loc:f

Decide what to do -- is this a regression or should it be the new default?

strange definition of CamelCase

Speller.isCamelCase() considers words like Waschmaschinen-Test to be camel case. If that's on purpose, it should be documented. For German, these are common words and I wouldn't consider them camel case.

Morfologik dictionaries for Norwegian, Portuguese, Finnish and Dutch.

Hi, all!
I would like to use morfologik library for multilanguage stemming and now I'm looking for the corresponding .dict and .info files for Norwegian, Portuguese, Finish and Dutch languages. I you've seen one of these dictionaries somewhere, please, give me a link to it.

Add forbidden API checker to the build

Vocative of village name "Bystra" should be "Bystro"

The lexicon contains both a lower-case common adjective whose vocative is correctly "bystra", but also an uppercase words which can be the village name and should IMHO have the vocative "Bystro".

Dictionary thread-safe

This is more of a question: is Dictionary thread-safe or are there any plans to guarantee its thread-safety? Here's my use case: I create a Dictionary at runtime, which takes some time, so I'd like to do it only once and use it for all threads.

Analyzer finds tokens that haven't been mentioned in original string

For the issue details, please see: monterail/elasticsearch-analysis-morfologik#6

Rethink packaging so that customized proguard is not needed?

Metadata *.info files should be in UTF-8 to support text attributes that otherwise would require text2ascii conversion

This is a trivial change that requires Java 1.6: we will load the properties with Properties.load(Reader) and enforce the reader to use UTF-8. ASCII-escaped sequences will be decoded anyway and this is backward compatible.

Remove SGJP/ duplicate resources, leave only PoliMorf

findReplacements() suboptimal results with replacement pairs

I'm not sure if the algorithm is guaranteed to return the best replacements, but this one looks like a bug. It ranks "Mitmuss" better than "Rhythmus" when the misspelled word is "Rytmus".

public static void main(String[] args) throws Exception {
  File infoFile = new File("/tmp/morfologik.info");
  FileWriter fw1 = new FileWriter(infoFile);
  fw1.write("fsa.dict.separator=+\n");
  fw1.write("fsa.dict.encoding=utf-8\n");
  // without this, suggestions improve:
  fw1.write("fsa.dict.speller.replacement-pairs=s ss\n");
  fw1.close();

  File inputFile = new File("/tmp/morfologik.txt");
  FileWriter fw2 = new FileWriter(inputFile);
  fw2.write("Mitmuss\n");
  fw2.write("Rhythmus\n");
  fw2.close();

  File outputFile = new File("/tmp/morfologik.dict");
  String[] buildToolOptions =
          {"-i", inputFile.getAbsolutePath(), "-o", outputFile.getAbsolutePath()};
  FSABuildTool.main(buildToolOptions);

  Dictionary dictionary = Dictionary.read(outputFile);
  Speller speller = new Speller(dictionary, 2);
  List<String> replacements = speller.findReplacements("Rytmus");
  // -will print "[Mitmuss, Rhythmus]"
  // -will print "[Rhythmus]" if there are no replacement pairs
  // -Debugging shows that in the CandidateData constructor, 'Mitmuss' gets create with a distance of 0
  System.out.println("replacements: " + replacements);
}

IndexOutOfBoundsException for Tagalog lookup

This just happened when checking Tagalog text with the current LanguageTool snapshot (which uses morfologik 1.8.1):

Exception in thread "main" java.lang.IndexOutOfBoundsException
    at java.nio.Buffer.checkBounds(Buffer.java:559)
    at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:181)
    at morfologik.stemming.DictionaryLookup.decodeBaseForm(DictionaryLookup.java:290)
    at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:211)
    at org.languagetool.tagging.BaseTagger.tag(BaseTagger.java:83)
    at org.languagetool.JLanguageTool.getRawAnalyzedSentence(JLanguageTool.java:793)
    at org.languagetool.JLanguageTool.getAnalyzedSentence(JLanguageTool.java:778)
    at org.languagetool.JLanguageTool.analyzeSentences(JLanguageTool.java:596)
    at org.languagetool.JLanguageTool.check(JLanguageTool.java:569)

Does this help you in any way? If not, I could try to find the text that causes it and build a small test case.

Review library dependencies and bring them up to date

Prefix (and possibly infix) encoding is suboptimal.

0123123456789 123456789X tag

This currently returns:

0123123456789+BJ456789X+tag

when prefix encoding is used. This is suboptimal because the heuristics is greedy and breaks after the first match is found. We should be minimizing the length of the output code (and thus maximize the length of the reused substring). The above sequence can be encoded as:

0123123456789+EAX+tag

I'll fix when rewriting the encoder.

"taić" as a non-reflective, imprefective form included twice

E.g., the line is:
taić taić verb:inf:imperf.perf:nonrefl+verb:inf:imperf:refl.nonrefl

As I understand, this expands to:
taić taić verb:inf:imperf:nonrefl
taić taić verb:inf:perf:nonrefl
taić taić verb:inf:imperf:refl
taić taić verb:inf:imperf:nonrefl

and the fact that the line "taić taić verb:inf:imperf:nonrefl" happens twice can confuse some programs.

ArrayIndexOutOfBoundsException with replacement-pairs

This exception happens only with master, not with the latest release:

Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at morfologik.speller.HMatrix.get(HMatrix.java:81)
at morfologik.speller.Speller.findRepl(Speller.java:484)
at morfologik.speller.Speller.findRepl(Speller.java:525)
at morfologik.speller.Speller.findReplacements(Speller.java:434)
at org.languagetool.rules.spelling.morfologik.MorfologikSpeller.getSuggestions(MorfologikSpeller.java:90)
at org.languagetool.rules.spelling.morfologik.MorfologikSpellerRule.getRuleMatches(MorfologikSpellerRule.java:182)
at org.languagetool.rules.spelling.morfologik.MorfologikSpellerRule.match(MorfologikSpellerRule.java:119)
at org.languagetool.JLanguageTool.checkAnalyzedSentence(JLanguageTool.java:601)
at org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:937)

It happens if you get suggestions for the word you with the recently added Dutch dictionary of LanguageTool, which contains:

fsa.dict.speller.replacement-pairs=y ij

If you remove that line, it works fine. Let me know if you need more details to reproduce this.

Strange words found/others not found

Random things I've stumbled upon:

the lexicon includes the word "wieczor" that I don't know and can't find in so.pwn.pl or www.sjp.pl.
words like "czasami" or "wieczorem" as only nouns. Can't they be adverbs in sentences like "Pogoda była wieczorem brzydka a rano ładna"?

Names of some professions have only m1 gender version

The ones I've stumbled upon are:

I would expect there would be both m1 and f forms, with the f form not changing in declension (like it is for "doktor").

Dump tool does not work with frequency dictionaries in -x mode

The dump tool does not understand the frequency flag, and fails in -x (decode) mode.

Recompress the Polish dictionary (change value separator to ";").

Change metadata format for explicit decoder specification

Resign from the following metadata:
fsa.dict.uses-prefixes=
fsa.dict.uses-infixes=
because this leaves an ambiguity between suffix-compressed and non-compressed dictionaries.

I think the new major release should use an explicit coder name:

fsa.dict.encoder=[suffix|prefix|infix]

this will also leave a way for any future encoders, if we come up with something.

WordData.clone should be public

Update dependencies

morfologik maven plugin

Hello,

i wrote a maven plugin (https://github.com/pminos/morfologik-maven-plugin) for creating FSA dictionaries with morfologik-fsa. For now it can create Morphological and Synthesizer dictionaries.

I am planning to publish it to the central repository, but first I wanted to ask if you want to merge it to morfologik-stemming. If you are interested, I can continue the development and add support for more types of dictionaries (e.g. for spelling).

Dictionary format

Hi,all
I want to konw the format of morfologik's dictionaries, I think it's not just like below:
Abe+I
Abel+J
Abelard+F
Abelson+E
Aberconwy+E
I want to know more detail about it, and how to use it.
Thanks advance.

BufferUtils.ensureCapacity now clears the input buffer

It'd be better to check for remaining space in the buffer and not require buffers to be at position() == 0

FSADumpTool header should always be dumped in UTF-8

FSADump: -x not optional

The help output of FSADumpTool describes -x as optional, at least that's what I understand by "if available":

Decode prefix/ infix/ suffix forms (if available).

But when I run java -jar morfologik-tools-1.9.0.jar fsa_dump -x -d pl_PL.dict (the Polish dict from LT), I get this error:

java.lang.RuntimeException: Invalid dictionary entry format (missing separator).
 at morfologik.stemming.DictionaryIterator.next(Unknown Source)
 at morfologik.stemming.DictionaryIterator.next(Unknown Source)
 at morfologik.tools.FSADumpTool.dump(Unknown Source)
 at morfologik.tools.FSADumpTool.go(Unknown Source)
 at morfologik.tools.Tool.go(Unknown Source)
 at morfologik.tools.FSADumpTool.main(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at morfologik.tools.Launcher$ToolInfo.invoke(Unknown Source)
 at morfologik.tools.Launcher.main(Unknown Source)

Without -x, it works. Maybe -x should be optional, i.e. it there's nothing to decode, this option should be ignored.

Thank you in advance.
Robert

morfologik / morfologik-stemming Goto Github PK

morfologik-stemming's People

Contributors

Stargazers

Watchers

Forkers

morfologik-stemming's Issues

Recommend Projects

Recommend Topics

Recommend Org