Giter Club home page Giter Club logo

morfologik-stemming's People

Contributors

aldenquimby avatar danielnaber avatar dweiss avatar jaumeortola avatar milekpl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

morfologik-stemming's Issues

Speller.findReplacements() doesn't provide properly ordererd suggestions

Calling Speller.findReplacements("schin") with a German dictionary gives this result:

China, Dschinn, Schanz, Schein, Scheine, Scheins, Schi, Schiene, Schier, Schieß, Schiff, Schiit, Schild, Schilf, Schily, Schira, Schirm, Schis, Schiss, Schiwa, Schon, Schund, Schwing, Schön, Sphinx, china, schanz, schein, scheine, scheins, scheint, schenk, schi, schick, schied, schief, schien, schiene, schient, schier, schieß, schiff, schiit, schild, schilf, schilt, schinde, schirm, schis, schiss, schon, schone, schont, schund, schwing, schön, schöne, schönt, sphinx

I would have expected that at least schon and schön come up earlier. Debugging in the CandidateData constructor shows that all words have a distance of 2. Is this expected?

This happens in LanguageTool 2.4 with morfologik-speller 1.8.2, not using frequency information yet. Also, the dictionary has not been re-encoded yet with 1.8.2, as you suggested.

simple regexp for replacement-pairs

I noticed the Danish (and other) hunspell dictionaries have REP statements like these in their .aff file:

REP ^hen hen_ #henover -> hen over
REP ^påny$ på_ny

Morfologik doesn't seem to support ^, $ and _ in its replacement-pairs feature. It would be nice if these could be added so more hunspell dictionaries could be ported to Morfologik without loss of quality in suggestions.

runon-words doesn't help with "alot"

Speller.replaceRunOnWords() won't suggest a lot for alot, as the first pair it tries is al + ot. The next one it tries is alo + t, so at the end it does allow a single character. The easiest solution seems to be to simply allow a single characters at the beginning of the word, too.

Some adverbs not marked with degree

The ones I've found are:

  • bardzo bardzo adv
  • najbardziej najbardziej adv
  • najpewniej najpewniej adv
  • posuwiściej posuwiściej adv:pos (shouldn't that ba adv:com from posuwiście?)
  • prędzej prędzej adv:pos (adv:com from prędko?)

I'm not sure about:

  • najpierw najpierw adv
  • najpierwej najpierwej adv
  • pierwej pierwej adv
  • pierwiej pierwiej adv

OSGi plugin forks execution into insane loops

The osgi (BND) plugin forks the execution of targets into insane loops (multiple runs of tests, javadocs, etc.).

I will remove the OSGi plugin from the build. If somebody needs it and knows how to make it work cleanly with Maven, submit a patch.

Is it possible to create dictionaries for other languages

Hi,

I have noticed that Czech and Slovak languages have quite poor stemming support in Solr. Only some basic heuristics and hunspell which is very slow in Solr 4.x. Would it be possible to prepare dictionaries similar to Polish one for that languages based for example on openoffice dictionaries?
if so - how to achieve that?

howto

Panowie,

jak tego narzędzia użyć... Chcę sprowadzić zbiór odmienianych form słów języka polskiego do postaci podstawowej. Przedstawiony w readme opis jest dość ubogi i nie daję rady wywnioskować jak to zrobić.

Można prosić o małe rozszerzenie?

pozdrawiam
robert

format zbioru dopasuje do potrzeb...

Autocorrection for stemming

I used new (1.7.0) polish-stemmer package from maven and noticed that it doesn't fix diacritics, even though these options are true by default.

Here are simple unit tests I made
http://pastebin.com/jwHSVecU

Another question is why "ą" is not replaced by "a" by default like "Ł" and "L"?

Dictionary data format has changed between 1.5.5 and 1.6.0

In short tags are concatenated for every lemma and previously they were returned separately.

1.5.5:
liście+AAA+subst:sg:acc:n2
liście+AAA+subst:sg:nom:n2
liście+AAA+subst:sg:voc:n2
liście+AADć+subst:pl:acc:m3
liście+AADć+subst:pl:nom:m3
liście+AADć+subst:pl:voc:m3
liście+AAFst+subst:sg:loc:m3
liście+AAFst+subst:sg:voc:m3
liście+AAFsta+subst:sg:dat:f
liście+AAFsta+subst:sg:loc:f

1.6.0:
liście+AAA+subst:sg:acc:n2+subst:sg:nom:n2+subst:sg:voc:n2
liście+AADć+subst:pl:acc:m3+subst:pl:nom:m3+subst:pl:voc:m3
liście+AAFst+subst:sg:loc:m3+subst:sg:voc:m3
liście+AAFsta+subst:sg:dat:f+subst:sg:loc:f

Decide what to do -- is this a regression or should it be the new default?

strange definition of CamelCase

Speller.isCamelCase() considers words like Waschmaschinen-Test to be camel case. If that's on purpose, it should be documented. For German, these are common words and I wouldn't consider them camel case.

Morfologik dictionaries for Norwegian, Portuguese, Finnish and Dutch.

Hi, all!
I would like to use morfologik library for multilanguage stemming and now I'm looking for the corresponding .dict and .info files for Norwegian, Portuguese, Finish and Dutch languages. I you've seen one of these dictionaries somewhere, please, give me a link to it.

Dictionary thread-safe

This is more of a question: is Dictionary thread-safe or are there any plans to guarantee its thread-safety? Here's my use case: I create a Dictionary at runtime, which takes some time, so I'd like to do it only once and use it for all threads.

findReplacements() suboptimal results with replacement pairs

I'm not sure if the algorithm is guaranteed to return the best replacements, but this one looks like a bug. It ranks "Mitmuss" better than "Rhythmus" when the misspelled word is "Rytmus".

public static void main(String[] args) throws Exception {
  File infoFile = new File("/tmp/morfologik.info");
  FileWriter fw1 = new FileWriter(infoFile);
  fw1.write("fsa.dict.separator=+\n");
  fw1.write("fsa.dict.encoding=utf-8\n");
  // without this, suggestions improve:
  fw1.write("fsa.dict.speller.replacement-pairs=s ss\n");
  fw1.close();

  File inputFile = new File("/tmp/morfologik.txt");
  FileWriter fw2 = new FileWriter(inputFile);
  fw2.write("Mitmuss\n");
  fw2.write("Rhythmus\n");
  fw2.close();

  File outputFile = new File("/tmp/morfologik.dict");
  String[] buildToolOptions =
          {"-i", inputFile.getAbsolutePath(), "-o", outputFile.getAbsolutePath()};
  FSABuildTool.main(buildToolOptions);

  Dictionary dictionary = Dictionary.read(outputFile);
  Speller speller = new Speller(dictionary, 2);
  List<String> replacements = speller.findReplacements("Rytmus");
  // -will print "[Mitmuss, Rhythmus]"
  // -will print "[Rhythmus]" if there are no replacement pairs
  // -Debugging shows that in the CandidateData constructor, 'Mitmuss' gets create with a distance of 0
  System.out.println("replacements: " + replacements);
}

IndexOutOfBoundsException for Tagalog lookup

This just happened when checking Tagalog text with the current LanguageTool snapshot (which uses morfologik 1.8.1):

Exception in thread "main" java.lang.IndexOutOfBoundsException
    at java.nio.Buffer.checkBounds(Buffer.java:559)
    at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:181)
    at morfologik.stemming.DictionaryLookup.decodeBaseForm(DictionaryLookup.java:290)
    at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:211)
    at org.languagetool.tagging.BaseTagger.tag(BaseTagger.java:83)
    at org.languagetool.JLanguageTool.getRawAnalyzedSentence(JLanguageTool.java:793)
    at org.languagetool.JLanguageTool.getAnalyzedSentence(JLanguageTool.java:778)
    at org.languagetool.JLanguageTool.analyzeSentences(JLanguageTool.java:596)
    at org.languagetool.JLanguageTool.check(JLanguageTool.java:569)

Does this help you in any way? If not, I could try to find the text that causes it and build a small test case.

Prefix (and possibly infix) encoding is suboptimal.

0123123456789 123456789X tag

This currently returns:

0123123456789+BJ456789X+tag

when prefix encoding is used. This is suboptimal because the heuristics is greedy and breaks after the first match is found. We should be minimizing the length of the output code (and thus maximize the length of the reused substring). The above sequence can be encoded as:

0123123456789+EAX+tag

I'll fix when rewriting the encoder.

"taić" as a non-reflective, imprefective form included twice

E.g., the line is:
taić taić verb:inf:imperf.perf:nonrefl+verb:inf:imperf:refl.nonrefl

As I understand, this expands to:
taić taić verb:inf:imperf:nonrefl
taić taić verb:inf:perf:nonrefl
taić taić verb:inf:imperf:refl
taić taić verb:inf:imperf:nonrefl

and the fact that the line "taić taić verb:inf:imperf:nonrefl" happens twice can confuse some programs.

ArrayIndexOutOfBoundsException with replacement-pairs

This exception happens only with master, not with the latest release:

Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at morfologik.speller.HMatrix.get(HMatrix.java:81)
at morfologik.speller.Speller.findRepl(Speller.java:484)
at morfologik.speller.Speller.findRepl(Speller.java:525)
at morfologik.speller.Speller.findReplacements(Speller.java:434)
at org.languagetool.rules.spelling.morfologik.MorfologikSpeller.getSuggestions(MorfologikSpeller.java:90)
at org.languagetool.rules.spelling.morfologik.MorfologikSpellerRule.getRuleMatches(MorfologikSpellerRule.java:182)
at org.languagetool.rules.spelling.morfologik.MorfologikSpellerRule.match(MorfologikSpellerRule.java:119)
at org.languagetool.JLanguageTool.checkAnalyzedSentence(JLanguageTool.java:601)
at org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:937)

It happens if you get suggestions for the word you with the recently added Dutch dictionary of LanguageTool, which contains:

fsa.dict.speller.replacement-pairs=y ij

If you remove that line, it works fine. Let me know if you need more details to reproduce this.

Strange words found/others not found

Random things I've stumbled upon:

  • the lexicon includes the word "wieczor" that I don't know and can't find in so.pwn.pl or www.sjp.pl.
  • words like "czasami" or "wieczorem" as only nouns. Can't they be adverbs in sentences like "Pogoda była wieczorem brzydka a rano ładna"?

Names of some professions have only m1 gender version

The ones I've stumbled upon are:

  • profesor
  • komisarz
  • nadkomisarz
  • mecenas
  • dyrektor
  • magister
  • architekt
  • redaktor
  • biskup
  • prezydent

I would expect there would be both m1 and f forms, with the f form not changing in declension (like it is for "doktor").

Change metadata format for explicit decoder specification

Resign from the following metadata:
fsa.dict.uses-prefixes=
fsa.dict.uses-infixes=
because this leaves an ambiguity between suffix-compressed and non-compressed dictionaries.

I think the new major release should use an explicit coder name:

fsa.dict.encoder=[suffix|prefix|infix]

this will also leave a way for any future encoders, if we come up with something.

morfologik maven plugin

Hello,

i wrote a maven plugin (https://github.com/pminos/morfologik-maven-plugin) for creating FSA dictionaries with morfologik-fsa. For now it can create Morphological and Synthesizer dictionaries.

I am planning to publish it to the central repository, but first I wanted to ask if you want to merge it to morfologik-stemming. If you are interested, I can continue the development and add support for more types of dictionaries (e.g. for spelling).

Dictionary format

Hi,all
I want to konw the format of morfologik's dictionaries, I think it's not just like below:
Abe+I
Abel+J
Abelard+F
Abelson+E
Aberconwy+E
I want to know more detail about it, and how to use it.
Thanks advance.

FSADump: -x not optional

The help output of FSADumpTool describes -x as optional, at least that's what I understand by "if available":

Decode prefix/ infix/ suffix forms (if available).

But when I run java -jar morfologik-tools-1.9.0.jar fsa_dump -x -d pl_PL.dict (the Polish dict from LT), I get this error:

java.lang.RuntimeException: Invalid dictionary entry format (missing separator).
 at morfologik.stemming.DictionaryIterator.next(Unknown Source)
 at morfologik.stemming.DictionaryIterator.next(Unknown Source)
 at morfologik.tools.FSADumpTool.dump(Unknown Source)
 at morfologik.tools.FSADumpTool.go(Unknown Source)
 at morfologik.tools.Tool.go(Unknown Source)
 at morfologik.tools.FSADumpTool.main(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at morfologik.tools.Launcher$ToolInfo.invoke(Unknown Source)
 at morfologik.tools.Launcher.main(Unknown Source)

Without -x, it works. Maybe -x should be optional, i.e. it there's nothing to decode, this option should be ignored.

Filter shoudn't stem words marked as keyword

I would add "agd" as keyword using solr.KeywordMarkerFilterFactory
I would be able to add synonyms after solr.MorfologikFilterFactory:
agd => lodówka, zamrażarka, chłodziarka, piekarnik, etc.

It's not possible right now

Rethink module dependencies

Make it lean and clean -- FSA reading/ traversals, FSA encoding (builders), then everything else downstream (input encoders, dictionaries, speller, etc.)

No .jar at github

Hi ,

I want to start this tool base on README, but I can't find lib/morfologik-tools-${version}-standalone.jar
Can you help?

Thank you in advance.
Robert

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.