kornai / 4lang Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 13.0 4.67 MB

Concept dictionary

License: MIT License

Shell 1.88% Perl 0.37% Python 74.48% HTML 0.47% TeX 22.80%

4lang's People

Contributors

Stargazers

Watchers

Forkers

recski bolevacz gaebor makrai eszti davidnemeskey filtered-chris dyeopensource evelinacs adaamko holloszaboakos hmrlke jcarlosneto

4lang's Issues

Debug have nodes and change to HAS

pymachine lexicon.py get_machines function contains a hack that changes have nodes to HAS ones.
Still in the fullgraph have node appears.

links of root elements in definitions should be moved to definiendum's node

If "dog: faithful animal" then dog -> animal -> faithful should be transformed into dog -> faithful
@gaebor FYI

Appositives require special treatment: dcoref detects them but they are not like pronouns

some generated dot files cause dot errors

if you try to process all dot for e.g. longman, you'll see many issues
won't fix in the near future

IN IN IN IN IN IN IN ...

http://people.mokk.bme.hu/~recski/4lang_graphs/eksz_firsts_160211/a/a1bra1zol.jpg

stanford parser should be made aware if the definition is expected to be an NP

This is possible via an API call and crucial for definitions like this one (with current Stanford parse):

wavelength: the size of a radio wave used to broadcast a radio signal

  (ROOT
    (S
      (NP
        (NP (DT the) (NN size))
        (PP (IN of)
          (NP (DT a) (NN radio) (NN wave))))
      (VP (VBD used)
        (S
          (VP (TO to)
            (VP (VB broadcast)
              (NP (DT a) (NN radio) (NN signal))))))))

@pajkossy FYI

need to run stanford parser as a service

To save the time of initialization (~5s) and especially the time it takes to parse the first sentence after init (~8 secs) every time I need to test something.

must disambiguate nsubj and csubj relations based on presence of copulars

Currently "The baby is cute" maps to cute ->1 baby

Wrong pickle files

Pickle files linked in the "Downloading pre-compiled graphs" section in the README.md have wrong format and are not compatible with the actual version.

need to explicitly handle NP-modifying relative clauses (rcmod)

add support for conversion between 4lang graphs and attribute value matrices (AVMs)

verbal complexes, esp. phrasal verbs should be detected and treated

upgrade dep_to_4lang to include new dependencies added in 2015 April release of the stanford parser

some words have multiple 4lang definitions for no good reason

I think these are the kinds of multiple entries that we really don't want:

tire fa1rad laboro zme1czyc1_sie1 712 U work, after[=PAT[exhausted]]
tire fa1raszt lasso me1czyc1 714 V =AGT CAUSE[=PAT[exhausted]]

total ege1sz totus cal1y 552 A whole
total teljes totus cal1kowity 2346 A whole
total ve1go2sszeg # # 2764 N sum

sound hang sonus odgl1os 993 u N wave, air[move] IN/2758, human HEAR
sound szo1l sono brzmiec1 2226 u U =AGT MAKE sound/993

suffixes striiped by lemmatization should have some effect, e.g. large-st

add support for wiktionary

By writing a simple parser and plugging it in (parser's nearly done)
Config might have new fields like "longman_input" and "wikt_input" etc. to avoid mixing up formats and parsers

expand can cause multiple 0-edges to nodes with the same name

For example: mouse
mouse HAS long tail
But the machine 'tail' also contains 'long' as its hypernym, so after expanding it, it has two 0 edges to two different 'long' nodes.

support universal dependencies

in Mr. Hug story:
34 case
30 nmod
21 compound
5 nummod
4 nmod:poss
4 acl:relcl
1 nmod:tmod
1 nmod:npmod
1 compound:prt
1 acl

@gaebor fyi

add support for the Universal Dependencies format

which is the default output format of the Stanford Parser as of April 2015

some LDOCE words can't be mapped to 4lang words

Some would require agressive lemmatization that we don't have, e.g. friendship -> friend, dangerous -> danger -- I'm sure we could get these with some simple last-resort stemming strategy, but do we want to? Some other words just haven't been defined yet, e.g. officer, spine
Meanwhile, we can use definitions for these words that we build from Longman.

handle coordination of root elements in definitions

until we're ready to handle coordination everywhere

The very base of the vocabulary (44 items) has a duplication

=agt, =at, =dat, =for, =from, =obl, =pat, =poss, =rel, =to, all, also, an, angry, before, can/1246, cause, characteristic, color, country, er, female, food, for, has, human, identity, inherent, is_a, lack, male, monk, next_to, not, number, or, other, part_of, person, real, round, speak, target, want

The duplication is between =for and for, both of which appear in several places.

Also suspicious: human/person (they have the same definition, just like permit/allow,
but the current algorithm doesn't prune these well).

Negation

The same skewed relation we see with binaries is seen in `lack' stuff, now formulated as NOTHAS (44 instances). We have NOTIN (9) NOTAT (26) but we have a few I don't understand, such as NOTCOP (11) and NOTPRED (4), what are these? There seem to be some that occur even fewer times, such as NOTINSTRUMENT (1 instance), NOTPARTOF (2), and one that probably is a grep accident:

feature vona1s differentia cecha 3017 u N ' NOTICE, important, typical, CAUSE[recognize]

Machines with empty printnames appear in a few Longman entries, probably caused by IPA symbols

2015-12-02 13:58:53,283 : lexicon (82) - WARNING - empty pn in node: _428206672, word: plosive
2015-12-02 13:58:53,283 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:58:55,637 : lexicon (82) - WARNING - empty pn in node: _480693008, word: fricative
2015-12-02 13:58:55,638 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:59:01,018 : lexicon (82) - WARNING - empty pn in node: _664230480, word: affricate
2015-12-02 13:59:01,018 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:59:01,664 : lexicon (82) - WARNING - empty pn in node: _696589008, word: velar
2015-12-02 13:59:01,664 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:59:01,883 : lexicon (82) - WARNING - empty pn in node: _708018640, word: bilabial
2015-12-02 13:59:01,883 : machine (12) - WARNING - empty printname! replacing with "???"

eliminate relative paths from configfiles

...and elsewhere, if any. Will use a few environment variables, to set installation directories of 4lang, stanford parser, jython. These variables will NOT have defaults, they must be set on any box before running 4lang

host packaged hun* binaries

hunpos, ocamorph, hundisambig, and their respective models for English

Plurals in 4lang

Why are there plurals in 4lang? To make it more difficult to process automatically? :)

Examples: cars, lessons, shapes, sides, sleeves, weapons, winds, words.

WordSimilarity doesn't make any connection between many pairs of near-synonyms

E.g.:

crazy   insane  0       0       0       0.066666667     0       0.035714286     4.10155766      9.57    5.46844234
rare    scarce  0       0       0       0       0       0       3.787751129     9.17    5.382248871
inform  notify  0       0       0       0       0       0       3.870539425     9.25    5.379460575
bizarre strange 0       0       0       0       0       0       4.0345673       9.37    5.3354327
defend  protect 0       0       0       0       0       0       3.870539425     9.13    5.259460575

@Eszti FYI

"iobj" dependency not handled

found by @juditacs

postprocessing of coordinations doesn't work sometimes

minimal example: efface - to destroy or remove something

longman_parser: encoding prior printing needed

in print_defs(), line 95

expansion: incoming edges in definitions don't get copied

because deepcopy doesn't
a fix that didn't work (but is maybe the right way and should be debugged) is now commented out in the expand function
@gaebor fyi

entry_preprocessor should merge multiple entries for same word form but keep POS-tags, etc. from each source

e.g. each sense should have its own POS-tag field inherited from the original Longman entry it appeared in

lemmatization of English words should all happen in one place

see the quick and dirty changes introduced in bd0b56e

definitions of abbreviations can and should be detected

Currently they behave in unexpected ways because the "fullform" field is sometimes part of the definition and somehow still outside of , so we obtain definitions like "abbreviation of" and "short for".
These are easy to detect and should be a first simple use case for pointers, once we have those (see #6)

separate module for parsing conll-style dependency data

this is now part of magyarlanc_wrapper, but it will be used for directly processing sentences from dependency treebanks

old Dependency class should be eliminated

unfortunately, this requires a rewrite of the postprocessing functions for English, but there aren't that many

configured paths should work from any working directory

detect 4langpath?

need script for rebuilding 4lang graphs

the current way is non-obvious to say the least

4lang graph of "know" contains empty nodes

Wordsim, run with longman machine and on simlex train_data, says that the worst fitting wordpair is "know-forget". Examining the graph of "know" I encountered that the graphs has some empty nodes.

New features shouldn't be defined in three places

why are there all kinds of rare binaries in 4lang definitions

e.g. SMOKE, RESPECT, STRIKE, etc.
There's 143 different binaries, but only 18 that occur at least ten times.

word similarity should recognize negation

@Eszti

plain copulars get nsubj, need postprocessing

"John is famous" gets parsed into nsubj(John, famous), but should be turned into John ->0 famous

should handle pointers in longman definitions

The most urgent use case is the simple one where a definition simply redirects us to another, but it'll come in handy later if we also keep track of all occurences of headwords in definitions.

lemmatizer runs hundisambig incorrectly?

returns "taway" for "tail", which is an ocamorph guess, but not the one that hundisambig picks if tested on the command line

make pre-compiled definition graphs available

For the (semeval) user who doesn't need/want to rebuild them. 4lang, longman(??), wiktionary (pending #19)

need to add negation to graphs

currently we don't treat neg dependencies at all

Config files should be installed

word2lemma in DepTo4lang should be eliminated

it now became an instance variable, as opposed to local variable of get_machines_from_deps_and_corefs, which still overwrites values for the current sentence, but this avoids a KeyError when there's been coreference resolution and the word needs its lemma from an earlier pass.
This process is extremely bug-prone, but can easily be eliminated, the Lemmatizer class can take care of everything now.