kornai / 4lang Goto Github PK
View Code? Open in Web Editor NEWConcept dictionary
License: MIT License
Concept dictionary
License: MIT License
pymachine lexicon.py get_machines function contains a hack that changes have nodes to HAS ones.
Still in the fullgraph have node appears.
If "dog: faithful animal" then dog -> animal -> faithful should be transformed into dog -> faithful
@gaebor FYI
if you try to process all dot for e.g. longman, you'll see many issues
won't fix in the near future
This is possible via an API call and crucial for definitions like this one (with current Stanford parse):
wavelength: the size of a radio wave used to broadcast a radio signal
(ROOT
(S
(NP
(NP (DT the) (NN size))
(PP (IN of)
(NP (DT a) (NN radio) (NN wave))))
(VP (VBD used)
(S
(VP (TO to)
(VP (VB broadcast)
(NP (DT a) (NN radio) (NN signal))))))))
@pajkossy FYI
To save the time of initialization (~5s) and especially the time it takes to parse the first sentence after init (~8 secs) every time I need to test something.
Currently "The baby is cute" maps to cute ->1 baby
Pickle files linked in the "Downloading pre-compiled graphs" section in the README.md have wrong format and are not compatible with the actual version.
I think these are the kinds of multiple entries that we really don't want:
tire fa1rad laboro zme1czyc1_sie1 712 U work, after[=PAT[exhausted]]
tire fa1raszt lasso me1czyc1 714 V =AGT CAUSE[=PAT[exhausted]]
total ege1sz totus cal1y 552 A whole
total teljes totus cal1kowity 2346 A whole
total ve1go2sszeg # # 2764 N sum
sound hang sonus odgl1os 993 u N wave, air[move] IN/2758, human HEAR
sound szo1l sono brzmiec1 2226 u U =AGT MAKE sound/993
By writing a simple parser and plugging it in (parser's nearly done)
Config might have new fields like "longman_input" and "wikt_input" etc. to avoid mixing up formats and parsers
For example: mouse
mouse HAS long tail
But the machine 'tail' also contains 'long' as its hypernym, so after expanding it, it has two 0 edges to two different 'long' nodes.
in Mr. Hug story:
34 case
30 nmod
21 compound
5 nummod
4 nmod:poss
4 acl:relcl
1 nmod:tmod
1 nmod:npmod
1 compound:prt
1 acl
@gaebor fyi
which is the default output format of the Stanford Parser as of April 2015
Some would require agressive lemmatization that we don't have, e.g. friendship
-> friend
, dangerous
-> danger
-- I'm sure we could get these with some simple last-resort stemming strategy, but do we want to? Some other words just haven't been defined yet, e.g. officer
, spine
Meanwhile, we can use definitions for these words that we build from Longman.
until we're ready to handle coordination everywhere
=agt, =at, =dat, =for, =from, =obl, =pat, =poss, =rel, =to, all, also, an, angry, before, can/1246, cause, characteristic, color, country, er, female, food, for, has, human, identity, inherent, is_a, lack, male, monk, next_to, not, number, or, other, part_of, person, real, round, speak, target, want
The duplication is between =for and for, both of which appear in several places.
Also suspicious: human/person (they have the same definition, just like permit/allow,
but the current algorithm doesn't prune these well).
The same skewed relation we see with binaries is seen in `lack' stuff, now formulated as NOTHAS (44 instances). We have NOTIN (9) NOTAT (26) but we have a few I don't understand, such as NOTCOP (11) and NOTPRED (4), what are these? There seem to be some that occur even fewer times, such as NOTINSTRUMENT (1 instance), NOTPARTOF (2), and one that probably is a grep accident:
feature vona1s differentia cecha 3017 u N ' NOTICE, important, typical, CAUSE[recognize]
2015-12-02 13:58:53,283 : lexicon (82) - WARNING - empty pn in node: _428206672, word: plosive
2015-12-02 13:58:53,283 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:58:55,637 : lexicon (82) - WARNING - empty pn in node: _480693008, word: fricative
2015-12-02 13:58:55,638 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:59:01,018 : lexicon (82) - WARNING - empty pn in node: _664230480, word: affricate
2015-12-02 13:59:01,018 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:59:01,664 : lexicon (82) - WARNING - empty pn in node: _696589008, word: velar
2015-12-02 13:59:01,664 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:59:01,883 : lexicon (82) - WARNING - empty pn in node: _708018640, word: bilabial
2015-12-02 13:59:01,883 : machine (12) - WARNING - empty printname! replacing with "???"
...and elsewhere, if any. Will use a few environment variables, to set installation directories of 4lang, stanford parser, jython. These variables will NOT have defaults, they must be set on any box before running 4lang
hunpos, ocamorph, hundisambig, and their respective models for English
Why are there plurals in 4lang? To make it more difficult to process automatically? :)
Examples: cars, lessons, shapes, sides, sleeves, weapons, winds, words.
E.g.:
crazy insane 0 0 0 0.066666667 0 0.035714286 4.10155766 9.57 5.46844234
rare scarce 0 0 0 0 0 0 3.787751129 9.17 5.382248871
inform notify 0 0 0 0 0 0 3.870539425 9.25 5.379460575
bizarre strange 0 0 0 0 0 0 4.0345673 9.37 5.3354327
defend protect 0 0 0 0 0 0 3.870539425 9.13 5.259460575
@Eszti FYI
found by @juditacs
minimal example: efface - to destroy or remove something
in print_defs(), line 95
because deepcopy doesn't
a fix that didn't work (but is maybe the right way and should be debugged) is now commented out in the expand function
@gaebor fyi
e.g. each sense should have its own POS-tag field inherited from the original Longman entry it appeared in
see the quick and dirty changes introduced in bd0b56e
Currently they behave in unexpected ways because the "fullform" field is sometimes part of the definition and somehow still outside of , so we obtain definitions like "abbreviation of" and "short for".
These are easy to detect and should be a first simple use case for pointers, once we have those (see #6)
this is now part of magyarlanc_wrapper
, but it will be used for directly processing sentences from dependency treebanks
unfortunately, this requires a rewrite of the postprocessing functions for English, but there aren't that many
detect 4langpath?
the current way is non-obvious to say the least
Wordsim, run with longman machine and on simlex train_data, says that the worst fitting wordpair is "know-forget". Examining the graph of "know" I encountered that the graphs has some empty nodes.
e.g. SMOKE, RESPECT, STRIKE, etc.
There's 143 different binaries, but only 18 that occur at least ten times.
"John is famous" gets parsed into nsubj(John, famous), but should be turned into John ->0 famous
The most urgent use case is the simple one where a definition simply redirects us to another, but it'll come in handy later if we also keep track of all occurences of headwords in definitions.
returns "taway" for "tail", which is an ocamorph guess, but not the one that hundisambig picks if tested on the command line
For the (semeval) user who doesn't need/want to rebuild them. 4lang, longman(??), wiktionary (pending #19)
currently we don't treat neg dependencies at all
it now became an instance variable, as opposed to local variable of get_machines_from_deps_and_corefs, which still overwrites values for the current sentence, but this avoids a KeyError when there's been coreference resolution and the word needs its lemma from an earlier pass.
This process is extremely bug-prone, but can easily be eliminated, the Lemmatizer class can take care of everything now.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.