Giter Club home page Giter Club logo

4lang's People

Contributors

adaamko avatar bolevacz avatar davidnemeskey avatar eszti avatar evelinacs avatar gaebor avatar holloszaboakos avatar juditacs avatar kornai avatar makrai avatar pajkossy avatar recski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

4lang's Issues

Debug have nodes and change to HAS

pymachine lexicon.py get_machines function contains a hack that changes have nodes to HAS ones.
Still in the fullgraph have node appears.

need to run stanford parser as a service

To save the time of initialization (~5s) and especially the time it takes to parse the first sentence after init (~8 secs) every time I need to test something.

Wrong pickle files

Pickle files linked in the "Downloading pre-compiled graphs" section in the README.md have wrong format and are not compatible with the actual version.

some words have multiple 4lang definitions for no good reason

I think these are the kinds of multiple entries that we really don't want:

tire fa1rad laboro zme1czyc1_sie1 712 U work, after[=PAT[exhausted]]
tire fa1raszt lasso me1czyc1 714 V =AGT CAUSE[=PAT[exhausted]]

total ege1sz totus cal1y 552 A whole
total teljes totus cal1kowity 2346 A whole
total ve1go2sszeg # # 2764 N sum

sound hang sonus odgl1os 993 u N wave, air[move] IN/2758, human HEAR
sound szo1l sono brzmiec1 2226 u U =AGT MAKE sound/993

add support for wiktionary

By writing a simple parser and plugging it in (parser's nearly done)
Config might have new fields like "longman_input" and "wikt_input" etc. to avoid mixing up formats and parsers

some LDOCE words can't be mapped to 4lang words

Some would require agressive lemmatization that we don't have, e.g. friendship -> friend, dangerous -> danger -- I'm sure we could get these with some simple last-resort stemming strategy, but do we want to? Some other words just haven't been defined yet, e.g. officer, spine
Meanwhile, we can use definitions for these words that we build from Longman.

The very base of the vocabulary (44 items) has a duplication

=agt, =at, =dat, =for, =from, =obl, =pat, =poss, =rel, =to, all, also, an, angry, before, can/1246, cause, characteristic, color, country, er, female, food, for, has, human, identity, inherent, is_a, lack, male, monk, next_to, not, number, or, other, part_of, person, real, round, speak, target, want

The duplication is between =for and for, both of which appear in several places.

Also suspicious: human/person (they have the same definition, just like permit/allow,
but the current algorithm doesn't prune these well).

Negation

The same skewed relation we see with binaries is seen in `lack' stuff, now formulated as NOTHAS (44 instances). We have NOTIN (9) NOTAT (26) but we have a few I don't understand, such as NOTCOP (11) and NOTPRED (4), what are these? There seem to be some that occur even fewer times, such as NOTINSTRUMENT (1 instance), NOTPARTOF (2), and one that probably is a grep accident:

feature vona1s differentia cecha 3017 u N ' NOTICE, important, typical, CAUSE[recognize]

Machines with empty printnames appear in a few Longman entries, probably caused by IPA symbols

2015-12-02 13:58:53,283 : lexicon (82) - WARNING - empty pn in node: _428206672, word: plosive
2015-12-02 13:58:53,283 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:58:55,637 : lexicon (82) - WARNING - empty pn in node: _480693008, word: fricative
2015-12-02 13:58:55,638 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:59:01,018 : lexicon (82) - WARNING - empty pn in node: _664230480, word: affricate
2015-12-02 13:59:01,018 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:59:01,664 : lexicon (82) - WARNING - empty pn in node: _696589008, word: velar
2015-12-02 13:59:01,664 : machine (12) - WARNING - empty printname! replacing with "???"
2015-12-02 13:59:01,883 : lexicon (82) - WARNING - empty pn in node: _708018640, word: bilabial
2015-12-02 13:59:01,883 : machine (12) - WARNING - empty printname! replacing with "???"

eliminate relative paths from configfiles

...and elsewhere, if any. Will use a few environment variables, to set installation directories of 4lang, stanford parser, jython. These variables will NOT have defaults, they must be set on any box before running 4lang

Plurals in 4lang

Why are there plurals in 4lang? To make it more difficult to process automatically? :)

Examples: cars, lessons, shapes, sides, sleeves, weapons, winds, words.

definitions of abbreviations can and should be detected

Currently they behave in unexpected ways because the "fullform" field is sometimes part of the definition and somehow still outside of , so we obtain definitions like "abbreviation of" and "short for".
These are easy to detect and should be a first simple use case for pointers, once we have those (see #6)

4lang graph of "know" contains empty nodes

Wordsim, run with longman machine and on simlex train_data, says that the worst fitting wordpair is "know-forget". Examining the graph of "know" I encountered that the graphs has some empty nodes.

should handle pointers in longman definitions

The most urgent use case is the simple one where a definition simply redirects us to another, but it'll come in handy later if we also keep track of all occurences of headwords in definitions.

word2lemma in DepTo4lang should be eliminated

it now became an instance variable, as opposed to local variable of get_machines_from_deps_and_corefs, which still overwrites values for the current sentence, but this avoids a KeyError when there's been coreference resolution and the word needs its lemma from an earlier pass.
This process is extremely bug-prone, but can easily be eliminated, the Lemmatizer class can take care of everything now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.