Giter Club home page Giter Club logo

yap's Introduction

yap - Yet Another Parser

This repository is no longer maintained.

For the latest and greatest see https://github.com/onlplab/yap

yap is yet another parser written in Go. It was implemented to test the hypothesis of my MSc thesis on Joint Morpho-Syntactic Processing of MRLs in a Transition Based Framework at IDC Herzliya with my advisor, Reut Tsarfaty. A paper on the morphological analysis and disambiguation aspect for Modern Hebrew and Universal Dependencies was accepted to COLING 2016

yap is currently provided with a model for Modern Hebrew, trained on a heavily updated version of the SPMRL 2014 Hebrew treebank. We hope to publish the updated treebank soon as well.

yap contains an implementation of the framework and parser of zpar from Z&N 2011 (Transition-based Dependency Parsing with Rich Non-local Features by Zhang and Nivre, 2011) with flags for precise output parity (i.e. bug replication), trained on the morphologically disambiguated Modern Hebrew treebank.

yap is under active development and documentation.

DO NOT USE FOR PRODUCTION

Requirements

  • Go
  • bzip2
  • 4-16 CPU cores
  • ~4.5GB RAM for Morphological Disambiguation
  • ~2GB RAM for Dependency Parsing

Compilation

  • Download and install Go
  • Setup a Go environment:
    • Create a directory (usually per workspace/project) mkdir yapproj; cd yapproj
    • Set $GOPATH environment variable to your workspace: export GOPATH=path/to/yapproj
    • In the workspace directory create 3 subdirectories: mkdir src pkg bin
    • cd into the src directory cd src
  • Clone the repository in the src folder of the workspace, then:
cd yap
go get .
go build .
./yap
  • Bunzip the Hebrew MD model: bunzip2 data/hebmd.b32.bz2
  • Bunzip the Hebrew Dependency Parsing model: bunzip2 data/dep.b64.bz2

You may want to use a go workspace manager or have a shell script to set $GOPATH to <.../yapproj>

Processing Modern Hebrew

Currently only Pipeline Morphological Analysis, Disambiguation, and Dependency Parsing of pre-tokenized Hebrew text is supported. For Hebrew Morphological Analysis, the input format should have tokens separated by a newline, with another newline to separate sentences.

The lattice format as output by the analyzer can be used as-is for disambiguation.

For example:

עשרות
אנשים
מגיעים
מתאילנד
...

כך
אמר
ח"כ
...

Note: The input must be in UTF-8 encoding. yap will process ISO-8859-* encodings incorrectly.

Commands for morphological analysis and disambiguation:

./yap hebma -raw input.raw -out lattices.conll -stream
./yap md -in lattices.conll -om output.conll -stream

The output of the morphological disambiguator can be used as input for the dependency parser. Command for dependency parsing:

./yap dep -inl output.conll -oc dep_output.conll

Citation

If you make use of this software for research, we would appreciate the following citation:

@InProceedings{moretsarfatycoling2016,
  author = {Amir More and Reut Tsarfaty},
  title = {Data-Driven Morphological Analysis and Disambiguation for Morphologically Rich Languages and Universal Dependencies},
  booktitle = {Proceedings of COLING 2016},
  year = {2016},
  month = {december},
  location = {Osaka}
}

HEBLEX, a Morphological Analyzer for Modern Hebrew in yap, relies on a slightly modified version of the BGU Lexicon. Please acknowledge and cite the work on the BGU Lexicon with this citation:

@inproceedings{adler06,
    Author = {Adler, Meni and Elhadad, Michael},
    Booktitle = {ACL},
    Crossref = {conf/acl/2006},
    Editor = {Calzolari, Nicoletta and Cardie, Claire and Isabelle, Pierre},
    Ee = {http://aclweb.org/anthology/P06-1084},
    Interhash = {6e302df82f4d7776cc487d5b8623d3db},
    Intrahash = {c7ac3ecfe40d039cd6c9ec855cb432db},
    Keywords = {dblp},
    Publisher = {The Association for Computer Linguistics},
    Timestamp = {2013-08-13T15:11:00.000+0200},
    Title = {An Unsupervised Morpheme-Based HMM for {H}ebrew Morphological
        Disambiguation},
    Url = {http://dblp.uni-trier.de/db/conf/acl/acl2006.html#AdlerE06},
    Year = 2006,
    Bdsk-Url-1 = {http://dblp.uni-trier.de/db/conf/acl/acl2006.html#AdlerE06}}

License

This software is released under the terms of the Apache License, Version 2.0.

The Apache license does not apply to the BGU Lexicon. Please contact Reut Tsarfaty regarding licensing of the lexicon.

Contact

You may contact me at mygithubuser at gmail or Reut Tsarfaty at reutts at openu dot ac dot il

yap's People

Contributors

habeanf avatar mikelibg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yap's Issues

Please publish the license for BGU Lexicon

Hello! This looks like a potentially very useful tool. However, it is unusable as long as its licensing terms are unclear. Please publish the license for BGU Lexicon files and/or at least document their format and how to create alternative data files.

Empty values on missing fields

I noticed that there are some inconsistency in the morphology file.
It contains several fields like word, lemma, pos, morphology etc. for every word in a sentence.

Following your README, if some field is not available - the value will be '_'.
But in practice, those unavailable fields are just missing, skipping to the next field.

Do you plan to fix that issue?

Another question is, on some morphological field, I get the following result:

התקשרו התקשר VB VB gen=F|gen=M|num=P|per=3|tense=PAST 8 relcomp _ _

What is the meaning of the double appearance of 'gen' ?

Single sentence file may not process successfully

Somewhere between a question and an issue ―
If an input file (input.raw) contains only a single sentence, e.g.:

עשרות
אנשים
מגיעים
מתאילנד

Then no tokens seem to go detected and the output file is empty, when running:

./yap hebma -raw input.raw -out lattices.conll

In further experimentation it may seem like the first sentence may get ignored, or some other irregularity causing this. Thing is this is the classical scenario for trying a single sentence, unless there's also an interactive mode aside requiring file input.

Thanks in advance for your commenting!

changing api port

I am writing here since the new yap git doesn't have an "issues" tab:
I want to run YAP api in multiple process- so I want so assign each process to a different port-
where in the code is the port address? (":8000")
thanks!

Consider adding to the setup passage of the readme

Perhaps consider explicitly asking to change directory to subdirectory yap ― or whatever other name the person cloning the repo insists on using with some fancy unnecessary git option ― before issuing go get .. It seems to be required for building the project after a clone, yet a golang newb won't necessarily guess, as the overall setup can be a little unearthly for a non-golanger at first glance.

Alternatively, a minor imports style refactor suggested on SO might later follow, in some future tour of the code.

building with docker -

Hi
I am trying to use YAP. The easiest way was to use docker to build.
Here is a working Dockerfile

FROM golang:1.12-buster

RUN apt update && apt install -y bzip2 ca-certificates libgnutls30

RUN mkdir -p /yap/src
COPY . /yap/src/yap

ENV GOPATH=/yap
WORKDIR /yap/src/yap

RUN bunzip2 data/*.bz2
RUN go get .
RUN go build .

EXPOSE 8000

ENTRYPOINT ["/yap/src/yap/yap", "api"]

Could not compile v1.1.0

Trying to compile v1.1.0 yielded the error message:

go/src/yap/nlp/format/taggedsentence/sent.go:46: too few values in struct initializer
go/src/yap/nlp/format/taggedsentence/sent.go:47: too few values in struct initializer

Any idea what is happening? Should I be using another version?

restart api

great work you have done the api! i just have trouble when it disconnects... is there a way to make it automatically reconnect, instead of me having to type src/yap/./yap api?

Status of the SPMRL⥋UD clitics morphological disambiguation issue ― identified in the CoNLL-2017 Shared Task

Hi,

Might you comment on the status of the second "bug" described in the CoNLL-2017 submission article, the one relating to expecting clitics to be represented as per the SPMRL convention rather than the Universal Dependencies one? Actually I'd not necessarily call it a bug, but a standards support aspect, but as long as the repo implies very standard Universal Dependencies as a source format, I'll accede to the exacting "bug" status implied in the article.

Was the code ultimately modified to assume clitics as separate syntactic elements, as implied in the article, or did the temporary remedy described therein in the article persist? or did you actually enable all of the modes as runtime options? (the three options possibly being dubbed SPMRL, UD with special out-of-standard indication, and something like strict UD).

Thanks for commenting!!

go build error - flags

yapproj\src\yap>go build .

yap/app

app\dep.go:648:9: cannot use *flag.NewFlagSet("dep", flag.ExitOnError) (variable of type "github.com/gonuts/flag".FlagSet) as "flag".FlagSet value in struct literal
app\md.go:879:9: cannot use *flag.NewFlagSet("md", flag.ExitOnError) (variable of type "github.com/gonuts/flag".FlagSet) as "flag".FlagSet value in struct literal
app\jointparse.go:835:9: cannot use *flag.NewFlagSet("joint", flag.ExitOnError) (variable of type "github.com/gonuts/flag".FlagSet) as "flag".FlagSet value in struct literal
app\ma.go:190:9: cannot use *flag.NewFlagSet("ma", flag.ExitOnError) (variable of type "github.com/gonuts/flag".FlagSet) as "flag".FlagSet value in struct literal
app\hebma.go:228:9: cannot use *flag.NewFlagSet("ma", flag.ExitOnError) (variable of type "github.com/gonuts/flag".FlagSet) as "flag".FlagSet value in struct literal
app\all.go:45:16: cannot use *flag.NewFlagSet("app", flag.ExitOnError) (variable of type "github.com/gonuts/flag".FlagSet) as "flag".FlagSet value in struct literal
app\depeval.go:196:9: cannot use *flag.NewFlagSet("depeval", flag.ExitOnError) (variable of type "github.com/gonuts/flag".FlagSet) as "flag".FlagSet value in struct literal
app\genlemmas.go:171:9: cannot use *flag.NewFlagSet("lemmas", flag.ExitOnError) (variable of type "github.com/gonuts/flag".FlagSet) as "flag".FlagSet value in struct literal
app\genunamlemmas.go:134:9: cannot use *flag.NewFlagSet("unamblemmas", flag.ExitOnError) (variable of type "github.com/gonuts/flag".FlagSet) as "flag".FlagSet value in struct literal
app\goldseg.go:151:9: cannot use *flag.NewFlagSet("gseg", flag.ExitOnError) (variable of type "github.com/gonuts/flag".FlagSet) as "flag".FlagSet value in struct literal
app\goldseg.go:151:9: too many errors

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.