Giter Club home page Giter Club logo

lt2opencorpora's Introduction

LT2OpenCorpora

Python script to convert ukrainian morphological dictionary from LanguageTool project to OpenCorpora format. Script runs well under PyPy and also collects some stats/insights/anomalies in the input dictionary. Use at your own risk.

It solves these tasks:

  • Parses LanguageTool raw dictionary format
  • Performs some basic sanity checks (and collects some stats about input dict)
  • Converts LanguageTool tags to OpenCorpora tags
  • Groups together wordforms and tries to determine a lemma for the group
  • Exports tagset, tagset restrictions and all the lemmas to OpenCorpora format.

It's all about grouping

Grouping wordforms under particular lemma is cumbersome for various reasons. Mostly because of homonymy and internal format of LanguageTool dict. In short:

  • Entry in LanugageTool dictionary looks like this wordform tag1:tag2:tag3 lemma, where lemma is just a string.
  • You cannot tell, to which lemma exactly this entry refers because of homonymy.
  • So you can only apply a bunch of heuristics: lemma should have the same POS as wordform also, lemma should have particular tags. For example, for nouns all lemmas should have :v_naz tag.
  • Another problem with heuristic is that a lot of verb lemmas looks the same for :perf and :imperf tags. But those are two different lemmas and they have their own wordforms!

Prerequisites

pip install -r requirements.txt

Batteries included

  • mapping.csv with general information about tagset used in ukrainian morphological dictionary. Exported from here.
  • Excerpt (first 1000 words) from ukrainian morphological dictionary.

Visualised mapping between tagsets in a great detail

Mapping

  • Cream nodes are for tags found only in OpenCorpora
  • Blue nodes are for tags from LanguageTool only
  • Green nodes are for tags that can be found in both
  • LT tag name is on top
  • OpenCorpora tag name is on bottom
  • Blue links are for OpenCorpora
  • Orange links are for LT

Running

python convert.py 1000.txt out.xml --debug

lt2opencorpora's People

Contributors

dchaplinsky avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.