Giter Club home page Giter Club logo

sanskrit_parser's Introduction

sanskrit_parser

Parsers for Sanskrit / संस्कृतम्

CI Build Status

NOTE: This project is still under development. Both over-generation (invalid forms/splits) and under-generation (missing valid forms/splits) are quite likely. Please see the Sanskrit Parser Stack section below for detailed status. Report any issues here.

Please feel free to ping us if you would like to collaborate on this project.

Try it out!

A web interface is available here - https://kmadathil.github.io/sanskrit_parser/ui/index.html

Installation

This project has been tested and developed using Python 2.7. A port to Python 3 has been completed, and everything should now work in both versions of Python.

pip install sanskrit_parser

Usage

Deploying REST API server

Run:

sudo mkdir /var/www/.sanskrit_parser
sudo chmod a+rwx /var/www/.sanskrit_parser

Contribution

  • Generate docs: cd docs; make html

Sanskrit Parser Stack

Stack of parsing tools

Level 0

Sandhi splitting subroutine Input: Phoneme sequence and Phoneme number to split at Action: Perform a sandhi split at given input phoneme number Ouptut: left and right sequences (multiple options will be output). No semantic validation will be performed (up to higher levels)

Current Status

Module that performs sandhi split/join and convenient rule definition is at lexical_analyzer/sandhi.py.

Rule definitions (human readable!) are at lexical_analyzer/sandhi_rules/*.txt

Level 1

  • From dhatu + lakAra + puruSha + vachana to pada and vice versa
  • From prAtipadika + vibhakti + vachana to pada and vice versa
  • Upasarga + dhAtu forms - forward and backwards
  • nAmadhAtu forms
  • Krt forms - forwards and backwards
  • Taddhita forms - forwards and backwards

Current Status

To be done.

However, we have a usable solution with inriaxmlwrapper + Prof. Gerard Huet's forms database to act as queriable form database. That gives us the bare minimum we need from Level 1, so Level 2 can work.

Level 2

Input

Sanskrit Sentence

Action

  • Traverse the sentence, splitting it (or not) at each location to determine all possible valid splits

  • Traverse from left to right

  • Using dynamic programming, assemble the results of all choices

    To split or not to split at each phoneme

    If split, all possible left/right combination of phonemes that can result

    Once split, check if the left section is a valid pada (use level 1 tools to pick pada type and tag morphologically)

    If left section is valid, proceed to split the right section

  • At the end of this step, we will have all possible syntactically valid splits with morphological tags

Output

All semantically valid sandhi split sequences

Current Status

Module that performs sentence split is at lexical_analyzer/SanksritLexicalAnalyzer.py

Level 3

Input

Semantically valid sequence of tagged padas (output of Level 1)

Action:

  • Assemble graphs of morphological constraints

    viseShaNa - viseShya

    karaka/vibhakti

    vachana/puruSha constraints on tiGantas and subantas

  • Check validity of graphs

Output

  1. Is the input sequence a morphologically valid sentence?
  2. Enhanced sequence of tagged padas, with karakas tagged, and a dependency graph associated

Current Status

Early experimental version (simple sentences only) is at morphological_analyzer/SanskritMorphologicalAnalyzer.py

Seq2Seq based Sanskrit Parser

See: Grammar as a Foreign Language : Vinyals & Kaiser et. al. Google http://arxiv.org/abs/1412.7449

  • Method: Seq2Seq Neural Network (n? layers)
  • Input Embedding with word2vec (optional)

Input

Sanskrit sentence

Output

Sentence split into padas with tags

Train/Test data

DCS corpus, converted by Vishvas Vasuki

Current Status

Not begun

sanskrit_parser's People

Contributors

kmadathil avatar avinashvarna avatar codito avatar vvasuki avatar alvarna avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.