Giter Club home page Giter Club logo

anuvaad's Introduction

Anuvaad

State of the art translation models for Indic languages.

Installation

# CPU pytorch will be installed if torch is not installed
pip install --upgrade anuvaad

Usage

As a Python module

from anuvaad import Anuvaad
anu = Anuvaad("english-telugu")

# Single sentence translation
# beam_size is optional and defaults to 4
anu.anuvaad("YS Jagan is the chief minister of Andhra Pradesh.")
# "వైఎస్ జగన్ ఆంధ్రప్రదేశ్ ముఖ్యమంత్రి."

# Batch translation
anu.anuvaad(["YS Jagan is the chief minister of Andhra Pradesh.",
            "Nara Lokesh suffered a humiliating defeat in Mangalagiri."])
# ['వైఎస్ జగన్ ఆంధ్రప్రదేశ్ ముఖ్యమంత్రి.', 'మంగళగిరిలో నారా లోకేష్కు అవమానకరమైన ఓటమి ఎదురైంది.']

As a service

# Starting the api service
docker run -it -e BATCH_SIZE=1 -p 8080:8080 notaitech/anuvaad:english-telugu

# Running a prediction
curl -d '{"data": ["YS Jagan is the chief minister of Andhra Pradesh."]}' -H "Content-Type: application/json" -X POST http://localhost:8080/sync
Available Models
english-telugu
english-tamil
english-malayalam
english-kannada
english-marathi
english-hindi

My thoughts on the evaluation/accuracy of the model(s):

  1. Unlike classification/ sequence labelling tasks, for open-domain translation or summarization systems it is very hard to quantify the accuracy through numbers.
  2. This is because, most accuracy measurements actually measure the overlap of character/word n-grams between the expected output and predicted output.
  3. These scores definitely help when evaluating/comparing multiple models on a particular dataset, but the number don't translate well for open-domain models.
  4. For example, Anuvaad translates the sentence An advance is placed with the Medical Superintendents of such hospitals who then provide assistance on a case to case basis. (taken from http://data.statmt.org/pmindia/v1/parallel corpus) to ऐसे अस्पतालों के चिकित्सा अधीक्षकों के साथ एडवांस रखा जाता है, जिसके बाद मामले के आधार पर सहायता प्रदान की जाती है। where as the expected translation of the sentence from the dataset is अग्रिम धन राशि इन अस्पतालों को चिकित्सा निरीक्षकों को दी जाएगी, जो हर मामले को देखते हुए सहायता प्रदान करेंगे।.
  5. In the above example, Although Anuvaad's translation is correct (in the sense that translation conveys the same thing as the original sentence), the BLEU score with n=3 will be 0.
  6. Similarly, a model trained on the pmindia dataset will have bad score on a different dataset which uses a different style of writing, even if the translation is semantically correct.
  7. Our aim in building Anuvaad is to build a general purpose, open-domain translation module that can flexibly translate text from various domains.
  8. https://docs.google.com/spreadsheets/d/1_TTtBEvVgemQfGbRBSZYkECMMt5r7L9-dt0FGVUbmOY/edit?usp=sharing is a sheet comparing translations from Anuvaad, ilmulti (https://github.com/jerinphilip/ilmulti) and Google Translate (=GOOGLETRANSLATE(text, "en", "language") function on google sheets) on 100 randomly selected English sentences from Tatoeba.

anuvaad's People

Contributors

bedapudi6788 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.