Giter Club home page Giter Club logo

gsoc2019-sphinx's Introduction

🚀 Google Summer Of Code 2019 Project - Creation of an online Greek mail dictation system

Welcome to the home repository of "Creation of an online Greek mail dictation system using Sphinx and personalized acoustic/language model training".

This project is implemented as a Google Summer of Code 2019 Project, under the auspices of Open Technologies Aliance - GFOSS.

About the project

With over 2.6 billion active users and over 4.6 billion email accounts in operation, email is the most important and widely used communication medium on the Internet. In the last fifteen years, the huge rise of social networks and chat applications changed the role of emails, that nowadays are mainly used for business purposes rather than chatting. The email inbox is the first thing that anyone checks after entering their respective workplace. Seeing that this kind of communication has become an important part for businesses, convenience, speed and accuracy are necessary. All these benefits can be provided by enabling people to dictate their emails rather than writing them. More specifically, through email dictation users can move from one place to another while getting their work done faster than writing them and more accurately, since the computer is responsible for the correct spelling.

In the modern era of Big Data, many dictation systems have already been implemented reaching high accuracy in the proposed metrics. However, each system concerns a certain language because of the huge diversity of spoken languages. As a result, the implementation of a Greek mail dictation system should be done from scratch based on the Greek language and its unique characteristics. A basic problem is the fact that the training part requires a large set of human transcribed recordings, while very small Greek speech datasets are available. So, the project’s purpose is the implementation of a personalized Greek mail dictation system, that will be trained in the speech of each user (speaker dependent). By this way, we solve the above problem by asking the user for some dictations at the start and train the system using these recordings. Ιt is worth noting, that this restriction of the system doesn’t pose a problem, since each email address corresponds to a single user. In addition, the system’s performance will be enhanced by adapting the language model to the user's existing emails. Extra utilities, such as special dictation commands and email replay, will facilitate the user interaction and make the whole procedure faster and more practical.

Demo

The project is hosted at https://snf-870149.vm.okeanos.grnet.gr.

Note: Till now, we use self signed ssl certificates for both the webpage and the api. As a result, before using the webpage, the user should give permission in both of them by entering https://snf-870149.vm.okeanos.grnet.gr and https://snf-870149.vm.okeanos.grnet.gr:5000 and clicking Advanced and Proceed to url.

Timeline and Documentation

  • A detailed timeline can be found here, organized by GSoC timeline.
  • The whole progress of the project was tracked on a daily basis in Project.
  • More details can be found in Wiki and in the Final Report.

The whole model as a block diagram follows:

Overview

Technologies used

  • The project is written in Python 3.x, using all the python packages in the requirements file.
  • The speech recognition part is done using the pocketsphinx library from CMUSphinx.
  • All language models are created using SRILM.
  • All the required user data is stored in a MongoDB.
  • The UI is based on angular 8.

Project Deliverables

  1. Tool for extracting and cleaning sent emails of a Gmail user. Code Wiki
  2. Tool for creating adapted language models through email clustering. Code Wiki
  3. Tool for correcting ASR output. Code Wiki
  4. Various tools for preparing and evaluating a speech dataset. Code Wiki
  5. Simple tool for creating a speech dataset. Code
  6. API written in Flask. Code Wiki
  7. Online webpage using Angular 8. Code Wiki

People

  • Google Summer of Code 2019 Student: Panagiotis Antoniadis (PanosAntoniadis)
  • Mentor: Andreas Symeonidis (asymeon)
  • Mentor: Manos Tsardoulias (etsardou)

gsoc2019-sphinx's People

Contributors

dependabot[bot] avatar panosantoniadis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

stathdim

gsoc2019-sphinx's Issues

(2) Train the acoustic model in paramythi dataset [2]

  • Split data into train and test set (70% - 30%)
  • Remove punctuation and convert all words to lowercase.
  • Extend the dictionary using the training set and Phonetisaurus.
  • Train the acoustic model on the train set.
  • Evaluate it on the test set.

(1) Fix issues regarding email processing [1]

  • Encoding errors because some messages are in a different encoding.
  • Remove html without BeautifulSoup library because it changes spaces and newlines.
  • Keep only sent emails.
  • Keep only the original message and remove previous conversations.
  • Improve documentation.
  • Remove signatures.
  • Remove punctuation and convert to lowercase.

(2) Train the acoustic model in Journalism dataset [2]

  • Split data into train and test set (70% - 30%)
  • Remove punctuation and convert all words to lowercase.
  • Extend the dictionary using the training set and Phonetisaurus.
  • Train the acoustic model on the train set.
  • Evaluate it on the test set.

(2) Organize Journalism dataset in order to be in a standard form. [2]

  • Remove data with missing recordings or texts.
  • Organize data in pairs (recording - text).
  • Convert all recordings to wav format and 16 kHz sample rate.
  • Normalization, convert numbers to text, remove punctuation
  • Change abbreviations.
  • Convert text in a uniform format as Sphinx requires.
  • Upload it to Dropbox.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.