Giter Club home page Giter Club logo

peteprattis / automatic-speech-recognision-system-asr Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 16 KB

A python script that implements an automatic speech recognision system.

License: MIT License

Python 100.00%
python asr automatic-speech-recognition signal signal-processing fir-filter nyquist short-time-signal-analysis short-time-fourier-transform librosa mel-frequency-cepstral-coefficients mfcc dynamic-time-warping dtw student computer-science program

automatic-speech-recognision-system-asr's Introduction

A Python Program / Project

This is a Python project from my early days as a Computer Science student

This program were created for the eighth semester class Voice and Audio Signal Processing and is final project for the class

Description of project

A python script that implements an automatic speech recognision system.

Implementation of project

  1. Sampling of the input signal from an initial sampling rate of 22050 to 8000.
  2. Apply FIR filter to eliminate DC component and 60 Hz buzz. Also, the bandwidth allowed by the passage was limited to the maximum frequency Fn which was defined as half the sampling rate Fs which is in agreement with Nyquist's theorem Fs> = 2 * Fn.
  3. Definition of 2 important variables L, R, which aim at careful observation of the signal for subsequent extraction of key characteristics of the signal. This is the most basic part that introduces us to Short Time Signal Analysis. In more detail, the variable L is the length of a frame frame in the signal, in order to process it, as previously defined. Within this context, we also define the variable R which is the sliding step within that context. These numbers are not random as, according to the literature, an "effective" window length L is set at 10-40 ms with a slip step R at 10 ms.
  4. The spectrogram of the signal which we found as follows was useful:
  • Calculate stft (Short Time Fourier Transform) for each frame window
  • Converting coefficients to db (decibels) ie stft (db log10Y - amplitude_to_db command)
  1. Calculate the passage rate from 0 for the signal, taking into account the frame length L with an R slide and then multiply it by 100 to display a percentage. More analytically for the rate of passage of 0, it reflects the change of algebraic sign in successive samples. The rate at which this change occurs is a mere measurement of the frequency content of a signal.
  2. Through observations it has become known that in the non-compliant parts of the signal the rate of passage from 0 is quite high compared to the compliant parts.
  • However, the above measurement is not sufficient to properly distinguish between the singular and non-partial signal, so the short energy time of the signal was calculated and calculated. More specifically, it reflects the variation in width. In a typical speech signal, we can detect significant changes over time. The short-time energy, defined as the sum of the squares of the signals, is calculated, as in the passage rate of 0, taking into account the given frame length L and the slip step R. Indeed, the short-time energy shows high values for consonant parts of the signal, compared to non-consonants that show significantly lower.
  1. The way we solved the problem of discriminating between vowel and vowel was the onset_detect function of the librosa library. More details of the method we implemented in conjunction with the above function are:
  • Creating an inverted signal
  • Using the onset_detect function to specify the frequency, frame length, slip step, and activation of the backtracking function for best results, in the original signal and in the inverted
  • Convert the result from frames to times and samples
  • In the case of the original signal we find the onset points where the sound begins. But in the case of the inverted file these points are essentially the end of each word. These points, in order to be able to use them to make a definitive assessment of the distinction between the singular and non-symmetric parts of the signal, is sufficient for each of the signals to subtract the onset points, which are in sec form, from the total duration of the signal
  • Finally, converting the results into samples to create consistent parts of the signal (digits)
  • View spectroscopy with the resulting Onset points and distinguish between the singular and non-partial signal
  1. After the input signal has been processed, the voice signal recognition process begins. Since our system wants to identify the numbers from 0 to 9, we have secured a database containing 3 speakers who speak from 0-9. This makes our system not so reliable because, according to the literature, if we could have 100 speakers who would record 10 times each digit, then we would have a powerful set of training that provides accurate and reliable word recognition models- digits. At this point it should be mentioned that the whole process was also subject to training
  2. Then the steps were the following for each digit found:
  • For subsequent recognition of the speech signal in text, the labels table containing the name (digit) of each speech signal from the training set was created
  • Finding 13th MFCC's (Mel-Frequency Cepstral Coefficients) for each separate input signal (digit)
  • Logging of MFCC coefficients (amplitude_to_db - command) for the input signal
  1. Within a repetitive loop that ends after all the speech signals from the training set have been examined:
  • Find 13 MFCC's for each signal (processed) from the training set
  • Logging of MFCC coefficients (amplitude_to_db - command)
  • Implementation of DTW (Dynamic Time Wrapping) algorithm giving MFCC coefficients for the input signal pair and signal from the training set as signal characteristics. We also used Euclidean distance to find the distance between pairs, as well as backtracking.
  • Finally, in a Dnew table we add the minimum cost we found for each signal pair (input and training), as in the mfccs table the MFCC coefficients set for each signal from the training set. The purpose of the above two tables is to highlight at the end the digit recognized (via Dnew) and the graph showing the similarity line - optimal path (with mfccs contribution)
  1. Then, it is necessary to evaluate the results of the DTW algorithm and extract a recognition percentage. After we output this result, we then display the final recognition of the text input signal.

About this project

  • The comments to make the code understandable, are within the .py archive
  • This project was written in Anaconda's Spyder IDE.
  • This program runs for Python version 3.7+
  • This repository was created to show the variety of the work I did and experience I gained as a student

automatic-speech-recognision-system-asr's People

Contributors

peteprattis avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.