create-speech-to-text-pipeline / pipeline Goto Github PK

A tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model

License: MIT License

Jupyter Notebook 99.61% HTML 0.04% CSS 0.01% JavaScript 0.12% Python 0.21% Shell 0.01% Dockerfile 0.01%

apache-airflow apache-kafka apache-spark kafka-js kafka-python pyspark reactjs amazon-msk amazon-s3-storage

pipeline's People

Contributors

Watchers

Forkers

akrobi prubayita kaydeejr yonamg mohammedesamaldin nahomhmichael haylemicheal create-speech-to-text-pipeline

pipeline's Issues

Code Review

Frontend

using react

README and github actions

Airflow

Testing

Create a javascript tag

The tag shall be used in front-end applications to communicate with your Kafka cluster - present a sentence to be read by a user and send back audio and other necessary metadata to your Kafka cluster.
You should look at the following to understand how an app or a browser captures and sends audio and text events to your kafka cluster
Using the MediaStream Recording API - Web APIs | MDN (mozilla.org)
Handling Large Messages with Apache Kafka (CSV, XML, Image, Video, Audio, Files) - Kai Waehner (kai-waehner.de)

Create a Kafka cluster

Based on Installing a Kafka Cluster and Creating a Topic - Hands-on Labs | A Cloud Guru, set up a cluster in your assigned AWS machine.
Your cluster will be responsible for creating a Delta Lake - a bucket in S3 where Spark transformed streaming data from users reading the texts you showed them are stored. (hint You will write a code that can generate an ID for a randomly selected text and its audio equivalent, receives an ID from an API, sends back as json the ID + audio to Kafka like URL

Planning and design

Build or simulate a Kafka event source for the text corpus - you should read Breaking News: Everything Is An Event! (Streams, Kafka And You) (florimond.dev)
Develop an overview of your approach and document it. Explain why this approach and why these tools. Explain how this approach will provide a good data source for the clients’ speech-to-text ML engine. Explain the purpose of each of these tools - should defend it if one asks them why, not simple python code.

EDA

Jupyter notebook that illustrate your data exploration with professional plots, readable axes labels, title, and legend; good choice of color

Backend

prepare API endpoints for kafka - using flask

Logging

linked to #11

Use Spark to transform and load from your Kafka cluster

Using PySpark, write code that will transform and load the data from the data lake
By using Kafka as an input source for Spark Structured Streaming and Delta Lake as a storage layer, build a complete streaming data pipeline to consolidate our data - you should read From Kafka to Delta Lake using Apache Spark Structured Streaming (michelin.io)

create-speech-to-text-pipeline / pipeline Goto Github PK

pipeline's People

Contributors

Watchers

Forkers

pipeline's Issues

Code Review

Frontend

README and github actions

Airflow

Testing

Create a javascript tag

Create a Kafka cluster

Planning and design

EDA

Backend

Logging

Use Spark to transform and load from your Kafka cluster

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent