A tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model
The tag shall be used in front-end applications to communicate with your Kafka cluster - present a sentence to be read by a user and send back audio and other necessary metadata to your Kafka cluster.
Your cluster will be responsible for creating a Delta Lake - a bucket in S3 where Spark transformed streaming data from users reading the texts you showed them are stored. (hint You will write a code that can generate an ID for a randomly selected text and its audio equivalent, receives an ID from an API, sends back as json the ID + audio to Kafka like URL
Develop an overview of your approach and document it. Explain why this approach and why these tools. Explain how this approach will provide a good data source for the clients’ speech-to-text ML engine. Explain the purpose of each of these tools - should defend it if one asks them why, not simple python code.