Cantonese (zh-HK) Text Transcription using Transformers

Lab assignment 2 of ID2223 Scalable Machine Learning and Deep Learning course at KTH Royal Institute of Technology.

Fine-Tune a pre-trained transformer model and build a serverless UI for using that model

We used Whisper from OpenAI as the pre-trained transformer model, a Hongkong Cantonese dataset in common_voice was used to train a Cantonese Whisper. Training parameters are set to be the same as in this notebook and the training was performed on Google Colab.

A basic Gradio UI on huggingface space with an inference pipeline to this model can be found here

This Word Error Rate (WER) after 4000 steps of training was 61.22.

Use data-centric approach to fine-tune the Cantonese Whisper model.

New data sources comes from Magic Data, which is segmented and uploaded to huggingface dataset under CC BY-NC-ND 4.0 license, preprocessed by a refactorized feature pipeline Feature Pipeline Guangzhou.py on Modal (Because running feature extraction with CPU on Modal is super fast and cheap ;-), barely used our $30 credit) and stored again as huggingface dataset Cantonese_processed_guangzhou.

We encountered some problems searching for datasets for Cantonese Chinese, the second most used dialect of Chinese. [The first being Mandarin Chinese, often being referred to as, Chinese] It turned out that there is only a limited number of data sources. The writing system of Cantonese Chinese is rather fragmented, with HK & Macau using traditional characters and wording formalities and China Mainland & Malaysia using simplified characters and their own wording formalities, resulting in an inconsistency in the transcripts. In order to be consistent with the dataset used in the first task, the problem is solved by brutally forcing all characters to be preprocessed as traditional characters in the feature pipeline. Also, there is virtually no other Cantonese dataset on Hugging face, and the dataset from magic data would be the only comprehensive and formal enough Cantonese voice transcripted dataset capable of better training our model.

Another problem we faced is data storage. We wanted to store the extracted features on Hopsworks in the beginning, yet we were unable to maintain the datatype consistency after extraction and Hopsworks do not accept inconsistent datatype as stored features. Therefore, we used huggingface dataset to store our features.

For the second training, the parameters used can be found in Training Pipeline Guangzhou, essentially unchanged apart from a lower learning rate. The training is performed on Google Colab (We did not use Modal here because Modal GPU time is expensive :<). Tracking of the training metrics is done on Tensorboard and can be found on the page of the fine tuned model on Huggingface.

This data-centric approach reduced the Word Error Rate (WER) to 60.46 using 1600 additional steps. (The first run failed to upload its metrics to tensorboard for tracking, due to connection error. The second run reduced WER to 60.96.)

Refactor the program into a feature engineering pipeline, training pipeline, and an inference program (Hugging Face Space), to improve efficiency and scalabiliy.

The refactored scripts can be found below:

The model trained can be found below:

Here is the UI, which you can generate zh-HK subtitle from audio file, video file, Youtube URL, and your microphone.

Whisper app on huggingface

chenzhou98 / scalable-ml_lab2 Goto Github PK

scalable-ml_lab2's Introduction

Cantonese (zh-HK) Text Transcription using Transformers

Fine-Tune a pre-trained transformer model and build a serverless UI for using that model

Use data-centric approach to fine-tune the Cantonese Whisper model.

Refactor the program into a feature engineering pipeline, training pipeline, and an inference program (Hugging Face Space), to improve efficiency and scalabiliy.

Authors

Acknowledgements

scalable-ml_lab2's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent