Giter Club home page Giter Club logo

scalable-ml_lab2's Introduction

Cantonese (zh-HK) Text Transcription using Transformers

Lab assignment 2 of ID2223 Scalable Machine Learning and Deep Learning course at KTH Royal Institute of Technology.

Fine-Tune a pre-trained transformer model and build a serverless UI for using that model

We used Whisper from OpenAI as the pre-trained transformer model, a Hongkong Cantonese dataset in common_voice was used to train a Cantonese Whisper. Training parameters are set to be the same as in this notebook and the training was performed on Google Colab.

A basic Gradio UI on huggingface space with an inference pipeline to this model can be found here

This Word Error Rate (WER) after 4000 steps of training was 61.22.

Use data-centric approach to fine-tune the Cantonese Whisper model.

New data sources comes from Magic Data, which is segmented and uploaded to huggingface dataset under CC BY-NC-ND 4.0 license, preprocessed by a refactorized feature pipeline Feature Pipeline Guangzhou.py on Modal (Because running feature extraction with CPU on Modal is super fast and cheap ;-), barely used our $30 credit) and stored again as huggingface dataset Cantonese_processed_guangzhou.

We encountered some problems searching for datasets for Cantonese Chinese, the second most used dialect of Chinese. [The first being Mandarin Chinese, often being referred to as, Chinese] It turned out that there is only a limited number of data sources. The writing system of Cantonese Chinese is rather fragmented, with HK & Macau using traditional characters and wording formalities and China Mainland & Malaysia using simplified characters and their own wording formalities, resulting in an inconsistency in the transcripts. In order to be consistent with the dataset used in the first task, the problem is solved by brutally forcing all characters to be preprocessed as traditional characters in the feature pipeline. Also, there is virtually no other Cantonese dataset on Hugging face, and the dataset from magic data would be the only comprehensive and formal enough Cantonese voice transcripted dataset capable of better training our model.

Another problem we faced is data storage. We wanted to store the extracted features on Hopsworks in the beginning, yet we were unable to maintain the datatype consistency after extraction and Hopsworks do not accept inconsistent datatype as stored features. Therefore, we used huggingface dataset to store our features.

For the second training, the parameters used can be found in Training Pipeline Guangzhou, essentially unchanged apart from a lower learning rate. The training is performed on Google Colab (We did not use Modal here because Modal GPU time is expensive :<). Tracking of the training metrics is done on Tensorboard and can be found on the page of the fine tuned model on Huggingface.

This data-centric approach reduced the Word Error Rate (WER) to 60.46 using 1600 additional steps. (The first run failed to upload its metrics to tensorboard for tracking, due to connection error. The second run reduced WER to 60.96.)

Refactor the program into a feature engineering pipeline, training pipeline, and an inference program (Hugging Face Space), to improve efficiency and scalabiliy.

The refactored scripts can be found below:

The model trained can be found below:

Here is the UI, which you can generate zh-HK subtitle from audio file, video file, Youtube URL, and your microphone.

Authors

Acknowledgements

scalable-ml_lab2's People

Contributors

chenzhou98 avatar li-ziyou avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.