Giter Club home page Giter Club logo

twblg-translate's Introduction

👩🏻‍💻 Meet Ayaka: A Passionate Researcher & Open-Source Contributor

Hi there! I am Ayaka, a 24-year-old computer science, historical linguistics, and mathematics researcher.

I have made significant contributions to the open-source community. I have created numerous open-source projects on GitHub and have hosted several websites and web services at my own expense. My open-source contributions span various fields, including deep learning, natural language processing, language conservation, historical linguistics, and computational linguistics.

📚 Proficiency in Deep Learning

My expertise in deep learning is reflected in my familiarity with JAX and Google Cloud TPU. I actively submit bug reports, participate in feature discussions and answer questions in the JAX and Google Cloud TPU community. In addition, I created TPU Starter, a comprehensive guide that has helped many people to get started with JAX and Google Cloud TPU. The guide has been translated into Korean and Chinese. Moreover, to enhance the user experience of JAX, I developed jax-smi, a tool that enables the monitoring of real-time memory usage of JAX programs, providing a similar experience to that of nvidia-smi. My significant contributions led to the honour of receiving the 2023 Google Open Source Peer Bonus Award.

💬 Natural Language Processing Expertise

In natural language processing, I have contributed to the Hugging Face Transformers library and released several NLP models. I implemented bart-base-jax and llama-2-jax from scratch. These two projects provide high-quality open-source codebases to deep learning researchers and engineers and demonstrate how Transformer models can be implemented using JAX and trained on Google Cloud TPUs. Besides, I implemented the BERT model from scratch using NumPy, performed in-browser inference using Pyodide, and thereby created TrAVis, a BERT attention visualiser that runs entirely within a browser. The visualiser offers an intuitive visualisation of BERT's attention mechanism for researchers.

Moreover, I constantly keep up with the most advanced AI technologies. I am an early adopter of the most advanced large language model today—ChatGPT and have been studying it since its release. I am the co-author of the open-source Better ChatGPT website. Utilising the ChatGPT API, this website offers many advanced features and greatly enhances the ChatGPT user experience. It has garnered over 6,000 stars on GitHub and is being used by millions of users worldwide.

🌏 Language Conservation Efforts

My expertise in NLP also extends to language conservation. I trained the BART model for Cantonese, a low-resource language, and released it on the Hugging Face Hub. Building upon this, I proposed TransCan, an English-to-Cantonese machine translation model, greatly outperforming the state-of-the-art commercial machine translation system by 11.8 BLEU. The model has been released on GitHub, bringing benefits to both Cantonese and the wider low-resource NLP community.

In addition to language models, I have created several datasets. In the LIHKG Scraper project, I circumvented many layers of Cloudflare's restrictions to scrape LIHKG, one of the most popular Cantonese forums in Hong Kong, resulting in a corpus of 172,937,863 unique sentences. I have also created two English-Cantonese parallel corpora, Words.hk and ABC Cantonese.

Moreover, for the conservation of Hainanese and Hakka, I engineered web-scraping programs to regularly fetch the latest TV news of Wenchang and Xingning, which are broadcast in their local dialects.

🕰️ Pioneering Contributions in Historical Linguistics

I have also made considerable contributions to the field of historical linguistics. I founded the open-source organisation, nk2028, attracting a community of experts in historical linguistics. In nk2028, we have conducted pioneering research in the field of Middle Chinese phonology. We innovatively formalised the phonological positions of the Tshet-uinh phonological system as 6-tuples, which allowed us to accurately analyse the sound changes that have happened throughout the history of the Chinese language.

Moreover, in the process of putting this system into practice, we explored different methods of representing the laws of sound changes in computer programs. Initially, we designed a domain-specific language in PureScript and utilised SQLite as the database. In subsequent research, we simplified our approach by designing a novel JavaScript library, which greatly enhanced productivity.

Based on this, we released the Qieyun Autoderiver website, allowing community members to contribute laws of sound changes for various languages. This website has effectively invigorated the community and attracted many people to this field. To help beginners master the Tshet-uinh phonological system, we also published many tools, such as a tool to automate the process of puonq-tshet, a tool to generate Tshet-uinh Flashcards, and a tool to look up Tshet-uinh phonological positions.

💻 Innovations in Computational Linguistics

In nk2028, I have also made contributions to other aspects of linguistics. In the field of dialectology, we took over the discontinued MCPDict project and released the Chinese Dialect Pronunciation Atlas. Regarding classical Chinese, with the consent of the data provider, Sou-Yun website, we published ORCHESTRA, a comprehensive dataset of classical Chinese poetry. For phonetics, we created an IPA Online Practice System and a Putonghua IPA Converter.

Besides, I maintained the simplified-traditional Chinese conversion project OpenCC and its successor StarCC. These projects can accurately handle the problem of one-to-many mappings in simplified-traditional Chinese conversion. On top of this, leveraging my in-depth understanding of OpenType font features, I proposed a novel approach for simplified-to-traditional conversion fonts to handle the one-to-many mappings. Based on this approach, I produced two simplified-to-traditional conversion fonts, Fan Wun Ming and Fan Wun Hak. The approach I proposed has also been adopted by other font developers, enhancing the vibrancy of the typographic community.

For Cantonese, I published cantoseg, an effective Cantonese segmentation tool. I have also created two tools, namely ToJyutping and Inject Jyutping, which aid Cantonese learners in mastering the pronunciation of Chinese characters.

I am an active contributor to the rime input method community. As a member of the CanCLID organisation, I maintain rime-cantonese, a rime input schema for Cantonese. I've also released input schemata for TUPA, Loengfan, Mandarin, and Nüshu. Moreover, I have curated awesome-rime, a comprehensive list of rime schemata.

🎲 Miscellaneous Endeavours and Contributions

My open-source contributions extend to my other areas of interest as well. With a deep understanding of the x64 instruction set and the Windows PE file format, I crafted the smallest 64-Bit PE file on Windows 10 using the assembly language. The file is a Windows executable of merely 268 bytes that can run normally and pop up a message box. Moreover, I proposed the Nya Calendar, a lunisolar-mercurial calendar that considers the synodic period of the Earth and Mercury and encompasses several unique properties.

In addition, I have contributed to the Arch Linux community by maintaining several AUR packages. I host several open-source websites and web services at my own expense, including the Online Nushu Dictionary website, a Graphviz server, a Telegram translation bot, and an instance of the Shieldy bot.

If you want to know more about me and explore my other passions and interests, feel free to visit my personal website!

twblg-translate's People

Contributors

ayaka14732 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.