Giter Club home page Giter Club logo

vietnamese-language-graph's Introduction

Vietnamese Language Graph 馃嚮馃嚦

Xin ch脿o c谩c b岷!

I'm in the process of learning Vietnamese, and something I want to focus on is expanding my vocabulary! Vietnamese has quite a few differences from English, but an important one is the low morpheme to word ratio. Vietnamese creates more complicated words (or words in different tenses) through creating compound words. When you learn a single morpheme word in Vietnamese, you can usually build on that word to create more complex words around the same idea. This project utilizes Python and Obsidian to visualize these compound words to improve vocabulary while learning the language.

Video

To see how this project was created, and get a glimpse of my problem-solving method, check out the video here:

My video summary of the project

Project Summary

This Python project takes a Vietnamese to English dictionary in the form of an XML file. It manipulates the XML data into a Pandas DataFrame, and then passes it to a search function. There are two search functions in the project:

reverse_search_method Reverse search method receives a DataFrame of words, then takes each word and splits it into morphemes and reverses the list. It iterates the morphemes, creates a word with a morpheme length equal to the iterated index, then reverses the word to put it back in proper order. This sub-word is then added to the connection list for the main word. This method is deprecated, in favor of dict_match_search_method, as it creates false positives in some words, linking to words that are not in the dictionary (i.e. morphemes used to add meaning to a compound word, but which are not words in their own right) and misses some connections to morpheme combinations created by using the center morphemes of a four (or higher) morpheme word.

dict_match_search_method Dictionary match search method receives a DataFrame of words and iterates it. For each word, it creates an array of all the possible sub-words that could exist in it. For instance, the word sinh h贸a h峄峜 (biochemistry) consists of five possible sub-words:

  • One Morpheme
    • sinh
    • h贸a
    • h峄峜
  • Two Morphemes
    • sinh h贸a
    • h贸a h峄峜

These words are compared against a filtered dictionary DataFrame containing only words with the same morpheme length as the sub-word. If a word exists in the dictionary a connection is made between the root word and the sub-word. This method is slightly slower than reverse_search_method, but ensures 100% connection with existing words from the dictionary, and removes all dead links.

After running the script, 23000 markdown files will be outputted, which can then be imported into an Obsidian Vault. This will result in a graph view like the following:

A screenshot of a large Obsidian graph view

How to use

You can use this project to visualize Vietnamese as well! Required Software:

  • Obsidian
  • Python
  • source.txt

Instructions:

  1. Clone this repository to your computer
  2. run python main.py
  3. Open Obsidian
  4. Click Open another vault (bottom of left navigation bar)
  5. Click Open folder as Vault
  6. Select the folder ObsidianVault in this repository

Obsidian will take some time to index and import this vault. That time is greatly decreased by not opening the Graph View until the indexing is complete.

Sources

Quang Hi峄僴's Vietnamese Dictionary

Like my work?

Buy Me a Coffee at ko-fi.com

vietnamese-language-graph's People

Contributors

davidasix avatar

Stargazers

Ilya Shigabeev avatar Oleg avatar Natan Duarte avatar Mauro Chiozzi avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    馃枛 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 馃搳馃搱馃帀

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google 鉂わ笍 Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.