Giter Club home page Giter Club logo

ilrdc-downloader's Introduction

ILRDC Downloader

Downloading the data from the website Indigenous Languages Research and Development Center (ILRDC).

Documentation

1. Import the package.

from ilrdc import ILRDC

If you don't know which dialect or which part of data you want to download, you can import two additional classes: ILRDCDialect and ILRDCPart

from ilrdc import ILRDC, ILRDCDialect, ILRDCPart

Use the method .get_list_info() on ILRDCDialect and ILRDCPart to get all the information:

  • To get information of dialects:

    ILRDCDialect.get_list_info()

    This prints:

    ['泰雅語', 
     '邵語', 
     '賽德克語', 
     '布農語', 
     '魯凱語', 
     '噶瑪蘭語', 
     '卑南語', 
     '雅美語', 
     '撒奇萊雅語', 
     '卡那卡那富語']
  • To get information of the parts:

    ILRDCPart.get_list_info()

    This prints:

    ['詞彙與構詞',
     '基本句型及詞序',
     '格謂標記與代名詞系統',
     '焦點與時貌語氣系統',
     '存在句所有句方位句結構',
     '祈使句結構',
     '使動結構',
     '否定句結構',
     '疑問句結構',
     '連動結構',
     '補語結構',
     '修飾結構',
     '並列結構',
     '其他結構',
     '標點符號',
     '基本詞彙',
     '長篇語料']

    Regarding this list, the first fifteen values refer to the Grammar part; the second to last, to the Vocabulary part; the last, to the Story part.

2. Fill in and instantiate ILRDC class:

Parameters:

  • dialect_ch: the chinese name of the dialect
  • part_type: the part type that you want to download (i.e. grammar, vocabulary and story)
  • part: the part you want to specify (optional)

Examples:

  • Select Grammar Part:

    Pass the string 'grammar' to the parameter part_type:

    ILRDC('泰雅語', part_type='grammar')

    Without specifing the parameter part, the class will include all the grammar parts (i.e. from 詞彙與構詞 to 標點符號). On the other hand, if you want to select a particular grammar part, pass one of the grammar part's name as a string to the parameter part:

    ILRDC('泰雅語', part_type='grammar', part='否定句結構')
  • Select Vocabulary Part:

    Pass the string 'vocab' to the parameter part_type:

    ILRDC('泰雅語', part_type='vocab')

    Since there is only one part for Vocabulary, you don't need to pass the string '基本詞彙' to the parameter part. If you insist, you can still specify:

    ILRDC('泰雅語', part_type='vocab', part='基本詞彙')
  • Select Story Part:

    Pass the string 'story' to the parameter part_type:

    ILRDC('泰雅語', part_type='story')

    Since there is only one part for Story, you don't need to pass the string '長篇語料' to the parameter part. If you insist, you can still specify:

    ILRDC('泰雅語', part_type='story', part='長篇語料')

3. Print out the data:

After filling in and instantiating the ILRDC class, you can use .download_data(). For example:

ILRDC('泰雅語', part_type='grammar', part='基本句型及詞序').download_data()

This prints:

{
    '基本句型及詞序': [
        {
            'ID': '(4-1)a.',
            'dialect': 'maniq ngahi’ i Silan.',
            'chinese_translation': 'Silan 吃地瓜。',
            'sound_url': 'https://ilrdc.tw/grammar/sound/2/4-1-1.mp3'},
        }
        ...
    ]
}

4. Write object to a JSON file:

After filling in and instantiating the ILRDC class, you can use the method .to_json() convert all the data to a JSON file.

ILRDC('泰雅語', part_type='vocab').to_json()

5. Write object to a CSV file:

After filling in and instantiating the ILRDC class, you can use the method .to_csv() convert all the data to a comma-separated values (CSV) file.

ILRDC('泰雅語', part_type='vocab').to_csv()

Tidbit: Downloading grammar, vocabulary, and story of all the languages at the same time

1. in .py file:

import asyncio 
from ilrdc import ILRDC, ILRDCDialect


async def download(langauge, part_type):
    return ILRDC(langauge, part_type=part_type).to_json()


async def main():
    languages = ILRDCDialect.get_list_info()
    task_grammar = [download(language, part_type='grammar') for language in languages]
    task_vocab = [download(language, part_type='vocab') for language in languages]
    task_story = [download(language, part_type='story') for language in languages]
    return await asyncio.gather(*task_grammar, *task_vocab, *task_story)

asyncio.run(main())

2. in .ipynb file:

import asyncio
import nest_asyncio
from ilrdc import ILRDC, ILRDCDialect


nest_asyncio.apply()


async def download(langauge, part_type):
    return ILRDC(langauge, part_type=part_type).to_json()


async def main():
    languages = ILRDCDialect.get_list_info()
    task_grammar = [download(language, part_type='grammar') for language in languages]
    task_vocab = [download(language, part_type='vocab') for language in languages]
    task_story = [download(language, part_type='story') for language in languages]
    return await asyncio.gather(*task_grammar, *task_vocab, *task_story)

asyncio.run(main())

Contact Me

If you have any suggestion or question, please do not hesitate to email me at [email protected]

ilrdc-downloader's People

Contributors

retr0327 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.