Giter Club home page Giter Club logo

doodlebears / split-lang Goto Github PK

View Code? Open in Web Editor NEW
24.0 2.0 2.0 233 KB

✨ Split text by languages (e.g. 你喜欢看アニメ吗 -> 你喜欢看 | アニメ | 吗) for NLP tasks (e.g. parse, TTS). Powered by fasttext and langua

Home Page: https://pypi.org/project/split-lang/

License: MIT License

Python 54.24% Jupyter Notebook 45.76%
nlp python sentence-segmenter split detectlang i18n language-identification languagedetector tts langdetect pip pythonpackage fasttext langua

split-lang's Introduction

VisActor Logo VisActor Logo

split-lang

English | 中文简体 | 日本語

Split text by languages through concatenating over split substrings based on their language, powered by

splitting: budoux and rule-base splitting

language detection: fast-langdetect and wordfreq


PyPI version Downloads Downloads

Open In Colab

License GitHub Repo stars wakatime

1. 💡How it works

Stage 1: rule-based split (separate character, punctuation and digit)

  • hello, how are you -> hello | , | how are you

Stage 2: over-split text to substrings by budoux for Chinese mix with Japanese, (space) for not scripta continua

  • 你喜欢看アニメ吗 -> | 喜欢 | | アニメ |
  • 昨天見た映画はとても感動的でした -> 昨天 | 見た | 映画 | | とても | 感動 | | | した
  • how are you -> how | are | you

Stage 3: concatenate substrings based on their languages using fast-langdetect, wordfreq and regex (rule-based)

  • | 喜欢 | | アニメ | -> 你喜欢看 | アニメ |
  • 昨天 | 見た | 映画 | | とても | 感動 | | | した -> 昨天 | 見た映画はとても感動的でした
  • how | are | you -> how are you

2. 🪨Motivation

  • TTS (Text-To-Speech) model often fails on multi-language speech generation, there are two ways to do:
    • Train a model can pronounce multiple languages
    • (This Package) Separate sentence based on language first, then use different language models
  • Existed models in NLP toolkit (e.g. SpaCy, jieba) is usually helpful for dealing with text in ONE language for each model. Which means multi-language texts need pre-process, like texts below:
你喜欢看アニメ吗?
Vielen Dank merci beaucoup for your help.
你最近好吗、最近どうですか?요즘 어떻게 지내요?sky is clear and sunny。

3. 📕Usage

3.1. 🚀Installation

You can install the package using pip:

pip install split-lang

3.2. Basic

3.2.1. split_by_lang

Open In Colab

from split_lang import LangSplitter
lang_splitter = LangSplitter()
text = "你喜欢看アニメ吗"

substr = lang_splitter.split_by_lang(
    text=text,
)
for index, item in enumerate(substr):
    print(f"{index}|{item.lang}:{item.text}")
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗
from split_lang import LangSplitter
lang_splitter = LangSplitter(merge_across_punctuation=True)
import time
texts = [
    "你喜欢看アニメ吗?我也喜欢看",
    "Please star this project on GitHub, Thanks you. I love you请加星这个项目,谢谢你。我爱你この項目をスターしてください、ありがとうございます!愛してる",
]
time1 = time.time()
for text in texts:
    substr = lang_splitter.split_by_lang(
        text=text,
    )
    for index, item in enumerate(substr):
        print(f"{index}|{item.lang}:{item.text}")
    print("----------------------")
time2 = time.time()
print(time2 - time1)
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗?我也喜欢看
----------------------
0|en:Please star this project on GitHub, Thanks you. I love you
1|zh:请加星这个项目,谢谢你。我爱你
2|ja:この項目をスターしてください、ありがとうございます!愛してる
----------------------
0.007998466491699219

3.2.2. merge_across_digit

lang_splitter.merge_across_digit = False
texts = [
    "衬衫的价格是9.15便士",
]
for text in texts:
    substr = lang_splitter.split_by_lang(
        text=text,
    )
    for index, item in enumerate(substr):
        print(f"{index}|{item.lang}:{item.text}")
0|zh:衬衫的价格是
1|digit:9.15
2|zh:便士

3.3. Advanced

3.3.1. usage of lang_map and default_lang (for your languages)

Important

Add lang code for your usecase if other languages are needed. See Support Language

  • default lang_map looks like below
    • if langua-py or fasttext or any other language detector detect the language that is NOT included in lang_map will be set to default_lang
    • if you set default_lang or value of key:value in lang_map to x, this substring will be merged to the near substring
      • zh | x | jp -> zh | jp (x been merged to one side)
      • In example below, zh-tw is set to x because character in zh and jp sometimes been detected as Traditional Chinese
  • default default_lang is x
DEFAULT_LANG_MAP = {
    "zh": "zh",
    "yue": "zh",  # 粤语
    "wuu": "zh",  # 吴语
    "zh-cn": "zh",
    "zh-tw": "x",
    "ko": "ko",
    "ja": "ja",
    "de": "de",
    "fr": "fr",
    "en": "en",
    "hr": "en",
}
DEFAULT_LANG = "x"

4. Acknowledgement

5. ✨Star History

Star History Chart

split-lang's People

Contributors

doodlebears avatar

Stargazers

Riwi How avatar 盐粒 Yanli avatar 微笑の辛翼 avatar welann avatar Astro avatar Explorare avatar  avatar Catstyle_Lee avatar vamdt avatar tsingchao avatar Tianyi Jing avatar A. Zhang avatar Yukai Huang avatar MinWoo(Daniel) Park avatar Lesion avatar Deng Xudong avatar Alan avatar  avatar yihong avatar Jasmine avatar HAESUNG JEON avatar  avatar  avatar  avatar

Watchers

Kostas Georgiou avatar  avatar

Forkers

ishine splinter21

split-lang's Issues

Use `wordfreq` to detect Chinese and Japanese better

rspeer/wordfreq

  • 日本人, 今晚, 国外移民 used in both Chinese and Japanese

Example Sentence (from wiki)

日本語(にほんご、にっぽんご)は、日本国内や、かつての日本領だった国、そして国外移民や移住者を含む日本人同士の間で使用されている言語。日本は法令によって公用語を規定していないが、法令その他の公用文は全て日本語で記述され、各種法令において日本語を用いることが規定され、学校教育においては「国語」の教科として学習を行うなど、事実上日本国内において唯一の公用語となっている。

will be split into

0|ja:日本語(にほんご、にっぽんご)は、日本国内や、かつての日本領だった国、そして
1|zh:国外移民
2|ja:や移住者を含む
3|zh:日本人
4|ja:同士の間で使用されている言語。日本は法令によって公用語を規定していないが、法令その他の公用文は全て日本語で記述され、各種法令において日本語を用いることが規定され、学校教育においては「国語」の教科として学習を行うなど、事実上日本国内において唯一の公用語となっている。

Add rule based merge

  • if ja zh ja and the wordfreq of zh in Japanese is greater than Chinese, detect it as Japanese

Revise possible_detection_list using `fast_langdetect.detect_multilingual`, improve accuracy in Chinese and Japanese

Summary

  • possible_detection_list will return possible detected languages
  • This was used when near substrings are in same language A and the middle substring been detect as other language B WHILE its possible language including the language A
    • e.g. 日语使用者应超过一亿三千万人 -> 日语使用者应超过一亿 | 三千万 | 人
      • zh: 日语使用者应超过一亿
      • ja: 三千万人 (could be ja or zh)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.