Giter Club home page Giter Club logo

replace_irregular_chars's Introduction

irregular-chars

irregular-chars is a library for cleaning text, such as removing zero-width characters or converting full-width characters to half-width.

Installation

You can install the package via pip:

pip install irregular_chars

Usage

remove zero width space

from irregular_chars import remove_zero_width_spaces

text = "Hello\u200BWorld"
clean_text = remove_zero_width(text)
print(clean_text)  # Outputs: HelloWorld

change width

  • convert alphanumerics width full to small
from irregular_chars import full_to_small_width_alphanumerics
assert full_to_small_width_alphanumerics("0") == "0"  # True
  • convert kana width small to full
from irregular_chars import half_to_full_width_kanas
assert half_to_full_width_kanas("ア") == "ア"  # True
  • normalize kana and alphanumerics width
from irregular_chars import normalize_width_all
assert normalize_width_all("ア0") == "ア0"  # True

combine sound symbols

from irregular_chars import combine_sound_symbols
assert combine_sound_symbols("ガギグゲゴ") == "ガギグゲゴ"  # True

ivs

  • detect ivs (unicode ivs) The Unicode code point of the character is in the range of the variant selector (E0100-E01EF). You can just ignore this kind of variant selectors.
from irregular_chars.ivs import is_unicode_ivs
assert is_unicode_ivs(0xE0100)  # True
  • detect ivs (cjk or supplementary ivs) The range of CJK unified ideographs extension B-F and supplementary ideographic plane (20000-2FA1F).

They are strongly combined with the previous character. So you can not remove or replace just only this characters...

from irregular_chars.ivs import is_cjk_or_supplementary_ivs
assert is_cjk_or_supplementary_ivs(0x20000)  # True
  • ignore unicode ivs and raise if a cjk or supplementary ivs is found.
from irregular_chars.ivs import remove_ivs
assert remove_ivs("test\U000E0100") == "test"  # True

replace_irregular_chars's People

Contributors

masatoemata avatar

Watchers

 avatar

replace_irregular_chars's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.