garabik / unicode Goto Github PK

display unicode character properties

License: Other

Python 89.21% Roff 10.79%

unicode's Introduction

This file is in UTF-8 encoding.

To use unicode utility, you need: 
 - python3
 - there is still some python2 compatibility, but now it is unsupported (and will fail when UnicodeData.txt is compressed)
 - (recommended) UnicodeData.txt file in /usr/share/unicode/, ~/.unicode/ or current
   working directory.
    - apt-get install unicode-data  # Debian
    - dnf install unicode-ucd       # Fedora
    - unicode --download            # try to download the file
 - if you want to see Unicode block information, you also need
   Blocks.txt file, which you should put into /usr/share/unicode/,
   ~/.unicode/ or current working directory.
 - if you want to see UniHan properties, you need also Unihan.txt file
   which should be put into /usr/share/unicode/, ~/.unicode/ or
   current working directory.


Enter regular expression, hexadecimal number or some characters as an
argument. unicode will try to guess what you want to look up, see the
manpage if you want to force other behaviour (the manpage is also the
best documentation). In particular, -r forces searching for regular
expression in the names of characters, -s forces unicode to display
information about the characters given.

Here are just some examples:

$ unicode.py euro
U+20A0 EURO-CURRENCY SIGN
UTF-8: e2 82 a0   UTF-16BE: 20a0   Decimal: &#8352;
₠
Category: Sc (Symbol, Currency)
Bidi: ET (European Number Terminator)

U+20AC EURO SIGN
UTF-8: e2 82 ac   UTF-16BE: 20ac   Decimal: &#8364;
€
Category: Sc (Symbol, Currency)
Bidi: ET (European Number Terminator)

$ unicode.py 00c0
U+00C0 LATIN CAPITAL LETTER A WITH GRAVE
UTF-8: c3 80   UTF-16BE: 00c0   Decimal: &#192;
À (à)
Lowercase: U+00E0
Category: Lu (Letter, Uppercase)
Bidi: L (Left-to-Right)
Decomposition: 0041 0300



You can specify a range of characters as arguments, unicode will show
these characters in nice tabular format, aligned to 256-byte boundaries.  
Use two dots ".." to indicate the range, e.g.

       unicode 0450..0520

will display the whole cyrillic, armenian and hebrew blocks (characters from U+0400 to U+05FF)

       unicode 0400..

will display just characters from U+0400 up to U+04FF

Use --fromcp to query codepoints from other encodings:

$ unicode --fromcp cp1250 -d 200
U+010C LATIN CAPITAL LETTER C WITH CARON
UTF-8: c4 8c  UTF-16BE: 010c  Decimal: &#268;
Č (Č)
Uppercase: U+010C
Category: Lu (Letter, Uppercase)
Bidi: L (Left-to-Right)
Decomposition: 0043 030C

Multibyte encodings are supported:
$ unicode --fromcp big5 -x aff3

and multi-char strings are supported, too:

$ unicode --fromcp utf-8 -x c599c3adc5a5


On format (--format='...'):

Format string tells unicode which information should be displayed.
There is one (and only one) escape character recognised, \n for a new line.

You can use standard python .format() syntax. Following variables are
recognized:

{black} {red} {green} {yellow}
{blue} {magenta} {cyan} {white}  -- ANSI colours (foreground)

{on_black} {on_red} ...          -- ANSI colours (background)

{no_colour} {default} {bold}
{underline} {blink} {reverse}
{concealed}                      -- self-explaining ANSI escape codes

{ordc} -- unicode codepoint of the character (integer)
{name} -- unicode name of the character
{utf8} -- utf8 representation of the character (hexadecimal)
{utf16be} -- utf16 representation of the character (hexadecimal)
{decimal} -- decimal representation of the character
{opt_additional} -- optional representation in additional charset (-c); 
                    empty string if not specified
{pchar} -- the character itself
{opt_flipcase} -- upper- or lowercase opposite of the character, in parentheses;
                  empty if character is not cased
{opt_uppercase}{opt_lowercase} -- optional string describing uppercase
                                  or lowercase variant of the character;
                                  empty if character is not cased
{category} {category_desc} -- character category and its human readable description
{opt_numeric}{numeric_desc} -- the string `Numeric value:' and the numeric value
                               of the character; both empty if the character
                               has no numeric value
{opt_digit}{digit_desc} -- the string `Digit value:' and the digit value
                           of the character; both empty if the character
                           has no digit value
{opt_bidi}{bidi}{bidi_desc} -- the string `Bidi:', the bidi property and
                               a human readable description 
                               of the bidi property; empty if the character
                               has no bidi category
{mirrored_desc} -- the string 'Character is mirrored' if the character is mirrored,
                   empty otherwise
{opt_combining}{combining_desc} -- the string `Combining: ', combining class and a
                                   human readable description of the combining class;
                                   empty if the character is not combining
{opt_decomp}{decomp_desc} -- the string `Decomposition: ' and a hexadecimal sequence
                             of decomposition characters; empty if the character
                             has no decomposition
{opt_unicode_block}{opt_unicode_block_desc} -- the string `Unicode block:',
                                               range of the unicode block
                                               and description of said unicode
                                               block for the given character
{opt_eaw}{eaw_desc} -- the string `East Asian width:' and the human readable
                       value of East Asian width
{opt_derivedage}{opt_derivedage_age}{opt_derivedage_desc} -- the string `Age:',
	the Unicode version the character has been introduced in, and the human
	readable description of the version and the year it has been introduced.

unicode's People

Contributors

Stargazers

Watchers

Forkers

davejagoda cben raylu dscorbett aib mdirik sireof sahwar microhexhq bogn83 bitforks gnprice mandel59 sersorrel matrixsociety pragma- julienpalard shakahl

unicode's Issues

Remove brexit-ascii

does brexit-ascii feature have any real purpose?
according to https://politics.stackexchange.com/questions/61178/why-does-the-eu-uk-trade-deal-have-the-7-bit-ascii-table-as-an-appendix it seems that the table is just regular ascii with several json/pdf/... encoding errors.

On the other hand --br shortcut for --brief no longer works :(

get_unihan_properties_internal raise AttributeError

When a CJK character is looked up with verbose option, AttributeError is thrown:

% ./unicode -v 夢
U+5922 CJK UNIFIED IDEOGRAPH-5922
UTF-8: e5 a4 a2 UTF-16BE: 5922 Decimal: &#22818; Octal: \054442
夢
Category: Lo (Letter, Other); East Asian width: W (wide)
Unicode block: 4E00..9FFF; CJK Unified Ideographs
Bidi: L (Left-to-Right)


Traceback (most recent call last):
  File "./unicode", line 1014, in <module>
    main()
  File "./unicode", line 1011, in main
    print_characters(processed_args, options.maxcount, format_string, options.query_wikipedia, options.query_wiktionary)
  File "./unicode", line 746, in print_characters
    uhp = get_unihan_properties(c)
  File "./unicode", line 335, in get_unihan_properties_internal
    properties[key] = value.decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'

Env. info:
macOS Big Sur version 11.2.1, MacBook Air (M1, 2020), Apple M1
Python 3.9.2

grep is placed in /usr/bin

ERROR: No matching distribution found for unidecode

Showing this error when I tried to install other package that depend on this one.

Building wheels for collected packages: unicode
Building wheel for unicode (setup.py) ... done
Stored in directory: C:\Users\XXXXXXXX
Successfully built unicode

I download and installed manually but I only see one file after in the site-packages folder which is unicode-2.7.dist-info, without unicode folder itself. Not sure why this happening, using win 10 python 3.6.

Need help. Thanks

Release on PyPi?

Is this package ready to be hosted on PyPI? We can try it out on https://test.pypi.org/ (under a dummy name) and if it works can we release it there?

Also, does anyone know where to find a "unicode-data" package for MacOS?

unicode finds BEVERAGE but not HOT

Still there's a HOT BEVERAGE character.

"HOT" in fact matches the three letters "H", "O", "T", yet I expected to find "HOT BEVERAGE".

Speaking of ☕, I think I owe you one for this package!!! :)

Special case control chars

unicode 0x07 outputs a bell, for example

pchar is stuck right into the format string but should probably get special handling. this would allow me to easily embed the output with pchar into other apps (an irc bot that responds to "!unicode", in particular)

Request: suggest where to get UnicodeData.txt

The README says UnicodeData.txt is required and suggests the URL http://www.unicode.org/Public/. That URL presents me with 44 directories and no indication which of them might contain a file named UnicodeData.txt. I can understand that versions change and you might not want to link to a specific version, but at least could you give a hint as to which directory people should be looking at? There seems to be a file with that name in UNIDATA/ so I'll start with that, but since you didn't link directly there I'm wondering if there might be other versions in other directories (that file does seem to work).

This unicode module looks very useful, thanks.

would be nice to see version of unicode that introduced character

Hey there! I love your tool, and have been using and recommending it for years.

It would be nice to display the version of Unicode that introduced a character, e.g. "13.0" for say U+1FBF5 SEGMENTED DIGIT FIVE. I don't believe that this can be determined strictly from Block membership, but I'm not certain that it can't (i.e. I think non-full Blocks can have characters added over time). It seems reasonable to require an external data source, as the Block information does.

This is of course strictly a wishlist item. If you have no interest in doing it, but would accept a patch, feel free to assign it to me, and I'll kick one over to you when I get the time. Thanks!

Feature request: Unicode blocks

It would be useful to include Unicode block of characters (http://ftp.unicode.org/Public/UNIDATA/Blocks.txt) since name of character not always indicates which block it belongs to. For example:

U+A673 SLAVONIC ASTERISK
UTF-8: ea 99 b3 UTF-16BE: a673 Decimal: &#42611; Octal: \0123163
꙳
Block: Cyrillic Extended-B (U+A640..U+A69F)
Category: Po (Punctuation, Other)
Bidi: ON (Other Neutrals)

U+F0021  - No such unicode character name in database
UTF-8: f3 b0 80 a1 UTF-16BE: db80dc21 Decimal: &#983073; Octal: \03600041
󰀡 (󰀡)
Uppercase: F0021
Block: Supplementary Private Use Area-A (U+F0000..U+FFFFD)
Category: Co (Other, Private Use)
Bidi: L (Left-to-Right)

garabik / unicode Goto Github PK

unicode's Introduction

unicode's People

Contributors

Stargazers

Watchers

Forkers

unicode's Issues

Remove brexit-ascii

get_unihan_properties_internal raise AttributeError

ERROR: No matching distribution found for unidecode

Release on PyPi?

unicode finds BEVERAGE but not HOT

Special case control chars

Request: suggest where to get UnicodeData.txt

would be nice to see version of unicode that introduced character

Feature request: Unicode blocks

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent