hackerb9 / ugrep Goto Github PK

View Code? Open in Web Editor NEW

15.0 6.0 2.0 355 KB

find unicode characters based on their names

License: GNU General Public License v3.0

Python 96.93% Shell 3.07%

unicode unicode-data unicode-characters unicode-character-database grep emoji emoji-unicode emoji-searcher emoji-picker

ugrep's Introduction

☙ ugrep ❧

Find unicode characters based on their names

ugrep is essentially grep for the Unicode table. It prints out the resulting unicode characters literally, so you can easily cut-and-paste. Ugrep is useful for looking up Emojis 😤, finding obscure symbols ⚸⅗ℏ℞☧☭, or beautiful glyphs to decorate your text. 🙶❡✯🟔❢🙷

You can also use it for the reverse operation to lookup a single character (or a string of them) you've pasted into the terminal.

As a bonus, it can list which fonts are installed that contain a particular unicode character and — through the magic of sixels — will show a rendering in each font.

Installation

It's just a Python 3 shell script. Download it to /usr/local/bin or ~/bin and make it executable.

cd /usr/local/bin
wget https://github.com/hackerb9/ugrep/raw/master/ugrep
chmod +x ugrep

Usage

Search by name: ugrep [-w] regex

Look up a character name where regex is a regular expression. If you don't know regular expressions, don't worry. Just use plain strings and you'll rarely be wrong.
```
  ugrep runic
```
If you find ugrep returning too many hits because the phrase you used is found in other terms, e.g., thema found in mathematical, use the -w option to limit the search to complete words.
Search by number: ugrep codepoint[..codepoint[..increment]]

Look up a character (or a range of them) using Unicode code points in hexadecimal. For example,
```
  ugrep 03c0
  ugrep 23b0..f
  ugrep 0..10ffff..1000
```
Search by character: ugrep [-c] character string

Look up each character in a string. Note that if the string is a single character, e.g., ugrep X, then -c is implied and need not be specified.
```
  ugrep -c "(ﾟ∀ﾟ)"
```
List fonts for a character: ugrep [-l] character

After showing the usual character information, list installed fonts that contain that character and show an example in each:
```
  ugrep -l mho
```
☝ When sshed to another machine, ugrep shows the fonts installed on the remote machine.
List fonts, scaled larger: ugrep [-L scale] character

Same as -l, but scale up the example rendering in each font to be easier to read:
```
  ugrep -L2 -w om
```
Useful scale values range from 2 to 8.

Examples

Note: output from all examples has been excerpted. (You'd be amazed how many heart emojis Unicode has. 😜)

Fun things to try:

To see some useful and lovely glyphs, try this:

ugrep face 
ugrep alchemical 
ugrep ornament
ugrep bullet
ugrep '(vine|bud)'
ugrep vai
ugrep heavy
ugrep drawing
ugrep combining

Plain text search is simple:

    $ ugrep heart
    ☙	U+2619	REVERSED ROTATED FLORAL HEART BULLET
    ❣	U+2763	HEAVY HEART EXCLAMATION MARK ORNAMENT
    ❤	U+2764	HEAVY BLACK HEART
    ⋮	[ ... truncated for brevity ... ]
    💞	U+1F49E REVOLVING HEARTS
    💟	U+1F49F HEART DECORATION
    😍	U+1F60D SMILING FACE WITH HEART-SHAPED EYES
    😻	U+1F63B	SMILING CAT FACE WITH HEART-SHAPED EYES

Paste in a single character to lookup its codepoint:

    $ ugrep ☺
    ☺       U+263A  WHITE SMILING FACE

Arguments on the command line have an implicit wildcard between them:

    $ ugrep right.*gle
    $ ugrep right gle       # Equivalent
    »	U+00BB	RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
    ’	U+2019	RIGHT SINGLE QUOTATION MARK
    ∟	U+221F	RIGHT ANGLE
    ⊿	U+22BF	RIGHT TRIANGLE

You can use regular expressions for fancier searches:

    $ ugrep -w '(wo|hu)?m(a|e)ns?'
    ᛗ	U+16D7	RUNIC LETTER MANNAZ MAN M
    ⛀	U+26C0	WHITE DRAUGHTS MAN
    ⛂	U+26C2	BLACK DRAUGHTS MAN
    ⼈	U+2F08	KANGXI RADICAL MAN
    ⼥	U+2F25	KANGXI RADICAL WOMAN
    𝌂	U+1D302	DIGRAM FOR HUMAN EARTH
    𝌄	U+1D304	DIGRAM FOR EARTHLY HUMAN
    🕴	U+1F574	MAN IN BUSINESS SUIT LEVITATING
    🕺	U+1F57A	MAN DANCING
    🚹	U+1F6B9	MENS SYMBOL
    🚺	U+1F6BA	WOMENS SYMBOL
    🤰	U+1F930	PREGNANT WOMAN
    🤵	U+1F935	MAN IN TUXEDO
    
    $ ugrep ^x		    #  Regex anchors ^ and $ work
    ⊻	U+22BB	XOR
    ⌧	U+2327	X IN A RECTANGLE BOX (clear key)

Use the `-w` flag to search only for complete words:

    $ ugrep -w R	    # The letter R used as a word
    $ ugrep "\bR\b"	    # (regex equivalent)
    R	U+0052	LATIN CAPITAL LETTER R
    Ŗ	U+0156	LATIN CAPITAL LETTER R WITH CEDILLA
    ℛ	U+211B	SCRIPT CAPITAL R (Script r)
    ℜ	U+211C	BLACK-LETTER CAPITAL R (Black-letter r)
    ℝ	U+211D	DOUBLE-STRUCK CAPITAL R (Double-struck r)

Use -c to display info for each character in a string.

    $ ugrep -c "ᕕ( ᐛ )ᕗ"
    ᕕ   U+1555  CANADIAN SYLLABICS FI
    (   U+0028  LEFT PARENTHESIS (opening parenthesis)
        U+0020  SPACE
    ᐛ   U+141B  CANADIAN SYLLABICS NASKAPI WAA
        U+0020  SPACE
    )   U+0029  RIGHT PARENTHESIS (closing parenthesis)
    ᕗ   U+1557  CANADIAN SYLLABICS FO

Aliases (alternate names) are also searched:

    $ ugrep backslash
    \	U+005C	REVERSE SOLIDUS (backslash)

Use .. to browse through a range of Unicode characters:

    $ ugrep 26b3..b
    ⚳	U+26B3	CERES
    ⚴	U+26B4	PALLAS
    ⚵	U+26B5	JUNO
    ⚶	U+26B6	VESTA
    ⚷	U+26B7	CHIRON
    ⚸	U+26B8	BLACK MOON LILITH
    ⚹	U+26B9	SEXTILE
    ⚺	U+26BA	SEMISEXTILE
    ⚻	U+26BB	QUINCUNX

    $ ugrep 1f470..ff  |  less
    👰	U+1F470	BRIDE WITH VEIL
    👱	U+1F471	PERSON WITH BLOND HAIR
    👲	U+1F472	MAN WITH GUA PI MAO
    👳	U+1F473	MAN WITH TURBAN
    👴	U+1F474	OLDER MAN
    👵	U+1F475	OLDER WOMAN
    👶	U+1F476	BABY
    👷	U+1F477	CONSTRUCTION WORKER
    👸	U+1F478	PRINCESS
    👹	U+1F479	JAPANESE OGRE
    👺	U+1F47A	JAPANESE GOBLIN
    👻	U+1F47B	GHOST
    👼	U+1F47C	BABY ANGEL
    👽	U+1F47D	EXTRATERRESTRIAL ALIEN
    ⋮	[ ... truncated for brevity ... ]
    📼	U+1F4FC	VIDEOCASSETTE
    📽	U+1F4FD	FILM PROJECTOR
    📾	U+1F4FE	PORTABLE STEREO
    📿	U+1F4FF	PRAYER BEADS

Sometimes it's useful (or just fun) to page through the Unicode
table and see what characters are defined in a region. (`ugrep
2700..ff`) Ranges are convenient, but very slow. Use regular
expressions if you want speed. (`ugrep U+27..`)

Ranges can have an optional increment:

$ ugrep 0..ffff..1000
   �    U+0000  <control> (null)
   က    U+1000  MYANMAR LETTER KA
  [ ]   U+2000  EN QUAD
  [　]  U+3000  IDEOGRAPHIC SPACE
   䀀   U+4000  cups; small cups ( M: fàn, C: fan3 fan4 fan6 )
   倀   U+5000  bewildered; rash, wildly ( M: chāng, C: caang1 caang4 coeng1 zaang1, J: KURUU TAORERU, K: CHANG, V: trành )
   怀   U+6000  bosom, breast; carry in bosom ( M: huái, C: waai4 )
   瀀   U+7000  [CJK Unified Ideographs] ( M: yōu, J: ATSUI )
   耀   U+8000  shine, sparkle, dazzle; glory ( M: yào, C: jiu6, J: KAGAYAKU, K: YO )
   退   U+9000  step back, retreat, withdraw ( M: tuì, C: teoi3, J: SHIRIZOKU SHIRIZOKERU, K: THOY, V: thoái )
   ꀀ   U+A000  YI SYLLABLE IT
   뀀   U+B000  Block: [Hangul Syllables]
   쀀   U+C000  Block: [Hangul Syllables]
   퀀   U+D000  Block: [Hangul Syllables]
   �    U+E000  <Private Use, First>
       U+F000  Block: [Private Use Area]

Tip: pipe long output to less and search for a code point by pressing /U\+A60F.

Use -l to list which installed fonts contain a certain glyph:

  ugrep -l swash amp

Requires FontConfig. (Most GNU/Linux boxes should already be set).
The requested character may also be displayed in each of the listed typefaces, but only if your terminal supports sixel graphics (e.g., xterm -ti vt340) and you have ImageMagick installed.

Use -L to scale up the font examples when listing fonts

ugrep -L4 fdfd
   ﷽    U+FDFD  ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM
                  Aldhabi
                  Trutypewriter PolyglOTT
                  Unifont

Note that increasing the glyph size also increased the text size. Not all terminals are capable of "double height" text. If yours shows two lines of the same text in the usual size, try using --never-double-text.

Copy whitespace from the terminal

    $ ugrep -w space
	  [ ]   U+0020  SPACE (SP)
	  [ ]   U+00A0  NO-BREAK SPACE (non-breaking space) (NBSP)
	  [ ]   U+1680  OGHAM SPACE MARK
	  [ ]   U+2002  EN SPACE
	  [ ]   U+2003  EM SPACE
	  [ ]   U+2004  THREE-PER-EM SPACE
	  [ ]   U+2005  FOUR-PER-EM SPACE
	  [ ]   U+2006  SIX-PER-EM SPACE
	  [ ]   U+2007  FIGURE SPACE
	  [ ]   U+2008  PUNCTUATION SPACE
	  [ ]   U+2009  THIN SPACE
	  [ ]   U+200A  HAIR SPACE

Whitespace characters are printed with square brackets around them to make it easy to highlight and copy them from the terminal. They will also be shown with a yellow background, if the terminal allows.

Determine if an alias is actually a correction

Ugrep shows the character name in all caps and aliases are usually lowercase in parentheses. Some aliases are treated differently. For aesthetic reasons, abbreviations are also shown in uppercase. For example:

� U+FEFF ZERO WIDTH NO-BREAK SPACE (byte order mark) (BOM) (ZWNBSP)

There are 31 characters in Unicode which have the wrong name in the UnicodeData.txt database. Unicode includes the correct name as an alias in NameAliases.txt. If that file exists on your system, then ugrep will show the correction in Title Case Letters and in red letters, if the terminal supports color text.

︘ U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (Presentation Form For Vertical Right White Lenticular Bracket)

View CJK (Chinese-Japanese-Korean) characters

Unicode does not actually define most CJK characters, except indirectly via Unihan, which maps certain blocks of characters to other standards.

Ugrep allows one to specify the code point or paste in an example character to look up.

  $ ugrep 𰻞
     𰻞   U+30EDE biangbiang noodles ( M: biáng )

  $ ugrep 8000
  耀  U+8000  shine, sparkle, dazzle; glory ( M: yào, C: jiu6, J: KAGAYAKU, K: YO )

View all characters defined by Unicode:

    $ ugrep .?  |  less
    ⋮	[ ... over 30,000 glyphs elided for brevity ... ]

Want just Unicode glyphs without the description? Please use fonttable. It shows all defined Unicode characters by default.

Show all possible code points, even the ones not defined in Unicode:

	$ ugrep 0..10FFFF | less
    ⋮	[ ... over a million lines elided for brevity ... ]

☝ This is currently very slow due to the way ugrep is implemented. You likely want to use fonttable -u instead.

Prerequisite: UnicodeData.txt

Ugrep requires the Unicode data file UnicodeData.txt which can be installed on your system, in your home, or in the current directory.

Easiest: On Ubuntu and Debian GNU/Linux, simply apt install unicode-data.

Still easy: Or, you can download it by hand from unicode.org and place it in ~/.local/share/unicode/UnicodeData.txt

Not hard: Or, if you wish the file to be accessible to all users on your machine, place it in /usr/local/share/unicode/UnicodeData.txt.

Unihan CJK Support

If the file Unihan_Readings.txt exists, then ugrep will automatically use it to show an English gloss describing a character in the CJK (Chinese-Japanese-Korean) Ideographs region.

Your OS may make it easy to install (e.g., apt install unicode-data). On other systems, you can do this

mkdir -p ~/.local/share/unicode
cd ~/.local/share/unicode
wget ftp://ftp.unicode.org/Public/UNIDATA/Unihan.zip
unzip Unihan.zip

CJK example

Example 1: Unicode code point

$ ugrep 8000
   耀   U+8000  shine, sparkle, dazzle; glory ( M: yào, C: jiu6, J: KAGAYAKU, K: YO )

The parenthesized text at the end shows the romanized pronunciation of the character in Mandarin (pinyin), Cantonese (jyutping), Japanese (Hepburn), and Korean (Yale).

Example 2: Using -c to see characters in a string

$ ugrep -c 「⿺辶⿳穴⿰月⿰⿲⿱幺長⿱言馬⿱幺長刂心」
   「   U+300C  LEFT CORNER BRACKET (opening corner bracket)
   ⿺   U+2FFA  IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT
   辶   U+8FB6  walk; walking; KangXi radical 162 ( M: chuò, J: SHINNYOU )
   ⿳   U+2FF3  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO MIDDLE AND BELOW
   穴   U+7A74  cave, den, hole; KangXi radical 116 ( M: xué, C: jyut6, J: ANA, K: HYEL, V: huyệt )
   ⿰   U+2FF0  IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT
   月   U+6708  moon; month; KangXi radical 74 ( M: yuè, C: jyut6, J: TSUKI, K: WEL, V: nguyệt )
   ⿰   U+2FF0  IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT
   ⿲   U+2FF2  IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO MIDDLE AND RIGHT
   ⿱   U+2FF1  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW
   幺   U+5E7A  one; tiny, small ( M: yāo, C: jiu1, J: CHIISAI, K: YO )
   長   U+9577  long; length; excel in; leader ( M: zhǎng, C: coeng4 zoeng2, J: NAGAI TAKERU OSA, K: CANG, V: trường )
   ⿱   U+2FF1  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW
   言   U+8A00  words, speech; speak, say ( M: yán, C: jin4, J: KOTO IU KOTOBA, K: EN UN, V: ngôn )
   馬   U+99AC  horse; surname; KangXi radical 187 ( M: mǎ, C: maa5, J: UMA, K: MA, V: mã )
   ⿱   U+2FF1  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW
   幺   U+5E7A  one; tiny, small ( M: yāo, C: jiu1, J: CHIISAI, K: YO )
   長   U+9577  long; length; excel in; leader ( M: zhǎng, C: coeng4 zoeng2, J: NAGAI TAKERU OSA, K: CANG, V: trường )
   刂   U+5202  knife; radical number 18 ( M: dāo, C: dou1, J: RITSUTOU, K: TO )
   心   U+5FC3  heart; mind, intelligence; soul ( M: xīn, C: sam1, J: KOKORO, K: SIM, V: tâm )
   」   U+300D  RIGHT CORNER BRACKET (closing corner bracket)

Note 1: A "definition" is not a translation

Unihan calls the English gloss the character's "definition", but that is meant in a very loose sense. CJK characters change meaning based upon the context they are used in. For example, most Chinese words are made of two characters, such as "蜂鳥", which means "hummingbird", but ugrep would shows it as:

$ ugrep -c 蜂鳥
   蜂   U+8702  bee, wasp, hornet ( M: fēng, C: fung1, J: HACHI, K: PONG, V: ong )
   鳥   U+9CE5  bird; KangXi radical 196 ( M: niǎo, C: niu5, J: TORI, K: CO, V: điểu )

Note 2: Not all characters have readings

Unihan refers to this supplemental information — both the English gloss and the romanizations — as "readings". Readings are meant to be helpful, but are not normative and are only available for some characters.

	Count	Percent
All CJK Characters	93,858	100%
Have any reading	47,429	51%
Mandarin Pinyin	41,378	44%
Cantonese Jyutping	23,112	25%
English definition	21,076	23%
Japanese Hepburn	11,293	12%
Korean Yale	9,051	10%
Vietnamese	8,301	9%

Example of CJK with no Mandarin

$ ugrep 2bac3
   𫫃   U+2BAC3 (Cant.) sarcastic interrogative ( C: e1 )

Example of CJK with no pronunciation

$ ugrep 20015
   𠀕   U+20015 Variant of U+4E99 亙

Example of CJK with no English definition

$ ugrep 20016
   𠀖   U+20016 [CJK Unified Ideographs Extension B] ( V: khạng )

Example of CJK with no readings whatsoever

$ ugrep 2abcd
   𪯍   U+2ABCD [CJK Unified Ideographs Extension C]

Note that ugrep currently prints just the name of the block the character is in [within square brackets] if it has no better way to identify the character.

Boring Implementation notes

This is a rewrite of b9's AWK ugrep into Python. While AWK makes more sense for what this program does (comparing fields based on regexps), a rewrite was necessary because GNU awk, while plenty powerful, uses \y for word edges instead of the standard \b. Gawk does this for backwards compatibility with historic AWK, but lacks a way to disable it for new scripts.

Switching to Python did have the benefit of allowing more powerful Perlesque regexes (not that anyone has requested that).

Why not use the unicodedata module?

I do not use Python's unicodedata module because it is woefully insufficient. It allows one to search by character name only by specifying it fully and exactly: unicodedata.lookup("ROTATED HEAVY BLACK HEART BULLET").

Future Work

Rename this project

Although I believe this ugrep existed first, there is now another ugrep which is quite widely known — with good reason as it looks pretty nifty — which hasnothing to do with looking up Unicode characters. The 'U' appears to stand for Ultra-fast as it is a very speedy grep with lots of bells and whistles.

What shall this project's new name be? ug is also taken by the other ugrep. How about ugre? It's an ugly, ogreish name, but it's probably a safe bet nobody is going to use that name for something else.

Maybe use Unihan_Readings.txt for grepping

Currently if Unihan_Readings.txt is installed — which is the default if the user has done apt install unicode-data) — and the user requests a character that is not in UnicodeData.txt, then the Readings data is used to show information about the character. However, Unihan_Readings could be used in the future for searching for characters to show.

Example data from Unihan_Readings for U+9B44 (魄):

U+9B44	kCantonese	bok3 paak3 tok3
U+9B44	kDefinition	vigor; body; dark part of moon
U+9B44	kHangul	백:0N
U+9B44	kHanyuPinlu	pò(11)
U+9B44	kHanyuPinyin	74431.090:pò,bó,tuò
U+9B44	kJapaneseKun	TAMASHII
U+9B44	kJapaneseOn	HAKU BAKU
U+9B44	kKorean	PAYK
U+9B44	kMandarin	pò
U+9B44	kTGHZ2013	287.140:pò
U+9B44	kTang	*pæk
U+9B44	kVietnamese	phách
U+9B44	kXHC1983	0084.110:bó 0887.020:pò 1175.020:tuò

See UAX #38: Unicode Han Database.

Two levels of Unihan support:

Show kDefinition if block name is CJK Ideographs
Search Unihan_Readings when searching for a word. Possible example: $ ugrep mononoke 魅 U+9B45 MONONOKE BAKEMONO SUDAMA (kind of forest demon, elf)

Number 1 is finished and working, but number 2 may require a command line switch or some other way of enabling/disabling it as searching through the Readings file may be slow or cause other problems.

Maybe use NamesList.txt

It looks like NamesList.txt might be useful to also parse as it allows multiple aliases for a character. For example (from grep -B1 [=%] NamesList.txt):

0023    NUMBER SIGN
        = pound sign, hash, crosshatch, octothorpe

002E    FULL STOP
        = period, dot, decimal point
--
002F    SOLIDUS
        = slash, virgule

1F70A   ALCHEMICAL SYMBOL FOR VINEGAR
        = crucible; acid; distill; atrament; vitriol; red
          sulfur; borax; wine; alkali salt; mercurius vivus,
          quick silver

I'm not sure how useful this will be (who is going to look up the number sign by searching on "octothorpe"), but it'd be nice to be able to at least show them as aliases.

Also, NamesList.txt has a fascinating "cross reference" feature:

0021    EXCLAMATION MARK
        = factorial
        = bang
        x (inverted exclamation mark - 00A1)
        x (latin letter retroflex click - 01C3)
        x (double exclamation mark - 203C)
        x (interrobang - 203D)
        x (heavy exclamation mark ornament - 2762)

How would one find the interrobang (‽) without such a cross reference?

Note that the NamesList.txt file actually starts with a warning not to parse it as it says it is generated mechanically from UnicodeData.txt plus "manually created annotations". However, those annotations are what is interesting about the file (the aliases and cross references) and there appears to be no other official source of that data.

Bugs, Misfeatures, and Workarounds

ugrep 3400 shows the text defined in UnicodeData.txt, which states that it is "<CJK Ideograph Extension A, First>". Now that ugrep can show ideograph definitions using Unihan_Readings.txt, we should (probably) replace any string in angle brackets with more useful info.
Brace expansion is confusing because of needing to be quoted from the shell. It is supported for ranges (not sequences), but is not currently documented because usage is tricky and the functionality is not actually that helpful. For example, the following works:
```
ugrep {0..F}{0,4,8,C}00
```
but is easier to understand using range expansion:
```
ugrep 0..FFFF..400
```
Range expansion and a seemingly equivalent regular expression search will give different results.
```
ugrep 0..FFFF..400 | wc -l 
64
ugrep U+[0-9A-F][048C]00 | wc -l
22
```
This is because regexes currently only return valid code points from the UnicodeData.txt file, whereas range expansions can generate code points which are in regions not directly defined by Unicode. For example, the range from U+4E00 to U+9FEF is a block of CJK Ideographs. Both are useful: regexes are blazingly fast, while range expansions have more functionality.
[Note: The following is not a problem for people who are willing to use vector fonts (truetype, opentype, postscript) that may be antialiased. Xterm uses fontconfig just fine.]
For bitmap fonts, Xterm (as of version 369) seems to be able to only use one font at a time, which means a single font must have all the glyphs you want shown. (Yes, you can have a second bitmap font for "wide" CJK, but that's still not enough.)

The author (hackerb9) currently prefers using the Neep bitmap font like so in ~/.Xresources:
```
! Neep looks nice, has good unicode coverage. Requires xfonts-jmk.
xterm*vt100.font        :       *neep-medium-r-normal--20*10646*
! Neep lacks Asian characters
xterm*vt100.wideFont    :       *fixed-medium-r-normal-ja-18*10646*
```
Neep has two major downsides. 1. It is a bitmap font with only one size well implemented, so you can't zoom in or out. 2. It is limited to 65536 characters, which means it cannot show characters outside of Unicode's Basic Multilingual Plane, such as new emojis. Neep can be installed on Debian GNU/Linux systems with apt install xfonts-jmk.
Mlterm appears to have the same single font limitation as Xterm. Also, it right aligns text that has even a single character in a right-to-left alphabet, such as Arabic, so the output from ugrep will look a little funny.
Gnome-terminal uses font-config, so it has very nice Unicode support and can easily zoom in with Ctrl-+⃣ and Ctrl--⃣. Older versions had a bug where combining characters were combined with the following character instead of the previous, but this is now fixed.

It does not support sixel graphics, so the -l option cannot show examples of the character in different fonts.

ugrep's People

Contributors

Stargazers

Watchers

Forkers

ohmycloud syk2016

ugrep's Issues

`ugrep -L` uses double-size terminal characters without checking

When increasing the size of the glyph rendering when listing fonts, the size of the text also increases. This is good, but some terminals, such as gnome-terminal, cannot handle double-high text. Is there a way to detect if a terminal supports double-high letters? Or, do I need to blocklist gnome-terminal?

Unknown characters should print block info

Some characters do not exists in the UnicodeData.txt file but are still part of Unicode. For example, CJK characters. If a character is not found, ugrep should fall back to printing out the Unicode block info from Blocks.txt.

# Blocks-12.1.0.txt
# Date: 2019-03-08, 23:59:00 GMT [KW]
# © 2019 Unicode®, Inc.
# For terms of use, see http://www.unicode.org/terms_of_use.html
#
# Unicode Character Database
# For documentation, see http://www.unicode.org/reports/tr44/
#
# Format:
# Start Code..End Code; Block Name

# ================================================

# Note:   When comparing block names, casing, whitespace, hyphens,
#         and underbars are ignored.
#         For example, "Latin Extended-A" and "latin extended a" are equivalent.
#         For more information on the comparison of property values,
#            see UAX #44: http://www.unicode.org/reports/tr44/
#
#  All block ranges start with a value where (cp MOD 16) = 0,
#  and end with a value where (cp MOD 16) = 15. In other words,
#  the last hexadecimal digit of the start of range is ...0
#  and the last hexadecimal digit of the end of range is ...F.
#  This constraint on block ranges guarantees that allocations
#  are done in terms of whole columns, and that code chart display
#  never involves splitting columns in the charts.
#
#  All code points not explicitly listed for Block
#  have the value No_Block.

# Property:	Block
#
# @missing: 0000..10FFFF; No_Block

0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
0180..024F; Latin Extended-B
0250..02AF; IPA Extensions
02B0..02FF; Spacing Modifier Letters
0300..036F; Combining Diacritical Marks
0370..03FF; Greek and Coptic
0400..04FF; Cyrillic
0500..052F; Cyrillic Supplement
0530..058F; Armenian
0590..05FF; Hebrew
0600..06FF; Arabic
0700..074F; Syriac
0750..077F; Arabic Supplement
0780..07BF; Thaana
07C0..07FF; NKo
0800..083F; Samaritan
0840..085F; Mandaic
0860..086F; Syriac Supplement
08A0..08FF; Arabic Extended-A
0900..097F; Devanagari
0980..09FF; Bengali
0A00..0A7F; Gurmukhi
0A80..0AFF; Gujarati
0B00..0B7F; Oriya
0B80..0BFF; Tamil
0C00..0C7F; Telugu
0C80..0CFF; Kannada
0D00..0D7F; Malayalam
0D80..0DFF; Sinhala
0E00..0E7F; Thai
0E80..0EFF; Lao
0F00..0FFF; Tibetan
1000..109F; Myanmar
10A0..10FF; Georgian
1100..11FF; Hangul Jamo
1200..137F; Ethiopic
1380..139F; Ethiopic Supplement
13A0..13FF; Cherokee
1400..167F; Unified Canadian Aboriginal Syllabics
1680..169F; Ogham
16A0..16FF; Runic
1700..171F; Tagalog
1720..173F; Hanunoo
1740..175F; Buhid
1760..177F; Tagbanwa
1780..17FF; Khmer
1800..18AF; Mongolian
18B0..18FF; Unified Canadian Aboriginal Syllabics Extended
1900..194F; Limbu
1950..197F; Tai Le
1980..19DF; New Tai Lue
19E0..19FF; Khmer Symbols
1A00..1A1F; Buginese
1A20..1AAF; Tai Tham
1AB0..1AFF; Combining Diacritical Marks Extended
1B00..1B7F; Balinese
1B80..1BBF; Sundanese
1BC0..1BFF; Batak
1C00..1C4F; Lepcha
1C50..1C7F; Ol Chiki
1C80..1C8F; Cyrillic Extended-C
1C90..1CBF; Georgian Extended
1CC0..1CCF; Sundanese Supplement
1CD0..1CFF; Vedic Extensions
1D00..1D7F; Phonetic Extensions
1D80..1DBF; Phonetic Extensions Supplement
1DC0..1DFF; Combining Diacritical Marks Supplement
1E00..1EFF; Latin Extended Additional
1F00..1FFF; Greek Extended
2000..206F; General Punctuation
2070..209F; Superscripts and Subscripts
20A0..20CF; Currency Symbols
20D0..20FF; Combining Diacritical Marks for Symbols
2100..214F; Letterlike Symbols
2150..218F; Number Forms
2190..21FF; Arrows
2200..22FF; Mathematical Operators
2300..23FF; Miscellaneous Technical
2400..243F; Control Pictures
2440..245F; Optical Character Recognition
2460..24FF; Enclosed Alphanumerics
2500..257F; Box Drawing
2580..259F; Block Elements
25A0..25FF; Geometric Shapes
2600..26FF; Miscellaneous Symbols
2700..27BF; Dingbats
27C0..27EF; Miscellaneous Mathematical Symbols-A
27F0..27FF; Supplemental Arrows-A
2800..28FF; Braille Patterns
2900..297F; Supplemental Arrows-B
2980..29FF; Miscellaneous Mathematical Symbols-B
2A00..2AFF; Supplemental Mathematical Operators
2B00..2BFF; Miscellaneous Symbols and Arrows
2C00..2C5F; Glagolitic
2C60..2C7F; Latin Extended-C
2C80..2CFF; Coptic
2D00..2D2F; Georgian Supplement
2D30..2D7F; Tifinagh
2D80..2DDF; Ethiopic Extended
2DE0..2DFF; Cyrillic Extended-A
2E00..2E7F; Supplemental Punctuation
2E80..2EFF; CJK Radicals Supplement
2F00..2FDF; Kangxi Radicals
2FF0..2FFF; Ideographic Description Characters
3000..303F; CJK Symbols and Punctuation
3040..309F; Hiragana
30A0..30FF; Katakana
3100..312F; Bopomofo
3130..318F; Hangul Compatibility Jamo
3190..319F; Kanbun
31A0..31BF; Bopomofo Extended
31C0..31EF; CJK Strokes
31F0..31FF; Katakana Phonetic Extensions
3200..32FF; Enclosed CJK Letters and Months
3300..33FF; CJK Compatibility
3400..4DBF; CJK Unified Ideographs Extension A
4DC0..4DFF; Yijing Hexagram Symbols
4E00..9FFF; CJK Unified Ideographs
A000..A48F; Yi Syllables
A490..A4CF; Yi Radicals
A4D0..A4FF; Lisu
A500..A63F; Vai
A640..A69F; Cyrillic Extended-B
A6A0..A6FF; Bamum
A700..A71F; Modifier Tone Letters
A720..A7FF; Latin Extended-D
A800..A82F; Syloti Nagri
A830..A83F; Common Indic Number Forms
A840..A87F; Phags-pa
A880..A8DF; Saurashtra
A8E0..A8FF; Devanagari Extended
A900..A92F; Kayah Li
A930..A95F; Rejang
A960..A97F; Hangul Jamo Extended-A
A980..A9DF; Javanese
A9E0..A9FF; Myanmar Extended-B
AA00..AA5F; Cham
AA60..AA7F; Myanmar Extended-A
AA80..AADF; Tai Viet
AAE0..AAFF; Meetei Mayek Extensions
AB00..AB2F; Ethiopic Extended-A
AB30..AB6F; Latin Extended-E
AB70..ABBF; Cherokee Supplement
ABC0..ABFF; Meetei Mayek
AC00..D7AF; Hangul Syllables
D7B0..D7FF; Hangul Jamo Extended-B
D800..DB7F; High Surrogates
DB80..DBFF; High Private Use Surrogates
DC00..DFFF; Low Surrogates
E000..F8FF; Private Use Area
F900..FAFF; CJK Compatibility Ideographs
FB00..FB4F; Alphabetic Presentation Forms
FB50..FDFF; Arabic Presentation Forms-A
FE00..FE0F; Variation Selectors
FE10..FE1F; Vertical Forms
FE20..FE2F; Combining Half Marks
FE30..FE4F; CJK Compatibility Forms
FE50..FE6F; Small Form Variants
FE70..FEFF; Arabic Presentation Forms-B
FF00..FFEF; Halfwidth and Fullwidth Forms
FFF0..FFFF; Specials
10000..1007F; Linear B Syllabary
10080..100FF; Linear B Ideograms
10100..1013F; Aegean Numbers
10140..1018F; Ancient Greek Numbers
10190..101CF; Ancient Symbols
101D0..101FF; Phaistos Disc
10280..1029F; Lycian
102A0..102DF; Carian
102E0..102FF; Coptic Epact Numbers
10300..1032F; Old Italic
10330..1034F; Gothic
10350..1037F; Old Permic
10380..1039F; Ugaritic
103A0..103DF; Old Persian
10400..1044F; Deseret
10450..1047F; Shavian
10480..104AF; Osmanya
104B0..104FF; Osage
10500..1052F; Elbasan
10530..1056F; Caucasian Albanian
10600..1077F; Linear A
10800..1083F; Cypriot Syllabary
10840..1085F; Imperial Aramaic
10860..1087F; Palmyrene
10880..108AF; Nabataean
108E0..108FF; Hatran
10900..1091F; Phoenician
10920..1093F; Lydian
10980..1099F; Meroitic Hieroglyphs
109A0..109FF; Meroitic Cursive
10A00..10A5F; Kharoshthi
10A60..10A7F; Old South Arabian
10A80..10A9F; Old North Arabian
10AC0..10AFF; Manichaean
10B00..10B3F; Avestan
10B40..10B5F; Inscriptional Parthian
10B60..10B7F; Inscriptional Pahlavi
10B80..10BAF; Psalter Pahlavi
10C00..10C4F; Old Turkic
10C80..10CFF; Old Hungarian
10D00..10D3F; Hanifi Rohingya
10E60..10E7F; Rumi Numeral Symbols
10F00..10F2F; Old Sogdian
10F30..10F6F; Sogdian
10FE0..10FFF; Elymaic
11000..1107F; Brahmi
11080..110CF; Kaithi
110D0..110FF; Sora Sompeng
11100..1114F; Chakma
11150..1117F; Mahajani
11180..111DF; Sharada
111E0..111FF; Sinhala Archaic Numbers
11200..1124F; Khojki
11280..112AF; Multani
112B0..112FF; Khudawadi
11300..1137F; Grantha
11400..1147F; Newa
11480..114DF; Tirhuta
11580..115FF; Siddham
11600..1165F; Modi
11660..1167F; Mongolian Supplement
11680..116CF; Takri
11700..1173F; Ahom
11800..1184F; Dogra
118A0..118FF; Warang Citi
119A0..119FF; Nandinagari
11A00..11A4F; Zanabazar Square
11A50..11AAF; Soyombo
11AC0..11AFF; Pau Cin Hau
11C00..11C6F; Bhaiksuki
11C70..11CBF; Marchen
11D00..11D5F; Masaram Gondi
11D60..11DAF; Gunjala Gondi
11EE0..11EFF; Makasar
11FC0..11FFF; Tamil Supplement
12000..123FF; Cuneiform
12400..1247F; Cuneiform Numbers and Punctuation
12480..1254F; Early Dynastic Cuneiform
13000..1342F; Egyptian Hieroglyphs
13430..1343F; Egyptian Hieroglyph Format Controls
14400..1467F; Anatolian Hieroglyphs
16800..16A3F; Bamum Supplement
16A40..16A6F; Mro
16AD0..16AFF; Bassa Vah
16B00..16B8F; Pahawh Hmong
16E40..16E9F; Medefaidrin
16F00..16F9F; Miao
16FE0..16FFF; Ideographic Symbols and Punctuation
17000..187FF; Tangut
18800..18AFF; Tangut Components
1B000..1B0FF; Kana Supplement
1B100..1B12F; Kana Extended-A
1B130..1B16F; Small Kana Extension
1B170..1B2FF; Nushu
1BC00..1BC9F; Duployan
1BCA0..1BCAF; Shorthand Format Controls
1D000..1D0FF; Byzantine Musical Symbols
1D100..1D1FF; Musical Symbols
1D200..1D24F; Ancient Greek Musical Notation
1D2E0..1D2FF; Mayan Numerals
1D300..1D35F; Tai Xuan Jing Symbols
1D360..1D37F; Counting Rod Numerals
1D400..1D7FF; Mathematical Alphanumeric Symbols
1D800..1DAAF; Sutton SignWriting
1E000..1E02F; Glagolitic Supplement
1E100..1E14F; Nyiakeng Puachue Hmong
1E2C0..1E2FF; Wancho
1E800..1E8DF; Mende Kikakui
1E900..1E95F; Adlam
1EC70..1ECBF; Indic Siyaq Numbers
1ED00..1ED4F; Ottoman Siyaq Numbers
1EE00..1EEFF; Arabic Mathematical Alphabetic Symbols
1F000..1F02F; Mahjong Tiles
1F030..1F09F; Domino Tiles
1F0A0..1F0FF; Playing Cards
1F100..1F1FF; Enclosed Alphanumeric Supplement
1F200..1F2FF; Enclosed Ideographic Supplement
1F300..1F5FF; Miscellaneous Symbols and Pictographs
1F600..1F64F; Emoticons
1F650..1F67F; Ornamental Dingbats
1F680..1F6FF; Transport and Map Symbols
1F700..1F77F; Alchemical Symbols
1F780..1F7FF; Geometric Shapes Extended
1F800..1F8FF; Supplemental Arrows-C
1F900..1F9FF; Supplemental Symbols and Pictographs
1FA00..1FA6F; Chess Symbols
1FA70..1FAFF; Symbols and Pictographs Extended-A
20000..2A6DF; CJK Unified Ideographs Extension B
2A700..2B73F; CJK Unified Ideographs Extension C
2B740..2B81F; CJK Unified Ideographs Extension D
2B820..2CEAF; CJK Unified Ideographs Extension E
2CEB0..2EBEF; CJK Unified Ideographs Extension F
2F800..2FA1F; CJK Compatibility Ideographs Supplement
E0000..E007F; Tags
E0100..E01EF; Variation Selectors Supplement
F0000..FFFFF; Supplementary Private Use Area-A
100000..10FFFF; Supplementary Private Use Area-B

# EOF

codepoints unreadable as they're pressed to the terminal's left margin

A little spacing would certainly fix it (just a single space would even be enough).

On OS/X for example the most-left pixels are literally the border of your terminal (unless your terminal itself has an explicit margin). It would be nice if ugrep could either by default or optionally prepend a space before at each line.

Detect terminal colors for rendering fonts

Currently ugrep -l hardcodes the foreground and background colors for the sixel output. It should use the same routines as lsix and detect the correct foreground and background colors.

^Head and tail$ regex characters broken again

ugrep ^x should show all characters whose description begins with "X". It does not. This was so much easier in awk... ☺

`ugrep 0..10FFFD` is very slow but `ugrep '.?'` is fast

They both show every single code point, so they shouldn't be that different in speed.

Installation guide in README.md?

I know how to install it (just throw to /usr/bin), but some people may not know. Please add a paragraph with installation guide

Python's str.isprintable may be out of date

Although we're getting the character info directly from the horse's mouth (UnicodeData.txt), we are using Python3's builtin str.isprintable() to determine whether the glyph is printable. This can be incorrect if the installed version of Python doesn't know about new symbols in the Unicode standard. We ought to use the value from the UnicodeData.txt file instead.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.