janlelis / unicode-emoji Goto Github PK

View Code? Open in Web Editor NEW

143.0 6.0 14.0 659 KB

Up-to-date Emoji Regex in Ruby 💥

Home Page: https://character.construction

License: MIT License

Ruby 100.00%

ruby unicode emoji unicode-data regex emoji-unicode sequence hacktoberfest

unicode-emoji's Introduction

Unicode::Emoji

Provides Unicode Emoji data and regexes, incorporating the latest Unicode and Emoji standards.

Also includes a categorized list of recommended Emoji.

Emoji version: 15.1 (September 2023)

CLDR version (used for sub-region flags): 43 (April 2023)

Supported Rubies: 3.2, 3.1, 3.0

No longer supported Rubies, but might still work: 2.7, 2.6, 2.5, 2.4, 2.3

If you are stuck on an older Ruby version, checkout the latest 0.9 version of this gem.

Gemfile

gem "unicode-emoji"

Usage

Regex

The gem includes a bunch of Emoji regexes, which are compiled out of various Emoji Unicode data sources.

require "unicode/emoji"

string = "String which contains all kinds of emoji:

- Singleton Emoji: 😴
- Textual singleton Emoji with Emoji variation: ▶️
- Emoji with skin tone modifier: 🛌🏽
- Region flag: 🇵🇹
- Sub-Region flag: 🏴󠁧󠁢󠁳󠁣󠁴󠁿
- Keycap sequence: 2️⃣
- Sequence using ZWJ (zero width joiner): 🤾🏽‍♀️

"

string.scan(Unicode::Emoji::REGEX) # => ["😴", "▶️", "🛌🏽", "🇵🇹", "🏴󠁧󠁢󠁳󠁣󠁴󠁿", "2️⃣", "🤾🏽‍♀️"]

Main Regexes

Matches (non-textual) Emoji of all kinds:

Regex	Description	Example Matches	Example Non-Matches
`Unicode::Emoji::REGEX`	Use this if unsure! Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of recommended Emoji sequences	`😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`	`😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`
`Unicode::Emoji::REGEX_VALID`	Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of valid Emoji sequences	`😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`	`😴︎`, `▶`, `🏻`, `🇵🇵`
`Unicode::Emoji::REGEX_WELL_FORMED`	Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of well-formed Emoji sequences	`😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `🇵🇵`	`😴︎`, `▶`, `🏻`

Picking the Right Emoji Regex

Usually you just want REGEX (RGI set)
If you want broader matching (e.g. more sub-regions), choose REGEX_VALID
If you even want to match for invalid sequences, too, use REGEX_WELL_FORMED

Please see the standard for details.

Property	`REGEX` (RGI / Recommended)	`REGEX_VALID` (Valid)	`REGEX_WELL_FORMED` (Well-formed)
Region "🇵🇹"	Yes	Yes	Yes
Region "🇵🇵"	No	No	Yes
Tag Sequence "🏴󠁧󠁢󠁳󠁣󠁴󠁿"	Yes	Yes	Yes
Tag Sequence "🏴󠁧󠁢󠁡󠁧󠁢󠁿"	No	Yes	Yes
Tag Sequence "😴󠁧󠁢󠁡󠁡󠁡󠁿"	No	No	Yes
ZWJ Sequence "🤾🏽‍♀️"	Yes	Yes	Yes
ZWJ Sequence "🤠‍🤢"	No	Yes	Yes

More info about valid vs. recommended Emoji in this blog article on Emojipedia.

Singleton Regexes

Matches only simple one-codepoint (+ optional variation selector) Emoji:

Regex	Description	Example Matches	Example Non-Matches
`Unicode::Emoji::REGEX_BASIC`	Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all	`😴`, `▶️`	`😴︎`, `▶`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`
`Unicode::Emoji::REGEX_TEXT`	Matches only textual singleton Emoji (except for singleton components, like digit 1)	`😴︎`, `▶`	`😴`, `▶️`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`

Include Textual Emoji

By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes. However, if you wish to match for them too, you can include them in your regex by appending the _INCLUDE_TEXT suffix:

Regex	Description	Example Matches	Example Non-Matches
`Unicode::Emoji::REGEX_INCLUDE_TEXT`	`REGEX` + `REGEX_TEXT`	`😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `😴︎`, `▶`	`🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`
`Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT`	`REGEX_VALID` + `REGEX_TEXT`	`😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `😴︎`, `▶`	`🏻`, `🇵🇵`
`Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT`	`REGEX_WELL_FORMED` + `REGEX_TEXT`	`😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `🇵🇵`, `😴︎`, `▶`	`🏻`

Extended Pictographic Regex

Unicode::Emoji::REGEX_PICTO matches single codepoints with the Extended_Pictographic property. For example, it will match ✀ BLACK SAFETY SCISSORS.

Unicode::Emoji::REGEX_PICTO_NO_EMOJI matches single codepoints with the Extended_Pictographic property, but excludes Emoji characters.

See character.construction/picto for a list of all non-Emoji pictographic characters.

Partial Regexes

Matches potential Emoji parts (often, this is not what you want):

Regex	Description	Example Matches	Example Non-Matches
`Unicode::Emoji::REGEX_ANY`	Matches any Emoji-related codepoint (but no variation selectors, tags, or zero-width joiners). Please not that this will match Emoji-parts rather than complete Emoji, for example, single digits!	`😴`, `▶`, `🏻`, `🛌`, `🏽`, `🇵`, `🇹`, `2`, `🏴`, `🤾`, `♀`, `🤠`, `🤢`	-

List

Use Unicode::Emoji::LIST or the list method to get a grouped (and ordered) list of Emoji:

Unicode::Emoji.list.keys
# => ["Smileys & Emotion", "People & Body", "Component", "Animals & Nature", "Food & Drink", "Travel & Places", "Activities", "Objects", "Symbols", "Flags"]

Unicode::Emoji.list("Food & Drink").keys
# => ["food-fruit", "food-vegetable", "food-prepared", "food-asian", "food-marine", "food-sweet", "drink", "dishware"]

Unicode::Emoji.list("Food & Drink", "food-asian")
=> ["🍱", "🍘", "🍙", "🍚", "🍛", "🍜", "🍝", "🍠", "🍢", "🍣", "🍤", "🍥", "🥮", "🍡", "🥟", "🥠", "🥡"]

Please note that categories might change with future versions of the Emoji standard. This gem will issue warnings when attempting to retrieve old categories using the #list method.

A list of all Emoji can be found at character.construction.

Properties

Allows you to access the codepoint data form Unicode's emoji-data.txt file:

require "unicode/emoji"

Unicode::Emoji.properties "☝" # => ["Emoji", "Emoji_Modifier_Base"]

Also See

Unicode® Technical Standard #51
Emoji categories
Ruby gem which displays Emoji sequence names
Part of unicode-x

MIT

Copyright (C) 2017-2023 Jan Lelis https://janlelis.com. Released under the MIT license.
Unicode data: https://www.unicode.org/copyright.html#Exhibit1

unicode-emoji's People

Contributors

Stargazers

Watchers

Forkers

kmy01 proximaio yyppaag behnam shushugah lukewestby linda-moreau ryel jcstringer radarek tukiel8 iq-scm sysfce2 bschrag

unicode-emoji's Issues

Write specs for + document non-regex constants in README

While there are specs for the REGEX* constants of the gem, we should also document/specify what the other constants are for (e.g. EXTENDED_PICTOGRAPHIC_NO_EMOJI)

Unicode Emoji Regexp on numbers

Hi,

I find your gem and it seems awesome to spot emojis but I've got some difficulties with it.

When I use Unicode::Emoji::REGEX_ANY, it matches numbers too (as described in the ReadMe).
To me, a number is not an emoji (but I'm probably wrong as I'm the first one to put this as an issue)

What would be the right regexp to match a maximum of emojis without numbers or "regular" characters ?

Incorrect matching of several family emojis

Scanning family emojis (e.g.) using the "recommended" REGEX results in a smaller family and a separate kid rather than matching the whole emoji in 15 cases:

Unicode::Emoji.list("People & Body", "family").each do |emoji|
  if emoji.length > emoji.scan(Unicode::Emoji::REGEX)[0].length
    puts "\"#{emoji}\".scan(Unicode::Emoji::REGEX) = #{emoji.scan(Unicode::Emoji::REGEX)}"
  end
end

# "👨‍👩‍👧‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👩‍👧", "👦"]
# "👨‍👩‍👦‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👩‍👦", "👦"]                                                          
# "👨‍👩‍👧‍👧".scan(Unicode::Emoji::REGEX) = ["👨‍👩‍👧", "👧"]                                                          
# "👨‍👨‍👧‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👨‍👧", "👦"]                                                          
# "👨‍👨‍👦‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👨‍👦", "👦"]                                                          
# "👨‍👨‍👧‍👧".scan(Unicode::Emoji::REGEX) = ["👨‍👨‍👧", "👧"]                                                          
# "👩‍👩‍👧‍👦".scan(Unicode::Emoji::REGEX) = ["👩‍👩‍👧", "👦"]                                                          
# "👩‍👩‍👦‍👦".scan(Unicode::Emoji::REGEX) = ["👩‍👩‍👦", "👦"]                                                          
# "👩‍👩‍👧‍👧".scan(Unicode::Emoji::REGEX) = ["👩‍👩‍👧", "👧"]                                                          
# "👨‍👦‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👦", "👦"]                                                                
# "👨‍👧‍👦".scan(Unicode::Emoji::REGEX) = ["👨‍👧", "👦"]                                                                
# "👨‍👧‍👧".scan(Unicode::Emoji::REGEX) = ["👨‍👧", "👧"]                                                                
# "👩‍👦‍👦".scan(Unicode::Emoji::REGEX) = ["👩‍👦", "👦"]                                                                
# "👩‍👧‍👦".scan(Unicode::Emoji::REGEX) = ["👩‍👧", "👦"]
# "👩‍👧‍👧".scan(Unicode::Emoji::REGEX) = ["👩‍👧", "👧"]

I think this is because of the order of the generated REGEX. I suspect it finds the smaller match first and stops looking... After modifying the generated REGEX in lib/unicode/emoji/generated/regex.rb to REGEX = /(?:(?:👨‍❤️‍👨|👨‍❤️‍💋‍👨|👨‍👩‍👧‍👧|👨‍👦|👨‍👦‍👦|... (moving the 👨‍👩‍👧‍👧 earlier) that particular emoji started matching correctly.

I have no idea how to update the generation logic to move those emojis to the front of the queue though?

`Unicode::Emoji::REGEX` allowing non-RGI sequence?

First of all, thanks so much for this gem! I try to minimise dependencies so I spent hours trying to figure out a sane way to check for valid emoji code points and sequences before finding this small, focused, well-maintained gem. Thank you so much!

The issue I'm encountering is that a non-RGI sequence is being allowed through by Unicode::Emoji::REGEX. For me it's matching the "GB AGB" flag that the README says is valid but not recommended and also the Northern Ireland flag which is valid but not recommended.

I had a look at the spec and discovered this is by design. The base flag is matched and the other characters are ignored.

Is there a way to more strictly compare? I'm trying to use this REGEX to validate (not extract) recommended emoji in an ActiveRecord model... For me I'd like it to not match at all if the whole sequence is valid but not recommended.

# Simplified model
class Person < ApplicationRecord
  validates :emoji, allow_blank: true, format: {with: Unicode::Emoji::REGEX, message: "can only contain recommended emoji"}
  validate :emoji_length

  private
  def emoji_length
    if emoji.present? && emoji.grapheme_clusters.length > 1
      errors.add(:emoji, "can only contain 1 emoji")
    end
  end
end

# Simplified test (ignoring setup)
class PersonTest < ActiveSupport::TestCase
  test "Allows no emoji (set implicitly)" do
    assert @person.valid?, "P should be valid with no emoji but isn't\n\n\t#{@person.errors.full_messages}"
  end

  test "Does not allow non-recommended (RGI) emoji" do
    @person.emoji = "🏴󠁧󠁢󠁮󠁩󠁲󠁿"

    assert_not @person.valid?, "P should be not be valid with Northern Ireland flag sequence but is"
  end
end

I suppose one way around it could be to extract the matched portion of the emoji and just save that? 🤔

When removing emojis with .gsub, I'm getting error on compare with empty string.

Hello. I’m trying to use the gem to remove emojis from strings, but I’m getting an error when comparing the result with the expected string.

[33] pry(#<RSpec::Matchers::DSL::Matcher>)> regex_result = '🛤🎯📮📘↕⭕🇬🇶🇼🇸📪🛎👨‍🌾🏺🐚🤯'.gsub(Unicode::Emoji::REGEX_ANY, '')
=> "‍"
[34] pry(#<RSpec::Matchers::DSL::Matcher>)> regex_result == ''
=> false

[36] pry(#<RSpec::Matchers::DSL::Matcher>)> Marshal.dump(regex_result)
=> "\x04\bI\"\b\xE2\x80\x8D\x06:\x06ET"
[37] pry(#<RSpec::Matchers::DSL::Matcher>)> Marshal.dump('')
=> "\x04\bI\"\x00\x06:\x06ET"

What I’m doing wrong here? 😅

Adding support for new Emoji

Hey there!

First off I absolutely love this library, it's so well organised and designed, it has been a total pleasure to use - thank you for your incredible work.

I'm just wondering, is there a straightforward way for me to update the marshal.gz file with newer emoji? I'm trying to build something that will match newer emoji and want to add them to one of the regexes.

I really love the project! Thank you for your hard work!

New architecture proposal to reduce memory usage

Hello.

I noticed that unicode-emoji gem takes more memory than I expected from such a library. Just requiring the gem takes 7-8MB. I know that for today's standards it isn't a huge amount, but if many such gems were used then it could add up to unnecessary memory usage.

Here is a method I used to measure memory usage (I use get_process_mem gem):

require 'get_process_mem'

def mem(&block)
  raise ArgumentError, 'missing block' unless block

  mem = GetProcessMem.new
  before = mem.mb
  block.call
  after = mem.mb
  return after - before
end

puts mem { require "unicode/emoji" }

Running multiple times this script, gives me numbers between 7-9MB (most of the time something around 7.6MB).

I also used memory_profiler:

require 'memory_profiler'
report = MemoryProfiler.report do
  require 'unicode/emoji'
end

report.pretty_print

It gives the information where memory is allocated but also how much of it is retained (which in most cases means it will never be freed).

I'm not sure how exactly people use this gem but looking at the content I suspect that most probably they use one of provided regex constants. And this is the case in the application I work on. We literally use single regex from this library (Unicode::Emoji::REGEX).

What can be done to lower memory usage? Here is the idea:

instead of generating all the Regex constants, they could be generated offline and included directly in a file
every constant could go to different file and autoload :FOO, File.expand_path('emoji/foo', __dir__) could be used to lazy load it when it is used
INDEX could be lazy loaded too. If all regexes were generated offline then it would be only used by methods like properties. No method calls? No constant loaded.
some regexes are quite big (require "object_space"; ObjectSpace.memsize_of(...)), for example REGEX_VALID_INCLUDE_TEXT is almost 0.5MB. I didn't look closely but I think that some big unions like "||..." could be replaced by range "[char1-charN]" (if it is sequence of subsequent characters of course).

If done properly then for usage scenario like mine (single constant), memory usage would be reduce from 7-8MB to a size of that constant (in our case it is 120kB).

Do you think it is worth looking into it?

New regex which includes textual emoji as well

Feature description

The REGEX and REGEX_VALID regexes will not match textual emoji like 😴︎ (because there not considered emoji by the Consortium)

However, some software might want to detect these, too (see #4 for example)

Seems like 🫶 might not be supported?

We seemed to have an issue with the above emoji not being caught by the gem regex? I don't really know how to debug further, but I can say that this doesn't seem to be working correctly?

Happy to try and debug further in any way I can help? Let me know,

Many thanks.