Giter Club home page Giter Club logo

unicode-emoji's Introduction

Unicode::Emoji [version] [ci]

Provides Unicode Emoji data and regexes, incorporating the latest Unicode and Emoji standards.

Also includes a categorized list of recommended Emoji.

Emoji version: 15.1 (September 2023)

CLDR version (used for sub-region flags): 43 (April 2023)

Supported Rubies: 3.2, 3.1, 3.0

No longer supported Rubies, but might still work: 2.7, 2.6, 2.5, 2.4, 2.3

If you are stuck on an older Ruby version, checkout the latest 0.9 version of this gem.

Gemfile

gem "unicode-emoji"

Usage

Regex

The gem includes a bunch of Emoji regexes, which are compiled out of various Emoji Unicode data sources.

require "unicode/emoji"

string = "String which contains all kinds of emoji:

- Singleton Emoji: ๐Ÿ˜ด
- Textual singleton Emoji with Emoji variation: โ–ถ๏ธ
- Emoji with skin tone modifier: ๐Ÿ›Œ๐Ÿฝ
- Region flag: ๐Ÿ‡ต๐Ÿ‡น
- Sub-Region flag: ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ
- Keycap sequence: 2๏ธโƒฃ
- Sequence using ZWJ (zero width joiner): ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ

"

string.scan(Unicode::Emoji::REGEX) # => ["๐Ÿ˜ด", "โ–ถ๏ธ", "๐Ÿ›Œ๐Ÿฝ", "๐Ÿ‡ต๐Ÿ‡น", "๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ", "2๏ธโƒฃ", "๐Ÿคพ๐Ÿฝโ€โ™€๏ธ"]

Main Regexes

Matches (non-textual) Emoji of all kinds:

Regex Description Example Matches Example Non-Matches
Unicode::Emoji::REGEX Use this if unsure! Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of recommended Emoji sequences ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ ๐Ÿ˜ด๏ธŽ, โ–ถ, ๐Ÿป, ๐Ÿ‡ต๐Ÿ‡ต, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿค โ€๐Ÿคข
Unicode::Emoji::REGEX_VALID Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of valid Emoji sequences ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข ๐Ÿ˜ด๏ธŽ, โ–ถ, ๐Ÿป, ๐Ÿ‡ต๐Ÿ‡ต
Unicode::Emoji::REGEX_WELL_FORMED Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of well-formed Emoji sequences ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข, ๐Ÿ‡ต๐Ÿ‡ต ๐Ÿ˜ด๏ธŽ, โ–ถ, ๐Ÿป
Picking the Right Emoji Regex
  • Usually you just want REGEX (RGI set)
  • If you want broader matching (e.g. more sub-regions), choose REGEX_VALID
  • If you even want to match for invalid sequences, too, use REGEX_WELL_FORMED

Please see the standard for details.

Property REGEX (RGI / Recommended) REGEX_VALID (Valid) REGEX_WELL_FORMED (Well-formed)
Region "๐Ÿ‡ต๐Ÿ‡น" Yes Yes Yes
Region "๐Ÿ‡ต๐Ÿ‡ต" No No Yes
Tag Sequence "๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ" Yes Yes Yes
Tag Sequence "๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ" No Yes Yes
Tag Sequence "๐Ÿ˜ด๓ ง๓ ข๓ ก๓ ก๓ ก๓ ฟ" No No Yes
ZWJ Sequence "๐Ÿคพ๐Ÿฝโ€โ™€๏ธ" Yes Yes Yes
ZWJ Sequence "๐Ÿค โ€๐Ÿคข" No Yes Yes

More info about valid vs. recommended Emoji in this blog article on Emojipedia.

Singleton Regexes

Matches only simple one-codepoint (+ optional variation selector) Emoji:

Regex Description Example Matches Example Non-Matches
Unicode::Emoji::REGEX_BASIC Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all ๐Ÿ˜ด, โ–ถ๏ธ ๐Ÿ˜ด๏ธŽ, โ–ถ, ๐Ÿป, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, ๐Ÿ‡ต๐Ÿ‡ต,2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข
Unicode::Emoji::REGEX_TEXT Matches only textual singleton Emoji (except for singleton components, like digit 1) ๐Ÿ˜ด๏ธŽ, โ–ถ ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿป, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, ๐Ÿ‡ต๐Ÿ‡ต,2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข

Include Textual Emoji

By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes. However, if you wish to match for them too, you can include them in your regex by appending the _INCLUDE_TEXT suffix:

Regex Description Example Matches Example Non-Matches
Unicode::Emoji::REGEX_INCLUDE_TEXT REGEX + REGEX_TEXT ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿ˜ด๏ธŽ, โ–ถ ๐Ÿป, ๐Ÿ‡ต๐Ÿ‡ต, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿค โ€๐Ÿคข
Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT REGEX_VALID + REGEX_TEXT ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข, ๐Ÿ˜ด๏ธŽ, โ–ถ ๐Ÿป, ๐Ÿ‡ต๐Ÿ‡ต
Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT REGEX_WELL_FORMED + REGEX_TEXT ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข, ๐Ÿ‡ต๐Ÿ‡ต, ๐Ÿ˜ด๏ธŽ, โ–ถ ๐Ÿป

Extended Pictographic Regex

Unicode::Emoji::REGEX_PICTO matches single codepoints with the Extended_Pictographic property. For example, it will match โœ€ BLACK SAFETY SCISSORS.

Unicode::Emoji::REGEX_PICTO_NO_EMOJI matches single codepoints with the Extended_Pictographic property, but excludes Emoji characters.

See character.construction/picto for a list of all non-Emoji pictographic characters.

Partial Regexes

Matches potential Emoji parts (often, this is not what you want):

Regex Description Example Matches Example Non-Matches
Unicode::Emoji::REGEX_ANY Matches any Emoji-related codepoint (but no variation selectors, tags, or zero-width joiners). Please not that this will match Emoji-parts rather than complete Emoji, for example, single digits! ๐Ÿ˜ด, โ–ถ, ๐Ÿป, ๐Ÿ›Œ, ๐Ÿฝ, ๐Ÿ‡ต, ๐Ÿ‡น, 2, ๐Ÿด, ๐Ÿคพ, โ™€, ๐Ÿค , ๐Ÿคข -

List

Use Unicode::Emoji::LIST or the list method to get a grouped (and ordered) list of Emoji:

Unicode::Emoji.list.keys
# => ["Smileys & Emotion", "People & Body", "Component", "Animals & Nature", "Food & Drink", "Travel & Places", "Activities", "Objects", "Symbols", "Flags"]

Unicode::Emoji.list("Food & Drink").keys
# => ["food-fruit", "food-vegetable", "food-prepared", "food-asian", "food-marine", "food-sweet", "drink", "dishware"]

Unicode::Emoji.list("Food & Drink", "food-asian")
=> ["๐Ÿฑ", "๐Ÿ˜", "๐Ÿ™", "๐Ÿš", "๐Ÿ›", "๐Ÿœ", "๐Ÿ", "๐Ÿ ", "๐Ÿข", "๐Ÿฃ", "๐Ÿค", "๐Ÿฅ", "๐Ÿฅฎ", "๐Ÿก", "๐ŸฅŸ", "๐Ÿฅ ", "๐Ÿฅก"]

Please note that categories might change with future versions of the Emoji standard. This gem will issue warnings when attempting to retrieve old categories using the #list method.

A list of all Emoji can be found at character.construction.

Properties

Allows you to access the codepoint data form Unicode's emoji-data.txt file:

require "unicode/emoji"

Unicode::Emoji.properties "โ˜" # => ["Emoji", "Emoji_Modifier_Base"]

Also See

MIT

unicode-emoji's People

Contributors

janlelis avatar jcstringer avatar kmy01 avatar radarek avatar shushugah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

unicode-emoji's Issues

Unicode Emoji Regexp on numbers

Hi,

I find your gem and it seems awesome to spot emojis but I've got some difficulties with it.

When I use Unicode::Emoji::REGEX_ANY, it matches numbers too (as described in the ReadMe).
To me, a number is not an emoji (but I'm probably wrong as I'm the first one to put this as an issue)

capture d ecran 2019-02-13 a 14 27 09

What would be the right regexp to match a maximum of emojis without numbers or "regular" characters ?

Incorrect matching of several family emojis

Scanning family emojis (e.g.) using the "recommended" REGEX results in a smaller family and a separate kid rather than matching the whole emoji in 15 cases:

Unicode::Emoji.list("People & Body", "family").each do |emoji|
  if emoji.length > emoji.scan(Unicode::Emoji::REGEX)[0].length
    puts "\"#{emoji}\".scan(Unicode::Emoji::REGEX) = #{emoji.scan(Unicode::Emoji::REGEX)}"
  end
end

# "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ฆ"]
# "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆ", "๐Ÿ‘ฆ"]                                                          
# "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ง"]                                                          
# "๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ง", "๐Ÿ‘ฆ"]                                                          
# "๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ฆ", "๐Ÿ‘ฆ"]                                                          
# "๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ง", "๐Ÿ‘ง"]                                                          
# "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ฆ"]                                                          
# "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆ", "๐Ÿ‘ฆ"]                                                          
# "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ง"]                                                          
# "๐Ÿ‘จโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ฆ", "๐Ÿ‘ฆ"]                                                                
# "๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ง", "๐Ÿ‘ฆ"]                                                                
# "๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘จโ€๐Ÿ‘ง", "๐Ÿ‘ง"]                                                                
# "๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ฆ", "๐Ÿ‘ฆ"]                                                                
# "๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ฆ"]
# "๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง".scan(Unicode::Emoji::REGEX) = ["๐Ÿ‘ฉโ€๐Ÿ‘ง", "๐Ÿ‘ง"]

I think this is because of the order of the generated REGEX. I suspect it finds the smaller match first and stops looking... After modifying the generated REGEX in lib/unicode/emoji/generated/regex.rb to REGEX = /(?:(?:๐Ÿ‘จโ€โค๏ธโ€๐Ÿ‘จ|๐Ÿ‘จโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘จ|๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง|๐Ÿ‘จโ€๐Ÿ‘ฆ|๐Ÿ‘จโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ|... (moving the ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ง earlier) that particular emoji started matching correctly.

I have no idea how to update the generation logic to move those emojis to the front of the queue though?

`Unicode::Emoji::REGEX` allowing non-RGI sequence?

First of all, thanks so much for this gem! I try to minimise dependencies so I spent hours trying to figure out a sane way to check for valid emoji code points and sequences before finding this small, focused, well-maintained gem. Thank you so much!

The issue I'm encountering is that a non-RGI sequence is being allowed through by Unicode::Emoji::REGEX. For me it's matching the "GB AGB" flag that the README says is valid but not recommended and also the Northern Ireland flag which is valid but not recommended.

I had a look at the spec and discovered this is by design. The base flag is matched and the other characters are ignored.

Is there a way to more strictly compare? I'm trying to use this REGEX to validate (not extract) recommended emoji in an ActiveRecord model... For me I'd like it to not match at all if the whole sequence is valid but not recommended.

# Simplified model
class Person < ApplicationRecord
  validates :emoji, allow_blank: true, format: {with: Unicode::Emoji::REGEX, message: "can only contain recommended emoji"}
  validate :emoji_length

  private
  def emoji_length
    if emoji.present? && emoji.grapheme_clusters.length > 1
      errors.add(:emoji, "can only contain 1 emoji")
    end
  end
end

# Simplified test (ignoring setup)
class PersonTest < ActiveSupport::TestCase
  test "Allows no emoji (set implicitly)" do
    assert @person.valid?, "P should be valid with no emoji but isn't\n\n\t#{@person.errors.full_messages}"
  end

  test "Does not allow non-recommended (RGI) emoji" do
    @person.emoji = "๐Ÿด๓ ง๓ ข๓ ฎ๓ ฉ๓ ฒ๓ ฟ"

    assert_not @person.valid?, "P should be not be valid with Northern Ireland flag sequence but is"
  end
end

I suppose one way around it could be to extract the matched portion of the emoji and just save that? ๐Ÿค”

When removing emojis with .gsub, I'm getting error on compare with empty string.

Hello. Iโ€™m trying to use the gem to remove emojis from strings, but Iโ€™m getting an error when comparing the result with the expected string.

[33] pry(#<RSpec::Matchers::DSL::Matcher>)> regex_result = '๐Ÿ›ค๐ŸŽฏ๐Ÿ“ฎ๐Ÿ“˜โ†•โญ•๐Ÿ‡ฌ๐Ÿ‡ถ๐Ÿ‡ผ๐Ÿ‡ธ๐Ÿ“ช๐Ÿ›Ž๐Ÿ‘จโ€๐ŸŒพ๐Ÿบ๐Ÿš๐Ÿคฏ'.gsub(Unicode::Emoji::REGEX_ANY, '')
=> "โ€"
[34] pry(#<RSpec::Matchers::DSL::Matcher>)> regex_result == ''
=> false

[36] pry(#<RSpec::Matchers::DSL::Matcher>)> Marshal.dump(regex_result)
=> "\x04\bI\"\b\xE2\x80\x8D\x06:\x06ET"
[37] pry(#<RSpec::Matchers::DSL::Matcher>)> Marshal.dump('')
=> "\x04\bI\"\x00\x06:\x06ET"

What Iโ€™m doing wrong here? ๐Ÿ˜…

Adding support for new Emoji

Hey there!

First off I absolutely love this library, it's so well organised and designed, it has been a total pleasure to use - thank you for your incredible work.

I'm just wondering, is there a straightforward way for me to update the marshal.gz file with newer emoji? I'm trying to build something that will match newer emoji and want to add them to one of the regexes.

I really love the project! Thank you for your hard work!

New architecture proposal to reduce memory usage

Hello.

I noticed that unicode-emoji gem takes more memory than I expected from such a library. Just requiring the gem takes 7-8MB. I know that for today's standards it isn't a huge amount, but if many such gems were used then it could add up to unnecessary memory usage.

Here is a method I used to measure memory usage (I use get_process_mem gem):

require 'get_process_mem'

def mem(&block)
  raise ArgumentError, 'missing block' unless block

  mem = GetProcessMem.new
  before = mem.mb
  block.call
  after = mem.mb
  return after - before
end

puts mem { require "unicode/emoji" }

Running multiple times this script, gives me numbers between 7-9MB (most of the time something around 7.6MB).

I also used memory_profiler:

require 'memory_profiler'
report = MemoryProfiler.report do
  require 'unicode/emoji'
end

report.pretty_print

It gives the information where memory is allocated but also how much of it is retained (which in most cases means it will never be freed).

I'm not sure how exactly people use this gem but looking at the content I suspect that most probably they use one of provided regex constants. And this is the case in the application I work on. We literally use single regex from this library (Unicode::Emoji::REGEX).

What can be done to lower memory usage? Here is the idea:

  • instead of generating all the Regex constants, they could be generated offline and included directly in a file
  • every constant could go to different file and autoload :FOO, File.expand_path('emoji/foo', __dir__) could be used to lazy load it when it is used
  • INDEX could be lazy loaded too. If all regexes were generated offline then it would be only used by methods like properties. No method calls? No constant loaded.
  • some regexes are quite big (require "object_space"; ObjectSpace.memsize_of(...)), for example REGEX_VALID_INCLUDE_TEXT is almost 0.5MB. I didn't look closely but I think that some big unions like "||..." could be replaced by range "[char1-charN]" (if it is sequence of subsequent characters of course).

If done properly then for usage scenario like mine (single constant), memory usage would be reduce from 7-8MB to a size of that constant (in our case it is 120kB).

Do you think it is worth looking into it?

New regex which includes textual emoji as well

Feature description

The REGEX and REGEX_VALID regexes will not match textual emoji like ๐Ÿ˜ด๏ธŽ (because there not considered emoji by the Consortium)

However, some software might want to detect these, too (see #4 for example)

Seems like ๐Ÿซถ might not be supported?

We seemed to have an issue with the above emoji not being caught by the gem regex? I don't really know how to debug further, but I can say that this doesn't seem to be working correctly?

image

Happy to try and debug further in any way I can help? Let me know,

Many thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.