explosion / spacymoji Goto Github PK

View Code? Open in Web Editor NEW

179.0 16.0 24.0 33 KB

💙 Emoji handling and meta data for spaCy with custom extension attributes

Home Page: https://spacy.io

License: MIT License

Python 97.30% Shell 2.70%

spacy nlp natural-language-processing spacy-pipeline emoji emojis emoji-unicode spacy-extension

spacymoji's Introduction

spacymoji: emoji for spaCy

spaCy extension and pipeline component for adding emoji meta data to Doc objects. Detects emoji consisting of one or more unicode characters, and can optionally merge multi-char emoji (combined pictures, emoji with skin tone modifiers) into one token. Human-readable emoji descriptions are added as a custom attribute, and an optional lookup table can be provided for your own descriptions. The extension sets the custom Doc, Token and Span attributes ._.is_emoji, ._.emoji_desc, ._.has_emoji and ._.emoji. You can read more about custom pipeline components and extension attributes here.

Emoji are matched using spaCy's PhraseMatcher, and looked up in the data table provided by the emoji package.

⏳ Installation

spacymoji requires spacy v3.0.0 or higher. For spaCy v2.x, install spacymoji==2.0.0.

pip install spacymoji

☝️ Usage

Import the component and add it anywhere in your pipeline using the string name of the "emoji" component factory:

import spacy

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("emoji", first=True)
doc = nlp("This is a test 😻 👍🏿")
assert doc._.has_emoji is True
assert doc[2:5]._.has_emoji is True
assert doc[0]._.is_emoji is False
assert doc[4]._.is_emoji is True
assert doc[5]._.emoji_desc == "thumbs up dark skin tone"
assert len(doc._.emoji) == 2
assert doc._.emoji[1] == ("👍🏿", 5, "thumbs up dark skin tone")

spacymoji only cares about the token text, so you can use it on a blank Language instance (it should work for all available languages!), or in a pipeline with a loaded pipeline. If your pipeline includes a tagger, parser and entity recognizer, make sure to add the emoji component as first=True, so the spans are merged right after tokenization, and before the document is parsed. If your text contains a lot of emoji, this might even give you a nice boost in parser accuracy.

Available attributes

The extension sets attributes on the Doc, Span and Token. You can change the attribute names (and other parameters of the Emoji component) by passing them via the config parameter in the nlp.add_pipe(...) method. For more details on custom components and attributes, see the processing pipelines documentation.

Attribute	Type	Description
`Token._.is_emoji`	bool	Whether the token is an emoji.
`Token._.emoji_desc`	str	A human-readable description of the emoji.
`Doc._.has_emoji`	bool	Whether the document contains emoji.
`Doc._.emoji`	List[Tuple[str, int, str]]	`(emoji, index, description)` tuples of the document's emoji.
`Span._.has_emoji`	bool	Whether the span contains emoji.
`Span._.emoji`	List[Tuple[str, int, str]]	`(emoji, index, description)` tuples of the span's emoji.

Settings

You can configure the emoji factory by setting any of the following parameters in the config dictionary:

Setting	Type	Description
`attrs`	Tuple[str, str, str, str]	Attributes to set on the `._` property. Defaults to `('has_emoji', 'is_emoji', 'emoji_desc', 'emoji')`.
`pattern_id`	str	ID of match pattern, defaults to `'EMOJI'`. Can be changed to avoid ID conflicts.
`merge_spans`	bool	Merge spans containing multi-character emoji, defaults to `True`. Will only merge combined emoji resulting in one icon, not sequences.
`lookup`	Dict[str, str]	Optional lookup table that maps emoji strings to custom descriptions, e.g. translations or other annotations.

emoji_config = {"attrs": ("has_e", "is_e", "e_desc", "e"), lookup={"👨‍🎤": "David Bowie"})
nlp.add_pipe(emoji, first=True, config=emoji_config)
doc = nlp("We can be 👨‍🎤 heroes")
assert doc[3]._.is_e
assert doc[3]._.e_desc == "David Bowie"

If you're training a pipeline, you can define the component config in your config.cfg:

[nlp]
pipeline = ["emoji", "ner"]
# ...

[components.emoji]
factory = "emoji"
merge_spans = false

spacymoji's People

Contributors

Stargazers

Watchers

Forkers

chrisemoulton sekhar14 graphext gilbertofp yhjohn163 supadupa ryancallihan bogdankostic sumnotes coreyryanhanson scottboyd-ai derjaeger playfloor databill86 adrianeboyd duckheada bdura 8bodifvbiopa 9wolfratibo

spacymoji's Issues

Slow to load

I’m currently using this in my project, however it seems like it takes about 10-20 seconds toinstantiate this class. Is that normal or might there be something wrong with my installation?

nlp = spacy.load('en')
emoji = Emoji(nlp)

^ Takes about 21 seconds

Better Descriptors

The Emoji library seems to base the metadata on Unicode's database, however it may not be descriptive enough for some.

https://www.wikiwand.com/en/Dingbat#/Dingbats_Unicode_block
- Ticks
- Crosses
- Enclosed Text
- Stars
- Arrows
https://www.wikiwand.com/en/Emoticons_(Unicode_block)
- Emotion (variations, whether it is positive or negative)
https://www.wikiwand.com/en/Miscellaneous_Symbols
- Stars
- Religious Symbols
- Astrological Symbols
- Music Symbols
- Weather Symbols
https://www.wikiwand.com/en/Miscellaneous_Symbols_and_Pictographs
- Hand Gestures
- Weather Symbols
- Geographic Symbols
- Speech Symbols
- (Other DIgital Pictographs)
https://www.wikiwand.com/en/Supplemental_Symbols_and_Pictographs
- Animals
- Emotion (variations, whether it is positive or negative)
- Food and Drink
- (Other Activity Pictographs)
https://www.wikiwand.com/en/Symbols_and_Pictographs_Extended-A
https://www.wikiwand.com/en/Transport_and_Map_Symbols
- Transport Symbols
https://www.wikiwand.com/en/Arrows_(Unicode_block) and https://www.wikiwand.com/en/Supplemental_Arrows-B
- Arrows
https://www.wikiwand.com/en/Enclosed_Alphanumerics and https://www.wikiwand.com/en/Enclosed_CJK_Letters_and_Months and https://www.wikiwand.com/en/Enclosed_Alphanumeric_Supplement
- Enclosed Text
https://www.wikiwand.com/en/Geometric_Shapes and https://www.wikiwand.com/en/Geometric_Shapes_Extended
- Squares
- Triangles
- Circles
- Darkness/Lightness
- Stars
- Colors
https://www.wikiwand.com/en/Miscellaneous_Symbols_and_Arrows
- Squares
- Circles
- Triangle
- Arrows
- Lightness/Darkness

Tokens not properly split when using emoji with modifiers

import spacy
from spacymoji import Emoji

def test():
    nlp = spacy.load('en_core_web_sm')
    emoji = Emoji(nlp, merge_spans=True)
    nlp.add_pipe(emoji, first=True)
    doc = nlp(
        'Word!👍🏿')
    for token in doc:
        print (token)
    doc = nlp(
        'Word! 👍🏿')
    for token in doc:
        print(token)
    doc = nlp(
        'Word!👍')
    for token in doc:
        print(token)
    return doc

Shows the problem. "Word!" is not correctly split into "Word" and "!", when the thumbs up has a dark skin tone modifier.

spaCy v3 support: PR

Hi, just to avoid any potential overlapping work... I'm about to finish an update of spacymoji to make it compatible with spacy v3.0, while also attempting to fix a couple of edge cases. I've got this PR on my own fork in case anyone wants to check or help: graphext#2

So far all tests are working. I'll still want to check if I can fix a couple of outstanding upstream issues before making a PR here.

Can't merge non-disjoint spans. '🇸' is already part of tokens to merge.

First off, thanks for this amazingly useful library! Really takes the sting out of working with social media texts.

I found something that looks like a bug occurring with multi-codepoint emoji's.
Repro case:

import en_core_web_sm
from spacymoji import Emoji

nlp = en_core_web_sm.load()
em = Emoji(nlp)

em(nlp(u"🇺🇸🇦🇷"))

This raises the exception shown in the title.
Initializing Emoji with merge_spans=False circumvents this.

Let me know if I can help!

Slow loading when the nlp object used is a StanfordNLPLanguage instance

This is probably a rare case occurring only when adding a spacymoji step to the pipeline of a StanfordNLPLanguage instance. However, what happens is that the spacymoji constructor uses the StanfordNLPLanguage's tokenizer to convert emoji to Docs for the PhraseMatcher. For standard Spacy Language instances this is pretty optimal, since the tokenizer does the minimal work necessary here. But the StanfordNLPLanguage's tokenizer executes the whole StanfordNLP pipeline at once, making this step very slow (>2min vs < 1s in the normal case on my laptop simply to create a spacymoji instance).

Not sure how to fix this elegantly. Making the code conditional on the class of the nlp object could be one option. Another might be to always simply load the default English tokenizer and use this to process the Emoji, instead of the passed nlp object's tokenizer.

I'm happy to create a PR if we agree on the best solution.

spacymoji not support Thai language

from PyThaiNLP/pythainlp#465, I found spacymoji not support Thai language. Can you advise me?

handle \u200d delimited emoji

Currently 👨🏽‍👩🏽‍👧🏽 is handled as multiple tokens. Note this likely relate to carpedm20/emoji#204

Ideally it would be handled as a single token.

Can't install via pip

Hi, thanks for the awesome add-on to spacy!
Well, I can't install it via pip install spacymoji as described in the readme, since this causes me the following error (I am using Python 3.6 and my current spacy version is 2.0.1):

>  pip install spacymoji
Collecting spacymoji
  Using cached https://files.pythonhosted.org/packages/3d/be/0074c90f82a38dbb3b4238f7b08de59a5b782ea1b5d8e15b2ce61db8b5a5/spacymoji-1.0.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-o1doktg0/spacymoji/setup.py", line 42, in <module>
        setup_package()
      File "/tmp/pip-install-o1doktg0/spacymoji/setup.py", line 37, in setup_package
        zip_safe=False,
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/site-packages/setuptools/__init__.py", line 140, in setup
        return distutils.core.setup(**attrs)
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/distutils/dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/distutils/dist.py", line 971, in run_command
        log.info("running %s", command)
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/distutils/log.py", line 44, in info
        self._log(INFO, msg, args)
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/distutils/log.py", line 32, in _log
        encoding = stream.encoding
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/codecs.py", line 408, in __getattr__
        return getattr(self.stream, name)
    AttributeError: '_io.BufferedWriter' object has no attribute 'encoding'

Thanks!

update to support emoji > 2.0

relates to carpedm20/emoji#255

extension attributes problem

When calling:

token._.is_hashtag

I get this error :

[E047] Can't assign a value to unregistered extension attribute 'is_hashtag'. Did you forget to call the set_extension method

I use spaCy v2.1.6

Doesn't seem to split multiple emojis occurring in sequence without spaces between.

This is a pretty common way for people to use emojis so its unfortunate that for example 😄😄 gets treated as on token instead of 2.