Giter Club home page Giter Club logo

spacymoji's Introduction

spacymoji: emoji for spaCy

spaCy extension and pipeline component for adding emoji meta data to Doc objects. Detects emoji consisting of one or more unicode characters, and can optionally merge multi-char emoji (combined pictures, emoji with skin tone modifiers) into one token. Human-readable emoji descriptions are added as a custom attribute, and an optional lookup table can be provided for your own descriptions. The extension sets the custom Doc, Token and Span attributes ._.is_emoji, ._.emoji_desc, ._.has_emoji and ._.emoji. You can read more about custom pipeline components and extension attributes here.

Emoji are matched using spaCy's PhraseMatcher, and looked up in the data table provided by the emoji package.

tests Current Release Version pypi Version

⏳ Installation

spacymoji requires spacy v3.0.0 or higher. For spaCy v2.x, install spacymoji==2.0.0.

pip install spacymoji

☝️ Usage

Import the component and add it anywhere in your pipeline using the string name of the "emoji" component factory:

import spacy

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("emoji", first=True)
doc = nlp("This is a test 😻 πŸ‘πŸΏ")
assert doc._.has_emoji is True
assert doc[2:5]._.has_emoji is True
assert doc[0]._.is_emoji is False
assert doc[4]._.is_emoji is True
assert doc[5]._.emoji_desc == "thumbs up dark skin tone"
assert len(doc._.emoji) == 2
assert doc._.emoji[1] == ("πŸ‘πŸΏ", 5, "thumbs up dark skin tone")

spacymoji only cares about the token text, so you can use it on a blank Language instance (it should work for all available languages!), or in a pipeline with a loaded pipeline. If your pipeline includes a tagger, parser and entity recognizer, make sure to add the emoji component as first=True, so the spans are merged right after tokenization, and before the document is parsed. If your text contains a lot of emoji, this might even give you a nice boost in parser accuracy.

Available attributes

The extension sets attributes on the Doc, Span and Token. You can change the attribute names (and other parameters of the Emoji component) by passing them via the config parameter in the nlp.add_pipe(...) method. For more details on custom components and attributes, see the processing pipelines documentation.

Attribute Type Description
Token._.is_emoji bool Whether the token is an emoji.
Token._.emoji_desc str A human-readable description of the emoji.
Doc._.has_emoji bool Whether the document contains emoji.
Doc._.emoji List[Tuple[str, int, str]] (emoji, index, description) tuples of the document's emoji.
Span._.has_emoji boolΒ  Whether the span contains emoji.
Span._.emoji List[Tuple[str, int, str]] (emoji, index, description) tuples of the span's emoji.

Settings

You can configure the emoji factory by setting any of the following parameters in the config dictionary:

Setting Type Description
attrs Tuple[str, str, str, str] Attributes to set on the ._ property. Defaults to ('has_emoji', 'is_emoji', 'emoji_desc', 'emoji').
pattern_id str ID of match pattern, defaults to 'EMOJI'. Can be changed to avoid ID conflicts.
merge_spans bool Merge spans containing multi-character emoji, defaults to True. Will only merge combined emoji resulting in one icon, not sequences.
lookup Dict[str, str] Optional lookup table that maps emoji strings to custom descriptions, e.g. translations or other annotations.
emoji_config = {"attrs": ("has_e", "is_e", "e_desc", "e"), lookup={"πŸ‘¨β€πŸŽ€": "David Bowie"})
nlp.add_pipe(emoji, first=True, config=emoji_config)
doc = nlp("We can be πŸ‘¨β€πŸŽ€ heroes")
assert doc[3]._.is_e
assert doc[3]._.e_desc == "David Bowie"

If you're training a pipeline, you can define the component config in your config.cfg:

[nlp]
pipeline = ["emoji", "ner"]
# ...

[components.emoji]
factory = "emoji"
merge_spans = false

spacymoji's People

Contributors

adrianeboyd avatar bdura avatar buhrmann avatar ines avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spacymoji's Issues

Slow to load

I’m currently using this in my project, however it seems like it takes about 10-20 seconds toinstantiate this class. Is that normal or might there be something wrong with my installation?

nlp = spacy.load('en')
emoji = Emoji(nlp)

^ Takes about 21 seconds

Better Descriptors

The Emoji library seems to base the metadata on Unicode's database, however it may not be descriptive enough for some.

Tokens not properly split when using emoji with modifiers

import spacy
from spacymoji import Emoji

def test():
    nlp = spacy.load('en_core_web_sm')
    emoji = Emoji(nlp, merge_spans=True)
    nlp.add_pipe(emoji, first=True)
    doc = nlp(
        'Word!πŸ‘πŸΏ')
    for token in doc:
        print (token)
    doc = nlp(
        'Word! πŸ‘πŸΏ')
    for token in doc:
        print(token)
    doc = nlp(
        'Word!πŸ‘')
    for token in doc:
        print(token)
    return doc

Shows the problem. "Word!" is not correctly split into "Word" and "!", when the thumbs up has a dark skin tone modifier.

spaCy v3 support: PR

Hi, just to avoid any potential overlapping work... I'm about to finish an update of spacymoji to make it compatible with spacy v3.0, while also attempting to fix a couple of edge cases. I've got this PR on my own fork in case anyone wants to check or help: graphext#2

So far all tests are working. I'll still want to check if I can fix a couple of outstanding upstream issues before making a PR here.

Can't merge non-disjoint spans. 'πŸ‡Έ' is already part of tokens to merge.

First off, thanks for this amazingly useful library! Really takes the sting out of working with social media texts.

I found something that looks like a bug occurring with multi-codepoint emoji's.
Repro case:

import en_core_web_sm
from spacymoji import Emoji

nlp = en_core_web_sm.load()
em = Emoji(nlp)

em(nlp(u"πŸ‡ΊπŸ‡ΈπŸ‡¦πŸ‡·"))

This raises the exception shown in the title.
Initializing Emoji with merge_spans=False circumvents this.

Let me know if I can help!

Slow loading when the nlp object used is a StanfordNLPLanguage instance

This is probably a rare case occurring only when adding a spacymoji step to the pipeline of a StanfordNLPLanguage instance. However, what happens is that the spacymoji constructor uses the StanfordNLPLanguage's tokenizer to convert emoji to Docs for the PhraseMatcher. For standard Spacy Language instances this is pretty optimal, since the tokenizer does the minimal work necessary here. But the StanfordNLPLanguage's tokenizer executes the whole StanfordNLP pipeline at once, making this step very slow (>2min vs < 1s in the normal case on my laptop simply to create a spacymoji instance).

Not sure how to fix this elegantly. Making the code conditional on the class of the nlp object could be one option. Another might be to always simply load the default English tokenizer and use this to process the Emoji, instead of the passed nlp object's tokenizer.

I'm happy to create a PR if we agree on the best solution.

Can't install via pip

Hi, thanks for the awesome add-on to spacy!
Well, I can't install it via pip install spacymoji as described in the readme, since this causes me the following error (I am using Python 3.6 and my current spacy version is 2.0.1):

>  pip install spacymoji
Collecting spacymoji
  Using cached https://files.pythonhosted.org/packages/3d/be/0074c90f82a38dbb3b4238f7b08de59a5b782ea1b5d8e15b2ce61db8b5a5/spacymoji-1.0.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-o1doktg0/spacymoji/setup.py", line 42, in <module>
        setup_package()
      File "/tmp/pip-install-o1doktg0/spacymoji/setup.py", line 37, in setup_package
        zip_safe=False,
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/site-packages/setuptools/__init__.py", line 140, in setup
        return distutils.core.setup(**attrs)
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/distutils/dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/distutils/dist.py", line 971, in run_command
        log.info("running %s", command)
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/distutils/log.py", line 44, in info
        self._log(INFO, msg, args)
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/distutils/log.py", line 32, in _log
        encoding = stream.encoding
      File "/home/lzfelix/bin/miniconda2/envs/dev/lib/python3.6/codecs.py", line 408, in __getattr__
        return getattr(self.stream, name)
    AttributeError: '_io.BufferedWriter' object has no attribute 'encoding'

Thanks!

extension attributes problem

When calling:

token._.is_hashtag

I get this error :

[E047] Can't assign a value to unregistered extension attribute 'is_hashtag'. Did you forget to call the set_extension method

I use spaCy v2.1.6

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.