davidberenstein1957 / concise-concepts Goto Github PK

This repository contains an easy and intuitive approach to few-shot NER using most similar expansion over spaCy embeddings. Now with entity scoring.

License: MIT License

Python 100.00%

few-shot-classifcation ner spacy gensim natural-language-processing nlp machine-learning hacktoberfest

concise-concepts's Introduction

Hi there 👋

From failing to study medicine ➡️ BSc industrial engineer ➡️ MSc computer scientist.
Life can be strange, so better enjoy it.
I´m sure I do by: 👨🏽‍🍳 Cooking, 👨🏽‍💻 Coding, 🏆 Committing.

Conference slides 📖

🧼 From GPU-poor to data-rich - data quality practices for LLM fine-tuning
Deeplearning.ai LLM workshop - get started with Argilla for human- and distilabel for AI feedback
NLP Healthcare Summit 2023 - Smart Shortcuts for Bootstrapping a Healthcare NER Project
Anyscale Ray Europe Meetup - Smart shortcuts for Bootstrapping a Text Classification project

employers 👨🏽‍💻

Argilla(2022-current) - data annotation and monitoring for enterprise NLP
Pandora Intelligence(2020-2022) - an independent intelligence company, specialized in security risks

open source ⭐️

maintainer 🤓

concise-concepts - a word similarity approach to few-shot NER
fast-sentence-transformers - wrapper for ONNX speed enhanced sentence-transformers
classy-classification - a quick and dirty few-shot text classification solution
crosslingual-coreference - a multi-lingual CoRef resolver using cross-lingual training
adept-augmentations - a Python library aimed at dissecting and augmenting NER training data
spacy-setfit - a Python library aimed to facilitate easy SetFit usage in spaCy

contributions 🫱🏾‍🫲🏼

spaCy - several additions to the spacy-universe
- spanmarker - added .pipe() method to spaCy integration
- spacy-dbpedia-spotlight - added a batch processing functionality
- spacy-fishing - added a batch processing functionality + bug fixes
- spacy-opentapioca - added a batch processing functionality
streamlit-url-fragment - resolved Python versioning issues
allennlp-models - added a batch processing functionality
mutate - resolved Python versioning issues and added PyPI support
rebel - added a batch processing functionality
trl - updated RLHF documentation for PPOTrainer

volunteering 🌍

Bonfari - small to medium sustainable scale projects in Gambia 🇬🇲
510 red-cross - occasional projects to improve humanitarian aid with data

Contacts

concise-concepts's People

Contributors

Stargazers

Watchers

Forkers

koaning manikant92 stjordanis asheeshiit riezebos joeyburzynski hyangchun danielerigo danielmlow techthiyanes entn-at tomaarsen iteimouri swelcker chopen82

concise-concepts's Issues

duplicate logging regarding missing entires in embedding model

2022-10-07 10:21:41.412 | WARNING  | concise_concepts.conceptualizer.Conceptualizer:verify_data:188 - word ´persianas eléctricas´ from key ´CARACTERISTICAS´ not present in vector model
2022-10-07 10:21:41.412 | WARNING  | concise_concepts.conceptualizer.Conceptualizer:verify_data:188 - word ´puente térmico´ from key ´CARACTERISTICAS´ not present in vector model
2022-10-07 10:21:41.412 | WARNING  | concise_concepts.conceptualizer.Conceptualizer:verify_data:188 - word ´por split´ from key ´CARACTERISTICAS´ not present in vector model
2022-10-07 10:21:41.413 | WARNING  | concise_concepts.conceptualizer.Conceptualizer:verify_data:188 - word ´tarima vinilica´ from key ´CARACTERISTICAS´ not present in vector model
2022-10-07 10:21:41.413 | WARNING  | concise_concepts.conceptualizer.Conceptualizer:verify_data:188 - word ´persianas electricas´ from key ´CARACTERISTICAS´ not present in vector model```

Unable to load local custom gensim model

I am trying to use concise-concepts with a custom word2vec model I trained using gensim. Here is a link to my notebook: https://github.com/EarthNLP/ClimateScholar/blob/users/hevia/concise-concepts-ner/research/Concise_Concepts.ipynb

The error & code in question:

Looking at the code: https://github.com/Pandora-Intelligence/concise-concepts/blob/main/concise_concepts/conceptualizer/Conceptualizer.py#L147

I tried to load my model the same way it is written here, and it works just fine:

The documentation doesnt mention any specific restrictions with loading models (unless you cant load local models?): https://github.com/Pandora-Intelligence/concise-concepts#use-gensimword2vec-model-from-pre-trained-gensim-or-custom-model-path, but I did notice that the example used a blank spaCy: https://github.com/Pandora-Intelligence/concise-concepts/blob/main/concise_concepts/examples/example_gensim_custom.py which I tried and also did not work.

I am also running gensim 4.X+

Hopefully its a minor issue on my end, just so it means you don't have to do any work 😅, the library is awesome and would love to try it out with some of our custom word embeddings

Thanks for your time!

Loading a local NER model but has no embeddings

Hi! I have a local trained model which only has NER in its pipeline and as soon as I try to add the concise-concepts data it returns

Exception: Choose a model with internal embeddings i.e. md or lg.

How can I train my model to have the necessary embeddings to work out with concise-concepts?

Example fail while using GPUs

I was using en_core_web_trf with GPU enabled. I changed it to en_core_web_lg since the en_core_web_trf is not supported. However, this would give me the following error while with GPU enabled. It took me a while to figure out what was wrong.

Fail while using GPU

import spacy
import concise_concepts

from spacy import displacy


spacy.require_gpu(0) # this triggered the numpy error

data = {
    "fruit": ["apple", "pear", "orange"],
    "vegetable": ["broccoli", "spinach", "tomato"],
    "meat": ["beef", "pork", "fish", "lamb"]
}

text = """
    Heat the oil in a large pan and add the Onion, celery and carrots.
    Then, cook over a medium–low heat for 10 minutes, or until softened.
    Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
    Later, add some oranges and chickens."""

# use any model that has internal spacy embeddings
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe("concise_concepts",
    config={"data": data}
)
doc = nlp(text)

options = {"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon"},
           "ents": ["fruit", "vegetable", "meat"]}

displacy.render(doc, style="ent", options=options)

ERROR Messasge

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

Python latest package 0.6.2 failing. Error in Conceptualizer.py.Results Deterioration

Hi @davidberenstein1957 ,the latest update version 0.6.2 is not working via pip. It is still installing 0.6.1
I tried manually making the changes as per the latest in the conceptualizer.py file.
Even there on testing it gives the following error.
~\anaconda3\lib\site-packages\concise_concepts\conceptualizer\Conceptualizer.py in run(self)
97 self.resolve_overlapping_concepts()
98 self.infer_original_data()
---> 99 self.create_conceptual_patterns()
100
101 if not self.ent_score:

~\anaconda3\lib\site-packages\concise_concepts\conceptualizer\Conceptualizer.py in create_conceptual_patterns(self)
359 }
360 )
--> 361
362 add_patterns(self.data)
363 add_patterns(self.original_data)

~\anaconda3\lib\site-packages\concise_concepts\conceptualizer\Conceptualizer.py in add_patterns(input_dict)
327 if self.case_sensitive:
328 specific_copy[self.match_key] = "{op}".join(word_parts)
--> 329 else:
330 specific_copy[self.match_key] = {
331 "regex": r"(?i)"

NameError: name 'specific_copy' is not defined.

My estimate is it is because of the following lines. I tried solving on my end but wasn't successful.
All the previous problems regarding custom custom word2vec stands as-it-is.
Kindly look into this ASAP

add support for entity scores based on similarity

multi token patterns

I might be working on a tutorial on this project, so I figured I'd double-check explicitly: are multi-token phrases supported? My impression is that they're not, and that's totally fine, but I just wanted to make sure.

This example:

import spacy
from spacy import displacy
import concise_concepts

data = {
    "fruit": ["apple", "pear", "orange"],
    "vegetable": ["broccoli", "spinach", "tomato"],
    "meat": ["beef", "pork", "fish", "lamb"],
    "utensil": ["large oven", "warm stove", "big knife"]
}

text = """
    Heat the oil in a large pan and add the Onion, celery and carrots.
    Then, cook over a medium–low heat for 10 minutes, or until softened.
    Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
    Later, add some oranges and chickens. """

nlp = spacy.load("en_core_web_lg", disable=["ner"])

# ent_score for entity condifence scoring
nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True})
doc = nlp(text)

options = {"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon", "utensil": "gray"},
           "ents": ["fruit", "vegetable", "meat", "utensil"]}

ents = doc.ents
for ent in ents:
    new_label = f"{ent.label_} ({float(ent._.ent_score):.0%})"
    options["colors"][new_label] = options["colors"].get(ent.label_.lower(), None)
    options["ents"].append(new_label)
    ent.label_ = new_label
doc.ents = ents

displacy.render(doc, style="ent", options=options)

Yields this error:

word ´large oven´ from key ´utensil´ not present in vector model
word ´warm stove´ from key ´utensil´ not present in vector model
word ´big knife´ from key ´utensil´ not present in vector model
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [4], in <cell line: 21>()
     18 nlp = spacy.load("en_core_web_lg", disable=["ner"])
     20 # ent_score for entity condifence scoring
---> 21 nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True})
     22 doc = nlp(text)
     24 options = {"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon", "utensil": "gray"},
     25            "ents": ["fruit", "vegetable", "meat", "utensil"]}

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:795, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    787     if not self.has_factory(factory_name):
    788         err = Errors.E002.format(
    789             name=factory_name,
    790             opts=", ".join(self.factory_names),
   (...)
    793             lang_code=self.lang,
    794         )
--> 795     pipe_component = self.create_pipe(
    796         factory_name,
    797         name=name,
    798         config=config,
    799         raw_config=raw_config,
    800         validate=validate,
    801     )
    802 pipe_index = self._get_pipe_index(before, after, first, last)
    803 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:674, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    671 cfg = {factory_name: config}
    672 # We're calling the internal _fill here to avoid constructing the
    673 # registered functions twice
--> 674 resolved = registry.resolve(cfg, validate=validate)
    675 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
    676 filled = Config(filled)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:747, in registry.resolve(cls, config, schema, overrides, validate)
    738 @classmethod
    739 def resolve(
    740     cls,
   (...)
    745     validate: bool = True,
    746 ) -> Dict[str, Any]:
--> 747     resolved, _ = cls._make(
    748         config, schema=schema, overrides=overrides, validate=validate, resolve=True
    749     )
    750     return resolved

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:796, in registry._make(cls, config, schema, overrides, resolve, validate)
    794 if not is_interpolated:
    795     config = Config(orig_config).interpolate()
--> 796 filled, _, resolved = cls._fill(
    797     config, schema, validate=validate, overrides=overrides, resolve=resolve
    798 )
    799 filled = Config(filled, section_order=section_order)
    800 # Check that overrides didn't include invalid properties not in config

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:868, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    865     getter = cls.get(reg_name, func_name)
    866     # We don't want to try/except this and raise our own error
    867     # here, because we want the traceback if the function fails.
--> 868     getter_result = getter(*args, **kwargs)
    869 else:
    870     # We're not resolving and calling the function, so replace
    871     # the getter_result with a Promise class
    872     getter_result = Promise(
    873         registry=reg_name, name=func_name, args=args, kwargs=kwargs
    874     )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/__init__.py:47, in make_concise_concepts(nlp, name, data, topn, model_path, word_delimiter, ent_score, exclude_pos, exclude_dep, include_compound_words, case_sensitive)
      9 @Language.factory(
     10     "concise_concepts",
     11     default_config={
   (...)
     45     case_sensitive: bool,
     46 ):
---> 47     return Conceptualizer(
     48         nlp=nlp,
     49         name=name,
     50         data=data,
     51         topn=topn,
     52         model_path=model_path,
     53         word_delimiter=word_delimiter,
     54         ent_score=ent_score,
     55         exclude_pos=exclude_pos,
     56         exclude_dep=exclude_dep,
     57         include_compound_words=include_compound_words,
     58         case_sensitive=case_sensitive,
     59     )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:95, in Conceptualizer.__init__(self, nlp, name, data, topn, model_path, word_delimiter, ent_score, exclude_pos, exclude_dep, include_compound_words, case_sensitive)
     93 else:
     94     self.match_key = "LEMMA"
---> 95 self.run()
     96 self.data_upper = {k.upper(): v for k, v in data.items()}

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:101, in Conceptualizer.run(self)
     99 self.determine_topn()
    100 self.set_gensim_model()
--> 101 self.verify_data()
    102 self.expand_concepts()
    103 self.verify_data(verbose=False)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:193, in Conceptualizer.verify_data(self, verbose)
    188                 logger.warning(
    189                     f"word ´{word}´ from key ´{key}´ not present in vector"
    190                     " model"
    191                 )
    192     verified_data[key] = verified_values
--> 193     assert len(
    194         verified_values
    195     ), f"None of the entries for key {key} are present in the vector model"
    196 self.data = deepcopy(verified_data)
    197 self.original_data = deepcopy(self.data)

AssertionError: None of the entries for key utensil are present in the vector model

consider generative LLM prompt based word expansion

Name a comma-separated list of fruits:

banana, apple, orange,

Avoid pipeline crashing if a word is not found in the embeddings table

Hi @davidberenstein1957 !

I'm testing this with Italian medical concepts and as some are pretty specific I ran into the following issue:

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/concise_concepts/conceptualizer/init.py in expand_concepts(self)
90 remaining_values = [self.data[rem_key] for rem_key in remaining_keys]
91 remaining_values = list(itertools.chain.from_iterable(remaining_values))
---> 92 similar = self.kv.most_similar(
93 positive=self.data[key] + [key],
94
KeyError: "Key 'Ultrasuonoterapia' not present"

After running:

data = {
    "PRESTAZIONE": list(df.PRESTAZIONE.unique()) # this is a list of Italian terms
}
nlp = spacy.load("it_core_news_lg", disable=["ner"])
nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True})

I'd likely need to clean my list before setting the cc pipeline, but what I would expect is:

A warning telling me my term was not found in the embeddings, and maybe still letting me build the pipeline. Maybe this could be set with a param like skip not found terms.
Or a a more descriptive error, telling me to remove the term in question from my list.

Custom models showing different confidences even 0 in case of mixed casing text

Code used is given below. Even my own trained custom model showing the same behaviour

import spacy
from spacy import displacy

import concise_concepts

model_path ="glove-wiki-gigaword-300"

os.environ['KMP_DUPLICATE_LIB_OK']='True'
data = {
"fruit": ["apple", "pear", "orange"],
"vegetable": ["broccoli", "spinach", "tomato"],
"meat": ["beef", "pork", "fish", "lamb"],
}

text = """
Heat the oil in a large pan and add the ONION, Celery and carrots.
onion is must.CELERY is optional
Then, cook over a medium–low heat for 10 minutes, or until softened.
Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
Garlic paste can also be used instead of Raw GARLIC
Later, add some oranges and chickens.
Vegetarain may add cottage cheese instead of CHICKENS"""

nlp = spacy.load("en_core_web_lg", disable=["ner"])

nlp.add_pipe(
"concise_concepts",
config={
"data": data,
"model_path":model_path,
"ent_score": True, # Entity Scoring section
"verbose": True,
"exclude_pos": ["VERB", "AUX"],
"exclude_dep": ["DOBJ", "PCOMP"],
"include_compound_words": False,
"json_path": "./fruitful_patterns.json",
},
)
doc = nlp(text)

options = {
"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon"},
"ents": ["fruit", "vegetable", "meat"],
}

ents = doc.ents
for ent in ents:
new_label = f"{ent.label_} ({ent..ent_score:.0%})"
options["colors"][new_label] = options["colors"].get(ent.label.lower(), None)
options["ents"].append(new_label)
ent.label_ = new_label
doc.ents = ents

error: missing ), unterminated subpattern at position x

When I run the example code snippet with my custom data, during the step of adding step to spacy pipeline i ran into the following error:
error: missing ), unterminated subpattern at

To reproduce:

few_shot = {
    "soccer": ["ronaldo", "messi"]
}
nlp = spacy.load("en_core_web_lg", disable=["ner"])
nlp.add_pipe("concise_concepts", config={"data": few_shot})

Still unable to pass in a custom Gensim model

Raised an issue earlier regarding the same problem and @davidberenstein1957 committed a fix and posted this code block as solution

import spacy
from spacy import displacy

import concise_concepts

data = {
"fruit": ["apple", "pear", "orange"],
"vegetable": ["broccoli", "spinach", "tomato", "garlic", "onion", "beans"],
"meat": ["beef", "pork", "fish", "lamb", "bacon", "ham", "meatball"],
"dairy": ["milk", "butter", "eggs", "cheese", "cheddar", "yoghurt", "egg"],
"herbs": ["rosemary", "salt", "sage", "basil", "cilantro"],
"carbs": ["bread", "rice", "toast", "tortilla", "noodles", "bagel", "croissant"],
}

text = """
Heat the oil in a large pan and add the Onion, celery and carrots.
Then, cook over a medium–low heat for 10 minutes, or until softened.
Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
Later, add some oranges and chickens. """

model_path = "word2vec.model"

nlp = spacy.load("en_core_web_md", disable=["ner"])
nlp.add_pipe(
"concise_concepts",
config={
"data": data,
"model_path": model_path,
"ent_score": True,
},
)
doc = nlp(text)

options = {
"colors": {
"fruit": "darkorange",
"vegetable": "limegreen",
"meat": "salmon",
"dairy": "lightblue",
"herbs": "darkgreen",
"carbs": "lightbrown",
},
"ents": ["fruit", "vegetable", "meat", "dairy", "herbs", "carbs"],
}

ents = doc.ents
for ent in ents:
new_label = f"{ent.label_} ({float(ent.ent_score):.0%})"
options["colors"][new_label] = options["colors"].get(ent.label.lower(), None)
options["ents"].append(new_label)
ent.label_ = new_label
doc.ents = ents

displacy.render(doc, style="ent", options=options)

However, I am still getting the 'Word2vec object is not iterable error'.

Could you please look into it?

Key not present in word2vec model

key fruit not present in word2vec model
key vegetable not present in word2vec model
key meat not present in word2vec model

This is the error I am getting , when I tried to run the example ?

json array too large

We are getting a 500MB array for the JSON output for concise concepts. Is it possible to exclude the results from the JSON from being returned?

allow verbs

Looks like patterns currently prevent verbs?
individual_pattern = {
"lemma": {"regex": r"(?i)" + word},
"POS": {"NOT_IN": ["VERB"]},
"DEP": {"NOT_IN": ["nsubjpass"]},
}

add spaczz fuzzymatcher option to concise-concepts

https://github.com/gandersen101/spaczz

OSError on while adding concise_concepts to spacy nlp pipeline

Hi Team,
I am getting below issue while adding concise_concepts to spacy nlp pipeline in the latest version

nlp.add_pipe("concise_concepts", config={"data" : prints_data})

2022-10-11 16:25:18.181 ISTextract_expertreports_printsdu74r1dm4mrf ERROR:root:Traceback (most recent call last): File "/workspace/main.py", line 12, in wrapper response = wrapped_func(*args, **kwargs) File "/workspace/main.py", line 43, in extract_expertreports_prints output_result = extract_prints(year) File "/workspace/extractprints_frompdf.py", line 65, in extract_prints nlp.add_pipe("concise_concepts", config={"data" : prints_data}) File "/layers/google.python.pip/pip/lib/python3.8/site-packages/spacy/language.py", line 792, in add_pipe pipe_component = self.create_pipe( File "/layers/google.python.pip/pip/lib/python3.8/site-packages/spacy/language.py", line 674, in create_pipe resolved = registry.resolve(cfg, validate=validate) File "/layers/google.python.pip/pip/lib/python3.8/site-packages/thinc/config.py", line 746, in resolve resolved, _ = cls._make( File "/layers/google.python.pip/pip/lib/python3.8/site-packages/thinc/config.py", line 795, in _make filled, _, resolved = cls._fill( File "/layers/google.python.pip/pip/lib/python3.8/site-packages/thinc/config.py", line 867, in _fill getter_result = getter(*args, **kwargs) File "/layers/google.python.pip/pip/lib/python3.8/site-packages/concise_concepts/__init__.py", line 51, in make_concise_concepts return Conceptualizer( File "/layers/google.python.pip/pip/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py", line 101, in __init__ self.run() File "/layers/google.python.pip/pip/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py", line 117, in run self.create_conceptual_patterns() File "/layers/google.python.pip/pip/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py", line 409, in create_conceptual_patterns with open(self.json_path, "w") as f: OSError: [Errno 30] Read-only file system: './matching_patterns.json```

Code example did not work

I tried to give your library a test run and installed it via>> 'pip install classy-classification' and git, but the code example did not work. I haven't taken the time to troubleshoot. I guess something has been renamed 'concise_concepts' vs 'classy_classification' ?

Model Sensitivity

Noticed that the concise concepts model isn't too sensitive. For instance, in the text block in the tutorial:

We are only picking up 1 or 2 of the fruits and vegetables, not all of them. Is there any way to adjust this?

Handling of Multiple Words

Seems like concise concepts, while great, isn't able to assign entities to "multiple words". For instance, it might pick up:
Mashed -> ENT
Potato -> ENT

But not "Mashed Potato" -> ENT

Is there any way that we can solve for this?

Unable to pass in custom gensim word2vec model

Getting this error, TypeError: argument of type 'Word2Vec' is not iterable, when I try to pass in a custom gensim model

Lemmatization need for LEMMA patterns

I believe you are missing to call the lematization function before you build the conceptual patterns. I saw you already have the function for it but not calling it:

I added in run(): (my local version)
....
self.infer_original_data()

```
 self.lemmatize_concepts()
```

....

Was obvious with german text and de_ spacy models

matching_patterns.json

Is it possible to change the "matching_patterns.json" name? These pattern files can be instrumental in combination with Prodigy, but I might want to generate a few upfronts.

It might also be a good idea to document this file more explicitly in the README, I only found out about it by accident when I was looking at the folder that was running my code.

Loading transformer based models and handling phrases

Two questions:

I am trying to load models that are custom build / are transformer based namely:

nlp = spacy.load("en_core_web_trf")  #or spacy.load("my_custom_model") 
nlp.add_pipe("concise_concepts", config={"data": data})
doc = nlp(text)

running this I get:

No lemmatizer found in spacy pipeline. Consider adding it for matching on LEMMA instead of exact text. 
AssertationError: Choose a spaCy model with internal embeddings, e.g. md or lg.

Does this mean that concise concepts does not handle transformer based or custom spacy models?

When I change the sample data to say two words or more such as:

data = {
    "change": ["change my card detail", "change my address"],
    "open": ["open an account", "new account opened"],
    "close": ["close account", "terminate account", "account closure"]
}

I get:
word change my card detailfrom keychangenot present in vector model.

Does this mean data can only handle list of words only?

Question: How to use (external) transformer-based embeddings?

Hi,

your idea of "concise concepts" sounds really intriguing! However, I would like to use transformer-based embeddings - as far as I can see it from the source code, you rely on (word, vector) tuples in a large list like for instance in GloVe or Word2Vec models, right?

So, how could one implement this using HuggingFace models like spacy-transformer's tok2vec interface, maybe? Should I use the texts to be tagged for pretraining (i.e. "fine-tuning") a HF transformer model and then create this list by tokenizing all words (maybe getting rid of fill words or the like before) from the texts? Afterwards I'd have the same setting as with the current models, I guess.

Or maybe I am completely off the right track :-)