Giter Club home page Giter Club logo

concise-concepts's Introduction

Hi there 👋

From failing to study medicine ➡️ BSc industrial engineer ➡️ MSc computer scientist.
Life can be strange, so better enjoy it.
I´m sure I do by: 👨🏽‍🍳 Cooking, 👨🏽‍💻 Coding, 🏆 Committing.

Conference slides 📖

employers 👨🏽‍💻

  • Argilla(2022-current) - data annotation and monitoring for enterprise NLP
  • Pandora Intelligence(2020-2022) - an independent intelligence company, specialized in security risks

open source ⭐️

maintainer 🤓

contributions 🫱🏾‍🫲🏼

volunteering 🌍

  • Bonfari - small to medium sustainable scale projects in Gambia 🇬🇲
  • 510 red-cross - occasional projects to improve humanitarian aid with data

Contacts

Gmail LinkedIn Twitter

concise-concepts's People

Contributors

davidberenstein1957 avatar davidfrompandora avatar koaning avatar rdeheer2 avatar tomaarsen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

concise-concepts's Issues

duplicate logging regarding missing entires in embedding model

2022-10-07 10:21:41.412 | WARNING  | concise_concepts.conceptualizer.Conceptualizer:verify_data:188 - word ´persianas eléctricas´ from key ´CARACTERISTICAS´ not present in vector model
2022-10-07 10:21:41.412 | WARNING  | concise_concepts.conceptualizer.Conceptualizer:verify_data:188 - word ´puente térmico´ from key ´CARACTERISTICAS´ not present in vector model
2022-10-07 10:21:41.412 | WARNING  | concise_concepts.conceptualizer.Conceptualizer:verify_data:188 - word ´por split´ from key ´CARACTERISTICAS´ not present in vector model
2022-10-07 10:21:41.413 | WARNING  | concise_concepts.conceptualizer.Conceptualizer:verify_data:188 - word ´tarima vinilica´ from key ´CARACTERISTICAS´ not present in vector model
2022-10-07 10:21:41.413 | WARNING  | concise_concepts.conceptualizer.Conceptualizer:verify_data:188 - word ´persianas electricas´ from key ´CARACTERISTICAS´ not present in vector model```

Unable to load local custom gensim model

I am trying to use concise-concepts with a custom word2vec model I trained using gensim. Here is a link to my notebook: https://github.com/EarthNLP/ClimateScholar/blob/users/hevia/concise-concepts-ner/research/Concise_Concepts.ipynb

The error & code in question:
image

Looking at the code: https://github.com/Pandora-Intelligence/concise-concepts/blob/main/concise_concepts/conceptualizer/Conceptualizer.py#L147

I tried to load my model the same way it is written here, and it works just fine:
image

The documentation doesnt mention any specific restrictions with loading models (unless you cant load local models?): https://github.com/Pandora-Intelligence/concise-concepts#use-gensimword2vec-model-from-pre-trained-gensim-or-custom-model-path, but I did notice that the example used a blank spaCy: https://github.com/Pandora-Intelligence/concise-concepts/blob/main/concise_concepts/examples/example_gensim_custom.py which I tried and also did not work.

I am also running gensim 4.X+

Hopefully its a minor issue on my end, just so it means you don't have to do any work 😅, the library is awesome and would love to try it out with some of our custom word embeddings

Thanks for your time!

Loading a local NER model but has no embeddings

Hi! I have a local trained model which only has NER in its pipeline and as soon as I try to add the concise-concepts data it returns

Exception: Choose a model with internal embeddings i.e. md or lg.

How can I train my model to have the necessary embeddings to work out with concise-concepts?

Example fail while using GPUs

I was using en_core_web_trf with GPU enabled. I changed it to en_core_web_lg since the en_core_web_trf is not supported. However, this would give me the following error while with GPU enabled. It took me a while to figure out what was wrong.

Fail while using GPU

import spacy
import concise_concepts

from spacy import displacy


spacy.require_gpu(0) # this triggered the numpy error

data = {
    "fruit": ["apple", "pear", "orange"],
    "vegetable": ["broccoli", "spinach", "tomato"],
    "meat": ["beef", "pork", "fish", "lamb"]
}

text = """
    Heat the oil in a large pan and add the Onion, celery and carrots.
    Then, cook over a medium–low heat for 10 minutes, or until softened.
    Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
    Later, add some oranges and chickens."""

# use any model that has internal spacy embeddings
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe("concise_concepts",
    config={"data": data}
)
doc = nlp(text)

options = {"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon"},
           "ents": ["fruit", "vegetable", "meat"]}

displacy.render(doc, style="ent", options=options)

ERROR Messasge

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

Python latest package 0.6.2 failing. Error in Conceptualizer.py.Results Deterioration

Hi @davidberenstein1957 ,the latest update version 0.6.2 is not working via pip. It is still installing 0.6.1
I tried manually making the changes as per the latest in the conceptualizer.py file.
Even there on testing it gives the following error.
~\anaconda3\lib\site-packages\concise_concepts\conceptualizer\Conceptualizer.py in run(self)
97 self.resolve_overlapping_concepts()
98 self.infer_original_data()
---> 99 self.create_conceptual_patterns()
100
101 if not self.ent_score:

~\anaconda3\lib\site-packages\concise_concepts\conceptualizer\Conceptualizer.py in create_conceptual_patterns(self)
359 }
360 )
--> 361
362 add_patterns(self.data)
363 add_patterns(self.original_data)

~\anaconda3\lib\site-packages\concise_concepts\conceptualizer\Conceptualizer.py in add_patterns(input_dict)
327 if self.case_sensitive:
328 specific_copy[self.match_key] = "{op}".join(word_parts)
--> 329 else:
330 specific_copy[self.match_key] = {
331 "regex": r"(?i)"

NameError: name 'specific_copy' is not defined.

My estimate is it is because of the following lines. I tried solving on my end but wasn't successful.
All the previous problems regarding custom custom word2vec stands as-it-is.
Kindly look into this ASAP
error

multi token patterns

I might be working on a tutorial on this project, so I figured I'd double-check explicitly: are multi-token phrases supported? My impression is that they're not, and that's totally fine, but I just wanted to make sure.

This example:

import spacy
from spacy import displacy
import concise_concepts

data = {
    "fruit": ["apple", "pear", "orange"],
    "vegetable": ["broccoli", "spinach", "tomato"],
    "meat": ["beef", "pork", "fish", "lamb"],
    "utensil": ["large oven", "warm stove", "big knife"]
}

text = """
    Heat the oil in a large pan and add the Onion, celery and carrots.
    Then, cook over a medium–low heat for 10 minutes, or until softened.
    Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
    Later, add some oranges and chickens. """

nlp = spacy.load("en_core_web_lg", disable=["ner"])

# ent_score for entity condifence scoring
nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True})
doc = nlp(text)

options = {"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon", "utensil": "gray"},
           "ents": ["fruit", "vegetable", "meat", "utensil"]}

ents = doc.ents
for ent in ents:
    new_label = f"{ent.label_} ({float(ent._.ent_score):.0%})"
    options["colors"][new_label] = options["colors"].get(ent.label_.lower(), None)
    options["ents"].append(new_label)
    ent.label_ = new_label
doc.ents = ents

displacy.render(doc, style="ent", options=options)

Yields this error:

word ´large oven´ from key ´utensil´ not present in vector model
word ´warm stove´ from key ´utensil´ not present in vector model
word ´big knife´ from key ´utensil´ not present in vector model
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [4], in <cell line: 21>()
     18 nlp = spacy.load("en_core_web_lg", disable=["ner"])
     20 # ent_score for entity condifence scoring
---> 21 nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True})
     22 doc = nlp(text)
     24 options = {"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon", "utensil": "gray"},
     25            "ents": ["fruit", "vegetable", "meat", "utensil"]}

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:795, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    787     if not self.has_factory(factory_name):
    788         err = Errors.E002.format(
    789             name=factory_name,
    790             opts=", ".join(self.factory_names),
   (...)
    793             lang_code=self.lang,
    794         )
--> 795     pipe_component = self.create_pipe(
    796         factory_name,
    797         name=name,
    798         config=config,
    799         raw_config=raw_config,
    800         validate=validate,
    801     )
    802 pipe_index = self._get_pipe_index(before, after, first, last)
    803 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:674, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    671 cfg = {factory_name: config}
    672 # We're calling the internal _fill here to avoid constructing the
    673 # registered functions twice
--> 674 resolved = registry.resolve(cfg, validate=validate)
    675 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
    676 filled = Config(filled)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:747, in registry.resolve(cls, config, schema, overrides, validate)
    738 @classmethod
    739 def resolve(
    740     cls,
   (...)
    745     validate: bool = True,
    746 ) -> Dict[str, Any]:
--> 747     resolved, _ = cls._make(
    748         config, schema=schema, overrides=overrides, validate=validate, resolve=True
    749     )
    750     return resolved

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:796, in registry._make(cls, config, schema, overrides, resolve, validate)
    794 if not is_interpolated:
    795     config = Config(orig_config).interpolate()
--> 796 filled, _, resolved = cls._fill(
    797     config, schema, validate=validate, overrides=overrides, resolve=resolve
    798 )
    799 filled = Config(filled, section_order=section_order)
    800 # Check that overrides didn't include invalid properties not in config

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:868, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    865     getter = cls.get(reg_name, func_name)
    866     # We don't want to try/except this and raise our own error
    867     # here, because we want the traceback if the function fails.
--> 868     getter_result = getter(*args, **kwargs)
    869 else:
    870     # We're not resolving and calling the function, so replace
    871     # the getter_result with a Promise class
    872     getter_result = Promise(
    873         registry=reg_name, name=func_name, args=args, kwargs=kwargs
    874     )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/__init__.py:47, in make_concise_concepts(nlp, name, data, topn, model_path, word_delimiter, ent_score, exclude_pos, exclude_dep, include_compound_words, case_sensitive)
      9 @Language.factory(
     10     "concise_concepts",
     11     default_config={
   (...)
     45     case_sensitive: bool,
     46 ):
---> 47     return Conceptualizer(
     48         nlp=nlp,
     49         name=name,
     50         data=data,
     51         topn=topn,
     52         model_path=model_path,
     53         word_delimiter=word_delimiter,
     54         ent_score=ent_score,
     55         exclude_pos=exclude_pos,
     56         exclude_dep=exclude_dep,
     57         include_compound_words=include_compound_words,
     58         case_sensitive=case_sensitive,
     59     )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:95, in Conceptualizer.__init__(self, nlp, name, data, topn, model_path, word_delimiter, ent_score, exclude_pos, exclude_dep, include_compound_words, case_sensitive)
     93 else:
     94     self.match_key = "LEMMA"
---> 95 self.run()
     96 self.data_upper = {k.upper(): v for k, v in data.items()}

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:101, in Conceptualizer.run(self)
     99 self.determine_topn()
    100 self.set_gensim_model()
--> 101 self.verify_data()
    102 self.expand_concepts()
    103 self.verify_data(verbose=False)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:193, in Conceptualizer.verify_data(self, verbose)
    188                 logger.warning(
    189                     f"word ´{word}´ from key ´{key}´ not present in vector"
    190                     " model"
    191                 )
    192     verified_data[key] = verified_values
--> 193     assert len(
    194         verified_values
    195     ), f"None of the entries for key {key} are present in the vector model"
    196 self.data = deepcopy(verified_data)
    197 self.original_data = deepcopy(self.data)

AssertionError: None of the entries for key utensil are present in the vector model

Avoid pipeline crashing if a word is not found in the embeddings table

Hi @davidberenstein1957 !

I'm testing this with Italian medical concepts and as some are pretty specific I ran into the following issue:

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/concise_concepts/conceptualizer/init.py in expand_concepts(self)
90 remaining_values = [self.data[rem_key] for rem_key in remaining_keys]
91 remaining_values = list(itertools.chain.from_iterable(remaining_values))
---> 92 similar = self.kv.most_similar(
93 positive=self.data[key] + [key],
94
KeyError: "Key 'Ultrasuonoterapia' not present"

After running:

data = {
    "PRESTAZIONE": list(df.PRESTAZIONE.unique()) # this is a list of Italian terms
}
nlp = spacy.load("it_core_news_lg", disable=["ner"])
nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True})

I'd likely need to clean my list before setting the cc pipeline, but what I would expect is:

  • A warning telling me my term was not found in the embeddings, and maybe still letting me build the pipeline. Maybe this could be set with a param like skip not found terms.
  • Or a a more descriptive error, telling me to remove the term in question from my list.

Custom models showing different confidences even 0 in case of mixed casing text

image

Code used is given below. Even my own trained custom model showing the same behaviour

import spacy
from spacy import displacy

import concise_concepts

model_path ="glove-wiki-gigaword-300"

os.environ['KMP_DUPLICATE_LIB_OK']='True'
data = {
"fruit": ["apple", "pear", "orange"],
"vegetable": ["broccoli", "spinach", "tomato"],
"meat": ["beef", "pork", "fish", "lamb"],
}

text = """
Heat the oil in a large pan and add the ONION, Celery and carrots.
onion is must.CELERY is optional
Then, cook over a medium–low heat for 10 minutes, or until softened.
Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
Garlic paste can also be used instead of Raw GARLIC
Later, add some oranges and chickens.
Vegetarain may add cottage cheese instead of CHICKENS"""

nlp = spacy.load("en_core_web_lg", disable=["ner"])

nlp.add_pipe(
"concise_concepts",
config={
"data": data,
"model_path":model_path,
"ent_score": True, # Entity Scoring section
"verbose": True,
"exclude_pos": ["VERB", "AUX"],
"exclude_dep": ["DOBJ", "PCOMP"],
"include_compound_words": False,
"json_path": "./fruitful_patterns.json",
},
)
doc = nlp(text)

options = {
"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon"},
"ents": ["fruit", "vegetable", "meat"],
}

ents = doc.ents
for ent in ents:
new_label = f"{ent.label_} ({ent..ent_score:.0%})"
options["colors"][new_label] = options["colors"].get(ent.label
.lower(), None)
options["ents"].append(new_label)
ent.label_ = new_label
doc.ents = ents

error: missing ), unterminated subpattern at position x

When I run the example code snippet with my custom data, during the step of adding step to spacy pipeline i ran into the following error:
error: missing ), unterminated subpattern at

To reproduce:

few_shot = {
    "soccer": ["ronaldo", "messi"]
}
nlp = spacy.load("en_core_web_lg", disable=["ner"])
nlp.add_pipe("concise_concepts", config={"data": few_shot})

Still unable to pass in a custom Gensim model

Raised an issue earlier regarding the same problem and @davidberenstein1957 committed a fix and posted this code block as solution

import spacy
from spacy import displacy

import concise_concepts

data = {
"fruit": ["apple", "pear", "orange"],
"vegetable": ["broccoli", "spinach", "tomato", "garlic", "onion", "beans"],
"meat": ["beef", "pork", "fish", "lamb", "bacon", "ham", "meatball"],
"dairy": ["milk", "butter", "eggs", "cheese", "cheddar", "yoghurt", "egg"],
"herbs": ["rosemary", "salt", "sage", "basil", "cilantro"],
"carbs": ["bread", "rice", "toast", "tortilla", "noodles", "bagel", "croissant"],
}

text = """
Heat the oil in a large pan and add the Onion, celery and carrots.
Then, cook over a medium–low heat for 10 minutes, or until softened.
Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
Later, add some oranges and chickens. """

model_path = "word2vec.model"

nlp = spacy.load("en_core_web_md", disable=["ner"])
nlp.add_pipe(
"concise_concepts",
config={
"data": data,
"model_path": model_path,
"ent_score": True,
},
)
doc = nlp(text)

options = {
"colors": {
"fruit": "darkorange",
"vegetable": "limegreen",
"meat": "salmon",
"dairy": "lightblue",
"herbs": "darkgreen",
"carbs": "lightbrown",
},
"ents": ["fruit", "vegetable", "meat", "dairy", "herbs", "carbs"],
}

ents = doc.ents
for ent in ents:
new_label = f"{ent.label_} ({float(ent.ent_score):.0%})"
options["colors"][new_label] = options["colors"].get(ent.label.lower(), None)
options["ents"].append(new_label)
ent.label_ = new_label
doc.ents = ents

displacy.render(doc, style="ent", options=options)

However, I am still getting the 'Word2vec object is not iterable error'.

Could you please look into it?

Key not present in word2vec model

key fruit not present in word2vec model
key vegetable not present in word2vec model
key meat not present in word2vec model

This is the error I am getting , when I tried to run the example ?

json array too large

We are getting a 500MB array for the JSON output for concise concepts. Is it possible to exclude the results from the JSON from being returned?

allow verbs

Looks like patterns currently prevent verbs?
individual_pattern = {
"lemma": {"regex": r"(?i)" + word},
"POS": {"NOT_IN": ["VERB"]},
"DEP": {"NOT_IN": ["nsubjpass"]},
}

OSError on while adding concise_concepts to spacy nlp pipeline

Hi Team,
I am getting below issue while adding concise_concepts to spacy nlp pipeline in the latest version

nlp.add_pipe("concise_concepts", config={"data" : prints_data})

2022-10-11 16:25:18.181 ISTextract_expertreports_printsdu74r1dm4mrf ERROR:root:Traceback (most recent call last): File "/workspace/main.py", line 12, in wrapper response = wrapped_func(*args, **kwargs) File "/workspace/main.py", line 43, in extract_expertreports_prints output_result = extract_prints(year) File "/workspace/extractprints_frompdf.py", line 65, in extract_prints nlp.add_pipe("concise_concepts", config={"data" : prints_data}) File "/layers/google.python.pip/pip/lib/python3.8/site-packages/spacy/language.py", line 792, in add_pipe pipe_component = self.create_pipe( File "/layers/google.python.pip/pip/lib/python3.8/site-packages/spacy/language.py", line 674, in create_pipe resolved = registry.resolve(cfg, validate=validate) File "/layers/google.python.pip/pip/lib/python3.8/site-packages/thinc/config.py", line 746, in resolve resolved, _ = cls._make( File "/layers/google.python.pip/pip/lib/python3.8/site-packages/thinc/config.py", line 795, in _make filled, _, resolved = cls._fill( File "/layers/google.python.pip/pip/lib/python3.8/site-packages/thinc/config.py", line 867, in _fill getter_result = getter(*args, **kwargs) File "/layers/google.python.pip/pip/lib/python3.8/site-packages/concise_concepts/__init__.py", line 51, in make_concise_concepts return Conceptualizer( File "/layers/google.python.pip/pip/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py", line 101, in __init__ self.run() File "/layers/google.python.pip/pip/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py", line 117, in run self.create_conceptual_patterns() File "/layers/google.python.pip/pip/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py", line 409, in create_conceptual_patterns with open(self.json_path, "w") as f: OSError: [Errno 30] Read-only file system: './matching_patterns.json```

Code example did not work

I tried to give your library a test run and installed it via>> 'pip install classy-classification' and git, but the code example did not work. I haven't taken the time to troubleshoot. I guess something has been renamed 'concise_concepts' vs 'classy_classification' ?

Model Sensitivity

Noticed that the concise concepts model isn't too sensitive. For instance, in the text block in the tutorial:

text = """
Heat the oil in a large pan and add the Onion, celery and carrots.
Then, cook over a medium–low heat for 10 minutes, or until softened.
Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
Later, add some oranges and chickens. """

We are only picking up 1 or 2 of the fruits and vegetables, not all of them. Is there any way to adjust this?

Handling of Multiple Words

Seems like concise concepts, while great, isn't able to assign entities to "multiple words". For instance, it might pick up:
Mashed -> ENT
Potato -> ENT

But not "Mashed Potato" -> ENT

Is there any way that we can solve for this?

Lemmatization need for LEMMA patterns

I believe you are missing to call the lematization function before you build the conceptual patterns. I saw you already have the function for it but not calling it:

I added in run(): (my local version)
....
self.infer_original_data()

  •  self.lemmatize_concepts()
    

....

Was obvious with german text and de_ spacy models

matching_patterns.json

Is it possible to change the "matching_patterns.json" name? These pattern files can be instrumental in combination with Prodigy, but I might want to generate a few upfronts.

It might also be a good idea to document this file more explicitly in the README, I only found out about it by accident when I was looking at the folder that was running my code.

Loading transformer based models and handling phrases

Two questions:

  1. I am trying to load models that are custom build / are transformer based namely:
nlp = spacy.load("en_core_web_trf")  #or spacy.load("my_custom_model") 
nlp.add_pipe("concise_concepts", config={"data": data})
doc = nlp(text)

running this I get:

No lemmatizer found in spacy pipeline. Consider adding it for matching on LEMMA instead of exact text. 
AssertationError: Choose a spaCy model with internal embeddings, e.g. md or lg. 

Does this mean that concise concepts does not handle transformer based or custom spacy models?

  1. When I change the sample data to say two words or more such as:
data = {
    "change": ["change my card detail", "change my address"],
    "open": ["open an account", "new account opened"],
    "close": ["close account", "terminate account", "account closure"]
}

I get:
word change my card detailfrom keychangenot present in vector model.

Does this mean data can only handle list of words only?

Question: How to use (external) transformer-based embeddings?

Hi,

your idea of "concise concepts" sounds really intriguing! However, I would like to use transformer-based embeddings - as far as I can see it from the source code, you rely on (word, vector) tuples in a large list like for instance in GloVe or Word2Vec models, right?

So, how could one implement this using HuggingFace models like spacy-transformer's tok2vec interface, maybe? Should I use the texts to be tagged for pretraining (i.e. "fine-tuning") a HF transformer model and then create this list by tokenizing all words (maybe getting rid of fill words or the like before) from the texts? Afterwards I'd have the same setting as with the current models, I guess.

Or maybe I am completely off the right track :-)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.