davidberenstein1957 / classy-classification Goto Github PK

This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface.

License: MIT License

Python 100.00%

few-shot-classifcation spacy sentence-transformers nlp nlu text-classification machine-learning natural-language-processing hacktoberfest

classy-classification's Introduction

Hi there 👋

From failing to study medicine ➡️ BSc industrial engineer ➡️ MSc computer scientist.
Life can be strange, so better enjoy it.
I´m sure I do by: 👨🏽‍🍳 Cooking, 👨🏽‍💻 Coding, 🏆 Committing.

Conference slides 📖

🧼 From GPU-poor to data-rich - data quality practices for LLM fine-tuning
Deeplearning.ai LLM workshop - get started with Argilla for human- and distilabel for AI feedback
NLP Healthcare Summit 2023 - Smart Shortcuts for Bootstrapping a Healthcare NER Project
Anyscale Ray Europe Meetup - Smart shortcuts for Bootstrapping a Text Classification project

employers 👨🏽‍💻

Argilla(2022-current) - data annotation and monitoring for enterprise NLP
Pandora Intelligence(2020-2022) - an independent intelligence company, specialized in security risks

open source ⭐️

maintainer 🤓

concise-concepts - a word similarity approach to few-shot NER
fast-sentence-transformers - wrapper for ONNX speed enhanced sentence-transformers
classy-classification - a quick and dirty few-shot text classification solution
crosslingual-coreference - a multi-lingual CoRef resolver using cross-lingual training
adept-augmentations - a Python library aimed at dissecting and augmenting NER training data
spacy-setfit - a Python library aimed to facilitate easy SetFit usage in spaCy

contributions 🫱🏾‍🫲🏼

spaCy - several additions to the spacy-universe
- spanmarker - added .pipe() method to spaCy integration
- spacy-dbpedia-spotlight - added a batch processing functionality
- spacy-fishing - added a batch processing functionality + bug fixes
- spacy-opentapioca - added a batch processing functionality
streamlit-url-fragment - resolved Python versioning issues
allennlp-models - added a batch processing functionality
mutate - resolved Python versioning issues and added PyPI support
rebel - added a batch processing functionality
trl - updated RLHF documentation for PPOTrainer

volunteering 🌍

Bonfari - small to medium sustainable scale projects in Gambia 🇬🇲
510 red-cross - occasional projects to improve humanitarian aid with data

Contacts

classy-classification's People

Contributors

Stargazers

Watchers

Forkers

svlandeg adelevie pepijnboers israelschwarz jcarlosneto tomaarsen brunotech danielmlow moonisali aqhali azizullah2017 mattkallo robinrojowiec

classy-classification's Issues

Inconsistent Result while using a fix random seed

I have been using Spacy - Classy Classification to classify text messages. Python version 3.10

Below is the the training model and I get the Unknown category with the highest score for this specific message:

#Import training data
with open ('SID - Commercial.txt', "r", encoding="utf8") as a:
    Commercial = a.read().splitlines()

with open ('SID - Crypto.txt', "r", encoding="utf8") as b:
    Crypto = b.read().splitlines()
    
with open ('SID - Extortion.txt', "r", encoding="utf8") as c:
    Extortion = c.read().splitlines()
    
with open ('SID - Financial.txt', "r", encoding="utf8") as d:
    Financial = d.read().splitlines()
    
with open ('SID - Gambling.txt', "r", encoding="utf8") as e:
    Gambling = e.read().splitlines()
    
with open ('SID - Gift.txt', "r", encoding="utf8") as f:
    Gift = f.read().splitlines()

with open ('SID - Investment.txt', "r", encoding="utf8") as g:
    Investment = g.read().splitlines()    

with open ('SID - Invoice.txt', "r", encoding="utf8") as h:
    Invoice = h.read().splitlines()  
    
with open ('SID - Phishing.txt', "r", encoding="utf8") as i:
    Phishing = i.read().splitlines() 
    
with open ('SID - Romance.txt', "r", encoding="utf8") as j:
    Romance = j.read().splitlines() 

with open ('SID - Unknown.txt', "r", encoding="utf8") as k:
    Unknown = k.read().splitlines() 
    
data = {}
data["Commercial"] = Commercial
data["Crypto"] = Crypto
data["Extortion"] = Extortion
data["Financial"] = Financial
data["Gambling"] = Gambling
data["Gift"] = Gift
data["Investment"] = Investment
data["Invoice"] = Invoice
data["Phishing"] = Phishing
data["Romance"] = Romance
data["Unknown"] = Unknown

# NLP model
spacy.util.fix_random_seed(0)
nlp = spacy.load("en_core_web_md")
nlp.add_pipe("text_categorizer", 
    config={
        "data": data,
        "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        "cat_type": "multi-label",
        "device": "gpu"
    }
)

print(nlp("FW: ð�š�ð�™´: ð�šˆð�š˜ð�šž ð�š‘ð�šŠð�šŸð�šŽ ð�š˜ð�š—ð�šŽ (ð�Ÿ·) ð�š˜ð�š›ð�š�ð�šŽð�š› ð�š™ð�šŽð�š—ð�š�ð�š’ð�š—ð�š� ð�š�ð�šŽð�š•ð�š’ð�šŸð�šŽð�š›ð�š¢. #622460835")._.cats)

Result (which is correct as the Unknown as the highest score):
{'Commercial': 0.13948287736862833, 'Crypto': 0.015437351941468657, 'Extortion': 0.0860014895963152, 'Financial': 0.01987490991768424, 'Gambling': 0.029074990906618126, 'Gift': 0.06850244399154756, 'Investment': 0.012729882351053419, 'Invoice': 0.0718818617408037, 'Phishing': 0.046637490542787444, 'Romance': 0.05515818363916855, 'Unknown': 0.45521851800392493}

Importing test dataset:

# Import the dataset and assign scores
Messages = pd.read_csv('November SID2.csv', encoding='utf8')
    
Messages['Body'] = Messages['Body'].astype(str)
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)

#Find the category based on highest score and join back to the original dataset

Scores = pd.json_normalize(Messages.NLP_Result)
Scores['Category'] = Scores.idxmax(axis=1)
Scores['Category'] = Scores['Category'].replace('_', ' ', regex=True)

Messages_Final = pd.concat([Messages, Scores], axis=1)
Messages_Final.to_csv('out.csv', index=False)

The result in csv file for that same statement:

Body	NLP_Result	Commercial	Crypto	Extortion	Financial	Gambling	Gift	Investment	Invoice	Phishing	Romance	Unknown	Category
FW: ð�š�ð�™´: ð�šˆð�š˜ð�šž ð�š‘ð�šŠð�šŸð�šŽ ð�š˜ð�š—ð�šŽ (ð�Ÿ·) ð�š˜ð�š›ð�š�ð�šŽð�š› ð�š™ð�šŽð�š—ð�š�ð�š’ð�š—ð�š� ð�š�ð�šŽð�š•ð�š’ð�šŸð�šŽð�š›ð�š¢. #622460835	{'Commercial': 0.03343028275903707, 'Crypto': 0.012076486026176284, 'Extortion': 0.08983918751534335, 'Financial': 0.07360790896376578, 'Gambling': 0.014564933067751274, 'Gift': 0.08460245841797985, 'Investment': 0.017324353297565327, 'Invoice': 0.1522007262418396, 'Phishing': 0.4507937431127887, 'Romance': 0.010566873139864728, 'Unknown': 0.060993047457888194}	0.03343	0.012076	0.089839	0.073608	0.014565	0.084602	0.017324	0.152201	0.450794	0.010567	0.060993	Phishing

Why are they inconsistent even when my training model has 'spacy.util.fix_random_seed(0)'?

Thank you

Standalone usage without spaCy setting embeddings post adding the data makes the classifications run twice

To initiate a ClassyClassification model without sPacy we have to first pass the data and then add in extra settings like embeddings and modifications in SVC config.
This makes the model run twice. Can we modify this and make the model train only once after the settings have been added.

from classy_classification import ClassyClassifier

data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
classifier = ClassyClassifier(data= data)
print(classifier("I am looking for kitchen appliances."))
classifier.set_embedding_model(model="all-mpnet-base-v2")
classifier.set_classification_model(
config={
"C": [1, 2, 5, 10, 20, 100],
"kernel": ["sigmoid"],
"max_cross_validation_folds": 5
}
)
print(classifier("I am looking for kitchen appliances."))
classifier.set_training_data(data=data)
print(classifier("I am looking for kitchen appliances."))

we get three different classification scores.

{'furniture': 0.13484464066590968, 'kitchen': 0.8651553593340902}
{'furniture': 0.8069939934544372, 'kitchen': 0.19300600654556258}
{'furniture': 0.542059833290298, 'kitchen': 0.457940166709702}

Error when try to save model with pickle (in my local ubuntu linux and in google colab too)

With the example:
`data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
classifier = classyClassifier(data=data)

with open("./classifier.pkl", "wb") as f:
pickle.dump(classifier, f)`

the error is:
TypeError: can't pickle onnxruntime.capi.onnxruntime_pybind11_state.InferenceSession objects

ValueError: Couldn't deep-copy config: maximum recursion depth exceeded while calling a Python object

Hi,

I am trying to reproduce your example on my data and using python 3.8.13 and spacy 3.4.1
This is my code:

import spacy
import classy_classification



nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "text_categorizer",
    config={
        "data": data,
        "model": "spacy"
    }
)

prompt="""
Either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to wonder what
was going to happen next. First, she tried to look down and make out
what she was coming to, but it was too dark to see anything; then she
looked at the sides of the well, and noticed that they were filled with
cupboards and book-shelves; here and there she saw maps and pictures
hung upon pegs. She took down a jar from one of the shelves as she
passed; it was labelled “ORANGE MARMALADE”, but to her great
disappointment it was empty: she did not like to drop the jar for fear
of killing somebody underneath, so managed to put it into one of the
cupboards as she fell past it.

“Well!” thought Alice to herself, “after such a fall as this, I shall
think nothing of tumbling down stairs! How brave they’ll all think me
at home! Why, I wouldn’t say anything about it, even if I fell off the
top of the house!” (Which was very likely true.)"""

print(nlp(prompt)._.cats)

but I get this error

ValueError: Couldn't deep-copy config: maximum recursion depth exceeded while calling a Python object

This is the full error output

ValueError                                Traceback (most recent call last)
Input In [2], in <cell line: 7>()
      2 import classy_classification
      6 nlp = spacy.load("en_core_web_sm")
----> 7 nlp.add_pipe(
      8     "text_categorizer",
      9     config={
     10         "data": data,
     11         "model": "spacy"
     12     }
     13 )
     15 prompt="""
     16 Either the well was very deep, or she fell very slowly, for she had
     17 plenty of time as she went down to look about her and to wonder what
   (...)
     30 at home! Why, I wouldn’t say anything about it, even if I fell off the
     31 top of the house!” (Which was very likely true.)"""
     33 print(nlp(prompt)._.cats)

File ~/miniforge3/envs/sklearn38/lib/python3.8/site-packages/spacy/language.py:795, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    787     if not self.has_factory(factory_name):
    788         err = Errors.E002.format(
    789             name=factory_name,
    790             opts=", ".join(self.factory_names),
   (...)
    793             lang_code=self.lang,
    794         )
--> 795     pipe_component = self.create_pipe(
    796         factory_name,
    797         name=name,
    798         config=config,
    799         raw_config=raw_config,
    800         validate=validate,
    801     )
    802 pipe_index = self._get_pipe_index(before, after, first, last)
    803 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~/miniforge3/envs/sklearn38/lib/python3.8/site-packages/spacy/language.py:660, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    657 # This is unideal, but the alternative would mean you always need to
    658 # specify the full config settings, which is not really viable.
    659 if pipe_meta.default_config:
--> 660     config = Config(pipe_meta.default_config).merge(config)
    661 internal_name = self.get_factory_name(factory_name)
    662 # If the language-specific factory doesn't exist, try again with the
    663 # not-specific name

File ~/miniforge3/envs/sklearn38/lib/python3.8/site-packages/thinc/config.py:327, in Config.merge(self, updates, remove_extra)
    325 """Deep merge the config with updates, using current as defaults."""
    326 defaults = self.copy()
--> 327 updates = Config(updates).copy()
    328 merged = deep_merge_configs(updates, defaults, remove_extra=remove_extra)
    329 return Config(
    330     merged,
    331     is_interpolated=defaults.is_interpolated and updates.is_interpolated,
    332     section_order=defaults.section_order,
    333 )

File ~/miniforge3/envs/sklearn38/lib/python3.8/site-packages/thinc/config.py:315, in Config.copy(self)
    313     config = copy.deepcopy(self)
    314 except Exception as e:
--> 315     raise ValueError(f"Couldn't deep-copy config: {e}") from e
    316 return Config(
    317     config,
    318     is_interpolated=self.is_interpolated,
    319     section_order=self.section_order,
    320 )

ValueError: Couldn't deep-copy config: maximum recursion depth exceeded while calling a Python object

Can you please help me solve this problem?
Thanks
David

classy_spacy.py raises "NotImplementedError: internal spacy embeddings need to be derived from md/lg spacy models not from sm/trf models."

Thanks for a very nice module! A problem has been introduced with the new version.
The error is raised when adding the pipe:
My code looks like this:

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("text_categorizer", config={"data": training_data,"model": "spacy"})

The console output in Spyder is:

File "c:\users\krist\documents\python scripts\doc genres\examples\classifiertest_dk.py", line 21, in <module>
    nlp.add_pipe("text_categorizer", config={"data": training_data,"model": "spacy", "include_sent": False})

  File "C:\Users\krist\anaconda3\envs\spyder-env\lib\site-packages\spacy\language.py", line 801, in add_pipe
    pipe_component = self.create_pipe(

  File "C:\Users\krist\anaconda3\envs\spyder-env\lib\site-packages\spacy\language.py", line 680, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)

  File "C:\Users\krist\anaconda3\envs\spyder-env\lib\site-packages\confection\__init__.py", line 728, in resolve
    resolved, _ = cls._make(

  File "C:\Users\krist\anaconda3\envs\spyder-env\lib\site-packages\confection\__init__.py", line 777, in _make
    filled, _, resolved = cls._fill(

  File "C:\Users\krist\anaconda3\envs\spyder-env\lib\site-packages\confection\__init__.py", line 849, in _fill
    getter_result = getter(*args, **kwargs)

  File "C:\Users\krist\anaconda3\envs\spyder-env\lib\site-packages\classy_classification\__init__.py", line 65, in make_text_categorizer
    return classySpacyInternalFewShot(

  File "C:\Users\krist\anaconda3\envs\spyder-env\lib\site-packages\classy_classification\classifiers\classy_spacy.py", line 104, in __init__
    classySkeletonFewShot.__init__(self, *args, **kwargs)

  File "C:\Users\krist\anaconda3\envs\spyder-env\lib\site-packages\classy_classification\classifiers\classy_skeleton.py", line 63, in __init__
    self.set_training_data()

  File "C:\Users\krist\anaconda3\envs\spyder-env\lib\site-packages\classy_classification\classifiers\classy_skeleton.py", line 187, in set_training_data
    self.X = self.get_embeddings(X)

  File "C:\Users\krist\anaconda3\envs\spyder-env\lib\site-packages\classy_classification\classifiers\classy_spacy.py", line 95, in get_embeddings
    raise NotImplementedError(

NotImplementedError: internal spacy embeddings need to be derived from md/lg spacy models not from sm/trf models.

Thanks in advance

How do i get acces to the score of each label?

Hello there,

i am currently working on a little Project and i really liked the simplicity of Classy-classification. Now to my issue: Is there a possibility to get acces to the bare prediction score of each label for example to round them up or to just show the top 3? I would be really grateful for some help.

Thanks for reading and i wish everyone a good day or a good tomorrow.

Would be great to also apply the classifier on arbitrary Spans

Would be nice if the classifiers can also be applied on spans, e.g. thoses suggested by span finder or span suggestors.

I have some code working, can open a PR.

between zero shot and few shot

I am a huge fan of this Github repo. One thing I noticed was that there is a pretty large jump in performance between zero shot tasks and few shot tasks, when there are fewer than N annotations. For instance, let's say we have 4 classes: cat, dog, mouse and fox, and sentences pertaining to each class in our dataset. Our zero shot model is able to predict with say ~50% accuracy the label of the corresponding sentence. Similarly, when we have ~8 labels for each class (8 cat, 8 dog, 8 mouse, and 8 fox), the few shot model begins to surpass the zero shot model, performing at say ~63% accuracy. However, there is this gap in between zero shot and few shot tasks, when say you just get your first 5 labels, and 3 of them are dog, one is cat, and one is mouse, where the model will perform relatively poorly, and have worse performance than the zero shot task. The reason for this is that the few shot models are very sensitive to user input.

Our team had a few ideas to solve this. One idea was to still take the class_category / label names as input to the few shot model, and have the few shot model be biased to the zero shot model when the number of annotation is small. The other way was to create random prompts / synthetic data for each category when there are just a few annotations, which may work okay but is not best practice. It would be great if there was a mechanism for classy_classification that handle this edge case where there are just a few annotations (where not each class has a specific annotation: say our distribution is 3 dog, 1 cat, 1 mouse, 0 fox).

Hugging face zero shot classifier not working

bug in text_categorizer

Allow for single class predictions.

You're not allowed to look for a single topic using this tool. Is there a reason why binary classification wouldn't work?

import spacy
import classy_classification

data = {
    "stategy": ["I really prefer strategic games.",
                "I like it when a boardgame makes you think."],
}

nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
    "text_categorizer",
    config={
        "data": data,
        "model": "spacy"
    }
)

Got this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [12], in <cell line: 10>()
      4 data = {
      5     "stategy": ["I really prefer strategic games.",
      6                 "I like it when a boardgame makes you think."],
      7 }
      9 nlp = spacy.load("en_core_web_md")
---> 10 nlp.add_pipe(
     11     "text_categorizer",
     12     config={
     13         "data": data,
     14         "model": "spacy"
     15     }
     16 )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:795, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    787     if not self.has_factory(factory_name):
    788         err = Errors.E002.format(
    789             name=factory_name,
    790             opts=", ".join(self.factory_names),
   (...)
    793             lang_code=self.lang,
    794         )
--> 795     pipe_component = self.create_pipe(
    796         factory_name,
    797         name=name,
    798         config=config,
    799         raw_config=raw_config,
    800         validate=validate,
    801     )
    802 pipe_index = self._get_pipe_index(before, after, first, last)
    803 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:674, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    671 cfg = {factory_name: config}
    672 # We're calling the internal _fill here to avoid constructing the
    673 # registered functions twice
--> 674 resolved = registry.resolve(cfg, validate=validate)
    675 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
    676 filled = Config(filled)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:747, in registry.resolve(cls, config, schema, overrides, validate)
    738 @classmethod
    739 def resolve(
    740     cls,
   (...)
    745     validate: bool = True,
    746 ) -> Dict[str, Any]:
--> 747     resolved, _ = cls._make(
    748         config, schema=schema, overrides=overrides, validate=validate, resolve=True
    749     )
    750     return resolved

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:796, in registry._make(cls, config, schema, overrides, resolve, validate)
    794 if not is_interpolated:
    795     config = Config(orig_config).interpolate()
--> 796 filled, _, resolved = cls._fill(
    797     config, schema, validate=validate, overrides=overrides, resolve=resolve
    798 )
    799 filled = Config(filled, section_order=section_order)
    800 # Check that overrides didn't include invalid properties not in config

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:868, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    865     getter = cls.get(reg_name, func_name)
    866     # We don't want to try/except this and raise our own error
    867     # here, because we want the traceback if the function fails.
--> 868     getter_result = getter(*args, **kwargs)
    869 else:
    870     # We're not resolving and calling the function, so replace
    871     # the getter_result with a Promise class
    872     getter_result = Promise(
    873         registry=reg_name, name=func_name, args=args, kwargs=kwargs
    874     )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/classy_classification/__init__.py:41, in make_text_categorizer(nlp, name, data, device, config, model, cat_type, include_doc, include_sent)
     39     if cat_type == "zero":
     40         raise NotImplementedError("cannot use spacy internal embeddings with zero-shot classification")
---> 41     return classySpacyInternal(
     42         nlp=nlp, name=name, data=data, config=config, include_doc=include_doc, include_sent=include_sent
     43     )
     44 else:
     45     if cat_type == "zero":

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/classy_classification/classifiers/spacy_internal.py:23, in classySpacyInternal.__init__(self, nlp, name, data, config, include_doc, include_sent)
     21 self.nlp = nlp
     22 self.set_training_data()
---> 23 self.set_svc()

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/classy_classification/classifiers/classy_skeleton.py:144, in classySkeleton.set_svc(self, config)
    135 cv_splits = max(2, min(folds, np.min(np.bincount(self.y)) // 5))
    136 self.clf = GridSearchCV(
    137     SVC(C=1, probability=True, class_weight="balanced"),
    138     param_grid=tuned_parameters,
   (...)
    142     verbose=0,
    143 )
--> 144 self.clf.fit(self.X, self.y)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/model_selection/_search.py:875, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
    869     results = self._format_results(
    870         all_candidate_params, n_splits, all_out, all_more_results
    871     )
    873     return results
--> 875 self._run_search(evaluate_candidates)
    877 # multimetric is determined here because in the case of a callable
    878 # self.scoring the return type is only known after calling
    879 first_test_score = all_out[0]["test_scores"]

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/model_selection/_search.py:1375, in GridSearchCV._run_search(self, evaluate_candidates)
   1373 def _run_search(self, evaluate_candidates):
   1374     """Search all candidates in param_grid"""
-> 1375     evaluate_candidates(ParameterGrid(self.param_grid))

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/model_selection/_search.py:852, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    845 elif len(out) != n_candidates * n_splits:
    846     raise ValueError(
    847         "cv.split and cv.get_n_splits returned "
    848         "inconsistent results. Expected {} "
    849         "splits, got {}".format(n_splits, len(out) // n_candidates)
    850     )
--> 852 _warn_or_raise_about_fit_failures(out, self.error_score)
    854 # For callable self.scoring, the return type is only know after
    855 # calling. If the return type is a dictionary, the error scores
    856 # can now be inserted with the correct key. The type checking
    857 # of out will be done in `_insert_error_scores`.
    858 if callable(self.scoring):

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/model_selection/_validation.py:367, in _warn_or_raise_about_fit_failures(results, error_score)
    360 if num_failed_fits == num_fits:
    361     all_fits_failed_message = (
    362         f"\nAll the {num_fits} fits failed.\n"
    363         "It is very likely that your model is misconfigured.\n"
    364         "You can try to debug the error by setting error_score='raise'.\n\n"
    365         f"Below are more details about the failures:\n{fit_errors_summary}"
    366     )
--> 367     raise ValueError(all_fits_failed_message)
    369 else:
    370     some_fits_failed_message = (
    371         f"\n{num_failed_fits} fits failed out of a total of {num_fits}.\n"
    372         "The score on these train-test partitions for these parameters"
   (...)
    376         f"Below are more details about the failures:\n{fit_errors_summary}"
    377     )

ValueError: 
All the 12 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
  File "/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/svm/_base.py", line 182, in fit
    y = self._validate_targets(y)
  File "/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/svm/_base.py", line 739, in _validate_targets
    raise ValueError(
ValueError: The number of classes has to be greater than one; got 1 class

Saving and loading models

Hi, does the classifier saves the model everytime set_training_data() function is called?

Also is there any way to save the models like the huggingface format?

So loading the models could something like this:

from classy_classification import ClassyClassifier

classifier = ClassyClassifier(data=data, model="all-MiniLM-L6-v2", multi_label=False)

Spacy embeddings vs sentence transformer embeddings

Hi there,

I have been using the classy-classification library for a few multi-label text classification tasks lately. I believed that the use of sentence transformer embeddings would always generate better results compared to the spacy embeddings. However, during my recent project, resulted in the opposite.

Below are the code snips I used:
Space embeddings (this generated better results) -
nlp = spacy.load("en_core_web_lg") nlp.add_pipe("classy_classification", config={ "data": data, "model": "spacy", "multi_label": True, "config": {"seed": 11}, "device": "gpu" } )

Sentence transformer embeddings -
# NLP model nlp = spacy.load("en_core_web_lg") nlp.add_pipe("classy_classification", config={ "data": data, "model": "sentence-transformers/all-mpnet-base-v2", "multi_label": True, "config": {"seed": 11}, "device": "gpu" } )

Will you be able to provide some guidance on when to use sentence transformer embeddings and when to use Spacy embeddings please?

Thank you

Drastic performance drop

Hi @davidberenstein1957,

I was wondering if there were any changes in the past week. I had trained a model that was performing quite well with unseen data and when I loaded it this week it started acting up even with examples from the training data.

Thanks for your support.

Issues with Saving and Loading

Thanks for this awesome library! I'm still quite new at all this, so this is probably something simple. I'm just trying to simply train and load the categorizer. I'm using Python 3.9.

import spacy
import classy_classification
import json

data = json.load(open('./../aitrain_small.json', 'r'))
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
    "text_categorizer", 
    config={
        "data": data, 
        "model": "spacy"
        }
) 

print(nlp("Ability to multi-task in a fast paced detail oriented environment.")._.cats)
nlp.to_disk('small_text_cat')

This seems to save successfully.

Trying to load this separately is when the error happens.

import spacy
import classy_classification

nlp = spacy.load("small_text_cat")
print(nlp("Ability to multi-task in a fast paced detail oriented environment.")._.cats)

Here's the error:

Traceback (most recent call last):
File "/Users/manicho/Mine/Projects/AIVizi/trainingData/usesaved.py", line 6, in
nlp = spacy.load("small_text_cat")
File "/opt/homebrew/lib/python3.9/site-packages/spacy/init.py", line 51, in load
return util.load_model(
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 422, in load_model
return load_model_from_path(Path(name), **kwargs) # type: ignore[arg-type]
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 488, in load_model_from_path
nlp = load_model_from_config(
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 528, in load_model_from_config
nlp = lang_cls.from_config(
File "/opt/homebrew/lib/python3.9/site-packages/spacy/language.py", line 1783, in from_config
nlp.add_pipe(
File "/opt/homebrew/lib/python3.9/site-packages/spacy/language.py", line 792, in add_pipe
pipe_component = self.create_pipe(
File "/opt/homebrew/lib/python3.9/site-packages/spacy/language.py", line 674, in create_pipe
resolved = registry.resolve(cfg, validate=validate)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/config.py", line 746, in resolve
resolved, _ = cls._make(
File "/opt/homebrew/lib/python3.9/site-packages/thinc/config.py", line 795, in _make
filled, _, resolved = cls._fill(
File "/opt/homebrew/lib/python3.9/site-packages/thinc/config.py", line 867, in _fill
getter_result = getter(args, **kwargs)
File "/opt/homebrew/lib/python3.9/site-packages/classy_classification/init.py", line 41, in make_text_categorizer
return classySpacyInternal(
File "/opt/homebrew/lib/python3.9/site-packages/classy_classification/classifiers/spacy_internal.py", line 22, in init
self.set_training_data()
File "/opt/homebrew/lib/python3.9/site-packages/classy_classification/classifiers/classy_skeleton.py", line 113, in set_training_data
self.X = self.get_embeddings(X)
File "/opt/homebrew/lib/python3.9/site-packages/classy_classification/classifiers/spacy_internal.py", line 36, in get_embeddings
embeddings = [self.get_embeddings_from_doc(doc) for doc in docs]
File "/opt/homebrew/lib/python3.9/site-packages/classy_classification/classifiers/spacy_internal.py", line 36, in
embeddings = [self.get_embeddings_from_doc(doc) for doc in docs]
File "/opt/homebrew/lib/python3.9/site-packages/spacy/language.py", line 1576, in pipe
for doc in docs:
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1602, in _pipe
yield from proc.pipe(docs, **kwargs)
File "spacy/pipeline/transition_parser.pyx", line 230, in pipe
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1551, in minibatch
batch = list(itertools.islice(items, int(batch_size)))
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1602, in _pipe
yield from proc.pipe(docs, **kwargs)
File "spacy/pipeline/pipe.pyx", line 53, in pipe
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1602, in _pipe
yield from proc.pipe(docs, **kwargs)
File "spacy/pipeline/pipe.pyx", line 53, in pipe
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1602, in _pipe
yield from proc.pipe(docs, **kwargs)
File "spacy/pipeline/trainable_pipe.pyx", line 73, in pipe
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1551, in minibatch
batch = list(itertools.islice(items, int(batch_size)))
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1602, in _pipe
yield from proc.pipe(docs, **kwargs)
File "spacy/pipeline/transition_parser.pyx", line 230, in pipe
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1551, in minibatch
batch = list(itertools.islice(items, int(batch_size)))
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1602, in _pipe
yield from proc.pipe(docs, **kwargs)
File "spacy/pipeline/trainable_pipe.pyx", line 73, in pipe
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1551, in minibatch
batch = list(itertools.islice(items, int(batch_size)))
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1602, in _pipe
yield from proc.pipe(docs, **kwargs)
File "spacy/pipeline/trainable_pipe.pyx", line 79, in pipe
File "/opt/homebrew/lib/python3.9/site-packages/spacy/util.py", line 1621, in raise_error
raise e
File "spacy/pipeline/trainable_pipe.pyx", line 75, in spacy.pipeline.trainable_pipe.TrainablePipe.pipe
File "/opt/homebrew/lib/python3.9/site-packages/spacy/pipeline/tok2vec.py", line 125, in predict
tokvecs = self.model.predict(docs)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/model.py", line 315, in predict
return self._func(self, X, is_train=False)[0]
File "/opt/homebrew/lib/python3.9/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/model.py", line 291, in call
return self._func(self, X, is_train=is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/model.py", line 291, in call
return self._func(self, X, is_train=is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/layers/concatenate.py", line 44, in forward
Ys, callbacks = zip([layer(X, is_train=is_train) for layer in model.layers])
File "/opt/homebrew/lib/python3.9/site-packages/thinc/layers/concatenate.py", line 44, in
Ys, callbacks = zip([layer(X, is_train=is_train) for layer in model.layers])
File "/opt/homebrew/lib/python3.9/site-packages/thinc/model.py", line 291, in call
return self._func(self, X, is_train=is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/model.py", line 291, in call
return self._func(self, X, is_train=is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/layers/with_array.py", line 30, in forward
return _ragged_forward(
File "/opt/homebrew/lib/python3.9/site-packages/thinc/layers/with_array.py", line 90, in _ragged_forward
Y, get_dX = layer(Xr.dataXd, is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/model.py", line 291, in call
return self._func(self, X, is_train=is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/layers/concatenate.py", line 44, in forward
Ys, callbacks = zip([layer(X, is_train=is_train) for layer in model.layers])
File "/opt/homebrew/lib/python3.9/site-packages/thinc/layers/concatenate.py", line 44, in
Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
File "/opt/homebrew/lib/python3.9/site-packages/thinc/model.py", line 291, in call
return self._func(self, X, is_train=is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/model.py", line 291, in call
return self._func(self, X, is_train=is_train)
File "/opt/homebrew/lib/python3.9/site-packages/thinc/layers/hashembed.py", line 61, in forward
vectors = cast(Floats2d, model.get_param("E"))
File "/opt/homebrew/lib/python3.9/site-packages/thinc/model.py", line 216, in get_param
raise KeyError(
KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."

Looking at the KeyError, it looks it's something with the Tok2Vec pipeline, but I'm unsure how to rectify this. I'm guessing I'm doing something wrong in general for saving and loading, and hoping you can point me in the right direction? Thanks in advance!

Installation Issue - error: can't find Rust compiler

Running pip v22.3.1 and pip install classy-classification errors out with the following for me:

Building wheels for collected packages: tokenizers
  Building wheel for tokenizers (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for tokenizers (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [51 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-10.9-universal2-cpython-310
      creating build/lib.macosx-10.9-universal2-cpython-310/tokenizers
      copying py_src/tokenizers/__init__.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers
      creating build/lib.macosx-10.9-universal2-cpython-310/tokenizers/models
      copying py_src/tokenizers/models/__init__.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/models
      creating build/lib.macosx-10.9-universal2-cpython-310/tokenizers/decoders
      copying py_src/tokenizers/decoders/__init__.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/decoders
      creating build/lib.macosx-10.9-universal2-cpython-310/tokenizers/normalizers
      copying py_src/tokenizers/normalizers/__init__.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/normalizers
      creating build/lib.macosx-10.9-universal2-cpython-310/tokenizers/pre_tokenizers
      copying py_src/tokenizers/pre_tokenizers/__init__.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/pre_tokenizers
      creating build/lib.macosx-10.9-universal2-cpython-310/tokenizers/processors
      copying py_src/tokenizers/processors/__init__.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/processors
      creating build/lib.macosx-10.9-universal2-cpython-310/tokenizers/trainers
      copying py_src/tokenizers/trainers/__init__.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/trainers
      creating build/lib.macosx-10.9-universal2-cpython-310/tokenizers/implementations
      copying py_src/tokenizers/implementations/byte_level_bpe.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/implementations
      copying py_src/tokenizers/implementations/sentencepiece_unigram.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/implementations
      copying py_src/tokenizers/implementations/sentencepiece_bpe.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/implementations
      copying py_src/tokenizers/implementations/base_tokenizer.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/implementations
      copying py_src/tokenizers/implementations/__init__.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/implementations
      copying py_src/tokenizers/implementations/char_level_bpe.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/implementations
      copying py_src/tokenizers/implementations/bert_wordpiece.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/implementations
      creating build/lib.macosx-10.9-universal2-cpython-310/tokenizers/tools
      copying py_src/tokenizers/tools/__init__.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/tools
      copying py_src/tokenizers/tools/visualizer.py -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/tools
      copying py_src/tokenizers/__init__.pyi -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers
      copying py_src/tokenizers/models/__init__.pyi -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/models
      copying py_src/tokenizers/decoders/__init__.pyi -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/decoders
      copying py_src/tokenizers/normalizers/__init__.pyi -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/normalizers
      copying py_src/tokenizers/pre_tokenizers/__init__.pyi -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/pre_tokenizers
      copying py_src/tokenizers/processors/__init__.pyi -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/processors
      copying py_src/tokenizers/trainers/__init__.pyi -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/trainers
      copying py_src/tokenizers/tools/visualizer-styles.css -> build/lib.macosx-10.9-universal2-cpython-310/tokenizers/tools
      running build_ext
      running build_rust
      error: can't find Rust compiler
      
      If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.
      
      To update pip, run:
      
          pip install --upgrade pip
      
      and then retry package installation.
      
      If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

add zero-shot `onnx`support

Running the first example we get a different score

{'furniture': 0.5352066861088035, 'kitchen': 0.4647933138911965}

ImportError: cannot import name 'cached_path' from 'transformers.file_utils' (/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py)

When trying to import the module after pip install classy-classifier in a new venv I get this import error consistently when trying 'import class_classification' after 'import spacy'

Full Error Text

ImportError Traceback (most recent call last)
/tmp/ipykernel_31548/4164985023.py in
----> 1 import classy_classification

/opt/conda/lib/python3.7/site-packages/classy_classification/init.py in
3 from spacy.language import Language
4
----> 5 from .classifiers.sentence_transformer import (
6 classySentenceTransformer as classyClassifier,
7 )

/opt/conda/lib/python3.7/site-packages/classy_classification/classifiers/sentence_transformer.py in
4 from onnxruntime import InferenceSession, SessionOptions
5 from transformers import AutoTokenizer
----> 6 from txtai.pipeline import HFOnnx
7
8 from .classy_skeleton import classySkeleton

/opt/conda/lib/python3.7/site-packages/txtai/pipeline/init.py in
11 from .image import *
12 from .nop import Nop
---> 13 from .text import *
14 from .tensors import Tensors
15 from .train import *

/opt/conda/lib/python3.7/site-packages/txtai/pipeline/text/init.py in
10 from .similarity import Similarity
11 from .summary import Summary
---> 12 from .translation import Translation

/opt/conda/lib/python3.7/site-packages/txtai/pipeline/text/translation.py in
13 from huggingface_hub.hf_api import HfApi
14 from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer, MarianMTModel, MarianTokenizer
---> 15 from transformers.file_utils import cached_path
16
17 from ..hfmodel import HFModel

Error when using spacy _trf models

Hi,
thanks for the updates on the classifier. I have been using your classifier based on en_core_web_lg and wanted to try out en_core_web_trf to see performance improvements.

import spacy
import os
import classy_classification # noqa: F401
from collections import OrderedDict
from operator import itemgetter
from data_en import training_data
nlp = spacy.load("en_core_web_trf")
print("english pretrained model has been loaded")
nlp.add_pipe("text_categorizer", config={"data": training_data,"model": "spacy"})

When I do this, I get the following console output:

english pretrained model has been loaded
Traceback (most recent call last):

File ~/anaconda3/envs/spyder-env/lib/python3.9/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
exec(code, globals, locals)

File ~/Documents/Python Scripts/doc genres/examples/classifiertest.py:22
nlp.add_pipe("text_categorizer", config={"data": training_data,"model": "spacy"})

File ~/anaconda3/envs/spyder-env/lib/python3.9/site-packages/spacy/language.py:795 in add_pipe
pipe_component = self.create_pipe(

File ~/anaconda3/envs/spyder-env/lib/python3.9/site-packages/spacy/language.py:674 in create_pipe
resolved = registry.resolve(cfg, validate=validate)

File ~/anaconda3/envs/spyder-env/lib/python3.9/site-packages/thinc/config.py:746 in resolve
resolved, _ = cls._make(

File ~/anaconda3/envs/spyder-env/lib/python3.9/site-packages/thinc/config.py:795 in _make
filled, _, resolved = cls._fill(

File ~/anaconda3/envs/spyder-env/lib/python3.9/site-packages/thinc/config.py:867 in _fill
getter_result = getter(*args, **kwargs)

File ~/anaconda3/envs/spyder-env/lib/python3.9/site-packages/classy_classification/init.py:53 in make_text_categorizer
return ClassySpacyInternalFewShot(

File ~/anaconda3/envs/spyder-env/lib/python3.9/site-packages/classy_classification/classifiers/classy_spacy.py:110 in init
ClassySkeletonFewShot.init(self, *args, **kwargs)

File ~/anaconda3/envs/spyder-env/lib/python3.9/site-packages/classy_classification/classifiers/classy_skeleton.py:65 in init
self.set_training_data()

File ~/anaconda3/envs/spyder-env/lib/python3.9/site-packages/classy_classification/classifiers/classy_skeleton.py:236 in set_training_data
self.X = self.get_embeddings(X)

File ~/anaconda3/envs/spyder-env/lib/python3.9/site-packages/classy_classification/classifiers/classy_spacy.py:101 in get_embeddings
embeddings.append(doc._.trf_data.model_output.pooler_output[0])

AttributeError: 'ModelOutput' object has no attribute 'pooler_output'

Thanks in advance for your help

Kristian

Not able to save and export spacy models

Not able to save and export spacy models
getting value error configparser.py

Token indices sequence length

Hi there, Thanks for this amazing library. I used the 0.5.x version for a project last year and used it again last week without any issue. But I upgraded the library today to 0.6.7 and I have been getting the warning:
"Token indices sequence length is longer than the specified maximum sequence length for this model (1505 > 512). Running this sequence through the model will result in indexing errors"

I'm using Pythin 3.10.

Code:
`import spacy
import classy_classification
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

NLP model

nlp = spacy.load("en_core_web_trf")
nlp.add_pipe("classy_classification",
config={
"data": data,
"model": "sentence-transformers/all-mpnet-base-v2",
"multi_label": True,
"device": "gpu"
}
)

print(nlp("""Random text that is 3375 characters long"""")`

The above works fine. However, when I import the full dataset, I get the warning mentioned above.

Importing the dataset -
`Subs = pd.read_csv('File.csv', encoding='cp1252')

Subs['Fixed_Text'] = Subs['Fixed_Text'].str.strip()

Subs['NLP_Result'] = Subs['Fixed_Text'].apply(lambda x: nlp(x)._.cats)
display(Subs)`

I just wanted to check-in whether you made any changes that could explain the warning I'm getting. Thanks again

Hugginng face models

@davidberenstein1957 Thanks for developing this interesting and great library.

I was able to test the working with the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model.

Can you recommend any other smaller models that works faster as well as with a good accuracy?

Thanks in advance.

Different language models

Hi!

Thanks for this library, it really makes complex things simple.

I wonder if some extra effort is required to use other than English models for classification? (English works great out of the box)

Here is some very basic example of wrong classification. Is there a chance somehow to debug and make it work?

import spacy
import classy_classification

nlp = spacy.load("ru_core_news_lg")

data = {
    'right': 'Съешь ещё этих мягких французских булок да выпей чаю.',
    'wrong': 'Быстрая бурая лиса перепрыгивает через ленивую собаку.'
}

nlp.add_pipe(
    'text_categorizer',
    config={
        'data': data,
        'model': 'spacy'
    }
)

test = nlp('Съешь мягких булок')
print(test._.cats)

Result is:
{'right': 0.45482638521673985, 'wrong': 0.5451736147832601}

It should take the 'right' label, the same way it does for English.

Example code gives error

Hi,

When trying the first example I get an error. I have downloaded en_core_web_lg and other models and I can't find a fix for this.

Config validation error
classy_classification -> model: extra fields not permitted
{'nlp': <spacy.lang.en.English object at 0x0000019E780DE290>, 'name': 'classy_classification', 'cat_type': 'few', 'config': None, 'data': {'economy': ['The cost of living', 'Tax', 'Petrol prices', 'Inflation'], 'public trust': ['Breaking the rules', 'One rule for them', 'lied', "don't live in the real world"]}, 'device': 'cpu', 'include_doc': True, 'include_sent': False, 'model': None, 'model:': 'spacy', 'multi_label': False, 'verbose': False, '@factories': 'classy_classification'}
python-BaseException

This my code using python v3.11 and spaCy v3.7.2, classy-classification 0.6.7.

`
import spacy
import classy_classification

Dictionary containing categories (keys) and lists matched to each category (value).

data = {
"economy": [
"The cost of living",
"Tax",
"Petrol prices",
"Inflation"],
"public trust": [
"Breaking the rules",
"One rule for them",
"lied",
"don't live in the real world"
]

}

Load model used for NLP

nlp = spacy.load("en_core_web_lg")

Set up a pipeline to feed our data into a text categoriser

nlp.add_pipe(
"classy_classification",
config={
"data": data,
"model:": "spacy"
}
)

Feed a sentence into the pipeline to get a score for each category in our dictionary

print(nlp("I don't trust them")._.cats)`

Thanks.
Matt.

Onnx support on M1 Macs

Unfortunately Onnx is not supported on M1.

Maybe add an alternative when installing the package on Macs with M1 due to the error that occurs? Or even add in Readme to install with homebrew: brew install cmake and brew install protobuf (I don't know if it works for everyone)

See: onnx/onnx#3129

replace MLP with Multi-class SVM to unify Multi-class classification approach

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn-svm-svc

decision_function_shape=’ovo’

Multilabel returns scientific notation with big dataset

The dataset has over 3000 sentences with labels in each category.

This causes inconsistency when you break the text into sentences and perform calculations on the final score.

And I still have doubts if the scientific number is represented correctly. I have to do some additional tests

f = open("../dumps/classifier.few.pkl", "rb")
f = open("../dumps/classifier.multilabel.pkl", "rb")
classifier = pickle.load(f)
classifier = classifier("Em jogo decisivo na Colômbia, América-MG enfrenta Tolima e busca primeira vitória na fase de grupos da Libertado. Saiba onde assistir a Flamengo x Goiás pelo Brasileirão 2022")
print(classifier)

{'Arte e Entretenimento': 2.9263447e-06, 'Economia': 0.0006199917, 'Esporte': 0.99947697, 'Games': 8.631342e-06, 'Moda': 3.58305e-08, 'Politica': 0.0005317643, 'Pornografia': 1.0400859e-14, 'Saude': 7.639193e-09, 'Sexualidade': 0.00030366945, 'Tecnologia': 0.00012137198, 'Violencia e Crime': 1.5593606e-07}

retrain on saved pickle model?

Hi, is it possible to use the saved model and train it again using ClassyClassifier

It could look like this

f = open("./classifier.pkl", "rb")
model = pickle.load(f)

use this model and then pass it to ClassyClassifier

classifier = ClassyClassifier(data=data, model)

Is it possible to view training progress?

Hello! I had a question, I'm using the classifier on a large training data, and as it takes quite sometime even on a GPU it would be great to know if it's progressing and how long I need to wait. Thanks! my sample code below for reference

from classy_classification import classyClassifier
import json 
with open('training_data.json') as json_file:
   # Load the JSON data
   data = json.load(json_file)
classifier.set_embedding_model(model="paraphrase-MiniLM-L3-v2")

# overwrite SVC config
classifier.set_classification_model(
    config={
        "C": [1, 2, 5, 10, 20, 100],
        "kernels": ["linear"],
        "max_cross_validation_folds": 5
    }
)

classifier("New text ")

Unnecessary print() in classy_skeleton

Is this print() supposed to be here?

https://github.com/Pandora-Intelligence/classy-classification/blob/beb2eaa2aabf2b6114d1480df7565d37cc8656b6/classy_classification/classifiers/classy_skeleton.py#L160

Misaligned pairings of labels and scores?

Hi, thanks for your helpful package!

Trying your example code https://github.com/Pandora-Intelligence/classy-classification#spacy-embeddings, I noticed two things: First, the output format seems to be different; instead of the list of dictionaries in your example code I get a single dictionary (not a big deal).

Second, and perhaps more importantly, it seems that the labels do not match the scores in a reasonable way. For example, print(nlp("I am looking for fridge, stove and oven appliances.")._.cats) returns {'furniture': 0.7210517551092989, 'kitchen': 0.27894824489070125}. While the scores seem somewhat reasonable, they should rather favor the 'kitchen' class in this example, shouldn't they? I could reproduce this with other examples and data, and on a different machine with a fresh Python (3.9.15), spacy (3.3.1) and classy_classification (0.6.1) installation.

Thank you very much for taking at look at this in advance!

Can't install Classy Classification due to FastText dependency

I am trying to install but I keep getting this error. I have tried to install FastText in multiple ways but it seems its no longer supported for py3.9+. Please help. Thank you.

` compilation terminated.
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> fasttext`

Omitting the report message

Hey, Thank you very much for your fantastic workmate.

Please, is there a way to prevent the algorithm from printing the report at the end?

"Fitting two folds for each of 6 candidates, totalling 12 fits."

Yours sincerely

Jay

Is there a better way to save the model to the disk?

Hello everyone. I am a novice in this field and I really would like to thank you all for this wonderful library that is making my work way easier than I expected!
For a project, I have to do a few-shot classification with a relatively large amount of samples (27 labels with roughly 500 entries each). This process takes about 15 minutes on my computer using the sentence-transformers embeddings. I wanted to save the resulting model to the disk in order to be used again in the future. I tried the instructions of the spacy documentation (https://spacy.io/usage/saving-loading), but I observe a strange behavior: if I use the "to_disk" and "from_disk" methods the model is saved on the disk, but when it's time to load it, it seems to start the loading from zero, taking 15 minutes again to load. I also tried with pickle, but I get the following error at loading time: "AttributeError: [E047] Can't assign a value to unregistered extension attribute 'cats'. Did you forget to call the set_extension method?".
Does anybody know a way to save the model (or the spacy object) to the disk and retrieve it rapidly afterwards?

Exception while saving model

I use the example and got this error:
TypeError: cannot pickle 'onnxruntime.capi.onnxruntime_pybind11_state.InferenceSession' object

Allow for exclusive classes

Is it possible to have non mutually exclusive classes here? For example, I might have text that's about "Italian food", but also "negative". In other words, can we have scores that do not sum to one?

add saving and loading support for `standalone` reproducability

add `https://onnx.ai/sklearn-onnx/` support

LOW prio

setfit in classy classification

Our team is a huge fan of the recent few shot learning work for text classification involving Setfit. However, running the setfit model via the github link: https://github.com/huggingface/setfit is not as simple and easy to use as classy_classification. Because of this, we were wondering if it would be possible to have the setfit model embedded as one of the classy_classification few shot models for text classification, that way it would be easier to use. We recognize that setfit can take minutes to train without GPU access, and we are all for distilled versions of setfit. Because setfit is one of the more robust few shot learning models today, we think that adding this functionality to classy_classification would be a major plus, that way the models would perform better.