iryna-kondr / scikit-llm Goto Github PK

View Code? Open in Web Editor NEW

2.9K 41.0 226.0 206 KB

Seamlessly integrate LLMs into scikit-learn.

Home Page: https://beastbyte.ai/

License: MIT License

Python 100.00%

chatgpt deep-learning llm machine-learning scikit-learn transformers

scikit-llm's Introduction

Scikit-LLM: Scikit-Learn Meets Large Language Models

Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks.

Installation 💾

pip install scikit-llm

Support us 🤝

You can support the project in the following ways:

⭐ Star Scikit-LLM on GitHub (click the star button in the top right corner)
💡 Provide your feedback or propose ideas in the issues section or Discord
📰 Post about Scikit-LLM on LinkedIn or other platforms
🔗 Check out our other projects: Dingo, Falcon

Quick Start & Documentation 📚

Quick start example of zero-shot text classification using GPT:

# Import the necessary modules
from skllm.datasets import get_classification_dataset
from skllm.config import SKLLMConfig
from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier

# Configure the credentials
SKLLMConfig.set_openai_key("<YOUR_KEY>")
SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION_ID>")

# Load a demo dataset
X, y = get_classification_dataset() # labels: positive, negative, neutral

# Initialize the model and make the predictions
clf = ZeroShotGPTClassifier(model="gpt-4")
clf.fit(X,y)
clf.predict(X)

For more information please refer to the documentation.

Citation

You can cite Scikit-LLM using the following BibTeX:

@software{ScikitLLM,
  author = {Iryna Kondrashchenko and Oleh Kostromin},
  year = {2023},
  publisher = {beastbyte.ai},
  address = {Linz, Austria},
  title = {Scikit-LLM: Scikit-Learn Meets Large Language Models},
  url = {https://github.com/iryna-kondr/scikit-llm }
}

scikit-llm's People

Stargazers

Watchers

Forkers

gsgithub17 fatimamhelmy hangj11 meesala-bfrs01946 yashasdevasurmutt lkafle nadiaelkhodja kp-forks saulocatharino kenichi-segawa techthiyanes bingoral sunnyly2016 ai-jie01 akashmavle5 tikna123 djiwandou sidjain24 advit200 statsgary mec-is vanamayaswanth anupqindia khemanta tomakk fran-gen tomkallo marcosoares-92 hector1993prog dsupertramp balakkvj lozanocelia mekongdelta-mind junaidiqbalsyed acedesci davidwhiting dingchaoz jcmo-research mobs75 ai-ml-cv clement-lelievre areyesan jaedukseo josimarviana yehocoh benwaldner henokb mshaek shrahimim pandinosaurus pratikrelekar deekshithadprakash evdcush debjyoti003 mzamini92 gyanachand1 abhishekdconviction iamollas roo-shy o7s8r6 ansariparvej hosseinghafarian junaidsheroz rafiulbiswas gilsondsouza sarbashis madeehrehman yiranvang mforootan tspannhw lapisco shweta2146 fsndzomga deyh2020 muneebhashmi7712 ishandutta2007 louderthanthunderx1 hhy5277 rpiryani ramondch iuriimattos2 cotp27 puspendra114 davydw pingyangtiaer veerumehta awesome-software eduhayon ukaserge mz0in joshurbandavis maralzar ssahgal glaceage jrcribb richardsonjf dmsama99 julianaguama hoanbi1812000 coulibaly-b

scikit-llm's Issues

ERROR: Failed building wheel for annoy

Building wheels for collected packages: annoy
Building wheel for annoy (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.10
creating build\lib.win-amd64-3.10\annoy
copying annoy_init_.py -> build\lib.win-amd64-3.10\annoy
copying annoy_init_.pyi -> build\lib.win-amd64-3.10\annoy
copying annoy\py.typed -> build\lib.win-amd64-3.10\annoy
running build_ext
building 'annoy.annoylib' extension
creating build\temp.win-amd64-3.10
creating build\temp.win-amd64-3.10\Release
creating build\temp.win-amd64-3.10\Release\src
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\include -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\Include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include /EHsc /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-3.10\Release\src/annoymodule.obj -D_CRT_SECURE_NO_WARNINGS -fpermissive -DANNOYLIB_MULTITHREADED_BUILD
cl : Command line warning D9002 : ignoring unknown option '-fpermissive'
annoymodule.cc
C:\Users\sumeruinfra\AppData\Local\Temp\pip-install-kyqaegqf\annoy_bbac4dc388c84ce883a7b6302751b8ed\src\annoylib.h(19): fatal error C1083: Cannot open include file: 'stdio.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for annoy
Running setup.py clean for annoy
Failed to build annoy
ERROR: Could not build wheels for annoy, which is required to install pyproject.toml-based projects

Issues is related to Build Wheels Files
@iryna-kondr

Dynamic FewShot GPTClassifier: does it cache Embd locally?

I wonder:

if DynamicFewShotGPTClassifier will cache embeddings by OpenAI locally for the 1st time calling it.
And can we access them as embeds can be used in other cases, so that we can save some budget?

`GPTVectorizer().fit_transform(X)` always returns `RuntimeError`

Hi! First of all very nice work!

I was trying the embedding utility with something as simple as:

from skllm.datasets import get_classification_dataset

X, _ = get_classification_dataset()
model = GPTVectorizer()
vectors = model.fit_transform(X)

however, I always get:

RuntimeError: Could not obtain the embedding after retrying 3 times. 
Last captured error: `<empty message>

I also tried with a custom dataset and with some simple strings.

Am I doing something wrong?

APIConnectionError

I am running this code:

from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key("my key")
SKLLMConfig.set_azure_api_base("my url")

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, y = get_classification_dataset()

clf = ZeroShotGPTClassifier(openai_model="azure::gpt-35-turbo")
clf.fit(X, y)
labels = clf.predict(X)

And i get this error:

1%|          | 1/109 [00:27<49:23, 27.44s/it]
Could not obtain the completion after 3 retries: `APIConnectionError :: Error communicating with OpenAI: HTTPSConnectionPool(host='gen-ai-sweden.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-05-15 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000018D602595D0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))`
None
Could not extract the label from the completion: 'NoneType' object is not subscriptable

I am using openai version 0.28.0

Supporting NER tasks

Could we add support for named entity recognition tasks to the library? I see that the user interface would not change much from what was applied in the multi-label classification method, with the difference that instead of the entire text input being classified as one or multiple labels, we would have distinct entities from the text being recognized and labeled (the user could also give a semantic list of possible entities to be recognized as input).

Feature request: setting seed parameter of OpenAI's chat completions API

Thank you for creating and maintaining this awesome project!

OpenAI recently introduced the seed parameter to make their models' text generation and chat completion behavior (more) reproducible (see https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter).

I think it would be great if you could enable users of your package to control this parameter when using OpenAI models as a backend (i.e., in the files here: https://github.com/iryna-kondr/scikit-llm/tree/main/skllm/models/gpt)

The seed parameter could be hard-coded

scikit-llm/skllm/llm/gpt/clients/openai/completion.py

Line 50 in 0bdea94

temperature=0.0, messages=messages, **model_dict

similar to setting temperature=0.0.

Alternatively, users could pass seed=<SEED> via **kwargs.

Could we run those LLM models on CPU for inference?

Hi,

Many thanks for releasing this repo for using LLMs locally!

Could we know if it is possible to run these LLM models on CPU rather than GPU for inference?

Many thanks!

'GPT4All' object has no attribute 'chat_completion'

if GPT4All is None:
raise ImportError(
"gpt4all is not installed, try pip install scikit-llm[gpt4all]"
)
if model not in _loaded_models.keys():
_loaded_models[model] = GPT4All(model)

return _loaded_models[model].chat_completion(
    messages, verbose=False, streaming=False, temp=1e-10
)

Wrong output in GPTTranslator

Hi,

I have used GPTTranslator in some list of mixed english and spanish text, and the output that I get looks at follows (truncated).

I tried two times and the output ramains the same:

'```Requisitos para entidades culturales.``` \n```Requirements for cultural entities.``` \n```Requisitos para entidades culturales.```',
'Requisito de identificación de alcohol.',
'Ampliar plazo matrimonio. \n\nTranslation: Ampliar el plazo de matrimonio.',
'Bono de invierno incrementado.',
'Mejora del seguro de GPS.',
'Retiro de fondos.',
'Reingreso ilegal penalizado.',
'Modificación de protección del agua.',
'Autorización para portar armas.',
'Adición de reembolso de aerolínea.',
'Licencia adulterada.',
'Renombrar el aeropuerto en honor a Margot Duhalde.',
'Alquiler regulado.',
'Modificaciones laborales.',
'Ley de igualdad de edad.',
'Reforma constitucional.',
'Arborización por nacimiento.',
'Sanciones de armas más duras.',
'```Comités de seguridad.``` \n(Translated to Spanish: Comités de seguridad.)',
'Modificaciones legales para migrantes.',
'Criminalizando la negación de los derechos humanos.',
'Reforma de empresas estatales.',
'Exención de multa.',
'Prioridad de educación sexual de los padres.',
'Secretos bancarios ampliados.',
'Modificación de la ley del VIH.',
'Cambios en la ley de votación.',
'```Subsistema de Inteligencia.``` \n(Spanish)'

The input (truncated) is:

['Requisitos para entidades culturales.',
'Alcohol ID requirement.',
'Ampliar plazo matrimonio.',
'Bono Invierno incrementado.',
'GPS insurance improvement.',
'Retiro de fondos.',
'Illegal re-entry penalized.',
'Water protection modification.',
'Porte de armas autorizado.',
'Airline refund addition.',
'Licencia adulterada.',
'Rename airport after Margot Duhalde.',
'Arriendo regulado.',
'Modificaciones laborales.',
'Age equality law.',
'Constitutional reform.',
'Arborización por nacimiento.',
'Tougher weapon penalties.',
'Comités de seguridad.',
'Legal modifications for migrants.',
'Criminalizing human rights denial.',
'State-owned companies reform.',
'Exención de multa.',
"Parents' sexual education priority.",
'Secretos bancarios ampliados.',
'HIV law modification.',
'Voting law changes.',
'Subsistema de Inteligencia.']

predict

ZeroShotGPTClassifier.predict must return an np.array instead of a list.

Integration of other LLMs

Anything in pipeline to integrate other Large Language Models such as LLAMA, GPT-J, GPT4ALL -J, AUTO-GPT and others?
Quick Reference: https://github.com/ajayarunachalam/pychatgpt_gui

Long Documents for Summarization -> Zero-Shot Multi-Label Classification

How do I summarize an exceptionally long, book-sized document for summarization?
I want to create a summary and then use the LLM classifier on it. The "summary of summaries" approach takes too much time.
When working with a zero-shot multi-label classifier, are individual texts I want to classify treated as separate requests to the LLM API, or are multiple texts combined into one request? Specifically, when using list of lists , would all the text inputs be included in a single prompt, or would they remain separate? How do we manage text limits if they are all aggregated?

Documentation

Hi @iryna-kondr

Re: Documentation: Improve the project's documentation, including code comments and README files.

I would love to help document this project including code comments and possibly adding some use case notebooks etc.

Let me know if this is something you're open to or if you have some pointers on where to start?

Thanks!

Consider updating package requirements

Consider remaining compatible with the openai package.

Specs: MacBook Pro 2019

Contribute.MD

We should have a contribute.Md file for users who want to contribute. I would also like to contribute to this project

Error for DynamicFewShotGPTClassifier when using OpenAI Api

Hello,

My codes for DynamicFewShotGPTClassifier worked well about a month ago, but for some reason, I'm getting error messages today. I still have my openAI account that is pay as you go. I tried running the code with a new API, and I still got errors.

Below is my code for DynamicFewShotGPTClassifier and the error:

from skllm import DynamicFewShotGPTClassifier
GPT_model2 = DynamicFewShotGPTClassifier("gpt-4", n_examples=3)
GPT_model2.fit(X,y)
GPT_model2_predict = GPT_model2.predict(X)

RuntimeError Traceback (most recent call last)
Cell In[45], line 3
1 from skllm import DynamicFewShotGPTClassifier
2 GPT_model2 = DynamicFewShotGPTClassifier(n_examples=3)
----> 3 GPT_model2.fit(X,y)
4 GPT_model2_predict = GPT_model2.predict(X)
6 few_shot = classification_report(y, GPT_model2_predict, output_dict=True)

File ~/anaconda3/lib/python3.10/site-packages/skllm/models/gpt/gpt_dyn_few_shot_clf.py:81, in DynamicFewShotGPTClassifier.fit(self, X, y)
79 partition = X[y == cls]
80 self.data_[cls]["partition"] = partition
---> 81 embeddings = self.embedding_model_.transform(partition)
82 index = AnnoyMemoryIndex(embeddings.shape[1])
83 for i, embedding in enumerate(embeddings):

File ~/anaconda3/lib/python3.10/site-packages/sklearn/utils/_set_output.py:142, in _wrap_method_output..wrapped(self, X, *args, **kwargs)
140 @wraps(f)
141 def wrapped(self, X, *args, **kwargs):
--> 142 data_to_wrap = f(self, X, *args, **kwargs)
143 if isinstance(data_to_wrap, tuple):
144 # only wrap the first output for cross decomposition
145 return (
146 _wrap_data_with_container(method, data_to_wrap[0], X, self),
147 *data_to_wrap[1:],
148 )

File ~/anaconda3/lib/python3.10/site-packages/skllm/preprocessing/gpt_vectorizer.py:74, in GPTVectorizer.transform(self, X)
71 embeddings = []
72 for i in tqdm(range(len(X))):
73 embeddings.append(
---> 74 _get_embedding(X[i], self._get_openai_key(), self._get_openai_org())
75 )
76 embeddings = np.asarray(embeddings)
77 return embeddings

File ~/anaconda3/lib/python3.10/site-packages/skllm/openai/embeddings.py:48, in get_embedding(text, key, org, model, max_retries)
46 error_type = type(e).name
47 sleep(3)
---> 48 raise RuntimeError(
49 f"Could not obtain the embedding after {max_retries} retries: {error_type} :: {error_msg}"
50 )

RuntimeError: Could not obtain the embedding after 3 retries: InvalidRequestError :: Must provide an 'engine' or 'deployment_id' parameter to create a <class 'openai.api_resources.embedding.Embedding'>

can it output probabilities

Hi,i must say this is an excellent work.While using it, is it possible to output probabilities?

Unable to use gpt4all

I followed all the steps as mentioned in the readme but I couldn't use gpt4all in my colab notebook. It keeps prompting me to install it (error: gpt4all is not installed, try 'pip install sckit-llm[gpt4all]') even though I added the installation code before running my code snippet.

It's happening just with gpt4all as gpt 3.5 turbo atleast loads and works.

InvalidRequestError POST /v1/openai/deployments/

Hello,
I have a InvalidRequestError when trying to do a ZeroShotGPTClassifier, whereas I can call a ChatCompletion with the same model, here is my code:

from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key(openai.api_key)
SKLLMConfig.set_azure_api_base(openai.api_base)

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

X = df[['cause','solution']].values
y = df['Causes_FinalNature_ENG'].values

defining the model

clf = ZeroShotGPTClassifier(openai_model="azure::gpt-4-32k")

fitting the data

clf.fit(X, y)

predicting the data

labels = clf.predict(X)

This code crashes with that error:
ould not obtain the completion after 3 retries: InvalidRequestError :: Invalid URL (POST /v1/openai/deployments/gpt-4-32k/chat/completions)

But when I run this code it works well:
response = openai.ChatCompletion.create(
engine='gpt-4-32k',
messages=[
{"role": "user", "content": "Who won the world series in 2020?"}],
max_tokens=193,
temperature=0,
)

Same stuff with gpt-35-turbo and gpt-3.5-turbo.

Do you know what's wrong with my code ?

Thanks in advance

ZeroShotGPTClassifier Error

I am running the example code:

from skllm.config import SKLLMConfig
import os

SKLLMConfig.set_openai_key(os.getenv("OPENAI_API_KEY"))
SKLLMConfig.set_openai_org(os.getenv("OpenAI-org"))

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, _ = get_classification_dataset()

clf = ZeroShotGPTClassifier()
clf.fit(None, ["positive", "negative", "neutral"])
labels = clf.predict(X)

I get the following error:

  3%|▎         | 1/30 [00:09<04:21,  9.00s/it]Could not obtain the completion after 3 retries: `AttributeError :: module 'openai' has no attribute 'ChatCompletion'`
None
Could not extract the label from the completion: 'NoneType' object is not subscriptable
  3%|▎         | 1/30 [00:15<07:15, 15.00s/it]

Is it possible to hybrid Bard and ChatGPT?

ChatGPT is good at being creative.
Bard is good at answering and finding good responses.

WDYT?

Is that possible not to raise error for few edge cases?

Hi, I am new to scikit-llm. Generally spekaing, ZeroShotGPTClassifier works for my daily work as follows.
clf= ZeroShotGPTClassifier(model=model_name)
clf.fit(X, y)
preds = clf.predict(X)
However, sometimes the input X may have few rows that are too long. It breaks my job by the error of 'context_length_exceeded', so I cannot get preds. Sometimes I fail to get predictions because few rows trigger the OpenAI error of 'content_filter'. (OpenAI's neural multi-class classification models believe my input text contains harmful content, but their predictions are false positives.)

I think the error comes from the retry function.

scikit-llm/skllm/utils.py

Line 92 in 0bdea94

raise RuntimeError(err_msg)

Is there a quick way that I can turn off this error arise in the version 1.0.0?
I am OK if the classification predictions are random if OpenAI API returns any error. Is that doable? My memory may not be accurate, but I remember the old manual of scikit-llm had something like the classifier still works even with an error, but the prediction will be random for that case.

Thank you in advance.

Prompt JSON response for DynamicFewShotGPTClassifier is blank

Hi,

I just started using skllm and I have tried to build a simple DynamicFewShotGPTClassifier with the following code:

from skllm import DynamicFewShotGPTClassifier

X = [
    "I love reading science fiction novels, they transport me to other worlds.",
    "A good mystery novel keeps me guessing until the very end.",
    "Historical novels give me a sense of different times and places.",
    "I love watching science fiction movies, they transport me to other galaxies.",
    "A good mystery movie keeps me on the edge of my seat.",
    "Historical movies offer a glimpse into the past.",
]

y = ["books", "books", "books", "movies", "movies", "movies"]

query = "I have fallen deeply in love with this sci-fi book; its unique blend of science and fiction has me spellbound."

clf = DynamicFewShotGPTClassifier(n_examples=1).fit(X, y)

prompt = clf._get_prompt(query)
print(prompt)

Everything seems to work just fine, but the Your JSON response is blank. Any ideas of what could be happening? Thanks

Prompt output:

List of categories: ['books', 'movies']

Training data:

Sample input:
```I love reading science fiction novels, they transport me to other worlds.```

Sample target: books


Sample input:
```I love watching science fiction movies, they transport me to other galaxies.```

Sample target: movies


Text sample: ```I have fallen deeply in love with this sci-fi book; its unique blend of science and fiction has me spellbound.```

Your JSON response:

Closing as duplicate of #40

          Closing as duplicate of #40

Originally posted by @OKUA1 in #39 (comment)
>pip install annoy
WARNING: Ignoring invalid distribution -rotobuf (c:\users\sumeruinfra\appdata\local\programs\python\python310\lib\site-packages)
WARNING: Ignoring invalid distribution -ymongo (c:\users\sumeruinfra\appdata\local\programs\python\python310\lib\site-packages)
Collecting annoy
Using cached annoy-1.17.3.tar.gz (647 kB)
Preparing metadata (setup.py) ... done
Building wheels for collected packages: annoy
Building wheel for annoy (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.10
creating build\lib.win-amd64-3.10\annoy
copying annoy_init_.py -> build\lib.win-amd64-3.10\annoy
copying annoy_init_.pyi -> build\lib.win-amd64-3.10\annoy
copying annoy\py.typed -> build\lib.win-amd64-3.10\annoy
running build_ext
building 'annoy.annoylib' extension
creating build\temp.win-amd64-3.10
creating build\temp.win-amd64-3.10\Release
creating build\temp.win-amd64-3.10\Release\src
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\include -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\Include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include /EHsc /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-3.10\Release\src/annoymodule.obj -D_CRT_SECURE_NO_WARNINGS -fpermissive -DANNOYLIB_MULTITHREADED_BUILD
cl : Command line warning D9002 : ignoring unknown option '-fpermissive'
annoymodule.cc
C:\Users\sumeruinfra\AppData\Local\Temp\pip-install-_z1x63lg\annoy_7c08e48d9418499c876cb0133458a99b\src\annoylib.h(19): fatal error C1083: Cannot open include file: 'stdio.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

[notice] A new release of pip is available: 23.1.2 -> 23.2
[notice] To update, run: python.exe -m pip install --upgrade pip

C:\Users\sumeruinfra>

Dependency VertexAI breaks Python installation

First of all, it seems that vertexai is not supposed to be used and merely a placeholder for google-cloud-aiplatform.
Secondly, the current wheel (at least on MacOS for me) is broken, as it contains a init.py file with the following content:

# ...

raise ImportError(
    "To use the Vertex AI SDK, install the google-cloud-aiplatform package."
)

However, this is not placed into site-packages/vertexai/init.py, BUT INSTEAD AT THE ROOT OF SITE-PACKAGES. This means every other module can NOT be imported anymore. I think this is an issue with the original wheel file, however since it is not needed I'd appreciate an updated version with the dependency removed.

Text Vectorization and Dynamic Few-Shot Text Classification not working

Hello,

I have the following code in my jupyter notebook:

!pip install scikit-llm
from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key("")
SKLLMConfig.set_azure_api_base("<my azure api base")

from skllm.datasets import get_multilabel_classification_dataset
X, y = get_classification_dataset()

from skllm import DynamicFewShotGPTClassifier
GPT_model3 = DynamicFewShotGPTClassifier(n_examples=3)
GPT_model3.fit(X, y)
GPT_labels3 = GPT_model3.predict(X)

from skllm.preprocessing import GPTVectorizer
model = GPTVectorizer()
vectors = model.fit_transform(X)

Both dynamic few shot classifier and GPTvectorizer are giving me the following issues:

RuntimeError Traceback (most recent call last)
Cell In[135], line 4
2 X, _ = get_classification_dataset()
3 model = GPTVectorizer()
----> 4 vectors = model.fit_transform(X)

File ~/anaconda3/lib/python3.10/site-packages/skllm/preprocessing/gpt_vectorizer.py:94, in GPTVectorizer.fit_transform(self, X, y, **fit_params)
79 def fit_transform(self, X: Optional[Union[np.ndarray, pd.Series, List[str]]], y=None, **fit_params) -> ndarray:
80 """
81 Fits and transforms a list of strings into a list of GPT embeddings.
82 This is modelled to function as the sklearn fit_transform method
(...)
92 embeddings : np.ndarray
93 """
---> 94 return self.fit(X, y).transform(X)

If you could help me figure out this issue, that would be great!

Azure OpenAI Embeddings

You have added support for Azure OpenAI GPT models. Please add support for Azure OpenAI embedding models too. Due to this problem, I can't use the GPTVectorizer as well as Dynamic Few Shot Classification.

Update Readme for Azure users

I was facing some issues configuring the Api keys and the Api base for my deployed resource on Azure. I was able to solve it using the azure documentation, i think adding these changes would help someone trying to configure it faster.

Below is the proposed change to the ReadMe.

Using Azure OpenAI

from skllm.config import SKLLMConfig

SKLLMConfig.set_openai_key("<YOUR_KEY>")  # use azure key instead
SKLLMConfig.set_azure_api_base("<API_BASE>") # your endpoint should look like the following https://YOUR_RESOURCE_NAME.openai.azure.com/

# start with "azure::" prefix when setting the model name
model_name = "azure::<model_name>"
# e.g. ZeroShotGPTClassifier(openai_model="azure::gpt-3.5-turbo")

Note:

Azure OpenAI is not supported by the preprocessors at the moment.
The the openai_model should be the deployment name for the resource.
To find the API_KEY and the Azure Api Base : Azure OpenAi Documentation

prompt-engineering ?

hey fellas, according to this : link
prompting to classify can be made better, by using techniques like Chain of Thought and Few Shot training.

Want this incorporated ?

Safe openai version to work on?

Hi, I try to use the few-shot classifier in the sample code. However, it seems that the openai package is restructuring their codes: https://community.openai.com/t/attributeerror-module-openai-has-no-attribute-embedding/484499.

Here are the error codes:
Could not obtain the completion after 3 retries: `APIRemovedInV1 ::

You tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run openai migrate to automatically upgrade your codebase to use the 1.0.0 interface.
...
A detailed migration guide is available here: openai/openai-python#742
`
None
Could not extract the label from the completion: 'NoneType' object is not subscriptable

So, is there an version of the openai package that is safe to run?

Evaluation

How to evaluate the zero shot classifier for multilabel task, can you make it adaptable to sikitlearn classification report

Failed building wheel for annoy In scikit-llm[gpt4all]

Building wheels for collected packages: annoy
Building wheel for annoy (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.10
creating build\lib.win-amd64-3.10\annoy
copying annoy_init_.py -> build\lib.win-amd64-3.10\annoy
copying annoy_init_.pyi -> build\lib.win-amd64-3.10\annoy
copying annoy\py.typed -> build\lib.win-amd64-3.10\annoy
running build_ext
building 'annoy.annoylib' extension
creating build\temp.win-amd64-3.10
creating build\temp.win-amd64-3.10\Release
creating build\temp.win-amd64-3.10\Release\src
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\include -IC:\Users\sumeruinfra\AppData\Local\Programs\Python\Python310\Include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include /EHsc /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-3.10\Release\src/annoymodule.obj -D_CRT_SECURE_NO_WARNINGS -fpermissive -DANNOYLIB_MULTITHREADED_BUILD
cl : Command line warning D9002 : ignoring unknown option '-fpermissive'
annoymodule.cc
C:\Users\sumeruinfra\AppData\Local\Temp\pip-install-risjy5ab\annoy_0f7489caf9de4e62a7376c32e7609982\src\annoylib.h(19): fatal error C1083: Cannot open include file: 'stdio.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.35.32215\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

是否支持加载本地大模型做文本向量化？

Add flag to control whether unknown labels are returned as None

Additionally, Scikit-LLM will ensure that the obtained response contains a valid label. If this is not the case, a label will be selected randomly (label probabilities are proportional to label occurrences in the training set).

In many pipelines it is better to return a None label to such examples instead of choosing one at random.
Would want a flag to control this behavior:
either set a specific label (like -1) in those cases / set None / select label at random (correct behavior)

Web Interface like scikit-learn

Developing a web interface like scikit-learn will help others understand better. The Ui may contain

Model Name
Model Specs
Model Description
Model snippet
Example

What is the `fit` method actually doing?

Hi,
Great work! I have 3 questions:

Refer to your example on Readme. As part of your fit method in ZeroShotGPTClassifier with gpt-3.5-turbo as the model, are you basically freezing the ada-02 embeddings and then adding some layer on top for the classification task? I'm asking this question because OpenAI APIs support fine-tuning only till GPT-3.
Or, are you simply using it as a zero-shot classifier, and no real training is happening? That is, fit method is only mapping to some prompts that is relevant for a classification task?
How to use scikit-llm for fine-tuning (on private data) for tasks such as summarization or question-answering?

Thanks!

Sentiment

Having Sentiment.py in https://github.com/iryna-kondr/scikit-llm/tree/main/skllm/datasets. would it be needed and essential ? @iryna-kondr @Nadav-Barak

Predictions scores for the ZeroShotGPTClassifier

I am trying to use the ZeroShotGPTClassifier and evaluate the results based on the prediction score. Is there a way to get the prediction score?

The HuggingFace ZeroShotClassifier returns a dictionary with labels and scores as lists. I am looking for a way to score the labels. Otherwise, there is no way to evaluate the labels from the ZeroShotClassifier. Especially if it returns a random label when no label matches.

Missing support for using Azure OpenAI services.

Would love to see some support for using LLM in Azure OpenAI services. I couldn't find any such support but would be happy to build and collaborate with anyone interested.

The scope of this repo is far beyond than it can be imagined

This is an amazing repo, I got so exited to see this. Because I have been thinking about similar ideas lately. I took a quick look into the repo, and it seems like the main aim (for now) is to support scikit learn datasets and make it some how fused with gpt prompts and make gpt models to do the work. But I think we there are more scopes to do just this.

Since scikit learn has been a huge ecosystem for ML and it is mostly used (till now and will be) by most of the organizations. When it comes to tabular data (100M row +) I do not think scikit LLM can work. But till now there are several real world problems that has this kind of tabular data with different varying patterns. And those kind of problems might not be solved with 'just' LLM.

Also at least with scikit learn models are much more LESS black box then these models are and hence easily interpretable where as these black box are far less interpretable and we might not be able to proof locally that why our models is generating this behavior for some data. Also these models are not deterministic, just change one small part of the input and the whole black box can turn into a different direction.

However, think of like this. Let's think in terms of BFF (Backend for Frontend) design, approach instead making GPT as the backend of computation, make it the front end part of the computation. Provide the dataset link, give it the 'sample dataset', provide the problem statement, give the meta data and using
langchain like tools, and existing awesome ecosystem of scikit learn, we can tell these models to do the computation on bare scikit learn and then come up with the predictions and even for example, if something is showing an anomaly in terms of behaviour, we can use explainable ai like LIME/SHAFT and use GPT in the top of these and may be generate awesome interpretable reports with these local 'fit' curves / graphs. In this way we can automate lot of process, keeping the reliability factor in check.

And then this can be used and 'deployed' in real world systems because the heavy lifting is still done by scikit learn, but with a front end of gpt. It then all boils down to fitting the right information and right instructions into right place to provide the results we want.

Some examples.

I provide a dataset link, the statement, metadata, it makes the ml model stores some where and provides the training report for each stage of ML training.

Then if I just provide a new data like, I have an user (not present in training data) with these 'unseen' feature what will be the prediction for that user. On the backend it might run prediction pipeline and then we can provide lot of follow ups like

why you predicted this
What if the input was this and how the output would have changed

Even systems which use real time ML can also incorporate, because real time ML is highly dependent on interpretable and light weight models as Speed and reliability both are indepedent. Using gpt as a query engine interface on top of it can be used for enhanced telemetry or something else, like automatically generating the use behavior from data drift or something else. All we might have to query with natural language.

In that way we are not using huge amount of tokens, can provide lesser black box results and also an use case of safe and less hallucinating AI. Please provide me your thought in general, I know this description got really big, but let me know, I am always up to discuss more on this, if my thought is aligned with yours.

Thanks

Supporting local LLM api server and vLLM

Thanks for your great work!
Since https://github.com/lm-sys/FastChat can initiate a local server on llama2/vicuna, which api is quite similar to openai, it is possible to support FastChat api server, so we can inference with a local api server?

Besides, is there any plan to support batch inference with https://github.com/vllm-project/vllm? The tabular data examples are similar, so batch inference with vLLM could speed up the whole process than gpt4all

How we can use custom train model for predictions

Let us assume we have build model on our own custom labeled data.
we can save model as pickle file , while testing we can load that particular pickle file and do predictions. Is this functionality available with the current implementation.if yes, please share me the sample notebook or code for it.

thanks
chandra