openai / chatgpt-retrieval-plugin Goto Github PK

The ChatGPT Retrieval Plugin lets you easily find personal or work documents by asking questions in natural language.

License: MIT License

Dockerfile 0.18% Makefile 0.11% Python 99.72%

chatgpt-retrieval-plugin's Issues

Tokenization Bug in services.chunks.py

Line 81-89:
chunk_text = chunk_text.replace("\n", " ").strip()
if len(chunk_text) > MIN_CHUNK_LENGTH_TO_EMBED:
chunks.append(chunk_text)
tokens = tokens[len(tokenizer.encode(chunk_text, disallowed_special=())) :]

You replace a \n in the chunk_text with a space. However, you have changed the tokenization of text when you do this replacement. For example, .\n\n\ will be encoded into one token 382 from the tokenizer while . will be encoded into [13, 256]. When you encode the sentence again, the length of the encoded chunk_text will change because the encoding for .\n\n and . are different. Then when you do the list sharding again via tokens = tokens[len(tokenizer.encode(chunk_text, disallowed_special=())) :], the index you get from len(tokenizer.encode(chunk_text, disallowed_special=())) is not the correct starting index for the first word in the next sentence after the punctuation.

Moreover,
if ( last_punctuation != -1 and last_punctuation > MIN_CHUNK_SIZE_CHARS and last_punctuation > MIN_CHUNK_SIZE_CHARS ):

Line 73-77: you have repeated decision conditions

Plugins supported from OpenAI API

Hello there,

Great job you are doing and excited to see more!

I am thinking of providing a plugin, but I would like to consume it by querying ChatGPT API instead of using the interface on OpenAI website.
Would it be possible to query ChatGPT with the API and precise which plugin to use?

Something like that would be great.

import openai

openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
    ],
  plugins=["my_plugin"]
)

Invalid OpenAPI YAML

chatgpt-retrieval-plugin/.well-known/openapi.yaml

Line 6 in b78bbd3

servers:

The servers property on info is invalid (according to OpenAPI 3.0.2)

Does this find get me early plug-in dev access? ^^

Why not use vector search lib instead of service

Hi, I found it's a very interesting repo, it recommends some vector search services, I wonder why not use some vector search libraries instead of services, personal scene retrieval candidate set is small, and all these libraries support serialization as Objects are stored on disk. These libs occupy less memory and are more suitable for personal scenarios. in general, library is more comvienient in most scenarios, and the service is more scalable when the scenarios became heavy.

Libraries I recommend:

faiss: A library for efficient similarity search and clustering of dense vectors.
hora: 🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .
qdrant: Vector Search Engine and Database for the next generation of AI applications.

and for more library: awesome vector search

what's difference between ChatGPTPluginRetriever and VectorStore Retriever

I have a question about ChatGPTPluginRetriever and LangChain VectorStore Retriever. I wanner use Enterprise Private Data with chatGPT, several weeks ago, there is no OpenAI chatGPT Plugin, I intend to use LangChain Vector Store implenment this function, but I have not finish this job, chatgpt plugin come true, I have read chatgpt-retrieval-plugin README.md, but don't have enough time to research code detail, I want to know, what's difference between ChatGPTPluginRetriever and VectorStore Retriever.

I guess ChatGPTPluginRetriever means if you implement "/query" interface, chatGPT will request it with questions intelligently, just like chatGPT will ask question in one of its "Reasoning Chains". But LangChain VectorStore will work independent. Does anyone konws is that so? thanks very much.

P.S. English is not my mother tongue, so I have a poor English. Sorry for that ^_^

unclosed file

If the extract_text_from_file func fails for whatever reason (still investigating that issue) the file is left unclosed causing further errors. Suggest wrapping it in a with statement.

chatgpt-retrieval-plugin/services/file.py

Line 38 in 958bb78

file = open(filepath, "rb")

404s when using the pinecone example

Hello,
I've gone through all the setup steps and using the pinecone example I've been able to have the script set up my pinecone index. I then set up the app on Digital Ocean (I also tried this locally using Poetry with the same results). I download the dataset and prepare it, then comes the issue:
When I try to post to the API, Using this code:

from tqdm.auto import tqdm 
import requests
from requests.adapters import HTTPAdapter, Retry
batch_size = 100
endpoint_url = "https://seashell-app-nhcyq.ondigitalocean.app/"
s = requests.Session()
# we setup a retry strategy to retry on 5xx errors
retries = Retry(
total=5, # number of retries before raising error
backoff_factor=0.1,
status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))
for i in tqdm(range(0, len(documents), batch_size)):
i_end = min(len(documents), i+batch_size)
# make post request that allows up to 5 retries
res = s.post(
f"{endpoint_url}/upsert",
headers=headers,
json={
"documents": documents[i:i_end]
}
)

I'm seeing a long list of 404 messages like this: [chatgpt-retrieval-plugin] [2023-03-26 17:25:41] INFO: 10.244.15.94:41594 - "POST //upsert HTTP/1.1" 404....(lots more like this)

So I don't think my initial post request is working properly.

Then I run the Queries Code Blocks:

queries = data['question'].tolist()
# format into the structure needed by the /query endpoint
queries = [{'query': queries[i]} for i in range(len(queries))]
len(queries)
res = requests.post(
"https://seashell-app-nhcyq.ondigitalocean.app/query",
headers=headers,
json={
'queries': queries[:5]
}
)
res

for query_result in res.json()['results']:
query = query_result['query']
answers = []
scores = []
for result in query_result['results']:
answers.append(result['text'])
scores.append(round(result['score'], 2))
print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

Output is:

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?

When did the Scholastic Magazine of Notre dame begin publishing?

So although I see my queries. I don't get back any answers from the data set. Any help much appreciated!

Redis doesn't return results when using query endpoint

I've added documents to the Redis search using the upsert API, and the documents were added successfully. Still, when I want to use the query API, the results returned are always empty, I have tried different files with different types (PDF, DOCX), but nothing returned in the response.

Support for Typesense vector store

Hi there,

I work on Typesense which is an open source, in-memory search engine that supports vector search.

Would you be open to me contributing an integration with Typesense as a vector store?

can't install grpcio==1.47.5

python version:3.10 or 3.11
pip version: 23.0.1
system: mac

pip install grpcio==1.47.5

Could not find <Python.h>. This could mean the following:
* You're on Ubuntu and haven't run apt-get install python3-dev.
* You're on RHEL/Fedora and haven't run yum install python3-devel or
dnf install python3-devel (make sure you also have redhat-rpm-config
installed)
* You're on Mac OS X and the usual Python framework was somehow corrupted
(check your environment variables or try re-installing?)
* You're on Windows and your Python installation was somehow corrupted
(check your environment variables or try re-installing?)

i tried many solutions but failed
should i switch to a previous python version?

upsert endpoint falied with `AuthenticationError`

I've setup the project locally with redis datastore and starting the application

$ poetry run start
INFO:     Will watch for changes in these directories: ['E:\\code\\chatgpt-retrieval-plugin']
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process [14784] using StatReload
INFO:     Started server process [18672]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

When I post /upsert I got a RetryError

Error: RetryError[<Future at 0x1dfaf1e2140 state=finished raised AuthenticationError>]
INFO:     127.0.0.1:62070 - "POST /upsert HTTP/1.1" 500 Internal Server Error

My request Body is

{
    "documents": [
        {
            "text": "Redis is a real-time data platform that supports a variety of use cases for everyday applications as well as AI/ML workloads. Use Redis as a low-latency vector engine by creating a Redis database with the Redis Stack docker container. For a hosted/managed solution, try Redis Cloud. See more helpful examples of Redis as a vector database here.The database needs the RediSearch module (>=v2.6) and RedisJSON, which are included in the self-hosted docker compose above.Run the App with the Redis docker image: docker compose up -d in this dir.The app automatically creates a Redis vector search index on the first run. Optionally, create a custom index with a specific name and set it as an environment variable (see below).To enable more hybrid searching capabilities, adjust the document schema here."
        }
    ]
}

My env

DATASTORE=redis
OPENAI_API_KEY=<MY_API_KEY>

The OPENAI_API_KEY is valid for accessing the OpenAI API.
I don't have a BEARER_TOKEN because I'm using No Authentication Methods.

How can I get more information about the error?

query endpoint not returning data with Milvus

I've setup the project locally, inserted the example zip file using the Milvus datastore.

When I run a query command in Postman I'm not getting any results. Am I missing something?

Starting application

$ poetry run start
INFO:     Will watch for changes in these directories: ['/home/nathank/workspace/chatgpt-retrieval-plugin']
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process [284968] using StatReload
INFO:     Started server process [285005]
INFO:     Waiting for application startup.
Attempting creation of Milvus default index
Creation of Milvus default index succesful
INFO:     Application startup complete.

process_zip

$ python process_zip.py --filepath '/home/XXX/Downloads/example.zip' 
Attempting creation of Milvus default index
Creation of Milvus default index succesful
Processed 0 documents
Error processing dump/__MACOSX/example/._document.pdf: EOF marker not found
Processed 0 documents
Error processing dump/__MACOSX/example/._Untitled presentation.pptx: File is not a zip file
Processed 0 documents
Error processing dump/__MACOSX/example/._document.txt: 'utf-8' codec can't decode byte 0xc8 in position 99: invalid continuation byte
Processed 0 documents
Error processing dump/__MACOSX/example/._document.docx: File is not a zip file
Processed 0 documents
extracted_text from dump/example/document.md
extracted_text from dump/example/document.docx
extracted_text from dump/example/Untitled presentation.pptx
extracted_text from dump/example/document.pdf
extracted_text from dump/example/document.txt
Upserting batch of 5 documents, batch 0
documents:  [Document(id='e238e070-7b0a-4db4-bf29-b06f0e4dff8b', text='This is an example markdown document\n', metadata=DocumentMetadata(source=<Source.file: 'file'>, source_id='document.md', url=None, created_at=None, author=None)), Document(id='396342ec-078d-435c-87a1-6abde4587326', text='This is a test word document', metadata=DocumentMetadata(source=<Source.file: 'file'>, source_id='document.docx', url=None, created_at=None, author=None)), Document(id='6c873341-dcc9-46a1-9412-5041f6dcf23f', text='This is an example Powerpoint \n', metadata=DocumentMetadata(source=<Source.file: 'file'>, source_id='Untitled presentation.pptx', url=None, created_at=None, author=None)), Document(id='f5bf06ad-8a86-470b-a7ac-23243c339efe', text='T h i s\ni s\na\nt e s t\nP D F\nd o c u m e n t', metadata=DocumentMetadata(source=<Source.file: 'file'>, source_id='document.pdf', url=None, created_at=None, author=None)), Document(id='e3b4677c-e1ca-4479-9122-8e8ec1ccf108', text='\ufeffThis is a test plaintext document', metadata=DocumentMetadata(source=<Source.file: 'file'>, source_id='document.txt', url=None, created_at=None, author=None))]
Upserting batch of size 5
Upserted batch successfully
Skipped 4 files due to errors or PII detection
dump/__MACOSX/example/._document.pdf
dump/__MACOSX/example/._Untitled presentation.pptx
dump/__MACOSX/example/._document.txt
dump/__MACOSX/example/._document.docx
(chatgpt-retrieval-plugin-py3.10) (base)

Query

hnswlib + sqlite integration

After building a bunch of different llm apps I found that most of them don't require much more than hnswlib + sqlite for retrieval. This combo scales up to millions of documents (think english language wikipedia scale), and is a great way to get started with llm applications without additional external service dependencies.

I just recently refactored a few of my projects into hnsqlite, which folks here might find useful.

I have a chatgpt-retrieval-plugin branch with this integrated as a datastore option; if folks are interested I can send a PR.

redis.exceptions.ResponseError: Invalid rule type: JSON

this is my redis service:

this is the error when i use: poetry run start

Future Direction: User Interface

User Interface: Developing a user interface for managing documents and interacting with the plugin could improve the user experience.

I was thinking of including a standalone frontend web app that talks to the API (e.g. React app that runs on a separate port) but wanted to check with you if you had something different in your mind first.

Developer Access for testing e2e

While testing/running local plug-ins - I can as a developer fully automate the tests and debug but cannot do the same when the plugin interacts with OpenAI on the plug-in store, and hits the /query endpoint

There is a waitlist currently - Is there an option for developers actively contributing to get early access for testing/validation and feedback?

Published Recall numbers for `text-embedding-ada-002` on BEIR dataset

The workflow for using ChatGPT for generating answers from a restricted data set is a powerful one.

However, the generated answers aren't useful unless the relevant information is contained in the context.

I have searched high and low for the RECALL@100 over BEIR for text-embedding-ada-002, but cannot find it. Y'all have published NDGC@10 over BEIR, but ChatGPT seems like it could almost be considered a "reranking" step as it generates an answer given a larger context.

If this is not the right place to ask for such information, please direct me where I can find this information.

Plugins in other languages

Is it currently only possible to write chatgpt plugins with chatgpt-retrieval-plugin in python or can I use all languages supported by the openai API?

OpenSearch provider

With OpenSearch being the managed AWS KNN solution, it would be great to develop an OpenSearch provider.

A provider like OpenSearch that can serve as a single document store for a full-text search and vector based semantic search is useful. Post v2.4 OpenSearch also supports KNN with metadata pre-filtering.

Please see the ElasticSearch issue for a similar PR. #52 (comment)

Google Cloud Run deployed failed

I followed the documentation and deployed this project on Google Cloud Run. However, Google Cloud Run required me to specify a port number, and I chose 8000. As a result, I encountered an error during deployment that says: "The user-provided container failed to start and listen on the port defined provided by the PORT=8000 environment variable." How can I solve this problem? Thanks

Security: minio/minio:RELEASE.2022-03-17T06-34-49Z vulnerable to CVE-2023-28432

As part of the examples, minio/minio:RELEASE.2022-03-17T06-34-49Z docker image is utilized.

This version of MinIO is vulnerable to CVE-2023-28432,

GHSA-6xvq-wj2x-3h3q

In a cluster deployment, MinIO returns all environment variables, including MINIO_SECRET_KEY
and MINIO_ROOT_PASSWORD, resulting in information disclosure.

chatgpt-retrieval-plugin/examples/docker/milvus/self-hosted/docker-compose.yml

Line 18 in 32bf09d

image: minio/minio:RELEASE.2022-03-17T06-34-49Z

API docs in addition to OpenAPI?

In this demo, OpenAPI is used for API documentation. I'm curious, would it be possible to use other documentation tools such as Swagger?

Support for Chroma vector datastore

Chroma is an open-source embedding database and AI tooling library. Support it as one of the various vector datastore options we provide.

Retrieved Data - Still being trained? - API?

If we use this plugin I assume the data that is being pulled could also could be trained on for public use?

Can we use this plugin with the API and does it cost more due to how it has to look through the data first?

Swagger UI won't show with http://0.0.0.0:8000/docs on windows

After running poetry run start the API Docu can't be reached via http://0.0.0.0:8000/docs, at least on windows. Going to http://localhost:8000/docs instead solves the issue.

Support OpenSearch k-NN as a vector datastore

OpenSearch supports approximate vector search powered by Lucene engine, nmslib engine, faiss engine and also bruteforce vector search using painless scripting functions. As OpenSearch is popular search engine, it would be good to have this available as one of the supported vector database

upsert-file API does not store DocumentMetadata.source as 'file'

When I upload a file using the upsert-file API in UI (http://localhost:8000/docs),
the DocumentMetadata.source is set to None.

As a result, the uploaded file is not retrieved because the source is None when querying.

For testing, I change get_document_from_file in services/file.py like below

async def get_document_from_file(file: UploadFile) -> Document:
    extracted_text = await extract_text_from_form_file(file)
    print(f"extracted_text:")
    # get metadata
    metadata = DocumentMetadata()
    metadata.source="file" # added for testing
    doc = Document(text=extracted_text, metadata=metadata)
    print(doc)
    return doc

It's works.

FYI
OS: ubuntu 20.04 5.4.0-146-generic
DATASTORE: redis

Running in docker showing error that DATASTORE not found

I have added all the required variables for DATASTORE

I used 'pinecone' as a DATASTORE and added all the required variables for pinecone.

This is how I run in the docker

docker build -t python-project-with-docker .
docker run -d --env-file ./.env -p 8000:8000 python-project-with-docker

But getting this error. I tried with several other vector stores but it's showing the same problem.

I am using Windows 10.

grpcio takes very long time (python 3.11 in fedora f37)

Maybe due to the fact that I have done

poetry env use python3.11

but grpcio is building from source and takes forever...

python process_jsonl.py -h throws ModuleNotFoundError: No module named 'models' Exception

Hi,I'm new to python
I want process my jsonl data,when I run python process_jsonl.py -h,it throws "ModuleNotFoundError: No module named 'models' "Exception

Could you please advise some tips where I am wrong

Running locally

How can I run it locally with redis? I tried using the json example but it does not start correctly.

[Question] What is the difference between id and document_id in Delete Request?

Hi,
I am trying to integrate OpenSearch K-NN Datastore(#78) in Chatgpt-retrieval-plugin. I was looking at Delete Request model here and got confused between id and document_id. Can some provide the difference between both of them?

My understanding is id is the string that is returned to user when he did the upsert request and document_id is the string which datastore has used to store the different chunks.

redis.exceptions.ResponseError: unknown command 'FT.CREATE'

redis.exceptions.ResponseError: unknown command 'FT.CREATE'?

Bug in get_text_chunks() function in chunks.py

if ( last_punctuation != -1 and last_punctuation > MIN_CHUNK_SIZE_CHARS and last_punctuation > MIN_CHUNK_SIZE_CHARS ):

Line 73-77 of services.chunks.py file:
repeated decision conditions

Setup Automated testing

Hey @isafulf any plans to setup automated testing?

Support for pgvector data store

It would be nice to support pgvector as a data store.

Milvus Integration Test Failed before starting test cases

Bug Report

ImportError while importing test module '/Users//matchbox/chatgpt-retrieval-plugin/tests/datastore/providers/milvus/test_milvus_datastore.py'.
Hint: make sure your test modules/packages have valid Python names.
E ImportError: dlopen(/Users//Library/Caches/pypoetry/virtualenvs/chatgpt-retrieval-plugin-4n3DfVMU-py3.10/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace '_CFRelease'

To Reproduce

I was able to get the Milvus docker-compose version to be up and running on the docker on my local machine. After that I was using the command: pytest ./tests/datastore/providers/milvus/test_milvus_datastore.py. However, it produces error as mentioned above. Kind of suspecting this is an issue with macos...please verify, Thanks!

Your Environment

Python Version: 3.10.9
OS: MacOS, M1 ARM Chip
pytest 7.2.2

Query after upsert-file

Hello,

I was able to call upsert-file with a pdf document and it returned successfully an id for the document. However, when I try to query I am getting a 500 Internal Server Error.

In the console running the plugin, I can see the following error:
Error: None is not a valid Source

In my query request I only provided the document_id in the filter section as the upsert-file operation does not allow to provide any other metadata.

Could you please advise how to query when the documents are added using upsert-file.

Thanks,
Marcelo

connection to Azure OpenAI services possible?

This is great, thanks!
Adding support to connect to Azure OpenAI might be helpful for corporate environments

It can be changed under
https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/openai.py
however, general support e.g. by means of reading it from an env variable might be an idea?

Query Redis backend fail with 'NewConnectionError'

1, Configure a redis DATASTORE (

9e6f17fc09d3 is the container id of chatgpt-retrieval-plugin
192.168.2.159 is the ip address of my MacBook Pro
env list about redis
root@9e6f17fc09d3:~# env | grep -i redis
REDIS_HOST=192.168.2.159
DATASTORE=redis
REDIS_PORT=6379
Telnet is OK in chatgpt-retrieval-container
root@9e6f17fc09d3:~# telnet 192.168.2.159 6379
Trying 192.168.2.159...
Connected to 192.168.2.159.
Escape character is '^]'.
^]

2, upsert an email like this
{
"documents": [
{
"id": "123456",
"text": "my first email",
"metadata": {
"source": "email",
"source_id": "111111",
"url": "string",
"created_at": "2023-3-26::10:00:00",
"author": "tom"
}
}
]
}

3, plugin log
INFO: 172.17.0.1:63002 - "POST /upsert HTTP/1.1" 500 Internal Server Error
INFO: Started server process [8]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0xffff9e380c10>: Failed to establish a new connection: [Errno 111] Connection refused')': /v1/embeddings
............ 4 duplicated lines is omitted ............
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0xffff9e382620>: Failed to establish a new connection: [Errno 111] Connection refused')': /v1/embeddings
Error: RetryError[<Future at 0xffff9e382980 state=finished raised APIConnectionError>]
INFO: 172.17.0.1:65224 - "POST /upsert HTTP/1.1" 500 Internal Server Error

In

Upsert example does not work for Pinecone

There seems to be a minimum length required for /upsert to work on Pinecone Vector DB

If the content is too small no IDs return and nothing can be queried (This is the example from the Deployment section)

curl -X POST http://0.0.0.0:8000/upsert \
>   -H "Authorization: Bearer AUTH_KEY" \
>   -H "Content-type: application/json" \
>   -d '{"documents": [{"id": "doc1", "text": "Hello world", "metadata": {"source_id": "12345", "source": "file"}}, {"text": "How are you?", "metadata": {"source_id": "23456"}}]}'
{"ids":[]}

When the text is long enough IDs return and then queries work

curl -X POST http://0.0.0.0:8000/upsert \
>   -H "Authorization: Bearer AUTH_KEY \
>   -H "Content-type: application/json" \
>   -d '{"documents": [{"id": "doc1", "text": "This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way", "metadata": {"source_id": "12345", "source": "file"}}, {"text": "How are you? How are you? How are you? How are you? How are you? How are you? How are you? How are you? How are you? How are you? How are you? How are you?", "metadata": {"source_id": "23456"}}]}'
{"ids":["doc1","1a709295-94fb-43b0-98e3-647ea5bbb028"]}

curl -X POST http://0.0.0.0:8000/query   -H "Authorization: Bearer AUTH_KEY"   -H "Content-type: application/json"   -d '{"queries": [{"query":"This is the way"}]}'
{"results":[{"query":"This is the way","results":[{"id":"doc1_0","text":"This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way This is the way","metadata":{"source":"file","source_id":"12345","url":null,"created_at":null,"author":null,"document_id":"doc1"},"embedding":null,"score":0.926478267},{"id":"1a709295-94fb-43b0-98e3-647ea5bbb028_0","text":"How are you? How are you? How are you? How are you? How are you? How are you? How are you? How are you? How are you? How are you? How are you? How are you?","metadata":{"source":null,"source_id":"23456","url":null,"created_at":null,"author":null,"document_id":"1a709295-94fb-43b0-98e3-647ea5bbb028"},"embedding":null,"score":0.776206791}]}]}

Use with opensource models

What would it take to use this repo with say GPT-J, OPT or other opensource models?

What customizations would have to be done?

a good look doc

hello, the documentation on Github doesn't look convenient, so I have built a good-looking document using Docusaurus, and I hope it can be maintained as the official documentation.
https://chatgpt-retrieval-plugin-dcen1prd0-mar-heaven.vercel.app/docs/intro

Embeddings exists on redis database, but not in the query response.

When positing a new document, it creates the embeddings as shown in the screenshot. However, when querying for documents, i am always getting this response, when embeddings are null. I am using the redis database.

{
    "results": [
        {
            "query": "how much a ride from hanoi to sapa will cost",
            "results": [
                {
                    "id": "liran-1",
                    "text": "hello world",
                    "metadata": {
                        "source": "email",
                        "source_id": "string",
                        "url": "string",
                        "created_at": "1679668478",
                        "author": "string",
                        "document_id": "liran-1"
                    },
                    "embedding": null,
                    "score": 0.307653447681
                },
                {
                    "id": "liran-2",
                    "text": "hello world",
                    "metadata": {
                        "source": "email",
                        "source_id": "string",
                        "url": "string",
                        "created_at": "1679669734",
                        "author": "string",
                        "document_id": "liran-2"
                    },
                    "embedding": null,
                    "score": 0.307653447681
                },
                {
                    "id": "a ride from sapa to hanoi cost 60 dollar",
                    "text": "hello world",
                    "metadata": {
                        "source": "email",
                        "source_id": "string",
                        "url": "string",
                        "created_at": "1679689220",
                        "author": "string",
                        "document_id": "a ride from sapa to hanoi cost 60 dollar"
                    },
                    "embedding": null,
                    "score": 0.307653447681
                }
            ]
        }
    ]
}

Example video of the plugin is showing factually incorrect outputs on climate change and UN

I work on climate change policy, and I'm afraid in the example video (https://cdn.openai.com/chat-plugins/retrieval-gh-repo-readme/Retrieval-Final.mp4) the model is showing multiple factually incorrect outputs:

2018 is a weird year to select the 2015 Paris Agreement as being 'recognized as important' and to 'plan a summit for 2019': there is a UNFCCC summit each year (including one in 2018) and the communique from that always has standard language to recognize the Paris Agreement
IPCC and UNFCCC do not 'collaborate on a special report': it was written (as is standard) by IPCC only and then communicated to UNFCCC
the UN Convention on biodiversity has been running for over 15 years now (last CBD meeting was COP15) - 2020was not the year of the first summit on biodiversity
the $100bn goal was a target for 2020 that was not met in 2021 (but is indeed the annual target for each year 2021-5)
COP26 was in 2021 not 2022
COP26 target was not a commitment to limit temperature to 1.5C but to make strong efforts towards that level (the main target remains 'well below 2C')
just transition is not only about transitioning to renewables

Overall, the table in the video is a pretty terrible description as to how climate thinking has evolved, with factual errors and quite weird things highlighted. I'm generally super excited about this plugin and GPT in general, but perhaps this is a good reminder that these LLMs can give misleading results!

export DATASTORE=pinecone
export PINECONE_API_KEY=$PINECONE_API_KEY
export PINECONE_ENVIRONMENT=us-central1-gcp
export PINECONE_INDEX=openai-retrieval-app
export BEARER_TOKEN=$BEARER_TOKEN
export OPENAI_API_KEY=$OPENAI_API_KEY
poetry run start

I get these errors:

INFO:     Started server process [2939775]
INFO:     Waiting for application startup.
Connecting to existing index openai-retrieval-app
Connected to index openai-retrieval-app successfully
INFO:     Application startup complete.
INFO:     127.0.0.1:38326 - "GET / HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:38326 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:38332 - "GET /query HTTP/1.1" 405 Method Not Allowed

Any help please? thanks

openai / chatgpt-retrieval-plugin Goto Github PK

chatgpt-retrieval-plugin's Issues

Output is:

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?

When did the Scholastic Magazine of Notre dame begin publishing?

Recommend Projects

Recommend Topics

Recommend Org