Light

michaelthwan / searchgpt Goto Github PK

Grounded search engine (i.e. with source reference) based on LLM / ChatGPT / OpenAI API. It supports web search, file content search etc.

License: MIT License

Python 76.05% HTML 21.03% Procfile 0.04% JavaScript 2.88%

chatgpt grounded-api grounded-bot llm nlp nlp-machine-learning openai python retrieval retrieval-model

searchgpt's Introduction

Michael Wan

Hello there, I'm Michael Wan, a data scientist and open-source developer interested in the latest AI and new technology.

searchgpt's People

Contributors

Stargazers

Watchers

searchgpt's Issues

Get bug here

I noticed that when I search for "In-situ Flow Information Telemetry", I will encounter an error written below:

openai.error.InvalidRequestError: This model's maximum context length is 8192 tokens, however you requested 12301 tokens (12301 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

the same thing happened when I run it locally.
Please check your text length limit to see if the restriction you define don't meet with the token length limit.

API call by LLM (another ground)

Prompt engineering study

along with various API choices

PDF extractor

Properly count tokens for OpenAI

Hi! Thanks for the project, it works surprisingly nicely. I checked the code and it seems that for now you check how much context can fit with the character length, which is really limiting the capabilities. In OpenAI, especially with gpt-3.5-turbo, most English words are really 1 token, so with 3000 token context you can get ~3k English words (in reality a bit less, but still close), not characters. This means that searchGPT will be able to pass 2-3x more web page content to the request so the results will improve, albeit it'll cost more.

The best library to use for tokenization for OpenAI right now is https://github.com/openai/tiktoken. For older models (everything except gpt-3.5-turbo) you can use the gpt2 encoding, for Turbo you have to use cl100k_base.

Frontend - Using ajax call (single page webapp)

Remove character limit

When I type search terms such as "spacex" and "apple", searchGPT doesn't allow that. To me this is pretty limiting and force the users to make the search term much longer than it should be.

PPT/DOC Extractor

LLM API integration

Large language model API integration
MVP: OpenAI API

Support of Calling LLM (cloud GPU)

Allow the option for users to not use OpenAI API (which need API key)

Fine-tuned GPT-NeoX 20B (@michaelthwan) blocked, too expensive
GPT-NeoX 20B (from Goose AI) (@michaelthwan )
Bloom xxB (created another card)

Footnote wrong order bug

Sometimes the footnote have wrong order.
@panglydia know why.

De-pyterrier (JAVA) / Better semantic search module?

langchain-faiss -> vectorstore
whoosh (cancelled)
https://haystack.deepset.ai/ (cancelled)

Sentence similarity module

Project enhancements for integration with other projects

You should consider to split backend from frontend to give the project better integrability with other projects.
The backend could be published in a pip package and be usable without cloning the frontend part too.
For example I would like to integrate the backend part into my discord bot but at the time I have to clone the entire project.

To get the best at all, providing an interface with the backend with REST APIs could make the development of a frontend way faster and easier and above all i think it would have more interest by frontend developers which they can use the framework they want.
Flask is a good framework but I think is not so used by developers, instead for example of React or Angular that are so much used.

Wrong time perception in answers by ChatGPT.

I think CHATGPT writes as if it is September 2021 because we do not specify the current time in the promt. This problem can be solved with a python script that will pull the current time information from a dual redundant NTP server and keep it updated daily in the promt. The easy solution would be to add the current date to the promt in this way: The current date: (currentDate).

Frontend - Select a proper tech for frontend

Win app

Electron (blocked)
React native (blocked)

Web server

Local-GPU LLM model (e.g. from huggingface) support

We support some API call (OpenAI / GooseAI) but it would be nice if we can select a model small but good enough to do the summarization work in local GPU.

Study models
Code the local GPU call LLM module

Better frontend design (or choice of tech stack)

The UI is built by non-frontend people so it is relatively plain now.

Web server - result showing function

Connect OpenAI API

Enable OpenAPI and get Davinci model for LLM API
- Don't push API key to git!

Enable OpenAPI subscription
Comparison of different grade API:Ada/Babbage/Curie/Davinci
Use Python to connect the API

Limited website search

Can specify domain name or even one URL to limit search.

Profiler to profile program

e.g. I think the 5x scrap website is quite slow

BingResultExtractor

Study how to get web result. Able to extract Bing result

Azure Bing API subscription
Python code to call Bing API

VectorDB

Are there any plans to utilize a vector DB like Pinecone?

the demo page break down?

Reference Generator tuning

Currently using Tf, score >= 5 as the rule for filtering correct reference.
See if there is needs for tuning. (e.g. too many responses have no reference)

Related code, in FootnoteService

SCORE_THRESHOLD = 5
result_within_scope_df = result_df[result_df['score'] >= SCORE_THRESHOLD]

Keyword extractor

Reference new bing.
To improve the efficiency/accuracy of searching

Caching on different df (consistent input for testing)

enforce pep8 Style Guide and apply black to format code

https://peps.python.org/pep-0008/
https://github.com/psf/black

document index (online resource)

It is possible to retrieve document index by pre-calculation, especially for offline documents.

Index online website - get document location, sentences, reference source
Store DB Index

E.g. retrieve doc index from ChatGPT
/xx/xx/ChatGPT.pptx, , Slide3
xx.com/xxx/chatgpt, , xx.com

Some ways to store fired query/result

So that we may analyze users' input

Test zero/few-shot learning and prompt engineering

Play with Davinci API for summarization power
Try different prompt for best result
Try intrinsic reference[1] footnote approach [blocked]

[1] i.e. instead of giving footnote by another similarity module, try to let model provide so.

Document indexer (offline document)

It is possible to retrieve document index by pre-calculation, especially for offline documents.

Index offline document - get document location, sentences, reference source
Store in DB
Using pyterrier

E.g. retrieve doc index from ChatGPT
/xx/xx/ChatGPT.pptx, , Slide3
xx.com/xxx/chatgpt, , xx.com

Frontend - Random topics for users

Look like sometimes user don't know what to search. Give some random examples?

Frontend - More option

Make config things selectable from UI
possible to prefill certain fields (from query string)

beautifulsoup
trafilatura
other method?

Prompt dataset for testing performance of response

We may have a gut feeling of how is the quality of response from current pipeline.
But we may want a boarder range of prompts with different context.

For example

Normal question (What is ChatGPT?)
Strange question (What is the best way to eat metal ball?)

How do I open IP access and change ports?

I built it on the server and it works fine

(searchChat) root@fusang:~/docker/searchGPT/src# python flask_app.py
 * Serving Flask app 'website'
 * Debug mode: on
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 790-193-721

How to change access privileges from 127.0.0.1 and 5000

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.