jerryjliu / llama_index Goto Github PK
View Code? Open in Web Editor NEWLlamaIndex is a data framework for your LLM applications
Home Page: https://docs.llamaindex.ai
License: MIT License
LlamaIndex is a data framework for your LLM applications
Home Page: https://docs.llamaindex.ai
License: MIT License
I run into the following error when using gpt2 from huggingface -
ValueError: Error raised by inference API: Input is too long for this model, shorten your input or use 'parameters': {'truncation': 'only_first'} to run the model only on the first part.
Can the index not be build chunk by chunk? Or am I missing something?
I'm querying GPT with a table like:
Name:
Date:
Location:
SF:
And would love for the response to just be the answer alone since I am writing the answer to an excel file that already has the table so my result ends up looking like:
Name: Name: Bob
Date: Date: 1/1/2023
Location: Location: New York
SF: SF: 2000 square feet
Months: Months: 20 months
Thank you!
While the SimpleDirectoryReader is very efficient and allows to simply query a (set of) text file(s) with very few lines of code, it would be great if the data connector was able to ingest more file formats such as PDF for example.
Hi! Hope all is well!
I'm writing and loading my index to AWS S3 using the following:
def __serialize_index(self, gpt_idx: GPTListIndex):
out_dict: Dict[str, dict] = {
"index_struct": gpt_idx.index_struct.to_dict(),
"docstore": gpt_idx.docstore.to_dict(),
}
return json.dumps(out_dict)
def __deserialize_index(self, serialized_idx: str):
idx_dict = json.loads(serialized_idx)
index_struct = GPTSimpleVectorIndex.index_struct_cls.from_dict(
idx_dict["index_struct"])
docstore = DocumentStore.from_dict(idx_dict["docstore"])
print(index_struct, docstore)
return GPTSimpleVectorIndex(docstore=docstore, index_struct=index_struct)
these two functions are used here
def get(self, key):
if key in self.cache:
return self.cache[key]
idx_json = self.s3_service.get_index_from_s3(key)
if idx_json is None:
return None
print("json", idx_json)
deserialized_idx = self.__deserialize_index(idx_json)
self.__add_to_cache(key, deserialized_idx)
return deserialized_idx
def flush(self, key, evict=True):
prev_idx = self.cache[key]
if evict:
prev_idx = self.cache.pop(key)
serialized_idx = self.__serialize_index(prev_idx)
self.s3_service.write_index_to_s3(key, serialized_idx)
Unfortunately, when I restore my app, everything gets loaded and seems like it works but every time I try to make a query I get the following:
Empty Response
reading into the logs I found this
> [query] Total LLM token usage: 0 tokens
> [query] Total embedding token usage: 3 tokens
which weirdly seems like the embedding is working but the LLM is not.
Hoping someone might have some insight! ๐
Unless I'm doing something wrong it doesn't seem that the SimpleDirectoryReader loads documents in subfolders. Would it be possible to add this?
Thank you! ๐
I've been trying to generate a tree index, but I'm hitting OAI ratelimits. The problem is that this forces me to start the index from scratch again, which is time-consuming and expensive.
If a ratelimit gets hit, the index should retry, or at the very least save some sort of intermediate state so you can resume indexing later.
I am planning on integrating gpt_index into a project of mine. I would like to have semantic search over a document. Is this library as expensive as the pricing mentioned here -> https://gpttools.com/searchtokens ?
Or is it that the initial build costs a lot of tokens and after the initial build it is much cheaper ? . Thanks
I was wondering how can I use a custom prompt/template for the index.query()
method.
According to the documentation, GPTSimpleVectorIndex may be queried with "embedding" mode (last example of https://gpt-index.readthedocs.io/en/latest/how_to/embeddings.html). However when I do so, I get an error.
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
index = GPTSimpleVectorIndex.load_from_disk(
save_path="index.json",
embed_model=embed_model,
)
response = index.query(
query_string,
mode="embedding",
)
error:
File "/usr/local/lib/python3.10/site-packages/gpt_index/indices/base.py", line 334, in query
return query_runner.query(query_str, self._index_struct)
File "/usr/local/lib/python3.10/site-packages/gpt_index/indices/query/query_runner.py", line 100, in query
query_cls = get_query_cls(index_struct_type, mode)
File "/usr/local/lib/python3.10/site-packages/gpt_index/indices/query/query_map.py", line 79, in get_query_cls
return MODE_TO_QUERY_MAP_SIMPLE[mode]
Is this an actual error or is it not possible/recommended to use embeddings with GPTSimpleVectorIndex? If I use Tree or List indices, then when I save them to disk, there are no actual embeddings, so I am confused on the correct set of classes to use for top-k embedding based queries.
Small change, mostly to familiarize myself with the code.
In #57 we added some kwargs
some constructors. Below is a list of classes for which we can replace kwargs
with llm_predictor
:
Hey i'm getting an error when building an index because gpt_index is trying to go over the maximum context length.
openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 4169 tokens (3913 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.
to reproduce:
from gpt_index import GPTTreeIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = GPTTreeIndex(documents)
# save to disk
index.save_to_disk('index.json')
on this data: https://github.com/awsdocs/amazon-quicksight-user-guide
Following the code in the gpt_index starter tutorial:
from gpt_index import GPTSimpleVectorIndex, SimpleDirectoryReader
from IPython.display import Markdown, display
documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex(documents)
response = index.query("What did the author do growing up?")
print(response)
I am receiving Process finished with exit code 0
as opposed to The author wrote short stories and tried to program on an IBM 1401.
which I am supposed to receive according to the tutorial.
I don't think I've provided an API key yet. That sounds like something I would have needed to have done to get a response from the LLM? However, I don't think the tutorial prompted me to put my key anywhere just yet.
I just ran into gpt_index, awesome project!
I took a quick look at the prompts docs and I see that it's geared towards summarization and QA. I have a project where I need to pass in a lot of context for text generation, so I was wondering if gpt_index could be used for that as well? I don't fully understand yet how gpt_index works, so excuse me if this is a dumb question ;)
Exciting project! I'm looking to build an AI assistant that can answer questions using hundreds of thousands of words of loosely organized notes as context. gpt-index seems like a promising route.
Attempting to load index_gatsby.json from disk yields the following KeyError.
In [12]: index = GPTTreeIndex.load_from_disk('examples/gatsby/index_gatsby.json')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-12-65a5668c7898> in <module>
----> 1 index = GPTTreeIndex.load_from_disk('examples/gatsby/index_gatsby.json')
~/code/github/jerryjliu/gpt_index/gpt_index/indices/base.py in load_from_disk(cls, save_path, **kwargs)
340 with open(save_path, "r") as f:
341 result_dict = json.load(f)
--> 342 index_struct = cls.index_struct_cls.from_dict(result_dict["index_struct"])
343 docstore = DocumentStore.from_dict(result_dict["docstore"])
344 return cls(index_struct=index_struct, docstore=docstore, **kwargs)
KeyError: 'index_struct'
This could be that index_gatsby.json is from an outdated version of the project. The Paul Graham essay example index loads just fine for me.
This was not installed with pip, but was required to use the package.
requirements.txt
gpt_index
numpy==1.23.5
pandas==1.5.2
torch
tensorflow
slack_sdk==3.19.5
generate_index.py
import gpt_index
reader = gpt_index.SlackReader(slack_token='XXX)
documents = reader.load_data(channel_ids=[
'XXX',
'XXX',
])
index = gpt_index.GPTTreeIndex(documents)
index.save_to_disk('gpt-index.json')
python3.10 ./generate_index.py returns error
Traceback (most recent call last):
File "/mnt/c/Users/Slach/Downloads/altinity.staff/src/github.com/altinity/slack-qa/./generate_index.py", line 1, in <module>
import gpt_index
File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/gpt_index/__init__.py", line 9, in <module>
from gpt_index.indices.keyword_table.base import GPTKeywordTableIndex
File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/gpt_index/indices/keyword_table/base.py", line 15, in <module>
from gpt_index.indices.base import (
File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/gpt_index/indices/base.py", line 17, in <module>
from gpt_index.indices.data_structs import IndexStruct
File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/gpt_index/indices/data_structs.py", line 9, in <module>
from dataclasses_json import DataClassJsonMixin
File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/dataclasses_json/__init__.py", line 2, in <module>
from dataclasses_json.api import (DataClassJsonMixin,
File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/dataclasses_json/api.py", line 6, in <module>
from dataclasses_json.cfg import config, LetterCase # noqa: F401
File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/dataclasses_json/cfg.py", line 5, in <module>
from marshmallow.fields import Field as MarshmallowField
File "/home/slach/venv/slack-qa/lib/python3.10/site-packages/marshmallow/__init__.py", line 3, in <module>
from packaging.version import Version
ImportError: cannot import name 'Version' from 'packaging.version' (/home/slach/venv/slack-qa/lib/python3.10/site-packages/packaging/version.py)
could you suggest properly libraries version?
Now that we've added retries with exponential backoff in #215, it would be cool to add support for "picking up where you left off". From the example in #210:
>>> index = GPTTreeIndex(documents, prompt_helper=prompt_helper)
> Building index from nodes: 502 chunks
0/5029
10/5029
20/5029
30/5029
40/5029
50/5029
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
// stack trace and error
If we run index = GPTTreeIndex(documents, prompt_helper=prompt_helper)
, we'd have to start from the beginning. With 502 chunks above, that's a lot of computation we'd be redoing, not to mention token budget gone to waste!
It would be cool if this happened instead:
>>> index = GPTTreeIndex(documents, prompt_helper=prompt_helper)
> Building index from nodes: 502 chunks
> continuing from chunk 50:
50/5029
60/5029
...
// hopefully no errors this time
I can think of two ways this might be done:
gpt_index
tracks some state that stores the results of the computation from the failed run. This might not be great since tracking state like this is confusingGPTTreeIndex(documents, prompt_helper=prompt_helper)
. This might be possible today with index composability?If we added support for this, I believe that this will give developers more confidence to index larger sets of documents.
would be a great way to explore large texts
Something seems wrong with import paths. With a pip install gpt_index
I see this error:
#+BEGIN_SRC jupyter-python
from gpt_index import GPTTreeIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('../../../../bibliography/literature-summaries/').load_data()
documents
#+END_SRC
#+RESULTS:
:RESULTS:
# [goto error]
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
/var/folders/3q/ht_2mtk52hl7ydxrcr87z2gr0000gn/T/ipykernel_18523/2206965719.py in <cell line: 1>()
----> 1 from gpt_index import GPTTreeIndex, SimpleDirectoryReader
2 documents = SimpleDirectoryReader('../../../../bibliography/literature-summaries/').load_data()
3 documents
~/opt/anaconda3/lib/python3.8/site-packages/gpt_index/__init__.py in <module>
23 # readers
24 from gpt_index.readers.file import SimpleDirectoryReader
---> 25 from gpt_index.readers.google.gdocs import GoogleDocsReader
26 from gpt_index.readers.mongo import SimpleMongoReader
27 from gpt_index.readers.notion import NotionPageReader
ModuleNotFoundError: No module named 'gpt_index.readers.google'
:END:
Integrate FAISS embeddings with existing data strucutres (list, tree). For instance we could have a FaissList where every document chunk is a vector, we accumulate the top document chunks by dot product, and then use the retrieved chunks to synthesize an answer.
We could even have a FaissTree where we use a similar summarization procedure as GPTTreeIndex, but we convert each piece of text to an additional embedding. Then traversal becomes dot product against embeddings rather than purely text-based reasoning.
The original design exercise of GPT index was to do text-based only traversal but now we can try focusing this on practical use.
Follow up to this PR comment: https://github.com/jerryjliu/gpt_index/pull/85/files#r1045482090
When the PR above lands, the placement of the @llm_token_counter()
decorator will be confusing, because it's done on methods found both in the base and implementation classes. Better to have inner (defined in subclass) and outer (defined in base class) methods. Pasting a chat from somewhere else
we could define "outer" methods (build_index, query, insert) on the base class, and these outer methods call "inner" methods (_build_index, _query, _insert) that are abstract and implemented by subclasses. Then token counting could take place in the outer method since it's shared among subclasses.
In the example notebook, FaissIndexDemo.ipynb when loading an index from disk and querying the index the following error is thrown:
76 if not query_obj._llm_predictor_set:
77 query_obj.set_llm_predictor(self._llm_predictor)
---> 79 return query_obj.query(query_str, verbose=self._verbose)
...
--> 180 raise ValueError("text_id not found in id_map")
181 int_id = self.id_map[text_id]
182 if int_id not in self.nodes_dict:
ValueError: text_id not found in id_map```
Building a new index and querying it works.
I'm halfway there!
Sample code
text = "...." # > 40k characters
documents = [Document(text)]
llm = OpenAI(temperature=0.7, model_name="text-curie-001")
llm_predictor = LLMPredictor(llm)
prompt_helper = PromptHelper.from_llm_predictor(self.llm_predictor)
index = GPTListIndex(documents, llm_predictor=self.llm_predictor, prompt_helper=self.prompt_helper)
index.query(prompt, response_mode="tree_summarize")
Surprisingly this seems to be happening for me for all long texts. It doesn't happen when davinci is used though and went unnoticed at first due to #182. Anyway I can help in debugging? which function/file should I look into?
File "/Users/tushar/PycharmProjects/transcription/transcription/gptindex.py", line 35, in _list_index_summarize
return str(index.query(prompt, response_mode="tree_summarize"))
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/base.py", line 322, in query
return query_runner.query(query_str, self._index_struct)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/query/query_runner.py", line 106, in query
return query_obj.query(query_str, verbose=self._verbose)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/utils.py", line 113, in wrapped_llm_predict
f_return_val = f(_self, *args, **kwargs)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/query/base.py", line 233, in query
response = self._query(query_str, verbose=verbose)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/query/base.py", line 222, in _query
response_str = self._give_response_for_nodes(
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/query/base.py", line 183, in _give_response_for_nodes
response = self.response_builder.get_response(
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/response/builder.py", line 239, in get_response
return self._get_response_tree_summarize(
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/response/builder.py", line 210, in _get_response_tree_summarize
root_nodes = index_builder.build_index_from_nodes(
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/indices/common/tree/base.py", line 103, in build_index_from_nodes
new_summary, _ = self._llm_predictor.predict(
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/langchain_helpers/chain_wrapper.py", line 96, in predict
llm_prediction = self._predict(prompt, **prompt_args)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/gpt_index/langchain_helpers/chain_wrapper.py", line 82, in _predict
llm_prediction = llm_chain.predict(**full_prompt_args)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/llm.py", line 103, in predict
return self(kwargs)[self.output_key]
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/base.py", line 146, in __call__
raise e
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/base.py", line 142, in __call__
outputs = self._call(inputs)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/llm.py", line 87, in _call
return self.apply([inputs])[0]
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/llm.py", line 78, in apply
response = self.generate(input_list)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/chains/llm.py", line 73, in generate
response = self.llm.generate(prompts, stop=stop)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/llms/base.py", line 81, in generate
raise e
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/llms/base.py", line 77, in generate
output = self._generate(prompts, stop=stop)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/langchain/llms/openai.py", line 155, in _generate
response = self.client.create(prompt=_prompts, **params)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/openai/api_resources/completion.py", line 25, in create
return super().create(*args, **kwargs)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 115, in create
response, _, api_key = requestor.request(
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/openai/api_requestor.py", line 181, in request
resp, got_stream = self._interpret_response(result, stream)
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/openai/api_requestor.py", line 396, in _interpret_response
self._interpret_response_line(
File "/Users/tushar/PycharmProjects/transcription/venv/lib/python3.9/site-packages/openai/api_requestor.py", line 429, in _interpret_response_line
raise self.handle_error_response(
openai.error.InvalidRequestError: This model's maximum context length is 2049 tokens, however you requested 3159 tokens (2903 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.
Currently the table and tree indices all GPT in building the index. This is not optimal because calls to GPT incur latency and cost. There are ways to build indices without needing to invoke GPT, and only invoke GPT during query time (when traversing the index).
We can start with the keyword table, for instance. We can develop a simple keyword extractor that extracts keywords without invoking GPT at all (both during index creation and query). We only need to invoke GPT when synthesizing an answer.
Currently we do not return score/distance metrics for query results.
Can we add distance metrics for each SourceNode in query response?
Define an interface where we can construct an index from mongo and save it to mongo. This will help us hook up data connectors to our core abstractions.
(Later on once we add insert/delete to indices we can also hook it up to this data store).
Sample Code -
llm = OpenAI(temperature=0.7, model_name="text-curie-001")
llm_predictor = LLMPredictor(llm)
prompt_helper = PromptHelper.from_llm_predictor(self.llm_predictor)
index = GPTListIndex(documents, llm_predictor=self.llm_predictor, prompt_helper=self.prompt_helper)
index.query(prompt, response_mode="tree_summarize")
The breaking change seems to have happened in this commit. The response builder in query is created using the default davinci llm_predictor
here instead of the passed in predictor. This is because query_runner
doesn't pass in the llm_predictor while creating the query obj
and sets it later.
Follow up to this pr comment: #85 (comment)
Most unit tests use similar mocks, so there's an opportunity to avoid repetition.
I work mostly with typescript and node.js, and I think many others do. Any ideas on how to make this thing compatible? Is your api getting stable already? I guess it must be possible then, to create a wrapper package in node.js. I'd like to get in touch with anyone interested in this, I might make it.
It seems to be a library that contains the result of massive amount of work for months. It is certainly the start of "AI search" that helps you answer questions appropriately against large knowledge bases. It is the thing I need.
No. We would need to create a typescript wrapper around their python API if we want to use it within node.js. This would a an enourmous task and also we would become dependent on an ever changing library that might make choices that I don't like. For such a crucial part, I think I would be better off creating my own Node.js implementation. Especially looking at long-term, this seems better.
I figured it will be a hassle to make this compatible with Typescript, and since GPT indexation is at the core of my company I decided I will at least try to make my own implementation in Node.js that uses similar principles.
I will try to replicate GPT Index as much as possible and needed in a typescript node.js package. Anyone that wants to help me: please get in touch; https://calendly.com/karsens
I'll regularly update my work in https://github.com/CodeFromAnywhere/gpt-index-js
I'm getting a RateLimitError
when constructing a GPTSimpleVectorIndex
as given below:
from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex
documents = SimpleDirectoryReader('./data').load_data()
index = GPTSimpleVectorIndex(documents)
index.save_to_disk('./index.json')
I'm currently using OpenAI's free trial for testing and it has a 150,000 tokens / minute hard limit. Is there some way to add a delay between API calls?
I ran the TokenPredictor.ipynb
given in the examples and got the following error:
from gpt_index import GPTTreeIndex, MockLLMPredictor, SimpleDirectoryReader
documents = SimpleDirectoryReader('../paul_graham_essay/data').load_data()
llm_predictor = MockLLMPredictor(max_tokens=256)
index = GPTTreeIndex(documents, llm_predictor=llm_predictor)
KeyError Traceback (most recent call last)
d:\Capture\Python\OpenAI\GPT Index\examples\cost_analysis\TokenPredictor.ipynb Cell 7 in <cell line: 1>()
----> [1](vscode-notebook-cell:/d%3A/Capture/Python/OpenAI/GPT%20Index/examples/cost_analysis/TokenPredictor.ipynb#W6sZmlsZQ%3D%3D?line=0) index = GPTTreeIndex(documents, llm_predictor=llm_predictor)
File c:\Users\mmz\AppData\Local\Programs\Python\Python310\lib\site-packages\gpt_index\indices\tree\base.py:65, in GPTTreeIndex.__init__(self, documents, index_struct, summary_template, insert_prompt, num_children, llm_predictor, build_tree, **kwargs)
63 self.insert_prompt: TreeInsertPrompt = insert_prompt or DEFAULT_INSERT_PROMPT
64 self.build_tree = build_tree
---> 65 super().__init__(
66 documents=documents,
67 index_struct=index_struct,
68 llm_predictor=llm_predictor,
69 **kwargs,
70 )
File c:\Users\mmz\AppData\Local\Programs\Python\Python310\lib\site-packages\gpt_index\indices\base.py:86, in BaseGPTIndex.__init__(self, documents, index_struct, llm_predictor, docstore, prompt_helper, chunk_size_limit, verbose)
84 self._validate_documents(documents)
85 # TODO: introduce document store outside __init__ function
---> 86 self._index_struct = self.build_index_from_documents(
87 documents, verbose=verbose
88 )
File c:\Users\mmz\AppData\Local\Programs\Python\Python310\lib\site-packages\gpt_index\utils.py:113, in llm_token_counter.<locals>.wrap.<locals>.wrapped_llm_predict(_self, *args, **kwargs)
111 def wrapped_llm_predict(_self: Any, *args: Any, **kwargs: Any) -> Any:
112 start_token_ct = _self._llm_predictor.total_tokens_used
...
---> 18 num_text_tokens = len(globals_helper.tokenizer(prompt_args["text"]))
19 token_limit = min(num_text_tokens, max_tokens)
20 return " ".join(["summary"] * token_limit)
KeyError: 'text'
In trying to use this, you get an error if you haven't setup an OpenAI account with API key. That should probably be noted on the readme. I had to setup a paid account to use this. It would also be helpful to indicate what typical costs associated with this might be.
I did see there is some cost analysis in the index documenation (eg https://github.com/jerryjliu/gpt_index/blob/main/gpt_index/indices/keyword_table/README.md#faqadditional).
SQLIndexDemo is an interesting starting point, would be helpful to have an example usecase leveraging a pre-loaded database rather than creating a database based on unstructured document parsing.
E.g. it is non obvious how to correctly set documents/index_struct when the data is already loaded.
Today, if I build a gpt index like this:
>>> index = GPTTreeIndex(documents, prompt_helper=prompt_helper)
> Building index from nodes: 502 chunks
0/5029
10/5029
...
This may take a while, and I'm blocked from doing anything else before then. (The same can be said for querying).
If I'm building some app on top of GPT index, and have an endpoint to start the build, like below:
from flask import Flask, jsonify, request
app = Flask(__name__)
@app.route('/build', methods=['POST'])
def build():
// get documents and prompt_helper
index = GPTTreeIndex(documents, prompt_helper=prompt_helper)
// do something with index
return {
"message": "build complete"
}
then I have to wait for the build to complete before getting a response.
I'm looking for ideas on how to support running the build in the background, and checking status, something like below:
// same flask boilerplate
@app.route('/build', methods=['POST'])
def build():
// stuff
index = GPTTreeIndex(documents, prompt_helper=prompt_helper)
// more stuff + produces id to check on later
return {
"message": "started building the index",
"task_id": id
}
// stuff
def status():
...
// get id
// returns a message that says that the build is complete, or is x% done
This should be possible with python's threading
library, or with a task queue like celery
. However it probably gets complicated depending on your application, e.g. you have more than one process.
I'm currently thinking of ways to support this within gpt_index
, whether it's adding extra functionality (without bloating the library), or adding some code samples somewhere so that no one's starting from scratch. If you have ideas, please feel free to add them here!
documents = SimpleDirectoryReader('data').load_data()
File "/usr/local/lib/python3.10/site-packages/gpt_index/readers/file.py", line 33, in load_data
data = f.read()
File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
on this data: https://github.com/awsdocs/amazon-quicksight-user-guide/tree/main/doc_source
We need a proper suite of unit tests with tests
SQL Alchemy does not load view metadata on:
SQLDatabase
def __init__(self, *args: Any, **kwargs: Any) -> None:
"""Init params."""
super().__init__(*args, **kwargs)
self.metadata_obj = MetaData(bind=self._engine)
self.metadata_obj.reflect()
Which causes the initialization of the index to fail with a KeyError
GPTSQLStructStoreIndex
def __init__(...):
...
table = self.sql_database.metadata_obj.tables[table_name] <--- XXX no such table name
...
More of a nice-to-have.
Since building the index involves calls to GPT-3, it might be nice to have a verbose output option that counts the number of tokens involved, which will help devs get a feel for the costs involved with using this package.
Here's a minimal example:
from gpt_index import GPTFaissIndex, SimpleDirectoryReader
faiss_index = faiss.IndexFlatL2(1536)
documents = SimpleDirectoryReader("data").load_data()
index = GPTFaissIndex(documents, faiss_index=faiss_index)
index.save_to_disk("index_faiss.json", faiss_index_save_path="index_faiss_core.index")
new_index = GPTFaissIndex.load_from_disk(
save_path="index_faiss.json", faiss_index_save_path="index_faiss_core.index"
)
print(new_index._faiss_index.ntotal)
Notice that 0 documents exist.
And just to verify that loading it regularly works, try this:
import faiss
faiss_index = faiss.read_index("index_faiss_core.index")
print(faiss_index.ntotal)
This was a weird one to track down, because it doesn't actually "fail". Instead, we always return Node 0 with a distance of infinity.
I poked around the code a bit but couldn't figure out where exactly the bug is. Separately, we should also include a quick sanity check, just ensure that the faiss_index
isn't empty if we're loading it from disk.
pip install gpt-index leads to below:
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
ร Getting requirements to build wheel did not run successfully.
โ exit code: 1
โฐโ> [19 lines of output]
Traceback (most recent call last):
File "F:\Python39\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 351, in <module>
main()
File "F:\Python39\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 333, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "F:\Python39\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "C:\Users\Mohammad\AppData\Local\Temp\pip-build-env-g92ox94r\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "C:\Users\Mohammad\AppData\Local\Temp\pip-build-env-g92ox94r\overlay\Lib\site-packages\setuptools\build_meta.py", line 320, in _get_build_requires
self.run_setup()
File "C:\Users\Mohammad\AppData\Local\Temp\pip-build-env-g92ox94r\overlay\Lib\site-packages\setuptools\build_meta.py", line 484, in run_setup
super(_BuildMetaLegacyBackend,
File "C:\Users\Mohammad\AppData\Local\Temp\pip-build-env-g92ox94r\overlay\Lib\site-packages\setuptools\build_meta.py", line 335, in run_setup
exec(code, locals())
File "<string>", line 11, in <module>
File "F:\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 8: character maps to <undefined>
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
ร Getting requirements to build wheel did not run successfully.
โ exit code: 1
โฐโ> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.