ngruver / llmtime Goto Github PK

Home Page: https://arxiv.org/abs/2310.07820

License: MIT License

Python 7.22% Jupyter Notebook 92.75% Shell 0.03%

llmtime's Introduction

Large Language Models Are Zero Shot Time Series Forecasters

This repository contains the code for the paper Large Language Models Are Zero Shot Time Series Forecasters by Nate Gruver, Marc Finzi, Shikai Qiu and Andrew Gordon Wilson (NeurIPS 2023).

We propose LLMTime, a method for zero-shot time series forecasting with large language models (LLMs) by encoding numbers as text and sampling possible extrapolations as text completions. LLMTime can outperform many popular timeseries methods without any training on the target dataset (i.e. zero shot). The performance of LLMTime also scales with the power of the underlying base model. However, models that undergo alignment (e.g. RLHF) do not follow the scaling trend. For example, GPT-4 demonstrates inferior performance to GPT-3.

🛠 Installation

Run the following command to install all dependencies in a conda environment named llmtime. Change the cuda version for torch if you don't have cuda 11.8.

source install.sh

After installation, activate the environment with

conda activate llmtime

If you prefer not using conda, you can also install the dependencies listed in install.sh manually.

If you want to run OpenAI models through their API (doesn't require access to a GPU), add your openai api key to ~/.bashrc with

echo "export OPENAI_API_KEY=<your key>" >> ~/.bashrc

Finally, if you have a diffferent OpenAI API base, change it in your ~/.bashrc with

echo "export OPENAI_API_BASE=<your base url>" >> ~/.bashrc

🚀 Trying out LLMTime

Want a quick taste of the power of LLMTime? Run the quick demo in the demo.ipynb notebook. No GPUs required!

🤖 Plugging in other LLMs

We currently support GPT-3, GPT-3.5, GPT-4, Mistral, and LLaMA 2. It's easy to plug in other LLMs by simply specifying how to generate text completions from them in models/llms.py.

To run Mistral models, add your mistral api key to ~/.bashrc with

echo "export MISTRAL_KEY=<your key>" >> ~/.bashrc

💡 Tips

Here are some tips for using LLMTime:

Performance is not too sensitive to the data scaling hyperparameters alpha, beta, basic. A good default is alpha=0.95, beta=0.3, basic=False. For data exhibiting symmetry around 0 (e.g. a sine wave), we recommend setting basic=True to avoid shifting the data.
The recently released gpt-3.5-turbo-instruct seems to require a lower temperature (e.g. 0.3) than other models, and tends to not outperform text-davinci-003 from our limited experiments.
Tuning hyperparameters based on validation likelihoods, as done by get_autotuned_predictions_data, will often yield better test likelihoods, but won't necessarily yield better samples.

📊 Replicating experiments in paper

Run the following commands to replicate the experiments in the paper. The outputs will be saved in ./outputs/. You can use visualize.ipynb to visualize the results. We also provide precomputed outputs used in the paper in ./precomputed_outputs/.

Darts (Section 4)

python -m experiments.run_darts

Monash (Section 4)

You can download preprocessing data from here or use the following command

gdown 'https://drive.google.com/uc?id=1sKrpWbD3LvLQ_e5lWgX3wJqT50sTd1aZ'

Then extract the data (the extracted data will be in ./datasets/monash/)

tar -xzvf monash.tar.gz

Then run the experiment

python -m experiments.run_monash

Synthetic (Section 5)

python -m experiments.run_synthetic

Missing values (Section 6)

python -m experiments.run_missing

Memorization (Appendix B)

python -m experiments.run_memorization

Citation

Please cite our work as:

@inproceedings{gruver2023llmtime,
    title={{Large Language Models Are Zero Shot Time Series Forecasters}},
    author={Nate Gruver, Marc Finzi, Shikai Qiu and Andrew Gordon Wilson},
    booktitle={Advances in Neural Information Processing Systems},
    year={2023}
}

llmtime's People

Contributors

Stargazers

Watchers

Forkers

mitkox valeman yynnxu wklm francyjglisboa jugglingnumbers akai01 gauravvgat shanthshivam eltociear zergey toandreyhse techthiyanes sarthakpujari agbaezehenry surfcao antonioliu97 kashif bassemfg ai-jie01 danx0r duongtrung gjmulder ailabteam vatogato haoxin1998 ayushrakesh hankniu01 yesouicom ffrujeri roozbehsanaei hanlaoshi tusharsinghbaghel franklong1 luciferjason ssrisunt yi-zhi111 badboy1314 baojinming arelkeselbri americast dean-south carmarpe rishabhmallik umesh92x s2014628 goed505s ljunius jsquires0 nkulkarni taylor-olmst 2132660698 reichlab lizonglingo ggzhang0071 rivera-paleo zxq-0058 shubh-81 12345rupali zaizou avibrahms bsweger alexandru-victor-andrei raunakdey mivanovitch annakrystalli elray1 mmkerr lshandross lumiqai darbetter yunjiao-chen deephaejoong mkim425 marievozanne julian-corbet smisham96 jamesliu forrestgg sanghy grace-go yueyangu skaiphd svorwerk-flextg quhaoh233 zhiyuanyaoj eemichel rohitkrjha bigdatamatta huibinshen vishalbelsare m6129 aditya-oikawa13 abhinavnarang777 swathir1999 kashifdayomscs seco1024 zewil08 chensh236 sanspass

llmtime's Issues

Question about the continuous likelihood.

Hello. I am a master degree student at Korea university.

First of all, I really appreciate to give me a good inspiration from your interesting paper "Large Language Models Are Zero-Shot Time Series Forecasters". And also a big congratulations on being published in NeurIPS 2023!

I read a lot of time, but I can't understand the part of "continuous likelihood".

First thing is the part of p(u_1, ..., u_n) = p(u_n | u_n-1, .. u_0) * p(u_1 | u_0) * p(u_0)
It is related to hierarchical softmax, but I can't understand 100%.
If this part means the definition of general language model, it should be p(u_1, ..., u_n) = p(u_n | u_n-1, .. u_0) * ... * p(u_1 | u_0) * p(u_0).

Second thing is part of the definition of U_k(x).
I think U_k(x) should be just composed of an indicator function. I can't understand the reason for the B^n term in the part of the definition.

Thnak you.

Size of the test set of the Informer datasets

Hi Nate!

Just scanned through you marvelous work. I found that the precomputed output of Autoformer on the Informer datasets are substantially smaller than the original test set. As you mentioned in your paper that the test set has been narrowed, but what is the actual size of the test set?

add suggestions for usage requirements for openAI

After running and debugging the demo notebook a bit I got the following error message

RateLimitError: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.

I only have a free account on OpenAI. Could you provide in the documentation files and/or the demo notebook some indication of how much usage might be needed to run the demo script one, or, say, 10 times? I will check my usage logs (although they don't appear to be updated in real-time) but it would be helpful to have a sense of how much a run of one of these models churns through API limits, and how different model parameters might change that. Thanks!

nvm.

ImportError: cannot import name 'load_dataset' from 'datasets' (unknown location) from monash.py

Can you also provide the monash datasets you used to reproduce the results?

Question on mistral.py code

in models/mistral.py, function "mistral_completion_fn"
"
batch = {k: v.repeat(batch_size, 1) for k, v in batch.items()}
"
Why do you need to repeat the batch_size items?

And I'm a little confused by "batch" here, it seems batches are generated here, why not depends on "input_strs"?

import darts.models error

hello,
when I run the demo.ipynp, there is a mistake. I feel confused. could you help me:

ImportError Traceback (most recent call last)
File /home/ssd2/mashichao/anaconda3/envs/llmtime_new/lib/python3.9/site-packages/sklearn/__check_build/init.py:45
44 try:
---> 45 from ._check_build import check_build # noqa
46 except ImportError as e:

ImportError: dlopen: cannot load any more object with static TLS

During handling of the above exception, another exception occurred:

ImportError Traceback (most recent call last)
/home/ssd2/mashichao/llmtime-main/demo.ipynb Cell 1 line 1
12 from models.utils import grid_iter
13 from models.promptcast import get_promptcast_predictions_data
---> 14 from models.darts import get_arima_predictions_data
15 from models.llmtime import get_llmtime_predictions_data
16 from data.small_context import get_datasets

File /home/ssd2/mashichao/llmtime-main/models/darts.py:3
1 import pandas as pd
2 from darts import TimeSeries
----> 3 import darts.models
4 import numpy as np
5 from darts.utils.likelihood_models import LaplaceLikelihood, GaussianLikelihood
...
to build the package before using it: run python setup.py install or
make in the source directory.

If you have used an installer, please check that it is suited for your
Python version, your operating system and your platform.

Fine-tuning with tabular data

Could you publish code/instructions on how to fine-tune with personal data?

how to use local csv data to test?

Date,c1,c2,c3,c4,c5,c6,c7
2001/5/30,22,24,29,31,35,4,11
2001/6/2,15,22,31,34,35,5,12
2001/6/4,3,4,18,23,32,1,6
.......
my local csv data like above, how to use your demo code

Missing LLaMa from experiments

Hello,

Thanks for sharing the code for the exciting work.
It seems that LLaMa is not in the experiments you shared.
In Monash, llama is initialized with empty hyperparameters and is never called. Similarly, it is not initialized in other experiments.

Since it is an open-source model, it is easier to work with that. Can you share the code for that please?

Thanks!

Question about the generate_predictions() function

When I run the demo.ipynb file without changing anything and try getting the autotuned predictions, gpt3 works fine, but once I use gpt4 and the promptcast model, I get this error:

TypeError Traceback (most recent call last)
Cell In[9], line 6
4 hypers = list(grid_iter(model_hypers[model]))
5 num_samples = 2
----> 6 pred_dict = get_autotuned_predictions_data(train, test, hypers, num_samples, model_predict_fns[model], verbose=False, parallel=False)
7 out[model] = pred_dict
8 plot_preds(train, test, pred_dict, model, show_samples=True)

File /mnt/aamv_data/nimeesha_workspace/nimeesha_workspace/first_paper/AAMV/llmtime/models/validation_likelihood_tuning.py:119, in get_autotuned_predictions_data(train, test, hypers, num_samples, get_predictions_fn, verbose, parallel, n_train, n_val)
117 best_val_nll = float('inf')
118 print(f'Sampling with best hyper... {best_hyper} \n with NLL {best_val_nll:3f}')
--> 119 out = get_predictions_fn(train, test, **best_hyper, num_samples=num_samples, n_train=n_train, parallel=parallel)
120 out['best_hyper']=convert_to_dict(best_hyper)
121 return out

File /mnt/aamv_data/nimeesha_workspace/nimeesha_workspace/first_paper/AAMV/llmtime/models/promptcast.py:278, in get_promptcast_predictions_data(train, test, model, settings, num_samples, temp, dataset_name, **kwargs)
275 input_strs = None
276 if num_samples > 0:
277 # Generate predictions
--> 278 preds, completions_list, input_strs = generate_predictions(model, inputs, steps, settings, scalers,
279 num_samples=num_samples, temp=temp, prompts=prompts, post_prompts=post_prompts,
280 parallel=True, return_input_strs=True, constrain_tokens=False, strict_handling=True, **kwargs)
281 # skip bad samples
282 samples = [pd.DataFrame(np.array([p for p in preds[i] if p is not None]), columns=test[i].index) for i in range(len(preds))]

TypeError: models.promptcast.generate_predictions() got multiple values for keyword argument 'parallel'

Changing parallel=False to True, or removing the parameter in the function call altogether doesn't work. What should I do?

Thank you!

The example does not support openai after the version is upgraded

Would you consider upgrading the source code to solve the call problem of the new version of openai？

NLL function is missing for GPT

The gpt_nll_fn should be added for gpt-3.5 in nll_fns

Integrate Llama3

Context:

I am trying to integrate the llama3 to the llmtime algorithm but I am facing some issues, I assume are related to the tokenization (using Llama2 and Mistral I don't have any problems). Below is my code of llama3 file.

import torch
import numpy as np
from jax import grad, vmap
from tqdm import tqdm
from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
from data.serialize import serialize_arr, SerializerSettings
import transformers

DEFAULT_EOS_TOKEN = "<|end_of_text|>"
DEFAULT_BOS_TOKEN = "<|begin_of_text|>"
DEFAULT_UNK_TOKEN = "<unk>"

loaded = {}


def llama3_model_string(model_size, chat):
    chat = "-chat" if chat else ""
    return f"meta-llama/Meta-Llama-3-8B"

def get_tokenizer(model):
    name_parts = model.split("-")
    model_size = name_parts[0]
    chat = len(name_parts) > 1
    assert model_size in ["8b", "70b"]
  
    tokenizer = AutoTokenizer.from_pretrained(llama3_model_string(model_size, chat), token="")
    transformers.logging.set_verbosity_error()

    special_tokens_dict = dict()
    if tokenizer.eos_token is None:
        special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN
    if tokenizer.bos_token is None:
        special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN

    tokenizer.add_special_tokens(special_tokens_dict)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

    return tokenizer


def get_model_and_tokenizer(model_name, cache_model=False):
    if model_name in loaded:
        return loaded[model_name]
    name_parts = model_name.split("-")
    model_size = name_parts[0]
    chat = len(name_parts) > 1

    assert model_size in ["8b", "70b"]

    tokenizer = get_tokenizer(model_name)

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )
   
    model = AutoModelForCausalLM.from_pretrained(
        llama3_model_string(model_size, chat), device_map="cuda:1", quantization_config=bnb_config, token=""
    )

    model.eval()
    if cache_model:
        loaded[model_name] = model, tokenizer
    return model, tokenizer


def tokenize_fn(str, model):
    tokenizer = get_tokenizer(model)
    return tokenizer(str)


def llama_nll_fn(
    model,
    input_arr,
    target_arr,
    settings: SerializerSettings,
    transform,
    count_seps=True,
    temp=1,
    cache_model=True,
):
    """Returns the NLL/dimension (log base e) of the target array (continuous) according to the LM
        conditioned on the input array. Applies relevant log determinant for transforms and
        converts from discrete NLL of the LLM to continuous by assuming uniform within the bins.
    inputs:
        input_arr: (n,) context array
        target_arr: (n,) ground truth array
        cache_model: whether to cache the model and tokenizer for faster repeated calls
    Returns: NLL/D
    """
    model, tokenizer = get_model_and_tokenizer(model, cache_model=cache_model)

    input_str = serialize_arr(vmap(transform)(input_arr), settings)
    target_str = serialize_arr(vmap(transform)(target_arr), settings)
    full_series = input_str + target_str

    batch = tokenizer([full_series], return_tensors="pt", add_special_tokens=True)
    batch = {k: v.cuda() for k, v in batch.items()}

    with torch.no_grad():
        out = model(**batch)

    good_tokens_str = list("0123456789" + settings.time_sep)
    good_tokens = [tokenizer.convert_tokens_to_ids(token) for token in good_tokens_str]
    bad_tokens = [i for i in range(len(tokenizer)) if i not in good_tokens]
    out["logits"][:, :, bad_tokens] = -100

    input_ids = batch["input_ids"][0][1:]
    logprobs = torch.nn.functional.log_softmax(out["logits"], dim=-1)[0][:-1]
    logprobs = logprobs[torch.arange(len(input_ids)), input_ids].cpu().numpy()

    tokens = tokenizer.batch_decode(
        input_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    input_len = len(
        tokenizer(
            [input_str],
            return_tensors="pt",
            add_special_tokens=True
        )[
            "input_ids"
        ][0]
    )
    input_len = input_len - 2  # remove the BOS token

    logprobs = logprobs[input_len:]
    tokens = tokens[input_len:]
    BPD = -logprobs.sum() / len(target_arr)

    # print("BPD unadjusted:", -logprobs.sum()/len(target_arr), "BPD adjusted:", BPD)
    # log p(x) = log p(token) - log bin_width = log p(token) + prec * log base
    transformed_nll = BPD - settings.prec * np.log(settings.base)
    avg_logdet_dydx = np.log(vmap(grad(transform))(target_arr)).mean()
    return transformed_nll - avg_logdet_dydx


def llama_completion_fn(
    model,
    input_str,
    steps,
    settings,
    batch_size=1,
    num_samples=20,
    temp=0.9,
    top_p=0.9,
    cache_model=True,
):
    avg_tokens_per_step = len(tokenize_fn(input_str, model)["input_ids"]) / len(
        input_str.split(settings.time_sep)
    )
    max_tokens = int(avg_tokens_per_step * steps)

    model, tokenizer = get_model_and_tokenizer(model, cache_model=cache_model)

    gen_strs = []
    for _ in tqdm(range(num_samples // batch_size)):
        batch = tokenizer(
            [input_str],
            return_tensors="pt",
            add_special_tokens=True
        )

        batch = {k: v.repeat(batch_size, 1) for k, v in batch.items()}
        batch = {k: v.cuda() for k, v in batch.items()}

        num_input_ids = batch["input_ids"].shape[1]

        good_tokens_str = list("0123456789" + settings.time_sep)
        good_tokens = [
            tokenizer.convert_tokens_to_ids(token) for token in good_tokens_str
        ]
        # good_tokens += [tokenizer.eos_token_id]
        bad_tokens = [i for i in range(len(tokenizer)) if i not in good_tokens]

        generate_ids = model.generate(
            **batch,
            do_sample=True,
            max_new_tokens=max_tokens,
            temperature=temp,
            top_p=top_p,
            bad_words_ids=[[t] for t in bad_tokens],
            renormalize_logits=True,
            pad_token_id=tokenizer.eos_token_id,
        )
        gen_strs += tokenizer.batch_decode(
            generate_ids[:, num_input_ids:],
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False,
        )

    return gen_strs

I tested different hyperparameters similar to

llama3_hypers = dict(
    temp=1.0,
    alpha=0.99,
    beta=0.3,
    basic=True,
    settings=SerializerSettings(base=10, prec=3, time_sep=',', bit_sep='', plus_sign='', minus_sign='-', signed=True), 
)

experimenting with spaces etc.

Output

Usually, the generated string doesn't make any sense (zeros, empty etc).

Question

Did you encounter any similar issues?
Do you have any idea what might be the cause of the issue and how to resolve it?

Reproducibility of LLM-Time results on Informer datasets

Hi,
First, I want to thank you for your insightful paper and the valuable resources in your repository. I am currently attempting to replicate your results for the Informer datasets (ETTm2, exchange_rate, electricity, etc.). However, I was unable to find a run_informer.py file to facilitate this, as was the case for Monash or DARTS. Could you please guide me on how to reproduce these results using your code, especially with the autoformer_dataset.py? Thank you in advance for your assistance and time.

how to get future forecasting points?

now, the demo is completion,can you share a demo to forcasting future data and load local csv file?

Basic usage without an LLM key?

I was wondering if I could quickly use your model without an LLM key (e.g. OpenAI key)?

How were the normalized scores aggregated?

Thank you for releasing the code! This is a very interesting piece of work. Congrats on the NeurIPS acceptance! 🎉

As per my understanding, you're aggregating normalized scores to report the final scaled score. It looks like you're using the arithmetic mean to aggregate the normalized scores. Please correct me if I am wrong.

Using the arithmetic mean may not be the best way of summarizing a normalized metric. This may lead to misleading conclusions. A better way to aggregate normalized scores is using the geometric mean. Please check this paper out for details:

Fleming, Philip J., and John J. Wallace. "How not to lie with statistics: the correct way to summarize benchmark results." Communications of the ACM 29.3 (1986): 218-221.

Based on the numbers in https://github.com/ngruver/llmtime/blob/main/precomputed_outputs/deterministic_csvs/monash.csv, here are the plots that I get using the arithmetic and geometric mean.

Prediction length for Monash benchmark

Hi, may I check how the baseline results for the Monash benchmark (Figure 4, e.g. Wavenet, Transform., DeepAR, etc.) were obtained? From my understanding of the codebase, it is using the huggingface monash_tsf dataset repository to obtain the Monash time series. The prediction length is based on this:

llmtime/data/monash.py

Line 43 in 37d0a33

pred_len = len(val_example) - len(train_example)

My concern is that the prediction lengths from the huggingface dataset are different from the default prediction length in the Monash dataset. For example, solar 10 minutes from the hf dataset has a prediction length of 60 while the Monash baseline results have a prediction length of 1008. Please correct me if I am mistaking anything here. Thank you!

Autoformer experiments

Hi, Thanks for the great repository. I could not find a run script for the datasets in the informer/autoformer papers. Are there plans to add them?

Reproducing the csv files used in figure-4

Hi @ngruver,

Thank you for making the source code available publicly!

I'm currently encountering issues when trying to replicate the MAE values shown in Figure-4 for the Darts dataset. Could you please clarify how the MAE values are calculated? These values are listed in "/precomputed_outputs/deterministic_csvs/darts_results_agg.csv" and seem to be derived from the pkl files located in "/precomputed_outputs/darts". When attempting to calculate the metrics, both using prediction samples and the median of predictions, the NMAE and NMSE metrics I obtain are significantly higher than those reported in "/precomputed_outputs/deterministic_csvs/darts_results_agg.csv". Below is the code snippet I've been using to compute these metrics with the pkl files for reference:

from data.metrics import Evaluator

## load '/precomputed_outputs/darts/AirPassengersDataset.pkl' in to out_dict
gp_results = out_dict['gp']
# Computing metrics using median predictions
median_results = Evaluator().evaluate(test.values.reshape(1,-1), gp_results['median'].reshape(1, 1, -1))
# Computing metrics using samples of predictions
sample_results = Evaluator().evaluate(test.values.reshape(1,-1), gp_results['samples'].reshape(1, 100, -1))

Best,
Srinath

text-davinci-003 has been deprecated && the results of demo are not good

Hi! Thank you for releasing the code! This is a very interesting piece of work. Congratsssss on the NeurIPS acceptance! 🎉

i met some problem when i use your code.
when directly run the demo.ipynb ,error here.

Sampling with best hyper... defaultdict(<class 'dict'>, {'model': 'text-davinci-003', 'temp': 0.7, 'alpha': 0.95, 'beta': 0.3, 'basic': False, 'settings': SerializerSettings(base=10, prec=3, signed=True, fixed_length=False, max_val=10000000.0, time_sep=' ,', bit_sep=' ', plus_sign='', minus_sign=' -', half_bin_correction=True, decimal_point='', missing_str=' Nan'), 'dataset_name': 'AirPassengersDataset'}) 
 with NLL inf
  0%|          | 0/1 [00:00<?, ?it/s]
---------------------------------------------------------------------------
InvalidRequestError                       Traceback (most recent call last)
Cell In[3], line 11
      9 hypers = list(grid_iter(model_hypers[model]))
     10 num_samples = 10
---> 11 pred_dict = get_autotuned_predictions_data(train, test, hypers, num_samples, model_predict_fns[model], verbose=False, parallel=False)
     12 out[model] = pred_dict
     13 plot_preds(train, test, pred_dict, model, show_samples=True)

File [e:\Document\CodeSpace\OpenProject\llmtime-main\models\validation_likelihood_tuning.py:119](file:///E:/Document/CodeSpace/OpenProject/llmtime-main/models/validation_likelihood_tuning.py:119), in get_autotuned_predictions_data(train, test, hypers, num_samples, get_predictions_fn, verbose, parallel, n_train, n_val)
    117     best_val_nll = float('inf')
    118 print(f'Sampling with best hyper... {best_hyper} \n with NLL {best_val_nll:3f}')
--> 119 out = get_predictions_fn(train, test, **best_hyper, num_samples=num_samples, n_train=n_train, parallel=parallel)
    120 out['best_hyper']=convert_to_dict(best_hyper)
    121 return out

File [e:\Document\CodeSpace\OpenProject\llmtime-main\models\llmtime.py:228](file:///E:/Document/CodeSpace/OpenProject/llmtime-main/models/llmtime.py:228), in get_llmtime_predictions_data(train, test, model, settings, num_samples, temp, alpha, beta, basic, parallel, **kwargs)
    226 completions_list = None
    227 if num_samples > 0:
--> 228     preds, completions_list, input_strs = generate_predictions(completion_fn, input_strs, steps, settings, scalers,
    229                                                                 num_samples=num_samples, temp=temp, 
    230                                                                 parallel=parallel, **kwargs)
    231     samples = [pd.DataFrame(preds[i], columns=test[i].index) for i in range(len(preds))]
    232     medians = [sample.median(axis=0) for sample in samples]
...
    776         rbody, rcode, resp.data, rheaders, stream_error=stream_error
    777     )
    778 return resp

InvalidRequestError: The model `text-davinci-003` has been deprecated, learn more here: https://platform.openai.com/docs/deprecations
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?3b9460ae-8b25-48ef-a914-d7f7efda15e9) or open in a [text editor](command:workbench.action.openLargeOutput?3b9460ae-8b25-48ef-a914-d7f7efda15e9). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

after check the openai's url, change this code:

model_predict_fns = {
    'LLMTime GPT-3': get_llmtime_predictions_data,
    'LLMTime GPT-4': get_llmtime_predictions_data,
    'PromptCast GPT-3': get_promptcast_predictions_data,
    'ARIMA': get_arima_predictions_data,
}

model_predict_fns = {
    'LLMTime GPT-3.5': get_llmtime_predictions_data,
    # 'LLMTime GPT-4': get_llmtime_predictions_data,
    # 'PromptCast GPT-3': get_promptcast_predictions_data,
    'ARIMA': get_arima_predictions_data,
}

here is the result i get，seem doesnt better than ARIMA，the bold purple line is farer from the actual， is the reason of gpt-3.5-turbo-instruct？
plz，can you update the demo for new api，or instruct me how to improve the performance？or only use the text-davinci-003 or llama-70B to get the result plot in your paper？
Sorry for taking up your time. Can you give me some help in your free time?
:

Line 121 in adefc38

 # adjust logprobs by removing extraneous and renormalizing (see appendix of paper) 

I am also curious whether this function will ensure a non-negative constraint on the return values.
Thanks in advance!