kathrinse / be_great Goto Github PK

View Code? Open in Web Editor NEW

261.0 261.0 45.0 4.25 MB

A novel approach for synthesizing tabular data using pretrained large language models

License: MIT License

Python 100.00%

data-generation deep-learning tabular-data transformers

be_great's People

Contributors

Stargazers

Watchers

be_great's Issues

Searching for Source Code to Replicate the Paper's Experiments and Evaluations

Hi team!

First, I need to say how excited I am about this project and the incredible research you are doing! The article is full of useful information, and it is clear that a lot of effort and dedication went into achieving these results. Congratulations on the incredible work!

I was browsing the repository and getting ready to experiment and play around with the code a bit. I was especially interested in trying to replicate the experiments that you guys did and see if I could recreate the metrics and graphs in the article.

However, I seem to have stumbled over a small snag - I have not been able to locate the corresponding source code. I must have lost it somewhere, or maybe it's not here yet?

I know how big these projects can be, and sometimes it's easy to forget to add everything, but could you guys help me find this code? Furthermore, I believe that not only me, but also other researchers and project enthusiasts would be extremely grateful!

Thanks again for the fantastic job you guys are doing! I look forward to hearing from you and to continue exploring this amazing project.

Best regards,

Sildolfo Gomes.

Can not be run on one 4090 GPU

Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE="1" and NCCL_IB_DISABLE="1" or use accelerate launch` which will do this automatically.

An error has occurred: Breaking the generation loop!

Hi guys,

After training a model to create synthetic data, I get the following message :

An error has occurred: Breaking the generation loop!
To address this issue, consider fine-tuning the GReaT model for an longer period. This can be achieved by increasing the number of epochs.
Alternatively, you might consider increasing the max_length parameter within the sample function. For example: model.sample(n_samples=10, max_length=2000)
If the problem persists despite these adjustments, feel free to raise an issue on our GitHub page at: https://github.com/kathrinse/be_great/issues

I increased the max_length parameter and the number of epochs, but still get the issue.

Here is my code :

from be_great import GReaT
from sklearn.model_selection import train_test_split
X_train,X_test = train_test_split(df,train_size = 0.01)
model = GReaT(llm='distilgpt2', batch_size=8, epochs=4)
model.fit(X_train)

I take a part of my initial dataset to reduce the time of learning (initial dataset got 800K rows).

Does anyone already had this issue and found how to solve it ?

Thanks

How many samples are used for the classification and regression?

Hi, thanks for the great work.
I'm interested in re-implementing the experiments from your paper.
Could you please provide information on the number of samples used for both regression and classification?
I've looked through the paper but couldn't find this detail.

Unable to sample

Hi, I'm trained the model on tabular data with categorical columns of high cardinality and text.

The system is failing when creating synthetic data.

Even after increasing the 'max_length to 10K, the same error persists. Any ideas on how to resolve this?

Modifying the AutoModelForCausalLM

I just want to start by saying I love the work that has been done on this project, Here is the issue I'm having

when the model is loaded from HuggingFace using it would be great to be able to select the paramaters of the AutoModelForCausalLM.
self.model = AutoModelForCausalLM.from_pretrained(self.llm)

It works great with small models likee GPT2 but when we advance to larger models (ex mistralai/Mistral-7B-Instruct-v0.1) the GPU quickly runs out of memory . I can generally get around this by using BitsAndBytesConfig to minmize the memory requiered for the LLM but that requires passing addtinal agrumetns to AutoModelForCausalLM ex

model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)

AssertionError: Torch not compiled with CUDA enabled

Hello,

I'm preparing a blog post about CTGANs and I want to cover this library too. I'm trying to generate synthetic data on my macbook m1 computer but I get this error message:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[1], line 10
      8 model = GReaT(llm='distilgpt2', batch_size=16, epochs=2, save_steps=4)
      9 model.fit(df)
---> 10 synthetic_data = model.sample(n_samples=100)

File ~/mambaforge/envs/ctgan/lib/python3.9/site-packages/be_great/great.py:144, in GReaT.sample(self, n_samples, start_col, start_col_dist, temperature, k, max_length, device)
    141 great_start = self._get_start_sampler(start_col, start_col_dist)
    143 # Move model to device
--> 144 self.model.to(device)
    146 # Init empty DataFrame for the generated samples
    147 df_gen = pd.DataFrame(columns=self.columns)

File ~/mambaforge/envs/ctgan/lib/python3.9/site-packages/transformers/modeling_utils.py:1682, in PreTrainedModel.to(self, *args, **kwargs)
   1677     raise ValueError(
   1678         "`.to` is not supported for `8-bit` models. Please use the model as it is, since the"
   1679         " model has already been set to the correct devices and casted to the correct `dtype`."
   1680     )
   1681 else:
-> 1682     return super().to(*args, **kwargs)

File ~/mambaforge/envs/ctgan/lib/python3.9/site-packages/torch/nn/modules/module.py:989, in Module.to(self, *args, **kwargs)
    985         return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
    986                     non_blocking, memory_format=convert_to_format)
    987     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
--> 989 return self._apply(convert)

File ~/mambaforge/envs/ctgan/lib/python3.9/site-packages/torch/nn/modules/module.py:641, in Module._apply(self, fn)
    639 def _apply(self, fn):
    640     for module in self.children():
--> 641         module._apply(fn)
    643     def compute_should_use_set_data(tensor, tensor_applied):
    644         if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    645             # If the new tensor has compatible tensor type as the existing tensor,
    646             # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    651             # global flag to let the user control whether they want the future
    652             # behavior of overwriting the existing tensor or not.

File ~/mambaforge/envs/ctgan/lib/python3.9/site-packages/torch/nn/modules/module.py:641, in Module._apply(self, fn)
    639 def _apply(self, fn):
    640     for module in self.children():
--> 641         module._apply(fn)
    643     def compute_should_use_set_data(tensor, tensor_applied):
    644         if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    645             # If the new tensor has compatible tensor type as the existing tensor,
    646             # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    651             # global flag to let the user control whether they want the future
    652             # behavior of overwriting the existing tensor or not.

File ~/mambaforge/envs/ctgan/lib/python3.9/site-packages/torch/nn/modules/module.py:664, in Module._apply(self, fn)
    660 # Tensors stored in modules are graph leaves, and we don't want to
    661 # track autograd history of `param_applied`, so we have to use
    662 # `with torch.no_grad():`
    663 with torch.no_grad():
--> 664     param_applied = fn(param)
    665 should_use_set_data = compute_should_use_set_data(param, param_applied)
    666 if should_use_set_data:

File ~/mambaforge/envs/ctgan/lib/python3.9/site-packages/torch/nn/modules/module.py:987, in Module.to.<locals>.convert(t)
    984 if convert_to_format is not None and t.dim() in (4, 5):
    985     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
    986                 non_blocking, memory_format=convert_to_format)
--> 987 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

File ~/mambaforge/envs/ctgan/lib/python3.9/site-packages/torch/cuda/__init__.py:221, in _lazy_init()
    217     raise RuntimeError(
    218         "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
    219         "multiprocessing, you must use the 'spawn' start method")
    220 if not hasattr(torch._C, '_cuda_getDeviceCount'):
--> 221     raise AssertionError("Torch not compiled with CUDA enabled")
    222 if _cudart is None:
    223     raise AssertionError(
    224         "libcudart functions unavailable. It looks like you have a broken build?")

AssertionError: Torch not compiled with CUDA enabled

How can I solve this problem?

Thank you so much!

Problem in running the model

Dear authors,

Hi! When I ran the notebook given in the readme file, there is an error reported on the line "model.fit(data)" that says:

ImportError Traceback (most recent call last)
in <cell line: 7>()
5
6 model = GReaT(llm='distilgpt2', batch_size=32, epochs=50)
----> 7 model.fit(data)
8 synthetic_data = model.sample(n_samples=100)

5 frames
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py in _setup_devices(self)
1670 if not is_sagemaker_mp_enabled():
1671 if not is_accelerate_available(min_version="0.20.1"):
-> 1672 raise ImportError(
1673 "Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U"
1674 )

ImportError: Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

I tried installing the hinted packages, and it didn't help. I tried it locally, no good either. Could you please let me know what may have gone wrong with my settings or packages?

Thanks a lot!

Joel

Sample failed because of list index out of range

Hi,

I'm training a dataset with over and above 91 columns and has both categorical and numerical values. I trained with 10 epochs and tried to do sampling. But the sampling is failed because of this error - "list index out of range"

Error in sampling the data

Hey. I am facing error on sampling data.
I get error: "Breaking the generation loop!"
and when I checked the code I understood in the next line, df_gen gets empty and it is obviously because the model just generated placeholder, and not the desired data.
df_gen = df_gen[~(df_gen == "placeholder").any(axis=1)]

I am working on it for more than one week but I couldn't handle it.
Can you please help me? I am really in need of this code to works.
My data is a tabular text data, each column consisting number or piece of text (sometimes long texts).

Thank you in advance.

How to randomly select a column for preconditioning for each epoch?

As explained in the code documentation, when training/fine-tuning by the function .fit(), if no column in the tabular data is specified, the last column is taken to precondition.

We have used several metrics from the SDMetrics library to compare the real California housing dataset to several synthetically generated tabular datasets from GReaT. From there, we notice that this default preconditioning is not ideal, as that last column is almost an exact match to its synthetic column, while the rest of the columns present much more variability.

We would really like to mitigate this effect by selecting a column randomly, where every column has equal probability to be selected, and this column selection is repeated every time the data is revisited, i.e, every number of epochs. Like this, all columns would be selected for preconditioning during the fitting process.

Is it possible, in any argument in GReaT() or in .fit(), to specify this random selection of columns switching every epoch? Or one has to somehow re-write the code and not use the be_great package directly?

Many thanks beforehand!

Long rows

Hi! I did a few first experiments with GReaT and like it already :)

I was wondering if you thought about how to tackle current token limits of LLMs? If I understand correctly, during training and generation it generates one row at a time. Hence, the token limit effectively limits the length a row can have (in text form).

For now I had only the following ideas to fit data with many features better into that token limit:

"Compress" the feature names: Reducing the length of the column names to avoid token overhead by renaming / encoding the feature names to more token friendly strings.
The same for categorical values that are too long.

For example if one column originally would be named "Patient disease name" and an original value would be "Creutzfeldt–Jakob disease" it could be changed to the column name "Disease" with the value "CJ".

Do you think this approach makes sense?

I am struggling to find a way for text features especially. Ironically, the ones seemingly ideally suited for this LLM approach. I have some columns containing free form text. Unfortunately, those exceed the token limit regularly. Do you have any recommendations how do deal with this scenario?

Issue with running Great on breast cancer dataset

Hello,
I've encountered an issue while running Great on the breast cancer dataset from sklearn. While the training process proceeds smoothly without any issues, I have noticed that when attempting to generate samples using the trained model, the script has been running for more than 15 minutes without returning the samples, even for very small n_samples values (the progress bar also appears to be frozen). Same with the Iris dataset. I find this behavior peculiar since Great ran as expected with the California housing dataset.
I wanted to ask if you have come across this particular issue before or if you have any suggestions on how to handle it.

Here is my script:

!pip install be-great
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate

from be_great import GReaT
import pandas as pd
from sklearn.datasets import load_breast_cancer

realData = load_breast_cancer(as_frame=True).frame
print(realData)
model = GReaT(llm='distilgpt2', batch_size=16, epochs=1, save_steps=400000)
model.fit(realData)
synthetic_data = model.sample(n_samples=10)
synthetic_data.head()

TypeError: '<' not supported between instances of 'list' and 'int'

how to fix?

pip install doesn't work

I tried to install be-great via pip on both google colab and my Mac. However, it says that there is no Python package named be-great.

Then, I tried to download the .whl file from PyPI and install it. It also didn't work.

I don't know the reason, it looks weird.

Thank you so much!

Missing data

I see the code drops rows with missing data but can this model work with missing data? Ie can it be adapted to produce missing data in the same way as present in the data

TypeError: '<' not supported between instances of 'list' and 'int'

Hi, I tried to use your library for my research. While testing its code, an error occurred. I just used the sample code you provided. Would let me know what's the matter?

My python version is 3.9.16

from be_great import GReaT from sklearn.datasets import fetch_california_housing

data = fetch_california_housing(as_frame=True).frame

model = GReaT(llm='distilgpt2', batch_size=32, epochs=50)
model.fit(data)
synthetic_data = model.sample(n_samples=100)

TypeError Traceback (most recent call last) Cell In[17], line 7 4 data = fetch_california_housing(as_frame=True).frame 6 model = GReaT(llm='distilgpt2', batch_size=32, epochs=50) ----> 7 model.fit(data) 8 synthetic_data = model.sample(n_samples=100)

File /usr/local/lib/python3.9/dist-packages/be_great/great.py:114, in GReaT.fit(self, data, column_names, conditional_col, resume_from_checkpoint)
112 # Start training
113 logging.info("Start training...")
--> 114 great_trainer.train(resume_from_checkpoint=resume_from_checkpoint)
115 return great_trainer

File /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1543, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1538 self.model_wrapped = self.model
1540 inner_training_loop = find_executable_batch_size(
1541 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1542 )
-> 1543 return inner_training_loop(
1544 args=args,
1545 resume_from_checkpoint=resume_from_checkpoint,
1546 trial=trial,
1547 ignore_keys_for_eval=ignore_keys_for_eval,
1548 )

File /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1765, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1762 self._load_rng_state(resume_from_checkpoint)
1764 step = -1
-> 1765 for step, inputs in enumerate(epoch_iterator):
1766
1767 # Skip past any already trained steps if resuming training
1768 if steps_trained_in_current_epoch > 0:
1769 steps_trained_in_current_epoch -= 1

File /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:628, in _BaseDataLoaderIter.next(self)
625 if self._sampler_iter is None:
626 # TODO(pytorch/pytorch#76750)
627 self._reset() # type: ignore[call-arg]
--> 628 data = self._next_data()
629 self._num_yielded += 1
630 if self._dataset_kind == _DatasetKind.Iterable and
631 self._IterableDataset_len_called is not None and
632 self._num_yielded > self._IterableDataset_len_called:

File /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:671, in _SingleProcessDataLoaderIter._next_data(self)
669 def _next_data(self):
670 index = self._next_index() # may raise StopIteration
--> 671 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
672 if self._pin_memory:
673 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

File /usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py:56, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
54 if self.auto_collation:
55 if hasattr(self.dataset, "getitems") and self.dataset.getitems:
---> 56 data = self.dataset.getitems(possibly_batched_index)
57 else:
58 data = [self.dataset[idx] for idx in possibly_batched_index]

File /usr/local/lib/python3.9/dist-packages/datasets/arrow_dataset.py:2662, in Dataset.getitems(self, keys)
2660 def getitems(self, keys: List) -> List:
2661 """Can be used to get a batch using a list of integers indices."""
-> 2662 batch = self.getitem(keys)
2663 n_examples = len(batch[next(iter(batch))])
2664 return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]

File /usr/local/lib/python3.9/dist-packages/datasets/arrow_dataset.py:2658, in Dataset.getitem(self, key)
2656 def getitem(self, key): # noqa: F811
2657 """Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools)."""
-> 2658 return self._getitem(key)

File /usr/local/lib/python3.9/dist-packages/be_great/great_dataset.py:31, in GReaTDataset._getitem(self, key, decoded, **kwargs)
26 """ Get Item from Tabular Data
27
28 Get one instance of the tabular data, permuted, converted to text and tokenized.
29 """
30 # If int, what else?
---> 31 row = self._data.fast_slice(key, 1)
33 shuffle_idx = list(range(row.num_columns))
34 random.shuffle(shuffle_idx)

File /usr/local/lib/python3.9/dist-packages/datasets/table.py:135, in IndexedTableMixin.fast_slice(self, offset, length)
127 def fast_slice(self, offset=0, length=None) -> pa.Table:
128 """
129 Slice the Table using interpolation search.
130 The behavior is the same as pyarrow.Table.slice but it's significantly faster.
(...)
133 The batches to keep are then concatenated to form the sliced Table.
134 """
--> 135 if offset < 0:
136 raise IndexError("Offset must be non-negative")
137 elif offset >= self._offsets[-1] or (length is not None and length <= 0):

TypeError: '<' not supported between instances of 'list' and 'int'

How can we use other LMs other than distilgpt2 and gpt2

The model for GReaT

Thanks for the great work! I am a newbie in NLP so this might be a silly question. I am wondering how to use the GPT-2 model for GReaT? Is it

model = GReaT(llm='gpt2', batch_size=32, epochs=50)

or are there any examples that run GReaT instead of Distill-GReaT?

About how the GReaT model handles the generation of missing values (NaN)

Hello, authors! I wanted to express my gratitude for the excellent work you've done!

I do have a question regarding the generation of missing values.
I'm a bit puzzled about how the GReaT model handles the generation of missing values (NaN) in the current implementation.

When I input a DataFrame, GReaT automatically converts it into text like 'column1 is value1, column2 is value2, ...'.
If value1 happens to be a null value, the resulting text becomes 'column1 is None, column2 is value2, ...'.
Then, it goes into the LLM backbone (GPT2).

However, I've noticed that the 'column1 is None' part gets dropped by the code below (# Remove rows with flawed numerical values).
This happens because when applying pd.to_numeric with the 'coerce' option to 'None' (as a string), it raises an error.
Consequently, the corresponding value is converted to null.
However, by selecting only non-null values using .notnull(), all rows with null values are dropped here.

I recently experimented with the .fit() and .sample() functions to generate a DataFrame with several columns containing missing values (e.g., 'sick' dataset).

If I've made any mistakes in my understanding, please let me know.

Once again, thank you for your assistance!


        with tqdm(total=n_samples) as pbar:
            already_generated = 0
            _cnt = 0
            try:
                while n_samples > already_generated:
                    start_tokens = great_start.get_start_tokens(k)
                    start_tokens = torch.tensor(start_tokens).to(device)

                    # Generate tokens
                    tokens = self.model.generate(
                        input_ids=start_tokens,
                        max_length=max_length,
                        do_sample=True,
                        temperature=temperature,
                        pad_token_id=50256,
                    )

                    # Convert tokens back to tabular data
                    text_data = _convert_tokens_to_text(tokens, self.tokenizer)
                    df_gen = _convert_text_to_tabular_data(text_data, self.columns)

                    # Remove rows with flawed numerical values
                    for i_num_cols in self.num_cols:
                        df_gen = df_gen[
                            pd.to_numeric(df_gen[i_num_cols], errors="coerce").notnull()
                        ]

                    df_gen[self.num_cols] = df_gen[self.num_cols].astype(float)

                    # Remove rows with missing values
                    df_gen = df_gen.drop(df_gen[df_gen.isna().any(axis=1)].index)

                    dfs.append(df_gen)
                    already_generated += len(dfs[-1])

                    # Update process bar
                    pbar.update(len(dfs[-1]))

                    # Check if we actually generating synth samples and if not break everything
                    _cnt += 1
                    if _cnt > 13 and already_generated == 0:  # (:
                        raise Exception("Breaking the generation loop!")

Invalid sampling with Heloc and Sick datasets

I'm trying to rerun the experiments of the paper on the Heloc and Sick datasets. I'm using distilgpt2, fine-tuning for 200 epochs and using a temperature T=0.7 over sampling. For both datasets I am unable to get valid samples, as all sampled datapoints have missing features. I realised that Heloc has an all-missing column which I removed from the input data and retrained the model. This still didn't work. Is there any further preprocessing or hyperparameter tuning required for these two datasets, beyond what's reported in the paper?

next GReaT version (fix for the datasets bug, "train it longer" message)

Hi @kathrinse,

I fixed the bug caused by the new datasets package version. By adding a new method to the GReaTDataset class, I tested and it worked.

be_great/be_great/great_dataset.py

Line 42 in 10a6a2a

def __getitems__(self, keys: tp.Union[int, slice, str, list]):

We should update the pip then, but maybe before we also should include a message on "train it longer", for cases there GReaT does not generate all the features.

Depending on your availability, please do it, I just created the issue so we won't forget :)

sampling error

When training is completed and sampling is performed, the following error occurs. How can I solve it?
please reply

be_great\great.py:162, in GReaT.sample(self, n_samples, start_col, start_col_dist, temperature, k, max_length, device)
160 # Convert tokens back to tabular data
161 text_data = _convert_tokens_to_text(tokens, self.tokenizer)
--> 162 df_gen = _convert_text_to_tabular_data(text_data, df_gen)
164 # Remove rows with flawed numerical values
165 for i_num_cols in self.num_cols:

be_great\great_utils.py:91, in _convert_text_to_tabular_data(text, df_gen)
89 values = f.strip().split(" is ")
90 if values[0] in columns and not td[values[0]]:
---> 91 td[values[0]] = [values[1]]
93 df_gen = pd.concat([df_gen, pd.DataFrame(td)], ignore_index=True, axis=0)
94 return df_gen

IndexError: list index out of range

Cannot load existing model

Hi,
I tried to load a trained model as follows:

model = GReaT(llm='distilgpt2', batch_size=32, epochs=1)
model.load_from_dir("trainer_great/checkpoint-76000")

I checked that I have the files there, but it gives me following errors:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[27], line 1
----> 1 model.load_from_dir("trainer_great/checkpoint-76000")

Cell In[2], line 254, in GReaT.load_from_dir(cls, path)
    251     attributes = json.load(f)
    253 # Create new be_great model instance
--> 254 great = cls(attributes["llm"])
    256 # Set all attributes
    257 for k, v in attributes.items():

KeyError: 'llm'

Thanks for your attention

Conditional sampling on multiple features

Hello,
How can the sampling operation be conditioned on more than one features?

Support for LoRA?

Hi:

Thank you very much for open-sourcing this project!
I found in your be_great/great.py, self.efficient_finetuning support lora. I've come across a few bugs that I may need help with.

[1] GReaT.load_from_dir() will lead to the state dict mismatch.

Missing key(s) in state_dict: "transformer.wte.weight" ...
Unexpected key(s) in state_dict: "base_model.model.transformer.wte.weight" ...

[2] net.sample(n_samples, k=50) returns

AttributeError: 'GPT2LMHeadModel' object has no attribute 'generation_config'

Thanks

No module named '_lzma' error on MacOS

`---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[3], line 1
----> 1 from be_great import GReaT
2 import pandas as pd
4 data = pd.read_csv('/Users/romajain/Development/inspect-ml/worker_compensation/stp_modelling/input/full_train_data_stage_2.csv')

File ~/.pyenv/versions/3.10.0/envs/stp_modelling_anomalies/lib/python3.10/site-packages/be_great/init.py:1
----> 1 from .great import GReaT

File ~/.pyenv/versions/3.10.0/envs/stp_modelling_anomalies/lib/python3.10/site-packages/be_great/great.py:15
12 import torch
13 from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
---> 15 from be_great.great_dataset import GReaTDataset, GReaTDataCollator
16 from be_great.great_start import (
17 GReaTStart,
18 CategoricalStart,
(...)
21 _pad_tokens,
22 )
23 from be_great.great_trainer import GReaTTrainer

File ~/.pyenv/versions/3.10.0/envs/stp_modelling_anomalies/lib/python3.10/site-packages/be_great/great_dataset.py:4
1 import random
2 import typing as tp
...
---> 27 from _lzma import *
28 from _lzma import _encode_filter_properties, _decode_filter_properties
29 import _compression

ModuleNotFoundError: No module named '_lzma'`

This is my error on Python 3.10 on MacOS

Breaking the generation loop

Is there any good way to encounter this situation besides increasing the epoch? My computer performance is limited, and increasing the epoch will greatly prolong the calculation time.

Error in Example with California Housing Dataset

Hi,

I have been trying to reproduce your results, and got an error with the example in your repo.
The code I try to run is as below:

from be_great import GReaT
from sklearn import datasets

data = datasets.fetch_california_housing(as_frame=True).frame
great = GReaT("distilgpt2",                         # Name of the large language model used (see HuggingFace for more options)
              epochs=1,                             # Number of epochs to train (only one epoch for demonstration)
              save_steps=2000,                      # Save model weights every x steps
              logging_steps=50,                     # Log the loss and learning rate every x steps
              experiment_dir="trainer_california",  # Name of the directory where all intermediate steps are saved
              #lr_scheduler_type="constant",        # Specify the learning rate scheduler
              #learning_rate=5e-5                   # Set the inital learning rate
             )
trainer = great.fit(data)

And I get the error

File "/Users/saydore/Documents/projects/SDG_2022/SD_sergul_coding/main.py", line 15, in <module>
    trainer = great.fit(data)
  File "/usr/local/lib/python3.9/site-packages/be_great/great.py", line 114, in fit
    great_trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/usr/local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch
    data = self.dataset.__getitems__(possibly_batched_index)
  File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2662, in __getitems__
    batch = self.__getitem__(keys)
  File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2658, in __getitem__
    return self._getitem(key)
  File "/usr/local/lib/python3.9/site-packages/be_great/great_dataset.py", line 31, in _getitem
    row = self._data.fast_slice(key, 1)
  File "/usr/local/lib/python3.9/site-packages/datasets/table.py", line 135, in fast_slice
    if offset < 0:
TypeError: '<' not supported between instances of 'list' and 'int'
  0%|          | 0/2580 [00:00<?, ?it/s]

Could you please help me resolve this issue?

np.float is deprecated

First and foremost, thanks for making this code available!

However, when trying to sample from a trained GReaT model I get an an error (included at the end of this issue).
I don't include any code because it is clear from the error message what went wrong.

I know I can get around it by installing a different numpy version but still it would be valuable to address this issue I think.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fischers/miniconda2/envs/inference/lib/python3.10/site-packages/be_great/great.py", line 168, in sample
    df_gen[self.num_cols] = df_gen[self.num_cols].astype(np.float)
  File "/home/fischers/miniconda2/envs/inference/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. Did you mean: 'cfloat'?
    ```

Not able to generate synthetic data after model fitting

I have a tabular data with shape of 11 rows and 25 columns. I have trained two models based on the following command
model = GReaT(llm='distilgpt2', batch_size=32, epochs=25)

and tried to generate synthetic data for this table after fitting based on these model but it fails with the below error:

An error has occurred: Breaking the generation loop!
To address this issue, consider fine-tuning the GReaT model for an longer period. This can be achieved by increasing the number of epochs.
model = GReaT(llm='distilgpt2', batch_size=25, epochs=100) ( Tried with this model as well but same error)
Alternatively, you might consider increasing the max_length parameter within the sample function. For example: model.sample(n_samples=10, max_length=2000)

Please let me know if there is a way the command has to be given for successful generation.

How many samples are used for the classification and regression?

I am writing to seek further clarification related to a matter previously discussed in the GitHub issue "#46" regarding your manuscript.

In your manuscript, Table 6 is described as providing "A run time comparison of all generative models of our study. Selected models were trained/fine-tuned for 100 epochs and 1000 samples were generated." However, I seek clarification regarding the number of samples used for the model presented in Table 1.

In Section C, Reproducibility details, it is noted that "The GReaT baseline is fine-tuned for 110, 310, 400, 255, 150, 85, epochs for California Housing, Adult Income, Travel, Home Equity Line of Credit (HELOC), Sick (Dua & Graff, 2017), and Diabetes data sets, respectively."
Given the difference in the number of epochs, which suggests different experimental conditions from those described in Table 6, I am prompted to inquire about the number of samples generated for classification and regression performances.

Did you consistently use 1000 samples across all experiments?

Thank you for your clarification.

failing to use BART models - Breaking the generation loop!

Hi,

I'm trying to use f.eks, 'sshleifer/distilbart-cnn-6-6' and failing. Following message:

An error has occurred: Breaking the generation loop! To address this issue, consider fine-tuning the GReaT model for an longer period. This can be achieved by increasing the number of epochs. Alternatively, you might consider increasing the max_length parameter within the sample function. For example: model.sample(n_samples=10, max_length=2000) If the problem persists despite these adjustments, feel free to raise an issue on our GitHub page at: https://github.com/kathrinse/be_great/issues

Aleksandar

Adding Native Distributed Data Parallels Support

Hi, I was wondering if there were any efforts on great.py natively supporting Distributed Data Parallels? Currently I am doing a workaround by editing my own trainer file and saving it via torch save.

Below is how I invoke it.

torchrun --nproc_per_node=8 ddptest.py

import os
import pandas as pd
from be_great import GReaT
import torch.distributed as dist
import torch
from collections import OrderedDict

def main():
    # Set CUDA devices for each process
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    dataFile = "/edit/for/your/own/repo.csv"
    data = pd.read_csv(dataFile)

    great = GReaT("gpt2-xl",         
                      batch_size=8,
                      epochs=50,                           
                      fp16=True
                     )

   # Move the model to the appropriate GPU
    great.model.to(local_rank)  

    # Wrap the model for distributed training
    great.model = torch.nn.parallel.DistributedDataParallel(
        great.model, device_ids=[local_rank], output_device=local_rank
    )

    trainer = great.fit(data, data.columns.to_list())

    
        # Save the model only from rank 0 process
    if dist.get_rank() == 0:
        # Create a new state dict with corrected key names
        state_dict = great.model.state_dict()
        new_state_dict = OrderedDict()
        for k, v in state_dict.items():
            name = k[7:]  # remove `module.`
            new_state_dict[name] = v

        # Save the model with the modified state dictl
        torch.save(new_state_dict, "/edit/for/your/own/model.pt")


if __name__ == "__main__":
    # Initialize the distributed process group
    dist.init_process_group(backend="nccl") 
    main()

Again thank you so much for this awesome framework.

Requirements don't allow numpy<1.24

I would love to include this method in our library synthcity.

However, currently we require numpy<1.24 (due to another third party library). Is there a reason that your requirements are for numpy>=1.24.2? I have a few examples working with numpy==1.23.5.

Would you consider allowing older versions of numpy, so that I can more easily add your synthetic data generation method to our library as a new plugin?

Many thanks!

np.float is deprecated

np.float you are using in impute function is deprecated (line 404 in great.py file). You have to change to float or to np.float64:
Current:
df_curr[self.num_cols] = df_curr[self.num_cols].astype(np.float)
Change:
df_curr[self.num_cols] = df_curr[self.num_cols].astype(float)
or
df_curr[self.num_cols] = df_curr[self.num_cols].astype(np.float64)

Improving generation speed

Dear authors,

Let me start by thanking you for the open-source release of GReaT. I found an implementation detail about the generation of samples, especially on larger datasets.

Problem Description

Looking at the GPU utilization I found that the CPU workload (everything outside of sampling the model) takes increasingly longer. (using nvtop, GPU utilization becomes worse with more/higher sampling iterations).

Proposed Solution

Digging in the code I found that the accumulator (df_gen) and generated (pd.DataFrame(td)) data frames are concatenated in each iteration.

https://github.com/kathrinse/be_great/blob/c568617763ba954fb39fc6b6e222e3abaef0886a/be_great/great.py#LL147C21-L147C21

be_great/be_great/great_utils.py

Line 97 in c568617

df_gen = pd.concat([df_gen, pd.DataFrame(td)], ignore_index=True, axis=0)

This incurs O(N^2) overhead (each time memory is allocated for a new DataFrame that can contain all rows). This can be resolved by creating a list of data frames and concatenating them at the end of the generation process. For example:

for GReaT.sample this would require a minor change, similar to the following:

# Create an accumulation list for generated data
dfs = [] 
...
while n > already_generated:
    ...
    df_gen = _convert_text_to_tabular_data(text_data, df_gen)
    ...
    dfs.append(df_gen)
    already_generated += len(dfs[-1])
    pbar.update(len(dfs[-1]))
    
df_gen = pd.concat(dfs)
df_gen = df_gen.reset_index(drop=True)    
...

The _convert_text_to_tabular_data can be improved similarly by making it return a DataFrame that is constructed from a list of dictionaries.

def _convert_text_to_tabular_data(text: tp.List[str], df_gen: pd.DataFrame) -> pd.DataFrame:
    ...
    generated = []
    ...
    for t in text:
        ...
        generated.append(td)
    gen_df = pd.DataFrame(generated)

This way for a dataset containing 20K+ samples, generation time went from 40+ minutes to about 3 minutes. Smaller datasets also seem to benefit, but this is less pronounced as the overhead grows linearly with the sampling iteration.

Example implementation

Looking at related work, it seems like the RealTabFormers implementation provides an example of this.

https://github.com/worldbank/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/rtf_sampler.py#L610-L674

Side note

Likely this could also (slightly) improve GReaT's performance in Appendix B.5 of your paper for inference/generation.

Importance of column semantics

Hello! Is the method expected to work if if there is no semantic information on the columns of the tabular datasets?

Few questions about the paper

(Since there is no discussion page in this repo I'd leave it here on issue page)

Currently only one conditional_col is allowed during fit/sample is it theoretically possible to have multiple columns for it?
It's a bit slow to train the data is it possible to rewrite in efficient transformer in future work>
In great_sample only single sample of data is returned for each prompt is it able to return multiple at a same time?

Thanks for the work!

Suggestion: Improve sampling speed

First of all, thanks for making this library available.
When trying to sample large number of samples, the sample() code becomes slower.
I believe this is because of this line:

be_great/be_great/great_utils.py

Line 109 in 251eb17

df_gen = pd.concat([df_gen, pd.DataFrame(td)], ignore_index=True, axis=0)

I think a speed improvement could be achieved by storing all the dataframes in a list and then concatenating the list of dataframes at the end.

Reg. Fine tuning for Time Series data generation

With time series data, the challenges I found model face is to understand that change in label (binary) becomes important point. For healthcare, use-case such as diagnosis of disease and data with timeline, the detection / label change from 0->1 is not irreversible (typically no records of vitals of patient after a patient is tested positive.)

One question I have is, is there a way to make a LLM understand time series / collection of records and then able to sample a time series collection of records ? I have tried to condition it with some fixed demographic values such as an identifier value, age, multiple Timestamps, however I am not convinded that I am getting a synthetic collection for those given fixed variables at different timestamps (sampling via great_sample 's starting_prompt )

Any ideas?

Changing default behaviour of weight saving?

Currently the weight is saved every 500 steps which quickly fill up hard disk space. Is it possible to change that so it updates the weight during training?

Token indices sequence length issue.

Hi, I am having the following problem when I am using my library on my dataset. The dataset has only 209 samples, but 107 features. The values in the set are floats and ints.

This is the call I am making:
model = GReaT(llm='gpt2', epochs=50, batch_size=32)
model.fit(df)

This is what I assume is the reason behind the error:
Token indices sequence length is longer than the specified maximum sequence length for this model (1614 > 1024). Running this sequence through the model will result in indexing errors

From what I can gather it seems a Huggingface issue.
Is there a way to pass the seq = seq[:512] parameter?

Do you know a solution to this problem?

Any help would be much appreciated.

Here is the full trace:

/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 209
  Num Epochs = 50
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 350
  Number of trainable parameters = 124439808
Token indices sequence length is longer than the specified maximum sequence length for this model (1614 > 1024). Running this sequence through the model will result in indexing errors
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [42,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [2,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [11,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In [7], line 8
      5 #data
      6 #coul_sum_0_0v_sc
      7 model = GReaT(llm='gpt2', epochs=50, batch_size=32)
----> 8 model.fit(coul_sum_0_0v_sc)
      9 #synthetic_data = model.sample(n_samples=100)

File /usr/local/lib/python3.9/dist-packages/be_great/great.py:114, in GReaT.fit(self, data, column_names, conditional_col, resume_from_checkpoint)
    112 # Start training
    113 logging.info("Start training...")
--> 114 great_trainer.train(resume_from_checkpoint=resume_from_checkpoint)
    115 return great_trainer

File /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1501, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1496     self.model_wrapped = self.model
   1498 inner_training_loop = find_executable_batch_size(
   1499     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1500 )
-> 1501 return inner_training_loop(
   1502     args=args,
   1503     resume_from_checkpoint=resume_from_checkpoint,
   1504     trial=trial,
   1505     ignore_keys_for_eval=ignore_keys_for_eval,
   1506 )

File /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1749, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1747         tr_loss_step = self.training_step(model, inputs)
   1748 else:
-> 1749     tr_loss_step = self.training_step(model, inputs)
   1751 if (
   1752     args.logging_nan_inf_filter
   1753     and not is_torch_tpu_available()
   1754     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1755 ):
   1756     # if loss is nan or inf simply add the average of previous logged losses
   1757     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:2508, in Trainer.training_step(self, model, inputs)
   2505     return loss_mb.reduce_mean().detach().to(self.args.device)
   2507 with self.compute_loss_context_manager():
-> 2508     loss = self.compute_loss(model, inputs)
   2510 if self.args.n_gpu > 1:
   2511     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:2540, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2538 else:
   2539     labels = None
-> 2540 outputs = model(**inputs)
   2541 # Save past state if it exists
   2542 # TODO: this needs to be fixed and made cleaner later.
   2543 if self.args.past_index >= 0:

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1190, in Module._call_impl(self, *input, **kwargs)
   1186 # If we don't have any hooks, we want to skip the rest of the logic in
   1187 # this function, and just call forward.
   1188 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1189         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1190     return forward_call(*input, **kwargs)
   1191 # Do not call functions when jit is used
   1192 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.9/dist-packages/transformers/models/gpt2/modeling_gpt2.py:1046, in GPT2LMHeadModel.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1038 r"""
   1039 labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
   1040     Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
   1041     `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
   1042     are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
   1043 """
   1044 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-> 1046 transformer_outputs = self.transformer(
   1047     input_ids,
   1048     past_key_values=past_key_values,
   1049     attention_mask=attention_mask,
   1050     token_type_ids=token_type_ids,
   1051     position_ids=position_ids,
   1052     head_mask=head_mask,
   1053     inputs_embeds=inputs_embeds,
   1054     encoder_hidden_states=encoder_hidden_states,
   1055     encoder_attention_mask=encoder_attention_mask,
   1056     use_cache=use_cache,
   1057     output_attentions=output_attentions,
   1058     output_hidden_states=output_hidden_states,
   1059     return_dict=return_dict,
   1060 )
   1061 hidden_states = transformer_outputs[0]
   1063 # Set device for model parallelism

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1190, in Module._call_impl(self, *input, **kwargs)
   1186 # If we don't have any hooks, we want to skip the rest of the logic in
   1187 # this function, and just call forward.
   1188 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1189         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1190     return forward_call(*input, **kwargs)
   1191 # Do not call functions when jit is used
   1192 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.9/dist-packages/transformers/models/gpt2/modeling_gpt2.py:889, in GPT2Model.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions, output_hidden_states, return_dict)
    879     outputs = torch.utils.checkpoint.checkpoint(
    880         create_custom_forward(block),
    881         hidden_states,
   (...)
    886         encoder_attention_mask,
    887     )
    888 else:
--> 889     outputs = block(
    890         hidden_states,
    891         layer_past=layer_past,
    892         attention_mask=attention_mask,
    893         head_mask=head_mask[i],
    894         encoder_hidden_states=encoder_hidden_states,
    895         encoder_attention_mask=encoder_attention_mask,
    896         use_cache=use_cache,
    897         output_attentions=output_attentions,
    898     )
    900 hidden_states = outputs[0]
    901 if use_cache is True:

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1190, in Module._call_impl(self, *input, **kwargs)
   1186 # If we don't have any hooks, we want to skip the rest of the logic in
   1187 # this function, and just call forward.
   1188 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1189         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1190     return forward_call(*input, **kwargs)
   1191 # Do not call functions when jit is used
   1192 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.9/dist-packages/transformers/models/gpt2/modeling_gpt2.py:389, in GPT2Block.forward(self, hidden_states, layer_past, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions)
    387 residual = hidden_states
    388 hidden_states = self.ln_1(hidden_states)
--> 389 attn_outputs = self.attn(
    390     hidden_states,
    391     layer_past=layer_past,
    392     attention_mask=attention_mask,
    393     head_mask=head_mask,
    394     use_cache=use_cache,
    395     output_attentions=output_attentions,
    396 )
    397 attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)
    398 outputs = attn_outputs[1:]

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1190, in Module._call_impl(self, *input, **kwargs)
   1186 # If we don't have any hooks, we want to skip the rest of the logic in
   1187 # this function, and just call forward.
   1188 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1189         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1190     return forward_call(*input, **kwargs)
   1191 # Do not call functions when jit is used
   1192 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.9/dist-packages/transformers/models/gpt2/modeling_gpt2.py:311, in GPT2Attention.forward(self, hidden_states, layer_past, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions)
    309     attention_mask = encoder_attention_mask
    310 else:
--> 311     query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
    313 query = self._split_heads(query, self.num_heads, self.head_dim)
    314 key = self._split_heads(key, self.num_heads, self.head_dim)

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1190, in Module._call_impl(self, *input, **kwargs)
   1186 # If we don't have any hooks, we want to skip the rest of the logic in
   1187 # this function, and just call forward.
   1188 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1189         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1190     return forward_call(*input, **kwargs)
   1191 # Do not call functions when jit is used
   1192 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.9/dist-packages/transformers/pytorch_utils.py:112, in Conv1D.forward(self, x)
    110 def forward(self, x):
    111     size_out = x.size()[:-1] + (self.nf,)
--> 112     x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
    113     x = x.view(size_out)
    114     return x

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Include LoRA to the GReaT framework for efficient and fast fine-tuning

The incorporation of the LoRA technique will enhance the speed of fine-tuning and decrease computational demands. Contributions are welcome.

About the LoRA: https://www.philschmid.de/fine-tune-flan-t5-peft

kathrinse / be_great Goto Github PK

be_great's People

Contributors

Stargazers

Watchers

Forkers

be_great's Issues

Hi! When I ran the notebook given in the readme file, there is an error reported on the line "model.fit(data)" that says:

To view examples of installing some common dependencies, click the "Open Examples" button below.

Problem Description

Proposed Solution

Example implementation

Side note

Recommend Projects

Recommend Topics

Recommend Org

To view examples of installing some common dependencies, click the
"Open Examples" button below.