A simple approach to use GPT2-medium (345M) for generating high quality text summaries with minimal training.

Python 19.53% Jupyter Notebook 80.47%

generating_text_summary_with_gpt2's People

Contributors

Stargazers

Watchers

generating_text_summary_with_gpt2's Issues

The `generate_sample` miss `model` argument ?

It seems that the second generate_sample miss model argument in train_gpt_2_summarizer notebook and python file.

max_articles_size : expected 2 got 1

Hey,
I tested your script to prepare the dataset in order to make my abstract summarizer. I would like to point out that a bug may appear in max_articles_size. Indeed, when you want to list the files in the cnn dataset, zip(os.listdir(file_name)) doesn't work. So we have to use enumerate(os.listdir(file_name)), which allows us to have the i + the name of the file !

TypeError: generate_sample() missing 1 required positional argument: 'model'

got this error while executing
#training the model

start = time.time()
train(args, model, tokenizer, train_data, valid_data, ignore_idx)
print('total time: ', (time.time()-start)/60, " minutes", end='\n\n')
print('Saving trained model...')
model_file = os.path.join(args.model_dir, 'model_data{}trained_after{}_epochs_only_sum_loss_ignr_pad.bin'.format(len(train_data),args.num_train_epochs))
config_file = os.path.join(args.model_dir, 'config_data{}trained_after{}_epochs_only_sum_loss_ignr_pad.json'.format(len(train_data),args.num_train_epochs))
torch.save(model.state_dict(), model_file)
model.config.to_json_file(config_file)

TypeError Traceback (most recent call last)
in
2
3 start = time.time()
----> 4 train(args, model, tokenizer, train_data, valid_data, ignore_idx)
5 print('total time: ', (time.time()-start)/60, " minutes", end='\n\n')
6

in train(args, model, tokenizer, train_dataset, valid_dataset, ignore_index)
53 writer.add_scalar('eval_{}'.format(key), value, global_step)
54 print('After', global_step+1,'updates: ', end='\n\n')
---> 55 generate_sample(valid_dataset, tokenizer, num=2, eval_step=True,device=args.device)

TypeError: generate_sample() missing 1 required positional argument: 'model'

Where is model_file

Will we get the model file after training, its taking very long to train on my normal laptop. Can you provide these files so we can skip this part. or suggest any free platforms,
I am trying to do this on google collab but its taking too much time to clone the repository because of the dataset i think, do you have a way do this?

Arguments required for running train_gpt2_summarizer.py

I am trying to recreate what you have done and getting this error.
Can you provide a optimal set of parameters

(tensorflow) C:\Users\XXXX\Desktop\Generating_Text_Summary_With_GPT2-master>python train_gpt2_summarizer.py --batch_size 1 --root_dir CNN\gpt2_1024_data
usage: train_gpt2_summarizer.py [-h] --lr LR [--seed SEED] [--n_gpu N_GPU] --gradient_accumulation_steps
GRADIENT_ACCUMULATION_STEPS --batch_size BATCH_SIZE [--num_workers NUM_WORKERS]
[--device DEVICE] --num_train_epochs NUM_TRAIN_EPOCHS --output_dir OUTPUT_DIR
--model_dir MODEL_DIR [--fp16 FP16] [--fp16_opt_level FP16_OPT_LEVEL]
[--max_grad_norm MAX_GRAD_NORM] [--root_dir ROOT_DIR] [--ids_file IDS_FILE]
train_gpt2_summarizer.py: error: the following arguments are required: --lr, --gradient_accumulation_steps, --num_train_epochs, --output_dir, --model_dir

About batch_size

Hi, your notebook is an impressive tutorial about GPT-2 for seq2seq model.
But it can not run with batch_size greater than 1. I think it is due to you added sum_idx in training dataset. In more details, you wrote shift_labels = labels[..., batch['sum_idx']+1:].contiguous() but it can not be applied for a batch since PyTorch does not enable tensor slicing by another tensor like this.

I dont know why you chose batch_size=1, but I can run greater batch_size in my NVIDIA 12GB GPU. To do that, I think you should define a new feature for your dataset, say labels and let some values of labels equals to -100. And you should remove sum_idx feature since it is really hard to slice a tensor. By assigning some values as -100, PyTorch's cross entroy loss will ignore those indices. Check out ignore_index for more details

Token indices sequence length is longer than the specified maximum sequence length for this model (4681 > 1024). Running this sequence through the model will result in indexing errors

In the blog you said you only chose those files which had a maximum 512 and 1024 tokens after tokenizing, can you tell what part of code did this. because after executing this
[$ python max_article_sizes.py path/to/cnn_or_dailymail/tokenized/articles]
A pickle file is created in CNN or DM folder going through all the files
after executing this
$ python prepare_data.py [path/to/pickle_file/of/articles/sizes/created/using/above/command]
in the CNN folder a folder names gpt2_1024 data which has around 59,2888 json files.
This doubt came to me because in the blog you said you chose around 1500 items but the json files created here are a lot larger than that. and during execution also after some files are written it said this
[Token indices sequence length is longer than the specified maximum sequence length for this model (4681 > 1024). Running this sequence through the model will result in indexing errors]
Could you please help me with this and tell if i have made any mistake.
Also i am getting some file not found errors while executing the second command (prepare_data.py) and if i copied the cnn_stories_tokenized in to the CNN folder its resolved

IndexError: Caught IndexError in DataLoader worker process 0

ml-lab@mllab-OptiPlex-7010:~/Downloads/Temp/Generating_Text_Summary_With_GPT2-master$ python3 train_gpt2_summarizer.py --batch_size 1 --root_dir ./CNN --lr 5e-5 --gradient_accumulation_steps 32 --num_train_epochs 1 --output_dir ./output --model_dir ./weights
Epoch: 0%| | 0/1 [00:00<?, ?it/s]
Training: 0%| | 0/3000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train_gpt2_summarizer.py", line 157, in
main()
File "train_gpt2_summarizer.py", line 146, in main
train(args, model, tokenizer, train_data, valid_data, ignore_idx)
File "train_gpt2_summarizer.py", line 38, in train
for step, batch in enumerate(epoch_iterator):
File "/home/ml-lab/.local/lib/python3.8/site-packages/tqdm/notebook.py", line 248, in iter
for obj in super(tqdm_notebook, self).iter():
File "/home/ml-lab/.local/lib/python3.8/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.

Original Traceback (most recent call last):
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ml-lab/Downloads/Temp/Generating_Text_Summary_With_GPT2-master/dataset.py", line 44, in getitem
idx = self.idxs[-idx]
IndexError: list index out of range

Line 44 in f1da9d3

inputs, labels = torch.tensor(batch['article']), torch.tensor(batch['article'])

Generating_Text_Summary_With_GPT2/train_gpt2_summarizer.py

Line 104 in f1da9d3

 inputs, labels = torch.tensor(batch['article']).to(args.device), torch.tensor(batch['article']).to(args.device) 

Preprocessing

How do I preprocess my dataset and convert it to the model needed format?

train_gpt2_summarizer.py does not work

File "train_gpt2_summarizer.py", line 32
writer = SummaryWriter('./logs')
^
IndentationError: unindent does not match any outer indentation level

running on google colab

skrohit / generating_text_summary_with_gpt2 Goto Github PK

generating_text_summary_with_gpt2's People

Contributors

Stargazers

Watchers

Forkers

generating_text_summary_with_gpt2's Issues

Recommend Projects

Recommend Topics

Recommend Org