Giter Club home page Giter Club logo

generating_text_summary_with_gpt2's People

Contributors

skrohit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

generating_text_summary_with_gpt2's Issues

max_articles_size : expected 2 got 1

Hey,
I tested your script to prepare the dataset in order to make my abstract summarizer. I would like to point out that a bug may appear in max_articles_size. Indeed, when you want to list the files in the cnn dataset, zip(os.listdir(file_name)) doesn't work. So we have to use enumerate(os.listdir(file_name)), which allows us to have the i + the name of the file !

TypeError: generate_sample() missing 1 required positional argument: 'model'

got this error while executing
#training the model

start = time.time()
train(args, model, tokenizer, train_data, valid_data, ignore_idx)
print('total time: ', (time.time()-start)/60, " minutes", end='\n\n')
print('Saving trained model...')
model_file = os.path.join(args.model_dir, 'model_data{}trained_after{}_epochs_only_sum_loss_ignr_pad.bin'.format(len(train_data),args.num_train_epochs))
config_file = os.path.join(args.model_dir, 'config_data{}trained_after{}_epochs_only_sum_loss_ignr_pad.json'.format(len(train_data),args.num_train_epochs))
torch.save(model.state_dict(), model_file)
model.config.to_json_file(config_file)

TypeError Traceback (most recent call last)
in
2
3 start = time.time()
----> 4 train(args, model, tokenizer, train_data, valid_data, ignore_idx)
5 print('total time: ', (time.time()-start)/60, " minutes", end='\n\n')
6

in train(args, model, tokenizer, train_dataset, valid_dataset, ignore_index)
53 writer.add_scalar('eval_{}'.format(key), value, global_step)
54 print('After', global_step+1,'updates: ', end='\n\n')
---> 55 generate_sample(valid_dataset, tokenizer, num=2, eval_step=True,device=args.device)

TypeError: generate_sample() missing 1 required positional argument: 'model'

Where is model_file

Will we get the model file after training, its taking very long to train on my normal laptop. Can you provide these files so we can skip this part. or suggest any free platforms,
I am trying to do this on google collab but its taking too much time to clone the repository because of the dataset i think, do you have a way do this?

Arguments required for running train_gpt2_summarizer.py

I am trying to recreate what you have done and getting this error.
Can you provide a optimal set of parameters

(tensorflow) C:\Users\XXXX\Desktop\Generating_Text_Summary_With_GPT2-master>python train_gpt2_summarizer.py --batch_size 1 --root_dir CNN\gpt2_1024_data
usage: train_gpt2_summarizer.py [-h] --lr LR [--seed SEED] [--n_gpu N_GPU] --gradient_accumulation_steps
GRADIENT_ACCUMULATION_STEPS --batch_size BATCH_SIZE [--num_workers NUM_WORKERS]
[--device DEVICE] --num_train_epochs NUM_TRAIN_EPOCHS --output_dir OUTPUT_DIR
--model_dir MODEL_DIR [--fp16 FP16] [--fp16_opt_level FP16_OPT_LEVEL]
[--max_grad_norm MAX_GRAD_NORM] [--root_dir ROOT_DIR] [--ids_file IDS_FILE]
train_gpt2_summarizer.py: error: the following arguments are required: --lr, --gradient_accumulation_steps, --num_train_epochs, --output_dir, --model_dir

About batch_size

Hi, your notebook is an impressive tutorial about GPT-2 for seq2seq model.
But it can not run with batch_size greater than 1. I think it is due to you added sum_idx in training dataset. In more details, you wrote shift_labels = labels[..., batch['sum_idx']+1:].contiguous() but it can not be applied for a batch since PyTorch does not enable tensor slicing by another tensor like this.

I dont know why you chose batch_size=1, but I can run greater batch_size in my NVIDIA 12GB GPU. To do that, I think you should define a new feature for your dataset, say labels and let some values of labels equals to -100. And you should remove sum_idx feature since it is really hard to slice a tensor. By assigning some values as -100, PyTorch's cross entroy loss will ignore those indices. Check out ignore_index for more details

Token indices sequence length is longer than the specified maximum sequence length for this model (4681 > 1024). Running this sequence through the model will result in indexing errors

In the blog you said you only chose those files which had a maximum 512 and 1024 tokens after tokenizing, can you tell what part of code did this. because after executing this
[$ python max_article_sizes.py path/to/cnn_or_dailymail/tokenized/articles]
A pickle file is created in CNN or DM folder going through all the files
after executing this
$ python prepare_data.py [path/to/pickle_file/of/articles/sizes/created/using/above/command]
in the CNN folder a folder names gpt2_1024 data which has around 59,2888 json files.
This doubt came to me because in the blog you said you chose around 1500 items but the json files created here are a lot larger than that. and during execution also after some files are written it said this
[Token indices sequence length is longer than the specified maximum sequence length for this model (4681 > 1024). Running this sequence through the model will result in indexing errors]
Could you please help me with this and tell if i have made any mistake.
Also i am getting some file not found errors while executing the second command (prepare_data.py) and if i copied the cnn_stories_tokenized in to the CNN folder its resolved

IndexError: Caught IndexError in DataLoader worker process 0

ml-lab@mllab-OptiPlex-7010:~/Downloads/Temp/Generating_Text_Summary_With_GPT2-master$ python3 train_gpt2_summarizer.py --batch_size 1 --root_dir ./CNN --lr 5e-5 --gradient_accumulation_steps 32 --num_train_epochs 1 --output_dir ./output --model_dir ./weights
Epoch: 0%| | 0/1 [00:00<?, ?it/s]
Training: 0%| | 0/3000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train_gpt2_summarizer.py", line 157, in
main()
File "train_gpt2_summarizer.py", line 146, in main
train(args, model, tokenizer, train_data, valid_data, ignore_idx)
File "train_gpt2_summarizer.py", line 38, in train
for step, batch in enumerate(epoch_iterator):
File "/home/ml-lab/.local/lib/python3.8/site-packages/tqdm/notebook.py", line 248, in iter
for obj in super(tqdm_notebook, self).iter():
File "/home/ml-lab/.local/lib/python3.8/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.

Original Traceback (most recent call last):
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ml-lab/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ml-lab/Downloads/Temp/Generating_Text_Summary_With_GPT2-master/dataset.py", line 44, in getitem
idx = self.idxs[-idx]
IndexError: list index out of range

About batch_size

Hi, your notebook is an impressive tutorial about GPT-2 for seq2seq model.
But it can not run with batch_size greater than 1. I think it is due to you added sum_idx in training dataset. In more details, you wrote shift_labels = labels[..., batch['sum_idx']+1:].contiguous() but it can not be applied for a batch since PyTorch does not enable tensor slicing by another tensor like this.

I dont know why you chose batch_size=1, but I can run greater batch_size in my NVIDIA 12GB GPU. To do that, I think you should define a new feature for your dataset, say labels and let some values of labels equals to -100. And you should remove sum_idx feature since it is really hard to slice a tensor. By assigning some values as -100, PyTorch's cross entroy loss will ignore those indices. Check out ignore_index for more details

name 'ignore_index' is not defined

I am getting this error. I tried running it on google colab?
Where are you running your code. Please guide and help me resolve this error..

Preprocessing

How do I preprocess my dataset and convert it to the model needed format?

train_gpt2_summarizer.py does not work

File "train_gpt2_summarizer.py", line 32
writer = SummaryWriter('./logs')
^
IndentationError: unindent does not match any outer indentation level

running on google colab

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.