Giter Club home page Giter Club logo

Comments (18)

JingqingZ avatar JingqingZ commented on July 18, 2024

Hi you may modify the input_pattern similar to

big_patent/all-train-shard_100-take_200
in the corresponding dataset in https://github.com/google-research/pegasus/blob/master/pegasus/params/public_params.py

from pegasus.

rohithsiddhartha avatar rohithsiddhartha commented on July 18, 2024

Hey I have a doubt over here,
I have following the code for text summarization for wikihow dataset you mentioned wikihow/all-train-take_1000 for big_patent/all-train-shard_100-take_200.

I would suggest you to add this part in readme itself and the pattern to take partial dataset too (if it's different for different datasets then please mention it as it varies based on the dataset)

from pegasus.

rohithsiddhartha avatar rohithsiddhartha commented on July 18, 2024

Can I know specs of GPU/CPU or setup which you used for extraction of tf records for big patent dataset. I tried it out in AI-platform and I was able to create tf-records for sub categories of d,e,f and rest I couldn't I ran out of master memory in AI-platform (not RAM).
Tried the same in VM instance but it didn't process even after 1 hr. If you could mention with what configuration did you extract tf records for big patent dataset, it would be helpful for us. or Even a much better solution would be to provide google cloud storage link for tf records which were created during your testing phase, if you provide tf records itself then we would just run the model by providing path to tf records itself and it would save a lot of computation effort in creating tf-records

from pegasus.

JingqingZ avatar JingqingZ commented on July 18, 2024

Hi, I am sorry I am not sure what you are asking for. The (sentence) extraction only happens in the pre-training stage and there is no extraction in the fine-tuning stage. The input-target pairs are already provided in tfds (of each downstream dataset), which can be simply used for supervised learning to fine-tune. The big patent dataset is very large so please be patient when you download it from tfds for the first time. By default, we use big_patent(all). As far as I remember, we used < 32 CPUs to pre-fetch data in the fine-tuning stage.

from pegasus.

rohithsiddhartha avatar rohithsiddhartha commented on July 18, 2024

Hey I wasn't specific in the earlier comment. If I change data_dir in datasets.py file in data folder. I'm running these jobs in AI platform where I use direct runner pipeline to download dataset and later after execution tf records (input-target) pairs. That download part isn't happening for the first time itself due to low computation power. I'm asking for that dataset which is downloaded from tfds for the first. What I will do is I'll download the dataset (one which got downloaded for the first time while training) and place it in bucket and then let the model access dataset from bucket.

Advantage in this case is I will skip the download of dataset for the first time while execution is happening and I'll change the directory path instead of looking for dataset in tfds folder. I'll customize it to search for dataset in gcs bucket (provide the path for input target pairs) and then model will execute itself skipping the download part even for the first time.

All I'm asking is for the data which is downloaded from tfds for first time (The big patent dataset is very large so please be patient when you download it from tfds for the first time). I hope you have them stored on your local disk.I request you to push the dataset to bucket and then share the path with users, just like you did with checkpoints.
Please revert back if you need further clarification.

from pegasus.

JingqingZ avatar JingqingZ commented on July 18, 2024

Hi, I am afraid we're not able to provide alternatives other than the default download by tfds. Sorry about this. If you would like to download it manually and upload to your cloud, please refer to big_patent website https://evasharma.github.io/bigpatent/.

from pegasus.

JingqingZ avatar JingqingZ commented on July 18, 2024

Feature: inputs (data type: string) is required but could not be found.

It seems the feature ("inputs") is missing.

from pegasus.

rohithsiddhartha avatar rohithsiddhartha commented on July 18, 2024

Hey sorry for deleting I thought you didn't go through as I found the fix I just deleted it as you'll have other things to take care of, yeah it doesn't have input and targets instead it has abstract and descriptions. I thought the tf records which were created by default when we pass tfds:big_patent or any other tfds dataset have input and target pairs instead of abstract and description

from pegasus.

CBvanYperen avatar CBvanYperen commented on July 18, 2024

Hi,

I am also interested in the low-resource results that you showed in your paper. Before going on to using my own dataset I'd like to replicate the results of the paper. For example, I'd like to replicate the values that you got for the CNN/DailyMail dataset with 10 examples. That is, the values that are highlighted in yellow in the screenshot below.
image

It would be great if you could show an example in the same way that you put in the README, thus like this:

python3 pegasus/bin/train.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc

and

python3 pegasus/bin/evaluate.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \
--model_dir=ckpt/pegasus_ckpt/aeslc

Your help would be much appreciated!

from pegasus.

JingqingZ avatar JingqingZ commented on July 18, 2024

Hi, you may change aeslc to cnn_dailymail in the command and update the train_pattern defined here to tfds:cnn_dailymail/plain_text-train-take_10

from pegasus.

CBvanYperen avatar CBvanYperen commented on July 18, 2024

@JingqingZ Great, thanks! Another question, in your paper you write

We fine-tuned the models up to 2000 steps with batch size 256, learning rate 0.0005, and picked the checkpoint with best validation performance.

On my machine I cannot use a batch_size of 256, but I assume that any batch size should result in similar performance. The learning rate hyperparameter can be set in the params_override FLAGS if I am not mistaken. However I am not clear on the best checkpoint here as it clearly depends on how often you save a checkpoint. In the code in this folder the checkpoints are saved every 1000 steps. Is that also the case for these low-resource results? So do you choose the best checkpoint from point 0,1000 and 2000?

from pegasus.

JingqingZ avatar JingqingZ commented on July 18, 2024

The batch size may affect performance slightly even other settings are the same. Yes, the param_override flag is provided to override some hyperparameters like learning-rate in your command.

The best checkpoint refers to the checkpoint that has the best validation loss. The checkpoints are actually saved with relatively small intervals (in the low-resource setting) so the best checkpoint can be 100, 200, 1000, 1500, 1600 or so.

from pegasus.

CBvanYperen avatar CBvanYperen commented on July 18, 2024

Hi, I'm still not really able to replicate the low-resource results.
Since the CNN/DailyMail dataset kept giving me OOM erros (something I'll look into later) I wanted to replicate the AESLC Dataset results. Specifically, the one with 10 examples as highlighted below:
image

What I have done is the following:

!python3 pegasus/bin/train.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
        train_pattern="tfds:aeslc-train-take_10", \
        learning_rate=0.0005 \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc \
--train_steps_overrides=2000 \
--save_checkpoints_steps=100 \
--keep_checkpoint_max=20 \

If I understood correcly, this will finetune Pegasus Large using 10 examples from the AESLC Dataset and perform 2000 steps where it saves the model every 100th step.

Then I ran the evaluation with the following code:

!python3 pegasus/bin/evaluate.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
--model_dir=ckpt/pegasus_ckpt/aeslc \

In the checkpoint file the first line was; model_checkpoint_path: "model.ckpt-2000" so it evaluates the model trained for 2000 steps. Then since I wanted to have the model that gave the best performance I ran the following code:

!python3 pegasus/bin/evaluate.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
--model_dir=ckpt/pegasus_ckpt/aeslc \
--evaluate_test=True \
--best=True \
--text_metrics_pattern=text_metrics-*-.dev.txt

This gave me a text_metrics-2000-.best.test.txt file which seems to be the one I need. The content of this file is as follows:
image
I highlighted the values that I expected would be the same as those in the paper. However, they clearly differ. So I must be making a mistake somewhere, I would very much appreciate your help in finding this mistake!

from pegasus.

JingqingZ avatar JingqingZ commented on July 18, 2024
!python3 pegasus/bin/train.py \
--params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model, \
        batch_size=4, \
        train_pattern="tfds:aeslc-train-take_10", \
        learning_rate=0.0005 \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc \
--train_steps_overrides=2000 \
--save_checkpoints_steps=100 \
--keep_checkpoint_max=20 \

Could you double-check if all the param_overrides have been successfully passed to the params instance? It is not on my machine, which means you probably fine-tuned on the entire training set instead of only 10 examples.

from pegasus.

CBvanYperen avatar CBvanYperen commented on July 18, 2024

Thanks, I think you were correct indeed. I added the following to public_params.py

@registry.register("aeslc_transformer_low")
def aeslc_transformer(param_overrides):
  return transformer_params(
      {
          "train_pattern": "tfds:aeslc-train-take_10",
          "dev_pattern": "tfds:aeslc-validation",
          "test_pattern": "tfds:aeslc-test",
          "max_input_len": 512,
          "max_output_len": 32,
          "train_steps": 2000,
          "learning_rate": 0.0005,
          "batch_size": 4,
      }, param_overrides)

And trained the model with those parameters. I kept the other steps the same, and ran the whole process 3 times. This gave me average values of:

rouge1-F: 0.0892
rouge2-F: 0.0406
rougeL-F: 0.0838

This is most certainly a lot closer to the values reported in the paper, although there is still some deviation. Do you think this could be explained by the smaller batch size? Or could it just be due to the randomness in selecting the 10 examples used to train the model? Or did you perhaps use a different max_output_len?

from pegasus.

JingqingZ avatar JingqingZ commented on July 18, 2024

It seems batch size is the major difference in your settings. We actually selected the first 10 examples instead of at random.

from pegasus.

CBvanYperen avatar CBvanYperen commented on July 18, 2024

Alright, thanks! So the -take_10 automatically selects the first 10 or do I have to add some modifications to make it select the first 10 as well?

from pegasus.

JingqingZ avatar JingqingZ commented on July 18, 2024

I think take_10 will take the first 10.

from pegasus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.