Hi, I'm very curious about this model. I'd love to know how to generate summaries from

Hi, y'all. I just created a <a href="https://github.com/TheRockXu/pe

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yeah, I also uploaded a model on gigaword, you can get from <a href="https://drive.goo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Summarization code example?,about google-research/pegasus

Comments (43)

TheRockXu commented on July 18, 2024 20

I got it to make novel predictions by instantiating the estimator object and create an input_fn.

import itertools
import os
import time

from absl import logging
from pegasus.data import infeed
from pegasus.params import all_params  # pylint: disable=unused-import
from pegasus.params import estimator_utils
from pegasus.params import registry
import tensorflow as tf
from pegasus.eval import text_eval
from pegasus.ops import public_parsing_ops


tf.enable_eager_execution()





master = ""
model_dir = "./ckpt/pegasus_ckpt/cnn_dailymail"
use_tpu = False
iterations_per_loop = 1000
num_shards = 1
param_overrides = "vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6"


eval_dir = os.path.dirname(model_dir)
checkpoint_path = model_dir
checkpoint_path = tf.train.latest_checkpoint(checkpoint_path )
params = registry.get_params('cnn_dailymail_transformer')(param_overrides)
pattern = params.dev_pattern
input_fn = infeed.get_input_fn(params.parser, pattern,
                                     tf.estimator.ModeKeys.PREDICT)
parser, shapes = params.parser(mode=tf.estimator.ModeKeys.PREDICT)


estimator = estimator_utils.create_estimator(master, 
                                             model_dir,
                                             use_tpu,
                                             iterations_per_loop,
                                             num_shards, params)

_SPM_VOCAB = 'ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model'
encoder = public_parsing_ops.create_text_encoder("sentencepiece", 
                                                     _SPM_VOCAB)



input_text = "Eighteen sailors were injured after an explosion and fire on board a ship at the US Naval Base in San Diego, US Navy officials said.The sailors on the USS Bonhomme Richard had 'minor injuries' from the fire and were taken to a hospital, Lt. Cmdr. Patricia Kreuzberger told CNN."
target = "18 sailors injured after an explosion and fire on a naval ship in San Diego"


def input_function(params):
    dataset = tf.data.Dataset.from_tensor_slices({"inputs":[input_text, input_text],"targets":[target, target]}).map(parser)
    dataset = dataset.unbatch()
    dataset = dataset.padded_batch(
        params["batch_size"],
        padded_shapes=shapes,
        drop_remainder=True)
    return dataset

predictions = estimator.predict(
          input_fn=input_function, checkpoint_path=checkpoint_path)

for i in predictions:
    print(text_eval.ids2str(encoder, i['outputs'], None))
    break

# Ouput - "The USS Bonhomme Richard had 'minor injuries' from the fire and were taken to a hospital ."

from pegasus.

TheRockXu commented on July 18, 2024 11

Hi, y'all.

I just created a repo that contains a trained pegasus servable model and a script which you can run the summarization end to end, like this

python test_example.py --article example_article --model_dir model/

Suppose your article is this one

You will this output - PREDICTION >> The hacking group known as NC29 is largely believed to operate as part of Russia's security services .<n>The three countries allege that it is carrying out a persistent and ongoing cyber campaign to steal intellectual property about a possible coronavirus vaccine .

from pegasus.

JingqingZ commented on July 18, 2024 3

The local set up has been explained in README. GPU or TPU is not compulsory but highly highly highly recommended. Running on CPU can be very very very slow. Regarding colab, please refer to #16.

from pegasus.

peterjliu commented on July 18, 2024 3

@sumeet-iitg two issues:

cnn/dm is an almost extractive dataset so this model is more extractive. Try xsum or Reddit for something more abstractive.
the input you provided for that model is totally different than what it expects, which is article length. This is a problem called out-of-distribution. Reddit might work better.

from pegasus.

Patrick-Old commented on July 18, 2024 3

Hey, for those who are looking for the simplest way to run Pegasus for summarization, I highly recommend checking out Hugging Face (as @JingqingZ recommended) above. Here is the link. The code is as simple as this (assuming you are trying to run the xsum version of pegasus (I recommend checking out some others as well).

import torch
src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest').to(torch_device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."

from pegasus.

JingqingZ commented on July 18, 2024 2

Hi, if you would like to run the PEGASUS model on the 12 existing datasets, on which PEGASUS has already been fine-tuned, please follow the README and run the model. If you would like to run the model on your customised textual data, you need to config the data as a new dataset, and then fine-tune on the pre-trained model checkpoints. These are already in README.

So far, there is no simple way for single text input and single summary output but we're trying to develop this feature. But note, if you have a new dataset, the fine-tuning is always necessary unless you would like to test zero-shot summarization (as we demonstrated in PEGASUS paper section 6.3).

The maximum length of summary can be specified in the decoding. You may control the length of summary in beam search by using length normalization.

from pegasus.

TheRockXu commented on July 18, 2024 2

Yeah, I also uploaded a model on gigaword, you can get from here.

It was trained overnight on Nvidia Quadro p6000. It seems to do the abstractive summary pretty well.

from pegasus.

TheRockXu commented on July 18, 2024 2

@nliu86 Oh, just to avoid memory error.

from pegasus.

TheRockXu commented on July 18, 2024 1

@sumeet-iitg
First you need to train the cnn_dailymail model by running
python3 pegasus/bin/train.py --params=cnn_dailymail_transformer \ --param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \ --train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \ --model_dir=ckpt/pegasus_ckpt/cnn_dailymail

from pegasus.

TheRockXu commented on July 18, 2024 1

@MarvinT Yes, I do. I just uploaded an iPython script here to convert the checkpoints into the pb model. I didn't use any other datasets than what was given.

from pegasus.

chetanambi commented on July 18, 2024 1

I was going thru the Author's paper and as per the papers max limit is 32 for gigaword dataset. Thanks again for your wonderful work on creating pegasus-demo code. That was really very helpful.

from pegasus.

beni1864 commented on July 18, 2024 1

Hi JingqingZ,

The following error occurs when I press "compute" on each model in the huggingface repository:

⚠️ Can't load config for 'google/pegasus-wikihow'. Make sure that: - 'google/pegasus-wikihow' is a correct model identifier listed on 'https://huggingface.co/models' - or 'google/pegasus-wikihow' is the correct path to a directory containing a config.json file

from pegasus.

purgenetik commented on July 18, 2024

++ It would be nice to add in readme how to build summary from given text using pretrained model.

from pegasus.

JingqingZ commented on July 18, 2024

Hope the solution in this issue #21 can help you create the tfrecord of your dataset and run PEGASUS.

from pegasus.

nkathireshan commented on July 18, 2024

Hi please guide me on how to use collab to run this code or to do a local set up ? as I see it is mentioned to use GPU instance, and I don't have GPU credits to test the same? if it works well to my data set, then I am fine to buy credits and try it on GPU. Thank you very much.

from pegasus.

peterjliu commented on July 18, 2024

You can try GPU/TPU on https://colab.research.google.com for free.

from pegasus.

Saurabh1602 commented on July 18, 2024

Any idea how I could go about generating summary for a new dataset for zero-shot summarisation? That would be hugely useful if possible at this point using the current checkpoints and the code.

from pegasus.

peterjliu commented on July 18, 2024

@Saurabh1602 that's an interesting question and is an active line of research that was out-of-scope for this project. It'd be a cool project to figure out how to do it with the existing checkpoints.

from pegasus.

JingqingZ commented on July 18, 2024

A practical solution may need the creation of TFDS or TFRecords given the new dataset, and register a new param (as describe in README) and then run the code on pre-trained checkpoints. Hope this may help #21 (comment).

from pegasus.

sumeet-iitg commented on July 18, 2024

I used the code mentioned in #13 (comment)
Unfortunately I get the following garbage output :
leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked leaked crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop crop

from pegasus.

sumeet-iitg commented on July 18, 2024

@TheRockXu, thank you for sharing your trained model
Overall, I feel that the summarization is still more extractive than abstractive. Summarizing bigger text hides this fact, but for smaller text this problem is clear. Here are results from the above finetuned cnn_dailymail on a small input text:

INPUT>> 'A person is standing in front of a lake. The person appears to be John. The person is smiling.'
PREDICTION >> A person is standing in front of a lake . The person appears to be John .
Whereas an abstractive summarization should probably look something like: A person that appears to be John is standing in front of a lake. OR A person that appears to be John is smiling.

Perhaps @JingqingZ can also provide insights on overcoming such problems.

from pegasus.

MarvinT commented on July 18, 2024

@TheRockXu , do you have a script for converting the checkpoints into the pb model files? Did you use your own fine tuned models or the checkpoints provided in the google cloud link?

from pegasus.

chetanambi commented on July 18, 2024

@TheRockXu Thanks for your Pegasus-demo code. I used this for generating abstractive summary using gigaword and results are really nice !! However, I can see that prediction size is limited to 32 characters. Is it possible to increase the prediction char size more than 32. I tried changing the parameters in test_example.py but prediction size is still limited to 32 chars only. Please let me know your thoughts.

from pegasus.

TheRockXu commented on July 18, 2024

@chetanambi I think it is due to that training data. If you want longer abstractive summaries, you can probably add other training data to it.

from pegasus.

chetanambi commented on July 18, 2024

Yeah, I also uploaded a model on gigaword, you can get from here.

It was trained overnight on Nvidia Quadro p6000. It seems to do the abstractive summary pretty well.

@TheRockXu Could you please let me know the steps you followed to fine-tune gigaword. I would like to try summary results on reddit dataset,

from pegasus.

TheRockXu commented on July 18, 2024

@chetanambi All I did was to decrease the batch size by half. I let it trained for two days.

from pegasus.

nliu86 commented on July 18, 2024

@TheRockXu I'm curious why you fine-tune gigaword. Why don't you just use the gigaword model finetuned by the author?

from pegasus.

chetanambi commented on July 18, 2024

@TheRockXu Yes, I also have same question as @nliu86. Why didn't you just use the gigaword model finetuned by the author?

from pegasus.

beni1864 commented on July 18, 2024

@TheRockXu I am trying to construct an instance of this model to use for drug-related clinical research articles (guessing Pubmed is the best choice?) I have little experience with Python and I would appreciate if you could help me get started, if you have the time. Thanks!

from pegasus.

TheRockXu commented on July 18, 2024

@beni1864 email me at [email protected]

from pegasus.

thangarani commented on July 18, 2024

Hi, if you would like to run the PEGASUS model on the 12 existing datasets, on which PEGASUS has already been fine-tuned, please follow the README and run the model. If you would like to run the model on your customised textual data, you need to config the data as a new dataset, and then fine-tune on the pre-trained model checkpoints. These are already in README.

So far, there is no simple way for single text input and single summary output but we're trying to develop this feature. But note, if you have a new dataset, the fine-tuning is always necessary unless you would like to test zero-shot summarization (as we demonstrated in PEGASUS paper section 6.3).

The maximum length of summary can be specified in the decoding. You may control the length of summary in beam search by using length normalization.

@JingqingZ Can you please let me know clearly, how to change the length of the summary in detail?

from pegasus.

JingqingZ commented on July 18, 2024

Set a different max_output_len or choose a different beam_alpha to encourage longer/shorter summaries.

These params are defined in https://github.com/google-research/pegasus/blob/master/pegasus/params/public_params.py

from pegasus.

thangarani commented on July 18, 2024

Yeah, I also uploaded a model on gigaword, you can get from here.

It was trained overnight on Nvidia Quadro p6000. It seems to do the abstractive summary pretty well.

Yes, I implemented your pegasus demo on gigaword. Its working great. And the prediction: three countries allege thatCozy Bear is trying to steal vaccine research.

I need to compare it with the BERT. The one of the way to compare is by have same number of words in prediction. Is that possible to increase the length of the prediction for the model gigaword in pegasus-demo?

from pegasus.

TheRockXu commented on July 18, 2024

@thangarani I doubt it. You'd have to pick a different training dataset to train a new model, like reddit.

from pegasus.

JingqingZ commented on July 18, 2024

Pegasus is supported by HuggingFace now https://huggingface.co/models?filter=pegasus.

from pegasus.

JingqingZ commented on July 18, 2024

Hi JingqingZ,

The following error occurs when I press "compute" on each model in the huggingface repository:

⚠️ Can't load config for 'google/pegasus-wikihow'. Make sure that: - 'google/pegasus-wikihow' is a correct model identifier listed on 'https://huggingface.co/models' - or 'google/pegasus-wikihow' is the correct path to a directory containing a config.json file

Hi, please report the error to the team of HuggingFace if the problem persists.

from pegasus.

Guru4741 commented on July 18, 2024

Can Pegasus Model be used to get summary from a url instead of text input ? What processing needs to be done on the url content to feed it to the Pegasus Model ? Can anybody help please.

from pegasus.

bui-thanh-lam commented on July 18, 2024

I was trying fine-tune on colab with batch_size = 2, max_input_length = 512 and it almost ran out of VRAM. Also, it took very long time (20000 examples, 2h/epoch, bs = 2). Why does it take a lot of time and memory? How do you guys set the params?
Thanks all!

from pegasus.

sulata2 commented on July 18, 2024

I got it to make novel predictions by instantiating the estimator object and create an input_fn.

import itertools
import os
import time

from absl import logging
from pegasus.data import infeed
from pegasus.params import all_params  # pylint: disable=unused-import
from pegasus.params import estimator_utils
from pegasus.params import registry
import tensorflow as tf
from pegasus.eval import text_eval
from pegasus.ops import public_parsing_ops


tf.enable_eager_execution()





master = ""
model_dir = "./ckpt/pegasus_ckpt/cnn_dailymail"
use_tpu = False
iterations_per_loop = 1000
num_shards = 1
param_overrides = "vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6"


eval_dir = os.path.dirname(model_dir)
checkpoint_path = model_dir
checkpoint_path = tf.train.latest_checkpoint(checkpoint_path )
params = registry.get_params('cnn_dailymail_transformer')(param_overrides)
pattern = params.dev_pattern
input_fn = infeed.get_input_fn(params.parser, pattern,
                                     tf.estimator.ModeKeys.PREDICT)
parser, shapes = params.parser(mode=tf.estimator.ModeKeys.PREDICT)


estimator = estimator_utils.create_estimator(master, 
                                             model_dir,
                                             use_tpu,
                                             iterations_per_loop,
                                             num_shards, params)

_SPM_VOCAB = 'ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model'
encoder = public_parsing_ops.create_text_encoder("sentencepiece", 
                                                     _SPM_VOCAB)



input_text = "Eighteen sailors were injured after an explosion and fire on board a ship at the US Naval Base in San Diego, US Navy officials said.The sailors on the USS Bonhomme Richard had 'minor injuries' from the fire and were taken to a hospital, Lt. Cmdr. Patricia Kreuzberger told CNN."
target = "18 sailors injured after an explosion and fire on a naval ship in San Diego"


def input_function(params):
    dataset = tf.data.Dataset.from_tensor_slices({"inputs":[input_text, input_text],"targets":[target, target]}).map(parser)
    dataset = dataset.unbatch()
    dataset = dataset.padded_batch(
        params["batch_size"],
        padded_shapes=shapes,
        drop_remainder=True)
    return dataset

predictions = estimator.predict(
          input_fn=input_function, checkpoint_path=checkpoint_path)

for i in predictions:
    print(text_eval.ids2str(encoder, i['outputs'], None))
    break

# Ouput - "The USS Bonhomme Richard had 'minor injuries' from the fire and were taken to a hospital ."

HI .. I tried using your demo code , but i am getting the below error . could you please help me in this.
!python3 pegasus/bin/train.py --params=aeslc_transformer
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000
--model_dir=ckpt/pegasus_ckpt/aeslc

Traceback (most recent call last):
File "pegasus/bin/train.py", line 17, in
from pegasus.data import infeed
File "/usr/local/lib/python3.7/dist-packages/pegasus/init.py", line 1, in
from pegasus.parser import *
File "/usr/local/lib/python3.7/dist-packages/pegasus/parser.py", line 10, in
from pegasus.rules import _build_rule, ParseError, Lazy
File "/usr/local/lib/python3.7/dist-packages/pegasus/rules.py", line 62
print 'pegasus: {}\x1b[2;38;5;241menter {} -> {}\x1b[m'.format(depth, repr(char()), _name)
^
SyntaxError: invalid syntax

from pegasus.

christophschuhmann commented on July 18, 2024

I tried to get it to run on Colab, but it gets a weird error:
https://colab.research.google.com/drive/1sMVvIhZExRYJqFBsPhO28InJ68502dJj?usp=sharing

Cn anyone here fix it?

from pegasus.

pumuckelo commented on July 18, 2024

This article is also great, it uses huggingface and you can try out different models https://towardsdatascience.com/abstractive-summarization-using-pytorch-f5063e67510

from pegasus.

prinky12 commented on July 18, 2024

@sumeet-iitg
It's too late, but I'm leaving a comment in the hopes that it will help someone else.
Need to fix: sentencepiece >>sentencepiece_newline
ex) encoder = public_parsing_ops.create_text_encoder("sentencepiece_newline", _SPM_VOCAB)

from pegasus.

jayasridharmireddi commented on July 18, 2024

@sumeet-iitg First you need to train the cnn_dailymail model by running python3 pegasus/bin/train.py --params=cnn_dailymail_transformer \ --param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \ --train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \ --model_dir=ckpt/pegasus_ckpt/cnn_dailymail

Hello, when I do this I am getting a checksum error. Can you please look into this:

raise NonMatchingChecksumError(resource.url, tmp_path)
tensorflow_datasets.core.download.download_manager.NonMatchingChecksumError: Artifact https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ, downloaded to /home/tbvl/tensorflow_datasets/downloads/ucexport_download_id_0BwmD_VLjROrfTHk4NFg2SndKG8BdJPpt2iRo6Dpzz23CByJuAePEilB-pxbcBCHaWDs.tmp.4214ed267e0b4cca80a05b4fd69eaa5c/download, has wrong checksum.
I0224 22:47:57.809536 140247509751552 download_manager.py:273] Skipping extraction for /home/tbvl/tensorflow_datasets/downloads/raw.gith.com_abis_cnn-dail_mast_url_list_apc7knzpshiwmzikwgjbSqZYlq2yGpDviLVIGsnkNgCk.txt (method=NO_EXTRACT).

from pegasus.

Summarization code example? about pegasus HOT 43 OPEN

Comments (43)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent