dmmiller612 / bert-extractive-summarizer Goto Github PK

Easy to use extractive text summarization with BERT

License: MIT License

Python 98.62% Dockerfile 0.85% Makefile 0.53%

bert extractive-summarization pytorch coreference

bert-extractive-summarizer's Introduction

Bert Extractive Summarizer

This repo is the generalization of the lecture-summarizer repo. This tool utilizes the HuggingFace Pytorch transformers library to run extractive summarizations. This works by first embedding the sentences, then running a clustering algorithm, finding the sentences that are closest to the cluster's centroids. This library also uses coreference techniques, utilizing the https://github.com/huggingface/neuralcoref library to resolve words in summaries that need more context. The greedyness of the neuralcoref library can be tweaked in the CoreferenceHandler class.

As of the most recent version of bert-extractive-summarizer, by default, CUDA is used if a gpu is available.

Paper: https://arxiv.org/abs/1906.04165

Try the Online Demo:

Distill Bert Summarization Demo

Install
Examples
Calculating Elbow
Running the Service

Install

pip install bert-extractive-summarizer

Examples

Simple Example

from summarizer import Summarizer

body = 'Text body that you want to summarize with BERT'
body2 = 'Something else you want to summarize with BERT'
model = Summarizer()
model(body)
model(body2)

Specifying number of sentences

Number of sentences can be supplied as a ratio or an integer. Examples are provided below.

from summarizer import Summarizer
body = 'Text body that you want to summarize with BERT'
model = Summarizer()
result = model(body, ratio=0.2)  # Specified with ratio
result = model(body, num_sentences=3)  # Will return 3 sentences

Using multiple hidden layers as the embedding output

You can also concat the summarizer embeddings for clustering. A simple example is below.

from summarizer import Summarizer
body = 'Text body that you want to summarize with BERT'
model = Summarizer('distilbert-base-uncased', hidden=[-1,-2], hidden_concat=True)
result = model(body, num_sentences=3)

Use SBert

One can use Sentence Bert with bert-extractive-summarizer with the newest version. It is based off the paper here: https://arxiv.org/abs/1908.10084, and the library here: https://www.sbert.net/. To get started, first install SBERT:

pip install -U sentence-transformers

Then a simple example is the following:

from summarizer.sbert import SBertSummarizer

body = 'Text body that you want to summarize with BERT'
model = SBertSummarizer('paraphrase-MiniLM-L6-v2')
result = model(body, num_sentences=3)

It is worth noting that all the features that you can do with the main Summarizer class, you can also do with SBert.

Retrieve Embeddings

You can also retrieve the embeddings of the summarization. Examples are below:

from summarizer import Summarizer
body = 'Text body that you want to summarize with BERT'
model = Summarizer()
result = model.run_embeddings(body, ratio=0.2)  # Specified with ratio. 
result = model.run_embeddings(body, num_sentences=3)  # Will return (3, N) embedding numpy matrix.
result = model.run_embeddings(body, num_sentences=3, aggregate='mean')  # Will return Mean aggregate over embeddings.

Use Coreference

First ensure you have installed neuralcoref and spacy. It is worth noting that neuralcoref does not work with spacy > 0.2.1.

pip install spacy
pip install transformers # > 4.0.0
pip install neuralcoref

python -m spacy download en_core_web_md

Then to to use coreference, run the following:

from summarizer import Summarizer
from summarizer.text_processors.coreference_handler import CoreferenceHandler

handler = CoreferenceHandler(greedyness=.4)
# How coreference works:
# >>>handler.process('''My sister has a dog. She loves him.''', min_length=2)
# ['My sister has a dog.', 'My sister loves a dog.']

body = 'Text body that you want to summarize with BERT'
body2 = 'Something else you want to summarize with BERT'
model = Summarizer(sentence_handler=handler)
model(body)
model(body2)

Custom Model Example

from transformers import *

# Load model, model config and tokenizer via Transformers
custom_config = AutoConfig.from_pretrained('allenai/scibert_scivocab_uncased')
custom_config.output_hidden_states=True
custom_tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
custom_model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased', config=custom_config)

from summarizer import Summarizer

body = 'Text body that you want to summarize with BERT'
body2 = 'Something else you want to summarize with BERT'
model = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)
model(body)
model(body2)

Large Example

from summarizer import Summarizer

body = '''
The Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of its previous sales price.
The deal, first reported by The Real Deal, was for $150 million, according to a source familiar with the deal.
Mubadala, an Abu Dhabi investment fund, purchased 90% of the building for $800 million in 2008.
Real estate firm Tishman Speyer had owned the other 10%.
The buyer is RFR Holding, a New York real estate company.
Officials with Tishman and RFR did not immediately respond to a request for comments.
It's unclear when the deal will close.
The building sold fairly quickly after being publicly placed on the market only two months ago.
The sale was handled by CBRE Group.
The incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building.
The rent is rising from $7.75 million last year to $32.5 million this year to $41 million in 2028.
Meantime, rents in the building itself are not rising nearly that fast.
While the building is an iconic landmark in the New York skyline, it is competing against newer office towers with large floor-to-ceiling windows and all the modern amenities.
Still the building is among the best known in the city, even to people who have never been to New York.
It is famous for its triangle-shaped, vaulted windows worked into the stylized crown, along with its distinctive eagle gargoyles near the top.
It has been featured prominently in many films, including Men in Black 3, Spider-Man, Armageddon, Two Weeks Notice and Independence Day.
The previous sale took place just before the 2008 financial meltdown led to a plunge in real estate prices.
Still there have been a number of high profile skyscrapers purchased for top dollar in recent years, including the Waldorf Astoria hotel, which Chinese firm Anbang Insurance purchased in 2016 for nearly $2 billion, and the Willis Tower in Chicago, which was formerly known as Sears Tower, once the world's tallest.
Blackstone Group (BX) bought it for $1.3 billion 2015.
The Chrysler Building was the headquarters of the American automaker until 1953, but it was named for and owned by Chrysler chief Walter Chrysler, not the company itself.
Walter Chrysler had set out to build the tallest building in the world, a competition at that time with another Manhattan skyscraper under construction at 40 Wall Street at the south end of Manhattan. He kept secret the plans for the spire that would grace the top of the building, building it inside the structure and out of view of the public until 40 Wall Street was complete.
Once the competitor could rise no higher, the spire of the Chrysler building was raised into view, giving it the title.
'''

model = Summarizer()
result = model(body, min_length=60)
full = ''.join(result)
print(full)
"""
The Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of its previous sales price. 
The building sold fairly quickly after being publicly placed on the market only two months ago.
The incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building.'
Still the building is among the best known in the city, even to people who have never been to New York.
"""

Calculating Elbow

As of bert-extractive-summarizer version 0.7.1, you can also calculate ELBOW to determine the optimal cluster. Below shows a sample example in how to retrieve the list of inertias.

from summarizer import Summarizer

body = 'Your Text here.'
model = Summarizer()
res = model.calculate_elbow(body, k_max=10)
print(res)

You can also find the optimal number of sentences with elbow using the following algorithm.

from summarizer import Summarizer

body = 'Your Text here.'
model = Summarizer()
res = model.calculate_optimal_k(body, k_max=10)
print(res)

Summarizer Options

model = Summarizer(
    model: This gets used by the hugging face bert library to load the model, you can supply a custom trained model here
    custom_model: If you have a pre-trained model, you can add the model class here.
    custom_tokenizer:  If you have a custom tokenizer, you can add the tokenizer here.
    hidden: Needs to be negative, but allows you to pick which layer you want the embeddings to come from.
    reduce_option: It can be 'mean', 'median', or 'max'. This reduces the embedding layer for pooling.
    sentence_handler: The handler to process sentences. If want to use coreference, instantiate and pass CoreferenceHandler instance
)

model(
    body: str # The string body that you want to summarize
    ratio: float # The ratio of sentences that you want for the final summary
    min_length: int # Parameter to specify to remove sentences that are less than 40 characters
    max_length: int # Parameter to specify to remove sentences greater than the max length,
    num_sentences: Number of sentences to use. Overrides ratio if supplied.
)

Running the Service

There is a provided flask service and corresponding Dockerfile. Running the service is simple, and can be done though the Makefile with the two commands:

make docker-service-build
make docker-service-run

This will use the Bert-base-uncased model, which has a small representation. The docker run also accepts a variety of arguments for custom and different models. This can be done through a command such as:

docker build -t summary-service -f Dockerfile.service ./
docker run --rm -it -p 5000:5000 summary-service:latest -model bert-large-uncased

Other arguments can also be passed to the server. Below includes the list of available arguments.

-greediness: Float parameter that determines how greedy nueralcoref should be
-reduce: Determines the reduction statistic of the encoding layer (mean, median, max).
-hidden: Determines the hidden layer to use for embeddings (default is -2)
-port: Determines the port to use.
-host: Determines the host to use.

Once the service is running, you can make a summarization command at the http://localhost:5000/summarize endpoint. This endpoint accepts a text/plain input which represents the text that you want to summarize. Parameters can also be passed as request arguments. The accepted arguments are:

ratio: Ratio of sentences to summarize to from the original body. (default to 0.2)
min_length: The minimum length to accept as a sentence. (default to 25)
max_length: The maximum length to accept as a sentence. (default to 500)

An example of a request is the following:

POST http://localhost:5000/summarize?ratio=0.1

Content-type: text/plain

Body:
The Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of its previous sales price.
The deal, first reported by The Real Deal, was for $150 million, according to a source familiar with the deal.
Mubadala, an Abu Dhabi investment fund, purchased 90% of the building for $800 million in 2008.
Real estate firm Tishman Speyer had owned the other 10%.
The buyer is RFR Holding, a New York real estate company.
Officials with Tishman and RFR did not immediately respond to a request for comments.
It's unclear when the deal will close.
The building sold fairly quickly after being publicly placed on the market only two months ago.
The sale was handled by CBRE Group.
The incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building.
The rent is rising from $7.75 million last year to $32.5 million this year to $41 million in 2028.
Meantime, rents in the building itself are not rising nearly that fast.
While the building is an iconic landmark in the New York skyline, it is competing against newer office towers with large floor-to-ceiling windows and all the modern amenities.
Still the building is among the best known in the city, even to people who have never been to New York.
It is famous for its triangle-shaped, vaulted windows worked into the stylized crown, along with its distinctive eagle gargoyles near the top.
It has been featured prominently in many films, including Men in Black 3, Spider-Man, Armageddon, Two Weeks Notice and Independence Day.
The previous sale took place just before the 2008 financial meltdown led to a plunge in real estate prices.
Still there have been a number of high profile skyscrapers purchased for top dollar in recent years, including the Waldorf Astoria hotel, which Chinese firm Anbang Insurance purchased in 2016 for nearly $2 billion, and the Willis Tower in Chicago, which was formerly known as Sears Tower, once the world's tallest.
Blackstone Group (BX) bought it for $1.3 billion 2015.
The Chrysler Building was the headquarters of the American automaker until 1953, but it was named for and owned by Chrysler chief Walter Chrysler, not the company itself.
Walter Chrysler had set out to build the tallest building in the world, a competition at that time with another Manhattan skyscraper under construction at 40 Wall Street at the south end of Manhattan. He kept secret the plans for the spire that would grace the top of the building, building it inside the structure and out of view of the public until 40 Wall Street was complete.
Once the competitor could rise no higher, the spire of the Chrysler building was raised into view, giving it the title.

Response:

{
    "summary": "The Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of its previous sales price. The buyer is RFR Holding, a New York real estate company. The incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building."
}

bert-extractive-summarizer's People

Contributors

Stargazers

Watchers

Forkers

snowcranestart teddius j40903272 dekked manikant92 yashugupta786 morenolaquatra qianrenjian zeroleon ompanda gm19900510 elliotshui dsadulla joshweir sid8519 theedoardo93 hdatteln davidlenz shilonosov alirezag stjordanis liudicsu rogervaas nidhoggurz haojiepan1 sriloksagar summon-ml uuinc-co rizwanbinyasin natnaelt mmaybeno rahulkhairnarr xrosliang dandanlovejm wkryst mogaio franceshe amogh-007 nikolausn mohsin-ashraf kaburelabs adithyasireesh unpixelate alirezabayatmk jonching aniruddhachoudhury zolekode soug-crypto abdullah-abunada coldteapot273k yukku nukea jason-lee-lxx heymaslo nikoschenk lipsajohny erick093 drsnowbird kuldeepyadav colaaaaaa lizy331 datar-ai data-scientist-dude neuralnlp tobielee vuthaihoc srravula1 joechow0103 sunilsivadas dmiruke whuxiaobenben bookscribs-io b0noi mingzi151 peters111 analytics4business 127 virattt mohitjuneja ischolar-dot-ai sitibanc sbassam jvk1987 marcoboucas sageralph talentbait geshan kunkaweb daywatch sjhh-nguyen-d alsbhn paulowoicho humdingers amarjitdhal leesehoon abhinavsp0730 shakirhurrah faiazrahman diogenes2001 fais3000

bert-extractive-summarizer's Issues

Can we Fine-tune or Pre-train it?

Is there any way to fine-tune or pre train the model.with custom dataset??

EOFError: Compressed file ended before the end-of-stream marker was reached

windows
new conda environment

running sample code in readme: model = SingleModel() gives this error.

ps. had to manually install pytorch as the pip install bert-extractive-summarizer produced the error specified in the other issue.

any idea whats going wrong?

Neuralcoref doesn't seem to actually be working

Hey authors,

Great repo so far. An issue: when I try to do run the body in the example (on the Chrysler building sale) through the neuralcoref code in the repo, it doesn't actually work...

For example, here is running the body through neuralcoref, and examining the clusters.

from spacy.lang.en import English
                               
body  =  """
The Chrysler bulding was sold for ... [COPY AND PASTE EXACT HERE]
"""
                                               
nlp = English()                                                                                                
nlp.add_pipe(nlp.create_pipe('sentencizer'))     
                                                              
neuralcoref.add_to_pipe(nlp, greedyness=0.45)                                                                  
#<spacy.lang.en.English at 0x7fce5d7d4110>

doc = nlp(body)                                                                                                
doc._.has_coref                                                                                                
#False

This the code used at the moment in modelprocessors.py.

However, if we try this, instead, it works:

import spacy
!python -m spacy download en_core_web_sm

body  =  """
The Chrysler bulding was sold for ... [COPY AND PASTE EXACT HERE]
"""

nlp = spacy.load('en_core_web_sm')
# Use the default dependency parser for sentence tokenization

neuralcoref.add_to_pipe(nlp, greedyness=0.45)

doc = nlp(body)
doc._.has_coref
#True
doc._.coref_resolved

"\nThe Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of \nThe Chrysler Building, the famous art deco New York skyscraper previous sales price.\nThe deal, first reported by The Real Deal, was for $150 million, according to a source familiar with the deal.\nMubadala, an Abu Dhabi investment fund, purchased 90% of the building for $800 million in 2008.\nReal estate firm Tishman Speyer had owned the other 10%.\nThe buyer is RFR Holding, a New York real estate company.\nOfficials with Tishman and RFR did not immediately respond to a request for comments.\nIt's unclear when the deal will close.\nthe building sold fairly quickly after being publicly placed on the market only two months ago.\nThe sale was handled by CBRE Group.\nThe incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building.\nThe rent is rising from $7.75 million last year to $32.5 million this year to $41 million in 2028.\nMeantime, rents in the building are not rising nearly that fast.\nWhile the building is an iconic landmark in the New York skyline, the building is competing against newer office towers with large floor-to-ceiling windows and all the modern amenities.\nStill the building is among the best known in the city, even to people who have never been to New York.\nIt is famous for its triangle-shaped, vaulted windows worked into the stylized crown, along with its distinctive eagle gargoyles near the top.\nIt has been featured prominently in many films, including Men in Black 3, Spider-Man, Armageddon, Two Weeks Notice and Independence Day.\nThe previous sale took place just before the 2008 financial meltdown led to a plunge in real estate prices.\nStill there have been a number of high profile skyscrapers purchased for top dollar in recent years, including the Waldorf Astoria hotel, which Chinese firm Anbang Insurance purchased in 2016 for nearly $2 billion, and the Willis Tower in Chicago, which was formerly known as Sears Tower, once the world's tallest.\nBlackstone Group (BX) bought Blackstone Group (BX) for $1.3 billion 2015.\nthe building was the headquarters of the American automaker until 1953, but the building was named for and owned by Chrysler chief Walter Chrysler, not the company itself.\nWalter Chrysler had set out to build the tallest building in the world, a competition at that time with another Manhattan skyscraper under construction at 40 Wall Street at the south end of Manhattan. Walter Chrysler kept secret the plans for the spire that would grace the top of the building, building it inside the structure and out of view of the public until 40 Wall Street was complete.\nOnce the competitor could rise no higher, the spire of the building was raised into view, giving the spire of the Chrysler building the title.\n"

Clearly, there are a lot of issues here (e.g "Blackstone Group (BX) bought Blackstone Group (BX) for $1.3 billion 2015").

So it is almost better that this repo is working without neuralcoref.

However, neuralcoref gets 65 F1 on OntoNotes, whereas in 3 years the state of the art has progressed to Bert or Span-Bert (~80 F1). So maybe, we should use those instead?

https://github.com/mandarjoshi90/coref

ERROR:pytorch_pretrained_bert.modeling:Model name 'bert-large-uncased' was not found in model name list

hmm.. any idea how to solve this? Full error here:

ERROR:pytorch_pretrained_bert.modeling:Model name 'bert-large-uncased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz' was a path or url but couldn't find any file associated to this path or url.

How to modify the code for domain specific summarization?

I tried to apply it to domain-specific documents, but it did not result in desired output. is there any way to get the desired result?

Suggestion to move over to sentence-bert for getting sentence embeddings

Hey Authors,

Since you are tokenizing each sentence separately, I suggest to check out this paper (Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
) and the corresponding repo (https://github.com/UKPLab/sentence-transformers) from UKP labs in Germany.

They have shown that using the sum of Bert embeddings for each word to represent a sentence does very poorly on benchmarks (but at least better than using the CLS token).

I know you guys are using the second last or third last layer, but it is a trivial transition to move over to sentence-transformers.

In short, using the mean of BERT embeddings gains a spearmans of 0.45 on STS benchmarks, whereas sentence-BERT gains a spearman of 0.84, a significant improvement.

The model is easy enough to use:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

cannot import name 'summarize'

After using pip install and going into my Python 3.8 shell, if I enter:

from summarizer import Summarizer

I get the error:

ImportError: cannot import name 'summarize' from partially initialized module 'summarizer' (most likely due to a circular import)

Can I used this bert extractive summarizer for Myanmar Language(Burmese) news summairzation?

ValueError: module functions cannot set METH_CLASS or METH_STATIC

I am using it on windows.
Just by running the import line I get the following error:

from summarizer import SingleModel
Traceback (most recent call last):

File "", line 1, in
from summarizer import SingleModel

File "C:\Users\himansh\Anaconda3\lib\site-packages\summarizer_init_.py", line 1, in
from summarizer.model_processors import SingleModel

File "C:\Users\himansh\Anaconda3\lib\site-packages\summarizer\model_processors.py", line 1, in
from summarizer.BertParent import BertParent

File "C:\Users\himansh\Anaconda3\lib\site-packages\summarizer\BertParent.py", line 1, in
from pytorch_pretrained_bert import BertTokenizer, BertModel, GPT2Model, GPT2Tokenizer

File "C:\Users\himansh\Anaconda3\lib\site-packages\pytorch_pretrained_bert_init_.py", line 4, in
from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)

File "C:\Users\himansh\Anaconda3\lib\site-packages\pytorch_pretrained_bert\tokenization_transfo_xl.py", line 30, in
import torch

File "C:\Users\himansh\Anaconda3\lib\site-packages\torch_init_.py", line 79, in
from torch._C import *

ValueError: module functions cannot set METH_CLASS or METH_STATIC
'

error running "pip install bert-extractive-summarizer"

While trying the installation, I received the following:

Collecting torch>=0.4.1 (from pytorch-pretrained-bert->bert-extractive-summarize
r)
ERROR: Could not find a version that satisfies the requirement torch>=0.4.1 (f
rom pytorch-pretrained-bert->bert-extractive-summarizer) (from versions: 0.1.2,
0.1.2.post1)
ERROR: No matching distribution found for torch>=0.4.1 (from pytorch-pretrained-
bert->bert-extractive-summarizer)

Is there a docker available?

How can i apply your code for Chinese?

Excuse me, can i use your code for Chinese...

Can we load the neuralcoref model offline without downloading it from the code?

vector size is not passed when loading bert model

#5
The vector size parameter is not passed when loading pretrained bert model.

No module named summarizer

when i run 'from summarizer import Summarizer', there is an error said no module names summarizer.

Yep, underneath, this uses the hugging face transformers library. So you will have access to all of the pretrained models there.

from summarizer import Summarizer
from transformers import *

d_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
d_model = DistilBertModel.from_pretrained('distilbert-base-multilingual-cased')

model = Summarizer(custom_model=d_model, custom_tokenizer=d_tokenizer)

Originally posted by @dmmiller612 in #54 (comment)

Limit the summarization length

Hi, first of all thanks for the great project.

Is there a way to limit the output of the summarizer? We use it to display a summarization in a table which get's very ugly when text length differ a lot.

ValueEroror: Found array with 0 sample(s)

I'm trying to run the summarizer on a dataframe where each row contains a body of text, the model summarizes the text, saves the output, and moves on to the next paragraph.

However, I keep getting the following ValueError:
ValueErrror: Found array with 0 sample(s) (shape=(0, 1024)) while a minimum of 1 is required.

I'm passing a series of strings of text to the summarizer (one at a time), not arrays.

Also, I'm positive that every summary being passed in is long enough.

Could there be a bug related to array size?

Traceback (most recent call last)
in
5 summaries.append(paragraph)
6 else:
----> 7 summary = model(paragraph, ratio=0.5)
8 summaries.append(summary)

~/.conda/envs/hp_summarizer/lib/python3.7/site-packages/summarizer/model_processors.py in call(self, body, ratio, min_length, max_length, use_first, algorithm)
40 def call(self, body: str, ratio: float=0.2, min_length: int=40, max_length: int=600,
41 use_first: bool=True, algorithm='kmeans') -> str:
---> 42 return self.run(body, ratio, min_length, max_length)
43
44

~/.conda/envs/hp_summarizer/lib/python3.7/site-packages/summarizer/model_processors.py in run(self, body, ratio, min_length, max_length, use_first, algorithm)
35 use_first: bool=True, algorithm='kmeans') -> str:
36 sentences = self.process_content_sentences(body, min_length, max_length)
---> 37 res = self.run_clusters(sentences, ratio, algorithm, use_first)
38 return ' '.join(res)
39

~/.conda/envs/hp_summarizer/lib/python3.7/site-packages/summarizer/model_processors.py in run_clusters(self, content, ratio, algorithm, use_first)
54 def run_clusters(self, content: List[str], ratio=0.2, algorithm='kmeans', use_first: bool= True) -> List[str]:
55 hidden = self.model(content, self.hidden, self.reduce_option)
---> 56 hidden_args = ClusterFeatures(hidden, algorithm).cluster(ratio)
57 if use_first:
58 if hidden_args[0] != 0:

~/.conda/envs/hp_summarizer/lib/python3.7/site-packages/summarizer/ClusterFeatures.py in cluster(self, ratio)
46 def cluster(self, ratio: float=0.1) -> List[int]:
47 k = 1 if ratio * len(self.features) < 1 else int(len(self.features) * ratio)
---> 48 model = self.__get_model(k).fit(self.features)
49 centroids = self.__get_centroids(model)
50 cluster_args = self.__find_closest_args(centroids)

~/.conda/envs/hp_summarizer/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y, sample_weight)
967 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
968 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 969 return_n_iter=True)
970 return self
971

~/.conda/envs/hp_summarizer/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in k_means(X, n_clusters, sample_weight, init, precompute_distances, n_init, max_iter, verbose, tol, random_state, copy_x, n_jobs, algorithm, return_n_iter)
307 order = "C" if copy_x else None
308 X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32],
--> 309 order=order, copy=copy_x)
310 # verify that the number of samples given is larger than k
311 if _num_samples(X) < n_clusters:

~/.conda/envs/hp_summarizer/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
548 " minimum of %d is required%s."
549 % (n_samples, array.shape, ensure_min_samples,
--> 550 context))
551
552 if ensure_min_features > 0 and array.ndim == 2:

How to determine number of sentences in summary by sum of squares in clustering?

Is it possible to remove ratio parameter and automatically determine the number of sentences in the summary using sum of squares of cluster

empty output

Hi!

I tried this module (v. 0.4.2). It's impressive. One thing I would like to check with you, though. In your example:
result = model(text, min_length=60) #this works
however when I change the "text " to like this:
result =model(df['text_body'][1], min_length=60)# this does not work
It generated an empty cell or like this ' ' Why?

How is a little bit more details:

type(text) => str
type(df['text_body'][1])=>str

print(df['text_body'][1]) #Note: the text was cleaned

vaccination programs are one of the most effective means of controlling infectious diseases and with the development of oral vaccines and bait delivery systems the elimination of diseases circulating in wildlife populations has become a realistic possibility the largescale oral rabies vaccination orv campaigns that have eliminated foxmediated rabies from western europe and north america and substantially reduced disease incidence in central europe are preeminent examples for the success of such control programs
orv programs in foxes are aimed at increasing herd immunity in the target population using oral rabies vaccines distributed into the environment over the past four decades several oral rabies vaccines mainly live replicationcompetent attenuated rabies virus vaccines have been successfully used in orv campaigns in canada for example the erabhk21vaccine virus strain a derivative of the cell culture adapted vaccine virus strain street alabama dufferin sad was the only live attenuated vaccine deployed in fox orv campaigns in europe with the exception of a recombinant vaccinia virus expressing the rabies virus glycoprotein all constructs have been based on live attenuated rabies virus strains derived from the sad bern original sad bernorig vaccine virus strain a successor of the era strain while all these vaccines have been highly efficient in fox rabies control the first generation of sadderived vaccines demonstrated residual pathogenicity in nontarget species particularly in rodents although several cases of vaccine virusinduced rabies were observed even in species other than rodents over the course of vaccination campaigns in a number of countries including germany austria slovenia romania poland and canada such cases were without epidemiological relevance

Removing elements of NLP spacy object

This library is bloody cool, good work mate.
Is there a way to disable some elements of the spacy NLP object to reduce memory usage?

ratio confusion

My understanding is that the ratio parameter is the ratio of sentences to get back from the summarizer from the text to summarize. It does not seem to be doing this consistently. When I set the ratio = 1.0, I should get all of the text back that I sent to summarize, but this does not seem to be the case always. What is the reason for this?

ValueError: spacy.strings.StringStore size changed, may indicate binary incompatibility. Expected 112 from C header, got 88 from PyObject

I am trying to run the example code and keep getting this error.

It takes long for the summary to be created

Hi,
When trying the out-of-the-box summarizer on an article with about 7000 words, it takes over 4 minutes to get a summary;
(I am running the example on my mac, no gpu)
Just want to check if that's that normal?

Thanks,
Heidi

Input limit?

I know BERT has an input token limit of 512. However, this summarizer does not have an input limit because you are looping through the sentences and embedding each one with BERT separately (assuming the sentences are <512 tokens). Is this correct?

Cannot reproduce examples

from summarizer import SingleModel

body = 'Text body that you want to summarize with BERT'

model = SingleModel()
model(body)

This produces a segmentation fault within iPython on both Ubuntu 18.04 and MacOS Mojave, and crashes Google Colab.

Specifically, the crash occurs at model(body).

Error when using custom BERT models.

When trying to use a custom model specifying the path, the class throws an error at the line: /summarizer/BertParent.py, line 41. The error is AttributeError: 'NoneType' object has no attribute 'from_pretrained'.

I think that the line:

base_model, base_tokenizer = self.MODELS.get(model, (None, None))

should become something like:

base_model, base_tokenizer = self.MODELS.get(model, (model, model))

when model specify the path to the custom bert model.

what's the content of 'pooled'

In the function 'extract_embeddings', it says that the function returns a numpy array, I'm wondering what are the elements of the array and I was thinking that the 'pooled' might refer to a float that is the average of (batch_size, sequence_length, hidden_size) in a torch tensor.
Please help me figure this out, it's very important to me.

Installation error through PIP

https://repl.it/languages/python3

bert-extractive-summarizer==0.1.2
Repl.it: package installation failed!

Is there an offline way to download summarizer?

When I execute from summarizer import Summarizer, it' s too slow to download. I would like to ask if you could provide a url to download this part of content offline and save it somewhere?

How to use the code to predict summery of my new text.

How do you calculate the number of cluster for KMeans?

I can see that the number of cluster is defined as this:
k = 1 if ratio * len(self.features) < 1 else int(len(self.features) * ratio)

Can anyone explain why? What is the role of ratio here?

server configuration?

what should be the server configuration, for running model on AWS? it is taking local storage, and the process is getting killed, giving 500 internal server?
we have 12 gigs of ram on aws and 12GB SSD

Unable to install using Anaconda

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

bert-extractive-summarizer==0.1.4

Using a different SpaCy Model

Hi. Thanks for this amazing library. I just wanted to know how I can use a different SpaCy model than the one used by default. I want to use the en_core_web_lg model instead of the en_core_web_sm model. So I downloaded the model and linked it to en shortcut. However, the library still complains about the en_core_web_sm model not being installed. Could you clarify how we can use it?

RuntimeError: unexpected EOF, expected 10413410 more bytes. The file might be corrupted.

Hi,
I tried to run the code and got this error:

RuntimeError: unexpected EOF, expected 10413410 more bytes. The file might be corrupted.
terminate called after throwing an instance of 'c10::Error'
  what():  owning_ptr == NullType::singleton() || owning_ptr->refcount_.load() > 0 INTERNAL ASSERT FAILED at /pytorch/c10/util/intrusive_ptr.h:348, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /pytorch/c10/util/intrusive_ptr.h:348)

Anyone experienced the same issue? Thanks!

Empty summary from example from cloud with python requests

Hi, great project, thanks a lot!

Server returns an empty summary when i try the example's text using python requests with a cloud machine. Dunno what i'm missing.

Works perfectly fine with curl on the cloud machine

curl -H "Content-Type: text/plain" -X POST --data "bla bla" http://localhost:5000/summarize?ratio=0.12

Setup:

bert-extractive-summarizer runs on a digitalocean docker image.

sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get install -y python3-pip
sudo apt-get install build-essential libssl-dev libffi-dev python-dev
git clone https://github.com/dmmiller612/bert-extractive-summarizer.git
cd bert-extractive-summarizer
python3 setup.py install
pip3 install torch
ufw allow 5000
make docker-service-build
make docker-service-run

Query code (replace < IP > ) :

def query_bert_summarizer(text, ratio=0.1,  ip=<IP>):
    headers = {
        'Content-Type': 'text/plain',
    }

    params = (
        ('ratio', ratio),
    )

    data = [
        (text, ''),    
    ]

    try:
        response = requests.post(f'http://{ip}:5000/summarize', headers=headers, params=params, data=data)
        summary = response.json()["summary"]
        return response, summary, urllib.parse.unquote_plus(urllib.parse.unquote(summary))
    except Exception as e:
        print(e)
        return str(e)

Does this use pretrained models?

Assuming so but want to confirm

Somewhat different result between online demo and python code.

I experimented some korean text to two scenarios but gained different results.
One is online demo site https://smrzr.io/
and the other is the python code with 'distilbert-base-uncased' or "multilingual-base-cased'

body= "KT 위즈의 '수호신' 이대은이 두 경기 연속 무너졌다. 10일 잠실 두산 베어스전에서 연장 10회말 오재일에게 동점포를 맞은데 이어, 12일 창원 NC 다이노스전에서 나성범에게 동점 투런포를 내주면서 고개를 떨궜다. 시즌 첫 블론세이브를 기록한 뒤 마운드에 오른 12일 이대은의 부담감은 상당해 보였다. 팀이 6-4로 리드한 9회말 선두 타자 권희동에게 안타를 내주고, 폭투로 진루를 허용했다. 장타를 의식해 낮은 제구를 가져가려 했지만, 어깨에 지나치게 힘이 들어간 모습을 보였다. KT 이강철 감독이 직접 마운드에 올라 이대은을 달랬고, 이대은은 박민우, 이명기를 차례로 범타 처리하며 아웃카운트를 추가했다. 하지만 2B2S 유리한 카운트에서 뿌린 포크볼이 나성범의 방망이를 피하지 못한 채 악몽을 되풀이 했다. 지난해 KT 입단 후 선발로 출발했던 이대은은 시즌 중반 마무리로 전향했다. 제구 불안을 드러내기도 했지만, 후반기부터 안정감을 찾으면서 17세이브를 올려 팀의 5강 경쟁에 힘을 보탰다. 하지만 올 시즌 초반 또다시 제구 불안을 드러내면서 불안하게 출발하고 있다."

Online demo gave a summarized result correctly: summary = "KT 위즈의 '수호신 ' 이대은이 두 경기 연속 무너졌다 . 장타를 의식해 낮은 제구를 가져가려 했지만 , 어깨에 지나치게 힘이 들어간 모습을 보였다 ."
But the python code gave a result same as an input.
What went wrong?

BETRT-Extractive-summarizer Multilingual

Hi there, what would be the procedure to use this library with either BERT-Base, Multilingual or single-language models like Bert for German, CamemBERT etc.?

Congratulation for your great work!

Error while importing summerizer

I am facing the issue while importing the summarizer. Anyone faced the same issue?

Text length exceeds maximum of 1000000

Hi, I got an error while feeding the text into the summarizer as follows.

ValueError: [E088] Text of length 1519175 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

I tried to add:
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 1519175
but it doesn't work.

So I was wondering is there any ways to address this issue? Thanks.

Example code throwing error

The sample code from the readme doesnt seem to work . Is something changed in the recent commits? Please suggest
---> 29 result = model(body, min_length=60)
30 full = ''.join(result)
31 print(full)

TypeError: 'Summarizer' object is not callable

can we implement it with GPU support?

Hello, I like your project very much.
One question: how can we run the code with gpu support?

BETRT-Extractive-summarizer for Chinese Articles

Does anyone know if the tool is workable for doing Extractive Summary to Chinese Article? Thanks a lot!

How can we use this model for other languages like German, French and many more?

I want to use this model for multiple languages, How can I achieve that in one code?

How to build my model ?

Hi ! I really like your project . How to buy my model from this ?

How to get the summarized texts' embeddings rather than getting text ?

I believe that the final summarised text is derived from embeddings , so is there a way where i can directly get the summarized texts emeddings directly rather than getting the summarized text ??

Any help on this would be much helpful for my task !!

Bug in passing vector_size parameter

Hi, first of all thanks for sharing your work to the research community.

I was trying to use a custom model for the extractive summarizer, in this case I noticed a little bug in the code, specifically in the summarizer/model_processors.py file:

The line self.model = BertParent(model) should be self.model = BertParent(model, vector_size=vector_size) because otherwise the class throws an error even if the vector_size parameter is specified when instantiating the SingleModel(model=MODEL_PATH, vector_size=vs)

Hope it helps other people and can be resolved in the master repo.

'Summarizer' object is not callable

from summarizer import Summarizer
model = Summarizer()
No problem in running above two lines
result = model(a, min_length=60)
When I run this it says:

'Summarizer' object is not callable

Unable to reproduce the examples

I'm running the code in Google Colab notebook.
I made sure I installed the spacy 2.1.3 version.

However, this is the error I'm getting says Summarizer is not callable

Then I tried to pass the text as an argument to Summarizer which resulted in this, I expected the argument returned to be of string type i.e. the summary text:

And when I tried to convert the result to a string, I got this:

What am I getting wrong here? If you could point out if I've misunderstood something, I'd really appreciate it. If you need any other information, let me know :)