Giter Club home page Giter Club logo

lexrank's Introduction

lexrank

LexRank algorithm for text summarization

image

image

Info

LexRank is an unsupervised approach to text summarization based on graph-based centrality scoring of sentences. The main idea is that sentences "recommend" other similar sentences to the reader. Thus, if one sentence is very similar to many others, it will likely be a sentence of great importance. The importance of this sentence also stems from the importance of the sentences "recommending" it. Thus, to get ranked highly and placed in a summary, a sentence must be similar to many sentences that are in turn also similar to many other sentences. This makes intuitive sense and allows the algorithms to be applied to any arbitrary new text.

Installation

pip install lexrank

Usage

In the following example we use BBC news dataset as a corpus of documents.

from lexrank import LexRank
from lexrank.mappings.stopwords import STOPWORDS
from path import Path

documents = []
documents_dir = Path('bbc/politics')

for file_path in documents_dir.files('*.txt'):
    with file_path.open(mode='rt', encoding='utf-8') as fp:
        documents.append(fp.readlines())

lxr = LexRank(documents, stopwords=STOPWORDS['en'])

sentences = [
    'One of David Cameron\'s closest friends and Conservative allies, '
    'George Osborne rose rapidly after becoming MP for Tatton in 2001.',

    'Michael Howard promoted him from shadow chief secretary to the '
    'Treasury to shadow chancellor in May 2005, at the age of 34.',

    'Mr Osborne took a key role in the election campaign and has been at '
    'the forefront of the debate on how to deal with the recession and '
    'the UK\'s spending deficit.',

    'Even before Mr Cameron became leader the two were being likened to '
    'Labour\'s Blair/Brown duo. The two have emulated them by becoming '
    'prime minister and chancellor, but will want to avoid the spats.',

    'Before entering Parliament, he was a special adviser in the '
    'agriculture department when the Tories were in government and later '
    'served as political secretary to William Hague.',

    'The BBC understands that as chancellor, Mr Osborne, along with the '
    'Treasury will retain responsibility for overseeing banks and '
    'financial regulation.',

    'Mr Osborne said the coalition government was planning to change the '
    'tax system \"to make it fairer for people on low and middle '
    'incomes\", and undertake \"long-term structural reform\" of the '
    'banking sector, education and the welfare state.',
]

# get summary with classical LexRank algorithm
summary = lxr.get_summary(sentences, summary_size=2, threshold=.1)
print(summary)

# ['Mr Osborne said the coalition government was planning to change the tax '
#  'system "to make it fairer for people on low and middle incomes", and '
#  'undertake "long-term structural reform" of the banking sector, education and '
#  'the welfare state.',
#  'The BBC understands that as chancellor, Mr Osborne, along with the Treasury '
#  'will retain responsibility for overseeing banks and financial regulation.']


# get summary with continuous LexRank
summary_cont = lxr.get_summary(sentences, threshold=None)
print(summary_cont)

# ['The BBC understands that as chancellor, Mr Osborne, along with the Treasury '
#  'will retain responsibility for overseeing banks and financial regulation.']

# get LexRank scores for sentences
# 'fast_power_method' speeds up the calculation, but requires more RAM
scores_cont = lxr.rank_sentences(
    sentences,
    threshold=None,
    fast_power_method=False,
)
print(scores_cont)

#  [1.0896493024505858,
#  0.9010711968859021,
#  1.1139166497016315,
#  0.8279523250808547,
#  0.8112028559566362,
#  1.185228912485382,
#  1.0709787574388283]

Stop words for 22 languages are included into the package. To define your own mapping of stop words, prepare text files with utf-8 encoding where words are separated by newlines. Then use the command

lexrank_assemble_stopwords --source_dir directory_with_txt_files

that replaces the default mapping. Note that names of .txt files are used as keys in STOPWORDS dictionary.

Customization

The straightforward implementation of LexRank algorithm described above may be insufficient for modern NLP tasks. It cannot be used with sentence embeddings or custom tf-idf vectors, prepared with different third-party software. Therefore we provide a core function to work with similarity matrix of sentences.

import numpy as np
from lexrank import degree_centrality_scores

similarity_matrix = np.array(
    [
        [1.00, 0.17, 0.02, 0.03, 0.00, 0.01, 0.00, 0.17, 0.03, 0.00, 0.00],
        [0.17, 1.00, 0.32, 0.19, 0.02, 0.03, 0.03, 0.04, 0.01, 0.02, 0.01],
        [0.02, 0.32, 1.00, 0.13, 0.02, 0.02, 0.05, 0.05, 0.01, 0.03, 0.02],
        [0.03, 0.19, 0.13, 1.00, 0.05, 0.05, 0.19, 0.06, 0.05, 0.06, 0.03],
        [0.00, 0.02, 0.02, 0.05, 1.00, 0.33, 0.09, 0.05, 0.03, 0.03, 0.06],
        [0.01, 0.03, 0.02, 0.05, 0.33, 1.00, 0.09, 0.04, 0.06, 0.08, 0.04],
        [0.00, 0.03, 0.05, 0.19, 0.09, 0.09, 1.00, 0.05, 0.01, 0.01, 0.01],
        [0.17, 0.04, 0.05, 0.06, 0.05, 0.04, 0.05, 1.00, 0.04, 0.05, 0.04],
        [0.03, 0.01, 0.01, 0.05, 0.03, 0.06, 0.01, 0.04, 1.00, 0.20, 0.24],
        [0.00, 0.02, 0.03, 0.06, 0.03, 0.08, 0.01, 0.05, 0.20, 1.00, 0.10],
        [0.00, 0.01, 0.02, 0.03, 0.06, 0.04, 0.01, 0.04, 0.24, 0.10, 1.00],
    ],
)

# scores calculated with classical LexRank algorithm
degree_centrality_scores(similarity_matrix, thershold=.1)

# array([0.66666667, 1.        , 1.11111111, 1.22222222, 1.11111111,
#        1.11111111, 0.77777778, 1.22222222, 0.88888889, 1.        ,
#        0.88888889])

# scores by continuous LexRank
degree_centrality_scores(similarity_matrix, thershold=None)

# array([0.86714443, 1.11576626, 1.01267916, 1.11576626, 1.01874311,
#        1.06119074, 0.9277839 , 0.96416759, 1.01874311, 0.95810364,
#        0.9399118 ])

The function degree_centrality_scores takes as input a similarity matrix so it is not restricted to NLP only. It can be used for any objects if exists a proper way to measure their similarity. When creating a custom similarity_matrix it is necessary to ensure that all its values are in range [0, 1].

Tests

Tests are not supplied with the package, to run them you need to clone the repository and install additional dependencies.

# ensure virtualenv is activated
make install-dev

Run linter and tests

make lint
make test

References

Güneş Erkan and Dragomir R. Radev: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization.

lexrank's People

Contributors

karambolishe avatar lshostenko avatar pcinkh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

lexrank's Issues

Lexrank for Single Document

How can I use this implementation of Lexrank for summarizing a single document?
Attempting this produces the following error -
ValueError: documents are not informative

Thanks in advance
Alka Khurana

Power method does not converge

I have the problem that for a given transition matrix, I cannot reach convergence with the implemented _power_method. Instead, after only a few iterations, I am left with only NaN values in the eigenvector guess.

Given that there are existing functions in numpy etc. for computing eigenvectors, is there any particular reason to use the given power method implementation?

Logging

We need to allow logging in case of running code on servers.

custom Keyword inclusion

Problem description

My requirement is, the generated summary should have specific keywords from the input text.

Steps/code/corpus to reproduce

I need the pipeline component to accept keywords as input parameter.

summary = lxr.get_summary(sentences, summary_size=2, threshold=.1, custom_keywords=keywords)

For example,

from lexrank import LexRank
from lexrank.mappings.stopwords import STOPWORDS
from path import Path

documents = []
documents_dir = Path('bbc/politics')

for file_path in documents_dir.files('*.txt'):
    with file_path.open(mode='rt', encoding='utf-8') as fp:
        documents.append(fp.readlines())

lxr = LexRank(documents, stopwords=STOPWORDS['en'])

# example text
sentences = [
    'One of David Cameron\'s closest friends and Conservative allies, '
    'George Osborne rose rapidly after becoming MP for Tatton in 2001.',

    'Michael Howard promoted him from shadow chief secretary to the '
    'Treasury to shadow chancellor in May 2005, at the age of 34.',

    'Mr Osborne took a key role in the election campaign and has been at '
    'the forefront of the debate on how to deal with the recession and '
    'the UK\'s spending deficit.',

    'Even before Mr Cameron became leader the two were being likened to '
    'Labour\'s Blair/Brown duo. The two have emulated them by becoming '
    'prime minister and chancellor, but will want to avoid the spats.',

    'Before entering Parliament, he was a special adviser in the '
    'agriculture department when the Tories were in government and later '
    'served as political secretary to William Hague.',

    'The BBC understands that as chancellor, Mr Osborne, along with the '
    'Treasury will retain responsibility for overseeing banks and '
    'financial regulation.',

    'Mr Osborne said the coalition government was planning to change the '
    'tax system \"to make it fairer for people on low and middle '
    'incomes\", and undertake \"long-term structural reform\" of the '
    'banking sector, education and the welfare state.',
]

# keywords
keywords = ['Michael Howard', 'chief secretary', 'BBC', 'Mr Osborne', 'Treasury' ]

# get summary with classical LexRank algorithm
summary = lxr.get_summary(sentences, summary_size=2, threshold=.1, custom_keywords=keywords)
print(summary)

Output

[ 'Michael Howard promoted him from shadow chief secretary to the '
'Treasury to shadow chancellor in May 2005, at the age of 34.',
'The BBC understands that as chancellor, Mr Osborne, along with the '
'Treasury will retain responsibility for overseeing banks and '
'financial regulation.']

As in above example, I need a parameter to include custom keywords and those keywords must be present in the summarized text.
(i.e) The sentences with the keywords should be the top ranked sentences.

Is there a way to do this? or any function that does this present as part of the library?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.