Giter Club home page Giter Club logo

nlp-extractive-news-summarization-using-mmr's Introduction

Comparison of MMR and LexRank Automatic Text Summarization approaches

Automatic summarization techniques are used to automate the process of summarizing document(s) to form a relatively shorter summary that conveys the most important information from the original larger text. Multi-document summarization in particular is used to extract summary from multiple documents written about the same topic. Here we have to tried to implement and compare two very commonly used techniques for multi-document automatic text summarization:

Implementation details

We have implemented both MMR and LexRank algorithms in python. For evaluation purpose we have used the DUC2004 data corpus which contains two sets of documents.

  • Documents/clusters: contains 50 different topics each containing on average 10 news articles.

  • Manual summaries: manually created summary for each of the 50 topics.

The generated summaries are evaluated against the human summaries using the ROUGE toolkit. The ROUGE scores help to compare the efficiency of the individual summarization systems. However we have also performed an analysis of how much similar (overlap) the summaries generated by each of these systems are by calculating Jaccard coefficient score for sentence level and word level overlap.

System/software requirements

We had implemented both MMR and LexRank and ran the evaluations (both ROUGE and Jaccard evaluations) on Ubuntu 14.04. The following packages were installed as part of this on the Ubuntu OS:

Files and Folders

Folder Description
root root folder of project containing all required files and folders
Documents news articles relating to the 50 topics (each topic containing 10 articles)
Humman_Summaries human summaries used to evaluate the quality of system generated
Lexrank_results folder which holds the system generated summaries of LexRank
MMR_results folder which holds the system generated summaries of MMR
LexRank.py LexRank summarizer implementation
mmr_summarizer.py MMR summarizer implementation
sentence.py sentence class for modelling sentences in the document cluster
jaccardScore.py for generating jaccard coefficient at word and sentence level
test_pyrouge.py for generating the ROUGE scores for the system summaries

How to run:

  • For generating the MMR system summaries run the mmr_summarizer.py. The results will be generated in the MMR_results folder.

  • For generating the LexRank system summaries run the LexRank.py. The results will be generated in the Lexrank_results folder.

  • For generating the ROUGE scores run the test_pyrouge.py. Results will be displayed on the terminal

  • For generating the Jaccard coefficient scores run the jaccardScore.py. Both word and sentence level scores will be displayed on the screen

NOTE:

The documents from DUC2004 have not been added here. These documents can be obtained from here.

This work was done as part of the CAP6640: Natural Language Processing course at UCF in Spring 2016 along with Amar Nair and Syed Ahmed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.