Giter Club home page Giter Club logo

docsumm's Introduction

DocSumm

DocSumm is an automated document-summary generator with Mihalcea and Tarau's (2004) TextRank at its core.

TextRank is a graph-based ranking model for text processing based on Google's PageRank algorithm (Brin and Page, 1998).

Dependencies:

Running DocSumm

DocSumm can be used to summarize local text files or articles on the web. DocSumm takes three arguments. The first is the path to the target document or the URL of the target web page. The second is the desired length of the document-summary represented as a percentage of the length of the original document (expressed as a value between 0 and 1). The third is a flag indicating if this is a local text file (-l) or a web page (-w). The fourth is optional and specifies which stemming algorithm to use in the process of normalizing the text in the target document. The options are NLTK's Porter Stemmer (-p), Lancaster Stemmer (-l), Snowball Stemmer (-s), Regex Stemmer (-r), or the WordNet Lemmatizer (-w). If no stemmer is specified, the Porter stemming algorithm will be applied by default.

For example:

python doc_summ.py ./Document_to_Summarize.txt 0.25 -l -w
python doc_summ.py 'www.nytimes.com/...' 0.25 -w -w

DocSumm saves its automatically generated summaries in the same directory as the target document. Summaries are named Original_File_Name_Summary.txt.

For example:

Document_to_Summarize_Summary.txt

Optimizing Your Target Document

DocSumm works best on documents with a relatively small degree of thematic variation. Therefore, isolating thematically consistent portions of texts which otherwise deal with a wide range of subjects and running DocSumm on these individually will produce summaries that are more informative and structurally more similar to ones that a human might produce. These chunks of text don't need to be short (I've included some relatively long examples which do well), the point is that they ought to be relatively thematically consistent. I'm looking at ways of overcoming the problem of thematic variety within a document, but until then the best approach is probably to split your document up.

docsumm's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.