Giter Club home page Giter Club logo

golechhasaloni / summarization Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sandeepsripada/summarization

0.0 2.0 0.0 260 KB

Performs multi document summarization. Includes a method to generate summaries: The method uses a sentence importance score calculator based on various semantic features and a semantic similarity score to select sentences that would be most representative of the document. It uses stack-decoder algorithm as used as a template and builds on it to produce summaries that are closer to optimal.

Home Page: http://cs.stanford.edu/~ssandeep

Java 100.00%

summarization's Introduction

Aim:
The project takes in documents and output an extraction based summary i.e. selects sentences from the given set that satisfy the length constraint and also maximize the overall score of the summary. (Importance Score of a summary is the sum of scores of individual sentences.)

Main class:
Expects one argument - path to the directory containing the data.
For the project, the sample data files are in the 'data' directory.

Method:
The project contains the implementation of the StackDecoder method for summary generation.
http://cs.stanford.edu/people/ssandeep/reports/PACLIC-09.pdf

The current similarity scorer set is 'CosineSimilarityScorer'. This can be changed to any other scorer by implementing the SimilarityScorer interface.
Other alternative included is JaccardSimilarityScorer.

Final report of the project:
http://cs.stanford.edu/people/ssandeep/reports/CS224N.pdf

Output:
The output summaries are generated in the 'summaries' folder.

External sources: (files not mentioned below were all coded by me)
Files that are not coded by me but provided as starter codes for the 'Natural Language Processing' class at Stanford:
com.util package: 
	Counter, CounterMap, MapFactory, PorterStemmer, PQ
com.score.importance:
	All the files
com.decoding.stackdecoder:
	SpecialPQ
	
Contributions:
I have written the framework code for loading the dataset and segregating it into topics, documents, summaries and sentences, also designed the structure.
I have also implemented the 'Stackdecoder' algorithm using the above framework.
The common platform also made it easy for others to plugin their code and implement other methods.

NOTE:
LSP is an external program that is not currently included. An open source alternative is to use NTLK (http://nltk.org/) tokenizer.

summarization's People

Contributors

sandeepsripada avatar

Watchers

James Cloos avatar Saloni Golechha avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.