Giter Club home page Giter Club logo

summar.pl's Introduction

summar.pl

Summar is an automatic summarization system developed for the Polish language. The system implements trainable sentence-based extractive summarization technique, which consists in determining most important sentences in document due to their computed salience. Neural networks were used as a machine learning algorithm (Pybrain library, as well as FANN were used).

The system was implemented during work on Master Thesis's in computer science at Feculty of Mathematics and Computer Science, Adam Mickiewicz University.

summar features

  • Automatic Summarization Library written in Python
  • corpus-based training using PyBrain or FANN (Fast Artificial Neural Network Library)
  • three running modes: summarization, training, evaluation
  • uses NERf as a Named Entity Recognition too (see: http://zil.ipipan.waw.pl/Nerf)
  • uses Polish Summaries Corpus (see: http://zil.ipipan.waw.pl/PolishSummariesCorpus) for training and evaluation

Dependencies

Usage

To run summar 3 files need to be provided: config.ini, used features list and model (model file is not needed during training).

Configuration file

[General
stoplist = {PATH_TO_STOPLIST FILE}
modeltype = {NN, FANN, Random, First}
length = {SHORT, LONG, MEDIUM}

[Train
trainmethod = {NN, FANN}
dirprocessor = {Wide, Ratio}
dirprocessoroption = {float number}

[Eval
evalmethod = {Strict, Wide}

Features

Example of features file:

Feature.TfIdf
Feature.SentInHighestTitle
Feature.SentLength
Feature.SentLocPara

Features list:

  • TfIdf - sum of term frequency -- inverse document frequency value for every word in sentence
  • Centrality -arithmetic average of the sentence's similarity to all other sentences in the document
  • SentLocPara - position of a sentence in the paragraph: in the first, second or third of equal parts
  • ParaLocSection - position of the paragraph in the document: in the first, second or third of equal parts
  • SentSpecialSection - occurrence in a special section such as the beginning (introduction) or ending (conclusion) of document
  • Verb - existence of the final verb
  • Nouns - number of nouns in sentence
  • Pronouns - number of pronouns in sentence
  • SentInHighestPname --- number of Named Entities in the sentence as found by a naive method, recognizing Named Entity as a word starting with capital letter
  • NER - number of Named Entities in the sentence as found by NERf Named Entities Recognition tool
  • NERTf - sum of each Named Entity frequency in the whole document, occurring in given sentence
  • PersNameNE - number of recognized NE of the "person" type
  • OrgNameNE - number of recognized NE of the "organization" type
  • PlaceNameNE - number of recognized NE of the "place" type
  • DateNE - number of recognized NE of the "date" type
  • GeogNameNE - number of recognized NE of the "geography" type
  • TimeNE - number of recognized NE of the "time" type
  • SentInHighestTitle - number of words from heading or title in the sentence
  • ParaLength -paragraph length: short (up to 1 sentence), average (2--5 sentences) or long (more than 5 sentences)
  • SentLength - sentence length: short (up to 7 words), average (7--14 words) or long (more than 14 words)
  • AvWordLength - the average of words lengths in sentencse
  • SentType - type of the sentences, based on its last punctuation mark: declarative, interrogative or imperative
  • MetaInfo - sentences not referring to the document content, i.e.: an information about document's author or photo signatures

Running

usage: main.py [-h] -r {sum,train,eval} [-f FILE] [-a DIR] [-m MODEL] [-g FEATURES] [-t TRAIN_DIR] [-y TEST_DIR] [-z CONFIG]

Running in summary mode:

To run in summary mode the following parameters are needed:

  • -r sum
  • -f - input file
  • -m - model file
  • -g - features file (must be the same list of features, which was used to train given model file)
  • -z - config file (also compatible with model)

Running in training mode:

To run in summary mode the following parameters are needed:

Running in evaluation mode:

To run in summary mode the following parameters are needed:

  • -r eval
  • -m - model file
  • -g - features file (must be the same list of features, which was used to train given model file)
  • -z - config file (also compatible with model)
  • -y - test directory (Polish Summaries Corpus again)

summar.pl's People

Contributors

lpawluczuk avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.