s-e-d-a / seda-ml-summarization Goto Github PK

Here we explore and build an unsupervised deep belief network with the goal of becoming a commercially viable extractive text summarization tool.

Home Page: http://s-e-d-a.github.io/

License: Apache License 2.0

Shell 10.31% Python 54.67% JavaScript 13.78% CSS 2.37% XSLT 16.04% Perl 2.84%

seda-ml-summarization's People

Contributors

Stargazers

Watchers

Forkers

pavan927 golechhasaloni vinodrajendran001 jz3707 linron84 abiraja2004 shubhampachori12110095 sakib1418

seda-ml-summarization's Issues

Make data processing for input more memory efficient

Currently, we know that attempting to parse data from 3,890 documents, with 115,000 distinct words, takes about 24.2 GB of RAM. We should find a way to parse the data using a more reasonable amount of memory.

Potential ideas include streaming the data during parsing with an efficient module like pandas, or using memmap to prevent keeping all of the data in RAM.

Set up Solr indexing of files

Need to be able to read in the formats of the datasets for training and the text generated by the text detection

Remove requirement for validation/test set in DBN

Right now the DBN requires some nonempty validation/test sets to run without error (because of the construction of the theano code). Since these are unnecessary in unsupervised learning they should be removed.

Automatically pull Australian dataset

Break down paper components

Query-Oriented Multi-Document Summarization via Unsupervised Deep Learning Paper

Check if Solr has already been installed when provisioning

Currently downloading, installing, and running Jetty Solr takes a considerable amount of time.

Need to set a variable of some sort to indicate it has already been done so re-provisioning is not bogged down every time.

Add data/ to .gitignore and make sure load_data creates ../data folder if it is non existent

4) Solve the Optimization Problem

Summary Generation Sub-Tasks in-order

Solve the optimization problem. This is the fourth and final step in summary generation.

Classical knapsack problem in dynamic programming

f_k (lambda_k) = max{mu_k In_k + f_{k-1} (lambda_{k-1}}
lambda_k = lambda_{k+1} - mu_{k+1} l_{k+1}
K = t, 1 <= t <= T
lambda_0 = 0
lambda_T = N_s
f_0 (lambda_0) = 0

f_k (lambda_k) = max of summary importance in stage k
K = stage variable to describe current sentence
lambda_k = remaining length before K starts
mu_k = decision variable for sentence

Reconstruction validation

For script indexPDFinSolrCore.sh create new core if one does not exists

Need to formulate a method to check whether a core does not exist, that way if it is the first time indexing to a new core, the core should automatically be created without having to specify -n for new.

Adapt Theano Restricted Boltzmann Machine to SEDA-ML

Basic building block for the hidden layer.

RBM is a two-layer recurrent neural network in which stochastic binary inputs and outputs are connected using symmetrically weighted connections.

The initial parameters of the first RBM are also determined by the query words.

Convert set-up to use BLAS for Theano

numpy, scipy, pandas, scipy, vincent, ipython, ipython-notebook for using web browser http://localhost:18888 for development

Set-up Python environment in vagrant

SVM Method Comparison

2) Word Extraction from Importance Matrix AF

Summary Generation Sub-Tasks in-order

Word Extraction from Importance Matrix AF. This is the second step in summary generation.

AF_i,n = importance of ith word in vocabulary to the nth node of the hidden layer H3 (extraction layer)

Extract 10 words with the largest AF_i,n value in every nth node of the hidden layer.
This set of words is defined as UN

Modify SOLR query to obtain term frequencies to feed into the RBM

In order to run the Deep Belief Network on the term frequency data, the dataset field in DBN.py needs to reflect the expected input.

The format appears to be

[ [train_x, train_y], [valid_x, valid_y], [test_x, test_y]]

where train_x, train_y, valid_x, valid_y, test_x and test_y are numpy arrays representing the corresponding array.

We need to take the term frequency data and wrap it up in these arrays and then run

cPickle.dump(your_own_dataset, open("file.pkl", "wb"), protocol=-1)

Generalizing features_to_dict.py

Making the Theano_DBN features to dict script component generalizable to generic input

Summary generation

Format Data for SOLR

Generate Documents into an uploadable format and upload them to SOLR server.

Develop protocol of communication with Text Detection component

Decide on the best format to receive and how to receive it.

Concepts extraction

Generate the feature vectors input into the RBM

First, generate a vocabulary with length V based on the words appearing in document topic set D. (V is the length of the vocabulary of D.)

Then, calculate the feature vectors

of the document set D and

of the single document .

is the tf (term frequency) value of the
word in the vocabulary of D calculated in all documents.

is the tf value of word in the vocabular of D calculated in .

Enable Solr cores re-loading on restart

Need to figure out the correct combination of parameters in order to make sure Solr loads the cores on restart

Extending DBN_Symmetric_RBM.py to take queries

For example:

python DBN_Symmetric_RBM.py "../data/australianLegalDataset_term_frequencies"

python DBN_Symmetric_RBM.py "../data/nist_dataset"

Given tf query, after running topic set through RBM, tokenize the tf query document into arrays of arrays

runQueryOnTopic.sh: takes a text input and topic to generate summary on

A script is needed to generate the summary label from the document (catchphrases I think) and then evaluate the generated summary with the label.

3) Calculate importance of every sentence In_t

Summary Generation Sub-Tasks in-order

The importance of every sentence In_t must be calculated. This is the third step in summary generation.

In_t = \sum_i w_i

where

w_i = lambda if mu_i is contained in UN AND mu_i is contained in q
w_i = 1 if mu_i is contained in UN
w_i = 0 otherwise

lambda = query word importance factor
mu_i = the word in sentence s_t
Importance of Summary: In = \sum_t In_t such that Le <= N_s
Le = l_1 + l_2 + ... + l_t + ... + l_T <= N_s

SOLR-Side Data Formatting

Write a script/commands in order to extract features from data on SOLR server.

Concepts extraction

Evaluation of results at the end of trainTopic.sh if available

If evaluation summaries available for documents, use ROUGE toolkit as part of the trainTopic.sh epilogue to determine score of trained dataset.

Reference: http://kavita-ganesan.com/rouge-howto

Extensive Testing

We need to run extensive tests on this component to ensure it is complete. The base testing confirms that the class runs as expected, but I have not seen it against large amounts of data and there could be an unknown bug I haven't seen yet... But who knows

Script to index a general PDF file

1) Calculate Importance Matrix AF

Summary Generation Sub-Tasks in-order

Importance Matrix AF must be calculated. This is the first step in summary generation.

AF_i,n = importance of ith word in vocabulary to the nth node of the hidden layer H3 (extraction layer)

s-e-d-a / seda-ml-summarization Goto Github PK

seda-ml-summarization's People

Contributors

Stargazers

Watchers

Forkers

seda-ml-summarization's Issues

Recommend Projects

Recommend Topics

Recommend Org