Giter Club home page Giter Club logo

seda-ml-summarization's People

Contributors

hkuhn avatar pcannons avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

seda-ml-summarization's Issues

Make data processing for input more memory efficient

Currently, we know that attempting to parse data from 3,890 documents, with 115,000 distinct words, takes about 24.2 GB of RAM. We should find a way to parse the data using a more reasonable amount of memory.

Potential ideas include streaming the data during parsing with an efficient module like pandas, or using memmap to prevent keeping all of the data in RAM.

Remove requirement for validation/test set in DBN

Right now the DBN requires some nonempty validation/test sets to run without error (because of the construction of the theano code). Since these are unnecessary in unsupervised learning they should be removed.

4) Solve the Optimization Problem

Summary Generation Sub-Tasks in-order

Solve the optimization problem. This is the fourth and final step in summary generation.

Classical knapsack problem in dynamic programming

f_k (lambda_k) = max{mu_k In_k + f_{k-1} (lambda_{k-1}}
lambda_k = lambda_{k+1} - mu_{k+1} l_{k+1}
K = t, 1 <= t <= T
lambda_0 = 0
lambda_T = N_s
f_0 (lambda_0) = 0

f_k (lambda_k) = max of summary importance in stage k
K = stage variable to describe current sentence
lambda_k = remaining length before K starts
mu_k = decision variable for sentence

Adapt Theano Restricted Boltzmann Machine to SEDA-ML

Basic building block for the hidden layer.

RBM is a two-layer recurrent neural network in which stochastic binary inputs and outputs are connected using symmetrically weighted connections.

The initial parameters of the first RBM are also determined by the query words.

2) Word Extraction from Importance Matrix AF

Summary Generation Sub-Tasks in-order

Word Extraction from Importance Matrix AF. This is the second step in summary generation.

AF_i,n = importance of ith word in vocabulary to the nth node of the hidden layer H3 (extraction layer)

Extract 10 words with the largest AF_i,n value in every nth node of the hidden layer.
This set of words is defined as UN

Modify SOLR query to obtain term frequencies to feed into the RBM

In order to run the Deep Belief Network on the term frequency data, the dataset field in DBN.py needs to reflect the expected input.

The format appears to be

[ [train_x, train_y], [valid_x, valid_y], [test_x, test_y]]

where train_x, train_y, valid_x, valid_y, test_x and test_y are numpy arrays representing the corresponding array.

We need to take the term frequency data and wrap it up in these arrays and then run

cPickle.dump(your_own_dataset, open("file.pkl", "wb"), protocol=-1)

Generate the feature vectors input into the RBM

First, generate a vocabulary with length V based on the words appearing in document topic set D. (V is the length of the vocabulary of D.)

Then, calculate the feature vectors
image
of the document set D and
image
of the single document image.

image is the tf (term frequency) value of the
image word in the vocabulary of D calculated in all documents.

image is the tf value of image word in the vocabular of D calculated in codecogseqn.

3) Calculate importance of every sentence In_t

Summary Generation Sub-Tasks in-order

The importance of every sentence In_t must be calculated. This is the third step in summary generation.

In_t = \sum_i w_i

where

w_i = lambda if mu_i is contained in UN AND mu_i is contained in q
w_i = 1 if mu_i is contained in UN
w_i = 0 otherwise

lambda = query word importance factor
mu_i = the word in sentence s_t
Importance of Summary: In = \sum_t In_t such that Le <= N_s
Le = l_1 + l_2 + ... + l_t + ... + l_T <= N_s

Extensive Testing

We need to run extensive tests on this component to ensure it is complete. The base testing confirms that the class runs as expected, but I have not seen it against large amounts of data and there could be an unknown bug I haven't seen yet... But who knows

1) Calculate Importance Matrix AF

Summary Generation Sub-Tasks in-order

Importance Matrix AF must be calculated. This is the first step in summary generation.

AF_i,n = importance of ith word in vocabulary to the nth node of the hidden layer H3 (extraction layer)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.