Giter Club home page Giter Club logo

textproc's Introduction

textproc

David Andrzejewski ([email protected])
Department of Computer Sciences
University of Wisconsin-Madison, USA


OVERVIEW

This code does some very simple pre-processing of raw text in order to
prepare nice inputs to topic modeling code.  For example, First-Order
Logic LDA (https://github.com/davidandrzej/LogicLDA).

Before building with 'mvn package', do 'bash getmodels.sh' to download
required OpenNLP model files (used for tokenization and sentence
detection).


USAGE

java -jar textproc-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
followed by COMMAND ARG1 ARG2 ...


COMMANDS

counts - how many unique vocab terms you *would* have at wordcount thresholds
 foo.doclist (input)  full paths to the raw text files
 50          (input)  max threshold to consider 
 foo.counts  (output) will contain vocabulary sizes at various thresholds     

makestop - make stoplist of rarely occurring words
 foo.doclist (input)  full paths to the raw text files
 50          (input)  occurrence threshold 
 foo.stop    (output) will contain terms occuring < thresh times

makevocab - construct vocabulary 
 foo.doclist (input)  full paths to the raw text files
 foo.stop    (input)  stoplist, one word per line
 foo.vocab   (output) foo.vocab will contain vocabulary, one word per line

makecorpus - construct actual corpus files        
 foo.doclist (input)  full paths to the raw text files
 foo.vocab   (input)  vocabulary, one word per line
 foo         (output) will create corpus files: *.words, *.docs, *.sent

docfilter - filter documents down to those containing ALL terms of interest
 foo.doclist  (input)  full paths to the raw text files
 foo.keywords (input)  filter keywords, one word per line
 foo.hitdocs  (output) will contain documents containing ALL keywords


RESOURCE DEPENDENCIES

getmodels.sh will fetch OpenNLP models from sourceforge, as well as an
English stoplist associated with a publication that appeared in the
Journal of Machine Learning Research (JMLR):

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A
New Benchmark Collection for Text Categorization
Research. J. Mach. Learn. Res. 5 (December 2004), 361-397.


LICENSE

This software is open-source, released under the terms of the GNU
General Public License version 3, or any later version of the GPL (see
COPYING).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.