melvinzang / textproc Goto Github PK
View Code? Open in Web Editor NEWThis project forked from davidandrzej/textproc
Text pre-processing for LDA code
License: GNU General Public License v3.0
This project forked from davidandrzej/textproc
Text pre-processing for LDA code
License: GNU General Public License v3.0
textproc David Andrzejewski ([email protected]) Department of Computer Sciences University of Wisconsin-Madison, USA OVERVIEW This code does some very simple pre-processing of raw text in order to prepare nice inputs to topic modeling code. For example, First-Order Logic LDA (https://github.com/davidandrzej/LogicLDA). Before building with 'mvn package', do 'bash getmodels.sh' to download required OpenNLP model files (used for tokenization and sentence detection). USAGE java -jar textproc-0.0.1-SNAPSHOT-jar-with-dependencies.jar followed by COMMAND ARG1 ARG2 ... COMMANDS counts - how many unique vocab terms you *would* have at wordcount thresholds foo.doclist (input) full paths to the raw text files 50 (input) max threshold to consider foo.counts (output) will contain vocabulary sizes at various thresholds makestop - make stoplist of rarely occurring words foo.doclist (input) full paths to the raw text files 50 (input) occurrence threshold foo.stop (output) will contain terms occuring < thresh times makevocab - construct vocabulary foo.doclist (input) full paths to the raw text files foo.stop (input) stoplist, one word per line foo.vocab (output) foo.vocab will contain vocabulary, one word per line makecorpus - construct actual corpus files foo.doclist (input) full paths to the raw text files foo.vocab (input) vocabulary, one word per line foo (output) will create corpus files: *.words, *.docs, *.sent docfilter - filter documents down to those containing ALL terms of interest foo.doclist (input) full paths to the raw text files foo.keywords (input) filter keywords, one word per line foo.hitdocs (output) will contain documents containing ALL keywords RESOURCE DEPENDENCIES getmodels.sh will fetch OpenNLP models from sourceforge, as well as an English stoplist associated with a publication that appeared in the Journal of Machine Learning Research (JMLR): David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. J. Mach. Learn. Res. 5 (December 2004), 361-397. LICENSE This software is open-source, released under the terms of the GNU General Public License version 3, or any later version of the GPL (see COPYING).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.