Giter Club home page Giter Club logo

dfrtopics's Introduction

dfrtopics

This small R package provides bits and pieces to help make and explore topic models of text, especially word-count data like that available from JSTOR's Data for Research (DfR) service. It uses MALLET to run the models.

I wrote most of the bits and pieces here while working on my research (I am a literary scholar), so this is not meant to be a professional, sophisticated, multipurpose tool. Nonetheless, by now it seemed worth it to make some of what I'd done conceivably reusable by others who might also want to explore topic models even if they, like me, know very little about machine learning. The code skews to my amateurishness as a programmer. It is all very much in-progress, hacked together, catch-as-catch-can, I am not an expert, I am not a lawyer, etc., etc., etc. Use and share freely, at your own risk.

Every function has online help in R. For a fairly detailed introduction to what you can do with this package, see the introductory vignette: vignette("introduction", "dfrtopics") or online here. I'm always happy to hear from anyone who makes use of this.

Installation

This is too messy for CRAN. The easiest way to install is to first install the devtools package, and then use it to install this package straight from github:

library(devtools)
install_github("agoldst/dfrtopics")

(This should work even if you don't have git or a github account.)

I have been profligate with dependencies. Note that if you use RStudio, getting rJava and mallet to load can be a messy business. See my blog post on rJava and RStudio on MacOS X.

Browsing the model interactively

Now in alpha release: another project of mine, dfr-browser, which makes topic models of DfR data into a javascript-based interactive browser. To export results from this model as a browser, use the package function dfr_browser (see the function documentation for more detail).

A note on licensing

I have decided to apply the MIT License to this repository. That means you can pretty much do anything you want with it, provided you attribute stuff by me to me. And you can't hold me liable. I prefer the spirit of the GNU Public License, but I would like academics who use this code to be able to do so without being obliged to release their source, since that it is not always possible. I don't attempt to forbid commercial uses, but I don't welcome them.

Running the package tests

The tests are based on a sample set of data from DfR. I do not currently have permission to distribute that data, but you can recreate it if you wish to run the tests or regenerate the package vignette. Perform this search in DfR and make a Dataset Request for wordcounts and metadata in CSV format. Then unzip the archive to a directory test-data inside the package directory for dfrtopics.

Version history

v0.2.3 : 4/19/16. An adjusted dfr-browser export via dfr_browser() for one-line interactive browsing. export_browser_data is still avaiable for more control. wordcounts_instances introduced to help express "no, MALLET, no more tokenizing!"

v0.2.2 : 2/10/16. New (beta) feature: functions for the mutual information of words and documents within topics, and for using this in a posterior predictive check of the model fit: imi_topic, mi_topic, imi_check, mi_check. Introduces a dependency on RcppEigen.

v0.2.1 : 9/23/15. Minor updates. read_wordcounts accepts a reader method for improved flexibility about data sources, and export_browser_data is more tolerant of variant metadata formats. Scaled topic coordinates now use JS_divergence written in C++ (introducing a direct Rcpp dependency for a questionable speed gain in a function no one uses). Various code- and documentation-cleaning tweaks.

v0.2 : New release, September 2015. An almost completely rewritten API, so don't expect backwards compatibility. This version should be more flexible and easier to use. At least it has more documentation.

v0.1 : Earliest public version(s), 2013--2015

Andrew Goldstone ([email protected])

dfrtopics's People

Contributors

agoldst avatar cbdavis avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.