Giter Club home page Giter Club logo

textplot's Introduction

Textplot

War and Peace (click to zoom)

War and Peace

Textplot is a little program that converts a document into a network of terms, with the goal of teasing out information about the high-level topic structure of the text. For each term:

  1. Get the set of offsets in the document where the term appears.

  2. Using kernel density estimation, compute a probability density function (PDF) that represents the word's distribution across the document. Eg, from War and Peace:

War and Peace

  1. Compute a Bray-Curtis dissimilarity between the term's PDF and the PDFs of all other terms in the document. This measures the extent to which two words appear in the same locations.

  2. Sort this list in descending order to get a custom "topic" for the term. Skim off the top N words (usually 10-20) to get the strongest links. Here's "napoleon":

[('napoleon', 1.0),
('war', 0.65319871313854128),
('military', 0.64782349297012154),
('men', 0.63958189887106576),
('order', 0.63636730075877446),
('general', 0.62621616907584432),
('russia', 0.62233286026418089),
('king', 0.61854160459241103),
('single', 0.61630514751638699),
('killed', 0.61262010905310182),
('peace', 0.60775702746632576),
('contrary', 0.60750138486684579),
('number', 0.59936009740377516),
('accompanied', 0.59748552019874168),
('clear', 0.59661288775164523),
('force', 0.59657370362505935),
('army', 0.59584331507492383),
('authority', 0.59523854206807647),
('troops', 0.59293965397478188),
('russian', 0.59077308177196441)]
  1. Shovel all of these links into a network and export a GML file.

Generating graphs

There are two ways to create graphs - you can use the textplot executable from the command line, or, if you want to tinker around with the underlying NetworkX graph instance, you can fire up a Python shell and use the build_graph() helper directly.

Either way, first install Textplot. With PyPI:

pip install textplot

Or, clone the repo and install the package manually:

pyvenv env
. env/bin/activate
pip install -r requirements.txt
python setup.py install

From the command line

Then, from the command line, generate graphs with:

texplot generate [IN_PATH] [OUT_PATH] [OPTIONS]

Where the input is a regular .txt file, and the output is a .gml file. So, if you're working with War and Peace:

texplot generate war-and-peace.txt war-and-peace.gml

The generate command takes these options:

  • --term_depth=1000 (int) - The number of terms to include in the network. For now, Textplot takes the top N most frequent terms, after stopwords are removed.

  • --skim_depth=10 (int) - The number of connections (edges) to skim off the top of the "topics" computed for each word.

  • --d_weights (flag) - By default, terms that appear in similar locations in the document will be connected by edges with "heavy" weights, the semantic expected by force-directed layout algorithms like Force Atlas 2 in Gephi. If this flag is passed, the weights will be inverted - use this if you want to do any kind of pathfinding analysis on the graph, where it's generally assumed that edge weights represent distance or cost.

  • --bandwidth=2000 (int) - The bandwidth for the kernel density estimation. This controls how "smoothness" of the curve. 2000 is a sensible default for long novels, but bump it down if you're working with shorter texts.

  • --samples=1000 (int) - The number of equally-spaced points on the X-axis where the kernel density is sampled. 1000 is almost always enough, unless you're working with a huge document.

  • --kernel=gaussian (str) - The kernel function. The scikit-learn implementation also supports tophat, epanechnikov, exponential, linear, and cosine.

From a Python shell

Or, fire up a Python shell and import build_graph() directly:

In [1]: from textplot.helpers import build_graph

In [2]: g = build_graph('war-and-peace.txt')

Tokenizing text...
Extracted 573064 tokens

Indexing terms:
[################################] 124750/124750 - 00:00:06

Generating graph:
[################################] 500/500 - 00:00:03

build_graph() returns an instance of textplot.graphs.Skimmer, which gives access to an instance of networkx.Graph. Eg, to get degree centralities:

In [3]: import networkx as nx
In [4]: nx.degree_centrality(g.graph)

Texplot uses numpy, scipy, scikit-learn, matplotlib, networkx, and clint.

textplot's People

Contributors

davidmcclure avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.