Scripts for analyzing text

Based on emotion, sentiment, subjectivity, orientation, and color.

Lexicons Used

These lexicons were parsed and compiled using the script compile_lexicons.py into file lexicons/lexicons_compiled.csv using these categories with the following counts:

Total word count: 14,852
Words with emotion: 4,463 (30.0%)
Words with sentiment: 10,916 (73.5%)
Words with subjectivity: 6,886 (46.4%)
Words with orientation: 2,192 (14.8%)
Words with color: 5,404 (36.4%)

How to analyze text

Download text, e.g. texts/moby_dick.txt

Parse text using gutenberg_text.py <text file> <output text json file> <output chapter json file>, e.g. gutenberg_text.py ../texts/moby_dick.txt ../output/moby_dick_normal.json ../output/moby_dick_chapters.json. This generates an output that looks like this:

{
  "title": "Moby-Dick; or, The Whale'",
  "author": "Herman Melville",
  "chapters": [
    {
      "title": "Loomings",
      "text": "discrete words in lowercase separated by spaces with punctuation removed"
    },
    {
      "title": "The Carpet-Bag",
      "text": "discrete words in lowercase separated by spaces with punctuation removed"
    },
    ...
  ]
}

Run get_data.py <json file from previous step> <a path to output csv file>, e.g. get_data.py output/moby_dick_normal.json output/moby_dick_data.csv. This outputs a .csv file in the format:
```
emotion,color,orientation,sentiment,subjectivity,chapter
0,2,1,1,-1,0
...
```
Where each row represents a word, and each column represent the index of each category listed in data/categories.json

Run analyze_data.py <csv file from previous step> <a path to output csv file> <word buffer> <word offset>, e.g. python analyze_data.py output/moby_dick_data.csv output/moby_dick_analysis.json 400 200. This outputs a .json file in the format:

[
 {
   "chapter": 0,
   "emotion": [
     0.500, // anger
     0.250, // fear
     ...
   ],
   "subjectivity": [
     0.600, // weak
     0.150 // strong
   ],
   "sentiment": [
     0.750, // positive
     0.050 // negative
   ],
   "orientation": [
     0.850, // active
     0.450 // passive
   ],
   "color": [
     0.950, // white
     0.001, // black
     ...
   ]
 },
 ...
]

Where each item represents a group of words (with a size of word buffer as configured in the previous step). The numbers are percentages between 0 and 1 that represents the relative weight of that particular category value.

Optionally, run python report_data.py <analysis json file> <output dir> to write individual .csv files for each category, e.g. python report_data.py output/moby_dick_analysis.json output/moby_dick/

danacity / text-analysis Goto Github PK

text-analysis's Introduction

Scripts for analyzing text

Lexicons Used

How to analyze text

text-analysis's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent