Giter Club home page Giter Club logo

text-analysis's Introduction

Scripts for analyzing text

Based on emotion, sentiment, subjectivity, orientation, and color.

Lexicons Used

These lexicons were parsed and compiled using the script compile_lexicons.py into file lexicons/lexicons_compiled.csv using these categories with the following counts:

  • Total word count: 14,852
  • Words with emotion: 4,463 (30.0%)
  • Words with sentiment: 10,916 (73.5%)
  • Words with subjectivity: 6,886 (46.4%)
  • Words with orientation: 2,192 (14.8%)
  • Words with color: 5,404 (36.4%)

How to analyze text

  1. Download text, e.g. texts/moby_dick.txt

  2. Parse text using gutenberg_text.py <text file> <output text json file> <output chapter json file>, e.g. gutenberg_text.py ../texts/moby_dick.txt ../output/moby_dick_normal.json ../output/moby_dick_chapters.json. This generates an output that looks like this:

    {
      "title": "Moby-Dick; or, The Whale'",
      "author": "Herman Melville",
      "chapters": [
        {
          "title": "Loomings",
          "text": "discrete words in lowercase separated by spaces with punctuation removed"
        },
        {
          "title": "The Carpet-Bag",
          "text": "discrete words in lowercase separated by spaces with punctuation removed"
        },
        ...
      ]
    }
  3. Run get_data.py <json file from previous step> <a path to output csv file>, e.g. get_data.py output/moby_dick_normal.json output/moby_dick_data.csv. This outputs a .csv file in the format:

    emotion,color,orientation,sentiment,subjectivity,chapter
    0,2,1,1,-1,0
    ...
    

    Where each row represents a word, and each column represent the index of each category listed in data/categories.json

  4. Run analyze_data.py <csv file from previous step> <a path to output csv file> <word buffer> <word offset>, e.g. python analyze_data.py output/moby_dick_data.csv output/moby_dick_analysis.json 400 200. This outputs a .json file in the format:

    [
     {
       "chapter": 0,
       "emotion": [
         0.500, // anger
         0.250, // fear
         ...
       ],
       "subjectivity": [
         0.600, // weak
         0.150 // strong
       ],
       "sentiment": [
         0.750, // positive
         0.050 // negative
       ],
       "orientation": [
         0.850, // active
         0.450 // passive
       ],
       "color": [
         0.950, // white
         0.001, // black
         ...
       ]
     },
     ...
    ]

    Where each item represents a group of words (with a size of word buffer as configured in the previous step). The numbers are percentages between 0 and 1 that represents the relative weight of that particular category value.

  5. Optionally, run python report_data.py <analysis json file> <output dir> to write individual .csv files for each category, e.g. python report_data.py output/moby_dick_analysis.json output/moby_dick/

text-analysis's People

Contributors

beefoo avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.