Giter Club home page Giter Club logo

so-cal's Introduction

SO-CAL

SO-CAL is the Semantic Orientation CALculator, a tool to extract sentiment from text. Sentiment is defined as positive or negative opinion.

SO-CAL has a long history of development, starting roughly in 2004. See below for improvements in this version. The best description of SO-CAL is in this paper:

Taboada, Maite, Julian Brooke, Milan Tofiloski, Kimberly Voll and Manfred Stede (2011) Lexicon-Based Methods for Sentiment Analysis. Computational Linguistics 37 (2): 267-307.

Other papers about SO-CAL development:

Taboada, M., J. Brooke and M. Stede (2009) Genre-Based Paragraph Classification for Sentiment Analysis. In Proceedings of 10th Annual SIGDIAL Conference on Discourse and Dialogue. London, UK. September 2009. pp. 62-70.

Brooke, J., M. Tofiloski and M. Taboada (2009) Cross-Linguistic Sentiment Analysis: From English to Spanish. In Proceedings of RANLP 2009, Recent Advances in Natural Language Processing. Borovets, Bulgaria. September 2009. pp. 50-54. -- Poster

Voll, K. and M. Taboada (2007) Not All Words are Created Equal: Extracting Semantic Orientation as a Function of Adjective Relevance. In Proceedings of the 20th Australian Joint Conference on Artificial Intelligence. Gold Coast, Australia. December 2007. pp. 337-346.

Taboada, M., C. Anthony and K. Voll (2006) Methods for Creating Semantic Orientation Dictionaries. Proceedings of 5th International Conference on Language Resources and Evaluation (LREC). Genoa, Italy. May 2006. pp. 427-432.

Taboada, M. and J. Grieve (2004) Analyzing Appraisal Automatically American Association for Artificial Intelligence Spring Symposium on Exploring Attitude and Affect in Text. Stanford. March 2004. AAAI Technical Report SS-04-07. (pp.158-161). Download poster (pdf).

New version of SO-CAL

The code is written in Python 3.5 so please make sure to work with this version of Python.

  • Part 1 - Install Stanford CoreNLP
  • Part 2 - Data Preprocessing
  • Part 3 - Sentiment Calculator

PART 1 - INSTALL STANFORD CORENLP

  1. Download Newest Stanford CoreNLP
  2. Unzip your downloaded .zip file. For example, unzip stanford-corenlp-full-2016-10-31.zip
  3. cd stanford-corenlp-full-2016-10-31
  4. java -mx5g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 10000, this will start the server. timeout is in milliseconds, here we set it to 10 sec above. You should increase it if you pass huge blobs to the server.
  5. pip3 install pycorenlp
  6. For an example code to test your setup

PART 2 - DATA PREPROCESSING

  • It reads a raw text file or folder and generate a folder with tagged file(s). Each token in a output file is tagged with POS tag
  • We are using Stanford CoreNLP for the tokenization, sentence spliting and POS tagging
  • The code file is preprocess.py under folder Source_Code/text_preprocessing
  • In order to run the code, you may need to change the settings in run_text_preprocessing.sh by defining the paths of raw text input, preprocessed data output and which stanford annotations you need. Then in your terminal, type cd source_code and type sh run_text_preprocessing.sh
  • Our sample input can be found in folder Sample/input/Raw_Text/BOOKS
    • If your raw text input is a folder, in run_text_preprocessing.sh, the command line should be python3.5 text_preprocessing/preprocess.py -i '../Sample/input/Raw_Text/BOOKS/' -o '../Sample/output/Preprocessed_Output/BOOKS/' -a 'tokenize,ssplit,pos'
    • If your raw text input is a file, in run_text_preprocessing.sh, the command line should be python3.5 text_preprocessing/preprocess.py -i '../Sample/input/Raw_Text/BOOKS/no1.txt' -o '../Sample/output/Preprocessed_Output/BOOKS/' -a 'tokenize,ssplit,pos'
    • NOTE: In order to make the output more organized, the output will be a folder no matter what your input is
  • Sample output can be found in folder Sample/output/Preprocessed_Output/BOOKS

PART 3 - SENTIMENT CALCULATOR

  • Major features of Sentiment Calculator

    • Read 1 single text file or a folder of text files, calculate sentiment for each file (positive, negative or neutral)
    • Generate detailed lists of word sentiment as well as average Sentiment Orientation (SO) score, total SO score for each file
      • The word types here are Noun, Verb, Adjectives and Adverb
    • If your input data has sentiment classification labels (positive or negative), we call the labels gold data here, it will generate sentiment prediction accuracy
  • Code Structure

    • All the source code for sentiment calculator is located under folder Source_Code/sentiment_calculator
    • SO_Calc.py
      • It process 1 file each time and does all the sentiment calculation
      • For each file, it adds the basic sentiment output, our sample is output.txt under folder Sample/output/SO_CAL_Output/BOOKS. For each file, there are file name and SO score
      • For each file, it also adds detailed sentiment output for each file, our sample is richout.txt under folder Sample/output/SO_CAL_Output/BOOKS. For each file, there are total text length; word sentiment & SO score for each Noun, Verb, Adjective and Adverb; Average SO score for Nouns, Verbs, Adjectives and Adverbs; and Total SO score for the file
    • SO_Run.py
      • It can read 1 single text file or a folder that contains text files. For each file, it calls SO_Calc.py
      • The input text file has to be preprocessed text. Check our sample preprocessed files under folder Sample/output/Preprocessed_Output/BOOKS. To preprocess your raw text files, check our PART 2 - DATA PREPROCESSING above
      • After SO_Calc.py has generated the output for all the files, SO_Run.py reads output.txt and richout.txt, in order to generate formatted file_sentiment.csv and rich_output.json
      • file_sentiment.csv is generated from output.txt, our sample is under folder Sample/output/SO_CAL_Output/BOOKS. For each file, it has file name, sentiment and SO score
      • rich_output.json is generated from richout.txt, our sample in under folder Sample/output/SO_CAL_Output/BOOKS. It contains the same data in the same order as richout.txt, but in JSON format which is easier to read and load data
      • If there is gold data, prediction_accuracy.txt generates the sentiment prediction accuracy, our sample can be found under folder Sample/output/SO_CAL_Output/BOOKS
      • There are 2 ways to create gold data:
        • Start your input text file name with 'yes' or 'no'. For example, yes7.txt, no7.txt. When the code is running, a gold file will be generated automatically under folder Resources/gold
        • Create a gold file with file name and sentiment label, check our sample in 'gold.txt' under folder Sample/gold. With a gold file, you don't need to worry about naming the text files, but the file names have to match each input text file
      • Without any gold data is also fine, you just won't generate prediction_accuracy.txt file, won't influence other output
  • How to Run the Code

    • In your terminal, under the folder of this project
      • Type cd Source_Code
      • Then type sh run_sentiment_calculator.sh
      • If you want to change the setteings of the command line input, go to run_sentiment_calculator.sh under folder Source_Code, and edit the command line
    • Command line arguments:
      • Use -i to indicate your input
      • Use -o to indicate your output folder. The output has to be a folder, so that all the output files can be generated there
      • Use -c to indicate your config files. Our config sample en_SO_Calc.ini for English, Spa_SO_Calc.ini for Spanish can be found under folder Resources/config_files
      • Use -cf to indicate your cutoff value
      • Use -g to indicate your gold file path. This argument is optional
      • -i, -o, -c, -cf are required, and we all have default values for them in this project
    • Sample Command line
      • Command line with default values: Python3.5 sentiment_calculator/SO_Run.py
      • Command line with full settings: Python3.5 sentiment_calculator/SO_Run.py -i "../Sample/output/Preprocessed_Output/BOOKS" -o "../Sample/output/SO_CAL_Output/BOOKS" -c "../Resources/config_files/en_SO_Calc.ini" -cf 0.0 -g "../Sample/gold/gold.txt"

so-cal's People

Contributors

kvarada avatar maitetaboada avatar waterliky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

so-cal's Issues

Using french dictionnary

Hello,

Last time you said it may be possible to use french Dictionnary but I would like to know if I have to translate the English dictionnary to french and then use it for my test or Should I myself use a french lexicon and grade myself the positive value and negative value of words ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.