Giter Club home page Giter Club logo

sentilex's Introduction

Sentiment Lexicon Generation Suite

License: MIT

This project provides executable files and scripts for generating sentiment lexicons from GermaNet (a German equivalent of the English WordNet), raw text corpora, and neural word embeddings.

Building

For generating a sentiment lexcion from pre-trained word embeddings, you first need to compile the C++ code by running the following commands (please note that the build requires the Armadillo library to be installed):

cd build/
cmake ../
make

Afterwards, an executable called vec2dic wil apper in the subdirectory bin. You can exectute this file by envoking:

./bin/vec2dic [OPTIONS] --type=TYPE VECTOR_FILE SEED_FILE

where the TYPE argument (an integer from zero to three) will determine the algorithm to use for inducing a sentiment lexicon, VECTORE_FILE denotes a path to a text file with pre-trained word2vec embeddings (note that the file should be in the raw text format with space separated values), and SEED_FILE. We currently support the following types of algorithms:

  • 0 -- nearest centroids (default);
  • 1 -- KNN;
  • 2 -- PCA.

Examples

In addition to the C++ executables, we also provide several reimplementations of popular alternative approaches which generate sentiment lexcions from lexical taxonomies (e.g., GermaNet) or raw unlabeled text corpora. Please note that in order to use dictionary-based methods, you need to download GermaNet, which is not included here by default due to license restrictions, and place its files in the directory data/GermaNet_v9.0/. For corpus-based algorithms, you need to provide a pre-lemmatized corpus in the format similar to the one used in data/snapshot_corpus_data/example.txt. Alternatively, for the method of Takamura et al. (2005), you need to provide a list of coordinately conjoined pairs similar to the one provided in data/corpus/cc_light.txt.

Below, you can find a short summary and command examples of the provided systems.

Hu and Liu (2004)

For generating a sentiment lexicon with the method of Hu and Liu (2004), you should envoke the following command:

./scripts/generate_lexicon.py hu-liu \
 --ext-syn-rels --seed-pos=adj \
--form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0

Blair-Goldensohn (2008)

If you want to generate a sentiment lexicon using the method of Blair-Goldensohn et al. (2008), you should envoke the following command:

./scripts/generate_lexicon.py blair-goldensohn \
 --ext-syn-rels --seed-pos=adj \
 --form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
 data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0/

Kim-Hovy (2004)

For generating a sentiment lexicon with the method of Kim and Hovy, (2004), use the following command:

./scripts/generate_lexicon.py kim-hovy \
 --ext-syn-rels --seed-pos=adj \
 --form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
 data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0/

Takamura et al. (2005)

To generate a sentiment lexicon with the method of Takamura et al. (2005), use the following command instead (note that the file data/corpus/cc.txt is not included in this repository due to its big size):

./scripts/generate_lexicon.py takamura \
    --form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
    data/seeds/turney_littman_2003.txt data/GermaNet_v9.0/ data/corpus/cc.txt -1

Esuli and Sebastiani (2006)

For generating a sentiment lexicon using the SentiWordNet method of Esuli and Sebastiani (2006), you should use the following command:

./scripts/generate_lexicon.py esuli --ext-syn-rels \
--seed-pos=adj --form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0

Rao and Ravichandran (2009)

In order to generate a sentiment lexicon with the min-cut approach of Rao and Ravichandran (2009), use the below command:

./scripts/generate_lexicon.py rao-min-cut --ext-syn-rels \
--seed-pos=adj --form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0

If you want to test the label propagation algorithm described by these authors, you should specify the following arguments:

./scripts/generate_lexicon.py rao-lbl-prop --ext-syn-rels \
--seed-pos=adj --form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0

Awdallah and Radev (2010)

To generate a sentiment lexicon using the method of Awdallah and Radev (2010), you should use the following command:

./scripts/generate_lexicon.py awdallah --ext-syn-rels \
--seed-pos=adj --form2lemma=data/GermaNet_v9.0/gn_form2lemma.txt \
data/seeds/hu_liu_seedset.txt data/GermaNet_v9.0/

Velikovich et al. (2010)

For generating a sentiment lexicon using the algorithm of Velikovich et al. (2010), you can use the following command:

./scripts/generate_lexicon.py velikovich \
data/seeds/hu_liu_seedset.txt -1 data/snapshot_corpus_data/example.txt

Kiritchenko et al. (2014)

In order to generate a sentiment lexicon using the system of Kiritchenko et al. (2014), you should use the following command:

./scripts/generate_lexicon.py kiritchenko \
data/seeds/hu_liu_seedset.txt -1 data/snapshot_corpus_data/example.txt

Severyn and Moschitti (2014)

For generating a sentiment lexicon using the approach of Severyn and Moschitti (2014), you should use the following command:

./scripts/generate_lexicon.py severyn \
data/seeds/hu_liu_seedset.txt -1 data/snapshot_corpus_data/example.txt

Evaluation

You can evaluate the resulting sentiment lexicon on the PotTS dataset by using the following command and providing a valid path to the downloaded corpus data:

./scripts/evaluate.py -l data/form2lemma.txt \
	data/results/esuli-sebastiani/esuli-sebastiani.ext-syn-rels.turney-littman-seedset.txt \
	${PATH_TO_PotTS}/corpus/basedata/ ${PATH_TO_PotTS}/corpus/annotator-2/markables/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.