Giter Club home page Giter Club logo

coling2018's Introduction

The repository contains the source code to reproduce experiments from the COLING 2018 paper: Topic or Style? Exploring the Most Useful Features for Authorship Attribution.

Dependencies

  1. Python 2.7
  2. Scikit Learn 0.18
  3. Keras 1.1.1 (with Theano backend). By default, Keras will use TensorFlow as its tensor manipulation library. Please refers to the [Keras website] (https://keras.io/) [1] to configure the Keras backend.
  4. Pandas
  5. NLTK 3.0.4
  6. Scipy 0.19.0
  7. Seaborn 0.7.1
  8. lda 1.0.4
  9. Matplotlib 1.3.1

You can install all of these by running:

pip install -r requirements.txt

Cloning the repository

git clone https://github.com/yunitata/coling2018

Preparing Data

  1. All the dataset need to be requested directly from the author. Please refer the CCAT10 and CCAT50 to this [paper] (http://www.sciencedirect.com/science/article/pii/S0306457307001197) [2] while Judgment and IMDb62 to this paper [3]. Please note that there are two version of IMDb62 datasets. In this experiment, we used the version which contains 62,000 movie reviews and 17,550 message board posts.
  2. CCAT10, CCAT50 and IMDb62 datasets comes in the form of list of files per author. To make things easier, we merge all the documents from each of the author (for each of the dataset) into one csv file. It can be done with this following command:

python data_prep.py folder_path csv_path "data_code"

CCAT10 and CCAT50 each comes with train and test folders, thus it will have separate train and test csv files. For example to prepare train and test data for CCAT10 data

python data_prep.py "/home/C10train" "/home/C10_train.csv" "ccat"

python data_prep.py "/home/C10test" "/home/C10_test.csv" "ccat"

For IMDb62 dataset, it does not come with separate train/test set. Lastly, for Judgment dataset, it already comes in one .txt file, so no data preparation is needed.

Dataset Analysis

To run dataset analysis, run this following command:

THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python data_analysis.py --data datasetname --data_path pathofdata --n_topics numberoftopics

datasetname refers to the code-name of the data, it can be ccat10, ccat50, judgment or imdb
pathofdata refers to the data path
numberoftopics refers to the number of topics that will be used by LDA

The code will produce average JS divergence score and heatmap/confusion matrix which shows topical disimilarity between authors.

coling2018's People

Contributors

yunitata avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.