The systemprogram from georgesaussy

README


Introduction
This directory should contain all the files and programs necessary to quickly
replicate the results documented in the write up. All files should be
self-documented in addition to an outline of their use below. Each of the
computer experiments run will be documented separately. All binaries were
compiled and ran on Ubuntu 14. They may need to be recompiled for another OS.
The documentation for the data files is sparse.

The data
All of the data was taken from "spiegel.de". My understanding is that the site
is a German mirror of Project Gutenburg. The data was downloaded by python glue
scripts using the Linux terminal browser lynx with the flag "--dump". The
directories for each author are located in file in "rawdata" carrying the
respective authors names. The original files downloaded were file linking to
each of works by each of the respective authors. Those links were then
downloaded in the same way and were then parsed for the text of each work. These
parses are the "*corpus" and "*corpus1" files in the main directory. Both the
downloading and the parsing was imperfect and introduced errors in the
transcription. Steps were taken to clean up the data, but I suspect error remain
and steps were taken to correct for these. (The corpora could not be reviewed by
hand in time for the project's deadline.) This is discussed in the writeup and
some of the documentation.


Naive Bayes
The Naive Bayes is run by, in the naivebayes directory, running first
	$ python naivebayes_logwriter.py -d func
and then
	$ python naivebayse_classifier.py
As an example "logfreq_data" contains the printed information from
naivebayes_logwriter.py. The model prints the log-likelihood that each author
wrote the text. It also prints the confusion matrix given the model only trained
on 90% of the data and tested on the other 10%. The model uses +1 smoothing. See
the "naivebayes_logwriter.py" and "naivebayes_classifier.py" self-documentation
for further details.
In addition, if instead of with the "func" option, one writes
    $ python naivebayes_logwriter.py -d all
the program will use the authors' combined vocabulary to run the Naive Bayes
model instead of the just the functional words. See the writeup for more notes
on the distinction. Binning changes the results and is discussed in the writeup.
See "naivebayes/README" for further documentation. Both
"naivebayes_logwriter.py" and "naivebayes_classifier.py" are self documented as
well.


Perceptron
The perceptron is run by, in the "perceptron" sub-directory using the command
    $ python perceptron.py --eta [float] --itrs [int]
Where the "eta" parameter is the learning factor for the model and the "itrs"
parameter is the number of iterations. The program will print the strength with
which a perceptron trained to recognize each author believes the System Program
was written by that author. It was found that the model was extremely sensitive
to the learning parameter. This is consistent with the results from the SVM and
indicate that the data in the training corpus is not separable. This is further
discussed in the writeup. See "perceptron/README" for further documentation and
"perceptron.py" is self-documented as well.
georgesaussy / systemprogram Goto Github PK

systemprogram's Introduction

systemprogram's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent