Giter Club home page Giter Club logo

systemprogram's Introduction

README


Introduction
This directory should contain all the files and programs necessary to quickly
replicate the results documented in the write up. All files should be
self-documented in addition to an outline of their use below. Each of the
computer experiments run will be documented separately. All binaries were
compiled and ran on Ubuntu 14. They may need to be recompiled for another OS.
The documentation for the data files is sparse.

The data
All of the data was taken from "spiegel.de". My understanding is that the site
is a German mirror of Project Gutenburg. The data was downloaded by python glue
scripts using the Linux terminal browser lynx with the flag "--dump". The
directories for each author are located in file in "rawdata" carrying the
respective authors names. The original files downloaded were file linking to
each of works by each of the respective authors. Those links were then
downloaded in the same way and were then parsed for the text of each work. These
parses are the "*corpus" and "*corpus1" files in the main directory. Both the
downloading and the parsing was imperfect and introduced errors in the
transcription. Steps were taken to clean up the data, but I suspect error remain
and steps were taken to correct for these. (The corpora could not be reviewed by
hand in time for the project's deadline.) This is discussed in the writeup and
some of the documentation.


Naive Bayes
The Naive Bayes is run by, in the naivebayes directory, running first
	$ python naivebayes_logwriter.py -d func
and then
	$ python naivebayse_classifier.py
As an example "logfreq_data" contains the printed information from
naivebayes_logwriter.py. The model prints the log-likelihood that each author
wrote the text. It also prints the confusion matrix given the model only trained
on 90% of the data and tested on the other 10%. The model uses +1 smoothing. See
the "naivebayes_logwriter.py" and "naivebayes_classifier.py" self-documentation
for further details.
In addition, if instead of with the "func" option, one writes
    $ python naivebayes_logwriter.py -d all
the program will use the authors' combined vocabulary to run the Naive Bayes
model instead of the just the functional words. See the writeup for more notes
on the distinction. Binning changes the results and is discussed in the writeup.
See "naivebayes/README" for further documentation. Both
"naivebayes_logwriter.py" and "naivebayes_classifier.py" are self documented as
well.


Perceptron
The perceptron is run by, in the "perceptron" sub-directory using the command
    $ python perceptron.py --eta [float] --itrs [int]
Where the "eta" parameter is the learning factor for the model and the "itrs"
parameter is the number of iterations. The program will print the strength with
which a perceptron trained to recognize each author believes the System Program
was written by that author. It was found that the model was extremely sensitive
to the learning parameter. This is consistent with the results from the SVM and
indicate that the data in the training corpus is not separable. This is further
discussed in the writeup. See "perceptron/README" for further documentation and
"perceptron.py" is self-documented as well.

systemprogram's People

Contributors

georgesaussy avatar

Watchers

 avatar

Forkers

dst628

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.