Giter Club home page Giter Club logo

perceptron-text-classification's Introduction

Perceptron Text Classification

Università degli Studi Firenze

Overview

Despite the fact that the standard Perceptron provides a good result in its simplicity, it presents some criticalities. Suppose, for example, that we want to classify the XOR function. It is immediately evident that it is impossible to draw a plane that divides the positive examples from the negative ones without committing any error. It follows that the algorithm, not being data linearly separable, will continue to generate each time a different plan and the final one will be randomly determined by the moment in which the stop occurs after a certain number of iterations.

xor function

Suppose now to train the Perceptron and obtain, after a few iterations, a satisfactory classifier that correctly predicts the next 5000 submitted data points. If the last datum is classified incorrectly, the plan must be updated despite its previous accuracy. To limit these situations, whenever a plan has to change, the number c of correct consecutive classifications will be saved. In this way, during testing, it will be possible to possible to determine the sign of an example by weighing the contribution of each plan, according to the formula:

The experiments revealed, as expected, dependence on the order in which the data were shown as input. This implies that, for the same problem, different seeds can generate very different performance for the standard version, while the voted one remains stable. More details, schematized as the table below, can be found in the final report.

result table

Prerequisites

  • Scikit-Learn to obtain the 20 Newsgroup dataset and various functionalities to transform the text into a numeric input.
  • Numpy to perform vectorized operations.
  • Memory Profiler useful to keep trace of memory occupation.
  • Pretty Table for a nice confusion matrix formatting.

Run

Experiments can be launched from the test.py file, containing three category couples as an example. In general, it is possible to choose them from the following list:

  • comp.os.ms-windows.misc
  • comp.sys.ibm.pc.hardware
  • comp.sys.mac.hardware
  • comp.windows.x
  • rec.autos
  • rec.motorcycles
  • rec.sport.baseball
  • rec.sport.hockey
  • sci.crypt
  • sci.electronics
  • sci.med
  • sci.space
  • misc.forsale
  • talk.politics.misc
  • talk.politics.guns
  • talk.politics.mideast
  • talk.religion.misc
  • alt.atheism
  • soc.religion.christian

In the two main functions it is possible to change the max_iter and seed parameters, in order to affect the number of cycles on the training data and to obtain different scenarios according to the shuffling.

perceptron.test_default(categories, max_iter=10, seed=8)
perceptron.test_voted(categories, max_iter=10, seed=8)

Although it is not recommended, within the util.py class it is possible to include additional elements of the original text such as headers, footers and quotes by removing the last attribute.

train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True,
                               random_state=seed, remove=('headers', 'footers', 'quotes'))
test = fetch_20newsgroups(subset='test', categories=categories,
                              remove=('headers', 'footers', 'quotes'))

If you want to get a graphical detail of the memory usage you need to run

mprof run test.py
mprof plot

References

perceptron-text-classification's People

Contributors

samuelemeta avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.