Giter Club home page Giter Club logo

corpus_analysis_assignments's Introduction

Data assignment #1: Gender classification from text

Natural Language Processing and Psychology (Corpus Analysis)

Data

The data for this assignment is a subset of the British National Corpus: 570 essays, 285 written by female authors, 285 by male authors.

The full set of essays is in CORPUS_TXT. The essays to be used in this assignment are listed in CORPUS_TXT/BAWE_balanced_subset.csv. The “fname” column is the essay filename (CORPUS_TXT/[fname].csv) and the “gender” column is the author gender (0=male, 1=female). Disclosure: We selected a subset of essays that are “easier” to classify.

Task

You are provided with two scripts: bawe_gender_classifier.py and feature_extractor.py.

bawe_gender_classifier.py loads the essays specified in BAWE_balanced_subset.csv, calls extract_features from feature_extractor to extract a set of features, and runs a NaiveBayes classifier to predict the gender labels from the features. You don't need to do anything to this script.

feature_extractor.py is a collection of functions to extract various features from text data, which it expects to have the form of a list of strings, where each string is an instance to be classified. Here, you will write functions to extract a subset of the following features (based on the Argamon and Ottenbacher papers):

  • Function words: counts for each function word (from the nltk stopwords list). This first feature set is implemented for you.

  • POS: counts for 500 most common ordered POS tag triples, 100 most common ordered POS tag pairs, all POS single tags.

  • Lexical: counts for 500 most common trigrams, 100 most common bigrams, 100 most common unigrams (stopwords removed for unigram counts)

  • Note: divide each count by the total number of words in the essay.

  • Topic: score for 20 highest topic models.

  • Complexity: average number of characters per word, #unique words/# total words, average #words per sentence.

  • Preprocessing: before extracting features, stem all words and replace any that appear only once with the token . This preprocessing step has been implemented for you.

Then, using the bawe_gender_classifier.py script, get the gender detection accuracy using each individual feature set, as well as all the features combined. Fill out the accuracy table and discussion question in the pull request template (see below).

See http://www.nyu.edu/projects/politicsdatalab/localdata/workshops/NLTK_presentation%20_code.py for excellent focused nltk tutorial.

Github workflow for submission (beta)

We will use the fork and pull request workflow for submissions for this class. Follow the link for a guide to this workflow. The idea is that you will fork this repository, make the required changes in your own copy, and then submit a pull request and @mention me (@rivlev). The pull request and @mention comprise your submission for this assignment. I will look at your feature_extractor.py and the body of the pull request.

This setup for this class is new, so please open an issue if you have any questions or run into bugs.

corpus_analysis_assignments's People

Contributors

rivlev avatar robert-d-schultz avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.