The data for this assignment is a subset of the British National Corpus: 570 essays, 285 written by female authors, 285 by male authors.
The full set of essays is in CORPUS_TXT. The essays to be used in this assignment are listed in CORPUS_TXT/BAWE_balanced_subset.csv. The “fname” column is the essay filename (CORPUS_TXT/[fname].csv) and the “gender” column is the author gender (0=male, 1=female). Disclosure: We selected a subset of essays that are “easier” to classify.
You are provided with two scripts: bawe_gender_classifier.py and feature_extractor.py.
bawe_gender_classifier.py loads the essays specified in BAWE_balanced_subset.csv, calls extract_features from feature_extractor to extract a set of features, and runs a NaiveBayes classifier to predict the gender labels from the features. You don't need to do anything to this script.
feature_extractor.py is a collection of functions to extract various features from text data, which it expects to have the form of a list of strings, where each string is an instance to be classified. Here, you will write functions to extract a subset of the following features (based on the Argamon and Ottenbacher papers):
-
Function words: counts for each function word (from the nltk stopwords list). This first feature set is implemented for you.
-
POS: counts for 500 most common ordered POS tag triples, 100 most common ordered POS tag pairs, all POS single tags.
-
Lexical: counts for 500 most common trigrams, 100 most common bigrams, 100 most common unigrams (stopwords removed for unigram counts)
-
Note: divide each count by the total number of words in the essay.
-
Topic: score for 20 highest topic models.
-
Complexity: average number of characters per word, #unique words/# total words, average #words per sentence.
-
Preprocessing: before extracting features, stem all words and replace any that appear only once with the token . This preprocessing step has been implemented for you.
Then, using the bawe_gender_classifier.py script, get the gender detection accuracy using each individual feature set, as well as all the features combined. Fill out the accuracy table and discussion question in the pull request template (see below).
See http://www.nyu.edu/projects/politicsdatalab/localdata/workshops/NLTK_presentation%20_code.py for excellent focused nltk tutorial.
We will use the fork and pull request workflow for submissions for this class. Follow the link for a guide to this workflow. The idea is that you will fork this repository, make the required changes in your own copy, and then submit a pull request and @mention me (@rivlev). The pull request and @mention comprise your submission for this assignment. I will look at your feature_extractor.py and the body of the pull request.
This setup for this class is new, so please open an issue if you have any questions or run into bugs.