Identity gender and language variety in Twitter in English. In specific, we want to classify between male and female and their locations including Australia, Canada, Great Britain, Ireland, New Zealand, United States
Dataset is provided by Pan 2017
- Get data from .xml files
- Preprocessing with NLTK
- Filter stopwords, punctuations, and lowercase
- Tokenize
- Stem words
- Visualization
- Visualization with TSNE (a tool to visualize high-dimensional data)
- Model training
- Support Vector Machine with variety of embeddings, including:
- Bag of words from scratch
- Tf-Idf Vectorizer
- Word2Vec
- Combination of Tf-Idf and Word2Vec
- BERT
- Neural Network with BERT and PyTorch
- Support Vector Machine with variety of embeddings, including:
Using Tf-Idf achieves the highes f1-score. The result is shown in the table below:
Gender: female: 0, male: 1
. | precision | recall | f1-score | support |
---|---|---|---|---|
0 | 0.79 | 0.82 | 0.80 | 1200 |
1 | 0.81 | 0.78 | 0.79 | 1200 |
accuracy | _ | _ | 0.80 | 2400 |
macro avg | 0.80 | 0.80 | 0.80 | 2400 |
weighted avg | 0.80 | 0.80 | 0.80 | 2400 |
Language Variety: australia: 0, canada: 1, great britain: 2, ireland: 3, new zealand: 4, United States: 5
. | precision | recall | f1-score | support |
---|---|---|---|---|
0 | 0.85 | 0.83 | 0.84 | 400 |
1 | 0.78 | 0.85 | 0.81 | 400 |
2 | 0.85 | 0.81 | 0.83 | 400 |
3 | 0.87 | 0.85 | 0.86 | 400 |
4 | 0.91 | 0.90 | 0.90 | 400 |
5 | 0.82 | 0.82 | 0.82 | 400 |
accuracy | _ | _ | 0.84 | 2400 |
macro avg | 0.85 | 0.84 | 0.84 | 2400 |
weighted avg | 0.85 | 0.84 | 0.84 | 2400 |