The ml-assignment2 from chickenbror

Assignment 2 from the course LT2222 in the University of Gothenburg's spring 2021 semester.

Name: Hsien-hao Liao

Part 1 preprocessing

In the preprocessing of the tokens, I originally skipped the punctuation, but it somehow removed some NEs as a result (fewer than 6922), so I only removed the non-NE tokens that are punctuation. I also decided to only skip commas, full stops, parentheses, etc, as I considered some tokens like '%' to be useful context. I originally used the default config of WordNetLemmatizer, which assumed everything was a noun, but words like 'has' became 'ha' as a result. So I used the POS tags to make it output the correct lemma forms.

Part 2 confusion matrix analysis

I took the recommendation from the Discord discussions and added 5 start/end tokens before and after each sentence so that the features length is always 10. It also makes the before/after context distinguishable, ie, the first 5 are 'before' and the last 5 are 'after'.

Part 3 vectors table

I used the tf-idf vectorizer to vectorize each 'features' list to account for each feature's weight, and then used TruncatedSVD to reduced the number of dimensions (choice of dims size noted below in Part 5).

Part 5 confusion matrix analysis

The chance of successful classification appears to be relative to the proportional size of each NE class in the data. For example, 'geo', 'org', 'gpe' and 'per' were by a big margin correctly classified. Also, the training data performed better than the test data, where NE classes that account for small proportions in the data got few to none successful classification. One thig worth noting is that the dims size in Part 3 seems to matter. Originally I reduced them to 300-D, then 1000-D, and then 3000-D. The latter took longer time to process but it gave eg 'art', 'eve' more correct classifications, in particular in the training ddata confusion matrix.

Bonus B

Since we are only considering the surrounding O-tag words as part of each NE's features, we don't need to include I-tag words, so I ignored them during the preprocessing. This leaves there only B-tag words vs O-tag words to consider, so I changed latter to a boolean False whereas B-tag will value to True.

When accounting for POS, I interleaved them with the words, ie, word1, POS1, word2, POS2... instead of concatenating word+POS because it would've significantly increased the size of unique vocab. I also made the dims reduction a parameter of the bonusb() function for easier comparison of different dimention sizes' outputs.

chickenbror / ml-assignment2 Goto Github PK

ml-assignment2's Introduction

ml-assignment2's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent