Giter Club home page Giter Club logo

ml-assignment2's Introduction

Assignment 2 from the course LT2222 in the University of Gothenburg's spring 2021 semester.

Name: Hsien-hao Liao

Part 1 preprocessing

In the preprocessing of the tokens, I originally skipped the punctuation, but it somehow removed some NEs as a result (fewer than 6922), so I only removed the non-NE tokens that are punctuation. I also decided to only skip commas, full stops, parentheses, etc, as I considered some tokens like '%' to be useful context. I originally used the default config of WordNetLemmatizer, which assumed everything was a noun, but words like 'has' became 'ha' as a result. So I used the POS tags to make it output the correct lemma forms.

Part 2 confusion matrix analysis

I took the recommendation from the Discord discussions and added 5 start/end tokens before and after each sentence so that the features length is always 10. It also makes the before/after context distinguishable, ie, the first 5 are 'before' and the last 5 are 'after'.

Part 3 vectors table

I used the tf-idf vectorizer to vectorize each 'features' list to account for each feature's weight, and then used TruncatedSVD to reduced the number of dimensions (choice of dims size noted below in Part 5).

Part 5 confusion matrix analysis

The chance of successful classification appears to be relative to the proportional size of each NE class in the data. For example, 'geo', 'org', 'gpe' and 'per' were by a big margin correctly classified. Also, the training data performed better than the test data, where NE classes that account for small proportions in the data got few to none successful classification. One thig worth noting is that the dims size in Part 3 seems to matter. Originally I reduced them to 300-D, then 1000-D, and then 3000-D. The latter took longer time to process but it gave eg 'art', 'eve' more correct classifications, in particular in the training ddata confusion matrix.

Bonus B

Since we are only considering the surrounding O-tag words as part of each NE's features, we don't need to include I-tag words, so I ignored them during the preprocessing. This leaves there only B-tag words vs O-tag words to consider, so I changed latter to a boolean False whereas B-tag will value to True.

When accounting for POS, I interleaved them with the words, ie, word1, POS1, word2, POS2... instead of concatenating word+POS because it would've significantly increased the size of unique vocab. I also made the dims reduction a parameter of the bonusb() function for easier comparison of different dimention sizes' outputs.

ml-assignment2's People

Contributors

chickenbror avatar asayeed avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.