Giter Club home page Giter Club logo

Dataset

For the NER experiment, I used CoNLL 2003 English dataset. This dataset includes 1,393 English and 909 German news articles. Entities are annotated with LOC (location), ORG (organisation), PER (person) and MISC (miscellaneous). This is an example sentence, where each line consists of [word] [POS tag] [chunk tag] [NER tag]

U.N. NNP I-NP I-ORG official NN I-NP O Ekeus NNP I-NP I-PER heads VBZ I-VP O for IN I-PP O Baghdad NNP I-NP I-LOC

Preprocessed Data Shapes:

X_Train - (900, 204566) X_val - (900, 46665) X_test - (900, 51577) Y_train - (10, 204566) Y_val - (10, 46665) Y_test - (10, 51577)

Each word is mapped to a pre-trained feature of size 300, hence the feature size of the whole window of size 3 is 300 X 3 = 900.

NETWORK DETAILS

We have experimented with different architectures. The common portion of all the networks is the following:

The input to the network is the pretrained features for each window. The input shape is (900, 204566) where 900 is the feature size of each window and 204566 refers to the total number of windows. The hidden layer varies between different architectures. We will describe it a bit later. The final layer is of size 10, corresponding to each NER Tag. The output of this final layer is passed to a cross entropy function for converting the output to probabilities. Different losses like log likelihood and max margin are used.

Changing the architecture is Easy. The structure of the architecture is the following:

nn_architecture = [ {"layer_size": 900, "activation": "none"}, {"layer_size": 300, "activation": "relu"}, {"layer_size": 100, "activation": "relu"}, {"layer_size": 10, "activation": "sigmoid"} ]

Different activations like sigmoid, tanh, relu, leaky relu are tried for the hidden layers.

The Dataset has class imbalance issue:

There are various methods for tackling this issue:

  • Duplicating the infrequent classes: Does not provide any new information to model.
  • Downscale the most frequent classes: Results in a lot of loss of data.
  • Focal Loss: This is a really good way for dealing with class imbalance. It puts more weight on harder or infrequent samples thus making the model to focus on infrequent samples too.

I used the Synthetic Minority Oversampling Technique (SMOTE) approach [1]. SMOTE first selects a minority class instance at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

This process is highly memory intensive. Further it requires a certain number of samples of each class present for successful interpolation. Hence I divided my data into batches of 10000 windows and applied the SMOTE on each of them.

After applying SMOTE to a 10000 batch: Counter({3: 7585, 1: 7585, 8: 7585, 0: 7585, 7: 7585, 4: 7585, 5: 7585, 9: 7585, 6: 7585, 2: 7585})

Before applying SMOTE to a 10000 batch:

Counter({1: 843, 4: 41, 0: 23, 8: 22, 3: 19, 5: 18, 7: 16, 9: 12, 6: 5, 2: 1})

Notice the very small number of samples of type 6, 2, 9 etc. are normalised after applying SMOTE.

Coding File Details:

  • CORNLL.ipynb: Preprocess the data and extract features
  • NER_NN.ipynb: Neural Network Implementation
  • NER_NN_balanced.ipynb: Neural Network Implementation with SMOTE
  • Other .py files: Supporting code

References:

Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357.

Ayush Jain's Projects

aerial-object-detection icon aerial-object-detection

Official Code for the paper: "AI-enabled Object Detection in UAVs: Challenges, Design Choices, and Research Directions", IEEE Networks

ai_csp icon ai_csp

This repository contains code for Constraint Satisfaction Problem solved using AI

ai_minesweeper icon ai_minesweeper

An artificially agent created to play Minesweepers. Human Experts Beware!!

ai_vacuum_cleaner icon ai_vacuum_cleaner

This repository contains the code for making an artificially intelligent vacuum cleaner, made using uninformed earch algorithms.

bert icon bert

TensorFlow code and pre-trained models for BERT

blenderproc icon blenderproc

A procedural Blender pipeline for photorealistic training image generation

clevr-dataset-gen icon clevr-dataset-gen

A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

content-holmes icon content-holmes

Content Holmes is a one-click AI-based online parenting solution that help you keep your children secure from cyber-bullying and adult content online and monitor their activities.

ebmplanner icon ebmplanner

Code for the RSS 2023 paper "Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement"

erplag-cc icon erplag-cc

Compiler for the custom language 'ERPLAG' in C.

firewall icon firewall

This is a naive implementation of firewall in prolog

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.