Giter Club home page Giter Club logo

textclassification's Introduction

CNN for Sentence Classification

Implementation of Convolutional Neural Networks for Sentence Classification using PyTorch. And using Bidirectional Encoder Representations from Transformers instead of word2vec for embedding words.

Yoon Kim, Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746 - 1751, 2014.

Requirements

  • Python: 3.8.0 or higher
  • PyTorch: 1.6.0 or higher
  • Optuna: 2.0.0 or higher
  • Transformers from huggingface 3.0.2 or higher

If you installed Anaconda, you can create a virtual enviroment from env.yml.

$ conda env create -f env.yml

Datasets

If you use sample dataset, download RCV1-ids and put the folder that unzipped the id.zip into data/.

This is RCV1 dataset that raw texts converted to ids by BERT tokenizer. Raw texts means it didn't normalize. Embedding's Max length is 512 by BERT-base. So, used only 512 words from the beggining of each texts. This dataset splitted 23,149 training sample and 781,265 testing sample according to RCV1: A New Benchmark Collection for Text Categorization Research.

If you want to use original or another dataset, you should convert dataset to ids by BERT tokenizer and you should change format.

- one document per line
- line must has 'tokenized ids' and 'label'
- label must be represented in one hot vector

example

[id1, id2, id3, ...]<TAB>[000100...]

And you should fix some parameters in run.py such as classes and embedding_dim.

BERT

huggingface publish some pre-trained models including BERT. This program uses base-uncased model that one of pre-trained models. So, embeds a word into a 768-dimensional vector by BERT. And this program turn off the fine-tuning of BERT.

Evalution Metrics

This program uses Precision@k, MicroF1 and MacroF1.

How to run

normal training and testing

$ python run.py

parameter search

$ python run.py --tuning

Force to use cpu

$ python run.py --no_cuda

Acknowledgment

This program is based on the following repositories. Thank you very much for their accomplishments.

textclassification's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

textclassification's Issues

bert fine-tuned

Hi,

Thank s for sharing ur code

The bert embeddings u used are fine-tuned or not ?

Thank u

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.