Giter Club home page Giter Club logo

kcbert's Introduction

KcBERT: Korean comments BERT

๊ณต๊ฐœ๋œ ํ•œ๊ตญ์–ด BERT๋Š” ๋Œ€๋ถ€๋ถ„ ํ•œ๊ตญ์–ด ์œ„ํ‚ค, ๋‰ด์Šค ๊ธฐ์‚ฌ, ์ฑ… ๋“ฑ ์ž˜ ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•œํŽธ, ์‹ค์ œ๋กœ NSMC์™€ ๊ฐ™์€ ๋Œ“๊ธ€ํ˜• ๋ฐ์ดํ„ฐ์…‹์€ ์ •์ œ๋˜์ง€ ์•Š์•˜๊ณ  ๊ตฌ์–ด์ฒด ํŠน์ง•์— ์‹ ์กฐ์–ด๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์˜คํƒˆ์ž ๋“ฑ ๊ณต์‹์ ์ธ ๊ธ€์“ฐ๊ธฐ์—์„œ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ํ‘œํ˜„๋“ค์ด ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.

KcBERT๋Š” ์œ„์™€ ๊ฐ™์€ ํŠน์„ฑ์˜ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ๋„ค์ด๋ฒ„ ๋‰ด์Šค์—์„œ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ์ˆ˜์ง‘ํ•ด, ํ† ํฌ๋‚˜์ด์ €์™€ BERT๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•œ Pretrained BERT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

KcBERT๋Š” Huggingface์˜ Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๊ฐ„ํŽธํžˆ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๋ณ„๋„์˜ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.)

How to use

Requirements

  • pytorch ~= 1.5.1

  • transformers ~= 3.0.1

  • emoji ~= 0.6.0

  • soynlp ~= 0.0.493

from transformers import AutoTokenizer, AutoModelWithLMHead

# Base Model (108M)

tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-base")

model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-base")

# Large Model (334M)

tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-large")

model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-large")

Train Data & Preprocessing

Raw Data

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 2019.01.01 ~ 2020.06.15 ์‚ฌ์ด์— ์ž‘์„ฑ๋œ ๋Œ“๊ธ€ ๋งŽ์€ ๋‰ด์Šค ๊ธฐ์‚ฌ๋“ค์˜ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ๋ชจ๋‘ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๋Š” ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ์‹œ ์•ฝ 15.4GB์ด๋ฉฐ, 1์–ต1์ฒœ๋งŒ๊ฐœ ์ด์ƒ์˜ ๋ฌธ์žฅ์œผ๋กœ ์ด๋ค„์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

Preprocessing

PLM ํ•™์Šต์„ ์œ„ํ•ด์„œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•œ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. ํ•œ๊ธ€ ๋ฐ ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž, ๊ทธ๋ฆฌ๊ณ  ์ด๋ชจ์ง€(๐Ÿฅณ)๊นŒ์ง€!

    ์ •๊ทœํ‘œํ˜„์‹์„ ํ†ตํ•ด ํ•œ๊ธ€, ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ํฌํ•จํ•ด Emoji๊นŒ์ง€ ํ•™์Šต ๋Œ€์ƒ์— ํฌํ•จํ–ˆ์Šต๋‹ˆ๋‹ค.

    ํ•œํŽธ, ํ•œ๊ธ€ ๋ฒ”์œ„๋ฅผ ใ„ฑ-ใ…Ž๊ฐ€-ํžฃ ์œผ๋กœ ์ง€์ •ํ•ด ใ„ฑ-ํžฃ ๋‚ด์˜ ํ•œ์ž๋ฅผ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

  2. ๋Œ“๊ธ€ ๋‚ด ์ค‘๋ณต ๋ฌธ์ž์—ด ์ถ•์•ฝ

    ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹์™€ ๊ฐ™์ด ์ค‘๋ณต๋œ ๊ธ€์ž๋ฅผ ใ…‹ใ…‹์™€ ๊ฐ™์€ ๊ฒƒ์œผ๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.

  3. Cased Model

    KcBERT๋Š” ์˜๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” ๋Œ€์†Œ๋ฌธ์ž๋ฅผ ์œ ์ง€ํ•˜๋Š” Cased model์ž…๋‹ˆ๋‹ค.

  4. ๊ธ€์ž ๋‹จ์œ„ 10๊ธ€์ž ์ดํ•˜ ์ œ๊ฑฐ

    10๊ธ€์ž ๋ฏธ๋งŒ์˜ ํ…์ŠคํŠธ๋Š” ๋‹จ์ผ ๋‹จ์–ด๋กœ ์ด๋ค„์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

  5. ์ค‘๋ณต ์ œ๊ฑฐ

    ์ค‘๋ณต์ ์œผ๋กœ ์“ฐ์ธ ๋Œ“๊ธ€์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ์ค‘๋ณต ๋Œ“๊ธ€์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด ๋งŒ๋“  ์ตœ์ข… ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 12.5GB, 8.9์ฒœ๋งŒ๊ฐœ ๋ฌธ์žฅ์ž…๋‹ˆ๋‹ค.

์•„๋ž˜ ๋ช…๋ น์–ด๋กœ pip๋กœ ์„ค์น˜ํ•œ ๋’ค, ์•„๋ž˜ cleanํ•จ์ˆ˜๋กœ ํด๋ฆฌ๋‹์„ ํ•˜๋ฉด Downstream task์—์„œ ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•„์ง‘๋‹ˆ๋‹ค. ([UNK] ๊ฐ์†Œ)

pip install soynlp emoji

์•„๋ž˜ clean ํ•จ์ˆ˜๋ฅผ Text data์— ์‚ฌ์šฉํ•ด์ฃผ์„ธ์š”.

import re
import emoji
from soynlp.normalizer import repeat_normalize

emojis = ''.join(emoji.UNICODE_EMOJI.keys())
pattern = re.compile(f'[^ .,?!/@$%~๏ผ…ยทโˆผ()\x00-\x7Fใ„ฑ-ํžฃ{emojis}]+')
url_pattern = re.compile(
    r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')

def clean(x):
    x = pattern.sub(' ', x)
    x = url_pattern.sub('', x)
    x = x.strip()
    x = repeat_normalize(x, num_repeats=2)
    return x

Tokenizer Train

Tokenizer๋Š” Huggingface์˜ Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ์ค‘ BertWordPieceTokenizer ๋ฅผ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , Vocab Size๋Š” 30000์œผ๋กœ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Tokenizer๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์—๋Š” 1/10๋กœ ์ƒ˜ํ”Œ๋งํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ๋ณด๋‹ค ๊ณจ๊ณ ๋ฃจ ์ƒ˜ํ”Œ๋งํ•˜๊ธฐ ์œ„ํ•ด ์ผ์ž๋ณ„๋กœ stratify๋ฅผ ์ง€์ •ํ•œ ๋’ค ํ–‘์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

BERT Model Pretrain

  • KcBERT Base config
{
    "max_position_embeddings": 300,
    "hidden_dropout_prob": 0.1,
    "pooler_size_per_head": 128,
    "hidden_act": "gelu",
    "initializer_range": 0.02,
    "num_hidden_layers": 12,
    "pooler_num_attention_heads": 12,
    "type_vocab_size": 2,
    "vocab_size": 30000,
    "hidden_size": 768,
    "attention_probs_dropout_prob": 0.1,
    "directionality": "bidi",
    "num_attention_heads": 12,
    "pooler_fc_size": 768,
    "pooler_type": "first_token_transform",
    "pooler_num_fc_layers": 3,
    "intermediate_size": 3072,
    "architectures": [
        "BertForMaskedLM"
    ],
    "model_type": "bert"
}
  • KcBERT Large config
{
    "type_vocab_size": 2,
    "initializer_range": 0.02,
    "max_position_embeddings": 300,
    "vocab_size": 30000,
    "hidden_size": 1024,
    "hidden_dropout_prob": 0.1,
    "model_type": "bert",
    "directionality": "bidi",
    "pooler_num_attention_heads": 12,
    "pooler_fc_size": 768,
    "pad_token_id": 0,
    "pooler_type": "first_token_transform",
    "layer_norm_eps": 1e-12,
    "hidden_act": "gelu",
    "num_hidden_layers": 24,
    "pooler_num_fc_layers": 3,
    "num_attention_heads": 16,
    "pooler_size_per_head": 128,
    "attention_probs_dropout_prob": 0.1,
    "intermediate_size": 4096,
    "architectures": [
        "BertForMaskedLM"
    ]
}

BERT Model Config๋Š” Base, Large ๊ธฐ๋ณธ ์„ธํŒ…๊ฐ’์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. (MLM 15% ๋“ฑ)

TPU v3-8 ์„ ์ด์šฉํ•ด ๊ฐ๊ฐ 3์ผ, N์ผ(Large๋Š” ํ•™์Šต ์ง„ํ–‰ ์ค‘)์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ˜„์žฌ Huggingface์— ๊ณต๊ฐœ๋œ ๋ชจ๋ธ์€ 1m(100๋งŒ) step์„ ํ•™์Šตํ•œ ckpt๊ฐ€ ์—…๋กœ๋“œ ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ํ•™์Šต Loss๋Š” Step์— ๋”ฐ๋ผ ์ดˆ๊ธฐ 200k์— ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ Loss๊ฐ€ ์ค„์–ด๋“ค๋‹ค 400k์ดํ›„๋กœ๋Š” ์กฐ๊ธˆ์”ฉ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Base Model Loss

KcBERT-Base Pretraining Loss

  • Large Model Loss

KcBERT-Large Pretraining Loss

ํ•™์Šต์€ GCP์˜ TPU v3-8์„ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ•™์Šต ์‹œ๊ฐ„์€ Base Model ๊ธฐ์ค€ 2.5์ผ์ •๋„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. Large Model์€ ์•ฝ 5์ผ์ •๋„ ์ง„ํ–‰ํ•œ ๋’ค ๊ฐ€์žฅ ๋‚ฎ์€ loss๋ฅผ ๊ฐ€์ง„ ์ฒดํฌํฌ์ธํŠธ๋กœ ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

Example

HuggingFace MASK LM

HuggingFace kcbert-base ๋ชจ๋ธ ์—์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ํ…Œ์ŠคํŠธ ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ "์ข‹๋„ค์š”", KcBERT-Base

๋ฌผ๋ก  kcbert-large ๋ชจ๋ธ ์—์„œ๋„ ํ…Œ์ŠคํŠธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

image-20200806160624340

NSMC Binary Classification

๋„ค์ด๋ฒ„ ์˜ํ™”ํ‰ ์ฝ”ํผ์Šค ๋ฐ์ดํ„ฐ์…‹์„ ๋Œ€์ƒ์œผ๋กœ Fine Tuning์„ ์ง„ํ–‰ํ•ด ์„ฑ๋Šฅ์„ ๊ฐ„๋‹จํžˆ ํ…Œ์ŠคํŠธํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

  • Base Model์„ Fine Tuneํ•˜๋Š” ์ฝ”๋“œ๋Š” ์ด Colab ๋งํฌ์—์„œ ์ง์ ‘ ์‹คํ–‰ํ•ด๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Large Model์„ Fine Tuneํ•˜๋Š” ์ฝ”๋“œ๋Š” GPU๋ฒ„์ „ Colab ๋งํฌ ์™€ TPU๋ฒ„์ „ Colab ๋งํฌ(๊ณต๊ฐœ์˜ˆ์ •, ์ž‘์—…์ค‘)์—์„œ ์ง์ ‘ ์‹คํ–‰ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • GPU๋Š” P100 x1๋Œ€ ๊ธฐ์ค€ 1epoch์— 2-3์‹œ๊ฐ„, TPU๋Š” 1epoch์— 1์‹œ๊ฐ„ ๋‚ด๋กœ ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.
      • GPU RTX Titan x4๋Œ€ ๊ธฐ์ค€ 30๋ถ„/epoch ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.
    • Large Model ์˜ˆ์‹œ ์ฝ”๋“œ๋Š” pytorch-lightning์œผ๋กœ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.
  • KcBERT-Base Model ์‹คํ—˜๊ฒฐ๊ณผ: Val acc .8905

    KcBERT Base finetune on NSMC

  • KcBERT-Large Model ์‹คํ—˜ ๊ฒฐ๊ณผ: Val acc .9089

    image-20200806190242834

๋” ๋‹ค์–‘ํ•œ Downstream Task์— ๋Œ€ํ•ด ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๊ณ  ๊ณต๊ฐœํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

Acknowledgement

KcBERT Model์„ ํ•™์Šตํ•˜๋Š” GCP/TPU ํ™˜๊ฒฝ์€ TFRC ํ”„๋กœ๊ทธ๋žจ์˜ ์ง€์›์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •์—์„œ ๋งŽ์€ ์กฐ์–ธ์„ ์ฃผ์‹  Monologg ๋‹˜ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)

Reference

Github Repos

Papers

Blogs

kcbert's People

Contributors

beomi avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.