Introduction
Using PhoBERT with transformers
Using PhoBERT with fairseq
Notes

PhoBERT: Pre-trained language models for Vietnamese

Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese (Pho, i.e. "Phở", is a popular food in Vietnam):

Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on RoBERTa which optimizes the BERT pre-training procedure for more robust performance.
PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.

The general architecture and experimental results of PhoBERT can be found in our paper:

@inproceedings{phobert,
title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
year      = {2020},
pages     = {1037--1042}
}

Please CITE our paper when PhoBERT is used to help produce published results or is incorporated into other software.

Using PhoBERT with `transformers`

Installation

Install transformers with pip: pip install transformers, or install transformers from source.
Note that we merged a slow tokenizer for PhoBERT into the main transformers branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might install transformers as follows:

git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .

Install tokenizers with pip: pip3 install tokenizers

Pre-trained models

Model	#params	Arch.	Max length	Pre-training data	License
`vinai/phobert-base`	135M	base	256	20GB of Wikipedia and News texts	MIT License
`vinai/phobert-large`	370M	large	256	20GB of Wikipedia and News texts	MIT License
`vinai/phobert-base-v2`	135M	base	256	20GB of Wikipedia and News texts + 120GB of texts from OSCAR-2301	GNU Affero GPL v3

Example usage

import torch
from transformers import AutoModel, AutoTokenizer

phobert = AutoModel.from_pretrained("vinai/phobert-base-v2")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
sentence = 'Chúng_tôi là những nghiên_cứu_viên .'  

input_ids = torch.tensor([tokenizer.encode(sentence)])

with torch.no_grad():
    features = phobert(input_ids)  # Models outputs are now tuples

## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# phobert = TFAutoModel.from_pretrained("vinai/phobert-base")

Using PhoBERT with `fairseq`

Please see details at HERE!

Notes

In case the input texts are raw, i.e. without word segmentation, a word segmenter must be applied to produce word-segmented texts before feeding to PhoBERT. As PhoBERT employed the RDRSegmenter from VnCoreNLP to pre-process the pre-training data (including Vietnamese tone normalization and word and sentence segmentation), it is recommended to also use the same word segmenter for PhoBERT-based downstream applications w.r.t. the input raw texts.

Installation

pip install py_vncorenlp

Example usage

import py_vncorenlp

# Automatically download VnCoreNLP components from the original repository
# and save them in some local machine folder
py_vncorenlp.download_model(save_dir='/absolute/path/to/vncorenlp')

# Load the word and sentence segmentation component
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir='/absolute/path/to/vncorenlp')

text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."

output = rdrsegmenter.word_segment(text)

print(output)
# ['Ông Nguyễn_Khắc_Chúc đang làm_việc tại Đại_học Quốc_gia Hà_Nội .', 'Bà Lan , vợ ông Chúc , cũng làm_việc tại đây .']

tellme0823 / phobert Goto Github PK

phobert's Introduction

Table of contents

PhoBERT: Pre-trained language models for Vietnamese

Using PhoBERT with `transformers`

Installation

Pre-trained models

Example usage

Using PhoBERT with `fairseq`

Notes

Installation

Example usage

phobert's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

tellme0823 / phobert Goto Github PK

phobert's Introduction

Table of contents

PhoBERT: Pre-trained language models for Vietnamese

Using PhoBERT with transformers

Installation

Pre-trained models

Example usage

Using PhoBERT with fairseq

Notes

Installation

Example usage

phobert's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org

Using PhoBERT with `transformers`

Using PhoBERT with `fairseq`