adapter-hub / hgiyt Goto Github PK

Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"

Home Page: https://arxiv.org/abs/2012.15613

Shell 2.05% Python 95.23% Jupyter Notebook 2.73%

hgiyt's Introduction

Introduction

This repository contains research code for the ACL 2021 paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models". Feel free to use this code to re-run our experiments or run new experiments on your own data.

Setup

General

Clone this repo

git clone [email protected]:Adapter-Hub/hgiyt.git

Install PyTorch (we used v1.7.1 - code may not work as expected for older or newer versions) in a new Python (>=3.6) virtual environment

pip install torch===1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

Initialize the submodules

git submodule update --init --recursive

Install the adapter-transformer library and dependencies

pip install lib/adapter-transformers
pip install -r requirements.txt

Pretraining

Install Nvidia Apex for automatic mixed-precision (amp / fp16) training

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Install wiki-bert-pipeline dependencies

pip install -r lib/wiki-bert-pipeline/requirements.txt

Language-specific prerequisites

To use the Japanese monolingual model, install the morphological parser MeCab with the mecab-ipadic-20070801 dictionary:

Install gdown for easy downloads from Google Drive

pip install gdown

Download and install MeCab

gdown https://drive.google.com/uc?id=0B4y35FiV1wh7cENtOXlicTFaRUE
tar -xvzf mecab-0.996.tar.gz
cd mecab-0.996
./configure 
make
make check
sudo make install

Download and install the mecab-ipadic-20070801 dictionary

gdown https://drive.google.com/uc?id=0B4y35FiV1wh7MWVlSDBCSXZMTXM
tar -xvzf mecab-ipadic-2.7.0-20070801.tar.gz
cd mecab-ipadic-2.7.0-20070801
./configure --with-charset=utf8
make
sudo make install

Data

We unfortunately cannot host the datasets used in our paper in this repo. However, we provide download links (wherever possible) and instructions or scripts to preprocess the data for finetuning and for pretraining.

Experiments

Our scripts are largely borrowed from the transformers and adapter-transformers libraries. For pretrained models and adapters we rely on the ModelHub and AdapterHub. However, even if you haven't used them before, running our scripts should be pretty straightforward :).

We provide instructions on how to execute our finetuning scripts here and our pretraining script here.

Models

Our pretrained models are also available in the ModelHub: https://huggingface.co/hgiyt. Feel free to finetune them with our scripts or use them in your own code.

Citation & Authors

@inproceedings{rust-etal-2021-good,
      title     = {How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models}, 
      author    = {Phillip Rust and Jonas Pfeiffer and Ivan Vuli{\'c} and Sebastian Ruder and Iryna Gurevych},
      year      = {2021},
      booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational
                  Linguistics, {ACL} 2021, Online, August 1-6, 2021},
      url       = {https://arxiv.org/abs/2012.15613},
      pages     = {3118--3135}
}

Contact Person: Phillip Rust, [email protected]

Don't hesitate to send us an e-mail or report an issue if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

hgiyt's People

Contributors

Stargazers

Watchers

Forkers

stefan-it emanuelaboros smutuvi techthiyanes casszhao gogopen

hgiyt's Issues

Calculating the same fertility twice in plot_fertility()

Thanks for sharing your codes!
I'm trying to test my tokenizers with your codes.

In tokenizer_exploration_utils.py, it seems like you're calculating the same fertility twice in plot_fertility(). And I can find the same part inplot_proportion_continuation(), plot_proportion_unks().

In plot_fertility(), line 332-340 and 342-353 look like they're doing the same calculation as getting language, fertility, and model in a dataframe.

Also, the function only returns the second dataframe. Is there any difference between the two parts?
Thanks :)

def plot_fertility(language_ud_dicts):
    sns.set(style="whitegrid")

    width = 512.14963
    sns.set(
        rc={
            "axes.spines.bottom": True,
            "axes.spines.left": True,
            "axes.spines.right": False,
            "axes.spines.top": False,
            "font.size": 12,
            "axes.labelsize": 12,
            "axes.grid": False,
            "legend.fontsize": 10,
            "ytick.left": True,
            "xtick.major.size": 8,
            "ytick.major.size": 8,
            "pgf.texsystem": "lualatex",
            "text.latex.preamble": r"\usepackage{xcolor}",
            "text.usetex": True,
        },
        style="whitegrid",
    )

    colors = ["indianred", "skyblue", "dodgerblue", "royalblue", "navy"]
    sns.set_palette(sns.color_palette(colors))
    sns.set_context("notebook")  # use notebook or talk

    titles = ["Mono", "mBERT"]
    for i, language_ud_dict in enumerate(language_ud_dicts):

        languages = []
        values = []
        for k, v in language_ud_dict.items():
            languages.append(r"\textsc{%s}" % k)
            values.append(np.mean(v["split_lengths"]))
        d = {"languages": languages, "fertility": values}
        df = pd.DataFrame(data=d).sort_values(ascending=True, by="fertility")

    d = {"Language": [], "Fertility": [], "Model": []}
    for i, language_ud_dict in enumerate(language_ud_dicts):

        languages = []
        values = []
        for k, v in language_ud_dict.items():
            languages.append(r"\textsc{%s}" % k)
            values.append(np.mean(v["split_lengths"]))
        d["Language"] += languages
        d["Fertility"] += values
        d["Model"] += [titles[i] for _ in values]
    df = pd.DataFrame(data=d).sort_values(ascending=True, by="Language")

    ax2 = sns.catplot(
        kind="bar", x="Language", y="Fertility", hue="Model", data=df, legend=False, height=5, aspect=2.1
    )

    ax2.set_xlabels("")
    ax2.set_ylabels(fontsize=30)
    ax2.set_xticklabels(fontsize=30)
    ax2.set(yticks=[0.0, 0.5, 1.0, 1.5, 2.0])
    ax2.set_yticklabels([0.0, 0.5, 1.0, 1.5, 2.0], fontsize=28)

    ax2.savefig("fertility.pdf", bbox_inches="tight")

    return df

"--with-charset=utf8" option is needed for the Mecab install

Thanks for releasing the codes.
I am testing your codes for Japanese.

In the "Language-specific prerequisites" section of "Setup" in README.md, I think --with-charset=utf8 option is needed for ./configure in "install MeCab" and "install the mecab-ipadic-20070801 dictionary" because the default encoding is euc-jp.

Thanks in advance.

Character-tokenized vs subword-tokenized in Japanese

In Section A.1, the authors say that "we select the character-tokenized Japanese BERT model because it achieved considerably higher scores on preliminary NER fine-tuning evaluations.", but from my experience, a subword-tokenized model is consistently better than a character-based model.

I could reproduce the above result for NER, but the Japanese portion of WikiAnn is character-based, and when we use a subword-tokenized model, the dataset has to be converted to word-based as follows:

ja:高 B-LOC
ja:島 I-LOC
ja:市 I-LOC
ja:周 O
ja:辺 O
↓
ja:高島 B-LOC
ja:市 I-LOC
ja:周辺 O

(I will perform this conversion, and test the subword-tokenized model later.)

All the other datasets are word-based. I have tested the character-tokenized model cl-tohoku/bert-base-japanese-char, which is used in the paper, and subword-tokenized model cl-tohoku/bert-base-japanese (with only one seed (seed = 1)). We can see that the subword-tokenized model is consistently better than the character-based model.

	SA	UDP (UAS/LAS)	POS
Monolingual (paper)	88.0	94.7 / 93.0	98.1
character-tokenized (mine)	88.4	94.8 / 93.1	98.1
subword-tokenized (mine)	91.1	95.0 / 93.4	98.2

It would be great if you could confirm this result.

Version of UD-Treebanks used for Tokenizer Experiments

Hello,

I was trying to reproduce some of the experiments (related to tokenizer metrics) in the paper and I am getting slightly different values for the fertility and continuation metrics. I was wondering if I was using a different version of UD-Treebank (I am using 2.8) that might be causing this discrepancy. It would be great if you can help me with the version used for the experiments in the paper.

Thanks

adapter-hub / hgiyt Goto Github PK

hgiyt's Introduction

Introduction

Setup

Data

Experiments

Models

Citation & Authors

hgiyt's People

Contributors

Stargazers

Watchers

Forkers

hgiyt's Issues

Calculating the same fertility twice in plot_fertility()

"--with-charset=utf8" option is needed for the Mecab install

Character-tokenized vs subword-tokenized in Japanese

Version of UD-Treebanks used for Tokenizer Experiments

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent