Giter Club home page Giter Club logo

hgiyt's Introduction

Introduction

This repository contains research code for the ACL 2021 paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models". Feel free to use this code to re-run our experiments or run new experiments on your own data.

Setup

General  
  1. Clone this repo
git clone [email protected]:Adapter-Hub/hgiyt.git
  1. Install PyTorch (we used v1.7.1 - code may not work as expected for older or newer versions) in a new Python (>=3.6) virtual environment
pip install torch===1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html
  1. Initialize the submodules
git submodule update --init --recursive
  1. Install the adapter-transformer library and dependencies
pip install lib/adapter-transformers
pip install -r requirements.txt
Pretraining  
  1. Install Nvidia Apex for automatic mixed-precision (amp / fp16) training
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  1. Install wiki-bert-pipeline dependencies
pip install -r lib/wiki-bert-pipeline/requirements.txt
Language-specific prerequisites  

To use the Japanese monolingual model, install the morphological parser MeCab with the mecab-ipadic-20070801 dictionary:

  1. Install gdown for easy downloads from Google Drive
pip install gdown
  1. Download and install MeCab
gdown https://drive.google.com/uc?id=0B4y35FiV1wh7cENtOXlicTFaRUE
tar -xvzf mecab-0.996.tar.gz
cd mecab-0.996
./configure 
make
make check
sudo make install
  1. Download and install the mecab-ipadic-20070801 dictionary
gdown https://drive.google.com/uc?id=0B4y35FiV1wh7MWVlSDBCSXZMTXM
tar -xvzf mecab-ipadic-2.7.0-20070801.tar.gz
cd mecab-ipadic-2.7.0-20070801
./configure --with-charset=utf8
make
sudo make install

Data

We unfortunately cannot host the datasets used in our paper in this repo. However, we provide download links (wherever possible) and instructions or scripts to preprocess the data for finetuning and for pretraining.

Experiments

Our scripts are largely borrowed from the transformers and adapter-transformers libraries. For pretrained models and adapters we rely on the ModelHub and AdapterHub. However, even if you haven't used them before, running our scripts should be pretty straightforward :).

We provide instructions on how to execute our finetuning scripts here and our pretraining script here.

Models

Our pretrained models are also available in the ModelHub: https://huggingface.co/hgiyt. Feel free to finetune them with our scripts or use them in your own code.

Citation & Authors

@inproceedings{rust-etal-2021-good,
      title     = {How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models}, 
      author    = {Phillip Rust and Jonas Pfeiffer and Ivan Vuli{\'c} and Sebastian Ruder and Iryna Gurevych},
      year      = {2021},
      booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational
                  Linguistics, {ACL} 2021, Online, August 1-6, 2021},
      url       = {https://arxiv.org/abs/2012.15613},
      pages     = {3118--3135}
}

Contact Person: Phillip Rust, [email protected]

Don't hesitate to send us an e-mail or report an issue if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

hgiyt's People

Contributors

jopfeiff avatar stefan-it avatar xplip avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hgiyt's Issues

Calculating the same fertility twice in plot_fertility()

Thanks for sharing your codes!
I'm trying to test my tokenizers with your codes.

In tokenizer_exploration_utils.py, it seems like you're calculating the same fertility twice in plot_fertility(). And I can find the same part inplot_proportion_continuation(), plot_proportion_unks().

In plot_fertility(), line 332-340 and 342-353 look like they're doing the same calculation as getting language, fertility, and model in a dataframe.

Also, the function only returns the second dataframe. Is there any difference between the two parts?
Thanks :)

def plot_fertility(language_ud_dicts):
    sns.set(style="whitegrid")

    width = 512.14963
    sns.set(
        rc={
            "axes.spines.bottom": True,
            "axes.spines.left": True,
            "axes.spines.right": False,
            "axes.spines.top": False,
            "font.size": 12,
            "axes.labelsize": 12,
            "axes.grid": False,
            "legend.fontsize": 10,
            "ytick.left": True,
            "xtick.major.size": 8,
            "ytick.major.size": 8,
            "pgf.texsystem": "lualatex",
            "text.latex.preamble": r"\usepackage{xcolor}",
            "text.usetex": True,
        },
        style="whitegrid",
    )

    colors = ["indianred", "skyblue", "dodgerblue", "royalblue", "navy"]
    sns.set_palette(sns.color_palette(colors))
    sns.set_context("notebook")  # use notebook or talk

    titles = ["Mono", "mBERT"]
    for i, language_ud_dict in enumerate(language_ud_dicts):

        languages = []
        values = []
        for k, v in language_ud_dict.items():
            languages.append(r"\textsc{%s}" % k)
            values.append(np.mean(v["split_lengths"]))
        d = {"languages": languages, "fertility": values}
        df = pd.DataFrame(data=d).sort_values(ascending=True, by="fertility")

    d = {"Language": [], "Fertility": [], "Model": []}
    for i, language_ud_dict in enumerate(language_ud_dicts):

        languages = []
        values = []
        for k, v in language_ud_dict.items():
            languages.append(r"\textsc{%s}" % k)
            values.append(np.mean(v["split_lengths"]))
        d["Language"] += languages
        d["Fertility"] += values
        d["Model"] += [titles[i] for _ in values]
    df = pd.DataFrame(data=d).sort_values(ascending=True, by="Language")

    ax2 = sns.catplot(
        kind="bar", x="Language", y="Fertility", hue="Model", data=df, legend=False, height=5, aspect=2.1
    )

    ax2.set_xlabels("")
    ax2.set_ylabels(fontsize=30)
    ax2.set_xticklabels(fontsize=30)
    ax2.set(yticks=[0.0, 0.5, 1.0, 1.5, 2.0])
    ax2.set_yticklabels([0.0, 0.5, 1.0, 1.5, 2.0], fontsize=28)

    ax2.savefig("fertility.pdf", bbox_inches="tight")

    return df

"--with-charset=utf8" option is needed for the Mecab install

Thanks for releasing the codes.
I am testing your codes for Japanese.

In the "Language-specific prerequisites" section of "Setup" in README.md, I think --with-charset=utf8 option is needed for ./configure in "install MeCab" and "install the mecab-ipadic-20070801 dictionary" because the default encoding is euc-jp.

Thanks in advance.

Character-tokenized vs subword-tokenized in Japanese

In Section A.1, the authors say that "we select the character-tokenized Japanese BERT model because it achieved considerably higher scores on preliminary NER fine-tuning evaluations.", but from my experience, a subword-tokenized model is consistently better than a character-based model.

I could reproduce the above result for NER, but the Japanese portion of WikiAnn is character-based, and when we use a subword-tokenized model, the dataset has to be converted to word-based as follows:

ja:高 B-LOC
ja:島 I-LOC
ja:市 I-LOC
ja:周 O
ja:辺 O

ja:高島 B-LOC
ja:市 I-LOC
ja:周辺 O

(I will perform this conversion, and test the subword-tokenized model later.)

All the other datasets are word-based. I have tested the character-tokenized model cl-tohoku/bert-base-japanese-char, which is used in the paper, and subword-tokenized model cl-tohoku/bert-base-japanese (with only one seed (seed = 1)). We can see that the subword-tokenized model is consistently better than the character-based model.

SA UDP (UAS/LAS) POS
Monolingual (paper) 88.0 94.7 / 93.0 98.1
character-tokenized (mine) 88.4 94.8 / 93.1 98.1
subword-tokenized (mine) 91.1 95.0 / 93.4 98.2

It would be great if you could confirm this result.

Version of UD-Treebanks used for Tokenizer Experiments

Hello,

I was trying to reproduce some of the experiments (related to tokenizer metrics) in the paper and I am getting slightly different values for the fertility and continuation metrics. I was wondering if I was using a different version of UD-Treebank (I am using 2.8) that might be causing this discrepancy. It would be great if you can help me with the version used for the experiments in the paper.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.