Giter Club home page Giter Club logo

politbert's Introduction

PoLitBert - Polish RoBERTa model

Polish RoBERTa model trained on Polish Wikipedia, Polish literature and Oscar. Major assumption is that good quality text will give good model.

We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.

Table of Contents

Experiments setup and goals

During experiments, we want to examine:

  • impact of different learning schedulers for training speed and accuracy, tested:
    • linear schedule with warmup
    • cyclic schedule: cosine, triangular
  • impact of training time on final accuracy

Data

Data processing for training

Our main assumption is that good quality text should produce good language model. So far the most popular polish dataset was "Polish wikipedia dump" however this text characterize with formal language. Second source of text is polish part of Oscar corpus - crawled text from the polish internet. When we investigate this corpus with more details it appears that it contains a lot of: foreign sentences (in Russian, English, German etc.), too short sentences and not grammatical sentences (as words enumerations).

We prepared a few cleaning heuristics:

  • remove sentences shorter than
  • remove non polish sentences
  • remove ungrammatical sentences (without verbs and with too many nouns)
  • perform sentence tokenization and save each sentence in new line, after each document the new line was added

Data was cleaned with use of process_sentences.py script, the whole process is presented in the polish_process_data.ipynb notebook.

Summary of Cleaned Polish Oscar corpus

File All lines All sentences Invalid length sent. Non-polish sent. Ungrammatical sent. Valid sentences
corpus_oscar_2020-04-10_32M_lines.txt 32 000 506 94 332 394 1 796 371 296 093 8 100 750 84 139 180
corpus_oscar_2020-04-10_64M_lines.txt 32 000 560 96 614 563 1 777 586 491 789 7 869 507 86 475 681
corpus_oscar_2020-04-10_96M_lines.txt 32 001 738 96 457 553 1 796 083 302 598 7 908 090 86 450 782
corpus_oscar_2020-04-10_128M_lines.txt 32 002 212 97 761 040 1 919 071 305 924 7 891 846 87 644 199
corpus_oscar_2020-04-10_128M_above_lines.txt 17 519 467 53 446 884   1 090 714 212 657 4 343 296 47 800 217

Training, testing dataset stats

Train Corpus Lines Words Characters
Polish Wikipedia (2020-03) 11 748 343 181 560 313 1 309 416 493
Books 81 140 395 829 404 801 5 386 053 287
Oscar (32M part, cleared) 112 466 497 1 198 735 834 8 454 177 161
Total 205 355 235 2 209 700 948 15 149 646 941

For testing we take ~10% of each corpus

Test Corpus Lines Words Characters
Polish Wikipedia (2020-03) 1 305 207 21 333 280 155 403 453
Books 9 007 716 93 141 853 610 111 989
Oscar (32M part, cleared) 14 515 735 157 303 490 1 104 855 397
Total 24 828 658 271 778 623 1 870 370 839

Training Polish RoBERTA protocol with Fairseq

General recipe of the final data preparation and model training process:

  1. Prepare huge text file data.txt e.g. Wikipedia text, where each sentence is in a new line and each article is separated by two new lines.
  2. Take 10-15M lines and prepare another file for sentencepiece (vocabulary builder) - again, each sentence is in one line.
  3. Train sentencepiece vocabulary and save it in fairseq format vocab.fairseq.txt.
  4. Encode data.txt with trained sentencepiece model to data.sp.txt.
  5. Preprocess data.sp.txt with fairseq-preprocess.
  6. Run training.

Detailed data preparation steps for fairseq (vocab gen and binarization) are available in separate notebook polish_roberta_vocab.ipynb.

Commands needed to reproduce fairseq models with various training protocols may be found in polish_roberta_training.ipynb.

Pretrained models and vocabs

KLEJ evaluation

All models were evaluated at 26.07.2020 with 9 KLEJ benchmark tasks . Below results were achieved with use of fine-tuning scripts from Polish RoBERTa without any further tweaks. which suggests that the potential of the models may not been fully utilized yet.

Model NKJP-NER CDSC-E CDSC-R CBD PolEmo2.0-IN PolEmo2.0-OUT DYK PSC AR Avg
PoLitBert_v32k_linear_50k 92.3 91.5 92.2 64 89.8 76.1 60.2 97.9 87.6 83.51
PoLitBert_v32k_linear_50k_2ep 91.9 91.8 90.9 64.6 89.1 75.9 59.8 97.9 87.9 83.31
PoLitBert_v32k_tri_125k 93.6 91.7 91.8 62.4 90.3 75.7 59 97.4 87.2 83.23
PoLitBert_v32k_linear_125k_2ep 94.3 92.1 92.8 64 90.6 79.1 51.7 94.1 88.7 83.04
PoLitBert_v32k_tri_50k 93.9 91.7 92.1 57.6 88.8 77.9 56.6 96.5 87.7 82.53
PoLitBert_v32k_linear_125k 94 91.3 91.8 61.1 90.4 78.1 50.8 95.8 88.2 82.39
PoLitBert_v50k_linear_50k 92.8 92.3 91.7 57.7 90.3 80.6 42.2 97.4 88.5 81.50
PoLitBert_v32k_cos1_2_50k 92.5 91.6 90.7 60.1 89.5 73.5 49.1 95.2 87.5 81.08
PoLitBert_v32k_cos1_5_50k 93.2 90.7 89.5 51.7 89.5 74.3 49.1 97.1 87.5 80.29

A comparison with other developed models is available in the continuously updated leaderboard of evaluation tasks.

Details of models training

We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.

Link to PoLitBert research log (same as below).

Experiment Model name Vocab size Scheduler BSZ WPB Steps Train tokens Train loss Valid loss Best (test) loss
#1 PoLitBert_v32k_linear_50k (tensorboard) 32k linear decay 8 192 4,07E+06 50 000 2,03E+11 1,502 1,460 1,422
#2 PoLitBert_v32k_tri_50k (tensorboard) 32k triangular 8 192 4,07E+06 50 000 2,03E+11 1,473 1,436 1,402
#3 PoLitBert_v32k_cos1_50k (tensorboard) 32k cosine mul=1 8 192 4,07E+06 23 030 9,37E+10 10,930 11,000 1,832
#4 PoLitBert_v32k_cos1_2_50k (tensorboard) 32k cosine mul=1 peak=0.0005 8 192 4,07E+06 50 000 2,03E+11 1,684 1,633 1,595
#5 PoLitBert_v32k_cos1_3_50k (tensorboard) 32k cosine mul=2 8 192 4,07E+06 3 735 1,52E+10 10,930
#6 PoLitBert_v32k_cos1_4_50k (tensorboard) 32k cosine mul=2 grad-clip=0.9 8 192 4,07E+06 4 954 2,02E+10 10,910 10,940 2,470
#8 PoLitBert_v32k_tri_125k (tensorboard) 32k triangular 8 192 4,07E+06 125 000 5,09E+11 1,435 1,313 1,363
#9 PoLitBert_v32k_cos1_5_50k (tensorboard) 32k cosine, mul=2, grad-clip=0.9 8 192 4,07E+06 125 000 5,09E+11 1,502 1,358 1,426
#10 PoLitBert_v32k_linear_125k (tensorboard) 32k linear decay 8 192 4,07E+06 125 000 5,09E+11 1,322 1,218 1,268
#11 PoLitBert_v50k_linear_50k (tensorboard) 50k linear decay 8 192 4,07E+06 50 000 2,04E+11 1,546 1,439 1,480

Used libraries

Instalation dependecies and problems

  • langdetect needs additional package
    • install sudo apt-get install libicu-dev
  • sentencepiece was installed from source code

Acknowledgements

This is the joint work of companies Ermlab Software and Literacka

Part of the work was financed from the grant of The Polish National Centre for Research and Development no. POIR.01.01.01-00-1213/19, the beneficiary of which was Literacka. Project title "Asystent wydawniczy - oprogramowanie do analizy treści, wykorzystujące algorytmy sztucznej inteligencji w celu zautomatyzowania procesu wydawniczego i predykcji sukcesów rynkowych publikacji."

We would like to express ours gratitude to NVidia Inception Programme and Amazon AWS for providing the free GPU credits - thank you!

Authors:

Also appreciate the help from

About Ermlab Software

Ermlab - Polish machine learning company

🦉 Website | :octocat: Repository

.

politbert's People

Contributors

ksopyla avatar lsawaniewski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

politbert's Issues

Cleaned Oscar dataset

Hello,

I am curious about the Cleaned Polish Oscar corpus.

Are the linked corpus_oscar_2020-04-10*lines.txt files already cleaned versions, or should I run the process_sentences.py script on them? For what I understood, there are non-polish and invalid lenght sentences in these files?

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.