Giter Club home page Giter Club logo

lepiszcze's Introduction

LEPISZCZE

This is the official code implementation for the LEPISZCZE benchmark experiments. "This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish" (NeurIPS 2022) (Łukasz Augustyniak, Kamil Tagowski, Albert Sawczyn, Denis Janiak, Roman Bartusiak, Adrian Szymczak, Marcin Wątroba, Arkadiusz Janz, Piotr Szymański, Mikołaj Morzy, Tomasz Kajdanowicz, Maciej Piasecki).

Resources

LEPISZCZE benchmark resources

Name Description URL
Leaderboard LEPISZCZE Leaderboard LEPISZCZE
Libary clarin-pl/embeddings Our library with pre-defined NLP pipelines for text classification, pair text classification and sequence labeling taks GitHub
Experiments dashboard Weight&Biases dashboard with our experiments W&B
Datasets LEPISZCZE Datasets are accessible through our HuggingFace Hub organization page. HuggingFace
KLEJ-Datasets Datasets for KLEJ benchmark are accessible through Allegro HuggingFace organization page. HuggingFace

Citation

@inproceedings{augustyniak2022lepiszcze,
 author = {Augustyniak, Lukasz and Tagowski, Kamil and Sawczyn, Albert and Janiak, Denis and Bartusiak, Roman and Szymczak, Adrian and Janz, Arkadiusz and Szyma\'{n}ski, Piotr and W\k{a}troba, Marcin and Morzy, Miko\l aj and Kajdanowicz, Tomasz and Piasecki, Maciej},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
 pages = {21805--21818},
 publisher = {Curran Associates, Inc.},
 title = {This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish},
 url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/890b206ebb79e550f3988cb8db936f42-Paper-Datasets_and_Benchmarks.pdf},
 volume = {35},
 year = {2022}
}

Contact

In case of any question or concerns about LEPISZCZE benchmark feel free to contact us:

DVC Repository Access Due to the size of pipeline outputs data, we do not provide public access to our DVC Remote Repository. However, if you are interested in any kinds of data artifacts, don't hesitate to get in touch with us.

Installation

Repository can be setup via poetry or via docker.

Requirements installation and environment setup via poetry

Prerequisites:

  • Python: 3.9+
  • Poetry [LINK].
  • CUDA 11.3+ for GPU support (Recommended)

Installation

poetry install

For GPU support

poetry run poe force-torch-cuda

Using docker image

Building image

docker build . -f docker/Dockerfile -t LEPISZCZE

After the container setup use conda env LEPISZCZE

conda activate LEPISZCZE

Reproducibility

Our experiments can be easily reproduced with DVC repro & W&B logging. Using dvc repro command and with W&B token setup.

DISCLAIMER Reproduction of full pipeline could take above 2000 hours to compelete on a single GPU device. We advise to execute stages in parallel on mutiple GPU computing devices.

Experiments

Experiments configs can be found under configs

DISCLAIMER For some of the dataset we had to limit manually maximum sequence length to 512 for Hyper Parameter Search.

Models hyperparameters configuations can be accessed via W&B dashboard. Example: [LINK]

Datasets configurations

dataset name task type input_column_name(s) target_column_name description
clarin-pl/kpwr-ner sequence labeling (named entity recognition) tokens ner KPWR-NER is a part of the Polish Corpus of Wrocław University of Technology (KPWr). Its objective is recognition of named entities, e.g., people, institutions etc.
clarin-pl/polemo2-official classification (sentiment analysis) text target A corpus of consumer reviews from 4 domains: medicine, hotels, products and school.
clarin-pl/2021-punctuation-restoration punctuation restoration text_in text_out Dataset contains original texts and ASR output. It is a part of PolEval 2021 Competition.
clarin-pl/nkjp-pos sequence labeling (part-of-speech tagging) tokens pos_tags NKJP-POS is a part of the National Corpus of Polish. Its objective is part-of-speech tagging, e.g., nouns, verbs, adjectives, adverbs, etc.
clarin-pl/aspectemo sequence labeling (sentiment classification) tokens labels AspectEmo Corpus is an extended version of a publicly available PolEmo 2.0 corpus of Polish customer reviews used in many projects on the use of different methods in sentiment analysis.
laugustyniak/political-advertising-pl sequence labeling (political advertising ) tokens tags First publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language.
laugustyniak/abusive-clauses-pl classification (abusive-clauses) text class Dataset with Polish abusive clauses examples.
allegro/klej-dyk pair classification (question answering)* (question, answer) target The Did You Know (pol. Czy wiesz?) dataset consists of human-annotated question-answer pairs.
allegro/klej-psc pair classification (text summarization)* (extract_text, summary_text) label The Polish Summaries Corpus contains news articles and their summaries.
allegro/klej-cdsc-e pair classification (textual entailment)* (sentence_A, sentence_B) entailment_judgment The polish sentence pairs which are human-annotated for textual entailment.

lepiszcze's People

Contributors

adrianszymczak avatar asawczyn avatar ktagowski avatar laugustyniak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

lepiszcze's Issues

Models proposed to test

It would be helpful to see those models from HF on the leaderboard:

  • xlm-roberta-base (base of HerBERT)
  • xlm-roberta-large (base of HerBERT)
  • facebook/xlm-roberta-xl - needs more VRAM
  • facebook/xlm-roberta-xxl - needs more VRAM
  • sdadas/polish-distilroberta - something smaller
  • sdadas/polish-roberta-base-v2 - maybe better than HerBERT
  • sdadas/polish-roberta-large-v2 - maybe better than HerBERT

Other models:

  • Flair
  • word2vec
  • GloVe
  • FastText
  • some non-vector based models from scikit-learn, e.g NB, SVM, XGB, logistic regression

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.