Giter Club home page Giter Club logo

clarin-pl / embeddings Goto Github PK

View Code? Open in Web Editor NEW
34.0 8.0 3.0 8.76 MB

Embeddings: State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language

Home Page: https://clarin-pl.github.io/embeddings/

License: MIT License

Python 54.81% Dockerfile 0.22% HTML 12.17% Makefile 0.05% Jupyter Notebook 32.57% CSS 0.10% TeX 0.09%
languagemodel nlp nlp-machine-learning fine-tuning classification sequence-tagging benchmark lm

embeddings's Introduction

CLARIN Embeddings

State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language

This library was used during the development of the LEPISZCZE benchmark (NeurIPS 2022).

Installation

pip install clarinpl-embeddings

Example

Text-classification with polemo2 dataset and transformer-based embeddings

from embeddings.pipeline.lightning_classification import LightningClassificationPipeline

pipeline = LightningClassificationPipeline(
    dataset_name_or_path="clarin-pl/polemo2-official",
    embedding_name_or_path="allegro/herbert-base-cased",
    input_column_name="text",
    target_column_name="target",
    output_path="."
)

print(pipeline.run())

⚠️ As for now, default pipeline model hyperparameters may provide poor results. It will be subject to change in further releases. We encourage users to use Optimized Pipelines to select appropriate hyperparameters.

Conventions

We use many of the HuggingFace concepts such as models (https://huggingface.co/models) or datasets (https://huggingface.co/datasets) to make our library as easy to use as it is possible. We want to enable users to create, customise, test, and execute NLP / NLU / SLU tasks in the fastest possible manner. Moreover, we present easy to use static embeddings, that were trained by CLARIN-PL.

Pipelines

We share predefined pipelines for common NLP tasks with corresponding scripts. For Transformer based pipelines we utilize PyTorch Lighting ⚡ trainers with Transformers AutoModels . For static embedding based pipelines we use Flair library under the hood.

Transformer embedding based pipelines (e.g. Bert, RoBERTA, Herbert):

Task Class Script
Text classification LightningClassificationPipeline evaluate_lightning_document_classification.py
Sequence labelling LightningSequenceLabelingPipeline evaluate_lightning_sequence_labeling.py

Run classification task

The example with non-default arguments

python evaluate_lightning_document_classification.py \
    --embedding-name-or-path allegro/herbert-base-cased \
    --dataset-name clarin-pl/polemo2-official \
    --input-columns-name text \
    --target-column-name target

Run sequence labeling task

The example with default language model and dataset.

python evaluate_lightning_sequence_labeling.py

Compatible datasets

As most datasets in HuggingFace repository should be compatible with our pipelines, there are several datasets that were tested by the authors.

dataset name task type input_column_name(s) target_column_name description
clarin-pl/kpwr-ner sequence labeling (named entity recognition) tokens ner KPWR-NER is a part of the Polish Corpus of Wrocław University of Technology (KPWr). Its objective is recognition of named entities, e.g., people, institutions etc.
clarin-pl/polemo2-official classification (sentiment analysis) text target A corpus of consumer reviews from 4 domains: medicine, hotels, products and school.
clarin-pl/2021-punctuation-restoration punctuation restoration text_in text_out Dataset contains original texts and ASR output. It is a part of PolEval 2021 Competition.
clarin-pl/nkjp-pos sequence labeling (part-of-speech tagging) tokens pos_tags NKJP-POS is a part of the National Corpus of Polish. Its objective is part-of-speech tagging, e.g., nouns, verbs, adjectives, adverbs, etc.
clarin-pl/aspectemo sequence labeling (sentiment classification) tokens labels AspectEmo Corpus is an extended version of a publicly available PolEmo 2.0 corpus of Polish customer reviews used in many projects on the use of different methods in sentiment analysis.
laugustyniak/political-advertising-pl sequence labeling (political advertising ) tokens tags First publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language.
laugustyniak/abusive-clauses-pl classification (abusive-clauses) text class Dataset with Polish abusive clauses examples.
allegro/klej-dyk pair classification (question answering)* (question, answer) target The Did You Know (pol. Czy wiesz?) dataset consists of human-annotated question-answer pairs.
allegro/klej-psc pair classification (text summarization)* (extract_text, summary_text) label The Polish Summaries Corpus contains news articles and their summaries.
allegro/klej-cdsc-e pair classification (textual entailment)* (sentence_A, sentence_B) entailment_judgment The polish sentence pairs which are human-annotated for textualentailment.

*only pair classification task is supported for now

Passing task model and task training parameters to predefined flair pipelines

Model and training parameters can be controlled via task_model_kwargs and task_train_kwargs parameters that can be populated using the advanced config. Tutorial on how to use configs can be found in /tutorials directory of the repository. Two types of config are defined in our library: BasicConfig and AdvancedConfig. In summary, the BasicConfig takes arguments and automatically assign them into proper keyword group, while the AdvancedConfig takes as the input keyword groups that should be already correctly mapped.

The list of available config can be found below:

Lightning:

  • LightningBasicConfig
  • LightningAdvancedConfig

Example with polemo2 dataset

Lightning pipeline

from embeddings.config.lightning_config import LightningBasicConfig
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline

config = LightningBasicConfig(
    learning_rate=0.01, max_epochs=1, max_seq_length=128, finetune_last_n_layers=0,
    accelerator="cpu"
)

pipeline = LightningClassificationPipeline(
    embedding_name_or_path="allegro/herbert-base-cased",
    dataset_name_or_path="clarin-pl/polemo2-official",
    input_column_name=["text"],
    target_column_name="target",
    load_dataset_kwargs={
        "train_domains": ["hotels", "medicine"],
        "dev_domains": ["hotels", "medicine"],
        "test_domains": ["hotels", "medicine"],
        "text_cfg": "text",
    },
    output_path=".",
    config=config
)

You can also define an Advanced config with populated keyword arguments. In general, the keywords are passed to the object when constructing specific pipelines. We can identify and trace the keyword arguments to find the possible arguments that can be set in the config kwargs.

from embeddings.config.lightning_config import LightningAdvancedConfig

config = LightningAdvancedConfig(
    finetune_last_n_layers=0,
    task_train_kwargs={
        "max_epochs": 1,
        "devices": "auto",
        "accelerator": "cpu",
        "deterministic": True,
    },
    task_model_kwargs={
        "learning_rate": 5e-4,
        "use_scheduler": False,
        "optimizer": "AdamW",
        "adam_epsilon": 1e-8,
        "warmup_steps": 100,
        "weight_decay": 0.0,
    },
    datamodule_kwargs={
        "downsample_train": 0.01,
        "downsample_val": 0.01,
        "downsample_test": 0.05,
    },
    dataloader_kwargs={"num_workers": 0},
)

Available embedding models for Polish

Instead of the allegro/herbert-base-cased model, user can pass any model from HuggingFace Hub that is compatible with Transformers or with our library.

Embedding Type Description
clarin-pl/herbert-kgr10 bert HerBERT Large trained on supplementary data - the KGR10 corpus.

Optimized pipelines

Transformers embeddings

Task Optimized Pipeline
Lightning Text Classification OptimizedLightingClassificationPipeline
Lightning Sequence Labeling OptimizedLightingSequenceLabelingPipeline

Example with Text Classification

Optimized pipelines can be run via following snippet of code:

from embeddings.config.lighting_config_space import LightingTextClassificationConfigSpace
from embeddings.pipeline.lightning_hps_pipeline import OptimizedLightingClassificationPipeline

pipeline = OptimizedLightingClassificationPipeline(
    config_space=LightingTextClassificationConfigSpace(
        embedding_name_or_path="allegro/herbert-base-cased"
    ),
    dataset_name_or_path="clarin-pl/polemo2-official",
    input_column_name="text",
    target_column_name="target",
).persisting(best_params_path="best_prams.yaml", log_path="hps_log.pickle")
df, metadata = pipeline.run()

Training model with obtained parameters

After the parameters search process we can train model with best parameters found. But firstly we have to set output_path parameter, which is not automatically generated from OptimizedLightingClassificationPipeline.

metadata["output_path"] = "."

Now we are able to train the pipeline

from embeddings.pipeline.lightning_classification import LightningClassificationPipeline

pipeline = LightningClassificationPipeline(**metadata)
results = pipeline.run()

Selection of best embedding model

Instead of performing search with single embedding model we can search with multiple embedding models via passing them as list to ConfigSpace.

pipeline = OptimizedLightingClassificationPipeline(
    config_space=LightingTextClassificationConfigSpace(
        embedding_name_or_path=["allegro/herbert-base-cased", "clarin-pl/roberta-polish-kgr10"]
    ),
    dataset_name_or_path="clarin-pl/polemo2-official",
    input_column_name="text",
    target_column_name="target",
).persisting(best_params_path="best_prams.yaml", log_path="hps_log.pickle")

Citation

The paper describing the library is available on arXiv or in proceedings of NeurIPS 2022.

@inproceedings{augustyniak2022lepiszcze,
 author = {Augustyniak, Lukasz and Tagowski, Kamil and Sawczyn, Albert and Janiak, Denis and Bartusiak, Roman and Szymczak, Adrian and Janz, Arkadiusz and Szyma\'{n}ski, Piotr and W\k{a}troba, Marcin and Morzy, Miko\l aj and Kajdanowicz, Tomasz and Piasecki, Maciej},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
 pages = {21805--21818},
 publisher = {Curran Associates, Inc.},
 title = {This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish},
 url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/890b206ebb79e550f3988cb8db936f42-Paper-Datasets_and_Benchmarks.pdf},
 volume = {35},
 year = {2022}
}

embeddings's People

Contributors

adrianszymczak avatar asawczyn avatar deepception avatar djaniak avatar gw98-github avatar koconjan avatar ktagowski avatar laugustyniak avatar lruczu2 avatar markowanga avatar mkossakowski19 avatar riomus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

embeddings's Issues

Add NER dataset and task

Part I: KPWr

  • Add NER dataset to HF
  • Add NER dataset to library
  • Adapt appropriate configuration for NER tag evaluation
  • Add NER sequence tagging test to pipeline
  • Add NER example

Part II: Poleval 2018 (Nested ner tags)

  • Add Poleval NER dataset to HF
  • Add Nested NER dataset to library
  • Adapt appropriate configuration for Nested NER tag evaluation
  • Add Nested NER example

Create a project roadmap

  • how do we want to load data from local drives
  • do we want to prepare special loaders for data - tensorflow datasets, spacy datasets

Sequence Labeling spelling

Labeling is the American spelling, labelling is the British spelling.

In our code and filenames, we use both spellings interchangeably.
Sequence labeling is more common in the literature so we could use this one.

image
image

ColumnCorpusTransformation saving to conll

Data in conll format is saved to a file with .csv extension while it probably should be saved to .tsv file as it is tab separated data.
image

column_format = self._save_to_conll( hf_datadict[subset_name], output_path.joinpath(f"{subset_name}.csv") )

Check sequence tagging metrics for POS

  • Check whether metrics are calculated correctly for POS tags
  • Check whether tags names are correctly parsed via seqeval library and add this case for tests
    - [ ] Add NER dataset for sequence tagging

Update README

Update README:

  • Add example with NER/POS
  • Add examples for use with predefined pipelines not only custom one

Add WSD dataset

HuggingFace

  • Add versioning, plWordNet may have different version, add metadata
    - [ ] How to add graph-data to HF (seperate dataset, or merged with WSD dataset)

TODO (only Text-based pipeline):

  • Add WSD dataset to HF
  • Add WSD dataset to library
  • Add WSD to pipeline
  • Add WSD example

Graph (HF)

  • Test HF proof of concept.:
    • Feeded texts with ID and TEXT
    • Graph data as a HF dataset attribute

Add hyperparameter-search pipeline for task models

TODO List:

Hypersearch Pipeline:

  • Add other pipelines
  • Add tests

Other:

  • #91
  • #95
    • Sequence Labelling
    • Text Classification
    • Pair Text Classification
  • Update README

Done:

  • #97
  • Add suport for custom optuna search space configuration
  • Add retraining task after selection
  • Add optuna to pyproject.toml
  • Add logging of best found parameters configuration
  • @ktagowski Create Pipeline for Hyperameter search
  • @djaniak Decide which library to use (optuna/hyperopt/ray):
  • #90
  • Add option to save optuna dataframe log
  • @djaniak Run libraries examples
  • @ktagowski Prepare proof-of-concept (notebook)
  • Add persister for best found parameters

Abandoned
- [ ] Add support for parsing Pipelines parameters from Yaml

Test kgr10-based LM with domain fine-tuning

==============================herbert-base=================================

2021-08-09 23:57:17,425 loading file herbert-base/best-model.pt
2021-08-09 23:57:31,239     0.728
2021-08-09 23:57:31,239 
Results:
- F-score (micro) 0.728
- F-score (macro) 0.6648
- Accuracy 0.728

By class:
              precision    recall  f1-score   support

       minus     0.7716    0.8171    0.7937       339
        plus     0.6137    0.8678    0.7190       227
        zero     0.9813    0.8898    0.9333       118
         amb     0.5455    0.1324    0.2130       136

   micro avg     0.7280    0.7280    0.7280       820
   macro avg     0.7280    0.6768    0.6648       820
weighted avg     0.7206    0.7280    0.6968       820
 samples avg     0.7280    0.7280    0.7280       820


==============================herbert-kgr10=================================

2021-08-10 00:03:07,082 loading file herbert-kgr10/best-model.pt
2021-08-10 00:03:34,942         0.7671
2021-08-10 00:03:34,943
Results:
- F-score (micro) 0.7671
- F-score (macro) 0.6484
- Accuracy 0.7671

By class:
              precision    recall  f1-score   support

       minus     0.7429    0.9292    0.8257       339
        plus     0.7143    0.9031    0.7977       227
        zero     1.0000    0.9153    0.9558       118
         amb     1.0000    0.0074    0.0146       136

   micro avg     0.7671    0.7671    0.7671       820
   macro avg     0.8643    0.6887    0.6484       820
weighted avg     0.8146    0.7671    0.7021       820
 samples avg     0.7671    0.7671    0.7671       820

tested based on clarin-pl/polemo2-official

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.