Giter Club home page Giter Club logo

stepantita / news-contest Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 715 KB

This repository contains Jupyter notebooks detailing the experiments conducted in our research paper on Ukrainian news classification. We introduce a framework for simple classification dataset creation with minimal labeling effort, and further compare several pretrained models for the Ukrainian language.

Home Page: https://link.springer.com/chapter/10.1007/978-3-031-14841-5_37

Jupyter Notebook 100.00%
benchmarking dataset-creation deep-learning jupyter-notebooks news-classification nlp nlp-research pretrained-models text-classification transfer-learning

news-contest's Introduction

πŸ“° Ukrainian News Classification Experiments

Welcome to the repository containing Jupyter notebooks and findings from our research paper on Ukrainian news classification. Dive in to discover our methodologies, key findings, and comparisons of various pretrained models for the Ukrainian language. This is the set of experiments conducted by Stepan Tytarenko, whose solution used XLM-R and has won the in-class competition.


🎯 Abstract

In the vast expanse of natural language processing, languages like Ukrainian face a pressing issue: the lack of datasets. This paper unveils a pioneering approach to dataset creation with minimal overhead, setting the stage for Ukrainian news classification.


πŸ“Œ Key Findings

  • ukr-RoBERTa, ukr-ELECTRA, and XLM-R are the crΓ¨me de la crΓ¨me of models.
  • XLM-R is the go-to for longer texts.
  • ukr-RoBERTa is a beacon for shorter sequences.
  • NB-SVM baseline? A dark horse with commendable performance on a large dataset!

πŸ”¬ Experiments

Our experiments spread across:

  1. Small training set with titles only πŸ“ƒ
  2. Small training set, full text immersion πŸ“œ
  3. Large training set, titles at the forefront πŸ“ƒπŸ“ƒ
  4. Large training set, full text deluge πŸ“œπŸ“œ

Models were put to the test, each having a time window of 24 hours on a single P100 GPU.

πŸ“Š Benchmark Results

Model Short texts / small training set Long texts / small training set Short texts / large training set Long texts / large training set
NB-SVM baseline 0.533 0.900 0.708 0.910
mBERT 0.790 0.910 0.675 0.907
Slavic BERT 0.636 0.907 0.620 0.940
ukr-RoBERTa 0.853 0.948 0.903 0.950
ukr-ELECTRA 0.685 0.950 0.745 0.948
XLM-R 0.840 0.915 0.909 0.915

πŸ† Note: XLM-R takes the gold with an F1 score of 0.95 on the large full-text training set.


🧐 Observations

  • mBERT & Slavic BERT: Not the stars of the show when it comes to F1-scores.
  • ukr-RoBERTa: Climbs the ranks, especially on short-text terrain.
  • ukr-ELECTRA: Balances the act between different text lengths.
  • XLM-R: Reigns supreme with long texts, but faces hurdles with short ones.

πŸ“ Dataset

Want the dataset? Fetch it on Kaggle.


πŸ“ Citation

If our work aids your research, show some love with a citation:

D. Panchenko et al. (2021). Ukrainian News Corpus As Text Classification Benchmark.

news-contest's People

Contributors

stepantita avatar stepantytaconsultant avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.