A curated list of resources for the processing of Slovak language.
- Slovak resources by Essential Data
- Slovak speech and language processing at KEMT FEI TUKE with tools, demos and language resources.
- Spelling Dictionary
- List of common names, abbreviations, pejoratives and neologisms.
- tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
- models trained on UD
- implementation in Python/PyTorch, command-line interface, web service interface
- license: Apache v2.0
- tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
- models trained on UD
- implementation in Python/dyNET, command-line interface, web service interface
- license: Apache v2.0
- tokenization, stemming
- tokenization, segmentation
- implementation in C++
- license: GPL v3.0
- UPOS, UD
- models trained on UD
- implementation in Python/PyTorch, command-line interface
- license: MIT
- tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
- models trained on UD
- implementation in C++, bindings in Java, Python, Perl, C#, command-line interface, web service interface
- license: MPL v2.0
- tokenization, stemming, lemmatization, diacritic restoration, POS (SNK), NER
- web service interface only
- license: ?
- tokenization, segmentation, lemmatization, POS (OpenNLP, SNK), UD (CoreNLP), NER
- implementation in Java/DL4J
- license: GNU AGPLv3
- Web-based Visualisation of Slovak word vectors
- Lemmatization for 25 languages
- In Python
- Slovak trained on UDP corpus
- automatic POS (SNK)
- source: web
- deduplicated
- source: Common Crawl
- automatic POS (AUT, TreeTagger)
- source: web
- no annotattion
- twitter part
- tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
- manual annotation
- format: conllu
- source: SNK
- tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
- format: conllu
- source: Slovak UD, SNK
- form, lemma, POS (SNK)
- source: SNK
- form, lemma, POS (Multext East)
- source: Europarl
- speech, vectors, language
- automatic POS (SNK)
- source: Acquis, Europarl, EU-journal, EC-Europa, OPUS
- automatic POS (SNK)
- source: Acquis, Europarl, EU-journal, EC-Europa, OPUS
- sentence aligned, POS
- Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian
- source: "1984" novel
- Parallel web Corpus with Slovak Part
- 3.3 mil sentences English-Slovak
- source: Twitter
- download data
- automatic annotation
- source: Wikipedia
- source: Wikipedia, Common Crawl
- source: Common Crawl
- source: Wikipedia
- Slovak RoBERTa base language model
- trained on web corpus
- Transformer models for machine translation
- Slovak, English, Finish, Swedish, Spanish, French
- VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
- Facebook's Wav2Vec2 base model pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine-tuned on the transcribed data in sk
- multilingual BERT, trained on Wikipedia
- Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.
- Flores101: Large-Scale Multilingual Machine Translation
- Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.
- Includes Slovak language
- For fairseq