Light

marcodimarek / resources Goto Github PK

View Code? Open in Web Editor NEW

This project forked from slovak-nlp/resources

0.0 0.0 0.0 93 KB

A curated list of resources such as tools and datasets useful for the processing of Slovak language

resources's Introduction

Resources

A curated list of resources for the processing of Slovak language.

Pages

Slovak resources by Essential Data
Slovak speech and language processing at KEMT FEI TUKE with tools, demos and language resources.

Tools

Slovak Hunspell

Spelling Dictionary
List of common names, abbreviations, pejoratives and neologisms.

Stanza

tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
models trained on UD
implementation in Python/PyTorch, command-line interface, web service interface
license: Apache v2.0

NLP-Cube

tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
models trained on UD
implementation in Python/dyNET, command-line interface, web service interface
license: Apache v2.0

Slovak Elasticsearch

tokenization, stemming

Slovak lexer

tokenization, segmentation
implementation in C++
license: GPL v3.0

dl4dp

UPOS, UD
models trained on UD
implementation in Python/PyTorch, command-line interface
license: MIT

UDPipe

tokenization, segmentation, UPOS, XPOS (SNK), lemmatization, UD
models trained on UD
implementation in C++, bindings in Java, Python, Perl, C#, command-line interface, web service interface
license: MPL v2.0

NLP4SK

tokenization, stemming, lemmatization, diacritic restoration, POS (SNK), NER
web service interface only
license: ?

NLP Tools

tokenization, segmentation, lemmatization, POS (OpenNLP, SNK), UD (CoreNLP), NER
implementation in Java/DL4J
license: GNU AGPLv3

Semä

Web-based Visualisation of Slovak word vectors

Simplemma

Lemmatization for 25 languages
In Python
Slovak trained on UDP corpus

Corpora, datasets, vocabularies

Web

Common Crawl

SkTenTen

automatic POS (SNK)
source: web

Oscar

deduplicated
source: Common Crawl

Aranea

automatic POS (AUT, TreeTagger)
source: web

HC Corpora

no annotattion
twitter part

Morpho-syntactic

Slovak Universal Dependencies

tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
manual annotation
format: conllu
source: SNK

Artificial Treebank with Ellipsis

tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
format: conllu
source: Slovak UD, SNK

Morphological vocabulary

form, lemma, POS (SNK)
source: SNK

MULTEXT-East free lexicons 4.0

form, lemma, POS (Multext East)

Parallel

VoxPopuli

source: Europarl
speech, vectors, language

Czech-Slovak Parallel Corpus

automatic POS (SNK)
source: Acquis, Europarl, EU-journal, EC-Europa, OPUS

English-Slovak Parallel Corpus

automatic POS (SNK)
source: Acquis, Europarl, EU-journal, EC-Europa, OPUS

MULTEXT-East "1984" annotated corpus

sentence aligned, POS
Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian
source: "1984" novel

Paracrawl

Parallel web Corpus with Slovak Part
3.3 mil sentences English-Slovak

Sentiment

Twitter sentiment for 15 European languages

source: Twitter

NER

Cross-lingual Name Tagging and Linking for 282 Languages

download data
automatic annotation
source: Wikipedia

Wordnet

Slovak Wordnet

Models

Word embeddings

ELMo word embeddings

source: Wikipedia, Common Crawl

fastText word embeddings - Common Crawl

source: Common Crawl

fastText word embeddings - Wikipedia

source: Wikipedia

Transformers

SlovakBert

Slovak RoBERTa base language model
trained on web corpus

HuggingFace Translation models

Transformer models for machine translation
Slovak, English, Finish, Swedish, Spanish, French

VoxPopuli Slovak model

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Facebook's Wav2Vec2 base model pretrained on the 10K unlabeled subset of VoxPopuli corpus and fine-tuned on the transcribed data in sk

m-BERT

multilingual BERT, trained on Wikipedia

Language Agnostic BERT model

Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.

Flores

Flores101: Large-Scale Multilingual Machine Translation
Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.
Includes Slovak language
For fairseq

resources's People

Contributors

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.