Giter Club home page Giter Club logo

word-embeddings-eval-hy's Introduction

Word Embeddings for the Armenian Language: Intrinsic and Extrinsic Evaluation

Pretrained Embeddings

We release pre-trained word embeddings:

  • 200-dimensional GloVe vectors (.text);
  • 300-dimensional CBOW (.text) and SkipGram (.text) vectors;
  • 200-dimensional fastText vectors (.text, .bin), trained using SkipGram architecture, with char n-grams up to length 3.

The training data for these models was collected from various sources:

a. Wikipedia;
b. fiction texts taken from the open part of the EANC corpus;
c. HC Corpora containing blogs and news articles collected by Hans Christensen from public sources in 2011;
d. digitized and reviewed part of Armenian soviet encyclopedia (as of February 2018) taken from Wikisource;
e. texts from news websites on the following topics: economics, events, art, sports, law, politics, blogs and interviews.

The texts were preprocessed by lowercasing all tokens and removing punctuation, digits. The final dataset contained 90.5 million tokens.

Word Analogy Task

In addition, we publish an adaptation of the word analogy task (Mikolov et al., 2013a) for the Armenian language to serve as benchmark for intrinsic evaluation of vectors. The task contains 5 semantic and 8 syntactic sections, with 15646 analogy questions in total.

News Texts Dataset

For extrinsic evalution of vectors in a classification task, we release a dataset of over 12000 news articles from iLur.am, categorized into 7 classes: sport, politics, weather, economy, accidents, art, society. The articles are split into train (2242k tokens) and test sets (425k tokens).

For more details, refer to the paper.

word-embeddings-eval-hy's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

word-embeddings-eval-hy's Issues

Suitable for both Eastern and Western Armenian?

I'm working on preparing models for the next release of Stanford's Stanza python software, https://stanfordnlp.github.io/stanza

The next release of Universal Dependencies distinguishes between Eastern and Western Armenian. Is it suitable to use the word vectors you host for both dialects?

https://github.com/UniversalDependencies/UD_Armenian-ArmTDP/
https://github.com/UniversalDependencies/UD_Western_Armenian-ArmTDP/

If I count the words from the UD datasets that are present in the Glove 200 file, for example, 97% of the words in the Eastern Armenian dataset appear here, and only 88% of the words in the Western Armenian dataset appear in the word vectors. This makes me think there's a bit of an issue with the coverage of these word vectors. If these vectors are not ideal, do you have any recommendations for others that would be, or do you have any intention of adding more Western Armenian words to the dataset?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.