Giter Club home page Giter Club logo

devmount / germanwordembeddings Goto Github PK

View Code? Open in Web Editor NEW
233.0 233.0 50.0 927 KB

Toolkit to obtain and preprocess German text corpora, train models and evaluate them with generated testsets. Built with Gensim and Tensorflow.

Home Page: https://devmount.github.io/GermanWordEmbeddings/

License: MIT License

Python 34.10% Shell 0.77% Jupyter Notebook 65.13%
deep-learning deep-neural-networks evaluation gensim german-language model natural-language-processing neural-network nlp training word-embeddings word2vec

germanwordembeddings's Introduction

πŸ‘‹πŸŒ

I'm Andreas. I'm a freelance web developer, creator and consultant in Berlin, Germany. I love God and I do all my work unto him. I create and maintain websites, build application based on web technologies and also do some machine learning and NLP stuff. I occasionally publish articles and tutorials about web development and related topics on DEV, Medium and Twitter. If you like my work, just drop me a 'thank you' or sponsor me.

  • πŸ’¬ Ask me about programming (JavaScript, Python, PHP, CSS, Vue.js), balancing work and family or freelancing
  • πŸ“« Contact me via my website, Keybase or email
  • πŸ€“ Fun fact: My first own computer was an Acer Travelmate 220 laptop with both, a floppy and a CD drive! What was yours?

I love Open Source Software and joined GitHub 10 years ago. Since then I pushed 9322 commits, opened 384 issues, submitted 617 and reviewed 509 pull requests, received 899 stars across 55 personal projects and contributed to 6 public repositories.

The most used languages across my projects are:

C++ PHP Vue CSS Blade TeX HTML Python Other

Happy coding!

GitHub Streak

germanwordembeddings's People

Contributors

dependabot[bot] avatar devmount avatar prabhakar267 avatar sohalt avatar undefdev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

germanwordembeddings's Issues

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

when I try to run the preprocessing I get the following error on MacOS:

  File "preprocessing.py", line 92, in <module>
    stop_words = [replace_umlauts(token) for token in stopwords.words('german')]
  File "preprocessing.py", line 52, in replace_umlauts
    res = res.replace('Γ€', 'ae')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

when using

python preprocessing.py dewiki.xml corpus/dewiki.corpus -psub

UnpicklingError beim verwenden deines Models

Hey πŸ‘‹
ich versuche zurzeit einen Sarkasmus Klassifizieren ( Link) zum Laufen zu bekommen. Ich habe den Code und den Originalkorpus, allerdings fehlt mir das deutsche w2v Model, um die Merkmalsvektoren zu generieren.
Nach etwas googeln habe ich dein Model (german.model) gefunden. Allerdings bekomme ich folgende Fehlermeldung wenn ich es in den Code einbinde:

Traceback (most recent call last):
  File "/Users/marisaschmidt/Desktop/sarcasm-detection/Classifiers/adversialsatire/src/classifier.py", line 6, in <module>
    import keras, retriever, argparse, os, numpy, evaluation, feature_extractors
  File "/Users/marisaschmidt/Desktop/sarcasm-detection/Classifiers/adversialsatire/src/retriever.py", line 6, in <module>
    import os, feature_extractors, numpy, classifier, random, sqlite3
  File "/Users/marisaschmidt/Desktop/sarcasm-detection/Classifiers/adversialsatire/src/feature_extractors.py", line 14, in <module>
    vector_model = gensim.models.Word2Vec.load(w2v_model).wv
  File "/opt/anaconda3/envs/ba/lib/python3.6/site-packages/gensim/models/word2vec.py", line 975, in load
    return super(Word2Vec, cls).load(*args, **kwargs)
  File "/opt/anaconda3/envs/ba/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 629, in load
    model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
  File "/opt/anaconda3/envs/ba/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 278, in load
    return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
  File "/opt/anaconda3/envs/ba/lib/python3.6/site-packages/gensim/utils.py", line 425, in load
    obj = unpickle(fname)
  File "/opt/anaconda3/envs/ba/lib/python3.6/site-packages/gensim/utils.py", line 1332, in unpickle
    return _pickle.load(f, encoding='latin1')
_pickle.UnpicklingError: invalid load key, '6'.

Hast du eine Idee wo das Problem liegt?

Liebe Grüße!

Use already trained model

Hello, I am very new and I dont know how to load the german.model file. Could you give further Information which library was used to save it and how to load it.

plain text to segments

I have plain German text without any stop or punctuation. It it possible to get sentence segments with this library?

Function deprecated

The training.py contains a deprecated word2vec function for saving the model after training:

Traceback (most recent call last):
  File "training.py", line 56, in <module>
    model.save_word2vec_format(args.target, binary=True)
  File "/home/timo/.local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1312, in save_word2vec_format
    raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")
DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.

This should be updated.

Choose model type based on file extension

You can set the type of model based on file extension instead of forcing it to be always binary, making it a generic embeddings evaluation system.

(https://github.com/devmount/GermanWordEmbeddings/blob/master/evaluation.py#L329)

trained_model = gensim.models.KeyedVectors.load_word2vec_format(args.model.strip(), binary=True)

If it's a text file, then you can set binary as False (it's default value) and if not a text file, you can continue with current flow.

Error trying to load "german.model"

Hi!

Thanks for making this embeddings available :) With which version of gensim did you save the model? I'm having problems loading the model, e.g.:

>>> from gensim.models import Word2Vec
>>> model = Word2Vec.load("german.model")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dsbatista/virtual_envs/python2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1412, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
  File "/Users/dsbatista/virtual_envs/python2/lib/python2.7/site-packages/gensim/utils.py", line 276, in load
    obj = unpickle(fname)
  File "/Users/dsbatista/virtual_envs/python2/lib/python2.7/site-packages/gensim/utils.py", line 940, in unpickle
    return _pickle.loads(f.read())
cPickle.UnpicklingError: invalid load key, '6'.
>>> gensim.__version__
'2.3.0'

Python 2.7.10

Tests should be lowercased

Problem

I was using the repository for testing my trained model, one issue which i faced (and can be improved) is that all the test queries are capitalised and for a lot of reasons it is possible for someone to have lowercased every single word during training

Improvement

Have one input parameter asking for the user to lowercase every query or not

Error in preprocessing.py

call: python GermanWordEmbeddings/preprocessing.py dewiki.xml corpus/dewiki.corpus -psub

sanket@QADeepL:~/DeepL/WordEmbeddings$ python GermanWordEmbeddings/preprocessing.py dewiki.xml corpus/dewiki.corpus -psub
Using TensorFlow backend.
Traceback (most recent call last):
  File "GermanWordEmbeddings/preprocessing.py", line 107, in <module>
    logging.info('preprocessing of {} sentences finished!'.format(i))
NameError: name 'i' is not defined

model evaluation & two error messages

Great script!!!
After loading german.model the evaluation yielded nice results:
nouns: SI/PL: 26.8% (134/500), 60.0% (300/500), 100.0% (500/500)
nouns: PL/SI: 21.2% (106/500), 53.8% (269/500), 100.0% (500/500)
adjectives: GR/KOM: 40.8% (194/475), 69.9% (332/475), 95.0% (475/500)
adjectives: KOM/GR: 36.1% (170/471), 64.1% (302/471), 94.2% (471/500)
adjectives: GR/SUP: 10.3% (27/262), 31.3% (82/262), 52.4% (262/500)
adjectives: SUP/GR: 7.6% (21/276), 21.4% (59/276), 55.2% (276/500)
adjectives: KOM/SUP: 23.4% (60/256), 55.1% (141/256), 51.2% (256/500)
adjectives: SUP/KOM: 31.6% (83/263), 54.4% (143/263), 52.6% (263/500)
verbs (pres): INF/1SP: 78.0% (320/410), 95.6% (392/410), 82.0% (410/500)
verbs (pres): 1SP/INF: 71.8% (296/412), 93.7% (386/412), 82.4% (412/500)
verbs (pres): INF/2PP: 56.7% (208/367), 74.4% (273/367), 73.4% (367/500)
verbs (pres): 2PP/INF: 65.5% (237/362), 76.2% (276/362), 72.4% (362/500)
verbs (pres): 1SP/2PP: 54.4% (185/340), 65.3% (222/340), 68.0% (340/500)
verbs (pres): 2PP/1SP: 62.3% (223/358), 76.5% (274/358), 71.6% (358/500)
verbs (past): INF/3SV: 49.2% (210/427), 91.1% (389/427), 85.4% (427/500)
verbs (past): 3SV/INF: 40.3% (173/429), 92.8% (398/429), 85.8% (429/500)
verbs (past): INF/3PV: 78.4% (352/449), 97.1% (436/449), 89.8% (449/500)
verbs (past): 3PV/INF: 84.6% (369/436), 97.0% (423/436), 87.2% (436/500)
verbs (past): 3SV/3PV: 78.1% (356/456), 93.4% (426/456), 91.2% (456/500)
verbs (past): 3PV/3SV: 85.3% (390/457), 95.6% (437/457), 91.4% (457/500)
total correct: 52.0% (4114/7906)
total top 10: 75.4% (5960/7906)
total coverage: 79.1% (7906/10000)

However, when I tried: model.similarity(β€˜wort’,’klang') and got: KeyError("word '%s' not in vocabulary" % word). doing: python vocabulary.py DEmodel DEmodel.vocab got me: 'ascii' codec can't encode character u'\u2020' in position 0: ordinal not in range(128).

Invalid Load Key Error

I tried loading your model in gensim, however got the error:
return _pickle.load(f, encoding='latin1') _pickle.UnpicklingError: invalid load key, '6'.
I plan to use your model and further train it on my closed domain application.

Provide citation info

I am using your work for writing my bachelor thesis, and on GitHub pages, I read that you created this project in the context of your bachelor thesis. Is it possible to cite you?

Unless instructed differently, I will link to your GitHub repository, but it would be great if you could add a DOI badge to your repo.

german.model done with skip-gram or CBOW?

you say 'I found the following parameter configuration to be optimal to train german language models with word2vec:
a corpus as big as possible (and as diverse as possible without being informal)
filtering of punctuation and stopwords
forming bigramm tokens
using skip-gram as training algorithm with hierarchical softmax
window size between 5 and 10
dimensionality of feature vectors of 300 or more
using negative sampling with 10 samples
ignoring all words with total frequency lower than 50'

so, you used skip-gram and not CBOW for training. what were the differences?
can one also access the bigram frequencies from german.model?

saving full model?

Great work you have done here in this repo - THX a lot. Would you mind saving your model as a full model in order to able continue training with new words?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.