martinomensio / it_vectors_wiki_spacy Goto Github PK

View Code? Open in Web Editor NEW

8.0 3.0 1.0 88 KB

Word embeddings for Italian language, spacy2 prebuilt model

License: MIT License

Shell 45.36% Python 54.64%

spacy embeddings wordvectors glove italian pretrained model spacy2

it_vectors_wiki_spacy's Introduction

Italian word embeddings

Data source

The source for the data is the Italian Wikipedia, downloaded from Wikipedia Dumps.

Preprocessing

The goal is to produce a single text file with the content of the Wikipedia pages, with a whitespaced tokenization. Usually for the tokenization the approach is to remove punctuation, but I want to get word embeddings also for punctuation (because I don't want to discard any information provided by an input sentence). For producing this type of input, and also because I want to have an alignement between the tokenization used to train word embeddings and the tokenization I am using at runtime, I chose to use SpaCy for its great power and speed. SpaCy comes with word embeddings of this kind for the English language.

Two types of preprocessing have been tried:

using spacy-dev-resources
using wikiextractor + SpaCy for tokenization

Training word embeddings

GloVe is used to produce a text file that contains:

number_of_vectors vector_length
WORD1 values_of_word_1
WORD2 values_of_word_2
...

Preparing SpaCy vectors

From the representation of word embeddings in text file, a binary representation is built, ready to be loaded into SpaCy.

The whole SpaCy model (a blank italian nlp + the word vectors) is saved and packaged using the script number 3.

Using the model

Option 1: do the preceding steps to train the vectors and then load the vectors with nlp.vocab.vectors.from_disk('path').

Option 2: install with pip the complete model from the latest release with the following command:

pip install -U https://github.com/MartinoMensio/it_vectors_wiki_spacy/releases/download/v1.0.1/it_vectors_wiki_lg-1.0.1.tar.gz

then simply load the model in SpaCy with nlp = spacy.load('it_vectors_wiki_lg').

If you want to use the vectors in another environment (outside SpaCy) you can find the raw embeddings in the vectors-1.0 release which contains

Evaluation

The questions-words-ITA.txt come from http://hlt.isti.cnr.it/wordembeddings/ as part of the paper:

@inproceedings{berardi2015word,
  title={Word Embeddings Go to Italy: A Comparison of Models and Training Datasets.},
  author={Berardi, Giacomo and Esuli, Andrea and Marcheggiani, Diego},
  booktitle={IIR},
  year={2015}
}

The preprocessing + the new dump of wikipedia gives the following results (script accuracy.py): 58.14% that seems an improvement with respect to the scores in the paper.

it_vectors_wiki_spacy's People

Contributors

Stargazers

Watchers

Forkers

elegagli

it_vectors_wiki_spacy's Issues

Preprocessing step + tokenization: managed or to do before submitting text?

Hi Martino,

I've read that you created this spacy model starting from blank italian nlp + word vectors. Does your spacy wrapper already manage the preprocessing and tokenization? I mean, when submitting a text to spacy

model = spacy.load('it_vectors_wiki_lg')
result = model(text)

Is your model already able to execute the proper preprocessing and tokenization or text has to be a preprocessed and tokenized string?

I was wondering because I read from this url (section Tokenization) that in general Spacy manges preprocessing and tokenization already. Since you created a custom model, does the tokenization (and preprocessing) work as good as with other spacy italian models?

Warnings with spacy version 2.3.4

Hi Martino,

I've seen your implementation and it is really usefull. Great work!

My environment runs with python 3.7.9 and spacy 2.3.4.
I've installed the package as you suggested in the readme, so with this command

pip install -U https://github.com/MartinoMensio/it_vectors_wiki_spacy/releases/download/v1.0.1/it_vectors_wiki_lg-1.0.1.tar.gz

I am able of running spacy with your embeddings but I have the following warnings.

.../python3.7/site-packages/spacy/util.py:275: UserWarning: [W031] Model 'it_vectors_wiki_lg' (1.0.1) requires spaCy v2.1 and is incompatible with the current spaCy version (2.3.4). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate

.../python3.7/site-packages/spacy/_ml.py:287: UserWarning: [W020] Unnamed vectors. This won't allow multiple vectors models to be loaded. (Shape: (962148, 300))

Even though it seems to run smoothly, could you check if the model is running correctly? Or, even better, recreate the model and publish a new release for this spacy version?

Thanks!
Best regards,
Nicola

error: bad escape \p at position 257

Hi, I downloaded the vectors, but when I try to load them with:

`model = spacy.load('it_vectors_wiki_lg')

it gives me error:

error: bad escape \p at position 257

martinomensio / it_vectors_wiki_spacy Goto Github PK

it_vectors_wiki_spacy's Introduction

Italian word embeddings

Data source

Preprocessing

Training word embeddings

Preparing SpaCy vectors

Using the model

Evaluation

it_vectors_wiki_spacy's People

Contributors

Stargazers

Watchers

Forkers

it_vectors_wiki_spacy's Issues

Preprocessing step + tokenization: managed or to do before submitting text?

Warnings with spacy version 2.3.4

error: bad escape \p at position 257

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent