Giter Club home page Giter Club logo

it_vectors_wiki_spacy's Introduction

Italian word embeddings

Data source

The source for the data is the Italian Wikipedia, downloaded from Wikipedia Dumps.

Preprocessing

The goal is to produce a single text file with the content of the Wikipedia pages, with a whitespaced tokenization. Usually for the tokenization the approach is to remove punctuation, but I want to get word embeddings also for punctuation (because I don't want to discard any information provided by an input sentence). For producing this type of input, and also because I want to have an alignement between the tokenization used to train word embeddings and the tokenization I am using at runtime, I chose to use SpaCy for its great power and speed. SpaCy comes with word embeddings of this kind for the English language.

Two types of preprocessing have been tried:

  1. using spacy-dev-resources
  2. using wikiextractor + SpaCy for tokenization

Training word embeddings

GloVe is used to produce a text file that contains:

number_of_vectors vector_length
WORD1 values_of_word_1
WORD2 values_of_word_2
...

Preparing SpaCy vectors

From the representation of word embeddings in text file, a binary representation is built, ready to be loaded into SpaCy.

The whole SpaCy model (a blank italian nlp + the word vectors) is saved and packaged using the script number 3.

Using the model

Option 1: do the preceding steps to train the vectors and then load the vectors with nlp.vocab.vectors.from_disk('path').

Option 2: install with pip the complete model from the latest release with the following command:

pip install -U https://github.com/MartinoMensio/it_vectors_wiki_spacy/releases/download/v1.0.1/it_vectors_wiki_lg-1.0.1.tar.gz

then simply load the model in SpaCy with nlp = spacy.load('it_vectors_wiki_lg').

If you want to use the vectors in another environment (outside SpaCy) you can find the raw embeddings in the vectors-1.0 release which contains

Evaluation

The questions-words-ITA.txt come from http://hlt.isti.cnr.it/wordembeddings/ as part of the paper:

@inproceedings{berardi2015word,
  title={Word Embeddings Go to Italy: A Comparison of Models and Training Datasets.},
  author={Berardi, Giacomo and Esuli, Andrea and Marcheggiani, Diego},
  booktitle={IIR},
  year={2015}
}

The preprocessing + the new dump of wikipedia gives the following results (script accuracy.py): 58.14% that seems an improvement with respect to the scores in the paper.

it_vectors_wiki_spacy's People

Contributors

martinomensio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

elegagli

it_vectors_wiki_spacy's Issues

error: bad escape \p at position 257

Hi, I downloaded the vectors, but when I try to load them with:

`model = spacy.load('it_vectors_wiki_lg')

it gives me error:

error: bad escape \p at position 257

Warnings with spacy version 2.3.4

Hi Martino,

I've seen your implementation and it is really usefull. Great work!

My environment runs with python 3.7.9 and spacy 2.3.4.
I've installed the package as you suggested in the readme, so with this command

pip install -U https://github.com/MartinoMensio/it_vectors_wiki_spacy/releases/download/v1.0.1/it_vectors_wiki_lg-1.0.1.tar.gz

I am able of running spacy with your embeddings but I have the following warnings.

.../python3.7/site-packages/spacy/util.py:275: UserWarning: [W031] Model 'it_vectors_wiki_lg' (1.0.1) requires spaCy v2.1 and is incompatible with the current spaCy version (2.3.4). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate

.../python3.7/site-packages/spacy/_ml.py:287: UserWarning: [W020] Unnamed vectors. This won't allow multiple vectors models to be loaded. (Shape: (962148, 300))

Even though it seems to run smoothly, could you check if the model is running correctly? Or, even better, recreate the model and publish a new release for this spacy version?

Thanks!
Best regards,
Nicola

Preprocessing step + tokenization: managed or to do before submitting text?

Hi Martino,

I've read that you created this spacy model starting from blank italian nlp + word vectors. Does your spacy wrapper already manage the preprocessing and tokenization? I mean, when submitting a text to spacy

model = spacy.load('it_vectors_wiki_lg')
result = model(text)

Is your model already able to execute the proper preprocessing and tokenization or text has to be a preprocessed and tokenized string?

I was wondering because I read from this url (section Tokenization) that in general Spacy manges preprocessing and tokenization already. Since you created a custom model, does the tokenization (and preprocessing) work as good as with other spacy italian models?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.