I am having some problems when running the script. I created my environment using the muda_env.yml file. When I test it on a small test document, I run into a "coref error". It seems to me that it is related to multi-token tokenisation (for example the Spanish word "al" is tokenized as "a" and "el"). See below for an example.
2024-01-10 16:53:32 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 22.0MB/s]
2024-01-10 16:53:33 INFO: Loading these models for language: en (English):
=================================
| Processor | Package |
---------------------------------
| tokenize | combined |
| pos | combined_charlm |
| lemma | combined_nocharlm |
| depparse | combined_charlm |
=================================
2024-01-10 16:53:33 INFO: Using device: cuda
2024-01-10 16:53:33 INFO: Loading: tokenize
2024-01-10 16:53:35 INFO: Loading: pos
2024-01-10 16:53:36 INFO: Loading: lemma
2024-01-10 16:53:36 INFO: Loading: depparse
2024-01-10 16:53:36 INFO: Done loading processors!
2024-01-10 16:53:36 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 21.0MB/s]
2024-01-10 16:53:37 WARNING: Language es package default expects mwt, which has been added
2024-01-10 16:53:38 INFO: Loading these models for language: es (Spanish):
===============================
| Processor | Package |
-------------------------------
| tokenize | ancora |
| mwt | ancora |
| pos | ancora_charlm |
| lemma | ancora_nocharlm |
| depparse | ancora_charlm |
===============================
2024-01-10 16:53:38 INFO: Using device: cuda
2024-01-10 16:53:38 INFO: Loading: tokenize
2024-01-10 16:53:38 INFO: Loading: mwt
2024-01-10 16:53:38 INFO: Loading: pos
2024-01-10 16:53:38 INFO: Loading: lemma
2024-01-10 16:53:38 INFO: Loading: depparse
2024-01-10 16:53:38 INFO: Done loading processors!
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Due to multiword token expansion or an alignment issue, the original text has been replaced by space-separated expanded tokens.
docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Varios', 'enmascarados', 'irrumpieron', 'en', 'el', 'estudio', 'de', 'el', 'canal', 'público', 'TC', 'durante', 'una', 'emisión', ',', 'obligando', 'a', 'el', 'personal', 'a', 'tirar', 'se', 'a', 'el', 'suelo', '.']
Entities: []
docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['A', 'el', 'menos', '10', 'personas', 'han', 'muerto', 'desde', 'que', 'el', 'lunes', 'se', 'declarara', 'el', 'estado', 'de', 'excepción', 'en', 'Ecuador', '.']
Entities: []
docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Este', 'se', 'declaró', 'después', 'de', 'que', 'un', 'famoso', 'gánster', 'desapareciera', 'de', 'su', 'celda', 'en', 'prisión', '.', 'No', 'está', 'claro', 'si', 'el', 'incidente', 'en', 'el', 'estudio', 'de', 'televisión', 'de', 'Guayaquil', 'está', 'relacionado', 'con', 'la', 'desaparición', 'de', 'una', 'prisión', 'de', 'la', 'misma', 'ciudad', 'de', 'el', 'jefe', 'de', 'la', 'banda', 'de', 'los', 'Choneros', ',', 'Adolfo', 'Macías', 'Villamar', ',', 'o', 'Fito', ',', 'como', 'es', 'más', 'conocido', '.']
Entities: []
docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['En', 'el', 'vecino', 'Perú', ',', 'el', 'gobierno', 'ordenó', 'el', 'despliegue', 'inmediato', 'de', 'una', 'fuerza', 'policial', 'en', 'la', 'frontera', 'para', 'evitar', 'que', 'la', 'inestabilidad', 'se', 'extienda', 'a', 'el', 'país', '.']
Entities: []
docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Ecuador', 'es', 'uno', 'de', 'los', 'principales', 'exportadores', 'de', 'plátano', 'de', 'el', 'mundo', ',', 'pero', 'también', 'exporta', 'petróleo', ',', 'café', ',', 'cacao', ',', 'camarones', 'y', 'productos', 'pesqueros', '.', 'El', 'aumento', 'de', 'la', 'violencia', 'en', 'el', 'país', 'andino', ',', 'dentro', 'y', 'fuera', 'de', 'sus', 'prisiones', ',', 'se', 'ha', 'vinculado', 'a', 'los', 'enfrentamientos', 'entre', 'cárteles', 'de', 'la', 'droga', ',', 'tanto', 'extranjeros', 'como', 'locales', ',', 'por', 'el', 'control', 'de', 'las', 'rutas', 'de', 'la', 'cocaína', 'hacia', 'Estados', 'Unidos', 'y', 'Europa', '.']
Entities: []
docs = (self._ensure_doc(text) for text in texts)
Loading the dataset...
Extracting: 9it [00:00, 23.15it/s]
Some weights of BertModel were not initialized from the model checkpoint at SpanBERT/spanbert-large-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
coref error
coref error