Comments (6)
Nevermind, I saw someone else's comment about marian-vocab. It might be useful to add it to the documentation somewhere? Thanks!
from opus-mt-train.
Actually I'm going to reopen this as I'm still confused - I successfully created a vocab.yml file by concatenating the source and target vocabs and passing to marian-vocab, but when I try to convert my packaged model to Huggingface I get an error: Original vocab size {opus_state.cfg['vocab_size']} and new vocab size {len(tokenizer.encoder)} mismatched AssertionError: Original vocab size 32001 and new vocab size 61724 mismatched
. Is there something I'm missing here? Should I ask this question over at Huggingface?
from opus-mt-train.
Looked at the commits and figured out that the old yml vocab can be produced if using USE_SPM_VOCAB=0 flag when creating the data. Does that mean that for now, I should train all models with that flag if I want to port them over to Huggingface?
from opus-mt-train.
Yes, sorry, this is what you need now but I will talk to the people at huggingface to also support the plain text vocab files that are taken from the sentence piece models. It's a bit of a moving target.
from opus-mt-train.
Great, thanks - happy to follow up with them or help with the porting if I can. Feel free to leave this issue up or close it if you think it's not directly relevant for this project, up to you.
from opus-mt-train.
Changed it now to have USE_SPM_VOCAB=0 as default. Seems more backward compatible with everything ...
from opus-mt-train.
Related Issues (20)
- What's the dataset used for training opus-mt-en-de HOT 1
- Language Code Difference HOT 1
- What is tatoeba-langtune? HOT 2
- Preprocessing Script Question
- Korean Finetuning
- Multilingual Tuned Model Translating everything to "sssssssss" HOT 2
- What could cause widely varying inference time when using pre-trained opus-mt-en-fr model with python transformers library? HOT 2
- Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model
- How to translate from english to Japan?
- Using OPUS-MT with DeepSpeed
- update Dockerfile.gpu--fixed
- different sizes of dictionaries in different models HOT 1
- Reproduced crash on Opus-mt-en-de model using string "J" and "J-10" HOT 1
- Unable to find current origin/master revision in submodule path HOT 2
- Hyperparameters used for pretrained models? HOT 1
- how to train our dataset HOT 3
- Unbelievably High BLEU scores from finetuning... HOT 3
- Data for Brazilian Portuguese HOT 2
- Lack of transparency on used training data. - Does finetuning make sense? HOT 1
- preprocess.sh [: ==: unary operator expected HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opus-mt-train.