Hi, here is the case. I pretrained a language model on English

Is it possibile to extend a trained BPE model's merge operations? about subword-nmt HOT 2 OPEN

pluiez commented on May 20, 2024

Is it possibile to extend a trained BPE model's merge operations?

from subword-nmt.

Comments (2)

rsennrich commented on May 20, 2024

technically, you can just concatenate the two BPE files (called codes_file in the README), and this should achieve your desired result. I've done this back in 2015 to combine Cyrillic and Latin merge operations for Russian. Two things to pay attention to:

the first line of the file gives some version info. You can remove this from the 2nd file that you concatenate to the first.
the order of the files matters, since you will get different segmentations depending on the order of merge operations.
if there's Latin alphabet text in the Japanese file, there is a chance that the English tokenization changes in rare cases. To prevent this, you'd have to only use the first 32000 merge operations for English text.

from subword-nmt.

pluiez commented on May 20, 2024

technically, you can just concatenate the two BPE files (called codes_file in the README), and this should achieve your desired result. I've done this back in 2015 to combine Cyrillic and Latin merge operations for Russian. Two things to pay attention to:

the first line of the file gives some version info. You can remove this from the 2nd file that you concatenate to the first.

the order of the files matters, since you will get different segmentations depending on the order of merge operations.

if there's Latin alphabet text in the Japanese file, there is a chance that the English tokenization changes in rare cases. To prevent this, you'd have to only use the first 32000 merge operations for English text.

Thank you very much!

from subword-nmt.

Recommend Projects