facebookresearch / sonar Goto Github PK

SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.

License: Other

Python 100.00%

sonar's Introduction

SONAR

[Paper] [Demo]

We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.

Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.

SONAR stands for Sentence-level multimOdal and laNguage-Agnostic Representations

The full list of supported languages (along with download links) can be found here below.

SONAR Architecture:

Text results

Speech results

Installing

You can install SONAR with pip install sonar-space. Note that there is another sonar package on pip that IS NOT this project, make sure to use sonar-space in your dependencies.

If you want to install SONAR manually, you can install it localy. SONAR depends mainly on Fairseq2 and can be installed using (tested with python=3.8)

pip install --upgrade pip
pip install -e .

If fairseq2 does not provide a build for your machine, check the readme of that project to build it locally.

Usage

fairseq2 will automatically download models into your $TORCH_HOME/hub directory upon using the commands below.

Compute text sentence embeddings with SONAR:

from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder",
                                           tokenizer="text_sonar_basic_encoder")
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
embeddings = t2vec_model.predict(sentences, source_lang="eng_Latn")
print(embeddings.shape)
# torch.Size([2, 1024])

Reconstruct text from SONAR embeddings

from sonar.inference_pipelines.text import EmbeddingToTextModelPipeline
vec2text_model = EmbeddingToTextModelPipeline(decoder="text_sonar_basic_decoder",
                                              tokenizer="text_sonar_basic_encoder")
reconstructed = vec2text_model.predict(embeddings, target_lang="eng_Latn", max_seq_len=512)
# max_seq_len is a keyword argument passed to the fairseq2 BeamSearchSeq2SeqGenerator.
print(reconstructed)
# ['My name is SONAR.', 'I can embed the sentences into vector space.']

Translate text with SONAR

from sonar.inference_pipelines.text import TextToTextModelPipeline
t2t_model = TextToTextModelPipeline(encoder="text_sonar_basic_encoder",
                                    decoder="text_sonar_basic_decoder",
                                    tokenizer="text_sonar_basic_encoder")  # tokenizer is attached to both encoder and decoder cards

sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
t2t_model.predict(sentences, source_lang="eng_Latn", target_lang="fra_Latn")
# ['Mon nom est SONAR.', "Je peux intégrer les phrases dans l'espace vectoriel."]

Compute speech sentence embeddings with SONAR

from sonar.inference_pipelines.speech import SpeechToEmbeddingModelPipeline
s2vec_model = SpeechToEmbeddingModelPipeline(encoder="sonar_speech_encoder_eng")

s2vec_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
                     "./tests/integration_tests/data/audio_files/audio_2.wav"]).shape
# torch.Size([2, 1024])
import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"

s2vec_model.predict([inp]).shape
# torch.Size([1, 1024])

Speech-to-text translation with SONAR

from sonar.inference_pipelines.speech import SpeechToTextModelPipeline

s2t_model = SpeechToTextModelPipeline(encoder="sonar_speech_encoder_eng",
                                      decoder="text_sonar_basic_decoder",
                                      tokenizer="text_sonar_basic_decoder")

import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"

# passing loaded audio files
s2t_model.predict([inp], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.']

# passing multiple wav files 
s2t_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
                   "./tests/integration_tests/data/audio_files/audio_2.wav"], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.',
# 'These couples may choose to make an adoption plan for their baby.']

Predicting sentence similarity with BLASER 2.0 models

BLASER 2.0 is a family of models for automatic evaluation of machine translation quality based on SONAR embeddings. They predict cross-lingual semantic similarity between the translation and the source (optionally, also using a reference translation).

from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
from sonar.models.blaser.loader import load_blaser_model

blaser_ref = load_blaser_model("blaser_2_0_ref").eval()
blaser_qe = load_blaser_model("blaser_2_0_qe").eval()
text_embedder = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")

src_embs = text_embedder.predict(["Le chat s'assit sur le tapis."], source_lang="fra_Latn")
ref_embs = text_embedder.predict(["The cat sat on the mat."], source_lang="eng_Latn")
mt_embs = text_embedder.predict(["The cat sat down on the carpet."], source_lang="eng_Latn")

print(blaser_ref(src=src_embs, ref=ref_embs, mt=mt_embs).item())  # 4.688
print(blaser_qe(src=src_embs, mt=mt_embs).item())  # 4.708

Detailed model cards with more examples: facebook/blaser-2.0-ref, facebook/blaser-2.0-qe.

Demo notebooks

See more complete demo notebooks :

Supported languages and download links

The SONAR text encoder & decoder supports 200 languages. SONAR speech encoders support 37 languages.

Available text encoders/decoders

model	link
encoder	download
decoder	download
finetuned decoder	download
tokenizer	download

All 200 languages from the No Language Left Behind project are supported.

Available speech encoders

lang_code	language	link
arb	ms arabic	download
asm	assamese	download
bel	belarussian	download
ben	bengali	download
bos	bosnian	download
bul	bulgarian	download
cat	catalan	download
ces	czech	download
cmn	mandarin chinese	download
cym	welsh	download
dan	danish	download
deu	german	download
est	estonian	download
fin	finnish	download
fra	french	download
guj	gujurati	download
heb	hebrew	download
hin	hindi	download
hrv	croatian	download
ind	indonesian	download
ita	italian	download
jpn	japanse	download
kan	kannada	download
kor	korean	download
lao	lao	download
lit	lithaian	download
lvs	standard latvian	download
mal	malayalam	download
mar	marathi	download
mkd	macedonian	download
mlt	maltese	download
npi	nepali	download
nld	dutch	download
ory	odia	download
pan	punjabi	download
pes	western persian	download
pol	polish	download
por	portuguese	download
ron	romanian	download
rus	russian	download
slk	slovak	download
slv	slovenian	download
snd	sindhi	download
srp	serbian	download
spa	spanish	download
swe	swedish	download
swh	swahili	download
tam	tamil	download
tel	telugu	download
tgl	tagalog	download
tha	thai	download
tur	turkish	download
ukr	ukrainian	download
urd	urdu	download
uzn	northern uzbek	download
vie	vietnamese	download
yue	yue	download

Citation Information

Please cite the paper when referencing the SONAR embedding space, encoders and decoders as:

@misc{Duquenne:2023:sonar_arxiv,
  author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},
  title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},
  publisher = {arXiv},
  year = {2023},
  url = {https://arxiv.org/abs/2308.11466},
}

Contributing

See the CONTRIBUTING file for how to help out.

License

SONAR code is released under the MIT license (see CODE_LICENSE).

Some of SONAR models are released with the same MIT license, BUT BEWARE, some of them are released under a non commercial license (see NC_MODEL_LICENSE). Please refer to LICENSE for the details.

sonar's People

Contributors

Stargazers

Watchers

sonar's Issues

Possible languages specific Alternative Spelling or Capitalization Rules alignment issue for future improvement SONAR cross-lingual vector space or BLASER quality measure alignment

Problem description for possible scientific research and more details: Alternative Spelling rules in some languages for benchmarking embeddings models

Colab for reproduce (.ipynb and .py) with SONAR and BLASER quick test:
SONAR_BLASER_Alternative_Spelling_or_Capitalization_Rules_TEST.zip

For SONAR and BLASER2 we can observe a decrease (sometimes significant) for the similarity metric of words/sentences written in Alternative Spelling or Capitalization Rules:

Word-level results (EN-DE, same EN-DE word in Alternative Spelling):
Word-level results (DE-DE word in Alternative Spelling):
Word-level results (EN-DE test Capitalization Rules):
Sentence-level (EN-DE, same EN-DE with German Alternative Spelling and Capitalization Rules) results:
Sentence-level (one German language, sentence with words written in Alternative Spelling) results:

Error downloading Mandarin speech encoder

Code to reproduce:

import torch
from sonar.inference_pipelines.speech import SpeechToEmbeddingModelPipeline
s2vec_model = SpeechToEmbeddingModelPipeline(encoder="sonar_speech_encoder_cmn", device=torch.device("cuda"))

Expected behavior: the code works.
Actual behavior: HTTP Error 403: Forbidden.

Full trace

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 s2vec_model = SpeechToEmbeddingModelPipeline(encoder="sonar_speech_encoder_cmn", device=torch.device("cuda"))

File ~/.conda/envs/external-sonar/lib/python3.10/site-packages/sonar/inference_pipelines/speech.py:394, in SpeechToEmbeddingModelPipeline.__init__(self, encoder, device, fbank_dtype)
    391 super().__init__(fbank_dtype)
    393 if isinstance(encoder, str):
--> 394     encoder = load_sonar_speech_model(encoder, device=device, progress=False)
    395 self.model = encoder.to(device).eval()

File ~/.conda/envs/external-sonar/lib/python3.10/site-packages/fairseq2/models/utils/model_loader.py:182, in ModelLoader.__call__(self, model_name_or_card, force, progress, device, dtype)
    179 # Load the checkpoint.
    180 uri = card.field("checkpoint").as_uri()
--> 182 pathname = self.download_manager.download_checkpoint(
    183     uri, card.name, force=force, progress=progress
    184 )
    186 checkpoint = load_checkpoint(
    187     pathname,
    188     card.name,
    189     map_location="cpu",
    190     converter=partial(self._upgrade_checkpoint, config=config),
    191 )
    193 try:
    194     # Try to construct the model on the meta device.

File ~/.conda/envs/external-sonar/lib/python3.10/site-packages/fairseq2/assets/download_manager.py:119, in DefaultAssetDownloadManager.download_checkpoint(self, uri, model_name, checkpoint_name, shard_idx, force, progress)
    115     display_name = f"{display_name} (shard {shard_idx})"
    117 pathname = self._get_pathname(uri, sub_dir="checkpoints")
--> 119 self._download_file(uri, pathname, display_name, force, progress)
    121 return pathname

File ~/.conda/envs/external-sonar/lib/python3.10/site-packages/fairseq2/assets/download_manager.py:223, in DefaultAssetDownloadManager._download_file(self, uri, pathname, display_name, force, progress)
    221     response = urlopen(uri)
    222 except HTTPError as ex:
--> 223     raise_connection_error(ex)
    225 with response, NamedTemporaryFile(delete=False, dir=pathname.parent) as fp:
    226     headers = response.info()

File ~/.conda/envs/external-sonar/lib/python3.10/site-packages/fairseq2/assets/download_manager.py:221, in DefaultAssetDownloadManager._download_file(self, uri, pathname, display_name, force, progress)
    218     _print_progress(f"Downloading the {display_name}...")
    220 try:
--> 221     response = urlopen(uri)
    222 except HTTPError as ex:
    223     raise_connection_error(ex)

File ~/.conda/envs/external-sonar/lib/python3.10/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    214 else:
    215     opener = _opener
--> 216 return opener.open(url, data, timeout)

File ~/.conda/envs/external-sonar/lib/python3.10/urllib/request.py:525, in OpenerDirector.open(self, fullurl, data, timeout)
    523 for processor in self.process_response.get(protocol, []):
    524     meth = getattr(processor, meth_name)
--> 525     response = meth(req, response)
    527 return response

File ~/.conda/envs/external-sonar/lib/python3.10/urllib/request.py:634, in HTTPErrorProcessor.http_response(self, request, response)
    631 # According to RFC 2616, "2xx" code indicates that the client's
    632 # request was successfully received, understood, and accepted.
    633 if not (200 <= code < 300):
--> 634     response = self.parent.error(
    635         'http', request, response, code, msg, hdrs)
    637 return response

File ~/.conda/envs/external-sonar/lib/python3.10/urllib/request.py:563, in OpenerDirector.error(self, proto, *args)
    561 if http_err:
    562     args = (dict, 'default', 'http_error_default') + orig_args
--> 563     return self._call_chain(*args)

File ~/.conda/envs/external-sonar/lib/python3.10/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    494 for handler in handlers:
    495     func = getattr(handler, meth_name)
--> 496     result = func(*args)
    497     if result is not None:
    498         return result

File ~/.conda/envs/external-sonar/lib/python3.10/urllib/request.py:643, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
    642 def http_error_default(self, req, fp, code, msg, hdrs):
--> 643     raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: HTTP Error 403: Forbidden

Training on lower precision

Hi, great work done here!
Have you tried training or inferring the models at a lower precision? What is the performance loss for that?

Finetuning Speech Encoders further

Hi,

I tried finetuning the Swahili speech encoder but the performance only increases to 9.6 BLEU from a base BLEU score of 7.5 on your already finetuned encoder. I finetuned the speech encoder for 5 epochs with augmented data. I am not willing to try more epochs as the performance increase is not I had imagined. I finetuned with about 30hrs of data. The MSE loss in the last epoch was 1.5*10^-6. Any different approach that might help achieve a better BLEU?

Also, what is the finetuned decoder model checkpoint that I read in the paper does well for Swahili? When I try to use it I get the error - ValueError: The input sequence length must be less than or equal to the maximum sequence length (512), but is 513 instead which I do not get for the normal decoder. All my audios are less than or equal to 30 sec.

Thank you for your time!

Language Code Mappings [Text & Speech]

Hi Team,

Is there a clear mapping between languages in the two letter format (e.g. en, de, fr, pt, ...) to the format present for Sonar. Is there a conversion script somewhere, or a clear mapping and explanation of the language codes?

Particular, it seems there is a speech format:
https://github.com/facebookresearch/SONAR/blob/main/sonar/cards/sonar_speech_encoder.yaml

And there is a text format:
https://github.com/facebookresearch/SONAR/blob/main/sonar/cards/text_sonar_basic_encoder.yaml

Thank you.

RuntimeError at Ray Serve Deployment: Mismatched Devices

When running the model under Ray Serve, I encountered a RuntimeError suggesting a device mismatch ("Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!").

Error logs:

(ServeReplica:default_FastAPIDeployment pid=3985235) ERROR 2023-08-25 08:10:39,337 default_FastAPIDeployment default_FastAPIDeployment#IxFVYy KOFVbAbqev /embedding default replica.py:636 - Request failed due to RayTaskError(DataPipelineError):
(ServeReplica:default_FastAPIDeployment pid=3985235) Traceback (most recent call last):
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 633, in invoke_single
(ServeReplica:default_FastAPIDeployment pid=3985235)     result = await method_to_call(*request_args, **request_kwargs)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/ray/serve/_private/http_util.py", line 411, in __call__
(ServeReplica:default_FastAPIDeployment pid=3985235)     await self._asgi_app(
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/fastapi/applications.py", line 289, in __call__
(ServeReplica:default_FastAPIDeployment pid=3985235)     await super().__call__(scope, receive, send)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
(ServeReplica:default_FastAPIDeployment pid=3985235)     await self.middleware_stack(scope, receive, send)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
(ServeReplica:default_FastAPIDeployment pid=3985235)     raise exc
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
(ServeReplica:default_FastAPIDeployment pid=3985235)     await self.app(scope, receive, _send)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
(ServeReplica:default_FastAPIDeployment pid=3985235)     raise exc
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
(ServeReplica:default_FastAPIDeployment pid=3985235)     await self.app(scope, receive, sender)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
(ServeReplica:default_FastAPIDeployment pid=3985235)     raise e
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
(ServeReplica:default_FastAPIDeployment pid=3985235)     await self.app(scope, receive, send)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
(ServeReplica:default_FastAPIDeployment pid=3985235)     await route.handle(scope, receive, send)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
(ServeReplica:default_FastAPIDeployment pid=3985235)     await self.app(scope, receive, send)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
(ServeReplica:default_FastAPIDeployment pid=3985235)     response = await func(request)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
(ServeReplica:default_FastAPIDeployment pid=3985235)     raw_response = await run_endpoint_function(
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
(ServeReplica:default_FastAPIDeployment pid=3985235)     return await dependant.call(**values)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/Github/sonar_testing/source/deployments/fast_api.py", line 119, in encode_sentences
(ServeReplica:default_FastAPIDeployment pid=3985235)     embeddings = ray.get(ref)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
(ServeReplica:default_FastAPIDeployment pid=3985235)     return fn(*args, **kwargs)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeReplica:default_FastAPIDeployment pid=3985235)     return func(*args, **kwargs)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/ray/_private/worker.py", line 2524, in get
(ServeReplica:default_FastAPIDeployment pid=3985235)     raise value.as_instanceof_cause()
(ServeReplica:default_FastAPIDeployment pid=3985235) ray.exceptions.RayTaskError(DataPipelineError): ray::ServeReplica:default_SentenceEncoder.handle_request() (pid=3985222, ip=192.168.4.101)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
(ServeReplica:default_FastAPIDeployment pid=3985235)     return forward_call(*args, **kwargs)
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/sonar/models/sonar_text/model.py", line 112, in forward
(ServeReplica:default_FastAPIDeployment pid=3985235)     sentence_embeddings = self.sentence_embedding_pooling(
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/sonar/models/sonar_text/model.py", line 96, in sentence_embedding_pooling
(ServeReplica:default_FastAPIDeployment pid=3985235)     sentence_embedding = torch.einsum(
(ServeReplica:default_FastAPIDeployment pid=3985235)   File "/data/share/user/simon.choi/.virtualenv/sonar_testing/lib/python3.10/site-packages/torch/functional.py", line 378, in einsum
(ServeReplica:default_FastAPIDeployment pid=3985235)     return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
(ServeReplica:default_FastAPIDeployment pid=3985235) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Fix:
It could be fixed by specifying the device when creating padding_mask under source/models/sonar_text/model.py:80

if padding_mask is None:
    padding_mask = torch.zeros(seqs.shape[:2], device=seqs.device)

add new speech language？

How to Finetune with X->English data?

I have a dataset with audio in X language having corresponding English translations. Should I finetune the encoder to match the vector space of encoded English text, or should I finetune the decoder after freezing the X audio encoder parameters?

Thank you for your response!

Is it necessary to supply the language to the tokenizer?

I've noticed that in the demo code the tokenizer is supplied with a language for every input. Is this necessary, and how does that effect what tokens are produced?

embedding -> text pipeline

Thanks for your amazing work on this project.
Curious if you plan to create a simple wrapper for an embedding to text model pipeline? Basically a decoder only to leverage precomputed embeddings to translate into a variety of languages, rather than having to re-create the embeddings using the text2text pipelines over and over again.

Thanks!

change max seq len

Is it a way to change max_seq_len from 514 to 1024 for example?
or somehow compute current seq_len of text to avoid long texts, tokenizer hasnt method to return tokens.

Missing assert card for Swedish(swe)?

Hey!

Swedish is in the list of supported languages but the card is not found? Is it expected?