bytedance / salmonn Goto Github PK
View Code? Open in Web Editor NEWSALMONN: Speech Audio Language Music Open Neural Network
Home Page: https://bytedance.github.io/SALMONN/
License: Apache License 2.0
SALMONN: Speech Audio Language Music Open Neural Network
Home Page: https://bytedance.github.io/SALMONN/
License: Apache License 2.0
This XML file does not appear to have any style information associated with it. The document tree is shown below.
Hello where is the source for Salmon
Nice work! But I'm wondering that how can SALMONN adopt the speaker verification task? What is the prmpt, input and output?
is there a way to run inference on 24GB GPU? A100-SXM-80GB is not accessible for now
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 23.65 GiB total capacity; 23.16 GiB already allocated; 34.31 MiB free; 23.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
你好,请问第一阶段speech Qformer的训练和模型代码开源吗
Why Contributors section:- A "Contributors" section in a repo gives credit to and acknowledges
the people who have helped with the project, fosters a sense of community, and helps others
know who to contact for questions or issues related to the project.
Issue type
@TCL606 kindly assign this issue to me ! I would love to work on it ! Thank you !
Hello,
I'm writing to inquire about the training data for the model, specifically for Task Level 3, which includes audio-based storytelling (Story) and speech audio co-reasoning (SAC) tasks.
In your review responses, you mentioned that "We will release our source and training data to provide all implementation details if the paper is accepted." I have been able to find the source code and some of the training data, but I'm having trouble locating the datasets for Task Level 3.
Could you please provide some guidance on where I might find these datasets, or if they are not yet available, could you provide an estimated timeline for when they might be released?
Thank you for your time and for your contributions to the field.
Best,
Enis
I'm a little confused about the paper.
train第三步,beats下载链接失效
您好~
SALMONN是一项很出色的工作,但是我在进行使用的时候有一个小疑问,SALMONN 7B使用的Vicuna是什么版本的呢?和SALMONN 13B一样都是VicunaV1.1吗?
因为 Qwne-Audio 的 AAC 测试集用的的是 Clotho,你们用的是 AudioCaps,不知道在同一个测试集上两个模型哪个更好?Qwne-Audio 论文的 Figure 1 中说比 SALMONN 好,也不知道是否客观
Is there any more information about how the audio text aligner is implemented? As they have different length, hard to image how they could be trained into the same embedding space.
Thanks.
I noticed that there is this parameter in your code: prompt_pattern. For music, do I need to modify it? Can you briefly talk about the process of training this model and the data set used?
Great work! I wonder is the paper published? If yes could you please provide the link of the paper, thank you very much!
code-of-conduct:- We propose adding a comprehensive Code of Conduct to our repository to ensure
a safe, respectful, and inclusive environment for all contributors and users. This code will
serve as a guideline for behavior, promoting diversity, reducing conflicts, and attracting a
wider range of perspectives.
Issue type
@ kindly assign this issue to me ! I would love to work on it !
in our lab, we have the urgent issue to train this model by ourself, when can you published the paper and the codes for trainning?
看仓库里面只有一个gradio的应用链接,没有开放模型权重。
In the "How to inference in CLI" section, there is a typo in the word "requried." It should be "required." Here's the corrected sentence:
Original: "Our environment: The python version is 3.9.17, and other requried packages can be installed with the following command..."
Corrected: "Our environment: The python version is 3.9.17, and other required packages can be installed with the following command..."
How to run your code in a distributed training? I try to set "use_distributed: True" in your configuration file, but I found it is not work. I found it only support one gpu mode.
Thank you very much for being able to open source your complete code of SALMONN.
Although you've given a clear presentation in both your paper and code, there are still some points make me puzzled after reading your paper and code:
What's the difference of your training settings between stage1 and stage2 on AST task?
The paper says you use LibriSpeech-960h (280k) and GigaSpeech M-set at stage1 and afterwards use the same LibriSpeech-960h (also 280k) at stage2, so what's the changes about the training setting on LibriSpeech dataset from stage1 to stage2? Did you train without any instruction during stage1, or just change the used instructions?
How to get the 200k samples of GigaSpeech used at stage2?
I notice the GigaSpeech used at stage2 is 200k, nearing the number of GigaSpeech S-set (220k), and it seems that you used all the GigaSpeech M-set (680k) during stage1 according to the paper, so what's exactly the 200k samples of GigaSpeech at stage2? Were they randomly selected from GigaSpeech M-set?
Will the performance on downstream tasks not be influenced by so many preset instructions for instruction tuning?
According to the code recently released, there are many instructions setting for a single downstream task (for instance, there are 15 instructions setting for ASR task). From my point of view, one problem hard to avoid is that some instructions for different downstream tasks may present similar patterns, and these similarities have the potential to mislead the model to another unexpected task during inference, especially with a lower beam setting. So I want to know whether more instructions for tuning is better or less is better in your opinion or during your experiments because I'm uncertain which case may prevent this kind of similarity better.
I failed to find any information about these problems both in paper and code so I'm looking forward to your further answers.
Thank you again for taking time to read my issue. Hoping for your early reply!
识别的内容无法停止。一直在重复一句话
这是一段电话对话,有两个人在谈话。
第一个人说:“你好,是吗?”
第二个人回答:“你好,有什么需要吗?”
第一个人说:“我想问一下你的价格是多少?”
第二个人回答:“我们的价格是三百六十美元。”
第一个人说:“啊,太贵了。那么价格是多少?”
第二个人回答:“我们的价格是三百六十美元。”
第一个人说:“啊,太贵了。那么价格是多少?”
第二个人回答:“我们的价格是三百六十美元。”
第一个人说:“啊,太贵了。那么价格是多少?”
第二个人回答:“我们的价格是三百六十美元。”
第一个人说:“啊,太贵了。那么价格是多少?”
第二个人回答:“我们的价格是三百六十美元。”
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.