Giter Club home page Giter Club logo

deepseek-llm's Introduction

DeepSeek LLM

Model Download | Quick Start | Evaluation Results | License | Citation

Paper Link👁️

1. Introduction

Introducing DeepSeek LLM, an advanced language model comprising 67 billion parameters. It has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. In order to foster research, we have made DeepSeek LLM 7B/67B Base and DeepSeek LLM 7B/67B Chat open source for the research community.

result
  • Superior General Capabilities: DeepSeek LLM 67B Base outperforms Llama2 70B Base in areas such as reasoning, coding, math, and Chinese comprehension.

  • Proficient in Coding and Math: DeepSeek LLM 67B Chat exhibits outstanding performance in coding (HumanEval Pass@1: 73.78) and mathematics (GSM8K 0-shot: 84.1, Math 0-shot: 32.6). It also demonstrates remarkable generalization abilities, as evidenced by its exceptional score of 65 on the Hungarian National High School Exam.

  • Mastery in Chinese Language: Based on our evaluation, DeepSeek LLM 67B Chat surpasses GPT-3.5 in Chinese.

2. Model Downloads

We release the DeepSeek LLM 7B/67B, including both base and chat models, to the public. To support a broader and more diverse range of research within both academic and commercial communities, we are providing access to the intermediate checkpoints of the base model from its training process. Please note that the use of this model is subject to the terms outlined in License section. Commercial usage is permitted under these terms.

Huggingface

Model Sequence Length Download
DeepSeek LLM 7B Base 4096 🤗 HuggingFace
DeepSeek LLM 7B Chat 4096 🤗 HuggingFace
DeepSeek LLM 67B Base 4096 🤗 HuggingFace
DeepSeek LLM 67B Chat 4096 🤗 HuggingFace

Intermediate Checkpoints

We host the intermediate checkpoints of DeepSeek LLM 7B/67B on AWS S3 (Simple Storage Service). These files can be downloaded using the AWS Command Line Interface (CLI).

# using AWS CLI

# DeepSeek-LLM-7B-Base
aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-7B-Base <local_path> --recursive --request-payer

# DeepSeek-LLM-67B-Base
aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-67B-Base <local_path> --recursive  --request-payer

3. Evaluation Results

Base Model

We evaluate our models and some baseline models on a series of representative benchmarks, both in English and Chinese. More results can be found in the evaluation folder. In this part, the evaluation results we report are based on the internal, non-open-source hai-llm evaluation framework. Please note that there may be slight discrepancies when using the converted HuggingFace models.

model Hella
Swag
Trivia
QA
MMLU GSM8K Human
Eval
BBH CEval CMMLU Chinese
QA
0-shot 5-shot 5-shot 8-shot 0-shot 3-shot 5-shot 5-shot 5-shot
LLaMA-2
-7B
75.6 63.8 45.8 15.5 14.6 38.5 33.9 32.6 21.5
LLaMA-2
-70B
84.0 79.5 69.0 58.4 28.7 62.9 51.4 53.1 50.2
DeepSeek LLM
7B Base
75.4 59.7 48.2 17.4 26.2 39.5 45.0 47.2 78.0
DeepSeek LLM
67B Base
84.0 78.9 71.3 63.4 42.7 68.7 66.1 70.8 87.6

Note: ChineseQA is an in-house benchmark, inspired by TriviaQA.

Chat Model

Never Seen Before Exam

To address data contamination and tuning for specific testsets, we have designed fresh problem sets to assess the capabilities of open-source LLM models. The evaluation results indicate that DeepSeek LLM 67B Chat performs exceptionally well on never-before-seen exams.


Hungarian National High-School Exam: In line with Grok-1, we have evaluated the model's mathematical capabilities using the Hungarian National High School Exam. This exam comprises 33 problems, and the model's scores are determined through human annotation. We follow the scoring metric in the solution.pdf to evaluate all models.

result

Remark: We have rectified an error from our initial evaluation. In this revised version, we have omitted the lowest scores for questions 16, 17, 18, as well as for the aforementioned image. Evaluation details are here.


Instruction Following Evaluation: On Nov 15th, 2023, Google released an instruction following evaluation dataset. They identified 25 types of verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We use the prompt-level loose metric to evaluate all models. Here, we used the first version released by Google for the evaluation. For the Google revised test set evaluation results, please refer to the number in our paper.

result

LeetCode Weekly Contest: To assess the coding proficiency of the model, we have utilized problems from the LeetCode Weekly Contest (Weekly Contest 351-372, Bi-Weekly Contest 108-117, from July 2023 to Nov 2023). We have obtained these problems by crawling data from LeetCode, which consists of 126 problems with over 20 test cases for each. The evaluation metric employed is akin to that of HumanEval. In this regard, if a model's outputs successfully pass all test cases, the model is considered to have effectively solved the problem. The model's coding capabilities are depicted in the Figure below, where the y-axis represents the pass@1 score on in-domain human evaluation testing, and the x-axis represents the pass@1 score on out-domain LeetCode Weekly Contest problems.

result

The specific questions and test cases will be released soon. Stay tuned!


Standard Benchmark

Model TriviaQA MMLU GSM8K HumanEval BBH C-Eval CMMLU ChineseQA
DeepSeek LLM 7B Base 59.7 48.2 17.4 26.2 39.5 45.0 47.2 78.0
DeepSeek LLM 67B Base 78.9 71.3 63.4 42.7 68.7 66.1 70.8 87.6
DeepSeek LLM 7B Chat 57.9 49.4 62.6 48.2 42.3 47.0 49.7 75.0
DeepSeek LLM 67B Chat 81.5 71.1 84.1 73.8 71.7 65.2 67.8 85.1

Note: We evaluate chat models with 0-shot for MMLU, GSM8K, C-Eval, and CMMLU. More evaluation results can be found here.

Revisit Multi-Choice Question Benchmarks

Based on our experimental observations, we have discovered that enhancing benchmark performance using multi-choice (MC) questions, such as MMLU, CMMLU, and C-Eval, is a relatively straightforward task. By incorporating multi-choice questions from Chinese exams, we have achieved exceptional results, as depicted in the table below:

Model MMLU C-Eval CMMLU
DeepSeek LLM 7B Chat 49.4 47.0 49.7
DeepSeek LLM 7B Chat + MC 60.9 71.3 73.8

Note: +MC represents the addition of 20 million Chinese multiple-choice questions collected from the web. It is important to note that we conducted deduplication for the C-Eval validation set and CMMLU test set to prevent data contamination. This addition not only improves Chinese multiple-choice benchmarks but also enhances English benchmarks. However, we observed that it does not enhance the model's knowledge performance on other evaluations that do not utilize the multiple-choice style in the 7B setting. As a result, we made the decision to not incorporate MC data in the pre-training or fine-tuning process, as it would lead to overfitting on benchmarks.

4. Pre-Training Details

Data

Our primary goal is to holistically enhance the dataset's richness and variety. To achieve this, we've implemented multiple methods and established a distributed, frequent-checkpointing batch processing system, named "cc_cleaner", to bolster our data pipeline.

Our minimal viable solution departs from RefinedWeb + CCNet. We greatly appreciate their selfless dedication to the research of AGI.

We have also significantly incorporated deterministic randomization into our data pipeline. This approach enables us to continuously enhance our data throughout the lengthy and unpredictable training process.

  • Data Composition: Our training data comprises a diverse mix of Internet text, math, code, books, and self-collected data respecting robots.txt. In addition to the diverse content, we place a high priority on personal privacy and copyright protection. All content containing personal information or subject to copyright restrictions has been removed from our dataset.

  • Dataset Pruning: Our system employs heuristic rules and models to refine our training data. Our filtering process removes low-quality web data while preserving precious low-resource knowledge. It aims to improve overall corpus quality and remove harmful or toxic content.

  • Deduplication: Our advanced deduplication system, using MinhashLSH, strictly removes duplicates both at document and string levels. This rigorous deduplication process ensures exceptional data uniqueness and integrity, especially crucial in large-scale datasets.

Pre-Training

DeepSeek LM models use the same architecture as LLaMA, an auto-regressive transformer decoder model. The 7B model uses Multi-Head attention (MHA) while the 67B model uses Grouped-Query Attention (GQA).

We pre-trained DeepSeek language models on a vast dataset of 2 trillion tokens, with a sequence length of 4096 and AdamW optimizer. The 7B model's training involved a batch size of 2304 and a learning rate of 4.2e-4 and the 67B model was trained with a batch size of 4608 and a learning rate of 3.2e-4. We employ a multi-step learning rate schedule in our training process. The learning rate begins with 2000 warmup steps, and then it is stepped to 31.6% of the maximum at 1.6 trillion tokens and 10% of the maximum at 1.8 trillion tokens.

We release the training loss curve and several benchmark metrics curves, as detailed below.

result
result

5. Quick Start

Installation

On the basis of Python >= 3.8 environment, install the necessary dependencies by running the following command:

pip install -r requirements.txt

Inference with Huggingface's Transformers

You can directly employ Huggingface's Transformers for model inference.

Text Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-llm-67b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Chat Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-llm-67b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Who are you?"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

Avoiding the use of the provided function apply_chat_template, you can also interact with our model following the sample template. Note that messages should be replaced by your input.

User: {messages[0]['content']}

Assistant: {messages[1]['content']}<|end▁of▁sentence|>User: {messages[2]['content']}

Assistant:

Note: By default (add_special_tokens=True), our tokenizer automatically adds a bos_token (<|begin▁of▁sentence|>) before the input text. Additionally, since the system prompt is not compatible with this version of our models, we DO NOT RECOMMEND including the system prompt in your input.

Inference with vLLM

You can also employ vLLM for high-throughput inference.

Text Completion

from vllm import LLM, SamplingParams

tp_size = 4 # Tensor Parallelism
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
model_name = "deepseek-ai/deepseek-llm-67b-base"
llm = LLM(model=model_name, trust_remote_code=True, gpu_memory_utilization=0.9, tensor_parallel_size=tp_size)

prompts = [
    "If everyone in a country loves one another,",
    "The research should also focus on the technologies",
    "To determine if the label is correct, we need to"
]
outputs = llm.generate(prompts, sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

Chat Completion

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

tp_size = 4 # Tensor Parallelism
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
model_name = "deepseek-ai/deepseek-llm-67b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, trust_remote_code=True, gpu_memory_utilization=0.9, tensor_parallel_size=tp_size)

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "What can you do?"}],
    [{"role": "user", "content": "Explain Transformer briefly."}],
]
# Avoid adding bos_token repeatedly
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

sampling_params.stop = [tokenizer.eos_token]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

6. FAQ

Could You Provide the tokenizer.model File for Model Quantization?

DeepSeek LLM utilizes the HuggingFace Tokenizer to implement the Byte-level BPE algorithm, with specially designed pre-tokenizers to ensure optimal performance. Currently, there is no direct way to convert the tokenizer into a SentencePiece tokenizer. We are contributing to the open-source quantization methods facilitate the usage of HuggingFace Tokenizer.

GGUF(llama.cpp)

We have submitted a PR to the popular quantization repository llama.cpp to fully support all HuggingFace pre-tokenizers, including ours.

While waiting for the PR to be merged, you can generate your GGUF model using the following steps:

git clone https://github.com/DOGEwbx/llama.cpp.git
cd llama.cpp
git checkout regex_gpt2_preprocess
# set up the environment according to README
make
python3 -m pip install -r requirements.txt
# generate GGUF model
python convert-hf-to-gguf.py <MODEL_PATH> --outfile <GGUF_PATH> --model-name deepseekllm
# use q4_0 quantization as an example
./quantize <GGUF_PATH> <OUTPUT_PATH> q4_0
./main -m <OUTPUT_PATH> -n 128 -p <PROMPT>

GPTQ(exllamav2)

UPDATE:exllamav2 has been able to support HuggingFace Tokenizer. Please pull the latest version and try out.

GPU Memory Usage

We profile the peak memory usage of inference for 7B and 67B models at different batch size and sequence length settings.

For DeepSeek LLM 7B, we utilize 1 NVIDIA A100-PCIE-40GB GPU for inference.

Batch SizeSequence Length
256512102420484096
113.29 GB13.63 GB14.47 GB16.37 GB21.25 GB
213.63 GB14.39 GB15.98 GB19.82 GB29.59 GB
414.47 GB15.82 GB19.04 GB26.65 GBOOM
815.99 GB18.71 GB25.14 GB35.19 GBOOM
1619.06 GB24.52 GB37.28 GBOOMOOM

For DeepSeek LLM 67B, we utilize 8 NVIDIA A100-PCIE-40GB GPUs for inference.

Batch SizeSequence Length
256512102420484096
116.92 GB17.11 GB17.66 GB20.01 GB33.23 GB
217.04 GB17.28 GB18.55 GB25.27 GBOOM
417.20 GB17.80 GB21.28 GB33.71 GBOOM
817.59 GB19.25 GB25.69 GBOOMOOM
1618.17 GB21.69 GB34.54 GBOOMOOM

7. Limitation

While DeepSeek LLMs have demonstrated impressive capabilities, they are not without their limitations. Here are some potential drawbacks of such models:

  1. Over-reliance on training data: These models are trained on vast amounts of text data, which can introduce biases present in the data. They may inadvertently generate biased or discriminatory responses, reflecting the biases prevalent in the training data.

  2. Hallucination: The model sometimes generates responses or outputs that may sound plausible but are factually incorrect or unsupported. This can occur when the model relies heavily on the statistical patterns it has learned from the training data, even if those patterns do not align with real-world knowledge or facts.

  3. Repetition: The model may exhibit repetition in their generated responses. This repetition can manifest in various ways, such as repeating certain phrases or sentences, generating redundant information, or producing repetitive structures in the generated text. This issue can make the output of LLMs less diverse and less engaging for users.

8. License

This code repository is licensed under the MIT License. The use of DeepSeek LLM Base/Chat models is subject to the Model License. DeepSeek LLM series (including Base and Chat) supports commercial use.

9. Citation

@article{deepseek-llm,
  author = {DeepSeek-AI},
  title = {DeepSeek LLM: Scaling Open-Source Language Models with Longtermism},
  journal = {arXiv preprint arXiv:2401.02954},
  year = {2024},
  url = {https://github.com/deepseek-ai/DeepSeek-LLM}
}

10. Contact

If you have any questions, please raise an issue or contact us at [email protected].

deepseek-llm's People

Contributors

deepseekph avatar dogewbx avatar freja71122 avatar hwxu20 avatar luofuli avatar stack-heap-overflow avatar zdaxie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepseek-llm's Issues

DeepSeek 7B Chat Lora 效果太棒了!

尊敬的DeepSeek团队:

我写这封信是为了表达我对你们团队极富创造力的工作的感激之情。我注意到在仓库中并没有关于Lora微调的脚本和教程,而llama-factory也没有为DeepSeek 7B chat模型做Lora微调适配。然而,在我实测了Lora微调的效果之后,我感到非常佩服你们团队的工作。

我非常感谢你们团队在开发DeepSeek 7B chat模型方面所做的努力。你们的模型在Lora微调方面表现出色,这让我感到非常惊喜。我已经在我的教程中分享了我的Lora微调经验,并将其发布在GitHub上。如果需要,我可以将其整理成脚本形式,并提交PR。

再次感谢你们团队的工作,期待着你们未来的创新和贡献。

DeepSeek 7B chat lora 教程 地址:https://github.com/datawhalechina/self-llm/blob/master/DeepSeek/04-DeepSeek-7B-chat%20Lora%20%E5%BE%AE%E8%B0%83.md
仓库地址:https://github.com/datawhalechina/self-llm.git

关于模型指标有一些疑问

为什么Deepseek-Math-7B-rl 已经到了88.2%,但是DeepSeek-LLM-67B Chat只有84%?67B的综合模型,在数学能力上比7B的Math专有模型要差。

About LR schedule

why init lr can be so much higher than llama2-70b?
And, would such a lr decay schedule be remarkable better than a routine cosine decay lr schedule?

Training data distribution

Hi, the paper is very detailed in most aspects, but the training data is not mentioned in as much detail.

Specifically, I am interested in the following:

  • The composition of the training dataset, including the types of data (e.g., text, code, images) and the sources of the data.
  • How the sampling ratio for each subset is determined, e.g., which principle is followed.

图很好

image

图很好,就是改了坐标轴。
下次别改了。好看但影响阅读。

67B-Instructor – will it be released shortly/ever?

I was excited to use this model for coding but it looks like I'm better off sticking to the 33b until there's an Instructor model uploaded. Can we expect that anytime soon and/or if not, any sort of timeframe would be highly appreciated! (...I am presuming its not "if" but "when" for optimism's sake (= )

关于System Prompt

首先感谢你们的优秀开源工作!

在你们发布的技术报告中提到的System Prompt,与你们在DeepSeek coder模板中的第一句提示词接近。

不过,在你们发布的67B的chat模型中,我检查了你们发布的tokenizer_config.json,发现模板中没有位置加入这个提示词。

请问System Prompt应该是加在哪里呢?
另外你们有function call或者agent版本的模型开发计划吗?

Deepseek SFT数据包含system应该如何处理?

你好,请问如果我的SFT数据里面有system,那么我的模型输入应该是什么样的呢?
我用的LLama_factory做Deepseek的SFT,模型input是这样的:

<|begin▁of▁sentence|>You are a helpful assistant.User: Query

Assistant: RESPONSE<|end▁of▁sentence|>User: Query

Assistant: RESPONSE<|end▁of▁sentence|>

不知道这样处理system是否合适?

question on "Revisit Multi-Choice Question Benchmarks"

Thanks so much for sharing the findings and insights about "Multi-Choice Question Benchmarks", I have a quick question about the 20 million Chinese MC data leading to overfiting without generalizing to other tasks, are the data composed of questions with pure options OR with sort of explanations in the answers?

Thank you again for your great work!

Is the compute calculation wrong for Chinchilla in the paper?

From the paper, Eq.2 list Chinchilla compute calculation as

$$6N_2 = 72 n_\text{layer}d_\text{model}^2 + 6n_\text{vocab}d_\text{model}$$

The first term comes from the $6ND$ estimate for non-embedding FLOPs (exclude lm_head parameters as well, maybe because of the tied embeddings), but the second term is not what Chinchilla used to calculate embedding FLOPs, see Appendix F from the Chinchilla paper, total forward pass FLOPs include embeddings and logits calculations.

So, $N_2$ should be larger than what is used in the paper (double the second term)?

Missing files in released pretrain ckpts

The pytorch_model-00013-of-00014.bin and pytorch_model-00014-of-00014.bin files are missing for the intermediate ckpt——"DeepSeek-LLM-67B-Base-Intermediate-1400B" in the aws link.
Could you please update the model resource?

TriviaQA结果复现求助

你好,我尝试着复现base模型(7B和67B)在TriviaQA上的结果。发现使用tech report 中的prompt格式,结果还是相差了7个点左右。请问可以提供复现的代码吗?感谢你的帮助。

LeetCode Weekly Contest Data

Great work
Would you like to share the test data for LeetCode Weekly Contest. It's very helpful for community.

Inquiry about Prompt Engineering and Handling Toxicity/Hallucination

Hello, I'm seeking guidance on prompt engineering and managing toxicity/hallucination in scenarios where the system prompt is not compatible with the model. Could you provide advice or best practices for prompt engineering in such cases? Additionally, how can we effectively address issues related to toxicity and hallucination without a compatible system prompt? Any insights or examples would be greatly appreciated. Thank you for your assistance.

关于vllm使用的疑问

你好!在使用官方提供的vllm代码的时候,我有一个问题:
prompts = [tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) for messages in messages_list]
这一步操作之后,实际的生成结果是一个字符串序列。但是诸如<|begin▁of▁sentence|>等其实应该是作为special token拼接的。请问这样使用是否是正确的?

AWS CLI 使用问题与 deepseek-ai S3 桶访问问题

问题描述:

我正在尝试使用 AWS CLI 从 deepseek-ai 的 S3 桶复制文件到本地。我使用的命令是 aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-7B-Base <local_path> --recursive --request-payer。

但是,我遇到了两个错误:

InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.
Service Unavailable: An error occurred (503) when calling the ListObjectsV2 operation (reached max retries: 4).

实际结果:

出现了 InvalidAccessKeyId 和 Service Unavailable 错误。

其他信息:

我想知道如何正确地访问 deepseek-ai 的 S3 桶,以及是否需要特定的 Access Key ID 或者 endpoint 参数。

AlignBench测评结果复现求助

注意到你们的模型在alignbench上的sota表现于是尝试复现了一下

  • 使用 HuggingFace放出的 67B-Chat模型 (是否对应Tech Report中的DPO版本?)
  • Tech Report 中的结果是基于 GPT4的测评结果在6.69
  • 我这边自测后上传到Alignben用他们那个CritiqueLLM测评在 5.68,如下
模型名称,专业能力,中文理解,基本任务,数学计算,文本写作,综合问答,角色扮演,逻辑推理,中文推理,中文语言,总分
deepseek67b,6.870967741935484,6.086206896551724,6.661764705882353,4.901785714285714,6.613333333333333,7.394736842105263,6.431034482758621,4.478260869565218,4.690023291925466,6.676340667094462,5.683181979509964

我的认知里这个应该是低于预期的(虽然没有控制变量), 我推测大概是生成过程的问题, 我这边简单参考了huggingface上提供的例子写的generate过程如下,大概就按照官方的setting改了temperature参数,其他都是default

       question = sample['question']
        temperature = sample['temperature']
        messages = [
            {
                "role": "user",
                "content": question
            }
        ]
        input_tensor = self.tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
        outputs = self.model.generate(input_tensor.to(self.model.device), temperature=temperature,max_new_tokens=2048)
        answer = self.tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
        return answer

请问如果要复现tech report中相近的精度,有没有更正确的template? 谢谢!

Scaling laws data

I am researching scaling laws across models and architectures among other things and was wondering if you could share the logs\training losses\val eval of the models you have ran for the scaling law experiments in DeepSeek LLM. If you have other similar losses or results it would also be interesting. It might not be super well curated, anything can be helpful.
Thanks

lora sft deepseek 67b base版本

感谢分享这么好的模型。
我使用5万条多轮数据对 67b base模型进行了sft微调。微调了一个epoch。但是测试时,模型输出会出现乱码。

image

使用的sft框架是 llama factory
实验参数如下:

deepspeed --num_gpus 2 --master_port=9901 src/train_bash.py
--deepspeed ds_config.json
--stage sft
--model_name_or_path /data/origin_models/deepseek-llm-67b-base
--do_train
--dataset 50k_multiple,self_cognition
--template alpaca
--finetuning_type lora
--quantization_bit 4
--lora_target q_proj,v_proj
--output_dir output-deepseek-67b-sft
--overwrite_cache
--overwrite_output_dir true
--per_device_train_batch_size 1
--gradient_accumulation_steps 10
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 100
--learning_rate 2e-4
--num_train_epochs 2.0
--plot_loss
--lora_rank 64
--lora_alpha 128
--cutoff_len 4096
--ddp_find_unused_parameters False
--preprocessing_num_workers 20
--save_total_limit 1
--flash_attn

测试时load模型的参数如下:
python src/web_demo.py
--model_name_or_path /data/origin_models/deepseek-llm-67b-base
--template alpaca
--finetuning_type lora
--quantization_bit 4
--checkpoint_dir /home/output-deepseek-deepctrl-67b-sft

训练硬件:

  • 两个A800 80G
  • cuda:12.1
  • torch 2.1.0
  • transformers 4.34.1

尝试调整了 repetition_penalty temprature top_p 的各种组合,这个问题仍然存在。

我的疑问是,是不是lora的rank过小,或者学习率过小,导致训练sft训练非常不充分造成的呢?

再次感谢!

German umlaut missing with deepseek-llm on llama

Here are the responses for few models and deepseek-llm cannot output "ö" and "ü":

%ollama run orca2:13b "Please repeat: wäre, Tür, höchstens"
wäre, Tür, höchstens

Translation: would be, door, at most

%ollama run codellama:34b "Please repeat: wäre, Tür, höchstens"

Wäre, Tür, höchstens.

%ollama run deepseek-llm:67b-chat "Please repeat: wäre, Tür, höchstens"
To complete this task, I will first listen to the audio file provided and write down the German words that are spoken. Then,
I will repeat those words in a clear manner for you.

Step 1: Listen to the audio file and identify the German words being spoken. In this case, the words are "wäre", "Tr"
(door), and "hchstens" (at most).

Step 2: Repeat each word in a clear manner.
- wäre -> I would say this as "vare".
- Tr -> Pronounced like "tuer", which means door.
- hchstens -> This is pronounced like "hkhs-tens" and it translates to "at most."

Is this a problem of the model or with ollama ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.