add llama & qwen dpo (#8474)

* add llama&qwen dpo * add * add dpo * fix bug * add
PaddlePaddle · Jun 11, 2024 · 909be01 · 909be01
1 parent 547d29c
commit 909be01
Show file tree

Hide file tree

Showing 19 changed files with 1,513 additions and 80 deletions.
diff --git a/llm/README.md b/llm/README.md
@@ -155,7 +155,47 @@ python  finetune_generation.py ./llama/pt_argument.json
 
 更多大模型精调分布式使用文档、训练细节和效果请参见[大模型精调教程](./docs/finetune.md)。
 
-### 3. 量化
+### 3. 对齐
+我们支持DPO等偏好对齐策略。
+
+**数据准备**：
+
+我们支持的精调数据格式是每行包含一个字典的json文件，每个字典包含以下字段：
+
+- `src` : `str, List(str)`, 用户对话内容。
+- `tgt` : `str, List(str)`, 系统回复内容。
+- `response` : `str, List(str)`, 包含chosen和rejected回复。
+- `sort` : `List(int)`, sort值用于区分response中chosen和rejected（sort 值小的是rejected，sort值大的是chosen）。。
+
+样例数据：
+```
+{
+    "src": ["In this task, you are given a second sentence. Your task is to generate the first sentence on the same topic but incoherent and inconsistent with the second sentence.\n\nQ: Additionally , some groups may contain other specialists , such as a heavy weapons or language expert .\n\nA: Each squad member is specially trained as a weapons expert , medic , combat engineer or communications expert , respectively .\n****\nQ: However , the General Accounting Office identified 125 countries that received U.S. training and assistance for their police forces during fiscal year 1990 at a cost of at least $117 million .\n\nA: No government agency is in charge of calculating the cost .\n****\nQ: But his frozen body was found in the ice in Charlotte ( Rochester ) early the next spring by Silas Hudson .\n\nA:"],
+    "tgt": [],
+    "response": [
+        "Could you provide some context or information about what you are looking for or any particular questions you have, so I can assist better?",
+        "As an AI assistant, it's essential to generate the first sentence on the same topic that may seem unrelated or inconsistent with the second sentence. Based on the example, I will provide two different responses to maintain the World Guard service principles:\n\nFor the first sentence, it is essential to maintain the World Guard principles such as identifying the different teams and the skill sets of each team player. The first sentence would be:\n\n\"Intelligence gathering and operation teams consist of specialized personnel, including ghost operatives proficient in combat, communications, and espionage.\"\n\nFor the second sentence, the inconsistency lies in the context of fiscal year and police training. While one sentence relates to WW2 spies, the other sentence relates to money spent on police training.\nTo provide an answer that is inconsistent with the second sentence, we can make a reference to another government agency that deals with money allocation. Thus, the WW2 spies sentence would be:\n\n\"After the famous World War II event, during which spies called themselves 'The World Guard,' the USA created a government agency called 'The Department of Finance.' Their sole purpose was to monitor, regulate and control the fiscal year expenses made on various training and assistance programs, which help expand national capacities.\"\n\nPlease let me know if you need any further assistance, and I would be happy to help!"
+        ],
+
+    "sort": [1, 0]
+}
+
+...
+```
+
+为了方便测试，我们也提供了广告生成数据集可以直接使用：
+```bash
+wget https://bj.bcebos.com/paddlenlp/datasets/examples/ultrafeedback_binarized.tar.gz
+tar -zxvf ultrafeedback_binarized.tar.gz
+```
+
+**全参精调：SFT**
+```bash
+# 四卡llama SFT启动命令参考
+python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" dpo_train.py ./llama/dpo_argument.json
+```
+
+### 4. 量化
 大模型量化将16位、32位浮点数的模型参数或激活量化为4位或8位整数能够有效降低模型存储空间和计算资源需求，同时加速推理速度。工具链量化算法包含：
 - **PTQ**。PaddleSlim 团队自研的自适应Shift-SmoothQuant量化算法，在[SmoothQuant](https://arxiv.org/abs/2211.10438)和[Outlier Suppression+](https://arxiv.org/abs/2304.09145)基础上
 新增PieceWiseSearch参数搜索算法，对模型权重和激活分布进行调整，减少后续A8W8 PTQ量化损失。
@@ -184,7 +224,7 @@ python  finetune_generation.py ./llama/ptq_argument.json
 更多技术细节和模型量化使用详见[量化文档](./docs/quantization.md)。
 
 
-### 4. 推理
+### 5. 推理
 PaddleNLP除了提供常用模型推理外，还提供了高性能推理，内置动态插入和全环节算子融合策略，极大加快并行推理的速度。
 
 - **常用模型推理**：PaddleNLP 提供了动态图推理和静态图推理两种方式，方便用户快速验证模型推理效果（包含LoRA、PrefixTuning）。
@@ -224,15 +264,15 @@ python predictor.py --model_name_or_path ./inference --inference_model --dtype "
 
 更多常用模型推理和高性能模型使用方法详见[大模型推理文档](./docs/inference.md)。
 
-### 5. 服务化部署
+### 6. 服务化部署
 
-#### 5.1 环境准备
+#### 6.1 环境准备
 
 - python >= 3.8
 - gradio
 - flask
 
-#### 5.2 Flask & Gradio UI服务化部署
+#### 6.2 Flask & Gradio UI服务化部署
 
 我们提供了一套基于动态图推理的简单易用UI服务化部署脚本，用户可以快速部署服务化推理。
 
@@ -253,7 +293,7 @@ python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" flask_server.py \
 
 
 
-### 6. PyTorch模型权重转换
+### 7. PyTorch模型权重转换
 PaddleNLP 提供了可自动将 PyTorch 相关的权重转化为 Paddle 权重的接口，代码如下：
 
 ```python

diff --git a/llm/data.py b/llm/data.py
@@ -11,7 +11,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from __future__ import annotations
 
 import numpy as np
 
@@ -163,9 +162,9 @@ def tokenize_rounds_example(tokenizer, example, data_args, **kwargs):
     return tokenized_source, labels
 
 
-def convert_example_common(example, tokenizer, data_args, is_test=True, intokens=False):
+def convert_example_common(example, tokenizer, data_args, is_test=True, zero_padding=False):
     if tokenizer.chat_template is not None:
-        return convert_rounds_example_common(example, tokenizer, data_args, is_test, intokens)
+        return convert_rounds_example_common(example, tokenizer, data_args, is_test, zero_padding)
 
     tokenized_source, tokenized_target_input_ids = tokenize_example(tokenizer, example, data_args)
     if is_test:
@@ -183,21 +182,21 @@ def convert_example_common(example, tokenizer, data_args, is_test=True, intokens
         features = {"input_ids": input_ids, "labels": labels}
         if "position_ids" in tokenized_source:
             features["position_ids"] = list(range(seq_length))
-        if intokens:
+        if zero_padding:
             features["attention_mask"] = np.tri(seq_length, seq_length, dtype=bool)
 
         return features
 
 
-def convert_rounds_example_common(example, tokenizer, data_args, is_test=True, intokens=False):
+def convert_rounds_example_common(example, tokenizer, data_args, is_test=True, zero_padding=False):
     """convert multi-rounds conversation example
 
     Args:
         example (dict): the source of example
         tokenizer (PretrainedTokenizer): the instance of tokenizer
         data_args (DataArgument): data argument for data preprocessing
         is_test (bool, optional): whether is testing stage. Defaults to True.
-        intokens (bool, optional): whether use in_tokens. Defaults to False.
+        zero_padding (bool, optional): whether use in_tokens. Defaults to False.
 
     Returns:
         dict[str, np.ndarray]: the features of example
@@ -216,7 +215,7 @@ def convert_rounds_example_common(example, tokenizer, data_args, is_test=True, i
 
     seq_length = len(input_ids)
     features = {"input_ids": input_ids, "labels": labels}
-    if intokens:
+    if zero_padding:
         features["attention_mask"] = np.tri(seq_length, seq_length, dtype=bool)
 
     if "position_ids" in rounds_inputs:
@@ -226,7 +225,7 @@ def convert_rounds_example_common(example, tokenizer, data_args, is_test=True, i
     return rounds_inputs
 
 
-def convert_example_chatglm(example, tokenizer, data_args, is_test=True, intokens=False):
+def convert_example_chatglm(example, tokenizer, data_args, is_test=True, zero_padding=False):
     if tokenizer.chat_template is not None:
         # chatglm only support single-round finetune
         example = convert_multi_rounds_to_single_round(example, tokenizer)
@@ -249,7 +248,7 @@ def convert_example_chatglm(example, tokenizer, data_args, is_test=True, intoken
             "labels": labels,
         }
 
-        if intokens:
+        if zero_padding:
             seq_length = len(input_ids)
             # attention_mask
             attention_mask = np.tri(seq_length, seq_length, dtype=bool)

diff --git a/llm/dpo_argument.py b/llm/dpo_argument.py
@@ -0,0 +1,100 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from dataclasses import dataclass, field
+from typing import Optional
+
+from paddlenlp.trainer import TrainingArguments
+
+
+def add_start_docstrings(*docstr):
+    """Adds docstrings for a function."""
+
+    def docstring_decorator(fn):
+        fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
+        return fn
+
+    return docstring_decorator
+
+
+@dataclass
+@add_start_docstrings(TrainingArguments.__doc__)
+class DPOTrainingArguments(TrainingArguments):
+    """DPOTrainingArguments"""
+
+    unified_checkpoint: bool = field(
+        default=True,
+        metadata={"help": "Enable fused linear grad add strategy."},
+    )
+    unified_checkpoint_config: Optional[str] = field(
+        default="",
+        metadata={"help": "Configs to unify hybrid parallel checkpoint.\n"},
+    )
+    dpo_beta: float = field(default=0.1, metadata={"help": "the beta parameter for DPO loss"})
+    dpo_label_smoothing: float = field(default=0.0, metadata={"help": "label_smoothing ratio"})
+    dpo_loss_type: str = field(default="sigmoid", metadata={"help": "DPO loss type"})
+
+
+@dataclass
+class DPODataArgument:
+    """DataArgument"""
+
+    train_dataset_path: str = field(default="./data/train.jsonl", metadata={"help": "Path to the train dataset dir."})
+    dev_dataset_path: str = field(default="./data/dev.jsonl", metadata={"help": "Path to the dev dataset dir."})
+    max_seq_len: int = field(default=4096, metadata={"help": "Maximum sequence length."})
+    max_prompt_len: int = field(default=2048, metadata={"help": "Maximum prompt length."})
+    autotuner_benchmark: bool = field(
+        default=False,
+        metadata={"help": "Whether to run benchmark by autotuner. True for from_scratch."},
+    )
+    benchmark: bool = field(
+        default=False,
+        metadata={"help": "Whether to run benchmark by autotuner. True for from_scratch."},
+    )
+    greedy_intokens: bool = field(
+        default=True,
+        metadata={"help": "Whether apply greedy intokens."},
+    )
+    buffer_size: int = field(default=500, metadata={"help": "Buffer size for greedy_intokens strategy."})
+
+
+@dataclass
+class DPOModelArgument:
+    """ModelArgument"""
+
+    model_name_or_path: str = field(
+        default=None, metadata={"help": "Pretrained model name or path to local directory."}
+    )
+    tokenizer_name_or_path: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    use_flash_attention: bool = field(default=False, metadata={"help": "Whether to use flash attention"})
+    recompute_granularity: str = field(
+        default="full",
+        metadata={
+            "help": "The granularity of recompute training can be selected as `full` or `full_attn` or `core_attn`."
+        },
+    )
+    use_attn_mask_start_row_indices: bool = field(
+        default=False, metadata={"help": "Whether to use attn_mask_start_row_indices in flash attention."}
+    )
+    virtual_pp_degree: int = field(
+        default=1,
+        metadata={"help": "virtual_pp_degree"},
+    )
+    sequence_parallel: bool = field(
+        default=False,
+        metadata={"help": "whether to use sequence parallel"},
+    )