diff --git a/examples/text_matching/diffcse/README.md b/examples/text_matching/diffcse/README.md new file mode 100644 index 000000000000..d08bd3bbe6d0 --- /dev/null +++ b/examples/text_matching/diffcse/README.md @@ -0,0 +1,169 @@ +# 无监督语义匹配模型 [DiffCSE](https://arxiv.org/pdf/2204.10298.pdf) + +借鉴 [DiffCSE](https://arxiv.org/pdf/2204.10298.pdf) 的思路,实现了 DiffCSE 模型。相比于 SimCSE 模型,DiffCSE模型会更关注语句之间的差异性,具有精确的向量表示能力。DiffCSE 模型同样适合缺乏监督数据,但是又有大量无监督数据的匹配和检索场景。 + +## 快速开始 +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +``` +DiffCSE/ +├── model.py # DiffCSE 模型组网代码 +├── custom_ernie.py # 为适配 DiffCSE 模型,对ERNIE模型进行了部分修改 +├── data.py # 无监督语义匹配训练数据、测试数据的读取逻辑 +├── run_diffcse.py # 模型训练、评估、预测的主脚本 +├── utils.py # 包括一些常用的工具式函数 +├── run_train.sh # 模型训练的脚本 +├── run_eval.sh # 模型评估的脚本 +└── run_infer.sh # 模型预测的脚本 +``` + +### 模型训练 +默认使用无监督模式进行训练 DiffCSE,模型训练数据的数据样例如下所示,每行表示一条训练样本: +```shell +全年地方财政总收入3686.81亿元,比上年增长12.3%。 +“我对案情并不十分清楚,所以没办法提出批评,建议,只能希望通过质询,要求检察院对此做出说明。”他说。 +据调查结果显示:2015年微商行业总体市场规模达到1819.5亿元,预计2016年将达到3607.3亿元,增长率为98.3%。 +前往冈仁波齐需要办理目的地包含日喀则和阿里地区的边防证,外转沿途有一些补给点,可购买到干粮和饮料。 +``` + +可以运行如下命令,开始模型训练并且进行模型测试。 + +```shell +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_train" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "train" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --generator_name "ernie-3.0-base-zh" \ + --discriminator_name "ernie-3.0-base-zh" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --train_set_file "your train_set path" \ + --eval_set_file "your dev_set path" \ + --save_dir "./checkpoints" \ + --log_dir ${log_dir} \ + --save_steps "50000" \ + --eval_steps "1000" \ + --epochs "3" \ + --batch_size "32" \ + --mlm_probability "0.15" \ + --lambda_weight "0.15" \ + --learning_rate "3e-5" \ + --weight_decay "0.01" \ + --warmup_proportion "0.01" \ + --seed "0" \ + --device "gpu" +``` + +可支持配置的参数: +* `mode`:可选,用于指明本次运行是模型训练、模型评估还是模型预测,仅支持[train, eval, infer]三种模式;默认为 infer。 +* `encoder_name`:可选,DiffCSE模型中用于向量抽取的模型名称;默认为 ernie-3.0-base-zh。 +* `generator_name`: 可选,DiffCSE模型中生成器的模型名称;默认为 ernie-3.0-base-zh。 +* `discriminator_name`: 可选,DiffCSE模型中判别器的模型名称;默认为 rocketqa-zh-dureader-query-encoder。 +* `max_seq_length`:可选,ERNIE-Gram 模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `output_emb_size`:可选,向量抽取模型输出向量的维度;默认为32。 +* `train_set_file`:可选,用于指定训练集的路径。 +* `eval_set_file`:可选,用于指定验证集的路径。 +* `save_dir`:可选,保存训练模型的目录; +* `log_dir`:可选,训练训练过程中日志的输出目录; +* `save_steps`:可选,用于指定模型训练过程中每隔多少 step 保存一次模型。 +* `eval_steps`:可选,用于指定模型训练过程中每隔多少 step,使用验证集评估一次模型。 +* `epochs`: 模型训练轮次,默认为3。 +* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `mlm_probability`:可选,利用生成器预测时,控制单词掩码的比例,默认为0.15。 +* `lambda_weight`:可选,控制RTD任务loss的占比,默认为0.15。 +* `learning_rate`:可选,Fine-tune 的最大学习率;默认为5e-5。 +* `weight_decay`:可选,控制正则项力度的参数,用于防止过拟合,默认为0.01。 +* `warmup_proportion`:可选,学习率 warmup 策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到 learning_rate, 而后再缓慢衰减,默认为0.01。 +* `seed`:可选,随机种子,默认为1000. +* `device`: 选用什么设备进行训练,可选 cpu 或 gpu。如使用 gpu 训练则参数 gpus 指定GPU卡号。 + +程序运行时将会自动进行训练,评估。同时训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +checkpoints/ +├── best +│   ├── model_state.pdparams +│   ├── tokenizer_config.json +│   ├── special_tokens_map.json +│   └── vocab.txt +└── ... +``` + +### 模型评估 +在模型评估时,需要使用带有标签的数据,以下展示了几条模型评估数据样例,每行表示一条训练样本,每行共计包含3列,分别是query1, query2, label: +```shell +右键单击此电脑选择属性,如下图所示 右键单击此电脑选择属性,如下图所示 5 +好医生解密||是什么,让美洲大蠊能美容还能救命 解密美洲大蠊巨大药用价值 1 +蒜香蜜汁烤鸡翅的做法 外香里嫩一口爆汁蒜蓉蜜汁烤鸡翅的做法 3 +项目计划书 篇2 简易项目计划书(参考模板) 2 +夏天幼儿园如何正确使用空调? 老师们该如何正确使用空调,让孩子少生病呢? 3 +``` + + +可以运行如下命令,进行模型评估。 + +```shell +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_eval" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "eval" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --eval_set_file "your dev_set path" \ + --ckpt_dir "./checkpoints/best" \ + --batch_size "32" \ + --seed "0" \ + --device "gpu" +``` +可支持配置的参数: +* `ckpt_dir`: 用于指定进行模型评估的checkpoint路径。 + +其他参数解释同上。 + +### 基于动态图模型预测 +在模型预测时,需要给定待预测的两条文本,以下展示了几条模型预测的数据样例,每行表示一条训练样本,每行共计包含2列,分别是query1, query2: +```shell +韩国现代摩比斯2015招聘 韩国现代摩比斯2015校园招聘信息 +《DNF》封号减刑方法 被封一年怎么办? DNF封号减刑方法 封号一年怎么减刑 +原神手鞠游戏三个刷新位置一览 手鞠游戏三个刷新位置一览 +``` + +可以运行如下命令,进行模型预测: +```shell +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_infer" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "infer" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --infer_set_file "your test_set path \ + --ckpt_dir "./checkpoints/best" \ + --save_infer_path "./infer_result.txt" \ + --batch_size "32" \ + --seed "0" \ + --device "gpu" +``` + +可支持配置的参数: +* `infer_set_file`: 可选,用于指定测试集的路径。 +* `save_infer_path`: 可选,用于保存模型预测结果的文件路径。 + +其他参数解释同上。 待模型预测结束后,会将结果保存至save_infer_path参数指定的文件中。 + + +## Reference +[1] Chuang Y S , Dangovski R , Luo H , et al. DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings[J]. arXiv e-prints, 2022. https://arxiv.org/pdf/2204.10298.pdf. diff --git a/examples/text_matching/diffcse/custom_ernie.py b/examples/text_matching/diffcse/custom_ernie.py new file mode 100644 index 000000000000..d31e67ab523c --- /dev/null +++ b/examples/text_matching/diffcse/custom_ernie.py @@ -0,0 +1,1175 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + +from paddlenlp.transformers import PretrainedModel, register_base_model + +__all__ = [ + 'ErnieModel', 'ErniePretrainedModel', 'ErnieForSequenceClassification', + 'ErnieForTokenClassification', 'ErnieForQuestionAnswering', + 'ErnieForPretraining', 'ErniePretrainingCriterion', 'ErnieForMaskedLM', + 'ErnieForMultipleChoice' +] + + +class ErnieEmbeddings(nn.Layer): + r""" + Include embeddings from word, position and token_type embeddings. + """ + + def __init__(self, + vocab_size, + hidden_size=768, + hidden_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=2, + pad_token_id=0, + weight_attr=None, + task_type_vocab_size=3, + task_id=0, + use_task_id=False): + super(ErnieEmbeddings, self).__init__() + + self.word_embeddings = nn.Embedding(vocab_size, + hidden_size, + padding_idx=pad_token_id, + weight_attr=weight_attr) + self.position_embeddings = nn.Embedding(max_position_embeddings, + hidden_size, + weight_attr=weight_attr) + self.token_type_embeddings = nn.Embedding(type_vocab_size, + hidden_size, + weight_attr=weight_attr) + self.use_task_id = use_task_id + self.task_id = task_id + if self.use_task_id: + self.task_type_embeddings = nn.Embedding(task_type_vocab_size, + hidden_size, + weight_attr=weight_attr) + self.layer_norm = nn.LayerNorm(hidden_size) + self.dropout = nn.Dropout(hidden_dropout_prob) + + def forward(self, + input_ids, + token_type_ids=None, + position_ids=None, + task_type_ids=None): + if position_ids is None: + # maybe need use shape op to unify static graph and dynamic graph + #seq_length = input_ids.shape[1] + ones = paddle.ones_like(input_ids, dtype="int64") + seq_length = paddle.cumsum(ones, axis=1) + position_ids = seq_length - ones + position_ids.stop_gradient = True + if token_type_ids is None: + token_type_ids = paddle.zeros_like(input_ids, dtype="int64") + input_embedings = self.word_embeddings(input_ids) + position_embeddings = self.position_embeddings(position_ids) + token_type_embeddings = self.token_type_embeddings(token_type_ids) + + embeddings = input_embedings + position_embeddings + token_type_embeddings + if self.use_task_id: + if task_type_ids is None: + task_type_ids = paddle.ones_like(input_ids, + dtype="int64") * self.task_id + task_type_embeddings = self.task_type_embeddings(task_type_ids) + embeddings = embeddings + task_type_embeddings + embeddings = self.layer_norm(embeddings) + embeddings = self.dropout(embeddings) + return embeddings + + +class ErniePooler(nn.Layer): + + def __init__(self, hidden_size, weight_attr=None): + super(ErniePooler, self).__init__() + self.dense = nn.Linear(hidden_size, + hidden_size, + weight_attr=weight_attr) + self.activation = nn.Tanh() + + def forward(self, hidden_states): + # We "pool" the model by simply taking the hidden state corresponding + # to the first token. + first_token_tensor = hidden_states[:, 0] + pooled_output = self.dense(first_token_tensor) + pooled_output = self.activation(pooled_output) + return pooled_output + + +class ErniePretrainedModel(PretrainedModel): + r""" + An abstract class for pretrained ERNIE models. It provides ERNIE related + `model_config_file`, `pretrained_init_configuration`, `resource_files_names`, + `pretrained_resource_files_map`, `base_model_prefix` for downloading and + loading pretrained models. + Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details. + """ + + model_config_file = "model_config.json" + pretrained_init_configuration = { + # Deprecated, alias for ernie-1.0-base-zh + "ernie-1.0": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "ernie-1.0-base-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "ernie-1.0-large-zh-cw": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "intermediate_size": 3072, # it is 3072 instead of 4096 + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 24, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "ernie-tiny": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "intermediate_size": 4096, + "max_position_embeddings": 600, + "num_attention_heads": 16, + "num_hidden_layers": 3, + "type_vocab_size": 2, + "vocab_size": 50006, + "pad_token_id": 0, + }, + "ernie-2.0-base-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 18000, + }, + "ernie-2.0-large-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "intermediate_size": 4096, # special for large model + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 24, + "type_vocab_size": 4, + "vocab_size": 12800, + }, + "ernie-2.0-base-en": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "ernie-2.0-base-en-finetuned-squad": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "ernie-2.0-large-en": { + "attention_probs_dropout_prob": 0.1, + "intermediate_size": 4096, # special for ernie-2.0-large-en + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 1024, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 16, + "num_hidden_layers": 24, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "rocketqa-zh-dureader-query-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "rocketqa-zh-dureader-para-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "rocketqa-v1-marco-query-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "rocketqa-v1-marco-para-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "rocketqa-zh-dureader-cross-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "relu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 513, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 2, + "vocab_size": 18000, + "pad_token_id": 0, + }, + "rocketqa-v1-marco-cross-encoder": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "type_vocab_size": 4, + "vocab_size": 30522, + "pad_token_id": 0, + }, + "ernie-3.0-xbase-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "intermediate_size": 4096, # special for large model + "hidden_size": 1024, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 16, + "num_hidden_layers": 20, + "task_type_vocab_size": 16, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000 + }, + "ernie-3.0-base-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 12, + "num_hidden_layers": 12, + "task_type_vocab_size": 3, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000 + }, + "ernie-3.0-medium-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 768, + "intermediate_size": 3072, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 12, + "num_hidden_layers": 6, + "task_type_vocab_size": 16, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000 + }, + "ernie-3.0-mini-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 384, + "intermediate_size": 1536, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 12, + "num_hidden_layers": 6, + "task_type_vocab_size": 16, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000 + }, + "ernie-3.0-micro-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 384, + "intermediate_size": 1536, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 12, + "num_hidden_layers": 4, + "task_type_vocab_size": 16, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000 + }, + "ernie-3.0-nano-zh": { + "attention_probs_dropout_prob": 0.1, + "hidden_act": "gelu", + "hidden_dropout_prob": 0.1, + "hidden_size": 312, + "intermediate_size": 1248, + "initializer_range": 0.02, + "max_position_embeddings": 2048, + "num_attention_heads": 12, + "num_hidden_layers": 4, + "task_type_vocab_size": 16, + "type_vocab_size": 4, + "use_task_id": True, + "vocab_size": 40000 + }, + } + resource_files_names = {"model_state": "model_state.pdparams"} + pretrained_resource_files_map = { + "model_state": { + # Deprecated, alias for ernie-1.0-base-zh + "ernie-1.0": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_v1_chn_base.pdparams", + "ernie-1.0-base-zh": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_v1_chn_base.pdparams", + "ernie-1.0-large-zh-cw": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_1.0_large_zh_cw.pdparams", + "ernie-tiny": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_tiny/ernie_tiny.pdparams", + "ernie-2.0-base-zh": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_2.0/ernie_2.0_base_zh.pdparams", + "ernie-2.0-large-zh": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_2.0/ernie_2.0_large_zh.pdparams", + "ernie-2.0-base-en": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/ernie_v2_eng_base.pdparams", + "ernie-2.0-base-en-finetuned-squad": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/ernie_v2_eng_base_finetuned_squad.pdparams", + "ernie-2.0-large-en": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_large/ernie_v2_eng_large.pdparams", + "rocketqa-zh-dureader-query-encoder": + "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_query_encoder.pdparams", + "rocketqa-zh-dureader-para-encoder": + "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_para_encoder.pdparams", + "rocketqa-v1-marco-query-encoder": + "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_query_encoder.pdparams", + "rocketqa-v1-marco-para-encoder": + "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_para_encoder.pdparams", + "rocketqa-zh-dureader-cross-encoder": + "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_cross_encoder.pdparams", + "rocketqa-v1-marco-cross-encoder": + "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_cross_encoder.pdparams", + "ernie-3.0-base-zh": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams", + "ernie-3.0-xbase-zh": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_xbase_zh.pdparams", + "ernie-3.0-medium-zh": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh.pdparams", + "ernie-3.0-mini-zh": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh.pdparams", + "ernie-3.0-micro-zh": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh.pdparams", + "ernie-3.0-nano-zh": + "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh.pdparams", + } + } + base_model_prefix = "ernie" + + def init_weights(self, layer): + """ Initialization hook """ + if isinstance(layer, (nn.Linear, nn.Embedding)): + # only support dygraph, use truncated_normal and make it inplace + # and configurable later + if isinstance(layer.weight, paddle.Tensor): + layer.weight.set_value( + paddle.tensor.normal(mean=0.0, + std=self.initializer_range if hasattr( + self, "initializer_range") else + self.ernie.config["initializer_range"], + shape=layer.weight.shape)) + elif isinstance(layer, nn.LayerNorm): + layer._epsilon = 1e-12 + + +@register_base_model +class ErnieModel(ErniePretrainedModel): + r""" + The bare ERNIE Model transformer outputting raw hidden-states. + This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`. + Refer to the superclass documentation for the generic methods. + This model is also a Paddle `paddle.nn.Layer `__ subclass. Use it as a regular Paddle Layer + and refer to the Paddle documentation for all matter related to general usage and behavior. + Args: + vocab_size (int): + Vocabulary size of `inputs_ids` in `ErnieModel`. Also is the vocab size of token embedding matrix. + Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `ErnieModel`. + hidden_size (int, optional): + Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`. + num_hidden_layers (int, optional): + Number of hidden layers in the Transformer encoder. Defaults to `12`. + num_attention_heads (int, optional): + Number of attention heads for each attention layer in the Transformer encoder. + Defaults to `12`. + intermediate_size (int, optional): + Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors + to ff layers are firstly projected from `hidden_size` to `intermediate_size`, + and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`. + Defaults to `3072`. + hidden_act (str, optional): + The non-linear activation function in the feed-forward layer. + ``"gelu"``, ``"relu"`` and any other paddle supported activation functions + are supported. Defaults to `"gelu"`. + hidden_dropout_prob (float, optional): + The dropout probability for all fully connected layers in the embeddings and encoder. + Defaults to `0.1`. + attention_probs_dropout_prob (float, optional): + The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target. + Defaults to `0.1`. + max_position_embeddings (int, optional): + The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input + sequence. Defaults to `512`. + type_vocab_size (int, optional): + The vocabulary size of the `token_type_ids`. + Defaults to `2`. + initializer_range (float, optional): + The standard deviation of the normal initializer for initializing all weight matrices. + Defaults to `0.02`. + + .. note:: + A normal_initializer initializes weight matrices as normal distributions. + See :meth:`ErniePretrainedModel._init_weights()` for how weights are initialized in `ErnieModel`. + pad_token_id(int, optional): + The index of padding token in the token vocabulary. + Defaults to `0`. + """ + + def __init__(self, + vocab_size, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=2, + initializer_range=0.02, + pad_token_id=0, + task_type_vocab_size=3, + task_id=0, + use_task_id=False): + super(ErnieModel, self).__init__() + self.pad_token_id = pad_token_id + self.initializer_range = initializer_range + weight_attr = paddle.ParamAttr( + initializer=nn.initializer.TruncatedNormal( + mean=0.0, std=self.initializer_range)) + self.embeddings = ErnieEmbeddings(vocab_size, hidden_size, + hidden_dropout_prob, + max_position_embeddings, + type_vocab_size, pad_token_id, + weight_attr, task_type_vocab_size, + task_id, use_task_id) + encoder_layer = nn.TransformerEncoderLayer( + hidden_size, + num_attention_heads, + intermediate_size, + dropout=hidden_dropout_prob, + activation=hidden_act, + attn_dropout=attention_probs_dropout_prob, + act_dropout=0, + weight_attr=weight_attr, + normalize_before=False) + self.encoder = nn.TransformerEncoder(encoder_layer, num_hidden_layers) + self.pooler = ErniePooler(hidden_size, weight_attr) + self.apply(self.init_weights) + + def forward(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None, + task_type_ids=None, + cls_input=None): + r""" + Args: + input_ids (Tensor): + Indices of input sequence tokens in the vocabulary. They are + numerical representations of tokens that build the input sequence. + It's data type should be `int64` and has a shape of [batch_size, sequence_length]. + token_type_ids (Tensor, optional): + Segment token indices to indicate different portions of the inputs. + Selected in the range ``[0, type_vocab_size - 1]``. + If `type_vocab_size` is 2, which means the inputs have two portions. + Indices can either be 0 or 1: + - 0 corresponds to a *sentence A* token, + - 1 corresponds to a *sentence B* token. + Its data type should be `int64` and it has a shape of [batch_size, sequence_length]. + Defaults to `None`, which means we don't add segment embeddings. + position_ids (Tensor, optional): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0, + max_position_embeddings - 1]``. + Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to `None`. + attention_mask (Tensor, optional): + Mask used in multi-head attention to avoid performing attention on to some unwanted positions, + usually the paddings or the subsequent positions. + Its data type can be int, float and bool. + When the data type is bool, the `masked` tokens have `False` values and the others have `True` values. + When the data type is int, the `masked` tokens have `0` values and the others have `1` values. + When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values. + It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`. + For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], + [batch_size, num_attention_heads, sequence_length, sequence_length]. + We use whole-word-mask in ERNIE, so the whole word will have the same value. For example, "使用" as a word, + "使" and "用" will have the same value. + Defaults to `None`, which means nothing needed to be prevented attention to. + Returns: + tuple: Returns tuple (``sequence_output``, ``pooled_output``). + With the fields: + - `sequence_output` (Tensor): + Sequence of hidden-states at the last layer of the model. + It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size]. + - `pooled_output` (Tensor): + The output of first token (`[CLS]`) in sequence. + We "pool" the model by simply taking the hidden state corresponding to the first token. + Its data type should be float32 and its shape is [batch_size, hidden_size]. + Example: + .. code-block:: + import paddle + from paddlenlp.transformers import ErnieModel, ErnieTokenizer + tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0') + model = ErnieModel.from_pretrained('ernie-1.0') + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + sequence_output, pooled_output = model(**inputs) + """ + if attention_mask is None: + attention_mask = paddle.unsqueeze( + (input_ids == self.pad_token_id).astype( + self.pooler.dense.weight.dtype) * -1e4, + axis=[1, 2]) + # For 2D attention_mask from tokenizer + elif attention_mask.ndim == 2: + attention_mask = paddle.unsqueeze( + attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype()) + attention_mask = (1.0 - attention_mask) * -1e4 + attention_mask.stop_gradient = True + + embedding_output = self.embeddings(input_ids=input_ids, + position_ids=position_ids, + token_type_ids=token_type_ids, + task_type_ids=task_type_ids) + + if cls_input is not None: + embedding_output = paddle.concat( + [cls_input.unsqueeze(1), embedding_output[:, 1:, :]], axis=1) + + encoder_outputs = self.encoder(embedding_output, attention_mask) + sequence_output = encoder_outputs + pooled_output = self.pooler(sequence_output) + return sequence_output, pooled_output + + +class ErnieForSequenceClassification(ErniePretrainedModel): + r""" + Ernie Model with a linear layer on top of the output layer, + designed for sequence classification/regression tasks like GLUE tasks. + Args: + ernie (ErnieModel): + An instance of `paddlenlp.transformers.ErnieModel`. + num_classes (int, optional): + The number of classes. Default to `2`. + dropout (float, optional): + The dropout probability for output of ERNIE. + If None, use the same value as `hidden_dropout_prob` + of `paddlenlp.transformers.ErnieModel` instance. Defaults to `None`. + """ + + def __init__(self, ernie, num_classes=2, dropout=None): + super(ErnieForSequenceClassification, self).__init__() + self.num_classes = num_classes + self.ernie = ernie # allow ernie to be config + self.dropout = nn.Dropout(dropout if dropout is not None else self. + ernie.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.ernie.config["hidden_size"], + num_classes) + self.apply(self.init_weights) + + def forward(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`ErnieModel`. + token_type_ids (Tensor, optional): + See :class:`ErnieModel`. + position_ids (Tensor, optional): + See :class:`ErnieModel`. + attention_mask (Tensor, optional): + See :class:`ErnieModel`. + Returns: + Tensor: Returns tensor `logits`, a tensor of the input text classification logits. + Shape as `[batch_size, num_classes]` and dtype as float32. + Example: + .. code-block:: + import paddle + from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer + tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0') + model = ErnieForSequenceClassification.from_pretrained('ernie-1.0') + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + """ + _, pooled_output = self.ernie(input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask) + + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + return logits + + +class ErnieForQuestionAnswering(ErniePretrainedModel): + """ + Ernie Model with a linear layer on top of the hidden-states + output to compute `span_start_logits` and `span_end_logits`, + designed for question-answering tasks like SQuAD. + Args: + ernie (`ErnieModel`): + An instance of `ErnieModel`. + """ + + def __init__(self, ernie): + super(ErnieForQuestionAnswering, self).__init__() + self.ernie = ernie # allow ernie to be config + self.classifier = nn.Linear(self.ernie.config["hidden_size"], 2) + self.apply(self.init_weights) + + def forward(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`ErnieModel`. + token_type_ids (Tensor, optional): + See :class:`ErnieModel`. + position_ids (Tensor, optional): + See :class:`ErnieModel`. + attention_mask (Tensor, optional): + See :class:`ErnieModel`. + Returns: + tuple: Returns tuple (`start_logits`, `end_logits`). + With the fields: + - `start_logits` (Tensor): + A tensor of the input token classification logits, indicates the start position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + - `end_logits` (Tensor): + A tensor of the input token classification logits, indicates the end position of the labelled span. + Its data type should be float32 and its shape is [batch_size, sequence_length]. + Example: + .. code-block:: + import paddle + from paddlenlp.transformers import ErnieForQuestionAnswering, ErnieTokenizer + tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0') + model = ErnieForQuestionAnswering.from_pretrained('ernie-1.0') + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + """ + + sequence_output, _ = self.ernie(input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask) + + logits = self.classifier(sequence_output) + logits = paddle.transpose(logits, perm=[2, 0, 1]) + start_logits, end_logits = paddle.unstack(x=logits, axis=0) + + return start_logits, end_logits + + +class ErnieForTokenClassification(ErniePretrainedModel): + r""" + ERNIE Model with a linear layer on top of the hidden-states output layer, + designed for token classification tasks like NER tasks. + Args: + ernie (`ErnieModel`): + An instance of `ErnieModel`. + num_classes (int, optional): + The number of classes. Defaults to `2`. + dropout (float, optional): + The dropout probability for output of ERNIE. + If None, use the same value as `hidden_dropout_prob` + of `ErnieModel` instance `ernie`. Defaults to `None`. + """ + + def __init__(self, ernie, num_classes=2, dropout=None): + super(ErnieForTokenClassification, self).__init__() + self.num_classes = num_classes + self.ernie = ernie # allow ernie to be config + self.dropout = nn.Dropout(dropout if dropout is not None else self. + ernie.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.ernie.config["hidden_size"], + num_classes) + self.apply(self.init_weights) + + def forward(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`ErnieModel`. + token_type_ids (Tensor, optional): + See :class:`ErnieModel`. + position_ids (Tensor, optional): + See :class:`ErnieModel`. + attention_mask (Tensor, optional): + See :class:`ErnieModel`. + Returns: + Tensor: Returns tensor `logits`, a tensor of the input token classification logits. + Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`. + Example: + .. code-block:: + import paddle + from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer + tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0') + model = ErnieForTokenClassification.from_pretrained('ernie-1.0') + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + """ + sequence_output, _ = self.ernie(input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask) + + sequence_output = self.dropout(sequence_output) + logits = self.classifier(sequence_output) + return logits + + +class ErnieLMPredictionHead(nn.Layer): + r""" + Ernie Model with a `language modeling` head on top. + """ + + def __init__( + self, + hidden_size, + vocab_size, + activation, + embedding_weights=None, + weight_attr=None, + ): + super(ErnieLMPredictionHead, self).__init__() + + self.transform = nn.Linear(hidden_size, + hidden_size, + weight_attr=weight_attr) + self.activation = getattr(nn.functional, activation) + self.layer_norm = nn.LayerNorm(hidden_size) + self.decoder_weight = self.create_parameter( + shape=[vocab_size, hidden_size], + dtype=self.transform.weight.dtype, + attr=weight_attr, + is_bias=False) if embedding_weights is None else embedding_weights + self.decoder_bias = self.create_parameter( + shape=[vocab_size], dtype=self.decoder_weight.dtype, is_bias=True) + + def forward(self, hidden_states, masked_positions=None): + if masked_positions is not None: + hidden_states = paddle.reshape(hidden_states, + [-1, hidden_states.shape[-1]]) + hidden_states = paddle.tensor.gather(hidden_states, + masked_positions) + # gather masked tokens might be more quick + hidden_states = self.transform(hidden_states) + hidden_states = self.activation(hidden_states) + hidden_states = self.layer_norm(hidden_states) + hidden_states = paddle.tensor.matmul( + hidden_states, self.decoder_weight, + transpose_y=True) + self.decoder_bias + return hidden_states + + +class ErniePretrainingHeads(nn.Layer): + + def __init__( + self, + hidden_size, + vocab_size, + activation, + embedding_weights=None, + weight_attr=None, + ): + super(ErniePretrainingHeads, self).__init__() + self.predictions = ErnieLMPredictionHead(hidden_size, vocab_size, + activation, embedding_weights, + weight_attr) + self.seq_relationship = nn.Linear(hidden_size, + 2, + weight_attr=weight_attr) + + def forward(self, sequence_output, pooled_output, masked_positions=None): + prediction_scores = self.predictions(sequence_output, masked_positions) + seq_relationship_score = self.seq_relationship(pooled_output) + return prediction_scores, seq_relationship_score + + +class ErnieForPretraining(ErniePretrainedModel): + r""" + Ernie Model with a `masked language modeling` head and a `sentence order prediction` head + on top. + """ + + def __init__(self, ernie): + super(ErnieForPretraining, self).__init__() + self.ernie = ernie + weight_attr = paddle.ParamAttr( + initializer=nn.initializer.TruncatedNormal( + mean=0.0, std=self.ernie.initializer_range)) + self.cls = ErniePretrainingHeads( + self.ernie.config["hidden_size"], + self.ernie.config["vocab_size"], + self.ernie.config["hidden_act"], + embedding_weights=self.ernie.embeddings.word_embeddings.weight, + weight_attr=weight_attr, + ) + + self.apply(self.init_weights) + + def forward(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None, + masked_positions=None): + r""" + Args: + input_ids (Tensor): + See :class:`ErnieModel`. + token_type_ids (Tensor, optional): + See :class:`ErnieModel`. + position_ids (Tensor, optional): + See :class:`ErnieModel`. + attention_mask (Tensor, optional): + See :class:`ErnieModel`. + Returns: + tuple: Returns tuple (``prediction_scores``, ``seq_relationship_score``). + With the fields: + - `prediction_scores` (Tensor): + The scores of masked token prediction. Its data type should be float32. + If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size]. + Otherwise, its shape is [batch_size, mask_token_num, vocab_size]. + - `seq_relationship_score` (Tensor): + The scores of next sentence prediction. + Its data type should be float32 and its shape is [batch_size, 2]. + """ + with paddle.static.amp.fp16_guard(): + outputs = self.ernie(input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask) + sequence_output, pooled_output = outputs[:2] + prediction_scores, seq_relationship_score = self.cls( + sequence_output, pooled_output, masked_positions) + return prediction_scores, seq_relationship_score + + +class ErniePretrainingCriterion(paddle.nn.Layer): + r""" + The loss output of Ernie Model during the pretraining: + a `masked language modeling` head and a `next sentence prediction (classification)` head. + """ + + def __init__(self, with_nsp_loss=True): + super(ErniePretrainingCriterion, self).__init__() + self.with_nsp_loss = with_nsp_loss + #self.loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=-1) + + def forward(self, + prediction_scores, + seq_relationship_score, + masked_lm_labels, + next_sentence_labels=None): + """ + Args: + prediction_scores(Tensor): + The scores of masked token prediction. Its data type should be float32. + If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size]. + Otherwise, its shape is [batch_size, mask_token_num, vocab_size] + seq_relationship_score(Tensor): + The scores of next sentence prediction. Its data type should be float32 and + its shape is [batch_size, 2] + masked_lm_labels(Tensor): + The labels of the masked language modeling, its dimensionality is equal to `prediction_scores`. + Its data type should be int64. If `masked_positions` is None, its shape is [batch_size, sequence_length, 1]. + Otherwise, its shape is [batch_size, mask_token_num, 1] + next_sentence_labels(Tensor): + The labels of the next sentence prediction task, the dimensionality of `next_sentence_labels` + is equal to `seq_relation_labels`. Its data type should be int64 and + its shape is [batch_size, 1] + Returns: + Tensor: The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean of `next_sentence_loss`. + Its data type should be float32 and its shape is [1]. + """ + + with paddle.static.amp.fp16_guard(): + masked_lm_loss = F.cross_entropy(prediction_scores, + masked_lm_labels, + ignore_index=-1, + reduction='none') + + if not self.with_nsp_loss: + return paddle.mean(masked_lm_loss) + + next_sentence_loss = F.cross_entropy(seq_relationship_score, + next_sentence_labels, + reduction='none') + return paddle.mean(masked_lm_loss), paddle.mean(next_sentence_loss) + + +class ErnieOnlyMLMHead(nn.Layer): + + def __init__(self, hidden_size, vocab_size, activation, embedding_weights): + super().__init__() + self.predictions = ErnieLMPredictionHead( + hidden_size=hidden_size, + vocab_size=vocab_size, + activation=activation, + embedding_weights=embedding_weights) + + def forward(self, sequence_output, masked_positions=None): + prediction_scores = self.predictions(sequence_output, masked_positions) + return prediction_scores + + +class ErnieForMaskedLM(ErniePretrainedModel): + """ + Ernie Model with a `masked language modeling` head on top. + Args: + ernie (:class:`ErnieModel`): + An instance of :class:`ErnieModel`. + """ + + def __init__(self, ernie): + super(ErnieForMaskedLM, self).__init__() + self.ernie = ernie + self.cls = ErnieOnlyMLMHead( + self.ernie.config["hidden_size"], + self.ernie.config["vocab_size"], + self.ernie.config["hidden_act"], + embedding_weights=self.ernie.embeddings.word_embeddings.weight) + + self.apply(self.init_weights) + + def forward(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + r""" + Args: + input_ids (Tensor): + See :class:`ErnieModel`. + token_type_ids (Tensor, optional): + See :class:`ErnieModel`. + position_ids (Tensor, optional): + See :class:`ErnieModel`. + attention_mask (Tensor, optional): + See :class:`ErnieModel`. + Returns: + Tensor: Returns tensor `prediction_scores`, The scores of masked token prediction. + Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size]. + Example: + .. code-block:: + import paddle + from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer + tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0') + model = ErnieForMaskedLM.from_pretrained('ernie-1.0') + + inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") + inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} + logits = model(**inputs) + print(logits.shape) + # [1, 17, 18000] + """ + + outputs = self.ernie(input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask) + sequence_output = outputs[0] + prediction_scores = self.cls(sequence_output, masked_positions=None) + return prediction_scores + + +class ErnieForMultipleChoice(ErniePretrainedModel): + """ + Ernie Model with a linear layer on top of the hidden-states output layer, + designed for multiple choice tasks like RocStories/SWAG tasks. + + Args: + ernie (:class:`ErnieModel`): + An instance of ErnieModel. + num_choices (int, optional): + The number of choices. Defaults to `2`. + dropout (float, optional): + The dropout probability for output of Ernie. + If None, use the same value as `hidden_dropout_prob` of `ErnieModel` + instance `ernie`. Defaults to None. + """ + + def __init__(self, ernie, num_choices=2, dropout=None): + super(ErnieForMultipleChoice, self).__init__() + self.num_choices = num_choices + self.ernie = ernie + self.dropout = nn.Dropout(dropout if dropout is not None else self. + ernie.config["hidden_dropout_prob"]) + self.classifier = nn.Linear(self.ernie.config["hidden_size"], 1) + self.apply(self.init_weights) + + def forward(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + r""" + The ErnieForMultipleChoice forward method, overrides the __call__() special method. + Args: + input_ids (Tensor): + See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length]. + token_type_ids(Tensor, optional): + See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length]. + position_ids(Tensor, optional): + See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length]. + attention_mask (list, optional): + See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length]. + Returns: + Tensor: Returns tensor `reshaped_logits`, a tensor of the multiple choice classification logits. + Shape as `[batch_size, num_choice]` and dtype as `float32`. + """ + # input_ids: [bs, num_choice, seq_l] + input_ids = input_ids.reshape(shape=( + -1, input_ids.shape[-1])) # flat_input_ids: [bs*num_choice,seq_l] + + if position_ids is not None: + position_ids = position_ids.reshape(shape=(-1, + position_ids.shape[-1])) + if token_type_ids is not None: + token_type_ids = token_type_ids.reshape( + shape=(-1, token_type_ids.shape[-1])) + + if attention_mask is not None: + attention_mask = attention_mask.reshape( + shape=(-1, attention_mask.shape[-1])) + + _, pooled_output = self.ernie(input_ids, + token_type_ids=token_type_ids, + position_ids=position_ids, + attention_mask=attention_mask) + pooled_output = self.dropout(pooled_output) + + logits = self.classifier(pooled_output) # logits: (bs*num_choice,1) + reshaped_logits = logits.reshape( + shape=(-1, self.num_choices)) # logits: (bs, num_choice) + + return reshaped_logits diff --git a/examples/text_matching/diffcse/data.py b/examples/text_matching/diffcse/data.py new file mode 100644 index 000000000000..6b8cee1c1978 --- /dev/null +++ b/examples/text_matching/diffcse/data.py @@ -0,0 +1,143 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle + +import os +import random +import numpy as np + + +def get_special_tokens(): + return ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"] + + +def get_special_token_ids(tokenizer): + special_tokens = ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"] + return tokenizer.convert_tokens_to_ids(special_tokens) + + +def get_special_token_dict(tokenizer): + special_tokens = ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"] + special_token_dict = dict( + zip(special_tokens, tokenizer.convert_tokens_to_ids(special_tokens))) + return special_token_dict + + +def create_dataloader(dataset, + mode="train", + batch_size=1, + batchify_fn=None, + trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, + batch_size=batch_size, + shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, + batch_size=batch_size, + shuffle=shuffle) + return paddle.io.DataLoader(dataset=dataset, + batch_sampler=batch_sampler, + collate_fn=batchify_fn, + return_list=True) + + +def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False): + result = [] + for key, text in example.items(): + if "label" in key: + # do_evaluate + result += [example["label"]] + else: + # do_train + encoded_inputs = tokenizer(text=text, + max_seq_len=max_seq_length, + return_attention_mask=True) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + attention_mask = encoded_inputs["attention_mask"] + result += [input_ids, token_type_ids, attention_mask] + return result + + +def read_text_single(data_path): + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip() + yield {"text_a": data, "text_b": data} + + +def masked_fill(x, mask, value): + y = paddle.full(x.shape, value, x.dtype) + return paddle.where(mask, y, x) + + +def mask_tokens(batch_inputs, tokenizer, mlm_probability=0.15): + """ + Description: Mask input_ids for masked language modeling: 80% MASK, 10% random, 10% original + """ + mlm_inputs = batch_inputs.clone() + mlm_labels = batch_inputs.clone() + + probability_matrix = paddle.full(mlm_inputs.shape, mlm_probability) + + special_tokens_mask = paddle.cast(paddle.zeros(mlm_inputs.shape), + dtype=bool) + for special_token_id in get_special_token_ids(tokenizer): + special_tokens_mask |= (mlm_inputs == special_token_id) + + probability_matrix = masked_fill(probability_matrix, special_tokens_mask, + 0.0) + + masked_indices = paddle.cast(paddle.bernoulli(probability_matrix), + dtype=bool) + mlm_labels = masked_fill(mlm_labels, ~masked_indices, -100) + + # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK]) + indices_replaced = paddle.cast(paddle.bernoulli( + paddle.full(mlm_inputs.shape, 0.8)), + dtype=bool) & masked_indices + mlm_inputs = masked_fill(mlm_inputs, indices_replaced, + tokenizer.mask_token_id) + + # 10% of the time, we replace masked input tokens with random word + indices_random = paddle.cast( + paddle.bernoulli(paddle.full(mlm_inputs.shape, 0.5)), + dtype=bool) & masked_indices & ~indices_replaced + random_words = paddle.randint(0, + len(tokenizer), + mlm_inputs.shape, + dtype=mlm_inputs.dtype) + mlm_inputs = paddle.where(indices_random, random_words, mlm_inputs) + + # The rest of the time (10% of the time) we keep the masked input tokens unchanged + return mlm_inputs, mlm_labels + + +def read_text_pair(data_path, is_infer=False): + with open(data_path, "r", encoding="utf-8") as f: + for line in f: + data = line.rstrip().split("\t") + if is_infer: + if len(data[0]) == 0 or len(data[1]) == 0: + continue + yield {"text_a": data[0], "text_b": data[1]} + else: + if len(data[0]) == 0 or len(data[1]) == 0 or len(data[2]) == 0: + continue + yield {"text_a": data[0], "text_b": data[1], "label": data[2]} diff --git a/examples/text_matching/diffcse/model.py b/examples/text_matching/diffcse/model.py new file mode 100644 index 000000000000..7739757d5350 --- /dev/null +++ b/examples/text_matching/diffcse/model.py @@ -0,0 +1,349 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddlenlp.transformers import AutoTokenizer, AutoModel, ErnieForMaskedLM + +from data import mask_tokens +from custom_ernie import ErnieModel as CustomErnie + + +class ProjectionMLP(nn.Layer): + + def __init__(self, in_dim): + super(ProjectionMLP, self).__init__() + hidden_dim = in_dim * 2 + out_dim = in_dim + affine = False + list_layers = [ + nn.Linear(in_dim, hidden_dim, bias_attr=False), + nn.BatchNorm1D(hidden_dim), + nn.ReLU() + ] + list_layers += [ + nn.Linear(hidden_dim, out_dim, bias_attr=False), + nn.BatchNorm1D(out_dim) + ] + self.net = nn.Sequential(*list_layers) + + def forward(self, x): + return self.net(x) + + +class Similarity(nn.Layer): + """ + Dot product or cosine similarity + """ + + def __init__(self, temp): + super(Similarity, self).__init__() + self.temp = temp + self.cos = nn.CosineSimilarity(axis=-1) + self.record = None + self.pos_avg = 0.0 + self.neg_avg = 0.0 + + def forward(self, x, y, one_vs_one=False): + if one_vs_one: + sim = self.cos(x, y) + return sim + + x = x.unsqueeze(1) + y = y.unsqueeze(0) + sim = self.cos(x, y) + self.record = sim.detach() + min_size = min(self.record.shape[0], self.record.shape[1]) + num_item = self.record.shape[0] * self.record.shape[1] + self.pos_avg = paddle.diag(self.record).sum().item() / min_size + self.neg_avg = (self.record.sum().item() - paddle.diag( + self.record).sum().item()) / (num_item - min_size) + return sim / self.temp + + +class Encoder(nn.Layer): + + def __init__(self, pretrained_model_name, temp=0.05, output_emb_size=None): + super(Encoder, self).__init__() + self.ptm = AutoModel.from_pretrained(pretrained_model_name) + # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size + self.output_emb_size = output_emb_size + self.mlp = ProjectionMLP(self.ptm.config['hidden_size']) + + if output_emb_size is not None: + self.emb_reduce_linear = nn.Linear(self.ptm.config['hidden_size'], + output_emb_size) + + self.temp = temp + self.sim = Similarity(temp) + + def get_pooled_embedding(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None, + with_pooler=False): + # Note: cls_embedding is poolerd embedding with act tanh + sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, + position_ids, attention_mask) + if not with_pooler: + ori_cls_embedding = sequence_output[:, 0, :] + else: + ori_cls_embedding = cls_embedding + + mlp_cls_embedding = self.mlp(ori_cls_embedding) + if self.output_emb_size is not None: + cls_embedding = self.emb_reduce_linear(mlp_cls_embedding) + + return cls_embedding, mlp_cls_embedding + + def cosine_sim(self, + query_input_ids, + key_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + key_token_type_ids=None, + key_position_ids=None, + key_attention_mask=None, + with_pooler=False): + query_cls_embedding, _ = self.get_pooled_embedding( + query_input_ids, + query_token_type_ids, + query_position_ids, + query_attention_mask, + with_pooler=with_pooler) + key_cls_embedding, _ = self.get_pooled_embedding( + key_input_ids, + key_token_type_ids, + key_position_ids, + key_attention_mask, + with_pooler=with_pooler) + + cosine_sim = self.sim(query_cls_embedding, + key_cls_embedding, + one_vs_one=True) + return cosine_sim + + def forward(self, + query_input_ids, + key_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + key_token_type_ids=None, + key_position_ids=None, + key_attention_mask=None, + with_pooler=False): + query_cls_embedding, mlp_query_cls_embedding = self.get_pooled_embedding( + query_input_ids, + query_token_type_ids, + query_position_ids, + query_attention_mask, + with_pooler=with_pooler) + key_cls_embedding, mlp_key_cls_embedding = self.get_pooled_embedding( + key_input_ids, + key_token_type_ids, + key_position_ids, + key_attention_mask, + with_pooler=with_pooler) + + cosine_sim = self.sim(query_cls_embedding, key_cls_embedding) + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") + labels = paddle.reshape(labels, shape=[-1, 1]) + loss = F.cross_entropy(input=cosine_sim, label=labels) + + mlp_cls_embedding = paddle.concat( + [mlp_query_cls_embedding, mlp_key_cls_embedding], axis=0) + return loss, mlp_cls_embedding + + +class Discriminator(nn.Layer): + + def __init__(self, ptm_model_name): + super(Discriminator, self).__init__() + self.ptm = CustomErnie.from_pretrained(ptm_model_name) + self.classifier = nn.Linear(self.ptm.config["hidden_size"], 2) + + def forward(self, + input_ids, + labels, + cls_input, + token_type_ids=None, + attention_mask=None): + sequence_output, _ = self.ptm(input_ids, + token_type_ids=token_type_ids, + attention_mask=attention_mask, + cls_input=cls_input) + pred_scores = self.classifier(sequence_output) + loss = F.cross_entropy(input=pred_scores, label=labels) + + return loss, pred_scores.argmax(-1) + + +class DiffCSE(nn.Layer): + + def __init__(self, + encoder_name, + generator_name, + discriminator_name, + enc_tokenizer, + gen_tokenizer, + dis_tokenizer, + temp=0.05, + output_emb_size=32, + mlm_probability=0.15, + lambda_weight=0.15): + super(DiffCSE, self).__init__() + self.encoder_name = encoder_name + self.generator_name = generator_name + self.discriminator_name = discriminator_name + self.enc_tokenizer = enc_tokenizer + self.gen_tokenizer = gen_tokenizer + self.dis_tokenizer = dis_tokenizer + self.temp = temp + self.output_emb_size = output_emb_size + self.mlm_probability = mlm_probability + self.lambda_weight = lambda_weight + + self.encoder = Encoder(encoder_name, + temp=temp, + output_emb_size=output_emb_size) + self.generator = ErnieForMaskedLM.from_pretrained(generator_name) + self.discriminator = Discriminator(discriminator_name) + + self.rtd_acc = 0.0 + self.rtd_rep_acc = 0.0 + self.rtd_fix_acc = 0.0 + + def train_forward(self, + query_input_ids, + key_input_ids, + query_token_type_ids=None, + key_token_type_ids=None, + query_attention_mask=None, + key_attention_mask=None): + + # extract senmantic vector with encoder and then comput CL loss + loss, mlp_cls_embedding = self.encoder( + query_input_ids, + key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask) + + with paddle.no_grad(): + # mask tokens for query and key input_ids and then predict mask token with generator + input_ids = paddle.concat([query_input_ids, key_input_ids], axis=0) + if self.encoder_name != self.generator_name: + input_ids = self.encode_by_generator(input_ids) + attention_mask = paddle.concat( + [query_attention_mask, key_attention_mask], axis=0) + mlm_input_ids, _ = mask_tokens(input_ids, + self.gen_tokenizer, + mlm_probability=self.mlm_probability) + # predict tokens using generator + pred_tokens = self.generator( + mlm_input_ids, attention_mask=attention_mask).argmax(axis=-1) + + pred_tokens = pred_tokens.detach() + + if self.generator_name != self.discriminator_name: + pred_tokens = self.encode_by_discriminator(pred_tokens) + input_ids = self.encode_by_discriminator(input_ids) + + pred_tokens[:, 0] = self.dis_tokenizer.cls_token_id + e_inputs = pred_tokens * attention_mask + replaced = pred_tokens != input_ids + e_labels = paddle.cast(replaced, dtype="int64") * attention_mask + rtd_loss, prediction = self.discriminator(e_inputs, + e_labels, + cls_input=mlp_cls_embedding) + loss = loss + self.lambda_weight * rtd_loss + + rep = (e_labels == 1) * attention_mask + fix = (e_labels == 0) * attention_mask + self.rtd_rep_acc = float((prediction * rep).sum() / rep.sum()) + self.rtd_fix_acc = float(1.0 - (prediction * fix).sum() / fix.sum()) + self.rtd_acc = float(((prediction == e_labels) * attention_mask).sum() / + attention_mask.sum()) + + return loss, rtd_loss + + def encode_by_generator(self, batch_tokens): + new_tokens = [] + for one_tokens in batch_tokens: + one_gen_tokens = self.enc_tokenizer.convert_ids_to_tokens( + one_tokens.tolist()) + new_tokens.append( + self.gen_tokenizer.convert_tokens_to_ids(one_gen_tokens)) + + return paddle.to_tensor(new_tokens) + + def encode_by_discriminator(self, batch_tokens): + new_tokens = [] + for one_tokens in batch_tokens: + one_gen_tokens = self.gen_tokenizer.convert_ids_to_tokens( + one_tokens.tolist()) + new_tokens.append( + self.dis_tokenizer.convert_tokens_to_ids(one_gen_tokens)) + + return paddle.to_tensor(new_tokens) + + def test_forward(self, + query_input_ids, + key_input_ids, + query_token_type_ids=None, + key_token_type_ids=None, + query_attention_mask=None, + key_attention_mask=None): + + # compute cosine similarity for query and key text + cos_sim = self.encoder.cosine_sim( + query_input_ids, + key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask) + + return cos_sim + + def forward(self, + query_input_ids, + key_input_ids, + query_token_type_ids=None, + key_token_type_ids=None, + query_attention_mask=None, + key_attention_mask=None, + mode="train"): + if mode == "train": + return self.train_forward(query_input_ids, + key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask) + else: + return self.test_forward(query_input_ids, + key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask) diff --git a/examples/text_matching/diffcse/run_diffcse.py b/examples/text_matching/diffcse/run_diffcse.py new file mode 100644 index 000000000000..1d69a717ac26 --- /dev/null +++ b/examples/text_matching/diffcse/run_diffcse.py @@ -0,0 +1,393 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time +import random +import argparse +import numpy as np +from scipy import stats +from functools import partial + +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import LinearDecayWithWarmup +from visualdl import LogWriter + +from model import DiffCSE, Encoder +from utils import set_seed, eval_metric +from data import read_text_single, read_text_pair, convert_example, create_dataloader + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--mode", choices=["train", "eval", "infer"], default="infer", help="Select which mode to run model, defaults to infer.") +parser.add_argument("--encoder_name", type=str, help="The sentence_encoder name or path that you wanna train based on.") +parser.add_argument("--generator_name", type=str, help="The generator model name or path that you wanna train based on.") +parser.add_argument("--discriminator_name", type=str, help="The discriminator model name or path that you wanna train based on.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization.") +parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.") +parser.add_argument("--train_set_file", type=str, help="The full path of train_set_file.") +parser.add_argument("--eval_set_file", type=str, help="The full path of eval_set_file.") +parser.add_argument("--infer_set_file", type=str, help="The full path of infer_set_file.") +parser.add_argument("--ckpt_dir", default=None, type=str, help="The ckpt directory where the model checkpoints will be loaded when doing evalution/inference.") +parser.add_argument("--save_dir", default="./checkpoints", type=str, help="The directory where the model checkpoints will be written.") +parser.add_argument("--log_dir", default=None, type=str, help="The directory where log will be written.") +parser.add_argument("--save_infer_path", default="./infer_result.txt", type=str, help="The save directory where the inference result will be written.") +parser.add_argument("--save_steps", type=int, default=10000, help="Step interval for saving checkpoint.") +parser.add_argument("--eval_steps", type=int, default=10000, help="Step interval for evaluation.") +parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override ecpochs.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proption over the training process.") +parser.add_argument("--temp", default=0.05, type=float, help="Temperature for softmax.") +parser.add_argument("--mlm_probability", default=0.15, type=float, help="The ratio for masked language model.") +parser.add_argument("--lambda_weight", default=0.15, type=float, help="The weight for RTD loss.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument("--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def do_infer(model, tokenizer, data_loader): + assert isinstance( + model, Encoder), "please make sure that model is instance of Encoder." + sims = [] + model.eval() + with paddle.no_grad(): + for batch in data_loader: + query_input_ids, query_token_type_ids, query_attention_mask, key_input_ids, key_token_type_ids, key_attention_mask = batch + cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + key_input_ids=key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask, + ) + sims.append(cosine_sim.numpy()) + sims = np.concatenate(sims, axis=0) + model.train() + return sims + + +def do_eval(model, tokenizer, data_loader): + assert isinstance( + model, Encoder), "please make sure that model is instance of Encoder." + sims, labels = [], [] + model.eval() + with paddle.no_grad(): + for batch in data_loader: + query_input_ids, query_token_type_ids, query_attention_mask, key_input_ids, key_token_type_ids, key_attention_mask, label = batch + cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + key_input_ids=key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask, + ) + sims.append(cosine_sim.numpy()) + labels.append(label.numpy()) + + sims = np.concatenate(sims, axis=0) + labels = np.concatenate(labels, axis=0) + score = eval_metric(labels, sims) + model.train() + return score + + +def do_train(model, tokenizer, train_data_loader, dev_data_loader, writer=None): + num_training_steps = args.max_steps if args.max_steps > 0 else len( + train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, + args.warmup_proportion) + + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params) + + global_step = 0 + best_score = 0. + tic_train = time.time() + model = paddle.DataParallel(model) + model.train() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, query_attention_mask, key_input_ids, key_token_type_ids, key_attention_mask = batch + + loss, rtd_loss = model(query_input_ids, + key_input_ids, + query_token_type_ids=query_token_type_ids, + key_token_type_ids=key_token_type_ids, + query_attention_mask=query_attention_mask, + key_attention_mask=key_attention_mask) + + global_step += 1 + if global_step % (args.eval_steps // 10) == 0 and rank == 0: + print( + "global step {}, epoch: {}, batch: {}, loss: {:.5f}, rtd_loss: {:.5f}, rtd_acc: {:.5f}, rtd_rep_acc: {:.5f}, rtd_fix_acc: {:.5f}, pos_avg: {:.5f}, neg_avg: {:.5f}, speed: {:.2f} step/s" + .format(global_step, epoch, step, loss.item(), + rtd_loss.item(), model._layers.rtd_acc, + model._layers.rtd_rep_acc, + model._layers.rtd_fix_acc, + model._layers.encoder.sim.pos_avg, + model._layers.encoder.sim.neg_avg, + (args.eval_steps // 10) / + (time.time() - tic_train))) + writer.add_scalar(tag="train/loss", + step=global_step, + value=loss.item()) + writer.add_scalar(tag="train/rtd_loss", + step=global_step, + value=rtd_loss.item()) + writer.add_scalar(tag="train/rtd_acc", + step=global_step, + value=model._layers.rtd_acc) + writer.add_scalar(tag="train/rtd_rep_acc", + step=global_step, + value=model._layers.rtd_rep_acc) + writer.add_scalar(tag="train/rtd_fix_acc", + step=global_step, + value=model._layers.rtd_fix_acc) + + tic_train = time.time() + + if global_step % args.eval_steps == 0 and rank == 0: + score = do_eval(model._layers.encoder, tokenizer, + dev_data_loader) + print("Evaluation - score:{:.5f}".format(score)) + + if best_score < score: + print( + "best checkpoint has been updated: from last best_score {} --> new score {}." + .format(best_score, score)) + best_score = score + # save best model + save_dir = os.path.join(args.save_dir, "best") + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, + "model_state.pdparams") + paddle.save(model._layers.encoder.state_dict(), + save_param_path) + tokenizer.save_pretrained(save_dir) + + writer.add_scalar(tag="eval/score", + step=global_step, + value=score) + model.train() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, + "checkpoint_{}".format(global_step)) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, "model_state.pdparams") + paddle.save(model._layers.encoder.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + if args.max_steps > 0 and global_step >= args.max_steps: + return model + + +if __name__ == "__main__": + # set running environment + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + if not os.path.exists(args.save_dir): + os.makedirs(args.save_dir) + + # define tokenizer for processing data + tokenizer = ppnlp.transformers.AutoTokenizer.from_pretrained( + args.encoder_name) + trans_func = partial(convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length) + + if args.mode == "train": + start_time = time.time() + + # load data + train_ds = load_dataset(read_text_single, + data_path=args.train_set_file, + lazy=False) + dev_ds = load_dataset(read_text_pair, + data_path=args.eval_set_file, + lazy=False) + gen_tokenizer = ppnlp.transformers.AutoTokenizer.from_pretrained( + args.generator_name) + dis_tokenizer = ppnlp.transformers.AutoTokenizer.from_pretrained( + args.discriminator_name) + + # intializing DiffCSE model + model = DiffCSE(encoder_name=args.encoder_name, + generator_name=args.generator_name, + discriminator_name=args.discriminator_name, + enc_tokenizer=tokenizer, + gen_tokenizer=gen_tokenizer, + dis_tokenizer=dis_tokenizer, + temp=args.temp, + output_emb_size=args.output_emb_size, + mlm_probability=args.mlm_probability, + lambda_weight=args.lambda_weight) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=0), # attention_mask + Pad(axis=0, pad_val=tokenizer.pad_token_id), # key_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # tilte_segment + Pad(axis=0, pad_val=0), # attention_mask + ): [data for data in fn(samples)] + dev_batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=0), # attention_mask + Pad(axis=0, pad_val=tokenizer.pad_token_id), # key_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # tilte_segment + Pad(axis=0, pad_val=0), # attention_mask + Stack(dtype="int64"), # labels + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader(train_ds, + mode="train", + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + dev_data_loader = create_dataloader(dev_ds, + mode="eval", + batch_size=args.batch_size, + batchify_fn=dev_batchify_fn, + trans_fn=trans_func) + + with LogWriter(logdir=os.path.join(args.log_dir, "scalar")) as writer: + do_train(model, + tokenizer, + train_data_loader, + dev_data_loader, + writer=writer) + + end_time = time.time() + print("running time {} s".format(end_time - start_time)) + + if args.mode == "eval": + start_time = time.time() + # initalizing encoder model for eval + model = Encoder(args.encoder_name, + temp=args.temp, + output_emb_size=args.output_emb_size) + # load model from saved checkpoint + if args.ckpt_dir: + init_from_ckpt = os.path.join(args.ckpt_dir, "model_state.pdparams") + if os.path.isfile(init_from_ckpt): + print( + "*************************initializing model from {}*****************************" + .format(init_from_ckpt)) + state_dict = paddle.load(init_from_ckpt) + model.set_dict(state_dict) + + dev_ds = load_dataset(read_text_pair, + data_path=args.eval_set_file, + lazy=False) + + dev_batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=0), # attention_mask + Pad(axis=0, pad_val=tokenizer.pad_token_id), # key_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # tilte_segment + Pad(axis=0, pad_val=0), # attention_mask + Stack(dtype="int64"), # labels + ): [data for data in fn(samples)] + + dev_data_loader = create_dataloader(dev_ds, + mode="eval", + batch_size=args.batch_size, + batchify_fn=dev_batchify_fn, + trans_fn=trans_func) + + score = do_eval(model, tokenizer, dev_data_loader) + print("Evaluation - score:{:.5f}".format(score)) + + end_time = time.time() + print("running time {} s".format(end_time - start_time)) + + if args.mode == "infer": + start_time = time.time() + # initalizing encoder model for eval + model = Encoder(args.encoder_name, + temp=args.temp, + output_emb_size=args.output_emb_size) + # load model from saved checkpoint + if args.ckpt_dir: + init_from_ckpt = os.path.join(args.ckpt_dir, "model_state.pdparams") + if os.path.isfile(init_from_ckpt): + print( + "*************************initializing model from {}*****************************" + .format(init_from_ckpt)) + state_dict = paddle.load(init_from_ckpt) + model.set_dict(state_dict) + + infer_ds = load_dataset(read_text_pair, + data_path=args.infer_set_file, + lazy=False, + is_infer=True) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=0), # attention_mask + Pad(axis=0, pad_val=tokenizer.pad_token_id), # key_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # tilte_segment + Pad(axis=0, pad_val=0), # attention_mask + ): [data for data in fn(samples)] + + infer_data_loader = create_dataloader(infer_ds, + mode="infer", + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + cosin_sim = do_infer(model, tokenizer, infer_data_loader) + + with open(args.save_infer_path, "w", encoding="utf-8") as f: + for idx, cos in enumerate(cosin_sim): + msg = "{} --> {}\n".format(idx, cos) + f.write(msg) + print("Inference result has been saved to : {}".format( + args.save_infer_path)) + + end_time = time.time() + print("running time {} s".format(end_time - start_time)) diff --git a/examples/text_matching/diffcse/run_eval.sh b/examples/text_matching/diffcse/run_eval.sh new file mode 100644 index 000000000000..0c5512c2f4d3 --- /dev/null +++ b/examples/text_matching/diffcse/run_eval.sh @@ -0,0 +1,15 @@ +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_eval" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "eval" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --eval_set_file "your dev_set path" \ + --ckpt_dir "./checkpoints/best" \ + --batch_size "32" \ + --seed "0" \ + --device "gpu" diff --git a/examples/text_matching/diffcse/run_infer.sh b/examples/text_matching/diffcse/run_infer.sh new file mode 100644 index 000000000000..7df8a573cd8a --- /dev/null +++ b/examples/text_matching/diffcse/run_infer.sh @@ -0,0 +1,16 @@ +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_infer" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "infer" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --infer_set_file "your test_set path \ + --ckpt_dir "./checkpoints/best" \ + --save_infer_path "./infer_result.txt" \ + --batch_size "32" \ + --seed "0" \ + --device "gpu" diff --git a/examples/text_matching/diffcse/run_train.sh b/examples/text_matching/diffcse/run_train.sh new file mode 100644 index 000000000000..9ed49a4b37e8 --- /dev/null +++ b/examples/text_matching/diffcse/run_train.sh @@ -0,0 +1,27 @@ +gpu_ids=0 +export CUDA_VISIBLE_DEVICES=${gpu_ids} + +log_dir="log_train" +python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \ + run_diffcse.py \ + --mode "train" \ + --encoder_name "rocketqa-zh-dureader-query-encoder" \ + --generator_name "ernie-3.0-base-zh" \ + --discriminator_name "ernie-3.0-base-zh" \ + --max_seq_length "128" \ + --output_emb_size "32" \ + --train_set_file "your train_set path" \ + --eval_set_file "your dev_set path" \ + --save_dir "./checkpoints" \ + --log_dir ${log_dir} \ + --save_steps "50000" \ + --eval_steps "1000" \ + --batch_size "32" \ + --epochs "3" \ + --mlm_probability "0.15" \ + --lambda_weight "0.15" \ + --learning_rate "3e-5" \ + --weight_decay "0.01" \ + --warmup_proportion "0.01" \ + --seed "0" \ + --device "gpu" diff --git a/examples/text_matching/diffcse/utils.py b/examples/text_matching/diffcse/utils.py new file mode 100644 index 000000000000..d70b434f8286 --- /dev/null +++ b/examples/text_matching/diffcse/utils.py @@ -0,0 +1,35 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import random +import numpy as np + +import paddle +from scipy import stats + + +def set_seed(seed=0): + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def masked_fill(x, mask, value): + y = paddle.full(x.shape, value, x.dtype) + return paddle.where(mask, y, x) + + +def eval_metric(labels, preds): + spearman_corr = stats.spearmanr(labels, preds).correlation + return spearman_corr