2019年OpenAI论文 Fine-Tuning Language Models from Human Preferences的代码
https://github.com/openai/lm-human-preferences
核心逻辑在PPOTrainer.step()函数中
https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L281
摘取代码并注释如下
# class PPOTrainer:
# def step(self):
# 获取问题queries
queries = self.sample_queries() # BxS
# policy模型(LLM,RL模型)进行回答,同时state value function也复用了policy模型的backbone,
# 预测了每一步模型输出的state value,形状为 BxS2
rollouts = self.policy.respond(queries, length=self.hparams.task.response_length)
responses = rollouts['responses'] # BxS2
logprobs = rollouts['logprobs'] # BxS2
values = rollouts['values'] # BxS2
rollouts['queries'] = queries
# BxS2, ref参考模型(LLM,SFT模型)进行回答
ref_logprobs = self.ref_policy.analyze_responses(queries, responses)['logprobs']
# Reward Model奖励模型对 问题-回答 进行打分,每个问答对的文字序列 S+S2,只有一个标量得分score
# B BxS BxS2
scores, postprocessed_responses, score_stats = self.score_fn(queries, responses)
# 根据InstructGPT论文[4]中公式(见下面图1),奖励值rewards[B, S2]是奖励模型的得分减去RL模型
# 和SFT模型的KL散度,其中KL散度形状是 [B, S2]。但是奖励模型得分scores的形状是 B,做法是将
# 奖励模型得分加到rewards的最后一列,即 rewards[:, -1] += scores,代表是对整句回答的得分。
rewards, non_score_reward, kl_coef = self.compute_rewards(
scores=scores,
logprobs=logprobs, # BxS2
ref_logprobs=ref_logprobs)
rollouts['rewards'] = rewards # BxS2
rollouts['rewards'] = rewards # BxS2
# 准备好rollouts,即一系列(st,at,rt),开始正式训练
# st表示第t步的状态,这里就是第1步至第t-1步模型的回答 logprobs[B, :t]
# at表示第t步的动作,这里就是第t步模型的回答 logprobs[B, t]
# rt表示第t步的奖励,这里就是上面计算得到的 rewards[B, t]
train_stats = self.train(rollouts=rollouts)
![图1:InstructGPT论文[4]中PPO的目标函数](https://prod-files-secure.s3.us-west-2.amazonaws.com/3b35e189-489b-4426-951c-63fcc8ad4752/22cd5282-5923-4060-bcdc-ae04641826df/image-20230316193658270.png)
图1:InstructGPT论文[4]中PPO的目标函数
PPOTrainer.train()会调用 train_minibatch 调用 self.loss 来得到上面公式(2)的ppo_loss,然后计算梯度进行更新。下面介绍核心的ppo loss计算过程。
https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L319
# class PPOTrainer:
# def loss(self, rollouts):
values = rollouts['values'] # BxS2
old_logprob = rollouts['logprobs'] # BxS2
rewards = rollouts['rewards'] # BxS2,只有最后一列是奖励模型分数
# 计算Generalized Advantage Estimator(GAE),形状为BxS2
# 见PPO论文中的公式(11)(12),如下图2
# GAE的作用是降低奖励估计的方差,这是Policy Gradient的一系列技巧重点解决的问题之一
# 详见“GAE的由来”中的说明
lastgaelam = 0
advantages_reversed = []
gen_length = self.hparams.task.response_length # S2
for t in reversed(range(gen_length)):
nextvalues = values[:, t + 1] if t < gen_length - 1 else 0.0
delta = rewards[:, t] + self.hparams.ppo.gamma * nextvalues - values[:, t]
lastgaelam = delta + self.hparams.ppo.gamma * self.hparams.ppo.lam * lastgaelam
advantages_reversed.append(lastgaelam)
# [B, B, ...] of size S2
advantages = tf.stack(advantages_reversed[::-1], axis=1) # get A of shape BxS2
# 对优势值Advantaga进行归一化,并截断梯度
advantages = utils.whiten(advantages)
advantages = tf.stop_gradient(advantages) # Shouldn't do anything, but better not to think about it
# 构建State Value function拟合用的训练数据 (st, Vt)中的Vt
# BxS2
# `returns` is state value function's target.
# 1. naiive version is Monte Carlo single sample evaluation: target= sum r(s_t', a_t') for t'>=t
# 2. a better version is: target = r(s_t, a_t) + V(s_t'), where t'=t+1
# 3. here uses an even better version by using n_step rewards through GAE:
# target = A(s_t, a_t) + V(s_t) = R(s_t, a_t) + V(s_t') - V(s_t) + V(s_t) = R(s_t, a_t) + V(s_t')
# where R(s_t, a_t) is equivalently a step reward evaluation by GAE (have lower variance) which is better than single evaluation r(s_t, a_t)
returns = advantages + values
# 用Value function对当前状态进行估计,并用上面构建的训练数据计算平方差损失函数
# 从而对Value function进行拟合
outputs = self.policy.analyze_responses_op(rollouts['queries'], rollouts['responses']) # state value function predict
vpred = outputs['values'] # B, value function which reuses the backbone of policy language model by adding a scalar head
vpredclipped = tf.clip_by_value(vpred, values - self.hparams.ppo.cliprange_value, values + self.hparams.ppo.cliprange_value)
vf_losses1 = tf.square(vpred - returns)
vf_losses2 = tf.square(vpredclipped - returns)
vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2)) # value function loss
vf_clipfrac = tf.reduce_mean(tf.cast(tf.greater(vf_losses2, vf_losses1), tf.float32))
# 因为样本不是当前policy模型生成,有几个epoch的滞后,对老策略模型生成的数据做
# 重要性采样 Importance Sampling
logprob = outputs['logprobs'] # BxS2
ratio = tf.exp(logprob - old_logprob) # importance sampling trick used by PPO for off-policy situations
pg_losses = -advantages * ratio # policy gradient version with importance sampling and Advantage tricks
# 这里clip处理是源自PPO论文的公式
pg_losses2 = -advantages * tf.clip_by_value(ratio, 1.0 - self.hparams.ppo.cliprange, 1.0 + self.hparams.ppo.cliprange)
pg_loss = tf.reduce_mean(tf.maximum(pg_losses, pg_losses2)) # policy gradient
# Actor-Critic模式,最终的损失函数,既包含作为actor的policy模型的policy gradient loss,
# 也包含作为critic的value模型的value function loss
loss = pg_loss + self.hparams.ppo.vf_coef * vf_loss
![图2:PPO[7]中的GAE公式](https://prod-files-secure.s3.us-west-2.amazonaws.com/3b35e189-489b-4426-951c-63fcc8ad4752/a8230f61-d458-457d-af1c-f0da6ac56516/image.png)
图2:PPO[7]中的GAE公式
OpenAI 2020年论文[3]中设置的GAE参数如下图,需要注意: $\gamma=1$(即没有为迟到的奖励打折扣)是因为有限长度轨迹(policy model只输出有限个token)以及只有轨迹最后收到奖励的设定。
