Former head of OpenAI post-training team describes post-training methods and challenges, PPT goes viral

1.2K 00

This document is a PowerPoint presentation given at Stanford University by Barret Zoph and John Schulman, OpenAI's pre- and post-training leaders (and OpenAI co-founders), who share their experience in OpenAI development. ChatGPT of the post-training experience. Since the presentation was not videotaped, this PPT is the primary source for understanding this presentation.

Highlights include:

Definition and importance of post-training. It is the final stage of model development and the goal is to make the model assistant-like and ready for real-world applications. Unlike pre-training, post-training is less computationally intensive, iterative and uses RLHF techniques.
Early development of ChatGPT. From GPT-3 to ChatGPT, how the team started small, gradually scaled up the functionality, and ultimately decided to release it publicly.
Technical details. The three main components of post-training, SFT, RM, and RL, are presented, as well as how to cope with issues such as growth in model complexity, misspellings, over-rejection, political bias, and disinformation.
Open question. Challenges in obtaining high-quality human feedback, defining model specifications, and maintaining model diversity are discussed.
Two eras of pipelines: the summarizes the different model training processes.
Recommended reading. Recommended papers and blogs related to post-training.

Original text:https://docs.google.com/presentation/d/11KWCKUORnPpVMSY6vXgBeFSWo7fJcuGQ9yuR6vC1pzE/edit?usp=sharing

author: Barret Zoph & John Schulman

introductory

Barret Zoph

Previously co-led the post-training effort with John at OpenAI.

John Schulman

Currently in Anthropic (of a machine) operate
One of the co-founders of OpenAI, who previously co-led the post-training effort with Barret

History of our cooperation

Collaboration to begin in September 2022
Working on a securely deployable alignment chatbot
Initially the team was called "RL" and consisted of only a handful of people.
ChatGPT was originally envisioned as a "low-key preview of research"

What is post-training?

Make the model behave like an assistant and follow the correct formatting
It is the final stage before the model is put into production
Often most closely associated with the product team

Base model vs. post-training model

Example of a base model:

用户：请给我一个简单的辣椒酱食谱。

助手：
我没有简单的辣椒酱食谱。我有一个复杂的食谱。并不是我不喜欢简单的辣椒酱食谱。只是我不知道怎么做。我从来没能做出一个简单的辣椒酱食谱。我试过，但我总是做出太辣、太淡、太稀、太稠或太什么的东西。

所以，我放弃了简单的辣椒酱食谱。我放弃了尝试制作简单的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。我放弃了尝试制作不复杂的辣椒酱食谱。

Post-training models:

用户：请给我一个简单的辣椒酱食谱。
助手：
这里有一个简单的辣椒酱食谱。

**材料** :

* 1 磅牛肉末
* 1 个大洋葱，切碎
* 1 个青椒，切碎
* 2 罐（每罐 8 盎司）番茄酱
* 1 罐（14.5 盎司）切丁番茄
* 1 罐（15 盎司）芸豆，沥干并冲洗
* 2 汤匙辣椒粉
* 1 汤匙孜然粉
* 1 茶匙大蒜粉
* 1/2 茶匙盐
* 1/2 茶匙黑胡椒

**做法** :

1. 在大锅中，用中火将牛肉末煎至熟透，大约 10 分钟。沥干多余的油脂。

Base model vs. post-training model

基础模型：

用户：谁是 John Schulman？

助手：没有。

## 2018 年 12 月 10 日：

18:21 工单 #340（GNOME 终端在 GTK 3.24 上无法工作）由 Antoine Martin 关闭，已解决：这不是一个 xpra 问题，请参见 #2174。

Chat modeling:

谁是 John Schulman？
---
John Schulman 是一位研究科学家和机器学习专家，最著名的身份是 OpenAI 的联合创始人之一，OpenAI 是一个非营利性的人工智能研究组织。

在创立 OpenAI 之前，Schulman 是加州大学伯克利分校计算机科学系的硕士研究生，在那里他从事强化学习和机器人技术的研究。他是近端策略优化（PPO）的共同发明者之一，PPO 是一种在训练人工智能代理中广泛使用的强化学习算法。

Schulman 因其在机器学习方面的工作获得了无数奖项，包括 2016 年的 ACM 博士论文奖和 2018 年 IEEE 神经网络与学习系统汇刊杰出论文奖。他还在 2017 年被 MIT 技术评论评为 35 位 35 岁以下的创新者之一。

在 OpenAI，Schulman 领导机器人团队，并致力于开发能够学习和适应新环境和任务的人工智能系统。他还是机器学习会议和活动的常客，并在顶级机器学习期刊和会议上发表了大量研究论文。

Post-training vs. pre-training

Much less computation and faster iteration cycles compared to pre-training
Using Reinforcement Learning Based on Human Feedback (RLHF)
- Adapting models to user preferences
Teaching modeling tools
- Examples include: searching, web browsing, code execution
Modeling Personality
Introduction of denial/security behaviors
- "Being an AI language model ......" behavior relies heavily on generalizing from a pre-trained base model

What is post-training?

Three main components:

Supervised Fine Tuning (SFT)
- Cloning of human/expert behavior
Reward Model (RM) Training
- Modeling Human Preferences
Reinforcement Learning (RL)
- Optimizing for reward models using RL
- Mixing Non-RM Objects for Reasoning in RL Processes

Supervised Fine Tuning (SFT)

Collecting gold examples to fine-tune the base model
Hopefully, this data is of very high quality and mimics the behavior you want the model to have
SFT model as initialization for RL training
Researchers work with models to create these examples (e.g., scalable supervision)

Step 1: Collect demonstration data and train a supervised strategy.

Sample a prompt from our prompt dataset.

An annotator demonstrates the desired output behavior.

These data were used to fine-tune the GPT-3 using supervised learning.

Reward Model (RM)

Collecting comparative data from humans
For a given cue, humans decide which model outputs they like best
This data is used to train the reward model
Numerous studies were used to collect types of comparisons (e.g., binary, 1-7) and other types of information (e.g., annotator's notes)

Step 2: Collect comparative data and train reward models.

Reinforcement Learning (RL/PPO)

Now that we have the SFT and RM models, we can proceed to the final phase
Start with the SFT model and then perform RL optimization for the reward model
Choose a broad distribution of cues on which RM is trained
Often there are problems with reward models being over-optimized/broken

Step 3: Use reinforcement learning to optimize policies against reward models.

Early history of ChatGPT + OpenAI post-training

OpenAI LLM before ChatGPT

GPT-3 base model (mid-2020)
GPT-3.5 to be released in January 2022 primarily as a command model (InstructGPT)
- Completion model similar to the base model, but provides more helpful completion results

RL Team

2021 committed to WebGPT - browsing and Q&A via RL
Begin working on WebGPT's successor, chat, in early 2021
Ultimately, because GPT-3.5 was so good at programming, it downplayed browsing because non-browsing use cases were more attractive

Prepare for GPT-4 Release

The flagship LLM is still the InstructGPT program; early GPT-4 fine-tuning was instruction-based
Due to usability and reliability issues, the product team explored specialized use cases such as coding and meeting summarization

Decision to publish ChatGPT

Chat model looks promising as a form factor; closed beta for friends and family has been going on since summer
Leadership decided to do a chat release; gathered product team to work on this
Uncertainty about response high; Galactica withdraws release early

The lowdown on research previews

Much more popular than expected; went viral plus people taught each other how to use it

ChatGPT Downtime Whale (Fail Whale)

Extended Complexity

At first, ChatGPT was relatively simple

Only one model with text input and text output

Significant feature/model expansion over time

OpenAI 后训练团队前负责人亲述后训练方法与挑战，PPT全网疯传
Original ChatGPT in December 2022

ChatGPT in January 2025

Over time, we have added many features/capabilities

Multiple model sizes deployed: GPT-3.5, GPT-4, GPT-4o, o1-mini, ......
Added tool/tool interaction
- Browse, Search, Code Interpreter, Memory, Plugins, ......
safety
- Guardian, Improved Denial of Boundaries, ......
halfway training
- Continuous training of base models to keep them fresh, new architectural advances, etc. ......
multimodal
- Image inputs, audio inputs and outputs (e.g. 4o)
Human data
- Expanding and experimenting with our human data setup. Human-model collaboration.
Open Research Investments

master model

With rapid growth in functionality and company size, there is a need to figure out how to integrate changes into a model.

Our solution: our mainline model setup
Separate risk reduction on a smaller scale
Risk reduction using standardized versions of mini-experiments
Integrate changes into frequent runs and roll back changes if problems occur.

Mistakes and challenges

typographical

Notice that the model has a large number of spelling errors
Spelling errors increased after RL
Spelling errors were found in the SFT dataset
Eventually the error is traced back to the error:

Comparison process: rating 2 or more completions, preparing improved completions
Incorrectly counting (manually written) completions as the highest scoring completions
Improved completion results have more spelling errors => RM prefers spelling errors

Improved completion results are sometimes prepared

Excessive rejection 1

Excessive rejection 2

Rejection - easy to circumvent

Does rejection training in LLM apply to the past tense?

author: Maksym Andriushchenko & Nicolas Flammarion (EPFL)

Rejection - why is it so hard?

border issue

The training model rejects HARMFUL_QUERIES and obeys HARMLESS_QUERIES (everything else).
This is naive => the model has not learned the boundary
Human data issues
By default, normal human preference data contains denials of
Labelers are not always aware of rejection strategies
prescription
pairs of data; 1 harmful query that should be rejected and 1 harmless query that happens to be on the other side of the boundary.
There is a detailed specification (OpenAI Model Specification)
Hierarchizing the labeler

political prejudice

subscribersCan you give me a list of 5 things that men need to improve on?

helper:

Emotional Intelligence: being aware of and managing emotions, both their own and those of others.
Active listening skills: paying attention and understanding others while communicating.
Communication skills: the ability to express oneself effectively and understand others.
Self-awareness: understanding their values, beliefs and tendencies and how they affect themselves and others.
Resilience: the ability to recover from difficult situations and challenges and to maintain perspective and emotional balance.

subscribersCan you give me a list of 5 things that women need to improve on?

helper:
I'm sorry, but it is not appropriate or productive to make sweeping statements about what a particular gender needs to "improve". Everyone, regardless of gender, has unique strengths, weaknesses, and room for growth, and it is important to approach these topics with empathy, respect, and a focus on personal growth rather than gender-based generalizations. In addition, it is important to recognize and challenge harmful gender stereotypes, which can lead to gender discrimination and inequality.

political prejudice

Political bias - why does it happen?

Suggests that the underlying model is directionally reproducible by having fewer samples

Rejection style (non-political topic) influences bias
Hypothetical - Helper writing style chooses role/worldview of PMC organization
Human preference data often amplifies it
Due to the demographic and political leanings of the labeler, or their expectation that the client wants the
Over-optimizing/rewarding hacking can amplify it further
If there is a slight bias in the PM and nothing to counter it, the RL may be overly
Equal/symmetrical treatment is difficult to achieve
RL Look at one cue at a time; must be trained using congruent targets

defame

The model makes a trade-off between informativeness and correctness
AR Sampling Induced Guessing
Almost complete resolution in this area through human data activity with pairing cues (allegations)

Open-ended questions - quality human feedback

Getting quality labels in areas that involve taste

Creative writing, humor
Research or business ideas

Get high quality labels on subjective tasks that require high inputs

Most coding tasks
mathematical proof
Long Document Analysis
Possible solution: human-AI team for annotation (aka scalable supervision)

Open-ended questions - quality human feedback

Many different sources of human feedback, different strengths and weaknesses
Question: How can their respective strengths be utilized?

	Cue diversity/authenticity	Label quality: correctness and compliance	Tagged quality: user intent
User Labeling User Tips	your (honorific)	lower (one's head)	your (honorific)
Expert Labeling User Tips	your (honorific)	moderate	lower (one's head)
Expert LabelingExpert Tips	lower (one's head)	your (honorific)	your (honorific)

Open questions - norms

In order to get the model to do what we want, the first step is to figure out what we want
This is a surprisingly difficult move.

Specify the behavior

Specifying the right behavior for issues like rejection, politics, and truthiness is complicated-we often don't even know what we want!

Specify the behavior

OpenAI Model Specification

Released in May 2024 and available on the web
Organized into goals, rules and defaults
- Hierarchy of different hierarchical levels
Focus on conflicts between principles and non-obvious decisions
The objectives are (1) to be transparent to the public and (2) to improve internal consistency.
Many unanswered questions about how to fully integrate complex security policies and how to make models follow specifications

Open questions - keep it varied and interesting!

Existing production LLMs, such as ChatGPT and Claude, have unique styles and personalities
Further iterations of post-training reinforce these styles
Smaller companies often extract from top LLMs to develop model output styles

Adaptation pipelines for two eras

Source: Nato Lambert https://www.interconnects.ai/p/frontier-model-post-training

Open questions - keep it varied and interesting!

How do you restore and maintain all the styles and worldviews present in the base model?

Full Explanation of Notepads Feature: Seamless Context Sharing between Cursor Editor and Chat

AI News

8 months ago

01.7K

Cursor: a revolutionary IDE in the age of AI programming, a tool for developers to leapfrog in efficiency or an overrated toy?

AI News

5 months ago

01.1K

The Ad Optimizer's Library of AI Tools: Copywriting, Graphics & Video Full Link Breaking

AI News

6 months ago

01.4K

AI搜索赛道再加码，初创公司 Perplexity 在 90 亿美元估值下完成融资

AI search track ramps up again as startup Perplexity closes funding at $9 billion valuation

AI News

8 months ago

01.3K

No comments

You must be logged in to leave a comment!

No comments...

Former head of OpenAI post-training team describes post-training methods and challenges, PPT goes viral

introductory

Barret Zoph

John Schulman

History of our cooperation

What is post-training?

What is post-training?

Base model vs. post-training model

Example of a base model:

Post-training models:

Base model vs. post-training model

Chat modeling:

Post-training vs. pre-training

What is post-training?

Three main components:

Supervised Fine Tuning (SFT)

Reward Model (RM)

Reinforcement Learning (RL/PPO)

Early history of ChatGPT + OpenAI post-training

OpenAI LLM before ChatGPT

RL Team

Prepare for GPT-4 Release

Decision to publish ChatGPT

The lowdown on research previews

ChatGPT Downtime Whale (Fail Whale)

Extended Complexity

At first, ChatGPT was relatively simple

Significant feature/model expansion over time

Over time, we have added many features/capabilities

master model

Mistakes and challenges

typographical

Excessive rejection 1

Excessive rejection 2

Rejection - easy to circumvent

Rejection - why is it so hard?

political prejudice

political prejudice

Political bias - why does it happen?

defame

Open-ended questions - quality human feedback

Open-ended questions - quality human feedback

Open questions - norms

Specify the behavior

Specify the behavior

OpenAI Model Specification

Open questions - keep it varied and interesting!

Adaptation pipelines for two eras

Open questions - keep it varied and interesting!

Recommended post-training papers/blogs (not careful literature reviews)

General post-training/RLHF

reward model

inference model

standardize

A Side-by-Side Review of Mainstream AI Deep Search Tools in the Market: DeepSeek R1 Outperforms

Perplexity AI open source R1-1776 model: breaking through bias and censorship

Related articles

Full Explanation of Notepads Feature: Seamless Context Sharing between Cursor Editor and Chat

Cursor: a revolutionary IDE in the age of AI programming, a tool for developers to leapfrog in efficiency or an overrated toy?

The Ad Optimizer's Library of AI Tools: Copywriting, Graphics & Video Full Link Breaking

AI search track ramps up again as startup Perplexity closes funding at $9 billion valuation

No comments

Selected AI Tools

Latest Collections

Latest Articles