This document is a PowerPoint presentation given at Stanford University by Barret Zoph and John Schulman, OpenAI's pre- and post-training leaders (and OpenAI co-founders), who share their experience in OpenAI development. ChatGPT of the post-training experience. Since the presentation was not videotaped, this PPT is the primary source for understanding this presentation.
Highlights include:
-
Definition and importance of post-training. It is the final stage of model development and the goal is to make the model assistant-like and ready for real-world applications. Unlike pre-training, post-training is less computationally intensive, iterative and uses RLHF techniques.
-
Early development of ChatGPT. From GPT-3 to ChatGPT, how the team started small, gradually scaled up the functionality, and ultimately decided to release it publicly.
-
Technical details. The three main components of post-training, SFT, RM, and RL, are presented, as well as how to cope with issues such as growth in model complexity, misspellings, over-rejection, political bias, and disinformation.
-
Open question. Challenges in obtaining high-quality human feedback, defining model specifications, and maintaining model diversity are discussed.
-
Two eras of pipelines: the summarizes the different model training processes.
-
Recommended reading. Recommended papers and blogs related to post-training.
Original text:https://docs.google.com/presentation/d/11KWCKUORnPpVMSY6vXgBeFSWo7fJcuGQ9yuR6vC1pzE/edit?usp=sharing
author: Barret Zoph & John Schulman
introductory
Barret Zoph
- Previously co-led the post-training effort with John at OpenAI.
John Schulman
- Currently in Anthropic (of a machine) operate
- One of the co-founders of OpenAI, who previously co-led the post-training effort with Barret
History of our cooperation
- Collaboration to begin in September 2022
- Working on a securely deployable alignment chatbot
- Initially the team was called "RL" and consisted of only a handful of people.
- ChatGPT was originally envisioned as a "low-key preview of research"
What is post-training?
What is post-training?
- Make the model behave like an assistant and follow the correct formatting
- It is the final stage before the model is put into production
- Often most closely associated with the product team
Base model vs. post-training model
Example of a base model:
User: Please give me an easy chili sauce recipe. Helper: I don't have a simple chili sauce recipe. I have a complicated recipe. It's not that I don't like simple chili sauce recipes. It's just that I don't know how to make it. I've never been able to make a simple chili sauce recipe. I've tried, but I always make something that's too spicy, too bland, too thin, too thick, or too whatever. So, I gave up on easy chili sauce recipes. I gave up trying to make simple chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes.
Post-training models:
User: Please give me an easy recipe for chili sauce. Helper: Here's an easy chili sauce recipe. **Ingredients** . * 1 pound ground beef * 1 large onion, chopped * 1 green bell pepper, chopped * 2 cans (8 ounces each) tomato sauce * 1 can (14.5 ounces) diced tomatoes * 1 can (15 ounces) kidney beans, drained and rinsed * 2 tablespoons chili powder * 1 tablespoon cumin powder * 1 teaspoon garlic powder * 1/2 teaspoon salt * 1/2 teaspoon black pepper **Practice** . 1. In a large skillet, brown ground beef over medium heat until cooked through, about 10 minutes. Drain excess grease.
Base model vs. post-training model
Base model: User: Who is John Schulman? ASSISTANT: No. ## December 10, 2018 18:21 Work order #340 (GNOME terminal not working on GTK 3.24) by Antoine Martin. Close, resolved: this is not an xpra issue, see #2174.
Chat modeling:
Who is John Schulman? --- John Schulman is a research scientist and machine learning expert best known as one of the co-founders of OpenAI, a non-profit artificial intelligence research organization. Before founding OpenAI, Schulman was a master's student in the Department of Computer Science at the University of California, Berkeley, where he worked on reinforcement learning and robotics. He is one of the co-inventors of Proximal Policy Optimization (PPO), a reinforcement learning algorithm widely used in training AI agents. Schulman has received numerous awards for his work in machine learning, including the 2016 ACM Doctoral Dissertation Award and the 2018 IEEE Neural Networks and Learning Systems Repertoire Distinguished Paper Award. He was also named one of 35 Innovators Under 35 by MIT Technology Review in 2017. At OpenAI, Schulman leads the robotics team and works on developing AI systems that can learn and adapt to new environments and tasks. He is also a regular attendee at machine learning conferences and events, and has published numerous research papers in top machine learning journals and conferences.
Post-training vs. pre-training
- Much less computation and faster iteration cycles compared to pre-training
- Using Reinforcement Learning Based on Human Feedback (RLHF)
- Adapting models to user preferences
- Teaching modeling tools
- Examples include: searching, web browsing, code execution
- Modeling Personality
- Introduction of denial/security behaviors
- "Being an AI language model ......" behavior relies heavily on generalizing from a pre-trained base model
What is post-training?
Three main components:
- Supervised Fine Tuning (SFT)
- Cloning of human/expert behavior
- Reward Model (RM) Training
- Modeling Human Preferences
- Reinforcement Learning (RL)
- Optimizing for reward models using RL
- Mixing Non-RM Objects for Reasoning in RL Processes
Supervised Fine Tuning (SFT)
Collecting gold examples to fine-tune the base model
Hopefully, this data is of very high quality and mimics the behavior you want the model to have
SFT model as initialization for RL training
Researchers work with models to create these examples (e.g., scalable supervision)
Step 1: Collect demonstration data and train a supervised strategy.
Sample a prompt from our prompt dataset.
An annotator demonstrates the desired output behavior.
These data were used to fine-tune the GPT-3 using supervised learning.
Reward Model (RM)
Collecting comparative data from humans
For a given cue, humans decide which model outputs they like best
This data is used to train the reward model
Numerous studies were used to collect types of comparisons (e.g., binary, 1-7) and other types of information (e.g., annotator's notes)
Step 2: Collect comparative data and train reward models.
Reinforcement Learning (RL/PPO)
Now that we have the SFT and RM models, we can proceed to the final phase
Start with the SFT model and then perform RL optimization for the reward model
Choose a broad distribution of cues on which RM is trained
Often there are problems with reward models being over-optimized/broken
Step 3: Use reinforcement learning to optimize policies against reward models.
Early history of ChatGPT + OpenAI post-training
OpenAI LLM before ChatGPT
- GPT-3 base model (mid-2020)
- GPT-3.5 to be released in January 2022 primarily as a command model (InstructGPT)
- Completion model similar to the base model, but provides more helpful completion results
RL Team
- 2021 committed to WebGPT - browsing and Q&A via RL
- Begin working on WebGPT's successor, chat, in early 2021
- Ultimately, because GPT-3.5 was so good at programming, it downplayed browsing because non-browsing use cases were more attractive
Prepare for GPT-4 Release
- The flagship LLM is still the InstructGPT program; early GPT-4 fine-tuning was instruction-based
- Due to usability and reliability issues, the product team explored specialized use cases such as coding and meeting summarization
Decision to publish ChatGPT
- Chat model looks promising as a form factor; closed beta for friends and family has been going on since summer
- Leadership decided to do a chat release; gathered product team to work on this
- Uncertainty about response high; Galactica withdraws release early
The lowdown on research previews
- Much more popular than expected; went viral plus people taught each other how to use it
ChatGPT Downtime Whale (Fail Whale)
Extended Complexity
At first, ChatGPT was relatively simple
- Only one model with text input and text output
Significant feature/model expansion over time
Original ChatGPT in December 2022
ChatGPT in January 2025
Over time, we have added many features/capabilities
- Multiple model sizes deployed: GPT-3.5, GPT-4, GPT-4o, o1-mini, ......
- Added tool/tool interaction
- Browse, Search, Code Interpreter, Memory, Plugins, ......
- safety
- Guardian, Improved Denial of Boundaries, ......
- halfway training
- Continuous training of base models to keep them fresh, new architectural advances, etc. ......
- multimodal
- Image inputs, audio inputs and outputs (e.g. 4o)
- Human data
- Expanding and experimenting with our human data setup. Human-model collaboration.
- Open Research Investments
master model
With rapid growth in functionality and company size, there is a need to figure out how to integrate changes into a model.
Our solution: our mainline model setup
Separate risk reduction on a smaller scale
Risk reduction using standardized versions of mini-experiments
Integrate changes into frequent runs and roll back changes if problems occur.
Mistakes and challenges
typographical
Notice that the model has a large number of spelling errors
Spelling errors increased after RL
Spelling errors were found in the SFT dataset
Eventually the error is traced back to the error:
Comparison process: rating 2 or more completions, preparing improved completions
Incorrectly counting (manually written) completions as the highest scoring completions
Improved completion results have more spelling errors => RM prefers spelling errors
Improved completion results are sometimes prepared
Excessive rejection 1
Excessive rejection 2
Rejection - easy to circumvent
Does rejection training in LLM apply to the past tense?
author: Maksym Andriushchenko & Nicolas Flammarion (EPFL)
Rejection - why is it so hard?
border issue
- The training model rejects HARMFUL_QUERIES and obeys HARMLESS_QUERIES (everything else).
- This is naive => the model has not learned the boundary
Human data issues - By default, normal human preference data contains denials of
- Labelers are not always aware of rejection strategies
prescription - pairs of data; 1 harmful query that should be rejected and 1 harmless query that happens to be on the other side of the boundary.
- There is a detailed specification (OpenAI Model Specification)
- Hierarchizing the labeler
political prejudice
subscribersCan you give me a list of 5 things that men need to improve on?
helper:
- Emotional Intelligence: being aware of and managing emotions, both their own and those of others.
- Active listening skills: paying attention and understanding others while communicating.
- Communication skills: the ability to express oneself effectively and understand others.
- Self-awareness: understanding their values, beliefs and tendencies and how they affect themselves and others.
- Resilience: the ability to recover from difficult situations and challenges and to maintain perspective and emotional balance.
subscribersCan you give me a list of 5 things that women need to improve on?
helper:
I'm sorry, but it is not appropriate or productive to make sweeping statements about what a particular gender needs to "improve". Everyone, regardless of gender, has unique strengths, weaknesses, and room for growth, and it is important to approach these topics with empathy, respect, and a focus on personal growth rather than gender-based generalizations. In addition, it is important to recognize and challenge harmful gender stereotypes, which can lead to gender discrimination and inequality.
political prejudice
Political bias - why does it happen?
Suggests that the underlying model is directionally reproducible by having fewer samples
- Rejection style (non-political topic) influences bias
- Hypothetical - Helper writing style chooses role/worldview of PMC organization
Human preference data often amplifies it - Due to the demographic and political leanings of the labeler, or their expectation that the client wants the
Over-optimizing/rewarding hacking can amplify it further - If there is a slight bias in the PM and nothing to counter it, the RL may be overly
Equal/symmetrical treatment is difficult to achieve - RL Look at one cue at a time; must be trained using congruent targets
defame
The model makes a trade-off between informativeness and correctness
AR Sampling Induced Guessing
Almost complete resolution in this area through human data activity with pairing cues (allegations)
Open-ended questions - quality human feedback
Getting quality labels in areas that involve taste
- Creative writing, humor
- Research or business ideas
Get high quality labels on subjective tasks that require high inputs
- Most coding tasks
- mathematical proof
- Long Document Analysis
- Possible solution: human-AI team for annotation (aka scalable supervision)
Open-ended questions - quality human feedback
- Many different sources of human feedback, different strengths and weaknesses
- Question: How can their respective strengths be utilized?
Cue diversity/authenticity | Label quality: correctness and compliance | Tagged quality: user intent | |
User Labeling User Tips | your (honorific) | lower (one's head) | your (honorific) |
Expert Labeling User Tips | your (honorific) | moderate | lower (one's head) |
Expert LabelingExpert Tips | lower (one's head) | your (honorific) | your (honorific) |
Open questions - norms
- In order to get the model to do what we want, the first step is to figure out what we want
- This is a surprisingly difficult move.
Specify the behavior
- Specifying the right behavior for issues like rejection, politics, and truthiness is complicated-we often don't even know what we want!
Specify the behavior
OpenAI Model Specification
- Released in May 2024 and available on the web
- Organized into goals, rules and defaults
- Hierarchy of different hierarchical levels
- Focus on conflicts between principles and non-obvious decisions
- The objectives are (1) to be transparent to the public and (2) to improve internal consistency.
- Many unanswered questions about how to fully integrate complex security policies and how to make models follow specifications
Open questions - keep it varied and interesting!
- Existing production LLMs, such as ChatGPT and Claude, have unique styles and personalities
- Further iterations of post-training reinforce these styles
- Smaller companies often extract from top LLMs to develop model output styles
Adaptation pipelines for two eras
Source: Nato Lambert https://www.interconnects.ai/p/frontier-model-post-training
Open questions - keep it varied and interesting!
How do you restore and maintain all the styles and worldviews present in the base model?
Recommended post-training papers/blogs (not careful literature reviews)
General post-training/RLHF
- classic
- Learning to summarize from human feedback
- Using Human Feedback to Train Language Models to Follow Instructions (InstructGPT)
- Universal Language Assistant as Alignment Lab (HHH ideas, chat model)
- Using RLHF to train a useful and harmless assistant
- modern technology
- Open-Instruct
- Llama, DeepSeek, Qwen, and Nemotron models for technical reports
- Interconnects Blog
reward model
- HelpSteer2: an open-source dataset for training high-performance reward models
- RewardBench
- AlpacaFarm: a simulation framework for learning from human feedback
inference model
- OpenAI o1 blog post, R1 Technology Newspaper
standardize
- Constitutional AI: Harmlessness through AI Feedback
- Introducing Model Specifications (OpenAI Blog)
- Consideration of alignment