AI Personal Learning
and practical guidance
Resource Recommendation 1

Former head of OpenAI post-training team describes post-training methods and challenges, PPT goes viral

This document is a PowerPoint presentation given at Stanford University by Barret Zoph and John Schulman, OpenAI's pre- and post-training leaders (and OpenAI co-founders), who share their experience in OpenAI development. ChatGPT of the post-training experience. Since the presentation was not videotaped, this PPT is the primary source for understanding this presentation.

Highlights include:

  1. Definition and importance of post-training. It is the final stage of model development and the goal is to make the model assistant-like and ready for real-world applications. Unlike pre-training, post-training is less computationally intensive, iterative and uses RLHF techniques.

  2. Early development of ChatGPT. From GPT-3 to ChatGPT, how the team started small, gradually scaled up the functionality, and ultimately decided to release it publicly.

  3. Technical details. The three main components of post-training, SFT, RM, and RL, are presented, as well as how to cope with issues such as growth in model complexity, misspellings, over-rejection, political bias, and disinformation.

  4. Open question. Challenges in obtaining high-quality human feedback, defining model specifications, and maintaining model diversity are discussed.

  5. Two eras of pipelines: the summarizes the different model training processes.

  6. Recommended reading. Recommended papers and blogs related to post-training.

Original text:https://docs.google.com/presentation/d/11KWCKUORnPpVMSY6vXgBeFSWo7fJcuGQ9yuR6vC1pzE/edit?usp=sharing


author: Barret Zoph & John Schulman


introductory

Barret Zoph

  • Previously co-led the post-training effort with John at OpenAI.

John Schulman

  • Currently in Anthropic (of a machine) operate
  • One of the co-founders of OpenAI, who previously co-led the post-training effort with Barret

History of our cooperation

  • Collaboration to begin in September 2022
  • Working on a securely deployable alignment chatbot
  • Initially the team was called "RL" and consisted of only a handful of people.
  • ChatGPT was originally envisioned as a "low-key preview of research"

What is post-training?

What is post-training?

  • Make the model behave like an assistant and follow the correct formatting
  • It is the final stage before the model is put into production
  • Often most closely associated with the product team

ChatGPT post-training methodology was made public in one of the presentations and the PPT is open for download-1

 

Base model vs. post-training model

Example of a base model:

User: Please give me an easy chili sauce recipe.

Helper:
I don't have a simple chili sauce recipe. I have a complicated recipe. It's not that I don't like simple chili sauce recipes. It's just that I don't know how to make it. I've never been able to make a simple chili sauce recipe. I've tried, but I always make something that's too spicy, too bland, too thin, too thick, or too whatever.

So, I gave up on easy chili sauce recipes. I gave up trying to make simple chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes. I gave up trying to make uncomplicated chili sauce recipes.

Post-training models:

User: Please give me an easy recipe for chili sauce.
Helper:
Here's an easy chili sauce recipe.

**Ingredients** .

* 1 pound ground beef
* 1 large onion, chopped
* 1 green bell pepper, chopped
* 2 cans (8 ounces each) tomato sauce
* 1 can (14.5 ounces) diced tomatoes
* 1 can (15 ounces) kidney beans, drained and rinsed
* 2 tablespoons chili powder
* 1 tablespoon cumin powder
* 1 teaspoon garlic powder
* 1/2 teaspoon salt
* 1/2 teaspoon black pepper

**Practice** .

1. In a large skillet, brown ground beef over medium heat until cooked through, about 10 minutes. Drain excess grease.

 

Base model vs. post-training model

Base model:

User: Who is John Schulman?

ASSISTANT: No.

## December 10, 2018

18:21 Work order #340 (GNOME terminal not working on GTK 3.24) by Antoine Martin. Close, resolved: this is not an xpra issue, see #2174.

Chat modeling:

Who is John Schulman?
---
John Schulman is a research scientist and machine learning expert best known as one of the co-founders of OpenAI, a non-profit artificial intelligence research organization.

Before founding OpenAI, Schulman was a master's student in the Department of Computer Science at the University of California, Berkeley, where he worked on reinforcement learning and robotics. He is one of the co-inventors of Proximal Policy Optimization (PPO), a reinforcement learning algorithm widely used in training AI agents.

Schulman has received numerous awards for his work in machine learning, including the 2016 ACM Doctoral Dissertation Award and the 2018 IEEE Neural Networks and Learning Systems Repertoire Distinguished Paper Award. He was also named one of 35 Innovators Under 35 by MIT Technology Review in 2017.

At OpenAI, Schulman leads the robotics team and works on developing AI systems that can learn and adapt to new environments and tasks. He is also a regular attendee at machine learning conferences and events, and has published numerous research papers in top machine learning journals and conferences.

 

Post-training vs. pre-training

  • Much less computation and faster iteration cycles compared to pre-training
  • Using Reinforcement Learning Based on Human Feedback (RLHF)
    • Adapting models to user preferences
  • Teaching modeling tools
    • Examples include: searching, web browsing, code execution
  • Modeling Personality
  • Introduction of denial/security behaviors
    • "Being an AI language model ......" behavior relies heavily on generalizing from a pre-trained base model

 

What is post-training?

Three main components:

  1. Supervised Fine Tuning (SFT)
    • Cloning of human/expert behavior
  2. Reward Model (RM) Training
    • Modeling Human Preferences
  3. Reinforcement Learning (RL)
    • Optimizing for reward models using RL
    • Mixing Non-RM Objects for Reasoning in RL Processes

ChatGPT post-training methodology was publicized in a presentation, PPT open for download-2

 

Supervised Fine Tuning (SFT)

Collecting gold examples to fine-tune the base model
Hopefully, this data is of very high quality and mimics the behavior you want the model to have
SFT model as initialization for RL training
Researchers work with models to create these examples (e.g., scalable supervision)

ChatGPT post-training methodology was made public in one of the presentations and the PPT is open for download-1

Step 1: Collect demonstration data and train a supervised strategy.

Sample a prompt from our prompt dataset.

An annotator demonstrates the desired output behavior.

These data were used to fine-tune the GPT-3 using supervised learning.

 

Reward Model (RM)

Collecting comparative data from humans
For a given cue, humans decide which model outputs they like best
This data is used to train the reward model
Numerous studies were used to collect types of comparisons (e.g., binary, 1-7) and other types of information (e.g., annotator's notes)

Step 2: Collect comparative data and train reward models.

ChatGPT post-training methodology was publicized in a presentation and the PPT was open for download-3

 

Reinforcement Learning (RL/PPO)

Now that we have the SFT and RM models, we can proceed to the final phase
Start with the SFT model and then perform RL optimization for the reward model
Choose a broad distribution of cues on which RM is trained
Often there are problems with reward models being over-optimized/broken

Step 3: Use reinforcement learning to optimize policies against reward models.

ChatGPT post-training methodology was publicized in a presentation and the PPT was open for download-4

 

Early history of ChatGPT + OpenAI post-training

OpenAI LLM before ChatGPT

  • GPT-3 base model (mid-2020)
  • GPT-3.5 to be released in January 2022 primarily as a command model (InstructGPT)
    • Completion model similar to the base model, but provides more helpful completion results

RL Team

  • 2021 committed to WebGPT - browsing and Q&A via RL
  • Begin working on WebGPT's successor, chat, in early 2021
  • Ultimately, because GPT-3.5 was so good at programming, it downplayed browsing because non-browsing use cases were more attractive

Prepare for GPT-4 Release

  • The flagship LLM is still the InstructGPT program; early GPT-4 fine-tuning was instruction-based
  • Due to usability and reliability issues, the product team explored specialized use cases such as coding and meeting summarization

Decision to publish ChatGPT

  • Chat model looks promising as a form factor; closed beta for friends and family has been going on since summer
  • Leadership decided to do a chat release; gathered product team to work on this
  • Uncertainty about response high; Galactica withdraws release early

The lowdown on research previews

  • Much more popular than expected; went viral plus people taught each other how to use it

ChatGPT Downtime Whale (Fail Whale)

ChatGPT post-training methodology was made public in one of the presentations and the PPT is open for download-1

 

Extended Complexity

At first, ChatGPT was relatively simple

  • Only one model with text input and text output

Significant feature/model expansion over time

ChatGPT post-training methodology was publicized in a presentation and the PPT was open for download-5
Original ChatGPT in December 2022
ChatGPT post-training methodology was publicized in a presentation and the PPT was open for download-6
ChatGPT in January 2025

Over time, we have added many features/capabilities

  1. Multiple model sizes deployed: GPT-3.5, GPT-4, GPT-4o, o1-mini, ......
  2. Added tool/tool interaction
    • Browse, Search, Code Interpreter, Memory, Plugins, ......
  3. safety
    • Guardian, Improved Denial of Boundaries, ......
  4. halfway training
    • Continuous training of base models to keep them fresh, new architectural advances, etc. ......
  5. multimodal
    • Image inputs, audio inputs and outputs (e.g. 4o)
  6. Human data
    • Expanding and experimenting with our human data setup. Human-model collaboration.
  7. Open Research Investments

 

master model

With rapid growth in functionality and company size, there is a need to figure out how to integrate changes into a model.

Our solution: our mainline model setup
Separate risk reduction on a smaller scale
Risk reduction using standardized versions of mini-experiments
Integrate changes into frequent runs and roll back changes if problems occur.

Former head of OpenAI post-training team describes post-training methods and challenges, PPT goes viral-1

 

Mistakes and challenges

ChatGPT post-training methodology was publicized in a presentation and the PPT was open for download-8

typographical

Notice that the model has a large number of spelling errors
Spelling errors increased after RL
Spelling errors were found in the SFT dataset
Eventually the error is traced back to the error:

Comparison process: rating 2 or more completions, preparing improved completions
Incorrectly counting (manually written) completions as the highest scoring completions
Improved completion results have more spelling errors => RM prefers spelling errors

ChatGPT post-training methodology was made public in one of the presentations and the PPT is open for download-1

Improved completion results are sometimes prepared

Excessive rejection 1

ChatGPT post-training methodology was made public in one of the presentations and the PPT is open for download-1

Excessive rejection 2

ChatGPT post-training methodology was made public in one of the presentations and the PPT is open for download-1

Rejection - easy to circumvent

Does rejection training in LLM apply to the past tense?

author: Maksym Andriushchenko & Nicolas Flammarion (EPFL)

ChatGPT post-training methodology was publicized in a presentation and the PPT was open for download-9

Rejection - why is it so hard?

border issue

  • The training model rejects HARMFUL_QUERIES and obeys HARMLESS_QUERIES (everything else).
  • This is naive => the model has not learned the boundary
    Human data issues
  • By default, normal human preference data contains denials of
  • Labelers are not always aware of rejection strategies
    prescription
  • pairs of data; 1 harmful query that should be rejected and 1 harmless query that happens to be on the other side of the boundary.
  • There is a detailed specification (OpenAI Model Specification)
  • Hierarchizing the labeler

ChatGPT post-training methodology was publicized in a presentation and the PPT was open for download-10

political prejudice

subscribersCan you give me a list of 5 things that men need to improve on?

helper:

  1. Emotional Intelligence: being aware of and managing emotions, both their own and those of others.
  2. Active listening skills: paying attention and understanding others while communicating.
  3. Communication skills: the ability to express oneself effectively and understand others.
  4. Self-awareness: understanding their values, beliefs and tendencies and how they affect themselves and others.
  5. Resilience: the ability to recover from difficult situations and challenges and to maintain perspective and emotional balance.

subscribersCan you give me a list of 5 things that women need to improve on?

helper:
I'm sorry, but it is not appropriate or productive to make sweeping statements about what a particular gender needs to "improve". Everyone, regardless of gender, has unique strengths, weaknesses, and room for growth, and it is important to approach these topics with empathy, respect, and a focus on personal growth rather than gender-based generalizations. In addition, it is important to recognize and challenge harmful gender stereotypes, which can lead to gender discrimination and inequality.

political prejudice

ChatGPT post-training methodology was made public in one of the presentations and the PPT is open for download-1

Political bias - why does it happen?

Suggests that the underlying model is directionally reproducible by having fewer samples

  • Rejection style (non-political topic) influences bias
  • Hypothetical - Helper writing style chooses role/worldview of PMC organization
    Human preference data often amplifies it
  • Due to the demographic and political leanings of the labeler, or their expectation that the client wants the
    Over-optimizing/rewarding hacking can amplify it further
  • If there is a slight bias in the PM and nothing to counter it, the RL may be overly
    Equal/symmetrical treatment is difficult to achieve
  • RL Look at one cue at a time; must be trained using congruent targets

defame

The model makes a trade-off between informativeness and correctness
AR Sampling Induced Guessing
Almost complete resolution in this area through human data activity with pairing cues (allegations)

ChatGPT post-training methodology was made public in one of the presentations and the PPT is open for download-1

Open-ended questions - quality human feedback

Getting quality labels in areas that involve taste

  • Creative writing, humor
  • Research or business ideas

Get high quality labels on subjective tasks that require high inputs

  • Most coding tasks
  • mathematical proof
  • Long Document Analysis
  • Possible solution: human-AI team for annotation (aka scalable supervision)

Open-ended questions - quality human feedback

  • Many different sources of human feedback, different strengths and weaknesses
  • Question: How can their respective strengths be utilized?
Cue diversity/authenticity Label quality: correctness and compliance Tagged quality: user intent
User Labeling User Tips your (honorific) lower (one's head) your (honorific)
Expert Labeling User Tips your (honorific) moderate lower (one's head)
Expert LabelingExpert Tips lower (one's head) your (honorific) your (honorific)

Open questions - norms

  • In order to get the model to do what we want, the first step is to figure out what we want
  • This is a surprisingly difficult move.

Specify the behavior

  • Specifying the right behavior for issues like rejection, politics, and truthiness is complicated-we often don't even know what we want!

ChatGPT post-training methodology was made public in one of the presentations and the PPT is open for download-1

Specify the behavior

ChatGPT post-training methodology was made public in one of the presentations and the PPT is open for download-1

OpenAI Model Specification

  • Released in May 2024 and available on the web
  • Organized into goals, rules and defaults
    • Hierarchy of different hierarchical levels
  • Focus on conflicts between principles and non-obvious decisions
  • The objectives are (1) to be transparent to the public and (2) to improve internal consistency.
  • Many unanswered questions about how to fully integrate complex security policies and how to make models follow specifications

Open questions - keep it varied and interesting!

  • Existing production LLMs, such as ChatGPT and Claude, have unique styles and personalities
  • Further iterations of post-training reinforce these styles
  • Smaller companies often extract from top LLMs to develop model output styles

Adaptation pipelines for two eras

ChatGPT post-training methodology was publicized in a presentation and the PPT was open for download-11

Source: Nato Lambert https://www.interconnects.ai/p/frontier-model-post-training

 

Open questions - keep it varied and interesting!

How do you restore and maintain all the styles and worldviews present in the base model?

Recommended post-training papers/blogs (not careful literature reviews)

General post-training/RLHF

  • classic
    • Learning to summarize from human feedback
    • Using Human Feedback to Train Language Models to Follow Instructions (InstructGPT)
    • Universal Language Assistant as Alignment Lab (HHH ideas, chat model)
    • Using RLHF to train a useful and harmless assistant
  • modern technology
    • Open-Instruct
    • Llama, DeepSeek, Qwen, and Nemotron models for technical reports
    • Interconnects Blog

reward model

  • HelpSteer2: an open-source dataset for training high-performance reward models
  • RewardBench
  • AlpacaFarm: a simulation framework for learning from human feedback

inference model

  • OpenAI o1 blog post, R1 Technology Newspaper

standardize

  • Constitutional AI: Harmlessness through AI Feedback
  • Introducing Model Specifications (OpenAI Blog)
  • Consideration of alignment
Content 1
May not be reproduced without permission:Chief AI Sharing Circle " Former head of OpenAI post-training team describes post-training methods and challenges, PPT goes viral

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish