Z Highlights
- Intuition works in about half of the jobs. Intuition is helpful when there is a clear product direction, for example, and it's just a matter of doing the final fine-tuning and trying to understand the target users and the exact problem to be solved, as this situation is closer to the traditional product launch process.But in the early stages of a project, that's not the case at all. Sometimes we just have some unknown capabilities.
- But here, every two months, computers are able to do new things that have never been accomplished in history, and the You need to understand how these technological changes will affect your product, and the answer may be that it will have a considerable impact , so it's really interesting to see how AI is evolving from the inside.
- We've found that Claude is actually good at writing assessments and scoring them. So we can automate a lot of that process for you guys, but only if you tell us what's successful, and then we can actually go and improve it over time.
- Models will become smarter at an accelerated rate, and that's part of what makes all of this possible. Another very exciting thing is to see models being able to interact like we humans do.
New Roles and Challenges in AI: Conversations and Explorations
Sarah: Hello everyone!
Kevin: Sarah, you are the queen of AI investing.
Sarah: It's a phrase we'll never use again, but it's great to be here with you both. I have two different ideas for our final discussion. The first one is the product launch duel, because both of these guys have access to just hit the "publish" button, and I'm like, come on, let's just publish everything that we're going to be launching in the next 6-12 months and completely ignore all the internal guidelines.
The second one is that we're redesigning Instagram together, because both of them actually ran Instagram, but those plans have since been completely scrapped. So, let's just share our insights as friends. This is going to sound kind of boring, but I'm really looking forward to hearing what you guys have to share. Anyway, this is actually a relatively new role for all of you.Kevin you've done a lot of really different and interesting things before, so what was the reaction of your friends and team when you took over the role?
Kevin: Overall it's exciting, it's one of the most interesting and impactful positions out there and there's so much to explore. I've never had a product position that was so challenging, interesting and sleepless - it encompasses all the challenges of a common product position, like figuring out who your users are, what problems you can solve, etc. But usually when you're developing a product, you're working from a relatively fixed technical base, you know what resources are available, and then you develop the best product possible.
But here, every two months, computers are able to do new things that have never been implemented in history, and you need to understand how these technological changes are going to affect your product, and the answer is probably going to be quite a big impact, so it's really interesting to watch the process of AI from the inside, and I'm enjoying it.
Sarah: Mike, what about you? I remember hearing the news and thinking to myself that it was surprising to actually get the founder of Instagram to work on a project that already existed.
Mike: Yeah, my three favorite reactions are: people who know me will say it makes sense and you'll have fun there. Then some people will say, you don't need to work ah why bother with this? If you really know me you will know that I just can't stop, I simply can't help myself. The third reaction is that it's interesting to actually have the founder of Instagram. It's true that not a lot of companies can do what it takes to interest me, but there are probably three that I would be interested in. So, depending on how well you know me, the reaction varies, especially if you've witnessed me in that semi-retirement state, which lasted about six weeks, and then I was like, what do I do next?
Kevin: We were having dinner with a group of friends recently and you were exuding a childlike sense of excitement, and it struck me that you said you were learning about all these corporate aspects. It's different than the kind of user base we usually do with Instagram, and now it's about servicing other clients or working in a research-driven organization. What's been the biggest surprise so far?
Mike: These are really two very rewarding aspects of this job, and completely new experiences for me. at 18 I made a vow very much in keeping with the 18 year old nature of my heart, that every year would be different, and that I didn't want to live the same year over and over again. Because of this, sometimes I think, "Do you want to make another social product? It feels too repetitive, and first of all, your standards get blurred, and secondly, it also feels a bit like repeating the same thing over and over again. So, enterprise content is really refreshing. I'm curious about your experience with that as well. You'll get feedback in real time, and I actually imagine it's more like an investment - the cycle is a lot longer. You'll have that initial conversation, and then you'll be like, "They seem to like me," and then you'll find out that the project is in the approval process, and then it'll be about six months before you get to the actual deployment stage, and then you'll know if it's a good fit or not.So got to get used to a different timeline.
I'll ask why it hasn't moved forward yet and they'll say, Mike, you've only been here two months and this thing is already going through the process and will eventually fall into place. One does have to get used to this different pace. But the interesting thing is that once the product is live, you can contact the customer directly and they can come to you and talk about the experience and confirm the results. Whereas with users, you can only analyze them in general through data science, and of course you can invite one or two to come and talk, but they won't have enough financial incentive to give you detailed feedback on your strengths and weaknesses. So, this approach is different, but it's also very fulfilling.
Sarah: Kevin, you've been involved in so many types of product development before, how much does your intuition play a role in these projects?
Kevin: Yes, I'd like to add something about the enterprise side as well before answering your question. In the corporate space, the focus is not necessarily on the product itself. There is also a buyer who has their own goals. You can build the best product in the world, and everyone in the company may be happy to use it, but it doesn't necessarily matter. I was in a meeting with a large corporate client earlier and they said, "This is great, we're happy with it, and so on. But we have a requirement that we want to know 60 days before any new product goes live." I thought to myself, "I'd like to know 60 days in advance too.
It's very different indeed, and it's interesting because at OpenAI, we have products for consumers, businesses and developers all at the same time, so we're experimenting on almost all fronts. In terms of intuition, roughly half of the jobs where intuition works. For example, when you have a clear product direction, like when you're close to releasing Advanced Speech Patterns or Canvas, and you're doing the final fine-tuning, trying to understand the target user and the exact problem you're trying to solve, that's when intuition is helpful, because this situation is closer to the traditional product release process.
still In the early stages of a project, it's not like that at all. Sometimes we just have some unknown capabilities. For example, you may be training a new model and think it has a certain capability, but you're not sure, the research team isn't sure, no one is sure. It might work, like a statue slowly emerging from the mist, but that capability is an emergent property of the model. So you don't know if it's actually going to work, or if it's 60% effective, 90% effective, or 99% effective. And for a model that's 60% valid, 90% or 99% valid, the corresponding product form is completely different. You're kind of in a waiting mode, I don't know if you've ever had the feeling that from time to time you're going to talk to the research team and ask them how it's going, how's the model training going, any new insights, and they're going to say it's research, we're still working on it, we're not sure, it's an exploratory process. But it's also fun because we're all discovering new things together, but also with a certain amount of randomness.
Uncertainty and adaptation in AI product development: from prototype to user feedback
Mike: It reminds me most of the Instagram days, like Apple's announcements at WWDC, where you're thinking this could either be very good for us or it could be disruptive for us. And now it's kind of similar, but your own company is disrupting you internally, which feels cool, but at the same time it feels like the product roadmap is completely disrupted.
Sarah: What does this cycle look like for you? You describe it as "looking through the fog" to find the next set of features. So, can you plan without knowing exactly what's going to happen? Also, what's the iterative cycle of discovering new features and integrating them into the product like?
Mike: In terms of intelligence, you can take a little bit of a cursory look at "it's moving in this direction. So you can build products around that and make decisions accordingly. Overall, there are three ways to deal with this. First. The progress of intelligence is unpredictable, but at least a general trend can be seen. The second thing is to decide which capabilities to invest in from a product perspective and then fine-tune them with the research team, something like Artifacts, where we're investing a lot of time between research and product. canvas is the same thing, you're doing co-design, co-research, and co-fine-tuning. That's a real privilege to be able to work at this company, to be able to participate in the design here. And then there's also the investment in capabilities, like OpenAI's speech mode, which is the computerized speech processing work that we released this week. You're like, "Okay, 60% now, good progress, keep it up."
So what we try to do is get the designer involved in the process early on, but at the same time know that you're not making a final bet, as the experimental discussion says that The result of an experiment should be learning, not a perfect product every time. The same applies when working with a research team, the result should be a demo or something inspiring that sparks product ideas, not a predictable product process that You don't think "this has eliminated the risk, which means that when the study comes, it should be this way."
Kevin: What I also like is that some parts of the research are at least product-oriented, especially in the post-training phase, as Mike said. And the other part of the research is more academic. So we would sometimes hear about certain capabilities at conferences, and then you'd really want to do this as well, and then one of the researchers on the team would say we've been able to do this for about three months now. And we'd be surprised and ask, really? What's going on? And they'll say, we didn't think it was important, so now I'm doing something else. But sometimes you really do get some magic moments.
Sarah: One of the things we often consider when investing is what you can do if a model has a success rate of 60% at performing a particular task instead of 99%. Unlike many tasks that approach 60%, the task itself is still very important and valuable. So, how are you assessing mission progress internally? And then, how do you think about making failures graceful in the product, or allowing users to get through this "transition", not so much because we need to wait for the model to get better, but how do you deal with it?
Kevin: There are actually a lot of things you can do with a model correctness of 60%, except that you need to design specifically for that. You have to expect that There will be more manual intervention in the system , rather than relying entirely on automation. For example, take a look at Github Copilot, which was the first product to really make people realize that AI could be used not just for Q&A, but for real economically valuable work. When it was released, I don't know exactly which model that was based on, but I do know that it must have been several generations ago. So I can guarantee you that that model was not perfect in any way related to coding.
Sarah: That would be based on GPT2, that model is sort of smaller.
Kevin: True, but it's still valuable because it saves you a lot of effort in writing code, and while it may not be perfect code, it at least gets most of it done for you, and you just need to edit it. So an experience like this one is totally viable. We're going to see something similar, especially in the shift to AGENT and longer task formats, and while it may not be perfect, if it saves you 5 to 10 minutes of time, it's still valuable. More importantly, if the model understands what it's unsure of and reaches out to you to ask, I'm not sure about this, can you help me? Then.The human-model combination will be much higher than 60%.
Mike: The percentage, it's like a threshold line for AI, and like the Mendoza line, it's usually very uneven and may perform very well in some tests and not so well in others. It also helps us when we're working with clients on pilot programs, especially when we're getting feedback from two companies on the same day, and sometimes clients will say, this solves all of our problems, we've been trying this for three months, thank you!
But that doesn't mean it's better than other models. We also come across situations where it is worse than other models. So it's essential to understand that. You can do a lot of internal evaluations, but when it comes to actually putting the model into real-world applications, you'll realize that, like when you're doing a design, you might think it's perfect at first, but when it's put in front of the user, you realize that I was wrong.Models have a similar feel to them, we try our best to come up with reasonable judgments, but each client has their own customized dataset, their own internal needs, and they prompt the model in some way. So when the model is actually put out into the world, it shows up almost like a double, giving you a different result.
Kevin: I'm curious if you feel the same way. Right now the models aren't limited by intelligence, they're limited by assessment. Models are actually capable of doing more, and being more accurate in a wider range of domains, but their current performance is far from reaching their full potential. The key is, how to teach them, to give them the intelligence you need to learn something about a specific topic that While these may not be in their initial training set, they are capable of doing so if you teach them.
Mike: Yes, we see this all the time. There were a lot of exciting AI apps a few years ago when everyone was just focused on launching cool AI features and didn't do any evaluation at all. Now everyone thinks the new model should be better, but we didn't actually do an evaluation because we were just rushing to release AI features. The hardest part was getting people to realize that we need to stop and think about what success really is? What problem are you actually solving? Often the product manager will change and the new product manager will take over and start asking, so what does success look like? Let's write some assessments.
We've found that Claude is actually good at writing assessments and scoring them. So we can automate a lot of this process for you guys, but only if you tell us what success is, and then we can actually go ahead and make incremental improvements. This process is often the key to getting a mission from 60% to 85%. If you come to Anthropic for an interview someday, maybe you'll see a part of our interview process that asks you to improve a bad assessment into a good one. We want to see how you think, and while that talent may not be available elsewhere, we're working hard to develop those skills. If we can teach someone one thing, it's this.
Kevin: This is really an important point. Writing emails to communicate in a timely manner is going to be one of the core skills of future product managers.
Mike: We actually discussed this internally, and maybe this is a bit of an insider tip, but it's interesting. We have research product managers who specialize in model capabilities and model development, and we have product managers who are more responsible for product interfaces or APIs. Then we realized that the role of the product manager building AI-driven features in 2024 and 2025 is becoming more like the former and less like the latter. For example, we released code analysis features where Claude can actually analyze CSVs and write code for you. This product manager is responsible for making it 80% good, and then handing it off to a product manager who can write assessments and fine-tune and prompt. This role is effectively the same, and the quality of the feature now depends on the work you do on evaluations and hints. So these two product manager roles are now gradually merging.
Kevin: Yes, exactly right. We did set up a bootcamp where every product manager learned by writing emails about the difference between good and bad evaluations. While we're certainly not done with this process yet and need to keep iterating and improving, it's really a key part of making great AI products.
Sarah: As part of this hire, for people who want to do well in building AI products or researching products in the future, we can't attend your bootcamp, Kevin. so how do we develop the intuition to become good at evaluation and iterative cycles?
Kevin: You can use the model itself to do this. For example, if you ask the model directly "what kind of assessment is good" or "give me some sample assessments", the model will give a good answer.
Mike: This is very important, and if you listen to people like Andrea Karpati and others who have spent a lot of time in this field, they will all say that nothing beats looking at the data. So often people get into the dilemma that we've got some assessment tool, the new model is measured by the assessment tool as 80% excellent, but we're afraid to release the new model thinking it's not perfect. But in fact, if you draw on some previous cases will find that the model is already good enough, just the assessment tool is not standardized enough.
It's even interesting that like every model release has a model card, and there are some assessments where we see even the golden answer, and I'm not sure if a human would say that or if that math question is actually a little bit wrong. Getting to 100% perfection is very difficult because even the scoring itself is very challenging. So I would suggest to you that the way to develop intuition is to look at the actual answers, or even sample them to see, "maybe we should evolve the assessment methodology, or even if the assessment results are harsh, the overall vibe might be good."That's why it's so important to dive into the data and really touch it.
Kevin: I also think it's going to be interesting to see how this process evolves as we move towards longer tasks or agentic tasks. Because when you have a task that's like, "I'm going to give you this math problem, and you can add four digits to it and get the right answer," you know what's good, and it's easy to judge. As the model starts to do more long form, fuzzy things like help me find a hotel in New York City, you know what's right, but a lot of times that involves personalization. If you ask any two perfectly capable people, they might make a completely different decision. So you're going to be judged on a much looser basis. It's going to be an interesting process for us. We're going to have to evolve again and redefine the rubrics, it's like we're constantly reinventing things.
Mike: When you think about it, there's actually some notion on both sides of the lab of "what it looks like to develop capacity as you go along." It looks a little bit like a career ladder where you're dealing with bigger, longer-term tasks. Maybe like assessments will start to look more like performance reviews. I'm in the middle of performance review season right now, so that metaphor is in the back of my mind. Like does the model meet your expectations of what a competent person should accomplish? Does it exceed expectations? Like did it do it faster, or discover a restaurant you didn't know existed, in which case it's more complex and subtle than the usual criteria of right and wrong.
Kevin: Not to mention the fact that humans are still writing these evaluations and the models are approaching or surpassing human performance on certain tasks. Sometimes people even prefer the model's answers to the human's. So what does this mean if you have humans writing your assessments?
Sarah: Assessments are obviously key. We're going to spend a lot of time with these models and teach ourselves how to write assessments. So what skills should product managers learn? Right now you're both on this learning path.
Mike: Prototyping with these models is an underrated skill. Our best product managers do this, and when we're discussing whether the UI should be this or that, before the designer even picks up a Figma, our product managers or sometimes our engineers say, "OK, I did an A/B test with Claude to see what each of these two UIs would look like. " And I just think that's cool, and then we're able to prototype more options in a shorter amount of time and be able to evaluate them more quickly. So the skill of being able to prototype using these tools is very useful.
Kevin: That's an excellent point. I also agree with you that this will also push product managers to dive deeper into the technology stack, and maybe that requirement will change over time. For example, if you were doing database technology in 2005, you may have needed to dive into it in a completely different way, whereas doing database technology now may not require mastering all the basics because so many levels of abstraction have been established. That's not to say that every product manager needs to be a researcher; having an understanding of these technologies, taking the time to learn their language, and developing an intuition for how these things work all go a long way for product managers.
Mike: The other aspect is that you're dealing with a stochastic, non-deterministic system, and like email is something we try to do the best we can, but product design in a world where you can't control the output of a model, you can only do the best you can. So what kind of feedback mechanisms do you need to close the loop? How do you decide when the model is off on the right track? How do you collect feedback quickly? What safeguards do you want to put in place? How do you know how the model will perform at large-scale outputs? These questions require an understanding of the model's output, not just for a single user, but at scale for a large number of users per day.This requires a very different way of thinking: previously, an error report could be that a user action was not performed when a button was clicked, and this type of problem was easier to recognize and solve.
Kevin: Maybe that will change in five years when people get used to it all. But we are still at the stage where we are getting used to this non-deterministic user interface, especially for those who are not technicians and who are not used to this when using technological products. This situation goes completely against our intuition from the last 25 years of using computers, which used to output the same result if the inputs were the same, but that no longer holds true. And it's not just that we need to adapt to this change when building our products, we also need to put ourselves in the shoes of the users who use our products and what this means for them. There are some downsides to this, but there are also some really cool upsides. So it's really interesting to think about how we can use this to our advantage in different ways.
Mike: I remember we did a lot of rolling user research at Instagram. Researchers would bring in different people every week and test prototypes each time, and we did something similar at Anthropic. But interestingly enough, what often surprises me about these sessions is the way users use Instagram. It's always interesting to see how users react to new features or their use cases. And now half of this research is about how users react and the other half is about how the model behaves in that context. And you'll see that it's done very well.
So it comes with a sense of pride, especially when the model responds well in a user research environment. And it's also frustrating when the model misinterprets the intent and you realize that it's gone to page 10 of the answer. So, it's probably in a way learning to have a "zen" mindset about the uncertainty in this environment, letting go of the sense of control and accepting what's going to happen.
Rapid adaptation and education of AI technology: from consumers to business users
Sarah: Both of you have worked on the design of these consumer experiences that quickly taught hundreds of millions of people new behaviors. How are you thinking about educating end-users now that these AI products are becoming even more ubiquitous than they were then, and if the product managers and techies themselves don't have much intuition about how to use these technologies? The scale of what you're dealing with is so massive and these technologies are so counterintuitive.
Kevin: It's amazing how quickly we adapt. I was talking to someone the other day about their experience on their first ride in a Waymo (driverless car). Who's been in a Waymo car? If you haven't been in a Waymo, when you leave here, take a Waymo in San Francisco to wherever you're going. It's an amazing experience. But they say that for the first 30 seconds, I'm thinking, "Oh, my God, watch out for the bicyclist," and then five minutes later, I'm thinking, "Oh, my God, I'm living in the future." But then ten minutes later, I'm bored and I'm on my cell phone.
How quickly we have become accustomed to this absolute magic. This phenomenon also occurs with ChatGPT, which came out less than two years ago, and at the time it was a real shocker. Now, if we go back and use the original GPT version 3.5, everyone will feel terrible.
Sarah: Everyone will say it's stupid.
Kevin: How could we possibly have thought before that what we're doing today and what you're doing, all of it feels like magic. 12 months from now we'll be in disbelief that we ever used any of that crap because that's the way the field has evolved so quickly. What amazes me even more is how quickly people are adapting because, despite our best efforts to push people to keep up, there's a lot of excitement.People understand that the world is moving in that direction, and we have to do what we can to keep it moving in the best possible direction. It is happening and it is moving very fast.
Mike: One of the things we're trying to improve on right now is making the product literally an educational tool, and that's something we didn't do early on, and that's the direction we're changing right now, is to make it more about Claude. Before we would just say, it's an AI created by Anthropic, what the training set contains, etc., but now we literally say, "Here's how to use this feature." Because user studies have shown that people ask, "How do I use this?" And then Claude might answer, "I don't know, have you tried looking it up online?" You'd think that answer wouldn't help at all.
So now we're trying to get it rooted in a real-world application. What we can do now is, "Here's the link to the documentation, and here are the steps. I can help you." These models are actually very effective at solving UI problems and user confusion, and we should be utilizing them more to solve those problems.
Sarah: Things must be different when it comes to change management in an organization, right? Because then there were existing ways of doing things and organizational processes. So how do you look at educating the whole organization to help them improve productivity or other changes that might come about?
Mike: The enterprise side is really interesting because while these products have millions of users, most of the core users are still early adopters and people who like technology, and then there's a long tail of users. And when you get into the enterprise, you're deploying the product into an organization, and there's usually some people who aren't very tech savvy. It's cool to actually see some of those non-technical users get their first exposure to chat-driven LLM and to be able to see how they react. So you have the opportunity to do some training sessions, teach them how to use it, and provide educational materials. We need to learn from these practices and then conclude how to teach the next 100 million people how to use these technologies.
Kevin: These user interfaces usually have some core users who are excited to teach others to use them. For example, OpenAI has custom GPTs and organizations typically create thousands of them. This provides an opportunity for core users to create something that makes AI easier and more immediately valuable to people who may not know how to use it. That's a cool place where you can find some core user groups who will actually become evangelists.
Sarah: I have to ask you this because your organization is basically a core user, so you live in your own little world of the future. I have a question, but feel free to direct me if you don't want to answer it.Mike, what do I do with a computer? What do you all do?
Mike: From an internal perspective, as Kevin mentioned earlier about "when will it be ready", we did have a period of time where we were very confident that the product was good enough, even though it was early and we would make mistakes, but how do we make it as good as it can be?
One of the most interesting use cases was when we were running a test and someone wanted to try to see if the AI could order a pizza for us. It ended up actually ordering it and everything went smoothly and the pizza ended up being delivered to the office. That was a cool moment, an iconic moment so to speak, even though it was Domino's (not a particularly high end pizza) but overall it was still done by the AI. Moments like that are very interesting indeed. Of course, the pizza was a bit over-ordered and I was probably hungry to try it.
Now we're seeing some really interesting early use cases, and one of them is UI tests. Like at Instagram, we had almost no UI tests because they were hard to write, they were fragile, and they would often fail because of things like the position of buttons changing, and you'd have to rewrite a lot of things. Now, computers are very effective at performing "does it work as expected" UI tests that basically say "does it do what you want it to do". That's very interesting.
Another direction we've started to delve into is those intelligent agent applications that require a lot of data processing. For example, in our support and finance teams, many of the PR forms were originally very tedious and repetitive, involving a lot of manual time to pull data from one data source and put it into another. Whenever I talk about the use of computers, I use the term "heavy lifting". We want to automate this tedious work so that people can focus on doing more creative things instead of clicking 30 times on each operation.
Sarah: Kevin, we have a number of teams experimenting with the GPT o1 model. Obviously it can do more complex things. But if you're already using a model like GPT-4 in your application, you can't simply use it as a one-to-one replacement. Can you give us some guidance on this? How are you using it internally?
Kevin: One thing that a lot of people probably don't realize is that actually, what some of our most advanced customers and what we're doing internally is not actually using a model for something.You end up combining models to form workflows and coordination mechanisms. As such, you will use each model according to what they specialize in.The GPT o1 model is very good at reasoning, but it also takes some time to think about and it is not multimodal, and certainly has some other limitations.
Sarah: Reasoning is a fundamental issue for this group, I realize.
Kevin: Yes, you should be familiar with the concept of "extended pretraining". You start with versions of GPT2, 3, 4, 5, etc., and do bigger and bigger pre-training. The models get "smarter" - or rather, they know more and more, but they're more like system 1 thinking, where you ask it a question and it gives you the answer right away, like text completion.
Sarah: Yes, if I ask you questions now, you will output the results one after the other and proceed.
Kevin: Don't you think that actually human intuition about how other people operate can often help you hypothesize how a lot of models work? You ask me a question and I might go off topic and into the wrong sentence, and it's hard to recover at that point. That can actually happen with models altogether. So, you have that kind of growing pre-training. gpt o1 models are actually a different way of extending intelligence and it's done at query time. So, unlike System 1 thinking, ask me a question and I'll give you the answer right away, it will pause, like if I ask you a question.
If I asked you to solve a Sudoku and do a New York Times connect-the-dots game, you'd start thinking about how these words are grouped, and these four might or might not be right, they might be these ...... You'll form hypotheses by what you already know, and then by falsifying or confirming those hypotheses, and then continue to reason. That's exactly how scientific breakthroughs arise and how we answer difficult questions, and that's teaching models what to do. And right now, they think for 30 to 60 seconds and then answer. Imagine what would happen if they could think for five hours or even five days.
So, it's a whole new way of expanding intelligence, and we think it's just getting started now. We are now in the GPT1 phase of this new type of reasoning. But like always, models aren't used for everything, right? Sometimes when you ask me a question, you don't want me to wait 60 seconds to answer, you want me to give the answer right away. So we end up using our models together in many different ways.
Cybersecurity, for example, is an area where you may think the models are not applicable. They may produce hallucinations, which seems like an inappropriate domain for hallucinations, but you can fine-tune the models so that they excel at certain tasks. You can then fine-tune the models so that they are very precise about the kinds of inputs and outputs, and then have those models start working together. You'd have models checking the output of other models, realizing that something isn't right, and asking them to retry.So ultimately, it's how we get great value out of models by operating them together and collaborating to accomplish specific tasks. It's like how humans accomplish complex tasks, we usually have people with different skills who work collaboratively to accomplish a difficult task.
Anticipating the Future of AI: Proactivity, Asynchronous Interactions and Personalized Experiences
Sarah: You guys have to tell us something about the future and what's coming, you don't have to give a release date, I understand that you don't know, but if you look farther out, the furthest you can see the AI field right now is probably ...... If you guys can see into the future, let me know. But let's say it's six months or 12 months, what do you guys imagine an experience to look like, what kind of experience will become possible or commonplace?
Mike: I think about this all the time, and there are two words that probably plant the seed in everyone's mind. The first is "proactivity", which means how do models become more proactive? For example, once they get to know you and start monitoring some of your information (assuming you authorize them to do so), they might read your emails in a way that isn't too disturbing and is helpful, and pick up on some interesting trends. Or, the model could give you a proactive summary as you start your day: what happened today and which conversations you might be involved in. I've done some research for you, and with your next meeting coming up, this is what you might want to talk about. I see you have an upcoming presentation, and this is the first version of the draft I've prepared for you. Initiatives like this one will be very powerful in the future.
The other aspect is to be more "asynchronous". o1 model is currently the initial interface for this exploratory phase, although it can do a lot of things and will tell you what it will do as it goes along. You can wait here for it, but you can also choose "it'll think about it for a while, I'll go do something else and maybe come back to it later, or it'll tell me when it's done". It's like expanding the dimension of time, it's not just that you didn't ask it a question, it's actively telling you something, which would be interesting. And also when you ask a question, it could say, "Okay, I'll go think about it, do some research, I might need to ask another person some questions, and then I'll give an initial answer, and I'll verify that answer one more time, and you'll hear back from me in an hour."
Breaking that constraint of getting an answer right away. That's going to allow you to do a lot of things like, "I've got a whole little project plan to go expand it" or "It's not just that I want you to change one place on the screen, but let me fix this bug, like tweak the PRD for me for a new market condition or make adjustments based on these three new market conditions to make adjustments."Being able to drive change in these dimensions is what I'm personally most excited about in terms of the product.
Kevin: Yes, I totally agree with all the points you made. models will become smarter at an accelerated rate. That's part of what makes all of this possible. Another very exciting thing is to see models being able to interact like we humans do. Currently you interact with these models most of the time by typing, and I often communicate with many of my friends on WhatsApp and other platforms by typing. But I can also talk and see things. We have recently introduced an advanced voice model. I was talking to people in Korea and Japan and I was often with someone who didn't understand my language at all. Before that, we couldn't communicate with each other at all. But now I said, "ChatGPT, I want you to act as a translator, and when I speak in English, please translate it into Korean; and when you hear Korean, please tell me in English." Suddenly, I had a universal translator for business conversations with each other. It felt like magic.
Think about what this technology could do, not just in business situations, but imagine how much more willing people would be to travel to new places if you no longer had to worry about not speaking the language and you had a universal translator like the one in Star Trek Universal right in your pocket. These experiences like this will become commonplace in the future, but it's still magical, and I'm very excited about this technology combined with all that Mike just said.
Sarah: One of my favorite pastimes right now is watching TikTok videos, which are videos of young people talking to voice modes, pouring their hearts out, using all sorts of methods, and it's just AMAZING to watch, and it reminds me of the old term "digital natives" or "mobile natives". It reminds me of the old term "digital natives" or "mobile natives". I'm a big believer in AI myself, but I never thought I'd be interacting with it in this way. But 14 year olds will think I can do this with AI.
Kevin: Have you ever used this on your kids?
Sarah: I haven't yet, my kids are 5 and 7 now
Kevin: We're definitely going to give it a try, though. My kids are 8 and 10 and they often ask while driving, "Can I talk to ChatGPT?" Then they ask the strangest questions and have odd conversations with the AI, but they don't mind talking to the AI at all.
Sarah: In fact, one of my favorite experiences, and maybe we can end it here by asking what's the most amazing behavior you've seen lately (whether it's from a child or someone else), is that I'm lucky when my parents read to me. It's great if I get to choose the books, otherwise my dad is like, "We're going to read this physics study I'm interested in." My kids, I don't know if this is the Bay Area way of parenting, but my kids would say, "Okay, Mom, make the right picture. I want to tell a story about a dragon and a unicorn, and in that context, I'll tell you how it's going to happen." And then that story would be created in real time. I think that's a big ask, and I'm glad they believe and know it's possible, but it's really crazy to create your own entertainment content this way. So what's the most surprising behavior you've seen in your products lately?
Mike: It's a behavior and a relationship. People are really starting to understand the nuances of Claude or the new model that was just described. They understand the nuances. The behavior is almost like making friends, or building two-way empathy in what's happening. And then I thought, "The new model feels smarter, but maybe a little distant." It's that nuance. As a product, it gives me more empathy for the mindset of people when they're using our products. You're not just launching a product, you're launching wisdom and empathy, and that's what makes relationships matter. If someone shows up and says, "I upgraded and got 21 TP3T in math," but I've become different in some ways, you'd think I'd have to adapt a little bit and probably be a little worried. It's been an interesting journey for me, understanding the mindset of people when they use our products.
Kevin: Yes. The behavior of the model is definitely part of the product persona. The personality of the model is critical, and there are some interesting issues that Like how customizable it should be. Or should OpenAI have a uniform personality and Claude have his own unique personality, do people use a model because they like a certain personality? This is actually very human phenomenon, we make friends with different people because we favor different people. It's an interesting topic to think about. We did something recently and it spread quickly on Twitter. People started asking the model, "Based on what you know about me, based on all of our past interactions, how would you describe me?" And then the model would respond and give what it thought was a description based on all the past interactions. It's like you start interacting with the model in some way, almost as if it were a person or entity. It's very interesting to see how people react to that.