By Krish Maniar and William Fu-Hinthorn
When writing cues, we try to communicate our intentions to Large Language Models (LLMs) so that they can apply these instructions on complex data. However, clearly expressing all the nuances at once is not easy. Prompt engineering is often optimized through manual trial and error, testing and tuning, but tools like DSPy and promptim demonstrate the value of "prompt programming" and systematic prompt optimization. They bridge the gap between intent and instruction by measuring and testing on real data. In this paper, we:
- Five datasets with verifiable results were selected for benchmarking prompted optimization
- Five methods of systematic improvement of prompts were realized and compared
- Three different models were evaluated (
gpt-4o
,claude-sonnet
,o1
) in cue optimization
Our conclusion:
- The model we recommend for cue optimization is
claude-sonnet
(Better thano1
) - Cue optimization is most effective in tasks where the model lacks domain knowledge
- In these cases, cue optimization can improve accuracy by about 200% over base cues
- Cue optimization can also be viewed as a form of long-term memory: learning directly from the data and adapting to the
What did we test?
We benchmarked five popular cue optimization methods (explained in detail later):
- Few-shot prompting: using training examples as demonstrations of desired behavior
- Meta-prompting: Analyzing and Improving Prompts with LLM
- Meta-prompting with reflection: Have the LLM think about and critically analyze its proposed changes before submitting an updated prompt!
- Prompt gradients: Generate "text gradients" for each example as suggestions for improvement, and then apply those suggestions in another LLM call.
- Evolutionary optimization: exploring cue space through controlled mutation
We ran these methods on three models (O1, GPT-4o, and Claude-3.5-Sonnet) and tested them on five datasets representing common tasks to answer the following core questions:
- When is prompting optimization most effective?
- Which frontier models are suitable for cue optimization?
- Which algorithms are the most reliable?
arithmetic
We tested five cue optimization methods, each with its own unique optimization theory:
Few-shot prompting
This approach is the simplest; we select up to 50 examples from the training set (sampled across multiple training cycles) and include them in the cue as a demonstration of the desired behavior. This approach has a low learning cost (no LLM calls are required to propose changes), but at the time of testing Token Higher cost (because demo examples usually contain more content than direct instructions).
Meta-prompting (Meta-prompting)
This is one of the most basic methods of instruction tuning. We first have the target LLM run the example and then compute the ratings of the outputs (note: an evaluator needs to be set up). We then provide the meta-cue LLM with the inputs, outputs, reference outputs (if any) and the current cue's scores on those outputs, and ask the LLM to write a better cue. This process is repeated on small batches of data and periodically evaluated on the retained development set (dev set) to retain the highest scoring cues.
Meta-prompting with reflection (Meta-prompting with reflection)
Provide "think" and "critique" tools based on the meta-prompts. These tools only allow LLM to record reflections in the draft area before submitting cue updates, in order to utilize more computational power to analyze previous cues and discover hidden patterns in the data before submitting the final version.
Prompt Gradients (Prompt Gradients)
The paper by Pryzant et al.Automatic Prompt OptimizationInspired by ", this approach splits the optimization process into multiple steps:
- Scoring the output of the current prompt
- Have LLM generate specific feedback (i.e., "gradients") for failure examples
- Prompts for updates based on these "gradients"
The core idea of the method is that collecting fine-grained feedback before making changes can provide more targeted suggestions for improvement than meta-prompting methods.
Evolutionary Optimization (Evolutionary Optimization)
The algorithms run in "generations", with each generation being optimized in a different stage. In each generation, the algorithms apply semi-randomized "mutations" to the cues (in this experiment, these mutations were generated by the LLM with different types of cue updates), and then retain the best performing cues.
In these experiments, we used the state-of-the-art technique proposed by Cui et al. PhaseEvoIt combines a "text gradient" approach with a more global variational strategy to explore the cue space on a larger scale, thus overcoming the local optimization problem.
data set
We created five datasets for benchmarking:
- Support for mail distribution 3: Each incoming email is categorized and assigned to one of 3 processors.
- Support for mail distribution 10: Similar to (1), but with an increase to 10 processors, the task is more challenging because the "domain expertise" of each processor is not sufficiently clear.
- multilingual mathematics: LLM involves solving math application problems and spelling out the answers in one of 5 languages. The target language is determined by the topic of the question (sports → Korean, space → Arabic, cooking → German, music → English, wildlife → Russian). The optimizer needs to discover this hidden pattern from the data.
- Mail assistant (simple): This is a synthetic dataset to test whether cue optimization is useful for tasks for which the LLM already has domain knowledge.The LLM needs to determine whether an email should be ignored, replied to, or notified to the user for processing.
- Mail Assistant (Preferences)The task is similar to the previous dataset, but the task rules are more subtle. We set up a busy and "eccentric" tech guru as the maker of email preference rules to provide ground truth labels.
in the end
We use OpenAI's GPT-4o and O1 on five datasets, as well as the Anthropic of Claude-3.5-Sonnet was optimized as the meta-cue LLM. The target LLM is GPT-4o-mini (i.e., we use other models to optimize the GPT-4o-mini cue).
The following figure shows the optimization results for different datasets and algorithms:
The average relative improvement value across all test datasets, 100% represents a doubling of accuracy and 200% represents an increase in accuracy to three times the original.
From the results.Claude is a more stable optimizer modelThe OpenAI endpoints are more reliable than O1. Moreover, O1 has disadvantages in terms of processing time, cost, and API reliability (OpenAI endpoints sometimes incorrectly mark requests as violating the ToS). Therefore, we currently recommend Claude-3.5-Sonnet as the preferred model for hint optimization. We will update this recommendation as O3 and other new models become available.
Our findings
The above results generally support the existing literature that Large Language Models (LLMs) perform well in input cue engineering. These experiments also reveal when LLMs will (or will not) be effective.
- Meta-prompting (Meta-prompting) existDiscovery rules or preferences and other clear patterns are particularly useful, especially when this information is outside the original knowledge domain of the LLM. This means that you can define the desired behaviors by example and rely on the optimizer to translate those behaviors to other LLMs, as long as they can follow sensible instructions. This makes the declarative input hint programming model possible.
- Meta-hints (fine-tuned by command) In conveying preferences in thenuance aspects are less useful, for example, in the simple email categorization dataset, all input cue fine-tuning methods underperform the few-shot (few-shot) input cueing methods. In that dataset, classification was based mainly on fuzzy rules and conditional judgments rather than explicit rules.
- combiningLess sample input prompts and command fine-tuning may bring complementary enhancements. This is in contrast to the Opsahl Ong et al. cap (a poem) Wan et al. The findings of the study are consistent. Sample less examples can convey more information than simple instructions, but cannot cover theComplex conditions and rules These may be an important part of what is needed for enterprise agents. On the other hand, input cue optimization through reflection, "text gradients", or evolutionary algorithms allows for more targeted improvements based on existing performance and dataset characteristics, while improving Token efficiency. Token efficiency is also improved.
- Meta-hints do not give models new capabilities . For example, in the multilingual math dataset, GPT-4o-mini fails to break the 65% pass rate even in the optimized configuration, mainly due to inference errors. While the optimizer can instruct the model on how to show off (Sometimes reasoning paths through examples can induce betterway of thinking ), but they do not unlock stronger reasoning skills or more complex domain-specific knowledge.
Beyond Assessment
We've been building LangSmith to help teams systematically evaluate LLM applications. A good evaluation allows you to identify problems and understand the behavior of the system. But the datasets and metrics built during the evaluation process also unlock an even more important value: systematically improving model performance through optimization.
The datasets in our experiments perform well in the optimization process because they have clear,Verifiable results ::
- Routing decisions with real labels
- Verifiable math answers
- Program-checkable language constraints
This is important because optimizing for fuzzy or unreliable metrics tends to make input cues worse, not better. If LLM judges output based only on fuzzy criteria, it will tend to optimize for its own biases rather than meet your actual needs.
If you're tracking application performance in LangSmith, you're already laying the groundwork for effective input prompt optimization. The same dataset not only helps you understand why it's failing, it also drives systematic improvement. Data, metrics, and learning form a closed loop.
Optimization of input prompts i.e. long-term memory
Optimization is learning, so we can think of input cue optimization as a special form of long-term memory that captures patterns of "always-on" behavior.
While traditional memory systems store information in databases (vectors, graphs, or other formats), Input Cue Optimization stores information directly in the input cues of the intelligences, making it always available and influencing every decision. This approach is particularly well suited for storing core patterns such as behavioral rules, style preferences, and key personality traits.
The "learn and improve" process is very similar to traditional input cue optimization, with slight differences in how updates are scheduled and where they are stored. The techniques used for input cue optimization and learning algorithms may also be applicable to memory systems. This is a direction we are actively investigating.
significance
These results support our (and researchers such as DSPy's) observation that LLM-driven input prompt optimization can systematically improve input prompts and automate the current manual trial-and-error-dominated input prompt engineering process. Making this approach more accessible to all interested parties can help us build better and stronger systems.
However, this is not a one-size-fits-all solution. Our optimized input prompts are not optimal on the test set and the improvement varies across tasks. This suggests that input cue optimization should be considered as one tool in the LLM application optimization toolbox, rather than the only approach.
We plan to integrate these insights directly into LangSmith to help teams move beyond manual input prompt engineering. Our goal is not to eliminate human judgment, but to make decisions more systematic and data-driven.
replication experiment
You can run GitHub Repositories hit the nail on the head all_sweeps.sh
scripts to reproduce these experiments.
appendice
Training dynamics
In the previous section, we focused oneventual performance of the input cues on the test set. Below, we show graphs of the training dynamics of each dataset on the development set. These graphs show how different algorithms fit the dataset and can be used to compare the final scores to reveal whether the algorithms are fitting the dataset in an unstable manner or notoverfitting data, thus failing to deliver consistent uplift.
Support for mail routing (3 classes)
Most of the optimizers improved on the baseline input cues, with the gradient and evolution methods performing similarly. Notably, Claude outperforms GPT-4o on all methods; however, in the development set, Claude and GPT-4o fail to improve significantly when using meta-cueing methods.
Support for mail routing (10 classes)
The meta-cue and meta-cue+reflection setups using GPT-4o failed to learn the classification rules in the dataset. A common pattern begins to emerge: if the curve remains flat, the algorithm has failed to learn. If the curve rapidly approaches a perfect score, there may be overfitting. The best test set performance tends to come from algorithms that show steady improvement in the development set.
Multilingual Mathematical Data Sets
The training performance of this datasetdiscrete This is because some settings don't get a significant boost until round 2 or 3 (or even later). This highlights the fact that trackingEditorial History importance, rather than just focusing on one or two recent attempts.LLM acts as a meta-optimizer and is able to translate more efficient update strategies based on edit history.
Comparison of input prompts
While we're ultimately more concerned with downstream metrics than the specifics of the input prompts, it's still valuable to analyze what the optimizer learned about the modifications and which changes resulted in enhancements.
Comparison of four different optimization algorithms using an example of a 10-class categorized dataset supporting emails:
All four algorithms learned the main classification rules. However, the gradient method is weak in finding fuzzy boundaries, while the other methods prefer to formulate "prioritization rules" or construct decision trees to specify classification criteria.
Differences in behavior can also be seen when comparing different optimizer models under the same algorithm. For example, O1 seems to prefer combining different techniques (e.g., synthesizing sample less examples and step-by-step instructions) and using its characteristic separator ("-") to distinguish between rule sets. Whereas Claude more concise and direct, but also learned the prioritization rules and domain mapping. In contrast, GPT-4o generates the lowest information density.