Putting a Prompt into Production

Relying solely on trial and error will only take you so far in the field of AI. When you're deploying a prompt at scale, running it hundreds or even thousands of times daily, understanding its failure rate becomes crucial. By identifying the most common scenarios in which it fails, you can take action, conducting A/B tests to determine which changes to the prompt yield significant improvements in performance.

Prompt engineering doesn't mean discovering a single magical phrase that makes the AI do what you say. Instead, it's a systematic approach to refining your prompts and constructing a framework that consistently delivers trustworthy results at scale. AI behaves differently than traditional software - responses vary, incorrect answers are stated with confidence, and processing large volumes of outputs consumes both time and financial resources.

To overcome these challenges, I examined all the tricks and workarounds I relied on in 2020 to get the most out of GPT-3. I discarded anything no longer needed in the GPT-4 era, leaving me with a core set of evergreen principles that are transferable across different models. These principles will remain relevant for GPT-5, LLaMA 4, Stable Diffusion XXL, or any other model you might utilize in the future. Those principles form the basis of the most popular prompt engineering course on Udemy (with over 80,000 students enrolled), and the prompt engineering book I released through O'Reilly.

The Five Principles of Prompting

Give Direction: Describe the desired style in detail, or reference a relevant persona.
Specify Format: Define what rules to follow, and the required structure of the response.
Provide Examples: Insert a diverse set of test cases where the task was done correctly.
Evaluate Quality: Identify errors and rate responses, testing what drives performance.
Divide Labor: Split tasks into multiple steps, chained together for complex goals.

I treat these principles as a step-by-step guide, progressing through them until the prompt performs consistently. For straightforward tasks, applying just one or two principles often suffices to achieve the required quality. However, in some cases, it may take weeks of iterating through all five principles before arriving at the optimal structure. To illustrate how these principles can be put into practice, I created a Prompt Engineering template that includes a case study demonstrating their application. Whenever you intend to use a prompt repeatedly - whether it's 10, 100, or 1000 times, as part of a web app or a saved template - you can work through this worksheet to significantly enhance your results.

Check out Mike's masterclass!

A Case Study

To show you how the template works, let’s run through a case study where we’ll apply the prompt engineering principles to optimize a prompt we plan to use in production.

Imagine you're a product manager at a news aggregator startup working on automatically generating summaries of articles. Being able to quickly understand the key points from long-form journalism is valuable for busy professionals trying to stay informed. Most articles can be summed up in a few sentences, but often a lot of nuance and context is lost. You could hire a team of editors, but the business is just getting started and you want to keep costs low while being able to handle a large volume of articles.

Naive Prompt

It strikes you that AI could probably handle this summarization task quite well, so you start with the most basic prompt template possible:

    Please summarize the following news article:

{article_text}

It makes sense to start with a naive prompt like this, because why put in more effort into prompting if it works well out of the box? These models are already very good, so don’t waste time on prompt engineering if you don’t have to. Now let's test it on a real article.

a) Can it summarize a straightforward, factual news story?:

US antitrust enforcers will investigate leading AI companies Microsoft, Nvidia and OpenAI

U.S. antitrust authorities are set to investigate leading AI companies, namely Microsoft, Nvidia, and OpenAI, amid growing concerns over monopolistic behavior in the rapidly advancing AI industry. The Department of Justice will probe Nvidia, a key supplier of AI chips, while the Federal Trade Commission (FTC) will scrutinize Microsoft and OpenAI, closely tied in business. This action aligns with President Joe Biden's push for stricter regulation of Big Tech. The FTC aims to ensure that dominant firms do not exert undue influence or gain unfair competitive advantages. This investigation highlights the increased focus on maintaining fair competition in the AI sector, which heavily relies on Nvidia’s semiconductors and significant investments from tech giants like Microsoft.

This looks quite good - it picked out the key facts and condensed them down into an easy to read summary.

b) Can it handle something a little more nuanced and opinion-driven?

Missing cues, Trump claims right to prosecute political enemies

In a recent interview with Fox News' Sean Hannity, former President Donald Trump repeatedly refused to pledge that he wouldn't prosecute his political enemies if re-elected. Despite Hannity's efforts to guide Trump to politically safer responses, Trump openly acknowledged the host's attempts and insisted that he would have the right to pursue retribution against his opponents. This candor from Trump highlights his intent to use presidential powers for personal vengeance, a stance he doesn't seem to hide. The New York Times noted that Trump's approach effectively places the rule of law on the ballot for the upcoming election.

Hmmm, this starts to get at the heart of the problem. The model is able to extract the main argument, but a lot of the character and critical tone of the original piece is lost. There is no mention of Hannity’s ties to Trump. The summary sounds more definitive and factual than the columnist likely intended. We're also still getting back unstructured text - if we want to use these summaries in our product and build features around them, it would help to get the response in a structured format.

1. Give Direction

Describe the desired style in detail, or reference a relevant persona.

The first principle is the easiest to implement, because we really just have to brief the AI better on our preferences for the task. Let's modify the prompt to give the AI more guidance on how we want the task done:

    Please summarize the following news article, preserving nuance and the original author's intended tone as much as possible. The summary should give the reader a concise understanding of the key points, but not state claims more definitively than the original article did:

{article_text}

Now when we feed it the same complex opinion article as before, we get back:

In an interview with Fox News' Sean Hannity, Donald Trump repeatedly avoided opportunities to pledge that he would not use presidential power to target his political enemies if elected. Despite Hannity's attempts to guide Trump into making politically advantageous statements, Trump openly dismissed the suggestions. When pressed on whether he would avoid abusing power, Trump instead asserted he would have the right to retaliate against his foes after the election. This candidness from Trump suggests he is not hiding his intentions to potentially misuse presidential powers, indicating his focus on revenge. A New York Times analysis suggests that Trump's stance effectively places the rule of law as a key issue in the upcoming election.

This is an improvement - the model is trying harder to preserve the original tone. But we're still dealing with paragraphs of text rather than structured data. Let's address that next.

2. Specify Format

Define what rules to follow, and the required structure of the response.

The second principle is the most important for automation, because you don’t want to be copy and pasting back and forth to ChatGPT if you can call an API instead. One popular format is JSON or JavaScript Object Notation, from which developers can easily display information on a website, or send along to another API call.

We can prompt the model to return its response in JSON format:

    Please summarize the following news article, preserving nuance and the original author's 
intended tone as much as possible. The summary should give the reader a concise understanding 
of the key points, but not state claims more definitively than the original article did.

Return the summary in this JSON format:

{ "summary": "text of the summary here", "keyPoints": ["list of 3-5 key takeaways"] }

{article_text}

The model returns:

{

"summary": "In a recent interview with Sean Hannity on Fox News, Donald Trump openly discussed his intentions regarding the prosecution of political enemies, despite Hannity's attempts to steer him toward more politically palatable responses. Trump repeatedly missed cues to assure voters that he would not abuse presidential power for retribution, instead suggesting he felt justified in seeking revenge. This behavior underscores his willingness to use presidential powers against his foes, a stance that Hannity tried but failed to mitigate during the interview.",

"keyPoints": [

"Hannity's efforts to guide Trump to deny intentions of abusing power failed.",

"Trump openly acknowledged Hannity's attempts but chose not to follow the cues.",

"Trump expressed his belief that he has the right to pursue retribution against his political enemies.",

"Trump's candidness about his intentions raises concerns about his willingness to abuse presidential powers."

]

}

Great, now we're getting well-structured data that will be much easier to work with programmatically. The key points provide a nice high-level overview. The overall summary quality is decent, but let's see if we can improve it further by providing some examples.

3. Provide Examples

Insert a diverse set of test cases where the task was done correctly.

Adding a few examples of high-quality article summaries in the target format (what AI researchers call ‘Few Shot’ prompting) could help the model better understand exactly what we're looking for. We get these summaries by running the prompt on other articles, and then correcting the examples to align them with our desired response. The resulting prompt is below:

    Please summarize the following news article, preserving nuance and the original author's intended tone as much as possible. The summary should give the reader a concise understanding of the key points, but not state claims more definitively than the original article did.

Here are some examples of high-quality summaries in the desired format:

Example 1
{ "summary": "Mike Johnson wants to help Donald Trump with his legal problems. He has a plan to stop giving money to certain lawyers and to investigate them. One idea is to make a new rule so that Trump can move his cases to different courts with judges who might like him more. This could help Trump pardon himself. The rule might not be allowed and probably won't happen this year because the other party is in charge. But if Mike Johnson's party wins more elections, they might try this plan again.", "keyPoints": [ "Mike Johnson wants to help Trump with his legal problems.", "He plans to stop giving money to some lawyers and investigate them.", "One idea is to let Trump move his cases to courts with friendlier judges.", "This could help Trump pardon himself.", "The plan might not happen this year, but they might try again if they win more elections." ] }

Example 2
{ "summary": "Jordan Rubin answers why former President Trump's trials might not be done before the election. He says that even though people want answers quickly, the law has rules that make trials take a long time. These rules mean we might not know what happens in Trump's trials until after the election. If Trump becomes president again, he might stop the trials altogether.", "keyPoints": [ "People want Trump's trials to be fast, but the law has rules that make them take longer.", "The law says the person on trial has a right to a quick trial, but it doesn't mean the public gets to see it quickly.", "The trials might not happen before the election because of these rules.", "If Trump wins the election again, he could stop the trials.", "People can think about the charges and decide who to vote for, even if the trials aren't done." ] }

Now please summarize this article:

{article_text}

When we provide the same opinion piece, with these examples, the model returns:

{

"summary": "During a recent interview with Fox News' Sean Hannity, Donald Trump repeatedly missed opportunities to assure viewers that he wouldn't use presidential power to target his political enemies. Hannity, a close ally of Trump, tried to steer him towards making such reassurances, but Trump openly rejected the cues. Instead, Trump acknowledged Hannity's efforts and stated that he feels justified in seeking retribution against his opponents if he wins the election. This candid approach highlights Trump's ongoing intent to use presidential power for personal vendettas, despite Hannity's attempts to mitigate these perceptions.",

"keyPoints": [

"Sean Hannity tried to get Trump to pledge not to use presidential power against political foes.",

"Trump ignored Hannity's cues and acknowledged the guidance openly.",

"Trump stated he feels justified in seeking retribution against his political enemies.",

"This interaction underscores Trump's ongoing intent to use presidential power for personal revenge.",

"Hannity's efforts to mitigate these perceptions were unsuccessful."

]

}

The summary now does an excellent job capturing the overall vibe of the piece - the opinions and tone are well preserved. The key points also hit the article's main arguments dead on, and Hannity’s allegiance to Trump is noted. We could probably put this prompt into production as is, but let's evaluate it more rigorously first.

4. Evaluate Quality

Identify errors and rate responses, testing what drives performance.

To really assess how this prompt performs, we should test it on a larger sample of articles spanning different topics, styles and sources. Ideally, we compile at least 20-30 articles and manually write out high-quality reference summaries for each. Then we can run the prompt on each article multiple times to account for variance in the model's responses.

Some metrics we could evaluate:

ROUGE score (a measure of similarity) of generated summaries vs reference summaries made by humans
Qualitative evaluation of summary quality on a 1-5 scale (by multiple human raters)
Classifying the key points as fully supported/partially supported/unsupported by the article text (this could be done by another call to an LLM)
Sentiment analysis of summaries vs original articles to see if they match
Fraction of summaries containing factual inaccuracies (checked by humans)

Going with ROUGE score (and asking ChatGPT for the code to calculate it, of course), we get the following output (also interpreted by ChatGPT):

Scores for Response 1: {'rouge1': Score(precision=0.10967741935483871, recall=0.7157894736842105, fmeasure=0.1902097902097902), 'rouge2': Score(precision=0.04038772213247173, recall=0.26595744680851063, fmeasure=0.07012622720897616), 'rougeL': Score(precision=0.06451612903225806, recall=0.42105263157894735, fmeasure=0.11188811188811187)}

Scores for Response 2: {'rouge1': Score(precision=0.0967741935483871, recall=0.7317073170731707, fmeasure=0.17094017094017092), 'rouge2': Score(precision=0.027463651050080775, recall=0.20987654320987653, fmeasure=0.04857142857142857), 'rougeL': Score(precision=0.0532258064516129, recall=0.4024390243902439, fmeasure=0.09401709401709402)}

Given the focus on the style of the content, **Response 2** is preferable as it has a higher recall in ROUGE-1, meaning it captures more content from the reference text. Despite having lower precision and F1 scores compared to Response 1, Response 2 effectively includes a broader range of information from the original text, which aligns with the priority of maximizing content coverage.

Now that we have a metric to optimize against, we can run the prompt 10 or 100 times and calculate the average score, then test a new version of the prompt and calculate the difference. One of the benefits of ROUGE as a metric is that it can be calculated instantly rather than incurring the cost and delay of human evaluation. We can use the insights to further refine the prompt, and also understand the failure modes to watch out for.

5. Divide Labor

Split tasks into multiple steps, chained together for complex goals.

As we evaluate the prompt's outputs, we may find that while the summaries are generally strong, the model occasionally slips up and makes factual errors, or fails to capture a key point. To address this, we could add an extra step where a separate prompt (perhaps with an alternative model that has access to the web, like Perplexity or Google’s Gemini) evaluates the summary and lists out any errors or omissions it notices.

We could also break out key point extraction into its own specialized prompt. The outputs of that prompt could then be inserted into the final summary.

In the end, our summarization system might involve 3-4 prompts working together in a sequential chain (often implemented in LangChain, a popular AI engineering library).

Key point extraction prompt
Main summarization prompt
Fact checking/error detection prompt
(Optional) human review prompt for uncertain cases

Let’s take a look at what Perplexity returns when fact checking the article:

The summary and key points are partially accurate but contain some inaccuracies based on the interview excerpts provided in the search results.

It is true that Sean Hannity tried to get Trump to reassure viewers that he would not use presidential power against political foes, as seen in this quote: "Will you pledge to restore equal justice, equal application of our laws, end this practice of weaponization?" However, Trump did not explicitly acknowledge or reject Hannity's guidance. Instead, he responded: "I know you want me to say something so nice, but I don't want to look naive. When this election is over, based on what they have done, I would have every right to go after them."

The summary correctly states that Trump feels justified in seeking retribution against his political opponents if elected, as evidenced by his statements: "When this election is over, based on what they've done, I would have every right to go after them, and it would be easy because it's Joe Biden." and "Revenge does take time. I will say that, and sometimes revenge can be justified. Phil, I have to be honest. You know, sometimes it can."

However, the claim that this interaction underscores Trump's "ongoing intent to use presidential power for personal vendettas" is not fully supported by the provided excerpts. While Trump expressed a belief in his right to pursue retribution, he did not explicitly state an intent to use presidential power for personal vengeance.

With an evaluation step we can not trigger any number of next steps, for example going back to an earlier stage and re-running the summary with the additional context provided by Perplexity about the truthfulness of certain statements. Alternatively we could generate 3 different summaries and then use an evaluation prompt to choose the best one. There are infinite potential possibilities for improvement, which is what makes prompt engineering so much fun!

We can continue to optimize each individual component of these prompts, and experiment with different models and approaches at each step. Prompt engineering is ultimately about architecting reliable and effective systems with language models as the core building blocks. By rigorously evaluating quality and dividing labor between a chain of prompts, we can build powerful applications that leverage the strengths of these models while mitigating their weaknesses.