
Prompt Engineering for Generative AI
You need to be signed in to add a collection
Transcript
Intro
Phil Winder: Hello. Welcome to another episode of GOTO Book Club. My name's Phil Winder, and I'm the CEO of Winder AI. Since 2013, Winder AI has been developing AI applications. And if you need AI engineers for your next project, then give me a call. I'm here today with James Phoenix and Mike Taylor. They are the authors of a brand new-ish book, end of 2024, "Prompt Engineering for Generative AI." I'm just laying there, proof. Today, I'm hoping to basically delve through the book and ask lots of questions, follow-up questions that I had based upon all of the interesting insight that was provided in the book. Mike, would you like to go first just to introduce yourself, tell us about what you do, and what made you interested in prompt engineering, to begin with.
Mike Taylor: Sure, Phil. Thanks for having us here. I also love that you use the word delve, by the way, because that's one of the words that we had to search for the book and make sure that it wasn't in too often because it's what ChatGPT uses quite a lot. I got into prompt engineering in 2020 because I was actually just leaving my first company. I founded a marketing agency, grew it to 50 people, and then I was looking for something to do during the COVID lockdowns, and got access to GT3. So, the rest is history. I managed to start replacing all my work with AI and then eventually started to work in AI full-time. We could talk a bit more about that, but that was my journey.
Phil Winder: Perfect. James, over to you.
James Phoenix: Nice to meet everyone. I'm James Phoenix. I'm a software engineer, indie hacker, and predominantly been using GPT and OpenAI for a good couple of years now. I really enjoy using it for coding. So, I'm really, really into the whole Cursor stuff and sort of, you know, replacing any part of my workflow. I do a range of different projects for clients. I've got one client at the moment who I'm helping build some sort of rank tracker for LLMs and, yeah, classification pipelines and that kind of stuff. So, yeah.
Phil Winder: Interesting. Well, thank you again for the book. I found it really, really interesting. I think I'd like to start off by asking a few general questions about the use of LLMs and generative AI, in general, just to bring everybody up to speed if they're not fully aware of what it is.
Recommended talk: How to Leverage Reinforcement Learning • Phil Winder & Rebecca Nugent • GOTO 2021
Key Aspects of Prompt Engineering & Its Definition
Phil Winder: I guess let's get the big one over and done with. How would you define prompt engineering? What is prompt engineering?
Mike Taylor: Obviously, everyone thinks about it differently. Some people think of it as just kind of adding some magic word on the end of the prompt that tricks the AI into doing a good job, and that's what it started as, I think, in a lot of cases. But we take a broader view of prompt engineering being this rigorous process of figuring out, you know, through testing, through evaluation, what combination of inputs lead to the best outputs and the most reliable outputs as well because I would say, you know, you don't need prompt engineering that much when you're just using ChatGPT as a consumer. But if you're building an AI application, it's going to be very important. There's a great source of improvement in accuracy just by getting the context right, by rewriting the prompt in a certain way, it will give you much better results. So, it's that kind of process for really applying the scientific method as much as possible with these LLMs to test and learn, "Okay, when I write it this way with this context, I get this type of response."
Phil Winder: So, it's a lot about rigor there. It's engineering rigor that you're trying to apply to a slightly non-deterministic system, depending on how you configure it.
Mike Taylor: Exactly. Yeah. And I come from a marketing background. You know, growth marketing has a great culture of A-B testing and that kind of came naturally to me. And then, James Phoenix, I think you could talk a bit more about this, but coming from a data science background as well, I think that's another rich profile for people getting into AI because it comes natural that you start to really think about what are the inputs, let's look at the raw data, let's look at patterns, let's see what opportunities there are to improve performance.
Phil Winder: Absolutely. James, what do you think?
James Phoenix: I think definitely, and then there's also some other dimensions that you'll often do whilst you're working in this kind of environment. Things like using more powerful models, switching the models quite a lot, and as well as that playing around with the parameters like temperature and log probabilities. It's one of these things where prompt engineering is definitely part of the stack. But as you know, an AI engineer or someone working in that space, you'll often use a mixture of techniques alongside prompt engineering. So, you know, changing the models, you might do fine-tuning. Like, there's a variety of different techniques that you're kind of trying to apply to maximize that output from a non-deterministic model.
Phil Winder: I guess one of the challenges I always have with my AI projects, data science projects, in general, is the problem definition, is defining the thing that you're trying to solve. And so, does that still apply in prompt engineering? How important is the goal? I think you mentioned the word goal there, Mike.
Mike Taylor: It's like 80% of the work. I find quite often when I work with clients, they don't have any formal process for evaluation, and that becomes most of the hard work. But because once you have a way to measure whether this is a good response or a bad response that doesn't require the CEO of the company to come and check each response manually, once you have a programmatic evaluation method that you can run after every response, that's when it really blooms. You know, it really opens up the amount of things you can do in prompt engineering. You can A-B test, especially if you're not a domain expert in this area. So, say you're working with a team of lawyers, like, James, one of your projects, it's a real bottleneck if you have to go back to the lawyers every time to check whether the response is good or not. You need to build up a test set, different test cases, you need to know the right answer in those cases, and then you need to set up some evaluation metric that you can use to optimize. Once you have that, you can do A-B testing. You can build more interesting architectures with retries and things like that. So, that tends to be the main key. And if you don't have that, you can't really do prompt engineering.
James Phoenix: Just to add onto this, I think the other thing that's really interesting is, for example, it's much harder to evaluate content generation versus a classification pipeline because you can easily see when a classification has gone wrong. So, depending upon what you're trying to do as the goal, if your goal is to produce, you know, social media posts, the evaluation isn't necessarily as important there. I mean, you could make the argument that brand text and guidelines are quite important, but there's less of an effort on that side. Whilst if you've got very wrong classifications, that's going to have a lot more of an impact on downstream data and, you know, the applications that are consuming that data.There's a risk of how well the evaluations have to work and also how easy it is to tell whether it's good or bad. If it's classification, a binary classification, it's very easy to see if it's wrong. It's a lot less easy to see whether it's wrong when you have a human evaluating this piece of marketing material versus that part of marketing material because it is a lot more vague. There's a lot more nuance to language than, you know, binary classification as well. Those are all the types of things you also run into as well.
Misconceptions About Prompt Engineering
Phil Winder: Cracking. I guess before we dive any deeper, are there any misconceptions or anything that we'd need to sort of clear up before we dig in?
Mike Taylor: One of them is just that prompt engineering isn't an art as much as it is a science. I would say that prompting is an art and, you know, there's a creative element to coming up with new ideas for prompts. But I wouldn't call it prompt engineering unless you're doing testing and, you know, trying to work at scale. And that's something I struggle with as well as someone who has a Udemy course with James, and we wrote the book as well. Those are targeted more towards a technical audience, but initially, a lot of the audience was non-technical. We get a huge demand from people who don't know how to code, who want to read the book or, you know, want to get better at prompting. And it's just kind of two different things, right? Like, there's only so much testing you can do manually without, like, you know, being able to code and being able to run it 1,000 times or 10,000 times and see how often it breaks. So, that's something that we still don't have, I think, a good handle on in the industry, whether...you need to know how to code to be a prompt engineer because when I deliver training, quite often the non-technical people say, "This is too technical." And then the technical people say, "This is too simplistic." And so, what I try to do is I call it prompt engineering if it's for a technical audience, and then I just call it prompting if it's not. But not everyone has that same definition in their head. So, that's the thing I would love for more people to clear up.
Practical Applications and Tools
Phil Winder: You're almost saying that it's actually more like programming than it is writing. So, do you consider prompt engineering to be a new style of programming?
Mike Taylor: It feels that way. I really like Andrej Karpathy, his take on it, that it's basically a new kind of abstraction on top of machine learning, which was an abstraction on top of programming initially, right? Rather than building a special-purpose computer that's the size of a room to do a specific counting task. That was the first programming, right? It was actually literally building the computer. Then we had general-purpose computers where you just have to write the program and run it. Then you have machine learning where you just need to give it the data and it will write the program. We have these pre-trained models. Now, we just need to kind of write in plain English what you want and then find some way of evaluating it. It does feel like a new way of programming. James, I mean, I don't know if you want to talk a bit more about your work with Cursor. You're very deep into using...
James Phoenix: I think the one thing...
Mike Taylor: I don't know how much code you're actually writing anymore or if you're going to see yourself more as an engineering manager of AI agents.
James Phoenix: For sure. Cursor is definitely like writing probably 60% to 70% of my code. I will say that when I'm learning something new, I actually manually type it out by hand. I think that's a really good way to still learn. Otherwise, you're sort of blindly copying and pasting. But, yeah, definitely what I'm starting to find is even in engineering workflows, you can specifically have prompts to just take that piece of work. So, the good one is if you're in composer for the whole day in Cursor, you can have a separate prompt or a separate notepad that will generate a progress report in a specifically standardized way. Like, what were the key learnings today, the key blockers, the next steps? You just say, "Generate a progress report." And because Cursor has now got an agent mode, it can also run Linux commands to get the right date and time and format and that kind of stuff. Or you can do things like have a prompt to generate a git commit message based on git conventional commits, such as a prefixing with fix or feed or chore.
You can kind of automate these smaller routines or subroutines that you're doing as a programmer, as well as the code that you're also generating. I definitely find that, yeah, LLMs are speeding us all up. I think going back to what Mike was saying about, you know, you've got this non-technical versus technical audience, I think with the technical audience, it's very much more focused around scientific rigor and experimentation when it comes to AI engineering and prompt engineering. Then I think for the non-technical audience, what I always recommend is just enriching the context as much as possible because they're not going to sit down and run O1 Pro 100 times or O1 100 times because it's just the latency is too large. But if they get the right context in there, then it's probably going to be good enough if they're just using one response or two responses. So, I think, yeah, putting an emphasis on the richness of context is really great for a non-technical audience, for sure.
Five Principles of Prompting
Phil Winder: Interesting. Let's move on to the book. One of the big headline topics that you kind of go back to throughout the book is these five principles of prompting. I guess let's have a little bit of a background. So it's a nice catchy title. I can see why you've done it. But I guess why did you come up with them in the first place? Like, why not six or four?
Mike Taylor: Good question. It came from self-preservation essentially. When GPT-4 came out, we had a bit of a panic, I think, because a lot of the prompt engineering techniques that we used to use GPT-3 just weren't really necessary anymore. You didn't have to threaten it to return JSON. They started to follow instructions much better. All frontier models are pretty good at following instructions. You don't need to kind of hack the prompt as much as you used to. But, you know, looking ahead, I thought, "Okay, we're just about to write a book. You know, we're creating a course as well. We don't want to have to update those two assets too often. And so, let's think really deeply about what would form the core of these principles. What things, you know, were we doing with GPT-3 that we're still doing today with GPT-4? And looking ahead, like, what do we think will still be useful with GPT-5 and so on?"
That was really just an attempt to make sure we didn't have to do version 2 of the book three weeks after it was released in Britain. We brought it down, and we tried to condense the principles as much as possible. Then actually, we were pretty happy when OpenAI came out with their principles. They have a prompt engineering guide now. It came out after we'd already written most of the book. I quickly checked it, and I was like, "That principle maps to this principle." It felt good. But I feel like pretty much anyone who works with these tools will arrive at a similar set of principles. And there's nothing magic to it. Anyone who's very experienced will get these straight away and will recognize them. But starting with those five principles in the first chapter just really helps people ramp up if they're not as familiar, if they don't have as much experience in prompt engineering yet.
Phil Winder: Yes. James, what do you think about the five? Like, have you got any ones that you particularly like?
James Phoenix: I think division of labor is something that I use quite a lot because when you're combining and composing multiple prompt chains, you are essentially breaking down a larger problem into a series of sub-problems that basically once solved will solve the larger problem. And I find that works incredibly well. I think, you know, that it also works quite well with imperative programming languages. You don't really want to do everything up until this point because you don't have certain data. Maybe you have to go to the database at that point. And just from the nature of things, you'll find that when you're doing prompt chaining and creating these chains, they naturally don't all sit together anyway within the code and within the data pipeline flow for backend applications. So, yeah, definitely division of labor, I think that's a really good one. It's probably my favorite, for sure.
Phil Winder: I think it might help actually if we just give a little example there, James. So, if division of labor is your favorite, could you just walk through how the average person might do that for a simple task like writing an email or something? How would you approach that in terms of dividing?
James Phoenix: So, if we're thinking about writing an email, the first thing you might want to do is gather some relevant context about that person or that company. The other thing you might want to do is gather context about previous emails that have been sent because there might be several email threads that are happening, which actually would be really beneficial to know about. So, those are the kinds of things you would do as upstream chains. Then the second thing you might do is then generate the email. Then the third thing is you might have a human approval loop or a human approval stage. And then the fourth thing is you might get another LLM to critique the email and look for any spelling mistakes or discrepancies or things that are missing. And then you would then have a final step to say, "Send the email."
So, rather than just say, "Let's create an email from this email," there's a step one, which is gathering the right relevant context. So, that might be doing some additional information, or maybe you've got a database full of the people that you're talking with, or it could just be also gathering relevant emails from your Gmail or the API. The second stage is generate the email, and then you've got maybe some human approval step or a human in the loop step, and then you've got a critique step, and then send the email. So, rather than trying to do everything in one step, you're kind of breaking that down into about four steps. And the reason why is that then you can have a deterministic way of doing that while still using LLMs within that workflow.
Mike Taylor: Yeah. If you find you're having trouble with one of the steps, you can then continue to break that down further. So, one of the things I found quite good for creative tasks, like the actual generation of the email, is split that into, "Okay, write the hook first, and then based on this hook, write the email." And just by splitting that into two smaller tasks, I find that you end up getting much more creative responses. You can test whether this model is just not good at writing a hook, like a way to draw people in, or whether this model is just not good at taking a hook and writing it out.You might find that you use two different models for those two things. Like, you might have to use a much better model for the creative task than you need for the actual email writing task itself.
James Phoenix: The other thing to note here is depending upon how cognitively intelligent the models are will depend upon when you need to split tasks into multiple sub-tasks. So, when you had GPT-3, it was quite advantageous to have more division of labor at smaller and more easy tasks. Now that you've got o1 and O1 Pro, you can essentially just one-shot entire scripts or one-shot entire processes. Obviously, that retrieval step might be done separately. But if we're viewing very much like editing a coding file, you could maybe do all of that in one go. So, your division of labor now goes up to, "Okay, now I can solve a larger problem and do division of labor with o1 to then maybe analyze, I don't know, 10 or 15 files and refactor all those files in different ways. So, you can still kind of use that principle even with a larger model. You're just basically using that to make o1 solve things that it can't inherently solve in a one-shot process, if that makes sense, because every model has some sort of top end capability. You basically have to discover when does o1 consistently fail at doing some type of task. And that's when you would then start breaking that down into a more deterministic workflow and using division of labor to then make o1 or O1 Pro do something that it can't do without division of labor, if that makes sense.
Phil Winder: That makes sense. Then two questions sort of pop into my mind. The first is, it was really interesting to note that despite the focus on prompt engineering and working with language models, in general, at least half of the steps that you described there, James, were getting data, ingesting data from other sources in order to use it in the context of generation. And it's almost like you're still writing software effectively. You're still plugging things together, you know, to generate the final thing. So, I found that interesting. But the second point is, you mentioned there that you're constantly trying to find the capabilities of the model that you're using. Have you had any experience of maintenance of a solution over the lifespan of multiple different models? Like, what happens when OpenAI depreciate, you know, GPT-4 or something? Does that mean all of your stuff is suddenly going to break because suddenly the capabilities of the model have changed?
James Phoenix: I think, in general, things get better. There are some nuances to that where, you know, like, if you're comparing, like, the new reasoning models versus the chat models, then they're completely different paradigms. But in general, classifications probably get more accurate. The output of the text feels kind of more human. That being said, you know, there are scenarios where you can get regressions because the newer model doesn't work with the old prompt in the same way. And we actually did experience that for one of our products called Vexpower when we had an automation script that would basically generate a large part of the course from listening to a video transcript and ingesting the course materials and sort of helping with the FAQ and the exercise generation. We found that when we moved from, I think, GPT-3.5 to GPT-4, and GPT-4 Turbo, the original prompt just basically broke, and I had to go in and re-architect and just basically go in and... And so, yes, you can get changes.
I think that is also happening less and less. And the reason for that is a lot of developers are now using something called structured output parsing, which is a native-supported output format for OpenAI and Anthropic also offer something similar, which basically allow you to define Python, Pydantic models, or Zod models for data validation. And these models that you can use, so, for example, GPT-4o Mini, have been fine-tuned to do JSON decoding and generation of JSON characters in a very deterministic way so that when it does produce JSON, it doesn't run into validation errors when you're parsing that string into a JSON type. We are finding that the more that you structure outputs, then potentially the less is going to change in terms of the data that's coming out. So, the structured outputs API is definitely worth looking at. It's something that I use actively on most of my projects. And that's been a massive improvement versus what we had to do two years ago where you had to specifically put in the prompt, "This is the kind of JSON structure that I want." And sometimes it wouldn't conform to that.
Mike Taylor: With the new capabilities as well, you find that even if your old code doesn't break, it's just a really inefficient way to do things because the older models tend to cost more and the new models tend to be more...they have more capabilities. So, I had a recent example where I had a client. They were doing video transcription, and they were using a service for the transcription of the audio. And then they're taking that transcript and then doing some kind of entity extraction from that transcript. We found with Google Gemini, which takes audio as a native input, it doesn't need a transcription model on top. It doesn't need Whisper, which is what they were using before. Now, you can just dump the whole thing into Gemini, and they don't need to chunk it up into different sections first. They didn't need to transcribe it first. Gemini just does it all in one shot. Because they have, I think it's like a million or two million token input, you can put the whole call transcript in there...oh, actually the whole call audio in there and then just get back out entities. So, it just massively simplified the whole pipeline once we got that working. It just took a lot of effort to optimize the prompt, but now it's at the same sort of 75% accuracy, something like that, that the previous system was, but at a much lower cost. I think it's about 60% lower cost.
Phil Winder: Much lower maintenance burden as well.
Recommended talk: Large Language Models: Friend, Foe, or Otherwise • Alex Castrounis • GOTO 2023
Balancing Specificity and Creativity in AI Models
Phil Winder: I think we sort of almost highlighted a little bit there about talking about task decomposition and specifying JSON output structure formats and things, about specificity, about being specific about what you want the model to do. But there was a really interesting quote from your book that I found interesting. "If your prompt is overly specific, there might not be enough samples in the training data to generate a response that's consistent with all your criteria." So, that's saying that you can go so deep sometimes that you actually end up in a space within the model that doesn't have any previous examples. Therefore what you get out might not be appropriate. I guess that might be more of a problem for image models, possibly, than it is for text because I would guess there's maybe more gaps in the training data for image models than there are for texts.
James Phoenix: There's a really good example I can give of this. I've actually got an example. I was doing basically an interview for a job and sort of a placement there. And they created a custom data structure to basically map that kind of backend system to the React front end. And you can't use prompt engineering if you have a custom data structure because these models don't have any data in the pre-training data about that type of custom data structure. So, the only thing you can really do at that point is a few-shot learning with their own custom data structures or building a fine-tuning model, which takes a lot more time.
When you're working with coding, if you're creating kind of a custom way of doing some type of CRUD operation and you're not using standard backend, so you're maybe using a JSON file as your backend, and you have your own custom data structure, and you have your own kind of ways of manipulating that JSON in an API layer that gets exposed to the front end, then all of that is very difficult to actually work with in terms of figuring out how does that work? How can we automate parts of that? How can we know which API endpoints to call because the data structure is specifically custom? Now, if you start using Postgres and you give it the Postgres tables or, you know, maybe you're using MongoDB, there's enough data in the pre-training to know what types of operations you can do on any type of database, in general. That's where, you know, there is a trade-off between if you do start doing things in a different way that is off the standard path, then you're going to end up where there's probably less likely to be data in pre-training for that. Another example of this is if a package changes tomorrow, there's probably no pre-training data in that foundational model about that package at this point in time. Therefore, if you try and generate code about that newer package, you're going to get older package code that's generated. So, you know, you've got to be very careful about specifically those types of problems.
Mike Taylor: I've seen the same thing. Specifically on image models, which is the part of the book that I handled most of, you used to get to these really sparse areas in latent space where there just...you know, there might be a lot of pictures of Ronald McDonald, and there might be a lot of pictures of astronauts, but there's not that many pictures of Ronald McDonald as an astronaut, right? And then the more variables you layer on top of that, the more likely the model is to get confused and not know how much of Ronald McDonald to put in there versus how much of an astronaut to put in there because there's some kind of inherent conflict. And that could lead to really creative results in some cases, but also quite often it leads to poor results when you step too far out of the training material.
The solution there is, you know, you can train new weights for the model. You can use DreamBooth, which is a fine-tuning technique, or LoRA, and give it examples. But then when you're training the model, fine-tuning the model, then you're in some ways kind of constraining the creativity of the model as well. So, this is always a trade-off because if you think about the more examples you give it, and this is true of text generation as well, the more examples you give it, and the more you steer it in one direction, then the less freedom it has to come up with something that you didn't think of. Usually, what I try to avoid getting stuck in this trap is I'll try prompt engineering first. I'll try and see what it comes up with just natively. Then if it's too out of bounds, then I'll start to layer on a few short examples or, you know, step towards fine-tuning. But I don't jump straight to a few-shot and fine-tuning for that reason. I want to give it an opportunity to surprise me and give me something that I wouldn't have thought of, so that then I can adapt my vision accordingly.
Phil Winder: How do you actually go about finding that middle ground, that perfect balance between specificity and creativity? Is it an iterative process based upon the problem that you're trying to solve, or are there tricks of the trade?
Mike Taylor: For image generation, in particular, it's much more iterative. I think image models are starting to get smarter, starting to get better at adherence to the prompt and character consistency as well, especially some of the video models where character consistency is really important, because you can't have your character change their face halfway through the video. It doesn't work that way. So, that is something, I think, that's a very strong active development space. And the models will just get better natively at it. But until it works better out of the box, it is very iterative right now. And it just takes a lot of trial and error.
Unlocking RAG: The Power of Iterative Steps, Query Planning, and Routing in AI Systems
Phil Winder: The next sort of bunch of questions I had were all sort of starting to dig into the use of RAG, really. Actually, I think that was really interesting. So, the quote that I pulled out was, "The real unlock is realizing that every part of the system can be broken down to a series of iterative steps." And we talked about that, and we touched upon it earlier with your comment, James, where you were talking about task decomposition and how the newer models are actually able to do that themselves. But what struck me, I think, with one of your examples, you mentioned in there, using the words, literally the word "step by step forces it into this mode of planning effectively." It made me wonder what other words are hidden within the models that force certain behaviors? And why are they there?
James Phoenix: I think maybe it's just the way that the model learned from the pre-training data, that when it has that, then the next bit of information that it sees or produces is more based on a reasoning chain. Thinking step by step, or let's explore this, or let's think through all the steps, or let's use a chain of thought. So, all of those ways, well, are early ways to determine whether or not you could get some type of more reasoning and using the reasoning from that to then hopefully produce a better answer. It's interesting because a lot of reasoning models do this now. So, O1 Pro will do reasoning for you. I do think chain of thought is becoming less of a used technique. It's still used in chat models. But specifically for reasoning models, they're already doing that. And they're doing a turn-based approach where they go through a series of steps, and they'll use the reasoning tokens within that single step, which then goes into the input of the secondary step. I would actually say that we're using those types of things less. I think the principles that are kind of standard are things like if we're giving direction, for example, you know, getting that relevant context is still very important. Specifying format, again, is also we're using that quite a lot with structured outputs. And few-shot learning is still really important. But I would say that chain-of-thought reasoning is probably going away, at least in reasoning models, for sure.
Mike Taylor: The new one I've been seeing a lot of and soon to be baked into most models is more like self-evaluation or backtracking. I don't know if you've seen this when using Claude, where it will start writing out the answer, and it'll say, "Oh, no, sorry, I made a mistake there," and then it will change its direction. So, it's kind of evaluating itself. So, rather than a chain of thought, which is like planning ahead, then you have this kind of other concept of like after it started to generate, does it backtrack or change direction? But I would say, generally, the reason why they're there in the training set is just that this is what people do, right? When I was running my marketing agency, I would tell people to plan what they're going to do for a presentation before they create the presentation. They would tell them to write an outline for a blog post before they wrote the blog post. And that's because thinking step by step through a problem really helps people access high-level thinking and make sure they don't forget anything, make sure that they are doing things in the right way to achieve a goal. If step by step reasoning helps humans, then LLMs are kind of like a human brain simulator, if you think about it. They're trying to predict what a human would say in that situation. It makes sense that the techniques that work for people will just work for LLMs as well. That's what they've seen in the training data. And when people think step by step, they tend to end up producing better results. So, it's kind of a way to steer them towards those better results.
Phil Winder: It's just specifically the use of that word that I find a bit weird. It's not a word that I've used in the past and that you haven't read in the past. And then all of a sudden, it's a word that everybody's using because it steers the model so much.
Mike Taylor: Because it's the word that they specifically used in one of the scientific papers.
Phil Winder: Exactly. Then everyone bakes that into their libraries and blog posts and prompts. And then you might learn from that second order or third order by reading our book or whatever. But, yeah, you don't have to specifically use that phrase. Actually, it's a big misconception. I think a lot of people think that... They think, "Oh, you have to use the phrase. Let's think step by step." But actually, any combination of words for getting it to think through the problem first is going to work.
James Phoenix: I think there's also some emotional prompting that still works with O1 Pro, for example. So, if you tell O1 Pro, "I'm not really bothered about this answer," it will give you something pretty small. But if you say, "Oh, I have to get this right because my livelihood depends on this," O1 Pro is going to think for much longer, and it will give you a much longer output as well. So, even emotional prompting still works. We don't have that thought step by step anymore. But emotional prompting definitely still works with O1 Pro.
Phil Winder: Interesting. I don't think that was one of your recipes in the book there, emotional prompting. Your next book should be a self-help book. I think it'll be useful there.
Mike Taylor: Therapy with LLMs.
Phil Winder: Absolutely. There were quite a few other strategies that were mentioned. I think the one that jumped out to me is the query planning prompting, attempting to address multiple user intents. Could you talk a little bit more about query planning?
James Phoenix: Query planning is basically where you have a user query, and then rather than sending that straight to an LLM, you can decompose that down into a series of intents, and then you can work out the order of those intents and then execute those. So, that can be useful to specifically figure out what to do at what point. There's also something else that you should be aware of called routing, which is another way of doing query planning, where you can basically take a user query, and you have a bunch of destinations. Let's say you've got three destinations, A, B, and C. You can take a user query, and you can specifically find what type of route should that user query be forwarded to. Let's say you've got three functions, one's generate a summary, one's write an email, and one's look for some order information, if the user query is like, "I really want to summarize this information," the LLM router will basically take that query and decide, "Okay, it needs to go to this generate summary route," and it will just forward over the user query. So, that's another approach.
Query planning is basically where you're trying to decompose the query, figure out all the individually related intents, the nested intents and dependencies there, and execute those. And then you've got routing, which I've just described. There's also in the OpenAI now, natively, they do support something called parallel tool execution, or parallel function calling, which basically means you can have a query that has mixed intents. So, if we say we have a function that can send an email, if I have a user query that says, for example, "I want to send an email to John and Jane," and you've had like, you know, "And I want to say this to John, and I want to say this to Jane," OpenAI's function calling or their tool calling natively now supports the ability to recognize multiple intents and then execute those in parallel.
Query planning is very useful. Routing is useful. And we now also have native support for being able to handle mixed intent using, you know, standard packages with parallel function calling. So, yeah, there's a range of things. I think routing is a very interesting one. You can obviously use that for forwarding requests. And then we've got parallel function calling, which is great for handling those kinds of multiple intents within the same user query. I would say, though, that if you had maybe 5 or 10 intents in a query, that's when, you know, manually breaking that down and figuring out those things is probably going to be more useful than sending it straight to an agent and relying on the agent to specifically find, you know, what tool calls to generate and what tool calls to execute on your backend because, yeah, basically, the more things that could go wrong and the less control you're having.
That's why you've got a lot of these frameworks like LangGraph, which are basically trying to, you know, not always generate every single tool in the kind of recursive while loop. They are kind of breaking these things out into DAGs, so direct acyclic graphs, where you have a series of functions kind of hop along and then it goes back into the agentic loop. And you can do that natively in Python. And the way you should do that is let's say you've got a tool call, rather than just returning it straight back to the agent, you could do three different types of other Python functions inside that Python function. You've got step one, step two, step three. So, you can kind of already just write your own kind of LangGraph approaches to this problem where you don't always want the agent to be responsible for generating the flow of information. Therefore, you know, it can be called tool A, but tool A also does three things inside of that tool. So, there's a variety of different things that you should be looking at. Routing is important, obviously, tool calling is important, but, yeah, you don't always have to rely on the agent to decide exactly what tools need to be generated and what ones need to be called.
Phil Winder: Okay. Makes sense. Thanks.
Understanding Agentic Systems: Beyond Prompt-Based AI
Phil Winder: James, you mentioned the word agent, which I think is a heavily overloaded term at the moment. It's like the word AI. I kind of struggle to use it because it's so broad. But one thing that struck me when you started talking about agents in your book was the parallels to reinforcement learning. So, as you know, I wrote a book on reinforcement learning, and that's all about the idea of learning through trial and error and having actions in an environment and feeding those signals back to the agent and using a reward signal to be able to decide whether it did well. The parallels here is that it's almost sounding like that the agents need a similar kind of structure. It's almost like the thinking of the prompt as the reward definition. That's what decides whether it's rewarding or not. The user context is kind of defining the actions and then the environment and the various tools that can be used by the agent. So, I guess, let me step back a little bit. How do agents differ from standard prompt based approaches?
James Phoenix: With a prompt-based approach, you're basically having a Python function or some type of code that will, you know, call an LLM, it'll produce an output, and you're sort of putting that into your existing software architecture, your sort of software where you've got an AWS Lambda, this little bit. Now, you kind of do that as a fuzzy function and non-deterministic step. And then you're kind of embedding that into your existing workflows. And how that differs from an agent is an agent is basically a different architecture, which relies very much on...it has a higher amount of control. It has a higher amount of autonomy. And rather than basically you as the programmer and imperatively determining what's going to happen when this code is executed at runtime, you're basically relying on the agent's prompt, its tools, and the tool definitions that it has, and the context that that has in the prompt to decide what to do given a series of messages that have already happened before.
Then once those messages are basically executed, those will generate some type of tool calls or maybe not. They'll maybe just reply. And then after that, you have to have some type of stopping criteria. So, generally, the easier way of writing is something like while there are still tool calls to do, then keep going with the agent. When we've hit stop, reason, finish, or there's no tool calls left, then the agent tick while loop. That being said, you can also create your own objectives. So, if the agent does stop prematurely, you can also add an additional message saying, "No, we haven't stopped yet. You haven't hit this programmatically or non-deterministic goal." So, imagine you're trying to get 100 leads, and you've got this agent that's using a Google search tool and maybe a web page reading tool. You can also say, you know, if it's finished early, because there are no tool calls left, you can add an additional message in there and tell it, "No, you actually need to keep going."
The thing with agents is they have these tools. The tools give them the ability to execute. You know, the more tools that it has, the more useful the agent becomes because it can bundle those tools and execute those tools in a variety of different ways. So, it is essentially building a computation graph at runtime as it executes these tools. Now, the problem with them is, obviously, you're giving a lot more autonomy to the agent. And if one of those steps is wrong, you're going to get compound error across all the rest of the steps. So, having reduced error is really, really important. And if it's got too many tools, that can be a problem, or if the user query isn't specific enough or can't be solved by the given tool set, that's also a problem. You've also got problems with if one of the tools fails, what does the agent do in that scenario? So, maybe it couldn't connect to the database, it couldn't get the output. What does the agent do?
There's lots of different problems that happen, both from the DevOps side, from IO issues, from, you know, the user issues in terms of the user query. And a lot of people are trying to figure out how to innovate in that space. So, adding humans in the loop steps is one approach. People are building robust execution workflows to make sure that steps don't fail so that the agent always succeeds. If there's an IO error, it will retry for that step. At its heart, an agent has tools. It's basically in some type of wild, true loop. It's doing all of those tool calls until it thinks that it's finished with the result.
Phil Winder: Mike, just following on from that then, how much of a role does prompting still have in agentic systems then? Does it make it harder in some sense because you've got disparate agents working independently?
Mike Taylor: It's a good question. And I would say the system prompt for the agent still needs a lot of work because you want to protect against those edge cases. So, in some respects, it's more important because, you know, when the agent runs, it's going to do a lot of steps itself. And therefore, that system prompt needs to be much better than it would be if it was just doing one step because, like James said, those errors compound. If it chooses the wrong steps, then that's a bad thing. So, all of the traditional prompt engineering techniques still apply. It's just that they have higher leverage because now the agent is going out on its own with your instructions. You better make sure your instructions are good.
In some respects, it also takes away from prompt engineering because the big difference actually between reinforcement learning and agents, just to make that link, is that with reinforcement learning, there's some objective truth, typically, right? You can calculate it quickly. Whereas with agents, they have to decide whether they've done a good job or not based on your goal. It's a much fuzzier reward mechanism and self-determined in some ways. So, I think that's why we're not seeing that many true agents in production because the current crop of models just aren't as reliable in terms of deciding whether they've done a good job and also in deciding what to do based on their observations.
As we get better models, then those loops will get tighter, and they'll make less mistakes. But it's a bit of a trap in that if the AI can't do that task very well, it's also probably not going to be very good at judging whether that task has been done well, to some degree. So, I think that's the major difference. That's why when people talk about agents from a marketing perspective, I know you hear Salesforce or Microsoft talk about agents, they're not really talking about agents. They're talking about quite a deterministic chain of prompts still, and maybe some retrieval with RAG. It's very rare that you have a true agent in production these days, although that should change quite a lot in the next year.
James Phoenix: Just to add on to that, like, the other thing for the prompt engineering side is, obviously, you can write the tool definitions more explicitly. If the tool definitions are very detailed and tell exactly when to use this tool, when to not use this tool, all that information is really valuable to the agent because it will pick different tools based on those tool descriptions and the arguments of those tool descriptions, etc., which is generally in a JSON schema specification. So, that also has an impact. You can also do hybrids, by the way. So, you can have like the agent uses the search function and it calls a search function and that search function could use something like Q-learning where it has a Google search, a tabu search, a variety of different searches, and then it could use Q-learning inside of that tool call. So, it's also possible to have an agent that has a mixture of an LLM-based agent with a tool call that will use different types of searches based on an updated Q-learning table.
Phil Winder: I think, obviously, that's where most of the innovation is going on at the moment. Be interesting to see where that goes. But I think we've got to the end of our time slot. So, I think the only thing that remains is to thank you both for joining me today. It's been really interesting. To everybody else, the book again, "Prompt Engineering for Generative AI," I'll plug it for you. There you go. Really great book. Read it almost like in a couple of nights, really. Just kept going and kept going. It's a really fun book, really interesting, and I really enjoyed it. So, I definitely recommend it. Any final closing words? Anything you want to plug?
Mike Taylor: I appreciate you shouting out the book and also reading it so quickly as well. Yeah, I'm glad it's digestible. One thing I would like to end with is just to remind people not to get too anxious about how fast everything is moving. Because if you're following everything on Twitter or Reddit and you're seeing new things come out every single week, it can get anxiety-inducing thinking, "How am I going to keep up with all this?" But what I found is that after a couple of years of being in the mix, very few things actually change. Even though the models are getting better, they're getting better at a predictable rate. Costs are coming down at a predictable rate. So, it's just a case of zooming out a little bit, thinking ahead and going, "Okay, well, if I start working on this project now, where are things going to be in six months?" I would say that even though things are moving fast, if you're in a specific niche or a specific domain, you cannot drive yourself crazy trying to keep up with everything. Ultimately, if something really big happens, other people will tell you about it. That would be my advice for coping, is just kind of pick a niche, learn everything about that niche, and then don't worry too much about all the other craziness that's happening.
James Phoenix: Nice. I think my plug will be if you haven't checked out Cursor yet, definitely give it a go. I think the other thing as well is this idea of bottom-up coding. You can give a very large goal to Claude and then break that down into lots of different smaller tasks, which you work through in chat or Composer. But then there's also this bottom or top-down approach to coding where you give O1 Pro 10 files and you say, "Generate me three or five new files," which is a completely different paradigm. And when do you use either of these? Claude's Sonnet has very low latency, but it has more regressions and more hallucinations, while O1 Pro has incredibly high latency, but very high accuracy. There's something to be had with sometimes you should pick Claude and sometimes you should pick O1 Pro. There's also scenarios where sometimes you'll use Claude and it will generate code, and you'll kind of get stuck in a loop and it can't really figure it out. You jump to O1 Pro, go make a cup of coffee or a tea, and then you come back and it's figured out on the first go. So, have a think about when you should be using these kinds of reasoning models for your development work versus when you should be using a lighter chat model to be doing quicker edits or doing that bottom-up approach to coding rather than the top-down. Have a think about that and obviously let me know. Yeah.
Phil Winder: Okay, thank you both. See you later.
Mike Taylor: All right. Thanks, Phil Winder.
James Phoenix: All right. Thanks, Bye.
About the speakers

Mike Taylor ( author )
Founder & Co-Author of "Prompt Engineering for Generative AI"

James Phoenix ( author )
Co-Author of "Prompt Engineering for Generative AI"

Phil Winder ( interviewer )
CEO of Winder.AI, author of "Reinforcement Learning"