Home Gotopia Articles GOTO State of th...

GOTO State of the Art Series: Building AI Applications That Actually Work - From Prompt Engineering to Synthetic Market Research

Share on:

Copied!

About the experts

Mike Taylor ( expert )

Founder & Co-Author of "Prompt Engineering for Generative AI"

Read further

Intro

From Proof of Concept to Production Reality

Hi, I'm Mike Taylor, and this is part of the GOTO State of the Art YouTube series. We're going to go through what it takes to build an AI application using Prompt Engineering today. We're specifically going to talk about synthetic market research, because that's the topic that I'm most interested in right now. I have built startups specifically doing that.

That's what's really fresh in my mind for me at the minute. But we're going to cover all of the normal things that you would normally run into when you are building an AI application. We're going to go through some of the prompt engineering we had to do. We're going to talk about some of the challenges that an organization would face when they're trying to implement or scale an AI application like this.

I'm also going to walk you through behind the scenes, like what practical strategies proved best and most effective in overcoming those obstacles. I'm also going to talk a little bit about how this has evolved over the past couple of years, because I've been in prompt engineering since 2020 and read a book on the topic last year with Rally.

It's really changed a lot, as you can imagine. There are some common misconceptions. Outdated practices that still persist. We can talk through that, too.

Let me just show you what I built, and we'll kind of work backwards from there.

This is a Rally or Ask Rally. It's an application I built. The idea is ChatGPT, but chatting to many GPTs so it's a traditional chat interface. Except when you submit your prompts, like you say, how do you feel about tariffs, whatever you're trying to figure out from your audience?

That prompt is then going to every one of the AI personas in this audience so you can click and see this specific person has these demographics, this background. We have a whole audience here of 100 AI personas. They've all been AI generated themselves and they'll give AI generated responses. You can see that this is a streaming application.

It's generating the responses and bringing them back. Then we have a combined response at the end. You can click in and see what the specific persona said. Then you can click in and see who said it and maybe have a sense of why they might like or not like the responses that you're giving. You can also do voting, and you can do image uploads, you can change model providers as well.

Like if you're a paying customer and we have other past sessions as well as a way to create audiences as well. I'll just quickly show you that and then we'll kind of talk about how we built it. This is putting a name and a description, and then when you hit generate, it will generate personas that match that description.

That is the overall application. Now let's kind of walk backwards to what we had to do to build this tool. That's going to really help you understand what are the sort of challenges you'll run into when you're building AI applications generally.

The first thing we had to do was create a proof of concept, because I don't think it makes sense in this day and age to build a whole AI application unless you know it's going to work, because quite often these things don't live up to the hype, and they're not as useful as you expect.

You might spend a few months building it and not actually create something useful for your customers. That's actually pretty common. One of the first things I did was I created a version of this in Jupyter notebooks. The reason I use Jupyter notebooks is that I find it personally quite easy to just write a few scripts and then see how they work.

The very first Jupyter notebook would literally just take the prompt and then send it individually to each AI persona. I found that it got pretty promising results. One thing you'll see here is that the responses are all quite different, and they have a different personality. Some people are mentioning different things, like this person spoke about local businesses and jobs.

The initial results were a little bit disappointing. They all looked very similar. That was a real problem because what's the point in messaging 100 different AI personas if they're all going to say the same thing. You might as well use ChatGPT. That was one of the big questions I was asking myself when I built this. Like, why? Why wouldn't you just use ChatGPT? I had to come up with some good reasons why before it was even worth the engineering effort.

What I found is that with a little bit of tweaking to the prompts that we use, we could get them sounding more human. Specifically, what we did is we added more examples to the prompts, and I can show you that.

Let me just bring up the OpenAI playground. I'm just going to show you some of the prompts that we're using here. I do a lot of my testing in the OpenAI playground just because it's much easier to test something quickly and then I'll put that into the Jupyter notebooks and test it at scale.

You can see here this is a prompt with GPT-4, you can see a persona, an example persona. A lot of the work was in these instructions we had to give. We said you need to roleplay as this persona. But then we worked a lot on these different instructions, for example, like respond with your inner thoughts from a first person perspective.

That seemed to really improve results. We would submit different questions and then see the response. I would say, like, how do you feel about tariffs? Then you get the response and then you could just reset the responses. Once they start to see that their responses were not all the same, or that if I put a different persona in here, it would get different responses.

That's when I started to get a little bit more excited about the project. We worked a lot on the instructions. We also found there are a few specific things in here when we're building it. People would complain. Once we kind of built the application, people would complain that it just keeps mentioning background information like, maybe it should mention, as someone who works in the air, then it would, this is kind of boring. It's not really how actual people talk. They didn't name drop that much.

You can see even here, she just still does it. We had to dial it down a bit.

That's why we added specific instructions, like, do not mention your background information. We talked about it being an inner monologue. We actually doubled down on that and put a few different things in here. This is just to give you a sense that the details don't matter too much because there will be different things for your application.

But this should give you a sense of the type of legwork you have to do to get your prompt working. You typically need to give it some context. Like, in this case, we're injecting the persona. But then you also have to really work on the instructions within that where the context is applied.

A lot of that work is really just looking at the responses, seeing what you like or don't like about them, and then adding specific instructions to counter that or to emphasize that the other really big thing that works. I think that this is something that people miss with prompt engineering, this is something that I think has changed a lot over time is that in the early days of prompt engineering, people focused a lot on the instructions, and that's still useful.

Recommended talk: Optimizing a Prompt for Production • Mike Taylor • GOTO 2024

Breaking the AI Positivity Curse

With our responses, you can see that there's some actual negative responses, like "they annoy me so much." The reason why I added that is because I wanted them to be a little bit meaner than the average chatbot response.

I think the feedback we get from users is that if I give you feedback on your idea or marketing campaign or the design of your product, then it's much more useful if I give you honest feedback, if I give you negative feedback sometimes. Otherwise, you can just ask your mom and she'll tell you it's great.

ChatGPT is like your mom that just tells you, "Oh, your idea is so good. It's definitely going to succeed." And then it doesn't. And you'll be unhappy. So I think that adding more examples is how we broke that positivity curse.

You have to do this process for every prompt in your system, and it's very rare that you just have one prompt. So we have one prompt for queries that generates the responses. We also have another prompt that generates this combined response. And this is something to think about with latency. When you generate the response, you also have to wait for all of them to finish before you get the combined response.

Now, the reason why this is so fast is that we're actually calling all of these in parallel. So we're doing 100 API calls, and each one of them is triggered at the same time, and we're using the API async library. This is super important if you're building an AI application. The biggest thing you have to worry about is latency, because people are used to software being fast.

It can be fast these days, unless it's using AI - AI is the slowest thing in your application. So anything you can do to speed it up or to make it look like it's sped up is better. Doing requests in parallel is really important, doing requests in the background. And then, in this case, we have streaming.

So you saw that the requests streamed in over time. So the user has something to look at while they're waiting for the full answer. Those are really important techniques.

The other thing that was difficult - and I think this has changed a lot over the past couple of years - is that the model is very important. It used to be when you were building an AI application two years ago, that the only sensible provider was OpenAI. Some people were experimenting with open source models. And in our case we're using Groq, which is not "grok" with a K - that's Elon Musk's AI model.

Recommended talk: Prompt Engineering for Generative AI • James Phoenix, Mike Taylor & Phil Winder • GOTO 2025

The Multi-Model Landscape: Beyond OpenAI

This is Grok with a Q, a chip hosting service that hosts GPUs with some open source models available. We do have open source models, but they're the least popular - almost nobody on our platform uses open source, although we experiment with it extensively.

The landscape used to be dominated by OpenAI as the only production-ready option, unless you were using open source. But at that time, open source models weren't very good. The landscape has changed dramatically and continues evolving weekly, with new models potentially becoming the best for weeks or months at a time.

Currently in our testing, we're finding that Anthropic is actually slightly better than OpenAI. Google is about 20% better than OpenAI as well. We find that Anthropic excels at creative tasks, while Google is quite good at replicating human behavior, making it the most accurate for testing purposes. OpenAI remains the most reliable API, so for large-scale projects, we use OpenAI.

I recently did a project for 5,000 personas. When you're generating 5,000 personas and querying them rather than just 100, it really changes the game. That's when using OpenAI makes a big difference.

Another great feature from OpenAI is structured responses. This isn't just using JSON mode like other providers offer - you can actually pass in a specific Pydantic object into OpenAI, and it's basically guaranteed to return as that object. I didn't get any retry issues when using this, which was very important when doing 5,000 different queries with 20 questions for each persona.

These are the real challenges when you're scaling up and start hitting rate limits. One thing you might not know is that OpenAI has different tiers. You can't always get to the higher tier in some organizations. You can see here that OpenAI allows you to pay $5 and start using the API.

Scaling Challenges: Rate Limits, Tiers, and Performance

API Pricing and Scaling Challenges

They do have a free version of the API in specific geographies, but the first tier is $5 with a $100 monthly spending limit. If you have a client that wants to run multiple projects or release a product into production, you need to move off tier one quickly.

Our costs went past that fast. We spent about $2,000 last month. Tier two is better with more rate limits as you scale up. For us, because we're doing 100 requests at a time, it was important to get up to tier five.

One thing we did was use an account we'd used in the past where we'd already spent $1,000 and been living for 30+ days. That's one of those gotchas you might not think about until you start projects and realize you're now 30 days delayed waiting to reach tier five. You can also talk to them directly.

Evaluation Tools and Testing

The other difficult aspect is evaluations. We have a very ugly eval tool because we spent all our time on the front end of the app. We think about this with different test cases and questions we want to ask people, then preload them into the API and run evaluations. This lets us test different providers and modes.

We have different audiences here as well. This is basically a thin wrapper on top of our API that gives us extra metrics and evaluations.

Evaluations have become much more important in the past few years. Before evaluations became standard, people were just using good vibes to check whether the model was good. Now the average team is much more formal with evaluations.

Key Metrics We Track

We calculate several metrics:

Latency: How long it takes for results to come back
Token counting: To understand cost for different models
Fidelity: We use a vector database to check how close the response was to the persona. If the response was similar to the persona's background, we've done a good job because the persona is shining through
Diversity: How different the response was from what was asked. We want people to come up with unexpected insights, so we optimize for that
AI Judge: Based on various criteria, we give responses a score out of five

In this case, we got a really high score on this test, but on other tests it doesn't do as well. Some models perform better than others.

Approach and Tools

This is how we do evaluations. It doesn't have to be clever or use fancy tools necessarily. You can do this in a Jupyter notebook.

The overall approach is doing evaluations, talking to customers, figuring out what they like and don't like about responses, formalizing that into evaluations, then testing and optimizing until you find the system that presents the right results from the model with the right context.

Synthetic Research Applications

We're doing blog posts and experiments on synthetic audiences. For example, we tested Game of Preferences and saw how well it performed versus real world data. This is really powerful and what makes me excited for synthetic research.

If you can predict with 85% accuracy how well a human would have chosen different things, then you don't have to pay for a big study and can test many more things than you otherwise would have the budget for. I see a lot of promise in this industry.

Future Directions

One paper that really inspired me to dig into this topic was the generative agents paper. Many people in AI have seen this one because it has a cool Pokemon-style graphic on the front.

Recommended talk: ChatGPT from Scratch: How to Train an Enterprise AI Assistant • Phil Winder • GOTO 2023

The Future of Synthetic Research: From Chat to Simulation

The general idea is that they got a bunch of AI agents in a small town, like a virtual town based on the game Pokemon. They let them walk around and interact with each other so they would talk to each other, go to school, come back, and plan parties.

This is super early, but it's a sign of things to come. What we want to do is not just message AI agents 100 at a time and get feedback on responses, but actually let them loose in a virtual world and see how they interact.

This may be of limited use in terms of how they interact in the small town. But you can imagine if you let them loose on your website and let them decide what to click on, that could be really interesting data. You could see how information spreads over time.

Here's an example from the paper where Isabella was planning a Valentine's Day party at a cafe. She told a bunch of different people. Some of them didn't tell anyone, but some people did, and some of those people told other people as well. Sometimes it even came back to Isabella.

The most promising part is being able to look at actual network diffusion and see how agents interact in spaces, in specific markets, and which messages propagate and which ones don't. As an economics nerd who used to run a marketing agency, this fascinates me - the ability to drop your message into a simulation and see how well it does. This is what a lot of people are working towards with varying degrees of success.

Right now, the real challenge is that these AI agents aren't that reliable. They go off the rails quite often and get stuck in loops. But as the models get better and cheaper to run, these more interactive sessions are going to be possible.

There's another paper by the same author, Jensen Park, called "1000 Agents" that's been very influential on this company. He found that when you have a deep amount of information about a person - in this case, a two-hour interview transcript where they talk about themselves - you can predict with quite a high degree of accuracy how they would respond on different surveys and how they would play specific economic behavioral games.

This inspired a lot of people to get into this field. If you can combine those two things - take a digital twin of somebody and put them into a virtual environment - you can get to quite an interesting scenario where you could practice different interactions, test different marketing messages, try different business structures and product ideas, and see how well they work before you have any consequences in the real world.

One day, what I'd like to do is before we record a session like this, go and actually do the session, then send it to the virtual version of all of you. Then you give me the feedback so that the real you doesn't have to sit through a bad session ever again. That's where all of this is going, or at least where it promises to go.