Key Takeaways 🔗
- Prompt Engineering Techniques: Understanding and applying techniques like zero-shot, few-shot, chain-of-thought, and self-consistency can significantly enhance the effectiveness of large language models (LLMs) in software development and regulatory contexts.
- Prompt Engineering Technique Usage: It is better to start simple and work to more complex techniques depending on the accuracy and behavior of the model being used.
- Similarity to Software Engineering: Good prompt engineering shares similarities with good software engineering practices, such as breaking down complex tasks, being specific, and providing clear instructions.
- Tools like Cursor: Utilizing tools like Cursor can help software engineers move up levels of abstraction, focusing more on requirements and higher-level design rather than low-level code, which is beneficial in regulated industries like medical devices.
- Verification Challenges with LLMs: While LLMs can generate code and assist in understanding large codebases, careful verification and validation are necessary due to potential inaccuracies or inconsistencies in outputs.
- LLMs in Requirements Verification: LLMs can aid in verifying if code meets requirements or in parsing documentation for compliance, but this requires careful prompt engineering and human oversight.
- Understanding Limitations: Recognizing the limitations of LLMs, such as difficulties with specific tasks like counting letters or performing certain mathematical operations, is crucial for effective use.
- Test-Driven Development (TDD) Principles: Incorporating TDD principles when working with LLMs can help ensure outputs are correct and reliable, which is essential in safety-critical domains like medical devices.
- Importance of Tokenization: Understanding how LLMs process information through tokenization can explain some of their limitations and guide more effective prompt engineering strategies.
Resources 🔗
- https://aws.amazon.com/what-is/prompt-engineering/
- https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api
- https://platform.openai.com/docs/guides/prompt-engineering
- https://www.promptingguide.ai/techniques
- https://partyrock.aws/u/js2222/zEj353AmT/Prompt-Engineering-Guide-Introduction
- https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prompt-generator
- https://github.com/innolitics/llm-toolkit
- https://youtu.be/oFfVt3S51T4?si=OLmyMoZS5GXj3Sqn
Transcript 🔗
Introduction to Prompt Engineering 🔗
Nicholas: Okay, so this week, we're gonna be using this time to discuss prompt engineering. Something that I was talking with Reese a little bit about, trying to come up with an idea for this time, and he suggested this. And it was a perfect suggestion because I was slow to answer when Matt did his 10x time and he asked if anyone had ever used a large language model, or like ChatGPT, or anything like that. I actually haven't. I hadn't until about two weeks ago, when Matt helped me set some stuff up.
So this presentation is going to be kind of from the perspective of me actually learning all of this this past week of how to use these techniques of prompt engineering. So yeah, without further ado, let me share my screen, and I will show you all what I have put together.
Nicholas: Starting at a high level, some definitions. What exactly is prompt engineering? And to pare it down even further, what is a prompt? It's a request to generative AI to perform some sort of task. Large language models are not omniscient, and they kind of have the same shortcomings as humans in the sense that if you give them a vague request or an open-ended prompt, they're going to struggle. This is something that, as I've tinkered around with large language models, I have actually encountered myself.
I have a quote here for what prompt engineering is from AWS, which is "a process where you guide generative AI solutions to generate desired outputs." Another definition: who is a prompt engineer? I put this up here just because it's a term that gets thrown around a lot for, like, prompt engineering or becoming a prompt engineer and those sorts of things. So one definition, again from AWS, is that a prompt engineer is someone who develops tools and context to aid a large language model when answering a user's prompt.
Techniques of Prompt Engineering 🔗
Nicholas: Now, transitioning from the definitions to techniques of prompt engineering, now that we've established what a prompt is and who a prompt engineer is, we start with the simplest technique: zero-shot. Zero-shot is probably the most straightforward in the sense that you simply ask the model without any further context, and the model will give you an output. So there's no extra bells or whistles to this prompt; it's just pretty straightforward.
From here, we move on to the next technique that builds on that: few-shot. Few-shot is where you're starting to actually add a little bit more information or context to your prompt. Unlike zero-shot, where it's kind of just point blank, here you are giving the model something to chew on, something to consider when formulating a response to you. It takes advantage of something called in-context learning, which is a user may provide some example within the prompt to help guide the model to give an answer aligned to their expectations.
Building off few-shot, now we have chain-of-thought. Chain-of-thought is like few-shot in the sense that you're providing some sort of help or demonstration to the model, but specifically with chain-of-thought prompting, you help the model reason through a task. So you form intermediate reasoning steps which will then translate to the model adopting those reasoning steps and making their response a little bit more robust to whatever you're asking.
And then we get to self-consistency. So this is probably the most complex one that we've demonstrated here. Self-consistency kind of puts together the foundation of different techniques so far and builds off multiple of those techniques, specifically few-shot and chain-of-thought. You can probably see in this example there's just a little bit more depth that's given, and the sense is you're giving that additional context and those reasoning steps in multiple shots.
Best Practices in Prompt Engineering 🔗
Nicholas: Moving on from the techniques, we've now laid a foundation for some definitions and looked at some techniques. Now, we're gonna look at best practices. Because if you're like me, having this plethora of information is nice, but then you get decision paralysis whenever you have the terminal open and you want to talk to the large language model. You don't necessarily know where to start. Now that I have a bunch of tools to choose from, which one do I choose? It's better to just boil it down to some basics. And so here's some best practices that can help guide your prompting.
Keeping instructions or requests at the beginning of the prompt. This is something that, if you look through all of the examples that I give, it's a template that is quite frequently used. Then separating instructions and context with some sort of delimiter. Here, they've done it very explicitly with quotes.
Being specific and detailed with your expectations for what the output of the model should be. The contrast here is you have a very open-ended prompt, and then you have very descriptive. Then, the contrast to that is to have something that you think is very descriptive but is generally just verbose and not necessarily the most precise.
Also, giving the large language model a framework to abide by when you are using it. For example, reading in a CSV file or parsing descriptors out of a paragraph and asking it to format in a specific way.
The big thing here is starting with the simplest technique and progressing to more involved techniques as needed. Here you have an example of someone starting with zero-shot and then, after that, going to few-shot.
Finally, don't be so negative. Give the model some paths forward instead of just restrictions. Instead of saying, "Perform this task for me and don't do this, this, or this," it's better to say, "Perform this task for me and work within this framework, and try and perform these set of actions while providing an answer for me."
Comparing Prompt Engineering to Software Engineering 🔗
Nicholas: Now, with all that, I'd like to open the discussion up to the floor. I have a list of questions that I posted in the Slack channel that we could discuss, or if anyone has any questions or comments about the presentation, we can also address those.
Reece: This is great. Thanks, Nico. I like seeing the overview and the examples for all these different techniques. I agree, it's hard to sometimes understand the definitions until you start playing around with them a little bit more. And it's funny; it feels like we're coming up with a vocabulary to describe something that we haven't even really had an opportunity to fully define yet. So it's very interesting to see it kind of piece together.
Nicholas: Yeah, yeah, it was definitely something interesting to tackle. Again, going back to seeing just like a blank screen and being like, "Oh, what do I type into here? Is it like I'm talking to a person? Is it like I'm not talking to a person?" Having the framework and some definitions in place kind of helped, like, okay, I can abstract that sort of, in a weird way, social anxiety out of this, so I can kind of give a little bit of a better experience for myself.
Bimba: Yeah, that's a really useful summary of the prompt engineering too. I'm curious if other people were thinking this as Nico was going through the best practices for prompt engineering. I was comparing good prompt engineering to just good software engineering. I feel like there's a lot. Like, I think one of the slides said, like, you know, you want to put your instructions at the top, or I forget what it was, maybe those contexts at the top and then instructions at the bottom. I feel like some programming languages have best practice around where you put, like, you define the variables upfront. I don't know if anyone else was thinking the same thing, and you're like, okay, this looks like lots of similarities between good prompt engineering and just good software engineering.
Reece: Yeah, definitely, I would agree with that. And it's, like, I guess to compare it to software engineering, it's like, I guess the other end of the spectrum is to compare it to talking to people, or like delegating to people or delegating to machines. And it's like somewhere in the middle, like, you can't be as free-form as you can with people, but you are more generic than you would like in a programming language. Yeah, I guess finding the happy medium between those two seems like the inherent challenge too.
David: Yeah, I was gonna say something really similar. Reese is like, to me, it's almost like writing requirements or writing task cards for a project because it is in English. But yeah, you need to be precise. It's like a good software requirement isn't so vague that you don't know if it's verified or not. And, you know, I think—this is a little bit of going on a tangent—but is anyone here using Cursor lately for programming?
The Role of Tools Like Cursor 🔗
David: I was playing with it a little bit over the weekend, and I'd used it a couple of times. But it feels—and I also listened to the podcast that Yujan posted, the Lex Fridman interview with the Cursor team. If anyone watched that, I thought the first third and the last third were good. The middle third, I thought they were saying some silly things that didn't make sense to me.
But some of the stuff they were talking about is how we're moving up a level of abstraction. And the way I think about the design history files that we're creating for our projects is, like, you've got the code at the bottom, and then you can say, "Well, why is this line of code here?" And that kind of, in the ideal case, takes you up to a requirement, maybe a detailed requirement. And then you can ask, "Well, why do we have that requirement?" You can go up to maybe a system-level requirement and then up to a user need. And in a way, each time you're asking why and moving up a level of abstraction.
And it feels to me like interacting with the LLMs is kind of like moving up a level from the code to a higher level, and using good prompt engineering kind of then turns into, like, writing good requirements.
Bimba: Oh, yeah, that's pretty interesting. I haven't seen the Lex interview, but yeah, I know that when I was saying software engineering, I guess I meant more like, I don't know, I feel like this probably applies more if you're trying to create, like, an agent. But like the stuff you were demoing last time, Yujan, I could see someone using, you know, like even a TDD kind of approach where you're like, okay, here's a bunch of expected inputs and output pairs, and you just keep making the prompt better until it passes on all of those things.
Moving Up Levels of Abstraction 🔗
Yujan: Yeah, if I'm hearing correctly, I think what you're saying, Bimba, is like large language models or prompting is similar to software engineering—you've got to break down the complexity. And what Reese is saying is, it's similar to learning how to delegate to another coworker. Like, the more times you do it, the better you get at delegating in general, but also delegating to that coworker. Like, we probably all had times where you have a task in mind, and you know the right person for the job on the team to do it, right? And the only way you picked up that intuition is by interacting with that person many, many times.
It's kind of how I've thought about LLMs. It's kind of silly to think, but I think it's a reasonable parallel. Now you can kind of know which LLM—and there's really only three to choose from, like, it's the big three from OpenAI, Anthropic, Google—you could kind of know which one is the right one for a given task. And the best way to do that is just to get as much mileage as possible, just to try a bunch of stuff. And I think techniques like prompt engineering will maximize the chance of success. It's still about, I think it's just, you've just got to get the miles in.
David: Well, I think that you would default to running it at the highest level that you can, but probably in practice, just the information content isn't there. It's not like you'll be only staying at the top levels. And then this is another thing they talked about in the Lex Fridman interview; you're still gonna have to go up and down, right? It's hard for me to believe that the English language is gonna get to the point where you can specify with the precision you need for making a program what it will do.
But one other thing I'd say is interesting is, you know, we've been helping a lot of companies who have the code create the design history file to go along with the code and verify it, and so on. But it's interesting—you think of going the other way, right? Like, sometimes you're writing the requirements, and then it's generating the code, and then vice versa. And having the LLMs kind of keep it all consistent is just an interesting thing where it's kind of blending the two types of projects together.
Limitations and Challenges of LLMs 🔗
Reece: Well, it's curious, because we're talking about almost two completely different kinds of problems. If you're going up from the code to requirements and user needs, it's like a reduction in information. And then going the other way, it's like you're spreading out into the details. And so there's necessarily information loss that happens there that the LLM is gonna have to fill in the gaps. And its ability to do that effectively currently is one of the major gaps in using the tool, right? Like, you can't just set it loose and expect it to crank something out.
I don't know if anyone else has thought this, but I've certainly found it. One of the challenges with using LLMs is that it feels very mentally taxing to use it for any non-trivially complex tasks, because the amount of verification you have to do after the fact can be exhausting unless you're really smart about using the tool in a way that lets you easily verify it. So, like, making it write its own unit tests, and then, you know, maybe using a language like Rust where the compiler will yell at you until you get certain memory safety guarantees, right? Or using some other type of automated verification. Finding ways to trust the output more feels like a key challenge. Even telling it to write its own tests, I always worry that it's like, well, it's just gonna write a test that's not right or will pass for reasons that don't make sense. So finding ways to trust the output more feels like a central challenge on the code construction side.
JP: Has anyone worked in the middle of the two things that David was talking about, like where you already have the codebase, you have a set of requirements that haven't been made in parallel as the codebase was being generated, and now you wanna see if your codebase has the code to fulfill the requirements and maybe point you towards, you know, "Yeah, all this is covered; these items are not there," but the codebase is pretty extensive.
Yujan: Are you asking, like, given a set of requirements, can you verify that all the requirements were met using LLMs or using something like Cursor?
JP: Yes, essentially, yep.
Yujan: I haven't tried it on a real project, but I have used Cursor to help me understand larger codebases, and it does seem to work pretty well. Like, whenever you ask it, "Point me to the code that does this," it's gotten to work pretty well for me, and I could imagine you could use it to ask, "Is this requirement met, and where is it implemented?" Like, I bet that would work fairly well, although I haven't really tried it.
Using LLMs for Code Verification 🔗
Bimba: This is somewhat similar, but I had to use it recently. One of the requirements that we had for a client was the cloud service providers that we integrate with had to meet a certain bar, like they had to be scale-tested with thousands of users, they had to have a third party do some auditing. So there was a long list of kind of compliance and scalability things that the cloud service provider had to check, and we had a list of cloud service providers, and they provided a bunch of documentation. And I did just ask ChatGPT to see what in the documentation I could use to support if each of these different requirements for the CSP are met. That's not code, but it did a very good job of that.
Ethan: I think that's a really good example of just taking a bunch of information and finding the piece that you need. I think one issue that I run into pretty often—and it might just be that I'm using the wrong model or need to tweak my prompt a little bit better—but it does seem like if you want the LLM to produce kind of like a machine output, like a CSV file, that's where I run into a lot of issues. I've had it, instead of just being done with a CSV file, it just keeps repeating the last two lines over and over, and I can't figure out why.
And I think part of that is just because of the way these things could be trained. They're language models, and no one talks in CSV—at least I don't. So it's just like a little bit different language, and while it can figure it out, it's not like a perfect solution every time. So yeah, I think there is kind of that disconnect sometimes of the natural language to kind of like machine language output. There's definitely been a lot of advances with the outputting code, which is really nice, but sometimes I've run into some issues with outputting in a machine-formatted format.
Another thing I ran into was the context. I was talking about this with Matt. The task was to simplify a radiological report that had information about a patient. And I thought I'd be clever and say, "Oh, you are a data scientist, and you're tasked with simplifying data, collecting data, simplifying it, and producing a CSV." And I, you know, kind of this kind of fancy prompt on what the LLM was supposed to do, and it didn't produce the output I wanted. Matt suggested getting rid of that context, and yeah, it was fine. So there is a point where sometimes context is not your friend. So yeah, it takes a little bit of experimentation, for sure.
Kris: Also be saying something about data scientists. "This isn't what I wanted."
Matt: But it does kind of get at this task of prompt engineering where you're doing like a translation into some language—I mean, by language, I mean, like, you know, the way in which somebody else would speak that would be best suited towards whatever task you want to be done, and you don't really know how to necessarily frame it in those terms always, just guessing.
Understanding LLM Limitations Through Tokenization 🔗
Matt: Just a fun fact while I have the floor. There was a paper that their contribution was adding "Let's reason through the steps." So they just added that to a bunch of prompts and evaluated it on a bunch of benchmarks—3,000 citations. So I wish I would have done grad school a little bit later in life.
Kris: I find that thinking about LLMs programming in natural language—I mean, there's all sorts of levels you could think about it, but has anybody ever watched the sci-fi movie "Arrival"?
Reece: One of my favorites.
Kris: So it's like, language is ultimately about communication of either an idea or—in the programming sense, it's pretty much all mostly imperative. Actually, it's all imperative, right? It's all about executing something, and really there's no other concerns there unless it's comments, and that doesn't even really count. And one thing to remember about LLMs is that they are not execution machines. So we're fundamentally expecting imperative behavior out of a machine that's not an execution machine.
And, you know, a lot of the weird behaviors that you guys are talking about, and the logical weirdness that can happen—like, there have been plenty of times where I ask it to provide me some Python code, and it gives me something that at first glance looks pretty reasonable. But then you point out an error, and then, of course, it's gonna be like, "Oh, yes, of course you're right. Here's the corrected whatever." And then it causes another problem, or that there's something contradictory about it. And you point it out, and then, of course, it's like, "Of course you're right. Here's the—" you know, and then you get into this really weird loop.
You know, ultimately, we've got a machine that isn't an execution—it isn't exactly imperative either. I mean, it's trying to be everything, but ultimately it can't be—not to say that it's not useful. I mean, definitely I've been pretty amazed, like, "Oh, wow, that actually worked great for me," and I'm really happy when that happens. But the fact that we have to provide it intermediate steps to give us the answer to a relatively simple logical problem does kind of make you wonder, should you trust it for much more than that? And therein lies the problem.
Bimba: Yeah, which is why, like, the TDD comparison was coming up in my head so much because it's like—I mean, I think if you think about, like, floating point arithmetic, it's like, yes, the chances of when you're adding two numbers, you're gonna get the mathematically correct result most of the time. But there's still some, you know, floating point numbers or decimals that floating point format can't represent, right? But it basically works all the time. It's obviously a lot less so the case for, you know, asking ChatGPT to do anything, but by using prompt engineering to put all these restrictions on it to make it do exactly what we need it, I think— I don't know exactly how, but I think TDD is gonna be pretty useful for doing a lot of this stuff.
Kris: Oh, certainly. I mean, it's useful, no doubt. It kind of reminds me of the translation engine that Google—I don't know if they still use it, but they were kind of translating everything into an intermediate language that was actually based on English. And so as long as the languages had a high similarity to English, it was more or less fine. But then, once you went to, like, Asian languages where a lot of the concepts are quite different, it had a lot of trouble dealing with that.
In kind of like in "Arrival," the language you use influences how you think and what you're focused on. It's true for us. I think it's probably true for LLMs as well.
Conclusion and Next Steps 🔗
Yujan: Let me share something real quick, just in the two minutes we have. There's this website called OpenRouter, and what's really cool about this is you can try out a bunch of different LLMs at once. And you can ask it, like, the same question. This has been really useful for me just to get a sense of what all these LLMs can do, like the capabilities, but also it's just been interesting to see what answers you get that kind of come out of them.
I'm just gonna do one. So there's a question that everyone asks: "How many 'R's are in the word 'strawberry'?" And so you can ask all these LLMs this one question, and they've historically struggled at this question. And I would say, like, if you're going to dive into how LLMs work, if you want to dive into that to get a better understanding of how to practically use them and learn what their limitations are, I think tokenization is what you should focus on.
A lot of these LLM optical illusions, you know, like these optical illusions help us understand how the human visual cortex works by finding edge cases like this, like, you know, why is it that this looks like it's moving? Whatever that, but that tells us something about how the human visual cortex is processing it. Tokenization is that. I think this gives some intuition about why LLMs are bad at math, why they can't spell "strawberry." You know, this one thinks there's two 'R's; a lot of them think there are two 'R's actually. But if you ask it to double-check, you know, then it'll get it right the second time.
Tokenization, I think, is the fundamental reason why these tasks that we think are really—should be relatively easy cognitively are difficult for it. You know, perhaps it's a conversation for another topic, but this is also a useful tool to—if you want to play around with all the different LLMs, too, you can use this. You can get a sense for what prompt engineering strategies work well with different LLMs. I don't think it's really necessary; I think it's better to probably just pick a good one and just get used to it and just kind of start incorporating it into your workflow. And if you'd like to optimize it a little bit further, then you can use a tool like this.
But I think the important thing is to just start getting more mileage out of it. And if you look at the generative AI policy that I posted, we have team licenses of all these things, and these team licenses also ensure that the data is not being used to train any of their models. So if anyone wants to try it out, wants to incorporate it more, you know, definitely read that and ping the appropriate person to get access to this stuff.
That's all I had. We're at time. Great conversation, Nico. And yeah, see y'all. Have a great rest of your day. And Kris, nice to have you on the team. Welcome.
Kris: Thanks. Glad to be here.
David: Thanks. Thanks, everyone.