
May 28, 2025
Cursor co-founder Arvid coined "Prompt Design" back in 2023. Even though it's been almost 2 years - a lifetime in AI engineering - I still think it's the most fitting term for the process of writing and optimizing prompts.
At Parahelp, we're building AI customer agents for teams like Perplexity, Framer, Replit, and ElevenLabs. The great thing about customer support is that you have a clear success metric for how your agent performs in a real job: the % of tickets resolved end-to-end.
We're constantly trying to increase this number by building more capable customer support agents. Part of this often involves spending 100s of hours optimizing just a few hundred lines of prompting. Most of the time spent optimizing prompts is actually not spent on writing the prompts but on figuring out how to evaluate them, running the evaluations, finding edge cases, testing them in the real world, and iterating on learnings.
To explain some of the key learnings behind a real prompt more clearly, I'll share our "manager" prompt and approximately one-fourth of our planning prompt.
Let's start with the planning prompt:
The first thing to highlight is that o1-med (we're now using o3-med) was the first model to perform well on our evaluations for this prompt. Two things make this planning prompt especially hard:
The full prompt contains ~1.5K tokens of dynamic information about the ticket so far - message history, relevant learnings from our memory system, company policies, etc. The model, therefore, has access to some of the information relevant to reply to the user but rarely all of it. Getting a model to understand that it shouldn't be confident about having complete information (or assume what data tool calls will return) is, therefore, difficult.
The plan must include all potential paths based on what tool calls (like search_helpcenter or search_subscription) return and the rules for different outcomes. For refund requests, the plan must consider all paths based on the purchase date, country, plan type, etc., as the refund rules vary according to these parameters. We call the number of paths a model can reliably handle "model RAM" - it's a key metric that affects a lot of our prompting and even architecture, as we have tricks to make it work when a model doesn't have enough RAM to handle certain complex scenarios.
For the first challenge, we let the model chain multiple tool-call steps using variable names: <> for tool call results, {{}} for specific policies. This way, it can plan across multiple tool calls without needing their outputs.
For the second challenge, o1/o3 was the most significant unlock, followed by using XML if blocks with conditions. This made the model more strict (and let us parse XML for evals), but I think it also performs better because it taps into the model's coding-logic capabilities from pre-training.
It's intentional that we don't allow the model to use an "else" block but only an "if" block. Not allowing the model to use an "else" block requires it to define explicit conditions for every path, which we've found increases performance in evals.
The second prompt is our manager prompt:
This prompt is probably more similar to prompts you've seen before. I still think it was worth sharing as a practical example of general prompt-design rules: specify the model's thinking order, use markdown and XML, assign a role the model should assume (manager), and use words like "Important" and "ALWAYS" to focus on critical instructions.
There are many other topics we would love to cover - our token-first agent architecture (vs. workflow-first), how we run evals, our memory system, why great retrieval is still underrated, and more.
If you want to work on these problems daily, consider joining us at Parahelp. We're hiring exceptional engineers and would love for you to join us. Email me the coolest project you've built: anker@parahelp.com.