Prompt Engineering Is Like Playing Whac-a-Mole—And the Stakes Are Higher Than Ever

Iz Beltagy

Jul 3, 2024

I’m Iz Beltagy—a co-founder and the Chief Scientist at Spiffy.ai, where we’re helping brands clear the pathway to purchase through turkey and ultra-safe AI solutions. Over the coming weeks, my team and I will share a series of articles examining the impact of conversational AI on the rapidly evolving world of retail commerce. First up: the potential (and limitations) of prompt engineering.

As you’re reading this, thousands of companies are scrambling to build AI products that understand their customers' contextual needs as well as they understand how to support the path to purchase. The default foundation for these products is, unsurprisingly, GPT-4. That’s not necessarily a bad thing. But relying on GPT-4 alone to build a consumer-facing AI product is like relying on a Swiss Army knife to remodel your home: it’s handy but hardly sufficient for what you actually need.

AI agents that run on Large Language Model (LLM) prompts are unpredictable, producing wildly different outputs that have to be monitored by developers. The result? Inaccurate or downright unacceptable outcomes.

Take Air Canada, for example, which faced a PR nightmare after a chatbot botched an explanation about the airline’s refund policy. Despite arguing that the chatbot had a mind of its own, Air Canada was ordered to refund the passenger as well as damages to cover interest. This is just one of many stark reminders that when AI fails, the company pays the price both literally and with their reputation.

This leads to the billion-dollar question: How can companies deploy reliable, future-proof AI agents? A common tactic is prompt engineering. But as I’ll explain, this approach can create more problems than it solves and leaves teams confused and frustrated.

In this article, I’ll explain:

Why prompt engineering is the wrong approach to build customer trust
Introduce the concept of finetuning and help you understand how and when to use it
Why owning your model is critical to successfully implementing AI

When Prompt Engineering Turns into Whac-a-Mole

The internet loves buzzing about prompt engineering. Some tout it as the key to generating magical outcomes with GPT-4, while the World Economic Forum hailed it as the number one “job of the future.”

At first glance, prompt engineering seems like the fastest path to deploy AI agents that mirror human skills and brand values: You craft a set of meticulous instructions (prompts) and your model generates an output that’s virtually indistinguishable from a human.

However, every change to the prompt—no matter how modest—can yield dramatically different outputs. You create a new prompt to fix bugs, but as soon as you fix them, more appear. Before you know it you’re playing a high-stakes game of Whac-a-Mole that leaves you with a brittle product that’s unsuitable for consumers.

This is the fundamental limitation countless companies face: You can assemble a team of the world’s best prompt engineers and still end up with a broken product. If you want to move away from shiny objects toward robust tools, the answer is data—lots of it.

Train Smarter, Not Harder

Retail companies are sitting on goldmines of data that can turn an unreliable chatbot into an industrial-grade copilot: thoughtful product recommendations, hyper-personalized explanations, product comparisons, and more. The key, then, is training your model on your own data so it becomes a natural extension of the company that functions like your best sales team member instead of a clunky add-on.

That’s where finetuning comes in.

Spiffy solves the generic GPT-4 problem by feeding a company’s key data directly into our model parameters (I will detail this further in future posts). That pre-set and 100% accurate knowledge is what sets finetuning apart from prompt engineering. Instead of feeding the model a few sentences of instructions at a time and hoping for the best, we inject it with thousands or even millions of training data points. This finetuned model will recreate the brand voice and produce more reliable outputs. Better yet, it continuously improves as more training data is accumulated. Over time, this positive feedback loop has a compounding effect on the model’s efficacy.

A prompt-engineered GPT-4 model might sound like a sales agent—but it won’t sound like your sales agent. It will be generic, unpredictable, and incapable of improving over time because we can't train it on more data.

How Efficiency Fits Into the Equation

Prompt engineering isn’t just inferior when it comes to building consumer-facing AI—it’s inefficient too.

Effective prompts for GPT-4 require intentional word choice to avoid misalignment between what you mean and what the model thinks you mean. These huge sequences quickly eat up your tokens—especially when you have to feed instructions to the model over and over again. And wasting tokens means wasting money.

As AI projects become more complex, reducing your sequence length is essential to preserve resources. Our fine-tuning approach accomplishes that because the model inherently understands context and tasks, eliminating the need for lengthy, detailed prompts. This approach conserves tokens as the model can generate ideal responses with minimal prompting, making interactions more efficient.

It’s Time to Own Your Own Model

In the near future, every company will be an AI company. But those who rely on prompt-engineered GPT models will get left in the dust. If you use the same processes everyone else uses, you can only get the same results everyone else gets.

GPT-4 is fine for prototyping. But if you want a sustainable and scalable AI strategy, you need to own your own model.

Take Curated, for example. The brand connects passionate Experts and consumers so that they find the perfect product together for high-consideration purchases like skiing and snowboard equipment, golf, and camping and hiking gear, etc. As they began to experiment with AI, they found tTheir GPT4-based co-pilot showed some potential , but lacked the depth and nuance that their team needed to feel comfortable using it to augment their workflows. After meeting with Curated, we realized they needed a model that was personalized to them and could be continuously improved upon to adapt to new information.

We worked with Curated to build a finetuned AI model that mirrored their 8,000 agents’ expertise and helped more consumers make more informed purchase decisions based on tailored recommendations and guidance. In just six months, Curated saw $1M annualized incremental sales via co-pilot and a 52% increase in conversions compared to GPT-4.

Today, Spiffy serves as the co-pilot for all 8,000+ experts on Curated, enabling them to handle more customers, make better recommendations, and close more sales.

If you’re interested in building an agent for your company with Spiffy, we’d love to talk.

Prompt Engineering Is Like Playing Whac-a-Mole—And the Stakes Are Higher Than Ever

When Prompt Engineering Turns into Whac-a-Mole

Train Smarter, Not Harder

How Efficiency Fits Into the Equation

It’s Time to Own Your Own Model

Turn every visitor into a customer

Walmart tech vet, AI2 research scientists are behind new Seattle startup Spiffy

Case Study: How Curated Unlocked $1M Annualized Incremental Sales with Spiffy

Prompt Engineering Is Like Playing Whac-a-Mole—And the Stakes Are Higher Than Ever