The "Looks Good" Problem in Gen AI

Feb 10

If it’s easy to get a good result, it’s equally likely that the result isn’t a solution. To check if it is in fact a good result, we need experimentation. What’s an experiment? You’re often on your own to figure out.

The experiment cycle is not intuitive. We learned in school that scientific experiments test hypotheses through falsifiability and observation, and it’s a correct definition of it. Like most broad definitions, it’s also unhelpful for designing experiments and conducting them in your specific situation.

A pragmatic definition is more useful: experiments should be designed so you don’t get duped by a good looking answer. In my experience, this is a better guiding principle for assembling an experiment. It’s easy to get good looking results using the tools of data science. AutoML has removed a lot of the labor in algorithm selection and parameter tuning. Large language models have made 80% solutions significantly easier to attain through prompt engineering and emerging no-code/low-code tool suites (Copilot and Microsoft/Google have many tools available as a service).

The primary reasons for good results, but not solutions:

Not defining what failure looks like
And hidden assumptions

Key to this cycle will be defining scenarios. Scenarios define the context, objectives, and expectations. It’s critical to identifying the assumptions. Assumptions don’t exist in the abstract ether. They’re context based. Scenarios give you a context.

Finally, failure measurements are important to define. These are measurements that stand on their own instead of being defined as “not-good”. Presence of hallucinations are a good example.

A summary can be quite bad, even if it has zero hallucinations. A “good” summary that’s concise, coherent, and informative can fail if there are hallucinations (maybe just 1 is enough).

This is a “failure” metric as opposed to a success metric of usefulness.

Success and failure is contextual. A good metric for your solution could be a terrible metric for another. 99% accuracy is great for a lot of things. The 1% failure rate is a problem if it’s people clicking phishing emails.

Look for metrics that identify “deal breakers” when looking for failure metrics.

Gen AI and the “looks good” problem

I use Gen AI at work. I develop systems with it at work. I play with developing systems off-work. Yet, I still get caught by the “looks good” problem. This problem shows up when I have a reasonable system that retrieves information which supports an answer or draft that I want, and a prompt for a language model to write the document or provide the answer. In casual reading of the outputs, it’s good 8 out of 10 times.

That’s good enough to move to the next stage and then come back later based on the assumption that if we start at 8 out of 10, we can improve it up to 9 out of 10, or 95 out of 100. Most things in the world work this way. A little effort followed up by more effort pays off.

With Gen AI systems, this usually results in “prompt begging”. A desperate plea to the Gen AI model to please just do the thing that you are asking it to do and follow the rules as they are written, plus don’t try to add anything new and especially stop hallucinating!

Let’s take a Q&A chatbot and break down this cycle of frustration.

“We want to ask questions about HR material” chatbot

This is a common use case that is provided in tutorials, advertising, and demos. The solution is a valuable one, make it easier for employees to navigate the tranche of documents that corporations accumulate as they grow in size. If an employee can just directly interact with all the documents, a lot of time is saved in answering requests for information.

It also increases the likelihood of an employee looking for an answer. That is because the effort needed to determine if a procedure is properly followed is much lower than navigating dozens of Sharepoint sites.

The ability to do this is nearly drag-and-drop these days. Take your documents, provide it to a system that makes them searchable. The system can automatically search for you, and an llm can explain it. This is often referred to as RAG: Retrieval Augmented Generation. Basically, it chops up your documents into smaller parts. Your question converted in a way that it can find the sections (chunks) that are similar (relevant). Those sections go to the language model, which then creates a tailored answer rather than a general answer based on its training data.

And I guarantee that if you do this, the results will not only look good, they’ll be impressively good. As a result, these are often in demos because, well, it’s a solution to a problem and it works well.

What’s trickier is whether it works well for you and the scenarios you will use it for.

You have to figure out, for yourself, whether it’s working for you.

Identify what failure looks like

Failures that guard against Looks Good solutions

The details are often in the failures.

Let’s look at possible failures:

The chatbot will sometimes miss some of the relevant sources for answers (which means it depends on some sort of information retrieval system).
1. Think of this like it misses one or two steps of a required procedure to submit an expense report.
The chatbot produces reasonably accurate answers to standard queries. But technical questions about in-house acronyms, procedure names, industry specific language, and other tribal knowledge goes wrong.
1. Data handling is common across all industries. But, specific requirements for handling sensitive customer data, health care data, and student record data, can be specific to the niche the company works in. It will say that you don’t need to follow a standard, when you really need to.
The chatbot only omits more esoteric requirements from required procedures.
1. The chatbot just never gets the specifics right. Those few essential requirements for your industry or company never show up.

In short, it works well for 80% of the scenarios. That leaves 20% for failure, which can be a completely acceptable level. How much failure you can tolerate, and what types of failures you can tolerate, is specific to your needs and wants.

General Looks Good failures:

Does not provide complete information (often called recall)
Provides irrelevant information (often called precision)
Does not handle idiosyncratic knowledge
Answers a different question all together
Sometimes gives the right answer, other times the wrong answer, based on small changes in the wording of your question.
1. This is a very common problem. Ask the same question different ways for all of your tests.

Where failure occurs in the processes that generate an answer

Let’s look at some of the areas for failure:

The information is in the HR manuals, but the information retrieval fails to provide them to the model (a failure in the RAG system).
1. The chatbot does not provide an answer when there should be one
2. The chatbot tries to assemble an answer from partial information
3. The chatbot fabricates an answer to satisfy the user’s question.
The RAG process is working, but the Gen AI model is not using it in the way you need it to.
1. The chatbot has the information, but does not provide it in full.
2. The chatbot has the information, but changes important wording.
3. The chatbot has the information, but omits some details if the questions are general.
4. The chatbot has the information, but it merges details that are separate (like a checklist).

General process failures:

These scenarios are common issues with Gen AI and can be broken down into these general failures. To identify if they are happening to you, ask these three questions.

How do we know the chatbot is getting all the information that is needed?
What does Gen AI do when it lacks information?
When does Gen AI change the information?

All 3 areas will happen in any Gen AI product you build

There’s no escaping these scenarios, only mitigating them.

General Assumptions lurking in the background

The 1st assumption: We have all the information for a user’s question in the database/RAG/index.

This is the primary assumption which impacts many of the projects I’ve seen. In reality, when an answer is bad, its because that information doesn’t exist or isn’t clearly explained in the documents.

If your documents don’t explain the meanings of critical words, the chatbot will default to a global understanding of it. Definitions are as important for Gen AI models as they are for humans.

Instructional documents, like HR documents, are often between 2 worlds: the general case and the company/industry specific technical scenarios. There’s a gap between those two and that’s often where a user wants the most amount of help.

The 2nd assumption: The information defines and describes

People love acronyms, and rightly so. Providing specific names to processes or parts of an organization gives meaningful and specific definitions that improve how we work and how we collaborate. They are also inscrutable to new hires and it takes months to figure out what common terms mean.

People also love adapting common phrases and concepts to specific implementations that don’t resemble the original concept. Case in point: OKRs versus KPIs. I can go to John Doerr’s book, Measure What Matters, and provide you definitions of both and the contrast between them. I can’t tell you the difference between the two at the companies I’ve worked for.

And people love jargon-y phrases like “shift-left”. I’ll be honest, I’ve heard this term for over half a decade with respect to programmers and cybersecurity. I still have no clue what it actually means, much less how this makes sense to people in the meetings. From questions I’ve asked, it’s from a convention that shows software development cycles with developers being on the left side and the final product release on the right. Shift-left means developers work on security issues early in the process. I’m not a developer, and this phrase made no sense. It came from a convention that is common in developer organizations, not data science (which uses circles).

We assume we have complete information when we read these documents because we have a whole world of context within the company of what these words mean to us and the people reading it. OKRs (which is also an abbreviation), shift-left, ABZ… Not sure what that last one is? Could be “A, then B, all the way to Z. Every journey begins with a single step” as an inspirational quote poster. Or something I just made up.

Failures often occur because we intuitively read into a document the context we have. Then we fail to supply that context to a Gen AI solution.

The 3rd Assumption: We assume success is because of something we did. The process that obtains the answer depends on what happened before it.

That 80% success may have nothing to do with the documents you’ve provided. The model is just supplying a generic answer that’s true across 80% of organizations and HR manuals.

The documents you gave it might introduce some wording or additional details, but the default response would have worked as well.

You only know this if you try both the generic and the grounded responses.

Just straight up ask the model to do something with absolutely nothing supporting its answer. If that gets to 60% or 70%, or even 80% passing, then your solution may very likely not a be a solution at all. If the pass rate is the same as your grounded chat bot using the company’s manuals, then what have you gained by storing, retrieving, and implementing a chatbot that costs money? Probably little to none.

In experimentation, this is commonly referred to as a null hypothesis. In practice, it starts with trying the laziest solutions and seeing how well it does.

Ignore all the hard work and just ask a general chatbot like Gemini for its answer. That’s your starting point. Get those answers, and grade them. Find what worked, and what failed. Classify those failures. Then, focus on correcting those failures with your project.

Break the cycle of hidden assumptions

Identify and test your assumptions with scenarios

Scenarios are very helpful with defining success and failure. In short, you create a hypothetical situation where someone needs information, wants creative help with a project, or note taking.

Define a scenario by filling out these categories:

Who is using the tool?
What problem do they have?
What is their expectation for a good answer?
What action do they take after an answer?

Action Oriented scenarios are incredibly useful for identifying what you need, and what success looks like. With more than one scenario, you are able to identify assumptions and requirements change between each scenario.

Part of the process of testing assumptions and creating scenarios is to find the boundaries of success. You can see the scope of failures, and define the limits of your solution. If the solution is limited, and doesn’t capture common or necessary scenarios, then it’s not a good solution! There are other solutions that are possible, and you can skip this one as not achievable with what you have.

When asked why you’re moving on to a different project, you’re armed with the best response: why it won’t work.

In a scenario, we assume that it knows our specific situation because we intuitively know the surrounding context within the company and within the user’s expectations.

Write things down and collaborate

For all the things I’ve listed and written above, to use the guidelines and create scenarios you must write it down. If working with a team, list off the assumptions you make, look at the data you have and ask “does this cover everything?” from the perspective of a first day hire.

Do not try to assign small parts to different people, and then try to merge the insights into a bigger picture. Get everyone on the same page by collaborating on scenarios that each person can come up with. The scenarios are the common ground that we all start from.

After assembling the scenarios, collaborate on what assumptions are common to them all, what assumptions are unique. What does failure look like across all scenarios and what’s specific to a few (or just one).

Now, it’s easier to construct your set of measurements and experiments. This is because you know what you’re testing, and agreed on why these are important.

Measuring things

The metric you use doesn’t have to be sophisticated to start with. Just one thing to keep in mind is whether it’s symmetrical. A symmetric scale is one where there is a middle option and there are the same number of options for negative grades as there are positive grades. Also, it’s anchored around a neutral middle.

Stick with an odd number of 3, 5, or 7 options. Give each option a name, not just a raw number, so that it’s descriptive and intuitive. Instead of 1-5, have “Strongly Approve”, “Approve”, “Neither Approve nor Disapprove”, “Disapprove”, and “Strongly Disapprove”. You’ll recognize it from many of the polls and surveys you’ve been bombarded with.

With an odd number, you can group the grades into positive, negative, and neutral (think of Gallup Poll approval ratings). But, the shift of the number that “Approve” into “Strongly Approve” is a valuable insight. An odd number of options gives you flexibility.

You’ve got results, now what?

Excel is your friend, and if you have Microsoft Office’s suite, Copilot as well. You will be tempted to have this all in a single Excel workbook, have things linked, and have results displayed side by side in a dashboard or chart. Resist this impulse.

Instead, put the results in a folder for each set of results that you (and your team) grade. Have the Excel sheet in that folder along with all of the outputs. Plus, have a word document explaining the circumstances, assumptions, and what you’re looking for in the experiment. That document will evolve over time. Track what you used in the past so you don’t make assumptions about what you did 6 months ago.

Finally, have a comments column. Write in this, a lot. Provide explanations of what was and what wasn’t good in each response. Tease apart both of these for each output you grade. Nothing is perfect, so make it clear what worked and what didn’t. You will want this for identifying conceptual trends.

Prior to Gen AI, you would have to manually read through these, take notes, and tease out what’s common. It’s a lot of work and it is a great instance where Gen AI can get trends, themes, common problems and common successes summarized. It’s a great use of Gen AI because you’re looking for direction, not decisions.

Iterate or abandon

If you have 10 or 15 scenarios, and you measure improvement against those scenarios, you can track improvements of different responses in the context of their assumptions. Sure you’re getting better results in the aggregate, but 5 out of 10 scenarios don’t pass. You’re getting “improvement”. You’re getting feedback that looks good, but it isn’t actually a good solution.

When you iterate, you should see improvement in both the number of good responses from the solution, and the scenario coverage. If the number of scenarios are not passing or getting good enough grades, then stop. Either you need to limit the scope of the solution, or move to another solution.

When you inevitably need to justify why, you’re not just falling back on a single aggregated grade. How do you justify saying “a 5% improvement is not feasible”? It’s much easier, and much more rigorous, to say “We needed a solution to cover these 10 scenarios that a user would use this solution. We only have 6 out of 10 succeeding, and the remaining 4 have not shown reasonable improvement”.

James Littiebrant