How to Effectively Test AI Assistants

22 aug. 2024

Introduction:

Testing AI Assistants before the launch is important, but how we test is perhaps even more important. Having been designing & developing chatbots for the last 6 years, I’ve seen how ineffective testing can lead to wasted time, misaligned feature prioritisation, and delayed product launches.

The challenge? Test users often behave differently from real users, leading to skewed results and misguided improvements.

In this post, I’ll share my approach to testing AI Assistants, drawing from both 6+ of experience working in Conversational AI space and insights from industry experts.

The Problem:

Many developers spend countless hours testing their AI Assistants, only to find that the results don’t translate to real-world usage. Why? Because:

Test users interact differently with AI Assistants compared to actual users.
Testing environments often fail to replicate real-world scenarios.
As a result, we often optimise for the wrong metrics based on artificial test conditions.

To illustrate this, let’s look at an example from my recent attendance at the UNPARSED conference. Nikoletta Ventoura and Spencer Hazel presented their work on a Voice Assistant for post-surgery patient check-ins. They found a stark difference between tester and patient behaviour:

When asked, “Before we finish, do you have any questions?”:

Testers typically gave simple yes/no responses or asked direct questions.
Actual patients used this as an opportunity to share concerns or provide general feedback.

Nikoletta Ventoura and Spencer Hazel also found out that behaviour of testers depend a lot of the instructions they are given. Given an explicit instruction to ask questions, 61% of testers would ask questions. Without such an instruction only 12% of testers would ask a question.

This difference between how testers and actual users behave can lead to misaligned development priorities and suboptimal user experiences.

So, how can we test more effectively?

How to Test AI Assistants Effectively:

1. Analyse Existing Conversations

Before diving into development, I always start by analysing past conversations with real users, if available. This provides invaluable insights into actual user behaviour and needs.

Example: In a recent project for a customer service chatbot for a property management company, we analysed hundreds of conversation transcripts between CS agents and real users. This showed common pain points and language patterns that we incorporated into our AI Assistant’s training data.

Link to the project case-study: https://parslabs.org/portfolio/property-management-chatbot

2. Be Selective With Your Test Participants

While internal testing is valuable, recognise its limitations. Your team’s familiarity with the product can blind you to issues new users might face.

Whenever possible, involve external testers who have lived experience related to your AI Assistant’s goal.

Example: For a mental health support chatbot aimed to help people discriminated against, we asked people who had such experiences to anonymously test our AI Assistant. Their interactions were notably more authentic than the conversations that our internal team had with the chatbot.

Link to the project case-study: https://parslabs.org/portfolio/anti-discrimination-chatbot

Also avoid priming testers with overly specific instructions. Avoid saying things like “Ask questions about X”. Instead, provide general guidelines that allow for more natural interactions.

Example: Instead of saying “Ask the chatbot about product features,” we might say, “Imagine you’re considering purchasing this product. Interact with the chatbot as you normally would when shopping online.”

3. Don’t Aim For Perfection Before Launch

Release your Minimum Viable Product (MVP) as soon as it’s reasonably functional, then iterate based on real user data.

If you aren’t comfortable doing full release, consider releasing your AI Assistant to a small group of real users before a full public launch. This allows you to gather authentic interaction data and make necessary adjustments.

Example: What happened with some of my projects in the past is that our team rigorously tested our AI Assistant with all the different types of questions, just to find out after the launch that:
1) real users either are most interested only in a small percentage of questions that we covered and already wrote content for or that
2) real users ask completely different questions in different kinds of words.
This often leads to a redesign of your conversation flow. Lessons were learnt.

Conclusion:

Effective testing of AI Assistants requires a delicate balance between controlled testing and real-world usage. By focusing on authentic user experiences, recruiting relevant testers, and iterating based on actual usage data, we can create AI Assistants that truly meet user needs.

Remember, the goal isn’t to create a perfect Assistant before launch, but to create a solid foundation that can be rapidly improved based on real user interactions. By following these steps, you’ll be well on your way to developing an AI Assistant that resonates with your target audience and provides genuine value.

What strategies have you found effective in testing AI Assistants? I’d love to hear about your experiences in the comments!

— —

Hi, my name is Lena Shakurova, I’m a Chatbot Developer & Conversation Designer and a co-founder of ParsLabs — multidisciplinary chatbot development agency. I share my experience working in Conversational AI space on LinkedIn: https://www.linkedin.com/in/lena-shakurova/