How to evaluate LLMs: Full LLM evaluation guide
May 20, 2025
How to evaluate LLMs: Full LLM Evaluation Guide
All you need to know if you are just getting started and are confused.
Building with LLMs feels like trying to ship a product on quicksand. Same input, different output. Hallucinations. Inconsistency. And don’t even get me started on jailbreaks.
This post is a ractical guide to making LLMs more reliable, based on a talk I recently gave at PyData London. I’ll walk you through a framework we use at ParsLabs to go from experiments to production. It’s been shaped by 100+ real-world AI automation projects, and yes, it’s still evolving, just like everything in this space.
Let’s dive in.
Why this matters
A few weeks ago, I shared a list of 61 LLM eval tools on LinkedIn. That post blew up — 40,000+ views, 200+ submissions.
Here’s what I learned:
Most teams are still not doing any LLM evaluation
Many rely on “vibe checks” (just reading the output and going by gut)
A surprising number are writing custom Python eval code instead of using libraries
Very few use structured, repeatable frameworks
This tells me one thing: we’re still winging it. We need a system.
The risks of skipping evaluation
If you’re not evaluating LLM output, you’re launching blind. Here’s what can go wrong:
Hallucinations → confident lies that waste user time and destroy trust
Inconsistent output → users get confused, support tickets pile up
Harmful responses → bias, stereotypes, misinformation
Jailbreaks → attackers bypass safety filters
Data/PII leaks → legal and reputational nightmares
LLMs are not deterministic. That’s the core problem. You can’t “hope” your app behaves. You need systems that actively test, monitor, and catch issues — before users do.
The 3-Stage Framework: Experiment → Monitor → Improve
Let’s break it down.
1. Experiment
1.1. Define what “good” looks like
Before writing prompts, define your test set. It should:
Cover happy paths, edge cases, sensitive topics, jailbreaking attempts
Include reference answers or custom criteria
Be rooted in real examples — not made-up ones
To make sure your test set represents real world data:
Invite subject matter experts to give you input on what good means in your domain
Consult existing documentation to find hints, e.g. customer support scripts, HR docs, onboarding documents
Review past conversations or production logs
Tip: Build tests before prompts. Yes, even for LLMs.
1.2. Choose your metrics
There’s no universal “best” metric — choose what fits your use case. A few categories:
Simple checks: regex, syntax match
Statistical metrics: precision, recall, BLEU, ROUGE
ML-based: BERTScore, sentiment analysis
LLM-as-a-judge: use one LLM to evaluate another’s output against custom criteria
Example: “Was this response polite?” → Yes/No + explanation.
Start simple. Select just a few metrics that make sense to you and go test it. You can evolve over time.
1.3. Run experiments
When testing:
Start with vibe check
Then run automated evals
Finish with manual review (to verify your metrics make sense)
Use tools like:
Prompt playgrounds with version control (e.g., Agenta, LangWatch, Langfuse, e.t.c.)
Unit, integration, and end-to-end tests
LLM test libraries (e.g., DeepEval, RAGAs)
2. Monitor (in production)
2.1. Add real-time guardrails
Two types:
Input guardrails: catch bad user prompts early (e.g., profanity, jailbreak attempts)
Output guardrails: catch bad generations (e.g., hallucinations, unsafe content)
Don’t overdo real-time filters — they add latency. Only use for high-risk apps. Instead, bake most of your checks into the test set.
Tools:
OpenAI and AWS moderation APIs
Guardrails libraries (e.g., Guardrails AI)
Custom LLM-as-a-judge filters
Always check for hallucinations and syntax issues before showing output.
2.2. Log everything
Track:
Input, output, and context
Prompt version and model
Any fallback or retry logic
User feedback (explicit or implicit)
Without logs, you can’t fix what breaks.
2.3. Set up alerts
Don’t wait for users to tell you it’s broken. Use alerts for:
No-response loops
Dangerous output
Repeated fallback events
3. Improve (post-production)
Logs are gold. Review them to:
Spot failures and blind spots
Update your test set
Track custom KPIs
Debug regressions after model or prompt changes
Reminder: your test set is never done. It evolves with your product.
Final takeaways
As developers, it's our responsibility to make sure LLM outputs are reliable.
Here’s what to remember:
Start with tests, not prompts
Measure what matters for your use case
Use automation to scale judgment, not replace it
Log and analyze everything in production
Improve constantly — your test set is a living thing
And most importantly: reliability is your job, not OpenAI’s. If your product misbehaves, users will blame you, not the model vendor.
Own it.