Learn how to create and run acceptance tests to prevent regressions in your AI agent's behavior.

What is Acceptance Testing for AI Agents?

AI agents produce non-deterministic outputs, making traditional testing difficult. Acceptance testing helps you:

Prevent regressions: Ensure new versions don't break existing behavior
Validate changes: Test improvements before deploying
Document expected behavior: Tests serve as living documentation
Build confidence: Deploy knowing critical paths work

Accessing Acceptance Testing

Log in to console.flutch.ai
Navigate to Agents → Select your agent
Go to Acceptance Testing tab

URL: https://console.flutch.ai/agents/{agentId}/acceptance-testing

Creating Test Buckets

Test Buckets are collections of related test cases (like test suites).

Create a Bucket

Click Create Bucket
Enter details:
- Name: "Customer Support Flows"
- Description: "Tests for common customer questions"
- Tags: support, customer, FAQ
Click Create

Bucket Organization

Organize buckets by:

Feature area: Onboarding, Billing, Support
User journey: New user, Returning user, Power user
Complexity: Simple, Medium, Complex
Version: v1.0.0, v2.0.0

Example buckets:

bash
📦 Onboarding Questions
📦 Billing Inquiries
📦 Technical Support
📦 Product Information
📦 Edge Cases

Creating Test Cases

Add Test Case to Bucket

Open bucket
Click Add Test Case
Fill in test details

Test Case Fields

Input Message:

bash
What are your pricing plans?

Expected Output Type:

Contains: Output must contain specific text
Exact Match: Output must match exactly (rare for LLMs)
Regex: Output must match regex pattern
Not Contains: Output must NOT contain text

Expected Output (example for "Contains"):

bash
Basic plan
Pro plan
Enterprise plan

Validation Rules:

✅ Logical AND: All items must be present
⚠️ Partial match: Case-insensitive by default
🔍 Multiple values: One per line

Advanced Test Cases

Regex validation:

bash
Input: "What's the price of the Pro plan?"

Expected (Regex):
\$\d+/month

Matches: "$49/month", "$99/month", etc.

Metadata validation:

json
{
  "confidence": "> 0.7",
  "sources_cited": true,
  "response_length": "< 500"
}

Multi-turn conversations:

bash
Turn 1:
  Input: "I need help with billing"
  Expected: "billing" (contains)

Turn 2:
  Input: "How do I cancel?"
  Expected: "cancel", "subscription" (contains)

Running Tests

Run Single Test

Open test case
Click Run Test
View results

Result details:

✅ Passed: Output met expectations
❌ Failed: Output didn't match
⚠️ Warning: Partial match or low confidence

Run Entire Bucket

Open bucket
Click Run All Tests (top right)
Watch progress (tests run in parallel)
View summary

Summary shows:

Total tests: 25
Passed: 23 (92%)
Failed: 2 (8%)
Duration: 45 seconds

Scheduled Test Runs

Automate testing:

Bucket settings → Schedule
Choose frequency:
- Daily (9 AM)
- Before each deployment
- On demand only
Set notification email
Save

Tests run automatically and email results.

Analyzing Test Results

Test Run Details

Click on test run to see:

Overview:

Run ID: run_abc123
Started: 2025-01-20 14:30 UTC
Duration: 45s
Status: 2 failures

Per-Test Results:

bash
✅ Pricing question - Passed
   Input: "What are your pricing plans?"
   Expected: Basic, Pro, Enterprise (contains)
   Actual: "We offer three plans: Basic ($9/mo), Pro ($49/mo),
           and Enterprise (custom pricing)."
   Match: 100%

❌ Cancellation flow - Failed
   Input: "How do I cancel my subscription?"
   Expected: "billing", "cancel" (contains)
   Actual: "You can manage your account settings here."
   Match: 50% (missing "billing")

   Diff:
   - Expected: "billing"
   + Actual: "account settings"

Diff Viewer

For failures, see side-by-side comparison:

diff
Expected Output:
+ To cancel, go to billing settings

Actual Output:
- You can manage your account settings

Helps understand what changed.

Trend Analysis

View test pass rate over time:

bash
Jan 15: 95% pass (20/21)
Jan 16: 92% pass (23/25)
Jan 17: 88% pass (22/25) ⚠️ Declining
Jan 18: 84% pass (21/25) 🚨 Alert

Spot regressions early.

Test Strategies for LLMs

Flexible Matching

LLMs word responses differently. Use flexible assertions:

❌ Bad (too strict):

bash
Expected: "The price is $49 per month for the Pro plan."

✅ Good (flexible):

bash
Expected (Contains):
- "49"
- "month"
- "Pro"

Test Intent, Not Wording

❌ Bad:

bash
Expected: "I apologize for the inconvenience."

✅ Good:

bash
Expected (Contains):
- "apolog" OR "sorry"
- "inconvenience" OR "trouble"

Focus on Critical Information

Test that important facts are present:

bash
Input: "What's included in the Pro plan?"

Expected (Contains):
- "unlimited projects"
- "5 team members"
- "priority support"

NOT Expected (Don't test):
- Exact phrasing
- Greeting format
- Transition words

Use Confidence Thresholds

json
{
  "expected_confidence": "> 0.7",
  "max_response_time": "< 5000ms"
}

Fail if agent is uncertain.

Best Practices

1. Start Small

Begin with 5-10 critical tests:

Most common questions
Edge cases that broke before
Brand-critical responses

Expand over time.

2. Test User Journeys

Group related questions:

bash
Bucket: New User Onboarding
  1. "What is your product?"
  2. "How do I sign up?"
  3. "Do you offer a free trial?"
  4. "How do I get started?"

3. Update Tests with Agent

When improving agent:

Update test expectations first
Deploy new version
Run tests
Verify improvements worked

4. Use Tags

Tag tests for filtering:

bash
Tags: critical, billing, regression, v2.0.0

Run only critical tests before deployment:

bash
flutch test --tags critical

5. Document Why

Add notes to tests:

bash
Test: Pricing question
Why: Users complained about missing Enterprise pricing
Added: 2025-01-15

6. Review Failures

Not all failures are bugs:

Agent improved response (update test)
Test was too strict (relax expectations)
Actual regression (fix agent)

CI/CD Integration

GitHub Actions

yaml
- name: Run Acceptance Tests
  run: |
    flutch test run <agent-id> --bucket critical
    flutch test run <agent-id> --bucket regression

Fail deployment if tests fail.

Pre-Deployment Gates

Configure in Flutch:

Settings → Deployment → Require Tests:

✅ Run critical tests before deployment
✅ Block deployment if tests fail
⚠️ Allow deployment with warnings

Exporting Test Results

Export for reporting:

bash
# Export as JSON
flutch test export <run-id> --format json

# Export as CSV
flutch test export <run-id> --format csv

Useful for:

Stakeholder reports
Historical analysis
Integration with BI tools

Common Testing Patterns

Regression Tests

After fixing a bug:

bash
Input: [message that caused bug]
Expected: [correct behavior]
Tags: regression, bug-123

Golden Examples

Perfect responses to keep:

bash
Input: "What makes your product different?"
Expected (Contains):
  - "unique feature X"
  - "competitive advantage Y"
Tags: golden, marketing

Boundary Tests

Edge cases:

bash
Input: "" (empty message)
Expected: "Please ask a question"

Input: "a" * 10000 (very long)
Expected: "Your message is too long"

Multi-Language Tests

If supporting multiple languages:

bash
Bucket: Spanish Support
  Input: "¿Cuáles son sus planes de precios?"
  Expected: "Basic", "Pro", "Enterprise"

Troubleshooting

"All tests failing"

Possible causes:

Agent is down (check status)
API key invalid (check settings)
Model changed (responses different)

"Tests flaky"

LLM outputs vary slightly. Make tests more flexible:

Use "Contains" instead of "Exact"
Allow alternative phrasings
Check metadata instead of exact text

"Tests take too long"

Optimization:

Run tests in parallel (automatic)
Reduce number of tests
Use faster model for testing (gpt-3.5 vs gpt-4)

Next Steps

Debug Failed Tests: Debugging Guide
Monitor in Production: Measures & Analytics
Deploy with Confidence: Deploy Guide

Tip: Run tests after every significant prompt change!

Screenshots Needed

TODO: Add screenshots for:

Test buckets list view
Create test case form
Test run results page
Diff viewer for failed test
Trend analysis chart