Acceptance Testing

Learn how to create and run acceptance tests to prevent regressions in your AI agent's behavior.

What is Acceptance Testing for AI Agents?

AI agents produce non-deterministic outputs, making traditional testing difficult. Acceptance testing helps you:

  • Prevent regressions: Ensure new versions don't break existing behavior
  • Validate changes: Test improvements before deploying
  • Document expected behavior: Tests serve as living documentation
  • Build confidence: Deploy knowing critical paths work

Accessing Acceptance Testing

  1. Log in to console.flutch.ai
  2. Navigate to Agents → Select your agent
  3. Go to Acceptance Testing tab

URL: https://console.flutch.ai/agents/{agentId}/acceptance-testing

Creating Test Buckets

Test Buckets are collections of related test cases (like test suites).

Create a Bucket

  1. Click Create Bucket
  2. Enter details:
    • Name: "Customer Support Flows"
    • Description: "Tests for common customer questions"
    • Tags: support, customer, FAQ
  3. Click Create

Bucket Organization

Organize buckets by:

  • Feature area: Onboarding, Billing, Support
  • User journey: New user, Returning user, Power user
  • Complexity: Simple, Medium, Complex
  • Version: v1.0.0, v2.0.0

Example buckets:

bash
📦 Onboarding Questions
📦 Billing Inquiries
📦 Technical Support
📦 Product Information
📦 Edge Cases

Creating Test Cases

Add Test Case to Bucket

  1. Open bucket
  2. Click Add Test Case
  3. Fill in test details

Test Case Fields

Input Message:

bash
What are your pricing plans?

Expected Output Type:

  • Contains: Output must contain specific text
  • Exact Match: Output must match exactly (rare for LLMs)
  • Regex: Output must match regex pattern
  • Not Contains: Output must NOT contain text

Expected Output (example for "Contains"):

bash
Basic plan
Pro plan
Enterprise plan

Validation Rules:

  • ✅ Logical AND: All items must be present
  • ⚠️ Partial match: Case-insensitive by default
  • 🔍 Multiple values: One per line

Advanced Test Cases

Regex validation:

bash
Input: "What's the price of the Pro plan?"

Expected (Regex):
\$\d+/month

Matches: "$49/month", "$99/month", etc.

Metadata validation:

json
{
  "confidence": "> 0.7",
  "sources_cited": true,
  "response_length": "< 500"
}

Multi-turn conversations:

bash
Turn 1:
  Input: "I need help with billing"
  Expected: "billing" (contains)

Turn 2:
  Input: "How do I cancel?"
  Expected: "cancel", "subscription" (contains)

Running Tests

Run Single Test

  1. Open test case
  2. Click Run Test
  3. View results

Result details:

  • Passed: Output met expectations
  • Failed: Output didn't match
  • ⚠️ Warning: Partial match or low confidence

Run Entire Bucket

  1. Open bucket
  2. Click Run All Tests (top right)
  3. Watch progress (tests run in parallel)
  4. View summary

Summary shows:

  • Total tests: 25
  • Passed: 23 (92%)
  • Failed: 2 (8%)
  • Duration: 45 seconds

Scheduled Test Runs

Automate testing:

  1. Bucket settings → Schedule
  2. Choose frequency:
    • Daily (9 AM)
    • Before each deployment
    • On demand only
  3. Set notification email
  4. Save

Tests run automatically and email results.

Analyzing Test Results

Test Run Details

Click on test run to see:

Overview:

  • Run ID: run_abc123
  • Started: 2025-01-20 14:30 UTC
  • Duration: 45s
  • Status: 2 failures

Per-Test Results:

bash
✅ Pricing question - Passed
   Input: "What are your pricing plans?"
   Expected: Basic, Pro, Enterprise (contains)
   Actual: "We offer three plans: Basic ($9/mo), Pro ($49/mo),
           and Enterprise (custom pricing)."
   Match: 100%

❌ Cancellation flow - Failed
   Input: "How do I cancel my subscription?"
   Expected: "billing", "cancel" (contains)
   Actual: "You can manage your account settings here."
   Match: 50% (missing "billing")

   Diff:
   - Expected: "billing"
   + Actual: "account settings"

Diff Viewer

For failures, see side-by-side comparison:

diff
Expected Output:
+ To cancel, go to billing settings

Actual Output:
- You can manage your account settings

Helps understand what changed.

Trend Analysis

View test pass rate over time:

bash
Jan 15: 95% pass (20/21)
Jan 16: 92% pass (23/25)
Jan 17: 88% pass (22/25) ⚠️ Declining
Jan 18: 84% pass (21/25) 🚨 Alert

Spot regressions early.

Test Strategies for LLMs

Flexible Matching

LLMs word responses differently. Use flexible assertions:

❌ Bad (too strict):

bash
Expected: "The price is $49 per month for the Pro plan."

✅ Good (flexible):

bash
Expected (Contains):
- "49"
- "month"
- "Pro"

Test Intent, Not Wording

❌ Bad:

bash
Expected: "I apologize for the inconvenience."

✅ Good:

bash
Expected (Contains):
- "apolog" OR "sorry"
- "inconvenience" OR "trouble"

Focus on Critical Information

Test that important facts are present:

bash
Input: "What's included in the Pro plan?"

Expected (Contains):
- "unlimited projects"
- "5 team members"
- "priority support"

NOT Expected (Don't test):
- Exact phrasing
- Greeting format
- Transition words

Use Confidence Thresholds

json
{
  "expected_confidence": "> 0.7",
  "max_response_time": "< 5000ms"
}

Fail if agent is uncertain.

Best Practices

1. Start Small

Begin with 5-10 critical tests:

  • Most common questions
  • Edge cases that broke before
  • Brand-critical responses

Expand over time.

2. Test User Journeys

Group related questions:

bash
Bucket: New User Onboarding
  1. "What is your product?"
  2. "How do I sign up?"
  3. "Do you offer a free trial?"
  4. "How do I get started?"

3. Update Tests with Agent

When improving agent:

  1. Update test expectations first
  2. Deploy new version
  3. Run tests
  4. Verify improvements worked

4. Use Tags

Tag tests for filtering:

bash
Tags: critical, billing, regression, v2.0.0

Run only critical tests before deployment:

bash
flutch test --tags critical

5. Document Why

Add notes to tests:

bash
Test: Pricing question
Why: Users complained about missing Enterprise pricing
Added: 2025-01-15

6. Review Failures

Not all failures are bugs:

  • Agent improved response (update test)
  • Test was too strict (relax expectations)
  • Actual regression (fix agent)

CI/CD Integration

GitHub Actions

yaml
- name: Run Acceptance Tests
  run: |
    flutch test run <agent-id> --bucket critical
    flutch test run <agent-id> --bucket regression

Fail deployment if tests fail.

Pre-Deployment Gates

Configure in Flutch:

Settings → Deployment → Require Tests:

  • ✅ Run critical tests before deployment
  • ✅ Block deployment if tests fail
  • ⚠️ Allow deployment with warnings

Exporting Test Results

Export for reporting:

bash
# Export as JSON
flutch test export <run-id> --format json

# Export as CSV
flutch test export <run-id> --format csv

Useful for:

  • Stakeholder reports
  • Historical analysis
  • Integration with BI tools

Common Testing Patterns

Regression Tests

After fixing a bug:

bash
Input: [message that caused bug]
Expected: [correct behavior]
Tags: regression, bug-123

Golden Examples

Perfect responses to keep:

bash
Input: "What makes your product different?"
Expected (Contains):
  - "unique feature X"
  - "competitive advantage Y"
Tags: golden, marketing

Boundary Tests

Edge cases:

bash
Input: "" (empty message)
Expected: "Please ask a question"

Input: "a" * 10000 (very long)
Expected: "Your message is too long"

Multi-Language Tests

If supporting multiple languages:

bash
Bucket: Spanish Support
  Input: "¿Cuáles son sus planes de precios?"
  Expected: "Basic", "Pro", "Enterprise"

Troubleshooting

"All tests failing"

Possible causes:

  • Agent is down (check status)
  • API key invalid (check settings)
  • Model changed (responses different)

"Tests flaky"

LLM outputs vary slightly. Make tests more flexible:

  • Use "Contains" instead of "Exact"
  • Allow alternative phrasings
  • Check metadata instead of exact text

"Tests take too long"

Optimization:

  • Run tests in parallel (automatic)
  • Reduce number of tests
  • Use faster model for testing (gpt-3.5 vs gpt-4)

Next Steps


Tip: Run tests after every significant prompt change!

Screenshots Needed

TODO: Add screenshots for:

  • Test buckets list view
  • Create test case form
  • Test run results page
  • Diff viewer for failed test
  • Trend analysis chart