Learn how to create and run acceptance tests to prevent regressions in your AI agent's behavior.
What is Acceptance Testing for AI Agents?
AI agents produce non-deterministic outputs, making traditional testing difficult. Acceptance testing helps you:
- Prevent regressions: Ensure new versions don't break existing behavior
- Validate changes: Test improvements before deploying
- Document expected behavior: Tests serve as living documentation
- Build confidence: Deploy knowing critical paths work
Accessing Acceptance Testing
- Log in to console.flutch.ai
- Navigate to Agents → Select your agent
- Go to Acceptance Testing tab
URL: https://console.flutch.ai/agents/{agentId}/acceptance-testing
Creating Test Buckets
Test Buckets are collections of related test cases (like test suites).
Create a Bucket
- Click Create Bucket
- Enter details:
- Name: "Customer Support Flows"
- Description: "Tests for common customer questions"
- Tags: support, customer, FAQ
- Click Create
Bucket Organization
Organize buckets by:
- Feature area: Onboarding, Billing, Support
- User journey: New user, Returning user, Power user
- Complexity: Simple, Medium, Complex
- Version: v1.0.0, v2.0.0
Example buckets:
bash📦 Onboarding Questions 📦 Billing Inquiries 📦 Technical Support 📦 Product Information 📦 Edge Cases
Creating Test Cases
Add Test Case to Bucket
- Open bucket
- Click Add Test Case
- Fill in test details
Test Case Fields
Input Message:
bashWhat are your pricing plans?
Expected Output Type:
- Contains: Output must contain specific text
- Exact Match: Output must match exactly (rare for LLMs)
- Regex: Output must match regex pattern
- Not Contains: Output must NOT contain text
Expected Output (example for "Contains"):
bashBasic plan Pro plan Enterprise plan
Validation Rules:
- ✅ Logical AND: All items must be present
- ⚠️ Partial match: Case-insensitive by default
- 🔍 Multiple values: One per line
Advanced Test Cases
Regex validation:
bashInput: "What's the price of the Pro plan?" Expected (Regex): \$\d+/month
Matches: "$49/month", "$99/month", etc.
Metadata validation:
json{ "confidence": "> 0.7", "sources_cited": true, "response_length": "< 500" }
Multi-turn conversations:
bashTurn 1: Input: "I need help with billing" Expected: "billing" (contains) Turn 2: Input: "How do I cancel?" Expected: "cancel", "subscription" (contains)
Running Tests
Run Single Test
- Open test case
- Click Run Test
- View results
Result details:
- ✅ Passed: Output met expectations
- ❌ Failed: Output didn't match
- ⚠️ Warning: Partial match or low confidence
Run Entire Bucket
- Open bucket
- Click Run All Tests (top right)
- Watch progress (tests run in parallel)
- View summary
Summary shows:
- Total tests: 25
- Passed: 23 (92%)
- Failed: 2 (8%)
- Duration: 45 seconds
Scheduled Test Runs
Automate testing:
- Bucket settings → Schedule
- Choose frequency:
- Daily (9 AM)
- Before each deployment
- On demand only
- Set notification email
- Save
Tests run automatically and email results.
Analyzing Test Results
Test Run Details
Click on test run to see:
Overview:
- Run ID:
run_abc123 - Started: 2025-01-20 14:30 UTC
- Duration: 45s
- Status: 2 failures
Per-Test Results:
bash✅ Pricing question - Passed Input: "What are your pricing plans?" Expected: Basic, Pro, Enterprise (contains) Actual: "We offer three plans: Basic ($9/mo), Pro ($49/mo), and Enterprise (custom pricing)." Match: 100% ❌ Cancellation flow - Failed Input: "How do I cancel my subscription?" Expected: "billing", "cancel" (contains) Actual: "You can manage your account settings here." Match: 50% (missing "billing") Diff: - Expected: "billing" + Actual: "account settings"
Diff Viewer
For failures, see side-by-side comparison:
diffExpected Output: + To cancel, go to billing settings Actual Output: - You can manage your account settings
Helps understand what changed.
Trend Analysis
View test pass rate over time:
bashJan 15: 95% pass (20/21) Jan 16: 92% pass (23/25) Jan 17: 88% pass (22/25) ⚠️ Declining Jan 18: 84% pass (21/25) 🚨 Alert
Spot regressions early.
Test Strategies for LLMs
Flexible Matching
LLMs word responses differently. Use flexible assertions:
❌ Bad (too strict):
bashExpected: "The price is $49 per month for the Pro plan."
✅ Good (flexible):
bashExpected (Contains): - "49" - "month" - "Pro"
Test Intent, Not Wording
❌ Bad:
bashExpected: "I apologize for the inconvenience."
✅ Good:
bashExpected (Contains): - "apolog" OR "sorry" - "inconvenience" OR "trouble"
Focus on Critical Information
Test that important facts are present:
bashInput: "What's included in the Pro plan?" Expected (Contains): - "unlimited projects" - "5 team members" - "priority support" NOT Expected (Don't test): - Exact phrasing - Greeting format - Transition words
Use Confidence Thresholds
json{ "expected_confidence": "> 0.7", "max_response_time": "< 5000ms" }
Fail if agent is uncertain.
Best Practices
1. Start Small
Begin with 5-10 critical tests:
- Most common questions
- Edge cases that broke before
- Brand-critical responses
Expand over time.
2. Test User Journeys
Group related questions:
bashBucket: New User Onboarding 1. "What is your product?" 2. "How do I sign up?" 3. "Do you offer a free trial?" 4. "How do I get started?"
3. Update Tests with Agent
When improving agent:
- Update test expectations first
- Deploy new version
- Run tests
- Verify improvements worked
4. Use Tags
Tag tests for filtering:
bashTags: critical, billing, regression, v2.0.0
Run only critical tests before deployment:
bashflutch test --tags critical
5. Document Why
Add notes to tests:
bashTest: Pricing question Why: Users complained about missing Enterprise pricing Added: 2025-01-15
6. Review Failures
Not all failures are bugs:
- Agent improved response (update test)
- Test was too strict (relax expectations)
- Actual regression (fix agent)
CI/CD Integration
GitHub Actions
yaml- name: Run Acceptance Tests run: | flutch test run <agent-id> --bucket critical flutch test run <agent-id> --bucket regression
Fail deployment if tests fail.
Pre-Deployment Gates
Configure in Flutch:
Settings → Deployment → Require Tests:
- ✅ Run critical tests before deployment
- ✅ Block deployment if tests fail
- ⚠️ Allow deployment with warnings
Exporting Test Results
Export for reporting:
bash# Export as JSON flutch test export <run-id> --format json # Export as CSV flutch test export <run-id> --format csv
Useful for:
- Stakeholder reports
- Historical analysis
- Integration with BI tools
Common Testing Patterns
Regression Tests
After fixing a bug:
bashInput: [message that caused bug] Expected: [correct behavior] Tags: regression, bug-123
Golden Examples
Perfect responses to keep:
bashInput: "What makes your product different?" Expected (Contains): - "unique feature X" - "competitive advantage Y" Tags: golden, marketing
Boundary Tests
Edge cases:
bashInput: "" (empty message) Expected: "Please ask a question" Input: "a" * 10000 (very long) Expected: "Your message is too long"
Multi-Language Tests
If supporting multiple languages:
bashBucket: Spanish Support Input: "¿Cuáles son sus planes de precios?" Expected: "Basic", "Pro", "Enterprise"
Troubleshooting
"All tests failing"
Possible causes:
- Agent is down (check status)
- API key invalid (check settings)
- Model changed (responses different)
"Tests flaky"
LLM outputs vary slightly. Make tests more flexible:
- Use "Contains" instead of "Exact"
- Allow alternative phrasings
- Check metadata instead of exact text
"Tests take too long"
Optimization:
- Run tests in parallel (automatic)
- Reduce number of tests
- Use faster model for testing (gpt-3.5 vs gpt-4)
Next Steps
- Debug Failed Tests: Debugging Guide
- Monitor in Production: Measures & Analytics
- Deploy with Confidence: Deploy Guide
Tip: Run tests after every significant prompt change!
Screenshots Needed
TODO: Add screenshots for:
- Test buckets list view
- Create test case form
- Test run results page
- Diff viewer for failed test
- Trend analysis chart