Supp/Sources & Research

Sources & Research

Last updated: February 13, 2026

We believe in transparency. This page explains how we arrived at every number on our website, what's an internal metric vs. an industry claim, and where you can verify our sources.

Our product metrics

These numbers come from our own system and are directly measurable. They are not estimates or projections.

315 support intents

Our classification model is trained on 315 distinct intent categories covering common support topics like password resets, billing questions, order status, refund requests, and more. This is the count of categories in our production model. You can browse the full list on our intents page.

92% accuracy

Measured on a held-out test set of real customer support messages. This is top-1 accuracy: the percentage of messages where the model's highest-confidence prediction matches the correct intent label. We re-evaluate this metric as we add new intents and training data.

1-3s end-to-end response

End-to-end latency from API request to classification response, measured in production. Ranges from 1-3 seconds depending on server load and network conditions. This includes network overhead, model inference, and response generation. The model is a purpose-built classifier (not a generative LLM).

$0.20 / $0.30 per resolution

Our pricing. $0.20 for classification-only (Base tier), $0.30 when an automated action is triggered (Automation tier). These are the prices we charge. You are only charged for successful resolutions (high-confidence or user-confirmed). Wrong classifications and explicit rejections are free.

$5 in free credits

Every new account receives $5.00 in credits, enough for 25 Base resolutions or 16 Automation resolutions. No credit card required to start.

Industry claims & comparisons

Our landing page compares Supp to DIY LLM-based support. Here is the research behind each claim.

“40-60% of support tickets are repetitive”

This is a well-established industry finding, supported by multiple sources:

  • Zendesk CX Trends Report (2023): Reports that common, repetitive questions make up the majority of support ticket volume across their customer base of 100,000+ companies.
  • Freshworks / Freshdesk research: Estimates 40-50% of incoming support tickets are repetitive, routine questions that could be automated.
  • IBM Watson contact center research: Found similar patterns in enterprise contact centers, with simple, predictable inquiries making up the bulk of volume.
  • Harvard Business Review (2017): “Kick-Ass Customer Service” research found that the vast majority of support interactions are for routine issues that don't require human judgment.

“LLM response time: 1-5 seconds”

This refers to the full round-trip time for an LLM-powered support resolution, not just the raw API call:

  • Raw API latency: Modern LLMs (GPT-4o, Claude Sonnet) return first tokens in 200-500ms, but full responses for support-length answers take 1-3 seconds.
  • Smaller models: GPT-4o-mini and Claude Haiku are faster (500ms-1.5s for classification-length tasks), but still 3-10x slower than a purpose-built classifier.
  • Full pipeline: A production LLM support system typically adds prompt construction, retrieval (RAG), guardrails, and response parsing on top of the raw API call, pushing total latency to 2-5 seconds.

We use the range “1-5 seconds” to reflect the full spectrum from optimized lightweight setups to production RAG pipelines. The key comparison point is that a dedicated classifier avoids the token generation step entirely.

“LLM cost per resolution: $0.50 – $2.00+”

Important context: This is the total cost of ownership per resolved ticket, not the raw API cost of a single LLM call.

  • Raw API cost per call: A single GPT-4o-mini or Claude Haiku classification call costs roughly $0.001-$0.05 depending on prompt length. This is inexpensive.
  • Multi-turn resolution: Most support tickets require multiple LLM calls (classify, retrieve context, generate response, handle follow-up). A multi-turn conversation can cost $0.10-$0.50+ in API costs alone.
  • Engineering and infrastructure: Building and maintaining a production LLM support pipeline requires prompt engineering, RAG infrastructure, guardrails, hosting, monitoring, and ongoing tuning. These costs are amortized per ticket.
  • Industry benchmarks: Enterprise AI support platforms (Intercom Fin, Zendesk AI) typically charge $0.99-$2.00 per AI-resolved conversation, reflecting the true total cost of LLM-powered support.

The $0.50-$2.00+ range reflects what teams actually spend per resolution when they build or buy LLM-powered support. If you're only making a single API call for classification (comparable to what Supp does), the raw cost is much lower — but you won't get confidence scoring, automatic escalation, or the “wrong answers are free” guarantee.

“LLMs hallucinate — no built-in confidence scoring”

LLM hallucination is a well-documented phenomenon in AI research. Our specific claim is:

  • Hallucination: LLMs can generate plausible but incorrect answers with high apparent confidence. This is well-established in research from OpenAI, Anthropic, Google, and academic institutions.
  • Confidence scoring: Some LLMs provide log probabilities, but these don't directly map to “is this classification correct?” in the way a purpose-built classifier's softmax output does. Building reliable confidence thresholds on LLM logprobs requires significant calibration.
  • Our approach: Supp's classifier produces a calibrated confidence score per intent. When confidence is below threshold, we don't guess — we escalate to a human. No charge.

To be fair, LLMs are improving rapidly at factual accuracy, and some providers are adding confidence features. Our comparison reflects the current state of general-purpose LLMs applied to support classification without custom calibration.

“Setup: days to weeks” for DIY LLM support

Building a production-ready LLM support system requires prompt engineering, testing across edge cases, building guardrails to prevent harmful outputs, setting up hosting infrastructure, implementing monitoring, and creating fallback/escalation paths. Based on developer experience reports and industry timelines, this typically takes days for a minimal viable setup and weeks for a production-grade system. Supp's setup is a single script tag because the model, infrastructure, and escalation logic are pre-built.

Batch processing discount

We offer 50% off all tiers for batch processing (1-3 hour response time). This is our pricing decision — batch processing allows us to optimize infrastructure costs, and we pass the savings to you.

Corrections

We audit our own claims regularly. Here are corrections we've made:

  • Feb 2026: Changed landing page response time claims from “<2 seconds” to “1-3 seconds” across all pages to match the measured production range documented on this page. Fixed “classified in milliseconds” to “classified in seconds.”
  • Feb 2026: Corrected brand slide pricing from fictional Free/Pro/Enterprise tiers to actual pay-per-resolution pricing ($0.20/$0.30). Fixed ROI calculations to use real pricing model.
  • Feb 2026: Corrected “On-device ML” claim in brand materials to “Purpose-built ML” (model runs on server, not on user devices).
  • Feb 2026: Updated Terms of Service OAuth provider list to include Jira and Intercom. Fixed section numbering errors (8.x to 9.x under Section 9).
  • Feb 2026: Removed “90-day key expiry” claim. This feature was documented but not yet implemented. Updated to accurately describe manual key rotation.
  • Feb 2026: Corrected default confidence threshold from 0.70 to 0.80 in API documentation, matching actual server behavior.
  • Feb 2026: Adjusted LLM speed comparison from “2-5 seconds” to “1-5 seconds” and clarified that LLM cost comparison reflects total cost of ownership, not raw API cost.

Our approach to claims

  • Internal metrics (315 intents, 92% accuracy, 1-3s) are directly measured from our production system and updated as our model evolves.
  • We verify against our own code. Every claim on this site is cross-referenced against the actual implementation in our codebase. If the docs say a feature exists, we confirm it's actually shipped.
  • Industry comparisons are sourced from published research, vendor documentation, and widely-cited benchmarks. We aim to use conservative ranges rather than cherry-picked extremes.
  • Cost comparisons use total cost of ownership rather than raw API cost, because that reflects what teams actually spend.
  • We update this page as the industry evolves. If you believe any claim is inaccurate, please let us know.