AI Engineering · Voice Infrastructure

AICC-SOF

AI Voice Call Center for University Admissions

Self-hosted on Mac Mini M4 · Python 3.11+ · FastAPI + LiveKit + Asterisk · Zero AI API spend

~540ms
End-to-end AI voice latency
(Kokoro route, Mac Mini M4)
<₹1/min
Full AI call cost via SIP
(vs ₹4–6/min Twilio-only)
14
Prometheus metrics
+ 9 Grafana panels
7
CI pipeline jobs
(GitHub Actions, all must pass)
"Everything runs locally on Mac Mini M4. No paid AI APIs, no cloud LLM, no vendor lock-in. The full STT → LLM → TTS pipeline completes in ~540ms on the Kokoro route. Telephony (SIP/Twilio) and Google Sheets are the only external dependencies." — AICC-SOF README
See Architecture ↓ ← All Projects

What It Does


The Problem

University admissions teams run outbound call campaigns to thousands of prospective students. Traditional options are expensive: Twilio-backed AI costs ₹4–6/min, human agents are slow and inconsistent, and cloud LLM APIs add recurring API spend plus latency.

What was needed: a fully self-hosted system that could run structured admissions conversations, score leads intelligently, and keep costs under ₹1/min — with zero AI API spend.

The Solution

AICC-SOF is a production-grade AI call center that runs entirely on Mac Mini M4. It reads pending leads from Google Sheets, scores them with XGBoost, and dials highest-scoring leads first through a 7-state conversation machine.

Every call runs STT → LLM → TTS locally. Kokoro-82M ONNX for TTS (~30–80ms), MLX Whisper on the M4 Neural Engine (~60–120ms), and Ollama phi3.5:3.8b on Metal GPU (~70–100ms first token). Total end-to-end: ~540ms on the recommended route.

Core Capabilities

Outbound Campaigns

Reads pending leads from Google Sheets, scores with XGBoost, dials highest-scoring leads first. Up to MAX_CONCURRENT_CALLS simultaneous AI conversations (default: 3).

Inbound Handling

Same AI agent handles students who call your number. Auto-appends inbound callers to Google Sheets as new leads. DNC-checked and E.164-validated before engaging.

7-State Conversation Machine

Structured admissions flow: Greeting → Stream Detection → Marks Collection → Course Suggestion → Interest Detection → Persuasion → Closing. Extracts name, stream, marks, and interest level.

Lead Intelligence

XGBoost pre-call scoring (0–1), post-call quality scoring on 6 axes (0–10 each), and post-call enrollment probability prediction — all written back to Google Sheets.

TRAI NDNC Compliance

Checks India's Do Not Call registry before every call. Fail-open if API unreachable. Auto-adds NDNC numbers to internal DNC list backed by SQLite WAL.

Operational Intelligence

Call recording (PCM + SQLite index), human escalation on 11 trigger phrases (Asterisk AMI Redirect), email nurture drip (Day 0/2/5), A/B script testing (50/50 deterministic split), dropout recovery campaigns.


System Architecture


Campaign Engine

Asyncio polling loop reads pending leads from Google Sheets every CAMPAIGN_POLL_INTERVAL seconds. Scores each lead via XGBoost, sorts by score descending, checks TRAI NDNC + internal DNC, then acquires an asyncio.Semaphore slot before dialing.

Telephony Router

Prefers SIP via Asterisk AMI (panoramisk, async). Falls back to Twilio REST if SIP fails. Both paths terminate in LiveKit SIP, which bridges the PSTN call into a WebRTC room where the agent session runs.

Agent Session

Per-call asyncio pipeline: subscribes to LiveKit audio track (5s poll timeout), synthesizes greeting, then enters the VAD-gated STT turn loop. Drives AdmissionFlow state machine and PersuasionLayer. Connects to Redis for session memory (TTL 1h).

AdmissionFlow

7-state machine defined in flows/admission_flow.py. Extracts stream (keyword matching), marks (regex \d{2,3}%?), and maps them to course recommendations from intake.json — 20 departments, seat counts, stream mappings.

IntakeConfig Singleton

intake.json is the single source of truth for all 20 departments (B.Tech CSE, BCA, MBA, LLB, …), stream keywords, seat counts, and course recommendations. Loaded once at startup; restart to reload.

Observability Stack

Self-hosted Prometheus scrapes 14 metrics at GET /metrics. Auto-provisioned Grafana (9 panels, no manual config). AlertManager with 4 alert rules and email routing. structlog JSON/pretty logging.


Deep Dive


Latency Budget — Mac Mini M4

All latencies include the 300ms silence-detection window. The recommended route (MLX Whisper + Ollama phi3.5 + Kokoro ONNX) achieves sub-600ms perceived latency.

Route STT LLM TTS Total
──────────────────────────────────────────────────────────────────────
MLX + phi3.5 + Kokoro ~80ms ~90ms ~50ms ~540ms
MLX + phi3.5 + Edge TTS ~80ms ~90ms ~300ms ~770ms
MLX + Groq (hedge fires) + Kokoro ~80ms ~650ms ~50ms ~1080ms
faster-whisper + phi3.5 + Kokoro ~250ms ~90ms ~50ms ~690ms
faster-whisper + phi3.5 + XTTS MPS ~250ms ~90ms ~700ms ~1340ms
Note: All routes include 300ms silence detection. Silence threshold was tuned from 1200ms → 400ms → 300ms (9 frames × 32ms) to minimize turn detection latency.

STT — Speech to Text

Primary: mlx-whisper using Apple M4 Neural Engine + Metal GPU. Fallback: faster-whisper (CTranslate2, CPU int8). Both implement the same transcribe(pcm_bytes) → str interface via get_stt_client() factory. Selection: STT_PROVIDER=mlx|faster-whisper|auto (auto = mlx on macOS, faster-whisper elsewhere).

Silero VAD gates STT: every incoming 32ms audio frame (16kHz) runs through Silero VAD ONNX (~2–3ms/frame vs ~5ms PyTorch). VAD accumulates speech frames and waits 300ms of silence (9 frames) before sending the buffer to Whisper — one transcription per utterance instead of one per 20ms chunk, eliminating ~50 redundant STT calls/second.

VAD is offloaded to a thread pool via asyncio.to_thread to unblock the asyncio event loop during ONNX inference. RMS energy fallback activates if torch is unavailable.

LLM Router — Hedge Pattern

The LLMRouter in ai/llm.py implements a buffer-then-commit strategy that prevents broken audio when the primary provider is slow or cold-starting.

Primary (Ollama phi3.5:3.8b)

Metal GPU · ~70–100ms first token · Free · LLM_HEDGE_DELAY_MS=500 window

If first token arrives within 500ms → commit. Forward all tokens to TTS queue.

Fallbacks (Groq → Gemini)

Groq: llama-3.1-8b-instant ~150ms · Gemini: 1.5-flash ~200–400ms

If hedge window expires → cancel primary (zero tokens forwarded). Start fallback fresh. No partial token mixing in TTS queue.

This guarantees: if the primary is cancelled mid-stream, zero tokens have been forwarded to TTS. The fallback starts completely fresh — preventing the bug where partial Ollama tokens were already being synthesised when Groq started.

TTS — Text to Speech

Three providers implement synthesize(text) → bytes and synthesize_sentences(token_stream, on_audio_chunk). The sentence-streaming interface starts playing the first sentence while later sentences are still being synthesised, cutting perceived latency.

Provider Latency (M4) Mode Notes
Kokoro-82M ONNX (primary) ~30–80ms/sentence Offline, 24kHz PCM 4 voices: bf_emma, af_sky, am_adam, bm_george · ~300 MB download
Edge TTS (fallback 1) ~200–350ms Network, 24kHz PCM Indian English en-IN-NeerjaNeural · MP3→PCM via pydub
Coqui XTTS v2 (fallback 2) ~700ms (M4 MPS GPU) Offline, 24kHz PCM Voice cloning from assets/reference_voice.wav · ~3–5s on CPU

Selection: TTS_PROVIDERS=kokoro,edge,xtts — first entry is primary, rest are fallbacks in order.


7-State Admission Machine


State 1
GREETING
"Hello {name}, calling from {university}. Do you have a couple of minutes?"
Exit: positive/neutral → STREAM_DETECTION  |  negative → CLOSING
State 2
STREAM_DETECTION
"Which stream are you from — Science, Commerce, or Arts?" Detects via keyword matching against intake.json stream_keywords. Re-asks if stream not detected.
Exit: stream extracted → MARKS_COLLECTION
State 3
MARKS_COLLECTION
"What percentage did you score in your 12th board exams?" Extracts via regex \d{2,3}(?:\.\d)?\s*%?. Re-asks if not matched.
Exit: marks extracted → COURSE_SUGGESTION
State 4
COURSE_SUGGESTION
"Based on your {marks} in {stream}, {courses} would be a great fit." Courses from intake.json course_recommendations by stream × marks band (≥75% high, 50–74% medium, <50% low). Injects urgency if seats ≤ 5: "Only N seats left!"
Exit: auto-advance → INTEREST_DETECTION
State 5
INTEREST_DETECTION
"Would you like to speak with our counselor for a free 15-minute session?" PersuasionLayer classifies intent into 5 buckets: interested, hesitant, confused, negative, neutral.
Exit: interested → CLOSING  |  negative → CLOSING  |  hesitant/confused → PERSUASION
State 6
PERSUASION
Targeted persuasion templates: success rate, merit scholarships, placement stats. Hesitant → highlight outcomes. Confused → simplify, ask easier question. Max 2 persuasion attempts per call.
Exit: interested → CLOSING  |  negative → CLOSING  |  max attempts → CLOSING
State 7
CLOSING → ENDED
Thank/goodbye message tuned to interest level. Triggers PostCallPipeline: quality scoring (6-axis LLM, 0–10), enrollment probability (XGBoost), email nurture scheduling, and Google Sheets write-back.
Exit: always → ENDED

Persuasion Layer — 5 Intent Classes

interested

"yes", "sure", "tell me more", "definitely" → Push for counseling appointment

hesitant

"maybe", "not sure", "let me think" → Highlight success rate, scholarships

confused

"what", "don't understand", "explain" → Simplify, ask an easier question

negative

"no", "not interested", "busy" → Accept gracefully, mark low interest

neutral

everything else → Let LLM respond freely without directive override

Escalation triggers (11 phrases including "speak to a person", "real agent", "human", "transfer me"): AI delivers handoff message and Asterisk AMI Redirect transfers to counselor extension.


Metrics, Dashboards & CI


14 Prometheus Metrics

MetricTypeDescription
aicc_calls_totalCounterCalls by outcome + provider
aicc_call_duration_secondsHistogramFull call duration distribution
aicc_active_callsGaugeCurrently active concurrent calls
aicc_stt_latency_msHistogramSTT transcription latency per utterance
aicc_llm_first_token_msHistogramLLM time-to-first-token
aicc_tts_first_chunk_msHistogramTTS first audio chunk latency
aicc_interest_totalCounterCalls by interest level (high/medium/low/unknown)
aicc_lead_scoreHistogramPre-call XGBoost lead score distribution
aicc_enrollment_probabilityHistogramPost-call enrollment probability distribution
aicc_call_quality_scoreHistogramPost-call quality score 0–10 (6-axis LLM)
aicc_escalations_totalCounterCalls escalated to a human counselor
aicc_language_totalCounterCalls by detected language (en / hi)
aicc_ab_variant_totalCounterA/B outcomes by variant + outcome
aicc_leads_skipped_totalCounterLeads skipped by reason (dnc, ndnc, score, recall-interval)

9 Grafana Panels

Active Calls

Real-time gauge of concurrent AI conversations

Calls Today

Total calls in the current calendar day

Success Rate

Completed calls / total calls (%)

Outcome Breakdown

Pie: completed, no_answer, failed, escalated

Interest Distribution

Pie: high, medium, low, unknown

STT Latency p95

95th percentile speech-to-text latency

LLM Latency p95

95th percentile time-to-first-token

TTS Latency p95

95th percentile first audio chunk

Calls by Provider

SIP vs Twilio over time

Auto-provisioned on first start. No manual Grafana config needed.

AlertManager Rules

warning
HighCallFailureRate
Failure rate > 40% for 5 consecutive minutes
warning
NoCalls
Zero calls for 2 hours during campaign window
warning
HighLatency
STT p95 > 2s or LLM p95 > 4s
info
ActiveCallsZero
Active calls gauge stuck at 0 during campaign hours

7-Job CI Pipeline (GitHub Actions)

All 7 jobs must pass before merging to main. Heavy model dependencies (torch, TTS, livekit, mlx_whisper) are excluded from CI via lazy import guards and unittest.mock stubs.

1
Lint

ruff — style, unused imports, bugbear, import order

2
Type Check

mypy — type mismatches, wrong args, missing returns

3
Unit Tests

pytest — state machine, persuasion, entity extraction

4
Integration Tests

pytest — end-to-end StateManager flow (runs after unit tests)

5
Security Scan

bandit — hardcoded secrets, unsafe calls, injection risks

6
Dependency Security

pip-audit — known CVEs in requirements-ci.txt packages

7
Import Sanity

Core modules import without crashing (optional deps guarded)


Under ₹1/Min, Zero AI API Spend


ComponentCost/MinNotes
SIP/GSM trunk₹0.50–₹0.80Varies by provider and destination
Ollama LLM (phi3.5:3.8b)₹0.00Runs locally on M4 Metal GPU
MLX Whisper STT₹0.00Runs locally on M4 Neural Engine
Kokoro ONNX TTS₹0.00Runs locally, no network required
LiveKit (self-hosted)₹0.00Docker container
Redis + Asterisk + Prometheus + Grafana₹0.00Self-hosted Docker
Total — SIP + Kokoro route~₹0.50–₹0.80Full AI call at under ₹1/min
Total — Twilio fallback fires~₹4–₹6Only activates if SIP fails

Monthly Estimate — 3 Concurrent Calls, 8h/day

~240
Calls/day (max)
20–30 calls/hour at 2–3 min avg
~4,800
Calls/month (max)
22 working days × 240
~₹7,800
Max monthly cost
@ ₹0.65/min avg, 2.5 min avg call
~₹60,000
Equivalent Twilio-only
8× higher cost for same volume

Technology Stack


AI / ML
mlx-whisper (STT, M4 Neural Engine) faster-whisper (STT fallback) Ollama phi3.5:3.8b Groq llama-3.1-8b-instant Gemini 1.5 Flash Kokoro-82M ONNX (TTS) Edge TTS en-IN-NeerjaNeural Coqui XTTS v2 Silero VAD ONNX XGBoost (lead scoring + enrollment probability) langdetect (en/hi)
Telephony / Infrastructure
Asterisk AMI panoramisk (async AMI) LiveKit (WebRTC) LiveKit SIP Bridge Twilio REST (fallback) Docker Compose Redis (session memory, TTL 1h)
Backend / API
Python 3.11+ FastAPI uvicorn asyncio Pydantic-Settings gspread (Google Sheets) HMAC API key auth SQLite WAL (DNC, recordings, nurture)
Observability / Quality
Prometheus Grafana AlertManager structlog (JSON/pretty) ruff (lint + format) mypy (strict) pytest + asyncio bandit (security) pip-audit (CVE scan) GitHub Actions CI
Integrations
TRAI NDNC registry Google Sheets API SMTP email nurture JEE/NEET candidate auto-import