AICC-SOF
AI Voice Call Center for University Admissions
Self-hosted on Mac Mini M4 · Python 3.11+ · FastAPI + LiveKit + Asterisk · Zero AI API spend
(Kokoro route, Mac Mini M4)
(vs ₹4–6/min Twilio-only)
+ 9 Grafana panels
(GitHub Actions, all must pass)
"Everything runs locally on Mac Mini M4. No paid AI APIs, no cloud LLM, no vendor lock-in. The full STT → LLM → TTS pipeline completes in ~540ms on the Kokoro route. Telephony (SIP/Twilio) and Google Sheets are the only external dependencies." — AICC-SOF README
01 — Overview
What It Does
The Problem
University admissions teams run outbound call campaigns to thousands of prospective students. Traditional options are expensive: Twilio-backed AI costs ₹4–6/min, human agents are slow and inconsistent, and cloud LLM APIs add recurring API spend plus latency.
What was needed: a fully self-hosted system that could run structured admissions conversations, score leads intelligently, and keep costs under ₹1/min — with zero AI API spend.
The Solution
AICC-SOF is a production-grade AI call center that runs entirely on Mac Mini M4. It reads pending leads from Google Sheets, scores them with XGBoost, and dials highest-scoring leads first through a 7-state conversation machine.
Every call runs STT → LLM → TTS locally. Kokoro-82M ONNX for TTS (~30–80ms), MLX Whisper on the M4 Neural Engine (~60–120ms), and Ollama phi3.5:3.8b on Metal GPU (~70–100ms first token). Total end-to-end: ~540ms on the recommended route.
Core Capabilities
Outbound Campaigns
Reads pending leads from Google Sheets, scores with XGBoost, dials highest-scoring leads first. Up to MAX_CONCURRENT_CALLS simultaneous AI conversations (default: 3).
Inbound Handling
Same AI agent handles students who call your number. Auto-appends inbound callers to Google Sheets as new leads. DNC-checked and E.164-validated before engaging.
7-State Conversation Machine
Structured admissions flow: Greeting → Stream Detection → Marks Collection → Course Suggestion → Interest Detection → Persuasion → Closing. Extracts name, stream, marks, and interest level.
Lead Intelligence
XGBoost pre-call scoring (0–1), post-call quality scoring on 6 axes (0–10 each), and post-call enrollment probability prediction — all written back to Google Sheets.
TRAI NDNC Compliance
Checks India's Do Not Call registry before every call. Fail-open if API unreachable. Auto-adds NDNC numbers to internal DNC list backed by SQLite WAL.
Operational Intelligence
Call recording (PCM + SQLite index), human escalation on 11 trigger phrases (Asterisk AMI Redirect), email nurture drip (Day 0/2/5), A/B script testing (50/50 deterministic split), dropout recovery campaigns.
02 — Architecture
System Architecture
Campaign Engine
Asyncio polling loop reads pending leads from Google Sheets every CAMPAIGN_POLL_INTERVAL seconds. Scores each lead via XGBoost, sorts by score descending, checks TRAI NDNC + internal DNC, then acquires an asyncio.Semaphore slot before dialing.
Telephony Router
Prefers SIP via Asterisk AMI (panoramisk, async). Falls back to Twilio REST if SIP fails. Both paths terminate in LiveKit SIP, which bridges the PSTN call into a WebRTC room where the agent session runs.
Agent Session
Per-call asyncio pipeline: subscribes to LiveKit audio track (5s poll timeout), synthesizes greeting, then enters the VAD-gated STT turn loop. Drives AdmissionFlow state machine and PersuasionLayer. Connects to Redis for session memory (TTL 1h).
AdmissionFlow
7-state machine defined in flows/admission_flow.py. Extracts stream (keyword matching), marks (regex \d{2,3}%?), and maps them to course recommendations from intake.json — 20 departments, seat counts, stream mappings.
IntakeConfig Singleton
intake.json is the single source of truth for all 20 departments (B.Tech CSE, BCA, MBA, LLB, …), stream keywords, seat counts, and course recommendations. Loaded once at startup; restart to reload.
Observability Stack
Self-hosted Prometheus scrapes 14 metrics at GET /metrics. Auto-provisioned Grafana (9 panels, no manual config). AlertManager with 4 alert rules and email routing. structlog JSON/pretty logging.
03 — AI Pipeline
Deep Dive
Latency Budget — Mac Mini M4
All latencies include the 300ms silence-detection window. The recommended route (MLX Whisper + Ollama phi3.5 + Kokoro ONNX) achieves sub-600ms perceived latency.
STT — Speech to Text
Primary: mlx-whisper using Apple M4 Neural Engine + Metal GPU. Fallback: faster-whisper (CTranslate2, CPU int8). Both implement the same transcribe(pcm_bytes) → str interface via get_stt_client() factory. Selection: STT_PROVIDER=mlx|faster-whisper|auto (auto = mlx on macOS, faster-whisper elsewhere).
Silero VAD gates STT: every incoming 32ms audio frame (16kHz) runs through Silero VAD ONNX (~2–3ms/frame vs ~5ms PyTorch). VAD accumulates speech frames and waits 300ms of silence (9 frames) before sending the buffer to Whisper — one transcription per utterance instead of one per 20ms chunk, eliminating ~50 redundant STT calls/second.
VAD is offloaded to a thread pool via asyncio.to_thread to unblock the asyncio event loop during ONNX inference. RMS energy fallback activates if torch is unavailable.
LLM Router — Hedge Pattern
The LLMRouter in ai/llm.py implements a buffer-then-commit strategy that prevents broken audio when the primary provider is slow or cold-starting.
Primary (Ollama phi3.5:3.8b)
Metal GPU · ~70–100ms first token · Free · LLM_HEDGE_DELAY_MS=500 window
If first token arrives within 500ms → commit. Forward all tokens to TTS queue.
Fallbacks (Groq → Gemini)
Groq: llama-3.1-8b-instant ~150ms · Gemini: 1.5-flash ~200–400ms
If hedge window expires → cancel primary (zero tokens forwarded). Start fallback fresh. No partial token mixing in TTS queue.
This guarantees: if the primary is cancelled mid-stream, zero tokens have been forwarded to TTS. The fallback starts completely fresh — preventing the bug where partial Ollama tokens were already being synthesised when Groq started.
TTS — Text to Speech
Three providers implement synthesize(text) → bytes and synthesize_sentences(token_stream, on_audio_chunk). The sentence-streaming interface starts playing the first sentence while later sentences are still being synthesised, cutting perceived latency.
| Provider | Latency (M4) | Mode | Notes |
|---|---|---|---|
| Kokoro-82M ONNX (primary) | ~30–80ms/sentence | Offline, 24kHz PCM | 4 voices: bf_emma, af_sky, am_adam, bm_george · ~300 MB download |
| Edge TTS (fallback 1) | ~200–350ms | Network, 24kHz PCM | Indian English en-IN-NeerjaNeural · MP3→PCM via pydub |
| Coqui XTTS v2 (fallback 2) | ~700ms (M4 MPS GPU) | Offline, 24kHz PCM | Voice cloning from assets/reference_voice.wav · ~3–5s on CPU |
Selection: TTS_PROVIDERS=kokoro,edge,xtts — first entry is primary, rest are fallbacks in order.
04 — Conversation
7-State Admission Machine
intake.json stream_keywords. Re-asks if stream not detected.\d{2,3}(?:\.\d)?\s*%?. Re-asks if not matched.intake.json course_recommendations by stream × marks band (≥75% high, 50–74% medium, <50% low). Injects urgency if seats ≤ 5: "Only N seats left!"Persuasion Layer — 5 Intent Classes
interested
"yes", "sure", "tell me more", "definitely" → Push for counseling appointment
hesitant
"maybe", "not sure", "let me think" → Highlight success rate, scholarships
confused
"what", "don't understand", "explain" → Simplify, ask an easier question
negative
"no", "not interested", "busy" → Accept gracefully, mark low interest
neutral
everything else → Let LLM respond freely without directive override
Escalation triggers (11 phrases including "speak to a person", "real agent", "human", "transfer me"): AI delivers handoff message and Asterisk AMI Redirect transfers to counselor extension.
05 — Observability
Metrics, Dashboards & CI
14 Prometheus Metrics
| Metric | Type | Description |
|---|---|---|
| aicc_calls_total | Counter | Calls by outcome + provider |
| aicc_call_duration_seconds | Histogram | Full call duration distribution |
| aicc_active_calls | Gauge | Currently active concurrent calls |
| aicc_stt_latency_ms | Histogram | STT transcription latency per utterance |
| aicc_llm_first_token_ms | Histogram | LLM time-to-first-token |
| aicc_tts_first_chunk_ms | Histogram | TTS first audio chunk latency |
| aicc_interest_total | Counter | Calls by interest level (high/medium/low/unknown) |
| aicc_lead_score | Histogram | Pre-call XGBoost lead score distribution |
| aicc_enrollment_probability | Histogram | Post-call enrollment probability distribution |
| aicc_call_quality_score | Histogram | Post-call quality score 0–10 (6-axis LLM) |
| aicc_escalations_total | Counter | Calls escalated to a human counselor |
| aicc_language_total | Counter | Calls by detected language (en / hi) |
| aicc_ab_variant_total | Counter | A/B outcomes by variant + outcome |
| aicc_leads_skipped_total | Counter | Leads skipped by reason (dnc, ndnc, score, recall-interval) |
9 Grafana Panels
Active Calls
Real-time gauge of concurrent AI conversations
Calls Today
Total calls in the current calendar day
Success Rate
Completed calls / total calls (%)
Outcome Breakdown
Pie: completed, no_answer, failed, escalated
Interest Distribution
Pie: high, medium, low, unknown
STT Latency p95
95th percentile speech-to-text latency
LLM Latency p95
95th percentile time-to-first-token
TTS Latency p95
95th percentile first audio chunk
Calls by Provider
SIP vs Twilio over time
Auto-provisioned on first start. No manual Grafana config needed.
AlertManager Rules
7-Job CI Pipeline (GitHub Actions)
All 7 jobs must pass before merging to main. Heavy model dependencies (torch, TTS, livekit, mlx_whisper) are excluded from CI via lazy import guards and unittest.mock stubs.
Lint
ruff — style, unused imports, bugbear, import order
Type Check
mypy — type mismatches, wrong args, missing returns
Unit Tests
pytest — state machine, persuasion, entity extraction
Integration Tests
pytest — end-to-end StateManager flow (runs after unit tests)
Security Scan
bandit — hardcoded secrets, unsafe calls, injection risks
Dependency Security
pip-audit — known CVEs in requirements-ci.txt packages
Import Sanity
Core modules import without crashing (optional deps guarded)
06 — Cost Engineering
Under ₹1/Min, Zero AI API Spend
| Component | Cost/Min | Notes |
|---|---|---|
| SIP/GSM trunk | ₹0.50–₹0.80 | Varies by provider and destination |
| Ollama LLM (phi3.5:3.8b) | ₹0.00 | Runs locally on M4 Metal GPU |
| MLX Whisper STT | ₹0.00 | Runs locally on M4 Neural Engine |
| Kokoro ONNX TTS | ₹0.00 | Runs locally, no network required |
| LiveKit (self-hosted) | ₹0.00 | Docker container |
| Redis + Asterisk + Prometheus + Grafana | ₹0.00 | Self-hosted Docker |
| Total — SIP + Kokoro route | ~₹0.50–₹0.80 | Full AI call at under ₹1/min |
| Total — Twilio fallback fires | ~₹4–₹6 | Only activates if SIP fails |
Monthly Estimate — 3 Concurrent Calls, 8h/day
07 — Stack