AI Engineering · Voice Infrastructure

AICC-SOF

AI Voice Call Center for University Admissions

Self-hosted on Mac Mini M4 · Python 3.11+ · FastAPI + LiveKit + Asterisk · Zero AI API spend

~540ms

End-to-end AI voice latency
(Kokoro route, Mac Mini M4)

<₹1/min

Full AI call cost via SIP
(vs ₹4–6/min Twilio-only)

Prometheus metrics
+ 9 Grafana panels

CI pipeline jobs
(GitHub Actions, all must pass)

"Everything runs locally on Mac Mini M4. No paid AI APIs, no cloud LLM, no vendor lock-in. The full STT → LLM → TTS pipeline completes in ~540ms on the Kokoro route. Telephony (SIP/Twilio) and Google Sheets are the only external dependencies." — AICC-SOF README

See Architecture ↓ ← All Projects

01 — Overview

What It Does

The Problem

University admissions teams run outbound call campaigns to thousands of prospective students. Traditional options are expensive: Twilio-backed AI costs ₹4–6/min, human agents are slow and inconsistent, and cloud LLM APIs add recurring API spend plus latency.

What was needed: a fully self-hosted system that could run structured admissions conversations, score leads intelligently, and keep costs under ₹1/min — with zero AI API spend.

The Solution

AICC-SOF is a production-grade AI call center that runs entirely on Mac Mini M4. It reads pending leads from Google Sheets, scores them with XGBoost, and dials highest-scoring leads first through a 7-state conversation machine.

Every call runs STT → LLM → TTS locally. Kokoro-82M ONNX for TTS (~30–80ms), MLX Whisper on the M4 Neural Engine (~60–120ms), and Ollama phi3.5:3.8b on Metal GPU (~70–100ms first token). Total end-to-end: ~540ms on the recommended route.

Core Capabilities

Outbound Campaigns

Reads pending leads from Google Sheets, scores with XGBoost, dials highest-scoring leads first. Up to MAX_CONCURRENT_CALLS simultaneous AI conversations (default: 3).

Inbound Handling

Same AI agent handles students who call your number. Auto-appends inbound callers to Google Sheets as new leads. DNC-checked and E.164-validated before engaging.

7-State Conversation Machine

Structured admissions flow: Greeting → Stream Detection → Marks Collection → Course Suggestion → Interest Detection → Persuasion → Closing. Extracts name, stream, marks, and interest level.

Lead Intelligence

XGBoost pre-call scoring (0–1), post-call quality scoring on 6 axes (0–10 each), and post-call enrollment probability prediction — all written back to Google Sheets.

TRAI NDNC Compliance

Checks India's Do Not Call registry before every call. Fail-open if API unreachable. Auto-adds NDNC numbers to internal DNC list backed by SQLite WAL.

Operational Intelligence

Call recording (PCM + SQLite index), human escalation on 11 trigger phrases (Asterisk AMI Redirect), email nurture drip (Day 0/2/5), A/B script testing (50/50 deterministic split), dropout recovery campaigns.

02 — Architecture

System Architecture

Campaign Intelligence

Google SheetsLead Database

Campaign ManagerPoll · Score · Batch

Lead ScorerXGBoost 0–1

DNC ManagerTRAI NDNC · Internal

Telephony Layer

Telephony RouterSIP primary

Asterisk AMISIP trunk

Twilio RESTSIP fallback

LiveKit SIPWebRTC bridge

Agent SessionPer-call pipeline

AI Conversation Pipeline

Silero VADONNX ~2–3ms/frame

MLX Whisper STTNeural Engine ~80ms

LLM RouterOllama · Groq · Gemini

Kokoro TTSONNX ~30–80ms

Post-Call Pipeline

Quality Scorer6-axis LLM grading

Enrollment ProbXGBoost post-call

Email NurtureSMTP drip Day 0/2/5

Sheets Write-backoutcome · score · prob

Campaign Engine

Asyncio polling loop reads pending leads from Google Sheets every CAMPAIGN_POLL_INTERVAL seconds. Scores each lead via XGBoost, sorts by score descending, checks TRAI NDNC + internal DNC, then acquires an asyncio.Semaphore slot before dialing.

Telephony Router

Prefers SIP via Asterisk AMI (panoramisk, async). Falls back to Twilio REST if SIP fails. Both paths terminate in LiveKit SIP, which bridges the PSTN call into a WebRTC room where the agent session runs.

Agent Session

Per-call asyncio pipeline: subscribes to LiveKit audio track (5s poll timeout), synthesizes greeting, then enters the VAD-gated STT turn loop. Drives AdmissionFlow state machine and PersuasionLayer. Connects to Redis for session memory (TTL 1h).

AdmissionFlow

7-state machine defined in flows/admission_flow.py. Extracts stream (keyword matching), marks (regex \d{2,3}%?), and maps them to course recommendations from intake.json — 20 departments, seat counts, stream mappings.

IntakeConfig Singleton

intake.json is the single source of truth for all 20 departments (B.Tech CSE, BCA, MBA, LLB, …), stream keywords, seat counts, and course recommendations. Loaded once at startup; restart to reload.

Observability Stack

Self-hosted Prometheus scrapes 14 metrics at GET /metrics. Auto-provisioned Grafana (9 panels, no manual config). AlertManager with 4 alert rules and email routing. structlog JSON/pretty logging.

03 — AI Pipeline

Deep Dive

Latency Budget — Mac Mini M4

All latencies include the 300ms silence-detection window. The recommended route (MLX Whisper + Ollama phi3.5 + Kokoro ONNX) achieves sub-600ms perceived latency.

Route STT LLM TTS Total

──────────────────────────────────────────────────────────────────────

MLX + phi3.5 + Kokoro ~80ms ~90ms ~50ms ~540ms

MLX + phi3.5 + Edge TTS ~80ms ~90ms ~300ms ~770ms

MLX + Groq (hedge fires) + Kokoro ~80ms ~650ms ~50ms ~1080ms

faster-whisper + phi3.5 + Kokoro ~250ms ~90ms ~50ms ~690ms

faster-whisper + phi3.5 + XTTS MPS ~250ms ~90ms ~700ms ~1340ms

Note: All routes include 300ms silence detection. Silence threshold was tuned from 1200ms → 400ms → 300ms (9 frames × 32ms) to minimize turn detection latency.

STT — Speech to Text

Primary: mlx-whisper using Apple M4 Neural Engine + Metal GPU. Fallback: faster-whisper (CTranslate2, CPU int8). Both implement the same transcribe(pcm_bytes) → str interface via get_stt_client() factory. Selection: STT_PROVIDER=mlx|faster-whisper|auto (auto = mlx on macOS, faster-whisper elsewhere).

Silero VAD gates STT: every incoming 32ms audio frame (16kHz) runs through Silero VAD ONNX (~2–3ms/frame vs ~5ms PyTorch). VAD accumulates speech frames and waits 300ms of silence (9 frames) before sending the buffer to Whisper — one transcription per utterance instead of one per 20ms chunk, eliminating ~50 redundant STT calls/second.

VAD is offloaded to a thread pool via asyncio.to_thread to unblock the asyncio event loop during ONNX inference. RMS energy fallback activates if torch is unavailable.

LLM Router — Hedge Pattern

The LLMRouter in ai/llm.py implements a buffer-then-commit strategy that prevents broken audio when the primary provider is slow or cold-starting.

Primary (Ollama phi3.5:3.8b)

Metal GPU · ~70–100ms first token · Free · LLM_HEDGE_DELAY_MS=500 window

If first token arrives within 500ms → commit. Forward all tokens to TTS queue.

Fallbacks (Groq → Gemini)

Groq: llama-3.1-8b-instant ~150ms · Gemini: 1.5-flash ~200–400ms

If hedge window expires → cancel primary (zero tokens forwarded). Start fallback fresh. No partial token mixing in TTS queue.

This guarantees: if the primary is cancelled mid-stream, zero tokens have been forwarded to TTS. The fallback starts completely fresh — preventing the bug where partial Ollama tokens were already being synthesised when Groq started.

TTS — Text to Speech

Three providers implement synthesize(text) → bytes and synthesize_sentences(token_stream, on_audio_chunk). The sentence-streaming interface starts playing the first sentence while later sentences are still being synthesised, cutting perceived latency.

Provider	Latency (M4)	Mode	Notes
Kokoro-82M ONNX (primary)	~30–80ms/sentence	Offline, 24kHz PCM	4 voices: bf_emma, af_sky, am_adam, bm_george · ~300 MB download
Edge TTS (fallback 1)	~200–350ms	Network, 24kHz PCM	Indian English `en-IN-NeerjaNeural` · MP3→PCM via pydub
Coqui XTTS v2 (fallback 2)	~700ms (M4 MPS GPU)	Offline, 24kHz PCM	Voice cloning from `assets/reference_voice.wav` · ~3–5s on CPU

Selection: TTS_PROVIDERS=kokoro,edge,xtts — first entry is primary, rest are fallbacks in order.

04 — Conversation

7-State Admission Machine

State 1

GREETING

"Hello {name}, calling from {university}. Do you have a couple of minutes?"

Exit: positive/neutral → STREAM_DETECTION | negative → CLOSING

State 2

STREAM_DETECTION

"Which stream are you from — Science, Commerce, or Arts?" Detects via keyword matching against intake.json stream_keywords. Re-asks if stream not detected.

Exit: stream extracted → MARKS_COLLECTION

State 3

MARKS_COLLECTION

"What percentage did you score in your 12th board exams?" Extracts via regex \d{2,3}(?:\.\d)?\s*%?. Re-asks if not matched.

Exit: marks extracted → COURSE_SUGGESTION

State 4

COURSE_SUGGESTION

"Based on your {marks} in {stream}, {courses} would be a great fit." Courses from intake.json course_recommendations by stream × marks band (≥75% high, 50–74% medium, <50% low). Injects urgency if seats ≤ 5: "Only N seats left!"

Exit: auto-advance → INTEREST_DETECTION

State 5

INTEREST_DETECTION

"Would you like to speak with our counselor for a free 15-minute session?" PersuasionLayer classifies intent into 5 buckets: interested, hesitant, confused, negative, neutral.

Exit: interested → CLOSING | negative → CLOSING | hesitant/confused → PERSUASION

State 6

PERSUASION

Targeted persuasion templates: success rate, merit scholarships, placement stats. Hesitant → highlight outcomes. Confused → simplify, ask easier question. Max 2 persuasion attempts per call.

Exit: interested → CLOSING | negative → CLOSING | max attempts → CLOSING

State 7

CLOSING → ENDED

Thank/goodbye message tuned to interest level. Triggers PostCallPipeline: quality scoring (6-axis LLM, 0–10), enrollment probability (XGBoost), email nurture scheduling, and Google Sheets write-back.

Exit: always → ENDED

Persuasion Layer — 5 Intent Classes

interested

"yes", "sure", "tell me more", "definitely" → Push for counseling appointment

hesitant

"maybe", "not sure", "let me think" → Highlight success rate, scholarships

confused

"what", "don't understand", "explain" → Simplify, ask an easier question

negative

"no", "not interested", "busy" → Accept gracefully, mark low interest

neutral

everything else → Let LLM respond freely without directive override

Escalation triggers (11 phrases including "speak to a person", "real agent", "human", "transfer me"): AI delivers handoff message and Asterisk AMI Redirect transfers to counselor extension.

05 — Observability

Metrics, Dashboards & CI

14 Prometheus Metrics

Metric	Type	Description
aicc_calls_total	Counter	Calls by `outcome` + `provider`
aicc_call_duration_seconds	Histogram	Full call duration distribution
aicc_active_calls	Gauge	Currently active concurrent calls
aicc_stt_latency_ms	Histogram	STT transcription latency per utterance
aicc_llm_first_token_ms	Histogram	LLM time-to-first-token
aicc_tts_first_chunk_ms	Histogram	TTS first audio chunk latency
aicc_interest_total	Counter	Calls by interest level (high/medium/low/unknown)
aicc_lead_score	Histogram	Pre-call XGBoost lead score distribution
aicc_enrollment_probability	Histogram	Post-call enrollment probability distribution
aicc_call_quality_score	Histogram	Post-call quality score 0–10 (6-axis LLM)
aicc_escalations_total	Counter	Calls escalated to a human counselor
aicc_language_total	Counter	Calls by detected language (en / hi)
aicc_ab_variant_total	Counter	A/B outcomes by `variant` + `outcome`
aicc_leads_skipped_total	Counter	Leads skipped by `reason` (dnc, ndnc, score, recall-interval)

9 Grafana Panels

Active Calls

Real-time gauge of concurrent AI conversations

Calls Today

Total calls in the current calendar day

Success Rate

Completed calls / total calls (%)

Outcome Breakdown

Pie: completed, no_answer, failed, escalated

Interest Distribution

Pie: high, medium, low, unknown

STT Latency p95

95th percentile speech-to-text latency

LLM Latency p95

95th percentile time-to-first-token

TTS Latency p95

95th percentile first audio chunk

Calls by Provider

SIP vs Twilio over time

Auto-provisioned on first start. No manual Grafana config needed.

AlertManager Rules

warning

HighCallFailureRate

Failure rate > 40% for 5 consecutive minutes

warning

NoCalls

Zero calls for 2 hours during campaign window

warning

HighLatency

STT p95 > 2s or LLM p95 > 4s

info

ActiveCallsZero

Active calls gauge stuck at 0 during campaign hours

7-Job CI Pipeline (GitHub Actions)

All 7 jobs must pass before merging to main. Heavy model dependencies (torch, TTS, livekit, mlx_whisper) are excluded from CI via lazy import guards and unittest.mock stubs.

Lint

ruff — style, unused imports, bugbear, import order

Type Check

mypy — type mismatches, wrong args, missing returns

Unit Tests

pytest — state machine, persuasion, entity extraction

Integration Tests

pytest — end-to-end StateManager flow (runs after unit tests)

Security Scan

bandit — hardcoded secrets, unsafe calls, injection risks

Dependency Security

pip-audit — known CVEs in requirements-ci.txt packages

Import Sanity

Core modules import without crashing (optional deps guarded)

06 — Cost Engineering

Under ₹1/Min, Zero AI API Spend

Component	Cost/Min	Notes
SIP/GSM trunk	₹0.50–₹0.80	Varies by provider and destination
Ollama LLM (phi3.5:3.8b)	₹0.00	Runs locally on M4 Metal GPU
MLX Whisper STT	₹0.00	Runs locally on M4 Neural Engine
Kokoro ONNX TTS	₹0.00	Runs locally, no network required
LiveKit (self-hosted)	₹0.00	Docker container
Redis + Asterisk + Prometheus + Grafana	₹0.00	Self-hosted Docker
Total — SIP + Kokoro route	~₹0.50–₹0.80	Full AI call at under ₹1/min
Total — Twilio fallback fires	~₹4–₹6	Only activates if SIP fails

Monthly Estimate — 3 Concurrent Calls, 8h/day

~240

Calls/day (max)

20–30 calls/hour at 2–3 min avg

~4,800

Calls/month (max)

22 working days × 240

~₹7,800

Max monthly cost

@ ₹0.65/min avg, 2.5 min avg call

~₹60,000

Equivalent Twilio-only

8× higher cost for same volume

07 — Stack

Technology Stack

AI / ML

mlx-whisper (STT, M4 Neural Engine) faster-whisper (STT fallback) Ollama phi3.5:3.8b Groq llama-3.1-8b-instant Gemini 1.5 Flash Kokoro-82M ONNX (TTS) Edge TTS en-IN-NeerjaNeural Coqui XTTS v2 Silero VAD ONNX XGBoost (lead scoring + enrollment probability) langdetect (en/hi)

Telephony / Infrastructure

Asterisk AMI panoramisk (async AMI) LiveKit (WebRTC) LiveKit SIP Bridge Twilio REST (fallback) Docker Compose Redis (session memory, TTL 1h)

Backend / API

Python 3.11+ FastAPI uvicorn asyncio Pydantic-Settings gspread (Google Sheets) HMAC API key auth SQLite WAL (DNC, recordings, nurture)

Observability / Quality

Prometheus Grafana AlertManager structlog (JSON/pretty) ruff (lint + format) mypy (strict) pytest + asyncio bandit (security) pip-audit (CVE scan) GitHub Actions CI

Integrations

TRAI NDNC registry Google Sheets API SMTP email nurture JEE/NEET candidate auto-import