AI/ML Research · Computer Vision · EdTech

CognEn

Cognitive Engagement Analyzer

v1.0 Browser Next.js 15 · Webcam · Chrome Extension    v2.0 Server Python FastAPI · Multi-person CCTV

86.5%
Behavior accuracy
(hybrid real-world test set)
<0.1ms
1D-CNN inference P95
(isolated classifier)
96.3%
Seat occupancy accuracy
(v2.0 dense environments)
55,100
Annotated temporal sequences
(hybrid dataset, 6 classes)
"Context-Gate Override, ConfusionPenaltyLoss, and AR(1) temporal autocorrelation resolve critical failure modes documented in state-of-the-art engagement monitoring literature — yielding +12.4% precision for using_phone, +18.5% macro-F1, and temporal consistency 0.742 → 0.931." — CognEn README, Novel Algorithmic Contributions
Novel Algorithms ↓ ← All Projects

Two Deployments, One ML Core


The Problem

Existing engagement monitoring systems apply a uniform downward-gaze penalty that cannot distinguish a student reading a book (attentive) from a student texting on their phone (distracted) — the biometric signals are visually homologous.

Rule-based spatial heuristics (the prior state of the art) achieve only 80.8% accuracy and fail under partial occlusion and object-detector failures. Networks trained without structured temporal data barely exceed random chance (21.3% on a 6-class problem vs 16.7% chance).

The Approach

CognEn resolves biometric ambiguity through five novel algorithms: context-conditional scoring with hard-rule post-processing, an asymmetric loss function penalizing high-cost pedagogical errors, AR(1) temporal fusion for sustained behavioral continuity, a Shapely-based prioritized seat mapper, and rolling-window OLS fatigue onset detection.

The result: 86.5% behavior accuracy on a hybrid real-world test set — +5.7% over the deterministic spatial baseline — while the temporal consistency coefficient improves from 0.742 to 0.931.

v1.0 vs v2.0 — System Comparison

Feature v1.0 Browser v2.0 Server
Primary TargetSingle student, desk webcamClassroom, multi-student CCTV
Input Source640×480 @ 5 FPS webcam1280×720 RTSP / file / USB
TrackingSingle person (first detected face)Multi-person (ByteTrack MOT)
Face Detectionface-api.js TinyFaceDetectorYOLOv8-m (person) + MediaPipe FaceMesh
Object DetectionYOLOv8-nano ONNX (WebGPU/WASM)YOLOv8-s (PyTorch/GPU)
Behavior Classifier1D-CNN ONNX (browser runtime)1D-CNN ONNX (server runtime)
Pipeline Throughput~125 ms/frame~35 ms/frame (~28 FPS)
Classifier Latency<0.02ms/subject<0.02ms/subject
OutputDashboard + PDF report + FirebaseRoom analytics WebSocket @ 5Hz
ExtensionChrome MV3 (Google Meet / Zoom)
Seat MappingShapely polygon IoU (96.3% accuracy)
Core equation: v2.0 = multi_person_detection + tracking + [v1.0_per_person_pipeline × N] + room_analytics. Both versions share the identical 1D-CNN temporal classification core.

Pipeline Architecture


v1.0 — Browser Pipeline Next.js 15

Webcam640×480 @ 5 FPS
face-api.jsTinyFaceDetector
YOLOv8-nanoevery 3rd frame
MediaPipe Handsevery 3rd frame
68-pt LandmarksEAR · MAR · Gaze
Objectsbook · phone
Hand Proximityinteraction score
Feature Fusion13-dim vector
1D-CNN Classifier15-frame window
Context-GateOverride
MLEngine EMAα=0.15
Score 0–100+ Behavior Label
DashboardFirebase · PDF

v2.0 — Server Pipeline FastAPI + GPU

Video SourceRTSP · File · Camera
YOLOv8-mPerson Detection
YOLOv8-sObject Detection
YOLOv8-pose17 Keypoints
ByteTrackPersistent Track IDs
SeatMapperShapely IoU ≥0.30
Per-Person Loop × NMediaPipe FaceMesh
Feature Fusion13-dim / frame
1D-CNN + Gate<0.02ms
Room Analyticsper-seat attention
WebSocket 5Hz→ Dashboard

6 Behavior Classes

attentive_screen
attentive_reading
distracted
using_phone
drowsy
absent

Screenshots


CognEn Live Dashboard screen-shared during a Google Meet session, displaying live biometric data, head pose calculations, and attention trend metrics

v1.0 · Google Meet · Live Dashboard Share

Active Google Meet video call screen showing real-time cognitive engagement indexes, head pose stability tracking, and attention scores stream

v1.0 · Google Meet · Active Session View

Zoom Workplace meeting window with multiple video participants during a remote review meeting under cognitive analysis

v1.0 · Zoom Integration · Multi-person Conference

Zoom Project Review Meeting window showing participant video feeds and active engagement monitoring overview

v1.0 · Zoom Integration · Active Meeting View


5 Novel Algorithms


ALGORITHM 01
Context-Gate Override Algorithm
A novel post-processing algorithm that intercepts the 1D-CNN output and modifies class probabilities using deterministic spatial heuristics. Resolves the biometric ambiguity between visually homologous states — phone use vs. reading — by mathematically compressing confidence scores when spatial context is unclear, or artificially boosting/suppressing specific classes (e.g., prioritising using_phone over attentive_reading when a phone object and hand interaction are detected in close proximity).
+12.4% precision (using_phone) · +9.1% recall (attentive_reading)
ALGORITHM 02
Asymmetric ConfusionPenaltyLoss Function
An asymmetric loss function that aligns the optimisation objective with real-world pedagogical utility. Applies a 3.0× penalty multiplier if the model predicts "reading" when the student is actually using a "phone" (a high-cost pedagogical error), while standard 1.0× multipliers apply to less critical misclassifications. Standard cross-entropy treats all errors equally — this function does not. Uses label smoothing 0.05 and AdamW with cosine annealing.
+18.5% improvement in macro-F1 over standard cross-entropy
ALGORITHM 03
Multimodal AR(1) Temporal Fusion + EMA Smoothing
A multimodal feature fusion algorithm that normalises and combines 13 disparate signals (EAR, PERCLOS, MAR, gaze deviation, head pose, blink rate, fatigue index, object one-hots, hand-object proximity) under AR(1) temporal dynamics with feature-tuned weights (gazeDeviation: 0.12, temporalPattern: 0.03). The final 0–100 engagement score is smoothed via Exponential Moving Average with decay factor α=0.15, eliminating high-frequency state oscillation.
Temporal consistency: 0.742 → 0.931 (with AR(1) autocorrelation)
ALGORITHM 04 · v2.0
Prioritized Seat Mapping Assignment Algorithm
A spatial algorithm that maps ByteTrack bounding boxes to physical classroom seats using Shapely polygons. Three-step prioritisation: (1) primary validation via centroid containment, (2) fallback via Intersection over Union (IoU ≥ 0.30), (3) tie-breaker heuristic for overlapping detections. Decouples persistent physical identities (seat A, seat B) from transient bounding-box track IDs, which ByteTrack re-assigns on every re-entry.
96.3% seat occupancy tracking accuracy in dense environments
ALGORITHM 05 · v1.0
Fatigue Onset Detection Algorithm
A rolling-window statistical algorithm for early detection of cognitive fatigue. Employs Ordinary Least Squares (OLS) linear regression on the attention score over a 5-minute rolling window to calculate the degradation slope. Anomaly triggered on a composite condition: negative slope < -0.5, high fit confidence R² > 0.4, and average score drop below < 60. All three conditions must hold simultaneously to avoid false positives from transient dips.
Early fatigue detection before attention score reaches critical threshold

1D-CNN Architecture & Dataset


1D-CNN Temporal Classifier

A 3-layer Convolutional Neural Network (1D) evaluates a sliding window of 15 frames (3 seconds at 5 FPS) to classify the current behavioral state. Parameters: ~20,550 — lightweight enough for browser ONNX runtime (WebGPU → WASM fallback).

Training Configuration
────────────────────────────────────────
Optimizer: AdamW (weight decay 1×10⁻⁴)
Learning rate: 1×10⁻³ → CosineAnnealingLR → min 1×10⁻⁵
Batch size: 64
Max epochs: 50 (early stopping patience: 15)
Loss function: ConfusionPenaltyLoss (label smoothing 0.05)
Parameters: ~20,550
Inference: <0.02ms/subject (ONNX runtime)

Dataset Specification — 55,100 Annotated Sequences

A hybrid dataset combining class-conditional synthetic sequences with real-world annotated clips to introduce lighting variance, partial occlusion, and detection jitter.

42,600
Training sequences
AR(1) synthetic (ρ=0.60–0.90) + DAiSEE/EngageNet clips, heavily augmented
1,800
Validation holdout
Deterministic 15% holdout, unaugmented
3,000
Test set
500/class, independent seed=43, strictly no leakage
6
Balanced classes
Equal class distribution across all splits

Augmentation strategy: Temporal time-warping (±20%), localised feature dropout (8%), Gaussian gaze jitter (σ=0.05), targeted object-context injection. Each 15-frame window = 3 seconds of behaviour at 5 Hz sampling.

Feature preprocessing: Gaze and yaw/pitch angles normalised to [−1, 1]. Biometrics (EAR, PERCLOS) clamped to [0, 1]. Object semantics mapped to one-hot vectors. Hand-object interactions encoded as continuous proximity coefficients.


Performance Metrics


0.865
Behavior Accuracy
Hybrid real-world test set · target ≥0.85 ✓
0.852
F1-macro
All 6 classes · target ≥0.80 ✓
0.914
Precision (phone)
using_phone class · target ≥0.90 ✓
0.873
Recall (reading)
attentive_reading class · target ≥0.85 ✓
0.931
Temporal Consistency
Coefficient · target ≥0.90 ✓ (was 0.742 without AR(1))
1.000
Tracking MOTA (v2.0)
ByteTrack · target ≥0.75 ✓
0.963
Seat Occupancy (v2.0)
Shapely polygon IoU · target ≥0.95 ✓
<0.1ms
Inference P95
1D-CNN ONNX isolated latency · target <100ms ✓

Ablation Study — Component Penalties

Component RemovedImpactMagnitude
Context-Gate Override Reintroduces biometric ambiguity (phone vs. reading) −12.4% precision (using_phone) · −9.1% recall (attentive_reading)
ConfusionPenaltyLoss → standard cross-entropy Removes optimisation of pedagogical edge-cases −18.5% macro-F1
AR(1) Temporal Autocorrelation Model flickers between transient states Temporal consistency: 0.931 → 0.742 (−0.189)

Baseline Comparisons

Baseline MethodAccuracyFailure Mode
Naive Temporal (1D-CNN, uniform sequences) 21.3% Network topology alone is insufficient without structured temporal data (random chance = 16.7%)
Deterministic Spatial Heuristic (rule-based) 80.8% Severe degradation under partial occlusion and object-detector failures
CognEn (proposed) 86.5% +5.7% over heuristic baseline; robust to occlusion via Context-Gate
Synthetic test set (seed=43): All classification metrics score 1.000, demonstrating perfect learning of the AR(1) temporal structure. The hybrid real-world scores (0.865 accuracy, 0.852 F1) are the credible production estimates. Real-classroom accuracy is estimated at 0.82–0.88 pending additional fine-tuning on DAiSEE/EngageNet.

Technology Stack


v1.0 — Browser Stack Next.js 15

Framework
Next.js 15 (App Router, Turbopack) TypeScript Tailwind CSS Radix UI Recharts
Perception / ML (Browser)
face-api.js (TinyFaceDetector) 68-point landmark model YOLOv8-nano ONNX onnxruntime-web (WebGPU → WASM) MediaPipe Hands Lite
Auth / Storage / AI
Firebase Auth Firestore Google Genkit Gemini (PDF report generation)
Extension
Chrome Extension Manifest V3 Offscreen Document API

v2.0 — Server Stack FastAPI + GPU

Backend
Python 3 FastAPI uvicorn WebSocket (5Hz push) OpenCV VideoCapture
Perception / ML (GPU)
YOLOv8-m (person detection) YOLOv8-s (object detection) YOLOv8-pose (17 COCO keypoints) MediaPipe FaceMesh (478-pt → 68-pt ibug) 1D-CNN ONNX (20,550 params) PyTorch (training only)
Tracking / Spatial
ByteTrack (boxmot) Shapely (polygon IoU)
Testing
pytest (61 tests) Evaluation suite (accuracy + F1 + latency)

Privacy & Security: No raw video stored — only derived feature vectors and metrics are persisted. All inference runs locally (in-browser for v1.0, on local GPU for v2.0). Track IDs (not names) are used by default in v2.0. Analytics auto-purge after 24 hours. The system never transmits raw video to any external service.