AI/ML Research · Computer Vision · EdTech

CognEn

Cognitive Engagement Analyzer

v1.0 Browser Next.js 15 · Webcam · Chrome Extension v2.0 Server Python FastAPI · Multi-person CCTV

86.5%

Behavior accuracy
(hybrid real-world test set)

<0.1ms

1D-CNN inference P95
(isolated classifier)

96.3%

Seat occupancy accuracy
(v2.0 dense environments)

55,100

Annotated temporal sequences
(hybrid dataset, 6 classes)

"Context-Gate Override, ConfusionPenaltyLoss, and AR(1) temporal autocorrelation resolve critical failure modes documented in state-of-the-art engagement monitoring literature — yielding +12.4% precision for using_phone, +18.5% macro-F1, and temporal consistency 0.742 → 0.931." — CognEn README, Novel Algorithmic Contributions

Novel Algorithms ↓ ← All Projects

01 — Overview

Two Deployments, One ML Core

The Problem

Existing engagement monitoring systems apply a uniform downward-gaze penalty that cannot distinguish a student reading a book (attentive) from a student texting on their phone (distracted) — the biometric signals are visually homologous.

Rule-based spatial heuristics (the prior state of the art) achieve only 80.8% accuracy and fail under partial occlusion and object-detector failures. Networks trained without structured temporal data barely exceed random chance (21.3% on a 6-class problem vs 16.7% chance).

The Approach

CognEn resolves biometric ambiguity through five novel algorithms: context-conditional scoring with hard-rule post-processing, an asymmetric loss function penalizing high-cost pedagogical errors, AR(1) temporal fusion for sustained behavioral continuity, a Shapely-based prioritized seat mapper, and rolling-window OLS fatigue onset detection.

The result: 86.5% behavior accuracy on a hybrid real-world test set — +5.7% over the deterministic spatial baseline — while the temporal consistency coefficient improves from 0.742 to 0.931.

v1.0 vs v2.0 — System Comparison

Feature	v1.0 Browser	v2.0 Server
Primary Target	Single student, desk webcam	Classroom, multi-student CCTV
Input Source	640×480 @ 5 FPS webcam	1280×720 RTSP / file / USB
Tracking	Single person (first detected face)	Multi-person (ByteTrack MOT)
Face Detection	face-api.js TinyFaceDetector	YOLOv8-m (person) + MediaPipe FaceMesh
Object Detection	YOLOv8-nano ONNX (WebGPU/WASM)	YOLOv8-s (PyTorch/GPU)
Behavior Classifier	1D-CNN ONNX (browser runtime)	1D-CNN ONNX (server runtime)
Pipeline Throughput	~125 ms/frame	~35 ms/frame (~28 FPS)
Classifier Latency	<0.02ms/subject	<0.02ms/subject
Output	Dashboard + PDF report + Firebase	Room analytics WebSocket @ 5Hz
Extension	Chrome MV3 (Google Meet / Zoom)	—
Seat Mapping	—	Shapely polygon IoU (96.3% accuracy)

Core equation: v2.0 = multi_person_detection + tracking + [v1.0_per_person_pipeline × N] + room_analytics. Both versions share the identical 1D-CNN temporal classification core.

02 — Architecture

Pipeline Architecture

v1.0 — Browser Pipeline Next.js 15

Webcam640×480 @ 5 FPS

face-api.jsTinyFaceDetector

YOLOv8-nanoevery 3rd frame

MediaPipe Handsevery 3rd frame

68-pt LandmarksEAR · MAR · Gaze

Objectsbook · phone

Hand Proximityinteraction score

Feature Fusion13-dim vector

1D-CNN Classifier15-frame window

Context-GateOverride

MLEngine EMAα=0.15

Score 0–100+ Behavior Label

DashboardFirebase · PDF

v2.0 — Server Pipeline FastAPI + GPU

Video SourceRTSP · File · Camera

YOLOv8-mPerson Detection

YOLOv8-sObject Detection

YOLOv8-pose17 Keypoints

ByteTrackPersistent Track IDs

SeatMapperShapely IoU ≥0.30

Per-Person Loop × NMediaPipe FaceMesh

Feature Fusion13-dim / frame

1D-CNN + Gate<0.02ms

Room Analyticsper-seat attention

WebSocket 5Hz→ Dashboard

6 Behavior Classes

✓

attentive_screen

✓

attentive_reading

⚠

distracted

✗

using_phone

⚠

drowsy

—

absent

03 — Visuals

Screenshots

CognEn Live Dashboard screen-shared during a Google Meet session, displaying live biometric data, head pose calculations, and attention trend metrics

v1.0 · Google Meet · Live Dashboard Share

Active Google Meet video call screen showing real-time cognitive engagement indexes, head pose stability tracking, and attention scores stream

v1.0 · Google Meet · Active Session View

Zoom Workplace meeting window with multiple video participants during a remote review meeting under cognitive analysis

v1.0 · Zoom Integration · Multi-person Conference

Zoom Project Review Meeting window showing participant video feeds and active engagement monitoring overview

v1.0 · Zoom Integration · Active Meeting View

04 — Novel Contributions

5 Novel Algorithms

ALGORITHM 01

Context-Gate Override Algorithm

A novel post-processing algorithm that intercepts the 1D-CNN output and modifies class probabilities using deterministic spatial heuristics. Resolves the biometric ambiguity between visually homologous states — phone use vs. reading — by mathematically compressing confidence scores when spatial context is unclear, or artificially boosting/suppressing specific classes (e.g., prioritising using_phone over attentive_reading when a phone object and hand interaction are detected in close proximity).

+12.4% precision (using_phone) · +9.1% recall (attentive_reading)

ALGORITHM 02

Asymmetric ConfusionPenaltyLoss Function

An asymmetric loss function that aligns the optimisation objective with real-world pedagogical utility. Applies a 3.0× penalty multiplier if the model predicts "reading" when the student is actually using a "phone" (a high-cost pedagogical error), while standard 1.0× multipliers apply to less critical misclassifications. Standard cross-entropy treats all errors equally — this function does not. Uses label smoothing 0.05 and AdamW with cosine annealing.

+18.5% improvement in macro-F1 over standard cross-entropy

ALGORITHM 03

Multimodal AR(1) Temporal Fusion + EMA Smoothing

A multimodal feature fusion algorithm that normalises and combines 13 disparate signals (EAR, PERCLOS, MAR, gaze deviation, head pose, blink rate, fatigue index, object one-hots, hand-object proximity) under AR(1) temporal dynamics with feature-tuned weights (gazeDeviation: 0.12, temporalPattern: 0.03). The final 0–100 engagement score is smoothed via Exponential Moving Average with decay factor α=0.15, eliminating high-frequency state oscillation.

Temporal consistency: 0.742 → 0.931 (with AR(1) autocorrelation)

ALGORITHM 04 · v2.0

Prioritized Seat Mapping Assignment Algorithm

A spatial algorithm that maps ByteTrack bounding boxes to physical classroom seats using Shapely polygons. Three-step prioritisation: (1) primary validation via centroid containment, (2) fallback via Intersection over Union (IoU ≥ 0.30), (3) tie-breaker heuristic for overlapping detections. Decouples persistent physical identities (seat A, seat B) from transient bounding-box track IDs, which ByteTrack re-assigns on every re-entry.

96.3% seat occupancy tracking accuracy in dense environments

ALGORITHM 05 · v1.0

Fatigue Onset Detection Algorithm

A rolling-window statistical algorithm for early detection of cognitive fatigue. Employs Ordinary Least Squares (OLS) linear regression on the attention score over a 5-minute rolling window to calculate the degradation slope. Anomaly triggered on a composite condition: negative slope < -0.5, high fit confidence R² > 0.4, and average score drop below < 60. All three conditions must hold simultaneously to avoid false positives from transient dips.

Early fatigue detection before attention score reaches critical threshold

05 — ML Core

1D-CNN Architecture & Dataset

1D-CNN Temporal Classifier

A 3-layer Convolutional Neural Network (1D) evaluates a sliding window of 15 frames (3 seconds at 5 FPS) to classify the current behavioral state. Parameters: ~20,550 — lightweight enough for browser ONNX runtime (WebGPU → WASM fallback).

Input
(Batch, 13, 15)

13 features × 15 frames

Conv1D
+ BatchNorm

Layer 1

Conv1D
+ BatchNorm

Layer 2

Conv1D
+ BatchNorm

Layer 3

Global Avg
Pooling

GAP

Linear
Softmax

Output

6 Classes
Probabilities

6 behaviors

Training Configuration

────────────────────────────────────────

Optimizer: AdamW (weight decay 1×10⁻⁴)

Learning rate: 1×10⁻³ → CosineAnnealingLR → min 1×10⁻⁵

Batch size: 64

Max epochs: 50 (early stopping patience: 15)

Loss function: ConfusionPenaltyLoss (label smoothing 0.05)

Parameters: ~20,550

Inference: <0.02ms/subject (ONNX runtime)

Dataset Specification — 55,100 Annotated Sequences

A hybrid dataset combining class-conditional synthetic sequences with real-world annotated clips to introduce lighting variance, partial occlusion, and detection jitter.

42,600

Training sequences

AR(1) synthetic (ρ=0.60–0.90) + DAiSEE/EngageNet clips, heavily augmented

1,800

Validation holdout

Deterministic 15% holdout, unaugmented

3,000

Test set

500/class, independent seed=43, strictly no leakage

Balanced classes

Equal class distribution across all splits

Augmentation strategy: Temporal time-warping (±20%), localised feature dropout (8%), Gaussian gaze jitter (σ=0.05), targeted object-context injection. Each 15-frame window = 3 seconds of behaviour at 5 Hz sampling.

Feature preprocessing: Gaze and yaw/pitch angles normalised to [−1, 1]. Biometrics (EAR, PERCLOS) clamped to [0, 1]. Object semantics mapped to one-hot vectors. Hand-object interactions encoded as continuous proximity coefficients.

06 — Validation

Performance Metrics

0.865

Behavior Accuracy

Hybrid real-world test set · target ≥0.85 ✓

0.852

F1-macro

All 6 classes · target ≥0.80 ✓

0.914

Precision (phone)

using_phone class · target ≥0.90 ✓

0.873

Recall (reading)

attentive_reading class · target ≥0.85 ✓

0.931

Temporal Consistency

Coefficient · target ≥0.90 ✓ (was 0.742 without AR(1))

1.000

Tracking MOTA (v2.0)

ByteTrack · target ≥0.75 ✓

0.963

Seat Occupancy (v2.0)

Shapely polygon IoU · target ≥0.95 ✓

<0.1ms

Inference P95

1D-CNN ONNX isolated latency · target <100ms ✓

Ablation Study — Component Penalties

Component Removed	Impact	Magnitude
Context-Gate Override	Reintroduces biometric ambiguity (phone vs. reading)	−12.4% precision (using_phone) · −9.1% recall (attentive_reading)
ConfusionPenaltyLoss → standard cross-entropy	Removes optimisation of pedagogical edge-cases	−18.5% macro-F1
AR(1) Temporal Autocorrelation	Model flickers between transient states	Temporal consistency: 0.931 → 0.742 (−0.189)

Baseline Comparisons

Baseline Method	Accuracy	Failure Mode
Naive Temporal (1D-CNN, uniform sequences)	21.3%	Network topology alone is insufficient without structured temporal data (random chance = 16.7%)
Deterministic Spatial Heuristic (rule-based)	80.8%	Severe degradation under partial occlusion and object-detector failures
CognEn (proposed)	86.5%	+5.7% over heuristic baseline; robust to occlusion via Context-Gate

Synthetic test set (seed=43): All classification metrics score 1.000, demonstrating perfect learning of the AR(1) temporal structure. The hybrid real-world scores (0.865 accuracy, 0.852 F1) are the credible production estimates. Real-classroom accuracy is estimated at 0.82–0.88 pending additional fine-tuning on DAiSEE/EngageNet.

07 — Stack

Technology Stack

v1.0 — Browser Stack Next.js 15

Framework

Next.js 15 (App Router, Turbopack) TypeScript Tailwind CSS Radix UI Recharts

Perception / ML (Browser)

face-api.js (TinyFaceDetector) 68-point landmark model YOLOv8-nano ONNX onnxruntime-web (WebGPU → WASM) MediaPipe Hands Lite

Auth / Storage / AI

Firebase Auth Firestore Google Genkit Gemini (PDF report generation)

Extension

Chrome Extension Manifest V3 Offscreen Document API

v2.0 — Server Stack FastAPI + GPU

Backend

Python 3 FastAPI uvicorn WebSocket (5Hz push) OpenCV VideoCapture

Perception / ML (GPU)

YOLOv8-m (person detection) YOLOv8-s (object detection) YOLOv8-pose (17 COCO keypoints) MediaPipe FaceMesh (478-pt → 68-pt ibug) 1D-CNN ONNX (20,550 params) PyTorch (training only)

Tracking / Spatial

ByteTrack (boxmot) Shapely (polygon IoU)

Testing

pytest (61 tests) Evaluation suite (accuracy + F1 + latency)

Privacy & Security: No raw video stored — only derived feature vectors and metrics are persisted. All inference runs locally (in-browser for v1.0, on local GPU for v2.0). Track IDs (not names) are used by default in v2.0. Analytics auto-purge after 24 hours. The system never transmits raw video to any external service.