Spaces:
Sleeping
MrrrMe - Privacy-First Multi-Modal Emotion Detection System
18-Week Specialization Project | Breda University of Applied Sciences
Real-time emotion analysis combining facial expressions, voice tonality, and text sentiment with conversational AI for empathetic human-computer interaction.
Project Information
Program: AI & Data Science - Applied Data Science
Institution: Breda University of Applied Sciences, Netherlands
Duration: 18 weeks (February - June 2026)
Current Status: Week 15 of 18
Team: Musaed Al-Fareh, Michon Goddijn, Lorena Kraljić
Overview
Problem Statement
Traditional emotion recognition systems face critical limitations:
- Single-modality approaches miss contextual emotional cues
- High latency unsuitable for natural conversation (5-8 seconds typical)
- Cloud dependencies raise privacy concerns
- Inability to detect genuine versus masked emotions
Solution
MrrrMe implements a privacy-first, multi-modal emotion detection system:
- Fuses facial expressions (40%), voice tonality (30%), and linguistic content (30%)
- Processes everything locally with no cloud dependencies
- Achieves sub-2-second response times (1.5-2.5s total)
- Generates empathetic conversational responses via Groq Cloud API
- Web-based interface with customizable 3D avatars
System Architecture
High-Level Architecture
Browser Client (Next.js 16 + React 19)
│
├─ Camera Stream (30 FPS)
├─ Microphone Audio (16kHz)
└─ WebSocket Connection
│
▼
Nginx Reverse Proxy (Port 7860)
│
├─ Frontend Server (Next.js) :3001
├─ Backend API (FastAPI) :8000
└─ Avatar TTS (XTTS v2) :8765
│
▼
Processing Pipeline
│
├─ Vision: ViT-Face-Expression → Face Emotion
├─ Audio: HuBERT-Large → Voice Emotion
└─ Text: DistilRoBERTa → Text Sentiment
│
▼
Fusion Engine (Quality-Aware Weighted Average)
│
▼
Groq Cloud API (Llama 3.1 8B Instant)
│
▼
Coqui XTTS v2 (Multi-lingual TTS)
│
▼
3D Avatar (Avaturn SDK + Three.js)
Data Flow
Input Processing:
- Video frames (640x480) → OpenCV Haar Cascade face detection → ViT-Face-Expression (100ms)
- Audio chunks (16kHz) → Silero VAD speech detection → HuBERT-Large (50ms)
- Speech buffer → Whisper distil-large-v3 transcription (0.37-1.04s)
- Transcript → DistilRoBERTa sentiment + rule overrides (100ms)
Fusion & Response:
- Quality-aware weight adjustment based on signal quality
- Weighted fusion: fused = 0.4×face + 0.3×voice + 0.3×text
- Conflict resolution and masking detection
- LLM context preparation with user summary and emotion state
- Groq API response generation (1-2s)
- Coqui XTTS v2 synthesis with viseme generation (2-4s)
- Avatar lip-sync playback in browser
Technology Stack
Computer Vision
| Component | Technology | Inference Time | Purpose |
|---|---|---|---|
| Face Detection | OpenCV Haar Cascade | <10ms | Locate face in frame |
| Emotion Recognition | ViT-Face-Expression (trpakov) | ~100ms | 7-class emotion (FER2013) |
| Mapping | 7-class to 4-class | <1ms | Neutral, Happy, Sad, Angry |
Emotion Mapping:
- FER2013 Classes: angry, disgust, fear, happy, sad, surprise, neutral
- MrrrMe Classes: Neutral, Happy, Sad, Angry
Audio Processing
| Component | Technology | Inference Time | Purpose |
|---|---|---|---|
| Speech-to-Text | Whisper distil-large-v3 | 0.37-1.04s | Transcription |
| Voice Emotion | HuBERT-Large (superb) | ~50ms | Prosody analysis |
| Speech Detection | Silero VAD | <5ms | Activity detection |
Natural Language
| Component | Technology | Inference Time | Purpose |
|---|---|---|---|
| Sentiment | DistilRoBERTa (emotion-distilroberta) | ~100ms | Text emotion |
| LLM | Groq Cloud (Llama 3.1 8B Instant) | 1-2s | Response generation |
| TTS | Coqui XTTS v2 | 2-4s | Voice synthesis |
Voice Options: Ana Florence (female), Damien Black (male)
Languages: 16 supported (en, nl, fr, de, it, es, ja, zh, pt, pl, tr, ru, cs, ar, hu, ko)
Frontend & Infrastructure
| Component | Technology | Purpose |
|---|---|---|
| Frontend | Next.js 16 + React 19 + TypeScript | Web interface |
| 3D Engine | React Three Fiber + Three.js 0.180 | Avatar rendering |
| Avatar | Avaturn SDK + Ready Player Me | Custom avatars |
| Styling | Tailwind CSS v4 | Design system |
| Backend | FastAPI + Uvicorn | WebSocket + REST API |
| Database | SQLite | User auth + sessions |
| Proxy | Nginx | Reverse proxy |
| Container | Docker + CUDA 11.8 | Deployment |
Project Structure
MrrrMe/
│
├── avatar-frontend/ # Next.js 16 Web Application
│ ├── app/
│ │ ├── api/
│ │ │ └── avaturn-proxy/ # CORS proxy for avatar assets
│ │ ├── app/ # Main application (authenticated)
│ │ │ └── page.tsx # Avatar + emotion UI + WebSocket
│ │ ├── login/ # Authentication page
│ │ │ └── page.tsx
│ │ ├── page.tsx # Landing page
│ │ ├── layout.tsx # Root layout
│ │ └── globals.css # Design system (light/dark mode)
│ ├── public/
│ │ └── idle-animation.glb # Avatar idle animation (Git LFS)
│ ├── package.json # Node dependencies (React 19)
│ ├── next.config.ts # Next.js standalone output
│ └── tsconfig.json
│
├── mrrrme/ # Python Backend Package
│ ├── backend/ # FastAPI Modular Backend (v2.0)
│ │ ├── auth/
│ │ │ ├── database.py # SQLite init + helpers
│ │ │ ├── models.py # Pydantic request models
│ │ │ └── routes.py # /api/signup, /api/login, /api/logout
│ │ ├── debug/
│ │ │ └── routes.py # /api/debug/users, /api/debug/sessions
│ │ ├── models/
│ │ │ └── loader.py # Async AI model initialization
│ │ ├── processing/
│ │ │ ├── audio.py # Audio chunk handling
│ │ │ ├── fusion.py # Emotion fusion algorithm
│ │ │ ├── speech.py # Speech-end pipeline
│ │ │ └── video.py # Video frame processing
│ │ ├── session/
│ │ │ ├── manager.py # Token validation + history
│ │ │ └── summary.py # AI conversation summaries (Groq)
│ │ ├── utils/
│ │ │ ├── helpers.py # Avatar URL, service check
│ │ │ └── patches.py # GPU/TensorBoard patches
│ │ ├── __init__.py # Apply patches on import
│ │ ├── app.py # FastAPI app + CORS + routes
│ │ ├── config.py # Configuration constants
│ │ └── websocket.py # WebSocket message handler
│ │
│ ├── audio/
│ │ ├── voice_assistant.py # Coqui XTTS v2 integration
│ │ ├── voice_emotion.py # HuBERT emotion detection
│ │ └── whisper_transcription.py # Whisper STT + Silero VAD
│ │
│ ├── avatar/
│ │ └── avatar_controller.py # Avatar TTS communication
│ │
│ ├── database/
│ │ ├── db_manager.py # Database operations wrapper
│ │ └── db_tool.py # CLI tool for DB management
│ │
│ ├── nlp/
│ │ ├── llm_generator_groq.py # Groq API (dual personality)
│ │ └── text_sentiment.py # DistilRoBERTa + rule overrides
│ │
│ ├── vision/
│ │ ├── async_face_processor.py # Async face worker (unused in web)
│ │ └── face_processor.py # ViT-Face-Expression integration
│ │
│ ├── utils/
│ │ └── weight_finder.py # Model weight locator
│ │
│ ├── config.py # Global configuration
│ ├── main.py # Desktop app entry (Pygame UI)
│ ├── backend_new.py # Modular backend entry point
│ └── backend_server_old.py # Legacy monolithic backend
│
├── avatar/ # Avatar TTS Backend Service
│ ├── speak_server.py # Coqui XTTS v2 FastAPI server
│ └── static/ # Generated audio files (runtime)
│
├── model/ # Neural Network Architectures
│ ├── AU_model.py # Action Unit detection (research)
│ ├── AutomaticWeightedLoss.py # Multi-task learning loss
│ └── MLT.py # Multi-task learning architecture
│
├── weights/ # Pre-trained Models (Git LFS)
│ ├── ir50.pth # Face recognition backbone (117 MB)
│ ├── mobilefacenet_model_best.pth # Lightweight face (12 MB)
│ └── raf-db-model_best.pth # RAF-DB emotion (228 MB)
│
├── Dockerfile # Multi-stage build (CUDA 11.8)
├── nginx.spaces.conf # Nginx reverse proxy config
├── requirements_docker.txt # Python dependencies
├── app.py # Hugging Face Spaces entry
├── .gitattributes # Git LFS configuration
└── .gitignore
Performance Metrics
Processing Latency (RTX 3090)
| Component | Latency | Technology |
|---|---|---|
| Face Detection | 8-15ms | OpenCV Haar Cascade |
| Face Emotion | 80-120ms | ViT-Face-Expression |
| Voice Emotion | 40-60ms | HuBERT-Large (per 3s chunk) |
| Transcription | 370ms-1.04s | Whisper distil-large-v3 |
| Text Sentiment | 90-110ms | DistilRoBERTa |
| Fusion | <5ms | Weighted average |
| LLM Response | 1-2s | Groq Cloud API |
| TTS Synthesis | 2-4s | Coqui XTTS v2 |
| Total | 1.5-2.5s | End-to-end response time |
Accuracy
| Modality | Accuracy | Dataset/Notes |
|---|---|---|
| Face Only | 70-75% | ViT on FER2013 |
| Voice Only | 76.8% | HuBERT on IEMOCAP |
| Text Only | 81.2% | DistilRoBERTa + rule overrides |
| Multi-Modal | 85-88% | Weighted fusion (estimated) |
Resource Usage
- CPU: 15-25% (Intel i7-12700K)
- GPU: 40-60% (NVIDIA RTX 3090)
- RAM: 6-8 GB
- VRAM: 3-4 GB
Efficiency Optimizations
| Metric | Before | After | Gain |
|---|---|---|---|
| Frame Processing | 100% | 5% | 20x efficiency |
| Voice Processing | Always on | 72.4% active | 1.4x efficiency |
| Memory Usage | 12 GB | 6-8 GB | 33% reduction |
| Response Time | 5-8s | 1.5-2.5s | 3-4x faster |
Installation
Prerequisites
- Python 3.11+
- Node.js 20+
- NVIDIA GPU with 4GB+ VRAM (recommended, CPU fallback available)
- CUDA 11.8+ (for GPU acceleration)
- Git LFS
Local Development
Backend Setup:
# Clone repository
git clone https://github.com/YourUsername/MrrrMe.git
cd MrrrMe
git lfs install
git lfs pull
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install dependencies
pip install -r requirements_docker.txt
# Configure environment
echo "GROQ_API_KEY=your_groq_api_key_here" > .env
Frontend Setup:
cd avatar-frontend
npm install
npm run build
cd ..
Start Services (3 terminals):
# Terminal 1: Avatar TTS Backend
cd avatar
python speak_server.py
# Terminal 2: Main Backend
python mrrrme/backend_new.py
# Terminal 3: Frontend (development)
cd avatar-frontend
npm run dev
Access: http://localhost:3000
Docker Deployment
# Build
docker build -t mrrrme:latest .
# Run with GPU
docker run --gpus all -p 7860:7860 mrrrme:latest
# Run CPU only
docker run -p 7860:7860 mrrrme:latest
Hugging Face Spaces
Automatic deployment configured:
- Push to Hugging Face repository
- Enable persistent storage in Space settings
- Add
GROQ_API_KEYsecret - Automatic rebuild and deployment
Configuration
Emotion Fusion Weights
File: mrrrme/config.py or mrrrme/backend/config.py
# Default balanced weights
FUSION_WEIGHTS = {
'face': 0.40, # Facial expressions
'voice': 0.30, # Vocal prosody
'text': 0.30 # Linguistic sentiment
}
# Dynamically adjusted during runtime based on:
# - Face quality score (size, position, confidence)
# - Voice activity detection (speech vs silence)
# - Text length (short inputs reduce text weight)
LLM Configuration
# Response styles
LLM_RESPONSE_STYLE = "brief" # 60 tokens, 1-2 sentences
LLM_RESPONSE_STYLE = "balanced" # 150 tokens, 2-3 sentences (default)
LLM_RESPONSE_STYLE = "detailed" # 250 tokens, more elaborate
# Personality modes
PERSONALITY = "therapist" # Empathetic, exploratory
PERSONALITY = "coach" # Practical, action-oriented
Model Selection
# mrrrme/config.py
WHISPER_MODEL = "distil-whisper/distil-large-v3"
TEXT_SENTIMENT_MODEL = "j-hartmann/emotion-english-distilroberta-base"
VOICE_EMOTION_MODEL = "superb/hubert-large-superb-er"
# Timing
TRANSCRIPTION_BUFFER_SEC = 3.0
AUDIO_SR = 16000
CLIP_SECONDS = 1.2
API Reference
WebSocket Protocol
Client → Server:
// Authentication
{"type": "auth", "token": "session_token"}
// Video frame
{"type": "video_frame", "frame": "data:image/jpeg;base64,..."}
// Audio chunk
{"type": "audio_chunk", "audio": "base64_webm_data"}
// User finished speaking
{"type": "speech_end", "text": "transcribed_speech"}
// Update preferences
{"type": "preferences", "voice": "female|male", "language": "en|nl", "personality": "therapist|coach"}
// Request greeting
{"type": "request_greeting"}
Server → Client:
// Face emotion update
{
"type": "face_emotion",
"emotion": "Happy",
"confidence": 0.87,
"probabilities": [0.05, 0.87, 0.04, 0.04],
"quality": 0.92
}
// Voice emotion update
{"type": "voice_emotion", "emotion": "Happy"}
// LLM response with avatar
{
"type": "llm_response",
"text": "Response text",
"emotion": "Happy",
"intensity": 0.75,
"audio_url": "/static/uuid.mp3",
"visemes": [{"t": 0.0, "blend": {"jawOpen": 0.5}}]
}
// Error
{"type": "error", "message": "Error description"}
REST Endpoints
POST /api/signup - Create user account
POST /api/login - Authenticate and create session
POST /api/logout - End session and generate summary
GET /api/debug/users - View all users and summaries
GET /api/debug/sessions - View active sessions
GET /health - Health check
GET / - Service status
Development Timeline
Completed (Weeks 1-7)
- Multi-modal emotion detection pipeline
- ViT-Face-Expression for facial analysis (70-75% accuracy)
- HuBERT-Large voice emotion (76.8% accuracy)
- Whisper transcription with intelligent VAD
- DistilRoBERTa sentiment with rule-based overrides
- Groq Cloud API integration (Llama 3.1 8B)
- Coqui XTTS v2 multi-lingual TTS (16 languages)
- Next.js 16 web interface with TypeScript
- Avaturn SDK 3D avatar system
- WebSocket real-time communication
- SQLite authentication and session management
- AI-generated conversation summaries
- Docker containerization with GPU support
- Event-driven processing (600x efficiency gain)
- Quality-aware dynamic fusion weights
Planned (Weeks 8-18)
Weeks 8-9: Core Stability
- Error handling improvements
- Unit test coverage
- Performance profiling
- Bug fixes
Weeks 10-12: Avatar Enhancement
- Advanced emotion-to-expression mapping
- Smooth animation transitions
- Eye gaze tracking
- Idle behavior polish
Weeks 13-15: UI/UX Refinement
- Emotion timeline visualization
- Conversation export (CSV/JSON)
- Advanced settings interface
- Accessibility improvements
Week 16: Memory & Context
- Extended conversation memory (20+ turns)
- Emotion timeline graphs
- Session statistics
- Export functionality
Week 17: Testing
- User testing (15+ participants)
- Feedback collection
- Bug fixes
- Performance tuning
Week 18: Demo Preparation
- Professional demo video (3-5 min)
- Presentation materials
- Final documentation
- Deployment guide
Key Features
Multi-Modal Fusion
- Weighted combination of three modalities
- Quality-aware dynamic weight adjustment
- Conflict resolution algorithm
- Event-driven updates (only recalculates on user speech)
Emotion Processing
- 4-class model: Neutral, Happy, Sad, Angry
- Face: ViT-Face-Expression with quality scoring
- Voice: HuBERT-Large with speech activity detection
- Text: DistilRoBERTa with rule-based overrides
Conversational AI
- Groq Cloud API for fast inference (1-2s)
- Dual personalities: Therapist (empathetic) and Coach (action-focused)
- Three response styles: brief, balanced, detailed
- Conversation history and user context
Avatar System
- Customizable 3D avatars (Avaturn SDK)
- Realistic lip-sync with XTTS v2 visemes
- Emotion-driven expressions
- 16-language support
Privacy & Security
- Local emotion processing (no cloud upload)
- User authentication with hashed passwords
- Session-based access control
- AI summaries stored per-user only
- No face recognition or identification
Technical Implementation
Fusion Algorithm
def fuse_emotions(face_probs, voice_probs, text_probs, weights):
"""
Quality-aware weighted fusion
Args:
face_probs: [4] Neutral, Happy, Sad, Angry probabilities
voice_probs: [4] Voice emotion probabilities
text_probs: [4] Text sentiment probabilities
weights: dict with 'face', 'voice', 'text' keys
Returns:
fused_emotion: str
intensity: float (0-1)
"""
fused = (
weights['face'] * face_probs +
weights['voice'] * voice_probs +
weights['text'] * text_probs
)
fused = fused / (fused.sum() + 1e-8)
emotion_idx = fused.argmax()
emotion = ['Neutral', 'Happy', 'Sad', 'Angry'][emotion_idx]
intensity = float(fused.max())
return emotion, intensity
Dynamic Weight Adjustment
Weights automatically adjust based on:
- Face quality < 0.5: Reduce face weight by 30%
- No voice activity: Reduce voice weight by 50%
- Text length < 10: Reduce text weight by 30%
All weights normalized to sum to 1.0 after adjustment.
Event-Driven Processing
Problem: Processing every frame/chunk wastes compute
Solution: Only update fusion when user finishes speaking
# Main loop: Use cached fusion result
fused_emotion, intensity = fusion_engine.fuse(force=False) # Returns cache
# On speech end: Force recalculation
fused_emotion, intensity = fusion_engine.fuse(force=True) # Recalculates
Result: 600x reduction in fusion calculations
Database Schema
Users Table
users (
user_id TEXT PRIMARY KEY,
username TEXT UNIQUE NOT NULL,
password_hash TEXT NOT NULL,
created_at TIMESTAMP
)
Sessions Table
sessions (
session_id TEXT PRIMARY KEY,
user_id TEXT,
token TEXT UNIQUE,
created_at TIMESTAMP,
is_active BOOLEAN
)
Messages Table
messages (
message_id INTEGER PRIMARY KEY,
session_id TEXT,
role TEXT, -- 'user' or 'assistant'
content TEXT,
emotion TEXT, -- Detected/generated emotion
timestamp TIMESTAMP
)
Summaries Table
user_summaries (
user_id TEXT PRIMARY KEY,
summary_text TEXT, -- AI-generated summary
updated_at TIMESTAMP
)
Known Issues
Current Limitations
- Single-user processing (one face at a time)
- Lighting sensitivity (performance degrades in low light)
- English and Dutch fully tested, other languages experimental
- Requires 4GB+ VRAM for optimal performance
- 4-class emotions may miss subtle nuances
Known Bugs
- Empty frame error in cv2.cvtColor (workaround in place)
- Audio buffer alignment issues with some microphones
- Occasional WebSocket disconnection on slow networks
Planned Improvements
- Action Unit detection for masking (genuine vs forced emotion)
- Multi-user face tracking
- Edge device optimization (Jetson Nano)
- Mobile app (React Native)
- Additional language support
- Real-time emotion timeline
Research References
Key Papers:
- Hu et al. (2025) - "OpenFace 3.0: Lightweight Multitask Facial Behavior Analysis"
- Radford et al. (2023) - "Whisper: Robust Speech Recognition via Weak Supervision"
- Hsu et al. (2021) - "HuBERT: Self-Supervised Speech Representation Learning"
- Liu et al. (2019) - "RoBERTa: Robustly Optimized BERT Pretraining"
Datasets:
- FER2013: Facial expression recognition (7 emotions)
- IEMOCAP: Interactive emotional dyadic motion capture
- RAF-DB: Real-world Affective Faces Database
- SST-2: Stanford Sentiment Treebank
Technologies:
- ViT-Face-Expression: Vision Transformer for FER
- HuBERT: Self-supervised speech representation
- Whisper: Distilled large-v3 for ASR
- Llama 3.1: Large language model
- Coqui XTTS v2: Multi-lingual TTS
Team
Musaed Al-Fareh - Project Lead
AI & Data Science Student
Email: [email protected]
LinkedIn: linkedin.com/in/musaed-alfareh-a365521b9
Michon Goddijn - AI & Data Science Student
Email: [email protected]
Lorena Kraljić - Tourism Student
Email: [email protected]
Course: Applied Data Science - Artificial Intelligence
Program: BUAŚ Classroom Specialisation 2025-2026
License
MIT License
Component Licenses:
- ViT-Face-Expression: MIT
- Whisper: MIT
- HuBERT: MIT
- Llama 3.1: Llama 2 Community License
- Coqui XTTS v2: Mozilla Public License 2.0
Acknowledgments
- Breda University of Applied Sciences
- OpenFace 3.0 Team
- OpenAI (Whisper)
- Meta AI (HuBERT, Llama)
- Hugging Face (Model Hub)
- Groq (LLM API)
- Coqui (TTS)
Contact
Repository: GitHub - MrrrMe
Live Demo: Hugging Face Spaces
Email: [email protected]
For bug reports or feature requests, open an issue on GitHub.
Last Updated: December 10, 2024
Version: 2.0.0
Status: Active Development (Week 7/18)