Spaces:

michon
/

mrrrme-emotion-ai

Sleeping

App Files Files Community

mrrrme-emotion-ai / mrrrme /README.md

MusaedMusaedSadeqMusaedAl-Fareh225739

Updated README file

1df4a51 10 days ago

preview code

raw

history blame contribute delete

23.8 kB

MrrrMe - Privacy-First Multi-Modal Emotion Detection System

18-Week Specialization Project | Breda University of Applied Sciences

Real-time emotion analysis combining facial expressions, voice tonality, and text sentiment with conversational AI for empathetic human-computer interaction.

Project Information

Program: AI & Data Science - Applied Data Science
Institution: Breda University of Applied Sciences, Netherlands
Duration: 18 weeks (February - June 2026)
Current Status: Week 15 of 18
Team: Musaed Al-Fareh, Michon Goddijn, Lorena Kraljić

Overview

Problem Statement

Traditional emotion recognition systems face critical limitations:

Single-modality approaches miss contextual emotional cues
High latency unsuitable for natural conversation (5-8 seconds typical)
Cloud dependencies raise privacy concerns
Inability to detect genuine versus masked emotions

Solution

MrrrMe implements a privacy-first, multi-modal emotion detection system:

Fuses facial expressions (40%), voice tonality (30%), and linguistic content (30%)
Processes everything locally with no cloud dependencies
Achieves sub-2-second response times (1.5-2.5s total)
Generates empathetic conversational responses via Groq Cloud API
Web-based interface with customizable 3D avatars

System Architecture

High-Level Architecture

Browser Client (Next.js 16 + React 19)
    │
    ├─ Camera Stream (30 FPS)
    ├─ Microphone Audio (16kHz)
    └─ WebSocket Connection
         │
         ▼
Nginx Reverse Proxy (Port 7860)
    │
    ├─ Frontend Server (Next.js) :3001
    ├─ Backend API (FastAPI) :8000
    └─ Avatar TTS (XTTS v2) :8765
         │
         ▼
Processing Pipeline
    │
    ├─ Vision: ViT-Face-Expression → Face Emotion
    ├─ Audio: HuBERT-Large → Voice Emotion
    └─ Text: DistilRoBERTa → Text Sentiment
         │
         ▼
Fusion Engine (Quality-Aware Weighted Average)
         │
         ▼
Groq Cloud API (Llama 3.1 8B Instant)
         │
         ▼
Coqui XTTS v2 (Multi-lingual TTS)
         │
         ▼
3D Avatar (Avaturn SDK + Three.js)

Data Flow

Input Processing:

Video frames (640x480) → OpenCV Haar Cascade face detection → ViT-Face-Expression (100ms)
Audio chunks (16kHz) → Silero VAD speech detection → HuBERT-Large (50ms)
Speech buffer → Whisper distil-large-v3 transcription (0.37-1.04s)
Transcript → DistilRoBERTa sentiment + rule overrides (100ms)

Fusion & Response:

Quality-aware weight adjustment based on signal quality
Weighted fusion: fused = 0.4×face + 0.3×voice + 0.3×text
Conflict resolution and masking detection
LLM context preparation with user summary and emotion state
Groq API response generation (1-2s)
Coqui XTTS v2 synthesis with viseme generation (2-4s)
Avatar lip-sync playback in browser

Technology Stack

Computer Vision

Component	Technology	Inference Time	Purpose
Face Detection	OpenCV Haar Cascade	<10ms	Locate face in frame
Emotion Recognition	ViT-Face-Expression (trpakov)	~100ms	7-class emotion (FER2013)
Mapping	7-class to 4-class	<1ms	Neutral, Happy, Sad, Angry

Emotion Mapping:

FER2013 Classes: angry, disgust, fear, happy, sad, surprise, neutral
MrrrMe Classes: Neutral, Happy, Sad, Angry

Audio Processing

Component	Technology	Inference Time	Purpose
Speech-to-Text	Whisper distil-large-v3	0.37-1.04s	Transcription
Voice Emotion	HuBERT-Large (superb)	~50ms	Prosody analysis
Speech Detection	Silero VAD	<5ms	Activity detection

Natural Language

Component	Technology	Inference Time	Purpose
Sentiment	DistilRoBERTa (emotion-distilroberta)	~100ms	Text emotion
LLM	Groq Cloud (Llama 3.1 8B Instant)	1-2s	Response generation
TTS	Coqui XTTS v2	2-4s	Voice synthesis

Voice Options: Ana Florence (female), Damien Black (male)
Languages: 16 supported (en, nl, fr, de, it, es, ja, zh, pt, pl, tr, ru, cs, ar, hu, ko)

Frontend & Infrastructure

Component	Technology	Purpose
Frontend	Next.js 16 + React 19 + TypeScript	Web interface
3D Engine	React Three Fiber + Three.js 0.180	Avatar rendering
Avatar	Avaturn SDK + Ready Player Me	Custom avatars
Styling	Tailwind CSS v4	Design system
Backend	FastAPI + Uvicorn	WebSocket + REST API
Database	SQLite	User auth + sessions
Proxy	Nginx	Reverse proxy
Container	Docker + CUDA 11.8	Deployment

Project Structure

MrrrMe/
│
├── avatar-frontend/              # Next.js 16 Web Application
│   ├── app/
│   │   ├── api/
│   │   │   └── avaturn-proxy/    # CORS proxy for avatar assets
│   │   ├── app/                  # Main application (authenticated)
│   │   │   └── page.tsx          # Avatar + emotion UI + WebSocket
│   │   ├── login/                # Authentication page
│   │   │   └── page.tsx
│   │   ├── page.tsx              # Landing page
│   │   ├── layout.tsx            # Root layout
│   │   └── globals.css           # Design system (light/dark mode)
│   ├── public/
│   │   └── idle-animation.glb    # Avatar idle animation (Git LFS)
│   ├── package.json              # Node dependencies (React 19)
│   ├── next.config.ts            # Next.js standalone output
│   └── tsconfig.json
│
├── mrrrme/                       # Python Backend Package
│   ├── backend/                  # FastAPI Modular Backend (v2.0)
│   │   ├── auth/
│   │   │   ├── database.py       # SQLite init + helpers
│   │   │   ├── models.py         # Pydantic request models
│   │   │   └── routes.py         # /api/signup, /api/login, /api/logout
│   │   ├── debug/
│   │   │   └── routes.py         # /api/debug/users, /api/debug/sessions
│   │   ├── models/
│   │   │   └── loader.py         # Async AI model initialization
│   │   ├── processing/
│   │   │   ├── audio.py          # Audio chunk handling
│   │   │   ├── fusion.py         # Emotion fusion algorithm
│   │   │   ├── speech.py         # Speech-end pipeline
│   │   │   └── video.py          # Video frame processing
│   │   ├── session/
│   │   │   ├── manager.py        # Token validation + history
│   │   │   └── summary.py        # AI conversation summaries (Groq)
│   │   ├── utils/
│   │   │   ├── helpers.py        # Avatar URL, service check
│   │   │   └── patches.py        # GPU/TensorBoard patches
│   │   ├── __init__.py           # Apply patches on import
│   │   ├── app.py                # FastAPI app + CORS + routes
│   │   ├── config.py             # Configuration constants
│   │   └── websocket.py          # WebSocket message handler
│   │
│   ├── audio/
│   │   ├── voice_assistant.py    # Coqui XTTS v2 integration
│   │   ├── voice_emotion.py      # HuBERT emotion detection
│   │   └── whisper_transcription.py  # Whisper STT + Silero VAD
│   │
│   ├── avatar/
│   │   └── avatar_controller.py  # Avatar TTS communication
│   │
│   ├── database/
│   │   ├── db_manager.py         # Database operations wrapper
│   │   └── db_tool.py            # CLI tool for DB management
│   │
│   ├── nlp/
│   │   ├── llm_generator_groq.py # Groq API (dual personality)
│   │   └── text_sentiment.py     # DistilRoBERTa + rule overrides
│   │
│   ├── vision/
│   │   ├── async_face_processor.py  # Async face worker (unused in web)
│   │   └── face_processor.py     # ViT-Face-Expression integration
│   │
│   ├── utils/
│   │   └── weight_finder.py      # Model weight locator
│   │
│   ├── config.py                 # Global configuration
│   ├── main.py                   # Desktop app entry (Pygame UI)
│   ├── backend_new.py            # Modular backend entry point
│   └── backend_server_old.py     # Legacy monolithic backend
│
├── avatar/                       # Avatar TTS Backend Service
│   ├── speak_server.py           # Coqui XTTS v2 FastAPI server
│   └── static/                   # Generated audio files (runtime)
│
├── model/                        # Neural Network Architectures
│   ├── AU_model.py               # Action Unit detection (research)
│   ├── AutomaticWeightedLoss.py  # Multi-task learning loss
│   └── MLT.py                    # Multi-task learning architecture
│
├── weights/                      # Pre-trained Models (Git LFS)
│   ├── ir50.pth                  # Face recognition backbone (117 MB)
│   ├── mobilefacenet_model_best.pth  # Lightweight face (12 MB)
│   └── raf-db-model_best.pth     # RAF-DB emotion (228 MB)
│
├── Dockerfile                    # Multi-stage build (CUDA 11.8)
├── nginx.spaces.conf             # Nginx reverse proxy config
├── requirements_docker.txt       # Python dependencies
├── app.py                        # Hugging Face Spaces entry
├── .gitattributes                # Git LFS configuration
└── .gitignore

Performance Metrics

Processing Latency (RTX 3090)

Component	Latency	Technology
Face Detection	8-15ms	OpenCV Haar Cascade
Face Emotion	80-120ms	ViT-Face-Expression
Voice Emotion	40-60ms	HuBERT-Large (per 3s chunk)
Transcription	370ms-1.04s	Whisper distil-large-v3
Text Sentiment	90-110ms	DistilRoBERTa
Fusion	<5ms	Weighted average
LLM Response	1-2s	Groq Cloud API
TTS Synthesis	2-4s	Coqui XTTS v2
Total	1.5-2.5s	End-to-end response time

Accuracy

Modality	Accuracy	Dataset/Notes
Face Only	70-75%	ViT on FER2013
Voice Only	76.8%	HuBERT on IEMOCAP
Text Only	81.2%	DistilRoBERTa + rule overrides
Multi-Modal	85-88%	Weighted fusion (estimated)

Resource Usage

CPU: 15-25% (Intel i7-12700K)
GPU: 40-60% (NVIDIA RTX 3090)
RAM: 6-8 GB
VRAM: 3-4 GB

Efficiency Optimizations

Metric	Before	After	Gain
Frame Processing	100%	5%	20x efficiency
Voice Processing	Always on	72.4% active	1.4x efficiency
Memory Usage	12 GB	6-8 GB	33% reduction
Response Time	5-8s	1.5-2.5s	3-4x faster

Installation

Prerequisites

Python 3.11+
Node.js 20+
NVIDIA GPU with 4GB+ VRAM (recommended, CPU fallback available)
CUDA 11.8+ (for GPU acceleration)
Git LFS

Local Development

Backend Setup:

# Clone repository
git clone https://github.com/YourUsername/MrrrMe.git
cd MrrrMe
git lfs install
git lfs pull

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate  # Windows

# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install dependencies
pip install -r requirements_docker.txt

# Configure environment
echo "GROQ_API_KEY=your_groq_api_key_here" > .env

Frontend Setup:

cd avatar-frontend
npm install
npm run build
cd ..

Start Services (3 terminals):

# Terminal 1: Avatar TTS Backend
cd avatar
python speak_server.py

# Terminal 2: Main Backend
python mrrrme/backend_new.py

# Terminal 3: Frontend (development)
cd avatar-frontend
npm run dev

Access: http://localhost:3000

Docker Deployment

# Build
docker build -t mrrrme:latest .

# Run with GPU
docker run --gpus all -p 7860:7860 mrrrme:latest

# Run CPU only
docker run -p 7860:7860 mrrrme:latest

Hugging Face Spaces

Automatic deployment configured:

Push to Hugging Face repository
Enable persistent storage in Space settings
Add GROQ_API_KEY secret
Automatic rebuild and deployment

Configuration

Emotion Fusion Weights

File: mrrrme/config.py or mrrrme/backend/config.py

# Default balanced weights
FUSION_WEIGHTS = {
    'face': 0.40,   # Facial expressions
    'voice': 0.30,  # Vocal prosody
    'text': 0.30    # Linguistic sentiment
}

# Dynamically adjusted during runtime based on:
# - Face quality score (size, position, confidence)
# - Voice activity detection (speech vs silence)
# - Text length (short inputs reduce text weight)

LLM Configuration

# Response styles
LLM_RESPONSE_STYLE = "brief"     # 60 tokens, 1-2 sentences
LLM_RESPONSE_STYLE = "balanced"  # 150 tokens, 2-3 sentences (default)
LLM_RESPONSE_STYLE = "detailed"  # 250 tokens, more elaborate

# Personality modes
PERSONALITY = "therapist"  # Empathetic, exploratory
PERSONALITY = "coach"      # Practical, action-oriented

Model Selection

# mrrrme/config.py
WHISPER_MODEL = "distil-whisper/distil-large-v3"
TEXT_SENTIMENT_MODEL = "j-hartmann/emotion-english-distilroberta-base"
VOICE_EMOTION_MODEL = "superb/hubert-large-superb-er"

# Timing
TRANSCRIPTION_BUFFER_SEC = 3.0
AUDIO_SR = 16000
CLIP_SECONDS = 1.2

API Reference

WebSocket Protocol

Client → Server:

// Authentication
{"type": "auth", "token": "session_token"}

// Video frame
{"type": "video_frame", "frame": "data:image/jpeg;base64,..."}

// Audio chunk
{"type": "audio_chunk", "audio": "base64_webm_data"}

// User finished speaking
{"type": "speech_end", "text": "transcribed_speech"}

// Update preferences
{"type": "preferences", "voice": "female|male", "language": "en|nl", "personality": "therapist|coach"}

// Request greeting
{"type": "request_greeting"}

Server → Client:

// Face emotion update
{
  "type": "face_emotion",
  "emotion": "Happy",
  "confidence": 0.87,
  "probabilities": [0.05, 0.87, 0.04, 0.04],
  "quality": 0.92
}

// Voice emotion update
{"type": "voice_emotion", "emotion": "Happy"}

// LLM response with avatar
{
  "type": "llm_response",
  "text": "Response text",
  "emotion": "Happy",
  "intensity": 0.75,
  "audio_url": "/static/uuid.mp3",
  "visemes": [{"t": 0.0, "blend": {"jawOpen": 0.5}}]
}

// Error
{"type": "error", "message": "Error description"}

REST Endpoints

POST /api/signup       - Create user account
POST /api/login        - Authenticate and create session
POST /api/logout       - End session and generate summary
GET  /api/debug/users  - View all users and summaries
GET  /api/debug/sessions - View active sessions
GET  /health           - Health check
GET  /                 - Service status

Development Timeline

Completed (Weeks 1-7)

Multi-modal emotion detection pipeline
ViT-Face-Expression for facial analysis (70-75% accuracy)
HuBERT-Large voice emotion (76.8% accuracy)
Whisper transcription with intelligent VAD
DistilRoBERTa sentiment with rule-based overrides
Groq Cloud API integration (Llama 3.1 8B)
Coqui XTTS v2 multi-lingual TTS (16 languages)
Next.js 16 web interface with TypeScript
Avaturn SDK 3D avatar system
WebSocket real-time communication
SQLite authentication and session management
AI-generated conversation summaries
Docker containerization with GPU support
Event-driven processing (600x efficiency gain)
Quality-aware dynamic fusion weights

Planned (Weeks 8-18)

Weeks 8-9: Core Stability

Error handling improvements
Unit test coverage
Performance profiling
Bug fixes

Weeks 10-12: Avatar Enhancement

Advanced emotion-to-expression mapping
Smooth animation transitions
Eye gaze tracking
Idle behavior polish

Weeks 13-15: UI/UX Refinement

Emotion timeline visualization
Conversation export (CSV/JSON)
Advanced settings interface
Accessibility improvements

Week 16: Memory & Context

Extended conversation memory (20+ turns)
Emotion timeline graphs
Session statistics
Export functionality

Week 17: Testing

User testing (15+ participants)
Feedback collection
Bug fixes
Performance tuning

Week 18: Demo Preparation

Professional demo video (3-5 min)
Presentation materials
Final documentation
Deployment guide

Key Features

Multi-Modal Fusion

Weighted combination of three modalities
Quality-aware dynamic weight adjustment
Conflict resolution algorithm
Event-driven updates (only recalculates on user speech)

Emotion Processing

4-class model: Neutral, Happy, Sad, Angry
Face: ViT-Face-Expression with quality scoring
Voice: HuBERT-Large with speech activity detection
Text: DistilRoBERTa with rule-based overrides

Conversational AI

Groq Cloud API for fast inference (1-2s)
Dual personalities: Therapist (empathetic) and Coach (action-focused)
Three response styles: brief, balanced, detailed
Conversation history and user context

Avatar System

Customizable 3D avatars (Avaturn SDK)
Realistic lip-sync with XTTS v2 visemes
Emotion-driven expressions
16-language support

Privacy & Security

Local emotion processing (no cloud upload)
User authentication with hashed passwords
Session-based access control
AI summaries stored per-user only
No face recognition or identification

Technical Implementation

Fusion Algorithm

def fuse_emotions(face_probs, voice_probs, text_probs, weights):
    """
    Quality-aware weighted fusion
    
    Args:
        face_probs: [4] Neutral, Happy, Sad, Angry probabilities
        voice_probs: [4] Voice emotion probabilities
        text_probs: [4] Text sentiment probabilities
        weights: dict with 'face', 'voice', 'text' keys
    
    Returns:
        fused_emotion: str
        intensity: float (0-1)
    """
    fused = (
        weights['face'] * face_probs +
        weights['voice'] * voice_probs +
        weights['text'] * text_probs
    )
    fused = fused / (fused.sum() + 1e-8)
    
    emotion_idx = fused.argmax()
    emotion = ['Neutral', 'Happy', 'Sad', 'Angry'][emotion_idx]
    intensity = float(fused.max())
    
    return emotion, intensity

Dynamic Weight Adjustment

Weights automatically adjust based on:

Face quality < 0.5: Reduce face weight by 30%
No voice activity: Reduce voice weight by 50%
Text length < 10: Reduce text weight by 30%

All weights normalized to sum to 1.0 after adjustment.

Event-Driven Processing

Problem: Processing every frame/chunk wastes compute
Solution: Only update fusion when user finishes speaking

# Main loop: Use cached fusion result
fused_emotion, intensity = fusion_engine.fuse(force=False)  # Returns cache

# On speech end: Force recalculation
fused_emotion, intensity = fusion_engine.fuse(force=True)  # Recalculates

Result: 600x reduction in fusion calculations

Database Schema

Users Table

users (
    user_id TEXT PRIMARY KEY,
    username TEXT UNIQUE NOT NULL,
    password_hash TEXT NOT NULL,
    created_at TIMESTAMP
)

Sessions Table

sessions (
    session_id TEXT PRIMARY KEY,
    user_id TEXT,
    token TEXT UNIQUE,
    created_at TIMESTAMP,
    is_active BOOLEAN
)

Messages Table

messages (
    message_id INTEGER PRIMARY KEY,
    session_id TEXT,
    role TEXT,  -- 'user' or 'assistant'
    content TEXT,
    emotion TEXT,  -- Detected/generated emotion
    timestamp TIMESTAMP
)

Summaries Table

user_summaries (
    user_id TEXT PRIMARY KEY,
    summary_text TEXT,  -- AI-generated summary
    updated_at TIMESTAMP
)

Known Issues

Current Limitations

Single-user processing (one face at a time)
Lighting sensitivity (performance degrades in low light)
English and Dutch fully tested, other languages experimental
Requires 4GB+ VRAM for optimal performance
4-class emotions may miss subtle nuances

Known Bugs

Empty frame error in cv2.cvtColor (workaround in place)
Audio buffer alignment issues with some microphones
Occasional WebSocket disconnection on slow networks

Planned Improvements

Action Unit detection for masking (genuine vs forced emotion)
Multi-user face tracking
Edge device optimization (Jetson Nano)
Mobile app (React Native)
Additional language support
Real-time emotion timeline

Research References

Key Papers:

Hu et al. (2025) - "OpenFace 3.0: Lightweight Multitask Facial Behavior Analysis"
Radford et al. (2023) - "Whisper: Robust Speech Recognition via Weak Supervision"
Hsu et al. (2021) - "HuBERT: Self-Supervised Speech Representation Learning"
Liu et al. (2019) - "RoBERTa: Robustly Optimized BERT Pretraining"

Datasets:

FER2013: Facial expression recognition (7 emotions)
IEMOCAP: Interactive emotional dyadic motion capture
RAF-DB: Real-world Affective Faces Database
SST-2: Stanford Sentiment Treebank

Technologies:

ViT-Face-Expression: Vision Transformer for FER
HuBERT: Self-supervised speech representation
Whisper: Distilled large-v3 for ASR
Llama 3.1: Large language model
Coqui XTTS v2: Multi-lingual TTS

Team

Musaed Al-Fareh - Project Lead
AI & Data Science Student
Email: [email protected]
LinkedIn: linkedin.com/in/musaed-alfareh-a365521b9

Michon Goddijn - AI & Data Science Student
Email: [email protected]

Lorena Kraljić - Tourism Student
Email: [email protected]

Course: Applied Data Science - Artificial Intelligence
Program: BUAŚ Classroom Specialisation 2025-2026

License

MIT License

Component Licenses:

ViT-Face-Expression: MIT
Whisper: MIT
HuBERT: MIT
Llama 3.1: Llama 2 Community License
Coqui XTTS v2: Mozilla Public License 2.0

Acknowledgments

Breda University of Applied Sciences
OpenFace 3.0 Team
OpenAI (Whisper)
Meta AI (HuBERT, Llama)
Hugging Face (Model Hub)
Groq (LLM API)
Coqui (TTS)

Contact

Repository: GitHub - MrrrMe
Live Demo: Hugging Face Spaces
Email: [email protected]

For bug reports or feature requests, open an issue on GitHub.

Last Updated: December 10, 2024
Version: 2.0.0
Status: Active Development (Week 7/18)