Jobly / RAG_ARCHITECTURE.md
Valentina9502's picture
First commit
fdf5af0 verified
# ๐Ÿง  RAG Architecture & Vector Embeddings
## Overview
GigMatch AI uses **Retrieval-Augmented Generation (RAG)** with **vector embeddings** to perform intelligent semantic matching between workers and gigs. This goes far beyond simple keyword matching!
## ๐Ÿ—๏ธ Architecture
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ DATA INGESTION โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 50 Workers + 50 Gigs (JSON) โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Text Enrichment (skills, bio, location, etc.) โ”‚
โ”‚ โ†“ โ”‚
โ”‚ HuggingFace Embeddings (all-MiniLM-L6-v2) โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Vector Storage (ChromaDB) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ QUERY PIPELINE โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ User Query (worker profile or gig post) โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Convert to Search Query โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Embed Query (HuggingFace) โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Semantic Search (Vector Similarity) โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Retrieve Top K Results โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Calculate Match Scores โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Return Results to Agent โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
## ๐Ÿฆ™ LlamaIndex Integration
### Why LlamaIndex?
1. **Sponsor Recognition** - LlamaIndex is a hackathon sponsor ๐ŸŽ‰
2. **Production-Ready** - Battle-tested RAG framework
3. **Easy Integration** - Simple API for vector operations
4. **Flexible** - Supports multiple vector stores and embeddings
### Implementation
```python
from llama_index.core import VectorStoreIndex, Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
# Initialize embedding model
embed_model = HuggingFaceEmbedding(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
# Create documents with rich text
worker_doc = Document(
text=f"Name: {name}, Skills: {skills}, Location: {location}...",
metadata=worker_data
)
# Create vector index
index = VectorStoreIndex.from_documents(
documents,
vector_store=vector_store
)
# Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Looking for plumber in Rome...")
```
## ๐Ÿค— HuggingFace Embeddings
### Model: all-MiniLM-L6-v2
**Why this model?**
- โœ… Fast inference (only 23M parameters)
- โœ… Good quality embeddings (384 dimensions)
- โœ… Pre-trained on semantic similarity
- โœ… HuggingFace sponsor recognition ๐Ÿค—
**Performance:**
- Embedding time: ~20ms per text
- Vector size: 384 dimensions
- Cosine similarity for matching
### How Embeddings Work
1. **Text โ†’ Vector**: Each worker/gig is converted to a 384-dimensional vector
2. **Semantic Meaning**: Similar meanings = similar vectors
3. **Cosine Similarity**: Measure angle between vectors (0-1 score)
4. **Top K**: Return K most similar vectors
**Example:**
```python
text1 = "Experienced plumber, pipe repair, Rome"
text2 = "Looking for plumbing services, leak fix, Rome"
# After embedding:
vec1 = [0.23, -0.45, 0.67, ...] # 384 dimensions
vec2 = [0.21, -0.43, 0.69, ...] # 384 dimensions
# Cosine similarity: 0.94 (very similar!)
```
## ๐Ÿ“Š ChromaDB Vector Store
### Why ChromaDB?
- โœ… Simple local setup (no server needed)
- โœ… Fast vector search
- โœ… Native Python API
- โœ… Persistence support
- โœ… Perfect for demo/hackathon
### Collections
**Workers Collection:**
- 50 worker profiles
- Indexed by skills, experience, location
- Searchable by semantic similarity
**Gigs Collection:**
- 50 gig posts
- Indexed by requirements, project details
- Searchable by semantic similarity
## ๐ŸŽฏ Semantic Matching Algorithm
### Traditional Keyword Matching (OLD)
```python
# Problem: Only finds exact keyword matches
if "plumbing" in worker_skills and "plumbing" in gig_requirements:
score += 1 # Match!
```
### Semantic Matching with RAG (NEW)
```python
# Solution: Understands meaning and context
Query: "Need someone to fix leaking pipes"
Embedding: [0.23, -0.45, 0.67, ...]
Worker 1: "Plumber, pipe repair specialist"
Embedding: [0.21, -0.43, 0.69, ...]
Similarity: 0.94 โ† HIGH MATCH!
Worker 2: "Electrician, wiring expert"
Embedding: [-0.11, 0.52, -0.33, ...]
Similarity: 0.12 โ† LOW MATCH
# Semantic search finds Worker 1 even though
# the word "plumbing" wasn't explicitly mentioned!
```
### Advantages
1. **Synonym Understanding**: "plumber" โ‰ˆ "pipe specialist"
2. **Context Awareness**: "fix pipes" โ‰ˆ "repair plumbing"
3. **Related Concepts**: "garden" โ‰ˆ "landscaping" โ‰ˆ "outdoor"
4. **Multi-language**: Can handle slight variations
5. **Fuzzy Matching**: Typos and variations still work
## ๐Ÿ”ฌ Match Score Calculation
### Components
1. **Semantic Similarity** (70% weight)
- Cosine similarity from vector embeddings
- Range: 0.0 to 1.0
- Higher = better semantic match
2. **Keyword Overlap** (20% weight)
- Exact skill matches
- Experience level alignment
- Calculated as: matched_skills / required_skills
3. **Location Match** (10% weight)
- Geographic proximity
- Remote work consideration
- Binary: 1.0 (same location/remote) or 0.5 (different)
### Final Formula
```python
semantic_score = cosine_similarity(query_vec, doc_vec)
keyword_score = len(matched_skills) / len(required_skills)
location_score = 1.0 if location_match else 0.5
final_score = (
semantic_score * 0.7 +
keyword_score * 0.2 +
location_score * 0.1
) * 100 # Convert to 0-100 scale
```
## ๐Ÿ“ˆ Performance & Scalability
### Current Setup (Demo)
- 50 workers + 50 gigs = 100 vectors
- Average query time: ~100ms
- Embedding model loaded in memory: ~100MB
- Total memory usage: ~200MB
### Production Scaling
**For 10,000 entries:**
- โœ… Still fast (<500ms per query)
- โœ… ChromaDB handles easily
- โœ… Consider batch embedding for ingestion
**For 100,000+ entries:**
- Use hosted vector DB (Pinecone, Weaviate)
- Batch processing for embeddings
- Caching layer for frequent queries
- GPU acceleration for embedding
## ๐ŸŽจ Benefits for the Hackathon
### Why This is WOW
1. **Not Just LLM Calls**: Real vector database with semantic search
2. **Sponsor Integration**: LlamaIndex ๐Ÿฆ™ + HuggingFace ๐Ÿค—
3. **Production Patterns**: Proper RAG architecture
4. **Scalable**: Easy to extend to 1000s of entries
5. **Explainable**: Can show similarity scores
### Demo Impact
Judges will see:
- โœ… "Powered by LlamaIndex + HuggingFace"
- โœ… Semantic similarity scores in results
- โœ… Better matches than keyword search
- โœ… 100 entries in vector database
- โœ… Real-time vector search
## ๐Ÿ”ฎ Future Enhancements
### Easy Wins
- [ ] Add filters (location, budget, experience)
- [ ] Implement hybrid search (semantic + keyword)
- [ ] Add reranking with cross-encoders
- [ ] Cache popular queries
### Advanced
- [ ] Fine-tune embedding model on gig data
- [ ] Multi-modal embeddings (add images)
- [ ] Graph relationships between skills
- [ ] Temporal embeddings (availability matching)
## ๐Ÿ“š Code Examples
### Creating the Index
```python
# 1. Load data
workers = load_workers_from_json()
# 2. Create documents
documents = []
for worker in workers:
text = f"""
Name: {worker['name']}
Skills: {', '.join(worker['skills'])}
Experience: {worker['experience']}
Location: {worker['location']}
"""
doc = Document(text=text, metadata=worker)
documents.append(doc)
# 3. Create vector store
chroma_collection = chroma_client.create_collection("workers")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# 4. Build index
index = VectorStoreIndex.from_documents(
documents,
vector_store=vector_store
)
```
### Querying the Index
```python
# 1. Create query
query = f"""
Looking for: {', '.join(required_skills)}
Location: {location}
Experience: {experience_level}
"""
# 2. Get query engine
query_engine = index.as_query_engine(similarity_top_k=5)
# 3. Execute query
response = query_engine.query(query)
# 4. Extract results
for node in response.source_nodes:
worker_data = node.metadata
similarity_score = node.score
print(f"Match: {worker_data['name']}, Score: {similarity_score}")
```
## ๐ŸŽฏ Key Takeaways
1. **RAG = Better Matches**: Semantic understanding > keyword matching
2. **LlamaIndex = Easy**: Production RAG in <100 lines of code
3. **HuggingFace = Quality**: Great embeddings, sponsor recognition
4. **ChromaDB = Fast**: Local vector store, perfect for demo
5. **Scalable = Future-proof**: Architecture works at scale
---
**This is what makes GigMatch AI stand out in the hackathon!** ๐Ÿš€