|
|
--- |
|
|
title: Multi-GGUF LLM Inference |
|
|
emoji: 🧠 |
|
|
colorFrom: pink |
|
|
colorTo: purple |
|
|
sdk: gradio |
|
|
sdk_version: 5.25.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: apache-2.0 |
|
|
short_description: Chat inference for GGUF models with llama.cpp & Gradio |
|
|
--- |
|
|
|
|
|
This Gradio app enables **chat-based inference** on various GGUF models using `llama.cpp` and `llama-cpp-python`. The application features: |
|
|
|
|
|
- **Real-Time Web Search Integration:** Uses DuckDuckGo to retrieve up-to-date context; debug output is displayed in real time. |
|
|
- **Streaming Token-by-Token Responses:** Users see the generated answer as it comes in. |
|
|
- **Response Cancellation:** A cancel button allows stopping response generation in progress. |
|
|
- **Customizable Prompts & Generation Parameters:** Adjust the system prompt (with dynamic date insertion), temperature, token limits, and more. |
|
|
- **Memory-Safe Design:** Loads one model at a time with proper memory management, ideal for deployment on Hugging Face Spaces. |
|
|
- **Rate Limit Handling:** Implements exponential backoff to cope with DuckDuckGo API rate limits. |
|
|
|
|
|
### 🔄 Supported Models: |
|
|
- `Qwen/Qwen2.5-7B-Instruct-GGUF` → `qwen2.5-7b-instruct-q2_k.gguf` |
|
|
- `unsloth/gemma-3-4b-it-GGUF` → `gemma-3-4b-it-Q4_K_M.gguf` |
|
|
- `unsloth/Phi-4-mini-instruct-GGUF` → `Phi-4-mini-instruct-Q4_K_M.gguf` |
|
|
- `MaziyarPanahi/Meta-Llama-3.1-8B-Instruct-GGUF` → `Meta-Llama-3.1-8B-Instruct.Q2_K.gguf` |
|
|
- `unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF` → `DeepSeek-R1-Distill-Llama-8B-Q2_K.gguf` |
|
|
- `MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF` → `Mistral-7B-Instruct-v0.3.IQ3_XS.gguf` |
|
|
- `Qwen/Qwen2.5-Coder-7B-Instruct-GGUF` → `qwen2.5-coder-7b-instruct-q2_k.gguf` |
|
|
|
|
|
### ⚙️ Features: |
|
|
- **Model Selection:** Select from multiple GGUF models. |
|
|
- **Customizable Prompts & Parameters:** Set a system prompt (e.g., automatically including today’s date), adjust temperature, token limits, and more. |
|
|
- **Chat-style Interface:** Interactive Gradio UI with streaming token-by-token responses. |
|
|
- **Real-Time Web Search & Debug Output:** Leverages DuckDuckGo to fetch recent context, with a dedicated debug panel showing web search progress and results. |
|
|
- **Response Cancellation:** Cancel in-progress answer generation using a cancel button. |
|
|
- **Memory-Safe & Rate-Limit Resilient:** Loads one model at a time with proper cleanup and incorporates exponential backoff to handle API rate limits. |
|
|
|
|
|
Ideal for deploying multiple GGUF chat models on Hugging Face Spaces with a robust, user-friendly interface! |
|
|
|
|
|
For further details, check the [Spaces configuration guide](https://huggingface.co/docs/hub/spaces-config-reference). |