Spaces:
Running
Running
Commit
·
21bc165
1
Parent(s):
64e67e1
update readme
Browse files- .vscode/settings.json +0 -3
- README.md +56 -41
- docs/CONFIGURATION.md +0 -104
- docs/MCP_USAGE.md +0 -275
- docs/STATISTICS_ENDPOINT.md +0 -427
- docs/deployment/DEPLOYMENT.md +0 -300
- docs/deployment/QUICKSTART.md +0 -148
- docs/deployment/mcp-client-examples.md +0 -295
.vscode/settings.json
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"kiroAgent.configureMCP": "Enabled",
|
| 3 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -17,7 +17,7 @@ tags:
|
|
| 17 |
|
| 18 |
# 📊 HuggingFace EDA MCP Server
|
| 19 |
|
| 20 |
-
> 🎉 Submission for the [HuggingFace 1st Birthday Hackathon](https://huggingface.co/
|
| 21 |
|
| 22 |
An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
|
| 23 |
|
|
@@ -32,6 +32,61 @@ Whether you're a ML engineer, data scientist, or researcher, dataset exploration
|
|
| 32 |
- Ask your AI assistant to build reports and visualizations
|
| 33 |
- **Content search**: Find specific examples in datasets using text search
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
## Available Tools
|
| 36 |
|
| 37 |
### `get_dataset_metadata`
|
|
@@ -135,46 +190,6 @@ Some features require datasets with `builder_name="parquet"`:
|
|
| 135 |
- Graceful fallback from statistics API to sample-based analysis
|
| 136 |
- Descriptive error messages with suggestions for common issues
|
| 137 |
|
| 138 |
-
## MCP Client Configuration
|
| 139 |
-
|
| 140 |
-
Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
|
| 141 |
-
|
| 142 |
-
**Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`
|
| 143 |
-
|
| 144 |
-
### With URL
|
| 145 |
-
|
| 146 |
-
```json
|
| 147 |
-
{
|
| 148 |
-
"mcpServers": {
|
| 149 |
-
"hf-eda-mcp": {
|
| 150 |
-
"url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
|
| 151 |
-
"headers": {
|
| 152 |
-
"hf-api-token": "<HF_TOKEN>"
|
| 153 |
-
}
|
| 154 |
-
}
|
| 155 |
-
}
|
| 156 |
-
}
|
| 157 |
-
```
|
| 158 |
-
|
| 159 |
-
### With mcp-remote
|
| 160 |
-
|
| 161 |
-
```json
|
| 162 |
-
{
|
| 163 |
-
"mcpServers": {
|
| 164 |
-
"hf-eda-mcp": {
|
| 165 |
-
"command": "npx",
|
| 166 |
-
"args": [
|
| 167 |
-
"mcp-remote",
|
| 168 |
-
"https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
|
| 169 |
-
"--transport",
|
| 170 |
-
"streamable-http",
|
| 171 |
-
"--header",
|
| 172 |
-
"hf-api-token: <HF_TOKEN>"
|
| 173 |
-
]
|
| 174 |
-
}
|
| 175 |
-
}
|
| 176 |
-
}
|
| 177 |
-
```
|
| 178 |
|
| 179 |
## Project Structure
|
| 180 |
|
|
|
|
| 17 |
|
| 18 |
# 📊 HuggingFace EDA MCP Server
|
| 19 |
|
| 20 |
+
> 🎉 Submission for the [HuggingFace 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)
|
| 21 |
|
| 22 |
An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
|
| 23 |
|
|
|
|
| 32 |
- Ask your AI assistant to build reports and visualizations
|
| 33 |
- **Content search**: Find specific examples in datasets using text search
|
| 34 |
|
| 35 |
+
<p align="center">
|
| 36 |
+
<a href="https://www.youtube.com/watch?v=XdP7zGSb81k">
|
| 37 |
+
<img src="https://img.shields.io/badge/▶️_Demo_Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Demo Video">
|
| 38 |
+
</a>
|
| 39 |
+
|
| 40 |
+
<a href="https://www.linkedin.com/posts/khalil-guetari-00a61415a_mcp-server-for-huggingface-datasets-discovery-activity-7400587711838842880-2K8p">
|
| 41 |
+
<img src="https://img.shields.io/badge/LinkedIn_Post-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn Post">
|
| 42 |
+
</a>
|
| 43 |
+
|
| 44 |
+
<a href="https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp">
|
| 45 |
+
<img src="https://img.shields.io/badge/🤗_Try_it_on_HF_Spaces-FFD21E?style=for-the-badge" alt="HF Space">
|
| 46 |
+
</a>
|
| 47 |
+
</p>
|
| 48 |
+
|
| 49 |
+
## MCP Client Configuration
|
| 50 |
+
|
| 51 |
+
Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
|
| 52 |
+
|
| 53 |
+
**Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`
|
| 54 |
+
|
| 55 |
+
### With URL
|
| 56 |
+
|
| 57 |
+
```json
|
| 58 |
+
{
|
| 59 |
+
"mcpServers": {
|
| 60 |
+
"hf-eda-mcp": {
|
| 61 |
+
"url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
|
| 62 |
+
"headers": {
|
| 63 |
+
"hf-api-token": "<HF_TOKEN>"
|
| 64 |
+
}
|
| 65 |
+
}
|
| 66 |
+
}
|
| 67 |
+
}
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
### With mcp-remote
|
| 71 |
+
|
| 72 |
+
```json
|
| 73 |
+
{
|
| 74 |
+
"mcpServers": {
|
| 75 |
+
"hf-eda-mcp": {
|
| 76 |
+
"command": "npx",
|
| 77 |
+
"args": [
|
| 78 |
+
"mcp-remote",
|
| 79 |
+
"https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
|
| 80 |
+
"--transport",
|
| 81 |
+
"streamable-http",
|
| 82 |
+
"--header",
|
| 83 |
+
"hf-api-token: <HF_TOKEN>"
|
| 84 |
+
]
|
| 85 |
+
}
|
| 86 |
+
}
|
| 87 |
+
}
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
## Available Tools
|
| 91 |
|
| 92 |
### `get_dataset_metadata`
|
|
|
|
| 190 |
- Graceful fallback from statistics API to sample-based analysis
|
| 191 |
- Descriptive error messages with suggestions for common issues
|
| 192 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
|
| 194 |
## Project Structure
|
| 195 |
|
docs/CONFIGURATION.md
DELETED
|
@@ -1,104 +0,0 @@
|
|
| 1 |
-
# Configuration Guide
|
| 2 |
-
|
| 3 |
-
The HF EDA MCP Server uses a centralized configuration system that supports both environment variables and command-line arguments.
|
| 4 |
-
|
| 5 |
-
## Configuration Module
|
| 6 |
-
|
| 7 |
-
The configuration is managed by the `src/hf_eda_mcp/config.py` module, which provides:
|
| 8 |
-
|
| 9 |
-
- `ServerConfig` dataclass with all configuration options
|
| 10 |
-
- Environment variable loading with `ServerConfig.from_env()`
|
| 11 |
-
- Global configuration management with `get_config()` and `set_config()`
|
| 12 |
-
- Logging setup and validation utilities
|
| 13 |
-
|
| 14 |
-
## Configuration Options
|
| 15 |
-
|
| 16 |
-
### Server Settings
|
| 17 |
-
- `HF_EDA_PORT` (default: 7860) - Server port
|
| 18 |
-
- `HF_EDA_HOST` (default: 127.0.0.1) - Server host
|
| 19 |
-
- `HF_EDA_MCP_ENABLED` (default: true) - Enable MCP server functionality
|
| 20 |
-
- `HF_EDA_SHARE` (default: false) - Enable public sharing via Gradio
|
| 21 |
-
|
| 22 |
-
### Authentication
|
| 23 |
-
- `HF_TOKEN` - HuggingFace access token for private datasets
|
| 24 |
-
|
| 25 |
-
### Logging
|
| 26 |
-
- `HF_EDA_LOG_LEVEL` (default: INFO) - Logging level (DEBUG, INFO, WARNING, ERROR)
|
| 27 |
-
|
| 28 |
-
### Performance and Caching
|
| 29 |
-
- `HF_EDA_CACHE_DIR` - Directory for caching datasets (optional)
|
| 30 |
-
- `HF_EDA_MAX_CACHE_SIZE` (default: 1000) - Maximum cache size in MB
|
| 31 |
-
- `HF_EDA_MAX_SAMPLE_SIZE` (default: 50000) - Maximum sample size for tools
|
| 32 |
-
- `HF_EDA_MAX_CONCURRENT` (default: 10) - Maximum concurrent requests
|
| 33 |
-
- `HF_EDA_REQUEST_TIMEOUT` (default: 300) - Request timeout in seconds
|
| 34 |
-
|
| 35 |
-
## How Configuration is Used
|
| 36 |
-
|
| 37 |
-
### Server Startup
|
| 38 |
-
The server loads configuration from environment variables and applies command-line overrides:
|
| 39 |
-
|
| 40 |
-
```python
|
| 41 |
-
from hf_eda_mcp.config import ServerConfig
|
| 42 |
-
from hf_eda_mcp.server import launch_server
|
| 43 |
-
|
| 44 |
-
config = ServerConfig.from_env()
|
| 45 |
-
launch_server(config)
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
### Tools Integration
|
| 49 |
-
All EDA tools (metadata, sampling, analysis) use the global configuration:
|
| 50 |
-
|
| 51 |
-
```python
|
| 52 |
-
from hf_eda_mcp.config import get_config
|
| 53 |
-
|
| 54 |
-
config = get_config()
|
| 55 |
-
# Tools respect config.max_sample_size, config.cache_dir, config.hf_token
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
### Dataset Service
|
| 59 |
-
The `DatasetService` is initialized with configuration values:
|
| 60 |
-
|
| 61 |
-
```python
|
| 62 |
-
service = DatasetService(
|
| 63 |
-
cache_dir=config.cache_dir,
|
| 64 |
-
token=config.hf_token
|
| 65 |
-
)
|
| 66 |
-
```
|
| 67 |
-
|
| 68 |
-
## Configuration Priority
|
| 69 |
-
|
| 70 |
-
1. Command-line arguments (highest priority)
|
| 71 |
-
2. Environment variables
|
| 72 |
-
3. Default values (lowest priority)
|
| 73 |
-
|
| 74 |
-
## Example Usage
|
| 75 |
-
|
| 76 |
-
### Environment Variables
|
| 77 |
-
```bash
|
| 78 |
-
export HF_TOKEN="your_token_here"
|
| 79 |
-
export HF_EDA_CACHE_DIR="/tmp/hf-cache"
|
| 80 |
-
export HF_EDA_MAX_SAMPLE_SIZE=25000
|
| 81 |
-
pdm run hf-eda-mcp
|
| 82 |
-
```
|
| 83 |
-
|
| 84 |
-
### Command Line
|
| 85 |
-
```bash
|
| 86 |
-
pdm run hf-eda-mcp --cache-dir /tmp/cache --max-sample-size 25000 --verbose
|
| 87 |
-
```
|
| 88 |
-
|
| 89 |
-
### Configuration File
|
| 90 |
-
Copy `config.example.env` to `.env` and modify as needed, then load with:
|
| 91 |
-
```bash
|
| 92 |
-
source .env
|
| 93 |
-
pdm run hf-eda-mcp
|
| 94 |
-
```
|
| 95 |
-
|
| 96 |
-
## Validation
|
| 97 |
-
|
| 98 |
-
The configuration system includes validation for:
|
| 99 |
-
- Port ranges (1024-65535)
|
| 100 |
-
- Cache directory permissions
|
| 101 |
-
- Sample size limits
|
| 102 |
-
- Timeout values
|
| 103 |
-
|
| 104 |
-
Invalid configurations will cause the server to exit with helpful error messages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/MCP_USAGE.md
DELETED
|
@@ -1,275 +0,0 @@
|
|
| 1 |
-
# MCP Server Usage Guide
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
The HF EDA MCP Server provides four main tools for exploratory data analysis of HuggingFace datasets via the Model Context Protocol (MCP).
|
| 6 |
-
|
| 7 |
-
## Available MCP Tools
|
| 8 |
-
|
| 9 |
-
The following 4 tools are automatically exposed by Gradio when `mcp_server=True`:
|
| 10 |
-
|
| 11 |
-
### 1. `get_dataset_metadata`
|
| 12 |
-
Retrieve comprehensive metadata for a HuggingFace dataset.
|
| 13 |
-
|
| 14 |
-
**Parameters:**
|
| 15 |
-
- `dataset_id` (string): HuggingFace dataset identifier (e.g., 'imdb', 'squad')
|
| 16 |
-
- `config_name` (string, optional): Configuration name for multi-config datasets
|
| 17 |
-
|
| 18 |
-
**Returns:** JSON object with dataset metadata including size, features, splits, and configuration details.
|
| 19 |
-
|
| 20 |
-
### 2. `get_dataset_sample`
|
| 21 |
-
Retrieve a sample of rows from a HuggingFace dataset.
|
| 22 |
-
|
| 23 |
-
**Parameters:**
|
| 24 |
-
- `dataset_id` (string): HuggingFace dataset identifier
|
| 25 |
-
- `split` (string, default: 'train'): Dataset split to sample from
|
| 26 |
-
- `num_samples` (number, default: 10): Number of samples to retrieve (max: 10000)
|
| 27 |
-
- `config_name` (string, optional): Configuration name for multi-config datasets
|
| 28 |
-
|
| 29 |
-
**Returns:** JSON object with sampled data and metadata.
|
| 30 |
-
|
| 31 |
-
### 3. `analyze_dataset_features`
|
| 32 |
-
Perform exploratory analysis on dataset features with automatic optimization.
|
| 33 |
-
|
| 34 |
-
**Parameters:**
|
| 35 |
-
- `dataset_id` (string): HuggingFace dataset identifier
|
| 36 |
-
- `split` (string, default: 'train'): Dataset split to analyze
|
| 37 |
-
- `sample_size` (number, default: 1000): Number of samples for analysis (max: 50000, only used for fallback)
|
| 38 |
-
- `config_name` (string, optional): Configuration name for multi-config datasets
|
| 39 |
-
|
| 40 |
-
**Returns:** JSON object with comprehensive feature analysis including:
|
| 41 |
-
- Feature types (numerical, categorical, text, image, audio)
|
| 42 |
-
- Statistical measures (mean, median, std, histograms)
|
| 43 |
-
- Missing value analysis
|
| 44 |
-
- Unique value counts
|
| 45 |
-
- Sample values
|
| 46 |
-
|
| 47 |
-
**Analysis Methods:**
|
| 48 |
-
- **Primary**: Uses HuggingFace Dataset Viewer API statistics when available (parquet datasets)
|
| 49 |
-
- Analyzes the full dataset without downloading data
|
| 50 |
-
- Provides complete statistics with histograms
|
| 51 |
-
- More efficient and accurate
|
| 52 |
-
- **Fallback**: Sample-based analysis for non-parquet datasets
|
| 53 |
-
- Downloads and analyzes a sample of the dataset
|
| 54 |
-
- Computes statistics locally
|
| 55 |
-
|
| 56 |
-
### 4. `search_text_in_dataset`
|
| 57 |
-
Search for text in text columns of a dataset using the Dataset Viewer API.
|
| 58 |
-
|
| 59 |
-
**Parameters:**
|
| 60 |
-
- `dataset_id` (string): HuggingFace dataset identifier
|
| 61 |
-
- `config_name` (string): Configuration name (required for search)
|
| 62 |
-
- `split` (string): Dataset split to search in
|
| 63 |
-
- `query` (string): Search query text
|
| 64 |
-
- `offset` (number, default: 0): Offset for pagination
|
| 65 |
-
- `length` (number, default: 10): Number of results to return (max: 100)
|
| 66 |
-
|
| 67 |
-
**Returns:** JSON object with search results including:
|
| 68 |
-
- `features`: List of features from the dataset, including column names and data types
|
| 69 |
-
- `rows`: List of matching rows with content from each column
|
| 70 |
-
- `num_rows_total`: Total number of examples in the split
|
| 71 |
-
- `num_rows_per_page`: Number of examples in the current page
|
| 72 |
-
- `partial`: Whether the response is partial (true if the dataset is too large to search completely)
|
| 73 |
-
|
| 74 |
-
**Limitations:**
|
| 75 |
-
- Only text columns are searched
|
| 76 |
-
- Only parquet datasets are supported (builder_name="parquet")
|
| 77 |
-
- Search is performed by the Dataset Viewer API, not locally
|
| 78 |
-
|
| 79 |
-
**Validation:**
|
| 80 |
-
- The tool validates that the dataset is in parquet format before attempting search
|
| 81 |
-
- The tool validates that the dataset has at least one text/string column
|
| 82 |
-
- If validation fails, a descriptive error message is returned with suggestions
|
| 83 |
-
|
| 84 |
-
## MCP Client Configuration
|
| 85 |
-
|
| 86 |
-
### Using with Claude Desktop
|
| 87 |
-
|
| 88 |
-
Add this configuration to your MCP settings:
|
| 89 |
-
|
| 90 |
-
```json
|
| 91 |
-
{
|
| 92 |
-
"mcpServers": {
|
| 93 |
-
"hf-eda-mcp-server": {
|
| 94 |
-
"command": "pdm",
|
| 95 |
-
"args": ["run", "hf-eda-mcp"],
|
| 96 |
-
"env": {
|
| 97 |
-
"HF_TOKEN": "your_huggingface_token_here"
|
| 98 |
-
}
|
| 99 |
-
}
|
| 100 |
-
}
|
| 101 |
-
}
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
### Using with Hosted Server
|
| 105 |
-
|
| 106 |
-
If the server is running on a remote host:
|
| 107 |
-
|
| 108 |
-
```json
|
| 109 |
-
{
|
| 110 |
-
"mcpServers": {
|
| 111 |
-
"hf-eda-mcp-server": {
|
| 112 |
-
"url": "https://your-server.com/gradio_api/mcp/sse"
|
| 113 |
-
"headers": {
|
| 114 |
-
"hf-api-token": "your_huggingface_token_here"
|
| 115 |
-
}
|
| 116 |
-
}
|
| 117 |
-
}
|
| 118 |
-
}
|
| 119 |
-
```
|
| 120 |
-
|
| 121 |
-
## Starting the Server
|
| 122 |
-
|
| 123 |
-
### Local Development
|
| 124 |
-
```bash
|
| 125 |
-
# Start with MCP server enabled (default)
|
| 126 |
-
pdm run hf-eda-mcp
|
| 127 |
-
|
| 128 |
-
# Start on custom port
|
| 129 |
-
pdm run hf-eda-mcp --port 8080
|
| 130 |
-
|
| 131 |
-
# Start with verbose logging
|
| 132 |
-
pdm run hf-eda-mcp --verbose
|
| 133 |
-
|
| 134 |
-
# Start without MCP server functionality
|
| 135 |
-
pdm run hf-eda-mcp --no-mcp
|
| 136 |
-
|
| 137 |
-
# Start with custom host (listen on all interfaces)
|
| 138 |
-
pdm run hf-eda-mcp --host 0.0.0.0
|
| 139 |
-
|
| 140 |
-
# Start with public sharing enabled
|
| 141 |
-
pdm run hf-eda-mcp --share
|
| 142 |
-
|
| 143 |
-
# Start with custom cache directory
|
| 144 |
-
pdm run hf-eda-mcp --cache-dir /path/to/cache
|
| 145 |
-
|
| 146 |
-
# Start with custom maximum sample size
|
| 147 |
-
pdm run hf-eda-mcp --max-sample-size 100000
|
| 148 |
-
```
|
| 149 |
-
|
| 150 |
-
### Server Modes
|
| 151 |
-
|
| 152 |
-
The server provides both a web interface and MCP server functionality in a single application. When MCP is enabled, Gradio automatically exposes the 4 EDA functions as MCP tools while still providing the web interface for direct interaction.
|
| 153 |
-
|
| 154 |
-
### Environment Variables
|
| 155 |
-
|
| 156 |
-
The server supports comprehensive configuration via environment variables:
|
| 157 |
-
|
| 158 |
-
#### Authentication
|
| 159 |
-
- `HF_TOKEN`: HuggingFace access token for private datasets (optional)
|
| 160 |
-
|
| 161 |
-
#### Server Configuration
|
| 162 |
-
- `HF_EDA_PORT`: Server port (default: 7860)
|
| 163 |
-
- `HF_EDA_HOST`: Server host (default: 127.0.0.1)
|
| 164 |
-
- `HF_EDA_MCP_ENABLED`: Enable MCP server functionality (default: true)
|
| 165 |
-
- `HF_EDA_SHARE`: Enable public sharing via Gradio (default: false)
|
| 166 |
-
|
| 167 |
-
#### Logging Configuration
|
| 168 |
-
- `HF_EDA_LOG_LEVEL`: Logging level - DEBUG, INFO, WARNING, ERROR (default: INFO)
|
| 169 |
-
|
| 170 |
-
#### Performance and Caching
|
| 171 |
-
- `HF_EDA_CACHE_DIR`: Directory for caching datasets (optional)
|
| 172 |
-
- `HF_EDA_MAX_CACHE_SIZE`: Maximum cache size in MB (default: 1000)
|
| 173 |
-
- `HF_EDA_MAX_SAMPLE_SIZE`: Maximum sample size for analysis (default: 50000)
|
| 174 |
-
- `HF_EDA_MAX_CONCURRENT`: Maximum concurrent requests (default: 10)
|
| 175 |
-
- `HF_EDA_REQUEST_TIMEOUT`: Request timeout in seconds (default: 300)
|
| 176 |
-
|
| 177 |
-
### Configuration Examples
|
| 178 |
-
|
| 179 |
-
#### Production Configuration
|
| 180 |
-
```bash
|
| 181 |
-
export HF_TOKEN="your_token_here"
|
| 182 |
-
export HF_EDA_HOST="0.0.0.0"
|
| 183 |
-
export HF_EDA_PORT="8080"
|
| 184 |
-
export HF_EDA_LOG_LEVEL="WARNING"
|
| 185 |
-
export HF_EDA_CACHE_DIR="/var/cache/hf-eda"
|
| 186 |
-
export HF_EDA_MAX_CONCURRENT="20"
|
| 187 |
-
pdm run hf-eda-mcp
|
| 188 |
-
```
|
| 189 |
-
|
| 190 |
-
#### Development Configuration
|
| 191 |
-
```bash
|
| 192 |
-
export HF_TOKEN="your_token_here"
|
| 193 |
-
export HF_EDA_LOG_LEVEL="DEBUG"
|
| 194 |
-
export HF_EDA_CACHE_DIR="./cache"
|
| 195 |
-
pdm run hf-eda-mcp --verbose
|
| 196 |
-
```
|
| 197 |
-
|
| 198 |
-
## Dataset Viewer Statistics Integration
|
| 199 |
-
|
| 200 |
-
The `analyze_dataset_features` tool automatically uses HuggingFace's Dataset Viewer API when available, providing significant benefits:
|
| 201 |
-
|
| 202 |
-
### Benefits
|
| 203 |
-
- **Full Dataset Analysis**: Analyzes entire datasets instead of samples
|
| 204 |
-
- **No Download Required**: Statistics are pre-computed by HuggingFace
|
| 205 |
-
- **Richer Statistics**: Includes histograms, frequencies, and multi-modal support
|
| 206 |
-
- **Better Performance**: Faster response times with caching
|
| 207 |
-
|
| 208 |
-
### Supported Datasets
|
| 209 |
-
Statistics are available for datasets with `builder_name="parquet"`. The tool automatically:
|
| 210 |
-
1. Checks if Dataset Viewer statistics are available
|
| 211 |
-
2. Uses full dataset statistics when available
|
| 212 |
-
3. Falls back to sample-based analysis for other datasets
|
| 213 |
-
|
| 214 |
-
### Supported Data Types
|
| 215 |
-
The analysis tool provides comprehensive statistics for multiple data types:
|
| 216 |
-
- **Numerical** (int, float): min, max, mean, median, std, histograms
|
| 217 |
-
- **Categorical** (class_label, string_label): frequencies, unique counts
|
| 218 |
-
- **Boolean** (bool): True/False distributions
|
| 219 |
-
- **Text** (string_text): character length statistics, histograms
|
| 220 |
-
- **Image** (image): dimension statistics, histograms
|
| 221 |
-
- **Audio** (audio): duration statistics (seconds), histograms
|
| 222 |
-
- **List** (list): length statistics, histograms
|
| 223 |
-
|
| 224 |
-
### Response Indicators
|
| 225 |
-
Check the `sample_info` field in the response:
|
| 226 |
-
- `sampling_method: "dataset_viewer_api"` - Using full dataset statistics
|
| 227 |
-
- `sampling_method: "sequential_head"` - Using sample-based analysis
|
| 228 |
-
- `represents_full_dataset: true/false` - Whether analysis covers the full dataset
|
| 229 |
-
|
| 230 |
-
## Example Usage
|
| 231 |
-
|
| 232 |
-
Once connected to an MCP client, you can use the tools like this:
|
| 233 |
-
|
| 234 |
-
```
|
| 235 |
-
# Get metadata for the IMDB dataset
|
| 236 |
-
Use the get_dataset_metadata tool with dataset_id="imdb"
|
| 237 |
-
|
| 238 |
-
# Sample 5 rows from the training split
|
| 239 |
-
Use the get_dataset_sample tool with dataset_id="imdb", split="train", num_samples=5
|
| 240 |
-
|
| 241 |
-
# Analyze features of the GLUE dataset (CoLA configuration)
|
| 242 |
-
Use the analyze_dataset_features tool with dataset_id="glue", config_name="cola", sample_size=500
|
| 243 |
-
|
| 244 |
-
# Search for text in the IMDB dataset
|
| 245 |
-
Use the search_text_in_dataset tool with dataset_id="imdb", config_name="plain_text", split="train", query="great movie", offset=0, length=10
|
| 246 |
-
|
| 247 |
-
# Search for a specific term in the SQuAD dataset
|
| 248 |
-
Use the search_text_in_dataset tool with dataset_id="squad", config_name="plain_text", split="train", query="president", offset=0, length=5
|
| 249 |
-
```
|
| 250 |
-
|
| 251 |
-
## API Endpoints
|
| 252 |
-
|
| 253 |
-
When the server is running, you can also access the tools via HTTP API:
|
| 254 |
-
|
| 255 |
-
- **MCP Schema**: `http://localhost:7860/gradio_api/mcp/schema`
|
| 256 |
-
- **API Documentation**: `http://localhost:7860/?view=api`
|
| 257 |
-
- **Web Interface**: `http://localhost:7860`
|
| 258 |
-
|
| 259 |
-
## Troubleshooting
|
| 260 |
-
|
| 261 |
-
### Authentication Issues
|
| 262 |
-
- Ensure `HF_TOKEN` environment variable is set for private datasets
|
| 263 |
-
- Check that your HuggingFace token has appropriate permissions
|
| 264 |
-
|
| 265 |
-
### Dataset Not Found
|
| 266 |
-
- Verify the dataset ID is correct and exists on HuggingFace Hub
|
| 267 |
-
- Check if the dataset requires authentication
|
| 268 |
-
|
| 269 |
-
### Performance Issues
|
| 270 |
-
- Reduce `sample_size` for large datasets
|
| 271 |
-
- Use streaming mode (enabled by default) for better memory efficiency
|
| 272 |
-
|
| 273 |
-
### Search Tool Issues
|
| 274 |
-
- **Dataset not in parquet format**: The search tool only works with parquet datasets. If you get a "DatasetNotParquetError", try using a different dataset or check if the dataset has a parquet configuration
|
| 275 |
-
- **No text columns found**: The search tool requires at least one text/string column. If you get a "NoTextColumnsError", verify that the dataset has text columns by checking the dataset metadata first
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/STATISTICS_ENDPOINT.md
DELETED
|
@@ -1,427 +0,0 @@
|
|
| 1 |
-
# Dataset Viewer Statistics Endpoint Integration
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
The HuggingFace Dataset Viewer API provides a `/statistics` endpoint that offers comprehensive statistics for datasets with `builder_name="parquet"`. This endpoint is significantly more efficient and complete than sample-based analysis.
|
| 6 |
-
|
| 7 |
-
## Key Benefits
|
| 8 |
-
|
| 9 |
-
### 1. Full Dataset Coverage
|
| 10 |
-
- **Before**: Analysis based on samples (default 1,000 examples)
|
| 11 |
-
- **After**: Statistics computed on the entire dataset (e.g., 25,000 examples for IMDB train split)
|
| 12 |
-
|
| 13 |
-
### 2. No Data Download Required
|
| 14 |
-
- **Before**: Download and process samples from the dataset
|
| 15 |
-
- **After**: Retrieve pre-computed statistics via API call
|
| 16 |
-
|
| 17 |
-
### 3. More Complete Statistics
|
| 18 |
-
The endpoint provides detailed statistics for multiple modalities:
|
| 19 |
-
|
| 20 |
-
#### Numerical Features (int, float)
|
| 21 |
-
- **Basic statistics**: min, max, mean, median, std
|
| 22 |
-
- **Missing values**: nan_count, nan_proportion
|
| 23 |
-
- **Distribution**: histogram with bin_edges and hist counts
|
| 24 |
-
|
| 25 |
-
Example response:
|
| 26 |
-
```json
|
| 27 |
-
{
|
| 28 |
-
"column_type": "float",
|
| 29 |
-
"column_statistics": {
|
| 30 |
-
"nan_count": 0,
|
| 31 |
-
"nan_proportion": 0,
|
| 32 |
-
"min": 0,
|
| 33 |
-
"max": 2,
|
| 34 |
-
"mean": 1.67206,
|
| 35 |
-
"median": 1.8,
|
| 36 |
-
"std": 0.38714,
|
| 37 |
-
"histogram": {
|
| 38 |
-
"hist": [17, 12, 48, 52, 135, 188, 814, 15, 1628, 2048],
|
| 39 |
-
"bin_edges": [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2]
|
| 40 |
-
}
|
| 41 |
-
}
|
| 42 |
-
}
|
| 43 |
-
```
|
| 44 |
-
|
| 45 |
-
#### Categorical Features (class_label, string_label)
|
| 46 |
-
- **Unique values**: n_unique count
|
| 47 |
-
- **Frequencies**: Complete frequency distribution for all categories
|
| 48 |
-
- **Missing values**: nan_count, nan_proportion
|
| 49 |
-
- **No label tracking**: no_label_count, no_label_proportion (for class_label)
|
| 50 |
-
|
| 51 |
-
Example response:
|
| 52 |
-
```json
|
| 53 |
-
{
|
| 54 |
-
"column_type": "class_label",
|
| 55 |
-
"column_statistics": {
|
| 56 |
-
"nan_count": 0,
|
| 57 |
-
"nan_proportion": 0,
|
| 58 |
-
"no_label_count": 0,
|
| 59 |
-
"no_label_proportion": 0,
|
| 60 |
-
"n_unique": 2,
|
| 61 |
-
"frequencies": {
|
| 62 |
-
"unacceptable": 2528,
|
| 63 |
-
"acceptable": 6023
|
| 64 |
-
}
|
| 65 |
-
}
|
| 66 |
-
}
|
| 67 |
-
```
|
| 68 |
-
|
| 69 |
-
#### Text Features (string_text)
|
| 70 |
-
- **Length statistics**: min, max, mean, median, std (character count)
|
| 71 |
-
- **Missing values**: nan_count, nan_proportion
|
| 72 |
-
- **Distribution**: histogram of text lengths
|
| 73 |
-
|
| 74 |
-
Example response:
|
| 75 |
-
```json
|
| 76 |
-
{
|
| 77 |
-
"column_type": "string_text",
|
| 78 |
-
"column_statistics": {
|
| 79 |
-
"nan_count": 0,
|
| 80 |
-
"nan_proportion": 0,
|
| 81 |
-
"min": 6,
|
| 82 |
-
"max": 231,
|
| 83 |
-
"mean": 40.70074,
|
| 84 |
-
"median": 37,
|
| 85 |
-
"std": 19.14431,
|
| 86 |
-
"histogram": {
|
| 87 |
-
"hist": [2260, 4512, 1262, 380, 102, 26, 6, 1, 1, 1],
|
| 88 |
-
"bin_edges": [6, 29, 52, 75, 98, 121, 144, 167, 190, 213, 231]
|
| 89 |
-
}
|
| 90 |
-
}
|
| 91 |
-
}
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
#### Boolean Features (bool)
|
| 95 |
-
- **Frequencies**: Distribution of True/False values
|
| 96 |
-
- **Missing values**: nan_count, nan_proportion
|
| 97 |
-
|
| 98 |
-
Example response:
|
| 99 |
-
```json
|
| 100 |
-
{
|
| 101 |
-
"column_type": "bool",
|
| 102 |
-
"column_statistics": {
|
| 103 |
-
"nan_count": 3,
|
| 104 |
-
"nan_proportion": 0.15,
|
| 105 |
-
"frequencies": {
|
| 106 |
-
"False": 7,
|
| 107 |
-
"True": 10
|
| 108 |
-
}
|
| 109 |
-
}
|
| 110 |
-
}
|
| 111 |
-
```
|
| 112 |
-
|
| 113 |
-
#### Image Features (image)
|
| 114 |
-
- **Dimension statistics**: min, max, mean, median, std (for width/height)
|
| 115 |
-
- **Missing values**: nan_count, nan_proportion
|
| 116 |
-
- **Distribution**: histogram of image dimensions
|
| 117 |
-
|
| 118 |
-
Example response:
|
| 119 |
-
```json
|
| 120 |
-
{
|
| 121 |
-
"column_type": "image",
|
| 122 |
-
"column_statistics": {
|
| 123 |
-
"nan_count": 0,
|
| 124 |
-
"nan_proportion": 0.0,
|
| 125 |
-
"min": 256,
|
| 126 |
-
"max": 873,
|
| 127 |
-
"mean": 327.99339,
|
| 128 |
-
"median": 341.0,
|
| 129 |
-
"std": 60.07286,
|
| 130 |
-
"histogram": {
|
| 131 |
-
"hist": [1734, 1637, 1326, 121, 10, 3, 1, 3, 1, 2],
|
| 132 |
-
"bin_edges": [256, 318, 380, 442, 504, ...]
|
| 133 |
-
}
|
| 134 |
-
}
|
| 135 |
-
}
|
| 136 |
-
```
|
| 137 |
-
|
| 138 |
-
#### Audio Features (audio)
|
| 139 |
-
- **Duration statistics**: min, max, mean, median, std (in seconds)
|
| 140 |
-
- **Missing values**: nan_count, nan_proportion
|
| 141 |
-
- **Distribution**: histogram of audio durations
|
| 142 |
-
|
| 143 |
-
Example response:
|
| 144 |
-
```json
|
| 145 |
-
{
|
| 146 |
-
"column_type": "audio",
|
| 147 |
-
"column_statistics": {
|
| 148 |
-
"nan_count": 0,
|
| 149 |
-
"nan_proportion": 0,
|
| 150 |
-
"min": 1.02,
|
| 151 |
-
"max": 15,
|
| 152 |
-
"mean": 13.93042,
|
| 153 |
-
"median": 14.77,
|
| 154 |
-
"std": 2.63734,
|
| 155 |
-
"histogram": {
|
| 156 |
-
"hist": [32, 25, 18, 24, 22, 17, 18, 19, 55, 1770],
|
| 157 |
-
"bin_edges": [1.02, 2.418, 3.816, 5.214, 6.612, ...]
|
| 158 |
-
}
|
| 159 |
-
}
|
| 160 |
-
}
|
| 161 |
-
```
|
| 162 |
-
|
| 163 |
-
#### List Features (list)
|
| 164 |
-
- **Length statistics**: min, max, mean, median, std (list length)
|
| 165 |
-
- **Missing values**: nan_count, nan_proportion
|
| 166 |
-
- **Distribution**: histogram of list lengths
|
| 167 |
-
|
| 168 |
-
Example response:
|
| 169 |
-
```json
|
| 170 |
-
{
|
| 171 |
-
"column_type": "list",
|
| 172 |
-
"column_statistics": {
|
| 173 |
-
"nan_count": 0,
|
| 174 |
-
"nan_proportion": 0.0,
|
| 175 |
-
"min": 1,
|
| 176 |
-
"max": 3,
|
| 177 |
-
"mean": 1.01741,
|
| 178 |
-
"median": 1.0,
|
| 179 |
-
"std": 0.13146,
|
| 180 |
-
"histogram": {
|
| 181 |
-
"hist": [11177, 196, 1],
|
| 182 |
-
"bin_edges": [1, 2, 3, 3]
|
| 183 |
-
}
|
| 184 |
-
}
|
| 185 |
-
}
|
| 186 |
-
```
|
| 187 |
-
|
| 188 |
-
## Implementation
|
| 189 |
-
|
| 190 |
-
### Architecture
|
| 191 |
-
|
| 192 |
-
```
|
| 193 |
-
analyze_dataset_features()
|
| 194 |
-
↓
|
| 195 |
-
Try: get_dataset_statistics() [Dataset Viewer API]
|
| 196 |
-
↓
|
| 197 |
-
If available (parquet format):
|
| 198 |
-
→ Use full dataset statistics
|
| 199 |
-
→ Cache results
|
| 200 |
-
→ Return converted analysis
|
| 201 |
-
↓
|
| 202 |
-
If not available:
|
| 203 |
-
→ Fall back to sample-based analysis
|
| 204 |
-
→ Load samples via streaming
|
| 205 |
-
→ Compute statistics locally
|
| 206 |
-
```
|
| 207 |
-
|
| 208 |
-
### Key Components
|
| 209 |
-
|
| 210 |
-
#### 1. DatasetViewerAdapter
|
| 211 |
-
- `get_dataset_statistics()`: Fetch statistics from API
|
| 212 |
-
- `check_statistics_availability()`: Check if statistics are available for a dataset
|
| 213 |
-
|
| 214 |
-
#### 2. DatasetService
|
| 215 |
-
- `get_dataset_statistics()`: Wrapper with caching and error handling
|
| 216 |
-
- Automatic fallback to sample-based analysis
|
| 217 |
-
- Statistics cache directory: `cache/statistics/`
|
| 218 |
-
|
| 219 |
-
#### 3. Analysis Tool
|
| 220 |
-
- `_convert_viewer_statistics_to_analysis()`: Convert API format to our analysis format
|
| 221 |
-
- Seamless integration with existing analysis pipeline
|
| 222 |
-
|
| 223 |
-
### Caching Strategy
|
| 224 |
-
|
| 225 |
-
Statistics are cached with the same TTL as other metadata (default: 1 hour):
|
| 226 |
-
|
| 227 |
-
```
|
| 228 |
-
cache/
|
| 229 |
-
├── metadata/ # Dataset metadata
|
| 230 |
-
├── samples/ # Sample data
|
| 231 |
-
└── statistics/ # Dataset Viewer statistics
|
| 232 |
-
└── {dataset}_{config}_{split}_stats.json
|
| 233 |
-
```
|
| 234 |
-
|
| 235 |
-
## Usage Examples
|
| 236 |
-
|
| 237 |
-
### Automatic Selection
|
| 238 |
-
|
| 239 |
-
```python
|
| 240 |
-
from hf_eda_mcp.tools.analysis import analyze_dataset_features
|
| 241 |
-
|
| 242 |
-
# Automatically uses Dataset Viewer statistics if available
|
| 243 |
-
result = analyze_dataset_features(
|
| 244 |
-
dataset_id="stanfordnlp/imdb",
|
| 245 |
-
split="train"
|
| 246 |
-
)
|
| 247 |
-
|
| 248 |
-
# Check which method was used
|
| 249 |
-
print(result['sample_info']['sampling_method'])
|
| 250 |
-
# Output: "dataset_viewer_api" or "sequential_head"
|
| 251 |
-
|
| 252 |
-
print(result['sample_info']['represents_full_dataset'])
|
| 253 |
-
# Output: True (full dataset) or False (sample)
|
| 254 |
-
```
|
| 255 |
-
|
| 256 |
-
### Check Availability
|
| 257 |
-
|
| 258 |
-
```python
|
| 259 |
-
from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter
|
| 260 |
-
|
| 261 |
-
adapter = DatasetViewerAdapter(token="your_token")
|
| 262 |
-
availability = adapter.check_statistics_availability("stanfordnlp/imdb")
|
| 263 |
-
|
| 264 |
-
print(availability)
|
| 265 |
-
# {
|
| 266 |
-
# 'available': True,
|
| 267 |
-
# 'configs': ['plain_text'],
|
| 268 |
-
# 'reason': 'Statistics available for 1 config(s)'
|
| 269 |
-
# }
|
| 270 |
-
```
|
| 271 |
-
|
| 272 |
-
### Direct Statistics Access
|
| 273 |
-
|
| 274 |
-
```python
|
| 275 |
-
from hf_eda_mcp.services.dataset_service import DatasetService
|
| 276 |
-
|
| 277 |
-
service = DatasetService(token="your_token")
|
| 278 |
-
stats = service.get_dataset_statistics(
|
| 279 |
-
dataset_id="stanfordnlp/imdb",
|
| 280 |
-
split="train",
|
| 281 |
-
config_name="plain_text"
|
| 282 |
-
)
|
| 283 |
-
|
| 284 |
-
if stats:
|
| 285 |
-
print(f"Full dataset: {stats['num_examples']} examples")
|
| 286 |
-
print(f"Columns: {len(stats['statistics'])}")
|
| 287 |
-
else:
|
| 288 |
-
print("Statistics not available, use sample-based analysis")
|
| 289 |
-
```
|
| 290 |
-
|
| 291 |
-
## Comparison: Before vs After
|
| 292 |
-
|
| 293 |
-
### IMDB Dataset Example
|
| 294 |
-
|
| 295 |
-
#### Before (Sample-based)
|
| 296 |
-
```python
|
| 297 |
-
{
|
| 298 |
-
'dataset_info': {
|
| 299 |
-
'sample_size_used': 1000,
|
| 300 |
-
'sample_size_requested': 1000,
|
| 301 |
-
},
|
| 302 |
-
'sample_info': {
|
| 303 |
-
'sampling_method': 'sequential_head',
|
| 304 |
-
'represents_full_dataset': True, # Only if sample >= requested
|
| 305 |
-
},
|
| 306 |
-
'features': {
|
| 307 |
-
'text': {
|
| 308 |
-
'feature_type': 'text',
|
| 309 |
-
'statistics': {
|
| 310 |
-
'count': 1000,
|
| 311 |
-
'avg_length': 1311.289,
|
| 312 |
-
'min_length': 65,
|
| 313 |
-
'max_length': 6103,
|
| 314 |
-
# Limited to sample
|
| 315 |
-
}
|
| 316 |
-
}
|
| 317 |
-
},
|
| 318 |
-
'summary': 'Analyzed 2 features from 1000 samples | Types: 1 categorical, 1 text'
|
| 319 |
-
}
|
| 320 |
-
```
|
| 321 |
-
|
| 322 |
-
#### After (Dataset Viewer)
|
| 323 |
-
```python
|
| 324 |
-
{
|
| 325 |
-
'dataset_info': {
|
| 326 |
-
'sample_size_used': 25000, # Full dataset
|
| 327 |
-
'sample_size_requested': 25000,
|
| 328 |
-
},
|
| 329 |
-
'sample_info': {
|
| 330 |
-
'sampling_method': 'dataset_viewer_api',
|
| 331 |
-
'represents_full_dataset': True, # Always true
|
| 332 |
-
'partial': False
|
| 333 |
-
},
|
| 334 |
-
'features': {
|
| 335 |
-
'text': {
|
| 336 |
-
'feature_type': 'text',
|
| 337 |
-
'statistics': {
|
| 338 |
-
'count': 25000, # Full dataset
|
| 339 |
-
'mean_length': 1325.06964,
|
| 340 |
-
'min_length': 52,
|
| 341 |
-
'max_length': 13704,
|
| 342 |
-
'histogram': {
|
| 343 |
-
'bin_edges': [52, 1418, 2784, ...],
|
| 344 |
-
'hist': [17426, 5384, 1490, ...]
|
| 345 |
-
}
|
| 346 |
-
}
|
| 347 |
-
}
|
| 348 |
-
},
|
| 349 |
-
'summary': 'Analyzed 2 features from 25000 samples | Types: 1 categorical, 1 text'
|
| 350 |
-
}
|
| 351 |
-
```
|
| 352 |
-
|
| 353 |
-
## Supported Data Types
|
| 354 |
-
|
| 355 |
-
The Dataset Viewer statistics endpoint supports comprehensive analysis for multiple data types:
|
| 356 |
-
|
| 357 |
-
| Data Type | Feature Type | Statistics Provided |
|
| 358 |
-
|-----------|--------------|---------------------|
|
| 359 |
-
| `int`, `float` | numerical | min, max, mean, median, std, histogram |
|
| 360 |
-
| `class_label`, `string_label` | categorical | frequencies, n_unique, no_label tracking |
|
| 361 |
-
| `bool` | boolean | True/False frequencies |
|
| 362 |
-
| `string_text` | text | character length stats (min, max, mean, median, std), histogram |
|
| 363 |
-
| `image` | image | dimension statistics, histogram |
|
| 364 |
-
| `audio` | audio | duration statistics (seconds), histogram |
|
| 365 |
-
| `list` | list | length statistics, histogram |
|
| 366 |
-
|
| 367 |
-
### Data Type Mapping
|
| 368 |
-
|
| 369 |
-
Our analysis tool automatically maps Dataset Viewer types to our internal types:
|
| 370 |
-
|
| 371 |
-
```python
|
| 372 |
-
Dataset Viewer Type → Our Feature Type
|
| 373 |
-
─────────────────────────────────────
|
| 374 |
-
int, float → numerical
|
| 375 |
-
class_label → categorical
|
| 376 |
-
string_label → categorical
|
| 377 |
-
bool → boolean
|
| 378 |
-
string_text → text
|
| 379 |
-
image → image
|
| 380 |
-
audio → audio
|
| 381 |
-
list → list
|
| 382 |
-
```
|
| 383 |
-
|
| 384 |
-
## Limitations
|
| 385 |
-
|
| 386 |
-
### Dataset Requirements
|
| 387 |
-
- Only works for datasets with `builder_name="parquet"`
|
| 388 |
-
- Not all datasets on HuggingFace Hub have this format
|
| 389 |
-
- Automatic fallback to sample-based analysis for other formats
|
| 390 |
-
|
| 391 |
-
### API Availability
|
| 392 |
-
- Requires internet connection
|
| 393 |
-
- Subject to HuggingFace API rate limits
|
| 394 |
-
- May fail for private datasets without proper authentication
|
| 395 |
-
|
| 396 |
-
## Error Handling
|
| 397 |
-
|
| 398 |
-
The implementation includes robust error handling:
|
| 399 |
-
|
| 400 |
-
1. **Check availability first**: Verify dataset supports statistics
|
| 401 |
-
2. **Graceful fallback**: Automatically use sample-based analysis if unavailable
|
| 402 |
-
3. **Caching**: Reduce API calls and improve performance
|
| 403 |
-
4. **Logging**: Clear messages about which method is being used
|
| 404 |
-
|
| 405 |
-
## Performance Impact
|
| 406 |
-
|
| 407 |
-
### API Call Overhead
|
| 408 |
-
- Initial call: ~1-2 seconds
|
| 409 |
-
- Cached calls: <10ms
|
| 410 |
-
- No data download required
|
| 411 |
-
|
| 412 |
-
### Sample-based Analysis
|
| 413 |
-
- Download time: Varies by dataset size
|
| 414 |
-
- Processing time: ~1-5 seconds for 1000 samples
|
| 415 |
-
- Network bandwidth: Depends on sample size
|
| 416 |
-
|
| 417 |
-
## Future Enhancements
|
| 418 |
-
|
| 419 |
-
1. **Parallel requests**: Fetch statistics for multiple splits simultaneously
|
| 420 |
-
2. **Partial statistics**: Support datasets with partial statistics
|
| 421 |
-
3. **Custom aggregations**: Add more statistical measures
|
| 422 |
-
4. **Visualization**: Generate plots from histogram data
|
| 423 |
-
|
| 424 |
-
## References
|
| 425 |
-
|
| 426 |
-
- [HuggingFace Dataset Viewer Documentation](https://huggingface.co/docs/dataset-viewer/info)
|
| 427 |
-
- [Statistics Endpoint Specification](https://huggingface.co/docs/dataset-viewer/statistics)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/deployment/DEPLOYMENT.md
DELETED
|
@@ -1,300 +0,0 @@
|
|
| 1 |
-
# Deployment Guide
|
| 2 |
-
|
| 3 |
-
This guide covers different deployment options for the hf-eda-mcp server.
|
| 4 |
-
|
| 5 |
-
## Table of Contents
|
| 6 |
-
|
| 7 |
-
- [Local Development](#local-development)
|
| 8 |
-
- [Docker Deployment](#docker-deployment)
|
| 9 |
-
- [HuggingFace Spaces](#huggingface-spaces)
|
| 10 |
-
- [Production Considerations](#production-considerations)
|
| 11 |
-
|
| 12 |
-
---
|
| 13 |
-
|
| 14 |
-
## Local Development
|
| 15 |
-
|
| 16 |
-
### Prerequisites
|
| 17 |
-
|
| 18 |
-
- Python 3.13+
|
| 19 |
-
- PDM (Python package manager)
|
| 20 |
-
- HuggingFace account (optional, for private datasets)
|
| 21 |
-
|
| 22 |
-
### Setup
|
| 23 |
-
|
| 24 |
-
1. Clone the repository:
|
| 25 |
-
```bash
|
| 26 |
-
git clone https://github.com/your-username/hf-eda-mcp.git
|
| 27 |
-
cd hf-eda-mcp
|
| 28 |
-
```
|
| 29 |
-
|
| 30 |
-
2. Install dependencies:
|
| 31 |
-
```bash
|
| 32 |
-
pdm install
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
3. Configure environment variables:
|
| 36 |
-
```bash
|
| 37 |
-
cp config.example.env .env
|
| 38 |
-
# Edit .env and add your HF_TOKEN if needed
|
| 39 |
-
```
|
| 40 |
-
|
| 41 |
-
4. Run the server:
|
| 42 |
-
```bash
|
| 43 |
-
pdm run hf-eda-mcp
|
| 44 |
-
```
|
| 45 |
-
|
| 46 |
-
The server will start on `http://localhost:7860` with MCP enabled.
|
| 47 |
-
|
| 48 |
-
---
|
| 49 |
-
|
| 50 |
-
## Docker Deployment
|
| 51 |
-
|
| 52 |
-
### Build the Image
|
| 53 |
-
|
| 54 |
-
```bash
|
| 55 |
-
docker build -t hf-eda-mcp:latest .
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
### Run with Docker
|
| 59 |
-
|
| 60 |
-
```bash
|
| 61 |
-
docker run -d \
|
| 62 |
-
--name hf-eda-mcp-server \
|
| 63 |
-
-p 7860:7860 \
|
| 64 |
-
-e HF_TOKEN=your_token_here \
|
| 65 |
-
-v hf-cache:/app/cache \
|
| 66 |
-
hf-eda-mcp:latest
|
| 67 |
-
```
|
| 68 |
-
|
| 69 |
-
### Run with Docker Compose
|
| 70 |
-
|
| 71 |
-
1. Create a `.env` file with your configuration:
|
| 72 |
-
```bash
|
| 73 |
-
HF_TOKEN=your_token_here
|
| 74 |
-
```
|
| 75 |
-
|
| 76 |
-
2. Start the service:
|
| 77 |
-
```bash
|
| 78 |
-
docker-compose up -d
|
| 79 |
-
```
|
| 80 |
-
|
| 81 |
-
3. View logs:
|
| 82 |
-
```bash
|
| 83 |
-
docker-compose logs -f
|
| 84 |
-
```
|
| 85 |
-
|
| 86 |
-
4. Stop the service:
|
| 87 |
-
```bash
|
| 88 |
-
docker-compose down
|
| 89 |
-
```
|
| 90 |
-
|
| 91 |
-
### Docker Configuration Options
|
| 92 |
-
|
| 93 |
-
Environment variables you can set:
|
| 94 |
-
|
| 95 |
-
- `HF_TOKEN`: HuggingFace API token
|
| 96 |
-
- `GRADIO_SERVER_NAME`: Server host (default: `0.0.0.0`)
|
| 97 |
-
- `GRADIO_SERVER_PORT`: Server port (default: `7860`)
|
| 98 |
-
- `HF_HOME`: Cache directory for HuggingFace
|
| 99 |
-
- `MCP_SERVER_ENABLED`: Enable MCP server (default: `true`)
|
| 100 |
-
|
| 101 |
-
---
|
| 102 |
-
|
| 103 |
-
## HuggingFace Spaces
|
| 104 |
-
|
| 105 |
-
### Deployment Steps
|
| 106 |
-
|
| 107 |
-
1. **Create a new Space**:
|
| 108 |
-
- Go to https://huggingface.co/spaces
|
| 109 |
-
- Click "Create new Space"
|
| 110 |
-
- Choose "Gradio" as the SDK
|
| 111 |
-
- Select SDK version 5.49.1 or higher
|
| 112 |
-
|
| 113 |
-
2. **Upload files**:
|
| 114 |
-
```bash
|
| 115 |
-
# Copy files to Spaces directory
|
| 116 |
-
cp -r src/ spaces/
|
| 117 |
-
cp README.md LICENSE spaces/
|
| 118 |
-
|
| 119 |
-
# Initialize git in spaces directory
|
| 120 |
-
cd spaces
|
| 121 |
-
git init
|
| 122 |
-
git remote add origin https://huggingface.co/spaces/YOUR-USERNAME/hf-eda-mcp
|
| 123 |
-
```
|
| 124 |
-
|
| 125 |
-
3. **Configure the Space**:
|
| 126 |
-
- Copy `spaces/README.md` as the Space's README
|
| 127 |
-
- Ensure `spaces/app.py` is set as the app file
|
| 128 |
-
- Add `spaces/requirements.txt` for dependencies
|
| 129 |
-
|
| 130 |
-
4. **Set secrets** (for private datasets):
|
| 131 |
-
- Go to Space settings
|
| 132 |
-
- Add `HF_TOKEN` as a secret
|
| 133 |
-
|
| 134 |
-
5. **Deploy**:
|
| 135 |
-
```bash
|
| 136 |
-
git add .
|
| 137 |
-
git commit -m "Initial deployment"
|
| 138 |
-
git push origin main
|
| 139 |
-
```
|
| 140 |
-
|
| 141 |
-
### Space Configuration
|
| 142 |
-
|
| 143 |
-
The Space will automatically:
|
| 144 |
-
- Install dependencies from `requirements.txt`
|
| 145 |
-
- Run `app.py` as the entry point
|
| 146 |
-
- Expose the MCP server at `/gradio_api/mcp/sse`
|
| 147 |
-
|
| 148 |
-
### Accessing the Space
|
| 149 |
-
|
| 150 |
-
Your MCP server will be available at:
|
| 151 |
-
```
|
| 152 |
-
https://YOUR-USERNAME-hf-eda-mcp.hf.space/gradio_api/mcp/sse
|
| 153 |
-
```
|
| 154 |
-
|
| 155 |
-
---
|
| 156 |
-
|
| 157 |
-
## Production Considerations
|
| 158 |
-
|
| 159 |
-
### Security
|
| 160 |
-
|
| 161 |
-
1. **Authentication**:
|
| 162 |
-
- Use environment variables for sensitive data
|
| 163 |
-
- Never commit tokens to version control
|
| 164 |
-
- Rotate tokens regularly
|
| 165 |
-
|
| 166 |
-
2. **Access Control**:
|
| 167 |
-
- Consider implementing rate limiting
|
| 168 |
-
- Use HTTPS for all connections
|
| 169 |
-
- Validate all input parameters
|
| 170 |
-
|
| 171 |
-
3. **Secrets Management**:
|
| 172 |
-
- Use Docker secrets or environment files
|
| 173 |
-
- For Spaces, use the built-in secrets feature
|
| 174 |
-
- Consider using a secrets manager (AWS Secrets Manager, HashiCorp Vault)
|
| 175 |
-
|
| 176 |
-
### Performance
|
| 177 |
-
|
| 178 |
-
1. **Caching**:
|
| 179 |
-
- Enable persistent cache volumes
|
| 180 |
-
- Configure appropriate cache sizes
|
| 181 |
-
- Monitor cache hit rates
|
| 182 |
-
|
| 183 |
-
2. **Resource Limits**:
|
| 184 |
-
- Set memory limits in Docker
|
| 185 |
-
- Configure appropriate timeouts
|
| 186 |
-
- Monitor CPU and memory usage
|
| 187 |
-
|
| 188 |
-
3. **Scaling**:
|
| 189 |
-
- Use load balancers for multiple instances
|
| 190 |
-
- Consider horizontal scaling for high traffic
|
| 191 |
-
- Monitor response times and adjust resources
|
| 192 |
-
|
| 193 |
-
### Monitoring
|
| 194 |
-
|
| 195 |
-
1. **Logging**:
|
| 196 |
-
- Configure structured logging
|
| 197 |
-
- Use log aggregation tools (ELK, Splunk)
|
| 198 |
-
- Monitor error rates
|
| 199 |
-
|
| 200 |
-
2. **Metrics**:
|
| 201 |
-
- Track request counts and latencies
|
| 202 |
-
- Monitor cache performance
|
| 203 |
-
- Set up alerts for errors
|
| 204 |
-
|
| 205 |
-
3. **Health Checks**:
|
| 206 |
-
- Implement health check endpoints
|
| 207 |
-
- Configure container health checks
|
| 208 |
-
- Set up uptime monitoring
|
| 209 |
-
|
| 210 |
-
### Backup and Recovery
|
| 211 |
-
|
| 212 |
-
1. **Data Backup**:
|
| 213 |
-
- Backup cache volumes regularly
|
| 214 |
-
- Document configuration settings
|
| 215 |
-
- Version control all code
|
| 216 |
-
|
| 217 |
-
2. **Disaster Recovery**:
|
| 218 |
-
- Document deployment procedures
|
| 219 |
-
- Test recovery processes
|
| 220 |
-
- Maintain rollback capabilities
|
| 221 |
-
|
| 222 |
-
---
|
| 223 |
-
|
| 224 |
-
## Deployment Checklist
|
| 225 |
-
|
| 226 |
-
### Pre-Deployment
|
| 227 |
-
|
| 228 |
-
- [ ] All tests passing
|
| 229 |
-
- [ ] Dependencies up to date
|
| 230 |
-
- [ ] Security scan completed
|
| 231 |
-
- [ ] Documentation updated
|
| 232 |
-
- [ ] Environment variables configured
|
| 233 |
-
- [ ] Secrets properly managed
|
| 234 |
-
|
| 235 |
-
### Deployment
|
| 236 |
-
|
| 237 |
-
- [ ] Build successful
|
| 238 |
-
- [ ] Health checks passing
|
| 239 |
-
- [ ] MCP endpoints accessible
|
| 240 |
-
- [ ] Tools functioning correctly
|
| 241 |
-
- [ ] Logs being collected
|
| 242 |
-
- [ ] Monitoring configured
|
| 243 |
-
|
| 244 |
-
### Post-Deployment
|
| 245 |
-
|
| 246 |
-
- [ ] Verify all tools work
|
| 247 |
-
- [ ] Check performance metrics
|
| 248 |
-
- [ ] Monitor error rates
|
| 249 |
-
- [ ] Test with MCP clients
|
| 250 |
-
- [ ] Document any issues
|
| 251 |
-
- [ ] Update runbooks
|
| 252 |
-
|
| 253 |
-
---
|
| 254 |
-
|
| 255 |
-
## Troubleshooting
|
| 256 |
-
|
| 257 |
-
### Common Issues
|
| 258 |
-
|
| 259 |
-
1. **Server won't start**:
|
| 260 |
-
- Check Python version (3.13+ required)
|
| 261 |
-
- Verify all dependencies installed
|
| 262 |
-
- Check port availability
|
| 263 |
-
- Review logs for errors
|
| 264 |
-
|
| 265 |
-
2. **MCP connection fails**:
|
| 266 |
-
- Verify server is running
|
| 267 |
-
- Check firewall settings
|
| 268 |
-
- Confirm correct URL/port
|
| 269 |
-
- Test with curl or browser
|
| 270 |
-
|
| 271 |
-
3. **Dataset access errors**:
|
| 272 |
-
- Verify HF_TOKEN is set
|
| 273 |
-
- Check token permissions
|
| 274 |
-
- Confirm dataset exists
|
| 275 |
-
- Test with public dataset first
|
| 276 |
-
|
| 277 |
-
4. **Performance issues**:
|
| 278 |
-
- Check cache configuration
|
| 279 |
-
- Monitor resource usage
|
| 280 |
-
- Reduce sample sizes
|
| 281 |
-
- Enable caching
|
| 282 |
-
|
| 283 |
-
### Getting Help
|
| 284 |
-
|
| 285 |
-
- Check logs: `docker logs hf-eda-mcp-server`
|
| 286 |
-
- Review documentation: See `MCP_USAGE.md`
|
| 287 |
-
- Open an issue: GitHub repository
|
| 288 |
-
- Community support: HuggingFace forums
|
| 289 |
-
|
| 290 |
-
---
|
| 291 |
-
|
| 292 |
-
## Next Steps
|
| 293 |
-
|
| 294 |
-
After deployment:
|
| 295 |
-
|
| 296 |
-
1. Configure MCP clients (see `deployment/mcp-client-examples.md`)
|
| 297 |
-
2. Test all tools with various datasets
|
| 298 |
-
3. Set up monitoring and alerts
|
| 299 |
-
4. Document any custom configurations
|
| 300 |
-
5. Share your Space with the community!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/deployment/QUICKSTART.md
DELETED
|
@@ -1,148 +0,0 @@
|
|
| 1 |
-
# Quick Start Guide
|
| 2 |
-
|
| 3 |
-
Get hf-eda-mcp running in minutes!
|
| 4 |
-
|
| 5 |
-
## Choose Your Deployment Method
|
| 6 |
-
|
| 7 |
-
### 🚀 Option 1: Local Development (Fastest)
|
| 8 |
-
|
| 9 |
-
```bash
|
| 10 |
-
# Install dependencies
|
| 11 |
-
pdm install
|
| 12 |
-
|
| 13 |
-
# Set up environment (optional for public datasets)
|
| 14 |
-
cp config.example.env .env
|
| 15 |
-
# Edit .env and add HF_TOKEN if needed
|
| 16 |
-
|
| 17 |
-
# Run the server
|
| 18 |
-
pdm run hf-eda-mcp
|
| 19 |
-
```
|
| 20 |
-
|
| 21 |
-
Server runs at: `http://localhost:7860`
|
| 22 |
-
|
| 23 |
-
---
|
| 24 |
-
|
| 25 |
-
### 🐳 Option 2: Docker (Recommended for Production)
|
| 26 |
-
|
| 27 |
-
```bash
|
| 28 |
-
# Build the image
|
| 29 |
-
docker build -t hf-eda-mcp:latest .
|
| 30 |
-
|
| 31 |
-
# Run the container
|
| 32 |
-
docker run -d \
|
| 33 |
-
--name hf-eda-mcp-server \
|
| 34 |
-
-p 7860:7860 \
|
| 35 |
-
-e HF_TOKEN=your_token_here \
|
| 36 |
-
hf-eda-mcp:latest
|
| 37 |
-
```
|
| 38 |
-
|
| 39 |
-
Or use Docker Compose:
|
| 40 |
-
|
| 41 |
-
```bash
|
| 42 |
-
# Create .env file with HF_TOKEN
|
| 43 |
-
echo "HF_TOKEN=your_token_here" > .env
|
| 44 |
-
|
| 45 |
-
# Start the service
|
| 46 |
-
docker-compose up -d
|
| 47 |
-
```
|
| 48 |
-
|
| 49 |
-
Server runs at: `http://localhost:7860`
|
| 50 |
-
|
| 51 |
-
---
|
| 52 |
-
|
| 53 |
-
### ☁️ Option 3: HuggingFace Spaces (Easiest for Sharing)
|
| 54 |
-
|
| 55 |
-
1. Create a new Gradio Space on HuggingFace
|
| 56 |
-
2. Copy files from `spaces/` directory to your Space
|
| 57 |
-
3. Set `HF_TOKEN` as a secret in Space settings (if needed)
|
| 58 |
-
4. Push to deploy
|
| 59 |
-
|
| 60 |
-
Your server will be at: `https://YOUR-USERNAME-hf-eda-mcp.hf.space`
|
| 61 |
-
|
| 62 |
-
---
|
| 63 |
-
|
| 64 |
-
## Connect an MCP Client
|
| 65 |
-
|
| 66 |
-
### Kiro IDE
|
| 67 |
-
|
| 68 |
-
Add to `.kiro/settings/mcp.json`:
|
| 69 |
-
|
| 70 |
-
```json
|
| 71 |
-
{
|
| 72 |
-
"mcpServers": {
|
| 73 |
-
"hf-eda-mcp": {
|
| 74 |
-
"command": "pdm",
|
| 75 |
-
"args": ["run", "hf-eda-mcp"],
|
| 76 |
-
"disabled": false
|
| 77 |
-
}
|
| 78 |
-
}
|
| 79 |
-
}
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
### Claude Desktop
|
| 83 |
-
|
| 84 |
-
Add to `claude_desktop_config.json`:
|
| 85 |
-
|
| 86 |
-
```json
|
| 87 |
-
{
|
| 88 |
-
"mcpServers": {
|
| 89 |
-
"hf-eda-mcp": {
|
| 90 |
-
"command": "python",
|
| 91 |
-
"args": ["-m", "hf_eda_mcp"],
|
| 92 |
-
"env": {
|
| 93 |
-
"PYTHONPATH": "/path/to/hf-eda-mcp/src"
|
| 94 |
-
}
|
| 95 |
-
}
|
| 96 |
-
}
|
| 97 |
-
}
|
| 98 |
-
```
|
| 99 |
-
|
| 100 |
-
---
|
| 101 |
-
|
| 102 |
-
## Test the Server
|
| 103 |
-
|
| 104 |
-
### Using the Web Interface
|
| 105 |
-
|
| 106 |
-
1. Open `http://localhost:7860` in your browser
|
| 107 |
-
2. Try the tools with a sample dataset like "squad"
|
| 108 |
-
|
| 109 |
-
### Using an MCP Client
|
| 110 |
-
|
| 111 |
-
Ask your AI assistant:
|
| 112 |
-
|
| 113 |
-
```
|
| 114 |
-
"Get metadata for the squad dataset"
|
| 115 |
-
"Show me 5 samples from the train split of squad"
|
| 116 |
-
"Analyze the features of the squad dataset"
|
| 117 |
-
```
|
| 118 |
-
|
| 119 |
-
---
|
| 120 |
-
|
| 121 |
-
## Common Issues
|
| 122 |
-
|
| 123 |
-
**Server won't start?**
|
| 124 |
-
- Check Python version: `python --version` (need 3.13+)
|
| 125 |
-
- Install dependencies: `pdm install`
|
| 126 |
-
|
| 127 |
-
**Can't access private datasets?**
|
| 128 |
-
- Set `HF_TOKEN` in your `.env` file
|
| 129 |
-
- Get token from: https://huggingface.co/settings/tokens
|
| 130 |
-
|
| 131 |
-
**Port 7860 already in use?**
|
| 132 |
-
- Change port: `GRADIO_SERVER_PORT=8080 pdm run hf-eda-mcp`
|
| 133 |
-
|
| 134 |
-
---
|
| 135 |
-
|
| 136 |
-
## Next Steps
|
| 137 |
-
|
| 138 |
-
- 📖 Read the full [Deployment Guide](DEPLOYMENT.md)
|
| 139 |
-
- 🔧 See [MCP Client Examples](mcp-client-examples.md)
|
| 140 |
-
- 📚 Check [MCP Usage Documentation](../MCP_USAGE.md)
|
| 141 |
-
|
| 142 |
-
---
|
| 143 |
-
|
| 144 |
-
## Need Help?
|
| 145 |
-
|
| 146 |
-
- Check logs: `docker logs hf-eda-mcp-server` (Docker)
|
| 147 |
-
- Review documentation in `docs/`
|
| 148 |
-
- Open an issue on GitHub
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/deployment/mcp-client-examples.md
DELETED
|
@@ -1,295 +0,0 @@
|
|
| 1 |
-
# MCP Client Configuration Examples
|
| 2 |
-
|
| 3 |
-
This document provides configuration examples for connecting various MCP clients to the hf-eda-mcp server.
|
| 4 |
-
|
| 5 |
-
## Table of Contents
|
| 6 |
-
|
| 7 |
-
- [Kiro IDE](#kiro-ide)
|
| 8 |
-
- [Claude Desktop](#claude-desktop)
|
| 9 |
-
- [Custom MCP Client](#custom-mcp-client)
|
| 10 |
-
- [Environment Variables](#environment-variables)
|
| 11 |
-
|
| 12 |
-
---
|
| 13 |
-
|
| 14 |
-
## Kiro IDE
|
| 15 |
-
|
| 16 |
-
### Workspace Configuration
|
| 17 |
-
|
| 18 |
-
Create or edit `.kiro/settings/mcp.json` in your workspace:
|
| 19 |
-
|
| 20 |
-
```json
|
| 21 |
-
{
|
| 22 |
-
"mcpServers": {
|
| 23 |
-
"hf-eda-mcp": {
|
| 24 |
-
"command": "docker",
|
| 25 |
-
"args": [
|
| 26 |
-
"run",
|
| 27 |
-
"--rm",
|
| 28 |
-
"-i",
|
| 29 |
-
"-p", "7860:7860",
|
| 30 |
-
"--env-file", ".env",
|
| 31 |
-
"hf-eda-mcp:latest"
|
| 32 |
-
],
|
| 33 |
-
"env": {
|
| 34 |
-
"HF_TOKEN": "${HF_TOKEN}"
|
| 35 |
-
},
|
| 36 |
-
"disabled": false,
|
| 37 |
-
"autoApprove": [
|
| 38 |
-
"get_dataset_metadata",
|
| 39 |
-
"get_dataset_sample",
|
| 40 |
-
"analyze_dataset_features"
|
| 41 |
-
]
|
| 42 |
-
}
|
| 43 |
-
}
|
| 44 |
-
}
|
| 45 |
-
```
|
| 46 |
-
|
| 47 |
-
### User-Level Configuration
|
| 48 |
-
|
| 49 |
-
Edit `~/.kiro/settings/mcp.json` for global configuration:
|
| 50 |
-
|
| 51 |
-
```json
|
| 52 |
-
{
|
| 53 |
-
"mcpServers": {
|
| 54 |
-
"hf-eda-mcp": {
|
| 55 |
-
"command": "pdm",
|
| 56 |
-
"args": ["run", "hf-eda-mcp"],
|
| 57 |
-
"env": {
|
| 58 |
-
"HF_TOKEN": "your_token_here"
|
| 59 |
-
},
|
| 60 |
-
"disabled": false,
|
| 61 |
-
"autoApprove": []
|
| 62 |
-
}
|
| 63 |
-
}
|
| 64 |
-
}
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
### Using HuggingFace Spaces
|
| 68 |
-
|
| 69 |
-
```json
|
| 70 |
-
{
|
| 71 |
-
"mcpServers": {
|
| 72 |
-
"hf-eda-mcp": {
|
| 73 |
-
"url": "https://your-username-hf-eda-mcp.hf.space/gradio_api/mcp/sse",
|
| 74 |
-
"disabled": false,
|
| 75 |
-
"autoApprove": ["get_dataset_metadata"]
|
| 76 |
-
}
|
| 77 |
-
}
|
| 78 |
-
}
|
| 79 |
-
```
|
| 80 |
-
|
| 81 |
-
---
|
| 82 |
-
|
| 83 |
-
## Claude Desktop
|
| 84 |
-
|
| 85 |
-
### Configuration File Location
|
| 86 |
-
|
| 87 |
-
- **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
|
| 88 |
-
- **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
|
| 89 |
-
- **Linux**: `~/.config/Claude/claude_desktop_config.json`
|
| 90 |
-
|
| 91 |
-
### Local Server Configuration
|
| 92 |
-
|
| 93 |
-
```json
|
| 94 |
-
{
|
| 95 |
-
"mcpServers": {
|
| 96 |
-
"hf-eda-mcp": {
|
| 97 |
-
"command": "python",
|
| 98 |
-
"args": ["-m", "hf_eda_mcp"],
|
| 99 |
-
"env": {
|
| 100 |
-
"HF_TOKEN": "your_token_here",
|
| 101 |
-
"PYTHONPATH": "/path/to/hf-eda-mcp/src"
|
| 102 |
-
}
|
| 103 |
-
}
|
| 104 |
-
}
|
| 105 |
-
}
|
| 106 |
-
```
|
| 107 |
-
|
| 108 |
-
### Docker Configuration
|
| 109 |
-
|
| 110 |
-
```json
|
| 111 |
-
{
|
| 112 |
-
"mcpServers": {
|
| 113 |
-
"hf-eda-mcp": {
|
| 114 |
-
"command": "docker",
|
| 115 |
-
"args": [
|
| 116 |
-
"run",
|
| 117 |
-
"--rm",
|
| 118 |
-
"-i",
|
| 119 |
-
"-p", "7860:7860",
|
| 120 |
-
"-e", "HF_TOKEN=your_token_here",
|
| 121 |
-
"hf-eda-mcp:latest"
|
| 122 |
-
]
|
| 123 |
-
}
|
| 124 |
-
}
|
| 125 |
-
}
|
| 126 |
-
```
|
| 127 |
-
|
| 128 |
-
### HuggingFace Spaces Configuration
|
| 129 |
-
|
| 130 |
-
```json
|
| 131 |
-
{
|
| 132 |
-
"mcpServers": {
|
| 133 |
-
"hf-eda-mcp": {
|
| 134 |
-
"url": "https://your-username-hf-eda-mcp.hf.space/gradio_api/mcp/sse"
|
| 135 |
-
}
|
| 136 |
-
}
|
| 137 |
-
}
|
| 138 |
-
```
|
| 139 |
-
|
| 140 |
-
---
|
| 141 |
-
|
| 142 |
-
## Custom MCP Client
|
| 143 |
-
|
| 144 |
-
### Python Client Example
|
| 145 |
-
|
| 146 |
-
```python
|
| 147 |
-
import asyncio
|
| 148 |
-
from mcp import ClientSession, StdioServerParameters
|
| 149 |
-
from mcp.client.stdio import stdio_client
|
| 150 |
-
|
| 151 |
-
async def main():
|
| 152 |
-
# Connect to local server
|
| 153 |
-
server_params = StdioServerParameters(
|
| 154 |
-
command="python",
|
| 155 |
-
args=["-m", "hf_eda_mcp"],
|
| 156 |
-
env={"HF_TOKEN": "your_token_here"}
|
| 157 |
-
)
|
| 158 |
-
|
| 159 |
-
async with stdio_client(server_params) as (read, write):
|
| 160 |
-
async with ClientSession(read, write) as session:
|
| 161 |
-
# Initialize the connection
|
| 162 |
-
await session.initialize()
|
| 163 |
-
|
| 164 |
-
# List available tools
|
| 165 |
-
tools = await session.list_tools()
|
| 166 |
-
print("Available tools:", tools)
|
| 167 |
-
|
| 168 |
-
# Call a tool
|
| 169 |
-
result = await session.call_tool(
|
| 170 |
-
"get_dataset_metadata",
|
| 171 |
-
arguments={"dataset_id": "squad"}
|
| 172 |
-
)
|
| 173 |
-
print("Result:", result)
|
| 174 |
-
|
| 175 |
-
if __name__ == "__main__":
|
| 176 |
-
asyncio.run(main())
|
| 177 |
-
```
|
| 178 |
-
|
| 179 |
-
### JavaScript/TypeScript Client Example
|
| 180 |
-
|
| 181 |
-
```typescript
|
| 182 |
-
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
|
| 183 |
-
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
|
| 184 |
-
|
| 185 |
-
async function main() {
|
| 186 |
-
const transport = new StdioClientTransport({
|
| 187 |
-
command: "python",
|
| 188 |
-
args: ["-m", "hf_eda_mcp"],
|
| 189 |
-
env: {
|
| 190 |
-
HF_TOKEN: process.env.HF_TOKEN
|
| 191 |
-
}
|
| 192 |
-
});
|
| 193 |
-
|
| 194 |
-
const client = new Client({
|
| 195 |
-
name: "hf-eda-client",
|
| 196 |
-
version: "1.0.0"
|
| 197 |
-
}, {
|
| 198 |
-
capabilities: {}
|
| 199 |
-
});
|
| 200 |
-
|
| 201 |
-
await client.connect(transport);
|
| 202 |
-
|
| 203 |
-
// List tools
|
| 204 |
-
const tools = await client.listTools();
|
| 205 |
-
console.log("Available tools:", tools);
|
| 206 |
-
|
| 207 |
-
// Call a tool
|
| 208 |
-
const result = await client.callTool({
|
| 209 |
-
name: "get_dataset_metadata",
|
| 210 |
-
arguments: {
|
| 211 |
-
dataset_id: "squad"
|
| 212 |
-
}
|
| 213 |
-
});
|
| 214 |
-
console.log("Result:", result);
|
| 215 |
-
|
| 216 |
-
await client.close();
|
| 217 |
-
}
|
| 218 |
-
|
| 219 |
-
main().catch(console.error);
|
| 220 |
-
```
|
| 221 |
-
|
| 222 |
-
---
|
| 223 |
-
|
| 224 |
-
## Environment Variables
|
| 225 |
-
|
| 226 |
-
### Required Variables
|
| 227 |
-
|
| 228 |
-
- `HF_TOKEN`: HuggingFace API token (optional for public datasets, required for private datasets)
|
| 229 |
-
|
| 230 |
-
### Optional Variables
|
| 231 |
-
|
| 232 |
-
- `HF_HOME`: Directory for HuggingFace cache (default: `~/.cache/huggingface`)
|
| 233 |
-
- `HF_DATASETS_CACHE`: Directory for datasets cache
|
| 234 |
-
- `TRANSFORMERS_CACHE`: Directory for transformers cache
|
| 235 |
-
- `GRADIO_SERVER_NAME`: Server host (default: `0.0.0.0`)
|
| 236 |
-
- `GRADIO_SERVER_PORT`: Server port (default: `7860`)
|
| 237 |
-
- `MCP_SERVER_ENABLED`: Enable MCP server (default: `true`)
|
| 238 |
-
|
| 239 |
-
### Example .env File
|
| 240 |
-
|
| 241 |
-
```bash
|
| 242 |
-
# HuggingFace Authentication
|
| 243 |
-
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
|
| 244 |
-
|
| 245 |
-
# Cache Configuration
|
| 246 |
-
HF_HOME=/path/to/cache
|
| 247 |
-
HF_DATASETS_CACHE=/path/to/cache/datasets
|
| 248 |
-
TRANSFORMERS_CACHE=/path/to/cache/transformers
|
| 249 |
-
|
| 250 |
-
# Server Configuration
|
| 251 |
-
GRADIO_SERVER_NAME=0.0.0.0
|
| 252 |
-
GRADIO_SERVER_PORT=7860
|
| 253 |
-
MCP_SERVER_ENABLED=true
|
| 254 |
-
```
|
| 255 |
-
|
| 256 |
-
---
|
| 257 |
-
|
| 258 |
-
## Deployment Options Comparison
|
| 259 |
-
|
| 260 |
-
| Option | Pros | Cons | Best For |
|
| 261 |
-
|--------|------|------|----------|
|
| 262 |
-
| **Local (PDM)** | Fast, easy debugging | Requires Python setup | Development |
|
| 263 |
-
| **Docker** | Isolated, reproducible | Requires Docker | Production, CI/CD |
|
| 264 |
-
| **HF Spaces** | Hosted, no maintenance | Limited control | Public sharing |
|
| 265 |
-
|
| 266 |
-
---
|
| 267 |
-
|
| 268 |
-
## Troubleshooting
|
| 269 |
-
|
| 270 |
-
### Connection Issues
|
| 271 |
-
|
| 272 |
-
1. **Server not starting**: Check logs for errors, verify dependencies installed
|
| 273 |
-
2. **Authentication failed**: Verify `HF_TOKEN` is set correctly
|
| 274 |
-
3. **Port already in use**: Change `GRADIO_SERVER_PORT` to a different port
|
| 275 |
-
|
| 276 |
-
### Tool Execution Issues
|
| 277 |
-
|
| 278 |
-
1. **Dataset not found**: Verify dataset ID is correct on HuggingFace Hub
|
| 279 |
-
2. **Permission denied**: Ensure `HF_TOKEN` has access to private datasets
|
| 280 |
-
3. **Timeout errors**: Increase timeout settings or use smaller sample sizes
|
| 281 |
-
|
| 282 |
-
### Docker Issues
|
| 283 |
-
|
| 284 |
-
1. **Image build fails**: Ensure all dependencies in `pyproject.toml` are compatible
|
| 285 |
-
2. **Container exits immediately**: Check logs with `docker logs hf-eda-mcp-server`
|
| 286 |
-
3. **Cache not persisting**: Verify volume mounts in `docker-compose.yml`
|
| 287 |
-
|
| 288 |
-
---
|
| 289 |
-
|
| 290 |
-
## Additional Resources
|
| 291 |
-
|
| 292 |
-
- [MCP Protocol Documentation](https://modelcontextprotocol.io/)
|
| 293 |
-
- [Gradio MCP Integration](https://www.gradio.app/guides/gradio-and-mcp)
|
| 294 |
-
- [HuggingFace Hub Documentation](https://huggingface.co/docs/hub/index)
|
| 295 |
-
- [Project Repository](https://github.com/your-username/hf-eda-mcp)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|