hf-eda-mcp / README.md
KhalilGuetari's picture
fix typo
3d81235
---
title: HuggingFace EDA MCP Server
short_description: MCP server to explore and analyze HuggingFace datasets
emoji: πŸ“Š
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.0
app_file: src/app.py
pinned: false
license: apache-2.0
app_port: 7860
tags:
- building-mcp-track-enterprise
- building-mcp-track-consumer
---
# πŸ“Š HuggingFace EDA MCP Server
> πŸŽ‰ Submission for the [Gradio MCP 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)
An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task.
**Use cases:**
- **Dataset discovery**:
- Inspect metadata, schemas, and samples to evaluate datasets before use
- Use it in conjunction with HuggingFace MCP `search_dataset` for even more powerful dataset discovery
- **Exploratory Data analysis**:
- Analyze feature distributions, detect missing values, and review statistics
- Ask your AI assistant to build reports and visualizations
- **Content search**: Find specific examples in datasets using text search
<p align="center">
<a href="https://www.youtube.com/watch?v=XdP7zGSb81k">
<img src="https://img.shields.io/badge/▢️_Demo_Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Demo Video">
</a>
&nbsp;
<a href="https://www.linkedin.com/posts/khalil-guetari-00a61415a_mcp-server-for-huggingface-datasets-discovery-activity-7400587711838842880-2K8p">
<img src="https://img.shields.io/badge/LinkedIn_Post-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn Post">
</a>
&nbsp;
<a href="https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp">
<img src="https://img.shields.io/badge/πŸ€—_Try_it_on_HF_Spaces-FFD21E?style=for-the-badge" alt="HF Space">
</a>
</p>
## MCP Client Configuration
Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
**Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`
### With URL
```json
{
"mcpServers": {
"hf-eda-mcp": {
"url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
"headers": {
"hf-api-token": "<HF_TOKEN>"
}
}
}
}
```
### With mcp-remote
```json
{
"mcpServers": {
"hf-eda-mcp": {
"command": "npx",
"args": [
"mcp-remote",
"https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
"--transport",
"streamable-http",
"--header",
"hf-api-token: <HF_TOKEN>"
]
}
}
}
```
## Available Tools
### `get_dataset_metadata`
Retrieve comprehensive metadata about a HuggingFace dataset.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | βœ… | - | HuggingFace dataset identifier (e.g., `imdb`, `squad`, `glue`) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
**Returns:** Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more.
---
### `get_dataset_sample`
Retrieve sample rows from a dataset for quick exploration.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | βœ… | - | HuggingFace dataset identifier |
| `split` | string | ❌ | `train` | Dataset split to sample from |
| `num_samples` | int | ❌ | `10` | Number of samples to retrieve (max: 10,000) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
| `streaming` | bool | ❌ | `True` | Use streaming mode for efficient loading |
**Returns:** Sample data rows with schema information and sampling metadata.
---
### `analyze_dataset_features`
Perform exploratory data analysis on dataset features with automatic optimization.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | βœ… | - | HuggingFace dataset identifier |
| `split` | string | ❌ | `train` | Dataset split to analyze |
| `sample_size` | int | ❌ | `1000` | Number of samples for analysis (max: 50,000) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
**Returns:** Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types.
---
### `search_text_in_dataset`
Search for text in dataset columns using the Dataset Viewer API.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | βœ… | - | Full dataset identifier (e.g., `stanfordnlp/imdb`) |
| `config_name` | string | βœ… | - | Configuration name |
| `split` | string | βœ… | - | Split name |
| `query` | string | βœ… | - | Search query |
| `offset` | int | ❌ | `0` | Pagination offset |
| `length` | int | ❌ | `10` | Number of results to return |
**Returns:** Matching rows with highlighted search results. Only works on parquet datasets with text columns.
---
## How It Works
### API Integrations
The server leverages multiple HuggingFace APIs:
| API | Used For |
|-----|----------|
| **[Hub API](https://huggingface.co/docs/huggingface_hub/guides/hf_api)** | Dataset metadata, repository info, download stats |
| **[Dataset Viewer API](https://huggingface.co/docs/dataset-viewer)** | Full dataset statistics, text search, parquet row access |
| **[datasets library](https://huggingface.co/docs/datasets)** | Streaming data loading, sample extraction |
### Data Loading Strategy
- **Streaming mode** (default): Uses `datasets.load_dataset(..., streaming=True)` to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint.
- **Statistics API**: For parquet datasets, `analyze_dataset_features` first attempts to fetch pre-computed statistics from the Dataset Viewer API (`/statistics` endpoint), providing full dataset coverage without sampling.
- **Fallback**: If statistics aren't available, analysis falls back to sample-based computation.
### Caching
Results are cached locally to reduce API calls:
| Cache Type | TTL | Location |
|------------|-----|----------|
| Metadata | 1 hour | `~/.cache/hf_eda_mcp/metadata/` |
| Samples | 1 hour | `~/.cache/hf_eda_mcp/samples/` |
| Statistics | 1 hour | `~/.cache/hf_eda_mcp/statistics/` |
### Parquet Requirements
Some features require datasets with `builder_name="parquet"`:
- **Text search** (`search_text_in_dataset`): Only parquet datasets are searchable
- **Full statistics**: Pre-computed stats are only available for parquet datasets
### Error Handling
- Automatic retry with exponential backoff for transient network errors
- Graceful fallback from statistics API to sample-based analysis
- Descriptive error messages with suggestions for common issues
## Project Structure
```
src/hf_eda_mcp/
β”œβ”€β”€ server.py # Gradio app with MCP server setup
β”œβ”€β”€ config.py # Server configuration (env vars, defaults)
β”œβ”€β”€ validation.py # Input validation for all tools
β”œβ”€β”€ error_handling.py # Retry logic, error formatting
β”œβ”€β”€ tools/ # MCP tools (exposed via Gradio)
β”‚ β”œβ”€β”€ metadata.py # get_dataset_metadata
β”‚ β”œβ”€β”€ sampling.py # get_dataset_sample
β”‚ β”œβ”€β”€ analysis.py # analyze_dataset_features
β”‚ └── search.py # search_text_in_dataset
β”œβ”€β”€ services/ # Business logic layer
β”‚ β”œβ”€β”€ dataset_service.py # Caching, data loading, statistics
└── integrations/
└── dataset_viewer_adapter.py # Dataset Viewer API client
└── hf_client.py # HuggingFace Hub API wrapper (HfApi)
```
## Local Development
### Setup
```bash
# Install pdm
brew install pdm
# Clone the repository
git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp
cd hf-eda-mcp
# Install dependencies
pdm install
# Set your HuggingFace token
export HF_TOKEN=hf_xxx
# or create a .env file with HF_TOKEN=hf_xxx (see config.example.env)
# Run the server
pdm run hf-eda-mcp
```
The server starts at `http://localhost:7860` with MCP endpoint at `/gradio_api/mcp/`.
## License
Apache License 2.0