Spaces:
Running
Running
| title: HuggingFace EDA MCP Server | |
| short_description: MCP server to explore and analyze HuggingFace datasets | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 6.0.0 | |
| app_file: src/app.py | |
| pinned: false | |
| license: apache-2.0 | |
| app_port: 7860 | |
| tags: | |
| - building-mcp-track-enterprise | |
| - building-mcp-track-consumer | |
| # π HuggingFace EDA MCP Server | |
| > π Submission for the [Gradio MCP 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday) | |
| An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub. | |
| Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task. | |
| **Use cases:** | |
| - **Dataset discovery**: | |
| - Inspect metadata, schemas, and samples to evaluate datasets before use | |
| - Use it in conjunction with HuggingFace MCP `search_dataset` for even more powerful dataset discovery | |
| - **Exploratory Data analysis**: | |
| - Analyze feature distributions, detect missing values, and review statistics | |
| - Ask your AI assistant to build reports and visualizations | |
| - **Content search**: Find specific examples in datasets using text search | |
| <p align="center"> | |
| <a href="https://www.youtube.com/watch?v=XdP7zGSb81k"> | |
| <img src="https://img.shields.io/badge/βΆοΈ_Demo_Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Demo Video"> | |
| </a> | |
| | |
| <a href="https://www.linkedin.com/posts/khalil-guetari-00a61415a_mcp-server-for-huggingface-datasets-discovery-activity-7400587711838842880-2K8p"> | |
| <img src="https://img.shields.io/badge/LinkedIn_Post-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn Post"> | |
| </a> | |
| | |
| <a href="https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp"> | |
| <img src="https://img.shields.io/badge/π€_Try_it_on_HF_Spaces-FFD21E?style=for-the-badge" alt="HF Space"> | |
| </a> | |
| </p> | |
| ## MCP Client Configuration | |
| Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API. | |
| **Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/` | |
| ### With URL | |
| ```json | |
| { | |
| "mcpServers": { | |
| "hf-eda-mcp": { | |
| "url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/", | |
| "headers": { | |
| "hf-api-token": "<HF_TOKEN>" | |
| } | |
| } | |
| } | |
| } | |
| ``` | |
| ### With mcp-remote | |
| ```json | |
| { | |
| "mcpServers": { | |
| "hf-eda-mcp": { | |
| "command": "npx", | |
| "args": [ | |
| "mcp-remote", | |
| "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/", | |
| "--transport", | |
| "streamable-http", | |
| "--header", | |
| "hf-api-token: <HF_TOKEN>" | |
| ] | |
| } | |
| } | |
| } | |
| ``` | |
| ## Available Tools | |
| ### `get_dataset_metadata` | |
| Retrieve comprehensive metadata about a HuggingFace dataset. | |
| | Parameter | Type | Required | Default | Description | | |
| |-----------|------|----------|---------|-------------| | |
| | `dataset_id` | string | β | - | HuggingFace dataset identifier (e.g., `imdb`, `squad`, `glue`) | | |
| | `config_name` | string | β | `None` | Configuration name for multi-config datasets | | |
| **Returns:** Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more. | |
| --- | |
| ### `get_dataset_sample` | |
| Retrieve sample rows from a dataset for quick exploration. | |
| | Parameter | Type | Required | Default | Description | | |
| |-----------|------|----------|---------|-------------| | |
| | `dataset_id` | string | β | - | HuggingFace dataset identifier | | |
| | `split` | string | β | `train` | Dataset split to sample from | | |
| | `num_samples` | int | β | `10` | Number of samples to retrieve (max: 10,000) | | |
| | `config_name` | string | β | `None` | Configuration name for multi-config datasets | | |
| | `streaming` | bool | β | `True` | Use streaming mode for efficient loading | | |
| **Returns:** Sample data rows with schema information and sampling metadata. | |
| --- | |
| ### `analyze_dataset_features` | |
| Perform exploratory data analysis on dataset features with automatic optimization. | |
| | Parameter | Type | Required | Default | Description | | |
| |-----------|------|----------|---------|-------------| | |
| | `dataset_id` | string | β | - | HuggingFace dataset identifier | | |
| | `split` | string | β | `train` | Dataset split to analyze | | |
| | `sample_size` | int | β | `1000` | Number of samples for analysis (max: 50,000) | | |
| | `config_name` | string | β | `None` | Configuration name for multi-config datasets | | |
| **Returns:** Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types. | |
| --- | |
| ### `search_text_in_dataset` | |
| Search for text in dataset columns using the Dataset Viewer API. | |
| | Parameter | Type | Required | Default | Description | | |
| |-----------|------|----------|---------|-------------| | |
| | `dataset_id` | string | β | - | Full dataset identifier (e.g., `stanfordnlp/imdb`) | | |
| | `config_name` | string | β | - | Configuration name | | |
| | `split` | string | β | - | Split name | | |
| | `query` | string | β | - | Search query | | |
| | `offset` | int | β | `0` | Pagination offset | | |
| | `length` | int | β | `10` | Number of results to return | | |
| **Returns:** Matching rows with highlighted search results. Only works on parquet datasets with text columns. | |
| --- | |
| ## How It Works | |
| ### API Integrations | |
| The server leverages multiple HuggingFace APIs: | |
| | API | Used For | | |
| |-----|----------| | |
| | **[Hub API](https://huggingface.co/docs/huggingface_hub/guides/hf_api)** | Dataset metadata, repository info, download stats | | |
| | **[Dataset Viewer API](https://huggingface.co/docs/dataset-viewer)** | Full dataset statistics, text search, parquet row access | | |
| | **[datasets library](https://huggingface.co/docs/datasets)** | Streaming data loading, sample extraction | | |
| ### Data Loading Strategy | |
| - **Streaming mode** (default): Uses `datasets.load_dataset(..., streaming=True)` to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint. | |
| - **Statistics API**: For parquet datasets, `analyze_dataset_features` first attempts to fetch pre-computed statistics from the Dataset Viewer API (`/statistics` endpoint), providing full dataset coverage without sampling. | |
| - **Fallback**: If statistics aren't available, analysis falls back to sample-based computation. | |
| ### Caching | |
| Results are cached locally to reduce API calls: | |
| | Cache Type | TTL | Location | | |
| |------------|-----|----------| | |
| | Metadata | 1 hour | `~/.cache/hf_eda_mcp/metadata/` | | |
| | Samples | 1 hour | `~/.cache/hf_eda_mcp/samples/` | | |
| | Statistics | 1 hour | `~/.cache/hf_eda_mcp/statistics/` | | |
| ### Parquet Requirements | |
| Some features require datasets with `builder_name="parquet"`: | |
| - **Text search** (`search_text_in_dataset`): Only parquet datasets are searchable | |
| - **Full statistics**: Pre-computed stats are only available for parquet datasets | |
| ### Error Handling | |
| - Automatic retry with exponential backoff for transient network errors | |
| - Graceful fallback from statistics API to sample-based analysis | |
| - Descriptive error messages with suggestions for common issues | |
| ## Project Structure | |
| ``` | |
| src/hf_eda_mcp/ | |
| βββ server.py # Gradio app with MCP server setup | |
| βββ config.py # Server configuration (env vars, defaults) | |
| βββ validation.py # Input validation for all tools | |
| βββ error_handling.py # Retry logic, error formatting | |
| βββ tools/ # MCP tools (exposed via Gradio) | |
| β βββ metadata.py # get_dataset_metadata | |
| β βββ sampling.py # get_dataset_sample | |
| β βββ analysis.py # analyze_dataset_features | |
| β βββ search.py # search_text_in_dataset | |
| βββ services/ # Business logic layer | |
| β βββ dataset_service.py # Caching, data loading, statistics | |
| βββ integrations/ | |
| βββ dataset_viewer_adapter.py # Dataset Viewer API client | |
| βββ hf_client.py # HuggingFace Hub API wrapper (HfApi) | |
| ``` | |
| ## Local Development | |
| ### Setup | |
| ```bash | |
| # Install pdm | |
| brew install pdm | |
| # Clone the repository | |
| git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp | |
| cd hf-eda-mcp | |
| # Install dependencies | |
| pdm install | |
| # Set your HuggingFace token | |
| export HF_TOKEN=hf_xxx | |
| # or create a .env file with HF_TOKEN=hf_xxx (see config.example.env) | |
| # Run the server | |
| pdm run hf-eda-mcp | |
| ``` | |
| The server starts at `http://localhost:7860` with MCP endpoint at `/gradio_api/mcp/`. | |
| ## License | |
| Apache License 2.0 | |