hf-eda-mcp / README.md
KhalilGuetari's picture
fix typo
3d81235

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: HuggingFace EDA MCP Server
short_description: MCP server to explore and analyze HuggingFace datasets
emoji: πŸ“Š
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.0
app_file: src/app.py
pinned: false
license: apache-2.0
app_port: 7860
tags:
  - building-mcp-track-enterprise
  - building-mcp-track-consumer

πŸ“Š HuggingFace EDA MCP Server

πŸŽ‰ Submission for the Gradio MCP 1st Birthday Hackathon

An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.

Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task.

Use cases:

  • Dataset discovery:
    • Inspect metadata, schemas, and samples to evaluate datasets before use
    • Use it in conjunction with HuggingFace MCP search_dataset for even more powerful dataset discovery
  • Exploratory Data analysis:
    • Analyze feature distributions, detect missing values, and review statistics
    • Ask your AI assistant to build reports and visualizations
  • Content search: Find specific examples in datasets using text search

Demo Video   LinkedIn Post   HF Space

MCP Client Configuration

Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.

Hosted endpoint: https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/

With URL

{
  "mcpServers": {
    "hf-eda-mcp": {
      "url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
      "headers": {
        "hf-api-token": "<HF_TOKEN>"
      }
    }
  }
}

With mcp-remote

{
  "mcpServers": {
    "hf-eda-mcp": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
        "--transport",
        "streamable-http",
        "--header",
        "hf-api-token: <HF_TOKEN>"
      ]
    }
  }
}

Available Tools

get_dataset_metadata

Retrieve comprehensive metadata about a HuggingFace dataset.

Parameter Type Required Default Description
dataset_id string βœ… - HuggingFace dataset identifier (e.g., imdb, squad, glue)
config_name string ❌ None Configuration name for multi-config datasets

Returns: Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more.


get_dataset_sample

Retrieve sample rows from a dataset for quick exploration.

Parameter Type Required Default Description
dataset_id string βœ… - HuggingFace dataset identifier
split string ❌ train Dataset split to sample from
num_samples int ❌ 10 Number of samples to retrieve (max: 10,000)
config_name string ❌ None Configuration name for multi-config datasets
streaming bool ❌ True Use streaming mode for efficient loading

Returns: Sample data rows with schema information and sampling metadata.


analyze_dataset_features

Perform exploratory data analysis on dataset features with automatic optimization.

Parameter Type Required Default Description
dataset_id string βœ… - HuggingFace dataset identifier
split string ❌ train Dataset split to analyze
sample_size int ❌ 1000 Number of samples for analysis (max: 50,000)
config_name string ❌ None Configuration name for multi-config datasets

Returns: Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types.


search_text_in_dataset

Search for text in dataset columns using the Dataset Viewer API.

Parameter Type Required Default Description
dataset_id string βœ… - Full dataset identifier (e.g., stanfordnlp/imdb)
config_name string βœ… - Configuration name
split string βœ… - Split name
query string βœ… - Search query
offset int ❌ 0 Pagination offset
length int ❌ 10 Number of results to return

Returns: Matching rows with highlighted search results. Only works on parquet datasets with text columns.


How It Works

API Integrations

The server leverages multiple HuggingFace APIs:

API Used For
Hub API Dataset metadata, repository info, download stats
Dataset Viewer API Full dataset statistics, text search, parquet row access
datasets library Streaming data loading, sample extraction

Data Loading Strategy

  • Streaming mode (default): Uses datasets.load_dataset(..., streaming=True) to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint.
  • Statistics API: For parquet datasets, analyze_dataset_features first attempts to fetch pre-computed statistics from the Dataset Viewer API (/statistics endpoint), providing full dataset coverage without sampling.
  • Fallback: If statistics aren't available, analysis falls back to sample-based computation.

Caching

Results are cached locally to reduce API calls:

Cache Type TTL Location
Metadata 1 hour ~/.cache/hf_eda_mcp/metadata/
Samples 1 hour ~/.cache/hf_eda_mcp/samples/
Statistics 1 hour ~/.cache/hf_eda_mcp/statistics/

Parquet Requirements

Some features require datasets with builder_name="parquet":

  • Text search (search_text_in_dataset): Only parquet datasets are searchable
  • Full statistics: Pre-computed stats are only available for parquet datasets

Error Handling

  • Automatic retry with exponential backoff for transient network errors
  • Graceful fallback from statistics API to sample-based analysis
  • Descriptive error messages with suggestions for common issues

Project Structure

src/hf_eda_mcp/
β”œβ”€β”€ server.py                 # Gradio app with MCP server setup
β”œβ”€β”€ config.py                 # Server configuration (env vars, defaults)
β”œβ”€β”€ validation.py             # Input validation for all tools
β”œβ”€β”€ error_handling.py         # Retry logic, error formatting
β”œβ”€β”€ tools/                    # MCP tools (exposed via Gradio)
β”‚   β”œβ”€β”€ metadata.py           # get_dataset_metadata
β”‚   β”œβ”€β”€ sampling.py           # get_dataset_sample
β”‚   β”œβ”€β”€ analysis.py           # analyze_dataset_features
β”‚   └── search.py             # search_text_in_dataset
β”œβ”€β”€ services/                 # Business logic layer
β”‚   β”œβ”€β”€ dataset_service.py    # Caching, data loading, statistics
└── integrations/
    └── dataset_viewer_adapter.py  # Dataset Viewer API client
    └── hf_client.py          # HuggingFace Hub API wrapper (HfApi)

Local Development

Setup

# Install pdm
brew install pdm

# Clone the repository
git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp
cd hf-eda-mcp

# Install dependencies
pdm install

# Set your HuggingFace token
export HF_TOKEN=hf_xxx
# or create a .env file with HF_TOKEN=hf_xxx (see config.example.env)

# Run the server
pdm run hf-eda-mcp

The server starts at http://localhost:7860 with MCP endpoint at /gradio_api/mcp/.

License

Apache License 2.0