hf-eda-mcp

Running

App Files Files Community

hf-eda-mcp / README.md

KhalilGuetari

fix typo

3d81235 12 days ago

preview code

raw

history blame contribute delete

8.82 kB

	---
	title: HuggingFace EDA MCP Server
	short_description: MCP server to explore and analyze HuggingFace datasets
	emoji: 📊
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 6.0.0
	app_file: src/app.py
	pinned: false
	license: apache-2.0
	app_port: 7860
	tags:
	- building-mcp-track-enterprise
	- building-mcp-track-consumer
	---

	# 📊 HuggingFace EDA MCP Server

	> 🎉 Submission for the [Gradio MCP 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)

	An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.

	Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task.

	Use cases:
	- Dataset discovery:
	- Inspect metadata, schemas, and samples to evaluate datasets before use
	- Use it in conjunction with HuggingFace MCP `search_dataset` for even more powerful dataset discovery
	- Exploratory Data analysis:
	- Analyze feature distributions, detect missing values, and review statistics
	- Ask your AI assistant to build reports and visualizations
	- Content search: Find specific examples in datasets using text search

	<p align="center">
	<a href="https://www.youtube.com/watch?v=XdP7zGSb81k">
	<img src="https://img.shields.io/badge/▶️_Demo_Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Demo Video">
	</a>

	<a href="https://www.linkedin.com/posts/khalil-guetari-00a61415a_mcp-server-for-huggingface-datasets-discovery-activity-7400587711838842880-2K8p">
	<img src="https://img.shields.io/badge/LinkedIn_Post-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn Post">
	</a>

	<a href="https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp">
	<img src="https://img.shields.io/badge/🤗_Try_it_on_HF_Spaces-FFD21E?style=for-the-badge" alt="HF Space">
	</a>
	</p>

	## MCP Client Configuration

	Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.

	Hosted endpoint: `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`

	### With URL

	```json
	{
	"mcpServers": {
	"hf-eda-mcp": {
	"url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
	"headers": {
	"hf-api-token": "<HF_TOKEN>"
	}
	}
	}
	}
	```

	### With mcp-remote

	```json
	{
	"mcpServers": {
	"hf-eda-mcp": {
	"command": "npx",
	"args": [
	"mcp-remote",
	"https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
	"--transport",
	"streamable-http",
	"--header",
	"hf-api-token: <HF_TOKEN>"
	]
	}
	}
	}
	```

	## Available Tools

	### `get_dataset_metadata`

	Retrieve comprehensive metadata about a HuggingFace dataset.

	\| Parameter \| Type \| Required \| Default \| Description \|
	\|-----------\|------\|----------\|---------\|-------------\|
	\| `dataset_id` \| string \| ✅ \| - \| HuggingFace dataset identifier (e.g., `imdb`, `squad`, `glue`) \|
	\| `config_name` \| string \| ❌ \| `None` \| Configuration name for multi-config datasets \|

	Returns: Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more.

	---

	### `get_dataset_sample`

	Retrieve sample rows from a dataset for quick exploration.

	\| Parameter \| Type \| Required \| Default \| Description \|
	\|-----------\|------\|----------\|---------\|-------------\|
	\| `dataset_id` \| string \| ✅ \| - \| HuggingFace dataset identifier \|
	\| `split` \| string \| ❌ \| `train` \| Dataset split to sample from \|
	\| `num_samples` \| int \| ❌ \| `10` \| Number of samples to retrieve (max: 10,000) \|
	\| `config_name` \| string \| ❌ \| `None` \| Configuration name for multi-config datasets \|
	\| `streaming` \| bool \| ❌ \| `True` \| Use streaming mode for efficient loading \|

	Returns: Sample data rows with schema information and sampling metadata.

	---

	### `analyze_dataset_features`

	Perform exploratory data analysis on dataset features with automatic optimization.

	\| Parameter \| Type \| Required \| Default \| Description \|
	\|-----------\|------\|----------\|---------\|-------------\|
	\| `dataset_id` \| string \| ✅ \| - \| HuggingFace dataset identifier \|
	\| `split` \| string \| ❌ \| `train` \| Dataset split to analyze \|
	\| `sample_size` \| int \| ❌ \| `1000` \| Number of samples for analysis (max: 50,000) \|
	\| `config_name` \| string \| ❌ \| `None` \| Configuration name for multi-config datasets \|

	Returns: Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types.

	---

	### `search_text_in_dataset`

	Search for text in dataset columns using the Dataset Viewer API.

	\| Parameter \| Type \| Required \| Default \| Description \|
	\|-----------\|------\|----------\|---------\|-------------\|
	\| `dataset_id` \| string \| ✅ \| - \| Full dataset identifier (e.g., `stanfordnlp/imdb`) \|
	\| `config_name` \| string \| ✅ \| - \| Configuration name \|
	\| `split` \| string \| ✅ \| - \| Split name \|
	\| `query` \| string \| ✅ \| - \| Search query \|
	\| `offset` \| int \| ❌ \| `0` \| Pagination offset \|
	\| `length` \| int \| ❌ \| `10` \| Number of results to return \|

	Returns: Matching rows with highlighted search results. Only works on parquet datasets with text columns.

	---

	## How It Works

	### API Integrations

	The server leverages multiple HuggingFace APIs:

	\| API \| Used For \|
	\|-----\|----------\|
	\| [Hub API](https://huggingface.co/docs/huggingface_hub/guides/hf_api) \| Dataset metadata, repository info, download stats \|
	\| [Dataset Viewer API](https://huggingface.co/docs/dataset-viewer) \| Full dataset statistics, text search, parquet row access \|
	\| [datasets library](https://huggingface.co/docs/datasets) \| Streaming data loading, sample extraction \|

	### Data Loading Strategy

	- Streaming mode (default): Uses `datasets.load_dataset(..., streaming=True)` to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint.
	- Statistics API: For parquet datasets, `analyze_dataset_features` first attempts to fetch pre-computed statistics from the Dataset Viewer API (`/statistics` endpoint), providing full dataset coverage without sampling.
	- Fallback: If statistics aren't available, analysis falls back to sample-based computation.

	### Caching

	Results are cached locally to reduce API calls:

	\| Cache Type \| TTL \| Location \|
	\|------------\|-----\|----------\|
	\| Metadata \| 1 hour \| `~/.cache/hf_eda_mcp/metadata/` \|
	\| Samples \| 1 hour \| `~/.cache/hf_eda_mcp/samples/` \|
	\| Statistics \| 1 hour \| `~/.cache/hf_eda_mcp/statistics/` \|

	### Parquet Requirements

	Some features require datasets with `builder_name="parquet"`:
	- Text search (`search_text_in_dataset`): Only parquet datasets are searchable
	- Full statistics: Pre-computed stats are only available for parquet datasets

	### Error Handling

	- Automatic retry with exponential backoff for transient network errors
	- Graceful fallback from statistics API to sample-based analysis
	- Descriptive error messages with suggestions for common issues


	## Project Structure

	```
	src/hf_eda_mcp/
	├── server.py # Gradio app with MCP server setup
	├── config.py # Server configuration (env vars, defaults)
	├── validation.py # Input validation for all tools
	├── error_handling.py # Retry logic, error formatting
	├── tools/ # MCP tools (exposed via Gradio)
	│ ├── metadata.py # get_dataset_metadata
	│ ├── sampling.py # get_dataset_sample
	│ ├── analysis.py # analyze_dataset_features
	│ └── search.py # search_text_in_dataset
	├── services/ # Business logic layer
	│ ├── dataset_service.py # Caching, data loading, statistics
	└── integrations/
	└── dataset_viewer_adapter.py # Dataset Viewer API client
	└── hf_client.py # HuggingFace Hub API wrapper (HfApi)
	```

	## Local Development

	### Setup

	```bash
	# Install pdm
	brew install pdm

	# Clone the repository
	git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp
	cd hf-eda-mcp

	# Install dependencies
	pdm install

	# Set your HuggingFace token
	export HF_TOKEN=hf_xxx
	# or create a .env file with HF_TOKEN=hf_xxx (see config.example.env)

	# Run the server
	pdm run hf-eda-mcp
	```

	The server starts at `http://localhost:7860` with MCP endpoint at `/gradio_api/mcp/`.

	## License

	Apache License 2.0