hf-eda-mcp

Running

App Files Files Community

KhalilGuetari commited on 13 days ago

Commit

21bc165

1 Parent(s): 64e67e1

update readme

Browse files

Files changed (8) hide show

.vscode/settings.json +0 -3
README.md +56 -41
docs/CONFIGURATION.md +0 -104
docs/MCP_USAGE.md +0 -275
docs/STATISTICS_ENDPOINT.md +0 -427
docs/deployment/DEPLOYMENT.md +0 -300
docs/deployment/QUICKSTART.md +0 -148
docs/deployment/mcp-client-examples.md +0 -295

.vscode/settings.json DELETED Viewed

@@ -1,3 +0,0 @@
-{
-    "kiroAgent.configureMCP": "Enabled",
-}

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ tags:
 # 📊 HuggingFace EDA MCP Server
-> 🎉 Submission for the [HuggingFace 1st Birthday Hackathon](https://huggingface.co/spaces/huggingface/hf-1st-birthday-hackathon)
 An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
@@ -32,6 +32,61 @@ Whether you're a ML engineer, data scientist, or researcher, dataset exploration
   - Ask your AI assistant to build reports and visualizations
 - **Content search**: Find specific examples in datasets using text search
 ## Available Tools
 ### `get_dataset_metadata`
@@ -135,46 +190,6 @@ Some features require datasets with `builder_name="parquet"`:
 - Graceful fallback from statistics API to sample-based analysis
 - Descriptive error messages with suggestions for common issues
-## MCP Client Configuration
-Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
-**Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`
-### With URL
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp": {
-      "url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
-      "headers": {
-        "hf-api-token": "<HF_TOKEN>"
-      }
-    }
-  }
-}
-```
-### With mcp-remote
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp": {
-      "command": "npx",
-      "args": [
-        "mcp-remote",
-        "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
-        "--transport",
-        "streamable-http",
-        "--header",
-        "hf-api-token: <HF_TOKEN>"
-      ]
-    }
-  }
-}
-```
 ## Project Structure

 # 📊 HuggingFace EDA MCP Server
+> 🎉 Submission for the [HuggingFace 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)
 An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
   - Ask your AI assistant to build reports and visualizations
 - **Content search**: Find specific examples in datasets using text search
+<p align="center">
+  <a href="https://www.youtube.com/watch?v=XdP7zGSb81k">
+    <img src="https://img.shields.io/badge/▶️_Demo_Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Demo Video">
+  </a>
+  &nbsp;
+  <a href="https://www.linkedin.com/posts/khalil-guetari-00a61415a_mcp-server-for-huggingface-datasets-discovery-activity-7400587711838842880-2K8p">
+    <img src="https://img.shields.io/badge/LinkedIn_Post-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn Post">
+  </a>
+  &nbsp;
+  <a href="https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp">
+    <img src="https://img.shields.io/badge/🤗_Try_it_on_HF_Spaces-FFD21E?style=for-the-badge" alt="HF Space">
+  </a>
+</p>
+## MCP Client Configuration
+Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
+**Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`
+### With URL
+```json
+{
+  "mcpServers": {
+    "hf-eda-mcp": {
+      "url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
+      "headers": {
+        "hf-api-token": "<HF_TOKEN>"
+      }
+    }
+  }
+}
+```
+### With mcp-remote
+```json
+{
+  "mcpServers": {
+    "hf-eda-mcp": {
+      "command": "npx",
+      "args": [
+        "mcp-remote",
+        "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
+        "--transport",
+        "streamable-http",
+        "--header",
+        "hf-api-token: <HF_TOKEN>"
+      ]
+    }
+  }
+}
+```
 ## Available Tools
 ### `get_dataset_metadata`
 - Graceful fallback from statistics API to sample-based analysis
 - Descriptive error messages with suggestions for common issues
 ## Project Structure

docs/CONFIGURATION.md DELETED Viewed

@@ -1,104 +0,0 @@
-# Configuration Guide
-The HF EDA MCP Server uses a centralized configuration system that supports both environment variables and command-line arguments.
-## Configuration Module
-The configuration is managed by the `src/hf_eda_mcp/config.py` module, which provides:
-- `ServerConfig` dataclass with all configuration options
-- Environment variable loading with `ServerConfig.from_env()`
-- Global configuration management with `get_config()` and `set_config()`
-- Logging setup and validation utilities
-## Configuration Options
-### Server Settings
-- `HF_EDA_PORT` (default: 7860) - Server port
-- `HF_EDA_HOST` (default: 127.0.0.1) - Server host
-- `HF_EDA_MCP_ENABLED` (default: true) - Enable MCP server functionality
-- `HF_EDA_SHARE` (default: false) - Enable public sharing via Gradio
-### Authentication
-- `HF_TOKEN` - HuggingFace access token for private datasets
-### Logging
-- `HF_EDA_LOG_LEVEL` (default: INFO) - Logging level (DEBUG, INFO, WARNING, ERROR)
-### Performance and Caching
-- `HF_EDA_CACHE_DIR` - Directory for caching datasets (optional)
-- `HF_EDA_MAX_CACHE_SIZE` (default: 1000) - Maximum cache size in MB
-- `HF_EDA_MAX_SAMPLE_SIZE` (default: 50000) - Maximum sample size for tools
-- `HF_EDA_MAX_CONCURRENT` (default: 10) - Maximum concurrent requests
-- `HF_EDA_REQUEST_TIMEOUT` (default: 300) - Request timeout in seconds
-## How Configuration is Used
-### Server Startup
-The server loads configuration from environment variables and applies command-line overrides:
-```python
-from hf_eda_mcp.config import ServerConfig
-from hf_eda_mcp.server import launch_server
-config = ServerConfig.from_env()
-launch_server(config)
-```
-### Tools Integration
-All EDA tools (metadata, sampling, analysis) use the global configuration:
-```python
-from hf_eda_mcp.config import get_config
-config = get_config()
-# Tools respect config.max_sample_size, config.cache_dir, config.hf_token
-```
-### Dataset Service
-The `DatasetService` is initialized with configuration values:
-```python
-service = DatasetService(
-    cache_dir=config.cache_dir,
-    token=config.hf_token
-)
-```
-## Configuration Priority
-1. Command-line arguments (highest priority)
-2. Environment variables
-3. Default values (lowest priority)
-## Example Usage
-### Environment Variables
-```bash
-export HF_TOKEN="your_token_here"
-export HF_EDA_CACHE_DIR="/tmp/hf-cache"
-export HF_EDA_MAX_SAMPLE_SIZE=25000
-pdm run hf-eda-mcp
-```
-### Command Line
-```bash
-pdm run hf-eda-mcp --cache-dir /tmp/cache --max-sample-size 25000 --verbose
-```
-### Configuration File
-Copy `config.example.env` to `.env` and modify as needed, then load with:
-```bash
-source .env
-pdm run hf-eda-mcp
-```
-## Validation
-The configuration system includes validation for:
-- Port ranges (1024-65535)
-- Cache directory permissions
-- Sample size limits
-- Timeout values
-Invalid configurations will cause the server to exit with helpful error messages.

docs/MCP_USAGE.md DELETED Viewed

@@ -1,275 +0,0 @@
-# MCP Server Usage Guide
-## Overview
-The HF EDA MCP Server provides four main tools for exploratory data analysis of HuggingFace datasets via the Model Context Protocol (MCP).
-## Available MCP Tools
-The following 4 tools are automatically exposed by Gradio when `mcp_server=True`:
-### 1. `get_dataset_metadata`
-Retrieve comprehensive metadata for a HuggingFace dataset.
-**Parameters:**
-- `dataset_id` (string): HuggingFace dataset identifier (e.g., 'imdb', 'squad')
-- `config_name` (string, optional): Configuration name for multi-config datasets
-**Returns:** JSON object with dataset metadata including size, features, splits, and configuration details.
-### 2. `get_dataset_sample`
-Retrieve a sample of rows from a HuggingFace dataset.
-**Parameters:**
-- `dataset_id` (string): HuggingFace dataset identifier
-- `split` (string, default: 'train'): Dataset split to sample from
-- `num_samples` (number, default: 10): Number of samples to retrieve (max: 10000)
-- `config_name` (string, optional): Configuration name for multi-config datasets
-**Returns:** JSON object with sampled data and metadata.
-### 3. `analyze_dataset_features`
-Perform exploratory analysis on dataset features with automatic optimization.
-**Parameters:**
-- `dataset_id` (string): HuggingFace dataset identifier
-- `split` (string, default: 'train'): Dataset split to analyze
-- `sample_size` (number, default: 1000): Number of samples for analysis (max: 50000, only used for fallback)
-- `config_name` (string, optional): Configuration name for multi-config datasets
-**Returns:** JSON object with comprehensive feature analysis including:
-- Feature types (numerical, categorical, text, image, audio)
-- Statistical measures (mean, median, std, histograms)
-- Missing value analysis
-- Unique value counts
-- Sample values
-**Analysis Methods:**
-- **Primary**: Uses HuggingFace Dataset Viewer API statistics when available (parquet datasets)
-  - Analyzes the full dataset without downloading data
-  - Provides complete statistics with histograms
-  - More efficient and accurate
-- **Fallback**: Sample-based analysis for non-parquet datasets
-  - Downloads and analyzes a sample of the dataset
-  - Computes statistics locally
-### 4. `search_text_in_dataset`
-Search for text in text columns of a dataset using the Dataset Viewer API.
-**Parameters:**
-- `dataset_id` (string): HuggingFace dataset identifier
-- `config_name` (string): Configuration name (required for search)
-- `split` (string): Dataset split to search in
-- `query` (string): Search query text
-- `offset` (number, default: 0): Offset for pagination
-- `length` (number, default: 10): Number of results to return (max: 100)
-**Returns:** JSON object with search results including:
-- `features`: List of features from the dataset, including column names and data types
-- `rows`: List of matching rows with content from each column
-- `num_rows_total`: Total number of examples in the split
-- `num_rows_per_page`: Number of examples in the current page
-- `partial`: Whether the response is partial (true if the dataset is too large to search completely)
-**Limitations:**
-- Only text columns are searched
-- Only parquet datasets are supported (builder_name="parquet")
-- Search is performed by the Dataset Viewer API, not locally
-**Validation:**
-- The tool validates that the dataset is in parquet format before attempting search
-- The tool validates that the dataset has at least one text/string column
-- If validation fails, a descriptive error message is returned with suggestions
-## MCP Client Configuration
-### Using with Claude Desktop
-Add this configuration to your MCP settings:
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp-server": {
-      "command": "pdm",
-      "args": ["run", "hf-eda-mcp"],
-      "env": {
-        "HF_TOKEN": "your_huggingface_token_here"
-      }
-    }
-  }
-}
-```
-### Using with Hosted Server
-If the server is running on a remote host:
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp-server": {
-      "url": "https://your-server.com/gradio_api/mcp/sse"
-      "headers": {
-        "hf-api-token": "your_huggingface_token_here"
-      }
-    }
-  }
-}
-```
-## Starting the Server
-### Local Development
-```bash
-# Start with MCP server enabled (default)
-pdm run hf-eda-mcp
-# Start on custom port
-pdm run hf-eda-mcp --port 8080
-# Start with verbose logging
-pdm run hf-eda-mcp --verbose
-# Start without MCP server functionality
-pdm run hf-eda-mcp --no-mcp
-# Start with custom host (listen on all interfaces)
-pdm run hf-eda-mcp --host 0.0.0.0
-# Start with public sharing enabled
-pdm run hf-eda-mcp --share
-# Start with custom cache directory
-pdm run hf-eda-mcp --cache-dir /path/to/cache
-# Start with custom maximum sample size
-pdm run hf-eda-mcp --max-sample-size 100000
-```
-### Server Modes
-The server provides both a web interface and MCP server functionality in a single application. When MCP is enabled, Gradio automatically exposes the 4 EDA functions as MCP tools while still providing the web interface for direct interaction.
-### Environment Variables
-The server supports comprehensive configuration via environment variables:
-#### Authentication
-- `HF_TOKEN`: HuggingFace access token for private datasets (optional)
-#### Server Configuration
-- `HF_EDA_PORT`: Server port (default: 7860)
-- `HF_EDA_HOST`: Server host (default: 127.0.0.1)
-- `HF_EDA_MCP_ENABLED`: Enable MCP server functionality (default: true)
-- `HF_EDA_SHARE`: Enable public sharing via Gradio (default: false)
-#### Logging Configuration
-- `HF_EDA_LOG_LEVEL`: Logging level - DEBUG, INFO, WARNING, ERROR (default: INFO)
-#### Performance and Caching
-- `HF_EDA_CACHE_DIR`: Directory for caching datasets (optional)
-- `HF_EDA_MAX_CACHE_SIZE`: Maximum cache size in MB (default: 1000)
-- `HF_EDA_MAX_SAMPLE_SIZE`: Maximum sample size for analysis (default: 50000)
-- `HF_EDA_MAX_CONCURRENT`: Maximum concurrent requests (default: 10)
-- `HF_EDA_REQUEST_TIMEOUT`: Request timeout in seconds (default: 300)
-### Configuration Examples
-#### Production Configuration
-```bash
-export HF_TOKEN="your_token_here"
-export HF_EDA_HOST="0.0.0.0"
-export HF_EDA_PORT="8080"
-export HF_EDA_LOG_LEVEL="WARNING"
-export HF_EDA_CACHE_DIR="/var/cache/hf-eda"
-export HF_EDA_MAX_CONCURRENT="20"
-pdm run hf-eda-mcp
-```
-#### Development Configuration
-```bash
-export HF_TOKEN="your_token_here"
-export HF_EDA_LOG_LEVEL="DEBUG"
-export HF_EDA_CACHE_DIR="./cache"
-pdm run hf-eda-mcp --verbose
-```
-## Dataset Viewer Statistics Integration
-The `analyze_dataset_features` tool automatically uses HuggingFace's Dataset Viewer API when available, providing significant benefits:
-### Benefits
-- **Full Dataset Analysis**: Analyzes entire datasets instead of samples
-- **No Download Required**: Statistics are pre-computed by HuggingFace
-- **Richer Statistics**: Includes histograms, frequencies, and multi-modal support
-- **Better Performance**: Faster response times with caching
-### Supported Datasets
-Statistics are available for datasets with `builder_name="parquet"`. The tool automatically:
-1. Checks if Dataset Viewer statistics are available
-2. Uses full dataset statistics when available
-3. Falls back to sample-based analysis for other datasets
-### Supported Data Types
-The analysis tool provides comprehensive statistics for multiple data types:
-- **Numerical** (int, float): min, max, mean, median, std, histograms
-- **Categorical** (class_label, string_label): frequencies, unique counts
-- **Boolean** (bool): True/False distributions
-- **Text** (string_text): character length statistics, histograms
-- **Image** (image): dimension statistics, histograms
-- **Audio** (audio): duration statistics (seconds), histograms
-- **List** (list): length statistics, histograms
-### Response Indicators
-Check the `sample_info` field in the response:
-- `sampling_method: "dataset_viewer_api"` - Using full dataset statistics
-- `sampling_method: "sequential_head"` - Using sample-based analysis
-- `represents_full_dataset: true/false` - Whether analysis covers the full dataset
-## Example Usage
-Once connected to an MCP client, you can use the tools like this:
-```
-# Get metadata for the IMDB dataset
-Use the get_dataset_metadata tool with dataset_id="imdb"
-# Sample 5 rows from the training split
-Use the get_dataset_sample tool with dataset_id="imdb", split="train", num_samples=5
-# Analyze features of the GLUE dataset (CoLA configuration)
-Use the analyze_dataset_features tool with dataset_id="glue", config_name="cola", sample_size=500
-# Search for text in the IMDB dataset
-Use the search_text_in_dataset tool with dataset_id="imdb", config_name="plain_text", split="train", query="great movie", offset=0, length=10
-# Search for a specific term in the SQuAD dataset
-Use the search_text_in_dataset tool with dataset_id="squad", config_name="plain_text", split="train", query="president", offset=0, length=5
-```
-## API Endpoints
-When the server is running, you can also access the tools via HTTP API:
-- **MCP Schema**: `http://localhost:7860/gradio_api/mcp/schema`
-- **API Documentation**: `http://localhost:7860/?view=api`
-- **Web Interface**: `http://localhost:7860`
-## Troubleshooting
-### Authentication Issues
-- Ensure `HF_TOKEN` environment variable is set for private datasets
-- Check that your HuggingFace token has appropriate permissions
-### Dataset Not Found
-- Verify the dataset ID is correct and exists on HuggingFace Hub
-- Check if the dataset requires authentication
-### Performance Issues
-- Reduce `sample_size` for large datasets
-- Use streaming mode (enabled by default) for better memory efficiency
-### Search Tool Issues
-- **Dataset not in parquet format**: The search tool only works with parquet datasets. If you get a "DatasetNotParquetError", try using a different dataset or check if the dataset has a parquet configuration
-- **No text columns found**: The search tool requires at least one text/string column. If you get a "NoTextColumnsError", verify that the dataset has text columns by checking the dataset metadata first

docs/STATISTICS_ENDPOINT.md DELETED Viewed

@@ -1,427 +0,0 @@
-# Dataset Viewer Statistics Endpoint Integration
-## Overview
-The HuggingFace Dataset Viewer API provides a `/statistics` endpoint that offers comprehensive statistics for datasets with `builder_name="parquet"`. This endpoint is significantly more efficient and complete than sample-based analysis.
-## Key Benefits
-### 1. Full Dataset Coverage
-- **Before**: Analysis based on samples (default 1,000 examples)
-- **After**: Statistics computed on the entire dataset (e.g., 25,000 examples for IMDB train split)
-### 2. No Data Download Required
-- **Before**: Download and process samples from the dataset
-- **After**: Retrieve pre-computed statistics via API call
-### 3. More Complete Statistics
-The endpoint provides detailed statistics for multiple modalities:
-#### Numerical Features (int, float)
-- **Basic statistics**: min, max, mean, median, std
-- **Missing values**: nan_count, nan_proportion
-- **Distribution**: histogram with bin_edges and hist counts
-Example response:
-```json
-{
-  "column_type": "float",
-  "column_statistics": {
-    "nan_count": 0,
-    "nan_proportion": 0,
-    "min": 0,
-    "max": 2,
-    "mean": 1.67206,
-    "median": 1.8,
-    "std": 0.38714,
-    "histogram": {
-      "hist": [17, 12, 48, 52, 135, 188, 814, 15, 1628, 2048],
-      "bin_edges": [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2]
-    }
-  }
-}
-```
-#### Categorical Features (class_label, string_label)
-- **Unique values**: n_unique count
-- **Frequencies**: Complete frequency distribution for all categories
-- **Missing values**: nan_count, nan_proportion
-- **No label tracking**: no_label_count, no_label_proportion (for class_label)
-Example response:
-```json
-{
-  "column_type": "class_label",
-  "column_statistics": {
-    "nan_count": 0,
-    "nan_proportion": 0,
-    "no_label_count": 0,
-    "no_label_proportion": 0,
-    "n_unique": 2,
-    "frequencies": {
-      "unacceptable": 2528,
-      "acceptable": 6023
-    }
-  }
-}
-```
-#### Text Features (string_text)
-- **Length statistics**: min, max, mean, median, std (character count)
-- **Missing values**: nan_count, nan_proportion
-- **Distribution**: histogram of text lengths
-Example response:
-```json
-{
-  "column_type": "string_text",
-  "column_statistics": {
-    "nan_count": 0,
-    "nan_proportion": 0,
-    "min": 6,
-    "max": 231,
-    "mean": 40.70074,
-    "median": 37,
-    "std": 19.14431,
-    "histogram": {
-      "hist": [2260, 4512, 1262, 380, 102, 26, 6, 1, 1, 1],
-      "bin_edges": [6, 29, 52, 75, 98, 121, 144, 167, 190, 213, 231]
-    }
-  }
-}
-```
-#### Boolean Features (bool)
-- **Frequencies**: Distribution of True/False values
-- **Missing values**: nan_count, nan_proportion
-Example response:
-```json
-{
-  "column_type": "bool",
-  "column_statistics": {
-    "nan_count": 3,
-    "nan_proportion": 0.15,
-    "frequencies": {
-      "False": 7,
-      "True": 10
-    }
-  }
-}
-```
-#### Image Features (image)
-- **Dimension statistics**: min, max, mean, median, std (for width/height)
-- **Missing values**: nan_count, nan_proportion
-- **Distribution**: histogram of image dimensions
-Example response:
-```json
-{
-  "column_type": "image",
-  "column_statistics": {
-    "nan_count": 0,
-    "nan_proportion": 0.0,
-    "min": 256,
-    "max": 873,
-    "mean": 327.99339,
-    "median": 341.0,
-    "std": 60.07286,
-    "histogram": {
-      "hist": [1734, 1637, 1326, 121, 10, 3, 1, 3, 1, 2],
-      "bin_edges": [256, 318, 380, 442, 504, ...]
-    }
-  }
-}
-```
-#### Audio Features (audio)
-- **Duration statistics**: min, max, mean, median, std (in seconds)
-- **Missing values**: nan_count, nan_proportion
-- **Distribution**: histogram of audio durations
-Example response:
-```json
-{
-  "column_type": "audio",
-  "column_statistics": {
-    "nan_count": 0,
-    "nan_proportion": 0,
-    "min": 1.02,
-    "max": 15,
-    "mean": 13.93042,
-    "median": 14.77,
-    "std": 2.63734,
-    "histogram": {
-      "hist": [32, 25, 18, 24, 22, 17, 18, 19, 55, 1770],
-      "bin_edges": [1.02, 2.418, 3.816, 5.214, 6.612, ...]
-    }
-  }
-}
-```
-#### List Features (list)
-- **Length statistics**: min, max, mean, median, std (list length)
-- **Missing values**: nan_count, nan_proportion
-- **Distribution**: histogram of list lengths
-Example response:
-```json
-{
-  "column_type": "list",
-  "column_statistics": {
-    "nan_count": 0,
-    "nan_proportion": 0.0,
-    "min": 1,
-    "max": 3,
-    "mean": 1.01741,
-    "median": 1.0,
-    "std": 0.13146,
-    "histogram": {
-      "hist": [11177, 196, 1],
-      "bin_edges": [1, 2, 3, 3]
-    }
-  }
-}
-```
-## Implementation
-### Architecture
-```
-analyze_dataset_features()
-    ↓
-    Try: get_dataset_statistics() [Dataset Viewer API]
-    ↓
-    If available (parquet format):
-        → Use full dataset statistics
-        → Cache results
-        → Return converted analysis
-    ↓
-    If not available:
-        → Fall back to sample-based analysis
-        → Load samples via streaming
-        → Compute statistics locally
-```
-### Key Components
-#### 1. DatasetViewerAdapter
-- `get_dataset_statistics()`: Fetch statistics from API
-- `check_statistics_availability()`: Check if statistics are available for a dataset
-#### 2. DatasetService
-- `get_dataset_statistics()`: Wrapper with caching and error handling
-- Automatic fallback to sample-based analysis
-- Statistics cache directory: `cache/statistics/`
-#### 3. Analysis Tool
-- `_convert_viewer_statistics_to_analysis()`: Convert API format to our analysis format
-- Seamless integration with existing analysis pipeline
-### Caching Strategy
-Statistics are cached with the same TTL as other metadata (default: 1 hour):
-```
-cache/
-├── metadata/          # Dataset metadata
-├── samples/           # Sample data
-└── statistics/        # Dataset Viewer statistics
-    └── {dataset}_{config}_{split}_stats.json
-```
-## Usage Examples
-### Automatic Selection
-```python
-from hf_eda_mcp.tools.analysis import analyze_dataset_features
-# Automatically uses Dataset Viewer statistics if available
-result = analyze_dataset_features(
-    dataset_id="stanfordnlp/imdb",
-    split="train"
-)
-# Check which method was used
-print(result['sample_info']['sampling_method'])
-# Output: "dataset_viewer_api" or "sequential_head"
-print(result['sample_info']['represents_full_dataset'])
-# Output: True (full dataset) or False (sample)
-```
-### Check Availability
-```python
-from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter
-adapter = DatasetViewerAdapter(token="your_token")
-availability = adapter.check_statistics_availability("stanfordnlp/imdb")
-print(availability)
-# {
-#   'available': True,
-#   'configs': ['plain_text'],
-#   'reason': 'Statistics available for 1 config(s)'
-# }
-```
-### Direct Statistics Access
-```python
-from hf_eda_mcp.services.dataset_service import DatasetService
-service = DatasetService(token="your_token")
-stats = service.get_dataset_statistics(
-    dataset_id="stanfordnlp/imdb",
-    split="train",
-    config_name="plain_text"
-)
-if stats:
-    print(f"Full dataset: {stats['num_examples']} examples")
-    print(f"Columns: {len(stats['statistics'])}")
-else:
-    print("Statistics not available, use sample-based analysis")
-```
-## Comparison: Before vs After
-### IMDB Dataset Example
-#### Before (Sample-based)
-```python
-{
-  'dataset_info': {
-    'sample_size_used': 1000,
-    'sample_size_requested': 1000,
-  },
-  'sample_info': {
-    'sampling_method': 'sequential_head',
-    'represents_full_dataset': True,  # Only if sample >= requested
-  },
-  'features': {
-    'text': {
-      'feature_type': 'text',
-      'statistics': {
-        'count': 1000,
-        'avg_length': 1311.289,
-        'min_length': 65,
-        'max_length': 6103,
-        # Limited to sample
-      }
-    }
-  },
-  'summary': 'Analyzed 2 features from 1000 samples | Types: 1 categorical, 1 text'
-}
-```
-#### After (Dataset Viewer)
-```python
-{
-  'dataset_info': {
-    'sample_size_used': 25000,  # Full dataset
-    'sample_size_requested': 25000,
-  },
-  'sample_info': {
-    'sampling_method': 'dataset_viewer_api',
-    'represents_full_dataset': True,  # Always true
-    'partial': False
-  },
-  'features': {
-    'text': {
-      'feature_type': 'text',
-      'statistics': {
-        'count': 25000,  # Full dataset
-        'mean_length': 1325.06964,
-        'min_length': 52,
-        'max_length': 13704,
-        'histogram': {
-          'bin_edges': [52, 1418, 2784, ...],
-          'hist': [17426, 5384, 1490, ...]
-        }
-      }
-    }
-  },
-  'summary': 'Analyzed 2 features from 25000 samples | Types: 1 categorical, 1 text'
-}
-```
-## Supported Data Types
-The Dataset Viewer statistics endpoint supports comprehensive analysis for multiple data types:
-| Data Type | Feature Type | Statistics Provided |
-|-----------|--------------|---------------------|
-| `int`, `float` | numerical | min, max, mean, median, std, histogram |
-| `class_label`, `string_label` | categorical | frequencies, n_unique, no_label tracking |
-| `bool` | boolean | True/False frequencies |
-| `string_text` | text | character length stats (min, max, mean, median, std), histogram |
-| `image` | image | dimension statistics, histogram |
-| `audio` | audio | duration statistics (seconds), histogram |
-| `list` | list | length statistics, histogram |
-### Data Type Mapping
-Our analysis tool automatically maps Dataset Viewer types to our internal types:
-```python
-Dataset Viewer Type → Our Feature Type
-─────────────────────────────────────
-int, float          → numerical
-class_label         → categorical
-string_label        → categorical
-bool                → boolean
-string_text         → text
-image               → image
-audio               → audio
-list                → list
-```
-## Limitations
-### Dataset Requirements
-- Only works for datasets with `builder_name="parquet"`
-- Not all datasets on HuggingFace Hub have this format
-- Automatic fallback to sample-based analysis for other formats
-### API Availability
-- Requires internet connection
-- Subject to HuggingFace API rate limits
-- May fail for private datasets without proper authentication
-## Error Handling
-The implementation includes robust error handling:
-1. **Check availability first**: Verify dataset supports statistics
-2. **Graceful fallback**: Automatically use sample-based analysis if unavailable
-3. **Caching**: Reduce API calls and improve performance
-4. **Logging**: Clear messages about which method is being used
-## Performance Impact
-### API Call Overhead
-- Initial call: ~1-2 seconds
-- Cached calls: <10ms
-- No data download required
-### Sample-based Analysis
-- Download time: Varies by dataset size
-- Processing time: ~1-5 seconds for 1000 samples
-- Network bandwidth: Depends on sample size
-## Future Enhancements
-1. **Parallel requests**: Fetch statistics for multiple splits simultaneously
-2. **Partial statistics**: Support datasets with partial statistics
-3. **Custom aggregations**: Add more statistical measures
-4. **Visualization**: Generate plots from histogram data
-## References
-- [HuggingFace Dataset Viewer Documentation](https://huggingface.co/docs/dataset-viewer/info)
-- [Statistics Endpoint Specification](https://huggingface.co/docs/dataset-viewer/statistics)

docs/deployment/DEPLOYMENT.md DELETED Viewed

@@ -1,300 +0,0 @@
-# Deployment Guide
-This guide covers different deployment options for the hf-eda-mcp server.
-## Table of Contents
-- [Local Development](#local-development)
-- [Docker Deployment](#docker-deployment)
-- [HuggingFace Spaces](#huggingface-spaces)
-- [Production Considerations](#production-considerations)
----
-## Local Development
-### Prerequisites
-- Python 3.13+
-- PDM (Python package manager)
-- HuggingFace account (optional, for private datasets)
-### Setup
-1. Clone the repository:
-```bash
-git clone https://github.com/your-username/hf-eda-mcp.git
-cd hf-eda-mcp
-```
-2. Install dependencies:
-```bash
-pdm install
-```
-3. Configure environment variables:
-```bash
-cp config.example.env .env
-# Edit .env and add your HF_TOKEN if needed
-```
-4. Run the server:
-```bash
-pdm run hf-eda-mcp
-```
-The server will start on `http://localhost:7860` with MCP enabled.
----
-## Docker Deployment
-### Build the Image
-```bash
-docker build -t hf-eda-mcp:latest .
-```
-### Run with Docker
-```bash
-docker run -d \
-  --name hf-eda-mcp-server \
-  -p 7860:7860 \
-  -e HF_TOKEN=your_token_here \
-  -v hf-cache:/app/cache \
-  hf-eda-mcp:latest
-```
-### Run with Docker Compose
-1. Create a `.env` file with your configuration:
-```bash
-HF_TOKEN=your_token_here
-```
-2. Start the service:
-```bash
-docker-compose up -d
-```
-3. View logs:
-```bash
-docker-compose logs -f
-```
-4. Stop the service:
-```bash
-docker-compose down
-```
-### Docker Configuration Options
-Environment variables you can set:
-- `HF_TOKEN`: HuggingFace API token
-- `GRADIO_SERVER_NAME`: Server host (default: `0.0.0.0`)
-- `GRADIO_SERVER_PORT`: Server port (default: `7860`)
-- `HF_HOME`: Cache directory for HuggingFace
-- `MCP_SERVER_ENABLED`: Enable MCP server (default: `true`)
----
-## HuggingFace Spaces
-### Deployment Steps
-1. **Create a new Space**:
-   - Go to https://huggingface.co/spaces
-   - Click "Create new Space"
-   - Choose "Gradio" as the SDK
-   - Select SDK version 5.49.1 or higher
-2. **Upload files**:
-   ```bash
-   # Copy files to Spaces directory
-   cp -r src/ spaces/
-   cp README.md LICENSE spaces/
-   # Initialize git in spaces directory
-   cd spaces
-   git init
-   git remote add origin https://huggingface.co/spaces/YOUR-USERNAME/hf-eda-mcp
-   ```
-3. **Configure the Space**:
-   - Copy `spaces/README.md` as the Space's README
-   - Ensure `spaces/app.py` is set as the app file
-   - Add `spaces/requirements.txt` for dependencies
-4. **Set secrets** (for private datasets):
-   - Go to Space settings
-   - Add `HF_TOKEN` as a secret
-5. **Deploy**:
-   ```bash
-   git add .
-   git commit -m "Initial deployment"
-   git push origin main
-   ```
-### Space Configuration
-The Space will automatically:
-- Install dependencies from `requirements.txt`
-- Run `app.py` as the entry point
-- Expose the MCP server at `/gradio_api/mcp/sse`
-### Accessing the Space
-Your MCP server will be available at:
-```
-https://YOUR-USERNAME-hf-eda-mcp.hf.space/gradio_api/mcp/sse
-```
----
-## Production Considerations
-### Security
-1. **Authentication**:
-   - Use environment variables for sensitive data
-   - Never commit tokens to version control
-   - Rotate tokens regularly
-2. **Access Control**:
-   - Consider implementing rate limiting
-   - Use HTTPS for all connections
-   - Validate all input parameters
-3. **Secrets Management**:
-   - Use Docker secrets or environment files
-   - For Spaces, use the built-in secrets feature
-   - Consider using a secrets manager (AWS Secrets Manager, HashiCorp Vault)
-### Performance
-1. **Caching**:
-   - Enable persistent cache volumes
-   - Configure appropriate cache sizes
-   - Monitor cache hit rates
-2. **Resource Limits**:
-   - Set memory limits in Docker
-   - Configure appropriate timeouts
-   - Monitor CPU and memory usage
-3. **Scaling**:
-   - Use load balancers for multiple instances
-   - Consider horizontal scaling for high traffic
-   - Monitor response times and adjust resources
-### Monitoring
-1. **Logging**:
-   - Configure structured logging
-   - Use log aggregation tools (ELK, Splunk)
-   - Monitor error rates
-2. **Metrics**:
-   - Track request counts and latencies
-   - Monitor cache performance
-   - Set up alerts for errors
-3. **Health Checks**:
-   - Implement health check endpoints
-   - Configure container health checks
-   - Set up uptime monitoring
-### Backup and Recovery
-1. **Data Backup**:
-   - Backup cache volumes regularly
-   - Document configuration settings
-   - Version control all code
-2. **Disaster Recovery**:
-   - Document deployment procedures
-   - Test recovery processes
-   - Maintain rollback capabilities
----
-## Deployment Checklist
-### Pre-Deployment
-- [ ] All tests passing
-- [ ] Dependencies up to date
-- [ ] Security scan completed
-- [ ] Documentation updated
-- [ ] Environment variables configured
-- [ ] Secrets properly managed
-### Deployment
-- [ ] Build successful
-- [ ] Health checks passing
-- [ ] MCP endpoints accessible
-- [ ] Tools functioning correctly
-- [ ] Logs being collected
-- [ ] Monitoring configured
-### Post-Deployment
-- [ ] Verify all tools work
-- [ ] Check performance metrics
-- [ ] Monitor error rates
-- [ ] Test with MCP clients
-- [ ] Document any issues
-- [ ] Update runbooks
----
-## Troubleshooting
-### Common Issues
-1. **Server won't start**:
-   - Check Python version (3.13+ required)
-   - Verify all dependencies installed
-   - Check port availability
-   - Review logs for errors
-2. **MCP connection fails**:
-   - Verify server is running
-   - Check firewall settings
-   - Confirm correct URL/port
-   - Test with curl or browser
-3. **Dataset access errors**:
-   - Verify HF_TOKEN is set
-   - Check token permissions
-   - Confirm dataset exists
-   - Test with public dataset first
-4. **Performance issues**:
-   - Check cache configuration
-   - Monitor resource usage
-   - Reduce sample sizes
-   - Enable caching
-### Getting Help
-- Check logs: `docker logs hf-eda-mcp-server`
-- Review documentation: See `MCP_USAGE.md`
-- Open an issue: GitHub repository
-- Community support: HuggingFace forums
----
-## Next Steps
-After deployment:
-1. Configure MCP clients (see `deployment/mcp-client-examples.md`)
-2. Test all tools with various datasets
-3. Set up monitoring and alerts
-4. Document any custom configurations
-5. Share your Space with the community!

docs/deployment/QUICKSTART.md DELETED Viewed

@@ -1,148 +0,0 @@
-# Quick Start Guide
-Get hf-eda-mcp running in minutes!
-## Choose Your Deployment Method
-### 🚀 Option 1: Local Development (Fastest)
-```bash
-# Install dependencies
-pdm install
-# Set up environment (optional for public datasets)
-cp config.example.env .env
-# Edit .env and add HF_TOKEN if needed
-# Run the server
-pdm run hf-eda-mcp
-```
-Server runs at: `http://localhost:7860`
----
-### 🐳 Option 2: Docker (Recommended for Production)
-```bash
-# Build the image
-docker build -t hf-eda-mcp:latest .
-# Run the container
-docker run -d \
-  --name hf-eda-mcp-server \
-  -p 7860:7860 \
-  -e HF_TOKEN=your_token_here \
-  hf-eda-mcp:latest
-```
-Or use Docker Compose:
-```bash
-# Create .env file with HF_TOKEN
-echo "HF_TOKEN=your_token_here" > .env
-# Start the service
-docker-compose up -d
-```
-Server runs at: `http://localhost:7860`
----
-### ☁️ Option 3: HuggingFace Spaces (Easiest for Sharing)
-1. Create a new Gradio Space on HuggingFace
-2. Copy files from `spaces/` directory to your Space
-3. Set `HF_TOKEN` as a secret in Space settings (if needed)
-4. Push to deploy
-Your server will be at: `https://YOUR-USERNAME-hf-eda-mcp.hf.space`
----
-## Connect an MCP Client
-### Kiro IDE
-Add to `.kiro/settings/mcp.json`:
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp": {
-      "command": "pdm",
-      "args": ["run", "hf-eda-mcp"],
-      "disabled": false
-    }
-  }
-}
-```
-### Claude Desktop
-Add to `claude_desktop_config.json`:
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp": {
-      "command": "python",
-      "args": ["-m", "hf_eda_mcp"],
-      "env": {
-        "PYTHONPATH": "/path/to/hf-eda-mcp/src"
-      }
-    }
-  }
-}
-```
----
-## Test the Server
-### Using the Web Interface
-1. Open `http://localhost:7860` in your browser
-2. Try the tools with a sample dataset like "squad"
-### Using an MCP Client
-Ask your AI assistant:
-```
-"Get metadata for the squad dataset"
-"Show me 5 samples from the train split of squad"
-"Analyze the features of the squad dataset"
-```
----
-## Common Issues
-**Server won't start?**
-- Check Python version: `python --version` (need 3.13+)
-- Install dependencies: `pdm install`
-**Can't access private datasets?**
-- Set `HF_TOKEN` in your `.env` file
-- Get token from: https://huggingface.co/settings/tokens
-**Port 7860 already in use?**
-- Change port: `GRADIO_SERVER_PORT=8080 pdm run hf-eda-mcp`
----
-## Next Steps
-- 📖 Read the full [Deployment Guide](DEPLOYMENT.md)
-- 🔧 See [MCP Client Examples](mcp-client-examples.md)
-- 📚 Check [MCP Usage Documentation](../MCP_USAGE.md)
----
-## Need Help?
-- Check logs: `docker logs hf-eda-mcp-server` (Docker)
-- Review documentation in `docs/`
-- Open an issue on GitHub

docs/deployment/mcp-client-examples.md DELETED Viewed

@@ -1,295 +0,0 @@
-# MCP Client Configuration Examples
-This document provides configuration examples for connecting various MCP clients to the hf-eda-mcp server.
-## Table of Contents
-- [Kiro IDE](#kiro-ide)
-- [Claude Desktop](#claude-desktop)
-- [Custom MCP Client](#custom-mcp-client)
-- [Environment Variables](#environment-variables)
----
-## Kiro IDE
-### Workspace Configuration
-Create or edit `.kiro/settings/mcp.json` in your workspace:
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp": {
-      "command": "docker",
-      "args": [
-        "run",
-        "--rm",
-        "-i",
-        "-p", "7860:7860",
-        "--env-file", ".env",
-        "hf-eda-mcp:latest"
-      ],
-      "env": {
-        "HF_TOKEN": "${HF_TOKEN}"
-      },
-      "disabled": false,
-      "autoApprove": [
-        "get_dataset_metadata",
-        "get_dataset_sample",
-        "analyze_dataset_features"
-      ]
-    }
-  }
-}
-```
-### User-Level Configuration
-Edit `~/.kiro/settings/mcp.json` for global configuration:
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp": {
-      "command": "pdm",
-      "args": ["run", "hf-eda-mcp"],
-      "env": {
-        "HF_TOKEN": "your_token_here"
-      },
-      "disabled": false,
-      "autoApprove": []
-    }
-  }
-}
-```
-### Using HuggingFace Spaces
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp": {
-      "url": "https://your-username-hf-eda-mcp.hf.space/gradio_api/mcp/sse",
-      "disabled": false,
-      "autoApprove": ["get_dataset_metadata"]
-    }
-  }
-}
-```
----
-## Claude Desktop
-### Configuration File Location
-- **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
-- **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
-- **Linux**: `~/.config/Claude/claude_desktop_config.json`
-### Local Server Configuration
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp": {
-      "command": "python",
-      "args": ["-m", "hf_eda_mcp"],
-      "env": {
-        "HF_TOKEN": "your_token_here",
-        "PYTHONPATH": "/path/to/hf-eda-mcp/src"
-      }
-    }
-  }
-}
-```
-### Docker Configuration
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp": {
-      "command": "docker",
-      "args": [
-        "run",
-        "--rm",
-        "-i",
-        "-p", "7860:7860",
-        "-e", "HF_TOKEN=your_token_here",
-        "hf-eda-mcp:latest"
-      ]
-    }
-  }
-}
-```
-### HuggingFace Spaces Configuration
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp": {
-      "url": "https://your-username-hf-eda-mcp.hf.space/gradio_api/mcp/sse"
-    }
-  }
-}
-```
----
-## Custom MCP Client
-### Python Client Example
-```python
-import asyncio
-from mcp import ClientSession, StdioServerParameters
-from mcp.client.stdio import stdio_client
-async def main():
-    # Connect to local server
-    server_params = StdioServerParameters(
-        command="python",
-        args=["-m", "hf_eda_mcp"],
-        env={"HF_TOKEN": "your_token_here"}
-    )
-    async with stdio_client(server_params) as (read, write):
-        async with ClientSession(read, write) as session:
-            # Initialize the connection
-            await session.initialize()
-            # List available tools
-            tools = await session.list_tools()
-            print("Available tools:", tools)
-            # Call a tool
-            result = await session.call_tool(
-                "get_dataset_metadata",
-                arguments={"dataset_id": "squad"}
-            )
-            print("Result:", result)
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-### JavaScript/TypeScript Client Example
-```typescript
-import { Client } from "@modelcontextprotocol/sdk/client/index.js";
-import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
-async function main() {
-  const transport = new StdioClientTransport({
-    command: "python",
-    args: ["-m", "hf_eda_mcp"],
-    env: {
-      HF_TOKEN: process.env.HF_TOKEN
-    }
-  });
-  const client = new Client({
-    name: "hf-eda-client",
-    version: "1.0.0"
-  }, {
-    capabilities: {}
-  });
-  await client.connect(transport);
-  // List tools
-  const tools = await client.listTools();
-  console.log("Available tools:", tools);
-  // Call a tool
-  const result = await client.callTool({
-    name: "get_dataset_metadata",
-    arguments: {
-      dataset_id: "squad"
-    }
-  });
-  console.log("Result:", result);
-  await client.close();
-}
-main().catch(console.error);
-```
----
-## Environment Variables
-### Required Variables
-- `HF_TOKEN`: HuggingFace API token (optional for public datasets, required for private datasets)
-### Optional Variables
-- `HF_HOME`: Directory for HuggingFace cache (default: `~/.cache/huggingface`)
-- `HF_DATASETS_CACHE`: Directory for datasets cache
-- `TRANSFORMERS_CACHE`: Directory for transformers cache
-- `GRADIO_SERVER_NAME`: Server host (default: `0.0.0.0`)
-- `GRADIO_SERVER_PORT`: Server port (default: `7860`)
-- `MCP_SERVER_ENABLED`: Enable MCP server (default: `true`)
-### Example .env File
-```bash
-# HuggingFace Authentication
-HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
-# Cache Configuration
-HF_HOME=/path/to/cache
-HF_DATASETS_CACHE=/path/to/cache/datasets
-TRANSFORMERS_CACHE=/path/to/cache/transformers
-# Server Configuration
-GRADIO_SERVER_NAME=0.0.0.0
-GRADIO_SERVER_PORT=7860
-MCP_SERVER_ENABLED=true
-```
----
-## Deployment Options Comparison
-| Option | Pros | Cons | Best For |
-|--------|------|------|----------|
-| **Local (PDM)** | Fast, easy debugging | Requires Python setup | Development |
-| **Docker** | Isolated, reproducible | Requires Docker | Production, CI/CD |
-| **HF Spaces** | Hosted, no maintenance | Limited control | Public sharing |
----
-## Troubleshooting
-### Connection Issues
-1. **Server not starting**: Check logs for errors, verify dependencies installed
-2. **Authentication failed**: Verify `HF_TOKEN` is set correctly
-3. **Port already in use**: Change `GRADIO_SERVER_PORT` to a different port
-### Tool Execution Issues
-1. **Dataset not found**: Verify dataset ID is correct on HuggingFace Hub
-2. **Permission denied**: Ensure `HF_TOKEN` has access to private datasets
-3. **Timeout errors**: Increase timeout settings or use smaller sample sizes
-### Docker Issues
-1. **Image build fails**: Ensure all dependencies in `pyproject.toml` are compatible
-2. **Container exits immediately**: Check logs with `docker logs hf-eda-mcp-server`
-3. **Cache not persisting**: Verify volume mounts in `docker-compose.yml`
----
-## Additional Resources
-- [MCP Protocol Documentation](https://modelcontextprotocol.io/)
-- [Gradio MCP Integration](https://www.gradio.app/guides/gradio-and-mcp)
-- [HuggingFace Hub Documentation](https://huggingface.co/docs/hub/index)
-- [Project Repository](https://github.com/your-username/hf-eda-mcp)