KhalilGuetari commited on
Commit
21bc165
·
1 Parent(s): 64e67e1

update readme

Browse files
.vscode/settings.json DELETED
@@ -1,3 +0,0 @@
1
- {
2
- "kiroAgent.configureMCP": "Enabled",
3
- }
 
 
 
 
README.md CHANGED
@@ -17,7 +17,7 @@ tags:
17
 
18
  # 📊 HuggingFace EDA MCP Server
19
 
20
- > 🎉 Submission for the [HuggingFace 1st Birthday Hackathon](https://huggingface.co/spaces/huggingface/hf-1st-birthday-hackathon)
21
 
22
  An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
23
 
@@ -32,6 +32,61 @@ Whether you're a ML engineer, data scientist, or researcher, dataset exploration
32
  - Ask your AI assistant to build reports and visualizations
33
  - **Content search**: Find specific examples in datasets using text search
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  ## Available Tools
36
 
37
  ### `get_dataset_metadata`
@@ -135,46 +190,6 @@ Some features require datasets with `builder_name="parquet"`:
135
  - Graceful fallback from statistics API to sample-based analysis
136
  - Descriptive error messages with suggestions for common issues
137
 
138
- ## MCP Client Configuration
139
-
140
- Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
141
-
142
- **Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`
143
-
144
- ### With URL
145
-
146
- ```json
147
- {
148
- "mcpServers": {
149
- "hf-eda-mcp": {
150
- "url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
151
- "headers": {
152
- "hf-api-token": "<HF_TOKEN>"
153
- }
154
- }
155
- }
156
- }
157
- ```
158
-
159
- ### With mcp-remote
160
-
161
- ```json
162
- {
163
- "mcpServers": {
164
- "hf-eda-mcp": {
165
- "command": "npx",
166
- "args": [
167
- "mcp-remote",
168
- "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
169
- "--transport",
170
- "streamable-http",
171
- "--header",
172
- "hf-api-token: <HF_TOKEN>"
173
- ]
174
- }
175
- }
176
- }
177
- ```
178
 
179
  ## Project Structure
180
 
 
17
 
18
  # 📊 HuggingFace EDA MCP Server
19
 
20
+ > 🎉 Submission for the [HuggingFace 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)
21
 
22
  An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
23
 
 
32
  - Ask your AI assistant to build reports and visualizations
33
  - **Content search**: Find specific examples in datasets using text search
34
 
35
+ <p align="center">
36
+ <a href="https://www.youtube.com/watch?v=XdP7zGSb81k">
37
+ <img src="https://img.shields.io/badge/▶️_Demo_Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Demo Video">
38
+ </a>
39
+ &nbsp;
40
+ <a href="https://www.linkedin.com/posts/khalil-guetari-00a61415a_mcp-server-for-huggingface-datasets-discovery-activity-7400587711838842880-2K8p">
41
+ <img src="https://img.shields.io/badge/LinkedIn_Post-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn Post">
42
+ </a>
43
+ &nbsp;
44
+ <a href="https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp">
45
+ <img src="https://img.shields.io/badge/🤗_Try_it_on_HF_Spaces-FFD21E?style=for-the-badge" alt="HF Space">
46
+ </a>
47
+ </p>
48
+
49
+ ## MCP Client Configuration
50
+
51
+ Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
52
+
53
+ **Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`
54
+
55
+ ### With URL
56
+
57
+ ```json
58
+ {
59
+ "mcpServers": {
60
+ "hf-eda-mcp": {
61
+ "url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
62
+ "headers": {
63
+ "hf-api-token": "<HF_TOKEN>"
64
+ }
65
+ }
66
+ }
67
+ }
68
+ ```
69
+
70
+ ### With mcp-remote
71
+
72
+ ```json
73
+ {
74
+ "mcpServers": {
75
+ "hf-eda-mcp": {
76
+ "command": "npx",
77
+ "args": [
78
+ "mcp-remote",
79
+ "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
80
+ "--transport",
81
+ "streamable-http",
82
+ "--header",
83
+ "hf-api-token: <HF_TOKEN>"
84
+ ]
85
+ }
86
+ }
87
+ }
88
+ ```
89
+
90
  ## Available Tools
91
 
92
  ### `get_dataset_metadata`
 
190
  - Graceful fallback from statistics API to sample-based analysis
191
  - Descriptive error messages with suggestions for common issues
192
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
 
194
  ## Project Structure
195
 
docs/CONFIGURATION.md DELETED
@@ -1,104 +0,0 @@
1
- # Configuration Guide
2
-
3
- The HF EDA MCP Server uses a centralized configuration system that supports both environment variables and command-line arguments.
4
-
5
- ## Configuration Module
6
-
7
- The configuration is managed by the `src/hf_eda_mcp/config.py` module, which provides:
8
-
9
- - `ServerConfig` dataclass with all configuration options
10
- - Environment variable loading with `ServerConfig.from_env()`
11
- - Global configuration management with `get_config()` and `set_config()`
12
- - Logging setup and validation utilities
13
-
14
- ## Configuration Options
15
-
16
- ### Server Settings
17
- - `HF_EDA_PORT` (default: 7860) - Server port
18
- - `HF_EDA_HOST` (default: 127.0.0.1) - Server host
19
- - `HF_EDA_MCP_ENABLED` (default: true) - Enable MCP server functionality
20
- - `HF_EDA_SHARE` (default: false) - Enable public sharing via Gradio
21
-
22
- ### Authentication
23
- - `HF_TOKEN` - HuggingFace access token for private datasets
24
-
25
- ### Logging
26
- - `HF_EDA_LOG_LEVEL` (default: INFO) - Logging level (DEBUG, INFO, WARNING, ERROR)
27
-
28
- ### Performance and Caching
29
- - `HF_EDA_CACHE_DIR` - Directory for caching datasets (optional)
30
- - `HF_EDA_MAX_CACHE_SIZE` (default: 1000) - Maximum cache size in MB
31
- - `HF_EDA_MAX_SAMPLE_SIZE` (default: 50000) - Maximum sample size for tools
32
- - `HF_EDA_MAX_CONCURRENT` (default: 10) - Maximum concurrent requests
33
- - `HF_EDA_REQUEST_TIMEOUT` (default: 300) - Request timeout in seconds
34
-
35
- ## How Configuration is Used
36
-
37
- ### Server Startup
38
- The server loads configuration from environment variables and applies command-line overrides:
39
-
40
- ```python
41
- from hf_eda_mcp.config import ServerConfig
42
- from hf_eda_mcp.server import launch_server
43
-
44
- config = ServerConfig.from_env()
45
- launch_server(config)
46
- ```
47
-
48
- ### Tools Integration
49
- All EDA tools (metadata, sampling, analysis) use the global configuration:
50
-
51
- ```python
52
- from hf_eda_mcp.config import get_config
53
-
54
- config = get_config()
55
- # Tools respect config.max_sample_size, config.cache_dir, config.hf_token
56
- ```
57
-
58
- ### Dataset Service
59
- The `DatasetService` is initialized with configuration values:
60
-
61
- ```python
62
- service = DatasetService(
63
- cache_dir=config.cache_dir,
64
- token=config.hf_token
65
- )
66
- ```
67
-
68
- ## Configuration Priority
69
-
70
- 1. Command-line arguments (highest priority)
71
- 2. Environment variables
72
- 3. Default values (lowest priority)
73
-
74
- ## Example Usage
75
-
76
- ### Environment Variables
77
- ```bash
78
- export HF_TOKEN="your_token_here"
79
- export HF_EDA_CACHE_DIR="/tmp/hf-cache"
80
- export HF_EDA_MAX_SAMPLE_SIZE=25000
81
- pdm run hf-eda-mcp
82
- ```
83
-
84
- ### Command Line
85
- ```bash
86
- pdm run hf-eda-mcp --cache-dir /tmp/cache --max-sample-size 25000 --verbose
87
- ```
88
-
89
- ### Configuration File
90
- Copy `config.example.env` to `.env` and modify as needed, then load with:
91
- ```bash
92
- source .env
93
- pdm run hf-eda-mcp
94
- ```
95
-
96
- ## Validation
97
-
98
- The configuration system includes validation for:
99
- - Port ranges (1024-65535)
100
- - Cache directory permissions
101
- - Sample size limits
102
- - Timeout values
103
-
104
- Invalid configurations will cause the server to exit with helpful error messages.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/MCP_USAGE.md DELETED
@@ -1,275 +0,0 @@
1
- # MCP Server Usage Guide
2
-
3
- ## Overview
4
-
5
- The HF EDA MCP Server provides four main tools for exploratory data analysis of HuggingFace datasets via the Model Context Protocol (MCP).
6
-
7
- ## Available MCP Tools
8
-
9
- The following 4 tools are automatically exposed by Gradio when `mcp_server=True`:
10
-
11
- ### 1. `get_dataset_metadata`
12
- Retrieve comprehensive metadata for a HuggingFace dataset.
13
-
14
- **Parameters:**
15
- - `dataset_id` (string): HuggingFace dataset identifier (e.g., 'imdb', 'squad')
16
- - `config_name` (string, optional): Configuration name for multi-config datasets
17
-
18
- **Returns:** JSON object with dataset metadata including size, features, splits, and configuration details.
19
-
20
- ### 2. `get_dataset_sample`
21
- Retrieve a sample of rows from a HuggingFace dataset.
22
-
23
- **Parameters:**
24
- - `dataset_id` (string): HuggingFace dataset identifier
25
- - `split` (string, default: 'train'): Dataset split to sample from
26
- - `num_samples` (number, default: 10): Number of samples to retrieve (max: 10000)
27
- - `config_name` (string, optional): Configuration name for multi-config datasets
28
-
29
- **Returns:** JSON object with sampled data and metadata.
30
-
31
- ### 3. `analyze_dataset_features`
32
- Perform exploratory analysis on dataset features with automatic optimization.
33
-
34
- **Parameters:**
35
- - `dataset_id` (string): HuggingFace dataset identifier
36
- - `split` (string, default: 'train'): Dataset split to analyze
37
- - `sample_size` (number, default: 1000): Number of samples for analysis (max: 50000, only used for fallback)
38
- - `config_name` (string, optional): Configuration name for multi-config datasets
39
-
40
- **Returns:** JSON object with comprehensive feature analysis including:
41
- - Feature types (numerical, categorical, text, image, audio)
42
- - Statistical measures (mean, median, std, histograms)
43
- - Missing value analysis
44
- - Unique value counts
45
- - Sample values
46
-
47
- **Analysis Methods:**
48
- - **Primary**: Uses HuggingFace Dataset Viewer API statistics when available (parquet datasets)
49
- - Analyzes the full dataset without downloading data
50
- - Provides complete statistics with histograms
51
- - More efficient and accurate
52
- - **Fallback**: Sample-based analysis for non-parquet datasets
53
- - Downloads and analyzes a sample of the dataset
54
- - Computes statistics locally
55
-
56
- ### 4. `search_text_in_dataset`
57
- Search for text in text columns of a dataset using the Dataset Viewer API.
58
-
59
- **Parameters:**
60
- - `dataset_id` (string): HuggingFace dataset identifier
61
- - `config_name` (string): Configuration name (required for search)
62
- - `split` (string): Dataset split to search in
63
- - `query` (string): Search query text
64
- - `offset` (number, default: 0): Offset for pagination
65
- - `length` (number, default: 10): Number of results to return (max: 100)
66
-
67
- **Returns:** JSON object with search results including:
68
- - `features`: List of features from the dataset, including column names and data types
69
- - `rows`: List of matching rows with content from each column
70
- - `num_rows_total`: Total number of examples in the split
71
- - `num_rows_per_page`: Number of examples in the current page
72
- - `partial`: Whether the response is partial (true if the dataset is too large to search completely)
73
-
74
- **Limitations:**
75
- - Only text columns are searched
76
- - Only parquet datasets are supported (builder_name="parquet")
77
- - Search is performed by the Dataset Viewer API, not locally
78
-
79
- **Validation:**
80
- - The tool validates that the dataset is in parquet format before attempting search
81
- - The tool validates that the dataset has at least one text/string column
82
- - If validation fails, a descriptive error message is returned with suggestions
83
-
84
- ## MCP Client Configuration
85
-
86
- ### Using with Claude Desktop
87
-
88
- Add this configuration to your MCP settings:
89
-
90
- ```json
91
- {
92
- "mcpServers": {
93
- "hf-eda-mcp-server": {
94
- "command": "pdm",
95
- "args": ["run", "hf-eda-mcp"],
96
- "env": {
97
- "HF_TOKEN": "your_huggingface_token_here"
98
- }
99
- }
100
- }
101
- }
102
- ```
103
-
104
- ### Using with Hosted Server
105
-
106
- If the server is running on a remote host:
107
-
108
- ```json
109
- {
110
- "mcpServers": {
111
- "hf-eda-mcp-server": {
112
- "url": "https://your-server.com/gradio_api/mcp/sse"
113
- "headers": {
114
- "hf-api-token": "your_huggingface_token_here"
115
- }
116
- }
117
- }
118
- }
119
- ```
120
-
121
- ## Starting the Server
122
-
123
- ### Local Development
124
- ```bash
125
- # Start with MCP server enabled (default)
126
- pdm run hf-eda-mcp
127
-
128
- # Start on custom port
129
- pdm run hf-eda-mcp --port 8080
130
-
131
- # Start with verbose logging
132
- pdm run hf-eda-mcp --verbose
133
-
134
- # Start without MCP server functionality
135
- pdm run hf-eda-mcp --no-mcp
136
-
137
- # Start with custom host (listen on all interfaces)
138
- pdm run hf-eda-mcp --host 0.0.0.0
139
-
140
- # Start with public sharing enabled
141
- pdm run hf-eda-mcp --share
142
-
143
- # Start with custom cache directory
144
- pdm run hf-eda-mcp --cache-dir /path/to/cache
145
-
146
- # Start with custom maximum sample size
147
- pdm run hf-eda-mcp --max-sample-size 100000
148
- ```
149
-
150
- ### Server Modes
151
-
152
- The server provides both a web interface and MCP server functionality in a single application. When MCP is enabled, Gradio automatically exposes the 4 EDA functions as MCP tools while still providing the web interface for direct interaction.
153
-
154
- ### Environment Variables
155
-
156
- The server supports comprehensive configuration via environment variables:
157
-
158
- #### Authentication
159
- - `HF_TOKEN`: HuggingFace access token for private datasets (optional)
160
-
161
- #### Server Configuration
162
- - `HF_EDA_PORT`: Server port (default: 7860)
163
- - `HF_EDA_HOST`: Server host (default: 127.0.0.1)
164
- - `HF_EDA_MCP_ENABLED`: Enable MCP server functionality (default: true)
165
- - `HF_EDA_SHARE`: Enable public sharing via Gradio (default: false)
166
-
167
- #### Logging Configuration
168
- - `HF_EDA_LOG_LEVEL`: Logging level - DEBUG, INFO, WARNING, ERROR (default: INFO)
169
-
170
- #### Performance and Caching
171
- - `HF_EDA_CACHE_DIR`: Directory for caching datasets (optional)
172
- - `HF_EDA_MAX_CACHE_SIZE`: Maximum cache size in MB (default: 1000)
173
- - `HF_EDA_MAX_SAMPLE_SIZE`: Maximum sample size for analysis (default: 50000)
174
- - `HF_EDA_MAX_CONCURRENT`: Maximum concurrent requests (default: 10)
175
- - `HF_EDA_REQUEST_TIMEOUT`: Request timeout in seconds (default: 300)
176
-
177
- ### Configuration Examples
178
-
179
- #### Production Configuration
180
- ```bash
181
- export HF_TOKEN="your_token_here"
182
- export HF_EDA_HOST="0.0.0.0"
183
- export HF_EDA_PORT="8080"
184
- export HF_EDA_LOG_LEVEL="WARNING"
185
- export HF_EDA_CACHE_DIR="/var/cache/hf-eda"
186
- export HF_EDA_MAX_CONCURRENT="20"
187
- pdm run hf-eda-mcp
188
- ```
189
-
190
- #### Development Configuration
191
- ```bash
192
- export HF_TOKEN="your_token_here"
193
- export HF_EDA_LOG_LEVEL="DEBUG"
194
- export HF_EDA_CACHE_DIR="./cache"
195
- pdm run hf-eda-mcp --verbose
196
- ```
197
-
198
- ## Dataset Viewer Statistics Integration
199
-
200
- The `analyze_dataset_features` tool automatically uses HuggingFace's Dataset Viewer API when available, providing significant benefits:
201
-
202
- ### Benefits
203
- - **Full Dataset Analysis**: Analyzes entire datasets instead of samples
204
- - **No Download Required**: Statistics are pre-computed by HuggingFace
205
- - **Richer Statistics**: Includes histograms, frequencies, and multi-modal support
206
- - **Better Performance**: Faster response times with caching
207
-
208
- ### Supported Datasets
209
- Statistics are available for datasets with `builder_name="parquet"`. The tool automatically:
210
- 1. Checks if Dataset Viewer statistics are available
211
- 2. Uses full dataset statistics when available
212
- 3. Falls back to sample-based analysis for other datasets
213
-
214
- ### Supported Data Types
215
- The analysis tool provides comprehensive statistics for multiple data types:
216
- - **Numerical** (int, float): min, max, mean, median, std, histograms
217
- - **Categorical** (class_label, string_label): frequencies, unique counts
218
- - **Boolean** (bool): True/False distributions
219
- - **Text** (string_text): character length statistics, histograms
220
- - **Image** (image): dimension statistics, histograms
221
- - **Audio** (audio): duration statistics (seconds), histograms
222
- - **List** (list): length statistics, histograms
223
-
224
- ### Response Indicators
225
- Check the `sample_info` field in the response:
226
- - `sampling_method: "dataset_viewer_api"` - Using full dataset statistics
227
- - `sampling_method: "sequential_head"` - Using sample-based analysis
228
- - `represents_full_dataset: true/false` - Whether analysis covers the full dataset
229
-
230
- ## Example Usage
231
-
232
- Once connected to an MCP client, you can use the tools like this:
233
-
234
- ```
235
- # Get metadata for the IMDB dataset
236
- Use the get_dataset_metadata tool with dataset_id="imdb"
237
-
238
- # Sample 5 rows from the training split
239
- Use the get_dataset_sample tool with dataset_id="imdb", split="train", num_samples=5
240
-
241
- # Analyze features of the GLUE dataset (CoLA configuration)
242
- Use the analyze_dataset_features tool with dataset_id="glue", config_name="cola", sample_size=500
243
-
244
- # Search for text in the IMDB dataset
245
- Use the search_text_in_dataset tool with dataset_id="imdb", config_name="plain_text", split="train", query="great movie", offset=0, length=10
246
-
247
- # Search for a specific term in the SQuAD dataset
248
- Use the search_text_in_dataset tool with dataset_id="squad", config_name="plain_text", split="train", query="president", offset=0, length=5
249
- ```
250
-
251
- ## API Endpoints
252
-
253
- When the server is running, you can also access the tools via HTTP API:
254
-
255
- - **MCP Schema**: `http://localhost:7860/gradio_api/mcp/schema`
256
- - **API Documentation**: `http://localhost:7860/?view=api`
257
- - **Web Interface**: `http://localhost:7860`
258
-
259
- ## Troubleshooting
260
-
261
- ### Authentication Issues
262
- - Ensure `HF_TOKEN` environment variable is set for private datasets
263
- - Check that your HuggingFace token has appropriate permissions
264
-
265
- ### Dataset Not Found
266
- - Verify the dataset ID is correct and exists on HuggingFace Hub
267
- - Check if the dataset requires authentication
268
-
269
- ### Performance Issues
270
- - Reduce `sample_size` for large datasets
271
- - Use streaming mode (enabled by default) for better memory efficiency
272
-
273
- ### Search Tool Issues
274
- - **Dataset not in parquet format**: The search tool only works with parquet datasets. If you get a "DatasetNotParquetError", try using a different dataset or check if the dataset has a parquet configuration
275
- - **No text columns found**: The search tool requires at least one text/string column. If you get a "NoTextColumnsError", verify that the dataset has text columns by checking the dataset metadata first
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/STATISTICS_ENDPOINT.md DELETED
@@ -1,427 +0,0 @@
1
- # Dataset Viewer Statistics Endpoint Integration
2
-
3
- ## Overview
4
-
5
- The HuggingFace Dataset Viewer API provides a `/statistics` endpoint that offers comprehensive statistics for datasets with `builder_name="parquet"`. This endpoint is significantly more efficient and complete than sample-based analysis.
6
-
7
- ## Key Benefits
8
-
9
- ### 1. Full Dataset Coverage
10
- - **Before**: Analysis based on samples (default 1,000 examples)
11
- - **After**: Statistics computed on the entire dataset (e.g., 25,000 examples for IMDB train split)
12
-
13
- ### 2. No Data Download Required
14
- - **Before**: Download and process samples from the dataset
15
- - **After**: Retrieve pre-computed statistics via API call
16
-
17
- ### 3. More Complete Statistics
18
- The endpoint provides detailed statistics for multiple modalities:
19
-
20
- #### Numerical Features (int, float)
21
- - **Basic statistics**: min, max, mean, median, std
22
- - **Missing values**: nan_count, nan_proportion
23
- - **Distribution**: histogram with bin_edges and hist counts
24
-
25
- Example response:
26
- ```json
27
- {
28
- "column_type": "float",
29
- "column_statistics": {
30
- "nan_count": 0,
31
- "nan_proportion": 0,
32
- "min": 0,
33
- "max": 2,
34
- "mean": 1.67206,
35
- "median": 1.8,
36
- "std": 0.38714,
37
- "histogram": {
38
- "hist": [17, 12, 48, 52, 135, 188, 814, 15, 1628, 2048],
39
- "bin_edges": [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2]
40
- }
41
- }
42
- }
43
- ```
44
-
45
- #### Categorical Features (class_label, string_label)
46
- - **Unique values**: n_unique count
47
- - **Frequencies**: Complete frequency distribution for all categories
48
- - **Missing values**: nan_count, nan_proportion
49
- - **No label tracking**: no_label_count, no_label_proportion (for class_label)
50
-
51
- Example response:
52
- ```json
53
- {
54
- "column_type": "class_label",
55
- "column_statistics": {
56
- "nan_count": 0,
57
- "nan_proportion": 0,
58
- "no_label_count": 0,
59
- "no_label_proportion": 0,
60
- "n_unique": 2,
61
- "frequencies": {
62
- "unacceptable": 2528,
63
- "acceptable": 6023
64
- }
65
- }
66
- }
67
- ```
68
-
69
- #### Text Features (string_text)
70
- - **Length statistics**: min, max, mean, median, std (character count)
71
- - **Missing values**: nan_count, nan_proportion
72
- - **Distribution**: histogram of text lengths
73
-
74
- Example response:
75
- ```json
76
- {
77
- "column_type": "string_text",
78
- "column_statistics": {
79
- "nan_count": 0,
80
- "nan_proportion": 0,
81
- "min": 6,
82
- "max": 231,
83
- "mean": 40.70074,
84
- "median": 37,
85
- "std": 19.14431,
86
- "histogram": {
87
- "hist": [2260, 4512, 1262, 380, 102, 26, 6, 1, 1, 1],
88
- "bin_edges": [6, 29, 52, 75, 98, 121, 144, 167, 190, 213, 231]
89
- }
90
- }
91
- }
92
- ```
93
-
94
- #### Boolean Features (bool)
95
- - **Frequencies**: Distribution of True/False values
96
- - **Missing values**: nan_count, nan_proportion
97
-
98
- Example response:
99
- ```json
100
- {
101
- "column_type": "bool",
102
- "column_statistics": {
103
- "nan_count": 3,
104
- "nan_proportion": 0.15,
105
- "frequencies": {
106
- "False": 7,
107
- "True": 10
108
- }
109
- }
110
- }
111
- ```
112
-
113
- #### Image Features (image)
114
- - **Dimension statistics**: min, max, mean, median, std (for width/height)
115
- - **Missing values**: nan_count, nan_proportion
116
- - **Distribution**: histogram of image dimensions
117
-
118
- Example response:
119
- ```json
120
- {
121
- "column_type": "image",
122
- "column_statistics": {
123
- "nan_count": 0,
124
- "nan_proportion": 0.0,
125
- "min": 256,
126
- "max": 873,
127
- "mean": 327.99339,
128
- "median": 341.0,
129
- "std": 60.07286,
130
- "histogram": {
131
- "hist": [1734, 1637, 1326, 121, 10, 3, 1, 3, 1, 2],
132
- "bin_edges": [256, 318, 380, 442, 504, ...]
133
- }
134
- }
135
- }
136
- ```
137
-
138
- #### Audio Features (audio)
139
- - **Duration statistics**: min, max, mean, median, std (in seconds)
140
- - **Missing values**: nan_count, nan_proportion
141
- - **Distribution**: histogram of audio durations
142
-
143
- Example response:
144
- ```json
145
- {
146
- "column_type": "audio",
147
- "column_statistics": {
148
- "nan_count": 0,
149
- "nan_proportion": 0,
150
- "min": 1.02,
151
- "max": 15,
152
- "mean": 13.93042,
153
- "median": 14.77,
154
- "std": 2.63734,
155
- "histogram": {
156
- "hist": [32, 25, 18, 24, 22, 17, 18, 19, 55, 1770],
157
- "bin_edges": [1.02, 2.418, 3.816, 5.214, 6.612, ...]
158
- }
159
- }
160
- }
161
- ```
162
-
163
- #### List Features (list)
164
- - **Length statistics**: min, max, mean, median, std (list length)
165
- - **Missing values**: nan_count, nan_proportion
166
- - **Distribution**: histogram of list lengths
167
-
168
- Example response:
169
- ```json
170
- {
171
- "column_type": "list",
172
- "column_statistics": {
173
- "nan_count": 0,
174
- "nan_proportion": 0.0,
175
- "min": 1,
176
- "max": 3,
177
- "mean": 1.01741,
178
- "median": 1.0,
179
- "std": 0.13146,
180
- "histogram": {
181
- "hist": [11177, 196, 1],
182
- "bin_edges": [1, 2, 3, 3]
183
- }
184
- }
185
- }
186
- ```
187
-
188
- ## Implementation
189
-
190
- ### Architecture
191
-
192
- ```
193
- analyze_dataset_features()
194
-
195
- Try: get_dataset_statistics() [Dataset Viewer API]
196
-
197
- If available (parquet format):
198
- → Use full dataset statistics
199
- → Cache results
200
- → Return converted analysis
201
-
202
- If not available:
203
- → Fall back to sample-based analysis
204
- → Load samples via streaming
205
- → Compute statistics locally
206
- ```
207
-
208
- ### Key Components
209
-
210
- #### 1. DatasetViewerAdapter
211
- - `get_dataset_statistics()`: Fetch statistics from API
212
- - `check_statistics_availability()`: Check if statistics are available for a dataset
213
-
214
- #### 2. DatasetService
215
- - `get_dataset_statistics()`: Wrapper with caching and error handling
216
- - Automatic fallback to sample-based analysis
217
- - Statistics cache directory: `cache/statistics/`
218
-
219
- #### 3. Analysis Tool
220
- - `_convert_viewer_statistics_to_analysis()`: Convert API format to our analysis format
221
- - Seamless integration with existing analysis pipeline
222
-
223
- ### Caching Strategy
224
-
225
- Statistics are cached with the same TTL as other metadata (default: 1 hour):
226
-
227
- ```
228
- cache/
229
- ├── metadata/ # Dataset metadata
230
- ├── samples/ # Sample data
231
- └── statistics/ # Dataset Viewer statistics
232
- └── {dataset}_{config}_{split}_stats.json
233
- ```
234
-
235
- ## Usage Examples
236
-
237
- ### Automatic Selection
238
-
239
- ```python
240
- from hf_eda_mcp.tools.analysis import analyze_dataset_features
241
-
242
- # Automatically uses Dataset Viewer statistics if available
243
- result = analyze_dataset_features(
244
- dataset_id="stanfordnlp/imdb",
245
- split="train"
246
- )
247
-
248
- # Check which method was used
249
- print(result['sample_info']['sampling_method'])
250
- # Output: "dataset_viewer_api" or "sequential_head"
251
-
252
- print(result['sample_info']['represents_full_dataset'])
253
- # Output: True (full dataset) or False (sample)
254
- ```
255
-
256
- ### Check Availability
257
-
258
- ```python
259
- from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter
260
-
261
- adapter = DatasetViewerAdapter(token="your_token")
262
- availability = adapter.check_statistics_availability("stanfordnlp/imdb")
263
-
264
- print(availability)
265
- # {
266
- # 'available': True,
267
- # 'configs': ['plain_text'],
268
- # 'reason': 'Statistics available for 1 config(s)'
269
- # }
270
- ```
271
-
272
- ### Direct Statistics Access
273
-
274
- ```python
275
- from hf_eda_mcp.services.dataset_service import DatasetService
276
-
277
- service = DatasetService(token="your_token")
278
- stats = service.get_dataset_statistics(
279
- dataset_id="stanfordnlp/imdb",
280
- split="train",
281
- config_name="plain_text"
282
- )
283
-
284
- if stats:
285
- print(f"Full dataset: {stats['num_examples']} examples")
286
- print(f"Columns: {len(stats['statistics'])}")
287
- else:
288
- print("Statistics not available, use sample-based analysis")
289
- ```
290
-
291
- ## Comparison: Before vs After
292
-
293
- ### IMDB Dataset Example
294
-
295
- #### Before (Sample-based)
296
- ```python
297
- {
298
- 'dataset_info': {
299
- 'sample_size_used': 1000,
300
- 'sample_size_requested': 1000,
301
- },
302
- 'sample_info': {
303
- 'sampling_method': 'sequential_head',
304
- 'represents_full_dataset': True, # Only if sample >= requested
305
- },
306
- 'features': {
307
- 'text': {
308
- 'feature_type': 'text',
309
- 'statistics': {
310
- 'count': 1000,
311
- 'avg_length': 1311.289,
312
- 'min_length': 65,
313
- 'max_length': 6103,
314
- # Limited to sample
315
- }
316
- }
317
- },
318
- 'summary': 'Analyzed 2 features from 1000 samples | Types: 1 categorical, 1 text'
319
- }
320
- ```
321
-
322
- #### After (Dataset Viewer)
323
- ```python
324
- {
325
- 'dataset_info': {
326
- 'sample_size_used': 25000, # Full dataset
327
- 'sample_size_requested': 25000,
328
- },
329
- 'sample_info': {
330
- 'sampling_method': 'dataset_viewer_api',
331
- 'represents_full_dataset': True, # Always true
332
- 'partial': False
333
- },
334
- 'features': {
335
- 'text': {
336
- 'feature_type': 'text',
337
- 'statistics': {
338
- 'count': 25000, # Full dataset
339
- 'mean_length': 1325.06964,
340
- 'min_length': 52,
341
- 'max_length': 13704,
342
- 'histogram': {
343
- 'bin_edges': [52, 1418, 2784, ...],
344
- 'hist': [17426, 5384, 1490, ...]
345
- }
346
- }
347
- }
348
- },
349
- 'summary': 'Analyzed 2 features from 25000 samples | Types: 1 categorical, 1 text'
350
- }
351
- ```
352
-
353
- ## Supported Data Types
354
-
355
- The Dataset Viewer statistics endpoint supports comprehensive analysis for multiple data types:
356
-
357
- | Data Type | Feature Type | Statistics Provided |
358
- |-----------|--------------|---------------------|
359
- | `int`, `float` | numerical | min, max, mean, median, std, histogram |
360
- | `class_label`, `string_label` | categorical | frequencies, n_unique, no_label tracking |
361
- | `bool` | boolean | True/False frequencies |
362
- | `string_text` | text | character length stats (min, max, mean, median, std), histogram |
363
- | `image` | image | dimension statistics, histogram |
364
- | `audio` | audio | duration statistics (seconds), histogram |
365
- | `list` | list | length statistics, histogram |
366
-
367
- ### Data Type Mapping
368
-
369
- Our analysis tool automatically maps Dataset Viewer types to our internal types:
370
-
371
- ```python
372
- Dataset Viewer Type → Our Feature Type
373
- ─────────────────────────────────────
374
- int, float → numerical
375
- class_label → categorical
376
- string_label → categorical
377
- bool → boolean
378
- string_text → text
379
- image → image
380
- audio → audio
381
- list → list
382
- ```
383
-
384
- ## Limitations
385
-
386
- ### Dataset Requirements
387
- - Only works for datasets with `builder_name="parquet"`
388
- - Not all datasets on HuggingFace Hub have this format
389
- - Automatic fallback to sample-based analysis for other formats
390
-
391
- ### API Availability
392
- - Requires internet connection
393
- - Subject to HuggingFace API rate limits
394
- - May fail for private datasets without proper authentication
395
-
396
- ## Error Handling
397
-
398
- The implementation includes robust error handling:
399
-
400
- 1. **Check availability first**: Verify dataset supports statistics
401
- 2. **Graceful fallback**: Automatically use sample-based analysis if unavailable
402
- 3. **Caching**: Reduce API calls and improve performance
403
- 4. **Logging**: Clear messages about which method is being used
404
-
405
- ## Performance Impact
406
-
407
- ### API Call Overhead
408
- - Initial call: ~1-2 seconds
409
- - Cached calls: <10ms
410
- - No data download required
411
-
412
- ### Sample-based Analysis
413
- - Download time: Varies by dataset size
414
- - Processing time: ~1-5 seconds for 1000 samples
415
- - Network bandwidth: Depends on sample size
416
-
417
- ## Future Enhancements
418
-
419
- 1. **Parallel requests**: Fetch statistics for multiple splits simultaneously
420
- 2. **Partial statistics**: Support datasets with partial statistics
421
- 3. **Custom aggregations**: Add more statistical measures
422
- 4. **Visualization**: Generate plots from histogram data
423
-
424
- ## References
425
-
426
- - [HuggingFace Dataset Viewer Documentation](https://huggingface.co/docs/dataset-viewer/info)
427
- - [Statistics Endpoint Specification](https://huggingface.co/docs/dataset-viewer/statistics)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/deployment/DEPLOYMENT.md DELETED
@@ -1,300 +0,0 @@
1
- # Deployment Guide
2
-
3
- This guide covers different deployment options for the hf-eda-mcp server.
4
-
5
- ## Table of Contents
6
-
7
- - [Local Development](#local-development)
8
- - [Docker Deployment](#docker-deployment)
9
- - [HuggingFace Spaces](#huggingface-spaces)
10
- - [Production Considerations](#production-considerations)
11
-
12
- ---
13
-
14
- ## Local Development
15
-
16
- ### Prerequisites
17
-
18
- - Python 3.13+
19
- - PDM (Python package manager)
20
- - HuggingFace account (optional, for private datasets)
21
-
22
- ### Setup
23
-
24
- 1. Clone the repository:
25
- ```bash
26
- git clone https://github.com/your-username/hf-eda-mcp.git
27
- cd hf-eda-mcp
28
- ```
29
-
30
- 2. Install dependencies:
31
- ```bash
32
- pdm install
33
- ```
34
-
35
- 3. Configure environment variables:
36
- ```bash
37
- cp config.example.env .env
38
- # Edit .env and add your HF_TOKEN if needed
39
- ```
40
-
41
- 4. Run the server:
42
- ```bash
43
- pdm run hf-eda-mcp
44
- ```
45
-
46
- The server will start on `http://localhost:7860` with MCP enabled.
47
-
48
- ---
49
-
50
- ## Docker Deployment
51
-
52
- ### Build the Image
53
-
54
- ```bash
55
- docker build -t hf-eda-mcp:latest .
56
- ```
57
-
58
- ### Run with Docker
59
-
60
- ```bash
61
- docker run -d \
62
- --name hf-eda-mcp-server \
63
- -p 7860:7860 \
64
- -e HF_TOKEN=your_token_here \
65
- -v hf-cache:/app/cache \
66
- hf-eda-mcp:latest
67
- ```
68
-
69
- ### Run with Docker Compose
70
-
71
- 1. Create a `.env` file with your configuration:
72
- ```bash
73
- HF_TOKEN=your_token_here
74
- ```
75
-
76
- 2. Start the service:
77
- ```bash
78
- docker-compose up -d
79
- ```
80
-
81
- 3. View logs:
82
- ```bash
83
- docker-compose logs -f
84
- ```
85
-
86
- 4. Stop the service:
87
- ```bash
88
- docker-compose down
89
- ```
90
-
91
- ### Docker Configuration Options
92
-
93
- Environment variables you can set:
94
-
95
- - `HF_TOKEN`: HuggingFace API token
96
- - `GRADIO_SERVER_NAME`: Server host (default: `0.0.0.0`)
97
- - `GRADIO_SERVER_PORT`: Server port (default: `7860`)
98
- - `HF_HOME`: Cache directory for HuggingFace
99
- - `MCP_SERVER_ENABLED`: Enable MCP server (default: `true`)
100
-
101
- ---
102
-
103
- ## HuggingFace Spaces
104
-
105
- ### Deployment Steps
106
-
107
- 1. **Create a new Space**:
108
- - Go to https://huggingface.co/spaces
109
- - Click "Create new Space"
110
- - Choose "Gradio" as the SDK
111
- - Select SDK version 5.49.1 or higher
112
-
113
- 2. **Upload files**:
114
- ```bash
115
- # Copy files to Spaces directory
116
- cp -r src/ spaces/
117
- cp README.md LICENSE spaces/
118
-
119
- # Initialize git in spaces directory
120
- cd spaces
121
- git init
122
- git remote add origin https://huggingface.co/spaces/YOUR-USERNAME/hf-eda-mcp
123
- ```
124
-
125
- 3. **Configure the Space**:
126
- - Copy `spaces/README.md` as the Space's README
127
- - Ensure `spaces/app.py` is set as the app file
128
- - Add `spaces/requirements.txt` for dependencies
129
-
130
- 4. **Set secrets** (for private datasets):
131
- - Go to Space settings
132
- - Add `HF_TOKEN` as a secret
133
-
134
- 5. **Deploy**:
135
- ```bash
136
- git add .
137
- git commit -m "Initial deployment"
138
- git push origin main
139
- ```
140
-
141
- ### Space Configuration
142
-
143
- The Space will automatically:
144
- - Install dependencies from `requirements.txt`
145
- - Run `app.py` as the entry point
146
- - Expose the MCP server at `/gradio_api/mcp/sse`
147
-
148
- ### Accessing the Space
149
-
150
- Your MCP server will be available at:
151
- ```
152
- https://YOUR-USERNAME-hf-eda-mcp.hf.space/gradio_api/mcp/sse
153
- ```
154
-
155
- ---
156
-
157
- ## Production Considerations
158
-
159
- ### Security
160
-
161
- 1. **Authentication**:
162
- - Use environment variables for sensitive data
163
- - Never commit tokens to version control
164
- - Rotate tokens regularly
165
-
166
- 2. **Access Control**:
167
- - Consider implementing rate limiting
168
- - Use HTTPS for all connections
169
- - Validate all input parameters
170
-
171
- 3. **Secrets Management**:
172
- - Use Docker secrets or environment files
173
- - For Spaces, use the built-in secrets feature
174
- - Consider using a secrets manager (AWS Secrets Manager, HashiCorp Vault)
175
-
176
- ### Performance
177
-
178
- 1. **Caching**:
179
- - Enable persistent cache volumes
180
- - Configure appropriate cache sizes
181
- - Monitor cache hit rates
182
-
183
- 2. **Resource Limits**:
184
- - Set memory limits in Docker
185
- - Configure appropriate timeouts
186
- - Monitor CPU and memory usage
187
-
188
- 3. **Scaling**:
189
- - Use load balancers for multiple instances
190
- - Consider horizontal scaling for high traffic
191
- - Monitor response times and adjust resources
192
-
193
- ### Monitoring
194
-
195
- 1. **Logging**:
196
- - Configure structured logging
197
- - Use log aggregation tools (ELK, Splunk)
198
- - Monitor error rates
199
-
200
- 2. **Metrics**:
201
- - Track request counts and latencies
202
- - Monitor cache performance
203
- - Set up alerts for errors
204
-
205
- 3. **Health Checks**:
206
- - Implement health check endpoints
207
- - Configure container health checks
208
- - Set up uptime monitoring
209
-
210
- ### Backup and Recovery
211
-
212
- 1. **Data Backup**:
213
- - Backup cache volumes regularly
214
- - Document configuration settings
215
- - Version control all code
216
-
217
- 2. **Disaster Recovery**:
218
- - Document deployment procedures
219
- - Test recovery processes
220
- - Maintain rollback capabilities
221
-
222
- ---
223
-
224
- ## Deployment Checklist
225
-
226
- ### Pre-Deployment
227
-
228
- - [ ] All tests passing
229
- - [ ] Dependencies up to date
230
- - [ ] Security scan completed
231
- - [ ] Documentation updated
232
- - [ ] Environment variables configured
233
- - [ ] Secrets properly managed
234
-
235
- ### Deployment
236
-
237
- - [ ] Build successful
238
- - [ ] Health checks passing
239
- - [ ] MCP endpoints accessible
240
- - [ ] Tools functioning correctly
241
- - [ ] Logs being collected
242
- - [ ] Monitoring configured
243
-
244
- ### Post-Deployment
245
-
246
- - [ ] Verify all tools work
247
- - [ ] Check performance metrics
248
- - [ ] Monitor error rates
249
- - [ ] Test with MCP clients
250
- - [ ] Document any issues
251
- - [ ] Update runbooks
252
-
253
- ---
254
-
255
- ## Troubleshooting
256
-
257
- ### Common Issues
258
-
259
- 1. **Server won't start**:
260
- - Check Python version (3.13+ required)
261
- - Verify all dependencies installed
262
- - Check port availability
263
- - Review logs for errors
264
-
265
- 2. **MCP connection fails**:
266
- - Verify server is running
267
- - Check firewall settings
268
- - Confirm correct URL/port
269
- - Test with curl or browser
270
-
271
- 3. **Dataset access errors**:
272
- - Verify HF_TOKEN is set
273
- - Check token permissions
274
- - Confirm dataset exists
275
- - Test with public dataset first
276
-
277
- 4. **Performance issues**:
278
- - Check cache configuration
279
- - Monitor resource usage
280
- - Reduce sample sizes
281
- - Enable caching
282
-
283
- ### Getting Help
284
-
285
- - Check logs: `docker logs hf-eda-mcp-server`
286
- - Review documentation: See `MCP_USAGE.md`
287
- - Open an issue: GitHub repository
288
- - Community support: HuggingFace forums
289
-
290
- ---
291
-
292
- ## Next Steps
293
-
294
- After deployment:
295
-
296
- 1. Configure MCP clients (see `deployment/mcp-client-examples.md`)
297
- 2. Test all tools with various datasets
298
- 3. Set up monitoring and alerts
299
- 4. Document any custom configurations
300
- 5. Share your Space with the community!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/deployment/QUICKSTART.md DELETED
@@ -1,148 +0,0 @@
1
- # Quick Start Guide
2
-
3
- Get hf-eda-mcp running in minutes!
4
-
5
- ## Choose Your Deployment Method
6
-
7
- ### 🚀 Option 1: Local Development (Fastest)
8
-
9
- ```bash
10
- # Install dependencies
11
- pdm install
12
-
13
- # Set up environment (optional for public datasets)
14
- cp config.example.env .env
15
- # Edit .env and add HF_TOKEN if needed
16
-
17
- # Run the server
18
- pdm run hf-eda-mcp
19
- ```
20
-
21
- Server runs at: `http://localhost:7860`
22
-
23
- ---
24
-
25
- ### 🐳 Option 2: Docker (Recommended for Production)
26
-
27
- ```bash
28
- # Build the image
29
- docker build -t hf-eda-mcp:latest .
30
-
31
- # Run the container
32
- docker run -d \
33
- --name hf-eda-mcp-server \
34
- -p 7860:7860 \
35
- -e HF_TOKEN=your_token_here \
36
- hf-eda-mcp:latest
37
- ```
38
-
39
- Or use Docker Compose:
40
-
41
- ```bash
42
- # Create .env file with HF_TOKEN
43
- echo "HF_TOKEN=your_token_here" > .env
44
-
45
- # Start the service
46
- docker-compose up -d
47
- ```
48
-
49
- Server runs at: `http://localhost:7860`
50
-
51
- ---
52
-
53
- ### ☁️ Option 3: HuggingFace Spaces (Easiest for Sharing)
54
-
55
- 1. Create a new Gradio Space on HuggingFace
56
- 2. Copy files from `spaces/` directory to your Space
57
- 3. Set `HF_TOKEN` as a secret in Space settings (if needed)
58
- 4. Push to deploy
59
-
60
- Your server will be at: `https://YOUR-USERNAME-hf-eda-mcp.hf.space`
61
-
62
- ---
63
-
64
- ## Connect an MCP Client
65
-
66
- ### Kiro IDE
67
-
68
- Add to `.kiro/settings/mcp.json`:
69
-
70
- ```json
71
- {
72
- "mcpServers": {
73
- "hf-eda-mcp": {
74
- "command": "pdm",
75
- "args": ["run", "hf-eda-mcp"],
76
- "disabled": false
77
- }
78
- }
79
- }
80
- ```
81
-
82
- ### Claude Desktop
83
-
84
- Add to `claude_desktop_config.json`:
85
-
86
- ```json
87
- {
88
- "mcpServers": {
89
- "hf-eda-mcp": {
90
- "command": "python",
91
- "args": ["-m", "hf_eda_mcp"],
92
- "env": {
93
- "PYTHONPATH": "/path/to/hf-eda-mcp/src"
94
- }
95
- }
96
- }
97
- }
98
- ```
99
-
100
- ---
101
-
102
- ## Test the Server
103
-
104
- ### Using the Web Interface
105
-
106
- 1. Open `http://localhost:7860` in your browser
107
- 2. Try the tools with a sample dataset like "squad"
108
-
109
- ### Using an MCP Client
110
-
111
- Ask your AI assistant:
112
-
113
- ```
114
- "Get metadata for the squad dataset"
115
- "Show me 5 samples from the train split of squad"
116
- "Analyze the features of the squad dataset"
117
- ```
118
-
119
- ---
120
-
121
- ## Common Issues
122
-
123
- **Server won't start?**
124
- - Check Python version: `python --version` (need 3.13+)
125
- - Install dependencies: `pdm install`
126
-
127
- **Can't access private datasets?**
128
- - Set `HF_TOKEN` in your `.env` file
129
- - Get token from: https://huggingface.co/settings/tokens
130
-
131
- **Port 7860 already in use?**
132
- - Change port: `GRADIO_SERVER_PORT=8080 pdm run hf-eda-mcp`
133
-
134
- ---
135
-
136
- ## Next Steps
137
-
138
- - 📖 Read the full [Deployment Guide](DEPLOYMENT.md)
139
- - 🔧 See [MCP Client Examples](mcp-client-examples.md)
140
- - 📚 Check [MCP Usage Documentation](../MCP_USAGE.md)
141
-
142
- ---
143
-
144
- ## Need Help?
145
-
146
- - Check logs: `docker logs hf-eda-mcp-server` (Docker)
147
- - Review documentation in `docs/`
148
- - Open an issue on GitHub
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/deployment/mcp-client-examples.md DELETED
@@ -1,295 +0,0 @@
1
- # MCP Client Configuration Examples
2
-
3
- This document provides configuration examples for connecting various MCP clients to the hf-eda-mcp server.
4
-
5
- ## Table of Contents
6
-
7
- - [Kiro IDE](#kiro-ide)
8
- - [Claude Desktop](#claude-desktop)
9
- - [Custom MCP Client](#custom-mcp-client)
10
- - [Environment Variables](#environment-variables)
11
-
12
- ---
13
-
14
- ## Kiro IDE
15
-
16
- ### Workspace Configuration
17
-
18
- Create or edit `.kiro/settings/mcp.json` in your workspace:
19
-
20
- ```json
21
- {
22
- "mcpServers": {
23
- "hf-eda-mcp": {
24
- "command": "docker",
25
- "args": [
26
- "run",
27
- "--rm",
28
- "-i",
29
- "-p", "7860:7860",
30
- "--env-file", ".env",
31
- "hf-eda-mcp:latest"
32
- ],
33
- "env": {
34
- "HF_TOKEN": "${HF_TOKEN}"
35
- },
36
- "disabled": false,
37
- "autoApprove": [
38
- "get_dataset_metadata",
39
- "get_dataset_sample",
40
- "analyze_dataset_features"
41
- ]
42
- }
43
- }
44
- }
45
- ```
46
-
47
- ### User-Level Configuration
48
-
49
- Edit `~/.kiro/settings/mcp.json` for global configuration:
50
-
51
- ```json
52
- {
53
- "mcpServers": {
54
- "hf-eda-mcp": {
55
- "command": "pdm",
56
- "args": ["run", "hf-eda-mcp"],
57
- "env": {
58
- "HF_TOKEN": "your_token_here"
59
- },
60
- "disabled": false,
61
- "autoApprove": []
62
- }
63
- }
64
- }
65
- ```
66
-
67
- ### Using HuggingFace Spaces
68
-
69
- ```json
70
- {
71
- "mcpServers": {
72
- "hf-eda-mcp": {
73
- "url": "https://your-username-hf-eda-mcp.hf.space/gradio_api/mcp/sse",
74
- "disabled": false,
75
- "autoApprove": ["get_dataset_metadata"]
76
- }
77
- }
78
- }
79
- ```
80
-
81
- ---
82
-
83
- ## Claude Desktop
84
-
85
- ### Configuration File Location
86
-
87
- - **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
88
- - **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
89
- - **Linux**: `~/.config/Claude/claude_desktop_config.json`
90
-
91
- ### Local Server Configuration
92
-
93
- ```json
94
- {
95
- "mcpServers": {
96
- "hf-eda-mcp": {
97
- "command": "python",
98
- "args": ["-m", "hf_eda_mcp"],
99
- "env": {
100
- "HF_TOKEN": "your_token_here",
101
- "PYTHONPATH": "/path/to/hf-eda-mcp/src"
102
- }
103
- }
104
- }
105
- }
106
- ```
107
-
108
- ### Docker Configuration
109
-
110
- ```json
111
- {
112
- "mcpServers": {
113
- "hf-eda-mcp": {
114
- "command": "docker",
115
- "args": [
116
- "run",
117
- "--rm",
118
- "-i",
119
- "-p", "7860:7860",
120
- "-e", "HF_TOKEN=your_token_here",
121
- "hf-eda-mcp:latest"
122
- ]
123
- }
124
- }
125
- }
126
- ```
127
-
128
- ### HuggingFace Spaces Configuration
129
-
130
- ```json
131
- {
132
- "mcpServers": {
133
- "hf-eda-mcp": {
134
- "url": "https://your-username-hf-eda-mcp.hf.space/gradio_api/mcp/sse"
135
- }
136
- }
137
- }
138
- ```
139
-
140
- ---
141
-
142
- ## Custom MCP Client
143
-
144
- ### Python Client Example
145
-
146
- ```python
147
- import asyncio
148
- from mcp import ClientSession, StdioServerParameters
149
- from mcp.client.stdio import stdio_client
150
-
151
- async def main():
152
- # Connect to local server
153
- server_params = StdioServerParameters(
154
- command="python",
155
- args=["-m", "hf_eda_mcp"],
156
- env={"HF_TOKEN": "your_token_here"}
157
- )
158
-
159
- async with stdio_client(server_params) as (read, write):
160
- async with ClientSession(read, write) as session:
161
- # Initialize the connection
162
- await session.initialize()
163
-
164
- # List available tools
165
- tools = await session.list_tools()
166
- print("Available tools:", tools)
167
-
168
- # Call a tool
169
- result = await session.call_tool(
170
- "get_dataset_metadata",
171
- arguments={"dataset_id": "squad"}
172
- )
173
- print("Result:", result)
174
-
175
- if __name__ == "__main__":
176
- asyncio.run(main())
177
- ```
178
-
179
- ### JavaScript/TypeScript Client Example
180
-
181
- ```typescript
182
- import { Client } from "@modelcontextprotocol/sdk/client/index.js";
183
- import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
184
-
185
- async function main() {
186
- const transport = new StdioClientTransport({
187
- command: "python",
188
- args: ["-m", "hf_eda_mcp"],
189
- env: {
190
- HF_TOKEN: process.env.HF_TOKEN
191
- }
192
- });
193
-
194
- const client = new Client({
195
- name: "hf-eda-client",
196
- version: "1.0.0"
197
- }, {
198
- capabilities: {}
199
- });
200
-
201
- await client.connect(transport);
202
-
203
- // List tools
204
- const tools = await client.listTools();
205
- console.log("Available tools:", tools);
206
-
207
- // Call a tool
208
- const result = await client.callTool({
209
- name: "get_dataset_metadata",
210
- arguments: {
211
- dataset_id: "squad"
212
- }
213
- });
214
- console.log("Result:", result);
215
-
216
- await client.close();
217
- }
218
-
219
- main().catch(console.error);
220
- ```
221
-
222
- ---
223
-
224
- ## Environment Variables
225
-
226
- ### Required Variables
227
-
228
- - `HF_TOKEN`: HuggingFace API token (optional for public datasets, required for private datasets)
229
-
230
- ### Optional Variables
231
-
232
- - `HF_HOME`: Directory for HuggingFace cache (default: `~/.cache/huggingface`)
233
- - `HF_DATASETS_CACHE`: Directory for datasets cache
234
- - `TRANSFORMERS_CACHE`: Directory for transformers cache
235
- - `GRADIO_SERVER_NAME`: Server host (default: `0.0.0.0`)
236
- - `GRADIO_SERVER_PORT`: Server port (default: `7860`)
237
- - `MCP_SERVER_ENABLED`: Enable MCP server (default: `true`)
238
-
239
- ### Example .env File
240
-
241
- ```bash
242
- # HuggingFace Authentication
243
- HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
244
-
245
- # Cache Configuration
246
- HF_HOME=/path/to/cache
247
- HF_DATASETS_CACHE=/path/to/cache/datasets
248
- TRANSFORMERS_CACHE=/path/to/cache/transformers
249
-
250
- # Server Configuration
251
- GRADIO_SERVER_NAME=0.0.0.0
252
- GRADIO_SERVER_PORT=7860
253
- MCP_SERVER_ENABLED=true
254
- ```
255
-
256
- ---
257
-
258
- ## Deployment Options Comparison
259
-
260
- | Option | Pros | Cons | Best For |
261
- |--------|------|------|----------|
262
- | **Local (PDM)** | Fast, easy debugging | Requires Python setup | Development |
263
- | **Docker** | Isolated, reproducible | Requires Docker | Production, CI/CD |
264
- | **HF Spaces** | Hosted, no maintenance | Limited control | Public sharing |
265
-
266
- ---
267
-
268
- ## Troubleshooting
269
-
270
- ### Connection Issues
271
-
272
- 1. **Server not starting**: Check logs for errors, verify dependencies installed
273
- 2. **Authentication failed**: Verify `HF_TOKEN` is set correctly
274
- 3. **Port already in use**: Change `GRADIO_SERVER_PORT` to a different port
275
-
276
- ### Tool Execution Issues
277
-
278
- 1. **Dataset not found**: Verify dataset ID is correct on HuggingFace Hub
279
- 2. **Permission denied**: Ensure `HF_TOKEN` has access to private datasets
280
- 3. **Timeout errors**: Increase timeout settings or use smaller sample sizes
281
-
282
- ### Docker Issues
283
-
284
- 1. **Image build fails**: Ensure all dependencies in `pyproject.toml` are compatible
285
- 2. **Container exits immediately**: Check logs with `docker logs hf-eda-mcp-server`
286
- 3. **Cache not persisting**: Verify volume mounts in `docker-compose.yml`
287
-
288
- ---
289
-
290
- ## Additional Resources
291
-
292
- - [MCP Protocol Documentation](https://modelcontextprotocol.io/)
293
- - [Gradio MCP Integration](https://www.gradio.app/guides/gradio-and-mcp)
294
- - [HuggingFace Hub Documentation](https://huggingface.co/docs/hub/index)
295
- - [Project Repository](https://github.com/your-username/hf-eda-mcp)