kshitijthakkar commited on
Commit
ae24574
·
1 Parent(s): 54d748d

docs: Add comprehensive JOB_SUBMISSION.md guide with accurate pricing

Browse files

- Add complete job submission documentation for HF Jobs and Modal
- Include accurate per-second pricing for both platforms
- HuggingFace Jobs: $0.40-2.50/hr (based on HF Spaces GPU pricing)
- Modal: $0.59-6.25/hr (verified rates from Modal pricing)
- Correct billing model: both platforms use per-second billing (no minimums)
- Add hardware selection guide with auto-selection logic
- Include cost estimation, monitoring, and troubleshooting sections
- Provide step-by-step submission workflow with examples
- Add cost comparison tables and optimization tips
- Update README.md with corrected technology stack details

References:
- https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs
- https://huggingface.co/docs/hub/en/spaces-gpus
- https://modal.com/pricing

Files changed (2) hide show
  1. JOB_SUBMISSION.md +971 -0
  2. README.md +2 -2
JOB_SUBMISSION.md ADDED
@@ -0,0 +1,971 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Job Submission Guide
2
+
3
+ This guide explains how to submit agent evaluation jobs to run on cloud infrastructure using TraceMind-AI.
4
+
5
+ ## Table of Contents
6
+
7
+ - [Overview](#overview)
8
+ - [Infrastructure Options](#infrastructure-options)
9
+ - [HuggingFace Jobs](#huggingface-jobs)
10
+ - [Modal](#modal)
11
+ - [Prerequisites](#prerequisites)
12
+ - [Hardware Selection Guide](#hardware-selection-guide)
13
+ - [Submitting a Job](#submitting-a-job)
14
+ - [Cost Estimation](#cost-estimation)
15
+ - [Monitoring Jobs](#monitoring-jobs)
16
+ - [Understanding Job Results](#understanding-job-results)
17
+ - [Troubleshooting](#troubleshooting)
18
+ - [Advanced Configuration](#advanced-configuration)
19
+
20
+ ---
21
+
22
+ ## Overview
23
+
24
+ TraceMind-AI allows you to submit SMOLTRACE evaluation jobs to two cloud platforms:
25
+
26
+ 1. **HuggingFace Jobs** - Managed compute with GPU/CPU options
27
+ 2. **Modal** - Serverless compute with pay-per-second billing
28
+
29
+ Both platforms:
30
+ - ✅ Run the same SMOLTRACE evaluation engine
31
+ - ✅ Push results automatically to HuggingFace datasets
32
+ - ✅ Appear in the TraceMind leaderboard when complete
33
+ - ✅ Collect OpenTelemetry traces and GPU metrics
34
+ - ✅ **Per-second billing** with no minimum duration
35
+
36
+ **Choose based on your needs**:
37
+ - **HuggingFace Jobs**: Best if you already have HF Pro subscription ($9/month)
38
+ - **Modal**: Best if you need H200/H100 GPUs or want to avoid subscriptions
39
+
40
+ **Pricing Sources**:
41
+ - [HuggingFace Jobs Documentation](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs)
42
+ - [HuggingFace Spaces GPU Pricing](https://huggingface.co/docs/hub/en/spaces-gpus)
43
+ - [Modal GPU Pricing](https://modal.com/pricing)
44
+
45
+ ---
46
+
47
+ ## Infrastructure Options
48
+
49
+ ### HuggingFace Jobs
50
+
51
+ **What it is**: Managed compute platform from HuggingFace with dedicated GPU/CPU instances.
52
+
53
+ **Pricing Model**: Subscription-based ($9/month HF Pro) + **per-second** GPU charges
54
+
55
+ **Hardware Options** (pricing from [HF Spaces GPU pricing](https://huggingface.co/docs/hub/en/spaces-gpus)):
56
+ - `cpu-basic` - 2 vCPU, 16GB RAM (Free with Pro)
57
+ - `cpu-upgrade` - 8 vCPU, 32GB RAM (Free with Pro)
58
+ - `t4-small` - NVIDIA T4 16GB, 4 vCPU, 15GB RAM ($0.40/hr = $0.000111/sec)
59
+ - `t4-medium` - NVIDIA T4 16GB, 8 vCPU, 30GB RAM ($0.60/hr = $0.000167/sec)
60
+ - `l4x1` - NVIDIA L4 24GB, 8 vCPU, 30GB RAM ($0.80/hr = $0.000222/sec)
61
+ - `l4x4` - 4x NVIDIA L4 96GB total, 48 vCPU, 186GB RAM ($3.80/hr = $0.001056/sec)
62
+ - `a10g-small` - NVIDIA A10G 24GB ($1.00/hr = $0.000278/sec)
63
+ - `a10g-large` - NVIDIA A10G 24GB (more compute) ($1.50/hr = $0.000417/sec)
64
+ - `a10g-largex2` - 2x NVIDIA A10G 48GB total ($3.00/hr = $0.000833/sec)
65
+ - `a10g-largex4` - 4x NVIDIA A10G 96GB total ($5.00/hr = $0.001389/sec)
66
+ - `a100-large` - NVIDIA A100 80GB, 12 vCPU, 142GB RAM ($2.50/hr = $0.000694/sec)
67
+ - `v5e-1x1` - Google Cloud TPU v5e (pricing TBD)
68
+ - `v5e-2x2` - Google Cloud TPU v5e (pricing TBD)
69
+ - `v5e-2x4` - Google Cloud TPU v5e (pricing TBD)
70
+
71
+ *Note: Jobs billing is **per-second** with no minimum. You only pay for actual compute time used.*
72
+
73
+ **Pros**:
74
+ - Simple authentication (HuggingFace token)
75
+ - Integrated with HF ecosystem
76
+ - Job dashboard at https://huggingface.co/jobs
77
+ - Reliable infrastructure
78
+
79
+ **Cons**:
80
+ - Requires HF Pro subscription ($9/month)
81
+ - Slightly more expensive than Modal for most GPUs
82
+ - Limited hardware options compared to Modal (no H100/H200)
83
+
84
+ **When to use**:
85
+ - ✅ You already have HF Pro subscription
86
+ - ✅ You want simplicity and reliability
87
+ - ✅ You prefer HuggingFace ecosystem integration
88
+ - ✅ You prefer managed infrastructure
89
+
90
+ ### Modal
91
+
92
+ **What it is**: Serverless compute platform with pay-per-second billing for CPU and GPU workloads.
93
+
94
+ **Pricing Model**: Pay-per-second usage (no subscription required)
95
+
96
+ **Hardware Options**:
97
+ - `cpu` - Physical core (2 vCPU equivalent) ($0.0000131/core/sec, min 0.125 cores)
98
+ - `gpu_t4` - NVIDIA T4 16GB ($0.000164/sec ~= $0.59/hr)
99
+ - `gpu_l4` - NVIDIA L4 24GB ($0.000222/sec ~= $0.80/hr)
100
+ - `gpu_a10` - NVIDIA A10G 24GB ($0.000306/sec ~= $1.10/hr)
101
+ - `gpu_l40s` - NVIDIA L40S 48GB ($0.000542/sec ~= $1.95/hr)
102
+ - `gpu_a100` - NVIDIA A100 40GB ($0.000583/sec ~= $2.10/hr)
103
+ - `gpu_a100_80gb` - NVIDIA A100 80GB ($0.000694/sec ~= $2.50/hr)
104
+ - `gpu_h100` - NVIDIA H100 80GB ($0.001097/sec ~= $3.95/hr)
105
+ - `gpu_h200` - NVIDIA H200 141GB ($0.001261/sec ~= $4.54/hr)
106
+ - `gpu_b200` - NVIDIA B200 192GB ($0.001736/sec ~= $6.25/hr)
107
+
108
+ **Pros**:
109
+ - Pay-per-second (no hourly minimums)
110
+ - Wide range of GPUs (including H200, H100)
111
+ - No subscription required
112
+ - Real-time logs and monitoring
113
+ - Fast cold starts
114
+
115
+ **Cons**:
116
+ - Requires Modal account setup
117
+ - Need to configure API tokens (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET)
118
+ - Network egress charges apply
119
+ - Less integrated with HF ecosystem
120
+
121
+ **When to use**:
122
+ - ✅ You want to minimize costs (generally cheaper than HF Jobs)
123
+ - ✅ You need access to latest GPUs (H200, H100, B200)
124
+ - ✅ You prefer serverless architecture
125
+ - ✅ You don't have HF Pro subscription
126
+ - ✅ You want more GPU options and flexibility
127
+
128
+ ---
129
+
130
+ ## Prerequisites
131
+
132
+ ### For Viewing Leaderboard (Free)
133
+
134
+ **Required**:
135
+ - HuggingFace account (free)
136
+ - HuggingFace token with **Read** permissions
137
+
138
+ **How to get**:
139
+ 1. Go to https://huggingface.co/settings/tokens
140
+ 2. Create new token with **Read** permission
141
+ 3. Copy token (starts with `hf_...`)
142
+ 4. Add to TraceMind Settings tab
143
+
144
+ ### For Submitting Jobs to HuggingFace Jobs
145
+
146
+ **Required**:
147
+ 1. **HuggingFace Pro** subscription ($9/month)
148
+ - Sign up at https://huggingface.co/pricing
149
+ - **Must add credit card** for GPU compute charges
150
+ 2. HuggingFace token with **Read + Write + Run Jobs** permissions
151
+ 3. LLM provider API keys (OpenAI, Anthropic, etc.) for API models
152
+
153
+ **How to setup**:
154
+ 1. Subscribe to HF Pro: https://huggingface.co/pricing
155
+ 2. Add credit card for compute charges
156
+ 3. Create token with all permissions:
157
+ - Go to https://huggingface.co/settings/tokens
158
+ - Click "New token"
159
+ - Select: **Read**, **Write**, **Run Jobs**
160
+ - Copy token
161
+ 4. Add API keys in TraceMind Settings:
162
+ - HuggingFace Token
163
+ - OpenAI API Key (if testing OpenAI models)
164
+ - Anthropic API Key (if testing Claude models)
165
+ - etc.
166
+
167
+ ### For Submitting Jobs to Modal
168
+
169
+ **Required**:
170
+ 1. Modal account (free to create, pay-per-use)
171
+ 2. Modal API token (Token ID + Token Secret)
172
+ 3. HuggingFace token with **Read + Write** permissions
173
+ 4. LLM provider API keys (OpenAI, Anthropic, etc.) for API models
174
+
175
+ **How to setup**:
176
+ 1. Create Modal account:
177
+ - Go to https://modal.com
178
+ - Sign up (GitHub or email)
179
+ 2. Create API token:
180
+ - Go to https://modal.com/settings/tokens
181
+ - Click "Create token"
182
+ - Copy **Token ID** (starts with `ak-...`)
183
+ - Copy **Token Secret** (starts with `as-...`)
184
+ 3. Add credentials in TraceMind Settings:
185
+ - Modal Token ID
186
+ - Modal Token Secret
187
+ - HuggingFace Token (Read + Write)
188
+ - LLM provider API keys
189
+
190
+ ---
191
+
192
+ ## Hardware Selection Guide
193
+
194
+ ### Auto-Selection (Recommended)
195
+
196
+ Set hardware to **`auto`** to let TraceMind automatically select the optimal hardware based on:
197
+ - Model size (extracted from model name)
198
+ - Provider type (API vs local)
199
+ - Infrastructure (HF Jobs vs Modal)
200
+
201
+ **Auto-selection logic**:
202
+
203
+ **For API Models** (provider = `litellm` or `inference`):
204
+ - Always uses **CPU** (no GPU needed)
205
+ - HF Jobs: `cpu-basic`
206
+ - Modal: `cpu`
207
+
208
+ **For Local Models** (provider = `transformers`):
209
+
210
+ *Memory estimation for agentic workloads*:
211
+ - Model weights (FP16): ~2GB per 1B params
212
+ - KV cache for long contexts: ~1.5-2x model size
213
+ - Inference overhead: ~20-30% additional
214
+ - **Total: ~4-5GB per 1B params for safe execution**
215
+
216
+ **HuggingFace Jobs**:
217
+ | Model Size | Hardware | VRAM | Example Models |
218
+ |------------|----------|------|----------------|
219
+ | < 1B | `t4-small` | 16GB | Qwen-0.5B, Phi-3-mini |
220
+ | 1B - 5B | `t4-small` | 16GB | Llama-3.2-3B, Gemma-2B |
221
+ | 6B - 12B | `a10g-large` | 24GB | Llama-3.1-8B, Mistral-7B |
222
+ | 13B+ | `a100-large` | 80GB | Llama-3.1-70B, Qwen-14B |
223
+
224
+ **Modal**:
225
+ | Model Size | Hardware | VRAM | Example Models |
226
+ |------------|----------|------|----------------|
227
+ | < 1B | `gpu_t4` | 16GB | Qwen-0.5B, Phi-3-mini |
228
+ | 1B - 5B | `gpu_t4` | 16GB | Llama-3.2-3B, Gemma-2B |
229
+ | 6B - 12B | `gpu_l40s` | 48GB | Llama-3.1-8B, Mistral-7B |
230
+ | 13B - 24B | `gpu_a100_80gb` | 80GB | Llama-2-13B, Qwen-14B |
231
+ | 25B - 48B | `gpu_a100_80gb` | 80GB | Gemma-27B, Yi-34B |
232
+ | 49B+ | `gpu_h200` | 141GB | Llama-3.1-70B, Qwen-72B |
233
+
234
+ ### Manual Selection
235
+
236
+ If you know your model's requirements, you can manually select hardware:
237
+
238
+ **CPU Jobs** (API models like GPT-4, Claude):
239
+ - HF Jobs: `cpu-basic` or `cpu-upgrade`
240
+ - Modal: `cpu`
241
+
242
+ **Small Models** (1B-5B params):
243
+ - HF Jobs: `t4-small` (16GB VRAM)
244
+ - Modal: `gpu_t4` (16GB VRAM)
245
+ - Examples: Llama-3.2-3B, Gemma-2B, Qwen-2.5-3B
246
+
247
+ **Medium Models** (6B-12B params):
248
+ - HF Jobs: `a10g-small` or `a10g-large` (24GB VRAM)
249
+ - Modal: `gpu_l40s` (48GB VRAM)
250
+ - Examples: Llama-3.1-8B, Mistral-7B, Qwen-2.5-7B
251
+
252
+ **Large Models** (13B-24B params):
253
+ - HF Jobs: `a100-large` (80GB VRAM)
254
+ - Modal: `gpu_a100_80gb` (80GB VRAM)
255
+ - Examples: Llama-2-13B, Qwen-14B, Mistral-22B
256
+
257
+ **Very Large Models** (25B+ params):
258
+ - HF Jobs: `a100-large` (80GB VRAM) - may need quantization
259
+ - Modal: `gpu_h200` (141GB VRAM) - recommended
260
+ - Examples: Llama-3.1-70B, Qwen-72B, Gemma-27B
261
+
262
+ **Cost vs Performance Trade-offs**:
263
+ - T4: Cheapest GPU, good for small models
264
+ - L4: Newer architecture, better performance than T4
265
+ - A10G: Good balance of cost/performance for medium models
266
+ - L40S: Best for 7B-12B models (Modal only)
267
+ - A100: Industry standard for large models
268
+ - H200: Latest GPU, massive VRAM (141GB), best for 70B+ models
269
+
270
+ ---
271
+
272
+ ## Submitting a Job
273
+
274
+ ### Step 1: Navigate to New Evaluation Screen
275
+
276
+ 1. Open TraceMind-AI
277
+ 2. Click **▶️ New Evaluation** in the sidebar
278
+ 3. You'll see a comprehensive configuration form
279
+
280
+ ### Step 2: Configure Infrastructure
281
+
282
+ **Infrastructure Provider**:
283
+ - Choose `HuggingFace Jobs` or `Modal`
284
+
285
+ **Hardware**:
286
+ - Use `auto` (recommended) or select specific hardware
287
+ - See [Hardware Selection Guide](#hardware-selection-guide)
288
+
289
+ ### Step 3: Configure Model
290
+
291
+ **Model**:
292
+ - Enter model ID (e.g., `openai/gpt-4`, `meta-llama/Llama-3.1-8B-Instruct`)
293
+ - Use HuggingFace format: `organization/model-name`
294
+
295
+ **Provider**:
296
+ - `litellm` - For API models (OpenAI, Anthropic, etc.)
297
+ - `inference` - For HuggingFace Inference API
298
+ - `transformers` - For local models loaded with transformers
299
+
300
+ **HF Inference Provider** (optional):
301
+ - Leave empty unless using HF Inference API
302
+ - Example: `openai-community/gpt2` for HF-hosted models
303
+
304
+ **HuggingFace Token** (optional):
305
+ - Leave empty if already configured in Settings
306
+ - Only needed for private models
307
+
308
+ ### Step 4: Configure Agent
309
+
310
+ **Agent Type**:
311
+ - `tool` - Function calling agents only
312
+ - `code` - Code execution agents only
313
+ - `both` - Hybrid agents (recommended)
314
+
315
+ **Search Provider**:
316
+ - `duckduckgo` - Free, no API key required (recommended)
317
+ - `serper` - Requires Serper API key
318
+ - `brave` - Requires Brave Search API key
319
+
320
+ **Enable Optional Tools**:
321
+ - Select additional tools for the agent:
322
+ - `google_search` - Google Search (requires API key)
323
+ - `duckduckgo_search` - DuckDuckGo Search
324
+ - `visit_webpage` - Web page scraping
325
+ - `python_interpreter` - Python code execution
326
+ - `wikipedia_search` - Wikipedia queries
327
+ - `user_input` - User interaction (not recommended for batch eval)
328
+
329
+ ### Step 5: Configure Test Dataset
330
+
331
+ **Dataset Name**:
332
+ - Default: `kshitijthakkar/smoltrace-tasks`
333
+ - Or use your own HuggingFace dataset
334
+ - Format: `username/dataset-name`
335
+
336
+ **Dataset Split**:
337
+ - Default: `train`
338
+ - Other options: `test`, `validation`
339
+
340
+ **Difficulty Filter**:
341
+ - `all` - All difficulty levels (recommended)
342
+ - `easy` - Easy tasks only
343
+ - `medium` - Medium tasks only
344
+ - `hard` - Hard tasks only
345
+
346
+ **Parallel Workers**:
347
+ - Default: `1` (sequential execution)
348
+ - Higher values (2-10) for faster execution
349
+ - ⚠️ Increases memory usage and API rate limits
350
+
351
+ ### Step 6: Configure Output & Monitoring
352
+
353
+ **Output Format**:
354
+ - `hub` - Push to HuggingFace datasets (recommended)
355
+ - `json` - Save locally (requires output directory)
356
+
357
+ **Output Directory**:
358
+ - Only for `json` format
359
+ - Example: `./evaluation_results`
360
+
361
+ **Enable OpenTelemetry Tracing**:
362
+ - ✅ Recommended - Collects detailed execution traces
363
+ - Traces appear in TraceMind trace visualization
364
+
365
+ **Enable GPU Metrics**:
366
+ - ✅ Recommended for GPU jobs
367
+ - Collects GPU utilization, memory, temperature, CO2 emissions
368
+ - No effect on CPU jobs
369
+
370
+ **Private Datasets**:
371
+ - ☐ Make result datasets private on HuggingFace
372
+ - Default: Public datasets
373
+
374
+ **Debug Mode**:
375
+ - ☐ Enable verbose logging for troubleshooting
376
+ - Default: Off
377
+
378
+ **Quiet Mode**:
379
+ - ☐ Reduce output verbosity
380
+ - Default: Off
381
+
382
+ **Run ID** (optional):
383
+ - Auto-generated UUID if left empty
384
+ - Custom ID for tracking specific runs
385
+
386
+ **Job Timeout**:
387
+ - Default: `1h` (1 hour)
388
+ - Other examples: `30m`, `2h`, `3h`
389
+ - Job will be terminated if it exceeds timeout
390
+
391
+ ### Step 7: Estimate Cost (Optional but Recommended)
392
+
393
+ 1. Click **💰 Estimate Cost** button
394
+ 2. Wait for AI-powered cost analysis
395
+ 3. Review:
396
+ - Estimated total cost
397
+ - Estimated duration
398
+ - Hardware selection (if auto)
399
+ - Historical data (if available)
400
+
401
+ **Cost Estimation Sources**:
402
+ - **Historical Data**: Based on previous runs of the same model in leaderboard
403
+ - **MCP AI Analysis**: AI-powered estimation using Gemini 2.5 Flash (if no historical data)
404
+
405
+ ### Step 8: Submit Job
406
+
407
+ 1. Review all configurations
408
+ 2. Click **🚀 Submit Evaluation** button
409
+ 3. Wait for confirmation message
410
+ 4. Copy job ID for tracking
411
+
412
+ **Confirmation message includes**:
413
+ - ✅ Job submission status
414
+ - Job ID and platform-specific ID
415
+ - Hardware selected
416
+ - Estimated duration
417
+ - Monitoring instructions
418
+
419
+ ### Example: Submit HuggingFace Jobs Evaluation
420
+
421
+ ```
422
+ Infrastructure: HuggingFace Jobs
423
+ Hardware: auto → a10g-large
424
+ Model: meta-llama/Llama-3.1-8B-Instruct
425
+ Provider: transformers
426
+ Agent Type: both
427
+ Dataset: kshitijthakkar/smoltrace-tasks
428
+ Output Format: hub
429
+
430
+ Click "Estimate Cost":
431
+ → Estimated Cost: $1.25
432
+ → Duration: 25 minutes
433
+ → Hardware: a10g-large (auto-selected)
434
+
435
+ Click "Submit Evaluation":
436
+ → ✅ Job submitted successfully!
437
+ → HF Job ID: username/job_abc123
438
+ → Monitor at: https://huggingface.co/jobs
439
+ ```
440
+
441
+ ### Example: Submit Modal Evaluation
442
+
443
+ ```
444
+ Infrastructure: Modal
445
+ Hardware: auto → L40S
446
+ Model: meta-llama/Llama-3.1-8B-Instruct
447
+ Provider: transformers
448
+ Agent Type: both
449
+ Dataset: kshitijthakkar/smoltrace-tasks
450
+ Output Format: hub
451
+
452
+ Click "Estimate Cost":
453
+ → Estimated Cost: $0.95
454
+ → Duration: 20 minutes
455
+ → Hardware: gpu_l40s (auto-selected)
456
+
457
+ Click "Submit Evaluation":
458
+ → ✅ Job submitted successfully!
459
+ → Modal Call ID: modal-job_xyz789
460
+ → Monitor at: https://modal.com/apps
461
+ ```
462
+
463
+ ---
464
+
465
+ ## Cost Estimation
466
+
467
+ ### Understanding Cost Estimates
468
+
469
+ TraceMind provides AI-powered cost estimation before you submit jobs:
470
+
471
+ **Historical Data** (most accurate):
472
+ - Based on actual runs of the same model
473
+ - Shows average cost, duration from past evaluations
474
+ - Displays number of historical runs used
475
+
476
+ **MCP AI Analysis** (when no historical data):
477
+ - Powered by Google Gemini 2.5 Flash
478
+ - Analyzes model size, hardware, provider
479
+ - Estimates cost based on typical usage patterns
480
+ - Includes detailed breakdown and recommendations
481
+
482
+ ### Cost Factors
483
+
484
+ **For HuggingFace Jobs**:
485
+ 1. **Hardware per-second rate** (see [Infrastructure Options](#huggingface-jobs))
486
+ 2. **Evaluation duration** (actual runtime only, billed per-second)
487
+ 3. **LLM API costs** (if using API models like GPT-4)
488
+ 4. **HF Pro subscription** ($9/month required)
489
+
490
+ **For Modal**:
491
+ 1. **Hardware per-second rate** (no minimums)
492
+ 2. **Evaluation duration** (actual runtime only)
493
+ 3. **Network egress** (data transfer out)
494
+ 4. **LLM API costs** (if using API models)
495
+
496
+ ### Cost Optimization Tips
497
+
498
+ **Use Auto Hardware Selection**:
499
+ - Automatically picks cheapest hardware for your model
500
+ - Avoids over-provisioning (e.g., H200 for 3B model)
501
+
502
+ **Choose Right Infrastructure**:
503
+ - **If you have HF Pro**: Use HF Jobs (already paying subscription)
504
+ - **If you don't have HF Pro**: Use Modal (no subscription required)
505
+ - **For latest GPUs (H200/H100)**: Use Modal (HF Jobs doesn't offer these)
506
+
507
+ **Optimize Model Selection**:
508
+ - Smaller models (3B-7B) are 10x cheaper than large models (70B)
509
+ - API models (GPT-4-mini) often cheaper than local 70B models
510
+
511
+ **Reduce Test Count**:
512
+ - Use difficulty filter (`easy` only) for quick validation
513
+ - Test with small dataset first, then scale up
514
+
515
+ **Parallel Workers**:
516
+ - Keep at 1 for sequential execution (cheapest)
517
+ - Increase only if time is critical (increases API costs)
518
+
519
+ **Example Cost Comparison**:
520
+ | Model | Hardware | Infrastructure | Duration | HF Jobs Cost | Modal Cost |
521
+ |-------|----------|----------------|----------|--------------|------------|
522
+ | GPT-4 (API) | CPU | Either | 5 min | Free* | ~$0.00* |
523
+ | Llama-3.1-8B | A10G-large | HF Jobs | 25 min | $0.63** | N/A |
524
+ | Llama-3.1-8B | L40S | Modal | 20 min | N/A | $0.65** |
525
+ | Llama-3.1-70B | A100-80GB | Both | 45 min | $1.74** | $1.56** |
526
+ | Llama-3.1-70B | H200 | Modal only | 35 min | N/A | $2.65** |
527
+
528
+ \* Plus LLM API costs (OpenAI/Anthropic/etc. - not included)
529
+ \** Per-second billing, actual runtime only (no minimums)
530
+
531
+ ---
532
+
533
+ ## Monitoring Jobs
534
+
535
+ ### HuggingFace Jobs
536
+
537
+ **Via HuggingFace Dashboard**:
538
+ 1. Go to https://huggingface.co/jobs
539
+ 2. Find your job in the list
540
+ 3. Click to view details and logs
541
+
542
+ **Via TraceMind Job Monitoring Tab**:
543
+ 1. Click **📈 Job Monitoring** in sidebar
544
+ 2. See all your submitted jobs
545
+ 3. Real-time status updates
546
+ 4. Click job to view logs
547
+
548
+ **Job Statuses**:
549
+ - `pending` - Waiting for resources
550
+ - `running` - Currently executing
551
+ - `completed` - Finished successfully
552
+ - `failed` - Error occurred (check logs)
553
+ - `cancelled` - Manually stopped
554
+
555
+ ### Modal
556
+
557
+ **Via Modal Dashboard**:
558
+ 1. Go to https://modal.com/apps
559
+ 2. Find your app: `smoltrace-eval-{job_id}`
560
+ 3. Click to view real-time logs and metrics
561
+
562
+ **Via TraceMind Job Monitoring Tab**:
563
+ 1. Click **📈 Job Monitoring** in sidebar
564
+ 2. See all your submitted jobs
565
+ 3. Modal jobs show as `submitted` (check Modal dashboard for details)
566
+
567
+ ### Viewing Job Logs
568
+
569
+ **HuggingFace Jobs**:
570
+ ```
571
+ 1. Go to Job Monitoring tab
572
+ 2. Click on your job
573
+ 3. Click "View Logs" button
574
+ 4. See real-time output from SMOLTRACE
575
+ ```
576
+
577
+ **Modal**:
578
+ ```
579
+ 1. Go to https://modal.com/apps
580
+ 2. Find your app
581
+ 3. Click "Logs" tab
582
+ 4. See streaming output in real-time
583
+ ```
584
+
585
+ ### Expected Job Duration
586
+
587
+ **API Models** (litellm provider):
588
+ - CPU job: 2-5 minutes for 100 tests
589
+ - No model download required
590
+ - Depends on API rate limits
591
+
592
+ **Local Models** (transformers provider):
593
+ - Model download: 5-15 minutes (one-time per job)
594
+ - 3B model: ~6GB download
595
+ - 8B model: ~16GB download
596
+ - 70B model: ~140GB download
597
+ - Evaluation: 10-30 minutes for 100 tests
598
+ - Total: 15-45 minutes typical
599
+
600
+ **Progress Indicators**:
601
+ 1. ⏳ Job queued (0-2 minutes)
602
+ 2. 🔄 Downloading model (5-15 minutes for first run)
603
+ 3. 🧪 Running evaluation (10-30 minutes)
604
+ 4. 📤 Uploading results to HuggingFace (1-2 minutes)
605
+ 5. ✅ Complete
606
+
607
+ ---
608
+
609
+ ## Understanding Job Results
610
+
611
+ ### Where Results Are Stored
612
+
613
+ **HuggingFace Datasets** (if output_format = "hub"):
614
+
615
+ SMOLTRACE creates 4 datasets for each evaluation:
616
+
617
+ 1. **Leaderboard Dataset**: `huggingface/smolagents-leaderboard`
618
+ - Aggregate statistics for the run
619
+ - Appears in TraceMind Leaderboard tab
620
+ - Public, shared across all users
621
+
622
+ 2. **Results Dataset**: `{your_username}/agent-results-{model}-{timestamp}`
623
+ - Individual test case results
624
+ - Success/failure, execution time, tokens, cost
625
+ - Links to traces dataset
626
+
627
+ 3. **Traces Dataset**: `{your_username}/agent-traces-{model}-{timestamp}`
628
+ - OpenTelemetry traces (if enable_otel = True)
629
+ - Detailed execution steps, LLM calls, tool usage
630
+ - Viewable in TraceMind Trace Visualization
631
+
632
+ 4. **Metrics Dataset**: `{your_username}/agent-metrics-{model}-{timestamp}`
633
+ - GPU metrics (if enable_gpu_metrics = True)
634
+ - GPU utilization, memory, temperature, CO2 emissions
635
+ - Time-series data for each test
636
+
637
+ **Local JSON Files** (if output_format = "json"):
638
+ - Saved to `output_dir` on the job machine
639
+ - Not automatically uploaded to HuggingFace
640
+ - Useful for local testing
641
+
642
+ ### Viewing Results in TraceMind
643
+
644
+ **Step 1: Refresh Leaderboard**
645
+ 1. Go to **📊 Leaderboard** tab
646
+ 2. Click **Load Leaderboard** button
647
+ 3. Your new run appears in the table
648
+
649
+ **Step 2: View Run Details**
650
+ 1. Click on your run in the leaderboard
651
+ 2. See detailed test results:
652
+ - Individual test cases
653
+ - Success/failure breakdown
654
+ - Execution times
655
+ - Token usage
656
+ - Costs
657
+
658
+ **Step 3: Visualize Traces** (if enable_otel = True)
659
+ 1. From run details, click on a test case
660
+ 2. Click **View Trace** button
661
+ 3. See OpenTelemetry waterfall diagram
662
+ 4. Analyze:
663
+ - LLM calls and durations
664
+ - Tool executions
665
+ - Reasoning steps
666
+ - GPU metrics overlay (if GPU job)
667
+
668
+ **Step 4: Ask Questions About Results**
669
+ 1. Go to **🤖 Agent Chat** tab
670
+ 2. Ask questions like:
671
+ - "Analyze my latest evaluation run"
672
+ - "Why did test case 5 fail?"
673
+ - "Compare my run with the top model"
674
+ - "What was the cost breakdown?"
675
+
676
+ ### Interpreting Results
677
+
678
+ **Key Metrics**:
679
+
680
+ | Metric | Description | Good Value |
681
+ |--------|-------------|------------|
682
+ | **Success Rate** | % of tests passed | >90% excellent, >70% good |
683
+ | **Avg Duration** | Time per test case | <5s good, <10s acceptable |
684
+ | **Total Cost** | Cost for all tests | Varies by model |
685
+ | **Tokens Used** | Total tokens consumed | Lower is better |
686
+ | **CO2 Emissions** | Carbon footprint | Lower is better |
687
+ | **GPU Utilization** | GPU usage % | >60% efficient |
688
+
689
+ **Common Patterns**:
690
+
691
+ **High accuracy, low cost**:
692
+ - ✅ Excellent model for production
693
+ - Examples: GPT-4-mini, Claude-3-Haiku, Gemini-1.5-Flash
694
+
695
+ **High accuracy, high cost**:
696
+ - ✅ Best for quality-critical tasks
697
+ - Examples: GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro
698
+
699
+ **Low accuracy, low cost**:
700
+ - ⚠️ May need prompt optimization or better model
701
+ - Examples: Small local models (<3B params)
702
+
703
+ **Low accuracy, high cost**:
704
+ - ❌ Poor choice, investigate or switch models
705
+ - May indicate configuration issues
706
+
707
+ ---
708
+
709
+ ## Troubleshooting
710
+
711
+ ### Job Submission Failures
712
+
713
+ **Error: "HuggingFace token not configured"**
714
+ - **Cause**: Missing or invalid HF token
715
+ - **Fix**:
716
+ 1. Go to Settings tab
717
+ 2. Add HF token with "Read + Write + Run Jobs" permissions
718
+ 3. Click "Save API Keys"
719
+
720
+ **Error: "HuggingFace Pro subscription required"**
721
+ - **Cause**: HF Jobs requires Pro subscription
722
+ - **Fix**:
723
+ 1. Subscribe at https://huggingface.co/pricing ($9/month)
724
+ 2. Add credit card for GPU charges
725
+ 3. Try again
726
+
727
+ **Error: "Modal credentials not configured"**
728
+ - **Cause**: Missing Modal API tokens
729
+ - **Fix**:
730
+ 1. Go to https://modal.com/settings/tokens
731
+ 2. Create new token
732
+ 3. Copy Token ID and Token Secret
733
+ 4. Add to Settings tab
734
+ 5. Try again
735
+
736
+ **Error: "Modal package not installed"**
737
+ - **Cause**: Modal SDK missing (should not happen in hosted Space)
738
+ - **Fix**: Contact support or run locally with `pip install modal`
739
+
740
+ ### Job Execution Failures
741
+
742
+ **Job stuck in "Pending" status**
743
+ - **Cause**: High demand for GPU resources
744
+ - **Fix**:
745
+ - Wait 5-10 minutes
746
+ - Try different hardware (e.g., T4 instead of A100)
747
+ - Try different infrastructure (Modal vs HF Jobs)
748
+
749
+ **Job fails with "Out of Memory"**
750
+ - **Cause**: Model too large for selected hardware
751
+ - **Fix**:
752
+ - Use larger GPU (A100-80GB or H200)
753
+ - Or use `auto` hardware selection
754
+ - Or reduce `parallel_workers` to 1
755
+
756
+ **Job fails with "Model not found"**
757
+ - **Cause**: Invalid model ID or private model
758
+ - **Fix**:
759
+ - Check model ID format: `organization/model-name`
760
+ - For private models, add HF token with access
761
+ - Verify model exists on HuggingFace Hub
762
+
763
+ **Job fails with "API key not set"**
764
+ - **Cause**: Missing LLM provider API key
765
+ - **Fix**:
766
+ 1. Go to Settings tab
767
+ 2. Add API key for your provider (OpenAI, Anthropic, etc.)
768
+ 3. Submit job again
769
+
770
+ **Job fails with "Rate limit exceeded"**
771
+ - **Cause**: Too many API requests
772
+ - **Fix**:
773
+ - Reduce `parallel_workers` to 1
774
+ - Use different model with higher rate limits
775
+ - Wait and retry later
776
+
777
+ **Modal job fails with "Authentication failed"**
778
+ - **Cause**: Invalid Modal tokens
779
+ - **Fix**:
780
+ 1. Go to https://modal.com/settings/tokens
781
+ 2. Create new token (old one may be expired)
782
+ 3. Update tokens in Settings tab
783
+
784
+ ### Results Not Appearing
785
+
786
+ **Results not in leaderboard after job completes**
787
+ - **Cause**: Dataset upload failed or not configured
788
+ - **Fix**:
789
+ - Check job logs for errors
790
+ - Verify `output_format` was set to "hub"
791
+ - Verify HF token has "Write" permission
792
+ - Manually refresh leaderboard (click "Load Leaderboard")
793
+
794
+ **Traces not appearing**
795
+ - **Cause**: OpenTelemetry not enabled
796
+ - **Fix**:
797
+ - Re-run evaluation with `enable_otel = True`
798
+ - Check traces dataset exists on your HF profile
799
+
800
+ **GPU metrics not showing**
801
+ - **Cause**: GPU metrics not enabled or CPU job
802
+ - **Fix**:
803
+ - Re-run with `enable_gpu_metrics = True`
804
+ - Verify job used GPU hardware (not CPU)
805
+ - Check metrics dataset exists
806
+
807
+ ---
808
+
809
+ ## Advanced Configuration
810
+
811
+ ### Custom Test Datasets
812
+
813
+ **Create your own test dataset**:
814
+
815
+ 1. Use **🔬 Synthetic Data Generator** tab:
816
+ - Configure domain and tools
817
+ - Generate custom tasks
818
+ - Push to HuggingFace Hub
819
+
820
+ 2. Use generated dataset in evaluation:
821
+ - Set `dataset_name` to your dataset: `{username}/dataset-name`
822
+ - Configure agent with matching tools
823
+
824
+ **Dataset Format Requirements**:
825
+ ```python
826
+ {
827
+ "task_id": "task_001",
828
+ "prompt": "What's the weather in Tokyo?",
829
+ "expected_tool": "get_weather",
830
+ "difficulty": "easy",
831
+ "category": "tool_usage"
832
+ }
833
+ ```
834
+
835
+ ### Environment Variables
836
+
837
+ **LLM Provider API Keys** (in Settings):
838
+ - `OPENAI_API_KEY` - OpenAI API
839
+ - `ANTHROPIC_API_KEY` - Anthropic API
840
+ - `GOOGLE_API_KEY` or `GEMINI_API_KEY` - Google Gemini API
841
+ - `COHERE_API_KEY` - Cohere API
842
+ - `MISTRAL_API_KEY` - Mistral API
843
+ - `TOGETHER_API_KEY` - Together AI API
844
+ - `GROQ_API_KEY` - Groq API
845
+ - `REPLICATE_API_TOKEN` - Replicate API
846
+ - `ANYSCALE_API_KEY` - Anyscale API
847
+
848
+ **Infrastructure Credentials**:
849
+ - `HF_TOKEN` - HuggingFace token
850
+ - `MODAL_TOKEN_ID` - Modal token ID
851
+ - `MODAL_TOKEN_SECRET` - Modal token secret
852
+
853
+ ### Parallel Execution
854
+
855
+ **Use `parallel_workers` to speed up evaluation**:
856
+
857
+ - `1` - Sequential execution (default, safest)
858
+ - `2-4` - Moderate parallelism (2-4x faster)
859
+ - `5-10` - High parallelism (5-10x faster, risky)
860
+
861
+ **Trade-offs**:
862
+ - ✅ **Faster**: Linear speedup with workers
863
+ - ⚠️ **Higher cost**: More API calls per minute
864
+ - ⚠️ **Rate limits**: May hit provider rate limits
865
+ - ⚠️ **Memory**: Increases GPU memory usage
866
+
867
+ **Recommendations**:
868
+ - API models: Keep at 1 (avoid rate limits)
869
+ - Local models: Can use 2-4 if GPU has enough VRAM
870
+ - Production runs: Use 1 for reliability
871
+
872
+ ### Private Datasets
873
+
874
+ **Make results private**:
875
+
876
+ 1. Set `private = True` in job configuration
877
+ 2. Results will be private on your HuggingFace profile
878
+ 3. Only you can view in leaderboard (if using private leaderboard dataset)
879
+
880
+ **Use cases**:
881
+ - Proprietary models
882
+ - Confidential evaluation data
883
+ - Internal benchmarking
884
+
885
+ ---
886
+
887
+ ## Quick Reference
888
+
889
+ ### Job Submission Checklist
890
+
891
+ Before submitting a job, verify:
892
+
893
+ - [ ] Infrastructure selected (HF Jobs or Modal)
894
+ - [ ] Hardware configured (auto or manual)
895
+ - [ ] Model ID is correct
896
+ - [ ] Provider matches model type
897
+ - [ ] API keys configured in Settings
898
+ - [ ] Dataset name is valid
899
+ - [ ] Output format is "hub" for TraceMind integration
900
+ - [ ] OpenTelemetry tracing enabled (if you want traces)
901
+ - [ ] GPU metrics enabled (if using GPU)
902
+ - [ ] Cost estimate reviewed
903
+ - [ ] Timeout is sufficient for your model size
904
+
905
+ ### Common Model Configurations
906
+
907
+ **OpenAI GPT-4**:
908
+ ```
909
+ Model: openai/gpt-4
910
+ Provider: litellm
911
+ Hardware: auto → cpu-basic
912
+ Infrastructure: Either (HF Jobs or Modal)
913
+ Estimated Cost: API costs only
914
+ ```
915
+
916
+ **Anthropic Claude-3.5-Sonnet**:
917
+ ```
918
+ Model: anthropic/claude-3.5-sonnet
919
+ Provider: litellm
920
+ Hardware: auto → cpu-basic
921
+ Infrastructure: Either (HF Jobs or Modal)
922
+ Estimated Cost: API costs only
923
+ ```
924
+
925
+ **Meta Llama-3.1-8B**:
926
+ ```
927
+ Model: meta-llama/Llama-3.1-8B-Instruct
928
+ Provider: transformers
929
+ Hardware: auto → a10g-large (HF) or gpu_l40s (Modal)
930
+ Infrastructure: Modal (cheaper for short jobs)
931
+ Estimated Cost: $0.75-1.50
932
+ ```
933
+
934
+ **Meta Llama-3.1-70B**:
935
+ ```
936
+ Model: meta-llama/Llama-3.1-70B-Instruct
937
+ Provider: transformers
938
+ Hardware: auto → a100-large (HF) or gpu_h200 (Modal)
939
+ Infrastructure: Modal (if available), else HF Jobs
940
+ Estimated Cost: $3.00-8.00
941
+ ```
942
+
943
+ **Qwen-2.5-Coder-32B**:
944
+ ```
945
+ Model: Qwen/Qwen2.5-Coder-32B-Instruct
946
+ Provider: transformers
947
+ Hardware: auto → a100-large (HF) or gpu_a100_80gb (Modal)
948
+ Infrastructure: Either
949
+ Estimated Cost: $2.00-4.00
950
+ ```
951
+
952
+ ---
953
+
954
+ ## Next Steps
955
+
956
+ After submitting your first job:
957
+
958
+ 1. **Monitor progress** in Job Monitoring tab
959
+ 2. **View results** in Leaderboard when complete
960
+ 3. **Analyze traces** in Trace Visualization
961
+ 4. **Ask questions** in Agent Chat about your results
962
+ 5. **Compare** with other models using Compare feature
963
+ 6. **Optimize** model selection based on cost/accuracy trade-offs
964
+ 7. **Generate** custom test datasets for your domain
965
+ 8. **Share** your results with the community
966
+
967
+ For more help:
968
+ - [USER_GUIDE.md](USER_GUIDE.md) - Complete screen-by-screen walkthrough
969
+ - [MCP_INTEGRATION.md](MCP_INTEGRATION.md) - MCP client architecture details
970
+ - [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture overview
971
+ - GitHub Issues: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
README.md CHANGED
@@ -327,9 +327,9 @@ To prevent rate limits during evaluation:
327
  - **Agent Framework**: smolagents 1.22.0+
328
  - **MCP Integration**: MCP Python SDK + smolagents MCPClient
329
  - **Data Source**: HuggingFace Datasets API
330
- - **Authentication**: HuggingFace OAuth
331
  - **AI Models**:
332
- - Agent: Qwen/Qwen2.5-Coder-32B-Instruct (HF API)
333
  - MCP Server: Google Gemini 2.5 Flash
334
  - **Cloud Platforms**: HuggingFace Jobs + Modal
335
 
 
327
  - **Agent Framework**: smolagents 1.22.0+
328
  - **MCP Integration**: MCP Python SDK + smolagents MCPClient
329
  - **Data Source**: HuggingFace Datasets API
330
+ - **Authentication**: HuggingFace OAuth (planned)
331
  - **AI Models**:
332
+ - Agent: Google Gemini 2.5 Flash
333
  - MCP Server: Google Gemini 2.5 Flash
334
  - **Cloud Platforms**: HuggingFace Jobs + Modal
335