sabaridsnfuji commited on
Commit
c7643a1
·
verified ·
1 Parent(s): fda14af

updated the code

Browse files
Files changed (3) hide show
  1. README.md +87 -12
  2. app.py.py +501 -0
  3. requirements.txt +9 -0
README.md CHANGED
@@ -1,14 +1,89 @@
1
- ---
2
- title: Image Analysis Qwen2.5 VL
3
- emoji: 👁
4
- colorFrom: yellow
5
- colorTo: red
6
- sdk: gradio
7
- sdk_version: 5.38.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: 🚀 Powerful Vision-Language Models for Advanced Image Unders
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
+ # 🖼️ Qwen2.5-VL Image Analysis
2
+
3
+ A powerful vision-language model interface for advanced image understanding tasks, built with Gradio and optimized for Hugging Face Spaces.
4
+
5
+ ## 🚀 Features
6
+
7
+ - **Multi-Model Support**: Choose between Qwen2.5-VL-3B (faster) and Qwen2.5-VL-7B (higher quality)
8
+ - **Advanced Image Analysis**: OCR, object detection, scene description, emotion analysis
9
+ - **Real-time Streaming**: See responses generated in real-time
10
+ - **Customizable Parameters**: Adjust temperature, top-p, top-k, and more
11
+ - **User-friendly Interface**: Clean, modern UI with example prompts
12
+
13
+ ## 🤖 Supported Models
14
+
15
+ ### Qwen2.5-VL-3B-Instruct
16
+ - **Speed**: ⚡ Fast inference
17
+ - **Memory**: Lower GPU memory requirements
18
+ - **Use Case**: Quick analysis, general image understanding
19
+
20
+ ### Qwen2.5-VL-7B-Instruct
21
+ - **Quality**: 🔬 Higher accuracy and detail
22
+ - **Memory**: Higher GPU memory requirements
23
+ - **Use Case**: Complex analysis, detailed OCR, research tasks
24
+
25
+ ## 💡 Use Cases
26
+
27
+ - 📄 **Document OCR**: Extract and transcribe text from images
28
+ - 🔍 **Object Detection**: Identify and count objects in images
29
+ - 🎨 **Art Analysis**: Analyze composition, colors, and artistic style
30
+ - 📊 **Chart Interpretation**: Understand graphs, charts, and diagrams
31
+ - 😊 **Emotion Detection**: Identify emotions and moods in images
32
+ - 🏛️ **Scene Understanding**: Describe locations, settings, and contexts
33
+ - 🔒 **Safety Analysis**: Identify potential hazards or safety concerns
34
+
35
+ ## 🛠️ Usage Instructions
36
+
37
+ 1. **Load Model**: Select your preferred model (3B or 7B) and click "Load Selected Model"
38
+ 2. **Upload Image**: Click the image upload area and select your image
39
+ 3. **Ask Question**: Enter your question about the image in the text box
40
+ 4. **Analyze**: Click "Analyze Image" or press Enter to get the AI response
41
+ 5. **Adjust Settings**: Use the Advanced Settings accordion for fine-tuning
42
+
43
+ ## 📝 Example Prompts
44
+
45
+ - "Describe this image in detail"
46
+ - "What text is visible in this image?"
47
+ - "Count the number of people in this image"
48
+ - "What emotions are expressed in this image?"
49
+ - "Analyze the artistic style and composition"
50
+ - "What safety concerns can you identify?"
51
+
52
+ ## ⚙️ Advanced Settings
53
+
54
+ - **Max New Tokens**: Control response length (1-4096)
55
+ - **Temperature**: Adjust creativity (0.1-2.0)
56
+ - **Top-p**: Control diversity (0.1-1.0)
57
+ - **Top-k**: Vocabulary limit per step (1-100)
58
+ - **Repetition Penalty**: Prevent repetitive text (1.0-2.0)
59
+ - **Stream Output**: Enable real-time response streaming
60
+
61
+ ## 🔧 Technical Details
62
+
63
+ - **Framework**: Gradio 4.0+
64
+ - **Models**: Qwen2.5-VL series by Alibaba Cloud
65
+ - **Hardware**: GPU-optimized for CUDA devices
66
+ - **Precision**: FP16 for efficient inference
67
+
68
+ ## 📋 Requirements
69
+
70
+ - Python 3.8+
71
+ - CUDA-compatible GPU (recommended)
72
+ - 8GB+ GPU memory for 3B model
73
+ - 16GB+ GPU memory for 7B model
74
+
75
+ ## 🚀 Deployment
76
+
77
+ This app is optimized for Hugging Face Spaces with automatic GPU detection and model loading.
78
+
79
+ ## 📄 License
80
+
81
+ This project uses the Qwen2.5-VL models which have their own licensing terms. Please refer to the original model repositories for license information.
82
+
83
+ ## 🤝 Contributing
84
+
85
+ Feel free to submit issues, suggestions, or pull requests to improve this application!
86
+
87
  ---
88
 
89
+ Built with ❤️ using [Gradio](https://gradio.app/) and [Qwen2.5-VL](https://huggingface.co/Qwen) models.
app.py.py ADDED
@@ -0,0 +1,501 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Qwen2.5-VL Image Analysis App for Hugging Face Spaces
4
+ A powerful vision-language model interface for image understanding tasks.
5
+ """
6
+
7
+ import os
8
+ import time
9
+ from threading import Thread
10
+ import gradio as gr
11
+ import torch
12
+ from PIL import Image
13
+ from transformers import (
14
+ Qwen2VLForConditionalGeneration,
15
+ AutoProcessor,
16
+ TextIteratorStreamer,
17
+ )
18
+
19
+ # Constants
20
+ MAX_MAX_NEW_TOKENS = 4096
21
+ DEFAULT_MAX_NEW_TOKENS = 1024
22
+ MAX_INPUT_TOKEN_LENGTH = int(os.getenv("MAX_INPUT_TOKEN_LENGTH", "8192"))
23
+ MAX_SEQUENCE_LENGTH = 12288
24
+
25
+ # Device configuration
26
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
27
+ print(f"🚀 Using device: {device}")
28
+
29
+ # Global variables for models
30
+ model_7b = None
31
+ processor_7b = None
32
+ model_3b = None
33
+ processor_3b = None
34
+
35
+ def load_model_with_progress(model_name, progress=gr.Progress()):
36
+ """Load model and processor with progress tracking"""
37
+ global model_7b, processor_7b, model_3b, processor_3b
38
+
39
+ try:
40
+ if model_name == "Qwen/Qwen2.5-VL-7B-Instruct":
41
+ if model_7b is None:
42
+ progress(0.1, desc="🔄 Loading Qwen2.5-VL-7B-Instruct...")
43
+ processor_7b = AutoProcessor.from_pretrained(
44
+ model_name,
45
+ trust_remote_code=True
46
+ )
47
+ progress(0.5, desc="🔄 Loading model weights...")
48
+ model_7b = Qwen2VLForConditionalGeneration.from_pretrained(
49
+ model_name,
50
+ trust_remote_code=True,
51
+ torch_dtype=torch.float16,
52
+ device_map="auto",
53
+ low_cpu_mem_usage=True,
54
+ ).eval()
55
+ progress(1.0, desc="✅ 7B model loaded successfully!")
56
+ return "✅ Qwen2.5-VL-7B-Instruct loaded and ready!", True
57
+ return "✅ Qwen2.5-VL-7B-Instruct already loaded!", True
58
+ else: # 3B model
59
+ if model_3b is None:
60
+ progress(0.1, desc="🔄 Loading Qwen2.5-VL-3B-Instruct...")
61
+ processor_3b = AutoProcessor.from_pretrained(
62
+ model_name,
63
+ trust_remote_code=True
64
+ )
65
+ progress(0.5, desc="🔄 Loading model weights...")
66
+ model_3b = Qwen2VLForConditionalGeneration.from_pretrained(
67
+ model_name,
68
+ trust_remote_code=True,
69
+ torch_dtype=torch.float16,
70
+ device_map="auto",
71
+ low_cpu_mem_usage=True,
72
+ ).eval()
73
+ progress(1.0, desc="✅ 3B model loaded successfully!")
74
+ return "✅ Qwen2.5-VL-3B-Instruct loaded and ready!", True
75
+ return "✅ Qwen2.5-VL-3B-Instruct already loaded!", True
76
+ except Exception as e:
77
+ error_msg = f"❌ Failed to load {model_name}: {str(e)}"
78
+ print(error_msg)
79
+ return error_msg, False
80
+
81
+ def get_loaded_model(model_name):
82
+ """Get already loaded model and processor"""
83
+ global model_7b, processor_7b, model_3b, processor_3b
84
+
85
+ if model_name == "Qwen/Qwen2.5-VL-7B-Instruct":
86
+ return model_7b, processor_7b
87
+ else:
88
+ return model_3b, processor_3b
89
+
90
+ def generate_image_response(model_name: str,
91
+ text: str,
92
+ image: Image.Image,
93
+ max_new_tokens: int = 1024,
94
+ max_sequence_length: int = 8192,
95
+ temperature: float = 0.6,
96
+ top_p: float = 0.9,
97
+ top_k: int = 50,
98
+ repetition_penalty: float = 1.2,
99
+ stream_output: bool = True):
100
+ """
101
+ Generate responses using the selected model for image input.
102
+ Always yields exactly 2 values: (raw_text, markdown_text)
103
+ """
104
+ if image is None:
105
+ yield "❌ Please upload an image first.", "❌ Please upload an image first."
106
+ return
107
+
108
+ if not text.strip():
109
+ yield "❌ Please enter a question about the image.", "❌ Please enter a question about the image."
110
+ return
111
+
112
+ # Check if model is loaded
113
+ model, processor = get_loaded_model(model_name)
114
+ if model is None or processor is None:
115
+ yield "❌ Please select and load a model first.", "❌ Please select and load a model first."
116
+ return
117
+
118
+ try:
119
+ # Prepare messages
120
+ messages = [{
121
+ "role": "user",
122
+ "content": [
123
+ {"type": "image", "image": image},
124
+ {"type": "text", "text": text},
125
+ ]
126
+ }]
127
+
128
+ # Apply chat template
129
+ prompt_full = processor.apply_chat_template(
130
+ messages,
131
+ tokenize=False,
132
+ add_generation_prompt=True
133
+ )
134
+
135
+ # Prepare inputs with sequence length limit
136
+ inputs = processor(
137
+ text=[prompt_full],
138
+ images=[image],
139
+ return_tensors="pt",
140
+ padding=True,
141
+ truncation=True,
142
+ max_length=min(max_sequence_length, MAX_SEQUENCE_LENGTH)
143
+ ).to(device)
144
+
145
+ if stream_output:
146
+ # Streaming generation
147
+ streamer = TextIteratorStreamer(
148
+ processor.tokenizer,
149
+ skip_prompt=True,
150
+ skip_special_tokens=True
151
+ )
152
+
153
+ generation_kwargs = {
154
+ **inputs,
155
+ "streamer": streamer,
156
+ "max_new_tokens": max_new_tokens,
157
+ "do_sample": True,
158
+ "temperature": temperature,
159
+ "top_p": top_p,
160
+ "top_k": top_k,
161
+ "repetition_penalty": repetition_penalty,
162
+ "pad_token_id": processor.tokenizer.eos_token_id,
163
+ }
164
+
165
+ thread = Thread(target=model.generate, kwargs=generation_kwargs)
166
+ thread.start()
167
+
168
+ buffer = ""
169
+ try:
170
+ for new_text in streamer:
171
+ buffer += new_text
172
+ time.sleep(0.01)
173
+ # Always yield exactly 2 values
174
+ yield buffer, buffer
175
+ thread.join()
176
+ except Exception as stream_error:
177
+ thread.join()
178
+ error_msg = f"❌ Streaming Error: {str(stream_error)}"
179
+ yield error_msg, error_msg
180
+ else:
181
+ # Complete generation
182
+ generation_kwargs = {
183
+ **inputs,
184
+ "max_new_tokens": max_new_tokens,
185
+ "do_sample": True,
186
+ "temperature": temperature,
187
+ "top_p": top_p,
188
+ "top_k": top_k,
189
+ "repetition_penalty": repetition_penalty,
190
+ "pad_token_id": processor.tokenizer.eos_token_id,
191
+ }
192
+
193
+ with torch.no_grad():
194
+ outputs = model.generate(**generation_kwargs)
195
+
196
+ # Decode response
197
+ generated_ids = outputs[0][len(inputs['input_ids'][0]):]
198
+ response = processor.tokenizer.decode(generated_ids, skip_special_tokens=True)
199
+ # Always yield exactly 2 values
200
+ yield response, response
201
+
202
+ except Exception as e:
203
+ error_msg = f"❌ Error: {str(e)}"
204
+ print(f"Generation error: {e}")
205
+ # Always yield exactly 2 values
206
+ yield error_msg, error_msg
207
+
208
+ # Define comprehensive examples for image inference
209
+ image_examples = [
210
+ ["Describe this image in detail.", None],
211
+ ["What objects can you see in this image? Count them if possible.", None],
212
+ ["Extract and transcribe any text visible in this image.", None],
213
+ ["Analyze the composition, colors, and artistic style of this image.", None],
214
+ ["What emotions or mood does this image convey?", None],
215
+ ["Identify any people in the image and describe what they are doing.", None],
216
+ ["What is the setting or location of this image?", None],
217
+ ["Are there any safety concerns or hazards visible in this image?", None]
218
+ ]
219
+
220
+ # Custom CSS for better UI
221
+ css = """
222
+ .gradio-container {
223
+ max-width: 1200px !important;
224
+ margin: auto !important;
225
+ }
226
+ .submit-btn {
227
+ background: linear-gradient(45deg, #2980b9, #3498db) !important;
228
+ color: white !important;
229
+ border-radius: 8px !important;
230
+ font-weight: bold !important;
231
+ border: none !important;
232
+ }
233
+ .submit-btn:hover {
234
+ background: linear-gradient(45deg, #3498db, #5dade2) !important;
235
+ transform: translateY(-1px) !important;
236
+ box-shadow: 0 4px 8px rgba(0,0,0,0.2) !important;
237
+ }
238
+ .clear-btn {
239
+ background: linear-gradient(45deg, #e74c3c, #ec7063) !important;
240
+ color: white !important;
241
+ border-radius: 8px !important;
242
+ border: none !important;
243
+ }
244
+ .output-box {
245
+ border: 2px solid #3498db !important;
246
+ border-radius: 12px !important;
247
+ padding: 20px !important;
248
+ background: linear-gradient(145deg, #f8f9fa, #e9ecef) !important;
249
+ box-shadow: inset 0 2px 4px rgba(0,0,0,0.1) !important;
250
+ }
251
+ .model-info {
252
+ background: linear-gradient(145deg, #ebf3fd, #d6eaf8) !important;
253
+ border-radius: 10px !important;
254
+ padding: 15px !important;
255
+ border-left: 4px solid #3498db !important;
256
+ }
257
+ .image-upload {
258
+ border: 2px dashed #3498db !important;
259
+ border-radius: 10px !important;
260
+ padding: 20px !important;
261
+ text-align: center !important;
262
+ }
263
+ """
264
+
265
+ def create_interface():
266
+ """Create the main Gradio interface"""
267
+ with gr.Blocks(css=css, title="🖼️ Qwen2.5-VL Image Analysis", theme=gr.themes.Soft()) as demo:
268
+ gr.Markdown("""
269
+ # 🖼️ **Qwen2.5-VL Image Analysis**
270
+ ### 🚀 Powerful Vision-Language Models for Advanced Image Understanding
271
+
272
+ Upload any image and ask questions about it! Perfect for OCR, object detection, image description, and more.
273
+ """)
274
+
275
+ with gr.Row():
276
+ with gr.Column(scale=1):
277
+ gr.Markdown("### 🤖 **Model Selection**")
278
+
279
+ model_choice = gr.Dropdown(
280
+ choices=["Qwen/Qwen2.5-VL-3B-Instruct", "Qwen/Qwen2.5-VL-7B-Instruct"],
281
+ label="🔮 Select Vision-Language Model",
282
+ value="Qwen/Qwen2.5-VL-3B-Instruct",
283
+ info="Choose your model: 3B (faster) or 7B (better quality)",
284
+ elem_classes="model-dropdown"
285
+ )
286
+
287
+ model_status = gr.Textbox(
288
+ label="📊 Model Status",
289
+ value="⏳ Click 'Load Model' to initialize...",
290
+ interactive=False,
291
+ elem_classes="model-status"
292
+ )
293
+
294
+ load_model_btn = gr.Button(
295
+ "🚀 Load Selected Model",
296
+ elem_classes="load-model-btn",
297
+ variant="primary"
298
+ )
299
+
300
+ gr.Markdown("---")
301
+ gr.Markdown("### 📝 **Input Section**")
302
+
303
+ image_query = gr.Textbox(
304
+ label="🤔 Your Question About the Image",
305
+ placeholder="e.g., 'Describe this image in detail', 'What text is visible?', 'Count the objects'...",
306
+ lines=3,
307
+ value="Describe this image in detail."
308
+ )
309
+
310
+ image_upload = gr.Image(
311
+ type="pil",
312
+ label="📸 Upload Your Image",
313
+ elem_classes="image-upload"
314
+ )
315
+
316
+ with gr.Row():
317
+ clear_btn = gr.Button("🗑️ Clear All", elem_classes="clear-btn", variant="secondary")
318
+ analyze_btn = gr.Button("🔍 Analyze Image", elem_classes="submit-btn", variant="primary", scale=2)
319
+
320
+ gr.Examples(
321
+ examples=image_examples,
322
+ inputs=[image_query, image_upload],
323
+ label="💡 **Try These Example Prompts:**"
324
+ )
325
+
326
+ with gr.Accordion("⚙️ **Advanced Settings**", open=False):
327
+ with gr.Row():
328
+ max_new_tokens = gr.Slider(
329
+ label="📝 Max New Tokens",
330
+ minimum=1,
331
+ maximum=MAX_MAX_NEW_TOKENS,
332
+ step=1,
333
+ value=DEFAULT_MAX_NEW_TOKENS,
334
+ info="Maximum tokens in the response"
335
+ )
336
+
337
+ max_sequence_length = gr.Slider(
338
+ label="📏 Max Sequence Length",
339
+ minimum=1024,
340
+ maximum=MAX_SEQUENCE_LENGTH,
341
+ step=256,
342
+ value=8192,
343
+ info="Total sequence length (input + output)"
344
+ )
345
+
346
+ with gr.Row():
347
+ temperature = gr.Slider(
348
+ label="🌡️ Temperature",
349
+ minimum=0.1,
350
+ maximum=2.0,
351
+ step=0.1,
352
+ value=0.6,
353
+ info="Controls creativity (higher = more creative)"
354
+ )
355
+
356
+ top_p = gr.Slider(
357
+ label="🎯 Top-p",
358
+ minimum=0.1,
359
+ maximum=1.0,
360
+ step=0.05,
361
+ value=0.9,
362
+ info="Controls diversity"
363
+ )
364
+
365
+ with gr.Row():
366
+ top_k = gr.Slider(
367
+ label="🔝 Top-k",
368
+ minimum=1,
369
+ maximum=100,
370
+ step=1,
371
+ value=50,
372
+ info="Vocabulary limit per step"
373
+ )
374
+
375
+ repetition_penalty = gr.Slider(
376
+ label="🔄 Repetition Penalty",
377
+ minimum=1.0,
378
+ maximum=2.0,
379
+ step=0.05,
380
+ value=1.2,
381
+ info="Prevents repetitive text"
382
+ )
383
+
384
+ stream_output = gr.Checkbox(
385
+ label="📡 Stream Output",
386
+ value=True,
387
+ info="Show response in real-time"
388
+ )
389
+
390
+ with gr.Column(scale=1):
391
+ gr.Markdown("### 📤 **Analysis Results**")
392
+
393
+ with gr.Column(elem_classes="output-box"):
394
+ output = gr.Textbox(
395
+ label="🤖 AI Response",
396
+ interactive=False,
397
+ lines=15,
398
+ show_copy_button=True,
399
+ placeholder="Upload an image and click 'Analyze Image' to see the AI's response here..."
400
+ )
401
+
402
+ with gr.Accordion("📋 **Formatted Output**", open=False):
403
+ markdown_output = gr.Markdown()
404
+
405
+ with gr.Column(elem_classes="model-info"):
406
+ gr.Markdown("""
407
+ ### 🤖 **Model Information**
408
+
409
+ **⚡ Qwen2.5-VL-3B-Instruct**:
410
+ - Lightweight vision-language model
411
+ - Good performance with faster speed
412
+ - Ideal for quick analysis tasks
413
+
414
+ **🔬 Qwen2.5-VL-7B-Instruct**:
415
+ - Advanced multimodal AI model
416
+ - Excellent for detailed analysis, OCR, complex reasoning
417
+ - Best quality but slower inference
418
+
419
+ ---
420
+
421
+ ### 💡 **Usage Tips**
422
+ - Load a model first using the "Load Selected Model" button
423
+ - Upload clear, high-resolution images for best results
424
+ - Be specific in your questions for more detailed answers
425
+ - Try different prompts: analysis, OCR, counting, emotions, etc.
426
+ - Enable streaming to see responses in real-time
427
+
428
+ ### 🔧 **Perfect for:**
429
+ - 📄 Document OCR and text extraction
430
+ - 🔍 Object detection and counting
431
+ - 🎨 Art and image analysis
432
+ - 📊 Chart and graph interpretation
433
+ - 😊 Emotion and mood detection
434
+ - 🏛️ Scene and location identification
435
+ """)
436
+
437
+ # Event handlers
438
+ def clear_all():
439
+ return "", None, "", ""
440
+
441
+ def analyze_image_wrapper(*args):
442
+ """Wrapper to properly handle generator output for Gradio"""
443
+ try:
444
+ for result in generate_image_response(*args):
445
+ yield result
446
+ except Exception as e:
447
+ error_msg = f"❌ Analysis Error: {str(e)}"
448
+ yield error_msg, error_msg
449
+
450
+ # Load model event
451
+ load_model_btn.click(
452
+ fn=load_model_with_progress,
453
+ inputs=[model_choice],
454
+ outputs=[model_status]
455
+ )
456
+
457
+ # Clear all event
458
+ clear_btn.click(
459
+ fn=clear_all,
460
+ outputs=[image_query, image_upload, output, markdown_output]
461
+ )
462
+
463
+ # Analyze image event
464
+ analyze_btn.click(
465
+ fn=analyze_image_wrapper,
466
+ inputs=[
467
+ model_choice, image_query, image_upload,
468
+ max_new_tokens, max_sequence_length, temperature,
469
+ top_p, top_k, repetition_penalty, stream_output
470
+ ],
471
+ outputs=[output, markdown_output]
472
+ )
473
+
474
+ # Auto-analyze on Enter key in query box
475
+ image_query.submit(
476
+ fn=analyze_image_wrapper,
477
+ inputs=[
478
+ model_choice, image_query, image_upload,
479
+ max_new_tokens, max_sequence_length, temperature,
480
+ top_p, top_k, repetition_penalty, stream_output
481
+ ],
482
+ outputs=[output, markdown_output]
483
+ )
484
+
485
+ return demo
486
+
487
+ # Main application
488
+ if __name__ == "__main__":
489
+ print("🚀 Initializing Qwen2.5-VL Image Analysis App for Hugging Face Spaces...")
490
+
491
+ # Create and launch the interface
492
+ demo = create_interface()
493
+
494
+ # Launch with Hugging Face Spaces settings
495
+ demo.launch(
496
+ server_name="0.0.0.0",
497
+ server_port=7860,
498
+ share=False,
499
+ show_error=True,
500
+ debug=False
501
+ )
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ torch>=2.0.0
2
+ torchvision>=0.15.0
3
+ transformers>=4.37.0
4
+ gradio>=4.0.0
5
+ accelerate>=0.20.0
6
+ pillow>=9.0.0
7
+ spaces>=0.19.0
8
+ numpy>=1.21.0
9
+ requests>=2.25.0