Spaces:

sabaridsnfuji
/

Image-Analysis-Qwen2.5-VL

Running

App Files Files Community

sabaridsnfuji commited on Jul 18

Commit

c7643a1

verified ·

1 Parent(s): fda14af

updated the code

Browse files

Files changed (3) hide show

README.md +87 -12
app.py.py +501 -0
requirements.txt +9 -0

README.md CHANGED Viewed

@@ -1,14 +1,89 @@
----
-title: Image Analysis Qwen2.5 VL
-emoji: 👁
-colorFrom: yellow
-colorTo: red
-sdk: gradio
-sdk_version: 5.38.0
-app_file: app.py
-pinned: false
-license: mit
-short_description: 🚀 Powerful Vision-Language Models for Advanced Image Unders
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 🖼️ Qwen2.5-VL Image Analysis
+A powerful vision-language model interface for advanced image understanding tasks, built with Gradio and optimized for Hugging Face Spaces.
+## 🚀 Features
+- **Multi-Model Support**: Choose between Qwen2.5-VL-3B (faster) and Qwen2.5-VL-7B (higher quality)
+- **Advanced Image Analysis**: OCR, object detection, scene description, emotion analysis
+- **Real-time Streaming**: See responses generated in real-time
+- **Customizable Parameters**: Adjust temperature, top-p, top-k, and more
+- **User-friendly Interface**: Clean, modern UI with example prompts
+## 🤖 Supported Models
+### Qwen2.5-VL-3B-Instruct
+- **Speed**: ⚡ Fast inference
+- **Memory**: Lower GPU memory requirements
+- **Use Case**: Quick analysis, general image understanding
+### Qwen2.5-VL-7B-Instruct
+- **Quality**: 🔬 Higher accuracy and detail
+- **Memory**: Higher GPU memory requirements
+- **Use Case**: Complex analysis, detailed OCR, research tasks
+## 💡 Use Cases
+- 📄 **Document OCR**: Extract and transcribe text from images
+- 🔍 **Object Detection**: Identify and count objects in images
+- 🎨 **Art Analysis**: Analyze composition, colors, and artistic style
+- 📊 **Chart Interpretation**: Understand graphs, charts, and diagrams
+- 😊 **Emotion Detection**: Identify emotions and moods in images
+- 🏛️ **Scene Understanding**: Describe locations, settings, and contexts
+- 🔒 **Safety Analysis**: Identify potential hazards or safety concerns
+## 🛠️ Usage Instructions
+1. **Load Model**: Select your preferred model (3B or 7B) and click "Load Selected Model"
+2. **Upload Image**: Click the image upload area and select your image
+3. **Ask Question**: Enter your question about the image in the text box
+4. **Analyze**: Click "Analyze Image" or press Enter to get the AI response
+5. **Adjust Settings**: Use the Advanced Settings accordion for fine-tuning
+## 📝 Example Prompts
+- "Describe this image in detail"
+- "What text is visible in this image?"
+- "Count the number of people in this image"
+- "What emotions are expressed in this image?"
+- "Analyze the artistic style and composition"
+- "What safety concerns can you identify?"
+## ⚙️ Advanced Settings
+- **Max New Tokens**: Control response length (1-4096)
+- **Temperature**: Adjust creativity (0.1-2.0)
+- **Top-p**: Control diversity (0.1-1.0)
+- **Top-k**: Vocabulary limit per step (1-100)
+- **Repetition Penalty**: Prevent repetitive text (1.0-2.0)
+- **Stream Output**: Enable real-time response streaming
+## 🔧 Technical Details
+- **Framework**: Gradio 4.0+
+- **Models**: Qwen2.5-VL series by Alibaba Cloud
+- **Hardware**: GPU-optimized for CUDA devices
+- **Precision**: FP16 for efficient inference
+## 📋 Requirements
+- Python 3.8+
+- CUDA-compatible GPU (recommended)
+- 8GB+ GPU memory for 3B model
+- 16GB+ GPU memory for 7B model
+## 🚀 Deployment
+This app is optimized for Hugging Face Spaces with automatic GPU detection and model loading.
+## 📄 License
+This project uses the Qwen2.5-VL models which have their own licensing terms. Please refer to the original model repositories for license information.
+## 🤝 Contributing
+Feel free to submit issues, suggestions, or pull requests to improve this application!
 ---
+Built with ❤️ using [Gradio](https://gradio.app/) and [Qwen2.5-VL](https://huggingface.co/Qwen) models.

app.py.py ADDED Viewed

	@@ -0,0 +1,501 @@

+#!/usr/bin/env python3
+"""
+Qwen2.5-VL Image Analysis App for Hugging Face Spaces
+A powerful vision-language model interface for image understanding tasks.
+"""
+import os
+import time
+from threading import Thread
+import gradio as gr
+import torch
+from PIL import Image
+from transformers import (
+    Qwen2VLForConditionalGeneration,
+    AutoProcessor,
+    TextIteratorStreamer,
+)
+# Constants
+MAX_MAX_NEW_TOKENS = 4096
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = int(os.getenv("MAX_INPUT_TOKEN_LENGTH", "8192"))
+MAX_SEQUENCE_LENGTH = 12288
+# Device configuration
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(f"🚀 Using device: {device}")
+# Global variables for models
+model_7b = None
+processor_7b = None
+model_3b = None
+processor_3b = None
+def load_model_with_progress(model_name, progress=gr.Progress()):
+    """Load model and processor with progress tracking"""
+    global model_7b, processor_7b, model_3b, processor_3b
+    try:
+        if model_name == "Qwen/Qwen2.5-VL-7B-Instruct":
+            if model_7b is None:
+                progress(0.1, desc="🔄 Loading Qwen2.5-VL-7B-Instruct...")
+                processor_7b = AutoProcessor.from_pretrained(
+                    model_name,
+                    trust_remote_code=True
+                )
+                progress(0.5, desc="🔄 Loading model weights...")
+                model_7b = Qwen2VLForConditionalGeneration.from_pretrained(
+                    model_name,
+                    trust_remote_code=True,
+                    torch_dtype=torch.float16,
+                    device_map="auto",
+                    low_cpu_mem_usage=True,
+                ).eval()
+                progress(1.0, desc="✅ 7B model loaded successfully!")
+                return "✅ Qwen2.5-VL-7B-Instruct loaded and ready!", True
+            return "✅ Qwen2.5-VL-7B-Instruct already loaded!", True
+        else:  # 3B model
+            if model_3b is None:
+                progress(0.1, desc="🔄 Loading Qwen2.5-VL-3B-Instruct...")
+                processor_3b = AutoProcessor.from_pretrained(
+                    model_name,
+                    trust_remote_code=True
+                )
+                progress(0.5, desc="🔄 Loading model weights...")
+                model_3b = Qwen2VLForConditionalGeneration.from_pretrained(
+                    model_name,
+                    trust_remote_code=True,
+                    torch_dtype=torch.float16,
+                    device_map="auto",
+                    low_cpu_mem_usage=True,
+                ).eval()
+                progress(1.0, desc="✅ 3B model loaded successfully!")
+                return "✅ Qwen2.5-VL-3B-Instruct loaded and ready!", True
+            return "✅ Qwen2.5-VL-3B-Instruct already loaded!", True
+    except Exception as e:
+        error_msg = f"❌ Failed to load {model_name}: {str(e)}"
+        print(error_msg)
+        return error_msg, False
+def get_loaded_model(model_name):
+    """Get already loaded model and processor"""
+    global model_7b, processor_7b, model_3b, processor_3b
+    if model_name == "Qwen/Qwen2.5-VL-7B-Instruct":
+        return model_7b, processor_7b
+    else:
+        return model_3b, processor_3b
+def generate_image_response(model_name: str,
+                           text: str,
+                           image: Image.Image,
+                           max_new_tokens: int = 1024,
+                           max_sequence_length: int = 8192,
+                           temperature: float = 0.6,
+                           top_p: float = 0.9,
+                           top_k: int = 50,
+                           repetition_penalty: float = 1.2,
+                           stream_output: bool = True):
+    """
+    Generate responses using the selected model for image input.
+    Always yields exactly 2 values: (raw_text, markdown_text)
+    """
+    if image is None:
+        yield "❌ Please upload an image first.", "❌ Please upload an image first."
+        return
+    if not text.strip():
+        yield "❌ Please enter a question about the image.", "❌ Please enter a question about the image."
+        return
+    # Check if model is loaded
+    model, processor = get_loaded_model(model_name)
+    if model is None or processor is None:
+        yield "❌ Please select and load a model first.", "❌ Please select and load a model first."
+        return
+    try:
+        # Prepare messages
+        messages = [{
+            "role": "user",
+            "content": [
+                {"type": "image", "image": image},
+                {"type": "text", "text": text},
+            ]
+        }]
+        # Apply chat template
+        prompt_full = processor.apply_chat_template(
+            messages,
+            tokenize=False,
+            add_generation_prompt=True
+        )
+        # Prepare inputs with sequence length limit
+        inputs = processor(
+            text=[prompt_full],
+            images=[image],
+            return_tensors="pt",
+            padding=True,
+            truncation=True,
+            max_length=min(max_sequence_length, MAX_SEQUENCE_LENGTH)
+        ).to(device)
+        if stream_output:
+            # Streaming generation
+            streamer = TextIteratorStreamer(
+                processor.tokenizer,
+                skip_prompt=True,
+                skip_special_tokens=True
+            )
+            generation_kwargs = {
+                **inputs,
+                "streamer": streamer,
+                "max_new_tokens": max_new_tokens,
+                "do_sample": True,
+                "temperature": temperature,
+                "top_p": top_p,
+                "top_k": top_k,
+                "repetition_penalty": repetition_penalty,
+                "pad_token_id": processor.tokenizer.eos_token_id,
+            }
+            thread = Thread(target=model.generate, kwargs=generation_kwargs)
+            thread.start()
+            buffer = ""
+            try:
+                for new_text in streamer:
+                    buffer += new_text
+                    time.sleep(0.01)
+                    # Always yield exactly 2 values
+                    yield buffer, buffer
+                thread.join()
+            except Exception as stream_error:
+                thread.join()
+                error_msg = f"❌ Streaming Error: {str(stream_error)}"
+                yield error_msg, error_msg
+        else:
+            # Complete generation
+            generation_kwargs = {
+                **inputs,
+                "max_new_tokens": max_new_tokens,
+                "do_sample": True,
+                "temperature": temperature,
+                "top_p": top_p,
+                "top_k": top_k,
+                "repetition_penalty": repetition_penalty,
+                "pad_token_id": processor.tokenizer.eos_token_id,
+            }
+            with torch.no_grad():
+                outputs = model.generate(**generation_kwargs)
+            # Decode response
+            generated_ids = outputs[0][len(inputs['input_ids'][0]):]
+            response = processor.tokenizer.decode(generated_ids, skip_special_tokens=True)
+            # Always yield exactly 2 values
+            yield response, response
+    except Exception as e:
+        error_msg = f"❌ Error: {str(e)}"
+        print(f"Generation error: {e}")
+        # Always yield exactly 2 values
+        yield error_msg, error_msg
+# Define comprehensive examples for image inference
+image_examples = [
+    ["Describe this image in detail.", None],
+    ["What objects can you see in this image? Count them if possible.", None],
+    ["Extract and transcribe any text visible in this image.", None],
+    ["Analyze the composition, colors, and artistic style of this image.", None],
+    ["What emotions or mood does this image convey?", None],
+    ["Identify any people in the image and describe what they are doing.", None],
+    ["What is the setting or location of this image?", None],
+    ["Are there any safety concerns or hazards visible in this image?", None]
+]
+# Custom CSS for better UI
+css = """
+.gradio-container {
+    max-width: 1200px !important;
+    margin: auto !important;
+}
+.submit-btn {
+    background: linear-gradient(45deg, #2980b9, #3498db) !important;
+    color: white !important;
+    border-radius: 8px !important;
+    font-weight: bold !important;
+    border: none !important;
+}
+.submit-btn:hover {
+    background: linear-gradient(45deg, #3498db, #5dade2) !important;
+    transform: translateY(-1px) !important;
+    box-shadow: 0 4px 8px rgba(0,0,0,0.2) !important;
+}
+.clear-btn {
+    background: linear-gradient(45deg, #e74c3c, #ec7063) !important;
+    color: white !important;
+    border-radius: 8px !important;
+    border: none !important;
+}
+.output-box {
+    border: 2px solid #3498db !important;
+    border-radius: 12px !important;
+    padding: 20px !important;
+    background: linear-gradient(145deg, #f8f9fa, #e9ecef) !important;
+    box-shadow: inset 0 2px 4px rgba(0,0,0,0.1) !important;
+}
+.model-info {
+    background: linear-gradient(145deg, #ebf3fd, #d6eaf8) !important;
+    border-radius: 10px !important;
+    padding: 15px !important;
+    border-left: 4px solid #3498db !important;
+}
+.image-upload {
+    border: 2px dashed #3498db !important;
+    border-radius: 10px !important;
+    padding: 20px !important;
+    text-align: center !important;
+}
+"""
+def create_interface():
+    """Create the main Gradio interface"""
+    with gr.Blocks(css=css, title="🖼️ Qwen2.5-VL Image Analysis", theme=gr.themes.Soft()) as demo:
+        gr.Markdown("""
+        # 🖼️ **Qwen2.5-VL Image Analysis**
+        ### 🚀 Powerful Vision-Language Models for Advanced Image Understanding
+        Upload any image and ask questions about it! Perfect for OCR, object detection, image description, and more.
+        """)
+        with gr.Row():
+            with gr.Column(scale=1):
+                gr.Markdown("### 🤖 **Model Selection**")
+                model_choice = gr.Dropdown(
+                    choices=["Qwen/Qwen2.5-VL-3B-Instruct", "Qwen/Qwen2.5-VL-7B-Instruct"],
+                    label="🔮 Select Vision-Language Model",
+                    value="Qwen/Qwen2.5-VL-3B-Instruct",
+                    info="Choose your model: 3B (faster) or 7B (better quality)",
+                    elem_classes="model-dropdown"
+                )
+                model_status = gr.Textbox(
+                    label="📊 Model Status",
+                    value="⏳ Click 'Load Model' to initialize...",
+                    interactive=False,
+                    elem_classes="model-status"
+                )
+                load_model_btn = gr.Button(
+                    "🚀 Load Selected Model",
+                    elem_classes="load-model-btn",
+                    variant="primary"
+                )
+                gr.Markdown("---")
+                gr.Markdown("### 📝 **Input Section**")
+                image_query = gr.Textbox(
+                    label="🤔 Your Question About the Image",
+                    placeholder="e.g., 'Describe this image in detail', 'What text is visible?', 'Count the objects'...",
+                    lines=3,
+                    value="Describe this image in detail."
+                )
+                image_upload = gr.Image(
+                    type="pil",
+                    label="📸 Upload Your Image",
+                    elem_classes="image-upload"
+                )
+                with gr.Row():
+                    clear_btn = gr.Button("🗑️ Clear All", elem_classes="clear-btn", variant="secondary")
+                    analyze_btn = gr.Button("🔍 Analyze Image", elem_classes="submit-btn", variant="primary", scale=2)
+                gr.Examples(
+                    examples=image_examples,
+                    inputs=[image_query, image_upload],
+                    label="💡 **Try These Example Prompts:**"
+                )
+                with gr.Accordion("⚙️ **Advanced Settings**", open=False):
+                    with gr.Row():
+                        max_new_tokens = gr.Slider(
+                            label="📝 Max New Tokens",
+                            minimum=1,
+                            maximum=MAX_MAX_NEW_TOKENS,
+                            step=1,
+                            value=DEFAULT_MAX_NEW_TOKENS,
+                            info="Maximum tokens in the response"
+                        )
+                        max_sequence_length = gr.Slider(
+                            label="📏 Max Sequence Length",
+                            minimum=1024,
+                            maximum=MAX_SEQUENCE_LENGTH,
+                            step=256,
+                            value=8192,
+                            info="Total sequence length (input + output)"
+                        )
+                    with gr.Row():
+                        temperature = gr.Slider(
+                            label="🌡️ Temperature",
+                            minimum=0.1,
+                            maximum=2.0,
+                            step=0.1,
+                            value=0.6,
+                            info="Controls creativity (higher = more creative)"
+                        )
+                        top_p = gr.Slider(
+                            label="🎯 Top-p",
+                            minimum=0.1,
+                            maximum=1.0,
+                            step=0.05,
+                            value=0.9,
+                            info="Controls diversity"
+                        )
+                    with gr.Row():
+                        top_k = gr.Slider(
+                            label="🔝 Top-k",
+                            minimum=1,
+                            maximum=100,
+                            step=1,
+                            value=50,
+                            info="Vocabulary limit per step"
+                        )
+                        repetition_penalty = gr.Slider(
+                            label="🔄 Repetition Penalty",
+                            minimum=1.0,
+                            maximum=2.0,
+                            step=0.05,
+                            value=1.2,
+                            info="Prevents repetitive text"
+                        )
+                    stream_output = gr.Checkbox(
+                        label="📡 Stream Output",
+                        value=True,
+                        info="Show response in real-time"
+                    )
+            with gr.Column(scale=1):
+                gr.Markdown("### 📤 **Analysis Results**")
+                with gr.Column(elem_classes="output-box"):
+                    output = gr.Textbox(
+                        label="🤖 AI Response",
+                        interactive=False,
+                        lines=15,
+                        show_copy_button=True,
+                        placeholder="Upload an image and click 'Analyze Image' to see the AI's response here..."
+                    )
+                    with gr.Accordion("📋 **Formatted Output**", open=False):
+                        markdown_output = gr.Markdown()
+                with gr.Column(elem_classes="model-info"):
+                    gr.Markdown("""
+                    ### 🤖 **Model Information**
+                    **⚡ Qwen2.5-VL-3B-Instruct**:
+                    - Lightweight vision-language model
+                    - Good performance with faster speed
+                    - Ideal for quick analysis tasks
+                    **🔬 Qwen2.5-VL-7B-Instruct**:
+                    - Advanced multimodal AI model
+                    - Excellent for detailed analysis, OCR, complex reasoning
+                    - Best quality but slower inference
+                    ---
+                    ### 💡 **Usage Tips**
+                    - Load a model first using the "Load Selected Model" button
+                    - Upload clear, high-resolution images for best results
+                    - Be specific in your questions for more detailed answers
+                    - Try different prompts: analysis, OCR, counting, emotions, etc.
+                    - Enable streaming to see responses in real-time
+                    ### 🔧 **Perfect for:**
+                    - 📄 Document OCR and text extraction
+                    - 🔍 Object detection and counting
+                    - 🎨 Art and image analysis
+                    - 📊 Chart and graph interpretation
+                    - 😊 Emotion and mood detection
+                    - 🏛️ Scene and location identification
+                    """)
+        # Event handlers
+        def clear_all():
+            return "", None, "", ""
+        def analyze_image_wrapper(*args):
+            """Wrapper to properly handle generator output for Gradio"""
+            try:
+                for result in generate_image_response(*args):
+                    yield result
+            except Exception as e:
+                error_msg = f"❌ Analysis Error: {str(e)}"
+                yield error_msg, error_msg
+        # Load model event
+        load_model_btn.click(
+            fn=load_model_with_progress,
+            inputs=[model_choice],
+            outputs=[model_status]
+        )
+        # Clear all event
+        clear_btn.click(
+            fn=clear_all,
+            outputs=[image_query, image_upload, output, markdown_output]
+        )
+        # Analyze image event
+        analyze_btn.click(
+            fn=analyze_image_wrapper,
+            inputs=[
+                model_choice, image_query, image_upload,
+                max_new_tokens, max_sequence_length, temperature,
+                top_p, top_k, repetition_penalty, stream_output
+            ],
+            outputs=[output, markdown_output]
+        )
+        # Auto-analyze on Enter key in query box
+        image_query.submit(
+            fn=analyze_image_wrapper,
+            inputs=[
+                model_choice, image_query, image_upload,
+                max_new_tokens, max_sequence_length, temperature,
+                top_p, top_k, repetition_penalty, stream_output
+            ],
+            outputs=[output, markdown_output]
+        )
+    return demo
+# Main application
+if __name__ == "__main__":
+    print("🚀 Initializing Qwen2.5-VL Image Analysis App for Hugging Face Spaces...")
+    # Create and launch the interface
+    demo = create_interface()
+    # Launch with Hugging Face Spaces settings
+    demo.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False,
+        show_error=True,
+        debug=False
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+torch>=2.0.0
+torchvision>=0.15.0
+transformers>=4.37.0
+gradio>=4.0.0
+accelerate>=0.20.0
+pillow>=9.0.0
+spaces>=0.19.0
+numpy>=1.21.0
+requests>=2.25.0