Spaces:

SWE-Arena
/

SWE-Model-Arena

Running

App Files Files Community

SWE-Model-Arena / README.md

zhimin-z

refine

351e03f 19 days ago

preview code

raw

history blame contribute delete

4.96 kB

	---
	title: SWE-Model-Arena
	emoji: 🎯
	colorFrom: green
	colorTo: red
	sdk: gradio
	sdk_version: 5.50.0
	app_file: app.py
	hf_oauth: true
	pinned: false
	short_description: Chatbot arena for software engineering tasks
	---

	# SWE-Model-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering

	Welcome to SWE-Model-Arena, an open-source platform designed for evaluating software engineering-focused foundation models (FMs), particularly large language models (LLMs). SWE-Model-Arena benchmarks models in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.

	## Key Features

	- Multi-Round Conversational Workflows: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes.
	- RepoChat Integration: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations.
	- Advanced Evaluation Metrics: Assess models using a comprehensive suite of metrics including:
	- Traditional ranking metrics: Elo ratings and win rates to measure overall model performance
	- Network-based metrics: Eigenvector centrality and PageRank to identify influential models in head-to-head comparisons
	- Community detection metrics: Newman modularity to reveal clusters of models with similar capabilities
	- Consistency metrics: Self-play match analysis to quantify model determinism and reliability
	- Efficiency metrics: Conversation efficiency index to measure response quality relative to length
	- Transparent, Open-Source Leaderboard: View real-time model rankings across diverse SE workflows with full transparency.
	- Intelligent Request Filtering: Employ `gpt-oss-safeguard-20b` as a guardrail to automatically filter out non-software-engineering-related requests, ensuring focused and relevant evaluations.

	## Why SWE-Model-Arena?

	Existing evaluation frameworks (e.g. [LMArena](https://lmarena.ai)) often don't address the complex, iterative nature of SE tasks. SWE-Model-Arena fills critical gaps by:

	- Supporting context-rich, multi-turn evaluations to capture iterative workflows
	- Integrating repository-level context through RepoChat to simulate real-world development scenarios
	- Providing multidimensional metrics for nuanced model comparisons
	- Focusing on the full breadth of SE tasks beyond just code generation

	## How It Works

	1. Submit a Prompt: Sign in and input your SE-related task (optional: include a repository URL for RepoChat context)
	2. Compare Responses: Two anonymous models provide responses to your query
	3. Continue the Conversation: Test contextual understanding over multiple rounds
	4. Vote: Choose the better model at any point, with ability to re-assess after multiple turns

	## Getting Started

	### Prerequisites

	- A [Hugging Face](https://huggingface.co) account

	### Usage

	1. Navigate to the [SWE-Model-Arena platform](https://huggingface.co/spaces/SE-Arena/SWE-Model-Arena)
	2. Sign in with your Hugging Face account
	3. Enter your SE task prompt (optionally include a repository URL for RepoChat)
	4. Engage in multi-round interactions and vote on model performance

	## Contributing

	We welcome contributions from the community! Here's how you can help:

	1. Submit SE Tasks: Share your real-world SE problems to enrich our evaluation dataset
	2. Report Issues: Found a bug or have a feature request? Open an issue in this repository
	3. Enhance the Codebase: Fork the repository, make your changes, and submit a pull request

	## Privacy Policy

	Your interactions are anonymized and used solely for improving SWE-Model-Arena and FM benchmarking. By using SWE-Model-Arena, you agree to our Terms of Service.

	## Future Plans

	- Analysis of Real-World SE Workloads: Identify common patterns and challenges in user-submitted tasks
	- Multi-Round Evaluation Metrics: Develop specialized metrics for assessing model adaptation over successive turns
	- Expanded FM Coverage: Include multimodal and domain-specific foundation models
	- Advanced Context Compression: Integrate techniques like [LongRope](https://github.com/microsoft/LongRoPE) and [SelfExtend](https://github.com/datamllab/LongLM) to manage long-term memory in multi-round conversations

	## Contact

	For inquiries or feedback, please [open an issue](https://github.com/SE-Arena/SWE-Model-Arena/issues/new) in this repository. We welcome your contributions and suggestions!

	## Citation

	Made with ❤️ for SWE-Model-Arena. If this work is useful to you, please consider citing our vision paper:

	```bibtex
	@inproceedings{zhao2025se,
	title={SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering},
	author={Zhao, Zhimin},
	booktitle={2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge)},
	pages={78--81},
	year={2025},
	organization={IEEE}
	}
	```