Spaces:
Running
Running
zhimin-z
commited on
Commit
·
351e03f
1
Parent(s):
d83f64e
refine
Browse files
README.md
CHANGED
|
@@ -20,10 +20,11 @@ Welcome to **SWE-Model-Arena**, an open-source platform designed for evaluating
|
|
| 20 |
- **Multi-Round Conversational Workflows**: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes.
|
| 21 |
- **RepoChat Integration**: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations.
|
| 22 |
- **Advanced Evaluation Metrics**: Assess models using a comprehensive suite of metrics including:
|
| 23 |
-
- Traditional metrics
|
| 24 |
-
- Network-based metrics
|
| 25 |
-
- Community detection
|
| 26 |
-
- Consistency
|
|
|
|
| 27 |
- **Transparent, Open-Source Leaderboard**: View real-time model rankings across diverse SE workflows with full transparency.
|
| 28 |
- **Intelligent Request Filtering**: Employ `gpt-oss-safeguard-20b` as a guardrail to automatically filter out non-software-engineering-related requests, ensuring focused and relevant evaluations.
|
| 29 |
|
|
|
|
| 20 |
- **Multi-Round Conversational Workflows**: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes.
|
| 21 |
- **RepoChat Integration**: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations.
|
| 22 |
- **Advanced Evaluation Metrics**: Assess models using a comprehensive suite of metrics including:
|
| 23 |
+
- **Traditional ranking metrics**: Elo ratings and win rates to measure overall model performance
|
| 24 |
+
- **Network-based metrics**: Eigenvector centrality and PageRank to identify influential models in head-to-head comparisons
|
| 25 |
+
- **Community detection metrics**: Newman modularity to reveal clusters of models with similar capabilities
|
| 26 |
+
- **Consistency metrics**: Self-play match analysis to quantify model determinism and reliability
|
| 27 |
+
- **Efficiency metrics**: Conversation efficiency index to measure response quality relative to length
|
| 28 |
- **Transparent, Open-Source Leaderboard**: View real-time model rankings across diverse SE workflows with full transparency.
|
| 29 |
- **Intelligent Request Filtering**: Employ `gpt-oss-safeguard-20b` as a guardrail to automatically filter out non-software-engineering-related requests, ensuring focused and relevant evaluations.
|
| 30 |
|