zhimin-z commited on
Commit
351e03f
·
1 Parent(s): d83f64e
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -20,10 +20,11 @@ Welcome to **SWE-Model-Arena**, an open-source platform designed for evaluating
20
  - **Multi-Round Conversational Workflows**: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes.
21
  - **RepoChat Integration**: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations.
22
  - **Advanced Evaluation Metrics**: Assess models using a comprehensive suite of metrics including:
23
- - Traditional metrics: Elo score and average win rate
24
- - Network-based metrics: Eigenvector centrality, PageRank score
25
- - Community detection: Newman modularity score
26
- - Consistency score: Quantify model determinism and reliability through self-play matches
 
27
  - **Transparent, Open-Source Leaderboard**: View real-time model rankings across diverse SE workflows with full transparency.
28
  - **Intelligent Request Filtering**: Employ `gpt-oss-safeguard-20b` as a guardrail to automatically filter out non-software-engineering-related requests, ensuring focused and relevant evaluations.
29
 
 
20
  - **Multi-Round Conversational Workflows**: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes.
21
  - **RepoChat Integration**: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations.
22
  - **Advanced Evaluation Metrics**: Assess models using a comprehensive suite of metrics including:
23
+ - **Traditional ranking metrics**: Elo ratings and win rates to measure overall model performance
24
+ - **Network-based metrics**: Eigenvector centrality and PageRank to identify influential models in head-to-head comparisons
25
+ - **Community detection metrics**: Newman modularity to reveal clusters of models with similar capabilities
26
+ - **Consistency metrics**: Self-play match analysis to quantify model determinism and reliability
27
+ - **Efficiency metrics**: Conversation efficiency index to measure response quality relative to length
28
  - **Transparent, Open-Source Leaderboard**: View real-time model rankings across diverse SE workflows with full transparency.
29
  - **Intelligent Request Filtering**: Employ `gpt-oss-safeguard-20b` as a guardrail to automatically filter out non-software-engineering-related requests, ensuring focused and relevant evaluations.
30