Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2509.21117

AI Paper of the Day

A collection of papers that I think are interesting, one added each day

Can Large Language Models Understand Context?

Paper • 2402.00858 • Published Feb 1, 2024 • 23
OLMo: Accelerating the Science of Language Models

Paper • 2402.00838 • Published Feb 1, 2024 • 85
Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18, 2024 • 151
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Paper • 2401.17072 • Published Jan 30, 2024 • 25

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Paper • 2508.09789 • Published Aug 13 • 5
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Paper • 2508.13186 • Published Aug 14 • 18
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

Paper • 2508.04038 • Published Aug 6 • 1
Prompt Orchestration Markup Language

Paper • 2508.13948 • Published Aug 19 • 48

Preference Leakage: A Contamination Problem in LLM-as-a-judge

Paper • 2502.01534 • Published Feb 3 • 40
Great Models Think Alike and this Undermines AI Oversight

Paper • 2502.04313 • Published Feb 6 • 33
CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

Paper • 2504.10823 • Published Apr 15 • 15
CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

Paper • 2507.18392 • Published Jul 24 • 19

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Paper • 2509.21117 • Published Sep 25 • 29

Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

Paper • 2508.08791 • Published Aug 12 • 16
WideSearch: Benchmarking Agentic Broad Info-Seeking

Paper • 2508.07999 • Published Aug 11 • 110
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Paper • 2509.21117 • Published Sep 25 • 29

AI Paper of the Day

A collection of papers that I think are interesting, one added each day

Can Large Language Models Understand Context?

Paper • 2402.00858 • Published Feb 1, 2024 • 23
OLMo: Accelerating the Science of Language Models

Paper • 2402.00838 • Published Feb 1, 2024 • 85
Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18, 2024 • 151
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Paper • 2401.17072 • Published Jan 30, 2024 • 25

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Paper • 2509.21117 • Published Sep 25 • 29

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Paper • 2508.09789 • Published Aug 13 • 5
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Paper • 2508.13186 • Published Aug 14 • 18
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

Paper • 2508.04038 • Published Aug 6 • 1
Prompt Orchestration Markup Language

Paper • 2508.13948 • Published Aug 19 • 48

Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

Paper • 2508.08791 • Published Aug 12 • 16
WideSearch: Benchmarking Agentic Broad Info-Seeking

Paper • 2508.07999 • Published Aug 11 • 110
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Paper • 2509.21117 • Published Sep 25 • 29

Preference Leakage: A Contamination Problem in LLM-as-a-judge

Paper • 2502.01534 • Published Feb 3 • 40
Great Models Think Alike and this Undermines AI Oversight

Paper • 2502.04313 • Published Feb 6 • 33
CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

Paper • 2504.10823 • Published Apr 15 • 15
CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

Paper • 2507.18392 • Published Jul 24 • 19

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs