JustRL: Simplicity at Scale
๐Ÿš€ Competitive RL Performance Without Complex Techniques ๐ŸŒŸ

Overview

JustRL demonstrates that competitive reinforcement learning performance for small language models doesn't require complex multi-stage pipelines or dynamic schedules. Using a minimal recipe with single-stage training and fixed hyperparameters, we achieve state-of-the-art results on mathematical reasoning tasks.

We release two models:

Both models use identical hyperparameters without per-model tuning, demonstrating the robustness of our approach.

The AIME24 performance curve for scaling from a weak base DeekSeek-R1-Distill-Qwen-1.5B and a strong base OpenMath-Nemotron-1.5B over thousands of steps.

Key Highlights

โœจ Simplicity: Single-stage training with fixed hyperparameters, without multi-stage pipelines or dynamic schedules

๐Ÿ“ˆ Stability: Smooth, monotonic improvement over 4,000+ training steps without collapses or oscillations

๐ŸŽฏ Performance: State-of-the-art results at 1.5B scale, matching or exceeding more complex approaches

๐Ÿ’ฐ Efficiency: Comparable or better performance with 2ร— less compute than multi-stage methods

๐Ÿ”“ Open: Complete evaluation scripts, and model weights released

Performance

JustRL-DeepSeek-1.5B (Based on DeepSeek-R1-Distill-Qwen-1.5B)

Model AIME24 (@32) AIME25 (@32) AMC23 (@32) MATH-500 (@4) Minerva (@4) OlympiadBench (@4) HMMT25 (@32) BRUMO25 (@32) CMIMC25 (@32) Avg
DeepSeek-R1-Distill-1.5B 29.90 22.40 63.82 84.90 34.65 45.95 13.44 30.94 12.89 37.65
DeepScaleR-1.5B-Preview 40.21 28.65 73.83 89.30 39.34 52.79 18.96 40.00 21.00 44.88
ProRL-V2 51.87 35.73 88.75 92.00 49.03 67.84 19.38 47.29 25.86 53.08
BroRL 57.50 36.88 / 92.14 49.08 61.54 / / / /
JustRL-DeepSeek-1.5B 52.60 38.75 91.02 91.65 51.47 67.99 21.98 52.71 25.63 54.87

Besides, the real question is whether our simplicity comes at a computational cost. It doesn't. We match half of ProRL-V2's compute budget while using a single-stage recipe with fixed hyperparameters. BroRL requires 4.9ร— more compute by increasing rollouts to 512 per example, essentially exhaustively exploring the solution space. Our approach achieves competitive performance without this computational overhead.

JustRL-Nemotron-1.5B (Based on OpenMath-Nemotron-1.5B)

Model AIME24 (@32) AIME25 (@32) AMC23 (@32) MATH-500 (@4) Minerva (@4) OlympiadBench (@4) HMMT25 (@32) BRUMO25 (@32) CMIMC25 (@32) Avg
OpenMath-Nemotron-1.5B 58.75 48.44 90.55 92.40 26.93 71.70 30.10 61.67 30.08 56.74
QUESTA-Nemotron-1.5B 71.56 62.08 93.44 92.95 32.08 72.28 40.94 67.50 41.48 63.81
JustRL-Nemotron-1.5B 69.69 62.92 96.02 94.15 30.24 76.59 40.63 66.88 41.72 64.32

We achieve 64.32% average, slightly outperforming QuestA's 63.81% and leading on five of nine benchmarks. The gap is narrow, which makes senseโ€”both approaches are pushing the boundaries of what's achievable at 1.5B scale. The key difference is in how we get there. We use 2ร— less compute while achieving slightly better average performance without designing a complex curriculum as used in QuestA.

Training Recipe

Our approach is deliberately minimal:

Core Algorithm: Standard GRPO with binary outcome rewards

  • Reward: Simple DAPO verifier (string-matching, no SymPy)
  • Training: Single-stage, no curriculum or stage transitions
  • Hyperparameters: Fixed throughout (no adaptive schedules)
  • Data: DAPO-Math-17k without filtering or dynamic sampling
  • Length Control: 16K context cap (no explicit penalties)
  • Stabilization: Only "clip higher" for gradient stability

Detail hyperparameters and comparisons on training techniques with other methods can refer to our blog.

Training Data

We train on DAPO-Math-17k, a curated dataset of mathematical problems. No offline difficulty filtering or online dynamic sampling is used.

Usage

Basic Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "hbx/JustRL-Nemotron-1.5B"  # or JustRL-DeepSeek-1.5B
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """<problem>

Please reason step by step, and put your final answer within \\boxed{}."""

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=16384,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(response)

Batch Inference with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="hbx/JustRL-Nemotron-1.5B",
    tensor_parallel_size=1,
    max_model_len=32768
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=16384,
)

problems = [...]  # Your list of problems
responses = llm.generate(problems, sampling_params)

Reproduction

We provide evaluation scripts based on POLARIS, the evaluation script is here.

Citation

@misc{he2025justrl,
  title         = {JustRL: Scaling a 1.5B LLM with a Simple RL Recipe},
  author        = {Bingxiang He and Zekai Qu and Zeyuan Liu and Yinghao Chen and Yuxin Zuo and Cheng Qian and Kaiyan Zhang and Weize Chen and Chaojun Xiao and Ganqu Cui and Ning Ding and Zhiyuan Liu},
  howpublished  = {\url{https://relieved-cafe-fe1.notion.site/JustRL-Scaling-a-1-5B-LLM-with-a-Simple-RL-Recipe-24f6198b0b6b80e48e74f519bfdaf0a8}},
  note          = {Notion Blog},
  year          = {2025},
  month         = {Nov},
  day           = {4}
}
Downloads last month
103
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for hbx/JustRL-Nemotron-1.5B

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(6)
this model
Quantizations
2 models

Dataset used to train hbx/JustRL-Nemotron-1.5B

Collection including hbx/JustRL-Nemotron-1.5B