OpenLLMKyrgyzLeaderboard_v0.1 / README_github.md
TTimur's picture
Initial commit
c097229
|
raw
history blame
9.64 kB

Kyrgyz LLM Evaluation Dataset

Welcome to the KyrgyzLLM-Bench - kyrgyz LLM Evaluation Dataset, your one-stop solution for evaluating Large Language Models (LLMs) in Kyrgyz. This toolkit helps you measure model performance across diverse domains and question types specific to the Kyrgyz language, so your models can be more accurate, robust, and helpful for Kyrgyz-speaking users. Whether you're a researcher, developer, or practitionerβ€”this dataset is tailored to help your Kyrgyz-capable LLM thrive.

Paper
Model

Kyrgyz LLM Evaluation in Tech Landscape

Quick facts:

  • Language Support: Kyrgyz (ky)

  • Audience: Researchers, developers, and the Kyrgyz NLP community

  • A native kyrgyz language datasets: β€’ MMLU β€’ Reading comprehension

  • Translated Benchmarks β€’ Commonsense reasoning & understanding: HellaSwag and WinoGrande β€’ Reading comprehension: BoolQ β€’ Mathematics: GSM8K β€’ Robustness & factuality: TruthfulQA

  • Tooling: First-class support with Lighteval (see lighteval/ scripts) and LM_harness (see lm_harness/ scripts)

πŸ” What's Inside?

KyrgyzLLM-Bench is a comprehensive suite purpose-built to evaluate LLMs’ deep understanding and reasoning in Kyrgyz. It combines natively authored benchmarks with carefully translated and post-edited international tasks to provide broad and culturally grounded coverage.

  • Language: Kyrgyz (ky)
  • Components:
    • KyrgyzMMLU (native, multiple-choice, 7,977 questions)
    • KyrgyzRC (native, reading comprehension, 400 questions)
    • Translated benchmarks: HellaSwag, WinoGrande, BoolQ, GSM8K, TruthfulQA (manually post-edited)

🧠 Diverse and Deep Evaluation Domains

KyrgyzLLM-Bench spans foundational sciences, humanities, and applied domains relevant to the Kyrgyz national curriculum and public knowledge.

KyrgyzMMLU (native, multiple-choice)

  • Total: 7,977 questions written by curriculum experts
  • Subjects and counts:
    • Math: 1,169
    • Physics: 1,228
    • Geography: 640
    • Biology: 1,550
    • Kyrgyz Language: 360
    • Kyrgyz Literature: 1,169
    • Kyrgyz History: 440
    • Medicine: 216
    • Chemistry: 1,205

KyrgyzRC (native, reading comprehension)

  • Total: 400 multiple-choice questions (4 options, 1 correct)
  • Sources: Kyrgyz Wikipedia, national news, literature, and school-style math word problems
  • Skills evaluated: factual understanding, inference, vocabulary-in-context, multi-sentence reasoning

Translated Benchmarks (with manual post-editing)

  • Commonsense reasoning: HellaSwag, WinoGrande
  • Reading comprehension: BoolQ
  • Mathematics: GSM8K
  • Robustness/factuality: TruthfulQA

Translation pipeline: dual-model machine translation (Claude 4 Sonnet, Gemini 2.5 Flash), ensemble comparison, expert post-editing, and quality checks (incl. back-translation sampling).

⚑ Turbocharge Your Evaluations with Lighteval πŸš€

If you want to evaluate models with Lighteval, please see README_lighteval.md β€” all installation steps, supported Kyrgyz tasks, example commands (HF and local), and leaderboard task files are documented there.

πŸ“Š Results

Below are the benchmark results for Kyrgyz in both zero-shot and few-shot settings.
Higher scores indicate better performance (accuracy for most tasks, QEM for GSM8K).


πŸ”οΈ Kyrgyz Zero-Shot Evaluation Results

Model KyrgyzMMLU KyrgyzRC WinoGrande BoolQ HellaSwag GSM8K TruthfulQA Average
Qwen
Qwen2.5-0.5B-Instruct 27.4 53.2 51.5 37.9 14.6 0.7 33.5 31.3
Qwen2.5-1.5B-Instruct 27.9 60.5 50.1 38.6 22.9 0.7 32.5 33.3
Qwen2.5-3B-Instruct 28.6 66.0 50.5 59.4 22.0 0.7 34.2 37.3
Qwen2.5-7B-Instruct 31.5 70.0 48.7 56.3 10.0 1.1 34.1 36.0
Qwen3-0.6B 26.0 61.8 49.8 38.0 11.1 0.7 29.9 31.0
Qwen3-1.7B 27.9 61.8 48.9 40.4 24.6 0.7 29.6 33.4
Qwen3-4B 30.3 68.2 49.0 38.3 24.5 0.7 32.9 34.8
Qwen3-8B 32.1 71.8 51.0 39.2 24.6 0.7 34.7 36.3
Gemma
gemma-3-1b-it 26.7 58.2 50.0 37.9 24.4 0.7 34.0 33.1
gemma-3-270m 27.5 56.8 48.3 37.9 17.4 0.7 34.7 31.9
gemma-3-4b-it 30.3 70.2 50.6 58.3 24.6 0.7 34.7 38.5
Meta-Llama
Llama-3.1-8B-Instruct 31.0 75.2 50.6 50.3 26.6 0.7 33.7 38.3
Llama-3.2-1B-Instruct 26.3 58.2 49.4 38.3 0.2 0.7 30.1 29.0
Llama-3.2-3B-Instruct 27.8 64.2 49.1 43.1 24.5 0.7 31.5 34.4

Zero-shot evaluation results on Kyrgyz benchmarks (%). The metric is accuracy, except for GSM8K which uses QEM. Higher is better.


πŸ”οΈ Kyrgyz Few-Shot Evaluation Results

Model KyrgyzMMLU KyrgyzRC WinoGrande BoolQ HellaSwag GSM8K TruthfulQA Average
Qwen
Qwen2.5-0.5B-Instruct 25.4 54.0 49.7 61.0 25.9 2.2 33.4 35.9
Qwen2.5-1.5B-Instruct 28.7 67.5 50.1 58.0 26.5 6.1 32.9 38.5
Qwen2.5-3B-Instruct 34.0 73.2 51.3 57.4 23.7 9.5 34.4 40.5
Qwen2.5-7B-Instruct 38.5 74.8 50.4 64.6 17.8 32.1 36.2 44.9
Qwen3-0.6B 26.8 59.5 50.1 60.1 26.4 4.3 30.0 36.8
Qwen3-1.7B 30.8 71.2 48.6 62.0 25.2 18.5 30.3 41.0
Qwen3-4B 38.5 77.2 48.1 74.0 24.7 51.5 32.5 49.4
Qwen3-8B 44.5 81.8 50.6 76.9 26.4 60.0 35.8 53.7
Gemma
gemma-3-1b-it 26.5 38.0 48.9 62.8 23.5 3.2 31.3 33.5
gemma-3-270m 27.0 53.2 48.7 61.5 27.6 1.4 36.6 36.6
gemma-3-4b-it 29.5 25.0 49.6 62.1 24.6 0.0 50.0 34.5
Meta-Llama
Llama-3.1-8B-Instruct 38.1 80.5 51.6 75.5 21.9 37.0 34.4 48.5
Llama-3.2-1B-Instruct 26.1 45.8 49.7 62.0 25.8 2.7 30.3 34.7
Llama-3.2-3B-Instruct 29.4 64.8 48.9 62.3 25.3 12.9 32.9 39.6

Few-shot evaluation results on Kyrgyz benchmarks (%). All tasks are 5-shot, except for HellaSwag (10-shot). The metric is accuracy, except for GSM8K which uses QEM. Higher is better.

πŸ’‘ Contributions Welcome!

Have ideas, bug fixes, or want to add a custom task? We'd love for you to be part of the journey! Contributions help grow and enhance the capabilities of the KyrgyzLLM-Bench.

πŸ“œ Citation

Thanks for using KyrgyzLLM-Bench β€” where language learning models meet Serbian precision and creativity! Let's build smarter models together. πŸš€οΏ½

If you find this dataset useful in your research, please cite it as follows:

@article{KyrgyzLLM-Bench,
  title={Bridging the Gap in Less-Resourced Languages: Building a Benchmark for Kyrgyz Language Models},
  author={Timur Turatali, Aida Turdubaeva, Islam Zhenishbekov, Zhoomart Suranbaev, Anton Alekseev, Rustem Izmailov},
  year={2025},
  url={https://huggingface.co/datasets/TTimur/kyrgyzMMLU, 
  https://huggingface.co/datasets/TTimur/kyrgyzRC, 
  https://huggingface.co/datasets/TTimur/winogrande_kg, 
  https://huggingface.co/datasets/TTimur/boolq_kg, 
  https://huggingface.co/datasets/TTimur/truthfulqa_kg, 
  https://huggingface.co/datasets/TTimur/gsm8k_kg, 
  https://huggingface.co/datasets/TTimur/hellaswag_kg}
}