Kyrgyz LLM Evaluation Dataset
Welcome to the KyrgyzLLM-Bench - kyrgyz LLM Evaluation Dataset, your one-stop solution for evaluating Large Language Models (LLMs) in Kyrgyz. This toolkit helps you measure model performance across diverse domains and question types specific to the Kyrgyz language, so your models can be more accurate, robust, and helpful for Kyrgyz-speaking users. Whether you're a researcher, developer, or practitionerβthis dataset is tailored to help your Kyrgyz-capable LLM thrive.
Quick facts:
Language Support: Kyrgyz (ky)
Audience: Researchers, developers, and the Kyrgyz NLP community
A native kyrgyz language datasets: β’ MMLU β’ Reading comprehension
Translated Benchmarks β’ Commonsense reasoning & understanding: HellaSwag and WinoGrande β’ Reading comprehension: BoolQ β’ Mathematics: GSM8K β’ Robustness & factuality: TruthfulQA
Tooling: First-class support with Lighteval (see
lighteval/scripts) and LM_harness (seelm_harness/scripts)
π What's Inside?
KyrgyzLLM-Bench is a comprehensive suite purpose-built to evaluate LLMsβ deep understanding and reasoning in Kyrgyz. It combines natively authored benchmarks with carefully translated and post-edited international tasks to provide broad and culturally grounded coverage.
- Language: Kyrgyz (ky)
- Components:
- KyrgyzMMLU (native, multiple-choice, 7,977 questions)
- KyrgyzRC (native, reading comprehension, 400 questions)
- Translated benchmarks: HellaSwag, WinoGrande, BoolQ, GSM8K, TruthfulQA (manually post-edited)
π§ Diverse and Deep Evaluation Domains
KyrgyzLLM-Bench spans foundational sciences, humanities, and applied domains relevant to the Kyrgyz national curriculum and public knowledge.
KyrgyzMMLU (native, multiple-choice)
- Total: 7,977 questions written by curriculum experts
- Subjects and counts:
- Math: 1,169
- Physics: 1,228
- Geography: 640
- Biology: 1,550
- Kyrgyz Language: 360
- Kyrgyz Literature: 1,169
- Kyrgyz History: 440
- Medicine: 216
- Chemistry: 1,205
KyrgyzRC (native, reading comprehension)
- Total: 400 multiple-choice questions (4 options, 1 correct)
- Sources: Kyrgyz Wikipedia, national news, literature, and school-style math word problems
- Skills evaluated: factual understanding, inference, vocabulary-in-context, multi-sentence reasoning
Translated Benchmarks (with manual post-editing)
- Commonsense reasoning: HellaSwag, WinoGrande
- Reading comprehension: BoolQ
- Mathematics: GSM8K
- Robustness/factuality: TruthfulQA
Translation pipeline: dual-model machine translation (Claude 4 Sonnet, Gemini 2.5 Flash), ensemble comparison, expert post-editing, and quality checks (incl. back-translation sampling).
β‘ Turbocharge Your Evaluations with Lighteval π
If you want to evaluate models with Lighteval, please see README_lighteval.md β all installation steps, supported Kyrgyz tasks, example commands (HF and local), and leaderboard task files are documented there.
- Guide: README_lighteval.md
π Results
Below are the benchmark results for Kyrgyz in both zero-shot and few-shot settings.
Higher scores indicate better performance (accuracy for most tasks, QEM for GSM8K).
ποΈ Kyrgyz Zero-Shot Evaluation Results
| Model | KyrgyzMMLU | KyrgyzRC | WinoGrande | BoolQ | HellaSwag | GSM8K | TruthfulQA | Average |
|---|---|---|---|---|---|---|---|---|
| Qwen | ||||||||
| Qwen2.5-0.5B-Instruct | 27.4 | 53.2 | 51.5 | 37.9 | 14.6 | 0.7 | 33.5 | 31.3 |
| Qwen2.5-1.5B-Instruct | 27.9 | 60.5 | 50.1 | 38.6 | 22.9 | 0.7 | 32.5 | 33.3 |
| Qwen2.5-3B-Instruct | 28.6 | 66.0 | 50.5 | 59.4 | 22.0 | 0.7 | 34.2 | 37.3 |
| Qwen2.5-7B-Instruct | 31.5 | 70.0 | 48.7 | 56.3 | 10.0 | 1.1 | 34.1 | 36.0 |
| Qwen3-0.6B | 26.0 | 61.8 | 49.8 | 38.0 | 11.1 | 0.7 | 29.9 | 31.0 |
| Qwen3-1.7B | 27.9 | 61.8 | 48.9 | 40.4 | 24.6 | 0.7 | 29.6 | 33.4 |
| Qwen3-4B | 30.3 | 68.2 | 49.0 | 38.3 | 24.5 | 0.7 | 32.9 | 34.8 |
| Qwen3-8B | 32.1 | 71.8 | 51.0 | 39.2 | 24.6 | 0.7 | 34.7 | 36.3 |
| Gemma | ||||||||
| gemma-3-1b-it | 26.7 | 58.2 | 50.0 | 37.9 | 24.4 | 0.7 | 34.0 | 33.1 |
| gemma-3-270m | 27.5 | 56.8 | 48.3 | 37.9 | 17.4 | 0.7 | 34.7 | 31.9 |
| gemma-3-4b-it | 30.3 | 70.2 | 50.6 | 58.3 | 24.6 | 0.7 | 34.7 | 38.5 |
| Meta-Llama | ||||||||
| Llama-3.1-8B-Instruct | 31.0 | 75.2 | 50.6 | 50.3 | 26.6 | 0.7 | 33.7 | 38.3 |
| Llama-3.2-1B-Instruct | 26.3 | 58.2 | 49.4 | 38.3 | 0.2 | 0.7 | 30.1 | 29.0 |
| Llama-3.2-3B-Instruct | 27.8 | 64.2 | 49.1 | 43.1 | 24.5 | 0.7 | 31.5 | 34.4 |
Zero-shot evaluation results on Kyrgyz benchmarks (%). The metric is accuracy, except for GSM8K which uses QEM. Higher is better.
ποΈ Kyrgyz Few-Shot Evaluation Results
| Model | KyrgyzMMLU | KyrgyzRC | WinoGrande | BoolQ | HellaSwag | GSM8K | TruthfulQA | Average |
|---|---|---|---|---|---|---|---|---|
| Qwen | ||||||||
| Qwen2.5-0.5B-Instruct | 25.4 | 54.0 | 49.7 | 61.0 | 25.9 | 2.2 | 33.4 | 35.9 |
| Qwen2.5-1.5B-Instruct | 28.7 | 67.5 | 50.1 | 58.0 | 26.5 | 6.1 | 32.9 | 38.5 |
| Qwen2.5-3B-Instruct | 34.0 | 73.2 | 51.3 | 57.4 | 23.7 | 9.5 | 34.4 | 40.5 |
| Qwen2.5-7B-Instruct | 38.5 | 74.8 | 50.4 | 64.6 | 17.8 | 32.1 | 36.2 | 44.9 |
| Qwen3-0.6B | 26.8 | 59.5 | 50.1 | 60.1 | 26.4 | 4.3 | 30.0 | 36.8 |
| Qwen3-1.7B | 30.8 | 71.2 | 48.6 | 62.0 | 25.2 | 18.5 | 30.3 | 41.0 |
| Qwen3-4B | 38.5 | 77.2 | 48.1 | 74.0 | 24.7 | 51.5 | 32.5 | 49.4 |
| Qwen3-8B | 44.5 | 81.8 | 50.6 | 76.9 | 26.4 | 60.0 | 35.8 | 53.7 |
| Gemma | ||||||||
| gemma-3-1b-it | 26.5 | 38.0 | 48.9 | 62.8 | 23.5 | 3.2 | 31.3 | 33.5 |
| gemma-3-270m | 27.0 | 53.2 | 48.7 | 61.5 | 27.6 | 1.4 | 36.6 | 36.6 |
| gemma-3-4b-it | 29.5 | 25.0 | 49.6 | 62.1 | 24.6 | 0.0 | 50.0 | 34.5 |
| Meta-Llama | ||||||||
| Llama-3.1-8B-Instruct | 38.1 | 80.5 | 51.6 | 75.5 | 21.9 | 37.0 | 34.4 | 48.5 |
| Llama-3.2-1B-Instruct | 26.1 | 45.8 | 49.7 | 62.0 | 25.8 | 2.7 | 30.3 | 34.7 |
| Llama-3.2-3B-Instruct | 29.4 | 64.8 | 48.9 | 62.3 | 25.3 | 12.9 | 32.9 | 39.6 |
Few-shot evaluation results on Kyrgyz benchmarks (%). All tasks are 5-shot, except for HellaSwag (10-shot). The metric is accuracy, except for GSM8K which uses QEM. Higher is better.
π‘ Contributions Welcome!
Have ideas, bug fixes, or want to add a custom task? We'd love for you to be part of the journey! Contributions help grow and enhance the capabilities of the KyrgyzLLM-Bench.
π Citation
Thanks for using KyrgyzLLM-Bench β where language learning models meet Serbian precision and creativity! Let's build smarter models together. ποΏ½
If you find this dataset useful in your research, please cite it as follows:
@article{KyrgyzLLM-Bench,
title={Bridging the Gap in Less-Resourced Languages: Building a Benchmark for Kyrgyz Language Models},
author={Timur Turatali, Aida Turdubaeva, Islam Zhenishbekov, Zhoomart Suranbaev, Anton Alekseev, Rustem Izmailov},
year={2025},
url={https://huggingface.co/datasets/TTimur/kyrgyzMMLU,
https://huggingface.co/datasets/TTimur/kyrgyzRC,
https://huggingface.co/datasets/TTimur/winogrande_kg,
https://huggingface.co/datasets/TTimur/boolq_kg,
https://huggingface.co/datasets/TTimur/truthfulqa_kg,
https://huggingface.co/datasets/TTimur/gsm8k_kg,
https://huggingface.co/datasets/TTimur/hellaswag_kg}
}
