QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs. If you've been tracking Arabic LLM evaluation, you've probably noticed a growing tension: the number of benchmarks and leaderboards is expanding rapidly, butare we actually measuring what we think we're measuring? We builtQIMMAقمّة (Arabic for "summit"), to answer that question systematically. Instead of aggregating existing Arabic benchmarks as-is and running models on them, we applied a rigorous quality validation pipelinebeforeany evaluation took place. What we found was sobering: even widely-used, well-regarded Arabic benchmarks contain systematic quality issues that can quietly corrupt evaluation results. This post walks through what QIMMA is, how we built it, what problems we found, and what the model rankings look like once you clean things up. Arabic is spoken by over 400 million people across diverse dialects and cultural contexts, yet the Arabic NLP evaluation landscape remains fragmented. A few key pain points have motivated this work: Translation issues.Many Arabic benchmarks are translations from English. This introduces distributional shifts. Questions that feel natural in English become awkward or culturally misaligned in Arabic, making benchmark data less representative of how Arabic is naturally used. Absent quality validation.EvennativeArabic benchmarks are often released without rigorous quality checks. Annotation inconsistencies, incorrect gold answers, encoding errors, and cultural bias in ground-truth labels have all been documented in established resources. Reproducibility gaps.Evaluation scripts and per-sample outputs are rarely released publicly, making it hard to audit results or build on prior work. Coverage fragmentation.Existing leaderboards cover isolated tasks and narrow domains, making holistic model assessment difficult. To illustrate where QIMMA sits relative to existing platforms: QIMMA is the only platform combining all five properties: open source, predominantly native Arabic content, systematic quality validation, code evaluation, and public per-sample inference outputs. QIMMA consolidates109 subsetsfrom14 source benchmarksinto a unified evaluation suite of over52,000 samples, spanning 7 domains: This is the methodological heart of QIMMA. Before running a single model, we applied amulti-stage validation pipelineto every sample in every benchmark. Each sample was independently evaluated by two state-of-the-art LLMs: We chose two models with strong Arabic capability but different training data compositions, so that theircombinedjudgment is more robust than either alone. Each model scores a sample against a10-point rubric, with binary scores (0 or 1) per criterion: A sample is eliminated if either model scores it below 7/10. Samples where both models agree on elimination are dropped immediately. However, where only one model flags a sample, it proceeds to human review in Stage 2. Flagged samples are reviewed bynative Arabic speakerswith cultural and dialectal familiarity. Human annotators make final calls on: For culturally sensitive content, multiple perspectives are considered, since "correctness" can genuinely vary across Arab regions. The pipeline revealed recurring quality issues across benchmarks; not isolated errors, butsystematic patternsreflecting gaps in how benchmarks were originally constructed. False or mismatched gold indices, factually wrong answers, missing or raw text answers. Corrupt or illegible text, spelling and grammar errors, and duplicate samples. Stereotype reinforcement and monolithic generalizations about diverse communities. Misalignment of gold answers with evaluation protocols. Code benchmarks required a different intervention. Rather than discarding samples, werefined the Arabic problem statementsin 3LM's Arabic adaptations of HumanEval+ and MBPP+, leaving task identifiers, reference solutions, and test suites completely unchanged. QIMMA usesLightEval,EvalPlusandFannOrFlopas its evaluation framework, chosen for consistency, multilingual community adoption, and reproducibility. QIMMA standardizes prompting by question format, with six template types: All prompts are in Arabic. For MizanQA and ArabCulture, benchmark-specific system prompts from the original papers are preserved. Results as of April 2026; covering top 10 evaluated models. Visit thelive leaderboardfor current rankings. Across the full leaderboard (46 models), a clear but imperfect size-performance correlation emerges. However, there are interesting exceptions: