NetHack Research · 2026 Eval

Arabic-Native LLMs in 2026: The New Quality Bar for GCC Production Workloads

The era of "passable translation" is over. The 2026 bar for Arabic LLMs in GCC production is native reasoning in Modern Standard Arabic and the Khaleeji dialects — with domain accuracy, culturally-calibrated refusals, and the eval discipline to prove it.

NetHack Research NetHack ResearchIn-house AI infrastructure analyst team May 21, 2026 · 13 min read

Why Arabic was hard, and what changed

For most of the last decade Arabic occupied an awkward middle ground in NLP. It was big enough — over four hundred million speakers, an official UN language — that no serious multilingual system could ignore it, and difficult enough that the major labs treated it as a second-class citizen. The result was a long stretch in which Arabic support on frontier models meant "we can translate it," not "we can reason in it." The 2024–25 cycle quietly closed that gap. The 2026 cycle has moved the goalposts.

The technical reasons Arabic was hard are worth restating. Arabic has rich templatic morphology — a single trilateral root spawns dozens of derived forms — and a model that tokenises Arabic with English-trained byte-pair merges typically wastes two to three tokens per word relative to a well-tuned tokenizer. Diacritics are optional in real-world text, so the same surface string can be several different words and the disambiguating signal lives in context. Arabic is right-to-left, which is harmless to a transformer but leaks loss into surrounding tooling — string libraries, RTL UI, bidi edge cases — in ways that show up as quality regressions you cannot trace.

Dialect fragmentation is the harder problem. Modern Standard Arabic (MSA) is the written register of newspapers, contracts and formal speech, roughly uniform from Rabat to Muscat. The spoken language, by contrast, splinters into mutually-distinct regional dialects: Khaleeji in the Gulf, Levantine in the Sham, Egyptian, Maghrebi, Iraqi — each with internal sub-dialects. A model that handles MSA beautifully can still fall flat the moment a Riyadh customer service log appears in Najdi, or a Dubai voice agent gets a question in Emirati. Real GCC users code-switch constantly on top of that: an Arabic sentence with English technical nouns is unremarkable in WhatsApp and lethal to a model trained on clean monolingual data.

What changed between 2023 and 2026 was a stack of compounding improvements rather than a single breakthrough. Tokenizers got serious about Arabic — SentencePiece variants with Arabic-aware merges cut per-word token counts materially. Pretraining corpora grew an order of magnitude — Common Crawl Arabic stopped being a curiosity, supplemented by curated legal, medical and government corpora out of UAE, KSA and Egyptian institutions. Instruction-tuning datasets moved from "translated from English" to "written natively by Arabic-speaking annotators," which transformed register, idiom and refusal style. And RLHF with Arabic-speaking labelers became standard practice. By the time Falcon-Arabic, Jais-30B/70B and AceGPT had each iterated through their second or third generation, the floor for "Arabic-capable" had risen to a place considered frontier-grade two years earlier.

The five evaluation axes that actually matter in 2026

When you stop measuring Arabic with BLEU and start measuring it the way a buyer actually experiences it, five axes do most of the discriminating work. We use this grid for every Arabic-model evaluation we run at NetCity, and it is the grid we recommend to customers building their own.

Instruction-following in MSA

The first question is simply: when you ask the model to do something in correct Modern Standard Arabic, does it do that thing in correct Modern Standard Arabic? "Correct" here means register-appropriate — formal verb forms, agreement on gender and number, proper use of nominal sentences where required, and an absence of the awkward English-translated phrasing that gives away a model trained primarily on translated instructions. The cleanest in-house test is to take 50–100 prompts you would actually give to a model in production, write them in clean MSA, and have two native-Arabic reviewers rate the responses on a 1–5 Likert for correctness and register. Public benches: ArabicMMLU as a screen, with a strong note that ArabicMMLU is mostly multiple-choice and does not exercise long-form generation register.

Dialect fidelity

This is where most models still fail. When a user prompts in Khaleeji — "إيش رايك في هذا العقد؟" — does the model answer in Khaleeji, default to MSA, or worst-case answer in Egyptian or generic dialect-mixed Arabic? Default-to-MSA is acceptable in many product contexts. Default-to-Egyptian is jarring to a Gulf customer. Mixed-dialect answers are immediately unprofessional. The simplest test: 50 prompts in Khaleeji across a few sub-dialects (Emirati, Najdi, Hijazi, Kuwaiti, Qatari) and a binary judgement per response — appropriate register or not. The honest reality in 2026 is that very few public models are evaluated on this axis at all; you almost always have to test it yourself.

Code-switching robustness

Real GCC prompts are not monolingual. They are an Arabic sentence with the technical noun in English ("نريد API يدعم OAuth 2.0"), or an English sentence with the entity in Arabic ("the لجنة said the deadline is الخميس"). A model that handles code-switching gracefully is one that does not lose state at the script boundary — it tracks the entity in whichever script it appeared, reasons across the mix, and replies in whichever script the user implicitly chose. Models that were trained heavily on translated Arabic data tend to suffer here, because the training data never code-switched. The test is mechanical: take 40 real production prompts, classify them by mix, and rate the response for both correctness and naturalness of the chosen reply script.

Domain accuracy

This is the axis that decides whether the model is shippable in regulated workloads. Can it answer a question about a UAE Federal Decree-Law in language a junior associate could rely on? Can it extract the correct ICD-10 code from an Arabic discharge summary? Can it pull the obligor and the interest rate out of a Murabaha contract written in Arabic legal register? Public benchmarks do not measure this — ArabicTruthfulQA is closer than most, but is general-knowledge, not domain-specific. The only honest test is your own corpus, scored by domain experts in that domain. We typically build a 100-prompt domain pack per vertical (legal, medical, financial, real-estate) and use it as the dominant signal in any Arabic-model decision.

Refusal calibration

The most underrated axis. Many models trained primarily on English RLHF data exhibit Western-default refusal behaviour translated into Arabic — refusing to discuss alcohol regulation in the UAE, refusing to engage with shariah-compliant finance in the same nuanced way they engage with conventional finance, or producing hedge-laden non-answers on culturally normal topics. The other failure mode is the opposite: models that have had little Arabic-language safety training and are happy to produce material that would never pass internal review at a GCC institution. The test is a 50-prompt cultural calibration pack, mixing genuinely sensitive prompts with normal-but-culturally-specific prompts, and rating refusals for both appropriateness and helpfulness. No public benchmark we are aware of measures this well; it has to be built in-house.

Public benchmarks: what they tell you (and what they don't)

Public Arabic benchmarks are useful as a coarse filter. They are not useful as a production decision. The honest summary of the 2026 leaderboard landscape is that the benchmarks worth knowing are ArabicMMLU, AlGhafa, HELM-Arabic, ArabicSuperGLUE and ArabicTruthfulQA — and that each tells you something narrow.

ArabicMMLU is the Arabic counterpart to the English MMLU multiple-choice knowledge benchmark, covering subjects from STEM to humanities at multiple education levels. Scores are useful for a rough ranking of general knowledge in MSA. It does not test generation quality, dialect handling, or long-form reasoning. A model that scores well here is at least not catastrophically bad at MSA knowledge; nothing more should be read into the number.

AlGhafa is a native-Arabic-built benchmark from TII covering a wider range of tasks — reading comprehension, sentiment, paraphrase, NLI and similar — in true Arabic, not translated English. It is one of the better screens for general competence in MSA and is widely cited by the Arabic-first model cards. Its main blind spot is the same as ArabicMMLU's: dialects and long-form generation are under-represented.

HELM-Arabic — Stanford CRFM's HELM extended into Arabic — gives you a multi-task, multi-metric view across a broad scenario set. The strength is breadth; the weakness is that it inherits HELM's academic-measurement bias and is not enterprise-relevant. Treat it as literature-search input, not a buying decision.

ArabicSuperGLUE ports SuperGLUE into Arabic. Most modern models saturate it. Useful for catching regressions; not useful for distinguishing frontier from near-frontier.

ArabicTruthfulQA tests truthfulness on questions where the most likely human answer is wrong, including some culturally-Arab-context items. One of the few benches that touches cultural calibration at all, narrowly.

What none of these measure well: long-context Arabic reasoning over real documents, dialect fidelity in either direction, legal-Arabic accuracy, medical-Arabic accuracy, code-switching, and refusal calibration tuned to GCC norms. If you make a model decision on benchmarks alone, you are betting that your production traffic looks like the average academic benchmark. It does not. Use the public benches as a screen to eliminate clearly weaker candidates, then run domain-specific eval on your own data before you ship.

The 2026 model landscape — five families worth deploying

The 2026 landscape for Arabic-capable LLMs is narrower than the English one and easier to reason about. Five families cover essentially every serious deployment we see in the GCC today.

Falcon-Arabic (TII, UAE)

UAE-sovereign provenance, strong MSA, permissive license.

StrengthsBuilt by the Technology Innovation Institute in Abu Dhabi specifically for Arabic-first usage. Strong MSA generation, native instruction tuning, and a real commitment to Arabic in the model card rather than as a checkbox.

WeaknessesDialect coverage is still strongest in Gulf-adjacent registers; less robust on Maghrebi. Open-weight ecosystem tooling lags Llama/Qwen for niche inference optimisations.

Best forSovereign-by-default GCC deployments where UAE provenance matters, and MSA-dominant workloads (gov, legal, formal customer comms).

LicensePermissive open-weight per the Falcon family's stated terms — read the current license before commercial deployment.

Jais / Jais-Adapted (Inception · G42)

Frontier-class Arabic, multiple tiers, both sovereign-self-hosted and managed.

StrengthsThe Jais family — including the 30B and 70B tiers — is one of the most heavily-invested Arabic-first programmes in the world, out of Inception (the G42 AI subsidiary) in collaboration with MBZUAI and Cerebras. Strong across MSA, instruction-following and Arabic-English code-switching, and available both as open-weight downloads and via managed endpoints including Microsoft Azure. As reported on the model's own card, Jais variants are competitive with or exceed comparable-size multilingual models on Arabic-native benchmarks.

WeaknessesDialect fidelity varies by sub-dialect. The largest tiers need serious GPU footprint to serve at low latency.

Best forProduction Arabic workloads that need frontier-class quality and either GCC-sovereign hosting or a managed regional endpoint. Both the 30B and 70B tiers are sensible defaults, with 30B usually the better latency/cost compromise.

LicenseOpen weights under the Jais community license terms; managed access via partner clouds. Check the current license text for commercial restrictions.

AceGPT (KAUST-affiliated)

Research-grade Arabic alignment, strong instruction-following.

StrengthsEmerging from a KAUST-affiliated research collaboration, AceGPT focuses specifically on Arabic alignment — RLHF with Arabic-speaking labelers, Arabic-native instruction data, and explicit work on cultural and value alignment. Strong on instruction-following and produces well-structured Arabic responses.

WeaknessesSmaller community and tooling ecosystem than Falcon or Jais; less proven at high-traffic production scale.

Best forTeams that want Arabic-aligned behaviour as the primary design property — especially in education, content moderation and culturally-sensitive customer-facing surfaces.

LicenseOpen weights under the licence terms published with each release. Confirm commercial use terms before deployment.

Qwen 2.5 / Qwen 3 (Alibaba)

Surprisingly strong multilingual including Arabic; broad open weights.

StrengthsQwen has quietly become one of the strongest multilingual open-weight families in the world, with credible Arabic performance — especially in the 32B and 72B tiers and in the Qwen 3 generation. Strong instruction-following, strong tool use, mature open-source tooling, and a deep variant catalog (chat, instruct, vision, code).

WeaknessesNot Arabic-first by design — Arabic is a multilingual capability rather than the centre of the training programme. Dialect fidelity and culturally-tuned refusals are weaker than the Arabic-first families.

Best forMultilingual products that need credible Arabic alongside Chinese, English and other major languages, and for cost-sensitive deployments where Qwen's strong open-weight tooling matters.

LicensePer the Qwen family terms — generally permissive for commercial use, with specific clauses worth reading in full.

Frontier API models — GPT-4 class, Claude 3.5 class, Gemini 1.5 Pro

Strongest raw quality — sovereignty & cost caveats apply.

StrengthsThe frontier API families from OpenAI, Anthropic and Google have closed most of the historical gap on Arabic. On long-context reasoning, complex instruction-following and multi-step tasks, they remain the highest-quality option for many Arabic workloads in 2026, particularly in MSA.

WeaknessesTwo structural ones. First, sovereignty: these are foreign-jurisdiction APIs, and the residency picture for prompts, logs and metadata frequently fails GCC regulated-sector requirements — see our prior piece on the sovereign AI stack. Second, cost: Arabic tokenises at a lower characters-per-token ratio than English, so the same workload costs materially more per request on a per-token API.

Best forNon-regulated workloads where raw quality dominates and per-token cost is acceptable; and as the upper fallback in a hybrid stack where the local model handles the long tail.

LicenseClosed API, vendor-specific commercial terms.

Building your own Arabic eval harness

The single highest-leverage thing a GCC team can do this year is to stop deciding Arabic-model questions on the basis of public benchmarks, and start running a small, disciplined, in-house eval harness on real production prompts. The work is less than you think. The payoff is more than you think.

The playbook we use, and recommend, has four steps.

  1. Sample 200–500 production prompts. Pull real, anonymised prompts from the top use-cases your product actually serves — customer support, document Q&A, agent tool calls, summarisation, whatever you ship. Stratify the sample across MSA, Khaleeji, code-switched, and any other registers that show up at material volume. Anonymise carefully; the value of the eval is destroyed if reviewers see PII.
  2. Have two native reviewers per prompt rate on a 5-point Likert across the five axes above — instruction-following, dialect fidelity, code-switching, domain accuracy, refusal calibration. Compute Cohen's kappa between reviewers; if it's below 0.5, your rubric is ambiguous and needs tightening before the numbers mean anything.
  3. Run the same prompts through three to five candidate models — for most GCC teams, that's Falcon-Arabic, Jais (30B or 70B), one frontier API, and one Qwen tier. Randomise and blind the source so reviewers cannot anchor on the brand.
  4. Score, regret, decide. Aggregate per axis and per use-case. Look at the worst-case responses, not just the average; an Arabic-language product that fails 5% of the time in production is failing every twentieth customer, and the failures will be the ones that get screenshotted.

A minimal eval-row structure that has worked well for us looks like this:

{
  "prompt_id":     "p-0042",
  "register":      "khaleeji",          // msa | khaleeji | levantine | mixed
  "use_case":      "contract_qa",
  "domain":        "legal",
  "prompt_ar":     "...",
  "prompt_en_gloss":"...",              // for non-Arabic reviewers
  "reference":     "...",               // optional gold answer
  "responses": {
    "falcon-arabic": "...",
    "jais-30b":      "...",
    "qwen-3-72b":    "...",
    "gpt-4-class":   "..."
  },
  "ratings": [
    {"reviewer":"R1","model":"jais-30b",
     "instr":5,"dialect":4,"codeswitch":5,"domain":4,"refusal":5,
     "notes":"natural Khaleeji, slight verb agreement slip"}
  ]
}

Three pitfalls to avoid. Reviewers default to MSA — if you don't actively reward dialect-appropriate responses, you'll train your own rubric to penalise the very behaviour you said you wanted. Vendors goodhart leaderboards — be sceptical of any model whose published Arabic scores improved sharply right around a benchmark release. And prompt engineering can disguise model weaknesses for two weeks and then break in week three when traffic distribution shifts; eval against the production distribution, not the demo distribution.

Production patterns that work

Four patterns recur across the Arabic-LLM deployments we ship for customers. They are cheap to build and produce material quality wins relative to the naïve single-model architecture.

1. Router-of-models

Route by language and register, not by feature.

WhatA lightweight classifier (or even a small LLM) at the top of the request path inspects the prompt and routes: MSA → model A, Khaleeji → model B, English → model C, code-and-technical → model D. Each downstream model can be the strongest available for its slice, rather than the best general-purpose compromise.

Why it winsThe compromise model never has to be best at everything. A 1–2 day build typically lifts user-visible quality more than a full month spent fine-tuning a single all-rounder.

2. Arabic-first prompting

Write the system prompt in Arabic, not translated English.

WhatFor Arabic-dominant products, write the system prompt natively in Arabic — by a fluent author — rather than translating an English original. Resist the temptation to keep both languages in the system prompt "for safety"; pick one and commit.

Why it winsArabic-native system prompts produce more on-register output and dramatically reduce the "machine-translated" feel of responses, especially on dialect tasks. The lift is material on every Arabic-first model we have tested.

3. Hybrid extraction → reasoning

Cheap small model does the document work; large model does the thinking.

WhatHeavy doc-extraction work — OCR cleanup, section detection, entity extraction, table parsing in Arabic documents — runs in a small Arabic-tuned model. The structured output is then handed to a larger model for the reasoning step. Each model does the job it is best at.

Why it winsFor document-heavy workloads (legal contracts, medical records, real-estate filings) the cost per request drops 5–10× versus running the whole pipeline on the large model, with equal or better accuracy because the small model is specifically tuned for Arabic structure.

4. Sovereign-default with fallback chain

Local-first, frontier-fallback, residency-tagged.

WhatA primary local Arabic model handles the bulk of traffic. A frontier API is configured as a fallback for the long tail — but the request router stamps every request with a residency tag so the audit log can prove which prompts left the region and why. If a prompt is flagged as regulated, the fallback is suppressed and the request fails closed.

Why it winsYou get most of the quality of the frontier without surrendering the residency story across the whole product. The hard part is the residency tagging discipline, not the fallback itself.

Where NetCity AI Cloud fits

NetCity AI Cloud exists to make Arabic-native production deployment a default, not a project. We host the Arabic-first model families — Falcon-Arabic, Jais, AceGPT — alongside the strongest multilingual models (Qwen, Llama, DeepSeek) on UAE-resident GPUs, behind OpenAI-compatible endpoints. Migrating a workload from a foreign API to a sovereign Arabic-capable model on NetCity is typically a base-URL change plus an evaluation pass, not a re-platforming exercise.

The piece most teams miss when assembling this themselves is the data. Our Domain Datasets are pre-curated Arabic corpora for the verticals where public benchmarks have nothing useful to say — legal, medical, real-estate, hospitality. They are PII-cleared, licensed locally, and ready to fine-tune against the day you start. They exist precisely to close the gap public Arabic benches do not measure.

The piece teams underinvest in is review. We embed an Arabic eval harness in every deployment, and we operate it with human Arabic-native reviewers in the loop — the same Cohen-kappa-disciplined process described above, running on your real traffic, against your real candidate models, against your real domain. The eval harness becomes a living artefact of the product, not a one-time benchmark.

For the wider architectural picture, see our earlier pieces on the sovereign AI stack and on the 2026 model-hosting landscape. The Arabic story sits inside both: the right model is the one you are allowed to deploy, that meets your domain bar, on infrastructure you can hand to an auditor without redaction.

Frequently asked questions

Is Jais better than GPT-4 for Arabic in 2026?

On many Arabic-first tasks Jais (especially the 70B tier) is competitive with or close to GPT-4-class quality, particularly in MSA and well-served Khaleeji sub-dialects. On long-context multi-step reasoning the frontier APIs typically retain an edge. The right question is rarely "which is better" in the abstract; it is "which is better on my domain, my registers, my latency budget and my sovereignty constraints." Run the eval.

Do I need separate models for MSA and Khaleeji?

Not necessarily, but for high-stakes customer-facing surfaces in the Gulf the answer is increasingly yes — or at minimum a router that picks the right model per request. A single-model strategy is fine for internal tools and for products where MSA-only responses are acceptable.

Can a single fine-tune cover all GCC dialects?

It can cover them adequately, particularly if your training data is well-balanced across Emirati, Najdi, Hijazi, Kuwaiti and Qatari. It rarely covers them excellently in a single pass. For premium quality, dialect-specific LoRA adapters layered on a strong Arabic base — selected per request — outperform a single monolithic fine-tune.

Is open-source Arabic good enough for regulated workloads (banking, healthcare)?

Yes, with the same caveats as for English: you need a disciplined eval, a clean fine-tune on domain data, an audit trail, and human review for high-risk outputs. The Arabic-first open-weight families plus Qwen now meet the floor for regulated deployment in most GCC sectors, provided the eval and audit work is actually done.

Should I translate English prompts to Arabic, or write them natively?

Write natively. Translation introduces register slippage and English idiom into the prompt, which then biases the model toward translated-English-style responses. The lift from writing system prompts and user-facing templates in Arabic from scratch is one of the highest-ROI changes a team can make.

How do I measure cultural appropriateness, not just linguistic correctness?

Build a 50–100 prompt cultural calibration pack covering culturally-normal but Western-default-refused topics (alcohol regulation, religious finance, regional politics, family law), and rate responses by native reviewers for both appropriateness and helpfulness. Track refusal rate alongside quality; refusal rate that is too high on normal prompts is as bad as refusal rate that is too low on sensitive ones.

What's the smallest Arabic-capable model I can self-host?

In 2026, useful Arabic generation begins around the 7B parameter mark for narrow tasks (classification, short-form answers, retrieval-grounded responses). For general-purpose assistant behaviour in Arabic, plan for 30B-class as a practical floor; for premium quality, 70B-class or frontier. Below 7B, Arabic quality tends to degrade faster than English quality at the same size.

How does Arabic tokenization affect my token bill?

Materially. Arabic typically tokenises to more tokens per character than English on tokenizers tuned primarily for English, which means the same logical request can cost 1.5–2× as many tokens on a foreign per-token API. On Arabic-aware tokenizers (Falcon-Arabic, Jais, the better multilingual ones) the gap narrows substantially. Factor tokenizer choice into your cost model — it is not a footnote.

NetHack Research logo
About NetHack Research NetHack Research is the in-house AI-infrastructure analyst team at NetCity Technologies LLC. We benchmark, deploy and stress-test every major model-hosting platform monthly so you don't have to. Editorial standards: no paid placement; methodology published per article.

Evaluate Arabic models on your data with our team

Tell us your use-case, your registers and your domain. A NetCity engineer will scope a real eval harness against Falcon-Arabic, Jais, AceGPT, Qwen and a frontier comparator on your traffic — with Arabic-native reviewers in the loop and a written recommendation at the end.

Talk to a NetCity engineer