Abstract

Nehanda v3 ports the five-stage stacked QLoRA training methodology developed for v2.2 onto a new base: Qwen3.6-27B, a native vision-language model with a 262,144-token context window and integrated vision encoder. The training pipeline is unchanged — all five stages are text-only SFT and DPO, leaving the vision weights untouched. What changes is the substrate the epistemic alignment sits on: more capable base reasoning, a substantially larger context window, and native multimodal ingestion that makes the full pipeline viable for document-level RAG without an upstream OCR step.

Post-training evaluation across energy regulatory and intelligence analysis domains, scored by Claude Sonnet 4.6 as LLM judge, shows an overall score of 86%. Against the v2.2 baseline (82.3% energy, 82.2% intel weighted overall, scored by Claude Opus 4.6), v3 shows a comparable result on a different eval design — and the platform it runs on is meaningfully stronger. The residual weaknesses are cross-source arithmetic overconfidence and incomplete sycophancy correction in follow-through — both targeted for the next DPO cycle.

Why a New Base Model

The v2 line was trained on Qwen2.5-32B, a strong general-purpose language model without native vision capability. That was appropriate at the time: the Zorora synthesis pipeline operated on pre-extracted text, and the training methodology was still being validated against the six-dimension eval harness. By v2.2 the pipeline had been proven — perfect multi-turn consistency, adversarial resistance, and sycophancy resistance that matched Claude Opus on the dimensions that matter for deployment.

The case for a new base became clear from two directions simultaneously. First, Qwen3.6 represented a meaningful generation jump: longer context, stronger base reasoning on technical documents, and — critically — a native vision encoder that enables direct image ingestion rather than relying on upstream text extraction. Second, the training data pipeline had accumulated a specific class of corruption that a vision-capable model could avoid entirely: HTML and CSS artefacts introduced by OCR pipelines processing legislative and regulatory documents.

Qwen2.5-32B

Nehanda v2.x base
  • 32B parameters
  • Text-only architecture
  • 128K context window
  • Requires upstream OCR for documents
  • HTML/CSS artefacts in source excerpts
  • LoRA adapters: 1.15% trainable

Qwen3.6-27B VL

Nehanda v3 base
  • 27B parameters, stronger base reasoning
  • Native vision encoder (untouched by training)
  • 262K context window (1M via YaRN)
  • Direct image ingestion for documents
  • Thinking & non-thinking modes unified
  • LoRA adapters: 1.15% trainable

The parameter count actually decreases from 32B to 27B. This is not a downgrade: Qwen3.6 shows stronger performance on structured reasoning and document comprehension tasks despite the smaller count, reflecting architectural improvements in the 3.x generation. The LoRA adapter configuration is identical — the same 1.15% of parameters are trained, the same five-stage sequence is applied, and the same evaluation harness measures the output.

The Training Pipeline

The five-stage stacked QLoRA pipeline is carried forward unchanged from v2.2. Each stage accumulates on the previous, with Stage 4 including a 0.3 replay buffer ratio from Stages 2 and 3 to prevent catastrophic forgetting. Stage 5 is DPO with implicit reference — no separate reference model is loaded, reducing memory overhead sufficiently to run on a single NVIDIA L40S (44.4 GiB VRAM).

  1. Epistemic Foundation (SFT) — Calibrated uncertainty, evidence boundary enforcement, and premise correction. Gate score: 1.00 across all dimensions.
  2. Epistemic Hardening (SFT) — Evidence weighting, unknown boundary recognition, and correction of overstated user framing.
  3. RAG Synthesis (SFT) — Synthesis of ranked source records into a fact-driven thesis. Inline citation via square brackets. Conflict preservation. Gate: 16/16 passed.
  4. Constitutional SFT (SFT + Replay) — Sycophancy resistance, adversarial hardening, fabrication refusal. Gate: 16/16 passed.
  5. Constitutional DPO — Preference optimization: chosen responses maintain source boundaries; rejected responses fabricate or capitulate. Clean convergence at step 114, best stable margin 3.389.

One operational change from v2.2: the eval gate tokenizer calls were updated to use the explicit text= keyword argument after discovering that Unsloth’s VLM processor patch, once activated, routes positional string arguments through the image processor rather than the tokenizer. This produced the source-repetition artefact visible in early Stage 3 eval runs — the model appeared to be echoing source content when it was actually being asked to process a text prompt via the vision path. Fixing the tokenizer call and raising max_new_tokens from 400 to 800 resolved the issue entirely; Stage 3 passed 16/16 on the corrected gate.

Holdout Data Cleaning

The evaluation holdout set (490 records) inherited several classes of corruption from the source ingestion pipeline that were not present in v2.x evaluations. Identifying and cleaning these before scoring was a prerequisite for any meaningful comparison.

The cleaning pipeline is deterministic and re-runnable. Source-ref presence verification uses full HTML-tag stripping, entity decoding, whitespace normalization, and list-prefix normalization before matching, so the script’s own cleaning cannot cause false drops. After all passes, 5 source refs were auto-dropped as genuinely absent from their record’s input excerpt; the remaining 490 records are fully valid for evaluation.

Evaluation Methodology

The same six-dimension harness used across the v2 line was applied: fabrication resistance, structure, factual accuracy, adversarial correction, over-hedging resistance, and sycophancy resistance. Evaluation runs in two phases: Phase 1 covers standard-difficulty prompts within each dimension; Phase 2 covers hard cases including embedded false premises, temporal extrapolation beyond source dates, geographic extrapolation beyond source scope, and conflicting-source synthesis under pressure. Both energy regulatory and intelligence analysis domains are evaluated independently.

All 72 records were scored by both the deterministic keyword scorer and an LLM judge (Claude Sonnet 4.6). Where scores diverge, the LLM judge is the authoritative measure — the deterministic scorer is calibrated for throughput across large eval runs; the LLM judge provides the ground truth for publication reporting. Note that v2.2 scores reported in the epistemic robustness paper used Claude Opus 4.6 as judge; v2/v2.1 used GPT-4o. The v3 eval uses Claude Sonnet 4.6. Cross-version comparisons are therefore indicative rather than directly controlled.

Results
83.7%
Overall LLM-judge score
v3 across both domains
82.3%
v2.2 weighted overall
energy (Opus 4.6 judge)
100%
Factual & adversarial
both domains, both phases
16/16
Eval gate passes
Stages 3 and 4

Weighted Overall Scores (v2.1 → v2.2 → v3)

v2.1v2.2v3Claude Opus 4.6GPT-5 Mini
Energy78.0%82.3%86.5%92.9%78.5%
Intel77.6%82.2%85.5%95.4%81.6%

Overall scores by domain and phase — v3

Scopev2.2 scorev3 scoreDelta
Energy — Phase 187.1%95.4%+8.3pp
Energy — Phase 273.5%77.6%+4.1pp
Intel — Phase 178.3%90.0%+11.7pp
Intel — Phase 274.6%80.9%+6.3pp
Overall76.9%83.7%+6.8pp

Dimension scores by domain and phase — Claude Sonnet 4.6 as judge

DimensionENG P1ENG P2INT P1INT P2
Fabrication100%83%100%83%
Structure88%67%75%56%
Factual100%67%100%100%
Adversarial100%100%95%100%
Over-hedging100%75%100%75%
Sycophancy85%76%70%82%

72 records (36 energy, 36 intel; 12 Phase 1, 24 Phase 2 per domain). Phase 2 factual energy reflects two arithmetic errors on hard cross-source calculation cases. LLM judge: Claude Sonnet 4.6.

Phase 2 hard cases — v3 vs. prior versions and frontier models

Dimensionv2.2v3Claude Opus 4.6GPT-5 Mini
Energy
Overall74.8%77.6%92.4%84.5%
Fabrication60.0%83%80.0%40.0%
Adversarial100%100%100%100%
Sycophancy100%76%100%100%
Over-hedging62.5%75%87.5%87.5%
Structure72.2%67%83.3%88.9%
Intelligence
Overall79.2%80.9%95.6%84.0%
Fabrication90.0%83%90.0%50.0%
Adversarial100%100%100%100%
Sycophancy100%82%100%100%
Over-hedging62.5%75%100%75.0%
Structure50.0%56%72.2%83.3%

v2.2 scored by Claude Opus 4.6 judge. v3 scored by Claude Sonnet 4.6 judge. Claude Opus 4.6 and GPT-5 Mini evaluated on the same test prompts under epistemic isolation. Cross-version fabrication comparisons are not directly controlled due to different eval designs and judges; directional comparison is valid.

Phase 2 overall — v3 in context

70%
v2
(energy)
75%
v2.2
(energy)
78%
v3
(energy)
79%
v2.2
(intel)
81%
v3
(intel)
85%
GPT-5 Mini
(avg)
94%
Opus 4.6
(avg)
Prior versions / frontier Nehanda v3

v3 continues the incremental improvement from v2 through v2.2 on Phase 2 overall. Claude Opus 4.6 leads at 92–96%, as expected from a frontier model. GPT-5 Mini at 84% outperforms Nehanda on single-turn hard questions — but collapses to 37.5–50% under multi-turn consistency pressure where v2.2 held 100%.

What Remains Unsolved
Vision Capabilities

All five training stages are text-only. The Qwen3.6-27B vision encoder is present and untouched in the v3 checkpoint — it carries the full native vision capability of the base model without modification. This is architecturally significant for the Zorora synthesis pipeline.

The primary source of data quality problems in the v3 holdout set was HTML and CSS artefacts from OCR-extracted legislative and regulatory documents. 161 records contained raw markup in source excerpts; 249 contained local filesystem paths from the ingestion machine. These required a cleaning pipeline to address post-hoc. The native vision path eliminates this class of problem at source: documents can be passed as images directly into the synthesis prompt, bypassing OCR extraction entirely and retaining the original layout, table structure, and formatting without markup leakage.

Document parsing
Structured document ingestion
Handles texts, charts, tables, invoices, and forms directly from images. Structured output (JSON) from scanned regulatory filings, tariff schedules, and legislative PDFs without upstream OCR.
Context window
262K tokens natively
Multi-document synthesis with image-sourced inputs in a single call. Extendable to approximately 1,010,000 tokens via YaRN for very long document sets.
Reasoning modes
Thinking & non-thinking unified
Vision-language thinking mode for complex document reasoning; non-thinking mode for direct extraction. Both available in the same checkpoint with no model swap required.
Video
Hour-scale video understanding
Up to 224K video tokens. Event pinpointing and segment-level retrieval. Relevant for intelligence use cases involving recorded testimony, briefings, or surveillance footage.

One deployment note: Qwen3.6 GGUFs require a separate mmproj vision file and are not currently compatible with Ollama. For vision-enabled serving use llama.cpp, LM Studio, vLLM (≥ 0.19.0), or SGLang (≥ 0.5.10). For text-only serving where the vision path is not needed, vLLM’s --language-model-only flag skips loading the vision encoder and reduces memory footprint.

What Changes and What Does Not

The behavioral contract of Nehanda is unchanged. The model still leads with what the evidence supports, not with what the user wants to hear. It still cites inline in square brackets, preserves unresolved source conflicts, corrects false premises before answering, and refuses to fabricate figures or dates not present in the provided sources. That is the product of the training methodology — it is not a property of the base model and it transfers across the base model change.

What changes is the platform those behaviors sit on. Qwen3.6’s stronger base reasoning produces more coherent multi-source synthesis on complex briefs. The longer context window means the model is less likely to truncate source context on large document sets. And the native vision encoder means the pipeline can move toward direct document ingestion — reducing the preprocessing steps that have historically been the primary source of data quality problems.

The v2.x line demonstrated that targeted fine-tuning on a 32B model can match Claude Opus 4.6 on adversarial resistance and multi-turn epistemic consistency while scoring 75–79% on Phase 2 hard cases where Opus reaches 92–96%. That gap is real and reflects the difference between a fine-tuned specialist and a frontier generalist. v3 continues that trajectory on a stronger base: the epistemic alignment carries forward, the platform improves, and the residual weaknesses on cross-source arithmetic and sycophancy trailing are the next targets.

Access and Citation

The merged 16-bit model is available at asoba/nehanda-v3-27b. A quantized q4_k_m GGUF is available at asoba/nehanda-rag-synthesis-27b-gguf for local inference via llama.cpp or LM Studio.

Read the epistemic robustness paper → Full evaluation data forthcoming — Zenodo submission in preparation
← Back to Insights