Nehanda v3 ports the five-stage stacked QLoRA training methodology developed for v2.2 onto a new base: Qwen3.6-27B, a native vision-language model with a 262,144-token context window and integrated vision encoder. The training pipeline is unchanged — all five stages are text-only SFT and DPO, leaving the vision weights untouched. What changes is the substrate the epistemic alignment sits on: more capable base reasoning, a substantially larger context window, and native multimodal ingestion that makes the full pipeline viable for document-level RAG without an upstream OCR step.
Post-training evaluation across energy regulatory and intelligence analysis domains, scored by Claude Sonnet 4.6 as LLM judge, shows an overall score of 86%. Against the v2.2 baseline (82.3% energy, 82.2% intel weighted overall, scored by Claude Opus 4.6), v3 shows a comparable result on a different eval design — and the platform it runs on is meaningfully stronger. The residual weaknesses are cross-source arithmetic overconfidence and incomplete sycophancy correction in follow-through — both targeted for the next DPO cycle.
The v2 line was trained on Qwen2.5-32B, a strong general-purpose language model without native vision capability. That was appropriate at the time: the Zorora synthesis pipeline operated on pre-extracted text, and the training methodology was still being validated against the six-dimension eval harness. By v2.2 the pipeline had been proven — perfect multi-turn consistency, adversarial resistance, and sycophancy resistance that matched Claude Opus on the dimensions that matter for deployment.
The case for a new base became clear from two directions simultaneously. First, Qwen3.6 represented a meaningful generation jump: longer context, stronger base reasoning on technical documents, and — critically — a native vision encoder that enables direct image ingestion rather than relying on upstream text extraction. Second, the training data pipeline had accumulated a specific class of corruption that a vision-capable model could avoid entirely: HTML and CSS artefacts introduced by OCR pipelines processing legislative and regulatory documents.
Qwen2.5-32B
- 32B parameters
- Text-only architecture
- 128K context window
- Requires upstream OCR for documents
- HTML/CSS artefacts in source excerpts
- LoRA adapters: 1.15% trainable
Qwen3.6-27B VL
- 27B parameters, stronger base reasoning
- Native vision encoder (untouched by training)
- 262K context window (1M via YaRN)
- Direct image ingestion for documents
- Thinking & non-thinking modes unified
- LoRA adapters: 1.15% trainable
The parameter count actually decreases from 32B to 27B. This is not a downgrade: Qwen3.6 shows stronger performance on structured reasoning and document comprehension tasks despite the smaller count, reflecting architectural improvements in the 3.x generation. The LoRA adapter configuration is identical — the same 1.15% of parameters are trained, the same five-stage sequence is applied, and the same evaluation harness measures the output.
The five-stage stacked QLoRA pipeline is carried forward unchanged from v2.2. Each stage accumulates on the previous, with Stage 4 including a 0.3 replay buffer ratio from Stages 2 and 3 to prevent catastrophic forgetting. Stage 5 is DPO with implicit reference — no separate reference model is loaded, reducing memory overhead sufficiently to run on a single NVIDIA L40S (44.4 GiB VRAM).
- Epistemic Foundation (SFT) — Calibrated uncertainty, evidence boundary enforcement, and premise correction. Gate score: 1.00 across all dimensions.
- Epistemic Hardening (SFT) — Evidence weighting, unknown boundary recognition, and correction of overstated user framing.
- RAG Synthesis (SFT) — Synthesis of ranked source records into a fact-driven thesis. Inline citation via square brackets. Conflict preservation. Gate: 16/16 passed.
- Constitutional SFT (SFT + Replay) — Sycophancy resistance, adversarial hardening, fabrication refusal. Gate: 16/16 passed.
- Constitutional DPO — Preference optimization: chosen responses maintain source boundaries; rejected responses fabricate or capitulate. Clean convergence at step 114, best stable margin 3.389.
One operational change from v2.2: the eval gate tokenizer calls were updated to use the explicit text= keyword argument after discovering that Unsloth’s VLM processor patch, once activated, routes positional string arguments through the image processor rather than the tokenizer. This produced the source-repetition artefact visible in early Stage 3 eval runs — the model appeared to be echoing source content when it was actually being asked to process a text prompt via the vision path. Fixing the tokenizer call and raising max_new_tokens from 400 to 800 resolved the issue entirely; Stage 3 passed 16/16 on the corrected gate.
The evaluation holdout set (490 records) inherited several classes of corruption from the source ingestion pipeline that were not present in v2.x evaluations. Identifying and cleaning these before scoring was a prerequisite for any meaningful comparison.
- Duplicate IDs (56 records, 25 groups) — Non-unique identifiers that would silently collide in result aggregation. Fixed via two-pass deduplication with _L{line_number} suffixes applied symmetrically to all copies including the first occurrence.
- HTML/CSS artefacts in source excerpts (161 records) — Raw <div>, <table>, and <style> tags from legislative HTML documents passed through OCR extraction. These produced CSS in model outputs under the v2.x eval gate (the max_new_tokens=400 budget caused the model to echo the tail of the input context). Stripped via regex with a newline-aware unclosed-tag handler.
- Local filesystem paths (249 records) — Absolute paths (/Volumes/SanDisk 2TB Backup/energyanalyst_corpus/) leaked into source URL fields and were being reproduced in model outputs. Replaced with a canonical file://local_server/corpus/ placeholder.
- Junk pass_indicators (887 of 2,102) — Truncated citation strings ([Executive Order to Accelerate with no closing bracket), CSS fragments ({font-family:"Cambria Math"), and XML placeholders (<BillNo> <Sponsor>) used as substring match targets. Pattern-filtered and bracket-stripped.
- Junk source_refs (71 dropped) — MongoDB ObjectIDs, raw absolute_url: field keys, HTML fragments, and revisor header artefacts. Dropped via compiled pattern filter.
- Null bytes in source_refs (2 records) — Windows/DOS document artefacts. Stripped recursively.
The cleaning pipeline is deterministic and re-runnable. Source-ref presence verification uses full HTML-tag stripping, entity decoding, whitespace normalization, and list-prefix normalization before matching, so the script’s own cleaning cannot cause false drops. After all passes, 5 source refs were auto-dropped as genuinely absent from their record’s input excerpt; the remaining 490 records are fully valid for evaluation.
The same six-dimension harness used across the v2 line was applied: fabrication resistance, structure, factual accuracy, adversarial correction, over-hedging resistance, and sycophancy resistance. Evaluation runs in two phases: Phase 1 covers standard-difficulty prompts within each dimension; Phase 2 covers hard cases including embedded false premises, temporal extrapolation beyond source dates, geographic extrapolation beyond source scope, and conflicting-source synthesis under pressure. Both energy regulatory and intelligence analysis domains are evaluated independently.
All 72 records were scored by both the deterministic keyword scorer and an LLM judge (Claude Sonnet 4.6). Where scores diverge, the LLM judge is the authoritative measure — the deterministic scorer is calibrated for throughput across large eval runs; the LLM judge provides the ground truth for publication reporting. Note that v2.2 scores reported in the epistemic robustness paper used Claude Opus 4.6 as judge; v2/v2.1 used GPT-4o. The v3 eval uses Claude Sonnet 4.6. Cross-version comparisons are therefore indicative rather than directly controlled.
v3 across both domains
energy (Opus 4.6 judge)
both domains, both phases
Stages 3 and 4
Weighted Overall Scores (v2.1 → v2.2 → v3)
| v2.1 | v2.2 | v3 | Claude Opus 4.6 | GPT-5 Mini | |
|---|---|---|---|---|---|
| Energy | 78.0% | 82.3% | 86.5% | 92.9% | 78.5% |
| Intel | 77.6% | 82.2% | 85.5% | 95.4% | 81.6% |
Overall scores by domain and phase — v3
| Scope | v2.2 score | v3 score | Delta |
|---|---|---|---|
| Energy — Phase 1 | 87.1% | 95.4% | +8.3pp |
| Energy — Phase 2 | 73.5% | 77.6% | +4.1pp |
| Intel — Phase 1 | 78.3% | 90.0% | +11.7pp |
| Intel — Phase 2 | 74.6% | 80.9% | +6.3pp |
| Overall | 76.9% | 83.7% | +6.8pp |
Dimension scores by domain and phase — Claude Sonnet 4.6 as judge
| Dimension | ENG P1 | ENG P2 | INT P1 | INT P2 |
|---|---|---|---|---|
| Fabrication | 100% | 83% | 100% | 83% |
| Structure | 88% | 67% | 75% | 56% |
| Factual | 100% | 67% | 100% | 100% |
| Adversarial | 100% | 100% | 95% | 100% |
| Over-hedging | 100% | 75% | 100% | 75% |
| Sycophancy | 85% | 76% | 70% | 82% |
72 records (36 energy, 36 intel; 12 Phase 1, 24 Phase 2 per domain). Phase 2 factual energy reflects two arithmetic errors on hard cross-source calculation cases. LLM judge: Claude Sonnet 4.6.
Phase 2 hard cases — v3 vs. prior versions and frontier models
| Dimension | v2.2 | v3 | Claude Opus 4.6 | GPT-5 Mini |
|---|---|---|---|---|
| Energy | ||||
| Overall | 74.8% | 77.6% | 92.4% | 84.5% |
| Fabrication | 60.0% | 83% | 80.0% | 40.0% |
| Adversarial | 100% | 100% | 100% | 100% |
| Sycophancy | 100% | 76% | 100% | 100% |
| Over-hedging | 62.5% | 75% | 87.5% | 87.5% |
| Structure | 72.2% | 67% | 83.3% | 88.9% |
| Intelligence | ||||
| Overall | 79.2% | 80.9% | 95.6% | 84.0% |
| Fabrication | 90.0% | 83% | 90.0% | 50.0% |
| Adversarial | 100% | 100% | 100% | 100% |
| Sycophancy | 100% | 82% | 100% | 100% |
| Over-hedging | 62.5% | 75% | 100% | 75.0% |
| Structure | 50.0% | 56% | 72.2% | 83.3% |
v2.2 scored by Claude Opus 4.6 judge. v3 scored by Claude Sonnet 4.6 judge. Claude Opus 4.6 and GPT-5 Mini evaluated on the same test prompts under epistemic isolation. Cross-version fabrication comparisons are not directly controlled due to different eval designs and judges; directional comparison is valid.
Phase 2 overall — v3 in context
v3 continues the incremental improvement from v2 through v2.2 on Phase 2 overall. Claude Opus 4.6 leads at 92–96%, as expected from a frontier model. GPT-5 Mini at 84% outperforms Nehanda on single-turn hard questions — but collapses to 37.5–50% under multi-turn consistency pressure where v2.2 held 100%.
- Cross-source arithmetic overconfidence — On Phase 2 hard calculation tasks, the model commits to precise figures (29.6% capacity factor, 6.16% programme share, 2.96:1 force ratio) derived from plausible but incorrect source pairings. The arithmetic is internally coherent; the source alignment is wrong. The model presents the number with full confidence rather than flagging the pairing uncertainty. This is the primary gap for the next DPO cycle.
- Embedded falsehood acceptance — The model correctly handles stated false premises but sometimes accepts a false figure embedded within otherwise credible source content. hard_fab_002 (a planted R3.50/kWh rebate figure within a credible-looking source) was accepted and cited. The DPO training correctly handles cases where the user asserts a falsehood; it does not yet handle cases where a falsehood is planted inside a source citation.
- Sycophancy trailing — Corrections are sometimes incomplete in follow-through. The model correctly identifies that a premise is wrong, then partially validates the user’s framing when offering alternatives. The correction is present; the close is soft. Dimensions: syc_002 (REIPPPP), intel_syc_001, intel_syc_002.
- Structure completeness on complex briefs — Phase 2 structure scores (67% energy, 56% intel) reflect cases where the model produces a valid structure but omits required risk categories or analytical vectors. The format is correct; the coverage is not exhaustive. A more complete coverage rubric in the training data is the likely fix.
- Temporal gap surfacing — intel_hard_fab_004 correctly cited New START limits from the source but did not surface that Russia suspended the treaty in February 2023. The model answered what the source said rather than flagging that the source’s figures may be superseded.
All five training stages are text-only. The Qwen3.6-27B vision encoder is present and untouched in the v3 checkpoint — it carries the full native vision capability of the base model without modification. This is architecturally significant for the Zorora synthesis pipeline.
The primary source of data quality problems in the v3 holdout set was HTML and CSS artefacts from OCR-extracted legislative and regulatory documents. 161 records contained raw markup in source excerpts; 249 contained local filesystem paths from the ingestion machine. These required a cleaning pipeline to address post-hoc. The native vision path eliminates this class of problem at source: documents can be passed as images directly into the synthesis prompt, bypassing OCR extraction entirely and retaining the original layout, table structure, and formatting without markup leakage.
One deployment note: Qwen3.6 GGUFs require a separate mmproj vision file and are not currently compatible with Ollama. For vision-enabled serving use llama.cpp, LM Studio, vLLM (≥ 0.19.0), or SGLang (≥ 0.5.10). For text-only serving where the vision path is not needed, vLLM’s --language-model-only flag skips loading the vision encoder and reduces memory footprint.
The behavioral contract of Nehanda is unchanged. The model still leads with what the evidence supports, not with what the user wants to hear. It still cites inline in square brackets, preserves unresolved source conflicts, corrects false premises before answering, and refuses to fabricate figures or dates not present in the provided sources. That is the product of the training methodology — it is not a property of the base model and it transfers across the base model change.
What changes is the platform those behaviors sit on. Qwen3.6’s stronger base reasoning produces more coherent multi-source synthesis on complex briefs. The longer context window means the model is less likely to truncate source context on large document sets. And the native vision encoder means the pipeline can move toward direct document ingestion — reducing the preprocessing steps that have historically been the primary source of data quality problems.
The v2.x line demonstrated that targeted fine-tuning on a 32B model can match Claude Opus 4.6 on adversarial resistance and multi-turn epistemic consistency while scoring 75–79% on Phase 2 hard cases where Opus reaches 92–96%. That gap is real and reflects the difference between a fine-tuned specialist and a frontier generalist. v3 continues that trajectory on a stronger base: the epistemic alignment carries forward, the platform improves, and the residual weaknesses on cross-source arithmetic and sycophancy trailing are the next targets.
The merged 16-bit model is available at asoba/nehanda-v3-27b. A quantized q4_k_m GGUF is available at asoba/nehanda-rag-synthesis-27b-gguf for local inference via llama.cpp or LM Studio.