DAR

Deontic Reasoning with Agentic Harnesses

Guangyao Dou1, William Jurayj1, Nils Holzenberger2, Benjamin Van Durme1

1Johns Hopkins University  ·  2Télécom Paris, Institut Polytechnique de Paris

Data is available on Hugging Face as part of DeonticBench.

Direct reasoning versus Deontic Agentic Reasoning (DAR).

Figure 1. Direct reasoning vs. Deontic Agentic Reasoning (DAR). In direct reasoning (left), the full statute and case facts are placed in the prompt, and the model produces an answer in a single pass. In DAR (right), the statute is placed as a file in the harness, and the model examines it on the fly using general-purpose tools (grep, sed, cat, Python).

Abstract

Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts — for example, computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on the hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

At a Glance

9

Models

Frontier (GPT-5.1, GPT-5.2, Claude Sonnet 4.5) and open-source (Qwen3.5 family, Qwen3-Coder, Qwen3-235B, Kimi K2).

4

Agentic harnesses

Terminus-2 and Terminus-KIRA in the main study; Claude Code and Codex CLI in the appendix — all vs. direct solving.

15–30%

Frontier gain

Frontier models improve on SARA-Numeric under Terminus-KIRA, while weaker models drop 11–23% and burn up to 4× the tokens.

Two Reasoning Paradigms

baseline

Direct reasoning

The model receives the full statute, the case facts, and the question in a single prompt and produces an answer in one pass — the configuration used in most prior deontic-reasoning evaluations.

ours

Deontic Agentic Reasoning

The statute is placed as a file (statute.txt) in a harness. The model receives only the case facts and question, then issues tool calls (sed, grep, cat, Python) to read targeted portions of the statute on demand, accumulating observations as it explores.

Key Findings

Frontier models gain from DAR

Under Terminus-KIRA, GPT-5.2 climbs from 30% → 60% on SARA-Numeric, Claude Sonnet 4.5 from 36% → 54%, and GPT-5.1 picks up another 15 points while staying saturated near 0.86 on Airline. The harness turns latent statute-reading ability into delivered accuracy.

Open-source models fail under the same harness

The same scaffold hurts weaker models. On SARA-Numeric, Qwen3.5-35B drops 34% → 11% and Qwen3.5-122B 37% → 20%. On Airline, every open-source model collapses to near-zero once placed in Terminus-2 or KIRA.

Token cost explodes for weaker models

Under Terminus-2, Qwen3.5-122B averages 401k tokens per trial and Qwen3-235B 303k — roughly what frontier models consume. Extra turns inflate already-shaky reasoning into longer, more confident wrong answers.

Harnesses amplify capability, not judgment

For capable models the harness enables self-directed retrieval and error recovery, as the Mismanaged Geniuses Hypothesis predicts. For weaker models it is a confidence amplifier: interactive access with tools, but not the judgment to use them well.

Harness Comparison

Legend: Direct Solving, Terminus-2, Terminus-KIRA.
SARA-Numeric (accuracy)
Harness comparison on SARA-Numeric.
Airline (accuracy)
Harness comparison on Airline.
SARA-Binary (macro-F1)
Harness comparison on SARA-Binary.
USCIS-AAO (macro-F1)
Harness comparison on USCIS-AAO.

Figure 2. Direct Solving vs. Terminus-2 vs. Terminus-KIRA across nine models on the four hard DeonticBench tasks. Each task is allotted a 10-minute budget; trials that exceed the budget, fail to parse, or raise harness runtime errors are counted as incorrect. Agentic harnesses lift frontier models but degrade open-source models, most severely on the numerical tasks.

Token Usage

Legend: Direct Solving, Terminus-2, Terminus-KIRA. Average tokens consumed per trial under Direct Solving, Terminus-2, and Terminus-KIRA.

Figure 3. Average tokens consumed per trial under Direct Solving, Terminus-2, and Terminus-KIRA. Agentic harnesses append each action's output to the next iteration's input, so weaker open-source models spend far more tokens — up to roughly 4× the frontier — without a matching accuracy gain.

Detailed Results

Accuracy Macro F1
ModelHarness SARA-NumAirline SARA-BinUSCIS-AAO
Qwen3-Coder-480Bdirect0.2490.0210.5910.338
codex0.0860.0000.5980.427
terminus-20.1430.0000.7930.408
terminus-kira0.1430.0130.7660.378
claude-code0.3430.0500.8000.505
Qwen3.5-122Bdirect0.3700.1500.7530.780
codex0.2290.0130.7990.775
terminus-20.2000.0380.8000.603
terminus-kira0.2000.0500.8230.764
claude-code0.2860.1130.7930.730
Qwen3.5-35Bdirect0.3400.1370.7400.477
terminus-20.2290.0130.8330.607
terminus-kira0.1140.0130.8290.718
claude-code0.3710.0880.8330.603
Qwen3.5-397Bdirect0.5280.1920.7820.727
terminus-20.2860.0130.8330.708
terminus-kira0.7710.0000.9060.778
claude-code0.5140.1000.8890.643
Qwen3-235Bdirect0.3210.1280.6650.531
codex0.1140.0000.7210.509
terminus-20.1710.0130.5980.689
terminus-kira0.2860.0380.6650.668
Kimi-K2direct0.0840.0900.6840.518
codex0.2000.0000.7330.553
terminus-20.1140.0000.5930.533
terminus-kira0.2290.0500.8850.668
GPT-5.2direct0.3030.0250.5970.779
codex0.3430.0000.4640.819
terminus-20.5140.1880.5310.781
terminus-kira0.6000.3630.5690.713

Codex CLI, Terminus-2, Terminus-KIRA, and Claude Code harnesses on DeonticBench (Appendix Table 1). Accuracy columns report exact-match accuracy; Macro F1 columns report macro-averaged F1. The best score per (model, metric) row group is bolded.

Built on DeonticBench

Benchmark

DeonticBench: A Benchmark for Reasoning over Rules

Guangyao Dou, Luis Brena, Akhil Deo, William Jurayj, Jingyu Zhang, Nils Holzenberger, Benjamin Van Durme

DAR is evaluated on the hard subsets of DeonticBench, a benchmark for deontic reasoning over real-world legal and regulatory statutes — U.S. federal income tax (SARA), airline baggage policies, state housing law, and USCIS immigration appeals — with verified reference Prolog programs for every case.

Visit the DeonticBench site