DeonticBench

A Benchmark for Reasoning over Rules

Guangyao Dou1, Luis Brena1, Akhil Deo1, William Jurayj1, Jingyu Zhang1, Nils Holzenberger2, Benjamin Van Durme1

1Johns Hopkins University  ·  2Télécom Paris

Overview of a DeonticBench instance in the symbolic setting.

Figure 1. Walkthrough of a DeonticBench instance in the symbolic setting. (1) Given the full problem context, the model performs deontic reasoning to identify and apply the relevant rules. (2) The LLM translates the problem into Prolog code. (3) The generated Prolog is executed by the SWI-Prolog solver. The illustrated example is a 2017 tax-liability case.

Abstract

DeonticBench is a benchmark for evaluating LLMs on deontic reasoning over real-world legal and regulatory statutes. Given case facts and statutory rules, models must derive legally correct answers — either by generating executable Prolog programs (few-shot or zero-shot) or by answering directly in natural language. It spans five domains — U.S. federal income tax, airline baggage policies, state housing and eviction law, and USCIS immigration appeals — and ships with verified reference Prolog programs for every case, enabling rigorous, executable evaluation of rule-following beyond surface-level pattern matching.

At a Glance

5

Real-world domains

Tax, airline, housing, and immigration law — drawn from actual statutes and case records.

3

Solving modes

Few-shot Prolog, zero-shot Prolog, and direct natural-language answering.

6,232

Verified tasks

Each with a ground-truth label and a reference Prolog program for every instance.

Dataset

Domain Description Label Hard Whole
SARA NumericU.S. federal income tax (§1, §2, §63, §151, §152, …)Integer (tax owed, $)35100
SARA BinaryEntailment / contradiction over individual tax statute clauses0 / 130276
AirlineAirline baggage fee policiesInteger (total cost, $)80300
HousingU.S. state housing and eviction law (50 states)"yes" / "no"785314
USCIS-AAOUSCIS Administrative Appeals Office immigration cases"Accepted" / "Dismissed"28242

Each split is available on Hugging Face. Every entry contains a natural-language question, a ground-truth label, and a verified reference_prolog program encoding the applicable rules and case facts. Domain names link to their original source datasets; USCIS-AAO is introduced in this work.

Three Solving Modes

few-shot

Few-shot Prolog

The model is given the statute text and 1–2 worked Prolog examples, then writes Prolog for the new case.

zero-shot

Zero-shot Prolog

The model writes Prolog with only the statute text — no worked examples to imitate.

direct

Direct answer

The model answers the question in natural language, with no symbolic intermediate representation.

Key Findings

Reasoning over rules is hard for every model

On the hard subsets, even the best configuration reaches only 44.4% accuracy on SARA Numeric and ~47 macro-F1 on Housing. No single model leads across all five domains, and bootstrap confidence intervals stay wide.

A clear frontier vs. open-source gap

Open-source models lag in few- and zero-shot Prolog and are highly prompt-sensitive — Qwen3-235B jumps from near-random 0.7 → 32.1 on SARA Numeric between few-shot and direct prompting. The gap narrows on binary tasks.

Failure modes are domain-dependent

Legal domains (Housing, USCIS-AAO) are bottlenecked by rule selection; SARA tasks by fact extraction; Airline by arithmetic precision. Most errors are confident wrong answers, not abstentions.

Training helps, but not enough

SFT and RL (DPO, Dr. GRPO) improve Prolog quality and binary-task accuracy, but current RL methods still fail to reliably solve precise numeric reasoning. Robust, executable rule reasoning remains open.

Main Results

Accuracy Macro F1
ModelSetting SARA Num.Airline SARA Bin.USCIS-AAOHousing
GPT-4.1Few-Shot23.741.539.153.046.6
Zero-Shot6.71.740.555.544.7
Direct18.86.730.350.920.2
O3Few-Shot15.290.829.549.443.0
Zero-Shot44.418.532.348.539.6
Direct33.537.859.552.620.8
GPT-5.1Few-Shot33.152.328.761.446.8
Zero-Shot44.040.220.265.941.2
Direct15.928.454.071.518.4
GPT-5.2Few-Shot17.177.924.337.641.1
Zero-Shot27.62.516.051.533.4
Direct20.729.925.858.317.4
Kimi K2 InstructFew-Shot9.342.648.538.237.1
Zero-Shot10.10.052.743.439.8
Direct8.40.968.451.824.9
Claude Sonnet 4.5Few-Shot21.685.828.763.142.9
Zero-Shot21.95.743.963.045.0
Direct41.26.270.07.232.1
Gemini 2.5 FlashFew-Shot2.728.444.731.843.4
Zero-Shot0.90.616.533.446.0
Direct30.518.161.145.930.2
Qwen3-235BFew-Shot0.722.937.824.438.6
Zero-Shot8.74.627.235.843.5
Direct32.112.866.553.125.7

Results on the hard subsets. SARA Numeric and Airline report accuracy (±$1 tolerance); SARA Binary, USCIS-AAO, and Housing report macro-F1. The best score per domain is bolded. Values are means over K generations; 95% bootstrap confidence intervals are reported in the paper.

Performance Decomposition

Legend: correct, incorrect, and abstention.
SARA Numeric
Performance decomposition for SARA Numeric: correct, incorrect, and abstention rates per model and prompting strategy.
Airline
Performance decomposition for Airline: correct, incorrect, and abstention rates per model and prompting strategy.

Figure 2. Performance decomposition for SARA Numeric and Airline. Each model shows three bars (left to right: Direct, Zero-Shot, Few-Shot), split into correct, incorrect, and abstention rates. A large fraction of errors are confident incorrect answers rather than abstentions. Prolog solving raises abstention, while direct prompting trades it for more wrong answers, reflecting a trade-off between coverage and reliability.

Related Paper

Built on DeonticBench

DAR: Deontic Reasoning with Agentic Harnesses

Guangyao Dou, William Jurayj, Nils Holzenberger, Benjamin Van Durme

A key challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and heavily cross-referenced, so models may fail to locate the rules needed for a particular step. Deontic Agentic Reasoning (DAR) places the statute as a file in a harness environment and lets the model examine it on demand with general-purpose tools (grep, sed, cat, Python). Evaluated on the hard subsets of DeonticBench, agentic harnesses push the frontier on deontic reasoning, but unevenly: frontier models gain 15–30% on SARA-Numeric under the Terminus-KIRA harness, while weaker open-source models degrade by 11–23% and consume up to 4× more tokens on the same tasks.

Citation

@article{dou2026deonticbench,
  title={DeonticBench: A Benchmark for Reasoning over Rules},
  author={Dou, Guangyao and Brena, Luis and Deo, Akhil and Jurayj, William and Zhang, Jingyu and Holzenberger, Nils and Van Durme, Benjamin},
  journal={arXiv preprint arXiv:2604.04443},
  year={2026}
}