Open Citation Benchmark

An open citation benchmark built around one simple question: does the quoted passage match the cited authority? Each row allows four exact responses, and the grading key stays server-side.

The current public pack has 500 rows. More real-world benchmark releases are planned.

Four exact responsesPrivate grading keyAggregate-only public results

Download public packJSON pack for the public citations benchmark

Download submission templateCSV with the exact upload contract

View GitHub repoPublic runner, pack, and benchmark docs

Open 500-row pack500 public rows4 exact responses

Reference runs

Same benchmark. Different answers.

The LawEngine reference lane and the frontier-model baseline answer the same 500-row open citation benchmark differently. That is the point of the benchmark.

LawEngine reference run

Deterministic verification lane

Accuracy100%

Macro F11.0000

Weighted F11.0000

Correct500 / 500

Deterministic verification on a 500-row open citation benchmark. No AI sits in the verification path.

Frontier AI baseline

GPT-5.4-mini on the same public pack

Accuracy72%

PackOpen 500-row pack

Rows500

TaskCitation grounding

Rounded 72% public baseline on the same 500-row pack, task, and label contract.

How it works

Run locally. Score against the protected key.

Download the public pack

Pull the 500-row public citations benchmark and the submission template directly from the public repo.

Run your system locally

Emit one exact response per row from the same four-label contract used by the public scorer.

Upload for aggregate scoring

LawEngine scores against the protected key and returns accuracy, F1, label counts, and a confusion matrix.

Submission contract

Your model must choose one of four exact responses.

Each benchmark row contains a citation and quoted passage. Your system should output a CSV with exactly two columns, id and predicted_status. The code values below are exact. The plain-English titles explain the public meaning.

VERIFIED

Verified

Use VERIFIED when the quoted passage appears in the cited authority and the citation is substantively correct.

NOT_FOUND

Not found

Use NOT_FOUND when the quoted passage cannot be found in the cited authority or in the current public benchmark source material.

MISATTRIBUTED

Found elsewhere

Use MISATTRIBUTED when the quoted language is real but belongs to a different authority than the one provided.

CITATION_UNRESOLVED

Citation unresolved

Use CITATION_UNRESOLVED when the citation string itself cannot be tied to a live authority in the benchmark source library.

Upload

Score a submission

Public pack upload and aggregate scoring.

Benchmark pack

Public pack essentials

PackOpen 500-row pack

Rows500

FormatsCSV / JSON

ScoringPrivate key

Label contract

VERIFIEDNOT_FOUNDMISATTRIBUTEDCITATION_UNRESOLVED

Accepted submission formats

CSV upload with id,predicted_status
JSON for local or private tooling