Public benchmark

cite-bench

A public citation-grounding benchmark with a downloadable 500-row pack, protected scoring key, and aggregate evaluation built for legal AI teams that want to measure trust, not just fluency.

Four-label scoring contractPrivate grading keyAggregate-only public results
cite-bench-v1500 public rowsVERIFIED / NOT_FOUND / MISATTRIBUTED / CITATION_UNRESOLVED

Same pack. Different systems.

Static benchmark snapshots on the public 500, shown here as quick orientation tiles before you upload your own run.

GPT-5.4-mini

Public pack · reasoning high

Accuracy69.2%
Macro F10.4978
Weighted F10.7096
Correct346 / 500

Observed baseline on the public 500-run pack using the public cite-bench runner and aggregate scoring flow.

LawEngine Reference

Protected internal reference lane

Accuracy100%
Macro F11.0000
Weighted F11.0000
Correct500 / 500

Internal scorer-parity reference on the same public pack. This is the benchmark-native ceiling, not a public baseline submission.

Run locally. Score against the protected key.

01

Download the public pack

Pull the public 500-case cite-bench pack and the submission template directly from the public repo.

02

Run your system locally

Use your own model, prompt, or product flow to emit a simple submission CSV with id and predicted_status.

03

Upload for aggregate scoring

LawEngine scores against the protected key and returns overall accuracy, F1, source-family breakdowns, and a confusion matrix.

Your model must choose one of four exact labels.

Each benchmark row contains a citation and quoted passage. Your system should output a CSV with exactly two columns, id and predicted_status, wherepredicted_status is one of these exact strings:

VERIFIED

The citation and quoted text match

Use VERIFIED when the quoted passage really appears in the cited authority and the citation is substantively correct.

NOT_FOUND

The quoted text is not there

Use NOT_FOUND when the quoted passage cannot be found in the cited authority or the current benchmark corpus.

MISATTRIBUTED

The quote exists, but under a different cite

Use MISATTRIBUTED when the quoted language is real but belongs to a different authority than the one provided.

CITATION_UNRESOLVED

The citation cannot be resolved cleanly

Use CITATION_UNRESOLVED when the citation string itself cannot be tied to a live authority in the benchmark corpus.

Score a submission

Public pack upload and aggregate scoring.
Expected header: id,predicted_status
Same-origin upload, backend-scored, aggregate results only.

Public pack essentials

Versioncite-bench-v1
Rows500
FormatsCSV / JSON
ScoringPrivate key

Label contract

VERIFIEDNOT_FOUNDMISATTRIBUTEDCITATION_UNRESOLVED

Accepted submission formats

  • CSV upload with id,predicted_status
  • JSON for local or private tooling