GPT-5.4-mini
Public pack · reasoning high
Observed baseline on the public 500-run pack using the public cite-bench runner and aggregate scoring flow.
Public benchmark
A public citation-grounding benchmark with a downloadable 500-row pack, protected scoring key, and aggregate evaluation built for legal AI teams that want to measure trust, not just fluency.
Reference runs
Static benchmark snapshots on the public 500, shown here as quick orientation tiles before you upload your own run.
Public pack · reasoning high
Observed baseline on the public 500-run pack using the public cite-bench runner and aggregate scoring flow.
Protected internal reference lane
Internal scorer-parity reference on the same public pack. This is the benchmark-native ceiling, not a public baseline submission.
How it works
Pull the public 500-case cite-bench pack and the submission template directly from the public repo.
Use your own model, prompt, or product flow to emit a simple submission CSV with id and predicted_status.
LawEngine scores against the protected key and returns overall accuracy, F1, source-family breakdowns, and a confusion matrix.
Submission contract
Each benchmark row contains a citation and quoted passage. Your system should output a CSV with exactly two columns, id and predicted_status, wherepredicted_status is one of these exact strings:
Use VERIFIED when the quoted passage really appears in the cited authority and the citation is substantively correct.
Use NOT_FOUND when the quoted passage cannot be found in the cited authority or the current benchmark corpus.
Use MISATTRIBUTED when the quoted language is real but belongs to a different authority than the one provided.
Use CITATION_UNRESOLVED when the citation string itself cannot be tied to a live authority in the benchmark corpus.
Upload
Benchmark pack
Label contract
Accepted submission formats