Just Think AIStart thinking

GlossaryTerm

LLM-as-Judge

Using an LLM to evaluate the quality of another LLM's output.

LLM-as-judge is an eval technique where you use a separate (usually stronger) LLM to score or compare model outputs against a rubric. You prompt the judge with the question, the response, and evaluation criteria — and it returns a score or a preference between two outputs.

It's the main technique for evaluating open-ended output quality at scale when you can't define exact-match criteria. Common rubrics: faithfulness (does the answer only claim what the source supports?), completeness (does it cover all required points?), tone (does it match the required style?), and safety (does it avoid prohibited content?).

Gotchas: LLM judges have biases (they prefer longer answers, they prefer well-structured prose, and they tend to agree with whichever option is presented first). Mitigation: swap option order and average, use structured rubrics rather than holistic ratings, and always spot-check with humans. A strong judge (GPT-4o, Claude Sonnet+) evaluating a weaker model gives more signal than a weak judge evaluating a strong one.

Bring this to your business

Knowing the term is one thing. Shipping it is another.

We do two-week AI Sprints — one term, one workflow, into production by Day 10.