LLM-as-a-Judge: Where Do Its Signals Break, When Do They Hold, and What Should “Evaluation” Mean?
What precisely is being measured when a decide LLM assigns a 1–5 (or pairwise) rating? Most “correctness/faithfulness/completeness” rubrics are project-specific. Without task-grounded definitions, a scalar rating can drift from enterprise outcomes (e.g., “helpful advertising submit” vs. “excessive completeness”). Surveys of LLM-as-a-judge (LAJ) note that rubric ambiguity and prompt template choices materially shift scores and human…
