How do you know if your LLM application is working well? Traditional ML evaluation — accuracy, precision, recall — doesn't translate directly to generative AI. When the output is free-form text, "correct" is subjective, context-dependent, and multi-dimensional.

Yet evaluation is critical. Without it, you're shipping AI applications with no quality guarantee, no way to detect regressions, and no data to drive improvements.

The Evaluation Dimensions

LLM outputs should be evaluated across multiple independent dimensions:

Faithfulness

Does the response accurately reflect the provided context? In RAG systems, this measures whether the model hallucinates information not present in the retrieved documents. This is arguably the most important dimension for enterprise applications.

Relevance

Does the response actually answer the user's question? A response can be factually correct and well-written but completely miss what the user asked.

Completeness

Does the response cover all aspects of the query? Partial answers that address only the easiest part of a complex question are a common failure mode.

Coherence

Is the response well-structured, logically consistent, and easy to understand? This matters especially for longer outputs like reports and analyses.

Safety

Does the response avoid harmful content, respect privacy, and comply with usage policies? This includes factual harm (confidently wrong answers), representational harm (biased or stereotyping content), and information hazards (leaking sensitive data).

Evaluation Methods

1. Reference-Based Evaluation

Compare model outputs against "gold standard" reference answers. Metrics like ROUGE and BERTScore measure similarity between generated and reference text. This works for well-defined tasks with clear correct answers but falls apart for open-ended generation.

2. LLM-as-Judge

Use a separate LLM to evaluate the output of your application LLM. The judge model scores outputs on specific criteria (faithfulness, relevance, coherence) using structured evaluation prompts.

This approach scales well and correlates reasonably with human judgment. But it has biases: LLMs tend to prefer verbose responses, favor their own generation style, and struggle with domain-specific accuracy.

Best practices for LLM-as-Judge:

  • Use a different model as judge than the model being evaluated
  • Provide specific rubrics rather than asking for general quality scores
  • Evaluate each dimension separately rather than asking for a single score
  • Calibrate against human judgments on a representative sample

3. Human Evaluation

The gold standard for quality assessment, but expensive and slow. Use human evaluation for:

  • Calibrating automated metrics (does your LLM-as-Judge agree with humans?)
  • Evaluating nuanced quality dimensions (helpfulness, tone, brand alignment)
  • High-stakes applications where automated evaluation isn't trusted

4. Functional Testing

Test specific behaviors rather than overall quality:

  • Does the model refuse to answer questions outside its scope?
  • Does it correctly cite sources when instructed?
  • Does it follow output format specifications?
  • Does it handle edge cases (empty input, very long input, adversarial input)?

Building an Evaluation Pipeline

  1. Define dimensions: Choose 3-5 evaluation dimensions relevant to your application
  2. Create evaluation datasets: 100-500 representative queries with annotations
  3. Implement automated scoring: LLM-as-Judge for subjective dimensions, deterministic checks for format compliance
  4. Calibrate with humans: Have domain experts rate a subset and ensure automated scores correlate
  5. Integrate into CI/CD: Run evaluation on every prompt change, model update, or configuration change
  6. Monitor in production: Sample and evaluate live responses continuously

The Bottom Line

LLM evaluation is hard but not optional. The organizations shipping reliable AI applications have invested in systematic evaluation — automated scoring calibrated against human judgment, integrated into their development workflow, and continuously monitoring production quality. Without evaluation, you're guessing.