解析 ‘RAG Evaluation (RAGAS)’ 的数学推导：如何量化检索结果对答案生成的‘负贡献度’？ - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

各位同仁，大家好。

今天，我们将深入探讨一个在检索增强生成（Retrieval-Augmented Generation, RAG）系统中至关重要但又常常被忽视的议题：如何量化检索结果对最终答案生成的“负贡献度”。我们知道，RAG系统通过结合检索与生成模型的优势，旨在提供更准确、更实时、更可溯源的答案。然而，一个普遍的误解是，只要有检索，就一定能带来正面效益。事实并非如此。不当的检索结果，无论是无关的、误导的、还是不完整的，都可能成为答案生成的“负资产”，降低系统性能，甚至引入幻觉。

我们将以RAGAS这一强大的RAG评估框架为例，剖析其核心度量是如何从数学和工程角度，帮助我们捕捉和量化这些“负贡献度”。作为一名编程专家，我将不仅仅停留在理论层面，更会深入代码实践，展示如何构建一个严谨的评估工作流。

I. 引言：RAG 系统中的挑战与评估需求

RAG系统是大型语言模型（LLM）领域的一个重要进展，它解决了纯LLM模型在知识时效性、事实准确性和可解释性方面的固有缺陷。通过在生成前从外部知识库中检索相关信息，RAG模型能够：

提高准确性：基于事实依据生成答案。
降低幻觉：减少模型凭空捏造信息的倾向。
增强可解释性：提供答案来源，便于用户验证。
实时更新：知识库可以独立于模型进行更新。

一个典型的RAG工作流包括：

用户查询（Query）：用户提出问题。
检索（Retrieval）：系统根据查询从知识库中检索出相关文档或文本片段（Context）。
生成（Generation）：LLM接收查询和检索到的上下文，生成最终答案（Answer）。

然而，这个看似线性的流程中充满了潜在的陷阱。检索阶段的质量直接影响生成阶段的上限。

如果检索到的上下文与查询无关，LLM可能会被误导，产生偏离主题的答案。
如果检索到的上下文包含错误信息，LLM可能会传播这些错误。
如果检索到的上下文不完整，LLM可能无法提供全面准确的答案，甚至会尝试“脑补”，导致幻觉。

这些情况，我们统称为检索结果对答案生成的“负贡献度”。它不仅仅是简单的“错误”，更是无效的资源消耗、对LLM注意力的分散、以及最终答案质量的潜在损害。因此，我们需要一套严谨的量化方法来识别、衡量和追踪这些负贡献。这正是RAGAS等评估框架的核心价值所在。

II. RAGAS 框架概述：核心度量与哲学

RAGAS (Retrieval Augmented Generation Assessment) 是一个专为评估RAG系统设计的开源框架。它创新性地利用LLM本身作为“评估者”，以克服传统人工标注成本高昂、耗时且主观性强的缺点。RAGAS的核心哲学是：一个高质量的RAG系统，其生成的答案应该同时具备以下几个特性：

忠实于上下文（Faithfulness）：答案中的信息应完全基于检索到的上下文，不应有幻觉。
相关于查询（Answer Relevancy）：生成的答案应直接、准确地回答用户的查询。
上下文相关性高（Context Relevancy）：检索到的上下文应只包含与查询相关的信息，没有冗余或噪声。
上下文召回率高（Context Recall）：检索到的上下文应包含回答查询所需的所有关键信息。

这些度量共同构成了一个全面的评估体系，能够从不同维度揭示RAG系统中的潜在问题，其中许多问题都直接关联到我们所讨论的“负贡献度”。

让我们通过一个表格来概览RAGAS的核心度量及其与“负贡献度”的关联：

RAGAS 度量	范围	理想分数	与“负贡献度”的关联
Context Relevancy	[0, 1]	1	低分表示检索结果包含大量无关信息，引入噪声，干扰LLM。负贡献：噪声干扰。
Faithfulness	[0, 1]	1	低分表示答案存在幻觉，未基于上下文。负贡献：幻觉与不实信息（部分源于上下文不足或误导）。
Context Recall	[0, 1]	1	低分表示检索结果缺失回答问题所需的关键信息。负贡献：信息缺失。
Answer Relevancy	[0, 1]	1	低分表示答案偏离查询主题。负贡献：答案偏离（可能由无关上下文诱导）。

III. 深入解析“负贡献度”的量化：RAGAS 度量详解与数学推导

现在，我们将深入探讨RAGAS的每个度量，理解其背后的数学思想和LLM评估机制，并明确它们如何量化检索结果的“负贡献度”。需要强调的是，RAGAS的“数学推导”更多是基于LLM的自然语言理解、推理和判断能力，将其转化为可量化的分数。这与传统统计学或机器学习中的数学公式有所不同，它是一种“基于语义和逻辑的软推导”。

A. Context Relevancy (上下文相关性)：噪声与干扰的量化

概念：Context Relevancy衡量的是检索到的上下文（context）中有多少比例的信息对于生成最终答案（answer）是真正有用的。
负贡献：当Context Relevancy分数低时，意味着检索结果中包含了大量对回答问题无用的噪声信息。这种噪声会增加LLM处理的负担，可能稀释掉真正有用的信息，甚至误导LLM的注意力，使其偏离核心问题。这种“负贡献”体现在计算资源浪费、推理效率下降以及答案质量受损上。

数学推导与LLM评估机制：
RAGAS通过LLM来判断context中每个句子（或更小的语义单元）对生成answer的贡献度。
其核心思想是：对于给定的用户query、生成的answer以及检索到的context，LLM被要求判断context中的每一个独立陈述（statement/sentence）是否是生成answer所必需的。

原子语句提取：LLM首先将检索到的context分解成一系列独立的、可评估的原子语句。
相关性判断：对于context中的每一个原子语句 $s_i$，LLM会根据query和answer，判断 $s_i$ 是否对生成answer是“相关”的或“必需”的。这个判断通常通过一个精心设计的提示词（prompt）来引导LLM完成。
例如，LLM可能会被要求回答：“以下句子是否对回答问题‘[query]’并生成答案‘[answer]’是必需的？请回答‘是’或‘否’。”

设 $I(s_i, text{query}, text{answer})$ 是一个指示函数，当语句 $s_i$ 被LLM判断为相关时，其值为1，否则为0。
计算Context Relevancy：
Context Relevancy的计算公式为：
$$
text{Context Relevancy} = frac{sum_{i=1}^{N_c} I(s_i, text{query}, text{answer})}{N_c}
$$
其中，$N_c$ 是context中原子语句的总数。

这里的“负贡献度”可以直观地理解为 $1 – text{Context Relevancy}$。这个值越高，表明context中无关信息越多，对答案生成的干扰越大。

代码实现示例 (RAGAS 库的使用)：

RAGAS库抽象了LLM作为评估者的复杂性。在使用时，我们只需提供数据和配置评估器。

import os
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import context_relevancy
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document

# 确保设置了OpenAI API Key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# 1. 模拟数据准备
# 实际应用中，这些数据会来自你的RAG系统
data_samples = [
    {
        "query": "Who developed the theory of relativity?",
        "answer": "Albert Einstein developed the theory of relativity, which includes both special and general relativity.",
        "contexts": [
            Document(page_content="Albert Einstein was a German-born theoretical physicist."),
            Document(page_content="He developed the theory of relativity, one of the two pillars of modern physics."),
            Document(page_content="Einstein is also known for his mass-energy equivalence formula E = mc²."),
            Document(page_content="He received the Nobel Prize in Physics in 1921 for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect. He was born in Ulm, Germany."),
            Document(page_content="The theory of relativity is a revolutionary scientific theory developed in the early 20th century.") # This sentence is somewhat redundant for "who developed" but relevant to the topic.
        ],
        "ground_truth": "Albert Einstein developed the theory of relativity." # For Context Recall, not Context Relevancy
    },
    {
        "query": "What is the capital of France?",
        "answer": "The capital of France is Paris, a major European city.",
        "contexts": [
            Document(page_content="Paris is the capital and most populous city of France."),
            Document(page_content="It is also known for its fashion, cuisine, art, and culture."),
            Document(page_content="France is a country located in Western Europe."),
            Document(page_content="The Eiffel Tower is a famous landmark in Paris."),
            Document(page_content="The Louvre Museum is another prominent attraction in Paris. The French Revolution began in 1789.") # Last part is irrelevant
        ],
        "ground_truth": "Paris is the capital of France."
    },
    {
        "query": "Explain quantum entanglement.",
        "answer": "Quantum entanglement is a phenomenon where two or more particles become linked in such a way that they share the same fate, even when separated by vast distances. Measuring the state of one instantly influences the state of the other, defying classical intuition.",
        "contexts": [
            Document(page_content="Quantum entanglement is a physical phenomenon that occurs when a group of particles is generated, interact, or share spatial proximity in a way such that the quantum state of each particle cannot be described independently of the others, even when the particles are separated by a large distance."),
            Document(page_content="Bell's theorem is a fundamental concept in quantum mechanics related to entanglement."),
            Document(page_content="It is a cornerstone of quantum mechanics and has been experimentally verified."),
            Document(page_content="The measurement of one entangled particle instantly influences the other, regardless of the distance between them."),
            Document(page_content="Classical physics cannot explain entanglement. Quantum computing leverages entanglement.") # Last part is somewhat related but not directly needed for "explain"
        ],
        "ground_truth": "Quantum entanglement is a phenomenon where particles become linked and their states are correlated instantly, regardless of distance."
    }
]

# RAGAS期望的输入格式是一个Dataset对象
# contexts字段需要是一个list of list of strings，或者list of list of Document
# 如果是Document对象，RAGAS会自动提取page_content
ragas_dataset_data = {
    "question": [s["query"] for s in data_samples],
    "answer": [s["answer"] for s in data_samples],
    "contexts": [[doc.page_content for doc in s["contexts"]] for s in data_samples],
    "ground_truths": [[s["ground_truth"]] for s in data_samples], # ground_truths expects a list of lists, even for a single ground truth
}
ragas_dataset = Dataset.from_dict(ragas_dataset_data)

# 2. 初始化LLM用于评估
# RAGAS使用LLM来执行其内部的判断逻辑
# 推荐使用性能较好的模型，如GPT-4或Anthropic Claude，以获得更准确的评估结果
# 如果使用开源模型，确保其能够处理复杂的指令并生成结构化的输出
llm_for_eval = ChatOpenAI(model="gpt-4o-mini", temperature=0) # gpt-4o-mini is cost-effective for demos

# 3. 运行评估
print("Starting Context Relevancy evaluation...")
result = evaluate(
    dataset=ragas_dataset,
    metrics=[context_relevancy],
    llm=llm_for_eval,
    # 如果需要，可以为Context Relevancy metric指定不同的LLM
    # context_relevancy=ContextRelevancy(llm=another_llm)
)

print("n--- Context Relevancy Evaluation Results ---")
print(result)

# 详细查看每个样本的Context Relevancy分数
df = result.to_dataframe()
print("nDataFrame of results:")
print(df[['question', 'contexts', 'answer', 'context_relevancy']])

# 负贡献度分析: 1 - Context Relevancy
df['context_relevancy_negative_contribution'] = 1 - df['context_relevancy']
print("nContext Relevancy Negative Contribution:")
print(df[['question', 'context_relevancy', 'context_relevancy_negative_contribution']])

# 示例解释：
# 对于第二个样本 "What is the capital of France?"
# Contexts: "Paris is the capital and most populous city of France.", "It is also known for its fashion, cuisine, art, and culture.", "France is a country located in Western Europe.", "The Eiffel Tower is a famous landmark in Paris.", "The Louvre Museum is another prominent attraction in Paris. The French Revolution began in 1789."
# Answer: "The capital of France is Paris, a major European city."
# LLM会判断 "The French Revolution began in 1789." 这句话与回答问题“什么是法国首都”无关。
# 如果计算结果显示 Context Relevancy 低，则说明这些无关信息拉低了分数，构成了负贡献。

B. Faithfulness (忠实度)：幻觉与偏离的量化

概念：Faithfulness衡量的是生成的答案中有多少比例的信息是能够从检索到的context中直接推断或支持的。
负贡献：当Faithfulness分数低时，意味着生成的答案中包含了模型“凭空捏造”的幻觉信息，或者这些信息虽然看似合理但无法从提供的上下文中找到依据。这种“负贡献”是RAG系统最严重的失败之一，因为它直接损害了答案的准确性和可信度。虽然幻觉是LLM的特性，但缺乏忠实度往往是由检索到的上下文不足、模糊或LLM未能正确利用上下文所致。

数学推导与LLM评估机制：
RAGAS通过LLM来判断answer中的每个原子事实是否能在context中找到支持。

原子事实提取：LLM首先将生成的answer分解成一系列独立的、可验证的原子事实（atomic statements）。例如，“Albert Einstein developed the theory of relativity, which includes both special and general relativity.” 可以分解为：“Albert Einstein developed the theory of relativity.” 和 “The theory of relativity includes both special and general relativity.”。
事实支持判断：对于answer中的每一个原子事实 $f_j$，LLM会根据context，判断 $f_j$ 是否能在context中找到支持。这同样通过一个提示词来引导LLM。
例如，LLM可能会被要求回答：“以下事实是否能从提供的上下文中推断出来？事实：‘[fact_j]’。上下文：‘[context]’。请回答‘是’或‘否’。”

设 $S(f_j, text{context})$ 是一个指示函数，当事实 $f_j$ 被LLM判断为在context中得到支持时，其值为1，否则为0。
计算Faithfulness：
Faithfulness的计算公式为：
$$
text{Faithfulness} = frac{sum_{j=1}^{N_f} S(f_j, text{context})}{N_f}
$$
其中，$N_f$ 是answer中原子事实的总数。

这里的“负贡献度”可以理解为 $1 – text{Faithfulness}$。这个值越高，表明答案中的幻觉成分越多，答案的不可信度越高。

代码实现示例 (RAGAS 库的使用)：

# ... (前面的数据准备和LLM初始化) ...

# 运行Faithfulness评估
print("nStarting Faithfulness evaluation...")
result_faithfulness = evaluate(
    dataset=ragas_dataset,
    metrics=[faithfulness],
    llm=llm_for_eval,
)

print("n--- Faithfulness Evaluation Results ---")
print(result_faithfulness)

df_faithfulness = result_faithfulness.to_dataframe()
print("nDataFrame of faithfulness results:")
print(df_faithfulness[['question', 'answer', 'faithfulness']])

# 负贡献度分析: 1 - Faithfulness
df_faithfulness['faithfulness_negative_contribution'] = 1 - df_faithfulness['faithfulness']
print("nFaithfulness Negative Contribution (Hallucination Rate):")
print(df_faithfulness[['question', 'faithfulness', 'faithfulness_negative_contribution']])

# 示例解释：
# 假设对于查询 "Who developed the theory of relativity?", 答案是 "Albert Einstein developed the theory of relativity, and he also invented the light bulb."
# Contexts中没有提到爱因斯坦发明了灯泡。
# LLM会将答案分解为两个事实:
# 1. "Albert Einstein developed the theory of relativity." -> Context supports this (S=1)
# 2. "He also invented the light bulb." -> Context does NOT support this (S=0)
# Faithfulness = (1 + 0) / 2 = 0.5
# 负贡献度 = 1 - 0.5 = 0.5，表示答案中50%的事实是幻觉。

C. Context Recall (上下文召回率)：信息缺失的量化

概念：Context Recall衡量的是检索到的context包含了多少比例的“地面真相”（ground_truth）信息。
负贡献：当Context Recall分数低时，意味着检索系统未能获取到回答query所需的所有关键信息。这种“负贡献”并非主动引入有害信息，而是由于关键信息的缺失导致LLM无法生成完整、准确甚至正确的答案。这可能导致答案不完整、不准确，或者LLM为了弥补信息缺失而产生幻觉（尽管幻觉会被Faithfulness捕捉，但其根源可能在此）。

数学推导与LLM评估机制：
Context Recall的评估需要ground_truth答案。它通过LLM来判断ground_truth中的每个原子事实是否能在context中找到。

原子事实提取：LLM首先将ground_truth答案分解成一系列独立的、可验证的原子事实。
事实存在判断：对于ground_truth中的每一个原子事实 $g_k$，LLM会根据检索到的context，判断 $g_k$ 是否能在context中找到。这同样通过提示词引导。
例如，LLM可能会被要求回答：“以下事实是否能在提供的上下文中找到？事实：‘[ground_truth_fact_k]’。上下文：‘[context]’。请回答‘是’或‘否’。”

设 $P(g_k, text{context})$ 是一个指示函数，当事实 $g_k$ 被LLM判断为在context中存在时，其值为1，否则为0。
计算Context Recall：
Context Recall的计算公式为：
$$
text{Context Recall} = frac{sum_{k=1}^{N_g} P(g_k, text{context})}{N_g}
$$
其中，$N_g$ 是ground_truth中原子事实的总数。

这里的“负贡献度”可以理解为 $1 – text{Context Recall}$。这个值越高，表明检索到的上下文缺失的关键信息越多，导致答案不完整或不准确的风险越高。

代码实现示例 (RAGAS 库的使用)：

# ... (前面的数据准备和LLM初始化) ...

# 运行Context Recall评估
print("nStarting Context Recall evaluation...")
result_recall = evaluate(
    dataset=ragas_dataset,
    metrics=[context_recall],
    llm=llm_for_eval,
)

print("n--- Context Recall Evaluation Results ---")
print(result_recall)

df_recall = result_recall.to_dataframe()
print("nDataFrame of context recall results:")
print(df_recall[['question', 'ground_truths', 'contexts', 'context_recall']])

# 负贡献度分析: 1 - Context Recall
df_recall['context_recall_negative_contribution'] = 1 - df_recall['context_recall']
print("nContext Recall Negative Contribution (Missing Information Rate):")
print(df_recall[['question', 'context_recall', 'context_recall_negative_contribution']])

# 示例解释：
# 假设对于查询 "Who developed the theory of relativity and what is its significance?",
# ground_truth是 "Albert Einstein developed the theory of relativity, which fundamentally changed our understanding of space and time."
# 而检索到的Contexts只包含 "Albert Einstein developed the theory of relativity."
# LLM将ground_truth分解为两个事实:
# 1. "Albert Einstein developed the theory of relativity." -> Context supports this (P=1)
# 2. "It fundamentally changed our understanding of space and time." -> Context does NOT support this (P=0)
# Context Recall = (1 + 0) / 2 = 0.5
# 负贡献度 = 1 - 0.5 = 0.5，表示50%的关键信息缺失。

D. Answer Relevancy (答案相关性)：偏离主题的量化

概念：Answer Relevancy衡量的是生成的answer与原始query的相关程度。
负贡献：当Answer Relevancy分数低时，意味着生成的答案未能直接、清晰地回答用户的问题，而是偏离了主题，提供了无关或次要的信息。这种“负贡献”直接体现在用户体验上，即使context和faithfulness都很好，一个偏离主题的答案也是无用的。检索结果的质量（例如，提供了大量相关但非核心的上下文）可能诱导LLM生成一个相关性不高的答案。

数学推导与LLM评估机制：
Answer Relevancy的评估比其他度量更具挑战性，因为它需要理解query和answer之间的语义匹配程度。RAGAS通常采用以下策略：

问题生成：LLM被要求根据生成的answer反向生成多个“潜在问题”（synthetic_questions），这些问题是answer可能回答的。
相关性评估：对于每个生成的synthetic_question_i，LLM会判断原始query与synthetic_question_i之间的语义相似度。或者更直接地，LLM会评估answer对原始query的回答程度。
一种常见的实现是：LLM会根据query和answer，判断answer在多大程度上直接回答了query。这通常是一个0到1之间的连续评分。

设 $Sim(text{query}, text{answer})$ 是LLM对query和answer语义相关性的评分函数。
计算Answer Relevancy：
Answer Relevancy的计算公式可以简化为LLM直接给出的相关性评分：
$$
text{Answer Relevancy} = Sim(text{query}, text{answer})
$$
在更复杂的RAGAS实现中，它可能是基于生成的多个潜在问题与原始查询的平均相似度。例如：
$$
text{Answer Relevancy} = frac{1}{M} sum_{i=1}^{M} text{Similarity}(text{query}, text{synthetic_question}_i)
$$
其中 $M$ 是生成的潜在问题数量。

这里的“负贡献度”可以理解为 $1 – text{Answer Relevancy}$。这个值越高，表明答案越偏离主题，用户需要花费更多精力才能找到所需信息。

代码实现示例 (RAGAS 库的使用)：

# ... (前面的数据准备和LLM初始化) ...

# 运行Answer Relevancy评估
print("nStarting Answer Relevancy evaluation...")
result_ans_rel = evaluate(
    dataset=ragas_dataset,
    metrics=[answer_relevancy],
    llm=llm_for_eval,
)

print("n--- Answer Relevancy Evaluation Results ---")
print(result_ans_rel)

df_ans_rel = result_ans_rel.to_dataframe()
print("nDataFrame of answer relevancy results:")
print(df_ans_rel[['question', 'answer', 'answer_relevancy']])

# 负贡献度分析: 1 - Answer Relevancy
df_ans_rel['answer_relevancy_negative_contribution'] = 1 - df_ans_rel['answer_relevancy']
print("nAnswer Relevancy Negative Contribution (Off-topic Rate):")
print(df_ans_rel[['question', 'answer_relevancy', 'answer_relevancy_negative_contribution']])

# 示例解释：
# 假设查询是 "Who developed the theory of relativity?", 答案是 "The theory of relativity is a cornerstone of modern physics, profoundly impacting our understanding of space, time, and gravity."
# 尽管答案内容本身是正确的，但它并没有直接回答“谁开发了”这个问题，而是解释了理论的意义。
# LLM会判断这个答案与原始查询的相关性较低，从而导致Answer Relevancy分数低。
# 负贡献度高，表示答案偏离主题。

IV. 综合评估与负贡献度权重

RAGAS提供的这四个度量，各自从不同角度量化了检索结果对答案生成的“负贡献度”。将它们综合起来，我们可以获得对RAG系统性能更全面的洞察。

如何结合这些度量？
通常，我们会独立分析每个度量，以诊断RAG系统不同组件的问题。然而，在某些场景下，可能需要一个聚合的“负贡献度”分数或一个加权平均的整体分数。

例如，一个简单的聚合方式可以是：
$$
text{Overall Negative Contribution} = w_1(1 – text{Context Relevancy}) + w_2(1 – text{Faithfulness}) + w_3(1 – text{Context Recall}) + w_4(1 – text{Answer Relevancy})
$$
其中 $w_i$ 是相应度量的权重，且 $sum w_i = 1$。

不同度量的相对重要性与加权：
权重分配应根据具体的应用场景和业务需求来确定。

法规、医疗或金融领域：Faithfulness的权重应极高。幻觉在此类场景下是不可接受的，其“负贡献度”的危害远超其他。
信息检索或问答系统：Context Recall和Answer Relevancy可能更受重视，确保信息完整且答案直接。
需要高效利用资源的系统：Context Relevancy可能更重要，以避免不必要的LLM计算和成本。

示例表格：不同RAG系统迭代中的度量变化及“负贡献度”分析

假设我们正在迭代一个RAG系统，并记录了两个版本的评估结果：

Metric	Version 1 Score	Version 1 Negative Contribution	Version 2 Score	Version 2 Negative Contribution	Change (V2-V1)
Context Relevancy	0.75	0.25	0.88	0.12	+0.13
Faithfulness	0.80	0.20	0.95	0.05	+0.15
Context Recall	0.60	0.40	0.70	0.30	+0.10
Answer Relevancy	0.85	0.15	0.90	0.10	+0.05

分析：

从Version 1到Version 2，所有度量都得到了提升，意味着所有类型的“负贡献度”都得到了降低。
Faithfulness的提升最为显著（负贡献度从0.20降至0.05），这可能说明我们改进了LLM的提示工程，或者提供了更清晰的上下文，有效抑制了幻觉。
Context Relevancy也有显著提升，表明检索器在过滤无关信息方面做得更好了。
Context Recall和Answer Relevancy也有所改善，但提升幅度相对较小。这可能意味着在信息全面性或答案与查询的直接匹配度方面仍有优化空间。

这种表格化的分析使得RAG系统的改进方向一目了然。

V. 实践中的挑战与高级策略

虽然RAGAS提供了一个强大的评估框架，但在实际应用中仍面临一些挑战：

LLM作为评估器的局限性：
- 成本与速度：LLM API调用会产生费用，并且评估大量样本可能需要较长时间。
- 一致性与偏见：不同LLM模型，甚至同一模型在不同运行中，其判断可能存在细微差异。LLM也可能受训练数据中的偏见影响。
- 复杂指令处理：评估任务的提示词设计至关重要，需要确保LLM能准确理解并执行复杂的判断。
- “幻觉评估者”：LLM本身也会产生幻觉，这可能导致评估结果不准确。
提升评估鲁棒性的方法：
- 多样本评估：增加评估样本数量，以获得更具统计意义的结果。
- CoT（Chain of Thought）提示：在评估LLM时，要求它先解释其推理过程，再给出最终判断。这有助于提高评估的透明度和准确性。
- 对抗性样本：构建包含模糊、矛盾或不完整信息的样本，以测试RAG系统在极端情况下的表现。
- Human-in-the-Loop (HITL)：对于关键或有争议的样本，引入人工专家进行复核，以校准LLM评估的准确性。
- 使用高质量的LLM：选择性能更好、更可靠的LLM作为评估器（如GPT-4o, Claude 3 Opus）。
如何利用这些“负贡献度”指标来迭代和改进RAG系统：
- 低Context Relevancy：
  - 问题诊断：检索器返回了太多无关或冗余信息。
  - 改进策略：
    - 改进检索器：使用更精确的检索算法（例如，语义检索结合BM25），引入Reranking模型（如Sentence Transformers, Cohere Rerank）来过滤和排序初始检索结果。
    - 上下文压缩：在将上下文传递给LLM之前，使用LLM或专门的模型（如LLM-based summarization）对上下文进行压缩，去除冗余信息。
    - 更精细的块划分：优化文档切块策略，确保每个块包含的信息更聚焦。
- 低Faithfulness：
  - 问题诊断：LLM在生成答案时出现了幻觉，或者无法完全基于上下文进行回答。
  - 改进策略：
    - 优化LLM提示词：强调“只从提供的上下文中回答”、“不要捏造信息”等指令。
    - 微调LLM：如果幻觉问题严重，可以考虑在特定领域数据上微调LLM，使其更倾向于“不知道”而非“编造”。
    - 增加上下文容量：如果上下文缺失导致LLM猜测，尝试增加LLM能够处理的上下文窗口大小，或优化检索以确保更全面的信息。
    - 置信度评估：让LLM评估其生成答案的置信度，并在置信度低时触发进一步检索或提示用户。
- 低Context Recall：
  - 问题诊断：检索系统未能找到回答问题所需的全部关键信息。
  - 改进策略：
    - 改进索引策略：确保知识库覆盖全面，并采用合适的文档切块和元数据管理。
    - 扩大检索范围：在检索阶段，增加检索到的文档数量，然后使用Reranker进行精选。
    - 结合多种检索方法：例如，同时使用关键词检索（BM25）和语义检索（向量数据库），取两者交集或合并结果。
    - 查询扩展/重写：在检索前，使用LLM对原始查询进行扩展或重写，生成多个相关查询，以提高召回率。
- 低Answer Relevancy：
  - 问题诊断：生成的答案未能直接回答用户问题，可能跑题或过于笼统。
  - 改进策略：
    - 优化LLM生成提示词：明确指示LLM“直接回答问题”、“保持简洁”或“聚焦核心信息”。
    - 答案重写/摘要：在LLM生成初步答案后，使用另一个LLM步骤对其进行重写或摘要，以提高其与原始查询的相关性。
    - 调整LLM温度：降低LLM的temperature参数，使其生成更确定、更直接的答案，减少发散性。

VI. 代码实践：构建一个可量化的RAG评估工作流

我们将整合上述RAGAS度量，展示一个完整的评估工作流。

import os
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
)
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document
import pandas as pd

# 确保设置了OpenAI API Key
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# 1. 模拟数据生成
# 在实际项目中，这些数据通常来自你的RAG系统日志或测试集。
# ground_truth 是人工标注的黄金标准答案，对于Context Recall至关重要。
data_samples = [
    {
        "query": "Who developed the theory of relativity?",
        "answer": "Albert Einstein developed the theory of relativity, which dramatically changed our understanding of space and time.",
        "contexts": [
            Document(page_content="Albert Einstein was a German-born theoretical physicist."),
            Document(page_content="He developed the theory of relativity, one of the two pillars of modern physics."),
            Document(page_content="Einstein is also known for his mass-energy equivalence formula E = mc²."),
            Document(page_content="He received the Nobel Prize in Physics in 1921 for his services to theoretical physics. He also enjoyed sailing and playing the violin.") # Last sentence is somewhat irrelevant
        ],
        "ground_truth": "Albert Einstein developed the theory of relativity."
    },
    {
        "query": "What are the main causes of climate change?",
        "answer": "The primary causes of climate change are human activities, especially the burning of fossil fuels, deforestation, and industrial processes, leading to increased greenhouse gas concentrations.",
        "contexts": [
            Document(page_content="The burning of fossil fuels (coal, oil, and natural gas) for energy production is the largest contributor to greenhouse gas emissions."),
            Document(page_content="Deforestation reduces the amount of carbon dioxide absorbed from the atmosphere."),
            Document(page_content="Industrial processes and agriculture also release significant amounts of greenhouse gases."),
            Document(page_content="Natural factors like volcanic eruptions and solar radiation changes can also influence climate, but their impact is generally smaller compared to human activities."), # Context includes some natural causes, which might not be "main" for the answer's focus
            Document(page_content="Climate change leads to rising sea levels and more extreme weather events.") # Irrelevant to "causes"
        ],
        "ground_truth": "Main causes of climate change include burning fossil fuels, deforestation, and industrial emissions."
    },
    {
        "query": "Explain the concept of 'black holes'.",
        "answer": "A black hole is a region of spacetime where gravity is so strong that nothing—no particles or even electromagnetic radiation such as light—can escape from it. It's formed from the remnants of a large star after its gravitational collapse.",
        "contexts": [
            Document(page_content="A black hole is a region of spacetime where gravity is so strong that nothing, not even light, can escape."),
            Document(page_content="The theory of general relativity predicts that a sufficiently compact mass can deform spacetime to form a black hole."),
            Document(page_content="They are believed to form when massive stars collapse at the end of their life cycle."),
            Document(page_content="The event horizon is the boundary beyond which no escape is possible. Stephen Hawking did significant work on black holes.") # Last sentence is related but not directly part of "concept explanation"
        ],
        "ground_truth": "A black hole is a region of spacetime with extremely strong gravity, preventing anything from escaping, formed from collapsed massive stars."
    },
    {
        "query": "What is the capital of Japan?",
        "answer": "Tokyo is the bustling capital city of Japan, known for its vibrant culture and technological advancements.",
        "contexts": [
            Document(page_content="Tokyo is the capital of Japan."),
            Document(page_content="It is the most populous metropolitan area in the world."),
            Document(page_content="Japan is an island nation in East Asia."),
            Document(page_content="The country is famous for its anime and sushi cuisine."),
            Document(page_content="Mount Fuji is Japan's highest mountain.") # Irrelevant info
        ],
        "ground_truth": "The capital of Japan is Tokyo."
    },
    {
        "query": "Who wrote 'Hamlet'?",
        "answer": "William Shakespeare, the renowned English playwright, penned the tragic play 'Hamlet'. He also wrote 'Romeo and Juliet'.",
        "contexts": [
            Document(page_content="William Shakespeare wrote 'Hamlet', one of his most famous tragedies."),
            Document(page_content="He is widely regarded as the greatest writer in the English language."),
            Document(page_content="'Romeo and Juliet' is another well-known play by Shakespeare."),
            Document(page_content="Shakespeare lived in the 16th and 17th centuries. He had three children.") # Irrelevant info
        ],
        "ground_truth": "William Shakespeare wrote 'Hamlet'."
    }
]

# RAGAS期望的输入格式转换
ragas_dataset_data = {
    "question": [s["query"] for s in data_samples],
    "answer": [s["answer"] for s in data_samples],
    "contexts": [[doc.page_content for doc in s["contexts"]] for s in data_samples],
    "ground_truths": [[s["ground_truth"]] for s in data_samples], # ground_truths expects a list of lists
}
ragas_dataset = Dataset.from_dict(ragas_dataset_data)

# 2. 初始化LLM用于评估
# 使用成本效益较高的模型进行演示，实际应用中可升级
llm_for_eval = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# llm_for_eval = ChatOpenAI(model="gpt-4", temperature=0) # For higher quality evaluation

# 3. 定义所有RAGAS评估指标
all_metrics = [
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
]

# 4. 运行评估
print("Starting RAGAS comprehensive evaluation...")
result = evaluate(
    dataset=ragas_dataset,
    metrics=all_metrics,
    llm=llm_for_eval,
)

print("n--- RAGAS Comprehensive Evaluation Results ---")
print(result)

# 5. 解析评估结果并与“负贡献度”关联
df_results = result.to_dataframe()
print("nDataFrame of all RAGAS results:")
print(df_results)

# 计算每个度量的负贡献度
df_results['faithfulness_neg_contrib'] = 1 - df_results['faithfulness']
df_results['answer_relevancy_neg_contrib'] = 1 - df_results['answer_relevancy']
df_results['context_relevancy_neg_contrib'] = 1 - df_results['context_relevancy']
df_results['context_recall_neg_contrib'] = 1 - df_results['context_recall']

print("nDetailed Negative Contribution Analysis per Sample:")
print(df_results[[
    'question',
    'faithfulness', 'faithfulness_neg_contrib',
    'answer_relevancy', 'answer_relevancy_neg_contrib',
    'context_relevancy', 'context_relevancy_neg_contrib',
    'context_recall', 'context_recall_neg_contrib'
]])

# 进一步分析：聚合负贡献度
# 我们可以根据需求定义权重，例如，更看重 Faithfulness 和 Context Recall
weights = {
    'faithfulness_neg_contrib': 0.3,
    'answer_relevancy_neg_contrib': 0.2,
    'context_relevancy_neg_contrib': 0.2,
    'context_recall_neg_contrib': 0.3,
}

df_results['weighted_overall_neg_contrib'] = (
    df_results['faithfulness_neg_contrib'] * weights['faithfulness_neg_contrib'] +
    df_results['answer_relevancy_neg_contrib'] * weights['answer_relevancy_neg_contrib'] +
    df_results['context_relevancy_neg_contrib'] * weights['context_relevancy_neg_contrib'] +
    df_results['context_recall_neg_contrib'] * weights['context_recall_neg_contrib']
)

print("nWeighted Overall Negative Contribution per Sample:")
print(df_results[['question', 'weighted_overall_neg_contrib']])

# 6. 如何扩展RAGAS，例如自定义评估逻辑 (概念性说明)
# RAGAS允许你定义自己的评估指标。这通常涉及到：
# - 继承 ragas.metrics.base.Metric 类
# - 实现 _score_batch 或 _score 方法
# - 在这些方法中，你可以使用自定义的LLM提示词来执行特定的判断逻辑。
# 例如，如果你想评估上下文的冗余度，而不仅仅是相关性，你可以定义一个新指标，
# 让LLM判断上下文中是否存在重复信息，或者是否存在与答案生成无关但又看似相关的“干扰”信息。

# from ragas.metrics.base import Metric
# from typing import List, Optional
# from ragas.metrics import Metric
# from ragas.metrics.base import EvaluationMode, Metric
# from ragas.metrics._context_relevancy import ContextRelevancy
#
# class CustomRedundancyMetric(Metric):
#     name: str = "custom_redundancy"
#     evaluation_mode: EvaluationMode = EvaluationMode.qa_context
#     llm: ChatOpenAI
#
#     def __init__(self, llm: ChatOpenAI, name: str = "custom_redundancy", evaluation_mode: EvaluationMode = EvaluationMode.qa_context):
#         super().__init__(name=name, evaluation_mode=evaluation_mode)
#         self.llm = llm
#
#     def _score_batch(self, dataset: Dataset, callbacks: Optional[List[CallbackHandler]] = None) -> List[float]:
#         # Implement your custom logic here using self.llm
#         # For each sample in the dataset, you'd prompt the LLM to assess redundancy
#         # Example: "Given the context: {context}, identify if there are any redundant sentences for answering the question: {question}. Output 'YES' or 'NO'."
#         # Then map 'YES'/'NO' to 0/1 or a continuous score.
#         scores = []
#         for i, row in enumerate(dataset):
#             question = row['question']
#             contexts = row['contexts'] # This will be a list of strings
#             # Combine contexts for a single redundancy check, or check pairwise
#             full_context_str = "n".join(contexts)
#             
#             prompt = f"""Given the following question and context, identify if the context contains redundant information that is not necessary to answer the question, or if there are highly repetitive statements.
# Question: {question}
# Context: {full_context_str}
#
# Is there significant redundancy in the context? Respond with 'YES' or 'NO'. If YES, also briefly explain why.
# """
#             response = self.llm.invoke(prompt)
#             # Parse response and assign a score
#             if "YES" in response.content.upper():
#                 scores.append(0.0) # Redundancy exists, lower score
#             else:
#                 scores.append(1.0) # No significant redundancy, higher score
#         return scores
#
# # Then you can add CustomRedundancyMetric(llm=llm_for_eval) to your metrics list.

这个工作流展示了如何利用RAGAS对RAG系统进行全面评估，并通过量化每个度量的“负贡献度”来诊断问题并指导优化。

VII. 展望：超越RAGAS，更细粒度的负贡献分析

RAGAS提供了一个优秀的起点，但RAG系统的评估和“负贡献度”分析仍有广阔的发展空间：

细粒度上下文分析：
- 上下文排序质量：检索到的上下文片段的顺序是否影响LLM的性能？最佳信息是否总是排在前面？
- 矛盾信息检测：如果检索到的上下文中包含相互矛盾的信息，LLM如何处理？这是一种严重的“负贡献”。
- 信息密度与粒度：上下文是过于分散还是过于密集？如何找到最佳粒度以避免信息过载或信息不足？
结合可解释AI (XAI) 技术：
- 利用XAI方法，如注意力可视化或梯度分析，深入理解LLM在生成答案时，究竟“看重”了上下文中的哪些部分，哪些部分被忽略或误用。这能更精确地定位“负贡献”的具体来源。
实时评估与A/B测试：
- 将评估集成到CI/CD流程中，实现自动化、实时的RAG系统性能监控。
- 通过A/B测试，在生产环境中验证不同RAG优化策略（如新的检索器、不同的Reranker、优化的提示词）的效果，并直接衡量其对用户体验和业务指标的影响。
多模态RAG评估：
- 随着多模态LLM的兴起，RAG系统可能需要处理图像、音频、视频等多种模态的检索结果。未来的评估框架需要扩展以量化多模态上下文的“负贡献度”。

VIII. 结语

量化检索结果对答案生成的“负贡献度”并非只是一个学术概念，它是RAG系统从被动接受到主动优化的关键桥梁。通过RAGAS等工具，我们能够系统性地识别并衡量检索过程中的噪声、幻觉、信息缺失和答案偏离等问题。这些量化的指标为我们提供了清晰的诊断依据和明确的改进方向，使得RAG系统的迭代和优化不再是盲目的尝试，而是基于数据驱动的科学实践。理解并有效利用这些评估框架，将是构建健壮、高效、可靠RAG系统的核心能力。

I. 引言：RAG 系统中的挑战与评估需求

II. RAGAS 框架概述：核心度量与哲学

III. 深入解析“负贡献度”的量化：RAGAS 度量详解与数学推导

A. Context Relevancy (上下文相关性)：噪声与干扰的量化

B. Faithfulness (忠实度)：幻觉与偏离的量化

C. Context Recall (上下文召回率)：信息缺失的量化

D. Answer Relevancy (答案相关性)：偏离主题的量化

IV. 综合评估与负贡献度权重

V. 实践中的挑战与高级策略

VI. 代码实践：构建一个可量化的RAG评估工作流

VII. 展望：超越RAGAS，更细粒度的负贡献分析

VIII. 结语

发表回复 取消回复

发表回复取消回复