RAG 训练集中长文档切片导致召回下降的工程化修复与优化实践

大家好，今天我们来深入探讨一个在构建基于检索增强生成（RAG）系统时经常遇到的问题：长文档切片导致的召回率下降，以及如何通过工程化的手段进行修复与优化。

RAG系统，简单来说，就是先通过检索步骤从文档库中找到相关信息，再利用大型语言模型（LLM）生成最终答案。文档切片是 RAG 系统中至关重要的一环，它直接影响着检索的准确性和效率。将长文档切分成更小的块（chunks）可以提高检索速度，并减少 LLM 处理单个文档的压力。然而，不合理的切片策略往往会导致关键信息被分割到不同的 chunk 中，从而降低召回率，最终影响 RAG 系统的整体性能。

问题分析：切片策略与召回率的关系

为什么不合理的切片会导致召回率下降？主要有以下几个原因：

语义割裂： 最直接的原因是切片破坏了文档的语义连贯性。如果一个关键的句子或段落被分割到两个不同的 chunk 中，那么查询时很可能无法同时检索到这两个 chunk，导致重要信息丢失。例如，一个描述“使用新型材料 X 可以有效提升电池续航”的句子，如果被切分到两个 chunk 中，查询 “提升电池续航的方法” 时，可能就无法同时检索到这两个 chunk。
上下文信息丢失： LLM 在生成答案时，依赖于一定的上下文信息。如果 chunk 太小，缺乏足够的上下文，LLM 可能无法准确理解 chunk 的含义，从而生成错误的答案。
索引偏差： 基于文本的索引方法 (例如，TF-IDF 或 embedding based) 通常依赖于词频或向量相似度。如果重要的上下文信息被分散到不同的 chunk 中，索引可能无法准确捕捉到关键信息之间的关联性。
查询与文档结构不匹配： 用户的查询通常具有一定的结构和上下文。如果文档的切片方式与用户的查询结构不匹配，就可能导致检索失败。例如，如果文档是按照章节划分的，而用户的查询跨越了多个章节，那么基于章节的切片方式可能就无法很好地满足查询需求。

工程化修复与优化策略

针对以上问题，我们可以从以下几个方面入手，进行工程化的修复与优化：

1. 优化切片算法

1.1 固定大小切片 (Fixed-size Chunking)：

这是最简单的切片方法，将文档按照固定大小（例如，512 个 token）进行切分。虽然简单易行，但容易导致语义割裂。

def fixed_size_chunking(text, chunk_size=512):
  """
  将文本按照固定大小切分。

  Args:
    text: 输入文本。
    chunk_size: 每个 chunk 的大小（token数）。

  Returns:
    一个包含所有 chunk 的列表。
  """
  tokens = text.split()  # 简单地按空格分词，实际应用中应使用更高级的分词器
  chunks = []
  for i in range(0, len(tokens), chunk_size):
    chunk = " ".join(tokens[i:i + chunk_size])
    chunks.append(chunk)
  return chunks

# 示例
text = "This is a long document with multiple sentences. We want to split it into chunks. Each chunk should have a fixed size."
chunks = fixed_size_chunking(text, chunk_size=10)
print(chunks)

1.2 基于分隔符切片 (Separator-based Chunking)：

这种方法利用文档中的分隔符（例如，句号、换行符、标题等）进行切分，尽可能保持语义的完整性。

import re

def separator_based_chunking(text, separators=["nn", "n", ". ", "!", "?"]):
  """
  基于分隔符切分文本。

  Args:
    text: 输入文本。
    separators: 分隔符列表，按照优先级排序。

  Returns:
    一个包含所有 chunk 的列表。
  """
  chunks = [text]
  for sep in separators:
    new_chunks = []
    for chunk in chunks:
      split_chunks = re.split(sep, chunk)
      new_chunks.extend(split_chunks)
    chunks = new_chunks
  return chunks

# 示例
text = "This is the first sentence. This is the second sentence!nThis is a new paragraph.nnThis is another new paragraph."
chunks = separator_based_chunking(text)
print(chunks)

1.3 递归切片 (Recursive Chunking)：

递归切片结合了分隔符切片和固定大小切片的优点。首先尝试使用高优先级的分隔符切分，如果切分后的 chunk 大小仍然超过阈值，则继续使用低优先级的分隔符切分，直到所有 chunk 的大小都满足要求。

import re

def recursive_chunking(text, separators=["nn", "n", ". ", "!", "?"], chunk_size=512):
  """
  递归切分文本。

  Args:
    text: 输入文本。
    separators: 分隔符列表，按照优先级排序。
    chunk_size: 每个 chunk 的大小（token数）。

  Returns:
    一个包含所有 chunk 的列表。
  """
  chunks = [text]
  final_chunks = []
  for chunk in chunks:
    if len(chunk.split()) <= chunk_size:
      final_chunks.append(chunk)
    else:
      new_chunks = []
      for sep in separators:
        split_chunks = re.split(sep, chunk)
        new_chunks.extend(split_chunks)
      if all(len(c.split()) > chunk_size for c in new_chunks): # 如果所有子chunk还是太大，则放弃分割
          final_chunks.append(chunk) # 保留原始chunk
      else:
          final_chunks.extend(new_chunks)

  # 再次检查，对大于chunk_size的chunk进行强制分割
  result_chunks = []
  for chunk in final_chunks:
      if len(chunk.split()) > chunk_size:
          result_chunks.extend(fixed_size_chunking(chunk, chunk_size))
      else:
          result_chunks.append(chunk)
  return result_chunks

# 示例
text = "This is the first sentence. This is the second sentence!nThis is a new paragraph.nnThis is another new paragraph. This paragraph is very long and contains many words. This paragraph is very long and contains many words. This paragraph is very long and contains many words. This paragraph is very long and contains many words. This paragraph is very long and contains many words. This paragraph is very long and contains many words. This paragraph is very long and contains many words."
chunks = recursive_chunking(text, chunk_size=50)
print(chunks)

1.4 基于语义的切片 (Semantic Chunking)：

这种方法利用自然语言处理技术（例如，句子嵌入、主题模型等）来识别文档中的语义边界，并根据语义边界进行切分。这种方法可以最大程度地保持语义的完整性，但计算成本较高。

# 以下代码需要安装 sentence-transformers 库： pip install sentence-transformers
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def semantic_chunking(text, chunk_size=512, threshold=0.7):
  """
  基于语义相似度切分文本。

  Args:
    text: 输入文本。
    chunk_size: 每个 chunk 的大小（token数）。
    threshold: 相似度阈值，低于该阈值则进行切分。

  Returns:
    一个包含所有 chunk 的列表。
  """
  sentences = re.split(r'(?<!w.w.)(?<![A-Z][a-z].)(?<=.|?)s', text)  # 更精确的句子分割
  model = SentenceTransformer('all-MiniLM-L6-v2')
  embeddings = model.encode(sentences)

  chunks = []
  current_chunk = ""
  current_embedding = np.zeros(embeddings.shape[1])
  current_length = 0

  for i, sentence in enumerate(sentences):
      sentence_embedding = embeddings[i]
      sentence_length = len(sentence.split())

      if current_length + sentence_length <= chunk_size:
          current_chunk += sentence + " "
          current_embedding += sentence_embedding
          current_length += sentence_length
      else:
          if current_chunk: # 避免第一个句子就超过chunk size
              chunks.append(current_chunk.strip())
              current_chunk = sentence + " "
              current_embedding = sentence_embedding
              current_length = sentence_length
          else: # 强制分割超长句子
              sub_sentences = fixed_size_chunking(sentence, chunk_size=chunk_size)
              chunks.extend(sub_sentences)
              current_chunk = ""
              current_embedding = np.zeros(embeddings.shape[1])
              current_length = 0
              continue

  if current_chunk:
      chunks.append(current_chunk.strip())

  return chunks

# 示例
text = "This is the first sentence. This is the second sentence. This sentence is related to the previous two. However, this sentence is completely different. It talks about another topic."
chunks = semantic_chunking(text, chunk_size=100, threshold=0.7)
print(chunks)

选择合适的切片算法需要根据具体的应用场景和文档特点进行权衡。 一般来说，递归切片和语义切片可以获得更好的效果，但计算成本也更高。

2. 引入 Chunk Overlap

为了解决语义割裂的问题，可以引入 chunk overlap 机制。即在切分文档时，让相邻的 chunk 之间有一定的重叠，这样可以确保关键信息在至少一个 chunk 中完整存在。

def chunk_with_overlap(text, chunk_size=512, overlap=128):
  """
  带重叠的切片。

  Args:
    text: 输入文本。
    chunk_size: 每个 chunk 的大小（token数）。
    overlap: 重叠的 token 数。

  Returns:
    一个包含所有 chunk 的列表。
  """
  tokens = text.split()
  chunks = []
  for i in range(0, len(tokens), chunk_size - overlap):
    chunk = " ".join(tokens[i:i + chunk_size])
    chunks.append(chunk)
  return chunks

# 示例
text = "This is a long document with multiple sentences. We want to split it into chunks. Each chunk should have some overlap."
chunks = chunk_with_overlap(text, chunk_size=10, overlap=3)
print(chunks)

Chunk overlap 的大小需要根据具体的文档内容和查询方式进行调整。一般来说，overlap 越大，召回率越高，但同时也会增加索引的大小和检索的时间。

3. 优化索引策略

即使有了良好的切片策略，也需要配合合适的索引策略才能充分发挥其优势。

3.1 元数据索引 (Metadata Indexing)：

为每个 chunk 添加元数据，例如，文档标题、章节标题、关键词等。在查询时，可以利用元数据进行过滤，提高检索的准确性。

def add_metadata(chunks, document_title, chapter_title=None):
  """
  为 chunk 添加元数据。

  Args:
    chunks: chunk 列表。
    document_title: 文档标题。
    chapter_title: 章节标题（可选）。

  Returns:
    一个包含带有元数据的 chunk 的列表。
  """
  metadata_chunks = []
  for chunk in chunks:
    metadata = {
        "document_title": document_title,
        "chapter_title": chapter_title,
        "content": chunk
    }
    metadata_chunks.append(metadata)
  return metadata_chunks

# 示例
text = "This is a section about the introduction. This is the content of the introduction."
chunks = separator_based_chunking(text)
metadata_chunks = add_metadata(chunks, document_title="My Document", chapter_title="Introduction")
print(metadata_chunks)

3.2 分层索引 (Hierarchical Indexing)：

建立多层索引，例如，先按照文档进行索引，再按照章节进行索引，最后按照 chunk 进行索引。在查询时，可以先从高层索引开始，逐步缩小检索范围，提高检索效率。

3.3 基于 Embedding 的索引 (Embedding-based Indexing)：

将每个 chunk 转换为向量表示（embedding），并使用向量数据库（例如，Faiss、Annoy）进行索引。在查询时，将查询语句也转换为向量，然后在向量数据库中查找最相似的 chunk。

# 需要安装 faiss-cpu 库： pip install faiss-cpu
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

def embedding_based_indexing(chunks):
  """
  基于 embedding 的索引。

  Args:
    chunks: chunk 列表。

  Returns:
    一个 faiss 索引和一个 chunk 列表。
  """
  model = SentenceTransformer('all-MiniLM-L6-v2')
  embeddings = model.encode(chunks)
  dimension = embeddings.shape[1]
  index = faiss.IndexFlatL2(dimension)  # 使用 L2 距离
  index.add(embeddings)
  return index, chunks

def search_embedding_index(index, chunks, query, top_k=5):
  """
  在 embedding 索引中搜索。

  Args:
    index: faiss 索引。
    chunks: chunk 列表。
    query: 查询语句。
    top_k: 返回最相似的 chunk 的数量。

  Returns:
    一个包含最相似的 chunk 的列表。
  """
  model = SentenceTransformer('all-MiniLM-L6-v2')
  query_embedding = model.encode([query])
  D, I = index.search(query_embedding, top_k)  # D 是距离，I 是索引
  results = [chunks[i] for i in I[0]]
  return results

# 示例
text = "This is the first sentence. This is the second sentence. This sentence is related to the previous two. However, this sentence is completely different. It talks about another topic."
chunks = semantic_chunking(text, chunk_size=100, threshold=0.7)
index, indexed_chunks = embedding_based_indexing(chunks)
query = "sentences about the same topic"
results = search_embedding_index(index, indexed_chunks, query)
print(results)

4. 查询优化

用户的查询方式多种多样，需要针对不同的查询方式进行优化。

4.1 查询扩展 (Query Expansion)：

利用同义词、近义词等扩展查询语句，增加检索的覆盖面。

from nltk.corpus import wordnet
import nltk

# 下载 wordnet 数据集，如果尚未下载
try:
    wordnet.synsets('example')
except LookupError:
    nltk.download('wordnet')

def get_synonyms(word):
    """
    获取单词的同义词。
    """
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonyms.add(lemma.name())
    return synonyms

def query_expansion(query):
    """
    扩展查询语句。
    """
    words = query.split()
    expanded_words = []
    for word in words:
        synonyms = get_synonyms(word)
        if synonyms:
            expanded_words.extend(synonyms)
        else:
            expanded_words.append(word)
    return " ".join(expanded_words)

# 示例
query = "improve battery life"
expanded_query = query_expansion(query)
print(expanded_query) # 输出可能包含 "improve better battery life lifespan" 等词语

4.2 查询重写 (Query Rewriting)：

利用 LLM 将查询语句重写为更清晰、更准确的表达，提高检索的准确性。

# 需要 OpenAI API 密钥
import openai

openai.api_key = "YOUR_OPENAI_API_KEY"

def query_rewriting(query):
    """
    利用 LLM 重写查询语句。
    """
    prompt = f"Rewrite the following query to be more specific and clear:n{query}nRewritten query:"
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=50,
        n=1,
        stop=None,
        temperature=0.5,
    )
    rewritten_query = response.choices[0].text.strip()
    return rewritten_query

# 示例
query = "battery life"
rewritten_query = query_rewriting(query)
print(rewritten_query) # 输出可能包含 "How to improve battery life on a smartphone?" 等更具体的问题

4.3 上下文感知查询 (Context-aware Querying)：

如果用户进行了多次查询，可以将之前的查询历史作为上下文信息，用于优化当前的查询。

5. 评估与迭代

上述方法只是提供了一些通用的指导原则，实际应用中需要根据具体的数据和场景进行调整和优化。

5.1 评估指标：

召回率 (Recall)： 衡量系统能够检索到所有相关文档的能力。
准确率 (Precision)： 衡量系统检索到的文档中，有多少是真正相关的。
F1 值 (F1-score)： 召回率和准确率的调和平均值，综合衡量系统的性能。
MRR (Mean Reciprocal Rank)： 对于每个查询，返回结果中第一个相关文档的排名的倒数的平均值。

5.2 迭代优化：

收集用户反馈，了解用户对检索结果的满意度。
分析检索失败的案例，找出问题所在。
针对问题，调整切片策略、索引策略或查询优化策略。
重新评估系统的性能，并进行迭代优化。

案例分析：一个实际场景的优化过程

假设我们正在构建一个 RAG 系统，用于检索公司内部的知识库文档。知识库包含大量的长文档，例如，产品手册、技术文档、FAQ 等。

初始状态：

使用固定大小切片，chunk 大小为 512 个 token。
使用基于 TF-IDF 的索引。
用户反馈召回率较低，很多相关文档无法检索到。

优化过程：

问题分析： 发现固定大小切片导致语义割裂，很多关键信息被分割到不同的 chunk 中。
改进方案：
- 将切片算法改为递归切片，并引入 chunk overlap。
- 使用 SentenceTransformer 将 chunk 转换为向量表示，并使用 Faiss 进行索引。
- 引入查询扩展，利用同义词扩展查询语句。
评估结果：
- 召回率显著提升。
- 准确率略有下降，但 F1 值整体提升。
- 用户满意度明显提高。
进一步优化：
- 针对特定的查询场景，调整 SentenceTransformer 的模型。
- 利用 LLM 对查询语句进行重写，提高查询的准确性。

通过以上优化，RAG 系统的性能得到了显著提升，能够更好地满足用户的需求。

优化策略的一些想法

通过上述的工程化修复和优化策略，我们可以有效地提升 RAG 系统在处理长文档时的召回率。关键在于选择合适的切片算法，优化索引策略，并针对不同的查询方式进行优化。记住，评估和迭代是持续改进 RAG 系统的关键。