尊敬的各位听众，各位编程爱好者与AI Agent开发者们：

大家好！

今天，我们将共同深入探讨一个前沿且极具挑战性的话题：如何构建一个具备“自我修复（Self-Healing）”能力的LangGraph Agent。在复杂多变的现实世界中，AI Agent需要与各种外部API和工具交互。这些外部依赖并非总是可靠，它们可能因网络问题、认证失败、数据格式错误或服务宕机而随时中断。传统的错误处理机制，如简单的重试或直接报错，往往不足以应对这些挑战。

我们的目标是超越传统的错误处理，让Agent在面对工具API失败时，不仅能检测到问题，还能像一位经验丰富的工程师一样，自动分析失败原因，并生成一套替代方案或新的逻辑来克服障碍，从而实现真正的“自我修复”。这将极大地提升Agent的韧性、自主性和实用性。

本次讲座将从LangGraph的基础出发，逐步深入到自我修复的架构设计、核心机制、详细代码实现以及未来的增强与扩展。我希望通过这次分享，能为大家在构建更智能、更健壮的AI Agent系统方面提供新的思路和实践经验。

一、LangGraph基础回顾：构建智能Agent的利器

在深入自我修复机制之前，我们首先快速回顾一下LangGraph的核心概念。LangGraph是LangChain框架的一个扩展，它允许我们通过有向无环图（DAG）来定义复杂的Agent行为。它将Agent的行为建模为一个状态机，其中每个节点代表一个操作（例如，调用LLM进行思考，执行一个工具），而边则定义了状态转换的逻辑。

1. LangGraph的核心组成部分：

状态（State）：LangGraph维护一个在整个图执行过程中共享和修改的中央状态对象。这个状态通常包含消息历史、工具输出、中间结果等。我们通常使用TypedDict来定义一个清晰的、类型化的状态。
节点（Nodes）：节点是图中的基本执行单元。它可以是一个调用LLM的Agent节点，一个执行特定外部工具的工具节点，或者一个普通的Python函数节点。
边（Edges）：边连接节点，定义了状态转换的路径。
- 直接边（Direct Edges）：从一个节点直接指向另一个节点，无条件执行。
- 条件边（Conditional Edges）：根据当前状态或前一个节点的输出来决定下一个要执行的节点。这是实现复杂决策逻辑的关键。
入口与出口（Entry & End Points）：graph.set_entry_point()定义了图的起始节点，END节点则表示图的执行终止。

2. LangGraph的工作原理：

当Graph启动时，它从入口点开始执行。一个节点执行完毕后，LangGraph会根据定义的边和条件逻辑，决定下一个要激活的节点。这个过程会持续进行，直到到达END节点，或者状态机进入一个无法继续的死锁状态。LangGraph的这种图结构非常适合构建多步骤、有决策能力的Agent，因为它能够清晰地表达Agent的思考-行动-观察循环。

二、核心概念：什么是“自我修复”？

在传统软件工程中，“自我修复”通常指系统在检测到组件故障时，能够自动采取措施来恢复其功能，例如重启服务、回滚配置、切换到备用资源等。

在AI Agent的语境下，“自我修复”被赋予了更深层的含义：

超越传统重试：不仅仅是简单地重试失败的操作，而是尝试理解失败的上下文和原因。
智能诊断：利用LLM的推理能力，对工具失败进行“诊断”，识别潜在的问题类型（例如，认证错误、网络问题、数据格式不匹配等）。
生成替代逻辑：根据诊断结果，LLM能够像一位资深开发者一样，生成新的行动计划或替代策略。这可能包括：
- 调用另一个功能相似的工具。
- 修改原始输入参数，以适应工具的期望或绕过已知问题。
- 将复杂任务分解为更小的、可行的子任务。
- （更高级的场景）甚至生成一段临时的、能在沙箱中执行的代码（本次讲座主要聚焦于通过现有工具和策略调整来“编写逻辑”）。
动态调整执行路径：LangGraph作为状态机，能够根据Agent的思考和决策，动态地调整后续的执行路径，从而执行生成的替代逻辑。

这种能力将Agent从一个被动的执行者转变为一个主动的问题解决者，使其能够在面对不确定性和外部故障时保持韧性，持续推进任务。

三、构建“自我修复”LangGraph的挑战与策略

实现自我修复能力并非易事，它涉及到多个层面的设计与考量。

1. 挑战1：如何检测工具失败？

策略：Python异常捕获。 这是最直接有效的方式。在LangGraph的工具节点中，我们需要在调用外部工具的代码块周围放置try-except语句，捕获可能抛出的任何异常（requests.exceptions.ConnectionError, json.JSONDecodeError, KeyError, ValueError等）。
策略：工具返回特定错误结构。 有些API在逻辑失败时不会抛出异常，而是返回一个包含错误码或错误消息的特定JSON结构。在这种情况下，我们需要在工具的输出中检查这些结构，并将其解释为失败。
策略：超时机制。 对于长时间无响应的工具调用，可以设置超时（timeout参数在requests库中），超时后同样抛出异常。

2. 挑战2：如何捕获失败信息？

一旦检测到失败，我们需要尽可能详细地记录所有相关信息，以便LLM能够进行有效的诊断。

策略：扩展LangGraph状态。 在LangGraph的State中增加一个字段（例如error），用于存储失败的上下文信息。
捕获内容：
- 失败工具名称：是哪个工具出了问题。
- 原始输入：调用该工具时传入了什么参数。
- 错误类型与消息：Python异常的类型和详细信息（e.__class__.__name__, str(e)）。
- 堆栈跟踪：完整的堆栈跟踪（traceback.format_exc()）对于复杂问题的诊断至关重要，尽管LLM可能不会直接“阅读”它，但可以作为上下文的一部分。
- 尝试次数：记录某个工具或某个修复尝试的次数，防止无限循环。

3. 挑战3：如何让LLM理解失败并生成替代逻辑？

这是自我修复机制的核心和难点。LLM需要从一个“执行者”转变为一个“诊断专家”和“策略规划师”。

策略：精细的Prompt Engineering。
- 角色设定：明确告知LLM它是一个“故障诊断专家”，其任务是分析工具失败并提出修复方案。
- 提供完整上下文：将LangGraph的当前State（包括所有消息历史、之前的工具输出）以及详细的error信息（工具名、输入、错误类型、错误消息、堆栈跟踪）作为Prompt的一部分提供给LLM。
- 明确输出格式：要求LLM以结构化格式（如JSON）输出修复计划，以便Agent能够解析和执行。这极大地简化了后续的自动化处理。
- 指导修复类型：在Prompt中给出LLM可以采取的修复行动类型，例如：
  - call_alternative_tool: 调用另一个工具。
  - retry_with_modified_input: 修改原始工具的输入后重试。
  - breakdown_problem: 将问题分解，并指示Agent重新思考。
  - report_failure_to_user: 如果无法修复，则报告给用户。
LLM作为“策略规划师”：LLM的强大推理能力使其能够根据错误信息，结合其通用知识和对工具功能的理解，提出合理的替代方案。例如：
- 如果WeatherAPI因认证失败，LLM可能会建议“尝试使用PublicWeatherTool，或者搜索‘免费天气API’”。
- 如果ProductSearchAPI因某个特定参数值无效而失败，LLM可能会建议“修改该参数为更通用的值，或尝试在WebSearchTool中搜索该产品”。

4. 挑战4：如何执行替代逻辑？

LLM生成修复计划后，LangGraph需要将其付诸实践。

策略：专门的“修复执行”节点。 创建一个execute_healing_plan节点，它负责解析LLM生成的JSON计划。
动态工具调用：如果计划是call_alternative_tool，execute_healing_plan节点需要能够动态地查找并调用指定的工具，并传入LLM提供的参数。
状态修改与重定向：如果计划是retry_with_modified_input，execute_healing_plan节点将修改LangGraph的State，然后通过条件边将流程重定向回原来的agent_node，让Agent带着新的输入重试。
条件边与路由：利用LangGraph强大的条件边，我们可以根据execute_healing_plan节点的执行结果（成功、失败、需要进一步Agent思考）来决定后续的路径。

四、架构设计：一个“自我修复”的LangGraph

现在，我们将上述策略整合到一个具体的LangGraph架构中。

1. LangGraph State定义：

我们需要一个能够承载所有必要信息的State对象。

字段名称	类型	描述
`messages`	`List[BaseMessage]`	存储Agent与用户、工具之间的所有对话历史。
`tool_output`	`Union[str, dict, None]`	最近一次工具调用的输出结果。当工具失败时，此字段可能包含错误信息或为空。
`error`	`Dict[str, Any]`	当工具失败时，存储详细的错误信息，包括：`tool_name`, `tool_input`, `error_type`, `error_message`, `traceback_str`等。如果无错误，则为`None`。
`healing_plan`	`Union[Dict[str, Any], None]`	LLM生成的、用于修复当前错误的策略或行动计划。以结构化JSON形式存储。
`healing_attempts_count`	`int`	记录当前错误已尝试的修复次数，防止无限循环。
`original_human_query`	`str`	记录用户最初的查询，当修复策略需要重新开始或报告给用户时有用。
`last_agent_decision`	`Union[Dict[str, Any], None]`	记录Agent最近一次的LLM思考结果，特别是它尝试调用的工具和参数，以便在修复时作为上下文。
`available_tools`	`List[Callable]`	可供Agent和修复逻辑使用的工具列表，方便动态查找和调用。 (实际实现中，通常是全局注册的工具列表)
`max_healing_attempts_per_error`	`int`	定义对单个错误的最大修复尝试次数。

class AgentState(TypedDict):
    messages: Annotated[List[BaseMessage], lambda x, y: x + y] # 累积消息
    tool_output: Union[str, dict, None] # 工具执行结果
    error: Union[Dict[str, Any], None] # 存储错误信息
    healing_plan: Union[Dict[str, Any], None] # LLM生成的修复计划
    healing_attempts_count: int # 当前错误修复尝试次数
    original_human_query: str # 原始用户查询
    last_agent_decision: Union[Dict[str, Any], None] # Agent上次决策 (工具名, 参数)
    available_tools: List[Callable] # 可用的工具列表 (在初始化时传入)
    max_healing_attempts_per_error: int # 最大修复尝试次数

2. 核心节点设计：

| 节点名称 | 职责
Self-Healing LangGraph: Generating Alternative Logic on Tool Failure

引言：AI Agent的韧性挑战与未来图景

各位同仁，女士们，先生们：

欢迎来到今天的技术讲座。今天，我们将聚焦于构建更智能、更健壮的AI Agent系统，特别是如何在面对外部不确定性时，赋予它们“自我修复（Self-Healing）”的能力。

在当前AI技术飞速发展的浪潮中，大型语言模型（LLMs）已经成为构建智能Agent的核心驱动力。这些Agent不再仅仅是信息检索或文本生成的工具，它们被设计用来执行多步骤任务，通过与外部工具（如API、数据库、文件系统等）交互来感知环境、规划行动并最终达成目标。LangGraph作为LangChain生态系统中的一个强大组件，提供了一种直观而强大的方式来编排这些复杂的Agent行为，将其建模为一个有向无环图（DAG）状态机。

然而，现实世界的复杂性远超我们的想象。Agent所依赖的外部工具API并非总是稳定可靠。网络波动、服务过载、认证失效、数据格式不匹配、甚至预料之外的API行为变更，都可能导致工具调用失败。传统的错误处理机制，如简单地重试几次或直接向用户报告错误，往往会导致任务中断，用户体验下降，甚至在关键业务场景下造成严重后果。

我们的深度挑战在于： 如何超越传统的被动错误处理，让Agent在某个工具API挂掉时，能够像一位经验丰富的工程师那样，自动诊断问题，并“写出”替代逻辑或生成新的策略来规避或解决问题，从而实现任务的持续执行。这正是我们今天将要探讨的“自我修复”LangGraph的核心思想。

本次讲座将带领大家：

回顾LangGraph的基础，理解其核心组件与工作原理。
深入探讨“自我修复”的理念，及其在AI Agent中的独特含义。
设计一个能够检测、诊断并修复工具失败的LangGraph架构。
提供详尽的代码实现，展示如何将理论转化为可运行的Agent。
展望未来的增强与扩展，讨论如何进一步提升Agent的韧性和自主性。

通过这次讲座，我希望大家能够掌握构建更智能、更可靠AI Agent的关键技术，为未来的AI系统开发奠定坚实基础。

二、LangGraph核心概念：Agent行为的图谱编排

LangGraph将Agent的行为抽象为一系列节点和边构成的图。这种模型非常适合描述决策循环、多步骤推理和工具使用等复杂场景。

2.1 状态（State）管理

LangGraph的核心是一个可变的共享状态。所有节点都通过修改或读取这个状态来协同工作。我们通常使用TypedDict来定义一个清晰且类型安全的Agent状态。

from typing import TypedDict, Annotated, List, Union, Callable, Dict, Any
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
import random
import json
import traceback
import functools
import os

# 为演示目的，如果没有设置OPENAI_API_KEY，则使用Mock LLM
if os.getenv("OPENAI_API_KEY"):
    llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
else:
    print("WARNING: OPENAI_API_KEY not set. Using a mock LLM for demonstration.")
    class MockChatOpenAI:
        def __init__(self, model, temperature):
            self.model = model
            self.temperature = temperature

        def invoke(self, messages, tools=None, tool_choice=None):
            last_message = messages[-1]
            if isinstance(last_message, HumanMessage):
                content = last_message.content.lower()
                if "search" in content and tools:
                    return AIMessage(content="Okay, I'll try searching.", tool_calls=[{"name": "search_tool", "args": {"query": "mock search query for " + content}}])
                elif "buggy" in content and tools:
                    return AIMessage(content="Attempting buggy tool.", tool_calls=[{"name": "buggy_api_tool", "args": {"input_data": "mock input for " + content}}])
                elif "healing plan" in content and tools:
                    # Mock a healing plan based on error type
                    if "authentication error" in content:
                        tool_input = content.split("failed tool input: ")[1].split(".")[0].strip()
                        return AIMessage(content=json.dumps({"action": "call_alternative_tool", "tool_name": "search_tool", "tool_args": {"query": f"public API alternative for {tool_input}"}}))
                    elif "network error" in content:
                        original_tool_name = content.split("original tool name: ")[1].split(".")[0].strip()
                        tool_input = content.split("failed tool input: ")[1].split(".")[0].strip()
                        return AIMessage(content=json.dumps({"action": "retry_with_modified_input", "modification": "Adding a retry tag to input", "original_tool_name": original_tool_name, "tool_args": {"input_data": f"{tool_input} (retried)"}}))
                    else: # Generic error
                        tool_input = content.split("failed tool input: ")[1].split(".")[0].strip()
                        return AIMessage(content=json.dumps({"action": "call_alternative_tool", "tool_name": "search_tool", "tool_args": {"query": f"general solution for {tool_input}"}}))
                else:
                    return AIMessage(content=f"Mock AI response to: {last_message.content}")
            return AIMessage(content="Mock AI says: Hello!")

        def bind_tools(self, tools_list):
            # This is a simplified mock. In reality, it would prepare for tool calling.
            # For our purpose, just return self.
            return self

    llm = MockChatOpenAI(model="gpt-4o", temperature=0.7)

class AgentState(TypedDict):
    messages: Annotated[List[BaseMessage], lambda x, y: x + y] # 累积消息
    tool_output: Union[str, dict, None] # 工具执行结果
    error: Union[Dict[str, Any], None] # 存储错误信息
    healing_plan: Union[Dict[str, Any], None] # LLM生成的修复计划
    healing_attempts_count: int # 当前错误修复尝试次数
    original_human_query: str # 原始用户查询
    last_agent_decision: Union[Dict[str, Any], None] # Agent上次决策 (工具名, 参数)
    available_tools: List[Callable] # 可用的工具列表 (在初始化时传入)
    max_healing_attempts_per_error: int # 最大修复尝试次数

2.2 节点（Nodes）

节点是Agent执行的“步骤”。在LangGraph中，节点可以是：

LLM调用：Agent思考、规划、生成文本。
工具调用：Agent与外部世界交互。
Python函数：执行任意自定义逻辑。

2.3 边（Edges）

边定义了节点之间的转换逻辑。

直接边：一个节点执行完后，无条件地跳转到下一个节点。
条件边：根据前一个节点的输出或当前状态，动态决定下一个节点。这是实现复杂决策流的关键。

2.4 Agent执行流

一个典型的LangGraph Agent执行流如下：

用户输入（HumanMessage）。
agent_node（LLM）接收输入，思考并可能决定调用工具。
如果决定调用工具，则将工具调用请求发送给tool_node。
tool_node执行工具。
tool_node的输出（ToolMessage）返回给agent_node，Agent根据工具结果继续思考或生成最终响应。
重复2-5，直到Agent生成最终响应或决定结束任务。

三、深度挑战：Agent的“自我修复”能力

自我修复，顾名思义，是指系统能够在其组件发生故障时，自动检测、诊断并采取行动来恢复其功能。在AI Agent的语境下，这尤为关键，因为Agent通常是面向目标且多步骤的，一个环节的失败可能导致整个任务链的断裂。

3.1 自我修复的层次与目标

检测 (Detection)：能够识别出工具API调用失败。
诊断 (Diagnosis)：理解失败的类型和可能的原因。例如，是网络问题、认证失败、参数错误还是服务不可用？
决策 (Decision)：基于诊断结果，规划一个修复策略。例如，是重试、切换到备用工具、修改参数还是寻求人工帮助？
执行 (Execution)：实施决策的策略，例如调用另一个工具，或者修改Agent的内部状态以触发不同的行为。
学习 (Learning – 进阶)：从失败和修复尝试中学习，优化未来的修复策略。

本次讲座将聚焦于检测、诊断、决策和执行这四个核心环节。

3.2 自我修复与传统错误处理的区别

特性	传统错误处理	自我修复Agent
错误响应	打印日志、抛出异常、返回错误信息。	尝试理解错误，并主动寻找解决方案。
行动方式	预设的重试逻辑、直接中止。	动态生成新的行动计划或逻辑，适应性强。
智能程度	较低，基于硬编码规则。	较高，利用LLM的推理能力进行诊断和决策。
韧性	有限，容易因未预料的错误而中断。	更强，能在部分组件失效时维持任务进展。
用户体验	可能需要用户手动干预或重新开始。	无缝衔接，用户可能感知不到底层故障已发生并被修复。

四、构建“自我修复”LangGraph的架构与实现

现在，让我们一步步构建这个“自我修复”的Agent。

4.1 定义工具：正常与故障模拟

我们首先定义两个工具：一个模拟正常的搜索工具，另一个模拟一个可能随机失败的API工具。

@tool
def search_tool(query: str) -> str:
    """
    一个模拟的搜索工具，用于在互联网上查找信息。
    Args:
        query (str): 搜索查询字符串。
    Returns:
        str: 搜索结果的摘要。
    """
    print(f"n--- Executing search_tool with query: '{query}' ---")
    if "public API alternative" in query.lower():
        return f"Found a public API alternative for '{query.replace('public API alternative for ', '')}': MockPublicAPI.com"
    elif "general solution" in query.lower():
        return f"Found general solutions for '{query.replace('general solution for ', '')}': Try restarting service, checking logs, or updating dependencies."
    return f"Search results for '{query}': Information about {query} found. (simulated)"

@tool
def buggy_api_tool(input_data: str) -> str:
    """
    一个模拟的、可能随机失败的API工具。
    Args:
        input_data (str): 传递给API的输入数据。
    Returns:
        str: API处理结果。
    Raises:
        Exception: 随机模拟网络错误或认证错误。
    """
    print(f"n--- Executing buggy_api_tool with input: '{input_data}' ---")
    failure_type = random.choice(["network_error", "auth_error", "success", "success", "success"]) # 增加成功几率

    if failure_type == "network_error":
        print("!!! Simulating Network Error for buggy_api_tool !!!")
        raise ConnectionError("模拟：无法连接到API服务，请检查网络或服务状态。")
    elif failure_type == "auth_error":
        print("!!! Simulating Authentication Error for buggy_api_tool !!!")
        raise ValueError("模拟：API认证失败，请检查API密钥或凭证。")
    else:
        print(f"Buggy API Tool Success: Processed '{input_data}'")
        return f"Processed '{input_data}' successfully via Buggy API."

# 将工具绑定到LLM
tools = [search_tool, buggy_api_tool]
llm_with_tools = llm.bind_tools(tools)

4.2 节点函数实现

我们将定义以下核心节点：

call_agent: Agent（LLM）进行思考和决策，可能生成工具调用。
call_tool: 安全地执行工具，捕获所有异常并更新状态。
handle_error: 记录失败信息到状态，并增加修复尝试计数。
propose_healing_plan: LLM作为故障诊断专家，根据错误信息生成修复计划。
execute_healing_plan: 解析并尝试执行LLM生成的修复计划。

# 节点函数
def call_agent(state: AgentState) -> AgentState:
    """
    Agent节点：LLM根据当前消息历史进行思考或工具调用。
    """
    print("n--- Entering call_agent node ---")
    messages = state["messages"]

    # 如果有修复计划，且是重试，则将上次的Agent决策作为当前思考的基础
    if state["healing_plan"] and state["healing_plan"].get("action") == "retry_with_modified_input":
        last_decision = state["last_agent_decision"]
        if last_decision:
            # 构造一个模拟的ToolMessage，让Agent以为工具成功执行了修改后的输入
            # 或者更直接地，让Agent重新思考，但传入一个提示
            tool_name = last_decision.get("tool_name")
            tool_args = state["healing_plan"].get("tool_args", {})
            # 这里我们选择直接将修改后的输入作为新的思考提示
            # 否则，LLM会尝试再次调用原始的ToolCall，这在LangGraph中需要更复杂的处理
            # 简单起见，我们直接让LLM重新思考，但给它上下文
            messages.append(HumanMessage(content=f"Previous attempt to use '{tool_name}' failed. I have modified the input and am ready to try again or find an alternative. Original query: {state['original_human_query']}. Modified input suggestion: {tool_args.get('input_data')}"))

    response = llm_with_tools.invoke(messages)

    new_messages = [response]
    new_state = {"messages": new_messages}

    # 记录LLM的最新决策，特别是工具调用
    if response.tool_calls:
        new_state["last_agent_decision"] = {
            "tool_name": response.tool_calls[0]["name"],
            "tool_args": response.tool_calls[0]["args"]
        }
    else:
        new_state["last_agent_decision"] = None

    # 清除任何待处理的修复计划，因为Agent正在进行新的思考
    new_state["healing_plan"] = None
    new_state["error"] = None # 如果Agent成功思考，清除之前的错误
    new_state["healing_attempts_count"] = 0 # 成功思考后重置修复尝试计数

    return new_state

def call_tool(state: AgentState) -> AgentState:
    """
    工具节点：执行Agent（LLM）建议的工具调用，并捕获异常。
    """
    print("n--- Entering call_tool node ---")
    messages = state["messages"]
    last_message = messages[-1]

    new_state = {"tool_output": None, "error": None} # 每次工具调用前清空

    if not last_message.tool_calls:
        print("Error: call_tool node entered without tool_calls in the last message.")
        new_state["error"] = {
            "tool_name": "N/A",
            "tool_input": "N/A",
            "error_type": "NoToolCallsError",
            "error_message": "Agent did not request a tool call.",
            "traceback_str": ""
        }
        return new_state

    tool_call = last_message.tool_calls[0]
    tool_name = tool_call["name"]
    tool_args = tool_call["args"]

    try:
        # 查找并执行工具
        selected_tool = next(t for t in state["available_tools"] if t.name == tool_name)
        tool_output = selected_tool.invoke(tool_args)

        new_messages = [ToolMessage(tool_output, tool_call_id=tool_call["id"])]
        new_state["messages"] = new_messages
        new_state["tool_output"] = tool_output
        new_state["error"] = None # 清除之前的错误
        new_state["healing_attempts_count"] = 0 # 成功后重置计数
        print(f"--- Tool '{tool_name}' executed successfully. ---")

    except Exception as e:
        error_message = f"Tool '{tool_name}' failed: {e}"
        print(f"!!! {error_message} !!!")

        # 捕获详细错误信息
        error_info = {
            "tool_name": tool_name,
            "tool_input": tool_args,
            "error_type": e.__class__.__name__,
            "error_message": str(e),
            "traceback_str": traceback.format_exc()
        }

        new_messages = [ToolMessage(f"Error executing {tool_name}: {str(e)}", tool_call_id=tool_call["id"])]
        new_state["messages"] = new_messages
        new_state["tool_output"] = None
        new_state["error"] = error_info

        # 由于这里是工具节点，它不会自行增加 healing_attempts_count，
        # 而是由 handle_error 节点来做

    return new_state

def handle_error(state: AgentState) -> AgentState:
    """
    错误处理节点：记录错误信息，增加修复尝试计数。
    """
    print("n--- Entering handle_error node ---")
    current_error = state["error"]
    healing_attempts = state["healing_attempts_count"] + 1

    print(f"Error detected for tool '{current_error.get('tool_name')}' (Attempt {healing_attempts})")
    print(f"Error message: {current_error.get('error_message')}")

    new_state = {
        "healing_attempts_count": healing_attempts,
        # 确保 error 字段保持最新，如果后续有其他错误处理逻辑
        "error": current_error 
    }
    return new_state

def propose_healing_plan(state: AgentState) -> AgentState:
    """
    修复策略节点：LLM作为故障诊断专家，根据错误信息生成修复计划。
    """
    print("n--- Entering propose_healing_plan node ---")
    error_info = state["error"]
    current_messages = state["messages"]

    prompt_template = """你是一名高级AI Agent故障诊断专家和策略规划师。
你的任务是分析Agent的工具调用失败原因，并提出一个可行的修复计划。
请严格按照JSON格式输出修复计划，不要包含任何其他文字或解释。

当前的对话历史：
{chat_history}

上次Agent尝试调用的工具：{last_tool_name}
上次Agent尝试调用的工具输入：{last_tool_input}
失败类型：{error_type}
错误消息：{error_message}
（注意：完整的堆栈跟踪信息已捕获但此处省略，以保持Prompt简洁，但你应假设其存在且可供分析）

请根据这些信息，提出一个修复计划。
修复计划必须是以下JSON格式之一：

1.  **调用替代工具 (Call an alternative tool)**
    ```json
    {{
        "action": "call_alternative_tool",
        "tool_name": "<替代工具的名称，必须是可用的工具之一>",
        "tool_args": {{ "<参数名>": "<参数值>", ... }},
        "reason": "<你选择这个替代工具和参数的理由>"
    }}

例如，如果搜索API失败，你可以尝试另一个搜索API。如果某个特定API需要认证但失败了，你可以尝试一个公共的、不需要认证的通用搜索工具来获取相关信息。

修改输入并重试 (Modify input and retry original tool)

{{
    "action": "retry_with_modified_input",
    "original_tool_name": "<原始失败工具的名称>",
    "tool_args": {{ "<原始工具的参数名>": "<修改后的参数值>", ... }},
    "modification": "<你对输入做了什么修改，以及为什么>",
    "reason": "<你选择修改输入重试的理由>"
}}

例如，如果API因某个特定参数值格式不正确而失败，你可以尝试修正该参数。如果输入过长导致失败，可以尝试截断或简化输入。

分解问题 (Break down the problem – 暂时不直接实现，作为LLM的规划能力)
```
{{
    "action": "breakdown_problem",
    "sub_tasks": ["<子任务1>", "<子任务2>", ...],
    "reason": "<你选择分解问题的理由>"
}}
```
这个方案需要Agent进一步的复杂编排，目前暂不直接执行，但LLM可以提出。

报告无法修复 (Report unfixable – 终止当前修复循环)

{{
    "action": "report_unfixable",
    "reason": "<为什么无法修复，以及建议用户采取什么行动>"
}}

请记住：

你的目标是让Agent能够继续完成原始任务，即使路径有所改变。
优先考虑最直接、最可能成功的修复方案。
如果多次尝试修复仍失败，最终可能需要report_unfixable。
可用的工具列表包括：{available_tool_names}。

请生成修复计划JSON：
"""

tool_names = [t.name for t in state["available_tools"]]

# 提取聊天历史中HumanMessage和AIMessage的内容，作为LLM的上下文
chat_history_str = "n".join([
    f"{m.type.capitalize()}: {m.content}" 
    for m in current_messages 
    if isinstance(m, (HumanMessage, AIMessage))
])

prompt_content = prompt_template.format(
    chat_history=chat_history_str,
    last_tool_name=error_info.get("tool_name", "Unknown"),
    last_tool_input=error_info.get("tool_input", "{}"),
    error_type=error_info.get("error_type", "UnknownError"),
    error_message=error_info.get("error_message", "No specific error message."),
    available_tool_names=", ".join(tool_names)
)

print("--- Proposing healing plan with LLM ---")
print(f"Prompt to LLM:n{prompt_content[:500]}...n") # 打印部分Prompt

try:
    response = llm.invoke([HumanMessage(content=prompt_content)])
    healing_plan_str = response.content
    healing_plan = json.loads(healing_plan_str)
    print(f"--- LLM proposed healing plan: {json.dumps(healing_plan, indent=2)} ---")

    return {"healing_plan": healing_plan}
except json.JSONDecodeError as e:
    print(f"!!! LLM response was not valid JSON for healing plan: {healing_plan_str}. Error: {e} !!!")
    # 如果LLM未能生成有效JSON，我们将其视为无法修复
    return {"healing_plan": {"action": "report_unfixable", "reason": f"LLM failed to generate valid JSON for healing plan: {e}"}}
except Exception as e:
    print(f"!!! Error invoking LLM for healing plan: {e} !!!")
    return {"healing_plan": {"action": "report_unfixable", "reason": f"Failed to get healing plan from LLM: {e}"}}

def execute_healing_plan(state: AgentState) -> AgentState:
"""
修复执行节点：解析并尝试执行LLM生成的修复计划。
"""
print("n— Entering execute_healing_plan node —")
healing_plan = state["healing_plan"]

if not healing_plan:
    print("Error: execute_healing_plan node entered without a healing plan.")
    return {"healing_plan": {"action": "report_unfixable", "reason": "No healing plan provided."}}

action = healing_plan.get("action")
new_state = {}

if action == "call_alternative_tool":
    tool_name = healing_plan.get("tool_name")
    tool_args = healing_plan.get("tool_args", {})
    reason = healing_plan.get("reason", "No reason provided.")

    print(f"--- Healing Plan: Calling alternative tool '{tool_name}' with args {tool_args}. Reason: {reason} ---")

    try:
        selected_tool = next(t for t in state["available_tools"] if t.name == tool_name)
        tool_output = selected_tool.invoke(tool_args)

        # 将替代工具的输出添加到消息历史，并清除错误，让Agent继续
        new_messages = [
            ToolMessage(tool_output, tool_call_id=f"healing_tool_call_{random.randint(1000,9999)}")
        ]
        new_state["messages"] = state["messages"] + new_messages
        new_state["tool_output"] = tool_output
        new_state["error"] = None
        new_state["healing_plan"] = None
        new_state["healing_attempts_count"] = 0
        print(f"--- Alternative tool '{tool_name}' executed successfully as part of healing. ---")

    except Exception as e:
        error_message = f"!!! Alternative tool '{tool_name}' failed during healing: {e} !!!"
        print(error_message)
        # 如果替代工具也失败了，记录为新的错误，以便再次进入修复循环
        error_info = {
            "tool_name": tool_name,
            "tool_input": tool_args,
            "error_type": e.__class__.__name__,
            "error_message": str(e),
            "traceback_str": traceback.format_exc()
        }
        new_messages = [
            ToolMessage(f"Error executing alternative tool {tool_name} during healing: {str(e)}", tool_call_id=f"healing_tool_call_{random.randint(1000,9999)}")
        ]
        new_state["messages"] = state["messages"] + new_messages
        new_state["tool_output"] = None
        new_state["error"] = error_info
        # 此时 healing_attempts_count 会在 handle_error 节点递增
        new_state["healing_plan"] = None # 清空计划，准备生成新计划

elif action == "retry_with_modified_input":
    original_tool_name = healing_plan.get("original_tool_name")
    tool_args = healing_plan.get("tool_args", {})
    modification = healing_plan.get("modification", "No specific modification.")
    reason = healing_plan.get("reason", "No reason provided.")

    print(f"--- Healing Plan: Retrying original tool '{original_tool_name}' with modified input: {tool_args}. Modification: {modification}. Reason: {reason} ---")

    # 为了让Agent重新思考并调用工具，我们需要修改 state["messages"]
    # 最直接的方式是构造一个HumanMessage，引导Agent重新尝试
    # 并且将 last_agent_decision 更新为新的工具调用意图

    # 我们可以模拟Agent的原始ToolCall，但使用新的参数
    mock_tool_call_message = AIMessage(
        content=f"As per healing plan, I will retry '{original_tool_name}' with modified input: {tool_args}",
        tool_calls=[{"name": original_tool_name, "args": tool_args, "id": f"retry_call_{random.randint(1000,9999)}"}]
    )

    new_state["messages"] = state["messages"] + [mock_tool_call_message]
    new_state["last_agent_decision"] = { # 更新Agent的上次决策，以便 call_agent 节点知道如何处理
        "tool_name": original_tool_name,
        "tool_args": tool_args
    }
    new_state["error"] = None # 清除错误，准备再次尝试
    new_state["healing_plan"] = None # 清空计划

elif action == "report_unfixable":
    reason = healing_plan.get("reason", "No specific reason.")
    print(f"--- Healing Plan: Reporting unfixable. Reason: {reason} ---")
    # 此时，我们应该让Agent生成一个最终的错误报告给用户
    new_messages = [
        AIMessage(content=f"我无法自动修复之前工具调用的问题：{state['error'].get('error_message', '未知错误')}。原因：{reason}。请您检查并提供帮助。")
    ]
    new_state["messages"] = state["messages"] + new_messages
    new_state["error"] = None # 清除错误
    new_state["healing_plan"] = None # 清空计划
    new_state["healing_attempts_count"] = 0 # 重置计数

elif action == "breakdown_problem":
    sub_tasks = healing_plan.get("sub_tasks", [])
    reason = healing_plan.get("reason", "No reason provided.")
    print(f"--- Healing Plan: Breaking down problem into sub-tasks: {sub_tasks}. Reason: {reason} ---")
    # 对于问题分解，我们可以让Agent重新思考，并提供新的指导
    new_messages = [
        HumanMessage(content=f"之前的任务遇到了障碍，我已经将问题分解为以下子任务：{', '.join(sub_tasks)}。请重新规划并解决这些子任务。")
    ]
    new_state["messages"] = state["messages"] + new_messages
    new_state["error"] = None # 清除错误
    new_state["healing_plan"] = None # 清空计划
    new_state["healing_attempts_count"] = 0 # 重置计数

return new_state


#### 4.3 路由逻辑

条件边是实现自我修复的关键。我们需要根据状态中的`error`字段和`healing_attempts_count`来决定下一步走向。

```python
# 路由函数
def route_agent_or_tool(state: AgentState) -> str:
    """
    根据Agent的最新消息判断是继续Agent思考还是执行工具。
    """
    last_message = state["messages"][-1]
    if last_message.tool_calls:
        return "call_tool"
    return "end" # 如果Agent没有工具调用，认为是最终响应

def route_healing_or_continue(state: AgentState) -> str:
    """
    根据是否存在错误和修复尝试次数，决定是进入修复流程、继续Agent思考还是报告失败。
    """
    if state["error"]:
        if state["healing_attempts_count"] < state["max_healing_attempts_per_error"]:
            print(f"n--- Routing to healing: Error detected, attempt {state['healing_attempts_count']} of {state['max_healing_attempts_per_error']} ---")
            return "propose_healing_plan"
        else:
            print(f"n--- Routing to report_unfixable: Max healing attempts reached ({state['max_healing_attempts_per_error']}) ---")
            # 达到最大修复次数，强制LLM生成报告unfixable的计划
            return "propose_healing_plan" # 依然通过LLM生成报告，但LLM会知道达到上限
    elif state["healing_plan"]:
        # 如果有修复计划，说明刚刚从 propose_healing_plan 过来
        # 且没有新的错误，那就执行计划
        return "execute_healing_plan"
    else:
        # 没有错误，也没有修复计划，意味着工具执行成功，返回Agent继续思考
        print("n--- Routing to agent: No error, tool executed successfully or healing completed ---")
        return "call_agent"

def route_after_healing_execution(state: AgentState) -> str:
    """
    在执行修复计划后，根据结果决定下一步。
    """
    healing_plan = state["healing_plan"]
    if healing_plan and healing_plan.get("action") == "report_unfixable":
        print("n--- Routing to end: Healing plan reported unfixable ---")
        return END # 报告不可修复，Agent结束
    elif state["error"]:
        print("n--- Routing to handle_error: Healing plan execution failed, another error detected ---")
        return "handle_error" # 修复计划本身导致了新的错误，再次进入错误处理
    elif healing_plan and healing_plan.get("action") == "retry_with_modified_input":
        print("n--- Routing to call_tool: Retrying original tool with modified input ---")
        return "call_tool" # 特殊处理，直接去call_tool节点执行
    elif healing_plan and healing_plan.get("action") == "breakdown_problem":
        print("n--- Routing to call_agent: Problem broken down, re-engage agent ---")
        return "call_agent" # 问题分解后，让Agent重新思考
    else:
        print("n--- Routing to call_agent: Healing successful, continue with agent ---")
        return "call_agent" # 修复成功，让Agent继续思考

4.4 构建LangGraph

我们将使用StateGraph来构建我们的Agent。

def build_self_healing_graph(tools_list: List[Callable], max_healing_attempts: int = 3):
    """
    构建自我修复的LangGraph。
    """
    workflow = StateGraph(AgentState)

    # 1. 添加节点
    workflow.add_node("call_agent", call_agent)
    workflow.add_node("call_tool", call_tool)
    workflow.add_node("handle_error", handle_error)
    workflow.add_node("propose_healing_plan", propose_healing_plan)
    workflow.add_node("execute_healing_plan", execute_healing_plan)

    # 2. 设置入口点
    workflow.set_entry_point("call_agent")

    # 3. 添加边
    # Agent思考后，判断是调用工具还是结束
    workflow.add_conditional_edges(
        "call_agent",
        route_agent_or_tool,
        {
            "call_tool": "call_tool",
            "end": END
        }
    )

    # 工具执行后，判断是否有错误，决定是继续Agent思考，还是进入错误处理
    workflow.add_conditional_edges(
        "call_tool",
        route_healing_or_continue,
        {
            "propose_healing_plan": "propose_healing_plan", # 有错误且未达上限
            "call_agent": "call_agent", # 无错误，继续Agent思考
            # "report_unfixable" : "propose_healing_plan" # 达到上限，LLM会生成报告unfixable的计划
        }
    )

    # 错误处理后，总是进入修复策略生成
    workflow.add_edge("handle_error", "propose_healing_plan")

    # 修复策略生成后，总是进入修复执行
    workflow.add_edge("propose_healing_plan", "execute_healing_plan")

    # 修复执行后，根据结果路由
    workflow.add_conditional_edges(
        "execute_healing_plan",
        route_after_healing_execution,
        {
            END: END, # 如果报告不可修复，则结束
            "handle_error": "handle_error", # 如果修复执行又产生错误
            "call_tool": "call_tool", # 如果修复计划是重试某个工具
            "call_agent": "call_agent" # 修复成功或问题分解，回到Agent思考
        }
    )

    # 达到最大修复尝试次数时，需要强制 LLM 生成 report_unfixable 计划
    # 这里的条件边会处理这个逻辑：如果 handle_error 发现 max_healing_attempts_per_error 达到，
    # 仍然路由到 propose_healing_plan，但在 propose_healing_plan 内部的 prompt 会指示 LLM 考虑这个限制。
    # 实际的路由逻辑在 route_healing_or_continue 中已经包含了这个判断。

    app = workflow.compile()
    return app

4.5 运行Agent

现在，我们可以初始化并运行我们的自我修复Agent了。

# 构建Agent
self_healing_agent_app = build_self_healing_graph(tools_list=tools, max_healing_attempts=3)

# 运行Agent的函数
async def run_agent_with_healing(query: str):
    initial_state = AgentState(
        messages=[HumanMessage(content=query)],
        tool_output=None,
        error=None,
        healing_plan=None,
        healing_attempts_count=0,
        original_human_query=query,
        last_agent_decision=None,
        available_tools=tools,
        max_healing_attempts_per_error=3 # 对一个错误最多尝试修复3次
    )

    print(f"n===== Starting Agent for query: '{query}' =====")
    final_messages = []
    async for s in self_healing_agent_app.stream(initial_state):
        if "__end__" not in s:
            print(s)
            # 收集最新的消息，用于最终输出
            if s.get("messages"):
                final_messages.extend(s["messages"])
        else:
            # 最终状态的消息
            final_messages.extend(s["__end__"]["messages"])

    print("n===== Agent Execution Finished =====")
    for msg in final_messages:
        if isinstance(msg, HumanMessage):
            print(f"Human: {msg.content}")
        elif isinstance(msg, AIMessage):
            if msg.tool_calls:
                print(f"AI (Tool Call): {msg.tool_calls}")
            else:
                print(f"AI: {msg.content}")
        elif isinstance(msg, ToolMessage):
            print(f"Tool Output: {msg.content}")
    print("====================================")
    return final_messages[-1].content if final_messages else "No response."

# 测试案例
async def test_healing_agent():
    print("--- Test Case 1: Buggy API fails, then self-heals with alternative tool ---")
    await run_agent_with_healing("Please use the buggy API with input 'data for problematic service' to get some info.")

    print("n--- Test Case 2: Buggy API fails, then self-heals by retrying with modified input (simulated) ---")
    await run_agent_with_healing("I need to process 'critical data segment' using the buggy API. If it fails due to network, just retry with 'critical data segment (retry)'")

    print("n--- Test Case 3: Buggy API fails multiple times, eventually reports unfixable ---")
    # 为了模拟多次失败，我们需要确保buggy_api_tool会连续失败
    # 在实际运行中，buggy_api_tool的随机性可能不会每次都导致连续失败
    # 但LLM在max_healing_attempts_per_error达到后会报告unfixable
    await run_agent_with_healing("Process 'highly sensitive data' using the buggy API. This is important but can fail.")

    print("n--- Test Case 4: Normal operation (search tool) ---")
    await run_agent_with_healing("What is the capital of France?")

# 运行测试
import asyncio
if __name__ == "__main__":
    asyncio.run(test_healing_agent())

在上述代码中，buggy_api_tool被设计为有一定几率失败。当它失败时：

call_tool节点捕获异常，并将错误信息填充到state["error"]。
route_healing_or_continue函数检测到state["error"]，并将流程路由到propose_healing_plan。
propose_healing_plan节点中的LLM接收到详细的错误上下文，并根据Prompt的指令，生成一个JSON格式的修复计划。例如，如果识别为认证错误，它可能会建议调用search_tool来查找替代方案；如果是网络错误，它可能建议retry_with_modified_input。
execute_healing_plan节点解析这个JSON计划。
- 如果计划是call_alternative_tool，它会查找并调用指定的替代工具，并将结果返回给Agent。
- 如果计划是retry_with_modified_input，它会修改Agent状态中的消息，模拟一个新的工具调用请求，然后将流程路由回call_tool节点进行重试。
- 如果计划是report_unfixable，Agent将生成一条消息告知用户无法修复并结束。
route_after_healing_execution根据execute_healing_plan的执行结果，决定是返回call_agent让Agent继续思考，还是再次进入handle_error（如果修复本身又失败了），或者直接结束。

五、增强与扩展：构建更强大的自我修复Agent

我们当前的实现是一个功能完备的自我修复Agent的良好起点，但仍有许多方向可以进行增强和扩展。

5.1 更复杂的修复策略

多轮修复尝试：允许Agent对同一个错误进行多次不同策略的修复尝试。例如，第一次尝试修改输入，第二次尝试替代工具，第三次尝试问题分解。
回溯与状态恢复：在某些复杂的失败场景中，可能需要回溯到Agent之前的某个成功状态，从那里重新开始规划。这需要更精细的状态快照和管理机制。
用户介入：当Agent确实无法自动修复时，可以向用户寻求帮助，并提供详细的诊断报告和建议。
学习失败模式：长期运行的Agent可以收集工具失败和修复成功的历史数据，训练一个更小的模型或优化Prompt，使其能够更快速、更准确地诊断和修复常见问题。

5.2 安全性与沙箱

执行LLM生成代码的沙箱环境：如果允许LLM生成任意Python代码作为修复逻辑，那么必须在一个高度隔离的沙箱环境中执行，以防止恶意代码注入或系统破坏。这通常涉及Docker容器、gVisor等技术。
工具调用的白名单/黑名单：严格控制Agent可以调用的工具范围，防止其调用未经授权或有潜在风险的工具。

5.3 可观测性与调试

详细的日志记录：记录Agent的每个节点进入/退出、状态变化、工具调用、错误发生和修复尝试。这对于调试和理解Agent行为至关重要。
Tracing (LangSmith)：利用LangChain提供的LangSmith等Tracing工具，可视化Agent的执行路径，包括节点间的跳转、LLM的输入输出、工具调用结果等，这对于理解复杂的自我修复流程非常有帮助。
Dashboard与告警：构建一个监控仪表板，实时显示Agent的健康状况、错误率、修复成功率等指标，并在关键问题发生时触发告警。

5.4 性能优化

缓存LLM调用：对于重复的或相似的LLM调用，可以考虑使用缓存机制，减少延迟和成本。
并发执行：在不影响逻辑正确性的前提下，探索Agent内部的并发执行机会，例如同时调用多个不相关的工具。

5.5 动态工具注册

允许Agent在运行时发现并注册新的工具，例如通过访问一个工具注册中心或根据LLM的指令动态创建工具包装器。这将进一步提升Agent的灵活性和适应性。

六、构建韧性AI Agent的基石

今天，我们共同探索了如何构建一个具备“自我修复”能力的LangGraph Agent。我们看到了LangGraph强大的状态机模型如何与LLM的推理能力相结合，使得Agent能够：

智能地检测外部工具API的失败。
深入地诊断失败的根本原因。
创造性地生成替代的行动计划或逻辑。
动态地执行这些修复策略，从而克服障碍并持续推进任务。

这种“自我修复”的能力，是未来AI Agent从“有用”走向“可靠”和“自主”的关键一步。它将使得Agent在面对不确定的现实世界时，能够展现出更高的韧性、更强的适应性和更卓越的性能。LLM不再仅仅是生成文本的引擎，它更是一个强大的逻辑推理和策略规划者，能够驱动Agent在复杂环境中进行自我管理和自我优化。

随着我们对Agent系统理解的深入和LLM能力的不断提升，我们有理由相信，具备自我修复能力的Agent将成为各行各业中不可或缺的智能助手，为人类带来前所未有的效率和便利。

感谢大家的聆听！

深度挑战：手写一个支持“自我修复（Self-Healing）”的 LangGraph——当某个工具 API 挂掉时，它能自动写出替代逻辑

一、LangGraph基础回顾：构建智能Agent的利器

二、核心概念：什么是“自我修复”？

三、构建“自我修复”LangGraph的挑战与策略

四、架构设计：一个“自我修复”的LangGraph

| 节点名称 | 职责
Self-Healing LangGraph: Generating Alternative Logic on Tool Failure

引言：AI Agent的韧性挑战与未来图景

二、LangGraph核心概念：Agent行为的图谱编排

2.1 状态（State）管理

2.2 节点（Nodes）

2.3 边（Edges）

2.4 Agent执行流

三、深度挑战：Agent的“自我修复”能力

3.1 自我修复的层次与目标

3.2 自我修复与传统错误处理的区别

四、构建“自我修复”LangGraph的架构与实现

4.1 定义工具：正常与故障模拟

4.2 节点函数实现

4.4 构建LangGraph

4.5 运行Agent

五、增强与扩展：构建更强大的自我修复Agent

5.1 更复杂的修复策略

5.2 安全性与沙箱

5.3 可观测性与调试

5.4 性能优化

5.5 动态工具注册

六、构建韧性AI Agent的基石

发表回复取消回复

一、LangGraph基础回顾：构建智能Agent的利器

二、核心概念：什么是“自我修复”？

三、构建“自我修复”LangGraph的挑战与策略

四、架构设计：一个“自我修复”的LangGraph

| 节点名称 | 职责 Self-Healing LangGraph: Generating Alternative Logic on Tool Failure

引言：AI Agent的韧性挑战与未来图景

二、LangGraph核心概念：Agent行为的图谱编排

2.1 状态（State）管理

2.2 节点（Nodes）

2.3 边（Edges）

2.4 Agent执行流

三、深度挑战：Agent的“自我修复”能力

3.1 自我修复的层次与目标

3.2 自我修复与传统错误处理的区别

四、构建“自我修复”LangGraph的架构与实现

4.1 定义工具：正常与故障模拟

4.2 节点函数实现

4.4 构建LangGraph

4.5 运行Agent

五、增强与扩展：构建更强大的自我修复Agent

5.1 更复杂的修复策略

5.2 安全性与沙箱

5.3 可观测性与调试

5.4 性能优化

5.5 动态工具注册

六、构建韧性AI Agent的基石

发表回复 取消回复

| 节点名称 | 职责
Self-Healing LangGraph: Generating Alternative Logic on Tool Failure

发表回复取消回复