解析 ‘Self-Aware Resource Management’：Agent 如何感知自己的 Token 余额并主动缩减‘思考字数’？ - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

尊敬的各位编程专家、AI架构师以及对智能系统资源管理充满热情的同仁们，大家好！

今天，我们将深入探讨一个在构建智能体（Agent）时日益重要的话题：“自感知资源管理”（Self-Aware Resource Management）。具体来说，我们将聚焦于一个核心问题：Agent 如何感知自己的 Token 余额，并据此主动缩减其“思考字数”——也就是生成响应的长度。这不仅关乎成本效益，更关乎Agent的运行效率、响应速度乃至用户体验。

在大型语言模型（LLM）驱动的智能体时代，我们赋予了Agent前所未有的智能和能力。然而，这种能力并非没有代价。每一次与LLM的交互，无论是输入提示（Prompt）还是接收输出（Completion），都消耗着计算资源，并通常以“Token”为单位进行计费。一个思考冗长、输出繁复的Agent，可能在不知不觉中耗尽预算，或者显著增加延迟。因此，让Agent像一个精明的会计师，时刻关注自己的“财务状况”——即Token余额，并能在必要时“节衣缩食”，变得至关重要。

1. 理解 LLM 中的 Token 消耗与资源限制

在深入探讨自感知机制之前，我们必须清晰地理解 Token 在 LLM 生态系统中的作用。

1.1 Token 的本质与计费模型

Token 是 LLM 处理文本的基本单位。它不完全等同于一个单词，通常一个英文单词可能是一个 Token，而一个中文汉字则通常是一个或两个 Token。LLM 提供商（如 OpenAI, Anthropic, Google 等）通常根据以下两个维度对 API 调用进行计费：

输入 Token (Input Tokens)：您发送给 LLM 的所有文本，包括系统指令、用户查询、历史对话、工具描述等。
输出 Token (Output Tokens)：LLM 生成的响应文本。

通常，输出 Token 的费率会高于输入 Token。这意味着，如果 Agent 不加限制地生成冗长响应，其成本将快速飙升。

1.2 上下文窗口限制 (Context Window Limit)

除了成本，还有一个更根本的限制是上下文窗口（Context Window）的大小。每个 LLM 模型都有其最大上下文长度，例如 GPT-4o 可能是 128k Tokens。这意味着 Agent 在单次调用中能“记住”的输入和输出 Token 总量是有限的。如果 Agent 持续生成冗长响应，它将很快达到上下文限制，导致历史信息被截断，影响Agent的长期记忆和连贯性。

1.3 延迟与用户体验

更长的响应意味着 LLM 需要更长的时间来生成。对于需要实时交互的应用，如聊天机器人或智能助手，高延迟会严重损害用户体验。缩减“思考字数”不仅节省成本，也能显著提升响应速度。

因此，对 Token 的感知与管理，是 Agent 从“智能”走向“高效智能”的关键一步。

2. 自感知 Agent 的架构设计：融入资源管理模块

要实现 Agent 对 Token 的自感知，我们不能仅仅依靠外部监控。Agent 自身必须内化这种意识，并将其融入其决策循环中。我们可以将传统的 Agent 架构（感知-规划-行动）扩展，引入一个专门的“资源管理模块”。

一个具备自感知能力的 Agent 架构可能包含以下关键组件：

感知模块 (Perception Module)：接收用户输入，并获取 Agent 内部状态信息。
资源管理模块 (Resource Management Module)：
- Token 估算器 (Token Estimator)：预测输入和潜在输出的 Token 数量。
- Token 追踪器 (Token Tracker)：记录实际消耗的 Token 数量。
- 预算控制器 (Budget Controller)：维护总预算和当前余额，并根据策略发出警告或限制。
- 策略引擎 (Policy Engine)：根据预算状况和任务优先级，决定资源分配和行动策略。
规划模块 (Planning Module)：根据感知信息和资源管理模块的建议，制定行动计划。
行动模块 (Action Module)：执行计划，包括调用 LLM、使用工具等。
反思/学习模块 (Reflection/Learning Module)：评估行动结果，更新 Agent 状态，并可能调整资源管理策略。

本文将主要聚焦于资源管理模块及其与规划/行动模块的交互，以实现“感知 Token 余额并主动缩减思考字数”。

3. 感知 Token 余额：Agent 的“财务报表”

Agent 要感知 Token 余额，需要两个核心能力：预估（在行动前）和追踪（在行动后）。

3.1 预估 Token：行动前的预测

在 Agent 每次准备调用 LLM 之前，它应该尝试预估这次调用可能消耗的 Token 数量。这包括：

精确计算输入 Token：这相对容易，可以使用 tiktoken (OpenAI 推荐) 或其他模型特定的 Tokenizer 库。
估算潜在输出 Token：这是挑战所在。LLM 的输出长度受多种因素影响，包括提示的复杂性、预期的响应类型以及模型自身的倾向。

代码示例：使用 tiktoken 估算 Token

import tiktoken
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

class TokenEstimator:
    def __init__(self, model_name="gpt-4o"):
        """
        初始化 Token 估算器。
        Args:
            model_name (str): 用于估算 Token 的模型名称。
                              OpenAI 模型可以使用其名称，其他模型可能需要自定义编码器。
        """
        try:
            self.encoding = tiktoken.encoding_for_model(model_name)
            logging.info(f"Initialized TokenEstimator for model: {model_name}")
        except KeyError:
            logging.warning(f"Model {model_name} not found in tiktoken. Using cl100k_base encoding.")
            self.encoding = tiktoken.get_encoding("cl100k_base")

    def estimate_tokens(self, text: str) -> int:
        """
        估算给定文本的 Token 数量。
        Args:
            text (str): 需要估算的文本。
        Returns:
            int: 估算的 Token 数量。
        """
        if not text:
            return 0
        return len(self.encoding.encode(text))

    def estimate_prompt_tokens(self, messages: list[dict]) -> int:
        """
        估算 OpenAI 聊天 API 格式的 messages 列表的 Token 数量。
        这会考虑每个 message 的角色、内容以及结构开销。
        参考：https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
        Args:
            messages (list[dict]): 聊天消息列表，如 [{"role": "user", "content": "hello"}]。
        Returns:
            int: 估算的 Token 数量。
        """
        if not messages:
            return 0

        # 每条消息的固定开销
        tokens_per_message = 3  # role, content, 和一个分隔符
        tokens_per_name = 1     # 如果有 name 字段

        num_tokens = 0
        for message in messages:
            num_tokens += tokens_per_message
            for key, value in message.items():
                num_tokens += self.estimate_tokens(value)
                if key == "name":
                    num_tokens += tokens_per_name
        num_tokens += 3  # 每次对话的固定开销 (e.g., system message start)
        return num_tokens

# 示例用法
# estimator = TokenEstimator(model_name="gpt-4o")
# print(f"Text 'Hello, world!' tokens: {estimator.estimate_tokens('Hello, world!')}")
# messages = [
#     {"role": "system", "content": "You are a helpful assistant."},
#     {"role": "user", "content": "What is the capital of France?"}
# ]
# print(f"Prompt messages tokens: {estimator.estimate_prompt_tokens(messages)}")

估算输出 Token 的启发式方法：

用户期望值：根据用户在提示中表达的期望（例如“总结为三句话”）。
任务类型：对于摘要任务，预期输出会较短；对于解释性任务，预期会较长。
历史平均值：Agent 可以记录之前类似任务的输出长度，并使用平均值作为基线。
上下文长度比例：如果输入上下文已经很长，可以保守地分配较少的 Token 给输出。
预设最大值：设置一个硬性的 max_tokens 参数，但这更像是一种限制而非估算。

3.2 追踪 Token：行动后的审计

每次 LLM 调用完成后，API 响应通常会包含实际消耗的 prompt_tokens 和 completion_tokens。Agent 必须捕获这些信息，并将其更新到其内部的“Token 余额”中。

代码示例：Token 预算管理器

class TokenBudgetManager:
    def __init__(self, total_budget_tokens: int, model_name: str = "gpt-4o"):
        """
        初始化 Token 预算管理器。
        Args:
            total_budget_tokens (int): Agent 在给定会话或任务中的总 Token 预算。
            model_name (str): 用于 Token 估算的 LLM 模型名称。
        """
        self.total_budget = total_budget_tokens
        self.current_used_tokens = 0
        self.estimator = TokenEstimator(model_name=model_name)
        logging.info(f"TokenBudgetManager initialized with total budget: {self.total_budget} tokens.")

    def track_usage(self, input_tokens: int, output_tokens: int):
        """
        记录本次 LLM 调用实际消耗的 Token 数量。
        Args:
            input_tokens (int): 本次调用的输入 Token 数量。
            output_tokens (int): 本次调用的输出 Token 数量。
        """
        self.current_used_tokens += (input_tokens + output_tokens)
        logging.info(f"Tokens used: Input={input_tokens}, Output={output_tokens}. "
                     f"Total used: {self.current_used_tokens}/{self.total_budget}")

    def get_remaining_budget(self) -> int:
        """
        获取剩余的 Token 预算。
        Returns:
            int: 剩余 Token 数量。
        """
        return max(0, self.total_budget - self.current_used_tokens)

    def is_budget_critical(self, threshold_percentage: float = 0.2) -> bool:
        """
        检查预算是否接近临界点。
        Args:
            threshold_percentage (float): 剩余预算占总预算的百分比，低于此百分比则视为临界。
        Returns:
            bool: 如果预算接近临界点，则为 True。
        """
        remaining_ratio = self.get_remaining_budget() / self.total_budget
        if remaining_ratio <= threshold_percentage:
            logging.warning(f"Budget is critical! Remaining: {self.get_remaining_budget()} tokens "
                            f"({remaining_ratio*100:.2f}% of total).")
            return True
        return False

    def is_budget_depleted(self) -> bool:
        """
        检查预算是否已经耗尽。
        Returns:
            bool: 如果预算已耗尽，则为 True。
        """
        if self.current_used_tokens >= self.total_budget:
            logging.error(f"Budget depleted! Total used: {self.current_used_tokens}/{self.total_budget}")
            return True
        return False

    def can_afford(self, estimated_input_tokens: int, estimated_output_tokens: int) -> bool:
        """
        检查 Agent 是否能够负担本次预估的 Token 消耗。
        Args:
            estimated_input_tokens (int): 预估的输入 Token 数量。
            estimated_output_tokens (int): 预估的输出 Token 数量。
        Returns:
            bool: 如果可以负担，则为 True。
        """
        if self.current_used_tokens + estimated_input_tokens + estimated_output_tokens > self.total_budget:
            logging.warning(f"Cannot afford. Estimated cost: {estimated_input_tokens + estimated_output_tokens}, "
                            f"remaining budget: {self.get_remaining_budget()}")
            return False
        return True

    def reset_budget(self):
        """重置预算管理器，通常用于新会话或新任务开始时。"""
        self.current_used_tokens = 0
        logging.info("TokenBudgetManager reset.")

# 示例用法
# budget_manager = TokenBudgetManager(total_budget_tokens=2000)
# estimator = TokenEstimator()
#
# # 模拟第一次调用
# prompt_tokens_1 = estimator.estimate_prompt_tokens([{"role": "user", "content": "Tell me a story."}])
# estimated_output_1 = 500 # 假设估算输出500 tokens
# if budget_manager.can_afford(prompt_tokens_1, estimated_output_1):
#     print(f"Affordable. Prompt tokens: {prompt_tokens_1}, Estimated output: {estimated_output_1}")
#     # LLM call happens, get actual usage
#     actual_input_1, actual_output_1 = prompt_tokens_1, 450 # 模拟实际使用
#     budget_manager.track_usage(actual_input_1, actual_output_1)
# else:
#     print("Not affordable, need to adjust.")
#
# # 模拟后续调用，预算可能吃紧
# prompt_tokens_2 = estimator.estimate_prompt_tokens([{"role": "user", "content": "Continue the story."}])
# estimated_output_2 = 800 # 假设估算输出800 tokens
# if budget_manager.can_afford(prompt_tokens_2, estimated_output_2):
#     print(f"Affordable. Prompt tokens: {prompt_tokens_2}, Estimated output: {estimated_output_2}")
#     actual_input_2, actual_output_2 = prompt_tokens_2, 780
#     budget_manager.track_usage(actual_input_2, actual_output_2)
# else:
#     print("Not affordable, need to adjust.")
#     # Agent will trigger response reduction strategies here
#
# budget_manager.is_budget_critical()
# budget_manager.is_budget_depleted()

4. 规划与行动：主动缩减“思考字数”的策略

当 Agent 的资源管理模块感知到 Token 预算紧张时，它就需要触发一系列策略来主动缩减输出 Token，即“思考字数”。这些策略可以分为两类：预处理（Prompt Engineering）和后处理（Post-processing）。

4.1 预处理策略：在调用 LLM 之前进行干预

这是最有效且推荐的方法，因为它可以直接影响 LLM 的生成行为，从而避免生成冗余内容。

4.1.1 显式指示 LLM 保持简洁

在系统提示或用户提示中明确要求 LLM 缩短响应。这是最直接的方法。

通用简洁指令：
- “请简洁地回答。”
- “用最少的话语解释。”
- “请仅提供关键信息。”
限制长度指令：
- “请总结为三句话。”
- “回答不超过 50 个单词。”
- “请使用不超过 100 个 Token 进行回复。” (需要注意模型对 Token 计数指令的遵守程度)
格式化指令：
- “请使用项目符号列表。”
- “请只提供名称，不要解释。”

4.1.2 调整 max_tokens 参数

在调用 LLM API 时，max_tokens 参数直接限制了 LLM 可以生成的最大 Token 数量。当预算紧张时，Agent 可以动态地降低这个参数的值。

4.1.3 Few-shot 示例

提供简洁的示例，引导 LLM 模仿这种简洁的风格。

4.1.4 调整 Agent 角色或语气

例如，将 Agent 的角色从“详细的解释者”变为“精炼的摘要者”。

代码示例：动态调整 Prompt 和 max_tokens

我们将创建一个 ConciseAgent 类，它集成了 TokenBudgetManager 和 TokenEstimator，并能根据预算状况动态调整其 LLM 调用行为。

import openai # 假设我们使用 OpenAI API 库
import os

# 模拟 OpenAI API 响应，因为实际 API 调用会消耗真实资源和时间
class MockOpenAI:
    def __init__(self):
        self.call_count = 0

    def chat_completion_create(self, model, messages, max_tokens, temperature):
        self.call_count += 1
        prompt_content = messages[-1]['content'] # 假设最后一个 message 是用户查询

        # 模拟不同情况下的 LLM 响应
        if "concise" in prompt_content.lower() or max_tokens < 100:
            response_text = f"This is a concise response number {self.call_count}, as requested. It fits within {max_tokens} tokens."
        elif "summarize" in prompt_content.lower():
            response_text = f"Here is a summary for call {self.call_count}. It is relatively brief."
        else:
            response_text = f"This is a detailed and potentially lengthy response for call {self.call_count}. " 
                            f"It elaborates on the query to provide comprehensive information, " 
                            f"demonstrating the agent's full capability. " 
                            f"The current max_tokens setting was {max_tokens}. " 
                            f"This is a paragraph of filler text to make it longer. " 
                            f"Another paragraph of filler to ensure it crosses typical short response lengths."

        # 估算实际使用的 Token
        estimator = TokenEstimator()
        actual_output_tokens = estimator.estimate_tokens(response_text)
        # 确保模拟的 output_tokens 不会超过 max_tokens 太多，但可以略微超出以模拟模型的不完全遵守
        actual_output_tokens = min(actual_output_tokens, max_tokens + 20) # 允许略微超出

        # 模拟 API 返回格式
        return {
            "choices": [{"message": {"content": response_text}}],
            "usage": {
                "prompt_tokens": estimator.estimate_prompt_tokens(messages),
                "completion_tokens": actual_output_tokens,
                "total_tokens": estimator.estimate_prompt_tokens(messages) + actual_output_tokens
            }
        }

# 替换 openai.chat.completions.create
# 通常在实际应用中，你会直接导入 openai 库并调用其方法
# 为了模拟，我们创建一个代理函数
mock_openai_client = MockOpenAI()
def call_llm_api(model, messages, max_tokens, temperature):
    return mock_openai_client.chat_completion_create(model=model, messages=messages, max_tokens=max_tokens, temperature=temperature)

class ConciseAgent:
    DEFAULT_MAX_OUTPUT_TOKENS = 500
    CRITICAL_MAX_OUTPUT_TOKENS = 150
    MIN_MAX_OUTPUT_TOKENS = 50 # 即使在极度紧张时，也至少允许生成少量 Token

    def __init__(self, agent_name: str, total_session_budget: int, llm_model: str = "gpt-4o"):
        self.name = agent_name
        self.llm_model = llm_model
        self.budget_manager = TokenBudgetManager(total_session_budget, model_name=llm_model)
        self.estimator = TokenEstimator(model_name=llm_model)
        self.conversation_history: list[dict] = []
        logging.info(f"{self.name} Agent initialized with budget: {total_session_budget} tokens.")

    def _get_adjusted_prompt(self, user_query: str, current_max_tokens: int) -> list[dict]:
        """
        根据当前预算状况和用户查询，动态调整 LLM 的 Prompt。
        Args:
            user_query (str): 用户的原始查询。
            current_max_tokens (int): 本次 LLM 调用允许的最大输出 Token 数量。
        Returns:
            list[dict]: 调整后的 messages 列表，用于 LLM API 调用。
        """
        messages = self.conversation_history.copy()

        system_message = {"role": "system", "content": f"You are a helpful assistant named {self.name}."}

        # 根据预算状态添加简洁指令
        if self.budget_manager.is_budget_critical(threshold_percentage=0.2):
            system_message["content"] += (
                f" Your remaining budget is low ({self.budget_manager.get_remaining_budget()} tokens). "
                f"Please be extremely concise and prioritize only the most critical information. "
                f"Limit your response strictly to approximately {current_max_tokens} tokens."
            )
            logging.warning(f"{self.name}: Budget critical, adding strong conciseness instruction to prompt.")
        elif self.budget_manager.is_budget_critical(threshold_percentage=0.4):
            system_message["content"] += (
                f" Your budget is getting tight. Please aim for conciseness. "
                f"Try to keep your response within {current_max_tokens} tokens."
            )
            logging.info(f"{self.name}: Budget getting tight, adding conciseness instruction.")
        else:
            system_message["content"] += (
                f" You can provide a moderately detailed response, "
                f"but still be mindful of efficiency. Max tokens allowed: {current_max_tokens}."
            )

        messages.insert(0, system_message) # 将系统消息放在最前面
        messages.append({"role": "user", "content": user_query})

        return messages

    def respond(self, user_query: str) -> str:
        """
        Agent 响应用户查询的主要方法，包含自感知和缩减逻辑。
        Args:
            user_query (str): 用户的查询文本。
        Returns:
            str: Agent 生成的响应文本。
        """
        if self.budget_manager.is_budget_depleted():
            logging.error(f"{self.name}: Budget depleted. Cannot respond to '{user_query}'.")
            return "对不起，我的思考预算已耗尽，无法继续提供详细回复。"

        # 1. 预估本次调用可能消耗的 Token
        # 首先，根据预算状态决定本次可以分配给输出的最大 Token 数
        max_output_tokens_for_this_call = self.DEFAULT_MAX_OUTPUT_TOKENS
        if self.budget_manager.is_budget_critical(threshold_percentage=0.2):
            max_output_tokens_for_this_call = self.CRITICAL_MAX_OUTPUT_TOKENS
        elif self.budget_manager.is_budget_critical(threshold_percentage=0.4):
            max_output_tokens_for_this_call = int(self.DEFAULT_MAX_OUTPUT_TOKENS * 0.5) # 中等缩减

        # 确保 max_output_tokens_for_this_call 不会低于最小值
        max_output_tokens_for_this_call = max(self.MIN_MAX_OUTPUT_TOKENS, max_output_tokens_for_this_call)

        # 动态调整 Prompt
        messages_to_send = self._get_adjusted_prompt(user_query, max_output_tokens_for_this_call)

        # 估算输入 Token (包括调整后的 Prompt)
        estimated_input_tokens = self.estimator.estimate_prompt_tokens(messages_to_send)

        # 检查是否能负担本次调用 (基于估算的输入和调整后的输出限制)
        if not self.budget_manager.can_afford(estimated_input_tokens, max_output_tokens_for_this_call):
            # 如果即使在缩减后仍然无法负担，则需要进一步缩减或拒绝
            logging.warning(f"{self.name}: Even with adjusted max_tokens ({max_output_tokens_for_this_call}), "
                            f"still cannot afford. Attempting minimal response.")
            max_output_tokens_for_this_call = self.MIN_MAX_OUTPUT_TOKENS
            messages_to_send = self._get_adjusted_prompt(user_query, max_output_tokens_for_this_call)
            estimated_input_tokens = self.estimator.estimate_prompt_tokens(messages_to_send)

            if not self.budget_manager.can_afford(estimated_input_tokens, max_output_tokens_for_this_call):
                logging.error(f"{self.name}: Cannot afford even minimal response. Budget too low.")
                return "对不起，我的预算非常紧张，无法提供任何回复。"

        logging.info(f"{self.name}: Calling LLM with estimated input {estimated_input_tokens} "
                     f"and max_output_tokens {max_output_tokens_for_this_call}.")

        # 2. 调用 LLM API
        try:
            # 实际调用 LLM
            # response = openai.chat.completions.create(
            #     model=self.llm_model,
            #     messages=messages_to_send,
            #     max_tokens=max_output_tokens_for_this_call,
            #     temperature=0.7 # 可以根据需要调整
            # )
            response = call_llm_api(
                model=self.llm_model,
                messages=messages_to_send,
                max_tokens=max_output_tokens_for_this_call,
                temperature=0.7
            )

            response_content = response.get("choices")[0].get("message").get("content")
            usage = response.get("usage")

            actual_input_tokens = usage.get("prompt_tokens")
            actual_output_tokens = usage.get("completion_tokens")

            # 3. 追踪实际消耗的 Token
            self.budget_manager.track_usage(actual_input_tokens, actual_output_tokens)

            # 更新对话历史
            self.conversation_history.append({"role": "user", "content": user_query})
            self.conversation_history.append({"role": "assistant", "content": response_content})

            return response_content

        except Exception as e:
            logging.error(f"{self.name}: Error calling LLM API: {e}")
            return "对不起，我在处理您的请求时遇到了问题。"

# 模拟 Agent 交互
print("--- 启动 Agent 模拟 ---")
agent = ConciseAgent("SmartAssistant", total_session_budget=2000) # 假设总预算 2000 tokens

# 第一次对话：预算充足，Agent 可以详细回答
print("n[用户]: 给我讲讲人工智能的最新进展，尤其是生成式AI方面。n")
response1 = agent.respond("给我讲讲人工智能的最新进展，尤其是生成式AI方面。")
print(f"[SmartAssistant]: {response1}")
print(f"当前剩余预算: {agent.budget_manager.get_remaining_budget()} tokens")
print(f"是否预算紧张: {agent.budget_manager.is_budget_critical(0.4)}") # 检查是否达到中等紧张

# 第二次对话：预算开始吃紧，Agent 会收到提示保持简洁
print("n[用户]: 好的，那请你总结一下最近生成式AI在多模态理解方面有哪些突破？n")
response2 = agent.respond("好的，那请你总结一下最近生成式AI在多模态理解方面有哪些突破？")
print(f"[SmartAssistant]: {response2}")
print(f"当前剩余预算: {agent.budget_manager.get_remaining_budget()} tokens")
print(f"是否预算紧张: {agent.budget_manager.is_budget_critical(0.2)}") # 检查是否达到临界紧张

# 第三次对话：预算临界，Agent 会极度简洁
print("n[用户]: 那么，多模态AI对未来人机交互的影响是什么？一句话概括。n")
response3 = agent.respond("那么，多模态AI对未来人机交互的影响是什么？一句话概括。")
print(f"[SmartAssistant]: {response3}")
print(f"当前剩余预算: {agent.budget_manager.get_remaining_budget()} tokens")
print(f"是否预算耗尽: {agent.budget_manager.is_budget_depleted()}")

# 第四次对话：预算耗尽，Agent 拒绝回复
print("n[用户]: 我还有一个问题，关于AI伦理。n")
response4 = agent.respond("我还有一个问题，关于AI伦理。")
print(f"[SmartAssistant]: {response4}")
print(f"当前剩余预算: {agent.budget_manager.get_remaining_budget()} tokens")
print(f"是否预算耗尽: {agent.budget_manager.is_budget_depleted()}")

print("--- Agent 模拟结束 ---")

4.2 后处理策略：在 LLM 生成后进行干预

当 LLM 未能完全遵守简洁指令，或者出于某些原因无法在预处理阶段有效控制输出时，后处理可以作为一种补救措施。然而，后处理通常效率较低，因为它已经产生了高 Token 成本的输出，并且可能在压缩过程中丢失信息。

截断 (Truncation)：直接剪切超出长度限制的部分。这是最粗暴的方式，可能导致信息不完整或语法错误。
二次摘要 (Re-summarization)：使用另一个（可能更小、更便宜的）LLM 模型对原始输出进行摘要。这会产生额外的 API 调用和延迟。
关键词提取 (Keyword Extraction)：从长文本中提取核心关键词或短语。
结构化信息提取 (Structured Information Extraction)：如果任务是提取特定信息，则可以只保留提取到的结构化数据。

代码示例：简单的后处理摘要 (作为备用)

class PostProcessor:
    def __init__(self, estimator: TokenEstimator, model_name: str = "gpt-3.5-turbo"):
        self.estimator = estimator
        self.summary_model = model_name # 可以使用一个更便宜的模型进行二次摘要

    def summarize_text(self, text: str, max_summary_tokens: int) -> str:
        """
        尝试使用一个（假设的）LLM 对文本进行二次摘要。
        在实际应用中，这里会再次调用一个 LLM API。
        Args:
            text (str): 原始文本。
            max_summary_tokens (int): 摘要的最大 Token 数量。
        Returns:
            str: 摘要后的文本。
        """
        original_tokens = self.estimator.estimate_tokens(text)
        if original_tokens <= max_summary_tokens:
            return text # 如果原始文本已经够短，则无需摘要

        logging.info(f"Post-processing: Summarizing text (original {original_tokens} tokens) "
                     f"to max {max_summary_tokens} tokens.")

        # 模拟 LLM 摘要过程
        # In a real scenario, this would be another LLM API call:
        # response = openai.chat.completions.create(
        #     model=self.summary_model,
        #     messages=[
        #         {"role": "system", "content": f"Summarize the following text concisely, "
        #                                       f"limiting it to approximately {max_summary_tokens} tokens."},
        #         {"role": "user", "content": text}
        #     ],
        #     max_tokens=max_summary_tokens,
        #     temperature=0.4
        # )
        # return response.choices[0].message.content

        # 模拟摘要结果
        if len(text) > 200: # 简化模拟
            summary = text[:int(len(text) * (max_summary_tokens / original_tokens * 0.8))] + "..." # 粗略截断
            if self.estimator.estimate_tokens(summary) > max_summary_tokens:
                summary = summary[:int(len(summary) * (max_summary_tokens / self.estimator.estimate_tokens(summary) * 0.9))] + "..."
            return summary
        else:
            return text

# 可以在 ConciseAgent 的 respond 方法中，在接收到 LLM 响应后添加 PostProcessor 调用
# 例如：
# if actual_output_tokens > max_output_tokens_for_this_call * 1.2: # 如果超出太多
#     post_processor = PostProcessor(self.estimator)
#     response_content = post_processor.summarize_text(response_content, max_output_tokens_for_this_call)
#     # 注意：这里需要重新计算 post-processed 后的 tokens 并更新 budget_manager
#     # 这会比较复杂，因为 PostProcessor 内部可能再次调用 LLM，产生新的成本
#     # 因此，预处理始终是首选。

4.3 策略总结与比较

下表总结了主要的缩减策略及其优缺点：

策略类型	具体方法	优点	缺点	适用场景
预处理	显式简洁指令	直接引导模型生成短响应，避免浪费	模型可能不完全遵守；过度限制可能影响质量	优先使用，尤其在预算紧张时
	动态 `max_tokens`	硬性限制输出长度，精确控制成本	可能导致响应被截断，丢失信息	结合简洁指令使用，作为硬性上限
	Few-shot 示例	风格引导，提升简洁度	增加 Prompt 长度，可能消耗更多输入 Token	追求特定简洁风格时
	角色/语气调整	改变模型行为，从根本上影响输出长度	影响 Agent 整体“性格”，需谨慎	适用于长期任务或特定 Agent 类型
后处理	截断	实现简单，快速	粗暴，高风险丢失关键信息，影响可读性	紧急情况下的最后手段
	二次摘要	质量相对较高，可控性强	引入额外 LLM 调用，增加成本和延迟	LLM 未遵守指令，且信息丢失风险高时
	关键词提取/结构化提取	适用于特定信息提取任务	仅适用于特定任务，无法生成自然语言摘要	需获取特定信息，而非自由文本时

5. 进阶考量与优化

5.1 上下文窗口管理

除了输出 Token，输入 Token 的管理同样重要。 Agent 应学会：

历史对话摘要：当对话历史过长时，将旧的对话轮次摘要，以节省输入 Token。
相关性剪枝：识别并移除对话历史中与当前任务不相关的部分。
动态工具描述：只在需要时提供工具的详细描述，而不是每次都发送所有工具信息。

5.2 成本感知模型选择

不同的 LLM 模型有不同的成本和性能。当预算紧张时，Agent 可以选择更小、更便宜的模型来处理简单任务，而将昂贵的模型留给复杂、高价值的任务。

例如：使用 gpt-3.5-turbo 进行简单的聊天或初步摘要，只有在需要高质量或复杂推理时才切换到 gpt-4o。

5.3 用户偏好与个性化

用户可能对响应长度有不同的偏好。Agent 可以允许用户设置“详细度”偏好，并在预算允许的情况下，优先满足用户对详细度的要求。

5.4 学习与适应

一个更高级的 Agent 甚至可以学习。通过观察哪些简洁指令在什么情况下有效，以及用户对不同长度响应的满意度，Agent 可以优化其动态缩减策略。这可能涉及强化学习或简单的统计分析。

6. 挑战与局限性

尽管自感知资源管理前景广阔，但仍面临一些挑战：

Token 估算的准确性：尤其对于输出 Token，准确预估是一个难题。模型对长度指令的遵守程度也各不相同。
信息丢失的风险：过度缩减可能导致关键信息丢失，影响 Agent 的效用。
平衡效率与质量：在节约成本和提供高质量、有帮助的响应之间找到最佳平衡点是一个持续的挑战。
LLM 行为的不确定性：即使给出明确指令，LLM 有时也可能生成超出预期的响应。

结语

自感知资源管理是构建健壮、高效且经济的 LLM 智能体的基石。通过赋予 Agent 像人类一样对自身“思考资源”的感知能力，并能够根据预算状况主动调整其“思考字数”，我们不仅能显著降低运营成本、提升系统性能，还能为用户带来更流畅、更智能的交互体验。这是一个持续演进的领域，随着 LLM 技术和 Agent 框架的不断成熟，我们期待看到更多创新性的资源管理策略和实践。