解析 ‘Batch Process Orchestration’：利用 Agent 处理数百万条离线任务时的并发控制与错误隔离策略 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

各位同行，下午好。今天，我们将深入探讨一个在现代分布式系统中极其关键且富有挑战性的话题：批处理编排（Batch Process Orchestration）。具体来说，我们将聚焦于如何利用代理（Agent）处理数百万乃至数十亿条离线任务时，设计和实现健壮的并发控制（Concurrency Control）与错误隔离（Error Isolation）策略。

在数据爆炸式增长的今天，离线批处理任务无处不在——无论是大数据分析、机器学习模型训练、数据仓库ETL、报告生成，还是用户数据同步。这些任务的共同特点是：数据量庞大、处理时间长、对实时性要求相对宽松，但对最终一致性和系统稳定性要求极高。当任务规模达到数百万甚至更高量级时，传统的单机处理或简单脚本将难以为继。我们需要一个分布式、可伸缩、容错性强的系统来完成这项工作。

而“代理”的概念，在这里指的是独立、自治的执行单元，它们从任务队列中获取任务，执行处理逻辑，并将结果提交。这些代理协同工作，共同完成大规模的批处理任务。

理解批处理与代理

在深入技术细节之前，我们先明确一些基础概念。

批处理的特性

离线批处理任务通常具有以下核心特性：

大规模性 (Scalability)： 需要处理的数据量巨大，单个任务或整个批次可能包含数百万到数十亿条记录。
异步性 (Asynchronous)： 任务通常不是实时响应，而是提交后在后台执行，等待完成。
可重试性 (Retriable)： 由于分布式系统的复杂性，任务失败是常态，系统必须具备重试能力。
最终一致性 (Eventual Consistency)： 允许短时间内的不一致，最终所有数据都会达到一致状态。
资源密集型 (Resource Intensive)： 任务可能消耗大量CPU、内存、I/O或网络带宽。

代理 (Agent) 的角色与优势

在这里，一个“代理”是一个独立的进程、容器或服务实例，它的核心职责是：

从任务源获取任务。
执行预定义的任务逻辑。
报告任务状态和结果。

使用代理模式进行批处理编排的优势显而易见：

解耦 (Decoupling)： 任务的生产、调度与执行完全分离，互不影响。
可扩展性 (Scalability)： 当任务量增加时，只需增加代理实例即可扩展处理能力。
故障隔离 (Fault Isolation)： 单个代理的故障不会影响其他代理或整个系统的稳定性。
资源管理 (Resource Management)： 可以根据代理的职责为其分配不同的资源配额。

任务模型

一个典型的离线任务可以被抽象为以下结构：

{
    "task_id": "unique_uuid_for_this_task",
    "task_type": "data_cleansing_v2",
    "input_data_uri": "s3://my-bucket/raw-data/part_001.json",
    "output_data_uri": "s3://my-bucket/processed-data/part_001_cleaned.json",
    "parameters": {
        "schema_version": "1.0",
        "validation_level": "strict"
    },
    "metadata": {
        "created_at": "2023-10-27T10:00:00Z",
        "priority": 5,
        "source_system": "crm"
    },
    "current_status": "PENDING",
    "retries_attempted": 0,
    "max_retries": 3,
    "last_error": null
}

代理会接收这样的任务描述，根据 task_type 和 parameters 执行相应的业务逻辑。

批处理编排架构概览

一个典型的批处理编排系统通常包含以下核心组件：

组件名称	职责	常用技术选型
任务源	产生待处理的任务。	数据库 (PostgreSQL, MySQL), 文件系统 (S3, HDFS), 消息队列 (Kafka, RabbitMQ)
调度器/编排器	负责任务的生成、分发、状态追踪、重试管理和整体流程控制。	Apache Airflow, Temporal, Cadence, 自研服务
任务队列	暂存待处理任务，实现任务的异步传递和削峰填谷。	Kafka, RabbitMQ, Redis Streams/List, AWS SQS
代理集群	从任务队列中获取任务并执行实际处理逻辑的计算单元。	Docker容器, Kubernetes Pods, VM实例, Serverless Functions (Lambda)
结果存储	存储任务处理后的结果、状态或元数据。	数据库 (PostgreSQL, MongoDB), 对象存储 (S3), 数据湖 (HDFS)
监控与日志	收集系统运行指标、日志，以便发现问题和优化性能。	Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), DataDog

其基本工作流程可以概括为：

任务生成： 调度器或上游服务根据业务需求生成大量任务元数据，并将其存储在任务源或直接推送到任务队列。
任务分发： 调度器从任务源读取任务，或直接将新生成的任务推送到任务队列。
任务拉取： 代理集群中的各个代理实例从任务队列中拉取任务。
任务执行： 代理执行任务的业务逻辑，可能涉及数据读取、计算、写入等操作。
状态更新与结果提交： 代理将任务执行结果（成功、失败、部分成功）和状态更新报告给调度器或直接写入结果存储。
错误处理与重试： 如果任务失败，调度器根据预设策略进行重试，或将任务移至死信队列 (Dead Letter Queue, DLQ)。
监控与告警： 监控系统实时收集各组件的运行指标，并在出现异常时发出告警。

并发控制策略

处理数百万条离线任务，并发是必然的。但无序的并发可能导致系统过载、资源耗尽、数据不一致甚至服务崩溃。因此，精心设计的并发控制策略至关重要。

1. 代理层面的并发控制

每个代理实例内部如何并行处理任务是并发控制的第一道防线。

多线程/多进程模型

Python等语言提供了方便的并发模型。

多线程 (ThreadPoolExecutor)： 适用于I/O密集型任务（如网络请求、文件读写），因为GIL (Global Interpreter Lock) 限制了Python线程的并行计算能力。
多进程 (ProcessPoolExecutor)： 适用于CPU密集型任务，每个进程都有独立的Python解释器和内存空间，可以充分利用多核CPU。

示例代码：Python 代理内部的并发处理

import time
import random
import json
import logging
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(processName)s - %(threadName)s - %(message)s')
logger = logging.getLogger(__name__)

def simulate_task_processing(task_payload):
    """
    模拟单个任务的处理逻辑。
    可能发生瞬时错误或永久错误。
    """
    task_id = task_payload['task_id']
    data = task_payload['data']
    operation = task_payload['operation']
    retries = task_payload.get('retries', 0)

    logger.info(f"[{task_id}] Starting processing (retry: {retries}). Data: {data[:20]}...")

    # 模拟I/O或计算耗时
    time.sleep(random.uniform(0.1, 0.5))

    # 模拟瞬时故障 (可重试)
    if random.random() < 0.05 and retries < task_payload.get('max_retries', 3):
        logger.warning(f"[{task_id}] Simulated transient failure.")
        raise RuntimeError(f"Transient error for task {task_id}")

    # 模拟永久故障 (不可重试，或需要特殊处理)
    if random.random() < 0.01:
        logger.error(f"[{task_id}] Simulated permanent failure.")
        raise ValueError(f"Permanent error for task {task_id}: Invalid data format")

    logger.info(f"[{task_id}] Successfully completed.")
    return {"task_id": task_id, "status": "COMPLETED", "result": f"Processed {data}"}

class AgentWorker:
    def __init__(self, agent_id, concurrency_type='thread', max_workers=5):
        self.agent_id = agent_id
        if concurrency_type == 'thread':
            self.executor = ThreadPoolExecutor(max_workers=max_workers, thread_name_prefix=f"Agent-{agent_id}-Thread")
            logger.info(f"Agent {agent_id} initialized with ThreadPoolExecutor (max_workers={max_workers}).")
        elif concurrency_type == 'process':
            self.executor = ProcessPoolExecutor(max_workers=max_workers)
            logger.info(f"Agent {agent_id} initialized with ProcessPoolExecutor (max_workers={max_workers}).")
        else:
            raise ValueError("concurrency_type must be 'thread' or 'process'")

        self.running = True
        self.processed_count = 0

    def submit_task(self, task_payload_str):
        """
        将任务提交到内部执行器。
        """
        task_payload = json.loads(task_payload_str)
        future = self.executor.submit(simulate_task_processing, task_payload)
        return future

    def shutdown(self):
        logger.info(f"Agent {self.agent_id} shutting down executor.")
        self.running = False
        self.executor.shutdown(wait=True)
        logger.info(f"Agent {self.agent_id} executor shut down. Total tasks processed: {self.processed_count}")

# 实际的代理会从队列中获取任务，并调用 submit_task
# For example:
# agent = AgentWorker("agent-001", concurrency_type='thread', max_workers=10)
# while agent.running:
#     task_str = get_task_from_queue() # This would be a real queue pull
#     if task_str:
#         future = agent.submit_task(task_str)
#         # You'd typically manage these futures, check their status, handle results/errors
#         # For simplicity here, we omit the full queue consumption loop.
#     else:
#         time.sleep(1)

异步 I/O 模型

对于大量并发的I/O密集型任务，如爬虫、API调用等，asyncio 配合 aiohttp 等库可以实现高效的并发，而无需多线程/多进程的开销。

2. 队列层面的并发控制

任务队列是连接调度器和代理的桥梁，它本身也提供了重要的并发控制机制。

消费者组 (Consumer Groups)： Kafka、RabbitMQ 等消息队列支持消费者组，允许多个代理实例共同消费一个主题/队列，每个消息只会被组内的一个消费者处理。这天然地实现了任务的并行处理和负载均衡。
预取限制 (Prefetch Limits / QoS)： RabbitMQ 的 basic_qos 命令允许消费者限制一次从队列中拉取的消息数量。这可以防止单个代理在处理能力有限的情况下，一次性拉取过多任务导致内存溢出或长时间不响应，从而实现背压（Backpressure）。
- 例如，设置 prefetch_count=10 意味着一个代理最多同时处理10个任务。
分区 (Partitioning)： 对于Kafka等支持分区的队列，将任务分成多个分区，可以显著提高并行度。不同的代理可以消费不同的分区，减少锁竞争，提高吞吐量。任务可以基于其ID或其他属性进行哈希分区。

3. 编排器层面的并发控制

编排器作为系统的“大脑”，可以从宏观上调控整个系统的并发行为。

全局速率限制 (Global Rate Limiting)：
- 令牌桶 (Token Bucket) 或漏桶 (Leaky Bucket) 算法： 限制整个系统或特定任务类型在单位时间内被提交或处理的任务数量。这可以保护下游系统（如数据库、外部API）不被批处理任务压垮。
- 实现方式： 可以使用 Redis 实现一个分布式速率限制器。

示例代码：基于 Redis 的分布式令牌桶速率限制器

import redis
import time

class RedisRateLimiter:
    def __init__(self, redis_host='localhost', redis_port=6379, redis_db=0,
                 rate_limit_key="global_task_rate", capacity=100, fill_rate=10):
        """
        :param capacity: 令牌桶的容量
        :param fill_rate: 每秒填充的令牌数量
        """
        self.redis_client = redis.StrictRedis(host=redis_host, port=redis_port, db=redis_db, decode_responses=True)
        self.key = rate_limit_key
        self.capacity = capacity
        self.fill_rate = fill_rate
        self._init_bucket()

    def _init_bucket(self):
        # 初始化令牌桶，如果不存在则设置初始值
        if not self.redis_client.exists(self.key + ":tokens"):
            self.redis_client.set(self.key + ":tokens", self.capacity)
            self.redis_client.set(self.key + ":last_refill_time", time.time())

    def allow_request(self, tokens_needed=1):
        """
        尝试从令牌桶中获取令牌。
        :param tokens_needed: 需要的令牌数量
        :return: True如果允许，False如果拒绝
        """
        now = time.time()

        # 使用Lua脚本保证原子性
        lua_script = """
        local key_tokens = KEYS[1]
        local key_last_refill = KEYS[2]
        local capacity = tonumber(ARGV[1])
        local fill_rate = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])
        local tokens_needed = tonumber(ARGV[4])

        local last_refill_time = tonumber(redis.call('get', key_last_refill))
        if not last_refill_time then
            last_refill_time = now
            redis.call('set', key_last_refill, now)
        end

        local current_tokens = tonumber(redis.call('get', key_tokens))
        if not current_tokens then
            current_tokens = capacity
            redis.call('set', key_tokens, capacity)
        end

        local tokens_to_add = math.floor((now - last_refill_time) * fill_rate)
        current_tokens = math.min(capacity, current_tokens + tokens_to_add)
        redis.call('set', key_last_refill, now)
        redis.call('set', key_tokens, current_tokens)

        if current_tokens >= tokens_needed then
            redis.call('decrby', key_tokens, tokens_needed)
            return 1
        else
            return 0
        end
        """

        result = self.redis_client.eval(lua_script, 2, 
                                        self.key + ":tokens", self.key + ":last_refill_time",
                                        self.capacity, self.fill_rate, now, tokens_needed)
        return result == 1

# Example Usage:
if __name__ == "__main__":
    limiter = RedisRateLimiter(capacity=10, fill_rate=2) # Max 10 tokens, refills 2 tokens per second

    print("--- Testing Rate Limiter ---")
    for i in range(20):
        if limiter.allow_request():
            print(f"Request {i+1}: ALLOWED")
        else:
            print(f"Request {i+1}: DENIED, waiting...")
            time.sleep(0.5) # Wait a bit for tokens to refill
        time.sleep(0.1) # Simulate some interval between requests

资源配额 (Resource Quotas)： 对于多租户系统，编排器可以为不同的租户或任务类型设置独立的资源配额，限制它们可以同时运行的代理数量或可以消费的任务量，防止“吵闹的邻居”问题。
背压机制 (Backpressure)： 当代理集群处理能力不足，任务队列积压严重时，编排器应减缓新任务的生成或分发速度。这可以通过监控队列深度来实现：当队列深度超过阈值时，编排器暂停或降低新任务的提交速率。

4. 外部资源并发控制

批处理任务经常需要访问共享的外部资源，如数据库、文件存储、第三方API。这些资源的并发访问也需要严格控制。

数据库连接池： 限制应用（代理）与数据库之间建立的并发连接数，防止数据库过载。
API调用速率限制： 遵守外部API提供商的速率限制，避免被封禁。可以在代理内部实现，也可以通过一个共享的代理层（如API网关）实现。
共享资源锁 (分布式锁)： 当多个代理可能尝试修改同一份数据或执行某个关键操作时，需要分布式锁来保证原子性。Redis 或 ZooKeeper 是常见的分布式锁实现。

示例代码：基于 Redis 的分布式锁

import redis
import time
import uuid

class RedisDistributedLock:
    def __init__(self, redis_host='localhost', redis_port=6379, redis_db=0):
        self.redis_client = redis.StrictRedis(host=redis_host, port=redis_port, db=redis_db, decode_responses=True)

    def acquire_lock(self, lock_name, acquire_timeout=10, lock_timeout=5):
        """
        尝试获取分布式锁。
        :param lock_name: 锁的名称
        :param acquire_timeout: 尝试获取锁的超时时间（秒）
        :param lock_timeout: 锁的自动释放时间（秒），防止死锁
        :return: 锁的值 (唯一标识) 如果获取成功，否则为 None
        """
        identifier = str(uuid.uuid4()) # 锁的唯一标识，用于安全释放
        end_time = time.time() + acquire_timeout
        while time.time() < end_time:
            # 使用 SET NX EX 命令原子性地设置锁
            # NX: Only set the key if it does not already exist.
            # EX: Set the specified expire time, in seconds.
            if self.redis_client.set(lock_name, identifier, ex=lock_timeout, nx=True):
                return identifier
            time.sleep(0.01) # 短暂等待后重试
        return None

    def release_lock(self, lock_name, identifier):
        """
        释放分布式锁。
        只有持有正确标识符的客户端才能释放锁。
        :param lock_name: 锁的名称
        :param identifier: 获取锁时返回的唯一标识
        :return: True如果释放成功，否则为False
        """
        # 使用Lua脚本保证检查和删除的原子性
        lua_script = """
        if redis.call("get", KEYS[1]) == ARGV[1] then
            return redis.call("del", KEYS[1])
        else
            return 0
        end
        """
        result = self.redis_client.eval(lua_script, 1, lock_name, identifier)
        return result == 1

# Example Usage:
if __name__ == "__main__":
    lock_manager = RedisDistributedLock()
    resource_id = "shared_resource_A"

    print(f"Agent 1 trying to acquire lock for {resource_id}...")
    id1 = lock_manager.acquire_lock(resource_id, acquire_timeout=5, lock_timeout=10)
    if id1:
        print(f"Agent 1 acquired lock with identifier: {id1}")
        # Simulate critical section work
        time.sleep(3)
        print("Agent 1 finishing critical section.")
        lock_manager.release_lock(resource_id, id1)
        print("Agent 1 released lock.")
    else:
        print("Agent 1 failed to acquire lock.")

    # Simulate another agent trying to acquire the same lock
    print(f"nAgent 2 trying to acquire lock for {resource_id}...")
    id2 = lock_manager.acquire_lock(resource_id, acquire_timeout=5, lock_timeout=10)
    if id2:
        print(f"Agent 2 acquired lock with identifier: {id2}")
        time.sleep(2)
        print("Agent 2 finishing critical section.")
        lock_manager.release_lock(resource_id, id2)
        print("Agent 2 released lock.")
    else:
        print("Agent 2 failed to acquire lock (might be held by Agent 1 or expired).")

错误隔离策略

在处理数百万条任务时，错误是不可避免的。网络波动、数据库瞬时故障、第三方API不稳定、甚至任务本身的数据问题都可能导致失败。关键在于如何设计系统，使得单个任务的失败或少数代理的崩溃不会蔓延，导致整个系统瘫痪。这就是错误隔离的核心目标。

1. 任务级别的错误处理

这是最细粒度的错误处理，发生在任务执行的业务逻辑内部。

Try-Except 块： 在任务处理函数内部，使用异常捕获机制来处理预期的错误情况。
- 瞬时错误 (Transient Errors)： 如网络超时、数据库连接中断。这些错误通常可以通过重试解决。
- 持久错误 (Permanent Errors)： 如无效的输入数据、业务逻辑错误。这些错误重试也无济于事，应标记为失败并记录。
优雅降级： 当遇到某些非关键性错误时，可以尝试执行备用逻辑或使用默认值，而不是直接失败整个任务。
错误码/状态： 任务处理结果应包含清晰的错误码或状态，方便调度器或监控系统进行后续处理。

2. 代理级别的错误隔离

代理是执行任务的最小自治单元，其自身的稳定性对整个系统至关重要。

沙箱机制 (Sandboxing)： 隔离潜在有害或资源密集型任务，防止它们影响代理上的其他任务。
- 进程隔离： 使用 subprocess 模块或 multiprocessing 库在独立的进程中运行任务。一个进程的崩溃不会导致其他进程崩溃。
- 容器化 (Containerization)： 将每个代理运行在一个独立的Docker容器中，或在Kubernetes Pod中。容器提供了强大的资源隔离（CPU、内存、网络），即使容器内部的任务崩溃，也只影响该容器。
心跳机制与健康检查 (Heartbeats & Health Checks)：
- 代理定期向调度器或服务注册中心发送心跳，表明自己仍然存活并正常工作。
- 调度器定期对代理执行健康检查（如HTTP GET请求），检查其是否响应，是否能正常拉取任务。
- 对于长时间无心跳或健康检查失败的代理，调度器应将其标记为不健康，并停止向其分配新任务，甚至触发其重启或替换。
资源限制 (Resource Limits)：
- 在部署代理时，通过 cgroups (Linux) 或 Kubernetes 的资源限制（requests 和 limits）来限制代理的CPU、内存使用。这可以防止单个代理因某个任务失控而耗尽宿主机的资源。
线程/进程隔离： 如前所述，使用 ProcessPoolExecutor 可以确保一个任务工作进程的崩溃不会影响代理主进程或其他工作进程。

示例代码：Agent 使用 multiprocessing 进行进程隔离

（此代码基于前面 AgentWorker 类的修改，省略了部分重复逻辑）

# ... (logging setup, simulate_task_processing function, RedisRateLimiter, RedisDistributedLock from previous examples) ...

class TaskAgent:
    def __init__(self, agent_id, redis_host='localhost', redis_port=6379, redis_db=0,
                 concurrency_type='thread', max_workers=5):
        self.agent_id = agent_id
        self.redis_client = redis.StrictRedis(host=redis_host, port=redis_port, db=redis_db)
        self.task_queue_key = "batch_tasks"
        self.dead_letter_queue_key = "dlq_batch_tasks"
        self.processing_tasks_key = f"agent:{agent_id}:processing" # Set to track tasks currently processed by this agent
        self.lock_manager = RedisDistributedLock(redis_host, redis_port, redis_db) # For external resource locking

        if concurrency_type == 'thread':
            self.executor = ThreadPoolExecutor(max_workers=max_workers, thread_name_prefix=f"Agent-{agent_id}-Worker")
            logger.info(f"Agent {agent_id} initialized with ThreadPoolExecutor (max_workers={max_workers}).")
        elif concurrency_type == 'process':
            self.executor = ProcessPoolExecutor(max_workers=max_workers)
            logger.info(f"Agent {agent_id} initialized with ProcessPoolExecutor (max_workers={max_workers}).")
        else:
            raise ValueError("concurrency_type must be 'thread' or 'process'")

        self.running = True
        self.processed_count = 0

    def _execute_task_with_retry_and_dlq(self, task_str):
        """
        包装任务执行，包含重试和死信队列逻辑。
        这是一个在工作线程/进程中执行的函数。
        """
        task_payload = json.loads(task_str)
        task_id = task_payload['task_id']
        max_retries = task_payload.get('max_retries', 3)

        # Record task as processing by this agent
        self.redis_client.sadd(self.processing_tasks_key, task_id)

        try:
            # Simulate external resource access with distributed lock for some tasks
            if random.random() < 0.05:
                lock_id = self.lock_manager.acquire_lock(f"task_resource_lock:{task_id}", acquire_timeout=2, lock_timeout=5)
                if not lock_id:
                    raise RuntimeError(f"Could not acquire lock for task {task_id}")
                logger.info(f"[{task_id}] Acquired distributed lock.")
                # Perform critical section operations
                time.sleep(random.uniform(0.1, 0.2)) 
                self.lock_manager.release_lock(f"task_resource_lock:{task_id}", lock_id)
                logger.info(f"[{task_id}] Released distributed lock.")

            result = simulate_task_processing(task_payload)
            logger.info(f"[{task_id}] Task completed successfully.")
            # In a real system, store result in DB or S3
            # self.redis_client.hset(f"task_results", task_id, json.dumps(result))
            self.processed_count += 1
            return True
        except (RuntimeError, ValueError) as e:
            current_retries = task_payload.get('retries', 0)
            task_payload['retries'] = current_retries + 1
            task_payload['last_error'] = str(e)

            if task_payload['retries'] <= max_retries:
                retry_delay = 2 ** task_payload['retries'] # Exponential backoff
                logger.warning(f"[{task_id}] Error: {e}. Retrying in {retry_delay}s (attempt {task_payload['retries']}/{max_retries}).")
                time.sleep(retry_delay) # Simulate backoff delay before re-queueing
                self.redis_client.rpush(self.task_queue_key, json.dumps(task_payload)) # Re-queue for retry
            else:
                logger.critical(f"[{task_id}] Failed after {max_retries} retries. Moving to DLQ.")
                self.redis_client.rpush(self.dead_letter_queue_key, json.dumps(task_payload)) # Move to DLQ
            return False
        except Exception as e:
            logger.exception(f"[{task_id}] Unexpected error during task processing.")
            task_payload['last_error'] = str(e)
            self.redis_client.rpush(self.dead_letter_queue_key, json.dumps(task_payload)) # Unexpected errors also go to DLQ
            return False
        finally:
            # Always remove task from processing set when it's done (success or final failure)
            self.redis_client.srem(self.processing_tasks_key, task_id)

    def start_consuming(self):
        logger.info(f"Agent {self.agent_id} starting consumption. PID: {os.getpid()}")
        pending_futures = set()

        while self.running:
            # Try to add new tasks to the executor until max_workers is reached or queue is empty
            while len(pending_futures) < self.executor._max_workers:
                task_item = self.redis_client.blpop(self.task_queue_key, timeout=0.5) # Blocking pop with timeout
                if task_item:
                    _, task_str = task_item
                    future = self.executor.submit(self._execute_task_with_retry_and_dlq, task_str.decode('utf-8'))
                    pending_futures.add(future)
                else:
                    # No tasks in queue, break to check existing futures
                    break

            # Check for completed tasks
            if pending_futures:
                done_futures = {f for f in pending_futures if f.done()}
                for future in done_futures:
                    try:
                        future.result() # Retrieve result, re-raises exceptions from worker
                    except Exception as e:
                        logger.error(f"[{self.agent_id}] A worker future raised an unhandled exception: {e}")
                    pending_futures.remove(future)
            else:
                # If no pending futures and no new tasks, sleep briefly to avoid busy-waiting
                if not self.redis_client.exists(self.task_queue_key) and self.redis_client.scard(self.processing_tasks_key) == 0:
                    logger.debug(f"[{self.agent_id}] No tasks in queue and no active processing. Sleeping...")
                    time.sleep(1)

            # Periodically log processed count for monitoring
            if self.processed_count > 0 and self.processed_count % 100 == 0:
                logger.info(f"[{self.agent_id}] Processed {self.processed_count} tasks so far. Active futures: {len(pending_futures)}")

    def stop_consuming(self):
        logger.info(f"Agent {self.agent_id} stopping consumption.")
        self.running = False
        self.executor.shutdown(wait=True) # Wait for all submitted tasks to complete
        logger.info(f"Agent {self.agent_id} shut down. Total processed: {self.processed_count}")

# To run this example:
# 1. Ensure Redis is running (e.g., `docker run --name some-redis -p 6379:6379 -d redis`)
# 2. Run the TaskProducer (e.g., `python producer.py`) to fill the 'batch_tasks' queue.
# 3. Run multiple agent instances in separate terminals:
#    `python agent.py agent-001 thread 5` (Agent 1, using threads, 5 workers)
#    `python agent.py agent-002 process 3` (Agent 2, using processes, 3 workers)
#    `python agent.py agent-003 thread 8` (Agent 3, using threads, 8 workers)
# 4. Observe logs for task processing, retries, and DLQ messages.

3. 编排器级别的错误恢复与隔离

编排器在系统层面提供更高级的错误恢复和隔离机制。

重试机制 (Retry Mechanisms)：
- 瞬时错误重试： 对于网络故障、数据库死锁等瞬时错误，编排器应自动重试任务。
- 指数退避 (Exponential Backoff)： 每次重试失败后，逐渐增加重试间隔时间，例如 1s, 2s, 4s, 8s… 避免对故障服务造成持续压力。
- 最大重试次数： 设置任务的最大重试次数，防止无限重试。
- 死信队列 (Dead Letter Queue, DLQ)： 当任务达到最大重试次数后仍然失败，或遇到无法重试的持久错误时，将其移至死信队列。DLQ中的任务可以用于人工分析、修复数据或触发告警。
熔断器 (Circuit Breaker Pattern)：
- 当对某个下游服务（如外部API、特定数据库）的调用持续失败时，熔断器模式会“打开”电路，在一段时间内停止所有对该服务的调用，直接返回失败。
- 这可以防止故障服务被持续压垮，并给它恢复的时间。一段时间后，熔断器会进入半开状态，允许少量请求通过，如果成功则关闭，否则继续打开。
幂等性 (Idempotency)：
- 设计任务时确保其操作是幂等的，即多次执行同一个任务（带有相同的输入）只会产生一次效果。这对于重试机制至关重要，可以避免重复处理导致的数据不一致。
- 实现方法：使用唯一的任务ID作为操作的键，进行条件更新；在写入前检查记录是否存在；使用事务。
监控与告警 (Monitoring & Alerting)：
- 关键指标： 监控任务成功率、失败率、重试率、处理延迟、队列深度、代理健康状态等。
- 告警： 当指标超出预设阈值时，及时触发告警，通知运维人员介入。例如，DLQ深度持续增加、任务失败率飙升、代理大量崩溃等。
回滚/补偿机制 (Rollback/Compensation)：
- 对于涉及多个步骤且需要强一致性的复杂业务流程，如果某个步骤失败，可能需要回滚之前已成功的步骤，或执行补偿事务来撤销已完成的操作。

实践案例与高级考量

将这些策略落地时，还需要考虑一些高级因素。

数据一致性： 大部分批处理场景接受最终一致性。但对于需要强一致性的场景，可能需要分布式事务或更复杂的协调机制（如两阶段提交），但通常会牺牲一部分性能和可用性。
任务状态管理： 维护任务的完整生命周期状态（Pending, Processing, Succeeded, Failed, Retrying, Aborted）。这些状态通常存储在数据库中，供调度器查询和更新。
可观测性 (Observability)：
- 日志聚合： 将所有代理和编排器的日志集中收集到ELK Stack、Splunk等系统中，便于搜索、分析和故障排查。
- 指标收集： 使用Prometheus、Grafana等工具收集并可视化系统的各项指标，提供仪表盘。
- 分布式追踪： 使用OpenTelemetry、Zipkin等工具追踪一个任务在整个分布式系统中的调用链，帮助定位性能瓶颈和错误源。
弹性伸缩 (Auto-scaling)： 根据任务队列的深度、代理的CPU/内存利用率等指标，自动增加或减少代理实例的数量。Kubernetes的Horizontal Pod Autoscaler (HPA) 是实现这一目标的利器。
蓝绿部署/金丝雀发布 (Blue/Green, Canary Releases)： 在升级代理代码时，采用这些部署策略可以最大限度地减少风险。先将新版本部署到一小部分流量或一个独立的集群，验证无误后再逐步扩大。

展望未来

批处理编排领域仍在不断演进。

机器学习驱动的调度： 利用ML模型预测任务的资源需求和执行时间，优化调度决策，提高资源利用率。
无服务器函数 (Serverless Functions) 作为代理： AWS Lambda, Google Cloud Functions, Azure Functions 等无服务器平台可以作为轻量级、按需付费的代理，极大地简化了运维和弹性伸缩。
更复杂的编排工具： Apache Airflow, Temporal, Cadence 等工具提供了强大的DAG (Directed Acyclic Graph) 任务定义、工作流版本控制、状态持久化和重试机制，使得构建复杂的批处理工作流变得更加容易。

在处理数百万条离线任务时，并发控制与错误隔离是构建稳定、高效、可扩展批处理系统的基石。通过在代理、队列和编排器等不同层面实施精细的策略，并结合持续的监控与可观测性，我们能够从容应对大规模任务处理带来的挑战，确保数据处理的最终成功。