探讨协程中的异常传递机制：如何在异步流中优雅地捕获并处理错误？ - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

各位同学，下午好！欢迎来到今天的技术讲座。

今天我们将深入探讨一个在现代异步编程中至关重要的话题：协程中的异常传递机制。随着我们构建的系统越来越复杂，异步操作、并发任务变得无处不在，如何优雅、高效地捕获并处理这些异步流中的错误，是衡量一个应用健壮性的关键指标。作为编程专家，我们不仅要让代码跑起来，更要让它在遇到问题时能够妥善应对，而不是崩溃、挂起或产生难以追踪的bug。

我们将以Kotlin协程为例，剖析其异常处理的内部机制，并分享一系列实用的模式和最佳实践，帮助大家构建更加稳定、可靠的异步应用。

第一章：协程的基石与异常处理的挑战

在深入异常传递机制之前，我们首先需要对协程有一个清晰的理解。协程，在许多语言中都有实现，比如Kotlin、Python、JavaScript (通过async/await)。它们提供了一种轻量级的并发模型，允许我们编写看起来像同步的异步代码。

1.1 协程：轻量级并发的魔力

与传统的线程相比，协程有几个显著的特点：

轻量级： 协程的创建和切换开销远小于线程，使得在一个应用中可以同时运行成千上万个协程。
协作式： 协程通过显式地挂起（suspend）和恢复（resume）操作来交出CPU控制权，而不是像线程那样通过操作系统调度器抢占式切换。这带来了更高的效率，但也对编程模型提出了新的要求。
结构化并发： 这是Kotlin协程的核心设计理念之一。它强制协程之间建立父子关系，使得协程的生命周期管理和错误传播变得更加可控。

1.2 异步编程中的异常之痛

在传统的同步编程中，异常处理相对直观：当一个函数抛出异常时，如果当前函数没有捕获，它会沿着调用栈向上传播，直到被某个try-catch块捕获，或者导致程序终止。

然而，在异步编程中，传统的调用栈概念被打破了。一个协程可能会在某个挂起点暂停执行，将控制权交给其他协程，并在未来的某个时刻（可能在不同的线程上）恢复执行。这意味着：

调用栈的“断裂”： 异常不再仅仅沿着一条连续的调用栈向上冒泡。
并发任务的独立性与关联性： 多个协程可能并行运行，一个协程的失败是否应该影响其他协程？
异常的“丢失”： 如果一个异步任务抛出异常，但没有任何机制捕获它，这个异常可能会被悄无声息地吞噬，导致难以诊断的程序行为。
资源泄漏： 异常发生时，如果资源没有得到及时释放，可能导致内存泄漏、文件句柄未关闭等问题。

这些挑战使得协程中的异常处理成为一个需要精心设计的环节。

第二章：Kotlin协程的异常处理基础

Kotlin协程提供了一套强大而灵活的机制来处理异步操作中的异常。其核心在于结构化并发和Job层次结构。

2.1 `Job`：协程的生命周期与父子关系

在Kotlin协程中，Job是所有协程的句柄。它代表一个可取消的计算单元，并管理着协程的生命周期（从活动到完成或取消）。更重要的是，Job构建了一个父子层次结构。

父子关系： 当你在一个协程作用域（CoroutineScope）内启动一个新的协程时，新协程的Job会自动成为当前作用域Job的子Job。
传播规则：
- 父Job取消，子Job也会被取消。 这是结构化并发的核心，确保了资源清理和避免僵尸协程。
- 子Job失败，父Job也会被取消。 这意味着如果父协程启动了一个子协程，而子协程抛出了一个未捕获的异常，这个异常会向上冒泡，导致父协程及其所有其他子协程都被取消。

让我们通过代码示例来理解这个传播规则：

import kotlinx.coroutines.*

fun main() = runBlocking {
    val parentJob = launch {
        println("Parent Coroutine started.")

        // 子协程1：正常完成
        launch {
            delay(100)
            println("Child Coroutine 1 finished successfully.")
        }

        // 子协程2：抛出异常
        launch {
            delay(50)
            println("Child Coroutine 2 started, will throw exception.")
            throw IllegalArgumentException("Something went wrong in Child 2!")
        }

        // 子协程3：永远不会执行到完成
        launch {
            delay(200)
            println("Child Coroutine 3 finished successfully. (Will it ever?)")
        }

        // 等待所有子协程完成 (或被取消)
        // 注意：这里我们故意不try-catch子协程的异常，让它传播
    }

    try {
        parentJob.join() // 等待父Job完成
        println("Parent Job completed.")
    } catch (e: Exception) {
        println("Caught exception from parent job: ${e.message}")
    }
}

输出分析：

Parent Coroutine started.
Child Coroutine 2 started, will throw exception.
Child Coroutine 1 finished successfully.
Caught exception from parent job: Something went wrong in Child 2!

从输出中可以看到：

Child Coroutine 2 抛出异常后，其异常向上冒泡。
Parent Job 接收到子协程的异常后，自身被取消。
Parent Job 的取消会传播到其所有其他子协程 (Child Coroutine 1 和 Child Coroutine 3)。
尽管 Child Coroutine 1 在异常传播前完成了，但其父协程的取消机制依然生效。Child Coroutine 3 因为父协程的取消，其 delay(200) 之后的 println 永远不会执行。
最终，runBlocking 外部的 try-catch 捕获到了由 parentJob.join() 重新抛出的异常。

这个行为是结构化并发的体现，它确保了在逻辑上相关的任务作为一个整体进行管理，当其中一个部分失败时，整个单元都能得到妥善处理。

2.2 `launch` 与 `async` 的异常行为差异

在Kotlin协程中，我们通常使用 launch 和 async 来启动新的协程。它们在异常处理行为上存在关键差异：

2.2.1 `launch`：即时传播异常

launch 启动一个协程，不返回结果。它主要用于“发后即忘”的任务，或者执行副作用操作。
当一个由 launch 启动的协程内部抛出未捕获的异常时，这个异常会立即传播到其父 Job。如果它是根协程（没有显式父 Job），则会由 CoroutineExceptionHandler 处理，或者作为未捕获异常抛出到线程的默认未捕获异常处理器。

示例：

import kotlinx.coroutines.*

fun main() = runBlocking {
    val scope = CoroutineScope(Job()) // 创建一个独立的CoroutineScope

    // 使用launch启动一个会抛异常的协程
    val job = scope.launch {
        println("Launch coroutine started.")
        delay(100)
        throw IllegalStateException("Exception from launched coroutine!")
    }

    try {
        job.join() // 等待job完成 (或失败)
    } catch (e: Exception) {
        println("Caught exception from launch job: ${e.message}")
    }

    scope.cancel() // 清理scope
}

输出：

Launch coroutine started.
Caught exception from launch job: Exception from launched coroutine!

这里 try-catch 能够捕获到异常，因为它是在 job.join() 尝试等待一个失败的 Job 时，Job 将其内部存储的异常重新抛出。

2.2.2 `async`：延迟传播异常

async 启动一个协程并返回一个 Deferred 对象，它是一个带有结果的 Job。你可以通过调用 await() 方法来获取结果或捕获可能发生的异常。
当一个由 async 启动的协程内部抛出未捕获的异常时，这个异常会被存储在返回的 Deferred 对象中，并不会立即传播到父 Job。只有当你调用 deferred.await() 时，这个异常才会被重新抛出。
如果 await() 从未被调用，那么这个异常可能会被“吞噬”，因为它不会向上传播。然而，如果父 Job 被取消，那么 Deferred 也会被取消，其内部的异常可能会被作为 CancellationException 的原因。

示例：

import kotlinx.coroutines.*

fun main() = runBlocking {
    val deferred = async {
        println("Async coroutine started.")
        delay(100)
        throw ArithmeticException("Division by zero in async!")
        10 / 0 // 实际不会执行到这里
    }

    println("After async launch, before await.")
    try {
        val result = deferred.await() // 异常在这里被重新抛出
        println("Result: $result")
    } catch (e: Exception) {
        println("Caught exception from async: ${e.message}")
    }

    println("Main coroutine continues.")

    // 尝试一个await从未被调用的情况
    val deferredNoAwait = async {
        println("Async coroutine (no await) started.")
        delay(50)
        throw IOException("Network error (no await)!")
    }
    // deferredNoAwait.await() // 故意不调用await

    delay(200) // 给予时间让deferredNoAwait执行
    println("Main coroutine finished.")
}

输出：

Async coroutine started.
After async launch, before await.
Caught exception from async: Division by zero in async!
Main coroutine continues.
Async coroutine (no await) started.
Main coroutine finished.

这里可以看到：

第一个 async 块的异常被 deferred.await() 捕获。
第二个 async 块 deferredNoAwait 抛出了异常，但由于我们没有调用 await()，这个异常并没有在 runBlocking 的 try-catch 块中被捕获。它虽然发生了，但没有被处理。这正是需要特别注意的地方！ 如果 deferredNoAwait 是一个根协程，且没有 CoroutineExceptionHandler，这个异常会作为未捕获异常被报告。如果它有父 Job，并且其父 Job 在其生命周期内没有被取消，那么这个异常可能不会被传播，直到父 Job 自身完成。

表格：launch 与 async 异常行为对比

特性	`launch`	`async`
返回值	`Job` (无结果)	`Deferred<T>` (带结果的 `Job`)
异常传播时机	立即传播到父 `Job` 或 `CoroutineExceptionHandler`	延迟到 `await()` 调用时重新抛出
`try-catch`	可以通过 `Job.join()` 或包裹 `launch` 块来捕获	必须包裹 `deferred.await()` 来捕获
未捕获异常	导致父 `Job` 取消，异常向上冒泡	如果 `await()` 未调用，异常可能被“吞噬”或作为未捕获异常报告
典型用途	启动不关心结果的副作用任务	启动需要返回结果的计算任务

2.3 `CoroutineExceptionHandler`：捕获根协程的异常

CoroutineExceptionHandler 是一个特殊的 CoroutineContext 元素，用于捕获根协程中未被处理的异常。

根协程： 指那些没有父 Job 的协程，或者其父 Job 是一个 SupervisorJob（我们稍后会讨论）。
作用： 它提供了一个钩子，让你可以在所有其他异常处理机制都失效后，对异常进行记录、报告或采取其他全局性的处理措施。

如何使用：

import kotlinx.coroutines.*
import kotlin.coroutines.CoroutineContext

fun main() = runBlocking {
    val handler = CoroutineExceptionHandler { context, exception ->
        println("Caught unhandled exception in CoroutineExceptionHandler: ${exception.message} in context $context")
    }

    // 场景1: 根协程使用handler
    val job1 = GlobalScope.launch(handler) {
        println("GlobalScope launch with handler started.")
        delay(100)
        throw IllegalStateException("Exception from GlobalScope launch!")
    }
    job1.join()

    // 场景2: runBlocking本身就是一个根协程的Scope，也可以添加handler
    val job2 = launch(handler) {
        println("runBlocking launch with handler started.")
        delay(100)
        throw IllegalStateException("Exception from runBlocking launch!")
    }
    job2.join()

    // 场景3: 子协程的异常不会被其父协程的handler捕获 (除非父协程是SupervisorJob)
    val parentJob = launch { // 这个launch是runBlocking的子协程
        println("Parent job started (no handler).")
        launch(handler) { // 这个子协程有handler，但其父job没有SupervisorJob
            println("Child job with handler started.")
            delay(50)
            throw IllegalArgumentException("Exception from child job with handler!")
        }
    }
    try {
        parentJob.join()
    } catch (e: Exception) {
        println("Caught exception from parent job (Child's exception propagated up): ${e.message}")
    }

    println("End of main.")
}

输出：

GlobalScope launch with handler started.
Caught unhandled exception in CoroutineExceptionHandler: Exception from GlobalScope launch! in context [StandaloneCoroutine{Cancelling}@...]
runBlocking launch with handler started.
Caught unhandled exception in CoroutineExceptionHandler: Exception from runBlocking launch! in context [StandaloneCoroutine{Cancelling}@...]
Parent job started (no handler).
Child job with handler started.
Caught exception from parent job (Child's exception propagated up): Exception from child job with handler!
End of main.

从场景3的输出可以看出，Child job with handler 内部的异常并没有被其自身的 CoroutineExceptionHandler 捕获。这是因为 Child job 有一个非 SupervisorJob 的父 Job (parentJob)。当 Child job 失败时，异常会按照结构化并发的规则，向上冒泡到 parentJob，导致 parentJob 被取消，最终由 runBlocking 外部的 try-catch 捕获。

核心理解： CoroutineExceptionHandler 仅在以下两种情况下起作用：

协程是根协程 (例如，使用 GlobalScope.launch 启动的协程，或者没有其他 Job 作为父级的 CoroutineScope 启动的协程)。
协程在一个SupervisorJob 上下文启动，其异常不会向上冒泡，而是直接传递给 SupervisorJob 的 CoroutineExceptionHandler (如果有的话)。

第三章：高级异常处理机制与模式

理解了 Job 层次、launch/async 差异和 CoroutineExceptionHandler 后，我们现在可以探讨更高级的机制和实用的模式。

3.1 `SupervisorJob`：让子协程独立失败

SupervisorJob 是 Job 的一个特殊实现，它改变了异常传播的规则。

特点： 当一个子协程失败时，SupervisorJob 不会取消其父协程或同级子协程。它允许子协程独立地失败。
用途： 适用于那些部分任务失败不应该影响整体任务的情况，例如在一个UI屏幕上同时加载多个独立的数据块，一个数据块加载失败不应该导致整个屏幕崩溃。

如何使用：

import kotlinx.coroutines.*

fun main() = runBlocking {
    val supervisor = SupervisorJob()
    val scope = CoroutineScope(coroutineContext + supervisor) // 使用SupervisorJob创建Scope

    val handler = CoroutineExceptionHandler { _, exception ->
        println("Caught exception in SupervisorJob handler: ${exception.message}")
    }

    // 场景1: 子协程失败，不影响其他子协程
    val child1 = scope.launch(handler) { // 注意：handler在这里才能捕获到异常
        println("Child 1 (supervisor) started.")
        delay(100)
        throw IllegalArgumentException("Error in child 1!")
    }

    val child2 = scope.launch {
        println("Child 2 (supervisor) started.")
        delay(200)
        println("Child 2 (supervisor) finished successfully.")
    }

    child1.join() // 等待child1完成 (或失败)
    child2.join() // 等待child2完成 (或失败)

    println("All supervisor children attempted.")

    // 场景2: SupervisorJob本身失败，仍会取消所有子协程
    val newScope = CoroutineScope(coroutineContext + SupervisorJob())
    val child3 = newScope.launch {
        println("Child 3 (new supervisor scope) started.")
        delay(500)
        println("Child 3 (new supervisor scope) finished.")
    }
    val child4 = newScope.launch {
        println("Child 4 (new supervisor scope) started.")
        delay(500)
        throw IllegalStateException("Error in child 4!")
    }

    // 主动取消SupervisorJob (这里通过取消其scope)
    delay(100) // 给予子协程启动时间
    newScope.cancel("Cancelling new supervisor scope explicitly.") // 取消scope，会取消SupervisorJob
    println("New supervisor scope cancelled.")

    // 等待子协程完成 (或被取消)
    try {
        child3.join()
    } catch (e: CancellationException) {
        println("Child 3 cancelled: ${e.message}")
    }
    try {
        child4.join()
    } catch (e: CancellationException) {
        println("Child 4 cancelled: ${e.message}")
    }
    println("End of main.")
}

输出：

Child 1 (supervisor) started.
Child 2 (supervisor) started.
Caught exception in SupervisorJob handler: Error in child 1!
Child 2 (supervisor) finished successfully.
All supervisor children attempted.
Child 3 (new supervisor scope) started.
Child 4 (new supervisor scope) started.
New supervisor scope cancelled.
Child 3 cancelled: Cancelling new supervisor scope explicitly.
Child 4 cancelled: Cancelling new supervisor scope explicitly.
End of main.

关键观察：

在 SupervisorJob 上下文中，child1 失败了，但 child2 仍然成功完成。
child1 的异常被附加到 scope 的 CoroutineExceptionHandler 中，而不是向上冒泡取消 runBlocking。
如果 SupervisorJob 本身被取消（或其 CoroutineScope 被取消），那么它仍然会取消所有子协程。

何时使用 SupervisorJob？

当你希望独立任务失败时不会影响其他任务。
在UI层面的并发操作中，例如并行加载多个图片或数据，一个失败不应导致整个屏幕冻结。
配合 CoroutineExceptionHandler 使用，以便在子协程失败时进行集中式日志记录或错误报告，而不是让异常在其他地方传播。

3.2 异常聚合

当一个 Job 包含多个子 Job，并且这些子 Job 都失败时，Kotlin协程会如何处理这些异常？

如果多个子协程在同一个父 Job 下失败，或者父 Job 自身失败，协程框架会尝试将这些异常聚合起来。当父 Job 最终失败时，它会抛出一个包含所有原始异常的 CancellationException 或其他类型的聚合异常。

例如，在 async 场景中，如果 await() 抛出异常，并且在 await() 返回之前，协程被取消了，那么 await() 会抛出 CancellationException。如果取消的原因是另一个异常，那么这个原始异常会作为 CancellationException 的 cause 属性被包含。

更复杂的聚合发生在 Job 内部。当一个 Job 由于多个原因被取消或失败时，它会尝试将这些原因收集起来。在 Kotlin Coroutines 的内部实现中，这通常通过将异常存储在一个 NonCancellable Job 或者 ChildHandle 中，并在 Job 最终完成时，将其原因链起来。

import kotlinx.coroutines.*

fun main() = runBlocking {
    val job = launch {
        launch {
            delay(100)
            throw IOException("Network error!")
        }
        launch {
            delay(200)
            throw IllegalStateException("Business logic error!")
        }
    }

    try {
        job.join()
    } catch (e: Exception) {
        println("Caught aggregated exception: ${e.message}")
        // 在实际应用中，你可能需要检查e的类型，如CancellationException或CompletionException
        // 并递归地打印其cause
        e.printStackTrace()
    }
}

输出示例（具体格式可能因版本略有不同）：

Caught aggregated exception: An operation is not implemented.
kotlinx.coroutines.CompletionException: An operation is not implemented.
    at ...
Caused by: java.io.IOException: Network error!
    at ...
Caused by: java.lang.IllegalStateException: Business logic error!
    at ...

这里 job.join() 抛出的异常是一个 CompletionException (或 CancellationException 的子类，取决于具体情况)，它将两个子协程的异常作为其 cause 链起来。这对于调试和理解系统中的多个故障点非常有帮助。

3.3 `Flow` 中的异常处理

Kotlin Flow 是用于处理异步数据流的强大工具。它也提供了专门的运算符来处理流中的异常。

3.3.1 `catch` 运算符

catch 运算符可以捕获上游 Flow 发出的任何异常。一旦捕获到异常，它会停止处理上游元素，并允许你发出新的元素，或者重新抛出异常，或者完成流。

import kotlinx.coroutines.*
import kotlinx.coroutines.flow.*

fun main() = runBlocking {
    flow {
        emit(1)
        emit(2)
        throw IOException("Flow error!")
        emit(3) // 不会执行
    }
    .onEach { println("Emitting $it") }
    .catch { e: Throwable ->
        println("Caught flow error: ${e.message}")
        emit(-1) // 捕获后可以发出新元素
    }
    .collect { value -> println("Collected $value") }

    println("--- Another Flow Example (rethrow) ---")

    try {
        flow {
            emit(10)
            throw IllegalStateException("Critical flow error!")
        }
        .catch { e: Throwable ->
            println("Caught critical flow error: ${e.message}, rethrowing.")
            throw e // 重新抛出异常
        }
        .collect { value -> println("Collected $value") }
    } catch (e: Exception) {
        println("Caught rethrown flow error in collector: ${e.message}")
    }
}

输出：

Emitting 1
Collected 1
Emitting 2
Collected 2
Caught flow error: Flow error!
Collected -1
--- Another Flow Example (rethrow) ---
Caught critical flow error: Critical flow error!, rethrowing.
Caught rethrown flow error in collector: Critical flow error!

3.3.2 `onCompletion` 运算符

onCompletion 运算符在流完成（无论是正常完成还是因异常完成）时被调用。它接收一个 Throwable? 参数，如果流因异常完成，则该参数不为 null。这对于资源清理非常有用。

import kotlinx.coroutines.*
import kotlinx.coroutines.flow.*

fun main() = runBlocking {
    flow {
        emit("Start processing...")
        delay(100)
        throw RuntimeException("Processing failed!")
    }
    .onEach { println(it) }
    .onCompletion { cause ->
        if (cause != null) {
            println("Flow completed with error: ${cause.message}")
        } else {
            println("Flow completed successfully.")
        }
    }
    .catch { /* 捕获异常防止它传播到collect */ } // 必须捕获，否则onCompletion后面的collect不会执行
    .collect() // 因为catch捕获了，所以collect可以正常完成
}

输出：

Start processing...
Flow completed with error: Processing failed!

3.3.3 `retry` 和 `retryWhen` 运算符

对于可恢复的、短暂的错误，retry 和 retryWhen 运算符允许你重试上游 Flow。

retry(retries: Long): 简单地重试指定次数。
retryWhen(predicate: suspend FlowCollector<T>.(cause: Throwable, attempt: Long) -> Boolean): 提供更灵活的重试逻辑，可以根据异常类型、重试次数等条件决定是否重试。

import kotlinx.coroutines.*
import kotlinx.coroutines.flow.*

var attempt = 0

fun main() = runBlocking {
    flow {
        if (attempt < 2) {
            attempt++
            println("Attempt $attempt: Simulating network error...")
            throw IOException("Network is down!")
        }
        emit("Data fetched successfully on attempt $attempt")
    }
    .retry(2) // 最多重试2次
    // .retryWhen { cause, attempt ->
    //     if (cause is IOException && attempt < 2) {
    //         delay(100 * attempt) // 模拟指数退避
    //         true // 返回true表示重试
    //     } else {
    //         false // 否则不重试
    //     }
    // }
    .catch { e -> println("Final error after retries: ${e.message}") }
    .collect { println(it) }
}

输出：

Attempt 1: Simulating network error...
Attempt 2: Simulating network error...
Data fetched successfully on attempt 3

表格：Flow 异常处理运算符

运算符	作用	备注
`catch`	捕获上游流的异常。可以发出新元素、重新抛出或完成流。	一旦捕获，上游流停止。
`onCompletion`	在流完成（无论成功或失败）时执行特定操作，用于资源清理。	接收 `Throwable?` 参数表示是否因异常完成。
`retry`	在上游流失败时，重试固定次数。	适用于可预测的短暂错误。
`retryWhen`	提供自定义逻辑，根据异常类型和重试次数决定是否重试。	适用于需要更精细控制的重试策略（如指数退避）。
`onErrorCollect`	(实验性/特定库) 捕获异常后，切换到另一个流。	类似于 `catch` 后 `emitAll` 另一个流。

3.4 `withContext` 与异常处理

withContext 是一个挂起函数，用于切换协程的上下文。它会创建一个新的 Job，并将其作为当前协程的子 Job。这意味着 withContext 内部抛出的异常会按照结构化并发的规则向上冒泡。

import kotlinx.coroutines.*

fun main() = runBlocking {
    try {
        withContext(Dispatchers.IO) {
            println("Inside withContext (IO dispatcher).")
            delay(50)
            throw IllegalStateException("Error from IO context!")
        }
    } catch (e: Exception) {
        println("Caught exception from withContext: ${e.message}")
    }

    println("Main coroutine continues.")
}

输出：

Inside withContext (IO dispatcher).
Caught exception from withContext: Error from IO context!
Main coroutine continues.

withContext 的一个好处是，它提供了一个自然的 try-catch 边界。如果 withContext 块内部发生异常，它会向上抛出，可以被外部的 try-catch 捕获。

3.5 取消与异常：`CancellationException`

在协程中，取消是一种协作机制。当一个协程被取消时，它会抛出一个 CancellationException。与其他异常不同，CancellationException 不会被视为“失败”，而是协程正常停止的一种方式。

特点：
- CancellationException 不会触发父 Job 的取消（除非父 Job 正在等待该子 Job 完成）。
- CoroutineExceptionHandler 不会捕获 CancellationException。
- 通常不应该捕获并吞噬 CancellationException，除非你明确知道自己在做什么。如果捕获了它，应该重新抛出，以便取消能够继续传播。

import kotlinx.coroutines.*

fun main() = runBlocking {
    val job = launch {
        try {
            delay(500)
            println("Coroutine completed.")
        } catch (e: CancellationException) {
            println("Coroutine was cancelled: ${e.message}")
            // 可以在这里进行资源清理
            // throw e // 如果希望取消继续传播，则重新抛出
        } catch (e: Exception) {
            println("Caught other exception: ${e.message}")
        } finally {
            println("Cleanup in finally block.")
        }
    }

    delay(100)
    job.cancel("Explicitly cancelled by main.") // 取消job
    job.join()
    println("Main finished.")
}

输出：

Coroutine was cancelled: Explicitly cancelled by main.
Cleanup in finally block.
Main finished.

这里 CancellationException 被捕获，finally 块得以执行进行清理。如果不在 catch (e: CancellationException) 块中重新抛出异常，那么协程的取消就不会继续传播，这在某些情况下可能不是你想要的行为。

第四章：实践中的优雅错误处理

结合上述机制，我们可以形成一些优雅的错误处理模式。

4.1 UI/ViewModel 层的错误处理

在Android等UI应用中，通常会有 ViewModel 或 Presenter 层来管理业务逻辑和UI状态。

使用 SupervisorJob： 为 ViewModel 创建一个 CoroutineScope，并使用 SupervisorJob 作为其 Job。这样，当一个数据加载任务失败时，不会导致整个 ViewModel 的协程被取消，其他任务可以继续执行。
集中式错误报告： 在 SupervisorJob 的 CoroutineExceptionHandler 中记录错误，并通知 UI 显示错误信息。
try-catch 封装 async/await： 如果使用 async 获取数据，确保在 await() 周围使用 try-catch。

// 假设这是一个ViewModel
class MyViewModel(private val repository: DataRepository) {
    private val viewModelJob = SupervisorJob() // 使用SupervisorJob
    private val coroutineScope = CoroutineScope(Dispatchers.Main + viewModelJob + CoroutineExceptionHandler { _, throwable ->
        // 在这里处理所有未捕获的子协程异常
        _errorMessage.value = "An unexpected error occurred: ${throwable.localizedMessage}"
        Log.e("MyViewModel", "Unhandled exception in coroutine", throwable)
    })

    private val _loadingState = MutableLiveData<Boolean>()
    val loadingState: LiveData<Boolean> = _loadingState

    private val _data = MutableLiveData<String>()
    val data: LiveData<String> = _data

    private val _errorMessage = MutableLiveData<String>()
    val errorMessage: LiveData<String> = _errorMessage

    fun loadData() {
        coroutineScope.launch {
            _loadingState.value = true
            _errorMessage.value = null // 清除之前的错误

            try {
                // 模拟并行加载多个数据
                val userDataDeferred = async { repository.fetchUser() }
                val productDataDeferred = async { repository.fetchProducts() }

                val user = userDataDeferred.await() // 这里的await会抛出异常
                val products = productDataDeferred.await()

                _data.value = "User: ${user.name}, Products: ${products.size}"
            } catch (e: Exception) {
                // 捕获由await()抛出的业务逻辑异常
                _errorMessage.value = "Failed to load data: ${e.localizedMessage}"
                Log.e("MyViewModel", "Error loading data", e)
            } finally {
                _loadingState.value = false
            }
        }
    }

    fun doIndependentTask() {
        coroutineScope.launch { // 这个任务失败不会影响loadData或其他任务
            delay(100)
            println("Doing independent task...")
            throw IllegalStateException("Independent task failed!")
        }
    }

    fun onCleared() {
        viewModelJob.cancel() // 清理ViewModel时取消所有协程
    }
}

// 模拟数据仓库
class DataRepository {
    suspend fun fetchUser(): User {
        delay(500)
        // 模拟用户数据获取失败
        if (System.currentTimeMillis() % 2 == 0L) { // 偶尔失败
            throw IOException("Network error fetching user!")
        }
        return User("Alice")
    }

    suspend fun fetchProducts(): List<Product> {
        delay(300)
        return listOf(Product("Laptop"), Product("Mouse"))
    }
}

data class User(val name: String)
data class Product(val name: String)

// 模拟LiveData, Log等Android组件
interface LiveData<T> { val value: T? }
class MutableLiveData<T> : LiveData<T> { override var value: T? = null }
object Log { fun e(tag: String, msg: String, t: Throwable?) { println("[$tag] ERROR: $msg, Exception: ${t?.message}") } }

fun main() = runBlocking {
    val viewModel = MyViewModel(DataRepository())
    viewModel.loadData()
    viewModel.doIndependentTask() // 尝试一个独立任务

    delay(1000) // 给予时间让协程执行
    viewModel.onCleared()
}

输出示例（可能因随机失败而异）：

Doing independent task...
[MyViewModel] ERROR: Unhandled exception in coroutine, Exception: Independent task failed!
[MyViewModel] ERROR: Error loading data, Exception: Network error fetching user!

这里：

MyViewModel 的 coroutineScope 使用了 SupervisorJob，所以 doIndependentTask 中的异常不会取消 loadData。
SupervisorJob 的 CoroutineExceptionHandler 捕获了 doIndependentTask 的异常。
loadData 中的 try-catch 捕获了 repository.fetchUser() 导致的异常，并更新了 _errorMessage。

4.2 业务逻辑层的错误封装

在更底层的业务逻辑或数据层，我们通常不希望直接抛出低级的 IOException 或 SQLException。而是将它们封装成领域特定的异常，或者使用 Result 类型来表示操作的成功或失败。

使用 Result 类型：
Kotlin标准库提供了 Result<T> 类型，可以优雅地封装成功值或异常。

sealed class CustomError(message: String) : Exception(message) {
    class NetworkError(message: String, val code: Int? = null) : CustomError(message)
    class BusinessLogicError(message: String, val errorCode: String? = null) : CustomError(message)
    class UnknownError(message: String, val cause: Throwable? = null) : CustomError(message)
}

class UserService(private val apiService: ApiService) {
    suspend fun getUserProfile(userId: String): Result<User> {
        return try {
            val dto = apiService.fetchUserDto(userId)
            Result.success(dto.toDomain())
        } catch (e: IOException) {
            Result.failure(CustomError.NetworkError("Failed to connect to server: ${e.message}"))
        } catch (e: Exception) {
            Result.failure(CustomError.UnknownError("An unexpected error occurred: ${e.message}", e))
        }
    }
}

// 模拟API服务
class ApiService {
    suspend fun fetchUserDto(userId: String): UserDto {
        delay(200)
        if (userId == "error_user") {
            throw IOException("Failed to fetch user DTO for $userId")
        }
        return UserDto("John Doe", "[email protected]")
    }
}

data class UserDto(val name: String, val email: String) {
    fun toDomain(): User = User(name, email)
}

fun main() = runBlocking {
    val userService = UserService(ApiService())

    // 成功案例
    val successResult = userService.getUserProfile("valid_user")
    successResult.onSuccess { user ->
        println("User fetched: ${user.name}, ${user.email}")
    }.onFailure { error ->
        println("Error fetching user: ${error.message}")
    }

    // 失败案例
    val failureResult = userService.getUserProfile("error_user")
    failureResult.onSuccess { user ->
        println("User fetched: ${user.name}, ${user.email}")
    }.onFailure { error ->
        println("Error fetching user: ${error.message}")
        if (error is CustomError.NetworkError) {
            println("Specific network error handling: ${error.message}")
        }
    }
}

输出：

User fetched: John Doe, [email protected]
Error fetching user: Failed to connect to server: Failed to fetch user DTO for error_user
Specific network error handling: Failed to connect to server: Failed to fetch user DTO for error_user

这种模式的优点是：

类型安全： 函数签名明确表示可能返回成功或失败。
强制处理： 调用方必须处理 Result 的两种情况，避免遗漏错误。
清晰的错误分类： 可以使用 sealed class 定义领域特定的错误类型。

4.3 全局异常捕获与日志

对于任何最终未能被业务逻辑层捕获的异常，都应该有一个全局的捕获机制来记录日志、向用户显示通用错误消息，甚至报告给崩溃分析服务。

JVM 线程的默认未捕获异常处理器： Thread.setDefaultUncaughtExceptionHandler 可以捕获所有未被协程 CoroutineExceptionHandler 处理的，最终传播到线程的异常。
CoroutineExceptionHandler 的最终防线： 对于所有根协程，确保至少有一个 CoroutineExceptionHandler 来记录日志。

import kotlinx.coroutines.*
import java.lang.Thread.UncaughtExceptionHandler

fun main() = runBlocking {
    // 设置JVM线程的默认未捕获异常处理器
    Thread.setDefaultUncaughtExceptionHandler { t: Thread, e: Throwable ->
        println("JVM Default UncaughtExceptionHandler: Thread ${t.name} caught: ${e.message}")
        e.printStackTrace()
    }

    val globalExceptionHandler = CoroutineExceptionHandler { context, exception ->
        println("Global CoroutineExceptionHandler: Caught in context $context, exception: ${exception.message}")
        exception.printStackTrace()
    }

    // 场景1: 根协程，没有为其指定handler，会传播到runBlocking的Job，最终如果runBlocking没有try-catch，会到JVM处理器
    // 但是runBlocking{}内部的协程都会有一个父Job，所以这里不会到JVM处理器，而是由runBlocking的try-catch捕获
    try {
        launch {
            delay(50)
            println("Root coroutine without explicit handler throwing.")
            throw RuntimeException("Root coroutine error!")
        }.join()
    } catch (e: Exception) {
        println("Caught root coroutine error in runBlocking try-catch: ${e.message}")
    }

    // 场景2: 根协程，使用globalExceptionHandler
    GlobalScope.launch(globalExceptionHandler) {
        delay(100)
        println("GlobalScope launch with explicit handler throwing.")
        throw IllegalStateException("GlobalScope error!")
    }.join()

    // 场景3: SupervisorJob下的子协程，异常会被其SupervisorJob的CoroutineExceptionHandler捕获
    val supervisorScope = CoroutineScope(SupervisorJob() + globalExceptionHandler)
    supervisorScope.launch {
        delay(150)
        println("Supervisor child throwing.")
        throw NullPointerException("NPE in supervisor child!")
    }.join()

    supervisorScope.cancel() // 清理资源
    println("Main function finished.")
}

输出示例：

Root coroutine without explicit handler throwing.
Caught root coroutine error in runBlocking try-catch: Root coroutine error!
GlobalScope launch with explicit handler throwing.
Global CoroutineExceptionHandler: Caught in context [StandaloneCoroutine{Cancelling}@1b363406, DefaultDispatcher], exception: GlobalScope error!
java.lang.IllegalStateException: GlobalScope error!
    at MainKt$main$3$1.invokeSuspend(Main.kt:127)
    at ...
Supervisor child throwing.
Global CoroutineExceptionHandler: Caught in context [StandaloneCoroutine{Cancelling}@601c3608, DefaultDispatcher, JobImpl{Cancelling}@342d385f], exception: NPE in supervisor child!
java.lang.NullPointerException: NPE in supervisor child!
    at MainKt$main$4$1.invokeSuspend(Main.kt:135)
    at ...
Main function finished.

总结与建议：

结构化并发是基石： 始终利用协程的父子关系来管理生命周期和错误传播。
launch 即时传播，async 延迟传播： 理解两者的核心差异，并根据需求选择。对于 async，务必 await() 或 join() 以捕获异常。
try-catch 用于局部错误： 包裹可能抛出异常的挂起函数或 await() 调用。
SupervisorJob 用于独立任务： 当一个协程失败不应影响同级协程时使用。
CoroutineExceptionHandler 用于根协程或 SupervisorJob 作用域： 捕获最终未处理的异常，进行日志记录或报告。
Flow 运算符： catch、onCompletion、retry、retryWhen 提供强大的流式错误处理能力。
错误封装： 在业务逻辑层，将底层技术异常封装为领域特定的异常或使用 Result 类型，提升代码可读性和可维护性。
日志记录： 确保所有异常都被记录，以便于调试和监控。

总结

协程中的异常传递是一个精妙且强大的系统，它通过结构化并发、Job 层次以及 launch 与 async 的不同行为，为我们提供了灵活的错误处理能力。掌握 SupervisorJob 和 CoroutineExceptionHandler 的用法，结合 Flow 的声明式错误处理运算符，能够帮助我们构建出高度健壮、响应迅速的异步应用。理解并遵循这些原则，是编写高质量并发代码的关键。

第一章：协程的基石与异常处理的挑战

1.1 协程：轻量级并发的魔力

1.2 异步编程中的异常之痛

第二章：Kotlin协程的异常处理基础

2.1 Job：协程的生命周期与父子关系

2.2 launch 与 async 的异常行为差异

2.2.1 launch：即时传播异常

2.2.2 async：延迟传播异常

2.3 CoroutineExceptionHandler：捕获根协程的异常

第三章：高级异常处理机制与模式

3.1 SupervisorJob：让子协程独立失败

3.2 异常聚合

3.3 Flow 中的异常处理

3.3.1 catch 运算符

3.3.2 onCompletion 运算符

3.3.3 retry 和 retryWhen 运算符

3.4 withContext 与异常处理

3.5 取消与异常：CancellationException