PyTorch/TensorFlow 自定义 `autograd`：实现复杂梯度的自动求导 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

好，各位老铁，今天咱们来聊聊PyTorch和TensorFlow里自定义 autograd 这事儿，说白了就是教机器咋算一些复杂的梯度。这东西听起来玄乎，其实就是让咱们能更灵活地控制模型训练，搞一些奇奇怪怪的骚操作。

开场白：为啥要自定义 autograd？

话说回来，PyTorch和TensorFlow自带的自动求导已经够用了，为啥还要自己动手呢？原因嘛，很简单，就是内置的梯度计算搞不定的时候。比如：

梯度不可导: 某些操作在数学上根本就不可导，比如ReLU在0点。虽然框架会默认处理，但有时候你想搞点更精细的控制。
效率问题: 某些自定义操作，如果用框架自带的算子拼凑，计算梯度可能效率很低。自己实现一遍，说不定能快上几倍。
研究需要: 搞学术的，总想搞点新花样，自定义 autograd 是必须的。
想装逼: 承认吧，有时候就是想秀一下自己的编程技巧。

总之，自定义 autograd 就是为了更大的自由度和控制力。

PyTorch自定义 autograd：从零开始

PyTorch里自定义 autograd，主要涉及两个部分：

定义一个继承自 torch.autograd.Function 的类： 这个类里要定义前向传播 (forward) 和反向传播 (backward) 的逻辑。
在前向传播里保存一些信息： 这些信息在反向传播里会用到，用来计算梯度。

咱们直接上代码，以一个简单的例子开始：

import torch

class MyReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        """
        在前向传播中，我们接收一个包含输入的张量
        并返回一个包含输出的张量。ctx是一个上下文对象，
        可以用来在反向传播中保存信息。
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        在反向传播中，我们接收输出的梯度，我们需要
        计算输入的梯度，并返回它。
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()  # 复制梯度，防止修改 grad_output
        grad_input[input < 0] = 0
        return grad_input

# 使用自定义的 ReLU
my_relu = MyReLU.apply

# 测试
x = torch.randn(5, requires_grad=True)
y = my_relu(x)
y.sum().backward()

print(x.grad)

代码解读：

MyReLU 类继承了 torch.autograd.Function。
forward 方法：
- ctx.save_for_backward(input)：把输入 input 保存起来，反向传播的时候要用到。
- input.clamp(min=0)：这就是 ReLU 的核心逻辑，把小于 0 的值都变成 0。
backward 方法：
- input, = ctx.saved_tensors：从 ctx 里取出之前保存的输入 input。
- grad_input = grad_output.clone()：复制输出的梯度。
- grad_input[input < 0] = 0：ReLU 的梯度，小于 0 的地方梯度是 0，大于 0 的地方梯度是 1 (保持不变)。
my_relu = MyReLU.apply: apply 方法是关键，它把我们的自定义函数变成一个可以像普通 PyTorch 函数一样调用的东西。
x.grad 打印梯度。

重点：ctx 对象

ctx 对象是自定义 autograd 的核心。它允许你在前向传播和反向传播之间传递信息。 ctx.save_for_backward() 用来保存张量，ctx.saved_tensors 用来获取保存的张量。还可以存储其他类型的信息，例如标量。

更复杂的例子：自定义线性层

咱们再来一个复杂点的例子，自定义一个线性层（全连接层）：

import torch

class MyLinear(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input, weight, bias=None):
        """
        前向传播：y = xW^T + b
        """
        ctx.save_for_backward(input, weight, bias)
        output = input.mm(weight.t())
        if bias is not None:
            output += bias
        return output

    @staticmethod
    def backward(ctx, grad_output):
        """
        反向传播：
        grad_input = grad_output @ W
        grad_weight = X^T @ grad_output
        grad_bias = grad_output.sum(dim=0)
        """
        input, weight, bias = ctx.saved_tensors
        grad_input = grad_output.mm(weight)
        grad_weight = input.t().mm(grad_output)
        grad_bias = grad_output.sum(dim=0) if bias is not None else None
        return grad_input, grad_weight, grad_bias

# 使用自定义的线性层
my_linear = MyLinear.apply

# 测试
batch_size = 32
in_features = 64
out_features = 128

x = torch.randn(batch_size, in_features, requires_grad=True)
W = torch.randn(out_features, in_features, requires_grad=True)
b = torch.randn(out_features, requires_grad=True)

y = my_linear(x, W, b)
loss = y.mean()
loss.backward()

print(x.grad.shape)
print(W.grad.shape)
print(b.grad.shape)

代码解读：

MyLinear 类同样继承了 torch.autograd.Function。
forward 方法：
- 保存了 input、weight 和 bias。
- 实现了线性层的计算：y = xW^T + b。
backward 方法：
- 计算了 input、weight 和 bias 的梯度。
- 返回的梯度顺序要和 forward 方法的输入顺序一致。

PyTorch autograd 的一些坑

原地操作: 尽量避免原地操作 (in-place operations)，比如 x += 1。原地操作可能会破坏计算图，导致梯度计算出错。如果必须使用，要非常小心。
梯度累加: PyTorch 默认会累加梯度。如果你的模型在每次迭代时都计算一次梯度，需要手动把梯度清零 (optimizer.zero_grad())。
requires_grad: 只有 requires_grad=True 的张量才会计算梯度。如果你的梯度是 None，检查一下是不是忘记设置 requires_grad 了。
非叶子节点的梯度: 默认情况下，PyTorch 只会保留叶子节点的梯度。如果需要查看中间节点的梯度，可以使用 retain_grad() 方法。

TensorFlow自定义梯度：tf.custom_gradient

TensorFlow里自定义梯度，主要使用 tf.custom_gradient 装饰器。这个装饰器可以让你自定义一个函数的梯度计算逻辑。

import tensorflow as tf

@tf.custom_gradient
def my_relu(x):
    def grad(dy):
        return dy * tf.cast(x > 0, tf.float32)  # dy 是 upstream gradient
    return tf.nn.relu(x), grad

# 测试
x = tf.Variable(tf.random.normal((5,)), dtype=tf.float32)

with tf.GradientTape() as tape:
    y = my_relu(x)
loss = tf.reduce_sum(y)

grad = tape.gradient(loss, x)
print(grad)

代码解读：

@tf.custom_gradient 装饰器：告诉 TensorFlow 这是一个自定义梯度函数。
my_relu(x) 函数：
- 定义了前向传播的逻辑：tf.nn.relu(x)。
- 定义了一个内部函数 grad(dy)，用来计算梯度。dy 是上游梯度。
- 返回前向传播的结果和梯度计算函数。
grad(dy) 函数：
- 计算 ReLU 的梯度：小于 0 的地方梯度是 0，大于 0 的地方梯度是 1。
tf.GradientTape()：TensorFlow 的梯度记录器。

更复杂的例子：自定义线性层 (TensorFlow)

import tensorflow as tf

@tf.custom_gradient
def my_linear(x, w, b):
    def grad(dy):
        dw = tf.matmul(tf.transpose(x), dy)
        db = tf.reduce_sum(dy, axis=0)
        dx = tf.matmul(dy, w, transpose_b=True)
        return dx, dw, db

    y = tf.matmul(x, w) + b
    return y, grad

# 测试
batch_size = 32
in_features = 64
out_features = 128

x = tf.Variable(tf.random.normal((batch_size, in_features), dtype=tf.float32))
W = tf.Variable(tf.random.normal((in_features, out_features), dtype=tf.float32))
b = tf.Variable(tf.random.normal((out_features,), dtype=tf.float32))

with tf.GradientTape() as tape:
    y = my_linear(x, W, b)
    loss = tf.reduce_mean(y)

gradients = tape.gradient(loss, [x, W, b])

print(gradients[0].shape)
print(gradients[1].shape)
print(gradients[2].shape)

代码解读：

my_linear(x, w, b) 函数：
- 实现了线性层的计算：y = xW + b。
- 定义了内部函数 grad(dy)，用来计算梯度。
- 返回前向传播的结果和梯度计算函数。
grad(dy) 函数：
- 计算了 x、w 和 b 的梯度。
- 返回的梯度顺序要和 my_linear 函数的输入顺序一致。

TensorFlow tf.custom_gradient 的一些坑

梯度必须是 Tensor: grad 函数必须返回 Tensor。如果你返回了 None 或者其他类型，TensorFlow 会报错。
tf.GradientTape(): 必须使用 tf.GradientTape() 记录梯度。
变量: 要计算梯度的变量必须是 tf.Variable。
数据类型: 确保所有张量的数据类型一致，否则可能会出现类型错误。

PyTorch vs TensorFlow: 自定义 autograd 的比较

特性	PyTorch	TensorFlow
核心类/装饰器	`torch.autograd.Function`	`tf.custom_gradient`
前向/反向传播	`forward` 和 `backward` 方法	前向传播函数和 `grad` 函数
上下文对象	`ctx`	无显式上下文对象，通过闭包传递信息
灵活性	更灵活，可以完全控制梯度计算	相对灵活，但不如 PyTorch 自由
学习曲线	稍陡峭，需要理解 `torch.autograd.Function`	相对平缓，`tf.custom_gradient` 更易上手
调试	相对困难，需要仔细检查计算图	相对容易，TensorFlow 的调试工具更完善

啥时候用自定义 autograd？

性能瓶颈: 当你发现某个操作的梯度计算是性能瓶颈时，可以考虑自定义 autograd。
特殊需求: 当你需要实现一些特殊的梯度计算逻辑时，比如梯度裁剪、梯度惩罚等。
研究探索: 当你想尝试一些新的算法或者模型结构时。

总结：

自定义 autograd 是深度学习框架提供的高级功能，它可以让你更灵活地控制模型的训练过程。虽然学习曲线稍陡峭，但掌握它可以让你在深度学习的道路上走得更远。PyTorch 相对更灵活，TensorFlow 更易上手。选择哪个框架，取决于你的个人偏好和项目需求。

好了，今天就到这里，希望大家有所收获。如果觉得有用，记得点个赞！下课！

发表回复 取消回复

发表回复取消回复