Liger Kernel优化：利用Triton重写HuggingFace模型算子以减少显存占用

大家好，今天我将和大家分享一种优化HuggingFace模型，特别是大型Transformer模型的方法：利用Triton重写模型算子以减少显存占用。

1. 背景：HuggingFace模型与显存瓶颈

HuggingFace的Transformers库为我们提供了丰富的预训练模型，极大地简化了NLP任务的开发流程。然而，随着模型规模的不断扩大，如BERT、GPT-3、LLaMA等，其庞大的参数量和中间激活值给显存带来了巨大的压力。在实际应用中，我们经常会遇到以下问题：

显存溢出（Out of Memory, OOM）： 训练或推理过程中，显存不足导致程序崩溃。
Batch Size受限： 为了避免OOM，不得不降低Batch Size，降低了硬件利用率，延长了训练/推理时间。
无法部署大型模型： 在资源有限的设备上（如边缘设备），无法部署大型模型。

因此，优化HuggingFace模型的显存占用变得至关重要。常见的优化方法包括模型压缩（量化、剪枝、知识蒸馏）、梯度累积、混合精度训练等。今天我们要介绍的是一种更底层的优化方法：重写Kernel。

2. 什么是Kernel？为什么要重写？

在深度学习框架中，Kernel是指实现特定操作（如矩阵乘法、卷积、激活函数）的底层代码。这些Kernel通常由框架的底层库（如CUDA、cuDNN）提供，并经过高度优化。

那么，为什么我们要重写已经高度优化的Kernel呢？原因主要有以下几点：

框架通用Kernel的局限性： 框架提供的通用Kernel为了适应各种场景，往往无法针对特定模型或算子进行定制优化。
内存访问模式优化： 深度学习模型的性能很大程度上取决于内存访问模式。通过定制Kernel，我们可以更精细地控制内存访问，减少不必要的内存读写，从而降低显存占用。
算子融合（Operator Fusion）： 将多个小的算子融合为一个大的Kernel，可以减少Kernel启动的开销，并减少中间结果的显存占用。

3. Triton：一个易于使用的Kernel编写工具

手动编写CUDA Kernel需要深入了解CUDA编程模型，学习成本较高。Triton是一个由OpenAI开发的开源编程语言和编译器，旨在简化Kernel的编写过程。Triton具有以下优点：

类Python语法： 降低了学习门槛，即使不熟悉CUDA也能快速上手。
自动并行化： 通过声明式编程，Triton编译器可以自动将代码并行化到GPU上。
高性能： 通过手动控制内存访问模式，可以实现与手写CUDA Kernel相当的性能。
易于集成： 可以方便地与PyTorch等深度学习框架集成。

4. Liger：一种针对Transformer模型的Kernel优化方法

Liger（Lightweight and Efficient Global Attention with Recurrence）是一种针对Transformer模型的Kernel优化方法，旨在减少全局注意力机制的显存占用。它的核心思想是：

将全局注意力操作分解为多个小的Kernel： 将原本需要一次性计算的全局注意力矩阵分解为多个小的Block，逐个计算。
利用循环（Recurrence）机制： 通过循环迭代的方式，逐步计算全局注意力矩阵，避免一次性加载整个矩阵到显存中。

5. 使用Triton实现Liger Kernel

下面，我们通过一个具体的例子来演示如何使用Triton实现Liger Kernel，并将其集成到HuggingFace模型中。

5.1 环境准备

首先，我们需要安装Triton：

pip install triton==2.1.0
pip install torch

5.2 实现Liger Kernel

我们以Transformer模型中的Self-Attention算子为例，使用Triton实现Liger Kernel。假设我们已经有了Query、Key、Value三个矩阵：Q, K, V。传统的Self-Attention计算如下：

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        assert self.head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"

        self.W_q = nn.Linear(embed_dim, embed_dim)
        self.W_k = nn.Linear(embed_dim, embed_dim)
        self.W_v = nn.Linear(embed_dim, embed_dim)
        self.W_o = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        batch_size, seq_len, embed_dim = x.size()

        q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        k = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        v = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

        # Scaled Dot-Product Attention
        attention_scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attention_probs = F.softmax(attention_scores, dim=-1)
        context = torch.matmul(attention_probs, v)

        context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, embed_dim)
        output = self.W_o(context)

        return output

现在，我们使用Triton重写 Self-Attention 中最耗显存的 torch.matmul(q, k.transpose(-2, -1)) 和 torch.matmul(attention_probs, v) 这两个矩阵乘法。

import triton
import triton.language as tl

@triton.jit
def _scaled_dot_product_attention_kernel(
    Q, K, V, output,
    Q_row_stride, K_row_stride, V_row_stride, output_row_stride,
    Q_col_stride, K_col_stride, V_col_stride, output_col_stride,
    seq_len_q, seq_len_kv, head_dim,
    BLOCK_SIZE_Q: tl.constexpr, BLOCK_SIZE_KV: tl.constexpr,
    ATTENTION_MASK: tl.constexpr
):
    """
    Triton kernel for scaled dot-product attention.

    Args:
        Q: Pointer to the query matrix.
        K: Pointer to the key matrix.
        V: Pointer to the value matrix.
        output: Pointer to the output matrix.
        Q_row_stride: Row stride of the query matrix.
        K_row_stride: Row stride of the key matrix.
        V_row_stride: Row stride of the value matrix.
        output_row_stride: Row stride of the output matrix.
        Q_col_stride: Column stride of the query matrix.
        K_col_stride: Column stride of the key matrix.
        V_col_stride: Column stride of the value matrix.
        output_col_stride: Column stride of the output matrix.
        seq_len_q: Sequence length of the query matrix.
        seq_len_kv: Sequence length of the key/value matrices.
        head_dim: Dimension of the attention head.
        BLOCK_SIZE_Q: Block size for the query matrix.
        BLOCK_SIZE_KV: Block size for the key/value matrices.
        ATTENTION_MASK: Whether to apply an attention mask.
    """

    row_idx = tl.program_id(0)
    col_idx = tl.program_id(1)
    head_idx = tl.program_id(2)
    batch_idx = tl.program_id(3)

    # Offsets for the query, key, and value matrices
    q_offset = batch_idx * Q_row_stride + head_idx * seq_len_q * head_dim + row_idx * BLOCK_SIZE_Q * head_dim
    k_offset = batch_idx * K_row_stride + head_idx * seq_len_kv * head_dim + col_idx * BLOCK_SIZE_KV * head_dim
    v_offset = batch_idx * V_row_stride + head_idx * seq_len_kv * head_dim + col_idx * BLOCK_SIZE_KV * head_dim

    # Load query, key, and value blocks
    q = tl.load(Q + q_offset, mask=tl.arange(0, BLOCK_SIZE_Q) < seq_len_q, other=0.0)
    k = tl.load(K + k_offset, mask=tl.arange(0, BLOCK_SIZE_KV) < seq_len_kv, other=0.0)
    v = tl.load(V + v_offset, mask=tl.arange(0, BLOCK_SIZE_KV) < seq_len_kv, other=0.0)

    # Compute attention scores
    attention_scores = tl.dot(q, tl.trans(k))

    # Apply scaling
    attention_scores = attention_scores / (head_dim ** 0.5)

    # Apply attention mask (optional)
    if ATTENTION_MASK:
        mask = (row_idx * BLOCK_SIZE_Q[:, None] + tl.arange(0, BLOCK_SIZE_Q)[:, None]) >= (col_idx * BLOCK_SIZE_KV[None, :] + tl.arange(0, BLOCK_SIZE_KV)[None, :])
        attention_scores = tl.where(mask, attention_scores, -float('inf'))

    # Compute attention probabilities
    attention_probs = tl.softmax(attention_scores)

    # Compute context vector
    context = tl.dot(attention_probs, v)

    # Store output
    output_offset = batch_idx * output_row_stride + head_idx * seq_len_q * head_dim + row_idx * BLOCK_SIZE_Q * head_dim
    tl.store(output + output_offset, context, mask=tl.arange(0, BLOCK_SIZE_Q) < seq_len_q)

def scaled_dot_product_attention(q, k, v, attention_mask=False):
    """
    Wrapper function to call the Triton kernel for scaled dot-product attention.

    Args:
        q: Query tensor.
        k: Key tensor.
        v: Value tensor.
        attention_mask: Whether to apply an attention mask.

    Returns:
        Output tensor.
    """

    batch_size, num_heads, seq_len_q, head_dim = q.shape
    _, _, seq_len_kv, _ = k.shape

    # Output tensor
    output = torch.empty_like(q)

    # Define block sizes
    BLOCK_SIZE_Q = 32 # Choose appropriate block size
    BLOCK_SIZE_KV = 32 # Choose appropriate block size

    # Launch the kernel
    grid = (triton.cdiv(seq_len_q, BLOCK_SIZE_Q), triton.cdiv(seq_len_kv, BLOCK_SIZE_KV), num_heads, batch_size)

    _scaled_dot_product_attention_kernel[grid](
        q, k, v, output,
        q.stride(0), k.stride(0), v.stride(0), output.stride(0),
        q.stride(3), k.stride(3), v.stride(3), output.stride(3),
        seq_len_q, seq_len_kv, head_dim,
        BLOCK_SIZE_Q=BLOCK_SIZE_Q, BLOCK_SIZE_KV=BLOCK_SIZE_KV,
        ATTENTION_MASK=attention_mask
    )

    return output

5.3 集成到HuggingFace模型

现在，我们将上述Triton Kernel集成到HuggingFace模型中。我们需要修改SelfAttention类中的forward函数，将原先的torch.matmul替换为我们自定义的scaled_dot_product_attention函数。

class SelfAttentionWithTriton(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        assert self.head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"

        self.W_q = nn.Linear(embed_dim, embed_dim)
        self.W_k = nn.Linear(embed_dim, embed_dim)
        self.W_v = nn.Linear(embed_dim, embed_dim)
        self.W_o = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        batch_size, seq_len, embed_dim = x.size()

        q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        k = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        v = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

        # Scaled Dot-Product Attention (using Triton kernel)
        context = scaled_dot_product_attention(q, k, v)

        context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, embed_dim)
        output = self.W_o(context)

        return output

5.4 测试与验证

最后，我们需要测试和验证我们的修改是否有效，并评估显存占用情况。我们可以使用以下代码进行测试：

# Example Usage
embed_dim = 256
num_heads = 8
seq_len = 128
batch_size = 4

# Generate random input
x = torch.randn(batch_size, seq_len, embed_dim).cuda()

# Instantiate the original SelfAttention module
attention = SelfAttention(embed_dim, num_heads).cuda()

# Instantiate the SelfAttentionWithTriton module
attention_triton = SelfAttentionWithTriton(embed_dim, num_heads).cuda()

# Perform forward pass
output = attention(x)
output_triton = attention_triton(x)

# Verify the output
print(f"Output shape: {output.shape}")
print(f"Output (Triton) shape: {output_triton.shape}")

# Check if the outputs are close
print(f"Output diff: {torch.abs(output - output_triton).mean()}")

# Measure memory usage (using torch.cuda.memory_allocated)
torch.cuda.reset_peak_memory_stats()
output = attention(x)
memory_allocated_original = torch.cuda.max_memory_allocated()
print(f"Original Attention Memory Allocated: {memory_allocated_original / (1024**2):.2f} MB")

torch.cuda.reset_peak_memory_stats()
output_triton = attention_triton(x)
memory_allocated_triton = torch.cuda.max_memory_allocated()
print(f"Triton Attention Memory Allocated: {memory_allocated_triton / (1024**2):.2f} MB")

通过比较原始Self-Attention和使用Triton优化的Self-Attention的显存占用，我们可以验证我们的优化是否有效。

6. 其他优化策略

除了Liger Kernel之外，我们还可以采用其他优化策略来进一步减少显存占用：

Kernel Fusion： 将多个小的Kernel融合为一个大的Kernel，减少Kernel启动的开销，并减少中间结果的显存占用。
Zero-Copy： 尽量避免不必要的内存拷贝，例如，可以直接在GPU上进行数据预处理。
Memory Pooling： 使用内存池技术，预先分配一块大的内存，然后将小的内存块分配给不同的算子，避免频繁的内存分配和释放。
混合精度训练 (Mixed Precision Training)： 使用 FP16 (半精度浮点数) 代替 FP32 (单精度浮点数) 训练模型，可以显著减少显存占用和提高计算速度。通常与 torch.cuda.amp.autocast 结合使用。

7. 总结与未来展望

通过使用Triton重写HuggingFace模型算子，特别是全局注意力机制相关的算子，我们可以有效地减少显存占用，提高训练/推理效率，并能够在资源有限的设备上部署大型模型。未来，我们可以进一步探索更复杂的Kernel优化方法，并将其应用到更多的HuggingFace模型中。 Triton 提供了一个相对简单的方式去优化底层算子，可以更高效地利用硬件资源。定制 Kernel 可以带来显著的性能提升和显存优化，尤其是在处理大型 Transformer 模型时。

Liger Kernel优化：利用Triton重写HuggingFace模型算子以减少显存占用

发表回复 取消回复

发表回复取消回复