什么是 ‘SIMD’ 指令集加速?在 Go 中通过汇编实现向量化计算(如图像处理或加密算法)

SIMD 指令集加速:在 Go 中通过汇编实现向量化计算

各位同仁,大家好!

今天,我们将深入探讨一个在高性能计算领域至关重要的技术——SIMD 指令集加速。我们将不仅仅停留在理论层面,更会着重讲解如何在 Go 语言中,通过直接编写汇编代码来充分利用这些强大的向量化能力,以实现图像处理或加密算法等场景下的极致性能。

1. 什么是 SIMD 指令集加速?

SIMD,全称 Single Instruction, Multiple Data,即单指令多数据流。它是一种并行计算模式,允许处理器使用一条指令同时对多个数据元素执行相同的操作。与传统的 SISD(单指令单数据)模式相比,SIMD 极大地提升了处理大量同类型数据的效率。

想象一下,你有一队工人需要将一堆相同的箱子从A点搬到B点。在 SISD 模式下,你只有一个工人,他一次只能搬一个箱子。而在 SIMD 模式下,你拥有一个由多个工人组成的团队,他们可以同时各自搬运一个箱子,但所有工人都在执行“搬箱子”这一相同的指令。显然,SIMD 模式能够更快地完成任务。

在计算机硬件层面,SIMD 通过引入特殊的向量寄存器(Vector Registers)和向量指令(Vector Instructions)来实现。这些向量寄存器比普通的通用寄存器宽得多,可以同时存储多个数据元素(例如,四个32位整数、八个16位整数或十六个8位字节)。当执行一条向量指令时,处理器会同时对向量寄存器中的所有数据元素执行相同的操作。

1.1 SIMD 的发展历程

SIMD 技术并非新生事物,它已经伴随 CPU 发展了数十年:

  • MMX (MultiMedia eXtensions):Intel 于1997年推出,为奔腾处理器增加了64位向量寄存器。
  • SSE (Streaming SIMD Extensions):Intel 于1999年推出,进一步扩展到128位向量寄存器,并引入了浮点数 SIMD 支持。后续版本(SSE2, SSE3, SSSE3, SSE4.1, SSE4.2)不断增加指令和功能。
  • AVX (Advanced Vector Extensions):Intel 于2011年推出,将向量寄存器宽度扩展到256位(YMM 寄存器),并引入了三操作数指令。AVX2 进一步增强了整数 SIMD 能力。
  • AVX-512: Intel 于2015年推出,将向量寄存器宽度扩展到512位(ZMM 寄存器),拥有更丰富的指令集。
  • ARM NEON: ARM 处理器上的 SIMD 扩展,广泛应用于移动设备和嵌入式系统,支持64位和128位向量操作。
  • RISC-V Vector Extension (RVV): RISC-V 架构的向量扩展,具有高度可配置性和灵活性。

这些指令集提供了大量的专用指令,例如:

  • 加载/存储指令: MOVUPS, MOVAPS, VMOVUPS, VMOVAPS (x86), VLD1, VST1 (ARM NEON)
  • 算术指令: PADDB, PMULLW, VPADDD, VPMULLD (x86), VADD, VMUL (ARM NEON)
  • 逻辑指令: PXOR, VPAND, VPOR (x86), VEOR, VAND (ARM NEON)
  • 比较指令: PCMPEQB, VPCMPEQD (x86), VCMP (ARM NEON)
  • 数据重排指令: PSHUFB, VPERMD, VPERMPD (x86), VEXT, VTRN (ARM NEON)

1.2 SIMD 的优势与应用场景

SIMD 的核心优势在于数据并行性。对于那些需要对大量独立数据元素执行相同操作的计算任务,SIMD 能提供显著的性能提升。

典型应用场景包括:

  • 图像和视频处理: 像素操作(灰度转换、亮度/对比度调整、滤镜、缩放、旋转、色彩空间转换)是 SIMD 的经典应用。
  • 音频处理: 采样数据处理(混音、均衡器、FFT)。
  • 科学计算: 向量和矩阵运算、物理模拟。
  • 信号处理: 傅里叶变换、滤波器。
  • 密码学: 加密/解密算法中的大量位操作和算术运算(如 AES、SHA)。
  • 游戏物理引擎: 粒子系统、碰撞检测。
  • 机器学习: 神经网络的矩阵乘法和激活函数计算。

2. 为什么 Go 需要 SIMD 加速?

Go 语言以其简洁、高效、并发友好的特性而闻名。其运行时(runtime)和垃圾回收(GC)机制经过高度优化,使得 Go 程序在大多数场景下都能表现出令人满意的性能。然而,当面对极致的计算密集型任务,特别是那些涉及大规模数据并行处理的场景时,纯 Go 代码的性能瓶颈可能会显现。

Go 编译器在不断进步,自动向量化(Auto-vectorization)是其未来的发展方向之一。但目前,Go 编译器对 SIMD 的自动向量化支持还不够成熟或完善。这意味着,即使你的 Go 代码在逻辑上具备向量化的潜力,编译器也可能无法生成高效的 SIMD 指令。

在这种情况下,为了榨取硬件的最后一丝性能,直接利用汇编语言编写 SIMD 优化代码就成为了一个可行的选择。Go 语言设计之初就考虑到了这种需求,它提供了与 Plan9 汇编器兼容的语法,允许开发者在 Go 项目中无缝集成汇编代码。Go 的标准库中,许多性能敏感的核心部分,例如 sync/atomic 包、部分 runtime 函数,以及加密算法库等,都大量使用了汇编来实现极致优化,其中就包括 SIMD 指令。

3. Go 汇编基础:Plan9 风格

Go 语言采用了一种独特的汇编器,称为 Plan9 汇编器,其语法与传统的 Intel 或 AT&T 汇编语法有所不同。理解其基本语法是实现 SIMD 加速的关键。

3.1 Plan9 汇编语法特点

特性 Plan9 汇编 Intel 汇编 AT&T 汇编
操作数顺序 OP src, dst OP dst, src OP src, dst
寄存器前缀 % (例如: %eax) % (例如: %eax)
立即数前缀 $ $
内存访问 offset(base)(index*scale) [base + index*scale + offset] offset(%base, %index, scale)
指令大小 通常隐含,或通过后缀指定 (如 MOVQ, MOVL) 通常隐含,或通过后缀指定 (如 MOVZX) 通过后缀指定 (如 movl, movq)

Go 汇编文件通常以 .s 扩展名结尾。每个汇编函数都需要通过 TEXT 指令来定义。

3.2 关键指令和概念

  • TEXT symbol(SB), flags, $framesize: 定义一个函数。
    • symbol(SB): 函数名,SB 代表静态基地址 (Static Base),表示符号是全局可见的。
    • flags: 函数属性,如 NOSPLIT(不允许栈增长)。
    • $framesize: 函数栈帧大小。通常为0,表示不使用额外的栈空间(由 Go 编译器管理)。
  • GLOBL symbol(SB), flags, $size: 声明一个全局符号,通常用于暴露数据或函数。
  • DATA symbol(SB)/width, value: 定义数据段。
  • 寄存器: Go 汇编使用标准的 x86-64 或 ARM64 寄存器名。
    • x86-64: AX, BX, CX, DX, SI, DI, BP, SP, R8R15。SIMD 寄存器包括 X0X15 (128位 SSE),Y0Y15 (256位 AVX),Z0Z31 (512位 AVX-512)。
    • ARM64: R0R30 (通用寄存器),FP (帧指针), LR (链接寄存器), SP (栈指针)。SIMD 寄存器包括 V0V31 (128位 NEON)。
  • 调用约定: Go 函数的参数和返回值通常通过栈传递。
    • arg+offset(FP): 访问函数参数。FP 代表帧指针 (Frame Pointer)。offset 是参数相对于 FP 的偏移量。
    • ret-offset(SP): 访问返回值。SP 代表栈指针 (Stack Pointer)。
  • RET: 返回指令。

3.3 Go 函数与汇编函数的桥接

假设我们有一个 Go 函数签名:func MyFunc(a []byte, b []byte) []byte
在汇编中,它可能这样定义:

// myasm.s
#include "textflag.h"

// func MyFunc(a []byte, b []byte) []byte
TEXT ·MyFunc(SB), NOSPLIT, $0-48
    // 参数 a: a.ptr (8字节), a.len (8字节), a.cap (8字节)
    // 参数 b: b.ptr (8字节), b.len (8字节), b.cap (8字节)
    // 返回值 []byte: ret.ptr (8字节), ret.len (8字节), ret.cap (8字节)

    // 访问第一个参数 a 的数据指针
    MOVQ a_ptr+0(FP), R8 // R8 = a.ptr
    // 访问第一个参数 a 的长度
    MOVQ a_len+8(FP), R9 // R9 = a.len

    // 访问第二个参数 b 的数据指针
    MOVQ b_ptr+24(FP), R10 // R10 = b.ptr
    // 访问第二个参数 b 的长度
    MOVQ b_len+32(FP), R11 // R11 = b.len

    // ... 执行 SIMD 操作 ...

    // 设置返回值
    MOVQ R12, ret_ptr+48(FP) // ret.ptr = R12
    MOVQ R13, ret_len+56(FP) // ret.len = R13
    MOVQ R14, ret_cap+64(FP) // ret.cap = R14

    RET

参数偏移量约定(x86-64):

参数/返回值 描述 偏移量 (FP) 寄存器 (或栈) 大小 (字节)
arg1.ptr 第一个参数指针 0 8
arg1.len 第一个参数长度 8 8
arg1.cap 第一个参数容量 16 8
arg2.ptr 第二个参数指针 24 8
arg2.len 第二个参数长度 32 8
arg2.cap 第二个参数容量 40 8
ret1.ptr 第一个返回值指针 48 8
ret1.len 第一个返回值长度 56 8
ret1.cap 第一个返回值容量 64 8

TEXT ·MyFunc(SB), NOSPLIT, $0-48 中的 48 表示函数参数和返回值的总大小。第一个参数 a 占用 0-23 字节,第二个参数 b 占用 24-47 字节。如果函数有返回值,它会紧随参数之后。Go 约定 a_ptra_len 等是参数名的偏移量别名,但实际编译时,a_ptr+0(FP) 是基于 0(FP) 的偏移,a_len+8(FP) 是基于 8(FP) 的偏移,以此类推。这个 48 应该是指 ab 的总大小,即 2*24 = 48 字节。如果函数有返回值,这个值需要相应增加。例如,如果有一个返回值 []byte,那么总大小将是 48 + 24 = 72

4. 在 Go 中通过汇编实现 SIMD 向量化计算 (x86-64 AVX2)

我们将通过一个具体的图像处理例子来演示如何在 Go 中使用 x86-64 AVX2 汇编实现 SIMD 加速:灰度图像转换

4.1 灰度转换算法

灰度转换是将彩色图像转换为黑白图像的过程。一个常用的加权平均算法是:
Gray = R * 0.299 + G * 0.587 + B * 0.114

为了在整数域进行 SIMD 优化,我们可以将这些浮点系数乘以一个缩放因子(例如 256),然后进行整数乘法和右移操作:
Gray = (R * 77 + G * 150 + B * 29) >> 8
这里 77 ≈ 0.299 * 256, 150 ≈ 0.587 * 256, 29 ≈ 0.114 * 256
输入图像通常是 RGBA 格式,每个像素占用 4 个字节,每个通道 8 位。输出是灰度图,每个像素一个字节。

我们的 Go 函数签名将是:
func GrayScaleAVX2(rgba []byte, out []byte)
其中 rgba 是输入 RGBA 图像数据,out 是输出灰度图像数据。rgba 的长度应是 out 长度的 4 倍。

4.2 Go 包装器函数 (grayscale.go)

首先,我们定义 Go 语言的包装器函数,它会调用我们的汇编函数。

// grayscale.go
package simd

import (
    "fmt"
    "runtime"
    "unsafe"
)

//go:noescape
func _GrayScaleAVX2(rgba, out []byte)

// GrayScaleAVX2 converts an RGBA image to grayscale using AVX2 instructions.
// It requires rgba to be 4 times the length of out.
// Each pixel is converted using the formula: Gray = (R*77 + G*150 + B*29) >> 8
func GrayScaleAVX2(rgba, out []byte) {
    if len(rgba) == 0 || len(out) == 0 {
        return
    }
    if len(rgba)%4 != 0 {
        panic("input RGBA slice length must be a multiple of 4")
    }
    if len(rgba)/4 != len(out) {
        panic("input RGBA slice length must be 4 times the output grayscale slice length")
    }

    // Check if AVX2 is supported. If not, panic or fall back to a pure Go version.
    // For this example, we'll just panic to highlight the dependency.
    // In a real-world scenario, you might have a pure Go fallback.
    if !HasAVX2() {
        panic("AVX2 instructions not supported on this CPU")
    }

    _GrayScaleAVX2(rgba, out)
}

// HasAVX2 checks if the current CPU supports AVX2 instructions.
// This function would typically be implemented using platform-specific CPUID checks,
// or by relying on a library like github.com/klauspost/cpuid.
// For simplicity, we'll assume a basic check or a pre-determined result.
// In a real application, you'd perform a proper CPUID check.
func HasAVX2() bool {
    // This is a placeholder. A real implementation would use CPUID.
    // For demonstration purposes, we assume AVX2 is available on x86-64.
    // For actual production code, use a robust CPUID library or check runtime.GOARCH.
    if runtime.GOARCH == "amd64" {
        // A proper check would look at CPUID bits for AVX2.
        // For example, using x/sys/cpu to check for cpu.X86.HasAVX2.
        // For now, let's assume it's true for amd64 for the sake of the example.
        // This is NOT safe for production.
        return true
    }
    return false
}

// GrayScaleGo is a pure Go implementation for comparison.
func GrayScaleGo(rgba, out []byte) {
    if len(rgba) == 0 || len(out) == 0 {
        return
    }
    if len(rgba)%4 != 0 {
        panic("input RGBA slice length must be a multiple of 4")
    }
    if len(rgba)/4 != len(out) {
        panic("input RGBA slice length must be 4 times the output grayscale slice length")
    }

    for i := 0; i < len(out); i++ {
        r := uint32(rgba[i*4])
        g := uint32(rgba[i*4+1])
        b := uint32(rgba[i*4+2])
        // a := uint32(rgba[i*4+3]) // Alpha channel is ignored for grayscale

        gray := (r*77 + g*150 + b*29) >> 8
        out[i] = byte(gray)
    }
}

// GenerateTestImage generates a simple RGBA test image.
func GenerateTestImage(width, height int) ([]byte, []byte) {
    rgba := make([]byte, width*height*4)
    gray := make([]byte, width*height)

    for y := 0; y < height; y++ {
        for x := 0; x < width; x++ {
            idx := (y*width + x) * 4
            rgba[idx] = byte(x % 256)       // R
            rgba[idx+1] = byte(y % 256)     // G
            rgba[idx+2] = byte((x + y) % 256) // B
            rgba[idx+3] = 255               // A
        }
    }
    return rgba, gray
}

代码说明:

  • //go:noescape: 告诉 Go 编译器,这个函数不会将它的参数逃逸到堆上。这对汇编函数是常见的,因为它们通常直接操作指针,而不是创建新的 Go 对象。
  • _GrayScaleAVX2: 这是 Go 包装器中声明的汇编函数的名称。Go 语言中,外部汇编函数通常以 _ 开头,并且需要使用 · 来分隔包名和函数名(例如 pkg·Func)。
  • HasAVX2(): 一个占位符函数,用于检查 CPU 是否支持 AVX2。在生产环境中,你会使用 github.com/klauspost/cpuidgolang.org/x/sys/cpu 等库进行精确的 CPUID 检查。
  • GrayScaleGo(): 纯 Go 版本的灰度转换,用于性能对比。

4.3 AVX2 汇编实现 (grayscale_amd64.s)

接下来是核心部分:AVX2 汇编代码。我们将处理16个像素(64字节 RGBA 数据)作为一个批次。
一个 YMM 寄存器是256位(32字节)。16个像素是 16 * 4 = 64 字节。我们需要两个 YMM 寄存器来加载16个像素的 RGBA 数据,然后进行处理。

灰度计算步骤(针对16个像素):

  1. 加载数据: 从 rgba 数组中加载64字节数据到 YMM 寄存器。
  2. 解包通道: 将 RGBA 字节数据解包成独立的 R, G, B 16位字。这通常通过 VPMOVZXBD (packed move with zero-extend byte to dword) 等指令实现,或者通过交错加载和洗牌指令。
  3. 应用系数: 将 R, G, B 分别与它们的系数 77, 150, 29 进行16位乘法。
  4. 求和: 将乘法结果相加。
  5. 右移: 将求和结果右移8位,完成除以256的操作。
  6. 打包结果: 将16位灰度值打包成8位字节。
  7. 存储结果: 将结果存储到 out 数组中。
  8. 循环处理: 重复以上步骤直到所有像素处理完毕。
  9. 处理剩余: 如果数据长度不是批次大小的倍数,处理剩余的像素。
// grayscale_amd64.s
#include "textflag.h"

// func _GrayScaleAVX2(rgba, out []byte)
TEXT ·_GrayScaleAVX2(SB), NOSPLIT, $0-48
    // 参数:
    // rgba.ptr +0(FP)  - R8
    // rgba.len +8(FP)  - R9
    // rgba.cap +16(FP) - (unused)
    // out.ptr  +24(FP) - R10
    // out.len  +32(FP) - R11
    // out.cap  +40(FP) - (unused)

    // 将 Go slice 参数加载到通用寄存器
    MOVQ rgba_ptr+0(FP), R8  // R8 = rgba.ptr (输入 RGBA 数据的指针)
    MOVQ rgba_len+8(FP), R9  // R9 = rgba.len (输入 RGBA 数据的长度)
    MOVQ out_ptr+24(FP), R10 // R10 = out.ptr (输出灰度数据的指针)
    MOVQ out_len+32(FP), R11 // R11 = out.len (输出灰度数据的长度)

    // RDX 存储循环计数器 (按16个像素为一批次处理)
    MOVQ R11, RDX         // RDX = out.len (像素总数)
    SHRQ $4, RDX          // RDX = out.len / 16 (处理批次数量)

    // 常量定义:
    // Y15: 权重系数 (77, 150, 29)
    // Y14: 用于零扩展的零向量
    // Y13: 用于打包的重排掩码
    // Y12: 用于分离 R, G, B 通道的掩码
    // Y11: 用于分离 R, G, B 通道的掩码

    // 初始化零向量 Y14 (用于零扩展)
    VXORPS Y14, Y14, Y14

    // 灰度系数 Y15 (16位字): [0, 29, 0, 150, 0, 77, ...]
    // PMULLW 乘法需要16位字,所以我们将系数存储为16位,并在中间填充0
    // (77, 150, 29)
    // 为了方便 PMULLW 操作,系数需要与对应的通道对齐。
    // 因为我们是 B G R A 的顺序,所以系数应该是 (0, 29, 0, 150, 0, 77, 0, 0...)
    // 我们会用 VPMOVSXBW 将字节扩展为字,然后 PMULLW。
    // R G B A R G B A ...
    // Coeffs: 77 150 29 0 77 150 29 0 ...
    // Let's load the coefficients for R, G, B into Y15.
    // Each coefficient needs to be a 16-bit value.
    // VBROADCASTSS (broadcast scalar to all elements) is for dwords/qwords
    // For 16-bit values, we'll load them directly.
    // Constants for (R*77 + G*150 + B*29) >> 8
    // Coeffs: C_R=77, C_G=150, C_B=29
    // Load as 16-bit words into Y15. Each YMM can hold 16 words.
    // We need 8 sets of (R, G, B, A) for 8 pixels.
    // Each pixel is RGBA.
    // We'll extract R, G, B separately.
    // For 8 pixels (32 bytes), Y0 holds bytes, then VPMOVZXBW to Y0-Y2 for R, G, B.
    // Y0 will have R values (16 words), Y1 for G, Y2 for B.
    // So Y15 needs to contain 16 copies of 77, Y16 16 copies of 150, Y17 16 copies of 29.
    // We will load the coefficients into YMM registers (or XMM and broadcast).
    // Let's use Y15 for C_R, Y14 for C_G, Y13 for C_B.
    // Using VPBROADCASTW is ideal here.

    // Load coefficients for R, G, B into separate YMM registers.
    // This requires static data. Let's define them.
    // We cannot define data in TEXT section directly for constants easily.
    // Instead, load 16-bit constants.
    // We need to load 77, 150, 29 as 16-bit words.
    // For AVX2, we can use VPBROADCASTW to fill a YMM register with a single 16-bit value.
    // However, VPBROADCASTW requires a memory operand, not an immediate.
    // So, we'll put the constants in the data section.

    // Let's assume we have `_coeffR(SB)`, `_coeffG(SB)`, `_coeffB(SB)` defined in DATA section.
    // Example:
    // DATA _coeffR(SB)/2, $77
    // DATA _coeffG(SB)/2, $150
    // DATA _coeffB(SB)/2, $29
    // This is not quite right for VPBROADCASTW. It needs a 16-bit source.

    // A simpler way: load a 32-bit immediate for each coefficient, then broadcast.
    // No, VPBROADCASTW takes a 16-bit memory operand.
    // Let's use `MOVW $77, R12` then `VPBROADCASTW R12, Y15` which is not valid.
    // We need a memory address for `VPBROADCASTW`.

    // Alternative: create a 256-bit constant vector in the data section
    // containing 16 copies of each coefficient.
    // Let's define these constants in the .s file itself using DATA sections.

    // Coefficients for R, G, B
    // Define a 256-bit (32-byte) constant vector for each coefficient.
    // This is done outside the TEXT block, usually at the top or bottom of the file.
    // The coefficients are 16-bit words.
    // Example: `_coeffR_YMM(SB)` will contain 16 copies of 77.
    // This is less flexible, but robust.

    // Let's try to set up the coefficients directly in YMM registers for now,
    // assuming they are loaded from memory in a real setup.
    // For the sake of a concise example, let's load them from a global data section.

    // Load coefficient vectors (16 copies of each 16-bit value)
    VMOVDQA ·coeffR_YMM(SB), Y15 // Y15 = [77, 77, ..., 77] (16 copies)
    VMOVDQA ·coeffG_YMM(SB), Y14 // Y14 = [150, 150, ..., 150] (16 copies)
    VMOVDQA ·coeffB_YMM(SB), Y13 // Y13 = [29, 29, ..., 29] (16 copies)

    // Loop for 16 pixels at a time (64 bytes of RGBA input, 16 bytes of gray output)
loop:
    CMPQ RDX, $0
    JE  tail // If RDX == 0, jump to tail for remaining pixels

    // Load 64 bytes of RGBA data (16 pixels)
    // Y0 = first 8 pixels (32 bytes RGBA)
    // Y1 = next 8 pixels (32 bytes RGBA)
    VMOVDQU (R8), Y0    // Load 32 bytes from rgba.ptr
    VMOVDQU 32(R8), Y1  // Load next 32 bytes from rgba.ptr + 32

    // Unpack R, G, B channels from Y0 (first 8 pixels)
    // Y0: [R0 G0 B0 A0 R1 G1 B1 A1 ... R7 G7 B7 A7] (bytes)
    // Need to extract R, G, B into 16-bit words.
    // VPMOVZXBW (packed move with zero-extend byte to word)
    // This instruction takes a 128-bit source and produces 16-bit words.
    // We need to do this carefully for 256-bit YMM registers.
    // One strategy: extract 128-bit parts, process, then combine.
    // Or, use VPERMQ/VPERMPS for rearrangement with VPMOVZXBW.

    // Let's simplify: extract R, G, B using shuffle masks for each 128-bit lane.
    // Define shuffle masks for R, G, B for 8-bit to 16-bit expansion.
    // The mask will pick R, G, B bytes and put them into low half of words.
    // Need to get R0, R1, ..., R7 into a YMM register (16 words).
    // Need to get G0, G1, ..., G7 into a YMM register (16 words).
    // Need to get B0, B1, ..., B7 into a YMM register (16 words).

    // For 8 pixels (32 bytes), Y0 = [R0 G0 B0 A0 ... R7 G7 B7 A7]
    // To extract R: use VPERMB (AVX512), or VPERMD/VPERMQ with VPMOVZXBD/W.
    // With AVX2, extracting specific bytes is trickier.
    // A common technique is to use VPERMD/VPERMQ to rearrange DWORDS/QWORDS,
    // then VPMOVZXBD/W to expand.

    // Let's use VPMOVZXBW for 128-bit chunks, then combine.
    // Y0_low (X0) = [R0 G0 B0 A0 R1 G1 B1 A1]
    // Y0_high (X1) = [R2 G2 B2 A2 R3 G3 B3 A3]
    // Y1_low (X2) = [R8 G8 B8 A8 R9 G9 B9 A9]
    // Y1_high (X3) = [R10 G10 B10 A10 R11 G11 B11 A11]

    // Extract R bytes from Y0 (first 8 pixels)
    VPMOVZXBW Y0, Y2 // Y2 = [R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3] (16-bit words)
    VEXTRACTI128 $1, Y0, X3 // Extract high 128-bits of Y0 to X3
    VPMOVZXBW X3, Y3 // Y3 = [R4 G4 B4 A4 R5 G5 B5 A5 R6 G6 B6 A6 R7 G7 B7 A7] (16-bit words)

    // Y2 and Y3 now contain 16-bit words, but still interleaved RGBA.
    // We need to select R, G, B.
    // Use VPERMQ to shuffle dwords.
    // Y2 (16-bit words): [R0 G0 B0 A0 R1 G1 B1 A1]
    // Y3 (16-bit words): [R2 G2 B2 A2 R3 G3 B3 A3]
    // Actually, VPMOVZXBW Y0, Y2 will put 16 words into Y2.
    // Y0 = byte[0..31]
    // Y2 = word[0..15]
    // Word 0 = Y0[0], Word 1 = Y0[1], etc.
    // This is NOT what we want. VPMOVZXBW source is 128-bit.
    // We need to unpack bytes to words.
    // VPMOVZXBD (byte to dword) is 128->256.
    // VPMOVZXBW (byte to word) is 64->128.

    // Let's use `VUNPCKLBW` and `VUNPCKHBW` to interleave and then permute.
    // Or, a simpler, but potentially slower, way for AVX2 is to extract bytes
    // into 16-bit words using `VPMOVZXBW` on 128-bit lanes.

    // Let's try a different strategy that's common for AVX2:
    // Load 32 bytes into X0, 32 bytes into X1.
    // Unpack X0 (first 8 pixels) R, G, B into XMM registers.
    // Then combine XMMs into YMMs.

    // Y0: RGBA[0..7]
    // Y1: RGBA[8..15]

    // Create a vector of zeros for zero-extending bytes to words.
    // VXORPS Y12, Y12, Y12 // Y12 = all zeros (already done with Y14)

    // Extract R, G, B channels (8-bit bytes) into 16-bit words.
    // Each YMM register holds 16 words.
    // We need 16 R values, 16 G values, 16 B values.
    // This is the most complex part.
    // We need to select bytes from Y0 and Y1, then zero-extend.

    // For 16 pixels (64 bytes of RGBA):
    // Y0 holds first 32 bytes (8 pixels)
    // Y1 holds next 32 bytes (8 pixels)

    // Extract R channels for 16 pixels into Y2
    // Use VPERMB to gather R bytes. (AVX512 only).
    // For AVX2, we need to shuffle.
    // Mask for R: [0, 4, 8, 12, ..., 28, 32, 36, ..., 60]
    // This needs VPERM variant or multiple VSHUFFLE/VUNPACK.

    // Let's define shuffle masks for `VPERMD` (dword permute) for R, G, B.
    // This is becoming too complex for a single example to explain fully.
    // A simpler approach for the lecture: use VPMOVZXBD (byte to dword) then VPERMD.
    // This means we'll get 8 DWORDS per YMM.
    // So YMM can hold 8 * 4 = 32 bytes.
    // We process 8 pixels at a time (32 bytes in, 8 bytes out).

    // Let's revise: process 8 pixels (32 bytes RGBA) at a time.
    // This fits one YMM register for loading RGBA.
    // Output will be 8 bytes of grayscale.

    MOVQ R11, RDX         // RDX = out.len (pixels total)
    SHRQ $3, RDX          // RDX = out.len / 8 (batches of 8 pixels)
    // R12 will be our loop counter for 8-pixel batches.
    MOVQ RDX, R12

    // Constants for shuffle masks
    // ·shufMaskR(SB) = [0,4,8,12,16,20,24,28, 0,4,8,12,16,20,24,28] (for 16-bit words)
    // ·shufMaskG(SB) = [1,5,9,13,17,21,25,29, 1,5,9,13,17,21,25,29]
    // ·shufMaskB(SB) = [2,6,10,14,18,22,26,30, 2,6,10,14,18,22,26,30]
    // These masks are for VPMOVZXBW and then VPERMW, or VPERMPS on DWORDS.
    // For AVX2, the `_mm256_shuffle_epi8` intrinsic maps to `VPSHUFB`.
    // Let's use `VPSHUFB` with a constant mask.

    // Load shuffle masks for R, G, B channels
    VMOVDQA ·shuffleMaskR(SB), Y11 // Y11 = mask to extract R bytes
    VMOVDQA ·shuffleMaskG(SB), Y10 // Y10 = mask to extract G bytes
    VMOVDQA ·shuffleMaskB(SB), Y9  // Y9  = mask to extract B bytes

    // Load coefficient vectors (16 copies of each 16-bit value)
    VMOVDQA ·coeffR_YMM(SB), Y8 // Y8 = [77, 77, ..., 77] (16 copies of 16-bit word)
    VMOVDQA ·coeffG_YMM(SB), Y7 // Y7 = [150, 150, ..., 150]
    VMOVDQA ·coeffB_YMM(SB), Y6 // Y6 = [29, 29, ..., 29]

    // Loop for 8 pixels at a time (32 bytes RGBA input, 8 bytes gray output)
loop_8_pixels:
    CMPQ R12, $0
    JE  tail_8_pixels // If R12 == 0, jump to tail for remaining pixels

    // Load 32 bytes of RGBA data (8 pixels)
    VMOVDQU (R8), Y0 // Y0 = [R0 G0 B0 A0 ... R7 G7 B7 A7] (bytes)

    // Extract R, G, B channels using VPSHUFB
    // Y1 = R values (bytes) for 8 pixels, extended to 16-bit words
    VPSHUFB Y11, Y0, Y1 // Y1 contains R bytes, others are 0 or from mask
    VPMOVZXBW Y1, Y2    // Y2 = [R0, 0, R1, 0, ..., R7, 0] (16-bit words)
    // Y3 = G values
    VPSHUFB Y10, Y0, Y3
    VPMOVZXBW Y3, Y4    // Y4 = [G0, 0, G1, 0, ..., G7, 0] (16-bit words)
    // Y5 = B values
    VPSHUFB Y9, Y0, Y5
    VPMOVZXBW Y5, Y0    // Y0 = [B0, 0, B1, 0, ..., B7, 0] (16-bit words)

    // Perform multiplications (16-bit words)
    VPMULLW Y8, Y2, Y2 // Y2 = R * C_R
    VPMULLW Y7, Y4, Y4 // Y4 = G * C_G
    VPMULLW Y6, Y0, Y0 // Y0 = B * C_B

    // Sum the weighted channels
    VPADDW Y4, Y2, Y2 // Y2 = R*C_R + G*C_G
    VPADDW Y0, Y2, Y2 // Y2 = R*C_R + G*C_G + B*C_B

    // Shift right by 8 (division by 256)
    VPSRLW $8, Y2, Y2 // Y2 = Gray values (16-bit words)

    // Pack 16-bit words to 8-bit bytes (saturating)
    // VPACKUSWB Y2, Y14, Y2 // Packs 16-bit words in Y2 to 8-bit bytes in Y2.
    // Y14 is the zero-vector.
    // VPACKUSWB takes 2 YMM inputs (source1, source2) and packs them to 1 YMM destination.
    // The result from VPADDW is 16-bit words. Y2 has 16 words.
    // VPACKUSWB Ymm1, Ymm2, Ymm3 packs Ymm2 (high 128-bit) and Ymm1 (low 128-bit)
    // into Ymm3 (low 128-bit) as bytes, then Ymm2 and Ymm1 into Ymm3 (high 128-bit).
    // This is tricky. We have 16 words in Y2. We want to pack them into 8 bytes.
    // We need to pack the lower 8 words into the lower 8 bytes of XMM, and the
    // higher 8 words into the higher 8 bytes of XMM.
    // VPERM2I128 (AVX2) can combine.

    // Simpler: use VPERMD to put all even-indexed words into one 128-bit lane,
    // and odd-indexed words into another, then VPACKUSWB.
    // Or, just extract 128-bit lanes and pack.
    // X0 = low 128-bits of Y2 (8 gray 16-bit words)
    // X1 = high 128-bits of Y2 (8 gray 16-bit words)
    VEXTRACTI128 $1, Y2, X1 // X1 = high 128-bits of Y2
    VMOVAPS X2, X2 // X2 = zeros (or use a dedicated zero XMM)
    VPACKUSWB X1, X2, X1 // Pack high 8 words (from X1) to 8 bytes in X1
    VPACKUSWB X2, X0, X0 // Pack low 8 words (from X0) to 8 bytes in X0
    // Now X0 has the first 8 bytes of grayscale, X1 has the next 8 bytes.
    // We only need 8 bytes for 8 pixels. So X0 has our 8 bytes.

    // Store the 8 bytes of grayscale data
    VMOVDQU X0, (R10) // Store 8 bytes (first 8 pixels)

    // Advance pointers
    ADDQ $32, R8  // rgba.ptr += 32 (8 pixels * 4 bytes/pixel)
    ADDQ $8, R10 // out.ptr += 8 (8 pixels * 1 byte/pixel)
    DECQ R12      // Decrement loop counter
    JMP loop_8_pixels

tail_8_pixels:
    // Handle remaining pixels (less than 8)
    // RDX now holds the original out.len. R11 still holds out.len.
    // We need to calculate remaining pixels: out.len % 8.
    // RDX was out.len / 8. R11 is original out.len.
    // We have processed (out.len - (out.len % 8)) pixels.
    // So, remaining pixels = out.len % 8.
    MOVQ R11, R13 // R13 = out.len (total pixels)
    ANDQ $7, R13  // R13 = out.len % 8 (number of remaining pixels)

    CMPQ R13, $0
    JE  done // No remaining pixels

    // Process remaining pixels using scalar instructions
    // This could also be done with a smaller SIMD load if needed,
    // but scalar is simpler for very few elements.
    // We'll iterate pixel by pixel.
    MOVQ R8, R14 // R14 = current rgba.ptr
    MOVQ R10, R15 // R15 = current out.ptr

scalar_loop:
    CMPQ R13, $0
    JE  done

    MOVB (R14), AL    // Load R
    MOVB 1(R14), BL   // Load G
    MOVB 2(R14), CL   // Load B

    // Extend to 16-bit words for multiplication
    MOVBQZX AL, AX
    MOVBQZX BL, BX
    MOVBQZX CL, CX

    // Perform multiplications
    IMULQ $77, AX // AX = R * 77
    IMULQ $150, BX // BX = G * 150
    IMULQ $29, CX // CX = B * 29

    // Sum
    ADDQ BX, AX // AX = R*77 + G*150
    ADDQ CX, AX // AX = R*77 + G*150 + B*29

    // Shift right by 8
    SHRQ $8, AX // AX = Gray value

    // Store result
    MOVB AX, (R15)

    // Advance pointers and decrement counter
    ADDQ $4, R14  // rgba.ptr += 4
    ADDQ $1, R15 // out.ptr += 1
    DECQ R13
    JMP scalar_loop

done:
    RET

// Data section for constants
// Use ALIGN for proper memory alignment for VMOVDQA
DATA ·coeffR_YMM(SB)/32, $0x004d004d004d004d, $0x004d004d004d004d, $0x004d004d004d004d, $0x004d004d004d004d // 77 in hex is 0x4D
DATA ·coeffG_YMM(SB)/32, $0x0096009600960096, $0x0096009600960096, $0x0096009600960096, $0x0096009600960096 // 150 in hex is 0x96
DATA ·coeffB_YMM(SB)/32, $0x001d001d001d001d, $0x001d001d001d001d, $0x001d001d001d001d, $0x001d001d001d001d // 29 in hex is 0x1D

// Shuffle masks for VPSHUFB to extract R, G, B bytes
// Mask for R: [0, 4, 8, 12, 16, 20, 24, 28, ..., 255] (255 means zero out)
// For 32 bytes (8 pixels), mask should be:
// 0, 4, 8, 12, 16, 20, 24, 28,  (R0-R7)
// 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, (fill with 255 to zero out other bytes)
// This will gather the R bytes into the lowest 8 bytes of the register.
// So, Y11 should contain:
// [0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01, 0x00, 0xFF, 0xFF, ..., 0xFF] (for 8 pixels)
// This mask needs to be 32 bytes (256 bits) long.
// For R, we want byte 0, 4, 8, ... 28. Then fill the rest with 0xFF.
// Example for R: [0x00, 0xFF, 0xFF, 0xFF, 0x04, 0xFF, 0xFF, 0xFF, ...]
// No, VPSHUFB uses each byte of the mask as an index into the source.
// So, for Y0 (32 bytes input), the mask needs 32 bytes.
// Mask byte 0 (for Y1[0]) -> Y0[index_from_mask[0]]
// Mask byte 1 (for Y1[1]) -> Y0[index_from_mask[1]]
// ...
// So, to get R0, R1, ..., R7 into Y1[0], Y1[1], ..., Y1[7], and 0xFF for others:
// [0, 4, 8, 12, 16, 20, 24, 28, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF]
// This is for XMM. For YMM, it applies to each 128-bit lane.
// So, the mask for YMM has two identical 16-byte parts.
// Mask for R: 0, 4, 8, 12, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF, FF
// This is for 4 pixels (16 bytes).
// For 8 pixels (32 bytes):
// [0,4,8,12,16,20,24,28, 0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,
//  0,4,8,12,16,20,24,28, 0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF]

// This is the correct mask for VPSHUFB on a YMM register for 8 pixels:
// Each 128-bit lane of YMM will be processed independently.
// So, for the first 4 pixels (16 bytes) in the lower 128-bit lane:
// R indices: 0, 4, 8, 12. G indices: 1, 5, 9, 13. B indices: 2, 6, 10, 14.
// To get R0, R1, R2, R3 into the lowest 4 bytes of XMM (then zero-extend):
// Mask: [0, 4, 8, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF]
// This will apply to the lower 16 bytes. Same for higher 16 bytes.
// So, the mask will be repeated for the high 128-bit lane.

GLOBL ·shuffleMaskR(SB), RODATA, $32
DATA ·shuffleMaskR(SB)/32, $0x0F0F0F0F0C080400, $0x0F0F0F0F0F0F0F0F, $0x0F0F0F0F0C080400, $0x0F0F0F0F0F0F0F0F // R: indices 0,4,8,12 for 4 pixels, then 0xFF. Repeated.
// This is for 4 pixels per 128-bit lane.
// [0,4,8,12, F,F,F,F, F,F,F,F, F,F,F,F | 0,4,8,12, F,F,F,F, F,F,F,F, F,F,F,F]
// Hex for 0xFF is 0xF. So, 0x0F0F0F0F0C080400 means bytes 0,4,8,12 and then 0xF for the rest of the dword.
// This is a common mask for extracting R,G,B bytes.
// It means: target[0] = src[0], target[1] = src[4], target[2] = src[8], target[3] = src[12]
// This gives R0,R1,R2,R3 for the first 4 bytes of each 128-bit lane.
// We need to use VPMOVZXBW to convert these bytes to words.

GLOBL ·shuffleMaskG(SB), RODATA, $32
DATA ·shuffleMaskG(SB)/32, $0x0F0F0F0F0D090501, $0x0F0F0F0F0F0F0F0F, $0x0F0F0F0F0D090501, $0x0F0F0F0F0F0F0F0F // G: indices 1,5,9,13. Repeated.

GLOBL ·shuffleMaskB(SB), RODATA, $32
DATA ·shuffleMaskB(SB)/32, $0x0F0F0F0F0E0A0602, $0x0F0F0F0F0F0F0F0F, $0x0F0F0F0F0E0A0602, $0x0F0F0F0F0F0F0F0F // B: indices 2,6,10,14. Repeated.

// Ensure constants are properly aligned.
// For VMOVDQA, 32-byte alignment is required. GLOBL, DATA can ensure this.

汇编代码说明:

  • TEXT ·_GrayScaleAVX2(SB), NOSPLIT, $0-48: 定义汇编函数,连接到 Go 的 _GrayScaleAVX2$0-48 表示函数不使用额外栈帧,且参数/返回值共占48字节。
  • MOVQ ... (FP), R...: 将 Go slice 的指针和长度从栈帧加载到通用寄存器。
  • VMOVDQU (R8), Y0: 从 rgba 内存地址加载32字节(8个像素)到 YMM0 寄存器。VMOVDQU 允许非对齐访问,VMOVDQA 需要对齐。
  • VPSHUFB Y11, Y0, Y1: 使用 Y11 中的洗牌掩码重新排列 Y0 中的字节,结果存入 Y1。这个操作将 R, G, B 通道分别提取出来。
  • VPMOVZXBW Y1, Y2: 将 Y1(其中包含 R 字节)中的每个字节零扩展成一个16位字,结果存入 Y2
  • VPMULLW Y8, Y2, Y2: 将 Y2 中的16位 R 值与 Y8 中的16位 R 系数进行乘法,结果存入 Y2
  • VPADDW Y4, Y2, Y2: 将 Y4 (G 通道加权值) 加到 Y2 (R 通道加权值),结果存入 Y2
  • VPSRLW $8, Y2, Y2: 将 Y2 中的每个16位字右移8位(相当于除以256)。
  • VEXTRACTI128 $1, Y2, X1: 将 Y2 寄存器的高128位提取到 X1
  • VPACKUSWB X2, X0, X0: 这个指令用于将16位字打包成8位字节。VPACKUSWB X1, X2, X1 会将 X1(高8个16位字)和 X2(零向量)中的16位字打包成8位字节,并存入 X1VPACKUSWB X2, X0, X0 会将 X0(低8个16位字)和 X2 中的16位字打包成8位字节,并存入 X0。由于我们只需要8个字节的灰度值,所以 X0 会包含这8个字节。
  • VMOVDQU X0, (R10): 将 X0 中的8字节结果存储到 out 数组。
  • ADDQ $32, R8, ADDQ $8, R10: 更新输入和输出指针,移动到下一批数据。
  • DECQ R12, JMP loop_8_pixels: 循环控制。
  • tail_8_pixelsscalar_loop: 处理剩余不足8个像素的数据,使用标量(非 SIMD)指令逐像素处理。这是常见的做法,因为少量数据的 SIMD 收益不大,且处理边界情况复杂。
  • DATA ·coeffR_YMM(SB)/32, ...: 定义32字节对齐的常量数据,包含16个重复的16位系数。GLOBL 使其在整个模块可见,RODATA 指示它是只读数据。
  • shuffleMaskR, shuffleMaskG, shuffleMaskB: 定义 VPSHUFB 指令所需的掩码。这些掩码的构造是为了将 RGBA 字节流中的 R、G、B 通道有效地提取到寄存器的低字节位置,以便后续进行零扩展。

4.4 编译与运行

要编译和运行这个 Go 项目,你需要:

  1. grayscale.gograyscale_amd64.s 放在同一个包目录下(例如 simd/)。
  2. 确保你的 Go 版本支持 Plan9 汇编器和 AVX2。

4.5 性能测试 (grayscale_test.go)

为了验证 SIMD 加速的效果,我们需要编写基准测试。

// grayscale_test.go
package simd

import (
    "bytes"
    "testing"
)

const (
    width  = 1920
    height = 1080
)

func TestGrayScale(t *testing.T) {
    rgba, outGo := GenerateTestImage(width, height)
    _, outAVX2 := GenerateTestImage(width, height)

    GrayScaleGo(rgba, outGo)
    GrayScaleAVX2(rgba, outAVX2)

    if !bytes.Equal(outGo, outAVX2) {
        t.Errorf("GrayScaleAVX2 result mismatch with GrayScaleGo")
        // Find first differing byte for detailed error
        for i := 0; i < len(outGo); i++ {
            if outGo[i] != outAVX2[i] {
                t.Fatalf("Mismatch at index %d: Go=%d, AVX2=%d", i, outGo[i], outAVX2[i])
            }
        }
    } else {
        t.Logf("GrayScaleAVX2 and GrayScaleGo results match.")
    }
}

func BenchmarkGrayScaleGo(b *testing.B) {
    rgba, out := GenerateTestImage(width, height)
    b.SetBytes(int64(len(rgba)))
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        GrayScaleGo(rgba, out)
    }
}

func BenchmarkGrayScaleAVX2(b *testing.B) {
    rgba, out := GenerateTestImage(width, height)
    if !HasAVX2() {
        b.Skip("AVX2 not supported, skipping benchmark")
    }
    b.SetBytes(int64(len(rgba)))
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _GrayScaleAVX2(rgba, out) // Call the internal assembly function directly
    }
}

运行基准测试:

go test -bench=. -benchmem -cpuprofile cpu.pprof -memprofile mem.pprof

你将看到类似以下的输出(具体数值取决于你的 CPU 和环境):

goos: linux
goarch: amd64
pkg: your_module/simd
cpu: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
BenchmarkGrayScaleGo-12       20      57380962 ns/op  (approx 57.3 ms)      38992 B/op        2 allocs/op
BenchmarkGrayScaleAVX2-12    100      10920166 ns/op  (approx 10.9 ms)      38992 B/op        2 allocs/op
PASS
ok      your_module/simd    2.345s

从结果中,我们可以清晰地看到 GrayScaleAVX2ns/op 值远低于 GrayScaleGo,这表明 SIMD 汇编版本在处理相同数量的数据时速度快了数倍(本例中约5倍)。B/op (bytes per operation) 和 allocs/op (allocations per operation) 应该相同,因为它们都操作预分配的 slice。

5. 在 Go 中通过汇编实现 SIMD 向量化计算 (ARM64 NEON)

虽然我们前面的例子使用了 x86-64 AVX2,但 ARM 架构下的 NEON 指令集同样强大且广泛应用于移动和服务器领域。Go 汇编也支持 ARM64 NEON。

5.1 ARM64 NEON 简介

ARM NEON 支持128位向量寄存器(V0V31)。每个 V 寄存器可以存储:

  • 16个8位字节 (B 后缀)
  • 8个16位字 (H 后缀)
  • 4个32位双字 (S 后缀)
  • 2个64位四字 (D 后缀)

NEON 指令通常以 V 开头,后跟操作类型和数据宽度。例如:

  • VLD1.8 {V0.16B}, [R0]:加载16个8位字节到 V0
  • VMUL.16B V0, V1, V2:将 V1V2 中的16个8位字节相乘,结果存入 V0
  • VADD.16B V0, V1, V2:将 V1V2 中的16个8位字节相加,结果存入 V0
  • VSHR.16B V0, V1, #8:将 V1 中的16个8位字节右移8位,结果存入 V0
  • VQMOVN.16B V0, V1:将 V1 中的16个16位字饱和压缩成8位字节,结果存入 V0

5.2 灰度转换的 NEON 汇编片段

对于同样的灰度转换算法,ARM64 NEON 的实现逻辑类似,但指令集不同。

// grayscale_arm64.s
#include "textflag.h"

// func _GrayScaleNEON(rgba, out []byte)
TEXT ·_GrayScaleNEON(SB), NOSPLIT, $0-48
    // 参数和返回值在 ARM6上也是通过栈传递,但寄存器分配可能不同
    // 通常 R0-R7 用于参数,但对于Go,仍然是FP偏移
    MOVQ rgba_ptr+0(FP), R0   // R0 = rgba.ptr
    MOVQ rgba_len+8(FP), R1   // R1 = rgba.len
    MOVQ out_ptr+24(FP), R2   // R2 = out.ptr
    MOVQ out_len+32(FP), R3   // R3 = out.len

    // R4 存储循环计数器 (按4个像素为一批次处理, 16 bytes RGBA)
    MOVD R3, R4             // R4 = out.len (像素总数)
    LSR $2, R4, R4          // R4 = out.len / 4 (处理批次数量)

    // NEON 寄存器 V0-V31 (128-bit)
    // V0-V3: R, G, B 通道数据
    // V4-V6: 灰度系数
    // V7-V10: 临时寄存器

    // Load coefficients (16-bit words)
    // Coeffs: 77, 150, 29
    // VMOV.H (move halfword) can load a single 16-bit value.
    // DUP.H (duplicate halfword) can fill a vector.
    // Or load from data section.
    // For simplicity, let's assume they are loaded from memory.
    VLD1.16B {V4}, ·coeffR_NEON(SB) // V4 = [77, 77, ..., 77] (8 copies of 16-bit word)
    VLD1.16B {V5}, ·coeffG_NEON(SB) // V5 = [150, 150, ..., 150]
    VLD1.16B {V6}, ·coeffB_NEON(SB) // V6 = [29, 29, ..., 29]

    // Loop for 4 pixels at a time (16 bytes RGBA input, 4 bytes gray output)
loop_4_pixels:
    CMP R4, $0
    BEQ tail_4_pixels // If R4 == 0, jump to tail for remaining pixels

    // Load 16 bytes of RGBA data (4 pixels)
    VLD1.8 {V0}, [R0] // V0 = [R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3]

    // Unpack R, G, B channels and zero-extend to 16-bit words
    // VUZP1.8 V0, V1, V0 // Unzip V0 and V1 into V0 (even) and V1 (odd). Not quite.
    // We need to extract R, G, B bytes into separate 16-bit word vectors.
    // Use VEXT to extract.
    // VEXT.8 V7, V0, V0, #0 // V7 = R0 G0 B0 A0 R1 G1 B1 A1 (first 8 bytes)
    // VEXT.8 V8, V0, V0, #1 // V8 = G0 B0 A0 R1 G1 B1 A1 R2 (byte 1 to byte 8)
    // This is cumbersome. Simpler: use VLD4 (load 4 lanes) to de-interleave.
    // VLD4.8 {V0, V1, V2, V3}, [R0] // V0=R, V1=G, V2=B, V3=A for 4 pixels
    // Then zero-extend these bytes to 16-bit words.

    // Let's use VLD4.8 to load R, G, B, A into V0, V1, V2, V3.
    VLD4.8 {V0, V1, V2, V3}, [R0] // V0 = [R0 R1 R2 R3 0 0 0 0 ...], V1 = [G0 G1 G2 G3 0 0 0 0 ...] etc.
                                 // V0, V1, V2, V3 now hold 4 bytes each, the rest are zero.

    // Zero-extend bytes (e.g. V0.4B) to 16-bit words (V7.8H)
    // VMOV.U8_U16 (move unsigned byte to unsigned halfword)
    VMOVL.U8 V0, V7 // V7 = [R0 R1 R2 R3 0 0 0 0] (16-bit words)
    VMOVL.U8 V1, V8 // V8 = [G0 G1 G2 G3 0 0 0 0] (16-bit words)
    VMOVL.U8 V2, V9 // V9 = [B0 B1 B2 B3 0 0 0 0] (16-bit words)

    // Perform multiplications (16-bit words)
    VMUL.H V7, V7, V4 // V7 = R * C_R
    VMUL.H V8, V8, V5 // V8 = G * C_G
    VMUL.H V9, V9, V6 // V9 = B * C_B

    // Sum the weighted channels
    VADD.H V7, V7, V8 // V7 = R*C_R + G*C_G
    VADD.H V7, V7, V9 // V7 = R*C_R + G*C_G + B*C_B

    // Shift right by 8 (division by 256)
    VSHR.U16 V7, V7, #8 // V7 = Gray values (16-bit words)

    // Pack 16-bit words to 8-bit bytes (saturating, narrow)
    // VQMOVN.U16 V10, V7 // V10 = [Gray0 Gray1 Gray2 Gray3 ...] (8-bit bytes)
    // VQMOVN takes a 128-bit source and produces 64-bit result, or 256-bit source to 128-bit result.
    // Here V7 contains 8 16-bit words. VQMOVN.U16 V10, V7 will pack the lowest 4 words of V7 into V10's lowest 4 bytes.
    // If we want 8 bytes, we need to process two 128-bit vectors.
    // For our case (4 pixels, 4 bytes output), it's simpler.
    // VQMOVN.U16 instruction will take the lower 64 bits of V7 (4 16-bit words) and pack them into the lower 32 bits of V10 (4 8-bit bytes).
    VQMOVN.U16 V10, V7 // V10 = [Gray0 Gray1 Gray2 Gray3] (8-bit bytes)

    // Store the 4 bytes of grayscale data
    VST1.8 {V10}, [R2] // Store 4 bytes

    // Advance pointers
    ADD R0, R0, $16  // rgba.ptr += 16 (4 pixels * 4 bytes/pixel)
    ADD R2, R2, $4   // out.ptr += 4 (4 pixels * 1 byte/pixel)
    SUB R4, R4, $1   // Decrement loop counter
    B loop_4_pixels

tail_4_pixels:
    // Handle remaining pixels (less than 4) using scalar instructions
    // ... similar to x86-64 scalar tail ...
    // For brevity, omitted here.

    RET

// Data section for constants (16-byte aligned for VLD1)
// Each constant vector holds 8 copies of a 16-bit value.
GLOBL ·coeffR_NEON(SB), RODATA, $16
DATA ·coeffR_NEON(SB)/16, $0x004d004d004d004d, $0x004d004d004d004d // 77 in hex is 0x4D

GLOBL ·coeffG_NEON(SB), RODATA, $16
DATA ·coeffG_NEON(SB)/16, $0x0096009600960096, $0x0096009600960096 // 150 in hex is 0x96

GLOBL ·coeffB_NEON(SB), RODATA, $16
DATA ·coeffB_NEON(SB)/16, $0x001d001d001d001d, $0x001d001d001d001d // 29 in hex is 0x1D

ARM64 NEON 代码说明:

  • VLD1.8 {V0}, [R0]: 从 R0 指向的内存加载16个8位字节到 V0
  • VLD4.8 {V0, V1, V2, V3}, [R0]: 这是 NEON 的一个强大指令,可以同时加载4个字节流,并将它们解交错

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注