分析 WordPress `wp_kses()` 函数的源码：如何通过白名单机制过滤 HTML 以防止 XSS 攻击。 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

各位观众老爷们，晚上好！欢迎来到今天的“WordPress 防XSS秘籍：wp_kses() 源码深度剖析”讲座。今天咱们不聊风花雪月，直接上干货，一起扒一扒 WordPress 里那个负责“看门护院”的 wp_kses() 函数，看看它是怎么通过“白名单”机制，把那些试图搞事情的 XSS 攻击挡在门外的。

首先，咱们先来热热身，了解一下 XSS 攻击是何方神圣。

一、XSS 攻击：网络世界的“熊孩子”

XSS，全称 Cross-Site Scripting，跨站脚本攻击。可以把它想象成网络世界里的一群“熊孩子”，他们试图通过各种手段，比如在你网站的留言板里偷偷塞入一段恶意 JavaScript 代码，当你或者其他用户浏览这个留言时，这段代码就会执行，盗取你的 Cookie、篡改你的页面内容，甚至冒充你干坏事。

举个例子：

<script>alert("XSS攻击！");</script>

如果你的网站没有做好防护，允许用户提交包含这段代码的内容，那么当用户访问包含这段内容的页面时，浏览器就会弹出一个警告框，告诉你“XSS攻击！”。当然，实际的攻击代码远比这复杂，可能隐藏在看似无害的链接、图片，甚至 CSS 样式中。

二、wp_kses()：网站的“白名单保安”

为了防止这些“熊孩子”搞破坏，WordPress 提供了 wp_kses() 函数。wp_kses() 的核心思想是“白名单机制”：只允许用户提交的内容中包含白名单里允许的 HTML 标签、属性和协议，其他一律过滤掉。

你可以把 wp_kses() 想象成一个保安，他手里拿着一张名单，上面写着允许进入的 HTML 标签和属性。只有出现在名单上的才能通过，其他的统统拦下来。

三、wp_kses() 的基本用法

wp_kses() 函数的基本用法如下：

$filtered_html = wp_kses( $string, $allowed_html, $allowed_protocols );

$string: 要过滤的 HTML 字符串。
$allowed_html: 一个数组，定义了允许的 HTML 标签和属性。
$allowed_protocols: 一个数组，定义了允许的 URL 协议 (例如 http, https, mailto)。

返回值是经过过滤后的 HTML 字符串。

举个简单的例子：

$string = '<p>This is a <strong>bold</strong> text. <script>alert("XSS");</script></p>';

$allowed_html = array(
    'p' => array(),
    'strong' => array(),
);

$filtered_html = wp_kses( $string, $allowed_html );

echo $filtered_html; // 输出：<p>This is a <strong>bold</strong> text. </p>

在这个例子中，我们只允许 p 和 strong 标签，所以 <script> 标签被无情地过滤掉了。

四、深入 wp_kses() 源码：庖丁解牛

现在，让我们深入 wp_kses() 的源码，看看它是如何实现“白名单”过滤的。wp_kses() 的源码比较长，我们这里只挑一些关键的部分进行分析。

wp_kses() 函数入口

function wp_kses( $string, $allowed_html, $allowed_protocols = array() ) {
    global $allowedposttags, $allowedtags;

    // 如果 $allowed_html 为空，则使用全局变量 $allowedposttags 或 $allowedtags
    if ( empty( $allowed_html ) ) {
        if ( $allowedposttags == $allowedtags ) {
            $allowed_html = $allowedposttags;
        } elseif ( 'post' == $GLOBALS['wp_current_filter'][0] ) {
            $allowed_html = $allowedposttags;
        } else {
            $allowed_html = $allowedtags;
        }
    }

    // 使用 KSES 类进行过滤
    $kses = new WP_HTML_Tag_Processor( $string );
    $kses->kses( $allowed_html, $allowed_protocols );
    return $kses->get_updated_html();
}

这段代码首先判断 $allowed_html 是否为空。如果为空，则使用全局变量 $allowedposttags 或 $allowedtags。这两个全局变量分别定义了允许在文章内容和普通文本中使用的 HTML 标签。

然后，它创建了一个 WP_HTML_Tag_Processor 类的实例，并将要过滤的 HTML 字符串和允许的 HTML 标签传递给它。最后，调用 kses() 方法进行过滤，并返回过滤后的 HTML 字符串。

WP_HTML_Tag_Processor 类

WordPress 6.2 以后，wp_kses 函数的核心逻辑移动到了 WP_HTML_Tag_Processor 类中。这个类使用 PHP 的 DOMDocument 类来解析 HTML，然后遍历 DOM 树，根据白名单进行过滤。

简单来说，WP_HTML_Tag_Processor 类就像一个精通 HTML 的外科医生，它能把 HTML 字符串分解成一个个标签、属性，然后根据白名单进行精细的手术，切除那些不符合规定的部分。

kses() 方法

WP_HTML_Tag_Processor 类的 kses() 方法是真正的过滤逻辑所在。它遍历 HTML 字符串，对每一个标签和属性进行检查，判断是否在白名单中。

    public function kses( array $allowed_tags, array $allowed_protocols = array() ) {
        $this->allowed_tags     = $allowed_tags;
        $this->allowed_protocols = $allowed_protocols;

        while ( $this->next_token() ) {
            switch ( $this->get_token_type() ) {
                case WP_HTML_Token::TOKEN_TYPE_TAG_OPEN:
                    $this->process_tag_open();
                    break;
                case WP_HTML_Token::TOKEN_TYPE_TAG_CLOSE:
                    $this->process_tag_close();
                    break;
                // 其他类型的 Token 处理...
            }
        }
    }

这个方法首先将允许的标签和协议保存到类的成员变量中。然后，它使用 next_token() 方法遍历 HTML 字符串中的每一个 Token。根据 Token 的类型，分别调用不同的处理方法。

例如，当遇到一个开始标签时，会调用 process_tag_open() 方法。

process_tag_open() 方法

    private function process_tag_open() {
        $tag_name = $this->get_tag();

        if ( ! isset( $this->allowed_tags[ $tag_name ] ) ) {
            $this->remove_token();
            return;
        }

        $allowed_attributes = $this->allowed_tags[ $tag_name ];

        foreach ( $this->get_attribute_names() as $attribute_name ) {
            if ( ! isset( $allowed_attributes[ $attribute_name ] ) ) {
                $this->remove_attribute( $attribute_name );
                continue;
            }

            $attribute_value = $this->get_attribute( $attribute_name );
            $allowed_attribute_options = $allowed_attributes[ $attribute_name ];

            // 协议检查...
            if ( isset( $allowed_attribute_options['protocols'] ) ) {
                $this->kses_bad_protocol( $attribute_name, $allowed_attribute_options['protocols'] );
            }

            // 其他属性检查...
        }
    }

这个方法首先获取标签名，然后判断该标签是否在白名单中。如果不在，则直接移除该标签。

如果在白名单中，则遍历该标签的所有属性，判断每一个属性是否在白名单中。如果不在，则移除该属性。

对于 URL 相关的属性，还会进行协议检查，判断 URL 的协议是否在允许的协议列表中。如果不在，则移除该属性。

kses_bad_protocol() 方法

    private function kses_bad_protocol( string $attribute_name, array $allowed_protocols ) {
        $attribute_value = $this->get_attribute( $attribute_name );

        if ( empty( $attribute_value ) ) {
            return;
        }

        $original_attribute_value = $attribute_value;
        $attribute_value          = preg_replace( '/s/', '', $attribute_value ); // 去除空格
        $attribute_value          = wp_kses_bad_protocol( $attribute_value, $allowed_protocols );

        if ( $original_attribute_value !== $attribute_value ) {
            $this->set_attribute( $attribute_name, $attribute_value );
        }
    }

这个方法调用了 wp_kses_bad_protocol() 函数，该函数负责检查 URL 的协议是否在允许的协议列表中。

wp_kses_bad_protocol() 函数

function wp_kses_bad_protocol( $string, $allowed_protocols ) {
    $string = wp_kses_no_null( $string );
    $string = strtolower( $string );
    $string = wp_kses_decode_entities( $string );

    $string = str_replace( '&amp;', '#AMP#', $string );

    while ( wp_kses_split_form_feed( $string, $allowed_protocols ) ) {
        $string = wp_kses_split_form_feed( $string, $allowed_protocols );
    }

    $string = str_replace( '#AMP#', '&amp;', $string );

    return $string;
}

这个函数首先去除字符串中的 NULL 字符，然后将字符串转换为小写，并解码 HTML 实体。

然后，它使用 wp_kses_split_form_feed() 函数进行协议检查。

wp_kses_split_form_feed() 函数

function wp_kses_split_form_feed( $string, $allowed_protocols ) {
    static $allowedProtocols;
    if ( is_array( $allowed_protocols ) ) {
        $allowedProtocols = $allowed_protocols;
    }

    $string = preg_replace_callback(
        '/:|:|:|U+003A/i',
        static function( $matches ) use ( &$string, &$allowedProtocols ) {
            return wp_kses_bad_protocol_once( $string, $allowedProtocols, $matches[0] );
        },
        $string
    );

    return $string;
}

这个函数使用正则表达式查找字符串中的冒号，然后调用 wp_kses_bad_protocol_once() 函数进行协议检查。

wp_kses_bad_protocol_once() 函数

function wp_kses_bad_protocol_once( $string, $allowed_protocols, $match ) {
    $split = preg_split( '/(:|:|:|U+003A)/i', $string, 2 );

    if ( empty( $split[0] ) ) {
        return $string;
    }

    $i = count( $allowed_protocols );

    while ( $i-- ) {
        $allowed = trim( $allowed_protocols[ $i ] );
        if ( empty( $allowed ) ) {
            continue;
        }

        if ( 0 === strcasecmp( $split[0], $allowed ) ) {
            return $string;
        }
    }

    return '';
}

这个函数将字符串按照冒号分割成两部分，然后判断第一部分是否在允许的协议列表中。如果在，则返回原始字符串；否则，返回空字符串。

五、自定义白名单：掌控你的安全

虽然 WordPress 默认提供了一些白名单，但在实际应用中，你可能需要根据自己的需求自定义白名单。

使用 wp_kses_allowed_html 过滤器

你可以使用 wp_kses_allowed_html 过滤器来修改 wp_kses() 函数使用的白名单。

add_filter( 'wp_kses_allowed_html', 'my_custom_kses_allowed_html', 10, 2 );

function my_custom_kses_allowed_html( $allowed_html, $context ) {
    if ( $context == 'post' ) { // 只修改文章内容的白名单
        $allowed_html['iframe'] = array(
            'src'             => true,
            'width'           => true,
            'height'          => true,
            'frameborder'     => true,
            'allowfullscreen' => true,
        );
    }
    return $allowed_html;
}

在这个例子中，我们向文章内容的白名单中添加了 iframe 标签，并允许 src、width、height、frameborder 和 allowfullscreen 属性。

创建自定义白名单数组

你也可以创建一个自定义的白名单数组，并将其传递给 wp_kses() 函数。

$string = '<p>This is a <strong>bold</strong> text. <a href="https://example.com">Link</a></p>';

$allowed_html = array(
    'p' => array(),
    'strong' => array(),
    'a' => array(
        'href' => array( 'http', 'https' ), // 允许 http 和 https 协议
        'title' => true,
    ),
);

$filtered_html = wp_kses( $string, $allowed_html, array( 'http', 'https' ) );

echo $filtered_html;

在这个例子中，我们定义了一个自定义的白名单数组，允许 p、strong 和 a 标签。对于 a 标签，我们只允许 href 和 title 属性，并且 href 属性只允许 http 和 https 协议。

六、注意事项：安全无小事

不要轻易放宽白名单：除非你有充分的理由，否则不要轻易放宽白名单。放宽白名单意味着增加了 XSS 攻击的风险。
对用户输入进行转义：除了使用 wp_kses() 进行过滤外，还应该对用户输入进行转义，以防止其他类型的攻击。可以使用 esc_html()、esc_attr()、esc_url() 等函数进行转义。
定期更新 WordPress 版本：WordPress 会不断修复安全漏洞，所以要定期更新 WordPress 版本，以确保你的网站安全。
使用安全插件：可以使用一些安全插件来增强 WordPress 的安全性，例如 Wordfence、Sucuri Security 等。
代码审计：定期进行代码审计，检查是否存在安全漏洞。

七、总结：wp_kses() 的核心思想

wp_kses() 函数通过“白名单”机制，只允许用户提交的内容中包含白名单里允许的 HTML 标签、属性和协议，其他一律过滤掉。

它的核心思想可以概括为：宁可错杀一千，不可放过一个。

虽然这种方式可能会导致一些正常的 HTML 代码被过滤掉，但为了安全起见，这是值得的。

八、彩蛋：常见问题解答

问：为什么我的 HTML 代码被 wp_kses() 过滤掉了？

答：很可能是因为你的 HTML 代码中包含了不在白名单中的标签或属性。你可以检查一下你的白名单设置，或者尝试放宽白名单（但要注意安全风险）。
问：wp_kses() 能完全防止 XSS 攻击吗？

答：wp_kses() 可以有效地防止大部分 XSS 攻击，但不能保证 100% 安全。XSS 攻击方式千变万化，wp_kses() 也有可能存在漏洞。所以，除了使用 wp_kses() 进行过滤外，还应该采取其他安全措施。
问：wp_kses() 会影响网站性能吗？

答：wp_kses() 会对 HTML 代码进行解析和过滤，所以会消耗一定的服务器资源。但一般来说，wp_kses() 的性能影响很小，可以忽略不计。

好了，今天的讲座就到这里。希望通过今天的讲解，大家对 wp_kses() 函数有了更深入的了解，能够更好地保护自己的 WordPress 网站。记住，网络安全无小事，安全意识要时刻保持！

感谢各位的观看！下次再见！

发表回复 取消回复

发表回复取消回复