分析 WordPress `wp_kses()` 函数源码：基于白名单机制的 HTML 内容过滤。 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

各位观众老爷们，大家好！我是今天的主讲人，一个在代码堆里摸爬滚打多年的老码农。今天咱们来聊聊WordPress的wp_kses()函数，这玩意儿就像个HTML界的“保安”，专门负责过滤内容，确保你网站的安全。

引子：为什么需要wp_kses()？

想象一下，你允许用户在评论区或者文章中提交HTML代码。如果用户恶意插入一段<script>，里面写着“删除服务器上所有文件”，你的服务器可能就要凉凉了。又或者，插入一段恶意链接，诱导用户访问钓鱼网站，那就损害了用户的利益。

所以，为了防止XSS攻击（Cross-Site Scripting），我们需要对用户提交的HTML内容进行过滤。wp_kses()就是WordPress提供的利器，它基于白名单机制，只允许通过预先设定的标签、属性，其他的统统干掉！

wp_kses()：白名单过滤的核心

wp_kses()的核心思想是：我不允许任何我不认识的东西进来！它定义了一系列允许的HTML标签和属性，只有在白名单中的标签和属性才能被保留，其他的都会被无情地删除。

wp_kses()函数的签名

先来看看wp_kses()函数的庐山真面目：

wp_kses( string $string, array $allowed_html, array $allowed_protocols = [] ) : string

$string: 要过滤的HTML字符串，也就是用户提交的内容。
$allowed_html: 一个关联数组，定义了允许的HTML标签和属性。这个是关键！
$allowed_protocols: 一个允许的协议白名单数组，例如http, https, mailto等。

$allowed_html：白名单规则的定义

$allowed_html这个数组的结构非常重要，它决定了哪些标签和属性可以被保留。它的结构是这样的：

$allowed_html = [
    'tag_name' => [
        'attribute_name' => true, // 允许该属性，不限制属性值
        'attribute_name2' => [ 'http', 'https' ], // 允许该属性，并且属性值必须是http或https协议
        'attribute_name3' => '' //允许该属性，并且属性值没有限制。等同于 true
    ],
    'tag_name2' => [
        // 更多属性定义...
    ],
    // 更多标签定义...
];

tag_name: 允许的HTML标签名称，比如a, p, img等。
attribute_name: 允许的属性名称，比如href, src, class等。
属性的值可以是：
- true或 '': 允许该属性，不限制属性值。
- 一个字符串数组：允许该属性，并且属性值必须是数组中的值。这通常用于限制URL的协议。

举个栗子：一个简单的$allowed_html

$allowed_html = [
    'a' => [
        'href' => true,
        'title' => true,
        'class' => true,
        'rel' => true,
        'target' => true,
    ],
    'p' => [
        'class' => true,
        'style' => true,
    ],
    'img' => [
        'src' => true,
        'alt' => true,
        'title' => true,
        'width' => true,
        'height' => true,
    ],
    'br' => [],
    'strong' => [],
    'em' => [],
];

这个例子定义了允许的<a>, <p>, <img>, <br>, <strong>, <em>标签。

对于<a>标签，允许href, title, class, rel, target属性，并且属性值不限制。
对于<p>标签，允许class和style属性，属性值不限制。
对于<img>标签，允许src, alt, title, width, height属性，属性值不限制。
对于<br>, <strong>, <em>标签，没有任何属性限制，相当于允许这些标签存在。

$allowed_protocols：URL协议的限制

为了防止恶意链接，$allowed_protocols参数用于限制URL的协议。默认情况下，WordPress允许的协议包括http, https, mailto, news, irc, gopher, nntp, telnet, mms, rtsp, svn, tel, fax, xmpp。

你可以自定义$allowed_protocols来增加或减少允许的协议。

wp_kses()的内部实现（简化版）

wp_kses()的内部实现比较复杂，但核心思想可以用以下伪代码来概括：

function wp_kses(string $string, array $allowed_html, array $allowed_protocols = []): string {
  // 1. 解析HTML字符串，将其拆解为标签、属性、文本等
  $tokens = html_parse( $string );

  $output = '';
  foreach ( $tokens as $token ) {
    if ( $token['type'] === 'tag_open' ) {
      $tag_name = strtolower( $token['tag'] );

      if ( isset( $allowed_html[ $tag_name ] ) ) {
        // 标签在白名单中
        $output .= '<' . $tag_name;

        foreach ( $token['attributes'] as $attr_name => $attr_value ) {
          $attr_name = strtolower( $attr_name );
          if ( isset( $allowed_html[ $tag_name ][ $attr_name ] ) ) {
            // 属性在白名单中
            $allowed_attr = $allowed_html[ $tag_name ][ $attr_name ];

            if ( is_array( $allowed_attr ) ) {
              // 属性值需要匹配白名单协议
              $url = $attr_value;
              $protocol = get_url_scheme( $url );
              if ( in_array( $protocol, $allowed_attr, true ) ) {
                $output .= ' ' . $attr_name . '="' . esc_attr( $attr_value ) . '"';
              }
            } else {
              // 属性值不需要匹配白名单协议
              $output .= ' ' . $attr_name . '="' . esc_attr( $attr_value ) . '"';
            }
          }
        }
        $output .= '>';
      }
    } elseif ( $token['type'] === 'tag_close' ) {
      $tag_name = strtolower( $token['tag'] );
      if ( isset( $allowed_html[ $tag_name ] ) ) {
        $output .= '</' . $tag_name . '>';
      }
    } elseif ( $token['type'] === 'text' ) {
      $output .= esc_html( $token['content'] );
    }
  }

  return $output;
}

function get_url_scheme( $url ) {
    $url = strtolower( $url );
    $parts = explode( ':', $url, 2 );
    if ( count( $parts ) > 1 ) {
        return trim( $parts[0] );
    }
    return '';
}

function esc_html( $data ) {
    return htmlspecialchars( $data, ENT_QUOTES, 'UTF-8' );
}

这个伪代码简化了实际的wp_kses()实现，但展示了核心逻辑：

解析HTML: 将HTML字符串拆解成标签、属性和文本。
标签过滤: 检查标签是否在$allowed_html中。
属性过滤: 检查属性是否在对应标签的$allowed_html中。
协议过滤: 如果属性值是URL，检查协议是否在$allowed_protocols中。
输出: 只保留通过过滤的标签、属性和文本。
转义: 使用esc_html()函数对文本内容进行HTML转义，防止XSS攻击。

使用wp_kses()的例子

$user_input = '<p class="highlight">This is some <strong>bold</strong> text with <a href="https://example.com" onclick="alert('XSS')">a link</a>.</p><script>alert("XSS");</script>';

$allowed_html = [
    'p' => [
        'class' => true,
    ],
    'strong' => [],
    'a' => [
        'href' => true,
    ],
];

$filtered_html = wp_kses( $user_input, $allowed_html );

echo "原始输入：n";
echo $user_input . "nn";

echo "过滤后的输出：n";
echo $filtered_html . "n";

// 输出结果：
// 原始输入：
// <p class="highlight">This is some <strong>bold</strong> text with <a href="https://example.com" onclick="alert('XSS')">a link</a>.</p><script>alert("XSS");</script>

// 过滤后的输出：
// <p class="highlight">This is some <strong>bold</strong> text with <a href="https://example.com">a link</a>.</p>

在这个例子中：

<script>标签被完全移除，因为它不在$allowed_html中。
<a>标签的onclick属性被移除，因为它不在$allowed_html中。
<p>标签的class属性被保留，因为它在$allowed_html中。
<strong>标签被保留，因为它在$allowed_html中.

wp_kses_post() 和 wp_kses_data()：WordPress提供的预定义白名单

WordPress提供了一些预定义的白名单，方便开发者使用：

wp_kses_post(): 用于过滤文章内容，允许的标签包括<a>, <em>, <strong>, <cite>, <code>, <ul>, <ol>, <li>, <dl>, <dt>, <dd>, <b>, <i>, <q>, <del>, <ins>, <pre>, <abbr>, <acronym>, <p>, <br>, <hr>, <h1>, <h2>, <h3>, <h4>, <h5>, <h6>, <img>, <iframe>. 它是一个相对宽松的白名单，适合用于文章内容。
wp_kses_data(): 用于过滤数据，只允许非常安全的标签，比如<a>, <b>, <em>, <i>, <strong>, <del>, <ins>, <mark>, <q>, <cite>, <code>. 它是一个非常严格的白名单，适合用于过滤用户提交的少量文本数据，例如用户名、评论等。

如何选择合适的白名单？

选择合适的白名单非常重要。如果白名单太严格，可能会导致用户无法正常发布内容；如果白名单太宽松，可能会导致安全漏洞。

以下是一些建议：

根据使用场景选择: 根据要过滤的内容类型选择合适的白名单。例如，文章内容可以使用wp_kses_post()，用户评论可以使用更严格的白名单。
最小权限原则: 只允许必要的标签和属性。不要为了方便而允许过多的标签和属性。
定期审查: 定期审查白名单，确保其仍然安全和有效。
自定义白名单: 如果预定义的白名单不满足需求，可以自定义白名单。

自定义白名单的注意事项

自定义白名单需要谨慎操作，否则可能会引入安全漏洞。

只允许必要的属性: 只允许必要的属性，例如href, src, title, alt。
限制URL协议: 使用$allowed_protocols参数限制URL协议。
避免使用style属性: style属性可能包含恶意代码，尽量避免使用。如果必须使用，需要进行严格的过滤。
HTML转义: 始终对用户输入进行HTML转义，防止XSS攻击。

wp_kses()的局限性

wp_kses()虽然是一个强大的HTML过滤器，但也存在一些局限性：

不能防御所有XSS攻击: wp_kses()只能防御基于HTML标签和属性的XSS攻击。对于基于JavaScript或其他技术的XSS攻击，wp_kses()可能无法防御。
性能问题: wp_kses()的解析和过滤过程可能会影响性能，特别是对于大型HTML字符串。

总结

wp_kses()是WordPress中一个非常重要的安全函数，它基于白名单机制，可以有效地过滤用户提交的HTML内容，防止XSS攻击。理解wp_kses()的工作原理和使用方法，对于开发安全的WordPress主题和插件至关重要。

实战演练

咱们来做一个简单的实战演练。假设你要开发一个评论系统，你需要对用户提交的评论进行过滤。

function sanitize_comment_content( $comment_content ) {
  $allowed_html = [
    'a' => [
      'href' => true,
      'rel' => true,
      'class' => true,
      'target' => true,
    ],
    'p' => [],
    'br' => [],
    'strong' => [],
    'em' => [],
    'code' => [],
    'pre' => [],
  ];

  $allowed_protocols = [ 'http', 'https', 'mailto' ];

  $sanitized_content = wp_kses( $comment_content, $allowed_html, $allowed_protocols );

  return $sanitized_content;
}

// 使用示例
$comment = $_POST['comment']; // 假设从POST请求获取评论内容

$sanitized_comment = sanitize_comment_content( $comment );

// 将$sanitized_comment保存到数据库
// ...

在这个例子中，我们定义了一个sanitize_comment_content()函数，用于过滤评论内容。

我们定义了一个$allowed_html数组，只允许<a>, <p>, <br>, <strong>, <em>, <code>, <pre>标签。
我们使用$allowed_protocols数组限制URL协议为http, https, mailto。
我们使用wp_kses()函数对评论内容进行过滤。
最后，我们将过滤后的评论内容保存到数据库。

最后的叮嘱

安全无小事！一定要重视用户输入的安全，合理使用wp_kses()，并且定期审查你的代码，确保你的网站安全无虞。

好了，今天的讲座就到这里。希望大家有所收获，下次再见！

发表回复 取消回复

发表回复取消回复