WordPress源码深度解析之：`Block`的`HTML`解析器：`WP_HTML_Tag_Processor`类的底层实现。 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

各位听众，晚上好！我是老码农，今晚咱们聊点有意思的——WordPress Block（块）的 HTML 解析器，也就是那个神秘的 WP_HTML_Tag_Processor 类。别怕名字长，其实它就是个专门吃 HTML 的小怪兽，而且是 WordPress 专门喂养的。

咱们都知道，WordPress 的 Block 编辑器让内容创作变得模块化了，每个 Block 就像一个乐高积木。但这些乐高积木最终还是要变成 HTML 代码才能呈现在浏览器里。问题来了，怎么高效、准确地处理这些 HTML 代码呢？这就是 WP_HTML_Tag_Processor 大显身手的地方了。

一、WP_HTML_Tag_Processor 是个啥？

简单来说，WP_HTML_Tag_Processor 是一个类，它专门用来解析和操作 HTML 字符串。它不是一个完整的 HTML 解析器，比如 DOM 解析器，它的目的是为了优化性能，针对 WordPress Block 的特定需求，提供快速、轻量级的 HTML 标签处理能力。

它主要解决以下问题：

定位特定标签： 找到某个特定的 HTML 标签，比如 <img src="..."> 或者 <div class="my-block">。
读取标签属性： 提取标签的属性值，比如 src、class、alt 等。
修改标签属性： 修改标签的属性值，比如给 class 属性添加新的类名。
处理自闭合标签： 正确处理像 <img />、<br /> 这样的自闭合标签。
快速跳过无关内容： 尽可能快地跳过不需要处理的 HTML 代码，提高效率。

二、WP_HTML_Tag_Processor 的基本用法

先看一个简单的例子：

<?php

$html = '<div class="my-block"><img src="image.jpg" alt="My Image" /></div>';

$processor = new WP_HTML_Tag_Processor( $html );

// 找到第一个 img 标签
if ( $processor->next_tag( 'img' ) ) {
  // 获取 src 属性
  $src = $processor->get_attribute( 'src' );
  echo "Image Source: " . esc_html( $src ) . "n";

  // 修改 alt 属性
  $processor->set_attribute( 'alt', 'Updated Image Description' );
  echo "Updated HTML: " . esc_html( $processor->get_updated_html() ) . "n";
} else {
  echo "No img tag found.n";
}

?>

这个例子演示了如何使用 WP_HTML_Tag_Processor 找到第一个 img 标签，获取它的 src 属性，然后修改 alt 属性。 $processor->next_tag( 'img' ) 就是查找下一个名为 img的tag。$processor->get_attribute( 'src' )获取src属性的值，$processor->set_attribute( 'alt', 'Updated Image Description' )设置alt属性的值为'Updated Image Description'，$processor->get_updated_html()获取修改后的html内容。

下面我们逐个分析 WP_HTML_Tag_Processor 的核心方法：

__construct( string $html = '' )： 构造函数，接收 HTML 字符串作为参数，初始化解析器。
load_HTML( string $html )： 加载新的 HTML 字符串。
next_tag( string|string[] $tag_name = null )： 在 HTML 字符串中查找下一个指定的标签。如果 $tag_name 为 null，则查找下一个任何标签。可以传递字符串数组，查找多个标签，如next_tag( [ 'img', 'a' ] )
get_tag()： 获取当前标签的名称，例如 "img" 或 "div"。
get_attribute( string $attribute_name )： 获取当前标签的指定属性值。
set_attribute( string $attribute_name, string $attribute_value )： 设置当前标签的指定属性值。
remove_attribute( string $attribute_name )： 移除当前标签的指定属性。
get_updated_html()： 获取修改后的 HTML 字符串。只有在调用过 set_attribute() 或者 remove_attribute() 之后，这个方法才会返回修改后的 HTML。

三、WP_HTML_Tag_Processor 的底层实现（简化版）

为了更好地理解 WP_HTML_Tag_Processor 的工作原理，我们来看一个简化版的实现（省略了错误处理、性能优化等细节）：

<?php

class Simple_HTML_Tag_Processor {

  private $html = '';
  private $position = 0;
  private $current_tag = null;

  public function __construct( string $html = '' ) {
    $this->html = $html;
  }

  public function load_HTML( string $html ) {
    $this->html = $html;
    $this->position = 0;
    $this->current_tag = null;
  }

  public function next_tag( string|array $tag_name = null ): bool {
    $start_tag_pattern = '/<([a-zA-Z][a-zA-Z0-9]*)/'; // 匹配开始标签的正则表达式

    while ( preg_match( $start_tag_pattern, $this->html, $matches, PREG_OFFSET_CAPTURE, $this->position ) ) {
      $tag_start = $matches[0][1];
      $tag_name_found = strtolower( $matches[1][0] );

      if ( is_string( $tag_name ) && strtolower( $tag_name ) !== $tag_name_found ) {
        $this->position = $tag_start + strlen( $matches[0][0] );
        continue;
      }

      if ( is_array( $tag_name ) && ! in_array( $tag_name_found, array_map( 'strtolower', $tag_name ), true ) ) {
        $this->position = $tag_start + strlen( $matches[0][0] );
        continue;
      }

      // 找到了目标标签
      $this->current_tag = [
        'name'     => $tag_name_found,
        'start'    => $tag_start,
        'full_tag' => $this->extract_full_tag( $tag_start ),
      ];
      $this->position = $tag_start + strlen( $this->current_tag['full_tag'] );
      return true;
    }

    // 没有找到目标标签
    return false;
  }

  private function extract_full_tag( int $start_position ): string {
    $full_tag = '';
    $tag_open = false;
    $tag_closed = false;
    $length = strlen($this->html);

    for ($i = $start_position; $i < $length; $i++) {
      $full_tag .= $this->html[$i];
      if ($this->html[$i] === '<') {
        $tag_open = true;
      }
      if ($this->html[$i] === '>') {
        $tag_closed = true;
        break;
      }
      if ($this->html[$i] === '/' && $this->html[$i+1] === '>') {
        $tag_closed = true;
        break;
      }
    }
    return $full_tag;
  }

  public function get_tag(): ?string {
    return $this->current_tag ? $this->current_tag['name'] : null;
  }

  public function get_attribute( string $attribute_name ): ?string {
    if ( ! $this->current_tag ) {
      return null;
    }

    $attribute_pattern = '/' . preg_quote( $attribute_name ) . 's*=s*["']?([^"'>]+)["']?/i';
    if ( preg_match( $attribute_pattern, $this->current_tag['full_tag'], $matches ) ) {
      return $matches[1];
    }

    return null;
  }

  public function set_attribute( string $attribute_name, string $attribute_value ): void {
    if ( ! $this->current_tag ) {
      return;
    }

    $attribute_pattern = '/' . preg_quote( $attribute_name ) . 's*=s*["']?([^"'>]+)["']?/i';

    if ( preg_match( $attribute_pattern, $this->current_tag['full_tag'], $matches ) ) {
      // 属性存在，替换属性值
      $new_attribute = $attribute_name . '="' . $attribute_value . '"';
      $this->current_tag['full_tag'] = preg_replace( $attribute_pattern, $new_attribute, $this->current_tag['full_tag'] );
    } else {
      // 属性不存在，添加属性
      $tag_end_pos = strpos( $this->current_tag['full_tag'], '>', 1 );
      if ($tag_end_pos !== false) {
        $this->current_tag['full_tag'] = substr_replace( $this->current_tag['full_tag'], ' ' . $attribute_name . '="' . $attribute_value . '"', $tag_end_pos, 0 );
      }
    }
  }

  public function get_updated_html(): string {
      $original_html = $this->html;
      $updated_html = substr_replace(
          $original_html,
          $this->current_tag['full_tag'],
          $this->current_tag['start'],
          strlen( $this->extract_full_tag( $this->current_tag['start'] ) )
      );
      return $updated_html;
  }
}

// Example usage:
$html = '<div class="my-block"><img src="image.jpg" alt="My Image" data-id="123"/></div>';
$processor = new Simple_HTML_Tag_Processor( $html );

if ( $processor->next_tag( 'img' ) ) {
  echo "Tag Name: " . $processor->get_tag() . "n";
  echo "Original HTML: " . $html . "n";

  $src = $processor->get_attribute( 'src' );
  echo "Image Source: " . $src . "n";

  $processor->set_attribute( 'alt', 'Updated Image Description' );
  $processor->set_attribute( 'data-id', '456' );

  echo "Updated HTML: " . $processor->get_updated_html() . "n";
} else {
  echo "No img tag found.n";
}

?>

这个简化版的 Simple_HTML_Tag_Processor 演示了 WP_HTML_Tag_Processor 的核心逻辑：

next_tag()： 使用正则表达式查找下一个标签。注意，这里使用了 PREG_OFFSET_CAPTURE 标志，可以获取匹配结果在字符串中的偏移量，方便后续操作。
get_attribute()： 使用正则表达式提取属性值。
set_attribute()： 使用正则表达式替换属性值，或者在标签末尾添加新的属性。
get_updated_html()： 通过字符串替换的方式，将修改后的标签替换回原始 HTML 字符串。

四、WP_HTML_Tag_Processor 的高级用法

WP_HTML_Tag_Processor 还有一些更高级的用法，可以处理更复杂的 HTML 结构。

处理嵌套标签： WP_HTML_Tag_Processor 可以处理嵌套的标签，但是需要注意维护好状态。例如，如果要在一个 div 标签内部查找 img 标签，可以先找到 div 标签，然后将 WP_HTML_Tag_Processor 的位置移动到 div 标签的内部，再查找 img 标签。
处理注释和DOCTYPE： WP_HTML_Tag_Processor 可以跳过 HTML 注释和 DOCTYPE 声明，不会将它们误认为标签。
性能优化： WP_HTML_Tag_Processor 内部做了一些性能优化，比如使用缓存、避免不必要的字符串复制等。

五、WP_HTML_Tag_Processor 的局限性

WP_HTML_Tag_Processor 并不是万能的。它有以下一些局限性：

不是一个完整的 HTML 解析器： 它不能处理所有 HTML 语法，比如不完整的标签、错误的嵌套等。
依赖正则表达式： 正则表达式的性能可能不高，特别是对于复杂的 HTML 结构。
状态管理复杂： 处理嵌套标签时，需要手动维护状态，容易出错。

六、WP_HTML_Tag_Processor 在 WordPress Block 中的应用

WP_HTML_Tag_Processor 在 WordPress Block 中扮演着重要的角色。它主要用于以下场景：

Block 属性提取： 从 Block 的 HTML 代码中提取属性值，比如图片 Block 的 src 属性、链接 Block 的 href 属性等。
Block 属性更新： 根据用户的编辑操作，更新 Block 的 HTML 代码，比如修改图片 Block 的 alt 属性、添加新的 CSS 类名等。
Block 转换： 将旧的 Block 转换为新的 Block，需要修改 Block 的 HTML 代码结构。

七、总结

WP_HTML_Tag_Processor 是 WordPress Block 编辑器中一个非常重要的工具。它提供了一种高效、轻量的方式来解析和操作 HTML 代码，为 Block 的属性提取、属性更新、Block 转换等功能提供了基础。虽然它有一些局限性，但在 WordPress Block 的特定场景下，它仍然是一个非常实用的选择。

八、实战演练

咱们来个稍微复杂点的例子，假设我们要给所有 img 标签添加 loading="lazy" 属性，提高页面加载速度。

<?php

$html = '<div class="content"><img src="image1.jpg" alt="Image 1"><p>Some text</p><img src="image2.jpg" alt="Image 2"></div>';

$processor = new WP_HTML_Tag_Processor( $html );

while ( $processor->next_tag( 'img' ) ) {
  if ( ! $processor->get_attribute( 'loading' ) ) {
    $processor->set_attribute( 'loading', 'lazy' );
  }
}

echo esc_html( $processor->get_updated_html() );

?>

这段代码会遍历 HTML 字符串中的所有 img 标签，如果标签没有 loading 属性，则添加 loading="lazy" 属性。

九、一些补充说明

WP_HTML_Tag_Processor 的性能瓶颈主要在于正则表达式。在处理大型 HTML 字符串时，可以考虑使用更高效的正则表达式，或者使用其他 HTML 解析器。
WP_HTML_Tag_Processor 的 API 可能会在未来的 WordPress 版本中发生变化。在使用它时，最好参考最新的 WordPress 官方文档。
WP_HTML_Tag_Processor 的设计目标是简单易用。它没有提供像 DOM 解析器那样强大的功能，但对于 WordPress Block 的特定需求来说，已经足够了。

好了，今天的讲座就到这里。希望大家对 WP_HTML_Tag_Processor 有了更深入的了解。记住，技术永远在发展，保持学习的热情，才能成为真正的技术专家！如果大家有什么问题，欢迎提问。

发表回复 取消回复

发表回复取消回复