实战：利用 AI 模拟器测试不同 Bot 对你网站抓取路径的偏好差异 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

各位技术同仁，下午好！

今天，我们将深入探讨一个既具挑战性又充满机遇的领域：如何利用 AI 模拟器，精准测试并理解不同爬虫（Bot）对我们网站抓取路径的偏好差异。在当今数字世界中，搜索引擎优化（SEO）、内容分发、甚至网站安全都与爬虫的行为息息相关。我们不仅仅是搭建网站，更是在与各种智能体进行一场无声的对话。理解这些智能体如何“思考”和“行动”，是优化我们网站性能、提升可见性的关键。

作为一名编程专家，我深知理论与实践的距离。因此，今天的讲座将不仅仅停留在概念层面，我们将一起构建一个简化的 AI 爬虫模拟器，并通过代码实例、逻辑分析，深入理解其工作原理和实际应用。

1. 爬虫世界的复杂性与理解的必要性

我们的网站并非孤立存在，它持续不断地被各种自动化程序——我们称之为爬虫或机器人（Bot）——访问。这些爬虫来自四面八方：

搜索引擎爬虫（如 Googlebot, Bingbot）：它们的目标是发现、抓取并索引互联网上的内容，以便用户能够通过搜索找到相关信息。
社交媒体爬虫（如 Facebook Crawler, Twitterbot）：用于抓取链接内容，生成预览卡片。
内容聚合器爬虫：从多个来源收集新闻、文章或其他特定类型的内容。
价格比较爬虫：抓取商品信息和价格。
监控爬虫：检查网站可用性、性能或特定内容的变动。
恶意爬虫：数据窃取、垃圾邮件、DDoS攻击等。

尽管它们都“访问”网站，但它们的目标、行为模式、资源限制、以及对网站结构的理解方式却大相径庭。这导致了它们在抓取路径上表现出显著的偏好差异。

为何理解这些偏好至关重要？

SEO 优化：确保搜索引擎爬虫能高效发现和索引我们最重要的内容。如果它们偏离了核心路径，重要页面可能无法被收录。
爬取预算（Crawl Budget）优化：网站的爬取预算是有限的。了解爬虫偏好有助于引导它们专注于高价值页面，避免浪费资源在低优先级或重复内容上。
性能与带宽：不当的爬取行为可能导致服务器负载过高，消耗不必要的带宽。
内容分发策略：理解聚合器爬虫的偏好，可以帮助我们更好地布局内容，提高内容被发现和传播的机会。
安全与反爬：识别异常或恶意爬虫的模式，从而实施有效的反爬策略。
网站架构验证：通过模拟爬虫行为，我们可以发现网站内部链接结构、导航设计中存在的潜在问题。

传统的测试方法，如分析服务器日志，虽然有用，但往往是滞后的、被动的。它告诉我们“发生了什么”，却难以模拟“如果我改变了A，爬虫会怎么做”。更重要的是，我们无法在生产环境中随意实验，以免影响真实用户体验和SEO排名。这正是 AI 模拟器大显身手的地方。

2. AI 模拟器：超越简单的爬虫

一个 AI 模拟器，在这里，不仅仅是一个简单的网络爬虫。它是一个能够模拟不同爬虫智能体决策过程、行为模式及其与网站环境交互的虚拟系统。它的核心在于：

网站环境的精确建模：模拟器需要一个尽可能真实的网站副本或抽象模型，包括页面内容、链接结构、响应时间、robots.txt、sitemap.xml等。
爬虫智能体的行为建模：这是“AI”所在。我们需要为不同的爬虫定义其独特的抓取策略、优先级、资源限制、以及对网站信号的响应方式。
交互与反馈循环：模拟器运行过程中，爬虫会根据模拟网站的“响应”来调整其后续行为。
数据收集与分析：记录爬虫的抓取路径、访问页面、遇到的问题等，以便进行深入分析。

AI 模拟器的好处：

安全无风险：所有测试都在隔离环境中进行，不会影响生产网站。
可控性高：可以精确控制模拟环境的参数，如网络延迟、服务器负载等。
可复现性：相同的模拟配置可以多次运行，确保结果的一致性。
前瞻性分析：在网站上线前或重大改版前，就能预测爬虫的行为。
效率高：可以并行运行多个爬虫模拟，加速测试进程。

3. 构建一个简化的 AI 爬虫模拟器：核心组件与实践

我们将使用 Python 来构建这个模拟器。它具有丰富的库生态系统，非常适合网络编程、数据处理和机器学习。

3.1 核心组件概述

一个基础的 AI 爬虫模拟器至少应包含以下核心组件：

网站模型（Website Model）：代表我们要测试的网站结构和内容。
爬虫智能体（Bot Agent）：模拟不同爬虫的行为和决策逻辑。
模拟引擎（Simulation Engine）：驱动整个模拟过程，管理状态。
数据分析器（Data Analyzer）：收集、处理和可视化模拟结果。

3.2 网站模型的构建：Representing the Web

这是模拟的基础。我们不能直接在模拟器内部运行一个完整的网站服务器，而是需要一个网站的抽象表示。

方法一：静态抓取与图结构表示

最直接的方式是预先抓取目标网站的少量关键页面，并将其内部链接结构抽象为一张图（Graph）。

数据结构：

我们可以用一个字典来表示网站的链接图，其中键是页面的 URL，值是一个包含该页面标题、内容摘要和所有出站链接 URL 的对象。

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

class Page:
    def __init__(self, url, title="", content_snippet="", outgoing_links=None):
        self.url = url
        self.title = title
        self.content_snippet = content_snippet
        self.outgoing_links = set(outgoing_links) if outgoing_links else set()
        self.depth = -1 # 用于记录页面在网站结构中的深度，或爬虫发现时的深度
        self.last_modified = None # 可以模拟服务器的Last-Modified头

    def __repr__(self):
        return f"Page(url='{self.url}', title='{self.title[:30]}...', links={len(self.outgoing_links)})"

class WebsiteModel:
    def __init__(self, base_url):
        self.base_url = base_url
        self.pages = {} # {url: Page object}
        self.domain = urlparse(base_url).netloc

    def add_page(self, page_obj):
        if page_obj.url not in self.pages:
            self.pages[page_obj.url] = page_obj

    def get_page(self, url):
        return self.pages.get(url)

    def build_from_crawl(self, start_url, max_pages=100, max_depth=3):
        """
        从真实网站抓取构建网站模型
        注意：这是一个简化的抓取器，仅用于构建模型，不模拟爬虫行为
        """
        queue = [(start_url, 0)]
        visited = set()

        print(f"Building website model from {start_url} (max_pages={max_pages}, max_depth={max_depth})...")

        while queue and len(self.pages) < max_pages:
            current_url, current_depth = queue.pop(0)

            if current_url in visited or current_depth > max_depth:
                continue

            visited.add(current_url)
            print(f"  Fetching: {current_url} (Depth: {current_depth})")

            try:
                response = requests.get(current_url, timeout=5)
                response.raise_for_status() # 检查HTTP错误
                soup = BeautifulSoup(response.text, 'html.parser')

                title = soup.title.string if soup.title else ""
                content_snippet = soup.find('body').get_text(separator=' ', strip=True)[:200] if soup.find('body') else ""

                outgoing_links = set()
                for a_tag in soup.find_all('a', href=True):
                    link = urljoin(current_url, a_tag['href'])
                    # 只保留同域链接
                    if urlparse(link).netloc == self.domain:
                        outgoing_links.add(link)
                        if link not in visited and len(self.pages) < max_pages:
                            queue.append((link, current_depth + 1))

                page_obj = Page(current_url, title, content_snippet, outgoing_links)
                page_obj.depth = current_depth # 记录页面在模型构建时的深度
                self.add_page(page_obj)

            except requests.exceptions.RequestException as e:
                print(f"    Error fetching {current_url}: {e}")
            except Exception as e:
                print(f"    Error processing {current_url}: {e}")

        print(f"Website model built with {len(self.pages)} pages.")
        return self.pages

    def get_all_urls(self):
        return list(self.pages.keys())

    # 模拟robots.txt和sitemap.xml
    def allows_crawl(self, url, user_agent="*"):
        # 这是一个简化的实现，实际需要解析robots.txt文件
        # For simplicity, assume all paths are allowed unless explicitly disallowed
        # For example: Disallow: /admin/
        parsed_url = urlparse(url)
        if parsed_url.path.startswith('/admin/'):
            return False
        return True

    def get_sitemap_urls(self):
        # 实际需要解析sitemap.xml文件
        # For simplicity, let's say we prioritize a subset of pages
        sitemap_priority_urls = [
            url for url, page in self.pages.items()
            if '/blog/' in url or '/products/' in url # 假设博客和产品页是sitemap中的重点
        ]
        return sitemap_priority_urls

# 示例：构建网站模型
# website_url = "http://quotes.toscrape.com/" # 一个简单的测试网站
# website_model = WebsiteModel(website_url)
# website_model.build_from_crawl(website_url, max_pages=50, max_depth=2)

方法二：配置文件或API接口

对于大型或动态网站，手动抓取可能不现实。我们可以：

从现有站点地图或API获取页面列表和元数据。
使用一个简化的配置文件，描述关键页面及其链接关系。
集成一个微型Web服务器，模拟特定页面的响应逻辑。

在本次讲座中，我们将主要基于静态抓取与图结构表示进行模拟。

3.3 爬虫智能体（Bot Agent）：行为建模

这是模拟器的核心，定义了不同爬虫如何选择下一个要抓取的页面。

基类 BaseBot：

所有爬虫都应继承自一个基类，提供通用的接口和状态管理。

import random
import time
from collections import deque

class BaseBot:
    def __init__(self, name, website_model, crawl_budget=100, politeness_delay=0.1):
        self.name = name
        self.website_model = website_model
        self.crawl_budget = crawl_budget # 抓取页面的最大数量
        self.politeness_delay = politeness_delay # 每次抓取之间的延迟
        self.visited_urls = set()
        self.crawl_queue = deque()
        self.crawl_path = [] # 记录抓取路径
        self.current_depth = {} # 记录每个URL被发现时的深度
        self.pages_crawled_count = 0
        self.errors_count = 0
        self.start_time = None
        self.end_time = None

    def initialize_crawl(self, start_url):
        if not self.website_model.get_page(start_url):
            print(f"Warning: Start URL {start_url} not in website model.")
            return False
        self.crawl_queue.append((start_url, 0))
        self.current_depth[start_url] = 0
        self.start_time = time.time()
        return True

    def choose_next_url(self):
        """
        核心方法：根据爬虫的策略选择下一个要抓取的URL
        由子类实现
        """
        raise NotImplementedError("Subclasses must implement choose_next_url method.")

    def crawl_page(self, url, depth):
        """模拟抓取一个页面"""
        if not self.website_model.allows_crawl(url, self.name):
            # print(f"[{self.name}] Disallowed by robots.txt: {url}")
            return None, [] # 模拟robots.txt禁止抓取

        page = self.website_model.get_page(url)
        if not page:
            # print(f"[{self.name}] Page not found in model: {url}")
            self.errors_count += 1
            return None, [] # 页面不在模型中，视为错误

        self.visited_urls.add(url)
        self.crawl_path.append(url)
        self.pages_crawled_count += 1
        self.current_depth[url] = depth

        # 模拟网络延迟和处理时间
        time.sleep(self.politeness_delay)

        # print(f"[{self.name}] Crawled: {url} (Depth: {depth})")
        return page, list(page.outgoing_links) # 返回页面对象和出站链接

    def run(self, start_url):
        if not self.initialize_crawl(start_url):
            return

        print(f"[{self.name}] Starting crawl from {start_url}...")

        while self.crawl_queue and self.pages_crawled_count < self.crawl_budget:
            url, depth = self.choose_next_url()

            if url is None:
                # 队列可能为空或选择策略返回None
                # print(f"[{self.name}] No more URLs to choose. Queue size: {len(self.crawl_queue)}")
                break

            if url in self.visited_urls:
                continue # 已访问过

            page, links = self.crawl_page(url, depth)

            if page:
                for link in links:
                    if link not in self.visited_urls and link not in [item[0] for item in self.crawl_queue]:
                        # 确保链接在网站模型中才加入队列
                        if self.website_model.get_page(link):
                             self.crawl_queue.append((link, depth + 1))
            else:
                self.errors_count += 1 # 页面抓取失败

        self.end_time = time.time()
        print(f"[{self.name}] Crawl finished. Pages crawled: {self.pages_crawled_count}, Errors: {self.errors_count}")
        return self.crawl_path

    def get_metrics(self):
        duration = self.end_time - self.start_time if self.start_time and self.end_time else 0
        return {
            "name": self.name,
            "pages_crawled": self.pages_crawled_count,
            "unique_pages_crawled": len(self.visited_urls),
            "errors": self.errors_count,
            "crawl_duration_sec": round(duration, 2),
            "avg_crawl_speed_pages_sec": round(self.pages_crawled_count / duration, 2) if duration > 0 else 0,
            "crawl_depth_distribution": self._get_depth_distribution(),
            "path_length": len(self.crawl_path)
        }

    def _get_depth_distribution(self):
        depths = [self.current_depth.get(url, 0) for url in self.visited_urls]
        distribution = {}
        for d in depths:
            distribution[d] = distribution.get(d, 0) + 1
        return distribution

不同类型的爬虫智能体：

我们将实现几种代表性爬虫，展示它们如何通过 choose_next_url 方法的不同实现来展现偏好差异。

1. SimpleBFSBot (广度优先爬虫)

偏好：优先抓取距离起始页更近的页面。
策略：使用队列（Queue），先进先出。

class SimpleBFSBot(BaseBot):
    def __init__(self, name, website_model, crawl_budget=100, politeness_delay=0.1):
        super().__init__(name, website_model, crawl_budget, politeness_delay)

    def choose_next_url(self):
        while self.crawl_queue:
            url, depth = self.crawl_queue.popleft()
            if url not in self.visited_urls:
                return url, depth
        return None, None # 队列为空，无更多可选择的URL

2. SimpleDFSBot (深度优先爬虫)

偏好：优先抓取更深层次的页面。
策略：使用栈（Stack），后进先出。

class SimpleDFSBot(BaseBot):
    def __init__(self, name, website_model, crawl_budget=100, politeness_delay=0.1):
        super().__init__(name, website_model, crawl_budget, politeness_delay)
        # DFS通常使用列表作为栈
        self.crawl_queue = [] # 重写队列为列表，用于模拟栈

    def initialize_crawl(self, start_url):
        if not self.website_model.get_page(start_url):
            print(f"Warning: Start URL {start_url} not in website model.")
            return False
        self.crawl_queue.append((start_url, 0)) # 添加到末尾
        self.current_depth[start_url] = 0
        self.start_time = time.time()
        return True

    def choose_next_url(self):
        while self.crawl_queue:
            url, depth = self.crawl_queue.pop() # 从末尾取出，模拟栈
            if url not in self.visited_urls:
                return url, depth
        return None, None

    def crawl_page(self, url, depth):
        # DFS 在添加新链接到队列时，需要将新链接添加到栈顶，以便优先处理
        page, links = super().crawl_page(url, depth)
        if page:
            # 将新发现的链接添加到队列（栈）的末尾
            # 确保它们在下一次pop时被优先处理（因为是LIFO）
            # 注意：这里需要稍微修改一下，使得新发现的链接能被push到栈顶
            # 传统DFS在选择下一个节点时，是从当前节点的未访问邻居中选择一个，并递归
            # 在队列模拟中，我们需要在当前页面处理后，将其子链接添加到队列的“前面”
            # 或者更简单，直接在run循环中处理queue.append()
            pass # 链接处理逻辑放在run方法中
        return page, links

    def run(self, start_url):
        if not self.initialize_crawl(start_url):
            return

        print(f"[{self.name}] Starting crawl from {start_url}...")

        while self.crawl_queue and self.pages_crawled_count < self.crawl_budget:
            url, depth = self.choose_next_url() # 从栈顶获取

            if url is None:
                break

            if url in self.visited_urls:
                continue

            page, links = self.crawl_page(url, depth)

            if page:
                # DFS：新发现的链接添加到队列（栈）的末尾，以便它们在下次迭代中被优先弹出
                # 这种实现方式，实际上是把新发现的链接作为最优先处理的，模拟了DFS的递归下降
                for link in links:
                    if link not in self.visited_urls and link not in [item[0] for item in self.crawl_queue]:
                        if self.website_model.get_page(link):
                            self.crawl_queue.append((link, depth + 1)) # 添加到列表末尾
            else:
                self.errors_count += 1

        self.end_time = time.time()
        print(f"[{self.name}] Crawl finished. Pages crawled: {self.pages_crawled_count}, Errors: {self.errors_count}")
        return self.crawl_path

3. GooglebotLikeBot (模拟搜索引擎爬虫)

偏好：
- 新鲜度（Freshness）：优先抓取近期更新或经常更新的页面。
- 重要性/权威性（Authority）：内部链接更多的页面通常被认为更重要。
- Sitemap 优先：优先考虑 sitemap.xml 中列出的 URL。
- Robots.txt 遵守：严格遵守 robots.txt 规则。
- Crawl Depth：倾向于在发现新内容和探索深度之间取得平衡。
- URL 结构：对某些 URL 模式（如 /blog/, /category/）可能赋予更高权重。
策略：基于加权评分系统选择下一个 URL。

class GooglebotLikeBot(BaseBot):
    def __init__(self, name, website_model, crawl_budget=100, politeness_delay=0.1):
        super().__init__(name, website_model, crawl_budget, politeness_delay)
        self.priority_queue = [] # (priority_score, url, depth)
        self.sitemap_urls = set(website_model.get_sitemap_urls())

    def initialize_crawl(self, start_url):
        if not self.website_model.get_page(start_url):
            print(f"Warning: Start URL {start_url} not in website model.")
            return False

        # 初始时，将起始URL加入队列，并计算其优先级
        initial_score = self._calculate_priority(start_url, 0)
        self.priority_queue.append((initial_score, start_url, 0))
        self.current_depth[start_url] = 0
        self.start_time = time.time()
        return True

    def _calculate_priority(self, url, depth):
        score = 0
        page = self.website_model.get_page(url)

        if not page:
            return -1 # 无效URL，最低优先级

        # 1. Sitemap 优先
        if url in self.sitemap_urls:
            score += 10 # sitemap中的URL优先级高

        # 2. 内部链接数量（模拟PageRank/重要性）
        # 我们可以粗略地认为，被更多内部页面链接的页面更重要
        # 这里的实现简化为：如果页面有大量出站链接，则其可能更重要
        # 更准确的做法是统计入站链接，但这需要预处理整个网站图
        if page.outgoing_links:
            score += min(len(page.outgoing_links) / 5, 5) # 最多加5分

        # 3. URL 结构偏好
        if '/blog/' in url or '/products/' in url or '/category/' in url:
            score += 3 # 偏好内容型或产品型页面

        # 4. 新鲜度（模拟：如果页面有last_modified属性且更新，可加分）
        # 我们的Page类目前没有last_modified，可以假设某些页面是“新鲜”的
        # 这里简化为对未访问过的页面给予一定探索奖励
        if url not in self.visited_urls:
            score += 1

        # 5. 深度惩罚 (避免无限深入，平衡探索与广度)
        score -= depth * 0.5 # 深度越深，优先级越低

        # 6. 随机性 (模拟一些探索行为)
        score += random.uniform(0, 0.5)

        return score

    def choose_next_url(self):
        # 每次选择时，重新评估队列中所有未访问URL的优先级，并选择最高者
        # 实际Googlebot会更复杂，可能维护一个长期优先级队列
        if not self.priority_queue:
            return None, None

        # 过滤掉已访问的URL
        self.priority_queue = [(s, u, d) for s, u, d in self.priority_queue if u not in self.visited_urls]

        if not self.priority_queue:
            return None, None

        # 找到最高优先级的URL
        best_score = -float('inf')
        best_url_info = None
        best_index = -1

        for i, (score, url, depth) in enumerate(self.priority_queue):
            if score > best_score:
                best_score = score
                best_url_info = (url, depth)
                best_index = i
            elif score == best_score:
                # 优先级相同，随机选择一个，避免固定路径
                if random.random() < 0.5:
                    best_url_info = (url, depth)
                    best_index = i

        if best_index != -1:
            self.priority_queue.pop(best_index) # 从队列中移除已选择的URL
            return best_url_info
        return None, None

    def crawl_page(self, url, depth):
        page, links = super().crawl_page(url, depth)
        if page:
            for link in links:
                if link not in self.visited_urls and link not in [item[1] for item in self.priority_queue]:
                    if self.website_model.get_page(link):
                        # 将新发现的链接加入优先级队列
                        new_score = self._calculate_priority(link, depth + 1)
                        self.priority_queue.append((new_score, link, depth + 1))
        return page, links

3.4 模拟引擎与执行

模拟引擎负责实例化爬虫，运行它们的 run 方法，并协调整个过程。在我们的设计中，BaseBot 类的 run 方法已经包含了大部分模拟引擎的逻辑。我们只需要一个主程序来启动它们。

# main_simulation.py

# ... (Previous classes: Page, WebsiteModel, BaseBot, SimpleBFSBot, SimpleDFSBot, GooglebotLikeBot) ...

def run_simulation(website_model, start_url, bot_configs):
    """
    运行多个爬虫模拟
    :param website_model: 已经构建好的网站模型
    :param start_url: 模拟的起始URL
    :param bot_configs: 包含不同爬虫配置的列表
                        [{'name': 'BFS Bot', 'type': SimpleBFSBot, 'budget': 50}, ...]
    :return: 所有爬虫的模拟结果（metrics和crawl_path）
    """
    all_results = []
    bot_instances = []

    # 实例化所有爬虫
    for config in bot_configs:
        bot_type = config['type']
        bot_name = config['name']
        crawl_budget = config.get('budget', 100)
        politeness_delay = config.get('politeness_delay', 0.05) # 稍微加快模拟速度

        bot = bot_type(bot_name, website_model, crawl_budget, politeness_delay)
        bot_instances.append(bot)

    # 运行每个爬虫
    for bot in bot_instances:
        print(f"n--- Running simulation for {bot.name} ---")
        path = bot.run(start_url)
        metrics = bot.get_metrics()
        all_results.append({
            "bot_name": bot.name,
            "metrics": metrics,
            "crawl_path": path
        })
    return all_results

3.5 数据分析与可视化：解读模拟结果

模拟完成后，我们需要对收集到的数据进行分析，以理解不同爬虫的偏好。

关键分析指标：

爬取页面数量：总共抓取了多少页面。
唯一页面数量：抓取到的不重复页面数量。
抓取深度分布：爬虫在哪个深度级别上访问了多少页面。
抓取路径：具体访问了哪些页面，顺序如何。
热门页面/冷门页面：哪些页面被所有爬虫频繁访问，哪些被忽视。
错误率：遇到的无法访问或被阻止的页面数量。
时间效率：完成抓取所需的时间。

使用 Pandas 和 Matplotlib (概念性展示)：

import pandas as pd
import matplotlib.pyplot as plt # 概念性，实际不生成图片

def analyze_results(results):
    print("n--- Simulation Analysis ---")

    # 1. 汇总指标
    metrics_df = pd.DataFrame([res['metrics'] for res in results])
    print("nOverall Metrics:")
    print(metrics_df.set_index('name'))

    # 2. 爬取路径对比
    print("nCrawl Path Comparison:")
    for res in results:
        print(f"nBot: {res['bot_name']}")
        print(f"  Path Length: {len(res['crawl_path'])}")
        # print(f"  Path Sample: {res['crawl_path'][:10]}...") # 路径可能很长，只显示前10个
        print(f"  Top 5 Visited Paths: {res['crawl_path'][:5]}")

    # 3. 页面访问频率
    all_visited_urls = {}
    for res in results:
        for url in res['crawl_path']:
            all_visited_urls[url] = all_visited_urls.get(url, 0) + 1

    most_visited_pages = sorted(all_visited_urls.items(), key=lambda item: item[1], reverse=True)[:10]
    print("nTop 10 Most Visited Pages Across All Bots:")
    for url, count in most_visited_pages:
        print(f"  {url}: {count} visits")

    # 4. 爬取深度分布对比
    print("nCrawl Depth Distribution:")
    depth_data = {}
    for res in results:
        depth_data[res['bot_name']] = res['metrics']['crawl_depth_distribution']

    # 转换为DataFrame，便于展示和图表化
    depth_df = pd.DataFrame(depth_data).fillna(0).sort_index()
    print(depth_df)

    # 概念性图表：爬取深度分布
    # plt.figure(figsize=(10, 6))
    # depth_df.plot(kind='bar', figsize=(12, 7))
    # plt.title('Crawl Depth Distribution by Bot')
    # plt.xlabel('Crawl Depth')
    # plt.ylabel('Number of Pages')
    # plt.xticks(rotation=45)
    # plt.tight_layout()
    # plt.show()

    # 概念性图表：Unique Pages Crawled
    # metrics_df.set_index('name')['unique_pages_crawled'].plot(kind='bar', figsize=(8, 5))
    # plt.title('Unique Pages Crawled by Bot')
    # plt.ylabel('Count')
    # plt.xticks(rotation=45)
    # plt.tight_layout()
    # plt.show()

4. 综合实战：运行与分析

现在，让我们将所有组件整合起来，运行一个完整的模拟。

if __name__ == "__main__":
    # 1. 构建网站模型
    # 注意：请替换为一个你可以安全抓取的网站，例如一个博客或测试网站
    # 如果是本地搭建的网站，确保其运行在指定端口
    # 为了演示，我们使用一个虚构的本地网站结构，避免真实网络请求
    # website_url = "http://localhost:8000/" # 假设有一个本地服务器运行
    # website_model = WebsiteModel(website_url)
    # website_model.build_from_crawl(website_url, max_pages=50, max_depth=2)

    # 为了避免网络请求和依赖外部网站，我们手动构建一个简单的网站模型
    print("--- Manually building a sample website model ---")
    sample_base_url = "http://example.com"
    website_model = WebsiteModel(sample_base_url)

    # 定义一些页面
    home_page = Page(sample_base_url + "/", "Home Page", "Welcome to our site.",
                     outgoing_links={sample_base_url + "/blog/", sample_base_url + "/products/", sample_base_url + "/about/"})
    blog_page = Page(sample_base_url + "/blog/", "Our Blog", "Latest articles.",
                     outgoing_links={sample_base_url + "/blog/article1", sample_base_url + "/blog/article2", sample_base_url + "/category/tech"})
    product_page = Page(sample_base_url + "/products/", "Our Products", "Check out our amazing products.",
                        outgoing_links={sample_base_url + "/products/itemA", sample_base_url + "/products/itemB", sample_base_url + "/category/gadgets"})
    about_page = Page(sample_base_url + "/about/", "About Us", "Learn more about our company.",
                      outgoing_links={sample_base_url + "/team/", sample_base_url + "/contact/"})
    article1_page = Page(sample_base_url + "/blog/article1", "Article One", "Content of article one.",
                         outgoing_links={sample_base_url + "/blog/article2", sample_base_url + "/category/tech"})
    article2_page = Page(sample_base_url + "/blog/article2", "Article Two", "Content of article two.",
                         outgoing_links={sample_base_url + "/blog/", sample_base_url + "/category/ai"})
    itemA_page = Page(sample_base_url + "/products/itemA", "Item A", "Details for item A.",
                      outgoing_links={sample_base_url + "/products/itemB"})
    itemB_page = Page(sample_base_url + "/products/itemB", "Item B", "Details for item B.",
                      outgoing_links={sample_base_url + "/products/"})
    category_tech_page = Page(sample_base_url + "/category/tech", "Tech Category", "All about technology.")
    category_gadgets_page = Page(sample_base_url + "/category/gadgets", "Gadgets Category", "Cool gadgets.")
    category_ai_page = Page(sample_base_url + "/category/ai", "AI Category", "Latest AI news.")
    team_page = Page(sample_base_url + "/team/", "Our Team", "Meet the team.")
    contact_page = Page(sample_base_url + "/contact/", "Contact Us", "Get in touch.")
    admin_page = Page(sample_base_url + "/admin/dashboard", "Admin Dashboard", "Sensitive content.") # 模拟一个robots.txt不允许的页面

    website_model.add_page(home_page)
    website_model.add_page(blog_page)
    website_model.add_page(product_page)
    website_model.add_page(about_page)
    website_model.add_page(article1_page)
    website_model.add_page(article2_page)
    website_model.add_page(itemA_page)
    website_model.add_page(itemB_page)
    website_model.add_page(category_tech_page)
    website_model.add_page(category_gadgets_page)
    website_model.add_page(category_ai_page)
    website_model.add_page(team_page)
    website_model.add_page(contact_page)
    website_model.add_page(admin_page)

    # 模拟sitemap.xml，包含重要页面
    website_model.sitemap_priority_urls = {
        sample_base_url + "/",
        sample_base_url + "/blog/",
        sample_base_url + "/products/",
        sample_base_url + "/blog/article1",
        sample_base_url + "/blog/article2",
        sample_base_url + "/products/itemA",
        sample_base_url + "/products/itemB"
    }

    print(f"Website model has {len(website_model.pages)} pages.")

    # 2. 配置爬虫
    start_url = sample_base_url + "/"
    bot_configs = [
        {'name': 'BFS Bot', 'type': SimpleBFSBot, 'budget': 15, 'politeness_delay': 0.01},
        {'name': 'DFS Bot', 'type': SimpleDFSBot, 'budget': 15, 'politeness_delay': 0.01},
        {'name': 'Googlebot-like', 'type': GooglebotLikeBot, 'budget': 15, 'politeness_delay': 0.01},
    ]

    # 3. 运行模拟
    simulation_results = run_simulation(website_model, start_url, bot_configs)

    # 4. 分析结果
    analyze_results(simulation_results)

    # 5. 更详细的路径对比
    print("n--- Detailed Path Differences ---")
    for i, res1 in enumerate(simulation_results):
        for j, res2 in enumerate(simulation_results):
            if i >= j:
                continue
            bot1_name = res1['bot_name']
            bot2_name = res2['bot_name']
            path1 = set(res1['crawl_path'])
            path2 = set(res2['crawl_path'])

            common_pages = path1.intersection(path2)
            unique_to_bot1 = path1 - path2
            unique_to_bot2 = path2 - path1

            print(f"nComparing {bot1_name} vs {bot2_name}:")
            print(f"  Common Pages ({len(common_pages)}): {list(common_pages)[:5]}...")
            print(f"  Unique to {bot1_name} ({len(unique_to_bot1)}): {list(unique_to_bot1)[:5]}...")
            print(f"  Unique to {bot2_name} ({len(unique_to_bot2)}): {list(unique_to_bot2)[:5]}...")

            # 找出哪些页面被Googlebot-like优先访问，而BFS/DFS可能没碰到
            if bot1_name == 'Googlebot-like':
                print(f"n  {bot1_name} prioritized these (vs {bot2_name}):")
                for url in unique_to_bot1:
                    page = website_model.get_page(url)
                    if page:
                        print(f"    - {url} (Title: {page.title}, Score: {res1['metrics']['crawl_depth_distribution'].get(website_model.get_page(url).depth, 'N/A')})")

模拟结果分析表（示例）：

Bot Name	Pages Crawled	Unique Pages	Errors	Crawl Duration (s)	Avg Speed (pages/s)	Depth Distribution
BFS Bot	15	13	1	0.15	100	{0: 1, 1: 4, 2: 8}
DFS Bot	15	14	1	0.15	100	{0: 1, 1: 1, 2: 2, 3: 4, 4: 7}
Googlebot-like	15	14	0	0.15	100	{0: 1, 1: 3, 2: 7, 3: 4}

差异解读：

BFS Bot 倾向于先抓取所有一级链接，再抓取所有二级链接，路径通常较浅，但覆盖面广。
DFS Bot 会沿着一条路径深入，直到无路可走或达到限制，因此其抓取深度分布会更偏向深层。
Googlebot-like Bot 会根据内部链接、URL结构、Sitemap等因素进行智能决策。它可能在早期就发现并抓取了某些深层但被认为“重要”的页面（例如，/blog/article1），而BFS可能需要更多轮次才能触及，DFS可能需要恰好沿着这条路径走。它可能避免了/admin/dashboard这样的页面（如果我们的allows_crawl逻辑更完善）。

通过比较这些数据，我们可以发现：

如果 BFS Bot 错过了你认为重要的深层页面，可能说明你的网站内部链接结构不够扁平或重要页面入口不明显。
如果 Googlebot-like Bot 频繁访问某些页面，而其他爬虫不那么重视，则可以进一步优化这些页面的内容和SEO元素。
如果某个爬虫的错误率异常高，则可能网站存在某些区域对特定爬虫不友好（例如，JavaScript渲染问题、不规范的URL等）。

5. 高级议题与未来展望

我们当前的模拟器是一个基础版本，但可以扩展以处理更复杂的场景：

JavaScript 渲染：现代网站大量依赖 JavaScript 动态生成内容。模拟器可以集成一个无头浏览器（如 Playwright 或 Selenium），来模拟爬虫对 JavaScript 的执行。
动态内容与个性化：根据用户（或爬虫）的 IP、User-Agent 或其他参数，网站可能返回不同的内容。模拟器可以模拟这些变体。
Rate Limiting 与 Politeness：更精确地模拟网站的限速机制和爬虫的礼貌性延迟，以及它们如何影响抓取效率。
Content Similarity / Duplication Detection：爬虫会尝试识别重复内容。模拟器可以集成文本相似度算法，评估爬虫是否会浪费预算在重复页面上。
增量抓取与更新：模拟爬虫如何处理网站内容的更新，以及它们是否能有效发现新内容。
强化学习 (Reinforcement Learning, RL)：将爬虫建模为一个 RL Agent，让它在模拟环境中学习最佳抓取策略，从而发现新的、意想不到的爬取偏好。这可以用于生成更真实的爬虫行为模型，甚至发现网站中未知的爬取陷阱。
分布式模拟：对于大型网站，单个模拟器可能不足以模拟成千上万个页面的抓取。可以构建一个分布式模拟环境。

挑战与局限性：

模型准确性：爬虫的真实行为非常复杂且不断变化。我们的模型只是一个近似。
环境真实性：模拟器难以完全复制真实网站的所有动态行为、服务器响应速度、CDN 效果等。
计算资源：随着模拟的复杂性和规模增加，所需的计算资源也会急剧上升。

尽管存在这些挑战，AI 爬虫模拟器仍然是理解、预测和优化爬虫行为的强大工具。它为我们提供了一个安全的沙盒，让我们能够在不影响生产环境的情况下，进行深入的实验和分析。

展望与实践价值

利用 AI 模拟器测试不同 Bot 对网站抓取路径的偏好差异，是现代网站运营和SEO策略中不可或缺的一环。它使我们从被动观察者转变为主动的设计者和优化者，能够更深刻地理解数字世界的运作方式，并为我们的网站创造更大的价值。

通过今天对网站模型构建、爬虫智能体设计、模拟执行与结果分析的深入探讨，我希望各位能够掌握构建此类模拟器的核心思路和实践方法。未来，随着AI技术的发展，爬虫模拟器将变得更加智能和强大，为我们揭示更多关于爬虫行为的奥秘。