什么是 ‘Data Masking for LLMs’？在将日志发送给外部推理服务前，自动脱敏 PII 隐私信息 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

数据脱敏在大型语言模型 (LLM) 应用中的实践：保护日志中的 PII 信息

大型语言模型 (LLM) 的出现正在彻底改变软件开发的格局，它们能够理解、生成并处理人类语言，为各种应用带来了前所未有的能力。然而，伴随这种强大能力而来的，是对数据隐私前所未有的挑战。当我们将应用程序日志、用户输入或其他敏感数据发送给外部 LLM 推理服务时，如何确保个人身份信息 (PII) 不被泄露，成为了一个核心问题。本讲座将深入探讨“Data Masking for LLMs”这一主题，重点关注如何在将日志发送给外部推理服务前，自动脱敏 PII 隐私信息。

引言：LLM 时代的数据隐私挑战

在人工智能，特别是 LLM 驱动的时代，数据扮演着核心角色。为了让 LLM 更好地理解用户意图、提供相关帮助或进行问题诊断，通常需要将应用程序生成的日志、用户查询、系统状态等信息传输给 LLM 服务。这些数据流往往包含大量敏感的个人身份信息 (PII)，例如姓名、电子邮件地址、电话号码、住址、健康信息乃至财务数据。

将含有 PII 的数据直接发送给外部 LLM 服务，无论这些服务声称其如何遵守隐私协议，都带来了不可忽视的风险：

数据泄露风险： 第三方服务可能存在安全漏洞，导致 PII 泄露。
合规性问题： GDPR、CCPA、HIPAA 等严格的数据隐私法规要求企业对 PII 的处理方式负责。未经脱敏的数据传输可能导致巨额罚款和法律诉讼。
信任危机： 用户对数据隐私的担忧日益增加。如果企业未能妥善保护其数据，将严重损害用户信任和品牌声誉。
模型记忆与复现： LLM 在训练过程中可能记忆并复现其训练数据中的 PII。即使是推理阶段，如果提示中包含 PII，模型也可能在响应中无意间暴露这些信息，甚至根据 PII 推断出其他敏感信息。
数据控制缺失： 一旦数据离开您的控制范围，您对其如何存储、处理、甚至用于后续模型训练的控制力将大大减弱。

因此，在将任何可能包含 PII 的数据发送给外部 LLM 推理服务之前，实施自动化、高效且可靠的数据脱敏 (Data Masking) 机制，已不再是可选项，而是强制性的需求。本讲座将为您提供构建此类机制所需的知识和技术。

PII 与敏感数据：LLM 上下文下的定义与风险

首先，我们需要明确什么是 PII，以及它在 LLM 应用场景中的特殊性。

什么是 PII？

个人身份信息 (Personally Identifiable Information, PII) 是指任何能够单独或与其他信息结合识别、联系或定位个人的信息。这包括直接识别符和间接识别符。

类别	直接识别符	间接识别符
通用	姓名、电子邮件地址、电话号码、社会安全号 (SSN)、护照号、驾驶执照号、信用卡号、IP 地址、MAC 地址、设备 ID	生日、性别、种族、国籍、地理位置、职业、教育程度、收入、用户行为数据
健康	病例号、医疗记录、健康保险信息	诊断结果、治疗方案、药物信息
财务	银行账号、交易记录、工资单	信用评分、投资组合
生物识别	指纹、面部识别数据、视网膜扫描	语音样本

日志中常见的 PII 示例：

应用程序日志是 PII 的温床，尤其是在调试、监控或用户支持场景中。常见的日志条目可能包含：

用户注册信息：User 'John Doe' registered with email '[email protected]' and phone '+1-555-123-4567'.
API 请求/响应体：API call to /checkout, payload: {"userId": "user123", "shippingAddress": "123 Main St, Anytown, USA", "creditCardLast4": "4321"}
错误信息：Failed to process order for customer ID 'cust987', error: 'Invalid name provided: J. Doe'.
搜索查询：User searched for 'Dr. Emily White's clinic hours'.
系统事件：User 'Alice Smith' logged in from IP '203.0.113.42'.

LLM 对 PII 的特殊风险：

LLM 的强大之处在于其对上下文的理解和生成类人文本的能力。这使得它们在处理 PII 时面临更独特的风险：

模式识别与推断： LLM 不仅能识别明显的 PII，还能通过上下文推断出看似无关的信息之间的联系，从而间接识别个人。例如，即使名字被脱敏，如果日志中包含“一位住在加州伯克利，生日是1980年5月1日的女性软件工程师”这样的描述，LLM 可能会推断出这是某位公开个人信息的知名人物。
生成性 PII： LLM 在生成文本时，可能会无意中“创造”出看似真实的 PII，或者将训练数据中的 PII 片段重新组合，形成新的、具有识别性的信息。
黑箱问题： 外部 LLM 服务通常是黑箱。我们无法完全了解它们内部如何处理、存储或利用我们发送的数据。即使服务提供商承诺不存储用户数据，但其内部的临时处理机制仍可能构成风险。
微调数据污染： 如果您的数据被用于微调模型，而其中含有 PII，那么 PII 将被模型“学习”和“内化”，从而在未来的交互中存在泄露风险。

因此，我们需要一种鲁棒的机制，在数据进入 LLM 服务之前，将其中的 PII 彻底清除或转换。

数据脱敏的核心概念与技术

数据脱敏 (Data Masking) 是一种通过转换敏感数据，使其变得非敏感但仍保留一定功能性的过程。其核心目标是在保护隐私的同时，尽可能地维持数据的可用性，以便 LLM 仍能完成其任务，或者数据仍可用于分析、调试等目的。

数据脱敏的目标：

保护隐私： 这是首要目标，防止 PII 泄露。
确保合规： 遵守 GDPR、CCPA、HIPAA 等数据保护法规。
保持数据实用性： 脱敏后的数据应在一定程度上保持其结构和语义，以便 LLM 或其他系统能够继续对其进行处理，而不至于完全失去上下文。例如，一个脱敏的电话号码仍然应该像一个电话号码，而不是一串随机字符。
降低风险： 减少因数据泄露而导致的财务、法律和声誉风险。

常见的数据脱敏技术：

| 技术名称 | 描述 (Nullification/Deletion):** 最彻底，直接清除敏感数据。

混淆/替换 (Substitution)： 用假数据替换真实 PII，但保持数据的格式和类型。例如，用一个虚构的名字替换真实姓名，用一个有效的但非真实的电子邮件地址替换真实邮箱。
部分遮蔽/编辑 (Redaction/Partial Masking)： 用特殊字符（如 X 或 *）替换敏感信息的一部分，或用一个占位符替换整个敏感字段。例如，信用卡号只显示后四位 XXXX-XXXX-XXXX-1234，电子邮件显示 j****@example.com。
数据洗牌/置换 (Shuffling/Permutation)： 随机打乱同一列中的数据，使其不再与原始记录关联，但保留了原始数据的分布特性。例如，将所有客户的姓名随机分配给其他客户。
加密 (Encryption)： 将数据转换为密文，需要密钥才能解密。对于外部 LLM 服务，如果服务无法解密，则相当于脱敏。如果服务需要解密才能处理，则需要谨慎管理密钥。
哈希 (Hashing)： 将数据通过哈希函数转换为固定长度的字符串。哈希值是不可逆的（理论上），但相同的输入总是产生相同的输出。这对于需要检查数据唯一性但又不想暴露原始数据的场景很有用。

选择脱敏技术时的考量：

数据类型： 不同类型的 PII 需要不同的脱敏方法。
数据实用性需求： LLM 需要多少上下文？脱敏后是否还能理解意图？
可逆性： 是否需要在将来某个时候恢复原始数据？大多数脱敏是不可逆的。
性能： 脱敏过程是否能在实时或近实时地处理大量日志？
安全性： 脱敏方法本身是否存在漏洞？
合规性： 所选方法是否满足所有相关法规要求？

对于发送给外部 LLM 服务的日志，我们通常倾向于选择不可逆且无法反推原始 PII 的脱敏方法，如彻底的替换、部分遮蔽、或者结合哈希与替换。

数据脱敏的挑战

在为 LLM 构建数据脱敏系统时，会遇到一系列挑战：

上下文保留与数据实用性： 脱敏过于激进，可能会破坏日志的语境，导致 LLM 无法正确理解和响应。例如，如果将所有日期都脱敏为 [DATE]，LLM 就无法理解“预订下周二的会议”这样的指令。找到保护隐私和维持实用性之间的平衡点至关重要。
非结构化和半结构化数据： 日志通常以自由文本、JSON、键值对等多种非结构化或半结构化格式存在，这使得 PII 的识别比在结构化数据库中困难得多。
高召回率与高准确率的平衡：
- 假阳性 (False Positives)： 将非 PII 误识别为 PII 并进行脱敏，可能导致有用的信息丢失。例如，将“Apple”识别为公司名称并脱敏，即使它指的是水果。
- 假阴性 (False Negatives)： 未能识别出真实的 PII，导致隐私泄露。这是更严重的风险。
语言和领域特异性： PII 的模式因语言和特定领域而异。例如，中文姓名识别与英文姓名识别的规则大相径庭。医疗领域的 PII 识别可能需要专业术语知识。
可伸缩性： 现代应用可能产生海量的日志数据。脱敏系统必须能够高效地处理实时或近实时的日志流。
不断演变的 PII 定义： 随着技术发展和法规更新，PII 的定义可能会扩展，要求脱敏系统具备可配置性和可更新性。
模糊和故意混淆的 PII： 用户或系统可能会以非标准方式表达 PII，例如“我的电话是 555-1234 地区代码 212”。

数据脱敏管道的架构考量

一个有效的数据脱敏系统通常需要一个清晰的架构来处理日志数据流。

![数据脱敏管道架构示意图]

日志源 (Log Sources)： 应用程序、服务、CDN、API 网关等生成原始日志数据。
日志收集器 (Log Collectors)： 负责从各种源收集日志，例如 Fluentd, Logstash, Vector 等。
消息队列 (Message Queue)： 提供缓冲、解耦和可伸缩性，处理突发的日志流量。例如 Kafka, RabbitMQ, AWS Kinesis。
PII 检测引擎 (PII Detection Engine)： 这是脱敏管道的核心。它负责分析日志内容，识别出其中的 PII。可以基于规则、正则、NLP 模型等。
数据脱敏引擎 (Data Masking Engine)： 根据 PII 检测引擎的输出和预定义的策略，应用相应的脱敏技术。
策略管理 (Policy Management)： 集中管理脱敏规则，例如哪些字段需要脱敏、使用何种脱敏方法、白名单/黑名单等。
审计与监控 (Auditing & Monitoring)： 记录脱敏操作，监控脱敏系统的性能和准确性，并对未脱敏的 PII 进行告警。
脱敏日志输出 (Masked Log Output)： 将脱敏后的日志发送到目标系统，例如外部 LLM 推理服务、安全日志存储、分析平台等。

这种分层架构确保了模块化、可伸缩性，并允许独立更新和优化不同组件。

技术深度剖析：实现数据脱敏（代码示例）

现在，我们将深入到具体的实现细节，通过 Python 代码示例展示 PII 检测和脱敏的常用技术。我们将使用一些流行的库，如 re 用于正则表达式，Faker 用于生成假数据，以及 spaCy 或 presidio 用于更高级的自然语言处理 (NLP) 驱动的 PII 检测。

A. PII 检测技术

PII 检测是数据脱敏的第一步，也是最关键的一步。

1. 正则表达式 (Regular Expressions) 匹配：

对于具有明确模式的 PII 类型，如电子邮件、电话号码、IP 地址、信用卡号等，正则表达式是非常高效和准确的。

import re

def detect_pii_with_regex(text):
    """
    使用正则表达式检测文本中的常见 PII。
    """
    detected_pii = []

    # 1. 电子邮件地址
    email_pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}"
    for match in re.finditer(email_pattern, text):
        detected_pii.append({"type": "EMAIL", "value": match.group(0), "start": match.start(), "end": match.end()})

    # 2. 美国电话号码 (简单模式: xxx-xxx-xxxx, (xxx) xxx-xxxx, xxx.xxx.xxxx)
    phone_pattern = r"((?d{3})?[-.s]?d{3}[-.s]?d{4})"
    for match in re.finditer(phone_pattern, text):
        detected_pii.append({"type": "PHONE_NUMBER", "value": match.group(0), "start": match.start(), "end": match.end()})

    # 3. IP 地址 (IPv4)
    ip_pattern = r"b(?:[0-9]{1,3}.){3}[0-9]{1,3}b"
    for match in re.finditer(ip_pattern, text):
        detected_pii.append({"type": "IP_ADDRESS", "value": match.group(0), "start": match.start(), "end": match.end()})

    # 4. 信用卡号 (16位数字，可能带空格或短划线) - 这是一个简化版本，实际需要更复杂的校验
    credit_card_pattern = r"b(?:d[ -]*?){13,16}b" # 13-16 digits
    for match in re.finditer(credit_card_pattern, text):
        # 简单过滤常见的年份或日期误报
        if not re.match(r"^d{4}$", match.group(0)):
             detected_pii.append({"type": "CREDIT_CARD", "value": match.group(0), "start": match.start(), "end": match.end()})

    # 5. URL (简单模式)
    url_pattern = r"https?://(?:[-w.]|(?:%[da-fA-F]{2}))+"
    for match in re.finditer(url_pattern, text):
        detected_pii.append({"type": "URL", "value": match.group(0), "start": match.start(), "end": match.end()})

    # 注意：真实世界的正则表达式会更加复杂，需要考虑国际化、更多格式和上下文。
    return detected_pii

# 示例
log_entry = "User John Doe ([email protected]) from 192.168.1.1 tried to log in. Phone: (123) 456-7890. Payment processed with card 4111-2222-3333-4444. More info at http://example.com/profile."
detected = detect_pii_with_regex(log_entry)
print("Detected PII with Regex:")
for item in detected:
    print(item)

2. 命名实体识别 (Named Entity Recognition, NER) with NLP 库：

对于非结构化文本中的 PII，例如人名、组织名、地理位置等，传统的正则表达式难以胜任。NER 是一种自然语言处理技术，可以识别文本中具有特定意义的实体。spaCy 是一个非常流行的 Python NLP 库，支持多种语言的 NER。Microsoft Presidio 则是专门为 PII 检测和脱敏设计的库，它集成了多种检测器（包括正则、NER、校验和等）。

为了演示，我们将使用 spaCy。如果您尚未安装，请运行 pip install spacy 和 python -m spacy download en_core_web_sm。

import spacy

# 加载英文小型模型
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading en_core_web_sm model for spaCy...")
    spacy.cli.download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

def detect_pii_with_ner(text):
    """
    使用 spaCy 的 NER 检测文本中的 PII。
    识别的实体类型包括 PERSON (人名), GPE (地理政治实体，如城市、国家), ORG (组织).
    """
    doc = nlp(text)
    detected_pii = []
    for ent in doc.ents:
        # 我们可以根据实体类型过滤 PII
        if ent.label_ in ["PERSON", "GPE", "ORG", "LOC"]: # LOC for location
            detected_pii.append({"type": ent.label_, "value": ent.text, "start": ent.start_char, "end": ent.end_char})
    return detected_pii

# 示例
log_entry_ner = "Customer Alice Smith from New York reported an issue with Google Cloud services. Her manager Bob Johnson will follow up."
detected_ner = detect_pii_with_ner(log_entry_ner)
print("nDetected PII with NER (spaCy):")
for item in detected_ner:
    print(item)

# 结合正则表达式和NER的检测器将更强大。
# 例如，Presidio 提供了更全面的集成：
# from presidio_analyzer import AnalyzerEngine
# analyzer = AnalyzerEngine()
# results = analyzer.analyze(text=log_entry_ner, language='en')
# print("nDetected PII with Presidio:")
# for res in results:
#     print(res.entity_type, res.start, res.end, res.score)
# (Presidio 需要额外安装: pip install presidio-analyzer)

B. 数据脱敏策略与实现

检测到 PII 后，下一步是应用脱敏策略。

1. 替换/遮蔽 (Redaction/Substitution)：

最常见的策略是将 PII 替换为占位符或假数据。

from faker import Faker
import random

# 初始化 Faker，用于生成逼真的假数据
fake = Faker('en_US') # 'en_US' 代表美国英语环境

def mask_pii_redact(text, pii_entities, placeholder="[REDACTED]"):
    """
    将检测到的 PII 实体替换为通用占位符。
    注意：为了避免索引问题，我们通常从文本的末尾开始替换。
    """
    masked_text = list(text) # 将字符串转换为列表以便修改
    # 按实体结束位置降序排序，以避免替换导致索引错位
    pii_entities.sort(key=lambda x: x['end'], reverse=True)

    for entity in pii_entities:
        start, end = entity['start'], entity['end']
        # 构造特定类型的占位符，例如 [EMAIL] 而不是 [REDACTED]
        specific_placeholder = f"[{entity['type']}_REDACTED]"
        masked_text[start:end] = list(specific_placeholder) # 替换为列表字符

    return "".join(masked_text)

def mask_pii_substitute(text, pii_entities):
    """
    将检测到的 PII 实体替换为由 Faker 生成的假数据。
    """
    masked_text = list(text)
    pii_entities.sort(key=lambda x: x['end'], reverse=True)

    for entity in pii_entities:
        start, end = entity['start'], entity['end']
        original_value = entity['value']
        replacement = original_value # 默认不替换

        if entity['type'] == 'EMAIL':
            replacement = fake.email()
        elif entity['type'] == 'PHONE_NUMBER':
            # Faker 默认生成的电话号码可能与原始格式不符，这里可以尝试保留部分格式
            replacement = fake.phone_number()
            # 简单裁剪以匹配原始长度，或用特定格式生成
            if len(replacement) > len(original_value):
                replacement = replacement[:len(original_value)]
            elif len(replacement) < len(original_value):
                replacement += 'X' * (len(original_value) - len(replacement))
        elif entity['type'] == 'IP_ADDRESS':
            replacement = fake.ipv4_public()
        elif entity['type'] == 'PERSON':
            replacement = fake.name()
        elif entity['type'] == 'GPE' or entity['type'] == 'LOC':
            replacement = fake.city()
        elif entity['type'] == 'ORG':
            replacement = fake.company()
        elif entity['type'] == 'CREDIT_CARD':
            # 对于信用卡，通常只遮蔽大部分，保留后4位
            replacement = "XXXX-XXXX-XXXX-" + original_value[-4:]
        elif entity['type'] == 'URL':
            replacement = fake.url()

        masked_text[start:end] = list(replacement)

    return "".join(masked_text)

# 示例脱敏
log_entry_to_mask = "User Alice Smith ([email protected]) from 192.168.1.100 called +1-800-555-1234. Card: 1111-2222-3333-4444. Lives in London."

# 结合正则表达式和NER检测器
all_detected_pii = detect_pii_with_regex(log_entry_to_mask) + detect_pii_with_ner(log_entry_to_mask)
# 移除重复或重叠的实体，优先选择更长的匹配或特定类型
# 这是一个简化的去重逻辑，实际生产中需要更复杂的冲突解决策略
unique_pii = []
for pii in all_detected_pii:
    is_overlap = False
    for existing_pii in unique_pii:
        # 检查是否有重叠
        if max(pii['start'], existing_pii['start']) < min(pii['end'], existing_pii['end']):
            # 如果重叠，选择范围更大的那个，或者更具体的类型
            if (pii['end'] - pii['start']) > (existing_pii['end'] - existing_pii['start']):
                unique_pii.remove(existing_pii)
                unique_pii.append(pii)
            is_overlap = True
            break
    if not is_overlap:
        unique_pii.append(pii)

print("nAll detected PII (combined and de-duplicated):")
for item in unique_pii:
    print(item)

# 使用遮蔽脱敏
masked_redacted = mask_pii_redact(log_entry_to_mask, unique_pii)
print(f"nMasked (Redacted): {masked_redacted}")

# 使用替换脱敏
masked_substituted = mask_pii_substitute(log_entry_to_mask, unique_pii)
print(f"Masked (Substituted): {masked_substituted}")

2. 哈希 (Hashing)：

哈希适用于需要一致性脱敏的场景，即相同的 PII 总是生成相同的哈希值。这对于在不暴露原始数据的情况下，跟踪某个“用户”或“实体”的跨日志行为非常有用。

import hashlib

def mask_pii_hash(text, pii_entities, salt="my_secret_salt"):
    """
    将检测到的 PII 实体替换为其 SHA256 哈希值。
    使用盐值 (salt) 可以增加安全性，防止彩虹表攻击。
    """
    masked_text = list(text)
    pii_entities.sort(key=lambda x: x['end'], reverse=True)

    for entity in pii_entities:
        start, end = entity['start'], entity['end']
        original_value = entity['value']

        # 将原始值和盐值拼接后进行哈希
        salted_value = (original_value + salt).encode('utf-8')
        hashed_value = hashlib.sha256(salted_value).hexdigest()

        # 替换为哈希值。哈希值通常很长，可能会改变文本结构。
        # 实际应用中，可能只替换 PII 的一小部分，或使用截断的哈希值。
        masked_text[start:end] = list(f"[{entity['type']}_HASH_{hashed_value}]")

    return "".join(masked_text)

# 示例哈希脱敏
log_entry_hash = "User Bob ([email protected]) accessed system from 203.0.113.1. Bob's ID: 12345."
detected_for_hash = detect_pii_with_regex(log_entry_hash) + detect_pii_with_ner(log_entry_hash)
# 再次去重
unique_pii_for_hash = []
for pii in detected_for_hash:
    is_overlap = False
    for existing_pii in unique_pii_for_hash:
        if max(pii['start'], existing_pii['start']) < min(pii['end'], existing_pii['end']):
            if (pii['end'] - pii['start']) > (existing_pii['end'] - existing_pii['start']):
                unique_pii_for_hash.remove(existing_pii)
                unique_pii_for_hash.append(pii)
            is_overlap = True
            break
    if not is_overlap:
        unique_pii_for_hash.append(pii)

masked_hashed = mask_pii_hash(log_entry_hash, unique_pii_for_hash, salt="my_prod_salt_123")
print(f"nMasked (Hashed): {masked_hashed}")

# 演示一致性：相同的邮件地址会生成相同的哈希
log_entry_hash_2 = "Another log by Bob ([email protected]). His IP: 203.0.113.1."
detected_for_hash_2 = detect_pii_with_regex(log_entry_hash_2) + detect_pii_with_ner(log_entry_hash_2)
unique_pii_for_hash_2 = []
for pii in detected_for_hash_2:
    is_overlap = False
    for existing_pii in unique_pii_for_hash_2:
        if max(pii['start'], existing_pii['start']) < min(pii['end'], existing_pii['end']):
            if (pii['end'] - pii['start']) > (existing_pii['end'] - existing_pii['start']):
                unique_pii_for_hash_2.remove(existing_pii)
                unique_pii_for_hash_2.append(pii)
            is_overlap = True
            break
    if not is_overlap:
        unique_pii_for_hash_2.append(pii)

masked_hashed_2 = mask_pii_hash(log_entry_hash_2, unique_pii_for_hash_2, salt="my_prod_salt_123")
print(f"Masked (Hashed, second log): {masked_hashed_2}")

注意：哈希值是单向的，但如果原始 PII 的可能值范围很小（例如，只有少数几个用户），哈希值可能容易被猜测或通过彩虹表反查。使用足够长的盐值和强大的哈希算法（如 SHA256）是关键。

3. 格式保留加密 (Format-Preserving Encryption, FPE) (概念性)：

FPE 是一种高级脱敏技术，它能够在加密数据时保持其原始格式。例如，一个 16 位的信用卡号加密后仍然是 16 位的数字，电子邮件地址加密后仍然是有效的电子邮件格式。这对于需要将脱敏数据集成到依赖特定数据格式的现有系统（如数据库、旧版应用程序）中非常有用。FPE 算法（如 FF1 或 FF3-1）通常比标准哈希或替换更复杂，需要专门的库或服务实现。在 Python 中，没有一个简单的内置库可以直接进行 FPE，通常需要集成专业的加密库或服务。因此，这里只做概念性介绍，不提供直接的 Python 示例。

C. 构建一个集成的脱敏管道函数

我们将上述检测和脱敏逻辑封装到一个更通用的 MaskingService 类中，模拟一个简化的脱敏管道。

import re
import spacy
from faker import Faker
import hashlib
import json # 假设日志可以是JSON格式

# 初始化 NLP 和 Faker
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    spacy.cli.download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

fake = Faker('en_US')

class PII_MaskingService:
    def __init__(self, salt="default_llm_masking_salt"):
        self.salt = salt
        self.regex_patterns = {
            "EMAIL": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}",
            "PHONE_NUMBER": r"((?d{3})?[-.s]?d{3}[-.s]?d{4})",
            "IP_ADDRESS": r"b(?:[0-9]{1,3}.){3}[0-9]{1,3}b",
            "CREDIT_CARD": r"b(?:d[ -]*?){13,16}b", # 简化，需更复杂校验
            "URL": r"https?://(?:[-w.]|(?:%[da-fA-F]{2}))+",
            # 可以添加更多，例如 SSN, 身份证号等
        }
        self.ner_entity_types = ["PERSON", "GPE", "ORG", "LOC"]

    def _detect_pii_with_regex(self, text):
        detected_pii = []
        for pii_type, pattern in self.regex_patterns.items():
            for match in re.finditer(pattern, text):
                if pii_type == "CREDIT_CARD" and re.match(r"^d{4}$", match.group(0)): # 避免年份误报
                    continue
                detected_pii.append({"type": pii_type, "value": match.group(0), "start": match.start(), "end": match.end()})
        return detected_pii

    def _detect_pii_with_ner(self, text):
        doc = nlp(text)
        detected_pii = []
        for ent in doc.ents:
            if ent.label_ in self.ner_entity_types:
                detected_pii.append({"type": ent.label_, "value": ent.text, "start": ent.start_char, "end": ent.end_char})
        return detected_pii

    def _merge_and_deduplicate_pii(self, pii_list):
        """
        合并并去重重叠或重复的 PII 实体。
        更复杂的逻辑可能需要根据 PII 类型优先级处理。
        """
        unique_pii = []
        for pii_candidate in pii_list:
            is_overlap = False
            for existing_pii in unique_pii:
                # 检查是否有重叠
                if max(pii_candidate['start'], existing_pii['start']) < min(pii_candidate['end'], existing_pii['end']):
                    # 如果重叠，优先选择更长的匹配。
                    # 或者可以根据 PII 类型优先级来选择，例如 EMAIL > PERSON。
                    if (pii_candidate['end'] - pii_candidate['start']) > (existing_pii['end'] - existing_pii['start']):
                        # 如果新实体更大，替换旧实体
                        unique_pii.remove(existing_pii)
                        unique_pii.append(pii_candidate)
                    is_overlap = True
                    break
            if not is_overlap:
                unique_pii.append(pii_candidate)

        # 再次排序，确保替换时从后往前
        unique_pii.sort(key=lambda x: x['end'], reverse=True)
        return unique_pii

    def detect_pii(self, text):
        """
        综合使用正则表达式和 NER 检测 PII。
        """
        regex_pii = self._detect_pii_with_regex(text)
        ner_pii = self._detect_pii_with_ner(text)

        all_pii = regex_pii + ner_pii
        return self._merge_and_deduplicate_pii(all_pii)

    def mask_entity(self, entity_type, original_value, masking_strategy="SUBSTITUTION"):
        """
        根据指定的脱敏策略对单个 PII 实体进行脱敏。
        """
        if masking_strategy == "REDACTION":
            return f"[{entity_type}_REDACTED]"
        elif masking_strategy == "SUBSTITUTION":
            if entity_type == 'EMAIL': return fake.email()
            if entity_type == 'PHONE_NUMBER': return fake.phone_number()
            if entity_type == 'IP_ADDRESS': return fake.ipv4_public()
            if entity_type == 'PERSON': return fake.name()
            if entity_type in ['GPE', 'LOC']: return fake.city()
            if entity_type == 'ORG': return fake.company()
            if entity_type == 'CREDIT_CARD': return "XXXX-XXXX-XXXX-" + original_value[-4:]
            if entity_type == 'URL': return fake.url()
            return f"[{entity_type}_MASKED]" # 默认回退
        elif masking_strategy == "HASHING":
            salted_value = (original_value + self.salt).encode('utf-8')
            hashed_value = hashlib.sha256(salted_value).hexdigest()
            return f"[{entity_type}_HASH_{hashed_value[:10]}]" # 截断哈希值以减少长度
        else:
            return original_value # 不脱敏

    def mask_text(self, text, masking_strategy="SUBSTITUTION"):
        """
        对给定文本中的所有 PII 进行脱敏。
        """
        pii_entities = self.detect_pii(text)
        masked_text_chars = list(text)

        for entity in pii_entities:
            start, end = entity['start'], entity['end']
            original_value = entity['value']

            masked_value = self.mask_entity(entity['type'], original_value, masking_strategy)

            # 使用列表切片进行替换
            masked_text_chars[start:end] = list(masked_value)

            # 由于替换可能改变长度，更新后续实体的起始/结束位置
            # 这是一个简化的处理，对于复杂的嵌套或重叠场景，可能需要更精细的重新计算或不同的替换方法。
            # 这里依赖于从后往前替换的策略，简化了索引管理。

        return "".join(masked_text_chars)

    def mask_log_entry(self, log_entry, masking_strategy="SUBSTITUTION"):
        """
        处理日志条目，可以是字符串或 JSON 字符串。
        """
        if isinstance(log_entry, dict):
            # 递归处理字典中的所有字符串值
            masked_dict = {}
            for key, value in log_entry.items():
                if isinstance(value, str):
                    masked_dict[key] = self.mask_text(value, masking_strategy)
                elif isinstance(value, dict):
                    masked_dict[key] = self.mask_log_entry(value, masking_strategy)
                elif isinstance(value, list):
                    masked_list = []
                    for item in value:
                        if isinstance(item, str):
                            masked_list.append(self.mask_text(item, masking_strategy))
                        elif isinstance(item, dict):
                            masked_list.append(self.mask_log_entry(item, masking_strategy))
                        else:
                            masked_list.append(item)
                    masked_dict[key] = masked_list
                else:
                    masked_dict[key] = value
            return masked_dict
        elif isinstance(log_entry, str):
            try:
                # 尝试解析为 JSON
                data = json.loads(log_entry)
                return json.dumps(self.mask_log_entry(data, masking_strategy))
            except json.JSONDecodeError:
                # 不是 JSON，按普通字符串处理
                return self.mask_text(log_entry, masking_strategy)
        else:
            return log_entry # 非字符串或字典，不处理

# 实例化脱敏服务
masker = PII_MaskingService(salt="my_llm_prod_secret_salt_456")

# 示例日志条目
sample_log_1 = "User 'Emily White' ([email protected]) from IP 192.168.1.5 accessed service X. Her phone is +1-202-555-0123. Location: Seattle, WA."
sample_log_2 = "Error: Invalid transaction for customer 987654321. Card 5123-XXXX-XXXX-9876 failed. Contact support at [email protected]."
sample_log_3_json = {
    "timestamp": "2023-10-27T10:00:00Z",
    "level": "INFO",
    "message": "User login attempt.",
    "user": {
        "id": "user123",
        "name": "Jane Doe",
        "email": "[email protected]",
        "ip_address": "203.0.113.123"
    },
    "details": "Client from New York using browser Chrome."
}
sample_log_4_complex = "Meeting scheduled for Dr. Smith, 10 AM, at 789 Pine St, Springfield, IL. Contact: [email protected]. Client ID: C_123456789."

print("--- Original Logs ---")
print(f"Log 1: {sample_log_1}")
print(f"Log 2: {sample_log_2}")
print(f"Log 3: {json.dumps(sample_log_3_json, indent=2)}")
print(f"Log 4: {sample_log_4_complex}")

print("n--- Masked Logs (Substitution Strategy) ---")
print(f"Log 1: {masker.mask_log_entry(sample_log_1, 'SUBSTITUTION')}")
print(f"Log 2: {masker.mask_log_entry(sample_log_2, 'SUBSTITUTION')}")
print(f"Log 3: {masker.mask_log_entry(sample_log_3_json, 'SUBSTITUTION')}")
print(f"Log 4: {masker.mask_log_entry(sample_log_4_complex, 'SUBSTITUTION')}")

print("n--- Masked Logs (Redaction Strategy) ---")
print(f"Log 1: {masker.mask_log_entry(sample_log_1, 'REDACTION')}")
print(f"Log 2: {masker.mask_log_entry(sample_log_2, 'REDACTION')}")
print(f"Log 3: {masker.mask_log_entry(sample_log_3_json, 'REDACTION')}")
print(f"Log 4: {masker.mask_log_entry(sample_log_4_complex, 'REDACTION')}")

print("n--- Masked Logs (Hashing Strategy) ---")
print(f"Log 1: {masker.mask_log_entry(sample_log_1, 'HASHING')}")
print(f"Log 2: {masker.mask_log_entry(sample_log_2, 'HASHING')}")
print(f"Log 3: {masker.mask_log_entry(sample_log_3_json, 'HASHING')}")
print(f"Log 4: {masker.mask_log_entry(sample_log_4_complex, 'HASHING')}")

这个 PII_MaskingService 类展示了一个相对完整的脱敏流程：

初始化： 设置盐值、正则表达式模式和 NER 实体类型。
PII 检测 (detect_pii)： 结合正则表达式和 NER 进行检测，并处理重叠实体。
实体脱敏 (mask_entity)： 根据指定的策略（替换、遮蔽、哈希）对单个 PII 值进行脱敏。
文本脱敏 (mask_text)： 迭代检测到的 PII，并将其替换为脱敏后的值。这里关键是从文本末尾开始替换，以避免替换操作改变字符串长度导致后续实体索引失效。
日志条目脱敏 (mask_log_entry)： 扩展处理能力，不仅支持纯字符串日志，还能识别并递归处理 JSON 格式的日志，这是现代应用日志的常见形式。

高级考量与最佳实践

构建一个健壮的 PII 脱敏系统，不仅仅是编写代码那么简单，还需要考虑以下高级因素和最佳实践：

策略驱动的脱敏： 将脱敏规则（哪些 PII 类型需要脱敏、使用哪种策略、特定字段的例外情况）外部化为配置文件（如 JSON, YAML），而不是硬编码。这允许在不修改代码的情况下更新策略。
上下文感知脱敏： 有些信息单独看不是 PII，但在特定上下文中却是。例如，“我的账号是 12345”，如果 12345 是银行账号，则需要脱敏。这需要更复杂的上下文分析或结合业务规则。
白名单与黑名单： 维护一份已知安全（白名单）或已知敏感（黑名单）的词汇列表，可以提高检测准确性并减少误报。
多语言支持： 如果您的应用面向全球用户，则需要支持多种语言的 PII 检测和脱敏，因为不同语言的 PII 模式差异巨大。
性能优化： 对于高吞吐量的日志流，需要考虑：
- 并行处理： 利用多核 CPU 并行处理日志。
- 批处理： 一次处理多个日志条目。
- 增量处理： 仅处理新生成的日志。
- 预编译正则： re.compile() 可以提高正则表达式的匹配速度。
- 高效的 NLP 模型： 选择轻量级且高效的 NER 模型。
错误处理与日志记录： 记录脱敏过程中遇到的错误，例如检测器失败、脱敏失败等。同时，记录哪些日志条目被脱敏，以及脱敏前后的对比（仅限于审计目的，且需严格访问控制）。
持续测试与验证：
- 单元测试： 针对每个正则表达式、NER 模型和脱敏函数编写测试用例。
- 集成测试： 测试整个脱敏管道。
- 红队测试 (Red Teaming)： 主动尝试绕过脱敏机制，寻找未被识别的 PII。
- 样本审查： 定期人工审查脱敏后的日志样本，确保 PII 被正确识别和处理，且数据实用性得到保留。
数据治理与合规性： 将数据脱敏作为更广泛的数据治理策略的一部分。与法律和合规团队密切合作，确保脱敏方案符合所有相关法规（GDPR, CCPA, HIPAA 等）。
可逆性 vs. 不可逆性： 对于发送给外部 LLM 的数据，通常强烈推荐使用不可逆的脱敏方法。如果内部出于调试或其他原因需要访问原始数据，应确保有严格的访问控制和审计机制。
安全上下文： 确保脱敏服务本身运行在安全的隔离环境中，防止其成为攻击目标。

数据脱敏的未来趋势

数据隐私和 LLM 的发展是动态的，数据脱敏技术也在不断演进：

AI 驱动的 PII 检测： LLM 本身可能被用于更准确地识别 PII，包括那些模糊的、上下文相关的 PII。但这需要谨慎设计，以避免“用魔法打败魔法”的隐私悖论。
隐私增强技术 (Privacy-Enhancing Technologies, PETs) 的融合：
- 同态加密 (Homomorphic Encryption)： 允许在加密数据上进行计算，而无需解密。这在理论上是完美的，但计算开销巨大，尚不成熟。
- 差分隐私 (Differential Privacy)： 通过向数据中添加噪声来模糊个体身份，同时保留整体数据模式。
- 联邦学习 (Federated Learning)： 模型在本地设备上训练，只有模型更新（而非原始数据）被共享。
标准化与互操作性： 随着数据隐私变得越来越重要，可能会出现更多行业标准和开源工具，简化 PII 检测和脱敏的实施。
实时与边缘脱敏： 在数据生成时或离用户更近的边缘设备上进行实时脱敏，减少 PII 在网络中传输的风险。

总结与展望

数据脱敏是负责任地利用大型语言模型，同时维护用户隐私和遵守法规的关键基石。它不仅仅是一个技术挑战，更是一个需要持续关注、迭代优化和跨职能协作的治理问题。通过在数据流的早期阶段，采用自动化、多策略的 PII 检测和脱敏机制，我们可以最大限度地降低风险，确保 LLM 的强大能力能够以安全、合规的方式服务于业务，从而赢得用户的信任并构建可持续的 AI 生态系统。