什么是 ‘Semantic Audio Routing’：根据用户说话的情绪语调，在图中选择不同的‘安抚’或‘执行’分支 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

各位同仁，各位对人机交互未来充满热情的开发者们：

欢迎来到今天的讲座。我们将深入探讨一个前沿且极具潜力的领域——“语义音频路由”（Semantic Audio Routing）。传统上，音频路由更多地是基于信号的物理特性、连接关系或简单的开关逻辑。然而，随着人工智能，特别是自然语言处理（NLP）和语音情感识别技术的发展，我们现在能够赋予音频路由更深层的“语义”理解。

今天，我们将以一个生动的场景为例：系统根据用户说话的情绪语调，智能地将对话或后续操作路由到不同的分支——例如“安抚”分支或“执行”分支。这不仅仅是一个简单的决策树，它背后蕴含着复杂的语音处理、情感分析、意图识别和智能决策机制。作为一名编程专家，我将带大家从理论到实践，逐步剖析这一系统的构建。

1. 语义音频路由的本质与价值

1.1 什么是语义音频路由？

语义音频路由，顾名思思，是指基于音频内容的“意义”或“语义”来进行智能路由和处理。这里的“语义”不仅仅是语音转文本后的文字内容，更包括了说话者的情绪、语调、意图，甚至是潜在的上下文含义。它超越了传统的、基于频率、音量、声道等物理属性的音频处理，将人机交互推向了一个新的高度。

在我们的特定场景中，语义音频路由的核心在于：

感知用户情绪： 识别用户语音中蕴含的情绪（如喜悦、悲伤、愤怒、平静等）。
理解用户意图： 解析用户话语中表达的明确或隐含的操作意图（如寻求安慰、请求执行某项任务、表达不满等）。
智能决策与路由： 根据情绪和意图的综合判断，将后续的音频输出或系统操作导向预设的不同分支（如“安抚”流程或“执行”流程）。

1.2 传统音频路由的局限

为了更好地理解语义音频路由的价值，我们不妨回顾一下传统音频路由的局限性：

机械性： 传统路由往往是预设的、静态的，无法根据实时的、动态的用户输入进行调整。例如，一个简单的语音助手可能会在用户说“播放音乐”时执行操作，但它无法区分用户是带着愉悦的心情想听音乐，还是带着烦躁的情绪想通过音乐来缓解。
缺乏上下文： 传统系统难以理解用户对话的上下文，导致交互生硬、不自然。
无法识别情感： 这是最核心的差异。一个无法感知情绪的系统，在面对用户情绪波动时，只能给出标准化的、可能不合时宜的回复，从而降低用户体验。

1.3 语义音频路由的潜在应用

除了我们今天讨论的“安抚”与“执行”分支，语义音频路由在诸多领域都有着广阔的应用前景：

智能客服与呼叫中心： 优先处理情绪激动或焦躁的客户，并引导至专门的安抚或解决方案团队。
智能家居： 根据家庭成员的语气和情绪，调整灯光、音乐、室温等环境参数，提供更个性化的体验。
车载系统： 监测驾驶员的情绪，在疲劳或烦躁时提供放松的音乐或语音提示。
教育与健康： 识别学习者的情绪状态，调整教学内容；监测患者的情绪变化，提供心理支持。
游戏与娱乐： 游戏角色根据玩家的语音情绪做出更逼真的反应。

2. 语义音频路由系统的核心组件与架构

构建一个语义音频路由系统，需要多个复杂且相互协作的模块。我们可以将其解耦为以下几个核心组件：

组件名称	主要功能	关键技术
音频输入模块	捕获用户语音，进行预处理（降噪、分帧、预加重等）。	麦克风阵列、声学前端处理、数字信号处理（DSP）
语音转文本（STT）	将语音信号转换为可处理的文本字符串。	深度学习（ASR模型，如RNN-T, Conformer）、声学模型、语言模型
语音情感识别（SER）	从原始语音信号或其特征中提取情感信息（如喜怒哀乐）。	声学特征提取（MFCC, 语调、能量）、机器学习（SVM, DNN, RNN, Transformer）
自然语言理解（NLU）/意图识别	从STT输出的文本中解析用户意图和关键实体。	文本分类、序列标注、深度学习（BERT, GPT等）、规则引擎
决策与路由引擎	综合情感和意图分析结果，做出最终的路由决策。	规则引擎、决策树、强化学习、多模态融合
音频输出模块	根据决策结果生成并播放相应的音频内容（如TTS、预录语音）。	文本转语音（TTS）、音频播放器、音效合成

2.1 音频输入与预处理

这是整个流程的起点。高质量的音频输入是后续所有分析的基础。

关键任务：

采样与数字化： 将模拟声波转换为数字信号。
降噪： 消除环境噪声，提高语音清晰度。
分帧与加窗： 将连续的语音信号分割成短时帧，并应用窗函数以减少频谱泄漏。
语音活动检测（VAD）： 识别语音存在的片段，去除静音部分，减少计算量。

示例代码（Python with pyaudio 和 webrtcvad）：

import pyaudio
import wave
import collections
import webrtcvad
import numpy as np

class AudioProcessor:
    def __init__(self, rate=16000, chunk_size=480, vad_aggressiveness=3):
        """
        初始化音频处理器。
        :param rate: 采样率 (Hz)
        :param chunk_size: 每个音频块的样本数 (必须是10ms的倍数，如16000Hz下160个样本是10ms)
        :param vad_aggressiveness: VAD的激进程度，0-3，3最激进（更倾向于识别语音）。
        """
        self.rate = rate
        self.chunk_size = chunk_size
        self.vad = webrtcvad.Vad(vad_aggressiveness)
        self.audio = pyaudio.PyAudio()

    def record_audio(self, duration=5, output_filename="recorded_audio.wav"):
        """
        录制指定时长的音频。
        :param duration: 录制时长（秒）
        :param output_filename: 输出文件名
        :return: 录制的音频数据（字节流）
        """
        print(f"开始录制 {duration} 秒音频...")
        stream = self.audio.open(format=pyaudio.paInt16,
                                 channels=1,
                                 rate=self.rate,
                                 input=True,
                                 frames_per_buffer=self.chunk_size)
        frames = []
        for _ in range(0, int(self.rate / self.chunk_size * duration)):
            data = stream.read(self.chunk_size)
            frames.append(data)

        print("录制结束。")
        stream.stop_stream()
        stream.close()

        # 将录制的音频保存到WAV文件
        wf = wave.open(output_filename, 'wb')
        wf.setnchannels(1)
        wf.setsampwidth(self.audio.get_sample_size(pyaudio.paInt16))
        wf.setframerate(self.rate)
        wf.writeframes(b''.join(frames))
        wf.close()

        return b''.join(frames)

    def process_with_vad(self, audio_data):
        """
        使用VAD处理音频数据，只保留语音部分。
        :param audio_data: 原始音频数据（字节流）
        :return: 包含语音的音频数据（字节流）
        """
        # webrtcvad期望10ms的帧
        frame_length_ms = 10
        frame_bytes = int(self.rate * frame_length_ms / 1000) * 2 # 16-bit samples

        frames = self._frame_generator(frame_length_ms, audio_data, self.rate)

        ring_buffer = collections.deque(maxlen=10) # 缓冲10帧
        voiced_frames = []

        for frame in frames:
            is_speech = self.vad.is_speech(frame.bytes, self.rate)

            if is_speech:
                ring_buffer.append(frame)
                voiced_frames.extend(list(ring_buffer))
                ring_buffer.clear()
            else:
                ring_buffer.append(frame)

        # 将语音帧连接起来
        processed_audio_data = b''.join([f.bytes for f in voiced_frames])
        return processed_audio_data

    # Helper class for VAD framing
    class Frame(object):
        def __init__(self, bytes, timestamp, duration):
            self.bytes = bytes
            self.timestamp = timestamp
            self.duration = duration

    def _frame_generator(self, frame_duration_ms, audio, sample_rate):
        """
        生成固定时长的音频帧。
        """
        n = int(sample_rate * (frame_duration_ms / 1000.0) * 2) # 2 bytes per sample (paInt16)
        offset = 0
        timestamp = 0.0
        duration = (n / sample_rate) / 2 # duration in seconds
        while offset + n < len(audio):
            yield self.Frame(audio[offset:offset + n], timestamp, duration)
            timestamp += duration
            offset += n

    def close(self):
        self.audio.terminate()

# 示例使用
if __name__ == "__main__":
    processor = AudioProcessor()
    # 录制并保存音频
    raw_audio = processor.record_audio(duration=3, output_filename="raw_input.wav")

    # 假设我们从文件中加载音频进行VAD处理
    # with wave.open("raw_input.wav", 'rb') as wf:
    #     rate = wf.getframerate()
    #     audio_data = wf.readframes(wf.getnframes())

    # 对录制的原始音频进行VAD处理
    processed_audio = processor.process_with_vad(raw_audio)

    # 将处理后的音频保存到文件
    with wave.open("vad_processed_audio.wav", 'wb') as wf:
        wf.setnchannels(1)
        wf.setsampwidth(processor.audio.get_sample_size(pyaudio.paInt16))
        wf.setframerate(processor.rate)
        wf.writeframes(processed_audio)

    print("VAD处理完成，语音部分已保存到 vad_processed_audio.wav")
    processor.close()

这段代码展示了如何使用pyaudio进行音频录制，并结合webrtcvad库进行语音活动检测，从而过滤掉非语音部分。这是实时语音处理中非常重要的一步，可以有效降低后续模块的计算负担并提高准确性。

2.2 语音转文本（STT）

STT模块将预处理后的语音信号转换成文本，为意图识别提供输入。虽然情感识别可以直接从语音中进行，但文本内容对于理解用户意图至关重要。

关键任务：

声学模型： 将声学特征映射到音素或子词单元。
语言模型： 结合上下文，预测最可能的词序列。
解码器： 搜索最佳的词序列。

示例工具：

Google Cloud Speech-to-Text API
Baidu AI Speech API
OpenAI Whisper (本地部署或API)
Kaldi (开源，复杂但强大)
AssemblyAI

示例代码（Python with AssemblyAI 或 Whisper 伪代码）：

import assemblyai as aai
import os

# 假设你已经安装了assemblyai并设置了API Key
# aai.settings.api_key = os.environ.get("ASSEMBLYAI_API_KEY")

class STTService:
    def __init__(self, api_key=None):
        if api_key:
            aai.settings.api_key = api_key
        # 或者使用本地Whisper模型
        # self.whisper_model = None
        # try:
        #     import whisper
        #     self.whisper_model = whisper.load_model("small") # 或 "base", "medium", "large"
        # except ImportError:
        #     print("Warning: OpenAI Whisper not installed. Using AssemblyAI.")

    def transcribe_audio(self, audio_file_path):
        """
        将音频文件转录为文本。
        :param audio_file_path: 音频文件路径
        :return: 转录文本
        """
        # 优先使用AssemblyAI（云服务，通常更方便且准确）
        try:
            transcriber = aai.Transcriber()
            transcript = transcriber.transcribe(audio_file_path)
            if transcript.status == aai.TranscriptStatus.completed:
                return transcript.text
            else:
                print(f"AssemblyAI 转录失败: {transcript.error}")
                return None
        except Exception as e:
            print(f"AssemblyAI 调用失败: {e}. 尝试使用本地Whisper模型...")
            # 如果AssemblyAI失败，尝试使用本地Whisper
            # if self.whisper_model:
            #     try:
            #         result = self.whisper_model.transcribe(audio_file_path)
            #         return result["text"]
            #     except Exception as we:
            #         print(f"Whisper 转录失败: {we}")
            #         return None
            # else:
            #     return None
            return None # 简化，这里不再尝试Whisper

# 示例使用
if __name__ == "__main__":
    # 假设我们有一个经过VAD处理的音频文件 "vad_processed_audio.wav"
    stt_service = STTService(api_key="YOUR_ASSEMBLYAI_API_KEY") # 替换为你的API Key

    if os.path.exists("vad_processed_audio.wav"):
        text = stt_service.transcribe_audio("vad_processed_audio.wav")
        if text:
            print(f"转录文本: '{text}'")
        else:
            print("未能转录音频。")
    else:
        print("请先运行 AudioProcessor 示例生成 vad_processed_audio.wav 文件。")

注意： 实际应用中，你可能需要根据项目需求选择一个合适的STT服务或模型。对于本地部署，OpenAI Whisper是一个不错的选择，但需要较好的硬件资源。

2.3 语音情感识别（SER）

这是语义音频路由的核心之一。SER的目标是从语音中识别说话者的情绪状态。

关键任务：

声学特征提取： 从原始语音信号中提取与情感相关的特征。
情感分类模型： 使用机器学习或深度学习模型对提取的特征进行分类。

2.3.1 情感特征

与情感相关的声学特征主要包括：

韵律特征 (Prosodic Features)：
- 基频 (Pitch/F0)： 说话人声带振动的频率，与语调、音高相关。高频可能表示兴奋、愤怒，低频可能表示悲伤、疲劳。
- 能量/响度 (Energy/Loudness)： 语音的强度。高能量可能表示兴奋、愤怒，低能量可能表示悲伤、平静。
- 语速 (Speech Rate)： 说话的快慢。快语速可能表示兴奋、焦虑，慢语速可能表示悲伤、犹豫。
频谱特征 (Spectral Features)：
- 梅尔频率倒谱系数 (MFCCs)： 模拟人耳听觉系统对声音的感知，能够有效捕捉音色信息。
- 线性预测倒谱系数 (LPCCs)： 描述声道的共振特性。
- 谱熵 (Spectral Entropy)： 衡量频谱的随机性或混乱程度。
语音质量特征 (Voice Quality Features)：
- 抖动 (Jitter) 与颤动 (Shimmer)： 衡量基频周期和振幅周期的变化，与声音的粗糙度、稳定性有关。
- 谐波噪声比 (HNR)： 衡量语音的谐波成分与噪声成分的比例。

2.3.2 情感分类模型

传统机器学习： 支持向量机（SVM）、随机森林（Random Forest）、K近邻（KNN）等，结合提取的声学特征进行分类。
深度学习：
- 循环神经网络（RNN）及其变体（LSTM, GRU）： 擅长处理序列数据，能捕捉语音的时间依赖性。
- 卷积神经网络（CNN）： 擅长处理局部特征，可用于提取频谱图中的空间模式。
- Transformer： 随着自注意力机制的兴起，Transformer模型在语音情感识别领域也取得了显著进展，能够捕捉长距离依赖关系。
- 预训练模型： 利用在大量语音数据上预训练的模型（如Wav2Vec 2.0, HuBERT）进行微调，可以大大提高性能。

示例代码（Python with librosa for feature extraction, scikit-learn for a simple classifier）：

import librosa
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 假设我们有一些带情感标签的音频数据
# 实际项目中，你需要一个包含音频文件和对应情感标签的数据集
# 例如：{'audio_path': 'path/to/audio1.wav', 'emotion': 'happy'}
#       {'audio_path': 'path/to/audio2.wav', 'emotion': 'sad'}

# 模拟生成一些特征和标签（实际应从真实数据中提取）
def generate_dummy_data(num_samples=100):
    features = []
    labels = []
    emotions = ['happy', 'sad', 'angry', 'neutral']

    for i in range(num_samples):
        # 模拟MFCCs (13维)
        mfcc = np.random.rand(13) * 10 
        # 模拟pitch (1维)
        pitch = np.random.rand(1) * 200 + 50
        # 模拟energy (1维)
        energy = np.random.rand(1) * 100 + 10

        # 将所有特征拼接
        combined_features = np.concatenate((mfcc, pitch, energy))
        features.append(combined_features)
        labels.append(np.random.choice(emotions))

    return np.array(features), np.array(labels)

def extract_features(audio_path, sr=16000):
    """
    从音频文件中提取MFCC、pitch和energy特征。
    :param audio_path: 音频文件路径
    :param sr: 采样率
    :return: 组合特征向量
    """
    try:
        y, sr = librosa.load(audio_path, sr=sr)

        # 1. MFCCs
        mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
        mfccs_mean = np.mean(mfccs.T, axis=0) # 取均值作为特征

        # 2. Pitch (F0)
        pitches, magnitudes = librosa.piptrack(y=y, sr=sr)
        pitch_mean = np.mean(pitches) if len(pitches) > 0 else 0

        # 3. Energy (RMS energy)
        rms = librosa.feature.rms(y=y)
        energy_mean = np.mean(rms)

        # 组合所有特征
        combined_features = np.concatenate((mfccs_mean, [pitch_mean], [energy_mean]))
        return combined_features
    except Exception as e:
        print(f"特征提取失败: {audio_path}, 错误: {e}")
        return None

class EmotionRecognizer:
    def __init__(self, model=None, scaler=None):
        self.model = model if model else SVC(kernel='rbf', C=1.0, gamma='scale')
        self.scaler = scaler if scaler else StandardScaler()
        self.label_encoder = {} # 用于将情感标签编码为数字，再解码

    def train(self, audio_files, emotions, sr=16000):
        """
        训练情感识别模型。
        :param audio_files: 音频文件路径列表
        :param emotions: 对应的情感标签列表
        :param sr: 采样率
        """
        X = []
        y_labels = []

        # 提取特征
        for i, audio_path in enumerate(audio_files):
            features = extract_features(audio_path, sr)
            if features is not None:
                X.append(features)
                y_labels.append(emotions[i])

        if not X:
            print("没有提取到有效特征，无法训练模型。")
            return

        X = np.array(X)
        y_labels = np.array(y_labels)

        # 编码标签
        unique_emotions = np.unique(y_labels)
        for i, emo in enumerate(unique_emotions):
            self.label_encoder[emo] = i
            self.label_encoder[i] = emo # 双向映射

        y = np.array([self.label_encoder[label] for label in y_labels])

        # 数据标准化
        X_scaled = self.scaler.fit_transform(X)

        # 训练模型
        print(f"开始训练模型，样本数: {len(X_scaled)}")
        self.model.fit(X_scaled, y)
        print("模型训练完成。")

    def predict(self, audio_path, sr=16000):
        """
        预测给定音频文件的情感。
        :param audio_path: 音频文件路径
        :param sr: 采样率
        :return: 预测的情感标签（字符串）
        """
        features = extract_features(audio_path, sr)
        if features is None:
            return "unknown"

        features = features.reshape(1, -1) # 转换为2D数组
        features_scaled = self.scaler.transform(features)

        prediction = self.model.predict(features_scaled)
        predicted_emotion_id = prediction[0]

        return self.label_encoder.get(predicted_emotion_id, "unknown")

# 示例使用
if __name__ == "__main__":
    # 假设你有一些wav文件和对应的标签
    # audio_files = ["./data/happy_01.wav", "./data/sad_01.wav", ...]
    # emotions = ["happy", "sad", ...]

    # 为了演示，我们先用假的音频文件和标签来模拟
    # 实际应用中需要准备真实的数据集

    # 创建一些虚拟音频文件用于演示
    # (在实际项目中，你需要真实录制的音频文件)
    if not os.path.exists("temp_audio_data"):
        os.makedirs("temp_audio_data")

    dummy_audio_files = []
    dummy_emotions = []
    emotions_list = ['happy', 'sad', 'angry', 'neutral', 'surprise']

    # 每次运行生成不同的随机文件
    for i in range(20): # 20个样本
        emotion = np.random.choice(emotions_list)
        filename = f"temp_audio_data/dummy_{emotion}_{i}.wav"

        # 生成一些随机波形数据
        duration = np.random.uniform(1.0, 3.0) # 1到3秒
        sr = 16000
        t = np.linspace(0, duration, int(sr * duration), endpoint=False)

        # 模拟不同情绪的波形特征（非常简化）
        if emotion == 'happy':
            y = 0.5 * np.sin(2 * np.pi * 440 * t) + 0.2 * np.random.randn(len(t))
        elif emotion == 'sad':
            y = 0.3 * np.sin(2 * np.pi * 220 * t) + 0.1 * np.random.randn(len(t))
        elif emotion == 'angry':
            y = 0.7 * np.sin(2 * np.pi * 600 * t) + 0.3 * np.random.randn(len(t))
        else: # neutral/surprise
            y = 0.4 * np.sin(2 * np.pi * 330 * t) + 0.15 * np.random.randn(len(t))

        # 规范化到-1到1之间
        y = y / np.max(np.abs(y)) * 0.8 

        # 保存为wav文件
        from scipy.io.wavfile import write
        write(filename, sr, (y * 32767).astype(np.int16)) # Convert to 16-bit integer

        dummy_audio_files.append(filename)
        dummy_emotions.append(emotion)

    # 划分训练集和测试集
    train_files, test_files, train_emotions, test_emotions = train_test_split(
        dummy_audio_files, dummy_emotions, test_size=0.3, random_state=42
    )

    ser = EmotionRecognizer()
    ser.train(train_files, train_emotions)

    # 在测试集上评估模型
    predictions = [ser.predict(f) for f in test_files]
    accuracy = accuracy_score(test_emotions, predictions)
    print(f"n模型在测试集上的准确率: {accuracy:.2f}")

    # 尝试预测一个新文件（假设是前面VAD处理后的文件）
    if os.path.exists("vad_processed_audio.wav"):
        predicted_emotion = ser.predict("vad_processed_audio.wav")
        print(f"n'vad_processed_audio.wav' 的预测情感是: {predicted_emotion}")
    else:
        print("n请确保 'vad_processed_audio.wav' 存在以进行情感预测。")

    # 清理虚拟音频文件
    # import shutil
    # shutil.rmtree("temp_audio_data")

说明： 上述代码提供了一个使用librosa提取基本声学特征，并用scikit-learn的SVC模型进行情感分类的框架。请注意，这是一个非常简化的示例，旨在说明流程。真实的语音情感识别需要更复杂、更丰富的特征集（如语调、能量、更多的MFCC系数及其一阶二阶差分等），更大的、标注精确的数据集，以及更先进的深度学习模型。对于实际项目，强烈建议使用预训练的深度学习模型（如基于Transformer的语音模型）进行微调。

2.4 自然语言理解（NLU）/意图识别

NLU模块负责从STT输出的文本中理解用户的意图。例如，“我很难过”的意图是“表达情感”，而“播放一首悲伤的音乐”的意图是“播放音乐”。

关键任务：

分词 (Tokenization)： 将文本分割成独立的词语或符号。
词性标注 (Part-of-Speech Tagging, POS)： 识别词语的语法角色。
命名实体识别 (Named Entity Recognition, NER)： 识别文本中的专有名词（如人名、地名、组织、时间等）。
意图分类 (Intent Classification)： 将整个句子归类到预定义的意图类别。
槽位填充 (Slot Filling)： 从句子中提取关键信息（槽位值），如“播放[摇滚乐]”中的“摇滚乐”就是槽位值。

2.4.1 意图识别方法

规则引擎： 适用于简单、明确的意图，通过关键词匹配或正则表达式进行识别。
传统机器学习： 将文本转换为数值特征（如TF-IDF、词向量），然后使用分类器（如SVM、朴素贝叶斯）进行意图分类。
深度学习：
- RNN/LSTM/GRU： 捕捉文本序列的上下文信息。
- CNN： 提取文本局部特征。
- Transformer（BERT, RoBERTa, XLM-R等）： 基于自注意力机制，能够更好地理解长距离依赖和上下文语义，是当前最先进的文本表示和理解模型。通常通过在预训练模型基础上进行微调（fine-tuning）来完成特定任务。

示例代码（Python with spaCy for NLP基础，scikit-learn for simple intent classification）：

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 加载spaCy中文模型
try:
    nlp = spacy.load("zh_core_web_sm")
except OSError:
    print("下载 spaCy 中文模型 'zh_core_web_sm'...")
    spacy.cli.download("zh_core_web_sm")
    nlp = spacy.load("zh_core_web_sm")

class IntentRecognizer:
    def __init__(self, nlp_model=None):
        self.nlp = nlp_model if nlp_model else spacy.load("zh_core_web_sm")
        self.pipeline = Pipeline([
            ('tfidf', TfidfVectorizer(tokenizer=self._spacy_tokenizer)),
            ('clf', LinearSVC())
        ])
        self.intent_labels = []

    def _spacy_tokenizer(self, text):
        """使用spaCy分词器"""
        return [token.lemma_ for token in self.nlp(text) if not token.is_stop and not token.is_punct and not token.is_space]

    def train(self, texts, intents):
        """
        训练意图识别模型。
        :param texts: 文本列表
        :param intents: 对应意图标签列表
        """
        self.intent_labels = list(set(intents)) # 记录所有意图标签
        print(f"开始训练意图识别模型，样本数: {len(texts)}")
        self.pipeline.fit(texts, intents)
        print("意图识别模型训练完成。")

    def predict_intent(self, text):
        """
        预测给定文本的意图。
        :param text: 输入文本
        :return: 预测的意图标签
        """
        if not self.intent_labels:
            return "unknown_intent"

        # 尝试使用规则引擎进行初步匹配
        if "播放" in text or "放歌" in text:
            return "play_music"
        if "打开" in text or "启动" in text:
            return "control_device"
        if "我感觉" in text or "我很" in text or "想说" in text:
            return "express_feeling"

        # 如果规则引擎未匹配，则使用机器学习模型
        prediction = self.pipeline.predict([text])
        return prediction[0]

# 示例使用
if __name__ == "__main__":
    # 模拟训练数据
    texts = [
        "我感觉非常沮丧，什么都不想做。",
        "现在心情很不好，想找人说说话。",
        "帮我播放一首轻快的音乐。",
        "打开客厅的灯。",
        "我今天很开心！",
        "把空调温度调到25度。",
        "我很生气，放一些摇滚乐。",
        "我想听一些放松的白噪音。",
        "我感觉有点孤独。",
        "请关掉卧室的窗帘。",
        "我有点焦虑，需要一些安慰。",
        "放一首周杰伦的歌。"
    ]
    intents = [
        "express_feeling",
        "express_feeling",
        "play_music",
        "control_device",
        "express_feeling",
        "control_device",
        "play_music",
        "play_music",
        "express_feeling",
        "control_device",
        "express_feeling",
        "play_music"
    ]

    # 划分训练集和测试集
    train_texts, test_texts, train_intents, test_intents = train_test_split(
        texts, intents, test_size=0.3, random_state=42
    )

    intent_rec = IntentRecognizer()
    intent_rec.train(train_texts, train_intents)

    # 在测试集上评估模型
    predictions = [intent_rec.predict_intent(t) for t in test_texts]
    print("n意图识别模型评估报告:")
    print(classification_report(test_intents, predictions))

    # 预测新文本
    test_sentence1 = "我现在很烦躁，放点劲爆的音乐。"
    test_sentence2 = "我今天工作不顺利，心情很差。"
    test_sentence3 = "把灯关了。"

    print(f"n'{test_sentence1}' 的意图是: {intent_rec.predict_intent(test_sentence1)}")
    print(f"'{test_sentence2}' 的意图是: {intent_rec.predict_intent(test_sentence2)}")
    print(f"'{test_sentence3}' 的意图是: {intent_rec.predict_intent(test_sentence3)}")

说明： 这个意图识别示例结合了简单的规则引擎和基于scikit-learn的机器学习模型。在实际应用中，对于复杂的意图识别，通常会使用基于深度学习的NLU框架（如Rasa、DeepPavlov或直接微调BERT等模型），它们能够更好地处理多意图、实体提取和上下文管理。

2.5 决策与路由引擎

这是语义音频路由系统的核心大脑，它综合了语音情感识别（SER）和意图识别（NLU）的结果，来决定将用户请求路由到“安抚”分支还是“执行”分支。

决策逻辑可以采用以下形式：

规则引擎： 最直接的方式，通过一系列IF-THEN规则进行判断。
决策树： 更结构化的规则，可以处理更复杂的条件分支。
机器学习/强化学习： 如果决策逻辑非常复杂且需要自适应，可以训练一个分类器或强化学习模型来学习最佳的路由策略。

2.5.1 基于规则的决策引擎

我们将使用一个基于规则的简单决策引擎来演示。

决策矩阵示例：

用户情绪	用户意图	路由分支	输出动作示例
悲伤/沮丧	表达情感	安抚	“我听到您现在很难过，请不要担心，我们在这里支持您。”
悲伤/沮丧	播放音乐	执行	“好的，为您播放一首舒缓的音乐。”
愤怒/烦躁	表达情感	安抚	“我理解您现在很生气，请深呼吸，我在这里听您说。”
愤怒/烦躁	播放音乐	执行	“好的，为您播放一些摇滚乐来发泄一下。”
快乐/兴奋	表达情感	安抚	“很高兴您这么开心！有什么想分享的吗？”
快乐/兴奋	播放音乐	执行	“好的，为您播放一首欢快的音乐。”
中性/平静	表达情感	安抚	“您好，有什么可以帮助您的吗？”
中性/平静	控制设备	执行	“好的，已为您打开客厅的灯。”
…	…	…	…

示例代码（Python）：

class RoutingEngine:
    def __init__(self):
        # 定义路由规则，可以是一个更复杂的决策树或查找表
        self.routing_rules = {
            # (emotion, intent): branch
            ('sad', 'express_feeling'): 'comfort',
            ('sad', 'play_music'): 'execute_play_music_calm',
            ('sad', 'control_device'): 'comfort', # 情绪不好时优先安抚

            ('angry', 'express_feeling'): 'comfort',
            ('angry', 'play_music'): 'execute_play_music_rock',
            ('angry', 'control_device'): 'execute', # 愤怒时可能更直接需要执行

            ('happy', 'express_feeling'): 'comfort',
            ('happy', 'play_music'): 'execute_play_music_happy',
            ('happy', 'control_device'): 'execute',

            ('neutral', 'express_feeling'): 'comfort',
            ('neutral', 'play_music'): 'execute_play_music_neutral',
            ('neutral', 'control_device'): 'execute',

            # 默认规则，如果上面没有匹配到
            ('unknown', 'unknown_intent'): 'comfort', # 未知情况先安抚
            ('unknown', 'express_feeling'): 'comfort',
            ('unknown', 'play_music'): 'execute_play_music_neutral',
            ('unknown', 'control_device'): 'execute',

            # 更多的组合...
        }

        # 定义每个分支的具体动作（简化为文本描述）
        self.branch_actions = {
            'comfort': {
                'sad': "我听到您现在很难过，请不要担心，我们在这里支持您。",
                'angry': "我理解您现在很生气，请深呼吸，我在这里听您说。",
                'happy': "很高兴您这么开心！有什么想分享的吗？",
                'neutral': "您好，有什么可以帮助您的吗？",
                'default': "我在这里，您有什么需要帮助的吗？"
            },
            'execute': {
                'play_music_calm': "好的，为您播放一首舒缓的音乐。",
                'play_music_rock': "好的，为您播放一些摇滚乐来发泄一下。",
                'play_music_happy': "好的，为您播放一首欢快的音乐。",
                'play_music_neutral': "好的，为您播放音乐。",
                'control_device': "好的，已为您执行操作。",
                'default': "好的，我将尝试为您执行请求。"
            }
        }

    def determine_route(self, emotion, intent):
        """
        根据情感和意图决定路由分支。
        :param emotion: 识别到的情感 (e.g., 'sad', 'angry', 'happy', 'neutral', 'unknown')
        :param intent: 识别到的意图 (e.g., 'express_feeling', 'play_music', 'control_device', 'unknown_intent')
        :return: 路由分支名称 (e.g., 'comfort', 'execute')
        """
        key = (emotion, intent)
        route = self.routing_rules.get(key)

        if route:
            return route

        # 如果没有精确匹配，尝试更通用的规则
        if intent == 'express_feeling':
            return 'comfort'
        elif intent in ['play_music', 'control_device']:
            return 'execute'

        return 'comfort' # 默认安抚

    def get_action_message(self, route, emotion=None, intent=None):
        """
        根据路由分支和上下文获取具体的回复信息。
        :param route: 路由分支
        :param emotion: 用户情感
        :param intent: 用户意图
        :return: 回复信息字符串
        """
        if route == 'comfort':
            return self.branch_actions['comfort'].get(emotion, self.branch_actions['comfort']['default'])
        elif route == 'execute':
            # 根据意图进一步细化执行分支的动作
            if intent == 'play_music':
                if emotion == 'sad':
                    return self.branch_actions['execute']['play_music_calm']
                elif emotion == 'angry':
                    return self.branch_actions['execute']['play_music_rock']
                elif emotion == 'happy':
                    return self.branch_actions['execute']['play_music_happy']
                else:
                    return self.branch_actions['execute']['play_music_neutral']
            elif intent == 'control_device':
                return self.branch_actions['execute']['control_device']
            else:
                return self.branch_actions['execute']['default']
        return "我不太明白您的意思。"

# 示例使用
if __name__ == "__main__":
    router = RoutingEngine()

    # 场景1: 用户沮丧并表达情感
    emotion1 = 'sad'
    intent1 = 'express_feeling'
    route1 = router.determine_route(emotion1, intent1)
    message1 = router.get_action_message(route1, emotion=emotion1, intent=intent1)
    print(f"用户情感: {emotion1}, 意图: {intent1} -> 路由: {route1}, 回复: {message1}")

    # 场景2: 用户愤怒并要求播放音乐
    emotion2 = 'angry'
    intent2 = 'play_music'
    route2 = router.determine_route(emotion2, intent2)
    message2 = router.get_action_message(route2, emotion=emotion2, intent=intent2)
    print(f"用户情感: {emotion2}, 意图: {intent2} -> 路由: {route2}, 回复: {message2}")

    # 场景3: 用户快乐并要求控制设备
    emotion3 = 'happy'
    intent3 = 'control_device'
    route3 = router.determine_route(emotion3, intent3)
    message3 = router.get_action_message(route3, emotion=emotion3, intent=intent3)
    print(f"用户情感: {emotion3}, 意图: {intent3} -> 路由: {route3}, 回复: {message3}")

    # 场景4: 用户中性并表达情感
    emotion4 = 'neutral'
    intent4 = 'express_feeling'
    route4 = router.determine_route(emotion4, intent4)
    message4 = router.get_action_message(route4, emotion=emotion4, intent=intent4)
    print(f"用户情感: {emotion4}, 意图: {intent4} -> 路由: {route4}, 回复: {message4}")

    # 场景5: 未知意图
    emotion5 = 'sad'
    intent5 = 'unknown_intent'
    route5 = router.determine_route(emotion5, intent5)
    message5 = router.get_action_message(route5, emotion=emotion5, intent=intent5)
    print(f"用户情感: {emotion5}, 意图: {intent5} -> 路由: {route5}, 回复: {message5}")

说明： 这里的路由引擎是一个基于Python字典的简单规则实现。在实际生产系统中，可能需要一个更健壮的规则引擎框架（如Durian、Drools等），或者使用决策树模型、基于强化学习的策略来处理更复杂的决策逻辑。

2.6 音频输出模块

这是系统与用户交互的最后一步。根据决策引擎的路由结果，生成并播放相应的音频内容。

关键任务：

文本转语音（TTS）： 将文本回复转换为自然语音。为了更好地模拟“安抚”或“执行”的效果，TTS系统可能需要支持情感语音合成，即根据指定的情绪合成语音。
音频播放： 播放预录的音频文件或TTS合成的语音。
动态内容生成： 对于“执行”分支，可能需要动态地播放音乐、音效或控制外部设备。

示例工具：

Google Cloud Text-to-Speech API
Baidu AI Speech API
Microsoft Azure Speech Service
PaddleSpeech (开源，可本地部署)
OpenTTS (开源)

示例代码（Python with gTTS 和 pydub）：

from gtts import gTTS
from pydub import AudioSegment
from pydub.playback import play
import os

class AudioOutputService:
    def __init__(self, lang='zh-cn'):
        self.lang = lang
        self.output_dir = "audio_responses"
        if not os.path.exists(self.output_dir):
            os.makedirs(self.output_dir)

    def speak(self, text, emotion_tone=None):
        """
        将文本转换为语音并播放。
        :param text: 要转换的文本
        :param emotion_tone: 情感语调（可选，gTTS不支持，但可在更高级TTS服务中实现）
        """
        print(f"系统回复: {text}")
        try:
            # gTTS不支持直接的情感语调，但我们可以通过文本选择来模拟
            # 例如，对于安抚，文本本身就带有安抚的语气
            tts = gTTS(text=text, lang=self.lang)
            filename = os.path.join(self.output_dir, f"response_{hash(text)}.mp3")
            tts.save(filename)

            audio = AudioSegment.from_mp3(filename)
            play(audio)
            # os.remove(filename) # 可选：播放后删除文件
        except Exception as e:
            print(f"TTS或播放失败: {e}")

    def play_music(self, music_type="default"):
        """
        播放指定类型的音乐。
        :param music_type: 音乐类型 (e.g., 'calm', 'rock', 'happy')
        """
        print(f"系统正在播放 {music_type} 音乐...")
        # 实际项目中，这里会集成音乐播放API或本地音乐库
        # 为了演示，我们只打印信息
        if music_type == 'calm':
            print("播放舒缓的轻音乐...")
        elif music_type == 'rock':
            print("播放劲爆的摇滚乐...")
        elif music_type == 'happy':
            print("播放欢快的流行乐...")
        else:
            print("播放默认背景音乐...")

        # 实际播放代码示例 (需要本地有音乐文件)
        # try:
        #     music_file = os.path.join("music_library", f"{music_type}.mp3")
        #     if os.path.exists(music_file):
        #         music = AudioSegment.from_mp3(music_file)
        #         play(music)
        #     else:
        #         print(f"未找到 {music_type} 音乐文件。")
        # except Exception as e:
        #     print(f"播放音乐失败: {e}")

# 示例使用 (假设已经有路由决策)
if __name__ == "__main__":
    output_service = AudioOutputService()
    router = RoutingEngine() # 重新初始化路由引擎

    # 场景1: 安抚
    emotion = 'sad'
    intent = 'express_feeling'
    route = router.determine_route(emotion, intent)
    message = router.get_action_message(route, emotion=emotion, intent=intent)
    output_service.speak(message)

    # 场景2: 执行播放摇滚乐
    emotion = 'angry'
    intent = 'play_music'
    route = router.determine_route(emotion, intent)
    message = router.get_action_message(route, emotion=emotion, intent=intent)
    output_service.speak(message) # 告知用户将播放音乐
    output_service.play_music(music_type='rock') # 实际播放音乐

    # 场景3: 执行控制设备 (简化为文本回复)
    emotion = 'neutral'
    intent = 'control_device'
    route = router.determine_route(emotion, intent)
    message = router.get_action_message(route, emotion=emotion, intent=intent)
    output_service.speak(message)

说明： gTTS是一个免费且易于使用的TTS库，但它不支持情感语音合成。在需要情感语音合成的场景中，你需要使用更高级的云服务（如百度、谷歌、微软的TTS API），它们通常提供多种音色和情感参数调整选项。pydub库用于处理和播放音频文件。

3. 系统整合与实时交互流程

现在，我们将把所有组件整合起来，模拟一个完整的实时交互流程。

import time
import os
import shutil

# 导入之前定义的类
# from audio_processor import AudioProcessor
# from stt_service import STTService
# from emotion_recognizer import EmotionRecognizer
# from intent_recognizer import IntentRecognizer
# from routing_engine import RoutingEngine
# from audio_output_service import AudioOutputService

# 为了演示方便，我们将所有类定义在此文件中，避免跨文件导入的复杂性
# 请确保之前所有代码块中的类定义都被复制到这里或正确导入

# === AudioProcessor ===
import pyaudio
import wave
import collections
import webrtcvad
import numpy as np

class AudioProcessor:
    def __init__(self, rate=16000, chunk_size=480, vad_aggressiveness=3):
        self.rate = rate
        self.chunk_size = chunk_size
        self.vad = webrtcvad.Vad(vad_aggressiveness)
        self.audio = pyaudio.PyAudio()

    def record_audio_chunk(self, chunk_duration_ms=1000):
        """
        录制一个音频块（实时流处理时使用）。
        :param chunk_duration_ms: 录制时长（毫秒）
        :return: 录制的音频数据（字节流）
        """
        frames_to_record = int(self.rate / self.chunk_size * (chunk_duration_ms / 1000))

        stream = self.audio.open(format=pyaudio.paInt16,
                                 channels=1,
                                 rate=self.rate,
                                 input=True,
                                 frames_per_buffer=self.chunk_size)

        frames = []
        for _ in range(0, frames_to_record):
            data = stream.read(self.chunk_size, exception_on_overflow=False)
            frames.append(data)

        stream.stop_stream()
        stream.close()
        return b''.join(frames)

    def process_with_vad(self, audio_data):
        frame_length_ms = 10
        frames = self._frame_generator(frame_length_ms, audio_data, self.rate)

        # 使用一个足够大的缓冲，确保能捕获到完整的语音片段
        # 实际应用中，VAD通常是实时流式处理，这里为简化演示
        ring_buffer = collections.deque(maxlen=int(self.rate * 2 / (self.rate * frame_length_ms / 1000))) # 2秒缓冲区
        voiced_frames = []

        for frame in frames:
            is_speech = self.vad.is_speech(frame.bytes, self.rate)

            if is_speech:
                voiced_frames.extend(list(ring_buffer)) # 语音开始前清空缓冲区
                ring_buffer.clear()
                voiced_frames.append(frame)
            else:
                ring_buffer.append(frame)

        processed_audio_data = b''.join([f.bytes for f in voiced_frames])
        return processed_audio_data

    class Frame(object):
        def __init__(self, bytes, timestamp, duration):
            self.bytes = bytes
            self.timestamp = timestamp
            self.duration = duration

    def _frame_generator(self, frame_duration_ms, audio, sample_rate):
        n = int(sample_rate * (frame_duration_ms / 1000.0) * 2)
        offset = 0
        timestamp = 0.0
        duration = (n / sample_rate) / 2
        while offset + n <= len(audio): # 确保不越界
            yield self.Frame(audio[offset:offset + n], timestamp, duration)
            timestamp += duration
            offset += n

    def close(self):
        self.audio.terminate()

# === STTService ===
import assemblyai as aai

class STTService:
    def __init__(self, api_key=None):
        if api_key:
            aai.settings.api_key = api_key
        else:
            print("Warning: AssemblyAI API Key not provided. STT might fail.")

    def transcribe_audio(self, audio_file_path):
        try:
            transcriber = aai.Transcriber()
            transcript = transcriber.transcribe(audio_file_path)
            if transcript.status == aai.TranscriptStatus.completed:
                return transcript.text
            else:
                print(f"AssemblyAI 转录失败: {transcript.error}")
                return None
        except Exception as e:
            print(f"AssemblyAI 调用失败: {e}")
            return None

# === EmotionRecognizer ===
import librosa
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from scipy.io.wavfile import write

class EmotionRecognizer:
    def __init__(self, model=None, scaler=None):
        self.model = model if model else SVC(kernel='rbf', C=1.0, gamma='scale')
        self.scaler = scaler if scaler else StandardScaler()
        self.label_encoder = {}
        self.is_trained = False

    def _extract_features(self, audio_path, sr=16000):
        try:
            y, sr = librosa.load(audio_path, sr=sr)
            mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
            mfccs_mean = np.mean(mfccs.T, axis=0)
            pitches, magnitudes = librosa.piptrack(y=y, sr=sr)
            pitch_mean = np.mean(pitches) if len(pitches) > 0 else 0
            rms = librosa.feature.rms(y=y)
            energy_mean = np.mean(rms)
            combined_features = np.concatenate((mfccs_mean, [pitch_mean], [energy_mean]))
            return combined_features
        except Exception as e:
            # print(f"特征提取失败: {audio_path}, 错误: {e}")
            return None

    def train(self, audio_files, emotions, sr=16000):
        X = []
        y_labels = []
        for i, audio_path in enumerate(audio_files):
            features = self._extract_features(audio_path, sr)
            if features is not None:
                X.append(features)
                y_labels.append(emotions[i])

        if not X:
            print("没有提取到有效特征，无法训练模型。")
            return

        X = np.array(X)
        y_labels = np.array(y_labels)

        unique_emotions = np.unique(y_labels)
        for i, emo in enumerate(unique_emotions):
            self.label_encoder[emo] = i
            self.label_encoder[i] = emo

        y = np.array([self.label_encoder[label] for label in y_labels])

        X_scaled = self.scaler.fit_transform(X)
        self.model.fit(X_scaled, y)
        self.is_trained = True
        print("情感识别模型训练完成。")

    def predict(self, audio_path, sr=16000):
        if not self.is_trained:
            print("Warning: 情感识别模型未训练。返回 'unknown'。")
            return "unknown"

        features = self._extract_features(audio_path, sr)
        if features is None:
            return "unknown"

        features = features.reshape(1, -1)
        features_scaled = self.scaler.transform(features)

        prediction = self.model.predict(features_scaled)
        predicted_emotion_id = prediction[0]

        return self.label_encoder.get(predicted_emotion_id, "unknown")

# === IntentRecognizer ===
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

try:
    nlp_zh = spacy.load("zh_core_web_sm")
except OSError:
    print("下载 spaCy 中文模型 'zh_core_web_sm'...")
    spacy.cli.download("zh_core_web_sm")
    nlp_zh = spacy.load("zh_core_web_sm")

class IntentRecognizer:
    def __init__(self, nlp_model=None):
        self.nlp = nlp_model if nlp_model else nlp_zh
        self.pipeline = Pipeline([
            ('tfidf', TfidfVectorizer(tokenizer=self._spacy_tokenizer)),
            ('clf', LinearSVC())
        ])
        self.intent_labels = []
        self.is_trained = False

    def _spacy_tokenizer(self, text):
        return [token.lemma_ for token in self.nlp(text) if not token.is_stop and not token.is_punct and not token.is_space]

    def train(self, texts, intents):
        self.intent_labels = list(set(intents))
        self.pipeline.fit(texts, intents)
        self.is_trained = True
        print("意图识别模型训练完成。")

    def predict_intent(self, text):
        if not self.is_trained:
            print("Warning: 意图识别模型未训练。返回 'unknown_intent'。")
            return "unknown_intent"

        # 简单的规则引擎作为补充
        if "播放" in text or "放歌" in text or "听音乐" in text:
            return "play_music"
        if "打开" in text or "启动" in text or "关闭" in text or "调" in text:
            return "control_device"
        if "感觉" in text or "心情" in text or "很难过" in text or "很开心" in text or "很生气" in text or "沮丧" in text or "焦虑" in text:
            return "express_feeling"

        prediction = self.pipeline.predict([text])
        return prediction[0]

# === RoutingEngine ===
class RoutingEngine:
    def __init__(self):
        self.routing_rules = {
            ('sad', 'express_feeling'): 'comfort',
            ('sad', 'play_music'): 'execute_play_music_calm',
            ('sad', 'control_device'): 'comfort', 

            ('angry', 'express_feeling'): 'comfort',
            ('angry', 'play_music'): 'execute_play_music_rock',
            ('angry', 'control_device'): 'execute', 

            ('happy', 'express_feeling'): 'comfort',
            ('happy', 'play_music'): 'execute_play_music_happy',
            ('happy', 'control_device'): 'execute',

            ('neutral', 'express_feeling'): 'comfort',
            ('neutral', 'play_music'): 'execute_play_music_neutral',
            ('neutral', 'control_device'): 'execute',

            ('unknown', 'unknown_intent'): 'comfort',
            ('unknown', 'express_feeling'): 'comfort',
            ('unknown', 'play_music'): 'execute_play_music_neutral',
            ('unknown', 'control_device'): 'execute',
        }

        self.branch_actions = {
            'comfort': {
                'sad': "我听到您现在很难过，请不要担心，我们在这里支持您。",
                'angry': "我理解您现在很生气，请深呼吸，我在这里听您说。",
                'happy': "很高兴您这么开心！有什么想分享的吗？",
                'neutral': "您好，有什么可以帮助您的吗？",
                'default': "我在这里，您有什么需要帮助的吗？"
            },
            'execute': {
                'play_music_calm': "好的，为您播放一首舒缓的音乐。",
                'play_music_rock': "好的，为您播放一些摇滚乐来发泄一下。",
                'play_music_happy': "好的，为您播放一首欢快的音乐。",
                'play_music_neutral': "好的，为您播放音乐。",
                'control_device': "好的，已为您执行操作。",
                'default': "好的，我将尝试为您执行请求。"
            }
        }

    def determine_route(self, emotion, intent):
        key = (emotion, intent)
        route = self.routing_rules.get(key)

        if route:
            return route

        if intent == 'express_feeling':
            return 'comfort'
        elif intent in ['play_music', 'control_device']:
            return 'execute'

        return 'comfort'

    def get_action_message(self, route, emotion=None, intent=None):
        if route == 'comfort':
            return self.branch_actions['comfort'].get(emotion, self.branch_actions['comfort']['default'])
        elif route == 'execute':
            if intent == 'play_music':
                if emotion == 'sad':
                    return self.branch_actions['execute']['play_music_calm']
                elif emotion == 'angry':
                    return self.branch_actions['execute']['play_music_rock']
                elif emotion == 'happy':
                    return self.branch_actions['execute']['play_music_happy']
                else:
                    return self.branch_actions['execute']['play_music_neutral']
            elif intent == 'control_device':
                return self.branch_actions['execute']['control_device']
            else:
                return self.branch_actions['execute']['default']
        return "我不太明白您的意思。"

# === AudioOutputService ===
from gtts import gTTS
from pydub import AudioSegment
from pydub.playback import play

class AudioOutputService:
    def __init__(self, lang='zh-cn'):
        self.lang = lang
        self.output_dir = "audio_responses"
        if not os.path.exists(self.output_dir):
            os.makedirs(self.output_dir)

    def speak(self, text, emotion_tone=None):
        print(f"系统回复: {text}")
        try:
            tts = gTTS(text=text, lang=self.lang)
            filename = os.path.join(self.output_dir, f"response_{hash(text)}.mp3")
            tts.save(filename)
            audio = AudioSegment.from_mp3(filename)
            play(audio)
            # os.remove(filename)
        except Exception as e:
            print(f"TTS或播放失败: {e}")

    def play_music(self, music_type="default"):
        print(f"系统正在播放 {music_type} 音乐...")
        # 实际项目中，这里会集成音乐播放API或本地音乐库
        if music_type == 'calm':
            print("播放舒缓的轻音乐...")
        elif music_type == 'rock':
            print("播放劲爆的摇滚乐...")
        elif music_type == 'happy':
            print("播放欢快的流行乐...")
        else:
            print("播放默认背景音乐...")
        # 模拟音乐播放时长
        time.sleep(2) # 模拟播放2秒

# --- 主流程集成 ---
class SemanticAudioRouterSystem:
    def __init__(self, stt_api_key=None):
        self.audio_processor = AudioProcessor()
        self.stt_service = STTService(api_key=stt_api_key)
        self.emotion_recognizer = EmotionRecognizer()
        self.intent_recognizer = IntentRecognizer()
        self.routing_engine = RoutingEngine()
        self.audio_output_service = AudioOutputService()

        # 临时文件目录
        self.temp_audio_dir = "temp_sar_audio"
        if not os.path.exists(self.temp_audio_dir):
            os.makedirs(self.temp_audio_dir)

        # 预训练情感和意图模型 (使用模拟数据，实际应加载真实模型)
        self._train_dummy_models()

    def _train_dummy_models(self):
        print("准备训练虚拟情感和意图识别模型...")
        # 为情感识别模型准备一些虚拟数据
        dummy_audio_files = []
        dummy_emotions = []
        emotions_list = ['happy', 'sad', 'angry', 'neutral', 'surprise']

        # 每次运行生成不同的随机文件
        for i in range(50): # 50个样本用于训练
            emotion = np.random.choice(emotions_list)
            filename = os.path.join(self.temp_audio_dir, f"dummy_emo_{emotion}_{i}.wav")
            duration = np.random.uniform(1.0, 3.0)
            sr = 16000
            t = np.linspace(0, duration, int(sr * duration), endpoint=False)

            if emotion == 'happy': y = 0.5 * np.sin(2 * np.pi * 440 * t) + 0.2 * np.random.randn(len(t))
            elif emotion == 'sad': y = 0.3 * np.sin(2 * np.pi * 220 * t) + 0.1 * np.random.randn(len(t))
            elif emotion == 'angry': y = 0.7 * np.sin(2 * np.pi * 600 * t) + 0.3 * np.random.randn(len(t))
            else: y = 0.4 * np.sin(2 * np.pi * 330 * t) + 0.15 * np.random.randn(len(t))
            y = y / np.max(np.abs(y)) * 0.8 
            write(filename, sr, (y * 32767).astype(np.int16))
            dummy_audio_files.append(filename)
            dummy_emotions.append(emotion)

        self.emotion_recognizer.train(dummy_audio_files, dummy_emotions)

        # 为意图识别模型准备一些虚拟数据
        texts = [
            "我感觉非常沮丧，什么都不想做。", "现在心情很不好，想找人说说话。",
            "帮我播放一首轻快的音乐。", "打开客厅的灯。",
            "我今天很开心！", "把空调温度调到25度。",
            "我很生气，放一些摇滚乐。", "我想听一些放松的白噪音。",
            "我感觉有点孤独。", "请关掉卧室的窗帘。",
            "我有点焦虑，需要一些安慰。", "放一首周杰伦的歌。",
            "我心情很差，想静一静。", "帮我把窗帘拉开。",
            "我很高兴，放首动感的歌。", "我有点难过，可以给我讲个笑话吗？"
        ]
        intents = [
            "express_feeling", "express_feeling",
            "play_music", "control_device",
            "express_feeling", "control_device",
            "play_music", "play_music",
            "express_feeling", "control_device",
            "express_feeling", "play_music",
            "express_feeling", "control_device",
            "play_music", "express_feeling" # 讲笑话也归类为安抚
        ]
        self.intent_recognizer.train(texts, intents)
        print("虚拟情感和意图识别模型训练完成。")

    def process_user_input(self, audio_data):
        """
        处理用户的实时语音输入。
        :param audio_data: 用户的原始语音数据（字节流）
        :return: 系统回复信息和执行动作
        """
        print("n--- 开始处理用户输入 ---")

        # 1. VAD处理
        print("1. 进行语音活动检测 (VAD)...")
        vad_processed_audio = self.audio_processor.process_with_vad(audio_data)
        if not vad_processed_audio:
            print("未检测到有效语音。")
            return "对不起，我没有听到您的声音。", None

        # 保存VAD处理后的音频到临时文件
        temp_vad_file = os.path.join(self.temp_audio_dir, "current_vad_input.wav")
        with wave.open(temp_vad_file, 'wb') as wf:
            wf.setnchannels(1)
            wf.setsampwidth(self.audio_processor.audio.get_sample_size(pyaudio.paInt16))
            wf.setramerate(self.audio_processor.rate)
            wf.writeframes(vad_processed_audio)

        # 2. 情感识别 (SER)
        print("2. 识别用户情感...")
        emotion = self.emotion_recognizer.predict(temp_vad_file)
        print(f"   识别到的情感: {emotion}")

        # 3. 语音转文本 (STT)
        print("3. 语音转文本 (STT)...")
        text_input = self.stt_service.transcribe_audio(temp_vad_file)
        if not text_input:
            print("STT失败，无法理解内容。")
            return "对不起，我没能听清您说了什么。", None
        print(f"   转录文本: '{text_input}'")

        # 4. 意图识别 (NLU)
        print("4. 识别用户意图...")
        intent = self.intent_recognizer.predict_intent(text_input)
        print(f"   识别到的意图: {intent}")

        # 5. 决策与路由
        print("5. 决策路由...")
        route = self.routing_engine.determine_route(emotion, intent)
        response_message = self.routing_engine.get_action_message(route, emotion=emotion, intent=intent)
        print(f"   决策路由到: {route}, 响应信息: '{response_message}'")

        return response_message, route

    def run_interactive_session(self, num_interactions=3):
        print("n--- 语义音频路由系统启动 ---")
        print("请说话，系统将根据您的情绪和意图进行回应。")
        print("每次说完请等待提示。")

        for i in range(num_interactions):
            print(f"n--- 第 {i+1} 次交互 ---")
            print("请说话（录制3秒）...")

            # 录制原始音频（这里模拟一次性录制，实际可采用流式）
            raw_input_audio = self.audio_processor.record_audio_chunk(chunk_duration_ms=3000)

            system_response, route = self.process_user_input(raw_input_audio)

            # 6. 音频输出
            self.audio_output_service.speak(system_response)
            if route == 'execute_play_music_calm':
                self.audio_output_service.play_music(music_type='calm')
            elif route == 'execute_play_music_rock':
                self.audio_output_service.play_music(music_type='rock')
            elif route == 'execute_play_music_happy':
                self.audio_output_service.play_music(music_type='happy')
            elif route == 'execute_play_music_neutral':
                self.audio_output_service.play_music(music_type='neutral')
            # 对于 'execute_control_device'，这里只是打印，实际会触发设备控制API
            elif route == 'execute': # 通用的执行分支，通常意味着设备控制或其他非音乐执行
                 print("（已模拟执行设备控制或其他操作）")

            time.sleep(1) # 稍作停顿

        print("n--- 交互会话结束 ---")
        self.cleanup()

    def cleanup(self):
        self.audio_processor.close()
        if os.path.exists(self.temp_audio_dir):
            shutil.rmtree(self.temp_audio_dir)
        if os.path.exists(self.audio_output_service.output_dir):
            shutil.rmtree(self.audio_output_service.output_dir)
        print("临时文件已清理。")

# 运行系统
if __name__ == "__main__":
    # 请替换为你的AssemblyAI API Key
    # 你可以在 https://www.assemblyai.com/ 注册获取免费额度
    ASSEMBLYAI_API_KEY = os.environ.get("ASSEMBLYAI_API_KEY", "YOUR_ASSEMBLYAI_API_KEY") 

    # 检查API Key是否已设置
    if ASSEMBLYAI_API_KEY == "YOUR_ASSEMBLYAI_API_KEY":
        print("*************************************************************")
        print("WARNING: 请设置 ASSEMBLYAI_API_KEY 环境变量或在代码中替换 'YOUR_ASSEMBLYAI_API_KEY'。")
        print("         否则STT服务将无法正常工作。")
        print("*************************************************************")
        # exit() # 如果没有API Key，可以选择退出或继续（STT会失败）

    system = SemanticAudioRouterSystem(stt_api_key=ASSEMBLYAI_API_KEY)
    system.run_interactive_session(num_interactions=3)

运行前请确保：

安装所有依赖： pip install pyaudio webrtcvad librosa scikit-learn spacy gtts pydub assemblyai
下载spaCy中文模型： python -m spacy download zh_core_web_sm
安装FFmpeg： pydub的音频播放依赖于FFmpeg。在Linux上通常是sudo apt-get install ffmpeg，在macOS上是brew install ffmpeg，在Windows上需要下载并添加到PATH。
设置AssemblyAI API Key： 请在代码中替换"YOUR_ASSEMBLYAI_API_KEY"或设置环境变量ASSEMBLYAI_API_KEY。

4. 挑战与未来方向

4.1 挑战

实时性与延迟： 语音输入、处理、决策到输出，整个链条必须在可接受的延迟内完成，尤其对于实时交互。
鲁棒性： 真实环境中的噪声、口音、语速变化、说话人差异、多说话人等都会影响识别准确性。
情感的细微性与模糊性： 人类情感是复杂的，细微的情绪变化、混合情绪、讽刺等很难被准确识别。
数据依赖： 高质量的语音情感数据集和意图识别数据集是模型训练的关键，但这类数据通常难以获取和标注。
语境理解：

1. 语义音频路由的本质与价值

1.1 什么是语义音频路由？

1.2 传统音频路由的局限

1.3 语义音频路由的潜在应用

2. 语义音频路由系统的核心组件与架构

2.1 音频输入与预处理

2.2 语音转文本（STT）

2.3 语音情感识别（SER）

2.3.1 情感特征

2.3.2 情感分类模型

2.4 自然语言理解（NLU）/意图识别

2.4.1 意图识别方法

2.5 决策与路由引擎

2.5.1 基于规则的决策引擎

2.6 音频输出模块

3. 系统整合与实时交互流程

4. 挑战与未来方向

4.1 挑战

发表回复 取消回复

发表回复取消回复