Python中的拓扑数据分析（TDA）：利用持续同调进行特征提取与模型构建

大家好！今天我们来聊聊一个相对新兴但潜力巨大的数据分析领域：拓扑数据分析（Topological Data Analysis，TDA）。我们将重点关注如何利用Python进行TDA，特别是使用持续同调（Persistent Homology）进行特征提取，并将其应用于机器学习模型的构建。

1. 拓扑数据分析（TDA）简介

传统的数据分析方法，例如统计学和机器学习，主要关注数据的统计性质，如均值、方差、相关性等。然而，对于复杂的数据集，这些方法可能无法捕捉到数据内在的“形状”和“连接性”。这就是TDA发挥作用的地方。

TDA的核心思想是利用拓扑学的概念来研究数据的形状。拓扑学关注的是在连续变形下保持不变的性质，例如连通性、孔洞的数量等。TDA将数据视为一个拓扑空间，并通过计算其拓扑特征来描述数据的结构。

TDA的主要优势包括：

对噪声不敏感：拓扑特征在一定程度上对噪声和扰动具有鲁棒性。
无需坐标系：TDA可以处理没有明确坐标系的数据，例如图数据。
高维数据处理：TDA可以有效地处理高维数据，并提取有意义的特征。
无需参数调整：在持续同调计算过程中，不需要调整参数，减少了主观因素的影响。

2. 持续同调（Persistent Homology）

持续同调是TDA中最核心的概念之一。它是一种用于计算拓扑空间在不同尺度下同调群变化的技术。同调群可以理解为描述拓扑空间中不同维度“孔洞”的代数结构。

简单来说，持续同调的工作流程如下：

构建过滤（Filtration）：将原始数据点逐渐连接起来，形成一系列嵌套的拓扑空间。常见的过滤方式包括Čech过滤、Vietoris-Rips过滤等。
计算同调群：对于每个过滤步骤，计算其同调群。
追踪同调类的生命周期：追踪每个同调类（代表一个“孔洞”）在过滤过程中的出现和消失。
生成持续图（Persistence Diagram）：将每个同调类的“出生”时间和“死亡”时间绘制在一个二维平面上，形成持续图。

持续图的解读：

持续图上的每个点代表一个同调类（例如一个孔洞）。
点的x坐标代表同调类的“出生”时间（即它在过滤过程中出现的时间）。
点的y坐标代表同调类的“死亡”时间（即它在过滤过程中消失的时间）。
点离对角线越远，说明这个同调类的生命周期越长，也越重要。

3. Python中的TDA工具：GUDHI

GUDHI (Geometric Understanding in Higher Dimensions) 是一个用 C++ 编写，并提供 Python 接口的 TDA 开源库。它提供了丰富的函数和类，用于构建过滤、计算持续同调、生成持续图等。

GUDHI的安装：

pip install gudhi

一个简单的GUDHI示例：计算点云的持续同调

import gudhi
import numpy as np
import matplotlib.pyplot as plt

# 1. 生成一个随机点云
points = np.random.rand(100, 2)

# 2. 创建一个 Vietoris-Rips 复形
rips_complex = gudhi.RipsComplex(points=points, max_edge_length=0.5)
simplex_tree = rips_complex.create_simplex_tree(max_dimension=2)

# 3. 计算持续同调
diag = simplex_tree.persistence()

# 4. 绘制持续图
gudhi.plot_persistence_diagram(diag)
plt.show()

# 5. 获取所有持久化pair
persistence_intervals = simplex_tree.persistence_intervals_in_dimension(0) # Dimension 0: connected components
persistence_intervals_1 = simplex_tree.persistence_intervals_in_dimension(1) # Dimension 1: loops (holes)

print("Persistence intervals in dimension 0:", persistence_intervals)
print("Persistence intervals in dimension 1:", persistence_intervals_1)

代码解释：

gudhi.RipsComplex(points=points, max_edge_length=0.5): 创建一个Vietoris-Rips复形。points是点云数据，max_edge_length是连接两个点的最大距离。
simplex_tree.create_simplex_tree(max_dimension=2): 从Vietoris-Rips复形构建一个单纯复形树，max_dimension是单纯复形的最高维度。
simplex_tree.persistence(): 计算持续同调。
gudhi.plot_persistence_diagram(diag): 绘制持续图。
simplex_tree.persistence_intervals_in_dimension(0): 获取0维同调群（连通分支）的持久化区间。
simplex_tree.persistence_intervals_in_dimension(1): 获取1维同调群（环/洞）的持久化区间。

4. TDA在特征提取中的应用

持续图本身可以被视为数据的拓扑特征。然而，直接使用持续图作为机器学习模型的输入通常比较困难，因为它是一个点集，而不是一个固定长度的向量。因此，我们需要将持续图转换为数值特征向量。

常见的持续图特征提取方法包括：

持续图统计量：计算持续图上点的统计量，例如点的数量、最长生命周期的点、平均生命周期等。
持续景观（Persistence Landscape）：将持续图转换为一个函数，该函数描述了不同尺度下同调类的“高度”。
持续图像（Persistence Image）：将持续图转换为一个图像，其中每个像素的值代表该位置附近的同调类的密度。
向量化方法： 使用各种核函数将持续图转化为向量。

4.1 持续图统计量

import gudhi
import numpy as np
from sklearn.preprocessing import MinMaxScaler

def persistence_diagram_stats(diagram):
  """
  Calculates statistics from a persistence diagram.

  Args:
    diagram: A persistence diagram (list of (birth, death) tuples).

  Returns:
    A dictionary containing the statistics.
  """
  if not diagram:
    return {"num_points": 0, "max_persistence": 0, "avg_persistence": 0}

  birth_times = [point[0] for point in diagram]
  death_times = [point[1] for point in diagram]
  persistences = [death - birth for birth, death in diagram]

  stats = {
      "num_points": len(diagram),
      "max_persistence": np.max(persistences),
      "avg_persistence": np.mean(persistences)
  }
  return stats

# 示例
points = np.random.rand(50, 2)
rips_complex = gudhi.RipsComplex(points=points, max_edge_length=0.5)
simplex_tree = rips_complex.create_simplex_tree(max_dimension=2)
diag = simplex_tree.persistence()

# 提取0维和1维同调类的持续图
diag_0 = [pair[1] for pair in diag if pair[0] == 0]
diag_1 = [pair[1] for pair in diag if pair[0] == 1]

stats_0 = persistence_diagram_stats(diag_0)
stats_1 = persistence_diagram_stats(diag_1)

print("0-dimensional homology statistics:", stats_0)
print("1-dimensional homology statistics:", stats_1)

4.2 持续景观（Persistence Landscape）

import gudhi
import numpy as np
import matplotlib.pyplot as plt

def plot_persistence_landscape(diagram, resolution=100, max_persistence=None):
  """
  Plots the persistence landscape of a persistence diagram.

  Args:
    diagram: A persistence diagram (list of (birth, death) tuples).
    resolution: The number of points to sample along the x-axis.
    max_persistence: The maximum persistence value to consider. If None, it's
      calculated from the diagram.
  """
  if not diagram:
    print("Empty diagram, cannot plot persistence landscape.")
    return

  birth_times = [point[0] for point in diagram]
  death_times = [point[1] for point in diagram]
  persistences = [death - birth for birth, death in diagram]

  if max_persistence is None:
    max_persistence = np.max(persistences)

  min_birth = np.min(birth_times)
  max_death = np.max(death_times)

  x = np.linspace(min_birth, max_death, resolution)
  landscape = np.zeros_like(x)

  for birth, death in diagram:
    persistence = death - birth
    midpoint = (birth + death) / 2
    height = persistence / 2

    # Calculate the landscape function for this point
    triangle = np.maximum(0, height - np.abs(x - midpoint))
    landscape = np.maximum(landscape, triangle)

  plt.plot(x, landscape)
  plt.xlabel("Parameter")
  plt.ylabel("Persistence Landscape")
  plt.title("Persistence Landscape")
  plt.show()

# 示例
points = np.random.rand(50, 2)
rips_complex = gudhi.RipsComplex(points=points, max_edge_length=0.5)
simplex_tree = rips_complex.create_simplex_tree(max_dimension=2)
diag = simplex_tree.persistence()

# 提取0维和1维同调类的持续图
diag_0 = [pair[1] for pair in diag if pair[0] == 0]
diag_1 = [pair[1] for pair in diag if pair[0] == 1]

plot_persistence_landscape(diag_0)
plot_persistence_landscape(diag_1)

4.3 持续图像（Persistence Image）

import gudhi
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

def persistence_image(diagram, resolution=[20,20], bandwidth=1.0, im_range=None):
    """
    Computes the persistence image of a persistence diagram.

    Args:
        diagram: A persistence diagram (list of (birth, death) tuples).
        resolution: The resolution of the persistence image (width, height).
        bandwidth: The bandwidth of the Gaussian kernel.
        im_range: The range of the persistence image (xmin, xmax, ymin, ymax).  If None, calculated from diagram.

    Returns:
        A 2D numpy array representing the persistence image.
    """
    if not diagram:
        return np.zeros(resolution)

    birth_times = [point[0] for point in diagram]
    death_times = [point[1] for point in diagram]

    if im_range is None:
        xmin = min(birth_times)
        xmax = max(birth_times)
        ymin = min(death_times)
        ymax = max(death_times)
    else:
        xmin, xmax, ymin, ymax = im_range

    width, height = resolution
    image = np.zeros((height, width))

    x = np.linspace(xmin, xmax, width)
    y = np.linspace(ymin, ymax, height)

    for birth, death in diagram:
        for i in range(height):
            for j in range(width):
                # Gaussian kernel centered at (birth, death)
                kernel = np.exp(-((x[j] - birth)**2 + (y[i] - death)**2) / (2 * bandwidth**2))
                image[i, j] += kernel

    return image

# 示例
points = np.random.rand(50, 2)
rips_complex = gudhi.RipsComplex(points=points, max_edge_length=0.5)
simplex_tree = rips_complex.create_simplex_tree(max_dimension=2)
diag = simplex_tree.persistence()

# 提取0维和1维同调类的持续图
diag_0 = [pair[1] for pair in diag if pair[0] == 0]
diag_1 = [pair[1] for pair in diag if pair[0] == 1]

image_0 = persistence_image(diag_0, resolution=[32, 32])
image_1 = persistence_image(diag_1, resolution=[32, 32])

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.imshow(image_0, cmap='viridis', origin='lower')
plt.title("Persistence Image (0-dim)")
plt.colorbar()

plt.subplot(1, 2, 2)
plt.imshow(image_1, cmap='viridis', origin='lower')
plt.title("Persistence Image (1-dim)")
plt.colorbar()

plt.show()

5. TDA与机器学习模型

提取了拓扑特征之后，就可以将其用于训练机器学习模型。

示例：使用持续图统计量训练分类器

import gudhi
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

def persistence_diagram_stats(diagram):
    """
    Calculates statistics from a persistence diagram.

    Args:
        diagram: A persistence diagram (list of (birth, death) tuples).

    Returns:
        A dictionary containing the statistics.
    """
    if not diagram:
      return {"num_points": 0, "max_persistence": 0, "avg_persistence": 0}

    birth_times = [point[0] for point in diagram]
    death_times = [point[1] for point in diagram]
    persistences = [death - birth for birth, death in diagram]

    stats = {
        "num_points": len(diagram),
        "max_persistence": np.max(persistences),
        "avg_persistence": np.mean(persistences)
    }
    return stats

def extract_tda_features(points):
    """
    Extracts TDA features from a point cloud.

    Args:
        points: A point cloud (numpy array).

    Returns:
        A numpy array of TDA features.
    """
    rips_complex = gudhi.RipsComplex(points=points, max_edge_length=0.5)
    simplex_tree = rips_complex.create_simplex_tree(max_dimension=2)
    diag = simplex_tree.persistence()

    diag_0 = [pair[1] for pair in diag if pair[0] == 0]
    diag_1 = [pair[1] for pair in diag if pair[0] == 1]

    stats_0 = persistence_diagram_stats(diag_0)
    stats_1 = persistence_diagram_stats(diag_1)

    # Combine features into a single vector
    features = [stats_0["num_points"], stats_0["max_persistence"], stats_0["avg_persistence"],
                stats_1["num_points"], stats_1["max_persistence"], stats_1["avg_persistence"]]
    return np.array(features)

# 1. 生成模拟数据
np.random.seed(42)
num_samples = 100
points_class_0 = np.random.rand(num_samples, 2)
points_class_1 = np.random.rand(num_samples, 2) + 0.5  # Shift class 1 slightly

# Create labels
labels = np.concatenate([np.zeros(num_samples), np.ones(num_samples)])

# Combine data
all_points = np.concatenate([points_class_0, points_class_1])

# 2. 提取TDA特征
tda_features = np.array([extract_tda_features(points) for points in all_points])

# 3. 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(tda_features, labels, test_size=0.3, random_state=42)

# 4. 训练分类器
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# 5. 评估模型
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

代码解释：

extract_tda_features(points): 从点云数据提取TDA特征。
使用train_test_split函数将数据集划分为训练集和测试集。
使用RandomForestClassifier训练分类器。
使用accuracy_score评估模型性能。

6. 更高级的应用

TDA的应用场景非常广泛，包括但不限于：

图像分析：识别图像中的形状和结构。
时间序列分析：检测时间序列中的模式和异常。
材料科学：研究材料的微观结构。
生物学：分析蛋白质结构和基因表达数据。
网络分析：发现网络中的社区结构和重要节点。

7. 总结一下

今天我们学习了拓扑数据分析的基本概念，特别是持续同调。我们还学习了如何使用Python中的GUDHI库计算持续同调，并提取持续图特征。最后，我们展示了如何将拓扑特征应用于机器学习模型的构建。TDA为我们提供了一种全新的视角来理解和分析数据，它能够捕捉到数据内在的形状和连接性，从而帮助我们解决传统方法难以解决的问题。希望今天的讲座能够激发大家对TDA的兴趣，并在实际项目中尝试应用它。

8. 持续探索，不断学习

TDA是一个快速发展的领域，有很多值得探索的方向。希望今天的内容能给大家带来启发，并在实际应用中尝试和创新。欢迎大家继续深入学习，共同推动TDA的发展！

更多IT精英技术系列讲座，到智猿学院

Python中的拓扑数据分析（TDA）：利用持续同调进行特征提取与模型构建

发表回复 取消回复

发表回复取消回复