JAVA Elasticsearch 分片分配失败？Disk watermark 触发机制讲解

各位听众，大家好！今天我们来探讨一个在 Elasticsearch 集群运维中经常遇到的问题：分片分配失败，以及导致这种失败的一个重要原因——磁盘水位线（Disk Watermark）触发机制。我们将深入了解磁盘水位线的原理、配置、如何排查问题，并提供一些实用的代码示例，帮助大家更好地管理 Elasticsearch 集群的健康状态。

一、Elasticsearch 分片分配机制简介

在深入讨论磁盘水位线之前，我们先简单回顾一下 Elasticsearch 的分片分配机制。Elasticsearch 将数据分割成多个分片，这些分片分布在集群中的不同节点上。这种分布式架构带来了高可用性和可扩展性。当集群中有节点加入或离开时，Elasticsearch 会自动重新分配分片，以保持集群的平衡和数据的完整性。

分片分配过程受到多种因素的影响，其中一个关键因素就是磁盘空间。如果节点的磁盘空间不足，Elasticsearch 为了保护数据的安全性和集群的稳定性，可能会阻止将新的分片分配到该节点。

二、磁盘水位线（Disk Watermark）：保护集群的卫士

磁盘水位线是 Elasticsearch 内置的一种保护机制，用于监控集群中每个节点的磁盘使用情况。当节点的磁盘使用率超过预设的阈值时，Elasticsearch 会采取相应的措施，以防止磁盘空间耗尽，从而避免数据丢失和集群崩溃。

Elasticsearch 定义了三个级别的磁盘水位线：

low watermark： 默认值为 85%。当磁盘使用率超过这个阈值时，Elasticsearch 不会将新的分片分配到该节点。
high watermark： 默认值为 90%。当磁盘使用率超过这个阈值时，Elasticsearch 会尝试将分片从该节点迁移到其他磁盘空间充足的节点。
flood stage watermark： 默认值为 95%。当磁盘使用率超过这个阈值时，Elasticsearch 会阻止对该节点进行任何写操作。这意味着无法在该节点上创建新的索引、写入新的文档或更新现有的文档。

这些水位线是可配置的，可以根据实际需求进行调整。

为什么需要磁盘水位线？

想象一下，如果没有磁盘水位线，当一个节点的磁盘空间耗尽时，可能会发生以下情况：

数据丢失： Elasticsearch 可能无法写入新的数据，导致数据丢失。
集群崩溃： 磁盘空间耗尽可能会导致节点崩溃，进而影响整个集群的稳定性。
性能下降： 磁盘空间不足会导致 I/O 性能下降，影响 Elasticsearch 的查询和索引速度。

磁盘水位线就像一个卫士，时刻监控着集群的磁盘使用情况，并在出现问题之前采取预防措施，确保集群的健康运行。

三、磁盘水位线的配置

可以通过 Elasticsearch 的集群设置 API 来配置磁盘水位线。以下是一些常用的配置参数：

参数	描述	默认值
`cluster.routing.allocation.disk.threshold_enabled`	是否启用磁盘水位线。	`true`
`cluster.routing.allocation.disk.watermark.low`	低水位线阈值。可以是百分比（例如 `85%`）或绝对大小（例如 `500GB`）。	`85%`
`cluster.routing.allocation.disk.watermark.high`	高水位线阈值。可以是百分比或绝对大小。	`90%`
`cluster.routing.allocation.disk.watermark.flood_stage`	洪涝水位线阈值。可以是百分比或绝对大小。	`95%`
`cluster.routing.allocation.disk.watermark.low.max_headroom`	在计算低水位线时，保留的最大磁盘空间。如果设置为 "100gb"，即使磁盘空间超过 85%，并且可用空间大于 100gb，节点仍然可以分配分片。	无默认值
`cluster.routing.allocation.disk.watermark.high.max_headroom`	在计算高水位线时，保留的最大磁盘空间。如果设置为 "50gb"，即使磁盘空间超过 90%，并且可用空间大于 50gb，节点仍然可以迁移分片。	无默认值
`cluster.info.update.interval`	Elasticsearch 检查磁盘使用情况的频率。	`30s`

使用 API 配置磁盘水位线：

以下是一个使用 Elasticsearch 的 REST API 配置磁盘水位线的示例：

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.threshold_enabled": true,
    "cluster.routing.allocation.disk.watermark.low": "75%",
    "cluster.routing.allocation.disk.watermark.high": "80%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "90%"
  }
}

这个 API 请求将：

启用磁盘水位线（如果尚未启用）。
将低水位线设置为 75%。
将高水位线设置为 80%。
将洪涝水位线设置为 90%。

使用 JAVA API 配置磁盘水位线：

import org.elasticsearch.action.admin.cluster.settings.ClusterUpdateSettingsRequest;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.settings.Settings;

import java.io.IOException;

public class DiskWatermarkConfigurator {

    private final RestHighLevelClient client;

    public DiskWatermarkConfigurator(RestHighLevelClient client) {
        this.client = client;
    }

    public void configureDiskWatermarks(String lowWatermark, String highWatermark, String floodStageWatermark) throws IOException {
        Settings settings = Settings.builder()
                .put("cluster.routing.allocation.disk.threshold_enabled", true)
                .put("cluster.routing.allocation.disk.watermark.low", lowWatermark)
                .put("cluster.routing.allocation.disk.watermark.high", highWatermark)
                .put("cluster.routing.allocation.disk.watermark.flood_stage", floodStageWatermark)
                .build();

        ClusterUpdateSettingsRequest request = new ClusterUpdateSettingsRequest();
        request.transientSettings(settings);

        client.cluster().putSettings(request, RequestOptions.DEFAULT);

        System.out.println("Disk watermarks configured successfully.");
    }

    public static void main(String[] args) throws IOException {
        // Replace with your Elasticsearch client initialization
        RestHighLevelClient client = new RestHighLevelClient(
                // Configuration options (e.g., host and port)
        );

        DiskWatermarkConfigurator configurator = new DiskWatermarkConfigurator(client);
        configurator.configureDiskWatermarks("75%", "80%", "90%");

        client.close();
    }
}

这段代码使用 Elasticsearch 的 Java High Level REST Client 来更新集群设置。它创建了一个 Settings 对象，其中包含了要更新的磁盘水位线配置，然后使用 ClusterUpdateSettingsRequest 将这些设置发送到 Elasticsearch 集群。

关于 transient 和 persistent 设置：

在上面的 API 请求中，我们使用了 transient 设置。transient 设置只在集群重启之前有效。如果集群重启，这些设置将会丢失，恢复为默认值或 elasticsearch.yml 文件中的配置。

如果希望设置永久生效，可以使用 persistent 设置。persistent 设置会被存储在集群状态中，即使集群重启也会保留。

选择合适的水位线阈值：

选择合适的水位线阈值非常重要。如果设置得太低，可能会导致不必要的碎片迁移和性能下降。如果设置得太高，可能会导致磁盘空间耗尽，从而导致数据丢失和集群崩溃。

一般来说，建议根据集群的实际情况和业务需求进行调整。可以参考以下建议：

low watermark： 建议设置为 75% – 85%。
high watermark： 建议设置为 80% – 90%。
flood stage watermark： 建议设置为 90% – 95%。

同时，也要考虑到磁盘的容量和数据的增长速度。如果磁盘容量较小，或者数据增长速度较快，可以适当降低水位线阈值。

四、分片分配失败的排查与解决

当 Elasticsearch 无法将分片分配到某个节点时，通常会在日志中看到类似以下的错误信息：

[WARN ][r.a.a.DiskThresholdMonitor] [node-1] high disk watermark [90%] exceeded on [node-1][...][/path/to/data] free: 50GB[9.5%], shards will be relocated away from this node

这个错误信息表明节点 node-1 的磁盘使用率超过了高水位线（90%），Elasticsearch 将会尝试将分片从该节点迁移到其他节点。

排查步骤：

检查节点磁盘使用情况： 使用 df -h 命令或其他磁盘监控工具检查节点的磁盘使用情况，确认磁盘空间是否真的不足。
检查 Elasticsearch 日志： 仔细阅读 Elasticsearch 的日志，查找与分片分配失败相关的错误信息，了解失败的原因。
检查集群设置： 使用以下 API 获取集群的磁盘水位线设置：
```
GET /_cluster/settings?include_defaults=true&flat_settings=true
```
检查 cluster.routing.allocation.disk.watermark.low、cluster.routing.allocation.disk.watermark.high 和 cluster.routing.allocation.disk.watermark.flood_stage 的值是否符合预期。
检查分片分配解释： 使用 _cluster/reroute API 的 explain 参数来了解分片分配失败的具体原因。
```
POST /_cluster/reroute?explain
{
  "commands": [
    {
      "allocate": {
        "index": "your_index",
        "shard": 0,
        "node": "your_node",
        "allow_primary": true
      }
    }
  ]
}
```
将 your_index、shard 和 node 替换为实际的值。API 的响应会提供分片分配失败的详细解释，包括磁盘空间不足的原因。

解决方法：

清理磁盘空间： 这是最直接的解决方法。可以删除不再需要的索引、日志文件或其他数据，释放磁盘空间。
增加磁盘容量： 如果磁盘空间经常不足，可以考虑增加节点的磁盘容量。
调整磁盘水位线阈值： 如果磁盘空间充足，但仍然频繁触发磁盘水位线，可以适当提高水位线阈值。但是，要确保提高后的阈值仍然能够保护集群的健康运行。
迁移分片： 如果某个节点的磁盘空间不足，可以将分片迁移到其他磁盘空间充足的节点。可以使用 _cluster/reroute API 手动迁移分片，或者让 Elasticsearch 自动迁移分片。
禁用磁盘水位线（不推荐）： 在极端情况下，可以禁用磁盘水位线。但是，强烈不建议这样做，因为这会增加数据丢失和集群崩溃的风险。如果必须禁用磁盘水位线，请确保采取其他措施来保护集群的健康运行。

代码示例：手动迁移分片

以下是一个使用 Java High Level REST Client 手动迁移分片的示例：

import org.elasticsearch.action.admin.cluster.reroute.ClusterRerouteRequest;
import org.elasticsearch.action.admin.cluster.reroute.ClusterRerouteResponse;
import org.elasticsearch.action.admin.cluster.reroute.RerouteAllocateAllocationCommand;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;

import java.io.IOException;

public class ShardMigrator {

    private final RestHighLevelClient client;

    public ShardMigrator(RestHighLevelClient client) {
        this.client = client;
    }

    public boolean migrateShard(String indexName, int shardId, String fromNode, String toNode, boolean allowPrimary) throws IOException {
        RerouteAllocateAllocationCommand allocateCommand = new RerouteAllocateAllocationCommand(indexName, shardId, toNode, allowPrimary);

        ClusterRerouteRequest request = new ClusterRerouteRequest();
        request.add(allocateCommand);

        ClusterRerouteResponse response = client.cluster().reroute(request, RequestOptions.DEFAULT);

        if (response.isAcknowledged()) {
            System.out.println("Shard migration initiated successfully.");
            return true;
        } else {
            System.out.println("Shard migration failed: " + response.getState());
            return false;
        }
    }

    public static void main(String[] args) throws IOException {
        // Replace with your Elasticsearch client initialization
        RestHighLevelClient client = new RestHighLevelClient(
                // Configuration options (e.g., host and port)
        );

        ShardMigrator migrator = new ShardMigrator(client);
        boolean success = migrator.migrateShard("your_index", 0, "node-1", "node-2", true);

        client.close();

        if (success) {
            System.out.println("Shard migration completed (check Elasticsearch logs for details).");
        } else {
            System.out.println("Shard migration failed.");
        }
    }
}

这个代码示例使用 ClusterRerouteRequest 和 RerouteAllocateAllocationCommand 来手动将一个分片从一个节点迁移到另一个节点。需要将 your_index, shardId, node-1, 和 node-2 替换为实际的值。 allowPrimary 参数指示是否允许将主分片分配到目标节点。

五、预防措施

为了避免分片分配失败，可以采取以下预防措施：

监控磁盘使用情况： 定期监控集群中每个节点的磁盘使用情况，及时发现潜在的问题。
合理规划磁盘容量： 在规划 Elasticsearch 集群时，要充分考虑数据的增长速度，合理规划磁盘容量。
定期清理数据： 定期清理不再需要的数据，释放磁盘空间。可以使用 Elasticsearch 的索引生命周期管理（ILM）功能来自动管理索引的生命周期。
优化索引和查询： 优化索引和查询可以减少磁盘空间的占用。例如，可以使用更高效的数据类型、减少不必要的字段和使用缓存。
配置报警： 配置磁盘使用率报警，当磁盘使用率超过预设的阈值时，及时收到报警通知。

六、总结与回顾

今天我们深入探讨了 Elasticsearch 分片分配失败的问题，重点讲解了磁盘水位线触发机制。我们了解了磁盘水位线的原理、配置、如何排查问题，并提供了实用的代码示例。希望这些知识能够帮助大家更好地管理 Elasticsearch 集群的健康状态，避免因磁盘空间不足导致的数据丢失和集群崩溃。记住，预防胜于治疗，定期监控磁盘使用情况，合理规划磁盘容量，并采取相应的预防措施，是确保 Elasticsearch 集群稳定运行的关键。