Elasticsearch `elasticsearch-py`：构建高性能搜索与日志分析 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

好的，各位观众老爷们，欢迎来到今天的 Elasticsearch 专题讲座！我是你们的导游（兼搬砖工），今天咱们就来聊聊如何用 Python 的 elasticsearch-py 库，打造高性能的搜索和日志分析系统。

开场白：Elasticsearch，你凭什么这么火？

话说这年头，数据量蹭蹭往上涨，想从海量数据里捞点有用的信息，那可不是件容易事。传统的数据库查询，就像大海捞针，捞半天捞不着，捞着了也累个半死。Elasticsearch 这货，就是来拯救咱们于水火的。

它是一款基于 Lucene 的分布式搜索和分析引擎，能快速、近乎实时地存储、搜索和分析海量数据。简单来说，它就像一个超级索引，能把你的数据组织得井井有条，让你嗖嗖嗖地找到想要的东西。

主角登场：elasticsearch-py，Python 的好基友

光有 Elasticsearch 还不行，咱们还得有个趁手的工具来跟它交流。elasticsearch-py 就是 Python 社区为 Elasticsearch 打造的官方客户端。有了它，我们就能用 Python 代码轻松地操作 Elasticsearch，实现各种骚操作。

第一部分：环境搭建，万丈高楼平地起

安装 Elasticsearch：
- 如果你是 Mac 用户，用 Homebrew 一条命令搞定：
```
brew install elasticsearch
brew tap AdoptOpenJDK/openjdk
brew install --cask adoptopenjdk8
```
  启动 Elasticsearch：
```
brew services start elasticsearch
```
- 如果你是 Linux 用户，可以下载 Elasticsearch 的安装包，然后解压、配置、启动。具体步骤可以参考 Elasticsearch 官方文档，这里就不赘述了。
安装 elasticsearch-py：
```
pip install elasticsearch
```
或者，如果你喜欢用 Poetry 管理依赖：
```
poetry add elasticsearch
```

验证安装：

打开 Python 解释器，输入以下代码：

from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

if es.ping():
    print("Elasticsearch 连接成功！")
else:
    print("Elasticsearch 连接失败！")

如果输出 "Elasticsearch 连接成功！"，恭喜你，环境搭建成功！

第二部分：基本操作，小试牛刀

连接 Elasticsearch：
```
from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
```
这里的 host 和 port 分别是 Elasticsearch 服务器的地址和端口。默认情况下，Elasticsearch 运行在 localhost:9200。

创建索引：

索引是 Elasticsearch 中存储数据的逻辑容器，类似于数据库中的表。

index_name = "my_index"

# 定义索引的 mapping，指定字段类型等
mapping = {
    "mappings": {
        "properties": {
            "title": {"type": "text"},
            "content": {"type": "text"},
            "author": {"type": "keyword"},
            "publish_date": {"type": "date", "format": "yyyy-MM-dd"}
        }
    }
}

if not es.indices.exists(index=index_name):
    es.indices.create(index=index_name, body=mapping)
    print(f"索引 '{index_name}' 创建成功！")
else:
    print(f"索引 '{index_name}' 已经存在！")

mapping 定义了索引中每个字段的类型。常用的字段类型包括：

text: 用于全文搜索，会被分词。
keyword: 用于精确匹配，不会被分词。
date: 用于存储日期，可以指定日期格式。
integer, long, float, double: 用于存储数字。
boolean: 用于存储布尔值。

添加文档：

文档是 Elasticsearch 中存储数据的基本单元，类似于数据库中的行。

doc1 = {
    "title": "Elasticsearch 快速入门",
    "content": "本文介绍了 Elasticsearch 的基本概念和使用方法。",
    "author": "张三",
    "publish_date": "2023-10-26"
}

doc2 = {
    "title": "Python Elasticsearch Client",
    "content": "This article demonstrates how to use the Elasticsearch-py client.",
    "author": "李四",
    "publish_date": "2023-10-27"
}

es.index(index=index_name, document=doc1)
es.index(index=index_name, document=doc2)

es.indices.refresh(index=index_name) # 刷新索引，确保文档可以被搜索

print("文档添加成功！")

es.index() 方法用于添加文档。index 参数指定要添加文档的索引，document 参数指定要添加的文档内容。

搜索文档：

# 简单搜索
query = {
    "query": {
        "match": {
            "content": "Elasticsearch"
        }
    }
}

response = es.search(index=index_name, body=query)

print("搜索结果：")
for hit in response['hits']['hits']:
    print(hit['_source'])

# 复杂搜索
query = {
    "query": {
        "bool": {
            "must": [
                {"match": {"content": "Elasticsearch"}},
                {"match": {"title": "入门"}}
            ],
            "filter": [
                {"term": {"author": "张三"}}
            ]
        }
    }
}

response = es.search(index=index_name, body=query)

print("n复杂搜索结果：")
for hit in response['hits']['hits']:
    print(hit['_source'])

es.search() 方法用于搜索文档。index 参数指定要搜索的索引，body 参数指定搜索条件。

match 查询：用于全文搜索，会对搜索关键词进行分词。
term 查询：用于精确匹配，不会对搜索关键词进行分词。
bool 查询：用于组合多个查询条件。
- must: 必须满足的条件。
- should: 应该满足的条件（满足的越多，得分越高）。
- must_not: 必须不满足的条件。
- filter: 过滤条件，不参与评分。

更新文档：

doc_id = response['hits']['hits'][0]['_id'] # 获取文档 ID

es.update(index=index_name, id=doc_id, body={"doc": {"content": "Elasticsearch 更新后的内容。"}})

es.indices.refresh(index=index_name) # 刷新索引，确保文档可以被搜索

print("文档更新成功！")

es.update() 方法用于更新文档。index 参数指定要更新的索引，id 参数指定要更新的文档 ID，body 参数指定要更新的内容。

删除文档：

es.delete(index=index_name, id=doc_id)

es.indices.refresh(index=index_name) # 刷新索引，确保文档可以被搜索

print("文档删除成功！")

es.delete() 方法用于删除文档。index 参数指定要删除的索引，id 参数指定要删除的文档 ID。

删除索引：

es.indices.delete(index=index_name)

print("索引删除成功！")

es.indices.delete() 方法用于删除索引。

第三部分：高级用法，更上一层楼

聚合分析：

聚合分析是 Elasticsearch 的强大功能之一，可以对数据进行分组、统计、计算等操作。

query = {
    "size": 0, # 不返回原始文档
    "aggs": {
        "authors": {
            "terms": {
                "field": "author"
            }
        }
    }
}

response = es.search(index=index_name, body=query)

print("作者统计：")
for bucket in response['aggregations']['authors']['buckets']:
    print(f"{bucket['key']}: {bucket['doc_count']}")

这个例子统计了每个作者的文章数量。

size: 0 表示不返回原始文档，只返回聚合结果。
aggs 定义聚合操作。
terms 聚合：用于分组统计。
- field: 指定要分组的字段。

滚动查询：

当数据量非常大时，一次性获取所有数据可能会导致内存溢出。滚动查询可以分批获取数据。

scroll_response = es.search(
    index=index_name,
    scroll='1m', # 设置滚动时间
    size=100,    # 每次返回的文档数量
    body={"query": {"match_all": {}}} # 查询所有文档
)

scroll_id = scroll_response['_scroll_id']

while True:
    hits = scroll_response['hits']['hits']
    if not hits:
        break

    for hit in hits:
        print(hit['_source'])

    scroll_response = es.scroll(scroll_id=scroll_id, scroll='1m') # 继续滚动
    scroll_id = scroll_response['_scroll_id']

scroll='1m' 设置滚动时间为 1 分钟。
size=100 设置每次返回的文档数量为 100。
es.scroll() 方法用于继续滚动。

批量操作：

批量操作可以提高数据处理效率。

from elasticsearch import helpers

actions = [
    {
        "_index": index_name,
        "_source": {
            "title": "批量添加文档 1",
            "content": "批量添加文档的内容 1。",
            "author": "王五",
            "publish_date": "2023-10-28"
        }
    },
    {
        "_index": index_name,
        "_source": {
            "title": "批量添加文档 2",
            "content": "批量添加文档的内容 2。",
            "author": "赵六",
            "publish_date": "2023-10-29"
        }
    }
]

helpers.bulk(es, actions)

es.indices.refresh(index=index_name) # 刷新索引，确保文档可以被搜索

print("批量添加文档成功！")

helpers.bulk() 方法用于批量操作。

第四部分：日志分析实战，让日志说话

Elasticsearch 在日志分析领域应用广泛。我们可以将日志数据导入 Elasticsearch，然后利用 Elasticsearch 的搜索和聚合功能进行分析。

模拟日志数据：

import datetime
import random

log_levels = ["INFO", "WARNING", "ERROR"]
services = ["web", "db", "cache"]

def generate_log_data(num_logs):
    logs = []
    for i in range(num_logs):
        timestamp = datetime.datetime.now() - datetime.timedelta(seconds=random.randint(0, 3600))
        log_level = random.choice(log_levels)
        service = random.choice(services)
        message = f"This is a log message from {service} service with level {log_level}."
        log = {
            "timestamp": timestamp.isoformat(),
            "log_level": log_level,
            "service": service,
            "message": message
        }
        logs.append(log)
    return logs

logs = generate_log_data(100)

# 批量导入日志数据
actions = [
    {
        "_index": "logs",
        "_source": log
    }
    for log in logs
]

helpers.bulk(es, actions)
es.indices.refresh(index="logs")

print("日志数据导入成功！")

搜索特定级别的日志：

query = {
    "query": {
        "term": {
            "log_level": "ERROR"
        }
    }
}

response = es.search(index="logs", body=query)

print("ERROR 级别的日志：")
for hit in response['hits']['hits']:
    print(hit['_source'])

统计每个服务的日志数量：

query = {
    "size": 0,
    "aggs": {
        "services": {
            "terms": {
                "field": "service"
            }
        }
    }
}

response = es.search(index="logs", body=query)

print("服务日志统计：")
for bucket in response['aggregations']['services']['buckets']:
    print(f"{bucket['key']}: {bucket['doc_count']}")

按时间范围搜索日志


import datetime
start_time = datetime.datetime.now() - datetime.timedelta(minutes=10)
end_time = datetime.datetime.now()

query = {
"query": {
"range": {
"timestamp": {
"gte": start_time.isoformat(),
"lte": end_time.isoformat()
}
}
}
}

response = es.search(index="logs", body=query)

print("最近10分钟的日志:")
for hit in response[‘hits’][‘hits’]:
print(hit[‘_source’])



**第五部分：性能优化，让你的搜索飞起来**

1.  **选择合适的字段类型：**

    不同的字段类型有不同的索引方式和存储方式，选择合适的字段类型可以提高搜索效率。例如，对于不需要分词的字段，应该使用 `keyword` 类型。

2.  **合理设置分片数量：**

    分片是 Elasticsearch 中数据水平扩展的基本单元。合理设置分片数量可以提高并发搜索能力。一般来说，每个节点的分片数量应该控制在一定的范围内（例如，每个节点不超过 20 个分片）。

3.  **使用缓存：**

    Elasticsearch 提供了多种缓存机制，可以提高搜索效率。例如，可以开启查询缓存，缓存常用的查询结果。

4.  **优化查询语句：**

    编写高效的查询语句可以减少 Elasticsearch 的计算量，提高搜索效率。例如，尽量避免使用 `wildcard` 查询和 `regexp` 查询，因为这些查询的性能比较差。

**总结：Elasticsearch，你的数据管家**

Elasticsearch 是一款功能强大的搜索和分析引擎，可以帮助我们快速、高效地处理海量数据。`elasticsearch-py` 是 Python 社区为 Elasticsearch 打造的官方客户端，让我们能够用 Python 代码轻松地操作 Elasticsearch。

希望今天的讲座能对大家有所帮助。记住，数据是宝贵的，Elasticsearch 可以帮你挖掘数据的价值！

**结束语：**

感谢各位的观看，如果大家有什么问题，欢迎在评论区留言。下次再见！

发表回复 取消回复

发表回复取消回复