构建ELK分布式日志分析平台
技术背景
在现代分布式系统中,日志数据呈指数级增长。传统的单机日志排查方式已无法满足海量数据的实时分析需求。运维团队需要一种能够聚合多源日志、支持全文检索并提供可视化洞察的解决方案。ELK技术栈(Elasticsearch、Logstash、Kibana)正是为此而生,配合轻量级采集器Filebeat,可构建完整的企业级日志处理流水线。
架构拓扑
典型部署包含以下角色:
- 采集层:Filebeat部署于业务服务器,负责日志文件的增量读取与转发
- 处理层:Logstash集群执行数据清洗、字段提取与格式转换
- 存储层:Elasticsearch分片集群实现分布式索引与检索
- 展示层:Kibana提供仪表盘构建与即席查询能力
环境规划
| 节点标识 | 角色 | 内网地址 |
|---|---|---|
| es-master | 主节点/数据节点 | 172.16.0.10 |
| es-data | 数据节点 | 172.16.0.11 |
| web-01 | 应用服务器/采集端 | 172.16.0.20 |
Elasticsearch集群部署
首先完成系统级调优与依赖安装:
# 所有ES节点执行
echo "vm.max_map_count=262144" >> /etc/sysctl.conf
sysctl -p
# 安装OpenJDK 11
dnf install java-11-openjdk -y
创建专用运行账户与数据目录:
groupadd esgroup && useradd esuser -g esgroup
mkdir -p /var/lib/elasticsearch/{data,logs}
chown -R esuser:esgroup /var/lib/elasticsearch
核心配置文件/etc/elasticsearch/elasticsearch.yml示例(主节点):
cluster.name: log-analytics-prod
node.name: es-master-01
node.roles: [master, data, ingest]
path.data: /var/lib/elasticsearch/data
path.logs: /var/lib/elasticsearch/logs
network.host: 172.16.0.10
http.port: 9200
transport.port: 9300
discovery.seed_hosts: ["172.16.0.10:9300", "172.16.0.11:9300"]
cluster.initial_master_nodes: ["es-master-01"]
# 安全与性能参数
xpack.security.enabled: false
bootstrap.memory_lock: true
indices.memory.index_buffer_size: 30%
启动服务并验证集群状态:
systemctl enable --now elasticsearch
curl -s http://172.16.0.10:9200/_cluster/health?pretty
可视化插件配置
部署Elasticsearch-HD管理界面(可选):
# 安装Node.js运行时
curl -fsSL https://rpm.nodesource.com/setup_16.x | bash -
yum install -y nodejs
# 获取前端组件
cd /opt
git clone https://github.com/mobz/elasticsearch-head.git
cd elasticsearch-head
npm install
npm run start &
需在ES配置中开启跨域支持:
http.cors.enabled: true
http.cors.allow-origin: "*"
数据采集端配置
在应用服务器安装Filebeat替代传统Logstash采集器,降低资源消耗:
rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
cat > /etc/yum.repos.d/elastic.repo << 'EOF'
[elastic-7.x]
name=Elastic repository for 7.x packages
baseurl=https://artifacts.elastic.co/packages/7.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
EOF
yum install filebeat -y
Filebeat管道配置/etc/filebeat/filebeat.yml:
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/nginx/*.log
- /var/log/application/*.log
fields:
service: web-frontend
dc: beijing
multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
multiline.negate: true
multiline.match: after
output.logstash:
hosts: ["172.16.0.10:5044", "172.16.0.11:5044"]
loadbalance: true
processors:
- add_host_metadata:
when.not.contains.tags: forwarded
- add_docker_metadata: ~
- drop_fields:
fields: ["agent.ephemeral_id", "agent.id"]
数据处理层搭建
Logstash承担ETL职责,配置文件按功能模块化组织:
# /etc/logstash/conf.d/pipeline.conf
input {
beats {
port => 5044
codec => "json"
}
}
filter {
if [fields][service] == "web-frontend" {
grok {
match => {
"message" => "%{IPORHOST:client_ip} - %{USER:auth} \[%{HTTPDATE:request_time}\] \"%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}\" %{NUMBER:status} %{NUMBER:bytes} \"%{DATA:referrer}\" \"%{DATA:user_agent}\""
}
}
date {
match => ["request_time", "dd/MMM/yyyy:HH:mm:ss Z"]
target => "@timestamp"
}
geoip {
source => "client_ip"
target => "geo"
}
mutate {
convert => {
"status" => "integer"
"bytes" => "integer"
}
remove_field => ["message", "beat.version"]
}
}
}
output {
if [fields][service] == "web-frontend" {
elasticsearch {
hosts => ["http://172.16.0.10:9200", "http://172.16.0.11:9200"]
index => "logs-web-%{+yyyy.MM.dd}"
template_name => "web-logs-template"
template_overwrite => true
}
}
}
启动并验证管道:
systemctl enable --now logstash
/usr/share/logstash/bin/logstash --config.test_and_exit -f /etc/logstash/conf.d/
可视化平台部署
Kibana配置强调与ES集群的连通性:
# /etc/kibana/kibana.yml
server.port: 5601
server.host: "172.16.0.10"
elasticsearch.hosts: ["http://172.16.0.10:9200", "http://172.16.0.11:9200"]
kibana.index: ".kibana"
i18n.locale: "zh-CN"
访问http://172.16.0.10:5601完成索引模式创建。建议按业务维度建立索引生命周期策略(ILM),实现热温冷数据自动迁移。
验证与调优
模拟写入测试数据:
curl -X POST "172.16.0.10:9200/test-metrics/_doc?pretty" -H 'Content-Type: application/json' -d'
{
"service": "payment-api",
"latency_ms": 245,
"status": "success",
"@timestamp": "2024-01-15T08:30:00Z"
}'
关键监控指标:
- 集群健康状态:
GET _cluster/health - 索引写入速率:
GET _stats/indexing - 节点资源使用:
GET _nodes/stats/process,os,jvm
生产环境建议启用TLS加密传输、配置快照仓库实现灾备,并通过Curator工具自动化管理索引过期清理。