CoreDNS故障排查与优化实战:从超时到高可用的蜕变
一次Kubernetes DNS故障的深度复盘
故障初现:支付服务超时之谜
凌晨11:42分收到告警,支付服务不可达。通过kubectl检查发现所有Pod和Service均处于正常状态:
kubectl get pods -n production | grep payment
payment-svc-8d4f6b7c-x2k9m 1/1 Running 0 45d
order-processor-6c8d9f4-x7q2w 1/1 Running 0 45d
kubectl get svc -n production
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
payment-service ClusterIP 10.102.144.200 <none> 8080/TCP
order-processor ClusterIP 10.102.145.88 <none> 8080/TCP
进一步测试发现IP直连正常但DNS解析失败:
kubectl exec -it payment-svc-8d4f6b7c-x2k9m -n production -- nslookup order-processor.production.svc.cluster.local
;; connection timed out; no servers could be reached
根因分析:五重致命问题
- ndots配置不当:默认值5导致DNS查询放大80%
- 缺乏节点级缓存:所有DNS请求都需要跨节点传输
- 资源限制过紧:CPU使用率达97m/100m导致goroutine饥饿
- 缓存策略低效:内存不足导致缓存频繁失效
- 副本数固定:无自动扩缩容机制
解决方案:六步优化策略
1. 调整ndots参数
将ndots从5调整为2,减少DNS查询次数:
spec:
dnsPolicy: ClusterFirst
dnsConfig:
options:
- name: ndots
value: "2"
2. 部署NodeLocal DNSCache
在每个节点部署本地DNS缓存:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nodelocaldns
namespace: kube-system
spec:
template:
spec:
hostNetwork: true
containers:
- name: node-cache
image: registry.k8s.io/dns/k8s-dns-node-cache:1.23.0
args:
- "-localip=169.254.20.10"
- "-conf=/etc/Corefile/Corefile"
ports:
- containerPort: 53
protocol: UDP
- containerPort: 53
protocol: TCP
3. 优化CoreDNS配置
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
cache {
success 99840 30
denial 60
prefetch 120 1200 4 25
}
forward . 8.8.8.8 8.8.4.4 {
max_concurrent 1000
prefer_tcp
}
}
4. 调整资源配额
resources:
requests:
cpu: "200m"
memory: "150Mi"
limits:
cpu: "500m"
memory: "250Mi"
5. 部署集群比例自动扩缩器
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns-cpa-config
data:
linear: |
{"nodesPerReplica": 4, "preventSinglePointFailure": true}
6. 配置Pod中断预算
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: coredns-pdb
spec:
minAvailable: 2
selector:
matchLabels:
k8s-app: kube-dns
验证与监控
# 验证解析速度
time nslookup google.com
# 查看资源使用
kubectl top pods -n kube-system -l k8s-app=kube-dns
# 检查缓存命中率
kubectl exec -n kube-system -l k8s-app=kube-dns -- curl -s localhost:9153/metrics | grep cache
最佳实践清单
- ✓ 设置ndots: 2
- ✓ 部署NodeLocal DNSCache
- ✓ 优化CoreDNS配置
- ✓ 合理配置资源限制
- ✓ 部署集群比例自动扩缩器
- ✓ 配置Pod中断预算
- ✓ 实施监控告警
- ✓ 定期验证DNS解析