Prometheus-operator获取kubelet指标失败的解决方法

问题描述

使用Prometheus-operator chart搭建完成后发现部分集群的kubelet采集job拉取指标失败
显示500错误
mark

问题可能发生的原因

由于k8s集群各个节点未开启kubelet组件采集的权限导致

问题解决

参考链接
修改各个节点位于/etc/systemd/system/kubelet.service.d/10-kubeadm.conf位置的kubelet配置文件
修改命令如下,记得修改前进行配置文件的备份

1
2
3
4
5
6
7
KUBEADM_SYSTEMD_CONF=/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
sed -e "/cadvisor-port=0/d" -i "$KUBEADM_SYSTEMD_CONF"
if ! grep -q "authentication-token-webhook=true" "$KUBEADM_SYSTEMD_CONF"; then
sed -e "s/--authorization-mode=Webhook/--authentication-token-webhook=true --authorization-mode=Webhook/" -i "$KUBEADM_SYSTEMD_CONF"
fi
systemctl daemon-reload
systemctl restart kubelet

类似问题

prometheus采集kube-controller-manager 与 kube-scheduler 组件指标失败

由于kube-controller-manager和kube-scheduler 配置绑定的地址为127.0.0.1导致

1
2
3
# 绑定地址由127.0.0.1 修改为 0.0.0.0
sed -e "s/- --address=127.0.0.1/- --address=0.0.0.0/" -i /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -e "s/- --address=127.0.0.1/- --address=0.0.0.0/" -i /etc/kubernetes/manifests/kube-scheduler.yaml

更好的解决方法

实际使用上发现prometheus-opretor自带的kubelet采集的cadvisor job有部分节点上的标签不会进行采集
可以替换为kubernetes-cadvisor job

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# cadvisor采集job
- job_name: kubernetes-cadvisor
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
metric_relabel_configs:
- action: replace
source_labels: [id]
regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$'
target_label: rkt_container_name
replacement: '${2}-${1}'
- action: replace
source_labels: [id]
regex: '^/system\.slice/(.+)\.service$'
target_label: systemd_service_name
replacement: '${1}'