IoT 边缘集群Kubernetes Events告警通知进一步配置详解

2025-08-29 12:17:47

目标

上一篇文章

IoT 边缘集群基于 Kubernetes Events 的告警通知实现

告警恢复通知 - 经过评估无法实现

原因: 告警和恢复是单独完全不相关的事件, 告警是 Warning 级别, 恢复是 Normal 级别, 要开启恢复, 就会导致所有 Normal Events 都会被发送, 这个数量是很恐怖的; 而且, 除非特别有经验和耐心, 否则无法看出哪条 Normal 对应的是告警的恢复.

未恢复进行持续告警 - 默认就带的能力, 无需额外配置.
告警内容显示资源名称，比如节点和pod名称

可以设置屏蔽特定的节点和工作负载并可以动态调整

比如，集群001中的节点worker-1做计划性维护，期间停止监控，维护完成后重新开始监控。

配置

告警内容显示资源名称

典型的几类 events:

apiVersion: v1
count: 101557
eventTime: null
firstTimestamp: "2022-04-08T03:50:47Z"
involvedObject:
  apiVersion: v1
  fieldPath: spec.containers{prometheus}
  kind: Pod
  name: prometheus-rancher-monitoring-prometheus-0
  namespace: cattle-monitoring-system
kind: Event
lastTimestamp: "2022-04-14T11:39:19Z"
message: 'Readiness probe failed: Get "http://10.42.0.87:9090/-/ready": context deadline
  exceeded (Client.Timeout exceeded while awaiting headers)'
metadata:
  creationTimestamp: "2022-04-08T03:51:17Z"
  name: prometheus-rancher-monitoring-prometheus-0.16e3cf53f0793344
  namespace: cattle-monitoring-system
reason: Unhealthy
reportingComponent: ""
reportingInstance: ""
source:
  component: kubelet
  host: master-1
type: Warning

apiVersion: v1
count: 116
eventTime: null
firstTimestamp: "2022-04-13T02:43:26Z"
involvedObject:
  apiVersion: v1
  fieldPath: spec.containers{grafana}
  kind: Pod
  name: rancher-monitoring-grafana-57777cc795-2b2x5
  namespace: cattle-monitoring-system
kind: Event
lastTimestamp: "2022-04-14T11:18:56Z"
message: 'Readiness probe failed: Get "http://10.42.0.90:3000/api/health": context
  deadline exceeded (Client.Timeout exceeded while awaiting headers)'
metadata:
  creationTimestamp: "2022-04-14T11:18:57Z"
  name: rancher-monitoring-grafana-57777cc795-2b2x5.16e5548dd2523a13
  namespace: cattle-monitoring-system
reason: Unhealthy
reportingComponent: ""
reportingInstance: ""
source:
  component: kubelet
  host: master-1
type: Warning

apiVersion: v1
count: 20958
eventTime: null
firstTimestamp: "2022-04-11T10:34:51Z"
involvedObject:
  apiVersion: v1
  fieldPath: spec.containers{lb-port-1883}
  kind: Pod
  name: svclb-emqx-dt22t
  namespace: emqx
kind: Event
lastTimestamp: "2022-04-14T11:39:48Z"
message: Back-off restarting failed container
metadata:
  creationTimestamp: "2022-04-11T10:34:51Z"
  name: svclb-emqx-dt22t.16e4d11e2b9efd27
  namespace: emqx
reason: BackOff
reportingComponent: ""
reportingInstance: ""
source:
  component: kubelet
  host: worker-1
type: Warning

apiVersion: v1
count: 21069
eventTime: null
firstTimestamp: "2022-04-11T10:34:48Z"
involvedObject:
  apiVersion: v1
  fieldPath: spec.containers{lb-port-80}
  kind: Pod
  name: svclb-traefik-r5p8t
  namespace: kube-system
kind: Event
lastTimestamp: "2022-04-14T11:44:59Z"
message: Back-off restarting failed container
metadata:
  creationTimestamp: "2022-04-11T10:34:48Z"
  name: svclb-traefik-r5p8t.16e4d11daf0b79ce
  namespace: kube-system
reason: BackOff
reportingComponent: ""
reportingInstance: ""
source:
  component: kubelet
  host: worker-1
type: Warning

{
  "metadata": {
    "name": "event-exporter-79544df9f7-xj4t5.16e5c540dc32614f",
    "namespace": "monitoring",
    "uid": "baf2f642-2383-4e22-87e0-456b6c3eaf4e",
    "resourceVersion": "14043444",
    "creationTimestamp": "2022-04-14T13:08:40Z"
  },
  "reason": "Pulled",
  "message": "Container image \"ghcr.io/opsgenie/kubernetes-event-exporter:v0.11\" already present on machine",
  "source": {
    "component": "kubelet",
    "host": "worker-2"
  },
  "firstTimestamp": "2022-04-14T13:08:40Z",
  "lastTimestamp": "2022-04-14T13:08:40Z",
  "count": 1,
  "type": "Normal",
  "eventTime": null,
  "reportingComponent": "",
  "reportingInstance": "",
  "involvedObject": {
    "kind": "Pod",
    "namespace": "monitoring",
    "name": "event-exporter-79544df9f7-xj4t5",
    "uid": "b77d3e13-fa9e-484b-8a5a-d1afc9edec75",
    "apiVersion": "v1",
    "resourceVersion": "14043435",
    "fieldPath": "spec.containers{event-exporter}",
    "labels": {
      "app": "event-exporter",
      "pod-template-hash": "79544df9f7",
      "version": "v1"
    }
  }
}

我们可以把更多的字段加入到告警信息中, 其中就包括:

节点: {{ Source.Host }}
Pod: {{ .InvolvedObject.Name }}

综上, 修改后的event-exporter-cfg yaml 如下:

apiVersion: v1
kind: ConfigMap
metadata:
  name: event-exporter-cfg
  namespace: monitoring
  resourceVersion: '5779968'
data:
  config.yaml: |
    logLevel: error
    logFormat: json
    route:
      routes:
        - match:
            - receiver: "dump"
        - drop:
            - type: "Normal"
          match:
            - receiver: "feishu"
    receivers:
      - name: "dump"
        stdout: {}
      - name: "feishu"
        webhook:
          endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..."
          headers:
            Content-Type: application/json
          layout:
            msg_type: interactive
            card:
              config:
                wide_screen_mode: true
                enable_forward: true
              header:
                title:
                  tag: plain_text
                  content: xxx测试K3S集群告警
                template: red
              elements:
                - tag: div
                  text:
                    tag: lark_md
                    content: "**EventID:**  {{ .UID }}\n**EventNamespace:**  {{ .InvolvedObject.Namespace }}\n**EventName:**  {{ .InvolvedObject.Name }}\n**EventType:**  {{ .Type }}\n**EventKind:**  {{ .InvolvedObject.Kind }}\n**EventReason:**  {{ .Reason }}\n**EventTime:**  {{ .LastTimestamp }}\n**EventMessage:**  {{ .Message }}\n**EventComponent:**  {{ .Source.Component }}\n**EventHost:**  {{ .Source.Host }}\n**EventLabels:**  {{ toJson .InvolvedObject.Labels}}\n**EventAnnotations:**  {{ toJson .InvolvedObject.Annotations}}"

屏蔽特定的节点和工作负载

比如，集群001中的节点worker-1做计划性维护，期间停止监控，维护完成后重新开始监控。

继续修改event-exporter-cfg yaml 如下:

apiVersion: v1
kind: ConfigMap
metadata:
  name: event-exporter-cfg
  namespace: monitoring
data:
  config.yaml: |
    logLevel: error
    logFormat: json
    route:
      routes:
        - match:
            - receiver: "dump"
        - drop:
            - type: "Normal"
            - source:
                host: "worker-1"
            - namespace: "cattle-monitoring-system"
            - name: "*emqx*"
            - kind: "Pod|Deployment|ReplicaSet"
            - labels:
                version: "dev"
          match:
            - receiver: "feishu"
    receivers:
      - name: "dump"
        stdout: {}
      - name: "feishu"
        webhook:
          endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..."
          headers:
            Content-Type: application/json
          layout:
            msg_type: interactive
            card:
              config:
                wide_screen_mode: true
                enable_forward: true
              header:
                title:
                  tag: plain_text
                  content: xxx测试K3S集群告警
                template: red
              elements:
                - tag: div
                  text:
                    tag: lark_md
                    content: "**EventID:**  {{ .UID }}\n**EventNamespace:**  {{ .InvolvedObject.Namespace }}\n**EventName:**  {{ .InvolvedObject.Name }}\n**EventType:**  {{ .Type }}\n**EventKind:**  {{ .InvolvedObject.Kind }}\n**EventReason:**  {{ .Reason }}\n**EventTime:**  {{ .LastTimestamp }}\n**EventMessage:**  {{ .Message }}\n**EventComponent:**  {{ .Source.Component }}\n**EventHost:**  {{ .Source.Host }}\n**EventLabels:**  {{ toJson .InvolvedObject.Labels}}\n**EventAnnotations:**  {{ toJson .InvolvedObject.Annotations}}"

默认的 drop 规则为: - type: "Normal", 即不对 Normal 级别进行告警;

现在加入以下规则:

            - source:
                host: "worker-1"
            - namespace: "cattle-monitoring-system"
            - name: "*emqx*"
            - kind: "Pod|Deployment|ReplicaSet"
            - labels:
                version: "dev"

... host: "worker-1": 不对节点worker-1 做告警;
... namespace: "cattle-monitoring-system": 不对 NameSpace: cattle-monitoring-system 做告警;
... name: "*emqx*": 不对 name(name 往往是 pod name) 包含 emqx 的做告警
kind: "Pod|Deployment|ReplicaSet": 不对 Pod Deployment ReplicaSet 做告警(也就是不关注应用, 组件相关的告警)
...version: "dev": 不对 label 含有 version: "dev" 的做告警(可以通过它屏蔽特定的应用的告警)

最终效果

如下图:

以上就是IoT 边缘集群Kubernetes Events告警通知进一步配置详解的详细内容，更多关于IoT Kubernetes Events告警的资料请关注我们其它相关文章！

一文解析Kubernetes使用PVC后数据丢失

目录问题现象复现问题分析问题现象使用官方postgresql镜像,通过pvc将云硬盘挂载至数据目录,每次重建Pod,数据库数据都会丢失. 复现 apiVersion: v1 kind: PersistentVolumeClaim metadata: name: postgresql-persistent-storage namespace: default spec: accessModes: - ReadWriteOnce resources: requests: storage:
Kubernetes如何限制不同团队只能访问各自namespace实现

目录场景说明 1 | 实现思路 2 | 实现的脚本 3 | 使用方式场景说明假设有这么一个场景:一个 Kubernetes 集群,有多个 namespace,然后每个 namespace 由一个工程团队去使用,不同的工程团队之间无法访问和操作其他团队的 namespace 下的资源,实现资源和权限隔离的目的. 1 | 实现思路可以利用 Kubernetes 的 RBAC 来实现: 在各自的 namespace 下创建一个 ServiceAccount 在这个 namespace 下创建一
Kubernetes Ingress实现细粒度IP访问控制

目录业务场景业务场景有这么一个业务场景:业务平台还是通过Kubernetes进行编排对外提供服务.然后其后台管理部分,出于安全的考虑,只允许特定的IP才能访问.如何实现? 目前,我们的网络架构是 SLB + Nginx Ingress + Ingress + Service + Pod的模式.其中,SLB使用的是阿里云的负载均衡SaaS服务,使用的是7层负载,支持一个SLB实例+多个域名的转发模式,如下图所示. 阿里云SLB可以通过设定黑/白名单的方式进行访问控制,但是该访问控制会进行”一
详解Kubernetes 中容器跨主机网络

目录前言什么是 Flannel Flannel 的后端实现有哪些 UDP VXLAN Host-gw 基于 Flannel UDP 模式的实现跨主通信 UDP 模式案例实现基于 Flannel VXLAN 模式的跨主通信 VXLAN 模式案例实现总结前言在云原生领域,Kubernetes 已经成为了最主流的容器管理工具.Kubernetes 支持将容器部署到多个节点(即主机)上,因此必须解决容器间跨主机通信的问题. 本文将详细介绍 Kubernetes 中容器跨主机网络的实现原理和方
Kubernetes上使用Jaeger分布式追踪基础设施详解

目录正文微服务架构中的可观察性分布式追踪 Jaeger组件架构图 Jaeger客户端 Jaeger代理 Jaeger SideCar 代理 Jaeger Daemonset 代理 Jaeger Collector 服务 Jaeger Query 查询服务 Storage Configuration 存储配置监控正文作为分布式系统(或任何系统)的一个组成部分,监测基础设施的重要性怎么强调都不过分.监控不仅要跟踪二进制的 "上升 "和 "下降 "模式,还要
kubernetes数据持久化PV PVC深入分析详解

目录 1. 什么是PV,PVC? 1.1 什么是PV 1.2 什么是PVC? 2. PV资源实践 2.1 PV配置字段详解 2.2 HostPath PV示例 2.3 NFS PV示例 3. PVC资源实践 3.1 PVC配置清单详解 3.2 hostPath-PVC示例 3.3 NFS-PV-PVC实践之准备NFS共享存储 3.4 准备NFS-PVC 3.4.1准备Pod并使用PVC 3.4.2 测试数据持久性 1. 什么是PV,PVC? 1.1 什么是PV 官方文档地址: https://k
云原生技术kubernetes调度单位pod的使用详解

k8s中的最小调度单位---pod 之前的文章中,我们对k8s能够解决的问题做了简单介绍,简单来说,它解决的问题是容器的编排与调度,它的核心价值在于:运行在大规模集群的任务之间,实际上存在着各种各样的关系,这些关系的处理,才是任务编排和系统管理最困难的地方,k8s就是为了这个问题而生的. 这句话比较难理解,我们从已有的知识入手,抽丝剥茧,慢慢理解它.我们已经知道,容器的本质是一个进程,它包含三个部分: 如果说容器是云环境的一个进程,那么你可以将k8s理解成云环境中的一个操作系统. 在一个操作系统
Kubernetes 权限管理认证鉴权详解

目录正文认证认证用户 Normal Users Service Accounts 认证策略客户端证书不记名令牌 Static Token File Service Account Tokens OpenID Connect Tokens 鉴权鉴权流程鉴权模块 RBAC Role 和 ClusterRole RoleBinding 和 ClusterRoleBinding Service Account 最后正文 Kubernetes 主要通过 API Server 对外提供服务,
sentinel支持的redis高可用集群配置详解

目录一.首先配置redis的主从同步集群二.sentinel高可用一.首先配置redis的主从同步集群 1.主库的配置文件不用修改,从库的配置文件只需增加一行,说明主库的IP端口.如果需要验证的,也要加多一行,认证密码. slaveof 192.168.20.26 5268 masterauth hodge01 一主多从的话,就启用多个从库.其中,从库都是一样的方案.本次有两个slave. 2.命令检查 /usr/local/redis/bin/redis-cli -p 5257 -a h
mongodb 集群重构和释放磁盘空间实例详解

MongoDB集群重构,释放磁盘空间由于mongodb删除了一部分数据后,不会回收相应的磁盘空间,所以这里通过重建数据目录的方式释放磁盘空间. 一实验环境配置了一个副本集,该副本集由以下三个节点组成: 10.192.203.201:27017 PRIMARY 10.192.203.202:27017 SECONDARY 10.192.203.202:10001 ARBITER 二实验步骤 2.1 模拟环境 use dba; for(var i=0;i<1000000;i++)db.c.
Java调用微信客服消息实现发货通知的方法详解

本文实例讲述了Java调用微信客服消息实现发货通知的方法.分享给大家供大家参考,具体如下: 微信文档地址:https://mp.weixin.qq.com/wiki?t=resource/res_main&id=mp1421140547&token=&lang=zh_CN 个人说明:这是一个样例,微信客户消息有很多种,我现在用的是公众号发送消息.样子如下图. 说明:下面开始代码部分了. 1.首先看微信文档.这里才是我们需要的这里是说发消息要POST请求这个接口:https://a
Android Toast通知用法实例详解

本文实例讲述了Android Toast通知用法.分享给大家供大家参考,具体如下: Toast在手机屏幕上向用户显示一条信息,一段时间后信息会自动消失. 1.默认用法复制代码代码如下: Toast.makeText(getApplicationContext(), "默认Toast样式",Toast.LENGTH_SHORT).show(); 2.Fragment中的用法复制代码代码如下: Toast.makeText(getActivity(),"网络连接错误,请检
Android下载进度监听和通知的处理详解

本文实例为大家分享了Android下载进度监听和通知的具体代码,供大家参考,具体内容如下下载管理器关于下载进度的监听,这个比较简单,以apk文件下载为例,需要处理3个回调函数,分别是: 1.下载中 2.下载成功 3.下载失败因此对应的回调接口就有了: public interface DownloadCallback { /** * 下载成功 * @param file 目标文件 */ void onComplete(File file); /** * 下载失败 * @param e */
Redis开启键空间通知实现超时通知的步骤详解

Redis部分设置修改配置文件redis.conf(Windows为redis.windows.conf) 打开该配置文件(位置取决于自己的安装位置),找到Event notification部分. 将notify-keyspace-events Ex的注释打开或者添加该配置,其中E代表Keyevent,此种通知会返回key的名字,x代表超时事件. 如果notify-keyspace-events ""配置没有被注释的话要注释掉,否则不会生效. 保存后重启redis,一定要使用当前配
详解mysql集群：一主多从架构实现

实验环境: 1.三台CentOS 7 服务器 2.mysql5.7.26(三台都通过yum安装) 服务器列表 7.100.222.111 master 47.103.211.5 slave1 47.103.98.221 slave2 如果还没安装mysql请看安装教程:mysql安装一.概述: 架构图: 此种架构,一般初创企业比较常用,也便于后面步步的扩展特点: 1.可以缓解读的压力. 2.成本低,布署快速.方便 3.读写分离 4.还能通过及时增加从库来减少读库压力 5.主库单点故障 6.数

IoT 边缘集群Kubernetes Events告警通知进一步配置详解

目录

目标

配置

告警内容显示资源名称

屏蔽特定的节点和工作负载

最终效果

相关推荐

随机推荐