首页 运维干货prometheus+grafana+alertmanager 安装配置文档

prometheus+grafana+alertmanager 安装配置文档

运维派隶属马哥教育旗下专业运维社区,是国内成立最早的IT运维技术社区,欢迎关注公众号:yunweipai
领取学习更多免费Linux云计算、Python、Docker、K8s教程关注公众号:马哥linux运维

1. 安装组件基本介绍:

  • prometheus:
    • server端守护进程,负责拉取各个终端exporter收集到metrics(监控指标数据),并记录在本身提供的tsdb时序记录数据库中,默认保留天数15天,可以通过启动参数自行设置数据保留天数。
    • prometheus官方提供了多种exporter,
    • 默认监听9090端口,对外提供web图形查询页面,以及数据库查询访问接口。
    • 配置监控规则rules(需自行手动配置),并将触发规则的告警发送至alertmanager ,并由alertmanager中配置的告警媒介向外发送告警。
  • grafana:
    • 由于prometheus本身提供的图形页面过于简陋,所以使用grafana来提供图形页面展示。
    • grafana 是专门用于图形展示的软件,支持多种数据来源,prometheus只是其中一种。
    • 自带告警功能,且告警规则可在监控图形上直接配置,不过由于此种方式不支持模板变量(dashboard中为了方便展示配置的特殊变量),即每一个指标,每一台设备均需要单独配置,所以实用性较低
    • 默认监听端口:3000
  • node_exporter:
    • agent端,prometheus官方提供的诸多exporter中的一种,安装与各监控节点主机
    • 负责抓取主机及系统各项信息,如cpu,mem ,disk,networtk.filesystem,…等等各项基本指标,非常全面。并将抓取到的各项指标metrics 通过http协议对方发布,供prometheus server端抓取。
    • 默认监听端口: 9100
  • cadvisor:
    • agent端,安装与docker主机,抓取主机和docker容器运行中各项数据。
    • 本身也已容器方式运行,监听端口8080(可自行设置对外映射端口,且建议映射到其他端口)。
    • 提供基本的graph展示页面,同时提供metrics抓取页面
  • alertmanager:
    • 接受prometheus发送的告警,并通过一定规则分组,控制告警的发送(如告警频率,规则抑制,匹配不同的告警后端媒介,设置静默等)。
    • 可配置多种不同的告警后端媒介,如:邮件,webhook,wechat(企业微信)已经一些企业版的监控告警平台等。
    • 默认监听端口:9093
  • blackbox_exporter:
    • Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的监控数据采集
    • 可直接配置与prometheus server节点,也可配置在单独节点
    • 默认监听端口:9115
  • nginx:
    • 由于prometheus,alertmanager本身不具有认证功能,所以前端使用nginx对外访问,提供基本basic认证已经配置https
    • 以上各组件均需暴露自身端口,所以在docker-compos 部署过程中,将容器部署在同一网络中,前端入口映射端口由nginx统一配置,方便管理

2.prometheus-server

2.1 官方地址:

  • 官方文档地址:https://prometheus.io/docs/introduction/overview/

  • github项目下载地址: https://github.com/prometheus/prometheus

2.2 安装 prometheus server

2.2.1 linux(centos7) 下载安装

  • 创建运行prometheus server进程的系统用户,并为其创建家目录/var/lib/prometheus 作为数据存储目录
~]# useradd -r -m -d /var/lib/prometheus prometheus
  • 下载并安装prometheus server,以2.14.0为例:
 wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
 tar -xf prometheus-2.14.0.linux-amd64.tar.gz  -C /usr/local/
 cd /usr/local
 ln -sv prometheus-2.14.0.linux-amd64 prometheus
  • 创建unit file,让systemd 管理prometheus
 vim /usr/lib/systemd/system/prometheus.service             
 [Unit]
 Description=The Prometheus 2 monitoring system and time series database.
 Documentation=https://prometheus.io
 After=network.target
 [Service]
 EnvironmentFile=-/etc/sysconfig/prometheus
 User=prometheus
 ExecStart=/usr/local/prometheus/prometheus \
 --storage.tsdb.path=/home/prometheus/prometheus \
 --config.file=/usr/local/prometheus/prometheus.yml \
 --web.listen-address=0.0.0.0:9090 \
 --web.external-url= $PROM_EXTRA_ARGS
 Restart=on-failure
 StartLimitInterval=1
 RestartSec=3
 [Install]
 WantedBy=multi-user.target
  • 其他运行时参数: ./prometheus –help

  • 启动服务

systemctl daemon-reload
systemctl start prometheus.service
  • 注意开启防火墙端口:
iptables -I INPUT -p tcp --dport 9090 -s NETWORK/MASK -j ACCEPT
  • 浏览器访问:
http://IP:PORT

2.2.2 docker安装:

  • image: prom/prometheus
  • 启动命令:
$ docker run --name prometheus -d -v ./prometheus:/etc/prometheus/ -v ./db/:/prometheus -p 9090:9090 prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.listen-address="0.0.0.0:9090" --storage.tsdb.path=/prometheus --web.console.libraries=/usr/share/prometheus/console_libraries --web.console.templates=/usr/share/prometheus/consoles --storage.tsdb.retention=30d

2.3 prometheus配置:

2.3.1 启动参数

  • 常用启动参数:
--config.file=/etc/prometheus/prometheus.yml # 指明主配置文件
--web.listen-address="0.0.0.0:9090"     # 指明监听地址端口
--storage.tsdb.path=/prometheus     # 指明数据库目录
--web.console.libraries=/usr/share/prometheus/console_libraries
--web.console.templates=/usr/share/prometheus/consoles  # 指明console lib 和 tmpl
--storage.tsdb.retention=60d  # 指明数据保留天数,默认15

2.3.2 配置文件:

  • Prometheus的主配置⽂件为prometheus.yml

    它主要由global、rule_files、scrape_configs、alerting、remote_write和remote_read⼏个配置段组成:

 - global:全局配置段;

 - rule_files:指定告警规则文件的路径

 - scrape_configs:
    scrape配置集合,⽤于定义监控的⽬标对象(target)的集合,以及描述如何抓取 (scrape)相关指标数据的配置参数;
    通常,每个scrape配置对应于⼀个单独的作业(job),
    ⽽每个targets可通过静态配置(static_configs)直接给出定义,也可基于Prometheus⽀持的服务发现机制进 ⾏⾃动配置;
  - job_name: 'nodes'
 static_configs:    # 静态指定,targets中的 host:port/metrics 将会作为metrics抓取对象
 - targets: ['localhost:9100']
 - targets: ['172.20.94.1:9100']
- job_name: 'docker_host'
  file_sd_configs:  # 基于文件的服务发现,文件中(yml 和json 格式)定义的host:port/metrics将会成为抓取对象
 - files:
  - ./sd_files/docker_host.yml
refresh_interval: 30s
  • alertmanager_configs:

可由Prometheus使⽤的Alertmanager实例的集合,以及如何同这些Alertmanager交互的配置参数;

每个Alertmanager可通过静态配置(static_configs)直接给出定义, 也可基于Prometheus⽀持的服务发现机制进⾏⾃动配置;

  • remote_write:
配置“远程写”机制,Prometheus需要将数据保存于外部的存储系统(例如InfluxDB)时 定义此配置段,
随后Prometheus将样本数据通过HTTP协议发送给由URL指定适配器(Adaptor);
  • remote_read:
配置“远程读”机制,Prometheus将接收到的查询请求交给由URL指定适配器 (Adpater)执⾏,
Adapter将请求条件转换为远程存储服务中的查询请求,并将获取的响应数据转换为Prometheus可⽤的格式;
  • 监控及告警规则配置文件:*.yml
    • 定义监控规则
    • 需要在主配置文件rule_files: 中指定才会生效
 rule_files:
- "test_rules.yml"  # 指定配置告警规则的文件路径
  • 服务发现定义文件:支持yaml 和 json 两种格式
    • 也是需要在主配置文件中定义
    file_sd_configs:
    - files:
        - ./sd_files/http.yml
      refresh_interval: 30s
    

2.3.3 简单的配置文件示例:

  • prometheus.yml 示例
global:
  scrape_interval:  15s      #每过15秒抓取一次指标数据
  evaluation_interval: 15s#每过15秒执行一次报警规则,也就是说15秒执行一次报警
alerting:
  alertmanagers:
  - static_configs:
 - targets: ["localhost:9093"]# 设置报警信息推送地址 , 一般而言设置的是alertManager的地址
rule_files:
  - "test_rules.yml"  # 指定配置告警规则的文件路径
scrape_configs: 
  - job_name: 'node'#自己定义的监控的job_name
 static_configs:    # 配置静态规则,直接指定抓取的ip:port
- targets: ['localhost:9100']
  - job_name: 'CDG-MS'
 honor_labels: true
 metrics_path: '/prometheus'
 static_configs:
- targets: ['localhost:8089']
 relabel_configs:
- target_label: env
  replacement: dev
  - job_name: 'eureka'
 file_sd_configs:       # 基于文件的服务发现
- files:
 - "/app/enmonster/basic/prometheus/prometheus-2.2.1.linux-amd64/eureka.json" # 支持json 和yml 两种格式
refresh_interval: 30s  # 30s钟自行刷新配置,读取文件,修改之后无需手动reload
 relabel_configs:
- source_labels: [__job_name__]
  regex: (.*)
  target_label: job
  replacement: ${1}
- target_label: env
  replacement: dev              
  • 告警规则配置文件示例:

    “`
    [root@host40 monitor-bak]# cat prometheus/rules/docker_monitor.yml
    groups:
    </p></li>
    <li><p>name: "container monitor"
    rules:

    <ul>
    <li>alert: "Container down: env1"
    expr: time() – container_last_seen{name="env1"} > 60
    for: 30s
    labels:
    severity: critical
    annotations:
    summary: "Container down: {{$labels.instance}} name={{$labels.name}}"

    “`

  • 基于文件的服务发现定义文件: *.yml
    [root@host40 monitor]# cat prometheus/sd_files/virtual_lan.yml 
    - targets: ['10.10.11.179:9100']
    - targets: ['10.10.11.178:9100']
    
    [root@host40 monitor]# cat prometheus/sd_files/tcp.yml 
    - targets: ['10.10.11.178:8001']
    labels:
    server_name: http_download
    - targets: ['10.10.11.178:3307']
    labels:
    server_name: xiaojing_db
    - targets: ['10.10.11.178:3001']
    labels:
    server_name: test_web
    
  • 2.3.5其他配置

    • 由于prometheus很多配置需要和其他组件耦合,所以在介绍到相应组件时再行介绍

    2.4 prometheus web-gui

    • web页面访问地址: http://ip:port 如:http://10.10.11.40:9090/
    • alerts: 查看告警规则
    • graph: 查询收集到的指标数据,并提供简单的绘图
    • status: prometheus运行时配置已经监听主机相关信息
    • 详情自行查看web-gui页面

    3.node_exporter

    3.1 基本介绍

    • node_exporter 在被监控节点安装,抓取主机监控信息,并对外提供http服务,供prometheus抓取监控信息。

    • 项目及文档地址:https://github.com/prometheus/node_exporter

    • prometheus官方提供了很多不同类型的exporter,列表地址: https://prometheus.io/docs/instrumenting/exporters/

    3.2 安装node_exporter

    3.2.1 linux(centos7)下载安装:

    • 下载并解压

      wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
      tar xf node_exporter-0.18.1.linux-amd64.tar.gz -C /usr/local/
      cd /usr/local
      ln -sv node_exporter-0.18.1.linux-amd64/ node_exporter
      
    • 创建用户:
      useradd -r -m -d /var/lib/prometheus prometheus
      
    • 配置unit file:
      vim /usr/lib/systemd/system/node_exporter.service
      [Unit]
      Description=Prometheus exporter for machine metrics, written in Go with pluggable metric 
      collectors.Documentation=https://github.com/prometheus/node_exporterAfter=network.target
      [Service]
      EnvironmentFile=-/etc/sysconfig/node_exporter
      User=prometheus
      ExecStart=/usr/local/node_exporter/node_exporter \
      $NODE_EXPORTER_OPTS
      Restart=on-failure
      StartLimitInterval=1
      RestartSec=3
      [Install]
      WantedBy=multi-user.target 
      
    • 启动服务:
      systemctl daemon-reload
      systemctl start node_exporter.service
      
    • 可以手动测试是否可以获取metrics信息:
      curl http://localhost:9100/metrics
      
    • 开启防火墙:
      iptables -I INPUT -p tcp --dport 9100 -s NET/MASK -j ACCEPT
      

    3.2.2 docker安装

    • image: quay.io/prometheus/node-exporter,prom/node-exporter

    • 启动命令:

      docker run -d --net="host" --pid="host" -v "/:/host:ro,rslave" --name monitor-node-exporter --restart always quay.io/prometheus/node-exporter --path.rootfs=/host --web.listen-address=:9100
      
    • 对于部分低版本的docker,出现报错:Error response from daemon: linux mounts: Could not find source mount of /

      解决办法:-v “/:/host:ro,rslave” -> -v “/:/host:ro”

    3.3 配置node_exporter

    • 开启关闭collectors:

      ./node_exporter --help  # 查看支持的所有collectors,可根据实际需求 enable 和 disabled 各项指标收集
      
      如 --collector.cpu=disabled ,不再收集cpu相关信息
      
    • Textfile Collector: 文本文件收集器
      通过 启动参数 --collector.textfile.directory="DIR"   可开启文本文件收集器
      收集器会收集目录下所有*.prom的文件中的指标,指标必须满足    prom格式
      

      示例:

      echo my_batch_job_completion_time $(date +%s) > /path/to/directory/my_batch_job.prom.$$
      mv /path/to/directory/my_batch_job.prom.$$ /path/to/directory/my_batch_job.prom            
      echo 'role{role="application_server"} 1' > /path/to/directory/role.prom.$$
      mv /path/to/directory/role.prom.$$ /path/to/directory/role.prom    
      rpc_duration_seconds{quantile="0.5"} 4773
      http_request_duration_seconds_bucket{le="0.5"} 129389
      

      即如果node_exporter 不能满足自身指标抓取,可以通过脚本形式将指标抓取之后写入文件,由node_exporter对外提供个prometheus抓取
      可以省掉pushgateway

    • 有关prom格式和查询语法,将再之后介绍

    3.4 配置prometheus抓取node_exporter 指标

    • 示例: prometheus.yml

      “`
      scrape_configs:
      </p></li>
      </ul>

      <h1>The job name is added as a label <code>job=</code> to any timeseries scraped from this config.</h1>

      <ul>
      <li>job_name: 'prometheus'
      # metrics_path defaults to '/metrics'
      # scheme defaults to 'http'.
      static_configs:

      <ul>
      <li>targets: ['localhost:9090']</li>
      </ul></li>
      <li><p>job_name: 'nodes'
      static_configs:

      <ul>
      <li>targets: ['localhost:9100']</li>
      <li>targets: ['172.20.94.1:9100']</li>
      </ul>

      <pre><code class="line-numbers"></code></pre>

      <ul>
      <li>job_name: 'node_real_lan'
      file_sd_configs:</li>
      <li>files:</li>
      </ul></li>
      <li>./sd_files/real_lan.yml
      refresh_interval: 30s
      params: # 可选
      collect[]:

      <ul>
      <li>cpu</li>
      <li>meminfo</li>
      <li>diskstats</li>
      <li>netdev</li>
      <li>netstat</li>
      <li>filefd</li>
      <li>filesystem</li>
      <li>xfs

      “`

    4.cadvisor

    4.1 官方地址:

    • https://github.com/google/cadvisor
    • image: gcr.io/google_containers/cadvisor[:v0.36.0] # 需要能访问google
    • image: google/cadvisor:v0.33.0 # docker hub镜像,版本没有google的新

    4.2 docker run

    sudo docker run \
      --volume=/:/rootfs:ro \
      --volume=/var/run:/var/run:ro \
      --volume=/sys:/sys:ro \
      --volume=/var/lib/docker/:/var/lib/docker:ro \
      --volume=/dev/disk/:/dev/disk:ro \
      --publish=9080:8080 \
      --detach=true \
      --name=cadvisor \
      --privileged \
      --device=/dev/kmsg \
      google/cadvisor:v0.33.0
    

    4.3 web页面查看简单的单机图形监控信息

    • http://ip:port

    4.4 配置prometheus抓取

    • 配置示例:

      “`

      <ul>
      <li>job_name: 'docker'
      static_configs:</li>
      </ul></li>
      <li>targets: ['localhost:9080']

      “`

    5.grafana

    5.1 官方地址

    • grafana程序下载地址:https://grafana.com/grafana/download
    • grafana dashboard 下载地址: https://grafana.com/grafana/download/

    5.2 安装grafana

    5.2.1 linux(centos7)安装

    • 下载并安装
      wget https://dl.grafana.com/oss/release/grafana-7.2.2-1.x86_64.rpm
      sudo yum install grafana-7.2.2-1.x86_64.rpm
      
    • 准备service 文件:
      [Unit]
      Description=Grafana instance
      Documentation=http://docs.grafana.org
      Wants=network-online.target
      After=network-online.target
      After=postgresql.service mariadb.service mysqld.service
      
      [Service]
      EnvironmentFile=/etc/sysconfig/grafana-server
      User=grafana
      Group=grafana
      Type=notify
      Restart=on-failure
      WorkingDirectory=/usr/share/grafana
      RuntimeDirectory=grafana
      RuntimeDirectoryMode=0750
      ExecStart=/usr/sbin/grafana-server  \
      --config=${CONF_FILE}  \
      --pidfile=${PID_FILE_DIR}/grafana-server.pid\
      --packaging=rpm  \
      cfg:default.paths.logs=${LOG_DIR}  \
      cfg:default.paths.data=${DATA_DIR} \
      cfg:default.paths.plugins=${PLUGINS_DIR} \
      cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR}
      
      LimitNOFILE=10000
      TimeoutStopSec=20
      
      [Install]
      WantedBy=multi-user.target
      
    • 启动grafana
      systemctl enable grafana-server.service
      systemctl restart grafana-server.service
      

      默认监听3000端口

    • 开启防火墙:

      iptables -I INPUT -p tcp --dport 3000 -s NET/MASK -j ACCEPT
      

    5.2.2 docker安装

    • image: grafana/grafana
      docker run -d --name=grafana -p 3000:3000 grafana/grafana:7.2.2
      

    5.3 grafana 简单使用流程

    • web页面访问:
      http://ip:port
      

      首次登陆会要求自行设置账号密码
      7.2版本会要求输入账号密码之后重置,初始账号密码都是admin

    • 使用流程:

      • 添加数据源
      • 添加dashboard,配置图形监控面板,也可在官网下载对应服务的dashboard模板,下载地址:https://grafana.com/grafana/download/
      • 导入模板,json 或 链接 或模板编号
      • 查看dashboard
    • 常用模板编号:
      • node-exporter: cn/8919,en/11074
      • k8s: 13105
      • docker: 12831
      • alertmanager: 9578
      • blackbox_exportre: 9965
    • 重置管理员密码:
      查看Grafana配置文件,确定grafana.db的路径
      配置文件路径:/etc/grafana/grafana.ini
      [paths]
      ;data = /var/lib/grafana
      [database]
      # For "sqlite3" only, path relative to data_path setting
      ;path = grafana.db
      通过配置文件得知grafana.db的完整路径如下:
      /var/lib/grafana/grafana.db
      
      使用sqlites修改admin密码 
      sqlite3 /var/lib/grafana/grafana.db
      sqlite> update user set password = 
      '59acf18b94d7eb0694c61e60ce44c110c7a683ac6a8f09580d626f90f4a242000746579358d77dd9e570e83fa24faa88a8a6', 
      salt = 'F3FAxVm33R' where login = 'admin';
      .exit
      
      使用admin admin 登录
      

    5.4 grafana告警配置:

    • grafana-server配置 smtp服务器,配置发件邮箱
      vim /etc/grafana/grafana.ini
      [smtp]
      enabled =  true
      host = smtp.126.com:465
      user = USER@126.com
      password = PASS
      skip_verify = false
      from_address = USER@126.com
      from_name = Grafana Alart
      
    • grafana页面添加Notification Channel
      Alerting -> Notification Channel
      save之前 可以send test
      
    • 进入dashboard,添加alart rules

    • 由于现阶段grafana(7.2.2)不支持在报警查询中使用模板变量。所以报警功能实用性很低。生产中建议使用alertmanager

    6.prometheus and PromQL:

    6.1 PromQL 简述

    • prometheus用来查询数据库的语法规则,用来将数据库中存储的由各exporter 采集到的metric指标组织成可视化的图标信息,以及告警规则

    • promQL一个多维数据模型,其中包含通过metric name 和键/值对标识的时间序列数据

    • 一种灵活的查询语言 ,可利用此维度
    • 不依赖分布式存储;单服务器节点是自治的
    • 多种图形和仪表板支持模式

    6.2 使用到promQL的组件:

    • prometheus server
    • client libraries for instrumenting application c7ode
    • push gateway
    • exporters
    • alertmanager

    6.3 metric 介绍

    6.3.1 metric类型

    • gauges: 返回单一数值,如:

      • node_boot_time_seconds

      node_boot_time_seconds{instance=”10.10.11.40:9100″,job=”node_real_lan”} 1574040030

    • counters: 计数,

    • histograms: 直方图,统计数据的分布情况。比如最大值,最小值,中间值,中位数,百分位数等。

    • summaries: 采样点分位图统计。

    6.3.2 label

    • node_boot_time_seconds{instance=”10.10.11.40:9100″,job=”node_real_lan”}

      如上示例,这里的instance,和job 就是label

      • job : job_name,在prometheus.yml 中定义
      • instance: host:port
    • 也可以在配置文件自行定义label,如:
      - targets: ['10.10.11.178:3001']
      labels:
      server_name: test_web
      

      添加的label即会在prometheus查询数据使用:

      metric{servername=...,}
      

    6.4 PromQL 表达式

    • PromQL表达式即是grafana绘制图标的基本语句,也是prometheus用来设置告警规则的基本语句,所以能弄懂或者看懂promQL 非常重要。

    6.4.1 先看示例:

    • 计算cpu使用率:
      (1-((sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance))/(sum(increase(node_cpu_seconds_total[1m])) by (instance)))) * 100
      

      其中metric:

      node_cpu_seconds_total         # 总cpu 使用时间
      node_cpu_seconds_total{mode="idle"} # 空闲cpu使用时间,其他类似标签: user , system , steal , softirq , irq , nice , iowait , idle
      

      用到的函数:

      “`
      increase( [1m]) # 1分钟之类的增量。
      sum()
      sum() by (TAG) # 其中 TAG 是标签,此地 instance 代表的是机器名. 按主机名进行相加,否则多主机只会显示一条线。
      </p></li>
      </ul>

      <pre><code class="line-numbers">#### 6.4.2 标签选择

      – 匹配运算:

      “`
      = #等于 Select labels that are exactly equal to the provided string.
      != #不等于 Select labels that are not equal to the provided string.
      =~ #正则表达式匹配 Select labels that regex-match the provided string.
      !~ #正则表达式不匹配 Select labels that do not regex-match the provided string.
      “`

      – 示例:

      “`
      node_cpu_seconds_total{mode=”idle”} # mode : 标签,metric自带属性。
      api_http_requests_total{method=”POST”, handler=”/messages”}
      “`

      “`
      http_requests_total{environment=~”staging|testing|development”,method!=”GET”}
      “`

      – 注意: 必须指定一个名称或至少一个与空字符串不匹配的标签匹配器

      “`
      {job=~”.*”} # Bad!
      {job=~”.+”} # Good!
      {job=~”.*”,method=”get”} # Good!
      “`

      #### 6.4.3 运算

      – 时间范围:

      “`
      s -秒
      m – 分钟
      h – 小时
      d – 天
      w -周
      y -年
      “`

      – 运算符:

      “`
      + (addition)
      – (subtraction)
      * (multiplication)
      / (division)
      % (modulo)
      ^ (power/exponentiatio
      == (equal)
      != (not-equal)
      > (greater-than)
      = (greater-or-equal)
      40
      for: 1m
      labels:
      servirity: warning
      annotations:
      summary: “{{$labels.instance}}:CPU 使用过高”
      description: “{{$labels.instance}}:CPU 使用率超过 40%”
      value: “{{$value}}”
      – alert: “CPU 使用率超过90%”
      expr: 100-(avg(rate(node_cpu_seconds_total{mode=”idle”}[1m])) by(instance)* 100) > 90
      for: 1m
      labels:
      severity: critical
      annotations:
      summary: “{{$labels.instance}}:CPU 使用率90%”
      description: “{{$labels.instance}}:CPU 使用率超过90%,持续时间超过5mins”
      value: “{{$value}}”
      “`

      – 如果需要在配置文件中使用中文,务必注意编码规则为utf8,否则报错

      ### 7.6 配置alertmanager

      – 详细文档地址: https://prometheus.io/docs/alerting/latest/configuration/
      – 主配置文件: alertmanager.yml
      – 模板配置文件: *.tmpl
      – 只是介绍少部需要用到的配置,如需查看完整配置,请查看官方文档

      #### 7.6.1 alertmanager.yml

      – 主配置文件中需要配置:
      – global: 发件邮箱配置,
      – templates: 指定邮件模板文件(如果不指定,则使用alertmanager默认模板),
      – routes: 配置告警规则,比如匹配哪个label的规则发送到哪个后端
      – receivers: 配置后端告警媒介: email,wechat,webhook等等

      – 先看示例:

      “`
      vim alertmanager.yml
      global:
      smtp_smarthost: ‘xxx’
      smtp_from: ‘xxx’
      smtp_auth_username: ‘xxx’
      smtp_auth_password: ‘xxx’
      smtp_require_tls: false
      templates:
      – ‘/alertmanager/template/*.tmpl’
      route:
      receiver: ‘default-receiver’
      group_wait: 1s #组报警等待时间
      group_interval: 1s #组报警间隔时间
      repeat_interval: 1s #重复报警间隔时间
      group_by: [cluster, alertname]
      routes:
      – receiver: test
      group_wait: 1s
      match_re:
      severity: test
      receivers:
      – name: ‘default-receiver’
      email_configs:
      – to: ‘xx@xx.xx’
      html: ‘{{ template “xx.html” . }}’
      headers: { Subject: ” {{ .CommonAnnotations.summary }}” }
      – name: ‘test’
      email_configs:
      – to: ‘xxx@xx.xx’
      html: ‘{{ template “xx.html” . }}’
      headers: { Subject: ” {{ 第二路由匹配测试}}” }
      “`

      “`
      vim test.tmpl
      {{ define “xx.html” }}

      {{ range $i, $alert := .Alerts }}

      {{ end }}

      报警项 磁盘 报警阀值 开始时间
      {{ index $alert.Labels “alertname” }} {{ index $alert.Labels “instance” }} {{ index $alert.Annotations “value” }} {{ $alert.StartsAt }}

      {{ end }}
      “`

      – 详解:

      gloable:

      resolve_timeout: # 在没有报警的情况下声明为已解决的时间

        - 其他邮件相关配置,如示例
      
      

      route: # 所有报警信息进入后的根路由,用来设置报警的分发策略

      group_by: [‘LABEL_NAME’,’alertname’, ‘cluster’,’job’,’instance’,…]

      这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 
      和alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
      
      

      group_wait: 30s

      当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
      
      

      group_interval: 5m

      当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
      
      

      repeat_interval: 5m

      如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
      
      

      match:
      label_name: NAME

      匹配报警规则,满足条件的告警将被发给 receiver
      
      

      match_re:
      label_name: , …

      正则表达式匹配。满足条件的告警将被发给 receiver
      
      

      receiver: receiver_name

      将满足match 和 match_re的告警发给后端 告警媒介(邮件,webhook,pagerduty,wechat,...)
      必须有一个default receivererr="root route must specify a default receiver"
      
      

      routes:
      – …

      配置多条规则。
      
      

      templates:
      [ – … ]

      “`

      ​ 配置模板,比如邮件告警页面模板

        receivers:
          - <receiver> ...# 列表
      
      - name: receiver_name   # 用于填写在route.receiver中的名字 
      
       email_configs:         # 配置邮件告警
      
       - to: <tmpl_string>
      send_resolved: <boolean> | default = false      # 故障恢复之后,是否发送恢复通知
      

      配置接受邮件告警的邮箱,也可以配置单独配置发件邮箱。 详见官方文档
      https://prometheus.io/docs/alerting/latest/configuration/#email_config

      - name: ...
        wechat_configs:
        - send_resolved: <boolean> | default = false
      
       api_secret: <secret> | default = global.wechat_api_secret
       api_url: <string> | default = global.wechat_api_url
       corp_id: <string> | default = global.wechat_api_corp_id
          message: <tmpl_string> | default = '{{ template "wechat.default.message" . }}'
      
          agent_id: <string> | default = '{{ template "wechat.default.agent_id" . }}'
      
          to_user: <string> | default = '{{ template "wechat.default.to_user" . }}'
          to_party: <string> | default = '{{ template "wechat.default.to_party" . }}'
          to_tag: <string> | default = '{{ template "wechat.default.to_tag" . }}'             
          # 说明
              to_user: 企业微信用户ID
              to_party: 需要发送的组id
      
              corp_id: 企业微信账号唯一ID 可以在 我的企业 查看                         
              agent_id: 应用的 ID,应用管理 --> 打开自定应用查看
              api_secret: 应用的密钥
      
              打开企业微信注册 https://work.weixin.qq.com
              微信API官方文档 https://work.weixin.qq.com/api/doc#90002/90151/90854  
      

      企业微信告警配置

        inhibit_rules:
       - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'dev', 'instance']
      

      抑制相关配置

      7.6.2 配置企业微信告警

      • 注册企业: https://work.weixin.qq.com

        可以注册未认证企业,人数上限200,绑定个人微信即可使用web后台

        微信API官方文档 : https://work.weixin.qq.com/api/doc#90002/90151/90854

      • 注册之后绑定私人微信即可扫码进入管理后台。

      • 发送告警的应用需要新建,操作也很简单

      • 需要注意的参数:

        • corp_id: 企业微信账号唯一ID 可以在 我的企业 查看
        • agent_id: 应用的 ID,应用管理 –> 打开自定应用查看
        • api_secret: 应用的密钥
        • to_user: 企业微信用户ID,
        • to_party: 需要发送的组id,通讯录,点击组名旁边的点可查看
      • 配置示例:

       receivers:
      - name: 'default'
        email_configs:
       - to: 'XXX'
      send_resolved: true
      
        wechat_configs:
       - send_resolved: true
      corp_id: 'XXX'
      api_secret: 'XXX'
      agent_id: 1000002
      to_user: XXX
      to_party: 2
      message: '{{ template "wechat.html" . }}'
      
      • template:

        • 由于alertmanager默认的微信报警模板太丑丑陋和冗长,所以使用告警模板,邮件模板默认的倒是还可以

        • 示例1:

        cat wechat.tmpl
        {{ define "wechat.html" }}
        {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
        [@警报~]
        实例: {{ .Labels.instance }}
        信息: {{ .Annotations.summary }}
        详情: {{ .Annotations.description }}
        值: {{ .Annotations.value }}
        时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
        {{ end }}{{ end -}}
        {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
        [@恢复~]
        实例: {{ .Labels.instance }}
        信息: {{ .Annotations.summary }}
        时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
        恢复: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
        {{ end }}{{ end -}}
        {{- end }}
      

      7.6.3 告警模板时间问题:

      • 参考来源: https://blog.csdn.net/knight_zhou/article/details/106323719

      • Prometheus 邮件告警自定义模板的默认使用的是utc时间。

        触发时间: {{ .StartsAt.Format "2020-01-02 15:04:05" }} 
        修改之后:{{ (.StartsAt.Add 28800e9).Format "2020-01-02 15:04:05" }}
        

      7.7 prometheus常用告警规则:

      • 很厉害的一个页面,包括的好多写好的规则: https://awesome-prometheus-alerts.grep.to/rules

      7.7.1 容器告警指标,容器down掉告警

      vim rules/docker_monitor.yml
      groups:
        - name: "container monitor"   
       rules:
      - alert: "Container down: env1"
        expr: time() - container_last_seen{name="env1"} > 60
        for: 30s
        labels:
       severity: critical
        annotations:
       summary: "Container down: {{$labels.instance}} name={{$labels.name}}"  
      

      注意:

      此项指标只能监控容器down 掉,无法准确监控容器恢复(不准),即便容器没有成功启动,过一段时间,也会受到resolve通知
      

      7.7.2 针对磁盘CPU,IO ,磁盘使用、内存使用、TCP、网络流量配置监控告警:

      groups:
      - name: 主机状态-监控告警
        rules:
        - alert: 主机状态
       expr: up == 0
       for: 1m
       labels:
      status: 非常严重
       annotations:
      summary: "{{$labels.instance}}:服务器宕机"
      description: "{{$labels.instance}}:服务器延时超过5分钟"
      
        - alert: CPU使用情况
       expr: 100-(avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100)
       for: 1m
       labels:
      status: 一般告警
       annotations:
      summary: "{{$labels.mountpoint}} CPU使用率过高!"
      description: "{{$labels.mountpoint }} CPU使用大于60%(目前使用:{{$value}}%)"
      - alert: cpu使用率过高告警  # 查询提供了hostname label
        expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 10
      nodename) (node_uname_info) > 85
        for: 5m
        labels:
       region: 成都
        annotations:
       summary: "{{$labels.instance}}({{$labels.nodename}})CPU使用率过高!"
       description: '服务器{{$labels.instance}}({{$labels.nodename}})CPU使用率超过85%(
      $value}}%)'       
      - alert: 系统负载过高
        expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode="system"}
      nodename) (node_uname_info)>1.1
        for: 3m
        labels:
       region: 成都
        annotations:
       summary: "{{$labels.instance}}({{$labels.nodename}})系统负载过高!"
       description: '{{$labels.instance}}({{$labels.nodename}})当前负载超标率 {{printf 
      
      - alert: 内存不足告警
        expr: (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* o
      nodename) (node_uname_info) > 80
        for: 3m
        labels:
       region: 成都
        annotations:
       summary: "{{$labels.instance}}({{$labels.nodename}})内存使用率过高!"
       description: '服务器{{$labels.instance}}({{$labels.nodename}})内存使用率超过80%(
      $value}}%)'
        - alert: IO操作耗时
       expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100)  102400
       for: 1m
       labels:
      status: 严重告警
       annotations:
      summary: "{{$labels.mountpoint}} 流入网络带宽过高!"
      description: "{{$labels.mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{
        - alert: 网络流出
       expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|d
      instance)) / 100) > 102400
       for: 1m
       labels:
      status: 严重告警
       annotations:
      summary: "{{$labels.mountpoint}} 流出网络带宽过高!"
      description: "{{$labels.mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{
        - alert: network in
       expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1
       for: 1m
       labels:
      name: network
      severity: Critical
       annotations:
      summary: "{{$labels.mountpoint}} 流入网络带宽过高"
      description: "{{$labels.mountpoint }}流入网络异常,高于100M"
      value: "{{ $value }}"        
        - alert: network out
       expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 
       for: 1m
       labels:
      name: network
      severity: Critical
       annotations:
      summary: "{{$labels.mountpoint}} 发送网络带宽过高"
      description: "{{$labels.mountpoint }}发送网络异常,高于100M"
      value: "{{ $value }}" 
      
        - alert: TCP会话
       expr: node_netstat_Tcp_CurrEstab > 1000
       for: 1m
       labels:
      status: 严重告警
       annotations:
      summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
      description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$valu
        - alert: 磁盘容量
       expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_b
      > 80
       for: 1m
       labels:
      status: 严重告警
       annotations:
      summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
      description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"    
      - alert: 硬盘空间不足告警  # 查询结果多了hostname等label
        expr: (100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_by
      )* on(instance) group_left(nodename) (node_uname_info)> 80
        for: 3m
        labels:
       region: 成都
        annotations:
       summary: "{{$labels.instance}}({{$labels.nodename}})硬盘使用率过高!"
       description: '服务器{{$labels.instance}}({{$labels.nodename}})硬盘使用率超过80%(
      $value}}%)'
        - alert: volume fullIn fourdaysd # 预计磁盘4天后写满
       expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0
       for: 5m
       labels:
      name: disk
      severity: Critical
       annotations:
      summary: "{{$labels.mountpoint}} 预计主机可用磁盘空间4天后将写满"
      description: "{{$labels.mountpoint }}" 
      value: "{{ $value }}%"  
        - alert: disk write rate
       expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024
       for: 1m
       labels:container_memory_max_usage_bytes
      name: disk
      severity: Critical
       annotations:
      summary: "disk write rate (instance {{ $labels.instance }})"
      description: "磁盘写入速率大于50MB/s"
      value: "{{ $value }}%" 
        - alert: disk read latency
       expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_complet
       for: 1m
       labels:
      name: disk
      severity: Critical
       annotations:
      summary: "unusual disk read latency (instance {{ $labels.instance }})"
      description: "磁盘读取延迟大于100毫秒"
      value: "{{ $value }}%" 
        - alert: disk write latency
       expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_compl
       for: 1m
       labels:
      name: disk
      severity: Critical
       annotations:
      summary: "unusual disk write latency (instance {{ $labels.instance }})"
      description: "磁盘写入延迟大于100毫秒"
      value: "{{ $value }}%" 
      

      7.8 alertmanager 管理api

      GET /-/healthy  
      GET /-/ready  
      POST /-/reload
      
      • 示例:
      curl -u monitor:fosafer.com 127.0.0.1:9093/-/healthy
          OK
      curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload
       [root@host40 monitor]# curl -XPOST -u monitor:fosafer.com 127.0.0.1:9093/-/reload
      failed to reload config: yaml: unmarshal errors:
      line 26: field receiver already set in type config.plain
      

      等同: docker exec -it monitor-alertmanager kill -1 1 ,但是失败会报错

      8.blackbox_exporter

      8.1 blackbox_exporter简介

      • blackbox_exporter是Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的监控数据采集。

      • 官方地址: https://github.com/prometheus/blackbox_exporter

      • 应用场景:

        HTTP 测试
        定义 Request Header 信息
        判断 Http status / Http Respones Header / Http Body 内容
        TCP 测试
        业务组件端口状态监听
        应用层协议定义与监听
        ICMP 测试
        主机探活机制
        POST 测试
        接口联通性
        SSL 证书过期时间     
        

      8.2 blackbox_exporter安装

      8.2.1 linux(centos7) 二进制下载安装blackbox_exporter

      • 下载并解压
        wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.18.0/
        blackbox_exporter-0.18.0.linux-amd64.tar.gz
        tar -xf blackbox_exporter-0.18.0.linux-amd64.tar.gz -C /usr/local/
        cd /usr/local 
        ln -sv blackbox_exporter-0.18.0.linux-amd64 blackbox_exporter
        cd blackbox_exporter
        ./blackbox_exporter --version
        
      • 添加systemd服务unit:
        vim /lib/systemd/system/blackbox_exporter.service
        [Unit]
        Description=blackbox_exporter
        After=network.target
        [Service]
        User=root
        Type=simple
        ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml
        Restart=on-failure
        [Install]
        WantedBy=multi-user.target
        
        systemctl daemon-reload
        systemctl enable blackbox_exporter
        systemctl start blackbox_exporter
        
      • 默认监听端口: 9115

      8.2.2 docker 安装blackbox_exporter

      • image: prom/blackbox-exporter:master

      • docker run:

        docker run --rm -d -p 9115:9115 --name blackbox_exporter -v `pwd`:/config prom/blackbox-exporter:master --config.file=/config/blackbox.yml
        

      8.3 配置blackbox_exporter

      • 默认配置文件:

      • blackbox_exporter 默认情况配置文件已经能够满足大多数需求,后续如需自行配置,参见官方文档,以及项目类一个示例配置文件

        • https://github.com/prometheus/blackbox_exporter/blob/master/example.yml
        cat blackbox.yml
        modules:
        http_2xx:
        prober: http
        http_post_2xx:
        prober: http
        http:
        method: POST
        tcp_connect:
        prober: tcp
        pop3s_banner:
        prober: tcp
        tcp:
        query_response:
        - expect: "^+OK"
        tls: true
        tls_config:
        insecure_skip_verify: false
        ssh_banner:
        prober: tcp
        tcp:
        query_response:
        - expect: "^SSH-2.0-"
        irc_banner:
        prober: tcp
        tcp:
        query_response:
        - send: "NICK prober"
        - send: "USER prober prober prober :prober"
        - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
        - expect: "^:[^ ]+ 001"
        icmp:
        prober: icmp
        

      8.4 配置prometheus:

      • 官方介绍: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

      • 参考文档: https://blog.csdn.net/qq_25934401/article/details/84325356

      • 说明:

        labels:
        job:  job_name
        __address__: :
        instance: 默认__address__,如果没有被重新标签的话
        __scheme__: scheme
        __metrics_path__: path
        __param_: url 中第一个出现的  参数
        

      8.4.1 http/https 测试示例:

      scrape_configs:
        - job_name: 'blackbox'
       metrics_path: /probe
       params:
      module: [http_2xx]  # Look for a HTTP 200 response.
       static_configs:
      - targets:
        - http://prometheus.io # Target to probe with http.
        - https://prometheus.io# Target to probe with https.
        - http://example.com:8080 # Target to probe with http on port 8080.
       relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.    
      

      8.4.2 tcp探测示例:

      - job_name: "blackbox_telnet_port]"
        scrape_interval: 5s
        metrics_path: /probe
        params:
       module: [tcp_connect]
        static_configs:
      - targets: [ '1x3.x1.xx.xx4:443' ]
        labels:
       group: 'xxxidc机房ip监控'
      - targets: ['10.xx.xx.xxx:443']
        labels:
       group: 'Process status of nginx(main) server'
        relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.xxx.xx.xx:9115        
      

      8.4.3 icmp探测示例:

      - job_name: 'blackbox00_ping_idc_ip'
        scrape_interval: 10s
        metrics_path: /probe
        params:
       module: [icmp]  #ping
        static_configs:
      - targets: [ '1x.xx.xx.xx' ]
        labels:
       group: 'xxnginx 虚拟IP'
        relabel_configs:
      - source_labels: [__address__]
        regex: (.*)(:80)?
        target_label: __param_target
        replacement: ${1}
      - source_labels: [__param_target]
        regex: (.*)
        target_label: ping
        replacement: ${1}
      - source_labels: []
        regex: .*
        target_label: __address__
        replacement: 1x.xxx.xx.xx:9115
      

      8.4.4 POST探测示例:

      - job_name: 'blackbox_http_2xx_post'
        scrape_interval: 10s
        metrics_path: /probe
        params:
       module: [http_post_2xx_query]
        static_configs:
      - targets:
        - https://xx.xxx.com/api/xx/xx/fund/query.action
        labels:
       group: 'Interface monitoring'
        relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 1x.xx.xx.xx:9115  # The blackbox exporter's real hostname:port.
      

      8.4.5 SSL证书时间监测:

      cat << 'EOF' > prometheus.yml
      rule_files:
        - ssl_expiry.rules
      scrape_configs:
        - job_name: 'blackbox'
       metrics_path: /probe
       params:
      module: [http_2xx]  # Look for a HTTP 200 response.
       static_configs:
      - targets:
        - example.com  # Target to probe
       relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # Blackbox exporter.
        EOF 
      cat << 'EOF' > ssl_expiry.rules 
      groups: 
        - name: ssl_expiry.rules 
       rules: 
      - alert: SSLCertExpiringSoon 
        expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30 
        for: 10m
      EOF
      

      8.5 查看监听过程:

      • 类似于:
        curl http://172.16.10.65:9115/probe?target=prometheus.io&module=http_2xx&debug=true
        

      8.6 添加告警:

      • icmp、tcp、http、post 监测是否正常可以观察probe_success 这一指标
        probe_success == 0 ##联通性异常
        probe_success == 1 ##联通性正常
        
      • 告警也是判断这个指标是否等于0,如等于0 则触发异常报警

        “`
        [sss@prometheus01 prometheus]$ cat rules/blackbox-alert.rules
        groups:

        <ul>
        <li>name: blackbox_network_stats
        rules:</li>
        </ul></li>
        <li>alert: blackbox_network_stats
        expr: probe_success <span class="text-highlighted-inline" style="background-color: #fffd38;"> 0
        for: 1m
        labels:
        severity: critical
        annotations:
        summary: "Instance {{ $labels.instance }} is down"
        description: "This requires immediate action!"

        “`

      9.docker-compose部署完整prometheus监控系统

      • 部署主机: 10.10.11.40

      9.1 部署组件:

       prometheus
       alertmanager
       grafana
       nginx
       node_exporter
       cadvisor
       blackbox_exporter
      
      • image:
       prom/prometheus
       prom/alertmanager
       quay.io/prometheus/node-exporter  ,prom/node-exporter
       gcr.io/google_containers/cadvisor[:v0.36.0]  # 需要能访问google
       google/cadvisor:v0.33.0 # docker hub镜像,版本没有google的新
       grafana/grafana
       nginx
      
      • 将iamge pull下来之后从新tag ,并上传至本地harbor 仓库
        image: 10.10.11.40:80/base/nginx:1.19.3
        image: 10.10.11.40:80/base/prometheus:2.22.0
        image: 10.10.11.40:80/base/grafana:7.2.2
        image: 10.10.11.40:80/base/alertmanager:0.21.0
        image: 10.10.11.40:80/base/node_exporter:1.0.1
        image: 10.10.11.40:80/base/cadvisor:v0.33.0
        image: 10.10.11.40:80/base/blackbox-exporter:0.18.0
        
        

      9.2 部署结构

      • 目录结构一览
        mkdir /home/deploy/monitor
        cd /home/deploy/monitor
        
        [root@host40 monitor]# tree
        .
        ├── alertmanager
        │   ├── alertmanager.yml
        │   ├── db
        │   │   ├── nflog
        │   │   └── silences
        │   └── templates
        │    └── wechat.tmpl
        ├── blackbox_exporter
        │   └── blackbox.yml
        ├── docker-compose.yml
        ├── grafana
        │   └── db
        │    ├── grafana.db
        │    ├── plugins
            ...
        ├── nginx
        │   ├── auth
        │   └── nginx.conf
        ├── node-exporter
        │   └── textfiles
        ├── node_exporter_install_docker.sh
        ├── prometheus
        │   ├── db
        │   ├── prometheus.yml
        │   ├── rules
        │   │   ├── docker_monitor.yml
        │   │   ├── system_monitor.yml
        │   │   └── tcp_monitor.yml
        │   └── sd_files
        │    ├── docker_host.yml
        │    ├── http.yml
        │    ├── icmp.yml
        │    ├── real_lan.yml
        │    ├── real_wan.yml
        │    ├── sedFDm5Rw
        │    ├── tcp.yml
        │    ├── virtual_lan.yml
        │    └── virtual_wan.yml
        └── sd_controler.sh
        
      • nginx basic认证需要的文件:
        [root@host40 monitor-bak]# ls nginx/auth/ -a
        .  ..  .htpasswd
        
      • 部分挂在目录权限:
        prometheus,grafana,alertmanager 的 db目录 需要777权限
        单独挂在的配置文件 alertmanager.yml,prometheus.yml,nginx.conf 需要 666权限。
        如果为了安全起见,建议将配置文件放入专门目录中挂载,并在command 中修改启动参数指定配置文件即可
        

      9.3 docker-compose.yml

      [root@host40 monitor-bak]# cat docker-compose.yml 
      version: "3"
      services:
      
        nginx:
       image: 10.10.11.40:80/base/nginx:1.19.3
       hostname: nginx
       container_name: monitor-nginx
       restart: always
       privileged: false
       ports:
      - 3001:3000
      - 9090:9090
      - 9093:9093
       volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf
      - ./nginx/auth:/etc/nginx/basic_auth
       networks:
      monitor:
        aliases:
       - nginx
       logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m
      
        prometheus:
       image: 10.10.11.40:80/base/prometheus:2.22.0
       container_name: monitor-prometheus
       hostname: prometheus
       restart: always
       privileged: true
       volumes:
      - ./prometheus/db/:/prometheus/
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules/:/etc/prometheus/rules/
      - ./prometheus/sd_files/:/etc/prometheus/sd_files/
       command: 
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--storage.tsdb.retention=60d'
       networks:
      monitor:
        aliases:
       - prometheus
       logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m
      
        grafana:
       image: 10.10.11.40:80/base/grafana:7.2.2
       container_name: monitor-grafana
       hostname: grafana
       restart: always
       privileged: true
       volumes:
      - ./grafana/db/:/var/lib/grafana 
       networks:
      monitor:
        aliases:
       - grafana
       logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m
      
        alertmanger:
       image: 10.10.11.40:80/base/alertmanager:0.21.0
       container_name: monitor-alertmanager
       hostname: alertmanager
       restart: always
       privileged: true
       volumes:
      - ./alertmanager/db/:/alertmanager
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - ./alertmanager/templates/:/etc/alertmanager/templates
       networks:
      monitor:
        aliases:
       - alertmanager
       logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m
      
        node-exporter:
       image: 10.10.11.40:80/base/node_exporter:1.0.1
       container_name: monitor-node-exporter
       hostname: host40
       restart: always
       privileged: true
       volumes:
      - /:/host:ro,rslave
      - ./node-exporter/textfiles/:/textfiles
       network_mode: "host"
       command: 
      - '--path.rootfs=/host'
      - '--web.listen-address=:9100'
      - '--collector.textfile.directory=/textfiles' 
       logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m
      
        cadvisor:
       image: 10.10.11.40:80/base/cadvisor:v0.33.0
       container_name: monitor-cadvisor
       hostname: cadvisor
       restart: always
       privileged: true
       volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
       ports:
      - 9080:8080
       networks: 
      monitor:
       logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m
      
        blackbox_exporter:
       image: 10.10.11.40:80/base/blackbox-exporter:0.18.0
       container_name: monitor-blackbox
       hostname: blackbox-exporter
       restart: always
       privileged: true
       volumes:
      - ./blackbox_exporter/:/etc/blackbox_exporter
       networks:
      monitor:
        aliases:
       - blackbox
       command:
      - '--config.file=/etc/blackbox_exporter/blackbox.yml'
       logging:
      driver: json-file
      options:
        max-file: '5'
        max-size: 50m
      
      networks:
        monitor:
       ipam:
      config:
        - subnet: 192.168.17.0/24
      

      9.4 nginx

      • 由于prometheus,alertmanager 本身不带认证功能,所以前端使用nginx完成调度和basic auth 认证,同一代理后端监听端口,便于管理。

      • 各程序默认端口

       prometheus: 9090
       grafana:3000
       alertmanager: 9093
       node_exproter: 9100
       cadvisor: 8080 (客户端)
      
      • nginx基础image使用basic认证:
       echo monitor:`openssl passwd -crypt 123456` > .htpasswd
      
      • 单独挂在配置文件容器不更新:(当然也可以选择挂在目录,而不是直接挂在文件)

        chmod 666 nginx.conf   
        
      • nginx容器加载配置文件:
        docker exec -it web-director nginx -s reload
        
      • nginx.conf

        “`
        [root@host40 monitor-bak]# cat nginx/nginx.conf
        user nginx;
        worker_processes auto;
        error_log /var/log/nginx/error.log;
        pid /run/nginx.pid;
        include /usr/share/nginx/modules/*.conf;
        events {
        worker_connections 10240;
        }
        http {
        log_format main '$remote_addr – $remote_user [$time_local] "$request" '
        '$status $body_bytes_sent "$http_referer" '
        '"$http_user_agent" "$http_x_forwarded_for"';
        access_log /var/log/nginx/access.log main;
        sendfileon;
        tcp_nopush on;
        tcp_nodelayon;
        keepalive_timeout65;
        types_hash_max_size 2048;
        include /etc/nginx/mime.types;
        default_type application/octet-stream;
        </p></li>
        </ul>

        <p>proxy_connect_timeout500ms;
        proxy_send_timeout1000ms;
        proxy_read_timeout3000ms;
        proxy_buffers 64 8k;
        proxy_busy_buffers_size 128k;
        proxy_temp_file_write_size 64k;
        proxy_redirect off;
        proxy_next_upstream error invalid_header timeout http_502 http_504;
        proxy_http_version 1.1;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Real-Port $remote_port;
        proxy_set_header Host $http_host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        client_max_body_size 10m;
        client_body_buffer_size 512k;
        client_body_timeout 180;
        client_header_timeout 10;
        send_timeout 240;
        gzip on;
        gzip_min_length 1k;
        gzip_buffers 4 16k;
        gzip_comp_level 2;
        gzip_types application/javascript application/x-javascript text/css text/javascript image/jpeg image/gif image/png;
        gzip_vary off;
        gzip_disable "MSIE [1-6].";

        server {
        listen 3000;
        server_name _;

        location / {
        proxy_pass http://grafana:3000;
        }
        }

        server {
        listen 9090;
        server_name _;

        location / {
        auth_basic "auth for monitor";
        auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
        proxy_pass http://prometheus:9090;
        }
        }

        server {
        listen 9093;
        server_name _;

        location / {
        auth_basic "auth for monitor";
        auth_basic_user_file /etc/nginx/basic_auth/.htpasswd;
        proxy_pass http://alertmanager:9093;<br />
        }
        }
        }

        “`

        9.5 prometheus

        • 注意db目录需可写,给777权限

        9.5.1 主配置文件: prometheus.yml

        [root@host40 monitor-bak]# cat prometheus/prometheus.yml 
        # my global config
        global:
          scrape_interval:  15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
          evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
          # scrape_timeout is set to the global default (10s).
        # Alertmanager configuration
        alerting:
          alertmanagers:
          - static_configs:
         - targets: ["alertmanager:9093"]
        # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
        rule_files:
          - "rules/*.yml"
        # A scrape configuration containing exactly one endpoint to scrape:
        # Here it's Prometheus itself.
        scrape_configs:
          # The job name is added as a label `job=` to any timeseries scraped from this config.
          - job_name: 'prometheus'
         static_configs:
         - targets: ['localhost:9090']
        
          - job_name: 'alertmanager'
         static_configs:
        - targets: ['alertmanager:9093']
          - job_name: 'node_real_lan'
         file_sd_configs:
        - files: 
         - ./sd_files/real_lan.yml
          refresh_interval: 30s
        
          - job_name: 'node_virtual_lan'
         file_sd_configs:
        - files:
         - ./sd_files/virtual_lan.yml
          refresh_interval: 30s
        
          - job_name: 'node_real_wan'
         file_sd_configs:
        - files:
         - ./sd_files/real_wan.yml
          refresh_interval: 30s
        
          - job_name: 'node_virtual_wan'
         file_sd_configs:
        - files:
         - ./sd_files/virtual_wan.yml
          refresh_interval: 30s
        
          - job_name: 'docker_host'
         file_sd_configs:
        - files:
         - ./sd_files/docker_host.yml
          refresh_interval: 30s
          - job_name: 'tcp'
         metrics_path: /probe
         params:
        module: [tcp_connect]
         file_sd_configs:
        - files:
         - ./sd_files/tcp.yml
          refresh_interval: 30s
         relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: blackbox:9115 
          - job_name: 'http'
         metrics_path: /probe
         params:
        module: [http_2xx]
         file_sd_configs:
        - files:
         - ./sd_files/http.yml
          refresh_interval: 30s
         relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: blackbox:9115 
          - job_name: 'icmp'
         metrics_path: /probe
         params:
        module: [icmp]
         file_sd_configs:
        - files:
         - ./sd_files/icmp.yml
          refresh_interval: 30s
         relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: blackbox:9115 
        

        9.5.2 全部节点使用基于文件的服务发现:

        • 将需要监控的主机targets 写入相应job的target文件即可。示例如下:
        ls prometheus/sd_files/
        docker_host.yml  http.yml  icmp.yml  real_lan.yml  real_wan.yml  sedFDm5Rw  tcp.yml  virtual_lan.yml  virtual_wan.yml
        
         cat prometheus/sd_files/docker_host.yml
         - targets: ['10.10.11.178:9080']
         - targets: ['10.10.11.99:9080']
         - targets: ['10.10.11.40:9080']
         - targets: ['10.10.11.35:9080']
         - targets: ['10.10.11.45:9080']
         - targets: ['10.10.11.46:9080']
         - targets: ['10.10.11.48:9080']
         - targets: ['10.10.11.47:9080']
         - targets: ['10.10.11.65:9081']
         - targets: ['10.10.11.61:9080']
         - targets: ['10.10.11.66:9080']
         - targets: ['10.10.11.68:9080']
         - targets: ['10.10.11.98:9080']
         - targets: ['10.10.11.75:9080']
         - targets: ['10.10.11.97:9080']
         - targets: ['10.10.11.179:9080']
        
         cat prometheus/sd_files/tcp.yml
         - targets: ['10.10.11.178:8001']
        labels:
          server_name: http_download
         - targets: ['10.10.11.178:3307']
        labels:
          server_name: xiaojing_db
         - targets: ['10.10.11.178:3001']
        labels:
          server_name: test_web
        

        9.5.3 rules文件:

        • docker rules:
        cat prometheus/rules/docker_monitor.yml 
         groups:
        - name: "container monitor"
          rules:
         - alert: "Container down: env1"
        expr: time() - container_last_seen{name="env1"} > 60
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Container down: {{$labels.instance}} name={{$labels.name}}"
         ```
        
        - tcp rules:
        
         ```
         cat prometheus/rules/tcp_monitor.yml 
         groups:
         - name: blackbox_network_stats
        rules:
        - alert: blackbox_network_stats
          expr: probe_success == 0
          for: 1m
          labels:
         severity: critical
          annotations:
         summary: "Instance {{ $labels.instance }} ,server-name: {{ $labels.server_name }} is down"
         description: "连接不通..."
         ```
        
        - system rules: # cpu ,mem, disk, network, filesystem...
        
        

        cat prometheus/rules/system_monitor.yml
        groups:
        – name: “system info”
        rules:
        – alert: “服务器宕机”
        expr: up 0
        for: 3m
        labels:
        severity: critical
        annotations:
        summary: “{{$labels.instance}}:服务器宕机”
        description: “{{$labels.instance}}:服务器无法连接,持续时间已超过3mins”
        – alert: “系统负载过高”
        expr: (node_load1/count without (cpu, mode) (node_cpu_seconds_total{mode=”system”}))* on(instance) group_left(
        nodename) (node_uname_info) > 1.1
        for: 3m
        labels:
        servirity: warning
        annotations:
        summary: “{{$labels.instance}}:系统负载过高”
        description: “{{$labels.instance}}:系统负载过高.”
        value: “{{$value}}”
        – alert: “CPU 使用率超过90%”
        expr: 100-(avg(rate(node_cpu_seconds_total{mode=”idle”}[5m])) by(instance)* 100) > 90
        for: 3m
        labels:
        severity: critical
        annotations:
        summary: “{{$labels.instance}}:CPU 使用率90%”
        description: “{{$labels.instance}}:CPU 使用率超过90%.”
        value: “{{$value}}”
        – alert: “内存使用率超过80%”
        expr: (100 – node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)* on(instance) group_left(
        nodename) (node_uname_info) > 80
        for: 3m
        labels:
        severity: critical
        annotations:
        summary: “{{$labels.instance}}:内存使用率80%”
        description: “{{$labels.instance}}:内存使用率超过80%”
        value: “{{$value}}”

        • alert: “IO操作耗时超过60%”
          expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) 85
          for: 3m
          labels:
          severity: longtime
          annotations:
          summary: “{{$labels.instance}}:磁盘分区容量超过85%”
          description: “{{$labels.instance}}:磁盘分区容量超过85%”
          value: “{{$value}}”

        • alert: “磁盘将在4天后写满”
          expr: predict_linear(node_filesystem_free_bytes[2h], 4 * 24 * 3600) < 0
          for: 3m
          labels:
          severity: longtime
          annotations:
          summary: “{{$labels.instance}}: 预计将有磁盘分区在4天后写满,”
          description: “{{$labels.instance}}:预计将有磁盘分区在4天后写满,”
          value: “{{$value}}”

          “`
          </p></li>
          </ul>

          <h3>9.6 alertmanager:</h3>

          <ul>
          <li><p>注意db目录可写:</p></li>
          <li><p>主配置文件:

          “`
          cat alertmanager/alertmanager.yml
          global:
          resolve_timeout: 5m
          smtp_smarthost: ‘smtphz.qiye.163.com:25’
          smtp_from: ‘XXX@fosafer.com’
          smtp_auth_username: ‘XXX@fosafer.com’
          smtp_auth_password: ‘XXX’
          smtp_hello: ‘qiye.163.com’
          smtp_require_tls: true
          route:
          group_by: [‘instance’]
          group_wait: 30s
          receiver: default
          routes:

          • group_interval: 3m
            repeat_interval: 10m
            match:
            severiry: warning
            receiver: ‘default’
          • group_interval: 3m
            repeat_interval: 30m
            match:
            severiry: critical
            receiver: ‘default’

          • group_interval: 5m
            repeat_interval: 24h
            match:
            severiry: longtime
            receiver: ‘default’
            templates:

        • ./templates/*.tmpl
          receivers:
        • name: ‘default’
          email_configs:

          • to: ‘xiangkaihua@fosafer.com’
            send_resolved: true

          wechat_configs:

          • send_resolved: true
            corp_id: ‘XXX’
            api_secret: ‘XXX’
            agent_id: 1000002
            to_user: XXX
            to_party: 2
            message: ‘{{ template “wechat.html” . }}’
        • name: ‘critical’
          email_configs:

          • to: ‘342382676@qq.com’
            send_resolved: true
          • to: ‘xiangkaihua@fosafer.com’
            send_resolved: true

            “`

        • 告警模板文件
          cat alertmanager/templates/wechat.tmpl 
          {{ define "wechat.html" }}
          {{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
          [@警报~]
          实例: {{ .Labels.instance }}
          信息: {{ .Annotations.summary }}
          详情: {{ .Annotations.description }}
          值: {{ .Annotations.value }}
          时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
          {{ end }}{{ end -}}
          {{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
          [@恢复~]
          实例: {{ .Labels.instance }}
          信息: {{ .Annotations.summary }}
          时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
          恢复: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
          {{ end }}{{ end -}}
          {{- end }}     
          

        9.7 grafana

        • 只需要挂载volume即可,配置文件无需更改,db目录也不大,可以保存配置和dashboard

        10.客户端部署

        10.1 被监控主机无docker,单独安装node_exporter

        • 安装脚本:
          http://10.10.11.178:8001/node_exporter_install.sh
          

        10.2 被监控主机运行docker,docker 安装 node_exporter cadvisor

        • 安装脚本:
          http://10.10.11.178:8001/node_exporter_install_docker.sh
          
        • 需要的image,对于没有添加10.10.11.40:80 仓库的docker主机,可以下载save的image,先load image 在安装
          http://10.10.11.178:8001/monitor-client.tgz
          

        11.prometheus使用和维护

        11.1 通过脚本添加和删除监控节点

        • 所有的job都使用基于文件的服务发现,所以,只用将target写入sd_file即可,无需重读配置文件

        • 基于此写了一个文本处理脚本作为sd_files的前端,通过命令行的形式添加和删除targets,无需手动编辑文件

        • 脚本名称: sd_controler.sh

        • 脚本使用:./sd_controler.sh 即可查看usage

        • 完整脚本如下:

          “`
          [root@host40 monitor]# cat sd_controler.sh

          !/bin/bash

          version: 1.0

          Description: add | del | show instance from|to prometheus file_sd_files.

          rl | vl | dk | rw | vw | tcp | http | icmp : short for job name, each one means a sd_file.

          tcp | http | icmp ( because with ports for service ) add with label (server_name by default) to easy read in alert emails.

          each time can only add|del for one instance.

          说明:用来添加、删除、查看prometheus基于文件的服务发现中的条目。比如IP:PORT 组合。

          rl | vl | dk | rw | vw | tcp | http | icmp :这写prometheus job名称的简称,每一项代表一个job,操作一个sd_file 即job文件服务发现使用的文件。

          tcp | http | icmp,由于常常无法根据服务端口第一时间确认挂掉的是什么服务,所以,在tcp http icmp(顺带)添加的时候要求带上server_name的标签label,

          让监控人员收到告警邮件第十时间知道挂掉的是什么服务。

          每一次只能添加、删除一条记录,如果需要批量添加,可以直接使用vim 文本操作,或者写for 语句批量执行。

          vars

          SD_DIR=./prometheus/sd_files
          DOCKER_SD=$SD_DIR/docker_host.yml
          RL_HOST_SD=$SD_DIR/real_lan.yml
          VL_HOST_SD=$SD_DIR/virtual_lan.yml
          RW_HOST_SD=$SD_DIR/real_wan.yml
          VW_HOST_SD=$SD_DIR/virtual_wan.yml

          TCP_SD=$SD_DIR/tcp.yml
          HTTP_SD=$SD_DIR/http.yml
          ICMP_SD=$SD_DIR/icmp.yml

          SDFILE=

          funcs

          usage(){
          echo -e “Usage: $0 [ IP:PORT | FQDN ] [ server-name ]”
          echo -e ” example: \n\t node add:\t $0 rl add | del 10.10.10.10:9100\n\t tcp,http,icmp add:\t $0 tcp add 10.10.10.10:3306 web-mysql\n\t del:\t $0 http del www.baidu.com\n\t show:\t $0 rl | vl | dk | rw | vw | tcp | http | icmp show.”
          exit
          }

          add(){

        $1: SDFILE, $2: IP:PORT

        grep -q $2 $1 || echo -e “- targets: [‘$2’]” >> $1
        }

        del(){

        $1: SDFILE, $2: IP:PORT

        sed -i ‘/’$2’/d’ $1
        }

        add_with_label(){

        $1: SDFILE, $2: [IP:[PROT]|FQDN] $3:SERVER-NAME

        LABEL_01=”server_name”
        if ! grep -q ‘$2’ $1;then
        echo -e “- targets: [‘$2’]” >> $1
        echo -e ” labels:” >> $1
        echo -e ” ${LABEL_01}: $3″ >> $1
        fi
        }

        del_with_label(){

        $1: SDFILE, $2: [IP:[PROT]|FQDN]

        NUM=cat -n $SDFILE |grep "'$2'"|awk '{print $1}'
        let ENDNUM=NUM+2

        sed -i $NUM,${ENDNUM}d $1
        }

        action(){
        if [ “$1” “add” ];then
        add $SDFILE $2
        elif [ “$1”
        “del” ];then
        del $SDFILE $2
        elif [ “$1” “show” ];then
        cat $SDFILE
        fi
        }

        action_with_label(){
        if [ “$1” “add” ];then
        add_with_label $SDFILE $2 $3
        elif [ “$1”
        “del” ];then
        del_with_label $SDFILE $2 $3
        elif [ “$1” “show” ];then
        cat $SDFILE
        fi
        }

        ### main code
        [ “$2” “” ] || [[ ! “$2” =~ ^(add|del|show)$ ]] && usage

        curl –version &>/dev/null || { echo -e “no curl found. ” && exit 15; }

        if [[ $1 =~ ^(rl|vl|rw|vw|dk)$ ]] && [ “$2” “add” ];then
        [ “$3”
        “” ] && usage

        if [ “$4” != “-f” ];then
        COOD=curl -IL -o /dev/null --retry 3 --connect-timeout 3 -s -w "%{http_code}" http://$3/metrics
        [ “$COOD” != “200” ] && echo -e “http://$3/metrics is not arriable. check it again. or you can use -f to ignor it.” && exit 11
        fi
        fi

        if [[ $1 =~ ^(tcp|http|icmp)$ ]] && [ “$2” “add” ];then
        [ “$4”
        “” ] && echo -e “监听 tcp http icmp 服务时必须指明 server-name.” && usage
        fi

        case $1 in
        rl)
        SDFILE=$RL_HOST_SD
        action $2 $3 && echo $2 OK
        ;;
        vl)
        SDFILE=$VL_HOST_SD
        action $2 $3 && echo $2 OK
        ;;
        dk)
        SDFILE=$DOCKER_SD
        action $2 $3 && echo $2 OK
        ;;
        rw)
        SDFILE=$RW_HOST_SD
        action $2 $3 && echo $2 OK
        ;;
        vw)
        SDFILE=$VW_HOST_SD
        action $2 $3 && echo $2 OK
        ;;
        tcp)
        SDFILE=$TCP_SD
        action_with_label $2 $3 $4 && echo $2 OK
        ;;
        http)
        SDFILE=$HTTP_SD
        action_with_label $2 $3 $4 && echo $2 OK
        ;;
        icmp)
        SDFILE=$ICMP_SD
        action_with_label $2 $3 $4 && echo $2 OK
        ;;
        *)
        usage
        ;;
        esac

        “`

        本文链接:https://www.yunweipai.com/39155.html

    网友评论comments

    发表回复

    您的电子邮箱地址不会被公开。

    暂无评论

    Copyright © 2012-2022 YUNWEIPAI.COM - 运维派 京ICP备16064699号-6
    扫二维码
    扫二维码
    返回顶部