首页 Mysql教程数据库监控:MySQL/PostgreSQL 关键指标与告警阈值

数据库监控:MySQL/PostgreSQL 关键指标与告警阈值

运维派隶属马哥教育旗下专业运维社区,是国内成立最早的IT运维技术社区,欢迎关注公众号:yunweipai
领取学习更多免费Linux云计算、Python、Docker、K8s教程关注公众号:马哥linux运维

引子:一次被监控救回来的库

去年 Q2 一次大促。凌晨 1 点,订单库的 P99 延迟从 80ms 涨到 1.2s,业务没炸,但订单量明显下滑。

值班工程师打开 Grafana,看到三个现象:

  1. MySQL Threads_running 从 30 涨到 380
  2. InnoDB 行锁等待时间(Innodb_row_lock_time)从 50ms/s 涨到 4s/s
  3. Slow_queries 每分钟从 12 涨到 900+

但当时 CPU 只有 60%,iops 也只用了 50%。如果只看 CPU 不会觉得有问题。

继续往下查:

SHOW ENGINE INNODB STATUS\G

输出里有一段:

LATEST DETECTED DEADLOCK
------------------------
2026-04-15 01:00:12 0x7f8b9c001700
*** (1) TRANSACTION:
TRANSACTION 37281973, ACTIVE 0.001 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 3287123, query id 9382 ...
UPDATE orders SET status='paid' WHERE id=12345

定位是订单状态更新的高频路径上,行锁竞争激烈。orders 表有 8000 万行,热点 status 字段没建合适索引,每次更新都是范围扫描 + 锁升级。

修复路径:

  1. 紧急 kill 长事务:KILL 37281973
  2. 临时方案:把热点订单走单独的连接池,限流
  3. 根本修复:加索引 CREATE INDEX idx_orders_status_updated ON orders(status, updated_at)
  4. 拆表:把 90 天前的热数据归档到 orders_archive

从告警到定位到修复,花了 17 分钟。如果当时没把上面那三个指标告警阈值定好,至少要 1 小时才能反应过来。

这就是数据库监控的核心价值:在业务感知之前先发现异常。但「定好指标 + 定对阈值」是难点。阈值定低了天天狼来了,定高了真出事了又没告警。


一、问题背景

数据库是「有状态 + 数据价值最高」的基础组件。监控的核心目标有四层:

  1. 可用性:服务在线、连接池健康、主从复制正常
  2. 性能:QPS / TPS / 延迟、锁等待、缓冲命中率
  3. 容量:连接数、表空间、磁盘空间、binlog / WAL 增长
  4. 变更:DDL 变更、参数变更、慢查询、异常 SQL

MySQL 和 PostgreSQL 的监控体系差异较大:

维度MySQLPostgreSQL
内置监控SHOW STATUSSHOW ENGINE INNODB STATUSperformance_schemainformation_schemapg_stat_* 视图、pg_stat_statements 扩展、EXPLAIN (ANALYZE)
主从复制监控SHOW REPLICA STATUS(8.0.22+) / SHOW SLAVE STATUS(5.7 / 8.0 旧版)pg_stat_replication
锁监控performance_schema.data_locks(8.0+) / information_schema.innodb_trxpg_locks + pg_stat_activity
慢查询slow_query_log + performance_schema.events_statements_summary_by_digestlog_min_duration_statement + pg_stat_statements
缓冲池SHOW ENGINE INNODB STATUS 中 Buffer Pool 信息pg_stat_database 中 blks_hit / blks_read
死锁SHOW ENGINE INNODB STATUS 中 LATEST DETECTED DEADLOCKlog 中 deadlock detected
主流 Exportermysqld_exporterpostgres_exporter、pgbouncer_exporter
Grafana DashboardPercona MySQL Overview、MySQL InnoDB MetricsPostgreSQL Overview、pganalyze、Crunchy
高可用方案MHA / Orchestrator / Group Replication / InnoDB ClusterPatroni / repmgr / PgBouncer + keepalived
逻辑复制5.7 简单 / 8.0 增强(mysqlbinlog 配合)内建 publication / subscription
主从延迟单位秒(Seconds_Behind_Master字节(pg_stat_replication LSN 差) + 时间(PG 10+ replay_lag)

监控常见痛点:

  • 指标太多不知道看什么:MySQL 有 400+ status 变量,PG 有 200+ 系统视图
  • 版本差异:MySQL 5.7/8.0 字段名变化;PG 12/13/14/15 视图字段调整
  • 阈值不会定:照搬网上”CPU > 80% 告警”,结果 CPU 永远跑 30%
  • 告警风暴:一个慢查询触发 20 个告警
  • 指标语义不清:Threads_connected 和 Threads_running 区别
  • 生产环境不敢装新组件:装个 Exporter 都担心影响数据库

二、适用场景

不是所有业务都要上全指标监控,按规模选:

团队现状推荐方案
单库 < 100 QPS,几张表慢查询日志 + SHOW STATUS + 自建脚本
中小规模,1-10 个 MySQL / PG 实例Prometheus + mysqld_exporter / postgres_exporter + Grafana
大规模,50+ 实例,多集群Prometheus + 多种 Exporter + 自研聚合层 + 告警分级
金融 / 政企商业监控(Datadog / New Relic / 自研 APM)+ 国产化 Exporter
多云 / 跨 regionThanos / Cortex / VictoriaMetrics 聚合

不推荐 Prometheus 监控的场景:

  • 业务量极小,且 DBA 就在机器旁
  • 数据库要求强一致性,且不允许任何 Exporter 占用资源
  • 已有商业 DB 监控产品且合规要求必须用

三、核心知识点

3.1 MySQL 关键版本差异

监控脚本的版本兼容性必须先看版本:

维度MySQL 5.7MySQL 8.0
主从复制查询SHOW SLAVE STATUSSHOW REPLICA STATUS(8.0.22+) / SHOW SLAVE STATUS(兼容旧版)
行锁监控information_schema.innodb_trx + innodb_locks + innodb_lock_waitsperformance_schema.data_locks + data_lock_waits(5.7 的 innodb_locks 在 8.0 已删除)
性能模式performance_schema 默认开启 events_statements_% 部分默认开启更全,但消耗更多内存
系统表mysql.user 明文密码可看5.7 升级后需 ALTER USER 重置
索引提示USE INDEX / FORCE INDEX
默认字符集latin1utf8mb4
主键不强制不强制,但缺主键问题更明显
直方图ANALYZE TABLE ... UPDATE HISTOGRAM
锁视图information_schema.innodb_locksperformance_schema.data_locks
EXPLAIN 输出JSON 格式可选JSON 格式默认,tree 格式可选
隐藏索引ALTER TABLE ... ALTER INDEX ... INVISIBLE
资源组CREATE RESOURCE GROUP
窗口函数8.0.2+ 支持
CTE8.0.1+ 支持
Group Replication5.7 基础8.0 单主 / 多主模式更稳定
InnoDB Cluster5.7 基础8.0 内置 mysqlsh
默认 binlog 格式ROW(建议)ROW(同)
默认 binlog 过期永不过期30 天(binlog_expire_logs_seconds
默认密码认证mysql_native_passwordcaching_sha2_password
performance_schema 默认 instrument部分大部分开启

重点:5.7 升级到 8.0 时,监控脚本要把 innodb_locks 改成 data_locksSHOW SLAVE STATUS 用 SHOW REPLICA STATUS(也可保留兼容)。

3.2 PostgreSQL 关键版本差异

维度PG 12PG 13PG 14PG 15PG 16
分区表增强增强增强增强增强
pg_stat_statements字段齐全加 total_exec_time 拆分拆 shared_blks_* 字段加 wal_records / wal_fpi 字段字段调整
pg_stat_activity基础加 leader_pid加 query_id 字段
逻辑复制基础增强增强增强增强
pg_stat_replication基础基础加 replay_lag
默认隔离级别READ COMMITTED
归档pg_basebackup 基础增强增强增强增强
监控视图 pg_stat_io新增(IO 详情)
pg_stat_wal新增
JIT增强
pg_stat_progress_*部分部分增加 analyze / cluster / vacuum / create_index
监控 WAL 生成速率通过 LSN 推算pg_stat_wal
Btree 去重新增(deduplication)

监控 PG 时,pg_stat_statements 字段名每个大版本可能微调,必须以实际版本为准。PG 16+ 的 pg_stat_io 是排查 IO 瓶颈的关键视图。

3.3 MySQL 关键指标分类

把 MySQL 400+ status 变量按用途归类:

3.3.1 连接类

指标含义异常表现
Threads_connected当前连接数接近 max_connections
Threads_running正在执行查询的线程持续 > 30 说明有积压
Threads_created创建过的线程数持续增长说明连接池没复用
Connection_errors_*连接错误数持续增长说明被拒绝
Aborted_connects失败的连接尝试增长说明密码错或网络问题
Max_used_connections历史最大连接数接近 max_connections 时告警
Connection_errors_max_connections超过 max 拒绝次数增长 = 业务被打回

注意:Threads_connected 包含 sleep 状态的线程,不代表当前压力。Threads_running 才是真正在干活的线程。

3.3.2 吞吐类

指标含义异常表现
Questions服务器收到的查询数(COM_PING 等除外)异常突增 / 突减
Com_selectSELECT 总数异常突增
Com_insert/update/deleteDML 总数异常突增
Innodb_rows_inserted/updated/deleted/read影响行数异常突增
Bytes_received/sent网络流量异常突增
Created_tmp_tables内存临时表数突增
Created_tmp_disk_tables磁盘临时表数突增(说明 tmp_table_size 不足)

QPS / TPS 计算(基于两次采样):

# QPS
mysqladmin -uroot -p extended-status -r -i 1 | awk '/Queries/{q=$4} /Questions/{print "QPS:", $4-q}'

# TPS(提交 + 回滚)
mysqladmin -uroot -p extended-status -r -i 1 | awk '/Com_commit/{c=$4} /Com_rollback/{print "TPS:", $4+c}'

3.3.3 InnoDB 类

指标含义异常表现
Innodb_buffer_pool_pages_total总页数
Innodb_buffer_pool_pages_free空闲页持续 < 5% 说明内存不足
Innodb_buffer_pool_pages_data数据页持续高位
Innodb_buffer_pool_pages_misc杂项页增长说明 hash index / metadata 占用
Innodb_buffer_pool_read_requests逻辑读
Innodb_buffer_pool_reads物理读/ read_requests < 99% 命中率低
Innodb_data_reads/writes物理 I/O持续高值
Innodb_log_waits写日志等待> 0 说明 redo log 满
Innodb_row_lock_waits行锁等待次数突增
Innodb_row_lock_time行锁总等待时间突增
Innodb_deadlocks死锁次数突增
Innodb_history_list_lengthundo 长度持续 > 1000 说明 purge 跟不上
Innodb_log_writesredo 写次数突增
Innodb_os_log_fsyncsfsync 次数突增
Innodb_os_log_pending_fsyncs待 fsync 数> 0 卡顿
Innodb_os_log_pending_writes待写 WAL 数> 0 卡顿
Innodb_data_pending_fsyncs数据文件 fsync 队列> 0 卡顿
Innodb_data_pending_reads/writes数据文件 IO 队列> 0 IO 饱和
Innodb_mutex_os_waitsmutex 等待突增 = 热点
Innodb_rows_read读取行数/ Com_select 比例异常高说明全表扫描

3.3.4 复制类(5.7 / 8.0.22-)

SHOW SLAVE STATUS\G

关键字段:

字段含义异常
Slave_IO_RunningIO 线程No 表示 IO 中断
Slave_SQL_RunningSQL 线程No 表示回放中断
Seconds_Behind_Master主从延迟(秒)> 0 持续增长
Relay_Log_Spacerelay log 大小异常增长
Last_IO_ErrorIO 错误信息非空
Last_SQL_ErrorSQL 错误信息非空
Master_Log_File / Read_Master_Log_Pos主库位点不更新说明 IO 停
Relay_Master_Log_File / Exec_Master_Log_Pos已执行位点不更新说明 SQL 停

8.0.22+ 等价命令:

SHOW REPLICA STATUS\G

字段名换成 Replica_IO_RunningReplica_SQL_RunningSeconds_Behind_Source

3.3.5 Group Replication 监控(8.0+)

SELECT FROM performance_schema.replication_group_members;

关键字段:

字段含义异常
MEMBER_ID成员 UUID
MEMBER_HOST / MEMBER_PORT地址端口
MEMBER_STATEONLINE / RECOVERING / OFFLINE / ERRORONLINE 才是正常
MEMBER_ROLEPRIMARY / SECONDARY单主模式下应只有 1 PRIMARY
MEMBER_VERSIONMySQL 版本不一致告警

Group Replication 延迟:

SELECT
  MEMBER_ID,
  COUNT_TRANSACTIONS_IN_QUEUE AS queue_count,
  COUNT_TRANSACTIONS_CHECKED AS checked,
  COUNT_CONFLICTS_DETECTED AS conflicts,
  COUNT_TRANSACTIONS_REMOTE_APPLIED AS remote_applied
FROM performance_schema.replication_group_member_stats;

queue_count > 0 持续增长 = 二级节点跟不上。

3.3.6 临时表 / 排序类

指标含义异常
Created_tmp_disk_tables磁盘临时表数突增说明内存临时表不够
Created_tmp_tables内存临时表数突增
Sort_rows排序行数突增
Sort_merge_passes归并排序次数突增说明 sort_buffer 不足
Select_full_join没有用索引的 join> 0 是问题
Select_range_check每行后检查范围> 0 是问题
Select_scan全表扫描次数突增
Select_full_range_join范围 join突增

3.3.7 performance_schema 表(8.0+)

行锁和事务等待重点看:

-- 8.0+ 行锁视图
SELECT FROM performance_schema.data_locks LIMIT 10;

-- 8.0+ 锁等待关系
SELECT FROM performance_schema.data_lock_waits LIMIT 10;

-- 当前正在等待的事务
SELECT
  t.trx_id,
  t.trx_state,
  t.trx_started,
  TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) AS duration_sec,
  t.trx_mysql_thread_id,
  t.trx_query,
  t.trx_rows_modified,
  t.trx_tables_in_use,
  t.trx_tables_locked,
  t.trx_lock_structs
FROM information_schema.innodb_trx t
WHERE TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) > 30
ORDER BY t.trx_started;

SQL 统计(按 digest 聚合):

SELECT
  SCHEMA_NAME,
  digest,
  digest_text,
  count_star AS exec_count,
  sum_timer_wait / 1e9 AS total_ms,
  avg_timer_wait / 1e9 AS avg_ms,
  max_timer_wait / 1e9 AS max_ms,
  sum_lock_time / 1e9 AS lock_ms,
  sum_rows_examined AS rows_examined,
  sum_rows_sent AS rows_sent,
  sum_no_index_used AS no_index_count,
  sum_errors AS error_count,
  sum_warnings AS warning_count
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME IS NOT NULL
ORDER BY sum_timer_wait DESC
LIMIT 20;

3.3.8 InnoDB 历史列表与 purge

SHOW ENGINE INNODB STATUS\G

关键段落:

TRANSACTIONS
------------
Trx id counter 37281975
Purge done for trx's n:o < 37281960 undo n:o < 0
History list length 2547

History list length 持续 > 1000 说明 purge 跟不上,常见于:

  • 大事务未提交
  • innodb_purge_threads 默认 4,但可以调到 8 / 16
  • innodb_purge_batch_size 太小

Trx id counter 增长太快说明写入速率高,需要监控。

3.4 PostgreSQL 关键系统视图

3.4.1 pg_stat_activity(活跃会话)

字段含义
pid进程 ID
usename用户
application_name应用名
client_addr客户端 IP
stateactive / idle / idle in transaction / fastpath function call
query_start查询开始时间
state_change状态变更时间
wait_event_type等待类型(Lock / IO / LWLock / Activity)
wait_event具体等待事件
query当前 SQL
backend_xid / backend_xmin事务 ID
leader_pidparallel worker 领头(13+)
query_idSQL hash(16+)

state = 'idle in transaction' 是关键告警指标:事务挂着不提交,锁不释放。

排查活跃锁等待:

SELECT
  pg_class.relname,
  pg_locks.locktype,
  pg_locks.mode,
  pg_locks.granted,
  pg_stat_activity.usename,
  pg_stat_activity.query,
  pg_stat_activity.query_start,
  pg_stat_activity.state_change,
EXTRACT(EPOCH FROM (now() - pg_stat_activity.query_start)) AS duration_sec
FROM pg_locks
JOIN pg_class ON pg_locks.relation = pg_class.oid
JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE NOT pg_locks.granted
ORDER BY pg_stat_activity.query_start;

3.4.2 pg_stat_statements(SQL 统计)

需要先在 postgresql.conf 启用:

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = top
pg_stat_statements.track_utility = on
pg_stat_statements.save = on

重启后创建扩展:

CREATE EXTENSION pg_stat_statements;

-- 验证
SELECT count(*) FROM pg_stat_statements;
-- 预期:> 0(执行过 SQL 才有)

关键字段:

字段含义
querySQL 模板(参数化)
calls调用次数
total_exec_time总执行时间(毫秒)
mean_exec_time平均执行时间
max_exec_time最大执行时间
rows总返回行数
shared_blks_hit共享缓冲命中块数
shared_blks_read共享缓冲读盘块数
temp_blks_read/written临时块读写
queryidSQL 哈希 ID
total_plan_time计划时间(13+)
wal_records / wal_fpiWAL 生成(13+)

缓存命中率:

SELECT
  sum(shared_blks_hit) / nullif(sum(shared_blks_hit) + sum(shared_blks_read), 0) AS hit_rate
FROM pg_stat_statements;

TOP 慢查询:

SELECT
  substring(query for 100) AS query_preview,
  calls,
  round(total_exec_time::numeric / 1000, 2) AS total_sec,
  round(mean_exec_time::numeric / 1000, 4) AS mean_sec,
  round(max_exec_time::numeric / 1000, 2) AS max_sec,
  rows,
  shared_blks_hit,
  shared_blks_read
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

3.4.3 pg_stat_database(库级)

字段含义
datname库名
numbackends当前连接数
xact_commit提交事务数
xact_rollback回滚事务数
blks_read物理读块数
blks_hit缓冲命中块数
tup_returned/fetched/inserted/updated/deleted行数统计
conflicts恢复冲突数
deadlocks死锁数
blk_read_time / blk_write_timeI/O 累计时间(需 track_io_timing
session_time / idle_in_transaction_time累计时间(13+)
active_timeactive 状态累计时间(14+)

需要 track_io_timing = on 才能拿到 blk_read_time / blk_write_time

3.4.4 pg_stat_replication(复制状态)

字段含义
pidwalsender 进程
usename复制用户
application_name备库标识
statestartup / catchup / streaming / backup / stopping
sent_lsn已发送 LSN
write_lsn备库已写 LSN
flush_lsn备库已刷盘 LSN
replay_lsn备库已回放 LSN
replay_lag / write_lag / flush_lag延迟(PG 10+)
sync_state / sync_priority同步优先级

延迟字节数计算:

SELECT
  application_name,
  state,
  pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag_bytes,
  replay_lag,
  write_lag,
  flush_lag
FROM pg_stat_replication;

3.4.5 pg_locks(锁视图)

SELECT
  pg_class.relname,
  pg_locks.locktype,
  pg_locks.mode,
  pg_locks.granted,
  pg_stat_activity.usename,
  pg_stat_activity.query,
  pg_stat_activity.query_start
FROM pg_locks
JOIN pg_class ON pg_locks.relation = pg_class.oid
JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE NOT pg_locks.granted
ORDER BY pg_locks.granted;

3.4.6 pg_stat_progress_vacuum(vacuum 进度,PG 9.6+)

SELECT
  pid,
  datname,
  relid::regclass,
  phase,
  heap_blks_total,
  heap_blks_scanned,
  heap_blks_vacuumed,
  index_vacuum_count,
  max_dead_tuples,
  num_dead_tuples
FROM pg_stat_progress_vacuum;

phase 字段:initializing / scanning heap / vacuuming indexes / vacuuming heap / cleaning up indexes / truncating heap / performing final cleanup

3.4.7 pg_stat_io(IO 详情,PG 16+)

SELECT FROM pg_stat_io;

字段:backend_type、object、context、reads、writes、extends、hits、read_time、write_time。

这是 PG 16 引入的关键视图,可以直接看到每类对象的 IO 分布。

3.5 关键告警阈值(基线参考)

下面给的阈值是经验值,不是绝对标准。所有阈值必须结合业务基线调整:

指标建议阈值说明
Threads_running> 30 持续 5 min高于 CPU 核心数 × 2 持续 5 min 告警
Buffer Pool 命中率< 99% 持续 10 min命中率低 = 物理 I/O 多
Innodb_row_lock_waits> 100/min突增说明行锁竞争
Innodb_deadlocks> 0/min立即告警
Innodb_history_list_length> 1000undo 清理慢
Slow_queries> N/min(N 依业务定)突增
主从延迟> 60s立即告警
复制 IO / SQL 线程No立即告警
磁盘空间> 80%业务影响前 1-2 天告警
连接数使用率> 80%接近 max_connections
Idle in transaction> 30s立即告警
pg_stat_replication.replay_lag> 60s立即告警
Cache hit rate< 99%持续 10 min 告警
WAL 堆积> 10 GB告警
Checkpoint 间隔< 30s说明 checkpoint 频繁
select_full_join 增量> 0任何增量都该看
Innodb_log_waits> 0redo log 不足
Group Replication queue_count> 100持续 > 5 min 告警
pg_stat_activity active 数> max_connections × 0.8接近极限
tmp_blks_written 速率> 100 MB/s临时表撑爆

四、整体排查或实施思路

按”先基础后深度,先指标后 SQL”:

  1. 第一步:基础资源监控(1 周)
    • CPU、内存、磁盘、网络、连接数
    • Prometheus node_exporter 即可
  2. 第二步:DB 内置指标(1-2 周)
    • 装 mysqld_exporter / postgres_exporter
    • 配置慢查询日志
    • 启用 performance_schema / pg_stat_statements
  3. 第三步:连接池监控(1 周)
    • ProxySQL / MaxScale(MySQL)
    • pgbouncer_exporter(PG)
    • 应用侧连接池(HikariCP、Druid)
  4. 第四步:复制与高可用(1-2 周)
    • 主从延迟、备库健康
    • MHA / Orchestrator / Patroni / Group Replication 状态
  5. 第五步:SQL 维度(2-4 周)
    • 慢查询分析
    • 全表扫描监控
    • 索引使用率
  6. 第六步:告警分级(持续)
    • P0:业务中断
    • P1:性能下降
    • P2:资源告警
    • P3:优化建议
  7. 第七步:自愈(高级)
    • 长事务自动 kill
    • 连接池自动扩容
    • 慢查询自动加入限流

五、实战步骤

5.1 MySQL 5.7 / 8.0 启用慢查询

# /etc/my.cnf 或 /etc/mysql/mysql.conf.d/mysqld.cnf
[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 2  # 2 秒以上记为慢查询
log_queries_not_using_indexes = 1  # 没走索引的也记
log_slow_extra = ON  # 8.0 才有
log_throttle_queries_not_using_indexes = 100  # 每分钟最多记 100 条
min_examined_row_limit = 1000  # 扫描行数 < 1000 不记

应用:

# 在线打开(不重启)
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 2;
SET GLOBAL log_queries_not_using_indexes = 'ON';

# 验证
SHOW VARIABLES LIKE 'slow_query_log%';
SHOW VARIABLES LIKE 'long_query_time';

5.2 MySQL 启用 performance_schema

# /etc/my.cnf
[mysqld]
performance_schema = ON
performance_schema_max_table_instances = 200
performance_schema_max_mutex_classes = 200
performance_schema_max_digest_length = 4096
performance_schema_max_index_stat = 10000

5.7 默认开启,8.0 默认更全。验证:

SHOW VARIABLES LIKE 'performance_schema';
-- 预期:ON

SELECT FROM performance_schema.setup_consumers WHERE enabled='YES';

-- 8.0 默认情况下,可以打开更详细的 events_statements 维度
UPDATE performance_schema.setup_consumers
SET enabled = 'YES'
WHERE name LIKE 'events_statements_%';

注意:performance_schema 默认开启会带来 5-10% 的性能开销,关键业务库在压测后再决定是否长期开启。

5.3 PostgreSQL 启用 pg_stat_statements

# /var/lib/pgsql/16/data/postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = top
pg_stat_statements.track_utility = on
pg_stat_statements.save = on
track_io_timing = on
log_min_duration_statement = 2000  # 单位 ms,记录 2 秒以上
log_lock_waits = on
log_temp_files = 0  # 记录所有临时文件
log_autovacuum_min_duration = 0  # 记录 autovacuum
log_checkpoint = on
log_connections = on
log_disconnections = on
log_line_prefix = '%m [%p] %q%u@%d/%a from %h '

重启后:

CREATE EXTENSION pg_stat_statements;

-- 验证
SELECT count(*) FROM pg_stat_statements;
-- 预期:> 0(执行过 SQL 才有)

5.4 安装 mysqld_exporter

# 1. 下载
VERSION="0.15.1"
wget https://github.com/prometheus/mysqld_exporter/releases/download/v${VERSION}/mysqld_exporter-${VERSION}.linux-amd64.tar.gz
tar xzf mysqld_exporter-${VERSION}.linux-amd64.tar.gz
sudo mv mysqld_exporter-${VERSION}.linux-amd64/mysqld_exporter /usr/local/bin/

# 2. 创建监控账号
mysql -uroot -p <<EOF
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'exporter_password_here' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
GRANT SELECT ON performance_schema.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
EOF

# 3. 配置 .my.cnf
sudo tee /etc/.mysqld_exporter.cnf <<'EOF'
[client]
user=exporter
password=exporter_password_here
host=127.0.0.1
port=3306
EOF
sudo chmod 600 /etc/.mysqld_exporter.cnf

# 4. 启动
mysqld_exporter \
  --config.my-cnf=/etc/.mysqld_exporter.cnf \
  --web.listen-address=:9104 \
  --collect.global_status \
  --collect.info_schema.innodb_metrics \
  --collect.info_schema.processlist \
  --collect.perf_schema.eventsstatements \
  --collect.perf_schema.eventswaits \
  --collect.perf_schema.file_events \
  --collect.perf_schema.tableio_waits

密码从环境变量或密钥管理读,不写在配置里。

5.4.1 systemd 服务

# /etc/systemd/system/mysqld_exporter.service
[Unit]
Description=MySQL Exporter
After=network.target

[Service]
Type=simple
User=mysqld_exporter
EnvironmentFile=/etc/mysqld_exporter.env
ExecStart=/usr/local/bin/mysqld_exporter \
  --config.my-cnf=/etc/.mysqld_exporter.cnf \
  --web.listen-address=:9104
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target
# /etc/mysqld_exporter.env(权限 600)
MYSQLD_EXPORTER_PASSWORD=xxx
sudo useradd -r -s /bin/false mysqld_exporter
sudo chown mysqld_exporter:mysqld_exporter /etc/mysqld_exporter.env
sudo chmod 600 /etc/mysqld_exporter.env
sudo systemctl daemon-reload
sudo systemctl enable --now mysqld_exporter
sudo systemctl status mysqld_exporter

5.5 安装 postgres_exporter

VERSION="0.15.0"
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v${VERSION}/postgres_exporter-${VERSION}.linux-amd64.tar.gz
tar xzf postgres_exporter-${VERSION}.linux-amd64.tar.gz
sudo mv postgres_exporter-${VERSION}.linux-amd64/postgres_exporter /usr/local/bin/

# 创建监控账号
sudo -u postgres psql <<EOF
CREATE USER postgres_exporter WITH PASSWORD 'xxx' INHERIT;
GRANT pg_read_all_stats TO postgres_exporter;
GRANT CONNECT ON DATABASE postgres TO postgres_exporter;
EOF

# 启动
DATA_SOURCE_NAME="postgresql://postgres_exporter:xxx@localhost:5432/postgres" \
  postgres_exporter --web.listen-address=:9187

5.5.1 systemd 服务

# /etc/systemd/system/postgres_exporter.service
[Unit]
Description=PostgreSQL Exporter
After=network.target

[Service]
Type=simple
User=postgres_exporter
Environment=DATA_SOURCE_NAME=postgresql://postgres_exporter:xxx@localhost:5432/postgres
ExecStart=/usr/local/bin/postgres_exporter \
  --web.listen-address=:9187 \
  --extend.query-path=/etc/postgres_exporter/queries.yaml
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

5.6 安装 pgbouncer_exporter

pgbouncer 自带 prometheus 兼容接口(1.21+),也可单独装 pgbouncer_exporter。

# /etc/pgbouncer.ini 添加
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
admin_users = pgb_admin
stats_users = pgb_stats
pool_mode = transaction
max_client_conn = 4000
default_pool_size = 20
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 600
# 验证
curl http://localhost:6432/metrics | head -30

5.7 安装 ProxySQL Exporter(MySQL 连接池)

ProxySQL 自带 prometheus 兼容接口(v2.0+):

-- ProxySQL 启用 metrics
SET admin-metrics_password='xxx';
LOAD ADMIN VARIABLES TO RUNTIME;
SAVE ADMIN VARIABLES TO DISK;
curl -u admin:xxx http://localhost:6032/metrics

5.8 Prometheus 抓取配置

# prometheus.yml
scrape_configs:
- job_name: 'mysqld'
    static_configs:
      - targets:
          - 'mysql-prod-1:9104'
          - 'mysql-prod-2:9104'
        labels:
          env: prod
          role: master

- job_name: 'postgres'
    static_configs:
      - targets:
          - 'pg-prod-1:9187'
          - 'pg-prod-2:9187'
        labels:
          env: prod
          role: master

- job_name: 'pgbouncer'
    static_configs:
      - targets:
          - 'pgbouncer-1:6432'
          - 'pgbouncer-2:6432'

- job_name: 'proxysql'
    static_configs:
      - targets:
          - 'proxysql-1:6032'

5.9 Grafana Dashboard 选型

Dashboard ID名称适用
7362MySQL OverviewMySQL 通用
11157MySQL InnoDB MetricsMySQL InnoDB 详细
12740Percona MySQLPercona 分支
9628PostgreSQL DatabasePG 通用
455PostgreSQL OverviewPG 通用
14881pgBouncer连接池
11807PG ReplicationPG 主从
17378MySQL Group ReplicationMGR
12227PatroniPatroni HA

Dashboard ID 可能因版本升级而变化,以 Grafana 官方仓库实际为准。

5.10 告警规则(MySQL)

# /etc/prometheus/rules/mysql.yml
groups:
- name: mysql_alerts
    rules:
      - alert: MySQLDown
        expr: mysql_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MySQL 实例 {{ $labels.instance }} 不可达"
          description: "mysqld_exporter 无法连接 MySQL"

      - alert: MySQLTooManyConnections
        expr: |
          (mysql_global_status_threads_connected
            / mysql_global_variables_max_connections) * 100 > 80
for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL 连接数使用率 {{ $value | printf \"%.1f\" }}%"

      - alert: MySQLHighThreadsRunning
        expr: mysql_global_status_threads_running > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL Threads_running {{ $value }} 持续高"

      - alert: MySQLReplicationBroken
        expr: |
          mysql_slave_status_slave_io_running == 0
          or
          mysql_slave_status_slave_sql_running == 0
for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MySQL 复制中断"
          description: "实例 {{ $labels.instance }} 复制线程停止"

      - alert: MySQLReplicationLag
        expr: mysql_slave_status_seconds_behind_master > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL 主从延迟 {{ $value }}s"

      - alert: MySQLBufferPoolLowHitRate
        expr: |
          (sum(rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]))
            - sum(rate(mysql_global_status_innodb_buffer_pool_reads[5m])))
            / sum(rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]))
            < 0.99
for: 10m
        labels:
          severity: warning
        annotations:
          summary: "MySQL Buffer Pool 命中率 {{ $value | printf \"%.3f\" }}"

      - alert: MySQLDeadlocks
        expr: |
          increase(mysql_global_status_innodb_deadlocks[5m]) > 0
for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MySQL 检测到死锁"

      - alert: MySQLSlowQueries
        expr: |
          increase(mysql_global_status_slow_queries[5m]) > 50
for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL 慢查询激增"

      - alert: MySQLHistoryListLong
        expr: mysql_global_status_innodb_history_list_length > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "MySQL history list length {{ $value }}"

      - alert: MySQLLogWaits
        expr: increase(mysql_global_status_innodb_log_waits[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL 写 redo log 等待"

      - alert: MySQLGroupReplicationOffline
        expr: |
          count by (group_name) (
            mysql_perf_schema_replication_group_members_member_state{state="ONLINE"}
          ) < 3
for: 2m
        labels:
          severity: critical
        annotations:
          summary: "MySQL Group Replication 成员不足"

5.11 告警规则(PostgreSQL)

groups:
  - name: postgres_alerts
    rules:
      - alert: PostgresDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL {{ $labels.instance }} 不可达"

      - alert: PostgresTooManyConnections
        expr: |
          (sum(pg_stat_activity_count) by (instance)
            / on(instance) pg_settings_max_connections) * 100 > 80
for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL 连接数使用率 {{ $value | printf \"%.1f\" }}%"

      - alert: PostgresIdleInTransaction
        expr: |
          count(pg_stat_activity{state="idle in transaction"}) > 5
for: 2m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL 有 {{ $value }} 个 idle in transaction 会话"

      - alert: PostgresReplicationLag
        expr: |
          pg_replication_lag > 60
for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL 复制延迟 {{ $value }}s"

      - alert: PostgresLowCacheHitRate
        expr: |
          (sum(rate(pg_stat_database_blks_hit[5m])) by (instance)
            / (sum(rate(pg_stat_database_blks_hit[5m])) by (instance)
              + sum(rate(pg_stat_database_blks_read[5m])) by (instance))) < 0.99
for: 10m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL 缓存命中率 {{ $value | printf \"%.3f\" }}"

      - alert: PostgresDeadlocks
        expr: |
          increase(pg_stat_database_deadlocks[5m]) > 0
for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL 检测到死锁"

      - alert: PostgresLongRunningQuery
        expr: |
          (time() - pg_stat_activity_query_start_seconds) > 600
for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL 有运行超过 10 分钟的查询"
          description: "query: {{ $labels.query | default \"(null)\" }}"

      - alert: PostgresWALLag
        expr: |
          (pg_stat_replication_sent_lsn - pg_stat_replication_replay_lsn) > 100000000
for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL WAL 堆积"

      - alert: PostgresLockWaits
        expr: |
          count(pg_stat_activity{wait_event_type="Lock"}) > 10
for: 2m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL 有 {{ $value }} 个会话在等待锁"

      - alert: PatroniReplicaLag
        expr: |
          patroni_replica_lag > 60
for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Patroni 副本延迟 {{ $value }}s"

5.12 长事务自动 Kill 脚本(MySQL)

kill_long_tx.sh

#!/bin/bash
# 杀 MySQL 长事务(运行超过 60 秒的非复制线程)
# 注意:先 dry-run,确认再真杀

DRY_RUN=${DRY_RUN:-true}# 默认 dry-run
THRESHOLD_SECONDS=60
LOG_FILE="/var/log/kill_long_tx.log"

exec >> "${LOG_FILE}" 2>&1
echo "=== $(date '+%Y-%m-%d %H:%M:%S') run ==="

if [[ -z "${MYSQL_ROOT_PASSWORD}" ]]; then
echo "MYSQL_ROOT_PASSWORD 未设置,退出"
exit 1
fi

MYSQL_CMD="mysql -uroot -p${MYSQL_ROOT_PASSWORD} -h127.0.0.1 -B -N"

# 找出长事务
LONG_TRX=$(${MYSQL_CMD} -e "
SELECT
  CONCAT_WS('\t',
    trx_id,
    trx_started,
    TIMESTAMPDIFF(SECOND, trx_started, NOW()),
    trx_mysql_thread_id,
    REPLACE(REPLACE(IFNULL(trx_query, ''), '\n', ' '), '\t', ' '))
FROM information_schema.innodb_trx
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > ${THRESHOLD_SECONDS}
  AND trx_mysql_thread_id != CONNECTION_ID()
" 2>/dev/null)

if [[ -z "${LONG_TRX}" ]]; then
echo "No long transactions."
exit 0
fi

echo "Found long transactions:"
echo "${LONG_TRX}"
echo "---"

if [[ "${DRY_RUN}" == "true" ]]; then
echo "DRY_RUN=true, not killing."
exit 0
fi

# 杀
echo "${LONG_TRX}" | while IFS=$'\t' read -r trx_id started duration thread_id query; do
echo "Killing thread ${thread_id} (duration ${duration}s)..."
${MYSQL_CMD} -e "KILL ${thread_id};" 2>&1
echo "Killed ${thread_id}"
done

定时跑:

# /etc/cron.d/kill-long-tx
*/5 * * * * root DRY_RUN=false /usr/local/bin/kill_long_tx.sh

权限:

sudo chmod 700 /usr/local/bin/kill_long_tx.sh
sudo chown root:root /usr/local/bin/kill_long_tx.sh

风险:KILL 线程会回滚事务,可能造成业务侧短时错误。生产环境必须先 DRY_RUN=true 观察 1-2 周,确认误杀率 < 0.1% 再开 DRY_RUN=false。

5.13 PG 长事务自动 Kill

#!/bin/bash
# PG 长事务 / idle in transaction 自动 kill
DRY_RUN=${DRY_RUN:-true}
THRESHOLD_SECONDS=300
LOG_FILE="/var/log/kill_pg_long_tx.log"

exec >> "${LOG_FILE}" 2>&1
echo "=== $(date '+%Y-%m-%d %H:%M:%S') run ==="

if [[ -z "${PGPASSWORD}" ]]; then
echo "PGPASSWORD 未设置,退出"
exit 1
fi

PIDS=$(psql -U postgres -h 127.0.0.1 -t -A -F, -c "
SELECT pid
FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'idle in transaction (aborted)')
  AND now() - state_change > interval '${THRESHOLD_SECONDS} seconds'
  AND pid != pg_backend_pid()
  AND application_name <> 'postgres_exporter'
" 2>/dev/null)

if [[ -z "${PIDS}" ]]; then
echo "No long idle in transaction."
exit 0
fi

echo "Found long idle in transaction: ${PIDS}"
if [[ "${DRY_RUN}" == "true" ]]; then
echo "DRY_RUN=true, not killing."
exit 0
fi

for pid in ${PIDS}; do
echo "Terminating PID ${pid}..."
  psql -U postgres -h 127.0.0.1 -c "SELECT pg_terminate_backend(${pid});" 2>&1
done

5.14 慢查询自动分析

analyze_slow_queries.sh

#!/bin/bash
# 解析 MySQL 慢查询日志,TOP 10 发送邮件

SLOW_LOG="/var/log/mysql/slow.log"
REPORT="/tmp/slow_query_report_$(date +%Y%m%d).txt"
TOP_N=10
MAIL_TO="dba@example.com"

# 用 pt-query-digest 解析
pt-query-digest "${SLOW_LOG}" --limit "${TOP_N}" > "${REPORT}" 2>&1

# 邮件
mail -s "[DB] Slow Query Report $(date +%Y-%m-%d)" \
"${MAIL_TO}" < "${REPORT}"

# 清理 30 天前报告
find /tmp -name "slow_query_report_*.txt" -mtime +30 -delete

5.15 pt-query-digest 完整示例

# 1. 实时分析
pt-query-digest --processlist h=localhost,u=root,p=xxx

# 2. 分析慢日志
pt-query-digest /var/log/mysql/slow.log

# 3. 输出到文件
pt-query-digest /var/log/mysql/slow.log > report.txt

# 4. 只看特定时间范围
pt-query-digest --since '2026-06-17 00:00:00' --until '2026-06-17 23:59:59' /var/log/mysql/slow.log

# 5. 看 general log
pt-query-digest --type genlog /var/log/mysql/general.log

# 6. 看 binlog(需要先解析)
mysqlbinlog --verbose /var/lib/mysql/mysql-bin.000123 | pt-query-digest --type binlog

# 7. 按用户过滤
pt-query-digest --filter '($event->{arg} =~ m/^SELECT/i)' /var/log/mysql/slow.log

# 8. 输出 JSON(便于二次处理)
pt-query-digest --output json /var/log/mysql/slow.log > slow.json

5.16 PG 慢查询分析工具对比

工具优点缺点
pg_stat_statements内建,无需额外组件看不到具体参数值
log_min_duration_statement + grep能看到完整 SQL文本分析不便
pg_query_governor自动 SQL 拦截商业
pgBadger完整 HTML 报表需要解析 log
auto_explain自动 EXPLAINlog 体量大
pg_stat_plans计划级别统计需额外安装

auto_explain 配置(自动记录慢查询的执行计划):

# postgresql.conf
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = '2s'
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_format = 'json'
auto_explain.log_nested_statements = on

5.17 完整监控告警架构

                    ┌──────────────────┐
                    │  MySQL/Postgres  │
                    └────────┬─────────┘
                             │ (连接 + 慢日志 + binlog)
                             ▼
        ┌────────────────────────────────┐
        │  mysqld_exporter / postgres_   │
        │  exporter / pgbouncer_export   │
        └────────┬───────────────────────┘
                 │ /metrics
                 ▼
        ┌────────────────────────────────┐
        │       Prometheus               │
        │  + AlertManager + 告警规则     │
        └────────┬───────────────────────┘
                 │
        ┌────────┴─────────┐
        ▼                  ▼
   ┌─────────┐        ┌─────────┐
   │ Grafana │        │ 钉钉/   │
   │ 面板    │        │ 飞书/   │
   │         │        │ Slack   │
   └─────────┘        └─────────┘
                 ▲
                 │ (慢查询邮件)
                 │
        ┌────────┴───────────────┐
        │ 慢查询分析器           │
        │ (pt-query-digest /     │
        │  pg_query_governor)    │
        └────────────────────────┘

5.18 自动巡检脚本(每日)

daily_check.sh

#!/bin/bash
# MySQL 每日巡检
MYSQL="mysql -uroot -p${MYSQL_ROOT_PASSWORD} -h127.0.0.1 -B -N"
REPORT="/tmp/mysql_daily_$(date +%Y%m%d).txt"

exec > "${REPORT}" 2>&1

echo "=== $(date) 巡检 ==="

# 1. 连接数
echo "[连接数]"
${MYSQL} -e "SHOW STATUS LIKE 'Threads_connected'"
${MYSQL} -e "SHOW STATUS LIKE 'Threads_running'"
${MYSQL} -e "SHOW STATUS LIKE 'Max_used_connections'"

# 2. QPS / TPS
echo "[QPS/TPS]"
${MYSQL} -e "SHOW GLOBAL STATUS WHERE Variable_name IN ('Questions','Com_commit','Com_rollback')"

# 3. 缓冲池
echo "[Buffer Pool]"
${MYSQL} -e "
SELECT
  Variable_name,
  Variable_value
FROM performance_schema.global_status
WHERE Variable_name LIKE 'Innodb_buffer_pool_pages_%'
"

# 4. 长事务
echo "[长事务]"
${MYSQL} -e "
SELECT
  trx_id,
  trx_state,
  TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS duration_sec,
  trx_mysql_thread_id,
  LEFT(trx_query, 100) AS query_preview
FROM information_schema.innodb_trx
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 10
ORDER BY trx_started
LIMIT 10
"

# 5. 复制状态
echo "[复制状态]"
${MYSQL} -e "SHOW SLAVE STATUS\G" 2>/dev/null || ${MYSQL} -e "SHOW REPLICA STATUS\G"

# 6. 表空间
echo "[表空间]"
${MYSQL} -e "
SELECT
  table_schema,
  ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS size_mb,
  COUNT(*) AS table_count
FROM information_schema.tables
WHERE table_schema NOT IN ('mysql','information_schema','performance_schema','sys')
GROUP BY table_schema
ORDER BY size_mb DESC
LIMIT 20
"

# 7. 慢查询数量
echo "[慢查询]"
${MYSQL} -e "SHOW GLOBAL STATUS LIKE 'Slow_queries'"

echo "=== 巡检结束 ==="

5.19 PG 自动巡检脚本

#!/bin/bash
# PG 每日巡检
PSQL="psql -U postgres -h 127.0.0.1 -A -F, -t"
REPORT="/tmp/pg_daily_$(date +%Y%m%d).csv"

exec > "${REPORT}" 2>&1
echo "metric,value"

# 1. 连接数
${PSQL} -c "SELECT 'connections', count(*) FROM pg_stat_activity"

# 2. 库级事务
${PSQL} -c "
SELECT
  'db_' || datname || '_xact_commit',
  xact_commit
FROM pg_stat_database
WHERE datname NOT IN ('template0','template1')
"

# 3. 长事务
${PSQL} -c "
SELECT
  'long_tx_count',
  count(*)
FROM pg_stat_activity
WHERE state IN ('active','idle in transaction')
  AND now() - xact_start > interval '5 minutes'
"

# 4. 复制延迟
${PSQL} -c "
SELECT
  'repl_lag_' || application_name,
  EXTRACT(EPOCH FROM replay_lag)::int
FROM pg_stat_replication
WHERE replay_lag IS NOT NULL
"

# 5. 表膨胀(用 pg_stat_user_tables.n_dead_tup)
${PSQL} -c "
SELECT
  'dead_tups_' || schemaname || '_' || relname,
  n_dead_tup
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20
"

# 6. 缓存命中率
${PSQL} -c "
SELECT
  'cache_hit_rate',
  ROUND(sum(blks_hit)::numeric / nullif(sum(blks_hit) + sum(blks_read), 0), 4)
FROM pg_stat_database
WHERE datname NOT IN ('template0','template1')
"

六、常用命令

6.1 连接 / 状态

# 当前连接
mysql -uroot -p -e "SHOW PROCESSLIST"

# 完整进程列表(不受 truncated_ground 限制)
mysql -uroot -p -e "SHOW FULL PROCESSLIST"

# 当前事务
mysql -uroot -p -e "
SELECT * FROM information_schema.innodb_trx\G
"

# 按主机统计连接
mysql -uroot -p -e "
SELECT
  SUBSTRING_INDEX(host, ':', 1) AS client_ip,
  count(*) AS conn_count
FROM information_schema.processlist
GROUP BY client_ip
ORDER BY conn_count DESC
"

# PG 类似
psql -U postgres -c "
SELECT
  client_addr,
  state,
  count(*)
FROM pg_stat_activity
GROUP BY client_addr, state
ORDER BY count DESC
"

6.2 锁 / 等待

# MySQL 当前锁等待
mysql -uroot -p -e "
SELECT
  r.trx_id AS waiting_trx,
  r.trx_mysql_thread_id AS waiting_thread,
  r.trx_query AS waiting_query,
  b.trx_id AS blocking_trx,
  b.trx_mysql_thread_id AS blocking_thread,
  b.trx_query AS blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id
"

# 8.0+ 等价
mysql -uroot -p -e "
SELECT
  waiting_pid,
  waiting_query,
  blocking_pid,
  blocking_query
FROM performance_schema.data_lock_waits
JOIN performance_schema.events_statements_current w ON w.thread_id = waiting_pid
"

# PG 当前锁等待
psql -U postgres -c "
SELECT
  pg_class.relname,
  pg_locks.mode,
  pg_stat_activity.usename,
  pg_stat_activity.query,
  EXTRACT(EPOCH FROM (now() - pg_stat_activity.query_start)) AS wait_sec
FROM pg_locks
JOIN pg_class ON pg_locks.relation = pg_class.oid
JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE NOT pg_locks.granted
ORDER BY query_start
"

6.3 主从状态

# MySQL 主从
mysql -uroot -p -e "SHOW SLAVE STATUS\G"
mysql -uroot -p -e "SHOW REPLICA STATUS\G"# 8.0.22+

# MySQL Group Replication
mysql -uroot -p -e "SELECT * FROM performance_schema.replication_group_members"
mysql -uroot -p -e "SELECT * FROM performance_schema.replication_group_member_stats"

# PG 主从
psql -U postgres -c "SELECT * FROM pg_stat_replication"
psql -U postgres -c "SELECT pg_is_in_recovery()"

# PG 备库回放延迟(pg 10+)
psql -U postgres -c "
SELECT
  now() - pg_last_xact_replay_timestamp() AS replay_lag
"

6.4 表 / 索引

# MySQL 表大小
mysql -uroot -p -e "
SELECT
  table_schema,
  table_name,
  ROUND((data_length + index_length) / 1024 / 1024, 2) AS total_mb,
  table_rows
FROM information_schema.tables
WHERE table_schema NOT IN ('mysql','information_schema','performance_schema','sys')
ORDER BY (data_length + index_length) DESC
LIMIT 20
"

# MySQL 索引使用情况
mysql -uroot -p -e "
SELECT
  object_schema,
  object_name,
  index_name,
  count_star,
  count_read,
  count_write,
  count_fetch,
  count_insert,
  count_update,
  count_delete
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE object_schema NOT IN ('mysql','sys')
ORDER BY count_star DESC
LIMIT 20
"

# PG 表大小
psql -U postgres -c "
SELECT
  schemaname,
  relname,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
  pg_size_pretty(pg_relation_size(relid)) AS table_size,
  pg_size_pretty(pg_indexes_size(relid)) AS index_size,
  n_live_tup,
  n_dead_tup
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20
"

# PG 索引使用
psql -U postgres -c "
SELECT
  schemaname,
  relname,
  indexrelname,
  idx_scan,
  idx_tup_read,
  idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC
LIMIT 20
"

6.5 杀线程

# MySQL 杀线程
mysql -uroot -p -e "KILL 12345"
mysql -uroot -p -e "KILL CONNECTION 12345"# 同时断开连接

# PG 杀会话
psql -U postgres -c "SELECT pg_terminate_backend(12345)"
psql -U postgres -c "SELECT pg_cancel_backend(12345)"# 仅取消查询

风险:kill 不会自动通知应用,应用端会拿到连接异常,需要应用层重试。

6.6 binlog / WAL

# MySQL binlog 列表
mysql -uroot -p -e "SHOW BINARY LOGS"
mysql -uroot -p -e "SHOW MASTER STATUS"

# 解析 binlog
mysqlbinlog --verbose --base64-output=DECODE-ROWS /var/lib/mysql/mysql-bin.000123

# PG WAL 位置
psql -U postgres -c "SELECT pg_current_wal_lsn()"
psql -U postgres -c "SELECT pg_walfile_name()"
psql -U postgres -c "SELECT * FROM pg_ls_waldir()"

# PG WAL 归档状态
psql -U postgres -c "SELECT * FROM pg_stat_archiver"

6.7 性能基准

# MySQL sysbench 基准
sysbench oltp_read_write \
  --mysql-host=127.0.0.1 \
  --mysql-user=root \
  --mysql-password=xxx \
  --mysql-db=sbtest \
  --tables=10 \
  --table-size=1000000 \
  --threads=64 \
  --time=60 \
  --report-interval=5 \
  run

# PG pgbench 基准
pgbench -i -s 100 pgbench
pgbench -c 64 -j 8 -T 60 pgbench

七、配置示例

7.1 MySQL 主配置(生产级)

# /etc/my.cnf
[mysqld]
# 基础
user = mysql
port = 3306
bind-address = 0.0.0.0
datadir = /var/lib/mysql
socket = /var/lib/mysql/mysql.sock
pid-file = /var/run/mysqld/mysqld.pid
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
skip-name-resolve
default-time-zone = '+08:00'

# 连接
max_connections = 2000
max_user_connections = 1900
wait_timeout = 600
interactive_timeout = 600

# InnoDB
innodb_buffer_pool_size = 64G
innodb_buffer_pool_instances = 8
innodb_buffer_pool_chunk_size = 128M
innodb_log_file_size = 4G
innodb_log_files_in_group = 2
innodb_log_buffer_size = 64M
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT
innodb_file_per_table = 1
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
innodb_purge_threads = 8
innodb_change_buffering = all
innodb_adaptive_hash_index = 1

# binlog
server-id = 100
log_bin = /var/lib/mysql/mysql-bin
binlog_format = ROW
binlog_row_image = FULL
sync_binlog = 1
expire_logs_days = 7
max_binlog_size = 1G
binlog_group_commit_sync_delay = 0

# 慢查询
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 2
log_queries_not_using_indexes = 1
log_throttle_queries_not_using_indexes = 100

# performance_schema
performance_schema = ON
performance_schema_max_table_instances = 2000

# 复制
gtid_mode = ON
enforce_gtid_consistency = ON
relay_log = /var/lib/mysql/relay-bin
relay_log_recovery = ON
log_slave_updates = ON
slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 8
slave_preserve_commit_order = ON

[mysqldump]
quick
quote-names
max_allowed_packet = 1G

[client]
default-character-set = utf8mb4

7.2 PostgreSQL 主配置(生产级)

# /var/lib/pgsql/16/data/postgresql.conf
# 基础
listen_addresses = '*'
port = 5432
max_connections = 1000
superuser_reserved_connections = 10
unix_socket_directories = '/var/run/postgresql'
timezone = 'Asia/Shanghai'
log_timezone = 'Asia/Shanghai'
shared_buffers = 16GB
huge_pages = try
temp_buffers = 16MB
work_mem = 64MB
hash_mem_multiplier = 2.0
maintenance_work_mem = 2GB
autovacuum_work_mem = 1GB

# 预写日志
wal_level = replica
wal_buffers = 64MB
wal_compression = on
max_wal_size = 4GB
min_wal_size = 1GB
max_wal_senders = 10
wal_keep_size = 1GB
wal_sender_timeout = 60s
wal_receiver_timeout = 60s

# 复制
max_replication_slots = 10
hot_standby = on
hot_standby_feedback = on
wal_receiver_status_interval = 10s

# 计划器
random_page_cost = 1.1
effective_cache_size = 48GB
default_statistics_target = 500

# 自动 vacuum / analyze
autovacuum = on
autovacuum_max_workers = 4
autovacuum_naptime = 30s
autovacuum_vacuum_threshold = 100
autovacuum_analyze_threshold = 100
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.025
autovacuum_vacuum_cost_limit = 1000
autovacuum_vacuum_cost_delay = 10ms

# checkpoint
checkpoint_completion_target = 0.9
checkpoint_timeout = 15min
bgwriter_lru_maxpages = 1000
bgwriter_delay = 50ms

# 日志
logging_collector = on
log_destination = 'stderr'
log_directory = 'log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_rotation_age = 1d
log_rotation_size = 100MB
log_min_duration_statement = 2000
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
log_line_prefix = '%m [%p] %q%u@%d/%a from %h '
log_statement = 'ddl'

# 监控扩展
shared_preload_libraries = 'pg_stat_statements,auto_explain'
pg_stat_statements.max = 10000
pg_stat_statements.track = top
auto_explain.log_min_duration = '5s'
auto_explain.log_analyze = on
auto_explain.log_format = 'json'

# 资源
track_io_timing = on
track_activities = on
track_counts = on
track_functions = pl

# 锁
max_locks_per_transaction = 256
deadlock_timeout = 1s
lock_timeout = 30s
idle_in_transaction_session_timeout = 10min
statement_timeout = 0

7.3 mysqld_exporter systemd 单元

# /etc/systemd/system/mysqld_exporter.service
[Unit]
Description=Prometheus MySQL Exporter
After=network.target mysqld.service
Wants=mysqld.service

[Service]
Type=simple
User=mysqld_exporter
Group=mysqld_exporter
EnvironmentFile=-/etc/default/mysqld_exporter
ExecStart=/usr/local/bin/mysqld_exporter \
  --config.my-cnf=/etc/.mysqld_exporter.cnf \
  --web.listen-address=:9104 \
  --web.tls-cert-file=/etc/ssl/certs/exporter.crt \
  --web.tls-key-file=/etc/ssl/private/exporter.key
Restart=on-failure
RestartSec=5s

# 安全加固
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/log/mysqld_exporter

[Install]
WantedBy=multi-user.target

7.4 AlertManager 路由

# alertmanager.yml
route:
receiver: 'default'
group_by: ['alertname', 'instance', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: warning
      receiver: 'slack-warning'
    - match_re:
        severity: critical
      receiver: 'dingtalk-critical'

receivers:
- name: 'default'
    webhook_configs:
      - url: 'http://alert-relay/alert'

- name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<key>'
        send_resolved: true

- name: 'slack-warning'
    slack_configs:
      - api_url: 'https://hooks.slack.com/xxx'
        channel: '#db-alerts'

- name: 'dingtalk-critical'
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'

7.5 ProxySQL 配置

# /etc/proxysql.cnf
datadir="/var/lib/proxysql"

admin_variables=
{
  admin_credentials="admin:xxx;cluster_admin:yyy"
  mysql_ifaces="0.0.0.0:6032"
}

mysql_variables=
{
  threads=4
  max_connections=2048
  default_query_delay=0
  default_query_timeout=36000000
  have_compress=true
  poll_timeout=2000
  interfaces="0.0.0.0:6033"
  default_schema="information_schema"
  stacksize=1048576
  server_version="8.0.36"
  connect_timeout_server=3000
  monitor_username="proxysql_monitor"
  monitor_password="xxx"
  monitor_history=600000
  monitor_connect_interval=60000
  monitor_ping_interval=10000
  monitor_read_only_interval=1500
  monitor_read_only_timeout=500
  ping_interval_server_msec=120000
  ping_timeout_server=500
  commands_stats=true
  sessions_sort=true
  connect_retries_on_failure=10
}

mysql_servers =
(
  { address="mysql-1", port=3306, hostgroup=0, max_connections=1000 },
  { address="mysql-2", port=3306, hostgroup=0, max_connections=1000 },
  { address="mysql-1", port=3306, hostgroup=1, max_connections=1000 },
  { address="mysql-2", port=3306, hostgroup=1, max_connections=1000 }
)

mysql_replication_hostgroups =
(
  { writer_hostgroup=0, reader_hostgroup=1, comment="MySQL Replication" }
)

mysql_query_rules =
(
  {
    rule_id=100
    active=1
    match_pattern="^SELECT .* FOR UPDATE$"
    destination_hostgroup=0
    apply=1
  },
  {
    rule_id=200
    active=1
    match_pattern="^SELECT .*"
    destination_hostgroup=1
    apply=1
  }
)

7.6 pgBouncer 配置

# /etc/pgbouncer/pgbouncer.ini
[databases]
app_db = host=pg-1 port=5432 dbname=app
app_db_ro = host=pg-2 port=5432 dbname=app

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
admin_users = pgb_admin
stats_users = pgb_stats
pool_mode = transaction
max_client_conn = 4000
default_pool_size = 20
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 600
server_lifetime = 3600
server_connect_timeout = 15
query_wait_timeout = 120
client_idle_timeout = 0
client_login_timeout = 60
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
stats_period = 60

7.7 Patroni 配置

# /etc/patroni.yml
scope: pg-cluster
namespace: /pg-cluster
name: pg-1

restapi:
listen: 0.0.0.0:8008
connect_address: pg-1:8008

etcd:
hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379

bootstrap:
dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        max_wal_senders: 10
        max_replication_slots: 10

postgresql:
listen: 0.0.0.0:5432
connect_address: pg-1:5432
data_dir: /var/lib/pgsql/16/data
pgpass: /tmp/pgpass0
authentication:
    superuser:
      username: postgres
      password: xxx
    replication:
      username: replicator
      password: xxx
parameters:
    unix_socket_directories: '/var/run/postgresql'

tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false

八、日志或指标观察方法

8.1 MySQL 错误日志

# 默认位置
/var/log/mysqld.log
/var/log/mysql/error.log

# 关键关键字
grep -iE "error|warning|innodb|deadlock|abort|kill" /var/log/mysqld.log | tail -50

# 重启日志
grep "ready for connections" /var/log/mysqld.log

# Binlog Dump 线程问题
grep "Slave I/O" /var/log/mysqld.log

8.2 MySQL 慢查询日志分析

# 直接 grep
grep -c "Query_time:" /var/log/mysql/slow.log

# TOP 10 慢查询
pt-query-digest /var/log/mysql/slow.log --limit 10

# 看某时间段
grep "Time: 2026-06-17T0[0-3]" /var/log/mysql/slow.log | head

8.3 performance_schema 常用查询

# 当前等待事件最多的 SQL
mysql -uroot -p -e "
SELECT
  digest_text,
  count_star,
  sum_timer_wait / 1e9 AS total_ms
FROM performance_schema.events_statements_summary_by_digest
WHERE digest_text IS NOT NULL
ORDER BY sum_timer_wait DESC
LIMIT 10
"

# 哪些表被全表扫描
mysql -uroot -p -e "
SELECT
  object_schema,
  object_name,
  count_read,
  count_fetch
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE index_name IS NULL
  AND count_read > 0
ORDER BY count_read DESC
LIMIT 10
"

# 哪些索引从未被用
mysql -uroot -p -e "
SELECT
  object_schema,
  object_name,
  index_name
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE index_name IS NOT NULL
  AND count_star = 0
  AND object_schema NOT IN ('mysql','sys')
ORDER BY object_schema, object_name
"

# 当前文件 IO 热点
mysql -uroot -p -e "
SELECT
  file_name,
  event_name,
  count_star,
  sum_timer_wait / 1e9 AS total_ms
FROM performance_schema.file_io_summary_by_instance
ORDER BY sum_timer_wait DESC
LIMIT 10
"

8.4 PostgreSQL 日志

# 默认位置
/var/log/postgresql/postgresql-16-main.log

# 错误日志
grep -iE "error|fatal|panic" /var/log/postgresql/postgresql-16-main.log | tail -50

# 慢查询
grep "duration:" /var/log/postgresql/postgresql-16-main.log | tail -20

# Checkpoint 记录
grep "checkpoint" /var/log/postgresql/postgresql-16-main.log | tail

# 自动 vacuum 记录
grep "automatic vacuum" /var/log/postgresql/postgresql-16-main.log | tail

# 死锁
grep "deadlock detected" /var/log/postgresql/postgresql-16-main.log

8.5 PG 关键视图巡检

# 表膨胀 Top 10
psql -U postgres -c "
SELECT
  schemaname || '.' || relname AS table_name,
  n_live_tup,
  n_dead_tup,
  ROUND(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
  last_autovacuum,
  last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10
"

# 长事务 Top 10
psql -U postgres -c "
SELECT
  pid,
  usename,
  state,
  now() - xact_start AS xact_duration,
  LEFT(query, 100) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
  AND now() - xact_start > interval '1 second'
ORDER BY xact_start
LIMIT 10
"

# 未使用索引
psql -U postgres -c "
SELECT
  schemaname || '.' || relname AS table_name,
  indexrelname AS index_name,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20
"

# 复制槽状态
psql -U postgres -c "
SELECT
  slot_name,
  plugin,
  slot_type,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size
FROM pg_replication_slots
"

8.6 Buffer Pool 状态观察

SHOW ENGINE INNODB STATUS\G

关键段落:

----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 8589934592
Dictionary memory allocated 133193
Buffer pool size   524272
Free buffers       1024
Database pages     523000
Old database pages 193245
Modified db pages  0

Free buffers / Buffer pool size < 5% 说明内存吃紧。

8.0+ 可以查 Buffer Pool 各索引命中率:

SELECT
  pool_id,
  hot_size,
  warm_size
FROM information_schema.innodb_buffer_pool_stats;

九、排查路径

9.1 MySQL 连接数爆满

现象:应用报 Too many connections

排查顺序:

  1. 看当前连接数:
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';
  1. 看哪些主机占用最多:
SELECT
  SUBSTRING_INDEX(host, ':', 1) AS client_ip,
  db,
  command,
  count(*)
FROM information_schema.processlist
GROUP BY client_ip, db, command
ORDER BY count(*) DESC
LIMIT 10;
  1. 看是否有 sleep 但不释放:
SELECT
  count(*)
FROM information_schema.processlist
WHERE command = 'Sleep'
  AND time > 600;
  1. 看连接错误原因:
SHOW STATUS LIKE 'Connection_errors%';
  1. 处理:
    • 应用层修连接池配置(HikariCP maxLifetime)
    • 临时 SET GLOBAL max_connections = 3000 缓解
    • 加 ProxySQL / RDS Proxy 中间层
    • 杀 sleep 线程:KILL <id>

9.2 MySQL 主从延迟

现象:备库追不上主库,业务读备库拿到旧数据。

排查顺序:

  1. 看延迟数值:
SHOW SLAVE STATUS\G
-- 看 Seconds_Behind_Master
-- 或 8.0.22+:
SHOW REPLICA STATUS\G
-- 看 Seconds_Behind_Source
  1. 看主库位点:
SHOW MASTER STATUS;
  1. 看 relay log 是否堆积:
SHOW SLAVE STATUS\G
-- 看 Relay_Log_Space
  1. 看备库回放速率:
SHOW SLAVE STATUS\G
-- 对比 Exec_Master_Log_Pos 的增长速率
  1. 常见原因:
    • 备库配置低(CPU/磁盘差)
    • 单线程回放:开启 slave_parallel_workers > 1
    • 大事务:Binlog_Group_Commit 看大事务
    • 网络抖动
    • 主库大量 DDL
  2. 处理:
    • 加并行复制
    • 拆大事务
    • 优化备库 IO(切 SSD)
    • 网络专线优化

9.3 MySQL 死锁

现象:应用报 Deadlock found when trying to get lock; try restarting transaction

排查:

  1. SHOW ENGINE INNODB STATUS\G 找 LATEST DETECTED DEADLOCK
  2. 启用 innodb_print_all_deadlocks = ON 把所有死锁都打到错误日志
  3. 看应用代码是否正确处理死锁(必须重试)
  4. 加合适的索引(让锁粒度更细)
  5. 调整事务粒度(避免长事务)

9.4 PostgreSQL 长事务 / 锁等待

现象:应用请求 hang 住。

排查:

-- 1. 看当前 active
SELECT FROM pg_stat_activity WHERE state = 'active';

-- 2. 看锁等待
SELECT
  pg_class.relname,
  pg_locks.mode,
  pg_stat_activity.usename,
  pg_stat_activity.query,
EXTRACT(EPOCH FROM (now() - pg_stat_activity.query_start)) AS wait_sec
FROM pg_locks
JOIN pg_class ON pg_locks.relation = pg_class.oid
JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE NOT pg_locks.granted;

-- 3. 看 idle in transaction
SELECT
  pid,
  usename,
  state,
  now() - state_change AS idle_sec,
LEFT(query, 100) AS query_preview
FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'idle in transaction (aborted');

-- 4. 杀
SELECT pg_terminate_backend(<pid>);
SELECT pg_cancel_backend(<pid>);

9.5 数据库磁盘突增

现象:df -h 报警。

排查:

# 1. 看哪个目录涨
du -sh /var/lib/mysql/* | sort -h | tail
du -sh /var/lib/pgsql/* | sort -h | tail

# 2. MySQL binlog 占用
mysql -uroot -p -e "SHOW BINARY LOGS"

# 3. PG WAL 占用
psql -U postgres -c "SELECT * FROM pg_ls_waldir() ORDER BY size DESC LIMIT 10"

# 4. PG 表膨胀
psql -U postgres -c "
SELECT
  schemaname || '.' || relname AS table_name,
  pg_size_pretty(pg_relation_size(relid)) AS size
FROM pg_stat_user_tables
ORDER BY pg_relation_size(relid) DESC
LIMIT 20
"

# 5. 大表索引膨胀
psql -U postgres -c "
SELECT
  schemaname || '.' || relname AS table_name,
  indexrelname AS index_name,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20
"

处理:

  • MySQL:PURGE BINARY LOGS TO 'mysql-bin.000123';
  • MySQL:清理 relay-log 文件
  • PG:VACUUM FULL <table>(注意锁表,建议低峰)
  • PG:增加 wal_keep_size 限制,启用归档

9.6 PG 主从切换后脑裂

Patroni + etcd 选举脑裂排查:

# 1. 看 Patroni 状态
patronictl -c /etc/patroni.yml list

# 2. 看 etcd 集群
etcdctl member list
etcdctl endpoint status --cluster -w table

# 3. 看每个 Patroni 节点 leader 标识
curl -s http://pg-1:8008/patroni | jq . | grep -E "role|state"

# 4. 检查 split-brain
for h in pg-1 pg-2 pg-3; do
echo "$h:"
  curl -s http://$h:8008/patroni | jq -r '"  state=\(.state) role=\(.role) timeline=\(.timeline)"'
done

每个节点应只有 1 个 leader,否则脑裂。

9.7 故障树:MySQL 慢查询爆增

MySQL 慢查询突增
├─ 业务量上涨
│  └─ 验证:QPS 同步上涨
├─ 索引失效
│  └─ 验证:EXPLAIN 走全表扫描
├─ 统计信息过期
│  └─ 验证:执行计划 rows 估计与实际偏差大
├─ 锁等待严重
│  └─ 验证:Innodb_row_lock_time 突增
├─ Buffer Pool 命中率低
│  └─ 验证:Innodb_buffer_pool_reads 突增
├─ 磁盘 IO 饱和
│  └─ 验证:iostat await > 10ms
├─ 临时表落盘
│  └─ 验证:Created_tmp_disk_tables 突增
└─ 业务慢 SQL 上线
   └─ 验证:pt-query-digest TOP 1 是新 SQL

十、风险提醒

10.1 KILL 风险

  • KILL <thread_id>:仅取消查询,连接保留。事务会被回滚。
  • KILL CONNECTION <thread_id>:断开连接,事务回滚。
  • 风险:应用端会拿到 Query execution was interrupted 或 Lost connection,需要应用重试。
  • 必要条件:DRY_RUN 跑 1-2 周,确认无误杀再加到 cron。

10.2 DROP / TRUNCATE 风险

  • DROP TABLE:不可恢复,必须先备份
  • TRUNCATE TABLE:不可回滚
  • 必须双确认 + 备份:
mysqldump -uroot -p --single-transaction --master-data=2 db_name table_name > /backup/table_name_$(date +%Y%m%d).sql
# 确认备份成功(> 0 KB)后再执行
mysql -uroot -p db_name -e "DROP TABLE table_name;"

10.3 性能参数风险

  • innodb_buffer_pool_size 调整必须先看机器空闲内存
  • max_connections 增大必须看 open_files_limit
  • long_query_time 调小会导致 slow log 暴涨

10.4 复制操作风险

  • 主从切换:RESET SLAVE 会清空所有复制信息
  • 主从切换前必须确认备库 Seconds_Behind_Master = 0
  • GTID 模式下注意 SET GTID_PURGED 不要乱设

10.5 数据库版本升级

  • 5.7 → 8.0 升级必须先用 mysql_upgrade --check
  • 升级前必须看官方 release notes 的 breaking changes
  • 主从升级先升级从库,再升级主库
  • 升级窗口前必须备份

10.6 Exporter 风险

  • Exporter 账号密码必须从环境变量或密钥管理读取
  • Exporter 账号权限最小化(只 SELECT,不要给 DELETE/DROP)
  • Exporter 自身要监控(挂了也要告警)

10.7 自动 kill 风险

  • 默认 DRY_RUN=true
  • 必须设置白名单:复制线程、内部 DBA 会话不能 kill
  • 必须有审计日志

10.8 数据迁移风险

  • 走 gh-ost / pt-online-schema-change 加索引,不要直接 ALTER TABLE
  • 走 mysqldumper / mydumper 并行备份,不要单线程 mysqldump
  • PG 大表 ALTER 走 pg_repack,不锁表

十一、验证方式

11.1 Exporter 验证

# 端口在监听
ss -lntp | grep 9104

# 拉取指标
curl -s http://localhost:9104/metrics | head -20
curl -s http://localhost:9104/metrics | grep -E "mysql_up|mysql_global_status"

# 验证指标有数据
curl -s http://localhost:9104/metrics | grep "mysql_global_status_threads_connected"

11.2 Prometheus 验证

# 在 Prometheus 控制台执行
mysql_up
mysql_global_status_threads_connected
mysql_global_status_threads_running
rate(mysql_global_status_questions[1m])

11.3 Grafana 验证

  • Dashboard 能看到数据
  • Variables 下拉能选到目标
  • 时间范围切换正常
  • 告警状态对(绿色 / 红色)

11.4 告警演练

定期演练(每季度):

# 制造一个假的死锁场景
# 在测试库跑
mysql -uroot -p test_db -e "
START TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
SELECT SLEEP(30);
COMMIT;
"

# 同时在另一个会话
mysql -uroot -p test_db -e "
START TRANSACTION;
UPDATE accounts SET balance = balance + 100 WHERE id = 1;
COMMIT;
"

# 观察 alertmanager 是否触发死锁告警

11.5 自愈验证

# 1. 制造长事务
mysql -uroot -p -e "
START TRANSACTION;
SELECT SLEEP(120);
"

# 2. 跑 kill_long_tx.sh DRY_RUN=true,看输出
DRY_RUN=true /usr/local/bin/kill_long_tx.sh
cat /var/log/kill_long_tx.log

# 3. 确认无误杀后改 DRY_RUN=false

11.6 数据库健康度自检脚本

#!/bin/bash
# db_health_check.sh

if [[ -z "${MYSQL_ROOT_PASSWORD}" ]]; then
echo "MYSQL_ROOT_PASSWORD 未设置"
exit 1
fi

MYSQL="mysql -uroot -p${MYSQL_ROOT_PASSWORD} -h127.0.0.1 -B -N"

echo "=== MySQL 健康度 ==="

# 连接数使用率
USED=$(${MYSQL} -e "SHOW STATUS LIKE 'Threads_connected'" | awk '{print $2}')
MAX=$(${MYSQL} -e "SHOW VARIABLES LIKE 'max_connections'" | awk '{print $2}')
PCT=$((USED * 100 / MAX))
echo "连接使用率: ${USED}/${MAX} (${PCT}%)"

# Buffer Pool 命中率
HIT=$(${MYSQL} -e "
SELECT
  ROUND(
    (1 - (
      VARIABLE_VALUE / (
        SELECT VARIABLE_VALUE
        FROM performance_schema.global_status
        WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests'
      )
    )) * 100, 2)
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads'
" 2>/dev/null)
echo "Buffer Pool 命中率: ${HIT}%"

# 慢查询
SLOW=$(${MYSQL} -e "SHOW GLOBAL STATUS LIKE 'Slow_queries'" | awk '{print $2}')
echo "累计慢查询: ${SLOW}"

# 复制状态
${MYSQL} -e "SHOW SLAVE STATUS\G" 2>/dev/null | grep -E "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind"

十二、回滚方案

12.1 Exporter 升级回滚

# 1. 备份旧版本
sudo cp /usr/local/bin/mysqld_exporter /usr/local/bin/mysqld_exporter.bak

# 2. 升级
sudo systemctl stop mysqld_exporter
sudo cp mysqld_exporter-new /usr/local/bin/mysqld_exporter
sudo systemctl start mysqld_exporter

# 3. 异常回滚
sudo systemctl stop mysqld_exporter
sudo cp /usr/local/bin/mysqld_exporter.bak /usr/local/bin/mysqld_exporter
sudo systemctl start mysqld_exporter

12.2 告警规则回滚

# Prometheus 规则目录
RULES_DIR=/etc/prometheus/rules

# 1. 备份当前规则
tar czf /backup/prometheus_rules_$(date +%Y%m%d_%H%M%S).tar.gz ${RULES_DIR}

# 2. 修改规则文件
vi ${RULES_DIR}/mysql.yml

# 3. 验证语法
promtool check rules ${RULES_DIR}/*.yml

# 4. 热加载
curl -X POST http://prometheus:9090/-/reload

# 5. 异常回滚
tar xzf /backup/prometheus_rules_*.tar.gz -C /
curl -X POST http://prometheus:9090/-/reload

12.3 参数变更回滚

# 1. 备份当前 my.cnf
sudo cp /etc/my.cnf /etc/my.cnf.bak

# 2. 修改
sudo vi /etc/my.cnf

# 3. 在线应用(不重启)
mysql -uroot -p -e "SET GLOBAL innodb_buffer_pool_size = 128*1024*1024*1024;"

# 4. 异常回滚
mysql -uroot -p -e "SET GLOBAL innodb_buffer_pool_size = 64*1024*1024*1024;"

12.4 数据库版本升级回滚

  • 5.7 → 8.0 一旦升级不能回退(系统表格式变了)
  • 必须先做从库升级验证,确认业务 SQL 全部兼容
  • 主从切换必须先在测试环境演练

12.5 自愈脚本回滚

# 1. 立即停掉定时任务
sudo rm /etc/cron.d/kill-long-tx

# 2. 跑一次 DRY_RUN 看看现在没杀是不是真没问题
DRY_RUN=true /usr/local/bin/kill_long_tx.sh

# 3. 如果业务恢复,恢复原 crontab
sudo cp /backup/cron.d/kill-long-tx /etc/cron.d/

十三、生产环境注意事项

13.1 性能开销

  • mysqld_exporter 默认 1 个连接 + 多个一次性 SHOW,占用很小
  • postgres_exporter 默认 1 个连接,建议给独立账号
  • pgbouncer_exporter 每 60s 抓一次,影响可忽略
  • performance_schema 默认开启带来 5-10% 开销,关键业务库要测

13.2 权限最小化

-- mysqld_exporter 账号
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
GRANT SELECT ON performance_schema.* TO 'exporter'@'localhost';

-- postgres_exporter 账号
GRANT pg_read_all_stats TO postgres_exporter;

13.3 凭据管理

  • 密码写到 .my.cnf 必须 chmod 600
  • systemd 单元用 EnvironmentFile 引用,文件权限 600
  • 生产环境用 HashiCorp Vault / 阿里云 KMS / 腾讯云密钥管理系统
  • 定期轮转(季度)

13.4 跨实例标签

# Prometheus scrape config
scrape_configs:
- job_name: 'mysqld'
    static_configs:
      - targets: ['mysql-1:9104']
        labels:
          env: prod
          role: master
          cluster: us-east
          biz: orders
      - targets: ['mysql-2:9104']
        labels:
          env: prod
          role: slave
          cluster: us-east
          biz: orders

13.5 告警收敛

  • group_by: ['alertname', 'instance'] 避免重复告警
  • group_wait: 30sgroup_interval: 5m 避免告警风暴
  • 重要告警走 PagerDuty / 钉钉值班,普通走 Slack 频道

13.6 高可用 Exporter

Exporter 自身挂了也要告警:

- alert: MysqlExporterDown
  expr: up{job="mysqld"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "MySQL Exporter 不可达"

Exporter 部署方式:

  • 主备两台机器各跑一个
  • 或加 Keepalived VIP
  • 或用 consul 做服务发现

13.7 长期归档

  • Prometheus 长期存储走 Thanos / Cortex / VictoriaMetrics
  • 关键指标保留 1 年
  • 慢查询日志归档到对象存储

13.8 多租户 / 合规

  • 监控数据脱敏(密码、用户身份证号)
  • 监控账号审计(每季度 review 权限)
  • 跨 region 监控走专线 + 加密

十四、总结

数据库监控的核心是「在业务感知之前发现异常」。落到 Prometheus + Grafana + Exporter 这套体系,关键要把握三件事:

  1. 指标选对:MySQL 看 Buffer Pool 命中率、Threads_running、行锁等待、复制延迟;PG 看 cache hit rate、idle in transactionpg_stat_replication lag、WAL 堆积
  2. 阈值定对:所有阈值都是经验值,必须结合业务基线调整;连接池 > 80%、命中率 < 99%、主从延迟 > 60s、死锁 > 0 都是常见红线
  3. 闭环跑通:告警 → 定位 → 修复 → 验证 → 回滚 → 复盘,每一步都要有具体动作

最容易踩的坑:

  • 看 CPU 看不到 DB 真实问题(锁等待/IO 等待 CPU 不高)
  • Threads_connected 和 Threads_running 混淆
  • SELECT FOR UPDATE 走备库导致脏读
  • Exporter 密码写死在配置文件
  • 长事务 kill 不通知应用端

数据库监控和普通的 Web 服务监控不一样:

  • DB 是有状态的,重启 / 切换有成本
  • DB 性能瓶颈往往在锁和 IO,不是 CPU
  • DB 的”假死”(连接还在但查询慢)比真死更危险
  • DB 的告警阈值必须分 P0/P1/P2/P3,不能一刀切

最后强调一点:监控不是装上 Exporter 就完了,而是要长期看、长期调。业务在变、数据量在涨、SQL 在迭代,告警阈值和指标都要相应调整。每季度做一次告警 review,看哪些告警从没用过(删)、哪些告警永远在响(调阈值或加自愈)。

DBA 真正的工作不是救火,而是让火不发生。

数据库监控:MySQL/PostgreSQL 关键指标与告警阈值插图

本文链接:https://www.yunweipai.com/49300.html

网友评论comments

发表回复

您的电子邮箱地址不会被公开。

暂无评论

Copyright © 2012-2022 YUNWEIPAI.COM - 运维派 京ICP备16064699号-6
扫二维码
扫二维码
返回顶部