引子:一次被监控救回来的库
去年 Q2 一次大促。凌晨 1 点,订单库的 P99 延迟从 80ms 涨到 1.2s,业务没炸,但订单量明显下滑。
值班工程师打开 Grafana,看到三个现象:
- MySQL
Threads_running从 30 涨到 380 - InnoDB 行锁等待时间(
Innodb_row_lock_time)从 50ms/s 涨到 4s/s Slow_queries每分钟从 12 涨到 900+
但当时 CPU 只有 60%,iops 也只用了 50%。如果只看 CPU 不会觉得有问题。
继续往下查:
SHOW ENGINE INNODB STATUS\G
输出里有一段:
LATEST DETECTED DEADLOCK
------------------------
2026-04-15 01:00:12 0x7f8b9c001700
*** (1) TRANSACTION:
TRANSACTION 37281973, ACTIVE 0.001 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 3287123, query id 9382 ...
UPDATE orders SET status='paid' WHERE id=12345
定位是订单状态更新的高频路径上,行锁竞争激烈。orders 表有 8000 万行,热点 status 字段没建合适索引,每次更新都是范围扫描 + 锁升级。
修复路径:
- 紧急 kill 长事务:
KILL 37281973 - 临时方案:把热点订单走单独的连接池,限流
- 根本修复:加索引
CREATE INDEX idx_orders_status_updated ON orders(status, updated_at) - 拆表:把 90 天前的热数据归档到
orders_archive
从告警到定位到修复,花了 17 分钟。如果当时没把上面那三个指标告警阈值定好,至少要 1 小时才能反应过来。
这就是数据库监控的核心价值:在业务感知之前先发现异常。但「定好指标 + 定对阈值」是难点。阈值定低了天天狼来了,定高了真出事了又没告警。
一、问题背景
数据库是「有状态 + 数据价值最高」的基础组件。监控的核心目标有四层:
- 可用性:服务在线、连接池健康、主从复制正常
- 性能:QPS / TPS / 延迟、锁等待、缓冲命中率
- 容量:连接数、表空间、磁盘空间、binlog / WAL 增长
- 变更:DDL 变更、参数变更、慢查询、异常 SQL
MySQL 和 PostgreSQL 的监控体系差异较大:
| 维度 | MySQL | PostgreSQL |
|---|---|---|
| 内置监控 | SHOW STATUS、SHOW ENGINE INNODB STATUS、performance_schema、information_schema | pg_stat_* 视图、pg_stat_statements 扩展、EXPLAIN (ANALYZE) |
| 主从复制监控 | SHOW REPLICA STATUS(8.0.22+) / SHOW SLAVE STATUS(5.7 / 8.0 旧版) | pg_stat_replication |
| 锁监控 | performance_schema.data_locks(8.0+) / information_schema.innodb_trx | pg_locks + pg_stat_activity |
| 慢查询 | slow_query_log + performance_schema.events_statements_summary_by_digest | log_min_duration_statement + pg_stat_statements |
| 缓冲池 | SHOW ENGINE INNODB STATUS 中 Buffer Pool 信息 | pg_stat_database 中 blks_hit / blks_read |
| 死锁 | SHOW ENGINE INNODB STATUS 中 LATEST DETECTED DEADLOCK | log 中 deadlock detected |
| 主流 Exporter | mysqld_exporter | postgres_exporter、pgbouncer_exporter |
| Grafana Dashboard | Percona MySQL Overview、MySQL InnoDB Metrics | PostgreSQL Overview、pganalyze、Crunchy |
| 高可用方案 | MHA / Orchestrator / Group Replication / InnoDB Cluster | Patroni / repmgr / PgBouncer + keepalived |
| 逻辑复制 | 5.7 简单 / 8.0 增强(mysqlbinlog 配合) | 内建 publication / subscription |
| 主从延迟单位 | 秒(Seconds_Behind_Master) | 字节(pg_stat_replication LSN 差) + 时间(PG 10+ replay_lag) |
监控常见痛点:
- 指标太多不知道看什么:MySQL 有 400+ status 变量,PG 有 200+ 系统视图
- 版本差异:MySQL 5.7/8.0 字段名变化;PG 12/13/14/15 视图字段调整
- 阈值不会定:照搬网上”CPU > 80% 告警”,结果 CPU 永远跑 30%
- 告警风暴:一个慢查询触发 20 个告警
- 指标语义不清:
Threads_connected和Threads_running区别 - 生产环境不敢装新组件:装个 Exporter 都担心影响数据库
二、适用场景
不是所有业务都要上全指标监控,按规模选:
| 团队现状 | 推荐方案 |
|---|---|
| 单库 < 100 QPS,几张表 | 慢查询日志 + SHOW STATUS + 自建脚本 |
| 中小规模,1-10 个 MySQL / PG 实例 | Prometheus + mysqld_exporter / postgres_exporter + Grafana |
| 大规模,50+ 实例,多集群 | Prometheus + 多种 Exporter + 自研聚合层 + 告警分级 |
| 金融 / 政企 | 商业监控(Datadog / New Relic / 自研 APM)+ 国产化 Exporter |
| 多云 / 跨 region | Thanos / Cortex / VictoriaMetrics 聚合 |
不推荐 Prometheus 监控的场景:
- 业务量极小,且 DBA 就在机器旁
- 数据库要求强一致性,且不允许任何 Exporter 占用资源
- 已有商业 DB 监控产品且合规要求必须用
三、核心知识点
3.1 MySQL 关键版本差异
监控脚本的版本兼容性必须先看版本:
| 维度 | MySQL 5.7 | MySQL 8.0 |
|---|---|---|
| 主从复制查询 | SHOW SLAVE STATUS | SHOW REPLICA STATUS(8.0.22+) / SHOW SLAVE STATUS(兼容旧版) |
| 行锁监控 | information_schema.innodb_trx + innodb_locks + innodb_lock_waits | performance_schema.data_locks + data_lock_waits(5.7 的 innodb_locks 在 8.0 已删除) |
| 性能模式 | performance_schema 默认开启 events_statements_% 部分 | 默认开启更全,但消耗更多内存 |
| 系统表 | mysql.user 明文密码可看 | 5.7 升级后需 ALTER USER 重置 |
| 索引提示 | USE INDEX / FORCE INDEX | 同 |
| 默认字符集 | latin1 | utf8mb4 |
| 主键 | 不强制 | 不强制,但缺主键问题更明显 |
| 直方图 | 无 | ANALYZE TABLE ... UPDATE HISTOGRAM |
| 锁视图 | information_schema.innodb_locks | performance_schema.data_locks |
EXPLAIN 输出 | JSON 格式可选 | JSON 格式默认,tree 格式可选 |
| 隐藏索引 | 无 | ALTER TABLE ... ALTER INDEX ... INVISIBLE |
| 资源组 | 无 | CREATE RESOURCE GROUP |
| 窗口函数 | 无 | 8.0.2+ 支持 |
| CTE | 无 | 8.0.1+ 支持 |
| Group Replication | 5.7 基础 | 8.0 单主 / 多主模式更稳定 |
| InnoDB Cluster | 5.7 基础 | 8.0 内置 mysqlsh |
| 默认 binlog 格式 | ROW(建议) | ROW(同) |
| 默认 binlog 过期 | 永不过期 | 30 天(binlog_expire_logs_seconds) |
| 默认密码认证 | mysql_native_password | caching_sha2_password |
| performance_schema 默认 instrument | 部分 | 大部分开启 |
重点:5.7 升级到 8.0 时,监控脚本要把 innodb_locks 改成 data_locks,SHOW SLAVE STATUS 用 SHOW REPLICA STATUS(也可保留兼容)。
3.2 PostgreSQL 关键版本差异
| 维度 | PG 12 | PG 13 | PG 14 | PG 15 | PG 16 |
|---|---|---|---|---|---|
| 分区表 | 增强 | 增强 | 增强 | 增强 | 增强 |
pg_stat_statements | 字段齐全 | 加 total_exec_time 拆分 | 拆 shared_blks_* 字段 | 加 wal_records / wal_fpi 字段 | 字段调整 |
pg_stat_activity | 基础 | 加 leader_pid | 同 | 同 | 加 query_id 字段 |
| 逻辑复制 | 基础 | 增强 | 增强 | 增强 | 增强 |
pg_stat_replication | 基础 | 基础 | 加 replay_lag | 同 | 同 |
| 默认隔离级别 | READ COMMITTED | 同 | 同 | 同 | 同 |
| 归档 | pg_basebackup 基础 | 增强 | 增强 | 增强 | 增强 |
监控视图 pg_stat_io | 无 | 无 | 无 | 无 | 新增(IO 详情) |
pg_stat_wal | 无 | 新增 | 同 | 同 | 同 |
| JIT | 增强 | 同 | 同 | 同 | 同 |
pg_stat_progress_* | 部分 | 部分 | 增加 analyze / cluster / vacuum / create_index | 同 | 同 |
| 监控 WAL 生成速率 | 通过 LSN 推算 | 同 | 同 | pg_stat_wal | 同 |
| Btree 去重 | 无 | 无 | 新增(deduplication) | 同 | 同 |
监控 PG 时,pg_stat_statements 字段名每个大版本可能微调,必须以实际版本为准。PG 16+ 的 pg_stat_io 是排查 IO 瓶颈的关键视图。
3.3 MySQL 关键指标分类
把 MySQL 400+ status 变量按用途归类:
3.3.1 连接类
| 指标 | 含义 | 异常表现 |
|---|---|---|
Threads_connected | 当前连接数 | 接近 max_connections |
Threads_running | 正在执行查询的线程 | 持续 > 30 说明有积压 |
Threads_created | 创建过的线程数 | 持续增长说明连接池没复用 |
Connection_errors_* | 连接错误数 | 持续增长说明被拒绝 |
Aborted_connects | 失败的连接尝试 | 增长说明密码错或网络问题 |
Max_used_connections | 历史最大连接数 | 接近 max_connections 时告警 |
Connection_errors_max_connections | 超过 max 拒绝次数 | 增长 = 业务被打回 |
注意:Threads_connected 包含 sleep 状态的线程,不代表当前压力。Threads_running 才是真正在干活的线程。
3.3.2 吞吐类
| 指标 | 含义 | 异常表现 |
|---|---|---|
Questions | 服务器收到的查询数(COM_PING 等除外) | 异常突增 / 突减 |
Com_select | SELECT 总数 | 异常突增 |
Com_insert/update/delete | DML 总数 | 异常突增 |
Innodb_rows_inserted/updated/deleted/read | 影响行数 | 异常突增 |
Bytes_received/sent | 网络流量 | 异常突增 |
Created_tmp_tables | 内存临时表数 | 突增 |
Created_tmp_disk_tables | 磁盘临时表数 | 突增(说明 tmp_table_size 不足) |
QPS / TPS 计算(基于两次采样):
# QPS
mysqladmin -uroot -p extended-status -r -i 1 | awk '/Queries/{q=$4} /Questions/{print "QPS:", $4-q}'
# TPS(提交 + 回滚)
mysqladmin -uroot -p extended-status -r -i 1 | awk '/Com_commit/{c=$4} /Com_rollback/{print "TPS:", $4+c}'
3.3.3 InnoDB 类
| 指标 | 含义 | 异常表现 |
|---|---|---|
Innodb_buffer_pool_pages_total | 总页数 | – |
Innodb_buffer_pool_pages_free | 空闲页 | 持续 < 5% 说明内存不足 |
Innodb_buffer_pool_pages_data | 数据页 | 持续高位 |
Innodb_buffer_pool_pages_misc | 杂项页 | 增长说明 hash index / metadata 占用 |
Innodb_buffer_pool_read_requests | 逻辑读 | – |
Innodb_buffer_pool_reads | 物理读 | / read_requests < 99% 命中率低 |
Innodb_data_reads/writes | 物理 I/O | 持续高值 |
Innodb_log_waits | 写日志等待 | > 0 说明 redo log 满 |
Innodb_row_lock_waits | 行锁等待次数 | 突增 |
Innodb_row_lock_time | 行锁总等待时间 | 突增 |
Innodb_deadlocks | 死锁次数 | 突增 |
Innodb_history_list_length | undo 长度 | 持续 > 1000 说明 purge 跟不上 |
Innodb_log_writes | redo 写次数 | 突增 |
Innodb_os_log_fsyncs | fsync 次数 | 突增 |
Innodb_os_log_pending_fsyncs | 待 fsync 数 | > 0 卡顿 |
Innodb_os_log_pending_writes | 待写 WAL 数 | > 0 卡顿 |
Innodb_data_pending_fsyncs | 数据文件 fsync 队列 | > 0 卡顿 |
Innodb_data_pending_reads/writes | 数据文件 IO 队列 | > 0 IO 饱和 |
Innodb_mutex_os_waits | mutex 等待 | 突增 = 热点 |
Innodb_rows_read | 读取行数 | / Com_select 比例异常高说明全表扫描 |
3.3.4 复制类(5.7 / 8.0.22-)
SHOW SLAVE STATUS\G
关键字段:
| 字段 | 含义 | 异常 |
|---|---|---|
Slave_IO_Running | IO 线程 | No 表示 IO 中断 |
Slave_SQL_Running | SQL 线程 | No 表示回放中断 |
Seconds_Behind_Master | 主从延迟(秒) | > 0 持续增长 |
Relay_Log_Space | relay log 大小 | 异常增长 |
Last_IO_Error | IO 错误信息 | 非空 |
Last_SQL_Error | SQL 错误信息 | 非空 |
Master_Log_File / Read_Master_Log_Pos | 主库位点 | 不更新说明 IO 停 |
Relay_Master_Log_File / Exec_Master_Log_Pos | 已执行位点 | 不更新说明 SQL 停 |
8.0.22+ 等价命令:
SHOW REPLICA STATUS\G
字段名换成 Replica_IO_Running、Replica_SQL_Running、Seconds_Behind_Source。
3.3.5 Group Replication 监控(8.0+)
SELECT * FROM performance_schema.replication_group_members;
关键字段:
| 字段 | 含义 | 异常 |
|---|---|---|
MEMBER_ID | 成员 UUID | – |
MEMBER_HOST / MEMBER_PORT | 地址端口 | – |
MEMBER_STATE | ONLINE / RECOVERING / OFFLINE / ERROR | ONLINE 才是正常 |
MEMBER_ROLE | PRIMARY / SECONDARY | 单主模式下应只有 1 PRIMARY |
MEMBER_VERSION | MySQL 版本 | 不一致告警 |
Group Replication 延迟:
SELECT
MEMBER_ID,
COUNT_TRANSACTIONS_IN_QUEUE AS queue_count,
COUNT_TRANSACTIONS_CHECKED AS checked,
COUNT_CONFLICTS_DETECTED AS conflicts,
COUNT_TRANSACTIONS_REMOTE_APPLIED AS remote_applied
FROM performance_schema.replication_group_member_stats;
queue_count > 0 持续增长 = 二级节点跟不上。
3.3.6 临时表 / 排序类
| 指标 | 含义 | 异常 |
|---|---|---|
Created_tmp_disk_tables | 磁盘临时表数 | 突增说明内存临时表不够 |
Created_tmp_tables | 内存临时表数 | 突增 |
Sort_rows | 排序行数 | 突增 |
Sort_merge_passes | 归并排序次数 | 突增说明 sort_buffer 不足 |
Select_full_join | 没有用索引的 join | > 0 是问题 |
Select_range_check | 每行后检查范围 | > 0 是问题 |
Select_scan | 全表扫描次数 | 突增 |
Select_full_range_join | 范围 join | 突增 |
3.3.7 performance_schema 表(8.0+)
行锁和事务等待重点看:
-- 8.0+ 行锁视图
SELECT * FROM performance_schema.data_locks LIMIT 10;
-- 8.0+ 锁等待关系
SELECT * FROM performance_schema.data_lock_waits LIMIT 10;
-- 当前正在等待的事务
SELECT
t.trx_id,
t.trx_state,
t.trx_started,
TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) AS duration_sec,
t.trx_mysql_thread_id,
t.trx_query,
t.trx_rows_modified,
t.trx_tables_in_use,
t.trx_tables_locked,
t.trx_lock_structs
FROM information_schema.innodb_trx t
WHERE TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) > 30
ORDER BY t.trx_started;
SQL 统计(按 digest 聚合):
SELECT
SCHEMA_NAME,
digest,
digest_text,
count_star AS exec_count,
sum_timer_wait / 1e9 AS total_ms,
avg_timer_wait / 1e9 AS avg_ms,
max_timer_wait / 1e9 AS max_ms,
sum_lock_time / 1e9 AS lock_ms,
sum_rows_examined AS rows_examined,
sum_rows_sent AS rows_sent,
sum_no_index_used AS no_index_count,
sum_errors AS error_count,
sum_warnings AS warning_count
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME IS NOT NULL
ORDER BY sum_timer_wait DESC
LIMIT 20;
3.3.8 InnoDB 历史列表与 purge
SHOW ENGINE INNODB STATUS\G
关键段落:
TRANSACTIONS
------------
Trx id counter 37281975
Purge done for trx's n:o < 37281960 undo n:o < 0
History list length 2547
History list length 持续 > 1000 说明 purge 跟不上,常见于:
- 大事务未提交
innodb_purge_threads默认 4,但可以调到 8 / 16innodb_purge_batch_size太小
Trx id counter 增长太快说明写入速率高,需要监控。
3.4 PostgreSQL 关键系统视图
3.4.1 pg_stat_activity(活跃会话)
| 字段 | 含义 |
|---|---|
pid | 进程 ID |
usename | 用户 |
application_name | 应用名 |
client_addr | 客户端 IP |
state | active / idle / idle in transaction / fastpath function call |
query_start | 查询开始时间 |
state_change | 状态变更时间 |
wait_event_type | 等待类型(Lock / IO / LWLock / Activity) |
wait_event | 具体等待事件 |
query | 当前 SQL |
backend_xid / backend_xmin | 事务 ID |
leader_pid | parallel worker 领头(13+) |
query_id | SQL hash(16+) |
state = 'idle in transaction' 是关键告警指标:事务挂着不提交,锁不释放。
排查活跃锁等待:
SELECT
pg_class.relname,
pg_locks.locktype,
pg_locks.mode,
pg_locks.granted,
pg_stat_activity.usename,
pg_stat_activity.query,
pg_stat_activity.query_start,
pg_stat_activity.state_change,
EXTRACT(EPOCH FROM (now() - pg_stat_activity.query_start)) AS duration_sec
FROM pg_locks
JOIN pg_class ON pg_locks.relation = pg_class.oid
JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE NOT pg_locks.granted
ORDER BY pg_stat_activity.query_start;
3.4.2 pg_stat_statements(SQL 统计)
需要先在 postgresql.conf 启用:
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = top
pg_stat_statements.track_utility = on
pg_stat_statements.save = on
重启后创建扩展:
CREATE EXTENSION pg_stat_statements;
-- 验证
SELECT count(*) FROM pg_stat_statements;
-- 预期:> 0(执行过 SQL 才有)
关键字段:
| 字段 | 含义 |
|---|---|
query | SQL 模板(参数化) |
calls | 调用次数 |
total_exec_time | 总执行时间(毫秒) |
mean_exec_time | 平均执行时间 |
max_exec_time | 最大执行时间 |
rows | 总返回行数 |
shared_blks_hit | 共享缓冲命中块数 |
shared_blks_read | 共享缓冲读盘块数 |
temp_blks_read/written | 临时块读写 |
queryid | SQL 哈希 ID |
total_plan_time | 计划时间(13+) |
wal_records / wal_fpi | WAL 生成(13+) |
缓存命中率:
SELECT
sum(shared_blks_hit) / nullif(sum(shared_blks_hit) + sum(shared_blks_read), 0) AS hit_rate
FROM pg_stat_statements;
TOP 慢查询:
SELECT
substring(query for 100) AS query_preview,
calls,
round(total_exec_time::numeric / 1000, 2) AS total_sec,
round(mean_exec_time::numeric / 1000, 4) AS mean_sec,
round(max_exec_time::numeric / 1000, 2) AS max_sec,
rows,
shared_blks_hit,
shared_blks_read
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
3.4.3 pg_stat_database(库级)
| 字段 | 含义 |
|---|---|
datname | 库名 |
numbackends | 当前连接数 |
xact_commit | 提交事务数 |
xact_rollback | 回滚事务数 |
blks_read | 物理读块数 |
blks_hit | 缓冲命中块数 |
tup_returned/fetched/inserted/updated/deleted | 行数统计 |
conflicts | 恢复冲突数 |
deadlocks | 死锁数 |
blk_read_time / blk_write_time | I/O 累计时间(需 track_io_timing) |
session_time / idle_in_transaction_time | 累计时间(13+) |
active_time | active 状态累计时间(14+) |
需要 track_io_timing = on 才能拿到 blk_read_time / blk_write_time。
3.4.4 pg_stat_replication(复制状态)
| 字段 | 含义 |
|---|---|
pid | walsender 进程 |
usename | 复制用户 |
application_name | 备库标识 |
state | startup / catchup / streaming / backup / stopping |
sent_lsn | 已发送 LSN |
write_lsn | 备库已写 LSN |
flush_lsn | 备库已刷盘 LSN |
replay_lsn | 备库已回放 LSN |
replay_lag / write_lag / flush_lag | 延迟(PG 10+) |
sync_state / sync_priority | 同步优先级 |
延迟字节数计算:
SELECT
application_name,
state,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag_bytes,
replay_lag,
write_lag,
flush_lag
FROM pg_stat_replication;
3.4.5 pg_locks(锁视图)
SELECT
pg_class.relname,
pg_locks.locktype,
pg_locks.mode,
pg_locks.granted,
pg_stat_activity.usename,
pg_stat_activity.query,
pg_stat_activity.query_start
FROM pg_locks
JOIN pg_class ON pg_locks.relation = pg_class.oid
JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE NOT pg_locks.granted
ORDER BY pg_locks.granted;
3.4.6 pg_stat_progress_vacuum(vacuum 进度,PG 9.6+)
SELECT
pid,
datname,
relid::regclass,
phase,
heap_blks_total,
heap_blks_scanned,
heap_blks_vacuumed,
index_vacuum_count,
max_dead_tuples,
num_dead_tuples
FROM pg_stat_progress_vacuum;
phase 字段:initializing / scanning heap / vacuuming indexes / vacuuming heap / cleaning up indexes / truncating heap / performing final cleanup。
3.4.7 pg_stat_io(IO 详情,PG 16+)
SELECT * FROM pg_stat_io;
字段:backend_type、object、context、reads、writes、extends、hits、read_time、write_time。
这是 PG 16 引入的关键视图,可以直接看到每类对象的 IO 分布。
3.5 关键告警阈值(基线参考)
下面给的阈值是经验值,不是绝对标准。所有阈值必须结合业务基线调整:
| 指标 | 建议阈值 | 说明 |
|---|---|---|
Threads_running | > 30 持续 5 min | 高于 CPU 核心数 × 2 持续 5 min 告警 |
| Buffer Pool 命中率 | < 99% 持续 10 min | 命中率低 = 物理 I/O 多 |
Innodb_row_lock_waits | > 100/min | 突增说明行锁竞争 |
Innodb_deadlocks | > 0/min | 立即告警 |
Innodb_history_list_length | > 1000 | undo 清理慢 |
Slow_queries | > N/min(N 依业务定) | 突增 |
| 主从延迟 | > 60s | 立即告警 |
| 复制 IO / SQL 线程 | No | 立即告警 |
| 磁盘空间 | > 80% | 业务影响前 1-2 天告警 |
| 连接数使用率 | > 80% | 接近 max_connections |
Idle in transaction | > 30s | 立即告警 |
pg_stat_replication.replay_lag | > 60s | 立即告警 |
| Cache hit rate | < 99% | 持续 10 min 告警 |
| WAL 堆积 | > 10 GB | 告警 |
| Checkpoint 间隔 | < 30s | 说明 checkpoint 频繁 |
select_full_join 增量 | > 0 | 任何增量都该看 |
Innodb_log_waits | > 0 | redo log 不足 |
Group Replication queue_count | > 100 | 持续 > 5 min 告警 |
pg_stat_activity active 数 | > max_connections × 0.8 | 接近极限 |
tmp_blks_written 速率 | > 100 MB/s | 临时表撑爆 |
四、整体排查或实施思路
按”先基础后深度,先指标后 SQL”:
- 第一步:基础资源监控(1 周)
- CPU、内存、磁盘、网络、连接数
- Prometheus node_exporter 即可
- 第二步:DB 内置指标(1-2 周)
- 装 mysqld_exporter / postgres_exporter
- 配置慢查询日志
- 启用 performance_schema / pg_stat_statements
- 第三步:连接池监控(1 周)
- ProxySQL / MaxScale(MySQL)
- pgbouncer_exporter(PG)
- 应用侧连接池(HikariCP、Druid)
- 第四步:复制与高可用(1-2 周)
- 主从延迟、备库健康
- MHA / Orchestrator / Patroni / Group Replication 状态
- 第五步:SQL 维度(2-4 周)
- 慢查询分析
- 全表扫描监控
- 索引使用率
- 第六步:告警分级(持续)
- P0:业务中断
- P1:性能下降
- P2:资源告警
- P3:优化建议
- 第七步:自愈(高级)
- 长事务自动 kill
- 连接池自动扩容
- 慢查询自动加入限流
五、实战步骤
5.1 MySQL 5.7 / 8.0 启用慢查询
# /etc/my.cnf 或 /etc/mysql/mysql.conf.d/mysqld.cnf
[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 2 # 2 秒以上记为慢查询
log_queries_not_using_indexes = 1 # 没走索引的也记
log_slow_extra = ON # 8.0 才有
log_throttle_queries_not_using_indexes = 100 # 每分钟最多记 100 条
min_examined_row_limit = 1000 # 扫描行数 < 1000 不记
应用:
# 在线打开(不重启)
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 2;
SET GLOBAL log_queries_not_using_indexes = 'ON';
# 验证
SHOW VARIABLES LIKE 'slow_query_log%';
SHOW VARIABLES LIKE 'long_query_time';
5.2 MySQL 启用 performance_schema
# /etc/my.cnf
[mysqld]
performance_schema = ON
performance_schema_max_table_instances = 200
performance_schema_max_mutex_classes = 200
performance_schema_max_digest_length = 4096
performance_schema_max_index_stat = 10000
5.7 默认开启,8.0 默认更全。验证:
SHOW VARIABLES LIKE 'performance_schema';
-- 预期:ON
SELECT * FROM performance_schema.setup_consumers WHERE enabled='YES';
-- 8.0 默认情况下,可以打开更详细的 events_statements 维度
UPDATE performance_schema.setup_consumers
SET enabled = 'YES'
WHERE name LIKE 'events_statements_%';
注意:performance_schema 默认开启会带来 5-10% 的性能开销,关键业务库在压测后再决定是否长期开启。
5.3 PostgreSQL 启用 pg_stat_statements
# /var/lib/pgsql/16/data/postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = top
pg_stat_statements.track_utility = on
pg_stat_statements.save = on
track_io_timing = on
log_min_duration_statement = 2000 # 单位 ms,记录 2 秒以上
log_lock_waits = on
log_temp_files = 0 # 记录所有临时文件
log_autovacuum_min_duration = 0 # 记录 autovacuum
log_checkpoint = on
log_connections = on
log_disconnections = on
log_line_prefix = '%m [%p] %q%u@%d/%a from %h '
重启后:
CREATE EXTENSION pg_stat_statements;
-- 验证
SELECT count(*) FROM pg_stat_statements;
-- 预期:> 0(执行过 SQL 才有)
5.4 安装 mysqld_exporter
# 1. 下载
VERSION="0.15.1"
wget https://github.com/prometheus/mysqld_exporter/releases/download/v${VERSION}/mysqld_exporter-${VERSION}.linux-amd64.tar.gz
tar xzf mysqld_exporter-${VERSION}.linux-amd64.tar.gz
sudo mv mysqld_exporter-${VERSION}.linux-amd64/mysqld_exporter /usr/local/bin/
# 2. 创建监控账号
mysql -uroot -p <<EOF
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'exporter_password_here' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
GRANT SELECT ON performance_schema.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
EOF
# 3. 配置 .my.cnf
sudo tee /etc/.mysqld_exporter.cnf <<'EOF'
[client]
user=exporter
password=exporter_password_here
host=127.0.0.1
port=3306
EOF
sudo chmod 600 /etc/.mysqld_exporter.cnf
# 4. 启动
mysqld_exporter \
--config.my-cnf=/etc/.mysqld_exporter.cnf \
--web.listen-address=:9104 \
--collect.global_status \
--collect.info_schema.innodb_metrics \
--collect.info_schema.processlist \
--collect.perf_schema.eventsstatements \
--collect.perf_schema.eventswaits \
--collect.perf_schema.file_events \
--collect.perf_schema.tableio_waits
密码从环境变量或密钥管理读,不写在配置里。
5.4.1 systemd 服务
# /etc/systemd/system/mysqld_exporter.service
[Unit]
Description=MySQL Exporter
After=network.target
[Service]
Type=simple
User=mysqld_exporter
EnvironmentFile=/etc/mysqld_exporter.env
ExecStart=/usr/local/bin/mysqld_exporter \
--config.my-cnf=/etc/.mysqld_exporter.cnf \
--web.listen-address=:9104
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
# /etc/mysqld_exporter.env(权限 600)
MYSQLD_EXPORTER_PASSWORD=xxx
sudo useradd -r -s /bin/false mysqld_exporter
sudo chown mysqld_exporter:mysqld_exporter /etc/mysqld_exporter.env
sudo chmod 600 /etc/mysqld_exporter.env
sudo systemctl daemon-reload
sudo systemctl enable --now mysqld_exporter
sudo systemctl status mysqld_exporter
5.5 安装 postgres_exporter
VERSION="0.15.0"
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v${VERSION}/postgres_exporter-${VERSION}.linux-amd64.tar.gz
tar xzf postgres_exporter-${VERSION}.linux-amd64.tar.gz
sudo mv postgres_exporter-${VERSION}.linux-amd64/postgres_exporter /usr/local/bin/
# 创建监控账号
sudo -u postgres psql <<EOF
CREATE USER postgres_exporter WITH PASSWORD 'xxx' INHERIT;
GRANT pg_read_all_stats TO postgres_exporter;
GRANT CONNECT ON DATABASE postgres TO postgres_exporter;
EOF
# 启动
DATA_SOURCE_NAME="postgresql://postgres_exporter:xxx@localhost:5432/postgres" \
postgres_exporter --web.listen-address=:9187
5.5.1 systemd 服务
# /etc/systemd/system/postgres_exporter.service
[Unit]
Description=PostgreSQL Exporter
After=network.target
[Service]
Type=simple
User=postgres_exporter
Environment=DATA_SOURCE_NAME=postgresql://postgres_exporter:xxx@localhost:5432/postgres
ExecStart=/usr/local/bin/postgres_exporter \
--web.listen-address=:9187 \
--extend.query-path=/etc/postgres_exporter/queries.yaml
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
5.6 安装 pgbouncer_exporter
pgbouncer 自带 prometheus 兼容接口(1.21+),也可单独装 pgbouncer_exporter。
# /etc/pgbouncer.ini 添加
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
admin_users = pgb_admin
stats_users = pgb_stats
pool_mode = transaction
max_client_conn = 4000
default_pool_size = 20
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 600
# 验证
curl http://localhost:6432/metrics | head -30
5.7 安装 ProxySQL Exporter(MySQL 连接池)
ProxySQL 自带 prometheus 兼容接口(v2.0+):
-- ProxySQL 启用 metrics
SET admin-metrics_password='xxx';
LOAD ADMIN VARIABLES TO RUNTIME;
SAVE ADMIN VARIABLES TO DISK;
curl -u admin:xxx http://localhost:6032/metrics
5.8 Prometheus 抓取配置
# prometheus.yml
scrape_configs:
- job_name: 'mysqld'
static_configs:
- targets:
- 'mysql-prod-1:9104'
- 'mysql-prod-2:9104'
labels:
env: prod
role: master
- job_name: 'postgres'
static_configs:
- targets:
- 'pg-prod-1:9187'
- 'pg-prod-2:9187'
labels:
env: prod
role: master
- job_name: 'pgbouncer'
static_configs:
- targets:
- 'pgbouncer-1:6432'
- 'pgbouncer-2:6432'
- job_name: 'proxysql'
static_configs:
- targets:
- 'proxysql-1:6032'
5.9 Grafana Dashboard 选型
| Dashboard ID | 名称 | 适用 |
|---|---|---|
| 7362 | MySQL Overview | MySQL 通用 |
| 11157 | MySQL InnoDB Metrics | MySQL InnoDB 详细 |
| 12740 | Percona MySQL | Percona 分支 |
| 9628 | PostgreSQL Database | PG 通用 |
| 455 | PostgreSQL Overview | PG 通用 |
| 14881 | pgBouncer | 连接池 |
| 11807 | PG Replication | PG 主从 |
| 17378 | MySQL Group Replication | MGR |
| 12227 | Patroni | Patroni HA |
Dashboard ID 可能因版本升级而变化,以 Grafana 官方仓库实际为准。
5.10 告警规则(MySQL)
# /etc/prometheus/rules/mysql.yml
groups:
- name: mysql_alerts
rules:
- alert: MySQLDown
expr: mysql_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL 实例 {{ $labels.instance }} 不可达"
description: "mysqld_exporter 无法连接 MySQL"
- alert: MySQLTooManyConnections
expr: |
(mysql_global_status_threads_connected
/ mysql_global_variables_max_connections) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL 连接数使用率 {{ $value | printf \"%.1f\" }}%"
- alert: MySQLHighThreadsRunning
expr: mysql_global_status_threads_running > 50
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL Threads_running {{ $value }} 持续高"
- alert: MySQLReplicationBroken
expr: |
mysql_slave_status_slave_io_running == 0
or
mysql_slave_status_slave_sql_running == 0
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL 复制中断"
description: "实例 {{ $labels.instance }} 复制线程停止"
- alert: MySQLReplicationLag
expr: mysql_slave_status_seconds_behind_master > 60
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL 主从延迟 {{ $value }}s"
- alert: MySQLBufferPoolLowHitRate
expr: |
(sum(rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]))
- sum(rate(mysql_global_status_innodb_buffer_pool_reads[5m])))
/ sum(rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]))
< 0.99
for: 10m
labels:
severity: warning
annotations:
summary: "MySQL Buffer Pool 命中率 {{ $value | printf \"%.3f\" }}"
- alert: MySQLDeadlocks
expr: |
increase(mysql_global_status_innodb_deadlocks[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL 检测到死锁"
- alert: MySQLSlowQueries
expr: |
increase(mysql_global_status_slow_queries[5m]) > 50
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL 慢查询激增"
- alert: MySQLHistoryListLong
expr: mysql_global_status_innodb_history_list_length > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "MySQL history list length {{ $value }}"
- alert: MySQLLogWaits
expr: increase(mysql_global_status_innodb_log_waits[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL 写 redo log 等待"
- alert: MySQLGroupReplicationOffline
expr: |
count by (group_name) (
mysql_perf_schema_replication_group_members_member_state{state="ONLINE"}
) < 3
for: 2m
labels:
severity: critical
annotations:
summary: "MySQL Group Replication 成员不足"
5.11 告警规则(PostgreSQL)
groups:
- name: postgres_alerts
rules:
- alert: PostgresDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL {{ $labels.instance }} 不可达"
- alert: PostgresTooManyConnections
expr: |
(sum(pg_stat_activity_count) by (instance)
/ on(instance) pg_settings_max_connections) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL 连接数使用率 {{ $value | printf \"%.1f\" }}%"
- alert: PostgresIdleInTransaction
expr: |
count(pg_stat_activity{state="idle in transaction"}) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "PostgreSQL 有 {{ $value }} 个 idle in transaction 会话"
- alert: PostgresReplicationLag
expr: |
pg_replication_lag > 60
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL 复制延迟 {{ $value }}s"
- alert: PostgresLowCacheHitRate
expr: |
(sum(rate(pg_stat_database_blks_hit[5m])) by (instance)
/ (sum(rate(pg_stat_database_blks_hit[5m])) by (instance)
+ sum(rate(pg_stat_database_blks_read[5m])) by (instance))) < 0.99
for: 10m
labels:
severity: warning
annotations:
summary: "PostgreSQL 缓存命中率 {{ $value | printf \"%.3f\" }}"
- alert: PostgresDeadlocks
expr: |
increase(pg_stat_database_deadlocks[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL 检测到死锁"
- alert: PostgresLongRunningQuery
expr: |
(time() - pg_stat_activity_query_start_seconds) > 600
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL 有运行超过 10 分钟的查询"
description: "query: {{ $labels.query | default \"(null)\" }}"
- alert: PostgresWALLag
expr: |
(pg_stat_replication_sent_lsn - pg_stat_replication_replay_lsn) > 100000000
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL WAL 堆积"
- alert: PostgresLockWaits
expr: |
count(pg_stat_activity{wait_event_type="Lock"}) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "PostgreSQL 有 {{ $value }} 个会话在等待锁"
- alert: PatroniReplicaLag
expr: |
patroni_replica_lag > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Patroni 副本延迟 {{ $value }}s"
5.12 长事务自动 Kill 脚本(MySQL)
kill_long_tx.sh:
#!/bin/bash
# 杀 MySQL 长事务(运行超过 60 秒的非复制线程)
# 注意:先 dry-run,确认再真杀
DRY_RUN=${DRY_RUN:-true}# 默认 dry-run
THRESHOLD_SECONDS=60
LOG_FILE="/var/log/kill_long_tx.log"
exec >> "${LOG_FILE}" 2>&1
echo "=== $(date '+%Y-%m-%d %H:%M:%S') run ==="
if [[ -z "${MYSQL_ROOT_PASSWORD}" ]]; then
echo "MYSQL_ROOT_PASSWORD 未设置,退出"
exit 1
fi
MYSQL_CMD="mysql -uroot -p${MYSQL_ROOT_PASSWORD} -h127.0.0.1 -B -N"
# 找出长事务
LONG_TRX=$(${MYSQL_CMD} -e "
SELECT
CONCAT_WS('\t',
trx_id,
trx_started,
TIMESTAMPDIFF(SECOND, trx_started, NOW()),
trx_mysql_thread_id,
REPLACE(REPLACE(IFNULL(trx_query, ''), '\n', ' '), '\t', ' '))
FROM information_schema.innodb_trx
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > ${THRESHOLD_SECONDS}
AND trx_mysql_thread_id != CONNECTION_ID()
" 2>/dev/null)
if [[ -z "${LONG_TRX}" ]]; then
echo "No long transactions."
exit 0
fi
echo "Found long transactions:"
echo "${LONG_TRX}"
echo "---"
if [[ "${DRY_RUN}" == "true" ]]; then
echo "DRY_RUN=true, not killing."
exit 0
fi
# 杀
echo "${LONG_TRX}" | while IFS=$'\t' read -r trx_id started duration thread_id query; do
echo "Killing thread ${thread_id} (duration ${duration}s)..."
${MYSQL_CMD} -e "KILL ${thread_id};" 2>&1
echo "Killed ${thread_id}"
done
定时跑:
# /etc/cron.d/kill-long-tx
*/5 * * * * root DRY_RUN=false /usr/local/bin/kill_long_tx.sh
权限:
sudo chmod 700 /usr/local/bin/kill_long_tx.sh
sudo chown root:root /usr/local/bin/kill_long_tx.sh
风险:KILL 线程会回滚事务,可能造成业务侧短时错误。生产环境必须先 DRY_RUN=true 观察 1-2 周,确认误杀率 < 0.1% 再开 DRY_RUN=false。
5.13 PG 长事务自动 Kill
#!/bin/bash
# PG 长事务 / idle in transaction 自动 kill
DRY_RUN=${DRY_RUN:-true}
THRESHOLD_SECONDS=300
LOG_FILE="/var/log/kill_pg_long_tx.log"
exec >> "${LOG_FILE}" 2>&1
echo "=== $(date '+%Y-%m-%d %H:%M:%S') run ==="
if [[ -z "${PGPASSWORD}" ]]; then
echo "PGPASSWORD 未设置,退出"
exit 1
fi
PIDS=$(psql -U postgres -h 127.0.0.1 -t -A -F, -c "
SELECT pid
FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'idle in transaction (aborted)')
AND now() - state_change > interval '${THRESHOLD_SECONDS} seconds'
AND pid != pg_backend_pid()
AND application_name <> 'postgres_exporter'
" 2>/dev/null)
if [[ -z "${PIDS}" ]]; then
echo "No long idle in transaction."
exit 0
fi
echo "Found long idle in transaction: ${PIDS}"
if [[ "${DRY_RUN}" == "true" ]]; then
echo "DRY_RUN=true, not killing."
exit 0
fi
for pid in ${PIDS}; do
echo "Terminating PID ${pid}..."
psql -U postgres -h 127.0.0.1 -c "SELECT pg_terminate_backend(${pid});" 2>&1
done
5.14 慢查询自动分析
analyze_slow_queries.sh:
#!/bin/bash
# 解析 MySQL 慢查询日志,TOP 10 发送邮件
SLOW_LOG="/var/log/mysql/slow.log"
REPORT="/tmp/slow_query_report_$(date +%Y%m%d).txt"
TOP_N=10
MAIL_TO="dba@example.com"
# 用 pt-query-digest 解析
pt-query-digest "${SLOW_LOG}" --limit "${TOP_N}" > "${REPORT}" 2>&1
# 邮件
mail -s "[DB] Slow Query Report $(date +%Y-%m-%d)" \
"${MAIL_TO}" < "${REPORT}"
# 清理 30 天前报告
find /tmp -name "slow_query_report_*.txt" -mtime +30 -delete
5.15 pt-query-digest 完整示例
# 1. 实时分析
pt-query-digest --processlist h=localhost,u=root,p=xxx
# 2. 分析慢日志
pt-query-digest /var/log/mysql/slow.log
# 3. 输出到文件
pt-query-digest /var/log/mysql/slow.log > report.txt
# 4. 只看特定时间范围
pt-query-digest --since '2026-06-17 00:00:00' --until '2026-06-17 23:59:59' /var/log/mysql/slow.log
# 5. 看 general log
pt-query-digest --type genlog /var/log/mysql/general.log
# 6. 看 binlog(需要先解析)
mysqlbinlog --verbose /var/lib/mysql/mysql-bin.000123 | pt-query-digest --type binlog
# 7. 按用户过滤
pt-query-digest --filter '($event->{arg} =~ m/^SELECT/i)' /var/log/mysql/slow.log
# 8. 输出 JSON(便于二次处理)
pt-query-digest --output json /var/log/mysql/slow.log > slow.json
5.16 PG 慢查询分析工具对比
| 工具 | 优点 | 缺点 |
|---|---|---|
pg_stat_statements | 内建,无需额外组件 | 看不到具体参数值 |
log_min_duration_statement + grep | 能看到完整 SQL | 文本分析不便 |
pg_query_governor | 自动 SQL 拦截 | 商业 |
pgBadger | 完整 HTML 报表 | 需要解析 log |
auto_explain | 自动 EXPLAIN | log 体量大 |
pg_stat_plans | 计划级别统计 | 需额外安装 |
auto_explain 配置(自动记录慢查询的执行计划):
# postgresql.conf
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = '2s'
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_format = 'json'
auto_explain.log_nested_statements = on
5.17 完整监控告警架构
┌──────────────────┐
│ MySQL/Postgres │
└────────┬─────────┘
│ (连接 + 慢日志 + binlog)
▼
┌────────────────────────────────┐
│ mysqld_exporter / postgres_ │
│ exporter / pgbouncer_export │
└────────┬───────────────────────┘
│ /metrics
▼
┌────────────────────────────────┐
│ Prometheus │
│ + AlertManager + 告警规则 │
└────────┬───────────────────────┘
│
┌────────┴─────────┐
▼ ▼
┌─────────┐ ┌─────────┐
│ Grafana │ │ 钉钉/ │
│ 面板 │ │ 飞书/ │
│ │ │ Slack │
└─────────┘ └─────────┘
▲
│ (慢查询邮件)
│
┌────────┴───────────────┐
│ 慢查询分析器 │
│ (pt-query-digest / │
│ pg_query_governor) │
└────────────────────────┘
5.18 自动巡检脚本(每日)
daily_check.sh:
#!/bin/bash
# MySQL 每日巡检
MYSQL="mysql -uroot -p${MYSQL_ROOT_PASSWORD} -h127.0.0.1 -B -N"
REPORT="/tmp/mysql_daily_$(date +%Y%m%d).txt"
exec > "${REPORT}" 2>&1
echo "=== $(date) 巡检 ==="
# 1. 连接数
echo "[连接数]"
${MYSQL} -e "SHOW STATUS LIKE 'Threads_connected'"
${MYSQL} -e "SHOW STATUS LIKE 'Threads_running'"
${MYSQL} -e "SHOW STATUS LIKE 'Max_used_connections'"
# 2. QPS / TPS
echo "[QPS/TPS]"
${MYSQL} -e "SHOW GLOBAL STATUS WHERE Variable_name IN ('Questions','Com_commit','Com_rollback')"
# 3. 缓冲池
echo "[Buffer Pool]"
${MYSQL} -e "
SELECT
Variable_name,
Variable_value
FROM performance_schema.global_status
WHERE Variable_name LIKE 'Innodb_buffer_pool_pages_%'
"
# 4. 长事务
echo "[长事务]"
${MYSQL} -e "
SELECT
trx_id,
trx_state,
TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS duration_sec,
trx_mysql_thread_id,
LEFT(trx_query, 100) AS query_preview
FROM information_schema.innodb_trx
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 10
ORDER BY trx_started
LIMIT 10
"
# 5. 复制状态
echo "[复制状态]"
${MYSQL} -e "SHOW SLAVE STATUS\G" 2>/dev/null || ${MYSQL} -e "SHOW REPLICA STATUS\G"
# 6. 表空间
echo "[表空间]"
${MYSQL} -e "
SELECT
table_schema,
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS size_mb,
COUNT(*) AS table_count
FROM information_schema.tables
WHERE table_schema NOT IN ('mysql','information_schema','performance_schema','sys')
GROUP BY table_schema
ORDER BY size_mb DESC
LIMIT 20
"
# 7. 慢查询数量
echo "[慢查询]"
${MYSQL} -e "SHOW GLOBAL STATUS LIKE 'Slow_queries'"
echo "=== 巡检结束 ==="
5.19 PG 自动巡检脚本
#!/bin/bash
# PG 每日巡检
PSQL="psql -U postgres -h 127.0.0.1 -A -F, -t"
REPORT="/tmp/pg_daily_$(date +%Y%m%d).csv"
exec > "${REPORT}" 2>&1
echo "metric,value"
# 1. 连接数
${PSQL} -c "SELECT 'connections', count(*) FROM pg_stat_activity"
# 2. 库级事务
${PSQL} -c "
SELECT
'db_' || datname || '_xact_commit',
xact_commit
FROM pg_stat_database
WHERE datname NOT IN ('template0','template1')
"
# 3. 长事务
${PSQL} -c "
SELECT
'long_tx_count',
count(*)
FROM pg_stat_activity
WHERE state IN ('active','idle in transaction')
AND now() - xact_start > interval '5 minutes'
"
# 4. 复制延迟
${PSQL} -c "
SELECT
'repl_lag_' || application_name,
EXTRACT(EPOCH FROM replay_lag)::int
FROM pg_stat_replication
WHERE replay_lag IS NOT NULL
"
# 5. 表膨胀(用 pg_stat_user_tables.n_dead_tup)
${PSQL} -c "
SELECT
'dead_tups_' || schemaname || '_' || relname,
n_dead_tup
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20
"
# 6. 缓存命中率
${PSQL} -c "
SELECT
'cache_hit_rate',
ROUND(sum(blks_hit)::numeric / nullif(sum(blks_hit) + sum(blks_read), 0), 4)
FROM pg_stat_database
WHERE datname NOT IN ('template0','template1')
"
六、常用命令
6.1 连接 / 状态
# 当前连接
mysql -uroot -p -e "SHOW PROCESSLIST"
# 完整进程列表(不受 truncated_ground 限制)
mysql -uroot -p -e "SHOW FULL PROCESSLIST"
# 当前事务
mysql -uroot -p -e "
SELECT * FROM information_schema.innodb_trx\G
"
# 按主机统计连接
mysql -uroot -p -e "
SELECT
SUBSTRING_INDEX(host, ':', 1) AS client_ip,
count(*) AS conn_count
FROM information_schema.processlist
GROUP BY client_ip
ORDER BY conn_count DESC
"
# PG 类似
psql -U postgres -c "
SELECT
client_addr,
state,
count(*)
FROM pg_stat_activity
GROUP BY client_addr, state
ORDER BY count DESC
"
6.2 锁 / 等待
# MySQL 当前锁等待
mysql -uroot -p -e "
SELECT
r.trx_id AS waiting_trx,
r.trx_mysql_thread_id AS waiting_thread,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx,
b.trx_mysql_thread_id AS blocking_thread,
b.trx_query AS blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id
"
# 8.0+ 等价
mysql -uroot -p -e "
SELECT
waiting_pid,
waiting_query,
blocking_pid,
blocking_query
FROM performance_schema.data_lock_waits
JOIN performance_schema.events_statements_current w ON w.thread_id = waiting_pid
"
# PG 当前锁等待
psql -U postgres -c "
SELECT
pg_class.relname,
pg_locks.mode,
pg_stat_activity.usename,
pg_stat_activity.query,
EXTRACT(EPOCH FROM (now() - pg_stat_activity.query_start)) AS wait_sec
FROM pg_locks
JOIN pg_class ON pg_locks.relation = pg_class.oid
JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE NOT pg_locks.granted
ORDER BY query_start
"
6.3 主从状态
# MySQL 主从
mysql -uroot -p -e "SHOW SLAVE STATUS\G"
mysql -uroot -p -e "SHOW REPLICA STATUS\G"# 8.0.22+
# MySQL Group Replication
mysql -uroot -p -e "SELECT * FROM performance_schema.replication_group_members"
mysql -uroot -p -e "SELECT * FROM performance_schema.replication_group_member_stats"
# PG 主从
psql -U postgres -c "SELECT * FROM pg_stat_replication"
psql -U postgres -c "SELECT pg_is_in_recovery()"
# PG 备库回放延迟(pg 10+)
psql -U postgres -c "
SELECT
now() - pg_last_xact_replay_timestamp() AS replay_lag
"
6.4 表 / 索引
# MySQL 表大小
mysql -uroot -p -e "
SELECT
table_schema,
table_name,
ROUND((data_length + index_length) / 1024 / 1024, 2) AS total_mb,
table_rows
FROM information_schema.tables
WHERE table_schema NOT IN ('mysql','information_schema','performance_schema','sys')
ORDER BY (data_length + index_length) DESC
LIMIT 20
"
# MySQL 索引使用情况
mysql -uroot -p -e "
SELECT
object_schema,
object_name,
index_name,
count_star,
count_read,
count_write,
count_fetch,
count_insert,
count_update,
count_delete
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE object_schema NOT IN ('mysql','sys')
ORDER BY count_star DESC
LIMIT 20
"
# PG 表大小
psql -U postgres -c "
SELECT
schemaname,
relname,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(pg_relation_size(relid)) AS table_size,
pg_size_pretty(pg_indexes_size(relid)) AS index_size,
n_live_tup,
n_dead_tup
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20
"
# PG 索引使用
psql -U postgres -c "
SELECT
schemaname,
relname,
indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC
LIMIT 20
"
6.5 杀线程
# MySQL 杀线程
mysql -uroot -p -e "KILL 12345"
mysql -uroot -p -e "KILL CONNECTION 12345"# 同时断开连接
# PG 杀会话
psql -U postgres -c "SELECT pg_terminate_backend(12345)"
psql -U postgres -c "SELECT pg_cancel_backend(12345)"# 仅取消查询
风险:kill 不会自动通知应用,应用端会拿到连接异常,需要应用层重试。
6.6 binlog / WAL
# MySQL binlog 列表
mysql -uroot -p -e "SHOW BINARY LOGS"
mysql -uroot -p -e "SHOW MASTER STATUS"
# 解析 binlog
mysqlbinlog --verbose --base64-output=DECODE-ROWS /var/lib/mysql/mysql-bin.000123
# PG WAL 位置
psql -U postgres -c "SELECT pg_current_wal_lsn()"
psql -U postgres -c "SELECT pg_walfile_name()"
psql -U postgres -c "SELECT * FROM pg_ls_waldir()"
# PG WAL 归档状态
psql -U postgres -c "SELECT * FROM pg_stat_archiver"
6.7 性能基准
# MySQL sysbench 基准
sysbench oltp_read_write \
--mysql-host=127.0.0.1 \
--mysql-user=root \
--mysql-password=xxx \
--mysql-db=sbtest \
--tables=10 \
--table-size=1000000 \
--threads=64 \
--time=60 \
--report-interval=5 \
run
# PG pgbench 基准
pgbench -i -s 100 pgbench
pgbench -c 64 -j 8 -T 60 pgbench
七、配置示例
7.1 MySQL 主配置(生产级)
# /etc/my.cnf
[mysqld]
# 基础
user = mysql
port = 3306
bind-address = 0.0.0.0
datadir = /var/lib/mysql
socket = /var/lib/mysql/mysql.sock
pid-file = /var/run/mysqld/mysqld.pid
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
skip-name-resolve
default-time-zone = '+08:00'
# 连接
max_connections = 2000
max_user_connections = 1900
wait_timeout = 600
interactive_timeout = 600
# InnoDB
innodb_buffer_pool_size = 64G
innodb_buffer_pool_instances = 8
innodb_buffer_pool_chunk_size = 128M
innodb_log_file_size = 4G
innodb_log_files_in_group = 2
innodb_log_buffer_size = 64M
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT
innodb_file_per_table = 1
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
innodb_purge_threads = 8
innodb_change_buffering = all
innodb_adaptive_hash_index = 1
# binlog
server-id = 100
log_bin = /var/lib/mysql/mysql-bin
binlog_format = ROW
binlog_row_image = FULL
sync_binlog = 1
expire_logs_days = 7
max_binlog_size = 1G
binlog_group_commit_sync_delay = 0
# 慢查询
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 2
log_queries_not_using_indexes = 1
log_throttle_queries_not_using_indexes = 100
# performance_schema
performance_schema = ON
performance_schema_max_table_instances = 2000
# 复制
gtid_mode = ON
enforce_gtid_consistency = ON
relay_log = /var/lib/mysql/relay-bin
relay_log_recovery = ON
log_slave_updates = ON
slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 8
slave_preserve_commit_order = ON
[mysqldump]
quick
quote-names
max_allowed_packet = 1G
[client]
default-character-set = utf8mb4
7.2 PostgreSQL 主配置(生产级)
# /var/lib/pgsql/16/data/postgresql.conf
# 基础
listen_addresses = '*'
port = 5432
max_connections = 1000
superuser_reserved_connections = 10
unix_socket_directories = '/var/run/postgresql'
timezone = 'Asia/Shanghai'
log_timezone = 'Asia/Shanghai'
shared_buffers = 16GB
huge_pages = try
temp_buffers = 16MB
work_mem = 64MB
hash_mem_multiplier = 2.0
maintenance_work_mem = 2GB
autovacuum_work_mem = 1GB
# 预写日志
wal_level = replica
wal_buffers = 64MB
wal_compression = on
max_wal_size = 4GB
min_wal_size = 1GB
max_wal_senders = 10
wal_keep_size = 1GB
wal_sender_timeout = 60s
wal_receiver_timeout = 60s
# 复制
max_replication_slots = 10
hot_standby = on
hot_standby_feedback = on
wal_receiver_status_interval = 10s
# 计划器
random_page_cost = 1.1
effective_cache_size = 48GB
default_statistics_target = 500
# 自动 vacuum / analyze
autovacuum = on
autovacuum_max_workers = 4
autovacuum_naptime = 30s
autovacuum_vacuum_threshold = 100
autovacuum_analyze_threshold = 100
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.025
autovacuum_vacuum_cost_limit = 1000
autovacuum_vacuum_cost_delay = 10ms
# checkpoint
checkpoint_completion_target = 0.9
checkpoint_timeout = 15min
bgwriter_lru_maxpages = 1000
bgwriter_delay = 50ms
# 日志
logging_collector = on
log_destination = 'stderr'
log_directory = 'log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_rotation_age = 1d
log_rotation_size = 100MB
log_min_duration_statement = 2000
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
log_line_prefix = '%m [%p] %q%u@%d/%a from %h '
log_statement = 'ddl'
# 监控扩展
shared_preload_libraries = 'pg_stat_statements,auto_explain'
pg_stat_statements.max = 10000
pg_stat_statements.track = top
auto_explain.log_min_duration = '5s'
auto_explain.log_analyze = on
auto_explain.log_format = 'json'
# 资源
track_io_timing = on
track_activities = on
track_counts = on
track_functions = pl
# 锁
max_locks_per_transaction = 256
deadlock_timeout = 1s
lock_timeout = 30s
idle_in_transaction_session_timeout = 10min
statement_timeout = 0
7.3 mysqld_exporter systemd 单元
# /etc/systemd/system/mysqld_exporter.service
[Unit]
Description=Prometheus MySQL Exporter
After=network.target mysqld.service
Wants=mysqld.service
[Service]
Type=simple
User=mysqld_exporter
Group=mysqld_exporter
EnvironmentFile=-/etc/default/mysqld_exporter
ExecStart=/usr/local/bin/mysqld_exporter \
--config.my-cnf=/etc/.mysqld_exporter.cnf \
--web.listen-address=:9104 \
--web.tls-cert-file=/etc/ssl/certs/exporter.crt \
--web.tls-key-file=/etc/ssl/private/exporter.key
Restart=on-failure
RestartSec=5s
# 安全加固
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/log/mysqld_exporter
[Install]
WantedBy=multi-user.target
7.4 AlertManager 路由
# alertmanager.yml
route:
receiver: 'default'
group_by: ['alertname', 'instance', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: warning
receiver: 'slack-warning'
- match_re:
severity: critical
receiver: 'dingtalk-critical'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://alert-relay/alert'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<key>'
send_resolved: true
- name: 'slack-warning'
slack_configs:
- api_url: 'https://hooks.slack.com/xxx'
channel: '#db-alerts'
- name: 'dingtalk-critical'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
7.5 ProxySQL 配置
# /etc/proxysql.cnf
datadir="/var/lib/proxysql"
admin_variables=
{
admin_credentials="admin:xxx;cluster_admin:yyy"
mysql_ifaces="0.0.0.0:6032"
}
mysql_variables=
{
threads=4
max_connections=2048
default_query_delay=0
default_query_timeout=36000000
have_compress=true
poll_timeout=2000
interfaces="0.0.0.0:6033"
default_schema="information_schema"
stacksize=1048576
server_version="8.0.36"
connect_timeout_server=3000
monitor_username="proxysql_monitor"
monitor_password="xxx"
monitor_history=600000
monitor_connect_interval=60000
monitor_ping_interval=10000
monitor_read_only_interval=1500
monitor_read_only_timeout=500
ping_interval_server_msec=120000
ping_timeout_server=500
commands_stats=true
sessions_sort=true
connect_retries_on_failure=10
}
mysql_servers =
(
{ address="mysql-1", port=3306, hostgroup=0, max_connections=1000 },
{ address="mysql-2", port=3306, hostgroup=0, max_connections=1000 },
{ address="mysql-1", port=3306, hostgroup=1, max_connections=1000 },
{ address="mysql-2", port=3306, hostgroup=1, max_connections=1000 }
)
mysql_replication_hostgroups =
(
{ writer_hostgroup=0, reader_hostgroup=1, comment="MySQL Replication" }
)
mysql_query_rules =
(
{
rule_id=100
active=1
match_pattern="^SELECT .* FOR UPDATE$"
destination_hostgroup=0
apply=1
},
{
rule_id=200
active=1
match_pattern="^SELECT .*"
destination_hostgroup=1
apply=1
}
)
7.6 pgBouncer 配置
# /etc/pgbouncer/pgbouncer.ini
[databases]
app_db = host=pg-1 port=5432 dbname=app
app_db_ro = host=pg-2 port=5432 dbname=app
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
admin_users = pgb_admin
stats_users = pgb_stats
pool_mode = transaction
max_client_conn = 4000
default_pool_size = 20
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 600
server_lifetime = 3600
server_connect_timeout = 15
query_wait_timeout = 120
client_idle_timeout = 0
client_login_timeout = 60
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
stats_period = 60
7.7 Patroni 配置
# /etc/patroni.yml
scope: pg-cluster
namespace: /pg-cluster
name: pg-1
restapi:
listen: 0.0.0.0:8008
connect_address: pg-1:8008
etcd:
hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
max_wal_senders: 10
max_replication_slots: 10
postgresql:
listen: 0.0.0.0:5432
connect_address: pg-1:5432
data_dir: /var/lib/pgsql/16/data
pgpass: /tmp/pgpass0
authentication:
superuser:
username: postgres
password: xxx
replication:
username: replicator
password: xxx
parameters:
unix_socket_directories: '/var/run/postgresql'
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
八、日志或指标观察方法
8.1 MySQL 错误日志
# 默认位置
/var/log/mysqld.log
/var/log/mysql/error.log
# 关键关键字
grep -iE "error|warning|innodb|deadlock|abort|kill" /var/log/mysqld.log | tail -50
# 重启日志
grep "ready for connections" /var/log/mysqld.log
# Binlog Dump 线程问题
grep "Slave I/O" /var/log/mysqld.log
8.2 MySQL 慢查询日志分析
# 直接 grep
grep -c "Query_time:" /var/log/mysql/slow.log
# TOP 10 慢查询
pt-query-digest /var/log/mysql/slow.log --limit 10
# 看某时间段
grep "Time: 2026-06-17T0[0-3]" /var/log/mysql/slow.log | head
8.3 performance_schema 常用查询
# 当前等待事件最多的 SQL
mysql -uroot -p -e "
SELECT
digest_text,
count_star,
sum_timer_wait / 1e9 AS total_ms
FROM performance_schema.events_statements_summary_by_digest
WHERE digest_text IS NOT NULL
ORDER BY sum_timer_wait DESC
LIMIT 10
"
# 哪些表被全表扫描
mysql -uroot -p -e "
SELECT
object_schema,
object_name,
count_read,
count_fetch
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE index_name IS NULL
AND count_read > 0
ORDER BY count_read DESC
LIMIT 10
"
# 哪些索引从未被用
mysql -uroot -p -e "
SELECT
object_schema,
object_name,
index_name
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE index_name IS NOT NULL
AND count_star = 0
AND object_schema NOT IN ('mysql','sys')
ORDER BY object_schema, object_name
"
# 当前文件 IO 热点
mysql -uroot -p -e "
SELECT
file_name,
event_name,
count_star,
sum_timer_wait / 1e9 AS total_ms
FROM performance_schema.file_io_summary_by_instance
ORDER BY sum_timer_wait DESC
LIMIT 10
"
8.4 PostgreSQL 日志
# 默认位置
/var/log/postgresql/postgresql-16-main.log
# 错误日志
grep -iE "error|fatal|panic" /var/log/postgresql/postgresql-16-main.log | tail -50
# 慢查询
grep "duration:" /var/log/postgresql/postgresql-16-main.log | tail -20
# Checkpoint 记录
grep "checkpoint" /var/log/postgresql/postgresql-16-main.log | tail
# 自动 vacuum 记录
grep "automatic vacuum" /var/log/postgresql/postgresql-16-main.log | tail
# 死锁
grep "deadlock detected" /var/log/postgresql/postgresql-16-main.log
8.5 PG 关键视图巡检
# 表膨胀 Top 10
psql -U postgres -c "
SELECT
schemaname || '.' || relname AS table_name,
n_live_tup,
n_dead_tup,
ROUND(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10
"
# 长事务 Top 10
psql -U postgres -c "
SELECT
pid,
usename,
state,
now() - xact_start AS xact_duration,
LEFT(query, 100) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
AND now() - xact_start > interval '1 second'
ORDER BY xact_start
LIMIT 10
"
# 未使用索引
psql -U postgres -c "
SELECT
schemaname || '.' || relname AS table_name,
indexrelname AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20
"
# 复制槽状态
psql -U postgres -c "
SELECT
slot_name,
plugin,
slot_type,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size
FROM pg_replication_slots
"
8.6 Buffer Pool 状态观察
SHOW ENGINE INNODB STATUS\G
关键段落:
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 8589934592
Dictionary memory allocated 133193
Buffer pool size 524272
Free buffers 1024
Database pages 523000
Old database pages 193245
Modified db pages 0
Free buffers / Buffer pool size < 5% 说明内存吃紧。
8.0+ 可以查 Buffer Pool 各索引命中率:
SELECT
pool_id,
hot_size,
warm_size
FROM information_schema.innodb_buffer_pool_stats;
九、排查路径
9.1 MySQL 连接数爆满
现象:应用报 Too many connections。
排查顺序:
- 看当前连接数:
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';
- 看哪些主机占用最多:
SELECT
SUBSTRING_INDEX(host, ':', 1) AS client_ip,
db,
command,
count(*)
FROM information_schema.processlist
GROUP BY client_ip, db, command
ORDER BY count(*) DESC
LIMIT 10;
- 看是否有 sleep 但不释放:
SELECT
count(*)
FROM information_schema.processlist
WHERE command = 'Sleep'
AND time > 600;
- 看连接错误原因:
SHOW STATUS LIKE 'Connection_errors%';
- 处理:
- 应用层修连接池配置(HikariCP maxLifetime)
- 临时
SET GLOBAL max_connections = 3000缓解 - 加 ProxySQL / RDS Proxy 中间层
- 杀 sleep 线程:
KILL <id>
9.2 MySQL 主从延迟
现象:备库追不上主库,业务读备库拿到旧数据。
排查顺序:
- 看延迟数值:
SHOW SLAVE STATUS\G
-- 看 Seconds_Behind_Master
-- 或 8.0.22+:
SHOW REPLICA STATUS\G
-- 看 Seconds_Behind_Source
- 看主库位点:
SHOW MASTER STATUS;
- 看 relay log 是否堆积:
SHOW SLAVE STATUS\G
-- 看 Relay_Log_Space
- 看备库回放速率:
SHOW SLAVE STATUS\G
-- 对比 Exec_Master_Log_Pos 的增长速率
- 常见原因:
- 备库配置低(CPU/磁盘差)
- 单线程回放:开启
slave_parallel_workers > 1 - 大事务:
Binlog_Group_Commit看大事务 - 网络抖动
- 主库大量 DDL
- 处理:
- 加并行复制
- 拆大事务
- 优化备库 IO(切 SSD)
- 网络专线优化
9.3 MySQL 死锁
现象:应用报 Deadlock found when trying to get lock; try restarting transaction。
排查:
SHOW ENGINE INNODB STATUS\G找LATEST DETECTED DEADLOCK- 启用
innodb_print_all_deadlocks = ON把所有死锁都打到错误日志 - 看应用代码是否正确处理死锁(必须重试)
- 加合适的索引(让锁粒度更细)
- 调整事务粒度(避免长事务)
9.4 PostgreSQL 长事务 / 锁等待
现象:应用请求 hang 住。
排查:
-- 1. 看当前 active
SELECT * FROM pg_stat_activity WHERE state = 'active';
-- 2. 看锁等待
SELECT
pg_class.relname,
pg_locks.mode,
pg_stat_activity.usename,
pg_stat_activity.query,
EXTRACT(EPOCH FROM (now() - pg_stat_activity.query_start)) AS wait_sec
FROM pg_locks
JOIN pg_class ON pg_locks.relation = pg_class.oid
JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE NOT pg_locks.granted;
-- 3. 看 idle in transaction
SELECT
pid,
usename,
state,
now() - state_change AS idle_sec,
LEFT(query, 100) AS query_preview
FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'idle in transaction (aborted');
-- 4. 杀
SELECT pg_terminate_backend(<pid>);
SELECT pg_cancel_backend(<pid>);
9.5 数据库磁盘突增
现象:df -h 报警。
排查:
# 1. 看哪个目录涨
du -sh /var/lib/mysql/* | sort -h | tail
du -sh /var/lib/pgsql/* | sort -h | tail
# 2. MySQL binlog 占用
mysql -uroot -p -e "SHOW BINARY LOGS"
# 3. PG WAL 占用
psql -U postgres -c "SELECT * FROM pg_ls_waldir() ORDER BY size DESC LIMIT 10"
# 4. PG 表膨胀
psql -U postgres -c "
SELECT
schemaname || '.' || relname AS table_name,
pg_size_pretty(pg_relation_size(relid)) AS size
FROM pg_stat_user_tables
ORDER BY pg_relation_size(relid) DESC
LIMIT 20
"
# 5. 大表索引膨胀
psql -U postgres -c "
SELECT
schemaname || '.' || relname AS table_name,
indexrelname AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20
"
处理:
- MySQL:
PURGE BINARY LOGS TO 'mysql-bin.000123'; - MySQL:清理
relay-log文件 - PG:
VACUUM FULL <table>(注意锁表,建议低峰) - PG:增加
wal_keep_size限制,启用归档
9.6 PG 主从切换后脑裂
Patroni + etcd 选举脑裂排查:
# 1. 看 Patroni 状态
patronictl -c /etc/patroni.yml list
# 2. 看 etcd 集群
etcdctl member list
etcdctl endpoint status --cluster -w table
# 3. 看每个 Patroni 节点 leader 标识
curl -s http://pg-1:8008/patroni | jq . | grep -E "role|state"
# 4. 检查 split-brain
for h in pg-1 pg-2 pg-3; do
echo "$h:"
curl -s http://$h:8008/patroni | jq -r '" state=\(.state) role=\(.role) timeline=\(.timeline)"'
done
每个节点应只有 1 个 leader,否则脑裂。
9.7 故障树:MySQL 慢查询爆增
MySQL 慢查询突增
├─ 业务量上涨
│ └─ 验证:QPS 同步上涨
├─ 索引失效
│ └─ 验证:EXPLAIN 走全表扫描
├─ 统计信息过期
│ └─ 验证:执行计划 rows 估计与实际偏差大
├─ 锁等待严重
│ └─ 验证:Innodb_row_lock_time 突增
├─ Buffer Pool 命中率低
│ └─ 验证:Innodb_buffer_pool_reads 突增
├─ 磁盘 IO 饱和
│ └─ 验证:iostat await > 10ms
├─ 临时表落盘
│ └─ 验证:Created_tmp_disk_tables 突增
└─ 业务慢 SQL 上线
└─ 验证:pt-query-digest TOP 1 是新 SQL
十、风险提醒
10.1 KILL 风险
KILL <thread_id>:仅取消查询,连接保留。事务会被回滚。KILL CONNECTION <thread_id>:断开连接,事务回滚。- 风险:应用端会拿到
Query execution was interrupted或Lost connection,需要应用重试。 - 必要条件:DRY_RUN 跑 1-2 周,确认无误杀再加到 cron。
10.2 DROP / TRUNCATE 风险
DROP TABLE:不可恢复,必须先备份TRUNCATE TABLE:不可回滚- 必须双确认 + 备份:
mysqldump -uroot -p --single-transaction --master-data=2 db_name table_name > /backup/table_name_$(date +%Y%m%d).sql
# 确认备份成功(> 0 KB)后再执行
mysql -uroot -p db_name -e "DROP TABLE table_name;"
10.3 性能参数风险
innodb_buffer_pool_size调整必须先看机器空闲内存max_connections增大必须看open_files_limitlong_query_time调小会导致 slow log 暴涨
10.4 复制操作风险
- 主从切换:
RESET SLAVE会清空所有复制信息 - 主从切换前必须确认备库
Seconds_Behind_Master = 0 - GTID 模式下注意
SET GTID_PURGED不要乱设
10.5 数据库版本升级
- 5.7 → 8.0 升级必须先用
mysql_upgrade --check - 升级前必须看官方 release notes 的 breaking changes
- 主从升级先升级从库,再升级主库
- 升级窗口前必须备份
10.6 Exporter 风险
- Exporter 账号密码必须从环境变量或密钥管理读取
- Exporter 账号权限最小化(只 SELECT,不要给 DELETE/DROP)
- Exporter 自身要监控(挂了也要告警)
10.7 自动 kill 风险
- 默认 DRY_RUN=true
- 必须设置白名单:复制线程、内部 DBA 会话不能 kill
- 必须有审计日志
10.8 数据迁移风险
- 走 gh-ost / pt-online-schema-change 加索引,不要直接 ALTER TABLE
- 走 mysqldumper / mydumper 并行备份,不要单线程 mysqldump
- PG 大表 ALTER 走 pg_repack,不锁表
十一、验证方式
11.1 Exporter 验证
# 端口在监听
ss -lntp | grep 9104
# 拉取指标
curl -s http://localhost:9104/metrics | head -20
curl -s http://localhost:9104/metrics | grep -E "mysql_up|mysql_global_status"
# 验证指标有数据
curl -s http://localhost:9104/metrics | grep "mysql_global_status_threads_connected"
11.2 Prometheus 验证
# 在 Prometheus 控制台执行
mysql_up
mysql_global_status_threads_connected
mysql_global_status_threads_running
rate(mysql_global_status_questions[1m])
11.3 Grafana 验证
- Dashboard 能看到数据
- Variables 下拉能选到目标
- 时间范围切换正常
- 告警状态对(绿色 / 红色)
11.4 告警演练
定期演练(每季度):
# 制造一个假的死锁场景
# 在测试库跑
mysql -uroot -p test_db -e "
START TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
SELECT SLEEP(30);
COMMIT;
"
# 同时在另一个会话
mysql -uroot -p test_db -e "
START TRANSACTION;
UPDATE accounts SET balance = balance + 100 WHERE id = 1;
COMMIT;
"
# 观察 alertmanager 是否触发死锁告警
11.5 自愈验证
# 1. 制造长事务
mysql -uroot -p -e "
START TRANSACTION;
SELECT SLEEP(120);
"
# 2. 跑 kill_long_tx.sh DRY_RUN=true,看输出
DRY_RUN=true /usr/local/bin/kill_long_tx.sh
cat /var/log/kill_long_tx.log
# 3. 确认无误杀后改 DRY_RUN=false
11.6 数据库健康度自检脚本
#!/bin/bash
# db_health_check.sh
if [[ -z "${MYSQL_ROOT_PASSWORD}" ]]; then
echo "MYSQL_ROOT_PASSWORD 未设置"
exit 1
fi
MYSQL="mysql -uroot -p${MYSQL_ROOT_PASSWORD} -h127.0.0.1 -B -N"
echo "=== MySQL 健康度 ==="
# 连接数使用率
USED=$(${MYSQL} -e "SHOW STATUS LIKE 'Threads_connected'" | awk '{print $2}')
MAX=$(${MYSQL} -e "SHOW VARIABLES LIKE 'max_connections'" | awk '{print $2}')
PCT=$((USED * 100 / MAX))
echo "连接使用率: ${USED}/${MAX} (${PCT}%)"
# Buffer Pool 命中率
HIT=$(${MYSQL} -e "
SELECT
ROUND(
(1 - (
VARIABLE_VALUE / (
SELECT VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests'
)
)) * 100, 2)
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads'
" 2>/dev/null)
echo "Buffer Pool 命中率: ${HIT}%"
# 慢查询
SLOW=$(${MYSQL} -e "SHOW GLOBAL STATUS LIKE 'Slow_queries'" | awk '{print $2}')
echo "累计慢查询: ${SLOW}"
# 复制状态
${MYSQL} -e "SHOW SLAVE STATUS\G" 2>/dev/null | grep -E "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind"
十二、回滚方案
12.1 Exporter 升级回滚
# 1. 备份旧版本
sudo cp /usr/local/bin/mysqld_exporter /usr/local/bin/mysqld_exporter.bak
# 2. 升级
sudo systemctl stop mysqld_exporter
sudo cp mysqld_exporter-new /usr/local/bin/mysqld_exporter
sudo systemctl start mysqld_exporter
# 3. 异常回滚
sudo systemctl stop mysqld_exporter
sudo cp /usr/local/bin/mysqld_exporter.bak /usr/local/bin/mysqld_exporter
sudo systemctl start mysqld_exporter
12.2 告警规则回滚
# Prometheus 规则目录
RULES_DIR=/etc/prometheus/rules
# 1. 备份当前规则
tar czf /backup/prometheus_rules_$(date +%Y%m%d_%H%M%S).tar.gz ${RULES_DIR}
# 2. 修改规则文件
vi ${RULES_DIR}/mysql.yml
# 3. 验证语法
promtool check rules ${RULES_DIR}/*.yml
# 4. 热加载
curl -X POST http://prometheus:9090/-/reload
# 5. 异常回滚
tar xzf /backup/prometheus_rules_*.tar.gz -C /
curl -X POST http://prometheus:9090/-/reload
12.3 参数变更回滚
# 1. 备份当前 my.cnf
sudo cp /etc/my.cnf /etc/my.cnf.bak
# 2. 修改
sudo vi /etc/my.cnf
# 3. 在线应用(不重启)
mysql -uroot -p -e "SET GLOBAL innodb_buffer_pool_size = 128*1024*1024*1024;"
# 4. 异常回滚
mysql -uroot -p -e "SET GLOBAL innodb_buffer_pool_size = 64*1024*1024*1024;"
12.4 数据库版本升级回滚
- 5.7 → 8.0 一旦升级不能回退(系统表格式变了)
- 必须先做从库升级验证,确认业务 SQL 全部兼容
- 主从切换必须先在测试环境演练
12.5 自愈脚本回滚
# 1. 立即停掉定时任务
sudo rm /etc/cron.d/kill-long-tx
# 2. 跑一次 DRY_RUN 看看现在没杀是不是真没问题
DRY_RUN=true /usr/local/bin/kill_long_tx.sh
# 3. 如果业务恢复,恢复原 crontab
sudo cp /backup/cron.d/kill-long-tx /etc/cron.d/
十三、生产环境注意事项
13.1 性能开销
- mysqld_exporter 默认 1 个连接 + 多个一次性 SHOW,占用很小
- postgres_exporter 默认 1 个连接,建议给独立账号
- pgbouncer_exporter 每 60s 抓一次,影响可忽略
- performance_schema 默认开启带来 5-10% 开销,关键业务库要测
13.2 权限最小化
-- mysqld_exporter 账号
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
GRANT SELECT ON performance_schema.* TO 'exporter'@'localhost';
-- postgres_exporter 账号
GRANT pg_read_all_stats TO postgres_exporter;
13.3 凭据管理
- 密码写到
.my.cnf必须chmod 600 - systemd 单元用
EnvironmentFile引用,文件权限 600 - 生产环境用 HashiCorp Vault / 阿里云 KMS / 腾讯云密钥管理系统
- 定期轮转(季度)
13.4 跨实例标签
# Prometheus scrape config
scrape_configs:
- job_name: 'mysqld'
static_configs:
- targets: ['mysql-1:9104']
labels:
env: prod
role: master
cluster: us-east
biz: orders
- targets: ['mysql-2:9104']
labels:
env: prod
role: slave
cluster: us-east
biz: orders
13.5 告警收敛
group_by: ['alertname', 'instance']避免重复告警group_wait: 30s、group_interval: 5m避免告警风暴- 重要告警走 PagerDuty / 钉钉值班,普通走 Slack 频道
13.6 高可用 Exporter
Exporter 自身挂了也要告警:
- alert: MysqlExporterDown
expr: up{job="mysqld"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "MySQL Exporter 不可达"
Exporter 部署方式:
- 主备两台机器各跑一个
- 或加 Keepalived VIP
- 或用 consul 做服务发现
13.7 长期归档
- Prometheus 长期存储走 Thanos / Cortex / VictoriaMetrics
- 关键指标保留 1 年
- 慢查询日志归档到对象存储
13.8 多租户 / 合规
- 监控数据脱敏(密码、用户身份证号)
- 监控账号审计(每季度 review 权限)
- 跨 region 监控走专线 + 加密
十四、总结
数据库监控的核心是「在业务感知之前发现异常」。落到 Prometheus + Grafana + Exporter 这套体系,关键要把握三件事:
- 指标选对:MySQL 看 Buffer Pool 命中率、Threads_running、行锁等待、复制延迟;PG 看 cache hit rate、
idle in transaction、pg_stat_replicationlag、WAL 堆积 - 阈值定对:所有阈值都是经验值,必须结合业务基线调整;连接池 > 80%、命中率 < 99%、主从延迟 > 60s、死锁 > 0 都是常见红线
- 闭环跑通:告警 → 定位 → 修复 → 验证 → 回滚 → 复盘,每一步都要有具体动作
最容易踩的坑:
- 看 CPU 看不到 DB 真实问题(锁等待/IO 等待 CPU 不高)
- Threads_connected 和 Threads_running 混淆
SELECT FOR UPDATE走备库导致脏读- Exporter 密码写死在配置文件
- 长事务 kill 不通知应用端
数据库监控和普通的 Web 服务监控不一样:
- DB 是有状态的,重启 / 切换有成本
- DB 性能瓶颈往往在锁和 IO,不是 CPU
- DB 的”假死”(连接还在但查询慢)比真死更危险
- DB 的告警阈值必须分 P0/P1/P2/P3,不能一刀切
最后强调一点:监控不是装上 Exporter 就完了,而是要长期看、长期调。业务在变、数据量在涨、SQL 在迭代,告警阈值和指标都要相应调整。每季度做一次告警 review,看哪些告警从没用过(删)、哪些告警永远在响(调阈值或加自愈)。
DBA 真正的工作不是救火,而是让火不发生。

本文链接:https://www.yunweipai.com/49300.html





网友评论comments