当前位置: 首页 > 后端技术 > PHP

[PHP问题定位]2018-07-02fpm掉线分析

时间:2023-03-30 02:03:19 PHP

顺风车运营研发团队黄涛问题现象某台机器这段时间出现cpu-idle下降告警,如图:原因分析查看当时的监控(php-fpm-idle,cpu-idle,io-wait,io-write等)(1)php-fpm-idle今天经历了两次突然掉线,一次是在12:00左右,一次是在16:30左右。查看整个一周也有突然下降,如图(2)io-wait在11:58分时突然变大(3)io-write也在11:58出现大量写入:(4)cpu-idle当时短暂下降,然后急剧上升,但结合整周的曲线,一直维持在70-80之间,上升过快的原因分析:(5)推测原因:因为当时写了大量的log,导致io-wait增加,加上php-fpm进程因为写文件的延迟而延迟,导致整体响应太慢。结果fpm倒地;写入同一个文件时,会出现并行大批量等待,阻塞验证推测(1)查看当时php-fpm的慢日志,看当时阻塞在什么地方,基本都是调用fwrite阻塞(2)查看当时程序日志trace.log的大小。日志文件较大的那段时间,正是fpm-idle严重下降的阶段:(3)通过sar命令验证当时的磁盘写入情况。grounddrop期间确实有巨大的写入,wr_sec/s从每秒几百个低峰值增加到几十万:问题的原因和写入日志的优化建议可能有两个原因:(1)当时请求暴增(2)请求没有暴增,但是有一些请求触发了一些不合理的日志验证原因1,问题期间每秒流量32[02/Jul/2018:12:01:1218[02/Jul/2018:12:01:1318[02/Jul/2018:12:01:1442[02/Jul/2018:12:01:1530[02/Jul/2018:12:01:1635[02/Jul/2018:12:01:1726[02/Jul/2018:12:01:1830[02/Jul/2018:12:01:191[02/Jul/2018:12:01:22108[02/2018年7月:12:01:22108/2018:12:01:2417[2018年7月2日:12:01:251[2018年7月2日:12:01:271[2018年7月2日:12:01:291[2018年7月2日:12:01:309[02/Jul/2018:12:01:331[02/Jul/2018:12:01:311[02/Jul/2018:12:01:32146[02/Jul/2018:12:01:3362[02/Jul/2018:12:01:3444[02/Jul/2018:12:01:351[02/Jul/2018:12:01:371[02/Jul/2018:12:01:381[02/Jul/2018:12:01:412[02/Jul/2018:12:01:442[02/Jul/2018:12:01:501[02/Jul/2018:12:01:4512[02/Jul/2018:12:01:502[02/Jul/2018:12:01:457[02/Jul/2018:12:01:501[02/Jul/2018:12:01:461[02/Jul/2018:12:01:501[02/Jul/2018:12:01:4615[02/Jul/2018:12:01:507[02/Jul/2018:12:01:482[02/Jul/2018:12:01:501[02/Jul/2018:12:01:48342[02/Jul/2018:12:01:5065[02/Jul/2018:12:01:5146[02/Jul/2018:12:01:5254[02/Jul/2018:12:01:531[02/Jul/2018:12:01:551[02/Jul/2018:12:01:561[02/Jul/2018:12:01:571[02/Jul/2018:12:01:5916[02/Jul/2018:12:02:031[02/Jul/2018:12:02:011[02/Jul/2018:12:02:0242[02/Jul/2018:12:02:031[02/Jul/2018:12:02:02187[02/Jul/2018:12:02:0339[02/Jul/2018:12:02:0440[02/Jul/2018:12:02:0525[02/Jul/2018:12:02:0644[02/Jul/2018:12:02:0729[02/Jul/2018:12:02:08正常情况是QPS在30左右,但是问题周期很不稳定,时高时低,差别很大。比如12:01:50342qps,但是前十秒基本都是个位数;原因?前几十秒被阻塞无法响应,直到12:01:50才响应;整体流量并未大幅增加;验证原因2,查看当时的traceId日志,查看写入的内容:发现这个写入非常巨大,一行119kb,一共写入了33555行,总大小占:33555*119KB=3993045KB=3899M。基本可以断定这条线路有问题。优化建议:限制底层日志类中字符串的长度,避免这种批量写入;