当前位置: 首页 > 科技观察

使用Trace-Event解决系统不能深度睡眠的问题

时间:2023-03-15 09:12:43 科技观察

使用Trace-Event解决系统无法进入深度睡眠问题本文转载请联系邂逅Linux公众号。最近遇到一个问题,系统不能sleep到c7s,只能sleep到c3。-state的值比较大)这时候第一感觉是系统是不是很忙,用perttop查看消耗cpu的进程和hot函数:perftop-E100--stdio>perf-top.txt19。85%性能[。]__symbols__insert7.68%perf[.]rb_next4.60%libc-2.26.so[.]__strcmp_sse2_unaligned4.20%libelf-0.168.so[.]gelf_getsym3.92%perf[.]dso__load_sym3.86%libc-2.26.so[.]_int_malloc3.60%libc-2.26.so[.]__libc_calloc3.30%libc-2.26.so[.]vfprintf2.95%perf[.]rb_insert_color2.61%[kernel][k]prepare_exit_to_usermode2.51%perf[.]machine__map_x86_64_entry_trampolines2.31%perf[.]symbol__new2.22%[kernel][k]do_syscall_642.11%libc-2.26.so[.]__strlen_avx2发现只有系统中的perf工具比较耗cpu:(然后我想到了不是系统中某个进程的ghost,cpu不让休眠到c7s,这时候用traceevent监听系统中的sched_switch事件,用trace-cmd工具监听所有cpu上的sched_switch(进程切换)事件持续30秒:#trace-cmdrecord-esched:sched_switch-M-1sleep30CPU0datarecordedatoffset=0x63e000102400bytesinsizeCPU1datarecordedatoffset=0x6570008192bytesinsizeCPU2datarecordedatoffset=0x659003bytesinsizeCPUdatarecordedatoffset=0x65e00020480bytesinsize使用trace-cmdreport查看监控结果,但是查看这样的原始数据不够直观,没有统计到某个进程被切换到:#trace-cmdreportcpus=4trace-cmd-19794[001]225127.464466:sched_switch:trace-cmd:19794[120]S==>swapper/1:0[120]trace-cmd-19795[003]225127.464601:sched_switch:trace-cmd:19795[120]S==>swapper/3:0[120]sleep-19796[002]225127.464792:sched_switch:sleep:19796[120]S==>swapper/2:0[120]-0[003]225127.471948:sched_switch:swapper/3:0[120]R==>rcu_sched:11[120]rcu_sched-11[003]225127.471950:sched_switch:rcu_sched:11[120]W==>swapper/3:0[120]-0[003]225127.479959:sched_switch:swapper/3:0[120]R==>rcu_sched:11[120]rcu_sched-11[003]225127.479960:sched_switch:rcu_sched:11[120]W==>swapper/3:0[120]-0[003]225127.487959:sched_switch:swapper/3:0[120]R==>rcu_sched:11[120]rcu_sched-11[003]225127.487961:sched_switch:rcu_sched:11[120]W==>交换器/3:0[120]<空闲>-0[002]225127。491959:sched_switch:swapper/2:0[120]R==>kworker/2:2:19735[120]kworker/2:2-19735[002]225127.491972:sched_switch:kworker/2:2:19735[120]W==>swapper/2:0[120]]trace-cmd报告结果使用正则表达式过滤,然后排序统计:trace-cmdreport|grep-o'==>[^]\+:\?'|sort|uniq-c3==>irqbalance:10343==>khugepaged:4320==>ksoftirqd/0:101==>ksoftirqd/1:1818==>ksoftirqd/3:301==>kthreadd:197981==>kthreadd:24==>kworker/0:0:197851==>kworker/0:1:197365==>kworker/0:1:197985==>kworker/0:1H:36453==>kworker/0:2:1961419==>kworker/1:1:766530==>tuned:19498...发现一个可疑线程被调优,30秒内切换运行30次。其他线程是常规线程。检查此时系统是否开启了tuned服务:确实是系统tuned服务开启了,然后拉起一个名为tuned的线程。查看tuned服务的配置文件:localhost:/home/jeff#tuned-admactiveCurrentactiveprofile:sap-hanalocalhost:/home/jeff#cat/usr/lib/tuned/sap-hana/tuned.conf[main]summary=OptimizeforSAPNetWeaver,SAPHANAandHANAbasedproducts[cpu]force_latency=70找到关于cpu项,设置强制延迟时间为70秒force_latency=70,这是为了优化HANA数据库。force_latency是如何工作的?经过查找,发现这个值被设置到/dev/cpu_dma_latency中。使用lsof/dev/cpu_dma_latency,发现tuned线程确实在操作这个文件#lsof/dev/cpu_dma_latencyCOMMANDPIDUSERFDTYPEDEVICESIZE/OFFNODENAMEtuned18734root9wCHR11,6400tdev/cpu_dma_latencyLinux内核文档也解释了/dev/cpu_dma_latency文件。如果你想写它,你需要打开它并写入数据而不关闭它。如果释放文件描述符,它将返回到默认值。也证实了上面的lsof/dev/cpu_dma_latency有输出结果。https://github.com/torvalds/linux/blob/v5.8/Documentation/trace/coresight/coresight-cpu-debug.rst如PMQoS文档中所述,请求的参数将保持无效,直到文件描述被发布。例如:#exec3<>/dev/cpu_dma_latency;echo0>&3...Dosomework……#exec3<>-查看/dev/cpu_dma_latency文件的内容,确实是70,也就是(force_latency=70)localhost:/home/jeff#cat/dev/cpu_dma_latency|hexdump-Cv0000000046000000|F...|localhost:/home/jeff#echo$((0x46))70现在查看系统中cpu各个休眠状态的描述和延迟时间值:#cd/sys/devices/system/cpu/cpu0/cpuidle/#forstatein*;doecho-e\"STATE:$state\t\DESC:$(cat$state/desc)\t\NAME:$(cat$state/name)\t\LATENCY:$(cat$state/latency)\t\RESIDENCY:$(cat$state/residency)》done发现C3状态的延迟时间为33微秒,C4延迟时间为133微秒,所以(force_latency=70),系统只能休眠到C3。(延迟时间是从休眠状态唤醒到运行状态的时间)STATE:state0DESC:CPPUIDLECOREPOLLIDLENAME:POLLLATENCY:0RESIDENCY:0STATE:state1DESC:MWAIT0x00NAME:C1LATENCY:2RESIDENCY:2STATE:state2DESC:MWAIT0x01NAME:C1ELATENCY:10RESIDENCY:20STATE:state3DESC:MWAIT0x10NAME:C3LATENCY:33RESIDENCY:100STATE:state4DESC:MWAIT0x20NAME:C6LATENCY:133RESIDENCY:400STATE:state5DESC:MWAIT0x32NAME:C7sLATENCY:166RESIDENCY:500此时关闭tuned服务,然后查看/dev/cpu_dma_latency的值,变为默认2000秒localhost:/home/jeff#tuned-admofflocalhost:/home/jeff#cat/dev/cpu_dma_latency|hexdump-Cv0000000000943577|..5w|localhost:/home/jeff#echo$((0x77359400))2000000000然后验证,系统可以sle此时ep到C7s,这个问题就解决了:)解决这个问题,主要是利用linux内核本身提供的trace-event。所以任何功能都不可小觑,内核就是这样,一般看起来枯燥乏味的功能,经过一些工程师以非常认真的态度打磨后,潜力还是很大的:)