转载本文请联系邂逅Linux公众号。一个叫JohnGarry的人向Linux内核社区报告了一个bug,说有人将内核升级到V5.10-rcX后,执行dd和sync后,看到进程挂了:Someguysinternallyupgradedtov5.10-rcXandstarttoseeahangafterdd+syncforalargefile:-mount/dev/sda1(ext4filesystem)todirectory/mnt;-run"if=/dev/zeroof=test1bs=1Mcount=2000"ondirectory/mnt;-run"sync"系统提示打印挂进程的堆栈信息:并得到:[367.912761]信息:taskjbd2/sdb1-8:3602blockedformorethan120seconds。[367.919618]Nottainted5.10.0-rc1-109488-g32ded76956b6#948[367.925776]“echo0>/proc/taskablesecables_sys/[367.933579]]io_schedule+0x1c/0xe8[367.957948]bit_wait_io+0x18/0x68[367.961346]__wait_on_bit+0x78/0xf0[367.964919]out_of_line_wait_on_bit+0x8c/0xb0[367.969356]__wait_on_buffer+0x30/0x40[367.973188]jbd2_journal_commit_transaction+0x1370/0x1958[367.978661]kjournald2+0xcc/0x260[367.982061]kthread+0x150/0x158[367.985288]ret_from_fork+0x10/0x34[367.988860]INFO:tasksync:3823blockedformorethan120seconds.[367.995102]Nottainted5.10.0-rc1-109488-g32ded76956b6#948[368.001265]“eCho0>/proc/sys/kernel/hung_task_task_timeout_secs”disablesthismessage。[368.009067]+0x30c/0x670[368.026804]schedule+0x70/0x108[368.030025]jbd2_log_wait_commit+0xbc/0x158[368.034290]ext4_sync_fs+0x188/0x1c8[368.037947]sync_fs_one_sb+0x30/0x40[368.041606]iterate_supers+0x9c/0x138[368.045350]ksys_sync+0x64/0xc0[368.048569]__arm64_sys_sync+0x10/0x20[368.052398]el0_svc_common.constprop.3+0x68/0x170[368.057177]do_el0_svc+0x24/0x90[368.060482]el0_sync_handler+0x118/0x168[368.064478]el0_sync+0x158/0x180并反馈8号hardqueue绑定的100号CPU上dispatchqueue和completionqueue不一致,但是没有inf光的情况。因为这个情况,我苦恼了三天三夜,头发掉了一地键盘。河口:/sys/kernel/debug/block/sda/hctx8$catcpu100/dispatched30:02.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/block/sda/sda1/inflight),inflight的数量为0。然后明磊(有声)给个建议,需要收集/sys/kernel/debug/block/sda目录和/sys/block/sda/device目录下所有文件的输出:hellochenxiang,Canyoucollectthedebugfslogviathefollowingcommandsaftertheiohangistriggered?1)debugfslog:(cd/sys/kernel/debug/block/sda&&find.-typef-execgrep-aH.{}\;)2)scsisysfsinfo:(cd/sys/block/sda/device&&find.-typef-execgrep-aH.{}\;)假设磁盘是/dev/sda。陈翔反馈:~$cd/sys/kernel/debug/block/sdb&&find.-typef-execgrep-aH.{}\;..../hctx9/tags:cleared=3891./hctx9/tags:bits_per_word=64。/hctx9/tags:map_nr=63./hctx9/tags:alloc_hint={3264,3265,0,3731,2462,842,0,0,1278,27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2424,0,0,0,346,3,3191,235,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,88,0,0,285,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1165,538,0,372,277,3476,0,0,0,111,0,2081,0,112,0,0,0,0,904,1127,0,0,0,113,0,0,0,0,0,0,321,0}./hctx9/tags:wake_batch=8./hctx9/tags:wake_index=7./hctx9/tags:ws_active=0./hctx9/tags:ws={./hctx9/tags:{.wait_cnt=8,.wait=inactive},./hctx9/tags:{.wait_cnt=8,.wait=inactive},./hctx9/tags:{.wait_cnt=8,.wait=inactive},./hctx9/tags:{.wait_cnt=8,.wait=inactive},./hctx9/tags:{.wait_cnt=8,.wait=inactive},./hctx9/tags:{.wait_cnt=8,.wait=inactive},./hctx9/tags:{.wait_cnt=8,.wait=inactive},./hctx9/tags:{.wait_cnt=8,.wait=inactive},./hctx9/tags:}./hctx9/tags:round_robin=1./hctx9/tags:min_shallow_depth=4294967295./hctx9/ctx_map:00000000:00...明磊看了一下调试输出。一杯大红袍,推荐测试此补丁:请尝试以下补丁:diff--gita/drivers/scsi/scsi_lib.cb/drivers/scsi/scsi_lib.cindex60c7a7d74852..03c6d0620bfd100644---a/drivers/scsi/scsi_lib.c+++b/drivers/scsi/scsi_lib.c@@-1703,8+1703,7@@staticblk_status_tscsi_queue_rq(structblk_mq_hw_ctx*hctx,break;caseBLK_STS_RESOURCE:caseBLK_STS_ZONE_RESOURCE:-if(atomic_read(&sdev-ed|device_sc_bus_bsi)y)(sdev)+if(scsi_device_blocked(sdev))ret=BLK_STS_DEV_RESOURCE;break;default:chenxiang经过测试反馈,问题解决,心中充满Linux从业者的纯粹喜悦,眼中充满崇敬。个人承诺:补丁我在两个环境下测试了100+次(之前经常出现这个问题),都没有出现这个问题。这样问题就解决了。我想如果编辑处理这个问题,我会问他触发crash,然后慢慢分析,真的低了几百个Level:)只有依靠这些debug信息,准确分析问题,blocksubsystem的专业性才能体现出来。在这个子系统中站稳脚跟,把一个子系统掌握得如此熟练,实属不易!
