当前位置: 首页 > 科技观察

GoCmd服务无法退出的小坑

时间:2023-03-18 11:50:56 科技观察

本文转载自微信公众号《董泽润的技术笔记》,作者董泽润。转载本文请联系董泽润技术笔记公众号。以前公司的情况。先说说背景吧。该服务在每台服务器上启动代理,用户将在指定的机器上执行任务,并将结果返回到网页。执行任务是用户自定义的脚本,通常是shell或python,会不断产生子进程和孙进程,直到执行完成或超时被杀死。最近发现的问题,经常有任务一直在运行,但实际上已经超时被kill了,输出还没有写入系统。看不到任务的执行状态,无法登录机器。下标卡在http调用上。再看这个脚本,pythonrequests默认是不设置超时的。。。总结一下现象:agent使用gocmd启动子进程,子进程也会启动孙子进程,孙子进程阻塞为一些原因。这时候如果子进程因为超时被agentkill杀掉,agent还处于wait状态。复现环境为go版本go1.16.5linux/amd64,agent使用exec.CommandContext启动任务,设置ctx超时为30s,并将结果写入bytes.Buffer,最后打印。一个简化的例子如下:~/zerun.dong/code/gotest#catwait.gopackagemainimport("bytes""context""fmt""os/exec""time")funcmain(){ctx,cancelFn:=context.WithTimeout(context.Background(),time.Second*30)defercancelFn()cmd:=exec.CommandContext(ctx,"./sleep.sh")varbbytes.Buffercmd.Stdout=&b//剧透,这里坑cmd.Stderr=&bcmd.Start()cmd.Wait()fmt.Println("recive:",b.String())}这是sleep.sh,模拟一个子进程#!/bin/shecho"insleep"sh./sleep1.sh这是sleep1.sh模拟孙子进程,这里sleep1000被屏蔽#!/bin/shsleep1000###现象启动测试等待程序,查看psaxjf|628746Ss00:00|\_-bash32818285312853132818pts/628746S00:00|\_strace./wait28531285432853132818pts/628746Sl00:00||\_./wait28543285592853132818pts/628746S00:00||\_/bin/sh/root/dongzerun/sleep.sh28559285602853132818pts/628746S00:00||\_sh/root/dongzerun/sleep1.sh28560285632853132818pts/628746S00:00||\_sleep1000等过了30s,通过psaxjf|less查看2468326903269032690?-1Ss00:00\_sshd:root@pts/632690328183281832818pts/636192Ss00:00|\_-bash32818285312853132818pts/636192S00:00|\_strace./wait28531285432853132818pts/636192Sl00:00||\_./wait1285602853132818pts/636192S00:00sh/root/dongzerun/sleep1.sh28560285632853132818pts/636192S00:00\_sleep1000通过上面的case可以看到sleep1.sh变成了孤儿进程,被init14退出了wait3.声称,那他在做什么???这时分析就卡住了,求助于我们的strace方法,查看等待程序epoll_ctl(4,EPOLL_CTL_DEL,6,{0,{u32=0,u64=0}})=0close(6)=0futex(0xc420054938,FUTEX_WAKE,1)=1waitid(P_PID,28559,{si_signo=SIGCHLD,si_code=CLD_KILLED,si_pid=28559,si_status=SIGKILL,si_utime=0,si_stime=0},WEXITED|WNOWAIT,NULL)=0卡在这里大约30s---SIGCHLD{si_signo=SIGCHLD,si_code=CLD_KILLED,si_pid=28559,si_status=SIGKILL,si_utime=0,si_stime=0}---rt_sigreturn()=0futex(0x9a0378,FUTEX_WAKE,1)=1futex(0x9a02b0,FUTEX_WAKE,1)=1wait4(28559,[{WIFSIGNALED(s)&&WTERMSIG(s)==SIGKILL}],0,{ru_utime={0,0},ru_stime={0,0},...})=28559futex(0x9a0b78,FUTEX_WAIT,0,NULL通过go源码可以看到goexecwait的时候会先执行waitid,阻塞在这里,然后wait4等待最后的退出结果。不明白这两个为什么要等。。。不过最后还是卡在了futex这里,貌似在等什么资源???打开golangpprof,再次运行程序,pprofgofunc(){err:=http.ListenAndServe(":6060",nil)iferr!=nil{fmt.Printf("failedtostartpprofmonitor:%s",err)}}()curlhttp://127.0.0.1:6060/debug/pprof/goroutine?debug=2goroutine1[chanreceive]:os/exec.(*Cmd).Wait(0xc42017a000,0x7c3d40,0x0)/usr/local/go/src/os/exec/exec.go:454+0x135main.main()/root/dongzerun/wait.go:32+0x167程序没有退出,一直卡在exec。go:454行代码,查看源码://WaitreleasesanyresourcesassociatedwiththeCmd.func(c*Cmd)Wait()error{......state,err:=c.Process.Wait()ifc.waitDone!=nil{close(c.waitDone)}c.ProcessState=statevarcopyErrorerrorforrangec.goroutine{//卡在这里iferr:=<-c.errch;err!=nil&©Error==nil{copyError=err}}c.closeDescriptors(c.closeAfterWait)......returncopyError}通过源码分析,程序wait卡在<-c.errch获取chan数据那么errch是怎么产生的呢?看cmd.Start的源码,go将cmd.Stdin、cmd.Stdout、cmd.Stderr组织成*os.File,依次写入到数组childFiles中。这个数组索引对应子进程的0,1,2Descriptor,即子进程的标准输入、输出、错误typeFfunc(*Cmd)(*os.File,error)for_,setupFd:=range[]F{(*Cmd).stdin,(*Cmd).stdout,(*Cmd).stderr}{fd,err:=setupFd(c)iferr!=nil{c.closeDescriptors(c.closeAfterStart)c.closeDescriptors(c.closeAfterWait)returnerr}c.childFiles=append(c.childFiles,fd)}c.childFiles=append(c.childFiles,c.ExtraFiles...)varerrorrorc.Process,err=os.StartProcess(c.Path,c.argv(),&os.ProcAttr{Dir:c.Dir,Files:c.childFiles,Env:dedupEnv(c.envv()),Sys:c.SysProcAttr,})执行setupFd时,会有一个key操作,开启pipe管道并封装一个Anonymousfunc,作用是将子进程的输出写入pipe或者将pipe数据写入子进程的标准输入,最后关闭pipe的匿名函数最后执行func(c*Cmd)stdin()(f*os.File,errerror){ifc.Stdin==nil{f,err=os.Open(os.DevNull)iferr!=nil{return}c.closeAfterStart=append(c.closeAfterStart,f)return}iff,ok:=c.Stdin.(*os.File);ok{returnf,nil}pr,pw,err:=os.Pipe()iferr!=nil{返回}c.closeAfterStart=append(c.closeAfterStart,pr)c.closeAfterWait=append(c.closeAfterWait,pw)c.goroutine=append(c.goroutine,func()error{_,err:=io.Copy(pw,c.Stdin)ifskip:=skipStdinCopyError;skip!=nil&&skip(err){err=nil}iferr1:=pw.Close();err==nil{err=err1}returnerr})returnpr,nil}重新运行测试用例并使用lsof查看哪些进程是打开资源root@nb1963:~/dongzerun#psaux|grepwaitroot45310.00.01221806520pts/6Sl17:240:00./waitroot47260.00.0104842144pts/6S+17:240:00grep--color=autowaitroot@nb1963:~3root@dongze1run#6:~/dongzerun#psaux|grepsleeproot45430.00.04456688pts/6S17:240:00/bin/sh/root/dongzerun/sleep.shroot45480.00.04456760pts/6S17:240:00sh/root/dongzerun/sleep1.shroot45500.007.08759:6240:00sleep1000root47840.00.0104802188pts/6S+17:240:00grep--color=autosleeproot@nb1963:~/dongzerun#root@nb1963:~/dongzerun#lsof-p4531COMMANDPIDUSERFDTYPEDEVICESIZE/OFFNODENAMEwait4531root0wCHR1,30t01029/dev/nullwait4531root1wREG8,1943714991345/root/dongzerun/nohup.outwait4531root2wREG8,1943714991345/root/dongzerun/nohup.outwait4531root3uIPv620055682150t0TCP*:6060(LISTEN)wait4531root4u00000,1009076anon_inodewait4531root5rFIFO0,90t02005473170piperoot@nb1963:~/dongzerun#lsof-p4543COMMANDPIDUSERFDTYPEDEVICESIZE/OFFNODENAMEsleep.sh4543root0rCHR1,30t01029/dev/nullsleep.sh4543root1wFIFO0,90t02005473170pipesleep.sh4543root2wFIFO0,90t02005473170pipesleep.sh4543root10rREG8,1554993949/root/dongzerun/sleep.shroot@nb1963:~/dongzerun#lsof-p4550COMMANDPIDUSERFDTYPEDEVICESIZE/OFFNODENAMEsleep4550rootmemREG8,116076649179617/usr/lib/locale/locale-archivesleep4550root0rCHR1,30t01029/dev/nullsleep4550root1wFIFO0,90t02005473170pipesleep4550root2wFIFO0,90t02005473170pipe原因总结孙进程启动后,默认会继承父进程打开的文件描述符,即节点2005473170的管道,在父进程kill-9时会清理资源并关闭打开的文件,但关闭只是引用计数减1。实际上,孙子进程仍然打开管道。回头看代理代码c.goroutine=append(c.goroutine,func()error{_,err:=io.Copy(pw,c.Stdin)ifskip:=skipStdinCopyError;skip!=nil&&skip(err){err=nil}iferr1:=pw.Close();err==nil{err=err1}returnerr})然后当子进程执行结束后,gocmd执行这个匿名函数的io.Copy,读取输出数据子进程,从不数据可读,没有超时。copy卡顿的原因在这里找到,解决办法也在那里。当子进程启动孙子进程时,加上CloseOnEec标志,但不现实。这取决于孙进程的输出日志。io。kill进程组,此时管道会真正关闭,触发io.Copy返回最终方案3,简化代码如下,主要有两个变化:SysProcAttr配置Setpgid,让子进程和孙子进程有独立的进程Groupid,也就是子进程的pid。syscall.Kill(-cmd.Process.Pid,syscall.SIGKILL)指定杀死进程时的进程组funcRun(instancestring,envmap[string]string)bool{var(cmd*exec.Cmdproc*ProcesssysProcAttr*syscall.SysProcAttr)t:=time.Now()sysProcAttr=&syscall.SysProcAttr{Setpgid:true,//让子进程有自己的pgid,相当于子进程的pidCredential:&syscall.Credential{Uid:uint32(uid),gid:uint32(gid),},}//超时控制ctx,cancel:=context.WithTimeout(context.Background(),time.Duration(j.Timeout)*time.Second)defercancel()ifj.ShellMode{cmd=exec.Command("/bin/bash","-c",j.Command)}else{cmd=exec.Command(j.cmd[0],j.cmd[1:]...)}cmd.SysProcAttr=sysProcAttrvarbbytes.Buffercmd.Stdout=&bcmd.Stderr=&biferr:=cmd.Start();err!=nil{j.Fail(t,instance,fmt.Sprintf("%s\n%s",b.String(),err.Error()),env)返回nfalse}waitChan:=make(chanstruct{},1)deferclose(waitChan)//超时杀死进程组或正常退出gofunc(){select{case<-ctx.Done():log.Warnf("timeoutkilljob%s-%s%spid:%d",j.Group,j.ID,j.Name,cmd.Process.Pid)syscall.Kill(-cmd.Process.Pid,syscall.SIGKILL)case<-waitChan:}}()iferr:=cmd.Wait();err!=nil{j.Fail(t,instance,fmt.Sprintf("%s\n%s",b.String(),err.Error()),env)returnfalse}returntrue}但是这种方法也有一个局限性。目前仅适用于类Linux平台。问题开始了,python请求库没有到处写超时...