本文转载自微信公众号《董泽润的技术笔记》,作者董泽润。转载本文请联系董泽润技术笔记公众号。我们都知道k8s的最小调度单元是POD,每个POD都有一个所谓的Infra容器Pause,负责初始化相关的namespace,然后再启动该POD中的其他容器。那么什么是暂停容器呢?它是什么样子的?它的作用是什么?分析源码废话不多说,直接上源码,来自官方的pause.c[1]#include#include#include#include#include#include#include#defineSTRINGIFY(x)#x#defineVERSION_STRING(x)STRINGIFY(x)#ifndefVERSION#defineVERSIONHEAD#endifstaticvoidsigdown(intsigno){psignal(signo,"Shuttingdown,gotsignal");exit(0);}staticvoidsigreap(intsigno){while(waitpid(-1,NULL,WNOHANG)>0);}intmain(intargc,char**argv){inti;for(i=1;i查看manonmac手册,waitforprocesstermination确实是这样写的。登录ubuntu18.04查看:~#manwaitpidWAIT(2)LinuxProgrammer'sManualWAIT(2)NAMEwait,waitpid,waitid-waitforprocesstochangestate对于linuxman手册,变成waitforprocesstochangestate!!!这些系统调用都用到了towaitforstatechangesinchildofthecallingprocess,andobtaininformationaboutthechildwhosestatehaschanged.Astatechangeisconsideredtobe:thechildterminated;thechildwasstoppedbyasignal;orthechildwasresumedbyasignal.Inthecaseofaterminatedchild,performingawaitallowsthesystemtoreleasetheresourcesassociatedwiththechild;ifawaitisnotperformed,thentheterminatedchildremainsina"zombie"state(seeNOTESbelow).并且还很贴心的提供了测试代码#include#include#include#includeintmain(intargc,char*argv[]){pid_tcpid,w;intwstatus;cpid=fork();if(cpid==-1){perror("fork");exit(EXIT_FAILURE);}if(cpid==0){/*Codeexecutedbychild*/printf("ChildPIDis%ld\n",(long)getpid());if(argc==1)pause();/*Waitforsignals*/_exit(atoi(argv[1]));}else{/*父代码执行*/do{w=waitpid(cpid,&wstatus,WUNTRACED|WCONTINUED);if(w==-1){perror("waitpid");exit(EXIT_FAILURE);}if(WIFEXITED(wstatus)){printf("exited,status=%d\n",WEXITSTATUS(wstatus));}elseif(WIFSIGNALED(wstatus)){printf("killedbysignal%d\n",WTERMSIG(wstatus));}elseif(WIFSTOPPED(wstatus)){printf("stoppedbysignal%d\n",WSTOPSIG(wstatus));}elseif(WIFCONTINUED(wstatus)){printf("继续\n");}}while(!WIFEXITED(wstatus)&&!WIFSIGNALED(wstatus));exit(EXIT_SUCCESS);}}子进程一直处于暂停状态,同时父进程调用waitpid等待子进程状态改变让我们启动一个会话来运行代码,另一个会话发送信号~$./a.outChildPIDis70718stoppedbysignal19continuedstoppedbysignal19continued^C~#psaux|grepa.outzerun.d+707170.00.04512744pts/0S+06:480:00./a.outzerun.d+707180.00.0451272pts/0S+06:480:00./a.outroot711550.00.0161521060pts/1S+06:490:00grep--color=autoa.out~#~#kill-STOP70718~#~#psaux|grepa。outzerun.d+707170.00.04512744pts/0S+06:480:00./a.outzerun.d+707180.00.0451272pts/0T+06:480:00./a.outroot711730.00.0161521060pts/1S+06:490:00grep--color=autoa.out~#~#kill-CONT70718~#~#psaux|grepa.outzerun.d+707170.00.04512744pts/0S+06:480:00./a.outzerun.d+707180.00.0451272pts/0S+06:480:00./a.outroot712960.00.0161521056pts/1R+06:490:00grep--color=autoa.out通过向子进程发送信号STOPCONT来控制进程。似乎不同的操作系统对同名的c函数有不同的行为。不用大惊小怪,就是菜:(sharedNS一般是提到POD就知道了,如果同一个POD内的容器可以互相访问,只需要调用localhost就可以了。如果把k8s集群想象成一个分布式操作系统,那么POD就是一个进程组的概念,必须要共享一些东西,那么默认共享哪些命名空间呢?使用minikube搭建环境,先看POD定义文件apiVersion:v1kind:Podmetadata:name:nginxspec:shareProcessNamespace:truecontainers:-name:nginximage:nginx-name:shellimage:busyboxsecurityContext:capabilities:add:-SYS_PTRACEstdin:truetty:true从1.17开始,有一个参数shareProcessNamespace来控制是否在POD中共享PID命名空间,1.18之后默认为false,需要的话需要填写这个字段。~$kubectlattach-itnginx-cshellIfyoudon'tseeacommandprompt,trypressingenter./#psauxPIDUSERTIMECOMMAND1root0:00/pause8root0:00nginx:masterprocessnginx-gdaemonoff;411010:00nginx:workerprocess42root0:00sh49root0:00psauxattach在shell中只能看到PO容器中的所有进程pause是init1进程。/#kill-HUP8/#psauxPIDUSERTIMECOMMAND1root0:00/pause8root0:00nginx:masterprocessnginx-gdaemonoff;42root0:00sh501010:00nginx:workerprocess51root0:00psaux测试向nginxmaster发送HUP信号,子进程重启。如果不共享PIDns,那么每个容器内部的进程pid就是init1进程。共享PIDns有什么影响?参考这篇文章[2]容器进程不再有PID1。一些容器镜像拒绝在没有PID1的情况下启动(例如,使用systemd的容器),或者拒绝执行kill-HUP1之类的命令来通知容器进程。在具有共享进程命名空间的pod中,kill-HUP1将通知pod沙箱(上例中的/pause)。进程对Pod中的其他容器可见。这包括/proc中可见的所有内容,例如作为参数或环境变量传递的密码。这些仅受普通Unix权限的保护。容器文件系统通过/proc/$pid/root链接对pod中的其他容器可见。这使调试更容易,但也意味着文件系统安全仅受文件系统权限的保护。在游戏机更多nginx,sh的电影id,论/proc/pid/ns查看命名空间id~#ls-l/proc/140756/nstotal0lrwxrwxrwx1rootroot0May609:08cgroup->'cgroup:[4026531835]'lrwxrwxrwx1rootroot0May609->:0ipc:[4026532497]'lrwxrwxrwx1rootroot0May609:08mnt->'mnt:[4026532561]'lrwxrwxrwx1rootroot0May609:08net->'net:[4026532500]'lrwxrwxrwx1rootroot0May609:08pid->'pid:[4026532498]'lrwxrwxrwx1rootroot0May609:08pid_for_children->'pid:[4026532498]'lrwxrwxrwx1rootroot0May609:08user->'user:[4026531837]'lrwxrwxrwx1rootroot0May609:08uts->'uts:[4026532562]'~#ls-l/proc/140879/nstotal0lrwxrwxrwx1rootroot0May609:08cgroup[4026531837]'lrwxrwxrwx1rootroot0May609:08ipc->'ipc:[4026532497]'lrwxrwxrwx1rootroot0May609:08mnt->'mnt:[4026532563]'lrwxrwxrwx1rootroot0May609:08net->'net:[4026532500]'lrwxrwxrwx1rootroot0May609:08pid->'pid:[4026532498]'lrwxrwxrwx1rootroot0May609:08pid_for_children->'pid:[4026532498]'lrwxrwxrwx1rootroot0May609:08user->'user:[4026531837]'lrwxrwxrwx1rootroot0May609:08uts->'uts:[4026532564]'可以看到这里共享了cgroup,ipc,net,pid,user。这仅适用于测试用例杀死Pause容器并测试如果Pause容器被杀死,k8s如何处理POD。使用minikube搭建环境,先看POD定义文件apiVersion:v1kind:Podmetadata:name:nginxspec:shareProcessNamespace:falsecontainers:-name:nginximage:nginx-name:shellimage:busyboxsecurityContext:capabilities:add:-SYS_PTRACEstdin:truetty:true启动后查看暂停进程id,然后kill~$kubectldescribepodnginx......Events:TypeReasonAgeFromMessage--------------------------NormalSandboxChanged3m1s(x2over155m)kubeletPodsandboxchanged,itwillbekilledandre-created.NormalKilling3m1s(x2over155m)kubeletStoppingcontainernginxNormalKilling3m1s(x2over155m)kubeletStoppingcontainershellNormalPulling2m31s(x3over156m)kubeletPullingimage"nginx"NormalPulling2m28s(x3over156m)kubeletPullingimage"busybox"NormalCreated2m28s(x3over156m)kubeletCreatedcontainernginxNormalStarted2m28s(x3over156m)kubeletStartedcontainernginxNormalPulled2m28skubeletSuccessfullypulledimage"nginx"in2.796081224sNormalCreated2m25s(x3over156m)kubeletCreatedcontainershellNormalStarted2m25s(x3over156m)kubeletStartedcontainershellNormalPulled2m25skubeletSuccessfullypulledimage"busybox"in2.856292466sk8s如果检测到暂停容器异常,会重启POD。其实不难理解,不管PID命名空间是否共享,infra容器退出,都必须重启POD。毕竟生命周期是与infra容器[1]pause.c一致的References:https://github.com/kubernetes/kubernetes/blob/master/build/pause/linux/pause.c,[2]share进程命名空间:https://kubernetes.io/en/docs/tasks/configure-pod-container/share-process-namespace/,