当前位置: 首页 > 科技观察

记得一个.NET娱乐聊天流媒体平台CPU爆破分析

时间:2023-03-18 22:16:56 科技观察

一、背景1、前段时间讲故事的时候,有朋友加了微信,说他的程序直接CPU=100%。让我帮忙看看是怎么回事,哈哈,这种充满意外的CPU会给程序员带来很大的压力,我让朋友在CPU高的时候抓了2个dump,然后发给我分析.二、WinDbg分析1、CPU真的满了吗?为了防止相反,一定要使用!tp命令来验证CPU是否真的很高。0:000>!tpCPUutilization:100%WorkerThread:Total:21Running:7Idle:0MaxLimit:32767MinLimit:4WorkRequestinQueue:3AsyncTimerCallbackCompletionTimerInfo@00000000042d2430AsyncTimerCallbackCompletionTimerInfo@00000000042d2f90AsyncTimerCallbackCompletionTimerInfo@000000000420c150-----------------------------------定时器数量:0--------------------------------------完成端口线程:总计:18Free:9MaxFree:8CurrentLimit:18MaxLimit:1000MinLimit:4fromGuaZhongkan确实是100%,太好了,WorkRequest还是有任务堆积的现象。确认之后,我们看看接下来是谁造成的?2、谁造成CPU高?按照惯例,先怀疑是不是GC触发的。可以使用!t查看线程列表,看是否有GC字。:000>!tThreadCount:53UnstartedThread:0BackgroundThread:53PendingThread:0DeadThread:0HostedRuntime:noLockIDOSIDThreadOBJStateGCModeGCAllocContextDomainCountAptException41124000000000021cdf302a220Preemptive0000000000000000:000000000000000000000000021d94c00MTA2324db400000000041cdaa02b220Preemptive0000000000000000:000000000000000000000000021d94c00MTA(Finalizer)...6515622f4000000000b1a3f608029220Preemptive00000004527751F0:0000000452775EE800000000021d94c00MTA(ThreadpoolCompletionPort)662052ef8000000000b1a10808029220Preemptive0000000157641DE0:00000001576435B000000000021d94c00MTA(ThreadpoolCompletionPort)...从卦中没有GC这个词,也说明这个程序不是GC触发的。接下来怎么查?一般来说,CPU的爆高是由线程拉高的。所以下一步就是查看CPU等级和各个线程堆栈,看看有没有新的线索,可以使用~*e!clrstack0:000>!cpuidCPF/M/S制造商MHz06,79,1<不可用>229916,79,1<不可用>229926,79,1<不可用>229936,79,122990:000>~*e!clrstackOSThreadId:0x2cc4(68)ChildSPIPCallSite000000000c14e75800007ffadeb86e4a[GCFrame:000000000c14e758]000000000c14e84000007ffadeb86e4a[GCFrame:000000000c14e840]000000000c14e87800007ffadeb86e4a[HelperMethodFrame:000000000c14e878]System.Threading.Monitor.输入(System.Object)000000000c14e97000007ffaceb40491System.Net.ConnectionGroup.Disassociate(System.Net.Connection)[f:\dd\NDP\fx\src\net\System\Net\_ConnectionGroup.cs@148]000000000c14e9d000007ffaceb3fc93System.Net.Connection.PrepareCloseConnectionSocket(System.Net.ConnectionReturnResultByRef)[f:\dd\NDP\fx\src\net\System\Net\_Connection.cs@3048]000000000c14eaa000007ffacf139bfbSystem.Net.Connection.HandleError(布尔值,布尔值,系统m.Net.WebExceptionStatus,System.Net.ConnectionReturnResultByRef)[f:\dd\NDP\fx\src\net\System\Net\_Connection.cs@3119]000000000c14eb0000007ffacebc4118System.Net.Connection.ReadComplete(Int32,系统.Net.WebExceptionStatus)[f:\dd\NDP\fx\src\net\System\Net\_Connection.cs@3387]000000000c14eb8000007ffacebe3dc5System.Net.LazyAsyncResult.Complete(IntPtr)[f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs@415]000000000c14ebe000007ffacebe3d07System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object,IntPtr)[f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs@368]000000000c14ec2000007ffacf3a476fSystem.Net.Security._SslStream.StartFrameBody(Int32,Byte[],Int32,Int32,System.Net.AsyncProtocolRequest)000000000c14ec8000007ffacebb3ed8System.Net.Security._SslStream.ReadHeaderCallback(System.Net.AsyncProtocolRequest)[f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_SslStream.cs@1007]000000000c14ece000007ffacebae5eeSystem.Net.AsyncProtocolRequest.CompleteRequest(Int32)[f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_HelperAsyncResults.cs@142]000000000c14ed1000007ffacf3a3567System.Net.FixedSizeReader.CheckCompletionBeforeNextRead(Int32)000000000c14ed4000007ffacebae507System.Net.FixedSizeReader.ReadCallback(System.IAsyncResult)[f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_FixedSizeReader.cs@148]000000000c14ed9000007ffacebe3dc5System.Net.LazyAsyncResult.Complete(IntPtr)[f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs@415]000000000c14edf000007ffadcbe3a63System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,System.Threading.ContextCallback,System.Object,布尔值)[f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs@954]000000000c14eec000007ffadcbe38f4System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,System.Threading.ContextCallback,System.Object,布尔值)[f:\dd\ndp\clr\src\BCL\系统\threading\executioncontext.cs@902]000000000c14eef000007ffadcbe38c2System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,System.Threading.ContextCallback,System.Object)[f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs@891]000000000c14ef4000007ffaceba60cfSystem.Net.ContextAwareResult.Complete(IntPtr)[f:\dd\NDP\fx\src\net\System\Net\_ContextAwareResult.cs@463]000000000c14ef9000007ffaceSystem.beSystem.be.LazyAsyncResult.ProtectedInvokeCallback(System.Object,IntPtr)[f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs@368]000000000c14efd000007ffaceba5e2fSystem.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32,UInt32),System.Threading.NativeOverlapped*)[f:\dd\NDP\fx\src\net\System\Net\Sockets\_BaseOverlappedAsyncResult.cs@399]000000000c14f04000007ffadcc2ffefSystem.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32,UInt32,System.Threading.NativeOverlapped*)[f:\dd\ndp\clr\src\bcl\system\threading\叠加.cs@135]000000000C14F1F000007FAD1F000007FFADE9A6D93[GCFRAME:000000000C14F1F1,System.Net.WebExceptionStatus,System.Net.ConnectionReturnResultByRef)操作系统线程ID:0x4ad4(75)子SPIP调用站点...000000000c94e5a000007ffacf139bfbSystem.Net.Connection.HandleError(Boolean,Boolean,System.Net.WebStatus,异常System.Net.ConnectionReturnResultByRef)...OS线程ID:0x1d70(80)子SPIP调用站点...000000000d24e3a000007ffacf139bfbSystem.Net.Connection.HandleError(布尔值、布尔值、System.Net.WebExceptionStatus、System.Net.ConnectionReturnResultByRef)[f:\dd\NDP\fx\src\net\System\Net\_Connection.cs@3119]...从线程栈来看,这个CPU有4个核,刚好对应4个HandleErrors报错,好像是网络有问题,那就切换线程80to看看有没有异常0:000>~80sclr!AwareLock::Contention+0x194:00007ffa`deb86e404883e801subrax,10:080>!mdsoThread80:Location对象类型------------------------------------------------------000000000d24e098000000015765e028System.Net.WebException000000000d24e0f80000000340b07110System.Collections.ArrayList000000000d24e110000000015765e2b8System.Net.HttpWebRequest[]000000000d24e1c00000000340b070b8System.Net.ConnectionGroup000000000d24e2580000000144a79678System.Net.Connection0:080>!mdt000000015765e028000000015765e028(System.Net.WebException)_className:NULL(System.String)_exceptionMethod:NULL(System.Reflection.MethodBase)_exceptionMethodString:NULL(System.String)_message:000000015765df70(System.String)Length=77,String="底层连接已关闭:连接意外关闭。"...果然看到了System.Net.WebException,从异常信息来看,好像是连接关闭了,这里我大胆猜测一下,是不是高频异常输出导致的WhatabouttheCPUexplosion?为了验证,可以到托管堆上找下WebException的个数0:080>!dumpheap-statStatistics:MTCountTotalSizeClassName...00007ffacecc38b0133152343440System.Net.WebException00007ffadcdf6570113691909992System.IO.IOException00007ffadcdf5fb8133802247840System.ObjectDisposedException...It'sscarytoseesomanyexceptions.Ithappenedthatafriendcaughttwodumpsforcomparison.0:048>!dumpheap-statStatistics:MTCountTotalSizeClassName00007ffacecc38b0267454707120System.Net.WebException00007ffadcdf6570267224489296System.IO.IOException00007ffadcdf5fb8287454829160System.ObjectDisposedException可以看到,2min之内异常增加了合计4w?左右,这Ithasbeenverifiedthattheprogramisindeedthrowingexceptionscrazily.OntheWindowsplatform,bothhardwareexceptionsandsoftwareexceptionsarehandledbytheWindowsSEHexceptionhandlingframework.Therewillbeaswitchbetweenusermodeandkernelmode.Suchcrazythrowingisinevitable.ItwillcausetheCPUtobursthigh,andfinallyfindthereason,thenextstepistofindthepredisposingfactors.3.Whocausedtheexception?Lookingback,thecallstackoftheHandleError?functionistheunderlyinglibraryfunction.FromtheperspectiveofthePerformIOCompletionCallback?functionofthethreadstack,itiscausedbytheIOthread.ItcanbecaughtbytheIOthreadbecauseitisdoneAsynchronousprocessing,sinceitisasynchronous,naturallytherewillbealotofOverlappedData.0:080>!gchandles-statStatistics:MTCountTotalSize类名称00007ffadc6f7b98145111625232System.Threading.OverlappedDataTotal17550个对象句柄:强句柄:426固定句柄:23异步固定句柄:14511参考计数句柄:24弱长句柄:2430:132SizedRefHandles:4表示此时大约有1.5w个异步请求等待返回。请求量还是挺大的,但是还是没有找到异常的用户代码,所以只能找谁发起了什么请求。0:080>!mdsoThread80:Location对象类型----------------------------------------------------------...000000000d24e4880000000358c57918System.Net.HttpWebRequest000000000d24e2e800000001407b5b40System.String"net_io_readfailure"...0:080>!mdt-r:20000000358c579180000000358c57918(System.Net.HttpWebRequest)_Uri:0000000358c57210(System.Uri)m_String:00000002407ee430(System.String)Length=98,String="https://api.xxxx/peer_messages"...可以看到请求连接是https://api.xxxx/peer_messages,是第三方API接口。因为关闭了底层连接,导致了最后的net_io_readfailure。综合所有信息就是:当请求量较大时,访问https://api.xxxx/peer_messages会出现问题,对方关闭底层连接,导致客户端出现大量IO回调异常side:IOException:Unabletoreaddatafromthetransportconnection:Theconnectionwasclosed.,2分钟内一共抛出4w个异常,导致CPU爆掉。我把信息告诉了朋友,让他们关注https://api.xxxx/peer_messages这个连接。3.总结本次生产事故主要是由于高峰期请求量过大,Socket连接因故关闭,导致大量异步回调异常。解决方法是在调用端限制电流。据朋友反映,减少了不必要的https://api.xxxx/peer_messages调用,目前没有出现CPU爆的现象。