1.背景1.讲故事10月份一个星球老朋友找到我说他们公司的程序在一次网红直播下变得无响应带货,无响应期间RabbitMQ有很多Timeout,如何找到根本原因,我就不贴聊天截图了。既然没有反应,那肯定是程序的大量线程主动或者被动挂掉了,我朋友也及时从程序里抽了一管血,然后我就去windbg看看是什么发生了什么?二、Windbg分析1.threads有什么问题如果要查看所有threads,还是老命令!t。0:000>!tThreadCount:5221UnstartedThread:0BackgroundThread:5199PendingThread:0DeadThread:21HostedRuntime:noLockDBGIDOSIDThreadOBJStateGCModeGCAllocContextDomainCountAptException20174e400000276CB778180202a020Preemptive0000000000000000:000000000000000000000276cb77c9d0-00001MTA31242cc00000276CB6CA8302b220Preemptive0000000000000000:000000000000000000000276cb77c9d0-00001MTA(Finalizer)3232b4000000276CB85D1B0102a220Preemptive0000000000000000:000000000000000000000276cb77c9d0-00001MTA(ThreadpoolWorker)26bccc00000276CBA5D2F020220Preemptive0000000000000000:000000000000000000000276cb77c9d0-00001Ukn339722400000276CBA5C0C03029220Preemptive0000000000000000:000000000000000000000276cb77c9d0-00001MTA(ThreadpoolWorker)System.IO.IOException00000279ccc56cd092329e00000027BD86FD18020220Preemptive0000000000000000:000000000000000000000276cb77c9d0-00001Ukn...从简要信息看,线程太多,一台机器不够进程,里面有。..不过我目前看到的最多的是1w+线程??????,这是一种没有线程池直接使用Thread造成的线程垃圾。可以看到线程列表中9号线程抛出了托管异常,然后查看是什么错误,使用!wpe00000279ccc56cd0命令。0:000>!wpe00000279ccc56cd0Address:00000279ccc56cd0ExceptionType:System.IO.IOExceptionMessage:无法从传输连接读取数据:现有连接被远程主机强行关闭。.InnerException:00000279ccc56b20System.Net.Sockets.SocketException现有连接被远程主机强行关闭。Stack:SPIPFunction000000791b88c97000007ffd844a1b31System.Net.Sockets.NetworkStream.Read(Byte[],Int32,Int32)000000791b88ee8000007ffd849e6f8aSystem.IO.BufferedStream.ReadByteSlow()000000791b88eeb000007ffd8312950aRabbitMQ.Client.Impl.InboundFrame.ReadFrom(System.IO.Stream,Byte[])000000791b88ef4000007ffd849e6d7dRabbitMQ.Client.Framing.Impl.Connection.MainLoopIteration()000000791b88efa000007ffd8312832fRabbitMQ.Client.Framing.Impl.Connection.MainLoop()HResult:80131620从堆栈信息来看,程序对RabbitMQ进行了远程访问,tcp连接被关闭对方。好像和朋友说的大量RabbitMQ超时有关。接下来就是查看各个线程栈,研究这些线程此时在做什么。使用~*e!clrstack命令仔细读取线程堆栈。我发现有很多方法卡在xxx.RabbitMq.RabbitMqConnection.GetInstance方法中。ChildSPIPCallSite0000008B8A9ED6A800007ffdf5246594[HelperMethodFrame_1OBJ:0000008b8a9ed6a8]System.Threading.Monitor.ReliableEnter(System.Object,BooleanByRef)0000008B8A9ED80000007ffd84a6a4a9xxx.RabbitMq.RabbitMqConnection.GetInstance(Microsoft.Extensions.Options.IOptions`1,Microsoft.Extensions.Logging.ILogger`1)0000008B8A9ED86000007ffd84a6a317xxx.RabbitMq.RabbitMqProducer..ctor(Microsoft.Extensions.Options.IOptionsSnapshot`1,Microsoft.Extensions.Logging.ILogger`1)0000008B8A9ED8A000007ffd8334817bDynamicClass.ResolveService(ILEmitResolverBuilderRuntimeContext,Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope)0000008B8A9ED93000007ffd83347d76DynamicClass.ResolveService(ILEmitResolverBuilderRuntimeContext,Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope)0000008B8A9EDE9000007ffd844f3cb3Microsoft.Extensions.DependencyInjection.ActivatorUtilities.GetService(System.IServiceProvider,System.Type,System.Type,Boolean)[/_/src/libraries/Common/src/Extensions/ActivatorUtilities/ActivatorUtilities.cs@173]DynamicClass.lambda_method196(System.Runtime.CompilerServices.Closure,System.IServiceProvider,System.Object[])0000008B8A9EDF2000007ffd84a0fc9cMicrosoft.AspNetCore.Mvc.Controllers.ControllerFactoryProvider+c__DisplayClass5_0.g__CreateController|0(Microsoft.AspNetCore.Mvc.ControllerContext)0000008B8A9EDF7000007ffd8452ce7fMicrosoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(StateByRef,ScopeByRef,System.ObjectByRef,BooleanByRef)[/_/src/Mvc/Mvc.Core/src/Infrastructure/ControllerActionInvoker.cs@285]0000008B8A9EE03000007ffd84a0fac8Microsoft.AspNetCore.Mvc.InvokerActionInvoker.ControllerInvoker()/_/src/Mvc/Mvc.Core/src/Infrastructure/ControllerActionInvoker.cs@490]0000008B8A9EE0B000007ffd845346cdMicrosoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Next(StateByRef、ScopeByRef、System.ObjectByRef、BooleanByRef)[/_/src/Mvc/Mvc.Core/src/Infrastructure/ResourceInvoker.cs@883]0000008B8A9EE240ff8AspMicrosoft。基础设施.ResourceInvoker.InvokeNextExceptionFilterAsync()[/_/src/Mvc/Mvc.Core/src/Infrastructure/ResourceInvoker.cs@1024]0000008B8A9EE2C000007ffd84534272Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Next(StateByRef,ScopeBoyRef,ScopeBoyRef,)[/_/src/Mvc/Mvc.Core/src/Infrastructure/ResourceInvoker.cs@883]0000008B8A9EE45000007ffd84a0f850Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.InvokeNextResourceFilter()[/_/src/Mvc/Mvc.Core/src/Infrastructure/ResourceInvoker.cs@976]...从调用栈信息来看,source是一个http请求,然后冻结在GetInstance下的锁上,引起了我极大的好奇,然后根据源码导出IP看看publicsealedclassRabbitMqConnection{publicstaticRabbitMqConnectionGetInstance(IOptionsoptions,ILoggerlogger){if(_uniqueInstance==null||_uniqueInstance.Connection==null||!_uniqueInstance.Connection.IsOpen){lock(_objLock){if(_uniqueInstance==null||_uniqueInstance.Connection==null||!_uniqueInstance.Connection.IsOpen){_uniqueInstance=newRabbitMqConnection(options.Value,logger);}}}return_uniqueInstance;}privateRabbitMqConnection(RabbitMqConfigconfig,ILoggerlogger){Policy.处理().Or().WaitAndRetry(6,(intretryAttempt)=>TimeSpan.FromSeconds(1.0),delegate(Exceptionex,TimeSpantime,intretryCount,Contextcontent){if(6==retryCount){throwex;}_logger.LogError(ex,$"{retryCount}:{ex.Message}");}).Execute(delegate{Connection=factory.CreateConnection();});}}从代码逻辑来看,朋友用了双重检查锁用于单实例化RabbitMQ实例。如果实例创建失败,会有6次1s的尝试。这种写法一看就没有问题。2、单例写法真的没问题吗?如果单例写法没问题,为什么会有大量线程卡在锁上?既然是单例,那肯定是rabbitmq的第一个实例化,后代直接乘凉好了好吧,带着这个问题我们再查一下双检索的方法。尼玛,我在判断单个实例的时候居然做了一个_uniqueInstance.Connection.IsOpen的判断。你知道这意味着什么吗?这意味着一旦rabbitmq在某个时刻挂了,单例条件就被打破了,无数线程排队等待实例化RabbimtMQ。要知道这些都是不实例化不放弃,然后导致程序挂掉的勇士……??????3。验证我的想法由于这个代码缺陷是从源代码中引入的,但是这些话是没有根据的,所以我不得不提供证据。验证起来非常简单。到托管堆中找到RabbitMqConnection实例,此时查看。通过ILSpy查看源码可以判断IsOpen是否为false,发现是通过CloseReason==null判断的。接下来,检查CloseReason是否不为空。0:000>!dumpheap-typeRabbitMqConnectionAddressMTSize00000277cbd7aa6800007ffd831f15703200000277ccb1306800007ffd831f1570320:000>!DumpObj/d00000277cbd7b858Name:RabbitMQ.Client.Framing.Impl.AutorecoveringConnectionMethodTable:00007ffd83235db0EEClass:00007ffd83242898Size:200(0xc8)bytesFile:G:\xxx\RabbitMQ.Client.dllFields:MTFieldOffsetTypeVTAttrValueName00007ffd823972384000180bcSystem.Boolean1instance0_disposed00007ffd82390c6840001818System.Object0instance00000277cbd7b920_eventLock00007ffd831fc230400018210...g.Impl.Connection0instance00000277cbd7d5f8_delegate0:000>!DumpObj/d00000277cbd7d5f8Name:RabbitMQ.Client.Framing.Impl.ConnectionMethodTable:00007ffd831fc230EEClass:00007ffd8322cd70Size:232(0xe8)bytesFile:G:\xxx\RabbitMQ.Client.dllFields:MTFieldOffsetTypeVTAttrValueName00007ffd823972384000198b8System.Boolean1instance0_disposed00007ffd82390c6840001998System.Object0instance00000277cbd7d6e0_eventLock00007ffd82d93d00400019a10...ualResetEventSlim0instance00000277cbd7d6f8_appContinuation00007ffd83276028400019b18...ShutdownEventArgs0instance00000279ccc56e28_closeReason0:000>!do00000277ccb13068Name:xxx.RabbitMq.RabbitMqConnectionMethodTable:00007ffd831f1570EEClass:00007ffd831eb920Size:32(0x20)bytesFile:G:\xxx\xxx.dllFields:MTFieldOffsetTypeVTAttrValueName00007ffd831f13f8400001f8...Private.CoreLib]]0instance00000278cbe4c2a0_logger00007ffd831f2ab0400002010...lient.IConnection0instance0000000000000000k__BackingField00007ffd831f1570400001d8...abbitMqConnection0static00000277cbd7aa68_uniqueInstance00007ffd82390c68400001e10System.Object0static00000277cbd7aa50_objLock从输出信息中可以很清楚的看到当前托管堆有两个RabbitMqConnection对象,其中一个果然是失败了(_closeReason=00000279ccc56e28),还有一个正在努力的newk__BackingField=00000000000000000,whichalsoverifiesassumption4.Afterhavingthisinformation,Icommunicatedwithmyfriends.ItisrecommendedtooptimizetheexceptionhandlinglogicwhenIsOpen=false,suchasreturnorthrownew,orsimplydonotuselazydetection,don'tforceyourself.AsfortheseriesofcausesthatcausedRabbitMQtonotrespond,myfriendfoundthatsomeoneelse’sAddSingletonwaschangedtoAddScopedthroughthereferencedopensourceproject.这也导致每一个Http请求都会去试探性的判断单例逻辑。别人的解决方案是,一旦上层被单例化,下层就不会再处理了,也就不会按照IsOpen的逻辑了。3.总结一下导致这个程序卡死的意外。主要有两点:RabbitMQ在单例中产生IsOpen判断问题。推荐使用自动重连属性AutomaticRecoveryEnabled和TopologyRecoveryEnabled,或者使用静态构造函数代替双重检查锁。http请求太多,测试单例逻辑,根据朋友的修改,将Scoped改成Singleton来避免。本文转载自微信公众号“行行码农闲聊技术”,可通过以下二维码关注。转载本文请联系一线码农聊聊技术公众号。