LogFileSync等待超高案例分析本文转载请联系DBA杂念公众号。监控工具DPA(SolarWinds的DatabasePerformanceAnalyzer)发现海外某Oracle数据库服务器DBCommitTime指标告警超过红色告警线(约40毫秒,黄色告警线为10毫秒,红色告警线为20毫秒),如下图所示显示对应时间段的AWR报告已经生成。分析时发现,Top5TimedEvents中,logfilesync等待事件的平均等待时间为37毫秒,logfileparallelwrite等待事件的平均等待时间为40毫秒。如果对'TanelPoder::UnderstandingLGWR,LogFileSyncWaitsandCommitPerformance'这篇文章描述的内容很熟悉的话(经典图片如下),那么等待事件日志的AvgWait(ms)指标filesyncandlogfileparallelwrite:37ms&40ms,基础上就可以判断redolog所在的磁盘I/O出现了性问题之后在用lfsdiag.sql脚本分析一下详细的统计数据,如下所示WASTORACHANDATA::通过查看“wait_time_milli”来查看日志文件同步等待的等待分布。查看高等待时间,然后查看是否可以将它们与其他相关等待事件相关联。INST_IDEVENTWAIT_TIME_MILLIWAIT_COUNT------------------------------------------------------------------------1logfilesync143631logfilesync28351logfilesync416501logfilesync849371logfilesync161462521logfilesync326066741logfilesync642633771logfilesync1282532541logfilesync25621logfileswitchcompletion11241logfileswitchcompletion291logfileswitchcompletion4191logfileswitchcompletion8211logfileswitchcompletion16351logfileswitchcompletion32971logfileswitchcompletion641331logfileswitchcompletion1283261logfileswitchcompletion25617361logfileswitchcompletion51230421logfileswitchcompletion102420201logfileparallelwrite101logfileparallelwrite201logfileparallelwrite4801logfileparallelwrite821421logfileparallelwrite161709871logfileparallelwrite327792051logfileparallelwrite643114631logfileparallelwrite128796881logfileparallelwrite256427631logfileparallelwrite512130521logfileparallelwrite1024204681logfileparallelwrite2048140201logfileparallelwrite40969211logfileparallelwrite8192961logfileparallelwrite16384181logfileparallelwrite32768181logfileparallelwrite6553681logfileparallelwrite13107221LGWRwaitforredocopy185161LGWRwaitforredocopy2201LGWRwaitforredocopy4191LGWRwaitforredocopy8201LGWRwaitforredocopy1611ORDEREDBYWAIT_TIME_MILLIINST_IDEVENTWAIT_TIME_MILLIWAIT_COUNT--------------------------------------------------------------------------1logfileswitchcompletion102420201logfileparallelwrite1024204681logfileparallelwrite2048140201logfileparallelwrite40969211logfileparallelwrite8192961logfileparallelwrite16384181logfileparallelwrite32768181logfileparallelwrite6553681logfileparallelwrite1310722REDOWRITESTATS"redowritetime"incentiseconds(100persecond)11.1:"redowritebroadcastacktime"incentiseconds(100persecond)11.2:"redowritebroadcastacktime"inmicroseconds(1000permillisecond)VERSIONINST_IDNAMEVALUEMILLISECONDS---------------------------------------------------------------------------------------------------10.2.0.5.01rewritetime955152495515240.00010.2.0.5.01rewriterlatchingtime5110.2.0.5.01rewrites1434931AWRWORSTAVGLOGFILESYNCSNAPS:从上面的数据可以看出,logfilesyncwaitingevents最多的是在32ms的区间内,logfileparallelwritewaitingevents的最大数也是在32ms的区间内。事实上,这个值已经远远超过了7ms的正常值,极不正常日志文件并行写事件是LGWR进程独有的等待事件。发生在LGWR将log_buffer中的重做日志信息写入联机重做日志文件组的成员文件中。LGWR在此事件上等待写入过程的完成。此事件等待的时间太长,表明日志文件所在的磁盘运行缓慢或发生争用。日志文件同步和日志文件并行写入是相互关联的。也就是说,如果logfileparallelwrite耗时很长,那肯定会导致logfilesync等待很长时间。如果日志文件并行写等待非常高,则可能是物理磁盘I/O问题。此外,我们还查看了redolog的切换频率,如下图,redolog_sitch_time.sql发现redolog的切换次数并不频繁,产生的归档日志量也不大。大多数情况下,每小时零或一次切换。然后我们找了一台服务器(上面指标正常的服务器)来简单测试一下IO的速度。cache等,测试采样并不详尽),但对比数据基本可以验证和反馈磁盘IO问题)。问题服务器:#timeddif=/dev/zeroof=./testbs=512kcount=2048oflag=direct2048+0recordsin2048+0recordsout1073741824bytes(1.1GB)copied,88.271seconds,12.2MB/sreal1m28.273suser0m0.010ssys0m0=:65timedifcomparedtoserver#timedif/dev/zeroof=./testbs=512kcount=2048oflag=direct2048+0recordsin2048+0recordsout1073741824bytes(1.1GB)copied,2.48344seconds,432MB/sreal0m2.485suser0m0.004ssys0m0.386s如图两台服务器上大小文件对比,所花费的时间和I/O性能有很大不同。经全面验证,该告警所在服务器的存储I/O存在性能问题。我可以报告这个问题,等待负责维护系统和存储的海外同事的回复。
