Prometheus告警问题分析

时间：2023-03-13 12:15:12 科技观察

今天说说我在使用prometheus过程中遇到的告警问题。问题分析在最近运维prometheus的过程中，发现有时候应该发出告警，但是没有发出；为了查明具体原因，特地查了一些资料，也参考了官网的相关资料。希望对大家以后使用prometheus有所帮助。我们先来看看官网提供的prometheus和alertmanager的一些重要的默认配置。Asshownbelow:#promtheusglobal:#Howfrequentlytoscrapetargetsbydefault.Intervalforscrapingmonitoringdatafromtarget[scrape_interval:|default=1m]#Howlonguntilascraperequesttimesout.Timeouttimeforscrapingtargetsettlementdata[scrape_timeout:|default=10s]#Howfrequentlytoevaluaterules.告警规则评估的时间间隔[evaluation_interval:|default=1m]#alertmanager#Howlongtoinitiallywaittosendanotificationforagroup#ofalerts.Allowstowaitforaninhibitingalerttoarriveorcollect#moreinitialalertsforthesamegroup.(Usually~0stofewminutes.)[group_wait:|default=30s]#初次发送告警的等待时间#Howlongtowaitbeforesendinganotificationaboutnewalertsthat#areaddedtoagroupofalertsforwhichaninitialnotificationhas#alreadybeensent.(Usually~5mormore.)[group_interval:|default=5m]同一个组其他新发生的告警发送时间间隔#Howlongtowaitbeforesendinganotificationagainifithasalready#beensentsuccessfullyforanalert.(Usually~3hormore).[repeat_interval:|default=4h]ThetimeintervalforrepeatedlysendingthesamealarmThroughtheaboveconfiguration,let'stakealookattheentirealarmprocess.Findproblemsthroughtheprocess.根据上图和配置，prometheus抓取数据后，根据告警规则进行计算。当表达式为真时，它进入挂起状态。当持续时间超过为for配置的时间时，进入active状态；数据将同时推送到警报管理器。group_wait后发送通知。告警延迟或频率根据整个告警流程，数据到达alertmanager后，如果group_wait设置越大，收到告警的时间越长，会造成告警延迟；同样，如果group_wait设置太小，您会频繁收到警报。因此需要根据具体场景进行设置。不应该报警的时候，每经过一次scrape_interval，prometheus就从target中拉取数据，然后进行计算。同时，target的数据可能已经恢复正常，也就是说在for的计算过程中，原来的数据已经恢复正常，只是跳过了告警，等到了duration时，报警被触发，并进行报警通知。但是从grafana来看，认为数据正常，不应该报警。这是因为grafana使用prometheus作为数据源时，是范围查询，不像告警数据那样稀疏。

上一篇：世界末日？物联网如何帮助避免气候混乱

下一篇：Dubbo高性能RPC框架实践_0

Prometheus告警问题分析相关文章