Optimize IT Operations: Alert Suppression, Alert Deduplication and Forecasting
Context
Europe based multi-national chain of personal care and fashion brand stores have multiple data centers across the world. They have employed monitoring and log analytic tools to monitor key metrics and events from their IT assets on a continuous basis. These tools are configured to analyze the data against static thresholds and raise alerts when, out of range events are detected. These alerts are logged as tickets in their ticketing system which need to be attended to and acted upon by their service desk teams.
However, the challenge they are currently facing is that majority of these alerts are false alerts, which not only result in unnecessary effort from the IT teams but also causes delays in attending to real alerts. Thereby, severely impacting the revenue and customer experience. The customer is considering automating the process of alert classification and prioritization so that the teams can get to address the real alerts quickly. The end goal is to forecast alerts so preventive measures can be taken to avoid service outages.
Customer Business Needs
- Reduce false alerts
- Respond to real alerts quickly
- Prevent alerts to the extent possible
- Reduce service downtime & impact
- Improve NOC/SOC & service desk efficiencies
- Improve overall customer experience
- Agility to support new environments
Challenges
- Alert noise – too many false alerts
- Identification of real alerts vs false alerts
- Delay in addressing real alerts
- Reactive measures to add capacity
- High incident volume and complexity
- Lack of historical knowledge base
- Manual classification and ticket routing
Solution Overview
CloudFabrix has deployed its Incident Room, Asset Optimization Advisor and Asset Dependency Mapping Apps. On installation of the App, the existing monitoring and log analytic tools are configured to send alerts to CloudFabrix Solution App instead of logging a ticket directly. The alerts are first analyzed by the App, where false alerts are detected and suppressed using alert correlation, classification and noise reduction algorithms. Only real alerts are logged as a ticket using built-in two-way ticketing integrations.
Furthermore, the data sources (metrics, logs etc.) are integrated for ingestion and analysis, allowing CloudFabrix advanced AI/ML algorithms to provide actionable insights like fine tuning thresholds, moving certain devices to dynamic thresholds to factor in the seasonality and forecasting future alerts. These insights induce preventive actions that allow prioritization and optimization of operations to reduce alerts. Key capabilities achieved with the adoption of Cloudfabrix solution are,
- Alert de-duplication and grouping of alerts to single incident
- Suppressing and/or auto closure of false alerts freeing up the IT teams time
- Forecasting and anomaly detection alerting users before impacting service
- Understanding alert dependencies improving the collaboration among teams